Message ID | 20230113035000.480021-7-ricarkol@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: arm64: Eager Huge-page splitting for dirty-logging | expand |
On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote: > > Split huge pages eagerly when enabling dirty logging. The goal is to > avoid doing it while faulting on write-protected pages, which > negatively impacts guest performance. > > A memslot marked for dirty logging is split in 1GB pieces at a time. > This is in order to release the mmu_lock and give other kernel threads > the opportunity to run, and also in order to allocate enough pages to > split a 1GB range worth of huge pages (or a single 1GB huge page). > Note that these page allocations can fail, so eager page splitting is > best-effort. This is not a correctness issue though, as huge pages > can still be split on write-faults. > > The benefits of eager page splitting are the same as in x86, added > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by > the TDP MMU when dirty logging is enabled"). For example, when running > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access > all of their memory after dirty logging is enabled decreased by 44% > from 2.58s to 1.42s. > > Signed-off-by: Ricardo Koller <ricarkol@google.com> > --- > arch/arm64/include/asm/kvm_host.h | 30 ++++++++ > arch/arm64/kvm/mmu.c | 110 +++++++++++++++++++++++++++++- > 2 files changed, 138 insertions(+), 2 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > index 35a159d131b5..6ab37209b1d1 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -153,6 +153,36 @@ struct kvm_s2_mmu { > /* The last vcpu id that ran on each physical CPU */ > int __percpu *last_vcpu_ran; > > + /* > + * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge > + * pages. It is used to allocate stage2 page tables while splitting > + * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY. > + * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both > + * the capacity of the split page cache (CACHE_CAPACITY), and how often > + * KVM reschedules. Be wary of raising CHUNK_SIZE too high. > + * > + * A good heuristic to pick CHUNK_SIZE is that it should be larger than > + * all the available huge-page sizes, and be a multiple of all the > + * other ones; for example, 1GB when all the available huge-page sizes > + * are (1GB, 2MB, 32MB, 512MB). This feels a little fragile to link scheduling decisions into the batch size. (I'm not saying we don't do the same thing on x86, but it's a mistake in either case.) I'd prefer if we could yield more granularly in a way that's not tied to splitting a whole EAGER_PAGE_SPLIT_CHUNK_SIZE worth of pages. Tuning the chunk size to balance memory overhead with batch performance makes sense, but it becomes more difficult to also balance with the needs of the scheduler. Could we add some yield point in kvm_pgtable_stage2_split or give it an early return path or something? > + * > + * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for > + * example, 1GB requires the following number of PAGE_SIZE-pages: > + * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB). > + * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)). > + * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB). > + * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB). > + * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB > + * granules. > + * > + * Protected by kvm->slots_lock. > + */ > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE SZ_1G > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY \ > + (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) + \ > + DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M)) > + struct kvm_mmu_memory_cache split_page_cache; > + > struct kvm_arch *arch; > }; > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 700c5774b50d..41ee330edae3 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector; > > static unsigned long io_map_base; > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) > +bool __read_mostly eager_page_split = true; > +module_param(eager_page_split, bool, 0644); > + > +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end, > + phys_addr_t size) > { > - phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); > phys_addr_t boundary = ALIGN_DOWN(addr + size, size); > > return (boundary - 1 < end - 1) ? boundary : end; > } > > +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) > +{ > + phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); > + > + return __stage2_range_addr_end(addr, end, size); > +} > + > /* > * Release kvm_mmu_lock periodically if the memory region is large. Otherwise, > * we may see kernel panics with CONFIG_DETECT_HUNG_TASK, > @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr, > return ret; > } > > +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min) > +{ > + return kvm_mmu_memory_cache_nr_free_objects(cache) < min; > +} > + > +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm) > +{ > + struct kvm_mmu_memory_cache *cache; > + > + if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) > + return true; > + > + cache = &kvm->arch.mmu.split_page_cache; > + return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY); > +} > + > +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr, > + phys_addr_t end) > +{ > + struct kvm_mmu_memory_cache *cache; > + struct kvm_pgtable *pgt; > + int ret; > + u64 next; > + int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY; > + > + lockdep_assert_held_write(&kvm->mmu_lock); > + > + lockdep_assert_held(&kvm->slots_lock); > + > + cache = &kvm->arch.mmu.split_page_cache; > + > + do { > + if (need_topup_split_page_cache_or_resched(kvm)) { > + write_unlock(&kvm->mmu_lock); > + cond_resched(); > + /* Eager page splitting is best-effort. */ > + ret = __kvm_mmu_topup_memory_cache(cache, > + cache_capacity, > + cache_capacity); > + write_lock(&kvm->mmu_lock); > + if (ret) > + break; > + } > + > + pgt = kvm->arch.mmu.pgt; > + if (!pgt) > + return -EINVAL; > + > + next = __stage2_range_addr_end(addr, end, > + EAGER_PAGE_SPLIT_CHUNK_SIZE); > + ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache); > + if (ret) > + break; > + } while (addr = next, addr != end); > + > + return ret; > +} > + > #define stage2_apply_range_resched(kvm, addr, end, fn) \ > stage2_apply_range(kvm, addr, end, fn, true) > > @@ -755,6 +823,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t > for_each_possible_cpu(cpu) > *per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1; > > + mmu->split_page_cache.gfp_zero = __GFP_ZERO; > + > mmu->pgt = pgt; > mmu->pgd_phys = __pa(pgt->pgd); > return 0; > @@ -769,6 +839,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t > void kvm_uninit_stage2_mmu(struct kvm *kvm) > { > kvm_free_stage2_pgd(&kvm->arch.mmu); > + kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache); > } > > static void stage2_unmap_memslot(struct kvm *kvm, > @@ -996,6 +1067,29 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, > stage2_wp_range(&kvm->arch.mmu, start, end); > } > > +/** > + * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE > + * pages for memory slot > + * @kvm: The KVM pointer > + * @slot: The memory slot to split > + * > + * Acquires kvm->mmu_lock. Called with kvm->slots_lock mutex acquired, > + * serializing operations for VM memory regions. > + */ > +static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot) > +{ > + struct kvm_memslots *slots = kvm_memslots(kvm); > + struct kvm_memory_slot *memslot = id_to_memslot(slots, slot); > + phys_addr_t start, end; > + > + start = memslot->base_gfn << PAGE_SHIFT; > + end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT; > + > + write_lock(&kvm->mmu_lock); > + kvm_mmu_split_huge_pages(kvm, start, end); > + write_unlock(&kvm->mmu_lock); > +} > + > /* > * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected > * dirty pages. > @@ -1783,7 +1877,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, > if (kvm_dirty_log_manual_protect_and_init_set(kvm)) > return; > > + if (READ_ONCE(eager_page_split)) > + kvm_mmu_split_memory_region(kvm, new->id); > + > kvm_mmu_wp_memory_region(kvm, new->id); > + } else { > + /* > + * Free any leftovers from the eager page splitting cache. Do > + * this when deleting, moving, disabling dirty logging, or > + * creating the memslot (a nop). Doing it for deletes makes > + * sure we don't leak memory, and there's no need to keep the > + * cache around for any of the other cases. > + */ > + kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache); > } > } > > -- > 2.39.0.314.g84b9a713c41-goog >
On Tue, Jan 24, 2023 at 09:52:49AM -0800, Ben Gardon wrote: > On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote: > > > > Split huge pages eagerly when enabling dirty logging. The goal is to > > avoid doing it while faulting on write-protected pages, which > > negatively impacts guest performance. > > > > A memslot marked for dirty logging is split in 1GB pieces at a time. > > This is in order to release the mmu_lock and give other kernel threads > > the opportunity to run, and also in order to allocate enough pages to > > split a 1GB range worth of huge pages (or a single 1GB huge page). > > Note that these page allocations can fail, so eager page splitting is > > best-effort. This is not a correctness issue though, as huge pages > > can still be split on write-faults. > > > > The benefits of eager page splitting are the same as in x86, added > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by > > the TDP MMU when dirty logging is enabled"). For example, when running > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access > > all of their memory after dirty logging is enabled decreased by 44% > > from 2.58s to 1.42s. > > > > Signed-off-by: Ricardo Koller <ricarkol@google.com> > > --- > > arch/arm64/include/asm/kvm_host.h | 30 ++++++++ > > arch/arm64/kvm/mmu.c | 110 +++++++++++++++++++++++++++++- > > 2 files changed, 138 insertions(+), 2 deletions(-) > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > index 35a159d131b5..6ab37209b1d1 100644 > > --- a/arch/arm64/include/asm/kvm_host.h > > +++ b/arch/arm64/include/asm/kvm_host.h > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu { > > /* The last vcpu id that ran on each physical CPU */ > > int __percpu *last_vcpu_ran; > > > > + /* > > + * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge > > + * pages. It is used to allocate stage2 page tables while splitting > > + * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY. > > + * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both > > + * the capacity of the split page cache (CACHE_CAPACITY), and how often > > + * KVM reschedules. Be wary of raising CHUNK_SIZE too high. > > + * > > + * A good heuristic to pick CHUNK_SIZE is that it should be larger than > > + * all the available huge-page sizes, and be a multiple of all the > > + * other ones; for example, 1GB when all the available huge-page sizes > > + * are (1GB, 2MB, 32MB, 512MB). > > This feels a little fragile to link scheduling decisions into the > batch size. Completely agree. > (I'm not saying we don't do the same thing on x86, but > it's a mistake in either case.) I'd prefer if we could yield more > granularly in a way that's not tied to splitting a whole > EAGER_PAGE_SPLIT_CHUNK_SIZE worth of pages. > Tuning the chunk size to balance memory overhead with batch > performance makes sense, but it becomes more difficult to also balance > with the needs of the scheduler. Could we add some yield point in > kvm_pgtable_stage2_split or give it an early return path or something? We definitely need to move in the direction of early returns, but I'd rather not build it purely for the sake of eager page spliting. A better approach would likely be to add it to the core page table walker code such that all walkers that operate on a swath of memory can use it. In the interest of keeping this series manageable, I'd prefer yielding walkers be done in a separate series. -- Thanks, Oliver
Hi Ricardo, On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote: > Split huge pages eagerly when enabling dirty logging. The goal is to > avoid doing it while faulting on write-protected pages, which > negatively impacts guest performance. > > A memslot marked for dirty logging is split in 1GB pieces at a time. > This is in order to release the mmu_lock and give other kernel threads > the opportunity to run, and also in order to allocate enough pages to > split a 1GB range worth of huge pages (or a single 1GB huge page). > Note that these page allocations can fail, so eager page splitting is > best-effort. This is not a correctness issue though, as huge pages > can still be split on write-faults. > > The benefits of eager page splitting are the same as in x86, added > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by > the TDP MMU when dirty logging is enabled"). For example, when running > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access > all of their memory after dirty logging is enabled decreased by 44% > from 2.58s to 1.42s. > > Signed-off-by: Ricardo Koller <ricarkol@google.com> > --- > arch/arm64/include/asm/kvm_host.h | 30 ++++++++ > arch/arm64/kvm/mmu.c | 110 +++++++++++++++++++++++++++++- > 2 files changed, 138 insertions(+), 2 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > index 35a159d131b5..6ab37209b1d1 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -153,6 +153,36 @@ struct kvm_s2_mmu { > /* The last vcpu id that ran on each physical CPU */ > int __percpu *last_vcpu_ran; > > + /* > + * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge > + * pages. It is used to allocate stage2 page tables while splitting > + * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY. > + * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both > + * the capacity of the split page cache (CACHE_CAPACITY), and how often > + * KVM reschedules. Be wary of raising CHUNK_SIZE too high. > + * > + * A good heuristic to pick CHUNK_SIZE is that it should be larger than > + * all the available huge-page sizes, and be a multiple of all the > + * other ones; for example, 1GB when all the available huge-page sizes > + * are (1GB, 2MB, 32MB, 512MB). > + * > + * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for > + * example, 1GB requires the following number of PAGE_SIZE-pages: > + * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB). > + * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)). > + * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB). > + * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB). > + * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB > + * granules. > + * > + * Protected by kvm->slots_lock. > + */ > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE SZ_1G > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY \ > + (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) + \ > + DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M)) Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL as the batch size? 513 pages across all page sizes is a non-negligible amount of memory that goes largely unused when PAGE_SIZE != 4K. With that change it is a lot easier to correctly match the cache capacity to the selected page size. Additionally, we continue to have a single set of batching logic that we can improve later on. > + struct kvm_mmu_memory_cache split_page_cache; > + > struct kvm_arch *arch; > }; > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 700c5774b50d..41ee330edae3 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector; > > static unsigned long io_map_base; > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) > +bool __read_mostly eager_page_split = true; > +module_param(eager_page_split, bool, 0644); > + Unless someone is really begging for it I'd prefer we not add a module parameter for this. > +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end, > + phys_addr_t size) > { > - phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); > phys_addr_t boundary = ALIGN_DOWN(addr + size, size); > > return (boundary - 1 < end - 1) ? boundary : end; > } > > +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) > +{ > + phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); > + > + return __stage2_range_addr_end(addr, end, size); > +} > + > /* > * Release kvm_mmu_lock periodically if the memory region is large. Otherwise, > * we may see kernel panics with CONFIG_DETECT_HUNG_TASK, > @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr, > return ret; > } > > +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min) > +{ > + return kvm_mmu_memory_cache_nr_free_objects(cache) < min; > +} I don't think the helper is adding too much here. > +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm) > +{ > + struct kvm_mmu_memory_cache *cache; > + > + if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) > + return true; > + > + cache = &kvm->arch.mmu.split_page_cache; > + return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY); > +} > + > +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr, > + phys_addr_t end) > +{ > + struct kvm_mmu_memory_cache *cache; > + struct kvm_pgtable *pgt; > + int ret; > + u64 next; > + int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY; > + > + lockdep_assert_held_write(&kvm->mmu_lock); Rather than having the caller acquire the lock, can you instead do it here? It would appear that the entire critical section is enclosed within this function. > + lockdep_assert_held(&kvm->slots_lock); This function doesn't depend on anything guarded by the slots_lock, can you move this to kvm_mmu_split_memory_region()? > + cache = &kvm->arch.mmu.split_page_cache; > + > + do { > + if (need_topup_split_page_cache_or_resched(kvm)) { > + write_unlock(&kvm->mmu_lock); > + cond_resched(); > + /* Eager page splitting is best-effort. */ > + ret = __kvm_mmu_topup_memory_cache(cache, > + cache_capacity, > + cache_capacity); > + write_lock(&kvm->mmu_lock); > + if (ret) > + break; > + } > + > + pgt = kvm->arch.mmu.pgt; > + if (!pgt) > + return -EINVAL; > + > + next = __stage2_range_addr_end(addr, end, > + EAGER_PAGE_SPLIT_CHUNK_SIZE); > + ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache); > + if (ret) > + break; > + } while (addr = next, addr != end); > + > + return ret; > +} -- Thanks, Oliver
On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > Hi Ricardo, > > On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote: > > Split huge pages eagerly when enabling dirty logging. The goal is to > > avoid doing it while faulting on write-protected pages, which > > negatively impacts guest performance. > > > > A memslot marked for dirty logging is split in 1GB pieces at a time. > > This is in order to release the mmu_lock and give other kernel threads > > the opportunity to run, and also in order to allocate enough pages to > > split a 1GB range worth of huge pages (or a single 1GB huge page). > > Note that these page allocations can fail, so eager page splitting is > > best-effort. This is not a correctness issue though, as huge pages > > can still be split on write-faults. > > > > The benefits of eager page splitting are the same as in x86, added > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by > > the TDP MMU when dirty logging is enabled"). For example, when running > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access > > all of their memory after dirty logging is enabled decreased by 44% > > from 2.58s to 1.42s. > > > > Signed-off-by: Ricardo Koller <ricarkol@google.com> > > --- > > arch/arm64/include/asm/kvm_host.h | 30 ++++++++ > > arch/arm64/kvm/mmu.c | 110 +++++++++++++++++++++++++++++- > > 2 files changed, 138 insertions(+), 2 deletions(-) > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > index 35a159d131b5..6ab37209b1d1 100644 > > --- a/arch/arm64/include/asm/kvm_host.h > > +++ b/arch/arm64/include/asm/kvm_host.h > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu { > > /* The last vcpu id that ran on each physical CPU */ > > int __percpu *last_vcpu_ran; > > > > + /* > > + * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge > > + * pages. It is used to allocate stage2 page tables while splitting > > + * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY. > > + * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both > > + * the capacity of the split page cache (CACHE_CAPACITY), and how often > > + * KVM reschedules. Be wary of raising CHUNK_SIZE too high. > > + * > > + * A good heuristic to pick CHUNK_SIZE is that it should be larger than > > + * all the available huge-page sizes, and be a multiple of all the > > + * other ones; for example, 1GB when all the available huge-page sizes > > + * are (1GB, 2MB, 32MB, 512MB). > > + * > > + * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for > > + * example, 1GB requires the following number of PAGE_SIZE-pages: > > + * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB). > > + * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)). > > + * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB). > > + * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB). > > + * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB > > + * granules. > > + * > > + * Protected by kvm->slots_lock. > > + */ > > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE SZ_1G > > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY \ > > + (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) + \ > > + DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M)) > > Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL > as the batch size? 513 pages across all page sizes is a non-negligible > amount of memory that goes largely unused when PAGE_SIZE != 4K. > Sounds good, will refine this for v2. > With that change it is a lot easier to correctly match the cache > capacity to the selected page size. Additionally, we continue to have a > single set of batching logic that we can improve later on. > > > + struct kvm_mmu_memory_cache split_page_cache; > > + > > struct kvm_arch *arch; > > }; > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > index 700c5774b50d..41ee330edae3 100644 > > --- a/arch/arm64/kvm/mmu.c > > +++ b/arch/arm64/kvm/mmu.c > > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector; > > > > static unsigned long io_map_base; > > > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) > > +bool __read_mostly eager_page_split = true; > > +module_param(eager_page_split, bool, 0644); > > + > > Unless someone is really begging for it I'd prefer we not add a module > parameter for this. It was mainly to match x86 and because it makes perf testing a bit simpler. What do others think? > > > +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end, > > + phys_addr_t size) > > { > > - phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); > > phys_addr_t boundary = ALIGN_DOWN(addr + size, size); > > > > return (boundary - 1 < end - 1) ? boundary : end; > > } > > > > +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) > > +{ > > + phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); > > + > > + return __stage2_range_addr_end(addr, end, size); > > +} > > + > > /* > > * Release kvm_mmu_lock periodically if the memory region is large. Otherwise, > > * we may see kernel panics with CONFIG_DETECT_HUNG_TASK, > > @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr, > > return ret; > > } > > > > +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min) > > +{ > > + return kvm_mmu_memory_cache_nr_free_objects(cache) < min; > > +} > > I don't think the helper is adding too much here. Will try how it looks without. > > > +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm) > > +{ > > + struct kvm_mmu_memory_cache *cache; > > + > > + if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) > > + return true; > > + > > + cache = &kvm->arch.mmu.split_page_cache; > > + return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY); > > +} > > + > > +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr, > > + phys_addr_t end) > > +{ > > + struct kvm_mmu_memory_cache *cache; > > + struct kvm_pgtable *pgt; > > + int ret; > > + u64 next; > > + int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY; > > + > > + lockdep_assert_held_write(&kvm->mmu_lock); > > Rather than having the caller acquire the lock, can you instead do it > here? It would appear that the entire critical section is enclosed > within this function. Sure. I will first double check things related to perf and correctness just in case. I'm not sure if the increase of acquire/releases makes any difference in perf. Also, not sure if there's a correctness issue because of releasing the lock between WP and split (I think it should be fine, but not 100% sure). > > > + lockdep_assert_held(&kvm->slots_lock); > > This function doesn't depend on anything guarded by the slots_lock, can > you move this to kvm_mmu_split_memory_region()? kvm_mmu_split_memory_region() takes a memslot. That works in this case, eager splitting when enabling dirty logging, but won't work in the next commit when spliting on the CLEAR ioctl. > > > + cache = &kvm->arch.mmu.split_page_cache; > > + > > + do { > > + if (need_topup_split_page_cache_or_resched(kvm)) { > > + write_unlock(&kvm->mmu_lock); > > + cond_resched(); > > + /* Eager page splitting is best-effort. */ > > + ret = __kvm_mmu_topup_memory_cache(cache, > > + cache_capacity, > > + cache_capacity); > > + write_lock(&kvm->mmu_lock); > > + if (ret) > > + break; > > + } > > + > > + pgt = kvm->arch.mmu.pgt; > > + if (!pgt) > > + return -EINVAL; > > + > > + next = __stage2_range_addr_end(addr, end, > > + EAGER_PAGE_SPLIT_CHUNK_SIZE); > > + ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache); > > + if (ret) > > + break; > > + } while (addr = next, addr != end); > > + > > + return ret; > > +} > > -- > Thanks, > Oliver
On Thu, Jan 26, 2023 at 10:45 AM Ricardo Koller <ricarkol@google.com> wrote: > > On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > Hi Ricardo, > > > > On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote: > > > Split huge pages eagerly when enabling dirty logging. The goal is to > > > avoid doing it while faulting on write-protected pages, which > > > negatively impacts guest performance. > > > > > > A memslot marked for dirty logging is split in 1GB pieces at a time. > > > This is in order to release the mmu_lock and give other kernel threads > > > the opportunity to run, and also in order to allocate enough pages to > > > split a 1GB range worth of huge pages (or a single 1GB huge page). > > > Note that these page allocations can fail, so eager page splitting is > > > best-effort. This is not a correctness issue though, as huge pages > > > can still be split on write-faults. > > > > > > The benefits of eager page splitting are the same as in x86, added > > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by > > > the TDP MMU when dirty logging is enabled"). For example, when running > > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, > > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access > > > all of their memory after dirty logging is enabled decreased by 44% > > > from 2.58s to 1.42s. > > > > > > Signed-off-by: Ricardo Koller <ricarkol@google.com> > > > --- > > > arch/arm64/include/asm/kvm_host.h | 30 ++++++++ > > > arch/arm64/kvm/mmu.c | 110 +++++++++++++++++++++++++++++- > > > 2 files changed, 138 insertions(+), 2 deletions(-) > > > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > > index 35a159d131b5..6ab37209b1d1 100644 > > > --- a/arch/arm64/include/asm/kvm_host.h > > > +++ b/arch/arm64/include/asm/kvm_host.h > > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu { > > > /* The last vcpu id that ran on each physical CPU */ > > > int __percpu *last_vcpu_ran; > > > > > > + /* > > > + * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge > > > + * pages. It is used to allocate stage2 page tables while splitting > > > + * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY. > > > + * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both > > > + * the capacity of the split page cache (CACHE_CAPACITY), and how often > > > + * KVM reschedules. Be wary of raising CHUNK_SIZE too high. > > > + * > > > + * A good heuristic to pick CHUNK_SIZE is that it should be larger than > > > + * all the available huge-page sizes, and be a multiple of all the > > > + * other ones; for example, 1GB when all the available huge-page sizes > > > + * are (1GB, 2MB, 32MB, 512MB). > > > + * > > > + * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for > > > + * example, 1GB requires the following number of PAGE_SIZE-pages: > > > + * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB). > > > + * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)). > > > + * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB). > > > + * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB). > > > + * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB > > > + * granules. > > > + * > > > + * Protected by kvm->slots_lock. > > > + */ > > > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE SZ_1G > > > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY \ > > > + (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) + \ > > > + DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M)) > > > > Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL > > as the batch size? 513 pages across all page sizes is a non-negligible > > amount of memory that goes largely unused when PAGE_SIZE != 4K. > > > > Sounds good, will refine this for v2. > > > With that change it is a lot easier to correctly match the cache > > capacity to the selected page size. Additionally, we continue to have a > > single set of batching logic that we can improve later on. > > > > > + struct kvm_mmu_memory_cache split_page_cache; > > > + > > > struct kvm_arch *arch; > > > }; > > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > index 700c5774b50d..41ee330edae3 100644 > > > --- a/arch/arm64/kvm/mmu.c > > > +++ b/arch/arm64/kvm/mmu.c > > > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector; > > > > > > static unsigned long io_map_base; > > > > > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) > > > +bool __read_mostly eager_page_split = true; > > > +module_param(eager_page_split, bool, 0644); > > > + > > > > Unless someone is really begging for it I'd prefer we not add a module > > parameter for this. > > It was mainly to match x86 and because it makes perf testing a bit > simpler. What do others think? > > > > > > +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end, > > > + phys_addr_t size) > > > { > > > - phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); > > > phys_addr_t boundary = ALIGN_DOWN(addr + size, size); > > > > > > return (boundary - 1 < end - 1) ? boundary : end; > > > } > > > > > > +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) > > > +{ > > > + phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); > > > + > > > + return __stage2_range_addr_end(addr, end, size); > > > +} > > > + > > > /* > > > * Release kvm_mmu_lock periodically if the memory region is large. Otherwise, > > > * we may see kernel panics with CONFIG_DETECT_HUNG_TASK, > > > @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr, > > > return ret; > > > } > > > > > > +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min) > > > +{ > > > + return kvm_mmu_memory_cache_nr_free_objects(cache) < min; > > > +} > > > > I don't think the helper is adding too much here. > > Will try how it looks without. > > > > > > +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm) > > > +{ > > > + struct kvm_mmu_memory_cache *cache; > > > + > > > + if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) > > > + return true; > > > + > > > + cache = &kvm->arch.mmu.split_page_cache; > > > + return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY); > > > +} > > > + > > > +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr, > > > + phys_addr_t end) > > > +{ > > > + struct kvm_mmu_memory_cache *cache; > > > + struct kvm_pgtable *pgt; > > > + int ret; > > > + u64 next; > > > + int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY; > > > + > > > + lockdep_assert_held_write(&kvm->mmu_lock); > > > > Rather than having the caller acquire the lock, can you instead do it > > here? It would appear that the entire critical section is enclosed > > within this function. > > Sure. I will first double check things related to perf and correctness > just in case. > I'm not sure if the increase of acquire/releases makes any difference in perf. > Also, not sure if there's a correctness issue because of releasing the lock > between WP and split (I think it should be fine, but not 100% sure). > > > > > > + lockdep_assert_held(&kvm->slots_lock); > > > > This function doesn't depend on anything guarded by the slots_lock, can > > you move this to kvm_mmu_split_memory_region()? > > kvm_mmu_split_memory_region() takes a memslot. > That works in this case, eager splitting when enabling dirty logging, but won't > work in the next commit when spliting on the CLEAR ioctl. > Ahh, you meant just the "lockdep" line. Yes, that makes sense. Will do. > > > > > + cache = &kvm->arch.mmu.split_page_cache; > > > + > > > + do { > > > + if (need_topup_split_page_cache_or_resched(kvm)) { > > > + write_unlock(&kvm->mmu_lock); > > > + cond_resched(); > > > + /* Eager page splitting is best-effort. */ > > > + ret = __kvm_mmu_topup_memory_cache(cache, > > > + cache_capacity, > > > + cache_capacity); > > > + write_lock(&kvm->mmu_lock); > > > + if (ret) > > > + break; > > > + } > > > + > > > + pgt = kvm->arch.mmu.pgt; > > > + if (!pgt) > > > + return -EINVAL; > > > + > > > + next = __stage2_range_addr_end(addr, end, > > > + EAGER_PAGE_SPLIT_CHUNK_SIZE); > > > + ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache); > > > + if (ret) > > > + break; > > > + } while (addr = next, addr != end); > > > + > > > + return ret; > > > +} > > > > -- > > Thanks, > > Oliver
On Thu, 26 Jan 2023 18:45:43 +0000, Ricardo Koller <ricarkol@google.com> wrote: > > On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > Hi Ricardo, > > > > On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote: > > > Split huge pages eagerly when enabling dirty logging. The goal is to > > > avoid doing it while faulting on write-protected pages, which > > > negatively impacts guest performance. > > > > > > A memslot marked for dirty logging is split in 1GB pieces at a time. > > > This is in order to release the mmu_lock and give other kernel threads > > > the opportunity to run, and also in order to allocate enough pages to > > > split a 1GB range worth of huge pages (or a single 1GB huge page). > > > Note that these page allocations can fail, so eager page splitting is > > > best-effort. This is not a correctness issue though, as huge pages > > > can still be split on write-faults. > > > > > > The benefits of eager page splitting are the same as in x86, added > > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by > > > the TDP MMU when dirty logging is enabled"). For example, when running > > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, > > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access > > > all of their memory after dirty logging is enabled decreased by 44% > > > from 2.58s to 1.42s. > > > > > > Signed-off-by: Ricardo Koller <ricarkol@google.com> > > > --- > > > arch/arm64/include/asm/kvm_host.h | 30 ++++++++ > > > arch/arm64/kvm/mmu.c | 110 +++++++++++++++++++++++++++++- > > > 2 files changed, 138 insertions(+), 2 deletions(-) > > > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > > index 35a159d131b5..6ab37209b1d1 100644 > > > --- a/arch/arm64/include/asm/kvm_host.h > > > +++ b/arch/arm64/include/asm/kvm_host.h > > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu { > > > /* The last vcpu id that ran on each physical CPU */ > > > int __percpu *last_vcpu_ran; > > > > > > + /* > > > + * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge > > > + * pages. It is used to allocate stage2 page tables while splitting > > > + * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY. > > > + * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both > > > + * the capacity of the split page cache (CACHE_CAPACITY), and how often > > > + * KVM reschedules. Be wary of raising CHUNK_SIZE too high. > > > + * > > > + * A good heuristic to pick CHUNK_SIZE is that it should be larger than > > > + * all the available huge-page sizes, and be a multiple of all the > > > + * other ones; for example, 1GB when all the available huge-page sizes > > > + * are (1GB, 2MB, 32MB, 512MB). > > > + * > > > + * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for > > > + * example, 1GB requires the following number of PAGE_SIZE-pages: > > > + * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB). > > > + * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)). > > > + * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB). > > > + * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB). > > > + * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB > > > + * granules. > > > + * > > > + * Protected by kvm->slots_lock. > > > + */ > > > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE SZ_1G > > > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY \ > > > + (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) + \ > > > + DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M)) > > > > Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL > > as the batch size? 513 pages across all page sizes is a non-negligible > > amount of memory that goes largely unused when PAGE_SIZE != 4K. > > > > Sounds good, will refine this for v2. > > > With that change it is a lot easier to correctly match the cache > > capacity to the selected page size. Additionally, we continue to have a > > single set of batching logic that we can improve later on. > > > > > + struct kvm_mmu_memory_cache split_page_cache; > > > + > > > struct kvm_arch *arch; > > > }; > > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > index 700c5774b50d..41ee330edae3 100644 > > > --- a/arch/arm64/kvm/mmu.c > > > +++ b/arch/arm64/kvm/mmu.c > > > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector; > > > > > > static unsigned long io_map_base; > > > > > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) > > > +bool __read_mostly eager_page_split = true; > > > +module_param(eager_page_split, bool, 0644); > > > + > > > > Unless someone is really begging for it I'd prefer we not add a module > > parameter for this. > > It was mainly to match x86 and because it makes perf testing a bit > simpler. What do others think? From my PoV this is a no. If you have a flag because this is an experimental feature (like NV), then this is a kernel option, and you taint the kernel when it is set. If you have a flag because this is a modal option that makes different use of the HW which cannot be exposed to userspace (like GICv4), then this also is a kernel option. This is neither. The one thing that would convince me to make it an option is the amount of memory this thing consumes. 512+ pages is a huge amount, and I'm not overly happy about that. Why can't this be a userspace visible option, selectable on a per VM (or memslot) basis? Thanks, M.
Hi Marc, On Thu, Jan 26, 2023 at 12:10 PM Marc Zyngier <maz@kernel.org> wrote: > > On Thu, 26 Jan 2023 18:45:43 +0000, > Ricardo Koller <ricarkol@google.com> wrote: > > > > On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > Hi Ricardo, > > > > > > On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote: > > > > Split huge pages eagerly when enabling dirty logging. The goal is to > > > > avoid doing it while faulting on write-protected pages, which > > > > negatively impacts guest performance. > > > > > > > > A memslot marked for dirty logging is split in 1GB pieces at a time. > > > > This is in order to release the mmu_lock and give other kernel threads > > > > the opportunity to run, and also in order to allocate enough pages to > > > > split a 1GB range worth of huge pages (or a single 1GB huge page). > > > > Note that these page allocations can fail, so eager page splitting is > > > > best-effort. This is not a correctness issue though, as huge pages > > > > can still be split on write-faults. > > > > > > > > The benefits of eager page splitting are the same as in x86, added > > > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by > > > > the TDP MMU when dirty logging is enabled"). For example, when running > > > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, > > > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access > > > > all of their memory after dirty logging is enabled decreased by 44% > > > > from 2.58s to 1.42s. > > > > > > > > Signed-off-by: Ricardo Koller <ricarkol@google.com> > > > > --- > > > > arch/arm64/include/asm/kvm_host.h | 30 ++++++++ > > > > arch/arm64/kvm/mmu.c | 110 +++++++++++++++++++++++++++++- > > > > 2 files changed, 138 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > > > index 35a159d131b5..6ab37209b1d1 100644 > > > > --- a/arch/arm64/include/asm/kvm_host.h > > > > +++ b/arch/arm64/include/asm/kvm_host.h > > > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu { > > > > /* The last vcpu id that ran on each physical CPU */ > > > > int __percpu *last_vcpu_ran; > > > > > > > > + /* > > > > + * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge > > > > + * pages. It is used to allocate stage2 page tables while splitting > > > > + * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY. > > > > + * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both > > > > + * the capacity of the split page cache (CACHE_CAPACITY), and how often > > > > + * KVM reschedules. Be wary of raising CHUNK_SIZE too high. > > > > + * > > > > + * A good heuristic to pick CHUNK_SIZE is that it should be larger than > > > > + * all the available huge-page sizes, and be a multiple of all the > > > > + * other ones; for example, 1GB when all the available huge-page sizes > > > > + * are (1GB, 2MB, 32MB, 512MB). > > > > + * > > > > + * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for > > > > + * example, 1GB requires the following number of PAGE_SIZE-pages: > > > > + * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB). > > > > + * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)). > > > > + * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB). > > > > + * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB). > > > > + * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB > > > > + * granules. > > > > + * > > > > + * Protected by kvm->slots_lock. > > > > + */ > > > > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE SZ_1G > > > > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY \ > > > > + (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) + \ > > > > + DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M)) > > > > > > Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL > > > as the batch size? 513 pages across all page sizes is a non-negligible > > > amount of memory that goes largely unused when PAGE_SIZE != 4K. > > > > > > > Sounds good, will refine this for v2. > > > > > With that change it is a lot easier to correctly match the cache > > > capacity to the selected page size. Additionally, we continue to have a > > > single set of batching logic that we can improve later on. > > > > > > > + struct kvm_mmu_memory_cache split_page_cache; > > > > + > > > > struct kvm_arch *arch; > > > > }; > > > > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > > index 700c5774b50d..41ee330edae3 100644 > > > > --- a/arch/arm64/kvm/mmu.c > > > > +++ b/arch/arm64/kvm/mmu.c > > > > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector; > > > > > > > > static unsigned long io_map_base; > > > > > > > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) > > > > +bool __read_mostly eager_page_split = true; > > > > +module_param(eager_page_split, bool, 0644); > > > > + > > > > > > Unless someone is really begging for it I'd prefer we not add a module > > > parameter for this. > > > > It was mainly to match x86 and because it makes perf testing a bit > > simpler. What do others think? > > From my PoV this is a no. > > If you have a flag because this is an experimental feature (like NV), > then this is a kernel option, and you taint the kernel when it is set. > > If you have a flag because this is a modal option that makes different > use of the HW which cannot be exposed to userspace (like GICv4), then > this also is a kernel option. > > This is neither. Ah, I see. Thanks for the explanation. > > The one thing that would convince me to make it an option is the > amount of memory this thing consumes. 512+ pages is a huge amount, and > I'm not overly happy about that. Why can't this be a userspace visible > option, selectable on a per VM (or memslot) basis? > It should be possible. I am exploring a couple of ideas that could help when the hugepages are not 1G (e.g., 2M). However, they add complexity and I'm not sure they help much. (will be using PAGE_SIZE=4K to make things simpler) This feature pre-allocates 513 pages before splitting every 1G range. For example, it converts 1G block PTEs into trees made of 513 pages. When not using this feature, the same 513 pages would be allocated, but lazily over a longer period of time. Eager-splitting pre-allocates those pages in order to split huge-pages into fully populated trees. Which is needed in order to use FEAT_BBM and skipping the expensive TLBI broadcasts. 513 is just the number of pages needed to break a 1G huge-page. We could optimize for smaller huge-pages, like 2M by splitting 1 huge-page at a time: only preallocate one 4K page at a time. The trick is how to know that we are splitting 2M huge-pages. We could either get the vma pagesize or use hints from userspace. I'm not sure that this is worth it though. The user will most likely want to split big ranges of memory (>1G), so optimizing for smaller huge-pages only converts the left into the right: alloc 1 page | | alloc 512 pages split 2M huge-page | | split 2M huge-page alloc 1 page | | split 2M huge-page split 2M huge-page | => | split 2M huge-page ... alloc 1 page | | split 2M huge-page split 2M huge-page | | split 2M huge-page Still thinking of what else to do. Thanks! Ricardo > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible.
On Fri, Jan 27, 2023 at 07:45:15AM -0800, Ricardo Koller wrote: > Hi Marc, > > On Thu, Jan 26, 2023 at 12:10 PM Marc Zyngier <maz@kernel.org> wrote: [...] > > > > The one thing that would convince me to make it an option is the > > amount of memory this thing consumes. 512+ pages is a huge amount, and > > I'm not overly happy about that. Why can't this be a userspace visible > > option, selectable on a per VM (or memslot) basis? > > > > It should be possible. I am exploring a couple of ideas that could > help when the hugepages are not 1G (e.g., 2M). However, they add > complexity and I'm not sure they help much. > > (will be using PAGE_SIZE=4K to make things simpler) > > This feature pre-allocates 513 pages before splitting every 1G range. > For example, it converts 1G block PTEs into trees made of 513 pages. > When not using this feature, the same 513 pages would be allocated, > but lazily over a longer period of time. > > Eager-splitting pre-allocates those pages in order to split huge-pages > into fully populated trees. Which is needed in order to use FEAT_BBM > and skipping the expensive TLBI broadcasts. 513 is just the number of > pages needed to break a 1G huge-page. > > We could optimize for smaller huge-pages, like 2M by splitting 1 > huge-page at a time: only preallocate one 4K page at a time. The > trick is how to know that we are splitting 2M huge-pages. We could > either get the vma pagesize or use hints from userspace. I'm not sure > that this is worth it though. The user will most likely want to split > big ranges of memory (>1G), so optimizing for smaller huge-pages only > converts the left into the right: > > alloc 1 page | | alloc 512 pages > split 2M huge-page | | split 2M huge-page > alloc 1 page | | split 2M huge-page > split 2M huge-page | => | split 2M huge-page > ... > alloc 1 page | | split 2M huge-page > split 2M huge-page | | split 2M huge-page > > Still thinking of what else to do. I think that Marc's suggestion of having userspace configure this is sound. After all, userspace _should_ know the granularity of the backing source it chose for guest memory. We could also interpret a cache size of 0 to signal that userspace wants to disable eager page split for a VM altogether. It is entirely possible that the user will want a differing QoS between slice-of-hardware and overcommitted VMs.
On Mon, Jan 30, 2023, Oliver Upton wrote: > I think that Marc's suggestion of having userspace configure this is > sound. After all, userspace _should_ know the granularity of the backing > source it chose for guest memory. > > We could also interpret a cache size of 0 to signal that userspace wants > to disable eager page split for a VM altogether. It is entirely possible that > the user will want a differing QoS between slice-of-hardware and > overcommitted VMs. Maybe. It's also entirely possible that QoS is never factored in, e.g. if QoS guarantees for all VMs on a system are better met by enabling eager splitting across the board. There are other reasons to use module/kernel params beyond what Marc listed, e.g. to let the user opt out even when something is on by default. x86's TDP MMU has benefited greatly from downstream users being able to do A/B performance testing this way. I suspect x86's eager_page_split knob was added largely for this reason, e.g. to easily see how a specific workload is affected by eager splitting. That seems like a reasonable fit on the ARM side as well. IMO, we should try to avoid new uAPI without a proven need, especially if we're considering throwing something into memslots. AFAIK, module params, at least on the x86 side of things, aren't considered immutable ABI (or maybe it's just that we haven't modified/removed one recently that folks care about).
On Fri, 27 Jan 2023 15:45:15 +0000, Ricardo Koller <ricarkol@google.com> wrote: > > > The one thing that would convince me to make it an option is the > > amount of memory this thing consumes. 512+ pages is a huge amount, and > > I'm not overly happy about that. Why can't this be a userspace visible > > option, selectable on a per VM (or memslot) basis? > > > > It should be possible. I am exploring a couple of ideas that could > help when the hugepages are not 1G (e.g., 2M). However, they add > complexity and I'm not sure they help much. > > (will be using PAGE_SIZE=4K to make things simpler) > > This feature pre-allocates 513 pages before splitting every 1G range. > For example, it converts 1G block PTEs into trees made of 513 pages. > When not using this feature, the same 513 pages would be allocated, > but lazily over a longer period of time. This is an important difference. It avoids the upfront allocation "thermal shock", giving time to the kernel to reclaim memory from somewhere else. Doing it upfront means you *must* have 2MB+ of immediately available memory for each GB of RAM you guest uses. > > Eager-splitting pre-allocates those pages in order to split huge-pages > into fully populated trees. Which is needed in order to use FEAT_BBM > and skipping the expensive TLBI broadcasts. 513 is just the number of > pages needed to break a 1G huge-page. I understand that. But it also clear that 1GB huge pages are unlikely to be THPs, and I wonder if we should treat the two differently. Using HugeTLBFS pages is significant here. > > We could optimize for smaller huge-pages, like 2M by splitting 1 > huge-page at a time: only preallocate one 4K page at a time. The > trick is how to know that we are splitting 2M huge-pages. We could > either get the vma pagesize or use hints from userspace. I'm not sure > that this is worth it though. The user will most likely want to split > big ranges of memory (>1G), so optimizing for smaller huge-pages only > converts the left into the right: > > alloc 1 page | | alloc 512 pages > split 2M huge-page | | split 2M huge-page > alloc 1 page | | split 2M huge-page > split 2M huge-page | => | split 2M huge-page > ... > alloc 1 page | | split 2M huge-page > split 2M huge-page | | split 2M huge-page > > Still thinking of what else to do. I think the 1G case fits your own use case, but I doubt this covers the majority of the users. Most people rely on the kernel ability to use THPs, which are capped at the first level of block mapping. 2MB (and 32MB for 16kB base pages) are the most likely mappings in my experience (512MB with 64kB pages are vanishingly rare). Having to pay an upfront cost for HugeTLBFS doesn't shock me, and it fits the model. For THPs, where everything is opportunistic and the user not involved, this is a lot more debatable. This is why I'd like this behaviour to be a buy-in, either directly (a first class userspace API) or indirectly (the provenance of the memory). Thanks, M.
On Mon, 30 Jan 2023 21:18:32 +0000, Oliver Upton <oliver.upton@linux.dev> wrote: > > On Fri, Jan 27, 2023 at 07:45:15AM -0800, Ricardo Koller wrote: > > Hi Marc, > > > > On Thu, Jan 26, 2023 at 12:10 PM Marc Zyngier <maz@kernel.org> wrote: > > [...] > > > > > > > The one thing that would convince me to make it an option is the > > > amount of memory this thing consumes. 512+ pages is a huge amount, and > > > I'm not overly happy about that. Why can't this be a userspace visible > > > option, selectable on a per VM (or memslot) basis? > > > > > > > It should be possible. I am exploring a couple of ideas that could > > help when the hugepages are not 1G (e.g., 2M). However, they add > > complexity and I'm not sure they help much. > > > > (will be using PAGE_SIZE=4K to make things simpler) > > > > This feature pre-allocates 513 pages before splitting every 1G range. > > For example, it converts 1G block PTEs into trees made of 513 pages. > > When not using this feature, the same 513 pages would be allocated, > > but lazily over a longer period of time. > > > > Eager-splitting pre-allocates those pages in order to split huge-pages > > into fully populated trees. Which is needed in order to use FEAT_BBM > > and skipping the expensive TLBI broadcasts. 513 is just the number of > > pages needed to break a 1G huge-page. > > > > We could optimize for smaller huge-pages, like 2M by splitting 1 > > huge-page at a time: only preallocate one 4K page at a time. The > > trick is how to know that we are splitting 2M huge-pages. We could > > either get the vma pagesize or use hints from userspace. I'm not sure > > that this is worth it though. The user will most likely want to split > > big ranges of memory (>1G), so optimizing for smaller huge-pages only > > converts the left into the right: > > > > alloc 1 page | | alloc 512 pages > > split 2M huge-page | | split 2M huge-page > > alloc 1 page | | split 2M huge-page > > split 2M huge-page | => | split 2M huge-page > > ... > > alloc 1 page | | split 2M huge-page > > split 2M huge-page | | split 2M huge-page > > > > Still thinking of what else to do. > > I think that Marc's suggestion of having userspace configure this is > sound. After all, userspace _should_ know the granularity of the backing > source it chose for guest memory. Only if it is not using anonymous memory. That's the important distinction. > > We could also interpret a cache size of 0 to signal that userspace wants > to disable eager page split for a VM altogether. It is entirely possible > that the user will want a differing QoS between slice-of-hardware and > overcommitted VMs. Absolutely. The overcommited case would suffer from the upfront allocation (these systems are usually very densely packed). Thanks, M.
On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote: > On Mon, Jan 30, 2023, Oliver Upton wrote: > > I think that Marc's suggestion of having userspace configure this is > > sound. After all, userspace _should_ know the granularity of the backing > > source it chose for guest memory. > > > > We could also interpret a cache size of 0 to signal that userspace wants > > to disable eager page split for a VM altogether. It is entirely possible that > > the user will want a differing QoS between slice-of-hardware and > > overcommitted VMs. > > Maybe. It's also entirely possible that QoS is never factored in, e.g. if QoS > guarantees for all VMs on a system are better met by enabling eager splitting > across the board. > > There are other reasons to use module/kernel params beyond what Marc listed, e.g. > to let the user opt out even when something is on by default. x86's TDP MMU has > benefited greatly from downstream users being able to do A/B performance testing > this way. I suspect x86's eager_page_split knob was added largely for this > reason, e.g. to easily see how a specific workload is affected by eager splitting. > That seems like a reasonable fit on the ARM side as well. There's a rather important distinction here in that we'd allow userspace to select the page split cache size, which should be correctly sized for the backing memory source. Considering the break-before-make rules of the architecture, the only way eager split is performant on arm64 is by replacing a block entry with a fully populated table hierarchy in one operation. AFAICT, you don't have this problem on x86, as the architecture generally permits a direct valid->valid transformation without an intermediate invalidation. Well, ignoring iTLB multihit :) So, the largest transformation we need to do right now is on a PUD w/ PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing that configuration option in a module parameter is presumptive that all VMs on a host use the exact same memory configuration, which doesn't feel right to me. -- Thanks, Oliver
On Tue, Jan 31, 2023, Oliver Upton wrote: > On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote: > > On Mon, Jan 30, 2023, Oliver Upton wrote: > > > I think that Marc's suggestion of having userspace configure this is > > > sound. After all, userspace _should_ know the granularity of the backing > > > source it chose for guest memory. > > > > > > We could also interpret a cache size of 0 to signal that userspace wants > > > to disable eager page split for a VM altogether. It is entirely possible that > > > the user will want a differing QoS between slice-of-hardware and > > > overcommitted VMs. > > > > Maybe. It's also entirely possible that QoS is never factored in, e.g. if QoS > > guarantees for all VMs on a system are better met by enabling eager splitting > > across the board. > > > > There are other reasons to use module/kernel params beyond what Marc listed, e.g. > > to let the user opt out even when something is on by default. x86's TDP MMU has > > benefited greatly from downstream users being able to do A/B performance testing > > this way. I suspect x86's eager_page_split knob was added largely for this > > reason, e.g. to easily see how a specific workload is affected by eager splitting. > > That seems like a reasonable fit on the ARM side as well. > > There's a rather important distinction here in that we'd allow userspace > to select the page split cache size, which should be correctly sized for > the backing memory source. Considering the break-before-make rules of > the architecture, the only way eager split is performant on arm64 is by > replacing a block entry with a fully populated table hierarchy in one > operation. AFAICT, you don't have this problem on x86, as the > architecture generally permits a direct valid->valid transformation > without an intermediate invalidation. Well, ignoring iTLB multihit :) > > So, the largest transformation we need to do right now is on a PUD w/ > PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing > that configuration option in a module parameter is presumptive that all > VMs on a host use the exact same memory configuration, which doesn't > feel right to me. Can you elaborate on the cache size needing to be tied to the backing source? Do the issues arise if you get to a point where KVM can have PGD-sized hugepages with PAGE_SIZE=4KiB? Or do you want to let userspace optimize _now_ for PMD+4KiB?
On Tue, Jan 31, 2023 at 9:46 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote: > > On Mon, Jan 30, 2023, Oliver Upton wrote: > > > I think that Marc's suggestion of having userspace configure this is > > > sound. After all, userspace _should_ know the granularity of the backing > > > source it chose for guest memory. > > > > > > We could also interpret a cache size of 0 to signal that userspace wants > > > to disable eager page split for a VM altogether. It is entirely possible that > > > the user will want a differing QoS between slice-of-hardware and > > > overcommitted VMs. > > > > Maybe. It's also entirely possible that QoS is never factored in, e.g. if QoS > > guarantees for all VMs on a system are better met by enabling eager splitting > > across the board. > > > > There are other reasons to use module/kernel params beyond what Marc listed, e.g. > > to let the user opt out even when something is on by default. x86's TDP MMU has > > benefited greatly from downstream users being able to do A/B performance testing > > this way. I suspect x86's eager_page_split knob was added largely for this > > reason, e.g. to easily see how a specific workload is affected by eager splitting. > > That seems like a reasonable fit on the ARM side as well. > > There's a rather important distinction here in that we'd allow userspace > to select the page split cache size, which should be correctly sized for > the backing memory source. Considering the break-before-make rules of > the architecture, the only way eager split is performant on arm64 is by > replacing a block entry with a fully populated table hierarchy in one > operation. I don't see how this can be true if we are able to tolerate splitting 2M pages. Splitting 2M pages inherently means 512 Break-Before-Make operations per 1GiB region of guest physical memory. If we had a cache size of 1 and were splitting a 1GiB region, we would then need to do 512+1 BBM per 1GiB region. If we can tolerate 512 per 1GiB, why not 513? It seems more like the 513 cache size is more to optimize splitting 1GiB pages. I agree it can turn those 513 int 1, but future versions of the architecture also elide BBM requirements which is another way to optimize 1GiB pages. > AFAICT, you don't have this problem on x86, as the > architecture generally permits a direct valid->valid transformation > without an intermediate invalidation. Well, ignoring iTLB multihit :) > > So, the largest transformation we need to do right now is on a PUD w/ > PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing > that configuration option in a module parameter is presumptive that all > VMs on a host use the exact same memory configuration, which doesn't > feel right to me. > > -- > Thanks, > Oliver
On Tue, Jan 31, 2023 at 10:02 AM David Matlack <dmatlack@google.com> wrote: > > On Tue, Jan 31, 2023 at 9:46 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote: > > > On Mon, Jan 30, 2023, Oliver Upton wrote: > > > > I think that Marc's suggestion of having userspace configure this is > > > > sound. After all, userspace _should_ know the granularity of the backing > > > > source it chose for guest memory. > > > > > > > > We could also interpret a cache size of 0 to signal that userspace wants > > > > to disable eager page split for a VM altogether. It is entirely possible that > > > > the user will want a differing QoS between slice-of-hardware and > > > > overcommitted VMs. > > > > > > Maybe. It's also entirely possible that QoS is never factored in, e.g. if QoS > > > guarantees for all VMs on a system are better met by enabling eager splitting > > > across the board. > > > > > > There are other reasons to use module/kernel params beyond what Marc listed, e.g. > > > to let the user opt out even when something is on by default. x86's TDP MMU has > > > benefited greatly from downstream users being able to do A/B performance testing > > > this way. I suspect x86's eager_page_split knob was added largely for this > > > reason, e.g. to easily see how a specific workload is affected by eager splitting. > > > That seems like a reasonable fit on the ARM side as well. > > > > There's a rather important distinction here in that we'd allow userspace > > to select the page split cache size, which should be correctly sized for > > the backing memory source. Considering the break-before-make rules of > > the architecture, the only way eager split is performant on arm64 is by > > replacing a block entry with a fully populated table hierarchy in one > > operation. > > I don't see how this can be true if we are able to tolerate splitting > 2M pages. Splitting 2M pages inherently means 512 Break-Before-Make > operations per 1GiB region of guest physical memory. > > If we had a cache size of 1 and were splitting a 1GiB region, we would > then need to do 512+1 BBM per 1GiB region. If we can tolerate 512 per > 1GiB, why not 513? > > It seems more like the 513 cache size is more to optimize splitting > 1GiB pages. That's correct. Although there's also a slight benefit for the 2M huge-pages case as the 512 huge-pages can be split without having to walk down the tree from the root every time. > I agree it can turn those 513 int 1, but future versions > of the architecture also elide BBM requirements which is another way > to optimize 1GiB pages. There's also the CPU cost needed to walk down the tree from the root 513 times (same as above). > > > > AFAICT, you don't have this problem on x86, as the > > architecture generally permits a direct valid->valid transformation > > without an intermediate invalidation. Well, ignoring iTLB multihit :) > > > > So, the largest transformation we need to do right now is on a PUD w/ > > PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing > > that configuration option in a module parameter is presumptive that all > > VMs on a host use the exact same memory configuration, which doesn't > > feel right to me. > > > > -- > > Thanks, > > Oliver
On Tue, Jan 31, 2023 at 10:01:45AM -0800, David Matlack wrote: > On Tue, Jan 31, 2023 at 9:46 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote: > > > On Mon, Jan 30, 2023, Oliver Upton wrote: > > > > I think that Marc's suggestion of having userspace configure this is > > > > sound. After all, userspace _should_ know the granularity of the backing > > > > source it chose for guest memory. > > > > > > > > We could also interpret a cache size of 0 to signal that userspace wants > > > > to disable eager page split for a VM altogether. It is entirely possible that > > > > the user will want a differing QoS between slice-of-hardware and > > > > overcommitted VMs. > > > > > > Maybe. It's also entirely possible that QoS is never factored in, e.g. if QoS > > > guarantees for all VMs on a system are better met by enabling eager splitting > > > across the board. > > > > > > There are other reasons to use module/kernel params beyond what Marc listed, e.g. > > > to let the user opt out even when something is on by default. x86's TDP MMU has > > > benefited greatly from downstream users being able to do A/B performance testing > > > this way. I suspect x86's eager_page_split knob was added largely for this > > > reason, e.g. to easily see how a specific workload is affected by eager splitting. > > > That seems like a reasonable fit on the ARM side as well. > > > > There's a rather important distinction here in that we'd allow userspace > > to select the page split cache size, which should be correctly sized for > > the backing memory source. Considering the break-before-make rules of > > the architecture, the only way eager split is performant on arm64 is by > > replacing a block entry with a fully populated table hierarchy in one > > operation. > > I don't see how this can be true if we are able to tolerate splitting > 2M pages. Splitting 2M pages inherently means 512 Break-Before-Make > operations per 1GiB region of guest physical memory. 'Only' was definitely not the right word here. I fear that there is a rather limited degree of portability for any observation that we derive from a particular system. Assuming the absolute worst of hardware, TLBIs + serialization will become even more of a bounding issue as the number of affected entities on the interconnect scales. So in that vein I don't believe it is easy to glean what may or may not be tolerable. > It seems more like the 513 cache size is more to optimize splitting > 1GiB pages. I don't believe anyone is particularly jazzed about this detail, so I would be surprised if we accepted this as the default configuration. > I agree it can turn those 513 int 1, but future versions > of the architecture also elide BBM requirements which is another way > to optimize 1GiB pages. Level 2 break-before-make behavior is helpful but certainly not a panacea. Most worrying is that implementations may generate TLB conflict aborts, at which point the only remediation is to invalidate the entire affected context.
On Tue, Jan 31, 2023 at 05:54:45PM +0000, Sean Christopherson wrote: > On Tue, Jan 31, 2023, Oliver Upton wrote: > > On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote: > > > On Mon, Jan 30, 2023, Oliver Upton wrote: > > > > I think that Marc's suggestion of having userspace configure this is > > > > sound. After all, userspace _should_ know the granularity of the backing > > > > source it chose for guest memory. > > > > > > > > We could also interpret a cache size of 0 to signal that userspace wants > > > > to disable eager page split for a VM altogether. It is entirely possible that > > > > the user will want a differing QoS between slice-of-hardware and > > > > overcommitted VMs. > > > > > > Maybe. It's also entirely possible that QoS is never factored in, e.g. if QoS > > > guarantees for all VMs on a system are better met by enabling eager splitting > > > across the board. > > > > > > There are other reasons to use module/kernel params beyond what Marc listed, e.g. > > > to let the user opt out even when something is on by default. x86's TDP MMU has > > > benefited greatly from downstream users being able to do A/B performance testing > > > this way. I suspect x86's eager_page_split knob was added largely for this > > > reason, e.g. to easily see how a specific workload is affected by eager splitting. > > > That seems like a reasonable fit on the ARM side as well. > > > > There's a rather important distinction here in that we'd allow userspace > > to select the page split cache size, which should be correctly sized for > > the backing memory source. Considering the break-before-make rules of > > the architecture, the only way eager split is performant on arm64 is by > > replacing a block entry with a fully populated table hierarchy in one > > operation. AFAICT, you don't have this problem on x86, as the > > architecture generally permits a direct valid->valid transformation > > without an intermediate invalidation. Well, ignoring iTLB multihit :) > > > > So, the largest transformation we need to do right now is on a PUD w/ > > PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing > > that configuration option in a module parameter is presumptive that all > > VMs on a host use the exact same memory configuration, which doesn't > > feel right to me. > > Can you elaborate on the cache size needing to be tied to the backing source? The proposed eager split mechanism attempts to replace a block with a a fully populated page table hierarchy (i.e. mapped at PTE granularity) in order to avoid successive break-before-make invalidations. The cache size must be >= the number of pages required to build out that fully mapped page table hierarchy. > Do the issues arise if you get to a point where KVM can have PGD-sized hugepages > with PAGE_SIZE=4KiB? Those problems when splitting any hugepage larger than a PMD. It just so happens that the only configuration that supports larger mappings is 4K at the moment. If we were to take the step-down approach to eager page splitting, there will be a lot of knock-on break-before-make operations as we go PUD -> PMD -> PTE. > Or do you want to let userspace optimize _now_ for PMD+4KiB? The default cache value should probably optimize for PMD splitting and give userspace the option to scale that up for PUD or greater if it sees fit.
Hi Marc, On Tue, Jan 31, 2023 at 2:28 AM Marc Zyngier <maz@kernel.org> wrote: > > On Fri, 27 Jan 2023 15:45:15 +0000, > Ricardo Koller <ricarkol@google.com> wrote: > > > > > The one thing that would convince me to make it an option is the > > > amount of memory this thing consumes. 512+ pages is a huge amount, and > > > I'm not overly happy about that. Why can't this be a userspace visible > > > option, selectable on a per VM (or memslot) basis? > > > > > > > It should be possible. I am exploring a couple of ideas that could > > help when the hugepages are not 1G (e.g., 2M). However, they add > > complexity and I'm not sure they help much. > > > > (will be using PAGE_SIZE=4K to make things simpler) > > > > This feature pre-allocates 513 pages before splitting every 1G range. > > For example, it converts 1G block PTEs into trees made of 513 pages. > > When not using this feature, the same 513 pages would be allocated, > > but lazily over a longer period of time. > > This is an important difference. It avoids the upfront allocation > "thermal shock", giving time to the kernel to reclaim memory from > somewhere else. Doing it upfront means you *must* have 2MB+ of > immediately available memory for each GB of RAM you guest uses. > > > > > Eager-splitting pre-allocates those pages in order to split huge-pages > > into fully populated trees. Which is needed in order to use FEAT_BBM > > and skipping the expensive TLBI broadcasts. 513 is just the number of > > pages needed to break a 1G huge-page. > > I understand that. But it also clear that 1GB huge pages are unlikely > to be THPs, and I wonder if we should treat the two differently. Using > HugeTLBFS pages is significant here. > > > > > We could optimize for smaller huge-pages, like 2M by splitting 1 > > huge-page at a time: only preallocate one 4K page at a time. The > > trick is how to know that we are splitting 2M huge-pages. We could > > either get the vma pagesize or use hints from userspace. I'm not sure > > that this is worth it though. The user will most likely want to split > > big ranges of memory (>1G), so optimizing for smaller huge-pages only > > converts the left into the right: > > > > alloc 1 page | | alloc 512 pages > > split 2M huge-page | | split 2M huge-page > > alloc 1 page | | split 2M huge-page > > split 2M huge-page | => | split 2M huge-page > > ... > > alloc 1 page | | split 2M huge-page > > split 2M huge-page | | split 2M huge-page > > > > Still thinking of what else to do. > > I think the 1G case fits your own use case, but I doubt this covers > the majority of the users. Most people rely on the kernel ability to > use THPs, which are capped at the first level of block mapping. > > 2MB (and 32MB for 16kB base pages) are the most likely mappings in my > experience (512MB with 64kB pages are vanishingly rare). > > Having to pay an upfront cost for HugeTLBFS doesn't shock me, and it > fits the model. For THPs, where everything is opportunistic and the > user not involved, this is a lot more debatable. > > This is why I'd like this behaviour to be a buy-in, either directly (a > first class userspace API) or indirectly (the provenance of the > memory). This all makes sense, thanks for the explanation. I decided to implement something for both cases: small caches (~1 page) where the PUDs are split one PMD at a time, and bigger caches (>513) where the PUDs can be split with a single replacement. The user specifies the size of the cache via a capability, and size of 0 implies no eager splitting (the feature is off). Thanks, Ricardo > > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible.
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 35a159d131b5..6ab37209b1d1 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -153,6 +153,36 @@ struct kvm_s2_mmu { /* The last vcpu id that ran on each physical CPU */ int __percpu *last_vcpu_ran; + /* + * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge + * pages. It is used to allocate stage2 page tables while splitting + * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY. + * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both + * the capacity of the split page cache (CACHE_CAPACITY), and how often + * KVM reschedules. Be wary of raising CHUNK_SIZE too high. + * + * A good heuristic to pick CHUNK_SIZE is that it should be larger than + * all the available huge-page sizes, and be a multiple of all the + * other ones; for example, 1GB when all the available huge-page sizes + * are (1GB, 2MB, 32MB, 512MB). + * + * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for + * example, 1GB requires the following number of PAGE_SIZE-pages: + * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB). + * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)). + * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB). + * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB). + * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB + * granules. + * + * Protected by kvm->slots_lock. + */ +#define EAGER_PAGE_SPLIT_CHUNK_SIZE SZ_1G +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY \ + (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) + \ + DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M)) + struct kvm_mmu_memory_cache split_page_cache; + struct kvm_arch *arch; }; diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 700c5774b50d..41ee330edae3 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector; static unsigned long io_map_base; -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) +bool __read_mostly eager_page_split = true; +module_param(eager_page_split, bool, 0644); + +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end, + phys_addr_t size) { - phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); phys_addr_t boundary = ALIGN_DOWN(addr + size, size); return (boundary - 1 < end - 1) ? boundary : end; } +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) +{ + phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); + + return __stage2_range_addr_end(addr, end, size); +} + /* * Release kvm_mmu_lock periodically if the memory region is large. Otherwise, * we may see kernel panics with CONFIG_DETECT_HUNG_TASK, @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr, return ret; } +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min) +{ + return kvm_mmu_memory_cache_nr_free_objects(cache) < min; +} + +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm) +{ + struct kvm_mmu_memory_cache *cache; + + if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) + return true; + + cache = &kvm->arch.mmu.split_page_cache; + return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY); +} + +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr, + phys_addr_t end) +{ + struct kvm_mmu_memory_cache *cache; + struct kvm_pgtable *pgt; + int ret; + u64 next; + int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY; + + lockdep_assert_held_write(&kvm->mmu_lock); + + lockdep_assert_held(&kvm->slots_lock); + + cache = &kvm->arch.mmu.split_page_cache; + + do { + if (need_topup_split_page_cache_or_resched(kvm)) { + write_unlock(&kvm->mmu_lock); + cond_resched(); + /* Eager page splitting is best-effort. */ + ret = __kvm_mmu_topup_memory_cache(cache, + cache_capacity, + cache_capacity); + write_lock(&kvm->mmu_lock); + if (ret) + break; + } + + pgt = kvm->arch.mmu.pgt; + if (!pgt) + return -EINVAL; + + next = __stage2_range_addr_end(addr, end, + EAGER_PAGE_SPLIT_CHUNK_SIZE); + ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache); + if (ret) + break; + } while (addr = next, addr != end); + + return ret; +} + #define stage2_apply_range_resched(kvm, addr, end, fn) \ stage2_apply_range(kvm, addr, end, fn, true) @@ -755,6 +823,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t for_each_possible_cpu(cpu) *per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1; + mmu->split_page_cache.gfp_zero = __GFP_ZERO; + mmu->pgt = pgt; mmu->pgd_phys = __pa(pgt->pgd); return 0; @@ -769,6 +839,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t void kvm_uninit_stage2_mmu(struct kvm *kvm) { kvm_free_stage2_pgd(&kvm->arch.mmu); + kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache); } static void stage2_unmap_memslot(struct kvm *kvm, @@ -996,6 +1067,29 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, stage2_wp_range(&kvm->arch.mmu, start, end); } +/** + * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE + * pages for memory slot + * @kvm: The KVM pointer + * @slot: The memory slot to split + * + * Acquires kvm->mmu_lock. Called with kvm->slots_lock mutex acquired, + * serializing operations for VM memory regions. + */ +static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot) +{ + struct kvm_memslots *slots = kvm_memslots(kvm); + struct kvm_memory_slot *memslot = id_to_memslot(slots, slot); + phys_addr_t start, end; + + start = memslot->base_gfn << PAGE_SHIFT; + end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT; + + write_lock(&kvm->mmu_lock); + kvm_mmu_split_huge_pages(kvm, start, end); + write_unlock(&kvm->mmu_lock); +} + /* * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected * dirty pages. @@ -1783,7 +1877,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, if (kvm_dirty_log_manual_protect_and_init_set(kvm)) return; + if (READ_ONCE(eager_page_split)) + kvm_mmu_split_memory_region(kvm, new->id); + kvm_mmu_wp_memory_region(kvm, new->id); + } else { + /* + * Free any leftovers from the eager page splitting cache. Do + * this when deleting, moving, disabling dirty logging, or + * creating the memslot (a nop). Doing it for deletes makes + * sure we don't leak memory, and there's no need to keep the + * cache around for any of the other cases. + */ + kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache); } }
Split huge pages eagerly when enabling dirty logging. The goal is to avoid doing it while faulting on write-protected pages, which negatively impacts guest performance. A memslot marked for dirty logging is split in 1GB pieces at a time. This is in order to release the mmu_lock and give other kernel threads the opportunity to run, and also in order to allocate enough pages to split a 1GB range worth of huge pages (or a single 1GB huge page). Note that these page allocations can fail, so eager page splitting is best-effort. This is not a correctness issue though, as huge pages can still be split on write-faults. The benefits of eager page splitting are the same as in x86, added with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled"). For example, when running dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access all of their memory after dirty logging is enabled decreased by 44% from 2.58s to 1.42s. Signed-off-by: Ricardo Koller <ricarkol@google.com> --- arch/arm64/include/asm/kvm_host.h | 30 ++++++++ arch/arm64/kvm/mmu.c | 110 +++++++++++++++++++++++++++++- 2 files changed, 138 insertions(+), 2 deletions(-)