Message ID | 20240402213656.3068504-1-dmatlack@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] KVM: Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG | expand |
On 2024/4/3 上午5:36, David Matlack wrote: > Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG to avoid > blocking other threads (e.g. vCPUs taking page faults) for too long. > > Specifically, change kvm_clear_dirty_log_protect() to acquire/release > mmu_lock only when calling kvm_arch_mmu_enable_log_dirty_pt_masked(), > rather than around the entire for loop. This ensures that KVM will only > hold mmu_lock for the time it takes the architecture-specific code to > process up to 64 pages, rather than holding mmu_lock for log->num_pages, > which is controllable by userspace. This also avoids holding mmu_lock > when processing parts of the dirty_bitmap that are zero (i.e. when there > is nothing to clear). > > Moving the acquire/release points for mmu_lock should be safe since > dirty_bitmap_buffer is already protected by slots_lock, and dirty_bitmap > is already accessed with atomic_long_fetch_andnot(). And at least on x86 > holding mmu_lock doesn't even serialize access to the memslot dirty > bitmap, as vCPUs can call mark_page_dirty_in_slot() without holding > mmu_lock. > > This change eliminates dips in guest performance during live migration > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which Frequently drop/reacquire mmu_lock will cause userspace migration process issuing CLEAR ioctls to contend with 160 vCPU, migration speed maybe become slower. In theory priority of userspace migration thread should be higher than vcpu thread. Drop and reacquire mmu_lock with 64-pages may be a little too smaller, in generic it is one huge page size. However it should be decided by framework maintainer:) Regards Bibo Mao > would also reduce contention on mmu_lock, but doing so will increase the > rate of remote TLB flushing. And there's really no reason to punt this > problem to userspace since KVM can just drop and reacquire mmu_lock more > frequently. > > Cc: Marc Zyngier <maz@kernel.org> > Cc: Oliver Upton <oliver.upton@linux.dev> > Cc: Tianrui Zhao <zhaotianrui@loongson.cn> > Cc: Bibo Mao <maobibo@loongson.cn> > Cc: Huacai Chen <chenhuacai@kernel.org> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: Anup Patel <anup@brainfault.org> > Cc: Christian Borntraeger <borntraeger@linux.ibm.com> > Cc: Janosch Frank <frankja@linux.ibm.com> > Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> > Cc: Sean Christopherson <seanjc@google.com> > Signed-off-by: David Matlack <dmatlack@google.com> > --- > v2: > - Rebase onto kvm/queue [Marc] > > v1: https://lore.kernel.org/kvm/20231205181645.482037-1-dmatlack@google.com/ > > virt/kvm/kvm_main.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index fb49c2a60200..0a8b25a52c15 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2386,7 +2386,6 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, > if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n)) > return -EFAULT; > > - KVM_MMU_LOCK(kvm); > for (offset = log->first_page, i = offset / BITS_PER_LONG, > n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--; > i++, offset += BITS_PER_LONG) { > @@ -2405,11 +2404,12 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, > */ > if (mask) { > flush = true; > + KVM_MMU_LOCK(kvm); > kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, > offset, mask); > + KVM_MMU_UNLOCK(kvm); > } > } > - KVM_MMU_UNLOCK(kvm); > > if (flush) > kvm_flush_remote_tlbs_memslot(kvm, memslot); > > base-commit: 9bc60f733839ab6fcdde0d0b15cbb486123e6402 >
On Wed, Apr 03, 2024, maobibo wrote: > > On 2024/4/3 上午5:36, David Matlack wrote: > > Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG to avoid > > blocking other threads (e.g. vCPUs taking page faults) for too long. > > > > Specifically, change kvm_clear_dirty_log_protect() to acquire/release > > mmu_lock only when calling kvm_arch_mmu_enable_log_dirty_pt_masked(), > > rather than around the entire for loop. This ensures that KVM will only > > hold mmu_lock for the time it takes the architecture-specific code to > > process up to 64 pages, rather than holding mmu_lock for log->num_pages, > > which is controllable by userspace. This also avoids holding mmu_lock > > when processing parts of the dirty_bitmap that are zero (i.e. when there > > is nothing to clear). > > > > Moving the acquire/release points for mmu_lock should be safe since > > dirty_bitmap_buffer is already protected by slots_lock, and dirty_bitmap > > is already accessed with atomic_long_fetch_andnot(). And at least on x86 > > holding mmu_lock doesn't even serialize access to the memslot dirty > > bitmap, as vCPUs can call mark_page_dirty_in_slot() without holding > > mmu_lock. > > > > This change eliminates dips in guest performance during live migration > > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with > > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which > Frequently drop/reacquire mmu_lock will cause userspace migration process > issuing CLEAR ioctls to contend with 160 vCPU, migration speed maybe become > slower. Only if vCPUs actually acquire mmu_lock. E.g. on x86, KVM fixes/handles faults due to write-protection for dirty logging without acquiring mmu_lock. So for x86, taking faults that need to acquire mmu_lock while dirty logging should be fairly uncommon (and if vCPUs are taking lots of faults, performance is likely going to be bad no matter what). > In theory priority of userspace migration thread should be higher than vcpu > thread. That's very debatable. And it's not an apples-to-apples comparison, because CLEAR_DIRTY_LOG can hold mmu_lock for a very long time, probably orders of magnitude longer than a vCPU will hold mmu_lock when handling a page fault. And on x86 and ARM, page faults can be resolved while hold mmu_lock for read. As a result, holding mmu_lock in CLEAR_DIRTY_LOG (for write) is effectively more costly than holding it in vCPUs. > Drop and reacquire mmu_lock with 64-pages may be a little too smaller, > in generic it is one huge page size. However it should be decided by > framework maintainer:) We could tweak the batching, but my very strong preference would be to do that only as a last resort, i.e. if and only if some magic batch number provides waaay better performance in all scenarios. Maintaining code with arbitrary magic numbers is a pain, e.g. KVM x86's MMU has arbitrary batching in kvm_zap_obsolete_pages(), and the justification for the number is extremely handwavy (paraphasing the changelog): Zap at least 10 shadow pages before releasing mmu_lock to reduce the overhead associated with re-acquiring the lock. Note: "10" is an arbitrary number, speculated to be high enough so that a vCPU isn't stuck zapping obsolete pages for an extended period, but small enough so that other vCPUs aren't starved waiting for mmu_lock. I.e. we're stuck with someone's best guess from years ago without any real idea if 10 is a good number, let alone optimal. Obviously that doesn't mean 64 pages is optimal, but it's at least not arbitrary, it's just an artifact of how KVM processes the bitmap. To be clear, I'm not opposed to per-arch behavior, nor do I think x86 should dictate how all other architectures should behave. But I would strongly prefer to avoid per-arch behavior unless it's actually necessary (doubly so for batching). In other words, if we do need per-arch behavior, e.g. because aggressivly dropping mmu_lock causes performance issues on other architectures that need to take mmu_lock for write to handle faults, I would prefer to have the arch knob control whether the lock+unlock is outside versus inside the loop, not control an arbitrary batch size.
On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote: > > > > On 2024/4/3 上午5:36, David Matlack wrote: > > Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG to avoid > > blocking other threads (e.g. vCPUs taking page faults) for too long. > > > > Specifically, change kvm_clear_dirty_log_protect() to acquire/release > > mmu_lock only when calling kvm_arch_mmu_enable_log_dirty_pt_masked(), > > rather than around the entire for loop. This ensures that KVM will only > > hold mmu_lock for the time it takes the architecture-specific code to > > process up to 64 pages, rather than holding mmu_lock for log->num_pages, > > which is controllable by userspace. This also avoids holding mmu_lock > > when processing parts of the dirty_bitmap that are zero (i.e. when there > > is nothing to clear). > > > > Moving the acquire/release points for mmu_lock should be safe since > > dirty_bitmap_buffer is already protected by slots_lock, and dirty_bitmap > > is already accessed with atomic_long_fetch_andnot(). And at least on x86 > > holding mmu_lock doesn't even serialize access to the memslot dirty > > bitmap, as vCPUs can call mark_page_dirty_in_slot() without holding > > mmu_lock. > > > > This change eliminates dips in guest performance during live migration > > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with > > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which > Frequently drop/reacquire mmu_lock will cause userspace migration > process issuing CLEAR ioctls to contend with 160 vCPU, migration speed > maybe become slower. In practice we have not found this to be the case. With this patch applied we see a significant improvement in guest workload throughput while userspace is issuing CLEAR ioctls without any change to the overall migration duration. For example, here[1] is a graph showing the effect of this patch on a160 vCPU VM being Live Migrated while running a MySQL workload. Y-Axis is transactions, blue line is without the patch, red line is with the patch. [1] https://drive.google.com/file/d/1zSKtc6wOQqfQrAQ4O9xlFmuyJ2-Iah0k/view > In theory priority of userspace migration thread > should be higher than vcpu thread. I don't think it's black and white. If prioritizing migration threads causes vCPU starvation (which is the type of issue this patch is fixing), then that's probably not the right trade-off. IMO the ultimate goal of live migration is that it's completely transparent to the guest workload, i.e. we should aim to minimize guest disruption as much as possible. A change that migration go 2x as fast but reduces vCPU performance by 10x during the migration is probably not worth it. And the reverse is also true, a change that increases vCPU performance by 10% during migration but makes the migration take 10x longer is also probably not worth it. In the case of this patch, there doesn't seem to be a trade-off. We see an improvement to vCPU performance without any regression in migration duration or other metrics. > > Drop and reacquire mmu_lock with 64-pages may be a little too smaller, > in generic it is one huge page size. However it should be decided by > framework maintainer:) > > Regards > Bibo Mao > > would also reduce contention on mmu_lock, but doing so will increase the > > rate of remote TLB flushing. And there's really no reason to punt this > > problem to userspace since KVM can just drop and reacquire mmu_lock more > > frequently. > > > > Cc: Marc Zyngier <maz@kernel.org> > > Cc: Oliver Upton <oliver.upton@linux.dev> > > Cc: Tianrui Zhao <zhaotianrui@loongson.cn> > > Cc: Bibo Mao <maobibo@loongson.cn> > > Cc: Huacai Chen <chenhuacai@kernel.org> > > Cc: Michael Ellerman <mpe@ellerman.id.au> > > Cc: Anup Patel <anup@brainfault.org> > > Cc: Christian Borntraeger <borntraeger@linux.ibm.com> > > Cc: Janosch Frank <frankja@linux.ibm.com> > > Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> > > Cc: Sean Christopherson <seanjc@google.com> > > Signed-off-by: David Matlack <dmatlack@google.com> > > --- > > v2: > > - Rebase onto kvm/queue [Marc] > > > > v1: https://lore.kernel.org/kvm/20231205181645.482037-1-dmatlack@google.com/ > > > > virt/kvm/kvm_main.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index fb49c2a60200..0a8b25a52c15 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -2386,7 +2386,6 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, > > if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n)) > > return -EFAULT; > > > > - KVM_MMU_LOCK(kvm); > > for (offset = log->first_page, i = offset / BITS_PER_LONG, > > n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--; > > i++, offset += BITS_PER_LONG) { > > @@ -2405,11 +2404,12 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, > > */ > > if (mask) { > > flush = true; > > + KVM_MMU_LOCK(kvm); > > kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, > > offset, mask); > > + KVM_MMU_UNLOCK(kvm); > > } > > } > > - KVM_MMU_UNLOCK(kvm); > > > > if (flush) > > kvm_flush_remote_tlbs_memslot(kvm, memslot); > > > > base-commit: 9bc60f733839ab6fcdde0d0b15cbb486123e6402 > > >
On Thu, Apr 04, 2024, David Matlack wrote: > On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote: > > > This change eliminates dips in guest performance during live migration > > > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with > > > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which > > Frequently drop/reacquire mmu_lock will cause userspace migration > > process issuing CLEAR ioctls to contend with 160 vCPU, migration speed > > maybe become slower. > > In practice we have not found this to be the case. With this patch > applied we see a significant improvement in guest workload throughput > while userspace is issuing CLEAR ioctls without any change to the > overall migration duration. ... > In the case of this patch, there doesn't seem to be a trade-off. We > see an improvement to vCPU performance without any regression in > migration duration or other metrics. For x86. We need to keep in mind that not all architectures have x86's optimization around dirty logging faults, or around faults in general. E.g. LoongArch's (which I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock. And if the fault can't be handled in the fast path, KVM will actually acquire mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after the mmu_seq and fault-in pfn stuff). So for x86, I think we can comfortably state that this change is a net positive for all scenarios. But for other architectures, that might not be the case. I'm not saying this isn't a good change for other architectures, just that we don't have relevant data to really know for sure. Absent performance data for other architectures, which is likely going to be difficult/slow to get, it might make sense to have this be opt-in to start. We could even do it with minimal #ifdeffery, e.g. something like the below would allow x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked() (I assume we want to give kvm_get_dirty_log_protect() similar treatment?). I don't love the idea of adding more arch specific MMU behavior (going the wrong direction), but it doesn't seem like an unreasonable approach in this case. diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index 86d267db87bb..5eb1ce83f29d 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -66,9 +66,9 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask) if (!memslot || (offset + __fls(mask)) >= memslot->npages) return; - KVM_MMU_LOCK(kvm); + KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm); kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask); - KVM_MMU_UNLOCK(kvm); + KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm); } int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d1fd9cb5d037..74ae844e4ed0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2279,7 +2279,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log) dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot); memset(dirty_bitmap_buffer, 0, n); - KVM_MMU_LOCK(kvm); + KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm); for (i = 0; i < n / sizeof(long); i++) { unsigned long mask; gfn_t offset; @@ -2295,7 +2295,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log) kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask); } - KVM_MMU_UNLOCK(kvm); + KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm); } if (flush) @@ -2390,7 +2390,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n)) return -EFAULT; - KVM_MMU_LOCK(kvm); + KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm); for (offset = log->first_page, i = offset / BITS_PER_LONG, n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--; i++, offset += BITS_PER_LONG) { @@ -2413,7 +2413,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, offset, mask); } } - KVM_MMU_UNLOCK(kvm); + KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm); if (flush) kvm_flush_remote_tlbs_memslot(kvm, memslot); diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index ecefc7ec51af..39d8b809c303 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -20,6 +20,11 @@ #define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock) #endif /* KVM_HAVE_MMU_RWLOCK */ +#ifndef KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT +#define KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT KVM_MMU_LOCK +#define KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT KVM_MMU_UNLOCK +#endif + kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible, bool *async, bool write_fault, bool *writable);
On Thu, Apr 4, 2024 at 10:10 AM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Apr 04, 2024, David Matlack wrote: > > On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote: > > > > This change eliminates dips in guest performance during live migration > > > > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with > > > > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which > > > Frequently drop/reacquire mmu_lock will cause userspace migration > > > process issuing CLEAR ioctls to contend with 160 vCPU, migration speed > > > maybe become slower. > > > > In practice we have not found this to be the case. With this patch > > applied we see a significant improvement in guest workload throughput > > while userspace is issuing CLEAR ioctls without any change to the > > overall migration duration. > > ... > > > In the case of this patch, there doesn't seem to be a trade-off. We > > see an improvement to vCPU performance without any regression in > > migration duration or other metrics. > > For x86. We need to keep in mind that not all architectures have x86's optimization > around dirty logging faults, or around faults in general. E.g. LoongArch's (which > I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock. > And if the fault can't be handled in the fast path, KVM will actually acquire > mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after > the mmu_seq and fault-in pfn stuff). > > So for x86, I think we can comfortably state that this change is a net positive > for all scenarios. But for other architectures, that might not be the case. > I'm not saying this isn't a good change for other architectures, just that we > don't have relevant data to really know for sure. I do not have data for other architectures, but may be able to get data on ARM in the next few weeks. I believe we saw similar benefits when testing on ARM. > > Absent performance data for other architectures, which is likely going to be > difficult/slow to get, it might make sense to have this be opt-in to start. We > could even do it with minimal #ifdeffery, e.g. something like the below would allow > x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked() > (I assume we want to give kvm_get_dirty_log_protect() similar treatment?). I don't see any reason not to give kvm_get_dirty_log_protect() the same treatment, but it's less important since kvm_get_dirty_log_protect() does not take the mmu_lock at all when manual-protect is enabled. > > I don't love the idea of adding more arch specific MMU behavior (going the wrong > direction), but it doesn't seem like an unreasonable approach in this case. I wonder if this is being overly cautious. I would expect only more benefit on architectures that more aggressively take the mmu_lock on vCPU threads during faults. The more lock acquisition on vCPU threads, the more this patch will help reduce vCPU starvation during CLEAR_DIRTY_LOG. Hm, perhaps testing with ept=N (which will use the write-lock for even dirty logging faults) would be a way to increase confidence in the effect on other architectures? > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c > index 86d267db87bb..5eb1ce83f29d 100644 > --- a/virt/kvm/dirty_ring.c > +++ b/virt/kvm/dirty_ring.c > @@ -66,9 +66,9 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask) > if (!memslot || (offset + __fls(mask)) >= memslot->npages) > return; > > - KVM_MMU_LOCK(kvm); > + KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm); > kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask); > - KVM_MMU_UNLOCK(kvm); > + KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm); > } > > int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size) > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index d1fd9cb5d037..74ae844e4ed0 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2279,7 +2279,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log) > dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot); > memset(dirty_bitmap_buffer, 0, n); > > - KVM_MMU_LOCK(kvm); > + KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm); > for (i = 0; i < n / sizeof(long); i++) { > unsigned long mask; > gfn_t offset; > @@ -2295,7 +2295,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log) > kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, > offset, mask); > } > - KVM_MMU_UNLOCK(kvm); > + KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm); > } > > if (flush) > @@ -2390,7 +2390,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, > if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n)) > return -EFAULT; > > - KVM_MMU_LOCK(kvm); > + KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm); > for (offset = log->first_page, i = offset / BITS_PER_LONG, > n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--; > i++, offset += BITS_PER_LONG) { > @@ -2413,7 +2413,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, > offset, mask); > } > } > - KVM_MMU_UNLOCK(kvm); > + KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm); > > if (flush) > kvm_flush_remote_tlbs_memslot(kvm, memslot); > diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h > index ecefc7ec51af..39d8b809c303 100644 > --- a/virt/kvm/kvm_mm.h > +++ b/virt/kvm/kvm_mm.h > @@ -20,6 +20,11 @@ > #define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock) > #endif /* KVM_HAVE_MMU_RWLOCK */ > > +#ifndef KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT > +#define KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT KVM_MMU_LOCK > +#define KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT KVM_MMU_UNLOCK > +#endif > + > kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible, > bool *async, bool write_fault, bool *writable); > >
On Thu, Apr 04, 2024, David Matlack wrote: > > I don't love the idea of adding more arch specific MMU behavior (going the wrong > > direction), but it doesn't seem like an unreasonable approach in this case. > > I wonder if this is being overly cautious. Probably. "Lazy" is another word for it ;-) > I would expect only more benefit on architectures that more aggressively take > the mmu_lock on vCPU threads during faults. The more lock acquisition on vCPU > threads, the more this patch will help reduce vCPU starvation during > CLEAR_DIRTY_LOG. > > Hm, perhaps testing with ept=N (which will use the write-lock for even > dirty logging faults) would be a way to increase confidence in the > effect on other architectures? Turning off the TDP MMU would be more representative, just manually disable the fast-path, e.g. diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 992e651540e8..532c24911f39 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3371,7 +3371,7 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault) * Note, instruction fetches and writes are mutually exclusive, ignore * the "exec" flag. */ - return fault->write; + return false;//fault->write; } /*
On 2024/4/5 上午12:29, David Matlack wrote: > On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote: >> >> >> >> On 2024/4/3 上午5:36, David Matlack wrote: >>> Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG to avoid >>> blocking other threads (e.g. vCPUs taking page faults) for too long. >>> >>> Specifically, change kvm_clear_dirty_log_protect() to acquire/release >>> mmu_lock only when calling kvm_arch_mmu_enable_log_dirty_pt_masked(), >>> rather than around the entire for loop. This ensures that KVM will only >>> hold mmu_lock for the time it takes the architecture-specific code to >>> process up to 64 pages, rather than holding mmu_lock for log->num_pages, >>> which is controllable by userspace. This also avoids holding mmu_lock >>> when processing parts of the dirty_bitmap that are zero (i.e. when there >>> is nothing to clear). >>> >>> Moving the acquire/release points for mmu_lock should be safe since >>> dirty_bitmap_buffer is already protected by slots_lock, and dirty_bitmap >>> is already accessed with atomic_long_fetch_andnot(). And at least on x86 >>> holding mmu_lock doesn't even serialize access to the memslot dirty >>> bitmap, as vCPUs can call mark_page_dirty_in_slot() without holding >>> mmu_lock. >>> >>> This change eliminates dips in guest performance during live migration >>> in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with >>> 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which >> Frequently drop/reacquire mmu_lock will cause userspace migration >> process issuing CLEAR ioctls to contend with 160 vCPU, migration speed >> maybe become slower. > > In practice we have not found this to be the case. With this patch > applied we see a significant improvement in guest workload throughput > while userspace is issuing CLEAR ioctls without any change to the > overall migration duration. > > For example, here[1] is a graph showing the effect of this patch on > a160 vCPU VM being Live Migrated while running a MySQL workload. > Y-Axis is transactions, blue line is without the patch, red line is > with the patch. > > [1] https://drive.google.com/file/d/1zSKtc6wOQqfQrAQ4O9xlFmuyJ2-Iah0k/view It is great that there is obvious improvement with the workload. > >> In theory priority of userspace migration thread >> should be higher than vcpu thread. > > I don't think it's black and white. If prioritizing migration threads > causes vCPU starvation (which is the type of issue this patch is > fixing), then that's probably not the right trade-off. IMO the > ultimate goal of live migration is that it's completely transparent to > the guest workload, i.e. we should aim to minimize guest disruption as > much as possible. A change that migration go 2x as fast but reduces > vCPU performance by 10x during the migration is probably not worth it. > And the reverse is also true, a change that increases vCPU performance > by 10% during migration but makes the migration take 10x longer is > also probably not worth it. I am not migration expert, IIRC there is migration timeout value in cloud stack. If migration timeout reached, migration will be abort. > > In the case of this patch, there doesn't seem to be a trade-off. We > see an improvement to vCPU performance without any regression in > migration duration or other metrics. I will be ok about this patch if there is no any regression in migration. If x86 migration takes more time with the patch on x86, I suggest migration experts give some comments. Regards Bibo Mao > >> >> Drop and reacquire mmu_lock with 64-pages may be a little too smaller, >> in generic it is one huge page size. However it should be decided by >> framework maintainer:) >> >> Regards >> Bibo Mao >>> would also reduce contention on mmu_lock, but doing so will increase the >>> rate of remote TLB flushing. And there's really no reason to punt this >>> problem to userspace since KVM can just drop and reacquire mmu_lock more >>> frequently. >>> >>> Cc: Marc Zyngier <maz@kernel.org> >>> Cc: Oliver Upton <oliver.upton@linux.dev> >>> Cc: Tianrui Zhao <zhaotianrui@loongson.cn> >>> Cc: Bibo Mao <maobibo@loongson.cn> >>> Cc: Huacai Chen <chenhuacai@kernel.org> >>> Cc: Michael Ellerman <mpe@ellerman.id.au> >>> Cc: Anup Patel <anup@brainfault.org> >>> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> >>> Cc: Janosch Frank <frankja@linux.ibm.com> >>> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> >>> Cc: Sean Christopherson <seanjc@google.com> >>> Signed-off-by: David Matlack <dmatlack@google.com> >>> --- >>> v2: >>> - Rebase onto kvm/queue [Marc] >>> >>> v1: https://lore.kernel.org/kvm/20231205181645.482037-1-dmatlack@google.com/ >>> >>> virt/kvm/kvm_main.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >>> index fb49c2a60200..0a8b25a52c15 100644 >>> --- a/virt/kvm/kvm_main.c >>> +++ b/virt/kvm/kvm_main.c >>> @@ -2386,7 +2386,6 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, >>> if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n)) >>> return -EFAULT; >>> >>> - KVM_MMU_LOCK(kvm); >>> for (offset = log->first_page, i = offset / BITS_PER_LONG, >>> n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--; >>> i++, offset += BITS_PER_LONG) { >>> @@ -2405,11 +2404,12 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, >>> */ >>> if (mask) { >>> flush = true; >>> + KVM_MMU_LOCK(kvm); >>> kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, >>> offset, mask); >>> + KVM_MMU_UNLOCK(kvm); >>> } >>> } >>> - KVM_MMU_UNLOCK(kvm); >>> >>> if (flush) >>> kvm_flush_remote_tlbs_memslot(kvm, memslot); >>> >>> base-commit: 9bc60f733839ab6fcdde0d0b15cbb486123e6402 >>> >>
On 2024/4/5 上午1:10, Sean Christopherson wrote: > On Thu, Apr 04, 2024, David Matlack wrote: >> On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote: >>>> This change eliminates dips in guest performance during live migration >>>> in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with >>>> 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which >>> Frequently drop/reacquire mmu_lock will cause userspace migration >>> process issuing CLEAR ioctls to contend with 160 vCPU, migration speed >>> maybe become slower. >> >> In practice we have not found this to be the case. With this patch >> applied we see a significant improvement in guest workload throughput >> while userspace is issuing CLEAR ioctls without any change to the >> overall migration duration. > > ... > >> In the case of this patch, there doesn't seem to be a trade-off. We >> see an improvement to vCPU performance without any regression in >> migration duration or other metrics. > > For x86. We need to keep in mind that not all architectures have x86's optimization > around dirty logging faults, or around faults in general. E.g. LoongArch's (which > I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock. > And if the fault can't be handled in the fast path, KVM will actually acquire > mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after > the mmu_seq and fault-in pfn stuff). yes, there is no lock in function fast_page_fault on x86, however mmu spinlock is held on fast path on LoongArch. I do not notice this, originally I think that there is read_lock on x86 fast path for pte modification, write lock about page table modification. > > So for x86, I think we can comfortably state that this change is a net positive > for all scenarios. But for other architectures, that might not be the case. > I'm not saying this isn't a good change for other architectures, just that we > don't have relevant data to really know for sure. From the code there is contention between migration assistant thread and vcpu thread in slow path where huge page need be split if memslot is dirty log enabled, however there is no contention between migration assistant thread and vcpu fault fast path. If it is right, negative effect is small on x86. > > Absent performance data for other architectures, which is likely going to be > difficult/slow to get, it might make sense to have this be opt-in to start. We > could even do it with minimal #ifdeffery, e.g. something like the below would allow > x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked() > (I assume we want to give kvm_get_dirty_log_protect() similar treatment?). > > I don't love the idea of adding more arch specific MMU behavior (going the wrong > direction), but it doesn't seem like an unreasonable approach in this case. No special modification for this, it is a little strange. LoongArch page fault fast path can improve later. Regards Bibo Mao > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c > index 86d267db87bb..5eb1ce83f29d 100644 > --- a/virt/kvm/dirty_ring.c > +++ b/virt/kvm/dirty_ring.c > @@ -66,9 +66,9 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask) > if (!memslot || (offset + __fls(mask)) >= memslot->npages) > return; > > - KVM_MMU_LOCK(kvm); > + KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm); > kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask); > - KVM_MMU_UNLOCK(kvm); > + KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm); > } > > int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size) > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index d1fd9cb5d037..74ae844e4ed0 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2279,7 +2279,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log) > dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot); > memset(dirty_bitmap_buffer, 0, n); > > - KVM_MMU_LOCK(kvm); > + KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm); > for (i = 0; i < n / sizeof(long); i++) { > unsigned long mask; > gfn_t offset; > @@ -2295,7 +2295,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log) > kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, > offset, mask); > } > - KVM_MMU_UNLOCK(kvm); > + KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm); > } > > if (flush) > @@ -2390,7 +2390,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, > if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n)) > return -EFAULT; > > - KVM_MMU_LOCK(kvm); > + KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm); > for (offset = log->first_page, i = offset / BITS_PER_LONG, > n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--; > i++, offset += BITS_PER_LONG) { > @@ -2413,7 +2413,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, > offset, mask); > } > } > - KVM_MMU_UNLOCK(kvm); > + KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm); > > if (flush) > kvm_flush_remote_tlbs_memslot(kvm, memslot); > diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h > index ecefc7ec51af..39d8b809c303 100644 > --- a/virt/kvm/kvm_mm.h > +++ b/virt/kvm/kvm_mm.h > @@ -20,6 +20,11 @@ > #define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock) > #endif /* KVM_HAVE_MMU_RWLOCK */ > > +#ifndef KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT > +#define KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT KVM_MMU_LOCK > +#define KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT KVM_MMU_UNLOCK > +#endif > + > kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible, > bool *async, bool write_fault, bool *writable); > >
On Sat, Apr 6, 2024 at 7:26 PM maobibo <maobibo@loongson.cn> wrote: > > > > On 2024/4/5 上午1:10, Sean Christopherson wrote: > > On Thu, Apr 04, 2024, David Matlack wrote: > >> On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote: > >>>> This change eliminates dips in guest performance during live migration > >>>> in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with > >>>> 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which > >>> Frequently drop/reacquire mmu_lock will cause userspace migration > >>> process issuing CLEAR ioctls to contend with 160 vCPU, migration speed > >>> maybe become slower. > >> > >> In practice we have not found this to be the case. With this patch > >> applied we see a significant improvement in guest workload throughput > >> while userspace is issuing CLEAR ioctls without any change to the > >> overall migration duration. > > > > ... > > > >> In the case of this patch, there doesn't seem to be a trade-off. We > >> see an improvement to vCPU performance without any regression in > >> migration duration or other metrics. > > > > For x86. We need to keep in mind that not all architectures have x86's optimization > > around dirty logging faults, or around faults in general. E.g. LoongArch's (which > > I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock. > > And if the fault can't be handled in the fast path, KVM will actually acquire > > mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after > > the mmu_seq and fault-in pfn stuff). > yes, there is no lock in function fast_page_fault on x86, however mmu > spinlock is held on fast path on LoongArch. I do not notice this, > originally I think that there is read_lock on x86 fast path for pte > modification, write lock about page table modification. Most[*] vCPU faults on KVM/x86 are handled as follows: - vCPU write-protection and access-tracking faults are handled locklessly (fast_page_fault()). - All other vCPU faults are handled under read_lock(&kvm->mmu_lock). [*] The locking is different when nested virtualization is in use, TDP (i.e. stage-2 hardware) is disabled, and/or kvm.tdp_mmu=N. > > > > So for x86, I think we can comfortably state that this change is a net positive > > for all scenarios. But for other architectures, that might not be the case. > > I'm not saying this isn't a good change for other architectures, just that we > > don't have relevant data to really know for sure. > From the code there is contention between migration assistant thread > and vcpu thread in slow path where huge page need be split if memslot is > dirty log enabled, however there is no contention between migration > assistant thread and vcpu fault fast path. If it is right, negative > effect is small on x86. Right there is no contention between CLEAR_DIRTY_LOG and vCPU write-protection faults on KVM/x86. KVM/arm64 does write-protection faults under the read-lock. KVM/x86 and KVM/arm64 also both have eager page splitting, where the huge pages are split during CLEAR_DIRTY_LOG, so there are also no vCPU faults to split huge pages. > > > > > Absent performance data for other architectures, which is likely going to be > > difficult/slow to get, it might make sense to have this be opt-in to start. We > > could even do it with minimal #ifdeffery, e.g. something like the below would allow > > x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked() > > (I assume we want to give kvm_get_dirty_log_protect() similar treatment?). > > > > I don't love the idea of adding more arch specific MMU behavior (going the wrong > > direction), but it doesn't seem like an unreasonable approach in this case. > No special modification for this, it is a little strange. LoongArch page > fault fast path can improve later. Sorry, I don't follow exactly what you mean here. Are you saying Sean's patch is not required for LoongArch and, instead, LoongArch should/will avoid acquiring the mmu_lock when handling write-protection faults?
On Thu, Apr 4, 2024 at 11:17 AM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Apr 04, 2024, David Matlack wrote: > > > I don't love the idea of adding more arch specific MMU behavior (going the wrong > > > direction), but it doesn't seem like an unreasonable approach in this case. > > > > I wonder if this is being overly cautious. > > Probably. "Lazy" is another word for it ;-) > > > I would expect only more benefit on architectures that more aggressively take > > the mmu_lock on vCPU threads during faults. The more lock acquisition on vCPU > > threads, the more this patch will help reduce vCPU starvation during > > CLEAR_DIRTY_LOG. > > > > Hm, perhaps testing with ept=N (which will use the write-lock for even > > dirty logging faults) would be a way to increase confidence in the > > effect on other architectures? > > Turning off the TDP MMU would be more representative, just manually disable the > fast-path, e.g. Good idea. I'm actually throwing in some writable module parameters too to make it easy to toggle between configurations. I'll report back when I have some data.
On 2024/4/13 上午12:12, David Matlack wrote: > On Sat, Apr 6, 2024 at 7:26 PM maobibo <maobibo@loongson.cn> wrote: >> >> >> >> On 2024/4/5 上午1:10, Sean Christopherson wrote: >>> On Thu, Apr 04, 2024, David Matlack wrote: >>>> On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote: >>>>>> This change eliminates dips in guest performance during live migration >>>>>> in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with >>>>>> 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which >>>>> Frequently drop/reacquire mmu_lock will cause userspace migration >>>>> process issuing CLEAR ioctls to contend with 160 vCPU, migration speed >>>>> maybe become slower. >>>> >>>> In practice we have not found this to be the case. With this patch >>>> applied we see a significant improvement in guest workload throughput >>>> while userspace is issuing CLEAR ioctls without any change to the >>>> overall migration duration. >>> >>> ... >>> >>>> In the case of this patch, there doesn't seem to be a trade-off. We >>>> see an improvement to vCPU performance without any regression in >>>> migration duration or other metrics. >>> >>> For x86. We need to keep in mind that not all architectures have x86's optimization >>> around dirty logging faults, or around faults in general. E.g. LoongArch's (which >>> I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock. >>> And if the fault can't be handled in the fast path, KVM will actually acquire >>> mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after >>> the mmu_seq and fault-in pfn stuff). >> yes, there is no lock in function fast_page_fault on x86, however mmu >> spinlock is held on fast path on LoongArch. I do not notice this, >> originally I think that there is read_lock on x86 fast path for pte >> modification, write lock about page table modification. > > Most[*] vCPU faults on KVM/x86 are handled as follows: > > - vCPU write-protection and access-tracking faults are handled > locklessly (fast_page_fault()). > - All other vCPU faults are handled under read_lock(&kvm->mmu_lock). > > [*] The locking is different when nested virtualization is in use, TDP > (i.e. stage-2 hardware) is disabled, and/or kvm.tdp_mmu=N. > >>> >>> So for x86, I think we can comfortably state that this change is a net positive >>> for all scenarios. But for other architectures, that might not be the case. >>> I'm not saying this isn't a good change for other architectures, just that we >>> don't have relevant data to really know for sure. >> From the code there is contention between migration assistant thread >> and vcpu thread in slow path where huge page need be split if memslot is >> dirty log enabled, however there is no contention between migration >> assistant thread and vcpu fault fast path. If it is right, negative >> effect is small on x86. > > Right there is no contention between CLEAR_DIRTY_LOG and vCPU > write-protection faults on KVM/x86. KVM/arm64 does write-protection > faults under the read-lock. > > KVM/x86 and KVM/arm64 also both have eager page splitting, where the > huge pages are split during CLEAR_DIRTY_LOG, so there are also no vCPU > faults to split huge pages. > >> >>> >>> Absent performance data for other architectures, which is likely going to be >>> difficult/slow to get, it might make sense to have this be opt-in to start. We >>> could even do it with minimal #ifdeffery, e.g. something like the below would allow >>> x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked() >>> (I assume we want to give kvm_get_dirty_log_protect() similar treatment?). >>> >>> I don't love the idea of adding more arch specific MMU behavior (going the wrong >>> direction), but it doesn't seem like an unreasonable approach in this case. >> No special modification for this, it is a little strange. LoongArch page >> fault fast path can improve later. > > Sorry, I don't follow exactly what you mean here. Are you saying > Sean's patch is not required for LoongArch and, instead, LoongArch > should/will avoid acquiring the mmu_lock when handling > write-protection faults? > yes, LoongArch need some optimization in secondary mmu fast path, such as rcu lock, read lock on mmu_lock, or page table spin lock on pte table page. It is not decided now. If the patch works on x86 platform, just go ahead; LoongArch KVM will improve later. Regards Bibo Mao
On 2024-04-12 09:14 AM, David Matlack wrote: > On Thu, Apr 4, 2024 at 11:17 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Thu, Apr 04, 2024, David Matlack wrote: > > > > I don't love the idea of adding more arch specific MMU behavior (going the wrong > > > > direction), but it doesn't seem like an unreasonable approach in this case. > > > > > > I wonder if this is being overly cautious. > > > > Probably. "Lazy" is another word for it ;-) > > > > > I would expect only more benefit on architectures that more aggressively take > > > the mmu_lock on vCPU threads during faults. The more lock acquisition on vCPU > > > threads, the more this patch will help reduce vCPU starvation during > > > CLEAR_DIRTY_LOG. > > > > > > Hm, perhaps testing with ept=N (which will use the write-lock for even > > > dirty logging faults) would be a way to increase confidence in the > > > effect on other architectures? > > > > Turning off the TDP MMU would be more representative, just manually disable the > > fast-path, e.g. > > Good idea. I'm actually throwing in some writable module parameters > too to make it easy to toggle between configurations. > > I'll report back when I have some data. tl;dr * My patch likely _will_ regress migration performance on other architectures. Thank you Bibo and Sean for keeping me honest here. * I suspect the original issue my patch is trying to fix is actually specific to the way the TDP MMU does eager page splitting and a more targeted fix is warranted. --- To evaluate my patch I tested on x86 with different mmu_lock configurations to simulate other architectures. Config 1: tdp_mmu=Y fast_page_fault_read_lock=N eager_page_split=Y Config 2: tdp_mmu=Y fast_page_fault_read_lock=Y eager_page_split=Y Config 3: tdp_mmu=N fast_page_fault_read_lock=N eager_page_split=N Note: "fast_page_fault_read_lock" is a non-upstream parameter I added to add a read_lock/unlock() in fast_page_fault(). Config 1 is vanilla KVM/x86. Config 2 emulates KVM/arm64. Config 3 emulates LoongArch if LoongArch added support for lockless write-protection fault handling. The test I ran was a Live Migration of a 16VCPU 64GB VM running an aggressive write-heavy workload. To compare runs I evaluated 3 metrics: * Duration of pre-copy. * Amount of dirty memory going into post-copy. * Total CPU usage of CLEAR_DIRTY_LOG. The following table shows how each metric changed after adding my patch to drop mmu_lock during CLEAR_DIRTY_LOG. | Precopy Duration | Post-Copy Dirty | CLEAR_DIRTY_LOG CPU ---------|------------------|-----------------|--------------------- Config 1 | -1% | -1% | +6% Config 2 | -1% | +1% | +123% Config 3 | +32% | +158% | +5200% Config 2 and 3 both show regressions, with Config 3 severely regressed in all 3 dimensions. Given these regressions, I started rethinking the original issue this patch is trying to fix. The dips in guest performance during CLEAR_DIRTY_LOG occurred during the first pre-copy pass but not during subsequent passes. One thing unique to the first pass is eager page splitting. Ah ha, a theory! The TDP MMU allocates shadow pages while holding the mmu_lock during eager page splitting and only drops the lock if need_resched=True or a GFP_NOWAIT allocation fails. If neither occurs, CLEAR_DIRTY_LOG could potential hold mmu_lock in write-mode for a long time. Second, the host platform where we saw the dips has nx_huge_pages=Y. I suspect the long CLEAR_DIRTY_LOG calls are blocking vCPUs taking steady-state faults for NX Huge Pages, causing the dips in performance. This theory also explains why we (Google) haven't seen similar drops in guest performance when using !manual-protect, as in that configuration the TDP MMU does eager page splitting under the read-lock instead of write-lock. If this is all true, then a better / more targeted fix for this issue would be to drop mmu_lock in the TDP MMU eager page splitting path. For example, we could limit the "allocate under lock" behavior to only when the read-lock is held, e.g. diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 7dfdc49a6ade..ea34f8232d97 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1472,9 +1472,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, * If this allocation fails we drop the lock and retry with reclaim * allowed. */ - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid); - if (sp) - return sp; + if (shared) { + sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid); + if (sp) + return sp; + } rcu_read_unlock(); I checked the KVM/arm64 eager page splitting code, and it drops the mmu_lock to allocate page tables. So I suspect no fix is needed there and this is, in fact, purely and x86-specific issue.
On Mon, Apr 15, 2024, David Matlack wrote: > * I suspect the original issue my patch is trying to fix is actually specific > to the way the TDP MMU does eager page splitting and a more targeted fix is > warranted. > > --- > > To evaluate my patch I tested on x86 with different mmu_lock configurations > to simulate other architectures. > > Config 1: tdp_mmu=Y fast_page_fault_read_lock=N eager_page_split=Y > Config 2: tdp_mmu=Y fast_page_fault_read_lock=Y eager_page_split=Y > Config 3: tdp_mmu=N fast_page_fault_read_lock=N eager_page_split=N > > Note: "fast_page_fault_read_lock" is a non-upstream parameter I added to > add a read_lock/unlock() in fast_page_fault(). > > Config 1 is vanilla KVM/x86. Config 2 emulates KVM/arm64. Config 3 emulates > LoongArch if LoongArch added support for lockless write-protection fault > handling. > > The test I ran was a Live Migration of a 16VCPU 64GB VM running an aggressive > write-heavy workload. To compare runs I evaluated 3 metrics: > > * Duration of pre-copy. > * Amount of dirty memory going into post-copy. > * Total CPU usage of CLEAR_DIRTY_LOG. > > The following table shows how each metric changed after adding my patch to drop > mmu_lock during CLEAR_DIRTY_LOG. > > | Precopy Duration | Post-Copy Dirty | CLEAR_DIRTY_LOG CPU > ---------|------------------|-----------------|--------------------- > Config 1 | -1% | -1% | +6% > Config 2 | -1% | +1% | +123% > Config 3 | +32% | +158% | +5200% > > Config 2 and 3 both show regressions, with Config 3 severely regressed in all 3 > dimensions. ... > If this is all true, then a better / more targeted fix for this issue would be > to drop mmu_lock in the TDP MMU eager page splitting path. For example, we > could limit the "allocate under lock" behavior to only when the read-lock is > held, e.g. > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 7dfdc49a6ade..ea34f8232d97 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1472,9 +1472,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > * If this allocation fails we drop the lock and retry with reclaim > * allowed. > */ > - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid); > - if (sp) > - return sp; > + if (shared) { > + sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid); > + if (sp) > + return sp; > + } > > rcu_read_unlock(); > > I checked the KVM/arm64 eager page splitting code, and it drops the mmu_lock to > allocate page tables. So I suspect no fix is needed there and this is, in fact, > purely and x86-specific issue. Hmm, it'd be nice if we didn't have to rely on random arch flows to coincidentally do the optimal thing for eager page splitting. Not sure how best to document the "best known practice" though. As for the TDP MMU code, unless the cost of dropping and reacquiring mmu_lock for read is measurable, I would prefer to unconditionally drop mmu_lock, and delete the GFP_NOWAIT allocation. There can be lock contention when mmu_lock is held for read, it's just less common. On a related topic, I think we should take a hard look at the rwlock_needbreak() usage in tdp_mmu_iter_cond_resched(). Because dropping when allocating is really just speculatively dropping mmu_lock because it _might_ be contended, but doing so at a batch size that provides a good balance between doing enough work under mmu_lock and providing low latency for vCPUs. I.e. in theory, we should be able to handle this fully in tdp_mmu_iter_cond_resched(), but that code is nowhere near smart enough and it's currently only for preemptible kernels (or at least, it's supposed to be only for preemptible kernels). Simply yielding on contention is not at all optimal, as evidenced by the whole dynamic preemption debacle[1][2]. The immediate issue was "fixed" by having vCPUs avoid taking mmu_lock, but KVM really shouldn't get into a situation where KVM is pathologically dropping mmu_lock to the point where a non-vCPU action grinds to a halt. The contention logic fails to take into account many things: (1) Is the other task higher priority? (2) Is the other task a vCPU, or something else? (3) Will yielding actually allow the other task to make forward progress? (4) What is the cost of dropping mmu_lock, e.g. is a remote TLB flush needed? (5) What is the expected duration this task is expected to hold mmu_lock? (6) Has this task made enough progress for yielding to be a decent choice? and probably many more than that. As we discussed off-list, I think there are two viable approaches: (a) We drop the preemption dependency from tdp_mmu_iter_cond_resched()'s lock contention logic, and improve the logic (especially the forward progress guarantees) so that tdp_mmu_iter_cond_resched() provides solid performance in all cases. (b) We completely remove the rwlock_needbreak() checks from tdp_mmu_iter_cond_resched(), and instead rely on unconditionally dropping mmu_lock in flows where doing so provides the best overall balance, e.g. as in the eager page split case. I don't have a strong preference between (a) and (b), though I think I'd lean towards (b), because it's simpler. My guess is that we can achieve similar performance results with both. E.g. odds are decent that the "best" batch size (see #6) is large enough that the cost of dropping and reacquiring mmu_lock is in the noise when it's not contented. The main argument I see for (b) is that it's simpler, as only code that actually has a justified need to drop mmu_lock does so. The advantage I see with (a) is that it would provide structure and documentation for choosing when to drop mmu_lock (or not). E.g. dropping in the eager page split path makes sense because KVM does so at a large batch size, odds are good that the contending task is a vCPU, there's no TLB flush required, the total hold time of mmu_lock is high, and we know that dropping mmu_lock will allow vCPUs to make forward progress. (a) would do a much better job of capturing all that in code, albeit with quite a bit more complexity. Regardless of which option we choose, I think we should drop the preemptible kernel dependency from the lock contention logic in tdp_mmu_iter_cond_resched(), i.e. manually check if mmu_lock is contented instead of bouncing through rwlock_needbreak(). The current approach essentially means that there's zero testing of the performance of the yield-on-contention logic. E.g. the complaints about the TDP MMU yielding too aggressively only popped up when commit c597bfddc9e9 ("sched: Provide Kconfig support for default dynamic preempt mode") unintentionally enabled rwlock_needbreak() by default. That's definitely easier said then done though, as I suspect that if we switched to rwlock_is_contended(), i.e. dropped the preemptible requirement, without also enhancing tdp_mmu_iter_cond_resched() to make it smarter as above, we'd see a lot of performance regressions. [1] https://lore.kernel.org/all/20240312193911.1796717-3-seanjc@google.com [2] https://lore.kernel.org/all/20240222012640.2820927-1-seanjc@google.com
On Mon, Apr 15, 2024 at 1:00 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Apr 15, 2024, David Matlack wrote: > > > If this is all true, then a better / more targeted fix for this issue would be > > to drop mmu_lock in the TDP MMU eager page splitting path. For example, we > > could limit the "allocate under lock" behavior to only when the read-lock is > > held, e.g. > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index 7dfdc49a6ade..ea34f8232d97 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -1472,9 +1472,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > > * If this allocation fails we drop the lock and retry with reclaim > > * allowed. > > */ > > - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid); > > - if (sp) > > - return sp; > > + if (shared) { > > + sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid); > > + if (sp) > > + return sp; > > + } > > > > rcu_read_unlock(); > > > > I checked the KVM/arm64 eager page splitting code, and it drops the mmu_lock to > > allocate page tables. So I suspect no fix is needed there and this is, in fact, > > purely and x86-specific issue. > > Hmm, it'd be nice if we didn't have to rely on random arch flows to coincidentally > do the optimal thing for eager page splitting. Not sure how best to document the > "best known practice" though. We discussed upstreaming Google's mmu_lock timing stats yesterday. Once that's in place, KVM could WARN_ON_ONCE() if the mmu_lock is held for an excessive amount of time to help flag these kinds of issues. That might trigger false positives though, as holding mmu_lock for a long time is no big deal if there's no contention. > > As for the TDP MMU code, unless the cost of dropping and reacquiring mmu_lock for > read is measurable, I would prefer to unconditionally drop mmu_lock, and delete > the GFP_NOWAIT allocation. There can be lock contention when mmu_lock is held > for read, it's just less common. SGTM. I'll do some testing. Unfortunately, the original MySQL workload that led to this patch has bitrotted so I'm having some trouble reproducing the original results and confirming the TDP MMU fix. Sigh. > > On a related topic, I think we should take a hard look at the rwlock_needbreak() > usage in tdp_mmu_iter_cond_resched(). Because dropping when allocating is really > just speculatively dropping mmu_lock because it _might_ be contended, but doing > so at a batch size that provides a good balance between doing enough work under > mmu_lock and providing low latency for vCPUs. I.e. in theory, we should be able > to handle this fully in tdp_mmu_iter_cond_resched(), but that code is nowhere > near smart enough and it's currently only for preemptible kernels (or at least, > it's supposed to be only for preemptible kernels). Dropping the lock for allocating is also to drop GFP_NOWAIT, i.e. to allow direct reclaim and other blocking operations. This is valuable for "cry-for-help" type migrations where the host is under intense memory pressure. I'd rather do the reclaim on the eager page splitting thread than a vCPU. But I agree that the rwlock_needbreak() checks are pretty much untested and likely super nonoptimal. > > Simply yielding on contention is not at all optimal, as evidenced by the whole > dynamic preemption debacle[1][2]. The immediate issue was "fixed" by having vCPUs > avoid taking mmu_lock, but KVM really shouldn't get into a situation where KVM is > pathologically dropping mmu_lock to the point where a non-vCPU action grinds to > a halt. > > The contention logic fails to take into account many things: > > (1) Is the other task higher priority? > > (2) Is the other task a vCPU, or something else? > > (3) Will yielding actually allow the other task to make forward progress? > > (4) What is the cost of dropping mmu_lock, e.g. is a remote TLB flush needed? > > (5) What is the expected duration this task is expected to hold mmu_lock? > > (6) Has this task made enough progress for yielding to be a decent choice? > > and probably many more than that. > > As we discussed off-list, I think there are two viable approaches: > > (a) We drop the preemption dependency from tdp_mmu_iter_cond_resched()'s lock > contention logic, and improve the logic (especially the forward progress > guarantees) so that tdp_mmu_iter_cond_resched() provides solid performance > in all cases. The only way I can think of to universally measure forward progress would be by wall time. Again that becomes more possible with the mmu_lock timing stats. But we'll have to hand-pick some thresholds and that feels wrong... > > (b) We completely remove the rwlock_needbreak() checks from > tdp_mmu_iter_cond_resched(), and instead rely on unconditionally dropping > mmu_lock in flows where doing so provides the best overall balance, e.g. as > in the eager page split case. > > I don't have a strong preference between (a) and (b), though I think I'd lean > towards (b), because it's simpler. My guess is that we can achieve similar > performance results with both. E.g. odds are decent that the "best" batch size > (see #6) is large enough that the cost of dropping and reacquiring mmu_lock is > in the noise when it's not contented. > > The main argument I see for (b) is that it's simpler, as only code that actually > has a justified need to drop mmu_lock does so. The advantage I see with (a) is > that it would provide structure and documentation for choosing when to drop > mmu_lock (or not). I need to think it through more but I'm leaning toward (b) and use the mmu_lock stats to flag potential flows that are holding the lock too long. With (b) we can make each flow incrementally better and don't have to pick any magic numbers. > > E.g. dropping in the eager page split path makes sense because KVM does so at a > large batch size, odds are good that the contending task is a vCPU, there's no > TLB flush required, the total hold time of mmu_lock is high, and we know that > dropping mmu_lock will allow vCPUs to make forward progress. (a) would do a much > better job of capturing all that in code, albeit with quite a bit more complexity. > > Regardless of which option we choose, I think we should drop the preemptible kernel > dependency from the lock contention logic in tdp_mmu_iter_cond_resched(), i.e. > manually check if mmu_lock is contented instead of bouncing through rwlock_needbreak(). +1 > > The current approach essentially means that there's zero testing of the performance > of the yield-on-contention logic. E.g. the complaints about the TDP MMU yielding > too aggressively only popped up when commit c597bfddc9e9 ("sched: Provide Kconfig > support for default dynamic preempt mode") unintentionally enabled > rwlock_needbreak() by default. > > That's definitely easier said then done though, as I suspect that if we switched > to rwlock_is_contended(), i.e. dropped the preemptible requirement, without also > enhancing tdp_mmu_iter_cond_resched() to make it smarter as above, we'd see a lot > of performance regressions. > > [1] https://lore.kernel.org/all/20240312193911.1796717-3-seanjc@google.com > [2] https://lore.kernel.org/all/20240222012640.2820927-1-seanjc@google.com
On Thu, Apr 18, 2024, David Matlack wrote: > On Mon, Apr 15, 2024 at 1:00 PM Sean Christopherson <seanjc@google.com> wrote: > > On a related topic, I think we should take a hard look at the rwlock_needbreak() > > usage in tdp_mmu_iter_cond_resched(). Because dropping when allocating is really > > just speculatively dropping mmu_lock because it _might_ be contended, but doing > > so at a batch size that provides a good balance between doing enough work under > > mmu_lock and providing low latency for vCPUs. I.e. in theory, we should be able > > to handle this fully in tdp_mmu_iter_cond_resched(), but that code is nowhere > > near smart enough and it's currently only for preemptible kernels (or at least, > > it's supposed to be only for preemptible kernels). > > Dropping the lock for allocating is also to drop GFP_NOWAIT, i.e. to > allow direct reclaim and other blocking operations. This is valuable > for "cry-for-help" type migrations where the host is under intense > memory pressure. I'd rather do the reclaim on the eager page splitting > thread than a vCPU. ... > > (a) We drop the preemption dependency from tdp_mmu_iter_cond_resched()'s lock > > contention logic, and improve the logic (especially the forward progress > > guarantees) so that tdp_mmu_iter_cond_resched() provides solid performance > > in all cases. > > The only way I can think of to universally measure forward progress > would be by wall time. Again that becomes more possible with the > mmu_lock timing stats. But we'll have to hand-pick some thresholds and > that feels wrong... Yeah, hard pass on anything based on arbitrary time thresholds. And I think we should have the mmu_lock stats be buried behind a Kconfig, i.e. we shouldn't use them in KVM to guide behavior in any way. > > (b) We completely remove the rwlock_needbreak() checks from > > tdp_mmu_iter_cond_resched(), and instead rely on unconditionally dropping > > mmu_lock in flows where doing so provides the best overall balance, e.g. as > > in the eager page split case. > > > > I don't have a strong preference between (a) and (b), though I think I'd lean > > towards (b), because it's simpler. My guess is that we can achieve similar > > performance results with both. E.g. odds are decent that the "best" batch size > > (see #6) is large enough that the cost of dropping and reacquiring mmu_lock is > > in the noise when it's not contented. > > > > The main argument I see for (b) is that it's simpler, as only code that actually > > has a justified need to drop mmu_lock does so. The advantage I see with (a) is > > that it would provide structure and documentation for choosing when to drop > > mmu_lock (or not). > > I need to think it through more but I'm leaning toward (b) and use the > mmu_lock stats to flag potential flows that are holding the lock too > long. With (b) we can make each flow incrementally better and don't > have to pick any magic numbers. I'm fully in the (b) boat given the GFP_NOWAIT => direct reclaim piece of the puzzle. 1. TDP MMU now frees and allocates roots with mmu_lock held for read 2. The "zap everything" MTRR trainwreck likely on its way out the door[1] 3. The change_pte() mmu_notifier is gone[2] 4. We're working on aging GFNs with mmu_lock held for read, or not at all[3] As a result, the the only paths that take mmu_lock for write are "zap all", mmu_notifier invalidations events, and APICv inhibit toggling, which does kvm_zap_gfn_range() on a single page, i.e. won't ever need to yield. The "slow" zap all is mutually exclusive with anything interesting because it only runs when the process is exiting. The "fast" zap all should never yield when it holds mmu_lock for write, because the part that runs with mmu_lock held for write is <drum roll> fast. Maaaybe the slow part that runs with mmu_lock held for read could drop mmu_lock on contention. That just leaves mmu_notifier invalidations, and the mess with dynamic preemption that resulted in KVM bouncing mmu_lock between invalidations and vCPUs is pretty strong evidence that yielding on mmu_lock contention when zapping SPTEs is a terrible idea. And yielding from most flows that hold mmu_lock for read is also a net negative, as (a) *all* readers need to go aways, (b) the majority of such flows are short-lived and operate in vCPU context, and (c) the remaining flows that take mmu_lock are either slow paths (zap all, mmu_notifier invalidations), or rare (APICv toggling). In short, I highly doubt we'll ever have more than a few flows where yielding because mmu_lock is contended is actually desirable. As a result, odds are good that the reasons for dropping mmu_lock in the middle of a sequence are going to be unique each and every time. E.g. eager page splitting wants to drop it in order to do direct reclaim, and because it can be super long running, *and* doesn't need to flush TLBs when yielding. kvm_tdp_mmu_zap_invalidated_roots() is the only flow I can think of where dropping mmu_lock on contention might make sense, e.g. to allow an mmu_notifier invalidation to go through. Again, that's a very long-running flow that doesn't need to flush TLBs, is (hopefully) not being done from vCPU context, and could put KVM into a scenario where it blocks an mmu_notifiers, which in turn blocks vCPUs that are trying to rebuild SPTEs. [1] https://lore.kernel.org/all/20240309010929.1403984-1-seanjc@google.com [2] https://lore.kernel.org/all/20240405115815.3226315-1-pbonzini@redhat.com [3] https://lore.kernel.org/all/20240401232946.1837665-1-jthoughton@google.com
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index fb49c2a60200..0a8b25a52c15 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2386,7 +2386,6 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n)) return -EFAULT; - KVM_MMU_LOCK(kvm); for (offset = log->first_page, i = offset / BITS_PER_LONG, n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--; i++, offset += BITS_PER_LONG) { @@ -2405,11 +2404,12 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm, */ if (mask) { flush = true; + KVM_MMU_LOCK(kvm); kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask); + KVM_MMU_UNLOCK(kvm); } } - KVM_MMU_UNLOCK(kvm); if (flush) kvm_flush_remote_tlbs_memslot(kvm, memslot);
Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG to avoid blocking other threads (e.g. vCPUs taking page faults) for too long. Specifically, change kvm_clear_dirty_log_protect() to acquire/release mmu_lock only when calling kvm_arch_mmu_enable_log_dirty_pt_masked(), rather than around the entire for loop. This ensures that KVM will only hold mmu_lock for the time it takes the architecture-specific code to process up to 64 pages, rather than holding mmu_lock for log->num_pages, which is controllable by userspace. This also avoids holding mmu_lock when processing parts of the dirty_bitmap that are zero (i.e. when there is nothing to clear). Moving the acquire/release points for mmu_lock should be safe since dirty_bitmap_buffer is already protected by slots_lock, and dirty_bitmap is already accessed with atomic_long_fetch_andnot(). And at least on x86 holding mmu_lock doesn't even serialize access to the memslot dirty bitmap, as vCPUs can call mark_page_dirty_in_slot() without holding mmu_lock. This change eliminates dips in guest performance during live migration in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which would also reduce contention on mmu_lock, but doing so will increase the rate of remote TLB flushing. And there's really no reason to punt this problem to userspace since KVM can just drop and reacquire mmu_lock more frequently. Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Tianrui Zhao <zhaotianrui@loongson.cn> Cc: Bibo Mao <maobibo@loongson.cn> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anup Patel <anup@brainfault.org> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> --- v2: - Rebase onto kvm/queue [Marc] v1: https://lore.kernel.org/kvm/20231205181645.482037-1-dmatlack@google.com/ virt/kvm/kvm_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) base-commit: 9bc60f733839ab6fcdde0d0b15cbb486123e6402