Message ID | 20190205210137.1377-11-sean.j.christopherson@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/mmu: Remove fast invalidate mechanism | expand |
On Tue, 5 Feb 2019 13:01:21 -0800 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs > that actually belong to the memslot being removed. This improves > performance, especially why the deleted memslot has only a few shadow > entries, or even no entries. E.g. a microbenchmark to access regular > memory while concurrently reading PCI ROM to trigger memslot deletion > showed a 5% improvement in throughput. > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > --- > arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++- > 1 file changed, 32 insertions(+), 1 deletion(-) A number of vfio users are reporting VM instability issues since v5.1, some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"), which I've confirmed via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and re-verified on current 5.3-rc4 using the below patch to toggle the broken behavior. My reproducer is a Windows 10 VM with assigned GeForce GPU running a variety of tests, including FurMark and PassMark Performance Test. With the code enabled as exists in upstream currently, PassMark will generally introduce graphics glitches or hangs. Sometimes it's necessary to reboot the VM to see these issues. Flipping the 0/1 in the below patch appears to resolve the issue. I'd appreciate any insights into further debugging this block of code so that we can fix this regression. Thanks, Alex diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 24843cf49579..6a77306940f7 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -5653,6 +5653,9 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, struct kvm_page_track_notifier_node *node) { +#if 0 + kvm_mmu_zap_all(kvm); +#else struct kvm_mmu_page *sp; LIST_HEAD(invalid_list); unsigned long i; @@ -5685,6 +5688,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, out_unlock: spin_unlock(&kvm->mmu_lock); +#endif } void kvm_mmu_init_vm(struct kvm *kvm)
On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote: > On Tue, 5 Feb 2019 13:01:21 -0800 > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's > > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs > > that actually belong to the memslot being removed. This improves > > performance, especially why the deleted memslot has only a few shadow > > entries, or even no entries. E.g. a microbenchmark to access regular > > memory while concurrently reading PCI ROM to trigger memslot deletion > > showed a 5% improvement in throughput. > > > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > --- > > arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++- > > 1 file changed, 32 insertions(+), 1 deletion(-) > > A number of vfio users are reporting VM instability issues since v5.1, > some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap > only the relevant pages when removing a memslot"), which I've confirmed > via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and > re-verified on current 5.3-rc4 using the below patch to toggle the > broken behavior. > > My reproducer is a Windows 10 VM with assigned GeForce GPU running a > variety of tests, including FurMark and PassMark Performance Test. > With the code enabled as exists in upstream currently, PassMark will > generally introduce graphics glitches or hangs. Sometimes it's > necessary to reboot the VM to see these issues. As in, the issue only shows up when the VM is rebooted? Just want to double check that that's not a typo. > Flipping the 0/1 in the below patch appears to resolve the issue. > > I'd appreciate any insights into further debugging this block of code > so that we can fix this regression. Thanks, If it's not too painful to reproduce, I'd say start by determining whether it's a problem with the basic logic or if the cond_resched_lock() handling is wrong. I.e. comment/ifdef out this chunk: if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); flush = false; cond_resched_lock(&kvm->mmu_lock); } > > Alex > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 24843cf49579..6a77306940f7 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -5653,6 +5653,9 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, > struct kvm_memory_slot *slot, > struct kvm_page_track_notifier_node *node) > { > +#if 0 > + kvm_mmu_zap_all(kvm); > +#else > struct kvm_mmu_page *sp; > LIST_HEAD(invalid_list); > unsigned long i; > @@ -5685,6 +5688,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, > > out_unlock: > spin_unlock(&kvm->mmu_lock); > +#endif > } > > void kvm_mmu_init_vm(struct kvm *kvm)
On Tue, 13 Aug 2019 10:04:41 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote: > > On Tue, 5 Feb 2019 13:01:21 -0800 > > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > > > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's > > > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs > > > that actually belong to the memslot being removed. This improves > > > performance, especially why the deleted memslot has only a few shadow > > > entries, or even no entries. E.g. a microbenchmark to access regular > > > memory while concurrently reading PCI ROM to trigger memslot deletion > > > showed a 5% improvement in throughput. > > > > > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > > --- > > > arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++- > > > 1 file changed, 32 insertions(+), 1 deletion(-) > > > > A number of vfio users are reporting VM instability issues since v5.1, > > some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap > > only the relevant pages when removing a memslot"), which I've confirmed > > via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and > > re-verified on current 5.3-rc4 using the below patch to toggle the > > broken behavior. > > > > My reproducer is a Windows 10 VM with assigned GeForce GPU running a > > variety of tests, including FurMark and PassMark Performance Test. > > With the code enabled as exists in upstream currently, PassMark will > > generally introduce graphics glitches or hangs. Sometimes it's > > necessary to reboot the VM to see these issues. > > As in, the issue only shows up when the VM is rebooted? Just want to > double check that that's not a typo. No, it can occur on the first boot as well, it's just that the recipe to induce a failure is not well understood and manifests itself in different ways. I generally run the tests, then if it still hasn't reproduced, I reboot the VM a couple times, running a couple apps in between to try to trigger/notice bad behavior. > > Flipping the 0/1 in the below patch appears to resolve the issue. > > > > I'd appreciate any insights into further debugging this block of code > > so that we can fix this regression. Thanks, > > If it's not too painful to reproduce, I'd say start by determining whether > it's a problem with the basic logic or if the cond_resched_lock() handling > is wrong. I.e. comment/ifdef out this chunk: > > if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { > kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); > flush = false; > cond_resched_lock(&kvm->mmu_lock); > } If anything, removing this chunk seems to make things worse. Thanks, Alex
On Tue, 13 Aug 2019 11:57:37 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Tue, 13 Aug 2019 10:04:41 -0700 > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote: > > > On Tue, 5 Feb 2019 13:01:21 -0800 > > > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > > > > > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's > > > > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs > > > > that actually belong to the memslot being removed. This improves > > > > performance, especially why the deleted memslot has only a few shadow > > > > entries, or even no entries. E.g. a microbenchmark to access regular > > > > memory while concurrently reading PCI ROM to trigger memslot deletion > > > > showed a 5% improvement in throughput. > > > > > > > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > > > --- > > > > arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++- > > > > 1 file changed, 32 insertions(+), 1 deletion(-) > > > > > > A number of vfio users are reporting VM instability issues since v5.1, > > > some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap > > > only the relevant pages when removing a memslot"), which I've confirmed > > > via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and > > > re-verified on current 5.3-rc4 using the below patch to toggle the > > > broken behavior. > > > > > > My reproducer is a Windows 10 VM with assigned GeForce GPU running a > > > variety of tests, including FurMark and PassMark Performance Test. > > > With the code enabled as exists in upstream currently, PassMark will > > > generally introduce graphics glitches or hangs. Sometimes it's > > > necessary to reboot the VM to see these issues. > > > > As in, the issue only shows up when the VM is rebooted? Just want to > > double check that that's not a typo. > > No, it can occur on the first boot as well, it's just that the recipe > to induce a failure is not well understood and manifests itself in > different ways. I generally run the tests, then if it still hasn't > reproduced, I reboot the VM a couple times, running a couple apps in > between to try to trigger/notice bad behavior. > > > > Flipping the 0/1 in the below patch appears to resolve the issue. > > > > > > I'd appreciate any insights into further debugging this block of code > > > so that we can fix this regression. Thanks, > > > > If it's not too painful to reproduce, I'd say start by determining whether > > it's a problem with the basic logic or if the cond_resched_lock() handling > > is wrong. I.e. comment/ifdef out this chunk: > > > > if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { > > kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); > > flush = false; > > cond_resched_lock(&kvm->mmu_lock); > > } > > If anything, removing this chunk seems to make things worse. Could it be something with the gfn test: if (sp->gfn != gfn) continue; If I remove it, I can't trigger the misbehavior. If I log it, I only get hits on VM boot/reboot and some of the gfns look suspiciously like they could be the assigned GPU BARs and maybe MSI mappings: (sp->gfn) != (gfn) [ 71.829450] gfn fec00 != c02c4 [ 71.835554] gfn ffe00 != c046f [ 71.841664] gfn 0 != c0720 [ 71.847084] gfn 0 != c0720 [ 71.852489] gfn 0 != c0720 [ 71.857899] gfn 0 != c0720 [ 71.863306] gfn 0 != c0720 [ 71.868717] gfn 0 != c0720 [ 71.874122] gfn 0 != c0720 [ 71.879531] gfn 0 != c0720 [ 71.884937] gfn 0 != c0720 [ 71.890349] gfn 0 != c0720 [ 71.895757] gfn 0 != c0720 [ 71.901163] gfn 0 != c0720 [ 71.906569] gfn 0 != c0720 [ 71.911980] gfn 0 != c0720 [ 71.917387] gfn 0 != c0720 [ 71.922808] gfn fee00 != c0edc [ 71.928915] gfn fee00 != c0edc [ 71.935018] gfn fee00 != c0edc [ 71.941730] gfn c1000 != 8002d7 [ 71.948039] gfn 80000 != 8006d5 [ 71.954328] gfn 80000 != 8006d5 [ 71.960600] gfn 80000 != 8006d5 [ 71.966874] gfn 80000 != 8006d5 [ 71.992272] gfn 0 != c0720 [ 71.997683] gfn 0 != c0720 [ 72.003725] gfn 80000 != 8006d5 [ 72.044333] gfn 0 != c0720 [ 72.049743] gfn 0 != c0720 [ 72.055846] gfn 80000 != 8006d5 [ 72.177341] gfn ffe00 != c046f [ 72.183453] gfn 0 != c0720 [ 72.188864] gfn 0 != c0720 [ 72.194290] gfn 0 != c0720 [ 72.200308] gfn 80000 != 8006d5 [ 82.539023] gfn fec00 != c02c4 [ 82.545142] gfn 40000 != c0377 [ 82.551343] gfn ffe00 != c046f [ 82.557466] gfn 100000 != c066f [ 82.563839] gfn 800000 != c06ec [ 82.570133] gfn 800000 != c06ec [ 82.576408] gfn 0 != c0720 [ 82.581850] gfn 0 != c0720 [ 82.587275] gfn 0 != c0720 [ 82.592685] gfn 0 != c0720 [ 82.598131] gfn 0 != c0720 [ 82.603552] gfn 0 != c0720 [ 82.608978] gfn 0 != c0720 [ 82.614419] gfn 0 != c0720 [ 82.619844] gfn 0 != c0720 [ 82.625291] gfn 0 != c0720 [ 82.630791] gfn 0 != c0720 [ 82.636208] gfn 0 != c0720 [ 82.641635] gfn 80a000 != c085e [ 82.647929] gfn fee00 != c0edc [ 82.654062] gfn fee00 != c0edc [ 82.660504] gfn 100000 != c066f [ 82.666800] gfn 0 != c0720 [ 82.672211] gfn 0 != c0720 [ 82.677635] gfn 0 != c0720 [ 82.683060] gfn 0 != c0720 [ 82.689209] gfn c1000 != 8002d7 [ 82.695501] gfn 80000 != 8006d5 [ 82.701796] gfn 80000 != 8006d5 [ 82.708092] gfn 100000 != 80099b [ 82.714547] gfn 0 != 800a4c [ 82.720154] gfn 0 != 800a4c [ 82.725752] gfn 0 != 800a4c [ 82.731370] gfn 0 != 800a4c [ 82.738705] gfn 100000 != 80099b [ 82.745201] gfn 0 != 800a4c [ 82.750793] gfn 0 != 800a4c [ 82.756381] gfn 0 != 800a4c [ 82.761979] gfn 0 != 800a4c [ 82.768122] gfn 100000 != 8083a4 [ 82.774605] gfn 0 != 8094aa [ 82.780196] gfn 0 != 8094aa [ 82.785796] gfn 0 != 8094aa [ 82.791412] gfn 0 != 8094aa [ 82.797523] gfn 100000 != 8083a4 [ 82.803977] gfn 0 != 8094aa [ 82.809576] gfn 0 != 8094aa [ 82.815193] gfn 0 != 8094aa [ 82.820809] gfn 0 != 8094aa (GPU has a BAR mapped at 0x80000000) Is this gfn optimization correct? Overzealous? Doesn't account correctly for something about MMIO mappings? Thanks, Alex
On Tue, Aug 13, 2019 at 01:33:16PM -0600, Alex Williamson wrote: > On Tue, 13 Aug 2019 11:57:37 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > Could it be something with the gfn test: > > if (sp->gfn != gfn) > continue; > > If I remove it, I can't trigger the misbehavior. If I log it, I only > get hits on VM boot/reboot and some of the gfns look suspiciously like > they could be the assigned GPU BARs and maybe MSI mappings: > > (sp->gfn) != (gfn) Hits at boot/reboot makes sense, memslots get zapped when userspace removes a memory region/slot, e.g. remaps BARs and whatnot. ... > Is this gfn optimization correct? Overzealous? Doesn't account > correctly for something about MMIO mappings? Thanks, Yes? Shadow pages are stored in a hash table, for_each_valid_sp() walks all entries for a given gfn. The sp->gfn check is there to skip entries that hashed to the same list but for a completely different gfn. Skipping the gfn check would be sort of a lightweight zap all in the sense that it would zap shadow pages that happend to collide with the target memslot/gfn but are otherwise unrelated. What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.: if (sp->gfn != gfn && sp->gfn != 0x80000) continue; If that doesn't work, it might be worth trying other gfns to see if you can pinpoint which sp is being zapped as collateral damage. It's possible there is a pre-existing bug somewhere else that was being hidden because KVM was effectively zapping all SPTEs during (re)boot, and the hash collision is also hiding the bug by zapping the stale entry. Of course it's also possible this code is wrong, :-)
On 13/08/19 22:19, Sean Christopherson wrote: > Yes? Shadow pages are stored in a hash table, for_each_valid_sp() walks > all entries for a given gfn. The sp->gfn check is there to skip entries > that hashed to the same list but for a completely different gfn. > > Skipping the gfn check would be sort of a lightweight zap all in the > sense that it would zap shadow pages that happend to collide with the > target memslot/gfn but are otherwise unrelated. > > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.: > > if (sp->gfn != gfn && sp->gfn != 0x80000) > continue; > > If that doesn't work, it might be worth trying other gfns to see if you > can pinpoint which sp is being zapped as collateral damage. > > It's possible there is a pre-existing bug somewhere else that was being > hidden because KVM was effectively zapping all SPTEs during (re)boot, > and the hash collision is also hiding the bug by zapping the stale entry. > > Of course it's also possible this code is wrong, :-) Also, can you reproduce it with one vCPU? This could (though not really 100%) distinguish a missing invalidation from a race condition. Do we even need the call to slot_handle_all_level? The rmap update should be done already by kvm_mmu_prepare_zap_page (via kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte). Alex, can you replace it with just "flush = false;"? Thanks, Paolo
On Tue, 13 Aug 2019 22:37:14 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 13/08/19 22:19, Sean Christopherson wrote: > > Yes? Shadow pages are stored in a hash table, for_each_valid_sp() walks > > all entries for a given gfn. The sp->gfn check is there to skip entries > > that hashed to the same list but for a completely different gfn. > > > > Skipping the gfn check would be sort of a lightweight zap all in the > > sense that it would zap shadow pages that happend to collide with the > > target memslot/gfn but are otherwise unrelated. > > > > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.: > > > > if (sp->gfn != gfn && sp->gfn != 0x80000) > > continue; Not having any luck with this yet. Tried 0x80000, 0x8xxxxx, 0. > > If that doesn't work, it might be worth trying other gfns to see if you > > can pinpoint which sp is being zapped as collateral damage. > > > > It's possible there is a pre-existing bug somewhere else that was being > > hidden because KVM was effectively zapping all SPTEs during (re)boot, > > and the hash collision is also hiding the bug by zapping the stale entry. > > > > Of course it's also possible this code is wrong, :-) > > Also, can you reproduce it with one vCPU? This could (though not really > 100%) distinguish a missing invalidation from a race condition. That's a pretty big change, I'll give it a shot, but not sure how conclusive it would be. > Do we even need the call to slot_handle_all_level? The rmap update > should be done already by kvm_mmu_prepare_zap_page (via > kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte). > > Alex, can you replace it with just "flush = false;"? Replace the continue w/ flush = false? I'm not clear on this suggestion. Thanks, Alex
On 13/08/19 23:14, Alex Williamson wrote: >> Do we even need the call to slot_handle_all_level? The rmap update >> should be done already by kvm_mmu_prepare_zap_page (via >> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte). >> >> Alex, can you replace it with just "flush = false;"? > Replace the continue w/ flush = false? I'm not clear on this > suggestion. Thanks, diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 24843cf49579..382b3ee303e3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -5664,7 +5668,7 @@ kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm if (list_empty(&kvm->arch.active_mmu_pages)) goto out_unlock; - flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false); + flush = false; for (i = 0; i < slot->npages; i++) { gfn = slot->base_gfn + i;
On Tue, 13 Aug 2019 23:15:38 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 13/08/19 23:14, Alex Williamson wrote: > >> Do we even need the call to slot_handle_all_level? The rmap update > >> should be done already by kvm_mmu_prepare_zap_page (via > >> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte). > >> > >> Alex, can you replace it with just "flush = false;"? > > Replace the continue w/ flush = false? I'm not clear on this > > suggestion. Thanks, > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 24843cf49579..382b3ee303e3 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -5664,7 +5668,7 @@ kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm > if (list_empty(&kvm->arch.active_mmu_pages)) > goto out_unlock; > > - flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false); > + flush = false; > > for (i = 0; i < slot->npages; i++) { > gfn = slot->base_gfn + i; > Things are not happy with that, it managed to crash the host: [ 871.244789] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 871.258680] #PF: supervisor read access in kernel mode [ 871.268939] #PF: error_code(0x0000) - not-present page [ 871.279202] PGD 0 P4D 0 [ 871.284264] Oops: 0000 [#1] SMP PTI [ 871.291232] CPU: 3 PID: 1954 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4 [ 871.304264] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 [ 871.322523] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm] [ 871.332085] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d [ 871.369560] RSP: 0018:ffffc10300a0bb18 EFLAGS: 00010202 [ 871.379995] RAX: 00000000000c1040 RBX: ffffc103007cd000 RCX: 0000000000000014 [ 871.394243] RDX: ffff9b3bbff18130 RSI: 00000000000c1080 RDI: 0000000000000004 [ 871.408491] RBP: ffff9b3bced15400 R08: ffff9b3b94a90008 R09: ffff9b3b94a90000 [ 871.422739] R10: ffff9b3b94a90168 R11: 0000000000000004 R12: ffff9b3bbff18130 [ 871.436986] R13: ffffc103007cd000 R14: 0000000000000000 R15: 00000000000c1000 [ 871.451234] FS: 00007fc27b37d700(0000) GS:ffff9b3f9f4c0000(0000) knlGS:0000000000000000 [ 871.467390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 871.478865] CR2: 0000000000000000 CR3: 000000003bb2a004 CR4: 00000000001626e0 [ 871.493113] Call Trace: [ 871.498012] drop_spte+0x77/0xa0 [kvm] [ 871.505500] mmu_page_zap_pte+0xac/0xe0 [kvm] [ 871.514200] __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm] [ 871.524809] kvm_mmu_invalidate_zap_pages_in_memslot+0xb7/0x130 [kvm] [ 871.537670] kvm_page_track_flush_slot+0x55/0x80 [kvm] [ 871.547928] __kvm_set_memory_region+0x821/0xaa0 [kvm] [ 871.558191] kvm_set_memory_region+0x26/0x40 [kvm] [ 871.567759] kvm_vm_ioctl+0x59a/0x940 [kvm] [ 871.576106] ? __seccomp_filter+0x7a/0x680 [ 871.584287] do_vfs_ioctl+0xa4/0x630 [ 871.591430] ? security_file_ioctl+0x32/0x50 [ 871.599954] ksys_ioctl+0x60/0x90 [ 871.606575] __x64_sys_ioctl+0x16/0x20 [ 871.614066] do_syscall_64+0x5f/0x1a0 [ 871.621379] entry_SYSCALL_64_after_hwframe+0x44/0xa9 It also failed with the 1 vcpu test. Thanks, Alex
On Tue, Aug 13, 2019 at 03:14:17PM -0600, Alex Williamson wrote: > On Tue, 13 Aug 2019 22:37:14 +0200 > Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 13/08/19 22:19, Sean Christopherson wrote: > > > Yes? Shadow pages are stored in a hash table, for_each_valid_sp() walks > > > all entries for a given gfn. The sp->gfn check is there to skip entries > > > that hashed to the same list but for a completely different gfn. > > > > > > Skipping the gfn check would be sort of a lightweight zap all in the > > > sense that it would zap shadow pages that happend to collide with the > > > target memslot/gfn but are otherwise unrelated. > > > > > > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.: > > > > > > if (sp->gfn != gfn && sp->gfn != 0x80000) > > > continue; > > Not having any luck with this yet. Tried 0x80000, 0x8xxxxx, 0. I've no idea if it would actually be interesting, but something to try would be to zap only emulated mmio SPTEs (in addition to the memslot). If that test passes then I think it might indicate a problem with device enumeration as opposed to the mapping of the device itself ("think" and "might" being the operative words). Patch attached.
On Tue, 13 Aug 2019 13:19:14 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Tue, Aug 13, 2019 at 01:33:16PM -0600, Alex Williamson wrote: > > On Tue, 13 Aug 2019 11:57:37 -0600 > > Alex Williamson <alex.williamson@redhat.com> wrote: > > > Could it be something with the gfn test: > > > > if (sp->gfn != gfn) > > continue; > > > > If I remove it, I can't trigger the misbehavior. If I log it, I only > > get hits on VM boot/reboot and some of the gfns look suspiciously like > > they could be the assigned GPU BARs and maybe MSI mappings: > > > > (sp->gfn) != (gfn) > > Hits at boot/reboot makes sense, memslots get zapped when userspace > removes a memory region/slot, e.g. remaps BARs and whatnot. > > ... > > > Is this gfn optimization correct? Overzealous? Doesn't account > > correctly for something about MMIO mappings? Thanks, > > Yes? Shadow pages are stored in a hash table, for_each_valid_sp() walks > all entries for a given gfn. The sp->gfn check is there to skip entries > that hashed to the same list but for a completely different gfn. > > Skipping the gfn check would be sort of a lightweight zap all in the > sense that it would zap shadow pages that happend to collide with the > target memslot/gfn but are otherwise unrelated. > > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.: > > if (sp->gfn != gfn && sp->gfn != 0x80000) > continue; > > If that doesn't work, it might be worth trying other gfns to see if you > can pinpoint which sp is being zapped as collateral damage. > > It's possible there is a pre-existing bug somewhere else that was being > hidden because KVM was effectively zapping all SPTEs during (re)boot, > and the hash collision is also hiding the bug by zapping the stale entry. > > Of course it's also possible this code is wrong, :-) Ok, fun day of trying to figure out which ranges are relevant, I've narrowed it down to all of these: 0xffe00 0xfee00 0xfec00 0xc1000 0x80a000 0x800000 0x100000 ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and 0x80000 can take the continue branch without seeing bad behavior in the VM. The assigned GPU has BARs at GPAs: 0xc0000000-0xc0ffffff 0x800000000-0x808000000 0x808000000-0x809ffffff And the assigned companion audio function is at GPA: 0xc1080000-0xc1083fff Only one of those seems to align very well with a gfn base involved here. The virtio ethernet has an mmio range at GPA 0x80a000000, otherwise I don't find any other I/O devices coincident with the gfns above. I'm running the VM with 2MB hugepages, but I believe the issue still occurs with standard pages. When run with standard pages I see more hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to the set above that cannot take the continue branch. I don't know if that means anything. Any further ideas what to look for? Thanks, Alex PS - I see the posted workaround patch, I'll test that in the interim.
On Thu, Aug 15, 2019 at 09:23:24AM -0600, Alex Williamson wrote: > Ok, fun day of trying to figure out which ranges are relevant, I've > narrowed it down to all of these: > > 0xffe00 > 0xfee00 > 0xfec00 APIC and I/O APIC stuff > 0xc1000 Assigned audio > 0x80a000 ? > 0x800000 GPU BAR > 0x100000 ? The APIC ranges are puzzling, I wouldn't expect their mappings to change. > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and > 0x80000 can take the continue branch without seeing bad behavior in the > VM. > > The assigned GPU has BARs at GPAs: > > 0xc0000000-0xc0ffffff > 0x800000000-0x808000000 > 0x808000000-0x809ffffff > > And the assigned companion audio function is at GPA: > > 0xc1080000-0xc1083fff > > Only one of those seems to align very well with a gfn base involved > here. The virtio ethernet has an mmio range at GPA 0x80a000000, > otherwise I don't find any other I/O devices coincident with the gfns > above. > > I'm running the VM with 2MB hugepages, but I believe the issue still > occurs with standard pages. When run with standard pages I see more > hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to > the set above that cannot take the continue branch. I don't know if > that means anything. > > Any further ideas what to look for? Thanks, Maybe try isolating which memslot removal causes problems? E.g. flush the affected ranges if base_gfn == (xxx || yyy || zzz), otherwise flush only the memslot's gfns. Based on the log you sent a while back for gfn mismatches, I'm guessing the culprits are all GPU BARs, but it's probably worth confirming. That might also explain why gfn == 0x80000 can take the continue branch, i.e. if removing the corresponding memslot is what's causing problems, then it's being flushed and not actually taking the continue path. One other thought would be to force a call to kvm_flush_remote_tlbs(kvm), e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap(). Maybe it's a case where there are no SPTEs for the memslot, but the TLB flush is needed for some reason.
On Thu, 15 Aug 2019 09:00:06 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Thu, Aug 15, 2019 at 09:23:24AM -0600, Alex Williamson wrote: > > Ok, fun day of trying to figure out which ranges are relevant, I've > > narrowed it down to all of these: > > > > 0xffe00 > > 0xfee00 > > 0xfec00 > > APIC and I/O APIC stuff > > > 0xc1000 > > Assigned audio > > > 0x80a000 > > ? > > > 0x800000 > > GPU BAR > > > 0x100000 > > ? > > The APIC ranges are puzzling, I wouldn't expect their mappings to change. > > > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and > > 0x80000 can take the continue branch without seeing bad behavior in the > > VM. > > > > The assigned GPU has BARs at GPAs: > > > > 0xc0000000-0xc0ffffff > > 0x800000000-0x808000000 > > 0x808000000-0x809ffffff > > > > And the assigned companion audio function is at GPA: > > > > 0xc1080000-0xc1083fff > > > > Only one of those seems to align very well with a gfn base involved > > here. The virtio ethernet has an mmio range at GPA 0x80a000000, > > otherwise I don't find any other I/O devices coincident with the gfns > > above. > > > > I'm running the VM with 2MB hugepages, but I believe the issue still > > occurs with standard pages. When run with standard pages I see more > > hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to > > the set above that cannot take the continue branch. I don't know if > > that means anything. > > > > Any further ideas what to look for? Thanks, > > Maybe try isolating which memslot removal causes problems? E.g. flush > the affected ranges if base_gfn == (xxx || yyy || zzz), otherwise flush > only the memslot's gfns. Based on the log you sent a while back for gfn > mismatches, I'm guessing the culprits are all GPU BARs, but it's > probably worth confirming. That might also explain why gfn == 0x80000 > can take the continue branch, i.e. if removing the corresponding memslot > is what's causing problems, then it's being flushed and not actually > taking the continue path. If I print out the memslot base_gfn, it seems pretty evident that only the assigned device mappings are triggering this branch. The base_gfns exclusively include: 0x800000 0x808000 0xc0089 Where the first two clearly match the 64bit BARs and the last is the result of a page that we need to emulate within the BAR @0xc0000000 at offset 0x88000, so the base_gfn is the remaining direct mapping. I don't know if this implies we're doing something wrong for assigned device slots, but maybe a more targeted workaround would be if we could specifically identify these slots, though there's no special registration of them versus other slots. Did you have any non-device assignment test cases that took this branch when developing the series? > One other thought would be to force a call to kvm_flush_remote_tlbs(kvm), > e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap(). > Maybe it's a case where there are no SPTEs for the memslot, but the TLB > flush is needed for some reason. This doesn't work. Thanks, Alex
On Thu, Aug 15, 2019 at 12:16:07PM -0600, Alex Williamson wrote: > On Thu, 15 Aug 2019 09:00:06 -0700 > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > If I print out the memslot base_gfn, it seems pretty evident that only > the assigned device mappings are triggering this branch. The base_gfns > exclusively include: > > 0x800000 > 0x808000 > 0xc0089 > > Where the first two clearly match the 64bit BARs and the last is the > result of a page that we need to emulate within the BAR @0xc0000000 at > offset 0x88000, so the base_gfn is the remaining direct mapping. That's consistent with my understanding of userspace, e.g. normal memory regions aren't deleted until the VM is shut down (barring hot unplug). > I don't know if this implies we're doing something wrong for assigned > device slots, but maybe a more targeted workaround would be if we could > specifically identify these slots, though there's no special > registration of them versus other slots. What is triggering the memslot removal/update? Is it possible that whatever action is occuring is modifying multiple memslots? E.g. KVM's memslot-only zapping is allowing the guest to access stale entries for the unzapped-but-related memslots, whereas the full zap does not. FYI, my VFIO/GPU/PCI knowledge is abysmal, please speak up if any of my ideas are nonsensical. > Did you have any non-device > assignment test cases that took this branch when developing the series? The primary testing was performance oriented, using a slightly modified version of a synthetic benchmark[1] from a previous series[2] that touched the memslot flushing flow. From a functional perspective, I highly doubt that test would have been able expose an improper zapping bug. We do have some amount of coverage via kvm-unit-tests, as an EPT test was triggering a slab bug due not actually zapping the collected SPTEs[3]. [1] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277/mmtest.tar.bz2 [2] https://lkml.kernel.org/r/1368706673-8530-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com [3] https://patchwork.kernel.org/patch/10899283/ > > One other thought would be to force a call to kvm_flush_remote_tlbs(kvm), > > e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap(). > > Maybe it's a case where there are no SPTEs for the memslot, but the TLB > > flush is needed for some reason. > > This doesn't work. Thanks, > > Alex
On Thu, 15 Aug 2019 12:25:31 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Thu, Aug 15, 2019 at 12:16:07PM -0600, Alex Williamson wrote: > > On Thu, 15 Aug 2019 09:00:06 -0700 > > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > > If I print out the memslot base_gfn, it seems pretty evident that only > > the assigned device mappings are triggering this branch. The base_gfns > > exclusively include: > > > > 0x800000 > > 0x808000 > > 0xc0089 > > > > Where the first two clearly match the 64bit BARs and the last is the > > result of a page that we need to emulate within the BAR @0xc0000000 at > > offset 0x88000, so the base_gfn is the remaining direct mapping. > > That's consistent with my understanding of userspace, e.g. normal memory > regions aren't deleted until the VM is shut down (barring hot unplug). > > > I don't know if this implies we're doing something wrong for assigned > > device slots, but maybe a more targeted workaround would be if we could > > specifically identify these slots, though there's no special > > registration of them versus other slots. > > What is triggering the memslot removal/update? Is it possible that > whatever action is occuring is modifying multiple memslots? E.g. KVM's > memslot-only zapping is allowing the guest to access stale entries for > the unzapped-but-related memslots, whereas the full zap does not. > > FYI, my VFIO/GPU/PCI knowledge is abysmal, please speak up if any of my > ideas are nonsensical. The memory bit in the PCI command register of config space for each device controls whether the device decoders are active for the MMIO BAR ranges. These get flipped as both the guest firmware and guest OS enumerate and assign resources to the PCI subsystem. Generally these are not further manipulated while the guest OS is running except for hotplug operations. The guest OS device driver will generally perform the final enable of these ranges and they'll remain enabled until the guest is rebooted. I recall somewhere in this thread you referenced reading the ROM as part of the performance testing of this series. The ROM has it's own enable bit within the ROM BAR as the PCI spec allows devices to share decoders between the standard BARs and the ROM BAR. Enabling and disabling the enable bit in the ROM BAR should be very similar in memslot behavior to the overall memory enable bit for the other BARs within the device. Note that often the faults that I'm seeing occur a long time after BAR mappings are finalized, usually (not always) the VM boots to a fully functional desktop and it's only as I run various tests do the glitches start to appear. For instance, when I allowed sp->gfn 0xfec00 to take the continue branch, I got an OpenCL error. For either 0xffee00 or 0xc1000 I got graphics glitches, for example stray geometric artifacts flashed on the screen. For 0x100000 and 0x800000 I'd get a black screen or blank 3D graphics window. For 0x80a000 the VM froze (apparently). I can't say whether each of these is a consistent failure mode, I only tested to the point of determining whether a range generates an error. > > Did you have any non-device > > assignment test cases that took this branch when developing the series? > > The primary testing was performance oriented, using a slightly modified > version of a synthetic benchmark[1] from a previous series[2] that touched > the memslot flushing flow. From a functional perspective, I highly doubt > that test would have been able expose an improper zapping bug. :-\ It seems like there's probably some sort of inflection point where it becomes faster to zap all pages versus the overhead of walking every page in a memory slot, was that evaluated? Not sure if that's relevant here, but curious. Thanks, Alex
On 15/08/19 17:23, Alex Williamson wrote: > 0xffe00 > 0xfee00 > 0xfec00 > 0xc1000 > 0x80a000 > 0x800000 > 0x100000 > > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and > 0x80000 can take the continue branch without seeing bad behavior in the > VM. > > The assigned GPU has BARs at GPAs: > > 0xc0000000-0xc0ffffff > 0x800000000-0x808000000 > 0x808000000-0x809ffffff > > And the assigned companion audio function is at GPA: > > 0xc1080000-0xc1083fff > > Only one of those seems to align very well with a gfn base involved > here. The virtio ethernet has an mmio range at GPA 0x80a000000, > otherwise I don't find any other I/O devices coincident with the gfns > above. The IOAPIC and LAPIC are respectively gfn 0xfec00 and 0xfee00. The audio function BAR is only 16 KiB, so the 2 MiB PDE starting at 0xc1000 includes both userspace-MMIO and device-MMIO memory. The virtio-net BAR is also userspace-MMIO. It seems like the problem occurs when the sp->gfn you "continue over" includes a userspace-MMIO gfn. But since I have no better ideas right now, I'm going to apply the revert (we don't know for sure that it only happens with assigned devices). Paolo > I'm running the VM with 2MB hugepages, but I believe the issue still > occurs with standard pages. When run with standard pages I see more > hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to > the set above that cannot take the continue branch. I don't know if > that means anything. > > Any further ideas what to look for? Thanks, > > Alex > > PS - I see the posted workaround patch, I'll test that in the interim. >
On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote: > On 15/08/19 17:23, Alex Williamson wrote: > > 0xffe00 > > 0xfee00 > > 0xfec00 > > 0xc1000 > > 0x80a000 > > 0x800000 > > 0x100000 > > > > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and > > 0x80000 can take the continue branch without seeing bad behavior in the > > VM. > > > > The assigned GPU has BARs at GPAs: > > > > 0xc0000000-0xc0ffffff > > 0x800000000-0x808000000 > > 0x808000000-0x809ffffff > > > > And the assigned companion audio function is at GPA: > > > > 0xc1080000-0xc1083fff > > > > Only one of those seems to align very well with a gfn base involved > > here. The virtio ethernet has an mmio range at GPA 0x80a000000, > > otherwise I don't find any other I/O devices coincident with the gfns > > above. > > The IOAPIC and LAPIC are respectively gfn 0xfec00 and 0xfee00. The > audio function BAR is only 16 KiB, so the 2 MiB PDE starting at 0xc1000 > includes both userspace-MMIO and device-MMIO memory. The virtio-net BAR > is also userspace-MMIO. > > It seems like the problem occurs when the sp->gfn you "continue over" > includes a userspace-MMIO gfn. But since I have no better ideas right > now, I'm going to apply the revert (we don't know for sure that it only > happens with assigned devices). After many hours of staring, I've come to the conclusion that kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e. a revert is definitely in order for 5.3 and stable. mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of the shadow ptes, e.g. for_each_valid_sp() will find page tables, page directories, etc... Passing in the raw gfns of the memslot doesn't work because the gfn isn't properly adjusted/aligned to match how KVM tracks gfns for shadow pages, e.g. removal of the companion audio memslot that occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find the shadow page table containing the relevant sptes. This is why Paolo's suggestion to remove slot_handle_all_level() on kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's accounting without actually zapping the relevant sptes. All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00 were sensitive to (lack of) zapping. My theory is that zapping what were effectively random-but-interesting shadow pages cleaned things up enough to avoid noticeable badness. Alex, Can you please test the attached patch? It implements a very slimmed down version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes pointing at the memslot being removed, which was the original intent of kvm_mmu_invalidate_zap_pages_in_memslot(). I apologize in advance if it crashes the host. I'm hopeful it's correct, but given how broken the previous version was, I'm not exactly confident.
On Tue, 20 Aug 2019 13:03:19 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote: > > On 15/08/19 17:23, Alex Williamson wrote: > > > 0xffe00 > > > 0xfee00 > > > 0xfec00 > > > 0xc1000 > > > 0x80a000 > > > 0x800000 > > > 0x100000 > > > > > > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and > > > 0x80000 can take the continue branch without seeing bad behavior in the > > > VM. > > > > > > The assigned GPU has BARs at GPAs: > > > > > > 0xc0000000-0xc0ffffff > > > 0x800000000-0x808000000 > > > 0x808000000-0x809ffffff > > > > > > And the assigned companion audio function is at GPA: > > > > > > 0xc1080000-0xc1083fff > > > > > > Only one of those seems to align very well with a gfn base involved > > > here. The virtio ethernet has an mmio range at GPA 0x80a000000, > > > otherwise I don't find any other I/O devices coincident with the gfns > > > above. > > > > The IOAPIC and LAPIC are respectively gfn 0xfec00 and 0xfee00. The > > audio function BAR is only 16 KiB, so the 2 MiB PDE starting at 0xc1000 > > includes both userspace-MMIO and device-MMIO memory. The virtio-net BAR > > is also userspace-MMIO. > > > > It seems like the problem occurs when the sp->gfn you "continue over" > > includes a userspace-MMIO gfn. But since I have no better ideas right > > now, I'm going to apply the revert (we don't know for sure that it only > > happens with assigned devices). > > After many hours of staring, I've come to the conclusion that > kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e. > a revert is definitely in order for 5.3 and stable. > > mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of > the shadow ptes, e.g. for_each_valid_sp() will find page tables, page > directories, etc... Passing in the raw gfns of the memslot doesn't work > because the gfn isn't properly adjusted/aligned to match how KVM tracks > gfns for shadow pages, e.g. removal of the companion audio memslot that > occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find > the shadow page table containing the relevant sptes. > > This is why Paolo's suggestion to remove slot_handle_all_level() on > kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's > accounting without actually zapping the relevant sptes. > > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00 > were sensitive to (lack of) zapping. My theory is that zapping what were > effectively random-but-interesting shadow pages cleaned things up enough > to avoid noticeable badness. > > > Alex, > > Can you please test the attached patch? It implements a very slimmed down > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes > pointing at the memslot being removed, which was the original intent of > kvm_mmu_invalidate_zap_pages_in_memslot(). I apologize in advance if it > crashes the host. I'm hopeful it's correct, but given how broken the > previous version was, I'm not exactly confident. It doesn't crash the host, but the guest is not happy, failing to boot the desktop in one case and triggering errors in the guest w/o even running test programs in another case. Seems like it might be worse than previous. Thanks, Alex
On Tue, Aug 20, 2019 at 02:42:04PM -0600, Alex Williamson wrote: > On Tue, 20 Aug 2019 13:03:19 -0700 > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00 > > were sensitive to (lack of) zapping. My theory is that zapping what were > > effectively random-but-interesting shadow pages cleaned things up enough > > to avoid noticeable badness. > > > > > > Alex, > > > > Can you please test the attached patch? It implements a very slimmed down > > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes > > pointing at the memslot being removed, which was the original intent of > > kvm_mmu_invalidate_zap_pages_in_memslot(). I apologize in advance if it > > crashes the host. I'm hopeful it's correct, but given how broken the > > previous version was, I'm not exactly confident. > > It doesn't crash the host, but the guest is not happy, failing to boot > the desktop in one case and triggering errors in the guest w/o even > running test programs in another case. Seems like it might be worse > than previous. Thanks, Hrm, I'm back to being completely flummoxed. Would you be able to generate a trace of all events/kvmmmu, using the latest patch? I'd like to rule out a stupid code bug if it's not too much trouble.
On Tue, 20 Aug 2019 14:02:45 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Tue, Aug 20, 2019 at 02:42:04PM -0600, Alex Williamson wrote: > > On Tue, 20 Aug 2019 13:03:19 -0700 > > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00 > > > were sensitive to (lack of) zapping. My theory is that zapping what were > > > effectively random-but-interesting shadow pages cleaned things up enough > > > to avoid noticeable badness. > > > > > > > > > Alex, > > > > > > Can you please test the attached patch? It implements a very slimmed down > > > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes > > > pointing at the memslot being removed, which was the original intent of > > > kvm_mmu_invalidate_zap_pages_in_memslot(). I apologize in advance if it > > > crashes the host. I'm hopeful it's correct, but given how broken the > > > previous version was, I'm not exactly confident. > > > > It doesn't crash the host, but the guest is not happy, failing to boot > > the desktop in one case and triggering errors in the guest w/o even > > running test programs in another case. Seems like it might be worse > > than previous. Thanks, > > Hrm, I'm back to being completely flummoxed. > > Would you be able to generate a trace of all events/kvmmmu, using the > latest patch? I'd like to rule out a stupid code bug if it's not too > much trouble. I tried to simplify the patch, making it closer to zap_all, so I removed the max_level calculation and exclusion based on s->role.level, as well as the gfn range filtering. For good measure I even removed the sp->root_count test, so any sp not marked invalid is zapped. This works, and I can also add back the sp->root_count test and things remain working. From there I added back the gfn range test, but I left out the gfn_mask because I'm not doing the level filtering and I think(?) this is just another optimization. So essentially I only add: if (sp->gfn < slot->base_gfn || sp->gfn > (slot->base_gfn + slot->npages - 1)) continue; Not only does this not work, the host will sometimes oops: [ 808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 808.555065] #PF: supervisor read access in kernel mode [ 808.565326] #PF: error_code(0x0000) - not-present page [ 808.575588] PGD 0 P4D 0 [ 808.580649] Oops: 0000 [#1] SMP PTI [ 808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4 [ 808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 [ 808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm] [ 808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d [ 808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202 [ 808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014 [ 808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004 [ 808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000 [ 808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260 [ 808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004 [ 808.747620] FS: 00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000 [ 808.763776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0 [ 808.789499] Call Trace: [ 808.794399] drop_spte+0x77/0xa0 [kvm] [ 808.801885] mmu_page_zap_pte+0xac/0xe0 [kvm] [ 808.810587] __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm] [ 808.821196] kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm] [ 808.833881] kvm_page_track_flush_slot+0x55/0x80 [kvm] [ 808.844140] __kvm_set_memory_region+0x821/0xaa0 [kvm] [ 808.854402] kvm_set_memory_region+0x26/0x40 [kvm] [ 808.863971] kvm_vm_ioctl+0x59a/0x940 [kvm] [ 808.872318] ? pagevec_lru_move_fn+0xb8/0xd0 [ 808.880846] ? __seccomp_filter+0x7a/0x680 [ 808.889028] do_vfs_ioctl+0xa4/0x630 [ 808.896168] ? security_file_ioctl+0x32/0x50 [ 808.904695] ksys_ioctl+0x60/0x90 [ 808.911316] __x64_sys_ioctl+0x16/0x20 [ 808.918807] do_syscall_64+0x5f/0x1a0 [ 808.926121] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 808.936209] RIP: 0033:0x7f28ebf2b0fb Does this suggests something is still fundamentally wrong with the premise of this change or have I done something stupid? Thanks, Alex
On Wed, 21 Aug 2019 13:08:59 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Tue, 20 Aug 2019 14:02:45 -0700 > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > On Tue, Aug 20, 2019 at 02:42:04PM -0600, Alex Williamson wrote: > > > On Tue, 20 Aug 2019 13:03:19 -0700 > > > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00 > > > > were sensitive to (lack of) zapping. My theory is that zapping what were > > > > effectively random-but-interesting shadow pages cleaned things up enough > > > > to avoid noticeable badness. > > > > > > > > > > > > Alex, > > > > > > > > Can you please test the attached patch? It implements a very slimmed down > > > > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes > > > > pointing at the memslot being removed, which was the original intent of > > > > kvm_mmu_invalidate_zap_pages_in_memslot(). I apologize in advance if it > > > > crashes the host. I'm hopeful it's correct, but given how broken the > > > > previous version was, I'm not exactly confident. > > > > > > It doesn't crash the host, but the guest is not happy, failing to boot > > > the desktop in one case and triggering errors in the guest w/o even > > > running test programs in another case. Seems like it might be worse > > > than previous. Thanks, > > > > Hrm, I'm back to being completely flummoxed. > > > > Would you be able to generate a trace of all events/kvmmmu, using the > > latest patch? I'd like to rule out a stupid code bug if it's not too > > much trouble. > > I tried to simplify the patch, making it closer to zap_all, so I > removed the max_level calculation and exclusion based on s->role.level, > as well as the gfn range filtering. For good measure I even removed > the sp->root_count test, so any sp not marked invalid is zapped. This > works, and I can also add back the sp->root_count test and things > remain working. > > From there I added back the gfn range test, but I left out the gfn_mask > because I'm not doing the level filtering and I think(?) this is just > another optimization. So essentially I only add: > > if (sp->gfn < slot->base_gfn || > sp->gfn > (slot->base_gfn + slot->npages - 1)) > continue; > > Not only does this not work, the host will sometimes oops: > > [ 808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [ 808.555065] #PF: supervisor read access in kernel mode > [ 808.565326] #PF: error_code(0x0000) - not-present page > [ 808.575588] PGD 0 P4D 0 > [ 808.580649] Oops: 0000 [#1] SMP PTI > [ 808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4 > [ 808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 > [ 808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm] > [ 808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d > [ 808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202 > [ 808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014 > [ 808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004 > [ 808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000 > [ 808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260 > [ 808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004 > [ 808.747620] FS: 00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000 > [ 808.763776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0 > [ 808.789499] Call Trace: > [ 808.794399] drop_spte+0x77/0xa0 [kvm] > [ 808.801885] mmu_page_zap_pte+0xac/0xe0 [kvm] > [ 808.810587] __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm] > [ 808.821196] kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm] > [ 808.833881] kvm_page_track_flush_slot+0x55/0x80 [kvm] > [ 808.844140] __kvm_set_memory_region+0x821/0xaa0 [kvm] > [ 808.854402] kvm_set_memory_region+0x26/0x40 [kvm] > [ 808.863971] kvm_vm_ioctl+0x59a/0x940 [kvm] > [ 808.872318] ? pagevec_lru_move_fn+0xb8/0xd0 > [ 808.880846] ? __seccomp_filter+0x7a/0x680 > [ 808.889028] do_vfs_ioctl+0xa4/0x630 > [ 808.896168] ? security_file_ioctl+0x32/0x50 > [ 808.904695] ksys_ioctl+0x60/0x90 > [ 808.911316] __x64_sys_ioctl+0x16/0x20 > [ 808.918807] do_syscall_64+0x5f/0x1a0 > [ 808.926121] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 808.936209] RIP: 0033:0x7f28ebf2b0fb > > Does this suggests something is still fundamentally wrong with the > premise of this change or have I done something stupid? Seems the latter, particularly your comment that we're looking for pages pointing to the gfn range to be removed, not just those in the range. Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or c0000-c0000, zapping zero or c0000, and I think one of the ones you were looking for c1080-c1083 is reduce to c1000-c1000 and therefore zaps sp->gfn c1000. I'll keep looking. Thanks, Alex
On Wed, Aug 21, 2019 at 01:08:59PM -0600, Alex Williamson wrote: > On Tue, 20 Aug 2019 14:02:45 -0700 > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > Hrm, I'm back to being completely flummoxed. > > > > Would you be able to generate a trace of all events/kvmmmu, using the > > latest patch? I'd like to rule out a stupid code bug if it's not too > > much trouble. > > I tried to simplify the patch, making it closer to zap_all, so I > removed the max_level calculation and exclusion based on s->role.level, > as well as the gfn range filtering. For good measure I even removed > the sp->root_count test, so any sp not marked invalid is zapped. This > works, and I can also add back the sp->root_count test and things > remain working. The sp->root_count test is necessary to avoid an infinite loop when not using tdp, e.g. EPT. The root sp can't be moved off active_mmu_pages and thus freed until all references to the root are gone. When not using tdp, the loop can restart in response to __kvm_mmu_prepare_zap_page(), even if the root sp page was already invalid. Functionally, skipping an invalid sp *shouldn't* change anything, as the actual ptes have been zapped, it's only the kernel resources that aren't yet freed. TL;DR: completely expected that removing the sp->root_count didn't affect the result. > From there I added back the gfn range test, but I left out the gfn_mask > because I'm not doing the level filtering and I think(?) this is just > another optimization. So essentially I only add: > > if (sp->gfn < slot->base_gfn || > sp->gfn > (slot->base_gfn + slot->npages - 1)) > continue; This doesn't work because of what sp->gfn contains. The shadow page is the page/table holding the spt entries, e.g. a level 1 sp refers to a page table for entries defining 4k pages. sp->gfn holds the base gfn of the range of gfns covered by the sp, which allows finding all parent sps for a given gfn. E.g. for level 1 sp, sp->gfn is found by masking the gfn of a given page by 0xFFFFFFFFFFFFFE00, level 2 sp by 0xFFFFFFFFFFFC0000, and so on up the chain. This is what the original code botched, as it would only find all sps for a memslot if the base of the memslot were 2mb aligned for 4k pages, 1gb aligned for 2mb pages, etc... This just happened to mostly work since memslots are usually aligned to such requirements. In the original code, removing the "sp->gfn != gfn" check caused KVM to zap random sps that just happened to hash to the same entry. Likewise, omitting the gfn filter in this code means everything gets zapped. What I don't understand is why zapping everything, or at least userspace MMIO addresses, is necessary when removing the GPU BAR memslot. The IOAPIC sp in particular makes no sense. AIUI, the IOAPIC is always emulated, i.e. not passed through, and its base/size is static (or at least static after init). Zapping an IOAPIC sp will send KVM down different paths but AFAIK the final outsome should be unaffected. > Not only does this not work, the host will sometimes oops: > > [ 808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [ 808.555065] #PF: supervisor read access in kernel mode > [ 808.565326] #PF: error_code(0x0000) - not-present page > [ 808.575588] PGD 0 P4D 0 > [ 808.580649] Oops: 0000 [#1] SMP PTI > [ 808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4 > [ 808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 > [ 808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm] > [ 808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d > [ 808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202 > [ 808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014 > [ 808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004 > [ 808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000 > [ 808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260 > [ 808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004 > [ 808.747620] FS: 00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000 > [ 808.763776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0 > [ 808.789499] Call Trace: > [ 808.794399] drop_spte+0x77/0xa0 [kvm] > [ 808.801885] mmu_page_zap_pte+0xac/0xe0 [kvm] > [ 808.810587] __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm] > [ 808.821196] kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm] > [ 808.833881] kvm_page_track_flush_slot+0x55/0x80 [kvm] > [ 808.844140] __kvm_set_memory_region+0x821/0xaa0 [kvm] > [ 808.854402] kvm_set_memory_region+0x26/0x40 [kvm] > [ 808.863971] kvm_vm_ioctl+0x59a/0x940 [kvm] > [ 808.872318] ? pagevec_lru_move_fn+0xb8/0xd0 > [ 808.880846] ? __seccomp_filter+0x7a/0x680 > [ 808.889028] do_vfs_ioctl+0xa4/0x630 > [ 808.896168] ? security_file_ioctl+0x32/0x50 > [ 808.904695] ksys_ioctl+0x60/0x90 > [ 808.911316] __x64_sys_ioctl+0x16/0x20 > [ 808.918807] do_syscall_64+0x5f/0x1a0 > [ 808.926121] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 808.936209] RIP: 0033:0x7f28ebf2b0fb > > Does this suggests something is still fundamentally wrong with the > premise of this change or have I done something stupid? Thanks, The NULL pointer thing is unexpected, it means we have a spte, i.e. the actual entry seen/used by hardware, that KVM thinks is present but doesn't have the expected KVM tracking. I'll take a look, my understanding is that zapping shadow pages at random shouldn't cause problems.
On Wed, Aug 21, 2019 at 01:35:04PM -0600, Alex Williamson wrote: > On Wed, 21 Aug 2019 13:08:59 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > > Does this suggests something is still fundamentally wrong with the > > premise of this change or have I done something stupid? > > Seems the latter, particularly your comment that we're looking for > pages pointing to the gfn range to be removed, not just those in the > range. Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or > c0000-c0000, zapping zero or c0000, and I think one of the ones you > were looking for c1080-c1083 is reduce to c1000-c1000 and therefore > zaps sp->gfn c1000. I'll keep looking. Thanks, Ya. As far as where to look, at this point I don't think it's an issue of incorrect zapping. Not because I'm 100% confident the zapping logic is correct, but because many of the tests, e.g. removing 'sp->gfn != gfn' and not being able to exclude APIC/IOAPIC ranges, suggest that the badness is 'fixed' by zapping seemingly unrelated sps. In other words, it may be fundamentally wrong to zap only the memslot being removed, but I really want to know why. History isn't helpful as KVM has always zapped all pages when removing a memslot (on x86), and the introduction of the per-memslot flush hook in commit 2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow") was all about refactoring generic code, and doesn't have any information on whether per-memslot flushing was actually tried for x86.
On Wed, Aug 21, 2019 at 01:30:41PM -0700, Sean Christopherson wrote: > On Wed, Aug 21, 2019 at 01:35:04PM -0600, Alex Williamson wrote: > > On Wed, 21 Aug 2019 13:08:59 -0600 > > Alex Williamson <alex.williamson@redhat.com> wrote: > > > Does this suggests something is still fundamentally wrong with the > > > premise of this change or have I done something stupid? > > > > Seems the latter, particularly your comment that we're looking for > > pages pointing to the gfn range to be removed, not just those in the > > range. Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or > > c0000-c0000, zapping zero or c0000, and I think one of the ones you > > were looking for c1080-c1083 is reduce to c1000-c1000 and therefore > > zaps sp->gfn c1000. I'll keep looking. Thanks, > > Ya. As far as where to look, at this point I don't think it's an issue of > incorrect zapping. Not because I'm 100% confident the zapping logic is > correct, but because many of the tests, e.g. removing 'sp->gfn != gfn' and > not being able to exclude APIC/IOAPIC ranges, suggest that the badness is > 'fixed' by zapping seemingly unrelated sps. > > In other words, it may be fundamentally wrong to zap only the memslot > being removed, but I really want to know why. History isn't helpful as > KVM has always zapped all pages when removing a memslot (on x86), and the > introduction of the per-memslot flush hook in commit > > 2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow") > > was all about refactoring generic code, and doesn't have any information > on whether per-memslot flushing was actually tried for x86. One semi-random idea would be to zap mmio pages, i.e. don't skip pages for which sp->mmio_cached is true, regardless of their gfn or level. I don't expect it to make a difference, but it would shrink the haystack on the off change it does "fix" the issues.
On Thu, 22 Aug 2019 19:25:02 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Wed, Aug 21, 2019 at 01:30:41PM -0700, Sean Christopherson wrote: > > On Wed, Aug 21, 2019 at 01:35:04PM -0600, Alex Williamson wrote: > > > On Wed, 21 Aug 2019 13:08:59 -0600 > > > Alex Williamson <alex.williamson@redhat.com> wrote: > > > > Does this suggests something is still fundamentally wrong with the > > > > premise of this change or have I done something stupid? > > > > > > Seems the latter, particularly your comment that we're looking for > > > pages pointing to the gfn range to be removed, not just those in the > > > range. Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or > > > c0000-c0000, zapping zero or c0000, and I think one of the ones you > > > were looking for c1080-c1083 is reduce to c1000-c1000 and therefore > > > zaps sp->gfn c1000. I'll keep looking. Thanks, > > > > Ya. As far as where to look, at this point I don't think it's an issue of > > incorrect zapping. Not because I'm 100% confident the zapping logic is > > correct, but because many of the tests, e.g. removing 'sp->gfn != gfn' and > > not being able to exclude APIC/IOAPIC ranges, suggest that the badness is > > 'fixed' by zapping seemingly unrelated sps. > > > > In other words, it may be fundamentally wrong to zap only the memslot > > being removed, but I really want to know why. History isn't helpful as > > KVM has always zapped all pages when removing a memslot (on x86), and the > > introduction of the per-memslot flush hook in commit > > > > 2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow") > > > > was all about refactoring generic code, and doesn't have any information > > on whether per-memslot flushing was actually tried for x86. > > One semi-random idea would be to zap mmio pages, i.e. don't skip pages > for which sp->mmio_cached is true, regardless of their gfn or level. I > don't expect it to make a difference, but it would shrink the haystack on > the off change it does "fix" the issues. You're right, it doesn't fix it. All of the logging I've been staring at suggests your patch does exactly what it's intended to do, but it still breaks GPU assignment in weird ways. I have no idea why. Thanks, Alex
> From: Sean Christopherson > Sent: Thursday, August 22, 2019 4:11 AM > [...] > > > From there I added back the gfn range test, but I left out the gfn_mask > > because I'm not doing the level filtering and I think(?) this is just > > another optimization. So essentially I only add: > > > > if (sp->gfn < slot->base_gfn || > > sp->gfn > (slot->base_gfn + slot->npages - 1)) > > continue; > > This doesn't work because of what sp->gfn contains. The shadow page is > the page/table holding the spt entries, e.g. a level 1 sp refers to a page > table for entries defining 4k pages. sp->gfn holds the base gfn of the > range of gfns covered by the sp, which allows finding all parent sps for a > given gfn. E.g. for level 1 sp, sp->gfn is found by masking the gfn of a > given page by 0xFFFFFFFFFFFFFE00, level 2 sp by 0xFFFFFFFFFFFC0000, and so > on up the chain. > > This is what the original code botched, as it would only find all sps for > a memslot if the base of the memslot were 2mb aligned for 4k pages, 1gb > aligned for 2mb pages, etc... This just happened to mostly work since > memslots are usually aligned to such requirements. In the original code, > removing the "sp->gfn != gfn" check caused KVM to zap random sps that just > happened to hash to the same entry. Likewise, omitting the gfn filter in > this code means everything gets zapped. > > What I don't understand is why zapping everything, or at least userspace > MMIO addresses, is necessary when removing the GPU BAR memslot. The > IOAPIC sp in particular makes no sense. AIUI, the IOAPIC is always > emulated, i.e. not passed through, and its base/size is static (or at > least static after init). Zapping an IOAPIC sp will send KVM down > different paths but AFAIK the final outsome should be unaffected. > I have the same feeling. the base address of IOAPIC is described by ACPI MADT thus static. It is possible to reprogram LAPIC base but no commodity OS does it today. btw I found Alex mentioned gfn 0x100000 in earlier post. This one is usually the starting of high memory. If true, it becomes even more weird, not just about MMIO thing. Thanks Kevin
On Wed, Aug 21, 2019 at 01:10:43PM -0700, Sean Christopherson wrote: > On Wed, Aug 21, 2019 at 01:08:59PM -0600, Alex Williamson wrote: > > Not only does this not work, the host will sometimes oops: > > > > [ 808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000 > > [ 808.555065] #PF: supervisor read access in kernel mode > > [ 808.565326] #PF: error_code(0x0000) - not-present page > > [ 808.575588] PGD 0 P4D 0 > > [ 808.580649] Oops: 0000 [#1] SMP PTI > > [ 808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4 > > [ 808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 > > [ 808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm] > > [ 808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d > > [ 808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202 > > [ 808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014 > > [ 808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004 > > [ 808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000 > > [ 808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260 > > [ 808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004 > > [ 808.747620] FS: 00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000 > > [ 808.763776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0 > > [ 808.789499] Call Trace: > > [ 808.794399] drop_spte+0x77/0xa0 [kvm] > > [ 808.801885] mmu_page_zap_pte+0xac/0xe0 [kvm] > > [ 808.810587] __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm] > > [ 808.821196] kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm] > > [ 808.833881] kvm_page_track_flush_slot+0x55/0x80 [kvm] > > [ 808.844140] __kvm_set_memory_region+0x821/0xaa0 [kvm] > > [ 808.854402] kvm_set_memory_region+0x26/0x40 [kvm] > > [ 808.863971] kvm_vm_ioctl+0x59a/0x940 [kvm] > > [ 808.872318] ? pagevec_lru_move_fn+0xb8/0xd0 > > [ 808.880846] ? __seccomp_filter+0x7a/0x680 > > [ 808.889028] do_vfs_ioctl+0xa4/0x630 > > [ 808.896168] ? security_file_ioctl+0x32/0x50 > > [ 808.904695] ksys_ioctl+0x60/0x90 > > [ 808.911316] __x64_sys_ioctl+0x16/0x20 > > [ 808.918807] do_syscall_64+0x5f/0x1a0 > > [ 808.926121] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 808.936209] RIP: 0033:0x7f28ebf2b0fb > > > > Does this suggests something is still fundamentally wrong with the > > premise of this change or have I done something stupid? Thanks, > > The NULL pointer thing is unexpected, it means we have a spte, i.e. the > actual entry seen/used by hardware, that KVM thinks is present but doesn't > have the expected KVM tracking. I'll take a look, my understanding is > that zapping shadow pages at random shouldn't cause problems. The NULL pointer dereference is expected given the flawed implementation, i.e. there isn't another bug lurking for that particular problem. The issue isn't zapping random sptes, but rather that the flawed logic leaves dangling sptes. When a different action, e.g. zapping all memslots, triggers zapping of the dangling spte(s), gfn_to_rmap() attempts to find the corresponding memslot and hits the above BUG because the memslot no longer exists. On the flip side, not hitting that condition provides additional confidence in the reworked flow, i.e. proves to some degree that it's zapping all sptes in the to-be-removed memslot.
/cast <thread necromancy> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: > On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote: > > It seems like the problem occurs when the sp->gfn you "continue over" > > includes a userspace-MMIO gfn. But since I have no better ideas right > > now, I'm going to apply the revert (we don't know for sure that it only > > happens with assigned devices). > > After many hours of staring, I've come to the conclusion that > kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e. > a revert is definitely in order for 5.3 and stable. Whelp, past me was wrong. The patch wasn't completely broken, as the rmap zapping piece of kvm_mmu_invalidate_zap_pages_in_memslot() was sufficient to satisfy removal of the memslot. I.e. zapping leaf PTEs (including large pages) should prevent referencing the old memslot, the fact that zapping upper level shadow pages was broken was irrelevant because there should be no need to zap non-leaf PTEs. The one thing that sticks out as potentially concerning is passing %false for @lock_flush_tlb. Dropping mmu_lock during slot_handle_level_range() without flushing would allow a vCPU to create and use a new entry while a different vCPU has the old entry cached in its TLB. I think that could even happen with a single vCPU if the memslot is deleted by a helper task, and the zapped page was a large page that was fractured into small pages when inserted into the TLB. TL;DR: Assuming no other bugs in the kernel, this should be sufficient if the goal is simply to prevent usage of a memslot: static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, struct kvm_page_track_notifier_node *node) { bool flush; spin_lock(&kvm->mmu_lock); flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, true); if (flush) kvm_flush_remote_tlbs(kvm); spin_unlock(& } > mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of > the shadow ptes, e.g. for_each_valid_sp() will find page tables, page > directories, etc... Passing in the raw gfns of the memslot doesn't work > because the gfn isn't properly adjusted/aligned to match how KVM tracks > gfns for shadow pages, e.g. removal of the companion audio memslot that > occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find > the shadow page table containing the relevant sptes. > > This is why Paolo's suggestion to remove slot_handle_all_level() on > kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's > accounting without actually zapping the relevant sptes. This is just straight up wrong, zapping the rmaps does zap the leaf sptes. The BUG() occurs because gfn_to_rmap() works on the _new_ memslots instance, and if a memslot is being deleted there is no memslot for the gfn, hence the NULL pointer bug when mmu_page_zap_pte() attempts to zap a PTE. Zapping the rmaps (leaf/last PTEs) first "fixes" the issue by making it so that mmu_page_zap_pte() will never see a present PTE for a non-existent meslot. I don't think any of this explains the pass-through GPU issue. But, we have a few use cases where zapping the entire MMU is undesirable, so I'm going to retry upstreaming this patch as with per-VM opt-in. I wanted to set the record straight for posterity before doing so.
On 26.06.20 19:32, Sean Christopherson wrote: > /cast <thread necromancy> > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: [...] > I don't think any of this explains the pass-through GPU issue. But, we > have a few use cases where zapping the entire MMU is undesirable, so I'm > going to retry upstreaming this patch as with per-VM opt-in. I wanted to > set the record straight for posterity before doing so. Hey Sean, Did you ever get around to upstream or rework the zap optimization? The way I read current upstream, a memslot change still always wipes all SPTEs, not only the ones that were changed. Thanks, Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Thu, Oct 20, 2022, Alexander Graf wrote: > On 26.06.20 19:32, Sean Christopherson wrote: > > /cast <thread necromancy> > > > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: > > [...] > > > I don't think any of this explains the pass-through GPU issue. But, we > > have a few use cases where zapping the entire MMU is undesirable, so I'm > > going to retry upstreaming this patch as with per-VM opt-in. I wanted to > > set the record straight for posterity before doing so. > > Hey Sean, > > Did you ever get around to upstream or rework the zap optimization? The way > I read current upstream, a memslot change still always wipes all SPTEs, not > only the ones that were changed. Nope, I've more or less given up hope on zapping only the deleted/moved memslot. TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very much a special case. Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?
On 20.10.22 22:37, Sean Christopherson wrote: > On Thu, Oct 20, 2022, Alexander Graf wrote: >> On 26.06.20 19:32, Sean Christopherson wrote: >>> /cast <thread necromancy> >>> >>> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: >> [...] >> >>> I don't think any of this explains the pass-through GPU issue. But, we >>> have a few use cases where zapping the entire MMU is undesirable, so I'm >>> going to retry upstreaming this patch as with per-VM opt-in. I wanted to >>> set the record straight for posterity before doing so. >> Hey Sean, >> >> Did you ever get around to upstream or rework the zap optimization? The way >> I read current upstream, a memslot change still always wipes all SPTEs, not >> only the ones that were changed. > Nope, I've more or less given up hope on zapping only the deleted/moved memslot. > TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very > much a special case. > > Do you have use case and/or issue that doesn't play nice with the "zap all" behavior? Yeah, we're looking at adding support for the Hyper-V VSM extensions which Windows uses to implement Credential Guard. With that, the guest gets access to hypercalls that allow it to set reduced permissions for arbitrary gfns. To ensure that user space has full visibility into those for live migration, memory slots to model access would be a great fit. But it means we'd do ~100k memslot modifications on boot. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Thu, Oct 20, 2022, Alexander Graf wrote: > > On 20.10.22 22:37, Sean Christopherson wrote: > > On Thu, Oct 20, 2022, Alexander Graf wrote: > > > On 26.06.20 19:32, Sean Christopherson wrote: > > > > /cast <thread necromancy> > > > > > > > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: > > > [...] > > > > > > > I don't think any of this explains the pass-through GPU issue. But, we > > > > have a few use cases where zapping the entire MMU is undesirable, so I'm > > > > going to retry upstreaming this patch as with per-VM opt-in. I wanted to > > > > set the record straight for posterity before doing so. > > > Hey Sean, > > > > > > Did you ever get around to upstream or rework the zap optimization? The way > > > I read current upstream, a memslot change still always wipes all SPTEs, not > > > only the ones that were changed. > > Nope, I've more or less given up hope on zapping only the deleted/moved memslot. > > TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very > > much a special case. > > > > Do you have use case and/or issue that doesn't play nice with the "zap all" behavior? > > > Yeah, we're looking at adding support for the Hyper-V VSM extensions which > Windows uses to implement Credential Guard. With that, the guest gets access > to hypercalls that allow it to set reduced permissions for arbitrary gfns. > To ensure that user space has full visibility into those for live migration, > memory slots to model access would be a great fit. But it means we'd do > ~100k memslot modifications on boot. Oof. 100k memslot updates is going to be painful irrespective of flushing. And memslots (in their current form) won't work if the guest can drop executable permissions. Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve the "KVM flushes everything on memslot deletion", I think we should instead properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing userspace to delete the memslot. Commit 75d61fbcf563 ("KVM: set_memory_region: Disallow changing read-only attribute later") was just a quick-and-dirty fix, there's no fundemental problem that makes it impossible (or even all that difficult) to support toggling permissions. The ABI would be that KVM only guarantees the new permissions take effect when the ioctl() returns, i.e. KVM doesn't need to ensure there are no writable SPTEs when the memslot is installed, just that there are no writable SPTEs before userspace regains control. E.g. sans sanity checking and whatnot, I think x86 support would be something like: @@ -12669,9 +12667,16 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm, * MOVE/DELETE: The old mappings will already have been cleaned up by * kvm_arch_flush_shadow_memslot(). */ - if ((change != KVM_MR_FLAGS_ONLY) || (new_flags & KVM_MEM_READONLY)) + if (change != KVM_MR_FLAGS_ONLY) return; + if ((old_flags ^ new_flags) & KVM_MEM_READONLY) { + if ((new_flags & KVM_MEM_READONLY) && + kvm_mmu_slot_write_protect(kvm, new)) + kvm_arch_flush_remote_tlbs_memslot(kvm, new); + return; + } + /* * READONLY and non-flags changes were filtered out above, and the only * other flag is LOG_DIRTY_PAGES, i.e. something is wrong if dirty
Hey Sean, On 21.10.22 21:40, Sean Christopherson wrote: > > On Thu, Oct 20, 2022, Alexander Graf wrote: >> On 20.10.22 22:37, Sean Christopherson wrote: >>> On Thu, Oct 20, 2022, Alexander Graf wrote: >>>> On 26.06.20 19:32, Sean Christopherson wrote: >>>>> /cast <thread necromancy> >>>>> >>>>> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: >>>> [...] >>>> >>>>> I don't think any of this explains the pass-through GPU issue. But, we >>>>> have a few use cases where zapping the entire MMU is undesirable, so I'm >>>>> going to retry upstreaming this patch as with per-VM opt-in. I wanted to >>>>> set the record straight for posterity before doing so. >>>> Hey Sean, >>>> >>>> Did you ever get around to upstream or rework the zap optimization? The way >>>> I read current upstream, a memslot change still always wipes all SPTEs, not >>>> only the ones that were changed. >>> Nope, I've more or less given up hope on zapping only the deleted/moved memslot. >>> TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very >>> much a special case. >>> >>> Do you have use case and/or issue that doesn't play nice with the "zap all" behavior? >> >> Yeah, we're looking at adding support for the Hyper-V VSM extensions which >> Windows uses to implement Credential Guard. With that, the guest gets access >> to hypercalls that allow it to set reduced permissions for arbitrary gfns. >> To ensure that user space has full visibility into those for live migration, >> memory slots to model access would be a great fit. But it means we'd do >> ~100k memslot modifications on boot. > Oof. 100k memslot updates is going to be painful irrespective of flushing. And > memslots (in their current form) won't work if the guest can drop executable > permissions. > > Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve > the "KVM flushes everything on memslot deletion", I think we should instead > properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing > userspace to delete the memslot. Commit 75d61fbcf563 ("KVM: set_memory_region: That would be a cute acceleration for the case where we have to change permissions for a full slot. Unfortunately, the bulk of the changes are slot splits. Let me explain with numbers from a 1 vcpu, 8GB Windows Server 2019 boot: GFN permission modification requests: 46294 Unique GFNs: 21200 That means on boot, we start off with a few huge memslots for guest RAM. Then down the road, we need to change permissions for individual pages inside these larger regions. The obvious option for that is a memslot split - delete, create, create, create. Now we have 2 large memslots and 1 that only spans a single page. Later in the boot process, Windows then some times also toggles permissions for pages that it already split off earlier. That's the case we can optimize with the modify optimization you described in the previous email. But that's only about half the requests. The other half are memslot split requests. We already built a prototype implementation of an atomic memslot update ioctl that allows us to keep other vCPUs running while we do the delete/create/create/create operation. But even with that, we see up to 30 min boot times for larger guests that most of the time are stuck in zapping pages. I guess we have 2 options to make this viable: 1) Optimize memslot splits + modifications to a point where they're fast enough 2) Add a different, faster mechanism on top of memslots for page granular permission bits Also sorry for not posting the underlying credguard and atomic memslot patches yet. I wanted to kick off this conversation before sending them out - they're still too raw for upstream review atm :). Thanks, Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Mon, Oct 24, 2022, Alexander Graf wrote: > Hey Sean, > > On 21.10.22 21:40, Sean Christopherson wrote: > > > > On Thu, Oct 20, 2022, Alexander Graf wrote: > > > On 20.10.22 22:37, Sean Christopherson wrote: > > > > On Thu, Oct 20, 2022, Alexander Graf wrote: > > > > > On 26.06.20 19:32, Sean Christopherson wrote: > > > > > > /cast <thread necromancy> > > > > > > > > > > > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: > > > > > [...] > > > > > > > > > > > I don't think any of this explains the pass-through GPU issue. But, we > > > > > > have a few use cases where zapping the entire MMU is undesirable, so I'm > > > > > > going to retry upstreaming this patch as with per-VM opt-in. I wanted to > > > > > > set the record straight for posterity before doing so. > > > > > Hey Sean, > > > > > > > > > > Did you ever get around to upstream or rework the zap optimization? The way > > > > > I read current upstream, a memslot change still always wipes all SPTEs, not > > > > > only the ones that were changed. > > > > Nope, I've more or less given up hope on zapping only the deleted/moved memslot. > > > > TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very > > > > much a special case. > > > > > > > > Do you have use case and/or issue that doesn't play nice with the "zap all" behavior? > > > > > > Yeah, we're looking at adding support for the Hyper-V VSM extensions which > > > Windows uses to implement Credential Guard. With that, the guest gets access > > > to hypercalls that allow it to set reduced permissions for arbitrary gfns. > > > To ensure that user space has full visibility into those for live migration, > > > memory slots to model access would be a great fit. But it means we'd do > > > ~100k memslot modifications on boot. > > Oof. 100k memslot updates is going to be painful irrespective of flushing. And > > memslots (in their current form) won't work if the guest can drop executable > > permissions. > > > > Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve > > the "KVM flushes everything on memslot deletion", I think we should instead > > properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing > > userspace to delete the memslot. Commit 75d61fbcf563 ("KVM: set_memory_region: > > > That would be a cute acceleration for the case where we have to change > permissions for a full slot. Unfortunately, the bulk of the changes are slot > splits. Ah, right, the guest will be operating on per-page granularity. > We already built a prototype implementation of an atomic memslot update > ioctl that allows us to keep other vCPUs running while we do the > delete/create/create/create operation. Please weigh in with your use case on a relevant upstream discussion regarding "atomic" memslot updates[*]. I suspect we'll end up with a different solution for this use case (see below), but we should at least capture all potential use cases and ideas for modifying memslots without pausing vCPUs. [*] https://lore.kernel.org/all/20220909104506.738478-1-eesposit@redhat.com > But even with that, we see up to 30 min boot times for larger guests that > most of the time are stuck in zapping pages. Out of curiosity, did you measure runtime performance? I would expect some amount of runtime overhead as well dut to fragmenting memslots to that degree. > I guess we have 2 options to make this viable: > > 1) Optimize memslot splits + modifications to a point where they're fast > enough > 2) Add a different, faster mechanism on top of memslots for page granular > permission bits #2 crossed my mind as well. This is actually nearly identical to the confidential VM use case, where KVM needs to handle guest-initiated conversions of memory between "private" and "shared" on a per-page granularity. The proposed solution for that is indeed a layer on top of memslots[*], which we arrived at in no small part because splitting memslots was going to be a bottleneck. Extending the proposed mem_attr_array to support additional state should be quite easy. The framework is all there, KVM just needs a few extra flags values, e.g. KVM_MEM_ATTR_SHARED BIT(0) KVM_MEM_ATTR_READONLY BIT(1) KVM_MEM_ATTR_NOEXEC BIT(2) and then new ioctls to expose the functionality to userspace. Actually, if we want to go this route, it might even make sense to define new a generic MEM_ATTR ioctl() right away instead of repurposing KVM_MEMORY_ENCRYPT_(UN)REG_REGION for the private vs. shared use case. [*] https://lore.kernel.org/all/20220915142913.2213336-6-chao.p.peng@linux.intel.com
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b6f362a988fd..ba96bf9cf706 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -5619,7 +5619,38 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, struct kvm_page_track_notifier_node *node) { - kvm_mmu_invalidate_zap_all_pages(kvm); + struct kvm_mmu_page *sp; + LIST_HEAD(invalid_list); + unsigned long i; + bool flush; + gfn_t gfn; + + spin_lock(&kvm->mmu_lock); + + if (list_empty(&kvm->arch.active_mmu_pages)) + goto out_unlock; + + flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false); + + for (i = 0; i < slot->npages; i++) { + gfn = slot->base_gfn + i; + + for_each_valid_sp(kvm, sp, gfn) { + if (sp->gfn != gfn) + continue; + + kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); + } + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { + kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); + flush = false; + cond_resched_lock(&kvm->mmu_lock); + } + } + kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); + +out_unlock: + spin_unlock(&kvm->mmu_lock); } void kvm_mmu_init_vm(struct kvm *kvm)
Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs that actually belong to the memslot being removed. This improves performance, especially why the deleted memslot has only a few shadow entries, or even no entries. E.g. a microbenchmark to access regular memory while concurrently reading PCI ROM to trigger memslot deletion showed a 5% improvement in throughput. Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> --- arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-)