diff mbox series

[v2,11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot

Message ID 20190205210137.1377-11-sean.j.christopherson@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Remove fast invalidate mechanism | expand

Commit Message

Sean Christopherson Feb. 5, 2019, 9:01 p.m. UTC
Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
that actually belong to the memslot being removed.  This improves
performance, especially why the deleted memslot has only a few shadow
entries, or even no entries.  E.g. a microbenchmark to access regular
memory while concurrently reading PCI ROM to trigger memslot deletion
showed a 5% improvement in throughput.

Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

Comments

Alex Williamson Aug. 13, 2019, 4:04 p.m. UTC | #1
On Tue,  5 Feb 2019 13:01:21 -0800
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
> handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
> that actually belong to the memslot being removed.  This improves
> performance, especially why the deleted memslot has only a few shadow
> entries, or even no entries.  E.g. a microbenchmark to access regular
> memory while concurrently reading PCI ROM to trigger memslot deletion
> showed a 5% improvement in throughput.
> 
> Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)

A number of vfio users are reporting VM instability issues since v5.1,
some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap
only the relevant pages when removing a memslot"), which I've confirmed
via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and
re-verified on current 5.3-rc4 using the below patch to toggle the
broken behavior.

My reproducer is a Windows 10 VM with assigned GeForce GPU running a
variety of tests, including FurMark and PassMark Performance Test.
With the code enabled as exists in upstream currently, PassMark will
generally introduce graphics glitches or hangs.  Sometimes it's
necessary to reboot the VM to see these issues.  Flipping the 0/1 in
the below patch appears to resolve the issue.

I'd appreciate any insights into further debugging this block of code
so that we can fix this regression.  Thanks,

Alex

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24843cf49579..6a77306940f7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5653,6 +5653,9 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
 {
+#if 0
+	kvm_mmu_zap_all(kvm);
+#else
 	struct kvm_mmu_page *sp;
 	LIST_HEAD(invalid_list);
 	unsigned long i;
@@ -5685,6 +5688,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 
 out_unlock:
 	spin_unlock(&kvm->mmu_lock);
+#endif
 }
 
 void kvm_mmu_init_vm(struct kvm *kvm)
Sean Christopherson Aug. 13, 2019, 5:04 p.m. UTC | #2
On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote:
> On Tue,  5 Feb 2019 13:01:21 -0800
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
> > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
> > that actually belong to the memslot being removed.  This improves
> > performance, especially why the deleted memslot has only a few shadow
> > entries, or even no entries.  E.g. a microbenchmark to access regular
> > memory while concurrently reading PCI ROM to trigger memslot deletion
> > showed a 5% improvement in throughput.
> > 
> > Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> >  arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
> >  1 file changed, 32 insertions(+), 1 deletion(-)
> 
> A number of vfio users are reporting VM instability issues since v5.1,
> some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap
> only the relevant pages when removing a memslot"), which I've confirmed
> via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and
> re-verified on current 5.3-rc4 using the below patch to toggle the
> broken behavior.
> 
> My reproducer is a Windows 10 VM with assigned GeForce GPU running a
> variety of tests, including FurMark and PassMark Performance Test.
> With the code enabled as exists in upstream currently, PassMark will
> generally introduce graphics glitches or hangs.  Sometimes it's
> necessary to reboot the VM to see these issues.

As in, the issue only shows up when the VM is rebooted?  Just want to
double check that that's not a typo.

> Flipping the 0/1 in the below patch appears to resolve the issue.
> 
> I'd appreciate any insights into further debugging this block of code
> so that we can fix this regression.  Thanks,

If it's not too painful to reproduce, I'd say start by determining whether
it's a problem with the basic logic or if the cond_resched_lock() handling
is wrong.  I.e. comment/ifdef out this chunk:

		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
			flush = false;
			cond_resched_lock(&kvm->mmu_lock);
		}


> 
> Alex
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 24843cf49579..6a77306940f7 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5653,6 +5653,9 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>  			struct kvm_memory_slot *slot,
>  			struct kvm_page_track_notifier_node *node)
>  {
> +#if 0
> +	kvm_mmu_zap_all(kvm);
> +#else
>  	struct kvm_mmu_page *sp;
>  	LIST_HEAD(invalid_list);
>  	unsigned long i;
> @@ -5685,6 +5688,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>  
>  out_unlock:
>  	spin_unlock(&kvm->mmu_lock);
> +#endif
>  }
>  
>  void kvm_mmu_init_vm(struct kvm *kvm)
Alex Williamson Aug. 13, 2019, 5:57 p.m. UTC | #3
On Tue, 13 Aug 2019 10:04:41 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote:
> > On Tue,  5 Feb 2019 13:01:21 -0800
> > Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> >   
> > > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
> > > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
> > > that actually belong to the memslot being removed.  This improves
> > > performance, especially why the deleted memslot has only a few shadow
> > > entries, or even no entries.  E.g. a microbenchmark to access regular
> > > memory while concurrently reading PCI ROM to trigger memslot deletion
> > > showed a 5% improvement in throughput.
> > > 
> > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > ---
> > >  arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
> > >  1 file changed, 32 insertions(+), 1 deletion(-)  
> > 
> > A number of vfio users are reporting VM instability issues since v5.1,
> > some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap
> > only the relevant pages when removing a memslot"), which I've confirmed
> > via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and
> > re-verified on current 5.3-rc4 using the below patch to toggle the
> > broken behavior.
> > 
> > My reproducer is a Windows 10 VM with assigned GeForce GPU running a
> > variety of tests, including FurMark and PassMark Performance Test.
> > With the code enabled as exists in upstream currently, PassMark will
> > generally introduce graphics glitches or hangs.  Sometimes it's
> > necessary to reboot the VM to see these issues.  
> 
> As in, the issue only shows up when the VM is rebooted?  Just want to
> double check that that's not a typo.

No, it can occur on the first boot as well, it's just that the recipe
to induce a failure is not well understood and manifests itself in
different ways.  I generally run the tests, then if it still hasn't
reproduced, I reboot the VM a couple times, running a couple apps in
between to try to trigger/notice bad behavior.

> > Flipping the 0/1 in the below patch appears to resolve the issue.
> > 
> > I'd appreciate any insights into further debugging this block of code
> > so that we can fix this regression.  Thanks,  
> 
> If it's not too painful to reproduce, I'd say start by determining whether
> it's a problem with the basic logic or if the cond_resched_lock() handling
> is wrong.  I.e. comment/ifdef out this chunk:
> 
> 		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> 			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
> 			flush = false;
> 			cond_resched_lock(&kvm->mmu_lock);
> 		}

If anything, removing this chunk seems to make things worse.  Thanks,

Alex
Alex Williamson Aug. 13, 2019, 7:33 p.m. UTC | #4
On Tue, 13 Aug 2019 11:57:37 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 13 Aug 2019 10:04:41 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote:  
> > > On Tue,  5 Feb 2019 13:01:21 -0800
> > > Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > >     
> > > > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
> > > > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
> > > > that actually belong to the memslot being removed.  This improves
> > > > performance, especially why the deleted memslot has only a few shadow
> > > > entries, or even no entries.  E.g. a microbenchmark to access regular
> > > > memory while concurrently reading PCI ROM to trigger memslot deletion
> > > > showed a 5% improvement in throughput.
> > > > 
> > > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > ---
> > > >  arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 32 insertions(+), 1 deletion(-)    
> > > 
> > > A number of vfio users are reporting VM instability issues since v5.1,
> > > some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap
> > > only the relevant pages when removing a memslot"), which I've confirmed
> > > via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and
> > > re-verified on current 5.3-rc4 using the below patch to toggle the
> > > broken behavior.
> > > 
> > > My reproducer is a Windows 10 VM with assigned GeForce GPU running a
> > > variety of tests, including FurMark and PassMark Performance Test.
> > > With the code enabled as exists in upstream currently, PassMark will
> > > generally introduce graphics glitches or hangs.  Sometimes it's
> > > necessary to reboot the VM to see these issues.    
> > 
> > As in, the issue only shows up when the VM is rebooted?  Just want to
> > double check that that's not a typo.  
> 
> No, it can occur on the first boot as well, it's just that the recipe
> to induce a failure is not well understood and manifests itself in
> different ways.  I generally run the tests, then if it still hasn't
> reproduced, I reboot the VM a couple times, running a couple apps in
> between to try to trigger/notice bad behavior.
> 
> > > Flipping the 0/1 in the below patch appears to resolve the issue.
> > > 
> > > I'd appreciate any insights into further debugging this block of code
> > > so that we can fix this regression.  Thanks,    
> > 
> > If it's not too painful to reproduce, I'd say start by determining whether
> > it's a problem with the basic logic or if the cond_resched_lock() handling
> > is wrong.  I.e. comment/ifdef out this chunk:
> > 
> > 		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> > 			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
> > 			flush = false;
> > 			cond_resched_lock(&kvm->mmu_lock);
> > 		}  
> 
> If anything, removing this chunk seems to make things worse.

Could it be something with the gfn test:

                        if (sp->gfn != gfn)
                                continue;

If I remove it, I can't trigger the misbehavior.  If I log it, I only
get hits on VM boot/reboot and some of the gfns look suspiciously like
they could be the assigned GPU BARs and maybe MSI mappings:

               (sp->gfn) != (gfn)
[   71.829450] gfn fec00 != c02c4
[   71.835554] gfn ffe00 != c046f
[   71.841664] gfn 0 != c0720
[   71.847084] gfn 0 != c0720
[   71.852489] gfn 0 != c0720
[   71.857899] gfn 0 != c0720
[   71.863306] gfn 0 != c0720
[   71.868717] gfn 0 != c0720
[   71.874122] gfn 0 != c0720
[   71.879531] gfn 0 != c0720
[   71.884937] gfn 0 != c0720
[   71.890349] gfn 0 != c0720
[   71.895757] gfn 0 != c0720
[   71.901163] gfn 0 != c0720
[   71.906569] gfn 0 != c0720
[   71.911980] gfn 0 != c0720
[   71.917387] gfn 0 != c0720
[   71.922808] gfn fee00 != c0edc
[   71.928915] gfn fee00 != c0edc
[   71.935018] gfn fee00 != c0edc
[   71.941730] gfn c1000 != 8002d7
[   71.948039] gfn 80000 != 8006d5
[   71.954328] gfn 80000 != 8006d5
[   71.960600] gfn 80000 != 8006d5
[   71.966874] gfn 80000 != 8006d5
[   71.992272] gfn 0 != c0720
[   71.997683] gfn 0 != c0720
[   72.003725] gfn 80000 != 8006d5
[   72.044333] gfn 0 != c0720
[   72.049743] gfn 0 != c0720
[   72.055846] gfn 80000 != 8006d5
[   72.177341] gfn ffe00 != c046f
[   72.183453] gfn 0 != c0720
[   72.188864] gfn 0 != c0720
[   72.194290] gfn 0 != c0720
[   72.200308] gfn 80000 != 8006d5
[   82.539023] gfn fec00 != c02c4
[   82.545142] gfn 40000 != c0377
[   82.551343] gfn ffe00 != c046f
[   82.557466] gfn 100000 != c066f
[   82.563839] gfn 800000 != c06ec
[   82.570133] gfn 800000 != c06ec
[   82.576408] gfn 0 != c0720
[   82.581850] gfn 0 != c0720
[   82.587275] gfn 0 != c0720
[   82.592685] gfn 0 != c0720
[   82.598131] gfn 0 != c0720
[   82.603552] gfn 0 != c0720
[   82.608978] gfn 0 != c0720
[   82.614419] gfn 0 != c0720
[   82.619844] gfn 0 != c0720
[   82.625291] gfn 0 != c0720
[   82.630791] gfn 0 != c0720
[   82.636208] gfn 0 != c0720
[   82.641635] gfn 80a000 != c085e
[   82.647929] gfn fee00 != c0edc
[   82.654062] gfn fee00 != c0edc
[   82.660504] gfn 100000 != c066f
[   82.666800] gfn 0 != c0720
[   82.672211] gfn 0 != c0720
[   82.677635] gfn 0 != c0720
[   82.683060] gfn 0 != c0720
[   82.689209] gfn c1000 != 8002d7
[   82.695501] gfn 80000 != 8006d5
[   82.701796] gfn 80000 != 8006d5
[   82.708092] gfn 100000 != 80099b
[   82.714547] gfn 0 != 800a4c
[   82.720154] gfn 0 != 800a4c
[   82.725752] gfn 0 != 800a4c
[   82.731370] gfn 0 != 800a4c
[   82.738705] gfn 100000 != 80099b
[   82.745201] gfn 0 != 800a4c
[   82.750793] gfn 0 != 800a4c
[   82.756381] gfn 0 != 800a4c
[   82.761979] gfn 0 != 800a4c
[   82.768122] gfn 100000 != 8083a4
[   82.774605] gfn 0 != 8094aa
[   82.780196] gfn 0 != 8094aa
[   82.785796] gfn 0 != 8094aa
[   82.791412] gfn 0 != 8094aa
[   82.797523] gfn 100000 != 8083a4
[   82.803977] gfn 0 != 8094aa
[   82.809576] gfn 0 != 8094aa
[   82.815193] gfn 0 != 8094aa
[   82.820809] gfn 0 != 8094aa

(GPU has a BAR mapped at 0x80000000)

Is this gfn optimization correct?  Overzealous?  Doesn't account
correctly for something about MMIO mappings?  Thanks,

Alex
Sean Christopherson Aug. 13, 2019, 8:19 p.m. UTC | #5
On Tue, Aug 13, 2019 at 01:33:16PM -0600, Alex Williamson wrote:
> On Tue, 13 Aug 2019 11:57:37 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:

> Could it be something with the gfn test:
> 
>                         if (sp->gfn != gfn)
>                                 continue;
> 
> If I remove it, I can't trigger the misbehavior.  If I log it, I only
> get hits on VM boot/reboot and some of the gfns look suspiciously like
> they could be the assigned GPU BARs and maybe MSI mappings:
> 
>                (sp->gfn) != (gfn)

Hits at boot/reboot makes sense, memslots get zapped when userspace
removes a memory region/slot, e.g. remaps BARs and whatnot.

...
 
> Is this gfn optimization correct?  Overzealous?  Doesn't account
> correctly for something about MMIO mappings?  Thanks,

Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
all entries for a given gfn.  The sp->gfn check is there to skip entries
that hashed to the same list but for a completely different gfn.

Skipping the gfn check would be sort of a lightweight zap all in the
sense that it would zap shadow pages that happend to collide with the
target memslot/gfn but are otherwise unrelated.

What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:

	if (sp->gfn != gfn && sp->gfn != 0x80000)
		continue;

If that doesn't work, it might be worth trying other gfns to see if you
can pinpoint which sp is being zapped as collateral damage.

It's possible there is a pre-existing bug somewhere else that was being
hidden because KVM was effectively zapping all SPTEs during (re)boot,
and the hash collision is also hiding the bug by zapping the stale entry.

Of course it's also possible this code is wrong, :-)
Paolo Bonzini Aug. 13, 2019, 8:37 p.m. UTC | #6
On 13/08/19 22:19, Sean Christopherson wrote:
> Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
> all entries for a given gfn.  The sp->gfn check is there to skip entries
> that hashed to the same list but for a completely different gfn.
> 
> Skipping the gfn check would be sort of a lightweight zap all in the
> sense that it would zap shadow pages that happend to collide with the
> target memslot/gfn but are otherwise unrelated.
> 
> What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:
> 
> 	if (sp->gfn != gfn && sp->gfn != 0x80000)
> 		continue;
> 
> If that doesn't work, it might be worth trying other gfns to see if you
> can pinpoint which sp is being zapped as collateral damage.
> 
> It's possible there is a pre-existing bug somewhere else that was being
> hidden because KVM was effectively zapping all SPTEs during (re)boot,
> and the hash collision is also hiding the bug by zapping the stale entry.
> 
> Of course it's also possible this code is wrong, :-)

Also, can you reproduce it with one vCPU?  This could (though not really
100%) distinguish a missing invalidation from a race condition.

Do we even need the call to slot_handle_all_level?  The rmap update
should be done already by kvm_mmu_prepare_zap_page (via
kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte).

Alex, can you replace it with just "flush = false;"?

Thanks,

Paolo
Alex Williamson Aug. 13, 2019, 9:14 p.m. UTC | #7
On Tue, 13 Aug 2019 22:37:14 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 13/08/19 22:19, Sean Christopherson wrote:
> > Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
> > all entries for a given gfn.  The sp->gfn check is there to skip entries
> > that hashed to the same list but for a completely different gfn.
> > 
> > Skipping the gfn check would be sort of a lightweight zap all in the
> > sense that it would zap shadow pages that happend to collide with the
> > target memslot/gfn but are otherwise unrelated.
> > 
> > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:
> > 
> > 	if (sp->gfn != gfn && sp->gfn != 0x80000)
> > 		continue;

Not having any luck with this yet.  Tried 0x80000, 0x8xxxxx, 0.
 
> > If that doesn't work, it might be worth trying other gfns to see if you
> > can pinpoint which sp is being zapped as collateral damage.
> > 
> > It's possible there is a pre-existing bug somewhere else that was being
> > hidden because KVM was effectively zapping all SPTEs during (re)boot,
> > and the hash collision is also hiding the bug by zapping the stale entry.
> > 
> > Of course it's also possible this code is wrong, :-)  
> 
> Also, can you reproduce it with one vCPU?  This could (though not really
> 100%) distinguish a missing invalidation from a race condition.

That's a pretty big change, I'll give it a shot, but not sure how
conclusive it would be.

> Do we even need the call to slot_handle_all_level?  The rmap update
> should be done already by kvm_mmu_prepare_zap_page (via
> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte).
> 
> Alex, can you replace it with just "flush = false;"?

Replace the continue w/ flush = false?  I'm not clear on this
suggestion.  Thanks,

Alex
Paolo Bonzini Aug. 13, 2019, 9:15 p.m. UTC | #8
On 13/08/19 23:14, Alex Williamson wrote:
>> Do we even need the call to slot_handle_all_level?  The rmap update
>> should be done already by kvm_mmu_prepare_zap_page (via
>> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte).
>>
>> Alex, can you replace it with just "flush = false;"?
> Replace the continue w/ flush = false?  I'm not clear on this
> suggestion.  Thanks,

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24843cf49579..382b3ee303e3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5664,7 +5668,7 @@ kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm
 	if (list_empty(&kvm->arch.active_mmu_pages))
 		goto out_unlock;

-	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false);
+	flush = false;

 	for (i = 0; i < slot->npages; i++) {
 		gfn = slot->base_gfn + i;
Alex Williamson Aug. 13, 2019, 10:10 p.m. UTC | #9
On Tue, 13 Aug 2019 23:15:38 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 13/08/19 23:14, Alex Williamson wrote:
> >> Do we even need the call to slot_handle_all_level?  The rmap update
> >> should be done already by kvm_mmu_prepare_zap_page (via
> >> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte).
> >>
> >> Alex, can you replace it with just "flush = false;"?  
> > Replace the continue w/ flush = false?  I'm not clear on this
> > suggestion.  Thanks,  
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 24843cf49579..382b3ee303e3 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5664,7 +5668,7 @@ kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm
>  	if (list_empty(&kvm->arch.active_mmu_pages))
>  		goto out_unlock;
> 
> -	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false);
> +	flush = false;
> 
>  	for (i = 0; i < slot->npages; i++) {
>  		gfn = slot->base_gfn + i;
> 

Things are not happy with that, it managed to crash the host:

[  871.244789] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  871.258680] #PF: supervisor read access in kernel mode
[  871.268939] #PF: error_code(0x0000) - not-present page
[  871.279202] PGD 0 P4D 0 
[  871.284264] Oops: 0000 [#1] SMP PTI
[  871.291232] CPU: 3 PID: 1954 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4
[  871.304264] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
[  871.322523] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm]
[  871.332085] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d
[  871.369560] RSP: 0018:ffffc10300a0bb18 EFLAGS: 00010202
[  871.379995] RAX: 00000000000c1040 RBX: ffffc103007cd000 RCX: 0000000000000014
[  871.394243] RDX: ffff9b3bbff18130 RSI: 00000000000c1080 RDI: 0000000000000004
[  871.408491] RBP: ffff9b3bced15400 R08: ffff9b3b94a90008 R09: ffff9b3b94a90000
[  871.422739] R10: ffff9b3b94a90168 R11: 0000000000000004 R12: ffff9b3bbff18130
[  871.436986] R13: ffffc103007cd000 R14: 0000000000000000 R15: 00000000000c1000
[  871.451234] FS:  00007fc27b37d700(0000) GS:ffff9b3f9f4c0000(0000) knlGS:0000000000000000
[  871.467390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  871.478865] CR2: 0000000000000000 CR3: 000000003bb2a004 CR4: 00000000001626e0
[  871.493113] Call Trace:
[  871.498012]  drop_spte+0x77/0xa0 [kvm]
[  871.505500]  mmu_page_zap_pte+0xac/0xe0 [kvm]
[  871.514200]  __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm]
[  871.524809]  kvm_mmu_invalidate_zap_pages_in_memslot+0xb7/0x130 [kvm]
[  871.537670]  kvm_page_track_flush_slot+0x55/0x80 [kvm]
[  871.547928]  __kvm_set_memory_region+0x821/0xaa0 [kvm]
[  871.558191]  kvm_set_memory_region+0x26/0x40 [kvm]
[  871.567759]  kvm_vm_ioctl+0x59a/0x940 [kvm]
[  871.576106]  ? __seccomp_filter+0x7a/0x680
[  871.584287]  do_vfs_ioctl+0xa4/0x630
[  871.591430]  ? security_file_ioctl+0x32/0x50
[  871.599954]  ksys_ioctl+0x60/0x90
[  871.606575]  __x64_sys_ioctl+0x16/0x20
[  871.614066]  do_syscall_64+0x5f/0x1a0
[  871.621379]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

It also failed with the 1 vcpu test.  Thanks,

Alex
Sean Christopherson Aug. 15, 2019, 2:46 p.m. UTC | #10
On Tue, Aug 13, 2019 at 03:14:17PM -0600, Alex Williamson wrote:
> On Tue, 13 Aug 2019 22:37:14 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> > On 13/08/19 22:19, Sean Christopherson wrote:
> > > Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
> > > all entries for a given gfn.  The sp->gfn check is there to skip entries
> > > that hashed to the same list but for a completely different gfn.
> > > 
> > > Skipping the gfn check would be sort of a lightweight zap all in the
> > > sense that it would zap shadow pages that happend to collide with the
> > > target memslot/gfn but are otherwise unrelated.
> > > 
> > > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:
> > > 
> > > 	if (sp->gfn != gfn && sp->gfn != 0x80000)
> > > 		continue;
> 
> Not having any luck with this yet.  Tried 0x80000, 0x8xxxxx, 0.

I've no idea if it would actually be interesting, but something to try
would be to zap only emulated mmio SPTEs (in addition to the memslot).
If that test passes then I think it might indicate a problem with device
enumeration as opposed to the mapping of the device itself ("think" and
"might" being the operative words).  Patch attached.
Alex Williamson Aug. 15, 2019, 3:23 p.m. UTC | #11
On Tue, 13 Aug 2019 13:19:14 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Tue, Aug 13, 2019 at 01:33:16PM -0600, Alex Williamson wrote:
> > On Tue, 13 Aug 2019 11:57:37 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:  
> 
> > Could it be something with the gfn test:
> > 
> >                         if (sp->gfn != gfn)
> >                                 continue;
> > 
> > If I remove it, I can't trigger the misbehavior.  If I log it, I only
> > get hits on VM boot/reboot and some of the gfns look suspiciously like
> > they could be the assigned GPU BARs and maybe MSI mappings:
> > 
> >                (sp->gfn) != (gfn)  
> 
> Hits at boot/reboot makes sense, memslots get zapped when userspace
> removes a memory region/slot, e.g. remaps BARs and whatnot.
> 
> ...
>  
> > Is this gfn optimization correct?  Overzealous?  Doesn't account
> > correctly for something about MMIO mappings?  Thanks,  
> 
> Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
> all entries for a given gfn.  The sp->gfn check is there to skip entries
> that hashed to the same list but for a completely different gfn.
> 
> Skipping the gfn check would be sort of a lightweight zap all in the
> sense that it would zap shadow pages that happend to collide with the
> target memslot/gfn but are otherwise unrelated.
> 
> What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:
> 
> 	if (sp->gfn != gfn && sp->gfn != 0x80000)
> 		continue;
> 
> If that doesn't work, it might be worth trying other gfns to see if you
> can pinpoint which sp is being zapped as collateral damage.
> 
> It's possible there is a pre-existing bug somewhere else that was being
> hidden because KVM was effectively zapping all SPTEs during (re)boot,
> and the hash collision is also hiding the bug by zapping the stale entry.
> 
> Of course it's also possible this code is wrong, :-)

Ok, fun day of trying to figure out which ranges are relevant, I've
narrowed it down to all of these:

0xffe00
0xfee00
0xfec00
0xc1000
0x80a000
0x800000
0x100000

ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
0x80000 can take the continue branch without seeing bad behavior in the
VM.

The assigned GPU has BARs at GPAs:

0xc0000000-0xc0ffffff
0x800000000-0x808000000
0x808000000-0x809ffffff

And the assigned companion audio function is at GPA:

0xc1080000-0xc1083fff

Only one of those seems to align very well with a gfn base involved
here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
otherwise I don't find any other I/O devices coincident with the gfns
above.

I'm running the VM with 2MB hugepages, but I believe the issue still
occurs with standard pages.  When run with standard pages I see more
hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to
the set above that cannot take the continue branch.  I don't know if
that means anything.

Any further ideas what to look for?  Thanks,

Alex

PS - I see the posted workaround patch, I'll test that in the interim.
Sean Christopherson Aug. 15, 2019, 4 p.m. UTC | #12
On Thu, Aug 15, 2019 at 09:23:24AM -0600, Alex Williamson wrote:
> Ok, fun day of trying to figure out which ranges are relevant, I've
> narrowed it down to all of these:
> 
> 0xffe00
> 0xfee00
> 0xfec00

APIC and I/O APIC stuff

> 0xc1000

Assigned audio

> 0x80a000

?

> 0x800000

GPU BAR

> 0x100000

?

The APIC ranges are puzzling, I wouldn't expect their mappings to change.

> ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
> 0x80000 can take the continue branch without seeing bad behavior in the
> VM.
> 
> The assigned GPU has BARs at GPAs:
> 
> 0xc0000000-0xc0ffffff
> 0x800000000-0x808000000
> 0x808000000-0x809ffffff
> 
> And the assigned companion audio function is at GPA:
> 
> 0xc1080000-0xc1083fff
> 
> Only one of those seems to align very well with a gfn base involved
> here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
> otherwise I don't find any other I/O devices coincident with the gfns
> above.
> 
> I'm running the VM with 2MB hugepages, but I believe the issue still
> occurs with standard pages.  When run with standard pages I see more
> hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to
> the set above that cannot take the continue branch.  I don't know if
> that means anything.
> 
> Any further ideas what to look for?  Thanks,

Maybe try isolating which memslot removal causes problems?  E.g. flush
the affected ranges if base_gfn == (xxx || yyy || zzz), otherwise flush
only the memslot's gfns.  Based on the log you sent a while back for gfn
mismatches, I'm guessing the culprits are all GPU BARs, but it's
probably worth confirming.  That might also explain why gfn == 0x80000
can take the continue branch, i.e. if removing the corresponding memslot
is what's causing problems, then it's being flushed and not actually
taking the continue path.

One other thought would be to force a call to kvm_flush_remote_tlbs(kvm),
e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap().
Maybe it's a case where there are no SPTEs for the memslot, but the TLB
flush is needed for some reason.
Alex Williamson Aug. 15, 2019, 6:16 p.m. UTC | #13
On Thu, 15 Aug 2019 09:00:06 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Thu, Aug 15, 2019 at 09:23:24AM -0600, Alex Williamson wrote:
> > Ok, fun day of trying to figure out which ranges are relevant, I've
> > narrowed it down to all of these:
> > 
> > 0xffe00
> > 0xfee00
> > 0xfec00  
> 
> APIC and I/O APIC stuff
> 
> > 0xc1000  
> 
> Assigned audio
> 
> > 0x80a000  
> 
> ?
> 
> > 0x800000  
> 
> GPU BAR
> 
> > 0x100000  
> 
> ?
> 
> The APIC ranges are puzzling, I wouldn't expect their mappings to change.
> 
> > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
> > 0x80000 can take the continue branch without seeing bad behavior in the
> > VM.
> > 
> > The assigned GPU has BARs at GPAs:
> > 
> > 0xc0000000-0xc0ffffff
> > 0x800000000-0x808000000
> > 0x808000000-0x809ffffff
> > 
> > And the assigned companion audio function is at GPA:
> > 
> > 0xc1080000-0xc1083fff
> > 
> > Only one of those seems to align very well with a gfn base involved
> > here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
> > otherwise I don't find any other I/O devices coincident with the gfns
> > above.
> > 
> > I'm running the VM with 2MB hugepages, but I believe the issue still
> > occurs with standard pages.  When run with standard pages I see more
> > hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to
> > the set above that cannot take the continue branch.  I don't know if
> > that means anything.
> > 
> > Any further ideas what to look for?  Thanks,  
> 
> Maybe try isolating which memslot removal causes problems?  E.g. flush
> the affected ranges if base_gfn == (xxx || yyy || zzz), otherwise flush
> only the memslot's gfns.  Based on the log you sent a while back for gfn
> mismatches, I'm guessing the culprits are all GPU BARs, but it's
> probably worth confirming.  That might also explain why gfn == 0x80000
> can take the continue branch, i.e. if removing the corresponding memslot
> is what's causing problems, then it's being flushed and not actually
> taking the continue path.

If I print out the memslot base_gfn, it seems pretty evident that only
the assigned device mappings are triggering this branch.  The base_gfns
exclusively include:

 0x800000
 0x808000
 0xc0089

Where the first two clearly match the 64bit BARs and the last is the
result of a page that we need to emulate within the BAR @0xc0000000 at
offset 0x88000, so the base_gfn is the remaining direct mapping.

I don't know if this implies we're doing something wrong for assigned
device slots, but maybe a more targeted workaround would be if we could
specifically identify these slots, though there's no special
registration of them versus other slots.  Did you have any non-device
assignment test cases that took this branch when developing the series?

> One other thought would be to force a call to kvm_flush_remote_tlbs(kvm),
> e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap().
> Maybe it's a case where there are no SPTEs for the memslot, but the TLB
> flush is needed for some reason.

This doesn't work.  Thanks,

Alex
Sean Christopherson Aug. 15, 2019, 7:25 p.m. UTC | #14
On Thu, Aug 15, 2019 at 12:16:07PM -0600, Alex Williamson wrote:
> On Thu, 15 Aug 2019 09:00:06 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> If I print out the memslot base_gfn, it seems pretty evident that only
> the assigned device mappings are triggering this branch.  The base_gfns
> exclusively include:
> 
>  0x800000
>  0x808000
>  0xc0089
> 
> Where the first two clearly match the 64bit BARs and the last is the
> result of a page that we need to emulate within the BAR @0xc0000000 at
> offset 0x88000, so the base_gfn is the remaining direct mapping.

That's consistent with my understanding of userspace, e.g. normal memory
regions aren't deleted until the VM is shut down (barring hot unplug).

> I don't know if this implies we're doing something wrong for assigned
> device slots, but maybe a more targeted workaround would be if we could
> specifically identify these slots, though there's no special
> registration of them versus other slots.

What is triggering the memslot removal/update?  Is it possible that
whatever action is occuring is modifying multiple memslots?  E.g. KVM's
memslot-only zapping is allowing the guest to access stale entries for
the unzapped-but-related memslots, whereas the full zap does not.

FYI, my VFIO/GPU/PCI knowledge is abysmal, please speak up if any of my
ideas are nonsensical.

> Did you have any non-device
> assignment test cases that took this branch when developing the series?

The primary testing was performance oriented, using a slightly modified
version of a synthetic benchmark[1] from a previous series[2] that touched
the memslot flushing flow.  From a functional perspective, I highly doubt
that test would have been able expose an improper zapping bug.

We do have some amount of coverage via kvm-unit-tests, as an EPT test was
triggering a slab bug due not actually zapping the collected SPTEs[3].

[1] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277/mmtest.tar.bz2
[2] https://lkml.kernel.org/r/1368706673-8530-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
[3] https://patchwork.kernel.org/patch/10899283/

> > One other thought would be to force a call to kvm_flush_remote_tlbs(kvm),
> > e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap().
> > Maybe it's a case where there are no SPTEs for the memslot, but the TLB
> > flush is needed for some reason.
> 
> This doesn't work.  Thanks,
> 
> Alex
Alex Williamson Aug. 15, 2019, 8:11 p.m. UTC | #15
On Thu, 15 Aug 2019 12:25:31 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Thu, Aug 15, 2019 at 12:16:07PM -0600, Alex Williamson wrote:
> > On Thu, 15 Aug 2019 09:00:06 -0700
> > Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > If I print out the memslot base_gfn, it seems pretty evident that only
> > the assigned device mappings are triggering this branch.  The base_gfns
> > exclusively include:
> > 
> >  0x800000
> >  0x808000
> >  0xc0089
> > 
> > Where the first two clearly match the 64bit BARs and the last is the
> > result of a page that we need to emulate within the BAR @0xc0000000 at
> > offset 0x88000, so the base_gfn is the remaining direct mapping.  
> 
> That's consistent with my understanding of userspace, e.g. normal memory
> regions aren't deleted until the VM is shut down (barring hot unplug).
> 
> > I don't know if this implies we're doing something wrong for assigned
> > device slots, but maybe a more targeted workaround would be if we could
> > specifically identify these slots, though there's no special
> > registration of them versus other slots.  
> 
> What is triggering the memslot removal/update?  Is it possible that
> whatever action is occuring is modifying multiple memslots?  E.g. KVM's
> memslot-only zapping is allowing the guest to access stale entries for
> the unzapped-but-related memslots, whereas the full zap does not.
> 
> FYI, my VFIO/GPU/PCI knowledge is abysmal, please speak up if any of my
> ideas are nonsensical.

The memory bit in the PCI command register of config space for each
device controls whether the device decoders are active for the MMIO BAR
ranges.  These get flipped as both the guest firmware and guest OS
enumerate and assign resources to the PCI subsystem.  Generally these
are not further manipulated while the guest OS is running except for
hotplug operations.  The guest OS device driver will generally perform
the final enable of these ranges and they'll remain enabled until the
guest is rebooted.

I recall somewhere in this thread you referenced reading the ROM as
part of the performance testing of this series.  The ROM has it's own
enable bit within the ROM BAR as the PCI spec allows devices to share
decoders between the standard BARs and the ROM BAR.  Enabling and
disabling the enable bit in the ROM BAR should be very similar in
memslot behavior to the overall memory enable bit for the other BARs
within the device.

Note that often the faults that I'm seeing occur a long time after BAR
mappings are finalized, usually (not always) the VM boots to a fully
functional desktop and it's only as I run various tests do the glitches
start to appear.  For instance, when I allowed sp->gfn 0xfec00 to take
the continue branch, I got an OpenCL error.  For either 0xffee00 or
0xc1000 I got graphics glitches, for example stray geometric artifacts
flashed on the screen.  For 0x100000 and 0x800000 I'd get a black
screen or blank 3D graphics window.  For 0x80a000 the VM froze
(apparently).  I can't say whether each of these is a consistent failure
mode, I only tested to the point of determining whether a range
generates an error.

> > Did you have any non-device
> > assignment test cases that took this branch when developing the series?  
> 
> The primary testing was performance oriented, using a slightly modified
> version of a synthetic benchmark[1] from a previous series[2] that touched
> the memslot flushing flow.  From a functional perspective, I highly doubt
> that test would have been able expose an improper zapping bug.

:-\

It seems like there's probably some sort of inflection point where it
becomes faster to zap all pages versus the overhead of walking every
page in a memory slot, was that evaluated?  Not sure if that's relevant
here, but curious.  Thanks,

Alex
Paolo Bonzini Aug. 19, 2019, 4:03 p.m. UTC | #16
On 15/08/19 17:23, Alex Williamson wrote:
> 0xffe00
> 0xfee00
> 0xfec00
> 0xc1000
> 0x80a000
> 0x800000
> 0x100000
> 
> ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
> 0x80000 can take the continue branch without seeing bad behavior in the
> VM.
> 
> The assigned GPU has BARs at GPAs:
> 
> 0xc0000000-0xc0ffffff
> 0x800000000-0x808000000
> 0x808000000-0x809ffffff
> 
> And the assigned companion audio function is at GPA:
> 
> 0xc1080000-0xc1083fff
> 
> Only one of those seems to align very well with a gfn base involved
> here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
> otherwise I don't find any other I/O devices coincident with the gfns
> above.

The IOAPIC and LAPIC are respectively gfn 0xfec00 and 0xfee00.  The
audio function BAR is only 16 KiB, so the 2 MiB PDE starting at 0xc1000
includes both userspace-MMIO and device-MMIO memory.  The virtio-net BAR
is also userspace-MMIO.

It seems like the problem occurs when the sp->gfn you "continue over"
includes a userspace-MMIO gfn.  But since I have no better ideas right
now, I'm going to apply the revert (we don't know for sure that it only
happens with assigned devices).

Paolo

> I'm running the VM with 2MB hugepages, but I believe the issue still
> occurs with standard pages.  When run with standard pages I see more
> hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to
> the set above that cannot take the continue branch.  I don't know if
> that means anything.
> 
> Any further ideas what to look for?  Thanks,
> 
> Alex
> 
> PS - I see the posted workaround patch, I'll test that in the interim.
>
Sean Christopherson Aug. 20, 2019, 8:03 p.m. UTC | #17
On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote:
> On 15/08/19 17:23, Alex Williamson wrote:
> > 0xffe00
> > 0xfee00
> > 0xfec00
> > 0xc1000
> > 0x80a000
> > 0x800000
> > 0x100000
> > 
> > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
> > 0x80000 can take the continue branch without seeing bad behavior in the
> > VM.
> > 
> > The assigned GPU has BARs at GPAs:
> > 
> > 0xc0000000-0xc0ffffff
> > 0x800000000-0x808000000
> > 0x808000000-0x809ffffff
> > 
> > And the assigned companion audio function is at GPA:
> > 
> > 0xc1080000-0xc1083fff
> > 
> > Only one of those seems to align very well with a gfn base involved
> > here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
> > otherwise I don't find any other I/O devices coincident with the gfns
> > above.
> 
> The IOAPIC and LAPIC are respectively gfn 0xfec00 and 0xfee00.  The
> audio function BAR is only 16 KiB, so the 2 MiB PDE starting at 0xc1000
> includes both userspace-MMIO and device-MMIO memory.  The virtio-net BAR
> is also userspace-MMIO.
> 
> It seems like the problem occurs when the sp->gfn you "continue over"
> includes a userspace-MMIO gfn.  But since I have no better ideas right
> now, I'm going to apply the revert (we don't know for sure that it only
> happens with assigned devices).

After many hours of staring, I've come to the conclusion that
kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e.
a revert is definitely in order for 5.3 and stable.

mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of
the shadow ptes, e.g. for_each_valid_sp() will find page tables, page
directories, etc...  Passing in the raw gfns of the memslot doesn't work
because the gfn isn't properly adjusted/aligned to match how KVM tracks
gfns for shadow pages, e.g. removal of the companion audio memslot that
occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find
the shadow page table containing the relevant sptes.

This is why Paolo's suggestion to remove slot_handle_all_level() on
kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's
accounting without actually zapping the relevant sptes.

All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00
were sensitive to (lack of) zapping.  My theory is that zapping what were
effectively random-but-interesting shadow pages cleaned things up enough
to avoid noticeable badness.


Alex,

Can you please test the attached patch?  It implements a very slimmed down
version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes
pointing at the memslot being removed, which was the original intent of
kvm_mmu_invalidate_zap_pages_in_memslot().  I apologize in advance if it
crashes the host.  I'm hopeful it's correct, but given how broken the
previous version was, I'm not exactly confident.
Alex Williamson Aug. 20, 2019, 8:42 p.m. UTC | #18
On Tue, 20 Aug 2019 13:03:19 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote:
> > On 15/08/19 17:23, Alex Williamson wrote:
> > > 0xffe00
> > > 0xfee00
> > > 0xfec00
> > > 0xc1000
> > > 0x80a000
> > > 0x800000
> > > 0x100000
> > > 
> > > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
> > > 0x80000 can take the continue branch without seeing bad behavior in the
> > > VM.
> > > 
> > > The assigned GPU has BARs at GPAs:
> > > 
> > > 0xc0000000-0xc0ffffff
> > > 0x800000000-0x808000000
> > > 0x808000000-0x809ffffff
> > > 
> > > And the assigned companion audio function is at GPA:
> > > 
> > > 0xc1080000-0xc1083fff
> > > 
> > > Only one of those seems to align very well with a gfn base involved
> > > here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
> > > otherwise I don't find any other I/O devices coincident with the gfns
> > > above.
> > 
> > The IOAPIC and LAPIC are respectively gfn 0xfec00 and 0xfee00.  The
> > audio function BAR is only 16 KiB, so the 2 MiB PDE starting at 0xc1000
> > includes both userspace-MMIO and device-MMIO memory.  The virtio-net BAR
> > is also userspace-MMIO.
> > 
> > It seems like the problem occurs when the sp->gfn you "continue over"
> > includes a userspace-MMIO gfn.  But since I have no better ideas right
> > now, I'm going to apply the revert (we don't know for sure that it only
> > happens with assigned devices).
> 
> After many hours of staring, I've come to the conclusion that
> kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e.
> a revert is definitely in order for 5.3 and stable.
> 
> mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of
> the shadow ptes, e.g. for_each_valid_sp() will find page tables, page
> directories, etc...  Passing in the raw gfns of the memslot doesn't work
> because the gfn isn't properly adjusted/aligned to match how KVM tracks
> gfns for shadow pages, e.g. removal of the companion audio memslot that
> occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find
> the shadow page table containing the relevant sptes.
> 
> This is why Paolo's suggestion to remove slot_handle_all_level() on
> kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's
> accounting without actually zapping the relevant sptes.
> 
> All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00
> were sensitive to (lack of) zapping.  My theory is that zapping what were
> effectively random-but-interesting shadow pages cleaned things up enough
> to avoid noticeable badness.
> 
> 
> Alex,
> 
> Can you please test the attached patch?  It implements a very slimmed down
> version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes
> pointing at the memslot being removed, which was the original intent of
> kvm_mmu_invalidate_zap_pages_in_memslot().  I apologize in advance if it
> crashes the host.  I'm hopeful it's correct, but given how broken the
> previous version was, I'm not exactly confident.

It doesn't crash the host, but the guest is not happy, failing to boot
the desktop in one case and triggering errors in the guest w/o even
running test programs in another case.  Seems like it might be worse
than previous.  Thanks,

Alex
Sean Christopherson Aug. 20, 2019, 9:02 p.m. UTC | #19
On Tue, Aug 20, 2019 at 02:42:04PM -0600, Alex Williamson wrote:
> On Tue, 20 Aug 2019 13:03:19 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00
> > were sensitive to (lack of) zapping.  My theory is that zapping what were
> > effectively random-but-interesting shadow pages cleaned things up enough
> > to avoid noticeable badness.
> > 
> > 
> > Alex,
> > 
> > Can you please test the attached patch?  It implements a very slimmed down
> > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes
> > pointing at the memslot being removed, which was the original intent of
> > kvm_mmu_invalidate_zap_pages_in_memslot().  I apologize in advance if it
> > crashes the host.  I'm hopeful it's correct, but given how broken the
> > previous version was, I'm not exactly confident.
> 
> It doesn't crash the host, but the guest is not happy, failing to boot
> the desktop in one case and triggering errors in the guest w/o even
> running test programs in another case.  Seems like it might be worse
> than previous.  Thanks,

Hrm, I'm back to being completely flummoxed.

Would you be able to generate a trace of all events/kvmmmu, using the
latest patch?  I'd like to rule out a stupid code bug if it's not too
much trouble.
Alex Williamson Aug. 21, 2019, 7:08 p.m. UTC | #20
On Tue, 20 Aug 2019 14:02:45 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Tue, Aug 20, 2019 at 02:42:04PM -0600, Alex Williamson wrote:
> > On Tue, 20 Aug 2019 13:03:19 -0700
> > Sean Christopherson <sean.j.christopherson@intel.com> wrote:  
> > > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00
> > > were sensitive to (lack of) zapping.  My theory is that zapping what were
> > > effectively random-but-interesting shadow pages cleaned things up enough
> > > to avoid noticeable badness.
> > > 
> > > 
> > > Alex,
> > > 
> > > Can you please test the attached patch?  It implements a very slimmed down
> > > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes
> > > pointing at the memslot being removed, which was the original intent of
> > > kvm_mmu_invalidate_zap_pages_in_memslot().  I apologize in advance if it
> > > crashes the host.  I'm hopeful it's correct, but given how broken the
> > > previous version was, I'm not exactly confident.  
> > 
> > It doesn't crash the host, but the guest is not happy, failing to boot
> > the desktop in one case and triggering errors in the guest w/o even
> > running test programs in another case.  Seems like it might be worse
> > than previous.  Thanks,  
> 
> Hrm, I'm back to being completely flummoxed.
> 
> Would you be able to generate a trace of all events/kvmmmu, using the
> latest patch?  I'd like to rule out a stupid code bug if it's not too
> much trouble.

I tried to simplify the patch, making it closer to zap_all, so I
removed the max_level calculation and exclusion based on s->role.level,
as well as the gfn range filtering.  For good measure I even removed
the sp->root_count test, so any sp not marked invalid is zapped.  This
works, and I can also add back the sp->root_count test and things
remain working.

From there I added back the gfn range test, but I left out the gfn_mask
because I'm not doing the level filtering and I think(?) this is just
another optimization.  So essentially I only add:

	if (sp->gfn < slot->base_gfn ||
            sp->gfn > (slot->base_gfn + slot->npages - 1))
		continue;

Not only does this not work, the host will sometimes oops:

[  808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  808.555065] #PF: supervisor read access in kernel mode
[  808.565326] #PF: error_code(0x0000) - not-present page
[  808.575588] PGD 0 P4D 0 
[  808.580649] Oops: 0000 [#1] SMP PTI
[  808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4
[  808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
[  808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm]
[  808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d
[  808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202
[  808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014
[  808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004
[  808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000
[  808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260
[  808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004
[  808.747620] FS:  00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000
[  808.763776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0
[  808.789499] Call Trace:
[  808.794399]  drop_spte+0x77/0xa0 [kvm]
[  808.801885]  mmu_page_zap_pte+0xac/0xe0 [kvm]
[  808.810587]  __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm]
[  808.821196]  kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm]
[  808.833881]  kvm_page_track_flush_slot+0x55/0x80 [kvm]
[  808.844140]  __kvm_set_memory_region+0x821/0xaa0 [kvm]
[  808.854402]  kvm_set_memory_region+0x26/0x40 [kvm]
[  808.863971]  kvm_vm_ioctl+0x59a/0x940 [kvm]
[  808.872318]  ? pagevec_lru_move_fn+0xb8/0xd0
[  808.880846]  ? __seccomp_filter+0x7a/0x680
[  808.889028]  do_vfs_ioctl+0xa4/0x630
[  808.896168]  ? security_file_ioctl+0x32/0x50
[  808.904695]  ksys_ioctl+0x60/0x90
[  808.911316]  __x64_sys_ioctl+0x16/0x20
[  808.918807]  do_syscall_64+0x5f/0x1a0
[  808.926121]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  808.936209] RIP: 0033:0x7f28ebf2b0fb

Does this suggests something is still fundamentally wrong with the
premise of this change or have I done something stupid?  Thanks,

Alex
Alex Williamson Aug. 21, 2019, 7:35 p.m. UTC | #21
On Wed, 21 Aug 2019 13:08:59 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 20 Aug 2019 14:02:45 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > On Tue, Aug 20, 2019 at 02:42:04PM -0600, Alex Williamson wrote:  
> > > On Tue, 20 Aug 2019 13:03:19 -0700
> > > Sean Christopherson <sean.j.christopherson@intel.com> wrote:    
> > > > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00
> > > > were sensitive to (lack of) zapping.  My theory is that zapping what were
> > > > effectively random-but-interesting shadow pages cleaned things up enough
> > > > to avoid noticeable badness.
> > > > 
> > > > 
> > > > Alex,
> > > > 
> > > > Can you please test the attached patch?  It implements a very slimmed down
> > > > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes
> > > > pointing at the memslot being removed, which was the original intent of
> > > > kvm_mmu_invalidate_zap_pages_in_memslot().  I apologize in advance if it
> > > > crashes the host.  I'm hopeful it's correct, but given how broken the
> > > > previous version was, I'm not exactly confident.    
> > > 
> > > It doesn't crash the host, but the guest is not happy, failing to boot
> > > the desktop in one case and triggering errors in the guest w/o even
> > > running test programs in another case.  Seems like it might be worse
> > > than previous.  Thanks,    
> > 
> > Hrm, I'm back to being completely flummoxed.
> > 
> > Would you be able to generate a trace of all events/kvmmmu, using the
> > latest patch?  I'd like to rule out a stupid code bug if it's not too
> > much trouble.  
> 
> I tried to simplify the patch, making it closer to zap_all, so I
> removed the max_level calculation and exclusion based on s->role.level,
> as well as the gfn range filtering.  For good measure I even removed
> the sp->root_count test, so any sp not marked invalid is zapped.  This
> works, and I can also add back the sp->root_count test and things
> remain working.
> 
> From there I added back the gfn range test, but I left out the gfn_mask
> because I'm not doing the level filtering and I think(?) this is just
> another optimization.  So essentially I only add:
> 
> 	if (sp->gfn < slot->base_gfn ||
>             sp->gfn > (slot->base_gfn + slot->npages - 1))
> 		continue;
> 
> Not only does this not work, the host will sometimes oops:
> 
> [  808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [  808.555065] #PF: supervisor read access in kernel mode
> [  808.565326] #PF: error_code(0x0000) - not-present page
> [  808.575588] PGD 0 P4D 0 
> [  808.580649] Oops: 0000 [#1] SMP PTI
> [  808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4
> [  808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
> [  808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm]
> [  808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d
> [  808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202
> [  808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014
> [  808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004
> [  808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000
> [  808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260
> [  808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004
> [  808.747620] FS:  00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000
> [  808.763776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0
> [  808.789499] Call Trace:
> [  808.794399]  drop_spte+0x77/0xa0 [kvm]
> [  808.801885]  mmu_page_zap_pte+0xac/0xe0 [kvm]
> [  808.810587]  __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm]
> [  808.821196]  kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm]
> [  808.833881]  kvm_page_track_flush_slot+0x55/0x80 [kvm]
> [  808.844140]  __kvm_set_memory_region+0x821/0xaa0 [kvm]
> [  808.854402]  kvm_set_memory_region+0x26/0x40 [kvm]
> [  808.863971]  kvm_vm_ioctl+0x59a/0x940 [kvm]
> [  808.872318]  ? pagevec_lru_move_fn+0xb8/0xd0
> [  808.880846]  ? __seccomp_filter+0x7a/0x680
> [  808.889028]  do_vfs_ioctl+0xa4/0x630
> [  808.896168]  ? security_file_ioctl+0x32/0x50
> [  808.904695]  ksys_ioctl+0x60/0x90
> [  808.911316]  __x64_sys_ioctl+0x16/0x20
> [  808.918807]  do_syscall_64+0x5f/0x1a0
> [  808.926121]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  808.936209] RIP: 0033:0x7f28ebf2b0fb
> 
> Does this suggests something is still fundamentally wrong with the
> premise of this change or have I done something stupid?

Seems the latter, particularly your comment that we're looking for
pages pointing to the gfn range to be removed, not just those in the
range.  Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or
c0000-c0000, zapping zero or c0000, and I think one of the ones you
were looking for c1080-c1083 is reduce to c1000-c1000 and therefore
zaps sp->gfn c1000.  I'll keep looking.  Thanks,

Alex
Sean Christopherson Aug. 21, 2019, 8:10 p.m. UTC | #22
On Wed, Aug 21, 2019 at 01:08:59PM -0600, Alex Williamson wrote:
> On Tue, 20 Aug 2019 14:02:45 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > Hrm, I'm back to being completely flummoxed.
> > 
> > Would you be able to generate a trace of all events/kvmmmu, using the
> > latest patch?  I'd like to rule out a stupid code bug if it's not too
> > much trouble.
> 
> I tried to simplify the patch, making it closer to zap_all, so I
> removed the max_level calculation and exclusion based on s->role.level,
> as well as the gfn range filtering.  For good measure I even removed
> the sp->root_count test, so any sp not marked invalid is zapped.  This
> works, and I can also add back the sp->root_count test and things
> remain working.

The sp->root_count test is necessary to avoid an infinite loop when not
using tdp, e.g. EPT.  The root sp can't be moved off active_mmu_pages and
thus freed until all references to the root are gone.  When not using tdp,
the loop can restart in response to __kvm_mmu_prepare_zap_page(), even if
the root sp page was already invalid.

Functionally, skipping an invalid sp *shouldn't* change anything, as the
actual ptes have been zapped, it's only the kernel resources that aren't
yet freed.

TL;DR: completely expected that removing the sp->root_count didn't
affect the result.

> From there I added back the gfn range test, but I left out the gfn_mask
> because I'm not doing the level filtering and I think(?) this is just
> another optimization.  So essentially I only add:
> 
> 	if (sp->gfn < slot->base_gfn ||
>             sp->gfn > (slot->base_gfn + slot->npages - 1))
> 		continue;

This doesn't work because of what sp->gfn contains.  The shadow page is
the page/table holding the spt entries, e.g. a level 1 sp refers to a page
table for entries defining 4k pages.  sp->gfn holds the base gfn of the
range of gfns covered by the sp, which allows finding all parent sps for a
given gfn.  E.g. for level 1 sp, sp->gfn is found by masking the gfn of a
given page by 0xFFFFFFFFFFFFFE00, level 2 sp by 0xFFFFFFFFFFFC0000, and so
on up the chain.

This is what the original code botched, as it would only find all sps for
a memslot if the base of the memslot were 2mb aligned for 4k pages, 1gb
aligned for 2mb pages, etc...  This just happened to mostly work since
memslots are usually aligned to such requirements.  In the original code,
removing the "sp->gfn != gfn" check caused KVM to zap random sps that just
happened to hash to the same entry.  Likewise, omitting the gfn filter in
this code means everything gets zapped.

What I don't understand is why zapping everything, or at least userspace
MMIO addresses, is necessary when removing the GPU BAR memslot.  The
IOAPIC sp in particular makes no sense.  AIUI, the IOAPIC is always
emulated, i.e. not passed through, and its base/size is static (or at
least static after init).  Zapping an IOAPIC sp will send KVM down
different paths but AFAIK the final outsome should be unaffected.

> Not only does this not work, the host will sometimes oops:
> 
> [  808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [  808.555065] #PF: supervisor read access in kernel mode
> [  808.565326] #PF: error_code(0x0000) - not-present page
> [  808.575588] PGD 0 P4D 0 
> [  808.580649] Oops: 0000 [#1] SMP PTI
> [  808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4
> [  808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
> [  808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm]
> [  808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d
> [  808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202
> [  808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014
> [  808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004
> [  808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000
> [  808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260
> [  808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004
> [  808.747620] FS:  00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000
> [  808.763776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0
> [  808.789499] Call Trace:
> [  808.794399]  drop_spte+0x77/0xa0 [kvm]
> [  808.801885]  mmu_page_zap_pte+0xac/0xe0 [kvm]
> [  808.810587]  __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm]
> [  808.821196]  kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm]
> [  808.833881]  kvm_page_track_flush_slot+0x55/0x80 [kvm]
> [  808.844140]  __kvm_set_memory_region+0x821/0xaa0 [kvm]
> [  808.854402]  kvm_set_memory_region+0x26/0x40 [kvm]
> [  808.863971]  kvm_vm_ioctl+0x59a/0x940 [kvm]
> [  808.872318]  ? pagevec_lru_move_fn+0xb8/0xd0
> [  808.880846]  ? __seccomp_filter+0x7a/0x680
> [  808.889028]  do_vfs_ioctl+0xa4/0x630
> [  808.896168]  ? security_file_ioctl+0x32/0x50
> [  808.904695]  ksys_ioctl+0x60/0x90
> [  808.911316]  __x64_sys_ioctl+0x16/0x20
> [  808.918807]  do_syscall_64+0x5f/0x1a0
> [  808.926121]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  808.936209] RIP: 0033:0x7f28ebf2b0fb
> 
> Does this suggests something is still fundamentally wrong with the
> premise of this change or have I done something stupid?  Thanks,

The NULL pointer thing is unexpected, it means we have a spte, i.e. the
actual entry seen/used by hardware, that KVM thinks is present but doesn't
have the expected KVM tracking.  I'll take a look, my understanding is
that zapping shadow pages at random shouldn't cause problems.
Sean Christopherson Aug. 21, 2019, 8:30 p.m. UTC | #23
On Wed, Aug 21, 2019 at 01:35:04PM -0600, Alex Williamson wrote:
> On Wed, 21 Aug 2019 13:08:59 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> > Does this suggests something is still fundamentally wrong with the
> > premise of this change or have I done something stupid?
> 
> Seems the latter, particularly your comment that we're looking for
> pages pointing to the gfn range to be removed, not just those in the
> range.  Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or
> c0000-c0000, zapping zero or c0000, and I think one of the ones you
> were looking for c1080-c1083 is reduce to c1000-c1000 and therefore
> zaps sp->gfn c1000.  I'll keep looking.  Thanks,

Ya.  As far as where to look, at this point I don't think it's an issue of
incorrect zapping.  Not because  I'm 100% confident the zapping logic is
correct, but because many of the tests, e.g. removing 'sp->gfn != gfn' and
not being able to exclude APIC/IOAPIC ranges, suggest that the badness is
'fixed' by zapping seemingly unrelated sps.

In other words, it may be fundamentally wrong to zap only the memslot
being removed, but I really want to know why.  History isn't helpful as
KVM has always zapped all pages when removing a memslot (on x86), and the
introduction of the per-memslot flush hook in commit

  2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow")

was all about refactoring generic code, and doesn't have any information
on whether per-memslot flushing was actually tried for x86.
Sean Christopherson Aug. 23, 2019, 2:25 a.m. UTC | #24
On Wed, Aug 21, 2019 at 01:30:41PM -0700, Sean Christopherson wrote:
> On Wed, Aug 21, 2019 at 01:35:04PM -0600, Alex Williamson wrote:
> > On Wed, 21 Aug 2019 13:08:59 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > Does this suggests something is still fundamentally wrong with the
> > > premise of this change or have I done something stupid?
> > 
> > Seems the latter, particularly your comment that we're looking for
> > pages pointing to the gfn range to be removed, not just those in the
> > range.  Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or
> > c0000-c0000, zapping zero or c0000, and I think one of the ones you
> > were looking for c1080-c1083 is reduce to c1000-c1000 and therefore
> > zaps sp->gfn c1000.  I'll keep looking.  Thanks,
> 
> Ya.  As far as where to look, at this point I don't think it's an issue of
> incorrect zapping.  Not because  I'm 100% confident the zapping logic is
> correct, but because many of the tests, e.g. removing 'sp->gfn != gfn' and
> not being able to exclude APIC/IOAPIC ranges, suggest that the badness is
> 'fixed' by zapping seemingly unrelated sps.
> 
> In other words, it may be fundamentally wrong to zap only the memslot
> being removed, but I really want to know why.  History isn't helpful as
> KVM has always zapped all pages when removing a memslot (on x86), and the
> introduction of the per-memslot flush hook in commit
> 
>   2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow")
> 
> was all about refactoring generic code, and doesn't have any information
> on whether per-memslot flushing was actually tried for x86.

One semi-random idea would be to zap mmio pages, i.e. don't skip pages
for which sp->mmio_cached is true, regardless of their gfn or level.  I
don't expect it to make a difference, but it would shrink the haystack on
the off change it does "fix" the issues.
Alex Williamson Aug. 23, 2019, 10:05 p.m. UTC | #25
On Thu, 22 Aug 2019 19:25:02 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Wed, Aug 21, 2019 at 01:30:41PM -0700, Sean Christopherson wrote:
> > On Wed, Aug 21, 2019 at 01:35:04PM -0600, Alex Williamson wrote:  
> > > On Wed, 21 Aug 2019 13:08:59 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:  
> > > > Does this suggests something is still fundamentally wrong with the
> > > > premise of this change or have I done something stupid?  
> > > 
> > > Seems the latter, particularly your comment that we're looking for
> > > pages pointing to the gfn range to be removed, not just those in the
> > > range.  Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or
> > > c0000-c0000, zapping zero or c0000, and I think one of the ones you
> > > were looking for c1080-c1083 is reduce to c1000-c1000 and therefore
> > > zaps sp->gfn c1000.  I'll keep looking.  Thanks,  
> > 
> > Ya.  As far as where to look, at this point I don't think it's an issue of
> > incorrect zapping.  Not because  I'm 100% confident the zapping logic is
> > correct, but because many of the tests, e.g. removing 'sp->gfn != gfn' and
> > not being able to exclude APIC/IOAPIC ranges, suggest that the badness is
> > 'fixed' by zapping seemingly unrelated sps.
> > 
> > In other words, it may be fundamentally wrong to zap only the memslot
> > being removed, but I really want to know why.  History isn't helpful as
> > KVM has always zapped all pages when removing a memslot (on x86), and the
> > introduction of the per-memslot flush hook in commit
> > 
> >   2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow")
> > 
> > was all about refactoring generic code, and doesn't have any information
> > on whether per-memslot flushing was actually tried for x86.  
> 
> One semi-random idea would be to zap mmio pages, i.e. don't skip pages
> for which sp->mmio_cached is true, regardless of their gfn or level.  I
> don't expect it to make a difference, but it would shrink the haystack on
> the off change it does "fix" the issues.

You're right, it doesn't fix it.  All of the logging I've been staring
at suggests your patch does exactly what it's intended to do, but it
still breaks GPU assignment in weird ways.  I have no idea why.  Thanks,

Alex
Tian, Kevin Aug. 26, 2019, 7:36 a.m. UTC | #26
> From: Sean Christopherson
> Sent: Thursday, August 22, 2019 4:11 AM
> 
[...]
> 
> > From there I added back the gfn range test, but I left out the gfn_mask
> > because I'm not doing the level filtering and I think(?) this is just
> > another optimization.  So essentially I only add:
> >
> > 	if (sp->gfn < slot->base_gfn ||
> >             sp->gfn > (slot->base_gfn + slot->npages - 1))
> > 		continue;
> 
> This doesn't work because of what sp->gfn contains.  The shadow page is
> the page/table holding the spt entries, e.g. a level 1 sp refers to a page
> table for entries defining 4k pages.  sp->gfn holds the base gfn of the
> range of gfns covered by the sp, which allows finding all parent sps for a
> given gfn.  E.g. for level 1 sp, sp->gfn is found by masking the gfn of a
> given page by 0xFFFFFFFFFFFFFE00, level 2 sp by 0xFFFFFFFFFFFC0000, and so
> on up the chain.
> 
> This is what the original code botched, as it would only find all sps for
> a memslot if the base of the memslot were 2mb aligned for 4k pages, 1gb
> aligned for 2mb pages, etc...  This just happened to mostly work since
> memslots are usually aligned to such requirements.  In the original code,
> removing the "sp->gfn != gfn" check caused KVM to zap random sps that just
> happened to hash to the same entry.  Likewise, omitting the gfn filter in
> this code means everything gets zapped.
> 
> What I don't understand is why zapping everything, or at least userspace
> MMIO addresses, is necessary when removing the GPU BAR memslot.  The
> IOAPIC sp in particular makes no sense.  AIUI, the IOAPIC is always
> emulated, i.e. not passed through, and its base/size is static (or at
> least static after init).  Zapping an IOAPIC sp will send KVM down
> different paths but AFAIK the final outsome should be unaffected.
> 

I have the same feeling. the base address of IOAPIC is described by
ACPI MADT thus static. It is possible to reprogram LAPIC base but no
commodity OS does it today.

btw I found Alex mentioned gfn 0x100000 in earlier post. This one is
usually the starting of high memory. If true, it becomes even more
weird, not just about MMIO thing.

Thanks
Kevin
Sean Christopherson Aug. 26, 2019, 2:56 p.m. UTC | #27
On Wed, Aug 21, 2019 at 01:10:43PM -0700, Sean Christopherson wrote:
> On Wed, Aug 21, 2019 at 01:08:59PM -0600, Alex Williamson wrote:
> > Not only does this not work, the host will sometimes oops:
> > 
> > [  808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [  808.555065] #PF: supervisor read access in kernel mode
> > [  808.565326] #PF: error_code(0x0000) - not-present page
> > [  808.575588] PGD 0 P4D 0 
> > [  808.580649] Oops: 0000 [#1] SMP PTI
> > [  808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4
> > [  808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
> > [  808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm]
> > [  808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d
> > [  808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202
> > [  808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014
> > [  808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004
> > [  808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000
> > [  808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260
> > [  808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004
> > [  808.747620] FS:  00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000
> > [  808.763776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0
> > [  808.789499] Call Trace:
> > [  808.794399]  drop_spte+0x77/0xa0 [kvm]
> > [  808.801885]  mmu_page_zap_pte+0xac/0xe0 [kvm]
> > [  808.810587]  __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm]
> > [  808.821196]  kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm]
> > [  808.833881]  kvm_page_track_flush_slot+0x55/0x80 [kvm]
> > [  808.844140]  __kvm_set_memory_region+0x821/0xaa0 [kvm]
> > [  808.854402]  kvm_set_memory_region+0x26/0x40 [kvm]
> > [  808.863971]  kvm_vm_ioctl+0x59a/0x940 [kvm]
> > [  808.872318]  ? pagevec_lru_move_fn+0xb8/0xd0
> > [  808.880846]  ? __seccomp_filter+0x7a/0x680
> > [  808.889028]  do_vfs_ioctl+0xa4/0x630
> > [  808.896168]  ? security_file_ioctl+0x32/0x50
> > [  808.904695]  ksys_ioctl+0x60/0x90
> > [  808.911316]  __x64_sys_ioctl+0x16/0x20
> > [  808.918807]  do_syscall_64+0x5f/0x1a0
> > [  808.926121]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  808.936209] RIP: 0033:0x7f28ebf2b0fb
> > 
> > Does this suggests something is still fundamentally wrong with the
> > premise of this change or have I done something stupid?  Thanks,
> 
> The NULL pointer thing is unexpected, it means we have a spte, i.e. the
> actual entry seen/used by hardware, that KVM thinks is present but doesn't
> have the expected KVM tracking.  I'll take a look, my understanding is
> that zapping shadow pages at random shouldn't cause problems.

The NULL pointer dereference is expected given the flawed implementation,
i.e. there isn't another bug lurking for that particular problem.  The
issue isn't zapping random sptes, but rather that the flawed logic leaves
dangling sptes.  When a different action, e.g. zapping all memslots,
triggers zapping of the dangling spte(s), gfn_to_rmap() attempts to find
the corresponding memslot and hits the above BUG because the memslot no
longer exists.

On the flip side, not hitting that condition provides additional confidence
in the reworked flow, i.e. proves to some degree that it's zapping all
sptes in the to-be-removed memslot.
Sean Christopherson June 26, 2020, 5:32 p.m. UTC | #28
/cast <thread necromancy>

On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
> On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote:
> > It seems like the problem occurs when the sp->gfn you "continue over"
> > includes a userspace-MMIO gfn.  But since I have no better ideas right
> > now, I'm going to apply the revert (we don't know for sure that it only
> > happens with assigned devices).
> 
> After many hours of staring, I've come to the conclusion that
> kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e.
> a revert is definitely in order for 5.3 and stable.

Whelp, past me was wrong.  The patch wasn't completely broken, as the rmap
zapping piece of kvm_mmu_invalidate_zap_pages_in_memslot() was sufficient
to satisfy removal of the memslot.  I.e. zapping leaf PTEs (including large
pages) should prevent referencing the old memslot, the fact that zapping
upper level shadow pages was broken was irrelevant because there should be
no need to zap non-leaf PTEs.

The one thing that sticks out as potentially concerning is passing %false
for @lock_flush_tlb.  Dropping mmu_lock during slot_handle_level_range()
without flushing would allow a vCPU to create and use a new entry while a
different vCPU has the old entry cached in its TLB.  I think that could
even happen with a single vCPU if the memslot is deleted by a helper task,
and the zapped page was a large page that was fractured into small pages
when inserted into the TLB.

TL;DR: Assuming no other bugs in the kernel, this should be sufficient if
the goal is simply to prevent usage of a memslot:

static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
			struct kvm_memory_slot *slot,
			struct kvm_page_track_notifier_node *node)
{
	bool flush;

	spin_lock(&kvm->mmu_lock);
	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, true);
	if (flush)
		kvm_flush_remote_tlbs(kvm);
	spin_unlock(&
}

> mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of
> the shadow ptes, e.g. for_each_valid_sp() will find page tables, page
> directories, etc...  Passing in the raw gfns of the memslot doesn't work
> because the gfn isn't properly adjusted/aligned to match how KVM tracks
> gfns for shadow pages, e.g. removal of the companion audio memslot that
> occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find
> the shadow page table containing the relevant sptes.
> 
> This is why Paolo's suggestion to remove slot_handle_all_level() on
> kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's
> accounting without actually zapping the relevant sptes.

This is just straight up wrong, zapping the rmaps does zap the leaf sptes.
The BUG() occurs because gfn_to_rmap() works on the _new_ memslots instance,
and if a memslot is being deleted there is no memslot for the gfn, hence the
NULL pointer bug when mmu_page_zap_pte() attempts to zap a PTE.  Zapping the
rmaps (leaf/last PTEs) first "fixes" the issue by making it so that
mmu_page_zap_pte() will never see a present PTE for a non-existent meslot.

I don't think any of this explains the pass-through GPU issue.  But, we
have a few use cases where zapping the entire MMU is undesirable, so I'm
going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
set the record straight for posterity before doing so.
Alexander Graf Oct. 20, 2022, 6:31 p.m. UTC | #29
On 26.06.20 19:32, Sean Christopherson wrote:
> /cast <thread necromancy>
> 
> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:

[...]

> I don't think any of this explains the pass-through GPU issue.  But, we
> have a few use cases where zapping the entire MMU is undesirable, so I'm
> going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
> set the record straight for posterity before doing so.

Hey Sean,

Did you ever get around to upstream or rework the zap optimization? The 
way I read current upstream, a memslot change still always wipes all 
SPTEs, not only the ones that were changed.


Thanks,

Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Sean Christopherson Oct. 20, 2022, 8:37 p.m. UTC | #30
On Thu, Oct 20, 2022, Alexander Graf wrote:
> On 26.06.20 19:32, Sean Christopherson wrote:
> > /cast <thread necromancy>
> > 
> > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
> 
> [...]
> 
> > I don't think any of this explains the pass-through GPU issue.  But, we
> > have a few use cases where zapping the entire MMU is undesirable, so I'm
> > going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
> > set the record straight for posterity before doing so.
> 
> Hey Sean,
> 
> Did you ever get around to upstream or rework the zap optimization? The way
> I read current upstream, a memslot change still always wipes all SPTEs, not
> only the ones that were changed.

Nope, I've more or less given up hope on zapping only the deleted/moved memslot.
TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very
much a special case.  

Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?
Alexander Graf Oct. 20, 2022, 9:06 p.m. UTC | #31
On 20.10.22 22:37, Sean Christopherson wrote:
> On Thu, Oct 20, 2022, Alexander Graf wrote:
>> On 26.06.20 19:32, Sean Christopherson wrote:
>>> /cast <thread necromancy>
>>>
>>> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
>> [...]
>>
>>> I don't think any of this explains the pass-through GPU issue.  But, we
>>> have a few use cases where zapping the entire MMU is undesirable, so I'm
>>> going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
>>> set the record straight for posterity before doing so.
>> Hey Sean,
>>
>> Did you ever get around to upstream or rework the zap optimization? The way
>> I read current upstream, a memslot change still always wipes all SPTEs, not
>> only the ones that were changed.
> Nope, I've more or less given up hope on zapping only the deleted/moved memslot.
> TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very
> much a special case.
>
> Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?


Yeah, we're looking at adding support for the Hyper-V VSM extensions 
which Windows uses to implement Credential Guard. With that, the guest 
gets access to hypercalls that allow it to set reduced permissions for 
arbitrary gfns. To ensure that user space has full visibility into those 
for live migration, memory slots to model access would be a great fit. 
But it means we'd do ~100k memslot modifications on boot.


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Sean Christopherson Oct. 21, 2022, 7:40 p.m. UTC | #32
On Thu, Oct 20, 2022, Alexander Graf wrote:
> 
> On 20.10.22 22:37, Sean Christopherson wrote:
> > On Thu, Oct 20, 2022, Alexander Graf wrote:
> > > On 26.06.20 19:32, Sean Christopherson wrote:
> > > > /cast <thread necromancy>
> > > > 
> > > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
> > > [...]
> > > 
> > > > I don't think any of this explains the pass-through GPU issue.  But, we
> > > > have a few use cases where zapping the entire MMU is undesirable, so I'm
> > > > going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
> > > > set the record straight for posterity before doing so.
> > > Hey Sean,
> > > 
> > > Did you ever get around to upstream or rework the zap optimization? The way
> > > I read current upstream, a memslot change still always wipes all SPTEs, not
> > > only the ones that were changed.
> > Nope, I've more or less given up hope on zapping only the deleted/moved memslot.
> > TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very
> > much a special case.
> > 
> > Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?
> 
> 
> Yeah, we're looking at adding support for the Hyper-V VSM extensions which
> Windows uses to implement Credential Guard. With that, the guest gets access
> to hypercalls that allow it to set reduced permissions for arbitrary gfns.
> To ensure that user space has full visibility into those for live migration,
> memory slots to model access would be a great fit. But it means we'd do
> ~100k memslot modifications on boot.

Oof.  100k memslot updates is going to be painful irrespective of flushing.  And
memslots (in their current form) won't work if the guest can drop executable
permissions.

Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve
the "KVM flushes everything on memslot deletion", I think we should instead
properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing
userspace to delete the memslot.  Commit 75d61fbcf563 ("KVM: set_memory_region:
Disallow changing read-only attribute later") was just a quick-and-dirty fix,
there's no fundemental problem that makes it impossible (or even all that difficult)
to support toggling permissions.

The ABI would be that KVM only guarantees the new permissions take effect when
the ioctl() returns, i.e. KVM doesn't need to ensure there are no writable SPTEs
when the memslot is installed, just that there are no writable SPTEs before
userspace regains control.

E.g. sans sanity checking and whatnot, I think x86 support would be something like:

@@ -12669,9 +12667,16 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
         * MOVE/DELETE: The old mappings will already have been cleaned up by
         *              kvm_arch_flush_shadow_memslot().
         */
-       if ((change != KVM_MR_FLAGS_ONLY) || (new_flags & KVM_MEM_READONLY))
+       if (change != KVM_MR_FLAGS_ONLY)
                return;
 
+       if ((old_flags ^ new_flags) & KVM_MEM_READONLY) {
+               if ((new_flags & KVM_MEM_READONLY) &&
+                   kvm_mmu_slot_write_protect(kvm, new))
+                       kvm_arch_flush_remote_tlbs_memslot(kvm, new);
+               return;
+       }
+
        /*
         * READONLY and non-flags changes were filtered out above, and the only
         * other flag is LOG_DIRTY_PAGES, i.e. something is wrong if dirty
Alexander Graf Oct. 24, 2022, 6:12 a.m. UTC | #33
Hey Sean,

On 21.10.22 21:40, Sean Christopherson wrote:
>
> On Thu, Oct 20, 2022, Alexander Graf wrote:
>> On 20.10.22 22:37, Sean Christopherson wrote:
>>> On Thu, Oct 20, 2022, Alexander Graf wrote:
>>>> On 26.06.20 19:32, Sean Christopherson wrote:
>>>>> /cast <thread necromancy>
>>>>>
>>>>> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
>>>> [...]
>>>>
>>>>> I don't think any of this explains the pass-through GPU issue.  But, we
>>>>> have a few use cases where zapping the entire MMU is undesirable, so I'm
>>>>> going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
>>>>> set the record straight for posterity before doing so.
>>>> Hey Sean,
>>>>
>>>> Did you ever get around to upstream or rework the zap optimization? The way
>>>> I read current upstream, a memslot change still always wipes all SPTEs, not
>>>> only the ones that were changed.
>>> Nope, I've more or less given up hope on zapping only the deleted/moved memslot.
>>> TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very
>>> much a special case.
>>>
>>> Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?
>>
>> Yeah, we're looking at adding support for the Hyper-V VSM extensions which
>> Windows uses to implement Credential Guard. With that, the guest gets access
>> to hypercalls that allow it to set reduced permissions for arbitrary gfns.
>> To ensure that user space has full visibility into those for live migration,
>> memory slots to model access would be a great fit. But it means we'd do
>> ~100k memslot modifications on boot.
> Oof.  100k memslot updates is going to be painful irrespective of flushing.  And
> memslots (in their current form) won't work if the guest can drop executable
> permissions.
>
> Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve
> the "KVM flushes everything on memslot deletion", I think we should instead
> properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing
> userspace to delete the memslot.  Commit 75d61fbcf563 ("KVM: set_memory_region:


That would be a cute acceleration for the case where we have to change 
permissions for a full slot. Unfortunately, the bulk of the changes are 
slot splits. Let me explain with numbers from a 1 vcpu, 8GB Windows 
Server 2019 boot:

GFN permission modification requests: 46294
Unique GFNs: 21200

That means on boot, we start off with a few huge memslots for guest RAM. 
Then down the road, we need to change permissions for individual pages 
inside these larger regions. The obvious option for that is a memslot 
split - delete, create, create, create. Now we have 2 large memslots and 
1 that only spans a single page.

Later in the boot process, Windows then some times also toggles 
permissions for pages that it already split off earlier. That's the case 
we can optimize with the modify optimization you described in the 
previous email. But that's only about half the requests. The other half 
are memslot split requests.

We already built a prototype implementation of an atomic memslot update 
ioctl that allows us to keep other vCPUs running while we do the 
delete/create/create/create operation. But even with that, we see up to 
30 min boot times for larger guests that most of the time are stuck in 
zapping pages.

I guess we have 2 options to make this viable:

   1) Optimize memslot splits + modifications to a point where they're 
fast enough
   2) Add a different, faster mechanism on top of memslots for page 
granular permission bits

Also sorry for not posting the underlying credguard and atomic memslot 
patches yet. I wanted to kick off this conversation before sending them 
out - they're still too raw for upstream review atm :).


Thanks,

Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Sean Christopherson Oct. 24, 2022, 3:55 p.m. UTC | #34
On Mon, Oct 24, 2022, Alexander Graf wrote:
> Hey Sean,
> 
> On 21.10.22 21:40, Sean Christopherson wrote:
> > 
> > On Thu, Oct 20, 2022, Alexander Graf wrote:
> > > On 20.10.22 22:37, Sean Christopherson wrote:
> > > > On Thu, Oct 20, 2022, Alexander Graf wrote:
> > > > > On 26.06.20 19:32, Sean Christopherson wrote:
> > > > > > /cast <thread necromancy>
> > > > > > 
> > > > > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
> > > > > [...]
> > > > > 
> > > > > > I don't think any of this explains the pass-through GPU issue.  But, we
> > > > > > have a few use cases where zapping the entire MMU is undesirable, so I'm
> > > > > > going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
> > > > > > set the record straight for posterity before doing so.
> > > > > Hey Sean,
> > > > > 
> > > > > Did you ever get around to upstream or rework the zap optimization? The way
> > > > > I read current upstream, a memslot change still always wipes all SPTEs, not
> > > > > only the ones that were changed.
> > > > Nope, I've more or less given up hope on zapping only the deleted/moved memslot.
> > > > TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very
> > > > much a special case.
> > > > 
> > > > Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?
> > > 
> > > Yeah, we're looking at adding support for the Hyper-V VSM extensions which
> > > Windows uses to implement Credential Guard. With that, the guest gets access
> > > to hypercalls that allow it to set reduced permissions for arbitrary gfns.
> > > To ensure that user space has full visibility into those for live migration,
> > > memory slots to model access would be a great fit. But it means we'd do
> > > ~100k memslot modifications on boot.
> > Oof.  100k memslot updates is going to be painful irrespective of flushing.  And
> > memslots (in their current form) won't work if the guest can drop executable
> > permissions.
> > 
> > Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve
> > the "KVM flushes everything on memslot deletion", I think we should instead
> > properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing
> > userspace to delete the memslot.  Commit 75d61fbcf563 ("KVM: set_memory_region:
> 
> 
> That would be a cute acceleration for the case where we have to change
> permissions for a full slot. Unfortunately, the bulk of the changes are slot
> splits.

Ah, right, the guest will be operating on per-page granularity.

> We already built a prototype implementation of an atomic memslot update
> ioctl that allows us to keep other vCPUs running while we do the
> delete/create/create/create operation.

Please weigh in with your use case on a relevant upstream discussion regarding
"atomic" memslot updates[*].  I suspect we'll end up with a different solution
for this use case (see below), but we should at least capture all potential use
cases and ideas for modifying memslots without pausing vCPUs.

[*] https://lore.kernel.org/all/20220909104506.738478-1-eesposit@redhat.com

> But even with that, we see up to 30 min boot times for larger guests that
> most of the time are stuck in zapping pages.

Out of curiosity, did you measure runtime performance?  I would expect some amount
of runtime overhead as well dut to fragmenting memslots to that degree.

> I guess we have 2 options to make this viable:
> 
>   1) Optimize memslot splits + modifications to a point where they're fast
> enough
>   2) Add a different, faster mechanism on top of memslots for page granular
> permission bits

#2 crossed my mind as well.  This is actually nearly identical to the confidential
VM use case, where KVM needs to handle guest-initiated conversions of memory between
"private" and "shared" on a per-page granularity.  The proposed solution for that
is indeed a layer on top of memslots[*], which we arrived at in no small part because
splitting memslots was going to be a bottleneck.

Extending the proposed mem_attr_array to support additional state should be quite
easy.  The framework is all there, KVM just needs a few extra flags values, e.g.

	KVM_MEM_ATTR_SHARED	BIT(0)
	KVM_MEM_ATTR_READONLY	BIT(1)
	KVM_MEM_ATTR_NOEXEC	BIT(2)

and then new ioctls to expose the functionality to userspace.  Actually, if we
want to go this route, it might even make sense to define new a generic MEM_ATTR
ioctl() right away instead of repurposing KVM_MEMORY_ENCRYPT_(UN)REG_REGION for
the private vs. shared use case.

[*] https://lore.kernel.org/all/20220915142913.2213336-6-chao.p.peng@linux.intel.com
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b6f362a988fd..ba96bf9cf706 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5619,7 +5619,38 @@  static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
 {
-	kvm_mmu_invalidate_zap_all_pages(kvm);
+	struct kvm_mmu_page *sp;
+	LIST_HEAD(invalid_list);
+	unsigned long i;
+	bool flush;
+	gfn_t gfn;
+
+	spin_lock(&kvm->mmu_lock);
+
+	if (list_empty(&kvm->arch.active_mmu_pages))
+		goto out_unlock;
+
+	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false);
+
+	for (i = 0; i < slot->npages; i++) {
+		gfn = slot->base_gfn + i;
+
+		for_each_valid_sp(kvm, sp, gfn) {
+			if (sp->gfn != gfn)
+				continue;
+
+			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+		}
+		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
+			flush = false;
+			cond_resched_lock(&kvm->mmu_lock);
+		}
+	}
+	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
+
+out_unlock:
+	spin_unlock(&kvm->mmu_lock);
 }
 
 void kvm_mmu_init_vm(struct kvm *kvm)