mbox series

[v3,00/17] KVM: Add Xen event channel acceleration

Message ID 20220303154127.202856-1-dwmw2@infradead.org (mailing list archive)
Headers show
Series KVM: Add Xen event channel acceleration | expand

Message

David Woodhouse March 3, 2022, 3:41 p.m. UTC
This series adds event channel acceleration for Xen guests. In particular
it allows guest vCPUs to send each other IPIs without having to bounce
all the way out to the userspace VMM in order to do so. Likewise, the
Xen singleshot timer is added, and a version of SCHEDOP_poll. Those
major features are based on Joao and Boris' patches from 2019.

Cleaning up the event delivery into the vcpu_info involved using the new
gfn_to_pfn_cache for that, and that means I ended up doing so for *all*
the places the guest can have a pvclock.

v0: Proof-of-concept RFC

v1:
 • Drop the runstate fix which is merged now.
 • Add Sean's gfn_to_pfn_cache API change at the start of the series.
 • Add KVM self tests
 • Minor bug fixes

v2:
 • Drop dirty handling from gfn_to_pfn_cache
 • Fix !CONFIG_KVM_XEN build and duplicate call to kvm_xen_init_vcpu()

v3:
 • Add KVM_XEN_EVTCHN_RESET to clear all outbound ports.
 • Clean up a stray #if	1 in a part of the the test case that was once
   being recalcitrant.
 • Check kvm_xen_has_pending_events() in kvm_vcpu_has_events() and *not*
   kvm_xen_has_pending_timer() which is checked from elsewhere.
 • Fix warnings noted by the kernel test robot <lkp@intel.com>:
    • Make kvm_xen_init_timer() static.
    • Make timer delta calculation use an explicit s64 to fix 32-bit build.

Boris Ostrovsky (1):
      KVM: x86/xen: handle PV spinlocks slowpath

David Woodhouse (12):
      KVM: Remove dirty handling from gfn_to_pfn_cache completely
      KVM: x86/xen: Use gfn_to_pfn_cache for runstate area
      KVM: x86: Use gfn_to_pfn_cache for pv_time
      KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_info
      KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_time_info
      KVM: x86/xen: Make kvm_xen_set_evtchn() reusable from other places
      KVM: x86/xen: Support direct injection of event channel events
      KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID
      KVM: x86/xen: Kernel acceleration for XENVER_version
      KVM: x86/xen: Support per-vCPU event channel upcall via local APIC
      KVM: x86/xen: Advertise and document KVM_XEN_HVM_CONFIG_EVTCHN_SEND
      KVM: x86/xen: Add self tests for KVM_XEN_HVM_CONFIG_EVTCHN_SEND

Joao Martins (3):
      KVM: x86/xen: intercept EVTCHNOP_send from guests
      KVM: x86/xen: handle PV IPI vcpu yield
      KVM: x86/xen: handle PV timers oneshot mode

Sean Christopherson (1):
      KVM: Use enum to track if cached PFN will be used in guest and/or host

 Documentation/virt/kvm/api.rst                     |  133 +-
 arch/x86/include/asm/kvm_host.h                    |   23 +-
 arch/x86/kvm/irq.c                                 |   11 +-
 arch/x86/kvm/irq_comm.c                            |    2 +-
 arch/x86/kvm/x86.c                                 |  119 +-
 arch/x86/kvm/xen.c                                 | 1271 ++++++++++++++++----
 arch/x86/kvm/xen.h                                 |   67 +-
 include/linux/kvm_host.h                           |   26 +-
 include/linux/kvm_types.h                          |   11 +-
 include/uapi/linux/kvm.h                           |   44 +
 .../testing/selftests/kvm/x86_64/xen_shinfo_test.c |  337 +++++-
 virt/kvm/pfncache.c                                |   53 +-
 12 files changed, 1722 insertions(+), 375 deletions(-)

Comments

Paolo Bonzini March 8, 2022, 4:33 p.m. UTC | #1
On 3/3/22 16:41, David Woodhouse wrote:
> This series adds event channel acceleration for Xen guests. In particular
> it allows guest vCPUs to send each other IPIs without having to bounce
> all the way out to the userspace VMM in order to do so. Likewise, the
> Xen singleshot timer is added, and a version of SCHEDOP_poll. Those
> major features are based on Joao and Boris' patches from 2019.
> 
> Cleaning up the event delivery into the vcpu_info involved using the new
> gfn_to_pfn_cache for that, and that means I ended up doing so for *all*
> the places the guest can have a pvclock.
> 
> v0: Proof-of-concept RFC
> 
> v1:
>   • Drop the runstate fix which is merged now.
>   • Add Sean's gfn_to_pfn_cache API change at the start of the series.
>   • Add KVM self tests
>   • Minor bug fixes
> 
> v2:
>   • Drop dirty handling from gfn_to_pfn_cache
>   • Fix !CONFIG_KVM_XEN build and duplicate call to kvm_xen_init_vcpu()
> 
> v3:
>   • Add KVM_XEN_EVTCHN_RESET to clear all outbound ports.
>   • Clean up a stray #if	1 in a part of the the test case that was once
>     being recalcitrant.
>   • Check kvm_xen_has_pending_events() in kvm_vcpu_has_events() and *not*
>     kvm_xen_has_pending_timer() which is checked from elsewhere.
>   • Fix warnings noted by the kernel test robot <lkp@intel.com>:
>      • Make kvm_xen_init_timer() static.
>      • Make timer delta calculation use an explicit s64 to fix 32-bit build.
> 
> Boris Ostrovsky (1):
>        KVM: x86/xen: handle PV spinlocks slowpath
> 
> David Woodhouse (12):
>        KVM: Remove dirty handling from gfn_to_pfn_cache completely
>        KVM: x86/xen: Use gfn_to_pfn_cache for runstate area
>        KVM: x86: Use gfn_to_pfn_cache for pv_time
>        KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_info
>        KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_time_info
>        KVM: x86/xen: Make kvm_xen_set_evtchn() reusable from other places
>        KVM: x86/xen: Support direct injection of event channel events
>        KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID
>        KVM: x86/xen: Kernel acceleration for XENVER_version
>        KVM: x86/xen: Support per-vCPU event channel upcall via local APIC
>        KVM: x86/xen: Advertise and document KVM_XEN_HVM_CONFIG_EVTCHN_SEND
>        KVM: x86/xen: Add self tests for KVM_XEN_HVM_CONFIG_EVTCHN_SEND
> 
> Joao Martins (3):
>        KVM: x86/xen: intercept EVTCHNOP_send from guests
>        KVM: x86/xen: handle PV IPI vcpu yield
>        KVM: x86/xen: handle PV timers oneshot mode
> 
> Sean Christopherson (1):
>        KVM: Use enum to track if cached PFN will be used in guest and/or host
> 
>   Documentation/virt/kvm/api.rst                     |  133 +-
>   arch/x86/include/asm/kvm_host.h                    |   23 +-
>   arch/x86/kvm/irq.c                                 |   11 +-
>   arch/x86/kvm/irq_comm.c                            |    2 +-
>   arch/x86/kvm/x86.c                                 |  119 +-
>   arch/x86/kvm/xen.c                                 | 1271 ++++++++++++++++----
>   arch/x86/kvm/xen.h                                 |   67 +-
>   include/linux/kvm_host.h                           |   26 +-
>   include/linux/kvm_types.h                          |   11 +-
>   include/uapi/linux/kvm.h                           |   44 +
>   .../testing/selftests/kvm/x86_64/xen_shinfo_test.c |  337 +++++-
>   virt/kvm/pfncache.c                                |   53 +-
>   12 files changed, 1722 insertions(+), 375 deletions(-)
> 
> 
> 

Queued, thanks.

Paolo
David Woodhouse March 8, 2022, 4:41 p.m. UTC | #2
On Tue, 2022-03-08 at 17:33 +0100, Paolo Bonzini wrote:
> 
> Queued, thanks.

Thanks. I've literally just a couple of minutes ago finished diagnosing
a sporadic live migration / live update bug which seems to happen
because adding an hrtimer in the past *sometimes* seems not to work,
although it always worked in my dev testing. 

Incremental diff to the 'oneshot timers' patch looks like the first
hunk of this. I'm also pondering the second hunk which actively
*cancels* the pending timer on serialization.

Do you want a repost, or a proper incremental patch on top of kvm/queue
when it becomes visible?

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index a044cedca73f..9e1a8fd0c9c5 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -163,17 +163,19 @@ void __kvm_migrate_xen_timer(struct kvm_vcpu *vcpu)
 		hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED);
 }
 
-static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, u64 delta_ns)
+static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
 {
-	ktime_t ktime_now;
-
 	atomic_set(&vcpu->arch.xen.timer_pending, 0);
 	vcpu->arch.xen.timer_expires = guest_abs;
 
-	ktime_now = ktime_get();
-	hrtimer_start(&vcpu->arch.xen.timer,
-		      ktime_add_ns(ktime_now, delta_ns),
-		      HRTIMER_MODE_ABS_PINNED);
+	if (delta_ns < 0) {
+		atomic_inc(&vcpu->arch.xen.timer_pending);
+	} else {
+		ktime_t ktime_now = ktime_get();
+		hrtimer_start(&vcpu->arch.xen.timer,
+			      ktime_add_ns(ktime_now, delta_ns),
+			      HRTIMER_MODE_ABS_PINNED);
+	}
 }
 
 static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu)
@@ -837,7 +839,10 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 	case KVM_XEN_VCPU_ATTR_TYPE_TIMER:
 		data->u.timer.port = vcpu->arch.xen.timer_virq;
 		data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
-		data->u.timer.expires_ns = vcpu->arch.xen.timer_expires;
+		if (hrtimer_cancel(&vcpu->arch.xen.timer))
+			data->u.timer.expires_ns = vcpu->arch.xen.timer_expires;
+		else
+			data->u.timer.expires_ns = 0;
 		r = 0;
 		break;
Paolo Bonzini March 8, 2022, 4:49 p.m. UTC | #3
On 3/8/22 17:41, David Woodhouse wrote:
> Thanks. I've literally just a couple of minutes ago finished diagnosing
> a sporadic live migration / live update bug which seems to happen
> because adding an hrtimer in the past*sometimes*  seems not to work,
> although it always worked in my dev testing.
> 
> Incremental diff to the 'oneshot timers' patch looks like the first
> hunk of this. I'm also pondering the second hunk which actively
> *cancels*  the pending timer on serialization.

Hmm, why so?

> Do you want a repost, or a proper incremental patch on top of kvm/queue
> when it becomes visible?

kvm/queue is rebased routinely, so I'll just squash (only the first 
hunk, methinks).  Got a testcase, though?  That might be better as a 
separate patch.

Paolo
David Woodhouse March 8, 2022, 4:59 p.m. UTC | #4
On Tue, 2022-03-08 at 17:49 +0100, Paolo Bonzini wrote:
> On 3/8/22 17:41, David Woodhouse wrote:
> > Thanks. I've literally just a couple of minutes ago finished diagnosing
> > a sporadic live migration / live update bug which seems to happen
> > because adding an hrtimer in the past*sometimes*  seems not to work,
> > although it always worked in my dev testing.
> > 
> > Incremental diff to the 'oneshot timers' patch looks like the first
> > hunk of this. I'm also pondering the second hunk which actively
> > *cancels*  the pending timer on serialization.
> 
> Hmm, why so?

Don't know yet. But as I added the save/restore support to Joao's patch
I had *assumed* that it would fail when the delta was negative, and was
kind of surprised when it worked in the first place. So I'm sticking
with "Don't Do That Then" as my initial response to fix it.

> > Do you want a repost, or a proper incremental patch on top of kvm/queue
> > when it becomes visible?
> 
> kvm/queue is rebased routinely, so I'll just squash (only the first 
> hunk, methinks).  Got a testcase, though?  That might be better as a 
> separate patch.

My test case right now is to run 'while true; time sleep 1; done' in
the guest while I repeatedly kexec / live update the host. 

After a kexec, the deadline for the timer is past, and that's why it
ends up getting restored with a negative delta. After a *few* cycles of
this it usually ends up with the timer callback never triggering.

I'll stick a negative delta into the KVM selftest included in the patch
series, instead of the nice polite '100ms in the future' that it uses
at the moment. That ought to trigger it too, and I can instrument the
hrtimer code to work out what's going on. Either way, I think 'Don't Do
That Then' will continue to be the right answer :)
David Woodhouse March 8, 2022, 5:10 p.m. UTC | #5
On Tue, 2022-03-08 at 17:59 +0100, David Woodhouse wrote:
> > kvm/queue is rebased routinely, so I'll just squash (only the first 
> > hunk, methinks).  Got a testcase, though?  That might be better as a 
> > separate patch.
> 
> My test case right now is to run 'while true; time sleep 1; done' in
> the guest while I repeatedly kexec / live update the host. 
> 
> After a kexec, the deadline for the timer is past, and that's why it
> ends up getting restored with a negative delta. After a *few* cycles of
> this it usually ends up with the timer callback never triggering.
> 
> I'll stick a negative delta into the KVM selftest included in the patch
> series, instead of the nice polite '100ms in the future' that it uses
> at the moment. That ought to trigger it too, and I can instrument the
> hrtimer code to work out what's going on. Either way, I think 'Don't Do
> That Then' will continue to be the right answer :)

Oh, I bet that won't show it and the kexec is actually needed. I bet it
only happens when the timer expiry is actually at a time *before* zero
on the new kernel's clock.
Paolo Bonzini March 8, 2022, 5:13 p.m. UTC | #6
On 3/8/22 17:59, David Woodhouse wrote:
>>> Incremental diff to the 'oneshot timers' patch looks like the first
>>> hunk of this. I'm also pondering the second hunk which actively
>>> *cancels*  the pending timer on serialization.
>> 
>> Hmm, why so?
>
> Don't know yet. But as I added the save/restore support to Joao's patch
> I had *assumed* that it would fail when the delta was negative, and was
> kind of surprised when it worked in the first place. So I'm sticking
> with "Don't Do That Then" as my initial response to fix it.

Yes, I'm just talking about the second hunk.  The first is clear(ish).

> After a kexec, the deadline for the timer is past, and that's why it
> ends up getting restored with a negative delta. After a *few*  cycles of
> this it usually ends up with the timer callback never triggering.
> 
> I'll stick a negative delta into the KVM selftest included in the patch
> series, instead of the nice polite '100ms in the future' that it uses
> at the moment. That ought to trigger it too, and I can instrument the
> hrtimer code to work out what's going on. Either way, I think 'Don't Do
> That Then' will continue to be the right answer:)

Yep, let's keep both testcases through.

Paolo
David Woodhouse March 8, 2022, 5:20 p.m. UTC | #7
On Tue, 2022-03-08 at 18:13 +0100, Paolo Bonzini wrote:
> On 3/8/22 17:59, David Woodhouse wrote:
> > > > Incremental diff to the 'oneshot timers' patch looks like the first
> > > > hunk of this. I'm also pondering the second hunk which actively
> > > > *cancels*  the pending timer on serialization.
> > > 
> > > Hmm, why so?
> > 
> > Don't know yet. But as I added the save/restore support to Joao's patch
> > I had *assumed* that it would fail when the delta was negative, and was
> > kind of surprised when it worked in the first place. So I'm sticking
> > with "Don't Do That Then" as my initial response to fix it.
> 
> Yes, I'm just talking about the second hunk.  The first is clear(ish).

Oh, I see.

When the oneshot timer has expired and hasn't been re-armed by the
guest, we should return zero as 'expires_ns' so that it doesn't get re-
armed in the past (and, hopefully, immediately retriggered) when the
guest is restored.

Also, we don't really want the timer firing *after* the guest vCPU
state has been serialized, since the newly-injected interrupt might not
get migrated. Hence using hrtimer_cancel() as our check for whether
it's still pending or not.

> > After a kexec, the deadline for the timer is past, and that's why it
> > ends up getting restored with a negative delta. After a *few*  cycles of
> > this it usually ends up with the timer callback never triggering.
> > 
> > I'll stick a negative delta into the KVM selftest included in the patch
> > series, instead of the nice polite '100ms in the future' that it uses
> > at the moment. That ought to trigger it too, and I can instrument the
> > hrtimer code to work out what's going on. Either way, I think 'Don't Do
> > That Then' will continue to be the right answer:)
> 
> Yep, let's keep both testcases through.

Ultimately I think the negative one only ends up testing the kernel's
hrtimer behaviour rather than KVM functionality, but I'll play with
that some more and make sure I understand it.
Paolo Bonzini March 8, 2022, 5:25 p.m. UTC | #8
On 3/8/22 18:20, David Woodhouse wrote:
>> Yes, I'm just talking about the second hunk.  The first is clear(ish).
> Oh, I see.
> 
> When the oneshot timer has expired and hasn't been re-armed by the
> guest, we should return zero as 'expires_ns' so that it doesn't get re-
> armed in the past (and, hopefully, immediately retriggered) when the
> guest is restored.
> 
> Also, we don't really want the timer firing*after*  the guest vCPU
> state has been serialized, since the newly-injected interrupt might not
> get migrated. Hence using hrtimer_cancel() as our check for whether
> it's still pending or not.
> 

I think the two are different.  The first is also clear, and that should 
be fixed with a separate bool or possibly a special meaning for 
expires_ns == -1 (or INT64_MAX, I don't speak Xen hypercalls :)).

The second should not be a problem.  The newly-injected interrupt might 
not get migrated, but it will be injected on the destination.  So it 
shouldn't be a problem, in fact anything that relies on that is probably 
going to be racy.

Paolo
David Woodhouse March 8, 2022, 5:40 p.m. UTC | #9
On Tue, 2022-03-08 at 18:25 +0100, Paolo Bonzini wrote:
> On 3/8/22 18:20, David Woodhouse wrote:
> > > Yes, I'm just talking about the second hunk.  The first is clear(ish).
> > Oh, I see.
> > 
> > When the oneshot timer has expired and hasn't been re-armed by the
> > guest, we should return zero as 'expires_ns' so that it doesn't get re-
> > armed in the past (and, hopefully, immediately retriggered) when the
> > guest is restored.
> > 
> > Also, we don't really want the timer firing*after*  the guest vCPU
> > state has been serialized, since the newly-injected interrupt might not
> > get migrated. Hence using hrtimer_cancel() as our check for whether
> > it's still pending or not.
> > 
> 
> I think the two are different.  The first is also clear, and that should 
> be fixed with a separate bool or possibly a special meaning for 
> expires_ns == -1 (or INT64_MAX, I don't speak Xen hypercalls :)).

The value you're looking for is zero; that's what means "not set".

The issue wasn't the Xen ABI or even the KVM_XEN_VCPU_ATTR ABI; it's
just that we weren't *returning* zero when the timer has already fired
and thus is it isn't considered to be 'set' any more.

> The second should not be a problem.  The newly-injected interrupt might 
> not get migrated, but it will be injected on the destination.  So it 
> shouldn't be a problem, in fact anything that relies on that is probably 
> going to be racy.

Sure. But if we consider it acceptable for the timer to fire again
after live migration when it already fired on the source host why did
we even bother fixing the first part above? :)
Paolo Bonzini March 8, 2022, 5:54 p.m. UTC | #10
On 3/8/22 18:40, David Woodhouse wrote:
> On Tue, 2022-03-08 at 18:25 +0100, Paolo Bonzini wrote:
>> On 3/8/22 18:20, David Woodhouse wrote:
>>>> Yes, I'm just talking about the second hunk.  The first is clear(ish).
>>> Oh, I see.
>>>
>>> When the oneshot timer has expired and hasn't been re-armed by the
>>> guest, we should return zero as 'expires_ns' so that it doesn't get re-
>>> armed in the past (and, hopefully, immediately retriggered) when the
>>> guest is restored.
>>>
>>> Also, we don't really want the timer firing*after*  the guest vCPU
>>> state has been serialized, since the newly-injected interrupt might not
>>> get migrated. Hence using hrtimer_cancel() as our check for whether
>>> it's still pending or not.
>>
> Sure. But if we consider it acceptable for the timer to fire again
> after live migration when it already fired on the source host why did
> we even bother fixing the first part above? :)

In the first case the timer has expired and the event has been injected 
by the time the state is retrieved, so you'd get a double event.

In the second the timer expires and the event is injected _on the 
source_  only.  This is required because, if the destination for 
whatever reason aborts the migration, your change would introduced a 
missed event.

Paolo
David Woodhouse March 8, 2022, 6:05 p.m. UTC | #11
On Tue, 2022-03-08 at 18:54 +0100, Paolo Bonzini wrote:
> In the first case the timer has expired and the event has been injected 
> by the time the state is retrieved, so you'd get a double event.
> 
> In the second the timer expires and the event is injected _on the 
> source_  only.  This is required because, if the destination for 
> whatever reason aborts the migration, your change would introduced a 
> missed event.

True. I'll change it to use hrtimer_active(). But first I'll
concentrate on the one that was *really* the cause of the bug I've been
chasing today.
Paolo Bonzini March 25, 2022, 6:19 p.m. UTC | #12
On 3/3/22 16:41, David Woodhouse wrote:
> This series adds event channel acceleration for Xen guests. In particular
> it allows guest vCPUs to send each other IPIs without having to bounce
> all the way out to the userspace VMM in order to do so. Likewise, the
> Xen singleshot timer is added, and a version of SCHEDOP_poll. Those
> major features are based on Joao and Boris' patches from 2019.
> 
> Cleaning up the event delivery into the vcpu_info involved using the new
> gfn_to_pfn_cache for that, and that means I ended up doing so for *all*
> the places the guest can have a pvclock.
> 
> v0: Proof-of-concept RFC
> 
> v1:
>   • Drop the runstate fix which is merged now.
>   • Add Sean's gfn_to_pfn_cache API change at the start of the series.
>   • Add KVM self tests
>   • Minor bug fixes
> 
> v2:
>   • Drop dirty handling from gfn_to_pfn_cache
>   • Fix !CONFIG_KVM_XEN build and duplicate call to kvm_xen_init_vcpu()
> 
> v3:
>   • Add KVM_XEN_EVTCHN_RESET to clear all outbound ports.
>   • Clean up a stray #if	1 in a part of the the test case that was once
>     being recalcitrant.
>   • Check kvm_xen_has_pending_events() in kvm_vcpu_has_events() and *not*
>     kvm_xen_has_pending_timer() which is checked from elsewhere.
>   • Fix warnings noted by the kernel test robot <lkp@intel.com>:
>      • Make kvm_xen_init_timer() static.
>      • Make timer delta calculation use an explicit s64 to fix 32-bit build.

I've seen this:

[1790637.031490] BUG: Bad page state in process qemu-kvm  pfn:03401
[1790637.037503] page:0000000077fc41af refcount:0 mapcount:1 
mapping:0000000000000000 index:0x7f4ab7e01 pfn:0x3401
[1790637.047592] head:0000000032101bf5 order:9 compound_mapcount:1 
compound_pincount:0
[1790637.055250] anon flags: 
0xfffffc009000e(referenced|uptodate|dirty|head|swapbacked|node=0|zone=1|lastcpupid=0x1fffff)
[1790637.065949] raw: 000fffffc0000000 ffffda4b800d0001 0000000000000903 
dead000000000200
[1790637.073869] raw: 0000000000000100 0000000000000000 00000000ffffffff 
0000000000000000
[1790637.081791] head: 000fffffc009000e dead000000000100 
dead000000000122 ffffa0636279fb01
[1790637.089797] head: 00000007f4ab7e00 0000000000000000 
00000000ffffffff 0000000000000000
[1790637.097795] page dumped because: nonzero compound_mapcount
[1790637.103455] Modules linked in: kvm_intel(OE) kvm(OE) overlay tun 
tls ib_core rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd 
grace fscache netfs rfkill sunrpc intel_rapl_msr intel_rapl_common 
isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal 
intel_powerclamp coretemp ipmi_ssif iTCO_wdt intel_pmc_bxt irqbypass 
iTCO_vendor_support acpi_ipmi rapl dell_smbios ipmi_si mei_me 
intel_cstate dcdbas ipmi_devintf i2c_i801 intel_uncore 
dell_wmi_descriptor wmi_bmof mei lpc_ich intel_pch_thermal i2c_smbus 
ipmi_msghandler acpi_power_meter xfs crct10dif_pclmul i40e crc32_pclmul 
crc32c_intel megaraid_sas ghash_clmulni_intel tg3 mgag200 wmi fuse [last 
unloaded: kvm]
[1790637.162636] CPU: 12 PID: 3056318 Comm: qemu-kvm Kdump: loaded 
Tainted: G        W IOE    --------- ---  5.16.0-0.rc6.41.fc36.x86_64 #1
[1790637.174878] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 
1.6.11 11/20/2018
[1790637.182618] Call Trace:
[1790637.185246]  <TASK>
[1790637.187524]  dump_stack_lvl+0x48/0x5e
[1790637.191373]  bad_page.cold+0x63/0x94
[1790637.195123]  free_tail_pages_check+0xbb/0x110
[1790637.199656]  free_pcp_prepare+0x270/0x310
[1790637.203843]  free_unref_page+0x1d/0x120
[1790637.207856]  kvm_gfn_to_pfn_cache_refresh+0x2c2/0x400 [kvm]
[1790637.213662]  kvm_setup_guest_pvclock+0x4b/0x180 [kvm]
[1790637.218913]  kvm_guest_time_update+0x26d/0x330 [kvm]
[1790637.224080]  vcpu_enter_guest+0x31c/0x1390 [kvm]
[1790637.228908]  kvm_arch_vcpu_ioctl_run+0x132/0x830 [kvm]
[1790637.234254]  kvm_vcpu_ioctl+0x270/0x680 [kvm]

followed by other badness with the same call stack:

[1790637.376127] page dumped because: 
VM_BUG_ON_PAGE(page_ref_count(page) == 0)

I am absolutely not sure that this series is the culprit in any way, but 
anyway I'll try to reproduce (it happened at the end of a RHEL7.2 
installation) and let you know.  If not, it is something that already 
made its way to Linus.

Paolo
David Woodhouse March 25, 2022, 7:57 p.m. UTC | #13
On Fri, 2022-03-25 at 19:19 +0100, Paolo Bonzini wrote:
> On 3/3/22 16:41, David Woodhouse wrote:
> > This series adds event channel acceleration for Xen guests. In particular
> > it allows guest vCPUs to send each other IPIs without having to bounce
> > all the way out to the userspace VMM in order to do so. Likewise, the
> > Xen singleshot timer is added, and a version of SCHEDOP_poll. Those
> > major features are based on Joao and Boris' patches from 2019.
> > 
> > Cleaning up the event delivery into the vcpu_info involved using the new
> > gfn_to_pfn_cache for that, and that means I ended up doing so for *all*
> > the places the guest can have a pvclock.
> > 
> > v0: Proof-of-concept RFC
> > 
> > v1:
> >   • Drop the runstate fix which is merged now.
> >   • Add Sean's gfn_to_pfn_cache API change at the start of the series.
> >   • Add KVM self tests
> >   • Minor bug fixes
> > 
> > v2:
> >   • Drop dirty handling from gfn_to_pfn_cache
> >   • Fix !CONFIG_KVM_XEN build and duplicate call to kvm_xen_init_vcpu()
> > 
> > v3:
> >   • Add KVM_XEN_EVTCHN_RESET to clear all outbound ports.
> >   • Clean up a stray #if	1 in a part of the the test case that was once
> >     being recalcitrant.
> >   • Check kvm_xen_has_pending_events() in kvm_vcpu_has_events() and *not*
> >     kvm_xen_has_pending_timer() which is checked from elsewhere.
> >   • Fix warnings noted by the kernel test robot <
> > lkp@intel.com
> > >:
> >      • Make kvm_xen_init_timer() static.
> >      • Make timer delta calculation use an explicit s64 to fix 32-bit build.
> 
> I've seen this:
> 
> [1790637.031490] BUG: Bad page state in process qemu-kvm  pfn:03401
> [1790637.037503] page:0000000077fc41af refcount:0 mapcount:1 
> mapping:0000000000000000 index:0x7f4ab7e01 pfn:0x3401
> [1790637.047592] head:0000000032101bf5 order:9 compound_mapcount:1 
> compound_pincount:0
> [1790637.055250] anon flags: 
> 0xfffffc009000e(referenced|uptodate|dirty|head|swapbacked|node=0|zone=1|lastcpupid=0x1fffff)
> [1790637.065949] raw: 000fffffc0000000 ffffda4b800d0001 0000000000000903 
> dead000000000200
> [1790637.073869] raw: 0000000000000100 0000000000000000 00000000ffffffff 
> 0000000000000000
> [1790637.081791] head: 000fffffc009000e dead000000000100 
> dead000000000122 ffffa0636279fb01
> [1790637.089797] head: 00000007f4ab7e00 0000000000000000 
> 00000000ffffffff 0000000000000000
> [1790637.097795] page dumped because: nonzero compound_mapcount
> [1790637.103455] Modules linked in: kvm_intel(OE) kvm(OE) overlay tun 
> tls ib_core rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd 
> grace fscache netfs rfkill sunrpc intel_rapl_msr intel_rapl_common 
> isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal 
> intel_powerclamp coretemp ipmi_ssif iTCO_wdt intel_pmc_bxt irqbypass 
> iTCO_vendor_support acpi_ipmi rapl dell_smbios ipmi_si mei_me 
> intel_cstate dcdbas ipmi_devintf i2c_i801 intel_uncore 
> dell_wmi_descriptor wmi_bmof mei lpc_ich intel_pch_thermal i2c_smbus 
> ipmi_msghandler acpi_power_meter xfs crct10dif_pclmul i40e crc32_pclmul 
> crc32c_intel megaraid_sas ghash_clmulni_intel tg3 mgag200 wmi fuse [last 
> unloaded: kvm]
> [1790637.162636] CPU: 12 PID: 3056318 Comm: qemu-kvm Kdump: loaded 
> Tainted: G        W IOE    --------- ---  5.16.0-0.rc6.41.fc36.x86_64 #1
> [1790637.174878] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 
> 1.6.11 11/20/2018
> [1790637.182618] Call Trace:
> [1790637.185246]  <TASK>
> [1790637.187524]  dump_stack_lvl+0x48/0x5e
> [1790637.191373]  bad_page.cold+0x63/0x94
> [1790637.195123]  free_tail_pages_check+0xbb/0x110
> [1790637.199656]  free_pcp_prepare+0x270/0x310
> [1790637.203843]  free_unref_page+0x1d/0x120
> [1790637.207856]  kvm_gfn_to_pfn_cache_refresh+0x2c2/0x400 [kvm]
> [1790637.213662]  kvm_setup_guest_pvclock+0x4b/0x180 [kvm]
> [1790637.218913]  kvm_guest_time_update+0x26d/0x330 [kvm]
> [1790637.224080]  vcpu_enter_guest+0x31c/0x1390 [kvm]
> [1790637.228908]  kvm_arch_vcpu_ioctl_run+0x132/0x830 [kvm]
> [1790637.234254]  kvm_vcpu_ioctl+0x270/0x680 [kvm]
> 
> followed by other badness with the same call stack:
> 
> [1790637.376127] page dumped because: 
> VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> 
> I am absolutely not sure that this series is the culprit in any way, but 
> anyway I'll try to reproduce (it happened at the end of a RHEL7.2 
> installation) and let you know.  If not, it is something that already 
> made its way to Linus.
> 

Hrm.... could it be a double/multiple free? This will come from
__release_gpc() which is called from the end of
kvm_gfn_to_pfn_cache_refresh() and which releases the *old* PFN.

How could we get there without... oh... could it be this?

--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -176,6 +176,7 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
                gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn);
 
                if (kvm_is_error_hva(gpc->uhva)) {
+                       gpc->pfn = KVM_PFN_ERR_FAULT;
                        ret = -EFAULT;
                        goto out;
                }
David Woodhouse March 25, 2022, 8:02 p.m. UTC | #14
On Fri, 2022-03-25 at 19:57 +0000, David Woodhouse wrote:
> On Fri, 2022-03-25 at 19:19 +0100, Paolo Bonzini wrote:
> > I am absolutely not sure that this series is the culprit in any way, but 
> > anyway I'll try to reproduce (it happened at the end of a RHEL7.2 
> > installation) and let you know.  If not, it is something that already 
> > made its way to Linus.
> > 
> 
> Hrm.... could it be a double/multiple free? This will come from
> __release_gpc() which is called from the end of
> kvm_gfn_to_pfn_cache_refresh() and which releases the *old* PFN.
> 
> How could we get there without... oh... could it be this?
> 
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -176,6 +176,7 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
>                 gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn);
>  
>                 if (kvm_is_error_hva(gpc->uhva)) {
> +                       gpc->pfn = KVM_PFN_ERR_FAULT;
>                         ret = -EFAULT;
>                         goto out;
>                 }
> 
> 

If you're going to try to reproduce, better to do it like this instead I suppose:

--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -176,6 +176,7 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
                gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn);
 
                if (kvm_is_error_hva(gpc->uhva)) {
+                       printk("Imma free PFN %llx again later. Oops!\n", gpc->pfn);
                        ret = -EFAULT;
                        goto out;
                }