Message ID | 20220303154127.202856-1-dwmw2@infradead.org (mailing list archive) |
---|---|
Headers | show |
Series | KVM: Add Xen event channel acceleration | expand |
On 3/3/22 16:41, David Woodhouse wrote: > This series adds event channel acceleration for Xen guests. In particular > it allows guest vCPUs to send each other IPIs without having to bounce > all the way out to the userspace VMM in order to do so. Likewise, the > Xen singleshot timer is added, and a version of SCHEDOP_poll. Those > major features are based on Joao and Boris' patches from 2019. > > Cleaning up the event delivery into the vcpu_info involved using the new > gfn_to_pfn_cache for that, and that means I ended up doing so for *all* > the places the guest can have a pvclock. > > v0: Proof-of-concept RFC > > v1: > • Drop the runstate fix which is merged now. > • Add Sean's gfn_to_pfn_cache API change at the start of the series. > • Add KVM self tests > • Minor bug fixes > > v2: > • Drop dirty handling from gfn_to_pfn_cache > • Fix !CONFIG_KVM_XEN build and duplicate call to kvm_xen_init_vcpu() > > v3: > • Add KVM_XEN_EVTCHN_RESET to clear all outbound ports. > • Clean up a stray #if 1 in a part of the the test case that was once > being recalcitrant. > • Check kvm_xen_has_pending_events() in kvm_vcpu_has_events() and *not* > kvm_xen_has_pending_timer() which is checked from elsewhere. > • Fix warnings noted by the kernel test robot <lkp@intel.com>: > • Make kvm_xen_init_timer() static. > • Make timer delta calculation use an explicit s64 to fix 32-bit build. > > Boris Ostrovsky (1): > KVM: x86/xen: handle PV spinlocks slowpath > > David Woodhouse (12): > KVM: Remove dirty handling from gfn_to_pfn_cache completely > KVM: x86/xen: Use gfn_to_pfn_cache for runstate area > KVM: x86: Use gfn_to_pfn_cache for pv_time > KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_info > KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_time_info > KVM: x86/xen: Make kvm_xen_set_evtchn() reusable from other places > KVM: x86/xen: Support direct injection of event channel events > KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID > KVM: x86/xen: Kernel acceleration for XENVER_version > KVM: x86/xen: Support per-vCPU event channel upcall via local APIC > KVM: x86/xen: Advertise and document KVM_XEN_HVM_CONFIG_EVTCHN_SEND > KVM: x86/xen: Add self tests for KVM_XEN_HVM_CONFIG_EVTCHN_SEND > > Joao Martins (3): > KVM: x86/xen: intercept EVTCHNOP_send from guests > KVM: x86/xen: handle PV IPI vcpu yield > KVM: x86/xen: handle PV timers oneshot mode > > Sean Christopherson (1): > KVM: Use enum to track if cached PFN will be used in guest and/or host > > Documentation/virt/kvm/api.rst | 133 +- > arch/x86/include/asm/kvm_host.h | 23 +- > arch/x86/kvm/irq.c | 11 +- > arch/x86/kvm/irq_comm.c | 2 +- > arch/x86/kvm/x86.c | 119 +- > arch/x86/kvm/xen.c | 1271 ++++++++++++++++---- > arch/x86/kvm/xen.h | 67 +- > include/linux/kvm_host.h | 26 +- > include/linux/kvm_types.h | 11 +- > include/uapi/linux/kvm.h | 44 + > .../testing/selftests/kvm/x86_64/xen_shinfo_test.c | 337 +++++- > virt/kvm/pfncache.c | 53 +- > 12 files changed, 1722 insertions(+), 375 deletions(-) > > > Queued, thanks. Paolo
On Tue, 2022-03-08 at 17:33 +0100, Paolo Bonzini wrote: > > Queued, thanks. Thanks. I've literally just a couple of minutes ago finished diagnosing a sporadic live migration / live update bug which seems to happen because adding an hrtimer in the past *sometimes* seems not to work, although it always worked in my dev testing. Incremental diff to the 'oneshot timers' patch looks like the first hunk of this. I'm also pondering the second hunk which actively *cancels* the pending timer on serialization. Do you want a repost, or a proper incremental patch on top of kvm/queue when it becomes visible? diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index a044cedca73f..9e1a8fd0c9c5 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -163,17 +163,19 @@ void __kvm_migrate_xen_timer(struct kvm_vcpu *vcpu) hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED); } -static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, u64 delta_ns) +static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) { - ktime_t ktime_now; - atomic_set(&vcpu->arch.xen.timer_pending, 0); vcpu->arch.xen.timer_expires = guest_abs; - ktime_now = ktime_get(); - hrtimer_start(&vcpu->arch.xen.timer, - ktime_add_ns(ktime_now, delta_ns), - HRTIMER_MODE_ABS_PINNED); + if (delta_ns < 0) { + atomic_inc(&vcpu->arch.xen.timer_pending); + } else { + ktime_t ktime_now = ktime_get(); + hrtimer_start(&vcpu->arch.xen.timer, + ktime_add_ns(ktime_now, delta_ns), + HRTIMER_MODE_ABS_PINNED); + } } static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu) @@ -837,7 +839,10 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) case KVM_XEN_VCPU_ATTR_TYPE_TIMER: data->u.timer.port = vcpu->arch.xen.timer_virq; data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; - data->u.timer.expires_ns = vcpu->arch.xen.timer_expires; + if (hrtimer_cancel(&vcpu->arch.xen.timer)) + data->u.timer.expires_ns = vcpu->arch.xen.timer_expires; + else + data->u.timer.expires_ns = 0; r = 0; break;
On 3/8/22 17:41, David Woodhouse wrote: > Thanks. I've literally just a couple of minutes ago finished diagnosing > a sporadic live migration / live update bug which seems to happen > because adding an hrtimer in the past*sometimes* seems not to work, > although it always worked in my dev testing. > > Incremental diff to the 'oneshot timers' patch looks like the first > hunk of this. I'm also pondering the second hunk which actively > *cancels* the pending timer on serialization. Hmm, why so? > Do you want a repost, or a proper incremental patch on top of kvm/queue > when it becomes visible? kvm/queue is rebased routinely, so I'll just squash (only the first hunk, methinks). Got a testcase, though? That might be better as a separate patch. Paolo
On Tue, 2022-03-08 at 17:49 +0100, Paolo Bonzini wrote: > On 3/8/22 17:41, David Woodhouse wrote: > > Thanks. I've literally just a couple of minutes ago finished diagnosing > > a sporadic live migration / live update bug which seems to happen > > because adding an hrtimer in the past*sometimes* seems not to work, > > although it always worked in my dev testing. > > > > Incremental diff to the 'oneshot timers' patch looks like the first > > hunk of this. I'm also pondering the second hunk which actively > > *cancels* the pending timer on serialization. > > Hmm, why so? Don't know yet. But as I added the save/restore support to Joao's patch I had *assumed* that it would fail when the delta was negative, and was kind of surprised when it worked in the first place. So I'm sticking with "Don't Do That Then" as my initial response to fix it. > > Do you want a repost, or a proper incremental patch on top of kvm/queue > > when it becomes visible? > > kvm/queue is rebased routinely, so I'll just squash (only the first > hunk, methinks). Got a testcase, though? That might be better as a > separate patch. My test case right now is to run 'while true; time sleep 1; done' in the guest while I repeatedly kexec / live update the host. After a kexec, the deadline for the timer is past, and that's why it ends up getting restored with a negative delta. After a *few* cycles of this it usually ends up with the timer callback never triggering. I'll stick a negative delta into the KVM selftest included in the patch series, instead of the nice polite '100ms in the future' that it uses at the moment. That ought to trigger it too, and I can instrument the hrtimer code to work out what's going on. Either way, I think 'Don't Do That Then' will continue to be the right answer :)
On Tue, 2022-03-08 at 17:59 +0100, David Woodhouse wrote: > > kvm/queue is rebased routinely, so I'll just squash (only the first > > hunk, methinks). Got a testcase, though? That might be better as a > > separate patch. > > My test case right now is to run 'while true; time sleep 1; done' in > the guest while I repeatedly kexec / live update the host. > > After a kexec, the deadline for the timer is past, and that's why it > ends up getting restored with a negative delta. After a *few* cycles of > this it usually ends up with the timer callback never triggering. > > I'll stick a negative delta into the KVM selftest included in the patch > series, instead of the nice polite '100ms in the future' that it uses > at the moment. That ought to trigger it too, and I can instrument the > hrtimer code to work out what's going on. Either way, I think 'Don't Do > That Then' will continue to be the right answer :) Oh, I bet that won't show it and the kexec is actually needed. I bet it only happens when the timer expiry is actually at a time *before* zero on the new kernel's clock.
On 3/8/22 17:59, David Woodhouse wrote: >>> Incremental diff to the 'oneshot timers' patch looks like the first >>> hunk of this. I'm also pondering the second hunk which actively >>> *cancels* the pending timer on serialization. >> >> Hmm, why so? > > Don't know yet. But as I added the save/restore support to Joao's patch > I had *assumed* that it would fail when the delta was negative, and was > kind of surprised when it worked in the first place. So I'm sticking > with "Don't Do That Then" as my initial response to fix it. Yes, I'm just talking about the second hunk. The first is clear(ish). > After a kexec, the deadline for the timer is past, and that's why it > ends up getting restored with a negative delta. After a *few* cycles of > this it usually ends up with the timer callback never triggering. > > I'll stick a negative delta into the KVM selftest included in the patch > series, instead of the nice polite '100ms in the future' that it uses > at the moment. That ought to trigger it too, and I can instrument the > hrtimer code to work out what's going on. Either way, I think 'Don't Do > That Then' will continue to be the right answer:) Yep, let's keep both testcases through. Paolo
On Tue, 2022-03-08 at 18:13 +0100, Paolo Bonzini wrote: > On 3/8/22 17:59, David Woodhouse wrote: > > > > Incremental diff to the 'oneshot timers' patch looks like the first > > > > hunk of this. I'm also pondering the second hunk which actively > > > > *cancels* the pending timer on serialization. > > > > > > Hmm, why so? > > > > Don't know yet. But as I added the save/restore support to Joao's patch > > I had *assumed* that it would fail when the delta was negative, and was > > kind of surprised when it worked in the first place. So I'm sticking > > with "Don't Do That Then" as my initial response to fix it. > > Yes, I'm just talking about the second hunk. The first is clear(ish). Oh, I see. When the oneshot timer has expired and hasn't been re-armed by the guest, we should return zero as 'expires_ns' so that it doesn't get re- armed in the past (and, hopefully, immediately retriggered) when the guest is restored. Also, we don't really want the timer firing *after* the guest vCPU state has been serialized, since the newly-injected interrupt might not get migrated. Hence using hrtimer_cancel() as our check for whether it's still pending or not. > > After a kexec, the deadline for the timer is past, and that's why it > > ends up getting restored with a negative delta. After a *few* cycles of > > this it usually ends up with the timer callback never triggering. > > > > I'll stick a negative delta into the KVM selftest included in the patch > > series, instead of the nice polite '100ms in the future' that it uses > > at the moment. That ought to trigger it too, and I can instrument the > > hrtimer code to work out what's going on. Either way, I think 'Don't Do > > That Then' will continue to be the right answer:) > > Yep, let's keep both testcases through. Ultimately I think the negative one only ends up testing the kernel's hrtimer behaviour rather than KVM functionality, but I'll play with that some more and make sure I understand it.
On 3/8/22 18:20, David Woodhouse wrote: >> Yes, I'm just talking about the second hunk. The first is clear(ish). > Oh, I see. > > When the oneshot timer has expired and hasn't been re-armed by the > guest, we should return zero as 'expires_ns' so that it doesn't get re- > armed in the past (and, hopefully, immediately retriggered) when the > guest is restored. > > Also, we don't really want the timer firing*after* the guest vCPU > state has been serialized, since the newly-injected interrupt might not > get migrated. Hence using hrtimer_cancel() as our check for whether > it's still pending or not. > I think the two are different. The first is also clear, and that should be fixed with a separate bool or possibly a special meaning for expires_ns == -1 (or INT64_MAX, I don't speak Xen hypercalls :)). The second should not be a problem. The newly-injected interrupt might not get migrated, but it will be injected on the destination. So it shouldn't be a problem, in fact anything that relies on that is probably going to be racy. Paolo
On Tue, 2022-03-08 at 18:25 +0100, Paolo Bonzini wrote: > On 3/8/22 18:20, David Woodhouse wrote: > > > Yes, I'm just talking about the second hunk. The first is clear(ish). > > Oh, I see. > > > > When the oneshot timer has expired and hasn't been re-armed by the > > guest, we should return zero as 'expires_ns' so that it doesn't get re- > > armed in the past (and, hopefully, immediately retriggered) when the > > guest is restored. > > > > Also, we don't really want the timer firing*after* the guest vCPU > > state has been serialized, since the newly-injected interrupt might not > > get migrated. Hence using hrtimer_cancel() as our check for whether > > it's still pending or not. > > > > I think the two are different. The first is also clear, and that should > be fixed with a separate bool or possibly a special meaning for > expires_ns == -1 (or INT64_MAX, I don't speak Xen hypercalls :)). The value you're looking for is zero; that's what means "not set". The issue wasn't the Xen ABI or even the KVM_XEN_VCPU_ATTR ABI; it's just that we weren't *returning* zero when the timer has already fired and thus is it isn't considered to be 'set' any more. > The second should not be a problem. The newly-injected interrupt might > not get migrated, but it will be injected on the destination. So it > shouldn't be a problem, in fact anything that relies on that is probably > going to be racy. Sure. But if we consider it acceptable for the timer to fire again after live migration when it already fired on the source host why did we even bother fixing the first part above? :)
On 3/8/22 18:40, David Woodhouse wrote: > On Tue, 2022-03-08 at 18:25 +0100, Paolo Bonzini wrote: >> On 3/8/22 18:20, David Woodhouse wrote: >>>> Yes, I'm just talking about the second hunk. The first is clear(ish). >>> Oh, I see. >>> >>> When the oneshot timer has expired and hasn't been re-armed by the >>> guest, we should return zero as 'expires_ns' so that it doesn't get re- >>> armed in the past (and, hopefully, immediately retriggered) when the >>> guest is restored. >>> >>> Also, we don't really want the timer firing*after* the guest vCPU >>> state has been serialized, since the newly-injected interrupt might not >>> get migrated. Hence using hrtimer_cancel() as our check for whether >>> it's still pending or not. >> > Sure. But if we consider it acceptable for the timer to fire again > after live migration when it already fired on the source host why did > we even bother fixing the first part above? :) In the first case the timer has expired and the event has been injected by the time the state is retrieved, so you'd get a double event. In the second the timer expires and the event is injected _on the source_ only. This is required because, if the destination for whatever reason aborts the migration, your change would introduced a missed event. Paolo
On Tue, 2022-03-08 at 18:54 +0100, Paolo Bonzini wrote: > In the first case the timer has expired and the event has been injected > by the time the state is retrieved, so you'd get a double event. > > In the second the timer expires and the event is injected _on the > source_ only. This is required because, if the destination for > whatever reason aborts the migration, your change would introduced a > missed event. True. I'll change it to use hrtimer_active(). But first I'll concentrate on the one that was *really* the cause of the bug I've been chasing today.
On 3/3/22 16:41, David Woodhouse wrote: > This series adds event channel acceleration for Xen guests. In particular > it allows guest vCPUs to send each other IPIs without having to bounce > all the way out to the userspace VMM in order to do so. Likewise, the > Xen singleshot timer is added, and a version of SCHEDOP_poll. Those > major features are based on Joao and Boris' patches from 2019. > > Cleaning up the event delivery into the vcpu_info involved using the new > gfn_to_pfn_cache for that, and that means I ended up doing so for *all* > the places the guest can have a pvclock. > > v0: Proof-of-concept RFC > > v1: > • Drop the runstate fix which is merged now. > • Add Sean's gfn_to_pfn_cache API change at the start of the series. > • Add KVM self tests > • Minor bug fixes > > v2: > • Drop dirty handling from gfn_to_pfn_cache > • Fix !CONFIG_KVM_XEN build and duplicate call to kvm_xen_init_vcpu() > > v3: > • Add KVM_XEN_EVTCHN_RESET to clear all outbound ports. > • Clean up a stray #if 1 in a part of the the test case that was once > being recalcitrant. > • Check kvm_xen_has_pending_events() in kvm_vcpu_has_events() and *not* > kvm_xen_has_pending_timer() which is checked from elsewhere. > • Fix warnings noted by the kernel test robot <lkp@intel.com>: > • Make kvm_xen_init_timer() static. > • Make timer delta calculation use an explicit s64 to fix 32-bit build. I've seen this: [1790637.031490] BUG: Bad page state in process qemu-kvm pfn:03401 [1790637.037503] page:0000000077fc41af refcount:0 mapcount:1 mapping:0000000000000000 index:0x7f4ab7e01 pfn:0x3401 [1790637.047592] head:0000000032101bf5 order:9 compound_mapcount:1 compound_pincount:0 [1790637.055250] anon flags: 0xfffffc009000e(referenced|uptodate|dirty|head|swapbacked|node=0|zone=1|lastcpupid=0x1fffff) [1790637.065949] raw: 000fffffc0000000 ffffda4b800d0001 0000000000000903 dead000000000200 [1790637.073869] raw: 0000000000000100 0000000000000000 00000000ffffffff 0000000000000000 [1790637.081791] head: 000fffffc009000e dead000000000100 dead000000000122 ffffa0636279fb01 [1790637.089797] head: 00000007f4ab7e00 0000000000000000 00000000ffffffff 0000000000000000 [1790637.097795] page dumped because: nonzero compound_mapcount [1790637.103455] Modules linked in: kvm_intel(OE) kvm(OE) overlay tun tls ib_core rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif iTCO_wdt intel_pmc_bxt irqbypass iTCO_vendor_support acpi_ipmi rapl dell_smbios ipmi_si mei_me intel_cstate dcdbas ipmi_devintf i2c_i801 intel_uncore dell_wmi_descriptor wmi_bmof mei lpc_ich intel_pch_thermal i2c_smbus ipmi_msghandler acpi_power_meter xfs crct10dif_pclmul i40e crc32_pclmul crc32c_intel megaraid_sas ghash_clmulni_intel tg3 mgag200 wmi fuse [last unloaded: kvm] [1790637.162636] CPU: 12 PID: 3056318 Comm: qemu-kvm Kdump: loaded Tainted: G W IOE --------- --- 5.16.0-0.rc6.41.fc36.x86_64 #1 [1790637.174878] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 1.6.11 11/20/2018 [1790637.182618] Call Trace: [1790637.185246] <TASK> [1790637.187524] dump_stack_lvl+0x48/0x5e [1790637.191373] bad_page.cold+0x63/0x94 [1790637.195123] free_tail_pages_check+0xbb/0x110 [1790637.199656] free_pcp_prepare+0x270/0x310 [1790637.203843] free_unref_page+0x1d/0x120 [1790637.207856] kvm_gfn_to_pfn_cache_refresh+0x2c2/0x400 [kvm] [1790637.213662] kvm_setup_guest_pvclock+0x4b/0x180 [kvm] [1790637.218913] kvm_guest_time_update+0x26d/0x330 [kvm] [1790637.224080] vcpu_enter_guest+0x31c/0x1390 [kvm] [1790637.228908] kvm_arch_vcpu_ioctl_run+0x132/0x830 [kvm] [1790637.234254] kvm_vcpu_ioctl+0x270/0x680 [kvm] followed by other badness with the same call stack: [1790637.376127] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) I am absolutely not sure that this series is the culprit in any way, but anyway I'll try to reproduce (it happened at the end of a RHEL7.2 installation) and let you know. If not, it is something that already made its way to Linus. Paolo
On Fri, 2022-03-25 at 19:19 +0100, Paolo Bonzini wrote: > On 3/3/22 16:41, David Woodhouse wrote: > > This series adds event channel acceleration for Xen guests. In particular > > it allows guest vCPUs to send each other IPIs without having to bounce > > all the way out to the userspace VMM in order to do so. Likewise, the > > Xen singleshot timer is added, and a version of SCHEDOP_poll. Those > > major features are based on Joao and Boris' patches from 2019. > > > > Cleaning up the event delivery into the vcpu_info involved using the new > > gfn_to_pfn_cache for that, and that means I ended up doing so for *all* > > the places the guest can have a pvclock. > > > > v0: Proof-of-concept RFC > > > > v1: > > • Drop the runstate fix which is merged now. > > • Add Sean's gfn_to_pfn_cache API change at the start of the series. > > • Add KVM self tests > > • Minor bug fixes > > > > v2: > > • Drop dirty handling from gfn_to_pfn_cache > > • Fix !CONFIG_KVM_XEN build and duplicate call to kvm_xen_init_vcpu() > > > > v3: > > • Add KVM_XEN_EVTCHN_RESET to clear all outbound ports. > > • Clean up a stray #if 1 in a part of the the test case that was once > > being recalcitrant. > > • Check kvm_xen_has_pending_events() in kvm_vcpu_has_events() and *not* > > kvm_xen_has_pending_timer() which is checked from elsewhere. > > • Fix warnings noted by the kernel test robot < > > lkp@intel.com > > >: > > • Make kvm_xen_init_timer() static. > > • Make timer delta calculation use an explicit s64 to fix 32-bit build. > > I've seen this: > > [1790637.031490] BUG: Bad page state in process qemu-kvm pfn:03401 > [1790637.037503] page:0000000077fc41af refcount:0 mapcount:1 > mapping:0000000000000000 index:0x7f4ab7e01 pfn:0x3401 > [1790637.047592] head:0000000032101bf5 order:9 compound_mapcount:1 > compound_pincount:0 > [1790637.055250] anon flags: > 0xfffffc009000e(referenced|uptodate|dirty|head|swapbacked|node=0|zone=1|lastcpupid=0x1fffff) > [1790637.065949] raw: 000fffffc0000000 ffffda4b800d0001 0000000000000903 > dead000000000200 > [1790637.073869] raw: 0000000000000100 0000000000000000 00000000ffffffff > 0000000000000000 > [1790637.081791] head: 000fffffc009000e dead000000000100 > dead000000000122 ffffa0636279fb01 > [1790637.089797] head: 00000007f4ab7e00 0000000000000000 > 00000000ffffffff 0000000000000000 > [1790637.097795] page dumped because: nonzero compound_mapcount > [1790637.103455] Modules linked in: kvm_intel(OE) kvm(OE) overlay tun > tls ib_core rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd > grace fscache netfs rfkill sunrpc intel_rapl_msr intel_rapl_common > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal > intel_powerclamp coretemp ipmi_ssif iTCO_wdt intel_pmc_bxt irqbypass > iTCO_vendor_support acpi_ipmi rapl dell_smbios ipmi_si mei_me > intel_cstate dcdbas ipmi_devintf i2c_i801 intel_uncore > dell_wmi_descriptor wmi_bmof mei lpc_ich intel_pch_thermal i2c_smbus > ipmi_msghandler acpi_power_meter xfs crct10dif_pclmul i40e crc32_pclmul > crc32c_intel megaraid_sas ghash_clmulni_intel tg3 mgag200 wmi fuse [last > unloaded: kvm] > [1790637.162636] CPU: 12 PID: 3056318 Comm: qemu-kvm Kdump: loaded > Tainted: G W IOE --------- --- 5.16.0-0.rc6.41.fc36.x86_64 #1 > [1790637.174878] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS > 1.6.11 11/20/2018 > [1790637.182618] Call Trace: > [1790637.185246] <TASK> > [1790637.187524] dump_stack_lvl+0x48/0x5e > [1790637.191373] bad_page.cold+0x63/0x94 > [1790637.195123] free_tail_pages_check+0xbb/0x110 > [1790637.199656] free_pcp_prepare+0x270/0x310 > [1790637.203843] free_unref_page+0x1d/0x120 > [1790637.207856] kvm_gfn_to_pfn_cache_refresh+0x2c2/0x400 [kvm] > [1790637.213662] kvm_setup_guest_pvclock+0x4b/0x180 [kvm] > [1790637.218913] kvm_guest_time_update+0x26d/0x330 [kvm] > [1790637.224080] vcpu_enter_guest+0x31c/0x1390 [kvm] > [1790637.228908] kvm_arch_vcpu_ioctl_run+0x132/0x830 [kvm] > [1790637.234254] kvm_vcpu_ioctl+0x270/0x680 [kvm] > > followed by other badness with the same call stack: > > [1790637.376127] page dumped because: > VM_BUG_ON_PAGE(page_ref_count(page) == 0) > > I am absolutely not sure that this series is the culprit in any way, but > anyway I'll try to reproduce (it happened at the end of a RHEL7.2 > installation) and let you know. If not, it is something that already > made its way to Linus. > Hrm.... could it be a double/multiple free? This will come from __release_gpc() which is called from the end of kvm_gfn_to_pfn_cache_refresh() and which releases the *old* PFN. How could we get there without... oh... could it be this? --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -176,6 +176,7 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn); if (kvm_is_error_hva(gpc->uhva)) { + gpc->pfn = KVM_PFN_ERR_FAULT; ret = -EFAULT; goto out; }
On Fri, 2022-03-25 at 19:57 +0000, David Woodhouse wrote: > On Fri, 2022-03-25 at 19:19 +0100, Paolo Bonzini wrote: > > I am absolutely not sure that this series is the culprit in any way, but > > anyway I'll try to reproduce (it happened at the end of a RHEL7.2 > > installation) and let you know. If not, it is something that already > > made its way to Linus. > > > > Hrm.... could it be a double/multiple free? This will come from > __release_gpc() which is called from the end of > kvm_gfn_to_pfn_cache_refresh() and which releases the *old* PFN. > > How could we get there without... oh... could it be this? > > --- a/virt/kvm/pfncache.c > +++ b/virt/kvm/pfncache.c > @@ -176,6 +176,7 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, > gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn); > > if (kvm_is_error_hva(gpc->uhva)) { > + gpc->pfn = KVM_PFN_ERR_FAULT; > ret = -EFAULT; > goto out; > } > > If you're going to try to reproduce, better to do it like this instead I suppose: --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -176,6 +176,7 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn); if (kvm_is_error_hva(gpc->uhva)) { + printk("Imma free PFN %llx again later. Oops!\n", gpc->pfn); ret = -EFAULT; goto out; }