Message ID | 20220530170739.19072-10-avihaih@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio/migration: Implement VFIO migration protocol v2 | expand |
On 2022/5/31 1:07, Avihai Horon wrote: > If vfio_migration_set_state() fails to set the device in the requested > state it tries to put it in a recover state. If setting the device in > the recover state fails as well, hw_error is triggered and the VM is > aborted. > > To improve user experience and avoid VM data loss, reset the device with > VFIO_RESET_DEVICE instead of aborting the VM. > > Signed-off-by: Avihai Horon <avihaih@nvidia.com> > --- > hw/vfio/migration.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > index 852759e6ca..6c34502611 100644 > --- a/hw/vfio/migration.c > +++ b/hw/vfio/migration.c > @@ -89,8 +89,16 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, > /* Try to put the device in some good state */ > mig_state->device_state = recover_state; > if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) { > - hw_error("%s: Device in error state, can't recover", > - vbasedev->name); > + if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) { > + hw_error("%s: Device in error state, can't recover", > + vbasedev->name); > + } > + > + error_report( > + "%s: Device was reset due to failure in changing device state to recover state %s", > + vbasedev->name, mig_state_to_str(recover_state)); > + > + return -1; > } > When I used the qemu 7.1.50 version compiled with this set of patches, I found that after the migration failed due to disconnecting the destination VM during the live migration process, when I exited the source qemu, the following error would appear: [100337.287047] BUG: Bad page state in process qemu-system-aar pfn:82199518 [100337.295815] page:00000000356de4da refcount:-2 mapcount:0 mapping:00000000000 00000 index:0x0 pfn:0x82199518 [100337.306403] flags: 0xbfff80000000000(node=0|zone=2|lastcpupid=0x7fff) [100337.314091] raw: 0bfff80000000000 dead000000000100 dead000000000122 00000000 00000000 [100337.322589] raw: 0000000000000000 0000000000000000 fffffffeffffffff 00000000 00000000 [100337.330630] page dumped because: nonzero _refcount [100337.335840] Modules linked in: hisi_acc_vfio_pci hisi_sec2 hisi_zip hisi_hpr e hisi_qm uacce vfio_iommu_type1 vfio_pci vfio_pci_core vfio_virqfd vfio pv680_m ii(O) [last unloaded: hisi_sec2] [100337.354564] CPU: 1 PID: 786 Comm: qemu-system-aar Tainted: G B O 6.0.0-rc4+ #1 [100337.377378] Call trace: [100337.380382] dump_backtrace.part.0+0xc4/0xd0 [100337.385791] show_stack+0x24/0x40 [100337.389478] dump_stack_lvl+0x68/0x84 [100337.394155] dump_stack+0x18/0x34 [100337.398006] bad_page+0xf0/0x120 [100337.401796] check_free_page_bad+0x84/0x90 [100337.406404] free_pcppages_bulk+0x1bc/0x2b0 [100337.411126] free_unref_page_commit+0x120/0x15c [100337.416935] free_unref_page+0x15c/0x254 [100337.421436] free_compound_page+0x6c/0x100 [100337.425868] free_transhuge_page+0xd4/0x140 [100337.430535] destroy_large_folio+0x30/0x40 [100337.434953] release_pages+0x1bc/0x4d0 [100337.439268] free_pages_and_swap_cache+0x68/0x80 [100337.444224] tlb_batch_pages_flush+0x5c/0x94 [100337.448976] tlb_flush_mmu+0x4c/0xd4 [100337.453062] unmap_page_range+0x8d0/0xbd0 [100337.457432] unmap_single_vma+0x90/0x12c [100337.461673] unmap_vmas+0x84/0xfc [100337.465354] exit_mmap+0x88/0x1b0 [100337.469008] __mmput+0x48/0x134 [100337.472637] mmput+0x44/0x50 [100337.475857] do_exit+0x2b8/0x970 [100337.479641] do_group_exit+0x40/0xac [100337.484079] get_signal+0x8c0/0x934 [100337.488215] do_notify_resume+0x1d0/0x1570 [100337.492795] el0_svc+0xa8/0xc0 [100337.496452] el0t_64_sync_handler+0x1ac/0x1b0 [100337.501187] el0t_64_sync+0x19c/0x1a0 Can anyone see what is causing this error? > error_report("%s: Failed changing device state to %s", vbasedev->name, > Thanks Longfang.
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 852759e6ca..6c34502611 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -89,8 +89,16 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, /* Try to put the device in some good state */ mig_state->device_state = recover_state; if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) { - hw_error("%s: Device in error state, can't recover", - vbasedev->name); + if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) { + hw_error("%s: Device in error state, can't recover", + vbasedev->name); + } + + error_report( + "%s: Device was reset due to failure in changing device state to recover state %s", + vbasedev->name, mig_state_to_str(recover_state)); + + return -1; } error_report("%s: Failed changing device state to %s", vbasedev->name,
If vfio_migration_set_state() fails to set the device in the requested state it tries to put it in a recover state. If setting the device in the recover state fails as well, hw_error is triggered and the VM is aborted. To improve user experience and avoid VM data loss, reset the device with VFIO_RESET_DEVICE instead of aborting the VM. Signed-off-by: Avihai Horon <avihaih@nvidia.com> --- hw/vfio/migration.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)