diff mbox series

[v2,09/11] vfio/migration: Reset device if setting recover state fails

Message ID 20220530170739.19072-10-avihaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series vfio/migration: Implement VFIO migration protocol v2 | expand

Commit Message

Avihai Horon May 30, 2022, 5:07 p.m. UTC
If vfio_migration_set_state() fails to set the device in the requested
state it tries to put it in a recover state. If setting the device in
the recover state fails as well, hw_error is triggered and the VM is
aborted.

To improve user experience and avoid VM data loss, reset the device with
VFIO_RESET_DEVICE instead of aborting the VM.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

liulongfang Oct. 11, 2022, 1:41 a.m. UTC | #1
On 2022/5/31 1:07, Avihai Horon wrote:
> If vfio_migration_set_state() fails to set the device in the requested
> state it tries to put it in a recover state. If setting the device in
> the recover state fails as well, hw_error is triggered and the VM is
> aborted.
> 
> To improve user experience and avoid VM data loss, reset the device with
> VFIO_RESET_DEVICE instead of aborting the VM.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  hw/vfio/migration.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 852759e6ca..6c34502611 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -89,8 +89,16 @@ static int vfio_migration_set_state(VFIODevice *vbasedev,
>          /* Try to put the device in some good state */
>          mig_state->device_state = recover_state;
>          if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> -            hw_error("%s: Device in error state, can't recover",
> -                     vbasedev->name);
> +            if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
> +                hw_error("%s: Device in error state, can't recover",
> +                         vbasedev->name);
> +            }
> +
> +            error_report(
> +                "%s: Device was reset due to failure in changing device state to recover state %s",
> +                vbasedev->name, mig_state_to_str(recover_state));
> +
> +            return -1;
>          }
> 

When I used the qemu 7.1.50 version compiled with this set of patches,
I found that after the migration failed due to disconnecting the destination VM
during the live migration process, when I exited the source qemu, the
following error would appear:

[100337.287047] BUG: Bad page state in process qemu-system-aar  pfn:82199518
[100337.295815] page:00000000356de4da refcount:-2 mapcount:0 mapping:00000000000
00000 index:0x0 pfn:0x82199518
[100337.306403] flags: 0xbfff80000000000(node=0|zone=2|lastcpupid=0x7fff)
[100337.314091] raw: 0bfff80000000000 dead000000000100 dead000000000122 00000000
00000000
[100337.322589] raw: 0000000000000000 0000000000000000 fffffffeffffffff 00000000
00000000
[100337.330630] page dumped because: nonzero _refcount
[100337.335840] Modules linked in: hisi_acc_vfio_pci hisi_sec2 hisi_zip hisi_hpr
e hisi_qm uacce vfio_iommu_type1 vfio_pci vfio_pci_core vfio_virqfd vfio pv680_m
ii(O) [last unloaded: hisi_sec2]
[100337.354564] CPU: 1 PID: 786 Comm: qemu-system-aar Tainted: G    B      O
   6.0.0-rc4+ #1
[100337.377378] Call trace:
[100337.380382]  dump_backtrace.part.0+0xc4/0xd0
[100337.385791]  show_stack+0x24/0x40
[100337.389478]  dump_stack_lvl+0x68/0x84
[100337.394155]  dump_stack+0x18/0x34
[100337.398006]  bad_page+0xf0/0x120
[100337.401796]  check_free_page_bad+0x84/0x90
[100337.406404]  free_pcppages_bulk+0x1bc/0x2b0
[100337.411126]  free_unref_page_commit+0x120/0x15c
[100337.416935]  free_unref_page+0x15c/0x254
[100337.421436]  free_compound_page+0x6c/0x100
[100337.425868]  free_transhuge_page+0xd4/0x140
[100337.430535]  destroy_large_folio+0x30/0x40
[100337.434953]  release_pages+0x1bc/0x4d0
[100337.439268]  free_pages_and_swap_cache+0x68/0x80
[100337.444224]  tlb_batch_pages_flush+0x5c/0x94
[100337.448976]  tlb_flush_mmu+0x4c/0xd4
[100337.453062]  unmap_page_range+0x8d0/0xbd0
[100337.457432]  unmap_single_vma+0x90/0x12c
[100337.461673]  unmap_vmas+0x84/0xfc
[100337.465354]  exit_mmap+0x88/0x1b0
[100337.469008]  __mmput+0x48/0x134
[100337.472637]  mmput+0x44/0x50
[100337.475857]  do_exit+0x2b8/0x970
[100337.479641]  do_group_exit+0x40/0xac
[100337.484079]  get_signal+0x8c0/0x934
[100337.488215]  do_notify_resume+0x1d0/0x1570
[100337.492795]  el0_svc+0xa8/0xc0
[100337.496452]  el0t_64_sync_handler+0x1ac/0x1b0
[100337.501187]  el0t_64_sync+0x19c/0x1a0

Can anyone see what is causing this error?

>          error_report("%s: Failed changing device state to %s", vbasedev->name,
> 
Thanks
Longfang.
diff mbox series

Patch

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 852759e6ca..6c34502611 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -89,8 +89,16 @@  static int vfio_migration_set_state(VFIODevice *vbasedev,
         /* Try to put the device in some good state */
         mig_state->device_state = recover_state;
         if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
-            hw_error("%s: Device in error state, can't recover",
-                     vbasedev->name);
+            if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
+                hw_error("%s: Device in error state, can't recover",
+                         vbasedev->name);
+            }
+
+            error_report(
+                "%s: Device was reset due to failure in changing device state to recover state %s",
+                vbasedev->name, mig_state_to_str(recover_state));
+
+            return -1;
         }
 
         error_report("%s: Failed changing device state to %s", vbasedev->name,