mbox series

[v2,0/2] Postcopy migration and vhost-user errors

Message ID 20240828100914.105728-1-ppandit@redhat.com (mailing list archive)
Headers show
Series Postcopy migration and vhost-user errors | expand

Message

Prasad Pandit Aug. 28, 2024, 10:09 a.m. UTC
From: Prasad Pandit <pjp@fedoraproject.org>

Hello,

* virsh(1) offers multiple options to initiate Postcopy migration:

    1) virsh migrate --postcopy --postcopy-after-precopy
    2) virsh migrate --postcopy + virsh migrate-postcopy
    3) virsh migrate --postcopy --timeout <N> --timeout-postcopy

When Postcopy migration is invoked via method (2) or (3) above,
the migrated guest on the destination host hangs sometimes.

* During Postcopy migration, multiple threads are spawned on the destination
host to start the guest and setup devices. One such thread starts vhost
device via vhost_dev_start() function and another called fault_thread handles
page faults in user space using kernel's userfaultfd(2) system.

* When fault_thread exits upon completion of Postcopy migration, it sends a
'postcopy_end' message to the vhost-user device. But sometimes 'postcopy_end'
message is sent while vhost device is being setup via vhost_dev_start().

     Thread-1                                  Thread-2

vhost_dev_start                        postcopy_ram_incoming_cleanup
 vhost_device_iotlb_miss                postcopy_notify
  vhost_backend_update_device_iotlb      vhost_user_postcopy_notifier
   vhost_user_send_device_iotlb_msg       vhost_user_postcopy_end
    process_message_reply                  process_message_reply
     vhost_user_read                        vhost_user_read
      vhost_user_read_header                 vhost_user_read_header
       "Fail to update device iotlb"          "Failed to receive reply to postcopy_end"

This creates confusion when vhost device receives 'postcopy_end' message while
it is still trying to update IOTLB entries.

This seems to leave the guest in a stranded/hung state because fault_thread
has exited saying Postcopy migration has ended, but vhost-device is probably
still expecting updates. QEMU logs following errors on the destination host
===
...
qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x0 instead of 0x5.
qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
qemu-kvm: vhost_user_postcopy_end: 700871,700900: Failed to receive reply to postcopy_end
qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x0 instead of 0x5.
qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x8 instead of 0x5.
qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x16 instead of 0x5.
qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x0 instead of 0x5.
qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
===

* Couple of patches here help to fix/handle these errors.

Thank you.
---
Prasad Pandit (2):
  vhost: fail device start if iotlb update fails
  vhost-user: add a request-reply lock

 hw/virtio/vhost-user.c         | 74 ++++++++++++++++++++++++++++++++++
 hw/virtio/vhost.c              |  6 ++-
 include/hw/virtio/vhost-user.h |  3 ++
 3 files changed, 82 insertions(+), 1 deletion(-)

--
2.46.0

Comments

Michael S. Tsirkin Sept. 10, 2024, 5:10 p.m. UTC | #1
On Wed, Aug 28, 2024 at 03:39:12PM +0530, Prasad Pandit wrote:
> From: Prasad Pandit <pjp@fedoraproject.org>
> 
> Hello,
> 
> * virsh(1) offers multiple options to initiate Postcopy migration:
> 
>     1) virsh migrate --postcopy --postcopy-after-precopy
>     2) virsh migrate --postcopy + virsh migrate-postcopy
>     3) virsh migrate --postcopy --timeout <N> --timeout-postcopy
> 
> When Postcopy migration is invoked via method (2) or (3) above,
> the migrated guest on the destination host hangs sometimes.
> 
> * During Postcopy migration, multiple threads are spawned on the destination
> host to start the guest and setup devices. One such thread starts vhost
> device via vhost_dev_start() function and another called fault_thread handles
> page faults in user space using kernel's userfaultfd(2) system.
> 
> * When fault_thread exits upon completion of Postcopy migration, it sends a
> 'postcopy_end' message to the vhost-user device. But sometimes 'postcopy_end'
> message is sent while vhost device is being setup via vhost_dev_start().
> 
>      Thread-1                                  Thread-2
> 
> vhost_dev_start                        postcopy_ram_incoming_cleanup
>  vhost_device_iotlb_miss                postcopy_notify
>   vhost_backend_update_device_iotlb      vhost_user_postcopy_notifier
>    vhost_user_send_device_iotlb_msg       vhost_user_postcopy_end
>     process_message_reply                  process_message_reply
>      vhost_user_read                        vhost_user_read
>       vhost_user_read_header                 vhost_user_read_header
>        "Fail to update device iotlb"          "Failed to receive reply to postcopy_end"
> 
> This creates confusion when vhost device receives 'postcopy_end' message while
> it is still trying to update IOTLB entries.
> 
> This seems to leave the guest in a stranded/hung state because fault_thread
> has exited saying Postcopy migration has ended, but vhost-device is probably
> still expecting updates. QEMU logs following errors on the destination host
> ===
> ...
> qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x0 instead of 0x5.
> qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
> qemu-kvm: vhost_user_postcopy_end: 700871,700900: Failed to receive reply to postcopy_end
> qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x0 instead of 0x5.
> qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
> qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x8 instead of 0x5.
> qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
> qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x16 instead of 0x5.
> qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
> qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. Flags 0x0 instead of 0x5.
> qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
> ===


So are we going to see a version with BQL?

> * Couple of patches here help to fix/handle these errors.
> 
> Thank you.
> ---
> Prasad Pandit (2):
>   vhost: fail device start if iotlb update fails
>   vhost-user: add a request-reply lock
> 
>  hw/virtio/vhost-user.c         | 74 ++++++++++++++++++++++++++++++++++
>  hw/virtio/vhost.c              |  6 ++-
>  include/hw/virtio/vhost-user.h |  3 ++
>  3 files changed, 82 insertions(+), 1 deletion(-)
> 
> --
> 2.46.0
Prasad Pandit Sept. 11, 2024, 7:14 a.m. UTC | #2
Hello Michael,

On Tue, 10 Sept 2024 at 22:40, Michael S. Tsirkin <mst@redhat.com> wrote:
> So are we going to see a version with BQL?

===
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index cdf9af4a4b..96ac0ed93b 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2079,6 +2079,7 @@ static int vhost_user_postcopy_end(struct
vhost_dev *dev, Error **errp)

     trace_vhost_user_postcopy_end_entry();

+    BQL_LOCK_GUARD();
     ret = vhost_user_write(dev, &msg, NULL, 0);
     if (ret < 0) {
         error_setg(errp, "Failed to send postcopy_end to vhost");
Michael S. Tsirkin Sept. 11, 2024, 9:46 a.m. UTC | #3
On Wed, Sep 11, 2024 at 12:44:59PM +0530, Prasad Pandit wrote:
> Hello Michael,
> 
> On Tue, 10 Sept 2024 at 22:40, Michael S. Tsirkin <mst@redhat.com> wrote:
> > So are we going to see a version with BQL?
> 
> ===
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index cdf9af4a4b..96ac0ed93b 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -2079,6 +2079,7 @@ static int vhost_user_postcopy_end(struct
> vhost_dev *dev, Error **errp)
> 
>      trace_vhost_user_postcopy_end_entry();
> 
> +    BQL_LOCK_GUARD();
>      ret = vhost_user_write(dev, &msg, NULL, 0);
>      if (ret < 0) {
>          error_setg(errp, "Failed to send postcopy_end to vhost");
> -- 
> 2.43.5
> ===
> 
> * We ran the test with the above BQL patch, but it did not help to fix
> race errors. I'm continuing to debug it, will update here soon.
> 
> Thank you.
> ---
>   - Prasad


Thanks! I have a suspicion there's a path where we wait for the
migration thread until BQL.