diff mbox series

[v4] vdpa: reset the backend device in the end of vhost_net_stop()

Message ID 1648776683-23739-1-git-send-email-qiudayu@archeros.com (mailing list archive)
State New, archived
Headers show
Series [v4] vdpa: reset the backend device in the end of vhost_net_stop() | expand

Commit Message

Michael Qiu April 1, 2022, 1:31 a.m. UTC
Currently, when VM poweroff, it will trigger vdpa
device(such as mlx bluefield2 VF) reset many times(with 1 datapath
queue pair and one control queue, triggered 3 times), this
leads to below issue:

vhost VQ 2 ring restore failed: -22: Invalid argument (22)

This because in vhost_net_stop(), it will stop all vhost device bind to
this virtio device, and in vhost_dev_stop(), qemu tries to stop the device
, then stop the queue: vhost_virtqueue_stop().

In vhost_dev_stop(), it resets the device, which clear some flags
in low level driver, and in next loop(stop other vhost backends),
qemu try to stop the queue corresponding to the vhost backend,
 the driver finds that the VQ is invalied, this is the root cause.

To solve the issue, vdpa should set vring unready, and
remove reset ops in device stop: vhost_dev_start(hdev, false).

and implement a new function vhost_dev_reset, only reset backend
device after all vhost(per-queue) stoped.

Signed-off-by: Michael Qiu<qiudayu@archeros.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
v4 --> v3
    Nothing changed, becasue of issue with mimecast,
    when the From: tag is different from the sender,
    the some mail client will take the patch as an
    attachment, RESEND v3 does not work, So resend
    the patch as v4

v3 --> v2:
    Call vhost_dev_reset() at the end of vhost_net_stop().

    Since the vDPA device need re-add the status bit 
    VIRTIO_CONFIG_S_ACKNOWLEDGE and VIRTIO_CONFIG_S_DRIVER,
    simply, add them inside vhost_vdpa_reset_device, and
    the only way calling vhost_vdpa_reset_device is in
    vhost_net_stop(), so it keeps the same behavior as before.

v2 --> v1:
   Implement a new function vhost_dev_reset,
   reset the backend kernel device at last.
---
 hw/net/vhost_net.c        | 24 +++++++++++++++++++++---
 hw/virtio/vhost-vdpa.c    | 15 +++++++++------
 hw/virtio/vhost.c         | 15 ++++++++++++++-
 include/hw/virtio/vhost.h |  1 +
 4 files changed, 45 insertions(+), 10 deletions(-)

Comments

Jason Wang April 1, 2022, 2:53 a.m. UTC | #1
On Fri, Apr 1, 2022 at 9:31 AM Michael Qiu <qiudayu@archeros.com> wrote:
>
> Currently, when VM poweroff, it will trigger vdpa
> device(such as mlx bluefield2 VF) reset many times(with 1 datapath
> queue pair and one control queue, triggered 3 times), this
> leads to below issue:
>
> vhost VQ 2 ring restore failed: -22: Invalid argument (22)
>
> This because in vhost_net_stop(), it will stop all vhost device bind to
> this virtio device, and in vhost_dev_stop(), qemu tries to stop the device
> , then stop the queue: vhost_virtqueue_stop().
>
> In vhost_dev_stop(), it resets the device, which clear some flags
> in low level driver, and in next loop(stop other vhost backends),
> qemu try to stop the queue corresponding to the vhost backend,
>  the driver finds that the VQ is invalied, this is the root cause.
>
> To solve the issue, vdpa should set vring unready, and
> remove reset ops in device stop: vhost_dev_start(hdev, false).
>
> and implement a new function vhost_dev_reset, only reset backend
> device after all vhost(per-queue) stoped.

Typo.

>
> Signed-off-by: Michael Qiu<qiudayu@archeros.com>
> Acked-by: Jason Wang <jasowang@redhat.com>

Rethink this patch, consider there're devices that don't support
set_vq_ready(). I wonder if we need

1) uAPI to tell the user space whether or not it supports set_vq_ready()
2) userspace will call SET_VRING_ENABLE() when the device supports
otherwise it will use RESET.

And for safety, I suggest tagging this as 7.1.

> ---
> v4 --> v3
>     Nothing changed, becasue of issue with mimecast,
>     when the From: tag is different from the sender,
>     the some mail client will take the patch as an
>     attachment, RESEND v3 does not work, So resend
>     the patch as v4
>
> v3 --> v2:
>     Call vhost_dev_reset() at the end of vhost_net_stop().
>
>     Since the vDPA device need re-add the status bit
>     VIRTIO_CONFIG_S_ACKNOWLEDGE and VIRTIO_CONFIG_S_DRIVER,
>     simply, add them inside vhost_vdpa_reset_device, and
>     the only way calling vhost_vdpa_reset_device is in
>     vhost_net_stop(), so it keeps the same behavior as before.
>
> v2 --> v1:
>    Implement a new function vhost_dev_reset,
>    reset the backend kernel device at last.
> ---
>  hw/net/vhost_net.c        | 24 +++++++++++++++++++++---
>  hw/virtio/vhost-vdpa.c    | 15 +++++++++------
>  hw/virtio/vhost.c         | 15 ++++++++++++++-
>  include/hw/virtio/vhost.h |  1 +
>  4 files changed, 45 insertions(+), 10 deletions(-)
>
> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> index 30379d2..422c9bf 100644
> --- a/hw/net/vhost_net.c
> +++ b/hw/net/vhost_net.c
> @@ -325,7 +325,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>      int total_notifiers = data_queue_pairs * 2 + cvq;
>      VirtIONet *n = VIRTIO_NET(dev);
>      int nvhosts = data_queue_pairs + cvq;
> -    struct vhost_net *net;
> +    struct vhost_net *net = NULL;
>      int r, e, i, index_end = data_queue_pairs * 2;
>      NetClientState *peer;
>
> @@ -391,8 +391,17 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>  err_start:
>      while (--i >= 0) {
>          peer = qemu_get_peer(ncs , i);
> -        vhost_net_stop_one(get_vhost_net(peer), dev);
> +
> +        net = get_vhost_net(peer);
> +
> +        vhost_net_stop_one(net, dev);
>      }
> +
> +    /* We only reset backend vdpa device */
> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
> +        vhost_dev_reset(&net->dev);
> +    }
> +
>      e = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
>      if (e < 0) {
>          fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", e);
> @@ -410,6 +419,7 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
>      VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
>      VirtIONet *n = VIRTIO_NET(dev);
>      NetClientState *peer;
> +    struct vhost_net *net = NULL;
>      int total_notifiers = data_queue_pairs * 2 + cvq;
>      int nvhosts = data_queue_pairs + cvq;
>      int i, r;
> @@ -420,7 +430,15 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
>          } else {
>              peer = qemu_get_peer(ncs, n->max_queue_pairs);
>          }
> -        vhost_net_stop_one(get_vhost_net(peer), dev);
> +
> +        net = get_vhost_net(peer);
> +
> +        vhost_net_stop_one(net, dev);
> +    }
> +
> +    /* We only reset backend vdpa device */
> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
> +        vhost_dev_reset(&net->dev);
>      }

So we've already reset the device in vhost_vdpa_dev_start(), any
reason we need to do it again here?

>
>      r = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index c5ed7a3..3ef0199 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -708,6 +708,11 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev)
>
>      ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>      trace_vhost_vdpa_reset_device(dev, status);
> +
> +    /* Add back this status, so that the device could work next time*/
> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> +                               VIRTIO_CONFIG_S_DRIVER);

This seems to contradict the semantic of reset.

> +
>      return ret;
>  }
>
> @@ -719,14 +724,14 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
>      return idx;
>  }
>
> -static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
> +static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, unsigned int ready)
>  {
>      int i;
>      trace_vhost_vdpa_set_vring_ready(dev);
>      for (i = 0; i < dev->nvqs; ++i) {
>          struct vhost_vring_state state = {
>              .index = dev->vq_index + i,
> -            .num = 1,
> +            .num = ready,
>          };
>          vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
>      }
> @@ -1088,8 +1093,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>          if (unlikely(!ok)) {
>              return -1;
>          }
> -        vhost_vdpa_set_vring_ready(dev);
> +        vhost_vdpa_set_vring_ready(dev, 1);
>      } else {
> +        vhost_vdpa_set_vring_ready(dev, 0);
>          ok = vhost_vdpa_svqs_stop(dev);
>          if (unlikely(!ok)) {
>              return -1;
> @@ -1105,9 +1111,6 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>          memory_listener_register(&v->listener, &address_space_memory);
>          return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
>      } else {
> -        vhost_vdpa_reset_device(dev);
> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> -                                   VIRTIO_CONFIG_S_DRIVER);
>          memory_listener_unregister(&v->listener);
>
>          return 0;
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index b643f42..7e0cdb6 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1820,7 +1820,6 @@ fail_features:
>  void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
>  {
>      int i;
> -

Unnecessary changes.

>      /* should only be called after backend is connected */
>      assert(hdev->vhost_ops);
>
> @@ -1854,3 +1853,17 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>
>      return -ENOSYS;
>  }
> +
> +int vhost_dev_reset(struct vhost_dev *hdev)
> +{

Let's use a separate patch for this.

Thanks

> +    int ret = 0;
> +
> +    /* should only be called after backend is connected */
> +    assert(hdev->vhost_ops);
> +
> +    if (hdev->vhost_ops->vhost_reset_device) {
> +        ret = hdev->vhost_ops->vhost_reset_device(hdev);
> +    }
> +
> +    return ret;
> +}
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index 58a73e7..b8b7c20 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -114,6 +114,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>  void vhost_dev_cleanup(struct vhost_dev *hdev);
>  int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
>  void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
> +int vhost_dev_reset(struct vhost_dev *hdev);
>  int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
>  void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
>
> --
> 1.8.3.1
>
>
>
Michael Qiu April 1, 2022, 3:20 a.m. UTC | #2
On 2022/4/1 10:53, Jason Wang wrote:
> On Fri, Apr 1, 2022 at 9:31 AM Michael Qiu <qiudayu@archeros.com> wrote:
>>
>> Currently, when VM poweroff, it will trigger vdpa
>> device(such as mlx bluefield2 VF) reset many times(with 1 datapath
>> queue pair and one control queue, triggered 3 times), this
>> leads to below issue:
>>
>> vhost VQ 2 ring restore failed: -22: Invalid argument (22)
>>
>> This because in vhost_net_stop(), it will stop all vhost device bind to
>> this virtio device, and in vhost_dev_stop(), qemu tries to stop the device
>> , then stop the queue: vhost_virtqueue_stop().
>>
>> In vhost_dev_stop(), it resets the device, which clear some flags
>> in low level driver, and in next loop(stop other vhost backends),
>> qemu try to stop the queue corresponding to the vhost backend,
>>   the driver finds that the VQ is invalied, this is the root cause.
>>
>> To solve the issue, vdpa should set vring unready, and
>> remove reset ops in device stop: vhost_dev_start(hdev, false).
>>
>> and implement a new function vhost_dev_reset, only reset backend
>> device after all vhost(per-queue) stoped.
> 
> Typo.
> 
>>
>> Signed-off-by: Michael Qiu<qiudayu@archeros.com>
>> Acked-by: Jason Wang <jasowang@redhat.com>
> 
> Rethink this patch, consider there're devices that don't support
> set_vq_ready(). I wonder if we need
> 
> 1) uAPI to tell the user space whether or not it supports set_vq_ready()
> 2) userspace will call SET_VRING_ENABLE() when the device supports
> otherwise it will use RESET.

if the device does not support set_vq_ready() in kernel, it will trigger 
kernel oops, at least in current kernel, it does not check where 
set_vq_ready has been implemented.

And I checked all vdpa driver in kernel, all drivers has implemented 
this ops.

So I think it is OK to call set_vq_ready without check.

> 
> And for safety, I suggest tagging this as 7.1.
> 
>> ---
>> v4 --> v3
>>      Nothing changed, becasue of issue with mimecast,
>>      when the From: tag is different from the sender,
>>      the some mail client will take the patch as an
>>      attachment, RESEND v3 does not work, So resend
>>      the patch as v4
>>
>> v3 --> v2:
>>      Call vhost_dev_reset() at the end of vhost_net_stop().
>>
>>      Since the vDPA device need re-add the status bit
>>      VIRTIO_CONFIG_S_ACKNOWLEDGE and VIRTIO_CONFIG_S_DRIVER,
>>      simply, add them inside vhost_vdpa_reset_device, and
>>      the only way calling vhost_vdpa_reset_device is in
>>      vhost_net_stop(), so it keeps the same behavior as before.
>>
>> v2 --> v1:
>>     Implement a new function vhost_dev_reset,
>>     reset the backend kernel device at last.
>> ---
>>   hw/net/vhost_net.c        | 24 +++++++++++++++++++++---
>>   hw/virtio/vhost-vdpa.c    | 15 +++++++++------
>>   hw/virtio/vhost.c         | 15 ++++++++++++++-
>>   include/hw/virtio/vhost.h |  1 +
>>   4 files changed, 45 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>> index 30379d2..422c9bf 100644
>> --- a/hw/net/vhost_net.c
>> +++ b/hw/net/vhost_net.c
>> @@ -325,7 +325,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>       int total_notifiers = data_queue_pairs * 2 + cvq;
>>       VirtIONet *n = VIRTIO_NET(dev);
>>       int nvhosts = data_queue_pairs + cvq;
>> -    struct vhost_net *net;
>> +    struct vhost_net *net = NULL;
>>       int r, e, i, index_end = data_queue_pairs * 2;
>>       NetClientState *peer;
>>
>> @@ -391,8 +391,17 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>   err_start:
>>       while (--i >= 0) {
>>           peer = qemu_get_peer(ncs , i);
>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>> +
>> +        net = get_vhost_net(peer);
>> +
>> +        vhost_net_stop_one(net, dev);
>>       }
>> +
>> +    /* We only reset backend vdpa device */
>> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
>> +        vhost_dev_reset(&net->dev);
>> +    }
>> +
>>       e = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
>>       if (e < 0) {
>>           fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", e);
>> @@ -410,6 +419,7 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
>>       VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
>>       VirtIONet *n = VIRTIO_NET(dev);
>>       NetClientState *peer;
>> +    struct vhost_net *net = NULL;
>>       int total_notifiers = data_queue_pairs * 2 + cvq;
>>       int nvhosts = data_queue_pairs + cvq;
>>       int i, r;
>> @@ -420,7 +430,15 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
>>           } else {
>>               peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>           }
>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>> +
>> +        net = get_vhost_net(peer);
>> +
>> +        vhost_net_stop_one(net, dev);
>> +    }
>> +
>> +    /* We only reset backend vdpa device */
>> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
>> +        vhost_dev_reset(&net->dev);
>>       }
> 
> So we've already reset the device in vhost_vdpa_dev_start(), any
> reason we need to do it again here?

reset device in vhost_vdpa_dev_start if there is some error with start.


> 
>>
>>       r = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>> index c5ed7a3..3ef0199 100644
>> --- a/hw/virtio/vhost-vdpa.c
>> +++ b/hw/virtio/vhost-vdpa.c
>> @@ -708,6 +708,11 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev)
>>
>>       ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>>       trace_vhost_vdpa_reset_device(dev, status);
>> +
>> +    /* Add back this status, so that the device could work next time*/
>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>> +                               VIRTIO_CONFIG_S_DRIVER);
> 
> This seems to contradict the semantic of reset

Yes, but it's hard to put it in other place, seems only vhost-vdpa need 
it, and for VM shutdown, qemu_del_nic() will do cleanup this like close 
vhost fds, which will call reset in kernel space without set those features.

So at last I put it here with no other inpact.

Thanks,
Michael
> 
>> +
>>       return ret;
>>   }
>>
>> @@ -719,14 +724,14 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
>>       return idx;
>>   }
>>
>> -static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
>> +static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, unsigned int ready)
>>   {
>>       int i;
>>       trace_vhost_vdpa_set_vring_ready(dev);
>>       for (i = 0; i < dev->nvqs; ++i) {
>>           struct vhost_vring_state state = {
>>               .index = dev->vq_index + i,
>> -            .num = 1,
>> +            .num = ready,
>>           };
>>           vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
>>       }
>> @@ -1088,8 +1093,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>           if (unlikely(!ok)) {
>>               return -1;
>>           }
>> -        vhost_vdpa_set_vring_ready(dev);
>> +        vhost_vdpa_set_vring_ready(dev, 1);
>>       } else {
>> +        vhost_vdpa_set_vring_ready(dev, 0);
>>           ok = vhost_vdpa_svqs_stop(dev);
>>           if (unlikely(!ok)) {
>>               return -1;
>> @@ -1105,9 +1111,6 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>           memory_listener_register(&v->listener, &address_space_memory);
>>           return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
>>       } else {
>> -        vhost_vdpa_reset_device(dev);
>> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>> -                                   VIRTIO_CONFIG_S_DRIVER);
>>           memory_listener_unregister(&v->listener);
>>
>>           return 0;
>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>> index b643f42..7e0cdb6 100644
>> --- a/hw/virtio/vhost.c
>> +++ b/hw/virtio/vhost.c
>> @@ -1820,7 +1820,6 @@ fail_features:
>>   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
>>   {
>>       int i;
>> -
> 
> Unnecessary changes.
> 
>>       /* should only be called after backend is connected */
>>       assert(hdev->vhost_ops);
>>
>> @@ -1854,3 +1853,17 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>>
>>       return -ENOSYS;
>>   }
>> +
>> +int vhost_dev_reset(struct vhost_dev *hdev)
>> +{
> 
> Let's use a separate patch for this.
> 
> Thanks
> 
>> +    int ret = 0;
>> +
>> +    /* should only be called after backend is connected */
>> +    assert(hdev->vhost_ops);
>> +
>> +    if (hdev->vhost_ops->vhost_reset_device) {
>> +        ret = hdev->vhost_ops->vhost_reset_device(hdev);
>> +    }
>> +
>> +    return ret;
>> +}
>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>> index 58a73e7..b8b7c20 100644
>> --- a/include/hw/virtio/vhost.h
>> +++ b/include/hw/virtio/vhost.h
>> @@ -114,6 +114,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>>   void vhost_dev_cleanup(struct vhost_dev *hdev);
>>   int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
>>   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
>> +int vhost_dev_reset(struct vhost_dev *hdev);
>>   int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
>>   void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
>>
>> --
>> 1.8.3.1
>>
>>
>>
> 
> 
>
Si-Wei Liu April 1, 2022, 11:07 p.m. UTC | #3
On 3/31/2022 7:53 PM, Jason Wang wrote:
> On Fri, Apr 1, 2022 at 9:31 AM Michael Qiu <qiudayu@archeros.com> wrote:
>> Currently, when VM poweroff, it will trigger vdpa
>> device(such as mlx bluefield2 VF) reset many times(with 1 datapath
>> queue pair and one control queue, triggered 3 times), this
>> leads to below issue:
>>
>> vhost VQ 2 ring restore failed: -22: Invalid argument (22)
>>
>> This because in vhost_net_stop(), it will stop all vhost device bind to
>> this virtio device, and in vhost_dev_stop(), qemu tries to stop the device
>> , then stop the queue: vhost_virtqueue_stop().
>>
>> In vhost_dev_stop(), it resets the device, which clear some flags
>> in low level driver, and in next loop(stop other vhost backends),
>> qemu try to stop the queue corresponding to the vhost backend,
>>   the driver finds that the VQ is invalied, this is the root cause.
>>
>> To solve the issue, vdpa should set vring unready, and
>> remove reset ops in device stop: vhost_dev_start(hdev, false).
>>
>> and implement a new function vhost_dev_reset, only reset backend
>> device after all vhost(per-queue) stoped.
> Typo.
>
>> Signed-off-by: Michael Qiu<qiudayu@archeros.com>
>> Acked-by: Jason Wang <jasowang@redhat.com>
> Rethink this patch, consider there're devices that don't support
> set_vq_ready(). I wonder if we need
>
> 1) uAPI to tell the user space whether or not it supports set_vq_ready()
I guess what's more relevant here is to define the uAPI semantics for 
unready i.e. set_vq_ready(0) for resuming/stopping virtqueue processing, 
as starting vq is comparatively less ambiguous. Considering the 
likelihood that this interface may be used for live migration, it would 
be nice to come up with variants such as 1) discard inflight request 
v.s. 2) waiting for inflight processing to be done, and 3) timeout in 
waiting.

> 2) userspace will call SET_VRING_ENABLE() when the device supports
> otherwise it will use RESET.
Are you looking to making virtqueue resume-able through the new 
SET_VRING_ENABLE() uAPI?

I think RESET is inevitable in some case, i.e. when guest initiates 
device reset by writing 0 to the status register. For suspend/resume and 
live migration use cases, indeed RESET can be substituted with 
SET_VRING_ENABLE. Again, it'd need quite some code refactoring to 
accommodate this change. Although I'm all for it, it'd be the best to 
lay out the plan for multiple phases rather than overload this single 
patch too much. You can count my time on this endeavor if you don't mind. :)

>
> And for safety, I suggest tagging this as 7.1.
+1

Regards,
-Siwei

>
>> ---
>> v4 --> v3
>>      Nothing changed, becasue of issue with mimecast,
>>      when the From: tag is different from the sender,
>>      the some mail client will take the patch as an
>>      attachment, RESEND v3 does not work, So resend
>>      the patch as v4
>>
>> v3 --> v2:
>>      Call vhost_dev_reset() at the end of vhost_net_stop().
>>
>>      Since the vDPA device need re-add the status bit
>>      VIRTIO_CONFIG_S_ACKNOWLEDGE and VIRTIO_CONFIG_S_DRIVER,
>>      simply, add them inside vhost_vdpa_reset_device, and
>>      the only way calling vhost_vdpa_reset_device is in
>>      vhost_net_stop(), so it keeps the same behavior as before.
>>
>> v2 --> v1:
>>     Implement a new function vhost_dev_reset,
>>     reset the backend kernel device at last.
>> ---
>>   hw/net/vhost_net.c        | 24 +++++++++++++++++++++---
>>   hw/virtio/vhost-vdpa.c    | 15 +++++++++------
>>   hw/virtio/vhost.c         | 15 ++++++++++++++-
>>   include/hw/virtio/vhost.h |  1 +
>>   4 files changed, 45 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>> index 30379d2..422c9bf 100644
>> --- a/hw/net/vhost_net.c
>> +++ b/hw/net/vhost_net.c
>> @@ -325,7 +325,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>       int total_notifiers = data_queue_pairs * 2 + cvq;
>>       VirtIONet *n = VIRTIO_NET(dev);
>>       int nvhosts = data_queue_pairs + cvq;
>> -    struct vhost_net *net;
>> +    struct vhost_net *net = NULL;
>>       int r, e, i, index_end = data_queue_pairs * 2;
>>       NetClientState *peer;
>>
>> @@ -391,8 +391,17 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>   err_start:
>>       while (--i >= 0) {
>>           peer = qemu_get_peer(ncs , i);
>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>> +
>> +        net = get_vhost_net(peer);
>> +
>> +        vhost_net_stop_one(net, dev);
>>       }
>> +
>> +    /* We only reset backend vdpa device */
>> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
>> +        vhost_dev_reset(&net->dev);
>> +    }
>> +
>>       e = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
>>       if (e < 0) {
>>           fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", e);
>> @@ -410,6 +419,7 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
>>       VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
>>       VirtIONet *n = VIRTIO_NET(dev);
>>       NetClientState *peer;
>> +    struct vhost_net *net = NULL;
>>       int total_notifiers = data_queue_pairs * 2 + cvq;
>>       int nvhosts = data_queue_pairs + cvq;
>>       int i, r;
>> @@ -420,7 +430,15 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
>>           } else {
>>               peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>           }
>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>> +
>> +        net = get_vhost_net(peer);
>> +
>> +        vhost_net_stop_one(net, dev);
>> +    }
>> +
>> +    /* We only reset backend vdpa device */
>> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
>> +        vhost_dev_reset(&net->dev);
>>       }
> So we've already reset the device in vhost_vdpa_dev_start(), any
> reason we need to do it again here?
>
>>       r = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>> index c5ed7a3..3ef0199 100644
>> --- a/hw/virtio/vhost-vdpa.c
>> +++ b/hw/virtio/vhost-vdpa.c
>> @@ -708,6 +708,11 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev)
>>
>>       ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>>       trace_vhost_vdpa_reset_device(dev, status);
>> +
>> +    /* Add back this status, so that the device could work next time*/
>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>> +                               VIRTIO_CONFIG_S_DRIVER);
> This seems to contradict the semantic of reset.
>
>> +
>>       return ret;
>>   }
>>
>> @@ -719,14 +724,14 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
>>       return idx;
>>   }
>>
>> -static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
>> +static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, unsigned int ready)
>>   {
>>       int i;
>>       trace_vhost_vdpa_set_vring_ready(dev);
>>       for (i = 0; i < dev->nvqs; ++i) {
>>           struct vhost_vring_state state = {
>>               .index = dev->vq_index + i,
>> -            .num = 1,
>> +            .num = ready,
>>           };
>>           vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
>>       }
>> @@ -1088,8 +1093,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>           if (unlikely(!ok)) {
>>               return -1;
>>           }
>> -        vhost_vdpa_set_vring_ready(dev);
>> +        vhost_vdpa_set_vring_ready(dev, 1);
>>       } else {
>> +        vhost_vdpa_set_vring_ready(dev, 0);
>>           ok = vhost_vdpa_svqs_stop(dev);
>>           if (unlikely(!ok)) {
>>               return -1;
>> @@ -1105,9 +1111,6 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>           memory_listener_register(&v->listener, &address_space_memory);
>>           return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
>>       } else {
>> -        vhost_vdpa_reset_device(dev);
>> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>> -                                   VIRTIO_CONFIG_S_DRIVER);
>>           memory_listener_unregister(&v->listener);
>>
>>           return 0;
>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>> index b643f42..7e0cdb6 100644
>> --- a/hw/virtio/vhost.c
>> +++ b/hw/virtio/vhost.c
>> @@ -1820,7 +1820,6 @@ fail_features:
>>   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
>>   {
>>       int i;
>> -
> Unnecessary changes.
>
>>       /* should only be called after backend is connected */
>>       assert(hdev->vhost_ops);
>>
>> @@ -1854,3 +1853,17 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>>
>>       return -ENOSYS;
>>   }
>> +
>> +int vhost_dev_reset(struct vhost_dev *hdev)
>> +{
> Let's use a separate patch for this.
>
> Thanks
>
>> +    int ret = 0;
>> +
>> +    /* should only be called after backend is connected */
>> +    assert(hdev->vhost_ops);
>> +
>> +    if (hdev->vhost_ops->vhost_reset_device) {
>> +        ret = hdev->vhost_ops->vhost_reset_device(hdev);
>> +    }
>> +
>> +    return ret;
>> +}
>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>> index 58a73e7..b8b7c20 100644
>> --- a/include/hw/virtio/vhost.h
>> +++ b/include/hw/virtio/vhost.h
>> @@ -114,6 +114,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>>   void vhost_dev_cleanup(struct vhost_dev *hdev);
>>   int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
>>   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
>> +int vhost_dev_reset(struct vhost_dev *hdev);
>>   int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
>>   void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
>>
>> --
>> 1.8.3.1
>>
>>
>>
Jason Wang April 2, 2022, 1:48 a.m. UTC | #4
On Fri, Apr 1, 2022 at 11:22 AM Michael Qiu <qiudayu@archeros.com> wrote:
>
>
>
> On 2022/4/1 10:53, Jason Wang wrote:
> > On Fri, Apr 1, 2022 at 9:31 AM Michael Qiu <qiudayu@archeros.com> wrote:
> >>
> >> Currently, when VM poweroff, it will trigger vdpa
> >> device(such as mlx bluefield2 VF) reset many times(with 1 datapath
> >> queue pair and one control queue, triggered 3 times), this
> >> leads to below issue:
> >>
> >> vhost VQ 2 ring restore failed: -22: Invalid argument (22)
> >>
> >> This because in vhost_net_stop(), it will stop all vhost device bind to
> >> this virtio device, and in vhost_dev_stop(), qemu tries to stop the device
> >> , then stop the queue: vhost_virtqueue_stop().
> >>
> >> In vhost_dev_stop(), it resets the device, which clear some flags
> >> in low level driver, and in next loop(stop other vhost backends),
> >> qemu try to stop the queue corresponding to the vhost backend,
> >>   the driver finds that the VQ is invalied, this is the root cause.
> >>
> >> To solve the issue, vdpa should set vring unready, and
> >> remove reset ops in device stop: vhost_dev_start(hdev, false).
> >>
> >> and implement a new function vhost_dev_reset, only reset backend
> >> device after all vhost(per-queue) stoped.
> >
> > Typo.
> >
> >>
> >> Signed-off-by: Michael Qiu<qiudayu@archeros.com>
> >> Acked-by: Jason Wang <jasowang@redhat.com>
> >
> > Rethink this patch, consider there're devices that don't support
> > set_vq_ready(). I wonder if we need
> >
> > 1) uAPI to tell the user space whether or not it supports set_vq_ready()
> > 2) userspace will call SET_VRING_ENABLE() when the device supports
> > otherwise it will use RESET.
>
> if the device does not support set_vq_ready() in kernel, it will trigger
> kernel oops, at least in current kernel, it does not check where
> set_vq_ready has been implemented.
>
> And I checked all vdpa driver in kernel, all drivers has implemented
> this ops.

Actually, it's not about whether or not the set_vq_ready() is
implemented. It's about whether the parent supports it correctly:

The ability to suspend and resume a virtqueue is currently beyond the
ability of some transport (e.g PCI).

For IFCVF:

static void ifcvf_vdpa_set_vq_ready(struct vdpa_device *vdpa_dev,
                                    u16 qid, bool ready)
{
        struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev);

        vf->vring[qid].ready = ready;
}

It seems to follow the PCI transport, so if you just set it to zero,
it simply doesn't work at all. I can see some tricks that are used in
the DPDK driver, maybe we can use the same to "fix" this.

For VDUSE, we are basically the same:

static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
                                        u16 idx, bool ready)
{
        struct vduse_dev *dev = vdpa_to_vduse(vdpa);
        struct vduse_virtqueue *vq = &dev->vqs[idx];

        vq->ready = ready;
}

It can't be stopped correctly if we just set it to zero.

For vp_vdpa, it basically wants to abuse the queue_enable, which may
result in a warning in Qemu (and the device isn't stopped).

static void vp_vdpa_set_vq_ready(struct vdpa_device *vdpa,
                                 u16 qid, bool ready)
{
        struct virtio_pci_modern_device *mdev = vdpa_to_mdev(vdpa);

        vp_modern_set_queue_enable(mdev, qid, ready);
}

ENI did a trick in writing 0 to virtqueue address, so it works for
stop but not the start.

static void eni_vdpa_set_vq_ready(struct vdpa_device *vdpa, u16 qid,
                                  bool ready)
{
        struct virtio_pci_legacy_device *ldev = vdpa_to_ldev(vdpa);

        /* ENI is a legacy virtio-pci device. This is not supported
         * by specification. But we can disable virtqueue by setting
         * address to 0.
         */
        if (!ready)
                vp_legacy_set_queue_address(ldev, qid, 0);
}

mlx5 call suspend_vq() which should be fine.

Simulator is probably fine.

So I worry if we use the set_vq_ready(0) it won't work correctly and
will have other issues. The idea is:

- advertise the suspend/resume capability via uAPI, then mlx5_vdpa and
simulator can go with set_vq_ready()
- others can still go with reset(), and we can try to fix them
gradually (and introduce this in the virtio spec).

>
> So I think it is OK to call set_vq_ready without check.
>
> >
> > And for safety, I suggest tagging this as 7.1.
> >
> >> ---
> >> v4 --> v3
> >>      Nothing changed, becasue of issue with mimecast,
> >>      when the From: tag is different from the sender,
> >>      the some mail client will take the patch as an
> >>      attachment, RESEND v3 does not work, So resend
> >>      the patch as v4
> >>
> >> v3 --> v2:
> >>      Call vhost_dev_reset() at the end of vhost_net_stop().
> >>
> >>      Since the vDPA device need re-add the status bit
> >>      VIRTIO_CONFIG_S_ACKNOWLEDGE and VIRTIO_CONFIG_S_DRIVER,
> >>      simply, add them inside vhost_vdpa_reset_device, and
> >>      the only way calling vhost_vdpa_reset_device is in
> >>      vhost_net_stop(), so it keeps the same behavior as before.
> >>
> >> v2 --> v1:
> >>     Implement a new function vhost_dev_reset,
> >>     reset the backend kernel device at last.
> >> ---
> >>   hw/net/vhost_net.c        | 24 +++++++++++++++++++++---
> >>   hw/virtio/vhost-vdpa.c    | 15 +++++++++------
> >>   hw/virtio/vhost.c         | 15 ++++++++++++++-
> >>   include/hw/virtio/vhost.h |  1 +
> >>   4 files changed, 45 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> >> index 30379d2..422c9bf 100644
> >> --- a/hw/net/vhost_net.c
> >> +++ b/hw/net/vhost_net.c
> >> @@ -325,7 +325,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
> >>       int total_notifiers = data_queue_pairs * 2 + cvq;
> >>       VirtIONet *n = VIRTIO_NET(dev);
> >>       int nvhosts = data_queue_pairs + cvq;
> >> -    struct vhost_net *net;
> >> +    struct vhost_net *net = NULL;
> >>       int r, e, i, index_end = data_queue_pairs * 2;
> >>       NetClientState *peer;
> >>
> >> @@ -391,8 +391,17 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
> >>   err_start:
> >>       while (--i >= 0) {
> >>           peer = qemu_get_peer(ncs , i);
> >> -        vhost_net_stop_one(get_vhost_net(peer), dev);
> >> +
> >> +        net = get_vhost_net(peer);
> >> +
> >> +        vhost_net_stop_one(net, dev);
> >>       }
> >> +
> >> +    /* We only reset backend vdpa device */
> >> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
> >> +        vhost_dev_reset(&net->dev);
> >> +    }
> >> +
> >>       e = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
> >>       if (e < 0) {
> >>           fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", e);
> >> @@ -410,6 +419,7 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
> >>       VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
> >>       VirtIONet *n = VIRTIO_NET(dev);
> >>       NetClientState *peer;
> >> +    struct vhost_net *net = NULL;
> >>       int total_notifiers = data_queue_pairs * 2 + cvq;
> >>       int nvhosts = data_queue_pairs + cvq;
> >>       int i, r;
> >> @@ -420,7 +430,15 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
> >>           } else {
> >>               peer = qemu_get_peer(ncs, n->max_queue_pairs);
> >>           }
> >> -        vhost_net_stop_one(get_vhost_net(peer), dev);
> >> +
> >> +        net = get_vhost_net(peer);
> >> +
> >> +        vhost_net_stop_one(net, dev);
> >> +    }
> >> +
> >> +    /* We only reset backend vdpa device */
> >> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
> >> +        vhost_dev_reset(&net->dev);
> >>       }
> >
> > So we've already reset the device in vhost_vdpa_dev_start(), any
> > reason we need to do it again here?
>
> reset device in vhost_vdpa_dev_start if there is some error with start.

The rest should have been done in vhost_net_stop_one()?

>
>
> >
> >>
> >>       r = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
> >> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >> index c5ed7a3..3ef0199 100644
> >> --- a/hw/virtio/vhost-vdpa.c
> >> +++ b/hw/virtio/vhost-vdpa.c
> >> @@ -708,6 +708,11 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev)
> >>
> >>       ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
> >>       trace_vhost_vdpa_reset_device(dev, status);
> >> +
> >> +    /* Add back this status, so that the device could work next time*/
> >> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> >> +                               VIRTIO_CONFIG_S_DRIVER);
> >
> > This seems to contradict the semantic of reset
>
> Yes, but it's hard to put it in other place, seems only vhost-vdpa need
> it, and for VM shutdown, qemu_del_nic() will do cleanup this like close
> vhost fds, which will call reset in kernel space without set those features.
>
> So at last I put it here with no other inpact.

Can we move this to the suitable caller of this function?

Thanks

>
> Thanks,
> Michael
> >
> >> +
> >>       return ret;
> >>   }
> >>
> >> @@ -719,14 +724,14 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
> >>       return idx;
> >>   }
> >>
> >> -static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
> >> +static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, unsigned int ready)
> >>   {
> >>       int i;
> >>       trace_vhost_vdpa_set_vring_ready(dev);
> >>       for (i = 0; i < dev->nvqs; ++i) {
> >>           struct vhost_vring_state state = {
> >>               .index = dev->vq_index + i,
> >> -            .num = 1,
> >> +            .num = ready,
> >>           };
> >>           vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
> >>       }
> >> @@ -1088,8 +1093,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> >>           if (unlikely(!ok)) {
> >>               return -1;
> >>           }
> >> -        vhost_vdpa_set_vring_ready(dev);
> >> +        vhost_vdpa_set_vring_ready(dev, 1);
> >>       } else {
> >> +        vhost_vdpa_set_vring_ready(dev, 0);
> >>           ok = vhost_vdpa_svqs_stop(dev);
> >>           if (unlikely(!ok)) {
> >>               return -1;
> >> @@ -1105,9 +1111,6 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> >>           memory_listener_register(&v->listener, &address_space_memory);
> >>           return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
> >>       } else {
> >> -        vhost_vdpa_reset_device(dev);
> >> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> >> -                                   VIRTIO_CONFIG_S_DRIVER);
> >>           memory_listener_unregister(&v->listener);
> >>
> >>           return 0;
> >> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >> index b643f42..7e0cdb6 100644
> >> --- a/hw/virtio/vhost.c
> >> +++ b/hw/virtio/vhost.c
> >> @@ -1820,7 +1820,6 @@ fail_features:
> >>   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
> >>   {
> >>       int i;
> >> -
> >
> > Unnecessary changes.
> >
> >>       /* should only be called after backend is connected */
> >>       assert(hdev->vhost_ops);
> >>
> >> @@ -1854,3 +1853,17 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
> >>
> >>       return -ENOSYS;
> >>   }
> >> +
> >> +int vhost_dev_reset(struct vhost_dev *hdev)
> >> +{
> >
> > Let's use a separate patch for this.
> >
> > Thanks
> >
> >> +    int ret = 0;
> >> +
> >> +    /* should only be called after backend is connected */
> >> +    assert(hdev->vhost_ops);
> >> +
> >> +    if (hdev->vhost_ops->vhost_reset_device) {
> >> +        ret = hdev->vhost_ops->vhost_reset_device(hdev);
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> >> index 58a73e7..b8b7c20 100644
> >> --- a/include/hw/virtio/vhost.h
> >> +++ b/include/hw/virtio/vhost.h
> >> @@ -114,6 +114,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
> >>   void vhost_dev_cleanup(struct vhost_dev *hdev);
> >>   int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
> >>   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
> >> +int vhost_dev_reset(struct vhost_dev *hdev);
> >>   int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
> >>   void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
> >>
> >> --
> >> 1.8.3.1
> >>
> >>
> >>
> >
> >
> >
>
>
>
Jason Wang April 2, 2022, 2:20 a.m. UTC | #5
Adding Michael.

On Sat, Apr 2, 2022 at 7:08 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 3/31/2022 7:53 PM, Jason Wang wrote:
> > On Fri, Apr 1, 2022 at 9:31 AM Michael Qiu <qiudayu@archeros.com> wrote:
> >> Currently, when VM poweroff, it will trigger vdpa
> >> device(such as mlx bluefield2 VF) reset many times(with 1 datapath
> >> queue pair and one control queue, triggered 3 times), this
> >> leads to below issue:
> >>
> >> vhost VQ 2 ring restore failed: -22: Invalid argument (22)
> >>
> >> This because in vhost_net_stop(), it will stop all vhost device bind to
> >> this virtio device, and in vhost_dev_stop(), qemu tries to stop the device
> >> , then stop the queue: vhost_virtqueue_stop().
> >>
> >> In vhost_dev_stop(), it resets the device, which clear some flags
> >> in low level driver, and in next loop(stop other vhost backends),
> >> qemu try to stop the queue corresponding to the vhost backend,
> >>   the driver finds that the VQ is invalied, this is the root cause.
> >>
> >> To solve the issue, vdpa should set vring unready, and
> >> remove reset ops in device stop: vhost_dev_start(hdev, false).
> >>
> >> and implement a new function vhost_dev_reset, only reset backend
> >> device after all vhost(per-queue) stoped.
> > Typo.
> >
> >> Signed-off-by: Michael Qiu<qiudayu@archeros.com>
> >> Acked-by: Jason Wang <jasowang@redhat.com>
> > Rethink this patch, consider there're devices that don't support
> > set_vq_ready(). I wonder if we need
> >
> > 1) uAPI to tell the user space whether or not it supports set_vq_ready()
> I guess what's more relevant here is to define the uAPI semantics for
> unready i.e. set_vq_ready(0) for resuming/stopping virtqueue processing,
> as starting vq is comparatively less ambiguous.

Yes.

> Considering the
> likelihood that this interface may be used for live migration, it would
> be nice to come up with variants such as 1) discard inflight request
> v.s. 2) waiting for inflight processing to be done,

Or inflight descriptor reporting (which seems to be tricky). But we
can start from net that a discarding may just work.

>and 3) timeout in
> waiting.

Actually, that's the plan and Eugenio is proposing something like this
via virtio spec:

https://lists.oasis-open.org/archives/virtio-dev/202111/msg00020.html

>
> > 2) userspace will call SET_VRING_ENABLE() when the device supports
> > otherwise it will use RESET.
> Are you looking to making virtqueue resume-able through the new
> SET_VRING_ENABLE() uAPI?
>
> I think RESET is inevitable in some case, i.e. when guest initiates
> device reset by writing 0 to the status register.

Yes, that's all my plan.

> For suspend/resume and
> live migration use cases, indeed RESET can be substituted with
> SET_VRING_ENABLE. Again, it'd need quite some code refactoring to
> accommodate this change. Although I'm all for it, it'd be the best to
> lay out the plan for multiple phases rather than overload this single
> patch too much. You can count my time on this endeavor if you don't mind. :)

You're welcome, I agree we should choose a way to go first:

1) manage to use SET_VRING_ENABLE (more like a workaround anyway)
2) go with virtio-spec (may take a while)
3) don't wait for the spec, have a vDPA specific uAPI first. Note that
I've chatted with most of the vendors and they seem to be fine with
the _S_STOP. If we go this way, we can still provide the forward
compatibility of _S_STOP
4) or do them all (in parallel)

Any thoughts?

Thanks

>
> >
> > And for safety, I suggest tagging this as 7.1.
> +1
>
> Regards,
> -Siwei
>
> >
> >> ---
> >> v4 --> v3
> >>      Nothing changed, becasue of issue with mimecast,
> >>      when the From: tag is different from the sender,
> >>      the some mail client will take the patch as an
> >>      attachment, RESEND v3 does not work, So resend
> >>      the patch as v4
> >>
> >> v3 --> v2:
> >>      Call vhost_dev_reset() at the end of vhost_net_stop().
> >>
> >>      Since the vDPA device need re-add the status bit
> >>      VIRTIO_CONFIG_S_ACKNOWLEDGE and VIRTIO_CONFIG_S_DRIVER,
> >>      simply, add them inside vhost_vdpa_reset_device, and
> >>      the only way calling vhost_vdpa_reset_device is in
> >>      vhost_net_stop(), so it keeps the same behavior as before.
> >>
> >> v2 --> v1:
> >>     Implement a new function vhost_dev_reset,
> >>     reset the backend kernel device at last.
> >> ---
> >>   hw/net/vhost_net.c        | 24 +++++++++++++++++++++---
> >>   hw/virtio/vhost-vdpa.c    | 15 +++++++++------
> >>   hw/virtio/vhost.c         | 15 ++++++++++++++-
> >>   include/hw/virtio/vhost.h |  1 +
> >>   4 files changed, 45 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> >> index 30379d2..422c9bf 100644
> >> --- a/hw/net/vhost_net.c
> >> +++ b/hw/net/vhost_net.c
> >> @@ -325,7 +325,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
> >>       int total_notifiers = data_queue_pairs * 2 + cvq;
> >>       VirtIONet *n = VIRTIO_NET(dev);
> >>       int nvhosts = data_queue_pairs + cvq;
> >> -    struct vhost_net *net;
> >> +    struct vhost_net *net = NULL;
> >>       int r, e, i, index_end = data_queue_pairs * 2;
> >>       NetClientState *peer;
> >>
> >> @@ -391,8 +391,17 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
> >>   err_start:
> >>       while (--i >= 0) {
> >>           peer = qemu_get_peer(ncs , i);
> >> -        vhost_net_stop_one(get_vhost_net(peer), dev);
> >> +
> >> +        net = get_vhost_net(peer);
> >> +
> >> +        vhost_net_stop_one(net, dev);
> >>       }
> >> +
> >> +    /* We only reset backend vdpa device */
> >> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
> >> +        vhost_dev_reset(&net->dev);
> >> +    }
> >> +
> >>       e = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
> >>       if (e < 0) {
> >>           fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", e);
> >> @@ -410,6 +419,7 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
> >>       VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
> >>       VirtIONet *n = VIRTIO_NET(dev);
> >>       NetClientState *peer;
> >> +    struct vhost_net *net = NULL;
> >>       int total_notifiers = data_queue_pairs * 2 + cvq;
> >>       int nvhosts = data_queue_pairs + cvq;
> >>       int i, r;
> >> @@ -420,7 +430,15 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
> >>           } else {
> >>               peer = qemu_get_peer(ncs, n->max_queue_pairs);
> >>           }
> >> -        vhost_net_stop_one(get_vhost_net(peer), dev);
> >> +
> >> +        net = get_vhost_net(peer);
> >> +
> >> +        vhost_net_stop_one(net, dev);
> >> +    }
> >> +
> >> +    /* We only reset backend vdpa device */
> >> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
> >> +        vhost_dev_reset(&net->dev);
> >>       }
> > So we've already reset the device in vhost_vdpa_dev_start(), any
> > reason we need to do it again here?
> >
> >>       r = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
> >> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >> index c5ed7a3..3ef0199 100644
> >> --- a/hw/virtio/vhost-vdpa.c
> >> +++ b/hw/virtio/vhost-vdpa.c
> >> @@ -708,6 +708,11 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev)
> >>
> >>       ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
> >>       trace_vhost_vdpa_reset_device(dev, status);
> >> +
> >> +    /* Add back this status, so that the device could work next time*/
> >> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> >> +                               VIRTIO_CONFIG_S_DRIVER);
> > This seems to contradict the semantic of reset.
> >
> >> +
> >>       return ret;
> >>   }
> >>
> >> @@ -719,14 +724,14 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
> >>       return idx;
> >>   }
> >>
> >> -static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
> >> +static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, unsigned int ready)
> >>   {
> >>       int i;
> >>       trace_vhost_vdpa_set_vring_ready(dev);
> >>       for (i = 0; i < dev->nvqs; ++i) {
> >>           struct vhost_vring_state state = {
> >>               .index = dev->vq_index + i,
> >> -            .num = 1,
> >> +            .num = ready,
> >>           };
> >>           vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
> >>       }
> >> @@ -1088,8 +1093,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> >>           if (unlikely(!ok)) {
> >>               return -1;
> >>           }
> >> -        vhost_vdpa_set_vring_ready(dev);
> >> +        vhost_vdpa_set_vring_ready(dev, 1);
> >>       } else {
> >> +        vhost_vdpa_set_vring_ready(dev, 0);
> >>           ok = vhost_vdpa_svqs_stop(dev);
> >>           if (unlikely(!ok)) {
> >>               return -1;
> >> @@ -1105,9 +1111,6 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> >>           memory_listener_register(&v->listener, &address_space_memory);
> >>           return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
> >>       } else {
> >> -        vhost_vdpa_reset_device(dev);
> >> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> >> -                                   VIRTIO_CONFIG_S_DRIVER);
> >>           memory_listener_unregister(&v->listener);
> >>
> >>           return 0;
> >> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >> index b643f42..7e0cdb6 100644
> >> --- a/hw/virtio/vhost.c
> >> +++ b/hw/virtio/vhost.c
> >> @@ -1820,7 +1820,6 @@ fail_features:
> >>   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
> >>   {
> >>       int i;
> >> -
> > Unnecessary changes.
> >
> >>       /* should only be called after backend is connected */
> >>       assert(hdev->vhost_ops);
> >>
> >> @@ -1854,3 +1853,17 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
> >>
> >>       return -ENOSYS;
> >>   }
> >> +
> >> +int vhost_dev_reset(struct vhost_dev *hdev)
> >> +{
> > Let's use a separate patch for this.
> >
> > Thanks
> >
> >> +    int ret = 0;
> >> +
> >> +    /* should only be called after backend is connected */
> >> +    assert(hdev->vhost_ops);
> >> +
> >> +    if (hdev->vhost_ops->vhost_reset_device) {
> >> +        ret = hdev->vhost_ops->vhost_reset_device(hdev);
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> >> index 58a73e7..b8b7c20 100644
> >> --- a/include/hw/virtio/vhost.h
> >> +++ b/include/hw/virtio/vhost.h
> >> @@ -114,6 +114,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
> >>   void vhost_dev_cleanup(struct vhost_dev *hdev);
> >>   int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
> >>   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
> >> +int vhost_dev_reset(struct vhost_dev *hdev);
> >>   int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
> >>   void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
> >>
> >> --
> >> 1.8.3.1
> >>
> >>
> >>
>
Michael Qiu April 2, 2022, 3:43 a.m. UTC | #6
On 2022/4/2 9:48, Jason Wang wrote:
> On Fri, Apr 1, 2022 at 11:22 AM Michael Qiu <qiudayu@archeros.com> wrote:
>>
>>
>>
>> On 2022/4/1 10:53, Jason Wang wrote:
>>> On Fri, Apr 1, 2022 at 9:31 AM Michael Qiu <qiudayu@archeros.com> wrote:
>>>>
>>>> Currently, when VM poweroff, it will trigger vdpa
>>>> device(such as mlx bluefield2 VF) reset many times(with 1 datapath
>>>> queue pair and one control queue, triggered 3 times), this
>>>> leads to below issue:
>>>>
>>>> vhost VQ 2 ring restore failed: -22: Invalid argument (22)
>>>>
>>>> This because in vhost_net_stop(), it will stop all vhost device bind to
>>>> this virtio device, and in vhost_dev_stop(), qemu tries to stop the device
>>>> , then stop the queue: vhost_virtqueue_stop().
>>>>
>>>> In vhost_dev_stop(), it resets the device, which clear some flags
>>>> in low level driver, and in next loop(stop other vhost backends),
>>>> qemu try to stop the queue corresponding to the vhost backend,
>>>>    the driver finds that the VQ is invalied, this is the root cause.
>>>>
>>>> To solve the issue, vdpa should set vring unready, and
>>>> remove reset ops in device stop: vhost_dev_start(hdev, false).
>>>>
>>>> and implement a new function vhost_dev_reset, only reset backend
>>>> device after all vhost(per-queue) stoped.
>>>
>>> Typo.
>>>
>>>>
>>>> Signed-off-by: Michael Qiu<qiudayu@archeros.com>
>>>> Acked-by: Jason Wang <jasowang@redhat.com>
>>>
>>> Rethink this patch, consider there're devices that don't support
>>> set_vq_ready(). I wonder if we need
>>>
>>> 1) uAPI to tell the user space whether or not it supports set_vq_ready()
>>> 2) userspace will call SET_VRING_ENABLE() when the device supports
>>> otherwise it will use RESET.
>>
>> if the device does not support set_vq_ready() in kernel, it will trigger
>> kernel oops, at least in current kernel, it does not check where
>> set_vq_ready has been implemented.
>>
>> And I checked all vdpa driver in kernel, all drivers has implemented
>> this ops.
> 
> Actually, it's not about whether or not the set_vq_ready() is
> implemented. It's about whether the parent supports it correctly:
> 
> The ability to suspend and resume a virtqueue is currently beyond the
> ability of some transport (e.g PCI).
> 

OK, Got it

> For IFCVF:
> 
> static void ifcvf_vdpa_set_vq_ready(struct vdpa_device *vdpa_dev,
>                                      u16 qid, bool ready)
> {
>          struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev);
> 
>          vf->vring[qid].ready = ready;
> }
> 
> It seems to follow the PCI transport, so if you just set it to zero,
> it simply doesn't work at all. I can see some tricks that are used in
> the DPDK driver, maybe we can use the same to "fix" this.
> 
> For VDUSE, we are basically the same:
> 
> static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
>                                          u16 idx, bool ready)
> {
>          struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>          struct vduse_virtqueue *vq = &dev->vqs[idx];
> 
>          vq->ready = ready;
> }
> 
> It can't be stopped correctly if we just set it to zero.
> 
> For vp_vdpa, it basically wants to abuse the queue_enable, which may
> result in a warning in Qemu (and the device isn't stopped).
> 
> static void vp_vdpa_set_vq_ready(struct vdpa_device *vdpa,
>                                   u16 qid, bool ready)
> {
>          struct virtio_pci_modern_device *mdev = vdpa_to_mdev(vdpa);
> 
>          vp_modern_set_queue_enable(mdev, qid, ready);
> }
> 
> ENI did a trick in writing 0 to virtqueue address, so it works for
> stop but not the start.
> 
> static void eni_vdpa_set_vq_ready(struct vdpa_device *vdpa, u16 qid,
>                                    bool ready)
> {
>          struct virtio_pci_legacy_device *ldev = vdpa_to_ldev(vdpa);
> 
>          /* ENI is a legacy virtio-pci device. This is not supported
>           * by specification. But we can disable virtqueue by setting
>           * address to 0.
>           */
>          if (!ready)
>                  vp_legacy_set_queue_address(ldev, qid, 0);
> }
> 
> mlx5 call suspend_vq() which should be fine.
> 
> Simulator is probably fine.
> 
> So I worry if we use the set_vq_ready(0) it won't work correctly and
> will have other issues. The idea is:
> 
> - advertise the suspend/resume capability via uAPI, then mlx5_vdpa and
> simulator can go with set_vq_ready()
> - others can still go with reset(), and we can try to fix them
> gradually (and introduce this in the virtio spec).
> 

Totally agreet.

>>
>> So I think it is OK to call set_vq_ready without check.
>>
>>>
>>> And for safety, I suggest tagging this as 7.1.
>>>
>>>> ---
>>>> v4 --> v3
>>>>       Nothing changed, becasue of issue with mimecast,
>>>>       when the From: tag is different from the sender,
>>>>       the some mail client will take the patch as an
>>>>       attachment, RESEND v3 does not work, So resend
>>>>       the patch as v4
>>>>
>>>> v3 --> v2:
>>>>       Call vhost_dev_reset() at the end of vhost_net_stop().
>>>>
>>>>       Since the vDPA device need re-add the status bit
>>>>       VIRTIO_CONFIG_S_ACKNOWLEDGE and VIRTIO_CONFIG_S_DRIVER,
>>>>       simply, add them inside vhost_vdpa_reset_device, and
>>>>       the only way calling vhost_vdpa_reset_device is in
>>>>       vhost_net_stop(), so it keeps the same behavior as before.
>>>>
>>>> v2 --> v1:
>>>>      Implement a new function vhost_dev_reset,
>>>>      reset the backend kernel device at last.
>>>> ---
>>>>    hw/net/vhost_net.c        | 24 +++++++++++++++++++++---
>>>>    hw/virtio/vhost-vdpa.c    | 15 +++++++++------
>>>>    hw/virtio/vhost.c         | 15 ++++++++++++++-
>>>>    include/hw/virtio/vhost.h |  1 +
>>>>    4 files changed, 45 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>>>> index 30379d2..422c9bf 100644
>>>> --- a/hw/net/vhost_net.c
>>>> +++ b/hw/net/vhost_net.c
>>>> @@ -325,7 +325,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>>>        int total_notifiers = data_queue_pairs * 2 + cvq;
>>>>        VirtIONet *n = VIRTIO_NET(dev);
>>>>        int nvhosts = data_queue_pairs + cvq;
>>>> -    struct vhost_net *net;
>>>> +    struct vhost_net *net = NULL;
>>>>        int r, e, i, index_end = data_queue_pairs * 2;
>>>>        NetClientState *peer;
>>>>
>>>> @@ -391,8 +391,17 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>>>    err_start:
>>>>        while (--i >= 0) {
>>>>            peer = qemu_get_peer(ncs , i);
>>>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>>>> +
>>>> +        net = get_vhost_net(peer);
>>>> +
>>>> +        vhost_net_stop_one(net, dev);
>>>>        }
>>>> +
>>>> +    /* We only reset backend vdpa device */
>>>> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
>>>> +        vhost_dev_reset(&net->dev);
>>>> +    }
>>>> +
>>>>        e = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
>>>>        if (e < 0) {
>>>>            fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", e);
>>>> @@ -410,6 +419,7 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
>>>>        VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
>>>>        VirtIONet *n = VIRTIO_NET(dev);
>>>>        NetClientState *peer;
>>>> +    struct vhost_net *net = NULL;
>>>>        int total_notifiers = data_queue_pairs * 2 + cvq;
>>>>        int nvhosts = data_queue_pairs + cvq;
>>>>        int i, r;
>>>> @@ -420,7 +430,15 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
>>>>            } else {
>>>>                peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>>>            }
>>>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>>>> +
>>>> +        net = get_vhost_net(peer);
>>>> +
>>>> +        vhost_net_stop_one(net, dev);
>>>> +    }
>>>> +
>>>> +    /* We only reset backend vdpa device */
>>>> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
>>>> +        vhost_dev_reset(&net->dev);
>>>>        }
>>>
>>> So we've already reset the device in vhost_vdpa_dev_start(), any
>>> reason we need to do it again here?
>>
>> reset device in vhost_vdpa_dev_start if there is some error with start.
> 
> The rest should have been done in vhost_net_stop_one()?
> 
>>
>>
>>>
>>>>
>>>>        r = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>> index c5ed7a3..3ef0199 100644
>>>> --- a/hw/virtio/vhost-vdpa.c
>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>> @@ -708,6 +708,11 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev)
>>>>
>>>>        ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>>>>        trace_vhost_vdpa_reset_device(dev, status);
>>>> +
>>>> +    /* Add back this status, so that the device could work next time*/
>>>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>> +                               VIRTIO_CONFIG_S_DRIVER);
>>>
>>> This seems to contradict the semantic of reset
>>
>> Yes, but it's hard to put it in other place, seems only vhost-vdpa need
>> it, and for VM shutdown, qemu_del_nic() will do cleanup this like close
>> vhost fds, which will call reset in kernel space without set those features.
>>
>> So at last I put it here with no other inpact.
> 
> Can we move this to the suitable caller of this function?
> 

This is vhost_vdpa backend specific requirement, if we move to the 
caller, we need a backend check after each vhost_dev_reset() been called.

Otherwise we need a new vhost API vhost_add_status(), it a bit
complex because we should consider live migration and error recovery as 
Si-Wei mentioned.

Could we just leave it here and move it to right place next time.

Thanks,
Michael
> Thanks
> 
>>
>> Thanks,
>> Michael
>>>
>>>> +
>>>>        return ret;
>>>>    }
>>>>
>>>> @@ -719,14 +724,14 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
>>>>        return idx;
>>>>    }
>>>>
>>>> -static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
>>>> +static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, unsigned int ready)
>>>>    {
>>>>        int i;
>>>>        trace_vhost_vdpa_set_vring_ready(dev);
>>>>        for (i = 0; i < dev->nvqs; ++i) {
>>>>            struct vhost_vring_state state = {
>>>>                .index = dev->vq_index + i,
>>>> -            .num = 1,
>>>> +            .num = ready,
>>>>            };
>>>>            vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
>>>>        }
>>>> @@ -1088,8 +1093,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>>>            if (unlikely(!ok)) {
>>>>                return -1;
>>>>            }
>>>> -        vhost_vdpa_set_vring_ready(dev);
>>>> +        vhost_vdpa_set_vring_ready(dev, 1);
>>>>        } else {
>>>> +        vhost_vdpa_set_vring_ready(dev, 0);
>>>>            ok = vhost_vdpa_svqs_stop(dev);
>>>>            if (unlikely(!ok)) {
>>>>                return -1;
>>>> @@ -1105,9 +1111,6 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>>>            memory_listener_register(&v->listener, &address_space_memory);
>>>>            return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
>>>>        } else {
>>>> -        vhost_vdpa_reset_device(dev);
>>>> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>> -                                   VIRTIO_CONFIG_S_DRIVER);
>>>>            memory_listener_unregister(&v->listener);
>>>>
>>>>            return 0;
>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>> index b643f42..7e0cdb6 100644
>>>> --- a/hw/virtio/vhost.c
>>>> +++ b/hw/virtio/vhost.c
>>>> @@ -1820,7 +1820,6 @@ fail_features:
>>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
>>>>    {
>>>>        int i;
>>>> -
>>>
>>> Unnecessary changes.
>>>
>>>>        /* should only be called after backend is connected */
>>>>        assert(hdev->vhost_ops);
>>>>
>>>> @@ -1854,3 +1853,17 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>>>>
>>>>        return -ENOSYS;
>>>>    }
>>>> +
>>>> +int vhost_dev_reset(struct vhost_dev *hdev)
>>>> +{
>>>
>>> Let's use a separate patch for this.
>>>
>>> Thanks
>>>
>>>> +    int ret = 0;
>>>> +
>>>> +    /* should only be called after backend is connected */
>>>> +    assert(hdev->vhost_ops);
>>>> +
>>>> +    if (hdev->vhost_ops->vhost_reset_device) {
>>>> +        ret = hdev->vhost_ops->vhost_reset_device(hdev);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>> index 58a73e7..b8b7c20 100644
>>>> --- a/include/hw/virtio/vhost.h
>>>> +++ b/include/hw/virtio/vhost.h
>>>> @@ -114,6 +114,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>>>>    void vhost_dev_cleanup(struct vhost_dev *hdev);
>>>>    int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>> +int vhost_dev_reset(struct vhost_dev *hdev);
>>>>    int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>    void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>
>>>> --
>>>> 1.8.3.1
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
> 
>
Michael Qiu April 2, 2022, 3:53 a.m. UTC | #7
On 2022/4/2 10:20, Jason Wang wrote:
> Adding Michael.
> 
> On Sat, Apr 2, 2022 at 7:08 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>>
>> On 3/31/2022 7:53 PM, Jason Wang wrote:
>>> On Fri, Apr 1, 2022 at 9:31 AM Michael Qiu <qiudayu@archeros.com> wrote:
>>>> Currently, when VM poweroff, it will trigger vdpa
>>>> device(such as mlx bluefield2 VF) reset many times(with 1 datapath
>>>> queue pair and one control queue, triggered 3 times), this
>>>> leads to below issue:
>>>>
>>>> vhost VQ 2 ring restore failed: -22: Invalid argument (22)
>>>>
>>>> This because in vhost_net_stop(), it will stop all vhost device bind to
>>>> this virtio device, and in vhost_dev_stop(), qemu tries to stop the device
>>>> , then stop the queue: vhost_virtqueue_stop().
>>>>
>>>> In vhost_dev_stop(), it resets the device, which clear some flags
>>>> in low level driver, and in next loop(stop other vhost backends),
>>>> qemu try to stop the queue corresponding to the vhost backend,
>>>>    the driver finds that the VQ is invalied, this is the root cause.
>>>>
>>>> To solve the issue, vdpa should set vring unready, and
>>>> remove reset ops in device stop: vhost_dev_start(hdev, false).
>>>>
>>>> and implement a new function vhost_dev_reset, only reset backend
>>>> device after all vhost(per-queue) stoped.
>>> Typo.
>>>
>>>> Signed-off-by: Michael Qiu<qiudayu@archeros.com>
>>>> Acked-by: Jason Wang <jasowang@redhat.com>
>>> Rethink this patch, consider there're devices that don't support
>>> set_vq_ready(). I wonder if we need
>>>
>>> 1) uAPI to tell the user space whether or not it supports set_vq_ready()
>> I guess what's more relevant here is to define the uAPI semantics for
>> unready i.e. set_vq_ready(0) for resuming/stopping virtqueue processing,
>> as starting vq is comparatively less ambiguous.
> 
> Yes.
> 
>> Considering the
>> likelihood that this interface may be used for live migration, it would
>> be nice to come up with variants such as 1) discard inflight request
>> v.s. 2) waiting for inflight processing to be done,
> 
> Or inflight descriptor reporting (which seems to be tricky). But we
> can start from net that a discarding may just work.
> 
>> and 3) timeout in
>> waiting.
> 
> Actually, that's the plan and Eugenio is proposing something like this
> via virtio spec:
> 
> https://lists.oasis-open.org/archives/virtio-dev/202111/msg00020.html
> 
>>
>>> 2) userspace will call SET_VRING_ENABLE() when the device supports
>>> otherwise it will use RESET.
>> Are you looking to making virtqueue resume-able through the new
>> SET_VRING_ENABLE() uAPI?
>>
>> I think RESET is inevitable in some case, i.e. when guest initiates
>> device reset by writing 0 to the status register.
> 
> Yes, that's all my plan.
> 
>> For suspend/resume and
>> live migration use cases, indeed RESET can be substituted with
>> SET_VRING_ENABLE. Again, it'd need quite some code refactoring to
>> accommodate this change. Although I'm all for it, it'd be the best to
>> lay out the plan for multiple phases rather than overload this single
>> patch too much. You can count my time on this endeavor if you don't mind. :)
> 
> You're welcome, I agree we should choose a way to go first:
> 
> 1) manage to use SET_VRING_ENABLE (more like a workaround anyway)
> 2) go with virtio-spec (may take a while)
> 3) don't wait for the spec, have a vDPA specific uAPI first. Note that
> I've chatted with most of the vendors and they seem to be fine with
> the _S_STOP. If we go this way, we can still provide the forward
> compatibility of _S_STOP
> 4) or do them all (in parallel)
> 
> Any thoughts?
> 

virtio-spec should be long-term, not only because the spec goes very 
slowly, but also the hardware upgrade should be a problem.

For short-term, better to take the first one?

Thanks,
Michael
> Thanks
> 
>>
>>>
>>> And for safety, I suggest tagging this as 7.1.
>> +1
>>
>> Regards,
>> -Siwei
>>
>>>
>>>> ---
>>>> v4 --> v3
>>>>       Nothing changed, becasue of issue with mimecast,
>>>>       when the From: tag is different from the sender,
>>>>       the some mail client will take the patch as an
>>>>       attachment, RESEND v3 does not work, So resend
>>>>       the patch as v4
>>>>
>>>> v3 --> v2:
>>>>       Call vhost_dev_reset() at the end of vhost_net_stop().
>>>>
>>>>       Since the vDPA device need re-add the status bit
>>>>       VIRTIO_CONFIG_S_ACKNOWLEDGE and VIRTIO_CONFIG_S_DRIVER,
>>>>       simply, add them inside vhost_vdpa_reset_device, and
>>>>       the only way calling vhost_vdpa_reset_device is in
>>>>       vhost_net_stop(), so it keeps the same behavior as before.
>>>>
>>>> v2 --> v1:
>>>>      Implement a new function vhost_dev_reset,
>>>>      reset the backend kernel device at last.
>>>> ---
>>>>    hw/net/vhost_net.c        | 24 +++++++++++++++++++++---
>>>>    hw/virtio/vhost-vdpa.c    | 15 +++++++++------
>>>>    hw/virtio/vhost.c         | 15 ++++++++++++++-
>>>>    include/hw/virtio/vhost.h |  1 +
>>>>    4 files changed, 45 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>>>> index 30379d2..422c9bf 100644
>>>> --- a/hw/net/vhost_net.c
>>>> +++ b/hw/net/vhost_net.c
>>>> @@ -325,7 +325,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>>>        int total_notifiers = data_queue_pairs * 2 + cvq;
>>>>        VirtIONet *n = VIRTIO_NET(dev);
>>>>        int nvhosts = data_queue_pairs + cvq;
>>>> -    struct vhost_net *net;
>>>> +    struct vhost_net *net = NULL;
>>>>        int r, e, i, index_end = data_queue_pairs * 2;
>>>>        NetClientState *peer;
>>>>
>>>> @@ -391,8 +391,17 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>>>    err_start:
>>>>        while (--i >= 0) {
>>>>            peer = qemu_get_peer(ncs , i);
>>>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>>>> +
>>>> +        net = get_vhost_net(peer);
>>>> +
>>>> +        vhost_net_stop_one(net, dev);
>>>>        }
>>>> +
>>>> +    /* We only reset backend vdpa device */
>>>> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
>>>> +        vhost_dev_reset(&net->dev);
>>>> +    }
>>>> +
>>>>        e = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
>>>>        if (e < 0) {
>>>>            fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", e);
>>>> @@ -410,6 +419,7 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
>>>>        VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
>>>>        VirtIONet *n = VIRTIO_NET(dev);
>>>>        NetClientState *peer;
>>>> +    struct vhost_net *net = NULL;
>>>>        int total_notifiers = data_queue_pairs * 2 + cvq;
>>>>        int nvhosts = data_queue_pairs + cvq;
>>>>        int i, r;
>>>> @@ -420,7 +430,15 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
>>>>            } else {
>>>>                peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>>>            }
>>>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>>>> +
>>>> +        net = get_vhost_net(peer);
>>>> +
>>>> +        vhost_net_stop_one(net, dev);
>>>> +    }
>>>> +
>>>> +    /* We only reset backend vdpa device */
>>>> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
>>>> +        vhost_dev_reset(&net->dev);
>>>>        }
>>> So we've already reset the device in vhost_vdpa_dev_start(), any
>>> reason we need to do it again here?
>>>
>>>>        r = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>> index c5ed7a3..3ef0199 100644
>>>> --- a/hw/virtio/vhost-vdpa.c
>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>> @@ -708,6 +708,11 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev)
>>>>
>>>>        ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>>>>        trace_vhost_vdpa_reset_device(dev, status);
>>>> +
>>>> +    /* Add back this status, so that the device could work next time*/
>>>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>> +                               VIRTIO_CONFIG_S_DRIVER);
>>> This seems to contradict the semantic of reset.
>>>
>>>> +
>>>>        return ret;
>>>>    }
>>>>
>>>> @@ -719,14 +724,14 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
>>>>        return idx;
>>>>    }
>>>>
>>>> -static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
>>>> +static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, unsigned int ready)
>>>>    {
>>>>        int i;
>>>>        trace_vhost_vdpa_set_vring_ready(dev);
>>>>        for (i = 0; i < dev->nvqs; ++i) {
>>>>            struct vhost_vring_state state = {
>>>>                .index = dev->vq_index + i,
>>>> -            .num = 1,
>>>> +            .num = ready,
>>>>            };
>>>>            vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
>>>>        }
>>>> @@ -1088,8 +1093,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>>>            if (unlikely(!ok)) {
>>>>                return -1;
>>>>            }
>>>> -        vhost_vdpa_set_vring_ready(dev);
>>>> +        vhost_vdpa_set_vring_ready(dev, 1);
>>>>        } else {
>>>> +        vhost_vdpa_set_vring_ready(dev, 0);
>>>>            ok = vhost_vdpa_svqs_stop(dev);
>>>>            if (unlikely(!ok)) {
>>>>                return -1;
>>>> @@ -1105,9 +1111,6 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>>>            memory_listener_register(&v->listener, &address_space_memory);
>>>>            return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
>>>>        } else {
>>>> -        vhost_vdpa_reset_device(dev);
>>>> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>> -                                   VIRTIO_CONFIG_S_DRIVER);
>>>>            memory_listener_unregister(&v->listener);
>>>>
>>>>            return 0;
>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>> index b643f42..7e0cdb6 100644
>>>> --- a/hw/virtio/vhost.c
>>>> +++ b/hw/virtio/vhost.c
>>>> @@ -1820,7 +1820,6 @@ fail_features:
>>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
>>>>    {
>>>>        int i;
>>>> -
>>> Unnecessary changes.
>>>
>>>>        /* should only be called after backend is connected */
>>>>        assert(hdev->vhost_ops);
>>>>
>>>> @@ -1854,3 +1853,17 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>>>>
>>>>        return -ENOSYS;
>>>>    }
>>>> +
>>>> +int vhost_dev_reset(struct vhost_dev *hdev)
>>>> +{
>>> Let's use a separate patch for this.
>>>
>>> Thanks
>>>
>>>> +    int ret = 0;
>>>> +
>>>> +    /* should only be called after backend is connected */
>>>> +    assert(hdev->vhost_ops);
>>>> +
>>>> +    if (hdev->vhost_ops->vhost_reset_device) {
>>>> +        ret = hdev->vhost_ops->vhost_reset_device(hdev);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>> index 58a73e7..b8b7c20 100644
>>>> --- a/include/hw/virtio/vhost.h
>>>> +++ b/include/hw/virtio/vhost.h
>>>> @@ -114,6 +114,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>>>>    void vhost_dev_cleanup(struct vhost_dev *hdev);
>>>>    int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>> +int vhost_dev_reset(struct vhost_dev *hdev);
>>>>    int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>    void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>
>>>> --
>>>> 1.8.3.1
>>>>
>>>>
>>>>
>>
> 
>
Si-Wei Liu April 6, 2022, 12:56 a.m. UTC | #8
On 4/1/2022 7:20 PM, Jason Wang wrote:
> Adding Michael.
>
> On Sat, Apr 2, 2022 at 7:08 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 3/31/2022 7:53 PM, Jason Wang wrote:
>>> On Fri, Apr 1, 2022 at 9:31 AM Michael Qiu <qiudayu@archeros.com> wrote:
>>>> Currently, when VM poweroff, it will trigger vdpa
>>>> device(such as mlx bluefield2 VF) reset many times(with 1 datapath
>>>> queue pair and one control queue, triggered 3 times), this
>>>> leads to below issue:
>>>>
>>>> vhost VQ 2 ring restore failed: -22: Invalid argument (22)
>>>>
>>>> This because in vhost_net_stop(), it will stop all vhost device bind to
>>>> this virtio device, and in vhost_dev_stop(), qemu tries to stop the device
>>>> , then stop the queue: vhost_virtqueue_stop().
>>>>
>>>> In vhost_dev_stop(), it resets the device, which clear some flags
>>>> in low level driver, and in next loop(stop other vhost backends),
>>>> qemu try to stop the queue corresponding to the vhost backend,
>>>>    the driver finds that the VQ is invalied, this is the root cause.
>>>>
>>>> To solve the issue, vdpa should set vring unready, and
>>>> remove reset ops in device stop: vhost_dev_start(hdev, false).
>>>>
>>>> and implement a new function vhost_dev_reset, only reset backend
>>>> device after all vhost(per-queue) stoped.
>>> Typo.
>>>
>>>> Signed-off-by: Michael Qiu<qiudayu@archeros.com>
>>>> Acked-by: Jason Wang <jasowang@redhat.com>
>>> Rethink this patch, consider there're devices that don't support
>>> set_vq_ready(). I wonder if we need
>>>
>>> 1) uAPI to tell the user space whether or not it supports set_vq_ready()
>> I guess what's more relevant here is to define the uAPI semantics for
>> unready i.e. set_vq_ready(0) for resuming/stopping virtqueue processing,
>> as starting vq is comparatively less ambiguous.
> Yes.
>
>> Considering the
>> likelihood that this interface may be used for live migration, it would
>> be nice to come up with variants such as 1) discard inflight request
>> v.s. 2) waiting for inflight processing to be done,
> Or inflight descriptor reporting (which seems to be tricky). But we
> can start from net that a discarding may just work.
>
>> and 3) timeout in
>> waiting.
> Actually, that's the plan and Eugenio is proposing something like this
> via virtio spec:
>
> https://urldefense.com/v3/__https://lists.oasis-open.org/archives/virtio-dev/202111/msg00020.html__;!!ACWV5N9M2RV99hQ!bcX6i6_atR-6Gcl-4q5Tekab_xDuXr7lDAMw2E1hilZ_1cZIX1c5mztQtvsnjiiy$
Thanks for the pointer, I seem to recall I saw it some time back though 
I wonder if there's follow-up for the v3? My impression was that this is 
still a work-in-progress spec proposal, while the semantics of various 
F_STOP scenario is unclear yet and not all of the requirements (ex: 
STOP_FAILED, rewind & !IN_ORDER) for live migration do seem to get 
accommodated?

>
>>> 2) userspace will call SET_VRING_ENABLE() when the device supports
>>> otherwise it will use RESET.
>> Are you looking to making virtqueue resume-able through the new
>> SET_VRING_ENABLE() uAPI?
>>
>> I think RESET is inevitable in some case, i.e. when guest initiates
>> device reset by writing 0 to the status register.
> Yes, that's all my plan.
>
>> For suspend/resume and
>> live migration use cases, indeed RESET can be substituted with
>> SET_VRING_ENABLE. Again, it'd need quite some code refactoring to
>> accommodate this change. Although I'm all for it, it'd be the best to
>> lay out the plan for multiple phases rather than overload this single
>> patch too much. You can count my time on this endeavor if you don't mind. :)
> You're welcome, I agree we should choose a way to go first:
>
> 1) manage to use SET_VRING_ENABLE (more like a workaround anyway)
For networking device and the vq suspend/resume and live migration use 
cases to support, I thought it might suffice? We may drop inflight or 
unused ones for Ethernet... What other part do you think may limit its 
extension to become a general uAPI or add new uAPI to address similar VQ 
stop requirement if need be? Or we might well define subsystem specific 
uAPI to stop the virtqueue, for vdpa device specifically? I think the 
point here is given that we would like to avoid guest side modification 
to support live migration, we can define specific uAPI for specific live 
migration requirement without having to involve guest driver change. 
It'd be easy to get started this way and generalize them all to a full 
blown _S_STOP when things are eventually settled.

> 2) go with virtio-spec (may take a while)
I feel it might be still quite early for now to get to a full blown 
_S_STOP spec level amendment that works for all types of virtio (vendor) 
devices. Generally there can be very specific subsystem-dependent ways 
to stop each type of virtio devices that satisfies the live migration of 
virtio subsystem devices. For now the discussion mostly concerns with vq 
index rewind, inflight handling, notification interrupt and 
configuration space such kind of virtio level things, but real device 
backend has implication on the other parts such as the order of IO/DMA 
quiescing and interrupt masking. If the subsystem virtio guest drivers 
today somehow don't support any of those _S_STOP new behaviors, I guess 
it's with little point to introduce the same or similar _S_STOP 
functionality to the guest driver to effectively support live migration.


Thanks,
-Siwei
> 3) don't wait for the spec, have a vDPA specific uAPI first. Note that
> I've chatted with most of the vendors and they seem to be fine with
> the _S_STOP. If we go this way, we can still provide the forward
> compatibility of _S_STOP
> 4) or do them all (in parallel)
>
> Any thoughts?
>
> Thanks
>
>>> And for safety, I suggest tagging this as 7.1.
>> +1
>>
>> Regards,
>> -Siwei
>>
>>>> ---
>>>> v4 --> v3
>>>>       Nothing changed, becasue of issue with mimecast,
>>>>       when the From: tag is different from the sender,
>>>>       the some mail client will take the patch as an
>>>>       attachment, RESEND v3 does not work, So resend
>>>>       the patch as v4
>>>>
>>>> v3 --> v2:
>>>>       Call vhost_dev_reset() at the end of vhost_net_stop().
>>>>
>>>>       Since the vDPA device need re-add the status bit
>>>>       VIRTIO_CONFIG_S_ACKNOWLEDGE and VIRTIO_CONFIG_S_DRIVER,
>>>>       simply, add them inside vhost_vdpa_reset_device, and
>>>>       the only way calling vhost_vdpa_reset_device is in
>>>>       vhost_net_stop(), so it keeps the same behavior as before.
>>>>
>>>> v2 --> v1:
>>>>      Implement a new function vhost_dev_reset,
>>>>      reset the backend kernel device at last.
>>>> ---
>>>>    hw/net/vhost_net.c        | 24 +++++++++++++++++++++---
>>>>    hw/virtio/vhost-vdpa.c    | 15 +++++++++------
>>>>    hw/virtio/vhost.c         | 15 ++++++++++++++-
>>>>    include/hw/virtio/vhost.h |  1 +
>>>>    4 files changed, 45 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>>>> index 30379d2..422c9bf 100644
>>>> --- a/hw/net/vhost_net.c
>>>> +++ b/hw/net/vhost_net.c
>>>> @@ -325,7 +325,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>>>        int total_notifiers = data_queue_pairs * 2 + cvq;
>>>>        VirtIONet *n = VIRTIO_NET(dev);
>>>>        int nvhosts = data_queue_pairs + cvq;
>>>> -    struct vhost_net *net;
>>>> +    struct vhost_net *net = NULL;
>>>>        int r, e, i, index_end = data_queue_pairs * 2;
>>>>        NetClientState *peer;
>>>>
>>>> @@ -391,8 +391,17 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
>>>>    err_start:
>>>>        while (--i >= 0) {
>>>>            peer = qemu_get_peer(ncs , i);
>>>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>>>> +
>>>> +        net = get_vhost_net(peer);
>>>> +
>>>> +        vhost_net_stop_one(net, dev);
>>>>        }
>>>> +
>>>> +    /* We only reset backend vdpa device */
>>>> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
>>>> +        vhost_dev_reset(&net->dev);
>>>> +    }
>>>> +
>>>>        e = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
>>>>        if (e < 0) {
>>>>            fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", e);
>>>> @@ -410,6 +419,7 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
>>>>        VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
>>>>        VirtIONet *n = VIRTIO_NET(dev);
>>>>        NetClientState *peer;
>>>> +    struct vhost_net *net = NULL;
>>>>        int total_notifiers = data_queue_pairs * 2 + cvq;
>>>>        int nvhosts = data_queue_pairs + cvq;
>>>>        int i, r;
>>>> @@ -420,7 +430,15 @@ void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
>>>>            } else {
>>>>                peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>>>            }
>>>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>>>> +
>>>> +        net = get_vhost_net(peer);
>>>> +
>>>> +        vhost_net_stop_one(net, dev);
>>>> +    }
>>>> +
>>>> +    /* We only reset backend vdpa device */
>>>> +    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
>>>> +        vhost_dev_reset(&net->dev);
>>>>        }
>>> So we've already reset the device in vhost_vdpa_dev_start(), any
>>> reason we need to do it again here?
>>>
>>>>        r = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>> index c5ed7a3..3ef0199 100644
>>>> --- a/hw/virtio/vhost-vdpa.c
>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>> @@ -708,6 +708,11 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev)
>>>>
>>>>        ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>>>>        trace_vhost_vdpa_reset_device(dev, status);
>>>> +
>>>> +    /* Add back this status, so that the device could work next time*/
>>>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>> +                               VIRTIO_CONFIG_S_DRIVER);
>>> This seems to contradict the semantic of reset.
>>>
>>>> +
>>>>        return ret;
>>>>    }
>>>>
>>>> @@ -719,14 +724,14 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
>>>>        return idx;
>>>>    }
>>>>
>>>> -static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
>>>> +static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, unsigned int ready)
>>>>    {
>>>>        int i;
>>>>        trace_vhost_vdpa_set_vring_ready(dev);
>>>>        for (i = 0; i < dev->nvqs; ++i) {
>>>>            struct vhost_vring_state state = {
>>>>                .index = dev->vq_index + i,
>>>> -            .num = 1,
>>>> +            .num = ready,
>>>>            };
>>>>            vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
>>>>        }
>>>> @@ -1088,8 +1093,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>>>            if (unlikely(!ok)) {
>>>>                return -1;
>>>>            }
>>>> -        vhost_vdpa_set_vring_ready(dev);
>>>> +        vhost_vdpa_set_vring_ready(dev, 1);
>>>>        } else {
>>>> +        vhost_vdpa_set_vring_ready(dev, 0);
>>>>            ok = vhost_vdpa_svqs_stop(dev);
>>>>            if (unlikely(!ok)) {
>>>>                return -1;
>>>> @@ -1105,9 +1111,6 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>>>            memory_listener_register(&v->listener, &address_space_memory);
>>>>            return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
>>>>        } else {
>>>> -        vhost_vdpa_reset_device(dev);
>>>> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>> -                                   VIRTIO_CONFIG_S_DRIVER);
>>>>            memory_listener_unregister(&v->listener);
>>>>
>>>>            return 0;
>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>> index b643f42..7e0cdb6 100644
>>>> --- a/hw/virtio/vhost.c
>>>> +++ b/hw/virtio/vhost.c
>>>> @@ -1820,7 +1820,6 @@ fail_features:
>>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
>>>>    {
>>>>        int i;
>>>> -
>>> Unnecessary changes.
>>>
>>>>        /* should only be called after backend is connected */
>>>>        assert(hdev->vhost_ops);
>>>>
>>>> @@ -1854,3 +1853,17 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>>>>
>>>>        return -ENOSYS;
>>>>    }
>>>> +
>>>> +int vhost_dev_reset(struct vhost_dev *hdev)
>>>> +{
>>> Let's use a separate patch for this.
>>>
>>> Thanks
>>>
>>>> +    int ret = 0;
>>>> +
>>>> +    /* should only be called after backend is connected */
>>>> +    assert(hdev->vhost_ops);
>>>> +
>>>> +    if (hdev->vhost_ops->vhost_reset_device) {
>>>> +        ret = hdev->vhost_ops->vhost_reset_device(hdev);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>> index 58a73e7..b8b7c20 100644
>>>> --- a/include/hw/virtio/vhost.h
>>>> +++ b/include/hw/virtio/vhost.h
>>>> @@ -114,6 +114,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>>>>    void vhost_dev_cleanup(struct vhost_dev *hdev);
>>>>    int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>> +int vhost_dev_reset(struct vhost_dev *hdev);
>>>>    int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>    void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>
>>>> --
>>>> 1.8.3.1
>>>>
>>>>
>>>>
Jason Wang April 7, 2022, 7:50 a.m. UTC | #9
在 2022/4/6 上午8:56, Si-Wei Liu 写道:
>
>
> On 4/1/2022 7:20 PM, Jason Wang wrote:
>> Adding Michael.
>>
>> On Sat, Apr 2, 2022 at 7:08 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>
>>>
>>> On 3/31/2022 7:53 PM, Jason Wang wrote:
>>>> On Fri, Apr 1, 2022 at 9:31 AM Michael Qiu <qiudayu@archeros.com> 
>>>> wrote:
>>>>> Currently, when VM poweroff, it will trigger vdpa
>>>>> device(such as mlx bluefield2 VF) reset many times(with 1 datapath
>>>>> queue pair and one control queue, triggered 3 times), this
>>>>> leads to below issue:
>>>>>
>>>>> vhost VQ 2 ring restore failed: -22: Invalid argument (22)
>>>>>
>>>>> This because in vhost_net_stop(), it will stop all vhost device 
>>>>> bind to
>>>>> this virtio device, and in vhost_dev_stop(), qemu tries to stop 
>>>>> the device
>>>>> , then stop the queue: vhost_virtqueue_stop().
>>>>>
>>>>> In vhost_dev_stop(), it resets the device, which clear some flags
>>>>> in low level driver, and in next loop(stop other vhost backends),
>>>>> qemu try to stop the queue corresponding to the vhost backend,
>>>>>    the driver finds that the VQ is invalied, this is the root cause.
>>>>>
>>>>> To solve the issue, vdpa should set vring unready, and
>>>>> remove reset ops in device stop: vhost_dev_start(hdev, false).
>>>>>
>>>>> and implement a new function vhost_dev_reset, only reset backend
>>>>> device after all vhost(per-queue) stoped.
>>>> Typo.
>>>>
>>>>> Signed-off-by: Michael Qiu<qiudayu@archeros.com>
>>>>> Acked-by: Jason Wang <jasowang@redhat.com>
>>>> Rethink this patch, consider there're devices that don't support
>>>> set_vq_ready(). I wonder if we need
>>>>
>>>> 1) uAPI to tell the user space whether or not it supports 
>>>> set_vq_ready()
>>> I guess what's more relevant here is to define the uAPI semantics for
>>> unready i.e. set_vq_ready(0) for resuming/stopping virtqueue 
>>> processing,
>>> as starting vq is comparatively less ambiguous.
>> Yes.
>>
>>> Considering the
>>> likelihood that this interface may be used for live migration, it would
>>> be nice to come up with variants such as 1) discard inflight request
>>> v.s. 2) waiting for inflight processing to be done,
>> Or inflight descriptor reporting (which seems to be tricky). But we
>> can start from net that a discarding may just work.
>>
>>> and 3) timeout in
>>> waiting.
>> Actually, that's the plan and Eugenio is proposing something like this
>> via virtio spec:
>>
>> https://urldefense.com/v3/__https://lists.oasis-open.org/archives/virtio-dev/202111/msg00020.html__;!!ACWV5N9M2RV99hQ!bcX6i6_atR-6Gcl-4q5Tekab_xDuXr7lDAMw2E1hilZ_1cZIX1c5mztQtvsnjiiy$ 
>>
> Thanks for the pointer, I seem to recall I saw it some time back 
> though I wonder if there's follow-up for the v3? My impression was 
> that this is still a work-in-progress spec proposal, while the 
> semantics of various F_STOP scenario is unclear yet and not all of the 
> requirements (ex: STOP_FAILED, rewind & !IN_ORDER) for live migration 
> do seem to get accommodated?


My understanding is that, the reason for STOP_FAILED and IN_ORDER is 
because we don't have a way to report inflight descriptors. We will try 
to overcome those by allow the device to report inflight descriptors in 
the next version.


>
>>
>>>> 2) userspace will call SET_VRING_ENABLE() when the device supports
>>>> otherwise it will use RESET.
>>> Are you looking to making virtqueue resume-able through the new
>>> SET_VRING_ENABLE() uAPI?
>>>
>>> I think RESET is inevitable in some case, i.e. when guest initiates
>>> device reset by writing 0 to the status register.
>> Yes, that's all my plan.
>>
>>> For suspend/resume and
>>> live migration use cases, indeed RESET can be substituted with
>>> SET_VRING_ENABLE. Again, it'd need quite some code refactoring to
>>> accommodate this change. Although I'm all for it, it'd be the best to
>>> lay out the plan for multiple phases rather than overload this single
>>> patch too much. You can count my time on this endeavor if you don't 
>>> mind. :)
>> You're welcome, I agree we should choose a way to go first:
>>
>> 1) manage to use SET_VRING_ENABLE (more like a workaround anyway)
> For networking device and the vq suspend/resume and live migration use 
> cases to support, I thought it might suffice?


Without config space change it would be sufficient. And anyhow the vDPA 
parent can prevent the config change if all the virtqueue is disabled.


> We may drop inflight or unused ones for Ethernet...


Yes.


> What other part do you think may limit its extension to become a 
> general uAPI or add new uAPI to address similar VQ stop requirement if 
> need be? 


For networking, we don't need other.


> Or we might well define subsystem specific uAPI to stop the virtqueue, 
> for vdpa device specifically?


Anyhow we need a uAPI consider we have some parent that doesn't support 
virtqueue stop. So this could be another way to go.

But if we decide to bother with new uAPI, I would rather go with a new 
uAPI for stop the device. It can help for the config space change as well.


> I think the point here is given that we would like to avoid guest side 
> modification to support live migration, we can define specific uAPI 
> for specific live migration requirement without having to involve 
> guest driver change. It'd be easy to get started this way and 
> generalize them all to a full blown _S_STOP when things are eventually 
> settled.


Yes, note that the status seen by guest is mediated by the hypervisor. 
So the hypervisor can choose not to hide the _S_STOP from the guest to 
keep the migration work without modifications in the guest driver.


>
>> 2) go with virtio-spec (may take a while)
> I feel it might be still quite early for now to get to a full blown 
> _S_STOP spec level amendment that works for all types of virtio 
> (vendor) devices. Generally there can be very specific 
> subsystem-dependent ways to stop each type of virtio devices that 
> satisfies the live migration of virtio subsystem devices.


Yes.


> For now the discussion mostly concerns with vq index rewind, inflight 
> handling, notification interrupt and configuration space such kind of 
> virtio level things, but real device backend has implication on the 
> other parts such as the order of IO/DMA quiescing and interrupt masking.


It's the charge of the vDPA parent to perform any necessary quiescing to 
satisfy the semantic of _S_STOP, it's an implementation detail which is 
out of the scope of the spec.


> If the subsystem virtio guest drivers today somehow don't support any 
> of those _S_STOP new behaviors, I guess it's with little point to 
> introduce the same or similar _S_STOP functionality to the guest 
> driver to effectively support live migration.


See above, the live migration is transparent to the guest. For the 
driver that doesn't support _S_STOP, we can still live migrate it. The 
only interesting part is nesting: if we want to live migrate a nested 
guest we need to the guest driver must support _S_STOP.

Thanks


>
>
> Thanks,
> -Siwei
>> 3) don't wait for the spec, have a vDPA specific uAPI first. Note that
>> I've chatted with most of the vendors and they seem to be fine with
>> the _S_STOP. If we go this way, we can still provide the forward
>> compatibility of _S_STOP
>> 4) or do them all (in parallel)
>>
>> Any thoughts?
>>
>> Thanks
>>
>>>> And for safety, I suggest tagging this as 7.1.
>>> +1
>>>
>>> Regards,
>>> -Siwei
>>>
>>>>> ---
>>>>> v4 --> v3
>>>>>       Nothing changed, becasue of issue with mimecast,
>>>>>       when the From: tag is different from the sender,
>>>>>       the some mail client will take the patch as an
>>>>>       attachment, RESEND v3 does not work, So resend
>>>>>       the patch as v4
>>>>>
>>>>> v3 --> v2:
>>>>>       Call vhost_dev_reset() at the end of vhost_net_stop().
>>>>>
>>>>>       Since the vDPA device need re-add the status bit
>>>>>       VIRTIO_CONFIG_S_ACKNOWLEDGE and VIRTIO_CONFIG_S_DRIVER,
>>>>>       simply, add them inside vhost_vdpa_reset_device, and
>>>>>       the only way calling vhost_vdpa_reset_device is in
>>>>>       vhost_net_stop(), so it keeps the same behavior as before.
>>>>>
>>>>> v2 --> v1:
>>>>>      Implement a new function vhost_dev_reset,
>>>>>      reset the backend kernel device at last.
>>>>> ---
>>>>>    hw/net/vhost_net.c        | 24 +++++++++++++++++++++---
>>>>>    hw/virtio/vhost-vdpa.c    | 15 +++++++++------
>>>>>    hw/virtio/vhost.c         | 15 ++++++++++++++-
>>>>>    include/hw/virtio/vhost.h |  1 +
>>>>>    4 files changed, 45 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>>>>> index 30379d2..422c9bf 100644
>>>>> --- a/hw/net/vhost_net.c
>>>>> +++ b/hw/net/vhost_net.c
>>>>> @@ -325,7 +325,7 @@ int vhost_net_start(VirtIODevice *dev, 
>>>>> NetClientState *ncs,
>>>>>        int total_notifiers = data_queue_pairs * 2 + cvq;
>>>>>        VirtIONet *n = VIRTIO_NET(dev);
>>>>>        int nvhosts = data_queue_pairs + cvq;
>>>>> -    struct vhost_net *net;
>>>>> +    struct vhost_net *net = NULL;
>>>>>        int r, e, i, index_end = data_queue_pairs * 2;
>>>>>        NetClientState *peer;
>>>>>
>>>>> @@ -391,8 +391,17 @@ int vhost_net_start(VirtIODevice *dev, 
>>>>> NetClientState *ncs,
>>>>>    err_start:
>>>>>        while (--i >= 0) {
>>>>>            peer = qemu_get_peer(ncs , i);
>>>>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>>>>> +
>>>>> +        net = get_vhost_net(peer);
>>>>> +
>>>>> +        vhost_net_stop_one(net, dev);
>>>>>        }
>>>>> +
>>>>> +    /* We only reset backend vdpa device */
>>>>> +    if (net && net->dev.vhost_ops->backend_type == 
>>>>> VHOST_BACKEND_TYPE_VDPA) {
>>>>> +        vhost_dev_reset(&net->dev);
>>>>> +    }
>>>>> +
>>>>>        e = k->set_guest_notifiers(qbus->parent, total_notifiers, 
>>>>> false);
>>>>>        if (e < 0) {
>>>>>            fprintf(stderr, "vhost guest notifier cleanup failed: 
>>>>> %d\n", e);
>>>>> @@ -410,6 +419,7 @@ void vhost_net_stop(VirtIODevice *dev, 
>>>>> NetClientState *ncs,
>>>>>        VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
>>>>>        VirtIONet *n = VIRTIO_NET(dev);
>>>>>        NetClientState *peer;
>>>>> +    struct vhost_net *net = NULL;
>>>>>        int total_notifiers = data_queue_pairs * 2 + cvq;
>>>>>        int nvhosts = data_queue_pairs + cvq;
>>>>>        int i, r;
>>>>> @@ -420,7 +430,15 @@ void vhost_net_stop(VirtIODevice *dev, 
>>>>> NetClientState *ncs,
>>>>>            } else {
>>>>>                peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>>>>            }
>>>>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>>>>> +
>>>>> +        net = get_vhost_net(peer);
>>>>> +
>>>>> +        vhost_net_stop_one(net, dev);
>>>>> +    }
>>>>> +
>>>>> +    /* We only reset backend vdpa device */
>>>>> +    if (net && net->dev.vhost_ops->backend_type == 
>>>>> VHOST_BACKEND_TYPE_VDPA) {
>>>>> +        vhost_dev_reset(&net->dev);
>>>>>        }
>>>> So we've already reset the device in vhost_vdpa_dev_start(), any
>>>> reason we need to do it again here?
>>>>
>>>>>        r = k->set_guest_notifiers(qbus->parent, total_notifiers, 
>>>>> false);
>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>>> index c5ed7a3..3ef0199 100644
>>>>> --- a/hw/virtio/vhost-vdpa.c
>>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>>> @@ -708,6 +708,11 @@ static int vhost_vdpa_reset_device(struct 
>>>>> vhost_dev *dev)
>>>>>
>>>>>        ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>>>>>        trace_vhost_vdpa_reset_device(dev, status);
>>>>> +
>>>>> +    /* Add back this status, so that the device could work next 
>>>>> time*/
>>>>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>>> +                               VIRTIO_CONFIG_S_DRIVER);
>>>> This seems to contradict the semantic of reset.
>>>>
>>>>> +
>>>>>        return ret;
>>>>>    }
>>>>>
>>>>> @@ -719,14 +724,14 @@ static int vhost_vdpa_get_vq_index(struct 
>>>>> vhost_dev *dev, int idx)
>>>>>        return idx;
>>>>>    }
>>>>>
>>>>> -static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
>>>>> +static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, 
>>>>> unsigned int ready)
>>>>>    {
>>>>>        int i;
>>>>>        trace_vhost_vdpa_set_vring_ready(dev);
>>>>>        for (i = 0; i < dev->nvqs; ++i) {
>>>>>            struct vhost_vring_state state = {
>>>>>                .index = dev->vq_index + i,
>>>>> -            .num = 1,
>>>>> +            .num = ready,
>>>>>            };
>>>>>            vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
>>>>>        }
>>>>> @@ -1088,8 +1093,9 @@ static int vhost_vdpa_dev_start(struct 
>>>>> vhost_dev *dev, bool started)
>>>>>            if (unlikely(!ok)) {
>>>>>                return -1;
>>>>>            }
>>>>> -        vhost_vdpa_set_vring_ready(dev);
>>>>> +        vhost_vdpa_set_vring_ready(dev, 1);
>>>>>        } else {
>>>>> +        vhost_vdpa_set_vring_ready(dev, 0);
>>>>>            ok = vhost_vdpa_svqs_stop(dev);
>>>>>            if (unlikely(!ok)) {
>>>>>                return -1;
>>>>> @@ -1105,9 +1111,6 @@ static int vhost_vdpa_dev_start(struct 
>>>>> vhost_dev *dev, bool started)
>>>>>            memory_listener_register(&v->listener, 
>>>>> &address_space_memory);
>>>>>            return vhost_vdpa_add_status(dev, 
>>>>> VIRTIO_CONFIG_S_DRIVER_OK);
>>>>>        } else {
>>>>> -        vhost_vdpa_reset_device(dev);
>>>>> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>>> - VIRTIO_CONFIG_S_DRIVER);
>>>>> memory_listener_unregister(&v->listener);
>>>>>
>>>>>            return 0;
>>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>>> index b643f42..7e0cdb6 100644
>>>>> --- a/hw/virtio/vhost.c
>>>>> +++ b/hw/virtio/vhost.c
>>>>> @@ -1820,7 +1820,6 @@ fail_features:
>>>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
>>>>>    {
>>>>>        int i;
>>>>> -
>>>> Unnecessary changes.
>>>>
>>>>>        /* should only be called after backend is connected */
>>>>>        assert(hdev->vhost_ops);
>>>>>
>>>>> @@ -1854,3 +1853,17 @@ int vhost_net_set_backend(struct vhost_dev 
>>>>> *hdev,
>>>>>
>>>>>        return -ENOSYS;
>>>>>    }
>>>>> +
>>>>> +int vhost_dev_reset(struct vhost_dev *hdev)
>>>>> +{
>>>> Let's use a separate patch for this.
>>>>
>>>> Thanks
>>>>
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    /* should only be called after backend is connected */
>>>>> +    assert(hdev->vhost_ops);
>>>>> +
>>>>> +    if (hdev->vhost_ops->vhost_reset_device) {
>>>>> +        ret = hdev->vhost_ops->vhost_reset_device(hdev);
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>>> index 58a73e7..b8b7c20 100644
>>>>> --- a/include/hw/virtio/vhost.h
>>>>> +++ b/include/hw/virtio/vhost.h
>>>>> @@ -114,6 +114,7 @@ int vhost_dev_init(struct vhost_dev *hdev, 
>>>>> void *opaque,
>>>>>    void vhost_dev_cleanup(struct vhost_dev *hdev);
>>>>>    int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>> +int vhost_dev_reset(struct vhost_dev *hdev);
>>>>>    int vhost_dev_enable_notifiers(struct vhost_dev *hdev, 
>>>>> VirtIODevice *vdev);
>>>>>    void vhost_dev_disable_notifiers(struct vhost_dev *hdev, 
>>>>> VirtIODevice *vdev);
>>>>>
>>>>> -- 
>>>>> 1.8.3.1
>>>>>
>>>>>
>>>>>
>
Jason Wang April 7, 2022, 7:52 a.m. UTC | #10
在 2022/4/2 上午11:53, Michael Qiu 写道:
>
>
> On 2022/4/2 10:20, Jason Wang wrote:
>> Adding Michael.
>>
>> On Sat, Apr 2, 2022 at 7:08 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>
>>>
>>>
>>> On 3/31/2022 7:53 PM, Jason Wang wrote:
>>>> On Fri, Apr 1, 2022 at 9:31 AM Michael Qiu <qiudayu@archeros.com> 
>>>> wrote:
>>>>> Currently, when VM poweroff, it will trigger vdpa
>>>>> device(such as mlx bluefield2 VF) reset many times(with 1 datapath
>>>>> queue pair and one control queue, triggered 3 times), this
>>>>> leads to below issue:
>>>>>
>>>>> vhost VQ 2 ring restore failed: -22: Invalid argument (22)
>>>>>
>>>>> This because in vhost_net_stop(), it will stop all vhost device 
>>>>> bind to
>>>>> this virtio device, and in vhost_dev_stop(), qemu tries to stop 
>>>>> the device
>>>>> , then stop the queue: vhost_virtqueue_stop().
>>>>>
>>>>> In vhost_dev_stop(), it resets the device, which clear some flags
>>>>> in low level driver, and in next loop(stop other vhost backends),
>>>>> qemu try to stop the queue corresponding to the vhost backend,
>>>>>    the driver finds that the VQ is invalied, this is the root cause.
>>>>>
>>>>> To solve the issue, vdpa should set vring unready, and
>>>>> remove reset ops in device stop: vhost_dev_start(hdev, false).
>>>>>
>>>>> and implement a new function vhost_dev_reset, only reset backend
>>>>> device after all vhost(per-queue) stoped.
>>>> Typo.
>>>>
>>>>> Signed-off-by: Michael Qiu<qiudayu@archeros.com>
>>>>> Acked-by: Jason Wang <jasowang@redhat.com>
>>>> Rethink this patch, consider there're devices that don't support
>>>> set_vq_ready(). I wonder if we need
>>>>
>>>> 1) uAPI to tell the user space whether or not it supports 
>>>> set_vq_ready()
>>> I guess what's more relevant here is to define the uAPI semantics for
>>> unready i.e. set_vq_ready(0) for resuming/stopping virtqueue 
>>> processing,
>>> as starting vq is comparatively less ambiguous.
>>
>> Yes.
>>
>>> Considering the
>>> likelihood that this interface may be used for live migration, it would
>>> be nice to come up with variants such as 1) discard inflight request
>>> v.s. 2) waiting for inflight processing to be done,
>>
>> Or inflight descriptor reporting (which seems to be tricky). But we
>> can start from net that a discarding may just work.
>>
>>> and 3) timeout in
>>> waiting.
>>
>> Actually, that's the plan and Eugenio is proposing something like this
>> via virtio spec:
>>
>> https://lists.oasis-open.org/archives/virtio-dev/202111/msg00020.html
>>
>>>
>>>> 2) userspace will call SET_VRING_ENABLE() when the device supports
>>>> otherwise it will use RESET.
>>> Are you looking to making virtqueue resume-able through the new
>>> SET_VRING_ENABLE() uAPI?
>>>
>>> I think RESET is inevitable in some case, i.e. when guest initiates
>>> device reset by writing 0 to the status register.
>>
>> Yes, that's all my plan.
>>
>>> For suspend/resume and
>>> live migration use cases, indeed RESET can be substituted with
>>> SET_VRING_ENABLE. Again, it'd need quite some code refactoring to
>>> accommodate this change. Although I'm all for it, it'd be the best to
>>> lay out the plan for multiple phases rather than overload this single
>>> patch too much. You can count my time on this endeavor if you don't 
>>> mind. :)
>>
>> You're welcome, I agree we should choose a way to go first:
>>
>> 1) manage to use SET_VRING_ENABLE (more like a workaround anyway)
>> 2) go with virtio-spec (may take a while)
>> 3) don't wait for the spec, have a vDPA specific uAPI first. Note that
>> I've chatted with most of the vendors and they seem to be fine with
>> the _S_STOP. If we go this way, we can still provide the forward
>> compatibility of _S_STOP
>> 4) or do them all (in parallel)
>>
>> Any thoughts?
>>
>
> virtio-spec should be long-term, not only because the spec goes very 
> slowly, but also the hardware upgrade should be a problem.
>
> For short-term, better to take the first one?


Consider we need a new uAPI anyhow, I prefer for 2) but you can try 1) 
and see what people think.

Thanks


>
> Thanks,
> Michael
>> Thanks
>>
>>>
>>>>
>>>> And for safety, I suggest tagging this as 7.1.
>>> +1
>>>
>>> Regards,
>>> -Siwei
>>>
>>>>
>>>>> ---
>>>>> v4 --> v3
>>>>>       Nothing changed, becasue of issue with mimecast,
>>>>>       when the From: tag is different from the sender,
>>>>>       the some mail client will take the patch as an
>>>>>       attachment, RESEND v3 does not work, So resend
>>>>>       the patch as v4
>>>>>
>>>>> v3 --> v2:
>>>>>       Call vhost_dev_reset() at the end of vhost_net_stop().
>>>>>
>>>>>       Since the vDPA device need re-add the status bit
>>>>>       VIRTIO_CONFIG_S_ACKNOWLEDGE and VIRTIO_CONFIG_S_DRIVER,
>>>>>       simply, add them inside vhost_vdpa_reset_device, and
>>>>>       the only way calling vhost_vdpa_reset_device is in
>>>>>       vhost_net_stop(), so it keeps the same behavior as before.
>>>>>
>>>>> v2 --> v1:
>>>>>      Implement a new function vhost_dev_reset,
>>>>>      reset the backend kernel device at last.
>>>>> ---
>>>>>    hw/net/vhost_net.c        | 24 +++++++++++++++++++++---
>>>>>    hw/virtio/vhost-vdpa.c    | 15 +++++++++------
>>>>>    hw/virtio/vhost.c         | 15 ++++++++++++++-
>>>>>    include/hw/virtio/vhost.h |  1 +
>>>>>    4 files changed, 45 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>>>>> index 30379d2..422c9bf 100644
>>>>> --- a/hw/net/vhost_net.c
>>>>> +++ b/hw/net/vhost_net.c
>>>>> @@ -325,7 +325,7 @@ int vhost_net_start(VirtIODevice *dev, 
>>>>> NetClientState *ncs,
>>>>>        int total_notifiers = data_queue_pairs * 2 + cvq;
>>>>>        VirtIONet *n = VIRTIO_NET(dev);
>>>>>        int nvhosts = data_queue_pairs + cvq;
>>>>> -    struct vhost_net *net;
>>>>> +    struct vhost_net *net = NULL;
>>>>>        int r, e, i, index_end = data_queue_pairs * 2;
>>>>>        NetClientState *peer;
>>>>>
>>>>> @@ -391,8 +391,17 @@ int vhost_net_start(VirtIODevice *dev, 
>>>>> NetClientState *ncs,
>>>>>    err_start:
>>>>>        while (--i >= 0) {
>>>>>            peer = qemu_get_peer(ncs , i);
>>>>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>>>>> +
>>>>> +        net = get_vhost_net(peer);
>>>>> +
>>>>> +        vhost_net_stop_one(net, dev);
>>>>>        }
>>>>> +
>>>>> +    /* We only reset backend vdpa device */
>>>>> +    if (net && net->dev.vhost_ops->backend_type == 
>>>>> VHOST_BACKEND_TYPE_VDPA) {
>>>>> +        vhost_dev_reset(&net->dev);
>>>>> +    }
>>>>> +
>>>>>        e = k->set_guest_notifiers(qbus->parent, total_notifiers, 
>>>>> false);
>>>>>        if (e < 0) {
>>>>>            fprintf(stderr, "vhost guest notifier cleanup failed: 
>>>>> %d\n", e);
>>>>> @@ -410,6 +419,7 @@ void vhost_net_stop(VirtIODevice *dev, 
>>>>> NetClientState *ncs,
>>>>>        VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
>>>>>        VirtIONet *n = VIRTIO_NET(dev);
>>>>>        NetClientState *peer;
>>>>> +    struct vhost_net *net = NULL;
>>>>>        int total_notifiers = data_queue_pairs * 2 + cvq;
>>>>>        int nvhosts = data_queue_pairs + cvq;
>>>>>        int i, r;
>>>>> @@ -420,7 +430,15 @@ void vhost_net_stop(VirtIODevice *dev, 
>>>>> NetClientState *ncs,
>>>>>            } else {
>>>>>                peer = qemu_get_peer(ncs, n->max_queue_pairs);
>>>>>            }
>>>>> -        vhost_net_stop_one(get_vhost_net(peer), dev);
>>>>> +
>>>>> +        net = get_vhost_net(peer);
>>>>> +
>>>>> +        vhost_net_stop_one(net, dev);
>>>>> +    }
>>>>> +
>>>>> +    /* We only reset backend vdpa device */
>>>>> +    if (net && net->dev.vhost_ops->backend_type == 
>>>>> VHOST_BACKEND_TYPE_VDPA) {
>>>>> +        vhost_dev_reset(&net->dev);
>>>>>        }
>>>> So we've already reset the device in vhost_vdpa_dev_start(), any
>>>> reason we need to do it again here?
>>>>
>>>>>        r = k->set_guest_notifiers(qbus->parent, total_notifiers, 
>>>>> false);
>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>>> index c5ed7a3..3ef0199 100644
>>>>> --- a/hw/virtio/vhost-vdpa.c
>>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>>> @@ -708,6 +708,11 @@ static int vhost_vdpa_reset_device(struct 
>>>>> vhost_dev *dev)
>>>>>
>>>>>        ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>>>>>        trace_vhost_vdpa_reset_device(dev, status);
>>>>> +
>>>>> +    /* Add back this status, so that the device could work next 
>>>>> time*/
>>>>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>>> +                               VIRTIO_CONFIG_S_DRIVER);
>>>> This seems to contradict the semantic of reset.
>>>>
>>>>> +
>>>>>        return ret;
>>>>>    }
>>>>>
>>>>> @@ -719,14 +724,14 @@ static int vhost_vdpa_get_vq_index(struct 
>>>>> vhost_dev *dev, int idx)
>>>>>        return idx;
>>>>>    }
>>>>>
>>>>> -static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
>>>>> +static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, 
>>>>> unsigned int ready)
>>>>>    {
>>>>>        int i;
>>>>>        trace_vhost_vdpa_set_vring_ready(dev);
>>>>>        for (i = 0; i < dev->nvqs; ++i) {
>>>>>            struct vhost_vring_state state = {
>>>>>                .index = dev->vq_index + i,
>>>>> -            .num = 1,
>>>>> +            .num = ready,
>>>>>            };
>>>>>            vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
>>>>>        }
>>>>> @@ -1088,8 +1093,9 @@ static int vhost_vdpa_dev_start(struct 
>>>>> vhost_dev *dev, bool started)
>>>>>            if (unlikely(!ok)) {
>>>>>                return -1;
>>>>>            }
>>>>> -        vhost_vdpa_set_vring_ready(dev);
>>>>> +        vhost_vdpa_set_vring_ready(dev, 1);
>>>>>        } else {
>>>>> +        vhost_vdpa_set_vring_ready(dev, 0);
>>>>>            ok = vhost_vdpa_svqs_stop(dev);
>>>>>            if (unlikely(!ok)) {
>>>>>                return -1;
>>>>> @@ -1105,9 +1111,6 @@ static int vhost_vdpa_dev_start(struct 
>>>>> vhost_dev *dev, bool started)
>>>>>            memory_listener_register(&v->listener, 
>>>>> &address_space_memory);
>>>>>            return vhost_vdpa_add_status(dev, 
>>>>> VIRTIO_CONFIG_S_DRIVER_OK);
>>>>>        } else {
>>>>> -        vhost_vdpa_reset_device(dev);
>>>>> -        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>>> - VIRTIO_CONFIG_S_DRIVER);
>>>>> memory_listener_unregister(&v->listener);
>>>>>
>>>>>            return 0;
>>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>>> index b643f42..7e0cdb6 100644
>>>>> --- a/hw/virtio/vhost.c
>>>>> +++ b/hw/virtio/vhost.c
>>>>> @@ -1820,7 +1820,6 @@ fail_features:
>>>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
>>>>>    {
>>>>>        int i;
>>>>> -
>>>> Unnecessary changes.
>>>>
>>>>>        /* should only be called after backend is connected */
>>>>>        assert(hdev->vhost_ops);
>>>>>
>>>>> @@ -1854,3 +1853,17 @@ int vhost_net_set_backend(struct vhost_dev 
>>>>> *hdev,
>>>>>
>>>>>        return -ENOSYS;
>>>>>    }
>>>>> +
>>>>> +int vhost_dev_reset(struct vhost_dev *hdev)
>>>>> +{
>>>> Let's use a separate patch for this.
>>>>
>>>> Thanks
>>>>
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    /* should only be called after backend is connected */
>>>>> +    assert(hdev->vhost_ops);
>>>>> +
>>>>> +    if (hdev->vhost_ops->vhost_reset_device) {
>>>>> +        ret = hdev->vhost_ops->vhost_reset_device(hdev);
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>>> index 58a73e7..b8b7c20 100644
>>>>> --- a/include/hw/virtio/vhost.h
>>>>> +++ b/include/hw/virtio/vhost.h
>>>>> @@ -114,6 +114,7 @@ int vhost_dev_init(struct vhost_dev *hdev, 
>>>>> void *opaque,
>>>>>    void vhost_dev_cleanup(struct vhost_dev *hdev);
>>>>>    int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
>>>>> +int vhost_dev_reset(struct vhost_dev *hdev);
>>>>>    int vhost_dev_enable_notifiers(struct vhost_dev *hdev, 
>>>>> VirtIODevice *vdev);
>>>>>    void vhost_dev_disable_notifiers(struct vhost_dev *hdev, 
>>>>> VirtIODevice *vdev);
>>>>>
>>>>> -- 
>>>>> 1.8.3.1
>>>>>
>>>>>
>>>>>
>>>
>>
>>
>
>
diff mbox series

Patch

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 30379d2..422c9bf 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -325,7 +325,7 @@  int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
     int total_notifiers = data_queue_pairs * 2 + cvq;
     VirtIONet *n = VIRTIO_NET(dev);
     int nvhosts = data_queue_pairs + cvq;
-    struct vhost_net *net;
+    struct vhost_net *net = NULL;
     int r, e, i, index_end = data_queue_pairs * 2;
     NetClientState *peer;
 
@@ -391,8 +391,17 @@  int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
 err_start:
     while (--i >= 0) {
         peer = qemu_get_peer(ncs , i);
-        vhost_net_stop_one(get_vhost_net(peer), dev);
+
+        net = get_vhost_net(peer);
+
+        vhost_net_stop_one(net, dev);
     }
+
+    /* We only reset backend vdpa device */
+    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
+        vhost_dev_reset(&net->dev);
+    }
+
     e = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
     if (e < 0) {
         fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", e);
@@ -410,6 +419,7 @@  void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(vbus);
     VirtIONet *n = VIRTIO_NET(dev);
     NetClientState *peer;
+    struct vhost_net *net = NULL;
     int total_notifiers = data_queue_pairs * 2 + cvq;
     int nvhosts = data_queue_pairs + cvq;
     int i, r;
@@ -420,7 +430,15 @@  void vhost_net_stop(VirtIODevice *dev, NetClientState *ncs,
         } else {
             peer = qemu_get_peer(ncs, n->max_queue_pairs);
         }
-        vhost_net_stop_one(get_vhost_net(peer), dev);
+
+        net = get_vhost_net(peer);
+
+        vhost_net_stop_one(net, dev);
+    }
+
+    /* We only reset backend vdpa device */
+    if (net && net->dev.vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA) {
+        vhost_dev_reset(&net->dev);
     }
 
     r = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index c5ed7a3..3ef0199 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -708,6 +708,11 @@  static int vhost_vdpa_reset_device(struct vhost_dev *dev)
 
     ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
     trace_vhost_vdpa_reset_device(dev, status);
+
+    /* Add back this status, so that the device could work next time*/
+    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
+                               VIRTIO_CONFIG_S_DRIVER);
+
     return ret;
 }
 
@@ -719,14 +724,14 @@  static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
     return idx;
 }
 
-static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
+static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, unsigned int ready)
 {
     int i;
     trace_vhost_vdpa_set_vring_ready(dev);
     for (i = 0; i < dev->nvqs; ++i) {
         struct vhost_vring_state state = {
             .index = dev->vq_index + i,
-            .num = 1,
+            .num = ready,
         };
         vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
     }
@@ -1088,8 +1093,9 @@  static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
         if (unlikely(!ok)) {
             return -1;
         }
-        vhost_vdpa_set_vring_ready(dev);
+        vhost_vdpa_set_vring_ready(dev, 1);
     } else {
+        vhost_vdpa_set_vring_ready(dev, 0);
         ok = vhost_vdpa_svqs_stop(dev);
         if (unlikely(!ok)) {
             return -1;
@@ -1105,9 +1111,6 @@  static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
         memory_listener_register(&v->listener, &address_space_memory);
         return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
     } else {
-        vhost_vdpa_reset_device(dev);
-        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
-                                   VIRTIO_CONFIG_S_DRIVER);
         memory_listener_unregister(&v->listener);
 
         return 0;
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index b643f42..7e0cdb6 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1820,7 +1820,6 @@  fail_features:
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
 {
     int i;
-
     /* should only be called after backend is connected */
     assert(hdev->vhost_ops);
 
@@ -1854,3 +1853,17 @@  int vhost_net_set_backend(struct vhost_dev *hdev,
 
     return -ENOSYS;
 }
+
+int vhost_dev_reset(struct vhost_dev *hdev)
+{
+    int ret = 0;
+
+    /* should only be called after backend is connected */
+    assert(hdev->vhost_ops);
+
+    if (hdev->vhost_ops->vhost_reset_device) {
+        ret = hdev->vhost_ops->vhost_reset_device(hdev);
+    }
+
+    return ret;
+}
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 58a73e7..b8b7c20 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -114,6 +114,7 @@  int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
 void vhost_dev_cleanup(struct vhost_dev *hdev);
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
+int vhost_dev_reset(struct vhost_dev *hdev);
 int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);