mbox series

[RFC,00/27] vDPA software assisted live migration

Message ID 20201120185105.279030-1-eperezma@redhat.com (mailing list archive)
Headers show
Series vDPA software assisted live migration | expand

Message

Eugenio Perez Martin Nov. 20, 2020, 6:50 p.m. UTC
This series enable vDPA software assisted live migration for vhost-net
devices. This is a new method of vhost devices migration: Instead of
relay on vDPA device's dirty logging capability, SW assisted LM
intercepts dataplane, forwarding the descriptors between VM and device.

In this migration mode, qemu offers a new vring to the device to
read and write into, and disable vhost notifiers, processing guest and
vhost notifications in qemu. On used buffer relay, qemu will mark the
dirty memory as with plain virtio-net devices. This way, devices does
not need to have dirty page logging capability.

This series is a POC doing SW LM for vhost-net devices, which already
have dirty page logging capabilities. None of the changes have actual
effect with current devices until last two patches (26 and 27) are
applied, but they can be rebased on top of any other. These checks the
device to meet all requirements, and disable vhost-net devices logging
so migration goes through SW LM. This last patch is not meant to be
applied in the final revision, it is in the series just for testing
purposes.

For use SW assisted LM these vhost-net devices need to be instantiated:
* With IOMMU (iommu_platform=on,ats=on)
* Without event_idx (event_idx=off)

Just the notification forwarding (with no descriptor relay) can be
achieved with patches 7 and 9, and starting migration. Partial applies
between 13 and 24 will not work while migrating on source, and patch
25 is needed for the destination to resume network activity.

It is based on the ideas of DPDK SW assisted LM, in the series of
DPDK's https://patchwork.dpdk.org/cover/48370/ .

Comments are welcome.

Thanks!

Eugenio Pérez (27):
  vhost: Add vhost_dev_can_log
  vhost: Add device callback in vhost_migration_log
  vhost: Move log resize/put to vhost_dev_set_log
  vhost: add vhost_kernel_set_vring_enable
  vhost: Add hdev->dev.sw_lm_vq_handler
  virtio: Add virtio_queue_get_used_notify_split
  vhost: Route guest->host notification through qemu
  vhost: Add a flag for software assisted Live Migration
  vhost: Route host->guest notification through qemu
  vhost: Allocate shadow vring
  virtio: const-ify all virtio_tswap* functions
  virtio: Add virtio_queue_full
  vhost: Send buffers to device
  virtio: Remove virtio_queue_get_used_notify_split
  vhost: Do not invalidate signalled used
  virtio: Expose virtqueue_alloc_element
  vhost: add vhost_vring_set_notification_rcu
  vhost: add vhost_vring_poll_rcu
  vhost: add vhost_vring_get_buf_rcu
  vhost: Return used buffers
  vhost: Add vhost_virtqueue_memory_unmap
  vhost: Add vhost_virtqueue_memory_map
  vhost: unmap qemu's shadow virtqueues on sw live migration
  vhost: iommu changes
  vhost: Do not commit vhost used idx on vhost_virtqueue_stop
  vhost: Add vhost_hdev_can_sw_lm
  vhost: forbid vhost devices logging

 hw/virtio/vhost-sw-lm-ring.h      |  39 +++
 include/hw/virtio/vhost.h         |   5 +
 include/hw/virtio/virtio-access.h |   8 +-
 include/hw/virtio/virtio.h        |   4 +
 hw/net/virtio-net.c               |  39 ++-
 hw/virtio/vhost-backend.c         |  29 ++
 hw/virtio/vhost-sw-lm-ring.c      | 268 +++++++++++++++++++
 hw/virtio/vhost.c                 | 431 +++++++++++++++++++++++++-----
 hw/virtio/virtio.c                |  18 +-
 hw/virtio/meson.build             |   2 +-
 10 files changed, 758 insertions(+), 85 deletions(-)
 create mode 100644 hw/virtio/vhost-sw-lm-ring.h
 create mode 100644 hw/virtio/vhost-sw-lm-ring.c

Comments

Eugenio Perez Martin Nov. 20, 2020, 7:03 p.m. UTC | #1
The main intention with this POC/RFC is to serve as a base to
implement SW LM in vdpa devices.

To implement in vhost-vdpa devices, the high priority is to achieve an
interface for vdpa to stop the device without losing its state, i.e.,
the avail_idx the destination device should start fetching descriptors
from. Following this POC, an implementation on vdpa_sim will be
performed.

Apart from that, there is a TODO list about this series, they will be
solved as the code is marked as valid. They don't affect the device,
just internal qemu's code, and in case of change of direction it is
easy to modify or delete. Comments about these are welcome.

- Currently, it hijacks the log mechanism to know when migration is
starting/done. Maybe it would be cleaner to forward migrate status
from virtio_vmstate_change, since there is no need for the memory
listener. However, this could make "memory backend" abstraction (also
TODO) more complicated. This would drop patches 2,3 entirely.
- There is a reverse search in a list on "vhost_dev_from_virtio" for
each notification. Not really efficient, and it leads to a race
condition at device destruction.
- Implement new capabilities (no iommu, packed vq, event_idx, ...)
- Lot of assertions need to be converted to proper error handling.

Thanks!

On Fri, Nov 20, 2020 at 8:02 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> This series enable vDPA software assisted live migration for vhost-net
> devices. This is a new method of vhost devices migration: Instead of
> relay on vDPA device's dirty logging capability, SW assisted LM
> intercepts dataplane, forwarding the descriptors between VM and device.
>
> In this migration mode, qemu offers a new vring to the device to
> read and write into, and disable vhost notifiers, processing guest and
> vhost notifications in qemu. On used buffer relay, qemu will mark the
> dirty memory as with plain virtio-net devices. This way, devices does
> not need to have dirty page logging capability.
>
> This series is a POC doing SW LM for vhost-net devices, which already
> have dirty page logging capabilities. None of the changes have actual
> effect with current devices until last two patches (26 and 27) are
> applied, but they can be rebased on top of any other. These checks the
> device to meet all requirements, and disable vhost-net devices logging
> so migration goes through SW LM. This last patch is not meant to be
> applied in the final revision, it is in the series just for testing
> purposes.
>
> For use SW assisted LM these vhost-net devices need to be instantiated:
> * With IOMMU (iommu_platform=on,ats=on)
> * Without event_idx (event_idx=off)
>
> Just the notification forwarding (with no descriptor relay) can be
> achieved with patches 7 and 9, and starting migration. Partial applies
> between 13 and 24 will not work while migrating on source, and patch
> 25 is needed for the destination to resume network activity.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of
> DPDK's https://patchwork.dpdk.org/cover/48370/ .
>
> Comments are welcome.
>
> Thanks!
>
> Eugenio Pérez (27):
>   vhost: Add vhost_dev_can_log
>   vhost: Add device callback in vhost_migration_log
>   vhost: Move log resize/put to vhost_dev_set_log
>   vhost: add vhost_kernel_set_vring_enable
>   vhost: Add hdev->dev.sw_lm_vq_handler
>   virtio: Add virtio_queue_get_used_notify_split
>   vhost: Route guest->host notification through qemu
>   vhost: Add a flag for software assisted Live Migration
>   vhost: Route host->guest notification through qemu
>   vhost: Allocate shadow vring
>   virtio: const-ify all virtio_tswap* functions
>   virtio: Add virtio_queue_full
>   vhost: Send buffers to device
>   virtio: Remove virtio_queue_get_used_notify_split
>   vhost: Do not invalidate signalled used
>   virtio: Expose virtqueue_alloc_element
>   vhost: add vhost_vring_set_notification_rcu
>   vhost: add vhost_vring_poll_rcu
>   vhost: add vhost_vring_get_buf_rcu
>   vhost: Return used buffers
>   vhost: Add vhost_virtqueue_memory_unmap
>   vhost: Add vhost_virtqueue_memory_map
>   vhost: unmap qemu's shadow virtqueues on sw live migration
>   vhost: iommu changes
>   vhost: Do not commit vhost used idx on vhost_virtqueue_stop
>   vhost: Add vhost_hdev_can_sw_lm
>   vhost: forbid vhost devices logging
>
>  hw/virtio/vhost-sw-lm-ring.h      |  39 +++
>  include/hw/virtio/vhost.h         |   5 +
>  include/hw/virtio/virtio-access.h |   8 +-
>  include/hw/virtio/virtio.h        |   4 +
>  hw/net/virtio-net.c               |  39 ++-
>  hw/virtio/vhost-backend.c         |  29 ++
>  hw/virtio/vhost-sw-lm-ring.c      | 268 +++++++++++++++++++
>  hw/virtio/vhost.c                 | 431 +++++++++++++++++++++++++-----
>  hw/virtio/virtio.c                |  18 +-
>  hw/virtio/meson.build             |   2 +-
>  10 files changed, 758 insertions(+), 85 deletions(-)
>  create mode 100644 hw/virtio/vhost-sw-lm-ring.h
>  create mode 100644 hw/virtio/vhost-sw-lm-ring.c
>
> --
> 2.18.4
>
>
no-reply@patchew.org Nov. 20, 2020, 7:30 p.m. UTC | #2
Patchew URL: https://patchew.org/QEMU/20201120185105.279030-1-eperezma@redhat.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20201120185105.279030-1-eperezma@redhat.com
Subject: [RFC PATCH 00/27] vDPA software assisted live migration

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 - [tag update]      patchew/20201117173635.29101-1-alex.bennee@linaro.org -> patchew/20201117173635.29101-1-alex.bennee@linaro.org
 * [new tag]         patchew/20201120185105.279030-1-eperezma@redhat.com -> patchew/20201120185105.279030-1-eperezma@redhat.com
Switched to a new branch 'test'
af2fe22 vhost: forbid vhost devices logging
405925c vhost: Add vhost_hdev_can_sw_lm
7f2955b vhost: Do not commit vhost used idx on vhost_virtqueue_stop
b68d3d5 vhost: iommu changes
74f282a vhost: unmap qemu's shadow virtqueues on sw live migration
c999e86 vhost: Add vhost_virtqueue_memory_map
6e5f219 vhost: Add vhost_virtqueue_memory_unmap
d5054c8 vhost: Return used buffers
806db46 vhost: add vhost_vring_get_buf_rcu
b6b8168 vhost: add vhost_vring_poll_rcu
ca44882 vhost: add vhost_vring_set_notification_rcu
3183f62 virtio: Expose virtqueue_alloc_element
4ead0ac vhost: Do not invalidate signalled used
ceb76a4 virtio: Remove virtio_queue_get_used_notify_split
6aacdfe vhost: Send buffers to device
9fcc98d virtio: Add virtio_queue_full
41da0f8 virtio: const-ify all virtio_tswap* functions
a3c92f1 vhost: Allocate shadow vring
b40b3f7 vhost: Route host->guest notification through qemu
92ea117 vhost: Add a flag for software assisted Live Migration
ace8c10 vhost: Route guest->host notification through qemu
6be11a6 virtio: Add virtio_queue_get_used_notify_split
47d4485 vhost: Add hdev->dev.sw_lm_vq_handler
7dbc8e5 vhost: add vhost_kernel_set_vring_enable
4e51c5c vhost: Move log resize/put to vhost_dev_set_log
b964abc vhost: Add device callback in vhost_migration_log
264504e vhost: Add vhost_dev_can_log

=== OUTPUT BEGIN ===
1/27 Checking commit 264504ee0018 (vhost: Add vhost_dev_can_log)
2/27 Checking commit b964abc315bd (vhost: Add device callback in vhost_migration_log)
3/27 Checking commit 4e51c5cec56d (vhost: Move log resize/put to vhost_dev_set_log)
4/27 Checking commit 7dbc8e5c64c3 (vhost: add vhost_kernel_set_vring_enable)
5/27 Checking commit 47d4485458d1 (vhost: Add hdev->dev.sw_lm_vq_handler)
6/27 Checking commit 6be11a63e3b4 (virtio: Add virtio_queue_get_used_notify_split)
7/27 Checking commit ace8c1034f83 (vhost: Route guest->host notification through qemu)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#24: 
new file mode 100644

total: 0 errors, 1 warnings, 245 lines checked

Patch 7/27 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
8/27 Checking commit 92ea11700e32 (vhost: Add a flag for software assisted Live Migration)
WARNING: Block comments use a leading /* on a separate line
#46: FILE: hw/virtio/vhost.c:1581:
+        /* We've been called after migration is completed, so no need to

WARNING: Block comments use * on subsequent lines
#47: FILE: hw/virtio/vhost.c:1582:
+        /* We've been called after migration is completed, so no need to
+           disable it again

total: 0 errors, 2 warnings, 45 lines checked

Patch 8/27 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
9/27 Checking commit b40b3f79355c (vhost: Route host->guest notification through qemu)
10/27 Checking commit a3c92f15b554 (vhost: Allocate shadow vring)
11/27 Checking commit 41da0f8e02b2 (virtio: const-ify all virtio_tswap* functions)
12/27 Checking commit 9fcc98da9cd9 (virtio: Add virtio_queue_full)
13/27 Checking commit 6aacdfe0e0ba (vhost: Send buffers to device)
ERROR: memory barrier without comment
#50: FILE: hw/virtio/vhost-sw-lm-ring.c:45:
+    smp_rmb();

WARNING: Block comments use a leading /* on a separate line
#98: FILE: hw/virtio/vhost-sw-lm-ring.c:93:
+/* virtqueue_add:

WARNING: Block comments use a leading /* on a separate line
#125: FILE: hw/virtio/vhost-sw-lm-ring.c:120:
+    /* Put entry in available array (but don't update avail->idx until they

WARNING: Block comments use a trailing */ on a separate line
#126: FILE: hw/virtio/vhost-sw-lm-ring.c:121:
+     * do sync). */

ERROR: g_free(NULL) is safe this check is probably not required
#147: FILE: hw/virtio/vhost-sw-lm-ring.c:142:
+    if (vq->ring_id_maps[host_head]) {
+        g_free(vq->ring_id_maps[host_head]);

ERROR: braces {} are necessary for all arms of this statement
#185: FILE: hw/virtio/vhost-sw-lm-ring.c:181:
+    for (i = 0; i < num - 1; i++)
[...]

ERROR: code indent should never use tabs
#207: FILE: hw/virtio/vhost-sw-lm-ring.h:23:
+^I                    struct vhost_vring_addr *addr);$

ERROR: space required before the open parenthesis '('
#247: FILE: hw/virtio/vhost.c:986:
+    } while(!virtio_queue_empty(vq));

total: 5 errors, 3 warnings, 275 lines checked

Patch 13/27 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

14/27 Checking commit ceb76a4401b8 (virtio: Remove virtio_queue_get_used_notify_split)
15/27 Checking commit 4ead0ac8457f (vhost: Do not invalidate signalled used)
16/27 Checking commit 3183f62db3dc (virtio: Expose virtqueue_alloc_element)
17/27 Checking commit ca44882af152 (vhost: add vhost_vring_set_notification_rcu)
ERROR: memory barrier without comment
#45: FILE: hw/virtio/vhost-sw-lm-ring.c:83:
+    smp_mb();

total: 1 errors, 0 warnings, 45 lines checked

Patch 17/27 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

18/27 Checking commit b6b8168b9fe7 (vhost: add vhost_vring_poll_rcu)
ERROR: memory barrier without comment
#37: FILE: hw/virtio/vhost-sw-lm-ring.c:98:
+    smp_rmb();

total: 1 errors, 0 warnings, 38 lines checked

Patch 18/27 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

19/27 Checking commit 806db46e9194 (vhost: add vhost_vring_get_buf_rcu)
20/27 Checking commit d5054c8556bf (vhost: Return used buffers)
21/27 Checking commit 6e5f2192254a (vhost: Add vhost_virtqueue_memory_unmap)
22/27 Checking commit c999e86cf7f0 (vhost: Add vhost_virtqueue_memory_map)
23/27 Checking commit 74f282a80019 (vhost: unmap qemu's shadow virtqueues on sw live migration)
24/27 Checking commit b68d3d5fb839 (vhost: iommu changes)
25/27 Checking commit 7f2955b8e788 (vhost: Do not commit vhost used idx on vhost_virtqueue_stop)
WARNING: Block comments use a leading /* on a separate line
#40: FILE: hw/virtio/vhost.c:1442:
+        /* Connection to the backend is unusable, so let's sync internal

total: 0 errors, 1 warnings, 27 lines checked

Patch 25/27 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
26/27 Checking commit 405925c6b54a (vhost: Add vhost_hdev_can_sw_lm)
27/27 Checking commit af2fe2219d39 (vhost: forbid vhost devices logging)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20201120185105.279030-1-eperezma@redhat.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Jason Wang Nov. 25, 2020, 7:08 a.m. UTC | #3
On 2020/11/21 上午2:50, Eugenio Pérez wrote:
> This series enable vDPA software assisted live migration for vhost-net
> devices. This is a new method of vhost devices migration: Instead of
> relay on vDPA device's dirty logging capability, SW assisted LM
> intercepts dataplane, forwarding the descriptors between VM and device.
>
> In this migration mode, qemu offers a new vring to the device to
> read and write into, and disable vhost notifiers, processing guest and
> vhost notifications in qemu. On used buffer relay, qemu will mark the
> dirty memory as with plain virtio-net devices. This way, devices does
> not need to have dirty page logging capability.
>
> This series is a POC doing SW LM for vhost-net devices, which already
> have dirty page logging capabilities. None of the changes have actual
> effect with current devices until last two patches (26 and 27) are
> applied, but they can be rebased on top of any other. These checks the
> device to meet all requirements, and disable vhost-net devices logging
> so migration goes through SW LM. This last patch is not meant to be
> applied in the final revision, it is in the series just for testing
> purposes.
>
> For use SW assisted LM these vhost-net devices need to be instantiated:
> * With IOMMU (iommu_platform=on,ats=on)
> * Without event_idx (event_idx=off)


So a question is at what level do we want to implement qemu assisted 
live migration. To me it could be done at two levels:

1) generic vhost level which makes it work for both vhost-net/vhost-user 
and vhost-vDPA
2) a specific type of vhost

To me, having a generic one looks better but it would be much more 
complicated. So what I read from this series is it was a vhost kernel 
specific software assisted live migration which is a good start. 
Actually it may even have real use case, e.g it can save dirty bitmaps 
for guest with large memory. But we need to address the above 
limitations first.

So I would like to know what's the reason for mandating iommu platform 
and ats? And I think we need to fix case of event idx support.


>
> Just the notification forwarding (with no descriptor relay) can be
> achieved with patches 7 and 9, and starting migration. Partial applies
> between 13 and 24 will not work while migrating on source, and patch
> 25 is needed for the destination to resume network activity.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of


Actually we're better than that since there's no need the trick like 
hardcoded IOVA for mediated(shadow) virtqueue.


> DPDK's https://patchwork.dpdk.org/cover/48370/ .


I notice that you do GPA->VA translations and try to establish a VA->VA 
(use VA as IOVA) mapping via device IOTLB. This shortcut should work for 
vhost-kernel/user but not vhost-vDPA. The reason is that there's no 
guarantee that the whole 64bit address range could be used as IOVA. One 
example is that for hardware IOMMU like intel, it usually has 47 or 52 
bits of address width.

So we probably need an IOVA allocator that can make sure the IOVA is not 
overlapped and fit for [1]. We can probably build the IOVA for guest VA 
via memory listeners. Then we have

1) IOVA for GPA
2) IOVA for shadow VQ

And advertise IOVA to VA mapping to vhost.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1b48dc03e575a872404f33b04cd237953c5d7498


>
> Comments are welcome.
>
> Thanks!
>
> Eugenio Pérez (27):
>    vhost: Add vhost_dev_can_log
>    vhost: Add device callback in vhost_migration_log
>    vhost: Move log resize/put to vhost_dev_set_log
>    vhost: add vhost_kernel_set_vring_enable
>    vhost: Add hdev->dev.sw_lm_vq_handler
>    virtio: Add virtio_queue_get_used_notify_split
>    vhost: Route guest->host notification through qemu
>    vhost: Add a flag for software assisted Live Migration
>    vhost: Route host->guest notification through qemu
>    vhost: Allocate shadow vring
>    virtio: const-ify all virtio_tswap* functions
>    virtio: Add virtio_queue_full
>    vhost: Send buffers to device
>    virtio: Remove virtio_queue_get_used_notify_split
>    vhost: Do not invalidate signalled used
>    virtio: Expose virtqueue_alloc_element
>    vhost: add vhost_vring_set_notification_rcu
>    vhost: add vhost_vring_poll_rcu
>    vhost: add vhost_vring_get_buf_rcu
>    vhost: Return used buffers
>    vhost: Add vhost_virtqueue_memory_unmap
>    vhost: Add vhost_virtqueue_memory_map
>    vhost: unmap qemu's shadow virtqueues on sw live migration
>    vhost: iommu changes
>    vhost: Do not commit vhost used idx on vhost_virtqueue_stop
>    vhost: Add vhost_hdev_can_sw_lm
>    vhost: forbid vhost devices logging
>
>   hw/virtio/vhost-sw-lm-ring.h      |  39 +++
>   include/hw/virtio/vhost.h         |   5 +
>   include/hw/virtio/virtio-access.h |   8 +-
>   include/hw/virtio/virtio.h        |   4 +
>   hw/net/virtio-net.c               |  39 ++-
>   hw/virtio/vhost-backend.c         |  29 ++
>   hw/virtio/vhost-sw-lm-ring.c      | 268 +++++++++++++++++++
>   hw/virtio/vhost.c                 | 431 +++++++++++++++++++++++++-----
>   hw/virtio/virtio.c                |  18 +-
>   hw/virtio/meson.build             |   2 +-
>   10 files changed, 758 insertions(+), 85 deletions(-)
>   create mode 100644 hw/virtio/vhost-sw-lm-ring.h
>   create mode 100644 hw/virtio/vhost-sw-lm-ring.c


So this looks like a pretty huge patchset which I'm trying to think of 
ways to split. An idea is to do this is two steps

1) implement a shadow virtqueue mode for vhost first (w/o live 
migration). Then we can test descriptors relay, IOVA allocating, etc.
2) add live migration support on top

And it looks to me it's better to split the shadow virtqueue (virtio 
driver part) into an independent file. And use generic name (w/o 
"shadow") in order to be reused by other use cases as well.

Thoughts?
Eugenio Perez Martin Nov. 25, 2020, 12:03 p.m. UTC | #4
On Wed, Nov 25, 2020 at 8:09 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/11/21 上午2:50, Eugenio Pérez wrote:
> > This series enable vDPA software assisted live migration for vhost-net
> > devices. This is a new method of vhost devices migration: Instead of
> > relay on vDPA device's dirty logging capability, SW assisted LM
> > intercepts dataplane, forwarding the descriptors between VM and device.
> >
> > In this migration mode, qemu offers a new vring to the device to
> > read and write into, and disable vhost notifiers, processing guest and
> > vhost notifications in qemu. On used buffer relay, qemu will mark the
> > dirty memory as with plain virtio-net devices. This way, devices does
> > not need to have dirty page logging capability.
> >
> > This series is a POC doing SW LM for vhost-net devices, which already
> > have dirty page logging capabilities. None of the changes have actual
> > effect with current devices until last two patches (26 and 27) are
> > applied, but they can be rebased on top of any other. These checks the
> > device to meet all requirements, and disable vhost-net devices logging
> > so migration goes through SW LM. This last patch is not meant to be
> > applied in the final revision, it is in the series just for testing
> > purposes.
> >
> > For use SW assisted LM these vhost-net devices need to be instantiated:
> > * With IOMMU (iommu_platform=on,ats=on)
> > * Without event_idx (event_idx=off)
>
>
> So a question is at what level do we want to implement qemu assisted
> live migration. To me it could be done at two levels:
>
> 1) generic vhost level which makes it work for both vhost-net/vhost-user
> and vhost-vDPA
> 2) a specific type of vhost
>
> To me, having a generic one looks better but it would be much more
> complicated. So what I read from this series is it was a vhost kernel
> specific software assisted live migration which is a good start.
> Actually it may even have real use case, e.g it can save dirty bitmaps
> for guest with large memory. But we need to address the above
> limitations first.
>
> So I would like to know what's the reason for mandating iommu platform
> and ats? And I think we need to fix case of event idx support.
>

There is no specific reason for mandating iommu & ats, it was just
started that way.

I will extend the patch to support those cases too.

>
> >
> > Just the notification forwarding (with no descriptor relay) can be
> > achieved with patches 7 and 9, and starting migration. Partial applies
> > between 13 and 24 will not work while migrating on source, and patch
> > 25 is needed for the destination to resume network activity.
> >
> > It is based on the ideas of DPDK SW assisted LM, in the series of
>
>
> Actually we're better than that since there's no need the trick like
> hardcoded IOVA for mediated(shadow) virtqueue.
>
>
> > DPDK's https://patchwork.dpdk.org/cover/48370/ .
>
>
> I notice that you do GPA->VA translations and try to establish a VA->VA
> (use VA as IOVA) mapping via device IOTLB. This shortcut should work for
> vhost-kernel/user but not vhost-vDPA. The reason is that there's no
> guarantee that the whole 64bit address range could be used as IOVA. One
> example is that for hardware IOMMU like intel, it usually has 47 or 52
> bits of address width.
>
> So we probably need an IOVA allocator that can make sure the IOVA is not
> overlapped and fit for [1]. We can probably build the IOVA for guest VA
> via memory listeners. Then we have
>
> 1) IOVA for GPA
> 2) IOVA for shadow VQ
>
> And advertise IOVA to VA mapping to vhost.
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1b48dc03e575a872404f33b04cd237953c5d7498
>

Got it, will control it too.

Maybe for vhost-net we could directly send iotlb miss for [0,~0ULL].

>
> >
> > Comments are welcome.
> >
> > Thanks!
> >
> > Eugenio Pérez (27):
> >    vhost: Add vhost_dev_can_log
> >    vhost: Add device callback in vhost_migration_log
> >    vhost: Move log resize/put to vhost_dev_set_log
> >    vhost: add vhost_kernel_set_vring_enable
> >    vhost: Add hdev->dev.sw_lm_vq_handler
> >    virtio: Add virtio_queue_get_used_notify_split
> >    vhost: Route guest->host notification through qemu
> >    vhost: Add a flag for software assisted Live Migration
> >    vhost: Route host->guest notification through qemu
> >    vhost: Allocate shadow vring
> >    virtio: const-ify all virtio_tswap* functions
> >    virtio: Add virtio_queue_full
> >    vhost: Send buffers to device
> >    virtio: Remove virtio_queue_get_used_notify_split
> >    vhost: Do not invalidate signalled used
> >    virtio: Expose virtqueue_alloc_element
> >    vhost: add vhost_vring_set_notification_rcu
> >    vhost: add vhost_vring_poll_rcu
> >    vhost: add vhost_vring_get_buf_rcu
> >    vhost: Return used buffers
> >    vhost: Add vhost_virtqueue_memory_unmap
> >    vhost: Add vhost_virtqueue_memory_map
> >    vhost: unmap qemu's shadow virtqueues on sw live migration
> >    vhost: iommu changes
> >    vhost: Do not commit vhost used idx on vhost_virtqueue_stop
> >    vhost: Add vhost_hdev_can_sw_lm
> >    vhost: forbid vhost devices logging
> >
> >   hw/virtio/vhost-sw-lm-ring.h      |  39 +++
> >   include/hw/virtio/vhost.h         |   5 +
> >   include/hw/virtio/virtio-access.h |   8 +-
> >   include/hw/virtio/virtio.h        |   4 +
> >   hw/net/virtio-net.c               |  39 ++-
> >   hw/virtio/vhost-backend.c         |  29 ++
> >   hw/virtio/vhost-sw-lm-ring.c      | 268 +++++++++++++++++++
> >   hw/virtio/vhost.c                 | 431 +++++++++++++++++++++++++-----
> >   hw/virtio/virtio.c                |  18 +-
> >   hw/virtio/meson.build             |   2 +-
> >   10 files changed, 758 insertions(+), 85 deletions(-)
> >   create mode 100644 hw/virtio/vhost-sw-lm-ring.h
> >   create mode 100644 hw/virtio/vhost-sw-lm-ring.c
>
>
> So this looks like a pretty huge patchset which I'm trying to think of
> ways to split. An idea is to do this is two steps
>
> 1) implement a shadow virtqueue mode for vhost first (w/o live
> migration). Then we can test descriptors relay, IOVA allocating, etc.

How would that mode be activated if it is not tied to live migration?
New backend/command line switch?

Maybe it is better to also start with no iommu & ats support and add it on top.

> 2) add live migration support on top
>
> And it looks to me it's better to split the shadow virtqueue (virtio
> driver part) into an independent file. And use generic name (w/o
> "shadow") in order to be reused by other use cases as well.
>

I think the same.

Thanks!

> Thoughts?
>
Eugenio Perez Martin Nov. 25, 2020, 12:14 p.m. UTC | #5
On Wed, Nov 25, 2020 at 1:03 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Nov 25, 2020 at 8:09 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > On 2020/11/21 上午2:50, Eugenio Pérez wrote:
> > > This series enable vDPA software assisted live migration for vhost-net
> > > devices. This is a new method of vhost devices migration: Instead of
> > > relay on vDPA device's dirty logging capability, SW assisted LM
> > > intercepts dataplane, forwarding the descriptors between VM and device.
> > >
> > > In this migration mode, qemu offers a new vring to the device to
> > > read and write into, and disable vhost notifiers, processing guest and
> > > vhost notifications in qemu. On used buffer relay, qemu will mark the
> > > dirty memory as with plain virtio-net devices. This way, devices does
> > > not need to have dirty page logging capability.
> > >
> > > This series is a POC doing SW LM for vhost-net devices, which already
> > > have dirty page logging capabilities. None of the changes have actual
> > > effect with current devices until last two patches (26 and 27) are
> > > applied, but they can be rebased on top of any other. These checks the
> > > device to meet all requirements, and disable vhost-net devices logging
> > > so migration goes through SW LM. This last patch is not meant to be
> > > applied in the final revision, it is in the series just for testing
> > > purposes.
> > >
> > > For use SW assisted LM these vhost-net devices need to be instantiated:
> > > * With IOMMU (iommu_platform=on,ats=on)
> > > * Without event_idx (event_idx=off)
> >
> >
> > So a question is at what level do we want to implement qemu assisted
> > live migration. To me it could be done at two levels:
> >
> > 1) generic vhost level which makes it work for both vhost-net/vhost-user
> > and vhost-vDPA
> > 2) a specific type of vhost
> >
> > To me, having a generic one looks better but it would be much more
> > complicated. So what I read from this series is it was a vhost kernel
> > specific software assisted live migration which is a good start.
> > Actually it may even have real use case, e.g it can save dirty bitmaps
> > for guest with large memory. But we need to address the above
> > limitations first.
> >
> > So I would like to know what's the reason for mandating iommu platform
> > and ats? And I think we need to fix case of event idx support.
> >
>
> There is no specific reason for mandating iommu & ats, it was just
> started that way.
>
> I will extend the patch to support those cases too.
>
> >
> > >
> > > Just the notification forwarding (with no descriptor relay) can be
> > > achieved with patches 7 and 9, and starting migration. Partial applies
> > > between 13 and 24 will not work while migrating on source, and patch
> > > 25 is needed for the destination to resume network activity.
> > >
> > > It is based on the ideas of DPDK SW assisted LM, in the series of
> >
> >
> > Actually we're better than that since there's no need the trick like
> > hardcoded IOVA for mediated(shadow) virtqueue.
> >
> >
> > > DPDK's https://patchwork.dpdk.org/cover/48370/ .
> >
> >
> > I notice that you do GPA->VA translations and try to establish a VA->VA
> > (use VA as IOVA) mapping via device IOTLB. This shortcut should work for
> > vhost-kernel/user but not vhost-vDPA. The reason is that there's no
> > guarantee that the whole 64bit address range could be used as IOVA. One
> > example is that for hardware IOMMU like intel, it usually has 47 or 52
> > bits of address width.
> >
> > So we probably need an IOVA allocator that can make sure the IOVA is not
> > overlapped and fit for [1]. We can probably build the IOVA for guest VA
> > via memory listeners. Then we have
> >
> > 1) IOVA for GPA
> > 2) IOVA for shadow VQ
> >
> > And advertise IOVA to VA mapping to vhost.
> >
> > [1]
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1b48dc03e575a872404f33b04cd237953c5d7498
> >
>
> Got it, will control it too.
>
> Maybe for vhost-net we could directly send iotlb miss for [0,~0ULL].
>

Sorry, this was intended to be a question :).

Given a vhost-* device IOVA usable range, is ok to expose all of qemu
overlapping VA to it? With the iotlb miss, for example. Would it be
acceptable from a security point of view? The device would have access
to all qemu VA, but on the other hand devices like vhost-net already
have it.

> >
> > >
> > > Comments are welcome.
> > >
> > > Thanks!
> > >
> > > Eugenio Pérez (27):
> > >    vhost: Add vhost_dev_can_log
> > >    vhost: Add device callback in vhost_migration_log
> > >    vhost: Move log resize/put to vhost_dev_set_log
> > >    vhost: add vhost_kernel_set_vring_enable
> > >    vhost: Add hdev->dev.sw_lm_vq_handler
> > >    virtio: Add virtio_queue_get_used_notify_split
> > >    vhost: Route guest->host notification through qemu
> > >    vhost: Add a flag for software assisted Live Migration
> > >    vhost: Route host->guest notification through qemu
> > >    vhost: Allocate shadow vring
> > >    virtio: const-ify all virtio_tswap* functions
> > >    virtio: Add virtio_queue_full
> > >    vhost: Send buffers to device
> > >    virtio: Remove virtio_queue_get_used_notify_split
> > >    vhost: Do not invalidate signalled used
> > >    virtio: Expose virtqueue_alloc_element
> > >    vhost: add vhost_vring_set_notification_rcu
> > >    vhost: add vhost_vring_poll_rcu
> > >    vhost: add vhost_vring_get_buf_rcu
> > >    vhost: Return used buffers
> > >    vhost: Add vhost_virtqueue_memory_unmap
> > >    vhost: Add vhost_virtqueue_memory_map
> > >    vhost: unmap qemu's shadow virtqueues on sw live migration
> > >    vhost: iommu changes
> > >    vhost: Do not commit vhost used idx on vhost_virtqueue_stop
> > >    vhost: Add vhost_hdev_can_sw_lm
> > >    vhost: forbid vhost devices logging
> > >
> > >   hw/virtio/vhost-sw-lm-ring.h      |  39 +++
> > >   include/hw/virtio/vhost.h         |   5 +
> > >   include/hw/virtio/virtio-access.h |   8 +-
> > >   include/hw/virtio/virtio.h        |   4 +
> > >   hw/net/virtio-net.c               |  39 ++-
> > >   hw/virtio/vhost-backend.c         |  29 ++
> > >   hw/virtio/vhost-sw-lm-ring.c      | 268 +++++++++++++++++++
> > >   hw/virtio/vhost.c                 | 431 +++++++++++++++++++++++++-----
> > >   hw/virtio/virtio.c                |  18 +-
> > >   hw/virtio/meson.build             |   2 +-
> > >   10 files changed, 758 insertions(+), 85 deletions(-)
> > >   create mode 100644 hw/virtio/vhost-sw-lm-ring.h
> > >   create mode 100644 hw/virtio/vhost-sw-lm-ring.c
> >
> >
> > So this looks like a pretty huge patchset which I'm trying to think of
> > ways to split. An idea is to do this is two steps
> >
> > 1) implement a shadow virtqueue mode for vhost first (w/o live
> > migration). Then we can test descriptors relay, IOVA allocating, etc.
>
> How would that mode be activated if it is not tied to live migration?
> New backend/command line switch?
>
> Maybe it is better to also start with no iommu & ats support and add it on top.
>
> > 2) add live migration support on top
> >
> > And it looks to me it's better to split the shadow virtqueue (virtio
> > driver part) into an independent file. And use generic name (w/o
> > "shadow") in order to be reused by other use cases as well.
> >
>
> I think the same.
>
> Thanks!
>
> > Thoughts?
> >
Jason Wang Nov. 26, 2020, 3:07 a.m. UTC | #6
On 2020/11/25 下午8:03, Eugenio Perez Martin wrote:
> On Wed, Nov 25, 2020 at 8:09 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/11/21 上午2:50, Eugenio Pérez wrote:
>>> This series enable vDPA software assisted live migration for vhost-net
>>> devices. This is a new method of vhost devices migration: Instead of
>>> relay on vDPA device's dirty logging capability, SW assisted LM
>>> intercepts dataplane, forwarding the descriptors between VM and device.
>>>
>>> In this migration mode, qemu offers a new vring to the device to
>>> read and write into, and disable vhost notifiers, processing guest and
>>> vhost notifications in qemu. On used buffer relay, qemu will mark the
>>> dirty memory as with plain virtio-net devices. This way, devices does
>>> not need to have dirty page logging capability.
>>>
>>> This series is a POC doing SW LM for vhost-net devices, which already
>>> have dirty page logging capabilities. None of the changes have actual
>>> effect with current devices until last two patches (26 and 27) are
>>> applied, but they can be rebased on top of any other. These checks the
>>> device to meet all requirements, and disable vhost-net devices logging
>>> so migration goes through SW LM. This last patch is not meant to be
>>> applied in the final revision, it is in the series just for testing
>>> purposes.
>>>
>>> For use SW assisted LM these vhost-net devices need to be instantiated:
>>> * With IOMMU (iommu_platform=on,ats=on)
>>> * Without event_idx (event_idx=off)
>>
>> So a question is at what level do we want to implement qemu assisted
>> live migration. To me it could be done at two levels:
>>
>> 1) generic vhost level which makes it work for both vhost-net/vhost-user
>> and vhost-vDPA
>> 2) a specific type of vhost
>>
>> To me, having a generic one looks better but it would be much more
>> complicated. So what I read from this series is it was a vhost kernel
>> specific software assisted live migration which is a good start.
>> Actually it may even have real use case, e.g it can save dirty bitmaps
>> for guest with large memory. But we need to address the above
>> limitations first.
>>
>> So I would like to know what's the reason for mandating iommu platform
>> and ats? And I think we need to fix case of event idx support.
>>
> There is no specific reason for mandating iommu & ats, it was just
> started that way.
>
> I will extend the patch to support those cases too.
>
>>> Just the notification forwarding (with no descriptor relay) can be
>>> achieved with patches 7 and 9, and starting migration. Partial applies
>>> between 13 and 24 will not work while migrating on source, and patch
>>> 25 is needed for the destination to resume network activity.
>>>
>>> It is based on the ideas of DPDK SW assisted LM, in the series of
>>
>> Actually we're better than that since there's no need the trick like
>> hardcoded IOVA for mediated(shadow) virtqueue.
>>
>>
>>> DPDK's https://patchwork.dpdk.org/cover/48370/ .
>>
>> I notice that you do GPA->VA translations and try to establish a VA->VA
>> (use VA as IOVA) mapping via device IOTLB. This shortcut should work for
>> vhost-kernel/user but not vhost-vDPA. The reason is that there's no
>> guarantee that the whole 64bit address range could be used as IOVA. One
>> example is that for hardware IOMMU like intel, it usually has 47 or 52
>> bits of address width.
>>
>> So we probably need an IOVA allocator that can make sure the IOVA is not
>> overlapped and fit for [1]. We can probably build the IOVA for guest VA
>> via memory listeners. Then we have
>>
>> 1) IOVA for GPA
>> 2) IOVA for shadow VQ
>>
>> And advertise IOVA to VA mapping to vhost.
>>
>> [1]
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1b48dc03e575a872404f33b04cd237953c5d7498
>>
> Got it, will control it too.
>
> Maybe for vhost-net we could directly send iotlb miss for [0,~0ULL].


It works but it means vhost-net needs some special care. To me a generic 
IOVA allocator looks better.


>
>>> Comments are welcome.
>>>
>>> Thanks!
>>>
>>> Eugenio Pérez (27):
>>>     vhost: Add vhost_dev_can_log
>>>     vhost: Add device callback in vhost_migration_log
>>>     vhost: Move log resize/put to vhost_dev_set_log
>>>     vhost: add vhost_kernel_set_vring_enable
>>>     vhost: Add hdev->dev.sw_lm_vq_handler
>>>     virtio: Add virtio_queue_get_used_notify_split
>>>     vhost: Route guest->host notification through qemu
>>>     vhost: Add a flag for software assisted Live Migration
>>>     vhost: Route host->guest notification through qemu
>>>     vhost: Allocate shadow vring
>>>     virtio: const-ify all virtio_tswap* functions
>>>     virtio: Add virtio_queue_full
>>>     vhost: Send buffers to device
>>>     virtio: Remove virtio_queue_get_used_notify_split
>>>     vhost: Do not invalidate signalled used
>>>     virtio: Expose virtqueue_alloc_element
>>>     vhost: add vhost_vring_set_notification_rcu
>>>     vhost: add vhost_vring_poll_rcu
>>>     vhost: add vhost_vring_get_buf_rcu
>>>     vhost: Return used buffers
>>>     vhost: Add vhost_virtqueue_memory_unmap
>>>     vhost: Add vhost_virtqueue_memory_map
>>>     vhost: unmap qemu's shadow virtqueues on sw live migration
>>>     vhost: iommu changes
>>>     vhost: Do not commit vhost used idx on vhost_virtqueue_stop
>>>     vhost: Add vhost_hdev_can_sw_lm
>>>     vhost: forbid vhost devices logging
>>>
>>>    hw/virtio/vhost-sw-lm-ring.h      |  39 +++
>>>    include/hw/virtio/vhost.h         |   5 +
>>>    include/hw/virtio/virtio-access.h |   8 +-
>>>    include/hw/virtio/virtio.h        |   4 +
>>>    hw/net/virtio-net.c               |  39 ++-
>>>    hw/virtio/vhost-backend.c         |  29 ++
>>>    hw/virtio/vhost-sw-lm-ring.c      | 268 +++++++++++++++++++
>>>    hw/virtio/vhost.c                 | 431 +++++++++++++++++++++++++-----
>>>    hw/virtio/virtio.c                |  18 +-
>>>    hw/virtio/meson.build             |   2 +-
>>>    10 files changed, 758 insertions(+), 85 deletions(-)
>>>    create mode 100644 hw/virtio/vhost-sw-lm-ring.h
>>>    create mode 100644 hw/virtio/vhost-sw-lm-ring.c
>>
>> So this looks like a pretty huge patchset which I'm trying to think of
>> ways to split. An idea is to do this is two steps
>>
>> 1) implement a shadow virtqueue mode for vhost first (w/o live
>> migration). Then we can test descriptors relay, IOVA allocating, etc.
> How would that mode be activated if it is not tied to live migration?
> New backend/command line switch?


Either a new cli option or even a qmp command can work.


>
> Maybe it is better to also start with no iommu & ats support and add it on top.


Yes.


>
>> 2) add live migration support on top
>>
>> And it looks to me it's better to split the shadow virtqueue (virtio
>> driver part) into an independent file. And use generic name (w/o
>> "shadow") in order to be reused by other use cases as well.
>>
> I think the same.
>
> Thanks!
>
>> Thoughts?
>>
>
Stefano Garzarella Nov. 27, 2020, 3:44 p.m. UTC | #7
On Fri, Nov 20, 2020 at 07:50:38PM +0100, Eugenio Pérez wrote:
>This series enable vDPA software assisted live migration for vhost-net
>devices. This is a new method of vhost devices migration: Instead of
>relay on vDPA device's dirty logging capability, SW assisted LM
>intercepts dataplane, forwarding the descriptors between VM and device.
>
>In this migration mode, qemu offers a new vring to the device to
>read and write into, and disable vhost notifiers, processing guest and
>vhost notifications in qemu. On used buffer relay, qemu will mark the
>dirty memory as with plain virtio-net devices. This way, devices does
>not need to have dirty page logging capability.
>
>This series is a POC doing SW LM for vhost-net devices, which already
>have dirty page logging capabilities. None of the changes have actual
>effect with current devices until last two patches (26 and 27) are
>applied, but they can be rebased on top of any other. These checks the
>device to meet all requirements, and disable vhost-net devices logging
>so migration goes through SW LM. This last patch is not meant to be
>applied in the final revision, it is in the series just for testing
>purposes.
>
>For use SW assisted LM these vhost-net devices need to be instantiated:
>* With IOMMU (iommu_platform=on,ats=on)
>* Without event_idx (event_idx=off)
>
>Just the notification forwarding (with no descriptor relay) can be
>achieved with patches 7 and 9, and starting migration. Partial applies
>between 13 and 24 will not work while migrating on source, and patch
>25 is needed for the destination to resume network activity.
>
>It is based on the ideas of DPDK SW assisted LM, in the series of
>DPDK's https://patchwork.dpdk.org/cover/48370/ .
>
>Comments are welcome.

Hi Eugenio,
I took a look and the idea of the shadow queue I think is the right way.
It's very similar to what we thought with Stefan for io_uring 
passthrough and vdpa-blk.

IIUC, when the migrations starts, the notifications from the guest to 
vhost are disabled, so QEMU starts to intercept them through the 
custom_handler installed in virtio-net (we need to understand how to 
generalize this).
At this point QEMU starts to use the shadows queues and exposes them to 
vhost.
The opposite is done for vhost to guest notifications, where 
vhost_handle_call is installed to masked_notifier to intercept the 
notification.

I hope to give better feedback when I get a complete overview ;-)

Anyway, as Jason suggested, we should split this series, so maybe we can 
merge some preparations patches (e.g. 1, 11, 21, 22) regardless the 
other patches.

Thanks,
Stefano
Stefan Hajnoczi Dec. 8, 2020, 9:37 a.m. UTC | #8
On Fri, Nov 20, 2020 at 07:50:38PM +0100, Eugenio Pérez wrote:
> This series enable vDPA software assisted live migration for vhost-net
> devices. This is a new method of vhost devices migration: Instead of
> relay on vDPA device's dirty logging capability, SW assisted LM
> intercepts dataplane, forwarding the descriptors between VM and device.

Pros:
+ vhost/vDPA devices don't need to implement dirty memory logging
+ Obsoletes ioctl(VHOST_SET_LOG_BASE) and friends

Cons:
- Not generic, relies on vhost-net-specific ioctls
- Doesn't support VIRTIO Shared Memory Regions
  https://github.com/oasis-tcs/virtio-spec/blob/master/shared-mem.tex
- Performance (see below)

I think performance will be significantly lower when the shadow vq is
enabled. Imagine a vDPA device with hardware vq doorbell registers
mapped into the guest so the guest driver can directly kick the device.
When the shadow vq is enabled a vmexit is needed to write to the shadow
vq ioeventfd, then the host kernel scheduler switches to a QEMU thread
to read the ioeventfd, the descriptors are translated, QEMU writes to
the vhost hdev kick fd, the host kernel scheduler switches to the vhost
worker thread, vhost/vDPA notifies the virtqueue, and finally the
vDPA driver writes to the hardware vq doorbell register. That is a lot
of overhead compared to writing to an exitless MMIO register!

If the shadow vq was implemented in drivers/vhost/ and QEMU used the
existing ioctl(VHOST_SET_LOG_BASE) approach, then the overhead would be
reduced to just one set of ioeventfd/irqfd. In other words, the QEMU
dirty memory logging happens asynchronously and isn't in the dataplane.

In addition, hardware that supports dirty memory logging as well as
software vDPA devices could completely eliminate the shadow vq for even
better performance.

But performance is a question of "is it good enough?". Maybe this
approach is okay and users don't expect good performance while dirty
memory logging is enabled. I just wanted to share the idea of moving the
shadow vq into the kernel in case you like that approach better.

Stefan
Jason Wang Dec. 9, 2020, 9:26 a.m. UTC | #9
----- Original Message -----
> On Fri, Nov 20, 2020 at 07:50:38PM +0100, Eugenio Pérez wrote:
> > This series enable vDPA software assisted live migration for vhost-net
> > devices. This is a new method of vhost devices migration: Instead of
> > relay on vDPA device's dirty logging capability, SW assisted LM
> > intercepts dataplane, forwarding the descriptors between VM and device.
> 
> Pros:
> + vhost/vDPA devices don't need to implement dirty memory logging
> + Obsoletes ioctl(VHOST_SET_LOG_BASE) and friends
> 
> Cons:
> - Not generic, relies on vhost-net-specific ioctls
> - Doesn't support VIRTIO Shared Memory Regions
>   https://github.com/oasis-tcs/virtio-spec/blob/master/shared-mem.tex

I may miss something but my understanding is that it's the
responsiblity of device to migrate this part?

> - Performance (see below)
> 
> I think performance will be significantly lower when the shadow vq is
> enabled. Imagine a vDPA device with hardware vq doorbell registers
> mapped into the guest so the guest driver can directly kick the device.
> When the shadow vq is enabled a vmexit is needed to write to the shadow
> vq ioeventfd, then the host kernel scheduler switches to a QEMU thread
> to read the ioeventfd, the descriptors are translated, QEMU writes to
> the vhost hdev kick fd, the host kernel scheduler switches to the vhost
> worker thread, vhost/vDPA notifies the virtqueue, and finally the
> vDPA driver writes to the hardware vq doorbell register. That is a lot
> of overhead compared to writing to an exitless MMIO register!

I think it's a balance. E.g we can poll the virtqueue to have an
exitless doorbell.

> 
> If the shadow vq was implemented in drivers/vhost/ and QEMU used the
> existing ioctl(VHOST_SET_LOG_BASE) approach, then the overhead would be
> reduced to just one set of ioeventfd/irqfd. In other words, the QEMU
> dirty memory logging happens asynchronously and isn't in the dataplane.
> 
> In addition, hardware that supports dirty memory logging as well as
> software vDPA devices could completely eliminate the shadow vq for even
> better performance.

Yes. That's our plan. But the interface might require more thought.

E.g is the bitmap a good approach? To me reporting dirty pages via
virqueue is better since it get less footprint and is self throttled.

And we need an address space other than the one used by guest for
either bitmap for virtqueue.

> 
> But performance is a question of "is it good enough?". Maybe this
> approach is okay and users don't expect good performance while dirty
> memory logging is enabled.

Yes, and actually such slow down may help for the converge of the
migration.

Note that the whole idea is try to have a generic solution for all
types of devices. It's good to consider the performance but for the
first stage, it should be sufficient to make it work and consider to
optimize on top.

> I just wanted to share the idea of moving the
> shadow vq into the kernel in case you like that approach better.

My understanding is to keep kernel as simple as possible and leave the
polices to userspace as much as possible. E.g it requires us to
disable doorbell mapping and irq offloading, all of which were under
the control of userspace.

Thanks

> 
> Stefan
>
Stefan Hajnoczi Dec. 9, 2020, 3:57 p.m. UTC | #10
On Wed, Dec 09, 2020 at 04:26:50AM -0500, Jason Wang wrote:
> ----- Original Message -----
> > On Fri, Nov 20, 2020 at 07:50:38PM +0100, Eugenio Pérez wrote:
> > > This series enable vDPA software assisted live migration for vhost-net
> > > devices. This is a new method of vhost devices migration: Instead of
> > > relay on vDPA device's dirty logging capability, SW assisted LM
> > > intercepts dataplane, forwarding the descriptors between VM and device.
> > 
> > Pros:
> > + vhost/vDPA devices don't need to implement dirty memory logging
> > + Obsoletes ioctl(VHOST_SET_LOG_BASE) and friends
> > 
> > Cons:
> > - Not generic, relies on vhost-net-specific ioctls
> > - Doesn't support VIRTIO Shared Memory Regions
> >   https://github.com/oasis-tcs/virtio-spec/blob/master/shared-mem.tex
> 
> I may miss something but my understanding is that it's the
> responsiblity of device to migrate this part?

Good point. You're right.

> > - Performance (see below)
> > 
> > I think performance will be significantly lower when the shadow vq is
> > enabled. Imagine a vDPA device with hardware vq doorbell registers
> > mapped into the guest so the guest driver can directly kick the device.
> > When the shadow vq is enabled a vmexit is needed to write to the shadow
> > vq ioeventfd, then the host kernel scheduler switches to a QEMU thread
> > to read the ioeventfd, the descriptors are translated, QEMU writes to
> > the vhost hdev kick fd, the host kernel scheduler switches to the vhost
> > worker thread, vhost/vDPA notifies the virtqueue, and finally the
> > vDPA driver writes to the hardware vq doorbell register. That is a lot
> > of overhead compared to writing to an exitless MMIO register!
> 
> I think it's a balance. E.g we can poll the virtqueue to have an
> exitless doorbell.
> 
> > 
> > If the shadow vq was implemented in drivers/vhost/ and QEMU used the
> > existing ioctl(VHOST_SET_LOG_BASE) approach, then the overhead would be
> > reduced to just one set of ioeventfd/irqfd. In other words, the QEMU
> > dirty memory logging happens asynchronously and isn't in the dataplane.
> > 
> > In addition, hardware that supports dirty memory logging as well as
> > software vDPA devices could completely eliminate the shadow vq for even
> > better performance.
> 
> Yes. That's our plan. But the interface might require more thought.
> 
> E.g is the bitmap a good approach? To me reporting dirty pages via
> virqueue is better since it get less footprint and is self throttled.
> 
> And we need an address space other than the one used by guest for
> either bitmap for virtqueue.
> 
> > 
> > But performance is a question of "is it good enough?". Maybe this
> > approach is okay and users don't expect good performance while dirty
> > memory logging is enabled.
> 
> Yes, and actually such slow down may help for the converge of the
> migration.
> 
> Note that the whole idea is try to have a generic solution for all
> types of devices. It's good to consider the performance but for the
> first stage, it should be sufficient to make it work and consider to
> optimize on top.

Moving the shadow vq to the kernel later would be quite a big change
requiring rewriting much of the code. That's why I mentioned this now
before a lot of effort is invested in a QEMU implementation.

> > I just wanted to share the idea of moving the
> > shadow vq into the kernel in case you like that approach better.
> 
> My understanding is to keep kernel as simple as possible and leave the
> polices to userspace as much as possible. E.g it requires us to
> disable doorbell mapping and irq offloading, all of which were under
> the control of userspace.

If the performance is acceptable with the QEMU approach then I think
that's the best place to implement it. It looks high-overhead though so
maybe one of the first things to do is to run benchmarks to collect data
on how it performs?

Stefan
Jason Wang Dec. 10, 2020, 9:12 a.m. UTC | #11
On 2020/12/9 下午11:57, Stefan Hajnoczi wrote:
> On Wed, Dec 09, 2020 at 04:26:50AM -0500, Jason Wang wrote:
>> ----- Original Message -----
>>> On Fri, Nov 20, 2020 at 07:50:38PM +0100, Eugenio Pérez wrote:
>>>> This series enable vDPA software assisted live migration for vhost-net
>>>> devices. This is a new method of vhost devices migration: Instead of
>>>> relay on vDPA device's dirty logging capability, SW assisted LM
>>>> intercepts dataplane, forwarding the descriptors between VM and device.
>>> Pros:
>>> + vhost/vDPA devices don't need to implement dirty memory logging
>>> + Obsoletes ioctl(VHOST_SET_LOG_BASE) and friends
>>>
>>> Cons:
>>> - Not generic, relies on vhost-net-specific ioctls
>>> - Doesn't support VIRTIO Shared Memory Regions
>>>    https://github.com/oasis-tcs/virtio-spec/blob/master/shared-mem.tex
>> I may miss something but my understanding is that it's the
>> responsiblity of device to migrate this part?
> Good point. You're right.
>
>>> - Performance (see below)
>>>
>>> I think performance will be significantly lower when the shadow vq is
>>> enabled. Imagine a vDPA device with hardware vq doorbell registers
>>> mapped into the guest so the guest driver can directly kick the device.
>>> When the shadow vq is enabled a vmexit is needed to write to the shadow
>>> vq ioeventfd, then the host kernel scheduler switches to a QEMU thread
>>> to read the ioeventfd, the descriptors are translated, QEMU writes to
>>> the vhost hdev kick fd, the host kernel scheduler switches to the vhost
>>> worker thread, vhost/vDPA notifies the virtqueue, and finally the
>>> vDPA driver writes to the hardware vq doorbell register. That is a lot
>>> of overhead compared to writing to an exitless MMIO register!
>> I think it's a balance. E.g we can poll the virtqueue to have an
>> exitless doorbell.
>>
>>> If the shadow vq was implemented in drivers/vhost/ and QEMU used the
>>> existing ioctl(VHOST_SET_LOG_BASE) approach, then the overhead would be
>>> reduced to just one set of ioeventfd/irqfd. In other words, the QEMU
>>> dirty memory logging happens asynchronously and isn't in the dataplane.
>>>
>>> In addition, hardware that supports dirty memory logging as well as
>>> software vDPA devices could completely eliminate the shadow vq for even
>>> better performance.
>> Yes. That's our plan. But the interface might require more thought.
>>
>> E.g is the bitmap a good approach? To me reporting dirty pages via
>> virqueue is better since it get less footprint and is self throttled.
>>
>> And we need an address space other than the one used by guest for
>> either bitmap for virtqueue.
>>
>>> But performance is a question of "is it good enough?". Maybe this
>>> approach is okay and users don't expect good performance while dirty
>>> memory logging is enabled.
>> Yes, and actually such slow down may help for the converge of the
>> migration.
>>
>> Note that the whole idea is try to have a generic solution for all
>> types of devices. It's good to consider the performance but for the
>> first stage, it should be sufficient to make it work and consider to
>> optimize on top.
> Moving the shadow vq to the kernel later would be quite a big change
> requiring rewriting much of the code. That's why I mentioned this now
> before a lot of effort is invested in a QEMU implementation.


Right.


>
>>> I just wanted to share the idea of moving the
>>> shadow vq into the kernel in case you like that approach better.
>> My understanding is to keep kernel as simple as possible and leave the
>> polices to userspace as much as possible. E.g it requires us to
>> disable doorbell mapping and irq offloading, all of which were under
>> the control of userspace.
> If the performance is acceptable with the QEMU approach then I think
> that's the best place to implement it. It looks high-overhead though so
> maybe one of the first things to do is to run benchmarks to collect data
> on how it performs?


Yes, I agree.

Thanks


>
> Stefan