[0/4] vhost-user-fs: Internal migration

Message ID	20230411150515.14020-1-hreitz@redhat.com (mailing list archive)
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Hanna Czenczek <hreitz@redhat.com> To: qemu-devel@nongnu.org, virtio-fs@redhat.com Cc: Hanna Czenczek <hreitz@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, German Maglione <gmaglione@redhat.com>, Anton Kuchin <antonkuchin@yandex-team.ru>, Juan Quintela <quintela@redhat.com>, "Michael S . Tsirkin" <mst@redhat.com>, Stefano Garzarella <sgarzare@redhat.com> Subject: [PATCH 0/4] vhost-user-fs: Internal migration Date: Tue, 11 Apr 2023 17:05:11 +0200 Message-Id: <20230411150515.14020-1-hreitz@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.133.124; envelope-from=hreitz@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	vhost-user-fs: Internal migration \| expand [0/4] vhost-user-fs: Internal migration [1/4] vhost: Re-enable vrings after setting features [2/4] vhost-user: Interface for migration state transfer [3/4] vhost: Add high-level state save/load functions [4/4] vhost-user-fs: Implement internal migration

Hanna Czenczek April 11, 2023, 3:05 p.m. UTC

RFC:
https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html

Hi,

Patch 2 of this series adds new vhost methods (only for vhost-user at
this point) for transferring the back-end’s internal state to/from qemu
during migration, so that this state can be stored in the migration
stream.  (This is what we call “internal migration”, because the state
is internally available to qemu; this is in contrast to “external
migration”, which Anton is working on, where the back-end’s state is
handled by the back-end itself without involving qemu.)

For this, the state is handled as a binary blob by qemu, and it is
transferred over a pipe that is established via a new vhost method.

Patch 3 adds two high-level helper functions to (A) fetch any vhost
back-end’s internal state and store it in a migration stream (a
`QEMUFile`), and (B) load such state from a migrations stream and send
it to a vhost back-end.  These build on the low-level interface
introduced in patch 2.

Patch 4 then uses these functions to implement internal migration for
vhost-user-fs.  Note that this of course depends on support in the
back-end (virtiofsd), which is not yet ready.

Finally, patch 1 fixes a bug around migrating vhost-user devices: To
enable/disable logging[1], the VHOST_F_LOG_ALL feature must be
set/cleared, via the SET_FEATURES call.  Another, technically unrelated,
feature exists, VHOST_USER_F_PROTOCOL_FEATURES, which indicates support
for vhost-user protocol features.  Naturally, qemu wants to keep that
other feature enabled, so it will set it (when possible) in every
SET_FEATURES call.  However, a side effect of setting
VHOST_USER_F_PROTOCOL_FEATURES is that all vrings are disabled.  This
causes any enabling (done at the start of migration) or disabling (done
on the source after a cancelled/failed migration) of logging to make the
back-end hang.  Without patch 1, therefore, starting a migration will
have any vhost-user back-end that supports both VHOST_F_LOG_ALL and
VHOST_USER_F_PROTOCOL_FEATURES immediately hang completely, and unless
execution is transferred to the destination, it will continue to hang.


[1] Logging here means logging writes to guest memory pages in a dirty
bitmap so that these dirty pages are flushed to the destination.  qemu
cannot monitor the back-end’s writes to guest memory, so the back-end
has to do so itself, and log its writes in a dirty bitmap shared with
qemu.


Changes in v1 compared to the RFC:
- Patch 1 added

- Patch 2: Interface is different, now uses a pipe instead of shared
  memory (as suggested by Stefan); also, this is now a generic
  vhost-user interface, and not just for vhost-user-fs

- Patches 3 and 4: Because this is now supposed to be a generic
  migration method for vhost-user back-ends, most of the migration code
  has been moved from vhost-user-fs.c to vhost.c so it can be shared
  between different back-ends.  The vhost-user-fs code is now a rather
  thin wrapper around the common code.
  - Note also (as suggested by Anton) that the back-end’s migration
    state is now in a subsection, and that it is technically optional.
    “Technically” means that with this series, it is always used (unless
    the back-end doesn’t support migration, in which case migration is
    just blocked), but Anton’s series for external migration would make
    it optional.  (I.e., the subsection would be skipped for external
    migration, and mandatorily included for internal migration.)


Hanna Czenczek (4):
  vhost: Re-enable vrings after setting features
  vhost-user: Interface for migration state transfer
  vhost: Add high-level state save/load functions
  vhost-user-fs: Implement internal migration

 include/hw/virtio/vhost-backend.h |  24 +++
 include/hw/virtio/vhost.h         | 124 +++++++++++++++
 hw/virtio/vhost-user-fs.c         | 101 +++++++++++-
 hw/virtio/vhost-user.c            | 147 ++++++++++++++++++
 hw/virtio/vhost.c                 | 246 ++++++++++++++++++++++++++++++
 5 files changed, 641 insertions(+), 1 deletion(-)

Stefan Hajnoczi April 12, 2023, 9 p.m. UTC | #1

Hi,
Is there a vhost-user.rst spec patch?

Thanks,
Stefan

Hanna Czenczek April 13, 2023, 8:20 a.m. UTC | #2

On 12.04.23 23:00, Stefan Hajnoczi wrote:
> Hi,
> Is there a vhost-user.rst spec patch?

Ah, right, I forgot.

Will add!

Hanna

Michael S. Tsirkin April 13, 2023, 4:11 p.m. UTC | #3

On Tue, Apr 11, 2023 at 05:05:11PM +0200, Hanna Czenczek wrote:
> RFC:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html
> 
> Hi,
> 
> Patch 2 of this series adds new vhost methods (only for vhost-user at
> this point) for transferring the back-end’s internal state to/from qemu
> during migration, so that this state can be stored in the migration
> stream.  (This is what we call “internal migration”, because the state
> is internally available to qemu; this is in contrast to “external
> migration”, which Anton is working on, where the back-end’s state is
> handled by the back-end itself without involving qemu.)
> 
> For this, the state is handled as a binary blob by qemu, and it is
> transferred over a pipe that is established via a new vhost method.
> 
> Patch 3 adds two high-level helper functions to (A) fetch any vhost
> back-end’s internal state and store it in a migration stream (a
> `QEMUFile`), and (B) load such state from a migrations stream and send
> it to a vhost back-end.  These build on the low-level interface
> introduced in patch 2.
> 
> Patch 4 then uses these functions to implement internal migration for
> vhost-user-fs.  Note that this of course depends on support in the
> back-end (virtiofsd), which is not yet ready.
> 
> Finally, patch 1 fixes a bug around migrating vhost-user devices: To
> enable/disable logging[1], the VHOST_F_LOG_ALL feature must be
> set/cleared, via the SET_FEATURES call.  Another, technically unrelated,
> feature exists, VHOST_USER_F_PROTOCOL_FEATURES, which indicates support
> for vhost-user protocol features.  Naturally, qemu wants to keep that
> other feature enabled, so it will set it (when possible) in every
> SET_FEATURES call.  However, a side effect of setting
> VHOST_USER_F_PROTOCOL_FEATURES is that all vrings are disabled.


I didn't get this part.
Two questions:
	Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.

	If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
	ring starts directly in the enabled state.

	If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
	initialized in a disabled state and is enabled by
	``VHOST_USER_SET_VRING_ENABLE`` with parameter 1.

so VHOST_USER_F_PROTOCOL_FEATURES only controls initial state of rings,
it does not disable rings.



>  This
> causes any enabling (done at the start of migration) or disabling (done
> on the source after a cancelled/failed migration) of logging to make the
> back-end hang.  Without patch 1, therefore, starting a migration will
> have any vhost-user back-end that supports both VHOST_F_LOG_ALL and
> VHOST_USER_F_PROTOCOL_FEATURES immediately hang completely, and unless
> execution is transferred to the destination, it will continue to hang.
> 
> 
> [1] Logging here means logging writes to guest memory pages in a dirty
> bitmap so that these dirty pages are flushed to the destination.  qemu
> cannot monitor the back-end’s writes to guest memory, so the back-end
> has to do so itself, and log its writes in a dirty bitmap shared with
> qemu.
> 
> 
> Changes in v1 compared to the RFC:
> - Patch 1 added
> 
> - Patch 2: Interface is different, now uses a pipe instead of shared
>   memory (as suggested by Stefan); also, this is now a generic
>   vhost-user interface, and not just for vhost-user-fs
> 
> - Patches 3 and 4: Because this is now supposed to be a generic
>   migration method for vhost-user back-ends, most of the migration code
>   has been moved from vhost-user-fs.c to vhost.c so it can be shared
>   between different back-ends.  The vhost-user-fs code is now a rather
>   thin wrapper around the common code.
>   - Note also (as suggested by Anton) that the back-end’s migration
>     state is now in a subsection, and that it is technically optional.
>     “Technically” means that with this series, it is always used (unless
>     the back-end doesn’t support migration, in which case migration is
>     just blocked), but Anton’s series for external migration would make
>     it optional.  (I.e., the subsection would be skipped for external
>     migration, and mandatorily included for internal migration.)
> 
> 
> Hanna Czenczek (4):
>   vhost: Re-enable vrings after setting features
>   vhost-user: Interface for migration state transfer
>   vhost: Add high-level state save/load functions
>   vhost-user-fs: Implement internal migration
> 
>  include/hw/virtio/vhost-backend.h |  24 +++
>  include/hw/virtio/vhost.h         | 124 +++++++++++++++
>  hw/virtio/vhost-user-fs.c         | 101 +++++++++++-
>  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++
>  hw/virtio/vhost.c                 | 246 ++++++++++++++++++++++++++++++
>  5 files changed, 641 insertions(+), 1 deletion(-)
> 
> -- 
> 2.39.1

Hanna Czenczek April 13, 2023, 5:53 p.m. UTC | #4

On 13.04.23 18:11, Michael S. Tsirkin wrote:
> On Tue, Apr 11, 2023 at 05:05:11PM +0200, Hanna Czenczek wrote:
>> RFC:
>> https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html
>>
>> Hi,
>>
>> Patch 2 of this series adds new vhost methods (only for vhost-user at
>> this point) for transferring the back-end’s internal state to/from qemu
>> during migration, so that this state can be stored in the migration
>> stream.  (This is what we call “internal migration”, because the state
>> is internally available to qemu; this is in contrast to “external
>> migration”, which Anton is working on, where the back-end’s state is
>> handled by the back-end itself without involving qemu.)
>>
>> For this, the state is handled as a binary blob by qemu, and it is
>> transferred over a pipe that is established via a new vhost method.
>>
>> Patch 3 adds two high-level helper functions to (A) fetch any vhost
>> back-end’s internal state and store it in a migration stream (a
>> `QEMUFile`), and (B) load such state from a migrations stream and send
>> it to a vhost back-end.  These build on the low-level interface
>> introduced in patch 2.
>>
>> Patch 4 then uses these functions to implement internal migration for
>> vhost-user-fs.  Note that this of course depends on support in the
>> back-end (virtiofsd), which is not yet ready.
>>
>> Finally, patch 1 fixes a bug around migrating vhost-user devices: To
>> enable/disable logging[1], the VHOST_F_LOG_ALL feature must be
>> set/cleared, via the SET_FEATURES call.  Another, technically unrelated,
>> feature exists, VHOST_USER_F_PROTOCOL_FEATURES, which indicates support
>> for vhost-user protocol features.  Naturally, qemu wants to keep that
>> other feature enabled, so it will set it (when possible) in every
>> SET_FEATURES call.  However, a side effect of setting
>> VHOST_USER_F_PROTOCOL_FEATURES is that all vrings are disabled.
>
> I didn't get this part.
> Two questions:
> 	Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
>
> 	If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
> 	ring starts directly in the enabled state.
>
> 	If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
> 	initialized in a disabled state and is enabled by
> 	``VHOST_USER_SET_VRING_ENABLE`` with parameter 1.
>
> so VHOST_USER_F_PROTOCOL_FEATURES only controls initial state of rings,
> it does not disable rings.

Oh.  Thanks. :)

That’s indeed a valid and more sensible interpretation.  I know that the 
vhost-user-backend crate virtiofsd uses has interpreted it differently.  
Looking into libvhost-user and DPDK, both have decided to instead have 
all vrings be disabled at reset, and enable them only when a 
SET_FEATURES with F_PROTOCOL_FEATURES comes in.  Doesn’t sound quite 
literally to spec either, but adheres to the interpretation of not 
disabling any rings just because F_PROTOCOL_FEATURES appears.

(I thought of proposing this (“only disable vrings for a `false` -> 
`true` flag state transition”), but thought that’d be too complicated.  
Oh, well. :))

So, the fix will go to the vhost-user-backend crate instead of qemu.  
That’s good!

Still, I will also prepare a patch to vhost-user.rst for this, because I 
still don’t find the specification clear on this.  The thing is, nobody 
interprets it as “negotiating this feature will decide whether, when all 
rings are initialized, they will be initialized in disabled or enabled 
state”, which is how I think you’ve interpreted it.  The problem is that 
“initialization” isn’t well-defined here.

Even libvhost-user and DPDK initialize the rings always in disabled 
state, regardless of this feature, but will put them into an enabled 
state later on if the feature isn’t negotiated.  I think this exact 
behavior should be precisely described in the spec, like:

Between initialization and ``VHOST_USER_SET_FEATURES``, it is 
implementation-defined whether each ring is enabled or disabled.

If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, each 
ring, when started, will be enabled immediately.

If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, each ring 
will remain in the disabled state until ``VHOST_USER_SET_VRING_ENABLE`` 
enables it with parameter 1.

Hanna

Hanna Czenczek May 4, 2023, 4:05 p.m. UTC | #5

On 11.04.23 17:05, Hanna Czenczek wrote:

[...]

> Hanna Czenczek (4):
>    vhost: Re-enable vrings after setting features
>    vhost-user: Interface for migration state transfer
>    vhost: Add high-level state save/load functions
>    vhost-user-fs: Implement internal migration

I’m trying to write v2, and my intention was to keep the code 
conceptually largely the same, but include in the documentation change 
thoughts and notes on how this interface is to be used in the future, 
when e.g. vDPA “extensions” come over to vhost-user.  My plan was to, 
based on that documentation, discuss further.

But now I’m struggling to even write that documentation because it’s not 
clear to me what exactly the result of the discussion was, so I need to 
stop even before that.

So as far as I understand, we need/want SUSPEND/RESUME for two reasons:
1. As a signal to the back-end when virt queues are no longer to be 
processed, so that it is clear that it will not do that when asked for 
migration state.
2. Stateful devices that support SET_STATUS receive a status of 0 when 
the VM is stopped, which supposedly resets the internal state. While 
suspended, device state is frozen, so as far as I understand, SUSPEND 
before SET_STATUS would have the status change be deferred until RESUME.

I don’t want to hang myself up on 2 because it doesn’t really seem 
important to this series, but: Why does a status of 0 reset the internal 
state?  [Note: This is all virtio_reset() seems to do, set the status to 
0.]  The vhost-user specification only points to the virtio 
specification, which doesn’t say anything to that effect. Instead, an 
explicit device reset is mentioned, which would be 
VHOST_USER_RESET_DEVICE, i.e. something completely different. Because 
RESET_DEVICE directly contradicts SUSPEND’s description, I would like to 
think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid.

Is it that a status 0 won’t explicitly reset the internal state, but 
because it does mean that the driver is unbound, the state should 
implicitly be reset?

Anyway.  1 seems to be the relevant point for migration.  As far as I 
understand, currently, a vhost-user back-end has no way of knowing when 
to stop processing virt queues.  Basically, rings can only transition 
from stopped to started, but not vice versa.  The vhost-user 
specification has this bit: “Once the source has finished migration, 
rings will be stopped by the source. No further update must be done 
before rings are restarted.”  It just doesn’t say how the front-end lets 
the back-end know that the rings are (to be) stopped.  So this seems 
like a pre-existing problem for stateless migration.  Unless this is 
communicated precisely by setting the device status to 0?

Naturally, what I want to know most of all is whether you believe I can 
get away without SUSPEND/RESUME for now.  To me, it seems like honestly 
not really, only when turning two blind eyes, because otherwise we can’t 
ensure that virtiofsd isn’t still processing pending virt queue requests 
when the state transfer is begun, even when the guest CPUs are already 
stopped.  Of course, virtiofsd could stop queue processing right there 
and then, but…  That feels like a hack that in the grand scheme of 
things just isn’t necessary when we could “just” introduce 
SUSPEND/RESUME into vhost-user for exactly this.

Beyond the SUSPEND/RESUME question, I understand everything can stay 
as-is for now, as the design doesn’t seem to conflict too badly with 
possible future extensions for other migration phases or more finely 
grained migration phase control between front-end and back-end.

Did I at least roughly get the gist?

Hanna

Stefan Hajnoczi May 4, 2023, 9:14 p.m. UTC | #6

On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 11.04.23 17:05, Hanna Czenczek wrote:
>
> [...]
>
> > Hanna Czenczek (4):
> >    vhost: Re-enable vrings after setting features
> >    vhost-user: Interface for migration state transfer
> >    vhost: Add high-level state save/load functions
> >    vhost-user-fs: Implement internal migration
>
> I’m trying to write v2, and my intention was to keep the code
> conceptually largely the same, but include in the documentation change
> thoughts and notes on how this interface is to be used in the future,
> when e.g. vDPA “extensions” come over to vhost-user.  My plan was to,
> based on that documentation, discuss further.
>
> But now I’m struggling to even write that documentation because it’s not
> clear to me what exactly the result of the discussion was, so I need to
> stop even before that.
>
> So as far as I understand, we need/want SUSPEND/RESUME for two reasons:
> 1. As a signal to the back-end when virt queues are no longer to be
> processed, so that it is clear that it will not do that when asked for
> migration state.
> 2. Stateful devices that support SET_STATUS receive a status of 0 when
> the VM is stopped, which supposedly resets the internal state. While
> suspended, device state is frozen, so as far as I understand, SUSPEND
> before SET_STATUS would have the status change be deferred until RESUME.

I'm not sure about SUSPEND -> SET_STATUS 0 -> RESUME. I guess the
device would be reset right away and it would either remain suspended
or be resumed as part of reset :).

Unfortunately the concepts of SUSPEND/RESUME and the Device Status
Field are orthogonal and there is no spec that explains how they
interact.

>
> I don’t want to hang myself up on 2 because it doesn’t really seem
> important to this series, but: Why does a status of 0 reset the internal
> state?  [Note: This is all virtio_reset() seems to do, set the status to
> 0.]  The vhost-user specification only points to the virtio
> specification, which doesn’t say anything to that effect. Instead, an
> explicit device reset is mentioned, which would be
> VHOST_USER_RESET_DEVICE, i.e. something completely different. Because
> RESET_DEVICE directly contradicts SUSPEND’s description, I would like to
> think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid.

The vhost-user protocol didn't have the concept of the VIRTIO Device
Status Field until SET_STATUS was added.

In order to participate in the VIRTIO device lifecycle to some extent,
the pre-SET_STATUS vhost-user protocol relied on vhost-user-specific
messages like RESET_DEVICE.

At the VIRTIO level, devices are reset by setting the Device Status
Field to 0. All state is lost and the Device Initialization process
must be followed to make the device operational again.

Existing vhost-user backends don't implement SET_STATUS 0 (it's new).

It's messy and not your fault. I think QEMU should solve this by
treating stateful devices differently from non-stateful devices. That
way existing vhost-user backends continue to work and new stateful
devices can also be supported.

>
> Is it that a status 0 won’t explicitly reset the internal state, but
> because it does mean that the driver is unbound, the state should
> implicitly be reset?

I think the fundamental problem is that transports like virtio-pci put
registers back in their initialization state upon reset, so internal
state is lost.

The VIRTIO spec does not go into detail on device state across reset
though, so I don't think much more can be said about the semantics.

> Anyway.  1 seems to be the relevant point for migration.  As far as I
> understand, currently, a vhost-user back-end has no way of knowing when
> to stop processing virt queues.  Basically, rings can only transition
> from stopped to started, but not vice versa.  The vhost-user
> specification has this bit: “Once the source has finished migration,
> rings will be stopped by the source. No further update must be done
> before rings are restarted.”  It just doesn’t say how the front-end lets
> the back-end know that the rings are (to be) stopped.  So this seems
> like a pre-existing problem for stateless migration.  Unless this is
> communicated precisely by setting the device status to 0?

No, my understanding is different. The vhost-user spec says the
backend must "stop [the] ring upon receiving
``VHOST_USER_GET_VRING_BASE``". The "Ring states" section goes into
more detail and adds the concept of enabled/disabled too.

SUSPEND is stronger than GET_VRING_BASE though. GET_VRING_BASE only
applies to a single virtqueue, whereas SUSPEND acts upon the entire
device, including non-virtqueue aspects like Configuration Change
Notifications (VHOST_USER_BACKEND_CONFIG_CHANGE_MSG).

You can approximate SUSPEND today by sending GET_VRING_BASE for all
virtqueues. I think in practice this does fully stop the device even
if the spec doesn't require it.

If we want minimal changes to vhost-user, then we could rely on
GET_VRING_BASE to suspend and SET_VRING_ENABLE to resume. And
SET_STATUS 0 must not be sent so that the device's state is not lost.

However, this approach means this effort needs to be redone when it's
time to add stateful device support to vDPA and the QEMU vhost code
will become more complex. I think it's better to agree on a proper
model that works for both vhost-user and vhost-vdpa now to avoid more
hacks/special cases.

> Naturally, what I want to know most of all is whether you believe I can
> get away without SUSPEND/RESUME for now.  To me, it seems like honestly
> not really, only when turning two blind eyes, because otherwise we can’t
> ensure that virtiofsd isn’t still processing pending virt queue requests
> when the state transfer is begun, even when the guest CPUs are already
> stopped.  Of course, virtiofsd could stop queue processing right there
> and then, but…  That feels like a hack that in the grand scheme of
> things just isn’t necessary when we could “just” introduce
> SUSPEND/RESUME into vhost-user for exactly this.
>
> Beyond the SUSPEND/RESUME question, I understand everything can stay
> as-is for now, as the design doesn’t seem to conflict too badly with
> possible future extensions for other migration phases or more finely
> grained migration phase control between front-end and back-end.
>
> Did I at least roughly get the gist?

One part we haven't discussed much: I'm not sure how much trouble
you'll face due to the fact that QEMU assumes vhost devices can be
reset across vhost_dev_stop() -> vhost_dev_start(). I don't think we
should keep a copy of the state in-memory just so it can be restored
in vhost_dev_start(). I think it's better to change QEMU's vhost code
to leave stateful devices suspended (but not reset) across
vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
this aspect?

Stefan

Hanna Czenczek May 5, 2023, 9:03 a.m. UTC | #7

On 04.05.23 23:14, Stefan Hajnoczi wrote:
> On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
>> On 11.04.23 17:05, Hanna Czenczek wrote:
>>
>> [...]
>>
>>> Hanna Czenczek (4):
>>>     vhost: Re-enable vrings after setting features
>>>     vhost-user: Interface for migration state transfer
>>>     vhost: Add high-level state save/load functions
>>>     vhost-user-fs: Implement internal migration
>> I’m trying to write v2, and my intention was to keep the code
>> conceptually largely the same, but include in the documentation change
>> thoughts and notes on how this interface is to be used in the future,
>> when e.g. vDPA “extensions” come over to vhost-user.  My plan was to,
>> based on that documentation, discuss further.
>>
>> But now I’m struggling to even write that documentation because it’s not
>> clear to me what exactly the result of the discussion was, so I need to
>> stop even before that.
>>
>> So as far as I understand, we need/want SUSPEND/RESUME for two reasons:
>> 1. As a signal to the back-end when virt queues are no longer to be
>> processed, so that it is clear that it will not do that when asked for
>> migration state.
>> 2. Stateful devices that support SET_STATUS receive a status of 0 when
>> the VM is stopped, which supposedly resets the internal state. While
>> suspended, device state is frozen, so as far as I understand, SUSPEND
>> before SET_STATUS would have the status change be deferred until RESUME.
> I'm not sure about SUSPEND -> SET_STATUS 0 -> RESUME. I guess the
> device would be reset right away and it would either remain suspended
> or be resumed as part of reset :).
>
> Unfortunately the concepts of SUSPEND/RESUME and the Device Status
> Field are orthogonal and there is no spec that explains how they
> interact.

Ah, OK.  So I guess it’s up to the implementation to decide whether the 
virtio device status counts as part of the “configuration” that “[it] 
must not change”.

>> I don’t want to hang myself up on 2 because it doesn’t really seem
>> important to this series, but: Why does a status of 0 reset the internal
>> state?  [Note: This is all virtio_reset() seems to do, set the status to
>> 0.]  The vhost-user specification only points to the virtio
>> specification, which doesn’t say anything to that effect. Instead, an
>> explicit device reset is mentioned, which would be
>> VHOST_USER_RESET_DEVICE, i.e. something completely different. Because
>> RESET_DEVICE directly contradicts SUSPEND’s description, I would like to
>> think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid.
> The vhost-user protocol didn't have the concept of the VIRTIO Device
> Status Field until SET_STATUS was added.
>
> In order to participate in the VIRTIO device lifecycle to some extent,
> the pre-SET_STATUS vhost-user protocol relied on vhost-user-specific
> messages like RESET_DEVICE.
>
> At the VIRTIO level, devices are reset by setting the Device Status
> Field to 0.

(I didn’t find this in the virtio specification until today, turns out 
it’s under 4.1.4.3 “Common configuration structure layout”, not under 
2.1 “Device Status Field”, where I was looking.)

> All state is lost and the Device Initialization process
> must be followed to make the device operational again.
>
> Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
>
> It's messy and not your fault. I think QEMU should solve this by
> treating stateful devices differently from non-stateful devices. That
> way existing vhost-user backends continue to work and new stateful
> devices can also be supported.

It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for 
stateful devices.  In a previous email, you wrote that these should 
implement SUSPEND+RESUME so qemu can use those instead.  But those are 
separate things, so I assume we just use SET_STATUS 0 when stopping the 
VM because this happens to also stop processing vrings as a side effect?

I.e. I understand “treating stateful devices differently” to mean that 
qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end 
supports it, and stateful back-ends should support it.

>> Is it that a status 0 won’t explicitly reset the internal state, but
>> because it does mean that the driver is unbound, the state should
>> implicitly be reset?
> I think the fundamental problem is that transports like virtio-pci put
> registers back in their initialization state upon reset, so internal
> state is lost.
>
> The VIRTIO spec does not go into detail on device state across reset
> though, so I don't think much more can be said about the semantics.
>
>> Anyway.  1 seems to be the relevant point for migration.  As far as I
>> understand, currently, a vhost-user back-end has no way of knowing when
>> to stop processing virt queues.  Basically, rings can only transition
>> from stopped to started, but not vice versa.  The vhost-user
>> specification has this bit: “Once the source has finished migration,
>> rings will be stopped by the source. No further update must be done
>> before rings are restarted.”  It just doesn’t say how the front-end lets
>> the back-end know that the rings are (to be) stopped.  So this seems
>> like a pre-existing problem for stateless migration.  Unless this is
>> communicated precisely by setting the device status to 0?
> No, my understanding is different. The vhost-user spec says the
> backend must "stop [the] ring upon receiving
> ``VHOST_USER_GET_VRING_BASE``".

Yes, I missed that part!

> The "Ring states" section goes into
> more detail and adds the concept of enabled/disabled too.
>
> SUSPEND is stronger than GET_VRING_BASE though. GET_VRING_BASE only
> applies to a single virtqueue, whereas SUSPEND acts upon the entire
> device, including non-virtqueue aspects like Configuration Change
> Notifications (VHOST_USER_BACKEND_CONFIG_CHANGE_MSG).
>
> You can approximate SUSPEND today by sending GET_VRING_BASE for all
> virtqueues. I think in practice this does fully stop the device even
> if the spec doesn't require it.
>
> If we want minimal changes to vhost-user, then we could rely on
> GET_VRING_BASE to suspend and SET_VRING_ENABLE to resume. And
> SET_STATUS 0 must not be sent so that the device's state is not lost.

So you mean that we’d use SUSPEND instead of SET_STATUS 0, but because 
we have no SUSPEND, we’d ensure that GET_VRING_BASE is/was called on all 
vrings?

> However, this approach means this effort needs to be redone when it's
> time to add stateful device support to vDPA and the QEMU vhost code
> will become more complex. I think it's better to agree on a proper
> model that works for both vhost-user and vhost-vdpa now to avoid more
> hacks/special cases.

Agreeing is easy, actually adding SUSPEND+RESUME to the vhost-user 
protocol is what I’d prefer to avoid. :)

The question is whether it’s really effort if we were (in qemu) to just 
implement SUSPEND as a GET_VRING_BASE to all vrings for vhost-user.  I 
don’t think there is a direct equivalent to RESUME, because the back-end 
is supposed to start rings automatically when it receives a kick, but 
that will only happen when the vCPUs run, so that should be fine.

>> Naturally, what I want to know most of all is whether you believe I can
>> get away without SUSPEND/RESUME for now.  To me, it seems like honestly
>> not really, only when turning two blind eyes, because otherwise we can’t
>> ensure that virtiofsd isn’t still processing pending virt queue requests
>> when the state transfer is begun, even when the guest CPUs are already
>> stopped.  Of course, virtiofsd could stop queue processing right there
>> and then, but…  That feels like a hack that in the grand scheme of
>> things just isn’t necessary when we could “just” introduce
>> SUSPEND/RESUME into vhost-user for exactly this.
>>
>> Beyond the SUSPEND/RESUME question, I understand everything can stay
>> as-is for now, as the design doesn’t seem to conflict too badly with
>> possible future extensions for other migration phases or more finely
>> grained migration phase control between front-end and back-end.
>>
>> Did I at least roughly get the gist?
> One part we haven't discussed much: I'm not sure how much trouble
> you'll face due to the fact that QEMU assumes vhost devices can be
> reset across vhost_dev_stop() -> vhost_dev_start(). I don't think we
> should keep a copy of the state in-memory just so it can be restored
> in vhost_dev_start().

All I can report is that virtiofsd continues to work fine after a 
cancelled/failed migration.

> I think it's better to change QEMU's vhost code
> to leave stateful devices suspended (but not reset) across
> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> this aspect?

Yes and no; I mean, I haven’t in detail, but I thought this is what’s 
meant by suspending instead of resetting when the VM is stopped.

Hanna

Hanna Czenczek May 5, 2023, 9:51 a.m. UTC | #8

(By the way, thanks for the explanations :))

On 05.05.23 11:03, Hanna Czenczek wrote:
> On 04.05.23 23:14, Stefan Hajnoczi wrote:

[...]

>> I think it's better to change QEMU's vhost code
>> to leave stateful devices suspended (but not reset) across
>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
>> this aspect?
>
> Yes and no; I mean, I haven’t in detail, but I thought this is what’s 
> meant by suspending instead of resetting when the VM is stopped.

So, now looking at vhost_dev_stop(), one problem I can see is that 
depending on the back-end, different operations it does will do 
different things.

It tries to stop the whole device via vhost_ops->vhost_dev_start(), 
which for vDPA will suspend the device, but for vhost-user will reset it 
(if F_STATUS is there).

It disables all vrings, which doesn’t mean stopping, but may be 
necessary, too.  (I haven’t yet really understood the use of disabled 
vrings, I heard that virtio-net would have a need for it.)

It then also stops all vrings, though, so that’s OK.  And because this 
will always do GET_VRING_BASE, this is actually always the same 
regardless of transport.

Finally (for this purpose), it resets the device status via 
vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and 
this is what resets the device there.

So vhost-user resets the device in .vhost_dev_start, but vDPA only does 
so in .vhost_reset_status.  It would seem better to me if vhost-user 
would also reset the device only in .vhost_reset_status, not in 
.vhost_dev_start.  .vhost_dev_start seems precisely like the place to 
run SUSPEND/RESUME.

Another question I have (but this is basically what I wrote in my last 
email) is why we even call .vhost_reset_status here.  If the device 
and/or all of the vrings are already stopped, why do we need to reset 
it?  Naïvely, I had assumed we only really need to reset the device if 
the guest changes, so that a new guest driver sees a freshly initialized 
device.

Hanna

Eugenio Perez Martin May 5, 2023, 9:53 a.m. UTC | #9

On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 04.05.23 23:14, Stefan Hajnoczi wrote:
> > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
> >> On 11.04.23 17:05, Hanna Czenczek wrote:
> >>
> >> [...]
> >>
> >>> Hanna Czenczek (4):
> >>>     vhost: Re-enable vrings after setting features
> >>>     vhost-user: Interface for migration state transfer
> >>>     vhost: Add high-level state save/load functions
> >>>     vhost-user-fs: Implement internal migration
> >> I’m trying to write v2, and my intention was to keep the code
> >> conceptually largely the same, but include in the documentation change
> >> thoughts and notes on how this interface is to be used in the future,
> >> when e.g. vDPA “extensions” come over to vhost-user.  My plan was to,
> >> based on that documentation, discuss further.
> >>
> >> But now I’m struggling to even write that documentation because it’s not
> >> clear to me what exactly the result of the discussion was, so I need to
> >> stop even before that.
> >>
> >> So as far as I understand, we need/want SUSPEND/RESUME for two reasons:
> >> 1. As a signal to the back-end when virt queues are no longer to be
> >> processed, so that it is clear that it will not do that when asked for
> >> migration state.
> >> 2. Stateful devices that support SET_STATUS receive a status of 0 when
> >> the VM is stopped, which supposedly resets the internal state. While
> >> suspended, device state is frozen, so as far as I understand, SUSPEND
> >> before SET_STATUS would have the status change be deferred until RESUME.
> > I'm not sure about SUSPEND -> SET_STATUS 0 -> RESUME. I guess the
> > device would be reset right away and it would either remain suspended
> > or be resumed as part of reset :).
> >
> > Unfortunately the concepts of SUSPEND/RESUME and the Device Status
> > Field are orthogonal and there is no spec that explains how they
> > interact.
>
> Ah, OK.  So I guess it’s up to the implementation to decide whether the
> virtio device status counts as part of the “configuration” that “[it]
> must not change”.
>

That's a very good point indeed. I think the easier way to think about
it is that reset must be able to recover the device always, so it must
take precedence over the suspension. But I think it is good to make it
explicit, at least in current vDPA headers.

> >> I don’t want to hang myself up on 2 because it doesn’t really seem
> >> important to this series, but: Why does a status of 0 reset the internal
> >> state?  [Note: This is all virtio_reset() seems to do, set the status to
> >> 0.]  The vhost-user specification only points to the virtio
> >> specification, which doesn’t say anything to that effect. Instead, an
> >> explicit device reset is mentioned, which would be
> >> VHOST_USER_RESET_DEVICE, i.e. something completely different. Because
> >> RESET_DEVICE directly contradicts SUSPEND’s description, I would like to
> >> think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid.
> > The vhost-user protocol didn't have the concept of the VIRTIO Device
> > Status Field until SET_STATUS was added.
> >
> > In order to participate in the VIRTIO device lifecycle to some extent,
> > the pre-SET_STATUS vhost-user protocol relied on vhost-user-specific
> > messages like RESET_DEVICE.
> >
> > At the VIRTIO level, devices are reset by setting the Device Status
> > Field to 0.
>
> (I didn’t find this in the virtio specification until today, turns out
> it’s under 4.1.4.3 “Common configuration structure layout”, not under
> 2.1 “Device Status Field”, where I was looking.)
>

Yes, but you had a point. That section is only for PCI transport, not
as a generic way to reset the device. Channel I/O uses an explicit
CCW_CMD_VDEV_RESET command for reset, more similar to
VHOST_USER_RESET_DEVICE.

> > All state is lost and the Device Initialization process
> > must be followed to make the device operational again.
> >
> > Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
> >
> > It's messy and not your fault. I think QEMU should solve this by
> > treating stateful devices differently from non-stateful devices. That
> > way existing vhost-user backends continue to work and new stateful
> > devices can also be supported.
>
> It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for
> stateful devices.  In a previous email, you wrote that these should
> implement SUSPEND+RESUME so qemu can use those instead.  But those are
> separate things, so I assume we just use SET_STATUS 0 when stopping the
> VM because this happens to also stop processing vrings as a side effect?
>
> I.e. I understand “treating stateful devices differently” to mean that
> qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end
> supports it, and stateful back-ends should support it.
>

Honestly I cannot think of any use case where the vhost-user backend
did not ignore set_status(0) and had to retrieve vq states. So maybe
we can totally remove that call from qemu?

> >> Is it that a status 0 won’t explicitly reset the internal state, but
> >> because it does mean that the driver is unbound, the state should
> >> implicitly be reset?
> > I think the fundamental problem is that transports like virtio-pci put
> > registers back in their initialization state upon reset, so internal
> > state is lost.
> >
> > The VIRTIO spec does not go into detail on device state across reset
> > though, so I don't think much more can be said about the semantics.
> >
> >> Anyway.  1 seems to be the relevant point for migration.  As far as I
> >> understand, currently, a vhost-user back-end has no way of knowing when
> >> to stop processing virt queues.  Basically, rings can only transition
> >> from stopped to started, but not vice versa.  The vhost-user
> >> specification has this bit: “Once the source has finished migration,
> >> rings will be stopped by the source. No further update must be done
> >> before rings are restarted.”  It just doesn’t say how the front-end lets
> >> the back-end know that the rings are (to be) stopped.  So this seems
> >> like a pre-existing problem for stateless migration.  Unless this is
> >> communicated precisely by setting the device status to 0?
> > No, my understanding is different. The vhost-user spec says the
> > backend must "stop [the] ring upon receiving
> > ``VHOST_USER_GET_VRING_BASE``".
>
> Yes, I missed that part!
>
> > The "Ring states" section goes into
> > more detail and adds the concept of enabled/disabled too.
> >
> > SUSPEND is stronger than GET_VRING_BASE though. GET_VRING_BASE only
> > applies to a single virtqueue, whereas SUSPEND acts upon the entire
> > device, including non-virtqueue aspects like Configuration Change
> > Notifications (VHOST_USER_BACKEND_CONFIG_CHANGE_MSG).
> >
> > You can approximate SUSPEND today by sending GET_VRING_BASE for all
> > virtqueues. I think in practice this does fully stop the device even
> > if the spec doesn't require it.
> >
> > If we want minimal changes to vhost-user, then we could rely on
> > GET_VRING_BASE to suspend and SET_VRING_ENABLE to resume. And
> > SET_STATUS 0 must not be sent so that the device's state is not lost.
>
> So you mean that we’d use SUSPEND instead of SET_STATUS 0, but because
> we have no SUSPEND, we’d ensure that GET_VRING_BASE is/was called on all
> vrings?
>
> > However, this approach means this effort needs to be redone when it's
> > time to add stateful device support to vDPA and the QEMU vhost code
> > will become more complex. I think it's better to agree on a proper
> > model that works for both vhost-user and vhost-vdpa now to avoid more
> > hacks/special cases.
>
> Agreeing is easy, actually adding SUSPEND+RESUME to the vhost-user
> protocol is what I’d prefer to avoid. :)
>
> The question is whether it’s really effort if we were (in qemu) to just
> implement SUSPEND as a GET_VRING_BASE to all vrings for vhost-user.  I
> don’t think there is a direct equivalent to RESUME, because the back-end
> is supposed to start rings automatically when it receives a kick, but
> that will only happen when the vCPUs run, so that should be fine.
>
> >> Naturally, what I want to know most of all is whether you believe I can
> >> get away without SUSPEND/RESUME for now.  To me, it seems like honestly
> >> not really, only when turning two blind eyes, because otherwise we can’t
> >> ensure that virtiofsd isn’t still processing pending virt queue requests
> >> when the state transfer is begun, even when the guest CPUs are already
> >> stopped.  Of course, virtiofsd could stop queue processing right there
> >> and then, but…  That feels like a hack that in the grand scheme of
> >> things just isn’t necessary when we could “just” introduce
> >> SUSPEND/RESUME into vhost-user for exactly this.
> >>
> >> Beyond the SUSPEND/RESUME question, I understand everything can stay
> >> as-is for now, as the design doesn’t seem to conflict too badly with
> >> possible future extensions for other migration phases or more finely
> >> grained migration phase control between front-end and back-end.
> >>
> >> Did I at least roughly get the gist?
> > One part we haven't discussed much: I'm not sure how much trouble
> > you'll face due to the fact that QEMU assumes vhost devices can be
> > reset across vhost_dev_stop() -> vhost_dev_start(). I don't think we
> > should keep a copy of the state in-memory just so it can be restored
> > in vhost_dev_start().
>
> All I can report is that virtiofsd continues to work fine after a
> cancelled/failed migration.
>

Isn't the device reset after a failed migration? At least net devices
are reset before sending VMState. If it cannot be applied at the
destination, the device is already reset...

> > I think it's better to change QEMU's vhost code
> > to leave stateful devices suspended (but not reset) across
> > vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> > vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> > this aspect?
>
> Yes and no; I mean, I haven’t in detail, but I thought this is what’s
> meant by suspending instead of resetting when the VM is stopped.
>

... unless we perform these changes of course :).

Thanks!

Hanna Czenczek May 5, 2023, 12:51 p.m. UTC | #10

On 05.05.23 11:53, Eugenio Perez Martin wrote:
> On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote:
>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
>>> On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:

[...]

>>> All state is lost and the Device Initialization process
>>> must be followed to make the device operational again.
>>>
>>> Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
>>>
>>> It's messy and not your fault. I think QEMU should solve this by
>>> treating stateful devices differently from non-stateful devices. That
>>> way existing vhost-user backends continue to work and new stateful
>>> devices can also be supported.
>> It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for
>> stateful devices.  In a previous email, you wrote that these should
>> implement SUSPEND+RESUME so qemu can use those instead.  But those are
>> separate things, so I assume we just use SET_STATUS 0 when stopping the
>> VM because this happens to also stop processing vrings as a side effect?
>>
>> I.e. I understand “treating stateful devices differently” to mean that
>> qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end
>> supports it, and stateful back-ends should support it.
>>
> Honestly I cannot think of any use case where the vhost-user backend
> did not ignore set_status(0) and had to retrieve vq states. So maybe
> we can totally remove that call from qemu?

I don’t know so I can’t really say; but I don’t quite understand why 
qemu would reset a device at any point but perhaps VM reset (and even 
then I’d expect the post-reset guest to just reset the device on boot by 
itself, too).

[...]

>>>> Naturally, what I want to know most of all is whether you believe I can
>>>> get away without SUSPEND/RESUME for now.  To me, it seems like honestly
>>>> not really, only when turning two blind eyes, because otherwise we can’t
>>>> ensure that virtiofsd isn’t still processing pending virt queue requests
>>>> when the state transfer is begun, even when the guest CPUs are already
>>>> stopped.  Of course, virtiofsd could stop queue processing right there
>>>> and then, but…  That feels like a hack that in the grand scheme of
>>>> things just isn’t necessary when we could “just” introduce
>>>> SUSPEND/RESUME into vhost-user for exactly this.
>>>>
>>>> Beyond the SUSPEND/RESUME question, I understand everything can stay
>>>> as-is for now, as the design doesn’t seem to conflict too badly with
>>>> possible future extensions for other migration phases or more finely
>>>> grained migration phase control between front-end and back-end.
>>>>
>>>> Did I at least roughly get the gist?
>>> One part we haven't discussed much: I'm not sure how much trouble
>>> you'll face due to the fact that QEMU assumes vhost devices can be
>>> reset across vhost_dev_stop() -> vhost_dev_start(). I don't think we
>>> should keep a copy of the state in-memory just so it can be restored
>>> in vhost_dev_start().
>> All I can report is that virtiofsd continues to work fine after a
>> cancelled/failed migration.
>>
> Isn't the device reset after a failed migration? At least net devices
> are reset before sending VMState. If it cannot be applied at the
> destination, the device is already reset...

It doesn’t look like the Rust crate virtiofsd uses for vhost-user 
supports either F_STATUS or F_RESET_DEVICE, so I think this just doesn’t 
affect virtiofsd.

Hanna

Eugenio Perez Martin May 5, 2023, 2:26 p.m. UTC | #11

On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> (By the way, thanks for the explanations :))
>
> On 05.05.23 11:03, Hanna Czenczek wrote:
> > On 04.05.23 23:14, Stefan Hajnoczi wrote:
>
> [...]
>
> >> I think it's better to change QEMU's vhost code
> >> to leave stateful devices suspended (but not reset) across
> >> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> >> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> >> this aspect?
> >
> > Yes and no; I mean, I haven’t in detail, but I thought this is what’s
> > meant by suspending instead of resetting when the VM is stopped.
>
> So, now looking at vhost_dev_stop(), one problem I can see is that
> depending on the back-end, different operations it does will do
> different things.
>
> It tries to stop the whole device via vhost_ops->vhost_dev_start(),
> which for vDPA will suspend the device, but for vhost-user will reset it
> (if F_STATUS is there).
>
> It disables all vrings, which doesn’t mean stopping, but may be
> necessary, too.  (I haven’t yet really understood the use of disabled
> vrings, I heard that virtio-net would have a need for it.)
>
> It then also stops all vrings, though, so that’s OK.  And because this
> will always do GET_VRING_BASE, this is actually always the same
> regardless of transport.
>
> Finally (for this purpose), it resets the device status via
> vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
> this is what resets the device there.
>
>
> So vhost-user resets the device in .vhost_dev_start, but vDPA only does
> so in .vhost_reset_status.  It would seem better to me if vhost-user
> would also reset the device only in .vhost_reset_status, not in
> .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
> run SUSPEND/RESUME.
>

I think the same. I just saw It's been proposed at [1].

> Another question I have (but this is basically what I wrote in my last
> email) is why we even call .vhost_reset_status here.  If the device
> and/or all of the vrings are already stopped, why do we need to reset
> it?  Naïvely, I had assumed we only really need to reset the device if
> the guest changes, so that a new guest driver sees a freshly initialized
> device.
>

I don't know why we didn't need to call it :). I'm assuming the
previous vhost-user net did fine resetting vq indexes, using
VHOST_USER_SET_VRING_BASE. But I don't know about more complex
devices.

The guest can reset the device, or write 0 to the PCI config status,
at any time. How does virtiofs handle it, being stateful?

Thanks!

[1] https://lore.kernel.org/qemu-devel/20230501230409.274178-1-stefanha@redhat.com/

Hanna Czenczek May 5, 2023, 2:37 p.m. UTC | #12

On 05.05.23 16:26, Eugenio Perez Martin wrote:
> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> wrote:
>> (By the way, thanks for the explanations :))
>>
>> On 05.05.23 11:03, Hanna Czenczek wrote:
>>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
>> [...]
>>
>>>> I think it's better to change QEMU's vhost code
>>>> to leave stateful devices suspended (but not reset) across
>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
>>>> this aspect?
>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s
>>> meant by suspending instead of resetting when the VM is stopped.
>> So, now looking at vhost_dev_stop(), one problem I can see is that
>> depending on the back-end, different operations it does will do
>> different things.
>>
>> It tries to stop the whole device via vhost_ops->vhost_dev_start(),
>> which for vDPA will suspend the device, but for vhost-user will reset it
>> (if F_STATUS is there).
>>
>> It disables all vrings, which doesn’t mean stopping, but may be
>> necessary, too.  (I haven’t yet really understood the use of disabled
>> vrings, I heard that virtio-net would have a need for it.)
>>
>> It then also stops all vrings, though, so that’s OK.  And because this
>> will always do GET_VRING_BASE, this is actually always the same
>> regardless of transport.
>>
>> Finally (for this purpose), it resets the device status via
>> vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
>> this is what resets the device there.
>>
>>
>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does
>> so in .vhost_reset_status.  It would seem better to me if vhost-user
>> would also reset the device only in .vhost_reset_status, not in
>> .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
>> run SUSPEND/RESUME.
>>
> I think the same. I just saw It's been proposed at [1].
>
>> Another question I have (but this is basically what I wrote in my last
>> email) is why we even call .vhost_reset_status here.  If the device
>> and/or all of the vrings are already stopped, why do we need to reset
>> it?  Naïvely, I had assumed we only really need to reset the device if
>> the guest changes, so that a new guest driver sees a freshly initialized
>> device.
>>
> I don't know why we didn't need to call it :). I'm assuming the
> previous vhost-user net did fine resetting vq indexes, using
> VHOST_USER_SET_VRING_BASE. But I don't know about more complex
> devices.
>
> The guest can reset the device, or write 0 to the PCI config status,
> at any time. How does virtiofs handle it, being stateful?

Honestly a good question because virtiofsd implements neither SET_STATUS 
nor RESET_DEVICE.  I’ll have to investigate that.

I think when the guest resets the device, SET_VRING_BASE always comes 
along some way or another, so that’s how the vrings are reset.  Maybe 
the internal state is reset only following more high-level FUSE commands 
like INIT.

Hanna

Hanna Czenczek May 8, 2023, 5 p.m. UTC | #13

On 05.05.23 16:37, Hanna Czenczek wrote:
> On 05.05.23 16:26, Eugenio Perez Martin wrote:
>> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> 
>> wrote:
>>> (By the way, thanks for the explanations :))
>>>
>>> On 05.05.23 11:03, Hanna Czenczek wrote:
>>>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
>>> [...]
>>>
>>>>> I think it's better to change QEMU's vhost code
>>>>> to leave stateful devices suspended (but not reset) across
>>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
>>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
>>>>> this aspect?
>>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s
>>>> meant by suspending instead of resetting when the VM is stopped.
>>> So, now looking at vhost_dev_stop(), one problem I can see is that
>>> depending on the back-end, different operations it does will do
>>> different things.
>>>
>>> It tries to stop the whole device via vhost_ops->vhost_dev_start(),
>>> which for vDPA will suspend the device, but for vhost-user will 
>>> reset it
>>> (if F_STATUS is there).
>>>
>>> It disables all vrings, which doesn’t mean stopping, but may be
>>> necessary, too.  (I haven’t yet really understood the use of disabled
>>> vrings, I heard that virtio-net would have a need for it.)
>>>
>>> It then also stops all vrings, though, so that’s OK.  And because this
>>> will always do GET_VRING_BASE, this is actually always the same
>>> regardless of transport.
>>>
>>> Finally (for this purpose), it resets the device status via
>>> vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
>>> this is what resets the device there.
>>>
>>>
>>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does
>>> so in .vhost_reset_status.  It would seem better to me if vhost-user
>>> would also reset the device only in .vhost_reset_status, not in
>>> .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
>>> run SUSPEND/RESUME.
>>>
>> I think the same. I just saw It's been proposed at [1].
>>
>>> Another question I have (but this is basically what I wrote in my last
>>> email) is why we even call .vhost_reset_status here.  If the device
>>> and/or all of the vrings are already stopped, why do we need to reset
>>> it?  Naïvely, I had assumed we only really need to reset the device if
>>> the guest changes, so that a new guest driver sees a freshly 
>>> initialized
>>> device.
>>>
>> I don't know why we didn't need to call it :). I'm assuming the
>> previous vhost-user net did fine resetting vq indexes, using
>> VHOST_USER_SET_VRING_BASE. But I don't know about more complex
>> devices.
>>
>> The guest can reset the device, or write 0 to the PCI config status,
>> at any time. How does virtiofs handle it, being stateful?
>
> Honestly a good question because virtiofsd implements neither 
> SET_STATUS nor RESET_DEVICE.  I’ll have to investigate that.
>
> I think when the guest resets the device, SET_VRING_BASE always comes 
> along some way or another, so that’s how the vrings are reset.  Maybe 
> the internal state is reset only following more high-level FUSE 
> commands like INIT.

So a meeting and one session of looking-into-the-code later:

We reset every virt queue on GET_VRING_BASE, which is wrong, but happens 
to serve the purpose.  (German is currently on that.)

In our meeting, German said the reset would occur when the memory 
regions are changed, but I can’t see that in the code.  I think it only 
happens implicitly through the SET_VRING_BASE call, which resets the 
internal avail/used pointers.

[This doesn’t seem different from libvhost-user, though, which 
implements neither SET_STATUS nor RESET_DEVICE, and which pretends to 
reset the device on RESET_OWNER, but really doesn’t (its 
vu_reset_device_exec() function just disables all vrings, doesn’t reset 
or even stop them).]

Consequently, the internal state is never reset.  It would be cleared on 
a FUSE Destroy message, but if you just force-reset the system, the 
state remains into the next reboot.  Not even FUSE Init clears it, which 
seems weird.  It happens to work because it’s still the same filesystem, 
so the existing state fits, but it kind of seems dangerous to keep e.g. 
files open.  I don’t think it’s really exploitable because everything 
still goes through the guest kernel, but, well.  We should clear the 
state on Init, and probably also implement SET_STATUS and clear the 
state there.

Hanna

Eugenio Perez Martin May 8, 2023, 5:51 p.m. UTC | #14

On Mon, May 8, 2023 at 7:00 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 05.05.23 16:37, Hanna Czenczek wrote:
> > On 05.05.23 16:26, Eugenio Perez Martin wrote:
> >> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com>
> >> wrote:
> >>> (By the way, thanks for the explanations :))
> >>>
> >>> On 05.05.23 11:03, Hanna Czenczek wrote:
> >>>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
> >>> [...]
> >>>
> >>>>> I think it's better to change QEMU's vhost code
> >>>>> to leave stateful devices suspended (but not reset) across
> >>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> >>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> >>>>> this aspect?
> >>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s
> >>>> meant by suspending instead of resetting when the VM is stopped.
> >>> So, now looking at vhost_dev_stop(), one problem I can see is that
> >>> depending on the back-end, different operations it does will do
> >>> different things.
> >>>
> >>> It tries to stop the whole device via vhost_ops->vhost_dev_start(),
> >>> which for vDPA will suspend the device, but for vhost-user will
> >>> reset it
> >>> (if F_STATUS is there).
> >>>
> >>> It disables all vrings, which doesn’t mean stopping, but may be
> >>> necessary, too.  (I haven’t yet really understood the use of disabled
> >>> vrings, I heard that virtio-net would have a need for it.)
> >>>
> >>> It then also stops all vrings, though, so that’s OK.  And because this
> >>> will always do GET_VRING_BASE, this is actually always the same
> >>> regardless of transport.
> >>>
> >>> Finally (for this purpose), it resets the device status via
> >>> vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
> >>> this is what resets the device there.
> >>>
> >>>
> >>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does
> >>> so in .vhost_reset_status.  It would seem better to me if vhost-user
> >>> would also reset the device only in .vhost_reset_status, not in
> >>> .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
> >>> run SUSPEND/RESUME.
> >>>
> >> I think the same. I just saw It's been proposed at [1].
> >>
> >>> Another question I have (but this is basically what I wrote in my last
> >>> email) is why we even call .vhost_reset_status here.  If the device
> >>> and/or all of the vrings are already stopped, why do we need to reset
> >>> it?  Naïvely, I had assumed we only really need to reset the device if
> >>> the guest changes, so that a new guest driver sees a freshly
> >>> initialized
> >>> device.
> >>>
> >> I don't know why we didn't need to call it :). I'm assuming the
> >> previous vhost-user net did fine resetting vq indexes, using
> >> VHOST_USER_SET_VRING_BASE. But I don't know about more complex
> >> devices.
> >>
> >> The guest can reset the device, or write 0 to the PCI config status,
> >> at any time. How does virtiofs handle it, being stateful?
> >
> > Honestly a good question because virtiofsd implements neither
> > SET_STATUS nor RESET_DEVICE.  I’ll have to investigate that.
> >
> > I think when the guest resets the device, SET_VRING_BASE always comes
> > along some way or another, so that’s how the vrings are reset.  Maybe
> > the internal state is reset only following more high-level FUSE
> > commands like INIT.
>
> So a meeting and one session of looking-into-the-code later:
>
> We reset every virt queue on GET_VRING_BASE, which is wrong, but happens
> to serve the purpose.  (German is currently on that.)
>
> In our meeting, German said the reset would occur when the memory
> regions are changed, but I can’t see that in the code.

That would imply that the status is reset when the guest's memory is
added or removed?

> I think it only
> happens implicitly through the SET_VRING_BASE call, which resets the
> internal avail/used pointers.
>
> [This doesn’t seem different from libvhost-user, though, which
> implements neither SET_STATUS nor RESET_DEVICE, and which pretends to
> reset the device on RESET_OWNER, but really doesn’t (its
> vu_reset_device_exec() function just disables all vrings, doesn’t reset
> or even stop them).]
>
> Consequently, the internal state is never reset.  It would be cleared on
> a FUSE Destroy message, but if you just force-reset the system, the
> state remains into the next reboot.  Not even FUSE Init clears it, which
> seems weird.  It happens to work because it’s still the same filesystem,
> so the existing state fits, but it kind of seems dangerous to keep e.g.
> files open.  I don’t think it’s really exploitable because everything
> still goes through the guest kernel, but, well.  We should clear the
> state on Init, and probably also implement SET_STATUS and clear the
> state there.
>

I see. That's in the line of assuming GET_VRING_BASE is the last
message received from qemu.

Thanks!

Eugenio Perez Martin May 8, 2023, 7:31 p.m. UTC | #15

On Mon, May 8, 2023 at 7:51 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Mon, May 8, 2023 at 7:00 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> >
> > On 05.05.23 16:37, Hanna Czenczek wrote:
> > > On 05.05.23 16:26, Eugenio Perez Martin wrote:
> > >> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com>
> > >> wrote:
> > >>> (By the way, thanks for the explanations :))
> > >>>
> > >>> On 05.05.23 11:03, Hanna Czenczek wrote:
> > >>>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
> > >>> [...]
> > >>>
> > >>>>> I think it's better to change QEMU's vhost code
> > >>>>> to leave stateful devices suspended (but not reset) across
> > >>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> > >>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> > >>>>> this aspect?
> > >>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s
> > >>>> meant by suspending instead of resetting when the VM is stopped.
> > >>> So, now looking at vhost_dev_stop(), one problem I can see is that
> > >>> depending on the back-end, different operations it does will do
> > >>> different things.
> > >>>
> > >>> It tries to stop the whole device via vhost_ops->vhost_dev_start(),
> > >>> which for vDPA will suspend the device, but for vhost-user will
> > >>> reset it
> > >>> (if F_STATUS is there).
> > >>>
> > >>> It disables all vrings, which doesn’t mean stopping, but may be
> > >>> necessary, too.  (I haven’t yet really understood the use of disabled
> > >>> vrings, I heard that virtio-net would have a need for it.)
> > >>>
> > >>> It then also stops all vrings, though, so that’s OK.  And because this
> > >>> will always do GET_VRING_BASE, this is actually always the same
> > >>> regardless of transport.
> > >>>
> > >>> Finally (for this purpose), it resets the device status via
> > >>> vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
> > >>> this is what resets the device there.
> > >>>
> > >>>
> > >>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does
> > >>> so in .vhost_reset_status.  It would seem better to me if vhost-user
> > >>> would also reset the device only in .vhost_reset_status, not in
> > >>> .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
> > >>> run SUSPEND/RESUME.
> > >>>
> > >> I think the same. I just saw It's been proposed at [1].
> > >>
> > >>> Another question I have (but this is basically what I wrote in my last
> > >>> email) is why we even call .vhost_reset_status here.  If the device
> > >>> and/or all of the vrings are already stopped, why do we need to reset
> > >>> it?  Naïvely, I had assumed we only really need to reset the device if
> > >>> the guest changes, so that a new guest driver sees a freshly
> > >>> initialized
> > >>> device.
> > >>>
> > >> I don't know why we didn't need to call it :). I'm assuming the
> > >> previous vhost-user net did fine resetting vq indexes, using
> > >> VHOST_USER_SET_VRING_BASE. But I don't know about more complex
> > >> devices.
> > >>
> > >> The guest can reset the device, or write 0 to the PCI config status,
> > >> at any time. How does virtiofs handle it, being stateful?
> > >
> > > Honestly a good question because virtiofsd implements neither
> > > SET_STATUS nor RESET_DEVICE.  I’ll have to investigate that.
> > >
> > > I think when the guest resets the device, SET_VRING_BASE always comes
> > > along some way or another, so that’s how the vrings are reset.  Maybe
> > > the internal state is reset only following more high-level FUSE
> > > commands like INIT.
> >
> > So a meeting and one session of looking-into-the-code later:
> >
> > We reset every virt queue on GET_VRING_BASE, which is wrong, but happens
> > to serve the purpose.  (German is currently on that.)
> >
> > In our meeting, German said the reset would occur when the memory
> > regions are changed, but I can’t see that in the code.
>
> That would imply that the status is reset when the guest's memory is
> added or removed?
>
> > I think it only
> > happens implicitly through the SET_VRING_BASE call, which resets the
> > internal avail/used pointers.
> >
> > [This doesn’t seem different from libvhost-user, though, which
> > implements neither SET_STATUS nor RESET_DEVICE, and which pretends to
> > reset the device on RESET_OWNER, but really doesn’t (its
> > vu_reset_device_exec() function just disables all vrings, doesn’t reset
> > or even stop them).]
> >
> > Consequently, the internal state is never reset.  It would be cleared on
> > a FUSE Destroy message, but if you just force-reset the system, the
> > state remains into the next reboot.  Not even FUSE Init clears it, which
> > seems weird.  It happens to work because it’s still the same filesystem,
> > so the existing state fits, but it kind of seems dangerous to keep e.g.
> > files open.  I don’t think it’s really exploitable because everything
> > still goes through the guest kernel, but, well.  We should clear the
> > state on Init, and probably also implement SET_STATUS and clear the
> > state there.
> >
>
> I see. That's in the line of assuming GET_VRING_BASE is the last
> message received from qemu.
>

Actually, does it prevent device recovery after a failure in
migration? Is the same state set for the device?

Thanks!

Stefan Hajnoczi May 8, 2023, 9:10 p.m. UTC | #16

On Fri, May 05, 2023 at 02:51:55PM +0200, Hanna Czenczek wrote:
> On 05.05.23 11:53, Eugenio Perez Martin wrote:
> > On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote:
> > > On 04.05.23 23:14, Stefan Hajnoczi wrote:
> > > > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
> 
> [...]
> 
> > > > All state is lost and the Device Initialization process
> > > > must be followed to make the device operational again.
> > > > 
> > > > Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
> > > > 
> > > > It's messy and not your fault. I think QEMU should solve this by
> > > > treating stateful devices differently from non-stateful devices. That
> > > > way existing vhost-user backends continue to work and new stateful
> > > > devices can also be supported.
> > > It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for
> > > stateful devices.  In a previous email, you wrote that these should
> > > implement SUSPEND+RESUME so qemu can use those instead.  But those are
> > > separate things, so I assume we just use SET_STATUS 0 when stopping the
> > > VM because this happens to also stop processing vrings as a side effect?
> > > 
> > > I.e. I understand “treating stateful devices differently” to mean that
> > > qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end
> > > supports it, and stateful back-ends should support it.
> > > 
> > Honestly I cannot think of any use case where the vhost-user backend
> > did not ignore set_status(0) and had to retrieve vq states. So maybe
> > we can totally remove that call from qemu?
> 
> I don’t know so I can’t really say; but I don’t quite understand why qemu
> would reset a device at any point but perhaps VM reset (and even then I’d
> expect the post-reset guest to just reset the device on boot by itself,
> too).

DPDK stores the Device Status field value and uses it later:
https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2791

While DPDK performs no immediate action upon SET_STATUS 0, omitting the
message will change the behavior of other DPDK code like
virtio_is_ready().

Changing the semantics of the vhost-user protocol in a way that's not
backwards compatible is something we should avoid unless there is no
other way.

The fundamental problem is that QEMU's vhost code is designed to reset
vhost devices because it assumes they are stateless. If an F_SUSPEND
protocol feature bit is added, then it becomes possible to detect new
backends and suspend/resume them rather than reset them.

That's the solution that I favor because it's backwards compatible and
the same model can be applied to stateful vDPA devices in the future.

Stefan

Hanna Czenczek May 9, 2023, 8:53 a.m. UTC | #17

On 08.05.23 23:10, Stefan Hajnoczi wrote:
> On Fri, May 05, 2023 at 02:51:55PM +0200, Hanna Czenczek wrote:
>> On 05.05.23 11:53, Eugenio Perez Martin wrote:
>>> On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote:
>>>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
>>>>> On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
>> [...]
>>
>>>>> All state is lost and the Device Initialization process
>>>>> must be followed to make the device operational again.
>>>>>
>>>>> Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
>>>>>
>>>>> It's messy and not your fault. I think QEMU should solve this by
>>>>> treating stateful devices differently from non-stateful devices. That
>>>>> way existing vhost-user backends continue to work and new stateful
>>>>> devices can also be supported.
>>>> It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for
>>>> stateful devices.  In a previous email, you wrote that these should
>>>> implement SUSPEND+RESUME so qemu can use those instead.  But those are
>>>> separate things, so I assume we just use SET_STATUS 0 when stopping the
>>>> VM because this happens to also stop processing vrings as a side effect?
>>>>
>>>> I.e. I understand “treating stateful devices differently” to mean that
>>>> qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end
>>>> supports it, and stateful back-ends should support it.
>>>>
>>> Honestly I cannot think of any use case where the vhost-user backend
>>> did not ignore set_status(0) and had to retrieve vq states. So maybe
>>> we can totally remove that call from qemu?
>> I don’t know so I can’t really say; but I don’t quite understand why qemu
>> would reset a device at any point but perhaps VM reset (and even then I’d
>> expect the post-reset guest to just reset the device on boot by itself,
>> too).
> DPDK stores the Device Status field value and uses it later:
> https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2791
>
> While DPDK performs no immediate action upon SET_STATUS 0, omitting the
> message will change the behavior of other DPDK code like
> virtio_is_ready().
>
> Changing the semantics of the vhost-user protocol in a way that's not
> backwards compatible is something we should avoid unless there is no
> other way.

Well, I have two opinions on this:

First, that in DPDK sounds wrong.  vhost_dev_stop() is called mostly by 
devices that call it when set_status is called on them.  But they don’t 
call it if status == 0, they call it if virtio_device_should_start() 
returns false, which is the case when the VM is stopped.  So basically 
we set a status value on the back-end device that is not the status 
value that is set in qemu. If DPDK makes actual use of this status value 
that differs from that of the front-end in qemu, that sounds like it 
probably actually wrong.

Second, it’s entirely possible and probably probable that DPDK doesn’t 
make “actual use of this status value”; the only use it probably has is 
to determine whether the device is supposed to be stopped, which is 
exactly what qemu has tried to confer by setting it to 0.  So it’s 
basically two implementations that have agreed on abusing a value to 
emulate behavior that isn’t otherwise implement (SUSPEND), and that 
works because all devices are stateless.  Then, I agree, we can’t change 
this until it gets SUSPEND support.

> The fundamental problem is that QEMU's vhost code is designed to reset
> vhost devices because it assumes they are stateless. If an F_SUSPEND
> protocol feature bit is added, then it becomes possible to detect new
> backends and suspend/resume them rather than reset them.
>
> That's the solution that I favor because it's backwards compatible and
> the same model can be applied to stateful vDPA devices in the future.

So basically the idea is the following: vhost_dev_stop() should just 
suspend the device, not reset it.  For devices that don’t support 
SUSPEND, we still need to do something, and just calling GET_VRING_BASE 
on all vrings is deemed inadequate, so they are reset (SET_STATUS 0) as 
a work-around, assuming that stateful devices that care (i.e. implement 
SET_STATUS) will also implement SUSPEND to not have this “legacy reset” 
happen to them.

Sounds good to me.  (If I understood that right. :))

Hanna

Hanna Czenczek May 9, 2023, 8:59 a.m. UTC | #18

On 08.05.23 21:31, Eugenio Perez Martin wrote:
> On Mon, May 8, 2023 at 7:51 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>> On Mon, May 8, 2023 at 7:00 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>>> On 05.05.23 16:37, Hanna Czenczek wrote:
>>>> On 05.05.23 16:26, Eugenio Perez Martin wrote:
>>>>> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com>
>>>>> wrote:
>>>>>> (By the way, thanks for the explanations :))
>>>>>>
>>>>>> On 05.05.23 11:03, Hanna Czenczek wrote:
>>>>>>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
>>>>>> [...]
>>>>>>
>>>>>>>> I think it's better to change QEMU's vhost code
>>>>>>>> to leave stateful devices suspended (but not reset) across
>>>>>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
>>>>>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
>>>>>>>> this aspect?
>>>>>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s
>>>>>>> meant by suspending instead of resetting when the VM is stopped.
>>>>>> So, now looking at vhost_dev_stop(), one problem I can see is that
>>>>>> depending on the back-end, different operations it does will do
>>>>>> different things.
>>>>>>
>>>>>> It tries to stop the whole device via vhost_ops->vhost_dev_start(),
>>>>>> which for vDPA will suspend the device, but for vhost-user will
>>>>>> reset it
>>>>>> (if F_STATUS is there).
>>>>>>
>>>>>> It disables all vrings, which doesn’t mean stopping, but may be
>>>>>> necessary, too.  (I haven’t yet really understood the use of disabled
>>>>>> vrings, I heard that virtio-net would have a need for it.)
>>>>>>
>>>>>> It then also stops all vrings, though, so that’s OK.  And because this
>>>>>> will always do GET_VRING_BASE, this is actually always the same
>>>>>> regardless of transport.
>>>>>>
>>>>>> Finally (for this purpose), it resets the device status via
>>>>>> vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
>>>>>> this is what resets the device there.
>>>>>>
>>>>>>
>>>>>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does
>>>>>> so in .vhost_reset_status.  It would seem better to me if vhost-user
>>>>>> would also reset the device only in .vhost_reset_status, not in
>>>>>> .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
>>>>>> run SUSPEND/RESUME.
>>>>>>
>>>>> I think the same. I just saw It's been proposed at [1].
>>>>>
>>>>>> Another question I have (but this is basically what I wrote in my last
>>>>>> email) is why we even call .vhost_reset_status here.  If the device
>>>>>> and/or all of the vrings are already stopped, why do we need to reset
>>>>>> it?  Naïvely, I had assumed we only really need to reset the device if
>>>>>> the guest changes, so that a new guest driver sees a freshly
>>>>>> initialized
>>>>>> device.
>>>>>>
>>>>> I don't know why we didn't need to call it :). I'm assuming the
>>>>> previous vhost-user net did fine resetting vq indexes, using
>>>>> VHOST_USER_SET_VRING_BASE. But I don't know about more complex
>>>>> devices.
>>>>>
>>>>> The guest can reset the device, or write 0 to the PCI config status,
>>>>> at any time. How does virtiofs handle it, being stateful?
>>>> Honestly a good question because virtiofsd implements neither
>>>> SET_STATUS nor RESET_DEVICE.  I’ll have to investigate that.
>>>>
>>>> I think when the guest resets the device, SET_VRING_BASE always comes
>>>> along some way or another, so that’s how the vrings are reset.  Maybe
>>>> the internal state is reset only following more high-level FUSE
>>>> commands like INIT.
>>> So a meeting and one session of looking-into-the-code later:
>>>
>>> We reset every virt queue on GET_VRING_BASE, which is wrong, but happens
>>> to serve the purpose.  (German is currently on that.)
>>>
>>> In our meeting, German said the reset would occur when the memory
>>> regions are changed, but I can’t see that in the code.
>> That would imply that the status is reset when the guest's memory is
>> added or removed?

No, but that whenever the memory in which there is a vring is changed, 
or whenever a vring’s address is changed, that vring is reset.

>>> I think it only
>>> happens implicitly through the SET_VRING_BASE call, which resets the
>>> internal avail/used pointers.
>>>
>>> [This doesn’t seem different from libvhost-user, though, which
>>> implements neither SET_STATUS nor RESET_DEVICE, and which pretends to
>>> reset the device on RESET_OWNER, but really doesn’t (its
>>> vu_reset_device_exec() function just disables all vrings, doesn’t reset
>>> or even stop them).]
>>>
>>> Consequently, the internal state is never reset.  It would be cleared on
>>> a FUSE Destroy message, but if you just force-reset the system, the
>>> state remains into the next reboot.  Not even FUSE Init clears it, which
>>> seems weird.  It happens to work because it’s still the same filesystem,
>>> so the existing state fits, but it kind of seems dangerous to keep e.g.
>>> files open.  I don’t think it’s really exploitable because everything
>>> still goes through the guest kernel, but, well.  We should clear the
>>> state on Init, and probably also implement SET_STATUS and clear the
>>> state there.
>>>
>> I see. That's in the line of assuming GET_VRING_BASE is the last
>> message received from qemu.
>>
> Actually, does it prevent device recovery after a failure in
> migration? Is the same state set for the device?

In theory no, because GET_VRING_BASE will return the current index, so 
it’ll be restored by SET_VRING_BASE even if the vring is reset in between.

In practice yes, because the current implementation has GET_VRING_BASE 
reset the vring before fetching the index, so the returned index is 
always 0, and it can’t be restored.  But this also prevents device 
recovery in successful migration.  German has sent a pull request for 
that: https://github.com/rust-vmm/vhost/pull/154

Hanna

Stefan Hajnoczi May 9, 2023, 2:53 p.m. UTC | #19

On Tue, May 09, 2023 at 10:53:35AM +0200, Hanna Czenczek wrote:
> On 08.05.23 23:10, Stefan Hajnoczi wrote:
> > On Fri, May 05, 2023 at 02:51:55PM +0200, Hanna Czenczek wrote:
> > > On 05.05.23 11:53, Eugenio Perez Martin wrote:
> > > > On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > On 04.05.23 23:14, Stefan Hajnoczi wrote:
> > > > > > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > [...]
> > > 
> > > > > > All state is lost and the Device Initialization process
> > > > > > must be followed to make the device operational again.
> > > > > > 
> > > > > > Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
> > > > > > 
> > > > > > It's messy and not your fault. I think QEMU should solve this by
> > > > > > treating stateful devices differently from non-stateful devices. That
> > > > > > way existing vhost-user backends continue to work and new stateful
> > > > > > devices can also be supported.
> > > > > It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for
> > > > > stateful devices.  In a previous email, you wrote that these should
> > > > > implement SUSPEND+RESUME so qemu can use those instead.  But those are
> > > > > separate things, so I assume we just use SET_STATUS 0 when stopping the
> > > > > VM because this happens to also stop processing vrings as a side effect?
> > > > > 
> > > > > I.e. I understand “treating stateful devices differently” to mean that
> > > > > qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end
> > > > > supports it, and stateful back-ends should support it.
> > > > > 
> > > > Honestly I cannot think of any use case where the vhost-user backend
> > > > did not ignore set_status(0) and had to retrieve vq states. So maybe
> > > > we can totally remove that call from qemu?
> > > I don’t know so I can’t really say; but I don’t quite understand why qemu
> > > would reset a device at any point but perhaps VM reset (and even then I’d
> > > expect the post-reset guest to just reset the device on boot by itself,
> > > too).
> > DPDK stores the Device Status field value and uses it later:
> > https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2791
> > 
> > While DPDK performs no immediate action upon SET_STATUS 0, omitting the
> > message will change the behavior of other DPDK code like
> > virtio_is_ready().
> > 
> > Changing the semantics of the vhost-user protocol in a way that's not
> > backwards compatible is something we should avoid unless there is no
> > other way.
> 
> Well, I have two opinions on this:
> 
> First, that in DPDK sounds wrong.  vhost_dev_stop() is called mostly by
> devices that call it when set_status is called on them.  But they don’t call
> it if status == 0, they call it if virtio_device_should_start() returns
> false, which is the case when the VM is stopped.  So basically we set a
> status value on the back-end device that is not the status value that is set
> in qemu. If DPDK makes actual use of this status value that differs from
> that of the front-end in qemu, that sounds like it probably actually wrong.
> 
> Second, it’s entirely possible and probably probable that DPDK doesn’t make
> “actual use of this status value”; the only use it probably has is to
> determine whether the device is supposed to be stopped, which is exactly
> what qemu has tried to confer by setting it to 0.  So it’s basically two
> implementations that have agreed on abusing a value to emulate behavior that
> isn’t otherwise implement (SUSPEND), and that works because all devices are
> stateless.  Then, I agree, we can’t change this until it gets SUSPEND
> support.
> 
> > The fundamental problem is that QEMU's vhost code is designed to reset
> > vhost devices because it assumes they are stateless. If an F_SUSPEND
> > protocol feature bit is added, then it becomes possible to detect new
> > backends and suspend/resume them rather than reset them.
> > 
> > That's the solution that I favor because it's backwards compatible and
> > the same model can be applied to stateful vDPA devices in the future.
> 
> So basically the idea is the following: vhost_dev_stop() should just suspend
> the device, not reset it.  For devices that don’t support SUSPEND, we still
> need to do something, and just calling GET_VRING_BASE on all vrings is
> deemed inadequate, so they are reset (SET_STATUS 0) as a work-around,
> assuming that stateful devices that care (i.e. implement SET_STATUS) will
> also implement SUSPEND to not have this “legacy reset” happen to them.
> 
> Sounds good to me.  (If I understood that right. :))

Yes.

Stefan

Stefan Hajnoczi May 9, 2023, 3:30 p.m. UTC | #20

On Fri, May 05, 2023 at 04:26:08PM +0200, Eugenio Perez Martin wrote:
> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> wrote:
> >
> > (By the way, thanks for the explanations :))
> >
> > On 05.05.23 11:03, Hanna Czenczek wrote:
> > > On 04.05.23 23:14, Stefan Hajnoczi wrote:
> >
> > [...]
> >
> > >> I think it's better to change QEMU's vhost code
> > >> to leave stateful devices suspended (but not reset) across
> > >> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> > >> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> > >> this aspect?
> > >
> > > Yes and no; I mean, I haven’t in detail, but I thought this is what’s
> > > meant by suspending instead of resetting when the VM is stopped.
> >
> > So, now looking at vhost_dev_stop(), one problem I can see is that
> > depending on the back-end, different operations it does will do
> > different things.
> >
> > It tries to stop the whole device via vhost_ops->vhost_dev_start(),
> > which for vDPA will suspend the device, but for vhost-user will reset it
> > (if F_STATUS is there).
> >
> > It disables all vrings, which doesn’t mean stopping, but may be
> > necessary, too.  (I haven’t yet really understood the use of disabled
> > vrings, I heard that virtio-net would have a need for it.)
> >
> > It then also stops all vrings, though, so that’s OK.  And because this
> > will always do GET_VRING_BASE, this is actually always the same
> > regardless of transport.
> >
> > Finally (for this purpose), it resets the device status via
> > vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
> > this is what resets the device there.
> >
> >
> > So vhost-user resets the device in .vhost_dev_start, but vDPA only does
> > so in .vhost_reset_status.  It would seem better to me if vhost-user
> > would also reset the device only in .vhost_reset_status, not in
> > .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
> > run SUSPEND/RESUME.
> >
> 
> I think the same. I just saw It's been proposed at [1].
> 
> > Another question I have (but this is basically what I wrote in my last
> > email) is why we even call .vhost_reset_status here.  If the device
> > and/or all of the vrings are already stopped, why do we need to reset
> > it?  Naïvely, I had assumed we only really need to reset the device if
> > the guest changes, so that a new guest driver sees a freshly initialized
> > device.
> >
> 
> I don't know why we didn't need to call it :). I'm assuming the
> previous vhost-user net did fine resetting vq indexes, using
> VHOST_USER_SET_VRING_BASE. But I don't know about more complex
> devices.

It was added so DPDK can batch rx virtqueue RSS updates:

commit 923b8921d210763359e96246a58658ac0db6c645
Author: Yajun Wu <yajunw@nvidia.com>
Date:   Mon Oct 17 14:44:52 2022 +0800

    vhost-user: Support vhost_dev_start
    
    The motivation of adding vhost-user vhost_dev_start support is to
    improve backend configuration speed and reduce live migration VM
    downtime.
    
    Today VQ configuration is issued one by one. For virtio net with
    multi-queue support, backend needs to update RSS (Receive side
    scaling) on every rx queue enable. Updating RSS is time-consuming
    (typical time like 7ms).
    
    Implement already defined vhost status and message in the vhost
    specification [1].
    (a) VHOST_USER_PROTOCOL_F_STATUS
    (b) VHOST_USER_SET_STATUS
    (c) VHOST_USER_GET_STATUS
    
    Send message VHOST_USER_SET_STATUS with VIRTIO_CONFIG_S_DRIVER_OK for
    device start and reset(0) for device stop.
    
    On reception of the DRIVER_OK message, backend can apply the needed setting
    only once (instead of incremental) and also utilize parallelism on enabling
    queues.
    
    This improves QEMU's live migration downtime with vhost user backend
    implementation by great margin, specially for the large number of VQs of 64
    from 800 msec to 250 msec.
    
    [1] https://qemu-project.gitlab.io/qemu/interop/vhost-user.html
    
    Signed-off-by: Yajun Wu <yajunw@nvidia.com>
    Acked-by: Parav Pandit <parav@nvidia.com>
    Message-Id: <20221017064452.1226514-3-yajunw@nvidia.com>
    Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

> 
> Thanks!
> 
> [1] https://lore.kernel.org/qemu-devel/20230501230409.274178-1-stefanha@redhat.com/
>

Stefan Hajnoczi May 9, 2023, 3:41 p.m. UTC | #21

On Fri, May 05, 2023 at 11:03:16AM +0200, Hanna Czenczek wrote:
> On 04.05.23 23:14, Stefan Hajnoczi wrote:
> > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > On 11.04.23 17:05, Hanna Czenczek wrote:
> > > 
> > > [...]
> > > 
> > > > Hanna Czenczek (4):
> > > >     vhost: Re-enable vrings after setting features
> > > >     vhost-user: Interface for migration state transfer
> > > >     vhost: Add high-level state save/load functions
> > > >     vhost-user-fs: Implement internal migration
> > > I’m trying to write v2, and my intention was to keep the code
> > > conceptually largely the same, but include in the documentation change
> > > thoughts and notes on how this interface is to be used in the future,
> > > when e.g. vDPA “extensions” come over to vhost-user.  My plan was to,
> > > based on that documentation, discuss further.
> > > 
> > > But now I’m struggling to even write that documentation because it’s not
> > > clear to me what exactly the result of the discussion was, so I need to
> > > stop even before that.
> > > 
> > > So as far as I understand, we need/want SUSPEND/RESUME for two reasons:
> > > 1. As a signal to the back-end when virt queues are no longer to be
> > > processed, so that it is clear that it will not do that when asked for
> > > migration state.
> > > 2. Stateful devices that support SET_STATUS receive a status of 0 when
> > > the VM is stopped, which supposedly resets the internal state. While
> > > suspended, device state is frozen, so as far as I understand, SUSPEND
> > > before SET_STATUS would have the status change be deferred until RESUME.
> > I'm not sure about SUSPEND -> SET_STATUS 0 -> RESUME. I guess the
> > device would be reset right away and it would either remain suspended
> > or be resumed as part of reset :).
> > 
> > Unfortunately the concepts of SUSPEND/RESUME and the Device Status
> > Field are orthogonal and there is no spec that explains how they
> > interact.
> 
> Ah, OK.  So I guess it’s up to the implementation to decide whether the
> virtio device status counts as part of the “configuration” that “[it] must
> not change”.
> 
> > > I don’t want to hang myself up on 2 because it doesn’t really seem
> > > important to this series, but: Why does a status of 0 reset the internal
> > > state?  [Note: This is all virtio_reset() seems to do, set the status to
> > > 0.]  The vhost-user specification only points to the virtio
> > > specification, which doesn’t say anything to that effect. Instead, an
> > > explicit device reset is mentioned, which would be
> > > VHOST_USER_RESET_DEVICE, i.e. something completely different. Because
> > > RESET_DEVICE directly contradicts SUSPEND’s description, I would like to
> > > think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid.
> > The vhost-user protocol didn't have the concept of the VIRTIO Device
> > Status Field until SET_STATUS was added.
> > 
> > In order to participate in the VIRTIO device lifecycle to some extent,
> > the pre-SET_STATUS vhost-user protocol relied on vhost-user-specific
> > messages like RESET_DEVICE.
> > 
> > At the VIRTIO level, devices are reset by setting the Device Status
> > Field to 0.
> 
> (I didn’t find this in the virtio specification until today, turns out it’s
> under 4.1.4.3 “Common configuration structure layout”, not under 2.1 “Device
> Status Field”, where I was looking.)
> 
> > All state is lost and the Device Initialization process
> > must be followed to make the device operational again.
> > 
> > Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
> > 
> > It's messy and not your fault. I think QEMU should solve this by
> > treating stateful devices differently from non-stateful devices. That
> > way existing vhost-user backends continue to work and new stateful
> > devices can also be supported.
> 
> It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for
> stateful devices.  In a previous email, you wrote that these should
> implement SUSPEND+RESUME so qemu can use those instead.  But those are
> separate things, so I assume we just use SET_STATUS 0 when stopping the VM
> because this happens to also stop processing vrings as a side effect?

SET_STATUS 0 doesn't do anything in most existing vhost-user backends
and QEMU's vhost code doesn't rely on it doing anything. It was added as
an optimization hint for DPDK's vhost-user-net implementation without
noticing that it breaks stateful devices (see commit 923b8921d210).

> 
> I.e. I understand “treating stateful devices differently” to mean that qemu
> should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end supports
> it, and stateful back-ends should support it.
> 
> > > Is it that a status 0 won’t explicitly reset the internal state, but
> > > because it does mean that the driver is unbound, the state should
> > > implicitly be reset?
> > I think the fundamental problem is that transports like virtio-pci put
> > registers back in their initialization state upon reset, so internal
> > state is lost.
> > 
> > The VIRTIO spec does not go into detail on device state across reset
> > though, so I don't think much more can be said about the semantics.
> > 
> > > Anyway.  1 seems to be the relevant point for migration.  As far as I
> > > understand, currently, a vhost-user back-end has no way of knowing when
> > > to stop processing virt queues.  Basically, rings can only transition
> > > from stopped to started, but not vice versa.  The vhost-user
> > > specification has this bit: “Once the source has finished migration,
> > > rings will be stopped by the source. No further update must be done
> > > before rings are restarted.”  It just doesn’t say how the front-end lets
> > > the back-end know that the rings are (to be) stopped.  So this seems
> > > like a pre-existing problem for stateless migration.  Unless this is
> > > communicated precisely by setting the device status to 0?
> > No, my understanding is different. The vhost-user spec says the
> > backend must "stop [the] ring upon receiving
> > ``VHOST_USER_GET_VRING_BASE``".
> 
> Yes, I missed that part!
> 
> > The "Ring states" section goes into
> > more detail and adds the concept of enabled/disabled too.
> > 
> > SUSPEND is stronger than GET_VRING_BASE though. GET_VRING_BASE only
> > applies to a single virtqueue, whereas SUSPEND acts upon the entire
> > device, including non-virtqueue aspects like Configuration Change
> > Notifications (VHOST_USER_BACKEND_CONFIG_CHANGE_MSG).
> > 
> > You can approximate SUSPEND today by sending GET_VRING_BASE for all
> > virtqueues. I think in practice this does fully stop the device even
> > if the spec doesn't require it.
> > 
> > If we want minimal changes to vhost-user, then we could rely on
> > GET_VRING_BASE to suspend and SET_VRING_ENABLE to resume. And
> > SET_STATUS 0 must not be sent so that the device's state is not lost.
> 
> So you mean that we’d use SUSPEND instead of SET_STATUS 0, but because we
> have no SUSPEND, we’d ensure that GET_VRING_BASE is/was called on all
> vrings?

Yes. I prefer adding SUSPEND+RESUME to vhost-user, but if we were
limited to today's vhost-user commands, then relying on GET_VRING_BASE
and skipping SET_STATUS calls for vhost_dev_start/stop() would come
close to achieving the behavior needed by stateful backends.

Stefan

Eugenio Perez Martin May 9, 2023, 3:43 p.m. UTC | #22

On Tue, May 9, 2023 at 5:30 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Fri, May 05, 2023 at 04:26:08PM +0200, Eugenio Perez Martin wrote:
> > On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> wrote:
> > >
> > > (By the way, thanks for the explanations :))
> > >
> > > On 05.05.23 11:03, Hanna Czenczek wrote:
> > > > On 04.05.23 23:14, Stefan Hajnoczi wrote:
> > >
> > > [...]
> > >
> > > >> I think it's better to change QEMU's vhost code
> > > >> to leave stateful devices suspended (but not reset) across
> > > >> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> > > >> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> > > >> this aspect?
> > > >
> > > > Yes and no; I mean, I haven’t in detail, but I thought this is what’s
> > > > meant by suspending instead of resetting when the VM is stopped.
> > >
> > > So, now looking at vhost_dev_stop(), one problem I can see is that
> > > depending on the back-end, different operations it does will do
> > > different things.
> > >
> > > It tries to stop the whole device via vhost_ops->vhost_dev_start(),
> > > which for vDPA will suspend the device, but for vhost-user will reset it
> > > (if F_STATUS is there).
> > >
> > > It disables all vrings, which doesn’t mean stopping, but may be
> > > necessary, too.  (I haven’t yet really understood the use of disabled
> > > vrings, I heard that virtio-net would have a need for it.)
> > >
> > > It then also stops all vrings, though, so that’s OK.  And because this
> > > will always do GET_VRING_BASE, this is actually always the same
> > > regardless of transport.
> > >
> > > Finally (for this purpose), it resets the device status via
> > > vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
> > > this is what resets the device there.
> > >
> > >
> > > So vhost-user resets the device in .vhost_dev_start, but vDPA only does
> > > so in .vhost_reset_status.  It would seem better to me if vhost-user
> > > would also reset the device only in .vhost_reset_status, not in
> > > .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
> > > run SUSPEND/RESUME.
> > >
> >
> > I think the same. I just saw It's been proposed at [1].
> >
> > > Another question I have (but this is basically what I wrote in my last
> > > email) is why we even call .vhost_reset_status here.  If the device
> > > and/or all of the vrings are already stopped, why do we need to reset
> > > it?  Naïvely, I had assumed we only really need to reset the device if
> > > the guest changes, so that a new guest driver sees a freshly initialized
> > > device.
> > >
> >
> > I don't know why we didn't need to call it :). I'm assuming the
> > previous vhost-user net did fine resetting vq indexes, using
> > VHOST_USER_SET_VRING_BASE. But I don't know about more complex
> > devices.
>
> It was added so DPDK can batch rx virtqueue RSS updates:
>
> commit 923b8921d210763359e96246a58658ac0db6c645
> Author: Yajun Wu <yajunw@nvidia.com>
> Date:   Mon Oct 17 14:44:52 2022 +0800
>
>     vhost-user: Support vhost_dev_start
>
>     The motivation of adding vhost-user vhost_dev_start support is to
>     improve backend configuration speed and reduce live migration VM
>     downtime.
>
>     Today VQ configuration is issued one by one. For virtio net with
>     multi-queue support, backend needs to update RSS (Receive side
>     scaling) on every rx queue enable. Updating RSS is time-consuming
>     (typical time like 7ms).
>
>     Implement already defined vhost status and message in the vhost
>     specification [1].
>     (a) VHOST_USER_PROTOCOL_F_STATUS
>     (b) VHOST_USER_SET_STATUS
>     (c) VHOST_USER_GET_STATUS
>
>     Send message VHOST_USER_SET_STATUS with VIRTIO_CONFIG_S_DRIVER_OK for
>     device start and reset(0) for device stop.
>
>     On reception of the DRIVER_OK message, backend can apply the needed setting
>     only once (instead of incremental) and also utilize parallelism on enabling
>     queues.
>
>     This improves QEMU's live migration downtime with vhost user backend
>     implementation by great margin, specially for the large number of VQs of 64
>     from 800 msec to 250 msec.
>
>     [1] https://qemu-project.gitlab.io/qemu/interop/vhost-user.html
>
>     Signed-off-by: Yajun Wu <yajunw@nvidia.com>
>     Acked-by: Parav Pandit <parav@nvidia.com>
>     Message-Id: <20221017064452.1226514-3-yajunw@nvidia.com>
>     Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>

Sorry for the confusion, what I was wondering is how vhost-user
devices do not need any signal to reset the device before
VHOST_USER_SET_STATUS. And my guess is that it is enough to get / set
the vq indexes.

vhost_user_reset_device is limited to scsi, so it would not work for
the rest of devices.

Thanks!

[0/4] vhost-user-fs: Internal migration

Message

Comments