Message ID | 20230411150515.14020-1-hreitz@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | vhost-user-fs: Internal migration | expand |
Hi, Is there a vhost-user.rst spec patch? Thanks, Stefan
On 12.04.23 23:00, Stefan Hajnoczi wrote: > Hi, > Is there a vhost-user.rst spec patch? Ah, right, I forgot. Will add! Hanna
On Tue, Apr 11, 2023 at 05:05:11PM +0200, Hanna Czenczek wrote: > RFC: > https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html > > Hi, > > Patch 2 of this series adds new vhost methods (only for vhost-user at > this point) for transferring the back-end’s internal state to/from qemu > during migration, so that this state can be stored in the migration > stream. (This is what we call “internal migration”, because the state > is internally available to qemu; this is in contrast to “external > migration”, which Anton is working on, where the back-end’s state is > handled by the back-end itself without involving qemu.) > > For this, the state is handled as a binary blob by qemu, and it is > transferred over a pipe that is established via a new vhost method. > > Patch 3 adds two high-level helper functions to (A) fetch any vhost > back-end’s internal state and store it in a migration stream (a > `QEMUFile`), and (B) load such state from a migrations stream and send > it to a vhost back-end. These build on the low-level interface > introduced in patch 2. > > Patch 4 then uses these functions to implement internal migration for > vhost-user-fs. Note that this of course depends on support in the > back-end (virtiofsd), which is not yet ready. > > Finally, patch 1 fixes a bug around migrating vhost-user devices: To > enable/disable logging[1], the VHOST_F_LOG_ALL feature must be > set/cleared, via the SET_FEATURES call. Another, technically unrelated, > feature exists, VHOST_USER_F_PROTOCOL_FEATURES, which indicates support > for vhost-user protocol features. Naturally, qemu wants to keep that > other feature enabled, so it will set it (when possible) in every > SET_FEATURES call. However, a side effect of setting > VHOST_USER_F_PROTOCOL_FEATURES is that all vrings are disabled. I didn't get this part. Two questions: Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``. If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the ring starts directly in the enabled state. If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is initialized in a disabled state and is enabled by ``VHOST_USER_SET_VRING_ENABLE`` with parameter 1. so VHOST_USER_F_PROTOCOL_FEATURES only controls initial state of rings, it does not disable rings. > This > causes any enabling (done at the start of migration) or disabling (done > on the source after a cancelled/failed migration) of logging to make the > back-end hang. Without patch 1, therefore, starting a migration will > have any vhost-user back-end that supports both VHOST_F_LOG_ALL and > VHOST_USER_F_PROTOCOL_FEATURES immediately hang completely, and unless > execution is transferred to the destination, it will continue to hang. > > > [1] Logging here means logging writes to guest memory pages in a dirty > bitmap so that these dirty pages are flushed to the destination. qemu > cannot monitor the back-end’s writes to guest memory, so the back-end > has to do so itself, and log its writes in a dirty bitmap shared with > qemu. > > > Changes in v1 compared to the RFC: > - Patch 1 added > > - Patch 2: Interface is different, now uses a pipe instead of shared > memory (as suggested by Stefan); also, this is now a generic > vhost-user interface, and not just for vhost-user-fs > > - Patches 3 and 4: Because this is now supposed to be a generic > migration method for vhost-user back-ends, most of the migration code > has been moved from vhost-user-fs.c to vhost.c so it can be shared > between different back-ends. The vhost-user-fs code is now a rather > thin wrapper around the common code. > - Note also (as suggested by Anton) that the back-end’s migration > state is now in a subsection, and that it is technically optional. > “Technically” means that with this series, it is always used (unless > the back-end doesn’t support migration, in which case migration is > just blocked), but Anton’s series for external migration would make > it optional. (I.e., the subsection would be skipped for external > migration, and mandatorily included for internal migration.) > > > Hanna Czenczek (4): > vhost: Re-enable vrings after setting features > vhost-user: Interface for migration state transfer > vhost: Add high-level state save/load functions > vhost-user-fs: Implement internal migration > > include/hw/virtio/vhost-backend.h | 24 +++ > include/hw/virtio/vhost.h | 124 +++++++++++++++ > hw/virtio/vhost-user-fs.c | 101 +++++++++++- > hw/virtio/vhost-user.c | 147 ++++++++++++++++++ > hw/virtio/vhost.c | 246 ++++++++++++++++++++++++++++++ > 5 files changed, 641 insertions(+), 1 deletion(-) > > -- > 2.39.1
On 13.04.23 18:11, Michael S. Tsirkin wrote: > On Tue, Apr 11, 2023 at 05:05:11PM +0200, Hanna Czenczek wrote: >> RFC: >> https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html >> >> Hi, >> >> Patch 2 of this series adds new vhost methods (only for vhost-user at >> this point) for transferring the back-end’s internal state to/from qemu >> during migration, so that this state can be stored in the migration >> stream. (This is what we call “internal migration”, because the state >> is internally available to qemu; this is in contrast to “external >> migration”, which Anton is working on, where the back-end’s state is >> handled by the back-end itself without involving qemu.) >> >> For this, the state is handled as a binary blob by qemu, and it is >> transferred over a pipe that is established via a new vhost method. >> >> Patch 3 adds two high-level helper functions to (A) fetch any vhost >> back-end’s internal state and store it in a migration stream (a >> `QEMUFile`), and (B) load such state from a migrations stream and send >> it to a vhost back-end. These build on the low-level interface >> introduced in patch 2. >> >> Patch 4 then uses these functions to implement internal migration for >> vhost-user-fs. Note that this of course depends on support in the >> back-end (virtiofsd), which is not yet ready. >> >> Finally, patch 1 fixes a bug around migrating vhost-user devices: To >> enable/disable logging[1], the VHOST_F_LOG_ALL feature must be >> set/cleared, via the SET_FEATURES call. Another, technically unrelated, >> feature exists, VHOST_USER_F_PROTOCOL_FEATURES, which indicates support >> for vhost-user protocol features. Naturally, qemu wants to keep that >> other feature enabled, so it will set it (when possible) in every >> SET_FEATURES call. However, a side effect of setting >> VHOST_USER_F_PROTOCOL_FEATURES is that all vrings are disabled. > > I didn't get this part. > Two questions: > Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``. > > If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the > ring starts directly in the enabled state. > > If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is > initialized in a disabled state and is enabled by > ``VHOST_USER_SET_VRING_ENABLE`` with parameter 1. > > so VHOST_USER_F_PROTOCOL_FEATURES only controls initial state of rings, > it does not disable rings. Oh. Thanks. :) That’s indeed a valid and more sensible interpretation. I know that the vhost-user-backend crate virtiofsd uses has interpreted it differently. Looking into libvhost-user and DPDK, both have decided to instead have all vrings be disabled at reset, and enable them only when a SET_FEATURES with F_PROTOCOL_FEATURES comes in. Doesn’t sound quite literally to spec either, but adheres to the interpretation of not disabling any rings just because F_PROTOCOL_FEATURES appears. (I thought of proposing this (“only disable vrings for a `false` -> `true` flag state transition”), but thought that’d be too complicated. Oh, well. :)) So, the fix will go to the vhost-user-backend crate instead of qemu. That’s good! Still, I will also prepare a patch to vhost-user.rst for this, because I still don’t find the specification clear on this. The thing is, nobody interprets it as “negotiating this feature will decide whether, when all rings are initialized, they will be initialized in disabled or enabled state”, which is how I think you’ve interpreted it. The problem is that “initialization” isn’t well-defined here. Even libvhost-user and DPDK initialize the rings always in disabled state, regardless of this feature, but will put them into an enabled state later on if the feature isn’t negotiated. I think this exact behavior should be precisely described in the spec, like: Between initialization and ``VHOST_USER_SET_FEATURES``, it is implementation-defined whether each ring is enabled or disabled. If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, each ring, when started, will be enabled immediately. If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, each ring will remain in the disabled state until ``VHOST_USER_SET_VRING_ENABLE`` enables it with parameter 1. Hanna
On 11.04.23 17:05, Hanna Czenczek wrote: [...] > Hanna Czenczek (4): > vhost: Re-enable vrings after setting features > vhost-user: Interface for migration state transfer > vhost: Add high-level state save/load functions > vhost-user-fs: Implement internal migration I’m trying to write v2, and my intention was to keep the code conceptually largely the same, but include in the documentation change thoughts and notes on how this interface is to be used in the future, when e.g. vDPA “extensions” come over to vhost-user. My plan was to, based on that documentation, discuss further. But now I’m struggling to even write that documentation because it’s not clear to me what exactly the result of the discussion was, so I need to stop even before that. So as far as I understand, we need/want SUSPEND/RESUME for two reasons: 1. As a signal to the back-end when virt queues are no longer to be processed, so that it is clear that it will not do that when asked for migration state. 2. Stateful devices that support SET_STATUS receive a status of 0 when the VM is stopped, which supposedly resets the internal state. While suspended, device state is frozen, so as far as I understand, SUSPEND before SET_STATUS would have the status change be deferred until RESUME. I don’t want to hang myself up on 2 because it doesn’t really seem important to this series, but: Why does a status of 0 reset the internal state? [Note: This is all virtio_reset() seems to do, set the status to 0.] The vhost-user specification only points to the virtio specification, which doesn’t say anything to that effect. Instead, an explicit device reset is mentioned, which would be VHOST_USER_RESET_DEVICE, i.e. something completely different. Because RESET_DEVICE directly contradicts SUSPEND’s description, I would like to think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid. Is it that a status 0 won’t explicitly reset the internal state, but because it does mean that the driver is unbound, the state should implicitly be reset? Anyway. 1 seems to be the relevant point for migration. As far as I understand, currently, a vhost-user back-end has no way of knowing when to stop processing virt queues. Basically, rings can only transition from stopped to started, but not vice versa. The vhost-user specification has this bit: “Once the source has finished migration, rings will be stopped by the source. No further update must be done before rings are restarted.” It just doesn’t say how the front-end lets the back-end know that the rings are (to be) stopped. So this seems like a pre-existing problem for stateless migration. Unless this is communicated precisely by setting the device status to 0? Naturally, what I want to know most of all is whether you believe I can get away without SUSPEND/RESUME for now. To me, it seems like honestly not really, only when turning two blind eyes, because otherwise we can’t ensure that virtiofsd isn’t still processing pending virt queue requests when the state transfer is begun, even when the guest CPUs are already stopped. Of course, virtiofsd could stop queue processing right there and then, but… That feels like a hack that in the grand scheme of things just isn’t necessary when we could “just” introduce SUSPEND/RESUME into vhost-user for exactly this. Beyond the SUSPEND/RESUME question, I understand everything can stay as-is for now, as the design doesn’t seem to conflict too badly with possible future extensions for other migration phases or more finely grained migration phase control between front-end and back-end. Did I at least roughly get the gist? Hanna
On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote: > > On 11.04.23 17:05, Hanna Czenczek wrote: > > [...] > > > Hanna Czenczek (4): > > vhost: Re-enable vrings after setting features > > vhost-user: Interface for migration state transfer > > vhost: Add high-level state save/load functions > > vhost-user-fs: Implement internal migration > > I’m trying to write v2, and my intention was to keep the code > conceptually largely the same, but include in the documentation change > thoughts and notes on how this interface is to be used in the future, > when e.g. vDPA “extensions” come over to vhost-user. My plan was to, > based on that documentation, discuss further. > > But now I’m struggling to even write that documentation because it’s not > clear to me what exactly the result of the discussion was, so I need to > stop even before that. > > So as far as I understand, we need/want SUSPEND/RESUME for two reasons: > 1. As a signal to the back-end when virt queues are no longer to be > processed, so that it is clear that it will not do that when asked for > migration state. > 2. Stateful devices that support SET_STATUS receive a status of 0 when > the VM is stopped, which supposedly resets the internal state. While > suspended, device state is frozen, so as far as I understand, SUSPEND > before SET_STATUS would have the status change be deferred until RESUME. I'm not sure about SUSPEND -> SET_STATUS 0 -> RESUME. I guess the device would be reset right away and it would either remain suspended or be resumed as part of reset :). Unfortunately the concepts of SUSPEND/RESUME and the Device Status Field are orthogonal and there is no spec that explains how they interact. > > I don’t want to hang myself up on 2 because it doesn’t really seem > important to this series, but: Why does a status of 0 reset the internal > state? [Note: This is all virtio_reset() seems to do, set the status to > 0.] The vhost-user specification only points to the virtio > specification, which doesn’t say anything to that effect. Instead, an > explicit device reset is mentioned, which would be > VHOST_USER_RESET_DEVICE, i.e. something completely different. Because > RESET_DEVICE directly contradicts SUSPEND’s description, I would like to > think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid. The vhost-user protocol didn't have the concept of the VIRTIO Device Status Field until SET_STATUS was added. In order to participate in the VIRTIO device lifecycle to some extent, the pre-SET_STATUS vhost-user protocol relied on vhost-user-specific messages like RESET_DEVICE. At the VIRTIO level, devices are reset by setting the Device Status Field to 0. All state is lost and the Device Initialization process must be followed to make the device operational again. Existing vhost-user backends don't implement SET_STATUS 0 (it's new). It's messy and not your fault. I think QEMU should solve this by treating stateful devices differently from non-stateful devices. That way existing vhost-user backends continue to work and new stateful devices can also be supported. > > Is it that a status 0 won’t explicitly reset the internal state, but > because it does mean that the driver is unbound, the state should > implicitly be reset? I think the fundamental problem is that transports like virtio-pci put registers back in their initialization state upon reset, so internal state is lost. The VIRTIO spec does not go into detail on device state across reset though, so I don't think much more can be said about the semantics. > Anyway. 1 seems to be the relevant point for migration. As far as I > understand, currently, a vhost-user back-end has no way of knowing when > to stop processing virt queues. Basically, rings can only transition > from stopped to started, but not vice versa. The vhost-user > specification has this bit: “Once the source has finished migration, > rings will be stopped by the source. No further update must be done > before rings are restarted.” It just doesn’t say how the front-end lets > the back-end know that the rings are (to be) stopped. So this seems > like a pre-existing problem for stateless migration. Unless this is > communicated precisely by setting the device status to 0? No, my understanding is different. The vhost-user spec says the backend must "stop [the] ring upon receiving ``VHOST_USER_GET_VRING_BASE``". The "Ring states" section goes into more detail and adds the concept of enabled/disabled too. SUSPEND is stronger than GET_VRING_BASE though. GET_VRING_BASE only applies to a single virtqueue, whereas SUSPEND acts upon the entire device, including non-virtqueue aspects like Configuration Change Notifications (VHOST_USER_BACKEND_CONFIG_CHANGE_MSG). You can approximate SUSPEND today by sending GET_VRING_BASE for all virtqueues. I think in practice this does fully stop the device even if the spec doesn't require it. If we want minimal changes to vhost-user, then we could rely on GET_VRING_BASE to suspend and SET_VRING_ENABLE to resume. And SET_STATUS 0 must not be sent so that the device's state is not lost. However, this approach means this effort needs to be redone when it's time to add stateful device support to vDPA and the QEMU vhost code will become more complex. I think it's better to agree on a proper model that works for both vhost-user and vhost-vdpa now to avoid more hacks/special cases. > Naturally, what I want to know most of all is whether you believe I can > get away without SUSPEND/RESUME for now. To me, it seems like honestly > not really, only when turning two blind eyes, because otherwise we can’t > ensure that virtiofsd isn’t still processing pending virt queue requests > when the state transfer is begun, even when the guest CPUs are already > stopped. Of course, virtiofsd could stop queue processing right there > and then, but… That feels like a hack that in the grand scheme of > things just isn’t necessary when we could “just” introduce > SUSPEND/RESUME into vhost-user for exactly this. > > Beyond the SUSPEND/RESUME question, I understand everything can stay > as-is for now, as the design doesn’t seem to conflict too badly with > possible future extensions for other migration phases or more finely > grained migration phase control between front-end and back-end. > > Did I at least roughly get the gist? One part we haven't discussed much: I'm not sure how much trouble you'll face due to the fact that QEMU assumes vhost devices can be reset across vhost_dev_stop() -> vhost_dev_start(). I don't think we should keep a copy of the state in-memory just so it can be restored in vhost_dev_start(). I think it's better to change QEMU's vhost code to leave stateful devices suspended (but not reset) across vhost_dev_stop() -> vhost_dev_start(), maybe by introducing vhost_dev_suspend() and vhost_dev_resume(). Have you thought about this aspect? Stefan
On 04.05.23 23:14, Stefan Hajnoczi wrote: > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote: >> On 11.04.23 17:05, Hanna Czenczek wrote: >> >> [...] >> >>> Hanna Czenczek (4): >>> vhost: Re-enable vrings after setting features >>> vhost-user: Interface for migration state transfer >>> vhost: Add high-level state save/load functions >>> vhost-user-fs: Implement internal migration >> I’m trying to write v2, and my intention was to keep the code >> conceptually largely the same, but include in the documentation change >> thoughts and notes on how this interface is to be used in the future, >> when e.g. vDPA “extensions” come over to vhost-user. My plan was to, >> based on that documentation, discuss further. >> >> But now I’m struggling to even write that documentation because it’s not >> clear to me what exactly the result of the discussion was, so I need to >> stop even before that. >> >> So as far as I understand, we need/want SUSPEND/RESUME for two reasons: >> 1. As a signal to the back-end when virt queues are no longer to be >> processed, so that it is clear that it will not do that when asked for >> migration state. >> 2. Stateful devices that support SET_STATUS receive a status of 0 when >> the VM is stopped, which supposedly resets the internal state. While >> suspended, device state is frozen, so as far as I understand, SUSPEND >> before SET_STATUS would have the status change be deferred until RESUME. > I'm not sure about SUSPEND -> SET_STATUS 0 -> RESUME. I guess the > device would be reset right away and it would either remain suspended > or be resumed as part of reset :). > > Unfortunately the concepts of SUSPEND/RESUME and the Device Status > Field are orthogonal and there is no spec that explains how they > interact. Ah, OK. So I guess it’s up to the implementation to decide whether the virtio device status counts as part of the “configuration” that “[it] must not change”. >> I don’t want to hang myself up on 2 because it doesn’t really seem >> important to this series, but: Why does a status of 0 reset the internal >> state? [Note: This is all virtio_reset() seems to do, set the status to >> 0.] The vhost-user specification only points to the virtio >> specification, which doesn’t say anything to that effect. Instead, an >> explicit device reset is mentioned, which would be >> VHOST_USER_RESET_DEVICE, i.e. something completely different. Because >> RESET_DEVICE directly contradicts SUSPEND’s description, I would like to >> think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid. > The vhost-user protocol didn't have the concept of the VIRTIO Device > Status Field until SET_STATUS was added. > > In order to participate in the VIRTIO device lifecycle to some extent, > the pre-SET_STATUS vhost-user protocol relied on vhost-user-specific > messages like RESET_DEVICE. > > At the VIRTIO level, devices are reset by setting the Device Status > Field to 0. (I didn’t find this in the virtio specification until today, turns out it’s under 4.1.4.3 “Common configuration structure layout”, not under 2.1 “Device Status Field”, where I was looking.) > All state is lost and the Device Initialization process > must be followed to make the device operational again. > > Existing vhost-user backends don't implement SET_STATUS 0 (it's new). > > It's messy and not your fault. I think QEMU should solve this by > treating stateful devices differently from non-stateful devices. That > way existing vhost-user backends continue to work and new stateful > devices can also be supported. It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for stateful devices. In a previous email, you wrote that these should implement SUSPEND+RESUME so qemu can use those instead. But those are separate things, so I assume we just use SET_STATUS 0 when stopping the VM because this happens to also stop processing vrings as a side effect? I.e. I understand “treating stateful devices differently” to mean that qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end supports it, and stateful back-ends should support it. >> Is it that a status 0 won’t explicitly reset the internal state, but >> because it does mean that the driver is unbound, the state should >> implicitly be reset? > I think the fundamental problem is that transports like virtio-pci put > registers back in their initialization state upon reset, so internal > state is lost. > > The VIRTIO spec does not go into detail on device state across reset > though, so I don't think much more can be said about the semantics. > >> Anyway. 1 seems to be the relevant point for migration. As far as I >> understand, currently, a vhost-user back-end has no way of knowing when >> to stop processing virt queues. Basically, rings can only transition >> from stopped to started, but not vice versa. The vhost-user >> specification has this bit: “Once the source has finished migration, >> rings will be stopped by the source. No further update must be done >> before rings are restarted.” It just doesn’t say how the front-end lets >> the back-end know that the rings are (to be) stopped. So this seems >> like a pre-existing problem for stateless migration. Unless this is >> communicated precisely by setting the device status to 0? > No, my understanding is different. The vhost-user spec says the > backend must "stop [the] ring upon receiving > ``VHOST_USER_GET_VRING_BASE``". Yes, I missed that part! > The "Ring states" section goes into > more detail and adds the concept of enabled/disabled too. > > SUSPEND is stronger than GET_VRING_BASE though. GET_VRING_BASE only > applies to a single virtqueue, whereas SUSPEND acts upon the entire > device, including non-virtqueue aspects like Configuration Change > Notifications (VHOST_USER_BACKEND_CONFIG_CHANGE_MSG). > > You can approximate SUSPEND today by sending GET_VRING_BASE for all > virtqueues. I think in practice this does fully stop the device even > if the spec doesn't require it. > > If we want minimal changes to vhost-user, then we could rely on > GET_VRING_BASE to suspend and SET_VRING_ENABLE to resume. And > SET_STATUS 0 must not be sent so that the device's state is not lost. So you mean that we’d use SUSPEND instead of SET_STATUS 0, but because we have no SUSPEND, we’d ensure that GET_VRING_BASE is/was called on all vrings? > However, this approach means this effort needs to be redone when it's > time to add stateful device support to vDPA and the QEMU vhost code > will become more complex. I think it's better to agree on a proper > model that works for both vhost-user and vhost-vdpa now to avoid more > hacks/special cases. Agreeing is easy, actually adding SUSPEND+RESUME to the vhost-user protocol is what I’d prefer to avoid. :) The question is whether it’s really effort if we were (in qemu) to just implement SUSPEND as a GET_VRING_BASE to all vrings for vhost-user. I don’t think there is a direct equivalent to RESUME, because the back-end is supposed to start rings automatically when it receives a kick, but that will only happen when the vCPUs run, so that should be fine. >> Naturally, what I want to know most of all is whether you believe I can >> get away without SUSPEND/RESUME for now. To me, it seems like honestly >> not really, only when turning two blind eyes, because otherwise we can’t >> ensure that virtiofsd isn’t still processing pending virt queue requests >> when the state transfer is begun, even when the guest CPUs are already >> stopped. Of course, virtiofsd could stop queue processing right there >> and then, but… That feels like a hack that in the grand scheme of >> things just isn’t necessary when we could “just” introduce >> SUSPEND/RESUME into vhost-user for exactly this. >> >> Beyond the SUSPEND/RESUME question, I understand everything can stay >> as-is for now, as the design doesn’t seem to conflict too badly with >> possible future extensions for other migration phases or more finely >> grained migration phase control between front-end and back-end. >> >> Did I at least roughly get the gist? > One part we haven't discussed much: I'm not sure how much trouble > you'll face due to the fact that QEMU assumes vhost devices can be > reset across vhost_dev_stop() -> vhost_dev_start(). I don't think we > should keep a copy of the state in-memory just so it can be restored > in vhost_dev_start(). All I can report is that virtiofsd continues to work fine after a cancelled/failed migration. > I think it's better to change QEMU's vhost code > to leave stateful devices suspended (but not reset) across > vhost_dev_stop() -> vhost_dev_start(), maybe by introducing > vhost_dev_suspend() and vhost_dev_resume(). Have you thought about > this aspect? Yes and no; I mean, I haven’t in detail, but I thought this is what’s meant by suspending instead of resetting when the VM is stopped. Hanna
(By the way, thanks for the explanations :)) On 05.05.23 11:03, Hanna Czenczek wrote: > On 04.05.23 23:14, Stefan Hajnoczi wrote: [...] >> I think it's better to change QEMU's vhost code >> to leave stateful devices suspended (but not reset) across >> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing >> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about >> this aspect? > > Yes and no; I mean, I haven’t in detail, but I thought this is what’s > meant by suspending instead of resetting when the VM is stopped. So, now looking at vhost_dev_stop(), one problem I can see is that depending on the back-end, different operations it does will do different things. It tries to stop the whole device via vhost_ops->vhost_dev_start(), which for vDPA will suspend the device, but for vhost-user will reset it (if F_STATUS is there). It disables all vrings, which doesn’t mean stopping, but may be necessary, too. (I haven’t yet really understood the use of disabled vrings, I heard that virtio-net would have a need for it.) It then also stops all vrings, though, so that’s OK. And because this will always do GET_VRING_BASE, this is actually always the same regardless of transport. Finally (for this purpose), it resets the device status via vhost_ops->vhost_reset_status(). This is only implemented on vDPA, and this is what resets the device there. So vhost-user resets the device in .vhost_dev_start, but vDPA only does so in .vhost_reset_status. It would seem better to me if vhost-user would also reset the device only in .vhost_reset_status, not in .vhost_dev_start. .vhost_dev_start seems precisely like the place to run SUSPEND/RESUME. Another question I have (but this is basically what I wrote in my last email) is why we even call .vhost_reset_status here. If the device and/or all of the vrings are already stopped, why do we need to reset it? Naïvely, I had assumed we only really need to reset the device if the guest changes, so that a new guest driver sees a freshly initialized device. Hanna
On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote: > > On 04.05.23 23:14, Stefan Hajnoczi wrote: > > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote: > >> On 11.04.23 17:05, Hanna Czenczek wrote: > >> > >> [...] > >> > >>> Hanna Czenczek (4): > >>> vhost: Re-enable vrings after setting features > >>> vhost-user: Interface for migration state transfer > >>> vhost: Add high-level state save/load functions > >>> vhost-user-fs: Implement internal migration > >> I’m trying to write v2, and my intention was to keep the code > >> conceptually largely the same, but include in the documentation change > >> thoughts and notes on how this interface is to be used in the future, > >> when e.g. vDPA “extensions” come over to vhost-user. My plan was to, > >> based on that documentation, discuss further. > >> > >> But now I’m struggling to even write that documentation because it’s not > >> clear to me what exactly the result of the discussion was, so I need to > >> stop even before that. > >> > >> So as far as I understand, we need/want SUSPEND/RESUME for two reasons: > >> 1. As a signal to the back-end when virt queues are no longer to be > >> processed, so that it is clear that it will not do that when asked for > >> migration state. > >> 2. Stateful devices that support SET_STATUS receive a status of 0 when > >> the VM is stopped, which supposedly resets the internal state. While > >> suspended, device state is frozen, so as far as I understand, SUSPEND > >> before SET_STATUS would have the status change be deferred until RESUME. > > I'm not sure about SUSPEND -> SET_STATUS 0 -> RESUME. I guess the > > device would be reset right away and it would either remain suspended > > or be resumed as part of reset :). > > > > Unfortunately the concepts of SUSPEND/RESUME and the Device Status > > Field are orthogonal and there is no spec that explains how they > > interact. > > Ah, OK. So I guess it’s up to the implementation to decide whether the > virtio device status counts as part of the “configuration” that “[it] > must not change”. > That's a very good point indeed. I think the easier way to think about it is that reset must be able to recover the device always, so it must take precedence over the suspension. But I think it is good to make it explicit, at least in current vDPA headers. > >> I don’t want to hang myself up on 2 because it doesn’t really seem > >> important to this series, but: Why does a status of 0 reset the internal > >> state? [Note: This is all virtio_reset() seems to do, set the status to > >> 0.] The vhost-user specification only points to the virtio > >> specification, which doesn’t say anything to that effect. Instead, an > >> explicit device reset is mentioned, which would be > >> VHOST_USER_RESET_DEVICE, i.e. something completely different. Because > >> RESET_DEVICE directly contradicts SUSPEND’s description, I would like to > >> think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid. > > The vhost-user protocol didn't have the concept of the VIRTIO Device > > Status Field until SET_STATUS was added. > > > > In order to participate in the VIRTIO device lifecycle to some extent, > > the pre-SET_STATUS vhost-user protocol relied on vhost-user-specific > > messages like RESET_DEVICE. > > > > At the VIRTIO level, devices are reset by setting the Device Status > > Field to 0. > > (I didn’t find this in the virtio specification until today, turns out > it’s under 4.1.4.3 “Common configuration structure layout”, not under > 2.1 “Device Status Field”, where I was looking.) > Yes, but you had a point. That section is only for PCI transport, not as a generic way to reset the device. Channel I/O uses an explicit CCW_CMD_VDEV_RESET command for reset, more similar to VHOST_USER_RESET_DEVICE. > > All state is lost and the Device Initialization process > > must be followed to make the device operational again. > > > > Existing vhost-user backends don't implement SET_STATUS 0 (it's new). > > > > It's messy and not your fault. I think QEMU should solve this by > > treating stateful devices differently from non-stateful devices. That > > way existing vhost-user backends continue to work and new stateful > > devices can also be supported. > > It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for > stateful devices. In a previous email, you wrote that these should > implement SUSPEND+RESUME so qemu can use those instead. But those are > separate things, so I assume we just use SET_STATUS 0 when stopping the > VM because this happens to also stop processing vrings as a side effect? > > I.e. I understand “treating stateful devices differently” to mean that > qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end > supports it, and stateful back-ends should support it. > Honestly I cannot think of any use case where the vhost-user backend did not ignore set_status(0) and had to retrieve vq states. So maybe we can totally remove that call from qemu? > >> Is it that a status 0 won’t explicitly reset the internal state, but > >> because it does mean that the driver is unbound, the state should > >> implicitly be reset? > > I think the fundamental problem is that transports like virtio-pci put > > registers back in their initialization state upon reset, so internal > > state is lost. > > > > The VIRTIO spec does not go into detail on device state across reset > > though, so I don't think much more can be said about the semantics. > > > >> Anyway. 1 seems to be the relevant point for migration. As far as I > >> understand, currently, a vhost-user back-end has no way of knowing when > >> to stop processing virt queues. Basically, rings can only transition > >> from stopped to started, but not vice versa. The vhost-user > >> specification has this bit: “Once the source has finished migration, > >> rings will be stopped by the source. No further update must be done > >> before rings are restarted.” It just doesn’t say how the front-end lets > >> the back-end know that the rings are (to be) stopped. So this seems > >> like a pre-existing problem for stateless migration. Unless this is > >> communicated precisely by setting the device status to 0? > > No, my understanding is different. The vhost-user spec says the > > backend must "stop [the] ring upon receiving > > ``VHOST_USER_GET_VRING_BASE``". > > Yes, I missed that part! > > > The "Ring states" section goes into > > more detail and adds the concept of enabled/disabled too. > > > > SUSPEND is stronger than GET_VRING_BASE though. GET_VRING_BASE only > > applies to a single virtqueue, whereas SUSPEND acts upon the entire > > device, including non-virtqueue aspects like Configuration Change > > Notifications (VHOST_USER_BACKEND_CONFIG_CHANGE_MSG). > > > > You can approximate SUSPEND today by sending GET_VRING_BASE for all > > virtqueues. I think in practice this does fully stop the device even > > if the spec doesn't require it. > > > > If we want minimal changes to vhost-user, then we could rely on > > GET_VRING_BASE to suspend and SET_VRING_ENABLE to resume. And > > SET_STATUS 0 must not be sent so that the device's state is not lost. > > So you mean that we’d use SUSPEND instead of SET_STATUS 0, but because > we have no SUSPEND, we’d ensure that GET_VRING_BASE is/was called on all > vrings? > > > However, this approach means this effort needs to be redone when it's > > time to add stateful device support to vDPA and the QEMU vhost code > > will become more complex. I think it's better to agree on a proper > > model that works for both vhost-user and vhost-vdpa now to avoid more > > hacks/special cases. > > Agreeing is easy, actually adding SUSPEND+RESUME to the vhost-user > protocol is what I’d prefer to avoid. :) > > The question is whether it’s really effort if we were (in qemu) to just > implement SUSPEND as a GET_VRING_BASE to all vrings for vhost-user. I > don’t think there is a direct equivalent to RESUME, because the back-end > is supposed to start rings automatically when it receives a kick, but > that will only happen when the vCPUs run, so that should be fine. > > >> Naturally, what I want to know most of all is whether you believe I can > >> get away without SUSPEND/RESUME for now. To me, it seems like honestly > >> not really, only when turning two blind eyes, because otherwise we can’t > >> ensure that virtiofsd isn’t still processing pending virt queue requests > >> when the state transfer is begun, even when the guest CPUs are already > >> stopped. Of course, virtiofsd could stop queue processing right there > >> and then, but… That feels like a hack that in the grand scheme of > >> things just isn’t necessary when we could “just” introduce > >> SUSPEND/RESUME into vhost-user for exactly this. > >> > >> Beyond the SUSPEND/RESUME question, I understand everything can stay > >> as-is for now, as the design doesn’t seem to conflict too badly with > >> possible future extensions for other migration phases or more finely > >> grained migration phase control between front-end and back-end. > >> > >> Did I at least roughly get the gist? > > One part we haven't discussed much: I'm not sure how much trouble > > you'll face due to the fact that QEMU assumes vhost devices can be > > reset across vhost_dev_stop() -> vhost_dev_start(). I don't think we > > should keep a copy of the state in-memory just so it can be restored > > in vhost_dev_start(). > > All I can report is that virtiofsd continues to work fine after a > cancelled/failed migration. > Isn't the device reset after a failed migration? At least net devices are reset before sending VMState. If it cannot be applied at the destination, the device is already reset... > > I think it's better to change QEMU's vhost code > > to leave stateful devices suspended (but not reset) across > > vhost_dev_stop() -> vhost_dev_start(), maybe by introducing > > vhost_dev_suspend() and vhost_dev_resume(). Have you thought about > > this aspect? > > Yes and no; I mean, I haven’t in detail, but I thought this is what’s > meant by suspending instead of resetting when the VM is stopped. > ... unless we perform these changes of course :). Thanks!
On 05.05.23 11:53, Eugenio Perez Martin wrote: > On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote: >> On 04.05.23 23:14, Stefan Hajnoczi wrote: >>> On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote: [...] >>> All state is lost and the Device Initialization process >>> must be followed to make the device operational again. >>> >>> Existing vhost-user backends don't implement SET_STATUS 0 (it's new). >>> >>> It's messy and not your fault. I think QEMU should solve this by >>> treating stateful devices differently from non-stateful devices. That >>> way existing vhost-user backends continue to work and new stateful >>> devices can also be supported. >> It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for >> stateful devices. In a previous email, you wrote that these should >> implement SUSPEND+RESUME so qemu can use those instead. But those are >> separate things, so I assume we just use SET_STATUS 0 when stopping the >> VM because this happens to also stop processing vrings as a side effect? >> >> I.e. I understand “treating stateful devices differently” to mean that >> qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end >> supports it, and stateful back-ends should support it. >> > Honestly I cannot think of any use case where the vhost-user backend > did not ignore set_status(0) and had to retrieve vq states. So maybe > we can totally remove that call from qemu? I don’t know so I can’t really say; but I don’t quite understand why qemu would reset a device at any point but perhaps VM reset (and even then I’d expect the post-reset guest to just reset the device on boot by itself, too). [...] >>>> Naturally, what I want to know most of all is whether you believe I can >>>> get away without SUSPEND/RESUME for now. To me, it seems like honestly >>>> not really, only when turning two blind eyes, because otherwise we can’t >>>> ensure that virtiofsd isn’t still processing pending virt queue requests >>>> when the state transfer is begun, even when the guest CPUs are already >>>> stopped. Of course, virtiofsd could stop queue processing right there >>>> and then, but… That feels like a hack that in the grand scheme of >>>> things just isn’t necessary when we could “just” introduce >>>> SUSPEND/RESUME into vhost-user for exactly this. >>>> >>>> Beyond the SUSPEND/RESUME question, I understand everything can stay >>>> as-is for now, as the design doesn’t seem to conflict too badly with >>>> possible future extensions for other migration phases or more finely >>>> grained migration phase control between front-end and back-end. >>>> >>>> Did I at least roughly get the gist? >>> One part we haven't discussed much: I'm not sure how much trouble >>> you'll face due to the fact that QEMU assumes vhost devices can be >>> reset across vhost_dev_stop() -> vhost_dev_start(). I don't think we >>> should keep a copy of the state in-memory just so it can be restored >>> in vhost_dev_start(). >> All I can report is that virtiofsd continues to work fine after a >> cancelled/failed migration. >> > Isn't the device reset after a failed migration? At least net devices > are reset before sending VMState. If it cannot be applied at the > destination, the device is already reset... It doesn’t look like the Rust crate virtiofsd uses for vhost-user supports either F_STATUS or F_RESET_DEVICE, so I think this just doesn’t affect virtiofsd. Hanna
On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> wrote: > > (By the way, thanks for the explanations :)) > > On 05.05.23 11:03, Hanna Czenczek wrote: > > On 04.05.23 23:14, Stefan Hajnoczi wrote: > > [...] > > >> I think it's better to change QEMU's vhost code > >> to leave stateful devices suspended (but not reset) across > >> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing > >> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about > >> this aspect? > > > > Yes and no; I mean, I haven’t in detail, but I thought this is what’s > > meant by suspending instead of resetting when the VM is stopped. > > So, now looking at vhost_dev_stop(), one problem I can see is that > depending on the back-end, different operations it does will do > different things. > > It tries to stop the whole device via vhost_ops->vhost_dev_start(), > which for vDPA will suspend the device, but for vhost-user will reset it > (if F_STATUS is there). > > It disables all vrings, which doesn’t mean stopping, but may be > necessary, too. (I haven’t yet really understood the use of disabled > vrings, I heard that virtio-net would have a need for it.) > > It then also stops all vrings, though, so that’s OK. And because this > will always do GET_VRING_BASE, this is actually always the same > regardless of transport. > > Finally (for this purpose), it resets the device status via > vhost_ops->vhost_reset_status(). This is only implemented on vDPA, and > this is what resets the device there. > > > So vhost-user resets the device in .vhost_dev_start, but vDPA only does > so in .vhost_reset_status. It would seem better to me if vhost-user > would also reset the device only in .vhost_reset_status, not in > .vhost_dev_start. .vhost_dev_start seems precisely like the place to > run SUSPEND/RESUME. > I think the same. I just saw It's been proposed at [1]. > Another question I have (but this is basically what I wrote in my last > email) is why we even call .vhost_reset_status here. If the device > and/or all of the vrings are already stopped, why do we need to reset > it? Naïvely, I had assumed we only really need to reset the device if > the guest changes, so that a new guest driver sees a freshly initialized > device. > I don't know why we didn't need to call it :). I'm assuming the previous vhost-user net did fine resetting vq indexes, using VHOST_USER_SET_VRING_BASE. But I don't know about more complex devices. The guest can reset the device, or write 0 to the PCI config status, at any time. How does virtiofs handle it, being stateful? Thanks! [1] https://lore.kernel.org/qemu-devel/20230501230409.274178-1-stefanha@redhat.com/
On 05.05.23 16:26, Eugenio Perez Martin wrote: > On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> wrote: >> (By the way, thanks for the explanations :)) >> >> On 05.05.23 11:03, Hanna Czenczek wrote: >>> On 04.05.23 23:14, Stefan Hajnoczi wrote: >> [...] >> >>>> I think it's better to change QEMU's vhost code >>>> to leave stateful devices suspended (but not reset) across >>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing >>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about >>>> this aspect? >>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s >>> meant by suspending instead of resetting when the VM is stopped. >> So, now looking at vhost_dev_stop(), one problem I can see is that >> depending on the back-end, different operations it does will do >> different things. >> >> It tries to stop the whole device via vhost_ops->vhost_dev_start(), >> which for vDPA will suspend the device, but for vhost-user will reset it >> (if F_STATUS is there). >> >> It disables all vrings, which doesn’t mean stopping, but may be >> necessary, too. (I haven’t yet really understood the use of disabled >> vrings, I heard that virtio-net would have a need for it.) >> >> It then also stops all vrings, though, so that’s OK. And because this >> will always do GET_VRING_BASE, this is actually always the same >> regardless of transport. >> >> Finally (for this purpose), it resets the device status via >> vhost_ops->vhost_reset_status(). This is only implemented on vDPA, and >> this is what resets the device there. >> >> >> So vhost-user resets the device in .vhost_dev_start, but vDPA only does >> so in .vhost_reset_status. It would seem better to me if vhost-user >> would also reset the device only in .vhost_reset_status, not in >> .vhost_dev_start. .vhost_dev_start seems precisely like the place to >> run SUSPEND/RESUME. >> > I think the same. I just saw It's been proposed at [1]. > >> Another question I have (but this is basically what I wrote in my last >> email) is why we even call .vhost_reset_status here. If the device >> and/or all of the vrings are already stopped, why do we need to reset >> it? Naïvely, I had assumed we only really need to reset the device if >> the guest changes, so that a new guest driver sees a freshly initialized >> device. >> > I don't know why we didn't need to call it :). I'm assuming the > previous vhost-user net did fine resetting vq indexes, using > VHOST_USER_SET_VRING_BASE. But I don't know about more complex > devices. > > The guest can reset the device, or write 0 to the PCI config status, > at any time. How does virtiofs handle it, being stateful? Honestly a good question because virtiofsd implements neither SET_STATUS nor RESET_DEVICE. I’ll have to investigate that. I think when the guest resets the device, SET_VRING_BASE always comes along some way or another, so that’s how the vrings are reset. Maybe the internal state is reset only following more high-level FUSE commands like INIT. Hanna
On 05.05.23 16:37, Hanna Czenczek wrote: > On 05.05.23 16:26, Eugenio Perez Martin wrote: >> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> >> wrote: >>> (By the way, thanks for the explanations :)) >>> >>> On 05.05.23 11:03, Hanna Czenczek wrote: >>>> On 04.05.23 23:14, Stefan Hajnoczi wrote: >>> [...] >>> >>>>> I think it's better to change QEMU's vhost code >>>>> to leave stateful devices suspended (but not reset) across >>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing >>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about >>>>> this aspect? >>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s >>>> meant by suspending instead of resetting when the VM is stopped. >>> So, now looking at vhost_dev_stop(), one problem I can see is that >>> depending on the back-end, different operations it does will do >>> different things. >>> >>> It tries to stop the whole device via vhost_ops->vhost_dev_start(), >>> which for vDPA will suspend the device, but for vhost-user will >>> reset it >>> (if F_STATUS is there). >>> >>> It disables all vrings, which doesn’t mean stopping, but may be >>> necessary, too. (I haven’t yet really understood the use of disabled >>> vrings, I heard that virtio-net would have a need for it.) >>> >>> It then also stops all vrings, though, so that’s OK. And because this >>> will always do GET_VRING_BASE, this is actually always the same >>> regardless of transport. >>> >>> Finally (for this purpose), it resets the device status via >>> vhost_ops->vhost_reset_status(). This is only implemented on vDPA, and >>> this is what resets the device there. >>> >>> >>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does >>> so in .vhost_reset_status. It would seem better to me if vhost-user >>> would also reset the device only in .vhost_reset_status, not in >>> .vhost_dev_start. .vhost_dev_start seems precisely like the place to >>> run SUSPEND/RESUME. >>> >> I think the same. I just saw It's been proposed at [1]. >> >>> Another question I have (but this is basically what I wrote in my last >>> email) is why we even call .vhost_reset_status here. If the device >>> and/or all of the vrings are already stopped, why do we need to reset >>> it? Naïvely, I had assumed we only really need to reset the device if >>> the guest changes, so that a new guest driver sees a freshly >>> initialized >>> device. >>> >> I don't know why we didn't need to call it :). I'm assuming the >> previous vhost-user net did fine resetting vq indexes, using >> VHOST_USER_SET_VRING_BASE. But I don't know about more complex >> devices. >> >> The guest can reset the device, or write 0 to the PCI config status, >> at any time. How does virtiofs handle it, being stateful? > > Honestly a good question because virtiofsd implements neither > SET_STATUS nor RESET_DEVICE. I’ll have to investigate that. > > I think when the guest resets the device, SET_VRING_BASE always comes > along some way or another, so that’s how the vrings are reset. Maybe > the internal state is reset only following more high-level FUSE > commands like INIT. So a meeting and one session of looking-into-the-code later: We reset every virt queue on GET_VRING_BASE, which is wrong, but happens to serve the purpose. (German is currently on that.) In our meeting, German said the reset would occur when the memory regions are changed, but I can’t see that in the code. I think it only happens implicitly through the SET_VRING_BASE call, which resets the internal avail/used pointers. [This doesn’t seem different from libvhost-user, though, which implements neither SET_STATUS nor RESET_DEVICE, and which pretends to reset the device on RESET_OWNER, but really doesn’t (its vu_reset_device_exec() function just disables all vrings, doesn’t reset or even stop them).] Consequently, the internal state is never reset. It would be cleared on a FUSE Destroy message, but if you just force-reset the system, the state remains into the next reboot. Not even FUSE Init clears it, which seems weird. It happens to work because it’s still the same filesystem, so the existing state fits, but it kind of seems dangerous to keep e.g. files open. I don’t think it’s really exploitable because everything still goes through the guest kernel, but, well. We should clear the state on Init, and probably also implement SET_STATUS and clear the state there. Hanna
On Mon, May 8, 2023 at 7:00 PM Hanna Czenczek <hreitz@redhat.com> wrote: > > On 05.05.23 16:37, Hanna Czenczek wrote: > > On 05.05.23 16:26, Eugenio Perez Martin wrote: > >> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> > >> wrote: > >>> (By the way, thanks for the explanations :)) > >>> > >>> On 05.05.23 11:03, Hanna Czenczek wrote: > >>>> On 04.05.23 23:14, Stefan Hajnoczi wrote: > >>> [...] > >>> > >>>>> I think it's better to change QEMU's vhost code > >>>>> to leave stateful devices suspended (but not reset) across > >>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing > >>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about > >>>>> this aspect? > >>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s > >>>> meant by suspending instead of resetting when the VM is stopped. > >>> So, now looking at vhost_dev_stop(), one problem I can see is that > >>> depending on the back-end, different operations it does will do > >>> different things. > >>> > >>> It tries to stop the whole device via vhost_ops->vhost_dev_start(), > >>> which for vDPA will suspend the device, but for vhost-user will > >>> reset it > >>> (if F_STATUS is there). > >>> > >>> It disables all vrings, which doesn’t mean stopping, but may be > >>> necessary, too. (I haven’t yet really understood the use of disabled > >>> vrings, I heard that virtio-net would have a need for it.) > >>> > >>> It then also stops all vrings, though, so that’s OK. And because this > >>> will always do GET_VRING_BASE, this is actually always the same > >>> regardless of transport. > >>> > >>> Finally (for this purpose), it resets the device status via > >>> vhost_ops->vhost_reset_status(). This is only implemented on vDPA, and > >>> this is what resets the device there. > >>> > >>> > >>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does > >>> so in .vhost_reset_status. It would seem better to me if vhost-user > >>> would also reset the device only in .vhost_reset_status, not in > >>> .vhost_dev_start. .vhost_dev_start seems precisely like the place to > >>> run SUSPEND/RESUME. > >>> > >> I think the same. I just saw It's been proposed at [1]. > >> > >>> Another question I have (but this is basically what I wrote in my last > >>> email) is why we even call .vhost_reset_status here. If the device > >>> and/or all of the vrings are already stopped, why do we need to reset > >>> it? Naïvely, I had assumed we only really need to reset the device if > >>> the guest changes, so that a new guest driver sees a freshly > >>> initialized > >>> device. > >>> > >> I don't know why we didn't need to call it :). I'm assuming the > >> previous vhost-user net did fine resetting vq indexes, using > >> VHOST_USER_SET_VRING_BASE. But I don't know about more complex > >> devices. > >> > >> The guest can reset the device, or write 0 to the PCI config status, > >> at any time. How does virtiofs handle it, being stateful? > > > > Honestly a good question because virtiofsd implements neither > > SET_STATUS nor RESET_DEVICE. I’ll have to investigate that. > > > > I think when the guest resets the device, SET_VRING_BASE always comes > > along some way or another, so that’s how the vrings are reset. Maybe > > the internal state is reset only following more high-level FUSE > > commands like INIT. > > So a meeting and one session of looking-into-the-code later: > > We reset every virt queue on GET_VRING_BASE, which is wrong, but happens > to serve the purpose. (German is currently on that.) > > In our meeting, German said the reset would occur when the memory > regions are changed, but I can’t see that in the code. That would imply that the status is reset when the guest's memory is added or removed? > I think it only > happens implicitly through the SET_VRING_BASE call, which resets the > internal avail/used pointers. > > [This doesn’t seem different from libvhost-user, though, which > implements neither SET_STATUS nor RESET_DEVICE, and which pretends to > reset the device on RESET_OWNER, but really doesn’t (its > vu_reset_device_exec() function just disables all vrings, doesn’t reset > or even stop them).] > > Consequently, the internal state is never reset. It would be cleared on > a FUSE Destroy message, but if you just force-reset the system, the > state remains into the next reboot. Not even FUSE Init clears it, which > seems weird. It happens to work because it’s still the same filesystem, > so the existing state fits, but it kind of seems dangerous to keep e.g. > files open. I don’t think it’s really exploitable because everything > still goes through the guest kernel, but, well. We should clear the > state on Init, and probably also implement SET_STATUS and clear the > state there. > I see. That's in the line of assuming GET_VRING_BASE is the last message received from qemu. Thanks!
On Mon, May 8, 2023 at 7:51 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Mon, May 8, 2023 at 7:00 PM Hanna Czenczek <hreitz@redhat.com> wrote: > > > > On 05.05.23 16:37, Hanna Czenczek wrote: > > > On 05.05.23 16:26, Eugenio Perez Martin wrote: > > >> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> > > >> wrote: > > >>> (By the way, thanks for the explanations :)) > > >>> > > >>> On 05.05.23 11:03, Hanna Czenczek wrote: > > >>>> On 04.05.23 23:14, Stefan Hajnoczi wrote: > > >>> [...] > > >>> > > >>>>> I think it's better to change QEMU's vhost code > > >>>>> to leave stateful devices suspended (but not reset) across > > >>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing > > >>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about > > >>>>> this aspect? > > >>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s > > >>>> meant by suspending instead of resetting when the VM is stopped. > > >>> So, now looking at vhost_dev_stop(), one problem I can see is that > > >>> depending on the back-end, different operations it does will do > > >>> different things. > > >>> > > >>> It tries to stop the whole device via vhost_ops->vhost_dev_start(), > > >>> which for vDPA will suspend the device, but for vhost-user will > > >>> reset it > > >>> (if F_STATUS is there). > > >>> > > >>> It disables all vrings, which doesn’t mean stopping, but may be > > >>> necessary, too. (I haven’t yet really understood the use of disabled > > >>> vrings, I heard that virtio-net would have a need for it.) > > >>> > > >>> It then also stops all vrings, though, so that’s OK. And because this > > >>> will always do GET_VRING_BASE, this is actually always the same > > >>> regardless of transport. > > >>> > > >>> Finally (for this purpose), it resets the device status via > > >>> vhost_ops->vhost_reset_status(). This is only implemented on vDPA, and > > >>> this is what resets the device there. > > >>> > > >>> > > >>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does > > >>> so in .vhost_reset_status. It would seem better to me if vhost-user > > >>> would also reset the device only in .vhost_reset_status, not in > > >>> .vhost_dev_start. .vhost_dev_start seems precisely like the place to > > >>> run SUSPEND/RESUME. > > >>> > > >> I think the same. I just saw It's been proposed at [1]. > > >> > > >>> Another question I have (but this is basically what I wrote in my last > > >>> email) is why we even call .vhost_reset_status here. If the device > > >>> and/or all of the vrings are already stopped, why do we need to reset > > >>> it? Naïvely, I had assumed we only really need to reset the device if > > >>> the guest changes, so that a new guest driver sees a freshly > > >>> initialized > > >>> device. > > >>> > > >> I don't know why we didn't need to call it :). I'm assuming the > > >> previous vhost-user net did fine resetting vq indexes, using > > >> VHOST_USER_SET_VRING_BASE. But I don't know about more complex > > >> devices. > > >> > > >> The guest can reset the device, or write 0 to the PCI config status, > > >> at any time. How does virtiofs handle it, being stateful? > > > > > > Honestly a good question because virtiofsd implements neither > > > SET_STATUS nor RESET_DEVICE. I’ll have to investigate that. > > > > > > I think when the guest resets the device, SET_VRING_BASE always comes > > > along some way or another, so that’s how the vrings are reset. Maybe > > > the internal state is reset only following more high-level FUSE > > > commands like INIT. > > > > So a meeting and one session of looking-into-the-code later: > > > > We reset every virt queue on GET_VRING_BASE, which is wrong, but happens > > to serve the purpose. (German is currently on that.) > > > > In our meeting, German said the reset would occur when the memory > > regions are changed, but I can’t see that in the code. > > That would imply that the status is reset when the guest's memory is > added or removed? > > > I think it only > > happens implicitly through the SET_VRING_BASE call, which resets the > > internal avail/used pointers. > > > > [This doesn’t seem different from libvhost-user, though, which > > implements neither SET_STATUS nor RESET_DEVICE, and which pretends to > > reset the device on RESET_OWNER, but really doesn’t (its > > vu_reset_device_exec() function just disables all vrings, doesn’t reset > > or even stop them).] > > > > Consequently, the internal state is never reset. It would be cleared on > > a FUSE Destroy message, but if you just force-reset the system, the > > state remains into the next reboot. Not even FUSE Init clears it, which > > seems weird. It happens to work because it’s still the same filesystem, > > so the existing state fits, but it kind of seems dangerous to keep e.g. > > files open. I don’t think it’s really exploitable because everything > > still goes through the guest kernel, but, well. We should clear the > > state on Init, and probably also implement SET_STATUS and clear the > > state there. > > > > I see. That's in the line of assuming GET_VRING_BASE is the last > message received from qemu. > Actually, does it prevent device recovery after a failure in migration? Is the same state set for the device? Thanks!
On Fri, May 05, 2023 at 02:51:55PM +0200, Hanna Czenczek wrote: > On 05.05.23 11:53, Eugenio Perez Martin wrote: > > On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote: > > > On 04.05.23 23:14, Stefan Hajnoczi wrote: > > > > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote: > > [...] > > > > > All state is lost and the Device Initialization process > > > > must be followed to make the device operational again. > > > > > > > > Existing vhost-user backends don't implement SET_STATUS 0 (it's new). > > > > > > > > It's messy and not your fault. I think QEMU should solve this by > > > > treating stateful devices differently from non-stateful devices. That > > > > way existing vhost-user backends continue to work and new stateful > > > > devices can also be supported. > > > It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for > > > stateful devices. In a previous email, you wrote that these should > > > implement SUSPEND+RESUME so qemu can use those instead. But those are > > > separate things, so I assume we just use SET_STATUS 0 when stopping the > > > VM because this happens to also stop processing vrings as a side effect? > > > > > > I.e. I understand “treating stateful devices differently” to mean that > > > qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end > > > supports it, and stateful back-ends should support it. > > > > > Honestly I cannot think of any use case where the vhost-user backend > > did not ignore set_status(0) and had to retrieve vq states. So maybe > > we can totally remove that call from qemu? > > I don’t know so I can’t really say; but I don’t quite understand why qemu > would reset a device at any point but perhaps VM reset (and even then I’d > expect the post-reset guest to just reset the device on boot by itself, > too). DPDK stores the Device Status field value and uses it later: https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2791 While DPDK performs no immediate action upon SET_STATUS 0, omitting the message will change the behavior of other DPDK code like virtio_is_ready(). Changing the semantics of the vhost-user protocol in a way that's not backwards compatible is something we should avoid unless there is no other way. The fundamental problem is that QEMU's vhost code is designed to reset vhost devices because it assumes they are stateless. If an F_SUSPEND protocol feature bit is added, then it becomes possible to detect new backends and suspend/resume them rather than reset them. That's the solution that I favor because it's backwards compatible and the same model can be applied to stateful vDPA devices in the future. Stefan
On 08.05.23 23:10, Stefan Hajnoczi wrote: > On Fri, May 05, 2023 at 02:51:55PM +0200, Hanna Czenczek wrote: >> On 05.05.23 11:53, Eugenio Perez Martin wrote: >>> On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote: >>>> On 04.05.23 23:14, Stefan Hajnoczi wrote: >>>>> On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote: >> [...] >> >>>>> All state is lost and the Device Initialization process >>>>> must be followed to make the device operational again. >>>>> >>>>> Existing vhost-user backends don't implement SET_STATUS 0 (it's new). >>>>> >>>>> It's messy and not your fault. I think QEMU should solve this by >>>>> treating stateful devices differently from non-stateful devices. That >>>>> way existing vhost-user backends continue to work and new stateful >>>>> devices can also be supported. >>>> It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for >>>> stateful devices. In a previous email, you wrote that these should >>>> implement SUSPEND+RESUME so qemu can use those instead. But those are >>>> separate things, so I assume we just use SET_STATUS 0 when stopping the >>>> VM because this happens to also stop processing vrings as a side effect? >>>> >>>> I.e. I understand “treating stateful devices differently” to mean that >>>> qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end >>>> supports it, and stateful back-ends should support it. >>>> >>> Honestly I cannot think of any use case where the vhost-user backend >>> did not ignore set_status(0) and had to retrieve vq states. So maybe >>> we can totally remove that call from qemu? >> I don’t know so I can’t really say; but I don’t quite understand why qemu >> would reset a device at any point but perhaps VM reset (and even then I’d >> expect the post-reset guest to just reset the device on boot by itself, >> too). > DPDK stores the Device Status field value and uses it later: > https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2791 > > While DPDK performs no immediate action upon SET_STATUS 0, omitting the > message will change the behavior of other DPDK code like > virtio_is_ready(). > > Changing the semantics of the vhost-user protocol in a way that's not > backwards compatible is something we should avoid unless there is no > other way. Well, I have two opinions on this: First, that in DPDK sounds wrong. vhost_dev_stop() is called mostly by devices that call it when set_status is called on them. But they don’t call it if status == 0, they call it if virtio_device_should_start() returns false, which is the case when the VM is stopped. So basically we set a status value on the back-end device that is not the status value that is set in qemu. If DPDK makes actual use of this status value that differs from that of the front-end in qemu, that sounds like it probably actually wrong. Second, it’s entirely possible and probably probable that DPDK doesn’t make “actual use of this status value”; the only use it probably has is to determine whether the device is supposed to be stopped, which is exactly what qemu has tried to confer by setting it to 0. So it’s basically two implementations that have agreed on abusing a value to emulate behavior that isn’t otherwise implement (SUSPEND), and that works because all devices are stateless. Then, I agree, we can’t change this until it gets SUSPEND support. > The fundamental problem is that QEMU's vhost code is designed to reset > vhost devices because it assumes they are stateless. If an F_SUSPEND > protocol feature bit is added, then it becomes possible to detect new > backends and suspend/resume them rather than reset them. > > That's the solution that I favor because it's backwards compatible and > the same model can be applied to stateful vDPA devices in the future. So basically the idea is the following: vhost_dev_stop() should just suspend the device, not reset it. For devices that don’t support SUSPEND, we still need to do something, and just calling GET_VRING_BASE on all vrings is deemed inadequate, so they are reset (SET_STATUS 0) as a work-around, assuming that stateful devices that care (i.e. implement SET_STATUS) will also implement SUSPEND to not have this “legacy reset” happen to them. Sounds good to me. (If I understood that right. :)) Hanna
On 08.05.23 21:31, Eugenio Perez Martin wrote: > On Mon, May 8, 2023 at 7:51 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: >> On Mon, May 8, 2023 at 7:00 PM Hanna Czenczek <hreitz@redhat.com> wrote: >>> On 05.05.23 16:37, Hanna Czenczek wrote: >>>> On 05.05.23 16:26, Eugenio Perez Martin wrote: >>>>> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> >>>>> wrote: >>>>>> (By the way, thanks for the explanations :)) >>>>>> >>>>>> On 05.05.23 11:03, Hanna Czenczek wrote: >>>>>>> On 04.05.23 23:14, Stefan Hajnoczi wrote: >>>>>> [...] >>>>>> >>>>>>>> I think it's better to change QEMU's vhost code >>>>>>>> to leave stateful devices suspended (but not reset) across >>>>>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing >>>>>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about >>>>>>>> this aspect? >>>>>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s >>>>>>> meant by suspending instead of resetting when the VM is stopped. >>>>>> So, now looking at vhost_dev_stop(), one problem I can see is that >>>>>> depending on the back-end, different operations it does will do >>>>>> different things. >>>>>> >>>>>> It tries to stop the whole device via vhost_ops->vhost_dev_start(), >>>>>> which for vDPA will suspend the device, but for vhost-user will >>>>>> reset it >>>>>> (if F_STATUS is there). >>>>>> >>>>>> It disables all vrings, which doesn’t mean stopping, but may be >>>>>> necessary, too. (I haven’t yet really understood the use of disabled >>>>>> vrings, I heard that virtio-net would have a need for it.) >>>>>> >>>>>> It then also stops all vrings, though, so that’s OK. And because this >>>>>> will always do GET_VRING_BASE, this is actually always the same >>>>>> regardless of transport. >>>>>> >>>>>> Finally (for this purpose), it resets the device status via >>>>>> vhost_ops->vhost_reset_status(). This is only implemented on vDPA, and >>>>>> this is what resets the device there. >>>>>> >>>>>> >>>>>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does >>>>>> so in .vhost_reset_status. It would seem better to me if vhost-user >>>>>> would also reset the device only in .vhost_reset_status, not in >>>>>> .vhost_dev_start. .vhost_dev_start seems precisely like the place to >>>>>> run SUSPEND/RESUME. >>>>>> >>>>> I think the same. I just saw It's been proposed at [1]. >>>>> >>>>>> Another question I have (but this is basically what I wrote in my last >>>>>> email) is why we even call .vhost_reset_status here. If the device >>>>>> and/or all of the vrings are already stopped, why do we need to reset >>>>>> it? Naïvely, I had assumed we only really need to reset the device if >>>>>> the guest changes, so that a new guest driver sees a freshly >>>>>> initialized >>>>>> device. >>>>>> >>>>> I don't know why we didn't need to call it :). I'm assuming the >>>>> previous vhost-user net did fine resetting vq indexes, using >>>>> VHOST_USER_SET_VRING_BASE. But I don't know about more complex >>>>> devices. >>>>> >>>>> The guest can reset the device, or write 0 to the PCI config status, >>>>> at any time. How does virtiofs handle it, being stateful? >>>> Honestly a good question because virtiofsd implements neither >>>> SET_STATUS nor RESET_DEVICE. I’ll have to investigate that. >>>> >>>> I think when the guest resets the device, SET_VRING_BASE always comes >>>> along some way or another, so that’s how the vrings are reset. Maybe >>>> the internal state is reset only following more high-level FUSE >>>> commands like INIT. >>> So a meeting and one session of looking-into-the-code later: >>> >>> We reset every virt queue on GET_VRING_BASE, which is wrong, but happens >>> to serve the purpose. (German is currently on that.) >>> >>> In our meeting, German said the reset would occur when the memory >>> regions are changed, but I can’t see that in the code. >> That would imply that the status is reset when the guest's memory is >> added or removed? No, but that whenever the memory in which there is a vring is changed, or whenever a vring’s address is changed, that vring is reset. >>> I think it only >>> happens implicitly through the SET_VRING_BASE call, which resets the >>> internal avail/used pointers. >>> >>> [This doesn’t seem different from libvhost-user, though, which >>> implements neither SET_STATUS nor RESET_DEVICE, and which pretends to >>> reset the device on RESET_OWNER, but really doesn’t (its >>> vu_reset_device_exec() function just disables all vrings, doesn’t reset >>> or even stop them).] >>> >>> Consequently, the internal state is never reset. It would be cleared on >>> a FUSE Destroy message, but if you just force-reset the system, the >>> state remains into the next reboot. Not even FUSE Init clears it, which >>> seems weird. It happens to work because it’s still the same filesystem, >>> so the existing state fits, but it kind of seems dangerous to keep e.g. >>> files open. I don’t think it’s really exploitable because everything >>> still goes through the guest kernel, but, well. We should clear the >>> state on Init, and probably also implement SET_STATUS and clear the >>> state there. >>> >> I see. That's in the line of assuming GET_VRING_BASE is the last >> message received from qemu. >> > Actually, does it prevent device recovery after a failure in > migration? Is the same state set for the device? In theory no, because GET_VRING_BASE will return the current index, so it’ll be restored by SET_VRING_BASE even if the vring is reset in between. In practice yes, because the current implementation has GET_VRING_BASE reset the vring before fetching the index, so the returned index is always 0, and it can’t be restored. But this also prevents device recovery in successful migration. German has sent a pull request for that: https://github.com/rust-vmm/vhost/pull/154 Hanna
On Tue, May 09, 2023 at 10:53:35AM +0200, Hanna Czenczek wrote: > On 08.05.23 23:10, Stefan Hajnoczi wrote: > > On Fri, May 05, 2023 at 02:51:55PM +0200, Hanna Czenczek wrote: > > > On 05.05.23 11:53, Eugenio Perez Martin wrote: > > > > On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > On 04.05.23 23:14, Stefan Hajnoczi wrote: > > > > > > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote: > > > [...] > > > > > > > > > All state is lost and the Device Initialization process > > > > > > must be followed to make the device operational again. > > > > > > > > > > > > Existing vhost-user backends don't implement SET_STATUS 0 (it's new). > > > > > > > > > > > > It's messy and not your fault. I think QEMU should solve this by > > > > > > treating stateful devices differently from non-stateful devices. That > > > > > > way existing vhost-user backends continue to work and new stateful > > > > > > devices can also be supported. > > > > > It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for > > > > > stateful devices. In a previous email, you wrote that these should > > > > > implement SUSPEND+RESUME so qemu can use those instead. But those are > > > > > separate things, so I assume we just use SET_STATUS 0 when stopping the > > > > > VM because this happens to also stop processing vrings as a side effect? > > > > > > > > > > I.e. I understand “treating stateful devices differently” to mean that > > > > > qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end > > > > > supports it, and stateful back-ends should support it. > > > > > > > > > Honestly I cannot think of any use case where the vhost-user backend > > > > did not ignore set_status(0) and had to retrieve vq states. So maybe > > > > we can totally remove that call from qemu? > > > I don’t know so I can’t really say; but I don’t quite understand why qemu > > > would reset a device at any point but perhaps VM reset (and even then I’d > > > expect the post-reset guest to just reset the device on boot by itself, > > > too). > > DPDK stores the Device Status field value and uses it later: > > https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2791 > > > > While DPDK performs no immediate action upon SET_STATUS 0, omitting the > > message will change the behavior of other DPDK code like > > virtio_is_ready(). > > > > Changing the semantics of the vhost-user protocol in a way that's not > > backwards compatible is something we should avoid unless there is no > > other way. > > Well, I have two opinions on this: > > First, that in DPDK sounds wrong. vhost_dev_stop() is called mostly by > devices that call it when set_status is called on them. But they don’t call > it if status == 0, they call it if virtio_device_should_start() returns > false, which is the case when the VM is stopped. So basically we set a > status value on the back-end device that is not the status value that is set > in qemu. If DPDK makes actual use of this status value that differs from > that of the front-end in qemu, that sounds like it probably actually wrong. > > Second, it’s entirely possible and probably probable that DPDK doesn’t make > “actual use of this status value”; the only use it probably has is to > determine whether the device is supposed to be stopped, which is exactly > what qemu has tried to confer by setting it to 0. So it’s basically two > implementations that have agreed on abusing a value to emulate behavior that > isn’t otherwise implement (SUSPEND), and that works because all devices are > stateless. Then, I agree, we can’t change this until it gets SUSPEND > support. > > > The fundamental problem is that QEMU's vhost code is designed to reset > > vhost devices because it assumes they are stateless. If an F_SUSPEND > > protocol feature bit is added, then it becomes possible to detect new > > backends and suspend/resume them rather than reset them. > > > > That's the solution that I favor because it's backwards compatible and > > the same model can be applied to stateful vDPA devices in the future. > > So basically the idea is the following: vhost_dev_stop() should just suspend > the device, not reset it. For devices that don’t support SUSPEND, we still > need to do something, and just calling GET_VRING_BASE on all vrings is > deemed inadequate, so they are reset (SET_STATUS 0) as a work-around, > assuming that stateful devices that care (i.e. implement SET_STATUS) will > also implement SUSPEND to not have this “legacy reset” happen to them. > > Sounds good to me. (If I understood that right. :)) Yes. Stefan
On Fri, May 05, 2023 at 04:26:08PM +0200, Eugenio Perez Martin wrote: > On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> wrote: > > > > (By the way, thanks for the explanations :)) > > > > On 05.05.23 11:03, Hanna Czenczek wrote: > > > On 04.05.23 23:14, Stefan Hajnoczi wrote: > > > > [...] > > > > >> I think it's better to change QEMU's vhost code > > >> to leave stateful devices suspended (but not reset) across > > >> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing > > >> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about > > >> this aspect? > > > > > > Yes and no; I mean, I haven’t in detail, but I thought this is what’s > > > meant by suspending instead of resetting when the VM is stopped. > > > > So, now looking at vhost_dev_stop(), one problem I can see is that > > depending on the back-end, different operations it does will do > > different things. > > > > It tries to stop the whole device via vhost_ops->vhost_dev_start(), > > which for vDPA will suspend the device, but for vhost-user will reset it > > (if F_STATUS is there). > > > > It disables all vrings, which doesn’t mean stopping, but may be > > necessary, too. (I haven’t yet really understood the use of disabled > > vrings, I heard that virtio-net would have a need for it.) > > > > It then also stops all vrings, though, so that’s OK. And because this > > will always do GET_VRING_BASE, this is actually always the same > > regardless of transport. > > > > Finally (for this purpose), it resets the device status via > > vhost_ops->vhost_reset_status(). This is only implemented on vDPA, and > > this is what resets the device there. > > > > > > So vhost-user resets the device in .vhost_dev_start, but vDPA only does > > so in .vhost_reset_status. It would seem better to me if vhost-user > > would also reset the device only in .vhost_reset_status, not in > > .vhost_dev_start. .vhost_dev_start seems precisely like the place to > > run SUSPEND/RESUME. > > > > I think the same. I just saw It's been proposed at [1]. > > > Another question I have (but this is basically what I wrote in my last > > email) is why we even call .vhost_reset_status here. If the device > > and/or all of the vrings are already stopped, why do we need to reset > > it? Naïvely, I had assumed we only really need to reset the device if > > the guest changes, so that a new guest driver sees a freshly initialized > > device. > > > > I don't know why we didn't need to call it :). I'm assuming the > previous vhost-user net did fine resetting vq indexes, using > VHOST_USER_SET_VRING_BASE. But I don't know about more complex > devices. It was added so DPDK can batch rx virtqueue RSS updates: commit 923b8921d210763359e96246a58658ac0db6c645 Author: Yajun Wu <yajunw@nvidia.com> Date: Mon Oct 17 14:44:52 2022 +0800 vhost-user: Support vhost_dev_start The motivation of adding vhost-user vhost_dev_start support is to improve backend configuration speed and reduce live migration VM downtime. Today VQ configuration is issued one by one. For virtio net with multi-queue support, backend needs to update RSS (Receive side scaling) on every rx queue enable. Updating RSS is time-consuming (typical time like 7ms). Implement already defined vhost status and message in the vhost specification [1]. (a) VHOST_USER_PROTOCOL_F_STATUS (b) VHOST_USER_SET_STATUS (c) VHOST_USER_GET_STATUS Send message VHOST_USER_SET_STATUS with VIRTIO_CONFIG_S_DRIVER_OK for device start and reset(0) for device stop. On reception of the DRIVER_OK message, backend can apply the needed setting only once (instead of incremental) and also utilize parallelism on enabling queues. This improves QEMU's live migration downtime with vhost user backend implementation by great margin, specially for the large number of VQs of 64 from 800 msec to 250 msec. [1] https://qemu-project.gitlab.io/qemu/interop/vhost-user.html Signed-off-by: Yajun Wu <yajunw@nvidia.com> Acked-by: Parav Pandit <parav@nvidia.com> Message-Id: <20221017064452.1226514-3-yajunw@nvidia.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > Thanks! > > [1] https://lore.kernel.org/qemu-devel/20230501230409.274178-1-stefanha@redhat.com/ >
On Fri, May 05, 2023 at 11:03:16AM +0200, Hanna Czenczek wrote: > On 04.05.23 23:14, Stefan Hajnoczi wrote: > > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote: > > > On 11.04.23 17:05, Hanna Czenczek wrote: > > > > > > [...] > > > > > > > Hanna Czenczek (4): > > > > vhost: Re-enable vrings after setting features > > > > vhost-user: Interface for migration state transfer > > > > vhost: Add high-level state save/load functions > > > > vhost-user-fs: Implement internal migration > > > I’m trying to write v2, and my intention was to keep the code > > > conceptually largely the same, but include in the documentation change > > > thoughts and notes on how this interface is to be used in the future, > > > when e.g. vDPA “extensions” come over to vhost-user. My plan was to, > > > based on that documentation, discuss further. > > > > > > But now I’m struggling to even write that documentation because it’s not > > > clear to me what exactly the result of the discussion was, so I need to > > > stop even before that. > > > > > > So as far as I understand, we need/want SUSPEND/RESUME for two reasons: > > > 1. As a signal to the back-end when virt queues are no longer to be > > > processed, so that it is clear that it will not do that when asked for > > > migration state. > > > 2. Stateful devices that support SET_STATUS receive a status of 0 when > > > the VM is stopped, which supposedly resets the internal state. While > > > suspended, device state is frozen, so as far as I understand, SUSPEND > > > before SET_STATUS would have the status change be deferred until RESUME. > > I'm not sure about SUSPEND -> SET_STATUS 0 -> RESUME. I guess the > > device would be reset right away and it would either remain suspended > > or be resumed as part of reset :). > > > > Unfortunately the concepts of SUSPEND/RESUME and the Device Status > > Field are orthogonal and there is no spec that explains how they > > interact. > > Ah, OK. So I guess it’s up to the implementation to decide whether the > virtio device status counts as part of the “configuration” that “[it] must > not change”. > > > > I don’t want to hang myself up on 2 because it doesn’t really seem > > > important to this series, but: Why does a status of 0 reset the internal > > > state? [Note: This is all virtio_reset() seems to do, set the status to > > > 0.] The vhost-user specification only points to the virtio > > > specification, which doesn’t say anything to that effect. Instead, an > > > explicit device reset is mentioned, which would be > > > VHOST_USER_RESET_DEVICE, i.e. something completely different. Because > > > RESET_DEVICE directly contradicts SUSPEND’s description, I would like to > > > think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid. > > The vhost-user protocol didn't have the concept of the VIRTIO Device > > Status Field until SET_STATUS was added. > > > > In order to participate in the VIRTIO device lifecycle to some extent, > > the pre-SET_STATUS vhost-user protocol relied on vhost-user-specific > > messages like RESET_DEVICE. > > > > At the VIRTIO level, devices are reset by setting the Device Status > > Field to 0. > > (I didn’t find this in the virtio specification until today, turns out it’s > under 4.1.4.3 “Common configuration structure layout”, not under 2.1 “Device > Status Field”, where I was looking.) > > > All state is lost and the Device Initialization process > > must be followed to make the device operational again. > > > > Existing vhost-user backends don't implement SET_STATUS 0 (it's new). > > > > It's messy and not your fault. I think QEMU should solve this by > > treating stateful devices differently from non-stateful devices. That > > way existing vhost-user backends continue to work and new stateful > > devices can also be supported. > > It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for > stateful devices. In a previous email, you wrote that these should > implement SUSPEND+RESUME so qemu can use those instead. But those are > separate things, so I assume we just use SET_STATUS 0 when stopping the VM > because this happens to also stop processing vrings as a side effect? SET_STATUS 0 doesn't do anything in most existing vhost-user backends and QEMU's vhost code doesn't rely on it doing anything. It was added as an optimization hint for DPDK's vhost-user-net implementation without noticing that it breaks stateful devices (see commit 923b8921d210). > > I.e. I understand “treating stateful devices differently” to mean that qemu > should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end supports > it, and stateful back-ends should support it. > > > > Is it that a status 0 won’t explicitly reset the internal state, but > > > because it does mean that the driver is unbound, the state should > > > implicitly be reset? > > I think the fundamental problem is that transports like virtio-pci put > > registers back in their initialization state upon reset, so internal > > state is lost. > > > > The VIRTIO spec does not go into detail on device state across reset > > though, so I don't think much more can be said about the semantics. > > > > > Anyway. 1 seems to be the relevant point for migration. As far as I > > > understand, currently, a vhost-user back-end has no way of knowing when > > > to stop processing virt queues. Basically, rings can only transition > > > from stopped to started, but not vice versa. The vhost-user > > > specification has this bit: “Once the source has finished migration, > > > rings will be stopped by the source. No further update must be done > > > before rings are restarted.” It just doesn’t say how the front-end lets > > > the back-end know that the rings are (to be) stopped. So this seems > > > like a pre-existing problem for stateless migration. Unless this is > > > communicated precisely by setting the device status to 0? > > No, my understanding is different. The vhost-user spec says the > > backend must "stop [the] ring upon receiving > > ``VHOST_USER_GET_VRING_BASE``". > > Yes, I missed that part! > > > The "Ring states" section goes into > > more detail and adds the concept of enabled/disabled too. > > > > SUSPEND is stronger than GET_VRING_BASE though. GET_VRING_BASE only > > applies to a single virtqueue, whereas SUSPEND acts upon the entire > > device, including non-virtqueue aspects like Configuration Change > > Notifications (VHOST_USER_BACKEND_CONFIG_CHANGE_MSG). > > > > You can approximate SUSPEND today by sending GET_VRING_BASE for all > > virtqueues. I think in practice this does fully stop the device even > > if the spec doesn't require it. > > > > If we want minimal changes to vhost-user, then we could rely on > > GET_VRING_BASE to suspend and SET_VRING_ENABLE to resume. And > > SET_STATUS 0 must not be sent so that the device's state is not lost. > > So you mean that we’d use SUSPEND instead of SET_STATUS 0, but because we > have no SUSPEND, we’d ensure that GET_VRING_BASE is/was called on all > vrings? Yes. I prefer adding SUSPEND+RESUME to vhost-user, but if we were limited to today's vhost-user commands, then relying on GET_VRING_BASE and skipping SET_STATUS calls for vhost_dev_start/stop() would come close to achieving the behavior needed by stateful backends. Stefan
On Tue, May 9, 2023 at 5:30 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Fri, May 05, 2023 at 04:26:08PM +0200, Eugenio Perez Martin wrote: > > On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > > (By the way, thanks for the explanations :)) > > > > > > On 05.05.23 11:03, Hanna Czenczek wrote: > > > > On 04.05.23 23:14, Stefan Hajnoczi wrote: > > > > > > [...] > > > > > > >> I think it's better to change QEMU's vhost code > > > >> to leave stateful devices suspended (but not reset) across > > > >> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing > > > >> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about > > > >> this aspect? > > > > > > > > Yes and no; I mean, I haven’t in detail, but I thought this is what’s > > > > meant by suspending instead of resetting when the VM is stopped. > > > > > > So, now looking at vhost_dev_stop(), one problem I can see is that > > > depending on the back-end, different operations it does will do > > > different things. > > > > > > It tries to stop the whole device via vhost_ops->vhost_dev_start(), > > > which for vDPA will suspend the device, but for vhost-user will reset it > > > (if F_STATUS is there). > > > > > > It disables all vrings, which doesn’t mean stopping, but may be > > > necessary, too. (I haven’t yet really understood the use of disabled > > > vrings, I heard that virtio-net would have a need for it.) > > > > > > It then also stops all vrings, though, so that’s OK. And because this > > > will always do GET_VRING_BASE, this is actually always the same > > > regardless of transport. > > > > > > Finally (for this purpose), it resets the device status via > > > vhost_ops->vhost_reset_status(). This is only implemented on vDPA, and > > > this is what resets the device there. > > > > > > > > > So vhost-user resets the device in .vhost_dev_start, but vDPA only does > > > so in .vhost_reset_status. It would seem better to me if vhost-user > > > would also reset the device only in .vhost_reset_status, not in > > > .vhost_dev_start. .vhost_dev_start seems precisely like the place to > > > run SUSPEND/RESUME. > > > > > > > I think the same. I just saw It's been proposed at [1]. > > > > > Another question I have (but this is basically what I wrote in my last > > > email) is why we even call .vhost_reset_status here. If the device > > > and/or all of the vrings are already stopped, why do we need to reset > > > it? Naïvely, I had assumed we only really need to reset the device if > > > the guest changes, so that a new guest driver sees a freshly initialized > > > device. > > > > > > > I don't know why we didn't need to call it :). I'm assuming the > > previous vhost-user net did fine resetting vq indexes, using > > VHOST_USER_SET_VRING_BASE. But I don't know about more complex > > devices. > > It was added so DPDK can batch rx virtqueue RSS updates: > > commit 923b8921d210763359e96246a58658ac0db6c645 > Author: Yajun Wu <yajunw@nvidia.com> > Date: Mon Oct 17 14:44:52 2022 +0800 > > vhost-user: Support vhost_dev_start > > The motivation of adding vhost-user vhost_dev_start support is to > improve backend configuration speed and reduce live migration VM > downtime. > > Today VQ configuration is issued one by one. For virtio net with > multi-queue support, backend needs to update RSS (Receive side > scaling) on every rx queue enable. Updating RSS is time-consuming > (typical time like 7ms). > > Implement already defined vhost status and message in the vhost > specification [1]. > (a) VHOST_USER_PROTOCOL_F_STATUS > (b) VHOST_USER_SET_STATUS > (c) VHOST_USER_GET_STATUS > > Send message VHOST_USER_SET_STATUS with VIRTIO_CONFIG_S_DRIVER_OK for > device start and reset(0) for device stop. > > On reception of the DRIVER_OK message, backend can apply the needed setting > only once (instead of incremental) and also utilize parallelism on enabling > queues. > > This improves QEMU's live migration downtime with vhost user backend > implementation by great margin, specially for the large number of VQs of 64 > from 800 msec to 250 msec. > > [1] https://qemu-project.gitlab.io/qemu/interop/vhost-user.html > > Signed-off-by: Yajun Wu <yajunw@nvidia.com> > Acked-by: Parav Pandit <parav@nvidia.com> > Message-Id: <20221017064452.1226514-3-yajunw@nvidia.com> > Reviewed-by: Michael S. Tsirkin <mst@redhat.com> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > Sorry for the confusion, what I was wondering is how vhost-user devices do not need any signal to reset the device before VHOST_USER_SET_STATUS. And my guess is that it is enough to get / set the vq indexes. vhost_user_reset_device is limited to scsi, so it would not work for the rest of devices. Thanks!