Message ID | 163909282574.728533.7460416142511440919.stgit@omen (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] vfio: Update/Clarify migration uAPI, add NDMA state | expand |
On Thu, Dec 09, 2021 at 04:34:29PM -0700, Alex Williamson wrote: > A new NDMA state is being proposed to support a quiescent state for > contexts containing multiple devices with peer-to-peer DMA support. > Formally define it. > > Clarify various aspects of the migration region data fields and > protocol. Remove QEMU related terminology and flows from the uAPI; > these will be provided in Documentation/ so as not to confuse the > device_state bitfield with a finite state machine with restricted > state transitions. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > include/uapi/linux/vfio.h | 405 ++++++++++++++++++++++++--------------------- > 1 file changed, 214 insertions(+), 191 deletions(-) I need other peope to look this over, so these are just some quick remarks. Thanks for doing it, it is very good. Given I'm almost on vacation till Jan I think we will shortly have to table this discussion to January. But, if you are happy with this as all that is needed to do mlx5 we can possibly have the v6 updated in early January, after the next merge window. Though lets try to quickly decide on what to do about the "change multiple bits" below, please. > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index ef33ea002b0b..1fdbc928f886 100644 > +++ b/include/uapi/linux/vfio.h > @@ -408,199 +408,211 @@ struct vfio_region_gfx_edid { > #define VFIO_REGION_SUBTYPE_MIGRATION (1) > > /* > + * The structure vfio_device_migration_info is placed at the immediate start of > + * the per-device VFIO_REGION_SUBTYPE_MIGRATION region to manage the device > + * state and migration information for the device. Field accesses for this > + * structure are only supported using their native width and alignment, > + * accesses otherwise are undefined and the kernel migration driver should > + * return an error. > * > * device_state: (read/write) > + * The device_state field is a bitfield written by the user to transition > + * the associated device between valid migration states using these rules: > + * - The user may read or write the device state register at any time. > + * - The kernel migration driver must fully transition the device to the > + * new state value before the write(2) operation returns to the user. > + * - The user may change multiple bits of the bitfield in the same write > + * operation, so long as the resulting state is valid. I would like to forbid this. It is really too complicated, and if there is not a strongly defined device behavior when this is done it is not inter-operable for userspace to do it. > + * - The kernel migration driver must not generate asynchronous device > + * state transitions outside of manipulation by the user or the > + * VFIO_DEVICE_RESET ioctl as described below. > + * - In the event of a device state transition failure, the kernel > + * migration driver must return a write(2) error with appropriate errno > + * to the user. > + * - Upon such an error, re-reading the device_state field must indicate > + * the device migration state as either the same state as prior to the > + * failing write or, at the migration driver discretion, indicate the > + * device state as VFIO_DEVICE_STATE_ERROR. It is because this is complete nightmare. Let's say the user goes from 0 -> SAVING | RUNNING and SAVING fails after we succeed to do RUNNING. We have to also backtrack and undo RUNNING, but what if that fails too? Oh and we have to keep track of all this backtracking while executing the new state and write a bunch of complicated never tested error handling code. Assuming we can even figure out what the precedence of multiple bits even means for interoperability. Backed into this is an assumption that any device transition is fully atomic - that just isn't what I see any of the HW has done. I thought we could tackled this when you first suggested it (eg copy the mlx5 logic and be OK), but now I'm very skeptical. The idea that every driver can do this right in all the corner cases doesn't seem reasonable given we've made so many errors here already just in mlx5. > + * - Bit 1 (SAVING) [REQUIRED]: > + * - Setting this bit enables and initializes the migration region data I would use the word clear instead of initialize - the point of this is to throw away any data that may be left over in the window from any prior actions. > + * window and associated fields within vfio_device_migration_info for > + * capturing the migration data stream for the device. The migration > + * driver may perform actions such as enabling dirty logging of device > + * state with this bit. The SAVING bit is mutually exclusive with the > + * RESUMING bit defined below. > + * - Clearing this bit (ie. !SAVING) de-initializes the migration region > + * data window and indicates the completion or termination of the > + * migration data stream for the device. I don't know what "de-initialized" means as something a device should do? IMHO there is no need to talk about the migration window here, SAVING says initialize/clear - and data_offset/etc say their values are undefined outside SAVING/RUNNING. That is enough. > + * - Bit 2 (RESUMING) [REQUIRED]: > + * - Setting this bit enables and initializes the migration region data > + * window and associated fields within vfio_device_migration_info for > + * restoring the device from a migration data stream captured from a > + * SAVING session with a compatible device. The migration driver may > + * perform internal device resets as necessary to reinitialize the > + * internal device state for the incoming migration data. > + * - Clearing this bit (ie. !RESUMING) de-initializes the migration > + * region data window and indicates the end of a resuming session for > + * the device. The kernel migration driver should complete the > + * incorporation of data written to the migration data window into the > + * device internal state and perform final validity and consistency > + * checking of the new device state. If the user provided data is > + * found to be incomplete, inconsistent, or otherwise invalid, the > + * migration driver must indicate a write(2) error and follow the > + * previously described protocol to return either the previous state > + * or an error state. Prefer this is just 'go to an error state' to avoid unnecessary implementation differences. > + * - Bit 3 (NDMA) [OPTIONAL]: > + * The NDMA or "No DMA" state is intended to be a quiescent state for > + * the device for the purposes of managing multiple devices within a > + * user context where peer-to-peer DMA between devices may be active. > + * Support for the NDMA bit is indicated through the presence of the > + * VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by > + * VFIO_DEVICE_GET_REGION_INFO for the associated device migration > + * region. > + * - Setting this bit must prevent the device from initiating any > + * new DMA or interrupt transactions. The migration driver must I'm not sure about interrupts. > + * complete any such outstanding operations prior to completing > + * the transition to the NDMA state. The NDMA device_state Reading this as you wrote it and I suddenly have a doubt about the PRI use case. Is it reasonable that the kernel driver will block on NDMA waiting for another userspace thread to resolve any outstanding PRIs? Can that allow userspace to deadlock the kernel or device? Is there an alterative? > + * All combinations for the above defined device_state bits are considered > + * valid with the following exceptions: > + * - RESUMING and SAVING are mutually exclusive, all combinations of > + * (RESUMING | SAVING) are invalid. Furthermore the specific combination > + * (!NDMA | RESUMING | SAVING | !RUNNING) is reserved to indicate the > + * device error state VFIO_DEVICE_STATE_ERROR. This variant is > + * specifically chosen due to the !RUNNING state of the device as the > + * migration driver should do everything possible, including an internal > + * reset of the device, to ensure that the device is fully stopped in > + * this state. Prefer we don't specify this. ERROR is undefined behavior and userspace should reset. Any path that leads along to ERROR already includes possiblities for wild DMAs and what not, so there is nothing to be gained by this other than causing a lot of driver complexity, IMHO. > + * Migration drivers should attempt to support any transition between valid should? must, I think. The whole migration window definition seems quite straightforward now! > + * a) The user reads pending_bytes. If the read value is zero, no data is > + * currently available for the device. If the device is !RUNNING and a > + * zero value is read, this indicates the end of the device migration > + * stream and the device must not generate any new migration data. If > + * the read value is non-zero, the user may proceed to collect device > + * migration data in step b). Repeated reads of pending_bytes is allowed > + * and must not compromise the migration data stream provided the user does > + * not proceed to the following step. Add what to do in SAVING|RUNNING if pending bytes is 0? > #define VFIO_DEVICE_STATE_SET_ERROR(state) \ > - ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \ > - VFIO_DEVICE_STATE_RESUMING) > + ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_STATE_ERROR) We should delete this macro. It only makes sense used in a driver does not belong in the uapi header. Thanks, Jason
On Thu, 9 Dec 2021 21:25:29 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Thu, Dec 09, 2021 at 04:34:29PM -0700, Alex Williamson wrote: > > A new NDMA state is being proposed to support a quiescent state for > > contexts containing multiple devices with peer-to-peer DMA support. > > Formally define it. > > > > Clarify various aspects of the migration region data fields and > > protocol. Remove QEMU related terminology and flows from the uAPI; > > these will be provided in Documentation/ so as not to confuse the > > device_state bitfield with a finite state machine with restricted > > state transitions. > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > include/uapi/linux/vfio.h | 405 ++++++++++++++++++++++++--------------------- > > 1 file changed, 214 insertions(+), 191 deletions(-) > > I need other peope to look this over, so these are just some quick > remarks. Thanks for doing it, it is very good. > > Given I'm almost on vacation till Jan I think we will shortly have to > table this discussion to January. > > But, if you are happy with this as all that is needed to do mlx5 we > can possibly have the v6 updated in early January, after the next > merge window. > > Though lets try to quickly decide on what to do about the "change > multiple bits" below, please. > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > index ef33ea002b0b..1fdbc928f886 100644 > > +++ b/include/uapi/linux/vfio.h > > @@ -408,199 +408,211 @@ struct vfio_region_gfx_edid { > > #define VFIO_REGION_SUBTYPE_MIGRATION (1) > > > > /* > > + * The structure vfio_device_migration_info is placed at the immediate start of > > + * the per-device VFIO_REGION_SUBTYPE_MIGRATION region to manage the device > > + * state and migration information for the device. Field accesses for this > > + * structure are only supported using their native width and alignment, > > + * accesses otherwise are undefined and the kernel migration driver should > > + * return an error. > > * > > * device_state: (read/write) > > + * The device_state field is a bitfield written by the user to transition > > + * the associated device between valid migration states using these rules: > > + * - The user may read or write the device state register at any time. > > + * - The kernel migration driver must fully transition the device to the > > + * new state value before the write(2) operation returns to the user. > > + * - The user may change multiple bits of the bitfield in the same write > > + * operation, so long as the resulting state is valid. > > I would like to forbid this. It is really too complicated, and if > there is not a strongly defined device behavior when this is done it > is not inter-operable for userspace to do it. > > > + * - The kernel migration driver must not generate asynchronous device > > + * state transitions outside of manipulation by the user or the > > + * VFIO_DEVICE_RESET ioctl as described below. > > + * - In the event of a device state transition failure, the kernel > > + * migration driver must return a write(2) error with appropriate errno > > + * to the user. > > + * - Upon such an error, re-reading the device_state field must indicate > > + * the device migration state as either the same state as prior to the > > + * failing write or, at the migration driver discretion, indicate the > > + * device state as VFIO_DEVICE_STATE_ERROR. > > It is because this is complete nightmare. Let's say the user goes from > 0 -> SAVING | RUNNING and SAVING fails after we succeed to do > RUNNING. We have to also backtrack and undo RUNNING, but what if that > fails too? Oh and we have to keep track of all this backtracking while > executing the new state and write a bunch of complicated never tested > error handling code. We do specify that a migration driver has discretion in using the error state for failed transitions, so there are options to simplify error handling. If we look at bit flips, we have: Initial state | Resuming | | Saving | | | Running | | | | Next states with multiple bit flips a) 0, 0, 0 (d) b) 0, 0, 1 (c) (e) c) 0, 1, 0 (b) (e) d) 0, 1, 1 (a) (e) e) 1, 0, 0 (b) (c) (d) f) 1, 0, 1 UNSUPPORTED g) 1, 1, 0 ERROR h) 1, 1, 1 INVALID We specify that we cannot pass through any invalid states during transition, so if I consider each bit to be a discrete operation and map all these multi-bit changes to a sequence of single bit flips, the only requirements are: 1) RESUMING must be cleared before setting SAVING or RUNNING 2) SAVING or RUNNING must be cleared before setting RESUMING I think the basis of your priority scheme comes from that. Ordering of the remaining items is more subtle though, for instance 0 -> SAVING | RUNNING can be broken down as: 0 -> SAVING SAVING -> SAVING | RUNNING vs 0 -> RUNNING RUNNING -> SAVING | RUNNING I'd give preference to enabling logging before running and I believe that holds for transition (e) -> (d) as well. In the reverse case, SAVING | RUNNING -> 0 SAVING | RUNNING -> RUNNING RUNNING -> 0 vs SAVING | RUNNING -> SAVING SAVING -> 0 This one is more arbitrary. I tend to favor clearing RUNNING to stop the device first, largely because it creates nice symmetry in the resulting algorithm and follows the general principle that you discovered as well, converge towards zero by addressing bit clearing before setting. So a valid algorithm would include: int set_device_state(u32 old, u32 new, bool is_unwind) { if (old.RUNNING && !new.RUNNING) { curr.RUNNING = 0; if (ERROR) goto unwind; } if (old.SAVING && !new.SAVING) { curr.SAVING = 0; if (ERROR) goto unwind; } if (old.RESUMING && !new.RESUMING) { curr.RESUMING = 0; if (ERROR) goto unwind; } if (!old.RESUMING && new.RESUMING) { curr.RESUMING = 1; if (ERROR) goto unwind; } if (!old.SAVING && new.SAVING) { curr.saving = 1; if (ERROR) goto unwind; } if (!old.RUNNING && new.RUNNING) { curr.RUNNING = 1; if (ERROR) goto unwind; } return 0; unwind: if (!is_unwind) { ret = set_device_state(curr, old, true); if (ret) { curr.raw = ERROR; return ret; } } return -EIO; } And I think that also addresses the claim that we're doomed to untested and complicated error code handling, we unwind by simply swapping the args to our set state function and enter the ERROR state should that recursive call fail. I think it would be reasonable for documentation to present similar pseudo code as a recommendation, but ultimately the migration driver needs to come up with something that fits all the requirements. (Ignoring NDMA for the moment until we determine if it's even a synchronous operations) If we put it in the user's hands and prescribe only single bit flips, they don't really have device knowledge to optimize further than this like a migration driver might be able to do. > Assuming we can even figure out what the precedence of multiple bits > even means for interoperability. > > Backed into this is an assumption that any device transition is fully > atomic - that just isn't what I see any of the HW has done. We only specify that the transition needs to be complete before the write(2) operation returns, there's no specified atomicity versus any other event. "Synchronous" is maybe your concern? > I thought we could tackled this when you first suggested it (eg copy > the mlx5 logic and be OK), but now I'm very skeptical. The idea that > every driver can do this right in all the corner cases doesn't seem > reasonable given we've made so many errors here already just in mlx5. > > > + * - Bit 1 (SAVING) [REQUIRED]: > > + * - Setting this bit enables and initializes the migration region data > > I would use the word clear instead of initialize - the point of this > is to throw away any data that may be left over in the window from any > prior actions. "Clear" to me suggests that there's some sort of internal shared buffer implementation that needs to be wiped between different modes. I chose "initialize" because I think it offers more independence to the implementation. > > + * window and associated fields within vfio_device_migration_info for > > + * capturing the migration data stream for the device. The migration > > + * driver may perform actions such as enabling dirty logging of device > > + * state with this bit. The SAVING bit is mutually exclusive with the > > + * RESUMING bit defined below. > > + * - Clearing this bit (ie. !SAVING) de-initializes the migration region > > + * data window and indicates the completion or termination of the > > + * migration data stream for the device. > > I don't know what "de-initialized" means as something a device should > do? IMHO there is no need to talk about the migration window here, > SAVING says initialize/clear - and data_offset/etc say their values > are undefined outside SAVING/RUNNING. That is enough. If "initializing" the migration data region puts in place handlers for pending_bytes and friends, "de-initializing" would undo that operation. Perhaps I should use "deactivates"? > > + * - Bit 2 (RESUMING) [REQUIRED]: > > + * - Setting this bit enables and initializes the migration region data > > + * window and associated fields within vfio_device_migration_info for > > + * restoring the device from a migration data stream captured from a > > + * SAVING session with a compatible device. The migration driver may > > + * perform internal device resets as necessary to reinitialize the > > + * internal device state for the incoming migration data. > > + * - Clearing this bit (ie. !RESUMING) de-initializes the migration > > + * region data window and indicates the end of a resuming session for > > + * the device. The kernel migration driver should complete the > > + * incorporation of data written to the migration data window into the > > + * device internal state and perform final validity and consistency > > + * checking of the new device state. If the user provided data is > > + * found to be incomplete, inconsistent, or otherwise invalid, the > > + * migration driver must indicate a write(2) error and follow the > > + * previously described protocol to return either the previous state > > + * or an error state. > > Prefer this is just 'go to an error state' to avoid unnecessary > implementation differences. Then it becomes a special case versus other device_state changes and we're forcing what you've described as an undefined state into the protocol. Use of the error state is at the driver's discretion, but the spec is written such a driver only needs to make use of it if it encounters some sort of irrecoverable internal error. > > + * - Bit 3 (NDMA) [OPTIONAL]: > > + * The NDMA or "No DMA" state is intended to be a quiescent state for > > + * the device for the purposes of managing multiple devices within a > > + * user context where peer-to-peer DMA between devices may be active. > > + * Support for the NDMA bit is indicated through the presence of the > > + * VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by > > + * VFIO_DEVICE_GET_REGION_INFO for the associated device migration > > + * region. > > + * - Setting this bit must prevent the device from initiating any > > + * new DMA or interrupt transactions. The migration driver must > > I'm not sure about interrupts. In the common case an interrupt is a DMA write, so the name, if not intention of this state gets a bit shaky if interrupts are allowed. > > + * complete any such outstanding operations prior to completing > > + * the transition to the NDMA state. The NDMA device_state > > Reading this as you wrote it and I suddenly have a doubt about the PRI > use case. Is it reasonable that the kernel driver will block on NDMA > waiting for another userspace thread to resolve any outstanding PRIs? > > Can that allow userspace to deadlock the kernel or device? Is there an > alterative? I'd hope we could avoid deadlock in the kernel, but it seems trickier for userspace to be waiting on a write(2) operation to the device while also handling page request events for that same device. Is this something more like a pending transaction bit where userspace asks the device to go quiescent and polls for that to occur? > > + * All combinations for the above defined device_state bits are considered > > + * valid with the following exceptions: > > + * - RESUMING and SAVING are mutually exclusive, all combinations of > > + * (RESUMING | SAVING) are invalid. Furthermore the specific combination > > + * (!NDMA | RESUMING | SAVING | !RUNNING) is reserved to indicate the > > + * device error state VFIO_DEVICE_STATE_ERROR. This variant is > > + * specifically chosen due to the !RUNNING state of the device as the > > + * migration driver should do everything possible, including an internal > > + * reset of the device, to ensure that the device is fully stopped in > > + * this state. > > Prefer we don't specify this. ERROR is undefined behavior and > userspace should reset. Any path that leads along to ERROR already > includes possiblities for wild DMAs and what not, so there is nothing > to be gained by this other than causing a lot of driver complexity, > IMHO. This seems contrary to your push for consistent, interoperable behavior. What's the benefit to actually leaving the state undefined or the drawback to preemptively resetting a device if the migration driver cannot determine if the device is quiesced, especially when the user would need to reset the device to enter a new state anyway? I added this because language in your doc suggested the error state was far more undefined that I understood it to be, ie. !RUNNING. > > + * Migration drivers should attempt to support any transition between valid > > should? must, I think. I think that "must" terminology is a bit contrary to the fact that we have a defined error state that can be used at the discretion of the migration driver. To me, "should" tells the migration drivers that they ought to make an attempt to support all transitions, but userspace needs to be be prepared that they might not work. If a driver fails to implement some transitions necessary for a given application, the application should fail gracefully, but migration features may not be available for the device. > The whole migration window definition seems quite straightforward now! Great! > > + * a) The user reads pending_bytes. If the read value is zero, no data is > > + * currently available for the device. If the device is !RUNNING and a > > + * zero value is read, this indicates the end of the device migration > > + * stream and the device must not generate any new migration data. If > > + * the read value is non-zero, the user may proceed to collect device > > + * migration data in step b). Repeated reads of pending_bytes is allowed > > + * and must not compromise the migration data stream provided the user does > > + * not proceed to the following step. > > Add what to do in SAVING|RUNNING if pending bytes is 0? Maybe it's too subtle, but that's why I phrased it as "no data is currently available" and went on to specify the implications in the !RUNNING state. "Currently", suggesting that in the RUNNING state the value is essentially volatile. > > #define VFIO_DEVICE_STATE_SET_ERROR(state) \ > > - ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \ > > - VFIO_DEVICE_STATE_RESUMING) > > + ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_STATE_ERROR) > > We should delete this macro. It only makes sense used in a driver does > not belong in the uapi header. I may have gotten sloppy here, I thought I was incorporating what had been proposed in the mlx5 series. Thanks, Alex
On Mon, Dec 13 2021, Alex Williamson <alex.williamson@redhat.com> wrote: > We do specify that a migration driver has discretion in using the error > state for failed transitions, so there are options to simplify error > handling. > > If we look at bit flips, we have: > > Initial state > | Resuming > | | Saving > | | | Running > | | | | Next states with multiple bit flips > > a) 0, 0, 0 (d) > b) 0, 0, 1 (c) (e) > c) 0, 1, 0 (b) (e) > d) 0, 1, 1 (a) (e) > e) 1, 0, 0 (b) (c) (d) > f) 1, 0, 1 UNSUPPORTED > g) 1, 1, 0 ERROR > h) 1, 1, 1 INVALID > > We specify that we cannot pass through any invalid states during > transition, so if I consider each bit to be a discrete operation and > map all these multi-bit changes to a sequence of single bit flips, the > only requirements are: > > 1) RESUMING must be cleared before setting SAVING or RUNNING > 2) SAVING or RUNNING must be cleared before setting RESUMING > > I think the basis of your priority scheme comes from that. Ordering > of the remaining items is more subtle though, for instance > 0 -> SAVING | RUNNING can be broken down as: > > 0 -> SAVING > SAVING -> SAVING | RUNNING > > vs > > 0 -> RUNNING > RUNNING -> SAVING | RUNNING > > I'd give preference to enabling logging before running and I believe > that holds for transition (e) -> (d) as well. Agreed. > > In the reverse case, SAVING | RUNNING -> 0 > > SAVING | RUNNING -> RUNNING > RUNNING -> 0 > > vs > > SAVING | RUNNING -> SAVING > SAVING -> 0 > > This one is more arbitrary. I tend to favor clearing RUNNING to stop > the device first, largely because it creates nice symmetry in the > resulting algorithm and follows the general principle that you > discovered as well, converge towards zero by addressing bit clearing > before setting. That also makes sense to me. > So a valid algorithm would include: > > int set_device_state(u32 old, u32 new, bool is_unwind) > { > if (old.RUNNING && !new.RUNNING) { > curr.RUNNING = 0; > if (ERROR) goto unwind; > } > if (old.SAVING && !new.SAVING) { > curr.SAVING = 0; > if (ERROR) goto unwind; > } > if (old.RESUMING && !new.RESUMING) { > curr.RESUMING = 0; > if (ERROR) goto unwind; > } > if (!old.RESUMING && new.RESUMING) { > curr.RESUMING = 1; > if (ERROR) goto unwind; > } > if (!old.SAVING && new.SAVING) { > curr.saving = 1; > if (ERROR) goto unwind; > } > if (!old.RUNNING && new.RUNNING) { > curr.RUNNING = 1; > if (ERROR) goto unwind; > } > > return 0; > > unwind: > if (!is_unwind) { > ret = set_device_state(curr, old, true); > if (ret) { > curr.raw = ERROR; > return ret; > } > } > > return -EIO; > } > I've stared at this and scribbled a bit on paper and I think this would work. > And I think that also addresses the claim that we're doomed to untested > and complicated error code handling, we unwind by simply swapping the > args to our set state function and enter the ERROR state should that > recursive call fail. Nod. As we clear RUNNING as the first thing and would only set RUNNING again as the last step, any transition to ERROR should be save. > > I think it would be reasonable for documentation to present similar > pseudo code as a recommendation, but ultimately the migration driver > needs to come up with something that fits all the requirements. Yes, something like this should go into Documentation/.
On Mon, Dec 13, 2021 at 01:40:38PM -0700, Alex Williamson wrote: > We do specify that a migration driver has discretion in using the error > state for failed transitions, so there are options to simplify error > handling. This could be OK if we can agree ERROR has undefined behavior. > I think the basis of your priority scheme comes from that. Ordering > of the remaining items is more subtle though, for instance > 0 -> SAVING | RUNNING can be broken down as: > > 0 -> SAVING > SAVING -> SAVING | RUNNING > > vs > > 0 -> RUNNING > RUNNING -> SAVING | RUNNING > > I'd give preference to enabling logging before running and I believe > that holds for transition (e) -> (d) as well. IMHO, any resolution to an arbitary choice should pick an order that follows the reference flow because we know those are useful sequences to have today. Generally we have no use for the sequence SAVING -> SAVING | RUNNING, while RUNNING -> SAVING | RUNNING is part of the standard flow. It also raises the question that it seems not well defined what the sequence: SAVING -> SAVING | RUNNING Even does to the migration window? Nor can I imagine what mlx5 could do beyond fail this or corrupt the migration.. If we keep the "should implement transitions" language below then I expect mlx5 to fail this, and then we have a conflict where mlx5 cannot implement these precedence rules. This is the kind of precedence resolution I was trying to avoid. As far as I can see the requirements are broadly: - Do not transit through an invalid state - Do not loose NDMA during the transit, eg NDMA | SAVING | RUNNING -> 0 should not have a race where a DMA can leak out - Do not do transit through things like SAVING -> SAVING | RUNNING, and I'm not confident I have a good list of these > And I think that also addresses the claim that we're doomed to untested > and complicated error code handling, we unwind by simply swapping the > args to our set state function and enter the ERROR state should that > recursive call fail. I had the same thought the day after I wrote this, it seems workable. I remain concerned however that we still can't seem to reach to a working precedence after all this time. This is a very bad sign. Even if we work something out adding a new state someday is terrifying. What if we can't work out any precedence that is compatible with todays and supports the new state? IMHO, we should be simplifing this before it becomes permanent API, not trying to force it to work. > If we put it in the user's hands and prescribe only single bit flips, > they don't really have device knowledge to optimize further than this > like a migration driver might be able to do. If so this argues we should go back to the enforced FSM that the v1 mlx5 posting had and forget about device_state as a bunch of bits. Most of things I brought up in this post are resolved by the forced FSM. Yes, we give up some flexability, but I think the quest for flexability is a little misguided. If the drivers don't consistently implement the flexability then it is just cruft we cannot make use of from userspace. eg what practical use is SAVING -> SAVING | RUNNING if today's mlx5 implementation silently corrupts the migration stream? That instantly makes that a no-go for userspace from an interoperability perspective and we've accomplished nothing by allowing for it. Please think about it, it looks like an easy resolution to all this discussion to simply specify a fixed FSM and be done with it. > > I thought we could tackled this when you first suggested it (eg copy > > the mlx5 logic and be OK), but now I'm very skeptical. The idea that > > every driver can do this right in all the corner cases doesn't seem > > reasonable given we've made so many errors here already just in mlx5. > > > > > + * - Bit 1 (SAVING) [REQUIRED]: > > > + * - Setting this bit enables and initializes the migration region data > > > > I would use the word clear instead of initialize - the point of this > > is to throw away any data that may be left over in the window from any > > prior actions. > > "Clear" to me suggests that there's some sort of internal shared buffer > implementation that needs to be wiped between different modes. I chose > "initialize" because I think it offers more independence to the > implementation. The data window is expressed as a shared buffer in this API, there is only one data_offset/size and data window for everything. I think it is fine to rely on that for the description, and like all abstractions an implementation can do whatever so long as externally it looks like this shared buffer API. The requirement here is that anything that pre-existed in the data window from any prior operation is cleaned and the data window starts empty before any data related to this SAVING is transfered. > > > + * window and associated fields within vfio_device_migration_info for > > > + * capturing the migration data stream for the device. The migration > > > + * driver may perform actions such as enabling dirty logging of device > > > + * state with this bit. The SAVING bit is mutually exclusive with the > > > + * RESUMING bit defined below. > > > + * - Clearing this bit (ie. !SAVING) de-initializes the migration region > > > + * data window and indicates the completion or termination of the > > > + * migration data stream for the device. > > > > I don't know what "de-initialized" means as something a device should > > do? IMHO there is no need to talk about the migration window here, > > SAVING says initialize/clear - and data_offset/etc say their values > > are undefined outside SAVING/RUNNING. That is enough. > > If "initializing" the migration data region puts in place handlers for > pending_bytes and friends, "de-initializing" would undo that operation. > Perhaps I should use "deactivates"? And if you don't use "initializing" we don't need to talk about "de-initializating". Reading the data window outside SAVING is undefined behavior it seems, so nothing needs to be said. > > > + * - Bit 2 (RESUMING) [REQUIRED]: > > > + * - Setting this bit enables and initializes the migration region data > > > + * window and associated fields within vfio_device_migration_info for > > > + * restoring the device from a migration data stream captured from a > > > + * SAVING session with a compatible device. The migration driver may > > > + * perform internal device resets as necessary to reinitialize the > > > + * internal device state for the incoming migration data. > > > + * - Clearing this bit (ie. !RESUMING) de-initializes the migration > > > + * region data window and indicates the end of a resuming session for > > > + * the device. The kernel migration driver should complete the > > > + * incorporation of data written to the migration data window into the > > > + * device internal state and perform final validity and consistency > > > + * checking of the new device state. If the user provided data is > > > + * found to be incomplete, inconsistent, or otherwise invalid, the > > > + * migration driver must indicate a write(2) error and follow the > > > + * previously described protocol to return either the previous state > > > + * or an error state. > > > > Prefer this is just 'go to an error state' to avoid unnecessary > > implementation differences. > > Then it becomes a special case versus other device_state changes and > we're forcing what you've described as an undefined state into the > protocol. Lets look at what recovery actions something the VMM would need to take along the reference flow: RUNNING -> SAVING | RUNNING If this fails and we are still in RUNNING and can continue -> SAVING | RUNNING | NDMA -> SAVING If these fail we need to go to RUNNING -> RUNNING If this fails we need to RESET -> 0 Migration succeeded? Failure here should RESET RUNNING -> RESUMING If this fails and we are still in RUNNING continue -> NDMA | RUNNING If this fails RESET -> RUNNING If this fails RESET, VM could be corrupted. One view is that what the device does is irrelevant as qemu should simply unconditionally reset in these case. Another view is that staying in a useless state is also pointless and we may as well return ERROR anyhow. Eg if exiting RESUMING failed there is no other action to take besides RESET, so why didn't we return ERROR to tell this directly to userspace? Both are reasonable views, which is why I wrote "prefer". > > > + * - Bit 3 (NDMA) [OPTIONAL]: > > > + * The NDMA or "No DMA" state is intended to be a quiescent state for > > > + * the device for the purposes of managing multiple devices within a > > > + * user context where peer-to-peer DMA between devices may be active. > > > + * Support for the NDMA bit is indicated through the presence of the > > > + * VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by > > > + * VFIO_DEVICE_GET_REGION_INFO for the associated device migration > > > + * region. > > > + * - Setting this bit must prevent the device from initiating any > > > + * new DMA or interrupt transactions. The migration driver must > > > > I'm not sure about interrupts. > > In the common case an interrupt is a DMA write, so the name, if not > intention of this state gets a bit shaky if interrupts are allowed. Interrupts have their own masking protocol. For instance a device like the huawei one is halting DMA by manipulating the queue registers, it may still generate interrupts. Yes, MSI is a MemWr, but I've never heard anyone call it a DMA - there is no memory access here since the TLP is routed to the interrupt controller. This is why I'm not sure. A hostile VM certainly can corrupt the MSI, even today and thus turn it into a DMA. As we talked before this may be OK, but is security risky that it allows the guest to impact the hypervisor. Overall it seems like this is more trouble for a device like huawei's if they want to implement NDMA using the trapping or something. Given your right concern that NDMA should be implemented as widely as possible making it more difficult that stricly necessary is perhaps not good. Other peope should comment here. > > > + * complete any such outstanding operations prior to completing > > > + * the transition to the NDMA state. The NDMA device_state > > > > Reading this as you wrote it and I suddenly have a doubt about the PRI > > use case. Is it reasonable that the kernel driver will block on NDMA > > waiting for another userspace thread to resolve any outstanding PRIs? > > > > Can that allow userspace to deadlock the kernel or device? Is there an > > alterative? > > I'd hope we could avoid deadlock in the kernel, but it seems trickier > for userspace to be waiting on a write(2) operation to the device while > also handling page request events for that same device. Is this > something more like a pending transaction bit where userspace asks the > device to go quiescent and polls for that to occur? Hum. I'm still looking into this question, but some further thoughts. PRI doesn't do DMA, it just transfers a physical address into the PCI device's cache that can be later used with DMA. PRI also doesn't imply the vPRI Intel is talking about. For PRI controlled by the hypervisor, it is completely reasonable that NDMA returns synchronously after the PRI and the DMA that triggered it completes. The VMM would have to understand this and ensure it doesn't block the kernel's fault path while going to NDMA eg with userfaultfd or something else crazy. The other reasonable option is that NDMA cancels the DMA that triggered the PRI and simply doesn't care how the PRI is completed after NDMA returns. The later is interesting because it is a possible better path to solve the vPRI problem Intel brought up. Waiting for the VCPU is just asking for a DOS, if NDMA can cancel the DMAs we can then just directly fail the open PRI in the hypervisor and we don't need to care about the VCPU. Some mess to fixup in the vIOMMU protocol on resume, but the resume'd device simply issues a new DMA with an empty ATS cache and does a new PRI. It is uncertain enough that qemu should not support vPRI with migration until we define protocol(s) and a cap flag to say the device supports it. > > > + * All combinations for the above defined device_state bits are considered > > > + * valid with the following exceptions: > > > + * - RESUMING and SAVING are mutually exclusive, all combinations of > > > + * (RESUMING | SAVING) are invalid. Furthermore the specific combination > > > + * (!NDMA | RESUMING | SAVING | !RUNNING) is reserved to indicate the > > > + * device error state VFIO_DEVICE_STATE_ERROR. This variant is > > > + * specifically chosen due to the !RUNNING state of the device as the > > > + * migration driver should do everything possible, including an internal > > > + * reset of the device, to ensure that the device is fully stopped in > > > + * this state. > > > > Prefer we don't specify this. ERROR is undefined behavior and > > userspace should reset. Any path that leads along to ERROR already > > includes possiblities for wild DMAs and what not, so there is nothing > > to be gained by this other than causing a lot of driver complexity, > > IMHO. > > This seems contrary to your push for consistent, interoperable > behavior. Formal "undefined behavior" can be a useful part of a spec, especially if the spec is 'when you see ERROR you must do RESET', we don't really need to constrain the device further to continue to have interoperability. > What's the benefit to actually leaving the state undefined or the > drawback to preemptively resetting a device if the migration driver > cannot determine if the device is quiesced, RESET puts the device back to RUNNING, so RESET alone does not remedy the problem. RESET followed by !RUNNING can fail, meaning the best mlx5 can do is "SHOULD", in which case lets omit the RESET since userspace can't rely on it. Even if it did work reliably, the requirement is userspace must issue RESET to exit ERROR and if we say the driver has to issue reset to enter ERROR we are just doing a pointless double RESET. > would need to reset the device to enter a new state anyway? I added > this because language in your doc suggested the error state was far > more undefined that I understood it to be, ie. !RUNNING. Yes it was like that, because the implementation of this strict requirement is not nice. Perhaps a middle ground can work: For device_state ERROR the device SHOULD have the device !RUNNING. If the ERROR arose due to a device_state change and if the new and old states have NDMA behavior then the device MUST maintain NDMA behavior while processing the device_state and continuing while in ERROR. Userspace MUST reset the device to recover from ERROR, therefore devices SHOULD NOT do a redundant internal reset > > > + * Migration drivers should attempt to support any transition between valid > > > > should? must, I think. > > I think that "must" terminology is a bit contrary to the fact that we > have a defined error state that can be used at the discretion of the > migration driver. To me, "should" tells the migration drivers that they > ought to make an attempt to support all transitions, but userspace > needs to be be prepared that they might not work. IMHO this is not inter-operable. At a minimum we should expect that a driver implements a set of standard transitions, or it has to implement all of them. Otherwise what is the point? If you go back to the mlx5 v1 version it did effectively this. It enforced a FSM and only allowed some transitions. That meets the language here with "should" but you didn't like it, and I agreed with you then. This is when the trouble stated :) The mlx5 v1 with the FSM didn't have alot of these problems we are discussing. It didn't have precedence issues, it didn't have problems executing odd combinations it can't support, it worked and was simple to understand. So, if we say should here, then I vote mlx5 goes back to enforcing its FSM and that becomes the LCD that userspace must implement to. In which case, why not formally specify the FSM now and avoid a driver pushing a defacto spec? If we say MUST here then we need to figure out a precedence and say that some transitions are undefined behavior, like SAVING -> SAVING|RUNNING. > Maybe it's too subtle, but that's why I phrased it as "no data is > currently available" and went on to specify the implications in the > !RUNNING state. "Currently", suggesting that in the RUNNING state the > value is essentially volatile. It is subtle enough to clarify that polling may see more data in future in that case. Thanks, Jason
On Thu, Dec 09 2021, Alex Williamson <alex.williamson@redhat.com> wrote: > A new NDMA state is being proposed to support a quiescent state for > contexts containing multiple devices with peer-to-peer DMA support. > Formally define it. [I'm wondering if we would want to use NDMA in other cases as well. Just thinking out loud below.] > > Clarify various aspects of the migration region data fields and > protocol. Remove QEMU related terminology and flows from the uAPI; > these will be provided in Documentation/ so as not to confuse the > device_state bitfield with a finite state machine with restricted > state transitions. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > include/uapi/linux/vfio.h | 405 ++++++++++++++++++++++++--------------------- > 1 file changed, 214 insertions(+), 191 deletions(-) > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index ef33ea002b0b..1fdbc928f886 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h (...) > + * The device_state field defines the following bitfield use: > + * > + * - Bit 0 (RUNNING) [REQUIRED]: > + * - Setting this bit indicates the device is fully operational, the > + * device may generate interrupts, DMA, respond to MMIO, all vfio > + * device regions are functional, and the device may advance its > + * internal state. The default device_state must indicate the device > + * in exclusively the RUNNING state, with no other bits in this field > + * set. > + * - Clearing this bit (ie. !RUNNING) must stop the operation of the > + * device. The device must not generate interrupts, DMA, or advance > + * its internal state. The user should take steps to restrict access > + * to vfio device regions other than the migration region while the > + * device is !RUNNING or risk corruption of the device migration data > + * stream. The device and kernel migration driver must accept and > + * respond to interaction to support external subsystems in the > + * !RUNNING state, for example PCI MSI-X and PCI config space. > + * Failure by the user to restrict device access while !RUNNING must > + * not result in error conditions outside the user context (ex. > + * host system faults). If I consider ccw, this would mean that user space would need to stop writing to the regions that initiate start/halt/... when RUNNING is cleared (makes sense) and that the subchannel must be idle or even disabled (so that it does not become status pending). The question is, does it make sense to stop new requests and wait for the subchannel to become idle during the !RUNNING transition (or even forcefully kill outstanding I/O), or... > + * - Bit 1 (SAVING) [REQUIRED]: > + * - Setting this bit enables and initializes the migration region data > + * window and associated fields within vfio_device_migration_info for > + * capturing the migration data stream for the device. The migration > + * driver may perform actions such as enabling dirty logging of device > + * state with this bit. The SAVING bit is mutually exclusive with the > + * RESUMING bit defined below. > + * - Clearing this bit (ie. !SAVING) de-initializes the migration region > + * data window and indicates the completion or termination of the > + * migration data stream for the device. > + * - Bit 2 (RESUMING) [REQUIRED]: > + * - Setting this bit enables and initializes the migration region data > + * window and associated fields within vfio_device_migration_info for > + * restoring the device from a migration data stream captured from a > + * SAVING session with a compatible device. The migration driver may > + * perform internal device resets as necessary to reinitialize the > + * internal device state for the incoming migration data. > + * - Clearing this bit (ie. !RESUMING) de-initializes the migration > + * region data window and indicates the end of a resuming session for > + * the device. The kernel migration driver should complete the > + * incorporation of data written to the migration data window into the > + * device internal state and perform final validity and consistency > + * checking of the new device state. If the user provided data is > + * found to be incomplete, inconsistent, or otherwise invalid, the > + * migration driver must indicate a write(2) error and follow the > + * previously described protocol to return either the previous state > + * or an error state. > + * - Bit 3 (NDMA) [OPTIONAL]: > + * The NDMA or "No DMA" state is intended to be a quiescent state for > + * the device for the purposes of managing multiple devices within a > + * user context where peer-to-peer DMA between devices may be active. > + * Support for the NDMA bit is indicated through the presence of the > + * VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by > + * VFIO_DEVICE_GET_REGION_INFO for the associated device migration > + * region. > + * - Setting this bit must prevent the device from initiating any > + * new DMA or interrupt transactions. The migration driver must > + * complete any such outstanding operations prior to completing > + * the transition to the NDMA state. The NDMA device_state > + * essentially represents a sub-set of the !RUNNING state for the > + * purpose of quiescing the device, therefore the NDMA device_state > + * bit is superfluous in combinations including !RUNNING. > + * - Clearing this bit (ie. !NDMA) negates the device operational > + * restrictions required by the NDMA state. ...should we use NDMA as the "stop new requests" state, but allow running channel programs to conclude? I'm not entirely sure whether that's in the spirit of NDMA (subchannels are independent of each other), but it would be kind of "quiescing" already. (We should probably clarify things like that in the Documentation/ file.)
On Tue, 14 Dec 2021 12:26:54 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Dec 13, 2021 at 01:40:38PM -0700, Alex Williamson wrote: > > > We do specify that a migration driver has discretion in using the error > > state for failed transitions, so there are options to simplify error > > handling. > > This could be OK if we can agree ERROR has undefined behavior. > > > I think the basis of your priority scheme comes from that. Ordering > > of the remaining items is more subtle though, for instance > > 0 -> SAVING | RUNNING can be broken down as: > > > > 0 -> SAVING > > SAVING -> SAVING | RUNNING > > > > vs > > > > 0 -> RUNNING > > RUNNING -> SAVING | RUNNING > > > > I'd give preference to enabling logging before running and I believe > > that holds for transition (e) -> (d) as well. > > IMHO, any resolution to an arbitary choice should pick an order that > follows the reference flow because we know those are useful sequences > to have today. > > Generally we have no use for the sequence SAVING -> SAVING | RUNNING, > while RUNNING -> SAVING | RUNNING is part of the standard flow. SAVING -> SAVING | RUNNING or SAVING -> RUNNING could both be used as recovery sequences should the target VM fail. The latter might be a full abort of the migration while the former might be a means to reset the downtime clock without fully restarting the migration. > It also raises the question that it seems not well defined what the > sequence: > > SAVING -> SAVING | RUNNING > > Even does to the migration window? > > Nor can I imagine what mlx5 could do beyond fail this or corrupt the > migration.. I think this comes down to the robustness of the migration driver. The migration data window and control of how userspace is to interact with it is essentially meant to allow the migration driver to implement its own transport protocol. In the case of mlx5 where it expects only to apply the received migration data on the RESUMING -> RUNNING transition, a "short" data segment might be reserved for providing sequencing information. Each time the device enters SAVING | !RUNNING the driver might begin by presenting a new sequence header. On the target, a new sequence header would cause any previously received migration data to be discarded. A similar header would also be suggested to validate the migration data stream is appropriate for the target device. Also, I hope it goes without saying, but I'll say it anyway, the migration data stream would make for an excellent exploit vector and each migration driver needs to be responsible to make sure that userspace cannot use it to break containment of the device. > If we keep the "should implement transitions" language below then I > expect mlx5 to fail this, and then we have a conflict where mlx5 > cannot implement these precedence rules. Per above, I think it can. > This is the kind of precedence resolution I was trying to avoid. > > As far as I can see the requirements are broadly: > - Do not transit through an invalid state > - Do not loose NDMA during the transit, eg NDMA | SAVING | RUNNING -> 0 > should not have a race where a DMA can leak out > - Do not do transit through things like SAVING -> SAVING | RUNNING, > and I'm not confident I have a good list of these "Things like", yeah, that's not really spec material. There's nothing special about that transition. The migration driver should take into account management of the migration data stream and support such states, or error the transition if it isn't sufficient ROI and expect device specific bug reports at some point in the future. For NDMA, that's a valid consideration. I think that means though that NDMA doesn't simply bookend the pseudo algorithm I provided. Perhaps it's nested between RUNNING and SAVING handlers on either end. > > And I think that also addresses the claim that we're doomed to untested > > and complicated error code handling, we unwind by simply swapping the > > args to our set state function and enter the ERROR state should that > > recursive call fail. > > I had the same thought the day after I wrote this, it seems workable. > > I remain concerned however that we still can't seem to reach to a > working precedence after all this time. This is a very bad sign. > > Even if we work something out adding a new state someday is > terrifying. What if we can't work out any precedence that is > compatible with todays and supports the new state? > > IMHO, we should be simplifing this before it becomes permanent API, > not trying to force it to work. I agree, this is our opportunity to simplify before we're committed, but I don't see how we can throw out perfectly valid transitions like SAVING -> SAVING | RUNNING just because the driver hasn't accounted for managing data in the data stream. > > If we put it in the user's hands and prescribe only single bit flips, > > they don't really have device knowledge to optimize further than this > > like a migration driver might be able to do. > > If so this argues we should go back to the enforced FSM that the v1 > mlx5 posting had and forget about device_state as a bunch of bits. > > Most of things I brought up in this post are resolved by the forced > FSM. Until userspace tries to do something different than exactly what it does today, and then what? > Yes, we give up some flexability, but I think the quest for > flexability is a little misguided. If the drivers don't consistently > implement the flexability then it is just cruft we cannot make use of > from userspace. > > eg what practical use is SAVING -> SAVING | RUNNING if today's mlx5 > implementation silently corrupts the migration stream? That instantly > makes that a no-go for userspace from an interoperability perspective > and we've accomplished nothing by allowing for it. Failure to support that transition is a deficiency of the driver and represented by a non-silent error in making that transition. Silently corrupting the migration stream is simply a driver bug. > Please think about it, it looks like an easy resolution to all this > discussion to simply specify a fixed FSM and be done with it. > > > > I thought we could tackled this when you first suggested it (eg copy > > > the mlx5 logic and be OK), but now I'm very skeptical. The idea that > > > every driver can do this right in all the corner cases doesn't seem > > > reasonable given we've made so many errors here already just in mlx5. > > > > > > > + * - Bit 1 (SAVING) [REQUIRED]: > > > > + * - Setting this bit enables and initializes the migration region data > > > > > > I would use the word clear instead of initialize - the point of this > > > is to throw away any data that may be left over in the window from any > > > prior actions. > > > > "Clear" to me suggests that there's some sort of internal shared buffer > > implementation that needs to be wiped between different modes. I chose > > "initialize" because I think it offers more independence to the > > implementation. > > The data window is expressed as a shared buffer in this API, there is > only one data_offset/size and data window for everything. Any access to the data window outside of that directed by the driver is undefined, it's up to the driver where and how to populate the data. A driver might make a portion of the data window available as an mmap that gets zapped and faulted to the correct device backing between operations. > I think it is fine to rely on that for the description, and like all > abstractions an implementation can do whatever so long as externally > it looks like this shared buffer API. > > The requirement here is that anything that pre-existed in the data > window from any prior operation is cleaned and the data window starts > empty before any data related to this SAVING is transfered. IOW, it's initialized. We're picking out colors for the bike shed at this point. > > > > + * window and associated fields within vfio_device_migration_info for > > > > + * capturing the migration data stream for the device. The migration > > > > + * driver may perform actions such as enabling dirty logging of device > > > > + * state with this bit. The SAVING bit is mutually exclusive with the > > > > + * RESUMING bit defined below. > > > > + * - Clearing this bit (ie. !SAVING) de-initializes the migration region > > > > + * data window and indicates the completion or termination of the > > > > + * migration data stream for the device. > > > > > > I don't know what "de-initialized" means as something a device should > > > do? IMHO there is no need to talk about the migration window here, > > > SAVING says initialize/clear - and data_offset/etc say their values > > > are undefined outside SAVING/RUNNING. That is enough. > > > > If "initializing" the migration data region puts in place handlers for > > pending_bytes and friends, "de-initializing" would undo that operation. > > Perhaps I should use "deactivates"? > > And if you don't use "initializing" we don't need to talk about > "de-initializating". > > Reading the data window outside SAVING is undefined behavior it seems, > so nothing needs to be said. Exactly why I thought simply describing it as the reciprocal of setting the bit would be sufficient. Taupe! > > > > + * - Bit 2 (RESUMING) [REQUIRED]: > > > > + * - Setting this bit enables and initializes the migration region data > > > > + * window and associated fields within vfio_device_migration_info for > > > > + * restoring the device from a migration data stream captured from a > > > > + * SAVING session with a compatible device. The migration driver may > > > > + * perform internal device resets as necessary to reinitialize the > > > > + * internal device state for the incoming migration data. > > > > + * - Clearing this bit (ie. !RESUMING) de-initializes the migration > > > > + * region data window and indicates the end of a resuming session for > > > > + * the device. The kernel migration driver should complete the > > > > + * incorporation of data written to the migration data window into the > > > > + * device internal state and perform final validity and consistency > > > > + * checking of the new device state. If the user provided data is > > > > + * found to be incomplete, inconsistent, or otherwise invalid, the > > > > + * migration driver must indicate a write(2) error and follow the > > > > + * previously described protocol to return either the previous state > > > > + * or an error state. > > > > > > Prefer this is just 'go to an error state' to avoid unnecessary > > > implementation differences. > > > > Then it becomes a special case versus other device_state changes and > > we're forcing what you've described as an undefined state into the > > protocol. > > Lets look at what recovery actions something the VMM would need to > take along the reference flow: > > RUNNING -> SAVING | RUNNING > If this fails and we are still in RUNNING and can continue > > -> SAVING | RUNNING | NDMA > -> SAVING > If these fail we need to go to RUNNING > -> RUNNING > If this fails we need to RESET I won't argue that there aren't transition failures where the next logical step is likely a reset, but I also don't see the point in defining special rules for certain cases. When in doubt, leave policy decisions to userspace? > > -> 0 > Migration succeeded? Failure here should RESET > > RUNNING -> RESUMING > If this fails and we are still in RUNNING continue > -> NDMA | RUNNING > If this fails RESET > -> RUNNING > If this fails RESET, VM could be corrupted. > > One view is that what the device does is irrelevant as qemu should > simply unconditionally reset in these case. > > Another view is that staying in a useless state is also pointless and > we may as well return ERROR anyhow. Eg if exiting RESUMING failed > there is no other action to take besides RESET, so why didn't we > return ERROR to tell this directly to userspace? And then the last packet arrives, gets written to the device that's still in RESUMING, and now can transition to RUNNING. > Both are reasonable views, which is why I wrote "prefer". There's no going back from the ERROR state, the only path the user has forward is to reset the device. Therefore the only case I'm willing to say it's the preferred next state is in the case of an irrecoverable internal fault. I'd also like to avoid another lengthy discussion trying to define which specific transitions should default to an ERROR state if they fail versus simply return -errno. Userspace is free to define a policy where an -errno is considered fatal for the device. I prefer consistent userspace handling, letting userspace define policy, and robust drivers that avoid forcing unnecessary user decisions. > > > > + * - Bit 3 (NDMA) [OPTIONAL]: > > > > + * The NDMA or "No DMA" state is intended to be a quiescent state for > > > > + * the device for the purposes of managing multiple devices within a > > > > + * user context where peer-to-peer DMA between devices may be active. > > > > + * Support for the NDMA bit is indicated through the presence of the > > > > + * VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by > > > > + * VFIO_DEVICE_GET_REGION_INFO for the associated device migration > > > > + * region. > > > > + * - Setting this bit must prevent the device from initiating any > > > > + * new DMA or interrupt transactions. The migration driver must > > > > > > I'm not sure about interrupts. > > > > In the common case an interrupt is a DMA write, so the name, if not > > intention of this state gets a bit shaky if interrupts are allowed. > > Interrupts have their own masking protocol. For instance a device like > the huawei one is halting DMA by manipulating the queue registers, it > may still generate interrupts. > > Yes, MSI is a MemWr, but I've never heard anyone call it a DMA - there > is no memory access here since the TLP is routed to the interrupt > controller. That's a pretty subtle distinction. Can't that controller live in MMIO space and isn't it then just a peer-to-peer DMA? I know I'm not the first to consider MSI to be just another DMA, that seems to be the basis of it's handling on ARM SMMU. > This is why I'm not sure. A hostile VM certainly can corrupt the MSI, > even today and thus turn it into a DMA. As we talked before this may > be OK, but is security risky that it allows the guest to impact the > hypervisor. > > Overall it seems like this is more trouble for a device like huawei's > if they want to implement NDMA using the trapping or something. Given > your right concern that NDMA should be implemented as widely as > possible making it more difficult that stricly necessary is perhaps > not good. > > Other peope should comment here. Yeah, I'm not clear on what devices can and cannot do in the NDMA state. Ultimately the goal is that once all devices are in the NDMA state, we can safely transition them to the !RUNNING state without concern regarding access from another device. Specifically we want to avoid things like DeviceA moves to !RUNNING while DeviceB initiates a DMA access to DeviceA which now cannot respond without advancing internal state which violates the condition of !RUNNING. But I think we're really only after that p2p behavior and we've discussed that disabling p2p mappings in the VM would be a sufficient condition to support multiple devices without NDMA support. I think that means DMA to memory is fine and DMA related to MSI is fine... but how does a device know which DMA is memory and which DMA is another device? > > > > + * complete any such outstanding operations prior to completing > > > > + * the transition to the NDMA state. The NDMA device_state > > > > > > Reading this as you wrote it and I suddenly have a doubt about the PRI > > > use case. Is it reasonable that the kernel driver will block on NDMA > > > waiting for another userspace thread to resolve any outstanding PRIs? > > > > > > Can that allow userspace to deadlock the kernel or device? Is there an > > > alterative? > > > > I'd hope we could avoid deadlock in the kernel, but it seems trickier > > for userspace to be waiting on a write(2) operation to the device while > > also handling page request events for that same device. Is this > > something more like a pending transaction bit where userspace asks the > > device to go quiescent and polls for that to occur? > > Hum. I'm still looking into this question, but some further thoughts. > > PRI doesn't do DMA, it just transfers a physical address into the PCI > device's cache that can be later used with DMA. > > PRI also doesn't imply the vPRI Intel is talking about. > > For PRI controlled by the hypervisor, it is completely reasonable that > NDMA returns synchronously after the PRI and the DMA that triggered it > completes. The VMM would have to understand this and ensure it doesn't > block the kernel's fault path while going to NDMA eg with userfaultfd > or something else crazy. > > The other reasonable option is that NDMA cancels the DMA that > triggered the PRI and simply doesn't care how the PRI is completed > after NDMA returns. > > The later is interesting because it is a possible better path to solve > the vPRI problem Intel brought up. Waiting for the VCPU is just asking > for a DOS, if NDMA can cancel the DMAs we can then just directly fail > the open PRI in the hypervisor and we don't need to care about the > VCPU. Some mess to fixup in the vIOMMU protocol on resume, but the > resume'd device simply issues a new DMA with an empty ATS cache and > does a new PRI. > > It is uncertain enough that qemu should not support vPRI with > migration until we define protocol(s) and a cap flag to say the device > supports it. > > > > > + * All combinations for the above defined device_state bits are considered > > > > + * valid with the following exceptions: > > > > + * - RESUMING and SAVING are mutually exclusive, all combinations of > > > > + * (RESUMING | SAVING) are invalid. Furthermore the specific combination > > > > + * (!NDMA | RESUMING | SAVING | !RUNNING) is reserved to indicate the > > > > + * device error state VFIO_DEVICE_STATE_ERROR. This variant is > > > > + * specifically chosen due to the !RUNNING state of the device as the > > > > + * migration driver should do everything possible, including an internal > > > > + * reset of the device, to ensure that the device is fully stopped in > > > > + * this state. > > > > > > Prefer we don't specify this. ERROR is undefined behavior and > > > userspace should reset. Any path that leads along to ERROR already > > > includes possiblities for wild DMAs and what not, so there is nothing > > > to be gained by this other than causing a lot of driver complexity, > > > IMHO. > > > > This seems contrary to your push for consistent, interoperable > > behavior. > > Formal "undefined behavior" can be a useful part of a spec, especially > if the spec is 'when you see ERROR you must do RESET', we don't really > need to constrain the device further to continue to have > interoperability. > > > What's the benefit to actually leaving the state undefined or the > > drawback to preemptively resetting a device if the migration driver > > cannot determine if the device is quiesced, > > RESET puts the device back to RUNNING, so RESET alone does not remedy > the problem. > > RESET followed by !RUNNING can fail, meaning the best mlx5 can do is > "SHOULD", in which case lets omit the RESET since userspace can't rely > on it. > > Even if it did work reliably, the requirement is userspace must issue > RESET to exit ERROR and if we say the driver has to issue reset to > enter ERROR we are just doing a pointless double RESET. Please read what I wrote: This variant is specifically chosen due to the !RUNNING state of the device as the migration driver should do everything possible, including an internal reset of the device, to ensure that the device is fully stopped in this state. That does not say that a driver must issue a reset to enter the ERROR state. Perhaps it's wrong that I'm equating this so formally to the !RUNNING state when really we don't care about the internal state of the device, we just want it to not corrupt memory or generate spurious interrupts. I'm thinking the equivalent of clear bus-master for PCI devices. Would it be sufficient if I clarified !RUNNING relative to DMA and interrupt generation? > > would need to reset the device to enter a new state anyway? I added > > this because language in your doc suggested the error state was far > > more undefined that I understood it to be, ie. !RUNNING. > > Yes it was like that, because the implementation of this strict > requirement is not nice. > > Perhaps a middle ground can work: > > For device_state ERROR the device SHOULD have the device > !RUNNING. If the ERROR arose due to a device_state change and > if the new and old states have NDMA behavior then the device MUST > maintain NDMA behavior while processing the device_state and > continuing while in ERROR. Userspace MUST reset the device to > recover from ERROR, therefore devices SHOULD NOT do a redundant > internal reset I don't have a problem if the reset is redundant to one the user needs to do anyway, I'd rather see any externally visible operation of the device stopped ASAP. The new and old state NDMA-like properties is also irrelevant, if a device enters an ERROR state moving from RUNNING -> SAVING | RUNNING it shouldn't continue manipulating memory and generating interrupts in the background. What about: The !RUNNING variant is used here specifically to reflect that the device should immediately cease all external operations such as DMA and interrupts. The migration driver should do everything possible, up to and including an internal reset of the device, to ensure that the device is externally quiescent in this state. > > > > + * Migration drivers should attempt to support any > > > > transition between valid > > > > > > should? must, I think. > > > > I think that "must" terminology is a bit contrary to the fact that > > we have a defined error state that can be used at the > > discretion of the migration driver. To me, "should" tells the > > migration drivers that they ought to make an attempt to support > > all transitions, but userspace needs to be be prepared that > > they might not work. > > IMHO this is not inter-operable. At a minimum we should expect that a > driver implements a set of standard transitions, or it has to > implement all of them. > > Otherwise what is the point? > > If you go back to the mlx5 v1 version it did effectively this. It > enforced a FSM and only allowed some transitions. That meets the > language here with "should" but you didn't like it, and I agreed with > you then. > > This is when the trouble stated :) > > The mlx5 v1 with the FSM didn't have alot of these problems we are > discussing. It didn't have precedence issues, it didn't have problems > executing odd combinations it can't support, it worked and was simple > to understand. And an audit of that driver during review found that it grossly failed to meet the spirit of a "should" requirement. Using "should" terminology here is meant to give the driver some leeway, it's not an invitation for abuse. Even below there's still a notion that a given state transitions is unsupportable by your device, what if that was actually true? > So, if we say should here, then I vote mlx5 goes back to enforcing > its FSM and that becomes the LCD that userspace must implement to. > > In which case, why not formally specify the FSM now and avoid a driver > pushing a defacto spec? It really only takes one driver implementing something like SAVING -> SAVING | RUNNING and QEMU taking advantage of it as a supported transition per the uAPI for mlx5 to be left out of the feature that might provide. > If we say MUST here then we need to figure out a precedence and > say that some transitions are undefined behavior, like SAVING -> > SAVING|RUNNING. If we say "should" and don't do those thing, then we're still not implementing to the spirit of the uAPI. I'm hearing a lot of "may" in your interpretation of "should". And again, nothing wrong with that transition, manage the migration stream better. Thanks, Alex
On Mon, 20 Dec 2021 18:38:26 +0100 Cornelia Huck <cohuck@redhat.com> wrote: > On Thu, Dec 09 2021, Alex Williamson <alex.williamson@redhat.com> wrote: > > > A new NDMA state is being proposed to support a quiescent state for > > contexts containing multiple devices with peer-to-peer DMA support. > > Formally define it. > > [I'm wondering if we would want to use NDMA in other cases as well. Just > thinking out loud below.] > > > > > Clarify various aspects of the migration region data fields and > > protocol. Remove QEMU related terminology and flows from the uAPI; > > these will be provided in Documentation/ so as not to confuse the > > device_state bitfield with a finite state machine with restricted > > state transitions. > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > --- > > include/uapi/linux/vfio.h | 405 ++++++++++++++++++++++++--------------------- > > 1 file changed, 214 insertions(+), 191 deletions(-) > > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > index ef33ea002b0b..1fdbc928f886 100644 > > --- a/include/uapi/linux/vfio.h > > +++ b/include/uapi/linux/vfio.h > > (...) > > > + * The device_state field defines the following bitfield use: > > + * > > + * - Bit 0 (RUNNING) [REQUIRED]: > > + * - Setting this bit indicates the device is fully operational, the > > + * device may generate interrupts, DMA, respond to MMIO, all vfio > > + * device regions are functional, and the device may advance its > > + * internal state. The default device_state must indicate the device > > + * in exclusively the RUNNING state, with no other bits in this field > > + * set. > > + * - Clearing this bit (ie. !RUNNING) must stop the operation of the > > + * device. The device must not generate interrupts, DMA, or advance > > + * its internal state. The user should take steps to restrict access > > + * to vfio device regions other than the migration region while the > > + * device is !RUNNING or risk corruption of the device migration data > > + * stream. The device and kernel migration driver must accept and > > + * respond to interaction to support external subsystems in the > > + * !RUNNING state, for example PCI MSI-X and PCI config space. > > + * Failure by the user to restrict device access while !RUNNING must > > + * not result in error conditions outside the user context (ex. > > + * host system faults). > > If I consider ccw, this would mean that user space would need to stop > writing to the regions that initiate start/halt/... when RUNNING is > cleared (makes sense) and that the subchannel must be idle or even > disabled (so that it does not become status pending). The question is, > does it make sense to stop new requests and wait for the subchannel to > become idle during the !RUNNING transition (or even forcefully kill > outstanding I/O), or... > > > + * - Bit 3 (NDMA) [OPTIONAL]: > > + * The NDMA or "No DMA" state is intended to be a quiescent state for > > + * the device for the purposes of managing multiple devices within a > > + * user context where peer-to-peer DMA between devices may be active. > > + * Support for the NDMA bit is indicated through the presence of the > > + * VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by > > + * VFIO_DEVICE_GET_REGION_INFO for the associated device migration > > + * region. > > + * - Setting this bit must prevent the device from initiating any > > + * new DMA or interrupt transactions. The migration driver must > > + * complete any such outstanding operations prior to completing > > + * the transition to the NDMA state. The NDMA device_state > > + * essentially represents a sub-set of the !RUNNING state for the > > + * purpose of quiescing the device, therefore the NDMA device_state > > + * bit is superfluous in combinations including !RUNNING. > > + * - Clearing this bit (ie. !NDMA) negates the device operational > > + * restrictions required by the NDMA state. > > ...should we use NDMA as the "stop new requests" state, but allow > running channel programs to conclude? I'm not entirely sure whether > that's in the spirit of NDMA (subchannels are independent of each > other), but it would be kind of "quiescing" already. > > (We should probably clarify things like that in the Documentation/ > file.) This bumps into the discussion in my other thread with Jason, we need to refine what NDMA means. Based on my reply there and our previous discussion that QEMU could exclude p2p mappings to support VMs with multiple devices that don't support NDMA, I think that NDMA is only quiescing p2p traffic (if so, maybe should be NOP2P). So this use of it seems out of scope to me. Userspace necessarily needs to stop vCPUs before stopping devices, which should mean that there are no new requests when a ccw device is transitioning to !RUNNING. Therefore I'd expect that the transition to any !RUNNING state would wait from completion of running channel programs. Thanks, Alex
On Mon, Dec 20 2021, Alex Williamson <alex.williamson@redhat.com> wrote: > On Mon, 20 Dec 2021 18:38:26 +0100 > Cornelia Huck <cohuck@redhat.com> wrote: > >> On Thu, Dec 09 2021, Alex Williamson <alex.williamson@redhat.com> wrote: >> >> > A new NDMA state is being proposed to support a quiescent state for >> > contexts containing multiple devices with peer-to-peer DMA support. >> > Formally define it. >> >> [I'm wondering if we would want to use NDMA in other cases as well. Just >> thinking out loud below.] >> >> > >> > Clarify various aspects of the migration region data fields and >> > protocol. Remove QEMU related terminology and flows from the uAPI; >> > these will be provided in Documentation/ so as not to confuse the >> > device_state bitfield with a finite state machine with restricted >> > state transitions. >> > >> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >> > --- >> > include/uapi/linux/vfio.h | 405 ++++++++++++++++++++++++--------------------- >> > 1 file changed, 214 insertions(+), 191 deletions(-) >> > >> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h >> > index ef33ea002b0b..1fdbc928f886 100644 >> > --- a/include/uapi/linux/vfio.h >> > +++ b/include/uapi/linux/vfio.h >> >> (...) >> >> > + * The device_state field defines the following bitfield use: >> > + * >> > + * - Bit 0 (RUNNING) [REQUIRED]: >> > + * - Setting this bit indicates the device is fully operational, the >> > + * device may generate interrupts, DMA, respond to MMIO, all vfio >> > + * device regions are functional, and the device may advance its >> > + * internal state. The default device_state must indicate the device >> > + * in exclusively the RUNNING state, with no other bits in this field >> > + * set. >> > + * - Clearing this bit (ie. !RUNNING) must stop the operation of the >> > + * device. The device must not generate interrupts, DMA, or advance >> > + * its internal state. The user should take steps to restrict access >> > + * to vfio device regions other than the migration region while the >> > + * device is !RUNNING or risk corruption of the device migration data >> > + * stream. The device and kernel migration driver must accept and >> > + * respond to interaction to support external subsystems in the >> > + * !RUNNING state, for example PCI MSI-X and PCI config space. >> > + * Failure by the user to restrict device access while !RUNNING must >> > + * not result in error conditions outside the user context (ex. >> > + * host system faults). >> >> If I consider ccw, this would mean that user space would need to stop >> writing to the regions that initiate start/halt/... when RUNNING is >> cleared (makes sense) and that the subchannel must be idle or even >> disabled (so that it does not become status pending). The question is, >> does it make sense to stop new requests and wait for the subchannel to >> become idle during the !RUNNING transition (or even forcefully kill >> outstanding I/O), or... >> > >> > + * - Bit 3 (NDMA) [OPTIONAL]: >> > + * The NDMA or "No DMA" state is intended to be a quiescent state for >> > + * the device for the purposes of managing multiple devices within a >> > + * user context where peer-to-peer DMA between devices may be active. >> > + * Support for the NDMA bit is indicated through the presence of the >> > + * VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by >> > + * VFIO_DEVICE_GET_REGION_INFO for the associated device migration >> > + * region. >> > + * - Setting this bit must prevent the device from initiating any >> > + * new DMA or interrupt transactions. The migration driver must >> > + * complete any such outstanding operations prior to completing >> > + * the transition to the NDMA state. The NDMA device_state >> > + * essentially represents a sub-set of the !RUNNING state for the >> > + * purpose of quiescing the device, therefore the NDMA device_state >> > + * bit is superfluous in combinations including !RUNNING. >> > + * - Clearing this bit (ie. !NDMA) negates the device operational >> > + * restrictions required by the NDMA state. >> >> ...should we use NDMA as the "stop new requests" state, but allow >> running channel programs to conclude? I'm not entirely sure whether >> that's in the spirit of NDMA (subchannels are independent of each >> other), but it would be kind of "quiescing" already. >> >> (We should probably clarify things like that in the Documentation/ >> file.) > > This bumps into the discussion in my other thread with Jason, we need > to refine what NDMA means. Based on my reply there and our previous > discussion that QEMU could exclude p2p mappings to support VMs with > multiple devices that don't support NDMA, I think that NDMA is only > quiescing p2p traffic (if so, maybe should be NOP2P). So this use of > it seems out of scope to me. Ok, makes sense. If the scope of this flag is indeed to be supposed quite narrow, it might make sense to rename it. > > Userspace necessarily needs to stop vCPUs before stopping devices, > which should mean that there are no new requests when a ccw device is > transitioning to !RUNNING. Therefore I'd expect that the transition to > any !RUNNING state would wait from completion of running channel > programs. Indeed, it should not be any problem to do this for !RUNNING, I had just been wondering about possible alternative implementations.
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, December 15, 2021 12:27 AM > > > > + * complete any such outstanding operations prior to completing > > > > + * the transition to the NDMA state. The NDMA device_state > > > > > > Reading this as you wrote it and I suddenly have a doubt about the PRI > > > use case. Is it reasonable that the kernel driver will block on NDMA > > > waiting for another userspace thread to resolve any outstanding PRIs? > > > > > > Can that allow userspace to deadlock the kernel or device? Is there an > > > alterative? > > > > I'd hope we could avoid deadlock in the kernel, but it seems trickier > > for userspace to be waiting on a write(2) operation to the device while > > also handling page request events for that same device. Is this > > something more like a pending transaction bit where userspace asks the > > device to go quiescent and polls for that to occur? > > Hum. I'm still looking into this question, but some further thoughts. > > PRI doesn't do DMA, it just transfers a physical address into the PCI > device's cache that can be later used with DMA. > > PRI also doesn't imply the vPRI Intel is talking about. This is correct. PRI can happen on either kernel-managed page table or user-managed page table. Only for latter case the PRI needs be forwarded to userspace for fixup. > > For PRI controlled by the hypervisor, it is completely reasonable that > NDMA returns synchronously after the PRI and the DMA that triggered it > completes. The VMM would have to understand this and ensure it doesn't > block the kernel's fault path while going to NDMA eg with userfaultfd > or something else crazy. I don't think there would be any problem on this usage. > > The other reasonable option is that NDMA cancels the DMA that > triggered the PRI and simply doesn't care how the PRI is completed > after NDMA returns. > > The later is interesting because it is a possible better path to solve > the vPRI problem Intel brought up. Waiting for the VCPU is just asking > for a DOS, if NDMA can cancel the DMAs we can then just directly fail cancel and save the context so the aborted transaction can be resumed on the target node. > the open PRI in the hypervisor and we don't need to care about the > VCPU. Some mess to fixup in the vIOMMU protocol on resume, but the > resume'd device simply issues a new DMA with an empty ATS cache and > does a new PRI. > > It is uncertain enough that qemu should not support vPRI with > migration until we define protocol(s) and a cap flag to say the device > supports it. > However this is too restricting. It's an ideal option but in reality it implies the capability that the device can preempt and recover an in-fly request in any granularity (given PRI can occur at any time). I was clearly told by hardware guys about how challenging to achieve this goal on various IPs, which is also the reason why the draining operation on most devices today is more-or-less a waiting flavor. btw can you elaborate the DOS concern? The device is assigned to an user application, which has one thread (migration thread) blocked on another thread (vcpu thread) when transiting the device to NDMA state. What service outside of this application is denied here? Thanks Kevin
On Tue, Jan 04, 2022 at 03:49:07AM +0000, Tian, Kevin wrote: > btw can you elaborate the DOS concern? The device is assigned > to an user application, which has one thread (migration thread) > blocked on another thread (vcpu thread) when transiting the > device to NDMA state. What service outside of this application > is denied here? The problem is the VM controls when the vPRI is responded and migration cannot proceed until this is done. So the basic DOS is for a hostile VM to trigger a vPRI and then never answer it. Even trivially done from userspace with a vSVA and userfaultfd, for instance. This will block the hypervisor from ever migrating the VM in a very poor way - it will just hang in the middle of a migration request. Regardless of the complaints of the IP designers, this is a very poor direction. Progress in the hypervisor should never be contingent on a guest VM. Jason
On Mon, Dec 20, 2021 at 03:26:23PM -0700, Alex Williamson wrote: > > It also raises the question that it seems not well defined what the > > sequence: > > > > SAVING -> SAVING | RUNNING > > > > Even does to the migration window? > > > > Nor can I imagine what mlx5 could do beyond fail this or corrupt the > > migration.. > > I think this comes down to the robustness of the migration driver. The > migration data window and control of how userspace is to interact with > it is essentially meant to allow the migration driver to implement its > own transport protocol. In the case of mlx5 where it expects only to > apply the received migration data on the RESUMING -> RUNNING > transition, a "short" data segment might be reserved for providing > sequencing information. Each time the device enters SAVING | !RUNNING > the driver might begin by presenting a new sequence header. On the > target, a new sequence header would cause any previously received > migration data to be discarded. A similar header would also be > suggested to validate the migration data stream is appropriate for the > target device. Honestly, I have no interest in implementing something so complicated for what should be a simple operation. We have no use case for this, no desire to test it, it is just pure kernel cruft and complexity to do this kind of extra work. I think it is very telling that we are, what, 10 weeks into this now, we have seen two HW drivers posted, and NOTHING implements like you are imagining here. I doubt the closed drivers do any better. Let's not make up busy work that can't be strongly justified! That is substantially what I see as wrong with this whole line of thinking that the device_state must be independent bits, not a constrained FSM. We are actively failing, again and again, to tell if drivers are implementing this mulit-bit vision correctly, or even specify properly how it should work in an implementable way. > > IMHO, we should be simplifing this before it becomes permanent API, > > not trying to force it to work. > > I agree, this is our opportunity to simplify before we're committed, > but I don't see how we can throw out perfectly valid transitions like > SAVING -> SAVING | RUNNING just because the driver hasn't accounted for > managing data in the data stream. Don't see? What do you mean? We showed how to do this exactly in the v1. We define the meaning of device_state to be actual FSM and write out the allowed FSM arcs and then properly describe them. This is what everyone else though this was already. AFAICT you are the only person to view this as a bunch of bits. It is why the original header file comment gave names to each of the states and roughtly sketches out legal state transition arcs with the odd ascii art. So I want to stop trying to make the bunch of bits idea work. Let's make it a FSM, let's define exactly the legal arcs, and then define the behaviors in each FSM state. It is easy to understand and we have a hope to get inter-operable implementations. All the driver postings so far are demonstrating they don't get oddball transition arcs correct, and we are clearly not able to find these things during review. And now you are asking for alot of extra work and complications in the driver to support arcs that will never be used - that is really too far, sorry. > > Most of things I brought up in this post are resolved by the forced > > FSM. > > Until userspace tries to do something different than exactly what it > does today, and then what? It can't. We define the API to be exactly the permited arcs and no others. That is what simplify means. If we need to add new FSM arcs and new states later then their support can be exposed via a cap flag. This is much better than trying to define all arcs as legal, only testing/using a small subset and hoping the somehow in the future this results in extensible interoperability. > > Another view is that staying in a useless state is also pointless and > > we may as well return ERROR anyhow. Eg if exiting RESUMING failed > > there is no other action to take besides RESET, so why didn't we > > return ERROR to tell this directly to userspace? > > And then the last packet arrives, gets written to the device that's > still in RESUMING, and now can transition to RUNNING. Huh? If the device indicated error during RESUMING userspace should probably stop shoving packets into it or it will possibly corrupt the migration stream. > But I think we're really only after that p2p behavior and we've > discussed that disabling p2p mappings in the VM would be a sufficient > condition to support multiple devices without NDMA support. I think > that means DMA to memory is fine and DMA related to MSI is fine... but > how does a device know which DMA is memory and which DMA is another > device? The device doesn't know if a particular DMA is P2P or not. This is why the device action is called 'NO DMA'. MSI is fine to be left unspecified because we currently virtualize all the MSI register writes and it is impossible for a hostile guest to corrupt them to point the address to anything but the interrupt controller. If a MSI triggers or not in NDMA doesn't practically matter. It only starts to matter someday if we get the world Thomas is thinking about where the guest can directly program the MSI registers. > > Even if it did work reliably, the requirement is userspace must issue > > RESET to exit ERROR and if we say the driver has to issue reset to > > enter ERROR we are just doing a pointless double RESET. > > Please read what I wrote: > > This variant is specifically chosen due to the !RUNNING state of > the device as the migration driver should do everything possible, > including an internal reset of the device, to ensure that the > device is fully stopped in this state. > > That does not say that a driver must issue a reset to enter the ERROR > state. Huh? "everything possible including an internal reset" sure sounds like a device must issue a reset in some cases. If we keep with your idea to rarely use ERROR then I think all the mlx5 cases that would hit it are already so messed up that RESET will be mandatory. > I don't have a problem if the reset is redundant to one the user needs > to do anyway, I'd rather see any externally visible operation of the > device stopped ASAP. Why? It was doing all those things before it had an error, why should it suddenly stop now? What is this extra work helping? Remember if we choose to return an error code instead of ERROR the device is still running as it was, I don't see an benifit to making ERROR different here. ERROR just means the device has to be reset, we don't need the device to stop doing what it was doing. > The new and old state NDMA-like properties is also irrelevant, if a > device enters an ERROR state moving from RUNNING -> SAVING | RUNNING > it shouldn't continue manipulating memory and generating interrupts > in the background. I prefer a model where the device is allowed to keep doing whatever it was doing before it hit the error. You are pushing for a model where upon error we must force the device to stop. For this view it is why the old state matters, if it was previously allowed to DMA then it continues to be allowed to do DMA, etc. > > The mlx5 v1 with the FSM didn't have alot of these problems we are > > discussing. It didn't have precedence issues, it didn't have problems > > executing odd combinations it can't support, it worked and was simple > > to understand. > > And an audit of that driver during review found that it grossly failed > to meet the spirit of a "should" requirement. That isn't how I see things.. The v1 driver implemented the uAPI we all thought existed, was the FSM based uAPI the original patch authors intended, and implemented only the FSM arcs discussed in the header file comment. Your idea that this is not a FSM seems to be unique here. I think we've explored it to a reasonable conclusion to find it isn't working out. Let's stop please. Yishai can prepare a version with the FSM design back in including NDMA and we can look at it. > > In which case, why not formally specify the FSM now and avoid a driver > > pushing a defacto spec? > > It really only takes one driver implementing something like SAVING -> > SAVING | RUNNING and QEMU taking advantage of it as a supported > transition per the uAPI for mlx5 to be left out of the feature that > might provide. The uAPI spec is irrelavent, qemu can't just suddenly start doing things that don't work on supported drivers. What we've seen many times in the kernel is that uapis that don't have driver interoperability don't succeed. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, January 5, 2022 12:10 AM > > On Tue, Jan 04, 2022 at 03:49:07AM +0000, Tian, Kevin wrote: > > > btw can you elaborate the DOS concern? The device is assigned > > to an user application, which has one thread (migration thread) > > blocked on another thread (vcpu thread) when transiting the > > device to NDMA state. What service outside of this application > > is denied here? > > The problem is the VM controls when the vPRI is responded and > migration cannot proceed until this is done. > > So the basic DOS is for a hostile VM to trigger a vPRI and then never > answer it. Even trivially done from userspace with a vSVA and > userfaultfd, for instance. > > This will block the hypervisor from ever migrating the VM in a very > poor way - it will just hang in the middle of a migration request. it's poor but 'hang' won't happen. PCI spec defines completion timeout for ATS translation request. If timeout the device will abort the in-fly request and report error back to software. > > Regardless of the complaints of the IP designers, this is a very poor > direction. > > Progress in the hypervisor should never be contingent on a guest VM. > Whether the said DOS is a real concern and how severe it is are usage specific things. Why would we want to hardcode such restriction on an uAPI? Just give the choice to the admin (as long as this restriction is clearly communicated to userspace clearly)... IMHO encouraging IP designers to work out better hardware shouldn't block supporting current hardware which has limitations but also values in scenarios where those limitations are tolerable. Thanks Kevin
> From: Tian, Kevin > Sent: Wednesday, January 5, 2022 9:59 AM > > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, January 5, 2022 12:10 AM > > > > On Tue, Jan 04, 2022 at 03:49:07AM +0000, Tian, Kevin wrote: > > > > > btw can you elaborate the DOS concern? The device is assigned > > > to an user application, which has one thread (migration thread) > > > blocked on another thread (vcpu thread) when transiting the > > > device to NDMA state. What service outside of this application > > > is denied here? > > > > The problem is the VM controls when the vPRI is responded and > > migration cannot proceed until this is done. > > > > So the basic DOS is for a hostile VM to trigger a vPRI and then never > > answer it. Even trivially done from userspace with a vSVA and > > userfaultfd, for instance. > > > > This will block the hypervisor from ever migrating the VM in a very > > poor way - it will just hang in the middle of a migration request. > > it's poor but 'hang' won't happen. PCI spec defines completion timeout > for ATS translation request. If timeout the device will abort the in-fly > request and report error back to software. > > > > > Regardless of the complaints of the IP designers, this is a very poor > > direction. > > > > Progress in the hypervisor should never be contingent on a guest VM. > > > > Whether the said DOS is a real concern and how severe it is are usage > specific things. Why would we want to hardcode such restriction on > an uAPI? Just give the choice to the admin (as long as this restriction is > clearly communicated to userspace clearly)... > > IMHO encouraging IP designers to work out better hardware shouldn't > block supporting current hardware which has limitations but also values > in scenarios where those limitations are tolerable. > btw although the uapi is named 'migration', it's really about device state management. Whether the managed device state is further migrated and whether failure to migrate is severe are really not the kernel's business. It's just simple that changing device state could fail. and vPRI here is just one failure reason due to no response from the user after certain timeout (for a user-managed page table). Then it's Qemu which should document the restriction and provide options for the admin to decide whether to expose vPRI vs. migration based on specific usage requirement. The choices could be vPRI-off/ migration-on, vPRI-on/migration-off, or enabling both (migration failure is tolerable or no 'hostile' VM in the setup)... Thanks Kevin
On Wed, Jan 05, 2022 at 01:59:31AM +0000, Tian, Kevin wrote: > > This will block the hypervisor from ever migrating the VM in a very > > poor way - it will just hang in the middle of a migration request. > > it's poor but 'hang' won't happen. PCI spec defines completion timeout > for ATS translation request. If timeout the device will abort the in-fly > request and report error back to software. The PRI time outs have to be long enough to handle swap back from disk, so 'hang' will be a fair amount of time.. > > Regardless of the complaints of the IP designers, this is a very poor > > direction. > > > > Progress in the hypervisor should never be contingent on a guest VM. > > > > Whether the said DOS is a real concern and how severe it is are usage > specific things. Why would we want to hardcode such restriction on > an uAPI? Just give the choice to the admin (as long as this restriction is > clearly communicated to userspace clearly)... IMHO it is not just DOS, PRI can become dependent on IO which requires DMA to complete. You could quickly get yourself into a deadlock situation where the hypervisor has disabled DMA activities of other devices and the vPRI simply cannot be completed. I just don't see how this scheme is generally workable without a lot of limitations. While I do agree we should support the HW that exists, we should recognize this is not a long term workable design and treat it as such. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, January 5, 2022 8:46 PM > > On Wed, Jan 05, 2022 at 01:59:31AM +0000, Tian, Kevin wrote: > > > > This will block the hypervisor from ever migrating the VM in a very > > > poor way - it will just hang in the middle of a migration request. > > > > it's poor but 'hang' won't happen. PCI spec defines completion timeout > > for ATS translation request. If timeout the device will abort the in-fly > > request and report error back to software. > > The PRI time outs have to be long enough to handle swap back from > disk, so 'hang' will be a fair amount of time.. This reminds me one interesting point. Putting PRI aside the time to drain in-fly requests is undefined. It depends on how many pending requests to be waited for before completing the draining command on the device. This is IP specific (e.g. whether supports preemption) and also guest specific (e.g. whether it's actively submitting workload). So even without hostile attempts the draining time may exceed what an user tolerates in live migration. This suggests certain software timeout mechanism might be necessary when transitioning to NDMA state, with the timeout value optionally configurable by the user. If timeout, then fail the state transition request. And once such mechanism is in place, PRI is automatically covered as it is just one implicit reason which may increase the draining time. > > > > Regardless of the complaints of the IP designers, this is a very poor > > > direction. > > > > > > Progress in the hypervisor should never be contingent on a guest VM. > > > > > > > Whether the said DOS is a real concern and how severe it is are usage > > specific things. Why would we want to hardcode such restriction on > > an uAPI? Just give the choice to the admin (as long as this restriction is > > clearly communicated to userspace clearly)... > > IMHO it is not just DOS, PRI can become dependent on IO which requires > DMA to complete. > > You could quickly get yourself into a deadlock situation where the > hypervisor has disabled DMA activities of other devices and the vPRI > simply cannot be completed. How is it related to PRI which is only about address translation? Instead, above is a general p2p problem for any draining operation. How to solve it needs to be defined clearly for this NDMA state (which I suppose is being discussed between you and Alex and I still need time to catch up). > > I just don't see how this scheme is generally workable without a lot > of limitations. > > While I do agree we should support the HW that exists, we should > recognize this is not a long term workable design and treat it as > such. > Definitely agree with this point. We software people should continue influencing IP designers toward a long-term software friendly design. and also bear the fact that it takes time...
On Thu, Jan 06, 2022 at 06:32:57AM +0000, Tian, Kevin wrote: > Putting PRI aside the time to drain in-fly requests is undefined. It depends > on how many pending requests to be waited for before completing the > draining command on the device. This is IP specific (e.g. whether supports > preemption) and also guest specific (e.g. whether it's actively submitting > workload). You are assuming a model where NDMA has to be implemented by pushing a command, but I would say that is very poor IP design. A device is fully in self-control of its own DMA and it should simply stop it quickly when doing NDMA. Devices that are poorly designed here will have very long migration downtime latencies and people simply won't want to use them. > > > Whether the said DOS is a real concern and how severe it is are usage > > > specific things. Why would we want to hardcode such restriction on > > > an uAPI? Just give the choice to the admin (as long as this restriction is > > > clearly communicated to userspace clearly)... > > > > IMHO it is not just DOS, PRI can become dependent on IO which requires > > DMA to complete. > > > > You could quickly get yourself into a deadlock situation where the > > hypervisor has disabled DMA activities of other devices and the vPRI > > simply cannot be completed. > > How is it related to PRI which is only about address translation? In something like SVA PRI can request a page which is not present and the OS has to do DMA to load the page back from storage to make it present and respond to the translation request. The DMA is not related to the device doing the PRI in the first place, but if the hypervisor has blocked the DMA already for some other reason (perhaps that device is also doing PRI) then it all will deadlock. > Instead, above is a general p2p problem for any draining operation. Nothing to do with p2p. Jason
On Tue, 4 Jan 2022 16:28:34 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Dec 20, 2021 at 03:26:23PM -0700, Alex Williamson wrote: > > > > It also raises the question that it seems not well defined what the > > > sequence: > > > > > > SAVING -> SAVING | RUNNING > > > > > > Even does to the migration window? > > > > > > Nor can I imagine what mlx5 could do beyond fail this or corrupt the > > > migration.. > > > > I think this comes down to the robustness of the migration driver. The > > migration data window and control of how userspace is to interact with > > it is essentially meant to allow the migration driver to implement its > > own transport protocol. In the case of mlx5 where it expects only to > > apply the received migration data on the RESUMING -> RUNNING > > transition, a "short" data segment might be reserved for providing > > sequencing information. Each time the device enters SAVING | !RUNNING > > the driver might begin by presenting a new sequence header. On the > > target, a new sequence header would cause any previously received > > migration data to be discarded. A similar header would also be > > suggested to validate the migration data stream is appropriate for the > > target device. > > Honestly, I have no interest in implementing something so complicated > for what should be a simple operation. We have no use case for this, > no desire to test it, it is just pure kernel cruft and complexity to > do this kind of extra work. Migration is not a simple operation, we have a device with a host kernel driver exporting and importing device state through userspace. It's a playground for potential exploit vectors. I assume this also means that you're not performing any sort of validation that the incoming data is from a compatible device or providing any versioning accommodations in the data stream, all features that would generally be considered best practices. > I think it is very telling that we are, what, 10 weeks into this now, > we have seen two HW drivers posted, and NOTHING implements like you > are imagining here. I doubt the closed drivers do any better. > > Let's not make up busy work that can't be strongly justified! I think I'm portraying the uAPI as it was designed and envisioned to work. Of course I also have doubts whether the closed drivers perform any sort of validation or consistency checking, and we can only guess what sort of attack vectors might exist as a result. > That is substantially what I see as wrong with this whole line of > thinking that the device_state must be independent bits, not a > constrained FSM. > > We are actively failing, again and again, to tell if drivers are > implementing this mulit-bit vision correctly, or even specify properly > how it should work in an implementable way. > > > > IMHO, we should be simplifing this before it becomes permanent API, > > > not trying to force it to work. > > > > I agree, this is our opportunity to simplify before we're committed, > > but I don't see how we can throw out perfectly valid transitions like > > SAVING -> SAVING | RUNNING just because the driver hasn't accounted for > > managing data in the data stream. > > Don't see? What do you mean? We showed how to do this exactly in the > v1. > > We define the meaning of device_state to be actual FSM and write out > the allowed FSM arcs and then properly describe them. > > This is what everyone else though this was already. > > AFAICT you are the only person to view this as a bunch of bits. It is > why the original header file comment gave names to each of the states > and roughtly sketches out legal state transition arcs with the odd > ascii art. > > So I want to stop trying to make the bunch of bits idea work. Let's > make it a FSM, let's define exactly the legal arcs, and then define > the behaviors in each FSM state. It is easy to understand and we have > a hope to get inter-operable implementations. > > All the driver postings so far are demonstrating they don't get > oddball transition arcs correct, and we are clearly not able to find > these things during review. > > And now you are asking for alot of extra work and complications in the > driver to support arcs that will never be used - that is really too > far, sorry. It's the uAPI as I understand it. If you want a new uAPI, propose one. I'm not willing to accept a driver that partially implements the uAPI with an addendum document vaguely hinting at the ways it might be limited, with the purpose of subverting the written uAPI by a de facto standard. > > > Most of things I brought up in this post are resolved by the forced > > > FSM. > > > > Until userspace tries to do something different than exactly what it > > does today, and then what? > > It can't. We define the API to be exactly the permited arcs and no > others. That is what simplify means. IOW, define the uAPI based on what happens to be the current QEMU implementation and limitations. That's exactly what we were trying to avoid in the uAPI design. > If we need to add new FSM arcs and new states later then their support > can be exposed via a cap flag. > > This is much better than trying to define all arcs as legal, only > testing/using a small subset and hoping the somehow in the future this > results in extensible interoperability. A proposal of which states transitions you want to keep would be useful here. Let's look at all the possibilities: {} -> {RUNNING} -> {SAVING} -> {RESUMING} -> {RUNNING|SAVING} {RUNNING} -> {} (a)* -> {SAVING} (a) -> {RESUMING} (a) -> {RUNNING|SAVING} (a) {SAVING} -> {} (a)* -> {RUNNING} (b) -> {RESUMING} -> {RUNNING|SAVING} {RESUMING} -> {} -> {RUNNING} (a) -> {SAVING} -> {RUNNING|SAVING} {RUNNING|SAVING} -> {} -> {RUNNING} (b) -> {SAVING} (a) -> {RESUMING} We have 20 possible transitions. I've marked those available via the "odd ascii art" diagram as (a), that's 7 transitions. We could essentially remove the NULL state as unreachable as there seems little value in the 2 transitions marked (a)* if we look only at migration, that would bring us down to 5 of 12 possible transitions. We need to give userspace an abort path though, so we minimally need the 2 transitions marked (b) (7/12). So now we can discuss the remaining 5 transitions: {SAVING} -> {RESUMING} If not supported, user can achieve this via: {SAVING}->{RUNNING}->{RESUMING} {SAVING}-RESET->{RUNNING}->{RESUMING} It would likely be dis-recommended to return a device to {RUNNING} for this use case, so the latter would be preferred. Potential use case: ping-pong migration {SAVING} -> {RUNNING|SAVING} If not supported, user can achieve this via: {SAVING}->{RUNNING}->{RUNNING|SAVING} Potential use case: downtime exceeded, avoid full migration restart (likely not achieved with the alternative flow). {RESUMING} -> {SAVING} If not supported: {RESUMING}->{RUNNING}->{SAVING} {RESUMING}-RESET->{RUNNING}->{SAVING} Potential use case: validate migration data in = data out (also cannot be achieved with alternative flow, as device passes through RUNNING) {RESUMING} -> {RUNNING|SAVING} If not supported: {RESUMING}->{RUNNING}->{RUNNING|SAVING} Potential use case: live ping-pong migration (alternative flow is likely sufficient) {RUNNING|SAVING} -> {RESUMING} If not supported: {RUNNING|SAVING}->{RUNNING}->{RESUMING} {RUNNING|SAVING}-RESET->{RUNNING}->{RESUMING} (again former is likely dis-recommended) Potential use case: ??? So what's the proposal? Do we ditch all of these? Some of these? If drivers follow the previously provided pseudo algorithm with the requirement that they cannot pass through an invalid state, we need to formally address whether the NULL state is invalid or just not reachable by the user. > > > Another view is that staying in a useless state is also pointless and > > > we may as well return ERROR anyhow. Eg if exiting RESUMING failed > > > there is no other action to take besides RESET, so why didn't we > > > return ERROR to tell this directly to userspace? > > > > And then the last packet arrives, gets written to the device that's > > still in RESUMING, and now can transition to RUNNING. > > Huh? If the device indicated error during RESUMING userspace should > probably stop shoving packets into it or it will possibly corrupt the > migration stream. If a {RESUMING}->{RUNNING} transition fails and the device remains in {RESUMING}, it should be valid for userspace to push data to it. If the driver wants to indicate the transition attempt failed AND it won't accept continuing data or a re-initialized data stream, it probably should put the device into {ERROR} instead. > > But I think we're really only after that p2p behavior and we've > > discussed that disabling p2p mappings in the VM would be a sufficient > > condition to support multiple devices without NDMA support. I think > > that means DMA to memory is fine and DMA related to MSI is fine... but > > how does a device know which DMA is memory and which DMA is another > > device? > > The device doesn't know if a particular DMA is P2P or not. This is why > the device action is called 'NO DMA'. > > MSI is fine to be left unspecified because we currently virtualize all > the MSI register writes and it is impossible for a hostile guest to > corrupt them to point the address to anything but the interrupt > controller. If a MSI triggers or not in NDMA doesn't practically > matter. The vector table is directly accessible via the region mmap. It previously was not, but that becomes a problem with 64k page sizes, and even some poorly designed devices on 4k systems when they don't honor the PCI spec recommended alignments. But I think that's beside the point, if the user has vectors pointed at memory or other devices, they've rather already broken their contract for using the device. But I think you've identified two classes of DMA, MSI and everything else. The device can assume that an MSI is special and not included in NDMA, but it can't know whether other arbitrary DMAs are p2p or memory. If we define that the minimum requirement for multi-device migration is to quiesce p2p DMA, ex. by not allowing it at all, then NDMA is actually significantly more restrictive while it's enabled. > It only starts to matter someday if we get the world Thomas is > thinking about where the guest can directly program the MSI registers. > > > > Even if it did work reliably, the requirement is userspace must issue > > > RESET to exit ERROR and if we say the driver has to issue reset to > > > enter ERROR we are just doing a pointless double RESET. > > > > Please read what I wrote: > > > > This variant is specifically chosen due to the !RUNNING state of > > the device as the migration driver should do everything possible, > > including an internal reset of the device, to ensure that the > > device is fully stopped in this state. > > > > That does not say that a driver must issue a reset to enter the ERROR > > state. > > Huh? "everything possible including an internal reset" sure sounds > like a device must issue a reset in some cases. If we keep with your > idea to rarely use ERROR then I think all the mlx5 cases that would > hit it are already so messed up that RESET will be mandatory. > > > I don't have a problem if the reset is redundant to one the user needs > > to do anyway, I'd rather see any externally visible operation of the > > device stopped ASAP. > > Why? It was doing all those things before it had an error, why > should it suddenly stop now? What is this extra work helping? > > Remember if we choose to return an error code instead of ERROR the > device is still running as it was, I don't see an benifit to making > ERROR different here. > > ERROR just means the device has to be reset, we don't need the device > to stop doing what it was doing. Should a device in the ERROR state continue operation or be in a quiesced state? It seems obvious to me that since the ERROR state is essentially undefined, the device should cease operations and be quiesced by the driver. If the device is continuing to operate in the previous state, why would the driver place the device in the ERROR state? It should have returned an errno and left the device in the previous state. > > The new and old state NDMA-like properties is also irrelevant, if a > > device enters an ERROR state moving from RUNNING -> SAVING | RUNNING > > it shouldn't continue manipulating memory and generating interrupts > > in the background. > > I prefer a model where the device is allowed to keep doing whatever it > was doing before it hit the error. You are pushing for a model where > upon error we must force the device to stop. If the device continues operating in the previous mode then it shouldn't enter the ERROR state, it should return errno and re-reading device_state should indicate it's in the previous state. > For this view it is why the old state matters, if it was previously > allowed to DMA then it continues to be allowed to do DMA, etc. If it's still running normally, it shouldn't have been reported in the ERROR state. > > > The mlx5 v1 with the FSM didn't have alot of these problems we are > > > discussing. It didn't have precedence issues, it didn't have problems > > > executing odd combinations it can't support, it worked and was simple > > > to understand. > > > > And an audit of that driver during review found that it grossly failed > > to meet the spirit of a "should" requirement. > > That isn't how I see things.. The v1 driver implemented the uAPI we > all thought existed, was the FSM based uAPI the original patch authors > intended, and implemented only the FSM arcs discussed in the header > file comment. > > Your idea that this is not a FSM seems to be unique here. I think > we've explored it to a reasonable conclusion to find it isn't working > out. Let's stop please. > > Yishai can prepare a version with the FSM design back in including > NDMA and we can look at it. Sorry, but I was actually there and participating in original development of the uAPI. If you'd like to propose a different uAPI do so, but again, I'm not going to accept a driver specifically looking to compromise what I understand to be the intent of the current specification in order to create a de facto standard outside of that specification. Thanks, Alex
On Thu, Jan 06, 2022 at 11:17:18AM -0700, Alex Williamson wrote: > > Honestly, I have no interest in implementing something so complicated > > for what should be a simple operation. We have no use case for this, > > no desire to test it, it is just pure kernel cruft and complexity to > > do this kind of extra work. > > Migration is not a simple operation, we have a device with a host > kernel driver exporting and importing device state through userspace. > It's a playground for potential exploit vectors. I assume this also > means that you're not performing any sort of validation that the > incoming data is from a compatible device or providing any versioning > accommodations in the data stream, all features that would generally be > considered best practices. Our architecture has FW code performing all validation and versioning. The driver is a dumb pipe and works on a simple snapshot basis. The snapshot basis is the same approach the huawei driver and the 3 other drivers we have in internal development use. The idea of forcing every snapshot style driver into some complicated streaming scheme is bad, IMHO. We should be having code sharing here, see below please. > It's the uAPI as I understand it. If you want a new uAPI, propose > one. Oh. I thought that was effectively what we are doing. So let's be clear about it. We are proposing a 'new API'. It will be binary and backwards compatible with current qemu. > > It can't. We define the API to be exactly the permited arcs and no > > others. That is what simplify means. > > IOW, define the uAPI based on what happens to be the current QEMU > implementation and limitations. That's exactly what we were trying to > avoid in the uAPI design. I don't like the characterization. Selecting only the arcs that are useful, and well defined just happens to overlap with most of the the arcs that qemu uses exactly because qemu is using the useful subset. The stuff beyond that is possibly veering into over-engineering. It was OK when it looked like those behaviors would fall out naturally, eg via precedence/etc, but now that we fully understand that implementing the unused part has a real implementation cost, it is not appealing. > > If we need to add new FSM arcs and new states later then their support > > can be exposed via a cap flag. > > > > This is much better than trying to define all arcs as legal, only > > testing/using a small subset and hoping the somehow in the future this > > results in extensible interoperability. > > A proposal of which states transitions you want to keep would be useful > here. Yishai is working it out currently in code, but you can go back to the v1 to see where we started from on this thinking. It looks like we end up with about 8 possible valid FSM states out of the 16 combinatins of bits and your analysis looks about about right on the arcs, but NDMA will be included. We were also thinking to retain STOP. SAVING -> STOP could possibly serve as the abort path to avoid a double action, and some of the use cases you ID'd below are achievable if STOP remains. > We have 20 possible transitions. I've marked those available via the > "odd ascii art" diagram as (a), that's 7 transitions. We could > essentially remove the NULL state as unreachable as there seems little > value in the 2 transitions marked (a)* if we look only at migration, > that would bring us down to 5 of 12 possible transitions. We need to > give userspace an abort path though, so we minimally need the 2 > transitions marked (b) (7/12). > So now we can discuss the remaining 5 transitions: > > {SAVING} -> {RESUMING} > If not supported, user can achieve this via: > {SAVING}->{RUNNING}->{RESUMING} > {SAVING}-RESET->{RUNNING}->{RESUMING} This can be: SAVING -> STOP -> RESUMING > It would likely be dis-recommended to return a device to > {RUNNING} for this use case, so the latter would be preferred. Yes, I would drop this, there is no advantage compared to manually going to STOP, or through RESET. > {SAVING} -> {RUNNING|SAVING} > If not supported, user can achieve this via: > {SAVING}->{RUNNING}->{RUNNING|SAVING} > > Potential use case: downtime exceeded, avoid full migration > restart (likely not achieved with the alternative flow). I'm sympathetic to your use case, but this is not natural, or useful for the snapshot/non-precopy style drivers to implement. Without support for PRECOPY there is no functional advantage to return to RUNNING|SAVING. It is much better if qemu signals the far side to abort the migration, move the local device to RUNNING and the far device through RESET -> RESUMING to start all over again. This becomes common code for all the snapshot drivers to rely on so we don't have to implement this recovery logic with tricks in each driver inside the migration stream. Shared code is better. If someone does think this usecase is important they can add a cap flag and implement this arc along with the qemu support/etc for that driver. To be meaningful it would have to be along with a device that implements PRECOPY with dirty tracking, and a streaming post copy. So I would discard this arc, at least from the mandatory set. > {RESUMING} -> {SAVING} > If not supported: > {RESUMING}->{RUNNING}->{SAVING} > {RESUMING}-RESET->{RUNNING}->{SAVING} The simplified version would be: RESUMING -> STOP -> SAVING > Potential use case: validate migration data in = data out (also > cannot be achieved with alternative flow, as device passes > through RUNNING) No sure mlx5 will guarantee idempotent migration data. It looks like the Huawei device will be able to do this, and some of the other simplish device types we are working on could possibly too. So I'd drop this, STOP is good enough. > {RESUMING} -> {RUNNING|SAVING} > If not supported: > {RESUMING}->{RUNNING}->{RUNNING|SAVING} > > Potential use case: live ping-pong migration (alternative flow > is likely sufficient) STOP can be used here too, I would drop this since it is redundant. > {RUNNING|SAVING} -> {RESUMING} > If not supported: > {RUNNING|SAVING}->{RUNNING}->{RESUMING} > {RUNNING|SAVING}-RESET->{RUNNING}->{RESUMING} > (again former is likely dis-recommended) > > Potential use case: ??? Yep, discard. > So what's the proposal? Do we ditch all of these? Some of these? Yes, all of them. Including the NDMA states we add 2 new FSM states and 7 new arcs, IIRC: SAVING | RUNNING -> SAVING | RUNNING | NDMA SAVING | RUNNING | NDMA -> SAVING SAVING | RUNNING | NDMA -> RUNNING SAVING | RUNNING | NDMA -> STOP and RESUMING -> RUNNING | NDMA RUNNING | NDMA -> RUNNING RUNNING | NDMA -> STOP > drivers follow the previously provided pseudo algorithm with the > requirement that they cannot pass through an invalid state, we need to > formally address whether the NULL state is invalid or just not > reachable by the user. What is a NULL state? We have defined (from memory, forgive me I don't have access to Yishai's latest code at the moment) 8 formal FSM states: RUNNING PRECOPY PRECOPY_NDMA STOP_COPY STOP RESUMING RESUMING_NDMA ERROR (perhaps MUST_RESET) Which matches the state labels already given in the header comment. Mapped onto the 4 device_state bits in an 'ABI compatible with current qemu' way as the current header comment describes. Then the list of allowed arcs between them close to what you have suggested. IMHO this substantially conforms to what is written down in the header comment. Consistently using the state labels in code and documentation then eliminating the named bits will conclude the change in specification from bits to a FSM. Yishai, we should also recast the cap discovery as some ioctl to query directly if the driver supports an arc. Eg we can discover if NDMA is supported by querying support for PRECOPY -> PRECOPY_NDMA, and if your timeout use case is supported by querying support for STOP_COPY -> PRECOPY. Any future new states/arcs will cleanly fit into this discovery scheme. > > Huh? If the device indicated error during RESUMING userspace should > > probably stop shoving packets into it or it will possibly corrupt the > > migration stream. > > If a {RESUMING}->{RUNNING} transition fails and the device remains in > {RESUMING}, it should be valid for userspace to push data to it. IMHO this is not useful. If the userspace did RESUMING -> RUNNING then that is 'end of stream' and continuning to push data is nonsensical. It is one of the cases where design wise it is much clearer to just say any exit from RESUMING either succeeds or userspace needs to reset the device, eg ERROR. Again, my perspective here is multi-device interoperability and I just don't think we can rely on a consistent device behavior in such a strange corner case. > The vector table is directly accessible via the region mmap. It > previously was not, but that becomes a problem with 64k page sizes, and > even some poorly designed devices on 4k systems when they don't honor > the PCI spec recommended alignments. Yes, I forgot about that. > But I think that's beside the point, if the user has vectors pointed > at memory or other devices, they've rather already broken their > contract for using the device. Yes, so long as it doesn't allow to compromise the hypervisor integrity. > But I think you've identified two classes of DMA, MSI and everything > else. The device can assume that an MSI is special and not included in > NDMA, but it can't know whether other arbitrary DMAs are p2p or memory. > If we define that the minimum requirement for multi-device migration is > to quiesce p2p DMA, ex. by not allowing it at all, then NDMA is > actually significantly more restrictive while it's enabled. You are right, but in any practical physical device NDMA will be implemented by halting all DMAs, not just p2p ones. I don't mind what we label this, so long as we understand that halting all DMA is a valid device implementation. Actually, having reflected on this now, one of the things on my list to fix in iommufd, is that mdevs can get access to P2P pages at all. This is currently buggy as-is because they cannot DMA map these things, touch them with the CPU and kmap, or do, really, anything with them. So we should be blocking mdev's from accessing P2P, and in that case a mdev driver can quite rightly say it doesn't support P2P at all and safely NOP NDMA if NDMA is defined to only impact P2P transactions. Perhaps that answers the question for the S390 drivers as well. > Should a device in the ERROR state continue operation or be in a > quiesced state? It seems obvious to me that since the ERROR state is > essentially undefined, the device should cease operations and be > quiesced by the driver. If the device is continuing to operate in the > previous state, why would the driver place the device in the ERROR > state? It should have returned an errno and left the device in the > previous state. What we found while implementing is the use of ERROR arises when the driver has been forced to do multiple actions and is unable to fully unwind the state. So the device_state is not fully the original state or fully the target state. Thus it is ERROR. The additional requirement that the driver do another action to quiesce the device only introduces the possiblity for triple failure. Since it doesn't bring any value to userspace, I prefer we not define things in this complicated way. > > I prefer a model where the device is allowed to keep doing whatever it > > was doing before it hit the error. You are pushing for a model where > > upon error we must force the device to stop. > > If the device continues operating in the previous mode then it > shouldn't enter the ERROR state, it should return errno and re-reading > device_state should indicate it's in the previous state. Continues operating in the new/previous state is an 'upper bound' on what it is allowed to do. For instance if we are going from RUNNING -> SAVING mlx5 must transit through 'RUNNING|NDMA' as part of its internal design. If it then fails it can't continue to pretend it is RUNNING when it is doing RUNNING|NDMA and a double failure means we can't restore RUNNING. Allowing ERROR to continue any behavior allowed by RUNNING allows the device to be left in RUNNING|NDMA and eliminates the possiblity of triple failure in a transition to 'STOP'. Indeed we can simplify the driver by removing failure recoveries for cases that have a double fault and simply go to ERROR. This is not so viable if ERROR itself requires an action to enter it as we get back to having to deal with double and triple faults. Yishai said he should have something to look at next week. We'll take a stab at rewording the docs you provided to reflect a FSM approach too. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, January 6, 2022 11:42 PM > > On Thu, Jan 06, 2022 at 06:32:57AM +0000, Tian, Kevin wrote: > > > Putting PRI aside the time to drain in-fly requests is undefined. It depends > > on how many pending requests to be waited for before completing the > > draining command on the device. This is IP specific (e.g. whether supports > > preemption) and also guest specific (e.g. whether it's actively submitting > > workload). > > You are assuming a model where NDMA has to be implemented by pushing a > command, but I would say that is very poor IP design. I was not assuming a single model. I just wanted to figure out how this model can be supported in this design, given I saw many examples of it. > > A device is fully in self-control of its own DMA and it should simply > stop it quickly when doing NDMA. simple on some classes, but definitely not so simple on others. > > Devices that are poorly designed here will have very long migration > downtime latencies and people simply won't want to use them. Different usages have different latency requirement. Do we just want people to decide whether to manage state for a device by measurement? There is always difference between an experimental environment and final production environment. A timeout mechanism is more robust as the last resort than breaking SLA in case of any surprise in the production environment. > > > > > Whether the said DOS is a real concern and how severe it is are usage > > > > specific things. Why would we want to hardcode such restriction on > > > > an uAPI? Just give the choice to the admin (as long as this restriction is > > > > clearly communicated to userspace clearly)... > > > > > > IMHO it is not just DOS, PRI can become dependent on IO which requires > > > DMA to complete. > > > > > > You could quickly get yourself into a deadlock situation where the > > > hypervisor has disabled DMA activities of other devices and the vPRI > > > simply cannot be completed. > > > > How is it related to PRI which is only about address translation? > > In something like SVA PRI can request a page which is not present and > the OS has to do DMA to load the page back from storage to make it > present and respond to the translation request. > > The DMA is not related to the device doing the PRI in the first place, > but if the hypervisor has blocked the DMA already for some other > reason (perhaps that device is also doing PRI) then it all will > deadlock. yes, but with timeout the NDMA path doesn't care about whether a PRI is not responded (due to hostile VM or such block-dma case). It simply fails the state transition request when timeout is triggered. Thanks Kevin
On Fri, Jan 07, 2022 at 12:00:13AM +0000, Tian, Kevin wrote: > > Devices that are poorly designed here will have very long migration > > downtime latencies and people simply won't want to use them. > > Different usages have different latency requirement. Do we just want > people to decide whether to manage state for a device by > measurement? It doesn't seem unreasonable to allow userspace to set max timer for NDMA for SLA purposes on devices that have unbounded NDMA times. It would probably be some new optional ioctl for devices that can implement it. However, this basically gives up on the idea that a VM can be migrated as any migration can timeout and fail under this philosophy. I think that is still very poor. Optional migration really can't be sane path forward. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, January 7, 2022 8:30 AM > > On Fri, Jan 07, 2022 at 12:00:13AM +0000, Tian, Kevin wrote: > > > Devices that are poorly designed here will have very long migration > > > downtime latencies and people simply won't want to use them. > > > > Different usages have different latency requirement. Do we just want > > people to decide whether to manage state for a device by > > measurement? > > It doesn't seem unreasonable to allow userspace to set max timer for > NDMA for SLA purposes on devices that have unbounded NDMA times. It > would probably be some new optional ioctl for devices that can > implement it. Yes, that's my point. > > However, this basically gives up on the idea that a VM can be migrated > as any migration can timeout and fail under this philosophy. I think > that is still very poor. > > Optional migration really can't be sane path forward. > How is it different from the scenario where the guest generates a very high dirty rate so the precopy phase can never converge to a pre-defined threshold then abort the migration after certain timeout? IMHO live migration is always a try-and-fail flavor. A previous migration failure doesn't prevent the orchestration stack to retry at a later point. In the meantime people do explore various optimizations to increase the success rate. Having the device to stop DMA quickly is one such optimization from the hardware side. Thanks Kevin
Hi, Alex, Thanks for cleaning up this part, which is very helpful! > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, December 10, 2021 7:34 AM > > + * > + * The device_state field defines the following bitfield use: > + * > + * - Bit 0 (RUNNING) [REQUIRED]: > + * - Setting this bit indicates the device is fully operational, the > + * device may generate interrupts, DMA, respond to MMIO, all vfio > + * device regions are functional, and the device may advance its > + * internal state. The default device_state must indicate the device > + * in exclusively the RUNNING state, with no other bits in this field > + * set. > + * - Clearing this bit (ie. !RUNNING) must stop the operation of the > + * device. The device must not generate interrupts, DMA, or advance > + * its internal state. I'm curious about what it means for the mediated device. I suppose this 'must not' clause is from user p.o.v i.e. no event delivered to the user, no DMA to user memory and no user visible change on mdev state. Physically the device resource backing the mdev may still generate interrupt/DMA to the host according to the mediation policy. Is this understanding correct? > +* The user should take steps to restrict access > + * to vfio device regions other than the migration region while the > + * device is !RUNNING or risk corruption of the device migration data > + * stream. The device and kernel migration driver must accept and > + * respond to interaction to support external subsystems in the > + * !RUNNING state, for example PCI MSI-X and PCI config space. and also respond to mmio access if some state is saved via reading mmio? > + * Failure by the user to restrict device access while !RUNNING must > + * not result in error conditions outside the user context (ex. > + * host system faults). > + * - Bit 1 (SAVING) [REQUIRED]: > + * - Setting this bit enables and initializes the migration region data > + * window and associated fields within vfio_device_migration_info for > + * capturing the migration data stream for the device. The migration > + * driver may perform actions such as enabling dirty logging of device > + * state with this bit. The SAVING bit is mutually exclusive with the > + * RESUMING bit defined below. > + * - Clearing this bit (ie. !SAVING) de-initializes the migration region > + * data window and indicates the completion or termination of the > + * migration data stream for the device. > + * - Bit 2 (RESUMING) [REQUIRED]: > + * - Setting this bit enables and initializes the migration region data > + * window and associated fields within vfio_device_migration_info for > + * restoring the device from a migration data stream captured from a > + * SAVING session with a compatible device. The migration driver may > + * perform internal device resets as necessary to reinitialize the > + * internal device state for the incoming migration data. > + * - Clearing this bit (ie. !RESUMING) de-initializes the migration > + * region data window and indicates the end of a resuming session for > + * the device. The kernel migration driver should complete the > + * incorporation of data written to the migration data window into the > + * device internal state and perform final validity and consistency > + * checking of the new device state. If the user provided data is > + * found to be incomplete, inconsistent, or otherwise invalid, the > + * migration driver must indicate a write(2) error and follow the > + * previously described protocol to return either the previous state > + * or an error state. > + * - Bit 3 (NDMA) [OPTIONAL]: > + * The NDMA or "No DMA" state is intended to be a quiescent state for > + * the device for the purposes of managing multiple devices within a > + * user context where peer-to-peer DMA between devices may be active. As discussed with Jason in another thread, this is also required for vPRI when stopping DMA involves completing (instead of preempting) in-fly requests then any vPRI for those requests must be completed when vcpu is running. This cannot be done in !RUNNING which is typically transitioned to after stopping vcpu. It is also useful when the time of stopping device DMA is unbound (even without vPRI). Having a failure path when vcpu is running avoids breaking SLA (if only capturing it after stopping vcpu). This further requires certain interface for the user to specify a timeout value for entering NDMA, though unclear to me what it will be now. > + * Support for the NDMA bit is indicated through the presence of the > + * VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by > + * VFIO_DEVICE_GET_REGION_INFO for the associated device migration > + * region. > + * - Setting this bit must prevent the device from initiating any > + * new DMA or interrupt transactions. The migration driver must Why also disabling interrupt? vcpu is still running at this point thus interrupt could be triggered for many reasons other than DMA... > + * complete any such outstanding operations prior to completing > + * the transition to the NDMA state. The NDMA device_state > + * essentially represents a sub-set of the !RUNNING state for the > + * purpose of quiescing the device, therefore the NDMA device_state > + * bit is superfluous in combinations including !RUNNING. 'superfluous' means doing so will get a failure, or just not recommended? > + * - Clearing this bit (ie. !NDMA) negates the device operational > + * restrictions required by the NDMA state. > + * - Bits [31:4]: > + * Reserved for future use, users should use read-modify-write > + * operations to the device_state field for manipulation of the above > + * defined bits for optimal compatibility. > + * Thanks Kevin
On Fri, 7 Jan 2022 08:03:57 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > Hi, Alex, > > Thanks for cleaning up this part, which is very helpful! > > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Friday, December 10, 2021 7:34 AM > > > > + * > > + * The device_state field defines the following bitfield use: > > + * > > + * - Bit 0 (RUNNING) [REQUIRED]: > > + * - Setting this bit indicates the device is fully operational, the > > + * device may generate interrupts, DMA, respond to MMIO, all vfio > > + * device regions are functional, and the device may advance its > > + * internal state. The default device_state must indicate the device > > + * in exclusively the RUNNING state, with no other bits in this field > > + * set. > > + * - Clearing this bit (ie. !RUNNING) must stop the operation of the > > + * device. The device must not generate interrupts, DMA, or advance > > + * its internal state. > > I'm curious about what it means for the mediated device. I suppose this > 'must not' clause is from user p.o.v i.e. no event delivered to the user, > no DMA to user memory and no user visible change on mdev state. Physically > the device resource backing the mdev may still generate interrupt/DMA > to the host according to the mediation policy. > > Is this understanding correct? Yes, one mediated device stopped running can't cause the backing device to halt, it must continue performing activities for other child devices as well as any host duties. The user owned device should effectively stop. > > +* The user should take steps to restrict access > > + * to vfio device regions other than the migration region while the > > + * device is !RUNNING or risk corruption of the device migration data > > + * stream. The device and kernel migration driver must accept and > > + * respond to interaction to support external subsystems in the > > + * !RUNNING state, for example PCI MSI-X and PCI config space. > > and also respond to mmio access if some state is saved via reading mmio? The device must not generate a host fault, ex. PCIe UR, but the idea here is that the device stops and preventing further access is the user's responsibility. Failure to stop those accesses may result in corrupting the migration data. > > + * Failure by the user to restrict device access while !RUNNING must > > + * not result in error conditions outside the user context (ex. > > + * host system faults). > > + * - Bit 1 (SAVING) [REQUIRED]: > > + * - Setting this bit enables and initializes the migration region data > > + * window and associated fields within vfio_device_migration_info for > > + * capturing the migration data stream for the device. The migration > > + * driver may perform actions such as enabling dirty logging of device > > + * state with this bit. The SAVING bit is mutually exclusive with the > > + * RESUMING bit defined below. > > + * - Clearing this bit (ie. !SAVING) de-initializes the migration region > > + * data window and indicates the completion or termination of the > > + * migration data stream for the device. > > + * - Bit 2 (RESUMING) [REQUIRED]: > > + * - Setting this bit enables and initializes the migration region data > > + * window and associated fields within vfio_device_migration_info for > > + * restoring the device from a migration data stream captured from a > > + * SAVING session with a compatible device. The migration driver may > > + * perform internal device resets as necessary to reinitialize the > > + * internal device state for the incoming migration data. > > + * - Clearing this bit (ie. !RESUMING) de-initializes the migration > > + * region data window and indicates the end of a resuming session for > > + * the device. The kernel migration driver should complete the > > + * incorporation of data written to the migration data window into the > > + * device internal state and perform final validity and consistency > > + * checking of the new device state. If the user provided data is > > + * found to be incomplete, inconsistent, or otherwise invalid, the > > + * migration driver must indicate a write(2) error and follow the > > + * previously described protocol to return either the previous state > > + * or an error state. > > + * - Bit 3 (NDMA) [OPTIONAL]: > > + * The NDMA or "No DMA" state is intended to be a quiescent state for > > + * the device for the purposes of managing multiple devices within a > > + * user context where peer-to-peer DMA between devices may be active. > > As discussed with Jason in another thread, this is also required for vPRI > when stopping DMA involves completing (instead of preempting) in-fly > requests then any vPRI for those requests must be completed when vcpu > is running. This cannot be done in !RUNNING which is typically transitioned > to after stopping vcpu. > > It is also useful when the time of stopping device DMA is unbound (even > without vPRI). Having a failure path when vcpu is running avoids breaking > SLA (if only capturing it after stopping vcpu). This further requires certain > interface for the user to specify a timeout value for entering NDMA, though > unclear to me what it will be now. > > > + * Support for the NDMA bit is indicated through the presence of the > > + * VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by > > + * VFIO_DEVICE_GET_REGION_INFO for the associated device migration > > + * region. > > + * - Setting this bit must prevent the device from initiating any > > + * new DMA or interrupt transactions. The migration driver must > > Why also disabling interrupt? vcpu is still running at this point thus interrupt > could be triggered for many reasons other than DMA... It's my understanding that the vCPU would be halted for the NDMA use case, we can't very well have vCPUs demanding requests to devices that are prevented from completing them. The NDMA phase is intended to support completion of outstanding requests without concurrently accepting new requests, AIUI. Further conversations in this thread allow for interrupts and deduce that the primary requirement of NDMA is to restrict P2P DMA, which can be approximated as all non-MSI DMA. > > + * complete any such outstanding operations prior to completing > > + * the transition to the NDMA state. The NDMA device_state > > + * essentially represents a sub-set of the !RUNNING state for the > > + * purpose of quiescing the device, therefore the NDMA device_state > > + * bit is superfluous in combinations including !RUNNING. > > 'superfluous' means doing so will get a failure, or just not recommended? Superfluous meaning redundant. It's allowed, but DMA is already restricted when !RUNNING, so setting NDMA when !RUNNING is meaningless. > > + * - Clearing this bit (ie. !NDMA) negates the device operational > > + * restrictions required by the NDMA state. > > + * - Bits [31:4]: > > + * Reserved for future use, users should use read-modify-write > > + * operations to the device_state field for manipulation of the above > > + * defined bits for optimal compatibility. > > + * FWIW, I'm expecting to see an alternative uAPI propose using a FSM machine in the near future, so while this clarifies what I believe is the intention of the existing uAPI, it might be deprecated before we bother to commit such clarifications. Thanks, Alex
On Fri, Jan 07, 2022 at 02:01:55AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Friday, January 7, 2022 8:30 AM > > > > On Fri, Jan 07, 2022 at 12:00:13AM +0000, Tian, Kevin wrote: > > > > Devices that are poorly designed here will have very long migration > > > > downtime latencies and people simply won't want to use them. > > > > > > Different usages have different latency requirement. Do we just want > > > people to decide whether to manage state for a device by > > > measurement? > > > > It doesn't seem unreasonable to allow userspace to set max timer for > > NDMA for SLA purposes on devices that have unbounded NDMA times. It > > would probably be some new optional ioctl for devices that can > > implement it. > > Yes, that's my point. > > > > > However, this basically gives up on the idea that a VM can be migrated > > as any migration can timeout and fail under this philosophy. I think > > that is still very poor. > > > > Optional migration really can't be sane path forward. > > > > How is it different from the scenario where the guest generates a very > high dirty rate so the precopy phase can never converge to a pre-defined > threshold then abort the migration after certain timeout? The hypervisor can halt the VCPU and put a stop to this and complete the migration. There is a difference between optional migration under a SLA and mandatory migration with no SLA - I think both must be supported to be sane. > IMHO live migration is always a try-and-fail flavor. A previous migration > failure doesn't prevent the orchestration stack to retry at a later point. An operator might need to emergency migrate a VM without the possibility for failure. For instance there is something wrong with the base HW. SLA ignored, migration must be done. IMHO it is completely wrong to view migration as optional, that is a terrible standard to design HW to. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Saturday, January 8, 2022 1:23 AM > > On Fri, Jan 07, 2022 at 02:01:55AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Friday, January 7, 2022 8:30 AM > > > > > > On Fri, Jan 07, 2022 at 12:00:13AM +0000, Tian, Kevin wrote: > > > > > Devices that are poorly designed here will have very long migration > > > > > downtime latencies and people simply won't want to use them. > > > > > > > > Different usages have different latency requirement. Do we just want > > > > people to decide whether to manage state for a device by > > > > measurement? > > > > > > It doesn't seem unreasonable to allow userspace to set max timer for > > > NDMA for SLA purposes on devices that have unbounded NDMA times. It > > > would probably be some new optional ioctl for devices that can > > > implement it. > > > > Yes, that's my point. > > > > > > > > However, this basically gives up on the idea that a VM can be migrated > > > as any migration can timeout and fail under this philosophy. I think > > > that is still very poor. > > > > > > Optional migration really can't be sane path forward. > > > > > > > How is it different from the scenario where the guest generates a very > > high dirty rate so the precopy phase can never converge to a pre-defined > > threshold then abort the migration after certain timeout? > > The hypervisor can halt the VCPU and put a stop to this and complete > the migration. > > There is a difference between optional migration under a SLA and > mandatory migration with no SLA - I think both must be supported to be > sane. > > > IMHO live migration is always a try-and-fail flavor. A previous migration > > failure doesn't prevent the orchestration stack to retry at a later point. > > An operator might need to emergency migrate a VM without the > possibility for failure. For instance there is something wrong with > the base HW. SLA ignored, migration must be done. How is it done today when no assigned device supports migration? If any constraint is tolerable today, I don't see why supporting only optional migration cannot be accepted which removes some constraints while still bears a subset in those deployments. This can be seen as an intermediate step in the transition path toward the perfect world where both optional and mandatory migration are supported. > > IMHO it is completely wrong to view migration as optional, that is a > terrible standard to design HW to. > I don't want to argue 'wrong' or 'terrible' here since different person certainly has different view based their own usages. But based on the whole discussion I hope we are aligned on: - It's necessary to support existing HW though it may only supports optional migration due to unbounded time of stopping DMA; - We should influence IP designers to design HW to allow preempting in-fly requests and stop DMA quickly (also implying the capability of aborting/resuming in-fly PRI requests); - Specific to the device state management uAPI, it should not assume a specific usage and instead allow the user to set a timeout value so transitioning to NDMA is failed if the operation cannot be completed within the specified timeout value. If the user doesn't set it, the migration driver could conservatively use a default timeout value to gate any potentially unbounded operation. Thanks Kevin
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Saturday, January 8, 2022 12:36 AM > > On Fri, 7 Jan 2022 08:03:57 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > Hi, Alex, > > > > Thanks for cleaning up this part, which is very helpful! > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Friday, December 10, 2021 7:34 AM > > > > > > + * > > > + * The device_state field defines the following bitfield use: > > > + * > > > + * - Bit 0 (RUNNING) [REQUIRED]: > > > + * - Setting this bit indicates the device is fully operational, the > > > + * device may generate interrupts, DMA, respond to MMIO, all vfio > > > + * device regions are functional, and the device may advance its > > > + * internal state. The default device_state must indicate the device > > > + * in exclusively the RUNNING state, with no other bits in this field > > > + * set. > > > + * - Clearing this bit (ie. !RUNNING) must stop the operation of the > > > + * device. The device must not generate interrupts, DMA, or > advance > > > + * its internal state. > > > > I'm curious about what it means for the mediated device. I suppose this > > 'must not' clause is from user p.o.v i.e. no event delivered to the user, > > no DMA to user memory and no user visible change on mdev state. > Physically > > the device resource backing the mdev may still generate interrupt/DMA > > to the host according to the mediation policy. > > > > Is this understanding correct? > > Yes, one mediated device stopped running can't cause the backing > device to halt, it must continue performing activities for other child > devices as well as any host duties. The user owned device should > effectively stop. > > > > +* The user should take steps to restrict access > > > + * to vfio device regions other than the migration region while the > > > + * device is !RUNNING or risk corruption of the device migration > data > > > + * stream. The device and kernel migration driver must accept and > > > + * respond to interaction to support external subsystems in the > > > + * !RUNNING state, for example PCI MSI-X and PCI config space. > > > > and also respond to mmio access if some state is saved via reading mmio? > > The device must not generate a host fault, ex. PCIe UR, but the idea > here is that the device stops and preventing further access is the > user's responsibility. Failure to stop those accesses may result in > corrupting the migration data. ok. With this background I can understand what the last sentence tries to say. It basically clarifies that while user access to the device is restricted (by user itself) the kernel access (e.g. pci core, or the mediation driver itself) is allowed as long as doing so doesn't advance the to-be-saved state. > > > > + * Failure by the user to restrict device access while !RUNNING must > > > + * not result in error conditions outside the user context (ex. > > > + * host system faults). > > > + * - Bit 1 (SAVING) [REQUIRED]: > > > + * - Setting this bit enables and initializes the migration region data > > > + * window and associated fields within vfio_device_migration_info > for > > > + * capturing the migration data stream for the device. The > migration > > > + * driver may perform actions such as enabling dirty logging of > device > > > + * state with this bit. The SAVING bit is mutually exclusive with the > > > + * RESUMING bit defined below. > > > + * - Clearing this bit (ie. !SAVING) de-initializes the migration region > > > + * data window and indicates the completion or termination of the > > > + * migration data stream for the device. > > > + * - Bit 2 (RESUMING) [REQUIRED]: > > > + * - Setting this bit enables and initializes the migration region data > > > + * window and associated fields within vfio_device_migration_info > for > > > + * restoring the device from a migration data stream captured from > a > > > + * SAVING session with a compatible device. The migration driver > may > > > + * perform internal device resets as necessary to reinitialize the > > > + * internal device state for the incoming migration data. > > > + * - Clearing this bit (ie. !RESUMING) de-initializes the migration > > > + * region data window and indicates the end of a resuming session > for > > > + * the device. The kernel migration driver should complete the > > > + * incorporation of data written to the migration data window into > the > > > + * device internal state and perform final validity and consistency > > > + * checking of the new device state. If the user provided data is > > > + * found to be incomplete, inconsistent, or otherwise invalid, the > > > + * migration driver must indicate a write(2) error and follow the > > > + * previously described protocol to return either the previous state > > > + * or an error state. > > > + * - Bit 3 (NDMA) [OPTIONAL]: > > > + * The NDMA or "No DMA" state is intended to be a quiescent state > for > > > + * the device for the purposes of managing multiple devices within a > > > + * user context where peer-to-peer DMA between devices may be > active. > > > > As discussed with Jason in another thread, this is also required for vPRI > > when stopping DMA involves completing (instead of preempting) in-fly > > requests then any vPRI for those requests must be completed when vcpu > > is running. This cannot be done in !RUNNING which is typically transitioned > > to after stopping vcpu. > > > > It is also useful when the time of stopping device DMA is unbound (even > > without vPRI). Having a failure path when vcpu is running avoids breaking > > SLA (if only capturing it after stopping vcpu). This further requires certain > > interface for the user to specify a timeout value for entering NDMA, though > > unclear to me what it will be now. > > > > > + * Support for the NDMA bit is indicated through the presence of the > > > + * VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by > > > + * VFIO_DEVICE_GET_REGION_INFO for the associated device > migration > > > + * region. > > > + * - Setting this bit must prevent the device from initiating any > > > + * new DMA or interrupt transactions. The migration driver must > > > > Why also disabling interrupt? vcpu is still running at this point thus > interrupt > > could be triggered for many reasons other than DMA... > > It's my understanding that the vCPU would be halted for the NDMA use > case, we can't very well have vCPUs demanding requests to devices that > are prevented from completing them. The NDMA phase is intended to > support completion of outstanding requests without concurrently > accepting new requests, AIUI. By current definition NDMA doesn't require the user to restrict access to vfio device regions as for !RUNNING. So it's probably not good to tie it with vcpu stopped. As explained above it's also required when vPRI is enabled. At least for current SVA-capable devices from Intel and Huawei they all require waiting for vPRI completion before transitioning to NDMA can be done while completing vPRI requires running vcpu. New requests between NDMA and !RUNNING need to be mediated/ queued by the migration driver and then re-submitted on the destination node. This further requires certain mechanism to allow dynamically changing the mediation vs. passthru policy on the submit portal. > > Further conversations in this thread allow for interrupts and deduce > that the primary requirement of NDMA is to restrict P2P DMA, which can > be approximated as all non-MSI DMA. > > > > + * complete any such outstanding operations prior to completing > > > + * the transition to the NDMA state. The NDMA device_state > > > + * essentially represents a sub-set of the !RUNNING state for the > > > + * purpose of quiescing the device, therefore the NDMA > device_state > > > + * bit is superfluous in combinations including !RUNNING. > > > > 'superfluous' means doing so will get a failure, or just not recommended? > > Superfluous meaning redundant. It's allowed, but DMA is already > restricted when !RUNNING, so setting NDMA when !RUNNING is meaningless. > > > > + * - Clearing this bit (ie. !NDMA) negates the device operational > > > + * restrictions required by the NDMA state. > > > + * - Bits [31:4]: > > > + * Reserved for future use, users should use read-modify-write > > > + * operations to the device_state field for manipulation of the above > > > + * defined bits for optimal compatibility. > > > + * > > FWIW, I'm expecting to see an alternative uAPI propose using a FSM > machine in the near future, so while this clarifies what I believe is > the intention of the existing uAPI, it might be deprecated before we > bother to commit such clarifications. Thanks, > got it. Thanks Kevin
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, January 7, 2022 5:21 AM > > We were also thinking to retain STOP. SAVING -> STOP could possibly > serve as the abort path to avoid a double action, and some of the use > cases you ID'd below are achievable if STOP remains. what is the exact difference between a null state {} (implying !RUNNING) and STOP in this context? If they are different, who (user or driver) should conduct and when do we expect transitioning the device into a null state? > > > We have 20 possible transitions. I've marked those available via the > > "odd ascii art" diagram as (a), that's 7 transitions. We could > > essentially remove the NULL state as unreachable as there seems little > > value in the 2 transitions marked (a)* if we look only at migration, > > that would bring us down to 5 of 12 possible transitions. We need to > > give userspace an abort path though, so we minimally need the 2 > > transitions marked (b) (7/12). > > > So now we can discuss the remaining 5 transitions: > > > > {SAVING} -> {RESUMING} > > If not supported, user can achieve this via: > > {SAVING}->{RUNNING}->{RESUMING} > > {SAVING}-RESET->{RUNNING}->{RESUMING} > > This can be: > > SAVING -> STOP -> RESUMING From Alex's original description the default device state is RUNNING. This supposed to be the initial state on the dest machine for the device assigned to Qemu before Qemu resumes the device state. Then how do we eliminate the RUNNING state in above flow? Who makes STOP as the initial state on the dest node? > > drivers follow the previously provided pseudo algorithm with the > > requirement that they cannot pass through an invalid state, we need to > > formally address whether the NULL state is invalid or just not > > reachable by the user. > > What is a NULL state? Hah, seems I'm not the only one having this confusion.
On Mon, 10 Jan 2022 07:55:16 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Friday, January 7, 2022 5:21 AM > > > > We were also thinking to retain STOP. SAVING -> STOP could possibly > > serve as the abort path to avoid a double action, and some of the use > > cases you ID'd below are achievable if STOP remains. > > what is the exact difference between a null state {} (implying !RUNNING) > and STOP in this context? > > If they are different, who (user or driver) should conduct and when do > we expect transitioning the device into a null state? Sorry if I added confusion here, the null, ie. {}, state fit my notation better. The null state is simply no bits set in device_state, it's equivalent to "STOP" without coming up with a new name for every set of bit combinations. > > > We have 20 possible transitions. I've marked those available via the > > > "odd ascii art" diagram as (a), that's 7 transitions. We could > > > essentially remove the NULL state as unreachable as there seems little > > > value in the 2 transitions marked (a)* if we look only at migration, > > > that would bring us down to 5 of 12 possible transitions. We need to > > > give userspace an abort path though, so we minimally need the 2 > > > transitions marked (b) (7/12). > > > > > So now we can discuss the remaining 5 transitions: > > > > > > {SAVING} -> {RESUMING} > > > If not supported, user can achieve this via: > > > {SAVING}->{RUNNING}->{RESUMING} > > > {SAVING}-RESET->{RUNNING}->{RESUMING} > > > > This can be: > > > > SAVING -> STOP -> RESUMING > > From Alex's original description the default device state is RUNNING. > This supposed to be the initial state on the dest machine for the > device assigned to Qemu before Qemu resumes the device state. > Then how do we eliminate the RUNNING state in above flow? Who > makes STOP as the initial state on the dest node? The device must be RUNNING by default. This is a requirement that introduction of migration support for a device cannot break compatibility with existing userspace that may no support migration features. It would be QEMU's responsibility to transition a migration target device from the default state to a state to accept a migration. There's no discussion here of eliminating the {RUNNING}->{RESUMING} transition. > > > drivers follow the previously provided pseudo algorithm with the > > > requirement that they cannot pass through an invalid state, we need to > > > formally address whether the NULL state is invalid or just not > > > reachable by the user. > > > > What is a NULL state? > > Hah, seems I'm not the only one having this confusion.
On Mon, Jan 10, 2022 at 03:14:44AM +0000, Tian, Kevin wrote: > > An operator might need to emergency migrate a VM without the > > possibility for failure. For instance there is something wrong with > > the base HW. SLA ignored, migration must be done. > > How is it done today when no assigned device supports migration? That is different, the operator deliberately created a VM that is not migratable. Operators may simply prefer to never do this. You are talking about migration which is blockable by the guest - outside of operator controll this is a totally different thing. > - It's necessary to support existing HW though it may only supports > optional migration due to unbounded time of stopping DMA; At a minimum a device with optional migration needs to be reported to userspace and qemu should not blindly adopt it without some opt-in IMHO. > - We should influence IP designers to design HW to allow preempting > in-fly requests and stop DMA quickly (also implying the capability of > aborting/resuming in-fly PRI requests); Yes, I think we need a way to suspend the device then abort its PRIs with some error. The ATS cache is not something that is migrated so this seems reasonable. The only sketchy bit looks like how to resync the VM that still would have a PRI in its queue and would still want to answer it. That answer would have to be discarded.. > - Specific to the device state management uAPI, it should not assume > a specific usage and instead allow the user to set a timeout value so > transitioning to NDMA is failed if the operation cannot be completed > within the specified timeout value. If the user doesn't set it, the > migration driver could conservatively use a default timeout value to > gate any potentially unbounded operation. This would need to go along with the flag above, as only optional drivers should have something like this. Jason
On Mon, Jan 10, 2022 at 07:55:16AM +0000, Tian, Kevin wrote: > > > {SAVING} -> {RESUMING} > > > If not supported, user can achieve this via: > > > {SAVING}->{RUNNING}->{RESUMING} > > > {SAVING}-RESET->{RUNNING}->{RESUMING} > > > > This can be: > > > > SAVING -> STOP -> RESUMING > > From Alex's original description the default device state is RUNNING. > This supposed to be the initial state on the dest machine for the > device assigned to Qemu before Qemu resumes the device state. > Then how do we eliminate the RUNNING state in above flow? Who > makes STOP as the initial state on the dest node? All of this notation should be read with the idea that the device_state is already somehow moved away from RESET. Ie the above notation is about what is possible once qemu has already moved the device to SAVING. > > This is currently buggy as-is because they cannot DMA map these > > things, touch them with the CPU and kmap, or do, really, anything with > > them. > > Can you elaborate why mdev cannot access p2p pages? It is just a failure of APIs in the kernel. A p2p page has no 'struct page' so it cannot be used in a scatter list, and thus cannot be used in dma_map_sg. It also cannot be kmap'd, or memcpy'd from. So, essentially, everything that a current mdev drivers try to do will crash with a non-struct page pfn. In principle this could all be made to work someday, but it doesn't work now. What I want to do is make these APIs correctly use struct page and block all non-struct page memory from getting into them. > > Since it doesn't bring any value to userspace, I prefer we not define > > things in this complicated way. > > So ERROR is really about unrecoverable failures. If recoverable suppose > errno should have been returned and the device is still in the original > state. Is this understanding correct? Yes > btw which errno indicates to the user that the device is back to the > original state or in the ERROR state? or want the user to always check > the device state upon any transition error? IMHO it is a failing of the API that this cannot be reported back. The fact that the system call became on-directional is a side effect of abusing the migration region like this. Jason
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Tuesday, January 11, 2022 1:34 AM > > On Mon, 10 Jan 2022 07:55:16 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Friday, January 7, 2022 5:21 AM > > > > > > We were also thinking to retain STOP. SAVING -> STOP could possibly > > > serve as the abort path to avoid a double action, and some of the use > > > cases you ID'd below are achievable if STOP remains. > > > > what is the exact difference between a null state {} (implying !RUNNING) > > and STOP in this context? > > > > If they are different, who (user or driver) should conduct and when do > > we expect transitioning the device into a null state? > > Sorry if I added confusion here, the null, ie. {}, state fit my > notation better. The null state is simply no bits set in device_state, > it's equivalent to "STOP" without coming up with a new name for every > set of bit combinations. This matches my thought. Earlier reading this thread I was left with the impression that Jason means 'STOP' as a new state. > > > > > We have 20 possible transitions. I've marked those available via the > > > > "odd ascii art" diagram as (a), that's 7 transitions. We could > > > > essentially remove the NULL state as unreachable as there seems little > > > > value in the 2 transitions marked (a)* if we look only at migration, > > > > that would bring us down to 5 of 12 possible transitions. We need to > > > > give userspace an abort path though, so we minimally need the 2 > > > > transitions marked (b) (7/12). > > > > > > > So now we can discuss the remaining 5 transitions: > > > > > > > > {SAVING} -> {RESUMING} > > > > If not supported, user can achieve this via: > > > > {SAVING}->{RUNNING}->{RESUMING} > > > > {SAVING}-RESET->{RUNNING}->{RESUMING} > > > > > > This can be: > > > > > > SAVING -> STOP -> RESUMING > > > > From Alex's original description the default device state is RUNNING. > > This supposed to be the initial state on the dest machine for the > > device assigned to Qemu before Qemu resumes the device state. > > Then how do we eliminate the RUNNING state in above flow? Who > > makes STOP as the initial state on the dest node? > > The device must be RUNNING by default. This is a requirement that > introduction of migration support for a device cannot break > compatibility with existing userspace that may no support migration > features. It would be QEMU's responsibility to transition a migration > target device from the default state to a state to accept a migration. > There's no discussion here of eliminating the {RUNNING}->{RESUMING} > transition. Then having STOP in the flow is meaningless. It actually means: {SAVING}-RESET->{RUNNING}->{STOP}->{RESUMING} > > > > > drivers follow the previously provided pseudo algorithm with the > > > > requirement that they cannot pass through an invalid state, we need to > > > > formally address whether the NULL state is invalid or just not > > > > reachable by the user. > > > > > > What is a NULL state? > > > > Hah, seems I'm not the only one having this confusion.
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, January 11, 2022 1:53 AM > > > - It's necessary to support existing HW though it may only supports > > optional migration due to unbounded time of stopping DMA; > > At a minimum a device with optional migration needs to be reported to > userspace and qemu should not blindly adopt it without some opt-in > IMHO. yes > > > - We should influence IP designers to design HW to allow preempting > > in-fly requests and stop DMA quickly (also implying the capability of > > aborting/resuming in-fly PRI requests); > > Yes, I think we need a way to suspend the device then abort its PRIs > with some error. The ATS cache is not something that is migrated so > this seems reasonable. yes, any cache can be abandoned in the migration process. and such error must be recoverable otherwise it would break the guest functionality. The key is that PRI can happen at any time which implies that the device must be able to preempt and abort a transaction in very fine-grained granularity. There might be IP-specific factors affecting the final preemption granularity (e.g. complex GPUs will certainly have more challenges), which is what we need involve IP designers to fully understand. > > The only sketchy bit looks like how to resync the VM that still would > have a PRI in its queue and would still want to answer it. That answer > would have to be discarded.. Yes. This also suggests that iommu core needs allow the driver to cancel all previously-queued PRI requests for its device so the guest answer to that device can be ignored by the iommu core. > > > - Specific to the device state management uAPI, it should not assume > > a specific usage and instead allow the user to set a timeout value so > > transitioning to NDMA is failed if the operation cannot be completed > > within the specified timeout value. If the user doesn't set it, the > > migration driver could conservatively use a default timeout value to > > gate any potentially unbounded operation. > > This would need to go along with the flag above, as only optional > drivers should have something like this. > Agree. Thanks Kevin
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, January 11, 2022 2:12 AM > > On Mon, Jan 10, 2022 at 07:55:16AM +0000, Tian, Kevin wrote: > > > > > {SAVING} -> {RESUMING} > > > > If not supported, user can achieve this via: > > > > {SAVING}->{RUNNING}->{RESUMING} > > > > {SAVING}-RESET->{RUNNING}->{RESUMING} > > > > > > This can be: > > > > > > SAVING -> STOP -> RESUMING > > > > From Alex's original description the default device state is RUNNING. > > This supposed to be the initial state on the dest machine for the > > device assigned to Qemu before Qemu resumes the device state. > > Then how do we eliminate the RUNNING state in above flow? Who > > makes STOP as the initial state on the dest node? > > All of this notation should be read with the idea that the > device_state is already somehow moved away from RESET. Ie the above > notation is about what is possible once qemu has already moved the > device to SAVING. Qemu moves the device to SAVING on the src node. On the dest the device is in RUNNING (after reset) which can be directly transitioned to RESUMING. I didn't see the point of adding a STOP here. > > > > This is currently buggy as-is because they cannot DMA map these > > > things, touch them with the CPU and kmap, or do, really, anything with > > > them. > > > > Can you elaborate why mdev cannot access p2p pages? > > It is just a failure of APIs in the kernel. A p2p page has no 'struct > page' so it cannot be used in a scatter list, and thus cannot be used in > dma_map_sg. > > It also cannot be kmap'd, or memcpy'd from. > > So, essentially, everything that a current mdev drivers try to do will > crash with a non-struct page pfn. > > In principle this could all be made to work someday, but it doesn't > work now. > > What I want to do is make these APIs correctly use struct page and > block all non-struct page memory from getting into them. > I see. It does make sense for the current sw mdev drivers. Later when supporting hw mdev (with pasid granular isolation in iommu), this restriction can be uplifted as it doesn't use dma api and is pretty much like a pdev regarding to ioas management. Thanks Kevin
On Tue, Jan 11, 2022 at 03:14:04AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Tuesday, January 11, 2022 2:12 AM > > > > On Mon, Jan 10, 2022 at 07:55:16AM +0000, Tian, Kevin wrote: > > > > > > > {SAVING} -> {RESUMING} > > > > > If not supported, user can achieve this via: > > > > > {SAVING}->{RUNNING}->{RESUMING} > > > > > {SAVING}-RESET->{RUNNING}->{RESUMING} > > > > > > > > This can be: > > > > > > > > SAVING -> STOP -> RESUMING > > > > > > From Alex's original description the default device state is RUNNING. > > > This supposed to be the initial state on the dest machine for the > > > device assigned to Qemu before Qemu resumes the device state. > > > Then how do we eliminate the RUNNING state in above flow? Who > > > makes STOP as the initial state on the dest node? > > > > All of this notation should be read with the idea that the > > device_state is already somehow moved away from RESET. Ie the above > > notation is about what is possible once qemu has already moved the > > device to SAVING. > > Qemu moves the device to SAVING on the src node. > > On the dest the device is in RUNNING (after reset) which can be directly > transitioned to RESUMING. I didn't see the point of adding a STOP here. Alex is talking about the same node case where qemu has put the device into SAVING and then, for whatever reason, decides it now wants the device to be in RESUMING. We are talking about the state space of commands the driver has to process here. If we can break down things like SAVING -> RESUMING into two commands: SAVING -> STOP STOP -> RESUMING Then the driver has to implement fewer arcs, and the arcs it does implement are much simpler. It also resolves the precedence question nicely as we have a core FSM that is built on the arcs the drivers implement and that in turn gives a natural answer to the question of how do you transit between any two states. Eg using the state names I gave earlier we can look at going from RESUMING -> PRE_COPY_NDMA and decomposing it into these four steps: RESUMING -> STOP -> RUNNING -> PRE_COPY -> PRE_COPY_P2P In the end the driver needs to implement only about half of the total arcs and the ones it does need to implement are simpler and have a more obvious implementation. > Later when supporting hw mdev (with pasid granular isolation in > iommu), this restriction can be uplifted as it doesn't use dma api > and is pretty much like a pdev regarding to ioas management. When I say 'mdev' I really mean things that use the vfio pinning interface - which we don't quite have a proper name for yet (though emulated iommu perhaps is sticking) Things that use iommu_domain would not be a problem Jason
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index ef33ea002b0b..1fdbc928f886 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -408,199 +408,211 @@ struct vfio_region_gfx_edid { #define VFIO_REGION_SUBTYPE_MIGRATION (1) /* - * The structure vfio_device_migration_info is placed at the 0th offset of - * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related - * migration information. Field accesses from this structure are only supported - * at their native width and alignment. Otherwise, the result is undefined and - * vendor drivers should return an error. + * The structure vfio_device_migration_info is placed at the immediate start of + * the per-device VFIO_REGION_SUBTYPE_MIGRATION region to manage the device + * state and migration information for the device. Field accesses for this + * structure are only supported using their native width and alignment, + * accesses otherwise are undefined and the kernel migration driver should + * return an error. * * device_state: (read/write) - * - The user application writes to this field to inform the vendor driver - * about the device state to be transitioned to. - * - The vendor driver should take the necessary actions to change the - * device state. After successful transition to a given state, the - * vendor driver should return success on write(device_state, state) - * system call. If the device state transition fails, the vendor driver - * should return an appropriate -errno for the fault condition. - * - On the user application side, if the device state transition fails, - * that is, if write(device_state, state) returns an error, read - * device_state again to determine the current state of the device from - * the vendor driver. - * - The vendor driver should return previous state of the device unless - * the vendor driver has encountered an internal error, in which case - * the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR. - * - The user application must use the device reset ioctl to recover the - * device from VFIO_DEVICE_STATE_ERROR state. If the device is - * indicated to be in a valid device state by reading device_state, the - * user application may attempt to transition the device to any valid - * state reachable from the current state or terminate itself. - * - * device_state consists of 3 bits: - * - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear, - * it indicates the _STOP state. When the device state is changed to - * _STOP, driver should stop the device before write() returns. - * - If bit 1 is set, it indicates the _SAVING state, which means that the - * driver should start gathering device state information that will be - * provided to the VFIO user application to save the device's state. - * - If bit 2 is set, it indicates the _RESUMING state, which means that - * the driver should prepare to resume the device. Data provided through - * the migration region should be used to resume the device. - * Bits 3 - 31 are reserved for future use. To preserve them, the user - * application should perform a read-modify-write operation on this - * field when modifying the specified bits. - * - * +------- _RESUMING - * |+------ _SAVING - * ||+----- _RUNNING - * ||| - * 000b => Device Stopped, not saving or resuming - * 001b => Device running, which is the default state - * 010b => Stop the device & save the device state, stop-and-copy state - * 011b => Device running and save the device state, pre-copy state - * 100b => Device stopped and the device state is resuming - * 101b => Invalid state - * 110b => Error state - * 111b => Invalid state - * - * State transitions: - * - * _RESUMING _RUNNING Pre-copy Stop-and-copy _STOP - * (100b) (001b) (011b) (010b) (000b) - * 0. Running or default state - * | - * - * 1. Normal Shutdown (optional) - * |------------------------------------->| - * - * 2. Save the state or suspend - * |------------------------->|---------->| - * - * 3. Save the state during live migration - * |----------->|------------>|---------->| - * - * 4. Resuming - * |<---------| - * - * 5. Resumed - * |--------->| - * - * 0. Default state of VFIO device is _RUNNING when the user application starts. - * 1. During normal shutdown of the user application, the user application may - * optionally change the VFIO device state from _RUNNING to _STOP. This - * transition is optional. The vendor driver must support this transition but - * must not require it. - * 2. When the user application saves state or suspends the application, the - * device state transitions from _RUNNING to stop-and-copy and then to _STOP. - * On state transition from _RUNNING to stop-and-copy, driver must stop the - * device, save the device state and send it to the application through the - * migration region. The sequence to be followed for such transition is given - * below. - * 3. In live migration of user application, the state transitions from _RUNNING - * to pre-copy, to stop-and-copy, and to _STOP. - * On state transition from _RUNNING to pre-copy, the driver should start - * gathering the device state while the application is still running and send - * the device state data to application through the migration region. - * On state transition from pre-copy to stop-and-copy, the driver must stop - * the device, save the device state and send it to the user application - * through the migration region. - * Vendor drivers must support the pre-copy state even for implementations - * where no data is provided to the user before the stop-and-copy state. The - * user must not be required to consume all migration data before the device - * transitions to a new state, including the stop-and-copy state. - * The sequence to be followed for above two transitions is given below. - * 4. To start the resuming phase, the device state should be transitioned from - * the _RUNNING to the _RESUMING state. - * In the _RESUMING state, the driver should use the device state data - * received through the migration region to resume the device. - * 5. After providing saved device data to the driver, the application should - * change the state from _RESUMING to _RUNNING. + * The device_state field is a bitfield written by the user to transition + * the associated device between valid migration states using these rules: + * - The user may read or write the device state register at any time. + * - The kernel migration driver must fully transition the device to the + * new state value before the write(2) operation returns to the user. + * - The user may change multiple bits of the bitfield in the same write + * operation, so long as the resulting state is valid. + * - The kernel migration driver must not generate asynchronous device + * state transitions outside of manipulation by the user or the + * VFIO_DEVICE_RESET ioctl as described below. + * - In the event of a device state transition failure, the kernel + * migration driver must return a write(2) error with appropriate errno + * to the user. + * - Upon such an error, re-reading the device_state field must indicate + * the device migration state as either the same state as prior to the + * failing write or, at the migration driver discretion, indicate the + * device state as VFIO_DEVICE_STATE_ERROR. + * - To continue using a device that has entered VFIO_DEVICE_STATE_ERROR, + * the user must issue a VFIO_DEVICE_RESET ioctl, which must transition + * the migration state to the default value (defined below). + * - Devices supporting migration via this specification must support the + * VFIO_DEVICE_RESET ioctl and any use of that ioctl must return the + * device migration state to the default value. + * + * The device_state field defines the following bitfield use: + * + * - Bit 0 (RUNNING) [REQUIRED]: + * - Setting this bit indicates the device is fully operational, the + * device may generate interrupts, DMA, respond to MMIO, all vfio + * device regions are functional, and the device may advance its + * internal state. The default device_state must indicate the device + * in exclusively the RUNNING state, with no other bits in this field + * set. + * - Clearing this bit (ie. !RUNNING) must stop the operation of the + * device. The device must not generate interrupts, DMA, or advance + * its internal state. The user should take steps to restrict access + * to vfio device regions other than the migration region while the + * device is !RUNNING or risk corruption of the device migration data + * stream. The device and kernel migration driver must accept and + * respond to interaction to support external subsystems in the + * !RUNNING state, for example PCI MSI-X and PCI config space. + * Failure by the user to restrict device access while !RUNNING must + * not result in error conditions outside the user context (ex. + * host system faults). + * - Bit 1 (SAVING) [REQUIRED]: + * - Setting this bit enables and initializes the migration region data + * window and associated fields within vfio_device_migration_info for + * capturing the migration data stream for the device. The migration + * driver may perform actions such as enabling dirty logging of device + * state with this bit. The SAVING bit is mutually exclusive with the + * RESUMING bit defined below. + * - Clearing this bit (ie. !SAVING) de-initializes the migration region + * data window and indicates the completion or termination of the + * migration data stream for the device. + * - Bit 2 (RESUMING) [REQUIRED]: + * - Setting this bit enables and initializes the migration region data + * window and associated fields within vfio_device_migration_info for + * restoring the device from a migration data stream captured from a + * SAVING session with a compatible device. The migration driver may + * perform internal device resets as necessary to reinitialize the + * internal device state for the incoming migration data. + * - Clearing this bit (ie. !RESUMING) de-initializes the migration + * region data window and indicates the end of a resuming session for + * the device. The kernel migration driver should complete the + * incorporation of data written to the migration data window into the + * device internal state and perform final validity and consistency + * checking of the new device state. If the user provided data is + * found to be incomplete, inconsistent, or otherwise invalid, the + * migration driver must indicate a write(2) error and follow the + * previously described protocol to return either the previous state + * or an error state. + * - Bit 3 (NDMA) [OPTIONAL]: + * The NDMA or "No DMA" state is intended to be a quiescent state for + * the device for the purposes of managing multiple devices within a + * user context where peer-to-peer DMA between devices may be active. + * Support for the NDMA bit is indicated through the presence of the + * VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by + * VFIO_DEVICE_GET_REGION_INFO for the associated device migration + * region. + * - Setting this bit must prevent the device from initiating any + * new DMA or interrupt transactions. The migration driver must + * complete any such outstanding operations prior to completing + * the transition to the NDMA state. The NDMA device_state + * essentially represents a sub-set of the !RUNNING state for the + * purpose of quiescing the device, therefore the NDMA device_state + * bit is superfluous in combinations including !RUNNING. + * - Clearing this bit (ie. !NDMA) negates the device operational + * restrictions required by the NDMA state. + * - Bits [31:4]: + * Reserved for future use, users should use read-modify-write + * operations to the device_state field for manipulation of the above + * defined bits for optimal compatibility. + * + * All combinations for the above defined device_state bits are considered + * valid with the following exceptions: + * - RESUMING and SAVING are mutually exclusive, all combinations of + * (RESUMING | SAVING) are invalid. Furthermore the specific combination + * (!NDMA | RESUMING | SAVING | !RUNNING) is reserved to indicate the + * device error state VFIO_DEVICE_STATE_ERROR. This variant is + * specifically chosen due to the !RUNNING state of the device as the + * migration driver should do everything possible, including an internal + * reset of the device, to ensure that the device is fully stopped in + * this state. Other invalid combinations are reserved for future use + * and must not be reachable. + * - Combinations involving (RESUMING | RUNNING) are currently unsupported + * by this revision of the uAPI. + * + * Migration drivers should attempt to support any transition between valid + * states. For further discussion of device_state relative to expected usage + * scenarios, see: Documentation/driver-api/vfio.rst * * reserved: - * Reads on this field return zero and writes are ignored. + * Reads on this field return zero and writes are ignored. * * pending_bytes: (read only) - * The number of pending bytes still to be migrated from the vendor driver. + * The kernel migration driver uses this field to indicate an estimate of + * the remaining data size (in bytes) for the user to copy while SAVING is + * set in the device_state. The value should be considered volatile, + * especially while RUNNING is still set in the device_state. Userspace + * uses this field to test whether data is available to be read from the + * data section described below. Userspace should only consider whether + * the value read is zero or non-zero for the purposes of the protocol + * below. The user may only consider the migration data stream to be + * completed when pending_bytes reports a zero value while the device is + * !RUNNING. The kernel migration driver must not require the user to reach + * a zero value for this field to transition to a !RUNNING device_state. + * The value of this field is undefined when !SAVING. * * data_offset: (read only) - * The user application should read data_offset field from the migration - * region. The user application should read the device data from this - * offset within the migration region during the _SAVING state or write - * the device data during the _RESUMING state. See below for details of - * sequence to be followed. + * This field indicates the offset relative to the start of the device + * migration region for the user to collect (SAVING) or store (RESUMING) + * migration data for the device following the protocol described below. + * The migration driver may provide sparse mmap support for the migration + * region and use the data_offset field to direct user accesses as + * appropriate, but must not require mmap access when provided. The value + * of this field is undefined when device_state does not include either + * SAVING or RESUMING. * * data_size: (read/write) - * The user application should read data_size to get the size in bytes of - * the data copied in the migration region during the _SAVING state and - * write the size in bytes of the data copied in the migration region - * during the _RESUMING state. - * - * The format of the migration region is as follows: - * ------------------------------------------------------------------ - * |vfio_device_migration_info| data section | - * | | /////////////////////////////// | - * ------------------------------------------------------------------ - * ^ ^ - * offset 0-trapped part data_offset - * - * The structure vfio_device_migration_info is always followed by the data - * section in the region, so data_offset will always be nonzero. The offset - * from where the data is copied is decided by the kernel driver. The data - * section can be trapped, mmapped, or partitioned, depending on how the kernel - * driver defines the data section. The data section partition can be defined - * as mapped by the sparse mmap capability. If mmapped, data_offset must be - * page aligned, whereas initial section which contains the - * vfio_device_migration_info structure, might not end at the offset, which is - * page aligned. The user is not required to access through mmap regardless - * of the capabilities of the region mmap. - * The vendor driver should determine whether and how to partition the data - * section. The vendor driver should return data_offset accordingly. - * - * The sequence to be followed while in pre-copy state and stop-and-copy state - * is as follows: - * a. Read pending_bytes, indicating the start of a new iteration to get device - * data. Repeated read on pending_bytes at this stage should have no side - * effects. - * If pending_bytes == 0, the user application should not iterate to get data - * for that device. - * If pending_bytes > 0, perform the following steps. - * b. Read data_offset, indicating that the vendor driver should make data - * available through the data section. The vendor driver should return this - * read operation only after data is available from (region + data_offset) - * to (region + data_offset + data_size). - * c. Read data_size, which is the amount of data in bytes available through - * the migration region. - * Read on data_offset and data_size should return the offset and size of - * the current buffer if the user application reads data_offset and - * data_size more than once here. - * d. Read data_size bytes of data from (region + data_offset) from the - * migration region. - * e. Process the data. - * f. Read pending_bytes, which indicates that the data from the previous - * iteration has been read. If pending_bytes > 0, go to step b. - * - * The user application can transition from the _SAVING|_RUNNING - * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the - * number of pending bytes. The user application should iterate in _SAVING - * (stop-and-copy) until pending_bytes is 0. - * - * The sequence to be followed while _RESUMING device state is as follows: - * While data for this device is available, repeat the following steps: - * a. Read data_offset from where the user application should write data. - * b. Write migration data starting at the migration region + data_offset for - * the length determined by data_size from the migration source. - * c. Write data_size, which indicates to the vendor driver that data is - * written in the migration region. Vendor driver must return this write - * operations on consuming data. Vendor driver should apply the - * user-provided migration region data to the device resume state. - * - * If an error occurs during the above sequences, the vendor driver can return - * an error code for next read() or write() operation, which will terminate the - * loop. The user application should then take the next necessary action, for - * example, failing migration or terminating the user application. - * - * For the user application, data is opaque. The user application should write - * data in the same order as the data is received and the data should be of - * same transaction size at the source. + * This field indicates the length of the current data segment in bytes. + * While SAVING, the kernel migration driver uses this field to indicate to + * the user the length of the migration data stream available at data_offset. + * When RESUMING, the user writes this field with the length of the data + * segment written at the migration driver provided data_offset. The value + * of this field is undefined when device_state does not include either + * SAVING or RESUMING. + * + * The following protocol is used while the device is in the SAVING + * device_state: + * + * a) The user reads pending_bytes. If the read value is zero, no data is + * currently available for the device. If the device is !RUNNING and a + * zero value is read, this indicates the end of the device migration + * stream and the device must not generate any new migration data. If + * the read value is non-zero, the user may proceed to collect device + * migration data in step b). Repeated reads of pending_bytes is allowed + * and must not compromise the migration data stream provided the user does + * not proceed to the following step. + * b) The user reads data_offset, which indicates to the migration driver to + * make a segment of device migration data available to the user at the + * provided offset. This action commits the user to collect the data + * segment. + * c) The user reads data_size to determine the extent of the currently + * available migration data segment. + * d) The user collects the data_size segment of device migration data at the + * previously provided data_offset using access methods compatible to those + * for the migration region. The user must not be required to collect the + * data in a single operation. + * e) The user re-reads pending_bytes to indicate to the migration driver that + * the provided data has been collected. Provided the read pending_bytes + * value is non-zero, the user may proceed directly to step b) for another + * iteration. + * + * The following protocol is used while the device is in the RESUMING + * device_state: + * + * a) The user reads data_offset, which directs the user to the location + * within the migration region to store the migration data segment. + * b) The user writes the migration data segment starting at the provided + * data_offset. The user must preserve the data segment size as used when + * the segment was collected from the device when SAVING. + * c) The user writes the data_size field with the number of bytes written to + * the migration region in step b). The kernel migration driver may use + * this write to indicate the end of the current iteration. + * d) User proceeds to step a) so long as the migration data stream is not + * complete. + * + * The kernel migration driver may indicate an error condition by returning + * a fault on read(2) or write(2) for any operation most approximate to the + * detection of the error. Field accesses are provided within the protocol + * such that an opportunity exists to return a fault regardless of whether the + * data section is directly accessed via an mmap. + * + * The user must consider the migration data segments to be opaque and + * non-fungible. During RESUMING, the data segments must be written in the + * size and order as provided during SAVING, irrespective of other bits within + * the device_state bitfield (ex. a transition to !RUNNING). */ struct vfio_device_migration_info { @@ -609,21 +621,25 @@ struct vfio_device_migration_info { #define VFIO_DEVICE_STATE_RUNNING (1 << 0) #define VFIO_DEVICE_STATE_SAVING (1 << 1) #define VFIO_DEVICE_STATE_RESUMING (1 << 2) -#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \ - VFIO_DEVICE_STATE_SAVING | \ +#define VFIO_DEVICE_STATE_NDMA (1 << 3) +#define VFIO_DEVICE_STATE_ERROR (VFIO_DEVICE_STATE_SAVING | \ VFIO_DEVICE_STATE_RESUMING) +#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \ + VFIO_DEVICE_STATE_SAVING | \ + VFIO_DEVICE_STATE_RESUMING | \ + VFIO_DEVICE_STATE_NDMA) #define VFIO_DEVICE_STATE_VALID(state) \ - (state & VFIO_DEVICE_STATE_RESUMING ? \ - (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1) + (!((state & VFIO_DEVICE_STATE_SAVING) && \ + (state & VFIO_DEVICE_STATE_RESUMING)) && \ + !((state & VFIO_DEVICE_STATE_RESUMING) && \ + (state & VFIO_DEVICE_STATE_RUNNING))) #define VFIO_DEVICE_STATE_IS_ERROR(state) \ - ((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \ - VFIO_DEVICE_STATE_RESUMING)) + ((state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_ERROR) #define VFIO_DEVICE_STATE_SET_ERROR(state) \ - ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \ - VFIO_DEVICE_STATE_RESUMING) + ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_STATE_ERROR) __u32 reserved; __u64 pending_bytes; @@ -631,6 +647,13 @@ struct vfio_device_migration_info { __u64 data_size; }; +/* + * The Migration NDMA capability is exposed on a device Migration region + * to indicate support for the VFIO_DEVICE_STATE_NDMA bit of + * vfio_device_migration_info.device_state. + */ +#define VFIO_REGION_INFO_CAP_MIG_NDMA 6 + /* * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped * which allows direct access to non-MSIX registers which happened to be within
A new NDMA state is being proposed to support a quiescent state for contexts containing multiple devices with peer-to-peer DMA support. Formally define it. Clarify various aspects of the migration region data fields and protocol. Remove QEMU related terminology and flows from the uAPI; these will be provided in Documentation/ so as not to confuse the device_state bitfield with a finite state machine with restricted state transitions. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- include/uapi/linux/vfio.h | 405 ++++++++++++++++++++++++--------------------- 1 file changed, 214 insertions(+), 191 deletions(-)