[v3] vfio: Documentation for the migration region

Message ID	0-v3-184b374ad0a8+24c-vfio_mig_doc_jgg@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Jason Gunthorpe <jgg@nvidia.com> To: Alex Williamson <alex.williamson@redhat.com>, Cornelia Huck <cohuck@redhat.com>, Jonathan Corbet <corbet@lwn.net>, kvm@vger.kernel.org, linux-doc@vger.kernel.org Subject: [PATCH v3] vfio: Documentation for the migration region Date: Tue, 7 Dec 2021 13:13:00 -0400 Message-Id: <0-v3-184b374ad0a8+24c-vfio_mig_doc_jgg@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	[v3] vfio: Documentation for the migration region \| expand [v3] vfio: Documentation for the migration region

Message ID

0-v3-184b374ad0a8+24c-vfio_mig_doc_jgg@nvidia.com (mailing list archive)

State

New, archived

Headers

From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>,
        Cornelia Huck <cohuck@redhat.com>,
        Jonathan Corbet <corbet@lwn.net>, kvm@vger.kernel.org,
        linux-doc@vger.kernel.org
Subject: [PATCH v3] vfio: Documentation for the migration region
Date: Tue,  7 Dec 2021 13:13:00 -0400
Message-Id: <0-v3-184b374ad0a8+24c-vfio_mig_doc_jgg@nvidia.com>
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 H/oreATZ6UYzzPdG4v4I/kJGmJAD9vcw/kjTPvbnpnXFkhN0XbUXxNXfIh1TiPlKly906s4h1DSzwvZQQDyTL6zpgiOoeKTck0brzCkLNuQHSAbk5Us+nliUhzmvFJXDiOunFoVG8NvUgZOnMXksf7l0cq5gOL/SdV/cN1zy91V4NyZMVX7St6gPkoQJykVzGbS24FALM9tcNxkOTgySg/TpsQBek2azTri2EbB+cJQ2NTyXAVwZ9ixfHAEoLVD1c6HHUstVHsx2ut4F/WX0q8JzLweUCx7I87MKdCS3HLczn6o4R4s2najnBFODgZ8mq5Kkan+rwL8dZqclZfuJAbAQFVg5lBro1/Ba9oDZmkJgSx2XNoBEhvmstmONAuuXDsaloYz811qF5Z9NdGsue+0ecn1X+wEE1q8xa/mmKyjbKKP/JTnwMcKSu3RQIW935/3zOyUQ+KjdsMKOViagUDav3jbMHNqGEqWZQZ7+HMMwM7grrGFvlf78+WefNTIlyw/9R5Bkn5u0FEwEdgmLwv8dHwRt+NcXv2cHj37dXr6q/yw4Hm8zs6iNegzKwZqWHD4MgsNZQhsWOL31L8edhect/LrC0cQ5q4VJYs6X3QpZqy7AHbkM4OiJNYDlnushku0zglb6s+O7rUGiqvkDbj9E+0viDAR2jt8cuguljANaEBsZvv9DnAXBYG5j2gf8VATRGRIW8xiDBcsVS/QE0sbfm1yClEWTUo9KPpvHEcwE1JF4f4A3WUZ5vDboHwcj3CYlcHd3zVutAEeQbAAnW2L79FvE3CGDThbxVEfolT39N5n6rl40Iz5E/vAYztOQaKOAyJhYNMRI1LfuB4hPDe5bGmidn3IuVh/YnPllM1dwxWrRqnN3mS4z9y+blLM3d6pgxoiZN3AY8uFFcCsQ13uzWt5GNZzGnT7HR/AkcR9CONWDza0cPLUaaUVsv5MZgVKe9+LRWVsNp9cPzq+IXdarO6WHac98MMKUbXMOL/Fu00bAsAiaRtuY6TlYrdf/ZSXJIEZoDBcSO9hTvycPSKVNOu8+UPQD4wLmT6woR9nvOP0OeRiVVCDz+EfQaDUojhpkkB+3uP2V1kF8r/wCGeiYB/afD5g/RozcyZgzMQT/GfTEuM77qvZdrF+t9QYJ3BQ1O3+djN+D2HPLxSZvjamApohAvcz4vEhMGui1XapeGDGIq/ZPNBu+ZYVed47fT4jjYTcudMJvmVhQ4wYwaEcBjUKpCVzvxSNVp9KDoOD2fzfGpsaRDXfMy7RypB3yhTM9aceuyV+RPWFWcCVJTvyHBTZ/P2wH1NHE9JvSHEMevP0wpfjbw4OHNn45KKouJAHA9s9HKnfcpfY3Loj4ANPfpukzaOqoRXs8ISFkS8lWFQSZO4LTbRpfsOCvh90OC/vj7PhchPC2rAggKz339bt5ui7RzNkJawwUVntLa8z1vZgTrWNx83bi5AVw1V8oFyUspvbVJsXK9rOZNT3EXl3rDmufZwSzhHDJ6owCei+PZbZvU1vgNEj6EbcTJETwoS/zPDRBuMOxWD5asxw/22DZEV9B7pz3HPGgKG3xTHI=
X-OriginatorOrg: Nvidia.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 69284bac-7a16-42e2-98e6-08d9b9a4d2a1
X-MS-Exchange-CrossTenant-AuthSource: BL0PR12MB5506.namprd12.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 07 Dec 2021 17:13:01.1091
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 p2TNJeGE6cR0KX/s1psLCgifsOzbtCG9hpBfIYK+Kt3cpCQRv2glM5iAxKVim+kZ
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BL1PR12MB5207
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Series

[v3] vfio: Documentation for the migration region | expand

Commit Message

Jason Gunthorpe Dec. 7, 2021, 5:13 p.m. UTC

Provide some more complete documentation for the migration regions
behavior, specifically focusing on the device_state bits and the whole
system view from a VMM.

To: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Yishai Hadas <yishaih@nvidia.com>
Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/driver-api/vfio.rst | 301 +++++++++++++++++++++++++++++-
 1 file changed, 300 insertions(+), 1 deletion(-)

v3:
 - s/migration_state/device_state
 - Redo how the migration data moves to better capture how pre-copy works
 - Require entry to RESUMING to always succeed, without prior reset
 - Move SAVING | RUNNING to after RUNNING in the precedence list
 - Reword the discussion of devices that have migration control registers
   in the same function
v2: https://lore.kernel.org/r/0-v2-45a95932a4c6+37-vfio_mig_doc_jgg@nvidia.com
 - RST fixups for sphinx rendering
 - Inclue the priority order for multi-bit-changes
 - Add a small discussion on devices like hns with migration control inside
   the same function as is being migrated.
 - Language cleanups from v1, the diff says almost every line was touched in some way
v1: https://lore.kernel.org/r/0-v1-0ec87874bede+123-vfio_mig_doc_jgg@nvidia.com


base-commit: ae0351a976d1880cf152de2bc680f1dff14d9049

Comments

Alex Williamson Dec. 9, 2021, 11:34 p.m. UTC | #1

On Tue,  7 Dec 2021 13:13:00 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> Provide some more complete documentation for the migration regions
> behavior, specifically focusing on the device_state bits and the whole
> system view from a VMM.
> 
> To: Alex Williamson <alex.williamson@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: kvm@vger.kernel.org
> Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
> Cc: Kirti Wankhede <kwankhede@nvidia.com>
> Cc: Yishai Hadas <yishaih@nvidia.com>
> Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  Documentation/driver-api/vfio.rst | 301 +++++++++++++++++++++++++++++-
>  1 file changed, 300 insertions(+), 1 deletion(-)

I'm sending a rewrite of the uAPI separately.  I hope this brings it
more in line with what you consider to be a viable specification and
perhaps makes some of the below new documentation unnecessary.  I took
a stab at including the new documentation here in that rewrite, but
frankly there are sections here that I don't know what you're trying to
show.  That makes it really, really hard to give the specific revision
advice you're looking for.

As noted in the changelog for my patch, I've removed the QEMU
terminology and any relation to use by a VMM.  That's the sort of thing
that makes sense to me to move here.

I'll attempt to provide more specifics regarding my stumbling blocks
for this document below...

> diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
> index c663b6f978255b..2ff47823a889b4 100644
> --- a/Documentation/driver-api/vfio.rst
> +++ b/Documentation/driver-api/vfio.rst
> @@ -242,7 +242,306 @@ group and can access them as follows::
>  VFIO User API
>  -------------------------------------------------------------------------------
>  
> -Please see include/linux/vfio.h for complete API documentation.
> +Please see include/uapi/linux/vfio.h for complete API documentation.
> +
> +-------------------------------------------------------------------------------
> +
> +VFIO migration driver API
> +-------------------------------------------------------------------------------
> +
> +VFIO drivers that support migration implement a migration control register
> +called device_state in the struct vfio_device_migration_info which is in its
> +VFIO_REGION_TYPE_MIGRATION region.
> +
> +The device_state controls both device action and continuous behavior.
> +Setting/clearing bit groups triggers device action, and each bit controls a
> +continuous device behavior.

This notion of device actions and continuous behavior seems to make
such a simple concept incredibly complicated.  We have "is the device
running or not" and a new modifier bit to that, and which mode is the
migration region, off, saving, or resuming.  Seems simple enough, but I
can't follow your bit groups below.

> +
> +Along with the device_state the migration driver provides a data window which
> +allows streaming migration data into or out of the device. The entire
> +migration data, up to the end of stream must be transported from the saving to
> +resuming side.
> +
> +A lot of flexibility is provided to user-space in how it operates these
> +bits. What follows is a reference flow for saving device state in a live
> +migration, with all features, and an illustration how other external non-VFIO
> +entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in.
> +
> +  RUNNING, VCPU_RUNNING
> +     Normal operating state
> +  RUNNING, DIRTY_TRACKING, VCPU_RUNNING
> +     Log DMAs
> +
> +     Stream all memory
> +  SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING
> +     Log internal device changes (pre-copy)
> +
> +     Stream device state through the migration window
> +
> +     While in this state repeat as desired:
> +
> +	Atomic Read and Clear DMA Dirty log
> +
> +	Stream dirty memory
> +  SAVING | NDMA | RUNNING, VCPU_RUNNING
> +     vIOMMU grace state
> +
> +     Complete all in progress IO page faults, idle the vIOMMU
> +  SAVING | NDMA | RUNNING
> +     Peer to Peer DMA grace state
> +
> +     Final snapshot of DMA dirty log (atomic not required)
> +  SAVING
> +     Stream final device state through the migration window
> +
> +     Copy final dirty data

So yes, let's move use of migration region in support of a VMM here,
but as I mentioned in the last round, these notes per state are all
over the map and some of them barely provide enough of a clue to know
what you're getting at.  Let's start simple and build.


> +  0
> +     Device is halted

We don't care what the device state goes to after we're done collecting
data from it.

> +
> +and the reference flow for resuming:
> +
> +  RUNNING
> +     Use ioctl(VFIO_GROUP_GET_DEVICE_FD) to obtain a fresh device
> +  RESUMING
> +     Push in migration data.
> +  NDMA | RUNNING
> +     Peer to Peer DMA grace state
> +  RUNNING, VCPU_RUNNING
> +     Normal operating state
> +
> +If the VMM has multiple VFIO devices undergoing migration then the grace
> +states act as cross device synchronization points. The VMM must bring all
> +devices to the grace state before advancing past it.

Why?  (rhetorical)  Describe that we can't stop all device atomically
therefore we need to running-but-not-initiating state to quiesce the
system to finish up saving and the same because we can't atomically
release all devices on the restoring end.

> +
> +The above reference flows are built around specific requirements on the
> +migration driver for its implementation of the device_state input.

As noted in previous review, the above sentence is just words that
convey no meaning, at least not to me.

> +The device_state cannot change asynchronously, upon writing the
> +device_state the driver will either keep the current state and return
> +failure, return failure and go to ERROR, or succeed and go to the new state.

This is spelled out pretty clearly in the uAPI.

> +Event triggered actions happen when user-space requests a new device_state
> +that differs from the current device_state. Actions happen on a bit group
> +basis:
> +
> + SAVING

Does this mean the entire new device_state is (SAVING) or does this
mean that we set the SAVING bit independent of all other bits.

> +   The device clears the data window and prepares to stream migration data.
> +   The entire data from the start of SAVING to the end of stream is transfered
> +   to the other side to execute a resume.

"Clearing the data window" is an implementation, each iteration of the
migration protocol provides "something" in the data window.  The
migration driver could take no action when SAVING is set and simply
evaluate what the current device state is when pending_bytes is read.

> +
> + SAVING | RUNNING

If we're trying to model typical usage scenarios, it's confusing that
we started with SAVING and jumped back to (SAVING | RUNNING).

> +   The device beings streaming 'pre copy' migration data through the window.
> +
> +   A device that does not support internal state logging should return a 0
> +   length stream.
> +
> +   The migration window may reach an end of stream, this can be a permanent or
> +   temporary condition.
> +
> +   User space can do SAVING | !RUNNING at any time, any in progress transfer
> +   through the migration window is carried forward.

By these "bit groups", does a device get from SAVING | RUNNING to
SAVING | !RUNNING by a !RUNNING action?  I'm clearly still lost in the
terminology.

> +
> +   This allows the device to implement a dirty log for its internal state.
> +   During this state the data window should present the device state being
> +   logged and during SAVING | !RUNNING the data window should transfer the
> +   dirtied state and conclude the migration data.

As we discussed in the previous revision, invariant data could also
reasonably be included here.  We're again sort of pushing an
implementation agenda, but the more useful thing to include here would
be to say something about how drivers and devices should attempt to
support any bulk data in this pre-copy phase in order to allow
userspace to perform a migration with minimal actual time in the next
state.

> +
> +   The state is only concerned with internal device state. External DMAs are
> +   covered by the separate DIRTY_TRACKING function.
> +
> + SAVING | !RUNNING

And this means we set SAVING and cleared RUNNING, and only those bits
or independent of other bits?  Give your reader a chance to follow
along even if you do expect them to read it a few times for it all to
sink in.

> +   The device captures its internal state and streams it through the
> +   migration window.
> +
> +   When the migration window reaches an end of stream the saving is concluded
> +   and there is no further data. All of the migration data streamed from the
> +   time SAVING starts to this final end of stream is concatenated together
> +   and provided to RESUMING.
> +
> +   Devices that cannot log internal state changes stream all their device
> +   state here.
> +
> + RESUMING
> +   The data window is cleared, opened, and can receive the migration data
> +   stream. The device must always be able to enter resuming and it may reset
> +   the device to do so.
> +
> + !RESUMING
> +   All the data transferred into the data window is loaded into the device's
> +   internal state.
> +
> +   The internal state of a device is undefined while in RESUMING. To abort a
> +   RESUMING and return to a known state issue a VFIO_DEVICE_RESET.

I've clarified aspects of this in the uAPI, maybe we don't need to make
this recommendation.

> +
> +   If the migration data is invalid then the ERROR state must be set.

I don't know why we're specifying this, it's at the driver discretion
to use the ERROR state, but we tend to suggest it for irrecoverable
errors.  Maybe any such error here could be considered irrecoverable,
or maybe the last data segment was missing and once it's added we can
continue.

> +
> +Continuous actions are in effect when device_state bit groups are active:
> +
> + RUNNING | NDMA
> +   The device is not allowed to issue new DMA operations.
> +
> +   Whenever the kernel returns with a device_state of NDMA there can be no
> +   in progress DMAs.

"kernel returns with" is rather strange here, the kernel can't
autonomously get to NDMA.

> +
> + !RUNNING
> +   The device should not change its internal state. Further implies the NDMA
> +   behavior above.
> +
> + SAVING | !RUNNING
> +   RESUMING | !RUNNING
> +   The device may assume there are no incoming MMIO operations.
> +
> +   Internal state logging can stop.
> +
> + RUNNING
> +   The device can alter its internal state and must respond to incoming MMIO.
> +
> + SAVING | RUNNING
> +   The device is logging changes to the internal state.

It all rather seems like a hodgepodge of random notes that are hard to
piece together.  Can we collect them better?

> +
> + ERROR
> +   The behavior of the device is largely undefined. The device must be
> +   recovered by issuing VFIO_DEVICE_RESET or closing the device file
> +   descriptor.
> +
> +   However, devices supporting NDMA must behave as though NDMA is asserted
> +   during ERROR to avoid corrupting other devices or a VM during a failed
> +   migration.

As clarified in the uAPI, we chose the invalid state that we did as the
error state specifically because of the !RUNNING value.  Migration
drivers should honor that, therefore NDMA in ERROR state is irrelevant.

> +
> +When multiple bits change in the device_state they may describe multiple event
> +triggered actions, and multiple changes to continuous actions.  The migration
> +driver must process the new device_state bits in a priority order:
> +
> + - NDMA
> + - !RUNNING
> + - SAVING | !RUNNING
> + - RESUMING
> + - !RESUMING
> + - RUNNING
> + - SAVING | RUNNING
> + - !NDMA

I'll hold my comments on this since you proposed another variant later.

> +
> +In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
> +device back to device_state RUNNING. When a migration driver executes this
> +ioctl it should discard the data window and set device_state to RUNNING as
> +part of resetting the device to a clean state. This must happen even if the
> +device_state has errored. A freshly opened device FD should always be in
> +the RUNNING state.

Pretty clear in the uAPI imo.

> +
> +The migration driver has limitations on what device state it can affect. Any
> +device state controlled by general kernel subsystems must not be changed
> +during RESUME, and SAVING must tolerate mutation of this state. Change to
> +externally controlled device state can happen at any time, asynchronously, to
> +the migration (ie interrupt rebalancing).
> +
> +Some examples of externally controlled state:
> + - MSI-X interrupt page
> + - MSI/legacy interrupt configuration
> + - Large parts of the PCI configuration space, ie common control bits
> + - PCI power management
> + - Changes via VFIO_DEVICE_SET_IRQS
> +
> +During !RUNNING, especially during SAVING and RESUMING, the device may have
> +limitations on what it can tolerate. An ideal device will discard/return all
> +ones to all incoming MMIO, PIO, or equivalent operations (exclusive of the
> +external state above) in !RUNNING. However, devices are free to have undefined
> +behavior if they receive incoming operations. This includes
> +corrupting/aborting the migration, dirtying pages, and segfaulting user-space.
> +
> +However, a device may not compromise system integrity if it is subjected to a
> +MMIO. It cannot trigger an error TLP, it cannot trigger an x86 Machine Check
> +or similar, and it cannot compromise device isolation.

Yes, this is the sort of stuff that makes sense to me here.

> +There are several edge cases that user-space should keep in mind when
> +implementing migration:
> +
> +- Device Peer to Peer DMA. In this case devices are able issue DMAs to each
> +  other's MMIO regions. The VMM can permit this if it maps the MMIO memory into
> +  the IOMMU.
> +
> +  As Peer to Peer DMA is a MMIO touch like any other, it is important that
> +  userspace suspend these accesses before entering any device_state where MMIO
> +  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
> +  state. Userspace may also choose to never install MMIO mappings into the
> +  IOMMU if devices do not support NDMA and rely on that to guarantee quiet
> +  MMIO.
> +
> +  The Peer to Peer Grace States exist so that all devices may reach RUNNING
> +  before any device is subjected to a MMIO access.
> +
> +  Failure to guarantee quiet MMIO may allow a hostile VM to use P2P to violate
> +  the no-MMIO restriction during SAVING or RESUMING and corrupt the migration
> +  on devices that cannot protect themselves.

Some of the stuff in the beginning would make a lot more sense
following this description.  This is definitely the type of thing that
belongs in this document.

> +
> +- IOMMU Page faults handled in user-space can occur at any time. A migration
> +  driver is not required to serialize in-progress page faults. It can assume
> +  that all page faults are completed before entering SAVING | !RUNNING. Since
> +  the guest VCPU is required to complete page faults the VMM can accomplish
> +  this by asserting NDMA | VCPU_RUNNING and clearing all pending page faults
> +  before clearing VCPU_RUNNING.
> +
> +  Device that do not support NDMA cannot be configured to generate page faults
> +  that require the VCPU to complete.
> +
> +- Atomic Read and Clear of the DMA log is a HW feature. If the tracker
> +  cannot support this, then NDMA could be used to synthesize it less
> +  efficiently.
> +
> +- NDMA is optional. If the device does not support this then the NDMA States
> +  are pushed down to the next step in the sequence and various behaviors that
> +  rely on NDMA cannot be used.
> +
> +  NDMA is made optional to support simple HW implementations that either just
> +  cannot do NDMA, or cannot do NDMA without a performance cost. NDMA is only
> +  necessary for special features like P2P and PRI, so devices can omit it in
> +  exchange for limitations on the guest.

Maybe we can emphasize this a little more as it's potentially pretty
significant.  Developers should not just think of their own device in
isolation, but their device in the context of devices that may have
performance, if not functional, restrictions with those limitations.

> +
> +- Devices that have their HW migration control MMIO registers inside the same
> +  iommu_group as the VFIO device have some special considerations. In this
> +  case a driver will be operating HW registers from kernel space that are also
> +  subjected to userspace controlled DMA due to the iommu_group.
> +
> +  This immediately raises a security concern as user-space can use Peer to
> +  Peer DMA to manipulate these migration control registers concurrently with
> +  any kernel actions.
> +
> +  A device driver operating such a device must ensure that kernel integrity
> +  cannot be broken by hostile user space operating the migration MMIO
> +  registers via peer to peer, at any point in the sequence. Further the kernel
> +  cannot use DMA to transfer any migration data.
> +
> +  However, as discussed above in the "Device Peer to Peer DMA" section, it can
> +  assume quiet MMIO as a condition to have a successful and uncorrupted
> +  migration.
> +
> +To elaborate details on the reference flows, they assume the following details
> +about the external behaviors:
> +
> + !VCPU_RUNNING
> +   User-space must not generate dirty pages or issue MMIO, PIO or equivalent
> +   operations to devices.  For a VMM this would typically be controlled by
> +   KVM.
> +
> + DIRTY_TRACKING
> +   Clear the DMA log and start DMA logging
> +
> +   DMA logs should be readable with an "atomic test and clear" to allow
> +   continuous non-disruptive sampling of the log.
> +
> +   This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container
> +   fd.
> +
> + !DIRTY_TRACKING
> +   Freeze the DMA log, stop tracking and allow user-space to read it.
> +
> +   If user-space is going to have any use of the dirty log it must ensure that
> +   all DMA is suspended before clearing DIRTY_TRACKING, for instance by using
> +   NDMA or !RUNNING on all VFIO devices.

Minimally there should be reference markers to direct to these
definitions before they were thrown at the reader in the beginning, but
better yet would be to adjust the flow to make them unnecessary.

> +
> +
> +TDB - discoverable feature flag for NDMA

Updated in the uAPI spec.  Thanks,

Alex

> +TDB IMS xlation
> +TBD PASID xlation
>  
>  VFIO bus driver API
>  -------------------------------------------------------------------------------
> 
> base-commit: ae0351a976d1880cf152de2bc680f1dff14d9049

Jason Gunthorpe Dec. 10, 2021, 12:46 a.m. UTC | #2

On Thu, Dec 09, 2021 at 04:34:57PM -0700, Alex Williamson wrote:
> On Tue,  7 Dec 2021 13:13:00 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > Provide some more complete documentation for the migration regions
> > behavior, specifically focusing on the device_state bits and the whole
> > system view from a VMM.
> > 
> > To: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Cornelia Huck <cohuck@redhat.com>
> > Cc: kvm@vger.kernel.org
> > Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
> > Cc: Kirti Wankhede <kwankhede@nvidia.com>
> > Cc: Yishai Hadas <yishaih@nvidia.com>
> > Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >  Documentation/driver-api/vfio.rst | 301 +++++++++++++++++++++++++++++-
> >  1 file changed, 300 insertions(+), 1 deletion(-)
> 
> I'm sending a rewrite of the uAPI separately.  I hope this brings it
> more in line with what you consider to be a viable specification and
> perhaps makes some of the below new documentation unnecessary.

It is far better than what was there before, and sufficiently terse it
is OK in a header file. Really, it is quite a great job what you've
got there.

Honestly, I don't think I can write something at quite that level, if
that is your expectation of what we need to achieve here..

> > +-------------------------------------------------------------------------------
> > +
> > +VFIO migration driver API
> > +-------------------------------------------------------------------------------
> > +
> > +VFIO drivers that support migration implement a migration control register
> > +called device_state in the struct vfio_device_migration_info which is in its
> > +VFIO_REGION_TYPE_MIGRATION region.
> > +
> > +The device_state controls both device action and continuous behavior.
> > +Setting/clearing bit groups triggers device action, and each bit controls a
> > +continuous device behavior.
> 
> This notion of device actions and continuous behavior seems to make
> such a simple concept incredibly complicated.  We have "is the device
> running or not" and a new modifier bit to that, and which mode is the
> migration region, off, saving, or resuming.  Seems simple enough, but I
> can't follow your bit groups below.

It is an effort to bridge from the very simple view you wrote to a
fuller understanding what the driver should be implementing.

We must talk about SAVING|RUNING / SAVING|!RUNNING together to be able
to explain everything that is going on.

But we probably don't want the introductory paragraphs at all. Lets
just refer to the header file and explain the following discussion
elaborates on that.

> > +Along with the device_state the migration driver provides a data window which
> > +allows streaming migration data into or out of the device. The entire
> > +migration data, up to the end of stream must be transported from the saving to
> > +resuming side.
> > +
> > +A lot of flexibility is provided to user-space in how it operates these
> > +bits. What follows is a reference flow for saving device state in a live
> > +migration, with all features, and an illustration how other external non-VFIO
> > +entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in.
> > +
> > +  RUNNING, VCPU_RUNNING
> > +     Normal operating state
> > +  RUNNING, DIRTY_TRACKING, VCPU_RUNNING
> > +     Log DMAs
> > +
> > +     Stream all memory
> > +  SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING
> > +     Log internal device changes (pre-copy)
> > +
> > +     Stream device state through the migration window
> > +
> > +     While in this state repeat as desired:
> > +
> > +	Atomic Read and Clear DMA Dirty log
> > +
> > +	Stream dirty memory
> > +  SAVING | NDMA | RUNNING, VCPU_RUNNING
> > +     vIOMMU grace state
> > +
> > +     Complete all in progress IO page faults, idle the vIOMMU
> > +  SAVING | NDMA | RUNNING
> > +     Peer to Peer DMA grace state
> > +
> > +     Final snapshot of DMA dirty log (atomic not required)
> > +  SAVING
> > +     Stream final device state through the migration window
> > +
> > +     Copy final dirty data
> 
> So yes, let's move use of migration region in support of a VMM here,
> but as I mentioned in the last round, these notes per state are all
> over the map and some of them barely provide enough of a clue to know
> what you're getting at.  Let's start simple and build.

I'm not sure what you are suggesting?

Combined with the new header file this is much better, it tersely
explains from a VMM point of view what each state is about

Do you think this section should be longer and the section below much
shorter? That might be a better document.

> > +  0
> > +     Device is halted
> 
> We don't care what the device state goes to after we're done collecting
> data from it.

The reference flow is just a reference, choosing to go to 0 is fine,
right?

> > +and the reference flow for resuming:
> > +
> > +  RUNNING
> > +     Use ioctl(VFIO_GROUP_GET_DEVICE_FD) to obtain a fresh device
> > +  RESUMING
> > +     Push in migration data.
> > +  NDMA | RUNNING
> > +     Peer to Peer DMA grace state
> > +  RUNNING, VCPU_RUNNING
> > +     Normal operating state
> > +
> > +If the VMM has multiple VFIO devices undergoing migration then the grace
> > +states act as cross device synchronization points. The VMM must bring all
> > +devices to the grace state before advancing past it.
> 
> Why?  (rhetorical)  Describe that we can't stop all device atomically
> therefore we need to running-but-not-initiating state to quiesce the
> system to finish up saving and the same because we can't atomically
> release all devices on the restoring end.

OK

> > +Event triggered actions happen when user-space requests a new device_state
> > +that differs from the current device_state. Actions happen on a bit group
> > +basis:
> > +
> > + SAVING
> 
> Does this mean the entire new device_state is (SAVING) or does this
> mean that we set the SAVING bit independent of all other bits.

It says "actions happen on a bit group basis", so independent of all
other bits as you say

But perhaps we don't need this at all anymore as the header file is
sufficent enough

> > +   The device clears the data window and prepares to stream migration data.
> > +   The entire data from the start of SAVING to the end of stream is transfered
> > +   to the other side to execute a resume.
> 
> "Clearing the data window" is an implementation, each iteration of the
> migration protocol provides "something" in the data window.  The
> migration driver could take no action when SAVING is set and simply
> evaluate what the current device state is when pending_bytes is read.

It is the same as what you said: "initializes the migration region
data window" 

> > + SAVING | RUNNING
> 
> If we're trying to model typical usage scenarios, it's confusing that
> we started with SAVING and jumped back to (SAVING | RUNNING).

This section isn't about usage scenarios this is talking about what
the driver must do in all the state combinations. SAVING is
"initializing the data window"

And then the two variations of RUNNING have their own special behaviors.

> > +   This allows the device to implement a dirty log for its internal state.
> > +   During this state the data window should present the device state being
> > +   logged and during SAVING | !RUNNING the data window should transfer the
> > +   dirtied state and conclude the migration data.
> 
> As we discussed in the previous revision, invariant data could also
> reasonably be included here.  We're again sort of pushing an
> implementation agenda, but the more useful thing to include here would
> be to say something about how drivers and devices should attempt to
> support any bulk data in this pre-copy phase in order to allow
> userspace to perform a migration with minimal actual time in the next
> state.

Invarient data is implicitly already "device state being logged" - the
log is always 'no change'

> > +   The state is only concerned with internal device state. External DMAs are
> > +   covered by the separate DIRTY_TRACKING function.
> > +
> > + SAVING | !RUNNING
> 
> And this means we set SAVING and cleared RUNNING, and only those bits
> or independent of other bits?  Give your reader a chance to follow
> along even if you do expect them to read it a few times for it all to
> sink in.

None of this is about set or cleared, where did you get that? The top
paragph said: "requests a new device_state" - that means only the new
device_state value matters, the change to get there is irrelevant.

> > +   If the migration data is invalid then the ERROR state must be set.
> 
> I don't know why we're specifying this, it's at the driver discretion
> to use the ERROR state, but we tend to suggest it for irrecoverable
> errors.  Maybe any such error here could be considered irrecoverable,
> or maybe the last data segment was missing and once it's added we can
> continue.

This was an explicit statement that seems to contridict what you wrote
in the header. I prefer we are deterministic, if the RESUME fails then
go to ERROR, always. Devices do not have the choice to do something
else.

> > + ERROR
> > +   The behavior of the device is largely undefined. The device must be
> > +   recovered by issuing VFIO_DEVICE_RESET or closing the device file
> > +   descriptor.
> > +
> > +   However, devices supporting NDMA must behave as though NDMA is asserted
> > +   during ERROR to avoid corrupting other devices or a VM during a failed
> > +   migration.
> 
> As clarified in the uAPI, we chose the invalid state that we did as the
> error state specifically because of the !RUNNING value.  Migration
> drivers should honor that, therefore NDMA in ERROR state is irrelevant.

This is another explict statement that you have contridicted in the
header. I'm not sure mlx5 can implement this. Certainly, it becomes
very hard if we continue to support precedence.

Unwinding an error during a multi-bit sequence and guaranteeing that
we can somehow make it back to !RUNNING is far very complex. Several
error scenarios mean the driver has lost control of the device.

I'm not even sure we can do the !NDMA I wrote, in hindsight I don't
think we checked that enough. Yishai noticed all the error unwinding
was broken in mlx5 for precedence cases after I wrote this.

> > +  NDMA is made optional to support simple HW implementations that either just
> > +  cannot do NDMA, or cannot do NDMA without a performance cost. NDMA is only
> > +  necessary for special features like P2P and PRI, so devices can omit it in
> > +  exchange for limitations on the guest.
> 
> Maybe we can emphasize this a little more as it's potentially pretty
> significant.  Developers should not just think of their own device in
> isolation, but their device in the context of devices that may have
> performance, if not functional, restrictions with those limitations.

Ok

> > +
> > +- Devices that have their HW migration control MMIO registers inside the same
> > +  iommu_group as the VFIO device have some special considerations. In this
> > +  case a driver will be operating HW registers from kernel space that are also
> > +  subjected to userspace controlled DMA due to the iommu_group.
> > +
> > +  This immediately raises a security concern as user-space can use Peer to
> > +  Peer DMA to manipulate these migration control registers concurrently with
> > +  any kernel actions.
> > +
> > +  A device driver operating such a device must ensure that kernel integrity
> > +  cannot be broken by hostile user space operating the migration MMIO
> > +  registers via peer to peer, at any point in the sequence. Further the kernel
> > +  cannot use DMA to transfer any migration data.
> > +
> > +  However, as discussed above in the "Device Peer to Peer DMA" section, it can
> > +  assume quiet MMIO as a condition to have a successful and uncorrupted
> > +  migration.
> > +
> > +To elaborate details on the reference flows, they assume the following details
> > +about the external behaviors:
> > +
> > + !VCPU_RUNNING
> > +   User-space must not generate dirty pages or issue MMIO, PIO or equivalent
> > +   operations to devices.  For a VMM this would typically be controlled by
> > +   KVM.
> > +
> > + DIRTY_TRACKING
> > +   Clear the DMA log and start DMA logging
> > +
> > +   DMA logs should be readable with an "atomic test and clear" to allow
> > +   continuous non-disruptive sampling of the log.
> > +
> > +   This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container
> > +   fd.
> > +
> > + !DIRTY_TRACKING
> > +   Freeze the DMA log, stop tracking and allow user-space to read it.
> > +
> > +   If user-space is going to have any use of the dirty log it must ensure that
> > +   all DMA is suspended before clearing DIRTY_TRACKING, for instance by using
> > +   NDMA or !RUNNING on all VFIO devices.
> 
> Minimally there should be reference markers to direct to these
> definitions before they were thrown at the reader in the beginning, but
> better yet would be to adjust the flow to make them unnecessary.

The first draft was orderd like this, Connie felt that was confusing,
so it was moved to the end :)

> > +TDB - discoverable feature flag for NDMA
> 
> Updated in the uAPI spec.  Thanks,

It matches what Yishai did

Jason

Alex Williamson Dec. 13, 2021, 10:16 p.m. UTC | #3

On Thu, 9 Dec 2021 20:46:59 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Dec 09, 2021 at 04:34:57PM -0700, Alex Williamson wrote:
> > On Tue,  7 Dec 2021 13:13:00 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > Provide some more complete documentation for the migration regions
> > > behavior, specifically focusing on the device_state bits and the whole
> > > system view from a VMM.
> > > 
> > > To: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Cornelia Huck <cohuck@redhat.com>
> > > Cc: kvm@vger.kernel.org
> > > Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
> > > Cc: Kirti Wankhede <kwankhede@nvidia.com>
> > > Cc: Yishai Hadas <yishaih@nvidia.com>
> > > Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > >  Documentation/driver-api/vfio.rst | 301 +++++++++++++++++++++++++++++-
> > >  1 file changed, 300 insertions(+), 1 deletion(-)  
> > 
> > I'm sending a rewrite of the uAPI separately.  I hope this brings it
> > more in line with what you consider to be a viable specification and
> > perhaps makes some of the below new documentation unnecessary.  
> 
> It is far better than what was there before, and sufficiently terse it
> is OK in a header file. Really, it is quite a great job what you've
> got there.
> 
> Honestly, I don't think I can write something at quite that level, if
> that is your expectation of what we need to achieve here..
> 
> > > +-------------------------------------------------------------------------------
> > > +
> > > +VFIO migration driver API
> > > +-------------------------------------------------------------------------------
> > > +
> > > +VFIO drivers that support migration implement a migration control register
> > > +called device_state in the struct vfio_device_migration_info which is in its
> > > +VFIO_REGION_TYPE_MIGRATION region.
> > > +
> > > +The device_state controls both device action and continuous behavior.
> > > +Setting/clearing bit groups triggers device action, and each bit controls a
> > > +continuous device behavior.  
> > 
> > This notion of device actions and continuous behavior seems to make
> > such a simple concept incredibly complicated.  We have "is the device
> > running or not" and a new modifier bit to that, and which mode is the
> > migration region, off, saving, or resuming.  Seems simple enough, but I
> > can't follow your bit groups below.  
> 
> It is an effort to bridge from the very simple view you wrote to a
> fuller understanding what the driver should be implementing.
> 
> We must talk about SAVING|RUNING / SAVING|!RUNNING together to be able
> to explain everything that is going on.
> 
> But we probably don't want the introductory paragraphs at all. Lets
> just refer to the header file and explain the following discussion
> elaborates on that.
> 
> > > +Along with the device_state the migration driver provides a data window which
> > > +allows streaming migration data into or out of the device. The entire
> > > +migration data, up to the end of stream must be transported from the saving to
> > > +resuming side.
> > > +
> > > +A lot of flexibility is provided to user-space in how it operates these
> > > +bits. What follows is a reference flow for saving device state in a live
> > > +migration, with all features, and an illustration how other external non-VFIO
> > > +entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in.
> > > +
> > > +  RUNNING, VCPU_RUNNING
> > > +     Normal operating state
> > > +  RUNNING, DIRTY_TRACKING, VCPU_RUNNING
> > > +     Log DMAs
> > > +
> > > +     Stream all memory
> > > +  SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING
> > > +     Log internal device changes (pre-copy)
> > > +
> > > +     Stream device state through the migration window
> > > +
> > > +     While in this state repeat as desired:
> > > +
> > > +	Atomic Read and Clear DMA Dirty log
> > > +
> > > +	Stream dirty memory
> > > +  SAVING | NDMA | RUNNING, VCPU_RUNNING
> > > +     vIOMMU grace state
> > > +
> > > +     Complete all in progress IO page faults, idle the vIOMMU
> > > +  SAVING | NDMA | RUNNING
> > > +     Peer to Peer DMA grace state
> > > +
> > > +     Final snapshot of DMA dirty log (atomic not required)
> > > +  SAVING
> > > +     Stream final device state through the migration window
> > > +
> > > +     Copy final dirty data  
> > 
> > So yes, let's move use of migration region in support of a VMM here,
> > but as I mentioned in the last round, these notes per state are all
> > over the map and some of them barely provide enough of a clue to know
> > what you're getting at.  Let's start simple and build.  
> 
> I'm not sure what you are suggesting?

I'm suggesting that we need to start with the basic device level view
of migration, ie. the device starts RUNNING, the VMM may optionally
have a pre-copy stage where the device is both RUNNING and SAVING where
the value of this state relative to both the device and VMM are briefly
discussed, the mandatory stop-and-copy phase, and the interaction of
data streams collected during those phases to a device in the RESUMING
state.

The next section might fold in how device dirtied pages fit into the
picture.

Another section would fit the idea of DMA grace periods to support
devices engaging in p2p.

I don't expect this to be an entirely self supporting document, the
reader should have some idea how migration of a VMM works, but at the
same time we can't just write a phrase with insufficient context and
expect that if someone reads it enough times they'll figure it out.

> Combined with the new header file this is much better, it tersely
> explains from a VMM point of view what each state is about
> 
> Do you think this section should be longer and the section below much
> shorter? That might be a better document.

My proposed uAPI update removes the mapping of device_state to VMM
migration terminology, so I'd like to see that moved here.  Your
discussion about externally controlled states relative to !RUNNING and
what userspace is allowed to touch without risking interfering with the
migration data stream, as well as the edge cases discussion are all
things that I would think this document update would focus on.

> > > +  0
> > > +     Device is halted  
> > 
> > We don't care what the device state goes to after we're done collecting
> > data from it.  
> 
> The reference flow is just a reference, choosing to go to 0 is fine,
> right?

It's fine that a user can do this, my question is more whether it's
relevant to the flow and if a migration driver author might read
"reference" in ways other than "example" and code their support to
expect such a terminating transition.
 
> > > +and the reference flow for resuming:
> > > +
> > > +  RUNNING
> > > +     Use ioctl(VFIO_GROUP_GET_DEVICE_FD) to obtain a fresh device
> > > +  RESUMING
> > > +     Push in migration data.
> > > +  NDMA | RUNNING
> > > +     Peer to Peer DMA grace state
> > > +  RUNNING, VCPU_RUNNING
> > > +     Normal operating state
> > > +
> > > +If the VMM has multiple VFIO devices undergoing migration then the grace
> > > +states act as cross device synchronization points. The VMM must bring all
> > > +devices to the grace state before advancing past it.  
> > 
> > Why?  (rhetorical)  Describe that we can't stop all device atomically
> > therefore we need to running-but-not-initiating state to quiesce the
> > system to finish up saving and the same because we can't atomically
> > release all devices on the restoring end.  
> 
> OK
> 
> > > +Event triggered actions happen when user-space requests a new device_state
> > > +that differs from the current device_state. Actions happen on a bit group
> > > +basis:
> > > +
> > > + SAVING  
> > 
> > Does this mean the entire new device_state is (SAVING) or does this
> > mean that we set the SAVING bit independent of all other bits.  
> 
> It says "actions happen on a bit group basis", so independent of all
> other bits as you say
> 
> But perhaps we don't need this at all anymore as the header file is
> sufficent enough

That would be ideal, otherwise we need to be really careful about
alignment with the header.  I really want the header to be the source
of truth and this document to supplement that with typical usage flows,
considerations how to handle multiple devices, clarifying what !RUNNING
means to the device internal state and resilience to generating host
fault, user access to devices while !RUNNING, etc.
 
> > > +   The device clears the data window and prepares to stream migration data.
> > > +   The entire data from the start of SAVING to the end of stream is transfered
> > > +   to the other side to execute a resume.  
> > 
> > "Clearing the data window" is an implementation, each iteration of the
> > migration protocol provides "something" in the data window.  The
> > migration driver could take no action when SAVING is set and simply
> > evaluate what the current device state is when pending_bytes is read.  
> 
> It is the same as what you said: "initializes the migration region
> data window" 

The connotation is different for me.  If I'm clearing the data window,
I infer that there's something backing the data window that can be
cleared.  The data window might just be a mapping of the internal
device registers.  Clearing those would probably not be a good idea.
OTOH, initializing the data window only suggests to me that the driver
does what is necessary to make the data window usable in this mode for
their implementation.

> > > + SAVING | RUNNING  
> > 
> > If we're trying to model typical usage scenarios, it's confusing that
> > we started with SAVING and jumped back to (SAVING | RUNNING).  
> 
> This section isn't about usage scenarios this is talking about what
> the driver must do in all the state combinations. SAVING is
> "initializing the data window"
> 
> And then the two variations of RUNNING have their own special behaviors.
> 
> > > +   This allows the device to implement a dirty log for its internal state.
> > > +   During this state the data window should present the device state being
> > > +   logged and during SAVING | !RUNNING the data window should transfer the
> > > +   dirtied state and conclude the migration data.  
> > 
> > As we discussed in the previous revision, invariant data could also
> > reasonably be included here.  We're again sort of pushing an
> > implementation agenda, but the more useful thing to include here would
> > be to say something about how drivers and devices should attempt to
> > support any bulk data in this pre-copy phase in order to allow
> > userspace to perform a migration with minimal actual time in the next
> > state.  
> 
> Invarient data is implicitly already "device state being logged" - the
> log is always 'no change'

That's subtle.  In any case, I was confused last time, I'm still
confused.  I'd suggest this falls into the category that we describe
how SAVING | RUNNING vs SAVING is consumed by userspace and possible
strategies for both invariant and state that supports dirty logging and
let the migration driver figure out an optimal implementation.
 
> > > +   The state is only concerned with internal device state. External DMAs are
> > > +   covered by the separate DIRTY_TRACKING function.
> > > +
> > > + SAVING | !RUNNING  
> > 
> > And this means we set SAVING and cleared RUNNING, and only those bits
> > or independent of other bits?  Give your reader a chance to follow
> > along even if you do expect them to read it a few times for it all to
> > sink in.  
> 
> None of this is about set or cleared, where did you get that? The top
> paragph said: "requests a new device_state" - that means only the new
> device_state value matters, the change to get there is irrelevant.

All I can say is that I read it many times and was still not clear how
to process it.  My intention Thursday was to try to contribute some
rewrites to this as well, but I didn't understand it well enough for
that.
 
> > > +   If the migration data is invalid then the ERROR state must be set.  
> > 
> > I don't know why we're specifying this, it's at the driver discretion
> > to use the ERROR state, but we tend to suggest it for irrecoverable
> > errors.  Maybe any such error here could be considered irrecoverable,
> > or maybe the last data segment was missing and once it's added we can
> > continue.  
> 
> This was an explicit statement that seems to contridict what you wrote
> in the header. I prefer we are deterministic, if the RESUME fails then
> go to ERROR, always. Devices do not have the choice to do something
> else.

The determinism is that if clearing RESUMING fails, the user gets an
errno.  But defining that if that transition fails then the device must
enter the ERROR state removes any opportunity that the failure could be
transient or recoverable.  For what?  It also makes this state
transition failure different than other state transitions and makes
the ERROR state a required state for the device, whereas it's really
meant as a catch-all, internal error recovery path.
 
> > > + ERROR
> > > +   The behavior of the device is largely undefined. The device must be
> > > +   recovered by issuing VFIO_DEVICE_RESET or closing the device file
> > > +   descriptor.
> > > +
> > > +   However, devices supporting NDMA must behave as though NDMA is asserted
> > > +   during ERROR to avoid corrupting other devices or a VM during a failed
> > > +   migration.  
> > 
> > As clarified in the uAPI, we chose the invalid state that we did as the
> > error state specifically because of the !RUNNING value.  Migration
> > drivers should honor that, therefore NDMA in ERROR state is irrelevant.  
> 
> This is another explict statement that you have contridicted in the
> header. I'm not sure mlx5 can implement this. Certainly, it becomes
> very hard if we continue to support precedence.
> 
> Unwinding an error during a multi-bit sequence and guaranteeing that
> we can somehow make it back to !RUNNING is far very complex. Several
> error scenarios mean the driver has lost control of the device.
> 
> I'm not even sure we can do the !NDMA I wrote, in hindsight I don't
> think we checked that enough. Yishai noticed all the error unwinding
> was broken in mlx5 for precedence cases after I wrote this.

I think I phrased it along the lines that the driver should make every
effort to make sure the device is equivalently !RUNNING in the ERROR
state, including device resets.  So there's room should that not be
possible, but I'd expect such a state to be the goal.  We require
device supporting migration to support the RESET ioctl, so the worst
case of unwinding state changes should be to internally reset the
device and report ERROR state.  If the device is still wedged/lost
beyond that, the RESET ioctl from the user should also errno.  At that
point the device remains in ERROR state and cannot be used.

I have concerns if that's not a model mlx5 can support.
 
> > > +  NDMA is made optional to support simple HW implementations that either just
> > > +  cannot do NDMA, or cannot do NDMA without a performance cost. NDMA is only
> > > +  necessary for special features like P2P and PRI, so devices can omit it in
> > > +  exchange for limitations on the guest.  
> > 
> > Maybe we can emphasize this a little more as it's potentially pretty
> > significant.  Developers should not just think of their own device in
> > isolation, but their device in the context of devices that may have
> > performance, if not functional, restrictions with those limitations.  
> 
> Ok
> 
> > > +
> > > +- Devices that have their HW migration control MMIO registers inside the same
> > > +  iommu_group as the VFIO device have some special considerations. In this
> > > +  case a driver will be operating HW registers from kernel space that are also
> > > +  subjected to userspace controlled DMA due to the iommu_group.
> > > +
> > > +  This immediately raises a security concern as user-space can use Peer to
> > > +  Peer DMA to manipulate these migration control registers concurrently with
> > > +  any kernel actions.
> > > +
> > > +  A device driver operating such a device must ensure that kernel integrity
> > > +  cannot be broken by hostile user space operating the migration MMIO
> > > +  registers via peer to peer, at any point in the sequence. Further the kernel
> > > +  cannot use DMA to transfer any migration data.
> > > +
> > > +  However, as discussed above in the "Device Peer to Peer DMA" section, it can
> > > +  assume quiet MMIO as a condition to have a successful and uncorrupted
> > > +  migration.
> > > +
> > > +To elaborate details on the reference flows, they assume the following details
> > > +about the external behaviors:
> > > +
> > > + !VCPU_RUNNING
> > > +   User-space must not generate dirty pages or issue MMIO, PIO or equivalent
> > > +   operations to devices.  For a VMM this would typically be controlled by
> > > +   KVM.
> > > +
> > > + DIRTY_TRACKING
> > > +   Clear the DMA log and start DMA logging
> > > +
> > > +   DMA logs should be readable with an "atomic test and clear" to allow
> > > +   continuous non-disruptive sampling of the log.
> > > +
> > > +   This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container
> > > +   fd.
> > > +
> > > + !DIRTY_TRACKING
> > > +   Freeze the DMA log, stop tracking and allow user-space to read it.
> > > +
> > > +   If user-space is going to have any use of the dirty log it must ensure that
> > > +   all DMA is suspended before clearing DIRTY_TRACKING, for instance by using
> > > +   NDMA or !RUNNING on all VFIO devices.  
> > 
> > Minimally there should be reference markers to direct to these
> > definitions before they were thrown at the reader in the beginning, but
> > better yet would be to adjust the flow to make them unnecessary.  
> 
> The first draft was orderd like this, Connie felt that was confusing,
> so it was moved to the end :)

I only read the comments on the first draft since I was on PTO and v2
was out when I returned.
 
> > > +TDB - discoverable feature flag for NDMA  
> > 
> > Updated in the uAPI spec.  Thanks,  
> 
> It matches what Yishai did

Cool.  Thanks,

Alex

diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
index c663b6f978255b..2ff47823a889b4 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -242,7 +242,306 @@  group and can access them as follows::
 VFIO User API
 -------------------------------------------------------------------------------
 
-Please see include/linux/vfio.h for complete API documentation.
+Please see include/uapi/linux/vfio.h for complete API documentation.
+
+-------------------------------------------------------------------------------
+
+VFIO migration driver API
+-------------------------------------------------------------------------------
+
+VFIO drivers that support migration implement a migration control register
+called device_state in the struct vfio_device_migration_info which is in its
+VFIO_REGION_TYPE_MIGRATION region.
+
+The device_state controls both device action and continuous behavior.
+Setting/clearing bit groups triggers device action, and each bit controls a
+continuous device behavior.
+
+Along with the device_state the migration driver provides a data window which
+allows streaming migration data into or out of the device. The entire
+migration data, up to the end of stream must be transported from the saving to
+resuming side.
+
+A lot of flexibility is provided to user-space in how it operates these
+bits. What follows is a reference flow for saving device state in a live
+migration, with all features, and an illustration how other external non-VFIO
+entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in.
+
+  RUNNING, VCPU_RUNNING
+     Normal operating state
+  RUNNING, DIRTY_TRACKING, VCPU_RUNNING
+     Log DMAs
+
+     Stream all memory
+  SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING
+     Log internal device changes (pre-copy)
+
+     Stream device state through the migration window
+
+     While in this state repeat as desired:
+
+	Atomic Read and Clear DMA Dirty log
+
+	Stream dirty memory
+  SAVING | NDMA | RUNNING, VCPU_RUNNING
+     vIOMMU grace state
+
+     Complete all in progress IO page faults, idle the vIOMMU
+  SAVING | NDMA | RUNNING
+     Peer to Peer DMA grace state
+
+     Final snapshot of DMA dirty log (atomic not required)
+  SAVING
+     Stream final device state through the migration window
+
+     Copy final dirty data
+  0
+     Device is halted
+
+and the reference flow for resuming:
+
+  RUNNING
+     Use ioctl(VFIO_GROUP_GET_DEVICE_FD) to obtain a fresh device
+  RESUMING
+     Push in migration data.
+  NDMA | RUNNING
+     Peer to Peer DMA grace state
+  RUNNING, VCPU_RUNNING
+     Normal operating state
+
+If the VMM has multiple VFIO devices undergoing migration then the grace
+states act as cross device synchronization points. The VMM must bring all
+devices to the grace state before advancing past it.
+
+The above reference flows are built around specific requirements on the
+migration driver for its implementation of the device_state input.
+
+The device_state cannot change asynchronously, upon writing the
+device_state the driver will either keep the current state and return
+failure, return failure and go to ERROR, or succeed and go to the new state.
+
+Event triggered actions happen when user-space requests a new device_state
+that differs from the current device_state. Actions happen on a bit group
+basis:
+
+ SAVING
+   The device clears the data window and prepares to stream migration data.
+   The entire data from the start of SAVING to the end of stream is transfered
+   to the other side to execute a resume.
+
+ SAVING | RUNNING
+   The device beings streaming 'pre copy' migration data through the window.
+
+   A device that does not support internal state logging should return a 0
+   length stream.
+
+   The migration window may reach an end of stream, this can be a permanent or
+   temporary condition.
+
+   User space can do SAVING | !RUNNING at any time, any in progress transfer
+   through the migration window is carried forward.
+
+   This allows the device to implement a dirty log for its internal state.
+   During this state the data window should present the device state being
+   logged and during SAVING | !RUNNING the data window should transfer the
+   dirtied state and conclude the migration data.
+
+   The state is only concerned with internal device state. External DMAs are
+   covered by the separate DIRTY_TRACKING function.
+
+ SAVING | !RUNNING
+   The device captures its internal state and streams it through the
+   migration window.
+
+   When the migration window reaches an end of stream the saving is concluded
+   and there is no further data. All of the migration data streamed from the
+   time SAVING starts to this final end of stream is concatenated together
+   and provided to RESUMING.
+
+   Devices that cannot log internal state changes stream all their device
+   state here.
+
+ RESUMING
+   The data window is cleared, opened, and can receive the migration data
+   stream. The device must always be able to enter resuming and it may reset
+   the device to do so.
+
+ !RESUMING
+   All the data transferred into the data window is loaded into the device's
+   internal state.
+
+   The internal state of a device is undefined while in RESUMING. To abort a
+   RESUMING and return to a known state issue a VFIO_DEVICE_RESET.
+
+   If the migration data is invalid then the ERROR state must be set.
+
+Continuous actions are in effect when device_state bit groups are active:
+
+ RUNNING | NDMA
+   The device is not allowed to issue new DMA operations.
+
+   Whenever the kernel returns with a device_state of NDMA there can be no
+   in progress DMAs.
+
+ !RUNNING
+   The device should not change its internal state. Further implies the NDMA
+   behavior above.
+
+ SAVING | !RUNNING
+   RESUMING | !RUNNING
+   The device may assume there are no incoming MMIO operations.
+
+   Internal state logging can stop.
+
+ RUNNING
+   The device can alter its internal state and must respond to incoming MMIO.
+
+ SAVING | RUNNING
+   The device is logging changes to the internal state.
+
+ ERROR
+   The behavior of the device is largely undefined. The device must be
+   recovered by issuing VFIO_DEVICE_RESET or closing the device file
+   descriptor.
+
+   However, devices supporting NDMA must behave as though NDMA is asserted
+   during ERROR to avoid corrupting other devices or a VM during a failed
+   migration.
+
+When multiple bits change in the device_state they may describe multiple event
+triggered actions, and multiple changes to continuous actions.  The migration
+driver must process the new device_state bits in a priority order:
+
+ - NDMA
+ - !RUNNING
+ - SAVING | !RUNNING
+ - RESUMING
+ - !RESUMING
+ - RUNNING
+ - SAVING | RUNNING
+ - !NDMA
+
+In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
+device back to device_state RUNNING. When a migration driver executes this
+ioctl it should discard the data window and set device_state to RUNNING as
+part of resetting the device to a clean state. This must happen even if the
+device_state has errored. A freshly opened device FD should always be in
+the RUNNING state.
+
+The migration driver has limitations on what device state it can affect. Any
+device state controlled by general kernel subsystems must not be changed
+during RESUME, and SAVING must tolerate mutation of this state. Change to
+externally controlled device state can happen at any time, asynchronously, to
+the migration (ie interrupt rebalancing).
+
+Some examples of externally controlled state:
+ - MSI-X interrupt page
+ - MSI/legacy interrupt configuration
+ - Large parts of the PCI configuration space, ie common control bits
+ - PCI power management
+ - Changes via VFIO_DEVICE_SET_IRQS
+
+During !RUNNING, especially during SAVING and RESUMING, the device may have
+limitations on what it can tolerate. An ideal device will discard/return all
+ones to all incoming MMIO, PIO, or equivalent operations (exclusive of the
+external state above) in !RUNNING. However, devices are free to have undefined
+behavior if they receive incoming operations. This includes
+corrupting/aborting the migration, dirtying pages, and segfaulting user-space.
+
+However, a device may not compromise system integrity if it is subjected to a
+MMIO. It cannot trigger an error TLP, it cannot trigger an x86 Machine Check
+or similar, and it cannot compromise device isolation.
+
+There are several edge cases that user-space should keep in mind when
+implementing migration:
+
+- Device Peer to Peer DMA. In this case devices are able issue DMAs to each
+  other's MMIO regions. The VMM can permit this if it maps the MMIO memory into
+  the IOMMU.
+
+  As Peer to Peer DMA is a MMIO touch like any other, it is important that
+  userspace suspend these accesses before entering any device_state where MMIO
+  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
+  state. Userspace may also choose to never install MMIO mappings into the
+  IOMMU if devices do not support NDMA and rely on that to guarantee quiet
+  MMIO.
+
+  The Peer to Peer Grace States exist so that all devices may reach RUNNING
+  before any device is subjected to a MMIO access.
+
+  Failure to guarantee quiet MMIO may allow a hostile VM to use P2P to violate
+  the no-MMIO restriction during SAVING or RESUMING and corrupt the migration
+  on devices that cannot protect themselves.
+
+- IOMMU Page faults handled in user-space can occur at any time. A migration
+  driver is not required to serialize in-progress page faults. It can assume
+  that all page faults are completed before entering SAVING | !RUNNING. Since
+  the guest VCPU is required to complete page faults the VMM can accomplish
+  this by asserting NDMA | VCPU_RUNNING and clearing all pending page faults
+  before clearing VCPU_RUNNING.
+
+  Device that do not support NDMA cannot be configured to generate page faults
+  that require the VCPU to complete.
+
+- Atomic Read and Clear of the DMA log is a HW feature. If the tracker
+  cannot support this, then NDMA could be used to synthesize it less
+  efficiently.
+
+- NDMA is optional. If the device does not support this then the NDMA States
+  are pushed down to the next step in the sequence and various behaviors that
+  rely on NDMA cannot be used.
+
+  NDMA is made optional to support simple HW implementations that either just
+  cannot do NDMA, or cannot do NDMA without a performance cost. NDMA is only
+  necessary for special features like P2P and PRI, so devices can omit it in
+  exchange for limitations on the guest.
+
+- Devices that have their HW migration control MMIO registers inside the same
+  iommu_group as the VFIO device have some special considerations. In this
+  case a driver will be operating HW registers from kernel space that are also
+  subjected to userspace controlled DMA due to the iommu_group.
+
+  This immediately raises a security concern as user-space can use Peer to
+  Peer DMA to manipulate these migration control registers concurrently with
+  any kernel actions.
+
+  A device driver operating such a device must ensure that kernel integrity
+  cannot be broken by hostile user space operating the migration MMIO
+  registers via peer to peer, at any point in the sequence. Further the kernel
+  cannot use DMA to transfer any migration data.
+
+  However, as discussed above in the "Device Peer to Peer DMA" section, it can
+  assume quiet MMIO as a condition to have a successful and uncorrupted
+  migration.
+
+To elaborate details on the reference flows, they assume the following details
+about the external behaviors:
+
+ !VCPU_RUNNING
+   User-space must not generate dirty pages or issue MMIO, PIO or equivalent
+   operations to devices.  For a VMM this would typically be controlled by
+   KVM.
+
+ DIRTY_TRACKING
+   Clear the DMA log and start DMA logging
+
+   DMA logs should be readable with an "atomic test and clear" to allow
+   continuous non-disruptive sampling of the log.
+
+   This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container
+   fd.
+
+ !DIRTY_TRACKING
+   Freeze the DMA log, stop tracking and allow user-space to read it.
+
+   If user-space is going to have any use of the dirty log it must ensure that
+   all DMA is suspended before clearing DIRTY_TRACKING, for instance by using
+   NDMA or !RUNNING on all VFIO devices.
+
+
+TDB - discoverable feature flag for NDMA
+TDB IMS xlation
+TBD PASID xlation
 
 VFIO bus driver API
 -------------------------------------------------------------------------------

[v3] vfio: Documentation for the migration region

Commit Message

Comments

Patch