Message ID | 20220501123301.127279-1-yishaih@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] vfio: Introduce DMA logging uAPIs for VFIO device | expand |
On Sun, 1 May 2022 15:33:00 +0300 Yishai Hadas <yishaih@nvidia.com> wrote: > DMA logging allows a device to internally record what DMAs the device is > initiation and report them back to userspace. > > It is part of the VFIO migration infrastructure that allows implementing > dirty page tracking during the pre-copy phase of live migration. > > Only DMA WRITEs are logged, and this API is not connected to > VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. > > This RFC patch shows the expected usage of the DMA logging involved > uAPIs for VFIO device-tracker. > > It uses the FEATURE ioctl with its GET/SET/PROBE options as of below. > > It exposes a PROBE option to detect if the device supports DMA logging. > > It exposes a SET option to start device DMA logging in given of IOVA > ranges. > > It exposes a SET option to stop device DMA logging that was previously > started. > > It exposes a GET option to read back and clear the device DMA log. > > Extra details exist as part of vfio.h per a specific option in this RFC > patch. > > Note: > To have IOMMU hardware support for dirty pages the below RFC [1] that > was sent by Joao Martins can be referenced. > > [1] https://lore.kernel.org/all/2d369e58-8ac0-f263-7b94-fe73917782e1@linux.intel.com/T/ > > Signed-off-by: Yishai Hadas <yishaih@nvidia.com> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > --- > include/uapi/linux/vfio.h | 80 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 80 insertions(+) > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index fea86061b44e..9d0b7e73e999 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -986,6 +986,86 @@ enum vfio_device_mig_state { > VFIO_DEVICE_STATE_RUNNING_P2P = 5, > }; > > +/* > + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging. > + * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports > + * DMA logging. > + * > + * DMA logging allows a device to internally record what DMAs the device is > + * initiation and report them back to userspace. It is part of the VFIO > + * migration infrastructure that allows implementing dirty page tracking > + * during the pre copy phase of live migration. Only DMA WRITEs are logged, > + * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. > + * > + * When DMA logging is started a range of IOVAs to monitor is provided and the > + * device can optimize its logging to cover only the IOVA range given. Each > + * DMA that the device initiates inside the range will be logged by the device > + * for later retrieval. > + * > + * page_size is an input that hints what tracking granularity the device > + * should try to achieve. If the device cannot do the hinted page size then it > + * should pick the next closest page size it supports. On output the device > + * will return the page size it selected. > + * > + * ranges is a pointer to an array of > + * struct vfio_device_feature_dma_logging_range. > + */ > +struct vfio_device_feature_dma_logging_control { > + __aligned_u64 page_size; > + __u32 num_ranges; > + __u32 __reserved; > + __aligned_u64 ranges; > +}; > + > +struct vfio_device_feature_dma_logging_range { > + __aligned_u64 iova; > + __aligned_u64 length; > +}; > + > +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3 > + > + > +/* > + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started > + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START > + */ > +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4 This seems difficult to use from a QEMU perspective, where a vfio device typically operates on a MemoryListener and we only have visibility to one range at a time. I don't see any indication that LOGGING_START is meant to be cumulative such that userspace could incrementally add ranges to be watched, nor clearly does LOGGING_STOP appear to have any sort of IOVA range granularity. Is userspace intended to pass the full vCPU physical address range here, and if so would a single min/max IOVA be sufficient? I'm not sure how else we could support memory hotplug while this was enabled. How does this work with IOMMU based tracking, I assume that if devices share an IOAS we wouldn't be able to exclude devices supporting device-level tracking from the IOAS log. > + > +/* > + * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log > + * > + * Query the device's DMA log for written pages within the given IOVA range. > + * During querying the log is cleared for the IOVA range. > + * > + * bitmap is a pointer to an array of u64s that will hold the output bitmap > + * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits > + * is given by: > + * bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64)) > + * > + * The input page_size can be any power of two value and does not have to > + * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver > + * will format its internal logging to match the reporting page size, possibly > + * by replicating bits if the internal page size is lower than requested. Or setting multiple bits if the internal page size is larger than requested. Is there a bitmap size limit? We've minimally needed to impose limits to reflect limitations of the bitmap code internally in the past. Userspace needs a means to learn such limits. Thanks, Alex > + * > + * Bits will be updated in bitmap using atomic or to allow userspace to > + * combine bitmaps from multiple trackers together. Therefore userspace must > + * zero the bitmap before doing any reports. > + * > + * If any error is returned userspace should assume that the dirty log is > + * corrupted and restart. > + * > + * If DMA logging is not enabled, an error will be returned. > + * > + */ > +struct vfio_device_feature_dma_logging_report { > + __aligned_u64 iova; > + __aligned_u64 length; > + __aligned_u64 page_size; > + __aligned_u64 bitmap; > +}; > + > +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5 > + > /* -------- API for Type1 VFIO IOMMU -------- */ > > /**
On Mon, May 02, 2022 at 01:07:01PM -0600, Alex Williamson wrote: > > +/* > > + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started > > + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START > > + */ > > +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4 > > This seems difficult to use from a QEMU perspective, where a vfio > device typically operates on a MemoryListener and we only have > visibility to one range at a time. I don't see any indication that > LOGGING_START is meant to be cumulative such that userspace could > incrementally add ranges to be watched, nor clearly does LOGGING_STOP > appear to have any sort of IOVA range granularity. Correct, at least mlx5 HW just cannot do a change tracking operation, so userspace must pre-select some kind of IOVA range to monitor based on the current VM configuration. > Is userspace intended to pass the full vCPU physical address range > here, and if so would a single min/max IOVA be sufficient? At least mlx5 doesn't have enough capacity for that. Some reasonable in-between of the current address space, and maybe a speculative extra for hot plug. Multi-range is needed to support some arch cases that have a big discontinuity in their IOVA space, like PPC for instance. > I'm not sure how else we could support memory hotplug while this was > enabled. Most likely memory hot plug events will have to take a 'everything got dirty' hit during pre-copy - not much else we can do with this HW. > How does this work with IOMMU based tracking, I assume that if devices > share an IOAS we wouldn't be able to exclude devices supporting > device-level tracking from the IOAS log. Exclusion is possible, the userspace would have to manually create iommu_domains and attach devices to them with the idea that only iommu_domains for devices it wants to track would have dma dirty tracking turned on. But I'm not sure it makes much sense, if the iommu will track dirties, and you have to use it for another devices, then it is unlikely there will be a performance win to inject a device tracker as well. In any case the two interfaces are orthogonal, I would not expect userspace to want to track with both, but if it does everything does work. > Is there a bitmap size limit? Joao's code doesn't have a limit, it pulls user pages incrementally as it goes. I'm expecting VFIO devices to use the same bitmap library as the IOMMU drivers so we have a consistent reporting. Jason
On Mon, 2 May 2022 16:25:41 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, May 02, 2022 at 01:07:01PM -0600, Alex Williamson wrote: > > > > +/* > > > + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started > > > + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START > > > + */ > > > +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4 > > > > This seems difficult to use from a QEMU perspective, where a vfio > > device typically operates on a MemoryListener and we only have > > visibility to one range at a time. I don't see any indication that > > LOGGING_START is meant to be cumulative such that userspace could > > incrementally add ranges to be watched, nor clearly does LOGGING_STOP > > appear to have any sort of IOVA range granularity. > > Correct, at least mlx5 HW just cannot do a change tracking operation, > so userspace must pre-select some kind of IOVA range to monitor based > on the current VM configuration. > > > Is userspace intended to pass the full vCPU physical address range > > here, and if so would a single min/max IOVA be sufficient? > > At least mlx5 doesn't have enough capacity for that. Some reasonable > in-between of the current address space, and maybe a speculative extra > for hot plug. Ah great, implicit limitations not reported to the user that I hadn't even guessed! How does a user learn about any limitations in the number of ranges or size of each range? > Multi-range is needed to support some arch cases that have a big > discontinuity in their IOVA space, like PPC for instance. > > > I'm not sure how else we could support memory hotplug while this was > > enabled. > > Most likely memory hot plug events will have to take a 'everything got > dirty' hit during pre-copy - not much else we can do with this HW. > > > How does this work with IOMMU based tracking, I assume that if devices > > share an IOAS we wouldn't be able to exclude devices supporting > > device-level tracking from the IOAS log. > > Exclusion is possible, the userspace would have to manually create > iommu_domains and attach devices to them with the idea that only > iommu_domains for devices it wants to track would have dma dirty > tracking turned on. Well yeah, but that's the separate IOAS solution. > But I'm not sure it makes much sense, if the iommu will track dirties, > and you have to use it for another devices, then it is unlikely there > will be a performance win to inject a device tracker as well. > > In any case the two interfaces are orthogonal, I would not expect > userspace to want to track with both, but if it does everything does > work. > > > Is there a bitmap size limit? > > Joao's code doesn't have a limit, it pulls user pages incrementally as > it goes. > > I'm expecting VFIO devices to use the same bitmap library as the IOMMU > drivers so we have a consistent reporting. I haven't reviewed that series in any detail yet, but it seems to impose the same bitmap size and reporting to userspace features as type1 based in internal limits of bitmap_set(). Thanks, Alex
On Mon, May 02, 2022 at 01:58:37PM -0600, Alex Williamson wrote: > On Mon, 2 May 2022 16:25:41 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, May 02, 2022 at 01:07:01PM -0600, Alex Williamson wrote: > > > > > > +/* > > > > + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started > > > > + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START > > > > + */ > > > > +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4 > > > > > > This seems difficult to use from a QEMU perspective, where a vfio > > > device typically operates on a MemoryListener and we only have > > > visibility to one range at a time. I don't see any indication that > > > LOGGING_START is meant to be cumulative such that userspace could > > > incrementally add ranges to be watched, nor clearly does LOGGING_STOP > > > appear to have any sort of IOVA range granularity. > > > > Correct, at least mlx5 HW just cannot do a change tracking operation, > > so userspace must pre-select some kind of IOVA range to monitor based > > on the current VM configuration. > > > > > Is userspace intended to pass the full vCPU physical address range > > > here, and if so would a single min/max IOVA be sufficient? > > > > At least mlx5 doesn't have enough capacity for that. Some reasonable > > in-between of the current address space, and maybe a speculative extra > > for hot plug. > > Ah great, implicit limitations not reported to the user that I hadn't > even guessed! How does a user learn about any limitations in the > number of ranges or size of each range? There is some limit of number of ranges and total aggregate address space, you think we should report rather than try-and-fail? I guess total address space and total number of ranges is easy to report, but I don't quite know what userspace will do with it? > > > How does this work with IOMMU based tracking, I assume that if devices > > > share an IOAS we wouldn't be able to exclude devices supporting > > > device-level tracking from the IOAS log. > > > > Exclusion is possible, the userspace would have to manually create > > iommu_domains and attach devices to them with the idea that only > > iommu_domains for devices it wants to track would have dma dirty > > tracking turned on. > > Well yeah, but that's the separate IOAS solution. Sure, you can't disable tracking done at the iommu_domain level without creating different iommu_domains. The IOAS can be shared, userspace just has to be aware of, or perhaps explicitly control, the assignment of iommu_domains to devices under the IOAS. > > I'm expecting VFIO devices to use the same bitmap library as the IOMMU > > drivers so we have a consistent reporting. > > I haven't reviewed that series in any detail yet, but it seems to > impose the same bitmap size and reporting to userspace features as > type1 based in internal limits of bitmap_set(). Thanks, It goes page by page, so the bitmap_set() can't see more than 4k of bitmap at a time, IIRC. Jason
On Mon, May 02, 2022 at 07:04:47PM -0300, Jason Gunthorpe wrote: > > Ah great, implicit limitations not reported to the user that I hadn't > > even guessed! How does a user learn about any limitations in the > > number of ranges or size of each range? > > There is some limit of number of ranges and total aggregate address > space, you think we should report rather than try-and-fail? > > I guess total address space and total number of ranges is easy to > report, but I don't quite know what userspace will do with it? Actually, I recall now, the idea was that the driver would sort this out. So, userspace would pass in whatever ranges it wanted and if the device could do only. say 4, then the driver would merge adjacent ranges till there were only 4 - and limit each range to any device limit there may be and so on and so forth. If the result was unfittable then it fails and the only thing userspace can do is try with less. So, if userspace wants to do 'minimal + extra for hotplug' then on failure it should try 'minimal' and on failure of that give up completely. Then we don't have to try and catalog all the weird device choices we might need into some query function. Yishahi, this concept should be called out in the documentation, and we probably need some core code to pull the ranges into an interval tree and validate they are non-overlapping and so forth Jason
On 5/2/22 23:04, Jason Gunthorpe wrote: > On Mon, May 02, 2022 at 01:58:37PM -0600, Alex Williamson wrote: >> On Mon, 2 May 2022 16:25:41 -0300 >> Jason Gunthorpe <jgg@nvidia.com> wrote: >>> On Mon, May 02, 2022 at 01:07:01PM -0600, Alex Williamson wrote: >>>>> +/* >>>>> + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started >>>>> + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START >>>>> + */ >>>>> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4 >>>> >>>> This seems difficult to use from a QEMU perspective, where a vfio >>>> device typically operates on a MemoryListener and we only have >>>> visibility to one range at a time. I don't see any indication that >>>> LOGGING_START is meant to be cumulative such that userspace could >>>> incrementally add ranges to be watched, nor clearly does LOGGING_STOP >>>> appear to have any sort of IOVA range granularity. >>> >>> Correct, at least mlx5 HW just cannot do a change tracking operation, >>> so userspace must pre-select some kind of IOVA range to monitor based >>> on the current VM configuration. >>> >>>> Is userspace intended to pass the full vCPU physical address range >>>> here, and if so would a single min/max IOVA be sufficient? >>> >>> At least mlx5 doesn't have enough capacity for that. Some reasonable >>> in-between of the current address space, and maybe a speculative extra >>> for hot plug. >> Though one knows (in the interim abstractions) at start of guest, the underlying memory map e.g. that you can only have <this much> for hotpluggable memory. Albeit I can't remember (need to check) if a memory listener infra gets propagated with everything at start (including the non present stuff like hotplug MR). >>> I'm expecting VFIO devices to use the same bitmap library as the IOMMU >>> drivers so we have a consistent reporting. >> >> I haven't reviewed that series in any detail yet, but it seems to >> impose the same bitmap size and reporting to userspace features as >> type1 based in internal limits of bitmap_set(). Thanks, > > It goes page by page, so the bitmap_set() can't see more than 4k of > bitmap at a time, IIRC. It allows a bitmap big enough to marshal 64G of IOVA space to it, albeit it only does a section at a time (128M) i.e. one 4K page or iow 32K bits at a time. However the callers will only set bits in the bitmap for a given PTE page size, so we are always limited by PTE page size being recorded at the bitmap page-size granularity. The limit is implicitly bound by the IOVA range being passed in and bitmap page size. So this means it's the bitmap size necessary to record an iova range of length ULONG_MAX.
On 5/3/22 00:02, Jason Gunthorpe wrote: > On Mon, May 02, 2022 at 07:04:47PM -0300, Jason Gunthorpe wrote: >>> Ah great, implicit limitations not reported to the user that I hadn't >>> even guessed! How does a user learn about any limitations in the >>> number of ranges or size of each range? >> >> There is some limit of number of ranges and total aggregate address >> space, you think we should report rather than try-and-fail? >> >> I guess total address space and total number of ranges is easy to >> report, but I don't quite know what userspace will do with it? > > Actually, I recall now, the idea was that the driver would sort this > out. > > So, userspace would pass in whatever ranges it wanted and if the > device could do only. say 4, then the driver would merge adjacent > ranges till there were only 4 - and limit each range to any device > limit there may be and so on and so forth. > > If the result was unfittable then it fails and the only thing > userspace can do is try with less. > > So, if userspace wants to do 'minimal + extra for hotplug' then on > failure it should try 'minimal' and on failure of that give up > completely. > Is it worth classifying the range as present or not, say a ::flags of vfio_device_feature_dma_logging_range to help choosing what to aggregate or not? Provided the number of calls might be undeterministic? > Then we don't have to try and catalog all the weird device choices we > might need into some query function. > > Yishahi, this concept should be called out in the documentation, and > we probably need some core code to pull the ranges into an interval > tree and validate they are non-overlapping and so forth > > Jason
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index fea86061b44e..9d0b7e73e999 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -986,6 +986,86 @@ enum vfio_device_mig_state { VFIO_DEVICE_STATE_RUNNING_P2P = 5, }; +/* + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging. + * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports + * DMA logging. + * + * DMA logging allows a device to internally record what DMAs the device is + * initiation and report them back to userspace. It is part of the VFIO + * migration infrastructure that allows implementing dirty page tracking + * during the pre copy phase of live migration. Only DMA WRITEs are logged, + * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. + * + * When DMA logging is started a range of IOVAs to monitor is provided and the + * device can optimize its logging to cover only the IOVA range given. Each + * DMA that the device initiates inside the range will be logged by the device + * for later retrieval. + * + * page_size is an input that hints what tracking granularity the device + * should try to achieve. If the device cannot do the hinted page size then it + * should pick the next closest page size it supports. On output the device + * will return the page size it selected. + * + * ranges is a pointer to an array of + * struct vfio_device_feature_dma_logging_range. + */ +struct vfio_device_feature_dma_logging_control { + __aligned_u64 page_size; + __u32 num_ranges; + __u32 __reserved; + __aligned_u64 ranges; +}; + +struct vfio_device_feature_dma_logging_range { + __aligned_u64 iova; + __aligned_u64 length; +}; + +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3 + + +/* + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START + */ +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4 + +/* + * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log + * + * Query the device's DMA log for written pages within the given IOVA range. + * During querying the log is cleared for the IOVA range. + * + * bitmap is a pointer to an array of u64s that will hold the output bitmap + * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits + * is given by: + * bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64)) + * + * The input page_size can be any power of two value and does not have to + * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver + * will format its internal logging to match the reporting page size, possibly + * by replicating bits if the internal page size is lower than requested. + * + * Bits will be updated in bitmap using atomic or to allow userspace to + * combine bitmaps from multiple trackers together. Therefore userspace must + * zero the bitmap before doing any reports. + * + * If any error is returned userspace should assume that the dirty log is + * corrupted and restart. + * + * If DMA logging is not enabled, an error will be returned. + * + */ +struct vfio_device_feature_dma_logging_report { + __aligned_u64 iova; + __aligned_u64 length; + __aligned_u64 page_size; + __aligned_u64 bitmap; +}; + +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5 + /* -------- API for Type1 VFIO IOMMU -------- */ /**