diff mbox series

[V2,vfio,03/11] vfio: Introduce DMA logging uAPIs

Message ID 20220714081251.240584-4-yishaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series Add device DMA logging support for mlx5 driver | expand

Commit Message

Yishai Hadas July 14, 2022, 8:12 a.m. UTC
DMA logging allows a device to internally record what DMAs the device is
initiating and report them back to userspace. It is part of the VFIO
migration infrastructure that allows implementing dirty page tracking
during the pre copy phase of live migration. Only DMA WRITEs are logged,
and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.

This patch introduces the DMA logging involved uAPIs.

It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.

It exposes a PROBE option to detect if the device supports DMA logging.
It exposes a SET option to start device DMA logging in given IOVAs
ranges.
It exposes a SET option to stop device DMA logging that was previously
started.
It exposes a GET option to read back and clear the device DMA log.

Extra details exist as part of vfio.h per a specific option.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/uapi/linux/vfio.h | 79 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

Comments

Alex Williamson July 18, 2022, 10:29 p.m. UTC | #1
On Thu, 14 Jul 2022 11:12:43 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> DMA logging allows a device to internally record what DMAs the device is
> initiating and report them back to userspace. It is part of the VFIO
> migration infrastructure that allows implementing dirty page tracking
> during the pre copy phase of live migration. Only DMA WRITEs are logged,
> and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
> 
> This patch introduces the DMA logging involved uAPIs.
> 
> It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
> 
> It exposes a PROBE option to detect if the device supports DMA logging.
> It exposes a SET option to start device DMA logging in given IOVAs
> ranges.
> It exposes a SET option to stop device DMA logging that was previously
> started.
> It exposes a GET option to read back and clear the device DMA log.
> 
> Extra details exist as part of vfio.h per a specific option.


Kevin, Kirti, others, any comments on this uAPI proposal?  Are there
potentially other devices that might make use of this or is everyone
else waiting for IOMMU based dirty tracking?

 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 79 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 79 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 733a1cddde30..81475c3e7c92 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -986,6 +986,85 @@ enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>  };
>  
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.
> + * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports
> + * DMA logging.
> + *
> + * DMA logging allows a device to internally record what DMAs the device is
> + * initiating and report them back to userspace. It is part of the VFIO
> + * migration infrastructure that allows implementing dirty page tracking
> + * during the pre copy phase of live migration. Only DMA WRITEs are logged,
> + * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
> + *
> + * When DMA logging is started a range of IOVAs to monitor is provided and the
> + * device can optimize its logging to cover only the IOVA range given. Each
> + * DMA that the device initiates inside the range will be logged by the device
> + * for later retrieval.
> + *
> + * page_size is an input that hints what tracking granularity the device
> + * should try to achieve. If the device cannot do the hinted page size then it
> + * should pick the next closest page size it supports. On output the device
> + * will return the page size it selected.
> + *
> + * ranges is a pointer to an array of
> + * struct vfio_device_feature_dma_logging_range.
> + */
> +struct vfio_device_feature_dma_logging_control {
> +	__aligned_u64 page_size;
> +	__u32 num_ranges;
> +	__u32 __reserved;
> +	__aligned_u64 ranges;
> +};
> +
> +struct vfio_device_feature_dma_logging_range {
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started
> + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4
> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log
> + *
> + * Query the device's DMA log for written pages within the given IOVA range.
> + * During querying the log is cleared for the IOVA range.
> + *
> + * bitmap is a pointer to an array of u64s that will hold the output bitmap
> + * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
> + * is given by:
> + *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
> + *
> + * The input page_size can be any power of two value and does not have to
> + * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver
> + * will format its internal logging to match the reporting page size, possibly
> + * by replicating bits if the internal page size is lower than requested.
> + *
> + * Bits will be updated in bitmap using atomic or to allow userspace to
> + * combine bitmaps from multiple trackers together. Therefore userspace must
> + * zero the bitmap before doing any reports.

Somewhat confusing, perhaps "between report sets"?

> + *
> + * If any error is returned userspace should assume that the dirty log is
> + * corrupted and restart.

Restart what?  The user can't just zero the bitmap and retry, dirty
information at the device has been lost.  Are we suggesting they stop
DMA logging and restart it, which sounds a lot like failing a migration
and starting over.  Or could the user gratuitously mark the bitmap
fully dirty and a subsequent logging report iteration might work?
Thanks,

Alex

> + *
> + * If DMA logging is not enabled, an error will be returned.
> + *
> + */
> +struct vfio_device_feature_dma_logging_report {
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +	__aligned_u64 page_size;
> +	__aligned_u64 bitmap;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**
Tian, Kevin July 19, 2022, 1:39 a.m. UTC | #2
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, July 19, 2022 6:30 AM
> 
> On Thu, 14 Jul 2022 11:12:43 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
> > DMA logging allows a device to internally record what DMAs the device is
> > initiating and report them back to userspace. It is part of the VFIO
> > migration infrastructure that allows implementing dirty page tracking
> > during the pre copy phase of live migration. Only DMA WRITEs are logged,
> > and this API is not connected to
> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
> >
> > This patch introduces the DMA logging involved uAPIs.
> >
> > It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
> >
> > It exposes a PROBE option to detect if the device supports DMA logging.
> > It exposes a SET option to start device DMA logging in given IOVAs
> > ranges.
> > It exposes a SET option to stop device DMA logging that was previously
> > started.
> > It exposes a GET option to read back and clear the device DMA log.
> >
> > Extra details exist as part of vfio.h per a specific option.
> 
> 
> Kevin, Kirti, others, any comments on this uAPI proposal?  Are there
> potentially other devices that might make use of this or is everyone
> else waiting for IOMMU based dirty tracking?
> 

I plan to take a look later this week.

From Intel side I'm not aware of such device so far and IOMMU based
dirty tracking would be the standard way to go.
Kirti Wankhede July 19, 2022, 5:40 a.m. UTC | #3
On 7/19/2022 7:09 AM, Tian, Kevin wrote:
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Tuesday, July 19, 2022 6:30 AM
>>
>> On Thu, 14 Jul 2022 11:12:43 +0300
>> Yishai Hadas <yishaih@nvidia.com> wrote:
>>
>>> DMA logging allows a device to internally record what DMAs the device is
>>> initiating and report them back to userspace. It is part of the VFIO
>>> migration infrastructure that allows implementing dirty page tracking
>>> during the pre copy phase of live migration. Only DMA WRITEs are logged,
>>> and this API is not connected to
>> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
>>>
>>> This patch introduces the DMA logging involved uAPIs.
>>>
>>> It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
>>>
>>> It exposes a PROBE option to detect if the device supports DMA logging.
>>> It exposes a SET option to start device DMA logging in given IOVAs
>>> ranges.
>>> It exposes a SET option to stop device DMA logging that was previously
>>> started.
>>> It exposes a GET option to read back and clear the device DMA log.
>>>
>>> Extra details exist as part of vfio.h per a specific option.
>>
>>
>> Kevin, Kirti, others, any comments on this uAPI proposal?  Are there
>> potentially other devices that might make use of this or is everyone
>> else waiting for IOMMU based dirty tracking?
>>
> 

I had briefly screened through it, I'm taking a closer look of it.

NVIDIA vGPU might use this API.

Thanks,
Kirti
Yishai Hadas July 19, 2022, 7:49 a.m. UTC | #4
On 19/07/2022 1:29, Alex Williamson wrote:
> On Thu, 14 Jul 2022 11:12:43 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>
>> DMA logging allows a device to internally record what DMAs the device is
>> initiating and report them back to userspace. It is part of the VFIO
>> migration infrastructure that allows implementing dirty page tracking
>> during the pre copy phase of live migration. Only DMA WRITEs are logged,
>> and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
>>
>> This patch introduces the DMA logging involved uAPIs.
>>
>> It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
>>
>> It exposes a PROBE option to detect if the device supports DMA logging.
>> It exposes a SET option to start device DMA logging in given IOVAs
>> ranges.
>> It exposes a SET option to stop device DMA logging that was previously
>> started.
>> It exposes a GET option to read back and clear the device DMA log.
>>
>> Extra details exist as part of vfio.h per a specific option.
>
> Kevin, Kirti, others, any comments on this uAPI proposal?  Are there
> potentially other devices that might make use of this or is everyone
> else waiting for IOMMU based dirty tracking?
>
>   
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> ---
>>   include/uapi/linux/vfio.h | 79 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 79 insertions(+)
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 733a1cddde30..81475c3e7c92 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -986,6 +986,85 @@ enum vfio_device_mig_state {
>>   	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>   };
>>   
>> +/*
>> + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.
>> + * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports
>> + * DMA logging.
>> + *
>> + * DMA logging allows a device to internally record what DMAs the device is
>> + * initiating and report them back to userspace. It is part of the VFIO
>> + * migration infrastructure that allows implementing dirty page tracking
>> + * during the pre copy phase of live migration. Only DMA WRITEs are logged,
>> + * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
>> + *
>> + * When DMA logging is started a range of IOVAs to monitor is provided and the
>> + * device can optimize its logging to cover only the IOVA range given. Each
>> + * DMA that the device initiates inside the range will be logged by the device
>> + * for later retrieval.
>> + *
>> + * page_size is an input that hints what tracking granularity the device
>> + * should try to achieve. If the device cannot do the hinted page size then it
>> + * should pick the next closest page size it supports. On output the device
>> + * will return the page size it selected.
>> + *
>> + * ranges is a pointer to an array of
>> + * struct vfio_device_feature_dma_logging_range.
>> + */
>> +struct vfio_device_feature_dma_logging_control {
>> +	__aligned_u64 page_size;
>> +	__u32 num_ranges;
>> +	__u32 __reserved;
>> +	__aligned_u64 ranges;
>> +};
>> +
>> +struct vfio_device_feature_dma_logging_range {
>> +	__aligned_u64 iova;
>> +	__aligned_u64 length;
>> +};
>> +
>> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
>> +
>> +/*
>> + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started
>> + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
>> + */
>> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4
>> +
>> +/*
>> + * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log
>> + *
>> + * Query the device's DMA log for written pages within the given IOVA range.
>> + * During querying the log is cleared for the IOVA range.
>> + *
>> + * bitmap is a pointer to an array of u64s that will hold the output bitmap
>> + * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
>> + * is given by:
>> + *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
>> + *
>> + * The input page_size can be any power of two value and does not have to
>> + * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver
>> + * will format its internal logging to match the reporting page size, possibly
>> + * by replicating bits if the internal page size is lower than requested.
>> + *
>> + * Bits will be updated in bitmap using atomic or to allow userspace to
>> + * combine bitmaps from multiple trackers together. Therefore userspace must
>> + * zero the bitmap before doing any reports.
> Somewhat confusing, perhaps "between report sets"?

The idea was that the driver just turns on its own dirty bits and 
doesn't touch others.

Do you suggest the below ?

"Therefore userspace must zero the bitmap between report sets".

>
>> + *
>> + * If any error is returned userspace should assume that the dirty log is
>> + * corrupted and restart.
> Restart what?  The user can't just zero the bitmap and retry, dirty
> information at the device has been lost.

Right

>   Are we suggesting they stop
> DMA logging and restart it, which sounds a lot like failing a migration
> and starting over.  Or could the user gratuitously mark the bitmap
> fully dirty and a subsequent logging report iteration might work?
> Thanks,
>
> Alex

An error at that step is not expected and might be fatal.

User space can consider marking all as dirty and continue with that 
approach for next iterations, maybe even without calling the driver.

Alternatively, user space can abort the migration and retry later on.

We can come with some rephrasing as of the above.

What do you think ?

Yishai

>> + *
>> + * If DMA logging is not enabled, an error will be returned.
>> + *
>> + */
>> +struct vfio_device_feature_dma_logging_report {
>> +	__aligned_u64 iova;
>> +	__aligned_u64 length;
>> +	__aligned_u64 page_size;
>> +	__aligned_u64 bitmap;
>> +};
>> +
>> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5
>> +
>>   /* -------- API for Type1 VFIO IOMMU -------- */
>>   
>>   /**
Alex Williamson July 19, 2022, 7:57 p.m. UTC | #5
On Tue, 19 Jul 2022 10:49:42 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 19/07/2022 1:29, Alex Williamson wrote:
> > On Thu, 14 Jul 2022 11:12:43 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:
> >  
> >> DMA logging allows a device to internally record what DMAs the device is
> >> initiating and report them back to userspace. It is part of the VFIO
> >> migration infrastructure that allows implementing dirty page tracking
> >> during the pre copy phase of live migration. Only DMA WRITEs are logged,
> >> and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
> >>
> >> This patch introduces the DMA logging involved uAPIs.
> >>
> >> It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
> >>
> >> It exposes a PROBE option to detect if the device supports DMA logging.
> >> It exposes a SET option to start device DMA logging in given IOVAs
> >> ranges.
> >> It exposes a SET option to stop device DMA logging that was previously
> >> started.
> >> It exposes a GET option to read back and clear the device DMA log.
> >>
> >> Extra details exist as part of vfio.h per a specific option.  
> >
> > Kevin, Kirti, others, any comments on this uAPI proposal?  Are there
> > potentially other devices that might make use of this or is everyone
> > else waiting for IOMMU based dirty tracking?
> >
> >     
> >> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> >> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >> ---
> >>   include/uapi/linux/vfio.h | 79 +++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 79 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 733a1cddde30..81475c3e7c92 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -986,6 +986,85 @@ enum vfio_device_mig_state {
> >>   	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> >>   };
> >>   
> >> +/*
> >> + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.
> >> + * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports
> >> + * DMA logging.
> >> + *
> >> + * DMA logging allows a device to internally record what DMAs the device is
> >> + * initiating and report them back to userspace. It is part of the VFIO
> >> + * migration infrastructure that allows implementing dirty page tracking
> >> + * during the pre copy phase of live migration. Only DMA WRITEs are logged,
> >> + * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
> >> + *
> >> + * When DMA logging is started a range of IOVAs to monitor is provided and the
> >> + * device can optimize its logging to cover only the IOVA range given. Each
> >> + * DMA that the device initiates inside the range will be logged by the device
> >> + * for later retrieval.
> >> + *
> >> + * page_size is an input that hints what tracking granularity the device
> >> + * should try to achieve. If the device cannot do the hinted page size then it
> >> + * should pick the next closest page size it supports. On output the device
> >> + * will return the page size it selected.
> >> + *
> >> + * ranges is a pointer to an array of
> >> + * struct vfio_device_feature_dma_logging_range.
> >> + */
> >> +struct vfio_device_feature_dma_logging_control {
> >> +	__aligned_u64 page_size;
> >> +	__u32 num_ranges;
> >> +	__u32 __reserved;
> >> +	__aligned_u64 ranges;
> >> +};
> >> +
> >> +struct vfio_device_feature_dma_logging_range {
> >> +	__aligned_u64 iova;
> >> +	__aligned_u64 length;
> >> +};
> >> +
> >> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
> >> +
> >> +/*
> >> + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started
> >> + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
> >> + */
> >> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4
> >> +
> >> +/*
> >> + * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log
> >> + *
> >> + * Query the device's DMA log for written pages within the given IOVA range.
> >> + * During querying the log is cleared for the IOVA range.
> >> + *
> >> + * bitmap is a pointer to an array of u64s that will hold the output bitmap
> >> + * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
> >> + * is given by:
> >> + *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
> >> + *
> >> + * The input page_size can be any power of two value and does not have to
> >> + * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver
> >> + * will format its internal logging to match the reporting page size, possibly
> >> + * by replicating bits if the internal page size is lower than requested.
> >> + *
> >> + * Bits will be updated in bitmap using atomic or to allow userspace to

s/or/OR/

> >> + * combine bitmaps from multiple trackers together. Therefore userspace must
> >> + * zero the bitmap before doing any reports.  
> > Somewhat confusing, perhaps "between report sets"?  
> 
> The idea was that the driver just turns on its own dirty bits and 
> doesn't touch others.

Right, we can aggregate dirty bits from multiple devices into a single
bitmap.

> Do you suggest the below ?
> 
> "Therefore userspace must zero the bitmap between report sets".

It may be best to simply drop this guidance, we don't need to presume
the user algorithm, we only need to make it apparent that
LOGGING_REPORT will only set bits in the bitmap and never clear or
preform any initialization of the user provided bitmap.

> >> + *
> >> + * If any error is returned userspace should assume that the dirty log is
> >> + * corrupted and restart.  
> > Restart what?  The user can't just zero the bitmap and retry, dirty
> > information at the device has been lost.  
> 
> Right
> 
> >   Are we suggesting they stop
> > DMA logging and restart it, which sounds a lot like failing a migration
> > and starting over.  Or could the user gratuitously mark the bitmap
> > fully dirty and a subsequent logging report iteration might work?
> > Thanks,
> >
> > Alex  
> 
> An error at that step is not expected and might be fatal.
> 
> User space can consider marking all as dirty and continue with that 
> approach for next iterations, maybe even without calling the driver.
> 
> Alternatively, user space can abort the migration and retry later on.
> 
> We can come with some rephrasing as of the above.
> 
> What do you think ?

If userspace needs to consider the bitmap undefined for any errno,
that's a pretty serious usage restriction that may negate the
practical utility of atomically OR'ing in dirty bits.  We can certainly
have EINVAL, ENOTTY, EFAULT, E2BIG, ENOMEM conditions that don't result
in a corrupted/undefined bitmap, right?  Maybe some of those result in
an incomplete bitmap, but how does the bitmap actually get corrupted?
It seems like such a condition should be pretty narrowly defined and
separate from errors resulting in an incomplete bitmap, maybe we'd
reserve -EIO for such a case.  The driver itself can also gratuitously
mark ranges dirty itself if it loses sync with the device, and can
probably do so at a much more accurate granularity than userspace.
Thanks,

Alex

> >> + *
> >> + * If DMA logging is not enabled, an error will be returned.
> >> + *
> >> + */
> >> +struct vfio_device_feature_dma_logging_report {
> >> +	__aligned_u64 iova;
> >> +	__aligned_u64 length;
> >> +	__aligned_u64 page_size;
> >> +	__aligned_u64 bitmap;
> >> +};
> >> +
> >> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5
> >> +
> >>   /* -------- API for Type1 VFIO IOMMU -------- */
> >>   
> >>   /**  
> 
>
Jason Gunthorpe July 19, 2022, 8:18 p.m. UTC | #6
On Tue, Jul 19, 2022 at 01:57:14PM -0600, Alex Williamson wrote:

> If userspace needs to consider the bitmap undefined for any errno,
> that's a pretty serious usage restriction that may negate the
> practical utility of atomically OR'ing in dirty bits.  

If any report fails it means qemu has lost the ability to dirty track,
so it doesn't really matter if one chunk has gotten lost - we can't
move forward with incomplete dirty data.

Error recovery is to consider all memory dirty and try to restart the
dirty tracking, or to abort/restart the whole migration.

Worrying about the integrity of a single bitmap chunk doesn't seem
worthwhile given that outcome.

Yes, there are cases where things are more deterministic, but it is
not useful information that userspace can do anything with. It can't
just call report again and expect it will work the 2nd time, for
instance.

We do not expect failures here, lets not overdesign things.

> reserve -EIO for such a case.  The driver itself can also gratuitously
> mark ranges dirty itself if it loses sync with the device, and can
> probably do so at a much more accurate granularity than userspace.

A driver that implements such a handling shouldn't return an error
code in the first place.

Jason
Tian, Kevin July 21, 2022, 8:45 a.m. UTC | #7
> From: Yishai Hadas <yishaih@nvidia.com>
> Sent: Thursday, July 14, 2022 4:13 PM
> 
> DMA logging allows a device to internally record what DMAs the device is
> initiating and report them back to userspace. It is part of the VFIO
> migration infrastructure that allows implementing dirty page tracking
> during the pre copy phase of live migration. Only DMA WRITEs are logged,
> and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
> 
> This patch introduces the DMA logging involved uAPIs.
> 
> It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
> 
> It exposes a PROBE option to detect if the device supports DMA logging.
> It exposes a SET option to start device DMA logging in given IOVAs
> ranges.
> It exposes a SET option to stop device DMA logging that was previously
> started.
> It exposes a GET option to read back and clear the device DMA log.
> 
> Extra details exist as part of vfio.h per a specific option.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 79
> +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 79 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 733a1cddde30..81475c3e7c92 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -986,6 +986,85 @@ enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>  };
> 
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.

both 'start'/'stop' are via VFIO_DEVICE_FEATURE_SET

> + * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device
> supports
> + * DMA logging.
> + *
> + * DMA logging allows a device to internally record what DMAs the device is
> + * initiating and report them back to userspace. It is part of the VFIO
> + * migration infrastructure that allows implementing dirty page tracking
> + * during the pre copy phase of live migration. Only DMA WRITEs are logged,

Then 'DMA dirty logging' might be a more accurate name throughput this
series.

> + * and this API is not connected to
> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.

didn't get the point of this explanation.

> + *
> + * When DMA logging is started a range of IOVAs to monitor is provided and
> the
> + * device can optimize its logging to cover only the IOVA range given. Each
> + * DMA that the device initiates inside the range will be logged by the device
> + * for later retrieval.
> + *
> + * page_size is an input that hints what tracking granularity the device
> + * should try to achieve. If the device cannot do the hinted page size then it
> + * should pick the next closest page size it supports. On output the device

next closest 'smaller' page size?

> + * will return the page size it selected.
> + *
> + * ranges is a pointer to an array of
> + * struct vfio_device_feature_dma_logging_range.
> + */
> +struct vfio_device_feature_dma_logging_control {
> +	__aligned_u64 page_size;
> +	__u32 num_ranges;
> +	__u32 __reserved;
> +	__aligned_u64 ranges;
> +};

should we move the definition of LOG_MAX_RANGES to be here
so the user can know the max limits of tracked ranges?

> +
> +struct vfio_device_feature_dma_logging_range {
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3

Can the user update the range list by doing another START?

> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was
> started
> + * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4

Is there value of allowing the user to stop tracking of a specific range?

> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA
> log
> + *
> + * Query the device's DMA log for written pages within the given IOVA range.
> + * During querying the log is cleared for the IOVA range.
> + *
> + * bitmap is a pointer to an array of u64s that will hold the output bitmap
> + * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
> + * is given by:
> + *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
> + *
> + * The input page_size can be any power of two value and does not have to
> + * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START.
> The driver
> + * will format its internal logging to match the reporting page size, possibly
> + * by replicating bits if the internal page size is lower than requested.

what's the purpose of this? I didn't quite get why an user would want to
start tracking in one page size and then read back the dirty bitmap in
another page size...

> + *
> + * Bits will be updated in bitmap using atomic or to allow userspace to
> + * combine bitmaps from multiple trackers together. Therefore userspace
> must
> + * zero the bitmap before doing any reports.

I'm a bit lost here. Since we allow userspace to combine bitmaps from 
multiple trackers then it's perfectly sane for userspace to leave bitmap 
with some 1's from one tracker when doing a report from another tracker.

> + *
> + * If any error is returned userspace should assume that the dirty log is
> + * corrupted and restart.
> + *
> + * If DMA logging is not enabled, an error will be returned.
> + *
> + */
> +struct vfio_device_feature_dma_logging_report {
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +	__aligned_u64 page_size;
> +	__aligned_u64 bitmap;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
> 
>  /**
> --
> 2.18.1
Jason Gunthorpe July 21, 2022, 12:05 p.m. UTC | #8
On Thu, Jul 21, 2022 at 08:45:10AM +0000, Tian, Kevin wrote:

> > + * will format its internal logging to match the reporting page size, possibly
> > + * by replicating bits if the internal page size is lower than requested.
> 
> what's the purpose of this? I didn't quite get why an user would want to
> start tracking in one page size and then read back the dirty bitmap in
> another page size...

There may be multiple kernel trackers that are working with different
underlying block sizes, so the concept is userspace decides what block
size it wants to work in and the kernel side transparently adapts. The
math is simple so putting it in the kernel is convenient.

Effectively the general vision is that qemu would allocate one
reporting buffer and then invoke these IOCTLs in parallel on all the
trackers then process the single bitmap. Forcing qemu to allocate
bitmaps per tracker page size is just inefficient.

Jason
Tian, Kevin July 25, 2022, 7:20 a.m. UTC | #9
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, July 21, 2022 8:05 PM
> 
> On Thu, Jul 21, 2022 at 08:45:10AM +0000, Tian, Kevin wrote:
> 
> > > + * will format its internal logging to match the reporting page size,
> possibly
> > > + * by replicating bits if the internal page size is lower than requested.
> >
> > what's the purpose of this? I didn't quite get why an user would want to
> > start tracking in one page size and then read back the dirty bitmap in
> > another page size...
> 
> There may be multiple kernel trackers that are working with different
> underlying block sizes, so the concept is userspace decides what block
> size it wants to work in and the kernel side transparently adapts. The
> math is simple so putting it in the kernel is convenient.
> 
> Effectively the general vision is that qemu would allocate one
> reporting buffer and then invoke these IOCTLs in parallel on all the
> trackers then process the single bitmap. Forcing qemu to allocate
> bitmaps per tracker page size is just inefficient.
> 

I got that point. But my question is slightly different.

A practical flow would like below:

1) Qemu first requests to start dirty tracking in 4KB page size.
   Underlying trackers may start tracking in 4KB, 256KB, 2MB,
   etc. based on their own constraints.

2) Qemu then reads back dirty reports in a shared bitmap in
   4KB page size. All trackers must update dirty bitmap in 4KB
   granular regardless of the actual size each tracker selects.

Is there a real usage where Qemu would want to attempt
different page sizes between above two steps?

If not then I wonder whether a simpler design is to just have 
page size specified in the first step and then inherited by the 
2nd step...

Thanks
Kevin
Tian, Kevin July 25, 2022, 7:30 a.m. UTC | #10
<please use plain-text next time>

> From: Yishai Hadas <yishaih@nvidia.com> 
> Sent: Thursday, July 21, 2022 7:06 PM
> > > +/*
> > > + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.
> >
> > both 'start'/'stop' are via VFIO_DEVICE_FEATURE_SET
>
> Right, we have a note for that near VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP.
> Here it refers to the start option.

let's make it accurate here.

> > > + * page_size is an input that hints what tracking granularity the device
> > > + * should try to achieve. If the device cannot do the hinted page size then it
> > > + * should pick the next closest page size it supports. On output the device
> >
> > next closest 'smaller' page size?
>
> Not only, it depends on the device capabilities/support and should be a driver choice.

'should pick next closest" is a guideline to the driver. If user requests 
8KB while the device supports 4KB and 16KB, which one is closest?

It's probably safer to just say that it's a driver choice when the hinted page
size cannot be set?

> > > +struct vfio_device_feature_dma_logging_control {
> > > +	__aligned_u64 page_size;
> > > +	__u32 num_ranges;
> > > +	__u32 __reserved;
> > > +	__aligned_u64 ranges;
> > > +};
> >
> > should we move the definition of LOG_MAX_RANGES to be here
> > so the user can know the max limits of tracked ranges?
> This was raised as an option as part of this mail thread.
> However, for now it seems redundant as we may not expect user space to hit this limit and it mainly comes to protect kernel from memory exploding by a malicious user.

No matter how realistic an user might hit an limitation, it doesn't
sound good to not expose it if existing.

> > > +
> > > +struct vfio_device_feature_dma_logging_range {
> > > +	__aligned_u64 iova;
> > > +	__aligned_u64 length;
> > > +};
> > > +
> > > +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
> >
> > Can the user update the range list by doing another START?
>
> No, single start to ask the device what to track and a matching single stop should follow at the end.

let's document it then.

Thanks
Kevin
Jason Gunthorpe July 25, 2022, 2:33 p.m. UTC | #11
On Mon, Jul 25, 2022 at 07:20:16AM +0000, Tian, Kevin wrote:

> I got that point. But my question is slightly different.
> 
> A practical flow would like below:
> 
> 1) Qemu first requests to start dirty tracking in 4KB page size.
>    Underlying trackers may start tracking in 4KB, 256KB, 2MB,
>    etc. based on their own constraints.
> 
> 2) Qemu then reads back dirty reports in a shared bitmap in
>    4KB page size. All trackers must update dirty bitmap in 4KB
>    granular regardless of the actual size each tracker selects.
> 
> Is there a real usage where Qemu would want to attempt
> different page sizes between above two steps?

If you multi-thread the tracker reads it will be efficient to populate
a single bitmap and then copy that single bitmapt to the dirty
transfer. In this case you want the page size conversion.

If qemu is just going to read sequentially then perhaps it doesn't.

But forcing a fixed page size just denies userspace this choice, and
it doesn't make the kernel any simpler because the kernel always must
have this code to adapt different page sizes to support the real iommu
with huge pages/etc.

Jason
Tian, Kevin July 26, 2022, 7:07 a.m. UTC | #12
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, July 25, 2022 10:33 PM
> 
> On Mon, Jul 25, 2022 at 07:20:16AM +0000, Tian, Kevin wrote:
> 
> > I got that point. But my question is slightly different.
> >
> > A practical flow would like below:
> >
> > 1) Qemu first requests to start dirty tracking in 4KB page size.
> >    Underlying trackers may start tracking in 4KB, 256KB, 2MB,
> >    etc. based on their own constraints.
> >
> > 2) Qemu then reads back dirty reports in a shared bitmap in
> >    4KB page size. All trackers must update dirty bitmap in 4KB
> >    granular regardless of the actual size each tracker selects.
> >
> > Is there a real usage where Qemu would want to attempt
> > different page sizes between above two steps?
> 
> If you multi-thread the tracker reads it will be efficient to populate
> a single bitmap and then copy that single bitmapt to the dirty
> transfer. In this case you want the page size conversion.
> 
> If qemu is just going to read sequentially then perhaps it doesn't.
> 
> But forcing a fixed page size just denies userspace this choice, and
> it doesn't make the kernel any simpler because the kernel always must
> have this code to adapt different page sizes to support the real iommu
> with huge pages/etc.
> 

OK, make sense.
Yishai Hadas July 26, 2022, 8:37 a.m. UTC | #13
On 25/07/2022 10:30, Tian, Kevin wrote:
> <please use plain-text next time>
>
>> From: Yishai Hadas <yishaih@nvidia.com>
>> Sent: Thursday, July 21, 2022 7:06 PM
>>>> +/*
>>>> + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.
>>> both 'start'/'stop' are via VFIO_DEVICE_FEATURE_SET
>> Right, we have a note for that near VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP.
>> Here it refers to the start option.
> let's make it accurate here.

OK

>
>>>> + * page_size is an input that hints what tracking granularity the device
>>>> + * should try to achieve. If the device cannot do the hinted page size then it
>>>> + * should pick the next closest page size it supports. On output the device
>>> next closest 'smaller' page size?
>> Not only, it depends on the device capabilities/support and should be a driver choice.
> 'should pick next closest" is a guideline to the driver. If user requests
> 8KB while the device supports 4KB and 16KB, which one is closest?
>
> It's probably safer to just say that it's a driver choice when the hinted page
> size cannot be set?

Yes, may rephrase in V3 accordingly.

>
>>>> +struct vfio_device_feature_dma_logging_control {
>>>> +	__aligned_u64 page_size;
>>>> +	__u32 num_ranges;
>>>> +	__u32 __reserved;
>>>> +	__aligned_u64 ranges;
>>>> +};
>>> should we move the definition of LOG_MAX_RANGES to be here
>>> so the user can know the max limits of tracked ranges?
>> This was raised as an option as part of this mail thread.
>> However, for now it seems redundant as we may not expect user space to hit this limit and it mainly comes to protect kernel from memory exploding by a malicious user.
> No matter how realistic an user might hit an limitation, it doesn't
> sound good to not expose it if existing.

As Jason replied at some point here, we need to see a clear use case for 
more than a few 10's of ranges before we complicate things.

For now we don't see one. If one does crop up someday it is easy to add 
a new query, or some other behavior.

Alex,

Can you please comment here so that we can converge and be ready for V3 ?

>>>> +
>>>> +struct vfio_device_feature_dma_logging_range {
>>>> +	__aligned_u64 iova;
>>>> +	__aligned_u64 length;
>>>> +};
>>>> +
>>>> +#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
>>> Can the user update the range list by doing another START?
>> No, single start to ask the device what to track and a matching single stop should follow at the end.
> let's document it then.

OK

>
> Thanks
> Kevin
>
Thanks,
Yishai
Alex Williamson July 26, 2022, 2:03 p.m. UTC | #14
On Tue, 26 Jul 2022 11:37:50 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 25/07/2022 10:30, Tian, Kevin wrote:
> > <please use plain-text next time>
> >  
> >> From: Yishai Hadas <yishaih@nvidia.com>
> >> Sent: Thursday, July 21, 2022 7:06 PM  
> >>>> +/*
> >>>> + * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.  
> >>> both 'start'/'stop' are via VFIO_DEVICE_FEATURE_SET  
> >> Right, we have a note for that near VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP.
> >> Here it refers to the start option.  
> > let's make it accurate here.  
> 
> OK
> 
> >  
> >>>> + * page_size is an input that hints what tracking granularity the device
> >>>> + * should try to achieve. If the device cannot do the hinted page size then it
> >>>> + * should pick the next closest page size it supports. On output the device  
> >>> next closest 'smaller' page size?  
> >> Not only, it depends on the device capabilities/support and should be a driver choice.  
> > 'should pick next closest" is a guideline to the driver. If user requests
> > 8KB while the device supports 4KB and 16KB, which one is closest?
> >
> > It's probably safer to just say that it's a driver choice when the hinted page
> > size cannot be set?  
> 
> Yes, may rephrase in V3 accordingly.
> 
> >  
> >>>> +struct vfio_device_feature_dma_logging_control {
> >>>> +	__aligned_u64 page_size;
> >>>> +	__u32 num_ranges;
> >>>> +	__u32 __reserved;
> >>>> +	__aligned_u64 ranges;
> >>>> +};  
> >>> should we move the definition of LOG_MAX_RANGES to be here
> >>> so the user can know the max limits of tracked ranges?  
> >> This was raised as an option as part of this mail thread.
> >> However, for now it seems redundant as we may not expect user space to hit this limit and it mainly comes to protect kernel from memory exploding by a malicious user.  
> > No matter how realistic an user might hit an limitation, it doesn't
> > sound good to not expose it if existing.  
> 
> As Jason replied at some point here, we need to see a clear use case for 
> more than a few 10's of ranges before we complicate things.
> 
> For now we don't see one. If one does crop up someday it is easy to add 
> a new query, or some other behavior.
> 
> Alex,
> 
> Can you please comment here so that we can converge and be ready for V3 ?

I raised the same concern myself, the reason for having a limit is
clear, but focusing on a single use case and creating an arbitrary
"good enough" limit that isn't exposed to userspace makes this an
implementation detail that can subtly break userspace.  For instance,
what if userspace comes to expect the limit is 1000 and we decide to be
even more strict?  If only a few 10s of entries are used, why isn't 100
more than sufficient?  We change it, we break userspace.  OTOH, if we
simply make use of that reserved field to expose the limit, now we have
a contract with userspace and we can change our implementation because
that detail of the implementation is visible to userspace.  Thanks,

Alex
Jason Gunthorpe July 26, 2022, 3:04 p.m. UTC | #15
On Tue, Jul 26, 2022 at 08:03:20AM -0600, Alex Williamson wrote:

> I raised the same concern myself, the reason for having a limit is
> clear, but focusing on a single use case and creating an arbitrary
> "good enough" limit that isn't exposed to userspace makes this an
> implementation detail that can subtly break userspace.  For instance,
> what if userspace comes to expect the limit is 1000 and we decide to be
> even more strict?  If only a few 10s of entries are used, why isn't 100
> more than sufficient?  

So lets use the number of elements that will fit in PAGE_SIZE as the
guideline. It means the kernel can memdup the userspace array into a
single kernel page of memory to process it, which seems reasonably
future proof in that we won't need to make it lower. Thus we can
promise we won't make it smaller.

However, remember, this isn't even the real device limit - this is
just the limit that the core kernel code will accept to marshal the
data to pass internally the driver.

I fully expect that the driver will still refuse ranges in certain
configurations even if they can be marshaled.

This is primarily why I don't think it make sense to expose some
internal limit that is not even the real "will the call succeed"
parameters.

The API is specifically designed as 'try and fail' to allow the
drivers flexibility it how they map the requested ranges to their
internal operations.

> We change it, we break userspace.  OTOH, if we simply make use of
> that reserved field to expose the limit, now we have a contract with
> userspace and we can change our implementation because that detail
> of the implementation is visible to userspace.  Thanks,

I think this is not correct, just because we made it discoverable does
not absolve the kernel of compatibility. If we change the limit, eg to
1, and a real userspace stops working then we still broke userspace.

Complaining that userspace does not check the discoverable limit
doesn't help matters - I seem to remember Linus has written about this
in recent times even.

So, it is ultimately not different from 'try and fail', unless we
implement some algorithm in qemu - an algorithm that would duplicate
the one we already have in the kernel :\

Jason
Tian, Kevin July 28, 2022, 4:05 a.m. UTC | #16
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, July 26, 2022 11:05 PM
> 
> On Tue, Jul 26, 2022 at 08:03:20AM -0600, Alex Williamson wrote:
> 
> > I raised the same concern myself, the reason for having a limit is
> > clear, but focusing on a single use case and creating an arbitrary
> > "good enough" limit that isn't exposed to userspace makes this an
> > implementation detail that can subtly break userspace.  For instance,
> > what if userspace comes to expect the limit is 1000 and we decide to be
> > even more strict?  If only a few 10s of entries are used, why isn't 100
> > more than sufficient?
> 
> So lets use the number of elements that will fit in PAGE_SIZE as the
> guideline. It means the kernel can memdup the userspace array into a
> single kernel page of memory to process it, which seems reasonably
> future proof in that we won't need to make it lower. Thus we can
> promise we won't make it smaller.
> 
> However, remember, this isn't even the real device limit - this is
> just the limit that the core kernel code will accept to marshal the
> data to pass internally the driver.
> 
> I fully expect that the driver will still refuse ranges in certain
> configurations even if they can be marshaled.
> 
> This is primarily why I don't think it make sense to expose some
> internal limit that is not even the real "will the call succeed"
> parameters.
> 
> The API is specifically designed as 'try and fail' to allow the
> drivers flexibility it how they map the requested ranges to their
> internal operations.
> 
> > We change it, we break userspace.  OTOH, if we simply make use of
> > that reserved field to expose the limit, now we have a contract with
> > userspace and we can change our implementation because that detail
> > of the implementation is visible to userspace.  Thanks,
> 
> I think this is not correct, just because we made it discoverable does
> not absolve the kernel of compatibility. If we change the limit, eg to
> 1, and a real userspace stops working then we still broke userspace.

iiuc Alex's suggestion doesn't conflict with the 'try and fail' model.
By using the reserved field of vfio_device_feature_dma_logging_control
to return the limit of the specified page_size from a given tracker, 
the user can quickly retry and adapt to that limit if workable.

Otherwise what would be an efficient policy for user to retry after
a failure? Say initially user requests 100 ranges with 4K page size
but the tracker can only support 10 ranges. w/o a hint returned
from the tracker then the user just blindly try 100, 90, 80, ... or 
using a bisect algorithm?

> 
> Complaining that userspace does not check the discoverable limit
> doesn't help matters - I seem to remember Linus has written about this
> in recent times even.
> 
> So, it is ultimately not different from 'try and fail', unless we
> implement some algorithm in qemu - an algorithm that would duplicate
> the one we already have in the kernel :\
> 
> Jason
Jason Gunthorpe July 28, 2022, 12:06 p.m. UTC | #17
On Thu, Jul 28, 2022 at 04:05:04AM +0000, Tian, Kevin wrote:

> > I think this is not correct, just because we made it discoverable does
> > not absolve the kernel of compatibility. If we change the limit, eg to
> > 1, and a real userspace stops working then we still broke userspace.
> 
> iiuc Alex's suggestion doesn't conflict with the 'try and fail' model.
> By using the reserved field of vfio_device_feature_dma_logging_control
> to return the limit of the specified page_size from a given tracker, 
> the user can quickly retry and adapt to that limit if workable.

Again, no it can't. The marshalling limit is not the device limit and
it will still potentially fail. Userspace does not gain much
additional API certainty by knowing this internal limit.

> Otherwise what would be an efficient policy for user to retry after
> a failure? Say initially user requests 100 ranges with 4K page size
> but the tracker can only support 10 ranges. w/o a hint returned
> from the tracker then the user just blindly try 100, 90, 80, ... or 
> using a bisect algorithm?

With what I just said the minimum is PAGE_SIZE, so if some userspace
is using a huge range list it should try the huge list first (assuming
the kernel has been updated because someone justified a use case
here), then try to narrow to PAGE_SIZE, then give up.

The main point is there is nothing for current qemu to do - we do not
want to duplicate the kernel narrowing algorithm into qemu - the idea
is to define the interface in a way that accomodates what qemu needs.

The only issue is the bounding the memory allocation.

Jason
Tian, Kevin July 29, 2022, 3:01 a.m. UTC | #18
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, July 28, 2022 8:06 PM
> 
> On Thu, Jul 28, 2022 at 04:05:04AM +0000, Tian, Kevin wrote:
> 
> > > I think this is not correct, just because we made it discoverable does
> > > not absolve the kernel of compatibility. If we change the limit, eg to
> > > 1, and a real userspace stops working then we still broke userspace.
> >
> > iiuc Alex's suggestion doesn't conflict with the 'try and fail' model.
> > By using the reserved field of vfio_device_feature_dma_logging_control
> > to return the limit of the specified page_size from a given tracker,
> > the user can quickly retry and adapt to that limit if workable.
> 
> Again, no it can't. The marshalling limit is not the device limit and
> it will still potentially fail. Userspace does not gain much
> additional API certainty by knowing this internal limit.

Why cannot the tracker return device limit here?

> 
> > Otherwise what would be an efficient policy for user to retry after
> > a failure? Say initially user requests 100 ranges with 4K page size
> > but the tracker can only support 10 ranges. w/o a hint returned
> > from the tracker then the user just blindly try 100, 90, 80, ... or
> > using a bisect algorithm?
> 
> With what I just said the minimum is PAGE_SIZE, so if some userspace
> is using a huge range list it should try the huge list first (assuming
> the kernel has been updated because someone justified a use case
> here), then try to narrow to PAGE_SIZE, then give up.

Then probably it's good to document above as a guidance to
userspace.

> 
> The main point is there is nothing for current qemu to do - we do not
> want to duplicate the kernel narrowing algorithm into qemu - the idea
> is to define the interface in a way that accomodates what qemu needs.
> 
> The only issue is the bounding the memory allocation.
> 
> Jason
Jason Gunthorpe July 29, 2022, 2:11 p.m. UTC | #19
On Fri, Jul 29, 2022 at 03:01:51AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, July 28, 2022 8:06 PM
> > 
> > On Thu, Jul 28, 2022 at 04:05:04AM +0000, Tian, Kevin wrote:
> > 
> > > > I think this is not correct, just because we made it discoverable does
> > > > not absolve the kernel of compatibility. If we change the limit, eg to
> > > > 1, and a real userspace stops working then we still broke userspace.
> > >
> > > iiuc Alex's suggestion doesn't conflict with the 'try and fail' model.
> > > By using the reserved field of vfio_device_feature_dma_logging_control
> > > to return the limit of the specified page_size from a given tracker,
> > > the user can quickly retry and adapt to that limit if workable.
> > 
> > Again, no it can't. The marshalling limit is not the device limit and
> > it will still potentially fail. Userspace does not gain much
> > additional API certainty by knowing this internal limit.
> 
> Why cannot the tracker return device limit here?

Because it is "complicated". mlx5 doesn't really have a limit in
simple terms of the number of ranges - it has other limits based on
total span and page size.

Jason
diff mbox series

Patch

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 733a1cddde30..81475c3e7c92 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -986,6 +986,85 @@  enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET start device DMA logging.
+ * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports
+ * DMA logging.
+ *
+ * DMA logging allows a device to internally record what DMAs the device is
+ * initiating and report them back to userspace. It is part of the VFIO
+ * migration infrastructure that allows implementing dirty page tracking
+ * during the pre copy phase of live migration. Only DMA WRITEs are logged,
+ * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
+ *
+ * When DMA logging is started a range of IOVAs to monitor is provided and the
+ * device can optimize its logging to cover only the IOVA range given. Each
+ * DMA that the device initiates inside the range will be logged by the device
+ * for later retrieval.
+ *
+ * page_size is an input that hints what tracking granularity the device
+ * should try to achieve. If the device cannot do the hinted page size then it
+ * should pick the next closest page size it supports. On output the device
+ * will return the page size it selected.
+ *
+ * ranges is a pointer to an array of
+ * struct vfio_device_feature_dma_logging_range.
+ */
+struct vfio_device_feature_dma_logging_control {
+	__aligned_u64 page_size;
+	__u32 num_ranges;
+	__u32 __reserved;
+	__aligned_u64 ranges;
+};
+
+struct vfio_device_feature_dma_logging_range {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started
+ * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
+ */
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log
+ *
+ * Query the device's DMA log for written pages within the given IOVA range.
+ * During querying the log is cleared for the IOVA range.
+ *
+ * bitmap is a pointer to an array of u64s that will hold the output bitmap
+ * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
+ * is given by:
+ *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
+ *
+ * The input page_size can be any power of two value and does not have to
+ * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver
+ * will format its internal logging to match the reporting page size, possibly
+ * by replicating bits if the internal page size is lower than requested.
+ *
+ * Bits will be updated in bitmap using atomic or to allow userspace to
+ * combine bitmaps from multiple trackers together. Therefore userspace must
+ * zero the bitmap before doing any reports.
+ *
+ * If any error is returned userspace should assume that the dirty log is
+ * corrupted and restart.
+ *
+ * If DMA logging is not enabled, an error will be returned.
+ *
+ */
+struct vfio_device_feature_dma_logging_report {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+	__aligned_u64 page_size;
+	__aligned_u64 bitmap;
+};
+
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**