diff mbox series

[v4,09/18] iommufd: Add a flag to skip clearing of IOPTE dirty

Message ID 20231018202715.69734-10-joao.m.martins@oracle.com (mailing list archive)
State New, archived
Headers show
Series IOMMUFD Dirty Tracking | expand

Commit Message

Joao Martins Oct. 18, 2023, 8:27 p.m. UTC
VFIO has an operation where it unmaps an IOVA while returning a bitmap with
the dirty data. In reality the operation doesn't quite query the IO
pagetables that the PTE was dirty or not. Instead it marks as dirty on
anything that was mapped, and doing so in one syscall.

In IOMMUFD the equivalent is done in two operations by querying with
GET_DIRTY_IOVA followed by UNMAP_IOVA. However, this would incur two TLB
flushes given that after clearing dirty bits IOMMU implementations require
invalidating their IOTLB, plus another invalidation needed for the UNMAP.
To allow dirty bits to be queried faster, add a flag
(IOMMU_GET_DIRTY_IOVA_NO_CLEAR) that requests to not clear the dirty bits
from the PTE (but just reading them), under the expectation that the next
operation is the unmap. An alternative is to unmap and just perpectually
mark as dirty as that's the same behaviour as today. So here equivalent
functionally can be provided with unmap alone, and if real dirty info is
required it will amortize the cost while querying.

There's still a race against DMA where in theory the unmap of the IOVA
(when the guest invalidates the IOTLB via emulated iommu) would race
against the VF performing DMA on the same IOVA. As discussed in [0], we are
accepting to resolve this race as throwing away the DMA and it doesn't
matter if it hit physical DRAM or not, the VM can't tell if we threw it
away because the DMA was blocked or because we failed to copy the DRAM.

[0] https://lore.kernel.org/linux-iommu/20220502185239.GR8364@nvidia.com/

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/hw_pagetable.c |  3 ++-
 drivers/iommu/iommufd/io_pagetable.c |  9 +++++++--
 include/uapi/linux/iommufd.h         | 12 ++++++++++++
 3 files changed, 21 insertions(+), 3 deletions(-)

Comments

Jason Gunthorpe Oct. 18, 2023, 10:54 p.m. UTC | #1
On Wed, Oct 18, 2023 at 09:27:06PM +0100, Joao Martins wrote:
> VFIO has an operation where it unmaps an IOVA while returning a bitmap with
> the dirty data. In reality the operation doesn't quite query the IO
> pagetables that the PTE was dirty or not. Instead it marks as dirty on
> anything that was mapped, and doing so in one syscall.
> 
> In IOMMUFD the equivalent is done in two operations by querying with
> GET_DIRTY_IOVA followed by UNMAP_IOVA. However, this would incur two TLB
> flushes given that after clearing dirty bits IOMMU implementations require
> invalidating their IOTLB, plus another invalidation needed for the UNMAP.
> To allow dirty bits to be queried faster, add a flag
> (IOMMU_GET_DIRTY_IOVA_NO_CLEAR) that requests to not clear the dirty bits
> from the PTE (but just reading them), under the expectation that the next
> operation is the unmap. An alternative is to unmap and just perpectually
> mark as dirty as that's the same behaviour as today. So here equivalent
> functionally can be provided with unmap alone, and if real dirty info is
> required it will amortize the cost while querying.

It seems fine, but I wonder is it really worthwhile? Did you measure
this? I suppose it is during the critical outage window

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason
Joao Martins Oct. 18, 2023, 11:50 p.m. UTC | #2
On 18/10/2023 23:54, Jason Gunthorpe wrote:
> On Wed, Oct 18, 2023 at 09:27:06PM +0100, Joao Martins wrote:
>> VFIO has an operation where it unmaps an IOVA while returning a bitmap with
>> the dirty data. In reality the operation doesn't quite query the IO
>> pagetables that the PTE was dirty or not. Instead it marks as dirty on
>> anything that was mapped, and doing so in one syscall.
>>
>> In IOMMUFD the equivalent is done in two operations by querying with
>> GET_DIRTY_IOVA followed by UNMAP_IOVA. However, this would incur two TLB
>> flushes given that after clearing dirty bits IOMMU implementations require
>> invalidating their IOTLB, plus another invalidation needed for the UNMAP.
>> To allow dirty bits to be queried faster, add a flag
>> (IOMMU_GET_DIRTY_IOVA_NO_CLEAR) that requests to not clear the dirty bits
>> from the PTE (but just reading them), under the expectation that the next
>> operation is the unmap. An alternative is to unmap and just perpectually
>> mark as dirty as that's the same behaviour as today. So here equivalent
>> functionally can be provided with unmap alone, and if real dirty info is
>> required it will amortize the cost while querying.
> 
> It seems fine, but I wonder is it really worthwhile? 
> Did you measure
> this? I suppose it is during the critical outage window
>

Design-wise we avoid an extra IOTLB invalidation in emulated-vIOMMU with many
potentially mappings being done ... which is where this usually is more relevant
to be used. It bothers a little, if it falling under over-optimization when
unmap itself is already expensive. But I didn't explicitly measure the cost of
the IOTLB that we are saving.

> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Jason
Tian, Kevin Oct. 20, 2023, 6:52 a.m. UTC | #3
> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> VFIO has an operation where it unmaps an IOVA while returning a bitmap
> with
> the dirty data. In reality the operation doesn't quite query the IO
> pagetables that the PTE was dirty or not. Instead it marks as dirty on
> anything that was mapped, and doing so in one syscall.
> 
> In IOMMUFD the equivalent is done in two operations by querying with
> GET_DIRTY_IOVA followed by UNMAP_IOVA. However, this would incur two
> TLB
> flushes given that after clearing dirty bits IOMMU implementations require
> invalidating their IOTLB, plus another invalidation needed for the UNMAP.
> To allow dirty bits to be queried faster, add a flag
> (IOMMU_GET_DIRTY_IOVA_NO_CLEAR) that requests to not clear the dirty
> bits
> from the PTE (but just reading them), under the expectation that the next
> operation is the unmap. An alternative is to unmap and just perpectually
> mark as dirty as that's the same behaviour as today. So here equivalent
> functionally can be provided with unmap alone, and if real dirty info is
> required it will amortize the cost while querying.
> 
> There's still a race against DMA where in theory the unmap of the IOVA
> (when the guest invalidates the IOTLB via emulated iommu) would race
> against the VF performing DMA on the same IOVA. As discussed in [0], we
> are
> accepting to resolve this race as throwing away the DMA and it doesn't
> matter if it hit physical DRAM or not, the VM can't tell if we threw it
> away because the DMA was blocked or because we failed to copy the DRAM.
> 
> [0] https://lore.kernel.org/linux-
> iommu/20220502185239.GR8364@nvidia.com/
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
diff mbox series

Patch

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index c954f91c3b7b..23a5e52b4755 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -252,7 +252,8 @@  int iommufd_hwpt_get_dirty_iova(struct iommufd_ucmd *ucmd)
 	struct iommufd_ioas *ioas;
 	int rc = -EOPNOTSUPP;
 
-	if ((cmd->flags || cmd->__reserved))
+	if ((cmd->flags & ~(IOMMU_GET_DIRTY_IOVA_NO_CLEAR)) ||
+	    cmd->__reserved)
 		return -EOPNOTSUPP;
 
 	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 0c08b3df1b6f..835d54876b45 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -414,6 +414,7 @@  int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
 }
 
 struct iova_bitmap_fn_arg {
+	unsigned long flags;
 	struct io_pagetable *iopt;
 	struct iommu_domain *domain;
 	struct iommu_dirty_bitmap *dirty;
@@ -430,6 +431,7 @@  static int __iommu_read_and_clear_dirty(struct iova_bitmap *bitmap,
 	struct iommu_dirty_bitmap *dirty = arg->dirty;
 	const struct iommu_dirty_ops *ops = domain->dirty_ops;
 	unsigned long last_iova = iova + length - 1;
+	unsigned long flags = arg->flags;
 	int ret = -EINVAL;
 
 	iopt_for_each_contig_area(&iter, area, arg->iopt, iova, last_iova) {
@@ -437,7 +439,7 @@  static int __iommu_read_and_clear_dirty(struct iova_bitmap *bitmap,
 
 		ret = ops->read_and_clear_dirty(domain, iter.cur_iova,
 						last - iter.cur_iova + 1,
-						0, dirty);
+						flags, dirty);
 		if (ret)
 			break;
 	}
@@ -470,12 +472,15 @@  static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
 
 	iommu_dirty_bitmap_init(&dirty, iter, &gather);
 
+	arg.flags = flags;
 	arg.iopt = iopt;
 	arg.domain = domain;
 	arg.dirty = &dirty;
 	iova_bitmap_for_each(iter, &arg, __iommu_read_and_clear_dirty);
 
-	iommu_iotlb_sync(domain, &gather);
+	if (!(flags & IOMMU_DIRTY_NO_CLEAR))
+		iommu_iotlb_sync(domain, &gather);
+
 	iova_bitmap_free(iter);
 
 	return ret;
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 91de0043e73f..8b372b43ffc0 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -492,6 +492,18 @@  struct iommu_hwpt_set_dirty {
 };
 #define IOMMU_HWPT_SET_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_SET_DIRTY)
 
+/**
+ * enum iommufd_get_dirty_iova_flags - Flags for getting dirty bits
+ * @IOMMU_GET_DIRTY_IOVA_NO_CLEAR: Just read the PTEs without clearing any dirty
+ *                                 bits metadata. This flag can be passed in the
+ *                                 expectation where the next operation is
+ *                                 an unmap of the same IOVA range.
+ *
+ */
+enum iommufd_hwpt_get_dirty_iova_flags {
+	IOMMU_GET_DIRTY_IOVA_NO_CLEAR = 1,
+};
+
 /**
  * struct iommu_hwpt_get_dirty_iova - ioctl(IOMMU_HWPT_GET_DIRTY_IOVA)
  * @size: sizeof(struct iommu_hwpt_get_dirty_iova)