diff mbox series

[v4,11/18] iommu/amd: Access/Dirty bit support in IOPTEs

Message ID 20231018202715.69734-12-joao.m.martins@oracle.com (mailing list archive)
State New, archived
Headers show
Series IOMMUFD Dirty Tracking | expand

Commit Message

Joao Martins Oct. 18, 2023, 8:27 p.m. UTC
IOMMU advertises Access/Dirty bits if the extended feature register reports
it. Relevant AMD IOMMU SDM ref[0] "1.3.8 Enhanced Support for Access and
Dirty Bits"

To enable it set the DTE flag in bits 7 and 8 to enable access, or
access+dirty. With that, the IOMMU starts marking the D and A flags on
every Memory Request or ATS translation request. It is on the VMM side to
steer whether to enable dirty tracking or not, rather than wrongly doing in
IOMMU. Relevant AMD IOMMU SDM ref [0], "Table 7. Device Table Entry (DTE)
Field Definitions" particularly the entry "HAD".

To actually toggle on and off it's relatively simple as it's setting 2 bits
on DTE and flush the device DTE cache.

To get what's dirtied use existing AMD io-pgtable support, by walking the
pagetables over each IOVA, with fetch_pte().  The IOTLB flushing is left to
the caller (much like unmap), and iommu_dirty_bitmap_record() is the one
adding page-ranges to invalidate. This allows caller to batch the flush
over a big span of IOVA space, without the iommu wondering about when to
flush.

Worthwhile sections from AMD IOMMU SDM:

"2.2.3.1 Host Access Support"
"2.2.3.2 Host Dirty Support"

For details on how IOMMU hardware updates the dirty bit see, and expects
from its consequent clearing by CPU:

"2.2.7.4 Updating Accessed and Dirty Bits in the Guest Address Tables"
"2.2.7.5 Clearing Accessed and Dirty Bits"

Quoting the SDM:

"The setting of accessed and dirty status bits in the page tables is
visible to both the CPU and the peripheral when sharing guest page tables.
The IOMMU interlocked operations to update A and D bits must be 64-bit
operations and naturally aligned on a 64-bit boundary"

.. and for the IOMMU update sequence to Dirty bit, essentially is states:

1. Decodes the read and write intent from the memory access.
2. If P=0 in the page descriptor, fail the access.
3. Compare the A & D bits in the descriptor with the read and write
intent in the request.
4. If the A or D bits need to be updated in the descriptor:
* Start atomic operation.
* Read the descriptor as a 64-bit access.
* If the descriptor no longer appears to require an update, release the
atomic lock with
no further action and continue to step 5.
* Calculate the new A & D bits.
* Write the descriptor as a 64-bit access.
* End atomic operation.
5. Continue to the next stage of translation or to the memory access.

Access/Dirty bits readout also need to consider the non-default page-sizes
(aka replicated PTEs as mentined by manual), as AMD supports all powers of
two (except 512G) page sizes.

Select IOMMUFD_DRIVER only if IOMMUFD is enabled considering that IOMMU
dirty tracking requires IOMMUFD.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
 drivers/iommu/amd/Kconfig           |  1 +
 drivers/iommu/amd/amd_iommu_types.h | 12 ++++
 drivers/iommu/amd/io_pgtable.c      | 69 +++++++++++++++++++++
 drivers/iommu/amd/iommu.c           | 96 +++++++++++++++++++++++++++++
 4 files changed, 178 insertions(+)

Comments

Jason Gunthorpe Oct. 18, 2023, 11:11 p.m. UTC | #1
On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
> +					 unsigned long iova, size_t size,
> +					 unsigned long flags,
> +					 struct iommu_dirty_bitmap *dirty)
> +{
> +	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
> +	unsigned long end = iova + size - 1;
> +
> +	do {
> +		unsigned long pgsize = 0;
> +		u64 *ptep, pte;
> +
> +		ptep = fetch_pte(pgtable, iova, &pgsize);
> +		if (ptep)
> +			pte = READ_ONCE(*ptep);

It is fine for now, but this is so slow for something that is such a
fast path. We are optimizing away a TLB invalidation but leaving
this???

It is a radix tree, you walk trees by retaining your position at each
level as you go (eg in a function per-level call chain or something)
then ++ is cheap. Re-searching the entire tree every time is madness.

Jason
Joao Martins Oct. 19, 2023, 12:17 a.m. UTC | #2
On 19/10/2023 00:11, Jason Gunthorpe wrote:
> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>> +					 unsigned long iova, size_t size,
>> +					 unsigned long flags,
>> +					 struct iommu_dirty_bitmap *dirty)
>> +{
>> +	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>> +	unsigned long end = iova + size - 1;
>> +
>> +	do {
>> +		unsigned long pgsize = 0;
>> +		u64 *ptep, pte;
>> +
>> +		ptep = fetch_pte(pgtable, iova, &pgsize);
>> +		if (ptep)
>> +			pte = READ_ONCE(*ptep);
> 
> It is fine for now, but this is so slow for something that is such a
> fast path. We are optimizing away a TLB invalidation but leaving
> this???
> 

More obvious reason is that I'm still working towards the 'faster' page table
walker. Then map/unmap code needs to do similar lookups so thought of reusing
the same functions as map/unmap initially. And improve it afterwards or when
introducing the splitting.

> It is a radix tree, you walk trees by retaining your position at each
> level as you go (eg in a function per-level call chain or something)
> then ++ is cheap. Re-searching the entire tree every time is madness.

I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel;
still in the works), but in my experiments with huge IOVA ranges, the time to
walk the page tables end up making not that much difference, compared to the
size it needs to walk. However is how none of this matters, once we increase up
a level (PMD), then walking huge IOVA ranges is super-cheap (and invisible with
PUDs). Which makes the dynamic-splitting/page-demotion important.

Furthermore, this is not quite yet easy for other people to test and see numbers
for themselves; so more and more I need to work on something like
iommufd_log_perf tool under tools/testing that is similar to the gup_perf to all
performance work obvious and 'standardized'

------->8--------
[0] [hasn't been rebased into this version I sent]

commit 431de7e855ee8c1622663f8d81600f62fed0ed4a
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Sat Oct 7 18:17:33 2023 -0400

    iommu/amd: Improve dirty read io-pgtable walker

    fetch_pte() based is a little ineficient for level-1 page-sizes.

    It walks all the levels to return a PTE, and disregarding the potential
    batching that could be done for the previous level. Implement a
    page-table walker based on the freeing functions which recursevily walks
    the next-level.

    For each level it iterates on the non-default page sizes as the
    different mappings return, provided each PTE level-7 may account
    the next power-of-2 per added PTE.

    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 29f5ab0ba14f..babb5fb5fd51 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -552,39 +552,63 @@ static bool pte_test_and_clear_dirty(u64 *ptep, unsigned
long size)
        return dirty;
 }

+static bool pte_is_large_or_base(u64 *ptep)
+{
+       return (PM_PTE_LEVEL(*ptep) == 0 || PM_PTE_LEVEL(*ptep) == 7);
+}
+
+static int walk_iova_range(u64 *pt, unsigned long iova, size_t size,
+                          int level, unsigned long flags,
+                          struct iommu_dirty_bitmap *dirty)
+{
+       unsigned long addr, isize, end = iova + size;
+       unsigned long page_size;
+       int i, next_level;
+       u64 *p, *ptep;
+
+       next_level = level - 1;
+       isize = page_size = PTE_LEVEL_PAGE_SIZE(next_level);
+
+       for (addr = iova; addr < end; addr += isize) {
+               i = PM_LEVEL_INDEX(next_level, addr);
+               ptep = &pt[i];
+
+               /* PTE present? */
+               if (!IOMMU_PTE_PRESENT(*ptep))
+                       continue;
+
+               if (level > 1 && !pte_is_large_or_base(ptep)) {
+                       p = IOMMU_PTE_PAGE(*ptep);
+                       isize = min(end - addr, page_size);
+                       walk_iova_range(p, addr, isize, next_level,
+                                       flags, dirty);
+               } else {
+                       isize = PM_PTE_LEVEL(*ptep) == 7 ?
+                                       PTE_PAGE_SIZE(*ptep) : page_size;
+
+                       /*
+                        * Mark the whole IOVA range as dirty even if only one
+                        * of the replicated PTEs were marked dirty.
+                        */
+                       if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
+                                       pte_test_dirty(ptep, isize)) ||
+                           pte_test_and_clear_dirty(ptep, isize))
+                               iommu_dirty_bitmap_record(dirty, addr, isize);
+               }
+       }
+
+       return 0;
+}
+
 static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
                                         unsigned long iova, size_t size,
                                         unsigned long flags,
                                         struct iommu_dirty_bitmap *dirty)
 {
        struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
-       unsigned long end = iova + size - 1;
-
-       do {
-               unsigned long pgsize = 0;
-               u64 *ptep, pte;
-
-               ptep = fetch_pte(pgtable, iova, &pgsize);
-               if (ptep)
-                       pte = READ_ONCE(*ptep);
-               if (!ptep || !IOMMU_PTE_PRESENT(pte)) {
-                       pgsize = pgsize ?: PTE_LEVEL_PAGE_SIZE(0);
-                       iova += pgsize;
-                       continue;
-               }
-
-               /*
-                * Mark the whole IOVA range as dirty even if only one of
-                * the replicated PTEs were marked dirty.
-                */
-               if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
-                               pte_test_dirty(ptep, pgsize)) ||
-                   pte_test_and_clear_dirty(ptep, pgsize))
-                       iommu_dirty_bitmap_record(dirty, iova, pgsize);
-               iova += pgsize;
-       } while (iova < end);

-       return 0;
+       return walk_iova_range(pgtable->root, iova, size,
+                              pgtable->mode, flags, dirty);
 }

 /*
Joao Martins Oct. 19, 2023, 11:58 a.m. UTC | #3
On 19/10/2023 01:17, Joao Martins wrote:
> On 19/10/2023 00:11, Jason Gunthorpe wrote:
>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>>> +					 unsigned long iova, size_t size,
>>> +					 unsigned long flags,
>>> +					 struct iommu_dirty_bitmap *dirty)
>>> +{
>>> +	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>> +	unsigned long end = iova + size - 1;
>>> +
>>> +	do {
>>> +		unsigned long pgsize = 0;
>>> +		u64 *ptep, pte;
>>> +
>>> +		ptep = fetch_pte(pgtable, iova, &pgsize);
>>> +		if (ptep)
>>> +			pte = READ_ONCE(*ptep);
>>
>> It is fine for now, but this is so slow for something that is such a
>> fast path. We are optimizing away a TLB invalidation but leaving
>> this???
>>
> 
> More obvious reason is that I'm still working towards the 'faster' page table
> walker. Then map/unmap code needs to do similar lookups so thought of reusing
> the same functions as map/unmap initially. And improve it afterwards or when
> introducing the splitting.
> 
>> It is a radix tree, you walk trees by retaining your position at each
>> level as you go (eg in a function per-level call chain or something)
>> then ++ is cheap. Re-searching the entire tree every time is madness.
> 
> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel;
> still in the works), 

Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
map/unmap/iova_to_phys) does something a little off when it finds a non-present
PTE. It allocates a page table to it; which is not OK in this specific case (I
would argue it's neither for iova_to_phys but well maybe I misunderstand the
expectation of that API).

AMD has no such behaviour, though that driver per your earlier suggestion might
need to wait until -rc1 for some of the refactorings get merged. Hopefully we
don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
done as that looks to be more SVA related; Unless there's something more
specific you are looking for prior to introducing AMD's domain_alloc_user().

Anyhow, let me fix this, and post an update. Perhaps it's best I target this for
-rc1 and have improved page-table walkers all at once [the iommufd_log_perf
thingie below unlikely to be part of this set right away]. I have been playing
with the AMD driver a lot more on baremetal, so I am getting confident on the
snippet below (even with big IOVA ranges). I'm also retrying to see in-house if
there's now a rev3.0 Intel machine that I can post results for -rc1 (last time
in v2 I didn't; but things could have changed).

> but in my experiments with huge IOVA ranges, the time to
> walk the page tables end up making not that much difference, compared to the
> size it needs to walk. However is how none of this matters, once we increase up
> a level (PMD), then walking huge IOVA ranges is super-cheap (and invisible with
> PUDs). Which makes the dynamic-splitting/page-demotion important.
> 
> Furthermore, this is not quite yet easy for other people to test and see numbers
> for themselves; so more and more I need to work on something like
> iommufd_log_perf tool under tools/testing that is similar to the gup_perf to make all
> performance work obvious and 'standardized'
> 
> ------->8--------
> [0] [hasn't been rebased into this version I sent]
> 
> commit 431de7e855ee8c1622663f8d81600f62fed0ed4a
> Author: Joao Martins <joao.m.martins@oracle.com>
> Date:   Sat Oct 7 18:17:33 2023 -0400
> 
>     iommu/amd: Improve dirty read io-pgtable walker
> 
>     fetch_pte() based is a little ineficient for level-1 page-sizes.
> 
>     It walks all the levels to return a PTE, and disregarding the potential
>     batching that could be done for the previous level. Implement a
>     page-table walker based on the freeing functions which recursevily walks
>     the next-level.
> 
>     For each level it iterates on the non-default page sizes as the
>     different mappings return, provided each PTE level-7 may account
>     the next power-of-2 per added PTE.
> 
>     Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> 
> diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
> index 29f5ab0ba14f..babb5fb5fd51 100644
> --- a/drivers/iommu/amd/io_pgtable.c
> +++ b/drivers/iommu/amd/io_pgtable.c
> @@ -552,39 +552,63 @@ static bool pte_test_and_clear_dirty(u64 *ptep, unsigned
> long size)
>         return dirty;
>  }
> 
> +static bool pte_is_large_or_base(u64 *ptep)
> +{
> +       return (PM_PTE_LEVEL(*ptep) == 0 || PM_PTE_LEVEL(*ptep) == 7);
> +}
> +
> +static int walk_iova_range(u64 *pt, unsigned long iova, size_t size,
> +                          int level, unsigned long flags,
> +                          struct iommu_dirty_bitmap *dirty)
> +{
> +       unsigned long addr, isize, end = iova + size;
> +       unsigned long page_size;
> +       int i, next_level;
> +       u64 *p, *ptep;
> +
> +       next_level = level - 1;
> +       isize = page_size = PTE_LEVEL_PAGE_SIZE(next_level);
> +
> +       for (addr = iova; addr < end; addr += isize) {
> +               i = PM_LEVEL_INDEX(next_level, addr);
> +               ptep = &pt[i];
> +
> +               /* PTE present? */
> +               if (!IOMMU_PTE_PRESENT(*ptep))
> +                       continue;
> +
> +               if (level > 1 && !pte_is_large_or_base(ptep)) {
> +                       p = IOMMU_PTE_PAGE(*ptep);
> +                       isize = min(end - addr, page_size);
> +                       walk_iova_range(p, addr, isize, next_level,
> +                                       flags, dirty);
> +               } else {
> +                       isize = PM_PTE_LEVEL(*ptep) == 7 ?
> +                                       PTE_PAGE_SIZE(*ptep) : page_size;
> +
> +                       /*
> +                        * Mark the whole IOVA range as dirty even if only one
> +                        * of the replicated PTEs were marked dirty.
> +                        */
> +                       if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
> +                                       pte_test_dirty(ptep, isize)) ||
> +                           pte_test_and_clear_dirty(ptep, isize))
> +                               iommu_dirty_bitmap_record(dirty, addr, isize);
> +               }
> +       }
> +
> +       return 0;
> +}
> +
>  static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>                                          unsigned long iova, size_t size,
>                                          unsigned long flags,
>                                          struct iommu_dirty_bitmap *dirty)
>  {
>         struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
> -       unsigned long end = iova + size - 1;
> -
> -       do {
> -               unsigned long pgsize = 0;
> -               u64 *ptep, pte;
> -
> -               ptep = fetch_pte(pgtable, iova, &pgsize);
> -               if (ptep)
> -                       pte = READ_ONCE(*ptep);
> -               if (!ptep || !IOMMU_PTE_PRESENT(pte)) {
> -                       pgsize = pgsize ?: PTE_LEVEL_PAGE_SIZE(0);
> -                       iova += pgsize;
> -                       continue;
> -               }
> -
> -               /*
> -                * Mark the whole IOVA range as dirty even if only one of
> -                * the replicated PTEs were marked dirty.
> -                */
> -               if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
> -                               pte_test_dirty(ptep, pgsize)) ||
> -                   pte_test_and_clear_dirty(ptep, pgsize))
> -                       iommu_dirty_bitmap_record(dirty, iova, pgsize);
> -               iova += pgsize;
> -       } while (iova < end);
> 
> -       return 0;
> +       return walk_iova_range(pgtable->root, iova, size,
> +                              pgtable->mode, flags, dirty);
>  }
> 
>  /*
Jason Gunthorpe Oct. 19, 2023, 11:59 p.m. UTC | #4
On Thu, Oct 19, 2023 at 12:58:29PM +0100, Joao Martins wrote:

> Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
> map/unmap/iova_to_phys) does something a little off when it finds a non-present
> PTE. It allocates a page table to it; which is not OK in this specific case (I
> would argue it's neither for iova_to_phys but well maybe I misunderstand the
> expectation of that API).

Oh :(
 
> AMD has no such behaviour, though that driver per your earlier suggestion might
> need to wait until -rc1 for some of the refactorings get merged. Hopefully we
> don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
> done as that looks to be more SVA related; Unless there's something more
> specific you are looking for prior to introducing AMD's domain_alloc_user().

I don't think we need to wait, it just needs to go on the cleaning list.
 
> Anyhow, let me fix this, and post an update. Perhaps it's best I target this for
> -rc1 and have improved page-table walkers all at once [the iommufd_log_perf
> thingie below unlikely to be part of this set right away]. I have been playing
> with the AMD driver a lot more on baremetal, so I am getting confident on the
> snippet below (even with big IOVA ranges). I'm also retrying to see in-house if
> there's now a rev3.0 Intel machine that I can post results for -rc1 (last time
> in v2 I didn't; but things could have changed).

I'd rather you keep it simple and send the walkers as followups to the
driver maintainers directly.

> > for themselves; so more and more I need to work on something like
> > iommufd_log_perf tool under tools/testing that is similar to the gup_perf to make all
> > performance work obvious and 'standardized'

We have a mlx5 vfio driver in rdma-core and I have been thinking it
would be a nice basis for building an iommufd tester/benchmarker as it
has a wide set of "easilly" triggered functionality.

Jason
Baolu Lu Oct. 20, 2023, 2:21 a.m. UTC | #5
On 10/19/23 7:58 PM, Joao Martins wrote:
> On 19/10/2023 01:17, Joao Martins wrote:
>> On 19/10/2023 00:11, Jason Gunthorpe wrote:
>>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
>>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>>>> +					 unsigned long iova, size_t size,
>>>> +					 unsigned long flags,
>>>> +					 struct iommu_dirty_bitmap *dirty)
>>>> +{
>>>> +	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>>> +	unsigned long end = iova + size - 1;
>>>> +
>>>> +	do {
>>>> +		unsigned long pgsize = 0;
>>>> +		u64 *ptep, pte;
>>>> +
>>>> +		ptep = fetch_pte(pgtable, iova, &pgsize);
>>>> +		if (ptep)
>>>> +			pte = READ_ONCE(*ptep);
>>> It is fine for now, but this is so slow for something that is such a
>>> fast path. We are optimizing away a TLB invalidation but leaving
>>> this???
>>>
>> More obvious reason is that I'm still working towards the 'faster' page table
>> walker. Then map/unmap code needs to do similar lookups so thought of reusing
>> the same functions as map/unmap initially. And improve it afterwards or when
>> introducing the splitting.
>>
>>> It is a radix tree, you walk trees by retaining your position at each
>>> level as you go (eg in a function per-level call chain or something)
>>> then ++ is cheap. Re-searching the entire tree every time is madness.
>> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel;
>> still in the works),
> Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
> map/unmap/iova_to_phys) does something a little off when it finds a non-present
> PTE. It allocates a page table to it; which is not OK in this specific case (I
> would argue it's neither for iova_to_phys but well maybe I misunderstand the
> expectation of that API).

pfn_to_dma_pte() doesn't allocate page for a non-present PTE if the
target_level parameter is set to 0. See below line 932.

  913 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
  914                           unsigned long pfn, int *target_level,
  915                           gfp_t gfp)
  916 {

[...]

  927         while (1) {
  928                 void *tmp_page;
  929
  930                 offset = pfn_level_offset(pfn, level);
  931                 pte = &parent[offset];
  932                 if (!*target_level && (dma_pte_superpage(pte) || 
!dma_pte_present(pte)))
  933                         break;

So both iova_to_phys() and read_and_clear_dirty() are doing things
right:

	struct dma_pte *pte;
	int level = 0;

	pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
                              &level, GFP_KERNEL);
	if (pte && dma_pte_present(pte)) {
		/* The PTE is valid, check anything you want! */
		... ...
	}

Or, I am overlooking something else?

Best regards,
baolu
Tian, Kevin Oct. 20, 2023, 7:01 a.m. UTC | #6
> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Friday, October 20, 2023 10:22 AM
> 
> On 10/19/23 7:58 PM, Joao Martins wrote:
> > On 19/10/2023 01:17, Joao Martins wrote:
> >> On 19/10/2023 00:11, Jason Gunthorpe wrote:
> >>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
> >>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
> >>>> +					 unsigned long iova, size_t size,
> >>>> +					 unsigned long flags,
> >>>> +					 struct iommu_dirty_bitmap *dirty)
> >>>> +{
> >>>> +	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
> >>>> +	unsigned long end = iova + size - 1;
> >>>> +
> >>>> +	do {
> >>>> +		unsigned long pgsize = 0;
> >>>> +		u64 *ptep, pte;
> >>>> +
> >>>> +		ptep = fetch_pte(pgtable, iova, &pgsize);
> >>>> +		if (ptep)
> >>>> +			pte = READ_ONCE(*ptep);
> >>> It is fine for now, but this is so slow for something that is such a
> >>> fast path. We are optimizing away a TLB invalidation but leaving
> >>> this???
> >>>
> >> More obvious reason is that I'm still working towards the 'faster' page
> table
> >> walker. Then map/unmap code needs to do similar lookups so thought of
> reusing
> >> the same functions as map/unmap initially. And improve it afterwards or
> when
> >> introducing the splitting.
> >>
> >>> It is a radix tree, you walk trees by retaining your position at each
> >>> level as you go (eg in a function per-level call chain or something)
> >>> then ++ is cheap. Re-searching the entire tree every time is madness.
> >> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for
> Intel;
> >> still in the works),
> > Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
> > map/unmap/iova_to_phys) does something a little off when it finds a non-
> present
> > PTE. It allocates a page table to it; which is not OK in this specific case (I
> > would argue it's neither for iova_to_phys but well maybe I misunderstand
> the
> > expectation of that API).
> 
> pfn_to_dma_pte() doesn't allocate page for a non-present PTE if the
> target_level parameter is set to 0. See below line 932.
> 
>   913 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
>   914                           unsigned long pfn, int *target_level,
>   915                           gfp_t gfp)
>   916 {
> 
> [...]
> 
>   927         while (1) {
>   928                 void *tmp_page;
>   929
>   930                 offset = pfn_level_offset(pfn, level);
>   931                 pte = &parent[offset];
>   932                 if (!*target_level && (dma_pte_superpage(pte) ||
> !dma_pte_present(pte)))
>   933                         break;
> 
> So both iova_to_phys() and read_and_clear_dirty() are doing things
> right:
> 
> 	struct dma_pte *pte;
> 	int level = 0;
> 
> 	pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
>                               &level, GFP_KERNEL);
> 	if (pte && dma_pte_present(pte)) {
> 		/* The PTE is valid, check anything you want! */
> 		... ...
> 	}
> 
> Or, I am overlooking something else?
> 

This is correct.
Joao Martins Oct. 20, 2023, 9:34 a.m. UTC | #7
On 20/10/2023 03:21, Baolu Lu wrote:
> On 10/19/23 7:58 PM, Joao Martins wrote:
>> On 19/10/2023 01:17, Joao Martins wrote:
>>> On 19/10/2023 00:11, Jason Gunthorpe wrote:
>>>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
>>>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>>>>> +                     unsigned long iova, size_t size,
>>>>> +                     unsigned long flags,
>>>>> +                     struct iommu_dirty_bitmap *dirty)
>>>>> +{
>>>>> +    struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>>>> +    unsigned long end = iova + size - 1;
>>>>> +
>>>>> +    do {
>>>>> +        unsigned long pgsize = 0;
>>>>> +        u64 *ptep, pte;
>>>>> +
>>>>> +        ptep = fetch_pte(pgtable, iova, &pgsize);
>>>>> +        if (ptep)
>>>>> +            pte = READ_ONCE(*ptep);
>>>> It is fine for now, but this is so slow for something that is such a
>>>> fast path. We are optimizing away a TLB invalidation but leaving
>>>> this???
>>>>
>>> More obvious reason is that I'm still working towards the 'faster' page table
>>> walker. Then map/unmap code needs to do similar lookups so thought of reusing
>>> the same functions as map/unmap initially. And improve it afterwards or when
>>> introducing the splitting.
>>>
>>>> It is a radix tree, you walk trees by retaining your position at each
>>>> level as you go (eg in a function per-level call chain or something)
>>>> then ++ is cheap. Re-searching the entire tree every time is madness.
>>> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel;
>>> still in the works),
>> Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
>> map/unmap/iova_to_phys) does something a little off when it finds a non-present
>> PTE. It allocates a page table to it; which is not OK in this specific case (I
>> would argue it's neither for iova_to_phys but well maybe I misunderstand the
>> expectation of that API).
> 
> pfn_to_dma_pte() doesn't allocate page for a non-present PTE if the
> target_level parameter is set to 0. See below line 932.
> 
>  913 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
>  914                           unsigned long pfn, int *target_level,
>  915                           gfp_t gfp)
>  916 {
> 
> [...]
> 
>  927         while (1) {
>  928                 void *tmp_page;
>  929
>  930                 offset = pfn_level_offset(pfn, level);
>  931                 pte = &parent[offset];
>  932                 if (!*target_level && (dma_pte_superpage(pte) ||
> !dma_pte_present(pte)))
>  933                         break;
> 
> So both iova_to_phys() and read_and_clear_dirty() are doing things
> right:
> 
>     struct dma_pte *pte;
>     int level = 0;
> 
>     pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
>                              &level, GFP_KERNEL);
>     if (pte && dma_pte_present(pte)) {
>         /* The PTE is valid, check anything you want! */
>         ... ...
>     }
> 
> Or, I am overlooking something else?

You're right, thanks for the keeping me straight -- I was already doing the
right thing. I've forgotten about it in the midst of the other code -- Probably
worth a comment in the caller to make it obvious.

	Joao
Joao Martins Oct. 20, 2023, 11:20 a.m. UTC | #8
On 20/10/2023 10:34, Joao Martins wrote:
> On 20/10/2023 03:21, Baolu Lu wrote:
>> On 10/19/23 7:58 PM, Joao Martins wrote:
>>> On 19/10/2023 01:17, Joao Martins wrote:
>>>> On 19/10/2023 00:11, Jason Gunthorpe wrote:
>>>>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
>>>>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>>>>>> +                     unsigned long iova, size_t size,
>>>>>> +                     unsigned long flags,
>>>>>> +                     struct iommu_dirty_bitmap *dirty)
>>>>>> +{
>>>>>> +    struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>>>>> +    unsigned long end = iova + size - 1;
>>>>>> +
>>>>>> +    do {
>>>>>> +        unsigned long pgsize = 0;
>>>>>> +        u64 *ptep, pte;
>>>>>> +
>>>>>> +        ptep = fetch_pte(pgtable, iova, &pgsize);
>>>>>> +        if (ptep)
>>>>>> +            pte = READ_ONCE(*ptep);
>>>>> It is fine for now, but this is so slow for something that is such a
>>>>> fast path. We are optimizing away a TLB invalidation but leaving
>>>>> this???
>>>>>
>>>> More obvious reason is that I'm still working towards the 'faster' page table
>>>> walker. Then map/unmap code needs to do similar lookups so thought of reusing
>>>> the same functions as map/unmap initially. And improve it afterwards or when
>>>> introducing the splitting.
>>>>
>>>>> It is a radix tree, you walk trees by retaining your position at each
>>>>> level as you go (eg in a function per-level call chain or something)
>>>>> then ++ is cheap. Re-searching the entire tree every time is madness.
>>>> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel;
>>>> still in the works),
>>> Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
>>> map/unmap/iova_to_phys) does something a little off when it finds a non-present
>>> PTE. It allocates a page table to it; which is not OK in this specific case (I
>>> would argue it's neither for iova_to_phys but well maybe I misunderstand the
>>> expectation of that API).
>>
>> pfn_to_dma_pte() doesn't allocate page for a non-present PTE if the
>> target_level parameter is set to 0. See below line 932.
>>
>>  913 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
>>  914                           unsigned long pfn, int *target_level,
>>  915                           gfp_t gfp)
>>  916 {
>>
>> [...]
>>
>>  927         while (1) {
>>  928                 void *tmp_page;
>>  929
>>  930                 offset = pfn_level_offset(pfn, level);
>>  931                 pte = &parent[offset];
>>  932                 if (!*target_level && (dma_pte_superpage(pte) ||
>> !dma_pte_present(pte)))
>>  933                         break;
>>
>> So both iova_to_phys() and read_and_clear_dirty() are doing things
>> right:
>>
>>     struct dma_pte *pte;
>>     int level = 0;
>>
>>     pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
>>                              &level, GFP_KERNEL);
>>     if (pte && dma_pte_present(pte)) {
>>         /* The PTE is valid, check anything you want! */
>>         ... ...
>>     }
>>
>> Or, I am overlooking something else?
> 
> You're right, thanks for the keeping me straight -- I was already doing the
> right thing. I've forgotten about it in the midst of the other code -- Probably
> worth a comment in the caller to make it obvious.

For what is worth, this is the improved page-table walker I have in staging (as
a separate patch) alongside AMD. It is quite similar, except AMD IOMMU has a
bigger featureset in the PTEs page size it can represent, but the crux of the
walking is the same, bearing different coding style in the IOMMU drivers.

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 97558b420e35..f6990962af2a 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4889,14 +4889,52 @@ static int intel_iommu_set_dirty_tracking(struct
iommu_domain *domain,
        return ret;
 }

+static int walk_dirty_dma_pte_level(struct dmar_domain *domain, int level,
+                                   struct dma_pte *pte, unsigned long start_pfn,
+                                   unsigned long last_pfn, unsigned long flags,
+                                   struct iommu_dirty_bitmap *dirty)
+{
+       unsigned long pfn, page_size;
+
+       pfn = start_pfn;
+       pte = &pte[pfn_level_offset(pfn, level)];
+
+       do {
+               unsigned long level_pfn = pfn & level_mask(level);
+               unsigned long level_last;
+
+               if (!dma_pte_present(pte))
+                       goto next;
+
+               if (level > 1 && !dma_pte_superpage(pte)) {
+                       level_last = level_pfn + level_size(level) - 1;
+                       level_last = min(level_last, last_pfn);
+                       walk_dirty_dma_pte_level(domain, level - 1,
+                                                phys_to_virt(dma_pte_addr(pte)),
+                                                pfn, level_last,
+                                                flags, dirty);
+               } else {
+                       page_size = level_size(level) << VTD_PAGE_SHIFT;
+
+                       if (dma_sl_pte_test_and_clear_dirty(pte, flags))
+                               iommu_dirty_bitmap_record(dirty,
+                                                         pfn << VTD_PAGE_SHIFT,
+                                                         page_size);
+               }
+next:
+               pfn = level_pfn + level_size(level);
+       } while (!first_pte_in_page(++pte) && pfn <= last_pfn);
+
+       return 0;
+}
+
 static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
                                            unsigned long iova, size_t size,
                                            unsigned long flags,
                                            struct iommu_dirty_bitmap *dirty)
 {
        struct dmar_domain *dmar_domain = to_dmar_domain(domain);
-       unsigned long end = iova + size - 1;
-       unsigned long pgsize;
+       unsigned long start_pfn, last_pfn;

        /*
         * IOMMUFD core calls into a dirty tracking disabled domain without an
@@ -4907,24 +4945,14 @@ static int intel_iommu_read_and_clear_dirty(struct
iommu_domain *domain,
        if (!dmar_domain->dirty_tracking && dirty->bitmap)
                return -EINVAL;

-       do {
-               struct dma_pte *pte;
-               int lvl = 0;
-
-               pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl,
-                                    GFP_ATOMIC);
-               pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
-               if (!pte || !dma_pte_present(pte)) {
-                       iova += pgsize;
-                       continue;
-               }

-               if (dma_sl_pte_test_and_clear_dirty(pte, flags))
-                       iommu_dirty_bitmap_record(dirty, iova, pgsize);
-               iova += pgsize;
-       } while (iova < end);
+       start_pfn = iova >> VTD_PAGE_SHIFT;
+       last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;

-       return 0;
+       return walk_dirty_dma_pte_level(dmar_domain,
+                                       agaw_to_level(dmar_domain->agaw),
+                                       dmar_domain->pgd, start_pfn, last_pfn,
+                                       flags, dirty);
 }

 const struct iommu_dirty_ops intel_dirty_ops = {
Joao Martins Oct. 20, 2023, 2:43 p.m. UTC | #9
On 20/10/2023 00:59, Jason Gunthorpe wrote:
> On Thu, Oct 19, 2023 at 12:58:29PM +0100, Joao Martins wrote:
>> AMD has no such behaviour, though that driver per your earlier suggestion might
>> need to wait until -rc1 for some of the refactorings get merged. Hopefully we
>> don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
>> done as that looks to be more SVA related; Unless there's something more
>> specific you are looking for prior to introducing AMD's domain_alloc_user().
> 
> I don't think we need to wait, it just needs to go on the cleaning list.
>

I am not sure I followed. This suggests an post-merge cleanups, which goes in
different direction of your original comment? But maybe I am just not parsing it
right (sorry, just confused)

>>> for themselves; so more and more I need to work on something like
>>> iommufd_log_perf tool under tools/testing that is similar to the gup_perf to make all
>>> performance work obvious and 'standardized'
> 
> We have a mlx5 vfio driver in rdma-core and I have been thinking it
> would be a nice basis for building an iommufd tester/benchmarker as it
> has a wide set of "easilly" triggered functionality.

Oh woah, that's quite awesome; I'll take a closer look; I thought rdma-core
support for mlx5-vfio was to do direct usage of the firmware interface, but it
appears to be for regular RDMA apps as well. I do use some RDMA to exercise
iommu dirty tracking; but it's more like a rudimentary test inside the guest,
not something self-contained.

I was thinking in something more basic (for starters) device-agnostic to
exercise the iommu side part -- I gave gup_test example as that's exactly what I
have in mind. Though having in-device knowledge is the ultimate tool to exercise
this free of migration problems/overheads/algorithm. Initially I was thinking a
DPDK app where you program the device itself, but it is a bit too complex. But
if rdma-core can still use mlx5-VFIO while retaining the same semantics of a
'normal' RDMA app (e.g. which registers some MRs and lets you rdma read/write)
then this is definitely good foundation.

	Joao
Joao Martins Oct. 20, 2023, 6:57 p.m. UTC | #10
Suravee,

On 18/10/2023 21:27, Joao Martins wrote:
> @@ -2379,6 +2407,69 @@ static bool amd_iommu_capable(struct device *dev, enum iommu_cap cap)
>  	return false;
>  }
>  
> +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
> +					bool enable)
> +{
> +	struct protection_domain *pdomain = to_pdomain(domain);
> +	struct dev_table_entry *dev_table;
> +	struct iommu_dev_data *dev_data;
> +	struct amd_iommu *iommu;
> +	unsigned long flags;
> +	u64 pte_root;
> +
> +	spin_lock_irqsave(&pdomain->lock, flags);
> +	if (!(pdomain->dirty_tracking ^ enable)) {
> +		spin_unlock_irqrestore(&pdomain->lock, flags);
> +		return 0;
> +	}
> +
> +	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
> +		iommu = rlookup_amd_iommu(dev_data->dev);
> +		if (!iommu)
> +			continue;
> +
> +		dev_table = get_dev_table(iommu);
> +		pte_root = dev_table[dev_data->devid].data[0];
> +
> +		pte_root = (enable ?
> +			pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
> +
> +		/* Flush device DTE */
> +		dev_table[dev_data->devid].data[0] = pte_root;
> +		device_flush_dte(dev_data);
> +	}
> +
> +	/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
> +	amd_iommu_domain_flush_tlb_pde(pdomain);
> +	amd_iommu_domain_flush_complete(pdomain);
> +	pdomain->dirty_tracking = enable;
> +	spin_unlock_irqrestore(&pdomain->lock, flags);
> +
> +	return 0;
> +}
> +

I'm adding this snippet below considering some earlier discussion on Intel
driver. This only skips the domain flush when the domain has no devices (or
rlookup didnt give an iommu). Technically this was code that mistakenly deleted
from rfcv1->rfcv2 after your first review, so still retaining your Reviewed-by;
let me know if that's wrong.

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 6b4768ff66e1..c5be76e019bf 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2413,6 +2413,7 @@ static int amd_iommu_set_dirty_tracking(struct
iommu_domain *domain,
        struct protection_domain *pdomain = to_pdomain(domain);
        struct dev_table_entry *dev_table;
        struct iommu_dev_data *dev_data;
+       bool domain_flush = false;
        struct amd_iommu *iommu;
        unsigned long flags;
        u64 pte_root;
@@ -2437,11 +2438,14 @@ static int amd_iommu_set_dirty_tracking(struct
iommu_domain *domain,
                /* Flush device DTE */
                dev_table[dev_data->devid].data[0] = pte_root;
                device_flush_dte(dev_data);
+               domain_flush = true;
        }

        /* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
-       amd_iommu_domain_flush_tlb_pde(pdomain);
-       amd_iommu_domain_flush_complete(pdomain);
+       if (domain_flush) {
+               amd_iommu_domain_flush_tlb_pde(pdomain);
+               amd_iommu_domain_flush_complete(pdomain);
+       }
        pdomain->dirty_tracking = enable;
        spin_unlock_irqrestore(&pdomain->lock, flags);
Joao Martins Oct. 20, 2023, 9:22 p.m. UTC | #11
On 20/10/2023 15:43, Joao Martins wrote:
> On 20/10/2023 00:59, Jason Gunthorpe wrote:
>> On Thu, Oct 19, 2023 at 12:58:29PM +0100, Joao Martins wrote:
>>> AMD has no such behaviour, though that driver per your earlier suggestion might
>>> need to wait until -rc1 for some of the refactorings get merged. Hopefully we
>>> don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
>>> done as that looks to be more SVA related; Unless there's something more
>>> specific you are looking for prior to introducing AMD's domain_alloc_user().
>>
>> I don't think we need to wait, it just needs to go on the cleaning list.
>>
> 
> I am not sure I followed. This suggests an post-merge cleanups, which goes in
> different direction of your original comment? But maybe I am just not parsing it
> right (sorry, just confused)
> 
Oh, I think I now really understood what you originally meant. The wait was into
other stuff that needs work unrelated to this, not specifically for these
patches to wait e.g. stuff like domain_alloc_paging as your example, with
respect to domain_alloc_user being reused as well and different domain types.

I will follow-up with v5 shortly with both drivers.
Jason Gunthorpe Oct. 21, 2023, 4:14 p.m. UTC | #12
On Fri, Oct 20, 2023 at 03:43:57PM +0100, Joao Martins wrote:
> On 20/10/2023 00:59, Jason Gunthorpe wrote:
> > On Thu, Oct 19, 2023 at 12:58:29PM +0100, Joao Martins wrote:
> >> AMD has no such behaviour, though that driver per your earlier suggestion might
> >> need to wait until -rc1 for some of the refactorings get merged. Hopefully we
> >> don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
> >> done as that looks to be more SVA related; Unless there's something more
> >> specific you are looking for prior to introducing AMD's domain_alloc_user().
> > 
> > I don't think we need to wait, it just needs to go on the cleaning list.
> >
> 
> I am not sure I followed. This suggests an post-merge cleanups, which goes in
> different direction of your original comment? But maybe I am just not parsing it
> right (sorry, just confused)

Yes post merge for the weirdo alloc flow

> >>> for themselves; so more and more I need to work on something like
> >>> iommufd_log_perf tool under tools/testing that is similar to the gup_perf to make all
> >>> performance work obvious and 'standardized'
> > 
> > We have a mlx5 vfio driver in rdma-core and I have been thinking it
> > would be a nice basis for building an iommufd tester/benchmarker as it
> > has a wide set of "easilly" triggered functionality.
>
> Oh woah, that's quite awesome; I'll take a closer look; I thought rdma-core
> support for mlx5-vfio was to do direct usage of the firmware interface, but it
> appears to be for regular RDMA apps as well. I do use some RDMA to exercise
> iommu dirty tracking; but it's more like a rudimentary test inside the guest,
> not something self-contained.

I can't remember anymore how much is supported, but supporting more is
not hard work. With a simple QP/CQ you can do all sorts of interesting
DMA.

Yishai would remember if QP/CQ got fully wired up

Jason
Yishai Hadas Oct. 22, 2023, 7:07 a.m. UTC | #13
On 21/10/2023 19:14, Jason Gunthorpe wrote:
> On Fri, Oct 20, 2023 at 03:43:57PM +0100, Joao Martins wrote:
>> On 20/10/2023 00:59, Jason Gunthorpe wrote:
>>> On Thu, Oct 19, 2023 at 12:58:29PM +0100, Joao Martins wrote:
>>>> AMD has no such behaviour, though that driver per your earlier suggestion might
>>>> need to wait until -rc1 for some of the refactorings get merged. Hopefully we
>>>> don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
>>>> done as that looks to be more SVA related; Unless there's something more
>>>> specific you are looking for prior to introducing AMD's domain_alloc_user().
>>> I don't think we need to wait, it just needs to go on the cleaning list.
>>>
>> I am not sure I followed. This suggests an post-merge cleanups, which goes in
>> different direction of your original comment? But maybe I am just not parsing it
>> right (sorry, just confused)
> Yes post merge for the weirdo alloc flow
>
>>>>> for themselves; so more and more I need to work on something like
>>>>> iommufd_log_perf tool under tools/testing that is similar to the gup_perf to make all
>>>>> performance work obvious and 'standardized'
>>> We have a mlx5 vfio driver in rdma-core and I have been thinking it
>>> would be a nice basis for building an iommufd tester/benchmarker as it
>>> has a wide set of "easilly" triggered functionality.
>> Oh woah, that's quite awesome; I'll take a closer look; I thought rdma-core
>> support for mlx5-vfio was to do direct usage of the firmware interface, but it
>> appears to be for regular RDMA apps as well. I do use some RDMA to exercise
>> iommu dirty tracking; but it's more like a rudimentary test inside the guest,
>> not something self-contained.
> I can't remember anymore how much is supported, but supporting more is
> not hard work. With a simple QP/CQ you can do all sorts of interesting
> DMA.
>
> Yishai would remember if QP/CQ got fully wired up

For now, QP/CQ are supported only over the DEVX API (i.e. 
mlx5dv_devx_obj_create()) of the mlx5-vfio driver in rdma-core.

In that case, data-path for RDMA applications should be done by the 
application itself based on the mlx5 specification.

Yishai

>
> Jason
diff mbox series

Patch

diff --git a/drivers/iommu/amd/Kconfig b/drivers/iommu/amd/Kconfig
index 9b5fc3356bf2..8bd4c3b183ec 100644
--- a/drivers/iommu/amd/Kconfig
+++ b/drivers/iommu/amd/Kconfig
@@ -10,6 +10,7 @@  config AMD_IOMMU
 	select IOMMU_API
 	select IOMMU_IOVA
 	select IOMMU_IO_PGTABLE
+	select IOMMUFD_DRIVER if IOMMUFD
 	depends on X86_64 && PCI && ACPI && HAVE_CMPXCHG_DOUBLE
 	help
 	  With this option you can enable support for AMD IOMMU hardware in
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 7dc30c2b56b3..dec4e5c2b66b 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -97,7 +97,9 @@ 
 #define FEATURE_GATS_MASK	(3ULL)
 #define FEATURE_GAM_VAPIC	BIT_ULL(21)
 #define FEATURE_GIOSUP		BIT_ULL(48)
+#define FEATURE_HASUP		BIT_ULL(49)
 #define FEATURE_EPHSUP		BIT_ULL(50)
+#define FEATURE_HDSUP		BIT_ULL(52)
 #define FEATURE_SNP		BIT_ULL(63)
 
 #define FEATURE_PASID_SHIFT	32
@@ -212,6 +214,7 @@ 
 /* macros and definitions for device table entries */
 #define DEV_ENTRY_VALID         0x00
 #define DEV_ENTRY_TRANSLATION   0x01
+#define DEV_ENTRY_HAD           0x07
 #define DEV_ENTRY_PPR           0x34
 #define DEV_ENTRY_IR            0x3d
 #define DEV_ENTRY_IW            0x3e
@@ -370,10 +373,16 @@ 
 #define PTE_LEVEL_PAGE_SIZE(level)			\
 	(1ULL << (12 + (9 * (level))))
 
+/*
+ * The IOPTE dirty bit
+ */
+#define IOMMU_PTE_HD_BIT (6)
+
 /*
  * Bit value definition for I/O PTE fields
  */
 #define IOMMU_PTE_PR	BIT_ULL(0)
+#define IOMMU_PTE_HD	BIT_ULL(IOMMU_PTE_HD_BIT)
 #define IOMMU_PTE_U	BIT_ULL(59)
 #define IOMMU_PTE_FC	BIT_ULL(60)
 #define IOMMU_PTE_IR	BIT_ULL(61)
@@ -384,6 +393,7 @@ 
  */
 #define DTE_FLAG_V	BIT_ULL(0)
 #define DTE_FLAG_TV	BIT_ULL(1)
+#define DTE_FLAG_HAD	(3ULL << 7)
 #define DTE_FLAG_GIOV	BIT_ULL(54)
 #define DTE_FLAG_GV	BIT_ULL(55)
 #define DTE_GLX_SHIFT	(56)
@@ -413,6 +423,7 @@ 
 
 #define IOMMU_PAGE_MASK (((1ULL << 52) - 1) & ~0xfffULL)
 #define IOMMU_PTE_PRESENT(pte) ((pte) & IOMMU_PTE_PR)
+#define IOMMU_PTE_DIRTY(pte) ((pte) & IOMMU_PTE_HD)
 #define IOMMU_PTE_PAGE(pte) (iommu_phys_to_virt((pte) & IOMMU_PAGE_MASK))
 #define IOMMU_PTE_MODE(pte) (((pte) >> 9) & 0x07)
 
@@ -563,6 +574,7 @@  struct protection_domain {
 	int nid;		/* Node ID */
 	u64 *gcr3_tbl;		/* Guest CR3 table */
 	unsigned long flags;	/* flags to find out type of domain */
+	bool dirty_tracking;	/* dirty tracking is enabled in the domain */
 	unsigned dev_cnt;	/* devices assigned to this domain */
 	unsigned dev_iommu[MAX_IOMMUS]; /* per-IOMMU reference count */
 };
diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 2892aa1b4dc1..953f867b4943 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -486,6 +486,74 @@  static phys_addr_t iommu_v1_iova_to_phys(struct io_pgtable_ops *ops, unsigned lo
 	return (__pte & ~offset_mask) | (iova & offset_mask);
 }
 
+static bool pte_test_and_clear_dirty(u64 *ptep, unsigned long size,
+				     unsigned long flags)
+{
+	bool test_only = flags & IOMMU_DIRTY_NO_CLEAR;
+	bool dirty = false;
+	int i, count;
+
+	/*
+	 * 2.2.3.2 Host Dirty Support
+	 * When a non-default page size is used , software must OR the
+	 * Dirty bits in all of the replicated host PTEs used to map
+	 * the page. The IOMMU does not guarantee the Dirty bits are
+	 * set in all of the replicated PTEs. Any portion of the page
+	 * may have been written even if the Dirty bit is set in only
+	 * one of the replicated PTEs.
+	 */
+	count = PAGE_SIZE_PTE_COUNT(size);
+	for (i = 0; i < count && test_only; i++) {
+		if (test_bit(IOMMU_PTE_HD_BIT,
+			     (unsigned long *) &ptep[i])) {
+			dirty = true;
+			break;
+		}
+	}
+
+	for (i = 0; i < count && !test_only; i++) {
+		if (test_and_clear_bit(IOMMU_PTE_HD_BIT,
+				       (unsigned long *) &ptep[i])) {
+			dirty = true;
+		}
+	}
+
+	return dirty;
+}
+
+static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
+					 unsigned long iova, size_t size,
+					 unsigned long flags,
+					 struct iommu_dirty_bitmap *dirty)
+{
+	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
+	unsigned long end = iova + size - 1;
+
+	do {
+		unsigned long pgsize = 0;
+		u64 *ptep, pte;
+
+		ptep = fetch_pte(pgtable, iova, &pgsize);
+		if (ptep)
+			pte = READ_ONCE(*ptep);
+		if (!ptep || !IOMMU_PTE_PRESENT(pte)) {
+			pgsize = pgsize ?: PTE_LEVEL_PAGE_SIZE(0);
+			iova += pgsize;
+			continue;
+		}
+
+		/*
+		 * Mark the whole IOVA range as dirty even if only one of
+		 * the replicated PTEs were marked dirty.
+		 */
+		if (pte_test_and_clear_dirty(ptep, pgsize, flags))
+			iommu_dirty_bitmap_record(dirty, iova, pgsize);
+		iova += pgsize;
+	} while (iova < end);
+
+	return 0;
+}
+
 /*
  * ----------------------------------------------------
  */
@@ -527,6 +595,7 @@  static struct io_pgtable *v1_alloc_pgtable(struct io_pgtable_cfg *cfg, void *coo
 	pgtable->iop.ops.map_pages    = iommu_v1_map_pages;
 	pgtable->iop.ops.unmap_pages  = iommu_v1_unmap_pages;
 	pgtable->iop.ops.iova_to_phys = iommu_v1_iova_to_phys;
+	pgtable->iop.ops.read_and_clear_dirty = iommu_v1_read_and_clear_dirty;
 
 	return &pgtable->iop;
 }
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 292a09b2fbbf..e7e9982fdad6 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -66,6 +66,7 @@  LIST_HEAD(hpet_map);
 LIST_HEAD(acpihid_map);
 
 const struct iommu_ops amd_iommu_ops;
+const struct iommu_dirty_ops amd_dirty_ops;
 
 static ATOMIC_NOTIFIER_HEAD(ppr_notifier);
 int amd_iommu_max_glx_val = -1;
@@ -1611,6 +1612,9 @@  static void set_dte_entry(struct amd_iommu *iommu, u16 devid,
 			pte_root |= 1ULL << DEV_ENTRY_PPR;
 	}
 
+	if (domain->dirty_tracking)
+		pte_root |= DTE_FLAG_HAD;
+
 	if (domain->flags & PD_IOMMUV2_MASK) {
 		u64 gcr3 = iommu_virt_to_phys(domain->gcr3_tbl);
 		u64 glx  = domain->glx;
@@ -2156,10 +2160,16 @@  static inline u64 dma_max_address(void)
 	return ((1ULL << PM_LEVEL_SHIFT(amd_iommu_gpt_level)) - 1);
 }
 
+static bool amd_iommu_hd_support(struct amd_iommu *iommu)
+{
+	return iommu && (iommu->features & FEATURE_HDSUP);
+}
+
 static struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
 						  struct device *dev,
 						  u32 flags)
 {
+	bool enforce_dirty = flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY;
 	struct protection_domain *domain;
 	struct amd_iommu *iommu = NULL;
 
@@ -2176,6 +2186,9 @@  static struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
 	if (amd_iommu_snp_en && (type == IOMMU_DOMAIN_IDENTITY))
 		return ERR_PTR(-EINVAL);
 
+	if (enforce_dirty && !amd_iommu_hd_support(iommu))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	domain = protection_domain_alloc(type);
 	if (!domain)
 		return ERR_PTR(-ENOMEM);
@@ -2190,6 +2203,9 @@  static struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
 			iommu->iommu.ops->pgsize_bitmap;
 		domain->domain.ops =
 			iommu->iommu.ops->default_domain_ops;
+
+		if (enforce_dirty)
+			domain->domain.dirty_ops = &amd_dirty_ops;
 	}
 
 	return &domain->domain;
@@ -2254,6 +2270,13 @@  static int amd_iommu_attach_device(struct iommu_domain *dom,
 
 	dev_data->defer_attach = false;
 
+	/*
+	 * Restrict to devices with compatible IOMMU hardware support
+	 * when enforcement of dirty tracking is enabled.
+	 */
+	if (dom->dirty_ops && !amd_iommu_hd_support(iommu))
+		return -EINVAL;
+
 	if (dev_data->domain)
 		detach_device(dev);
 
@@ -2372,6 +2395,11 @@  static bool amd_iommu_capable(struct device *dev, enum iommu_cap cap)
 		return true;
 	case IOMMU_CAP_DEFERRED_FLUSH:
 		return true;
+	case IOMMU_CAP_DIRTY: {
+		struct amd_iommu *iommu = rlookup_amd_iommu(dev);
+
+		return amd_iommu_hd_support(iommu);
+	}
 	default:
 		break;
 	}
@@ -2379,6 +2407,69 @@  static bool amd_iommu_capable(struct device *dev, enum iommu_cap cap)
 	return false;
 }
 
+static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
+					bool enable)
+{
+	struct protection_domain *pdomain = to_pdomain(domain);
+	struct dev_table_entry *dev_table;
+	struct iommu_dev_data *dev_data;
+	struct amd_iommu *iommu;
+	unsigned long flags;
+	u64 pte_root;
+
+	spin_lock_irqsave(&pdomain->lock, flags);
+	if (!(pdomain->dirty_tracking ^ enable)) {
+		spin_unlock_irqrestore(&pdomain->lock, flags);
+		return 0;
+	}
+
+	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
+		iommu = rlookup_amd_iommu(dev_data->dev);
+		if (!iommu)
+			continue;
+
+		dev_table = get_dev_table(iommu);
+		pte_root = dev_table[dev_data->devid].data[0];
+
+		pte_root = (enable ?
+			pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
+
+		/* Flush device DTE */
+		dev_table[dev_data->devid].data[0] = pte_root;
+		device_flush_dte(dev_data);
+	}
+
+	/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
+	amd_iommu_domain_flush_tlb_pde(pdomain);
+	amd_iommu_domain_flush_complete(pdomain);
+	pdomain->dirty_tracking = enable;
+	spin_unlock_irqrestore(&pdomain->lock, flags);
+
+	return 0;
+}
+
+static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain,
+					  unsigned long iova, size_t size,
+					  unsigned long flags,
+					  struct iommu_dirty_bitmap *dirty)
+{
+	struct protection_domain *pdomain = to_pdomain(domain);
+	struct io_pgtable_ops *ops = &pdomain->iop.iop.ops;
+	unsigned long lflags;
+
+	if (!ops || !ops->read_and_clear_dirty)
+		return -EOPNOTSUPP;
+
+	spin_lock_irqsave(&pdomain->lock, lflags);
+	if (!pdomain->dirty_tracking && dirty->bitmap) {
+		spin_unlock_irqrestore(&pdomain->lock, lflags);
+		return -EINVAL;
+	}
+	spin_unlock_irqrestore(&pdomain->lock, lflags);
+
+	return ops->read_and_clear_dirty(ops, iova, size, flags, dirty);
+}
+
 static void amd_iommu_get_resv_regions(struct device *dev,
 				       struct list_head *head)
 {
@@ -2501,6 +2592,11 @@  static bool amd_iommu_enforce_cache_coherency(struct iommu_domain *domain)
 	return true;
 }
 
+const struct iommu_dirty_ops amd_dirty_ops = {
+	.set_dirty_tracking = amd_iommu_set_dirty_tracking,
+	.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
+};
+
 const struct iommu_ops amd_iommu_ops = {
 	.capable = amd_iommu_capable,
 	.domain_alloc = amd_iommu_domain_alloc,