Message ID | 20230923012511.10379-8-joao.m.martins@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | IOMMUFD Dirty Tracking | expand |
On 23/09/2023 02:24, Joao Martins wrote: > +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt, > + struct iommu_domain *domain, > + unsigned long flags, > + struct iommufd_dirty_data *bitmap) > +{ > + unsigned long last_iova, iova = bitmap->iova; > + unsigned long length = bitmap->length; > + int ret = -EOPNOTSUPP; > + > + if ((iova & (iopt->iova_alignment - 1))) > + return -EINVAL; > + > + if (check_add_overflow(iova, length - 1, &last_iova)) > + return -EOVERFLOW; > + > + down_read(&iopt->iova_rwsem); > + ret = iommu_read_and_clear_dirty(domain, flags, bitmap); > + up_read(&iopt->iova_rwsem); > + return ret; > +} I need to call out that a mistake I made, noticed while submitting. I should be walk over iopt_areas here (or in iommu_read_and_clear_dirty()) to check area::pages. So this is a comment I have to fix for next version. I did that for clear_dirty but not for this function hence half-backed. There are a lot of other changes nonetheless, so didn't wanted to break what I had there. Apologies for the distraction. Joao
On 23/09/2023 02:40, Joao Martins wrote: > On 23/09/2023 02:24, Joao Martins wrote: >> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt, >> + struct iommu_domain *domain, >> + unsigned long flags, >> + struct iommufd_dirty_data *bitmap) >> +{ >> + unsigned long last_iova, iova = bitmap->iova; >> + unsigned long length = bitmap->length; >> + int ret = -EOPNOTSUPP; >> + >> + if ((iova & (iopt->iova_alignment - 1))) >> + return -EINVAL; >> + >> + if (check_add_overflow(iova, length - 1, &last_iova)) >> + return -EOVERFLOW; >> + >> + down_read(&iopt->iova_rwsem); >> + ret = iommu_read_and_clear_dirty(domain, flags, bitmap); >> + up_read(&iopt->iova_rwsem); >> + return ret; >> +} > > I need to call out that a mistake I made, noticed while submitting. I should be > walk over iopt_areas here (or in iommu_read_and_clear_dirty()) to check > area::pages. So this is a comment I have to fix for next version. Below is how I fixed it. Essentially the thinking being that the user passes either an mapped IOVA area it mapped *or* a subset of a mapped IOVA area. This should also allow the possibility of having multiple threads read dirties from huge IOVA area splitted in different chunks (in the case it gets splitted into lowest level). diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index 5a35885aef04..991c57458725 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -473,7 +473,9 @@ int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt, { unsigned long last_iova, iova = bitmap->iova; unsigned long length = bitmap->length; - int ret = -EOPNOTSUPP; + struct iopt_area *area; + bool found = false; + int ret = -EINVAL; if ((iova & (iopt->iova_alignment - 1))) return -EINVAL; @@ -482,7 +484,22 @@ int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt, return -EOVERFLOW; down_read(&iopt->iova_rwsem); - ret = iommu_read_and_clear_dirty(domain, flags, bitmap); + + /* Find the portion of IOVA space belonging to area */ + while ((area = iopt_area_iter_first(iopt, iova, last_iova))) { + unsigned long area_last = iopt_area_last_iova(area); + unsigned long area_first = iopt_area_iova(area); + + if (!area->pages) + continue; + + found = (iova >= area_first && last_iova <= area_last); + if (found) + break; + } + + if (found) + ret = iommu_read_and_clear_dirty(domain, flags, bitmap); up_read(&iopt->iova_rwsem); return ret;
On Tue, Oct 17, 2023 at 01:06:12PM +0100, Joao Martins wrote: > On 23/09/2023 02:40, Joao Martins wrote: > > On 23/09/2023 02:24, Joao Martins wrote: > >> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt, > >> + struct iommu_domain *domain, > >> + unsigned long flags, > >> + struct iommufd_dirty_data *bitmap) > >> +{ > >> + unsigned long last_iova, iova = bitmap->iova; > >> + unsigned long length = bitmap->length; > >> + int ret = -EOPNOTSUPP; > >> + > >> + if ((iova & (iopt->iova_alignment - 1))) > >> + return -EINVAL; > >> + > >> + if (check_add_overflow(iova, length - 1, &last_iova)) > >> + return -EOVERFLOW; > >> + > >> + down_read(&iopt->iova_rwsem); > >> + ret = iommu_read_and_clear_dirty(domain, flags, bitmap); > >> + up_read(&iopt->iova_rwsem); > >> + return ret; > >> +} > > > > I need to call out that a mistake I made, noticed while submitting. I should be > > walk over iopt_areas here (or in iommu_read_and_clear_dirty()) to check > > area::pages. So this is a comment I have to fix for next version. > > Below is how I fixed it. > > Essentially the thinking being that the user passes either an mapped IOVA area > it mapped *or* a subset of a mapped IOVA area. This should also allow the > possibility of having multiple threads read dirties from huge IOVA area splitted > in different chunks (in the case it gets splitted into lowest level). What happens if the iommu_read_and_clear_dirty is done on unmapped PTEs? It fails? Jason
On 17/10/2023 16:29, Jason Gunthorpe wrote: > On Tue, Oct 17, 2023 at 01:06:12PM +0100, Joao Martins wrote: >> On 23/09/2023 02:40, Joao Martins wrote: >>> On 23/09/2023 02:24, Joao Martins wrote: >>>> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt, >>>> + struct iommu_domain *domain, >>>> + unsigned long flags, >>>> + struct iommufd_dirty_data *bitmap) >>>> +{ >>>> + unsigned long last_iova, iova = bitmap->iova; >>>> + unsigned long length = bitmap->length; >>>> + int ret = -EOPNOTSUPP; >>>> + >>>> + if ((iova & (iopt->iova_alignment - 1))) >>>> + return -EINVAL; >>>> + >>>> + if (check_add_overflow(iova, length - 1, &last_iova)) >>>> + return -EOVERFLOW; >>>> + >>>> + down_read(&iopt->iova_rwsem); >>>> + ret = iommu_read_and_clear_dirty(domain, flags, bitmap); >>>> + up_read(&iopt->iova_rwsem); >>>> + return ret; >>>> +} >>> >>> I need to call out that a mistake I made, noticed while submitting. I should be >>> walk over iopt_areas here (or in iommu_read_and_clear_dirty()) to check >>> area::pages. So this is a comment I have to fix for next version. >> >> Below is how I fixed it. >> >> Essentially the thinking being that the user passes either an mapped IOVA area >> it mapped *or* a subset of a mapped IOVA area. This should also allow the >> possibility of having multiple threads read dirties from huge IOVA area splitted >> in different chunks (in the case it gets splitted into lowest level). > > What happens if the iommu_read_and_clear_dirty is done on unmapped > PTEs? It fails? If there's no IOPTE or the IOPTE is non-present, it keeps walking to the next base page (or level-0 IOVA range). For both drivers in this series. Joao
On Tue, Oct 17, 2023 at 04:51:25PM +0100, Joao Martins wrote: > On 17/10/2023 16:29, Jason Gunthorpe wrote: > > On Tue, Oct 17, 2023 at 01:06:12PM +0100, Joao Martins wrote: > >> On 23/09/2023 02:40, Joao Martins wrote: > >>> On 23/09/2023 02:24, Joao Martins wrote: > >>>> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt, > >>>> + struct iommu_domain *domain, > >>>> + unsigned long flags, > >>>> + struct iommufd_dirty_data *bitmap) > >>>> +{ > >>>> + unsigned long last_iova, iova = bitmap->iova; > >>>> + unsigned long length = bitmap->length; > >>>> + int ret = -EOPNOTSUPP; > >>>> + > >>>> + if ((iova & (iopt->iova_alignment - 1))) > >>>> + return -EINVAL; > >>>> + > >>>> + if (check_add_overflow(iova, length - 1, &last_iova)) > >>>> + return -EOVERFLOW; > >>>> + > >>>> + down_read(&iopt->iova_rwsem); > >>>> + ret = iommu_read_and_clear_dirty(domain, flags, bitmap); > >>>> + up_read(&iopt->iova_rwsem); > >>>> + return ret; > >>>> +} > >>> > >>> I need to call out that a mistake I made, noticed while submitting. I should be > >>> walk over iopt_areas here (or in iommu_read_and_clear_dirty()) to check > >>> area::pages. So this is a comment I have to fix for next version. > >> > >> Below is how I fixed it. > >> > >> Essentially the thinking being that the user passes either an mapped IOVA area > >> it mapped *or* a subset of a mapped IOVA area. This should also allow the > >> possibility of having multiple threads read dirties from huge IOVA area splitted > >> in different chunks (in the case it gets splitted into lowest level). > > > > What happens if the iommu_read_and_clear_dirty is done on unmapped > > PTEs? It fails? > > If there's no IOPTE or the IOPTE is non-present, it keeps walking to the next > base page (or level-0 IOVA range). For both drivers in this series. Hum, so this check doesn't seem quite right then as it is really an input validation that the iova is within the tree. It should be able to span contiguous areas. Write it with the intersection logic: for (area = iopt_area_iter_first(iopt, iova, iova_last); area; area = iopt_area_iter_next(area, iova, iova_last)) { if (!area->pages) // fail if (cur_iova < area_first) // fail if (last_iova <= area_last) // success, do iommu_read_and_clear_dirty() cur_iova = area_last + 1; } // else fail if not success Jason
On 17/10/2023 17:01, Jason Gunthorpe wrote: > On Tue, Oct 17, 2023 at 04:51:25PM +0100, Joao Martins wrote: >> On 17/10/2023 16:29, Jason Gunthorpe wrote: >>> On Tue, Oct 17, 2023 at 01:06:12PM +0100, Joao Martins wrote: >>>> On 23/09/2023 02:40, Joao Martins wrote: >>>>> On 23/09/2023 02:24, Joao Martins wrote: >>>>>> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt, >>>>>> + struct iommu_domain *domain, >>>>>> + unsigned long flags, >>>>>> + struct iommufd_dirty_data *bitmap) >>>>>> +{ >>>>>> + unsigned long last_iova, iova = bitmap->iova; >>>>>> + unsigned long length = bitmap->length; >>>>>> + int ret = -EOPNOTSUPP; >>>>>> + >>>>>> + if ((iova & (iopt->iova_alignment - 1))) >>>>>> + return -EINVAL; >>>>>> + >>>>>> + if (check_add_overflow(iova, length - 1, &last_iova)) >>>>>> + return -EOVERFLOW; >>>>>> + >>>>>> + down_read(&iopt->iova_rwsem); >>>>>> + ret = iommu_read_and_clear_dirty(domain, flags, bitmap); >>>>>> + up_read(&iopt->iova_rwsem); >>>>>> + return ret; >>>>>> +} >>>>> >>>>> I need to call out that a mistake I made, noticed while submitting. I should be >>>>> walk over iopt_areas here (or in iommu_read_and_clear_dirty()) to check >>>>> area::pages. So this is a comment I have to fix for next version. >>>> >>>> Below is how I fixed it. >>>> >>>> Essentially the thinking being that the user passes either an mapped IOVA area >>>> it mapped *or* a subset of a mapped IOVA area. This should also allow the >>>> possibility of having multiple threads read dirties from huge IOVA area splitted >>>> in different chunks (in the case it gets splitted into lowest level). >>> >>> What happens if the iommu_read_and_clear_dirty is done on unmapped >>> PTEs? It fails? >> >> If there's no IOPTE or the IOPTE is non-present, it keeps walking to the next >> base page (or level-0 IOVA range). For both drivers in this series. > > Hum, so this check doesn't seem quite right then as it is really an > input validation that the iova is within the tree. It should be able > to span contiguous areas. > > Write it with the intersection logic: > > for (area = iopt_area_iter_first(iopt, iova, iova_last); area; > area = iopt_area_iter_next(area, iova, iova_last)) { > if (!area->pages) > // fail > > if (cur_iova < area_first) > // fail > > if (last_iova <= area_last) > // success, do iommu_read_and_clear_dirty() > > cur_iova = area_last + 1; > } > > // else fail if not success > Perhaps that could be rewritten as e.g. ret = -EINVAL; iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) { // do iommu_read_and_clear_dirty(); } // else fail. Though OTOH, the places you wrote as to fail are skipped instead.
On Tue, Oct 17, 2023 at 05:51:49PM +0100, Joao Martins wrote: > Perhaps that could be rewritten as e.g. > > ret = -EINVAL; > iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) { > // do iommu_read_and_clear_dirty(); > } > > // else fail. > > Though OTOH, the places you wrote as to fail are skipped instead. Yeah, if consolidating the areas isn't important (it probably isn't) then this is the better API It does check all the same things: iopt_area_contig_done() will fail Jason
On 17/10/2023 18:13, Jason Gunthorpe wrote: > On Tue, Oct 17, 2023 at 05:51:49PM +0100, Joao Martins wrote: > >> Perhaps that could be rewritten as e.g. >> >> ret = -EINVAL; >> iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) { >> // do iommu_read_and_clear_dirty(); >> } >> >> // else fail. >> >> Though OTOH, the places you wrote as to fail are skipped instead. > > Yeah, if consolidating the areas isn't important (it probably isn't) > then this is the better API > Doing it in a single iommu_read_and_clear_dirty() saves me from manipulating the bitmap address in an atypical way. Considering that the first bit in each u8 is the iova we initialize the bitmap, so if I call it in multiple times in a single IOVA range (in each individual area as I think you suggested) then I need to align down the iova-length to the minimum granularity of the bitmap, which is an u8 (32k). Joao
On 17/10/2023 18:30, Joao Martins wrote: > On 17/10/2023 18:13, Jason Gunthorpe wrote: >> On Tue, Oct 17, 2023 at 05:51:49PM +0100, Joao Martins wrote: >> >>> Perhaps that could be rewritten as e.g. >>> >>> ret = -EINVAL; >>> iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) { >>> // do iommu_read_and_clear_dirty(); >>> } >>> >>> // else fail. >>> >>> Though OTOH, the places you wrote as to fail are skipped instead. >> >> Yeah, if consolidating the areas isn't important (it probably isn't) >> then this is the better API >> > > Doing it in a single iommu_read_and_clear_dirty() saves me from manipulating the > bitmap address in an atypical way. Considering that the first bit in each u8 is > the iova we initialize the bitmap, so if I call it in multiple times in a single > IOVA range (in each individual area as I think you suggested) then I need to > align down the iova-length to the minimum granularity of the bitmap, which is an > u8 (32k). Or I can do the reverse, which is to iterate the bitmap like right now, and iterate over individual iopt areas from within the iova-bitmap iterator, and avoid this. That's should be a lot cleaner: diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index 991c57458725..179afc6b74f2 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -415,6 +415,7 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt, struct iova_bitmap_fn_arg { unsigned long flags; + struct io_pagetable *iopt; struct iommu_domain *domain; struct iommu_dirty_bitmap *dirty; }; @@ -423,16 +424,34 @@ static int __iommu_read_and_clear_dirty(struct iova_bitmap *bitmap, unsigned long iova, size_t length, void *opaque) { + struct iopt_area *area; + struct iopt_area_contig_iter iter; struct iova_bitmap_fn_arg *arg = opaque; struct iommu_domain *domain = arg->domain; struct iommu_dirty_bitmap *dirty = arg->dirty; const struct iommu_dirty_ops *ops = domain->dirty_ops; unsigned long flags = arg->flags; + unsigned long last_iova = iova + length - 1; + int ret = -EINVAL; - return ops->read_and_clear_dirty(domain, iova, length, flags, dirty); + iopt_for_each_contig_area(&iter, area, arg->iopt, iova, last_iova) { + unsigned long last = min(last_iova, iopt_area_last_iova(area)); + + ret = ops->read_and_clear_dirty(domain, iter.cur_iova, + last - iter.cur_iova + 1, + flags, dirty); + if (ret) + break; + } + + if (!iopt_area_contig_done(&iter)) + ret = -EINVAL; + + return ret; } static int iommu_read_and_clear_dirty(struct iommu_domain *domain, + struct io_pagetable *iopt, unsigned long flags, struct iommu_hwpt_get_dirty_iova *bitmap) { @@ -453,6 +472,7 @@ static int iommu_read_and_clear_dirty(struct iommu_domain *domain, iommu_dirty_bitmap_init(&dirty, iter, &gather); + arg.iopt = iopt; arg.flags = flags; arg.domain = domain; arg.dirty = &dirty;
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index 3a598182b761..d70617447392 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -15,6 +15,7 @@ #include <linux/err.h> #include <linux/slab.h> #include <linux/errno.h> +#include <uapi/linux/iommufd.h> #include "io_pagetable.h" #include "double_span.h" @@ -412,6 +413,75 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt, return 0; } +struct iova_bitmap_fn_arg { + struct iommu_domain *domain; + struct iommu_dirty_bitmap *dirty; +}; + +static int __iommu_read_and_clear_dirty(struct iova_bitmap *bitmap, + unsigned long iova, size_t length, + void *opaque) +{ + struct iova_bitmap_fn_arg *arg = opaque; + struct iommu_domain *domain = arg->domain; + struct iommu_dirty_bitmap *dirty = arg->dirty; + const struct iommu_dirty_ops *ops = domain->dirty_ops; + + return ops->read_and_clear_dirty(domain, iova, length, 0, dirty); +} + +static int iommu_read_and_clear_dirty(struct iommu_domain *domain, + unsigned long flags, + struct iommufd_dirty_data *bitmap) +{ + const struct iommu_dirty_ops *ops = domain->dirty_ops; + struct iommu_iotlb_gather gather; + struct iommu_dirty_bitmap dirty; + struct iova_bitmap_fn_arg arg; + struct iova_bitmap *iter; + int ret = 0; + + if (!ops || !ops->read_and_clear_dirty) + return -EOPNOTSUPP; + + iter = iova_bitmap_alloc(bitmap->iova, bitmap->length, + bitmap->page_size, bitmap->data); + if (IS_ERR(iter)) + return -ENOMEM; + + iommu_dirty_bitmap_init(&dirty, iter, &gather); + + arg.domain = domain; + arg.dirty = &dirty; + iova_bitmap_for_each(iter, &arg, __iommu_read_and_clear_dirty); + + iommu_iotlb_sync(domain, &gather); + iova_bitmap_free(iter); + + return ret; +} + +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt, + struct iommu_domain *domain, + unsigned long flags, + struct iommufd_dirty_data *bitmap) +{ + unsigned long last_iova, iova = bitmap->iova; + unsigned long length = bitmap->length; + int ret = -EOPNOTSUPP; + + if ((iova & (iopt->iova_alignment - 1))) + return -EINVAL; + + if (check_add_overflow(iova, length - 1, &last_iova)) + return -EOVERFLOW; + + down_read(&iopt->iova_rwsem); + ret = iommu_read_and_clear_dirty(domain, flags, bitmap); + up_read(&iopt->iova_rwsem); + return ret; +} + int iopt_get_pages(struct io_pagetable *iopt, unsigned long iova, unsigned long length, struct list_head *pages_list) { diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 3064997a0181..84ec1df29074 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -8,6 +8,8 @@ #include <linux/xarray.h> #include <linux/refcount.h> #include <linux/uaccess.h> +#include <linux/iommu.h> +#include <linux/iova_bitmap.h> struct iommu_domain; struct iommu_group; @@ -70,6 +72,18 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova, unsigned long length, unsigned long *unmapped); int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped); +struct iommufd_dirty_data { + unsigned long iova; + unsigned long length; + unsigned long page_size; + unsigned long long *data; +}; + +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt, + struct iommu_domain *domain, + unsigned long flags, + struct iommufd_dirty_data *bitmap); + void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova, unsigned long length); int iopt_table_add_domain(struct io_pagetable *iopt,
Add an IO pagetable API iopt_read_and_clear_dirty_data() that performs the reading of dirty IOPTEs for a given IOVA range and then copying back to userspace bitmap. Underneath it uses the IOMMU domain kernel API which will read the dirty bits, as well as atomically clearing the IOPTE dirty bit and flushing the IOTLB at the end. The IOVA bitmaps usage takes care of the iteration of the bitmaps user pages efficiently and without copies. This is in preparation for a set dirty tracking ioctl which will clear the dirty bits before enabling it. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> --- drivers/iommu/iommufd/io_pagetable.c | 70 +++++++++++++++++++++++++ drivers/iommu/iommufd/iommufd_private.h | 14 +++++ 2 files changed, 84 insertions(+)