Message ID | 20220616235212.15185-7-nicolinc@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Update vfio_pin/unpin_pages API | expand |
There is a bunch of code an comments in the iommu type1 code that suggest we can pin memory that is not page backed. > int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova, > + int npage, int prot, struct page **phys_page) I don't think phys_page makes much sense as an argument name. I'd just call this pages. > + phys_page[i] = pfn_to_page(vpfn->pfn); Please store the actual page pointer in the vfio_pfn structure. > remote_vaddr = dma->vaddr + (iova - dma->iova); > - ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn[i], > + ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn, > do_accounting); Please just return the actual page from vaddr_get_pfns through vfio_pin_pages_remote and vfio_pin_page_external, maybe even as a prep patch as that is a fair amount of churn.
On Fri, Jun 17, 2022 at 01:54:05AM -0700, Christoph Hellwig wrote: > There is a bunch of code an comments in the iommu type1 code that > suggest we can pin memory that is not page backed. Would you mind explaining the use case for pinning memory that isn't page backed? And do we have such use case so far? > > int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova, > > + int npage, int prot, struct page **phys_page) > > I don't think phys_page makes much sense as an argument name. > I'd just call this pages. OK. > > + phys_page[i] = pfn_to_page(vpfn->pfn); > > Please store the actual page pointer in the vfio_pfn structure. OK. > > remote_vaddr = dma->vaddr + (iova - dma->iova); > > - ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn[i], > > + ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn, > > do_accounting); > > Please just return the actual page from vaddr_get_pfns through > vfio_pin_pages_remote and vfio_pin_page_external, maybe even as a prep > patch as that is a fair amount of churn. I can do that. I tried once, but there were just too much changes inside type1 code that felt like a chain reaction. If we plan to eventually replace with IOMMUFD implementations, these changes in type1 might not be necessary, I thought.
On Fri, Jun 17, 2022 at 03:06:25PM -0700, Nicolin Chen wrote: > On Fri, Jun 17, 2022 at 01:54:05AM -0700, Christoph Hellwig wrote: > > There is a bunch of code an comments in the iommu type1 code that > > suggest we can pin memory that is not page backed. > > Would you mind explaining the use case for pinning memory that > isn't page backed? And do we have such use case so far? Sorry, I should have deleted that sentence. I wrote it before spending some more time to dig through the code and all the locked memory has page backing. There just seem to be a lot of checks left inbetween if a pfn is page backed, mostly due to the pfn based calling convetions. > I can do that. I tried once, but there were just too much changes > inside type1 code that felt like a chain reaction. If we plan to > eventually replace with IOMMUFD implementations, these changes in > type1 might not be necessary, I thought. To make sure we keep full compatibility I suspect the final iommufd implementation has to be gradutally created from the existing code anyway.
On Sat, Jun 18, 2022 at 11:18:17PM -0700, Christoph Hellwig wrote: > > > There is a bunch of code an comments in the iommu type1 code that > > > suggest we can pin memory that is not page backed. > > > > Would you mind explaining the use case for pinning memory that > > isn't page backed? And do we have such use case so far? > > Sorry, I should have deleted that sentence. I wrote it before spending > some more time to dig through the code and all the locked memory has > page backing. There just seem to be a lot of checks left inbetween > if a pfn is page backed, mostly due to the pfn based calling convetions. OK. We'd be safe to move on then. Thanks for the clarification. > > I can do that. I tried once, but there were just too much changes > > inside type1 code that felt like a chain reaction. If we plan to > > eventually replace with IOMMUFD implementations, these changes in > > type1 might not be necessary, I thought. > > To make sure we keep full compatibility I suspect the final iommufd > implementation has to be gradutally created from the existing code > anyway. Hmm. I think Jason can give some insight. Meanwhile, I will try to add a patch to type1 code, in case we'd end up with what you suspected.
On Fri, Jun 17, 2022 at 01:54:05AM -0700, Christoph Hellwig wrote: > There is a bunch of code an comments in the iommu type1 code that > suggest we can pin memory that is not page backed. AFAIK you can.. The whole follow_pte() mechanism allows raw PFNs to be loaded into the type1 maps and the pin API will happily return them. This happens in almost every qemu scenario because PCI MMIO BAR memory ends up routed down this path. Thanks, Jason
On Mon, Jun 20, 2022 at 12:00:46AM -0300, Jason Gunthorpe wrote: > On Fri, Jun 17, 2022 at 01:54:05AM -0700, Christoph Hellwig wrote: > > There is a bunch of code an comments in the iommu type1 code that > > suggest we can pin memory that is not page backed. > > AFAIK you can.. The whole follow_pte() mechanism allows raw PFNs to be > loaded into the type1 maps and the pin API will happily return > them. This happens in almost every qemu scenario because PCI MMIO BAR > memory ends up routed down this path. Indeed, my read wasn't deep enough. Which means that we can't change the ->pin_pages interface to return a struct pages array, as we don't have one for those.
On Sun, Jun 19, 2022 at 10:51:47PM -0700, Christoph Hellwig wrote: > On Mon, Jun 20, 2022 at 12:00:46AM -0300, Jason Gunthorpe wrote: > > On Fri, Jun 17, 2022 at 01:54:05AM -0700, Christoph Hellwig wrote: > > > There is a bunch of code an comments in the iommu type1 code that > > > suggest we can pin memory that is not page backed. > > > > AFAIK you can.. The whole follow_pte() mechanism allows raw PFNs to be > > loaded into the type1 maps and the pin API will happily return > > them. This happens in almost every qemu scenario because PCI MMIO BAR > > memory ends up routed down this path. > > Indeed, my read wasn't deep enough. Which means that we can't change > the ->pin_pages interface to return a struct pages array, as we don't > have one for those. Actually. gvt requires a struct page, and both s390 seem to require normal non-I/O, non-remapped kernel pointers. So I think for the vfio_pin_pages we can assume that we only want page backed memory and remove the follow_fault_pfn case entirely. But we'll probably have to keep it for the vfio_iommu_replay case that is not tied to emulated IOMMU drivers.
On Sun, Jun 19, 2022 at 11:37:47PM -0700, Christoph Hellwig wrote: > On Sun, Jun 19, 2022 at 10:51:47PM -0700, Christoph Hellwig wrote: > > On Mon, Jun 20, 2022 at 12:00:46AM -0300, Jason Gunthorpe wrote: > > > On Fri, Jun 17, 2022 at 01:54:05AM -0700, Christoph Hellwig wrote: > > > > There is a bunch of code an comments in the iommu type1 code that > > > > suggest we can pin memory that is not page backed. > > > > > > AFAIK you can.. The whole follow_pte() mechanism allows raw PFNs to be > > > loaded into the type1 maps and the pin API will happily return > > > them. This happens in almost every qemu scenario because PCI MMIO BAR > > > memory ends up routed down this path. > > > > Indeed, my read wasn't deep enough. Which means that we can't change > > the ->pin_pages interface to return a struct pages array, as we don't > > have one for those. > > Actually. gvt requires a struct page, and both s390 seem to require > normal non-I/O, non-remapped kernel pointers. So I think for the > vfio_pin_pages we can assume that we only want page backed memory and > remove the follow_fault_pfn case entirely. But we'll probably have > to keep it for the vfio_iommu_replay case that is not tied to > emulated IOMMU drivers. Right, that is my thinking - since all drivers actually need a struct page we should have the API return a working struct page and have the VFIO layer isolate the non-struct page stuff so it never leaks out of this API. This nicely fixes the various problems in all the drivers if io memory comes down this path. It is also why doing too much surgery deeper into type 1 probably isn't too worthwhile as it still needs raw pfns in its data structures for iommu modes. Thanks, Jason
On Mon, Jun 20, 2022 at 12:36:28PM -0300, Jason Gunthorpe wrote: > On Sun, Jun 19, 2022 at 11:37:47PM -0700, Christoph Hellwig wrote: > > On Sun, Jun 19, 2022 at 10:51:47PM -0700, Christoph Hellwig wrote: > > > On Mon, Jun 20, 2022 at 12:00:46AM -0300, Jason Gunthorpe wrote: > > > > On Fri, Jun 17, 2022 at 01:54:05AM -0700, Christoph Hellwig wrote: > > > > > There is a bunch of code an comments in the iommu type1 code that > > > > > suggest we can pin memory that is not page backed. > > > > > > > > AFAIK you can.. The whole follow_pte() mechanism allows raw PFNs to be > > > > loaded into the type1 maps and the pin API will happily return > > > > them. This happens in almost every qemu scenario because PCI MMIO BAR > > > > memory ends up routed down this path. > > > > > > Indeed, my read wasn't deep enough. Which means that we can't change > > > the ->pin_pages interface to return a struct pages array, as we don't > > > have one for those. > > > > Actually. gvt requires a struct page, and both s390 seem to require > > normal non-I/O, non-remapped kernel pointers. So I think for the > > vfio_pin_pages we can assume that we only want page backed memory and > > remove the follow_fault_pfn case entirely. But we'll probably have > > to keep it for the vfio_iommu_replay case that is not tied to > > emulated IOMMU drivers. > > Right, that is my thinking - since all drivers actually need a struct > page we should have the API return a working struct page and have the > VFIO layer isolate the non-struct page stuff so it never leaks out of > this API. > > This nicely fixes the various problems in all the drivers if io memory > comes down this path. > > It is also why doing too much surgery deeper into type 1 probably > isn't too worthwhile as it still needs raw pfns in its data > structures for iommu modes. Christoph, do you agree with Jason's remark on not doing too much surgery into type1 code? Or do you still want this series to change type1 like removing follow_fault_pfn() that you mentioned above?
diff --git a/Documentation/driver-api/vfio-mediated-device.rst b/Documentation/driver-api/vfio-mediated-device.rst index d28f8bcbfbc6..070e51bb0bb6 100644 --- a/Documentation/driver-api/vfio-mediated-device.rst +++ b/Documentation/driver-api/vfio-mediated-device.rst @@ -263,7 +263,7 @@ The following APIs are provided for translating user pfn to host pfn in a VFIO driver:: int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova, - int npage, int prot, unsigned long *phys_pfn); + int npage, int prot, struct page **phys_page); int vfio_unpin_pages(struct vfio_device *device, dma_addr_t iova, int npage); diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c index c9bdc3901f1e..669432999676 100644 --- a/drivers/gpu/drm/i915/gvt/kvmgt.c +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c @@ -243,7 +243,7 @@ static void gvt_unpin_guest_page(struct intel_vgpu *vgpu, unsigned long gfn, static int gvt_pin_guest_page(struct intel_vgpu *vgpu, unsigned long gfn, unsigned long size, struct page **page) { - unsigned long base_pfn = 0; + struct page *base_page = NULL; int total_pages; int npage; int ret; @@ -255,26 +255,19 @@ static int gvt_pin_guest_page(struct intel_vgpu *vgpu, unsigned long gfn, */ for (npage = 0; npage < total_pages; npage++) { unsigned long cur_iova = (gfn + npage) << PAGE_SHIFT; - unsigned long pfn; + struct page *cur_page; ret = vfio_pin_pages(&vgpu->vfio_device, cur_iova, 1, - IOMMU_READ | IOMMU_WRITE, &pfn); + IOMMU_READ | IOMMU_WRITE, &cur_page); if (ret != 1) { gvt_vgpu_err("vfio_pin_pages failed for iova 0x%lx, ret %d\n", cur_iova, ret); goto err; } - if (!pfn_valid(pfn)) { - gvt_vgpu_err("pfn 0x%lx is not mem backed\n", pfn); - npage++; - ret = -EFAULT; - goto err; - } - if (npage == 0) - base_pfn = pfn; - else if (base_pfn + npage != pfn) { + base_page = cur_page; + else if (base_page + npage != cur_page) { gvt_vgpu_err("The pages are not continuous\n"); ret = -EINVAL; npage++; @@ -282,7 +275,7 @@ static int gvt_pin_guest_page(struct intel_vgpu *vgpu, unsigned long gfn, } } - *page = pfn_to_page(base_pfn); + *page = base_page; return 0; err: gvt_unpin_guest_page(vgpu, gfn, npage * PAGE_SIZE); diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c index 12cbe66721af..92be288dff74 100644 --- a/drivers/s390/cio/vfio_ccw_cp.c +++ b/drivers/s390/cio/vfio_ccw_cp.c @@ -24,8 +24,8 @@ struct pfn_array { unsigned long pa_iova; /* Array that stores PFNs of the pages need to pin. */ unsigned long *pa_iova_pfn; - /* Array that receives PFNs of the pages pinned. */ - unsigned long *pa_pfn; + /* Array that receives the pinned pages. */ + struct page **pa_page; /* Number of pages pinned from @pa_iova. */ int pa_nr; }; @@ -73,19 +73,19 @@ static int pfn_array_alloc(struct pfn_array *pa, u64 iova, unsigned int len) pa->pa_iova_pfn = kcalloc(pa->pa_nr, sizeof(*pa->pa_iova_pfn) + - sizeof(*pa->pa_pfn), + sizeof(*pa->pa_page), GFP_KERNEL); if (unlikely(!pa->pa_iova_pfn)) { pa->pa_nr = 0; return -ENOMEM; } - pa->pa_pfn = pa->pa_iova_pfn + pa->pa_nr; + pa->pa_page = (struct page **)pa->pa_iova_pfn + pa->pa_nr; pa->pa_iova_pfn[0] = pa->pa_iova >> PAGE_SHIFT; - pa->pa_pfn[0] = -1ULL; + pa->pa_page[0] = NULL; for (i = 1; i < pa->pa_nr; i++) { pa->pa_iova_pfn[i] = pa->pa_iova_pfn[i - 1] + 1; - pa->pa_pfn[i] = -1ULL; + pa->pa_page[i] = NULL; } return 0; @@ -147,7 +147,7 @@ static int pfn_array_pin(struct pfn_array *pa, struct vfio_device *vdev) ret = vfio_pin_pages(vdev, *first << PAGE_SHIFT, npage, IOMMU_READ | IOMMU_WRITE, - &pa->pa_pfn[pinned]); + &pa->pa_page[pinned]); if (ret < 0) { goto err_out; } else if (ret > 0 && ret != npage) { @@ -200,7 +200,7 @@ static inline void pfn_array_idal_create_words( */ for (i = 0; i < pa->pa_nr; i++) - idaws[i] = pa->pa_pfn[i] << PAGE_SHIFT; + idaws[i] = page_to_phys(pa->pa_page[i]); /* Adjust the first IDAW, since it may not start on a page boundary */ idaws[0] += pa->pa_iova & (PAGE_SIZE - 1); @@ -251,8 +251,7 @@ static long copy_from_iova(struct vfio_device *vdev, void *to, u64 iova, l = n; for (i = 0; i < pa.pa_nr; i++) { - struct page *page = pfn_to_page(pa.pa_pfn[i]); - void *from = kmap_local_page(page); + void *from = kmap_local_page(pa.pa_page[i]); m = PAGE_SIZE; if (i == 0) { diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c index 8a2018ab3cf0..e73bdb57bc90 100644 --- a/drivers/s390/crypto/vfio_ap_ops.c +++ b/drivers/s390/crypto/vfio_ap_ops.c @@ -243,9 +243,10 @@ static struct ap_queue_status vfio_ap_irq_enable(struct vfio_ap_queue *q, struct ap_qirq_ctrl aqic_gisa = {}; struct ap_queue_status status = {}; struct kvm_s390_gisa *gisa; + struct page *h_page; int nisc; struct kvm *kvm; - unsigned long g_pfn, h_pfn; + unsigned long g_pfn; phys_addr_t h_nib; int ret; @@ -259,7 +260,7 @@ static struct ap_queue_status vfio_ap_irq_enable(struct vfio_ap_queue *q, } ret = vfio_pin_pages(&q->matrix_mdev->vdev, g_pfn << PAGE_SHIFT, 1, - IOMMU_READ | IOMMU_WRITE, &h_pfn); + IOMMU_READ | IOMMU_WRITE, &h_page); switch (ret) { case 1: break; @@ -275,7 +276,7 @@ static struct ap_queue_status vfio_ap_irq_enable(struct vfio_ap_queue *q, kvm = q->matrix_mdev->kvm; gisa = kvm->arch.gisa_int.origin; - h_nib = (h_pfn << PAGE_SHIFT) | (nib & ~PAGE_MASK); + h_nib = page_to_phys(h_page) | (nib & ~PAGE_MASK); aqic_gisa.gisc = isc; nisc = kvm_s390_gisc_register(kvm, isc); diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index e8dbb0122e20..7eee8048e231 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -1917,18 +1917,18 @@ EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare); * @npage [in] : count of pages to be pinned. This count should not * be greater VFIO_PIN_PAGES_MAX_ENTRIES. * @prot [in] : protection flags - * @phys_pfn[out]: array of host PFNs + * @phys_page[out]: array of host pages * Return error or number of pages pinned. */ int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova, - int npage, int prot, unsigned long *phys_pfn) + int npage, int prot, struct page **phys_page) { struct vfio_container *container; struct vfio_group *group = device->group; struct vfio_iommu_driver *driver; int ret; - if (!phys_pfn || !npage || !vfio_assert_device_open(device)) + if (!phys_page || !npage || !vfio_assert_device_open(device)) return -EINVAL; if (npage > VFIO_PIN_PAGES_MAX_ENTRIES) @@ -1943,7 +1943,7 @@ int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova, if (likely(driver && driver->ops->pin_pages)) ret = driver->ops->pin_pages(container->iommu_data, group->iommu_group, iova, - npage, prot, phys_pfn); + npage, prot, phys_page); else ret = -ENOTTY; diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h index 6bd5304ee0b7..758a0a91a066 100644 --- a/drivers/vfio/vfio.h +++ b/drivers/vfio/vfio.h @@ -52,7 +52,7 @@ struct vfio_iommu_driver_ops { struct iommu_group *group, dma_addr_t user_iova, int npage, int prot, - unsigned long *phys_pfn); + struct page **phys_page); int (*unpin_pages)(void *iommu_data, dma_addr_t user_iova, int npage); int (*register_notifier)(void *iommu_data, diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index d027ed8441a9..841b1803e313 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -830,7 +830,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, struct iommu_group *iommu_group, dma_addr_t user_iova, int npage, int prot, - unsigned long *phys_pfn) + struct page **phys_page) { struct vfio_iommu *iommu = iommu_data; struct vfio_iommu_group *group; @@ -840,7 +840,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, bool do_accounting; dma_addr_t iova; - if (!iommu || !phys_pfn) + if (!iommu || !phys_page) return -EINVAL; /* Supported for v2 version only */ @@ -879,6 +879,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, do_accounting = list_empty(&iommu->domain_list); for (i = 0; i < npage; i++) { + unsigned long phys_pfn; struct vfio_pfn *vpfn; iova = user_iova + PAGE_SIZE * i; @@ -895,23 +896,25 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, vpfn = vfio_iova_get_vfio_pfn(dma, iova); if (vpfn) { - phys_pfn[i] = vpfn->pfn; + phys_page[i] = pfn_to_page(vpfn->pfn); continue; } remote_vaddr = dma->vaddr + (iova - dma->iova); - ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn[i], + ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn, do_accounting); if (ret) goto pin_unwind; - ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]); + ret = vfio_add_to_pfn_list(dma, iova, phys_pfn); if (ret) { - if (put_pfn(phys_pfn[i], dma->prot) && do_accounting) + if (put_pfn(phys_pfn, dma->prot) && do_accounting) vfio_lock_acct(dma, -1, true); goto pin_unwind; } + phys_page[i] = pfn_to_page(phys_pfn); + if (iommu->dirty_page_tracking) { unsigned long pgshift = __ffs(iommu->pgsize_bitmap); @@ -934,14 +937,14 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, goto pin_done; pin_unwind: - phys_pfn[i] = 0; + phys_page[i] = NULL; for (j = 0; j < i; j++) { dma_addr_t iova; iova = user_iova + PAGE_SIZE * j; dma = vfio_find_dma(iommu, iova, PAGE_SIZE); vfio_unpin_page_external(dma, iova, do_accounting); - phys_pfn[j] = 0; + phys_page[j] = NULL; } pin_done: mutex_unlock(&iommu->lock); diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 99c3bf52c4da..7bc18802bf39 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -148,7 +148,7 @@ extern bool vfio_file_has_dev(struct file *file, struct vfio_device *device); #define VFIO_PIN_PAGES_MAX_ENTRIES (PAGE_SIZE/sizeof(unsigned long)) extern int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova, - int npage, int prot, unsigned long *phys_pfn); + int npage, int prot, struct page **phys_page); extern int vfio_unpin_pages(struct vfio_device *device, dma_addr_t iova, int npage); extern int vfio_dma_rw(struct vfio_device *device, dma_addr_t iova,
Most of the callers of vfio_pin_pages() want "struct page *" and the low-level mm code to pin pages returns a list of "struct page *" too. So there's no gain in converting "struct page *" to PFN in between. Replace the output parameter phys_pfn list with a phys_page list, to simplify callers. This also allows us to replace the vfio_iommu_type1 implementation with a more efficient one. For now, also update vfio_iommu_type1 to fit this new parameter too. Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> --- .../driver-api/vfio-mediated-device.rst | 2 +- drivers/gpu/drm/i915/gvt/kvmgt.c | 19 ++++++------------- drivers/s390/cio/vfio_ccw_cp.c | 19 +++++++++---------- drivers/s390/crypto/vfio_ap_ops.c | 7 ++++--- drivers/vfio/vfio.c | 8 ++++---- drivers/vfio/vfio.h | 2 +- drivers/vfio/vfio_iommu_type1.c | 19 +++++++++++-------- include/linux/vfio.h | 2 +- 8 files changed, 37 insertions(+), 41 deletions(-)