Message ID | 20250218111017.491719-13-aik@amd.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | TSM: Secure VFIO, TDISP, SEV TIO | expand |
On Tue, Feb 18, 2025 at 10:09:59PM +1100, Alexey Kardashevskiy wrote: > CoCo VMs get their private memory allocated from guest_memfd > ("gmemfd") which is a KVM facility similar to memfd. > At the moment gmemfds cannot mmap() so the usual GUP API does > not work on these as expected. > > Use the existing IOMMU_IOAS_MAP_FILE API to allow mapping from > fd + offset. Detect the gmemfd case in pfn_reader_user_pin() and > simplified mapping. > > The long term plan is to ditch this workaround and follow > the usual memfd path. How is that possible though? > +static struct folio *guest_memfd_get_pfn(struct file *file, unsigned long index, > + unsigned long *pfn, int *max_order) > +{ > + struct folio *folio; > + int ret = 0; > + > + folio = filemap_grab_folio(file_inode(file)->i_mapping, index); > + > + if (IS_ERR(folio)) > + return folio; > + > + if (folio_test_hwpoison(folio)) { > + folio_unlock(folio); > + folio_put(folio); > + return ERR_PTR(-EHWPOISON); > + } > + > + *pfn = folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1)); > + if (!max_order) > + goto unlock_exit; > + > + /* Refs for unpin_user_page_range_dirty_lock->gup_put_folio(FOLL_PIN) */ > + ret = folio_add_pins(folio, 1); > + folio_put(folio); /* Drop ref from filemap_grab_folio */ > + > +unlock_exit: > + folio_unlock(folio); > + if (ret) > + folio = ERR_PTR(ret); > + > + return folio; > +} Connecting iommufd to guestmemfd through the FD is broadly the right idea, but I'm not sure this matches the design of guestmemfd regarding pinnability. IIRC they were adamant that the pages would not be pinned.. folio_add_pins() just prevents the folio from being freed, it doesn't prevent the guestmemfd code from messing with the filemap. You should separate this from the rest of the series and discuss it directly with the guestmemfd maintainers. As I understood it the requirement here is to have some kind of invalidation callback so that iommufd can drop mappings, but I don't really know and AFAIK AMD is special in wanting private pages mapped to the hypervisor iommu.. Jason
On 19/2/25 01:16, Jason Gunthorpe wrote: > On Tue, Feb 18, 2025 at 10:09:59PM +1100, Alexey Kardashevskiy wrote: >> CoCo VMs get their private memory allocated from guest_memfd >> ("gmemfd") which is a KVM facility similar to memfd. >> At the moment gmemfds cannot mmap() so the usual GUP API does >> not work on these as expected. >> >> Use the existing IOMMU_IOAS_MAP_FILE API to allow mapping from >> fd + offset. Detect the gmemfd case in pfn_reader_user_pin() and >> simplified mapping. >> >> The long term plan is to ditch this workaround and follow >> the usual memfd path. > > How is that possible though? dunno, things evolve over years and converge somehow :) >> +static struct folio *guest_memfd_get_pfn(struct file *file, unsigned long index, >> + unsigned long *pfn, int *max_order) >> +{ >> + struct folio *folio; >> + int ret = 0; >> + >> + folio = filemap_grab_folio(file_inode(file)->i_mapping, index); >> + >> + if (IS_ERR(folio)) >> + return folio; >> + >> + if (folio_test_hwpoison(folio)) { >> + folio_unlock(folio); >> + folio_put(folio); >> + return ERR_PTR(-EHWPOISON); >> + } >> + >> + *pfn = folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1)); >> + if (!max_order) >> + goto unlock_exit; >> + >> + /* Refs for unpin_user_page_range_dirty_lock->gup_put_folio(FOLL_PIN) */ >> + ret = folio_add_pins(folio, 1); >> + folio_put(folio); /* Drop ref from filemap_grab_folio */ >> + >> +unlock_exit: >> + folio_unlock(folio); >> + if (ret) >> + folio = ERR_PTR(ret); >> + >> + return folio; >> +} > > Connecting iommufd to guestmemfd through the FD is broadly the right > idea, but I'm not sure this matches the design of guestmemfd regarding > pinnability. IIRC they were adamant that the pages would not be > pinned.. uff I thought it was about "not mapped" rather than "non pinned". > folio_add_pins() just prevents the folio from being freed, it doesn't > prevent the guestmemfd code from messing with the filemap. > > You should separate this from the rest of the series and discuss it > directly with the guestmemfd maintainers. Alright, thanks for the suggestion. > As I understood it the requirement here is to have some kind of > invalidation callback so that iommufd can drop mappings, Since shared<->private conversion is an ioctl() (kvm/gmemfd) so it is ioctl() for iommufd then too. Oh well. > but I don't > really know and AFAIK AMD is special in wanting private pages mapped > to the hypervisor iommu.. With in-place conversion, we could map the entire guest once in the HV IOMMU and control the Cbit via the guest's IOMMU table (when available). Thanks,
On Wed, Feb 19, 2025 at 10:35:28AM +1100, Alexey Kardashevskiy wrote: > With in-place conversion, we could map the entire guest once in the HV IOMMU > and control the Cbit via the guest's IOMMU table (when available). Thanks, Isn't it more complicated than that? I understood you need to have a IOPTE boundary in the hypervisor at any point where the guest Cbit changes - so you can't just dump 1G hypervisor pages to cover the whole VM, you have to actively resize ioptes? This was the whole motivation to adding the page size override kernel command line. Jason
On 19/2/25 10:51, Jason Gunthorpe wrote: > On Wed, Feb 19, 2025 at 10:35:28AM +1100, Alexey Kardashevskiy wrote: > >> With in-place conversion, we could map the entire guest once in the HV IOMMU >> and control the Cbit via the guest's IOMMU table (when available). Thanks, > > Isn't it more complicated than that? I understood you need to have a > IOPTE boundary in the hypervisor at any point where the guest Cbit > changes - so you can't just dump 1G hypervisor pages to cover the > whole VM, you have to actively resize ioptes? When the guest Cbit changes, only AMD RMP table requires update but not necessaryly NPT or IOPTEs. (I may have misunderstood the question, what meaning does "dump 1G pages" have?). > This was the whole motivation to adding the page size override kernel > command line. > > Jason
On Wed, Feb 19, 2025 at 11:43:46AM +1100, Alexey Kardashevskiy wrote: > On 19/2/25 10:51, Jason Gunthorpe wrote: > > On Wed, Feb 19, 2025 at 10:35:28AM +1100, Alexey Kardashevskiy wrote: > > > > > With in-place conversion, we could map the entire guest once in the HV IOMMU > > > and control the Cbit via the guest's IOMMU table (when available). Thanks, > > > > Isn't it more complicated than that? I understood you need to have a > > IOPTE boundary in the hypervisor at any point where the guest Cbit > > changes - so you can't just dump 1G hypervisor pages to cover the > > whole VM, you have to actively resize ioptes? > > When the guest Cbit changes, only AMD RMP table requires update but not > necessaryly NPT or IOPTEs. > (I may have misunderstood the question, what meaning does "dump 1G pages" > have?). AFAIK that is not true, if there are mismatches in page size, ie the RMP is 2M and the IOPTE is 1G then things do not work properly. It is why we had to do this: > > This was the whole motivation to adding the page size override kernel > > command line. commit f0295913c4b4f377c454e06f50c1a04f2f80d9df Author: Joerg Roedel <jroedel@suse.de> Date: Thu Sep 5 09:22:40 2024 +0200 iommu/amd: Add kernel parameters to limit V1 page-sizes Add two new kernel command line parameters to limit the page-sizes used for v1 page-tables: nohugepages - Limits page-sizes to 4KiB v2_pgsizes_only - Limits page-sizes to 4Kib/2Mib/1GiB; The same as the sizes used with v2 page-tables This is needed for multiple scenarios. When assigning devices to SEV-SNP guests the IOMMU page-sizes need to match the sizes in the RMP table, otherwise the device will not be able to access all shared memory. Also, some ATS devices do not work properly with arbitrary IO page-sizes as supported by AMD-Vi, so limiting the sizes used by the driver is a suitable workaround. All-in-all, these parameters are only workarounds until the IOMMU core and related APIs gather the ability to negotiate the page-sizes in a better way. Signed-off-by: Joerg Roedel <jroedel@suse.de> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Link: https://lore.kernel.org/r/20240905072240.253313-1-joro@8bytes.org Jason
On Wed, Feb 19, 2025 at 09:35:16AM -0400, Jason Gunthorpe wrote: > On Wed, Feb 19, 2025 at 11:43:46AM +1100, Alexey Kardashevskiy wrote: > > On 19/2/25 10:51, Jason Gunthorpe wrote: > > > On Wed, Feb 19, 2025 at 10:35:28AM +1100, Alexey Kardashevskiy wrote: > > > > > > > With in-place conversion, we could map the entire guest once in the HV IOMMU > > > > and control the Cbit via the guest's IOMMU table (when available). Thanks, > > > > > > Isn't it more complicated than that? I understood you need to have a > > > IOPTE boundary in the hypervisor at any point where the guest Cbit > > > changes - so you can't just dump 1G hypervisor pages to cover the > > > whole VM, you have to actively resize ioptes? > > > > When the guest Cbit changes, only AMD RMP table requires update but not > > necessaryly NPT or IOPTEs. > > (I may have misunderstood the question, what meaning does "dump 1G pages" > > have?). > > AFAIK that is not true, if there are mismatches in page size, ie the > RMP is 2M and the IOPTE is 1G then things do not work properly. Just for clarity: at least for normal/nested page table (but I'm assuming the same applies to IOMMU mappings), 1G mappings are handled similarly as 2MB mappings as far as RMP table checks are concerned: each 2MB range is checked individually as if it were a separate 2MB mapping: AMD Architecture Programmer's Manual Volume 2, 15.36.10, "RMP and VMPL Access Checks": "Accesses to 1GB pages only install 2MB TLB entries when SEV-SNP is enabled, therefore this check treats 1GB accesses as 2MB accesses for purposes of this check." So a 1GB mapping doesn't really impose more restrictions than a 2MB mapping (unless there's something different about how RMP checks are done for IOMMU). But the point still stands for 4K RMP entries and 2MB mappings: a 2MB mapping either requires private page RMP entries to be 2MB, or in the case of 2MB mapping of shared pages, every page in the range must be shared according to the corresponding RMP entries. > > It is why we had to do this: I think, for the non-SEV-TIO use-case, it had more to do with inability to unmap a 4K range once a particular 4K page has been converted from shared to private if it was originally installed via a 2MB IOPTE, since the guest could actively be DMA'ing to other shared pages in the 2M range (but we can be assured it is not DMA'ing to a particular 4K page it has converted to private), and the IOMMU doesn't (AFAIK) have a way to atomically split an existing 2MB IOPTE to avoid this. So forcing everything to 4K ends up being necessary since we don't know in advance what ranges might contain 4K pages that will get converted to private in the future by the guest. SEV-TIO might relax this restriction by making use of TMPM and the PSMASH_IO command to split/"smash" RMP entries and IOMMU mappings to 4K after-the-fact, but I'm not too familiar with the architecture/plans so Alexey can correct me on that. -Mike > > > > This was the whole motivation to adding the page size override kernel > > > command line. > > commit f0295913c4b4f377c454e06f50c1a04f2f80d9df > Author: Joerg Roedel <jroedel@suse.de> > Date: Thu Sep 5 09:22:40 2024 +0200 > > iommu/amd: Add kernel parameters to limit V1 page-sizes > > Add two new kernel command line parameters to limit the page-sizes > used for v1 page-tables: > > nohugepages - Limits page-sizes to 4KiB > > v2_pgsizes_only - Limits page-sizes to 4Kib/2Mib/1GiB; The > same as the sizes used with v2 page-tables > > This is needed for multiple scenarios. When assigning devices to > SEV-SNP guests the IOMMU page-sizes need to match the sizes in the RMP > table, otherwise the device will not be able to access all shared > memory. > > Also, some ATS devices do not work properly with arbitrary IO > page-sizes as supported by AMD-Vi, so limiting the sizes used by the > driver is a suitable workaround. > > All-in-all, these parameters are only workarounds until the IOMMU core > and related APIs gather the ability to negotiate the page-sizes in a > better way. > > Signed-off-by: Joerg Roedel <jroedel@suse.de> > Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> > Link: https://lore.kernel.org/r/20240905072240.253313-1-joro@8bytes.org > > Jason
On Wed, Feb 19, 2025 at 02:23:24PM -0600, Michael Roth wrote: > Just for clarity: at least for normal/nested page table (but I'm > assuming the same applies to IOMMU mappings), 1G mappings are > handled similarly as 2MB mappings as far as RMP table checks are > concerned: each 2MB range is checked individually as if it were > a separate 2MB mapping: Well, IIRC we are dealing with the AMDv1 IO page table here which supports more sizes than 1G and we likely start to see things like 4M mappings and the like. So maybe there is some issue if the above special case really only applies to 1G and only 1G. > But the point still stands for 4K RMP entries and 2MB mappings: a 2MB > mapping either requires private page RMP entries to be 2MB, or in the > case of 2MB mapping of shared pages, every page in the range must be > shared according to the corresponding RMP entries. Is 4k RMP what people are running? > I think, for the non-SEV-TIO use-case, it had more to do with inability > to unmap a 4K range once a particular 4K page has been converted Yes, we don't support unmap or resize. The entire theory of operation has the IOPTEs cover the guest memory and remain static at VM boot time. The RMP alone controls access and handles the static/private. Assuming the host used 2M pages the IOPTEs in an AMDv1 table will be sized around 2M,4M,8M just based around random luck. So it sounds like you can get to a situation with a >=2M mapping in the IOPTE but the guest has split it into private/shared at lower granularity and the HW cannot handle this? > from shared to private if it was originally installed via a 2MB IOPTE, > since the guest could actively be DMA'ing to other shared pages in > the 2M range (but we can be assured it is not DMA'ing to a particular 4K > page it has converted to private), and the IOMMU doesn't (AFAIK) have > a way to atomically split an existing 2MB IOPTE to avoid this. The iommu can split it (with SW help), I'm working on that infrastructure right now.. So you will get a notification that the guest has made a private/public split and the iommu page table can be atomically restructured to put an IOPTE boundary at the split. Then the HW will not see IOPTEs that exceed the shared/private granularity of the VM. Jason
On Wed, Feb 19, 2025 at 04:37:08PM -0400, Jason Gunthorpe wrote: > On Wed, Feb 19, 2025 at 02:23:24PM -0600, Michael Roth wrote: > > Just for clarity: at least for normal/nested page table (but I'm > > assuming the same applies to IOMMU mappings), 1G mappings are > > handled similarly as 2MB mappings as far as RMP table checks are > > concerned: each 2MB range is checked individually as if it were > > a separate 2MB mapping: > > Well, IIRC we are dealing with the AMDv1 IO page table here which > supports more sizes than 1G and we likely start to see things like 4M > mappings and the like. So maybe there is some issue if the above > special case really only applies to 1G and only 1G. I think the documentation only mentioned 1G specifically since that's the next level up in host/nested page table mappings, and that more generally anything mapping at a higher granularity than 2MB would be broken down into individual checks on each 2MB range within. But it's quite possible things are handled differently for IOMMU so definitely worth confirming. > > > But the point still stands for 4K RMP entries and 2MB mappings: a 2MB > > mapping either requires private page RMP entries to be 2MB, or in the > > case of 2MB mapping of shared pages, every page in the range must be > > shared according to the corresponding RMP entries. > > Is 4k RMP what people are running? Unfortunately yes, but that's mainly due to guest_memfd only handling 4K currently. Hopefully that will change soon, but in the meantime there's only experimental support for larger private page sizes that make use of 2MB RMP entries (via THP). But regardless, we'll still end up dealing with 4K RMP entries since we'll need to split 2MB RMP entries in response to private->conversions that aren't 2MB aligned/sized. > > > I think, for the non-SEV-TIO use-case, it had more to do with inability > > to unmap a 4K range once a particular 4K page has been converted > > Yes, we don't support unmap or resize. The entire theory of operation > has the IOPTEs cover the guest memory and remain static at VM boot > time. The RMP alone controls access and handles the static/private. > > Assuming the host used 2M pages the IOPTEs in an AMDv1 table will be > sized around 2M,4M,8M just based around random luck. > > So it sounds like you can get to a situation with a >=2M mapping in > the IOPTE but the guest has split it into private/shared at lower > granularity and the HW cannot handle this? Remembering more details: the situation is a bit more specific to guest_memfd. In general, for non-SEV-TIO, everything in the IOMMU will be always be for shared pages, and because of that the RMP checks don't impose any additional restrictions on mapping size (a shared page can be mapped 2MB even if the RMP entry is 4K (the RMP page-size bit only really applies for private pages)). The issue with guest_memfd is that it is only used for private pages (at least until in-place conversion is supported), so when we "convert" shared pages to private we are essentially discarding those pages and re-allocating them via guest_memfd, so the mappings for those discarded pages become stale and need to be removed. But since this can happen at 4K granularities, we need to map as 4K because we don't have a way to split them later on (at least, not currently...). The other approach is to not discard these shared pages after conversion and just not free them back, which ends up using more host memory, but allows for larger IOMMU mappings. > > > from shared to private if it was originally installed via a 2MB IOPTE, > > since the guest could actively be DMA'ing to other shared pages in > > the 2M range (but we can be assured it is not DMA'ing to a particular 4K > > page it has converted to private), and the IOMMU doesn't (AFAIK) have > > a way to atomically split an existing 2MB IOPTE to avoid this. > > The iommu can split it (with SW help), I'm working on that > infrastructure right now.. > > So you will get a notification that the guest has made a > private/public split and the iommu page table can be atomically > restructured to put an IOPTE boundary at the split. > > Then the HW will not see IOPTEs that exceed the shared/private > granularity of the VM. That sounds very interesting. It would allow us to use larger IOMMU mappings even for guest_memfd as it exists today, while still supporting shared memory discard and avoiding the additional host memory usage mentioned above. Are there patches available publicly? Thanks, Mike > > Jason
On Wed, Feb 19, 2025 at 03:30:37PM -0600, Michael Roth wrote: > I think the documentation only mentioned 1G specifically since that's > the next level up in host/nested page table mappings, and that more > generally anything mapping at a higher granularity than 2MB would be > broken down into individual checks on each 2MB range within. But it's > quite possible things are handled differently for IOMMU so definitely > worth confirming. Hmm, well, I'd very much like it if we are all on the same page as to why the new kernel parameters were needed. Joerg was definitely seeing testing failures without them. IMHO we should not require parameters like that, I expect the kernel to fix this stuff on its own. > But regardless, we'll still end up dealing with 4K RMP entries since > we'll need to split 2MB RMP entries in response to private->conversions > that aren't 2MB aligned/sized. :( What is the point of even allowing < 2MP private/shared conversion? > > Then the HW will not see IOPTEs that exceed the shared/private > > granularity of the VM. > > That sounds very interesting. It would allow us to use larger IOMMU > mappings even for guest_memfd as it exists today, while still supporting > shared memory discard and avoiding the additional host memory usage > mentioned above. Are there patches available publicly? https://patch.msgid.link/r/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com I'm getting quite close to having something non-RFC that just does AMD and the bare minimum. I will add you two to the CC Jason
On 20/2/25 00:35, Jason Gunthorpe wrote: > On Wed, Feb 19, 2025 at 11:43:46AM +1100, Alexey Kardashevskiy wrote: >> On 19/2/25 10:51, Jason Gunthorpe wrote: >>> On Wed, Feb 19, 2025 at 10:35:28AM +1100, Alexey Kardashevskiy wrote: >>> >>>> With in-place conversion, we could map the entire guest once in the HV IOMMU >>>> and control the Cbit via the guest's IOMMU table (when available). Thanks, >>> >>> Isn't it more complicated than that? I understood you need to have a >>> IOPTE boundary in the hypervisor at any point where the guest Cbit >>> changes - so you can't just dump 1G hypervisor pages to cover the >>> whole VM, you have to actively resize ioptes? >> >> When the guest Cbit changes, only AMD RMP table requires update but not >> necessaryly NPT or IOPTEs. >> (I may have misunderstood the question, what meaning does "dump 1G pages" >> have?). > > AFAIK that is not true, if there are mismatches in page size, ie the > RMP is 2M and the IOPTE is 1G then things do not work properly. Right, so I misunderstood. When I first replied, I assumed the current situation of 4K pages everywhere. IOPTEs larger than RMP entries are likely to cause failed RMP checks (confirming now, surprises sometime happen). Thanks, > It is why we had to do this: > >>> This was the whole motivation to adding the page size override kernel >>> command line. > > commit f0295913c4b4f377c454e06f50c1a04f2f80d9df > Author: Joerg Roedel <jroedel@suse.de> > Date: Thu Sep 5 09:22:40 2024 +0200 > > iommu/amd: Add kernel parameters to limit V1 page-sizes > > Add two new kernel command line parameters to limit the page-sizes > used for v1 page-tables: > > nohugepages - Limits page-sizes to 4KiB > > v2_pgsizes_only - Limits page-sizes to 4Kib/2Mib/1GiB; The > same as the sizes used with v2 page-tables > > This is needed for multiple scenarios. When assigning devices to > SEV-SNP guests the IOMMU page-sizes need to match the sizes in the RMP > table, otherwise the device will not be able to access all shared > memory. > > Also, some ATS devices do not work properly with arbitrary IO > page-sizes as supported by AMD-Vi, so limiting the sizes used by the > driver is a suitable workaround. > > All-in-all, these parameters are only workarounds until the IOMMU core > and related APIs gather the ability to negotiate the page-sizes in a > better way. > > Signed-off-by: Joerg Roedel <jroedel@suse.de> > Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> > Link: https://lore.kernel.org/r/20240905072240.253313-1-joro@8bytes.org > > Jason
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c index 3427749bc5ce..457d8eaacd2c 100644 --- a/drivers/iommu/iommufd/pages.c +++ b/drivers/iommu/iommufd/pages.c @@ -53,6 +53,7 @@ #include <linux/overflow.h> #include <linux/slab.h> #include <linux/sched/mm.h> +#include <linux/pagemap.h> #include "double_span.h" #include "io_pagetable.h" @@ -850,6 +851,88 @@ static long pin_memfd_pages(struct pfn_reader_user *user, unsigned long start, return npages_out; } +static bool is_guest_memfd(struct file *file) +{ + struct address_space *mapping = file_inode(file)->i_mapping; + + return mapping_inaccessible(mapping) && mapping_unevictable(mapping); +} + +static struct folio *guest_memfd_get_pfn(struct file *file, unsigned long index, + unsigned long *pfn, int *max_order) +{ + struct folio *folio; + int ret = 0; + + folio = filemap_grab_folio(file_inode(file)->i_mapping, index); + + if (IS_ERR(folio)) + return folio; + + if (folio_test_hwpoison(folio)) { + folio_unlock(folio); + folio_put(folio); + return ERR_PTR(-EHWPOISON); + } + + *pfn = folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1)); + if (!max_order) + goto unlock_exit; + + /* Refs for unpin_user_page_range_dirty_lock->gup_put_folio(FOLL_PIN) */ + ret = folio_add_pins(folio, 1); + folio_put(folio); /* Drop ref from filemap_grab_folio */ + +unlock_exit: + folio_unlock(folio); + if (ret) + folio = ERR_PTR(ret); + + return folio; +} + +static long pin_guest_memfd_pages(struct pfn_reader_user *user, loff_t start, unsigned long npages, + struct iopt_pages *pages) +{ + unsigned long offset = 0; + loff_t uptr = start; + long rc = 0; + + for (unsigned long i = 0; i < npages; ++i, uptr += PAGE_SIZE) { + unsigned long gfn = 0, pfn = 0; + int max_order = 0; + struct folio *folio; + + folio = guest_memfd_get_pfn(user->file, uptr >> PAGE_SHIFT, &pfn, &max_order); + if (IS_ERR(folio)) + rc = PTR_ERR(folio); + + if (rc == -EINVAL && i == 0) { + pr_err_once("Must be vfio mmio at gfn=%lx pfn=%lx, skipping\n", gfn, pfn); + return rc; + } + + if (rc) { + pr_err("%s: %ld %ld %lx -> %lx\n", __func__, + rc, i, (unsigned long) uptr, (unsigned long) pfn); + break; + } + + if (i == 0) + offset = offset_in_folio(folio, start); + + user->ufolios[i] = folio; + } + + if (!rc) { + rc = npages; + user->ufolios_next = user->ufolios; + user->ufolios_offset = offset; + } + + return rc; +} + static int pfn_reader_user_pin(struct pfn_reader_user *user, struct iopt_pages *pages, unsigned long start_index, @@ -903,7 +986,10 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user, if (user->file) { start = pages->start + (start_index * PAGE_SIZE); - rc = pin_memfd_pages(user, start, npages); + if (is_guest_memfd(user->file)) + rc = pin_guest_memfd_pages(user, start, npages, pages); + else + rc = pin_memfd_pages(user, start, npages); } else if (!remote_mm) { uptr = (uintptr_t)(pages->uptr + start_index * PAGE_SIZE); rc = pin_user_pages_fast(uptr, npages, user->gup_flags,
CoCo VMs get their private memory allocated from guest_memfd ("gmemfd") which is a KVM facility similar to memfd. At the moment gmemfds cannot mmap() so the usual GUP API does not work on these as expected. Use the existing IOMMU_IOAS_MAP_FILE API to allow mapping from fd + offset. Detect the gmemfd case in pfn_reader_user_pin() and simplified mapping. The long term plan is to ditch this workaround and follow the usual memfd path. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> --- drivers/iommu/iommufd/pages.c | 88 +++++++++++++++++++- 1 file changed, 87 insertions(+), 1 deletion(-)