[RFC,v2,12/22] iommufd: Allow mapping from guest_memfd

Message ID	20250218111017.491719-13-aik@amd.com (mailing list archive)
State	New
Headers	show Received: from NAM11-DM6-obe.outbound.protection.outlook.com (mail-dm6nam11on2079.outbound.protection.outlook.com [40.107.223.79]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2A18B1AF0B7; Tue, 18 Feb 2025 11:14:45 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Alexey Kardashevskiy <aik@amd.com> To: <x86@kernel.org> CC: <kvm@vger.kernel.org>, <linux-crypto@vger.kernel.org>, <linux-pci@vger.kernel.org>, <linux-arch@vger.kernel.org>, "Sean Christopherson" <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com>, "Tom Lendacky" <thomas.lendacky@amd.com>, Ashish Kalra <ashish.kalra@amd.com>, Joerg Roedel <joro@8bytes.org>, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>, Robin Murphy <robin.murphy@arm.com>, "Jason Gunthorpe" <jgg@ziepe.ca>, Kevin Tian <kevin.tian@intel.com>, Bjorn Helgaas <bhelgaas@google.com>, Dan Williams <dan.j.williams@intel.com>, "Christoph Hellwig" <hch@lst.de>, Nikunj A Dadhania <nikunj@amd.com>, Michael Roth <michael.roth@amd.com>, Vasant Hegde <vasant.hegde@amd.com>, Joao Martins <joao.m.martins@oracle.com>, Nicolin Chen <nicolinc@nvidia.com>, Lu Baolu <baolu.lu@linux.intel.com>, Steve Sistare <steven.sistare@oracle.com>, "Lukas Wunner" <lukas@wunner.de>, Jonathan Cameron <Jonathan.Cameron@huawei.com>, Suzuki K Poulose <suzuki.poulose@arm.com>, Dionna Glaze <dionnaglaze@google.com>, Yi Liu <yi.l.liu@intel.com>, <iommu@lists.linux.dev>, <linux-coco@lists.linux.dev>, Zhi Wang <zhiw@nvidia.com>, AXu Yilun <yilun.xu@linux.intel.com>, "Aneesh Kumar K . V" <aneesh.kumar@kernel.org>, Alexey Kardashevskiy <aik@amd.com> Subject: [RFC PATCH v2 12/22] iommufd: Allow mapping from guest_memfd Date: Tue, 18 Feb 2025 22:09:59 +1100 Message-ID: <20250218111017.491719-13-aik@amd.com> In-Reply-To: <20250218111017.491719-1-aik@amd.com> References: <20250218111017.491719-1-aik@amd.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	TSM: Secure VFIO, TDISP, SEV TIO \| expand [RFC,v2,00/22] TSM: Secure VFIO, TDISP, SEV TIO [RFC,v2,01/22] pci/doe: Define protocol types and make those public [RFC,v2,02/22] PCI/IDE: Fixes to make it work on AMD SNP-SEV [RFC,v2,03/22] PCI/IDE: Init IDs on all IDE streams beforehand [RFC,v2,04/22] iommu/amd: Report SEV-TIO support [RFC,v2,05/22] crypto: ccp: Enable SEV-TIO feature in the PSP when supported [RFC,v2,06/22] KVM: X86: Define tsm_get_vmid [RFC,v2,07/22] coco/tsm: Add tsm and tsm-host modules [RFC,v2,08/22] pci/tsm: Add PCI driver for TSM [RFC,v2,09/22] crypto/ccp: Implement SEV TIO firmware interface [RFC,v2,10/22] KVM: SVM: Add uAPI to change RMP for MMIO [RFC,v2,11/22] KVM: SEV: Add TIO VMGEXIT [RFC,v2,12/22] iommufd: Allow mapping from guest_memfd [RFC,v2,13/22] iommufd: amd-iommu: Add vdevice support [RFC,v2,14/22] iommufd: Add TIO calls [RFC,v2,15/22] KVM: X86: Handle private MMIO as shared [RFC,v2,16/22] coco/tsm: Add tsm-guest module [RFC,v2,17/22] resource: Mark encrypted MMIO resource on validation [RFC,v2,18/22] coco/sev-guest: Implement the guest support for SEV TIO [RFC,v2,19/22] RFC: pci: Add BUS_NOTIFY_PCI_BUS_MASTER event [RFC,v2,20/22] sev-guest: Stop changing encrypted page state for TDISP devices [RFC,v2,21/22] pci: Allow encrypted MMIO mapping via sysfs [RFC,v2,22/22] pci: Define pci_iomap_range_encrypted

Alexey Kardashevskiy Feb. 18, 2025, 11:09 a.m. UTC

CoCo VMs get their private memory allocated from guest_memfd
("gmemfd") which is a KVM facility similar to memfd.
At the moment gmemfds cannot mmap() so the usual GUP API does
not work on these as expected.

Use the existing IOMMU_IOAS_MAP_FILE API to allow mapping from
fd + offset. Detect the gmemfd case in pfn_reader_user_pin() and
simplified mapping.

The long term plan is to ditch this workaround and follow
the usual memfd path.

Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
---
 drivers/iommu/iommufd/pages.c | 88 +++++++++++++++++++-
 1 file changed, 87 insertions(+), 1 deletion(-)

Jason Gunthorpe Feb. 18, 2025, 2:16 p.m. UTC | #1

On Tue, Feb 18, 2025 at 10:09:59PM +1100, Alexey Kardashevskiy wrote:
> CoCo VMs get their private memory allocated from guest_memfd
> ("gmemfd") which is a KVM facility similar to memfd.
> At the moment gmemfds cannot mmap() so the usual GUP API does
> not work on these as expected.
> 
> Use the existing IOMMU_IOAS_MAP_FILE API to allow mapping from
> fd + offset. Detect the gmemfd case in pfn_reader_user_pin() and
> simplified mapping.
> 
> The long term plan is to ditch this workaround and follow
> the usual memfd path.

How is that possible though?

> +static struct folio *guest_memfd_get_pfn(struct file *file, unsigned long index,
> +					 unsigned long *pfn, int *max_order)
> +{
> +	struct folio *folio;
> +	int ret = 0;
> +
> +	folio = filemap_grab_folio(file_inode(file)->i_mapping, index);
> +
> +	if (IS_ERR(folio))
> +		return folio;
> +
> +	if (folio_test_hwpoison(folio)) {
> +		folio_unlock(folio);
> +		folio_put(folio);
> +		return ERR_PTR(-EHWPOISON);
> +	}
> +
> +	*pfn = folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1));
> +	if (!max_order)
> +		goto unlock_exit;
> +
> +	/* Refs for unpin_user_page_range_dirty_lock->gup_put_folio(FOLL_PIN) */
> +	ret = folio_add_pins(folio, 1);
> +	folio_put(folio); /* Drop ref from filemap_grab_folio */
> +
> +unlock_exit:
> +	folio_unlock(folio);
> +	if (ret)
> +		folio = ERR_PTR(ret);
> +
> +	return folio;
> +}

Connecting iommufd to guestmemfd through the FD is broadly the right
idea, but I'm not sure this matches the design of guestmemfd regarding
pinnability. IIRC they were adamant that the pages would not be
pinned..

folio_add_pins() just prevents the folio from being freed, it doesn't
prevent the guestmemfd code from messing with the filemap.

You should separate this from the rest of the series and discuss it
directly with the guestmemfd maintainers.

As I understood it the requirement here is to have some kind of
invalidation callback so that iommufd can drop mappings, but I don't
really know and AFAIK AMD is special in wanting private pages mapped
to the hypervisor iommu..

Jason

Alexey Kardashevskiy Feb. 18, 2025, 11:35 p.m. UTC | #2

On 19/2/25 01:16, Jason Gunthorpe wrote:
> On Tue, Feb 18, 2025 at 10:09:59PM +1100, Alexey Kardashevskiy wrote:
>> CoCo VMs get their private memory allocated from guest_memfd
>> ("gmemfd") which is a KVM facility similar to memfd.
>> At the moment gmemfds cannot mmap() so the usual GUP API does
>> not work on these as expected.
>>
>> Use the existing IOMMU_IOAS_MAP_FILE API to allow mapping from
>> fd + offset. Detect the gmemfd case in pfn_reader_user_pin() and
>> simplified mapping.
>>
>> The long term plan is to ditch this workaround and follow
>> the usual memfd path.
> 
> How is that possible though?

dunno, things evolve over years and converge somehow :)

>> +static struct folio *guest_memfd_get_pfn(struct file *file, unsigned long index,
>> +					 unsigned long *pfn, int *max_order)
>> +{
>> +	struct folio *folio;
>> +	int ret = 0;
>> +
>> +	folio = filemap_grab_folio(file_inode(file)->i_mapping, index);
>> +
>> +	if (IS_ERR(folio))
>> +		return folio;
>> +
>> +	if (folio_test_hwpoison(folio)) {
>> +		folio_unlock(folio);
>> +		folio_put(folio);
>> +		return ERR_PTR(-EHWPOISON);
>> +	}
>> +
>> +	*pfn = folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1));
>> +	if (!max_order)
>> +		goto unlock_exit;
>> +
>> +	/* Refs for unpin_user_page_range_dirty_lock->gup_put_folio(FOLL_PIN) */
>> +	ret = folio_add_pins(folio, 1);
>> +	folio_put(folio); /* Drop ref from filemap_grab_folio */
>> +
>> +unlock_exit:
>> +	folio_unlock(folio);
>> +	if (ret)
>> +		folio = ERR_PTR(ret);
>> +
>> +	return folio;
>> +}
> 
> Connecting iommufd to guestmemfd through the FD is broadly the right
> idea, but I'm not sure this matches the design of guestmemfd regarding
> pinnability. IIRC they were adamant that the pages would not be
> pinned..

uff I thought it was about "not mapped" rather than "non pinned".

> folio_add_pins() just prevents the folio from being freed, it doesn't
> prevent the guestmemfd code from messing with the filemap.
> 
> You should separate this from the rest of the series and discuss it
> directly with the guestmemfd maintainers.

Alright, thanks for the suggestion.

> As I understood it the requirement here is to have some kind of
> invalidation callback so that iommufd can drop mappings,

Since shared<->private conversion is an ioctl() (kvm/gmemfd) so it is 
ioctl() for iommufd then too. Oh well.

> but I don't
> really know and AFAIK AMD is special in wanting private pages mapped
> to the hypervisor iommu..

With in-place conversion, we could map the entire guest once in the HV 
IOMMU and control the Cbit via the guest's IOMMU table (when available). 
Thanks,

Jason Gunthorpe Feb. 18, 2025, 11:51 p.m. UTC | #3

On Wed, Feb 19, 2025 at 10:35:28AM +1100, Alexey Kardashevskiy wrote:

> With in-place conversion, we could map the entire guest once in the HV IOMMU
> and control the Cbit via the guest's IOMMU table (when available). Thanks,

Isn't it more complicated than that? I understood you need to have a
IOPTE boundary in the hypervisor at any point where the guest Cbit
changes - so you can't just dump 1G hypervisor pages to cover the
whole VM, you have to actively resize ioptes?

This was the whole motivation to adding the page size override kernel
command line.

Jason

Alexey Kardashevskiy Feb. 19, 2025, 12:43 a.m. UTC | #4

On 19/2/25 10:51, Jason Gunthorpe wrote:
> On Wed, Feb 19, 2025 at 10:35:28AM +1100, Alexey Kardashevskiy wrote:
> 
>> With in-place conversion, we could map the entire guest once in the HV IOMMU
>> and control the Cbit via the guest's IOMMU table (when available). Thanks,
> 
> Isn't it more complicated than that? I understood you need to have a
> IOPTE boundary in the hypervisor at any point where the guest Cbit
> changes - so you can't just dump 1G hypervisor pages to cover the
> whole VM, you have to actively resize ioptes?

When the guest Cbit changes, only AMD RMP table requires update but not 
necessaryly NPT or IOPTEs.
(I may have misunderstood the question, what meaning does "dump 1G 
pages" have?).


> This was the whole motivation to adding the page size override kernel
> command line.
> 
> Jason

Jason Gunthorpe Feb. 19, 2025, 1:35 p.m. UTC | #5

On Wed, Feb 19, 2025 at 11:43:46AM +1100, Alexey Kardashevskiy wrote:
> On 19/2/25 10:51, Jason Gunthorpe wrote:
> > On Wed, Feb 19, 2025 at 10:35:28AM +1100, Alexey Kardashevskiy wrote:
> > 
> > > With in-place conversion, we could map the entire guest once in the HV IOMMU
> > > and control the Cbit via the guest's IOMMU table (when available). Thanks,
> > 
> > Isn't it more complicated than that? I understood you need to have a
> > IOPTE boundary in the hypervisor at any point where the guest Cbit
> > changes - so you can't just dump 1G hypervisor pages to cover the
> > whole VM, you have to actively resize ioptes?
> 
> When the guest Cbit changes, only AMD RMP table requires update but not
> necessaryly NPT or IOPTEs.
> (I may have misunderstood the question, what meaning does "dump 1G pages"
> have?).

AFAIK that is not true, if there are mismatches in page size, ie the
RMP is 2M and the IOPTE is 1G then things do not work properly.

It is why we had to do this:

> > This was the whole motivation to adding the page size override kernel
> > command line.

commit f0295913c4b4f377c454e06f50c1a04f2f80d9df
Author: Joerg Roedel <jroedel@suse.de>
Date:   Thu Sep 5 09:22:40 2024 +0200

    iommu/amd: Add kernel parameters to limit V1 page-sizes
    
    Add two new kernel command line parameters to limit the page-sizes
    used for v1 page-tables:
    
            nohugepages     - Limits page-sizes to 4KiB
    
            v2_pgsizes_only - Limits page-sizes to 4Kib/2Mib/1GiB; The
                              same as the sizes used with v2 page-tables
    
    This is needed for multiple scenarios. When assigning devices to
    SEV-SNP guests the IOMMU page-sizes need to match the sizes in the RMP
    table, otherwise the device will not be able to access all shared
    memory.
    
    Also, some ATS devices do not work properly with arbitrary IO
    page-sizes as supported by AMD-Vi, so limiting the sizes used by the
    driver is a suitable workaround.
    
    All-in-all, these parameters are only workarounds until the IOMMU core
    and related APIs gather the ability to negotiate the page-sizes in a
    better way.
    
    Signed-off-by: Joerg Roedel <jroedel@suse.de>
    Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
    Link: https://lore.kernel.org/r/20240905072240.253313-1-joro@8bytes.org

Jason

Michael Roth Feb. 19, 2025, 8:23 p.m. UTC | #6

On Wed, Feb 19, 2025 at 09:35:16AM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 19, 2025 at 11:43:46AM +1100, Alexey Kardashevskiy wrote:
> > On 19/2/25 10:51, Jason Gunthorpe wrote:
> > > On Wed, Feb 19, 2025 at 10:35:28AM +1100, Alexey Kardashevskiy wrote:
> > > 
> > > > With in-place conversion, we could map the entire guest once in the HV IOMMU
> > > > and control the Cbit via the guest's IOMMU table (when available). Thanks,
> > > 
> > > Isn't it more complicated than that? I understood you need to have a
> > > IOPTE boundary in the hypervisor at any point where the guest Cbit
> > > changes - so you can't just dump 1G hypervisor pages to cover the
> > > whole VM, you have to actively resize ioptes?
> > 
> > When the guest Cbit changes, only AMD RMP table requires update but not
> > necessaryly NPT or IOPTEs.
> > (I may have misunderstood the question, what meaning does "dump 1G pages"
> > have?).
> 
> AFAIK that is not true, if there are mismatches in page size, ie the
> RMP is 2M and the IOPTE is 1G then things do not work properly.

Just for clarity: at least for normal/nested page table (but I'm
assuming the same applies to IOMMU mappings), 1G mappings are
handled similarly as 2MB mappings as far as RMP table checks are
concerned: each 2MB range is checked individually as if it were
a separate 2MB mapping:

AMD Architecture Programmer's Manual Volume 2, 15.36.10,
"RMP and VMPL Access Checks":

  "Accesses to 1GB pages only install 2MB TLB entries when SEV-SNP is
  enabled, therefore this check treats 1GB accesses as 2MB accesses for
  purposes of this check."

So a 1GB mapping doesn't really impose more restrictions than a 2MB
mapping (unless there's something different about how RMP checks are
done for IOMMU).

But the point still stands for 4K RMP entries and 2MB mappings: a 2MB
mapping either requires private page RMP entries to be 2MB, or in the
case of 2MB mapping of shared pages, every page in the range must be
shared according to the corresponding RMP entries.

> 
> It is why we had to do this:

I think, for the non-SEV-TIO use-case, it had more to do with inability
to unmap a 4K range once a particular 4K page has been converted
from shared to private if it was originally installed via a 2MB IOPTE,
since the guest could actively be DMA'ing to other shared pages in the
2M range (but we can be assured it is not DMA'ing to a particular 4K
page it has converted to private), and the IOMMU doesn't (AFAIK) have
a way to atomically split an existing 2MB IOPTE to avoid this. So
forcing everything to 4K ends up being necessary since we don't know
in advance what ranges might contain 4K pages that will get converted
to private in the future by the guest.

SEV-TIO might relax this restriction by making use of TMPM and the
PSMASH_IO command to split/"smash" RMP entries and IOMMU mappings to 4K
after-the-fact, but I'm not too familiar with the architecture/plans so
Alexey can correct me on that.

-Mike

> 
> > > This was the whole motivation to adding the page size override kernel
> > > command line.
> 
> commit f0295913c4b4f377c454e06f50c1a04f2f80d9df
> Author: Joerg Roedel <jroedel@suse.de>
> Date:   Thu Sep 5 09:22:40 2024 +0200
> 
>     iommu/amd: Add kernel parameters to limit V1 page-sizes
>     
>     Add two new kernel command line parameters to limit the page-sizes
>     used for v1 page-tables:
>     
>             nohugepages     - Limits page-sizes to 4KiB
>     
>             v2_pgsizes_only - Limits page-sizes to 4Kib/2Mib/1GiB; The
>                               same as the sizes used with v2 page-tables
>     
>     This is needed for multiple scenarios. When assigning devices to
>     SEV-SNP guests the IOMMU page-sizes need to match the sizes in the RMP
>     table, otherwise the device will not be able to access all shared
>     memory.
>     
>     Also, some ATS devices do not work properly with arbitrary IO
>     page-sizes as supported by AMD-Vi, so limiting the sizes used by the
>     driver is a suitable workaround.
>     
>     All-in-all, these parameters are only workarounds until the IOMMU core
>     and related APIs gather the ability to negotiate the page-sizes in a
>     better way.
>     
>     Signed-off-by: Joerg Roedel <jroedel@suse.de>
>     Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
>     Link: https://lore.kernel.org/r/20240905072240.253313-1-joro@8bytes.org
> 
> Jason

Jason Gunthorpe Feb. 19, 2025, 8:37 p.m. UTC | #7

On Wed, Feb 19, 2025 at 02:23:24PM -0600, Michael Roth wrote:
> Just for clarity: at least for normal/nested page table (but I'm
> assuming the same applies to IOMMU mappings), 1G mappings are
> handled similarly as 2MB mappings as far as RMP table checks are
> concerned: each 2MB range is checked individually as if it were
> a separate 2MB mapping:

Well, IIRC we are dealing with the AMDv1 IO page table here which
supports more sizes than 1G and we likely start to see things like 4M
mappings and the like. So maybe there is some issue if the above
special case really only applies to 1G and only 1G.

> But the point still stands for 4K RMP entries and 2MB mappings: a 2MB
> mapping either requires private page RMP entries to be 2MB, or in the
> case of 2MB mapping of shared pages, every page in the range must be
> shared according to the corresponding RMP entries.

 Is 4k RMP what people are running?

> I think, for the non-SEV-TIO use-case, it had more to do with inability
> to unmap a 4K range once a particular 4K page has been converted

Yes, we don't support unmap or resize. The entire theory of operation
has the IOPTEs cover the guest memory and remain static at VM boot
time. The RMP alone controls access and handles the static/private.

Assuming the host used 2M pages the IOPTEs in an AMDv1 table will be
sized around 2M,4M,8M just based around random luck.

So it sounds like you can get to a situation with a >=2M mapping in
the IOPTE but the guest has split it into private/shared at lower
granularity and the HW cannot handle this?

> from shared to private if it was originally installed via a 2MB IOPTE,
> since the guest could actively be DMA'ing to other shared pages in
> the 2M range (but we can be assured it is not DMA'ing to a particular 4K
> page it has converted to private), and the IOMMU doesn't (AFAIK) have
> a way to atomically split an existing 2MB IOPTE to avoid this. 

The iommu can split it (with SW help), I'm working on that
infrastructure right now..

So you will get a notification that the guest has made a
private/public split and the iommu page table can be atomically
restructured to put an IOPTE boundary at the split.

Then the HW will not see IOPTEs that exceed the shared/private
granularity of the VM.

Jason

Michael Roth Feb. 19, 2025, 9:30 p.m. UTC | #8

On Wed, Feb 19, 2025 at 04:37:08PM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 19, 2025 at 02:23:24PM -0600, Michael Roth wrote:
> > Just for clarity: at least for normal/nested page table (but I'm
> > assuming the same applies to IOMMU mappings), 1G mappings are
> > handled similarly as 2MB mappings as far as RMP table checks are
> > concerned: each 2MB range is checked individually as if it were
> > a separate 2MB mapping:
> 
> Well, IIRC we are dealing with the AMDv1 IO page table here which
> supports more sizes than 1G and we likely start to see things like 4M
> mappings and the like. So maybe there is some issue if the above
> special case really only applies to 1G and only 1G.

I think the documentation only mentioned 1G specifically since that's
the next level up in host/nested page table mappings, and that more
generally anything mapping at a higher granularity than 2MB would be
broken down into individual checks on each 2MB range within. But it's
quite possible things are handled differently for IOMMU so definitely
worth confirming.

> 
> > But the point still stands for 4K RMP entries and 2MB mappings: a 2MB
> > mapping either requires private page RMP entries to be 2MB, or in the
> > case of 2MB mapping of shared pages, every page in the range must be
> > shared according to the corresponding RMP entries.
> 
>  Is 4k RMP what people are running?

Unfortunately yes, but that's mainly due to guest_memfd only handling
4K currently. Hopefully that will change soon, but in the meantime
there's only experimental support for larger private page sizes that
make use of 2MB RMP entries (via THP).

But regardless, we'll still end up dealing with 4K RMP entries since
we'll need to split 2MB RMP entries in response to private->conversions
that aren't 2MB aligned/sized.

> 
> > I think, for the non-SEV-TIO use-case, it had more to do with inability
> > to unmap a 4K range once a particular 4K page has been converted
> 
> Yes, we don't support unmap or resize. The entire theory of operation
> has the IOPTEs cover the guest memory and remain static at VM boot
> time. The RMP alone controls access and handles the static/private.
> 
> Assuming the host used 2M pages the IOPTEs in an AMDv1 table will be
> sized around 2M,4M,8M just based around random luck.
> 
> So it sounds like you can get to a situation with a >=2M mapping in
> the IOPTE but the guest has split it into private/shared at lower
> granularity and the HW cannot handle this?

Remembering more details: the situation is a bit more specific to
guest_memfd. In general, for non-SEV-TIO, everything in the IOMMU will
be always be for shared pages, and because of that the RMP checks don't
impose any additional restrictions on mapping size (a shared page can
be mapped 2MB even if the RMP entry is 4K (the RMP page-size bit only
really applies for private pages)).

The issue with guest_memfd is that it is only used for private pages
(at least until in-place conversion is supported), so when we "convert"
shared pages to private we are essentially discarding those pages and
re-allocating them via guest_memfd, so the mappings for those discarded
pages become stale and need to be removed. But since this can happen
at 4K granularities, we need to map as 4K because we don't have a way
to split them later on (at least, not currently...).

The other approach is to not discard these shared pages after conversion
and just not free them back, which ends up using more host memory, but
allows for larger IOMMU mappings.

> 
> > from shared to private if it was originally installed via a 2MB IOPTE,
> > since the guest could actively be DMA'ing to other shared pages in
> > the 2M range (but we can be assured it is not DMA'ing to a particular 4K
> > page it has converted to private), and the IOMMU doesn't (AFAIK) have
> > a way to atomically split an existing 2MB IOPTE to avoid this. 
> 
> The iommu can split it (with SW help), I'm working on that
> infrastructure right now..
> 
> So you will get a notification that the guest has made a
> private/public split and the iommu page table can be atomically
> restructured to put an IOPTE boundary at the split.
> 
> Then the HW will not see IOPTEs that exceed the shared/private
> granularity of the VM.

That sounds very interesting. It would allow us to use larger IOMMU
mappings even for guest_memfd as it exists today, while still supporting
shared memory discard and avoiding the additional host memory usage
mentioned above. Are there patches available publicly?

Thanks,

Mike

> 
> Jason

Jason Gunthorpe Feb. 20, 2025, 12:57 a.m. UTC | #9

On Wed, Feb 19, 2025 at 03:30:37PM -0600, Michael Roth wrote:
> I think the documentation only mentioned 1G specifically since that's
> the next level up in host/nested page table mappings, and that more
> generally anything mapping at a higher granularity than 2MB would be
> broken down into individual checks on each 2MB range within. But it's
> quite possible things are handled differently for IOMMU so definitely
> worth confirming.

Hmm, well, I'd very much like it if we are all on the same page as to
why the new kernel parameters were needed. Joerg was definitely seeing
testing failures without them.

IMHO we should not require parameters like that, I expect the kernel
to fix this stuff on its own.

> But regardless, we'll still end up dealing with 4K RMP entries since
> we'll need to split 2MB RMP entries in response to private->conversions
> that aren't 2MB aligned/sized.

:( What is the point of even allowing < 2MP private/shared conversion?

> > Then the HW will not see IOPTEs that exceed the shared/private
> > granularity of the VM.
> 
> That sounds very interesting. It would allow us to use larger IOMMU
> mappings even for guest_memfd as it exists today, while still supporting
> shared memory discard and avoiding the additional host memory usage
> mentioned above. Are there patches available publicly?

https://patch.msgid.link/r/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com

I'm getting quite close to having something non-RFC that just does AMD
and the bare minimum. I will add you two to the CC

Jason

Alexey Kardashevskiy Feb. 20, 2025, 2:29 a.m. UTC | #10

On 20/2/25 00:35, Jason Gunthorpe wrote:
> On Wed, Feb 19, 2025 at 11:43:46AM +1100, Alexey Kardashevskiy wrote:
>> On 19/2/25 10:51, Jason Gunthorpe wrote:
>>> On Wed, Feb 19, 2025 at 10:35:28AM +1100, Alexey Kardashevskiy wrote:
>>>
>>>> With in-place conversion, we could map the entire guest once in the HV IOMMU
>>>> and control the Cbit via the guest's IOMMU table (when available). Thanks,
>>>
>>> Isn't it more complicated than that? I understood you need to have a
>>> IOPTE boundary in the hypervisor at any point where the guest Cbit
>>> changes - so you can't just dump 1G hypervisor pages to cover the
>>> whole VM, you have to actively resize ioptes?
>>
>> When the guest Cbit changes, only AMD RMP table requires update but not
>> necessaryly NPT or IOPTEs.
>> (I may have misunderstood the question, what meaning does "dump 1G pages"
>> have?).
> 
> AFAIK that is not true, if there are mismatches in page size, ie the
> RMP is 2M and the IOPTE is 1G then things do not work properly.


Right, so I misunderstood. When I first replied, I assumed the current 
situation of 4K pages everywhere. IOPTEs larger than RMP entries are 
likely to cause failed RMP checks (confirming now, surprises sometime 
happen). Thanks,


> It is why we had to do this:
> 
>>> This was the whole motivation to adding the page size override kernel
>>> command line.
> 
> commit f0295913c4b4f377c454e06f50c1a04f2f80d9df
> Author: Joerg Roedel <jroedel@suse.de>
> Date:   Thu Sep 5 09:22:40 2024 +0200
> 
>      iommu/amd: Add kernel parameters to limit V1 page-sizes
>      
>      Add two new kernel command line parameters to limit the page-sizes
>      used for v1 page-tables:
>      
>              nohugepages     - Limits page-sizes to 4KiB
>      
>              v2_pgsizes_only - Limits page-sizes to 4Kib/2Mib/1GiB; The
>                                same as the sizes used with v2 page-tables
>      
>      This is needed for multiple scenarios. When assigning devices to
>      SEV-SNP guests the IOMMU page-sizes need to match the sizes in the RMP
>      table, otherwise the device will not be able to access all shared
>      memory.
>      
>      Also, some ATS devices do not work properly with arbitrary IO
>      page-sizes as supported by AMD-Vi, so limiting the sizes used by the
>      driver is a suitable workaround.
>      
>      All-in-all, these parameters are only workarounds until the IOMMU core
>      and related APIs gather the ability to negotiate the page-sizes in a
>      better way.
>      
>      Signed-off-by: Joerg Roedel <jroedel@suse.de>
>      Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
>      Link: https://lore.kernel.org/r/20240905072240.253313-1-joro@8bytes.org
> 
> Jason

[RFC,v2,12/22] iommufd: Allow mapping from guest_memfd

Commit Message

Comments

Patch