[v2,0/1] KVM: arm64: Map GPU memory with no struct pages

Message ID	20241118131958.4609-1-ankita@nvidia.com (mailing list archive)
Headers	show Received: from NAM12-BN8-obe.outbound.protection.outlook.com (mail-bn8nam12on2072.outbound.protection.outlook.com [40.107.237.72]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0134E199243; Mon, 18 Nov 2024 13:20:09 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.118.233 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.118.233; helo=mail.nvidia.com; pr=C From: <ankita@nvidia.com> To: <ankita@nvidia.com>, <jgg@nvidia.com>, <maz@kernel.org>, <oliver.upton@linux.dev>, <joey.gouly@arm.com>, <suzuki.poulose@arm.com>, <yuzenghui@huawei.com>, <catalin.marinas@arm.com>, <will@kernel.org>, <ryan.roberts@arm.com>, <shahuang@redhat.com>, <lpieralisi@kernel.org> CC: <aniketa@nvidia.com>, <cjia@nvidia.com>, <kwankhede@nvidia.com>, <targupta@nvidia.com>, <vsethi@nvidia.com>, <acurrid@nvidia.com>, <apopple@nvidia.com>, <jhubbard@nvidia.com>, <danw@nvidia.com>, <zhiw@nvidia.com>, <mochs@nvidia.com>, <udhoke@nvidia.com>, <dnigam@nvidia.com>, <alex.williamson@redhat.com>, <sebastianene@google.com>, <coltonlewis@google.com>, <kevin.tian@intel.com>, <yi.l.liu@intel.com>, <ardb@kernel.org>, <akpm@linux-foundation.org>, <gshan@redhat.com>, <linux-mm@kvack.org>, <kvmarm@lists.linux.dev>, <kvm@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linux-arm-kernel@lists.infradead.org> Subject: [PATCH v2 0/1] KVM: arm64: Map GPU memory with no struct pages Date: Mon, 18 Nov 2024 13:19:57 +0000 Message-ID: <20241118131958.4609-1-ankita@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	KVM: arm64: Map GPU memory with no struct pages \| expand [v2,0/1] KVM: arm64: Map GPU memory with no struct pages [v2,1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

Ankit Agrawal Nov. 18, 2024, 1:19 p.m. UTC

From: Ankit Agrawal <ankita@nvidia.com>

Grace based platforms such as Grace Hopper/Blackwell Superchips have
CPU accessible cache coherent GPU memory. The current KVM code
prevents such memory to be mapped Normal cacheable and the patch aims
to solve this use case.

Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
based on pfn_is_map_memory() and ignores the per-VMA flags that
indicates the memory attributes. This means there is no way for
a VM to get cachable IO memory (like from a CXL or pre-CXL device).
In both cases the memory will be forced to be DEVICE_nGnRE and the
VM's memory attributes will be ignored.

The pfn_is_map_memory() is thus restrictive and allows only for
the memory that is added to the kernel to be marked as cacheable.
In most cases the code needs to know if there is a struct page, or
if the memory is in the kernel map and pfn_valid() is an appropriate
API for this. Extend the umbrella with pfn_valid() to include memory
with no struct pages for consideration to be mapped cacheable in
stage 2. A !pfn_valid() implies that the memory is unsafe to be mapped
as cacheable.

Also take care of the following two cases that are unsafe to be mapped
as cacheable:
1. The VMA pgprot may have VM_IO set alongwith MT_NORMAL or MT_NORMAL_TAGGED.
   Although unexpected and wrong, presence of such configuration cannot
   be ruled out.
2. Configurations where VM_MTE_ALLOWED is not set and KVM_CAP_ARM_MTE
   is enabled. Otherwise a malicious guest can enable MTE at stage 1
   without the hypervisor being able to tell. This could cause external
   aborts.

The GPU memory such as on the Grace Hopper systems is interchangeable
with DDR memory and retains its properties. Executable faults should thus
be allowed on the memory determined as Normal cacheable.

Note when FWB is not enabled, the kernel expects to trivially do
cache management by flushing the memory by linearly converting a
kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). This is
only possibile for struct page backed memory. Do not allow non-struct
page memory to be cachable without FWB.

The changes are heavily influenced by the insightful discussions between
Catalin Marinas and Jason Gunthorpe [1] on v1. Many thanks for their
valuable suggestions.

Applied over next-20241117 and tested on the Grace Hopper and
Grace Blackwell platforms by booting up VM and running several CUDA
workloads. This has not been tested on MTE enabled hardware. If
someone can give it a try, it will be very helpful.

v1 -> v2
1. Removed kvm_is_device_pfn() as a determiner for device type memory
   determination. Instead using pfn_valid()
2. Added handling for MTE.
3. Minor cleanup.

Link: https://lore.kernel.org/lkml/20230907181459.18145-2-ankita@nvidia.com [1]

Ankit Agrawal (1):
  KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

 arch/arm64/include/asm/kvm_pgtable.h |   8 +++
 arch/arm64/kvm/hyp/pgtable.c         |   2 +-
 arch/arm64/kvm/mmu.c                 | 101 +++++++++++++++++++++------
 3 files changed, 87 insertions(+), 24 deletions(-)

Donald Dutile Nov. 26, 2024, 5:10 p.m. UTC | #1

My email client says this patch: [PATCH v2 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
   is part of a thread for this titled patchPATCH.  Is it?

The description has similarities to above description, but some adds, some drops.

So, could you clean these two up into (a) a series, or (b) single, separate PATCH's?

Thanks.

- Don

On 11/18/24 8:19 AM, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Grace based platforms such as Grace Hopper/Blackwell Superchips have
> CPU accessible cache coherent GPU memory. The current KVM code
> prevents such memory to be mapped Normal cacheable and the patch aims
> to solve this use case.
> 
> Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
> based on pfn_is_map_memory() and ignores the per-VMA flags that
> indicates the memory attributes. This means there is no way for
> a VM to get cachable IO memory (like from a CXL or pre-CXL device).
> In both cases the memory will be forced to be DEVICE_nGnRE and the
> VM's memory attributes will be ignored.
> 
> The pfn_is_map_memory() is thus restrictive and allows only for
> the memory that is added to the kernel to be marked as cacheable.
> In most cases the code needs to know if there is a struct page, or
> if the memory is in the kernel map and pfn_valid() is an appropriate
> API for this. Extend the umbrella with pfn_valid() to include memory
> with no struct pages for consideration to be mapped cacheable in
> stage 2. A !pfn_valid() implies that the memory is unsafe to be mapped
> as cacheable.
> 
> Also take care of the following two cases that are unsafe to be mapped
> as cacheable:
> 1. The VMA pgprot may have VM_IO set alongwith MT_NORMAL or MT_NORMAL_TAGGED.
>     Although unexpected and wrong, presence of such configuration cannot
>     be ruled out.
> 2. Configurations where VM_MTE_ALLOWED is not set and KVM_CAP_ARM_MTE
>     is enabled. Otherwise a malicious guest can enable MTE at stage 1
>     without the hypervisor being able to tell. This could cause external
>     aborts.
> 
> The GPU memory such as on the Grace Hopper systems is interchangeable
> with DDR memory and retains its properties. Executable faults should thus
> be allowed on the memory determined as Normal cacheable.
> 
> Note when FWB is not enabled, the kernel expects to trivially do
> cache management by flushing the memory by linearly converting a
> kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). This is
> only possibile for struct page backed memory. Do not allow non-struct
> page memory to be cachable without FWB.
> 
> The changes are heavily influenced by the insightful discussions between
> Catalin Marinas and Jason Gunthorpe [1] on v1. Many thanks for their
> valuable suggestions.
> 
> Applied over next-20241117 and tested on the Grace Hopper and
> Grace Blackwell platforms by booting up VM and running several CUDA
> workloads. This has not been tested on MTE enabled hardware. If
> someone can give it a try, it will be very helpful.
> 
> v1 -> v2
> 1. Removed kvm_is_device_pfn() as a determiner for device type memory
>     determination. Instead using pfn_valid()
> 2. Added handling for MTE.
> 3. Minor cleanup.
> 
> Link: https://lore.kernel.org/lkml/20230907181459.18145-2-ankita@nvidia.com [1]
> 
> Ankit Agrawal (1):
>    KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
> 
>   arch/arm64/include/asm/kvm_pgtable.h |   8 +++
>   arch/arm64/kvm/hyp/pgtable.c         |   2 +-
>   arch/arm64/kvm/mmu.c                 | 101 +++++++++++++++++++++------
>   3 files changed, 87 insertions(+), 24 deletions(-)
>

Ankit Agrawal Dec. 2, 2024, 4:51 a.m. UTC | #2

Thanks Don.

> My email client says this patch: [PATCH v2 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
>    is part of a thread for this titled patchPATCH.  Is it?

Those are supposed to be 2 different patches (0/1 being the cover letter, 1/1 the primary patch).

> So, could you clean these two up into (a) a series, or (b) single, separate PATCH's?

Yeah, I'll send it as a patch with a cover letter in a series when I send the next version.

Will Deacon Dec. 10, 2024, 2:07 p.m. UTC | #3

On Mon, Nov 18, 2024 at 01:19:57PM +0000, ankita@nvidia.com wrote:
> The changes are heavily influenced by the insightful discussions between
> Catalin Marinas and Jason Gunthorpe [1] on v1. Many thanks for their
> valuable suggestions.
> 
> Link: https://lore.kernel.org/lkml/20230907181459.18145-2-ankita@nvidia.com [1]

That's a different series, no? It got merged at v9:

https://lore.kernel.org/all/20240224150546.368-1-ankita@nvidia.com/

Will

Jason Gunthorpe Dec. 10, 2024, 2:18 p.m. UTC | #4

On Tue, Dec 10, 2024 at 02:07:40PM +0000, Will Deacon wrote:
> On Mon, Nov 18, 2024 at 01:19:57PM +0000, ankita@nvidia.com wrote:
> > The changes are heavily influenced by the insightful discussions between
> > Catalin Marinas and Jason Gunthorpe [1] on v1. Many thanks for their
> > valuable suggestions.
> > 
> > Link: https://lore.kernel.org/lkml/20230907181459.18145-2-ankita@nvidia.com [1]
> 
> That's a different series, no? It got merged at v9:

I was confused by this too. v1 of that series included this patch, as
that series went along it became focused only on enabling WC
(Normal-NC) in a VM for device MMIO and this patch for device cachable
memory was dropped off.

There are two related things:
 1) Device MMIO memory should be able to be Normal-NC in a VM. Already
    merged
 2) Device Cachable memory (ie CXL and pre-CXL coherently attached
    memory) should be Normal Cachable in a VM, even if it doesn't have
    struct page/etc. (this patch)

IIRC this part was dropped off because of the MTE complexity that
Catalin raised.

Jason

Catalin Marinas Dec. 10, 2024, 2:45 p.m. UTC | #5

On Tue, Dec 10, 2024 at 10:18:06AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 10, 2024 at 02:07:40PM +0000, Will Deacon wrote:
> > On Mon, Nov 18, 2024 at 01:19:57PM +0000, ankita@nvidia.com wrote:
> > > The changes are heavily influenced by the insightful discussions between
> > > Catalin Marinas and Jason Gunthorpe [1] on v1. Many thanks for their
> > > valuable suggestions.
> > > 
> > > Link: https://lore.kernel.org/lkml/20230907181459.18145-2-ankita@nvidia.com [1]
> > 
> > That's a different series, no? It got merged at v9:
> 
> I was confused by this too. v1 of that series included this patch, as
> that series went along it became focused only on enabling WC
> (Normal-NC) in a VM for device MMIO and this patch for device cachable
> memory was dropped off.
> 
> There are two related things:
>  1) Device MMIO memory should be able to be Normal-NC in a VM. Already
>     merged
>  2) Device Cachable memory (ie CXL and pre-CXL coherently attached
>     memory) should be Normal Cachable in a VM, even if it doesn't have
>     struct page/etc. (this patch)
> 
> IIRC this part was dropped off because of the MTE complexity that
> Catalin raised.

Indeed, we only merged the Normal NC part at the time. I asked Ankit to
drop the cacheable attributes and only focus on getting the other part
merged.

Donald Dutile Dec. 10, 2024, 3:56 p.m. UTC | #6

On 12/10/24 9:18 AM, Jason Gunthorpe wrote:
> On Tue, Dec 10, 2024 at 02:07:40PM +0000, Will Deacon wrote:
>> On Mon, Nov 18, 2024 at 01:19:57PM +0000, ankita@nvidia.com wrote:
>>> The changes are heavily influenced by the insightful discussions between
>>> Catalin Marinas and Jason Gunthorpe [1] on v1. Many thanks for their
>>> valuable suggestions.
>>>
>>> Link: https://lore.kernel.org/lkml/20230907181459.18145-2-ankita@nvidia.com [1]
>>
>> That's a different series, no? It got merged at v9:
> 
> I was confused by this too. v1 of that series included this patch, as
> that series went along it became focused only on enabling WC
> (Normal-NC) in a VM for device MMIO and this patch for device cachable
> memory was dropped off.
> 
> There are two related things:
>   1) Device MMIO memory should be able to be Normal-NC in a VM. Already
>      merged
>   2) Device Cachable memory (ie CXL and pre-CXL coherently attached
>      memory) should be Normal Cachable in a VM, even if it doesn't have
>      struct page/etc. (this patch)
> 
> IIRC this part was dropped off because of the MTE complexity that
> Catalin raised.
> 
> Jason
> 
Jason & Catalin: Thanks for the filler for the splitting.

So, I'm not sure I read what is needed to resolve this patch.
I read Will's reply to split it further and basically along what logical lines of functionality;
is there still an MTE complexity that has to be resolved/included in the series?

IOW, I'm looking for a complete, clear (set of) statement(s) that Ankit can implement to get this (requested) series moving forward, sooner than later;
it's already been a year+ to get to this point, and I don't want further ambiguity/uncertainty to drag it out more than needed.

Thanks... Don

Catalin Marinas Dec. 10, 2024, 4:08 p.m. UTC | #7

On Tue, Dec 10, 2024 at 10:56:43AM -0500, Donald Dutile wrote:
> So, I'm not sure I read what is needed to resolve this patch. I read
> Will's reply to split it further and basically along what logical
> lines of functionality; is there still an MTE complexity that has to
> be resolved/included in the series?

Since MTE is still around, the complexity did not go away. But I need to
properly read the patch and Will's comment and page in the whole
discussion from last year.

There's now FEAT_MTE_PERM as well but we don't have those patches in
yet:

https://lore.kernel.org/r/20241028094014.2596619-1-aneesh.kumar@kernel.org/

Ankit Agrawal Dec. 11, 2024, 3:05 a.m. UTC | #8

> IOW, I'm looking for a complete, clear (set of) statement(s) that
> Ankit can implement to get this (requested) series moving forward,
> sooner than later; it's already been a year+ to get to this point, and
> I don't want further ambiguity/uncertainty to drag it out more than needed.

Yes, I'll be working on this series to get to conclusion. I'll fix the text and
split the patch as suggested. Moreover, I'll see what Catalin has to say on
the MTE part. Based on that I'll resend the patch series.

Thanks
Ankit

[v2,0/1] KVM: arm64: Map GPU memory with no struct pages

Message

Comments