diff mbox series

[v2,1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

Message ID 20241118131958.4609-2-ankita@nvidia.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Map GPU memory with no struct pages | expand

Commit Message

Ankit Agrawal Nov. 18, 2024, 1:19 p.m. UTC
From: Ankit Agrawal <ankita@nvidia.com>

Currently KVM determines if a VMA is pointing at IO memory by checking
pfn_is_map_memory(). However, the MM already gives us a way to tell what
kind of memory it is by inspecting the VMA.

This patch solves the problems where it is possible for the kernel to
have VMAs pointing at cachable memory without causing
pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
devices. This memory is now properly marked as cachable in KVM.

The pfn_is_map_memory() is restrictive and allows only for the memory
that is added to the kernel to be marked as cacheable. In most cases
the code needs to know if there is a struct page, or if the memory is
in the kernel map and pfn_valid() is an appropriate API for this.
Extend the umbrella with pfn_valid() to include memory with no struct
pages for consideration to be mapped cacheable in stage 2. A !pfn_valid()
implies that the memory is unsafe to be mapped as cacheable.

Moreover take account of the mapping type in the VMA to make a decision
on the mapping. The VMA's pgprot is tested to determine the memory type
with the following mapping:
 pgprot_noncached    MT_DEVICE_nGnRnE   device (or Normal_NC)
 pgprot_writecombine MT_NORMAL_NC       device (or Normal_NC)
 pgprot_device       MT_DEVICE_nGnRE    device (or Normal_NC)
 pgprot_tagged       MT_NORMAL_TAGGED   RAM / Normal
 -                   MT_NORMAL          RAM / Normal

Also take care of the following two cases that prevents the memory to
be safely mapped as cacheable:
1. The VMA pgprot have VM_IO set alongwith MT_NORMAL or
   MT_NORMAL_TAGGED. Although unexpected and wrong, presence of such
   configuration cannot be ruled out.
2. Configurations where VM_MTE_ALLOWED is not set and KVM_CAP_ARM_MTE
   is enabled. Otherwise a malicious guest can enable MTE at stage 1
   without the hypervisor being able to tell. This could cause external
   aborts.

Introduce a new variable noncacheable to represent whether the memory
should not be mapped as cacheable. The noncacheable as false implies
the memory is safe to be mapped cacheable. Use this to handle the
aforementioned potentially unsafe cases for cacheable mapping.

Note when FWB is not enabled, the kernel expects to trivially do
cache management by flushing the memory by linearly converting a
kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). This is
only possibile for struct page backed memory. Do not allow non-struct
page memory to be cachable without FWB.

The device memory such as on the Grace Hopper systems is interchangeable
with DDR memory and retains its properties. Allow executable faults
on the memory determined as Normal cacheable.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |   8 +++
 arch/arm64/kvm/hyp/pgtable.c         |   2 +-
 arch/arm64/kvm/mmu.c                 | 101 +++++++++++++++++++++------
 3 files changed, 87 insertions(+), 24 deletions(-)

Comments

Will Deacon Dec. 10, 2024, 2:13 p.m. UTC | #1
On Mon, Nov 18, 2024 at 01:19:58PM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Currently KVM determines if a VMA is pointing at IO memory by checking
> pfn_is_map_memory(). However, the MM already gives us a way to tell what
> kind of memory it is by inspecting the VMA.
> 
> This patch solves the problems where it is possible for the kernel to
> have VMAs pointing at cachable memory without causing
> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
> devices. This memory is now properly marked as cachable in KVM.
> 
> The pfn_is_map_memory() is restrictive and allows only for the memory
> that is added to the kernel to be marked as cacheable. In most cases
> the code needs to know if there is a struct page, or if the memory is
> in the kernel map and pfn_valid() is an appropriate API for this.
> Extend the umbrella with pfn_valid() to include memory with no struct
> pages for consideration to be mapped cacheable in stage 2. A !pfn_valid()
> implies that the memory is unsafe to be mapped as cacheable.
> 
> Moreover take account of the mapping type in the VMA to make a decision
> on the mapping. The VMA's pgprot is tested to determine the memory type
> with the following mapping:
>  pgprot_noncached    MT_DEVICE_nGnRnE   device (or Normal_NC)
>  pgprot_writecombine MT_NORMAL_NC       device (or Normal_NC)
>  pgprot_device       MT_DEVICE_nGnRE    device (or Normal_NC)
>  pgprot_tagged       MT_NORMAL_TAGGED   RAM / Normal
>  -                   MT_NORMAL          RAM / Normal
> 
> Also take care of the following two cases that prevents the memory to
> be safely mapped as cacheable:
> 1. The VMA pgprot have VM_IO set alongwith MT_NORMAL or
>    MT_NORMAL_TAGGED. Although unexpected and wrong, presence of such
>    configuration cannot be ruled out.
> 2. Configurations where VM_MTE_ALLOWED is not set and KVM_CAP_ARM_MTE
>    is enabled. Otherwise a malicious guest can enable MTE at stage 1
>    without the hypervisor being able to tell. This could cause external
>    aborts.
> 
> Introduce a new variable noncacheable to represent whether the memory
> should not be mapped as cacheable. The noncacheable as false implies
> the memory is safe to be mapped cacheable. Use this to handle the
> aforementioned potentially unsafe cases for cacheable mapping.
> 
> Note when FWB is not enabled, the kernel expects to trivially do
> cache management by flushing the memory by linearly converting a
> kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). This is
> only possibile for struct page backed memory. Do not allow non-struct
> page memory to be cachable without FWB.
> 
> The device memory such as on the Grace Hopper systems is interchangeable
> with DDR memory and retains its properties. Allow executable faults
> on the memory determined as Normal cacheable.

Sorry, but a change this subtle to the arch code is going to need a _much_
better explanation than the rambling text above.

I also suspect that you're trying to do too many things in one patch.
In particular, the changes to execute permission handling absolutely
need to be discussed as a single patch in the series.

Please can you split this up in to a set of changes with useful commit
messages so that we can review them?

Jason knows how to do this so you could ask him for help if you're stuck.
Otherwise, there are plenty of examples of well-written commit messages
on the mailing list and in the git history.

Thanks,

Will
Ankit Agrawal Dec. 11, 2024, 2:58 a.m. UTC | #2
Thanks Will for taking a look.

>> The device memory such as on the Grace Hopper systems is interchangeable
>> with DDR memory and retains its properties. Allow executable faults
>> on the memory determined as Normal cacheable.
>
> Sorry, but a change this subtle to the arch code is going to need a _much_
> better explanation than the rambling text above.

Understood, I'll work on the text and try to make it coherent.


> I also suspect that you're trying to do too many things in one patch.
> In particular, the changes to execute permission handling absolutely
> need to be discussed as a single patch in the series.
>
> Please can you split this up in to a set of changes with useful commit
> messages so that we can review them?

Yes. I'll take out the execute permission part to a separate patch.


> Jason knows how to do this so you could ask him for help if you're stuck.
> Otherwise, there are plenty of examples of well-written commit messages
> on the mailing list and in the git history.

Ack.
Will Deacon Dec. 11, 2024, 9:49 p.m. UTC | #3
On Wed, Dec 11, 2024 at 02:58:38AM +0000, Ankit Agrawal wrote:
> Thanks Will for taking a look.
> 
> >> The device memory such as on the Grace Hopper systems is interchangeable
> >> with DDR memory and retains its properties. Allow executable faults
> >> on the memory determined as Normal cacheable.
> >
> > Sorry, but a change this subtle to the arch code is going to need a _much_
> > better explanation than the rambling text above.
> 
> Understood, I'll work on the text and try to make it coherent.

Heh, I thought it was the patch trying to make things coherent :p

Will
Catalin Marinas Dec. 11, 2024, 10:01 p.m. UTC | #4
Hi Ankit,

On Mon, Nov 18, 2024 at 01:19:58PM +0000, ankita@nvidia.com wrote:
> Currently KVM determines if a VMA is pointing at IO memory by checking
> pfn_is_map_memory(). However, the MM already gives us a way to tell what
> kind of memory it is by inspecting the VMA.
> 
> This patch solves the problems where it is possible for the kernel to
> have VMAs pointing at cachable memory without causing
> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
> devices. This memory is now properly marked as cachable in KVM.
> 
> The pfn_is_map_memory() is restrictive and allows only for the memory
> that is added to the kernel to be marked as cacheable. In most cases
> the code needs to know if there is a struct page, or if the memory is
> in the kernel map and pfn_valid() is an appropriate API for this.
> Extend the umbrella with pfn_valid() to include memory with no struct
> pages for consideration to be mapped cacheable in stage 2. A !pfn_valid()
> implies that the memory is unsafe to be mapped as cacheable.

Please note that a pfn_valid() does not imply the memory is in the
linear map, only that there's a struct page. Some cache maintenance you
do later in the patch may fail. kvm_is_device_pfn() was changed by
commit 873ba463914c ("arm64: decouple check whether pfn is in linear map
from pfn_valid()"), see the log for some explanation.

> Moreover take account of the mapping type in the VMA to make a decision
> on the mapping. The VMA's pgprot is tested to determine the memory type
> with the following mapping:
>  pgprot_noncached    MT_DEVICE_nGnRnE   device (or Normal_NC)
>  pgprot_writecombine MT_NORMAL_NC       device (or Normal_NC)
>  pgprot_device       MT_DEVICE_nGnRE    device (or Normal_NC)
>  pgprot_tagged       MT_NORMAL_TAGGED   RAM / Normal
>  -                   MT_NORMAL          RAM / Normal
> 
> Also take care of the following two cases that prevents the memory to
> be safely mapped as cacheable:
> 1. The VMA pgprot have VM_IO set alongwith MT_NORMAL or
>    MT_NORMAL_TAGGED. Although unexpected and wrong, presence of such
>    configuration cannot be ruled out.
> 2. Configurations where VM_MTE_ALLOWED is not set and KVM_CAP_ARM_MTE
>    is enabled. Otherwise a malicious guest can enable MTE at stage 1
>    without the hypervisor being able to tell. This could cause external
>    aborts.

A first point I'd make - we can simplify this a bit and only allow such
configuration if FWB is present. Do you have a platform without FWB that
needs such feature?

Another reason for the above is my second point - I don't like relying
on the user mapping memory type for this (at some point we may have
device pass-through without a VMM mapping). Can we use something like a
new VM_FORCE_CACHED flag instead? There's precedent for this with
VM_ALLOW_ANY_UNCACHED.

> Introduce a new variable noncacheable to represent whether the memory
> should not be mapped as cacheable. The noncacheable as false implies
> the memory is safe to be mapped cacheable. Use this to handle the
> aforementioned potentially unsafe cases for cacheable mapping.
> 
> Note when FWB is not enabled, the kernel expects to trivially do
> cache management by flushing the memory by linearly converting a
> kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). This is
> only possibile for struct page backed memory. Do not allow non-struct
> page memory to be cachable without FWB.

I want to be sure we actually have a real case for this for the !FWB
case. One issue is that it introduces a mismatch between the VMM and the
guest mappings I'd rather not have to have to deal with. Another is that
we can't guarantee it is mapped in the kernel linear map, pfn_valid()
does not imply this (I'll say this a few times through this patch).

> The device memory such as on the Grace Hopper systems is interchangeable
> with DDR memory and retains its properties. Allow executable faults
> on the memory determined as Normal cacheable.

As Will said, please introduce the exec handling separately, it will be
easier to follow the patches.

The exec fault would require cache maintenance in certain conditions
(depending on CTR_EL0.{DIC,IDC}). Since you introduce some conditions on
pfn_valid() w.r.t. D-cache maintenance, I assume we have similar
restrictions for I/D cache coherency.

> @@ -1430,6 +1425,23 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_MTE_ALLOWED;
>  }
>  
> +/*
> + * Determine the memory region cacheability from VMA's pgprot. This
> + * is used to set the stage 2 PTEs.
> + */
> +static unsigned long mapping_type(pgprot_t page_prot)
> +{
> +	return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
> +}
> +
> +/*
> + * Determine if the mapping type is normal cacheable.
> + */
> +static bool mapping_type_normal_cacheable(unsigned long mt)
> +{
> +	return (mt == MT_NORMAL || mt == MT_NORMAL_TAGGED);
> +}

Personally I'd not use this at all, maybe at most as a safety check and
warn but I'd rather have an opt-in from the host driver (which could
also ensure that the user mapping is cached).

> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_s2_trans *nested,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> @@ -1438,8 +1450,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	int ret = 0;
>  	bool write_fault, writable, force_pte = false;
>  	bool exec_fault, mte_allowed;
> -	bool device = false, vfio_allow_any_uc = false;
> +	bool noncacheable = false, vfio_allow_any_uc = false;
>  	unsigned long mmu_seq;
> +	unsigned long mt;
>  	phys_addr_t ipa = fault_ipa;
>  	struct kvm *kvm = vcpu->kvm;
>  	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
> @@ -1568,6 +1581,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
>  
> +	mt = mapping_type(vma->vm_page_prot);
> +
> +	/*
> +	 * Check for potentially ineligible or unsafe conditions for
> +	 * cacheable mappings.
> +	 */
> +	if (vma->vm_flags & VM_IO)
> +		noncacheable = true;
> +	else if (!mte_allowed && kvm_has_mte(kvm))
> +		noncacheable = true;
> +
>  	/* Don't use the VMA after the unlock -- it may have vanished */
>  	vma = NULL;
>  
> @@ -1591,19 +1615,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (is_error_noslot_pfn(pfn))
>  		return -EFAULT;
>  
> -	if (kvm_is_device_pfn(pfn)) {
> -		/*
> -		 * If the page was identified as device early by looking at
> -		 * the VMA flags, vma_pagesize is already representing the
> -		 * largest quantity we can map.  If instead it was mapped
> -		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
> -		 * and must not be upgraded.
> -		 *
> -		 * In both cases, we don't let transparent_hugepage_adjust()
> -		 * change things at the last minute.
> -		 */
> -		device = true;
> -	} else if (logging_active && !write_fault) {
> +	/*
> +	 * pfn_valid() indicates to the code if there is a struct page, or
> +	 * if the memory is in the kernel map. Any memory region otherwise
> +	 * is unsafe to be cacheable.
> +	 */
> +	if (!pfn_valid(pfn))
> +		noncacheable = true;

The assumptions here are wrong. pfn_valid() does not imply the memory is
in the kernel map.

> +
> +	if (!noncacheable && logging_active && !write_fault) {
>  		/*
>  		 * Only actually map the page as writable if this was a write
>  		 * fault.
> @@ -1611,7 +1631,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		writable = false;
>  	}
>  
> -	if (exec_fault && device)
> +	/*
> +	 * Do not allow exec fault; unless the memory is determined safely
> +	 * to be Normal cacheable.
> +	 */
> +	if (exec_fault && (noncacheable || !mapping_type_normal_cacheable(mt)))
>  		return -ENOEXEC;
>  
>  	/*
> @@ -1641,10 +1665,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	}
>  
>  	/*
> +	 * If the page was identified as device early by looking at
> +	 * the VMA flags, vma_pagesize is already representing the
> +	 * largest quantity we can map.  If instead it was mapped
> +	 * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
> +	 * and must not be upgraded.
> +	 *
> +	 * In both cases, we don't let transparent_hugepage_adjust()
> +	 * change things at the last minute.
> +	 *
>  	 * If we are not forced to use page mapping, check if we are
>  	 * backed by a THP and thus use block mapping if possible.
>  	 */
> -	if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) {
> +	if (vma_pagesize == PAGE_SIZE && !(force_pte || noncacheable)) {
>  		if (fault_is_perm && fault_granule > PAGE_SIZE)
>  			vma_pagesize = fault_granule;
>  		else
> @@ -1658,7 +1691,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		}
>  	}
>  
> -	if (!fault_is_perm && !device && kvm_has_mte(kvm)) {
> +	if (!fault_is_perm && !noncacheable && kvm_has_mte(kvm)) {
>  		/* Check the VMM hasn't introduced a new disallowed VMA */
>  		if (mte_allowed) {
>  			sanitise_mte_tags(kvm, pfn, vma_pagesize);
> @@ -1674,7 +1707,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (exec_fault)
>  		prot |= KVM_PGTABLE_PROT_X;
>  
> -	if (device) {
> +	/*
> +	 * If any of the following pgprot modifiers are applied on the pgprot,
> +	 * consider as device memory and map in Stage 2 as device or
> +	 * Normal noncached:
> +	 * pgprot_noncached
> +	 * pgprot_writecombine
> +	 * pgprot_device
> +	 */
> +	if (!mapping_type_normal_cacheable(mt)) {
>  		if (vfio_allow_any_uc)
>  			prot |= KVM_PGTABLE_PROT_NORMAL_NC;
>  		else
> @@ -1684,6 +1725,20 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		prot |= KVM_PGTABLE_PROT_X;
>  	}

I'd leave the device check in place, maybe rename it to something else
to distinguish from linear map memory (e.g. !pfn_is_map_memory()) and
only deal with the attributes for this device pfn which could be Device,
Normal-NC or Normal-WB depending on the presence of some VM_* flags.
Deny execution in the first patch, introduce it subsequently.

> +	/*
> +	 *  When FWB is unsupported KVM needs to do cache flushes
> +	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
> +	 *  only possible if the memory is already mapped into the kernel map
> +	 *  at the usual spot.
> +	 *
> +	 *  Validate that there is a struct page for the PFN which maps
> +	 *  to the KVA that the flushing code expects.
> +	 */
> +	if (!stage2_has_fwb(pgt) && !(pfn_valid(pfn))) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}

Cache maintenance relies on memory being mapped. That's what
memblock_is_map_memory() gives us. Hotplugged memory ends up in memblock
as arm64 selects ARCH_KEEP_MEMBLOCK.

> +
>  	/*
>  	 * Under the premise of getting a FSC_PERM fault, we just need to relax
>  	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
> -- 
> 2.34.1
>
David Hildenbrand Dec. 20, 2024, 3:42 p.m. UTC | #5
On 18.11.24 14:19, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Currently KVM determines if a VMA is pointing at IO memory by checking
> pfn_is_map_memory(). However, the MM already gives us a way to tell what
> kind of memory it is by inspecting the VMA.

Do you primarily care about VM_PFNMAP/VM_MIXEDMAP VMAs, or also other 
VMA types?

> 
> This patch solves the problems where it is possible for the kernel to
> have VMAs pointing at cachable memory without causing
> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
> devices. This memory is now properly marked as cachable in KVM.

Does this only imply in worse performance, or does this also affect 
correctness? I suspect performance is the problem, correct?

> 
> The pfn_is_map_memory() is restrictive and allows only for the memory
> that is added to the kernel to be marked as cacheable. In most cases
> the code needs to know if there is a struct page, or if the memory is
> in the kernel map and pfn_valid() is an appropriate API for this.
 > Extend the umbrella with pfn_valid() to include memory with no 
struct> pages for consideration to be mapped cacheable in stage 2. A 
!pfn_valid()
> implies that the memory is unsafe to be mapped as cacheable.


I do wonder, are there ways we could have a !(VM_PFNMAP/VM_MIXEDMAP) 
where kvm_is_device_pfn() == true? Are these the "DAX memremap cases and 
CXL/pre-CXL" things you describe above, or are they VM_PFNMAP/VM_MIXEDMAP?


It's worth nothing that COW VM_PFNMAP/VM_MIXEDMAP mappings are possible 
right now, where we could have anon pages mixed with PFN mappings. Of 
course, VMA pgrpot only partially apply to the anon pages (esp. caching 
attributes).

Likely you assume to never end up with COW VM_PFNMAP -- I think it's 
possible when doing a MAP_PRIVATE /dev/mem mapping on systems that allow 
for mapping /dev/mem. Maybe one could just reject such cases (if KVM PFN 
lookup code not already rejects them, which might just be that case IIRC).

> 
> Moreover take account of the mapping type in the VMA to make a decision
> on the mapping. The VMA's pgprot is tested to determine the memory type
> with the following mapping:
>   pgprot_noncached    MT_DEVICE_nGnRnE   device (or Normal_NC)
>   pgprot_writecombine MT_NORMAL_NC       device (or Normal_NC)
>   pgprot_device       MT_DEVICE_nGnRE    device (or Normal_NC)
>   pgprot_tagged       MT_NORMAL_TAGGED   RAM / Normal
>   -                   MT_NORMAL          RAM / Normal
> 
> Also take care of the following two cases that prevents the memory to
> be safely mapped as cacheable:
> 1. The VMA pgprot have VM_IO set alongwith MT_NORMAL or
>     MT_NORMAL_TAGGED. Although unexpected and wrong, presence of such
>     configuration cannot be ruled out.
> 2. Configurations where VM_MTE_ALLOWED is not set and KVM_CAP_ARM_MTE
>     is enabled. Otherwise a malicious guest can enable MTE at stage 1
>     without the hypervisor being able to tell. This could cause external
>     aborts.
> 
> Introduce a new variable noncacheable to represent whether the memory
> should not be mapped as cacheable. The noncacheable as false implies
> the memory is safe to be mapped cacheable.

Why not use ... "cacheable" ? This sentence would then read as:

"Introduce a new variable "cachable" to represent whether the memory
should be mapped as cacheable."

and maybe even could be dropped completely. :)

But maybe there is a reason for that in the code.

> Use this to handle the
> aforementioned potentially unsafe cases for cacheable mapping.
> 
> Note when FWB is not enabled, the kernel expects to trivially do
> cache management by flushing the memory by linearly converting a
> kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). This is
> only possibile for struct page backed memory. Do not allow non-struct
> page memory to be cachable without FWB.
> 
> The device memory such as on the Grace Hopper systems is interchangeable
> with DDR memory and retains its properties. Allow executable faults
> on the memory determined as Normal cacheable.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
Jason Gunthorpe Jan. 6, 2025, 4:51 p.m. UTC | #6
On Fri, Dec 20, 2024 at 04:42:35PM +0100, David Hildenbrand wrote:
> On 18.11.24 14:19, ankita@nvidia.com wrote:
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > Currently KVM determines if a VMA is pointing at IO memory by checking
> > pfn_is_map_memory(). However, the MM already gives us a way to tell what
> > kind of memory it is by inspecting the VMA.
> 
> Do you primarily care about VM_PFNMAP/VM_MIXEDMAP VMAs, or also other VMA
> types?

I think this is exclusively about allowing cachable memory inside a
VM_PFNMAP VMA (created by VFIO) remain cachable inside the guest VM.

> > This patch solves the problems where it is possible for the kernel to
> > have VMAs pointing at cachable memory without causing
> > pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
> > devices. This memory is now properly marked as cachable in KVM.
> 
> Does this only imply in worse performance, or does this also affect
> correctness? I suspect performance is the problem, correct?

Correctness. Things like atomics don't work on non-cachable mappings.

> Maybe one could just reject such cases (if KVM PFN lookup code not
> already rejects them, which might just be that case IIRC).

At least VFIO enforces SHARED or it won't create the VMA.

drivers/vfio/pci/vfio_pci_core.c:       if ((vma->vm_flags & VM_SHARED) == 0)

This is pretty normal/essential for drivers..

Are you suggesting the VMA flags should be inspected more?
VM_SHARED/PFNMAP before allowing this?

Jason
David Hildenbrand Jan. 8, 2025, 4:09 p.m. UTC | #7
On 06.01.25 17:51, Jason Gunthorpe wrote:
> On Fri, Dec 20, 2024 at 04:42:35PM +0100, David Hildenbrand wrote:
>> On 18.11.24 14:19, ankita@nvidia.com wrote:
>>> From: Ankit Agrawal <ankita@nvidia.com>
>>>
>>> Currently KVM determines if a VMA is pointing at IO memory by checking
>>> pfn_is_map_memory(). However, the MM already gives us a way to tell what
>>> kind of memory it is by inspecting the VMA.
>>
>> Do you primarily care about VM_PFNMAP/VM_MIXEDMAP VMAs, or also other VMA
>> types?
> 
> I think this is exclusively about allowing cachable memory inside a
> VM_PFNMAP VMA (created by VFIO) remain cachable inside the guest VM.

Thanks!

> 
>>> This patch solves the problems where it is possible for the kernel to
>>> have VMAs pointing at cachable memory without causing
>>> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
>>> devices. This memory is now properly marked as cachable in KVM.
>>
>> Does this only imply in worse performance, or does this also affect
>> correctness? I suspect performance is the problem, correct?
> 
> Correctness. Things like atomics don't work on non-cachable mappings.

Hah! This needs to be highlighted in the patch description. And maybe 
this even implies Fixes: etc?

> 
>> Maybe one could just reject such cases (if KVM PFN lookup code not
>> already rejects them, which might just be that case IIRC).
> 
> At least VFIO enforces SHARED or it won't create the VMA.
> 
> drivers/vfio/pci/vfio_pci_core.c:       if ((vma->vm_flags & VM_SHARED) == 0)

That makes a lot of sense for VFIO.

> 
> This is pretty normal/essential for drivers..
> 
> Are you suggesting the VMA flags should be inspected more?
> VM_SHARED/PFNMAP before allowing this?


I was wondering if we can safely assume that "device PFNs" can only 
exist in VM_PFNMAP mappings. Then we can avoid all that pfn_valid() / 
pfn_is_map_memory() stuff for "obviously not device" memory.

I tried to protoype, but have to give up for now; the code is too 
complicated to make quick progress :)
Ankit Agrawal Jan. 10, 2025, 9:04 p.m. UTC | #8
Thanks Catalin for the review. Comments inline.

> Please note that a pfn_valid() does not imply the memory is in the
> linear map, only that there's a struct page. Some cache maintenance you
> do later in the patch may fail. kvm_is_device_pfn() was changed by
> commit 873ba463914c ("arm64: decouple check whether pfn is in linear map
> from pfn_valid()"), see the log for some explanation.

Thanks for the clarification. So, it appears that the original pfn_is_map_memory()
need to be kept. I am getting the impression from the comments that we should
only add the change to extend the cacheability for the very specific case that we
are trying to address. i.e. all of the following is true:
- type is VM_PFNMAP
- user mapping is cacheable (MT_NORMAL or MT_NORMAL_TAGGED)
- The suggested VM_FORCE_CACHED is set.

>> Also take care of the following two cases that prevents the memory to
>> be safely mapped as cacheable:
>> 1. The VMA pgprot have VM_IO set alongwith MT_NORMAL or
>>    MT_NORMAL_TAGGED. Although unexpected and wrong, presence of such
>>    configuration cannot be ruled out.
>> 2. Configurations where VM_MTE_ALLOWED is not set and KVM_CAP_ARM_MTE
>>    is enabled. Otherwise a malicious guest can enable MTE at stage 1
>>    without the hypervisor being able to tell. This could cause external
>>    aborts.
>
> A first point I'd make - we can simplify this a bit and only allow such
> configuration if FWB is present. Do you have a platform without FWB that
> needs such feature?

No, we don't have a platform without FWB. So I'll check for FWB presence.

> Another reason for the above is my second point - I don't like relying
> on the user mapping memory type for this (at some point we may have
> device pass-through without a VMM mapping). Can we use something like a
> new VM_FORCE_CACHED flag instead? There's precedent for this with
> VM_ALLOW_ANY_UNCACHED.

Ack, this will help better control the affected configurations. I'll introduce
this flag in the next version.

>> Note when FWB is not enabled, the kernel expects to trivially do
>> cache management by flushing the memory by linearly converting a
>> kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). This is
>> only possibile for struct page backed memory. Do not allow non-struct
>> page memory to be cachable without FWB.
>
> I want to be sure we actually have a real case for this for the !FWB
> case. One issue is that it introduces a mismatch between the VMM and the
> guest mappings I'd rather not have to have to deal with. Another is that
> we can't guarantee it is mapped in the kernel linear map, pfn_valid()
> does not imply this (I'll say this a few times through this patch).

I am not aware of such case. I'll restrict the changes to FWB then.

>> The device memory such as on the Grace Hopper systems is interchangeable
>> with DDR memory and retains its properties. Allow executable faults
>> on the memory determined as Normal cacheable.
>
> As Will said, please introduce the exec handling separately, it will be
> easier to follow the patches.
>
> The exec fault would require cache maintenance in certain conditions
> (depending on CTR_EL0.{DIC,IDC}). Since you introduce some conditions on
> pfn_valid() w.r.t. D-cache maintenance, I assume we have similar
> restrictions for I/D cache coherency.

I suppose if we only do the change to extend to the aforementioned case
of the following being true, the check for exec fault could safely be as it is
in the patch (albeit it has to be moved to a separate patch).
- type is VM_PFNMAP
- user mapping is cacheable (MT_NORMAL or MT_NORMAL_TAGGED)
- The suggested VM_FORCE_CACHED is set.

>> +static bool mapping_type_normal_cacheable(unsigned long mt)
>> +{
>> +     return (mt == MT_NORMAL || mt == MT_NORMAL_TAGGED);
>> +}
>
> Personally I'd not use this at all, maybe at most as a safety check and
> warn but I'd rather have an opt-in from the host driver (which could
> also ensure that the user mapping is cached).

Understood, will make it part of the check as mentioned above.

>> +      * pfn_valid() indicates to the code if there is a struct page, or
>> +      * if the memory is in the kernel map. Any memory region otherwise
>> +      * is unsafe to be cacheable.
>> +      */
>> +     if (!pfn_valid(pfn))
>> +             noncacheable = true;
>
> The assumptions here are wrong. pfn_valid() does not imply the memory is
> in the kernel map.

Understood, thanks for the clarification.

>> +     if (!mapping_type_normal_cacheable(mt)) {
>>               if (vfio_allow_any_uc)
>>                       prot |= KVM_PGTABLE_PROT_NORMAL_NC;
>>               else
>> @@ -1684,6 +1725,20 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>               prot |= KVM_PGTABLE_PROT_X;
>>       }
>
> I'd leave the device check in place, maybe rename it to something else
> to distinguish from linear map memory (e.g. !pfn_is_map_memory()) and
> only deal with the attributes for this device pfn which could be Device,
> Normal-NC or Normal-WB depending on the presence of some VM_* flags.
> Deny execution in the first patch, introduce it subsequently.

Ack.

>> +     /*
>> +      *  When FWB is unsupported KVM needs to do cache flushes
>> +      *  (via dcache_clean_inval_poc()) of the underlying memory. This is
>> +      *  only possible if the memory is already mapped into the kernel map
>> +      *  at the usual spot.
>> +      *
>> +      *  Validate that there is a struct page for the PFN which maps
>> +      *  to the KVA that the flushing code expects.
>> +      */
>> +     if (!stage2_has_fwb(pgt) && !(pfn_valid(pfn))) {
>> +             ret = -EINVAL;
>> +             goto out_unlock;
>> +     }
>
> Cache maintenance relies on memory being mapped. That's what
> memblock_is_map_memory() gives us. Hotplugged memory ends up in memblock
> as arm64 selects ARCH_KEEP_MEMBLOCK.

Ok, so will replace the pfn_valid with pfn_is_map_memory. Did I get that right?

Apologies for being slow in getting back.

- Ankit Agrawal
Ankit Agrawal Jan. 10, 2025, 9:15 p.m. UTC | #9
>>>> This patch solves the problems where it is possible for the kernel to
>>>> have VMAs pointing at cachable memory without causing
>>>> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
>>>> devices. This memory is now properly marked as cachable in KVM.
>>>
>>> Does this only imply in worse performance, or does this also affect
>>> correctness? I suspect performance is the problem, correct?
>>
>> Correctness. Things like atomics don't work on non-cachable mappings.
>
> Hah! This needs to be highlighted in the patch description. And maybe
> this even implies Fixes: etc?

Understood. I'll put that in the patch description.

>>> Likely you assume to never end up with COW VM_PFNMAP -- I think it's
>>> possible when doing a MAP_PRIVATE /dev/mem mapping on systems that allow
>>> for mapping /dev/mem. Maybe one could just reject such cases (if KVM PFN
>>> lookup code not already rejects them, which might just be that case IIRC).
>>
>> At least VFIO enforces SHARED or it won't create the VMA.
>>
>> drivers/vfio/pci/vfio_pci_core.c:       if ((vma->vm_flags & VM_SHARED) == 0)
>
> That makes a lot of sense for VFIO.

So, I suppose we don't need to check this? Specially if we only extend the
changes to the following case:
- type is VM_PFNMAP &&
- user mapping is cacheable (MT_NORMAL or MT_NORMAL_TAGGED) &&
- The suggested VM_FORCE_CACHED is set.

- Ankit Agrawal
Jason Gunthorpe Jan. 13, 2025, 4:27 p.m. UTC | #10
On Fri, Jan 10, 2025 at 09:15:53PM +0000, Ankit Agrawal wrote:
> >>>> This patch solves the problems where it is possible for the kernel to
> >>>> have VMAs pointing at cachable memory without causing
> >>>> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
> >>>> devices. This memory is now properly marked as cachable in KVM.
> >>>
> >>> Does this only imply in worse performance, or does this also affect
> >>> correctness? I suspect performance is the problem, correct?
> >>
> >> Correctness. Things like atomics don't work on non-cachable mappings.
> >
> > Hah! This needs to be highlighted in the patch description. And maybe
> > this even implies Fixes: etc?
> 
> Understood. I'll put that in the patch description.
> 
> >>> Likely you assume to never end up with COW VM_PFNMAP -- I think it's
> >>> possible when doing a MAP_PRIVATE /dev/mem mapping on systems that allow
> >>> for mapping /dev/mem. Maybe one could just reject such cases (if KVM PFN
> >>> lookup code not already rejects them, which might just be that case IIRC).
> >>
> >> At least VFIO enforces SHARED or it won't create the VMA.
> >>
> >> drivers/vfio/pci/vfio_pci_core.c:       if ((vma->vm_flags & VM_SHARED) == 0)
> >
> > That makes a lot of sense for VFIO.
> 
> So, I suppose we don't need to check this? Specially if we only extend the
> changes to the following case:
> - type is VM_PFNMAP &&
> - user mapping is cacheable (MT_NORMAL or MT_NORMAL_TAGGED) &&
> - The suggested VM_FORCE_CACHED is set.

Do we really want another weirdly defined VMA flag? I'd really like to
avoid this.. How is the VFIO going to know any better if it should set
the flag when the questions seem to be around things like MTE that
have nothing to do with VFIO?

Jason
David Hildenbrand Jan. 14, 2025, 1:17 p.m. UTC | #11
On 13.01.25 17:27, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 09:15:53PM +0000, Ankit Agrawal wrote:
>>>>>> This patch solves the problems where it is possible for the kernel to
>>>>>> have VMAs pointing at cachable memory without causing
>>>>>> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
>>>>>> devices. This memory is now properly marked as cachable in KVM.
>>>>>
>>>>> Does this only imply in worse performance, or does this also affect
>>>>> correctness? I suspect performance is the problem, correct?
>>>>
>>>> Correctness. Things like atomics don't work on non-cachable mappings.
>>>
>>> Hah! This needs to be highlighted in the patch description. And maybe
>>> this even implies Fixes: etc?
>>
>> Understood. I'll put that in the patch description.
>>
>>>>> Likely you assume to never end up with COW VM_PFNMAP -- I think it's
>>>>> possible when doing a MAP_PRIVATE /dev/mem mapping on systems that allow
>>>>> for mapping /dev/mem. Maybe one could just reject such cases (if KVM PFN
>>>>> lookup code not already rejects them, which might just be that case IIRC).
>>>>
>>>> At least VFIO enforces SHARED or it won't create the VMA.
>>>>
>>>> drivers/vfio/pci/vfio_pci_core.c:       if ((vma->vm_flags & VM_SHARED) == 0)
>>>
>>> That makes a lot of sense for VFIO.
>>
>> So, I suppose we don't need to check this? Specially if we only extend the
>> changes to the following case:

I would check if it is a VM_PFNMAP, and outright refuse any page if 
is_cow_mapping(vma->vm_flags) is true.

>> - type is VM_PFNMAP &&
>> - user mapping is cacheable (MT_NORMAL or MT_NORMAL_TAGGED) &&
>> - The suggested VM_FORCE_CACHED is set.
> 
> Do we really want another weirdly defined VMA flag? I'd really like to
> avoid this..

Agreed.

Can't we do a "this is a weird VM_PFNMAP thing, let's consult the VMA 
prot + whatever PFN information to find out if it is weird-device and 
how we could safely map it?"

Ideally, we'd separate this logic from the "this is a normal VMA that 
doesn't need any such special casing", and even stop playing PFN games 
on these normal VMAs completely.


How is the VFIO going to know any better if it should set
> the flag when the questions seem to be around things like MTE that
> have nothing to do with VFIO?

I assume MTE does not apply at all to VM_PFNMAP, at least 
arch_calc_vm_flag_bits() tells me that VM_MTE_ALLOWED should never get 
set there.

So for VFIO and friends with VM_PFNMAP mapping, we can completely ignore 
that maybe?
Jason Gunthorpe Jan. 14, 2025, 1:31 p.m. UTC | #12
On Tue, Jan 14, 2025 at 02:17:50PM +0100, David Hildenbrand wrote:

> I assume MTE does not apply at all to VM_PFNMAP, at least
> arch_calc_vm_flag_bits() tells me that VM_MTE_ALLOWED should never get set
> there.

As far as I know, it is completely platform specific what addresses MTE
will work on. For instance, I would expect a MTE capable platform with
CXL to want to make MTE work on the CXL memory too.

However, given we have no way of discovery, limiting MTE to boot time
system memory seems like the right thing to do for now - which we can
achieve by forbidding it from VM_PFNMAP.

Having VFIO add VM_MTE_ALLOWED someday might make sense if someone
solves the discoverability problem.

Jason
Ankit Agrawal Jan. 14, 2025, 11:13 p.m. UTC | #13
Thanks for the responses.

> Do we really want another weirdly defined VMA flag? I'd really like to
> avoid this.. 

I'd let Catalin chime in on this. My take of the reason for his suggestion is
that we want to reduce the affected configs to only the NVIDIA grace based
systems. The nvgrace-gpu module would be setting the flag and the
new codepath will only be applicable there. Or am I missing something here?


>>>>>> Likely you assume to never end up with COW VM_PFNMAP -- I think it's
>>>>>> possible when doing a MAP_PRIVATE /dev/mem mapping on systems that allow
>>>>>> for mapping /dev/mem. Maybe one could just reject such cases (if KVM PFN
>>>>>> lookup code not already rejects them, which might just be that case IIRC).
>>>>>
>>>>> At least VFIO enforces SHARED or it won't create the VMA.
>>>>>
>>>>> drivers/vfio/pci/vfio_pci_core.c:       if ((vma->vm_flags & VM_SHARED) == 0)
>>>>
>>>> That makes a lot of sense for VFIO.
>>>
>>> So, I suppose we don't need to check this? Specially if we only extend the
>>> changes to the following case:
>
> I would check if it is a VM_PFNMAP, and outright refuse any page if
> is_cow_mapping(vma->vm_flags) is true.

So IIUC, I'll add the check to return -EINVAL for
(vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)


>>> - type is VM_PFNMAP &&
>>> - user mapping is cacheable (MT_NORMAL or MT_NORMAL_TAGGED) &&
>>> - The suggested VM_FORCE_CACHED is set.
>>
>> Do we really want another weirdly defined VMA flag? I'd really like to
>> avoid this..
>
> Agreed.
> 
> Can't we do a "this is a weird VM_PFNMAP thing, let's consult the VMA
> prot + whatever PFN information to find out if it is weird-device and
> how we could safely map it?"

My understanding was that the new suggested flag VM_FORCE_CACHED
was essentially to represent "whatever PFN information to find out if it is
weird-device". Is there an alternate reliable check to figure this out?


> Ideally, we'd separate this logic from the "this is a normal VMA that
> doesn't need any such special casing", and even stop playing PFN games
> on these normal VMAs completely.

Sorry David, it isn't clear to me what normal VMA mean here? I suppose you mean
the original KVM's non-nvgrace related code piece for MT_NORMAL memory?


>> I assume MTE does not apply at all to VM_PFNMAP, at least
>> arch_calc_vm_flag_bits() tells me that VM_MTE_ALLOWED should never get set
>> there.
>
> As far as I know, it is completely platform specific what addresses MTE
> will work on. For instance, I would expect a MTE capable platform with
> CXL to want to make MTE work on the CXL memory too.
>
> However, given we have no way of discovery, limiting MTE to boot time
> system memory seems like the right thing to do for now - which we can
> achieve by forbidding it from VM_PFNMAP.
>
> Having VFIO add VM_MTE_ALLOWED someday might make sense if someone
> solves the discoverability problem.

Currently in the patch we check the following. So Jason, is the suggestion that
we simply return error to forbid such condition if VM_PFNMAP is set?
+	else if (!mte_allowed && kvm_has_mte(kvm))

- Ankit Agrawal
Jason Gunthorpe Jan. 15, 2025, 2:32 p.m. UTC | #14
On Tue, Jan 14, 2025 at 11:13:48PM +0000, Ankit Agrawal wrote:
> > Do we really want another weirdly defined VMA flag? I'd really like to
> > avoid this.. 
> 
> I'd let Catalin chime in on this. My take of the reason for his suggestion is
> that we want to reduce the affected configs to only the NVIDIA grace based
> systems. The nvgrace-gpu module would be setting the flag and the
> new codepath will only be applicable there. Or am I missing something here?

We cannot add VMA flags that are not clearly defined. The rules for
when the VMA creater should set the flag need to be extermely clear
and well defined.

> > Can't we do a "this is a weird VM_PFNMAP thing, let's consult the VMA
> > prot + whatever PFN information to find out if it is weird-device and
> > how we could safely map it?"
> 
> My understanding was that the new suggested flag VM_FORCE_CACHED
> was essentially to represent "whatever PFN information to find out if it is
> weird-device". Is there an alternate reliable check to figure this out?

For instance FORCE_CACHED makes no sense, how will the VMA creator
know it should set this flag?

> Currently in the patch we check the following. So Jason, is the suggestion that
> we simply return error to forbid such condition if VM_PFNMAP is set?
> +	else if (!mte_allowed && kvm_has_mte(kvm))

I really don't know enought about mte, but I would take the position
that VM_PFNMAP does not support MTE, and maybe it is even any VMA
without VM_MTE/_ALLOWED does not support MTE?

At least it makes alost more sense for the VMA creator to indicate
positively that the underlying HW supports MTE.

Jason
Catalin Marinas Jan. 16, 2025, 10:28 p.m. UTC | #15
On Wed, Jan 15, 2025 at 10:32:13AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 14, 2025 at 11:13:48PM +0000, Ankit Agrawal wrote:
> > > Do we really want another weirdly defined VMA flag? I'd really like to
> > > avoid this.. 
> > 
> > I'd let Catalin chime in on this. My take of the reason for his suggestion is
> > that we want to reduce the affected configs to only the NVIDIA grace based
> > systems. The nvgrace-gpu module would be setting the flag and the
> > new codepath will only be applicable there. Or am I missing something here?
> 
> We cannot add VMA flags that are not clearly defined. The rules for
> when the VMA creater should set the flag need to be extermely clear
> and well defined.
> 
> > > Can't we do a "this is a weird VM_PFNMAP thing, let's consult the VMA
> > > prot + whatever PFN information to find out if it is weird-device and
> > > how we could safely map it?"
> > 
> > My understanding was that the new suggested flag VM_FORCE_CACHED
> > was essentially to represent "whatever PFN information to find out if it is
> > weird-device". Is there an alternate reliable check to figure this out?
> 
> For instance FORCE_CACHED makes no sense, how will the VMA creator
> know it should set this flag?
> 
> > Currently in the patch we check the following. So Jason, is the suggestion that
> > we simply return error to forbid such condition if VM_PFNMAP is set?
> > +	else if (!mte_allowed && kvm_has_mte(kvm))
> 
> I really don't know enought about mte, but I would take the position
> that VM_PFNMAP does not support MTE, and maybe it is even any VMA
> without VM_MTE/_ALLOWED does not support MTE?
> 
> At least it makes alost more sense for the VMA creator to indicate
> positively that the underlying HW supports MTE.

Sorry, I didn't get the chance to properly read this thread. I'll try
tomorrow and next week.

Basically I don't care whether MTE is supported on such vma, I doubt
you'd want to enable MTE anyway. But the way MTE was designed in the Arm
architecture, prior to FEAT_MTE_PERM, it allows a guest to enable MTE at
Stage 1 when Stage 2 is Normal WB Cacheable. We have no idea what enable
MTE at Stage 1 means if the memory range doesn't support it. It could be
external aborts, SError or simply accessing data (as tags) at random
physical addresses that don't belong to the guest. So if a vma does not
have VM_MTE_ALLOWED, we either disable Stage 2 cacheable or allow it
with FEAT_MTE_PERM (patches from Aneesh on the list). Or, a bigger
happen, disable MTE in guests (well, not that big, not many platforms
supporting MTE, especially in the enterprise space).

A second problem, similar to relaxing to Normal NC we merged last year,
we can't tell what allowing Stage 2 cacheable means (SError etc). That's
why I thought this knowledge lies with the device, KVM doesn't have the
information. Checking vm_page_prot instead of a VM_* flag may work if
it's mapped in user space but this might not always be the case. I don't
see how VM_PFNMAP alone can tell us anything about the access properties
supported by a device address range. Either way, it's the driver setting
vm_page_prot or some VM_* flag. KVM has no clue, it's just a memory
slot.

A third aspect, more of a simplification when reasoning about this, was
to use FWB at Stage 2 to force cacheability and not care about cache
maintenance, especially when such range might be mapped both in user
space and in the guest.
Jason Gunthorpe Jan. 17, 2025, 2 p.m. UTC | #16
On Thu, Jan 16, 2025 at 10:28:48PM +0000, Catalin Marinas wrote:

> Basically I don't care whether MTE is supported on such vma, I doubt
> you'd want to enable MTE anyway. But the way MTE was designed in the Arm
> architecture, prior to FEAT_MTE_PERM, it allows a guest to enable MTE at
> Stage 1 when Stage 2 is Normal WB Cacheable. We have no idea what enable
> MTE at Stage 1 means if the memory range doesn't support it.

I'm reading Aneesh's cover letter (Add support for NoTagAccess memory
attribute) and it seems like we already have exactly the behavior we
want. If MTE is enabled in the KVM then memory types, like from VFIO,
are not permitted - this looks like it happens during memslot
creation, not in the fault handler.

So I think at this point Ankit's series should rely on that. We never
have a fault on a PFNMAP VMA in a MTE enabled KVM in the first place
because you can't even create a memslot.

After Aneesh's series it would make the memory NoTagAccess (though I
don't understand from the cover letter how this works for MMIO) amd
faults will be fully contained.

> with FEAT_MTE_PERM (patches from Aneesh on the list). Or, a bigger
> happen, disable MTE in guests (well, not that big, not many platforms
> supporting MTE, especially in the enterprise space).

As above, it seems we already effectively disable MTE in guests to use
VFIO.

> A second problem, similar to relaxing to Normal NC we merged last year,
> we can't tell what allowing Stage 2 cacheable means (SError etc).

That was a very different argument. On that series KVM was upgrading a
VM with pgprot noncached to Normal NC, that upgrade was what triggered
the discussions about SError.

For this case the VMA is already pgprot cache. KVM is not changing
anything. The KVM S2 will have the same Normal NC memory type as the
VMA has in the S1.  Thus KVM has no additional responsibility for
safety here.

If using Normal Cachable on this memory is unsafe then VFIO must not
create such a VMA in the first place.

Today the vfio-grace driver is the only place that creates cachable
VMAs in VFIO and it is doing so with platform specific knowledge that
this memory is fully cachable safe.

> information. Checking vm_page_prot instead of a VM_* flag may work if
> it's mapped in user space but this might not always be the case. 

For this series it is only about mapping VMAs. Some future FD based
mapping for CC is going to also need similar metadata.. I have another
thread about that :)

> I don't see how VM_PFNMAP alone can tell us anything about the
> access properties supported by a device address range. Either way,
> it's the driver setting vm_page_prot or some VM_* flag. KVM has no
> clue, it's just a memory slot.

I think David's point about VM_PFNMAP was to avoid some of the
pfn_valid() logic. If we get VM_PFNMAP we just assume it is non-struct
page and follow the VMA's pgprot.

> A third aspect, more of a simplification when reasoning about this, was
> to use FWB at Stage 2 to force cacheability and not care about cache
> maintenance, especially when such range might be mapped both in user
> space and in the guest.

Yes, I thought we needed this anyhow as KVM can't cache invalidate
non-struct page memory..

Jason
David Hildenbrand Jan. 17, 2025, 4:58 p.m. UTC | #17
>> I don't see how VM_PFNMAP alone can tell us anything about the
>> access properties supported by a device address range. Either way,
>> it's the driver setting vm_page_prot or some VM_* flag. KVM has no
>> clue, it's just a memory slot.
> 
> I think David's point about VM_PFNMAP was to avoid some of the
> pfn_valid() logic. If we get VM_PFNMAP we just assume it is non-struct
> page and follow the VMA's pgprot.

Exactly what I meant.

If we only expect DEVICE stuff in VM_PFNMAP, then we can limit all the 
weirdness (and figuring out what todo using pgprot etc) to exactly that.

VM_IO without VM_PFNMAP might be interesting; not sure if we even want 
to support that here.
Jason Gunthorpe Jan. 17, 2025, 5:10 p.m. UTC | #18
On Fri, Jan 17, 2025 at 05:58:29PM +0100, David Hildenbrand wrote:

> VM_IO without VM_PFNMAP might be interesting; not sure if we even want to
> support that here.

The VMA is filled with struct pages of MEMORY_DEVICE_PCI_P2PDMA and is
non-cachable?

I don't have a use case for that with KVM at least.

Jason
Catalin Marinas Jan. 17, 2025, 6:52 p.m. UTC | #19
On Fri, Jan 17, 2025 at 10:00:50AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 16, 2025 at 10:28:48PM +0000, Catalin Marinas wrote:
> > with FEAT_MTE_PERM (patches from Aneesh on the list). Or, a bigger
> > happen, disable MTE in guests (well, not that big, not many platforms
> > supporting MTE, especially in the enterprise space).
> 
> As above, it seems we already effectively disable MTE in guests to use
> VFIO.

That's fine. Once the NoTagAccess feature gets in, we could allow MTE
here as well.

> > A second problem, similar to relaxing to Normal NC we merged last year,
> > we can't tell what allowing Stage 2 cacheable means (SError etc).
> 
> That was a very different argument. On that series KVM was upgrading a
> VM with pgprot noncached to Normal NC, that upgrade was what triggered
> the discussions about SError.
> 
> For this case the VMA is already pgprot cache. KVM is not changing
> anything. The KVM S2 will have the same Normal NC memory type as the
> VMA has in the S1.  Thus KVM has no additional responsibility for
> safety here.

I agree this is safe. My point was more generic about not allowing
cacheable mappings without some sanity check. Ankit's patch relies on
the pgprot used on the S1 mapping to make this decision. Presumably the
pgprot is set by the host driver.

> > information. Checking vm_page_prot instead of a VM_* flag may work if
> > it's mapped in user space but this might not always be the case. 
> 
> For this series it is only about mapping VMAs. Some future FD based
> mapping for CC is going to also need similar metadata.. I have another
> thread about that :)

How soon would you need that and if you come up with a different
mechanism, shouldn't we unify them early rather than having two methods?

> > I don't see how VM_PFNMAP alone can tell us anything about the
> > access properties supported by a device address range. Either way,
> > it's the driver setting vm_page_prot or some VM_* flag. KVM has no
> > clue, it's just a memory slot.
> 
> I think David's point about VM_PFNMAP was to avoid some of the
> pfn_valid() logic. If we get VM_PFNMAP we just assume it is non-struct
> page and follow the VMA's pgprot.

Ah, ok, thanks for the clarification.

> > A third aspect, more of a simplification when reasoning about this, was
> > to use FWB at Stage 2 to force cacheability and not care about cache
> > maintenance, especially when such range might be mapped both in user
> > space and in the guest.
> 
> Yes, I thought we needed this anyhow as KVM can't cache invalidate
> non-struct page memory..

Looks good. I think Ankit should post a new series factoring out the
exec handling in a separate patch, dropping some of the pfn_valid()
assumptions and we take it from there.

I also think some sanity check should be done early in
kvm_arch_prepare_memory_region() like rejecting the slot if it's
cacheable and we don't have FWB. But I'll leave this decision to the KVM
maintainers. We are trying to relax some stuff here as well (see the
NoTagAccess thread).
Jason Gunthorpe Jan. 17, 2025, 7:16 p.m. UTC | #20
On Fri, Jan 17, 2025 at 06:52:39PM +0000, Catalin Marinas wrote:
> On Fri, Jan 17, 2025 at 10:00:50AM -0400, Jason Gunthorpe wrote:
> > On Thu, Jan 16, 2025 at 10:28:48PM +0000, Catalin Marinas wrote:
> > > with FEAT_MTE_PERM (patches from Aneesh on the list). Or, a bigger
> > > happen, disable MTE in guests (well, not that big, not many platforms
> > > supporting MTE, especially in the enterprise space).
> > 
> > As above, it seems we already effectively disable MTE in guests to use
> > VFIO.
> 
> That's fine. Once the NoTagAccess feature gets in, we could allow MTE
> here as well.

Yes, we can eventually mark these pages as NoTagAccess and maybe
someday consume a VMA flag from VFIO indicating that MTE works and
NoTagAccess is not required.

> I agree this is safe. My point was more generic about not allowing
> cacheable mappings without some sanity check. Ankit's patch relies on
> the pgprot used on the S1 mapping to make this decision. Presumably the
> pgprot is set by the host driver.

Yes, pgprot is set only by vfio, and pgprot == Normal is the sanity
check for KVM.

> > For this series it is only about mapping VMAs. Some future FD based
> > mapping for CC is going to also need similar metadata.. I have another
> > thread about that :)
> 
> How soon would you need that and if you come up with a different
> mechanism, shouldn't we unify them early rather than having two methods?

Looks to me like a year and half or more to negotiate this and
complete the required preperation patches. It is big and complex and
consensus is not obviously converging..

If we had a FD flow now I'd prefer to use it than going through the
VMA :(

I have a vauge hope that perhaps KVM could see the VMA when it creates
the memslot and then transparently pick up the FD under the VMA and
switch to a singular FD based mode inside KVM.

Jason
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index aab04097b505..c41ba415c5e4 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -502,6 +502,14 @@  u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
  */
 u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
 
+/**
+ * stage2_has_fwb() - Determine whether FWB is supported
+ * @pgt:    Page-table structure initialised by kvm_pgtable_stage2_init*()
+ *
+ * Return: True if FWB is supported.
+ */
+bool stage2_has_fwb(struct kvm_pgtable *pgt);
+
 /**
  * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD
  * @vtcr:	Content of the VTCR register.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 40bd55966540..4417651e2b1c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -642,7 +642,7 @@  u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
 	return vtcr;
 }
 
-static bool stage2_has_fwb(struct kvm_pgtable *pgt)
+bool stage2_has_fwb(struct kvm_pgtable *pgt)
 {
 	if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
 		return false;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c9d46ad57e52..9bf5c3e89d6d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -180,11 +180,6 @@  int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm,
 	return 0;
 }
 
-static bool kvm_is_device_pfn(unsigned long pfn)
-{
-	return !pfn_is_map_memory(pfn);
-}
-
 static void *stage2_memcache_zalloc_page(void *arg)
 {
 	struct kvm_mmu_memory_cache *mc = arg;
@@ -1430,6 +1425,23 @@  static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+/*
+ * Determine the memory region cacheability from VMA's pgprot. This
+ * is used to set the stage 2 PTEs.
+ */
+static unsigned long mapping_type(pgprot_t page_prot)
+{
+	return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
+}
+
+/*
+ * Determine if the mapping type is normal cacheable.
+ */
+static bool mapping_type_normal_cacheable(unsigned long mt)
+{
+	return (mt == MT_NORMAL || mt == MT_NORMAL_TAGGED);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1438,8 +1450,9 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	int ret = 0;
 	bool write_fault, writable, force_pte = false;
 	bool exec_fault, mte_allowed;
-	bool device = false, vfio_allow_any_uc = false;
+	bool noncacheable = false, vfio_allow_any_uc = false;
 	unsigned long mmu_seq;
+	unsigned long mt;
 	phys_addr_t ipa = fault_ipa;
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
@@ -1568,6 +1581,17 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
 
+	mt = mapping_type(vma->vm_page_prot);
+
+	/*
+	 * Check for potentially ineligible or unsafe conditions for
+	 * cacheable mappings.
+	 */
+	if (vma->vm_flags & VM_IO)
+		noncacheable = true;
+	else if (!mte_allowed && kvm_has_mte(kvm))
+		noncacheable = true;
+
 	/* Don't use the VMA after the unlock -- it may have vanished */
 	vma = NULL;
 
@@ -1591,19 +1615,15 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (is_error_noslot_pfn(pfn))
 		return -EFAULT;
 
-	if (kvm_is_device_pfn(pfn)) {
-		/*
-		 * If the page was identified as device early by looking at
-		 * the VMA flags, vma_pagesize is already representing the
-		 * largest quantity we can map.  If instead it was mapped
-		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
-		 * and must not be upgraded.
-		 *
-		 * In both cases, we don't let transparent_hugepage_adjust()
-		 * change things at the last minute.
-		 */
-		device = true;
-	} else if (logging_active && !write_fault) {
+	/*
+	 * pfn_valid() indicates to the code if there is a struct page, or
+	 * if the memory is in the kernel map. Any memory region otherwise
+	 * is unsafe to be cacheable.
+	 */
+	if (!pfn_valid(pfn))
+		noncacheable = true;
+
+	if (!noncacheable && logging_active && !write_fault) {
 		/*
 		 * Only actually map the page as writable if this was a write
 		 * fault.
@@ -1611,7 +1631,11 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		writable = false;
 	}
 
-	if (exec_fault && device)
+	/*
+	 * Do not allow exec fault; unless the memory is determined safely
+	 * to be Normal cacheable.
+	 */
+	if (exec_fault && (noncacheable || !mapping_type_normal_cacheable(mt)))
 		return -ENOEXEC;
 
 	/*
@@ -1641,10 +1665,19 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	}
 
 	/*
+	 * If the page was identified as device early by looking at
+	 * the VMA flags, vma_pagesize is already representing the
+	 * largest quantity we can map.  If instead it was mapped
+	 * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
+	 * and must not be upgraded.
+	 *
+	 * In both cases, we don't let transparent_hugepage_adjust()
+	 * change things at the last minute.
+	 *
 	 * If we are not forced to use page mapping, check if we are
 	 * backed by a THP and thus use block mapping if possible.
 	 */
-	if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) {
+	if (vma_pagesize == PAGE_SIZE && !(force_pte || noncacheable)) {
 		if (fault_is_perm && fault_granule > PAGE_SIZE)
 			vma_pagesize = fault_granule;
 		else
@@ -1658,7 +1691,7 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		}
 	}
 
-	if (!fault_is_perm && !device && kvm_has_mte(kvm)) {
+	if (!fault_is_perm && !noncacheable && kvm_has_mte(kvm)) {
 		/* Check the VMM hasn't introduced a new disallowed VMA */
 		if (mte_allowed) {
 			sanitise_mte_tags(kvm, pfn, vma_pagesize);
@@ -1674,7 +1707,15 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (exec_fault)
 		prot |= KVM_PGTABLE_PROT_X;
 
-	if (device) {
+	/*
+	 * If any of the following pgprot modifiers are applied on the pgprot,
+	 * consider as device memory and map in Stage 2 as device or
+	 * Normal noncached:
+	 * pgprot_noncached
+	 * pgprot_writecombine
+	 * pgprot_device
+	 */
+	if (!mapping_type_normal_cacheable(mt)) {
 		if (vfio_allow_any_uc)
 			prot |= KVM_PGTABLE_PROT_NORMAL_NC;
 		else
@@ -1684,6 +1725,20 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		prot |= KVM_PGTABLE_PROT_X;
 	}
 
+	/*
+	 *  When FWB is unsupported KVM needs to do cache flushes
+	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
+	 *  only possible if the memory is already mapped into the kernel map
+	 *  at the usual spot.
+	 *
+	 *  Validate that there is a struct page for the PFN which maps
+	 *  to the KVA that the flushing code expects.
+	 */
+	if (!stage2_has_fwb(pgt) && !(pfn_valid(pfn))) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
 	 * permissions only if vma_pagesize equals fault_granule. Otherwise,