KVM: arm64: Drop mte_allowed check during memslot creation

Message ID	20250224093938.3934386-1-aneesh.kumar@kernel.org (mailing list archive)
State	New
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> To: linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev Cc: Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>, Joey Gouly <joey.gouly@arm.com>, Zenghui Yu <yuzenghui@huawei.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Suzuki K Poulose <Suzuki.Poulose@arm.com>, Steven Price <steven.price@arm.com>, "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> Subject: [PATCH] KVM: arm64: Drop mte_allowed check during memslot creation Date: Mon, 24 Feb 2025 15:09:38 +0530 Message-ID: <20250224093938.3934386-1-aneesh.kumar@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	KVM: arm64: Drop mte_allowed check during memslot creation \| expand KVM: arm64: Drop mte_allowed check during memslot creation

Aneesh Kumar K.V Feb. 24, 2025, 9:39 a.m. UTC

Before commit d89585fbb308 ("KVM: arm64: unify the tests for VMAs in
memslots when MTE is enabled"), kvm_arch_prepare_memory_region() only
rejected a memory slot if VM_SHARED was set. This commit unified the
checking with user_mem_abort(), with slots being rejected if either
VM_MTE_ALLOWED is not set or VM_SHARED set. A subsequent commit
c911f0d46879 ("KVM: arm64: permit all VM_MTE_ALLOWED mappings with MTE
enabled") dropped the VM_SHARED check, so we ended up with memory slots
being rejected if VM_MTE_ALLOWED is not set. This wasn't the case before
the commit d89585fbb308. The rejection of the memory slot with VM_SHARED
set was done to avoid a race condition with the test/set of the
PG_mte_tagged flag. Before Commit d77e59a8fccd ("arm64: mte: Lock a page
for MTE tag initialization") the kernel avoided allowing MTE with shared
pages, thereby preventing two tasks sharing a page from setting up the
PG_mte_tagged flag racily.

Commit d77e59a8fccd ("arm64: mte: Lock a page for MTE tag
initialization") further updated the locking so that the kernel
allows VM_SHARED mapping with MTE. With this commit, we can enable
memslot creation with VM_SHARED VMA mapping.

This patch results in a minor tweak to the ABI. We now allow creating
memslots that don't have the VM_MTE_ALLOWED flag set. If the guest uses
such a memslot with Allocation Tags, the kernel will generate -EFAULT.
ie, instead of failing early, we now fail later during KVM_RUN.

This change is needed because, without it, users are not able to use MTE
with VFIO passthrough (currently the mapping is either Device or
NonCacheable for which tag access check is not applied.), as shown
below (kvmtool VMM).

[  617.921030] vfio-pci 0000:01:00.0: resetting
[  618.024719] vfio-pci 0000:01:00.0: reset done
  Error: 0000:01:00.0: failed to register region with KVM
  Warning: [0abc:aced] Error activating emulation for BAR 0
  Error: 0000:01:00.0: failed to configure regions
  Warning: Failed init: vfio__init

  Fatal: Initialisation failed

Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
---
 arch/arm64/kvm/mmu.c | 5 -----
 1 file changed, 5 deletions(-)

Suzuki K Poulose Feb. 24, 2025, 10:32 a.m. UTC | #1

On 24/02/2025 09:39, Aneesh Kumar K.V (Arm) wrote:
> Before commit d89585fbb308 ("KVM: arm64: unify the tests for VMAs in
> memslots when MTE is enabled"), kvm_arch_prepare_memory_region() only
> rejected a memory slot if VM_SHARED was set. This commit unified the
> checking with user_mem_abort(), with slots being rejected if either
> VM_MTE_ALLOWED is not set or VM_SHARED set. A subsequent commit
> c911f0d46879 ("KVM: arm64: permit all VM_MTE_ALLOWED mappings with MTE
> enabled") dropped the VM_SHARED check, so we ended up with memory slots
> being rejected if VM_MTE_ALLOWED is not set. This wasn't the case before
> the commit d89585fbb308. The rejection of the memory slot with VM_SHARED
> set was done to avoid a race condition with the test/set of the
> PG_mte_tagged flag. Before Commit d77e59a8fccd ("arm64: mte: Lock a page
> for MTE tag initialization") the kernel avoided allowing MTE with shared
> pages, thereby preventing two tasks sharing a page from setting up the
> PG_mte_tagged flag racily.
> 
> Commit d77e59a8fccd ("arm64: mte: Lock a page for MTE tag
> initialization") further updated the locking so that the kernel
> allows VM_SHARED mapping with MTE. With this commit, we can enable
> memslot creation with VM_SHARED VMA mapping.
> 
> This patch results in a minor tweak to the ABI. We now allow creating
> memslots that don't have the VM_MTE_ALLOWED flag set. If the guest uses
> such a memslot with Allocation Tags, the kernel will generate -EFAULT.
> ie, instead of failing early, we now fail later during KVM_RUN.
> 
> This change is needed because, without it, users are not able to use MTE
> with VFIO passthrough (currently the mapping is either Device or
> NonCacheable for which tag access check is not applied.), as shown
> below (kvmtool VMM).
> 
> [  617.921030] vfio-pci 0000:01:00.0: resetting
> [  618.024719] vfio-pci 0000:01:00.0: reset done
>    Error: 0000:01:00.0: failed to register region with KVM
>    Warning: [0abc:aced] Error activating emulation for BAR 0
>    Error: 0000:01:00.0: failed to configure regions
>    Warning: Failed init: vfio__init
> 
>    Fatal: Initialisation failed
> 
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>

Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>

Catalin Marinas Feb. 24, 2025, 11:05 a.m. UTC | #2

On Mon, Feb 24, 2025 at 03:09:38PM +0530, Aneesh Kumar K.V (Arm) wrote:
> Before commit d89585fbb308 ("KVM: arm64: unify the tests for VMAs in
> memslots when MTE is enabled"), kvm_arch_prepare_memory_region() only
> rejected a memory slot if VM_SHARED was set. This commit unified the
> checking with user_mem_abort(), with slots being rejected if either
> VM_MTE_ALLOWED is not set or VM_SHARED set. A subsequent commit
> c911f0d46879 ("KVM: arm64: permit all VM_MTE_ALLOWED mappings with MTE
> enabled") dropped the VM_SHARED check, so we ended up with memory slots
> being rejected if VM_MTE_ALLOWED is not set. This wasn't the case before
> the commit d89585fbb308. The rejection of the memory slot with VM_SHARED
> set was done to avoid a race condition with the test/set of the
> PG_mte_tagged flag. Before Commit d77e59a8fccd ("arm64: mte: Lock a page
> for MTE tag initialization") the kernel avoided allowing MTE with shared
> pages, thereby preventing two tasks sharing a page from setting up the
> PG_mte_tagged flag racily.
> 
> Commit d77e59a8fccd ("arm64: mte: Lock a page for MTE tag
> initialization") further updated the locking so that the kernel
> allows VM_SHARED mapping with MTE. With this commit, we can enable
> memslot creation with VM_SHARED VMA mapping.
> 
> This patch results in a minor tweak to the ABI. We now allow creating
> memslots that don't have the VM_MTE_ALLOWED flag set.

As I commented here:

https://lore.kernel.org/r/Z4e04P1bQlFBDHo7@arm.com

I'm fine with the change, we basically go back to the original ABI prior
to relaxing this for VM_SHARED.

> If the guest uses
> such a memslot with Allocation Tags, the kernel will generate -EFAULT.
> ie, instead of failing early, we now fail later during KVM_RUN.

Nit: more like the kernel "will return -EFAULT" to the VMM rather than
"generate".

> This change is needed because, without it, users are not able to use MTE
> with VFIO passthrough (currently the mapping is either Device or
> NonCacheable for which tag access check is not applied.), as shown
> below (kvmtool VMM).

Another nit: "users are not able to user VFIO passthrough when MTE is
enabled". At a first read, the above sounded to me like one wants to
enable MTE for VFIO passthrough mappings.

> [  617.921030] vfio-pci 0000:01:00.0: resetting
> [  618.024719] vfio-pci 0000:01:00.0: reset done
>   Error: 0000:01:00.0: failed to register region with KVM
>   Warning: [0abc:aced] Error activating emulation for BAR 0
>   Error: 0000:01:00.0: failed to configure regions
>   Warning: Failed init: vfio__init
> 
>   Fatal: Initialisation failed
> 
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
>  arch/arm64/kvm/mmu.c | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index ef53af6df6de..1f1b5aa43d2d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -2178,11 +2178,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  		if (!vma)
>  			break;
>  
> -		if (kvm_has_mte(kvm) && !kvm_vma_mte_allowed(vma)) {
> -			ret = -EINVAL;
> -			break;
> -		}
> -
>  		if (vma->vm_flags & VM_PFNMAP) {
>  			/* IO region dirty page logging not allowed */
>  			if (new->flags & KVM_MEM_LOG_DIRTY_PAGES) {

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Marc Zyngier Feb. 24, 2025, 12:24 p.m. UTC | #3

On Mon, 24 Feb 2025 11:05:33 +0000,
Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> On Mon, Feb 24, 2025 at 03:09:38PM +0530, Aneesh Kumar K.V (Arm) wrote:
> > Before commit d89585fbb308 ("KVM: arm64: unify the tests for VMAs in
> > memslots when MTE is enabled"), kvm_arch_prepare_memory_region() only
> > rejected a memory slot if VM_SHARED was set. This commit unified the
> > checking with user_mem_abort(), with slots being rejected if either
> > VM_MTE_ALLOWED is not set or VM_SHARED set. A subsequent commit
> > c911f0d46879 ("KVM: arm64: permit all VM_MTE_ALLOWED mappings with MTE
> > enabled") dropped the VM_SHARED check, so we ended up with memory slots
> > being rejected if VM_MTE_ALLOWED is not set. This wasn't the case before
> > the commit d89585fbb308. The rejection of the memory slot with VM_SHARED
> > set was done to avoid a race condition with the test/set of the
> > PG_mte_tagged flag. Before Commit d77e59a8fccd ("arm64: mte: Lock a page
> > for MTE tag initialization") the kernel avoided allowing MTE with shared
> > pages, thereby preventing two tasks sharing a page from setting up the
> > PG_mte_tagged flag racily.
> > 
> > Commit d77e59a8fccd ("arm64: mte: Lock a page for MTE tag
> > initialization") further updated the locking so that the kernel
> > allows VM_SHARED mapping with MTE. With this commit, we can enable
> > memslot creation with VM_SHARED VMA mapping.
> > 
> > This patch results in a minor tweak to the ABI. We now allow creating
> > memslots that don't have the VM_MTE_ALLOWED flag set.
> 
> As I commented here:
> 
> https://lore.kernel.org/r/Z4e04P1bQlFBDHo7@arm.com
> 
> I'm fine with the change, we basically go back to the original ABI prior
> to relaxing this for VM_SHARED.
> 
> > If the guest uses
> > such a memslot with Allocation Tags, the kernel will generate -EFAULT.
> > ie, instead of failing early, we now fail later during KVM_RUN.
> 
> Nit: more like the kernel "will return -EFAULT" to the VMM rather than
> "generate".
> 
> > This change is needed because, without it, users are not able to use MTE
> > with VFIO passthrough (currently the mapping is either Device or
> > NonCacheable for which tag access check is not applied.), as shown
> > below (kvmtool VMM).
> 
> Another nit: "users are not able to user VFIO passthrough when MTE is
> enabled". At a first read, the above sounded to me like one wants to
> enable MTE for VFIO passthrough mappings.

What the commit message doesn't spell out is how MTE and VFIO are
interacting here. I also don't understand the reference to Device or
NC memory here.

Isn't the issue that DMA doesn't check/update tags, and therefore it
makes little sense to prevent non-tagged memory being associated with
a memslot?

My other concern is that this gives pretty poor consistency to the
guest, which cannot know what can be tagged and what cannot, and
breaks a guarantee that the guest should be able to rely on.

Thanks,

	M.

Catalin Marinas Feb. 24, 2025, 2:39 p.m. UTC | #4

On Mon, Feb 24, 2025 at 12:24:14PM +0000, Marc Zyngier wrote:
> On Mon, 24 Feb 2025 11:05:33 +0000,
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Mon, Feb 24, 2025 at 03:09:38PM +0530, Aneesh Kumar K.V (Arm) wrote:
> > > This change is needed because, without it, users are not able to use MTE
> > > with VFIO passthrough (currently the mapping is either Device or
> > > NonCacheable for which tag access check is not applied.), as shown
> > > below (kvmtool VMM).
> > 
> > Another nit: "users are not able to user VFIO passthrough when MTE is
> > enabled". At a first read, the above sounded to me like one wants to
> > enable MTE for VFIO passthrough mappings.
> 
> What the commit message doesn't spell out is how MTE and VFIO are
> interacting here. I also don't understand the reference to Device or
> NC memory here.

I guess it's saying that the guest cannot turn MTE on (Normal Tagged)
for these ranges anyway since Stage 2 is Device or Normal NC. So we
don't break any use-case specific to VFIO.

> Isn't the issue that DMA doesn't check/update tags, and therefore it
> makes little sense to prevent non-tagged memory being associated with
> a memslot?

The issue is that some MMIO memory range that does not support MTE
(well, all MMIO) could be mapped by the guest as Normal Tagged and we
have no clue what the hardware does as tag accesses, hence we currently
prevent it altogether. It's not about DMA.

This patch still prevents such MMIO+MTE mappings but moves the decision
to user_mem_abort() and it's slightly more relaxed - only rejecting it
if !VM_MTE_ALLOWED _and_ the Stage 2 is cacheable. The side-effect is
that it allows device assignment into the guest since Stage 2 is not
Normal Cacheable (at least for now, we have some patches Ankit but they
handle the MTE case).

> My other concern is that this gives pretty poor consistency to the
> guest, which cannot know what can be tagged and what cannot, and
> breaks a guarantee that the guest should be able to rely on.

The guest should not set Normal Tagged on anything other than what it
gets as standard RAM. We are not changing this here. KVM than needs to
prevent a broken/malicious guest from setting MTE on other (physical)
ranges that don't support MTE. Currently it can only do this by forcing
Device or Normal NC (or disable MTE altogether). Later we'll add
FEAT_MTE_PERM to permit Stage 2 Cacheable but trap on tag accesses.

The ABI change is just for the VMM, the guest shouldn't be aware as
long as it sticks to the typical recommendations for MTE - only enable
on standard RAM.

Does any VMM rely on the memory slot being rejected on registration if
it does not support MTE? After this change, we'd get an exit to the VMM
on guest access with MTE turned on (even if it's not mapped as such at
Stage 1).

Marc Zyngier Feb. 24, 2025, 3:02 p.m. UTC | #5

On Mon, 24 Feb 2025 14:39:16 +0000,
Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> On Mon, Feb 24, 2025 at 12:24:14PM +0000, Marc Zyngier wrote:
> > On Mon, 24 Feb 2025 11:05:33 +0000,
> > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Mon, Feb 24, 2025 at 03:09:38PM +0530, Aneesh Kumar K.V (Arm) wrote:
> > > > This change is needed because, without it, users are not able to use MTE
> > > > with VFIO passthrough (currently the mapping is either Device or
> > > > NonCacheable for which tag access check is not applied.), as shown
> > > > below (kvmtool VMM).
> > > 
> > > Another nit: "users are not able to user VFIO passthrough when MTE is
> > > enabled". At a first read, the above sounded to me like one wants to
> > > enable MTE for VFIO passthrough mappings.
> > 
> > What the commit message doesn't spell out is how MTE and VFIO are
> > interacting here. I also don't understand the reference to Device or
> > NC memory here.
> 
> I guess it's saying that the guest cannot turn MTE on (Normal Tagged)
> for these ranges anyway since Stage 2 is Device or Normal NC. So we
> don't break any use-case specific to VFIO.
>
> > Isn't the issue that DMA doesn't check/update tags, and therefore it
> > makes little sense to prevent non-tagged memory being associated with
> > a memslot?
> 
> The issue is that some MMIO memory range that does not support MTE
> (well, all MMIO) could be mapped by the guest as Normal Tagged and we
> have no clue what the hardware does as tag accesses, hence we currently
> prevent it altogether. It's not about DMA.
> 
> This patch still prevents such MMIO+MTE mappings but moves the decision
> to user_mem_abort() and it's slightly more relaxed - only rejecting it
> if !VM_MTE_ALLOWED _and_ the Stage 2 is cacheable. The side-effect is
> that it allows device assignment into the guest since Stage 2 is not
> Normal Cacheable (at least for now, we have some patches Ankit but they
> handle the MTE case).

The other side effect is that it also allows non-tagged cacheable
memory to be given to the MTE-enabled guest, and the guest has no way
to distinguish between what is tagged and what's not.

> 
> > My other concern is that this gives pretty poor consistency to the
> > guest, which cannot know what can be tagged and what cannot, and
> > breaks a guarantee that the guest should be able to rely on.
> 
> The guest should not set Normal Tagged on anything other than what it
> gets as standard RAM. We are not changing this here. KVM than needs to
> prevent a broken/malicious guest from setting MTE on other (physical)
> ranges that don't support MTE. Currently it can only do this by forcing
> Device or Normal NC (or disable MTE altogether). Later we'll add
> FEAT_MTE_PERM to permit Stage 2 Cacheable but trap on tag accesses.
> 
> The ABI change is just for the VMM, the guest shouldn't be aware as
> long as it sticks to the typical recommendations for MTE - only enable
> on standard RAM.

See above. You fall into the same trap with standard memory, since you
now allow userspace to mix things at will, and only realise something
has gone wrong on access (and -EFAULT is not very useful).

>
> Does any VMM rely on the memory slot being rejected on registration if
> it does not support MTE? After this change, we'd get an exit to the VMM
> on guest access with MTE turned on (even if it's not mapped as such at
> Stage 1).

I really don't know what userspace expects w.r.t. mixing tagged and
non-tagged memory. But I don't expect anything good to come out of it,
given that we provide zero information about the fault context.

Honestly, if we are going to change this, then let's make sure we give
enough information for userspace to go and fix the mess. Not just "it
all went wrong".

Thanks,

	M.

Aneesh Kumar K.V Feb. 24, 2025, 4:44 p.m. UTC | #6

Marc Zyngier <maz@kernel.org> writes:

> On Mon, 24 Feb 2025 14:39:16 +0000,
> Catalin Marinas <catalin.marinas@arm.com> wrote:
>>
>> On Mon, Feb 24, 2025 at 12:24:14PM +0000, Marc Zyngier wrote:
>> > On Mon, 24 Feb 2025 11:05:33 +0000,
>> > Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > > On Mon, Feb 24, 2025 at 03:09:38PM +0530, Aneesh Kumar K.V (Arm) wrote:
>> > > > This change is needed because, without it, users are not able to use MTE
>> > > > with VFIO passthrough (currently the mapping is either Device or
>> > > > NonCacheable for which tag access check is not applied.), as shown
>> > > > below (kvmtool VMM).
>> > >
>> > > Another nit: "users are not able to user VFIO passthrough when MTE is
>> > > enabled". At a first read, the above sounded to me like one wants to
>> > > enable MTE for VFIO passthrough mappings.
>> >
>> > What the commit message doesn't spell out is how MTE and VFIO are
>> > interacting here. I also don't understand the reference to Device or
>> > NC memory here.
>>
>> I guess it's saying that the guest cannot turn MTE on (Normal Tagged)
>> for these ranges anyway since Stage 2 is Device or Normal NC. So we
>> don't break any use-case specific to VFIO.
>>
>> > Isn't the issue that DMA doesn't check/update tags, and therefore it
>> > makes little sense to prevent non-tagged memory being associated with
>> > a memslot?
>>
>> The issue is that some MMIO memory range that does not support MTE
>> (well, all MMIO) could be mapped by the guest as Normal Tagged and we
>> have no clue what the hardware does as tag accesses, hence we currently
>> prevent it altogether. It's not about DMA.
>>
>> This patch still prevents such MMIO+MTE mappings but moves the decision
>> to user_mem_abort() and it's slightly more relaxed - only rejecting it
>> if !VM_MTE_ALLOWED _and_ the Stage 2 is cacheable. The side-effect is
>> that it allows device assignment into the guest since Stage 2 is not
>> Normal Cacheable (at least for now, we have some patches Ankit but they
>> handle the MTE case).
>
> The other side effect is that it also allows non-tagged cacheable
> memory to be given to the MTE-enabled guest, and the guest has no way
> to distinguish between what is tagged and what's not.
>
>>
>> > My other concern is that this gives pretty poor consistency to the
>> > guest, which cannot know what can be tagged and what cannot, and
>> > breaks a guarantee that the guest should be able to rely on.
>>
>> The guest should not set Normal Tagged on anything other than what it
>> gets as standard RAM. We are not changing this here. KVM than needs to
>> prevent a broken/malicious guest from setting MTE on other (physical)
>> ranges that don't support MTE. Currently it can only do this by forcing
>> Device or Normal NC (or disable MTE altogether). Later we'll add
>> FEAT_MTE_PERM to permit Stage 2 Cacheable but trap on tag accesses.
>>
>> The ABI change is just for the VMM, the guest shouldn't be aware as
>> long as it sticks to the typical recommendations for MTE - only enable
>> on standard RAM.
>
> See above. You fall into the same trap with standard memory, since you
> now allow userspace to mix things at will, and only realise something
> has gone wrong on access (and -EFAULT is not very useful).
>
>>
>> Does any VMM rely on the memory slot being rejected on registration if
>> it does not support MTE? After this change, we'd get an exit to the VMM
>> on guest access with MTE turned on (even if it's not mapped as such at
>> Stage 1).
>
> I really don't know what userspace expects w.r.t. mixing tagged and
> non-tagged memory. But I don't expect anything good to come out of it,
> given that we provide zero information about the fault context.
>
> Honestly, if we are going to change this, then let's make sure we give
> enough information for userspace to go and fix the mess. Not just "it
> all went wrong".
>

What if we trigger a memory fault exit with the TAGACCESS flag, allowing
the VMM to use the GPA to retrieve additional details and print extra
information to aid in analysis? BTW, we will do this on the first fault
in cacheable, non-tagged memory even if there is no tagaccess in that
region. This can be further improved using the NoTagAccess series I
posted earlier, which ensures the memory fault exit occurs only on
actual tag access

Something like below?

modified   Documentation/virt/kvm/api.rst
@@ -7121,6 +7121,9 @@ describes properties of the faulting access that are likely pertinent:
  - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
    on a private memory access.  When clear, indicates the fault occurred on a
    shared access.
+ - KVM_MEMORY_EXIT_FLAG_TAGACCESS - When set, indicates the memory fault
+   occurred due to allocation tag access on a memory region that doesn't support
+   allocation tags.
 
 Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
 accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
modified   arch/arm64/kvm/mmu.c
@@ -1695,6 +1695,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		if (mte_allowed) {
 			sanitise_mte_tags(kvm, pfn, vma_pagesize);
 		} else {
+			kvm_prepare_tagaccess_exit(vcpu, gfn << PAGE_SHIFT, PAGE_SIZE);
 			ret = -EFAULT;
 			goto out_unlock;
 		}
modified   include/linux/kvm_host.h
@@ -2489,6 +2489,16 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
 
+static inline void kvm_prepare_tagaccess_exit(struct kvm_vcpu *vcpu,
+					      gpa_t gpa, gpa_t size)
+{
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+	vcpu->run->memory_fault.flags = KVM_MEMORY_EXIT_FLAG_TAGACCESS;
+	vcpu->run->memory_fault.gpa = gpa;
+	vcpu->run->memory_fault.size = size;
+}
+
+
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
modified   include/uapi/linux/kvm.h
@@ -442,6 +442,7 @@ struct kvm_run {
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
 #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+#define KVM_MEMORY_EXIT_FLAG_TAGACCESS	(1ULL << 4)
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;

Marc Zyngier Feb. 24, 2025, 5:23 p.m. UTC | #7

On Mon, 24 Feb 2025 16:44:06 +0000,
Aneesh Kumar K.V <aneesh.kumar@kernel.org> wrote:
> 
> Marc Zyngier <maz@kernel.org> writes:
> 
> > On Mon, 24 Feb 2025 14:39:16 +0000,
> > Catalin Marinas <catalin.marinas@arm.com> wrote:
> >>
> >> On Mon, Feb 24, 2025 at 12:24:14PM +0000, Marc Zyngier wrote:
> >> > On Mon, 24 Feb 2025 11:05:33 +0000,
> >> > Catalin Marinas <catalin.marinas@arm.com> wrote:
> >> > > On Mon, Feb 24, 2025 at 03:09:38PM +0530, Aneesh Kumar K.V (Arm) wrote:
> >> > > > This change is needed because, without it, users are not able to use MTE
> >> > > > with VFIO passthrough (currently the mapping is either Device or
> >> > > > NonCacheable for which tag access check is not applied.), as shown
> >> > > > below (kvmtool VMM).
> >> > >
> >> > > Another nit: "users are not able to user VFIO passthrough when MTE is
> >> > > enabled". At a first read, the above sounded to me like one wants to
> >> > > enable MTE for VFIO passthrough mappings.
> >> >
> >> > What the commit message doesn't spell out is how MTE and VFIO are
> >> > interacting here. I also don't understand the reference to Device or
> >> > NC memory here.
> >>
> >> I guess it's saying that the guest cannot turn MTE on (Normal Tagged)
> >> for these ranges anyway since Stage 2 is Device or Normal NC. So we
> >> don't break any use-case specific to VFIO.
> >>
> >> > Isn't the issue that DMA doesn't check/update tags, and therefore it
> >> > makes little sense to prevent non-tagged memory being associated with
> >> > a memslot?
> >>
> >> The issue is that some MMIO memory range that does not support MTE
> >> (well, all MMIO) could be mapped by the guest as Normal Tagged and we
> >> have no clue what the hardware does as tag accesses, hence we currently
> >> prevent it altogether. It's not about DMA.
> >>
> >> This patch still prevents such MMIO+MTE mappings but moves the decision
> >> to user_mem_abort() and it's slightly more relaxed - only rejecting it
> >> if !VM_MTE_ALLOWED _and_ the Stage 2 is cacheable. The side-effect is
> >> that it allows device assignment into the guest since Stage 2 is not
> >> Normal Cacheable (at least for now, we have some patches Ankit but they
> >> handle the MTE case).
> >
> > The other side effect is that it also allows non-tagged cacheable
> > memory to be given to the MTE-enabled guest, and the guest has no way
> > to distinguish between what is tagged and what's not.
> >
> >>
> >> > My other concern is that this gives pretty poor consistency to the
> >> > guest, which cannot know what can be tagged and what cannot, and
> >> > breaks a guarantee that the guest should be able to rely on.
> >>
> >> The guest should not set Normal Tagged on anything other than what it
> >> gets as standard RAM. We are not changing this here. KVM than needs to
> >> prevent a broken/malicious guest from setting MTE on other (physical)
> >> ranges that don't support MTE. Currently it can only do this by forcing
> >> Device or Normal NC (or disable MTE altogether). Later we'll add
> >> FEAT_MTE_PERM to permit Stage 2 Cacheable but trap on tag accesses.
> >>
> >> The ABI change is just for the VMM, the guest shouldn't be aware as
> >> long as it sticks to the typical recommendations for MTE - only enable
> >> on standard RAM.
> >
> > See above. You fall into the same trap with standard memory, since you
> > now allow userspace to mix things at will, and only realise something
> > has gone wrong on access (and -EFAULT is not very useful).
> >
> >>
> >> Does any VMM rely on the memory slot being rejected on registration if
> >> it does not support MTE? After this change, we'd get an exit to the VMM
> >> on guest access with MTE turned on (even if it's not mapped as such at
> >> Stage 1).
> >
> > I really don't know what userspace expects w.r.t. mixing tagged and
> > non-tagged memory. But I don't expect anything good to come out of it,
> > given that we provide zero information about the fault context.
> >
> > Honestly, if we are going to change this, then let's make sure we give
> > enough information for userspace to go and fix the mess. Not just "it
> > all went wrong".
> >
> 
> What if we trigger a memory fault exit with the TAGACCESS flag, allowing
> the VMM to use the GPA to retrieve additional details and print extra
> information to aid in analysis? BTW, we will do this on the first fault
> in cacheable, non-tagged memory even if there is no tagaccess in that
> region. This can be further improved using the NoTagAccess series I
> posted earlier, which ensures the memory fault exit occurs only on
> actual tag access
> 
> Something like below?

Something like that, only with:

- a capability informing userspace of this behaviour

- a per-VM (or per-VMA) flag as a buy-in for that behaviour

- the relaxation is made conditional on the memslot not being memory
(i.e. really MMIO-only).

and keep the current behaviour otherwise.

Thanks,

	M.

KVM: arm64: Drop mte_allowed check during memslot creation

Commit Message

Comments

Patch