diff mbox series

[v10,2/6] arm64: kvm: Introduce MTE VM feature

Message ID 20210312151902.17853-3-steven.price@arm.com (mailing list archive)
State New, archived
Headers show
Series MTE support for KVM guest | expand

Commit Message

Steven Price March 12, 2021, 3:18 p.m. UTC
Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.

Actually exposing the new capability to user space happens in a later
patch.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/kvm_emulate.h |  3 +++
 arch/arm64/include/asm/kvm_host.h    |  3 +++
 arch/arm64/kvm/hyp/exception.c       |  3 ++-
 arch/arm64/kvm/mmu.c                 | 16 ++++++++++++++++
 arch/arm64/kvm/sys_regs.c            |  3 +++
 include/uapi/linux/kvm.h             |  1 +
 6 files changed, 28 insertions(+), 1 deletion(-)

Comments

Catalin Marinas March 27, 2021, 3:23 p.m. UTC | #1
On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..b31b7a821f90 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (vma_pagesize == PAGE_SIZE && !force_pte)
>  		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>  							   &pfn, &fault_ipa);
> +
> +	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) {

This pfn_valid() check may be problematic. Following commit eeb0753ba27b
("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
true for ZONE_DEVICE memory but such memory is allowed not to support
MTE.

I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn
even without virtualisation.

> +		/*
> +		 * VM will be able to see the page's tags, so we must ensure
> +		 * they have been initialised. if PG_mte_tagged is set, tags
> +		 * have already been initialised.
> +		 */
> +		struct page *page = pfn_to_page(pfn);
> +		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
> +
> +		for (i = 0; i < nr_pages; i++, page++) {
> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
> +				mte_clear_page_tags(page_address(page));
> +		}
> +	}
> +
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;
>
Catalin Marinas March 28, 2021, 12:21 p.m. UTC | #2
On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote:
> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote:
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 77cb2d28f2a4..b31b7a821f90 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	if (vma_pagesize == PAGE_SIZE && !force_pte)
> >  		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> >  							   &pfn, &fault_ipa);
> > +
> > +	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) {
> > +		/*
> > +		 * VM will be able to see the page's tags, so we must ensure
> > +		 * they have been initialised. if PG_mte_tagged is set, tags
> > +		 * have already been initialised.
> > +		 */
> > +		struct page *page = pfn_to_page(pfn);
> > +		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
> > +
> > +		for (i = 0; i < nr_pages; i++, page++) {
> > +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
> > +				mte_clear_page_tags(page_address(page));
> > +		}
> > +	}
> 
> This pfn_valid() check may be problematic. Following commit eeb0753ba27b
> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
> true for ZONE_DEVICE memory but such memory is allowed not to support
> MTE.

Some more thinking, this should be safe as any ZONE_DEVICE would be
mapped as untagged memory in the kernel linear map. It could be slightly
inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
untagged memory. Another overhead is pfn_valid() which will likely end
up calling memblock_is_map_memory().

However, the bigger issue is that Stage 2 cannot disable tagging for
Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a
way to detect what gets mapped in the guest as Normal Cacheable memory
and make sure it's only early memory or hotplug but no ZONE_DEVICE (or
something else like on-chip memory)?  If we can't guarantee that all
Cacheable memory given to a guest supports tags, we should disable the
feature altogether.

> I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn
> even without virtualisation.

I haven't checked all the code paths but I don't think we can get a
MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file
descriptor.
Steven Price March 29, 2021, 4:06 p.m. UTC | #3
On 28/03/2021 13:21, Catalin Marinas wrote:
> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote:
>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote:
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index 77cb2d28f2a4..b31b7a821f90 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>   	if (vma_pagesize == PAGE_SIZE && !force_pte)
>>>   		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>>>   							   &pfn, &fault_ipa);
>>> +
>>> +	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) {
>>> +		/*
>>> +		 * VM will be able to see the page's tags, so we must ensure
>>> +		 * they have been initialised. if PG_mte_tagged is set, tags
>>> +		 * have already been initialised.
>>> +		 */
>>> +		struct page *page = pfn_to_page(pfn);
>>> +		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
>>> +
>>> +		for (i = 0; i < nr_pages; i++, page++) {
>>> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>>> +				mte_clear_page_tags(page_address(page));
>>> +		}
>>> +	}
>>
>> This pfn_valid() check may be problematic. Following commit eeb0753ba27b
>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
>> true for ZONE_DEVICE memory but such memory is allowed not to support
>> MTE.
> 
> Some more thinking, this should be safe as any ZONE_DEVICE would be
> mapped as untagged memory in the kernel linear map. It could be slightly
> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
> untagged memory. Another overhead is pfn_valid() which will likely end
> up calling memblock_is_map_memory().
> 
> However, the bigger issue is that Stage 2 cannot disable tagging for
> Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a
> way to detect what gets mapped in the guest as Normal Cacheable memory
> and make sure it's only early memory or hotplug but no ZONE_DEVICE (or
> something else like on-chip memory)?  If we can't guarantee that all
> Cacheable memory given to a guest supports tags, we should disable the
> feature altogether.

In stage 2 I believe we only have two types of mapping - 'normal' or 
DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter 
is a case of checking the 'device' variable, and makes sense to avoid 
the overhead you describe.

This should also guarantee that all stage-2 cacheable memory supports 
tags, as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() 
should only be true for memory that Linux considers "normal".

>> I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn
>> even without virtualisation.
> 
> I haven't checked all the code paths but I don't think we can get a
> MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file
> descriptor.
> 

I certainly hope this is the case - it's the weird corner cases of 
device drivers that worry me. E.g. I know i915 has a "hidden" mmap 
behind an ioctl (see i915_gem_mmap_ioctl(), although this case is fine - 
it's MAP_SHARED). Mali's kbase did something similar in the past.

Steve
Catalin Marinas March 30, 2021, 10:30 a.m. UTC | #4
On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
> On 28/03/2021 13:21, Catalin Marinas wrote:
> > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote:
> > > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote:
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index 77cb2d28f2a4..b31b7a821f90 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > >   	if (vma_pagesize == PAGE_SIZE && !force_pte)
> > > >   		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> > > >   							   &pfn, &fault_ipa);
> > > > +
> > > > +	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) {
> > > > +		/*
> > > > +		 * VM will be able to see the page's tags, so we must ensure
> > > > +		 * they have been initialised. if PG_mte_tagged is set, tags
> > > > +		 * have already been initialised.
> > > > +		 */
> > > > +		struct page *page = pfn_to_page(pfn);
> > > > +		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
> > > > +
> > > > +		for (i = 0; i < nr_pages; i++, page++) {
> > > > +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
> > > > +				mte_clear_page_tags(page_address(page));
> > > > +		}
> > > > +	}
> > > 
> > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b
> > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
> > > true for ZONE_DEVICE memory but such memory is allowed not to support
> > > MTE.
> > 
> > Some more thinking, this should be safe as any ZONE_DEVICE would be
> > mapped as untagged memory in the kernel linear map. It could be slightly
> > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
> > untagged memory. Another overhead is pfn_valid() which will likely end
> > up calling memblock_is_map_memory().
> > 
> > However, the bigger issue is that Stage 2 cannot disable tagging for
> > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a
> > way to detect what gets mapped in the guest as Normal Cacheable memory
> > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or
> > something else like on-chip memory)?  If we can't guarantee that all
> > Cacheable memory given to a guest supports tags, we should disable the
> > feature altogether.
> 
> In stage 2 I believe we only have two types of mapping - 'normal' or
> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a
> case of checking the 'device' variable, and makes sense to avoid the
> overhead you describe.
> 
> This should also guarantee that all stage-2 cacheable memory supports tags,
> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only
> be true for memory that Linux considers "normal".

That's the problem. With Anshuman's commit I mentioned above,
pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
memory, not talking about some I/O mapping that requires Device_nGnRE).
So kvm_is_device_pfn() is false for such memory and it may be mapped as
Normal but it is not guaranteed to support tagging.

For user MTE, we get away with this as the MAP_ANONYMOUS requirement
would filter it out while arch_add_memory() will ensure it's mapped as
untagged in the linear map. See another recent fix for hotplugged
memory: d15dfd31384b ("arm64: mte: Map hotplugged memory as Normal
Tagged"). We needed to ensure that ZONE_DEVICE doesn't end up as tagged,
only hoplugged memory. Both handled via arch_add_memory() in the arch
code with ZONE_DEVICE starting at devm_memremap_pages().

> > > I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn
> > > even without virtualisation.
> > 
> > I haven't checked all the code paths but I don't think we can get a
> > MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file
> > descriptor.
> 
> I certainly hope this is the case - it's the weird corner cases of device
> drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl
> (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED).
> Mali's kbase did something similar in the past.

I think this should be fine since it's not a MAP_ANONYMOUS (we do allow
MAP_SHARED to be tagged).
David Hildenbrand March 31, 2021, 7:34 a.m. UTC | #5
On 30.03.21 12:30, Catalin Marinas wrote:
> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
>> On 28/03/2021 13:21, Catalin Marinas wrote:
>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote:
>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote:
>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>>> index 77cb2d28f2a4..b31b7a821f90 100644
>>>>> --- a/arch/arm64/kvm/mmu.c
>>>>> +++ b/arch/arm64/kvm/mmu.c
>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>    	if (vma_pagesize == PAGE_SIZE && !force_pte)
>>>>>    		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>>>>>    							   &pfn, &fault_ipa);
>>>>> +
>>>>> +	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) {
>>>>> +		/*
>>>>> +		 * VM will be able to see the page's tags, so we must ensure
>>>>> +		 * they have been initialised. if PG_mte_tagged is set, tags
>>>>> +		 * have already been initialised.
>>>>> +		 */
>>>>> +		struct page *page = pfn_to_page(pfn);
>>>>> +		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
>>>>> +
>>>>> +		for (i = 0; i < nr_pages; i++, page++) {
>>>>> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>>>>> +				mte_clear_page_tags(page_address(page));
>>>>> +		}
>>>>> +	}
>>>>
>>>> This pfn_valid() check may be problematic. Following commit eeb0753ba27b
>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
>>>> true for ZONE_DEVICE memory but such memory is allowed not to support
>>>> MTE.
>>>
>>> Some more thinking, this should be safe as any ZONE_DEVICE would be
>>> mapped as untagged memory in the kernel linear map. It could be slightly
>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
>>> untagged memory. Another overhead is pfn_valid() which will likely end
>>> up calling memblock_is_map_memory().
>>>
>>> However, the bigger issue is that Stage 2 cannot disable tagging for
>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a
>>> way to detect what gets mapped in the guest as Normal Cacheable memory
>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE (or
>>> something else like on-chip memory)?  If we can't guarantee that all
>>> Cacheable memory given to a guest supports tags, we should disable the
>>> feature altogether.
>>
>> In stage 2 I believe we only have two types of mapping - 'normal' or
>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a
>> case of checking the 'device' variable, and makes sense to avoid the
>> overhead you describe.
>>
>> This should also guarantee that all stage-2 cacheable memory supports tags,
>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only
>> be true for memory that Linux considers "normal".

If you think "normal" == "normal System RAM", that's wrong; see below.

> 
> That's the problem. With Anshuman's commit I mentioned above,
> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
> memory, not talking about some I/O mapping that requires Device_nGnRE).
> So kvm_is_device_pfn() is false for such memory and it may be mapped as
> Normal but it is not guaranteed to support tagging.

pfn_valid() means "there is a struct page"; if you do pfn_to_page() and 
touch the page, you won't fault. So Anshuman's commit is correct.

pfn_to_online_page() means, "there is a struct page and it's system RAM 
that's in use; the memmap has a sane content"

> 
> For user MTE, we get away with this as the MAP_ANONYMOUS requirement
> would filter it out while arch_add_memory() will ensure it's mapped as
> untagged in the linear map. See another recent fix for hotplugged
> memory: d15dfd31384b ("arm64: mte: Map hotplugged memory as Normal
> Tagged"). We needed to ensure that ZONE_DEVICE doesn't end up as tagged,
> only hoplugged memory. Both handled via arch_add_memory() in the arch
> code with ZONE_DEVICE starting at devm_memremap_pages().
> 
>>>> I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn
>>>> even without virtualisation.
>>>
>>> I haven't checked all the code paths but I don't think we can get a
>>> MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file
>>> descriptor.
>>
>> I certainly hope this is the case - it's the weird corner cases of device
>> drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl
>> (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED).
>> Mali's kbase did something similar in the past.
> 
> I think this should be fine since it's not a MAP_ANONYMOUS (we do allow
> MAP_SHARED to be tagged).
>
Catalin Marinas March 31, 2021, 9:21 a.m. UTC | #6
On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
> On 30.03.21 12:30, Catalin Marinas wrote:
> > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
> > > On 28/03/2021 13:21, Catalin Marinas wrote:
> > > > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote:
> > > > > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote:
> > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > > > index 77cb2d28f2a4..b31b7a821f90 100644
> > > > > > --- a/arch/arm64/kvm/mmu.c
> > > > > > +++ b/arch/arm64/kvm/mmu.c
> > > > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > > > >    	if (vma_pagesize == PAGE_SIZE && !force_pte)
> > > > > >    		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> > > > > >    							   &pfn, &fault_ipa);
> > > > > > +
> > > > > > +	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) {
> > > > > > +		/*
> > > > > > +		 * VM will be able to see the page's tags, so we must ensure
> > > > > > +		 * they have been initialised. if PG_mte_tagged is set, tags
> > > > > > +		 * have already been initialised.
> > > > > > +		 */
> > > > > > +		struct page *page = pfn_to_page(pfn);
> > > > > > +		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
> > > > > > +
> > > > > > +		for (i = 0; i < nr_pages; i++, page++) {
> > > > > > +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
> > > > > > +				mte_clear_page_tags(page_address(page));
> > > > > > +		}
> > > > > > +	}
> > > > > 
> > > > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b
> > > > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
> > > > > true for ZONE_DEVICE memory but such memory is allowed not to support
> > > > > MTE.
> > > > 
> > > > Some more thinking, this should be safe as any ZONE_DEVICE would be
> > > > mapped as untagged memory in the kernel linear map. It could be slightly
> > > > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
> > > > untagged memory. Another overhead is pfn_valid() which will likely end
> > > > up calling memblock_is_map_memory().
> > > > 
> > > > However, the bigger issue is that Stage 2 cannot disable tagging for
> > > > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a
> > > > way to detect what gets mapped in the guest as Normal Cacheable memory
> > > > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or
> > > > something else like on-chip memory)?  If we can't guarantee that all
> > > > Cacheable memory given to a guest supports tags, we should disable the
> > > > feature altogether.
> > > 
> > > In stage 2 I believe we only have two types of mapping - 'normal' or
> > > DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a
> > > case of checking the 'device' variable, and makes sense to avoid the
> > > overhead you describe.
> > > 
> > > This should also guarantee that all stage-2 cacheable memory supports tags,
> > > as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only
> > > be true for memory that Linux considers "normal".
> 
> If you think "normal" == "normal System RAM", that's wrong; see below.

By "normal" I think both Steven and I meant the Normal Cacheable memory
attribute (another being the Device memory attribute).

> > That's the problem. With Anshuman's commit I mentioned above,
> > pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
> > memory, not talking about some I/O mapping that requires Device_nGnRE).
> > So kvm_is_device_pfn() is false for such memory and it may be mapped as
> > Normal but it is not guaranteed to support tagging.
> 
> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and
> touch the page, you won't fault. So Anshuman's commit is correct.

I agree.

> pfn_to_online_page() means, "there is a struct page and it's system RAM
> that's in use; the memmap has a sane content"

Does pfn_to_online_page() returns a valid struct page pointer for
ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for
some definition of system RAM (I assume NVDIMM != system RAM). For
example, pmem_attach_disk() calls devm_memremap_pages() and this would
use the Normal Cacheable memory attribute without necessarily being
system RAM.

So if pfn_valid() is not equivalent to system RAM, we have a potential
issue with MTE. Even if "system RAM" includes NVDIMMs, we still have
this issue and we may need a new term to describe MTE-safe memory. In
the kernel we assume MTE-safe all pages that can be mapped as
MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages.

Thanks.
David Hildenbrand March 31, 2021, 9:32 a.m. UTC | #7
On 31.03.21 11:21, Catalin Marinas wrote:
> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
>> On 30.03.21 12:30, Catalin Marinas wrote:
>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
>>>> On 28/03/2021 13:21, Catalin Marinas wrote:
>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote:
>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote:
>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644
>>>>>>> --- a/arch/arm64/kvm/mmu.c
>>>>>>> +++ b/arch/arm64/kvm/mmu.c
>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>>>     	if (vma_pagesize == PAGE_SIZE && !force_pte)
>>>>>>>     		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>>>>>>>     							   &pfn, &fault_ipa);
>>>>>>> +
>>>>>>> +	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) {
>>>>>>> +		/*
>>>>>>> +		 * VM will be able to see the page's tags, so we must ensure
>>>>>>> +		 * they have been initialised. if PG_mte_tagged is set, tags
>>>>>>> +		 * have already been initialised.
>>>>>>> +		 */
>>>>>>> +		struct page *page = pfn_to_page(pfn);
>>>>>>> +		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
>>>>>>> +
>>>>>>> +		for (i = 0; i < nr_pages; i++, page++) {
>>>>>>> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>>>>>>> +				mte_clear_page_tags(page_address(page));
>>>>>>> +		}
>>>>>>> +	}
>>>>>>
>>>>>> This pfn_valid() check may be problematic. Following commit eeb0753ba27b
>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to support
>>>>>> MTE.
>>>>>
>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be
>>>>> mapped as untagged memory in the kernel linear map. It could be slightly
>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
>>>>> untagged memory. Another overhead is pfn_valid() which will likely end
>>>>> up calling memblock_is_map_memory().
>>>>>
>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for
>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a
>>>>> way to detect what gets mapped in the guest as Normal Cacheable memory
>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE (or
>>>>> something else like on-chip memory)?  If we can't guarantee that all
>>>>> Cacheable memory given to a guest supports tags, we should disable the
>>>>> feature altogether.
>>>>
>>>> In stage 2 I believe we only have two types of mapping - 'normal' or
>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a
>>>> case of checking the 'device' variable, and makes sense to avoid the
>>>> overhead you describe.
>>>>
>>>> This should also guarantee that all stage-2 cacheable memory supports tags,
>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only
>>>> be true for memory that Linux considers "normal".
>>
>> If you think "normal" == "normal System RAM", that's wrong; see below.
> 
> By "normal" I think both Steven and I meant the Normal Cacheable memory
> attribute (another being the Device memory attribute).
> 
>>> That's the problem. With Anshuman's commit I mentioned above,
>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
>>> memory, not talking about some I/O mapping that requires Device_nGnRE).
>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as
>>> Normal but it is not guaranteed to support tagging.
>>
>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and
>> touch the page, you won't fault. So Anshuman's commit is correct.
> 
> I agree.
> 
>> pfn_to_online_page() means, "there is a struct page and it's system RAM
>> that's in use; the memmap has a sane content"
> 
> Does pfn_to_online_page() returns a valid struct page pointer for
> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for
> some definition of system RAM (I assume NVDIMM != system RAM). For
> example, pmem_attach_disk() calls devm_memremap_pages() and this would
> use the Normal Cacheable memory attribute without necessarily being
> system RAM.

No, not for ZONE_DEVICE.

However, if you expose PMEM via dax/kmem as System RAM to the system (-> 
add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or 
ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system 
thinks it's "ordinary system RAM" and the memory is managed by the buddy.

> 
> So if pfn_valid() is not equivalent to system RAM, we have a potential
> issue with MTE. Even if "system RAM" includes NVDIMMs, we still have
> this issue and we may need a new term to describe MTE-safe memory. In
> the kernel we assume MTE-safe all pages that can be mapped as
> MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages.
> 
> Thanks.
>
Steven Price March 31, 2021, 10:41 a.m. UTC | #8
On 31/03/2021 10:32, David Hildenbrand wrote:
> On 31.03.21 11:21, Catalin Marinas wrote:
>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
>>> On 30.03.21 12:30, Catalin Marinas wrote:
>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
>>>>> On 28/03/2021 13:21, Catalin Marinas wrote:
>>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote:
>>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote:
>>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644
>>>>>>>> --- a/arch/arm64/kvm/mmu.c
>>>>>>>> +++ b/arch/arm64/kvm/mmu.c
>>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu 
>>>>>>>> *vcpu, phys_addr_t fault_ipa,
>>>>>>>>         if (vma_pagesize == PAGE_SIZE && !force_pte)
>>>>>>>>             vma_pagesize = transparent_hugepage_adjust(memslot, 
>>>>>>>> hva,
>>>>>>>>                                    &pfn, &fault_ipa);
>>>>>>>> +
>>>>>>>> +    if (fault_status != FSC_PERM && kvm_has_mte(kvm) && 
>>>>>>>> pfn_valid(pfn)) {
>>>>>>>> +        /*
>>>>>>>> +         * VM will be able to see the page's tags, so we must 
>>>>>>>> ensure
>>>>>>>> +         * they have been initialised. if PG_mte_tagged is set, 
>>>>>>>> tags
>>>>>>>> +         * have already been initialised.
>>>>>>>> +         */
>>>>>>>> +        struct page *page = pfn_to_page(pfn);
>>>>>>>> +        unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
>>>>>>>> +
>>>>>>>> +        for (i = 0; i < nr_pages; i++, page++) {
>>>>>>>> +            if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>>>>>>>> +                mte_clear_page_tags(page_address(page));
>>>>>>>> +        }
>>>>>>>> +    }
>>>>>>>
>>>>>>> This pfn_valid() check may be problematic. Following commit 
>>>>>>> eeb0753ba27b
>>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it 
>>>>>>> returns
>>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to 
>>>>>>> support
>>>>>>> MTE.
>>>>>>
>>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be
>>>>>> mapped as untagged memory in the kernel linear map. It could be 
>>>>>> slightly
>>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
>>>>>> untagged memory. Another overhead is pfn_valid() which will likely 
>>>>>> end
>>>>>> up calling memblock_is_map_memory().
>>>>>>
>>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for
>>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is 
>>>>>> there a
>>>>>> way to detect what gets mapped in the guest as Normal Cacheable 
>>>>>> memory
>>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE 
>>>>>> (or
>>>>>> something else like on-chip memory)?  If we can't guarantee that all
>>>>>> Cacheable memory given to a guest supports tags, we should disable 
>>>>>> the
>>>>>> feature altogether.
>>>>>
>>>>> In stage 2 I believe we only have two types of mapping - 'normal' or
>>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the 
>>>>> latter is a
>>>>> case of checking the 'device' variable, and makes sense to avoid the
>>>>> overhead you describe.
>>>>>
>>>>> This should also guarantee that all stage-2 cacheable memory 
>>>>> supports tags,
>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() 
>>>>> should only
>>>>> be true for memory that Linux considers "normal".
>>>
>>> If you think "normal" == "normal System RAM", that's wrong; see below.
>>
>> By "normal" I think both Steven and I meant the Normal Cacheable memory
>> attribute (another being the Device memory attribute).

Sadly there's no good standardised terminology here. Aarch64 provides 
the "normal (cacheable)" definition. Memory which is mapped as "Normal 
Cacheable" is implicitly MTE capable when shared with a guest (because 
the stage 2 mappings don't allow restricting MTE other than mapping it 
as Device memory).

So MTE also forces us to have a definition of memory which is "bog 
standard memory"[1] separate from the mapping attributes. This is the 
main memory which fully supports MTE.

Separate from the "bog standard" we have the "special"[1] memory, e.g. 
ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but 
that memory may not support MTE tags. This memory can only be safely 
shared with a guest in the following situations:

  1. MTE is completely disabled for the guest

  2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE)

  3. We have some guarantee that guest MTE access are in some way safe.

(1) is the situation today (without this patch series). But it prevents 
the guest from using MTE in any form.

(2) is pretty terrible for general memory, but is the get-out clause for 
mapping devices into the guest.

(3) isn't something we have any architectural way of discovering. We'd 
need to know what the device did with the MTE accesses (and any caches 
between the CPU and the device) to ensure there aren't any side-channels 
or h/w lockup issues. We'd also need some way of describing this memory 
to the guest.

So at least for the time being the approach is to avoid letting a guest 
with MTE enabled have access to this sort of memory.

[1] Neither "bog standard" nor "special" are real terms - like I said 
there's a lack of standardised terminology.

>>>> That's the problem. With Anshuman's commit I mentioned above,
>>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
>>>> memory, not talking about some I/O mapping that requires Device_nGnRE).
>>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as
>>>> Normal but it is not guaranteed to support tagging.
>>>
>>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and
>>> touch the page, you won't fault. So Anshuman's commit is correct.
>>
>> I agree.
>>
>>> pfn_to_online_page() means, "there is a struct page and it's system RAM
>>> that's in use; the memmap has a sane content"
>>
>> Does pfn_to_online_page() returns a valid struct page pointer for
>> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for
>> some definition of system RAM (I assume NVDIMM != system RAM). For
>> example, pmem_attach_disk() calls devm_memremap_pages() and this would
>> use the Normal Cacheable memory attribute without necessarily being
>> system RAM.
> 
> No, not for ZONE_DEVICE.
> 
> However, if you expose PMEM via dax/kmem as System RAM to the system (-> 
> add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or 
> ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system 
> thinks it's "ordinary system RAM" and the memory is managed by the buddy.

So if I'm understanding this correctly for KVM we need to use 
pfn_to_online_pages() and reject if NULL is returned? In the case of 
dax/kmem there already needs to be validation that the memory supports 
MTE (otherwise we break user space) before it's allowed into the 
"ordinary system RAM" bucket.

Steve
David Hildenbrand March 31, 2021, 2:14 p.m. UTC | #9
On 31.03.21 12:41, Steven Price wrote:
> On 31/03/2021 10:32, David Hildenbrand wrote:
>> On 31.03.21 11:21, Catalin Marinas wrote:
>>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
>>>> On 30.03.21 12:30, Catalin Marinas wrote:
>>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
>>>>>> On 28/03/2021 13:21, Catalin Marinas wrote:
>>>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote:
>>>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote:
>>>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644
>>>>>>>>> --- a/arch/arm64/kvm/mmu.c
>>>>>>>>> +++ b/arch/arm64/kvm/mmu.c
>>>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu
>>>>>>>>> *vcpu, phys_addr_t fault_ipa,
>>>>>>>>>          if (vma_pagesize == PAGE_SIZE && !force_pte)
>>>>>>>>>              vma_pagesize = transparent_hugepage_adjust(memslot,
>>>>>>>>> hva,
>>>>>>>>>                                     &pfn, &fault_ipa);
>>>>>>>>> +
>>>>>>>>> +    if (fault_status != FSC_PERM && kvm_has_mte(kvm) &&
>>>>>>>>> pfn_valid(pfn)) {
>>>>>>>>> +        /*
>>>>>>>>> +         * VM will be able to see the page's tags, so we must
>>>>>>>>> ensure
>>>>>>>>> +         * they have been initialised. if PG_mte_tagged is set,
>>>>>>>>> tags
>>>>>>>>> +         * have already been initialised.
>>>>>>>>> +         */
>>>>>>>>> +        struct page *page = pfn_to_page(pfn);
>>>>>>>>> +        unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
>>>>>>>>> +
>>>>>>>>> +        for (i = 0; i < nr_pages; i++, page++) {
>>>>>>>>> +            if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>>>>>>>>> +                mte_clear_page_tags(page_address(page));
>>>>>>>>> +        }
>>>>>>>>> +    }
>>>>>>>>
>>>>>>>> This pfn_valid() check may be problematic. Following commit
>>>>>>>> eeb0753ba27b
>>>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it
>>>>>>>> returns
>>>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to
>>>>>>>> support
>>>>>>>> MTE.
>>>>>>>
>>>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be
>>>>>>> mapped as untagged memory in the kernel linear map. It could be
>>>>>>> slightly
>>>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
>>>>>>> untagged memory. Another overhead is pfn_valid() which will likely
>>>>>>> end
>>>>>>> up calling memblock_is_map_memory().
>>>>>>>
>>>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for
>>>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is
>>>>>>> there a
>>>>>>> way to detect what gets mapped in the guest as Normal Cacheable
>>>>>>> memory
>>>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE
>>>>>>> (or
>>>>>>> something else like on-chip memory)?  If we can't guarantee that all
>>>>>>> Cacheable memory given to a guest supports tags, we should disable
>>>>>>> the
>>>>>>> feature altogether.
>>>>>>
>>>>>> In stage 2 I believe we only have two types of mapping - 'normal' or
>>>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the
>>>>>> latter is a
>>>>>> case of checking the 'device' variable, and makes sense to avoid the
>>>>>> overhead you describe.
>>>>>>
>>>>>> This should also guarantee that all stage-2 cacheable memory
>>>>>> supports tags,
>>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid()
>>>>>> should only
>>>>>> be true for memory that Linux considers "normal".
>>>>
>>>> If you think "normal" == "normal System RAM", that's wrong; see below.
>>>
>>> By "normal" I think both Steven and I meant the Normal Cacheable memory
>>> attribute (another being the Device memory attribute).
> 
> Sadly there's no good standardised terminology here. Aarch64 provides
> the "normal (cacheable)" definition. Memory which is mapped as "Normal
> Cacheable" is implicitly MTE capable when shared with a guest (because
> the stage 2 mappings don't allow restricting MTE other than mapping it
> as Device memory).
> 
> So MTE also forces us to have a definition of memory which is "bog
> standard memory"[1] separate from the mapping attributes. This is the
> main memory which fully supports MTE.
> 
> Separate from the "bog standard" we have the "special"[1] memory, e.g.
> ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but
> that memory may not support MTE tags. This memory can only be safely
> shared with a guest in the following situations:
> 
>    1. MTE is completely disabled for the guest
> 
>    2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE)
> 
>    3. We have some guarantee that guest MTE access are in some way safe.
> 
> (1) is the situation today (without this patch series). But it prevents
> the guest from using MTE in any form.
> 
> (2) is pretty terrible for general memory, but is the get-out clause for
> mapping devices into the guest.
> 
> (3) isn't something we have any architectural way of discovering. We'd
> need to know what the device did with the MTE accesses (and any caches
> between the CPU and the device) to ensure there aren't any side-channels
> or h/w lockup issues. We'd also need some way of describing this memory
> to the guest.
> 
> So at least for the time being the approach is to avoid letting a guest
> with MTE enabled have access to this sort of memory.
> 
> [1] Neither "bog standard" nor "special" are real terms - like I said
> there's a lack of standardised terminology.
> 
>>>>> That's the problem. With Anshuman's commit I mentioned above,
>>>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
>>>>> memory, not talking about some I/O mapping that requires Device_nGnRE).
>>>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as
>>>>> Normal but it is not guaranteed to support tagging.
>>>>
>>>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and
>>>> touch the page, you won't fault. So Anshuman's commit is correct.
>>>
>>> I agree.
>>>
>>>> pfn_to_online_page() means, "there is a struct page and it's system RAM
>>>> that's in use; the memmap has a sane content"
>>>
>>> Does pfn_to_online_page() returns a valid struct page pointer for
>>> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for
>>> some definition of system RAM (I assume NVDIMM != system RAM). For
>>> example, pmem_attach_disk() calls devm_memremap_pages() and this would
>>> use the Normal Cacheable memory attribute without necessarily being
>>> system RAM.
>>
>> No, not for ZONE_DEVICE.
>>
>> However, if you expose PMEM via dax/kmem as System RAM to the system (->
>> add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or
>> ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system
>> thinks it's "ordinary system RAM" and the memory is managed by the buddy.
> 
> So if I'm understanding this correctly for KVM we need to use
> pfn_to_online_pages() and reject if NULL is returned? In the case of
> dax/kmem there already needs to be validation that the memory supports
> MTE (otherwise we break user space) before it's allowed into the
> "ordinary system RAM" bucket.

That should work.

1. One alternative is

if (!pfn_valid(pfn))
	return false;
#ifdef CONFIG_ZONE_DEVICE
page = pfn_to_page(pfn);
if (page_zonenum(page) == ZONE_DEVICE)
	return false;
#endif
return true;


Note that when you are dealing with random PFNs, this approach is in 
general not safe; the memmap could be uninitialized and contain garbage. 
You can have false positives for ZONE_DEVICE.


2. Yet another (slower?) variant to detect (some?) ZONE_DEVICE is

pgmap = get_dev_pagemap(pfn, NULL);
put_dev_pagemap(pgmap);

if (pgmap)
	return false;
return true;



I know that /dev/mem mappings can be problematic ... because the memmap 
could be in any state and actually we shouldn't even touch/rely on any 
"struct pages" at all, as we have a pure PFN mapping ...
Catalin Marinas March 31, 2021, 6:43 p.m. UTC | #10
On Wed, Mar 31, 2021 at 11:41:20AM +0100, Steven Price wrote:
> On 31/03/2021 10:32, David Hildenbrand wrote:
> > On 31.03.21 11:21, Catalin Marinas wrote:
> > > On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
> > > > On 30.03.21 12:30, Catalin Marinas wrote:
> > > > > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
> > > > > > On 28/03/2021 13:21, Catalin Marinas wrote:
> > > > > > > However, the bigger issue is that Stage 2 cannot disable
> > > > > > > tagging for Stage 1 unless the memory is Non-cacheable or
> > > > > > > Device at S2. Is there a way to detect what gets mapped in
> > > > > > > the guest as Normal Cacheable memory and make sure it's
> > > > > > > only early memory or hotplug but no ZONE_DEVICE (or
> > > > > > > something else like on-chip memory)?  If we can't
> > > > > > > guarantee that all Cacheable memory given to a guest
> > > > > > > supports tags, we should disable the feature altogether.
> > > > > > 
> > > > > > In stage 2 I believe we only have two types of mapping -
> > > > > > 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()).
> > > > > > Filtering out the latter is a case of checking the 'device'
> > > > > > variable, and makes sense to avoid the overhead you
> > > > > > describe.
> > > > > > 
> > > > > > This should also guarantee that all stage-2 cacheable
> > > > > > memory supports tags,
> > > > > > as kvm_is_device_pfn() is simply !pfn_valid(), and
> > > > > > pfn_valid() should only
> > > > > > be true for memory that Linux considers "normal".
> > > > 
> > > > If you think "normal" == "normal System RAM", that's wrong; see
> > > > below.
> > > 
> > > By "normal" I think both Steven and I meant the Normal Cacheable memory
> > > attribute (another being the Device memory attribute).
> 
> Sadly there's no good standardised terminology here. Aarch64 provides the
> "normal (cacheable)" definition. Memory which is mapped as "Normal
> Cacheable" is implicitly MTE capable when shared with a guest (because the
> stage 2 mappings don't allow restricting MTE other than mapping it as Device
> memory).
> 
> So MTE also forces us to have a definition of memory which is "bog standard
> memory"[1] separate from the mapping attributes. This is the main memory
> which fully supports MTE.
> 
> Separate from the "bog standard" we have the "special"[1] memory, e.g.
> ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that
> memory may not support MTE tags. This memory can only be safely shared with
> a guest in the following situations:
> 
>  1. MTE is completely disabled for the guest
> 
>  2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE)
> 
>  3. We have some guarantee that guest MTE access are in some way safe.
> 
> (1) is the situation today (without this patch series). But it prevents the
> guest from using MTE in any form.
> 
> (2) is pretty terrible for general memory, but is the get-out clause for
> mapping devices into the guest.
> 
> (3) isn't something we have any architectural way of discovering. We'd need
> to know what the device did with the MTE accesses (and any caches between
> the CPU and the device) to ensure there aren't any side-channels or h/w
> lockup issues. We'd also need some way of describing this memory to the
> guest.
> 
> So at least for the time being the approach is to avoid letting a guest with
> MTE enabled have access to this sort of memory.

When a slot is added by the VMM, if it asked MTE in guest (I guess
that's an opt-in by the VMM, haven't checked the other patches), can we
reject it if it's is going to be mapped as Normal Cacheable but it is a
ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to
check for ZONE_DEVICE)? This way we don't need to do more expensive
checks in set_pte_at().

We could simplify the set_pte_at() further if we require that the VMM
has a PROT_MTE mapping. This does not mean it cannot have two mappings,
the other without PROT_MTE. But at least we get a set_pte_at() when
swapping in which has PROT_MTE.

We could add another PROT_TAGGED or something which means PG_mte_tagged
set but still mapped as Normal Untagged. It's just that we are short of
pte bits for another flag.

Can we somehow identify when the S2 pte is set and can we get access to
the prior swap pte? This way we could avoid changes to set_pte_at() for
S2 faults.
Steven Price April 7, 2021, 10:20 a.m. UTC | #11
On 31/03/2021 19:43, Catalin Marinas wrote:
> On Wed, Mar 31, 2021 at 11:41:20AM +0100, Steven Price wrote:
>> On 31/03/2021 10:32, David Hildenbrand wrote:
>>> On 31.03.21 11:21, Catalin Marinas wrote:
>>>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
>>>>> On 30.03.21 12:30, Catalin Marinas wrote:
>>>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
>>>>>>> On 28/03/2021 13:21, Catalin Marinas wrote:
>>>>>>>> However, the bigger issue is that Stage 2 cannot disable
>>>>>>>> tagging for Stage 1 unless the memory is Non-cacheable or
>>>>>>>> Device at S2. Is there a way to detect what gets mapped in
>>>>>>>> the guest as Normal Cacheable memory and make sure it's
>>>>>>>> only early memory or hotplug but no ZONE_DEVICE (or
>>>>>>>> something else like on-chip memory)?� If we can't
>>>>>>>> guarantee that all Cacheable memory given to a guest
>>>>>>>> supports tags, we should disable the feature altogether.
>>>>>>>
>>>>>>> In stage 2 I believe we only have two types of mapping -
>>>>>>> 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()).
>>>>>>> Filtering out the latter is a case of checking the 'device'
>>>>>>> variable, and makes sense to avoid the overhead you
>>>>>>> describe.
>>>>>>>
>>>>>>> This should also guarantee that all stage-2 cacheable
>>>>>>> memory supports tags,
>>>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and
>>>>>>> pfn_valid() should only
>>>>>>> be true for memory that Linux considers "normal".
>>>>>
>>>>> If you think "normal" == "normal System RAM", that's wrong; see
>>>>> below.
>>>>
>>>> By "normal" I think both Steven and I meant the Normal Cacheable memory
>>>> attribute (another being the Device memory attribute).
>>
>> Sadly there's no good standardised terminology here. Aarch64 provides the
>> "normal (cacheable)" definition. Memory which is mapped as "Normal
>> Cacheable" is implicitly MTE capable when shared with a guest (because the
>> stage 2 mappings don't allow restricting MTE other than mapping it as Device
>> memory).
>>
>> So MTE also forces us to have a definition of memory which is "bog standard
>> memory"[1] separate from the mapping attributes. This is the main memory
>> which fully supports MTE.
>>
>> Separate from the "bog standard" we have the "special"[1] memory, e.g.
>> ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that
>> memory may not support MTE tags. This memory can only be safely shared with
>> a guest in the following situations:
>>
>>   1. MTE is completely disabled for the guest
>>
>>   2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE)
>>
>>   3. We have some guarantee that guest MTE access are in some way safe.
>>
>> (1) is the situation today (without this patch series). But it prevents the
>> guest from using MTE in any form.
>>
>> (2) is pretty terrible for general memory, but is the get-out clause for
>> mapping devices into the guest.
>>
>> (3) isn't something we have any architectural way of discovering. We'd need
>> to know what the device did with the MTE accesses (and any caches between
>> the CPU and the device) to ensure there aren't any side-channels or h/w
>> lockup issues. We'd also need some way of describing this memory to the
>> guest.
>>
>> So at least for the time being the approach is to avoid letting a guest with
>> MTE enabled have access to this sort of memory.
> 
> When a slot is added by the VMM, if it asked MTE in guest (I guess
> that's an opt-in by the VMM, haven't checked the other patches), can we
> reject it if it's is going to be mapped as Normal Cacheable but it is a
> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to
> check for ZONE_DEVICE)? This way we don't need to do more expensive
> checks in set_pte_at().

The problem is that KVM allows the VMM to change the memory backing a 
slot while the guest is running. This is obviously useful for the likes 
of migration, but ultimately means that even if you were to do checks at 
the time of slot creation, you would need to repeat the checks at 
set_pte_at() time to ensure a mischievous VMM didn't swap the page for a 
problematic one.

> We could simplify the set_pte_at() further if we require that the VMM
> has a PROT_MTE mapping. This does not mean it cannot have two mappings,
> the other without PROT_MTE. But at least we get a set_pte_at() when
> swapping in which has PROT_MTE.

That is certainly an option - but from what I've seen of trying to 
implement a VMM to support MTE, the PROT_MTE mapping is not what you 
actually want in user space. Two mappings is possible but is likely to 
complicate the VMM.

> We could add another PROT_TAGGED or something which means PG_mte_tagged
> set but still mapped as Normal Untagged. It's just that we are short of
> pte bits for another flag.

That could help here - although it's slightly odd as you're asking the 
kernel to track the tags, but not allowing user space (direct) access to 
them. Like you say using us the precious bits for this seems like it 
might be short-sighted.

> Can we somehow identify when the S2 pte is set and can we get access to
> the prior swap pte? This way we could avoid changes to set_pte_at() for
> S2 faults.
> 

Unless I'm misunderstanding the code the swap information is (only) 
stored in the stage 1 user-space VMM PTE. When we get a stage 2 fault 
this triggers a corresponding access attempt to the VMM's address space. 
It's at this point when populating the VMM's page tables that the swap 
information is discovered.

The problem at the moment is a mismatch regarding whether the page needs 
tags or not. The VMM's mapping can (currently) be !PROT_MTE which means 
we wouldn't normally require restoring/zeroing the tags. However the 
stage 2 access requires that the tags be preserved. Requiring PROT_MTE 
(or PROT_TAGGED as above) would certainly simplify the handling in the 
kernel.

Of course I did propose the 'requiring PROT_MTE' approach before which 
led to a conversation[1] ending with a conclusion[2] that:

    I'd much rather the kernel just
    provided us with an API for what we want, which is (1) the guest
    RAM as just RAM with no tag checking and separately (2) some
    mechanism yet-to-be-designed which lets us bulk copy a page's
    worth of tags for migration.

Which is what I've implemented ;)

Do you think it's worth investigating the PROT_TAGGED approach as a 
middle ground? My gut feeling is that it's a waste of a VM flag, but I 
agree it would certainly make the code cleaner.

Steve

[1] 
https://lore.kernel.org/kvmarm/CAFEAcA85fiqA206FuFANKbV_3GkfY1F8Gv7MP58BgTT81bs9kA@mail.gmail.com/
[2] 
https://lore.kernel.org/kvmarm/CAFEAcA_K47jKSp46DFK-AKWv6MD1pkrEB6FNz=HNGdxmBDCSbw@mail.gmail.com/
Catalin Marinas April 7, 2021, 3:14 p.m. UTC | #12
On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote:
> On 31/03/2021 19:43, Catalin Marinas wrote:
> > When a slot is added by the VMM, if it asked for MTE in guest (I guess
> > that's an opt-in by the VMM, haven't checked the other patches), can we
> > reject it if it's is going to be mapped as Normal Cacheable but it is a
> > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to
> > check for ZONE_DEVICE)? This way we don't need to do more expensive
> > checks in set_pte_at().
> 
> The problem is that KVM allows the VMM to change the memory backing a slot
> while the guest is running. This is obviously useful for the likes of
> migration, but ultimately means that even if you were to do checks at the
> time of slot creation, you would need to repeat the checks at set_pte_at()
> time to ensure a mischievous VMM didn't swap the page for a problematic one.

Does changing the slot require some KVM API call? Can we intercept it
and do the checks there?

Maybe a better alternative for the time being is to add a new
kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns
true _and_ the VMM asked for MTE in guest. We can then only set
PG_mte_tagged if !device.

We can later relax this further to Normal Non-cacheable for ZONE_DEVICE
memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable
if we manage to change the behaviour of the architecture.

> > We could add another PROT_TAGGED or something which means PG_mte_tagged
> > set but still mapped as Normal Untagged. It's just that we are short of
> > pte bits for another flag.
> 
> That could help here - although it's slightly odd as you're asking the
> kernel to track the tags, but not allowing user space (direct) access to
> them. Like you say using us the precious bits for this seems like it might
> be short-sighted.

Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(),
so we already know it's a page potentially containing tags. On
restoring from swap, we need to check for MTE metadata irrespective of
whether the user pte is tagged or not, as you already did in patch 1.
I'll get back to that and look at the potential races.

BTW, after a page is restored from swap, how long do we keep the
metadata around? I think we can delete it as soon as it was restored and
PG_mte_tagged was set. Currently it looks like we only do this when the
actual page was freed or swapoff. I haven't convinced myself that it's
safe to do this for swapoff unless it guarantees that all the ptes
sharing a page have been restored.
David Hildenbrand April 7, 2021, 3:30 p.m. UTC | #13
On 07.04.21 17:14, Catalin Marinas wrote:
> On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote:
>> On 31/03/2021 19:43, Catalin Marinas wrote:
>>> When a slot is added by the VMM, if it asked for MTE in guest (I guess
>>> that's an opt-in by the VMM, haven't checked the other patches), can we
>>> reject it if it's is going to be mapped as Normal Cacheable but it is a
>>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to
>>> check for ZONE_DEVICE)? This way we don't need to do more expensive
>>> checks in set_pte_at().
>>
>> The problem is that KVM allows the VMM to change the memory backing a slot
>> while the guest is running. This is obviously useful for the likes of
>> migration, but ultimately means that even if you were to do checks at the
>> time of slot creation, you would need to repeat the checks at set_pte_at()
>> time to ensure a mischievous VMM didn't swap the page for a problematic one.
> 
> Does changing the slot require some KVM API call? Can we intercept it
> and do the checks there?

User space can simply mmap(MAP_FIXED) the user space area registered in 
a KVM memory slot. You cannot really intercept that. You can only check 
in the KVM MMU code at fault time that the VMA which you hold in your 
hands is still in a proper state. The KVM MMU is synchronous, which 
means that updates to the VMA layout are reflected in the KVM MMU page 
tables -- e.g., via mmu notifier calls.

E.g., in s390x code we cannot handle VMAs with gigantic pages. We check 
that when faulting (creating the links in the page table) via __gmap_link().

You could either check the page itself (ZONE_DEVICE) or might even be 
able to check via the VMA/file.
Steven Price April 7, 2021, 3:52 p.m. UTC | #14
On 07/04/2021 16:14, Catalin Marinas wrote:
> On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote:
>> On 31/03/2021 19:43, Catalin Marinas wrote:
>>> When a slot is added by the VMM, if it asked for MTE in guest (I guess
>>> that's an opt-in by the VMM, haven't checked the other patches), can we
>>> reject it if it's is going to be mapped as Normal Cacheable but it is a
>>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to
>>> check for ZONE_DEVICE)? This way we don't need to do more expensive
>>> checks in set_pte_at().
>>
>> The problem is that KVM allows the VMM to change the memory backing a slot
>> while the guest is running. This is obviously useful for the likes of
>> migration, but ultimately means that even if you were to do checks at the
>> time of slot creation, you would need to repeat the checks at set_pte_at()
>> time to ensure a mischievous VMM didn't swap the page for a problematic one.
> 
> Does changing the slot require some KVM API call? Can we intercept it
> and do the checks there?

As David has already replied - KVM uses MMU notifiers, so there's not 
really a good place to intercept this before the fault.

> Maybe a better alternative for the time being is to add a new
> kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns
> true _and_ the VMM asked for MTE in guest. We can then only set
> PG_mte_tagged if !device.

KVM already has a kvm_is_device_pfn(), and yes I agree restricting the 
MTE checks to only !kvm_is_device_pfn() makes sense (I have the fix in 
my branch locally).

> We can later relax this further to Normal Non-cacheable for ZONE_DEVICE
> memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable
> if we manage to change the behaviour of the architecture.

Indeed, it'll be interesting to see whether people want to build MTE 
capable systems with significant quantities of non-MTE capable memory. 
But for a first stage let's stick with either all guest memory (except 
devices) is MTE or you disable MTE for the guest.

>>> We could add another PROT_TAGGED or something which means PG_mte_tagged
>>> set but still mapped as Normal Untagged. It's just that we are short of
>>> pte bits for another flag.
>>
>> That could help here - although it's slightly odd as you're asking the
>> kernel to track the tags, but not allowing user space (direct) access to
>> them. Like you say using us the precious bits for this seems like it might
>> be short-sighted.
> 
> Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(),
> so we already know it's a page potentially containing tags. On
> restoring from swap, we need to check for MTE metadata irrespective of
> whether the user pte is tagged or not, as you already did in patch 1.
> I'll get back to that and look at the potential races.
> 
> BTW, after a page is restored from swap, how long do we keep the
> metadata around? I think we can delete it as soon as it was restored and
> PG_mte_tagged was set. Currently it looks like we only do this when the
> actual page was freed or swapoff. I haven't convinced myself that it's
> safe to do this for swapoff unless it guarantees that all the ptes
> sharing a page have been restored.
> 

My initial thought was to free the metadata immediately. However it 
turns out that the following sequence can happen:

  1. Swap out a page
  2. Swap the page in *read only*
  3. Discard the page
  4. Swap the page in again

So there's no writing of the swap data again before (3). This works 
nicely with a swap device because after writing a page it stays there 
forever, so if you know it hasn't been modified it's pointless rewriting 
it. Sadly it's not quite so ideal with the MTE tags which are currently 
kept in RAM. Arguably it would make sense to modify the on-disk swap 
format to include the tags - but that would open a whole new can of worms!

swapoff needs to ensure that all the PTEs have been restored because 
after the swapoff has completed the PTEs will be pointing at a swap 
entry which is no longer valid (and could even have been reallocated to 
point to a new swap device). When you issue you a swapoff, Linux will 
scan the mmlist and the page tables of every process to search for swap 
entry PTEs relating to the swap which is being removed (see try_to_unuse()).

Steve
Catalin Marinas April 8, 2021, 2:18 p.m. UTC | #15
On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote:
> On 07/04/2021 16:14, Catalin Marinas wrote:
> > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote:
> > > On 31/03/2021 19:43, Catalin Marinas wrote:
> > > > When a slot is added by the VMM, if it asked for MTE in guest (I guess
> > > > that's an opt-in by the VMM, haven't checked the other patches), can we
> > > > reject it if it's is going to be mapped as Normal Cacheable but it is a
> > > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to
> > > > check for ZONE_DEVICE)? This way we don't need to do more expensive
> > > > checks in set_pte_at().
> > > 
> > > The problem is that KVM allows the VMM to change the memory backing a slot
> > > while the guest is running. This is obviously useful for the likes of
> > > migration, but ultimately means that even if you were to do checks at the
> > > time of slot creation, you would need to repeat the checks at set_pte_at()
> > > time to ensure a mischievous VMM didn't swap the page for a problematic one.
> > 
> > Does changing the slot require some KVM API call? Can we intercept it
> > and do the checks there?
> 
> As David has already replied - KVM uses MMU notifiers, so there's not really
> a good place to intercept this before the fault.
> 
> > Maybe a better alternative for the time being is to add a new
> > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns
> > true _and_ the VMM asked for MTE in guest. We can then only set
> > PG_mte_tagged if !device.
> 
> KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE
> checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch
> locally).

Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE,
I'd also mark a pfn as 'device' in user_mem_abort() if
pfn_to_online_page() is NULL as we don't want to map it as Cacheable in
Stage 2. It's unlikely that we'll trip over this path but just in case.

(can we have a ZONE_DEVICE _online_ pfn or by definition they are
considered offline?)

> > BTW, after a page is restored from swap, how long do we keep the
> > metadata around? I think we can delete it as soon as it was restored and
> > PG_mte_tagged was set. Currently it looks like we only do this when the
> > actual page was freed or swapoff. I haven't convinced myself that it's
> > safe to do this for swapoff unless it guarantees that all the ptes
> > sharing a page have been restored.
> 
> My initial thought was to free the metadata immediately. However it turns
> out that the following sequence can happen:
> 
>  1. Swap out a page
>  2. Swap the page in *read only*
>  3. Discard the page
>  4. Swap the page in again
> 
> So there's no writing of the swap data again before (3). This works nicely
> with a swap device because after writing a page it stays there forever, so
> if you know it hasn't been modified it's pointless rewriting it. Sadly it's
> not quite so ideal with the MTE tags which are currently kept in RAM.

I missed this scenario. So we need to keep it around as long as the
corresponding swap storage is still valid.
David Hildenbrand April 8, 2021, 6:16 p.m. UTC | #16
On 08.04.21 16:18, Catalin Marinas wrote:
> On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote:
>> On 07/04/2021 16:14, Catalin Marinas wrote:
>>> On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote:
>>>> On 31/03/2021 19:43, Catalin Marinas wrote:
>>>>> When a slot is added by the VMM, if it asked for MTE in guest (I guess
>>>>> that's an opt-in by the VMM, haven't checked the other patches), can we
>>>>> reject it if it's is going to be mapped as Normal Cacheable but it is a
>>>>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to
>>>>> check for ZONE_DEVICE)? This way we don't need to do more expensive
>>>>> checks in set_pte_at().
>>>>
>>>> The problem is that KVM allows the VMM to change the memory backing a slot
>>>> while the guest is running. This is obviously useful for the likes of
>>>> migration, but ultimately means that even if you were to do checks at the
>>>> time of slot creation, you would need to repeat the checks at set_pte_at()
>>>> time to ensure a mischievous VMM didn't swap the page for a problematic one.
>>>
>>> Does changing the slot require some KVM API call? Can we intercept it
>>> and do the checks there?
>>
>> As David has already replied - KVM uses MMU notifiers, so there's not really
>> a good place to intercept this before the fault.
>>
>>> Maybe a better alternative for the time being is to add a new
>>> kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns
>>> true _and_ the VMM asked for MTE in guest. We can then only set
>>> PG_mte_tagged if !device.
>>
>> KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE
>> checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch
>> locally).
> 
> Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE,
> I'd also mark a pfn as 'device' in user_mem_abort() if
> pfn_to_online_page() is NULL as we don't want to map it as Cacheable in
> Stage 2. It's unlikely that we'll trip over this path but just in case.
> 
> (can we have a ZONE_DEVICE _online_ pfn or by definition they are
> considered offline?)

By definition (and implementation) offline. When you get a page = 
pfn_to_online_page() with page != NULL, that one should never be 
ZONE_DEVICE (otherwise it would be a BUG).

As I said, things are different when exposing dax memory via dax/kmem to 
the buddy. But then, we are no longer talking about ZONE_DEVICE.
Catalin Marinas April 8, 2021, 6:21 p.m. UTC | #17
On Thu, Apr 08, 2021 at 08:16:17PM +0200, David Hildenbrand wrote:
> On 08.04.21 16:18, Catalin Marinas wrote:
> > On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote:
> > > On 07/04/2021 16:14, Catalin Marinas wrote:
> > > > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote:
> > > > > On 31/03/2021 19:43, Catalin Marinas wrote:
> > > > > > When a slot is added by the VMM, if it asked for MTE in guest (I guess
> > > > > > that's an opt-in by the VMM, haven't checked the other patches), can we
> > > > > > reject it if it's is going to be mapped as Normal Cacheable but it is a
> > > > > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to
> > > > > > check for ZONE_DEVICE)? This way we don't need to do more expensive
> > > > > > checks in set_pte_at().
> > > > > 
> > > > > The problem is that KVM allows the VMM to change the memory backing a slot
> > > > > while the guest is running. This is obviously useful for the likes of
> > > > > migration, but ultimately means that even if you were to do checks at the
> > > > > time of slot creation, you would need to repeat the checks at set_pte_at()
> > > > > time to ensure a mischievous VMM didn't swap the page for a problematic one.
> > > > 
> > > > Does changing the slot require some KVM API call? Can we intercept it
> > > > and do the checks there?
> > > 
> > > As David has already replied - KVM uses MMU notifiers, so there's not really
> > > a good place to intercept this before the fault.
> > > 
> > > > Maybe a better alternative for the time being is to add a new
> > > > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns
> > > > true _and_ the VMM asked for MTE in guest. We can then only set
> > > > PG_mte_tagged if !device.
> > > 
> > > KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE
> > > checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch
> > > locally).
> > 
> > Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE,
> > I'd also mark a pfn as 'device' in user_mem_abort() if
> > pfn_to_online_page() is NULL as we don't want to map it as Cacheable in
> > Stage 2. It's unlikely that we'll trip over this path but just in case.
> > 
> > (can we have a ZONE_DEVICE _online_ pfn or by definition they are
> > considered offline?)
> 
> By definition (and implementation) offline. When you get a page =
> pfn_to_online_page() with page != NULL, that one should never be ZONE_DEVICE
> (otherwise it would be a BUG).
> 
> As I said, things are different when exposing dax memory via dax/kmem to the
> buddy. But then, we are no longer talking about ZONE_DEVICE.

Thanks David, it's clear now.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index f612c090f2e4..6bf776c2399c 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -84,6 +84,9 @@  static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
 	if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
 	    vcpu_el1_is_32bit(vcpu))
 		vcpu->arch.hcr_el2 |= HCR_TID2;
+
+	if (kvm_has_mte(vcpu->kvm))
+		vcpu->arch.hcr_el2 |= HCR_ATA;
 }
 
 static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 3d10e6527f7d..1170ee137096 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -132,6 +132,8 @@  struct kvm_arch {
 
 	u8 pfr0_csv2;
 	u8 pfr0_csv3;
+	/* Memory Tagging Extension enabled for the guest */
+	bool mte_enabled;
 };
 
 struct kvm_vcpu_fault_info {
@@ -767,6 +769,7 @@  bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
 #define kvm_arm_vcpu_sve_finalized(vcpu) \
 	((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
 
+#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
 #define kvm_vcpu_has_pmu(vcpu)					\
 	(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
 
diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
index 73629094f903..56426565600c 100644
--- a/arch/arm64/kvm/hyp/exception.c
+++ b/arch/arm64/kvm/hyp/exception.c
@@ -112,7 +112,8 @@  static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode,
 	new |= (old & PSR_C_BIT);
 	new |= (old & PSR_V_BIT);
 
-	// TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
+	if (kvm_has_mte(vcpu->kvm))
+		new |= PSR_TCO_BIT;
 
 	new |= (old & PSR_DIT_BIT);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..b31b7a821f90 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -879,6 +879,22 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (vma_pagesize == PAGE_SIZE && !force_pte)
 		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
 							   &pfn, &fault_ipa);
+
+	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) {
+		/*
+		 * VM will be able to see the page's tags, so we must ensure
+		 * they have been initialised. if PG_mte_tagged is set, tags
+		 * have already been initialised.
+		 */
+		struct page *page = pfn_to_page(pfn);
+		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
+
+		for (i = 0; i < nr_pages; i++, page++) {
+			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+				mte_clear_page_tags(page_address(page));
+		}
+	}
+
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 4f2f1e3145de..18c87500a7a8 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1047,6 +1047,9 @@  static u64 read_id_reg(const struct kvm_vcpu *vcpu,
 		break;
 	case SYS_ID_AA64PFR1_EL1:
 		val &= ~FEATURE(ID_AA64PFR1_MTE);
+		if (kvm_has_mte(vcpu->kvm))
+			val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE),
+					  ID_AA64PFR1_MTE);
 		break;
 	case SYS_ID_AA64ISAR1_EL1:
 		if (!vcpu_has_ptrauth(vcpu))
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f6afee209620..6dc16c09a2d1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1078,6 +1078,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING 192
 #define KVM_CAP_X86_BUS_LOCK_EXIT 193
 #define KVM_CAP_PPC_DAWR1 194
+#define KVM_CAP_ARM_MTE 195
 
 #ifdef KVM_CAP_IRQ_ROUTING