diff mbox series

[v11,2/6] arm64: kvm: Introduce MTE VM feature

Message ID 20210416154309.22129-3-steven.price@arm.com (mailing list archive)
State New, archived
Headers show
Series MTE support for KVM guest | expand

Commit Message

Steven Price April 16, 2021, 3:43 p.m. UTC
Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.

Actually exposing the new capability to user space happens in a later
patch.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/kvm_emulate.h |  3 +++
 arch/arm64/include/asm/kvm_host.h    |  3 +++
 arch/arm64/kvm/hyp/exception.c       |  3 ++-
 arch/arm64/kvm/mmu.c                 | 20 ++++++++++++++++++++
 arch/arm64/kvm/sys_regs.c            |  3 +++
 include/uapi/linux/kvm.h             |  1 +
 6 files changed, 32 insertions(+), 1 deletion(-)

Comments

Catalin Marinas April 28, 2021, 5:07 p.m. UTC | #1
On Fri, Apr 16, 2021 at 04:43:05PM +0100, Steven Price wrote:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..5f8e165ea053 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -879,6 +879,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (vma_pagesize == PAGE_SIZE && !force_pte)
>  		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>  							   &pfn, &fault_ipa);
> +
> +	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && !device &&
> +	    pfn_valid(pfn)) {

In the current implementation, device == !pfn_valid(), so we could skip
the latter check.

> +		/*
> +		 * VM will be able to see the page's tags, so we must ensure
> +		 * they have been initialised. if PG_mte_tagged is set, tags
> +		 * have already been initialised.
> +		 */
> +		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
> +		struct page *page = pfn_to_online_page(pfn);
> +
> +		if (!page)
> +			return -EFAULT;

I think that's fine, though maybe adding a comment that otherwise it
would be mapped at stage 2 as Normal Cacheable and we cannot guarantee
that the memory supports MTE tags.

> +
> +		for (i = 0; i < nr_pages; i++, page++) {
> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
> +				mte_clear_page_tags(page_address(page));
> +		}
> +	}
> +
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;

I probably asked already but is the only way to map a standard RAM page
(not device) in stage 2 via the fault handler? One case I had in mind
was something like get_user_pages() but it looks like that one doesn't
call set_pte_at_notify(). There are a few other places where
set_pte_at_notify() is called and these may happen before we got a
chance to fault on stage 2, effectively populating the entry (IIUC). If
that's an issue, we could move the above loop and check closer to the
actual pte setting like kvm_pgtable_stage2_map().

While the set_pte_at() race on the page flags is somewhat clearer, we
may still have a race here with the VMM's set_pte_at() if the page is
mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
handling the VMM page tables (well, not always, see below).

gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
sure whether get_user_page_fast_only() does the same.

The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
KVM mmu notifier is invoked before set_pte_at() and racing with another
user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
set_pte_at() would see the PG_mte_tagged set either by the current CPU
or by the one it was racing with.
Steven Price April 29, 2021, 4:06 p.m. UTC | #2
On 28/04/2021 18:07, Catalin Marinas wrote:
> On Fri, Apr 16, 2021 at 04:43:05PM +0100, Steven Price wrote:
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 77cb2d28f2a4..5f8e165ea053 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -879,6 +879,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	if (vma_pagesize == PAGE_SIZE && !force_pte)
>>   		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>>   							   &pfn, &fault_ipa);
>> +
>> +	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && !device &&
>> +	    pfn_valid(pfn)) {
> 
> In the current implementation, device == !pfn_valid(), so we could skip
> the latter check.

Thanks, I'll drop that check.

>> +		/*
>> +		 * VM will be able to see the page's tags, so we must ensure
>> +		 * they have been initialised. if PG_mte_tagged is set, tags
>> +		 * have already been initialised.
>> +		 */
>> +		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
>> +		struct page *page = pfn_to_online_page(pfn);
>> +
>> +		if (!page)
>> +			return -EFAULT;
> 
> I think that's fine, though maybe adding a comment that otherwise it
> would be mapped at stage 2 as Normal Cacheable and we cannot guarantee
> that the memory supports MTE tags.

That's what I intended by "be able to see the page's tags", but I'll 
reword to be explicit about it being Normal Cacheable.

>> +
>> +		for (i = 0; i < nr_pages; i++, page++) {
>> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>> +				mte_clear_page_tags(page_address(page));
>> +		}
>> +	}
>> +
>>   	if (writable)
>>   		prot |= KVM_PGTABLE_PROT_W;
> 
> I probably asked already but is the only way to map a standard RAM page
> (not device) in stage 2 via the fault handler? One case I had in mind
> was something like get_user_pages() but it looks like that one doesn't
> call set_pte_at_notify(). There are a few other places where
> set_pte_at_notify() is called and these may happen before we got a
> chance to fault on stage 2, effectively populating the entry (IIUC). If
> that's an issue, we could move the above loop and check closer to the
> actual pte setting like kvm_pgtable_stage2_map().

The only call sites of kvm_pgtable_stage2_map() are in mmu.c:

  * kvm_phys_addr_ioremap() - maps as device in stage 2

  * user_mem_abort() - handled above

  * kvm_set_spte_handler() - ultimately called from the .change_pte() 
callback of the MMU notifier

So the last one is potentially a problem. It's called via the MMU 
notifiers in the case of set_pte_at_notify(). The users of that are:

  * uprobe_write_opcode(): Allocates a new page and performs a 
copy_highpage() to copy the data to the new page (which with MTE 
includes the tags and will copy across the PG_mte_tagged flag).

  * write_protect_page() (KSM): Changes the permissions on the PTE but 
it's still the same page, so nothing to do regarding MTE.

  * replace_page() (KSM): If the page has MTE tags then the MTE version 
of memcmp_pages() will return false, so the only caller 
(try_to_merge_one_page()) will never call this on a page with tags.

  * wp_page_copy(): This one is more interesting - if we go down the 
cow_user_page() path with an old page then everything is safe (tags are 
copied over). The is_zero_pfn() case worries me a bit - a new page is 
allocated, but I can't instantly see anything to zero out the tags (and 
set PG_mte_tagged).

  * migrate_vma_insert_page(): I think migration should be safe as the 
tags should be copied.

So wp_page_copy() looks suspicious.

kvm_pgtable_stage2_map() looks like it could be a good place for the 
checks, it looks like it should work and is probably a more obvious 
place for the checks.

> While the set_pte_at() race on the page flags is somewhat clearer, we
> may still have a race here with the VMM's set_pte_at() if the page is
> mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
> handling the VMM page tables (well, not always, see below).
> 
> gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
> path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
> would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
> sure whether get_user_page_fast_only() does the same.
> 
> The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
> KVM mmu notifier is invoked before set_pte_at() and racing with another
> user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
> set_pte_at() would see the PG_mte_tagged set either by the current CPU
> or by the one it was racing with.
> 

Given the changes to set_pte_at() which means that tags are restored 
from swap even if !PROT_MTE, the only race I can see remaining is the 
creation of new PROT_MTE mappings. As you mention an attempt to change 
mappings in the VMM memory space should involve a mmu notifier call 
which I think serialises this. So the remaining issue is doing this in a 
separate address space.

So I guess the potential problem is:

  * allocate memory MAP_SHARED but !PROT_MTE
  * fork()
  * VM causes a fault in parent address space
  * child does a mprotect(PROT_MTE)

With the last two potentially racing. Sadly I can't see a good way of 
handling that.

Steve
Catalin Marinas May 4, 2021, 5:40 p.m. UTC | #3
On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> On 28/04/2021 18:07, Catalin Marinas wrote:
> > I probably asked already but is the only way to map a standard RAM page
> > (not device) in stage 2 via the fault handler? One case I had in mind
> > was something like get_user_pages() but it looks like that one doesn't
> > call set_pte_at_notify(). There are a few other places where
> > set_pte_at_notify() is called and these may happen before we got a
> > chance to fault on stage 2, effectively populating the entry (IIUC). If
> > that's an issue, we could move the above loop and check closer to the
> > actual pte setting like kvm_pgtable_stage2_map().
> 
> The only call sites of kvm_pgtable_stage2_map() are in mmu.c:
> 
>  * kvm_phys_addr_ioremap() - maps as device in stage 2
> 
>  * user_mem_abort() - handled above
> 
>  * kvm_set_spte_handler() - ultimately called from the .change_pte()
> callback of the MMU notifier
> 
> So the last one is potentially a problem. It's called via the MMU notifiers
> in the case of set_pte_at_notify(). The users of that are:
> 
>  * uprobe_write_opcode(): Allocates a new page and performs a
> copy_highpage() to copy the data to the new page (which with MTE includes
> the tags and will copy across the PG_mte_tagged flag).
> 
>  * write_protect_page() (KSM): Changes the permissions on the PTE but it's
> still the same page, so nothing to do regarding MTE.

My concern here is that the VMM had a stage 1 pte but we haven't yet
faulted in at stage 2 via user_mem_abort(), so we don't have any stage 2
pte set. write_protect_page() comes in and sets the new stage 2 pte via
the callback. I couldn't find any check in kvm_pgtable_stage2_map() for
the old pte, so it will set the new stage 2 pte regardless. A subsequent
guest read would no longer fault at stage 2.

>  * replace_page() (KSM): If the page has MTE tags then the MTE version of
> memcmp_pages() will return false, so the only caller
> (try_to_merge_one_page()) will never call this on a page with tags.
> 
>  * wp_page_copy(): This one is more interesting - if we go down the
> cow_user_page() path with an old page then everything is safe (tags are
> copied over). The is_zero_pfn() case worries me a bit - a new page is
> allocated, but I can't instantly see anything to zero out the tags (and set
> PG_mte_tagged).

True, I think tag zeroing happens only if we map it as PROT_MTE in the
VMM.

>  * migrate_vma_insert_page(): I think migration should be safe as the tags
> should be copied.
> 
> So wp_page_copy() looks suspicious.
> 
> kvm_pgtable_stage2_map() looks like it could be a good place for the checks,
> it looks like it should work and is probably a more obvious place for the
> checks.

That would be my preference. It also matches the stage 1 set_pte_at().

> > While the set_pte_at() race on the page flags is somewhat clearer, we
> > may still have a race here with the VMM's set_pte_at() if the page is
> > mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
> > handling the VMM page tables (well, not always, see below).
> > 
> > gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
> > path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
> > would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
> > sure whether get_user_page_fast_only() does the same.
> > 
> > The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
> > KVM mmu notifier is invoked before set_pte_at() and racing with another
> > user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
> > set_pte_at() would see the PG_mte_tagged set either by the current CPU
> > or by the one it was racing with.
> 
> Given the changes to set_pte_at() which means that tags are restored from
> swap even if !PROT_MTE, the only race I can see remaining is the creation of
> new PROT_MTE mappings. As you mention an attempt to change mappings in the
> VMM memory space should involve a mmu notifier call which I think serialises
> this. So the remaining issue is doing this in a separate address space.
> 
> So I guess the potential problem is:
> 
>  * allocate memory MAP_SHARED but !PROT_MTE
>  * fork()
>  * VM causes a fault in parent address space
>  * child does a mprotect(PROT_MTE)
> 
> With the last two potentially racing. Sadly I can't see a good way of
> handling that.

Ah, the mmap lock doesn't help as they are different processes
(mprotect() acquires it as a writer).

I wonder whether this is racy even in the absence of KVM. If both parent
and child do an mprotect(PROT_MTE), one of them may be reading stale
tags for a brief period.

Maybe we should revisit whether shared MTE pages are of any use, though
it's an ABI change (not bad if no-one is relying on this). However...

Thinking about this, we have a similar problem with the PG_dcache_clean
and two processes doing mprotect(PROT_EXEC). One of them could see the
flag set and skip the I-cache maintenance while the other executes
stale instructions. change_pte_range() could acquire the page lock if
the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
solve the MTE/KVM case but we could at least take the page lock via
user_mem_abort().

Or maybe we just document this behaviour as racy both for PROT_EXEC and
PROT_MTE mappings and be done with this. The minor issue with PROT_MTE
is the potential leaking of tags (it's harder to leak information
through the I-cache).
Steven Price May 6, 2021, 4:15 p.m. UTC | #4
On 04/05/2021 18:40, Catalin Marinas wrote:
> On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
>> On 28/04/2021 18:07, Catalin Marinas wrote:
>>> I probably asked already but is the only way to map a standard RAM page
>>> (not device) in stage 2 via the fault handler? One case I had in mind
>>> was something like get_user_pages() but it looks like that one doesn't
>>> call set_pte_at_notify(). There are a few other places where
>>> set_pte_at_notify() is called and these may happen before we got a
>>> chance to fault on stage 2, effectively populating the entry (IIUC). If
>>> that's an issue, we could move the above loop and check closer to the
>>> actual pte setting like kvm_pgtable_stage2_map().
>>
>> The only call sites of kvm_pgtable_stage2_map() are in mmu.c:
>>
>>   * kvm_phys_addr_ioremap() - maps as device in stage 2
>>
>>   * user_mem_abort() - handled above
>>
>>   * kvm_set_spte_handler() - ultimately called from the .change_pte()
>> callback of the MMU notifier
>>
>> So the last one is potentially a problem. It's called via the MMU notifiers
>> in the case of set_pte_at_notify(). The users of that are:
>>
>>   * uprobe_write_opcode(): Allocates a new page and performs a
>> copy_highpage() to copy the data to the new page (which with MTE includes
>> the tags and will copy across the PG_mte_tagged flag).
>>
>>   * write_protect_page() (KSM): Changes the permissions on the PTE but it's
>> still the same page, so nothing to do regarding MTE.
> 
> My concern here is that the VMM had a stage 1 pte but we haven't yet
> faulted in at stage 2 via user_mem_abort(), so we don't have any stage 2
> pte set. write_protect_page() comes in and sets the new stage 2 pte via
> the callback. I couldn't find any check in kvm_pgtable_stage2_map() for
> the old pte, so it will set the new stage 2 pte regardless. A subsequent
> guest read would no longer fault at stage 2.
> 
>>   * replace_page() (KSM): If the page has MTE tags then the MTE version of
>> memcmp_pages() will return false, so the only caller
>> (try_to_merge_one_page()) will never call this on a page with tags.
>>
>>   * wp_page_copy(): This one is more interesting - if we go down the
>> cow_user_page() path with an old page then everything is safe (tags are
>> copied over). The is_zero_pfn() case worries me a bit - a new page is
>> allocated, but I can't instantly see anything to zero out the tags (and set
>> PG_mte_tagged).
> 
> True, I think tag zeroing happens only if we map it as PROT_MTE in the
> VMM.
> 
>>   * migrate_vma_insert_page(): I think migration should be safe as the tags
>> should be copied.
>>
>> So wp_page_copy() looks suspicious.
>>
>> kvm_pgtable_stage2_map() looks like it could be a good place for the checks,
>> it looks like it should work and is probably a more obvious place for the
>> checks.
> 
> That would be my preference. It also matches the stage 1 set_pte_at().
> 
>>> While the set_pte_at() race on the page flags is somewhat clearer, we
>>> may still have a race here with the VMM's set_pte_at() if the page is
>>> mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
>>> handling the VMM page tables (well, not always, see below).
>>>
>>> gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
>>> path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
>>> would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
>>> sure whether get_user_page_fast_only() does the same.
>>>
>>> The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
>>> KVM mmu notifier is invoked before set_pte_at() and racing with another
>>> user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
>>> set_pte_at() would see the PG_mte_tagged set either by the current CPU
>>> or by the one it was racing with.
>>
>> Given the changes to set_pte_at() which means that tags are restored from
>> swap even if !PROT_MTE, the only race I can see remaining is the creation of
>> new PROT_MTE mappings. As you mention an attempt to change mappings in the
>> VMM memory space should involve a mmu notifier call which I think serialises
>> this. So the remaining issue is doing this in a separate address space.
>>
>> So I guess the potential problem is:
>>
>>   * allocate memory MAP_SHARED but !PROT_MTE
>>   * fork()
>>   * VM causes a fault in parent address space
>>   * child does a mprotect(PROT_MTE)
>>
>> With the last two potentially racing. Sadly I can't see a good way of
>> handling that.
> 
> Ah, the mmap lock doesn't help as they are different processes
> (mprotect() acquires it as a writer).
> 
> I wonder whether this is racy even in the absence of KVM. If both parent
> and child do an mprotect(PROT_MTE), one of them may be reading stale
> tags for a brief period.
> 
> Maybe we should revisit whether shared MTE pages are of any use, though
> it's an ABI change (not bad if no-one is relying on this). However...

Shared MTE pages are certainly hard to use correctly (e.g. see the 
discussions with the VMM accessing guest memory). But I guess that boat 
has sailed.

> Thinking about this, we have a similar problem with the PG_dcache_clean
> and two processes doing mprotect(PROT_EXEC). One of them could see the
> flag set and skip the I-cache maintenance while the other executes
> stale instructions. change_pte_range() could acquire the page lock if
> the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
> solve the MTE/KVM case but we could at least take the page lock via
> user_mem_abort().

For PG_dcache_clean AFAICS the solution is actually simple: split the 
test and set parts. i.e..:

  if (!test_bit(PG_dcache_clean, &page->flags)) {
	sync_icache_aliases(page_address(page), page_size(page));
	set_bit(PG_dcache_clean, &page->flags);
  }

There isn't a problem with repeating the sync_icache_aliases() call in 
the case of a race. Or am I missing something?

> Or maybe we just document this behaviour as racy both for PROT_EXEC and
> PROT_MTE mappings and be done with this. The minor issue with PROT_MTE
> is the potential leaking of tags (it's harder to leak information
> through the I-cache).
> 

This is the real issue I see - the race in PROT_MTE case is either a 
data leak (syncing after setting the bit) or data loss (syncing before 
setting the bit).

But without serialising through a spinlock (in mte_sync_tags()) I 
haven't been able to come up with any way of closing the race. But with 
the change to set_pte_at() to call mte_sync_tags() even if the PTE isn't 
PROT_MTE that is likely to seriously hurt performance.

Steve
Catalin Marinas May 7, 2021, 6:25 p.m. UTC | #5
On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> On 04/05/2021 18:40, Catalin Marinas wrote:
> > On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> > > On 28/04/2021 18:07, Catalin Marinas wrote:
> > > > While the set_pte_at() race on the page flags is somewhat clearer, we
> > > > may still have a race here with the VMM's set_pte_at() if the page is
> > > > mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
> > > > handling the VMM page tables (well, not always, see below).
> > > > 
> > > > gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
> > > > path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
> > > > would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
> > > > sure whether get_user_page_fast_only() does the same.
> > > > 
> > > > The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
> > > > KVM mmu notifier is invoked before set_pte_at() and racing with another
> > > > user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
> > > > set_pte_at() would see the PG_mte_tagged set either by the current CPU
> > > > or by the one it was racing with.
> > > 
> > > Given the changes to set_pte_at() which means that tags are restored from
> > > swap even if !PROT_MTE, the only race I can see remaining is the creation of
> > > new PROT_MTE mappings. As you mention an attempt to change mappings in the
> > > VMM memory space should involve a mmu notifier call which I think serialises
> > > this. So the remaining issue is doing this in a separate address space.
> > > 
> > > So I guess the potential problem is:
> > > 
> > >   * allocate memory MAP_SHARED but !PROT_MTE
> > >   * fork()
> > >   * VM causes a fault in parent address space
> > >   * child does a mprotect(PROT_MTE)
> > > 
> > > With the last two potentially racing. Sadly I can't see a good way of
> > > handling that.
> > 
> > Ah, the mmap lock doesn't help as they are different processes
> > (mprotect() acquires it as a writer).
> > 
> > I wonder whether this is racy even in the absence of KVM. If both parent
> > and child do an mprotect(PROT_MTE), one of them may be reading stale
> > tags for a brief period.
> > 
> > Maybe we should revisit whether shared MTE pages are of any use, though
> > it's an ABI change (not bad if no-one is relying on this). However...
> 
> Shared MTE pages are certainly hard to use correctly (e.g. see the
> discussions with the VMM accessing guest memory). But I guess that boat has
> sailed.

Digging out some old emails (two years ago), the Chrome people may have
found a use for MTE in shared mappings (with memfd_create), though not
sure they took advantage of this yet.

> > Thinking about this, we have a similar problem with the PG_dcache_clean
> > and two processes doing mprotect(PROT_EXEC). One of them could see the
> > flag set and skip the I-cache maintenance while the other executes
> > stale instructions. change_pte_range() could acquire the page lock if
> > the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
> > solve the MTE/KVM case but we could at least take the page lock via
> > user_mem_abort().
> 
> For PG_dcache_clean AFAICS the solution is actually simple: split the test
> and set parts. i.e..:
> 
>  if (!test_bit(PG_dcache_clean, &page->flags)) {
> 	sync_icache_aliases(page_address(page), page_size(page));
> 	set_bit(PG_dcache_clean, &page->flags);
>  }
> 
> There isn't a problem with repeating the sync_icache_aliases() call in the
> case of a race. Or am I missing something?

No, the fix is simple as you said.

> > Or maybe we just document this behaviour as racy both for PROT_EXEC and
> > PROT_MTE mappings and be done with this. The minor issue with PROT_MTE
> > is the potential leaking of tags (it's harder to leak information
> > through the I-cache).
> 
> This is the real issue I see - the race in PROT_MTE case is either a data
> leak (syncing after setting the bit) or data loss (syncing before setting
> the bit).

For a moment I thought an mmap(PROT_MTE, MAP_SHARED) on memfd/tmpfs file
may lead to the same situation but the mmap() itself won't directly
cause allocating the page. The subsequent fault via filemap_map_pages()
seems to take the page lock.

> But without serialising through a spinlock (in mte_sync_tags()) I haven't
> been able to come up with any way of closing the race. But with the change
> to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
> is likely to seriously hurt performance.

Yeah. We could add another page flag as a lock though I think it should
be the core code that prevents the race.

If we are to do it in the arch code, maybe easier with a custom
ptep_modify_prot_start/end() where we check if it's VM_SHARED and
VM_MTE, take a (big) lock.

In the core code, something like below (well, a partial hack, not tested
and it doesn't handle huge pages but just to give an idea):

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94188df1ee55..6ba96ff141a6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -76,14 +76,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		if (pte_present(oldpte)) {
 			pte_t ptent;
 			bool preserve_write = prot_numa && pte_write(oldpte);
+			struct page *page = NULL;
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
 			 * pages. See similar comment in change_huge_pmd.
 			 */
 			if (prot_numa) {
-				struct page *page;
-
 				/* Avoid TLB flush if possible */
 				if (pte_protnone(oldpte))
 					continue;
@@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			}
 
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
+			if (vma->vm_flags & VM_SHARED) {
+				page = vm_normal_page(vma, addr, oldpte);
+				lock_page(page);
+			}
 			ptent = pte_modify(oldpte, newprot);
 			if (preserve_write)
 				ptent = pte_mk_savedwrite(ptent);
@@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				ptent = pte_mkwrite(ptent);
 			}
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+			if (page)
+				unlock_page(page);
 			pages++;
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
Catalin Marinas May 10, 2021, 6:35 p.m. UTC | #6
On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:
> On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> > On 04/05/2021 18:40, Catalin Marinas wrote:
> > > On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> > > > Given the changes to set_pte_at() which means that tags are restored from
> > > > swap even if !PROT_MTE, the only race I can see remaining is the creation of
> > > > new PROT_MTE mappings. As you mention an attempt to change mappings in the
> > > > VMM memory space should involve a mmu notifier call which I think serialises
> > > > this. So the remaining issue is doing this in a separate address space.
> > > > 
> > > > So I guess the potential problem is:
> > > > 
> > > >   * allocate memory MAP_SHARED but !PROT_MTE
> > > >   * fork()
> > > >   * VM causes a fault in parent address space
> > > >   * child does a mprotect(PROT_MTE)
> > > > 
> > > > With the last two potentially racing. Sadly I can't see a good way of
> > > > handling that.
> > > 
> > > Ah, the mmap lock doesn't help as they are different processes
> > > (mprotect() acquires it as a writer).
> > > 
> > > I wonder whether this is racy even in the absence of KVM. If both parent
> > > and child do an mprotect(PROT_MTE), one of them may be reading stale
> > > tags for a brief period.
> > > 
> > > Maybe we should revisit whether shared MTE pages are of any use, though
> > > it's an ABI change (not bad if no-one is relying on this). However...
[...]
> > > Thinking about this, we have a similar problem with the PG_dcache_clean
> > > and two processes doing mprotect(PROT_EXEC). One of them could see the
> > > flag set and skip the I-cache maintenance while the other executes
> > > stale instructions. change_pte_range() could acquire the page lock if
> > > the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
> > > solve the MTE/KVM case but we could at least take the page lock via
> > > user_mem_abort().
[...]
> > This is the real issue I see - the race in PROT_MTE case is either a data
> > leak (syncing after setting the bit) or data loss (syncing before setting
> > the bit).
[...]
> > But without serialising through a spinlock (in mte_sync_tags()) I haven't
> > been able to come up with any way of closing the race. But with the change
> > to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
> > is likely to seriously hurt performance.
> 
> Yeah. We could add another page flag as a lock though I think it should
> be the core code that prevents the race.
> 
> If we are to do it in the arch code, maybe easier with a custom
> ptep_modify_prot_start/end() where we check if it's VM_SHARED and
> VM_MTE, take a (big) lock.

I think in the general case we don't even need VM_SHARED. For example,
we have two processes mapping a file, read-only. An mprotect() call in
both processes will race on the page->flags via the corresponding
set_pte_at(). I think an mprotect() with a page fault in different
processes can also race.

The PROT_EXEC case can be easily fixed, as you said already. The
PROT_MTE with MAP_PRIVATE I think can be made safe by a similar
approach: test flag, clear tags, set flag. A subsequent write would
trigger a CoW, so different page anyway.

Anyway, I don't think ptep_modify_prot_start/end would buy us much, it
probably makes the code even harder to read.

> In the core code, something like below (well, a partial hack, not tested
> and it doesn't handle huge pages but just to give an idea):
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 94188df1ee55..6ba96ff141a6 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  			}
>  
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
> +			if (vma->vm_flags & VM_SHARED) {
> +				page = vm_normal_page(vma, addr, oldpte);
> +				lock_page(page);
> +			}
>  			ptent = pte_modify(oldpte, newprot);
>  			if (preserve_write)
>  				ptent = pte_mk_savedwrite(ptent);
> @@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				ptent = pte_mkwrite(ptent);
>  			}
>  			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> +			if (page)
> +				unlock_page(page);
>  			pages++;
>  		} else if (is_swap_pte(oldpte)) {
>  			swp_entry_t entry = pte_to_swp_entry(oldpte);

That's bogus: lock_page() might sleep but this whole code sequence is
under the ptl spinlock. There are some lock_page_* variants but that
would involve either a busy loop on this path or some bailing out,
waiting for a release.

Options:

1. Change the mte_sync_tags() code path to set the flag after clearing
   and avoid reading stale tags. We document that mprotect() on
   MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
   arch code and return an error.

2. Figure out some other locking in the core code. However, if
   mprotect() in one process can race with a handle_pte_fault() in
   another, on the same shared mapping, it's not trivial.
   filemap_map_pages() would take the page lock before calling
   do_set_pte(), so mprotect() would need the same page lock.

3. Use another PG_arch_3 bit as a lock to spin on in the arch code (i.e.
   set it around the other PG_arch_* bit setting).

I ran out of ideas.
Steven Price May 12, 2021, 3:46 p.m. UTC | #7
On 10/05/2021 19:35, Catalin Marinas wrote:
> On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:
>> On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
>>> On 04/05/2021 18:40, Catalin Marinas wrote:
>>>> On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
>>>>> Given the changes to set_pte_at() which means that tags are restored from
>>>>> swap even if !PROT_MTE, the only race I can see remaining is the creation of
>>>>> new PROT_MTE mappings. As you mention an attempt to change mappings in the
>>>>> VMM memory space should involve a mmu notifier call which I think serialises
>>>>> this. So the remaining issue is doing this in a separate address space.
>>>>>
>>>>> So I guess the potential problem is:
>>>>>
>>>>>    * allocate memory MAP_SHARED but !PROT_MTE
>>>>>    * fork()
>>>>>    * VM causes a fault in parent address space
>>>>>    * child does a mprotect(PROT_MTE)
>>>>>
>>>>> With the last two potentially racing. Sadly I can't see a good way of
>>>>> handling that.
>>>>
>>>> Ah, the mmap lock doesn't help as they are different processes
>>>> (mprotect() acquires it as a writer).
>>>>
>>>> I wonder whether this is racy even in the absence of KVM. If both parent
>>>> and child do an mprotect(PROT_MTE), one of them may be reading stale
>>>> tags for a brief period.
>>>>
>>>> Maybe we should revisit whether shared MTE pages are of any use, though
>>>> it's an ABI change (not bad if no-one is relying on this). However...
> [...]
>>>> Thinking about this, we have a similar problem with the PG_dcache_clean
>>>> and two processes doing mprotect(PROT_EXEC). One of them could see the
>>>> flag set and skip the I-cache maintenance while the other executes
>>>> stale instructions. change_pte_range() could acquire the page lock if
>>>> the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
>>>> solve the MTE/KVM case but we could at least take the page lock via
>>>> user_mem_abort().
> [...]
>>> This is the real issue I see - the race in PROT_MTE case is either a data
>>> leak (syncing after setting the bit) or data loss (syncing before setting
>>> the bit).
> [...]
>>> But without serialising through a spinlock (in mte_sync_tags()) I haven't
>>> been able to come up with any way of closing the race. But with the change
>>> to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
>>> is likely to seriously hurt performance.
>>
>> Yeah. We could add another page flag as a lock though I think it should
>> be the core code that prevents the race.
>>
>> If we are to do it in the arch code, maybe easier with a custom
>> ptep_modify_prot_start/end() where we check if it's VM_SHARED and
>> VM_MTE, take a (big) lock.
> 
> I think in the general case we don't even need VM_SHARED. For example,
> we have two processes mapping a file, read-only. An mprotect() call in
> both processes will race on the page->flags via the corresponding
> set_pte_at(). I think an mprotect() with a page fault in different
> processes can also race.
> 
> The PROT_EXEC case can be easily fixed, as you said already. The
> PROT_MTE with MAP_PRIVATE I think can be made safe by a similar
> approach: test flag, clear tags, set flag. A subsequent write would
> trigger a CoW, so different page anyway.
> 
> Anyway, I don't think ptep_modify_prot_start/end would buy us much, it
> probably makes the code even harder to read.
> 
>> In the core code, something like below (well, a partial hack, not tested
>> and it doesn't handle huge pages but just to give an idea):
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 94188df1ee55..6ba96ff141a6 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>>   			}
>>   
>>   			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> +			if (vma->vm_flags & VM_SHARED) {
>> +				page = vm_normal_page(vma, addr, oldpte);
>> +				lock_page(page);
>> +			}
>>   			ptent = pte_modify(oldpte, newprot);
>>   			if (preserve_write)
>>   				ptent = pte_mk_savedwrite(ptent);
>> @@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>>   				ptent = pte_mkwrite(ptent);
>>   			}
>>   			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>> +			if (page)
>> +				unlock_page(page);
>>   			pages++;
>>   		} else if (is_swap_pte(oldpte)) {
>>   			swp_entry_t entry = pte_to_swp_entry(oldpte);
> 
> That's bogus: lock_page() might sleep but this whole code sequence is
> under the ptl spinlock. There are some lock_page_* variants but that
> would involve either a busy loop on this path or some bailing out,
> waiting for a release.
> 
> Options:
> 
> 1. Change the mte_sync_tags() code path to set the flag after clearing
>     and avoid reading stale tags. We document that mprotect() on
>     MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
>     arch code and return an error.

This is the best option I've come up with so far - but it's not a good
one! We can replace the set_bit() with a test_and_set_bit() to catch the
race after it has occurred - but I'm not sure what we can do about it
then (we've already wiped the data). Returning an error doesn't seem
particularly useful at that point, a message in dmesg is about the best
I can come up with.

> 2. Figure out some other locking in the core code. However, if
>     mprotect() in one process can race with a handle_pte_fault() in
>     another, on the same shared mapping, it's not trivial.
>     filemap_map_pages() would take the page lock before calling
>     do_set_pte(), so mprotect() would need the same page lock.

I can't see how this is going to work without harming the performance of
non-MTE work. Ultimately we're trying to add some sort of locking for
two (mostly) unrelated processes doing page table operations, which will
hurt scalability.

> 3. Use another PG_arch_3 bit as a lock to spin on in the arch code (i.e.
>     set it around the other PG_arch_* bit setting).

This is certainly tempting, although sadly the existing
wait_on_page_bit() is sleeping - so this would either be a literal spin,
or we'd need to implement a new non-sleeping wait mechanism.

> I ran out of ideas.
> 

4. Sledgehammer locking in mte_sync_page_tags(), add a spinlock only for
the MTE case where we have to sync tags (see below). What the actual
performance impact of this is I've no idea. It could certainly be bad
if there are a lot of pages with MTE enabled, which sadly is exactly
the case if KVM is used with MTE :(

Steve

--->8----
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 0d320c060ebe..389ad40256f6 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -25,23 +25,33 @@
  u64 gcr_kernel_excl __ro_after_init;
  
  static bool report_fault_once = true;
+static spinlock_t tag_sync_lock;
  
  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
  			       bool pte_is_tagged)
  {
  	pte_t old_pte = READ_ONCE(*ptep);
  
+	if (!is_swap_pte(old_pte) && !pte_is_tagged)
+		return;
+
+	spin_lock_irqsave(&tag_sync_lock, flags);
+
+	/* Recheck with the lock held */
+	if (test_bit(PG_mte_tagged, &page->flags))
+		goto out;
+
  	if (check_swap && is_swap_pte(old_pte)) {
  		swp_entry_t entry = pte_to_swp_entry(old_pte);
  
  		if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
  			set_bit(PG_mte_tagged, &page->flags);
-			return;
+			goto out;
  		}
  	}
  
  	if (!pte_is_tagged)
-		return;
+		goto out;
  
  	page_kasan_tag_reset(page);
  	/*
@@ -55,6 +65,9 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
  	mte_clear_page_tags(page_address(page));
  
  	set_bit(PG_mte_tagged, &page->flags);
+
+out:
+	spin_unlock_irqrestore(&tag_sync_lock, flags);
  }
  
  void mte_sync_tags(pte_t *ptep, pte_t pte)
@@ -63,6 +76,11 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
  	long i, nr_pages = compound_nr(page);
  	bool check_swap = nr_pages == 1;
  	bool pte_is_tagged = pte_tagged(pte);
+	unsigned long flags;
+
+	/* Early out if there's nothing to do */
+	if (!check_swap && !pte_is_tagged)
+		return;
  
  	/* if PG_mte_tagged is set, tags have already been initialised */
  	for (i = 0; i < nr_pages; i++, page++) {
Catalin Marinas May 12, 2021, 5:45 p.m. UTC | #8
On Wed, May 12, 2021 at 04:46:48PM +0100, Steven Price wrote:
> On 10/05/2021 19:35, Catalin Marinas wrote:
> > On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:
> > > On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> > > > On 04/05/2021 18:40, Catalin Marinas wrote:
> > > > > On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> > > > > > Given the changes to set_pte_at() which means that tags are restored from
> > > > > > swap even if !PROT_MTE, the only race I can see remaining is the creation of
> > > > > > new PROT_MTE mappings. As you mention an attempt to change mappings in the
> > > > > > VMM memory space should involve a mmu notifier call which I think serialises
> > > > > > this. So the remaining issue is doing this in a separate address space.
> > > > > > 
> > > > > > So I guess the potential problem is:
> > > > > > 
> > > > > >    * allocate memory MAP_SHARED but !PROT_MTE
> > > > > >    * fork()
> > > > > >    * VM causes a fault in parent address space
> > > > > >    * child does a mprotect(PROT_MTE)
> > > > > > 
> > > > > > With the last two potentially racing. Sadly I can't see a good way of
> > > > > > handling that.
[...]
> > Options:
> > 
> > 1. Change the mte_sync_tags() code path to set the flag after clearing
> >     and avoid reading stale tags. We document that mprotect() on
> >     MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
> >     arch code and return an error.
> 
> This is the best option I've come up with so far - but it's not a good
> one! We can replace the set_bit() with a test_and_set_bit() to catch the
> race after it has occurred - but I'm not sure what we can do about it
> then (we've already wiped the data). Returning an error doesn't seem
> particularly useful at that point, a message in dmesg is about the best
> I can come up with.

What I meant about intercepting is on something like
arch_validate_flags() to prevent VM_SHARED and VM_MTE together but only
for mprotect(), not mmap(). However, arch_validate_flags() is currently
called on both mmap() and mprotect() paths.

We can't do much in set_pte_at() to prevent the race with only a single
bit.

> > 2. Figure out some other locking in the core code. However, if
> >     mprotect() in one process can race with a handle_pte_fault() in
> >     another, on the same shared mapping, it's not trivial.
> >     filemap_map_pages() would take the page lock before calling
> >     do_set_pte(), so mprotect() would need the same page lock.
> 
> I can't see how this is going to work without harming the performance of
> non-MTE work. Ultimately we're trying to add some sort of locking for
> two (mostly) unrelated processes doing page table operations, which will
> hurt scalability.

Another option is to have an arch callback to force re-faulting on the
pte. That means we don't populate it back after the invalidation in the
change_protection() path. We could do this only if the new pte is tagged
and the page doesn't have PG_mte_tagged. The faulting path takes the
page lock IIUC.

Well, at least for stage 1, I haven't thought much about stage 2.

> > 3. Use another PG_arch_3 bit as a lock to spin on in the arch code (i.e.
> >     set it around the other PG_arch_* bit setting).
> 
> This is certainly tempting, although sadly the existing
> wait_on_page_bit() is sleeping - so this would either be a literal spin,
> or we'd need to implement a new non-sleeping wait mechanism.

Yeah, it would have to be a custom spinning mechanism, something like:

	/* lock the page */
	while (test_and_set_bit(PG_arch_3, &page->flags))
		smp_cond_load_relaxed(&page->flags, !(VAL & PG_arch_3));
	...
	/* unlock the page */
	clear_bit(PG_arch_3, &page->flags);

> 4. Sledgehammer locking in mte_sync_page_tags(), add a spinlock only for
> the MTE case where we have to sync tags (see below). What the actual
> performance impact of this is I've no idea. It could certainly be bad
> if there are a lot of pages with MTE enabled, which sadly is exactly
> the case if KVM is used with MTE :(
> 
> --->8----
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index 0d320c060ebe..389ad40256f6 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -25,23 +25,33 @@
>  u64 gcr_kernel_excl __ro_after_init;
>  static bool report_fault_once = true;
> +static spinlock_t tag_sync_lock;
>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
>  			       bool pte_is_tagged)
>  {
>  	pte_t old_pte = READ_ONCE(*ptep);
> +	if (!is_swap_pte(old_pte) && !pte_is_tagged)
> +		return;
> +
> +	spin_lock_irqsave(&tag_sync_lock, flags);
> +
> +	/* Recheck with the lock held */
> +	if (test_bit(PG_mte_tagged, &page->flags))
> +		goto out;

Can we skip the lock if the page already has the PG_mte_tagged set?
That's assuming that we set the flag only after clearing the tags. The
normal case where mprotect() is called on a page already mapped with
PROT_MTE would not be affected.
Steven Price May 13, 2021, 10:57 a.m. UTC | #9
On 12/05/2021 18:45, Catalin Marinas wrote:
> On Wed, May 12, 2021 at 04:46:48PM +0100, Steven Price wrote:
>> On 10/05/2021 19:35, Catalin Marinas wrote:
>>> On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:
>>>> On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
>>>>> On 04/05/2021 18:40, Catalin Marinas wrote:
>>>>>> On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
>>>>>>> Given the changes to set_pte_at() which means that tags are restored from
>>>>>>> swap even if !PROT_MTE, the only race I can see remaining is the creation of
>>>>>>> new PROT_MTE mappings. As you mention an attempt to change mappings in the
>>>>>>> VMM memory space should involve a mmu notifier call which I think serialises
>>>>>>> this. So the remaining issue is doing this in a separate address space.
>>>>>>>
>>>>>>> So I guess the potential problem is:
>>>>>>>
>>>>>>>    * allocate memory MAP_SHARED but !PROT_MTE
>>>>>>>    * fork()
>>>>>>>    * VM causes a fault in parent address space
>>>>>>>    * child does a mprotect(PROT_MTE)
>>>>>>>
>>>>>>> With the last two potentially racing. Sadly I can't see a good way of
>>>>>>> handling that.
> [...]
>>> Options:
>>>
>>> 1. Change the mte_sync_tags() code path to set the flag after clearing
>>>     and avoid reading stale tags. We document that mprotect() on
>>>     MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
>>>     arch code and return an error.
>>
>> This is the best option I've come up with so far - but it's not a good
>> one! We can replace the set_bit() with a test_and_set_bit() to catch the
>> race after it has occurred - but I'm not sure what we can do about it
>> then (we've already wiped the data). Returning an error doesn't seem
>> particularly useful at that point, a message in dmesg is about the best
>> I can come up with.
> 
> What I meant about intercepting is on something like
> arch_validate_flags() to prevent VM_SHARED and VM_MTE together but only
> for mprotect(), not mmap(). However, arch_validate_flags() is currently
> called on both mmap() and mprotect() paths.

I think even if we were to restrict mprotect() there would be corner
cases around swapping in. For example if a page mapped VM_SHARED|VM_MTE
is faulted simultaneously in both processes then we have the same situation:

 * with test_and_set_bit() one process could potentially see the tags
before they have been restored - i.e. a data leak.

 * with separated test and set then one process could write to the tags
before the second restore has completed causing a lost update.

Obviously completely banning VM_SHARED|VM_MTE might work, but I don't
think that's a good idea.

> We can't do much in set_pte_at() to prevent the race with only a single
> bit.
> 
>>> 2. Figure out some other locking in the core code. However, if
>>>     mprotect() in one process can race with a handle_pte_fault() in
>>>     another, on the same shared mapping, it's not trivial.
>>>     filemap_map_pages() would take the page lock before calling
>>>     do_set_pte(), so mprotect() would need the same page lock.
>>
>> I can't see how this is going to work without harming the performance of
>> non-MTE work. Ultimately we're trying to add some sort of locking for
>> two (mostly) unrelated processes doing page table operations, which will
>> hurt scalability.
> 
> Another option is to have an arch callback to force re-faulting on the
> pte. That means we don't populate it back after the invalidation in the
> change_protection() path. We could do this only if the new pte is tagged
> and the page doesn't have PG_mte_tagged. The faulting path takes the
> page lock IIUC.

As above - I don't think this race is just on the change_protection() path.

> Well, at least for stage 1, I haven't thought much about stage 2.
> 
>>> 3. Use another PG_arch_3 bit as a lock to spin on in the arch code (i.e.
>>>     set it around the other PG_arch_* bit setting).
>>
>> This is certainly tempting, although sadly the existing
>> wait_on_page_bit() is sleeping - so this would either be a literal spin,
>> or we'd need to implement a new non-sleeping wait mechanism.
> 
> Yeah, it would have to be a custom spinning mechanism, something like:
> 
> 	/* lock the page */
> 	while (test_and_set_bit(PG_arch_3, &page->flags))
> 		smp_cond_load_relaxed(&page->flags, !(VAL & PG_arch_3));
> 	...
> 	/* unlock the page */
> 	clear_bit(PG_arch_3, &page->flags);

Presumably we'd also need to ensure interrupts are disabled to ensure
we're not pre-empted in the middle and potentially deadlock. It's
doable, but I'd prefer not to invent a new lock type if possible.

>> 4. Sledgehammer locking in mte_sync_page_tags(), add a spinlock only for
>> the MTE case where we have to sync tags (see below). What the actual
>> performance impact of this is I've no idea. It could certainly be bad
>> if there are a lot of pages with MTE enabled, which sadly is exactly
>> the case if KVM is used with MTE :(
>>
>> --->8----
>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>> index 0d320c060ebe..389ad40256f6 100644
>> --- a/arch/arm64/kernel/mte.c
>> +++ b/arch/arm64/kernel/mte.c
>> @@ -25,23 +25,33 @@
>>  u64 gcr_kernel_excl __ro_after_init;
>>  static bool report_fault_once = true;
>> +static spinlock_t tag_sync_lock;
>>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
>>  			       bool pte_is_tagged)
>>  {
>>  	pte_t old_pte = READ_ONCE(*ptep);
>> +	if (!is_swap_pte(old_pte) && !pte_is_tagged)
>> +		return;
>> +
>> +	spin_lock_irqsave(&tag_sync_lock, flags);
>> +
>> +	/* Recheck with the lock held */
>> +	if (test_bit(PG_mte_tagged, &page->flags))
>> +		goto out;
> 
> Can we skip the lock if the page already has the PG_mte_tagged set?
> That's assuming that we set the flag only after clearing the tags. The
> normal case where mprotect() is called on a page already mapped with
> PROT_MTE would not be affected.
> 

It was missing from the diff context (sorry, should have double checked
that), but I was keeping the check in mte_sync_tags():

  void mte_sync_tags(pte_t *ptep, pte_t pte)
  {
	struct page *page = pte_page(pte);
	long i, nr_pages = compound_nr(page);
	bool check_swap = nr_pages == 1;
	bool pte_is_tagged = pte_tagged(pte);
	unsigned long flags;

	/* Early out if there's nothing to do */
	if (!check_swap && !pte_is_tagged)
		return;

	/* if PG_mte_tagged is set, tags have already been initialised */
	for (i = 0; i < nr_pages; i++, page++) {
		if (!test_bit(PG_mte_tagged, &page->flags))
			mte_sync_page_tags(page, ptep, check_swap,
					   pte_is_tagged);
	}
  }

So the hit is only taken if !PG_mte_tagged - hence the "recheck" comment
in mte_sync_page_tags() once the lock is held. I guess if I'm going this
route it would make sense to refactor this to be a bit clearer.

Steve
Catalin Marinas May 13, 2021, 3:08 p.m. UTC | #10
On Thu, May 13, 2021 at 11:57:39AM +0100, Steven Price wrote:
> On 12/05/2021 18:45, Catalin Marinas wrote:
> > On Wed, May 12, 2021 at 04:46:48PM +0100, Steven Price wrote:
> >>>>>> On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> >>>>>>> Given the changes to set_pte_at() which means that tags are restored from
> >>>>>>> swap even if !PROT_MTE, the only race I can see remaining is the creation of
> >>>>>>> new PROT_MTE mappings. As you mention an attempt to change mappings in the
> >>>>>>> VMM memory space should involve a mmu notifier call which I think serialises
> >>>>>>> this. So the remaining issue is doing this in a separate address space.
> >>>>>>>
> >>>>>>> So I guess the potential problem is:
> >>>>>>>
> >>>>>>>    * allocate memory MAP_SHARED but !PROT_MTE
> >>>>>>>    * fork()
> >>>>>>>    * VM causes a fault in parent address space
> >>>>>>>    * child does a mprotect(PROT_MTE)
> >>>>>>>
> >>>>>>> With the last two potentially racing. Sadly I can't see a good way of
> >>>>>>> handling that.
[...]
> >> 4. Sledgehammer locking in mte_sync_page_tags(), add a spinlock only for
> >> the MTE case where we have to sync tags (see below). What the actual
> >> performance impact of this is I've no idea. It could certainly be bad
> >> if there are a lot of pages with MTE enabled, which sadly is exactly
> >> the case if KVM is used with MTE :(
> >>
> >> --->8----
> >> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> >> index 0d320c060ebe..389ad40256f6 100644
> >> --- a/arch/arm64/kernel/mte.c
> >> +++ b/arch/arm64/kernel/mte.c
> >> @@ -25,23 +25,33 @@
> >>  u64 gcr_kernel_excl __ro_after_init;
> >>  static bool report_fault_once = true;
> >> +static spinlock_t tag_sync_lock;
> >>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
> >>  			       bool pte_is_tagged)
> >>  {
> >>  	pte_t old_pte = READ_ONCE(*ptep);
> >> +	if (!is_swap_pte(old_pte) && !pte_is_tagged)
> >> +		return;
> >> +
> >> +	spin_lock_irqsave(&tag_sync_lock, flags);
> >> +
> >> +	/* Recheck with the lock held */
> >> +	if (test_bit(PG_mte_tagged, &page->flags))
> >> +		goto out;
> > 
> > Can we skip the lock if the page already has the PG_mte_tagged set?
> > That's assuming that we set the flag only after clearing the tags. The
> > normal case where mprotect() is called on a page already mapped with
> > PROT_MTE would not be affected.
> 
> It was missing from the diff context (sorry, should have double checked
> that), but I was keeping the check in mte_sync_tags():
> 
>   void mte_sync_tags(pte_t *ptep, pte_t pte)
>   {
> 	struct page *page = pte_page(pte);
> 	long i, nr_pages = compound_nr(page);
> 	bool check_swap = nr_pages == 1;
> 	bool pte_is_tagged = pte_tagged(pte);
> 	unsigned long flags;
> 
> 	/* Early out if there's nothing to do */
> 	if (!check_swap && !pte_is_tagged)
> 		return;
> 
> 	/* if PG_mte_tagged is set, tags have already been initialised */
> 	for (i = 0; i < nr_pages; i++, page++) {
> 		if (!test_bit(PG_mte_tagged, &page->flags))
> 			mte_sync_page_tags(page, ptep, check_swap,
> 					   pte_is_tagged);
> 	}
>   }
> 
> So the hit is only taken if !PG_mte_tagged - hence the "recheck" comment
> in mte_sync_page_tags() once the lock is held. I guess if I'm going this
> route it would make sense to refactor this to be a bit clearer.

I think we can go with this for now but we should still raise it with
the mm folk, maybe they have a better idea on how to avoid the race in
the core code. There are other architectures affected, those that use
PG_arch_1.

As the kernel stands currently, we'd take the lock on any set_pte_at()
for a tagged page when first mapped. With Peter's patches to use DC
GZVA, the tag zeroing is done on allocation. Until those are merged, we
could do something similar in the arch code but without the DC GZVA
optimisation (useful if we need to cc this fix to stable):

----------8<--------------------------
From 9f445f794454cf139c0953391e6c30fa3f075dc1 Mon Sep 17 00:00:00 2001
From: Catalin Marinas <catalin.marinas@arm.com>
Date: Thu, 13 May 2021 14:15:37 +0100
Subject: [PATCH] arm64: Handle MTE tags zeroing in
 __alloc_zeroed_user_highpage()

Currently, on an anonymous page fault, the kernel allocates a zeroed
page and maps it in user space. If the mapping is tagged (PROT_MTE),
set_pte_at() additionally clears the tags under a spinlock to avoid a
race on the page->flags. In order to optimise the lock, clear the page
tags on allocation in __alloc_zeroed_user_highpage() if the vma flags
have VM_MTE set.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/include/asm/page.h |  6 ++++--
 arch/arm64/mm/fault.c         | 21 +++++++++++++++++++++
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..97853570d0f1 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -13,6 +13,7 @@
 #ifndef __ASSEMBLY__
 
 #include <linux/personality.h> /* for READ_IMPLIES_EXEC */
+#include <linux/types.h>
 #include <asm/pgtable-types.h>
 
 struct page;
@@ -28,8 +29,9 @@ void copy_user_highpage(struct page *to, struct page *from,
 void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE
 
-#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
-	alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
+struct page *__alloc_zeroed_user_highpage(gfp_t movableflags,
+					  struct vm_area_struct *vma,
+					  unsigned long vaddr);
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 871c82ab0a30..5a03428e97f3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -921,3 +921,24 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned int esr,
 	debug_exception_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug_exception);
+
+/*
+ * Used during anonymous page fault handling.
+ */
+struct page *__alloc_zeroed_user_highpage(gfp_t movableflags,
+					  struct vm_area_struct *vma,
+					  unsigned long vaddr)
+{
+	struct page *page;
+	bool tagged = system_supports_mte() && (vma->vm_flags & VM_MTE);
+
+	page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma,
+			      vaddr);
+	if (tagged && page) {
+		mte_clear_page_tags(page_address(page));
+		page_kasan_tag_reset(page);
+		set_bit(PG_mte_tagged, &page->flags);
+	}
+
+	return page;
+}
Catalin Marinas May 13, 2021, 3:21 p.m. UTC | #11
On Thu, May 13, 2021 at 11:57:39AM +0100, Steven Price wrote:
> On 12/05/2021 18:45, Catalin Marinas wrote:
> > On Wed, May 12, 2021 at 04:46:48PM +0100, Steven Price wrote:
> >> On 10/05/2021 19:35, Catalin Marinas wrote:
> >>>> On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> >>>>>> On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> >>>>>>> Given the changes to set_pte_at() which means that tags are restored from
> >>>>>>> swap even if !PROT_MTE, the only race I can see remaining is the creation of
> >>>>>>> new PROT_MTE mappings. As you mention an attempt to change mappings in the
> >>>>>>> VMM memory space should involve a mmu notifier call which I think serialises
> >>>>>>> this. So the remaining issue is doing this in a separate address space.
> >>>>>>>
> >>>>>>> So I guess the potential problem is:
> >>>>>>>
> >>>>>>>    * allocate memory MAP_SHARED but !PROT_MTE
> >>>>>>>    * fork()
> >>>>>>>    * VM causes a fault in parent address space
> >>>>>>>    * child does a mprotect(PROT_MTE)
> >>>>>>>
> >>>>>>> With the last two potentially racing. Sadly I can't see a good way of
> >>>>>>> handling that.
> > [...]
> >>> Options:
> >>>
> >>> 1. Change the mte_sync_tags() code path to set the flag after clearing
> >>>     and avoid reading stale tags. We document that mprotect() on
> >>>     MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
> >>>     arch code and return an error.
> >>
> >> This is the best option I've come up with so far - but it's not a good
> >> one! We can replace the set_bit() with a test_and_set_bit() to catch the
> >> race after it has occurred - but I'm not sure what we can do about it
> >> then (we've already wiped the data). Returning an error doesn't seem
> >> particularly useful at that point, a message in dmesg is about the best
> >> I can come up with.
> > 
> > What I meant about intercepting is on something like
> > arch_validate_flags() to prevent VM_SHARED and VM_MTE together but only
> > for mprotect(), not mmap(). However, arch_validate_flags() is currently
> > called on both mmap() and mprotect() paths.
> 
> I think even if we were to restrict mprotect() there would be corner
> cases around swapping in. For example if a page mapped VM_SHARED|VM_MTE
> is faulted simultaneously in both processes then we have the same situation:
> 
>  * with test_and_set_bit() one process could potentially see the tags
> before they have been restored - i.e. a data leak.
> 
>  * with separated test and set then one process could write to the tags
> before the second restore has completed causing a lost update.

I don't think this can happen. We have shmem_swapin_page() which I think
would be called on any faulting pte that was sharing such page. It takes
the page lock around arch_swap_restore().
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index f612c090f2e4..6bf776c2399c 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -84,6 +84,9 @@  static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
 	if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
 	    vcpu_el1_is_32bit(vcpu))
 		vcpu->arch.hcr_el2 |= HCR_TID2;
+
+	if (kvm_has_mte(vcpu->kvm))
+		vcpu->arch.hcr_el2 |= HCR_ATA;
 }
 
 static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 3d10e6527f7d..1170ee137096 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -132,6 +132,8 @@  struct kvm_arch {
 
 	u8 pfr0_csv2;
 	u8 pfr0_csv3;
+	/* Memory Tagging Extension enabled for the guest */
+	bool mte_enabled;
 };
 
 struct kvm_vcpu_fault_info {
@@ -767,6 +769,7 @@  bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
 #define kvm_arm_vcpu_sve_finalized(vcpu) \
 	((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
 
+#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
 #define kvm_vcpu_has_pmu(vcpu)					\
 	(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
 
diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
index 73629094f903..56426565600c 100644
--- a/arch/arm64/kvm/hyp/exception.c
+++ b/arch/arm64/kvm/hyp/exception.c
@@ -112,7 +112,8 @@  static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode,
 	new |= (old & PSR_C_BIT);
 	new |= (old & PSR_V_BIT);
 
-	// TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
+	if (kvm_has_mte(vcpu->kvm))
+		new |= PSR_TCO_BIT;
 
 	new |= (old & PSR_DIT_BIT);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..5f8e165ea053 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -879,6 +879,26 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (vma_pagesize == PAGE_SIZE && !force_pte)
 		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
 							   &pfn, &fault_ipa);
+
+	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && !device &&
+	    pfn_valid(pfn)) {
+		/*
+		 * VM will be able to see the page's tags, so we must ensure
+		 * they have been initialised. if PG_mte_tagged is set, tags
+		 * have already been initialised.
+		 */
+		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
+		struct page *page = pfn_to_online_page(pfn);
+
+		if (!page)
+			return -EFAULT;
+
+		for (i = 0; i < nr_pages; i++, page++) {
+			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+				mte_clear_page_tags(page_address(page));
+		}
+	}
+
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 4f2f1e3145de..18c87500a7a8 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1047,6 +1047,9 @@  static u64 read_id_reg(const struct kvm_vcpu *vcpu,
 		break;
 	case SYS_ID_AA64PFR1_EL1:
 		val &= ~FEATURE(ID_AA64PFR1_MTE);
+		if (kvm_has_mte(vcpu->kvm))
+			val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE),
+					  ID_AA64PFR1_MTE);
 		break;
 	case SYS_ID_AA64ISAR1_EL1:
 		if (!vcpu_has_ptrauth(vcpu))
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f6afee209620..6dc16c09a2d1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1078,6 +1078,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING 192
 #define KVM_CAP_X86_BUS_LOCK_EXIT 193
 #define KVM_CAP_PPC_DAWR1 194
+#define KVM_CAP_ARM_MTE 195
 
 #ifdef KVM_CAP_IRQ_ROUTING