diff mbox series

[v10,1/6] arm64: mte: Sync tags for pages where PTE is untagged

Message ID 20210312151902.17853-2-steven.price@arm.com (mailing list archive)
State New, archived
Headers show
Series MTE support for KVM guest | expand

Commit Message

Steven Price March 12, 2021, 3:18 p.m. UTC
A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages which are !pte_valid_user() as these will
not have been swapped out.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/pgtable.h |  2 +-
 arch/arm64/kernel/mte.c          | 16 ++++++++++++----
 2 files changed, 13 insertions(+), 5 deletions(-)

Comments

Catalin Marinas March 26, 2021, 6:56 p.m. UTC | #1
Hi Steven,

On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote:
> A KVM guest could store tags in a page even if the VMM hasn't mapped
> the page with PROT_MTE. So when restoring pages from swap we will
> need to check to see if there are any saved tags even if !pte_tagged().
> 
> However don't check pages which are !pte_valid_user() as these will
> not have been swapped out.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h |  2 +-
>  arch/arm64/kernel/mte.c          | 16 ++++++++++++----
>  2 files changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index e17b96d0e4b5..84166625c989 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>  		__sync_icache_dcache(pte);
>  
>  	if (system_supports_mte() &&
> -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
> +	    pte_present(pte) && pte_valid_user(pte) && !pte_special(pte))
>  		mte_sync_tags(ptep, pte);

With the EPAN patches queued in for-next/epan, pte_valid_user()
disappeared as its semantics weren't very clear.

So this relies on the set_pte_at() being done on the VMM address space.
I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access
it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need
something like pte_present() && addr <= user_addr_max().

BTW, ignoring virtualisation, can we ever bring a page in from swap on a
PROT_NONE mapping (say fault-around)? It's not too bad if we keep the
metadata around for when the pte becomes accessible but I suspect we
remove it if the page is removed from swap.
Steven Price March 29, 2021, 3:55 p.m. UTC | #2
On 26/03/2021 18:56, Catalin Marinas wrote:
> Hi Steven,
> 
> On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote:
>> A KVM guest could store tags in a page even if the VMM hasn't mapped
>> the page with PROT_MTE. So when restoring pages from swap we will
>> need to check to see if there are any saved tags even if !pte_tagged().
>>
>> However don't check pages which are !pte_valid_user() as these will
>> not have been swapped out.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h |  2 +-
>>   arch/arm64/kernel/mte.c          | 16 ++++++++++++----
>>   2 files changed, 13 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index e17b96d0e4b5..84166625c989 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>>   		__sync_icache_dcache(pte);
>>   
>>   	if (system_supports_mte() &&
>> -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
>> +	    pte_present(pte) && pte_valid_user(pte) && !pte_special(pte))
>>   		mte_sync_tags(ptep, pte);
> 
> With the EPAN patches queued in for-next/epan, pte_valid_user()
> disappeared as its semantics weren't very clear.

Thanks for pointing that out.

> So this relies on the set_pte_at() being done on the VMM address space.
> I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access
> it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need
> something like pte_present() && addr <= user_addr_max().

AFAIUI the stage 2 matches the VMM's address space (for the subset that 
has memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to 
be invalidated and a subsequent fault would exit to the VMM to sort out. 
This sort of thing is done for the lazy migration use case (i.e. pages 
are fetched as the VM tries to access them).

> BTW, ignoring virtualisation, can we ever bring a page in from swap on a
> PROT_NONE mapping (say fault-around)? It's not too bad if we keep the
> metadata around for when the pte becomes accessible but I suspect we
> remove it if the page is removed from swap.

There are two stages of bringing data from swap. First is populating the 
swap cache by doing the physical read from swap. The second is actually 
restoring the page table entries.

Clearly the first part can happen even with PROT_NONE (the simple case 
is there's another mapping which is !PROT_NONE).

For the second I'm a little hazy on exactly what happens when you do a 
'swapoff' - that may cause a page to be re-inserted into a page table 
without a fault. If you follow the chain down from try_to_unuse() you 
end up at a call to set_pte_at(). So we need set_pte_at() to handle a 
PROT_NONE mapping. So I guess the test we really want here is just 
(pte_val() & PTE_USER).

Steve
Catalin Marinas March 30, 2021, 10:13 a.m. UTC | #3
On Mon, Mar 29, 2021 at 04:55:29PM +0100, Steven Price wrote:
> On 26/03/2021 18:56, Catalin Marinas wrote:
> > On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote:
> > > A KVM guest could store tags in a page even if the VMM hasn't mapped
> > > the page with PROT_MTE. So when restoring pages from swap we will
> > > need to check to see if there are any saved tags even if !pte_tagged().
> > > 
> > > However don't check pages which are !pte_valid_user() as these will
> > > not have been swapped out.
> > > 
> > > Signed-off-by: Steven Price <steven.price@arm.com>
> > > ---
> > >   arch/arm64/include/asm/pgtable.h |  2 +-
> > >   arch/arm64/kernel/mte.c          | 16 ++++++++++++----
> > >   2 files changed, 13 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > index e17b96d0e4b5..84166625c989 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
> > >   		__sync_icache_dcache(pte);
> > >   	if (system_supports_mte() &&
> > > -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
> > > +	    pte_present(pte) && pte_valid_user(pte) && !pte_special(pte))
> > >   		mte_sync_tags(ptep, pte);
> > 
> > With the EPAN patches queued in for-next/epan, pte_valid_user()
> > disappeared as its semantics weren't very clear.
> 
> Thanks for pointing that out.
> 
> > So this relies on the set_pte_at() being done on the VMM address space.
> > I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access
> > it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need
> > something like pte_present() && addr <= user_addr_max().
> 
> AFAIUI the stage 2 matches the VMM's address space (for the subset that has
> memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be
> invalidated and a subsequent fault would exit to the VMM to sort out. This
> sort of thing is done for the lazy migration use case (i.e. pages are
> fetched as the VM tries to access them).

There's also the protected KVM case which IIUC wouldn't provide any
mapping of the guest memory to the host (or maybe the host still thinks
it's there but cannot access it without a Stage 2 fault). At least in
this case it wouldn't swap pages out and it would be the responsibility
of the EL2 code to clear the tags when giving pages to the guest
(user_mem_abort() must not touch the page).

So basically we either have a valid, accessible mapping in the VMM and
we can handle the tags via set_pte_at() or we leave it to whatever is
running at EL2 in the pKVM case.

I don't remember whether we had a clear conclusion in the past: have we
ruled out requiring the VMM to map the guest memory with PROT_MTE
entirely? IIRC a potential problem was the VMM using MTE itself and
having to disable it when accessing the guest memory.

Another potential issue (I haven't got my head around it yet) is a race
in mte_sync_tags() as we now defer the PG_mte_tagged bit setting until
after the tags had been restored. Can we have the same page mapped by
two ptes, each attempting to restore it from swap and one gets it first
and starts modifying it? Given that we set the actual pte after setting
PG_mte_tagged, it's probably alright but I think we miss some barriers.

Also, if a page is not a swap one, we currently clear the tags if mapped
as pte_tagged() (prior to this patch). We'd need something similar when
mapping it in the guest so that we don't leak tags but to avoid any page
ending up with PG_mte_tagged, I think you moved the tag clearing to
user_mem_abort() in the KVM code. I presume set_pte_at() in the VMM
would be called first and then set in Stage 2.

> > BTW, ignoring virtualisation, can we ever bring a page in from swap on a
> > PROT_NONE mapping (say fault-around)? It's not too bad if we keep the
> > metadata around for when the pte becomes accessible but I suspect we
> > remove it if the page is removed from swap.
> 
> There are two stages of bringing data from swap. First is populating the
> swap cache by doing the physical read from swap. The second is actually
> restoring the page table entries.

When is the page metadata removed? I want to make sure we don't drop it
for some pte attributes.
Steven Price March 31, 2021, 10:09 a.m. UTC | #4
On 30/03/2021 11:13, Catalin Marinas wrote:
> On Mon, Mar 29, 2021 at 04:55:29PM +0100, Steven Price wrote:
>> On 26/03/2021 18:56, Catalin Marinas wrote:
>>> On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote:
>>>> A KVM guest could store tags in a page even if the VMM hasn't mapped
>>>> the page with PROT_MTE. So when restoring pages from swap we will
>>>> need to check to see if there are any saved tags even if !pte_tagged().
>>>>
>>>> However don't check pages which are !pte_valid_user() as these will
>>>> not have been swapped out.
>>>>
>>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>>> ---
>>>>    arch/arm64/include/asm/pgtable.h |  2 +-
>>>>    arch/arm64/kernel/mte.c          | 16 ++++++++++++----
>>>>    2 files changed, 13 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index e17b96d0e4b5..84166625c989 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>>>>    		__sync_icache_dcache(pte);
>>>>    	if (system_supports_mte() &&
>>>> -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
>>>> +	    pte_present(pte) && pte_valid_user(pte) && !pte_special(pte))
>>>>    		mte_sync_tags(ptep, pte);
>>>
>>> With the EPAN patches queued in for-next/epan, pte_valid_user()
>>> disappeared as its semantics weren't very clear.
>>
>> Thanks for pointing that out.
>>
>>> So this relies on the set_pte_at() being done on the VMM address space.
>>> I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access
>>> it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need
>>> something like pte_present() && addr <= user_addr_max().
>>
>> AFAIUI the stage 2 matches the VMM's address space (for the subset that has
>> memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be
>> invalidated and a subsequent fault would exit to the VMM to sort out. This
>> sort of thing is done for the lazy migration use case (i.e. pages are
>> fetched as the VM tries to access them).
> 
> There's also the protected KVM case which IIUC wouldn't provide any
> mapping of the guest memory to the host (or maybe the host still thinks
> it's there but cannot access it without a Stage 2 fault). At least in
> this case it wouldn't swap pages out and it would be the responsibility
> of the EL2 code to clear the tags when giving pages to the guest
> (user_mem_abort() must not touch the page).
> 
> So basically we either have a valid, accessible mapping in the VMM and
> we can handle the tags via set_pte_at() or we leave it to whatever is
> running at EL2 in the pKVM case.

For the pKVM case it's up to the EL2 code to hand over suitably scrubbed 
pages to the guest, and the host doesn't have access to the pages so we 
(currently) don't have to worry about swap. If swap get implemented it 
will presumably be up to the EL2 code to package up both the normal data 
and the MTE tags into an encrypted bundle for the host to stash somewhere.

> I don't remember whether we had a clear conclusion in the past: have we
> ruled out requiring the VMM to map the guest memory with PROT_MTE
> entirely? IIRC a potential problem was the VMM using MTE itself and
> having to disable it when accessing the guest memory.

Yes, there are some ugly corner cases if we require the VMM to map with 
PROT_MTE. Hence patch 5 - an ioctl to allow the VMM to access the tags 
without having to maintain a PROT_MTE mapping.

> Another potential issue (I haven't got my head around it yet) is a race
> in mte_sync_tags() as we now defer the PG_mte_tagged bit setting until
> after the tags had been restored. Can we have the same page mapped by
> two ptes, each attempting to restore it from swap and one gets it first
> and starts modifying it? Given that we set the actual pte after setting
> PG_mte_tagged, it's probably alright but I think we miss some barriers.

I'm not sure if I've got my head round this one yet either, but you 
could be right there's a race. This exists without these patches:

CPU 1                    |  CPU 2
-------------------------+-----------------
set_pte_at()             |
--> mte_sync_tags()      |
--> test_and_set_bit()   |
--> mte_sync_page_tags() | set_pte_at()
    [stalls/sleeps]       | --> mte_sync_tags()
                          | --> test_and_set_bit()
                          |     [already set by CPU 1]
                          | set_pte()
                          | [sees stale tags]
    [eventually wakes up  |
     and sets tags]       |

What I'm struggling to get my head around is whether there's always a 
sufficient lock held during the call to set_pte_at() to avoid the above. 
I suspect not because the two calls could be in completely separate 
processes.

We potentially could stick a lock_page()/unlock_page() sequence in 
mte_sync_tags(). I just ran a basic test and didn't hit problems with 
that. Any thoughts?

> Also, if a page is not a swap one, we currently clear the tags if mapped
> as pte_tagged() (prior to this patch). We'd need something similar when
> mapping it in the guest so that we don't leak tags but to avoid any page
> ending up with PG_mte_tagged, I think you moved the tag clearing to
> user_mem_abort() in the KVM code. I presume set_pte_at() in the VMM
> would be called first and then set in Stage 2.

Yes - KVM will perform the equivalent of get_user_pages() before setting 
the entry in Stage 2, that should end up performing any set_pte_at() 
calls to populate the VMM's page tables. So the VMM 'sees' the memory 
before stage 2.

>>> BTW, ignoring virtualisation, can we ever bring a page in from swap on a
>>> PROT_NONE mapping (say fault-around)? It's not too bad if we keep the
>>> metadata around for when the pte becomes accessible but I suspect we
>>> remove it if the page is removed from swap.
>>
>> There are two stages of bringing data from swap. First is populating the
>> swap cache by doing the physical read from swap. The second is actually
>> restoring the page table entries.
> 
> When is the page metadata removed? I want to make sure we don't drop it
> for some pte attributes.

The tag metadata for swapped pages lives for the same length of time as 
the swap metadata itself. The swap code already makes sure that the 
metadata hangs around as long as there are any swap PTEs in existence, 
so I think everything should be fine here. The 
arch_swap_invalidate_xxx() calls match up with the frontswap calls as it 
has the same lifetime requirements.

Steve
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e17b96d0e4b5..84166625c989 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -312,7 +312,7 @@  static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 		__sync_icache_dcache(pte);
 
 	if (system_supports_mte() &&
-	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
+	    pte_present(pte) && pte_valid_user(pte) && !pte_special(pte))
 		mte_sync_tags(ptep, pte);
 
 	__check_racy_pte_update(mm, ptep, pte);
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index b3c70a612c7a..e016ab57ea36 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -26,17 +26,23 @@  u64 gcr_kernel_excl __ro_after_init;
 
 static bool report_fault_once = true;
 
-static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
+static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
+			       bool pte_is_tagged)
 {
 	pte_t old_pte = READ_ONCE(*ptep);
 
 	if (check_swap && is_swap_pte(old_pte)) {
 		swp_entry_t entry = pte_to_swp_entry(old_pte);
 
-		if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
+		if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
+			set_bit(PG_mte_tagged, &page->flags);
 			return;
+		}
 	}
 
+	if (!pte_is_tagged || test_and_set_bit(PG_mte_tagged, &page->flags))
+		return;
+
 	page_kasan_tag_reset(page);
 	/*
 	 * We need smp_wmb() in between setting the flags and clearing the
@@ -54,11 +60,13 @@  void mte_sync_tags(pte_t *ptep, pte_t pte)
 	struct page *page = pte_page(pte);
 	long i, nr_pages = compound_nr(page);
 	bool check_swap = nr_pages == 1;
+	bool pte_is_tagged = pte_tagged(pte);
 
 	/* if PG_mte_tagged is set, tags have already been initialised */
 	for (i = 0; i < nr_pages; i++, page++) {
-		if (!test_and_set_bit(PG_mte_tagged, &page->flags))
-			mte_sync_page_tags(page, ptep, check_swap);
+		if (!test_bit(PG_mte_tagged, &page->flags))
+			mte_sync_page_tags(page, ptep, check_swap,
+					   pte_is_tagged);
 	}
 }