diff mbox series

KVM: arm64: permit MAP_SHARED mappings with MTE enabled

Message ID 20220623234944.141869-1-pcc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: permit MAP_SHARED mappings with MTE enabled | expand

Commit Message

Peter Collingbourne June 23, 2022, 11:49 p.m. UTC
Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
depend on being able to map guest memory as MAP_SHARED. The current
restriction on sharing MAP_SHARED pages with the guest is preventing
the use of those features with MTE. Therefore, remove this restriction.

To avoid races between multiple tasks attempting to clear tags on the
same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it
atomically before beginning to clear tags on a page. If the flag was not
initially set, spin until the other task has finished clearing the tags.

Link: https://linux-review.googlesource.com/id/I05a57f546f2b4d9b008a08f5f03e9ba3c950fdcc
Signed-off-by: Peter Collingbourne <pcc@google.com>
---
 arch/arm64/include/asm/mte.h   |  5 +++++
 arch/arm64/kernel/mte.c        | 13 +++++++++++-
 arch/arm64/kvm/mmu.c           | 36 +++++++---------------------------
 include/linux/page-flags.h     |  1 +
 include/trace/events/mmflags.h |  7 ++++---
 5 files changed, 29 insertions(+), 33 deletions(-)

Comments

Catalin Marinas June 24, 2022, 5:05 p.m. UTC | #1
+ Steven as he added the KVM and swap support for MTE.

On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
> depend on being able to map guest memory as MAP_SHARED. The current
> restriction on sharing MAP_SHARED pages with the guest is preventing
> the use of those features with MTE. Therefore, remove this restriction.

We already have some corner cases where the PG_mte_tagged logic fails
even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
KVM MAP_SHARED will potentially make things worse (or hard to reason
about; for example the VMM sets PROT_MTE as well). I'm more inclined to
get rid of PG_mte_tagged altogether, always zero (or restore) the tags
on user page allocation, copy them on write. For swap we can scan and if
all tags are 0 and just skip saving them.

Another aspect is a change in the KVM ABI with this patch. It's probably
not that bad since it's rather a relaxation but it has the potential to
confuse the VMM, especially as it doesn't know whether it's running on
older kernels or not (it would have to probe unless we expose this info
to the VMM in some other way).

> To avoid races between multiple tasks attempting to clear tags on the
> same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it
> atomically before beginning to clear tags on a page. If the flag was not
> initially set, spin until the other task has finished clearing the tags.

TBH, I can't mentally model all the corner cases, so maybe a formal
model would help (I can have a go with TLA+, though not sure when I find
a bit of time this summer). If we get rid of PG_mte_tagged altogether,
this would simplify things (hopefully).

As you noticed, the problem is that setting PG_mte_tagged and clearing
(or restoring) the tags is not an atomic operation. There are places
like mprotect() + CoW where one task can end up with stale tags. Another
is shared memfd mappings if more than one mapping sets PROT_MTE and
there's the swap restoring on top.

> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index f6b00743c399..8f9655053a9f 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -57,7 +57,18 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
>  	 * the new page->flags are visible before the tags were updated.
>  	 */
>  	smp_wmb();
> -	mte_clear_page_tags(page_address(page));
> +	mte_ensure_page_tags_cleared(page);
> +}
> +
> +void mte_ensure_page_tags_cleared(struct page *page)
> +{
> +	if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) {
> +		while (!test_bit(PG_mte_tagged, &page->flags))
> +			;
> +	} else {
> +		mte_clear_page_tags(page_address(page));
> +		set_bit(PG_mte_tagged, &page->flags);
> +	}
>  }

mte_sync_tags() already sets PG_mte_tagged prior to clearing the page
tags. The reason was so that multiple concurrent set_pte_at() would not
all rush to clear (or restore) the tags. But we do have the risk of one
thread accessing the page with the stale tags (copy_user_highpage() is
worse as the tags would be wrong in the destination page). I'd rather be
consistent everywhere with how we set the flags.

However, I find it easier to reason about if we used the new flag as a
lock. IOW, if PG_mte_tagged is set, we know that tags are valid. If not
set, take the PG_mte_locked flag, check PG_mte_tagged again and
clear/restore the tags followed by PG_mte_tagged (and you can use
test_and_set_bit_lock() for the acquire semantics).

It would be interesting to benchmark the cost of always zeroing the tags
on allocation and copy when MTE is not in use:

diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
index 0dea80bf6de4..d31708886bf9 100644
--- a/arch/arm64/mm/copypage.c
+++ b/arch/arm64/mm/copypage.c
@@ -21,7 +21,7 @@ void copy_highpage(struct page *to, struct page *from)
 
 	copy_page(kto, kfrom);
 
-	if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) {
+	if (system_supports_mte()) {
 		set_bit(PG_mte_tagged, &to->flags);
 		page_kasan_tag_reset(to);
 		/*
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index c5e11768e5c1..b42cad9b9349 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -913,12 +913,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
 {
 	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
 
-	/*
-	 * If the page is mapped with PROT_MTE, initialise the tags at the
-	 * point of allocation and page zeroing as this is usually faster than
-	 * separate DC ZVA and STGM.
-	 */
-	if (vma->vm_flags & VM_MTE)
+	if (system_supports_mte())
 		flags |= __GFP_ZEROTAGS;
 
 	return alloc_page_vma(flags, vma, vaddr);

If that's negligible, we can hopefully get rid of PG_mte_tagged. For
swap we could move the restoring to arch_do_swap_page() (but move the
call one line above set_pte_at() in do_swap_page()).
Peter Collingbourne June 24, 2022, 9:50 p.m. UTC | #2
On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> + Steven as he added the KVM and swap support for MTE.
>
> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
> > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
> > depend on being able to map guest memory as MAP_SHARED. The current
> > restriction on sharing MAP_SHARED pages with the guest is preventing
> > the use of those features with MTE. Therefore, remove this restriction.
>
> We already have some corner cases where the PG_mte_tagged logic fails
> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
> KVM MAP_SHARED will potentially make things worse (or hard to reason
> about; for example the VMM sets PROT_MTE as well). I'm more inclined to
> get rid of PG_mte_tagged altogether, always zero (or restore) the tags
> on user page allocation, copy them on write. For swap we can scan and if
> all tags are 0 and just skip saving them.

A problem with this approach is that it would conflict with any
potential future changes that we might make that would require the
kernel to avoid modifying the tags for non-PROT_MTE pages.

Thinking about this some more, another idea that I had was to only
allow MAP_SHARED mappings in a guest with MTE enabled if the mapping
is PROT_MTE and there are no non-PROT_MTE aliases. For anonymous
mappings I don't think it's possible to create a non-PROT_MTE alias in
another mm (since you can't turn off PROT_MTE with mprotect), and for
memfd maybe we could introduce a flag that requires PROT_MTE on all
mappings. That way, we are guaranteed that either the page has been
tagged prior to fault or we have exclusive access to it so it can be
tagged on demand without racing. Let me see what effect that has on
crosvm.

> Another aspect is a change in the KVM ABI with this patch. It's probably
> not that bad since it's rather a relaxation but it has the potential to
> confuse the VMM, especially as it doesn't know whether it's running on
> older kernels or not (it would have to probe unless we expose this info
> to the VMM in some other way).
>
> > To avoid races between multiple tasks attempting to clear tags on the
> > same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it
> > atomically before beginning to clear tags on a page. If the flag was not
> > initially set, spin until the other task has finished clearing the tags.
>
> TBH, I can't mentally model all the corner cases, so maybe a formal
> model would help (I can have a go with TLA+, though not sure when I find
> a bit of time this summer). If we get rid of PG_mte_tagged altogether,
> this would simplify things (hopefully).
>
> As you noticed, the problem is that setting PG_mte_tagged and clearing
> (or restoring) the tags is not an atomic operation. There are places
> like mprotect() + CoW where one task can end up with stale tags. Another
> is shared memfd mappings if more than one mapping sets PROT_MTE and
> there's the swap restoring on top.
>
> > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > index f6b00743c399..8f9655053a9f 100644
> > --- a/arch/arm64/kernel/mte.c
> > +++ b/arch/arm64/kernel/mte.c
> > @@ -57,7 +57,18 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
> >        * the new page->flags are visible before the tags were updated.
> >        */
> >       smp_wmb();
> > -     mte_clear_page_tags(page_address(page));
> > +     mte_ensure_page_tags_cleared(page);
> > +}
> > +
> > +void mte_ensure_page_tags_cleared(struct page *page)
> > +{
> > +     if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) {
> > +             while (!test_bit(PG_mte_tagged, &page->flags))
> > +                     ;
> > +     } else {
> > +             mte_clear_page_tags(page_address(page));
> > +             set_bit(PG_mte_tagged, &page->flags);
> > +     }
> >  }
>
> mte_sync_tags() already sets PG_mte_tagged prior to clearing the page
> tags. The reason was so that multiple concurrent set_pte_at() would not
> all rush to clear (or restore) the tags. But we do have the risk of one
> thread accessing the page with the stale tags (copy_user_highpage() is
> worse as the tags would be wrong in the destination page). I'd rather be
> consistent everywhere with how we set the flags.
>
> However, I find it easier to reason about if we used the new flag as a
> lock. IOW, if PG_mte_tagged is set, we know that tags are valid. If not
> set, take the PG_mte_locked flag, check PG_mte_tagged again and
> clear/restore the tags followed by PG_mte_tagged (and you can use
> test_and_set_bit_lock() for the acquire semantics).

Okay, I can look at doing it that way as well.

> It would be interesting to benchmark the cost of always zeroing the tags
> on allocation and copy when MTE is not in use:
>
> diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
> index 0dea80bf6de4..d31708886bf9 100644
> --- a/arch/arm64/mm/copypage.c
> +++ b/arch/arm64/mm/copypage.c
> @@ -21,7 +21,7 @@ void copy_highpage(struct page *to, struct page *from)
>
>         copy_page(kto, kfrom);
>
> -       if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) {
> +       if (system_supports_mte()) {
>                 set_bit(PG_mte_tagged, &to->flags);
>                 page_kasan_tag_reset(to);
>                 /*
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index c5e11768e5c1..b42cad9b9349 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -913,12 +913,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>  {
>         gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
>
> -       /*
> -        * If the page is mapped with PROT_MTE, initialise the tags at the
> -        * point of allocation and page zeroing as this is usually faster than
> -        * separate DC ZVA and STGM.
> -        */
> -       if (vma->vm_flags & VM_MTE)
> +       if (system_supports_mte())
>                 flags |= __GFP_ZEROTAGS;
>
>         return alloc_page_vma(flags, vma, vaddr);
>
> If that's negligible, we can hopefully get rid of PG_mte_tagged. For
> swap we could move the restoring to arch_do_swap_page() (but move the
> call one line above set_pte_at() in do_swap_page()).

I could experiment with this but I'm hesistant given the potential to
cut off potential future changes.

Peter
Steven Price June 25, 2022, 8:14 a.m. UTC | #3
On 24/06/2022 18:05, Catalin Marinas wrote:
> + Steven as he added the KVM and swap support for MTE.
> 
> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
>> depend on being able to map guest memory as MAP_SHARED. The current
>> restriction on sharing MAP_SHARED pages with the guest is preventing
>> the use of those features with MTE. Therefore, remove this restriction.
> 
> We already have some corner cases where the PG_mte_tagged logic fails
> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
> KVM MAP_SHARED will potentially make things worse (or hard to reason
> about; for example the VMM sets PROT_MTE as well). I'm more inclined to
> get rid of PG_mte_tagged altogether, always zero (or restore) the tags
> on user page allocation, copy them on write. For swap we can scan and if
> all tags are 0 and just skip saving them.
> 
> Another aspect is a change in the KVM ABI with this patch. It's probably
> not that bad since it's rather a relaxation but it has the potential to
> confuse the VMM, especially as it doesn't know whether it's running on
> older kernels or not (it would have to probe unless we expose this info
> to the VMM in some other way).
> 
>> To avoid races between multiple tasks attempting to clear tags on the
>> same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it
>> atomically before beginning to clear tags on a page. If the flag was not
>> initially set, spin until the other task has finished clearing the tags.
> 
> TBH, I can't mentally model all the corner cases, so maybe a formal
> model would help (I can have a go with TLA+, though not sure when I find
> a bit of time this summer). If we get rid of PG_mte_tagged altogether,
> this would simplify things (hopefully).
> 
> As you noticed, the problem is that setting PG_mte_tagged and clearing
> (or restoring) the tags is not an atomic operation. There are places
> like mprotect() + CoW where one task can end up with stale tags. Another
> is shared memfd mappings if more than one mapping sets PROT_MTE and
> there's the swap restoring on top.
> 
>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>> index f6b00743c399..8f9655053a9f 100644
>> --- a/arch/arm64/kernel/mte.c
>> +++ b/arch/arm64/kernel/mte.c
>> @@ -57,7 +57,18 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
>>  	 * the new page->flags are visible before the tags were updated.
>>  	 */
>>  	smp_wmb();
>> -	mte_clear_page_tags(page_address(page));
>> +	mte_ensure_page_tags_cleared(page);
>> +}
>> +
>> +void mte_ensure_page_tags_cleared(struct page *page)
>> +{
>> +	if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) {
>> +		while (!test_bit(PG_mte_tagged, &page->flags))
>> +			;
>> +	} else {
>> +		mte_clear_page_tags(page_address(page));
>> +		set_bit(PG_mte_tagged, &page->flags);
>> +	}

I'm pretty sure we need some form of barrier in here to ensure the tag
clearing is visible to the other CPU. Otherwise I can't immediately see
any problems with the approach of a second flag (it was something I had
considered). But I do also think we should seriously consider Catalin's
approach of simply zeroing tags unconditionally - it would certainly
simplify the code.

The main issue with unconditionally zeroing is if you then want to
(ab)use the tag memory carveout as extra memory for regions that are not
being used for tags. Personally it seems pretty crazy (a whole lot of
extra complexity for a small gain in ram), and I'm not convinced that
memory is sufficiently moveable for it to reliably work.

>>  }
> 
> mte_sync_tags() already sets PG_mte_tagged prior to clearing the page
> tags. The reason was so that multiple concurrent set_pte_at() would not
> all rush to clear (or restore) the tags. But we do have the risk of one
> thread accessing the page with the stale tags (copy_user_highpage() is
> worse as the tags would be wrong in the destination page). I'd rather be
> consistent everywhere with how we set the flags.
> 
> However, I find it easier to reason about if we used the new flag as a
> lock. IOW, if PG_mte_tagged is set, we know that tags are valid. If not
> set, take the PG_mte_locked flag, check PG_mte_tagged again and
> clear/restore the tags followed by PG_mte_tagged (and you can use
> test_and_set_bit_lock() for the acquire semantics).

I agree - when I considered this before it was using the second flag as
a lock.

Steve

> 
> It would be interesting to benchmark the cost of always zeroing the tags
> on allocation and copy when MTE is not in use:
> 
> diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
> index 0dea80bf6de4..d31708886bf9 100644
> --- a/arch/arm64/mm/copypage.c
> +++ b/arch/arm64/mm/copypage.c
> @@ -21,7 +21,7 @@ void copy_highpage(struct page *to, struct page *from)
>  
>  	copy_page(kto, kfrom);
>  
> -	if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) {
> +	if (system_supports_mte()) {
>  		set_bit(PG_mte_tagged, &to->flags);
>  		page_kasan_tag_reset(to);
>  		/*
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index c5e11768e5c1..b42cad9b9349 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -913,12 +913,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>  {
>  	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
>  
> -	/*
> -	 * If the page is mapped with PROT_MTE, initialise the tags at the
> -	 * point of allocation and page zeroing as this is usually faster than
> -	 * separate DC ZVA and STGM.
> -	 */
> -	if (vma->vm_flags & VM_MTE)
> +	if (system_supports_mte())
>  		flags |= __GFP_ZEROTAGS;
>  
>  	return alloc_page_vma(flags, vma, vaddr);
> 
> If that's negligible, we can hopefully get rid of PG_mte_tagged. For
> swap we could move the restoring to arch_do_swap_page() (but move the
> call one line above set_pte_at() in do_swap_page()).
>
Catalin Marinas June 27, 2022, 11:43 a.m. UTC | #4
On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote:
> On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > + Steven as he added the KVM and swap support for MTE.
> >
> > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
> > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
> > > depend on being able to map guest memory as MAP_SHARED. The current
> > > restriction on sharing MAP_SHARED pages with the guest is preventing
> > > the use of those features with MTE. Therefore, remove this restriction.
> >
> > We already have some corner cases where the PG_mte_tagged logic fails
> > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
> > KVM MAP_SHARED will potentially make things worse (or hard to reason
> > about; for example the VMM sets PROT_MTE as well). I'm more inclined to
> > get rid of PG_mte_tagged altogether, always zero (or restore) the tags
> > on user page allocation, copy them on write. For swap we can scan and if
> > all tags are 0 and just skip saving them.
> 
> A problem with this approach is that it would conflict with any
> potential future changes that we might make that would require the
> kernel to avoid modifying the tags for non-PROT_MTE pages.

Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the
vma available where it matters. We can keep PG_mte_tagged around but
always set it on page allocation (e.g. when zeroing or CoW) and check
VM_MTE_ALLOWED rather than VM_MTE.

I'm not sure how Linux can deal with pages that do not support MTE.
Currently we only handle this at the vm_flags level. Assuming that we
somehow manage to, we can still use PG_mte_tagged to mark the pages that
supported tags on allocation (and they have been zeroed or copied). I
guess if you want to move a page around, you'd need to go through
something like try_to_unmap() (or set all mappings to PROT_NONE like in
NUMA migration). Then you can either check whether the page is PROT_MTE
anywhere and maybe read the tags to see whether all are zero after
unmapping.

Deferring tag zeroing/restoring to set_pte_at() can be racy without a
lock (or your approach with another flag) but I'm not sure it's worth
the extra logic if zeroing or copying the tags doesn't have a
significant overhead for non-PROT_MTE pages.

Another issue with set_pte_at() is that it can write tags on mprotect()
even if the page is mapped read-only. So far I couldn't find a problem
with this but it adds to the complexity.

> Thinking about this some more, another idea that I had was to only
> allow MAP_SHARED mappings in a guest with MTE enabled if the mapping
> is PROT_MTE and there are no non-PROT_MTE aliases. For anonymous
> mappings I don't think it's possible to create a non-PROT_MTE alias in
> another mm (since you can't turn off PROT_MTE with mprotect), and for
> memfd maybe we could introduce a flag that requires PROT_MTE on all
> mappings. That way, we are guaranteed that either the page has been
> tagged prior to fault or we have exclusive access to it so it can be
> tagged on demand without racing. Let me see what effect that has on
> crosvm.

You could still have all initial shared mappings as !PROT_MTE and some
mprotect() afterwards setting PG_mte_tagged and clearing the tags and
this can race. AFAICT, the easiest way to avoid the race is to set
PG_mte_tagged on allocation before it ends up in set_pte_at().
Cornelia Huck June 27, 2022, 3:55 p.m. UTC | #5
[I'm still in the process of trying to grok the issues surrounding
MTE+KVM, so apologies in advance if I'm muddying the waters]

On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote:

> On 24/06/2022 18:05, Catalin Marinas wrote:
>> + Steven as he added the KVM and swap support for MTE.
>> 
>> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
>>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
>>> depend on being able to map guest memory as MAP_SHARED. The current
>>> restriction on sharing MAP_SHARED pages with the guest is preventing
>>> the use of those features with MTE. Therefore, remove this restriction.
>> 
>> We already have some corner cases where the PG_mte_tagged logic fails
>> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
>> KVM MAP_SHARED will potentially make things worse (or hard to reason
>> about; for example the VMM sets PROT_MTE as well). I'm more inclined to
>> get rid of PG_mte_tagged altogether, always zero (or restore) the tags
>> on user page allocation, copy them on write. For swap we can scan and if
>> all tags are 0 and just skip saving them.
>> 
>> Another aspect is a change in the KVM ABI with this patch. It's probably
>> not that bad since it's rather a relaxation but it has the potential to
>> confuse the VMM, especially as it doesn't know whether it's running on
>> older kernels or not (it would have to probe unless we expose this info
>> to the VMM in some other way).

Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.)

>> 
>>> To avoid races between multiple tasks attempting to clear tags on the
>>> same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it
>>> atomically before beginning to clear tags on a page. If the flag was not
>>> initially set, spin until the other task has finished clearing the tags.
>> 
>> TBH, I can't mentally model all the corner cases, so maybe a formal
>> model would help (I can have a go with TLA+, though not sure when I find
>> a bit of time this summer). If we get rid of PG_mte_tagged altogether,
>> this would simplify things (hopefully).
>> 
>> As you noticed, the problem is that setting PG_mte_tagged and clearing
>> (or restoring) the tags is not an atomic operation. There are places
>> like mprotect() + CoW where one task can end up with stale tags. Another
>> is shared memfd mappings if more than one mapping sets PROT_MTE and
>> there's the swap restoring on top.
>> 
>>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>>> index f6b00743c399..8f9655053a9f 100644
>>> --- a/arch/arm64/kernel/mte.c
>>> +++ b/arch/arm64/kernel/mte.c
>>> @@ -57,7 +57,18 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
>>>  	 * the new page->flags are visible before the tags were updated.
>>>  	 */
>>>  	smp_wmb();
>>> -	mte_clear_page_tags(page_address(page));
>>> +	mte_ensure_page_tags_cleared(page);
>>> +}
>>> +
>>> +void mte_ensure_page_tags_cleared(struct page *page)
>>> +{
>>> +	if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) {
>>> +		while (!test_bit(PG_mte_tagged, &page->flags))
>>> +			;
>>> +	} else {
>>> +		mte_clear_page_tags(page_address(page));
>>> +		set_bit(PG_mte_tagged, &page->flags);
>>> +	}
>
> I'm pretty sure we need some form of barrier in here to ensure the tag
> clearing is visible to the other CPU. Otherwise I can't immediately see
> any problems with the approach of a second flag (it was something I had
> considered). But I do also think we should seriously consider Catalin's
> approach of simply zeroing tags unconditionally - it would certainly
> simplify the code.

What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end
up copying zeroes?

That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS
will be called? I.e. when implementing migration, it should be ok to
call it while the vm is paused, but you probably won't get a consistent
state while the vm is running?

[Postcopy needs a different interface, I guess, so that the migration
target can atomically place a received page and its metadata. I see
https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/;
has there been any follow-up?]
Peter Collingbourne June 27, 2022, 6:16 p.m. UTC | #6
On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote:
> > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas
> > <catalin.marinas@arm.com> wrote:
> > > + Steven as he added the KVM and swap support for MTE.
> > >
> > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
> > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
> > > > depend on being able to map guest memory as MAP_SHARED. The current
> > > > restriction on sharing MAP_SHARED pages with the guest is preventing
> > > > the use of those features with MTE. Therefore, remove this restriction.
> > >
> > > We already have some corner cases where the PG_mte_tagged logic fails
> > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
> > > KVM MAP_SHARED will potentially make things worse (or hard to reason
> > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to
> > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags
> > > on user page allocation, copy them on write. For swap we can scan and if
> > > all tags are 0 and just skip saving them.
> >
> > A problem with this approach is that it would conflict with any
> > potential future changes that we might make that would require the
> > kernel to avoid modifying the tags for non-PROT_MTE pages.
>
> Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the
> vma available where it matters. We can keep PG_mte_tagged around but
> always set it on page allocation (e.g. when zeroing or CoW) and check
> VM_MTE_ALLOWED rather than VM_MTE.

Right, but for avoiding tagging we would like that to apply to as many
pages as possible. If we check VM_MTE_ALLOWED then the heap pages of
those processes that are not using MTE would not be covered, which on
a mostly non-MTE system would be a majority of pages.

Over the weekend I thought of another policy, which would be similar
to your original one. We can always tag pages which are mapped as
MAP_SHARED. These pages are much less common than private pages, so
the impact would be less. So the if statement in
alloc_zeroed_user_highpage_movable would become:

if ((vma->vm_flags & VM_MTE) || (system_supports_mte() &&
(vma->vm_flags & VM_SHARED)))

That would allow us to put basically any shared mapping in the guest
address space without needing to deal with races in sanitise_mte_tags.

We may consider going further than this and require all pages mapped
into guests with MTE enabled to be PROT_MTE. I think it would allow
dropping sanitise_mte_tags entirely. This would not be a relaxation of
the ABI but perhaps we can get away with it if, as Cornelia mentioned,
QEMU does not currently support MTE, and since crosvm doesn't
currently support it either there's no userspace to break AFAIK. This
would also address a current weirdness in the API where it is possible
for the underlying pages of a MAP_SHARED file mapping to become tagged
via KVM, said tags are exposed to the guest and are discarded when the
underlying page is paged out. We can perhaps accomplish it by dropping
support for KVM_CAP_ARM_MTE in the kernel and introducing something
like a KVM_CAP_ARM_MTE_V2 with the new restriction.

> I'm not sure how Linux can deal with pages that do not support MTE.
> Currently we only handle this at the vm_flags level. Assuming that we
> somehow manage to, we can still use PG_mte_tagged to mark the pages that
> supported tags on allocation (and they have been zeroed or copied). I
> guess if you want to move a page around, you'd need to go through
> something like try_to_unmap() (or set all mappings to PROT_NONE like in
> NUMA migration). Then you can either check whether the page is PROT_MTE
> anywhere and maybe read the tags to see whether all are zero after
> unmapping.
>
> Deferring tag zeroing/restoring to set_pte_at() can be racy without a
> lock (or your approach with another flag) but I'm not sure it's worth
> the extra logic if zeroing or copying the tags doesn't have a
> significant overhead for non-PROT_MTE pages.
>
> Another issue with set_pte_at() is that it can write tags on mprotect()
> even if the page is mapped read-only. So far I couldn't find a problem
> with this but it adds to the complexity.
>
> > Thinking about this some more, another idea that I had was to only
> > allow MAP_SHARED mappings in a guest with MTE enabled if the mapping
> > is PROT_MTE and there are no non-PROT_MTE aliases. For anonymous
> > mappings I don't think it's possible to create a non-PROT_MTE alias in
> > another mm (since you can't turn off PROT_MTE with mprotect), and for
> > memfd maybe we could introduce a flag that requires PROT_MTE on all
> > mappings. That way, we are guaranteed that either the page has been
> > tagged prior to fault or we have exclusive access to it so it can be
> > tagged on demand without racing. Let me see what effect that has on
> > crosvm.
>
> You could still have all initial shared mappings as !PROT_MTE and some
> mprotect() afterwards setting PG_mte_tagged and clearing the tags and
> this can race. AFAICT, the easiest way to avoid the race is to set
> PG_mte_tagged on allocation before it ends up in set_pte_at().

For shared pages we will be safe with my new proposal because the
pages will always be tagged. For private ones we might need to move
pages to an MTE supported region in response to the mprotect, as you
discussed above.

Peter
Catalin Marinas June 28, 2022, 5:57 p.m. UTC | #7
On Mon, Jun 27, 2022 at 11:16:17AM -0700, Peter Collingbourne wrote:
> On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote:
> > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas
> > > <catalin.marinas@arm.com> wrote:
> > > > + Steven as he added the KVM and swap support for MTE.
> > > >
> > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
> > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
> > > > > depend on being able to map guest memory as MAP_SHARED. The current
> > > > > restriction on sharing MAP_SHARED pages with the guest is preventing
> > > > > the use of those features with MTE. Therefore, remove this restriction.
> > > >
> > > > We already have some corner cases where the PG_mte_tagged logic fails
> > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
> > > > KVM MAP_SHARED will potentially make things worse (or hard to reason
> > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to
> > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags
> > > > on user page allocation, copy them on write. For swap we can scan and if
> > > > all tags are 0 and just skip saving them.
> > >
> > > A problem with this approach is that it would conflict with any
> > > potential future changes that we might make that would require the
> > > kernel to avoid modifying the tags for non-PROT_MTE pages.
> >
> > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the
> > vma available where it matters. We can keep PG_mte_tagged around but
> > always set it on page allocation (e.g. when zeroing or CoW) and check
> > VM_MTE_ALLOWED rather than VM_MTE.
> 
> Right, but for avoiding tagging we would like that to apply to as many
> pages as possible. If we check VM_MTE_ALLOWED then the heap pages of
> those processes that are not using MTE would not be covered, which on
> a mostly non-MTE system would be a majority of pages.

By non-MTE system, I guess you mean a system that supports MTE but most
of the user apps don't use it. That's why it would be interesting to see
the effect of using DC GZVA instead of DC ZVA for page zeroing.

I suspect on Android you'd notice the fork() penalty a bit more with all
the copy-on-write having to copy tags. But we can't tell until we do
some benchmarks. If the penalty is indeed significant, we'll go back to
assessing the races here.

Another thing that won't happen for PG_mte_tagged currently is KSM page
merging. I had a patch to allow comparing the tags but eventually
dropped it (can dig it out).

> Over the weekend I thought of another policy, which would be similar
> to your original one. We can always tag pages which are mapped as
> MAP_SHARED. These pages are much less common than private pages, so
> the impact would be less. So the if statement in
> alloc_zeroed_user_highpage_movable would become:
> 
> if ((vma->vm_flags & VM_MTE) || (system_supports_mte() &&
>				   (vma->vm_flags & VM_SHARED)))
> 
> That would allow us to put basically any shared mapping in the guest
> address space without needing to deal with races in sanitise_mte_tags.

It's not just about VM_SHARED. A page can be effectively shared as a
result of a fork(). It is read-only in all processes but still shared
and one task may call mprotect(PROT_MTE).

Another case of sharing is between the VMM and the guest though I think
an mprotect() in the VMM would trigger the unmapping of the guest
address and pages mapped into guests already have PG_mte_tagged set.

We probably need to draw a state machine of all the cases. AFAICT, we
need to take into account a few of the below (it's probably incomplete;
I've been with Steven through most of them IIRC):

1. Private mappings with mixed PROT_MTE, CoW sharing and concurrent
   mprotect(PROT_MTE). That's one of the things I dislike is that a late
   tag clearing via set_pte_at() can happen without breaking the CoW
   mapping. It's a bit counter-intuitive if you treat the tags as data
   (rather than some cache), you don't expect a read-only page to have
   some (tag) updated.

2. Shared mappings with concurrent mprotect(PROT_MTE).

3. Shared mapping restoring from swap.

4. Private mapping restoring from swap into CoW mapping.

5. KVM faults.

6. Concurrent ptrace accesses (or KVM tag copying)

What currently risks failing I think is breaking a CoW mapping with
concurrent mprotect(PROT_MTE) - we set PG_mte_tagged before zeroing the
tags. A concurrent copy may read stale tags. In sanitise_mte_tags() we
do this the other way around - clear tags first and then set the flag.

I think using another bit as a lock may solve most (all) of these but
another option is to treat the tags as data and make sure they are set
before mapping.

> We may consider going further than this and require all pages mapped
> into guests with MTE enabled to be PROT_MTE.

We discussed this when upstreaming KVM support and the idea got pushed
back. The main problem is that the VMM may use MTE for itself but can no
longer access the guest memory without the risk of taking a fault. We
don't have a match-all tag in user space and we can't teach the VMM to
use the PSTATE.TCO bit since driver emulation can be fairly generic.
And, of course, there's also the ABI change now.

> I think it would allow
> dropping sanitise_mte_tags entirely. This would not be a relaxation of
> the ABI but perhaps we can get away with it if, as Cornelia mentioned,
> QEMU does not currently support MTE, and since crosvm doesn't
> currently support it either there's no userspace to break AFAIK. This
> would also address a current weirdness in the API where it is possible
> for the underlying pages of a MAP_SHARED file mapping to become tagged
> via KVM, said tags are exposed to the guest and are discarded when the
> underlying page is paged out.

Ah, good point, shared file mappings is another reason we did not allow
MAP_SHARED and MTE for guest memory.

BTW, in user_mem_abort() we should probably check for VM_MTE_ALLOWED
irrespective of whether we allow MAP_SHARED or not.

> We can perhaps accomplish it by dropping
> support for KVM_CAP_ARM_MTE in the kernel and introducing something
> like a KVM_CAP_ARM_MTE_V2 with the new restriction.

That's an option for the ABI upgrade but we still need to solve the
potential races.
Peter Collingbourne June 28, 2022, 6:54 p.m. UTC | #8
On Tue, Jun 28, 2022 at 10:58 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> On Mon, Jun 27, 2022 at 11:16:17AM -0700, Peter Collingbourne wrote:
> > On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote:
> > > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas
> > > > <catalin.marinas@arm.com> wrote:
> > > > > + Steven as he added the KVM and swap support for MTE.
> > > > >
> > > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
> > > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
> > > > > > depend on being able to map guest memory as MAP_SHARED. The current
> > > > > > restriction on sharing MAP_SHARED pages with the guest is preventing
> > > > > > the use of those features with MTE. Therefore, remove this restriction.
> > > > >
> > > > > We already have some corner cases where the PG_mte_tagged logic fails
> > > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
> > > > > KVM MAP_SHARED will potentially make things worse (or hard to reason
> > > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to
> > > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags
> > > > > on user page allocation, copy them on write. For swap we can scan and if
> > > > > all tags are 0 and just skip saving them.
> > > >
> > > > A problem with this approach is that it would conflict with any
> > > > potential future changes that we might make that would require the
> > > > kernel to avoid modifying the tags for non-PROT_MTE pages.
> > >
> > > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the
> > > vma available where it matters. We can keep PG_mte_tagged around but
> > > always set it on page allocation (e.g. when zeroing or CoW) and check
> > > VM_MTE_ALLOWED rather than VM_MTE.
> >
> > Right, but for avoiding tagging we would like that to apply to as many
> > pages as possible. If we check VM_MTE_ALLOWED then the heap pages of
> > those processes that are not using MTE would not be covered, which on
> > a mostly non-MTE system would be a majority of pages.
>
> By non-MTE system, I guess you mean a system that supports MTE but most
> of the user apps don't use it.

Yes, that's what I meant.

> That's why it would be interesting to see
> the effect of using DC GZVA instead of DC ZVA for page zeroing.
>
> I suspect on Android you'd notice the fork() penalty a bit more with all
> the copy-on-write having to copy tags. But we can't tell until we do
> some benchmarks. If the penalty is indeed significant, we'll go back to
> assessing the races here.

Okay, I can try to measure it. I do feel rather strongly though that
we should try to avoid tagging pages as much as possible even ignoring
the potential performance implications.

Here's one more idea: we can tag pages eagerly as you propose, but
introduce an opt-out. For example, we introduce a MAP_NOMTE flag,
which would prevent tag initialization as well as causing any future
attempt to mprotect(PROT_MTE) to fail. Allocators that know that the
memory will not be used for MTE in the future can set this flag. For
example, Scudo can start setting this flag once MTE has been disabled
as it has no mechanism for turning MTE back on once disabled. And that
way we will end up with no tags on the heap in the processes with MTE
disabled. Mappings with MAP_NOMTE would not be permitted in the guest
memory space of MTE enabled guests. For executables mapped by the
kernel we may consider adding a bit to the ELF program headers to
enable MAP_NOMTE.

> Another thing that won't happen for PG_mte_tagged currently is KSM page
> merging. I had a patch to allow comparing the tags but eventually
> dropped it (can dig it out).
>
> > Over the weekend I thought of another policy, which would be similar
> > to your original one. We can always tag pages which are mapped as
> > MAP_SHARED. These pages are much less common than private pages, so
> > the impact would be less. So the if statement in
> > alloc_zeroed_user_highpage_movable would become:
> >
> > if ((vma->vm_flags & VM_MTE) || (system_supports_mte() &&
> >                                  (vma->vm_flags & VM_SHARED)))
> >
> > That would allow us to put basically any shared mapping in the guest
> > address space without needing to deal with races in sanitise_mte_tags.
>
> It's not just about VM_SHARED. A page can be effectively shared as a
> result of a fork(). It is read-only in all processes but still shared
> and one task may call mprotect(PROT_MTE).

I see, so VM_SHARED isn't that useful for this then.

> Another case of sharing is between the VMM and the guest though I think
> an mprotect() in the VMM would trigger the unmapping of the guest
> address and pages mapped into guests already have PG_mte_tagged set.
>
> We probably need to draw a state machine of all the cases. AFAICT, we
> need to take into account a few of the below (it's probably incomplete;
> I've been with Steven through most of them IIRC):
>
> 1. Private mappings with mixed PROT_MTE, CoW sharing and concurrent
>    mprotect(PROT_MTE). That's one of the things I dislike is that a late
>    tag clearing via set_pte_at() can happen without breaking the CoW
>    mapping. It's a bit counter-intuitive if you treat the tags as data
>    (rather than some cache), you don't expect a read-only page to have
>    some (tag) updated.
>
> 2. Shared mappings with concurrent mprotect(PROT_MTE).
>
> 3. Shared mapping restoring from swap.
>
> 4. Private mapping restoring from swap into CoW mapping.
>
> 5. KVM faults.
>
> 6. Concurrent ptrace accesses (or KVM tag copying)
>
> What currently risks failing I think is breaking a CoW mapping with
> concurrent mprotect(PROT_MTE) - we set PG_mte_tagged before zeroing the
> tags. A concurrent copy may read stale tags. In sanitise_mte_tags() we
> do this the other way around - clear tags first and then set the flag.
>
> I think using another bit as a lock may solve most (all) of these but
> another option is to treat the tags as data and make sure they are set
> before mapping.
>
> > We may consider going further than this and require all pages mapped
> > into guests with MTE enabled to be PROT_MTE.
>
> We discussed this when upstreaming KVM support and the idea got pushed
> back. The main problem is that the VMM may use MTE for itself but can no
> longer access the guest memory without the risk of taking a fault. We
> don't have a match-all tag in user space and we can't teach the VMM to
> use the PSTATE.TCO bit since driver emulation can be fairly generic.
> And, of course, there's also the ABI change now.

It sounds like MAP_NOMTE would solve this problem. Since tag
initialization becomes independent of the memory access attributes, we
don't have a risk of tag check faults in the VMM.

> > I think it would allow
> > dropping sanitise_mte_tags entirely. This would not be a relaxation of
> > the ABI but perhaps we can get away with it if, as Cornelia mentioned,
> > QEMU does not currently support MTE, and since crosvm doesn't
> > currently support it either there's no userspace to break AFAIK. This
> > would also address a current weirdness in the API where it is possible
> > for the underlying pages of a MAP_SHARED file mapping to become tagged
> > via KVM, said tags are exposed to the guest and are discarded when the
> > underlying page is paged out.
>
> Ah, good point, shared file mappings is another reason we did not allow
> MAP_SHARED and MTE for guest memory.
>
> BTW, in user_mem_abort() we should probably check for VM_MTE_ALLOWED
> irrespective of whether we allow MAP_SHARED or not.

Yes, that sounds like a good idea. If we made MAP_NOMTE clear
VM_MTE_ALLOWED then I think that's the only check we would need.

> > We can perhaps accomplish it by dropping
> > support for KVM_CAP_ARM_MTE in the kernel and introducing something
> > like a KVM_CAP_ARM_MTE_V2 with the new restriction.
>
> That's an option for the ABI upgrade but we still need to solve the
> potential races.

With MAP_NOMTE I think we wouldn't need an ABI change at all.

Peter
Catalin Marinas June 29, 2022, 8:45 a.m. UTC | #9
On Mon, Jun 27, 2022 at 05:55:33PM +0200, Cornelia Huck wrote:
> [I'm still in the process of trying to grok the issues surrounding
> MTE+KVM, so apologies in advance if I'm muddying the waters]

No worries, we are not that far ahead either ;).

> On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote:
> > On 24/06/2022 18:05, Catalin Marinas wrote:
> >> + Steven as he added the KVM and swap support for MTE.
> >> 
> >> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
> >>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
> >>> depend on being able to map guest memory as MAP_SHARED. The current
> >>> restriction on sharing MAP_SHARED pages with the guest is preventing
> >>> the use of those features with MTE. Therefore, remove this restriction.
> >> 
> >> We already have some corner cases where the PG_mte_tagged logic fails
> >> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
> >> KVM MAP_SHARED will potentially make things worse (or hard to reason
> >> about; for example the VMM sets PROT_MTE as well). I'm more inclined to
> >> get rid of PG_mte_tagged altogether, always zero (or restore) the tags
> >> on user page allocation, copy them on write. For swap we can scan and if
> >> all tags are 0 and just skip saving them.
> >> 
> >> Another aspect is a change in the KVM ABI with this patch. It's probably
> >> not that bad since it's rather a relaxation but it has the potential to
> >> confuse the VMM, especially as it doesn't know whether it's running on
> >> older kernels or not (it would have to probe unless we expose this info
> >> to the VMM in some other way).
> 
> Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.)

Steven to confirm but I think he only played with kvmtool. Adding
Jean-Philippe who also had Qemu on his plans at some point.

> What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end
> up copying zeroes?

Yes. For migration, the VMM could ignore sending over tags that are all
zeros or maybe use some simple compression. We don't have a way to
disable MTE for guests, so all pages mapped into the guest address space
end up with PG_mte_tagged.

> That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS
> will be called? I.e. when implementing migration, it should be ok to
> call it while the vm is paused, but you probably won't get a consistent
> state while the vm is running?

Wouldn't this be the same as migrating data? The VMM would only copy it
after it was marked read-only. BTW, I think sanitise_mte_tags() needs a
barrier before setting the PG_mte_tagged() flag (unless we end up with
some lock for reading the tags).

> [Postcopy needs a different interface, I guess, so that the migration
> target can atomically place a received page and its metadata. I see
> https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/;
> has there been any follow-up?]

I don't follow the qemu list, so I wasn't even aware of that thread. But
postcopy, the VMM needs to ensure that both the data and tags are up to
date before mapping such page into the guest address space.
Catalin Marinas June 29, 2022, 7:15 p.m. UTC | #10
On Tue, Jun 28, 2022 at 11:54:51AM -0700, Peter Collingbourne wrote:
> On Tue, Jun 28, 2022 at 10:58 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > That's why it would be interesting to see
> > the effect of using DC GZVA instead of DC ZVA for page zeroing.
> >
> > I suspect on Android you'd notice the fork() penalty a bit more with all
> > the copy-on-write having to copy tags. But we can't tell until we do
> > some benchmarks. If the penalty is indeed significant, we'll go back to
> > assessing the races here.
> 
> Okay, I can try to measure it. I do feel rather strongly though that
> we should try to avoid tagging pages as much as possible even ignoring
> the potential performance implications.
> 
> Here's one more idea: we can tag pages eagerly as you propose, but
> introduce an opt-out. For example, we introduce a MAP_NOMTE flag,
> which would prevent tag initialization as well as causing any future
> attempt to mprotect(PROT_MTE) to fail. Allocators that know that the
> memory will not be used for MTE in the future can set this flag. For
> example, Scudo can start setting this flag once MTE has been disabled
> as it has no mechanism for turning MTE back on once disabled. And that
> way we will end up with no tags on the heap in the processes with MTE
> disabled. Mappings with MAP_NOMTE would not be permitted in the guest
> memory space of MTE enabled guests. For executables mapped by the
> kernel we may consider adding a bit to the ELF program headers to
> enable MAP_NOMTE.

I don't like such negative flags and we should aim for minimal changes
to code that doesn't care about MTE. If there's a performance penalty
with zeroing the tags, we'll keep looking at the lazy tag
initialisation.

In the meantime, I'll think some more about the lazy stuff. We need at
least mte_sync_tags() fixed to set the PG_mte_tagged after the tags have
been updated (fixes the CoW + mprotect() race but probably breaks
concurrent MAP_SHARED mprotect()). We'd have to add some barriers (maybe
in a new function, set_page_tagged()). Some cases like restoring from
swap (both private and shared) have the page lock held. KVM doesn't seem
to take any page lock, so it can race with the VMM.

Anyway, I doubt we can get away with a single bit. We can't make use of
PG_locked either since set_pte_at() is called with the page either
locked or unlocked.
Catalin Marinas June 30, 2022, 5:24 p.m. UTC | #11
On Wed, Jun 29, 2022 at 08:15:07PM +0100, Catalin Marinas wrote:
> In the meantime, I'll think some more about the lazy stuff. We need at
> least mte_sync_tags() fixed to set the PG_mte_tagged after the tags have
> been updated (fixes the CoW + mprotect() race but probably breaks
> concurrent MAP_SHARED mprotect()). We'd have to add some barriers (maybe
> in a new function, set_page_tagged()). Some cases like restoring from
> swap (both private and shared) have the page lock held. KVM doesn't seem
> to take any page lock, so it can race with the VMM.
> 
> Anyway, I doubt we can get away with a single bit. We can't make use of
> PG_locked either since set_pte_at() is called with the page either
> locked or unlocked.

I dug out the one of the old threads on MAP_SHARED and KVM:

https://lore.kernel.org/r/20210609174117.GA18459@arm.com

Last year we did not consider another page flag for the lock, only a big
lock or an array of locks and neither looked appealing. An extra bit in
the page flags scales a lot better though we tend not go through these
bits that quickly.

Anyway, I think at least for the current setup, we need something like
below. It can break concurrent mprotect() by zeroing valid tags but
that's a lot less likely to happen than CoW. I'm still digging through
the KVM, I think it has the mprotect() race already if it's done in the
VMM.

-----8<-----------------------------
From 1f424a52a71df0ff0dafef8a00fe1a8363de63ee Mon Sep 17 00:00:00 2001
From: Catalin Marinas <catalin.marinas@arm.com>
Date: Thu, 30 Jun 2022 10:16:56 +0100
Subject: [PATCH] arm64: mte: Fix/clarify the PG_mte_tagged semantics

Currently the PG_mte_tagged page flag mostly means the page contains
valid tags and it should be set after the tags have been cleared or
restored. However, in mte_sync_tags() it is set before setting the tags
to avoid, in theory, a race with concurrent mprotect(PROT_MTE) for
shared pages. However, a concurrent mprotect(PROT_MTE) with a copy on
write in another thread can cause the new page to have stale tags.
Similarly, tag reading via ptrace() can read stale tags of the
PG_mte_tagged flag is set before actually clearing/restoring the tags.

Fix the PG_mte_tagged semantics so that it is only set after the tags
have been cleared or restored. This is safe for swap restoring into a
MAP_SHARED or CoW page since the core code takes the page lock. Add two
functions to test and set the PG_mte_tagged flag with acquire and
release semantics. The downside is that concurrent mprotect(PROT_MTE) on
a MAP_SHARED page may cause tag loss. This is already the case for KVM
guests if a VMM changes the page protection while the guest triggers a
user_mem_abort().

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
---
 arch/arm64/include/asm/mte.h     | 32 ++++++++++++++++++++++++++++++++
 arch/arm64/include/asm/pgtable.h |  2 +-
 arch/arm64/kernel/cpufeature.c   |  4 +++-
 arch/arm64/kernel/elfcore.c      |  2 +-
 arch/arm64/kernel/hibernate.c    |  2 +-
 arch/arm64/kernel/mte.c          |  8 ++++----
 arch/arm64/kvm/guest.c           |  4 ++--
 arch/arm64/kvm/mmu.c             |  4 ++--
 arch/arm64/mm/copypage.c         |  4 ++--
 arch/arm64/mm/fault.c            |  2 +-
 arch/arm64/mm/mteswap.c          |  2 +-
 11 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index aa523591a44e..dcf0e09112b9 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -37,6 +37,29 @@ void mte_free_tag_storage(char *storage);
 /* track which pages have valid allocation tags */
 #define PG_mte_tagged	PG_arch_2
 
+static inline void set_page_mte_tagged(struct page *page)
+{
+	/*
+	 * Ensure that the tags written prior to this function are visible
+	 * before the page flags update.
+	 */
+	smp_wmb();
+	set_bit(PG_mte_tagged, &page->flags);
+}
+
+static inline bool page_mte_tagged(struct page *page)
+{
+	bool ret = test_bit(PG_mte_tagged, &page->flags);
+
+	/*
+	 * If the page is tagged, ensure ordering with a likely subsequent
+	 * read of the tags.
+	 */
+	if (ret)
+		smp_rmb();
+	return ret;
+}
+
 void mte_zero_clear_page_tags(void *addr);
 void mte_sync_tags(pte_t old_pte, pte_t pte);
 void mte_copy_page_tags(void *kto, const void *kfrom);
@@ -54,6 +77,15 @@ size_t mte_probe_user_range(const char __user *uaddr, size_t size);
 /* unused if !CONFIG_ARM64_MTE, silence the compiler */
 #define PG_mte_tagged	0
 
+static inline set_page_mte_tagged(struct page *page)
+{
+}
+
+static inline bool page_mte_tagged(struct page *page)
+{
+	return false;
+}
+
 static inline void mte_zero_clear_page_tags(void *addr)
 {
 }
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b6632f18364..08823669db0a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1034,7 +1034,7 @@ static inline void arch_swap_invalidate_area(int type)
 static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 {
 	if (system_supports_mte() && mte_restore_tags(entry, &folio->page))
-		set_bit(PG_mte_tagged, &folio->flags);
+		set_page_mte_tagged(&folio->page);
 }
 
 #endif /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 8d88433de81d..01eda797bd59 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1964,8 +1964,10 @@ static void cpu_enable_mte(struct arm64_cpu_capabilities const *cap)
 	 * Clear the tags in the zero page. This needs to be done via the
 	 * linear map which has the Tagged attribute.
 	 */
-	if (!test_and_set_bit(PG_mte_tagged, &ZERO_PAGE(0)->flags))
+	if (page_mte_tagged(ZERO_PAGE(0))) {
 		mte_clear_page_tags(lm_alias(empty_zero_page));
+		set_page_mte_tagged(ZERO_PAGE(0));
+	}
 
 	kasan_init_hw_tags_cpu();
 }
diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c
index 98d67444a5b6..f91bb1572d22 100644
--- a/arch/arm64/kernel/elfcore.c
+++ b/arch/arm64/kernel/elfcore.c
@@ -47,7 +47,7 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
 		 * Pages mapped in user space as !pte_access_permitted() (e.g.
 		 * PROT_EXEC only) may not have the PG_mte_tagged flag set.
 		 */
-		if (!test_bit(PG_mte_tagged, &page->flags)) {
+		if (!page_mte_tagged(page)) {
 			put_page(page);
 			dump_skip(cprm, MTE_PAGE_TAG_STORAGE);
 			continue;
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 2e248342476e..018ae7a25737 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -271,7 +271,7 @@ static int swsusp_mte_save_tags(void)
 			if (!page)
 				continue;
 
-			if (!test_bit(PG_mte_tagged, &page->flags))
+			if (!page_mte_tagged(page))
 				continue;
 
 			ret = save_tags(page, pfn);
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index f6b00743c399..37020353bb35 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -58,6 +58,7 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
 	 */
 	smp_wmb();
 	mte_clear_page_tags(page_address(page));
+	set_page_mte_tagged(page);
 }
 
 void mte_sync_tags(pte_t old_pte, pte_t pte)
@@ -73,7 +74,7 @@ void mte_sync_tags(pte_t old_pte, pte_t pte)
 
 	/* if PG_mte_tagged is set, tags have already been initialised */
 	for (i = 0; i < nr_pages; i++, page++) {
-		if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+		if (!page_mte_tagged(page))
 			mte_sync_page_tags(page, old_pte, check_swap,
 					   pte_is_tagged);
 	}
@@ -100,8 +101,7 @@ int memcmp_pages(struct page *page1, struct page *page2)
 	 * pages is tagged, set_pte_at() may zero or change the tags of the
 	 * other page via mte_sync_tags().
 	 */
-	if (test_bit(PG_mte_tagged, &page1->flags) ||
-	    test_bit(PG_mte_tagged, &page2->flags))
+	if (page_mte_tagged(page1) || page_mte_tagged(page2))
 		return addr1 != addr2;
 
 	return ret;
@@ -407,7 +407,7 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
 			put_page(page);
 			break;
 		}
-		WARN_ON_ONCE(!test_bit(PG_mte_tagged, &page->flags));
+		WARN_ON_ONCE(!page_mte_tagged(page));
 
 		/* limit access to the end of the page */
 		offset = offset_in_page(addr);
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 8c607199cad1..3b04e69006b4 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -1058,7 +1058,7 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
 		maddr = page_address(page);
 
 		if (!write) {
-			if (test_bit(PG_mte_tagged, &page->flags))
+			if (page_mte_tagged(page))
 				num_tags = mte_copy_tags_to_user(tags, maddr,
 							MTE_GRANULES_PER_PAGE);
 			else
@@ -1075,7 +1075,7 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
 			 * completed fully
 			 */
 			if (num_tags == MTE_GRANULES_PER_PAGE)
-				set_bit(PG_mte_tagged, &page->flags);
+				set_page_mte_tagged(page);
 
 			kvm_release_pfn_dirty(pfn);
 		}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f5651a05b6a8..9cfa516452e1 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1075,9 +1075,9 @@ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
 		return -EFAULT;
 
 	for (i = 0; i < nr_pages; i++, page++) {
-		if (!test_bit(PG_mte_tagged, &page->flags)) {
+		if (!page_mte_tagged(page)) {
 			mte_clear_page_tags(page_address(page));
-			set_bit(PG_mte_tagged, &page->flags);
+			set_page_mte_tagged(page);
 		}
 	}
 
diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
index 0dea80bf6de4..f36d796f1bce 100644
--- a/arch/arm64/mm/copypage.c
+++ b/arch/arm64/mm/copypage.c
@@ -21,8 +21,7 @@ void copy_highpage(struct page *to, struct page *from)
 
 	copy_page(kto, kfrom);
 
-	if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) {
-		set_bit(PG_mte_tagged, &to->flags);
+	if (system_supports_mte() && page_mte_tagged(from)) {
 		page_kasan_tag_reset(to);
 		/*
 		 * We need smp_wmb() in between setting the flags and clearing the
@@ -33,6 +32,7 @@ void copy_highpage(struct page *to, struct page *from)
 		 */
 		smp_wmb();
 		mte_copy_page_tags(kto, kfrom);
+		set_page_mte_tagged(to);
 	}
 }
 EXPORT_SYMBOL(copy_highpage);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index c5e11768e5c1..147fe28d3fbe 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -928,5 +928,5 @@ void tag_clear_highpage(struct page *page)
 {
 	mte_zero_clear_page_tags(page_address(page));
 	page_kasan_tag_reset(page);
-	set_bit(PG_mte_tagged, &page->flags);
+	set_page_mte_tagged(page);
 }
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index a9e50e930484..9d3a8cf388fc 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -24,7 +24,7 @@ int mte_save_tags(struct page *page)
 {
 	void *tag_storage, *ret;
 
-	if (!test_bit(PG_mte_tagged, &page->flags))
+	if (!page_mte_tagged(page))
 		return 0;
 
 	tag_storage = mte_allocate_tag_storage();
Steven Price July 4, 2022, 9:52 a.m. UTC | #12
On 29/06/2022 09:45, Catalin Marinas wrote:
> On Mon, Jun 27, 2022 at 05:55:33PM +0200, Cornelia Huck wrote:
>> [I'm still in the process of trying to grok the issues surrounding
>> MTE+KVM, so apologies in advance if I'm muddying the waters]
> 
> No worries, we are not that far ahead either ;).
> 
>> On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote:
>>> On 24/06/2022 18:05, Catalin Marinas wrote:
>>>> + Steven as he added the KVM and swap support for MTE.
>>>>
>>>> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
>>>>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
>>>>> depend on being able to map guest memory as MAP_SHARED. The current
>>>>> restriction on sharing MAP_SHARED pages with the guest is preventing
>>>>> the use of those features with MTE. Therefore, remove this restriction.
>>>>
>>>> We already have some corner cases where the PG_mte_tagged logic fails
>>>> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
>>>> KVM MAP_SHARED will potentially make things worse (or hard to reason
>>>> about; for example the VMM sets PROT_MTE as well). I'm more inclined to
>>>> get rid of PG_mte_tagged altogether, always zero (or restore) the tags
>>>> on user page allocation, copy them on write. For swap we can scan and if
>>>> all tags are 0 and just skip saving them.
>>>>
>>>> Another aspect is a change in the KVM ABI with this patch. It's probably
>>>> not that bad since it's rather a relaxation but it has the potential to
>>>> confuse the VMM, especially as it doesn't know whether it's running on
>>>> older kernels or not (it would have to probe unless we expose this info
>>>> to the VMM in some other way).
>>
>> Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.)
> 
> Steven to confirm but I think he only played with kvmtool. Adding
> Jean-Philippe who also had Qemu on his plans at some point.

Yes I've only played with kvmtool so far. 'basic support' at the moment
is little more than enabling the cap - that allows the guest to access
tags. However obviously aspects such as migration need to understand
what's going on to correctly save/restore tags - which is mostly only
relevant to Qemu.

>> What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end
>> up copying zeroes?
> 
> Yes. For migration, the VMM could ignore sending over tags that are all
> zeros or maybe use some simple compression. We don't have a way to
> disable MTE for guests, so all pages mapped into the guest address space
> end up with PG_mte_tagged.

Architecturally we don't (yet) have a way of describing memory without
tags, so indeed you will get all zeros if the guest hasn't populated the
tags yet.

>> That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS
>> will be called? I.e. when implementing migration, it should be ok to
>> call it while the vm is paused, but you probably won't get a consistent
>> state while the vm is running?
> 
> Wouldn't this be the same as migrating data? The VMM would only copy it
> after it was marked read-only. BTW, I think sanitise_mte_tags() needs a
> barrier before setting the PG_mte_tagged() flag (unless we end up with
> some lock for reading the tags).

As Catalin says, tags are no different from data so the VMM needs to
either pause the VM or mark the page read-only to protect it from guest
updates during the copy.

The whole test_bit/set_bit dance does seem to be leaving open race
conditions. I /think/ that Peter's extra flag as a lock with an added
memory barrier is sufficient to mitigate it, but at this stage I'm also
thinking some formal modelling would be wise as I don't trust my
intuition when it comes to memory barriers.

>> [Postcopy needs a different interface, I guess, so that the migration
>> target can atomically place a received page and its metadata. I see
>> https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/;
>> has there been any follow-up?]
> 
> I don't follow the qemu list, so I wasn't even aware of that thread. But
> postcopy, the VMM needs to ensure that both the data and tags are up to
> date before mapping such page into the guest address space.
> 

I'm not sure I see how atomically updating data+tags is different from
the existing issues around atomically updating the data. The VMM needs
to ensure that the guest doesn't see the page before all the data+all
the tags are written. It does mean lazy setting of the tags isn't
possible in the VMM, but I'm not sure that's a worthwhile thing anyway.
Perhaps I'm missing something?

Steve
Cornelia Huck July 4, 2022, 12:19 p.m. UTC | #13
On Mon, Jul 04 2022, Steven Price <steven.price@arm.com> wrote:

> On 29/06/2022 09:45, Catalin Marinas wrote:
>> On Mon, Jun 27, 2022 at 05:55:33PM +0200, Cornelia Huck wrote:
>>> [I'm still in the process of trying to grok the issues surrounding
>>> MTE+KVM, so apologies in advance if I'm muddying the waters]
>> 
>> No worries, we are not that far ahead either ;).
>> 
>>> On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote:
>>>> On 24/06/2022 18:05, Catalin Marinas wrote:
>>>>> + Steven as he added the KVM and swap support for MTE.
>>>>>
>>>>> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
>>>>>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
>>>>>> depend on being able to map guest memory as MAP_SHARED. The current
>>>>>> restriction on sharing MAP_SHARED pages with the guest is preventing
>>>>>> the use of those features with MTE. Therefore, remove this restriction.
>>>>>
>>>>> We already have some corner cases where the PG_mte_tagged logic fails
>>>>> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
>>>>> KVM MAP_SHARED will potentially make things worse (or hard to reason
>>>>> about; for example the VMM sets PROT_MTE as well). I'm more inclined to
>>>>> get rid of PG_mte_tagged altogether, always zero (or restore) the tags
>>>>> on user page allocation, copy them on write. For swap we can scan and if
>>>>> all tags are 0 and just skip saving them.
>>>>>
>>>>> Another aspect is a change in the KVM ABI with this patch. It's probably
>>>>> not that bad since it's rather a relaxation but it has the potential to
>>>>> confuse the VMM, especially as it doesn't know whether it's running on
>>>>> older kernels or not (it would have to probe unless we expose this info
>>>>> to the VMM in some other way).
>>>
>>> Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.)
>> 
>> Steven to confirm but I think he only played with kvmtool. Adding
>> Jean-Philippe who also had Qemu on his plans at some point.
>
> Yes I've only played with kvmtool so far. 'basic support' at the moment
> is little more than enabling the cap - that allows the guest to access
> tags. However obviously aspects such as migration need to understand
> what's going on to correctly save/restore tags - which is mostly only
> relevant to Qemu.

Yes, simply only enabling the cap seems to work fine in QEMU as well (as
in, 'mte selftests work fine'). Migration support is the
hard/interesting part.

>
>>> What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end
>>> up copying zeroes?
>> 
>> Yes. For migration, the VMM could ignore sending over tags that are all
>> zeros or maybe use some simple compression. We don't have a way to
>> disable MTE for guests, so all pages mapped into the guest address space
>> end up with PG_mte_tagged.
>
> Architecturally we don't (yet) have a way of describing memory without
> tags, so indeed you will get all zeros if the guest hasn't populated the
> tags yet.

Nod.

>
>>> That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS
>>> will be called? I.e. when implementing migration, it should be ok to
>>> call it while the vm is paused, but you probably won't get a consistent
>>> state while the vm is running?
>> 
>> Wouldn't this be the same as migrating data? The VMM would only copy it
>> after it was marked read-only. BTW, I think sanitise_mte_tags() needs a
>> barrier before setting the PG_mte_tagged() flag (unless we end up with
>> some lock for reading the tags).
>
> As Catalin says, tags are no different from data so the VMM needs to
> either pause the VM or mark the page read-only to protect it from guest
> updates during the copy.

Yes, that seems reasonable; not sure whether the documentation should
call that out explicitly.

>
> The whole test_bit/set_bit dance does seem to be leaving open race
> conditions. I /think/ that Peter's extra flag as a lock with an added
> memory barrier is sufficient to mitigate it, but at this stage I'm also
> thinking some formal modelling would be wise as I don't trust my
> intuition when it comes to memory barriers.
>
>>> [Postcopy needs a different interface, I guess, so that the migration
>>> target can atomically place a received page and its metadata. I see
>>> https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/;
>>> has there been any follow-up?]
>> 
>> I don't follow the qemu list, so I wasn't even aware of that thread. But
>> postcopy, the VMM needs to ensure that both the data and tags are up to
>> date before mapping such page into the guest address space.
>> 
>
> I'm not sure I see how atomically updating data+tags is different from
> the existing issues around atomically updating the data. The VMM needs
> to ensure that the guest doesn't see the page before all the data+all
> the tags are written. It does mean lazy setting of the tags isn't
> possible in the VMM, but I'm not sure that's a worthwhile thing anyway.
> Perhaps I'm missing something?

For postcopy, we basically want to fault in any not-yet-migrated page
via uffd once the guest accesses it. We only get the page data that way,
though, not the tag. I'm wondering whether we'd need a 'page+metadata'
uffd mode; not sure if that makes sense. Otherwise, we'd need to stop
the guest while grabbing the tags for the page as well, and stopping is
the thing we want to avoid here.
Steven Price July 4, 2022, 3 p.m. UTC | #14
On 04/07/2022 13:19, Cornelia Huck wrote:
> On Mon, Jul 04 2022, Steven Price <steven.price@arm.com> wrote:
> 
>> On 29/06/2022 09:45, Catalin Marinas wrote:
>>> On Mon, Jun 27, 2022 at 05:55:33PM +0200, Cornelia Huck wrote:
>>>> [I'm still in the process of trying to grok the issues surrounding
>>>> MTE+KVM, so apologies in advance if I'm muddying the waters]
>>>
>>> No worries, we are not that far ahead either ;).
>>>
>>>> On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote:
>>>>> On 24/06/2022 18:05, Catalin Marinas wrote:
>>>>>> + Steven as he added the KVM and swap support for MTE.
>>>>>>
>>>>>> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
>>>>>>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
>>>>>>> depend on being able to map guest memory as MAP_SHARED. The current
>>>>>>> restriction on sharing MAP_SHARED pages with the guest is preventing
>>>>>>> the use of those features with MTE. Therefore, remove this restriction.
>>>>>>
>>>>>> We already have some corner cases where the PG_mte_tagged logic fails
>>>>>> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
>>>>>> KVM MAP_SHARED will potentially make things worse (or hard to reason
>>>>>> about; for example the VMM sets PROT_MTE as well). I'm more inclined to
>>>>>> get rid of PG_mte_tagged altogether, always zero (or restore) the tags
>>>>>> on user page allocation, copy them on write. For swap we can scan and if
>>>>>> all tags are 0 and just skip saving them.
>>>>>>
>>>>>> Another aspect is a change in the KVM ABI with this patch. It's probably
>>>>>> not that bad since it's rather a relaxation but it has the potential to
>>>>>> confuse the VMM, especially as it doesn't know whether it's running on
>>>>>> older kernels or not (it would have to probe unless we expose this info
>>>>>> to the VMM in some other way).
>>>>
>>>> Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.)
>>>
>>> Steven to confirm but I think he only played with kvmtool. Adding
>>> Jean-Philippe who also had Qemu on his plans at some point.
>>
>> Yes I've only played with kvmtool so far. 'basic support' at the moment
>> is little more than enabling the cap - that allows the guest to access
>> tags. However obviously aspects such as migration need to understand
>> what's going on to correctly save/restore tags - which is mostly only
>> relevant to Qemu.
> 
> Yes, simply only enabling the cap seems to work fine in QEMU as well (as
> in, 'mte selftests work fine'). Migration support is the
> hard/interesting part.
> 
>>
>>>> What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end
>>>> up copying zeroes?
>>>
>>> Yes. For migration, the VMM could ignore sending over tags that are all
>>> zeros or maybe use some simple compression. We don't have a way to
>>> disable MTE for guests, so all pages mapped into the guest address space
>>> end up with PG_mte_tagged.
>>
>> Architecturally we don't (yet) have a way of describing memory without
>> tags, so indeed you will get all zeros if the guest hasn't populated the
>> tags yet.
> 
> Nod.
> 
>>
>>>> That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS
>>>> will be called? I.e. when implementing migration, it should be ok to
>>>> call it while the vm is paused, but you probably won't get a consistent
>>>> state while the vm is running?
>>>
>>> Wouldn't this be the same as migrating data? The VMM would only copy it
>>> after it was marked read-only. BTW, I think sanitise_mte_tags() needs a
>>> barrier before setting the PG_mte_tagged() flag (unless we end up with
>>> some lock for reading the tags).
>>
>> As Catalin says, tags are no different from data so the VMM needs to
>> either pause the VM or mark the page read-only to protect it from guest
>> updates during the copy.
> 
> Yes, that seems reasonable; not sure whether the documentation should
> call that out explicitly.
> 
>>
>> The whole test_bit/set_bit dance does seem to be leaving open race
>> conditions. I /think/ that Peter's extra flag as a lock with an added
>> memory barrier is sufficient to mitigate it, but at this stage I'm also
>> thinking some formal modelling would be wise as I don't trust my
>> intuition when it comes to memory barriers.
>>
>>>> [Postcopy needs a different interface, I guess, so that the migration
>>>> target can atomically place a received page and its metadata. I see
>>>> https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/;
>>>> has there been any follow-up?]
>>>
>>> I don't follow the qemu list, so I wasn't even aware of that thread. But
>>> postcopy, the VMM needs to ensure that both the data and tags are up to
>>> date before mapping such page into the guest address space.
>>>
>>
>> I'm not sure I see how atomically updating data+tags is different from
>> the existing issues around atomically updating the data. The VMM needs
>> to ensure that the guest doesn't see the page before all the data+all
>> the tags are written. It does mean lazy setting of the tags isn't
>> possible in the VMM, but I'm not sure that's a worthwhile thing anyway.
>> Perhaps I'm missing something?
> 
> For postcopy, we basically want to fault in any not-yet-migrated page
> via uffd once the guest accesses it. We only get the page data that way,
> though, not the tag. I'm wondering whether we'd need a 'page+metadata'
> uffd mode; not sure if that makes sense. Otherwise, we'd need to stop
> the guest while grabbing the tags for the page as well, and stopping is
> the thing we want to avoid here.

Ah, I think I see now. UFFDIO_COPY atomically populates the (data) page
and ensures that no thread will see the partially populated page. But
there's currently no way of doing that with tags as well.

I'd not looked at the implementation of userfaultfd before and I'd
assumed it avoided the need for an 'atomic' operation like this. But
apparently not! AFAICT either a new ioctl would be needed (which can
take a tag buffer) or a new flag to UFFDIO_COPY which would tighten the
alignment requirements of `src` and would copy the tags along with the data.

Steve
Cornelia Huck July 8, 2022, 1:03 p.m. UTC | #15
On Mon, Jul 04 2022, Steven Price <steven.price@arm.com> wrote:

> On 04/07/2022 13:19, Cornelia Huck wrote:
>> On Mon, Jul 04 2022, Steven Price <steven.price@arm.com> wrote:
>> 
>>> On 29/06/2022 09:45, Catalin Marinas wrote:
>>>> On Mon, Jun 27, 2022 at 05:55:33PM +0200, Cornelia Huck wrote:
>>>
>>>>> [Postcopy needs a different interface, I guess, so that the migration
>>>>> target can atomically place a received page and its metadata. I see
>>>>> https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/;
>>>>> has there been any follow-up?]
>>>>
>>>> I don't follow the qemu list, so I wasn't even aware of that thread. But
>>>> postcopy, the VMM needs to ensure that both the data and tags are up to
>>>> date before mapping such page into the guest address space.
>>>>
>>>
>>> I'm not sure I see how atomically updating data+tags is different from
>>> the existing issues around atomically updating the data. The VMM needs
>>> to ensure that the guest doesn't see the page before all the data+all
>>> the tags are written. It does mean lazy setting of the tags isn't
>>> possible in the VMM, but I'm not sure that's a worthwhile thing anyway.
>>> Perhaps I'm missing something?
>> 
>> For postcopy, we basically want to fault in any not-yet-migrated page
>> via uffd once the guest accesses it. We only get the page data that way,
>> though, not the tag. I'm wondering whether we'd need a 'page+metadata'
>> uffd mode; not sure if that makes sense. Otherwise, we'd need to stop
>> the guest while grabbing the tags for the page as well, and stopping is
>> the thing we want to avoid here.
>
> Ah, I think I see now. UFFDIO_COPY atomically populates the (data) page
> and ensures that no thread will see the partially populated page. But
> there's currently no way of doing that with tags as well.

Nod.

>
> I'd not looked at the implementation of userfaultfd before and I'd
> assumed it avoided the need for an 'atomic' operation like this. But
> apparently not! AFAICT either a new ioctl would be needed (which can
> take a tag buffer) or a new flag to UFFDIO_COPY which would tighten the
> alignment requirements of `src` and would copy the tags along with the data.

I was thinking about a new flag that implies "copy metadata"; not sure
how we would get the same atomicity with a separate ioctl. I've only
just started looking at userfaultfd, though, and I might be on a wrong
track... One thing I'd like to avoid is having something that is too
ARM-specific, I think there are other architecture features that might
have similar issues.

Maybe someone more familiar with uffd and/or postcopy can chime in?
Peter Xu July 8, 2022, 1:58 p.m. UTC | #16
On Fri, Jul 08, 2022 at 03:03:34PM +0200, Cornelia Huck wrote:
> On Mon, Jul 04 2022, Steven Price <steven.price@arm.com> wrote:
> 
> > On 04/07/2022 13:19, Cornelia Huck wrote:
> >> On Mon, Jul 04 2022, Steven Price <steven.price@arm.com> wrote:
> >> 
> >>> On 29/06/2022 09:45, Catalin Marinas wrote:
> >>>> On Mon, Jun 27, 2022 at 05:55:33PM +0200, Cornelia Huck wrote:
> >>>
> >>>>> [Postcopy needs a different interface, I guess, so that the migration
> >>>>> target can atomically place a received page and its metadata. I see
> >>>>> https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/;
> >>>>> has there been any follow-up?]
> >>>>
> >>>> I don't follow the qemu list, so I wasn't even aware of that thread. But
> >>>> postcopy, the VMM needs to ensure that both the data and tags are up to
> >>>> date before mapping such page into the guest address space.
> >>>>
> >>>
> >>> I'm not sure I see how atomically updating data+tags is different from
> >>> the existing issues around atomically updating the data. The VMM needs
> >>> to ensure that the guest doesn't see the page before all the data+all
> >>> the tags are written. It does mean lazy setting of the tags isn't
> >>> possible in the VMM, but I'm not sure that's a worthwhile thing anyway.
> >>> Perhaps I'm missing something?
> >> 
> >> For postcopy, we basically want to fault in any not-yet-migrated page
> >> via uffd once the guest accesses it. We only get the page data that way,
> >> though, not the tag. I'm wondering whether we'd need a 'page+metadata'
> >> uffd mode; not sure if that makes sense. Otherwise, we'd need to stop
> >> the guest while grabbing the tags for the page as well, and stopping is
> >> the thing we want to avoid here.
> >
> > Ah, I think I see now. UFFDIO_COPY atomically populates the (data) page
> > and ensures that no thread will see the partially populated page. But
> > there's currently no way of doing that with tags as well.
> 
> Nod.
> 
> >
> > I'd not looked at the implementation of userfaultfd before and I'd
> > assumed it avoided the need for an 'atomic' operation like this. But
> > apparently not! AFAICT either a new ioctl would be needed (which can
> > take a tag buffer) or a new flag to UFFDIO_COPY which would tighten the
> > alignment requirements of `src` and would copy the tags along with the data.
> 
> I was thinking about a new flag that implies "copy metadata"; not sure
> how we would get the same atomicity with a separate ioctl. I've only
> just started looking at userfaultfd, though, and I might be on a wrong
> track... One thing I'd like to avoid is having something that is too
> ARM-specific, I think there are other architecture features that might
> have similar issues.

Agreed, to propose such an interface we'd better make sure it'll be easily
applicable to other similar memory protection mechanisms elsewhere.

> 
> Maybe someone more familiar with uffd and/or postcopy can chime in?

Hanving UFFDIO_COPY provide a new flag sounds reasonable to me.  I'm
curious what's the maximum possible size of the tags and whether they can
be embeded already into struct uffdio_copy somehow.

Thanks,
Cornelia Huck July 14, 2022, 1:30 p.m. UTC | #17
On Fri, Jul 08 2022, Peter Xu <peterx@redhat.com> wrote:

> On Fri, Jul 08, 2022 at 03:03:34PM +0200, Cornelia Huck wrote:

>> I was thinking about a new flag that implies "copy metadata"; not sure
>> how we would get the same atomicity with a separate ioctl. I've only
>> just started looking at userfaultfd, though, and I might be on a wrong
>> track... One thing I'd like to avoid is having something that is too
>> ARM-specific, I think there are other architecture features that might
>> have similar issues.
>
> Agreed, to propose such an interface we'd better make sure it'll be easily
> applicable to other similar memory protection mechanisms elsewhere.

There's storage keys on s390, although I believe they are considered
legacy by now. I dimly recall something in x86 land.

>
>> 
>> Maybe someone more familiar with uffd and/or postcopy can chime in?
>
> Hanving UFFDIO_COPY provide a new flag sounds reasonable to me.  I'm
> curious what's the maximum possible size of the tags and whether they can
> be embeded already into struct uffdio_copy somehow.

Each tag is four bits and covers 16 bytes (also see the defs in
arch/arm64/include/asm/mte-def.h).
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index aa523591a44e..f66f70194c76 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -19,6 +19,7 @@ 
 #include <asm/pgtable-types.h>
 
 void mte_clear_page_tags(void *addr);
+void mte_ensure_page_tags_cleared(struct page *page);
 unsigned long mte_copy_tags_from_user(void *to, const void __user *from,
 				      unsigned long n);
 unsigned long mte_copy_tags_to_user(void __user *to, void *from,
@@ -37,6 +38,9 @@  void mte_free_tag_storage(char *storage);
 /* track which pages have valid allocation tags */
 #define PG_mte_tagged	PG_arch_2
 
+/* the page is or has been in the process of having its tags cleared */
+#define PG_mte_tag_clearing	PG_arch_3
+
 void mte_zero_clear_page_tags(void *addr);
 void mte_sync_tags(pte_t old_pte, pte_t pte);
 void mte_copy_page_tags(void *kto, const void *kfrom);
@@ -53,6 +57,7 @@  size_t mte_probe_user_range(const char __user *uaddr, size_t size);
 
 /* unused if !CONFIG_ARM64_MTE, silence the compiler */
 #define PG_mte_tagged	0
+#define PG_mte_tag_clearing	0
 
 static inline void mte_zero_clear_page_tags(void *addr)
 {
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index f6b00743c399..8f9655053a9f 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -57,7 +57,18 @@  static void mte_sync_page_tags(struct page *page, pte_t old_pte,
 	 * the new page->flags are visible before the tags were updated.
 	 */
 	smp_wmb();
-	mte_clear_page_tags(page_address(page));
+	mte_ensure_page_tags_cleared(page);
+}
+
+void mte_ensure_page_tags_cleared(struct page *page)
+{
+	if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) {
+		while (!test_bit(PG_mte_tagged, &page->flags))
+			;
+	} else {
+		mte_clear_page_tags(page_address(page));
+		set_bit(PG_mte_tagged, &page->flags);
+	}
 }
 
 void mte_sync_tags(pte_t old_pte, pte_t pte)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f5651a05b6a8..016d14c9d798 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1050,11 +1050,9 @@  static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
  * able to see the page's tags and therefore they must be initialised first. If
  * PG_mte_tagged is set, tags have already been initialised.
  *
- * The race in the test/set of the PG_mte_tagged flag is handled by:
- * - preventing VM_SHARED mappings in a memslot with MTE preventing two VMs
- *   racing to santise the same page
- * - mmap_lock protects between a VM faulting a page in and the VMM performing
- *   an mprotect() to add VM_MTE
+ * The race in the test/set of the PG_mte_tagged flag is handled via the flag
+ * PG_mte_tag_clearing which is atomically test and set before beginning a tag
+ * initialization, and spinning on PG_mte_tagged if it was already set.
  */
 static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
 			     unsigned long size)
@@ -1074,12 +1072,9 @@  static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
 	if (!page)
 		return -EFAULT;
 
-	for (i = 0; i < nr_pages; i++, page++) {
-		if (!test_bit(PG_mte_tagged, &page->flags)) {
-			mte_clear_page_tags(page_address(page));
-			set_bit(PG_mte_tagged, &page->flags);
-		}
-	}
+	for (i = 0; i < nr_pages; i++, page++)
+		if (!test_bit(PG_mte_tagged, &page->flags))
+			mte_ensure_page_tags_cleared(page);
 
 	return 0;
 }
@@ -1092,7 +1087,6 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	bool write_fault, writable, force_pte = false;
 	bool exec_fault;
 	bool device = false;
-	bool shared;
 	unsigned long mmu_seq;
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
@@ -1142,8 +1136,6 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		vma_shift = get_vma_page_shift(vma, hva);
 	}
 
-	shared = (vma->vm_flags & VM_SHARED);
-
 	switch (vma_shift) {
 #ifndef __PAGETABLE_PMD_FOLDED
 	case PUD_SHIFT:
@@ -1263,11 +1255,7 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	}
 
 	if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
-		/* Check the VMM hasn't introduced a new VM_SHARED VMA */
-		if (!shared)
-			ret = sanitise_mte_tags(kvm, pfn, vma_pagesize);
-		else
-			ret = -EFAULT;
+		ret = sanitise_mte_tags(kvm, pfn, vma_pagesize);
 		if (ret)
 			goto out_unlock;
 	}
@@ -1705,16 +1693,6 @@  int kvm_arch_prepare_memory_region(struct kvm *kvm,
 		if (!vma)
 			break;
 
-		/*
-		 * VM_SHARED mappings are not allowed with MTE to avoid races
-		 * when updating the PG_mte_tagged page flag, see
-		 * sanitise_mte_tags for more details.
-		 */
-		if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED) {
-			ret = -EINVAL;
-			break;
-		}
-
 		if (vma->vm_flags & VM_PFNMAP) {
 			/* IO region dirty page logging not allowed */
 			if (new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e66f7aa3191d..447cdd4b1311 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -134,6 +134,7 @@  enum pageflags {
 #endif
 #ifdef CONFIG_64BIT
 	PG_arch_2,
+	PG_arch_3,
 #endif
 #ifdef CONFIG_KASAN_HW_TAGS
 	PG_skip_kasan_poison,
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index e87cb2b80ed3..ecf7ae0de3d8 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -92,9 +92,9 @@ 
 #endif
 
 #ifdef CONFIG_64BIT
-#define IF_HAVE_PG_ARCH_2(flag,string) ,{1UL << flag, string}
+#define IF_HAVE_PG_ARCH_2_3(flag,string) ,{1UL << flag, string}
 #else
-#define IF_HAVE_PG_ARCH_2(flag,string)
+#define IF_HAVE_PG_ARCH_2_3(flag,string)
 #endif
 
 #ifdef CONFIG_KASAN_HW_TAGS
@@ -130,7 +130,8 @@  IF_HAVE_PG_UNCACHED(PG_uncached,	"uncached"	)		\
 IF_HAVE_PG_HWPOISON(PG_hwpoison,	"hwpoison"	)		\
 IF_HAVE_PG_IDLE(PG_young,		"young"		)		\
 IF_HAVE_PG_IDLE(PG_idle,		"idle"		)		\
-IF_HAVE_PG_ARCH_2(PG_arch_2,		"arch_2"	)		\
+IF_HAVE_PG_ARCH_2_3(PG_arch_2,		"arch_2"	)		\
+IF_HAVE_PG_ARCH_2_3(PG_arch_3,		"arch_3"	)		\
 IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
 
 #define show_page_flags(flags)						\