diff mbox series

[v14,1/8] arm64: mte: Handle race when synchronising tags

Message ID 20210607110816.25762-2-steven.price@arm.com (mailing list archive)
State New, archived
Headers show
Series MTE support for KVM guest | expand

Commit Message

Steven Price June 7, 2021, 11:08 a.m. UTC
mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
before restoring/zeroing the MTE tags. However if another thread were to
race and attempt to sync the tags on the same page before the first
thread had completed restoring/zeroing then it would see the flag is
already set and continue without waiting. This would potentially expose
the previous contents of the tags to user space, and cause any updates
that user space makes before the restoring/zeroing has completed to
potentially be lost.

Since this code is run from atomic contexts we can't just lock the page
during the process. Instead implement a new (global) spinlock to protect
the mte_sync_page_tags() function.

Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in user-space with PROT_MTE")
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/kernel/mte.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

Comments

Marc Zyngier June 9, 2021, 10:30 a.m. UTC | #1
On Mon, 07 Jun 2021 12:08:09 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
> before restoring/zeroing the MTE tags. However if another thread were to
> race and attempt to sync the tags on the same page before the first
> thread had completed restoring/zeroing then it would see the flag is
> already set and continue without waiting. This would potentially expose
> the previous contents of the tags to user space, and cause any updates
> that user space makes before the restoring/zeroing has completed to
> potentially be lost.
> 
> Since this code is run from atomic contexts we can't just lock the page
> during the process. Instead implement a new (global) spinlock to protect
> the mte_sync_page_tags() function.
> 
> Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in user-space with PROT_MTE")
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/kernel/mte.c | 20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index 125a10e413e9..a3583a7fd400 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -25,6 +25,7 @@
>  u64 gcr_kernel_excl __ro_after_init;
>  
>  static bool report_fault_once = true;
> +static DEFINE_SPINLOCK(tag_sync_lock);
>  
>  #ifdef CONFIG_KASAN_HW_TAGS
>  /* Whether the MTE asynchronous mode is enabled. */
> @@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
>  
>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
>  {
> +	unsigned long flags;
>  	pte_t old_pte = READ_ONCE(*ptep);
>  
> +	spin_lock_irqsave(&tag_sync_lock, flags);

having though a bit more about this after an offline discussion with
Catalin: why can't this lock be made per mm? We can't really share
tags across processes anyway, so this is limited to threads from the
same process.

I'd also like it to be documented that page sharing can only reliably
work with tagging if only one of the mappings is using tags.

Thanks,

	M.
Steven Price June 9, 2021, 10:51 a.m. UTC | #2
On 09/06/2021 11:30, Marc Zyngier wrote:
> On Mon, 07 Jun 2021 12:08:09 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
>> before restoring/zeroing the MTE tags. However if another thread were to
>> race and attempt to sync the tags on the same page before the first
>> thread had completed restoring/zeroing then it would see the flag is
>> already set and continue without waiting. This would potentially expose
>> the previous contents of the tags to user space, and cause any updates
>> that user space makes before the restoring/zeroing has completed to
>> potentially be lost.
>>
>> Since this code is run from atomic contexts we can't just lock the page
>> during the process. Instead implement a new (global) spinlock to protect
>> the mte_sync_page_tags() function.
>>
>> Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in user-space with PROT_MTE")
>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/kernel/mte.c | 20 +++++++++++++++++---
>>  1 file changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>> index 125a10e413e9..a3583a7fd400 100644
>> --- a/arch/arm64/kernel/mte.c
>> +++ b/arch/arm64/kernel/mte.c
>> @@ -25,6 +25,7 @@
>>  u64 gcr_kernel_excl __ro_after_init;
>>  
>>  static bool report_fault_once = true;
>> +static DEFINE_SPINLOCK(tag_sync_lock);
>>  
>>  #ifdef CONFIG_KASAN_HW_TAGS
>>  /* Whether the MTE asynchronous mode is enabled. */
>> @@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
>>  
>>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
>>  {
>> +	unsigned long flags;
>>  	pte_t old_pte = READ_ONCE(*ptep);
>>  
>> +	spin_lock_irqsave(&tag_sync_lock, flags);
> 
> having though a bit more about this after an offline discussion with
> Catalin: why can't this lock be made per mm? We can't really share
> tags across processes anyway, so this is limited to threads from the
> same process.

Currently there's nothing stopping processes sharing tags (mmap(...,
PROT_MTE, MAP_SHARED)) - I agree making use of this is tricky and it
would have been nice if this had just been prevented from the beginning.

Given the above, clearly the lock can't be per mm and robust.

> I'd also like it to be documented that page sharing can only reliably
> work with tagging if only one of the mappings is using tags.

I'm not entirely clear whether you mean "can only reliably work" to be
"is practically impossible to coordinate tag values", or whether you are
proposing to (purposefully) introduce the race with a per-mm lock? (and
document it).

I guess we could have a per-mm lock and handle the race if user space
screws up with the outcome being lost tags (double clear).

But it feels to me like it could come back to bite in the future since
VM_SHARED|VM_MTE will almost always work and I fear someone will start
using it since it's permitted by the kernel.

Steve
Marc Zyngier June 9, 2021, 11:19 a.m. UTC | #3
On Wed, 09 Jun 2021 11:51:34 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> On 09/06/2021 11:30, Marc Zyngier wrote:
> > On Mon, 07 Jun 2021 12:08:09 +0100,
> > Steven Price <steven.price@arm.com> wrote:
> >>
> >> mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
> >> before restoring/zeroing the MTE tags. However if another thread were to
> >> race and attempt to sync the tags on the same page before the first
> >> thread had completed restoring/zeroing then it would see the flag is
> >> already set and continue without waiting. This would potentially expose
> >> the previous contents of the tags to user space, and cause any updates
> >> that user space makes before the restoring/zeroing has completed to
> >> potentially be lost.
> >>
> >> Since this code is run from atomic contexts we can't just lock the page
> >> during the process. Instead implement a new (global) spinlock to protect
> >> the mte_sync_page_tags() function.
> >>
> >> Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in user-space with PROT_MTE")
> >> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> >> Signed-off-by: Steven Price <steven.price@arm.com>
> >> ---
> >>  arch/arm64/kernel/mte.c | 20 +++++++++++++++++---
> >>  1 file changed, 17 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> >> index 125a10e413e9..a3583a7fd400 100644
> >> --- a/arch/arm64/kernel/mte.c
> >> +++ b/arch/arm64/kernel/mte.c
> >> @@ -25,6 +25,7 @@
> >>  u64 gcr_kernel_excl __ro_after_init;
> >>  
> >>  static bool report_fault_once = true;
> >> +static DEFINE_SPINLOCK(tag_sync_lock);
> >>  
> >>  #ifdef CONFIG_KASAN_HW_TAGS
> >>  /* Whether the MTE asynchronous mode is enabled. */
> >> @@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
> >>  
> >>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
> >>  {
> >> +	unsigned long flags;
> >>  	pte_t old_pte = READ_ONCE(*ptep);
> >>  
> >> +	spin_lock_irqsave(&tag_sync_lock, flags);
> > 
> > having though a bit more about this after an offline discussion with
> > Catalin: why can't this lock be made per mm? We can't really share
> > tags across processes anyway, so this is limited to threads from the
> > same process.
> 
> Currently there's nothing stopping processes sharing tags (mmap(...,
> PROT_MTE, MAP_SHARED)) - I agree making use of this is tricky and it
> would have been nice if this had just been prevented from the
> beginning.

I don't think it should be prevented. I think it should be made clear
that it is unreliable and that it will result in tag corruption.

> Given the above, clearly the lock can't be per mm and robust.

I don't think we need to make it robust. The architecture actively
prevents sharing if the tags are also shared, just like we can't
really expect the VMM to share tags with the guest.

> > I'd also like it to be documented that page sharing can only reliably
> > work with tagging if only one of the mappings is using tags.
> 
> I'm not entirely clear whether you mean "can only reliably work" to be
> "is practically impossible to coordinate tag values", or whether you are
> proposing to (purposefully) introduce the race with a per-mm lock? (and
> document it).

The latter. You can obviously communicate your tags to another task,
but this should come with attached restrictions (mlock?).

> I guess we could have a per-mm lock and handle the race if user space
> screws up with the outcome being lost tags (double clear).
> 
> But it feels to me like it could come back to bite in the future since
> VM_SHARED|VM_MTE will almost always work and I fear someone will start
> using it since it's permitted by the kernel.

I'm really worried that performance is going to suck even on a small
system, and this global lock will be heavily contended, even without
considering KVM.

Catalin, any though?

	M.
Catalin Marinas June 9, 2021, 5:41 p.m. UTC | #4
On Wed, Jun 09, 2021 at 12:19:31PM +0100, Marc Zyngier wrote:
> On Wed, 09 Jun 2021 11:51:34 +0100,
> Steven Price <steven.price@arm.com> wrote:
> > On 09/06/2021 11:30, Marc Zyngier wrote:
> > > On Mon, 07 Jun 2021 12:08:09 +0100,
> > > Steven Price <steven.price@arm.com> wrote:
> > >> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > >> index 125a10e413e9..a3583a7fd400 100644
> > >> --- a/arch/arm64/kernel/mte.c
> > >> +++ b/arch/arm64/kernel/mte.c
> > >> @@ -25,6 +25,7 @@
> > >>  u64 gcr_kernel_excl __ro_after_init;
> > >>  
> > >>  static bool report_fault_once = true;
> > >> +static DEFINE_SPINLOCK(tag_sync_lock);
> > >>  
> > >>  #ifdef CONFIG_KASAN_HW_TAGS
> > >>  /* Whether the MTE asynchronous mode is enabled. */
> > >> @@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
> > >>  
> > >>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
> > >>  {
> > >> +	unsigned long flags;
> > >>  	pte_t old_pte = READ_ONCE(*ptep);
> > >>  
> > >> +	spin_lock_irqsave(&tag_sync_lock, flags);
> > > 
> > > having though a bit more about this after an offline discussion with
> > > Catalin: why can't this lock be made per mm? We can't really share
> > > tags across processes anyway, so this is limited to threads from the
> > > same process.
> > 
> > Currently there's nothing stopping processes sharing tags (mmap(...,
> > PROT_MTE, MAP_SHARED)) - I agree making use of this is tricky and it
> > would have been nice if this had just been prevented from the
> > beginning.
> 
> I don't think it should be prevented. I think it should be made clear
> that it is unreliable and that it will result in tag corruption.
> 
> > Given the above, clearly the lock can't be per mm and robust.
> 
> I don't think we need to make it robust. The architecture actively
> prevents sharing if the tags are also shared, just like we can't
> really expect the VMM to share tags with the guest.

The architecture does not prevent MTE tag sharing (if that's what you
meant). The tags are just an additional metadata stored in physical
memory. It's not associated with the VA (as in the CHERI-style
capability tags), only checked against the logical tag in a pointer. If
the architecture prevented MAP_SHARED, we would have prevented PROT_MTE
on them (well, it's not too late to do this ;)).

I went with Steven a few times through this exercise, though I tend to
forget it quickly after. The use-case we had in mind when deciding to
allow MTE on shared mappings is something like:

	int fd = memfd_create("jitted-code", MFD_ALLOW_SEALING);
	ftruncate(fd, size);

	void* rw_mapping = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	void* rx_mapping = mmap(NULL, size, PROT_READ | PROT_EXEC, MAP_SHARED, fd, 0);

	close(fd);

The above is within the same mm but you might as well have a fork and
the rx mapping in a child process. Any of the mappings may have
PROT_MTE from the start or set later with mprotect(), though it's
probably the rw one only.

The race we have is in set_pte_at() and the equivalent KVM setting for
stage 2 (in any combination of these). To detect a page that was not
previously tagged (first time mapped, remapped with new attributes), we
have a test like this via set_pte_at():

	if (!test_bit(PG_mte_tagged, &page->flags)) {
		mte_clear_page_tags(page);
		set_bit(PG_mte_tagged, &page->flags);
	}

Calling the above concurrently on a page may cause some tag loss in the
absence of any locking. Note that it only matters if one of the mappings
is writable (to write tags), so this excludes CoW (fork, KSM).

For stage 1, I think almost all cases that end up in set_pte_at() also
have the page->lock held and the ptl. The exception is mprotect() which
doesn't bother to look up each page and lock it, it just takes the ptl
lock. Within the same mm, mprotect() also takes the mmap_lock as a
writer, so it's all fine. The race is between two mms, one doing an
mprotect(PROT_MTE) with the page already mapped in its address space and
the other taking a fault and mapping the page via set_pte_at(). Two
faults in two mms again are fine because of the page lock.

For stage 2, the race between the VMM doing an mprotect() and the VM
going via user_mem_abort() is fine because the former calls
mmap_write_lock() while the latter mmap_read_lock(). So, as in stage 1,
the problem in stage 2 is for a MAP_SHARED region that another process
(maybe spawned by the VMM) calls mprotect(PROT_MTE).

There is another case of MAP_SHARED in the VMM that does not involve
mprotect(). The shared page is mapped on fault in VMM2, initially mapped
as PROT_MTE while VMM1 handles a user_mem_abort() -> hva_to_pfn(). If in
VMM1 the page was not mapped with PROT_MTE but the pte is accessible,
get_user_pages_fast() won't touch the VMM1 pte, so we have the race
between user_mem_abort() in VMM1 and set_pte_at() in VMM2.

So, AFAICT, MAP_SHARED between two different mms is the only problem
(both for stage 1 and stage 2), hence the big lock that Steven
introduced. I don't like the lock either but we couldn't come up with a
better solution.

I'm happy to document that MAP_SHARED may lose tags but we need to be
more precise than that as people may still want to use MTE with shared
memory as per my first example (otherwise we can block it upfront easily
in arch_validate_flags()). Also the exact behaviour in a MAP_SHARED case
may be quite fragile.

An alternative is to set PG_mte_tagged before we even end up in a
set_pte_at() or user_mem_abort() and that's what patch 2 does here for
stage 1. There are other options like finer grained locking via another
page flag (well, inventing a new page lock that doesn't sleep), maybe an
array of locks indexed by a hash of the pfn to mitigate the big lock
(multiple big locks ;)).

With patch 2 in this series (and an equivalent one in Peter's
optimisation series), PG_mte_tagged is set on page allocation in the
majority of the cases, so we'd very rarely get to the big lock path. We
could do a similar trick with a new vma flag which is set by kvm when a
slot is added. Subsequent __alloc_zeroed_user_highpage() just set
PG_mte_tagged. Note that for MAP_SHARED, we'd still end up on the big
lock path (both for stage 1 and stage 2) since they don't use
__alloc_zeroed_user_highpage().

Another big hammer approach is to scrap PG_mte_tagged altogether. That's
what I had in some early versions before Steven added swap support. Any
page is considered tagged, we always zero the tags in clear_page() and
copy tags in copy_page() (is DC GZVA as fast as DC ZVA? What's the
overhead of copying tags? I don't have access to hardware to benchmark).
The page comparison for KSM would also need to compare the tags (I have
such patch already).

Other suggestions are welcomed, including banning MAP_SHARED with
PROT_MTE.

> > > I'd also like it to be documented that page sharing can only reliably
> > > work with tagging if only one of the mappings is using tags.
> > 
> > I'm not entirely clear whether you mean "can only reliably work" to be
> > "is practically impossible to coordinate tag values", or whether you are
> > proposing to (purposefully) introduce the race with a per-mm lock? (and
> > document it).
> 
> The latter. You can obviously communicate your tags to another task,
> but this should come with attached restrictions (mlock?).

No, it wouldn't, you may confuse them with CHERI tags.

> > I guess we could have a per-mm lock and handle the race if user space
> > screws up with the outcome being lost tags (double clear).
> > 
> > But it feels to me like it could come back to bite in the future since
> > VM_SHARED|VM_MTE will almost always work and I fear someone will start
> > using it since it's permitted by the kernel.
> 
> I'm really worried that performance is going to suck even on a small
> system, and this global lock will be heavily contended, even without
> considering KVM.

I agree, as it currently stands, enabling MTE in the guest will always
serialise user_mem_abort() through a big lock shared by all VMs.
Steven Price June 10, 2021, 8:05 a.m. UTC | #5
On 09/06/2021 18:41, Catalin Marinas wrote:
> On Wed, Jun 09, 2021 at 12:19:31PM +0100, Marc Zyngier wrote:
>> On Wed, 09 Jun 2021 11:51:34 +0100,
>> Steven Price <steven.price@arm.com> wrote:
>>> On 09/06/2021 11:30, Marc Zyngier wrote:
>>>> On Mon, 07 Jun 2021 12:08:09 +0100,
>>>> Steven Price <steven.price@arm.com> wrote:
>>>>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>>>>> index 125a10e413e9..a3583a7fd400 100644
>>>>> --- a/arch/arm64/kernel/mte.c
>>>>> +++ b/arch/arm64/kernel/mte.c
>>>>> @@ -25,6 +25,7 @@
>>>>>  u64 gcr_kernel_excl __ro_after_init;
>>>>>  
>>>>>  static bool report_fault_once = true;
>>>>> +static DEFINE_SPINLOCK(tag_sync_lock);
>>>>>  
>>>>>  #ifdef CONFIG_KASAN_HW_TAGS
>>>>>  /* Whether the MTE asynchronous mode is enabled. */
>>>>> @@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
>>>>>  
>>>>>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
>>>>>  {
>>>>> +	unsigned long flags;
>>>>>  	pte_t old_pte = READ_ONCE(*ptep);
>>>>>  
>>>>> +	spin_lock_irqsave(&tag_sync_lock, flags);
>>>>
>>>> having though a bit more about this after an offline discussion with
>>>> Catalin: why can't this lock be made per mm? We can't really share
>>>> tags across processes anyway, so this is limited to threads from the
>>>> same process.
>>>
>>> Currently there's nothing stopping processes sharing tags (mmap(...,
>>> PROT_MTE, MAP_SHARED)) - I agree making use of this is tricky and it
>>> would have been nice if this had just been prevented from the
>>> beginning.
>>
>> I don't think it should be prevented. I think it should be made clear
>> that it is unreliable and that it will result in tag corruption.
>>
>>> Given the above, clearly the lock can't be per mm and robust.
>>
>> I don't think we need to make it robust. The architecture actively
>> prevents sharing if the tags are also shared, just like we can't
>> really expect the VMM to share tags with the guest.
> 
> The architecture does not prevent MTE tag sharing (if that's what you
> meant). The tags are just an additional metadata stored in physical
> memory. It's not associated with the VA (as in the CHERI-style
> capability tags), only checked against the logical tag in a pointer. If
> the architecture prevented MAP_SHARED, we would have prevented PROT_MTE
> on them (well, it's not too late to do this ;)).
> 
> I went with Steven a few times through this exercise, though I tend to
> forget it quickly after. The use-case we had in mind when deciding to
> allow MTE on shared mappings is something like:
> 
> 	int fd = memfd_create("jitted-code", MFD_ALLOW_SEALING);
> 	ftruncate(fd, size);
> 
> 	void* rw_mapping = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> 	void* rx_mapping = mmap(NULL, size, PROT_READ | PROT_EXEC, MAP_SHARED, fd, 0);
> 
> 	close(fd);
> 
> The above is within the same mm but you might as well have a fork and
> the rx mapping in a child process. Any of the mappings may have
> PROT_MTE from the start or set later with mprotect(), though it's
> probably the rw one only.
> 
> The race we have is in set_pte_at() and the equivalent KVM setting for
> stage 2 (in any combination of these). To detect a page that was not
> previously tagged (first time mapped, remapped with new attributes), we
> have a test like this via set_pte_at():
> 
> 	if (!test_bit(PG_mte_tagged, &page->flags)) {
> 		mte_clear_page_tags(page);
> 		set_bit(PG_mte_tagged, &page->flags);
> 	}
> 
> Calling the above concurrently on a page may cause some tag loss in the
> absence of any locking. Note that it only matters if one of the mappings
> is writable (to write tags), so this excludes CoW (fork, KSM).
> 
> For stage 1, I think almost all cases that end up in set_pte_at() also
> have the page->lock held and the ptl. The exception is mprotect() which
> doesn't bother to look up each page and lock it, it just takes the ptl
> lock. Within the same mm, mprotect() also takes the mmap_lock as a
> writer, so it's all fine. The race is between two mms, one doing an
> mprotect(PROT_MTE) with the page already mapped in its address space and
> the other taking a fault and mapping the page via set_pte_at(). Two
> faults in two mms again are fine because of the page lock.
> 
> For stage 2, the race between the VMM doing an mprotect() and the VM
> going via user_mem_abort() is fine because the former calls
> mmap_write_lock() while the latter mmap_read_lock(). So, as in stage 1,
> the problem in stage 2 is for a MAP_SHARED region that another process
> (maybe spawned by the VMM) calls mprotect(PROT_MTE).
> 
> There is another case of MAP_SHARED in the VMM that does not involve
> mprotect(). The shared page is mapped on fault in VMM2, initially mapped
> as PROT_MTE while VMM1 handles a user_mem_abort() -> hva_to_pfn(). If in
> VMM1 the page was not mapped with PROT_MTE but the pte is accessible,
> get_user_pages_fast() won't touch the VMM1 pte, so we have the race
> between user_mem_abort() in VMM1 and set_pte_at() in VMM2.
> 
> So, AFAICT, MAP_SHARED between two different mms is the only problem
> (both for stage 1 and stage 2), hence the big lock that Steven
> introduced. I don't like the lock either but we couldn't come up with a
> better solution.
> 
> I'm happy to document that MAP_SHARED may lose tags but we need to be
> more precise than that as people may still want to use MTE with shared
> memory as per my first example (otherwise we can block it upfront easily
> in arch_validate_flags()). Also the exact behaviour in a MAP_SHARED case
> may be quite fragile.
> 
> An alternative is to set PG_mte_tagged before we even end up in a
> set_pte_at() or user_mem_abort() and that's what patch 2 does here for
> stage 1. There are other options like finer grained locking via another
> page flag (well, inventing a new page lock that doesn't sleep), maybe an
> array of locks indexed by a hash of the pfn to mitigate the big lock
> (multiple big locks ;)).
> 
> With patch 2 in this series (and an equivalent one in Peter's
> optimisation series), PG_mte_tagged is set on page allocation in the
> majority of the cases, so we'd very rarely get to the big lock path. We
> could do a similar trick with a new vma flag which is set by kvm when a
> slot is added. Subsequent __alloc_zeroed_user_highpage() just set
> PG_mte_tagged. Note that for MAP_SHARED, we'd still end up on the big
> lock path (both for stage 1 and stage 2) since they don't use
> __alloc_zeroed_user_highpage().
> 
> Another big hammer approach is to scrap PG_mte_tagged altogether. That's
> what I had in some early versions before Steven added swap support. Any
> page is considered tagged, we always zero the tags in clear_page() and
> copy tags in copy_page() (is DC GZVA as fast as DC ZVA? What's the
> overhead of copying tags? I don't have access to hardware to benchmark).
> The page comparison for KSM would also need to compare the tags (I have
> such patch already).
> 
> Other suggestions are welcomed, including banning MAP_SHARED with
> PROT_MTE.

The finer grained locking could also be done using a table of spin_locks
and hashing the struct page pointer to choose a lock. This is what
page_waitqueue() does - but unlike the existing mechanism we need
spinlocks not wait queues.

>>>> I'd also like it to be documented that page sharing can only reliably
>>>> work with tagging if only one of the mappings is using tags.
>>>
>>> I'm not entirely clear whether you mean "can only reliably work" to be
>>> "is practically impossible to coordinate tag values", or whether you are
>>> proposing to (purposefully) introduce the race with a per-mm lock? (and
>>> document it).
>>
>> The latter. You can obviously communicate your tags to another task,
>> but this should come with attached restrictions (mlock?).
> 
> No, it wouldn't, you may confuse them with CHERI tags.
> 
>>> I guess we could have a per-mm lock and handle the race if user space
>>> screws up with the outcome being lost tags (double clear).
>>>
>>> But it feels to me like it could come back to bite in the future since
>>> VM_SHARED|VM_MTE will almost always work and I fear someone will start
>>> using it since it's permitted by the kernel.
>>
>> I'm really worried that performance is going to suck even on a small
>> system, and this global lock will be heavily contended, even without
>> considering KVM.
> 
> I agree, as it currently stands, enabling MTE in the guest will always
> serialise user_mem_abort() through a big lock shared by all VMs.
> 

We only serialise in the case where PG_mte_tagged isn't set. Admittedly
this is likely to be VM startup which I know tends to be an important
use case. One other option there would be to provide a mechanism to the
VMM to proactively clear the tags (e.g. a new VM_MTE_INIT flag which
causes the PG_mte_tagged bit to be set but doesn't affect user space's
PTEs).

The problem I've got is without real hardware to benchmark on it's
impossible to know how big of a performance problem this is and
therefore which of the available options (if any) is appropriate. Some
like the "big hammer" of scrapping PG_mte_tagged could hit the
performance of non-MTE paths - but equally might not have any affect and
would simplify the code.

My belief is that this series is now at the stage where it is
functionally correct and it doesn't impact the performance when MTE
isn't in use, so no impact to existing systems or MTE-enabled systems
where user space/VMs are not using MTE. We also have a number of ideas
to investigate if/when we see performance problems with the MTE use
cases. I'm not sure what else we can do until hardware for benchmarking
is readily available.

Steve
Catalin Marinas June 10, 2021, 2:07 p.m. UTC | #6
On Thu, Jun 10, 2021 at 09:05:18AM +0100, Steven Price wrote:
> On 09/06/2021 18:41, Catalin Marinas wrote:
> > On Wed, Jun 09, 2021 at 12:19:31PM +0100, Marc Zyngier wrote:
> >> On Wed, 09 Jun 2021 11:51:34 +0100,
> >> Steven Price <steven.price@arm.com> wrote:
> >>> On 09/06/2021 11:30, Marc Zyngier wrote:
> >>>> On Mon, 07 Jun 2021 12:08:09 +0100,
> >>>> Steven Price <steven.price@arm.com> wrote:
> >>>>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> >>>>> index 125a10e413e9..a3583a7fd400 100644
> >>>>> --- a/arch/arm64/kernel/mte.c
> >>>>> +++ b/arch/arm64/kernel/mte.c
> >>>>> @@ -25,6 +25,7 @@
> >>>>>  u64 gcr_kernel_excl __ro_after_init;
> >>>>>  
> >>>>>  static bool report_fault_once = true;
> >>>>> +static DEFINE_SPINLOCK(tag_sync_lock);
> >>>>>  
> >>>>>  #ifdef CONFIG_KASAN_HW_TAGS
> >>>>>  /* Whether the MTE asynchronous mode is enabled. */
> >>>>> @@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
> >>>>>  
> >>>>>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
> >>>>>  {
> >>>>> +	unsigned long flags;
> >>>>>  	pte_t old_pte = READ_ONCE(*ptep);
> >>>>>  
> >>>>> +	spin_lock_irqsave(&tag_sync_lock, flags);
> >>>>
> >>>> having though a bit more about this after an offline discussion with
> >>>> Catalin: why can't this lock be made per mm? We can't really share
> >>>> tags across processes anyway, so this is limited to threads from the
> >>>> same process.
> >>>
> >>> Currently there's nothing stopping processes sharing tags (mmap(...,
> >>> PROT_MTE, MAP_SHARED)) - I agree making use of this is tricky and it
> >>> would have been nice if this had just been prevented from the
> >>> beginning.
> >>
> >> I don't think it should be prevented. I think it should be made clear
> >> that it is unreliable and that it will result in tag corruption.
> >>
> >>> Given the above, clearly the lock can't be per mm and robust.
> >>
> >> I don't think we need to make it robust. The architecture actively
> >> prevents sharing if the tags are also shared, just like we can't
> >> really expect the VMM to share tags with the guest.
[...]
> > So, AFAICT, MAP_SHARED between two different mms is the only problem
> > (both for stage 1 and stage 2), hence the big lock that Steven
> > introduced. I don't like the lock either but we couldn't come up with a
> > better solution.
[...]
> > Other suggestions are welcomed, including banning MAP_SHARED with
> > PROT_MTE.
> 
> The finer grained locking could also be done using a table of spin_locks
> and hashing the struct page pointer to choose a lock. This is what
> page_waitqueue() does - but unlike the existing mechanism we need
> spinlocks not wait queues.

That's an option indeed.

> >>> I guess we could have a per-mm lock and handle the race if user space
> >>> screws up with the outcome being lost tags (double clear).
> >>>
> >>> But it feels to me like it could come back to bite in the future since
> >>> VM_SHARED|VM_MTE will almost always work and I fear someone will start
> >>> using it since it's permitted by the kernel.
> >>
> >> I'm really worried that performance is going to suck even on a small
> >> system, and this global lock will be heavily contended, even without
> >> considering KVM.
> > 
> > I agree, as it currently stands, enabling MTE in the guest will always
> > serialise user_mem_abort() through a big lock shared by all VMs.
> 
> We only serialise in the case where PG_mte_tagged isn't set. Admittedly
> this is likely to be VM startup which I know tends to be an important
> use case. One other option there would be to provide a mechanism to the
> VMM to proactively clear the tags (e.g. a new VM_MTE_INIT flag which
> causes the PG_mte_tagged bit to be set but doesn't affect user space's
> PTEs).
> 
> The problem I've got is without real hardware to benchmark on it's
> impossible to know how big of a performance problem this is and
> therefore which of the available options (if any) is appropriate. Some
> like the "big hammer" of scrapping PG_mte_tagged could hit the
> performance of non-MTE paths - but equally might not have any affect and
> would simplify the code.

One reason to keep PG_mte_tagged around for normal user programs,
especially those not using MTE, is that it doesn't affect CoW. Page
copying in two separate loops (for data and tags) does affect
performance. Now, if there's no fork, you may not notice but in the
Android land with zygote, I think it will be quickly observed.

For KVM, I guess we could benchmark a dummy lock array without any MTE
hardware but I don't have a setup around where I can time multiple VMs
starting in parallel.

> My belief is that this series is now at the stage where it is
> functionally correct and it doesn't impact the performance when MTE
> isn't in use, so no impact to existing systems or MTE-enabled systems
> where user space/VMs are not using MTE.

I agree.

> We also have a number of ideas
> to investigate if/when we see performance problems with the MTE use
> cases. I'm not sure what else we can do until hardware for benchmarking
> is readily available.

My concern is that when we hit a bottleneck on that big lock (and I'm
pretty sure we will), do we have a good mitigation or we should rather
break the ABI now and disallow MAP_SHARED? The latter would need to be
done fairly soon as a fix, backported.

I think the possible workarounds are pretty good but maybe we should
also ask user-space people how important MAP_SHARED + PROT_MTE is. If
they don't use it, maybe we shouldn't bother with all this (well,
assuming there's no other race between different mms that we missed).
diff mbox series

Patch

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 125a10e413e9..a3583a7fd400 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -25,6 +25,7 @@ 
 u64 gcr_kernel_excl __ro_after_init;
 
 static bool report_fault_once = true;
+static DEFINE_SPINLOCK(tag_sync_lock);
 
 #ifdef CONFIG_KASAN_HW_TAGS
 /* Whether the MTE asynchronous mode is enabled. */
@@ -34,13 +35,22 @@  EXPORT_SYMBOL_GPL(mte_async_mode);
 
 static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
 {
+	unsigned long flags;
 	pte_t old_pte = READ_ONCE(*ptep);
 
+	spin_lock_irqsave(&tag_sync_lock, flags);
+
+	/* Recheck with the lock held */
+	if (test_bit(PG_mte_tagged, &page->flags))
+		goto out;
+
 	if (check_swap && is_swap_pte(old_pte)) {
 		swp_entry_t entry = pte_to_swp_entry(old_pte);
 
-		if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
-			return;
+		if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
+			set_bit(PG_mte_tagged, &page->flags);
+			goto out;
+		}
 	}
 
 	page_kasan_tag_reset(page);
@@ -53,6 +63,10 @@  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
 	 */
 	smp_wmb();
 	mte_clear_page_tags(page_address(page));
+	set_bit(PG_mte_tagged, &page->flags);
+
+out:
+	spin_unlock_irqrestore(&tag_sync_lock, flags);
 }
 
 void mte_sync_tags(pte_t *ptep, pte_t pte)
@@ -63,7 +77,7 @@  void mte_sync_tags(pte_t *ptep, pte_t pte)
 
 	/* if PG_mte_tagged is set, tags have already been initialised */
 	for (i = 0; i < nr_pages; i++, page++) {
-		if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+		if (!test_bit(PG_mte_tagged, &page->flags))
 			mte_sync_page_tags(page, ptep, check_swap);
 	}
 }