mbox series

[v2,0/4] kasan: Fix ordering between MTE tag colouring and page->flags

Message ID 20220610152141.2148929-1-catalin.marinas@arm.com (mailing list archive)
Headers show
Series kasan: Fix ordering between MTE tag colouring and page->flags | expand

Message

Catalin Marinas June 10, 2022, 3:21 p.m. UTC
Hi,

That's a second attempt on fixing the race race between setting the
allocation (in-memory) tags in a page and the corresponding logical tag
in page->flags. Initial version here:

https://lore.kernel.org/r/20220517180945.756303-1-catalin.marinas@arm.com

This new series does not introduce any new GFP flags but instead always
skips unpoisoning of the user pages (we already skip the poisoning on
free). Any unpoisoned page will have the page->flags tag reset.

For the background:

On a system with MTE and KASAN_HW_TAGS enabled, when a page is allocated
kasan_unpoison_pages() sets a random tag and saves it in page->flags so
that page_to_virt() re-creates the correct tagged pointer. We need to
ensure that the in-memory tags are visible before setting the
page->flags:

P0 (__kasan_unpoison_range):    P1 (access via virt_to_page):
  Wtags=x                         Rflags=x
    |                               |
    | DMB                           | address dependency
    V                               V
  Wflags=x                        Rtags=x

The first patch changes the order of page unpoisoning with the tag
storing in page->flags. page_kasan_tag_set() has the right barriers
through try_cmpxchg().

If a page is mapped in user-space with PROT_MTE, the architecture code
will set the allocation tag to 0 and a subsequent page_to_virt()
dereference will fault. We currently try to fix this by resetting the
tag in page->flags so that it is 0xff (match-all, not faulting).
However, setting the tags and flags can race with another CPU reading
the flags (page_to_virt()) and barriers can't help, e.g.:

P0 (mte_sync_page_tags):        P1 (memcpy from virt_to_page):
                                  Rflags!=0xff
  Wflags=0xff
  DMB (doesn't help)
  Wtags=0
                                  Rtags=0   // fault

Since clearing the flags in the arch code doesn't work, to do this at
page allocation time when __GFP_SKIP_KASAN_UNPOISON is passed.

Thanks.

Catalin Marinas (4):
  mm: kasan: Ensure the tags are visible before the tag in page->flags
  mm: kasan: Skip unpoisoning of user pages
  mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON
  arm64: kasan: Revert "arm64: mte: reset the page tag in page->flags"

 arch/arm64/kernel/hibernate.c |  5 -----
 arch/arm64/kernel/mte.c       |  9 ---------
 arch/arm64/mm/copypage.c      |  9 ---------
 arch/arm64/mm/fault.c         |  1 -
 arch/arm64/mm/mteswap.c       |  9 ---------
 include/linux/gfp.h           |  2 +-
 mm/kasan/common.c             |  3 ++-
 mm/page_alloc.c               | 19 ++++++++++---------
 8 files changed, 13 insertions(+), 44 deletions(-)

Comments

Will Deacon July 7, 2022, 10:33 a.m. UTC | #1
On Fri, 10 Jun 2022 16:21:37 +0100, Catalin Marinas wrote:
> That's a second attempt on fixing the race race between setting the
> allocation (in-memory) tags in a page and the corresponding logical tag
> in page->flags. Initial version here:
> 
> https://lore.kernel.org/r/20220517180945.756303-1-catalin.marinas@arm.com
> 
> This new series does not introduce any new GFP flags but instead always
> skips unpoisoning of the user pages (we already skip the poisoning on
> free). Any unpoisoned page will have the page->flags tag reset.
> 
> [...]

Applied to arm64 (for-next/mte), thanks!

[1/4] mm: kasan: Ensure the tags are visible before the tag in page->flags
      https://git.kernel.org/arm64/c/ed0a6d1d973e
[2/4] mm: kasan: Skip unpoisoning of user pages
      https://git.kernel.org/arm64/c/70c248aca9e7
[3/4] mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON
      https://git.kernel.org/arm64/c/6d05141a3930
[4/4] arm64: kasan: Revert "arm64: mte: reset the page tag in page->flags"
      https://git.kernel.org/arm64/c/20794545c146

Cheers,
Will Deacon July 8, 2022, 1:31 p.m. UTC | #2
On Fri, Jun 10, 2022 at 04:21:37PM +0100, Catalin Marinas wrote:
> Hi,
> 
> That's a second attempt on fixing the race race between setting the
> allocation (in-memory) tags in a page and the corresponding logical tag
> in page->flags. Initial version here:
> 
> https://lore.kernel.org/r/20220517180945.756303-1-catalin.marinas@arm.com
> 
> This new series does not introduce any new GFP flags but instead always
> skips unpoisoning of the user pages (we already skip the poisoning on
> free). Any unpoisoned page will have the page->flags tag reset.
> 
> For the background:
> 
> On a system with MTE and KASAN_HW_TAGS enabled, when a page is allocated
> kasan_unpoison_pages() sets a random tag and saves it in page->flags so
> that page_to_virt() re-creates the correct tagged pointer. We need to
> ensure that the in-memory tags are visible before setting the
> page->flags:
> 
> P0 (__kasan_unpoison_range):    P1 (access via virt_to_page):
>   Wtags=x                         Rflags=x
>     |                               |
>     | DMB                           | address dependency
>     V                               V
>   Wflags=x                        Rtags=x
> 
> The first patch changes the order of page unpoisoning with the tag
> storing in page->flags. page_kasan_tag_set() has the right barriers
> through try_cmpxchg().
> 
> If a page is mapped in user-space with PROT_MTE, the architecture code
> will set the allocation tag to 0 and a subsequent page_to_virt()
> dereference will fault. We currently try to fix this by resetting the
> tag in page->flags so that it is 0xff (match-all, not faulting).
> However, setting the tags and flags can race with another CPU reading
> the flags (page_to_virt()) and barriers can't help, e.g.:
> 
> P0 (mte_sync_page_tags):        P1 (memcpy from virt_to_page):
>                                   Rflags!=0xff
>   Wflags=0xff
>   DMB (doesn't help)
>   Wtags=0
>                                   Rtags=0   // fault
> 
> Since clearing the flags in the arch code doesn't work, to do this at
> page allocation time when __GFP_SKIP_KASAN_UNPOISON is passed.

I've picked this up, thanks.

An alternative solution might be to use a seqlock (if you can find somewhere
to put it) so that virt_to_page() spins briefly while the tags and flags
are being updated.

Will
Kuan-Ying Lee (李冠穎) Feb. 2, 2023, 5:25 a.m. UTC | #3
On Fri, 2022-06-10 at 16:21 +0100, Catalin Marinas wrote:
> Hi,
> 
> That's a second attempt on fixing the race race between setting the
> allocation (in-memory) tags in a page and the corresponding logical
> tag
> in page->flags. Initial version here:
> 
> 
https://lore.kernel.org/r/20220517180945.756303-1-catalin.marinas@arm.com
> 
> This new series does not introduce any new GFP flags but instead
> always
> skips unpoisoning of the user pages (we already skip the poisoning on
> free). Any unpoisoned page will have the page->flags tag reset.
> 
> For the background:
> 
> On a system with MTE and KASAN_HW_TAGS enabled, when a page is
> allocated
> kasan_unpoison_pages() sets a random tag and saves it in page->flags
> so
> that page_to_virt() re-creates the correct tagged pointer. We need to
> ensure that the in-memory tags are visible before setting the
> page->flags:
> 
> P0 (__kasan_unpoison_range):    P1 (access via virt_to_page):
>   Wtags=x                         Rflags=x
>     |                               |
>     | DMB                           | address dependency
>     V                               V
>   Wflags=x                        Rtags=x
> 
> The first patch changes the order of page unpoisoning with the tag
> storing in page->flags. page_kasan_tag_set() has the right barriers
> through try_cmpxchg().
> 
> If a page is mapped in user-space with PROT_MTE, the architecture
> code
> will set the allocation tag to 0 and a subsequent page_to_virt()
> dereference will fault. We currently try to fix this by resetting the
> tag in page->flags so that it is 0xff (match-all, not faulting).
> However, setting the tags and flags can race with another CPU reading
> the flags (page_to_virt()) and barriers can't help, e.g.:
> 
> P0 (mte_sync_page_tags):        P1 (memcpy from virt_to_page):
>                                   Rflags!=0xff
>   Wflags=0xff
>   DMB (doesn't help)
>   Wtags=0
>                                   Rtags=0   // fault
> 
> Since clearing the flags in the arch code doesn't work, to do this at
> page allocation time when __GFP_SKIP_KASAN_UNPOISON is passed.
> 
> Thanks.
> 
> Catalin Marinas (4):
>   mm: kasan: Ensure the tags are visible before the tag in page-
> >flags
>   mm: kasan: Skip unpoisoning of user pages
>   mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON
>   arm64: kasan: Revert "arm64: mte: reset the page tag in page-
> >flags"
> 
>  arch/arm64/kernel/hibernate.c |  5 -----
>  arch/arm64/kernel/mte.c       |  9 ---------
>  arch/arm64/mm/copypage.c      |  9 ---------
>  arch/arm64/mm/fault.c         |  1 -
>  arch/arm64/mm/mteswap.c       |  9 ---------
>  include/linux/gfp.h           |  2 +-
>  mm/kasan/common.c             |  3 ++-
>  mm/page_alloc.c               | 19 ++++++++++---------
>  8 files changed, 13 insertions(+), 44 deletions(-)
> 

Hi kasan maintainers,

We hit the following issue on the android-6.1 devices with MTE and HW
tag kasan enabled.

I observe that the anon flag doesn't have skip_kasan_poison and
skip_kasan_unpoison flag and kasantag is weird.

AFAIK, kasantag of anon flag needs to be 0x0.

[   71.953938] [T1403598] FramePolicy:
[name:report&]=========================================================
=========
[   71.955305] [T1403598] FramePolicy: [name:report&]BUG: KASAN:
invalid-access in copy_page+0x10/0xd0
[   71.956476] [T1403598] FramePolicy: [name:report&]Read at addr
f0ffff81332a8000 by task FramePolicy/3598
[   71.957673] [T1403598] FramePolicy: [name:report_hw_tags&]Pointer
tag: [f0], memory tag: [ff]
[   71.958746] [T1403598] FramePolicy: [name:report&]
[   71.959354] [T1403598] FramePolicy: CPU: 4 PID: 3598 Comm:
FramePolicy Tainted: G S      W  OE      6.1.0-mainline-android14-0-
ga8a53f83b9e4 #1
[   71.960978] [T1403598] FramePolicy: Hardware name: MT6985(ENG) (DT)
[   71.961767] [T1403598] FramePolicy: Call trace:
[   71.962338] [T1403598] FramePolicy:  dump_backtrace+0x108/0x158
[   71.963097] [T1403598] FramePolicy:  show_stack+0x20/0x48
[   71.963782] [T1403598] FramePolicy:  dump_stack_lvl+0x6c/0x88
[   71.964512] [T1403598] FramePolicy:  print_report+0x2cc/0xa64
[   71.965263] [T1403598] FramePolicy:  kasan_report+0xb8/0x138
[   71.965986] [T1403598] FramePolicy:  __do_kernel_fault+0xd4/0x248
[   71.966782] [T1403598] FramePolicy:  do_bad_area+0x38/0xe8
[   71.967484] [T1403598] FramePolicy:  do_tag_check_fault+0x24/0x38
[   71.968261] [T1403598] FramePolicy:  do_mem_abort+0x48/0xb0
[   71.968973] [T1403598] FramePolicy:  el1_abort+0x44/0x68
[   71.969646] [T1403598] FramePolicy:  el1h_64_sync_handler+0x68/0xb8
[   71.970440] [T1403598] FramePolicy:  el1h_64_sync+0x68/0x6c
[   71.971146] [T1403598] FramePolicy:  copy_page+0x10/0xd0
[   71.971824] [T1403598] FramePolicy:  copy_user_highpage+0x20/0x40
[   71.972603] [T1403598] FramePolicy:  wp_page_copy+0xd0/0x9f8
[   71.973344] [T1403598] FramePolicy:  do_wp_page+0x374/0x3b0
[   71.974056] [T1403598] FramePolicy:  handle_mm_fault+0x3ec/0x119c
[   71.974833] [T1403598] FramePolicy:  do_page_fault+0x344/0x4ac
[   71.975583] [T1403598] FramePolicy:  do_mem_abort+0x48/0xb0
[   71.976294] [T1403598] FramePolicy:  el0_da+0x4c/0xe0
[   71.976934] [T1403598] FramePolicy:  el0t_64_sync_handler+0xd4/0xfc
[   71.977725] [T1403598] FramePolicy:  el0t_64_sync+0x1a0/0x1a4
[   71.978451] [T1403598] FramePolicy: [name:report&]
[   71.979057] [T1403598] FramePolicy: [name:report&]The buggy address
belongs to the physical page:
[   71.980173] [T1403598] FramePolicy:
[name:debug&]page:fffffffe04ccaa00 refcount:14 mapcount:13
mapping:0000000000000000 index:0x7884c74 pfn:0x1732a8
[   71.981849] [T1403598] FramePolicy:
[name:debug&]memcg:faffff80c0241000
[   71.982680] [T1403598] FramePolicy: [name:debug&]anon flags:
0x43c000000048003e(referenced|uptodate|dirty|lru|active|swapbacked|arch
_2|zone=1|kasantag=0xf)
[   71.984446] [T1403598] FramePolicy: raw: 43c000000048003e
fffffffe04b99648 fffffffe04cca308 f2ffff8103390831
[   71.985684] [T1403598] FramePolicy: raw: 0000000007884c74
0000000000000000 0000000e0000000c faffff80c0241000
[   71.986919] [T1403598] FramePolicy: [name:debug&]page dumped
because: kasan: bad access detected
[   71.988022] [T1403598] FramePolicy: [name:report&]
[   71.988624] [T1403598] FramePolicy: [name:report&]Memory state
around the buggy address:
[   71.989641] [T1403598] FramePolicy:  ffffff81332a7e00: fe fe fe fe
fe fe fe fe fe fe fe fe fe fe fe fe
[   71.990811] [T1403598] FramePolicy:  ffffff81332a7f00: fe fe fe fe
fe fe fe fe fe fe fe fe fe fe fe fe
[   71.991982] [T1403598] FramePolicy: >ffffff81332a8000: ff ff ff ff
f0 f0 fc fc fc fc fc fc fc f0 f0 f3
[   71.993149] [T1403598] FramePolicy:
[name:report&]                   ^
[   71.993972] [T1403598] FramePolicy:  ffffff81332a8100: f3 f3 f3 f3
f3 f3 f0 f0 f8 f8 f8 f8 f8 f8 f8 f0
[   71.995141] [T1403598] FramePolicy:  ffffff81332a8200: f0 fb fb fb
fb fb fb fb f0 f0 fe fe fe fe fe fe
[   71.996332] [T1403598] FramePolicy:
[name:report&]=========================================================
=========

Originally, I suspect that some userspace pages have been migrated so
the page->flags will be lost and page->flags is re-generated by
alloc_pages().

I try the following diff, but it didn't help.

diff --git a/mm/migrate.c b/mm/migrate.c
index dff333593a8a..ed2065908418 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -51,6 +51,7 @@
 #include <linux/random.h>
 #include <linux/sched/sysctl.h>
 #include <linux/memory-tiers.h>
+#include <linux/kasan.h>
 
 #include <asm/tlbflush.h>
 
@@ -611,6 +612,14 @@ void folio_migrate_flags(struct folio *newfolio,
struct folio *folio)
 
 	if (!folio_test_hugetlb(folio))
 		mem_cgroup_migrate(folio, newfolio);
+
+#ifdef CONFIG_KASAN_HW_TAGS
+	if (kasan_hw_tags_enabled()) {
+		if (folio_test_skip_kasan_poison(folio))
+			folio_set_skip_kasan_poison(newfolio);
+		page_kasan_tag_set(&newfolio->page,
page_kasan_tag(&folio->page));
+	}
+#endif
 }
 EXPORT_SYMBOL(folio_migrate_flags);
 

After I revert this patchset (4 patches), this issue disappear.

>
Andrey Konovalov Feb. 2, 2023, 12:59 p.m. UTC | #4
On Thu, Feb 2, 2023 at 6:25 AM Kuan-Ying Lee (李冠穎)
<Kuan-Ying.Lee@mediatek.com> wrote:
>
> On Fri, 2022-06-10 at 16:21 +0100, Catalin Marinas wrote:
> > Hi,
> >
> > That's a second attempt on fixing the race race between setting the
> > allocation (in-memory) tags in a page and the corresponding logical
> > tag
> > in page->flags. Initial version here:
> >
> >
> https://lore.kernel.org/r/20220517180945.756303-1-catalin.marinas@arm.com
> >
> > This new series does not introduce any new GFP flags but instead
> > always
> > skips unpoisoning of the user pages (we already skip the poisoning on
> > free). Any unpoisoned page will have the page->flags tag reset.
> >
> > For the background:
> >
> > On a system with MTE and KASAN_HW_TAGS enabled, when a page is
> > allocated
> > kasan_unpoison_pages() sets a random tag and saves it in page->flags
> > so
> > that page_to_virt() re-creates the correct tagged pointer. We need to
> > ensure that the in-memory tags are visible before setting the
> > page->flags:
> >
> > P0 (__kasan_unpoison_range):    P1 (access via virt_to_page):
> >   Wtags=x                         Rflags=x
> >     |                               |
> >     | DMB                           | address dependency
> >     V                               V
> >   Wflags=x                        Rtags=x
> >
> > The first patch changes the order of page unpoisoning with the tag
> > storing in page->flags. page_kasan_tag_set() has the right barriers
> > through try_cmpxchg().
> >
> > If a page is mapped in user-space with PROT_MTE, the architecture
> > code
> > will set the allocation tag to 0 and a subsequent page_to_virt()
> > dereference will fault. We currently try to fix this by resetting the
> > tag in page->flags so that it is 0xff (match-all, not faulting).
> > However, setting the tags and flags can race with another CPU reading
> > the flags (page_to_virt()) and barriers can't help, e.g.:
> >
> > P0 (mte_sync_page_tags):        P1 (memcpy from virt_to_page):
> >                                   Rflags!=0xff
> >   Wflags=0xff
> >   DMB (doesn't help)
> >   Wtags=0
> >                                   Rtags=0   // fault
> >
> > Since clearing the flags in the arch code doesn't work, to do this at
> > page allocation time when __GFP_SKIP_KASAN_UNPOISON is passed.
> >
> > Thanks.
> >
> > Catalin Marinas (4):
> >   mm: kasan: Ensure the tags are visible before the tag in page-
> > >flags
> >   mm: kasan: Skip unpoisoning of user pages
> >   mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON
> >   arm64: kasan: Revert "arm64: mte: reset the page tag in page-
> > >flags"
> >
> >  arch/arm64/kernel/hibernate.c |  5 -----
> >  arch/arm64/kernel/mte.c       |  9 ---------
> >  arch/arm64/mm/copypage.c      |  9 ---------
> >  arch/arm64/mm/fault.c         |  1 -
> >  arch/arm64/mm/mteswap.c       |  9 ---------
> >  include/linux/gfp.h           |  2 +-
> >  mm/kasan/common.c             |  3 ++-
> >  mm/page_alloc.c               | 19 ++++++++++---------
> >  8 files changed, 13 insertions(+), 44 deletions(-)
> >
>
> Hi kasan maintainers,
>
> We hit the following issue on the android-6.1 devices with MTE and HW
> tag kasan enabled.
>
> I observe that the anon flag doesn't have skip_kasan_poison and
> skip_kasan_unpoison flag and kasantag is weird.
>
> AFAIK, kasantag of anon flag needs to be 0x0.
>
> [   71.953938] [T1403598] FramePolicy:
> [name:report&]=========================================================
> =========
> [   71.955305] [T1403598] FramePolicy: [name:report&]BUG: KASAN:
> invalid-access in copy_page+0x10/0xd0
> [   71.956476] [T1403598] FramePolicy: [name:report&]Read at addr
> f0ffff81332a8000 by task FramePolicy/3598
> [   71.957673] [T1403598] FramePolicy: [name:report_hw_tags&]Pointer
> tag: [f0], memory tag: [ff]
> [   71.958746] [T1403598] FramePolicy: [name:report&]
> [   71.959354] [T1403598] FramePolicy: CPU: 4 PID: 3598 Comm:
> FramePolicy Tainted: G S      W  OE      6.1.0-mainline-android14-0-
> ga8a53f83b9e4 #1
> [   71.960978] [T1403598] FramePolicy: Hardware name: MT6985(ENG) (DT)
> [   71.961767] [T1403598] FramePolicy: Call trace:
> [   71.962338] [T1403598] FramePolicy:  dump_backtrace+0x108/0x158
> [   71.963097] [T1403598] FramePolicy:  show_stack+0x20/0x48
> [   71.963782] [T1403598] FramePolicy:  dump_stack_lvl+0x6c/0x88
> [   71.964512] [T1403598] FramePolicy:  print_report+0x2cc/0xa64
> [   71.965263] [T1403598] FramePolicy:  kasan_report+0xb8/0x138
> [   71.965986] [T1403598] FramePolicy:  __do_kernel_fault+0xd4/0x248
> [   71.966782] [T1403598] FramePolicy:  do_bad_area+0x38/0xe8
> [   71.967484] [T1403598] FramePolicy:  do_tag_check_fault+0x24/0x38
> [   71.968261] [T1403598] FramePolicy:  do_mem_abort+0x48/0xb0
> [   71.968973] [T1403598] FramePolicy:  el1_abort+0x44/0x68
> [   71.969646] [T1403598] FramePolicy:  el1h_64_sync_handler+0x68/0xb8
> [   71.970440] [T1403598] FramePolicy:  el1h_64_sync+0x68/0x6c
> [   71.971146] [T1403598] FramePolicy:  copy_page+0x10/0xd0
> [   71.971824] [T1403598] FramePolicy:  copy_user_highpage+0x20/0x40
> [   71.972603] [T1403598] FramePolicy:  wp_page_copy+0xd0/0x9f8
> [   71.973344] [T1403598] FramePolicy:  do_wp_page+0x374/0x3b0
> [   71.974056] [T1403598] FramePolicy:  handle_mm_fault+0x3ec/0x119c
> [   71.974833] [T1403598] FramePolicy:  do_page_fault+0x344/0x4ac
> [   71.975583] [T1403598] FramePolicy:  do_mem_abort+0x48/0xb0
> [   71.976294] [T1403598] FramePolicy:  el0_da+0x4c/0xe0
> [   71.976934] [T1403598] FramePolicy:  el0t_64_sync_handler+0xd4/0xfc
> [   71.977725] [T1403598] FramePolicy:  el0t_64_sync+0x1a0/0x1a4
> [   71.978451] [T1403598] FramePolicy: [name:report&]
> [   71.979057] [T1403598] FramePolicy: [name:report&]The buggy address
> belongs to the physical page:
> [   71.980173] [T1403598] FramePolicy:
> [name:debug&]page:fffffffe04ccaa00 refcount:14 mapcount:13
> mapping:0000000000000000 index:0x7884c74 pfn:0x1732a8
> [   71.981849] [T1403598] FramePolicy:
> [name:debug&]memcg:faffff80c0241000
> [   71.982680] [T1403598] FramePolicy: [name:debug&]anon flags:
> 0x43c000000048003e(referenced|uptodate|dirty|lru|active|swapbacked|arch
> _2|zone=1|kasantag=0xf)
> [   71.984446] [T1403598] FramePolicy: raw: 43c000000048003e
> fffffffe04b99648 fffffffe04cca308 f2ffff8103390831
> [   71.985684] [T1403598] FramePolicy: raw: 0000000007884c74
> 0000000000000000 0000000e0000000c faffff80c0241000
> [   71.986919] [T1403598] FramePolicy: [name:debug&]page dumped
> because: kasan: bad access detected
> [   71.988022] [T1403598] FramePolicy: [name:report&]
> [   71.988624] [T1403598] FramePolicy: [name:report&]Memory state
> around the buggy address:
> [   71.989641] [T1403598] FramePolicy:  ffffff81332a7e00: fe fe fe fe
> fe fe fe fe fe fe fe fe fe fe fe fe
> [   71.990811] [T1403598] FramePolicy:  ffffff81332a7f00: fe fe fe fe
> fe fe fe fe fe fe fe fe fe fe fe fe
> [   71.991982] [T1403598] FramePolicy: >ffffff81332a8000: ff ff ff ff
> f0 f0 fc fc fc fc fc fc fc f0 f0 f3
> [   71.993149] [T1403598] FramePolicy:
> [name:report&]                   ^
> [   71.993972] [T1403598] FramePolicy:  ffffff81332a8100: f3 f3 f3 f3
> f3 f3 f0 f0 f8 f8 f8 f8 f8 f8 f8 f0
> [   71.995141] [T1403598] FramePolicy:  ffffff81332a8200: f0 fb fb fb
> fb fb fb fb f0 f0 fe fe fe fe fe fe
> [   71.996332] [T1403598] FramePolicy:
> [name:report&]=========================================================
> =========
>
> Originally, I suspect that some userspace pages have been migrated so
> the page->flags will be lost and page->flags is re-generated by
> alloc_pages().

Hi Kuan-Ying,

There recently was a similar crash due to incorrectly implemented sampling.

Do you have the following patch in your tree?

https://android.googlesource.com/kernel/common/+/9f7f5a25f335e6e1484695da9180281a728db7e2

If not, please sync your 6.1 tree with the Android common kernel.
Hopefully this will fix the issue.

Thanks!
Kuan-Ying Lee (李冠穎) Feb. 3, 2023, 3:41 a.m. UTC | #5
On Thu, 2023-02-02 at 13:59 +0100, Andrey Konovalov wrote:
> On Thu, Feb 2, 2023 at 6:25 AM Kuan-Ying Lee (李冠穎)
> <Kuan-Ying.Lee@mediatek.com> wrote:
> > 
> > On Fri, 2022-06-10 at 16:21 +0100, Catalin Marinas wrote:
> > > Hi,
> > > 
> > > That's a second attempt on fixing the race race between setting
> > > the
> > > allocation (in-memory) tags in a page and the corresponding
> > > logical
> > > tag
> > > in page->flags. Initial version here:
> > > 
> > > 
> > 
> > 
https://lore.kernel.org/r/20220517180945.756303-1-catalin.marinas@arm.com
> > > 
> > > This new series does not introduce any new GFP flags but instead
> > > always
> > > skips unpoisoning of the user pages (we already skip the
> > > poisoning on
> > > free). Any unpoisoned page will have the page->flags tag reset.
> > > 
> > > For the background:
> > > 
> > > On a system with MTE and KASAN_HW_TAGS enabled, when a page is
> > > allocated
> > > kasan_unpoison_pages() sets a random tag and saves it in page-
> > > >flags
> > > so
> > > that page_to_virt() re-creates the correct tagged pointer. We
> > > need to
> > > ensure that the in-memory tags are visible before setting the
> > > page->flags:
> > > 
> > > P0 (__kasan_unpoison_range):    P1 (access via virt_to_page):
> > >   Wtags=x                         Rflags=x
> > >     |                               |
> > >     | DMB                           | address dependency
> > >     V                               V
> > >   Wflags=x                        Rtags=x
> > > 
> > > The first patch changes the order of page unpoisoning with the
> > > tag
> > > storing in page->flags. page_kasan_tag_set() has the right
> > > barriers
> > > through try_cmpxchg().
> > > 
> > > If a page is mapped in user-space with PROT_MTE, the architecture
> > > code
> > > will set the allocation tag to 0 and a subsequent page_to_virt()
> > > dereference will fault. We currently try to fix this by resetting
> > > the
> > > tag in page->flags so that it is 0xff (match-all, not faulting).
> > > However, setting the tags and flags can race with another CPU
> > > reading
> > > the flags (page_to_virt()) and barriers can't help, e.g.:
> > > 
> > > P0 (mte_sync_page_tags):        P1 (memcpy from virt_to_page):
> > >                                   Rflags!=0xff
> > >   Wflags=0xff
> > >   DMB (doesn't help)
> > >   Wtags=0
> > >                                   Rtags=0   // fault
> > > 
> > > Since clearing the flags in the arch code doesn't work, to do
> > > this at
> > > page allocation time when __GFP_SKIP_KASAN_UNPOISON is passed.
> > > 
> > > Thanks.
> > > 
> > > Catalin Marinas (4):
> > >   mm: kasan: Ensure the tags are visible before the tag in page-
> > > > flags
> > > 
> > >   mm: kasan: Skip unpoisoning of user pages
> > >   mm: kasan: Skip page unpoisoning only if
> > > __GFP_SKIP_KASAN_UNPOISON
> > >   arm64: kasan: Revert "arm64: mte: reset the page tag in page-
> > > > flags"
> > > 
> > >  arch/arm64/kernel/hibernate.c |  5 -----
> > >  arch/arm64/kernel/mte.c       |  9 ---------
> > >  arch/arm64/mm/copypage.c      |  9 ---------
> > >  arch/arm64/mm/fault.c         |  1 -
> > >  arch/arm64/mm/mteswap.c       |  9 ---------
> > >  include/linux/gfp.h           |  2 +-
> > >  mm/kasan/common.c             |  3 ++-
> > >  mm/page_alloc.c               | 19 ++++++++++---------
> > >  8 files changed, 13 insertions(+), 44 deletions(-)
> > > 
> > 
> > Hi kasan maintainers,
> > 
> > We hit the following issue on the android-6.1 devices with MTE and
> > HW
> > tag kasan enabled.
> > 
> > I observe that the anon flag doesn't have skip_kasan_poison and
> > skip_kasan_unpoison flag and kasantag is weird.
> > 
> > AFAIK, kasantag of anon flag needs to be 0x0.
> > 
> > [   71.953938] [T1403598] FramePolicy:
> > [name:report&]=====================================================
> > ====
> > =========
> > [   71.955305] [T1403598] FramePolicy: [name:report&]BUG: KASAN:
> > invalid-access in copy_page+0x10/0xd0
> > [   71.956476] [T1403598] FramePolicy: [name:report&]Read at addr
> > f0ffff81332a8000 by task FramePolicy/3598
> > [   71.957673] [T1403598] FramePolicy:
> > [name:report_hw_tags&]Pointer
> > tag: [f0], memory tag: [ff]
> > [   71.958746] [T1403598] FramePolicy: [name:report&]
> > [   71.959354] [T1403598] FramePolicy: CPU: 4 PID: 3598 Comm:
> > FramePolicy Tainted: G S      W  OE      6.1.0-mainline-android14-
> > 0-
> > ga8a53f83b9e4 #1
> > [   71.960978] [T1403598] FramePolicy: Hardware name: MT6985(ENG)
> > (DT)
> > [   71.961767] [T1403598] FramePolicy: Call trace:
> > [   71.962338] [T1403598] FramePolicy:  dump_backtrace+0x108/0x158
> > [   71.963097] [T1403598] FramePolicy:  show_stack+0x20/0x48
> > [   71.963782] [T1403598] FramePolicy:  dump_stack_lvl+0x6c/0x88
> > [   71.964512] [T1403598] FramePolicy:  print_report+0x2cc/0xa64
> > [   71.965263] [T1403598] FramePolicy:  kasan_report+0xb8/0x138
> > [   71.965986] [T1403598]
> > FramePolicy:  __do_kernel_fault+0xd4/0x248
> > [   71.966782] [T1403598] FramePolicy:  do_bad_area+0x38/0xe8
> > [   71.967484] [T1403598]
> > FramePolicy:  do_tag_check_fault+0x24/0x38
> > [   71.968261] [T1403598] FramePolicy:  do_mem_abort+0x48/0xb0
> > [   71.968973] [T1403598] FramePolicy:  el1_abort+0x44/0x68
> > [   71.969646] [T1403598]
> > FramePolicy:  el1h_64_sync_handler+0x68/0xb8
> > [   71.970440] [T1403598] FramePolicy:  el1h_64_sync+0x68/0x6c
> > [   71.971146] [T1403598] FramePolicy:  copy_page+0x10/0xd0
> > [   71.971824] [T1403598]
> > FramePolicy:  copy_user_highpage+0x20/0x40
> > [   71.972603] [T1403598] FramePolicy:  wp_page_copy+0xd0/0x9f8
> > [   71.973344] [T1403598] FramePolicy:  do_wp_page+0x374/0x3b0
> > [   71.974056] [T1403598]
> > FramePolicy:  handle_mm_fault+0x3ec/0x119c
> > [   71.974833] [T1403598] FramePolicy:  do_page_fault+0x344/0x4ac
> > [   71.975583] [T1403598] FramePolicy:  do_mem_abort+0x48/0xb0
> > [   71.976294] [T1403598] FramePolicy:  el0_da+0x4c/0xe0
> > [   71.976934] [T1403598]
> > FramePolicy:  el0t_64_sync_handler+0xd4/0xfc
> > [   71.977725] [T1403598] FramePolicy:  el0t_64_sync+0x1a0/0x1a4
> > [   71.978451] [T1403598] FramePolicy: [name:report&]
> > [   71.979057] [T1403598] FramePolicy: [name:report&]The buggy
> > address
> > belongs to the physical page:
> > [   71.980173] [T1403598] FramePolicy:
> > [name:debug&]page:fffffffe04ccaa00 refcount:14 mapcount:13
> > mapping:0000000000000000 index:0x7884c74 pfn:0x1732a8
> > [   71.981849] [T1403598] FramePolicy:
> > [name:debug&]memcg:faffff80c0241000
> > [   71.982680] [T1403598] FramePolicy: [name:debug&]anon flags:
> > 0x43c000000048003e(referenced|uptodate|dirty|lru|active|swapbacked|
> > arch
> > _2|zone=1|kasantag=0xf)
> > [   71.984446] [T1403598] FramePolicy: raw: 43c000000048003e
> > fffffffe04b99648 fffffffe04cca308 f2ffff8103390831
> > [   71.985684] [T1403598] FramePolicy: raw: 0000000007884c74
> > 0000000000000000 0000000e0000000c faffff80c0241000
> > [   71.986919] [T1403598] FramePolicy: [name:debug&]page dumped
> > because: kasan: bad access detected
> > [   71.988022] [T1403598] FramePolicy: [name:report&]
> > [   71.988624] [T1403598] FramePolicy: [name:report&]Memory state
> > around the buggy address:
> > [   71.989641] [T1403598] FramePolicy:  ffffff81332a7e00: fe fe fe
> > fe
> > fe fe fe fe fe fe fe fe fe fe fe fe
> > [   71.990811] [T1403598] FramePolicy:  ffffff81332a7f00: fe fe fe
> > fe
> > fe fe fe fe fe fe fe fe fe fe fe fe
> > [   71.991982] [T1403598] FramePolicy: >ffffff81332a8000: ff ff ff
> > ff
> > f0 f0 fc fc fc fc fc fc fc f0 f0 f3
> > [   71.993149] [T1403598] FramePolicy:
> > [name:report&]                   ^
> > [   71.993972] [T1403598] FramePolicy:  ffffff81332a8100: f3 f3 f3
> > f3
> > f3 f3 f0 f0 f8 f8 f8 f8 f8 f8 f8 f0
> > [   71.995141] [T1403598] FramePolicy:  ffffff81332a8200: f0 fb fb
> > fb
> > fb fb fb fb f0 f0 fe fe fe fe fe fe
> > [   71.996332] [T1403598] FramePolicy:
> > [name:report&]=====================================================
> > ====
> > =========
> > 
> > Originally, I suspect that some userspace pages have been migrated
> > so
> > the page->flags will be lost and page->flags is re-generated by
> > alloc_pages().
> 
> Hi Kuan-Ying,
> 
> There recently was a similar crash due to incorrectly implemented
> sampling.
> 
> Do you have the following patch in your tree?
> 
> 
https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/9f7f5a25f335e6e1484695da9180281a728db7e2__;Kw!!CTRNKA9wMg0ARbw!hUjRlXirPMSusdIWe0RIPt0PNqIHYDCJyd7GSd4o-TgLMP0CKRUkjElH-jcvtaz42-sgE2U58964rCCbuNTJE5Jx$ 
>  
> 
> If not, please sync your 6.1 tree with the Android common kernel.
> Hopefully this will fix the issue.
> 
> Thanks!

Hi Andrey,

Thanks for your advice.

I saw this patch is to fix ("kasan: allow sampling page_alloc
allocations for HW_TAGS").

But our 6.1 tree doesn't have following two commits now.
("FROMGIT: kasan: allow sampling page_alloc allocations for HW_TAGS")
(FROMLIST: kasan: reset page tags properly with sampling)
Andrey Konovalov Feb. 3, 2023, 5:51 p.m. UTC | #6
On Fri, Feb 3, 2023 at 4:41 AM Kuan-Ying Lee (李冠穎)
<Kuan-Ying.Lee@mediatek.com> wrote:
>
> > Hi Kuan-Ying,
> >
> > There recently was a similar crash due to incorrectly implemented
> > sampling.
> >
> > Do you have the following patch in your tree?
> >
> >
> https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/9f7f5a25f335e6e1484695da9180281a728db7e2__;Kw!!CTRNKA9wMg0ARbw!hUjRlXirPMSusdIWe0RIPt0PNqIHYDCJyd7GSd4o-TgLMP0CKRUkjElH-jcvtaz42-sgE2U58964rCCbuNTJE5Jx$
> >
> >
> > If not, please sync your 6.1 tree with the Android common kernel.
> > Hopefully this will fix the issue.
> >
> > Thanks!
>
> Hi Andrey,
>
> Thanks for your advice.
>
> I saw this patch is to fix ("kasan: allow sampling page_alloc
> allocations for HW_TAGS").
>
> But our 6.1 tree doesn't have following two commits now.
> ("FROMGIT: kasan: allow sampling page_alloc allocations for HW_TAGS")
> (FROMLIST: kasan: reset page tags properly with sampling)

Hi Kuan-Ying,

Just to clarify: these two patches were applied twice: once here on Jan 13:

https://android.googlesource.com/kernel/common/+/a2a9e34d164e90fc08d35fd097a164b9101d72ef
https://android.googlesource.com/kernel/common/+/435e2a6a6c8ba8d0eb55f9aaade53e7a3957322b

but then reverted here on Jan 20:

https://android.googlesource.com/kernel/common/+/5503dbe454478fe54b9cac3fc52d4477f52efdc9
https://android.googlesource.com/kernel/common/+/4573a3cf7e18735a477845426238d46d96426bb6

And then once again via the link I sent before together with a fix on Jan 25.

It might be that you still have to former two patches in your tree if
you synced it before the revert.

However, if this is not the case:

Which 6.1 commit is your tree based on?
Do you have any private MTE-related changes in the kernel?
Do you have userspace MTE enabled?

Thanks!
Qun-Wei Lin Feb. 8, 2023, 5:41 a.m. UTC | #7
On Fri, 2023-02-03 at 18:51 +0100, Andrey Konovalov wrote:
> On Fri, Feb 3, 2023 at 4:41 AM Kuan-Ying Lee (李冠穎)
> <Kuan-Ying.Lee@mediatek.com> wrote:
> > 
> > > Hi Kuan-Ying,
> > > 
> > > There recently was a similar crash due to incorrectly implemented
> > > sampling.
> > > 
> > > Do you have the following patch in your tree?
> > > 
> > > 
> > 
> > 
https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/9f7f5a25f335e6e1484695da9180281a728db7e2__;Kw!!CTRNKA9wMg0ARbw!hUjRlXirPMSusdIWe0RIPt0PNqIHYDCJyd7GSd4o-TgLMP0CKRUkjElH-jcvtaz42-sgE2U58964rCCbuNTJE5Jx$
> > > 
> > > 
> > > If not, please sync your 6.1 tree with the Android common kernel.
> > > Hopefully this will fix the issue.
> > > 
> > > Thanks!
> > 
> > Hi Andrey,
> > 
> > Thanks for your advice.
> > 
> > I saw this patch is to fix ("kasan: allow sampling page_alloc
> > allocations for HW_TAGS").
> > 
> > But our 6.1 tree doesn't have following two commits now.
> > ("FROMGIT: kasan: allow sampling page_alloc allocations for
> > HW_TAGS")
> > (FROMLIST: kasan: reset page tags properly with sampling)
> 
> Hi Kuan-Ying,
> 

Hi Andrey,
I'll stand in for Kuan-Ying as he's out of office.
Thanks for your help!

> Just to clarify: these two patches were applied twice: once here on
> Jan 13:
> 
> 
https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/a2a9e34d164e90fc08d35fd097a164b9101d72ef__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745_3oO-3k$ 
>  
> 
https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/435e2a6a6c8ba8d0eb55f9aaade53e7a3957322b__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745sDEOYWY$ 
>  
> 

Our codebase does not contain these two patches.

> but then reverted here on Jan 20:
> 
> 
https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/5503dbe454478fe54b9cac3fc52d4477f52efdc9__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745Bl77dFY$ 
>  
> 
https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/4573a3cf7e18735a477845426238d46d96426bb6__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745K-J8O-w$ 
>  
> 
> And then once again via the link I sent before together with a fix on
> Jan 25.
> 
> It might be that you still have to former two patches in your tree if
> you synced it before the revert.
> 
> However, if this is not the case:
> 
> Which 6.1 commit is your tree based on?


https://android.googlesource.com/kernel/common/+/53b3a7721b7aec74d8fa2ee55c2480044cc7c1b8
(53b3a77 Merge 6.1.1 into android14-6.1) is the latest commit in our
tree.

> Do you have any private MTE-related changes in the kernel?

No, all the MTE-related code is the same as Android Common Kernel.

> Do you have userspace MTE enabled?

Yes, we have enabled MTE for both EL1 and EL0.

> 
> Thanks!
Peter Collingbourne Feb. 10, 2023, 6:19 a.m. UTC | #8
On Wed, Feb 08, 2023 at 05:41:45AM +0000, Qun-wei Lin (林群崴) wrote:
> On Fri, 2023-02-03 at 18:51 +0100, Andrey Konovalov wrote:
> > On Fri, Feb 3, 2023 at 4:41 AM Kuan-Ying Lee (李冠穎)
> > <Kuan-Ying.Lee@mediatek.com> wrote:
> > > 
> > > > Hi Kuan-Ying,
> > > > 
> > > > There recently was a similar crash due to incorrectly implemented
> > > > sampling.
> > > > 
> > > > Do you have the following patch in your tree?
> > > > 
> > > > 
> > > 
> > > 
> https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/9f7f5a25f335e6e1484695da9180281a728db7e2__;Kw!!CTRNKA9wMg0ARbw!hUjRlXirPMSusdIWe0RIPt0PNqIHYDCJyd7GSd4o-TgLMP0CKRUkjElH-jcvtaz42-sgE2U58964rCCbuNTJE5Jx$
> > > > 
> > > > 
> > > > If not, please sync your 6.1 tree with the Android common kernel.
> > > > Hopefully this will fix the issue.
> > > > 
> > > > Thanks!
> > > 
> > > Hi Andrey,
> > > 
> > > Thanks for your advice.
> > > 
> > > I saw this patch is to fix ("kasan: allow sampling page_alloc
> > > allocations for HW_TAGS").
> > > 
> > > But our 6.1 tree doesn't have following two commits now.
> > > ("FROMGIT: kasan: allow sampling page_alloc allocations for
> > > HW_TAGS")
> > > (FROMLIST: kasan: reset page tags properly with sampling)
> > 
> > Hi Kuan-Ying,
> > 
> 
> Hi Andrey,
> I'll stand in for Kuan-Ying as he's out of office.
> Thanks for your help!
> 
> > Just to clarify: these two patches were applied twice: once here on
> > Jan 13:
> > 
> > 
> https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/a2a9e34d164e90fc08d35fd097a164b9101d72ef__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745_3oO-3k$ 
> >  
> > 
> https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/435e2a6a6c8ba8d0eb55f9aaade53e7a3957322b__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745sDEOYWY$ 
> >  
> > 
> 
> Our codebase does not contain these two patches.
> 
> > but then reverted here on Jan 20:
> > 
> > 
> https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/5503dbe454478fe54b9cac3fc52d4477f52efdc9__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745Bl77dFY$ 
> >  
> > 
> https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/4573a3cf7e18735a477845426238d46d96426bb6__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745K-J8O-w$ 
> >  
> > 
> > And then once again via the link I sent before together with a fix on
> > Jan 25.
> > 
> > It might be that you still have to former two patches in your tree if
> > you synced it before the revert.
> > 
> > However, if this is not the case:
> > 
> > Which 6.1 commit is your tree based on?
> 
> 
> https://android.googlesource.com/kernel/common/+/53b3a7721b7aec74d8fa2ee55c2480044cc7c1b8
> (53b3a77 Merge 6.1.1 into android14-6.1) is the latest commit in our
> tree.
> 
> > Do you have any private MTE-related changes in the kernel?
> 
> No, all the MTE-related code is the same as Android Common Kernel.
> 
> > Do you have userspace MTE enabled?
> 
> Yes, we have enabled MTE for both EL1 and EL0.

Hi Qun-wei,

Thanks for the information. We encountered a similar issue internally
with the Android 5.15 common kernel. We tracked it down to an issue
with page migration, where the source page was a userspace page with
MTE tags, and the target page was allocated using KASAN (i.e. having
a non-zero KASAN tag). This caused tag check faults when the page was
subsequently accessed by the kernel as a result of the mismatching tags
from userspace. Given the number of different ways that page migration
target pages can be allocated, the simplest fix that we could think of
was to synchronize the KASAN tag in copy_highpage().

Can you try the patch below and let us know whether it fixes the issue?

diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
index 24913271e898c..87ed38e9747bd 100644
--- a/arch/arm64/mm/copypage.c
+++ b/arch/arm64/mm/copypage.c
@@ -23,6 +23,8 @@ void copy_highpage(struct page *to, struct page *from)
 
 	if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) {
 		set_bit(PG_mte_tagged, &to->flags);
+		if (kasan_hw_tags_enabled())
+			page_kasan_tag_set(to, page_kasan_tag(from));
 		mte_copy_page_tags(kto, kfrom);
 	}
 }

Catalin, please let us know what you think of the patch above. It
effectively partially undoes commit 20794545c146 ("arm64: kasan: Revert
"arm64: mte: reset the page tag in page->flags""), but this seems okay
to me because the mentioned race condition shouldn't affect "new" pages
such as those being used as migration targets. The smp_wmb() that was
there before doesn't seem necessary for the same reason.

If the patch is okay, we should apply it to the 6.1 stable kernel. The
problem appears to be "fixed" in the mainline kernel because of
a bad merge conflict resolution on my part; when I rebased commit
e059853d14ca ("arm64: mte: Fix/clarify the PG_mte_tagged semantics")
past commit 20794545c146, it looks like I accidentally brought back the
page_kasan_tag_reset() line removed in the latter. But we should align
the mainline kernel with whatever we decide to do on 6.1.

Peter
Catalin Marinas Feb. 10, 2023, 6:28 p.m. UTC | #9
Hi Peter,

On Thu, Feb 09, 2023 at 10:19:20PM -0800, Peter Collingbourne wrote:
> Thanks for the information. We encountered a similar issue internally
> with the Android 5.15 common kernel. We tracked it down to an issue
> with page migration, where the source page was a userspace page with
> MTE tags, and the target page was allocated using KASAN (i.e. having
> a non-zero KASAN tag). This caused tag check faults when the page was
> subsequently accessed by the kernel as a result of the mismatching tags
> from userspace. Given the number of different ways that page migration
> target pages can be allocated, the simplest fix that we could think of
> was to synchronize the KASAN tag in copy_highpage().
> 
> Can you try the patch below and let us know whether it fixes the issue?
> 
> diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
> index 24913271e898c..87ed38e9747bd 100644
> --- a/arch/arm64/mm/copypage.c
> +++ b/arch/arm64/mm/copypage.c
> @@ -23,6 +23,8 @@ void copy_highpage(struct page *to, struct page *from)
>  
>  	if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) {
>  		set_bit(PG_mte_tagged, &to->flags);
> +		if (kasan_hw_tags_enabled())
> +			page_kasan_tag_set(to, page_kasan_tag(from));
>  		mte_copy_page_tags(kto, kfrom);

Why not just page_kasan_tag_reset(to)? If PG_mte_tagged is set on the
'from' page, the tags are random anyway and page_kasan_tag(from) should
already be 0xff. It makes more sense to do the same for the 'to' page
rather than copying the tag from the 'from' page. IOW, we are copying
user-controlled tags into a page, the kernel should have a match-all tag
in page->flags.

> Catalin, please let us know what you think of the patch above. It
> effectively partially undoes commit 20794545c146 ("arm64: kasan: Revert
> "arm64: mte: reset the page tag in page->flags""), but this seems okay
> to me because the mentioned race condition shouldn't affect "new" pages
> such as those being used as migration targets. The smp_wmb() that was
> there before doesn't seem necessary for the same reason.
> 
> If the patch is okay, we should apply it to the 6.1 stable kernel. The
> problem appears to be "fixed" in the mainline kernel because of
> a bad merge conflict resolution on my part; when I rebased commit
> e059853d14ca ("arm64: mte: Fix/clarify the PG_mte_tagged semantics")
> past commit 20794545c146, it looks like I accidentally brought back the
> page_kasan_tag_reset() line removed in the latter. But we should align
> the mainline kernel with whatever we decide to do on 6.1.

Happy accident ;). When I reverted such calls in commit 20794545c146, my
assumption was that we always get a page that went through
post_alloc_hook() and the tags were reset. But it seems that's not
always the case (and probably wasteful anyway if we have to zero the
tags and data on a page we know we are going to override via
copy_highpage() anyway). The barrier doesn't help, so we shouldn't add
it back.

So, I'm fine with a stable fix but I wonder whether we should backport
the whole "Fix/clarify the PG_mte_tagged semantics" series instead.
Peter Collingbourne Feb. 10, 2023, 7:03 p.m. UTC | #10
Hi Catalin,

On Fri, Feb 10, 2023 at 10:28 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> Hi Peter,
>
> On Thu, Feb 09, 2023 at 10:19:20PM -0800, Peter Collingbourne wrote:
> > Thanks for the information. We encountered a similar issue internally
> > with the Android 5.15 common kernel. We tracked it down to an issue
> > with page migration, where the source page was a userspace page with
> > MTE tags, and the target page was allocated using KASAN (i.e. having
> > a non-zero KASAN tag). This caused tag check faults when the page was
> > subsequently accessed by the kernel as a result of the mismatching tags
> > from userspace. Given the number of different ways that page migration
> > target pages can be allocated, the simplest fix that we could think of
> > was to synchronize the KASAN tag in copy_highpage().
> >
> > Can you try the patch below and let us know whether it fixes the issue?
> >
> > diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
> > index 24913271e898c..87ed38e9747bd 100644
> > --- a/arch/arm64/mm/copypage.c
> > +++ b/arch/arm64/mm/copypage.c
> > @@ -23,6 +23,8 @@ void copy_highpage(struct page *to, struct page *from)
> >
> >       if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) {
> >               set_bit(PG_mte_tagged, &to->flags);
> > +             if (kasan_hw_tags_enabled())
> > +                     page_kasan_tag_set(to, page_kasan_tag(from));
> >               mte_copy_page_tags(kto, kfrom);
>
> Why not just page_kasan_tag_reset(to)? If PG_mte_tagged is set on the
> 'from' page, the tags are random anyway and page_kasan_tag(from) should
> already be 0xff. It makes more sense to do the same for the 'to' page
> rather than copying the tag from the 'from' page. IOW, we are copying
> user-controlled tags into a page, the kernel should have a match-all tag
> in page->flags.

That would also work, but I was thinking that if copy_highpage() were
being used to copy a KASAN page we should keep the original tag in
order to maintain tag checks for page accesses.

> > Catalin, please let us know what you think of the patch above. It
> > effectively partially undoes commit 20794545c146 ("arm64: kasan: Revert
> > "arm64: mte: reset the page tag in page->flags""), but this seems okay
> > to me because the mentioned race condition shouldn't affect "new" pages
> > such as those being used as migration targets. The smp_wmb() that was
> > there before doesn't seem necessary for the same reason.
> >
> > If the patch is okay, we should apply it to the 6.1 stable kernel. The
> > problem appears to be "fixed" in the mainline kernel because of
> > a bad merge conflict resolution on my part; when I rebased commit
> > e059853d14ca ("arm64: mte: Fix/clarify the PG_mte_tagged semantics")
> > past commit 20794545c146, it looks like I accidentally brought back the
> > page_kasan_tag_reset() line removed in the latter. But we should align
> > the mainline kernel with whatever we decide to do on 6.1.
>
> Happy accident ;). When I reverted such calls in commit 20794545c146, my
> assumption was that we always get a page that went through
> post_alloc_hook() and the tags were reset. But it seems that's not
> always the case (and probably wasteful anyway if we have to zero the
> tags and data on a page we know we are going to override via
> copy_highpage() anyway). The barrier doesn't help, so we shouldn't add
> it back.
>
> So, I'm fine with a stable fix but I wonder whether we should backport
> the whole "Fix/clarify the PG_mte_tagged semantics" series instead.

That seems fine to me (or as well as the above patch if we decide to
copy the tag).

Peter
Qun-Wei Lin Feb. 13, 2023, 1:56 a.m. UTC | #11
On Thu, 2023-02-09 at 22:19 -0800, Peter Collingbourne wrote:
> On Wed, Feb 08, 2023 at 05:41:45AM +0000, Qun-wei Lin (林群崴) wrote:
> > On Fri, 2023-02-03 at 18:51 +0100, Andrey Konovalov wrote:
> > > On Fri, Feb 3, 2023 at 4:41 AM Kuan-Ying Lee (李冠穎)
> > > <Kuan-Ying.Lee@mediatek.com> wrote:
> > > > 
> > > > > Hi Kuan-Ying,
> > > > > 
> > > > > There recently was a similar crash due to incorrectly
> > > > > implemented
> > > > > sampling.
> > > > > 
> > > > > Do you have the following patch in your tree?
> > > > > 
> > > > > 
> > > > 
> > > > 
> > 
> > 
https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/9f7f5a25f335e6e1484695da9180281a728db7e2__;Kw!!CTRNKA9wMg0ARbw!hUjRlXirPMSusdIWe0RIPt0PNqIHYDCJyd7GSd4o-TgLMP0CKRUkjElH-jcvtaz42-sgE2U58964rCCbuNTJE5Jx$
> > > > > 
> > > > > 
> > > > > If not, please sync your 6.1 tree with the Android common
> > > > > kernel.
> > > > > Hopefully this will fix the issue.
> > > > > 
> > > > > Thanks!
> > > > 
> > > > Hi Andrey,
> > > > 
> > > > Thanks for your advice.
> > > > 
> > > > I saw this patch is to fix ("kasan: allow sampling page_alloc
> > > > allocations for HW_TAGS").
> > > > 
> > > > But our 6.1 tree doesn't have following two commits now.
> > > > ("FROMGIT: kasan: allow sampling page_alloc allocations for
> > > > HW_TAGS")
> > > > (FROMLIST: kasan: reset page tags properly with sampling)
> > > 
> > > Hi Kuan-Ying,
> > > 
> > 
> > Hi Andrey,
> > I'll stand in for Kuan-Ying as he's out of office.
> > Thanks for your help!
> > 
> > > Just to clarify: these two patches were applied twice: once here
> > > on
> > > Jan 13:
> > > 
> > > 
> > 
> > 
https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/a2a9e34d164e90fc08d35fd097a164b9101d72ef__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745_3oO-3k$
> >  
> > >  
> > > 
> > 
> > 
https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/435e2a6a6c8ba8d0eb55f9aaade53e7a3957322b__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745sDEOYWY$
> >  
> > >  
> > > 
> > 
> > Our codebase does not contain these two patches.
> > 
> > > but then reverted here on Jan 20:
> > > 
> > > 
> > 
> > 
https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/5503dbe454478fe54b9cac3fc52d4477f52efdc9__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745Bl77dFY$
> >  
> > >  
> > > 
> > 
> > 
https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/4573a3cf7e18735a477845426238d46d96426bb6__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745K-J8O-w$
> >  
> > >  
> > > 
> > > And then once again via the link I sent before together with a
> > > fix on
> > > Jan 25.
> > > 
> > > It might be that you still have to former two patches in your
> > > tree if
> > > you synced it before the revert.
> > > 
> > > However, if this is not the case:
> > > 
> > > Which 6.1 commit is your tree based on?
> > 
> > 
> > 
https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/53b3a7721b7aec74d8fa2ee55c2480044cc7c1b8__;Kw!!CTRNKA9wMg0ARbw!iEzuh9LYXlwXkpcWaHjncfr6lNgTky7OEAEzQ7cIFjlTD__7lwXqAhPJwWJXEnD8THUS7jnBK7hjnHw$ 
> >  
> > (53b3a77 Merge 6.1.1 into android14-6.1) is the latest commit in
> > our
> > tree.
> > 
> > > Do you have any private MTE-related changes in the kernel?
> > 
> > No, all the MTE-related code is the same as Android Common Kernel.
> > 
> > > Do you have userspace MTE enabled?
> > 
> > Yes, we have enabled MTE for both EL1 and EL0.
> 
> Hi Qun-wei,
> 
> Thanks for the information. We encountered a similar issue internally
> with the Android 5.15 common kernel. We tracked it down to an issue
> with page migration, where the source page was a userspace page with
> MTE tags, and the target page was allocated using KASAN (i.e. having
> a non-zero KASAN tag). This caused tag check faults when the page was
> subsequently accessed by the kernel as a result of the mismatching
> tags
> from userspace. Given the number of different ways that page
> migration
> target pages can be allocated, the simplest fix that we could think
> of
> was to synchronize the KASAN tag in copy_highpage().
> 
> Can you try the patch below and let us know whether it fixes the
> issue?
> 
> diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
> index 24913271e898c..87ed38e9747bd 100644
> --- a/arch/arm64/mm/copypage.c
> +++ b/arch/arm64/mm/copypage.c
> @@ -23,6 +23,8 @@ void copy_highpage(struct page *to, struct page
> *from)
>  
>  	if (system_supports_mte() && test_bit(PG_mte_tagged, &from-
> >flags)) {
>  		set_bit(PG_mte_tagged, &to->flags);
> +		if (kasan_hw_tags_enabled())
> +			page_kasan_tag_set(to, page_kasan_tag(from));
>  		mte_copy_page_tags(kto, kfrom);
>  	}
>  }
> 

Thank you so much, this patch has solved the problem.

> Catalin, please let us know what you think of the patch above. It
> effectively partially undoes commit 20794545c146 ("arm64: kasan:
> Revert
> "arm64: mte: reset the page tag in page->flags""), but this seems
> okay
> to me because the mentioned race condition shouldn't affect "new"
> pages
> such as those being used as migration targets. The smp_wmb() that was
> there before doesn't seem necessary for the same reason.
> 
> If the patch is okay, we should apply it to the 6.1 stable kernel.
> The
> problem appears to be "fixed" in the mainline kernel because of
> a bad merge conflict resolution on my part; when I rebased commit
> e059853d14ca ("arm64: mte: Fix/clarify the PG_mte_tagged semantics")
> past commit 20794545c146, it looks like I accidentally brought back
> the
> page_kasan_tag_reset() line removed in the latter. But we should
> align
> the mainline kernel with whatever we decide to do on 6.1.
> 
> Peter
Catalin Marinas Feb. 13, 2023, 6:47 p.m. UTC | #12
On Fri, Feb 10, 2023 at 11:03:45AM -0800, Peter Collingbourne wrote:
> On Fri, Feb 10, 2023 at 10:28 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > On Thu, Feb 09, 2023 at 10:19:20PM -0800, Peter Collingbourne wrote:
> > > Thanks for the information. We encountered a similar issue internally
> > > with the Android 5.15 common kernel. We tracked it down to an issue
> > > with page migration, where the source page was a userspace page with
> > > MTE tags, and the target page was allocated using KASAN (i.e. having
> > > a non-zero KASAN tag). This caused tag check faults when the page was
> > > subsequently accessed by the kernel as a result of the mismatching tags
> > > from userspace. Given the number of different ways that page migration
> > > target pages can be allocated, the simplest fix that we could think of
> > > was to synchronize the KASAN tag in copy_highpage().
> > >
> > > Can you try the patch below and let us know whether it fixes the issue?
> > >
> > > diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
> > > index 24913271e898c..87ed38e9747bd 100644
> > > --- a/arch/arm64/mm/copypage.c
> > > +++ b/arch/arm64/mm/copypage.c
> > > @@ -23,6 +23,8 @@ void copy_highpage(struct page *to, struct page *from)
> > >
> > >       if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) {
> > >               set_bit(PG_mte_tagged, &to->flags);
> > > +             if (kasan_hw_tags_enabled())
> > > +                     page_kasan_tag_set(to, page_kasan_tag(from));
> > >               mte_copy_page_tags(kto, kfrom);
> >
> > Why not just page_kasan_tag_reset(to)? If PG_mte_tagged is set on the
> > 'from' page, the tags are random anyway and page_kasan_tag(from) should
> > already be 0xff. It makes more sense to do the same for the 'to' page
> > rather than copying the tag from the 'from' page. IOW, we are copying
> > user-controlled tags into a page, the kernel should have a match-all tag
> > in page->flags.
> 
> That would also work, but I was thinking that if copy_highpage() were
> being used to copy a KASAN page we should keep the original tag in
> order to maintain tag checks for page accesses.

If PG_mte_tagged is set on the source, it means that the tags are no
longer trusted and we should reset to match-all. Otherwise if
copy_highpage() is called on a page that was never mapped as PROT_MTE in
user space, PG_mte_tagged would not be set on the source and no tags
copied. In such case, we should keep the original KASAN tag in the
destination. Unless I misunderstood what you meant.
Peter Collingbourne Feb. 14, 2023, 1:56 a.m. UTC | #13
On Mon, Feb 13, 2023 at 10:47 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> On Fri, Feb 10, 2023 at 11:03:45AM -0800, Peter Collingbourne wrote:
> > On Fri, Feb 10, 2023 at 10:28 AM Catalin Marinas
> > <catalin.marinas@arm.com> wrote:
> > > On Thu, Feb 09, 2023 at 10:19:20PM -0800, Peter Collingbourne wrote:
> > > > Thanks for the information. We encountered a similar issue internally
> > > > with the Android 5.15 common kernel. We tracked it down to an issue
> > > > with page migration, where the source page was a userspace page with
> > > > MTE tags, and the target page was allocated using KASAN (i.e. having
> > > > a non-zero KASAN tag). This caused tag check faults when the page was
> > > > subsequently accessed by the kernel as a result of the mismatching tags
> > > > from userspace. Given the number of different ways that page migration
> > > > target pages can be allocated, the simplest fix that we could think of
> > > > was to synchronize the KASAN tag in copy_highpage().
> > > >
> > > > Can you try the patch below and let us know whether it fixes the issue?
> > > >
> > > > diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
> > > > index 24913271e898c..87ed38e9747bd 100644
> > > > --- a/arch/arm64/mm/copypage.c
> > > > +++ b/arch/arm64/mm/copypage.c
> > > > @@ -23,6 +23,8 @@ void copy_highpage(struct page *to, struct page *from)
> > > >
> > > >       if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) {
> > > >               set_bit(PG_mte_tagged, &to->flags);
> > > > +             if (kasan_hw_tags_enabled())
> > > > +                     page_kasan_tag_set(to, page_kasan_tag(from));
> > > >               mte_copy_page_tags(kto, kfrom);
> > >
> > > Why not just page_kasan_tag_reset(to)? If PG_mte_tagged is set on the
> > > 'from' page, the tags are random anyway and page_kasan_tag(from) should
> > > already be 0xff. It makes more sense to do the same for the 'to' page
> > > rather than copying the tag from the 'from' page. IOW, we are copying
> > > user-controlled tags into a page, the kernel should have a match-all tag
> > > in page->flags.
> >
> > That would also work, but I was thinking that if copy_highpage() were
> > being used to copy a KASAN page we should keep the original tag in
> > order to maintain tag checks for page accesses.
>
> If PG_mte_tagged is set on the source, it means that the tags are no
> longer trusted and we should reset to match-all. Otherwise if
> copy_highpage() is called on a page that was never mapped as PROT_MTE in
> user space, PG_mte_tagged would not be set on the source and no tags
> copied. In such case, we should keep the original KASAN tag in the
> destination. Unless I misunderstood what you meant.

I was thinking that it might be possible for PG_mte_tagged to be set
on a page with KASAN tags. But as far as I can tell, this isn't
actually possible (or not intended to be possible, at least). So I
agree, we can just call page_kasan_tag_reset() here.

Patch sent (with stable backport instructions) here:
https://lore.kernel.org/all/20230214015214.747873-1-pcc@google.com/

Peter