Message ID | 20220610152141.2148929-1-catalin.marinas@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | kasan: Fix ordering between MTE tag colouring and page->flags | expand |
On Fri, 10 Jun 2022 16:21:37 +0100, Catalin Marinas wrote: > That's a second attempt on fixing the race race between setting the > allocation (in-memory) tags in a page and the corresponding logical tag > in page->flags. Initial version here: > > https://lore.kernel.org/r/20220517180945.756303-1-catalin.marinas@arm.com > > This new series does not introduce any new GFP flags but instead always > skips unpoisoning of the user pages (we already skip the poisoning on > free). Any unpoisoned page will have the page->flags tag reset. > > [...] Applied to arm64 (for-next/mte), thanks! [1/4] mm: kasan: Ensure the tags are visible before the tag in page->flags https://git.kernel.org/arm64/c/ed0a6d1d973e [2/4] mm: kasan: Skip unpoisoning of user pages https://git.kernel.org/arm64/c/70c248aca9e7 [3/4] mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON https://git.kernel.org/arm64/c/6d05141a3930 [4/4] arm64: kasan: Revert "arm64: mte: reset the page tag in page->flags" https://git.kernel.org/arm64/c/20794545c146 Cheers,
On Fri, Jun 10, 2022 at 04:21:37PM +0100, Catalin Marinas wrote: > Hi, > > That's a second attempt on fixing the race race between setting the > allocation (in-memory) tags in a page and the corresponding logical tag > in page->flags. Initial version here: > > https://lore.kernel.org/r/20220517180945.756303-1-catalin.marinas@arm.com > > This new series does not introduce any new GFP flags but instead always > skips unpoisoning of the user pages (we already skip the poisoning on > free). Any unpoisoned page will have the page->flags tag reset. > > For the background: > > On a system with MTE and KASAN_HW_TAGS enabled, when a page is allocated > kasan_unpoison_pages() sets a random tag and saves it in page->flags so > that page_to_virt() re-creates the correct tagged pointer. We need to > ensure that the in-memory tags are visible before setting the > page->flags: > > P0 (__kasan_unpoison_range): P1 (access via virt_to_page): > Wtags=x Rflags=x > | | > | DMB | address dependency > V V > Wflags=x Rtags=x > > The first patch changes the order of page unpoisoning with the tag > storing in page->flags. page_kasan_tag_set() has the right barriers > through try_cmpxchg(). > > If a page is mapped in user-space with PROT_MTE, the architecture code > will set the allocation tag to 0 and a subsequent page_to_virt() > dereference will fault. We currently try to fix this by resetting the > tag in page->flags so that it is 0xff (match-all, not faulting). > However, setting the tags and flags can race with another CPU reading > the flags (page_to_virt()) and barriers can't help, e.g.: > > P0 (mte_sync_page_tags): P1 (memcpy from virt_to_page): > Rflags!=0xff > Wflags=0xff > DMB (doesn't help) > Wtags=0 > Rtags=0 // fault > > Since clearing the flags in the arch code doesn't work, to do this at > page allocation time when __GFP_SKIP_KASAN_UNPOISON is passed. I've picked this up, thanks. An alternative solution might be to use a seqlock (if you can find somewhere to put it) so that virt_to_page() spins briefly while the tags and flags are being updated. Will
On Fri, 2022-06-10 at 16:21 +0100, Catalin Marinas wrote: > Hi, > > That's a second attempt on fixing the race race between setting the > allocation (in-memory) tags in a page and the corresponding logical > tag > in page->flags. Initial version here: > > https://lore.kernel.org/r/20220517180945.756303-1-catalin.marinas@arm.com > > This new series does not introduce any new GFP flags but instead > always > skips unpoisoning of the user pages (we already skip the poisoning on > free). Any unpoisoned page will have the page->flags tag reset. > > For the background: > > On a system with MTE and KASAN_HW_TAGS enabled, when a page is > allocated > kasan_unpoison_pages() sets a random tag and saves it in page->flags > so > that page_to_virt() re-creates the correct tagged pointer. We need to > ensure that the in-memory tags are visible before setting the > page->flags: > > P0 (__kasan_unpoison_range): P1 (access via virt_to_page): > Wtags=x Rflags=x > | | > | DMB | address dependency > V V > Wflags=x Rtags=x > > The first patch changes the order of page unpoisoning with the tag > storing in page->flags. page_kasan_tag_set() has the right barriers > through try_cmpxchg(). > > If a page is mapped in user-space with PROT_MTE, the architecture > code > will set the allocation tag to 0 and a subsequent page_to_virt() > dereference will fault. We currently try to fix this by resetting the > tag in page->flags so that it is 0xff (match-all, not faulting). > However, setting the tags and flags can race with another CPU reading > the flags (page_to_virt()) and barriers can't help, e.g.: > > P0 (mte_sync_page_tags): P1 (memcpy from virt_to_page): > Rflags!=0xff > Wflags=0xff > DMB (doesn't help) > Wtags=0 > Rtags=0 // fault > > Since clearing the flags in the arch code doesn't work, to do this at > page allocation time when __GFP_SKIP_KASAN_UNPOISON is passed. > > Thanks. > > Catalin Marinas (4): > mm: kasan: Ensure the tags are visible before the tag in page- > >flags > mm: kasan: Skip unpoisoning of user pages > mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON > arm64: kasan: Revert "arm64: mte: reset the page tag in page- > >flags" > > arch/arm64/kernel/hibernate.c | 5 ----- > arch/arm64/kernel/mte.c | 9 --------- > arch/arm64/mm/copypage.c | 9 --------- > arch/arm64/mm/fault.c | 1 - > arch/arm64/mm/mteswap.c | 9 --------- > include/linux/gfp.h | 2 +- > mm/kasan/common.c | 3 ++- > mm/page_alloc.c | 19 ++++++++++--------- > 8 files changed, 13 insertions(+), 44 deletions(-) > Hi kasan maintainers, We hit the following issue on the android-6.1 devices with MTE and HW tag kasan enabled. I observe that the anon flag doesn't have skip_kasan_poison and skip_kasan_unpoison flag and kasantag is weird. AFAIK, kasantag of anon flag needs to be 0x0. [ 71.953938] [T1403598] FramePolicy: [name:report&]========================================================= ========= [ 71.955305] [T1403598] FramePolicy: [name:report&]BUG: KASAN: invalid-access in copy_page+0x10/0xd0 [ 71.956476] [T1403598] FramePolicy: [name:report&]Read at addr f0ffff81332a8000 by task FramePolicy/3598 [ 71.957673] [T1403598] FramePolicy: [name:report_hw_tags&]Pointer tag: [f0], memory tag: [ff] [ 71.958746] [T1403598] FramePolicy: [name:report&] [ 71.959354] [T1403598] FramePolicy: CPU: 4 PID: 3598 Comm: FramePolicy Tainted: G S W OE 6.1.0-mainline-android14-0- ga8a53f83b9e4 #1 [ 71.960978] [T1403598] FramePolicy: Hardware name: MT6985(ENG) (DT) [ 71.961767] [T1403598] FramePolicy: Call trace: [ 71.962338] [T1403598] FramePolicy: dump_backtrace+0x108/0x158 [ 71.963097] [T1403598] FramePolicy: show_stack+0x20/0x48 [ 71.963782] [T1403598] FramePolicy: dump_stack_lvl+0x6c/0x88 [ 71.964512] [T1403598] FramePolicy: print_report+0x2cc/0xa64 [ 71.965263] [T1403598] FramePolicy: kasan_report+0xb8/0x138 [ 71.965986] [T1403598] FramePolicy: __do_kernel_fault+0xd4/0x248 [ 71.966782] [T1403598] FramePolicy: do_bad_area+0x38/0xe8 [ 71.967484] [T1403598] FramePolicy: do_tag_check_fault+0x24/0x38 [ 71.968261] [T1403598] FramePolicy: do_mem_abort+0x48/0xb0 [ 71.968973] [T1403598] FramePolicy: el1_abort+0x44/0x68 [ 71.969646] [T1403598] FramePolicy: el1h_64_sync_handler+0x68/0xb8 [ 71.970440] [T1403598] FramePolicy: el1h_64_sync+0x68/0x6c [ 71.971146] [T1403598] FramePolicy: copy_page+0x10/0xd0 [ 71.971824] [T1403598] FramePolicy: copy_user_highpage+0x20/0x40 [ 71.972603] [T1403598] FramePolicy: wp_page_copy+0xd0/0x9f8 [ 71.973344] [T1403598] FramePolicy: do_wp_page+0x374/0x3b0 [ 71.974056] [T1403598] FramePolicy: handle_mm_fault+0x3ec/0x119c [ 71.974833] [T1403598] FramePolicy: do_page_fault+0x344/0x4ac [ 71.975583] [T1403598] FramePolicy: do_mem_abort+0x48/0xb0 [ 71.976294] [T1403598] FramePolicy: el0_da+0x4c/0xe0 [ 71.976934] [T1403598] FramePolicy: el0t_64_sync_handler+0xd4/0xfc [ 71.977725] [T1403598] FramePolicy: el0t_64_sync+0x1a0/0x1a4 [ 71.978451] [T1403598] FramePolicy: [name:report&] [ 71.979057] [T1403598] FramePolicy: [name:report&]The buggy address belongs to the physical page: [ 71.980173] [T1403598] FramePolicy: [name:debug&]page:fffffffe04ccaa00 refcount:14 mapcount:13 mapping:0000000000000000 index:0x7884c74 pfn:0x1732a8 [ 71.981849] [T1403598] FramePolicy: [name:debug&]memcg:faffff80c0241000 [ 71.982680] [T1403598] FramePolicy: [name:debug&]anon flags: 0x43c000000048003e(referenced|uptodate|dirty|lru|active|swapbacked|arch _2|zone=1|kasantag=0xf) [ 71.984446] [T1403598] FramePolicy: raw: 43c000000048003e fffffffe04b99648 fffffffe04cca308 f2ffff8103390831 [ 71.985684] [T1403598] FramePolicy: raw: 0000000007884c74 0000000000000000 0000000e0000000c faffff80c0241000 [ 71.986919] [T1403598] FramePolicy: [name:debug&]page dumped because: kasan: bad access detected [ 71.988022] [T1403598] FramePolicy: [name:report&] [ 71.988624] [T1403598] FramePolicy: [name:report&]Memory state around the buggy address: [ 71.989641] [T1403598] FramePolicy: ffffff81332a7e00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe [ 71.990811] [T1403598] FramePolicy: ffffff81332a7f00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe [ 71.991982] [T1403598] FramePolicy: >ffffff81332a8000: ff ff ff ff f0 f0 fc fc fc fc fc fc fc f0 f0 f3 [ 71.993149] [T1403598] FramePolicy: [name:report&] ^ [ 71.993972] [T1403598] FramePolicy: ffffff81332a8100: f3 f3 f3 f3 f3 f3 f0 f0 f8 f8 f8 f8 f8 f8 f8 f0 [ 71.995141] [T1403598] FramePolicy: ffffff81332a8200: f0 fb fb fb fb fb fb fb f0 f0 fe fe fe fe fe fe [ 71.996332] [T1403598] FramePolicy: [name:report&]========================================================= ========= Originally, I suspect that some userspace pages have been migrated so the page->flags will be lost and page->flags is re-generated by alloc_pages(). I try the following diff, but it didn't help. diff --git a/mm/migrate.c b/mm/migrate.c index dff333593a8a..ed2065908418 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -51,6 +51,7 @@ #include <linux/random.h> #include <linux/sched/sysctl.h> #include <linux/memory-tiers.h> +#include <linux/kasan.h> #include <asm/tlbflush.h> @@ -611,6 +612,14 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio) if (!folio_test_hugetlb(folio)) mem_cgroup_migrate(folio, newfolio); + +#ifdef CONFIG_KASAN_HW_TAGS + if (kasan_hw_tags_enabled()) { + if (folio_test_skip_kasan_poison(folio)) + folio_set_skip_kasan_poison(newfolio); + page_kasan_tag_set(&newfolio->page, page_kasan_tag(&folio->page)); + } +#endif } EXPORT_SYMBOL(folio_migrate_flags); After I revert this patchset (4 patches), this issue disappear. >
On Thu, Feb 2, 2023 at 6:25 AM Kuan-Ying Lee (李冠穎) <Kuan-Ying.Lee@mediatek.com> wrote: > > On Fri, 2022-06-10 at 16:21 +0100, Catalin Marinas wrote: > > Hi, > > > > That's a second attempt on fixing the race race between setting the > > allocation (in-memory) tags in a page and the corresponding logical > > tag > > in page->flags. Initial version here: > > > > > https://lore.kernel.org/r/20220517180945.756303-1-catalin.marinas@arm.com > > > > This new series does not introduce any new GFP flags but instead > > always > > skips unpoisoning of the user pages (we already skip the poisoning on > > free). Any unpoisoned page will have the page->flags tag reset. > > > > For the background: > > > > On a system with MTE and KASAN_HW_TAGS enabled, when a page is > > allocated > > kasan_unpoison_pages() sets a random tag and saves it in page->flags > > so > > that page_to_virt() re-creates the correct tagged pointer. We need to > > ensure that the in-memory tags are visible before setting the > > page->flags: > > > > P0 (__kasan_unpoison_range): P1 (access via virt_to_page): > > Wtags=x Rflags=x > > | | > > | DMB | address dependency > > V V > > Wflags=x Rtags=x > > > > The first patch changes the order of page unpoisoning with the tag > > storing in page->flags. page_kasan_tag_set() has the right barriers > > through try_cmpxchg(). > > > > If a page is mapped in user-space with PROT_MTE, the architecture > > code > > will set the allocation tag to 0 and a subsequent page_to_virt() > > dereference will fault. We currently try to fix this by resetting the > > tag in page->flags so that it is 0xff (match-all, not faulting). > > However, setting the tags and flags can race with another CPU reading > > the flags (page_to_virt()) and barriers can't help, e.g.: > > > > P0 (mte_sync_page_tags): P1 (memcpy from virt_to_page): > > Rflags!=0xff > > Wflags=0xff > > DMB (doesn't help) > > Wtags=0 > > Rtags=0 // fault > > > > Since clearing the flags in the arch code doesn't work, to do this at > > page allocation time when __GFP_SKIP_KASAN_UNPOISON is passed. > > > > Thanks. > > > > Catalin Marinas (4): > > mm: kasan: Ensure the tags are visible before the tag in page- > > >flags > > mm: kasan: Skip unpoisoning of user pages > > mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON > > arm64: kasan: Revert "arm64: mte: reset the page tag in page- > > >flags" > > > > arch/arm64/kernel/hibernate.c | 5 ----- > > arch/arm64/kernel/mte.c | 9 --------- > > arch/arm64/mm/copypage.c | 9 --------- > > arch/arm64/mm/fault.c | 1 - > > arch/arm64/mm/mteswap.c | 9 --------- > > include/linux/gfp.h | 2 +- > > mm/kasan/common.c | 3 ++- > > mm/page_alloc.c | 19 ++++++++++--------- > > 8 files changed, 13 insertions(+), 44 deletions(-) > > > > Hi kasan maintainers, > > We hit the following issue on the android-6.1 devices with MTE and HW > tag kasan enabled. > > I observe that the anon flag doesn't have skip_kasan_poison and > skip_kasan_unpoison flag and kasantag is weird. > > AFAIK, kasantag of anon flag needs to be 0x0. > > [ 71.953938] [T1403598] FramePolicy: > [name:report&]========================================================= > ========= > [ 71.955305] [T1403598] FramePolicy: [name:report&]BUG: KASAN: > invalid-access in copy_page+0x10/0xd0 > [ 71.956476] [T1403598] FramePolicy: [name:report&]Read at addr > f0ffff81332a8000 by task FramePolicy/3598 > [ 71.957673] [T1403598] FramePolicy: [name:report_hw_tags&]Pointer > tag: [f0], memory tag: [ff] > [ 71.958746] [T1403598] FramePolicy: [name:report&] > [ 71.959354] [T1403598] FramePolicy: CPU: 4 PID: 3598 Comm: > FramePolicy Tainted: G S W OE 6.1.0-mainline-android14-0- > ga8a53f83b9e4 #1 > [ 71.960978] [T1403598] FramePolicy: Hardware name: MT6985(ENG) (DT) > [ 71.961767] [T1403598] FramePolicy: Call trace: > [ 71.962338] [T1403598] FramePolicy: dump_backtrace+0x108/0x158 > [ 71.963097] [T1403598] FramePolicy: show_stack+0x20/0x48 > [ 71.963782] [T1403598] FramePolicy: dump_stack_lvl+0x6c/0x88 > [ 71.964512] [T1403598] FramePolicy: print_report+0x2cc/0xa64 > [ 71.965263] [T1403598] FramePolicy: kasan_report+0xb8/0x138 > [ 71.965986] [T1403598] FramePolicy: __do_kernel_fault+0xd4/0x248 > [ 71.966782] [T1403598] FramePolicy: do_bad_area+0x38/0xe8 > [ 71.967484] [T1403598] FramePolicy: do_tag_check_fault+0x24/0x38 > [ 71.968261] [T1403598] FramePolicy: do_mem_abort+0x48/0xb0 > [ 71.968973] [T1403598] FramePolicy: el1_abort+0x44/0x68 > [ 71.969646] [T1403598] FramePolicy: el1h_64_sync_handler+0x68/0xb8 > [ 71.970440] [T1403598] FramePolicy: el1h_64_sync+0x68/0x6c > [ 71.971146] [T1403598] FramePolicy: copy_page+0x10/0xd0 > [ 71.971824] [T1403598] FramePolicy: copy_user_highpage+0x20/0x40 > [ 71.972603] [T1403598] FramePolicy: wp_page_copy+0xd0/0x9f8 > [ 71.973344] [T1403598] FramePolicy: do_wp_page+0x374/0x3b0 > [ 71.974056] [T1403598] FramePolicy: handle_mm_fault+0x3ec/0x119c > [ 71.974833] [T1403598] FramePolicy: do_page_fault+0x344/0x4ac > [ 71.975583] [T1403598] FramePolicy: do_mem_abort+0x48/0xb0 > [ 71.976294] [T1403598] FramePolicy: el0_da+0x4c/0xe0 > [ 71.976934] [T1403598] FramePolicy: el0t_64_sync_handler+0xd4/0xfc > [ 71.977725] [T1403598] FramePolicy: el0t_64_sync+0x1a0/0x1a4 > [ 71.978451] [T1403598] FramePolicy: [name:report&] > [ 71.979057] [T1403598] FramePolicy: [name:report&]The buggy address > belongs to the physical page: > [ 71.980173] [T1403598] FramePolicy: > [name:debug&]page:fffffffe04ccaa00 refcount:14 mapcount:13 > mapping:0000000000000000 index:0x7884c74 pfn:0x1732a8 > [ 71.981849] [T1403598] FramePolicy: > [name:debug&]memcg:faffff80c0241000 > [ 71.982680] [T1403598] FramePolicy: [name:debug&]anon flags: > 0x43c000000048003e(referenced|uptodate|dirty|lru|active|swapbacked|arch > _2|zone=1|kasantag=0xf) > [ 71.984446] [T1403598] FramePolicy: raw: 43c000000048003e > fffffffe04b99648 fffffffe04cca308 f2ffff8103390831 > [ 71.985684] [T1403598] FramePolicy: raw: 0000000007884c74 > 0000000000000000 0000000e0000000c faffff80c0241000 > [ 71.986919] [T1403598] FramePolicy: [name:debug&]page dumped > because: kasan: bad access detected > [ 71.988022] [T1403598] FramePolicy: [name:report&] > [ 71.988624] [T1403598] FramePolicy: [name:report&]Memory state > around the buggy address: > [ 71.989641] [T1403598] FramePolicy: ffffff81332a7e00: fe fe fe fe > fe fe fe fe fe fe fe fe fe fe fe fe > [ 71.990811] [T1403598] FramePolicy: ffffff81332a7f00: fe fe fe fe > fe fe fe fe fe fe fe fe fe fe fe fe > [ 71.991982] [T1403598] FramePolicy: >ffffff81332a8000: ff ff ff ff > f0 f0 fc fc fc fc fc fc fc f0 f0 f3 > [ 71.993149] [T1403598] FramePolicy: > [name:report&] ^ > [ 71.993972] [T1403598] FramePolicy: ffffff81332a8100: f3 f3 f3 f3 > f3 f3 f0 f0 f8 f8 f8 f8 f8 f8 f8 f0 > [ 71.995141] [T1403598] FramePolicy: ffffff81332a8200: f0 fb fb fb > fb fb fb fb f0 f0 fe fe fe fe fe fe > [ 71.996332] [T1403598] FramePolicy: > [name:report&]========================================================= > ========= > > Originally, I suspect that some userspace pages have been migrated so > the page->flags will be lost and page->flags is re-generated by > alloc_pages(). Hi Kuan-Ying, There recently was a similar crash due to incorrectly implemented sampling. Do you have the following patch in your tree? https://android.googlesource.com/kernel/common/+/9f7f5a25f335e6e1484695da9180281a728db7e2 If not, please sync your 6.1 tree with the Android common kernel. Hopefully this will fix the issue. Thanks!
On Thu, 2023-02-02 at 13:59 +0100, Andrey Konovalov wrote: > On Thu, Feb 2, 2023 at 6:25 AM Kuan-Ying Lee (李冠穎) > <Kuan-Ying.Lee@mediatek.com> wrote: > > > > On Fri, 2022-06-10 at 16:21 +0100, Catalin Marinas wrote: > > > Hi, > > > > > > That's a second attempt on fixing the race race between setting > > > the > > > allocation (in-memory) tags in a page and the corresponding > > > logical > > > tag > > > in page->flags. Initial version here: > > > > > > > > > > https://lore.kernel.org/r/20220517180945.756303-1-catalin.marinas@arm.com > > > > > > This new series does not introduce any new GFP flags but instead > > > always > > > skips unpoisoning of the user pages (we already skip the > > > poisoning on > > > free). Any unpoisoned page will have the page->flags tag reset. > > > > > > For the background: > > > > > > On a system with MTE and KASAN_HW_TAGS enabled, when a page is > > > allocated > > > kasan_unpoison_pages() sets a random tag and saves it in page- > > > >flags > > > so > > > that page_to_virt() re-creates the correct tagged pointer. We > > > need to > > > ensure that the in-memory tags are visible before setting the > > > page->flags: > > > > > > P0 (__kasan_unpoison_range): P1 (access via virt_to_page): > > > Wtags=x Rflags=x > > > | | > > > | DMB | address dependency > > > V V > > > Wflags=x Rtags=x > > > > > > The first patch changes the order of page unpoisoning with the > > > tag > > > storing in page->flags. page_kasan_tag_set() has the right > > > barriers > > > through try_cmpxchg(). > > > > > > If a page is mapped in user-space with PROT_MTE, the architecture > > > code > > > will set the allocation tag to 0 and a subsequent page_to_virt() > > > dereference will fault. We currently try to fix this by resetting > > > the > > > tag in page->flags so that it is 0xff (match-all, not faulting). > > > However, setting the tags and flags can race with another CPU > > > reading > > > the flags (page_to_virt()) and barriers can't help, e.g.: > > > > > > P0 (mte_sync_page_tags): P1 (memcpy from virt_to_page): > > > Rflags!=0xff > > > Wflags=0xff > > > DMB (doesn't help) > > > Wtags=0 > > > Rtags=0 // fault > > > > > > Since clearing the flags in the arch code doesn't work, to do > > > this at > > > page allocation time when __GFP_SKIP_KASAN_UNPOISON is passed. > > > > > > Thanks. > > > > > > Catalin Marinas (4): > > > mm: kasan: Ensure the tags are visible before the tag in page- > > > > flags > > > > > > mm: kasan: Skip unpoisoning of user pages > > > mm: kasan: Skip page unpoisoning only if > > > __GFP_SKIP_KASAN_UNPOISON > > > arm64: kasan: Revert "arm64: mte: reset the page tag in page- > > > > flags" > > > > > > arch/arm64/kernel/hibernate.c | 5 ----- > > > arch/arm64/kernel/mte.c | 9 --------- > > > arch/arm64/mm/copypage.c | 9 --------- > > > arch/arm64/mm/fault.c | 1 - > > > arch/arm64/mm/mteswap.c | 9 --------- > > > include/linux/gfp.h | 2 +- > > > mm/kasan/common.c | 3 ++- > > > mm/page_alloc.c | 19 ++++++++++--------- > > > 8 files changed, 13 insertions(+), 44 deletions(-) > > > > > > > Hi kasan maintainers, > > > > We hit the following issue on the android-6.1 devices with MTE and > > HW > > tag kasan enabled. > > > > I observe that the anon flag doesn't have skip_kasan_poison and > > skip_kasan_unpoison flag and kasantag is weird. > > > > AFAIK, kasantag of anon flag needs to be 0x0. > > > > [ 71.953938] [T1403598] FramePolicy: > > [name:report&]===================================================== > > ==== > > ========= > > [ 71.955305] [T1403598] FramePolicy: [name:report&]BUG: KASAN: > > invalid-access in copy_page+0x10/0xd0 > > [ 71.956476] [T1403598] FramePolicy: [name:report&]Read at addr > > f0ffff81332a8000 by task FramePolicy/3598 > > [ 71.957673] [T1403598] FramePolicy: > > [name:report_hw_tags&]Pointer > > tag: [f0], memory tag: [ff] > > [ 71.958746] [T1403598] FramePolicy: [name:report&] > > [ 71.959354] [T1403598] FramePolicy: CPU: 4 PID: 3598 Comm: > > FramePolicy Tainted: G S W OE 6.1.0-mainline-android14- > > 0- > > ga8a53f83b9e4 #1 > > [ 71.960978] [T1403598] FramePolicy: Hardware name: MT6985(ENG) > > (DT) > > [ 71.961767] [T1403598] FramePolicy: Call trace: > > [ 71.962338] [T1403598] FramePolicy: dump_backtrace+0x108/0x158 > > [ 71.963097] [T1403598] FramePolicy: show_stack+0x20/0x48 > > [ 71.963782] [T1403598] FramePolicy: dump_stack_lvl+0x6c/0x88 > > [ 71.964512] [T1403598] FramePolicy: print_report+0x2cc/0xa64 > > [ 71.965263] [T1403598] FramePolicy: kasan_report+0xb8/0x138 > > [ 71.965986] [T1403598] > > FramePolicy: __do_kernel_fault+0xd4/0x248 > > [ 71.966782] [T1403598] FramePolicy: do_bad_area+0x38/0xe8 > > [ 71.967484] [T1403598] > > FramePolicy: do_tag_check_fault+0x24/0x38 > > [ 71.968261] [T1403598] FramePolicy: do_mem_abort+0x48/0xb0 > > [ 71.968973] [T1403598] FramePolicy: el1_abort+0x44/0x68 > > [ 71.969646] [T1403598] > > FramePolicy: el1h_64_sync_handler+0x68/0xb8 > > [ 71.970440] [T1403598] FramePolicy: el1h_64_sync+0x68/0x6c > > [ 71.971146] [T1403598] FramePolicy: copy_page+0x10/0xd0 > > [ 71.971824] [T1403598] > > FramePolicy: copy_user_highpage+0x20/0x40 > > [ 71.972603] [T1403598] FramePolicy: wp_page_copy+0xd0/0x9f8 > > [ 71.973344] [T1403598] FramePolicy: do_wp_page+0x374/0x3b0 > > [ 71.974056] [T1403598] > > FramePolicy: handle_mm_fault+0x3ec/0x119c > > [ 71.974833] [T1403598] FramePolicy: do_page_fault+0x344/0x4ac > > [ 71.975583] [T1403598] FramePolicy: do_mem_abort+0x48/0xb0 > > [ 71.976294] [T1403598] FramePolicy: el0_da+0x4c/0xe0 > > [ 71.976934] [T1403598] > > FramePolicy: el0t_64_sync_handler+0xd4/0xfc > > [ 71.977725] [T1403598] FramePolicy: el0t_64_sync+0x1a0/0x1a4 > > [ 71.978451] [T1403598] FramePolicy: [name:report&] > > [ 71.979057] [T1403598] FramePolicy: [name:report&]The buggy > > address > > belongs to the physical page: > > [ 71.980173] [T1403598] FramePolicy: > > [name:debug&]page:fffffffe04ccaa00 refcount:14 mapcount:13 > > mapping:0000000000000000 index:0x7884c74 pfn:0x1732a8 > > [ 71.981849] [T1403598] FramePolicy: > > [name:debug&]memcg:faffff80c0241000 > > [ 71.982680] [T1403598] FramePolicy: [name:debug&]anon flags: > > 0x43c000000048003e(referenced|uptodate|dirty|lru|active|swapbacked| > > arch > > _2|zone=1|kasantag=0xf) > > [ 71.984446] [T1403598] FramePolicy: raw: 43c000000048003e > > fffffffe04b99648 fffffffe04cca308 f2ffff8103390831 > > [ 71.985684] [T1403598] FramePolicy: raw: 0000000007884c74 > > 0000000000000000 0000000e0000000c faffff80c0241000 > > [ 71.986919] [T1403598] FramePolicy: [name:debug&]page dumped > > because: kasan: bad access detected > > [ 71.988022] [T1403598] FramePolicy: [name:report&] > > [ 71.988624] [T1403598] FramePolicy: [name:report&]Memory state > > around the buggy address: > > [ 71.989641] [T1403598] FramePolicy: ffffff81332a7e00: fe fe fe > > fe > > fe fe fe fe fe fe fe fe fe fe fe fe > > [ 71.990811] [T1403598] FramePolicy: ffffff81332a7f00: fe fe fe > > fe > > fe fe fe fe fe fe fe fe fe fe fe fe > > [ 71.991982] [T1403598] FramePolicy: >ffffff81332a8000: ff ff ff > > ff > > f0 f0 fc fc fc fc fc fc fc f0 f0 f3 > > [ 71.993149] [T1403598] FramePolicy: > > [name:report&] ^ > > [ 71.993972] [T1403598] FramePolicy: ffffff81332a8100: f3 f3 f3 > > f3 > > f3 f3 f0 f0 f8 f8 f8 f8 f8 f8 f8 f0 > > [ 71.995141] [T1403598] FramePolicy: ffffff81332a8200: f0 fb fb > > fb > > fb fb fb fb f0 f0 fe fe fe fe fe fe > > [ 71.996332] [T1403598] FramePolicy: > > [name:report&]===================================================== > > ==== > > ========= > > > > Originally, I suspect that some userspace pages have been migrated > > so > > the page->flags will be lost and page->flags is re-generated by > > alloc_pages(). > > Hi Kuan-Ying, > > There recently was a similar crash due to incorrectly implemented > sampling. > > Do you have the following patch in your tree? > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/9f7f5a25f335e6e1484695da9180281a728db7e2__;Kw!!CTRNKA9wMg0ARbw!hUjRlXirPMSusdIWe0RIPt0PNqIHYDCJyd7GSd4o-TgLMP0CKRUkjElH-jcvtaz42-sgE2U58964rCCbuNTJE5Jx$ > > > If not, please sync your 6.1 tree with the Android common kernel. > Hopefully this will fix the issue. > > Thanks! Hi Andrey, Thanks for your advice. I saw this patch is to fix ("kasan: allow sampling page_alloc allocations for HW_TAGS"). But our 6.1 tree doesn't have following two commits now. ("FROMGIT: kasan: allow sampling page_alloc allocations for HW_TAGS") (FROMLIST: kasan: reset page tags properly with sampling)
On Fri, Feb 3, 2023 at 4:41 AM Kuan-Ying Lee (李冠穎) <Kuan-Ying.Lee@mediatek.com> wrote: > > > Hi Kuan-Ying, > > > > There recently was a similar crash due to incorrectly implemented > > sampling. > > > > Do you have the following patch in your tree? > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/9f7f5a25f335e6e1484695da9180281a728db7e2__;Kw!!CTRNKA9wMg0ARbw!hUjRlXirPMSusdIWe0RIPt0PNqIHYDCJyd7GSd4o-TgLMP0CKRUkjElH-jcvtaz42-sgE2U58964rCCbuNTJE5Jx$ > > > > > > If not, please sync your 6.1 tree with the Android common kernel. > > Hopefully this will fix the issue. > > > > Thanks! > > Hi Andrey, > > Thanks for your advice. > > I saw this patch is to fix ("kasan: allow sampling page_alloc > allocations for HW_TAGS"). > > But our 6.1 tree doesn't have following two commits now. > ("FROMGIT: kasan: allow sampling page_alloc allocations for HW_TAGS") > (FROMLIST: kasan: reset page tags properly with sampling) Hi Kuan-Ying, Just to clarify: these two patches were applied twice: once here on Jan 13: https://android.googlesource.com/kernel/common/+/a2a9e34d164e90fc08d35fd097a164b9101d72ef https://android.googlesource.com/kernel/common/+/435e2a6a6c8ba8d0eb55f9aaade53e7a3957322b but then reverted here on Jan 20: https://android.googlesource.com/kernel/common/+/5503dbe454478fe54b9cac3fc52d4477f52efdc9 https://android.googlesource.com/kernel/common/+/4573a3cf7e18735a477845426238d46d96426bb6 And then once again via the link I sent before together with a fix on Jan 25. It might be that you still have to former two patches in your tree if you synced it before the revert. However, if this is not the case: Which 6.1 commit is your tree based on? Do you have any private MTE-related changes in the kernel? Do you have userspace MTE enabled? Thanks!
On Fri, 2023-02-03 at 18:51 +0100, Andrey Konovalov wrote: > On Fri, Feb 3, 2023 at 4:41 AM Kuan-Ying Lee (李冠穎) > <Kuan-Ying.Lee@mediatek.com> wrote: > > > > > Hi Kuan-Ying, > > > > > > There recently was a similar crash due to incorrectly implemented > > > sampling. > > > > > > Do you have the following patch in your tree? > > > > > > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/9f7f5a25f335e6e1484695da9180281a728db7e2__;Kw!!CTRNKA9wMg0ARbw!hUjRlXirPMSusdIWe0RIPt0PNqIHYDCJyd7GSd4o-TgLMP0CKRUkjElH-jcvtaz42-sgE2U58964rCCbuNTJE5Jx$ > > > > > > > > > If not, please sync your 6.1 tree with the Android common kernel. > > > Hopefully this will fix the issue. > > > > > > Thanks! > > > > Hi Andrey, > > > > Thanks for your advice. > > > > I saw this patch is to fix ("kasan: allow sampling page_alloc > > allocations for HW_TAGS"). > > > > But our 6.1 tree doesn't have following two commits now. > > ("FROMGIT: kasan: allow sampling page_alloc allocations for > > HW_TAGS") > > (FROMLIST: kasan: reset page tags properly with sampling) > > Hi Kuan-Ying, > Hi Andrey, I'll stand in for Kuan-Ying as he's out of office. Thanks for your help! > Just to clarify: these two patches were applied twice: once here on > Jan 13: > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/a2a9e34d164e90fc08d35fd097a164b9101d72ef__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745_3oO-3k$ > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/435e2a6a6c8ba8d0eb55f9aaade53e7a3957322b__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745sDEOYWY$ > > Our codebase does not contain these two patches. > but then reverted here on Jan 20: > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/5503dbe454478fe54b9cac3fc52d4477f52efdc9__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745Bl77dFY$ > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/4573a3cf7e18735a477845426238d46d96426bb6__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745K-J8O-w$ > > > And then once again via the link I sent before together with a fix on > Jan 25. > > It might be that you still have to former two patches in your tree if > you synced it before the revert. > > However, if this is not the case: > > Which 6.1 commit is your tree based on? https://android.googlesource.com/kernel/common/+/53b3a7721b7aec74d8fa2ee55c2480044cc7c1b8 (53b3a77 Merge 6.1.1 into android14-6.1) is the latest commit in our tree. > Do you have any private MTE-related changes in the kernel? No, all the MTE-related code is the same as Android Common Kernel. > Do you have userspace MTE enabled? Yes, we have enabled MTE for both EL1 and EL0. > > Thanks!
On Wed, Feb 08, 2023 at 05:41:45AM +0000, Qun-wei Lin (林群崴) wrote: > On Fri, 2023-02-03 at 18:51 +0100, Andrey Konovalov wrote: > > On Fri, Feb 3, 2023 at 4:41 AM Kuan-Ying Lee (李冠穎) > > <Kuan-Ying.Lee@mediatek.com> wrote: > > > > > > > Hi Kuan-Ying, > > > > > > > > There recently was a similar crash due to incorrectly implemented > > > > sampling. > > > > > > > > Do you have the following patch in your tree? > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/9f7f5a25f335e6e1484695da9180281a728db7e2__;Kw!!CTRNKA9wMg0ARbw!hUjRlXirPMSusdIWe0RIPt0PNqIHYDCJyd7GSd4o-TgLMP0CKRUkjElH-jcvtaz42-sgE2U58964rCCbuNTJE5Jx$ > > > > > > > > > > > > If not, please sync your 6.1 tree with the Android common kernel. > > > > Hopefully this will fix the issue. > > > > > > > > Thanks! > > > > > > Hi Andrey, > > > > > > Thanks for your advice. > > > > > > I saw this patch is to fix ("kasan: allow sampling page_alloc > > > allocations for HW_TAGS"). > > > > > > But our 6.1 tree doesn't have following two commits now. > > > ("FROMGIT: kasan: allow sampling page_alloc allocations for > > > HW_TAGS") > > > (FROMLIST: kasan: reset page tags properly with sampling) > > > > Hi Kuan-Ying, > > > > Hi Andrey, > I'll stand in for Kuan-Ying as he's out of office. > Thanks for your help! > > > Just to clarify: these two patches were applied twice: once here on > > Jan 13: > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/a2a9e34d164e90fc08d35fd097a164b9101d72ef__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745_3oO-3k$ > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/435e2a6a6c8ba8d0eb55f9aaade53e7a3957322b__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745sDEOYWY$ > > > > > > Our codebase does not contain these two patches. > > > but then reverted here on Jan 20: > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/5503dbe454478fe54b9cac3fc52d4477f52efdc9__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745Bl77dFY$ > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/4573a3cf7e18735a477845426238d46d96426bb6__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745K-J8O-w$ > > > > > > And then once again via the link I sent before together with a fix on > > Jan 25. > > > > It might be that you still have to former two patches in your tree if > > you synced it before the revert. > > > > However, if this is not the case: > > > > Which 6.1 commit is your tree based on? > > > https://android.googlesource.com/kernel/common/+/53b3a7721b7aec74d8fa2ee55c2480044cc7c1b8 > (53b3a77 Merge 6.1.1 into android14-6.1) is the latest commit in our > tree. > > > Do you have any private MTE-related changes in the kernel? > > No, all the MTE-related code is the same as Android Common Kernel. > > > Do you have userspace MTE enabled? > > Yes, we have enabled MTE for both EL1 and EL0. Hi Qun-wei, Thanks for the information. We encountered a similar issue internally with the Android 5.15 common kernel. We tracked it down to an issue with page migration, where the source page was a userspace page with MTE tags, and the target page was allocated using KASAN (i.e. having a non-zero KASAN tag). This caused tag check faults when the page was subsequently accessed by the kernel as a result of the mismatching tags from userspace. Given the number of different ways that page migration target pages can be allocated, the simplest fix that we could think of was to synchronize the KASAN tag in copy_highpage(). Can you try the patch below and let us know whether it fixes the issue? diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c index 24913271e898c..87ed38e9747bd 100644 --- a/arch/arm64/mm/copypage.c +++ b/arch/arm64/mm/copypage.c @@ -23,6 +23,8 @@ void copy_highpage(struct page *to, struct page *from) if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) { set_bit(PG_mte_tagged, &to->flags); + if (kasan_hw_tags_enabled()) + page_kasan_tag_set(to, page_kasan_tag(from)); mte_copy_page_tags(kto, kfrom); } } Catalin, please let us know what you think of the patch above. It effectively partially undoes commit 20794545c146 ("arm64: kasan: Revert "arm64: mte: reset the page tag in page->flags""), but this seems okay to me because the mentioned race condition shouldn't affect "new" pages such as those being used as migration targets. The smp_wmb() that was there before doesn't seem necessary for the same reason. If the patch is okay, we should apply it to the 6.1 stable kernel. The problem appears to be "fixed" in the mainline kernel because of a bad merge conflict resolution on my part; when I rebased commit e059853d14ca ("arm64: mte: Fix/clarify the PG_mte_tagged semantics") past commit 20794545c146, it looks like I accidentally brought back the page_kasan_tag_reset() line removed in the latter. But we should align the mainline kernel with whatever we decide to do on 6.1. Peter
Hi Peter, On Thu, Feb 09, 2023 at 10:19:20PM -0800, Peter Collingbourne wrote: > Thanks for the information. We encountered a similar issue internally > with the Android 5.15 common kernel. We tracked it down to an issue > with page migration, where the source page was a userspace page with > MTE tags, and the target page was allocated using KASAN (i.e. having > a non-zero KASAN tag). This caused tag check faults when the page was > subsequently accessed by the kernel as a result of the mismatching tags > from userspace. Given the number of different ways that page migration > target pages can be allocated, the simplest fix that we could think of > was to synchronize the KASAN tag in copy_highpage(). > > Can you try the patch below and let us know whether it fixes the issue? > > diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c > index 24913271e898c..87ed38e9747bd 100644 > --- a/arch/arm64/mm/copypage.c > +++ b/arch/arm64/mm/copypage.c > @@ -23,6 +23,8 @@ void copy_highpage(struct page *to, struct page *from) > > if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) { > set_bit(PG_mte_tagged, &to->flags); > + if (kasan_hw_tags_enabled()) > + page_kasan_tag_set(to, page_kasan_tag(from)); > mte_copy_page_tags(kto, kfrom); Why not just page_kasan_tag_reset(to)? If PG_mte_tagged is set on the 'from' page, the tags are random anyway and page_kasan_tag(from) should already be 0xff. It makes more sense to do the same for the 'to' page rather than copying the tag from the 'from' page. IOW, we are copying user-controlled tags into a page, the kernel should have a match-all tag in page->flags. > Catalin, please let us know what you think of the patch above. It > effectively partially undoes commit 20794545c146 ("arm64: kasan: Revert > "arm64: mte: reset the page tag in page->flags""), but this seems okay > to me because the mentioned race condition shouldn't affect "new" pages > such as those being used as migration targets. The smp_wmb() that was > there before doesn't seem necessary for the same reason. > > If the patch is okay, we should apply it to the 6.1 stable kernel. The > problem appears to be "fixed" in the mainline kernel because of > a bad merge conflict resolution on my part; when I rebased commit > e059853d14ca ("arm64: mte: Fix/clarify the PG_mte_tagged semantics") > past commit 20794545c146, it looks like I accidentally brought back the > page_kasan_tag_reset() line removed in the latter. But we should align > the mainline kernel with whatever we decide to do on 6.1. Happy accident ;). When I reverted such calls in commit 20794545c146, my assumption was that we always get a page that went through post_alloc_hook() and the tags were reset. But it seems that's not always the case (and probably wasteful anyway if we have to zero the tags and data on a page we know we are going to override via copy_highpage() anyway). The barrier doesn't help, so we shouldn't add it back. So, I'm fine with a stable fix but I wonder whether we should backport the whole "Fix/clarify the PG_mte_tagged semantics" series instead.
Hi Catalin, On Fri, Feb 10, 2023 at 10:28 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > Hi Peter, > > On Thu, Feb 09, 2023 at 10:19:20PM -0800, Peter Collingbourne wrote: > > Thanks for the information. We encountered a similar issue internally > > with the Android 5.15 common kernel. We tracked it down to an issue > > with page migration, where the source page was a userspace page with > > MTE tags, and the target page was allocated using KASAN (i.e. having > > a non-zero KASAN tag). This caused tag check faults when the page was > > subsequently accessed by the kernel as a result of the mismatching tags > > from userspace. Given the number of different ways that page migration > > target pages can be allocated, the simplest fix that we could think of > > was to synchronize the KASAN tag in copy_highpage(). > > > > Can you try the patch below and let us know whether it fixes the issue? > > > > diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c > > index 24913271e898c..87ed38e9747bd 100644 > > --- a/arch/arm64/mm/copypage.c > > +++ b/arch/arm64/mm/copypage.c > > @@ -23,6 +23,8 @@ void copy_highpage(struct page *to, struct page *from) > > > > if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) { > > set_bit(PG_mte_tagged, &to->flags); > > + if (kasan_hw_tags_enabled()) > > + page_kasan_tag_set(to, page_kasan_tag(from)); > > mte_copy_page_tags(kto, kfrom); > > Why not just page_kasan_tag_reset(to)? If PG_mte_tagged is set on the > 'from' page, the tags are random anyway and page_kasan_tag(from) should > already be 0xff. It makes more sense to do the same for the 'to' page > rather than copying the tag from the 'from' page. IOW, we are copying > user-controlled tags into a page, the kernel should have a match-all tag > in page->flags. That would also work, but I was thinking that if copy_highpage() were being used to copy a KASAN page we should keep the original tag in order to maintain tag checks for page accesses. > > Catalin, please let us know what you think of the patch above. It > > effectively partially undoes commit 20794545c146 ("arm64: kasan: Revert > > "arm64: mte: reset the page tag in page->flags""), but this seems okay > > to me because the mentioned race condition shouldn't affect "new" pages > > such as those being used as migration targets. The smp_wmb() that was > > there before doesn't seem necessary for the same reason. > > > > If the patch is okay, we should apply it to the 6.1 stable kernel. The > > problem appears to be "fixed" in the mainline kernel because of > > a bad merge conflict resolution on my part; when I rebased commit > > e059853d14ca ("arm64: mte: Fix/clarify the PG_mte_tagged semantics") > > past commit 20794545c146, it looks like I accidentally brought back the > > page_kasan_tag_reset() line removed in the latter. But we should align > > the mainline kernel with whatever we decide to do on 6.1. > > Happy accident ;). When I reverted such calls in commit 20794545c146, my > assumption was that we always get a page that went through > post_alloc_hook() and the tags were reset. But it seems that's not > always the case (and probably wasteful anyway if we have to zero the > tags and data on a page we know we are going to override via > copy_highpage() anyway). The barrier doesn't help, so we shouldn't add > it back. > > So, I'm fine with a stable fix but I wonder whether we should backport > the whole "Fix/clarify the PG_mte_tagged semantics" series instead. That seems fine to me (or as well as the above patch if we decide to copy the tag). Peter
On Thu, 2023-02-09 at 22:19 -0800, Peter Collingbourne wrote: > On Wed, Feb 08, 2023 at 05:41:45AM +0000, Qun-wei Lin (林群崴) wrote: > > On Fri, 2023-02-03 at 18:51 +0100, Andrey Konovalov wrote: > > > On Fri, Feb 3, 2023 at 4:41 AM Kuan-Ying Lee (李冠穎) > > > <Kuan-Ying.Lee@mediatek.com> wrote: > > > > > > > > > Hi Kuan-Ying, > > > > > > > > > > There recently was a similar crash due to incorrectly > > > > > implemented > > > > > sampling. > > > > > > > > > > Do you have the following patch in your tree? > > > > > > > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/9f7f5a25f335e6e1484695da9180281a728db7e2__;Kw!!CTRNKA9wMg0ARbw!hUjRlXirPMSusdIWe0RIPt0PNqIHYDCJyd7GSd4o-TgLMP0CKRUkjElH-jcvtaz42-sgE2U58964rCCbuNTJE5Jx$ > > > > > > > > > > > > > > > If not, please sync your 6.1 tree with the Android common > > > > > kernel. > > > > > Hopefully this will fix the issue. > > > > > > > > > > Thanks! > > > > > > > > Hi Andrey, > > > > > > > > Thanks for your advice. > > > > > > > > I saw this patch is to fix ("kasan: allow sampling page_alloc > > > > allocations for HW_TAGS"). > > > > > > > > But our 6.1 tree doesn't have following two commits now. > > > > ("FROMGIT: kasan: allow sampling page_alloc allocations for > > > > HW_TAGS") > > > > (FROMLIST: kasan: reset page tags properly with sampling) > > > > > > Hi Kuan-Ying, > > > > > > > Hi Andrey, > > I'll stand in for Kuan-Ying as he's out of office. > > Thanks for your help! > > > > > Just to clarify: these two patches were applied twice: once here > > > on > > > Jan 13: > > > > > > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/a2a9e34d164e90fc08d35fd097a164b9101d72ef__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745_3oO-3k$ > > > > > > > > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/435e2a6a6c8ba8d0eb55f9aaade53e7a3957322b__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745sDEOYWY$ > > > > > > > > > > > > Our codebase does not contain these two patches. > > > > > but then reverted here on Jan 20: > > > > > > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/5503dbe454478fe54b9cac3fc52d4477f52efdc9__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745Bl77dFY$ > > > > > > > > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/4573a3cf7e18735a477845426238d46d96426bb6__;Kw!!CTRNKA9wMg0ARbw!kE1XiSmunRcQb9rTpKGkFc1EFJA57qr1cj7v9EZAjUBzXcSzMl-ofCI2mdtEQsxn3J4n7Lkgxb0_G745K-J8O-w$ > > > > > > > > > > > And then once again via the link I sent before together with a > > > fix on > > > Jan 25. > > > > > > It might be that you still have to former two patches in your > > > tree if > > > you synced it before the revert. > > > > > > However, if this is not the case: > > > > > > Which 6.1 commit is your tree based on? > > > > > > https://urldefense.com/v3/__https://android.googlesource.com/kernel/common/*/53b3a7721b7aec74d8fa2ee55c2480044cc7c1b8__;Kw!!CTRNKA9wMg0ARbw!iEzuh9LYXlwXkpcWaHjncfr6lNgTky7OEAEzQ7cIFjlTD__7lwXqAhPJwWJXEnD8THUS7jnBK7hjnHw$ > > > > (53b3a77 Merge 6.1.1 into android14-6.1) is the latest commit in > > our > > tree. > > > > > Do you have any private MTE-related changes in the kernel? > > > > No, all the MTE-related code is the same as Android Common Kernel. > > > > > Do you have userspace MTE enabled? > > > > Yes, we have enabled MTE for both EL1 and EL0. > > Hi Qun-wei, > > Thanks for the information. We encountered a similar issue internally > with the Android 5.15 common kernel. We tracked it down to an issue > with page migration, where the source page was a userspace page with > MTE tags, and the target page was allocated using KASAN (i.e. having > a non-zero KASAN tag). This caused tag check faults when the page was > subsequently accessed by the kernel as a result of the mismatching > tags > from userspace. Given the number of different ways that page > migration > target pages can be allocated, the simplest fix that we could think > of > was to synchronize the KASAN tag in copy_highpage(). > > Can you try the patch below and let us know whether it fixes the > issue? > > diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c > index 24913271e898c..87ed38e9747bd 100644 > --- a/arch/arm64/mm/copypage.c > +++ b/arch/arm64/mm/copypage.c > @@ -23,6 +23,8 @@ void copy_highpage(struct page *to, struct page > *from) > > if (system_supports_mte() && test_bit(PG_mte_tagged, &from- > >flags)) { > set_bit(PG_mte_tagged, &to->flags); > + if (kasan_hw_tags_enabled()) > + page_kasan_tag_set(to, page_kasan_tag(from)); > mte_copy_page_tags(kto, kfrom); > } > } > Thank you so much, this patch has solved the problem. > Catalin, please let us know what you think of the patch above. It > effectively partially undoes commit 20794545c146 ("arm64: kasan: > Revert > "arm64: mte: reset the page tag in page->flags""), but this seems > okay > to me because the mentioned race condition shouldn't affect "new" > pages > such as those being used as migration targets. The smp_wmb() that was > there before doesn't seem necessary for the same reason. > > If the patch is okay, we should apply it to the 6.1 stable kernel. > The > problem appears to be "fixed" in the mainline kernel because of > a bad merge conflict resolution on my part; when I rebased commit > e059853d14ca ("arm64: mte: Fix/clarify the PG_mte_tagged semantics") > past commit 20794545c146, it looks like I accidentally brought back > the > page_kasan_tag_reset() line removed in the latter. But we should > align > the mainline kernel with whatever we decide to do on 6.1. > > Peter
On Fri, Feb 10, 2023 at 11:03:45AM -0800, Peter Collingbourne wrote: > On Fri, Feb 10, 2023 at 10:28 AM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Thu, Feb 09, 2023 at 10:19:20PM -0800, Peter Collingbourne wrote: > > > Thanks for the information. We encountered a similar issue internally > > > with the Android 5.15 common kernel. We tracked it down to an issue > > > with page migration, where the source page was a userspace page with > > > MTE tags, and the target page was allocated using KASAN (i.e. having > > > a non-zero KASAN tag). This caused tag check faults when the page was > > > subsequently accessed by the kernel as a result of the mismatching tags > > > from userspace. Given the number of different ways that page migration > > > target pages can be allocated, the simplest fix that we could think of > > > was to synchronize the KASAN tag in copy_highpage(). > > > > > > Can you try the patch below and let us know whether it fixes the issue? > > > > > > diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c > > > index 24913271e898c..87ed38e9747bd 100644 > > > --- a/arch/arm64/mm/copypage.c > > > +++ b/arch/arm64/mm/copypage.c > > > @@ -23,6 +23,8 @@ void copy_highpage(struct page *to, struct page *from) > > > > > > if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) { > > > set_bit(PG_mte_tagged, &to->flags); > > > + if (kasan_hw_tags_enabled()) > > > + page_kasan_tag_set(to, page_kasan_tag(from)); > > > mte_copy_page_tags(kto, kfrom); > > > > Why not just page_kasan_tag_reset(to)? If PG_mte_tagged is set on the > > 'from' page, the tags are random anyway and page_kasan_tag(from) should > > already be 0xff. It makes more sense to do the same for the 'to' page > > rather than copying the tag from the 'from' page. IOW, we are copying > > user-controlled tags into a page, the kernel should have a match-all tag > > in page->flags. > > That would also work, but I was thinking that if copy_highpage() were > being used to copy a KASAN page we should keep the original tag in > order to maintain tag checks for page accesses. If PG_mte_tagged is set on the source, it means that the tags are no longer trusted and we should reset to match-all. Otherwise if copy_highpage() is called on a page that was never mapped as PROT_MTE in user space, PG_mte_tagged would not be set on the source and no tags copied. In such case, we should keep the original KASAN tag in the destination. Unless I misunderstood what you meant.
On Mon, Feb 13, 2023 at 10:47 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Fri, Feb 10, 2023 at 11:03:45AM -0800, Peter Collingbourne wrote: > > On Fri, Feb 10, 2023 at 10:28 AM Catalin Marinas > > <catalin.marinas@arm.com> wrote: > > > On Thu, Feb 09, 2023 at 10:19:20PM -0800, Peter Collingbourne wrote: > > > > Thanks for the information. We encountered a similar issue internally > > > > with the Android 5.15 common kernel. We tracked it down to an issue > > > > with page migration, where the source page was a userspace page with > > > > MTE tags, and the target page was allocated using KASAN (i.e. having > > > > a non-zero KASAN tag). This caused tag check faults when the page was > > > > subsequently accessed by the kernel as a result of the mismatching tags > > > > from userspace. Given the number of different ways that page migration > > > > target pages can be allocated, the simplest fix that we could think of > > > > was to synchronize the KASAN tag in copy_highpage(). > > > > > > > > Can you try the patch below and let us know whether it fixes the issue? > > > > > > > > diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c > > > > index 24913271e898c..87ed38e9747bd 100644 > > > > --- a/arch/arm64/mm/copypage.c > > > > +++ b/arch/arm64/mm/copypage.c > > > > @@ -23,6 +23,8 @@ void copy_highpage(struct page *to, struct page *from) > > > > > > > > if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) { > > > > set_bit(PG_mte_tagged, &to->flags); > > > > + if (kasan_hw_tags_enabled()) > > > > + page_kasan_tag_set(to, page_kasan_tag(from)); > > > > mte_copy_page_tags(kto, kfrom); > > > > > > Why not just page_kasan_tag_reset(to)? If PG_mte_tagged is set on the > > > 'from' page, the tags are random anyway and page_kasan_tag(from) should > > > already be 0xff. It makes more sense to do the same for the 'to' page > > > rather than copying the tag from the 'from' page. IOW, we are copying > > > user-controlled tags into a page, the kernel should have a match-all tag > > > in page->flags. > > > > That would also work, but I was thinking that if copy_highpage() were > > being used to copy a KASAN page we should keep the original tag in > > order to maintain tag checks for page accesses. > > If PG_mte_tagged is set on the source, it means that the tags are no > longer trusted and we should reset to match-all. Otherwise if > copy_highpage() is called on a page that was never mapped as PROT_MTE in > user space, PG_mte_tagged would not be set on the source and no tags > copied. In such case, we should keep the original KASAN tag in the > destination. Unless I misunderstood what you meant. I was thinking that it might be possible for PG_mte_tagged to be set on a page with KASAN tags. But as far as I can tell, this isn't actually possible (or not intended to be possible, at least). So I agree, we can just call page_kasan_tag_reset() here. Patch sent (with stable backport instructions) here: https://lore.kernel.org/all/20230214015214.747873-1-pcc@google.com/ Peter