[115/146] mm/rmap: fix potential batched TLB flush race

From: Huang Ying <ying.huang@intel.com>

From: Huang Ying <ying.huang@intel.com>
Subject: mm/rmap: fix potential batched TLB flush race

In theory, the following race is possible for batched TLB flushing.

CPU0                               CPU1
----                               ----
shrink_page_list()
                                   unmap
                                     zap_pte_range()
                                       flush_tlb_batched_pending()
                                         flush_tlb_mm()
  try_to_unmap()
    set_tlb_ubc_flush_pending()
      mm->tlb_flush_batched = true
                                         mm->tlb_flush_batched = false

After the TLB is flushed on CPU1 via flush_tlb_mm() and before
mm->tlb_flush_batched is set to false, some PTE is unmapped on CPU0
and the TLB flushing is pended.  Then the pended TLB flushing will be
lost.  Although both set_tlb_ubc_flush_pending() and
flush_tlb_batched_pending() are called with PTL locked, different PTL
instances may be used.

Because the race window is really small, and the lost TLB flushing
will cause problem only if a TLB entry is inserted before the
unmapping in the race window, the race is only theoretical.  But the
fix is simple and cheap too.

Syzbot has reported this too as follows,

==================================================================
BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one

write to 0xffff8881072cfbbc of 1 bytes by task 17406 on cpu 1:
 flush_tlb_batched_pending+0x5f/0x80 mm/rmap.c:691
 madvise_free_pte_range+0xee/0x7d0 mm/madvise.c:594
 walk_pmd_range mm/pagewalk.c:128 [inline]
 walk_pud_range mm/pagewalk.c:205 [inline]
 walk_p4d_range mm/pagewalk.c:240 [inline]
 walk_pgd_range mm/pagewalk.c:277 [inline]
 __walk_page_range+0x981/0x1160 mm/pagewalk.c:379
 walk_page_range+0x131/0x300 mm/pagewalk.c:475
 madvise_free_single_vma mm/madvise.c:734 [inline]
 madvise_dontneed_free mm/madvise.c:822 [inline]
 madvise_vma mm/madvise.c:996 [inline]
 do_madvise+0xe4a/0x1140 mm/madvise.c:1202
 __do_sys_madvise mm/madvise.c:1228 [inline]
 __se_sys_madvise mm/madvise.c:1226 [inline]
 __x64_sys_madvise+0x5d/0x70 mm/madvise.c:1226
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

write to 0xffff8881072cfbbc of 1 bytes by task 71 on cpu 0:
 set_tlb_ubc_flush_pending mm/rmap.c:636 [inline]
 try_to_unmap_one+0x60e/0x1220 mm/rmap.c:1515
 rmap_walk_anon+0x2fb/0x470 mm/rmap.c:2301
 try_to_unmap+0xec/0x110
 shrink_page_list+0xe91/0x2620 mm/vmscan.c:1719
 shrink_inactive_list+0x3fb/0x730 mm/vmscan.c:2394
 shrink_list mm/vmscan.c:2621 [inline]
 shrink_lruvec+0x3c9/0x710 mm/vmscan.c:2940
 shrink_node_memcgs+0x23e/0x410 mm/vmscan.c:3129
 shrink_node+0x8f6/0x1190 mm/vmscan.c:3252
 kswapd_shrink_node mm/vmscan.c:4022 [inline]
 balance_pgdat+0x702/0xd30 mm/vmscan.c:4213
 kswapd+0x200/0x340 mm/vmscan.c:4473
 kthread+0x2c7/0x2e0 kernel/kthread.c:327
 ret_from_fork+0x1f/0x30

value changed: 0x01 -> 0x00

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 71 Comm: kswapd0 Not tainted 5.16.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
==================================================================

[akpm@linux-foundation.org: tweak comments]
Link: https://lkml.kernel.org/r/20211201021104.126469-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: syzbot+aa5bebed695edaccf0df@syzkaller.appspotmail.com
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm_types.h |    2 -
 mm/rmap.c                |   43 ++++++++++++++++++++++++++++++-------
 2 files changed, 37 insertions(+), 8 deletions(-)

Message ID	20220114220916.k8LSl3Sd6%akpm@linux-foundation.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Fri, 14 Jan 2022 14:09:16 -0800 From: Andrew Morton <akpm@linux-foundation.org> To: aarcange@redhat.com, akpm@linux-foundation.org, dave.hansen@linux.intel.com, elver@google.com, linux-mm@kvack.org, luto@kernel.org, mgorman@techsingularity.net, mm-commits@vger.kernel.org, namit@vmware.com, torvalds@linux-foundation.org, will@kernel.org, ying.huang@intel.com, yuzhao@google.com Subject: [patch 115/146] mm/rmap: fix potential batched TLB flush race Message-ID: <20220114220916.k8LSl3Sd6%akpm@linux-foundation.org> In-Reply-To: <20220114140222.6b14f0061194d3200000c52d@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/146] kthread: add the helper function kthread_run_on_cpu() \| expand [001/146] kthread: add the helper function kthread_run_on_cpu() [002/146] RDMA/siw: make use of the helper function kthread_run_on_cpu() [003/146] ring-buffer: make use of the helper function kthread_run_on_cpu() [004/146] rcutorture: make use of the helper function kthread_run_on_cpu() [005/146] trace/osnoise: make use of the helper function kthread_run_on_cpu() [006/146] trace/hwlat: make use of the helper function kthread_run_on_cpu() [007/146] ia64: module: use swap() to make code cleaner [008/146] arch/ia64/kernel/setup.c: use swap() to make code cleaner [009/146] ia64: fix typo in a comment [010/146] ia64: topology: use default_groups in kobj_type [011/146] scripts/spelling.txt: add "oveflow" [012/146] fs/ntfs/attrib.c: fix one kernel-doc comment [013/146] squashfs: provide backing_dev_info in order to disable read-ahead [014/146] ocfs2: use BUG_ON instead of if condition followed by BUG. [015/146] ocfs2: clearly handle ocfs2_grab_pages_for_write() return value [016/146] ocfs2: use default_groups in kobj_type [017/146] ocfs2: remove redundant assignment to pointer root_bh [018/146] ocfs2: cluster: use default_groups in kobj_type [019/146] ocfs2: remove redundant assignment to variable free_space [020/146] fs/ioctl: remove unnecessary __user annotation [021/146] mm/slab_common: use WARN() if cache still has objects on destroy [022/146] mm: slab: make slab iterator functions static [023/146] kmemleak: fix kmemleak false positive report with HW tag-based kasan enable [024/146] mm: kmemleak: alloc gray object for reserved region with direct map [025/146] mm: defer kmemleak object creation of module_alloc() [026/146] mm/page_alloc: split prep_compound_page into head and tail subparts [027/146] mm/page_alloc: refactor memmap_init_zone_device() page init [028/146] mm/memremap: add ZONE_DEVICE support for compound pages [029/146] device-dax: use ALIGN() for determining pgoff [030/146] device-dax: use struct_size() [031/146] device-dax: ensure dev_dax->pgmap is valid for dynamic devices [032/146] device-dax: factor out page mapping initialization [033/146] device-dax: set mapping prior to vmf_insert_pfn{,_pmd,pud}() [034/146] device-dax: remove pfn from __dev_dax_{pte,pmd,pud}_fault() [035/146] device-dax: compound devmap support [036/146] kasan: test: add globals left-out-of-bounds test [037/146] kasan: add ability to detect double-kmem_cache_destroy() [038/146] kasan: test: add test case for double-kmem_cache_destroy() [039/146] kasan: fix quarantine conflicting with init_on_free [040/146] mm,fs: split dump_mapping() out from dump_page() [041/146] mm/debug_vm_pgtable: update comments regarding migration swap entries [042/146] mm/truncate.c: remove unneeded variable [043/146] gup: avoid multiple user access locking/unlocking in fault_in_{read/write}able [044/146] mm/gup.c: stricter check on THP migration entry during follow_pmd_mask [045/146] mm: shmem: don't truncate page if memory failure happens [046/146] shmem: fix a race between shmem_unused_huge_shrink and shmem_evict_inode [047/146] mm/frontswap.c: use non-atomic '__set_bit()' when possible [048/146] mm: memcontrol: make cgroup_memory_nokmem static [049/146] mm/page_counter: remove an incorrect call to propagate_protected_usage() [050/146] mm/memcg: add oom_group_kill memory event [051/146] memcg: better bounds on the memcg stats updates [052/146] mm/memcg: use struct_size() helper in kzalloc() [053/146] memcg: add per-memcg vmalloc stat [054/146] tools/testing/selftests/vm/userfaultfd.c: use swap() to make code cleaner [055/146] mm: remove redundant check about FAULT_FLAG_ALLOW_RETRY bit [056/146] mm: rearrange madvise code to allow for reuse [057/146] mm: add a field to store names for private anonymous memory [058/146] mm: add anonymous vma name refcounting [059/146] mm: move anon_vma declarations to linux/mm_inline.h [060/146] mm: move tlb_flush_pending inline helpers to mm_inline.h [061/146] mm: protect free_pgtables with mmap_lock write lock in exit_mmap [062/146] mm: document locking restrictions for vm_operations_struct::close [063/146] mm/oom_kill: allow process_mrelease to run under mmap_lock protection [064/146] docs/vm: add vmalloced-kernel-stacks document [065/146] mm: change page type prior to adding page table entry [066/146] mm: ptep_clear() page table helper [067/146] mm: page table check [068/146] x86: mm: add x86_64 support for page table check [069/146] mm: remove last argument of reuse_swap_page() [070/146] mm: remove the total_mapcount argument from page_trans_huge_map_swapcount() [071/146] mm: remove the total_mapcount argument from page_trans_huge_mapcount() [072/146] mm/dmapool.c: revert "make dma pool to use kmalloc_node" [073/146] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc [074/146] mm/vmalloc: add support for __GFP_NOFAIL [075/146] mm/vmalloc: be more explicit about supported gfp flags. [076/146] mm: allow !GFP_KERNEL allocations for kvmalloc [077/146] mm: make slab and vmalloc allocators __GFP_NOLOCKDEP aware [078/146] mm: introduce memalloc_retry_wait() [079/146] mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30% [080/146] mm: fix boolreturn.cocci warning [081/146] mm: page_alloc: fix building error on -Werror=array-compare [082/146] mm: drop node from alloc_pages_vma [083/146] include/linux/gfp.h: further document GFP_DMA32 [084/146] mm/page_alloc.c: modify the comment section for alloc_contig_pages() [085/146] mm_zone: add function to check if managed dma zone exists [086/146] dma/pool: create dma atomic pool only if dma zone has managed pages [087/146] mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages [088/146] hugetlb: add hugetlb..numa_stat file [089/146] mm, hugepages: make memory size variable in hugepage-mremap selftest [090/146] mm/vmstat: add events for THP max_ptes_ exceeds [091/146] selftests/vm: make charge_reserved_hugetlb.sh work with existing cgroup setting [092/146] selftests/uffd: allow EINTR/EAGAIN [093/146] userfaultfd/selftests: clean up hugetlb allocation code [094/146] vmscan: make drop_slab_node static [095/146] mm/page_isolation: unset migratetype directly for non Buddy page [096/146] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY [097/146] mm/mempolicy: add set_mempolicy_home_node syscall [098/146] mm/mempolicy: wire up syscall set_mempolicy_home_node [099/146] mm/mempolicy: fix all kernel-doc warnings [100/146] mm, oom: OOM sysrq should always kill a process [101/146] hugetlbfs: fix off-by-one error in hugetlb_vmdelete_list() [102/146] mm: migrate: fix the return value of migrate_pages() [103/146] mm: migrate: correct the hugetlb migration stats [104/146] mm: compaction: fix the migration stats in trace_mm_compaction_migratepages() [105/146] mm: migrate: support multiple target nodes demotion [106/146] mm: migrate: add more comments for selecting target node randomly [107/146] mm/migrate: move node demotion code to near its user [108/146] mm/migrate: remove redundant variables used in a for-loop [109/146] mm/thp: drop unused trace events hugepage_[invalidate\|splitting] [110/146] mm: ksm: fix use-after-free kasan report in ksm_might_need_to_copy [111/146] mm/hwpoison: mf_mutex for soft offline and unpoison [112/146] mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE [113/146] mm/hwpoison: fix unpoison_memory() [114/146] mm: memcg/percpu: account extra objcg space to memory cgroups [115/146] mm/rmap: fix potential batched TLB flush race [116/146] zpool: remove the list of pools_head [117/146] zram: use ATTRIBUTE_GROUPS [118/146] mm: fix some comment errors [119/146] mm: make some vars and functions static or __init [120/146] mm/hmm.c: allow VM_MIXEDMAP to work with hmm_range_fault [121/146] mm/damon: unified access_check function naming rules [122/146] mm/damon: add 'age' of region tracepoint support [123/146] mm/damon/core: use abs() instead of diff_of() [124/146] mm/damon: remove some unneeded function definitions in damon.h [125/146] mm/damon/vaddr: remove swap_ranges() and replace it with swap() [126/146] mm/damon/schemes: add the validity judgment of thresholds [127/146] mm/damon: move damon_rand() definition into damon.h [128/146] mm/damon: modify damon_rand() macro to static inline function [129/146] mm/damon: convert macro functions to static inline functions [131/146] Docs/admin-guide/mm/damon/usage: remove redundant information [132/146] Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning [133/146] Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk\|rm)_contexts [134/146] mm/damon: remove a mistakenly added comment for a future feature [135/146] mm/damon/schemes: account scheme actions that successfully applied [136/146] mm/damon/schemes: account how many times quota limit has exceeded [137/146] mm/damon/reclaim: provide reclamation statistics [138/146] Docs/admin-guide/mm/damon/reclaim: document statistics parameters [139/146] mm/damon/dbgfs: support all DAMOS stats [140/146] Docs/admin-guide/mm/damon/usage: update for schemes statistics [141/146] mm/damon: add access checking for hugetlb pages [142/146] mm/damon: move the implementation of damon_insert_region to damon.h [143/146] mm/damon/dbgfs: remove an unnecessary variable [144/146] mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging [145/146] mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log [146/146] mm/damon: hide kernel pointer from tracepoint event

[115/146] mm/rmap: fix potential batched TLB flush race

Commit Message

Patch