diff mbox series

mm/rmap: fix potential batched TLB flush race

Message ID 20211123074344.1877731-1-ying.huang@intel.com (mailing list archive)
State New
Headers show
Series mm/rmap: fix potential batched TLB flush race | expand

Commit Message

Huang, Ying Nov. 23, 2021, 7:43 a.m. UTC
In theory, the following race is possible for batched TLB flushing.

CPU0                               CPU1
----                               ----
shrink_page_list()
                                   unmap
                                     zap_pte_range()
                                       flush_tlb_batched_pending()
                                         flush_tlb_mm()
  try_to_unmap()
    set_tlb_ubc_flush_pending()
      mm->tlb_flush_batched = true
                                         mm->tlb_flush_batched = false

After the TLB is flushed on CPU1 via flush_tlb_mm() and before
mm->tlb_flush_batched is set to false, some PTE is unmapped on CPU0
and the TLB flushing is pended.  Then the pended TLB flushing will be
lost.  Although both set_tlb_ubc_flush_pending() and
flush_tlb_batched_pending() are called with PTL locked, different PTL
instances may be used.

Because the race window is really small, and the lost TLB flushing
will cause problem only if a TLB entry is inserted before the
unmapping in the race window, the race is only theoretical.  But the
fix is simple and cheap too.

Syzbot has reported this too as follows,

==================================================================
BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one

write to 0xffff8881072cfbbc of 1 bytes by task 17406 on cpu 1:
 flush_tlb_batched_pending+0x5f/0x80 mm/rmap.c:691
 madvise_free_pte_range+0xee/0x7d0 mm/madvise.c:594
 walk_pmd_range mm/pagewalk.c:128 [inline]
 walk_pud_range mm/pagewalk.c:205 [inline]
 walk_p4d_range mm/pagewalk.c:240 [inline]
 walk_pgd_range mm/pagewalk.c:277 [inline]
 __walk_page_range+0x981/0x1160 mm/pagewalk.c:379
 walk_page_range+0x131/0x300 mm/pagewalk.c:475
 madvise_free_single_vma mm/madvise.c:734 [inline]
 madvise_dontneed_free mm/madvise.c:822 [inline]
 madvise_vma mm/madvise.c:996 [inline]
 do_madvise+0xe4a/0x1140 mm/madvise.c:1202
 __do_sys_madvise mm/madvise.c:1228 [inline]
 __se_sys_madvise mm/madvise.c:1226 [inline]
 __x64_sys_madvise+0x5d/0x70 mm/madvise.c:1226
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

write to 0xffff8881072cfbbc of 1 bytes by task 71 on cpu 0:
 set_tlb_ubc_flush_pending mm/rmap.c:636 [inline]
 try_to_unmap_one+0x60e/0x1220 mm/rmap.c:1515
 rmap_walk_anon+0x2fb/0x470 mm/rmap.c:2301
 try_to_unmap+0xec/0x110
 shrink_page_list+0xe91/0x2620 mm/vmscan.c:1719
 shrink_inactive_list+0x3fb/0x730 mm/vmscan.c:2394
 shrink_list mm/vmscan.c:2621 [inline]
 shrink_lruvec+0x3c9/0x710 mm/vmscan.c:2940
 shrink_node_memcgs+0x23e/0x410 mm/vmscan.c:3129
 shrink_node+0x8f6/0x1190 mm/vmscan.c:3252
 kswapd_shrink_node mm/vmscan.c:4022 [inline]
 balance_pgdat+0x702/0xd30 mm/vmscan.c:4213
 kswapd+0x200/0x340 mm/vmscan.c:4473
 kthread+0x2c7/0x2e0 kernel/kthread.c:327
 ret_from_fork+0x1f/0x30

value changed: 0x01 -> 0x00

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 71 Comm: kswapd0 Not tainted 5.16.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
==================================================================

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: syzbot+aa5bebed695edaccf0df@syzkaller.appspotmail.com
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Marco Elver <elver@google.com>
---
 include/linux/mm_types.h |  2 +-
 mm/rmap.c                | 15 ++++++++-------
 2 files changed, 9 insertions(+), 8 deletions(-)

Comments

Marco Elver Nov. 23, 2021, 9:33 a.m. UTC | #1
On Tue, 23 Nov 2021 at 08:44, Huang Ying <ying.huang@intel.com> wrote:
>
> In theory, the following race is possible for batched TLB flushing.
>
> CPU0                               CPU1
> ----                               ----
> shrink_page_list()
>                                    unmap
>                                      zap_pte_range()
>                                        flush_tlb_batched_pending()
>                                          flush_tlb_mm()
>   try_to_unmap()
>     set_tlb_ubc_flush_pending()
>       mm->tlb_flush_batched = true
>                                          mm->tlb_flush_batched = false
>
> After the TLB is flushed on CPU1 via flush_tlb_mm() and before
> mm->tlb_flush_batched is set to false, some PTE is unmapped on CPU0
> and the TLB flushing is pended.  Then the pended TLB flushing will be
> lost.  Although both set_tlb_ubc_flush_pending() and
> flush_tlb_batched_pending() are called with PTL locked, different PTL
> instances may be used.
>
> Because the race window is really small, and the lost TLB flushing
> will cause problem only if a TLB entry is inserted before the
> unmapping in the race window, the race is only theoretical.  But the
> fix is simple and cheap too.

Thanks for fixing this!

> Syzbot has reported this too as follows,
>
> ==================================================================
> BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one
[...]
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c3a6e6209600..789778067db9 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -632,7 +632,7 @@ struct mm_struct {
>                 atomic_t tlb_flush_pending;
>  #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
>                 /* See flush_tlb_batched_pending() */
> -               bool tlb_flush_batched;
> +               atomic_t tlb_flush_batched;
>  #endif
>                 struct uprobes_state uprobes_state;
>  #ifdef CONFIG_PREEMPT_RT
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 163ac4e6bcee..60902c3cfb4a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -633,7 +633,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
>          * before the PTE is cleared.
>          */
>         barrier();
> -       mm->tlb_flush_batched = true;
> +       atomic_inc(&mm->tlb_flush_batched);

The use of barrier() and atomic needs some clarification. Is there a
requirement that the CPU also doesn't reorder anything after this
atomic_inc() (which is unordered)? I.e. should this be
atomic_inc_return_release() and remove barrier()?

>         /*
>          * If the PTE was dirty then it's best to assume it's writable. The
> @@ -680,15 +680,16 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>   */
>  void flush_tlb_batched_pending(struct mm_struct *mm)
>  {
> -       if (data_race(mm->tlb_flush_batched)) {
> -               flush_tlb_mm(mm);
> +       int batched = atomic_read(&mm->tlb_flush_batched);
>
> +       if (batched) {
> +               flush_tlb_mm(mm);
>                 /*
> -                * Do not allow the compiler to re-order the clearing of
> -                * tlb_flush_batched before the tlb is flushed.
> +                * If the new TLB flushing is pended during flushing,
> +                * leave mm->tlb_flush_batched as is, to avoid to lose
> +                * flushing.
>                  */
> -               barrier();
> -               mm->tlb_flush_batched = false;
> +               atomic_cmpxchg(&mm->tlb_flush_batched, batched, 0);
>         }
>  }
>  #else
> --
> 2.30.2
>
Nadav Amit Nov. 23, 2021, 3:28 p.m. UTC | #2
> On Nov 22, 2021, at 11:43 PM, Huang Ying <ying.huang@intel.com> wrote:
> 
> In theory, the following race is possible for batched TLB flushing.
> 
> CPU0                               CPU1
> ----                               ----
> shrink_page_list()
>                                   unmap
>                                     zap_pte_range()
>                                       flush_tlb_batched_pending()
>                                         flush_tlb_mm()
>  try_to_unmap()
>    set_tlb_ubc_flush_pending()
>      mm->tlb_flush_batched = true
>                                         mm->tlb_flush_batched = false
> 
> After the TLB is flushed on CPU1 via flush_tlb_mm() and before
> mm->tlb_flush_batched is set to false, some PTE is unmapped on CPU0
> and the TLB flushing is pended.  Then the pended TLB flushing will be
> lost.  Although both set_tlb_ubc_flush_pending() and
> flush_tlb_batched_pending() are called with PTL locked, different PTL
> instances may be used.
> 
> Because the race window is really small, and the lost TLB flushing
> will cause problem only if a TLB entry is inserted before the
> unmapping in the race window, the race is only theoretical.  But the
> fix is simple and cheap too.
> 
> Syzbot has reported this too as follows,
> 
> ==================================================================
> BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one
> 
> write to 0xffff8881072cfbbc of 1 bytes by task 17406 on cpu 1:
> flush_tlb_batched_pending+0x5f/0x80 mm/rmap.c:691
> madvise_free_pte_range+0xee/0x7d0 mm/madvise.c:594
> walk_pmd_range mm/pagewalk.c:128 [inline]
> walk_pud_range mm/pagewalk.c:205 [inline]
> walk_p4d_range mm/pagewalk.c:240 [inline]
> walk_pgd_range mm/pagewalk.c:277 [inline]
> __walk_page_range+0x981/0x1160 mm/pagewalk.c:379
> walk_page_range+0x131/0x300 mm/pagewalk.c:475
> madvise_free_single_vma mm/madvise.c:734 [inline]
> madvise_dontneed_free mm/madvise.c:822 [inline]
> madvise_vma mm/madvise.c:996 [inline]
> do_madvise+0xe4a/0x1140 mm/madvise.c:1202
> __do_sys_madvise mm/madvise.c:1228 [inline]
> __se_sys_madvise mm/madvise.c:1226 [inline]
> __x64_sys_madvise+0x5d/0x70 mm/madvise.c:1226
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> write to 0xffff8881072cfbbc of 1 bytes by task 71 on cpu 0:
> set_tlb_ubc_flush_pending mm/rmap.c:636 [inline]
> try_to_unmap_one+0x60e/0x1220 mm/rmap.c:1515
> rmap_walk_anon+0x2fb/0x470 mm/rmap.c:2301
> try_to_unmap+0xec/0x110
> shrink_page_list+0xe91/0x2620 mm/vmscan.c:1719
> shrink_inactive_list+0x3fb/0x730 mm/vmscan.c:2394
> shrink_list mm/vmscan.c:2621 [inline]
> shrink_lruvec+0x3c9/0x710 mm/vmscan.c:2940
> shrink_node_memcgs+0x23e/0x410 mm/vmscan.c:3129
> shrink_node+0x8f6/0x1190 mm/vmscan.c:3252
> kswapd_shrink_node mm/vmscan.c:4022 [inline]
> balance_pgdat+0x702/0xd30 mm/vmscan.c:4213
> kswapd+0x200/0x340 mm/vmscan.c:4473
> kthread+0x2c7/0x2e0 kernel/kthread.c:327
> ret_from_fork+0x1f/0x30
> 
> value changed: 0x01 -> 0x00
> 
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 0 PID: 71 Comm: kswapd0 Not tainted 5.16.0-rc1-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> ==================================================================
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Reported-by: syzbot+aa5bebed695edaccf0df@syzkaller.appspotmail.com
> Cc: Nadav Amit <namit@vmware.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Marco Elver <elver@google.com>
> ---
> include/linux/mm_types.h |  2 +-
> mm/rmap.c                | 15 ++++++++-------
> 2 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c3a6e6209600..789778067db9 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -632,7 +632,7 @@ struct mm_struct {
> 		atomic_t tlb_flush_pending;
> #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> 		/* See flush_tlb_batched_pending() */
> -		bool tlb_flush_batched;
> +		atomic_t tlb_flush_batched;
> #endif
> 		struct uprobes_state uprobes_state;
> #ifdef CONFIG_PREEMPT_RT
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 163ac4e6bcee..60902c3cfb4a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -633,7 +633,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> 	 * before the PTE is cleared.
> 	 */
> 	barrier();
> -	mm->tlb_flush_batched = true;
> +	atomic_inc(&mm->tlb_flush_batched);
> 
> 	/*
> 	 * If the PTE was dirty then it's best to assume it's writable. The
> @@ -680,15 +680,16 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>  */
> void flush_tlb_batched_pending(struct mm_struct *mm)
> {
> -	if (data_race(mm->tlb_flush_batched)) {
> -		flush_tlb_mm(mm);
> +	int batched = atomic_read(&mm->tlb_flush_batched);
> 
> +	if (batched) {
> +		flush_tlb_mm(mm);
> 		/*
> -		 * Do not allow the compiler to re-order the clearing of
> -		 * tlb_flush_batched before the tlb is flushed.
> +		 * If the new TLB flushing is pended during flushing,
> +		 * leave mm->tlb_flush_batched as is, to avoid to lose
> +		 * flushing.
> 		 */
> -		barrier();
> -		mm->tlb_flush_batched = false;
> +		atomic_cmpxchg(&mm->tlb_flush_batched, batched, 0);

This does not seem to prevent a race completely.

  CPU0		CPU1		CPU2

  set_tlb_ubc_flush_pending()
  [ tlb_flush_batched = 1 ]

				flush_tlb_batched_pending()
				[ batched = 1 ]
				 flush_tlb_mm()
				 ...
  flush_tlb_batched_pending()
  [ tlb_flush_batched = 0 ]
				

		ptep_get_and_clear()
		set_tlb_ubc_flush_pending()
		[ tlb_flush_batched = 1 ]
					
					
				 ...
				 atomic_cmpxchg()
				 [ succeeds ]

At the end of this flow tlb_flush_batched is 0 although
the TLB flush of CPU1’s newly added PTE was not done.

If you go with your approach you need to have two atomic
counters, one of the flushed “generation” and one of the
pending “generation”. 

Anyhow, I am just mentioning that I think a more fundamental
solution is appropriate to take into account other flushes
that might render flush_tlb_batched_pending() flush
unnecessary.
Huang, Ying Nov. 24, 2021, 1:27 a.m. UTC | #3
Nadav Amit <namit@vmware.com> writes:

>> On Nov 22, 2021, at 11:43 PM, Huang Ying <ying.huang@intel.com> wrote:
>> 
>> In theory, the following race is possible for batched TLB flushing.
>> 
>> CPU0                               CPU1
>> ----                               ----
>> shrink_page_list()
>>                                   unmap
>>                                     zap_pte_range()
>>                                       flush_tlb_batched_pending()
>>                                         flush_tlb_mm()
>>  try_to_unmap()
>>    set_tlb_ubc_flush_pending()
>>      mm->tlb_flush_batched = true
>>                                         mm->tlb_flush_batched = false
>> 
>> After the TLB is flushed on CPU1 via flush_tlb_mm() and before
>> mm->tlb_flush_batched is set to false, some PTE is unmapped on CPU0
>> and the TLB flushing is pended.  Then the pended TLB flushing will be
>> lost.  Although both set_tlb_ubc_flush_pending() and
>> flush_tlb_batched_pending() are called with PTL locked, different PTL
>> instances may be used.
>> 
>> Because the race window is really small, and the lost TLB flushing
>> will cause problem only if a TLB entry is inserted before the
>> unmapping in the race window, the race is only theoretical.  But the
>> fix is simple and cheap too.
>> 
>> Syzbot has reported this too as follows,
>> 
>> ==================================================================
>> BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one
>> 
>> write to 0xffff8881072cfbbc of 1 bytes by task 17406 on cpu 1:
>> flush_tlb_batched_pending+0x5f/0x80 mm/rmap.c:691
>> madvise_free_pte_range+0xee/0x7d0 mm/madvise.c:594
>> walk_pmd_range mm/pagewalk.c:128 [inline]
>> walk_pud_range mm/pagewalk.c:205 [inline]
>> walk_p4d_range mm/pagewalk.c:240 [inline]
>> walk_pgd_range mm/pagewalk.c:277 [inline]
>> __walk_page_range+0x981/0x1160 mm/pagewalk.c:379
>> walk_page_range+0x131/0x300 mm/pagewalk.c:475
>> madvise_free_single_vma mm/madvise.c:734 [inline]
>> madvise_dontneed_free mm/madvise.c:822 [inline]
>> madvise_vma mm/madvise.c:996 [inline]
>> do_madvise+0xe4a/0x1140 mm/madvise.c:1202
>> __do_sys_madvise mm/madvise.c:1228 [inline]
>> __se_sys_madvise mm/madvise.c:1226 [inline]
>> __x64_sys_madvise+0x5d/0x70 mm/madvise.c:1226
>> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>> do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
>> entry_SYSCALL_64_after_hwframe+0x44/0xae
>> 
>> write to 0xffff8881072cfbbc of 1 bytes by task 71 on cpu 0:
>> set_tlb_ubc_flush_pending mm/rmap.c:636 [inline]
>> try_to_unmap_one+0x60e/0x1220 mm/rmap.c:1515
>> rmap_walk_anon+0x2fb/0x470 mm/rmap.c:2301
>> try_to_unmap+0xec/0x110
>> shrink_page_list+0xe91/0x2620 mm/vmscan.c:1719
>> shrink_inactive_list+0x3fb/0x730 mm/vmscan.c:2394
>> shrink_list mm/vmscan.c:2621 [inline]
>> shrink_lruvec+0x3c9/0x710 mm/vmscan.c:2940
>> shrink_node_memcgs+0x23e/0x410 mm/vmscan.c:3129
>> shrink_node+0x8f6/0x1190 mm/vmscan.c:3252
>> kswapd_shrink_node mm/vmscan.c:4022 [inline]
>> balance_pgdat+0x702/0xd30 mm/vmscan.c:4213
>> kswapd+0x200/0x340 mm/vmscan.c:4473
>> kthread+0x2c7/0x2e0 kernel/kthread.c:327
>> ret_from_fork+0x1f/0x30
>> 
>> value changed: 0x01 -> 0x00
>> 
>> Reported by Kernel Concurrency Sanitizer on:
>> CPU: 0 PID: 71 Comm: kswapd0 Not tainted 5.16.0-rc1-syzkaller #0
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>> ==================================================================
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Reported-by: syzbot+aa5bebed695edaccf0df@syzkaller.appspotmail.com
>> Cc: Nadav Amit <namit@vmware.com>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Marco Elver <elver@google.com>
>> ---
>> include/linux/mm_types.h |  2 +-
>> mm/rmap.c                | 15 ++++++++-------
>> 2 files changed, 9 insertions(+), 8 deletions(-)
>> 
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index c3a6e6209600..789778067db9 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -632,7 +632,7 @@ struct mm_struct {
>> 		atomic_t tlb_flush_pending;
>> #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
>> 		/* See flush_tlb_batched_pending() */
>> -		bool tlb_flush_batched;
>> +		atomic_t tlb_flush_batched;
>> #endif
>> 		struct uprobes_state uprobes_state;
>> #ifdef CONFIG_PREEMPT_RT
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 163ac4e6bcee..60902c3cfb4a 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -633,7 +633,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
>> 	 * before the PTE is cleared.
>> 	 */
>> 	barrier();
>> -	mm->tlb_flush_batched = true;
>> +	atomic_inc(&mm->tlb_flush_batched);
>> 
>> 	/*
>> 	 * If the PTE was dirty then it's best to assume it's writable. The
>> @@ -680,15 +680,16 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>>  */
>> void flush_tlb_batched_pending(struct mm_struct *mm)
>> {
>> -	if (data_race(mm->tlb_flush_batched)) {
>> -		flush_tlb_mm(mm);
>> +	int batched = atomic_read(&mm->tlb_flush_batched);
>> 
>> +	if (batched) {
>> +		flush_tlb_mm(mm);
>> 		/*
>> -		 * Do not allow the compiler to re-order the clearing of
>> -		 * tlb_flush_batched before the tlb is flushed.
>> +		 * If the new TLB flushing is pended during flushing,
>> +		 * leave mm->tlb_flush_batched as is, to avoid to lose
>> +		 * flushing.
>> 		 */
>> -		barrier();
>> -		mm->tlb_flush_batched = false;
>> +		atomic_cmpxchg(&mm->tlb_flush_batched, batched, 0);
>
> This does not seem to prevent a race completely.
>
>   CPU0		CPU1		CPU2
>
>   set_tlb_ubc_flush_pending()
>   [ tlb_flush_batched = 1 ]
>
> 				flush_tlb_batched_pending()
> 				[ batched = 1 ]
> 				 flush_tlb_mm()
> 				 ...
>   flush_tlb_batched_pending()
>   [ tlb_flush_batched = 0 ]
> 				
>
> 		ptep_get_and_clear()
> 		set_tlb_ubc_flush_pending()
> 		[ tlb_flush_batched = 1 ]
> 					
> 					
> 				 ...
> 				 atomic_cmpxchg()
> 				 [ succeeds ]
>
> At the end of this flow tlb_flush_batched is 0 although
> the TLB flush of CPU1’s newly added PTE was not done.
>
> If you go with your approach you need to have two atomic
> counters, one of the flushed “generation” and one of the
> pending “generation”.

Thanks for review.  You are right, the race cannot be eliminated with
this patch completely.  Maybe we can pack two counter into atomic_t to
implement the generation as above.

> Anyhow, I am just mentioning that I think a more fundamental
> solution is appropriate to take into account other flushes
> that might render flush_tlb_batched_pending() flush
> unnecessary.

Best Regards,
Huang, Ying
Huang, Ying Nov. 24, 2021, 1:43 a.m. UTC | #4
Marco Elver <elver@google.com> writes:

> On Tue, 23 Nov 2021 at 08:44, Huang Ying <ying.huang@intel.com> wrote:
>>
>> In theory, the following race is possible for batched TLB flushing.
>>
>> CPU0                               CPU1
>> ----                               ----
>> shrink_page_list()
>>                                    unmap
>>                                      zap_pte_range()
>>                                        flush_tlb_batched_pending()
>>                                          flush_tlb_mm()
>>   try_to_unmap()
>>     set_tlb_ubc_flush_pending()
>>       mm->tlb_flush_batched = true
>>                                          mm->tlb_flush_batched = false
>>
>> After the TLB is flushed on CPU1 via flush_tlb_mm() and before
>> mm->tlb_flush_batched is set to false, some PTE is unmapped on CPU0
>> and the TLB flushing is pended.  Then the pended TLB flushing will be
>> lost.  Although both set_tlb_ubc_flush_pending() and
>> flush_tlb_batched_pending() are called with PTL locked, different PTL
>> instances may be used.
>>
>> Because the race window is really small, and the lost TLB flushing
>> will cause problem only if a TLB entry is inserted before the
>> unmapping in the race window, the race is only theoretical.  But the
>> fix is simple and cheap too.
>
> Thanks for fixing this!
>
>> Syzbot has reported this too as follows,
>>
>> ==================================================================
>> BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one
> [...]
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index c3a6e6209600..789778067db9 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -632,7 +632,7 @@ struct mm_struct {
>>                 atomic_t tlb_flush_pending;
>>  #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
>>                 /* See flush_tlb_batched_pending() */
>> -               bool tlb_flush_batched;
>> +               atomic_t tlb_flush_batched;
>>  #endif
>>                 struct uprobes_state uprobes_state;
>>  #ifdef CONFIG_PREEMPT_RT
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 163ac4e6bcee..60902c3cfb4a 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -633,7 +633,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
>>          * before the PTE is cleared.
>>          */
>>         barrier();
>> -       mm->tlb_flush_batched = true;
>> +       atomic_inc(&mm->tlb_flush_batched);
>
> The use of barrier() and atomic needs some clarification.

There are some comments above barrier() to describe why it is needed.
For atomic, because the type of mm->tlb_flush_batched is atomic_t, do we
need extra clarification?

> Is there a
> requirement that the CPU also doesn't reorder anything after this
> atomic_inc() (which is unordered)? I.e. should this be
> atomic_inc_return_release() and remove barrier()?

We don't have an atomic_xx_acquire() to pair with this.  So I guess we
don't need atomic_inc_return_release()?

Best Regards,
Huang, Ying

>>         /*
>>          * If the PTE was dirty then it's best to assume it's writable. The
>> @@ -680,15 +680,16 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>>   */
>>  void flush_tlb_batched_pending(struct mm_struct *mm)
>>  {
>> -       if (data_race(mm->tlb_flush_batched)) {
>> -               flush_tlb_mm(mm);
>> +       int batched = atomic_read(&mm->tlb_flush_batched);
>>
>> +       if (batched) {
>> +               flush_tlb_mm(mm);
>>                 /*
>> -                * Do not allow the compiler to re-order the clearing of
>> -                * tlb_flush_batched before the tlb is flushed.
>> +                * If the new TLB flushing is pended during flushing,
>> +                * leave mm->tlb_flush_batched as is, to avoid to lose
>> +                * flushing.
>>                  */
>> -               barrier();
>> -               mm->tlb_flush_batched = false;
>> +               atomic_cmpxchg(&mm->tlb_flush_batched, batched, 0);
>>         }
>>  }
>>  #else
>> --
>> 2.30.2
>>
Marco Elver Nov. 24, 2021, 8:10 a.m. UTC | #5
On Wed, 24 Nov 2021 at 02:44, Huang, Ying <ying.huang@intel.com> wrote:
>
> Marco Elver <elver@google.com> writes:
>
> > On Tue, 23 Nov 2021 at 08:44, Huang Ying <ying.huang@intel.com> wrote:
[...]
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -633,7 +633,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> >>          * before the PTE is cleared.
> >>          */
> >>         barrier();
> >> -       mm->tlb_flush_batched = true;
> >> +       atomic_inc(&mm->tlb_flush_batched);
> >
> > The use of barrier() and atomic needs some clarification.
>
> There are some comments above barrier() to describe why it is needed.
> For atomic, because the type of mm->tlb_flush_batched is atomic_t, do we
> need extra clarification?

Apologies, maybe I wasn't clear enough: the existing comment tells me
the clearing of PTE should never happen after tlb_flush_batched is
set, but only the compiler is considered. However, I become suspicious
when I see barrier() paired with an atomic. barrier() is purely a
compiler-barrier and does not prevent the CPU from reordering things.
atomic_inc() does not return anything and is therefore unordered per
Documentation/atomic_t.txt.

> > Is there a
> > requirement that the CPU also doesn't reorder anything after this
> > atomic_inc() (which is unordered)? I.e. should this be
> > atomic_inc_return_release() and remove barrier()?
>
> We don't have an atomic_xx_acquire() to pair with this.  So I guess we
> don't need atomic_inc_return_release()?

You have 2 things stronger than unordered: atomic_read() which result
is used in a conditional branch, thus creating a control-dependency
ordering later dependent writes; and the atomic_cmpxchg() is fully
ordered.

But before all that, I'd still want to understand what ordering
requirements you have. The current comments say only the compiler
needs taming, but does that mean we're fine with the CPU wildly
reordering things?

Thanks,
-- Marco
Huang, Ying Nov. 24, 2021, 8:41 a.m. UTC | #6
Marco Elver <elver@google.com> writes:

> On Wed, 24 Nov 2021 at 02:44, Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Marco Elver <elver@google.com> writes:
>>
>> > On Tue, 23 Nov 2021 at 08:44, Huang Ying <ying.huang@intel.com> wrote:
> [...]
>> >> --- a/mm/rmap.c
>> >> +++ b/mm/rmap.c
>> >> @@ -633,7 +633,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
>> >>          * before the PTE is cleared.
>> >>          */
>> >>         barrier();
>> >> -       mm->tlb_flush_batched = true;
>> >> +       atomic_inc(&mm->tlb_flush_batched);
>> >
>> > The use of barrier() and atomic needs some clarification.
>>
>> There are some comments above barrier() to describe why it is needed.
>> For atomic, because the type of mm->tlb_flush_batched is atomic_t, do we
>> need extra clarification?
>
> Apologies, maybe I wasn't clear enough: the existing comment tells me
> the clearing of PTE should never happen after tlb_flush_batched is
> set, but only the compiler is considered. However, I become suspicious
> when I see barrier() paired with an atomic. barrier() is purely a
> compiler-barrier and does not prevent the CPU from reordering things.
> atomic_inc() does not return anything and is therefore unordered per
> Documentation/atomic_t.txt.
>
>> > Is there a
>> > requirement that the CPU also doesn't reorder anything after this
>> > atomic_inc() (which is unordered)? I.e. should this be
>> > atomic_inc_return_release() and remove barrier()?
>>
>> We don't have an atomic_xx_acquire() to pair with this.  So I guess we
>> don't need atomic_inc_return_release()?
>
> You have 2 things stronger than unordered: atomic_read() which result
> is used in a conditional branch, thus creating a control-dependency
> ordering later dependent writes; and the atomic_cmpxchg() is fully
> ordered.
>
> But before all that, I'd still want to understand what ordering
> requirements you have. The current comments say only the compiler
> needs taming, but does that mean we're fine with the CPU wildly
> reordering things?

Per my understanding, atomic_cmpxchg() is fully ordered, so we have
strong ordering in flush_tlb_batched_pending().  And we use xchg() in
ptep_get_and_clear() (at least for x86) which is called before
set_tlb_ubc_flush_pending().  So we have strong ordering there too.

So at least for x86, barrier() in set_tlb_ubc_flush_pending() appears
unnecessary.  Is it needed by other architectures?

Best Regards,
Huang, Ying
Marco Elver Nov. 24, 2021, 8:49 a.m. UTC | #7
On Wed, 24 Nov 2021 at 09:41, Huang, Ying <ying.huang@intel.com> wrote:
>
> Marco Elver <elver@google.com> writes:
>
> > On Wed, 24 Nov 2021 at 02:44, Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Marco Elver <elver@google.com> writes:
> >>
> >> > On Tue, 23 Nov 2021 at 08:44, Huang Ying <ying.huang@intel.com> wrote:
> > [...]
> >> >> --- a/mm/rmap.c
> >> >> +++ b/mm/rmap.c
> >> >> @@ -633,7 +633,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> >> >>          * before the PTE is cleared.
> >> >>          */
> >> >>         barrier();
> >> >> -       mm->tlb_flush_batched = true;
> >> >> +       atomic_inc(&mm->tlb_flush_batched);
> >> >
> >> > The use of barrier() and atomic needs some clarification.
> >>
> >> There are some comments above barrier() to describe why it is needed.
> >> For atomic, because the type of mm->tlb_flush_batched is atomic_t, do we
> >> need extra clarification?
> >
> > Apologies, maybe I wasn't clear enough: the existing comment tells me
> > the clearing of PTE should never happen after tlb_flush_batched is
> > set, but only the compiler is considered. However, I become suspicious
> > when I see barrier() paired with an atomic. barrier() is purely a
> > compiler-barrier and does not prevent the CPU from reordering things.
> > atomic_inc() does not return anything and is therefore unordered per
> > Documentation/atomic_t.txt.
> >
> >> > Is there a
> >> > requirement that the CPU also doesn't reorder anything after this
> >> > atomic_inc() (which is unordered)? I.e. should this be
> >> > atomic_inc_return_release() and remove barrier()?
> >>
> >> We don't have an atomic_xx_acquire() to pair with this.  So I guess we
> >> don't need atomic_inc_return_release()?
> >
> > You have 2 things stronger than unordered: atomic_read() which result
> > is used in a conditional branch, thus creating a control-dependency
> > ordering later dependent writes; and the atomic_cmpxchg() is fully
> > ordered.
> >
> > But before all that, I'd still want to understand what ordering
> > requirements you have. The current comments say only the compiler
> > needs taming, but does that mean we're fine with the CPU wildly
> > reordering things?
>
> Per my understanding, atomic_cmpxchg() is fully ordered, so we have
> strong ordering in flush_tlb_batched_pending().  And we use xchg() in
> ptep_get_and_clear() (at least for x86) which is called before
> set_tlb_ubc_flush_pending().  So we have strong ordering there too.
>
> So at least for x86, barrier() in set_tlb_ubc_flush_pending() appears
> unnecessary.  Is it needed by other architectures?

Hmm, this is not arch/ code -- this code needs to be portable.
atomic_t accessors provide arch-independent guarantees. But do the
other operations here provide any guarantees? If they don't, then I
think we have to assume unordered.
Huang, Ying Nov. 25, 2021, 6:36 a.m. UTC | #8
Marco Elver <elver@google.com> writes:

> On Wed, 24 Nov 2021 at 09:41, Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Marco Elver <elver@google.com> writes:
>>
>> > On Wed, 24 Nov 2021 at 02:44, Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Marco Elver <elver@google.com> writes:
>> >>
>> >> > On Tue, 23 Nov 2021 at 08:44, Huang Ying <ying.huang@intel.com> wrote:
>> > [...]
>> >> >> --- a/mm/rmap.c
>> >> >> +++ b/mm/rmap.c
>> >> >> @@ -633,7 +633,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
>> >> >>          * before the PTE is cleared.
>> >> >>          */
>> >> >>         barrier();
>> >> >> -       mm->tlb_flush_batched = true;
>> >> >> +       atomic_inc(&mm->tlb_flush_batched);
>> >> >
>> >> > The use of barrier() and atomic needs some clarification.
>> >>
>> >> There are some comments above barrier() to describe why it is needed.
>> >> For atomic, because the type of mm->tlb_flush_batched is atomic_t, do we
>> >> need extra clarification?
>> >
>> > Apologies, maybe I wasn't clear enough: the existing comment tells me
>> > the clearing of PTE should never happen after tlb_flush_batched is
>> > set, but only the compiler is considered. However, I become suspicious
>> > when I see barrier() paired with an atomic. barrier() is purely a
>> > compiler-barrier and does not prevent the CPU from reordering things.
>> > atomic_inc() does not return anything and is therefore unordered per
>> > Documentation/atomic_t.txt.
>> >
>> >> > Is there a
>> >> > requirement that the CPU also doesn't reorder anything after this
>> >> > atomic_inc() (which is unordered)? I.e. should this be
>> >> > atomic_inc_return_release() and remove barrier()?
>> >>
>> >> We don't have an atomic_xx_acquire() to pair with this.  So I guess we
>> >> don't need atomic_inc_return_release()?
>> >
>> > You have 2 things stronger than unordered: atomic_read() which result
>> > is used in a conditional branch, thus creating a control-dependency
>> > ordering later dependent writes; and the atomic_cmpxchg() is fully
>> > ordered.
>> >
>> > But before all that, I'd still want to understand what ordering
>> > requirements you have. The current comments say only the compiler
>> > needs taming, but does that mean we're fine with the CPU wildly
>> > reordering things?
>>
>> Per my understanding, atomic_cmpxchg() is fully ordered, so we have
>> strong ordering in flush_tlb_batched_pending().  And we use xchg() in
>> ptep_get_and_clear() (at least for x86) which is called before
>> set_tlb_ubc_flush_pending().  So we have strong ordering there too.
>>
>> So at least for x86, barrier() in set_tlb_ubc_flush_pending() appears
>> unnecessary.  Is it needed by other architectures?
>
> Hmm, this is not arch/ code -- this code needs to be portable.
> atomic_t accessors provide arch-independent guarantees. But do the
> other operations here provide any guarantees? If they don't, then I
> think we have to assume unordered.

Yes.  The analysis is for x86 only.  For other architectures, we need to
make sure the order of ptep_get_and_clear().

But anyway, that should be another patch.  This patch doesn't make the
original ordering weaker.

Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c3a6e6209600..789778067db9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -632,7 +632,7 @@  struct mm_struct {
 		atomic_t tlb_flush_pending;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 		/* See flush_tlb_batched_pending() */
-		bool tlb_flush_batched;
+		atomic_t tlb_flush_batched;
 #endif
 		struct uprobes_state uprobes_state;
 #ifdef CONFIG_PREEMPT_RT
diff --git a/mm/rmap.c b/mm/rmap.c
index 163ac4e6bcee..60902c3cfb4a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -633,7 +633,7 @@  static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 	 * before the PTE is cleared.
 	 */
 	barrier();
-	mm->tlb_flush_batched = true;
+	atomic_inc(&mm->tlb_flush_batched);
 
 	/*
 	 * If the PTE was dirty then it's best to assume it's writable. The
@@ -680,15 +680,16 @@  static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
  */
 void flush_tlb_batched_pending(struct mm_struct *mm)
 {
-	if (data_race(mm->tlb_flush_batched)) {
-		flush_tlb_mm(mm);
+	int batched = atomic_read(&mm->tlb_flush_batched);
 
+	if (batched) {
+		flush_tlb_mm(mm);
 		/*
-		 * Do not allow the compiler to re-order the clearing of
-		 * tlb_flush_batched before the tlb is flushed.
+		 * If the new TLB flushing is pended during flushing,
+		 * leave mm->tlb_flush_batched as is, to avoid to lose
+		 * flushing.
 		 */
-		barrier();
-		mm->tlb_flush_batched = false;
+		atomic_cmpxchg(&mm->tlb_flush_batched, batched, 0);
 	}
 }
 #else