diff mbox series

[-next] mm/page_counter: mark intentional data races

Message ID 20200129042019.3632-1-cai@lca.pw (mailing list archive)
State New, archived
Headers show
Series [-next] mm/page_counter: mark intentional data races | expand

Commit Message

Qian Cai Jan. 29, 2020, 4:20 a.m. UTC
The commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
had memcg->memsw->failcnt and ->watermark could be accessed concurrently
as reported by KCSAN,

 Reported by Kernel Concurrency Sanitizer on:
 BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge

 read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
  page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
  try_charge+0x131/0xd50
  __memcg_kmem_charge_memcg+0x58/0x140
  __memcg_kmem_charge+0xcc/0x280
  __alloc_pages_nodemask+0x1e1/0x450
  alloc_pages_current+0xa6/0x120
  pte_alloc_one+0x17/0xd0
  __pte_alloc+0x3a/0x1f0
  copy_p4d_range+0xc36/0x1990
  copy_page_range+0x21d/0x360
  dup_mmap+0x5f5/0x7a0
  dup_mm+0xa2/0x240
  copy_process+0x1b3f/0x3460
  _do_fork+0xaa/0xa20
  __x64_sys_clone+0x13b/0x170
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
  page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
  try_charge+0x131/0xd50
  mem_cgroup_try_charge+0x159/0x460
  mem_cgroup_try_charge_delay+0x3d/0xa0
  wp_page_copy+0x14d/0x930
  do_wp_page+0x107/0x7b0
  __handle_mm_fault+0xce6/0xd40
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

Since the failcnt and watermark are tolerant of some inaccuracy, a data
race will not be harmful, thus mark them as intentional data races with
the data_race() macro.

Signed-off-by: Qian Cai <cai@lca.pw>
---
 mm/page_counter.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Michal Hocko Jan. 29, 2020, 8:51 a.m. UTC | #1
On Tue 28-01-20 23:20:19, Qian Cai wrote:
> The commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
> had memcg->memsw->failcnt and ->watermark could be accessed concurrently
> as reported by KCSAN,
> 
>  Reported by Kernel Concurrency Sanitizer on:
>  BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
> 
>  read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
>   page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
>   try_charge+0x131/0xd50
>   __memcg_kmem_charge_memcg+0x58/0x140
>   __memcg_kmem_charge+0xcc/0x280
>   __alloc_pages_nodemask+0x1e1/0x450
>   alloc_pages_current+0xa6/0x120
>   pte_alloc_one+0x17/0xd0
>   __pte_alloc+0x3a/0x1f0
>   copy_p4d_range+0xc36/0x1990
>   copy_page_range+0x21d/0x360
>   dup_mmap+0x5f5/0x7a0
>   dup_mm+0xa2/0x240
>   copy_process+0x1b3f/0x3460
>   _do_fork+0xaa/0xa20
>   __x64_sys_clone+0x13b/0x170
>   do_syscall_64+0x91/0xb47
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>  write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
>   page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
>   try_charge+0x131/0xd50
>   mem_cgroup_try_charge+0x159/0x460
>   mem_cgroup_try_charge_delay+0x3d/0xa0
>   wp_page_copy+0x14d/0x930
>   do_wp_page+0x107/0x7b0
>   __handle_mm_fault+0xce6/0xd40
>   handle_mm_fault+0xfc/0x2f0
>   do_page_fault+0x263/0x6f9
>   page_fault+0x34/0x40
> 
> Since the failcnt and watermark are tolerant of some inaccuracy, a data
> race will not be harmful, thus mark them as intentional data races with
> the data_race() macro.

I am not familiar with KCSAN and git grep for data_race on the current
linux-next doesn't really show any users of this macro. Is there a
general consensus that data_race is going to be used to silence all
KCSAN false positives?

> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  mm/page_counter.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index de31470655f6..13934636eafd 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -82,8 +82,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  		 * This is indeed racy, but we can live with some
>  		 * inaccuracy in the watermark.
>  		 */
> -		if (new > c->watermark)
> -			c->watermark = new;
> +		if (data_race(new > c->watermark))
> +			data_race(c->watermark = new);
>  	}
>  }
>  
> @@ -126,7 +126,7 @@ bool page_counter_try_charge(struct page_counter *counter,
>  			 * This is racy, but we can live with some
>  			 * inaccuracy in the failcnt.
>  			 */
> -			c->failcnt++;
> +			data_race(c->failcnt++);
>  			*fail = c;
>  			goto failed;
>  		}
> @@ -135,8 +135,8 @@ bool page_counter_try_charge(struct page_counter *counter,
>  		 * Just like with failcnt, we can live with some
>  		 * inaccuracy in the watermark.
>  		 */
> -		if (new > c->watermark)
> -			c->watermark = new;
> +		if (data_race(new > c->watermark))
> +			data_race(c->watermark = new);
>  	}
>  	return true;
>  
> -- 
> 2.21.0 (Apple Git-122.2)
Marco Elver Jan. 29, 2020, 9:06 a.m. UTC | #2
On Wed, 29 Jan 2020 at 09:51, Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-01-20 23:20:19, Qian Cai wrote:
> > The commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
> > had memcg->memsw->failcnt and ->watermark could be accessed concurrently
> > as reported by KCSAN,
> >
> >  Reported by Kernel Concurrency Sanitizer on:
> >  BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
> >
> >  read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
> >   page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
> >   try_charge+0x131/0xd50

Why are the line numbers for the remaining symbols missing?  Doesn't
scripts/decode_stacktrace.sh give you all line numbers?

[ As an aside: if you want to use what syzbot uses to put line numbers
on symbols, which is a bit faster:
https://github.com/google/syzkaller/tree/master/tools/syz-symbolize
https://github.com/google/syzkaller/blob/master/docs/linux/setup.md
then 'go build tools/syz-symbolize'. ]

> >   __memcg_kmem_charge_memcg+0x58/0x140
> >   __memcg_kmem_charge+0xcc/0x280
> >   __alloc_pages_nodemask+0x1e1/0x450
> >   alloc_pages_current+0xa6/0x120
> >   pte_alloc_one+0x17/0xd0
> >   __pte_alloc+0x3a/0x1f0
> >   copy_p4d_range+0xc36/0x1990
> >   copy_page_range+0x21d/0x360
> >   dup_mmap+0x5f5/0x7a0
> >   dup_mm+0xa2/0x240
> >   copy_process+0x1b3f/0x3460
> >   _do_fork+0xaa/0xa20
> >   __x64_sys_clone+0x13b/0x170
> >   do_syscall_64+0x91/0xb47
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> >  write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
> >   page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
> >   try_charge+0x131/0xd50
> >   mem_cgroup_try_charge+0x159/0x460
> >   mem_cgroup_try_charge_delay+0x3d/0xa0
> >   wp_page_copy+0x14d/0x930
> >   do_wp_page+0x107/0x7b0
> >   __handle_mm_fault+0xce6/0xd40
> >   handle_mm_fault+0xfc/0x2f0
> >   do_page_fault+0x263/0x6f9
> >   page_fault+0x34/0x40
> >
> > Since the failcnt and watermark are tolerant of some inaccuracy, a data
> > race will not be harmful, thus mark them as intentional data races with
> > the data_race() macro.
>
> I am not familiar with KCSAN and git grep for data_race on the current
> linux-next doesn't really show any users of this macro. Is there a
> general consensus that data_race is going to be used to silence all
> KCSAN false positives?

It was discussed here:
https://lore.kernel.org/linux-fsdevel/CAHk-=wg5CkOEF8DTez1Qu0XTEFw_oHhxN98bDnFqbY7HL5AB2g@mail.gmail.com/

If they're intentional data races that should remain, data_race() is
one option. There are 4 options (other than address the data race) to
deal with 'false positives':
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/dev-tools/kcsan.rst#n101

That being said, every use of data_race() needs to be justified, and
not just applied without understanding the issue. See below.

> > Signed-off-by: Qian Cai <cai@lca.pw>
> > ---
> >  mm/page_counter.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > index de31470655f6..13934636eafd 100644
> > --- a/mm/page_counter.c
> > +++ b/mm/page_counter.c
> > @@ -82,8 +82,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
> >                * This is indeed racy, but we can live with some
> >                * inaccuracy in the watermark.
> >                */
> > -             if (new > c->watermark)
> > -                     c->watermark = new;
> > +             if (data_race(new > c->watermark))
> > +                     data_race(c->watermark = new);

These should be using 'READ_ONCE' and 'WRITE_ONCE' for c->watermark.
Store or load tearing would change the logic here, since the
comparison might see garbage.

> >       }
> >  }
> >
> > @@ -126,7 +126,7 @@ bool page_counter_try_charge(struct page_counter *counter,
> >                        * This is racy, but we can live with some
> >                        * inaccuracy in the failcnt.
> >                        */
> > -                     c->failcnt++;
> > +                     data_race(c->failcnt++);

This is probably fine.

> >                       *fail = c;
> >                       goto failed;
> >               }
> > @@ -135,8 +135,8 @@ bool page_counter_try_charge(struct page_counter *counter,
> >                * Just like with failcnt, we can live with some
> >                * inaccuracy in the watermark.
> >                */
> > -             if (new > c->watermark)
> > -                     c->watermark = new;
> > +             if (data_race(new > c->watermark))
> > +                     data_race(c->watermark = new);

This should be READ_ONCE / WRITE_ONCE.

> >       }
> >       return true;
> >
> > --
> > 2.21.0 (Apple Git-122.2)
>
> --
> Michal Hocko
> SUSE Labs
Qian Cai Jan. 29, 2020, 9:33 a.m. UTC | #3
> On Jan 29, 2020, at 4:06 AM, Marco Elver <elver@google.com> wrote:
> 
> Why are the line numbers for the remaining symbols missing?  Doesn't
> scripts/decode_stacktrace.sh give you all line numbers?

I used scripts/faddr2line and never used decode_stacktrace.sh before. I did not insert other line numbers because it can be easily determined by looking at the function itself through an code browser.

> 
> [ As an aside: if you want to use what syzbot uses to put line numbers
> on symbols, which is a bit faster:
> https://github.com/google/syzkaller/tree/master/tools/syz-symbolize
> https://github.com/google/syzkaller/blob/master/docs/linux/setup.md
> then 'go build tools/syz-symbolize'. ]

That is good to know.
Qian Cai Jan. 29, 2020, 9:51 a.m. UTC | #4
> On Jan 29, 2020, at 4:06 AM, Marco Elver <elver@google.com> wrote:
> 
> These should be using 'READ_ONCE' and 'WRITE_ONCE' for c->watermark.
> Store or load tearing would change the logic here, since the
> comparison might see garbage.

I originally thought that it probably does not matter because it is racy there by doing lockless access anyway. Another thread could change the value at anytime.

Now, I agree set it to a garbage due to a data race could be quite unpleasant there.
diff mbox series

Patch

diff --git a/mm/page_counter.c b/mm/page_counter.c
index de31470655f6..13934636eafd 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -82,8 +82,8 @@  void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 		 * This is indeed racy, but we can live with some
 		 * inaccuracy in the watermark.
 		 */
-		if (new > c->watermark)
-			c->watermark = new;
+		if (data_race(new > c->watermark))
+			data_race(c->watermark = new);
 	}
 }
 
@@ -126,7 +126,7 @@  bool page_counter_try_charge(struct page_counter *counter,
 			 * This is racy, but we can live with some
 			 * inaccuracy in the failcnt.
 			 */
-			c->failcnt++;
+			data_race(c->failcnt++);
 			*fail = c;
 			goto failed;
 		}
@@ -135,8 +135,8 @@  bool page_counter_try_charge(struct page_counter *counter,
 		 * Just like with failcnt, we can live with some
 		 * inaccuracy in the watermark.
 		 */
-		if (new > c->watermark)
-			c->watermark = new;
+		if (data_race(new > c->watermark))
+			data_race(c->watermark = new);
 	}
 	return true;