mm/page_counter: fix various data races
diff mbox series

Message ID 20200129105224.4016-1-cai@lca.pw
State New
Headers show
Series
  • mm/page_counter: fix various data races
Related show

Commit Message

Qian Cai Jan. 29, 2020, 10:52 a.m. UTC
The commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") could
had memcg->memsw->watermark been accessed concurrently as reported by
KCSAN,

 Reported by Kernel Concurrency Sanitizer on:
 BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge

 read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
  page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
  try_charge+0x131/0xd50 mm/memcontrol.c:2405
  __memcg_kmem_charge_memcg+0x58/0x140
  __memcg_kmem_charge+0xcc/0x280
  __alloc_pages_nodemask+0x1e1/0x450
  alloc_pages_current+0xa6/0x120
  pte_alloc_one+0x17/0xd0
  __pte_alloc+0x3a/0x1f0
  copy_p4d_range+0xc36/0x1990
  copy_page_range+0x21d/0x360
  dup_mmap+0x5f5/0x7a0
  dup_mm+0xa2/0x240
  copy_process+0x1b3f/0x3460
  _do_fork+0xaa/0xa20
  __x64_sys_clone+0x13b/0x170
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
  page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
  try_charge+0x131/0xd50 mm/memcontrol.c:2405
  mem_cgroup_try_charge+0x159/0x460
  mem_cgroup_try_charge_delay+0x3d/0xa0
  wp_page_copy+0x14d/0x930
  do_wp_page+0x107/0x7b0
  __handle_mm_fault+0xce6/0xd40
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

Since watermark could be compared or set to garbage due to load or
store tearing which would change the code logic, fix it by adding a pair
of READ_ONCE() and WRITE_ONCE() in those places.

Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
Signed-off-by: Qian Cai <cai@lca.pw>
---
 mm/page_counter.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

David Hildenbrand Jan. 29, 2020, 10:57 a.m. UTC | #1
On 29.01.20 11:52, Qian Cai wrote:
> The commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") could
> had memcg->memsw->watermark been accessed concurrently as reported by
> KCSAN,
> 
>  Reported by Kernel Concurrency Sanitizer on:
>  BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
> 
>  read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
>   page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
>   try_charge+0x131/0xd50 mm/memcontrol.c:2405
>   __memcg_kmem_charge_memcg+0x58/0x140
>   __memcg_kmem_charge+0xcc/0x280
>   __alloc_pages_nodemask+0x1e1/0x450
>   alloc_pages_current+0xa6/0x120
>   pte_alloc_one+0x17/0xd0
>   __pte_alloc+0x3a/0x1f0
>   copy_p4d_range+0xc36/0x1990
>   copy_page_range+0x21d/0x360
>   dup_mmap+0x5f5/0x7a0
>   dup_mm+0xa2/0x240
>   copy_process+0x1b3f/0x3460
>   _do_fork+0xaa/0xa20
>   __x64_sys_clone+0x13b/0x170
>   do_syscall_64+0x91/0xb47
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>  write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
>   page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
>   try_charge+0x131/0xd50 mm/memcontrol.c:2405
>   mem_cgroup_try_charge+0x159/0x460
>   mem_cgroup_try_charge_delay+0x3d/0xa0
>   wp_page_copy+0x14d/0x930
>   do_wp_page+0x107/0x7b0
>   __handle_mm_fault+0xce6/0xd40
>   handle_mm_fault+0xfc/0x2f0
>   do_page_fault+0x263/0x6f9
>   page_fault+0x34/0x40
> 
> Since watermark could be compared or set to garbage due to load or
> store tearing which would change the code logic, fix it by adding a pair
> of READ_ONCE() and WRITE_ONCE() in those places.
> 
> Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  mm/page_counter.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index de31470655f6..a17841150906 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -82,8 +82,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  		 * This is indeed racy, but we can live with some
>  		 * inaccuracy in the watermark.
>  		 */
> -		if (new > c->watermark)
> -			c->watermark = new;
> +		if (new > READ_ONCE(c->watermark))
> +			WRITE_ONCE(c->watermark, new);
>  	}
>  }
>  
> @@ -135,8 +135,8 @@ bool page_counter_try_charge(struct page_counter *counter,
>  		 * Just like with failcnt, we can live with some
>  		 * inaccuracy in the watermark.
>  		 */
> -		if (new > c->watermark)
> -			c->watermark = new;
> +		if (new > READ_ONCE(c->watermark))
> +			WRITE_ONCE(c->watermark, new);

So, if this is racy, isn't it a problem that that "new" could suddenly
be < c->watermark (concurrent writer). So you would use the "higher"
watermark.
David Hildenbrand Jan. 29, 2020, 10:58 a.m. UTC | #2
On 29.01.20 11:57, David Hildenbrand wrote:
> On 29.01.20 11:52, Qian Cai wrote:
>> The commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") could
>> had memcg->memsw->watermark been accessed concurrently as reported by
>> KCSAN,
>>
>>  Reported by Kernel Concurrency Sanitizer on:
>>  BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
>>
>>  read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
>>   page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
>>   try_charge+0x131/0xd50 mm/memcontrol.c:2405
>>   __memcg_kmem_charge_memcg+0x58/0x140
>>   __memcg_kmem_charge+0xcc/0x280
>>   __alloc_pages_nodemask+0x1e1/0x450
>>   alloc_pages_current+0xa6/0x120
>>   pte_alloc_one+0x17/0xd0
>>   __pte_alloc+0x3a/0x1f0
>>   copy_p4d_range+0xc36/0x1990
>>   copy_page_range+0x21d/0x360
>>   dup_mmap+0x5f5/0x7a0
>>   dup_mm+0xa2/0x240
>>   copy_process+0x1b3f/0x3460
>>   _do_fork+0xaa/0xa20
>>   __x64_sys_clone+0x13b/0x170
>>   do_syscall_64+0x91/0xb47
>>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>>  write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
>>   page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
>>   try_charge+0x131/0xd50 mm/memcontrol.c:2405
>>   mem_cgroup_try_charge+0x159/0x460
>>   mem_cgroup_try_charge_delay+0x3d/0xa0
>>   wp_page_copy+0x14d/0x930
>>   do_wp_page+0x107/0x7b0
>>   __handle_mm_fault+0xce6/0xd40
>>   handle_mm_fault+0xfc/0x2f0
>>   do_page_fault+0x263/0x6f9
>>   page_fault+0x34/0x40
>>
>> Since watermark could be compared or set to garbage due to load or
>> store tearing which would change the code logic, fix it by adding a pair
>> of READ_ONCE() and WRITE_ONCE() in those places.
>>
>> Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
>> Signed-off-by: Qian Cai <cai@lca.pw>
>> ---
>>  mm/page_counter.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/page_counter.c b/mm/page_counter.c
>> index de31470655f6..a17841150906 100644
>> --- a/mm/page_counter.c
>> +++ b/mm/page_counter.c
>> @@ -82,8 +82,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>>  		 * This is indeed racy, but we can live with some
>>  		 * inaccuracy in the watermark.
>>  		 */
>> -		if (new > c->watermark)
>> -			c->watermark = new;
>> +		if (new > READ_ONCE(c->watermark))
>> +			WRITE_ONCE(c->watermark, new);
>>  	}
>>  }
>>  
>> @@ -135,8 +135,8 @@ bool page_counter_try_charge(struct page_counter *counter,
>>  		 * Just like with failcnt, we can live with some
>>  		 * inaccuracy in the watermark.
>>  		 */
>> -		if (new > c->watermark)
>> -			c->watermark = new;
>> +		if (new > READ_ONCE(c->watermark))
>> +			WRITE_ONCE(c->watermark, new);
> 
> So, if this is racy, isn't it a problem that that "new" could suddenly
> be < c->watermark (concurrent writer). So you would use the "higher"
> watermark.

s/use/lose/ :)
Michal Hocko Jan. 29, 2020, 12:03 p.m. UTC | #3
On Wed 29-01-20 05:52:24, Qian Cai wrote:
> The commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") could
> had memcg->memsw->watermark been accessed concurrently as reported by
> KCSAN,
> 
>  Reported by Kernel Concurrency Sanitizer on:
>  BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
> 
>  read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
>   page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
>   try_charge+0x131/0xd50 mm/memcontrol.c:2405
>   __memcg_kmem_charge_memcg+0x58/0x140
>   __memcg_kmem_charge+0xcc/0x280
>   __alloc_pages_nodemask+0x1e1/0x450
>   alloc_pages_current+0xa6/0x120
>   pte_alloc_one+0x17/0xd0
>   __pte_alloc+0x3a/0x1f0
>   copy_p4d_range+0xc36/0x1990
>   copy_page_range+0x21d/0x360
>   dup_mmap+0x5f5/0x7a0
>   dup_mm+0xa2/0x240
>   copy_process+0x1b3f/0x3460
>   _do_fork+0xaa/0xa20
>   __x64_sys_clone+0x13b/0x170
>   do_syscall_64+0x91/0xb47
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>  write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
>   page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
>   try_charge+0x131/0xd50 mm/memcontrol.c:2405
>   mem_cgroup_try_charge+0x159/0x460
>   mem_cgroup_try_charge_delay+0x3d/0xa0
>   wp_page_copy+0x14d/0x930
>   do_wp_page+0x107/0x7b0
>   __handle_mm_fault+0xce6/0xd40
>   handle_mm_fault+0xfc/0x2f0
>   do_page_fault+0x263/0x6f9
>   page_fault+0x34/0x40
> 
> Since watermark could be compared or set to garbage due to load or
> store tearing which would change the code logic, fix it by adding a pair
> of READ_ONCE() and WRITE_ONCE() in those places.

There is no actual problem and the report is false positive, right?
While compilers are free to do all sorts of stuff do we have any actual
proof that this particular path would ever be problematic.

I do not oppose to the patch. {READ,WRITE}_ONCE actually makes some
sense to me, but the changelog should be more clear on how serious this
is.
 
> Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
> Signed-off-by: Qian Cai <cai@lca.pw>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_counter.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index de31470655f6..a17841150906 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -82,8 +82,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  		 * This is indeed racy, but we can live with some
>  		 * inaccuracy in the watermark.
>  		 */
> -		if (new > c->watermark)
> -			c->watermark = new;
> +		if (new > READ_ONCE(c->watermark))
> +			WRITE_ONCE(c->watermark, new);
>  	}
>  }
>  
> @@ -135,8 +135,8 @@ bool page_counter_try_charge(struct page_counter *counter,
>  		 * Just like with failcnt, we can live with some
>  		 * inaccuracy in the watermark.
>  		 */
> -		if (new > c->watermark)
> -			c->watermark = new;
> +		if (new > READ_ONCE(c->watermark))
> +			WRITE_ONCE(c->watermark, new);
>  	}
>  	return true;
>  
> -- 
> 2.21.0 (Apple Git-122.2)
Tetsuo Handa Jan. 29, 2020, 12:13 p.m. UTC | #4
On 2020/01/29 21:03, Michal Hocko wrote:
>> Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
>> Signed-off-by: Qian Cai <cai@lca.pw>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Please include

Reported-by: syzbot+f36cfe60b1006a94f9dc@syzkaller.appspotmail.com

for https://syzkaller.appspot.com/bug?id=744097b8b91cecd8b035a6f746bb12e4efc7669f .

By the way, can READ_ONCE()/WRITE_ONCE() really solve this warning?
The link above says read/write on the same location ( mm/page_counter.c:129 ).
I don't know how READ_ONCE()/WRITE_ONCE() can solve the race.

> 
>> ---
>>  mm/page_counter.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/page_counter.c b/mm/page_counter.c
>> index de31470655f6..a17841150906 100644
>> --- a/mm/page_counter.c
>> +++ b/mm/page_counter.c
>> @@ -82,8 +82,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>>  		 * This is indeed racy, but we can live with some
>>  		 * inaccuracy in the watermark.
>>  		 */
>> -		if (new > c->watermark)
>> -			c->watermark = new;
>> +		if (new > READ_ONCE(c->watermark))
>> +			WRITE_ONCE(c->watermark, new);
>>  	}
>>  }
>>  
>> @@ -135,8 +135,8 @@ bool page_counter_try_charge(struct page_counter *counter,
>>  		 * Just like with failcnt, we can live with some
>>  		 * inaccuracy in the watermark.
>>  		 */
>> -		if (new > c->watermark)
>> -			c->watermark = new;
>> +		if (new > READ_ONCE(c->watermark))
>> +			WRITE_ONCE(c->watermark, new);
>>  	}
>>  	return true;
>>  
>> -- 
>> 2.21.0 (Apple Git-122.2)
>
Marco Elver Jan. 29, 2020, 12:21 p.m. UTC | #5
On Wed, 29 Jan 2020 at 13:13, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2020/01/29 21:03, Michal Hocko wrote:
> >> Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
> >> Signed-off-by: Qian Cai <cai@lca.pw>
> >
> > Acked-by: Michal Hocko <mhocko@suse.com>
>
> Please include
>
> Reported-by: syzbot+f36cfe60b1006a94f9dc@syzkaller.appspotmail.com
>
> for https://syzkaller.appspot.com/bug?id=744097b8b91cecd8b035a6f746bb12e4efc7669f .
>
> By the way, can READ_ONCE()/WRITE_ONCE() really solve this warning?
> The link above says read/write on the same location ( mm/page_counter.c:129 ).
> I don't know how READ_ONCE()/WRITE_ONCE() can solve the race.

It avoids the *data* race, with *_ONCE telling the compiler to not
optimize the accesses in concurrency-unfriendly ways.  Since *_ONCE is
used, it conveys clear intent that the code here is meant to be
concurrent, and KCSAN stops complaining (and assumes that the *logic*
is correct).

The race itself is still there, but as per comment in the file,
apparently fine and not a logic bug.

> >
> >> ---
> >>  mm/page_counter.c | 8 ++++----
> >>  1 file changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/mm/page_counter.c b/mm/page_counter.c
> >> index de31470655f6..a17841150906 100644
> >> --- a/mm/page_counter.c
> >> +++ b/mm/page_counter.c
> >> @@ -82,8 +82,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
> >>               * This is indeed racy, but we can live with some
> >>               * inaccuracy in the watermark.
> >>               */
> >> -            if (new > c->watermark)
> >> -                    c->watermark = new;
> >> +            if (new > READ_ONCE(c->watermark))
> >> +                    WRITE_ONCE(c->watermark, new);
> >>      }
> >>  }
> >>
> >> @@ -135,8 +135,8 @@ bool page_counter_try_charge(struct page_counter *counter,
> >>               * Just like with failcnt, we can live with some
> >>               * inaccuracy in the watermark.
> >>               */
> >> -            if (new > c->watermark)
> >> -                    c->watermark = new;
> >> +            if (new > READ_ONCE(c->watermark))
> >> +                    WRITE_ONCE(c->watermark, new);
> >>      }
> >>      return true;
> >>
> >> --
> >> 2.21.0 (Apple Git-122.2)
> >
>
Qian Cai Jan. 29, 2020, 12:25 p.m. UTC | #6
> On Jan 29, 2020, at 7:13 AM, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
> 
> By the way, can READ_ONCE()/WRITE_ONCE() really solve this warning?
> The link above says read/write on the same location ( mm/page_counter.c:129 ).
> I don't know how READ_ONCE()/WRITE_ONCE() can solve the race.

That looks like a different one where it complains about c->failcnt++. I’ll send a separate patch for that.
Tetsuo Handa Jan. 29, 2020, 1:09 p.m. UTC | #7
On 2020/01/29 21:25, Qian Cai wrote:
> 
> 
>> On Jan 29, 2020, at 7:13 AM, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>
>> By the way, can READ_ONCE()/WRITE_ONCE() really solve this warning?
>> The link above says read/write on the same location ( mm/page_counter.c:129 ).
>> I don't know how READ_ONCE()/WRITE_ONCE() can solve the race.
> 
> That looks like a different one where it complains about c->failcnt++. I’ll send a separate patch for that.
> 

Ah, then this patch is meant for mm/page_counter.c:138 versus page_counter.c:139 race which was
closed as invalid at https://syzkaller.appspot.com/bug?id=871391ec080746185a2dd437c54d75fcd1ef0ef8 .
Tetsuo Handa Jan. 29, 2020, 1:47 p.m. UTC | #8
On 2020/01/29 21:21, Marco Elver wrote:
>> By the way, can READ_ONCE()/WRITE_ONCE() really solve this warning?
>> The link above says read/write on the same location ( mm/page_counter.c:129 ).
>> I don't know how READ_ONCE()/WRITE_ONCE() can solve the race.
> 
> It avoids the *data* race, with *_ONCE telling the compiler to not
> optimize the accesses in concurrency-unfriendly ways.  Since *_ONCE is
> used, it conveys clear intent that the code here is meant to be
> concurrent, and KCSAN stops complaining (and assumes that the *logic*
> is correct).

I see. Unlike c->failcnt++ which involves read-modify-write, *_ONCE() can be used for
simple read (like c->watermark) or simple write (like c->watermark = new) case.
Qian Cai Feb. 11, 2020, 12:14 p.m. UTC | #9
> On Jan 29, 2020, at 7:03 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Andrew, could you pick this as well if have not done so?

Patch
diff mbox series

diff --git a/mm/page_counter.c b/mm/page_counter.c
index de31470655f6..a17841150906 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -82,8 +82,8 @@  void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 		 * This is indeed racy, but we can live with some
 		 * inaccuracy in the watermark.
 		 */
-		if (new > c->watermark)
-			c->watermark = new;
+		if (new > READ_ONCE(c->watermark))
+			WRITE_ONCE(c->watermark, new);
 	}
 }
 
@@ -135,8 +135,8 @@  bool page_counter_try_charge(struct page_counter *counter,
 		 * Just like with failcnt, we can live with some
 		 * inaccuracy in the watermark.
 		 */
-		if (new > c->watermark)
-			c->watermark = new;
+		if (new > READ_ONCE(c->watermark))
+			WRITE_ONCE(c->watermark, new);
 	}
 	return true;