diff mbox series

mm/util: fix a data race in __vm_enough_memory()

Message ID 20200130025133.5232-1-cai@lca.pw (mailing list archive)
State New, archived
Headers show
Series mm/util: fix a data race in __vm_enough_memory() | expand

Commit Message

Qian Cai Jan. 30, 2020, 2:51 a.m. UTC
"vm_committed_as.count" could be accessed concurrently as reported by
KCSAN,

 read to 0xffffffff923164f8 of 8 bytes by task 1268 on cpu 38:
  __vm_enough_memory+0x43/0x280 mm/util.c:801
  mmap_region+0x1b2/0xb90 mm/mmap.c:1726
  do_mmap+0x45c/0x700
  vm_mmap_pgoff+0xc0/0x130
  vm_mmap+0x71/0x90
  elf_map+0xa1/0x1b0
  load_elf_binary+0x9de/0x2180
  search_binary_handler+0xd8/0x2b0
  __do_execve_file+0xb61/0x1080
  __x64_sys_execve+0x5f/0x70
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 write to 0xffffffff923164f8 of 8 bytes by task 1265 on cpu 41:
  percpu_counter_add_batch+0x83/0xd0 lib/percpu_counter.c:91
  exit_mmap+0x178/0x220 include/linux/mman.h:68
  mmput+0x10e/0x270
  flush_old_exec+0x572/0xfe0
  load_elf_binary+0x467/0x2180
  search_binary_handler+0xd8/0x2b0
  __do_execve_file+0xb61/0x1080
  __x64_sys_execve+0x5f/0x70
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Since only the read is operating as lockless, fix it by using
READ_ONLY() for it to avoid any possible false warning due to load
tearing.

Signed-off-by: Qian Cai <cai@lca.pw>
---
 mm/util.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Matthew Wilcox Jan. 30, 2020, 4:20 a.m. UTC | #1
On Wed, Jan 29, 2020 at 09:51:33PM -0500, Qian Cai wrote:
> "vm_committed_as.count" could be accessed concurrently as reported by
> KCSAN,
> 
>  read to 0xffffffff923164f8 of 8 bytes by task 1268 on cpu 38:
>   __vm_enough_memory+0x43/0x280 mm/util.c:801
>   mmap_region+0x1b2/0xb90 mm/mmap.c:1726
>   do_mmap+0x45c/0x700
>   vm_mmap_pgoff+0xc0/0x130
>   vm_mmap+0x71/0x90
>   elf_map+0xa1/0x1b0
>   load_elf_binary+0x9de/0x2180
>   search_binary_handler+0xd8/0x2b0
>   __do_execve_file+0xb61/0x1080
>   __x64_sys_execve+0x5f/0x70
>   do_syscall_64+0x91/0xb47
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>  write to 0xffffffff923164f8 of 8 bytes by task 1265 on cpu 41:
>   percpu_counter_add_batch+0x83/0xd0 lib/percpu_counter.c:91
>   exit_mmap+0x178/0x220 include/linux/mman.h:68
>   mmput+0x10e/0x270
>   flush_old_exec+0x572/0xfe0
>   load_elf_binary+0x467/0x2180
>   search_binary_handler+0xd8/0x2b0
>   __do_execve_file+0xb61/0x1080
>   __x64_sys_execve+0x5f/0x70
>   do_syscall_64+0x91/0xb47
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Since only the read is operating as lockless, fix it by using
> READ_ONLY() for it to avoid any possible false warning due to load

You mean READ_ONCE ...

>  {
>  	long allowed;
>  
> -	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> +	VM_WARN_ONCE(READ_ONCE(vm_committed_as.count) <
>  			-(s64)vm_committed_as_batch * num_online_cpus(),

I'm really not a fan of exposing the internals of a percpu_counter outside
the percpu_counter.h file.  Why shouldn't this be fixed by putting the
READ_ONCE() inside percpu_counter_read()?
Qian Cai Jan. 30, 2020, 11:50 a.m. UTC | #2
> On Jan 29, 2020, at 11:20 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> I'm really not a fan of exposing the internals of a percpu_counter outside
> the percpu_counter.h file.  Why shouldn't this be fixed by putting the
> READ_ONCE() inside percpu_counter_read()?

It is because not all places suffer from a data race. For example, in __wb_update_bandwidth(), it was protected by a lock. I was a bit worry about blindly adding READ_ONCE() inside percpu_counter_read() might has unexpected side-effect. For example, it is unnecessary to have READ_ONCE() for a volatile variable. So, I thought just to keep the change minimal with a trade off by exposing a bit internal details as you mentioned.

However, I had also copied the percpu maintainers to see if they have any preferences?
Marco Elver Jan. 30, 2020, 12:35 p.m. UTC | #3
On Thu, 30 Jan 2020 at 12:50, Qian Cai <cai@lca.pw> wrote:
>
> > On Jan 29, 2020, at 11:20 PM, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > I'm really not a fan of exposing the internals of a percpu_counter outside
> > the percpu_counter.h file.  Why shouldn't this be fixed by putting the
> > READ_ONCE() inside percpu_counter_read()?
>
> It is because not all places suffer from a data race. For example, in __wb_update_bandwidth(), it was protected by a lock. I was a bit worry about blindly adding READ_ONCE() inside percpu_counter_read() might has unexpected side-effect. For example, it is unnecessary to have READ_ONCE() for a volatile variable. So, I thought just to keep the change minimal with a trade off by exposing a bit internal details as you mentioned.
>
> However, I had also copied the percpu maintainers to see if they have any preferences?

I would not add READ_ONCE to percpu_counter_read(), given the writes
(increments) are not atomic either, so not much is gained.

Notice that this is inside a WARN_ONCE, so you may argue that a data
race here doesn't matter to the correct behaviour of the system
(except if you have panic_on_warn on).

For the warning to trigger, vm_committed_as must decrease. Assume that
a data race (assuming bad compiler optimizations) can somehow
accomplish this, then the load or write must cause a transient value
to somehow be less than a stable value. My hypothesis is this is very
unlikely.

Given the fact this is a WARN_ONCE, and the fact that a transient
decrease in the value is unlikely, you may consider
'VM_WARN_ONCE(data_race(percpu_counter_read(&vm_committed_as)) <
...)'. That way you won't modify percpu_counter_read and still catch
unintended races elsewhere.

[ Note that the 'data_race()' macro is still only in -next, -tip, and -rcu. ]

Thanks,
-- Marco
Andrew Morton Jan. 31, 2020, 2:18 a.m. UTC | #4
On Thu, 30 Jan 2020 13:35:18 +0100 Marco Elver <elver@google.com> wrote:

> On Thu, 30 Jan 2020 at 12:50, Qian Cai <cai@lca.pw> wrote:
> >
> > > On Jan 29, 2020, at 11:20 PM, Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > I'm really not a fan of exposing the internals of a percpu_counter outside
> > > the percpu_counter.h file.  Why shouldn't this be fixed by putting the
> > > READ_ONCE() inside percpu_counter_read()?
> >
> > It is because not all places suffer from a data race. For example, in __wb_update_bandwidth(), it was protected by a lock. I was a bit worry about blindly adding READ_ONCE() inside percpu_counter_read() might has unexpected side-effect. For example, it is unnecessary to have READ_ONCE() for a volatile variable. So, I thought just to keep the change minimal with a trade off by exposing a bit internal details as you mentioned.
> >
> > However, I had also copied the percpu maintainers to see if they have any preferences?
> 
> I would not add READ_ONCE to percpu_counter_read(), given the writes
> (increments) are not atomic either, so not much is gained.
> 
> Notice that this is inside a WARN_ONCE, so you may argue that a data
> race here doesn't matter to the correct behaviour of the system
> (except if you have panic_on_warn on).
> 
> For the warning to trigger, vm_committed_as must decrease. Assume that
> a data race (assuming bad compiler optimizations) can somehow
> accomplish this, then the load or write must cause a transient value
> to somehow be less than a stable value. My hypothesis is this is very
> unlikely.
> 
> Given the fact this is a WARN_ONCE, and the fact that a transient
> decrease in the value is unlikely, you may consider
> 'VM_WARN_ONCE(data_race(percpu_counter_read(&vm_committed_as)) <
> ...)'. That way you won't modify percpu_counter_read and still catch
> unintended races elsewhere.
> 

That, or add an alternative version of per_cpu_counter_read() to the
percpu API.  A very carefully commented version!
Qian Cai Jan. 31, 2020, 2:22 a.m. UTC | #5
> On Jan 30, 2020, at 9:18 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Thu, 30 Jan 2020 13:35:18 +0100 Marco Elver <elver@google.com> wrote:
> 
>> On Thu, 30 Jan 2020 at 12:50, Qian Cai <cai@lca.pw> wrote:
>>> 
>>>> On Jan 29, 2020, at 11:20 PM, Matthew Wilcox <willy@infradead.org> wrote:
>>>> 
>>>> I'm really not a fan of exposing the internals of a percpu_counter outside
>>>> the percpu_counter.h file.  Why shouldn't this be fixed by putting the
>>>> READ_ONCE() inside percpu_counter_read()?
>>> 
>>> It is because not all places suffer from a data race. For example, in __wb_update_bandwidth(), it was protected by a lock. I was a bit worry about blindly adding READ_ONCE() inside percpu_counter_read() might has unexpected side-effect. For example, it is unnecessary to have READ_ONCE() for a volatile variable. So, I thought just to keep the change minimal with a trade off by exposing a bit internal details as you mentioned.
>>> 
>>> However, I had also copied the percpu maintainers to see if they have any preferences?
>> 
>> I would not add READ_ONCE to percpu_counter_read(), given the writes
>> (increments) are not atomic either, so not much is gained.
>> 
>> Notice that this is inside a WARN_ONCE, so you may argue that a data
>> race here doesn't matter to the correct behaviour of the system
>> (except if you have panic_on_warn on).
>> 
>> For the warning to trigger, vm_committed_as must decrease. Assume that
>> a data race (assuming bad compiler optimizations) can somehow
>> accomplish this, then the load or write must cause a transient value
>> to somehow be less than a stable value. My hypothesis is this is very
>> unlikely.
>> 
>> Given the fact this is a WARN_ONCE, and the fact that a transient
>> decrease in the value is unlikely, you may consider
>> 'VM_WARN_ONCE(data_race(percpu_counter_read(&vm_committed_as)) <
>> ...)'. That way you won't modify percpu_counter_read and still catch
>> unintended races elsewhere.
>> 
> 
> That, or add an alternative version of per_cpu_counter_read() to the
> percpu API.  A very carefully commented version!

 I send a patch to use data_race() which should be sufficient,

https://lore.kernel.org/linux-mm/20200130145649.1240-1-cai@lca.pw/
diff mbox series

Patch

diff --git a/mm/util.c b/mm/util.c
index 988d11e6c17c..58cd8f28651c 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -798,7 +798,7 @@  int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 {
 	long allowed;
 
-	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
+	VM_WARN_ONCE(READ_ONCE(vm_committed_as.count) <
 			-(s64)vm_committed_as_batch * num_online_cpus(),
 			"memory commitment underflow");