diff mbox series

mm: memcontrol: fix data race in mem_cgroup_select_victim_node

Message ID 20191029005405.201986-1-shakeelb@google.com (mailing list archive)
State New, archived
Headers show
Series mm: memcontrol: fix data race in mem_cgroup_select_victim_node | expand

Commit Message

Shakeel Butt Oct. 29, 2019, 12:54 a.m. UTC
Syzbot reported the following bug:

BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node

write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
 mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
 try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
 reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
 mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
 tracehook_notify_resume include/linux/tracehook.h:197 [inline]
 exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
 prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
 swapgs_restore_regs_and_return_to_usermode+0x0/0x40

read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
 mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
 try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
 reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
 mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
 tracehook_notify_resume include/linux/tracehook.h:197 [inline]
 exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
 prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
 swapgs_restore_regs_and_return_to_usermode+0x0/0x40

mem_cgroup_select_victim_node() can be called concurrently which reads
and modifies memcg->last_scanned_node without any synchrnonization. So,
read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
to stop potential reordering.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Reported-by: syzbot+13f93c99c06988391efe@syzkaller.appspotmail.com
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Michal Hocko Oct. 29, 2019, 9:03 a.m. UTC | #1
On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> Syzbot reported the following bug:
> 
> BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> 
> write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
>  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
>  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
>  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
>  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
>  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
>  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
>  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
>  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> 
> read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
>  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
>  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
>  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
>  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
>  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
>  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
>  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
>  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> 
> mem_cgroup_select_victim_node() can be called concurrently which reads
> and modifies memcg->last_scanned_node without any synchrnonization. So,
> read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> to stop potential reordering.

I am sorry but I do not understand the problem and the fix. Why does the
race happen and why does _ONCE fixes it? There is still no
synchronization. Do you want to prevent from memcg->last_scanned_node
reloading?

> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Cc: Greg Thelen <gthelen@google.com>
> Reported-by: syzbot+13f93c99c06988391efe@syzkaller.appspotmail.com
> ---
>  mm/memcontrol.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c4c555055a72..5a06739dd3e4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1667,7 +1667,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
>  	int node;
>  
>  	mem_cgroup_may_update_nodemask(memcg);
> -	node = memcg->last_scanned_node;
> +	node = READ_ONCE(memcg->last_scanned_node);
>  
>  	node = next_node_in(node, memcg->scan_nodes);
>  	/*
> @@ -1678,7 +1678,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
>  	if (unlikely(node == MAX_NUMNODES))
>  		node = numa_node_id();
>  
> -	memcg->last_scanned_node = node;
> +	WRITE_ONCE(memcg->last_scanned_node, node);
>  	return node;
>  }
>  #else
> -- 
> 2.24.0.rc0.303.g954a862665-goog
Shakeel Butt Oct. 29, 2019, 6:09 p.m. UTC | #2
+Marco

On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > Syzbot reported the following bug:
> >
> > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> >
> > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> >  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> >
> > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> >  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> >
> > mem_cgroup_select_victim_node() can be called concurrently which reads
> > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > to stop potential reordering.
>
> I am sorry but I do not understand the problem and the fix. Why does the
> race happen and why does _ONCE fixes it? There is still no
> synchronization. Do you want to prevent from memcg->last_scanned_node
> reloading?
>

The problem is memcg->last_scanned_node can read and modified
concurrently. Though to me it seems like a tolerable race and not
worth to add an explicit lock. My aim was to make KCSAN happy here to
look elsewhere for the concurrency bugs. However I see that it might
complain next on memcg->scan_nodes.

Now taking a step back, I am questioning the whole motivation behind
mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK
zonelist to the reclaimer, the shrink_node will be called for all
potential nodes. Also we don't short circuit the traversal of
shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >>
priority) for all nodes, I don't see the reason behind having round
robin order of node traversal.

I am thinking of removing the whole mem_cgroup_select_victim_node()
heuristic. Please let me know if there are any objections.

thanks,
Shakeel
Marco Elver Oct. 29, 2019, 6:28 p.m. UTC | #3
On Tue, 29 Oct 2019, Shakeel Butt wrote:

> +Marco
> 
> On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > Syzbot reported the following bug:
> > >
> > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > >
> > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > >  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > >  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > to stop potential reordering.

Strictly speaking, READ_ONCE/WRITE_ONCE alone avoid various bad compiler
optimizations, including store tearing, load tearing, etc. This does not
add memory barriers to constrain memory ordering.  (If this code needs
some memory ordering guarantees w.r.t. previous loads/stores then this
alone is not enough.)

> > I am sorry but I do not understand the problem and the fix. Why does the
> > race happen and why does _ONCE fixes it? There is still no
> > synchronization. Do you want to prevent from memcg->last_scanned_node
> > reloading?
> >
> 
> The problem is memcg->last_scanned_node can read and modified
> concurrently. Though to me it seems like a tolerable race and not
> worth to add an explicit lock. My aim was to make KCSAN happy here to
> look elsewhere for the concurrency bugs. However I see that it might
> complain next on memcg->scan_nodes.

The plain concurrent reads/writes are a data race, which may manifest in
various undefined behaviour due to compiler optimizations. The _ONCE
will prevent these (KCSAN only reports data races).  Note that, "data
race" does not necessarily imply "race condition"; some data races are
race conditions (usually the more interesting bugs) -- but not *all*
data races are race conditions. If there is no race condition here that
warrants heavier synchronization (locking etc.), then this patch is all
that should be needed.

I can't comment on the rest.

Thanks,
-- Marco
Johannes Weiner Oct. 29, 2019, 6:34 p.m. UTC | #4
On Tue, Oct 29, 2019 at 11:09:29AM -0700, Shakeel Butt wrote:
> +Marco
> 
> On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > Syzbot reported the following bug:
> > >
> > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > >
> > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > >  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > >  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > to stop potential reordering.
> >
> > I am sorry but I do not understand the problem and the fix. Why does the
> > race happen and why does _ONCE fixes it? There is still no
> > synchronization. Do you want to prevent from memcg->last_scanned_node
> > reloading?
> >
> 
> The problem is memcg->last_scanned_node can read and modified
> concurrently. Though to me it seems like a tolerable race and not
> worth to add an explicit lock. My aim was to make KCSAN happy here to
> look elsewhere for the concurrency bugs. However I see that it might
> complain next on memcg->scan_nodes.
> 
> Now taking a step back, I am questioning the whole motivation behind
> mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK
> zonelist to the reclaimer, the shrink_node will be called for all
> potential nodes. Also we don't short circuit the traversal of
> shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >>
> priority) for all nodes, I don't see the reason behind having round
> robin order of node traversal.

It's actually only very recently that we don't bail out of the reclaim
loop anymore - if I'm not missing anything, it was only 1ba6fc9af35b
("mm: vmscan: do not share cgroup iteration between reclaimers") that
removed the last bailout condition on sc->nr_reclaimed.

> I am thinking of removing the whole mem_cgroup_select_victim_node()
> heuristic. Please let me know if there are any objections.

In the current state, I don't see any reason to keep it, either. We
can always just start the zonelist walk from the current node.

A nice cleanup, actually. Good catch!
Shakeel Butt Oct. 29, 2019, 6:46 p.m. UTC | #5
On Tue, Oct 29, 2019 at 11:28 AM Marco Elver <elver@google.com> wrote:
>
>
>
> On Tue, 29 Oct 2019, Shakeel Butt wrote:
>
> > +Marco
> >
> > On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > > Syzbot reported the following bug:
> > > >
> > > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > > >
> > > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > > >  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > > >
> > > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > > >  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > > >
> > > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > > to stop potential reordering.
>
> Strictly speaking, READ_ONCE/WRITE_ONCE alone avoid various bad compiler
> optimizations, including store tearing, load tearing, etc. This does not
> add memory barriers to constrain memory ordering.  (If this code needs
> some memory ordering guarantees w.r.t. previous loads/stores then this
> alone is not enough.)
>
> > > I am sorry but I do not understand the problem and the fix. Why does the
> > > race happen and why does _ONCE fixes it? There is still no
> > > synchronization. Do you want to prevent from memcg->last_scanned_node
> > > reloading?
> > >
> >
> > The problem is memcg->last_scanned_node can read and modified
> > concurrently. Though to me it seems like a tolerable race and not
> > worth to add an explicit lock. My aim was to make KCSAN happy here to
> > look elsewhere for the concurrency bugs. However I see that it might
> > complain next on memcg->scan_nodes.
>
> The plain concurrent reads/writes are a data race, which may manifest in
> various undefined behaviour due to compiler optimizations. The _ONCE
> will prevent these (KCSAN only reports data races).  Note that, "data
> race" does not necessarily imply "race condition"; some data races are
> race conditions (usually the more interesting bugs) -- but not *all*
> data races are race conditions. If there is no race condition here that
> warrants heavier synchronization (locking etc.), then this patch is all
> that should be needed.
>
> I can't comment on the rest.
>

Thanks Marco for the explanation.
Shakeel Butt Oct. 29, 2019, 6:47 p.m. UTC | #6
On Tue, Oct 29, 2019 at 11:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Oct 29, 2019 at 11:09:29AM -0700, Shakeel Butt wrote:
> > +Marco
> >
> > On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > > Syzbot reported the following bug:
> > > >
> > > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > > >
> > > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > > >  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > > >
> > > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > > >  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > > >
> > > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > > to stop potential reordering.
> > >
> > > I am sorry but I do not understand the problem and the fix. Why does the
> > > race happen and why does _ONCE fixes it? There is still no
> > > synchronization. Do you want to prevent from memcg->last_scanned_node
> > > reloading?
> > >
> >
> > The problem is memcg->last_scanned_node can read and modified
> > concurrently. Though to me it seems like a tolerable race and not
> > worth to add an explicit lock. My aim was to make KCSAN happy here to
> > look elsewhere for the concurrency bugs. However I see that it might
> > complain next on memcg->scan_nodes.
> >
> > Now taking a step back, I am questioning the whole motivation behind
> > mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK
> > zonelist to the reclaimer, the shrink_node will be called for all
> > potential nodes. Also we don't short circuit the traversal of
> > shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >>
> > priority) for all nodes, I don't see the reason behind having round
> > robin order of node traversal.
>
> It's actually only very recently that we don't bail out of the reclaim
> loop anymore - if I'm not missing anything, it was only 1ba6fc9af35b
> ("mm: vmscan: do not share cgroup iteration between reclaimers") that
> removed the last bailout condition on sc->nr_reclaimed.
>
> > I am thinking of removing the whole mem_cgroup_select_victim_node()
> > heuristic. Please let me know if there are any objections.
>
> In the current state, I don't see any reason to keep it, either. We
> can always just start the zonelist walk from the current node.
>
> A nice cleanup, actually. Good catch!

Thanks, I will follow up with the removal of this heuristic.
Michal Hocko Oct. 29, 2019, 6:47 p.m. UTC | #7
On Tue 29-10-19 11:09:29, Shakeel Butt wrote:
> +Marco
> 
> On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 28-10-19 17:54:05, Shakeel Butt wrote:
> > > Syzbot reported the following bug:
> > >
> > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node
> > >
> > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0:
> > >  mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686
> > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1:
> > >  mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675
> > >  try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376
> > >  reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349
> > >  mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430
> > >  tracehook_notify_resume include/linux/tracehook.h:197 [inline]
> > >  exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163
> > >  prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194
> > >  swapgs_restore_regs_and_return_to_usermode+0x0/0x40
> > >
> > > mem_cgroup_select_victim_node() can be called concurrently which reads
> > > and modifies memcg->last_scanned_node without any synchrnonization. So,
> > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE()
> > > to stop potential reordering.
> >
> > I am sorry but I do not understand the problem and the fix. Why does the
> > race happen and why does _ONCE fixes it? There is still no
> > synchronization. Do you want to prevent from memcg->last_scanned_node
> > reloading?
> >
> 
> The problem is memcg->last_scanned_node can read and modified
> concurrently. Though to me it seems like a tolerable race and not
> worth to add an explicit lock.

Agreed

> My aim was to make KCSAN happy here to
> look elsewhere for the concurrency bugs. However I see that it might
> complain next on memcg->scan_nodes.

I would really refrain from adding whatever measure to silence some
tool without a deeper understanding of why that is needed. $FOO_ONCE
will prevent compiler from making funcy stuff. But this is an int and
I would be really surprised if $FOO_ONCE made any practical difference.

> Now taking a step back, I am questioning the whole motivation behind
> mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK
> zonelist to the reclaimer, the shrink_node will be called for all
> potential nodes. Also we don't short circuit the traversal of
> shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >>
> priority) for all nodes, I don't see the reason behind having round
> robin order of node traversal.
> 
> I am thinking of removing the whole mem_cgroup_select_victim_node()
> heuristic. Please let me know if there are any objections.

I would have to think more about this but this surely sounds like a
preferable way than adding $FOO_ONCE to silence the tool.

Thanks!
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c4c555055a72..5a06739dd3e4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1667,7 +1667,7 @@  int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 	int node;
 
 	mem_cgroup_may_update_nodemask(memcg);
-	node = memcg->last_scanned_node;
+	node = READ_ONCE(memcg->last_scanned_node);
 
 	node = next_node_in(node, memcg->scan_nodes);
 	/*
@@ -1678,7 +1678,7 @@  int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 	if (unlikely(node == MAX_NUMNODES))
 		node = numa_node_id();
 
-	memcg->last_scanned_node = node;
+	WRITE_ONCE(memcg->last_scanned_node, node);
 	return node;
 }
 #else