[v2] mm/memcontrol: fix a data race in scan count

Message ID	1581096119-13593-1-git-send-email-cai@lca.pw (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=3ecC=33=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D76762082E From: Qian Cai <cai@lca.pw> To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, elver@google.com, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qian Cai <cai@lca.pw> Subject: [PATCH v2] mm/memcontrol: fix a data race in scan count Date: Fri, 7 Feb 2020 12:21:59 -0500 Message-Id: <1581096119-13593-1-git-send-email-cai@lca.pw> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] mm/memcontrol: fix a data race in scan count \| expand [v2] mm/memcontrol: fix a data race in scan count

Message ID

1581096119-13593-1-git-send-email-cai@lca.pw (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D76762082E
From: Qian Cai <cai@lca.pw>
To: akpm@linux-foundation.org
Cc: hannes@cmpxchg.org,
	mhocko@kernel.org,
	vdavydov.dev@gmail.com,
	elver@google.com,
	cgroups@vger.kernel.org,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Qian Cai <cai@lca.pw>
Subject: [PATCH v2] mm/memcontrol: fix a data race in scan count
Date: Fri,  7 Feb 2020 12:21:59 -0500
Message-Id: <1581096119-13593-1-git-send-email-cai@lca.pw>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[v2] mm/memcontrol: fix a data race in scan count | expand

Commit Message

Qian Cai Feb. 7, 2020, 5:21 p.m. UTC

struct mem_cgroup_per_node mz.lru_zone_size[zone_idx][lru] could be
accessed concurrently as noticed by KCSAN,

 BUG: KCSAN: data-race in lruvec_lru_size / mem_cgroup_update_lru_size

 write to 0xffff9c804ca285f8 of 8 bytes by task 50951 on cpu 12:
  mem_cgroup_update_lru_size+0x11c/0x1d0
  mem_cgroup_update_lru_size at mm/memcontrol.c:1266
  isolate_lru_pages+0x6a9/0xf30
  shrink_active_list+0x123/0xcc0
  shrink_lruvec+0x8fd/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 read to 0xffff9c804ca285f8 of 8 bytes by task 50964 on cpu 95:
  lruvec_lru_size+0xbb/0x270
  mem_cgroup_get_zone_lru_size at include/linux/memcontrol.h:536
  (inlined by) lruvec_lru_size at mm/vmscan.c:326
  shrink_lruvec+0x1d0/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_current+0xa6/0x120
  alloc_slab_page+0x3b1/0x540
  allocate_slab+0x70/0x660
  new_slab+0x46/0x70
  ___slab_alloc+0x4ad/0x7d0
  __slab_alloc+0x43/0x70
  kmem_cache_alloc+0x2c3/0x420
  getname_flags+0x4c/0x230
  getname+0x22/0x30
  do_sys_openat2+0x205/0x3b0
  do_sys_open+0x9a/0xf0
  __x64_sys_openat+0x62/0x80
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 95 PID: 50964 Comm: cc1 Tainted: G        W  O L    5.5.0-next-20200204+ #6
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

The write is under lru_lock, but the read is done as lockless. The scan
count is used to determine how aggressively the anon and file LRU lists
should be scanned. Load tearing could generate an inefficient heuristic,
so fix it by adding READ_ONCE() for the read and WRITE_ONCE() for the
writes.

Signed-off-by: Qian Cai <cai@lca.pw>
---

v2: also have WRITE_ONCE() in the writer which is necessary.

 include/linux/memcontrol.h | 2 +-
 mm/memcontrol.c            | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

Comments

Marco Elver Feb. 7, 2020, 6:19 p.m. UTC | #1

On Fri, 7 Feb 2020 at 18:22, Qian Cai <cai@lca.pw> wrote:
>
> struct mem_cgroup_per_node mz.lru_zone_size[zone_idx][lru] could be
> accessed concurrently as noticed by KCSAN,
>
>  BUG: KCSAN: data-race in lruvec_lru_size / mem_cgroup_update_lru_size
>
>  write to 0xffff9c804ca285f8 of 8 bytes by task 50951 on cpu 12:
>   mem_cgroup_update_lru_size+0x11c/0x1d0
>   mem_cgroup_update_lru_size at mm/memcontrol.c:1266
>   isolate_lru_pages+0x6a9/0xf30
>   shrink_active_list+0x123/0xcc0
>   shrink_lruvec+0x8fd/0x1380
>   shrink_node+0x317/0xd80
>   do_try_to_free_pages+0x1f7/0xa10
>   try_to_free_pages+0x26c/0x5e0
>   __alloc_pages_slowpath+0x458/0x1290
>   __alloc_pages_nodemask+0x3bb/0x450
>   alloc_pages_vma+0x8a/0x2c0
>   do_anonymous_page+0x170/0x700
>   __handle_mm_fault+0xc9f/0xd00
>   handle_mm_fault+0xfc/0x2f0
>   do_page_fault+0x263/0x6f9
>   page_fault+0x34/0x40
>
>  read to 0xffff9c804ca285f8 of 8 bytes by task 50964 on cpu 95:
>   lruvec_lru_size+0xbb/0x270
>   mem_cgroup_get_zone_lru_size at include/linux/memcontrol.h:536
>   (inlined by) lruvec_lru_size at mm/vmscan.c:326
>   shrink_lruvec+0x1d0/0x1380
>   shrink_node+0x317/0xd80
>   do_try_to_free_pages+0x1f7/0xa10
>   try_to_free_pages+0x26c/0x5e0
>   __alloc_pages_slowpath+0x458/0x1290
>   __alloc_pages_nodemask+0x3bb/0x450
>   alloc_pages_current+0xa6/0x120
>   alloc_slab_page+0x3b1/0x540
>   allocate_slab+0x70/0x660
>   new_slab+0x46/0x70
>   ___slab_alloc+0x4ad/0x7d0
>   __slab_alloc+0x43/0x70
>   kmem_cache_alloc+0x2c3/0x420
>   getname_flags+0x4c/0x230
>   getname+0x22/0x30
>   do_sys_openat2+0x205/0x3b0
>   do_sys_open+0x9a/0xf0
>   __x64_sys_openat+0x62/0x80
>   do_syscall_64+0x91/0xb47
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
>  Reported by Kernel Concurrency Sanitizer on:
>  CPU: 95 PID: 50964 Comm: cc1 Tainted: G        W  O L    5.5.0-next-20200204+ #6
>  Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
>
> The write is under lru_lock, but the read is done as lockless. The scan
> count is used to determine how aggressively the anon and file LRU lists
> should be scanned. Load tearing could generate an inefficient heuristic,
> so fix it by adding READ_ONCE() for the read and WRITE_ONCE() for the
> writes.
>
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>
> v2: also have WRITE_ONCE() in the writer which is necessary.

Again, note that KCSAN will *not* complain if you omitted the
WRITE_ONCE and only had the READ_ONCE, as long as the write aligned
and up to word-size. Because we still don't have a nice way to deal
with read-modify-writes, like 'var +=', '++', I don't know if we want
to do the WRITE_ONCE right now.

I think the kernel might need a primitive that avoids the readability
issues of writing 'WRITE_ONCE(var, var + val)'. I don't have strong
opinions on this, so it's up to maintainers.

Thanks,
-- Marco

>  include/linux/memcontrol.h | 2 +-
>  mm/memcontrol.c            | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a7a0a1a5c8d5..e8734dabbc61 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -533,7 +533,7 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
>         struct mem_cgroup_per_node *mz;
>
>         mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> -       return mz->lru_zone_size[zone_idx][lru];
> +       return READ_ONCE(mz->lru_zone_size[zone_idx][lru]);
>  }
>
>  void mem_cgroup_handle_over_high(void);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6f6dc8712e39..daf375cc312c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1263,7 +1263,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
>         lru_size = &mz->lru_zone_size[zid][lru];
>
>         if (nr_pages < 0)
> -               *lru_size += nr_pages;
> +               WRITE_ONCE(*lru_size, *lru_size + nr_pages);
>
>         size = *lru_size;
>         if (WARN_ONCE(size < 0,
> @@ -1274,7 +1274,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
>         }
>
>         if (nr_pages > 0)
> -               *lru_size += nr_pages;
> +               WRITE_ONCE(*lru_size, *lru_size + nr_pages);
>  }
>
>  /**
> --
> 1.8.3.1
>

Qian Cai Feb. 7, 2020, 7:30 p.m. UTC | #2

On Fri, 2020-02-07 at 19:19 +0100, Marco Elver wrote:
> On Fri, 7 Feb 2020 at 18:22, Qian Cai <cai@lca.pw> wrote:
> > 
> > struct mem_cgroup_per_node mz.lru_zone_size[zone_idx][lru] could be
> > accessed concurrently as noticed by KCSAN,
> > 
> >  BUG: KCSAN: data-race in lruvec_lru_size / mem_cgroup_update_lru_size
> > 
> >  write to 0xffff9c804ca285f8 of 8 bytes by task 50951 on cpu 12:
> >   mem_cgroup_update_lru_size+0x11c/0x1d0
> >   mem_cgroup_update_lru_size at mm/memcontrol.c:1266
> >   isolate_lru_pages+0x6a9/0xf30
> >   shrink_active_list+0x123/0xcc0
> >   shrink_lruvec+0x8fd/0x1380
> >   shrink_node+0x317/0xd80
> >   do_try_to_free_pages+0x1f7/0xa10
> >   try_to_free_pages+0x26c/0x5e0
> >   __alloc_pages_slowpath+0x458/0x1290
> >   __alloc_pages_nodemask+0x3bb/0x450
> >   alloc_pages_vma+0x8a/0x2c0
> >   do_anonymous_page+0x170/0x700
> >   __handle_mm_fault+0xc9f/0xd00
> >   handle_mm_fault+0xfc/0x2f0
> >   do_page_fault+0x263/0x6f9
> >   page_fault+0x34/0x40
> > 
> >  read to 0xffff9c804ca285f8 of 8 bytes by task 50964 on cpu 95:
> >   lruvec_lru_size+0xbb/0x270
> >   mem_cgroup_get_zone_lru_size at include/linux/memcontrol.h:536
> >   (inlined by) lruvec_lru_size at mm/vmscan.c:326
> >   shrink_lruvec+0x1d0/0x1380
> >   shrink_node+0x317/0xd80
> >   do_try_to_free_pages+0x1f7/0xa10
> >   try_to_free_pages+0x26c/0x5e0
> >   __alloc_pages_slowpath+0x458/0x1290
> >   __alloc_pages_nodemask+0x3bb/0x450
> >   alloc_pages_current+0xa6/0x120
> >   alloc_slab_page+0x3b1/0x540
> >   allocate_slab+0x70/0x660
> >   new_slab+0x46/0x70
> >   ___slab_alloc+0x4ad/0x7d0
> >   __slab_alloc+0x43/0x70
> >   kmem_cache_alloc+0x2c3/0x420
> >   getname_flags+0x4c/0x230
> >   getname+0x22/0x30
> >   do_sys_openat2+0x205/0x3b0
> >   do_sys_open+0x9a/0xf0
> >   __x64_sys_openat+0x62/0x80
> >   do_syscall_64+0x91/0xb47
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > 
> >  Reported by Kernel Concurrency Sanitizer on:
> >  CPU: 95 PID: 50964 Comm: cc1 Tainted: G        W  O L    5.5.0-next-20200204+ #6
> >  Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> > 
> > The write is under lru_lock, but the read is done as lockless. The scan
> > count is used to determine how aggressively the anon and file LRU lists
> > should be scanned. Load tearing could generate an inefficient heuristic,
> > so fix it by adding READ_ONCE() for the read and WRITE_ONCE() for the
> > writes.
> > 
> > Signed-off-by: Qian Cai <cai@lca.pw>
> > ---
> > 
> > v2: also have WRITE_ONCE() in the writer which is necessary.
> 
> Again, note that KCSAN will *not* complain if you omitted the
> WRITE_ONCE and only had the READ_ONCE, as long as the write aligned
> and up to word-size. Because we still don't have a nice way to deal
> with read-modify-writes, like 'var +=', '++', I don't know if we want
> to do the WRITE_ONCE right now.
> 
> I think the kernel might need a primitive that avoids the readability
> issues of writing 'WRITE_ONCE(var, var + val)'. I don't have strong
> opinions on this, so it's up to maintainers.

Those are good points. Andrew, feel free to pick the v1 instead which seems like
a reasonable trade off.

> 
> Thanks,
> -- Marco
> 
> >  include/linux/memcontrol.h | 2 +-
> >  mm/memcontrol.c            | 4 ++--
> >  2 files changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index a7a0a1a5c8d5..e8734dabbc61 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -533,7 +533,7 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
> >         struct mem_cgroup_per_node *mz;
> > 
> >         mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> > -       return mz->lru_zone_size[zone_idx][lru];
> > +       return READ_ONCE(mz->lru_zone_size[zone_idx][lru]);
> >  }
> > 
> >  void mem_cgroup_handle_over_high(void);
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 6f6dc8712e39..daf375cc312c 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1263,7 +1263,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> >         lru_size = &mz->lru_zone_size[zid][lru];
> > 
> >         if (nr_pages < 0)
> > -               *lru_size += nr_pages;
> > +               WRITE_ONCE(*lru_size, *lru_size + nr_pages);
> > 
> >         size = *lru_size;
> >         if (WARN_ONCE(size < 0,
> > @@ -1274,7 +1274,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> >         }
> > 
> >         if (nr_pages > 0)
> > -               *lru_size += nr_pages;
> > +               WRITE_ONCE(*lru_size, *lru_size + nr_pages);
> >  }
> > 
> >  /**
> > --
> > 1.8.3.1
> >

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a7a0a1a5c8d5..e8734dabbc61 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -533,7 +533,7 @@  unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
 	struct mem_cgroup_per_node *mz;
 
 	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-	return mz->lru_zone_size[zone_idx][lru];
+	return READ_ONCE(mz->lru_zone_size[zone_idx][lru]);
 }
 
 void mem_cgroup_handle_over_high(void);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6f6dc8712e39..daf375cc312c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1263,7 +1263,7 @@  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 	lru_size = &mz->lru_zone_size[zid][lru];
 
 	if (nr_pages < 0)
-		*lru_size += nr_pages;
+		WRITE_ONCE(*lru_size, *lru_size + nr_pages);
 
 	size = *lru_size;
 	if (WARN_ONCE(size < 0,
@@ -1274,7 +1274,7 @@  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 	}
 
 	if (nr_pages > 0)
-		*lru_size += nr_pages;
+		WRITE_ONCE(*lru_size, *lru_size + nr_pages);
 }
 
 /**

[v2] mm/memcontrol: fix a data race in scan count

Commit Message

Comments

Patch