diff mbox series

[v3] mm: list_lru: set shrinker map bit when child nr_items is not zero

Message ID 20201201212553.52164-1-shy828301@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v3] mm: list_lru: set shrinker map bit when child nr_items is not zero | expand

Commit Message

Yang Shi Dec. 1, 2020, 9:25 p.m. UTC
When investigating a slab cache bloat problem, significant amount of
negative dentry cache was seen, but confusingly they neither got shrunk
by reclaimer (the host has very tight memory) nor be shrunk by dropping
cache.  The vmcore shows there are over 14M negative dentry objects on lru,
but tracing result shows they were even not scanned at all.  The further
investigation shows the memcg's vfs shrinker_map bit is not set.  So the
reclaimer or dropping cache just skip calling vfs shrinker.  So we have
to reboot the hosts to get the memory back.

I didn't manage to come up with a reproducer in test environment, and the
problem can't be reproduced after rebooting.  But it seems there is race
between shrinker map bit clear and reparenting by code inspection.  The
hypothesis is elaborated as below.

The memcg hierarchy on our production environment looks like:
                root
               /    \
          system   user

The main workloads are running under user slice's children, and it creates
and removes memcg frequently.  So reparenting happens very often under user
slice, but no task is under user slice directly.

So with the frequent reparenting and tight memory pressure, the below
hypothetical race condition may happen:

       CPU A                            CPU B
reparent
    dst->nr_items == 0
                                 shrinker:
                                     total_objects == 0
    add src->nr_items to dst
    set_bit
                                     retrun SHRINK_EMPTY
                                     clear_bit
child memcg offline
    replace child's kmemcg_id to
    parent's (in memcg_offline_kmem())
                                  list_lru_del() between shrinker runs
                                     see parent's kmemcg_id
                                     dec dst->nr_items
reparent again
    dst->nr_items may go negative
    due to concurrent list_lru_del()

                                 The second run of shrinker:
                                     read nr_items without any
                                     synchronization, so it may
                                     see intermediate negative
                                     nr_items then total_objects
                                     may return 0 conincidently

                                     keep the bit cleared
    dst->nr_items != 0
    skip set_bit
    add scr->nr_item to dst

After this point dst->nr_item may never go zero, so reparenting will not
set shrinker_map bit anymore.  And since there is no task under user
slice directly, so no new object will be added to its lru to set the
shrinker map bit either.  That bit is kept cleared forever.

How does list_lru_del() race with reparenting?  It is because
reparenting replaces childen's kmemcg_id to parent's without protecting
from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
actually deleting items from child's lru, but dec'ing parent's nr_items,
so the parent's nr_items may go negative as commit
2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
free kmemcg_id on css offline") says.

Since it is impossible that dst->nr_items goes negative and
src->nr_items goes zero at the same time, so it seems we could set the
shrinker map bit iff src->nr_items != 0.  We could synchronize
list_lru_count_one() and reparenting with nlru->lock, but it seems
checking src->nr_items in reparenting is the simplest and avoids lock
contention.

Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
Suggested-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org> v4.19+
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
v3: * Revised commit log per Roman's suggestion
    * Added Roman's reviewed-by tag
v2: * Incorporated Roman's suggestion
    * Incorporated Kirill's suggestion
    * Changed the subject of patch to get align with the new fix
    * Added fixes tag

 mm/list_lru.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Kirill Tkhai Dec. 2, 2020, 8:15 a.m. UTC | #1
On 02.12.2020 00:25, Yang Shi wrote:
> When investigating a slab cache bloat problem, significant amount of
> negative dentry cache was seen, but confusingly they neither got shrunk
> by reclaimer (the host has very tight memory) nor be shrunk by dropping
> cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> but tracing result shows they were even not scanned at all.  The further
> investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> to reboot the hosts to get the memory back.
> 
> I didn't manage to come up with a reproducer in test environment, and the
> problem can't be reproduced after rebooting.  But it seems there is race
> between shrinker map bit clear and reparenting by code inspection.  The
> hypothesis is elaborated as below.
> 
> The memcg hierarchy on our production environment looks like:
>                 root
>                /    \
>           system   user
> 
> The main workloads are running under user slice's children, and it creates
> and removes memcg frequently.  So reparenting happens very often under user
> slice, but no task is under user slice directly.
> 
> So with the frequent reparenting and tight memory pressure, the below
> hypothetical race condition may happen:
> 
>        CPU A                            CPU B
> reparent
>     dst->nr_items == 0
>                                  shrinker:
>                                      total_objects == 0
>     add src->nr_items to dst
>     set_bit
>                                      retrun SHRINK_EMPTY
>                                      clear_bit
> child memcg offline
>     replace child's kmemcg_id to
>     parent's (in memcg_offline_kmem())
>                                   list_lru_del() between shrinker runs
>                                      see parent's kmemcg_id
>                                      dec dst->nr_items
> reparent again
>     dst->nr_items may go negative
>     due to concurrent list_lru_del()
> 
>                                  The second run of shrinker:
>                                      read nr_items without any
>                                      synchronization, so it may
>                                      see intermediate negative
>                                      nr_items then total_objects
>                                      may return 0 conincidently
> 
>                                      keep the bit cleared
>     dst->nr_items != 0
>     skip set_bit
>     add scr->nr_item to dst
> 
> After this point dst->nr_item may never go zero, so reparenting will not
> set shrinker_map bit anymore.  And since there is no task under user
> slice directly, so no new object will be added to its lru to set the
> shrinker map bit either.  That bit is kept cleared forever.
> 
> How does list_lru_del() race with reparenting?  It is because
> reparenting replaces childen's kmemcg_id to parent's without protecting
> from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
> actually deleting items from child's lru, but dec'ing parent's nr_items,
> so the parent's nr_items may go negative as commit
> 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
> free kmemcg_id on css offline") says.
> 
> Since it is impossible that dst->nr_items goes negative and
> src->nr_items goes zero at the same time, so it seems we could set the
> shrinker map bit iff src->nr_items != 0.  We could synchronize
> list_lru_count_one() and reparenting with nlru->lock, but it seems
> checking src->nr_items in reparenting is the simplest and avoids lock
> contention.
> 
> Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
> Suggested-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Roman Gushchin <guro@fb.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: <stable@vger.kernel.org> v4.19+
> Signed-off-by: Yang Shi <shy828301@gmail.com>

Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>

> ---
> v3: * Revised commit log per Roman's suggestion
>     * Added Roman's reviewed-by tag
> v2: * Incorporated Roman's suggestion
>     * Incorporated Kirill's suggestion
>     * Changed the subject of patch to get align with the new fix
>     * Added fixes tag
> 
>  mm/list_lru.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 5aa6e44bc2ae..fe230081690b 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
>  	struct list_lru_node *nlru = &lru->node[nid];
>  	int dst_idx = dst_memcg->kmemcg_id;
>  	struct list_lru_one *src, *dst;
> -	bool set;
>  
>  	/*
>  	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> @@ -546,11 +545,12 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
>  	dst = list_lru_from_memcg_idx(nlru, dst_idx);
>  
>  	list_splice_init(&src->list, &dst->list);
> -	set = (!dst->nr_items && src->nr_items);
> -	dst->nr_items += src->nr_items;
> -	if (set)
> +
> +	if (src->nr_items) {
> +		dst->nr_items += src->nr_items;
>  		memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
> -	src->nr_items = 0;
> +		src->nr_items = 0;
> +	}
>  
>  	spin_unlock_irq(&nlru->lock);
>  }
>
Shakeel Butt Dec. 2, 2020, 3 p.m. UTC | #2
On Tue, Dec 1, 2020 at 1:25 PM Yang Shi <shy828301@gmail.com> wrote:
>
> When investigating a slab cache bloat problem, significant amount of
> negative dentry cache was seen, but confusingly they neither got shrunk
> by reclaimer (the host has very tight memory) nor be shrunk by dropping
> cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> but tracing result shows they were even not scanned at all.  The further
> investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> to reboot the hosts to get the memory back.
>
> I didn't manage to come up with a reproducer in test environment, and the
> problem can't be reproduced after rebooting.  But it seems there is race
> between shrinker map bit clear and reparenting by code inspection.  The
> hypothesis is elaborated as below.
>
> The memcg hierarchy on our production environment looks like:
>                 root
>                /    \
>           system   user
>
> The main workloads are running under user slice's children, and it creates
> and removes memcg frequently.  So reparenting happens very often under user
> slice, but no task is under user slice directly.
>
> So with the frequent reparenting and tight memory pressure, the below
> hypothetical race condition may happen:
>
>        CPU A                            CPU B
> reparent
>     dst->nr_items == 0
>                                  shrinker:
>                                      total_objects == 0
>     add src->nr_items to dst
>     set_bit
>                                      retrun SHRINK_EMPTY

return

>                                      clear_bit
> child memcg offline
>     replace child's kmemcg_id to

with

>     parent's (in memcg_offline_kmem())
>                                   list_lru_del() between shrinker runs
>                                      see parent's kmemcg_id
>                                      dec dst->nr_items
> reparent again
>     dst->nr_items may go negative
>     due to concurrent list_lru_del()
>
>                                  The second run of shrinker:
>                                      read nr_items without any
>                                      synchronization, so it may
>                                      see intermediate negative
>                                      nr_items then total_objects
>                                      may return 0 conincidently

coincidently

>
>                                      keep the bit cleared
>     dst->nr_items != 0
>     skip set_bit
>     add scr->nr_item to dst
>
> After this point dst->nr_item may never go zero, so reparenting will not
> set shrinker_map bit anymore.  And since there is no task under user
> slice directly, so no new object will be added to its lru to set the
> shrinker map bit either.  That bit is kept cleared forever.
>
> How does list_lru_del() race with reparenting?  It is because
> reparenting replaces childen's kmemcg_id to parent's without protecting

children's

> from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
> actually deleting items from child's lru, but dec'ing parent's nr_items,
> so the parent's nr_items may go negative as commit
> 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
> free kmemcg_id on css offline") says.
>
> Since it is impossible that dst->nr_items goes negative and
> src->nr_items goes zero at the same time, so it seems we could set the
> shrinker map bit iff src->nr_items != 0.  We could synchronize
> list_lru_count_one() and reparenting with nlru->lock, but it seems
> checking src->nr_items in reparenting is the simplest and avoids lock
> contention.
>
> Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
> Suggested-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Roman Gushchin <guro@fb.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: <stable@vger.kernel.org> v4.19+
> Signed-off-by: Yang Shi <shy828301@gmail.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>
Yang Shi Dec. 2, 2020, 4:52 p.m. UTC | #3
On Wed, Dec 2, 2020 at 7:01 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, Dec 1, 2020 at 1:25 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > When investigating a slab cache bloat problem, significant amount of
> > negative dentry cache was seen, but confusingly they neither got shrunk
> > by reclaimer (the host has very tight memory) nor be shrunk by dropping
> > cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> > but tracing result shows they were even not scanned at all.  The further
> > investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> > reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> > to reboot the hosts to get the memory back.
> >
> > I didn't manage to come up with a reproducer in test environment, and the
> > problem can't be reproduced after rebooting.  But it seems there is race
> > between shrinker map bit clear and reparenting by code inspection.  The
> > hypothesis is elaborated as below.
> >
> > The memcg hierarchy on our production environment looks like:
> >                 root
> >                /    \
> >           system   user
> >
> > The main workloads are running under user slice's children, and it creates
> > and removes memcg frequently.  So reparenting happens very often under user
> > slice, but no task is under user slice directly.
> >
> > So with the frequent reparenting and tight memory pressure, the below
> > hypothetical race condition may happen:
> >
> >        CPU A                            CPU B
> > reparent
> >     dst->nr_items == 0
> >                                  shrinker:
> >                                      total_objects == 0
> >     add src->nr_items to dst
> >     set_bit
> >                                      retrun SHRINK_EMPTY
>
> return
>
> >                                      clear_bit
> > child memcg offline
> >     replace child's kmemcg_id to
>
> with
>
> >     parent's (in memcg_offline_kmem())
> >                                   list_lru_del() between shrinker runs
> >                                      see parent's kmemcg_id
> >                                      dec dst->nr_items
> > reparent again
> >     dst->nr_items may go negative
> >     due to concurrent list_lru_del()
> >
> >                                  The second run of shrinker:
> >                                      read nr_items without any
> >                                      synchronization, so it may
> >                                      see intermediate negative
> >                                      nr_items then total_objects
> >                                      may return 0 conincidently
>
> coincidently
>
> >
> >                                      keep the bit cleared
> >     dst->nr_items != 0
> >     skip set_bit
> >     add scr->nr_item to dst
> >
> > After this point dst->nr_item may never go zero, so reparenting will not
> > set shrinker_map bit anymore.  And since there is no task under user
> > slice directly, so no new object will be added to its lru to set the
> > shrinker map bit either.  That bit is kept cleared forever.
> >
> > How does list_lru_del() race with reparenting?  It is because
> > reparenting replaces childen's kmemcg_id to parent's without protecting
>
> children's
>
> > from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
> > actually deleting items from child's lru, but dec'ing parent's nr_items,
> > so the parent's nr_items may go negative as commit
> > 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
> > free kmemcg_id on css offline") says.
> >
> > Since it is impossible that dst->nr_items goes negative and
> > src->nr_items goes zero at the same time, so it seems we could set the
> > shrinker map bit iff src->nr_items != 0.  We could synchronize
> > list_lru_count_one() and reparenting with nlru->lock, but it seems
> > checking src->nr_items in reparenting is the simplest and avoids lock
> > contention.
> >
> > Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
> > Suggested-by: Roman Gushchin <guro@fb.com>
> > Reviewed-by: Roman Gushchin <guro@fb.com>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Cc: <stable@vger.kernel.org> v4.19+
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Thanks for finding those spelling errors. Will fix in v4.
diff mbox series

Patch

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 5aa6e44bc2ae..fe230081690b 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -534,7 +534,6 @@  static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	struct list_lru_node *nlru = &lru->node[nid];
 	int dst_idx = dst_memcg->kmemcg_id;
 	struct list_lru_one *src, *dst;
-	bool set;
 
 	/*
 	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -546,11 +545,12 @@  static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	dst = list_lru_from_memcg_idx(nlru, dst_idx);
 
 	list_splice_init(&src->list, &dst->list);
-	set = (!dst->nr_items && src->nr_items);
-	dst->nr_items += src->nr_items;
-	if (set)
+
+	if (src->nr_items) {
+		dst->nr_items += src->nr_items;
 		memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
-	src->nr_items = 0;
+		src->nr_items = 0;
+	}
 
 	spin_unlock_irq(&nlru->lock);
 }