mm/vmscan: fix infinite loop in drop_slab_node

Message ID	20200908142456.89626-1-zangchunxin@bytedance.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=bTCv=CR=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5DE0222228 From: zangchunxin@bytedance.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Chunxin Zang <zangchunxin@bytedance.com>, Muchun Song <songmuchun@bytedance.com> Subject: [PATCH] mm/vmscan: fix infinite loop in drop_slab_node Date: Tue, 8 Sep 2020 22:24:56 +0800 Message-Id: <20200908142456.89626-1-zangchunxin@bytedance.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/vmscan: fix infinite loop in drop_slab_node \| expand mm/vmscan: fix infinite loop in drop_slab_node

Chunxin Zang Sept. 8, 2020, 2:24 p.m. UTC

From: Chunxin Zang <zangchunxin@bytedance.com>

On our server, there are about 10k memcg in one machine. They use memory
very frequently. When I tigger drop caches，the process will infinite loop
in drop_slab_node.
There are two reasons:
1.We have too many memcgs, even though one object freed in one memcg, the
  sum of object is bigger than 10.
2.We spend a lot of time in traverse memcg once. So, the memcg who
  traversed at the first have been freed many objects. Traverse memcg next
  time, the freed count bigger than 10 again.

We can get the following info through 'ps':

  root:~# ps -aux | grep drop
  root  357956 ... R    Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
  root 1771385 ... R    Aug16 21146421:17 echo 3 > /proc/sys/vm/drop_caches
  root 1986319 ... R    18:56 117:27 echo 3 > /proc/sys/vm/drop_caches
  root 2002148 ... R    Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches
  root 2564666 ... R    18:59 113:58 echo 3 > /proc/sys/vm/drop_caches
  root 2639347 ... R    Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches
  root 3904747 ... R    03:35 993:31 echo 3 > /proc/sys/vm/drop_caches
  root 4016780 ... R    Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches

Use bpftrace follow 'freed' value in drop_slab_node:

  root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }'
  Attaching 1 probe...
  ^B^C

  @ret:
  [64, 128)        1 |                                                    |
  [128, 256)      28 |                                                    |
  [256, 512)     107 |@                                                   |
  [512, 1K)      298 |@@@                                                 |
  [1K, 2K)       613 |@@@@@@@                                             |
  [2K, 4K)      4435 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [4K, 8K)       442 |@@@@@                                               |
  [8K, 16K)      299 |@@@                                                 |
  [16K, 32K)     100 |@                                                   |
  [32K, 64K)     139 |@                                                   |
  [64K, 128K)     56 |                                                    |
  [128K, 256K)    26 |                                                    |
  [256K, 512K)     2 |                                                    |

In one drop caches action, only traverse memcg once maybe is better.
If user need more memory, they can do drop caches again.

Signed-off-by: Chunxin Zang <zangchunxin@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/vmscan.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

Chris Down Sept. 8, 2020, 3:09 p.m. UTC | #1

drop_caches by its very nature can be extremely performance intensive -- if 
someone wants to abort after trying too long, they can just send a 
TASK_KILLABLE signal, no? If exiting the loop and returning to usermode doesn't 
reliably work when doing that, then _that's_ something to improve, but this 
looks premature to me until that's demonstrated not to work.

zangchunxin@bytedance.com writes:
>In one drop caches action, only traverse memcg once maybe is better.
>If user need more memory, they can do drop caches again.

Can you please provide some measurements of the difference in reclamation in 
practice?

Vlastimil Babka Sept. 8, 2020, 3:42 p.m. UTC | #2

On 9/8/20 5:09 PM, Chris Down wrote:
> drop_caches by its very nature can be extremely performance intensive -- if 
> someone wants to abort after trying too long, they can just send a 
> TASK_KILLABLE signal, no? If exiting the loop and returning to usermode doesn't 
> reliably work when doing that, then _that's_ something to improve, but this 
> looks premature to me until that's demonstrated not to work.

Hm there might be existings scripts (even though I dislike those) running
drop_caches periodically, and they are currently not set up to be killed, so one
day it might surprise someone. Dropping should be a one-time event, not a
continual reclaim.

Maybe we could be a bit smarter and e.g. double the threshold currently
hardcoded as "10" with each iteration?

> zangchunxin@bytedance.com writes:
>>In one drop caches action, only traverse memcg once maybe is better.
>>If user need more memory, they can do drop caches again.
> 
> Can you please provide some measurements of the difference in reclamation in 
> practice?
>

Chris Down Sept. 8, 2020, 5:55 p.m. UTC | #3

Vlastimil Babka writes:
>On 9/8/20 5:09 PM, Chris Down wrote:
>> drop_caches by its very nature can be extremely performance intensive -- if
>> someone wants to abort after trying too long, they can just send a
>> TASK_KILLABLE signal, no? If exiting the loop and returning to usermode doesn't
>> reliably work when doing that, then _that's_ something to improve, but this
>> looks premature to me until that's demonstrated not to work.
>
>Hm there might be existings scripts (even though I dislike those) running
>drop_caches periodically, and they are currently not set up to be killed, so one
>day it might surprise someone. Dropping should be a one-time event, not a
>continual reclaim.

Sure, but these scripts can send it to their child themselves instead of using 
WCE, no? As far as I know, that's the already supported way to abort 
drop_caches-induced reclaim.

Muchun Song Sept. 9, 2020, 4:19 a.m. UTC | #4

Hi Chris,

On Tue, Sep 8, 2020 at 11:09 PM Chris Down <chris@chrisdown.name> wrote:
>
> drop_caches by its very nature can be extremely performance intensive -- if
> someone wants to abort after trying too long, they can just send a
> TASK_KILLABLE signal, no? If exiting the loop and returning to usermode doesn't
> reliably work when doing that, then _that's_ something to improve, but this
> looks premature to me until that's demonstrated not to work.

Sending a TASK_KILLABLE signal? It didn't work now. Because the the
current task has no chance to handle the signal. So I think we may
need to do any of the following things to avoid this case happening.

1. Double the threshold currently hard coded as "10" with each iteration
    suggested by Vlastimil. It is also a good idea.

2. In the while loop, we can check whether the TASK_KILLABLE
    signal is set, if so, we should break the loop. like the following code
    snippe. Thanks.

@@ -704,6 +704,9 @@ void drop_slab_node(int nid)
  do {
  struct mem_cgroup *memcg = NULL;

+ if (fatal_signal_pending(current))
+ return;
+
  freed = 0;
  memcg = mem_cgroup_iter(NULL, NULL, NULL);
  do {

>
> zangchunxin@bytedance.com writes:
> >In one drop caches action, only traverse memcg once maybe is better.
> >If user need more memory, they can do drop caches again.
>
> Can you please provide some measurements of the difference in reclamation in
> practice?



--
Yours,
Muchun

Chris Down Sept. 9, 2020, 9:59 a.m. UTC | #5

Muchun Song writes:
>1. Double the threshold currently hard coded as "10" with each iteration
>    suggested by Vlastimil. It is also a good idea.

I think this sounds reasonable, although I'd like to see what the difference in 
reclaim looks like in practice.

>2. In the while loop, we can check whether the TASK_KILLABLE
>    signal is set, if so, we should break the loop. like the following code
>    snippe. Thanks.
>
>@@ -704,6 +704,9 @@ void drop_slab_node(int nid)
>  do {
>  struct mem_cgroup *memcg = NULL;
>
>+ if (fatal_signal_pending(current))
>+ return;
>+
>  freed = 0;
>  memcg = mem_cgroup_iter(NULL, NULL, NULL);
>  do {

Regardless of anything, I think this is probably a good idea. Could you send it 
as a patch? :-)

Thanks,

Chris

Muchun Song Sept. 9, 2020, 12:10 p.m. UTC | #6

On Wed, Sep 9, 2020 at 5:59 PM Chris Down <chris@chrisdown.name> wrote:
>
> Muchun Song writes:
> >1. Double the threshold currently hard coded as "10" with each iteration
> >    suggested by Vlastimil. It is also a good idea.
>
> I think this sounds reasonable, although I'd like to see what the difference in
> reclaim looks like in practice.
>
> >2. In the while loop, we can check whether the TASK_KILLABLE
> >    signal is set, if so, we should break the loop. like the following code
> >    snippe. Thanks.
> >
> >@@ -704,6 +704,9 @@ void drop_slab_node(int nid)
> >  do {
> >  struct mem_cgroup *memcg = NULL;
> >
> >+ if (fatal_signal_pending(current))
> >+ return;
> >+
> >  freed = 0;
> >  memcg = mem_cgroup_iter(NULL, NULL, NULL);
> >  do {
>
> Regardless of anything, I think this is probably a good idea. Could you send it
> as a patch? :-)

OK, Will do that thanks.

>
> Thanks,
>
> Chris

Chunxin Zang Sept. 9, 2020, 3:26 p.m. UTC | #7

Chris Down <chris@chrisdown.name> writes：

> Muchun Song writes:
> >1. Double the threshold currently hard coded as "10" with each iteration
> >    suggested by Vlastimil. It is also a good idea.
>
> I think this sounds reasonable, although I'd like to see what the
> difference in
> reclaim looks like in practice.
>
> >2. In the while loop, we can check whether the TASK_KILLABLE
> >    signal is set, if so, we should break the loop. like the following
> code
> >    snippe. Thanks.
> >
> >@@ -704,6 +704,9 @@ void drop_slab_node(int nid)
> >  do {
> >  struct mem_cgroup *memcg = NULL;
> >
> >+ if (fatal_signal_pending(current))
> >+ return;
> >+
> >  freed = 0;
> >  memcg = mem_cgroup_iter(NULL, NULL, NULL);
> >  do {
>
> Regardless of anything, I think this is probably a good idea. Could you
> send it
> as a patch? :-)
>
> Thanks,
>
> Chris


We have send the v2 version, please check.

Best Wishes
ChunXin

mm/vmscan: fix infinite loop in drop_slab_node

Commit Message

Comments

Patch