diff mbox series

[v2] mm/vmscan: fix infinite loop in drop_slab_node

Message ID 20200909152047.27905-1-zangchunxin@bytedance.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm/vmscan: fix infinite loop in drop_slab_node | expand

Commit Message

Chunxin Zang Sept. 9, 2020, 3:20 p.m. UTC
From: Chunxin Zang <zangchunxin@bytedance.com>

On our server, there are about 10k memcg in one machine. They use memory
very frequently. When I tigger drop caches,the process will infinite loop
in drop_slab_node.

There are two reasons:
1.We have too many memcgs, even though one object freed in one memcg, the
  sum of object is bigger than 10.

2.We spend a lot of time in traverse memcg once. So, the memcg who
  traversed at the first have been freed many objects. Traverse memcg next
  time, the freed count bigger than 10 again.

We can get the following info through 'ps':

  root:~# ps -aux | grep drop
  root  357956 ... R    Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
  root 1771385 ... R    Aug16 21146421:17 echo 3 > /proc/sys/vm/drop_caches
  root 1986319 ... R    18:56 117:27 echo 3 > /proc/sys/vm/drop_caches
  root 2002148 ... R    Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches
  root 2564666 ... R    18:59 113:58 echo 3 > /proc/sys/vm/drop_caches
  root 2639347 ... R    Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches
  root 3904747 ... R    03:35 993:31 echo 3 > /proc/sys/vm/drop_caches
  root 4016780 ... R    Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches

Use bpftrace follow 'freed' value in drop_slab_node:

  root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }'
  Attaching 1 probe...
  ^B^C

  @ret:
  [64, 128)        1 |                                                    |
  [128, 256)      28 |                                                    |
  [256, 512)     107 |@                                                   |
  [512, 1K)      298 |@@@                                                 |
  [1K, 2K)       613 |@@@@@@@                                             |
  [2K, 4K)      4435 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [4K, 8K)       442 |@@@@@                                               |
  [8K, 16K)      299 |@@@                                                 |
  [16K, 32K)     100 |@                                                   |
  [32K, 64K)     139 |@                                                   |
  [64K, 128K)     56 |                                                    |
  [128K, 256K)    26 |                                                    |
  [256K, 512K)     2 |                                                    |

In the while loop, we can check whether the TASK_KILLABLE signal is set,
if so, we should break the loop.

Signed-off-by: Chunxin Zang <zangchunxin@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
	changelogs in v2: 
	1) Via check TASK_KILLABLE signal break loop.

 mm/vmscan.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Chris Down Sept. 9, 2020, 4:09 p.m. UTC | #1
Thanks! This looks reasonable to me, and avoids hard-to-reason-about changes in 
the existing semantics.

zangchunxin@bytedance.com writes:
>Signed-off-by: Chunxin Zang <zangchunxin@bytedance.com>
>Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Acked-by: Chris Down <chris@chrisdown.name>
Vlastimil Babka Sept. 9, 2020, 5:59 p.m. UTC | #2
On 9/9/20 5:20 PM, zangchunxin@bytedance.com wrote:
> From: Chunxin Zang <zangchunxin@bytedance.com>
> 
> On our server, there are about 10k memcg in one machine. They use memory
> very frequently. When I tigger drop caches,the process will infinite loop
> in drop_slab_node.
> 
> There are two reasons:
> 1.We have too many memcgs, even though one object freed in one memcg, the
>   sum of object is bigger than 10.
> 
> 2.We spend a lot of time in traverse memcg once. So, the memcg who
>   traversed at the first have been freed many objects. Traverse memcg next
>   time, the freed count bigger than 10 again.
> 
> We can get the following info through 'ps':
> 
>   root:~# ps -aux | grep drop
>   root  357956 ... R    Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
>   root 1771385 ... R    Aug16 21146421:17 echo 3 > /proc/sys/vm/drop_caches
>   root 1986319 ... R    18:56 117:27 echo 3 > /proc/sys/vm/drop_caches
>   root 2002148 ... R    Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches
>   root 2564666 ... R    18:59 113:58 echo 3 > /proc/sys/vm/drop_caches
>   root 2639347 ... R    Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches
>   root 3904747 ... R    03:35 993:31 echo 3 > /proc/sys/vm/drop_caches
>   root 4016780 ... R    Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches
> 
> Use bpftrace follow 'freed' value in drop_slab_node:
> 
>   root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }'
>   Attaching 1 probe...
>   ^B^C
> 
>   @ret:
>   [64, 128)        1 |                                                    |
>   [128, 256)      28 |                                                    |
>   [256, 512)     107 |@                                                   |
>   [512, 1K)      298 |@@@                                                 |
>   [1K, 2K)       613 |@@@@@@@                                             |
>   [2K, 4K)      4435 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>   [4K, 8K)       442 |@@@@@                                               |
>   [8K, 16K)      299 |@@@                                                 |
>   [16K, 32K)     100 |@                                                   |
>   [32K, 64K)     139 |@                                                   |
>   [64K, 128K)     56 |                                                    |
>   [128K, 256K)    26 |                                                    |
>   [256K, 512K)     2 |                                                    |
> 
> In the while loop, we can check whether the TASK_KILLABLE signal is set,
> if so, we should break the loop.

That's definitely a good change, thanks. I would just maybe consider:
- Test in the memcg iteration loop? If you have 10k memcgs as you mention, this
can still take long until the test happens?
- Exit also on other signals such as SIGABRT, SIGTERM? If I write to drop_caches
and think it's too long, I would prefer to kill it by ctrl-c and not just kill
-9. Dunno if the canonical way of testing for this is if
(signal_pending(current)) or differently.
- IMHO it's still worth to bail out in your scenario even without a signal, e.g.
by the doubling of threshold. But it can be a separate patch.

Thanks!

> Signed-off-by: Chunxin Zang <zangchunxin@bytedance.com>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
> 	changelogs in v2: 
> 	1) Via check TASK_KILLABLE signal break loop.
> 
>  mm/vmscan.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b6d84326bdf2..c3ed8b45d264 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -704,6 +704,9 @@ void drop_slab_node(int nid)
>  	do {
>  		struct mem_cgroup *memcg = NULL;
>  
> +		if (fatal_signal_pending(current))
> +			return;
> +
>  		freed = 0;
>  		memcg = mem_cgroup_iter(NULL, NULL, NULL);
>  		do {
>
Chris Down Sept. 9, 2020, 9:47 p.m. UTC | #3
Vlastimil Babka writes:
>- Exit also on other signals such as SIGABRT, SIGTERM? If I write to drop_caches
>and think it's too long, I would prefer to kill it by ctrl-c and not just kill

Oh dear, fatal_signal_pending() doesn't consider cases with no more userspace 
instructions due to SIG_DFL on TERM/INT etc, that seems misleading :-( I had 
(naively) believed it internally checks the same set as TASK_KILLABLE.

Chuxin, Muchun, can you please make it work using TASK_KILLABLE in a similar 
way to how schedule_timeout_killable and friends do it instead, so that other 
signals will be caught?
Matthew Wilcox Sept. 9, 2020, 9:49 p.m. UTC | #4
On Wed, Sep 09, 2020 at 07:59:44PM +0200, Vlastimil Babka wrote:
> - Exit also on other signals such as SIGABRT, SIGTERM? If I write to drop_caches
> and think it's too long, I would prefer to kill it by ctrl-c and not just kill
> -9. Dunno if the canonical way of testing for this is if
> (signal_pending(current)) or differently.

fatal_signal_pending() is the canonical way to do it.  If your task has
installed a signal handler for ABRT or TERM, that's its prerogative,
but it's chosen not to get killed, and it's not allowed to see short
reads/writes, so we can't break out early.
Matthew Wilcox Sept. 9, 2020, 9:52 p.m. UTC | #5
On Wed, Sep 09, 2020 at 10:47:24PM +0100, Chris Down wrote:
> Vlastimil Babka writes:
> > - Exit also on other signals such as SIGABRT, SIGTERM? If I write to drop_caches
> > and think it's too long, I would prefer to kill it by ctrl-c and not just kill
> 
> Oh dear, fatal_signal_pending() doesn't consider cases with no more
> userspace instructions due to SIG_DFL on TERM/INT etc, that seems misleading
> :-( I had (naively) believed it internally checks the same set as
> TASK_KILLABLE.
> 
> Chuxin, Muchun, can you please make it work using TASK_KILLABLE in a similar
> way to how schedule_timeout_killable and friends do it instead, so that
> other signals will be caught?

You're mistaken.

        if (sig_fatal(p, sig) &&
            !(signal->flags & SIGNAL_GROUP_EXIT) &&
            !sigismember(&t->real_blocked, sig) &&
            (sig == SIGKILL || !p->ptrace)) {
...
                                sigaddset(&t->pending.signal, SIGKILL);

static inline int __fatal_signal_pending(struct task_struct *p)
{
        return unlikely(sigismember(&p->pending.signal, SIGKILL));
}
Chris Down Sept. 9, 2020, 9:56 p.m. UTC | #6
Matthew Wilcox writes:
>On Wed, Sep 09, 2020 at 10:47:24PM +0100, Chris Down wrote:
>> Vlastimil Babka writes:
>> > - Exit also on other signals such as SIGABRT, SIGTERM? If I write to drop_caches
>> > and think it's too long, I would prefer to kill it by ctrl-c and not just kill
>>
>> Oh dear, fatal_signal_pending() doesn't consider cases with no more
>> userspace instructions due to SIG_DFL on TERM/INT etc, that seems misleading
>> :-( I had (naively) believed it internally checks the same set as
>> TASK_KILLABLE.
>>
>> Chuxin, Muchun, can you please make it work using TASK_KILLABLE in a similar
>> way to how schedule_timeout_killable and friends do it instead, so that
>> other signals will be caught?
>
>You're mistaken.
>
>        if (sig_fatal(p, sig) &&
>            !(signal->flags & SIGNAL_GROUP_EXIT) &&
>            !sigismember(&t->real_blocked, sig) &&
>            (sig == SIGKILL || !p->ptrace)) {
>...
>                                sigaddset(&t->pending.signal, SIGKILL);
>
>static inline int __fatal_signal_pending(struct task_struct *p)
>{
>        return unlikely(sigismember(&p->pending.signal, SIGKILL));
>}

Oh great, that makes things way easier. Thanks!
Vlastimil Babka Sept. 9, 2020, 10:13 p.m. UTC | #7
On 9/9/20 11:52 PM, Matthew Wilcox wrote:
> On Wed, Sep 09, 2020 at 10:47:24PM +0100, Chris Down wrote:
>> Vlastimil Babka writes:
>> > - Exit also on other signals such as SIGABRT, SIGTERM? If I write to drop_caches
>> > and think it's too long, I would prefer to kill it by ctrl-c and not just kill
>> 
>> Oh dear, fatal_signal_pending() doesn't consider cases with no more
>> userspace instructions due to SIG_DFL on TERM/INT etc, that seems misleading
>> :-( I had (naively) believed it internally checks the same set as
>> TASK_KILLABLE.
>> 
>> Chuxin, Muchun, can you please make it work using TASK_KILLABLE in a similar
>> way to how schedule_timeout_killable and friends do it instead, so that
>> other signals will be caught?
> 
> You're mistaken.

Ah actually it was me who thought fatal_signal_pending() was only for SIGKILL,
OOM and whatnot. Sorry for the noise.

>         if (sig_fatal(p, sig) &&
>             !(signal->flags & SIGNAL_GROUP_EXIT) &&
>             !sigismember(&t->real_blocked, sig) &&
>             (sig == SIGKILL || !p->ptrace)) {
> ...
>                                 sigaddset(&t->pending.signal, SIGKILL);
> 
> static inline int __fatal_signal_pending(struct task_struct *p)
> {
>         return unlikely(sigismember(&p->pending.signal, SIGKILL));
> }
>
Michal Hocko Sept. 14, 2020, 9:30 a.m. UTC | #8
The subject is misleading because this patch doesn't fix an infinite
loop, right? It just allows the userspace to interrupt the operation.

On Wed 09-09-20 23:20:47, zangchunxin@bytedance.com wrote:
> From: Chunxin Zang <zangchunxin@bytedance.com>
> 
> On our server, there are about 10k memcg in one machine. They use memory
> very frequently. When I tigger drop caches,the process will infinite loop
> in drop_slab_node.

Is this really an infinite loop, or it just takes a lot of time to
process all the metadata in that setup? If this is really an infinite
loop then we should look at it. My current understanding is that the
operation would finish at some time it just takes painfully long to get
there.

> There are two reasons:
> 1.We have too many memcgs, even though one object freed in one memcg, the
>   sum of object is bigger than 10.
> 
> 2.We spend a lot of time in traverse memcg once. So, the memcg who
>   traversed at the first have been freed many objects. Traverse memcg next
>   time, the freed count bigger than 10 again.
> 
> We can get the following info through 'ps':
> 
>   root:~# ps -aux | grep drop
>   root  357956 ... R    Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
>   root 1771385 ... R    Aug16 21146421:17 echo 3 > /proc/sys/vm/drop_caches
>   root 1986319 ... R    18:56 117:27 echo 3 > /proc/sys/vm/drop_caches
>   root 2002148 ... R    Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches
>   root 2564666 ... R    18:59 113:58 echo 3 > /proc/sys/vm/drop_caches
>   root 2639347 ... R    Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches
>   root 3904747 ... R    03:35 993:31 echo 3 > /proc/sys/vm/drop_caches
>   root 4016780 ... R    Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches
> 
> Use bpftrace follow 'freed' value in drop_slab_node:
> 
>   root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }'
>   Attaching 1 probe...
>   ^B^C
> 
>   @ret:
>   [64, 128)        1 |                                                    |
>   [128, 256)      28 |                                                    |
>   [256, 512)     107 |@                                                   |
>   [512, 1K)      298 |@@@                                                 |
>   [1K, 2K)       613 |@@@@@@@                                             |
>   [2K, 4K)      4435 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>   [4K, 8K)       442 |@@@@@                                               |
>   [8K, 16K)      299 |@@@                                                 |
>   [16K, 32K)     100 |@                                                   |
>   [32K, 64K)     139 |@                                                   |
>   [64K, 128K)     56 |                                                    |
>   [128K, 256K)    26 |                                                    |
>   [256K, 512K)     2 |                                                    |
> 
> In the while loop, we can check whether the TASK_KILLABLE signal is set,
> if so, we should break the loop.

I would make it explicit that this is not fixing the above scenario. It
just helps to cancel to operation which is a good thing in general.
 
> Signed-off-by: Chunxin Zang <zangchunxin@bytedance.com>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

With updated changelog
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> 	changelogs in v2: 
> 	1) Via check TASK_KILLABLE signal break loop.
> 
>  mm/vmscan.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b6d84326bdf2..c3ed8b45d264 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -704,6 +704,9 @@ void drop_slab_node(int nid)
>  	do {
>  		struct mem_cgroup *memcg = NULL;
>  
> +		if (fatal_signal_pending(current))
> +			return;
> +
>  		freed = 0;
>  		memcg = mem_cgroup_iter(NULL, NULL, NULL);
>  		do {
> -- 
> 2.11.0
>
Chunxin Zang Sept. 14, 2020, 1:25 p.m. UTC | #9
On Mon, Sep 14, 2020 at 5:30 PM Michal Hocko <mhocko@suse.com> wrote:

> The subject is misleading because this patch doesn't fix an infinite
> loop, right? It just allows the userspace to interrupt the operation.
>
>
Yes,  so we are making a separate patch follow Vlastimil's recommendations.
Use
double of threshold to end the loop.

On Thu, Sep 10, 2020 at 1:59 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > From: Chunxin Zang <zangchunxin@bytedance.com>
> >
> ...
> - IMHO it's still worth to bail out in your scenario even without a
> signal, e.g.
> by the doubling of threshold. But it can be a separate patch.
> Thanks!
> ...



On Wed 09-09-20 23:20:47, zangchunxin@bytedance.com wrote:
> > From: Chunxin Zang <zangchunxin@bytedance.com>
> >
> > On our server, there are about 10k memcg in one machine. They use memory
> > very frequently. When I tigger drop caches,the process will infinite loop
> > in drop_slab_node.
>
> Is this really an infinite loop, or it just takes a lot of time to
> process all the metadata in that setup? If this is really an infinite
> loop then we should look at it. My current understanding is that the
> operation would finish at some time it just takes painfully long to get
> there.
>

Yes,  it's really an infinite loop.  Every loop spends a lot of time. In
this time,
memcg will alloc/free memory,  so the next loop, the total of  'freed'
always bigger than 10.


> > There are two reasons:
> > 1.We have too many memcgs, even though one object freed in one memcg, the
> >   sum of object is bigger than 10.
> >
> > 2.We spend a lot of time in traverse memcg once. So, the memcg who
> >   traversed at the first have been freed many objects. Traverse memcg
> next
> >   time, the freed count bigger than 10 again.
> >
> > We can get the following info through 'ps':
> >
> >   root:~# ps -aux | grep drop
> >   root  357956 ... R    Aug25 21119854:55 echo 3 >
> /proc/sys/vm/drop_caches
> >   root 1771385 ... R    Aug16 21146421:17 echo 3 >
> /proc/sys/vm/drop_caches
> >   root 1986319 ... R    18:56 117:27 echo 3 > /proc/sys/vm/drop_caches
> >   root 2002148 ... R    Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches
> >   root 2564666 ... R    18:59 113:58 echo 3 > /proc/sys/vm/drop_caches
> >   root 2639347 ... R    Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches
> >   root 3904747 ... R    03:35 993:31 echo 3 > /proc/sys/vm/drop_caches
> >   root 4016780 ... R    Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches
> >
> > Use bpftrace follow 'freed' value in drop_slab_node:
> >
> >   root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }'
> >   Attaching 1 probe...
> >   ^B^C
> >
> >   @ret:
> >   [64, 128)        1 |
>   |
> >   [128, 256)      28 |
>   |
> >   [256, 512)     107 |@
>  |
> >   [512, 1K)      298 |@@@
>  |
> >   [1K, 2K)       613 |@@@@@@@
>  |
> >   [2K, 4K)      4435
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >   [4K, 8K)       442 |@@@@@
>  |
> >   [8K, 16K)      299 |@@@
>  |
> >   [16K, 32K)     100 |@
>  |
> >   [32K, 64K)     139 |@
>  |
> >   [64K, 128K)     56 |
>   |
> >   [128K, 256K)    26 |
>   |
> >   [256K, 512K)     2 |
>   |
> >
> > In the while loop, we can check whether the TASK_KILLABLE signal is set,
> > if so, we should break the loop.
>
> I would make it explicit that this is not fixing the above scenario. It
> just helps to cancel to operation which is a good thing in general.
>
> > Signed-off-by: Chunxin Zang <zangchunxin@bytedance.com>
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> With updated changelog
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> > ---
> >       changelogs in v2:
> >       1) Via check TASK_KILLABLE signal break loop.
> >
> >  mm/vmscan.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index b6d84326bdf2..c3ed8b45d264 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -704,6 +704,9 @@ void drop_slab_node(int nid)
> >       do {
> >               struct mem_cgroup *memcg = NULL;
> >
> > +             if (fatal_signal_pending(current))
> > +                     return;
> > +
> >               freed = 0;
> >               memcg = mem_cgroup_iter(NULL, NULL, NULL);
> >               do {
> > --
> > 2.11.0
> >
>
> --
> Best wishes
> Chunxin
>
Michal Hocko Sept. 14, 2020, 1:47 p.m. UTC | #10
On Mon 14-09-20 21:25:59, Chunxin Zang wrote:
> On Mon, Sep 14, 2020 at 5:30 PM Michal Hocko <mhocko@suse.com> wrote:
> 
> > The subject is misleading because this patch doesn't fix an infinite
> > loop, right? It just allows the userspace to interrupt the operation.
> >
> >
> Yes,  so we are making a separate patch follow Vlastimil's recommendations.
> Use double of threshold to end the loop.

That still means the changelog needs an update
 
> On Thu, Sep 10, 2020 at 1:59 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > From: Chunxin Zang <zangchunxin@bytedance.com>
> > >
> > ...
> > - IMHO it's still worth to bail out in your scenario even without a
> > signal, e.g.
> > by the doubling of threshold. But it can be a separate patch.
> > Thanks!
> > ...
> 
> 
> 
> On Wed 09-09-20 23:20:47, zangchunxin@bytedance.com wrote:
> > > From: Chunxin Zang <zangchunxin@bytedance.com>
> > >
> > > On our server, there are about 10k memcg in one machine. They use memory
> > > very frequently. When I tigger drop caches,the process will infinite loop
> > > in drop_slab_node.
> >
> > Is this really an infinite loop, or it just takes a lot of time to
> > process all the metadata in that setup? If this is really an infinite
> > loop then we should look at it. My current understanding is that the
> > operation would finish at some time it just takes painfully long to get
> > there.
> >
> 
> Yes,  it's really an infinite loop.  Every loop spends a lot of time. In
> this time,
> memcg will alloc/free memory,  so the next loop, the total of  'freed'
> always bigger than 10.

I am still not sure I follow. Do you mean that there is somebody
constantly generating more objects to reclaim?
Maybe we are just not agreeing on the definition of an infinite loop but
in my book that means that the final condition can never be met. While a
busy adding new object might indeed cause drop caches to loop for a long
time this is to be expected from that interface as it is supposed to
drop all the cache and that can grow during the operation.
Chunxin Zang Sept. 14, 2020, 3:02 p.m. UTC | #11
On Mon, Sep 14, 2020 at 9:47 PM Michal Hocko <mhocko@suse.com> wrote:

> On Mon 14-09-20 21:25:59, Chunxin Zang wrote:
> > On Mon, Sep 14, 2020 at 5:30 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > > The subject is misleading because this patch doesn't fix an infinite
> > > loop, right? It just allows the userspace to interrupt the operation.
> > >
> > >
> > Yes,  so we are making a separate patch follow Vlastimil's
> recommendations.
> > Use double of threshold to end the loop.
>
> That still means the changelog needs an update
>

The patch is already merged in Linux-next branch.  Can I update the
changelog now?
This is my first patch, please forgive me :)


>
> > On Thu, Sep 10, 2020 at 1:59 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > From: Chunxin Zang <zangchunxin@bytedance.com>
> > > >
> > > ...
> > > - IMHO it's still worth to bail out in your scenario even without a
> > > signal, e.g.
> > > by the doubling of threshold. But it can be a separate patch.
> > > Thanks!
> > > ...
> >
> >
> >
> > On Wed 09-09-20 23:20:47, zangchunxin@bytedance.com wrote:
> > > > From: Chunxin Zang <zangchunxin@bytedance.com>
> > > >
> > > > On our server, there are about 10k memcg in one machine. They use
> memory
> > > > very frequently. When I tigger drop caches,the process will infinite
> loop
> > > > in drop_slab_node.
> > >
> > > Is this really an infinite loop, or it just takes a lot of time to
> > > process all the metadata in that setup? If this is really an infinite
> > > loop then we should look at it. My current understanding is that the
> > > operation would finish at some time it just takes painfully long to get
> > > there.
> > >
> >
> > Yes,  it's really an infinite loop.  Every loop spends a lot of time. In
> > this time,
> > memcg will alloc/free memory,  so the next loop, the total of  'freed'
> > always bigger than 10.
>
> I am still not sure I follow. Do you mean that there is somebody
> constantly generating more objects to reclaim?
>

Yes, this is my meaning. :)


> Maybe we are just not agreeing on the definition of an infinite loop but
> in my book that means that the final condition can never be met. While a
> busy adding new object might indeed cause drop caches to loop for a long
> time this is to be expected from that interface as it is supposed to
> drop all the cache and that can grow during the operation.
> --
>

Because I have 10k memcg , all of them are heavy users of memory.
During each loop, there are always more than 10 reclaimable objects
generating, so the
condition is never met. The drop cache process has no chance to exit the
loop.
Although the purpose of the 'drop cache' interface is to release all
caches, we still need a
way to terminate it, e.g. in this case, the process took too long to run .

  root  357956 ... R    Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
Michal Hocko Sept. 14, 2020, 3:17 p.m. UTC | #12
On Mon 14-09-20 23:02:15, Chunxin Zang wrote:
> On Mon, Sep 14, 2020 at 9:47 PM Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Mon 14-09-20 21:25:59, Chunxin Zang wrote:
> > > On Mon, Sep 14, 2020 at 5:30 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > > The subject is misleading because this patch doesn't fix an infinite
> > > > loop, right? It just allows the userspace to interrupt the operation.
> > > >
> > > >
> > > Yes,  so we are making a separate patch follow Vlastimil's
> > recommendations.
> > > Use double of threshold to end the loop.
> >
> > That still means the changelog needs an update
> >
> 
> The patch is already merged in Linux-next branch.  Can I update the
> changelog now?

Yes. Andrew will refresh it. He doesn't have a git tree which would
prevent rewriting the patch.

> This is my first patch, please forgive me :)

No worries. The mm patch workflow is rather different from others.

> > > On Thu, Sep 10, 2020 at 1:59 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > > From: Chunxin Zang <zangchunxin@bytedance.com>
> > > > >
> > > > ...
> > > > - IMHO it's still worth to bail out in your scenario even without a
> > > > signal, e.g.
> > > > by the doubling of threshold. But it can be a separate patch.
> > > > Thanks!
> > > > ...
> > >
> > >
> > >
> > > On Wed 09-09-20 23:20:47, zangchunxin@bytedance.com wrote:
> > > > > From: Chunxin Zang <zangchunxin@bytedance.com>
> > > > >
> > > > > On our server, there are about 10k memcg in one machine. They use
> > memory
> > > > > very frequently. When I tigger drop caches,the process will infinite
> > loop
> > > > > in drop_slab_node.
> > > >
> > > > Is this really an infinite loop, or it just takes a lot of time to
> > > > process all the metadata in that setup? If this is really an infinite
> > > > loop then we should look at it. My current understanding is that the
> > > > operation would finish at some time it just takes painfully long to get
> > > > there.
> > > >
> > >
> > > Yes,  it's really an infinite loop.  Every loop spends a lot of time. In
> > > this time,
> > > memcg will alloc/free memory,  so the next loop, the total of  'freed'
> > > always bigger than 10.
> >
> > I am still not sure I follow. Do you mean that there is somebody
> > constantly generating more objects to reclaim?
> >
> 
> Yes, this is my meaning. :)
> 
> 
> > Maybe we are just not agreeing on the definition of an infinite loop but
> > in my book that means that the final condition can never be met. While a
> > busy adding new object might indeed cause drop caches to loop for a long
> > time this is to be expected from that interface as it is supposed to
> > drop all the cache and that can grow during the operation.
> > --
> >
> 
> Because I have 10k memcg , all of them are heavy users of memory.
> During each loop, there are always more than 10 reclaimable objects
> generating, so the
> condition is never met.

10k or any number of memcgs shouldn't really make much of a difference.
Except for the time the scan adds. Fundamentally we are talking about
freed objects and whether they are on the global or per memcg lists
should result in a similar behavior.

> The drop cache process has no chance to exit the
> loop.
> Although the purpose of the 'drop cache' interface is to release all
> caches, we still need a
> way to terminate it, e.g. in this case, the process took too long to run .

Yes, this is perfectly understandable. Having a bail out on fatal signal
is a completely reasonable thing to do. I am mostly confused by your
infinite loop claims and what the relation of this patch to it. I would
propose this wording instead

"
We have observed that drop_caches can take a considerable amount of
time (<put data here>). Especially when there are many memcgs involved
because they are adding an additional overhead.

It is quite unfortunate that the operation cannot be interrupted by a
signal currently. Add a check for fatal signals into the main loop
so that userspace can control early bailout.
"

or something along those lines.

> 
>   root  357956 ... R    Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
Chunxin Zang Sept. 15, 2020, 4:23 a.m. UTC | #13
On Mon, Sep 14, 2020 at 11:17 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 14-09-20 23:02:15, Chunxin Zang wrote:
> > On Mon, Sep 14, 2020 at 9:47 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > > On Mon 14-09-20 21:25:59, Chunxin Zang wrote:
> > > > On Mon, Sep 14, 2020 at 5:30 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > > The subject is misleading because this patch doesn't fix an infinite
> > > > > loop, right? It just allows the userspace to interrupt the operation.
> > > > >
> > > > >
> > > > Yes,  so we are making a separate patch follow Vlastimil's
> > > recommendations.
> > > > Use double of threshold to end the loop.
> > >
> > > That still means the changelog needs an update
> > >
> >
> > The patch is already merged in Linux-next branch.  Can I update the
> > changelog now?
>
> Yes. Andrew will refresh it. He doesn't have a git tree which would
> prevent rewriting the patch.
>
> > This is my first patch, please forgive me :)
>
> No worries. The mm patch workflow is rather different from others.
>
> > > > On Thu, Sep 10, 2020 at 1:59 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > > > From: Chunxin Zang <zangchunxin@bytedance.com>
> > > > > >
> > > > > ...
> > > > > - IMHO it's still worth to bail out in your scenario even without a
> > > > > signal, e.g.
> > > > > by the doubling of threshold. But it can be a separate patch.
> > > > > Thanks!
> > > > > ...
> > > >
> > > >
> > > >
> > > > On Wed 09-09-20 23:20:47, zangchunxin@bytedance.com wrote:
> > > > > > From: Chunxin Zang <zangchunxin@bytedance.com>
> > > > > >
> > > > > > On our server, there are about 10k memcg in one machine. They use
> > > memory
> > > > > > very frequently. When I tigger drop caches,the process will infinite
> > > loop
> > > > > > in drop_slab_node.
> > > > >
> > > > > Is this really an infinite loop, or it just takes a lot of time to
> > > > > process all the metadata in that setup? If this is really an infinite
> > > > > loop then we should look at it. My current understanding is that the
> > > > > operation would finish at some time it just takes painfully long to get
> > > > > there.
> > > > >
> > > >
> > > > Yes,  it's really an infinite loop.  Every loop spends a lot of time. In
> > > > this time,
> > > > memcg will alloc/free memory,  so the next loop, the total of  'freed'
> > > > always bigger than 10.
> > >
> > > I am still not sure I follow. Do you mean that there is somebody
> > > constantly generating more objects to reclaim?
> > >
> >
> > Yes, this is my meaning. :)
> >
> >
> > > Maybe we are just not agreeing on the definition of an infinite loop but
> > > in my book that means that the final condition can never be met. While a
> > > busy adding new object might indeed cause drop caches to loop for a long
> > > time this is to be expected from that interface as it is supposed to
> > > drop all the cache and that can grow during the operation.
> > > --
> > >
> >
> > Because I have 10k memcg , all of them are heavy users of memory.
> > During each loop, there are always more than 10 reclaimable objects
> > generating, so the
> > condition is never met.
>
> 10k or any number of memcgs shouldn't really make much of a difference.
> Except for the time the scan adds. Fundamentally we are talking about
> freed objects and whether they are on the global or per memcg lists
> should result in a similar behavior.
>
> > The drop cache process has no chance to exit the
> > loop.
> > Although the purpose of the 'drop cache' interface is to release all
> > caches, we still need a
> > way to terminate it, e.g. in this case, the process took too long to run .
>
> Yes, this is perfectly understandable. Having a bail out on fatal signal
> is a completely reasonable thing to do. I am mostly confused by your
> infinite loop claims and what the relation of this patch to it. I would
> propose this wording instead
>
> "
> We have observed that drop_caches can take a considerable amount of
> time (<put data here>). Especially when there are many memcgs involved
> because they are adding an additional overhead.
>
> It is quite unfortunate that the operation cannot be interrupted by a
> signal currently. Add a check for fatal signals into the main loop
> so that userspace can control early bailout.
> "

I understand what you mean. Yes, you are right, The patch only
provides a way for
users to terminate the process, not to reduce time consumption, not improve
the situation about loops.
I will update the description of the patch in v3.
Thanks so much :)

> or something along those lines.
>
> >
> >   root  357956 ... R    Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
>
> --
> Michal Hocko
> SUSE Labs

Best wishes
Chunxin
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6d84326bdf2..c3ed8b45d264 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -704,6 +704,9 @@  void drop_slab_node(int nid)
 	do {
 		struct mem_cgroup *memcg = NULL;
 
+		if (fatal_signal_pending(current))
+			return;
+
 		freed = 0;
 		memcg = mem_cgroup_iter(NULL, NULL, NULL);
 		do {