mm, memcg: do full scan initially in force_empty

Message ID	20200728074032.1555-1-laoar.shao@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=h896=BH=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1D4BC22B3F From: Yafang Shao <laoar.shao@gmail.com> To: mhocko@kernel.org, hannes@cmpxchg.org, akpm@linux-foundation.org Cc: linux-mm@kvack.org, Yafang Shao <laoar.shao@gmail.com> Subject: [PATCH] mm, memcg: do full scan initially in force_empty Date: Tue, 28 Jul 2020 03:40:32 -0400 Message-Id: <20200728074032.1555-1-laoar.shao@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm, memcg: do full scan initially in force_empty \| expand mm, memcg: do full scan initially in force_empty

Yafang Shao July 28, 2020, 7:40 a.m. UTC

Sometimes we use memory.force_empty to drop pages in a memcg to work
around some memory pressure issues. When we use force_empty, we want the
pages can be reclaimed ASAP, however force_empty reclaims pages as a
regular reclaimer which scans the page cache LRUs from DEF_PRIORITY
priority and finally it will drop to 0 to do full scan. That is a waste
of time, we'd better do full scan initially in force_empty.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/swap.h |  3 ++-
 mm/memcontrol.c      | 16 ++++++++++------
 mm/vmscan.c          |  5 +++--
 3 files changed, 15 insertions(+), 9 deletions(-)

Michal Hocko July 30, 2020, 11:26 a.m. UTC | #1

On Tue 28-07-20 03:40:32, Yafang Shao wrote:
> Sometimes we use memory.force_empty to drop pages in a memcg to work
> around some memory pressure issues. When we use force_empty, we want the
> pages can be reclaimed ASAP, however force_empty reclaims pages as a
> regular reclaimer which scans the page cache LRUs from DEF_PRIORITY
> priority and finally it will drop to 0 to do full scan. That is a waste
> of time, we'd better do full scan initially in force_empty.

Do you have any numbers please?

> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/swap.h |  3 ++-
>  mm/memcontrol.c      | 16 ++++++++++------
>  mm/vmscan.c          |  5 +++--
>  3 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 5b3216ba39a9..d88430f1b964 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -364,7 +364,8 @@ extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  						  unsigned long nr_pages,
>  						  gfp_t gfp_mask,
> -						  bool may_swap);
> +						  bool may_swap,
> +						  int priority);
>  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>  						gfp_t gfp_mask, bool noswap,
>  						pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 13f559af1ab6..c873a98f8c7e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2237,7 +2237,8 @@ static void reclaim_high(struct mem_cgroup *memcg,
>  		    READ_ONCE(memcg->memory.high))
>  			continue;
>  		memcg_memory_event(memcg, MEMCG_HIGH);
> -		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
> +		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true,
> +					     DEF_PRIORITY);
>  	} while ((memcg = parent_mem_cgroup(memcg)) &&
>  		 !mem_cgroup_is_root(memcg));
>  }
> @@ -2515,7 +2516,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	memcg_memory_event(mem_over_limit, MEMCG_MAX);
>  
>  	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> -						    gfp_mask, may_swap);
> +						    gfp_mask, may_swap,
> +						    DEF_PRIORITY);
>  
>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>  		goto retry;
> @@ -3089,7 +3091,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>  		}
>  
>  		if (!try_to_free_mem_cgroup_pages(memcg, 1,
> -					GFP_KERNEL, !memsw)) {
> +					GFP_KERNEL, !memsw,
> +					DEF_PRIORITY)) {
>  			ret = -EBUSY;
>  			break;
>  		}
> @@ -3222,7 +3225,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>  			return -EINTR;
>  
>  		progress = try_to_free_mem_cgroup_pages(memcg, 1,
> -							GFP_KERNEL, true);
> +							GFP_KERNEL, true,
> +							0);
>  		if (!progress) {
>  			nr_retries--;
>  			/* maybe some writeback is necessary */
> @@ -6065,7 +6069,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  		}
>  
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -							 GFP_KERNEL, true);
> +							 GFP_KERNEL, true, DEF_PRIORITY);
>  
>  		if (!reclaimed && !nr_retries--)
>  			break;
> @@ -6113,7 +6117,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>  
>  		if (nr_reclaims) {
>  			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> -							  GFP_KERNEL, true))
> +							  GFP_KERNEL, true, DEF_PRIORITY))
>  				nr_reclaims--;
>  			continue;
>  		}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749d239c62b2..49298bb2892d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3315,7 +3315,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  					   unsigned long nr_pages,
>  					   gfp_t gfp_mask,
> -					   bool may_swap)
> +					   bool may_swap,
> +					   int priority)
>  {
>  	unsigned long nr_reclaimed;
>  	unsigned long pflags;
> @@ -3326,7 +3327,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
>  		.reclaim_idx = MAX_NR_ZONES - 1,
>  		.target_mem_cgroup = memcg,
> -		.priority = DEF_PRIORITY,
> +		.priority = priority,
>  		.may_writepage = !laptop_mode,
>  		.may_unmap = 1,
>  		.may_swap = may_swap,
> -- 
> 2.18.1

Yafang Shao July 31, 2020, 1:50 a.m. UTC | #2

On Thu, Jul 30, 2020 at 7:26 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 28-07-20 03:40:32, Yafang Shao wrote:
> > Sometimes we use memory.force_empty to drop pages in a memcg to work
> > around some memory pressure issues. When we use force_empty, we want the
> > pages can be reclaimed ASAP, however force_empty reclaims pages as a
> > regular reclaimer which scans the page cache LRUs from DEF_PRIORITY
> > priority and finally it will drop to 0 to do full scan. That is a waste
> > of time, we'd better do full scan initially in force_empty.
>
> Do you have any numbers please?
>

Unfortunately the number doesn't improve obviously, while it is
directly proportional to the numbers of total pages to be scanned.
But then I notice that force_empty will try to write dirty pages, that
is not expected by us, because this behavior may be dangerous in the
production environment.
What do you think introducing per memcg drop_cache ?
memory.drop_caches:
    1 - drop clean page cache
    2 - drop kmem, that can also give us a workaround to drop negative
dentries in a specific memcg.
    3 - drop all

> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/swap.h |  3 ++-
> >  mm/memcontrol.c      | 16 ++++++++++------
> >  mm/vmscan.c          |  5 +++--
> >  3 files changed, 15 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 5b3216ba39a9..d88430f1b964 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -364,7 +364,8 @@ extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> >  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >                                                 unsigned long nr_pages,
> >                                                 gfp_t gfp_mask,
> > -                                               bool may_swap);
> > +                                               bool may_swap,
> > +                                               int priority);
> >  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> >                                               gfp_t gfp_mask, bool noswap,
> >                                               pg_data_t *pgdat,
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 13f559af1ab6..c873a98f8c7e 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2237,7 +2237,8 @@ static void reclaim_high(struct mem_cgroup *memcg,
> >                   READ_ONCE(memcg->memory.high))
> >                       continue;
> >               memcg_memory_event(memcg, MEMCG_HIGH);
> > -             try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
> > +             try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true,
> > +                                          DEF_PRIORITY);
> >       } while ((memcg = parent_mem_cgroup(memcg)) &&
> >                !mem_cgroup_is_root(memcg));
> >  }
> > @@ -2515,7 +2516,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >       memcg_memory_event(mem_over_limit, MEMCG_MAX);
> >
> >       nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> > -                                                 gfp_mask, may_swap);
> > +                                                 gfp_mask, may_swap,
> > +                                                 DEF_PRIORITY);
> >
> >       if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> >               goto retry;
> > @@ -3089,7 +3091,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> >               }
> >
> >               if (!try_to_free_mem_cgroup_pages(memcg, 1,
> > -                                     GFP_KERNEL, !memsw)) {
> > +                                     GFP_KERNEL, !memsw,
> > +                                     DEF_PRIORITY)) {
> >                       ret = -EBUSY;
> >                       break;
> >               }
> > @@ -3222,7 +3225,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> >                       return -EINTR;
> >
> >               progress = try_to_free_mem_cgroup_pages(memcg, 1,
> > -                                                     GFP_KERNEL, true);
> > +                                                     GFP_KERNEL, true,
> > +                                                     0);
> >               if (!progress) {
> >                       nr_retries--;
> >                       /* maybe some writeback is necessary */
> > @@ -6065,7 +6069,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> >               }
> >
> >               reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > -                                                      GFP_KERNEL, true);
> > +                                                      GFP_KERNEL, true, DEF_PRIORITY);
> >
> >               if (!reclaimed && !nr_retries--)
> >                       break;
> > @@ -6113,7 +6117,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> >
> >               if (nr_reclaims) {
> >                       if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> > -                                                       GFP_KERNEL, true))
> > +                                                       GFP_KERNEL, true, DEF_PRIORITY))
> >                               nr_reclaims--;
> >                       continue;
> >               }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 749d239c62b2..49298bb2892d 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3315,7 +3315,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> >  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >                                          unsigned long nr_pages,
> >                                          gfp_t gfp_mask,
> > -                                        bool may_swap)
> > +                                        bool may_swap,
> > +                                        int priority)
> >  {
> >       unsigned long nr_reclaimed;
> >       unsigned long pflags;
> > @@ -3326,7 +3327,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >                               (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
> >               .reclaim_idx = MAX_NR_ZONES - 1,
> >               .target_mem_cgroup = memcg,
> > -             .priority = DEF_PRIORITY,
> > +             .priority = priority,
> >               .may_writepage = !laptop_mode,
> >               .may_unmap = 1,
> >               .may_swap = may_swap,
> > --
> > 2.18.1
>
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Aug. 3, 2020, 10:12 a.m. UTC | #3

On Fri 31-07-20 09:50:04, Yafang Shao wrote:
> On Thu, Jul 30, 2020 at 7:26 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 28-07-20 03:40:32, Yafang Shao wrote:
> > > Sometimes we use memory.force_empty to drop pages in a memcg to work
> > > around some memory pressure issues. When we use force_empty, we want the
> > > pages can be reclaimed ASAP, however force_empty reclaims pages as a
> > > regular reclaimer which scans the page cache LRUs from DEF_PRIORITY
> > > priority and finally it will drop to 0 to do full scan. That is a waste
> > > of time, we'd better do full scan initially in force_empty.
> >
> > Do you have any numbers please?
> >
> 
> Unfortunately the number doesn't improve obviously, while it is
> directly proportional to the numbers of total pages to be scanned.

Your changelog claims an optimization and that should be backed by some
numbers. It is true that reclaim at a higher priority behaves slightly
and subtly differently but that urge for even more details in the
changelog.

> But then I notice that force_empty will try to write dirty pages, that
> is not expected by us, because this behavior may be dangerous in the
> production environment.

I do not understand your claim here. Direct reclaim doesn't write dirty
page cache pages directly. And it is even less clear why that would be
dangerous if it did.

> What do you think introducing per memcg drop_cache ?

I do not like the global drop_cache and per memcg is not very much
different. This all shouldn't be really necessary because we do have
means to reclaim memory in a memcg.

Yafang Shao Aug. 3, 2020, 1:20 p.m. UTC | #4

On Mon, Aug 3, 2020 at 6:12 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 31-07-20 09:50:04, Yafang Shao wrote:
> > On Thu, Jul 30, 2020 at 7:26 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 28-07-20 03:40:32, Yafang Shao wrote:
> > > > Sometimes we use memory.force_empty to drop pages in a memcg to work
> > > > around some memory pressure issues. When we use force_empty, we want the
> > > > pages can be reclaimed ASAP, however force_empty reclaims pages as a
> > > > regular reclaimer which scans the page cache LRUs from DEF_PRIORITY
> > > > priority and finally it will drop to 0 to do full scan. That is a waste
> > > > of time, we'd better do full scan initially in force_empty.
> > >
> > > Do you have any numbers please?
> > >
> >
> > Unfortunately the number doesn't improve obviously, while it is
> > directly proportional to the numbers of total pages to be scanned.
>
> Your changelog claims an optimization and that should be backed by some
> numbers. It is true that reclaim at a higher priority behaves slightly
> and subtly differently but that urge for even more details in the
> changelog.
>

With the below addition change (nr_to_scan also changed), the elapsed
time of force_empty can be reduced by 10%.

@@ -3208,6 +3211,7 @@ static inline bool memcg_has_children(struct
mem_cgroup *memcg)
 static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 {
        int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+       unsigned long size;

        /* we call try-to-free pages for make this cgroup empty */
        lru_add_drain_all();
@@ -3215,14 +3219,15 @@ static int mem_cgroup_force_empty(struct
mem_cgroup *memcg)
        drain_all_stock(memcg);
        /* try to free all pages in this cgroup */
-       while (nr_retries && page_counter_read(&memcg->memory)) {
+       while (nr_retries && (size = page_counter_read(&memcg->memory))) {
                int progress;

                if (signal_pending(current))
                        return -EINTR;
-               progress = try_to_free_mem_cgroup_pages(memcg, 1,
-                                                       GFP_KERNEL, true);
+               progress = try_to_free_mem_cgroup_pages(memcg, size,
+                                                       GFP_KERNEL, true,
+                                                       0);

Below are the numbers for a 16G memcg with full clean pagecache.
Without these change,
$ time echo 1 > /sys/fs/cgroup/memory/foo/memory.force_empty
real    0m2.247s
user    0m0.000s
sys     0m1.722s

With these change,
$ time echo 1 > /sys/fs/cgroup/memory/foo/memory.force_empty
real    0m2.053s
user    0m0.000s
sys     0m1.529s

But I'm not sure whether we should make this improvement, because
force_empty is not a critical path.


> > But then I notice that force_empty will try to write dirty pages, that
> > is not expected by us, because this behavior may be dangerous in the
> > production environment.
>
> I do not understand your claim here. Direct reclaim doesn't write dirty
> page cache pages directly.

It will write dirty pages once the sc->priority drops to a very low number.
if (sc->priority < DEF_PRIORITY - 2)
    sc->may_writepage = 1;

>  And it is even less clear why that would be
> dangerous if it did.
>

It will generate many IOs, which may block the others.

> > What do you think introducing per memcg drop_cache ?
>
> I do not like the global drop_cache and per memcg is not very much
> different. This all shouldn't be really necessary because we do have
> means to reclaim memory in a memcg.
> --

We used to find an issue that there are many negative  dentries in some memcgs.
These negative dentries were introduced by some specific workload in
these memcgs,  and we want to drop them as soon as possible.
But unfortunately there is no good way to drop them except the
force_empy or global drop_caches.
The force_empty will also drop the pagecache pages, which is not
expected by us.
The global drop_caches can't work either because it will drop slabs in
other memcgs.
That is why I want to introduce per memcg drop_caches.

Michal Hocko Aug. 3, 2020, 1:56 p.m. UTC | #5

On Mon 03-08-20 21:20:44, Yafang Shao wrote:
> On Mon, Aug 3, 2020 at 6:12 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 31-07-20 09:50:04, Yafang Shao wrote:
> > > On Thu, Jul 30, 2020 at 7:26 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 28-07-20 03:40:32, Yafang Shao wrote:
> > > > > Sometimes we use memory.force_empty to drop pages in a memcg to work
> > > > > around some memory pressure issues. When we use force_empty, we want the
> > > > > pages can be reclaimed ASAP, however force_empty reclaims pages as a
> > > > > regular reclaimer which scans the page cache LRUs from DEF_PRIORITY
> > > > > priority and finally it will drop to 0 to do full scan. That is a waste
> > > > > of time, we'd better do full scan initially in force_empty.
> > > >
> > > > Do you have any numbers please?
> > > >
> > >
> > > Unfortunately the number doesn't improve obviously, while it is
> > > directly proportional to the numbers of total pages to be scanned.
> >
> > Your changelog claims an optimization and that should be backed by some
> > numbers. It is true that reclaim at a higher priority behaves slightly
> > and subtly differently but that urge for even more details in the
> > changelog.
> >
> 
> With the below addition change (nr_to_scan also changed), the elapsed
> time of force_empty can be reduced by 10%.
> 
> @@ -3208,6 +3211,7 @@ static inline bool memcg_has_children(struct
> mem_cgroup *memcg)
>  static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>  {
>         int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +       unsigned long size;
> 
>         /* we call try-to-free pages for make this cgroup empty */
>         lru_add_drain_all();
> @@ -3215,14 +3219,15 @@ static int mem_cgroup_force_empty(struct
> mem_cgroup *memcg)
>         drain_all_stock(memcg);
>         /* try to free all pages in this cgroup */
> -       while (nr_retries && page_counter_read(&memcg->memory)) {
> +       while (nr_retries && (size = page_counter_read(&memcg->memory))) {
>                 int progress;
> 
>                 if (signal_pending(current))
>                         return -EINTR;
> -               progress = try_to_free_mem_cgroup_pages(memcg, 1,
> -                                                       GFP_KERNEL, true);
> +               progress = try_to_free_mem_cgroup_pages(memcg, size,
> +                                                       GFP_KERNEL, true,
> +                                                       0);

Have you tried this change without changing the reclaim priority?

> Below are the numbers for a 16G memcg with full clean pagecache.
> Without these change,
> $ time echo 1 > /sys/fs/cgroup/memory/foo/memory.force_empty
> real    0m2.247s
> user    0m0.000s
> sys     0m1.722s
> 
> With these change,
> $ time echo 1 > /sys/fs/cgroup/memory/foo/memory.force_empty
> real    0m2.053s
> user    0m0.000s
> sys     0m1.529s
> 
> But I'm not sure whether we should make this improvement, because
> force_empty is not a critical path.

Well, an isolated change to force_empty would be more acceptable but it
is worth noting that a very large reclaim target might affect the
userspace triggering this path because it will potentially increase
latency to process any signals. I do not expect this to be a huge
problem in practice because even reclaim for a smaller target can take
quite long if the memory is not really reclaimable and it has to take
the full world scan. Moreovere most userspace will simply do
echo 1 > $MEMCG_PAGE/force_empty
and only care about killing that if it takes too long.
 
> > > But then I notice that force_empty will try to write dirty pages, that
> > > is not expected by us, because this behavior may be dangerous in the
> > > production environment.
> >
> > I do not understand your claim here. Direct reclaim doesn't write dirty
> > page cache pages directly.
> 
> It will write dirty pages once the sc->priority drops to a very low number.
> if (sc->priority < DEF_PRIORITY - 2)
>     sc->may_writepage = 1;

OK, I see what you mean now. Please have a look above that check:
                        /*
                         * Only kswapd can writeback filesystem pages
                         * to avoid risk of stack overflow. But avoid
                         * injecting inefficient single-page IO into
                         * flusher writeback as much as possible: only
                         * write pages when we've encountered many
                         * dirty pages, and when we've already scanned
                         * the rest of the LRU for clean pages and see
                         * the same dirty pages again (PageReclaim).
                         */

> >  And it is even less clear why that would be
> > dangerous if it did.
> >
> 
> It will generate many IOs, which may block the others.
> 
> > > What do you think introducing per memcg drop_cache ?
> >
> > I do not like the global drop_cache and per memcg is not very much
> > different. This all shouldn't be really necessary because we do have
> > means to reclaim memory in a memcg.
> > --
> 
> We used to find an issue that there are many negative  dentries in some memcgs.

Yes, negative dentries can build up but the memory reclaim should be
pretty effective reclaiming them.

> These negative dentries were introduced by some specific workload in
> these memcgs,  and we want to drop them as soon as possible.
> But unfortunately there is no good way to drop them except the
> force_empy or global drop_caches.

You can use memcg limits (e.g. memory high) to pro-actively reclaim
excess memory. Have you tried that?

> The force_empty will also drop the pagecache pages, which is not
> expected by us.

force_empty is intended to reclaim _all_ pages.

> The global drop_caches can't work either because it will drop slabs in
> other memcgs.
> That is why I want to introduce per memcg drop_caches.

Problems with negative dentries has been already discussed in the past.
I believe there was no conclusion so far. Please try to dig into
archives.

Yafang Shao Aug. 3, 2020, 2:18 p.m. UTC | #6

On Mon, Aug 3, 2020 at 9:56 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 03-08-20 21:20:44, Yafang Shao wrote:
> > On Mon, Aug 3, 2020 at 6:12 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 31-07-20 09:50:04, Yafang Shao wrote:
> > > > On Thu, Jul 30, 2020 at 7:26 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Tue 28-07-20 03:40:32, Yafang Shao wrote:
> > > > > > Sometimes we use memory.force_empty to drop pages in a memcg to work
> > > > > > around some memory pressure issues. When we use force_empty, we want the
> > > > > > pages can be reclaimed ASAP, however force_empty reclaims pages as a
> > > > > > regular reclaimer which scans the page cache LRUs from DEF_PRIORITY
> > > > > > priority and finally it will drop to 0 to do full scan. That is a waste
> > > > > > of time, we'd better do full scan initially in force_empty.
> > > > >
> > > > > Do you have any numbers please?
> > > > >
> > > >
> > > > Unfortunately the number doesn't improve obviously, while it is
> > > > directly proportional to the numbers of total pages to be scanned.
> > >
> > > Your changelog claims an optimization and that should be backed by some
> > > numbers. It is true that reclaim at a higher priority behaves slightly
> > > and subtly differently but that urge for even more details in the
> > > changelog.
> > >
> >
> > With the below addition change (nr_to_scan also changed), the elapsed
> > time of force_empty can be reduced by 10%.
> >
> > @@ -3208,6 +3211,7 @@ static inline bool memcg_has_children(struct
> > mem_cgroup *memcg)
> >  static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> >  {
> >         int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > +       unsigned long size;
> >
> >         /* we call try-to-free pages for make this cgroup empty */
> >         lru_add_drain_all();
> > @@ -3215,14 +3219,15 @@ static int mem_cgroup_force_empty(struct
> > mem_cgroup *memcg)
> >         drain_all_stock(memcg);
> >         /* try to free all pages in this cgroup */
> > -       while (nr_retries && page_counter_read(&memcg->memory)) {
> > +       while (nr_retries && (size = page_counter_read(&memcg->memory))) {
> >                 int progress;
> >
> >                 if (signal_pending(current))
> >                         return -EINTR;
> > -               progress = try_to_free_mem_cgroup_pages(memcg, 1,
> > -                                                       GFP_KERNEL, true);
> > +               progress = try_to_free_mem_cgroup_pages(memcg, size,
> > +                                                       GFP_KERNEL, true,
> > +                                                       0);
>
> Have you tried this change without changing the reclaim priority?
>

I tried it again. Seems the improvement is mostly due to the change of
nr_to_reclaim, rather the reclaim priority,

-               progress = try_to_free_mem_cgroup_pages(memcg, 1,
+               progress = try_to_free_mem_cgroup_pages(memcg, size,


> > Below are the numbers for a 16G memcg with full clean pagecache.
> > Without these change,
> > $ time echo 1 > /sys/fs/cgroup/memory/foo/memory.force_empty
> > real    0m2.247s
> > user    0m0.000s
> > sys     0m1.722s
> >
> > With these change,
> > $ time echo 1 > /sys/fs/cgroup/memory/foo/memory.force_empty
> > real    0m2.053s
> > user    0m0.000s
> > sys     0m1.529s
> >
> > But I'm not sure whether we should make this improvement, because
> > force_empty is not a critical path.
>
> Well, an isolated change to force_empty would be more acceptable but it
> is worth noting that a very large reclaim target might affect the
> userspace triggering this path because it will potentially increase
> latency to process any signals. I do not expect this to be a huge
> problem in practice because even reclaim for a smaller target can take
> quite long if the memory is not really reclaimable and it has to take
> the full world scan. Moreovere most userspace will simply do
> echo 1 > $MEMCG_PAGE/force_empty
> and only care about killing that if it takes too long.
>

We may do it in a script to force empty many memcgs at the same time.
Of course we can measure the time it takes to force empty, but that
will be complicated.

> > > > But then I notice that force_empty will try to write dirty pages, that
> > > > is not expected by us, because this behavior may be dangerous in the
> > > > production environment.
> > >
> > > I do not understand your claim here. Direct reclaim doesn't write dirty
> > > page cache pages directly.
> >
> > It will write dirty pages once the sc->priority drops to a very low number.
> > if (sc->priority < DEF_PRIORITY - 2)
> >     sc->may_writepage = 1;
>
> OK, I see what you mean now. Please have a look above that check:
>                         /*
>                          * Only kswapd can writeback filesystem pages
>                          * to avoid risk of stack overflow. But avoid
>                          * injecting inefficient single-page IO into
>                          * flusher writeback as much as possible: only
>                          * write pages when we've encountered many
>                          * dirty pages, and when we've already scanned
>                          * the rest of the LRU for clean pages and see
>                          * the same dirty pages again (PageReclaim).
>                          */
>
> > >  And it is even less clear why that would be
> > > dangerous if it did.
> > >
> >
> > It will generate many IOs, which may block the others.
> >
> > > > What do you think introducing per memcg drop_cache ?
> > >
> > > I do not like the global drop_cache and per memcg is not very much
> > > different. This all shouldn't be really necessary because we do have
> > > means to reclaim memory in a memcg.
> > > --
> >
> > We used to find an issue that there are many negative  dentries in some memcgs.
>
> Yes, negative dentries can build up but the memory reclaim should be
> pretty effective reclaiming them.
>
> > These negative dentries were introduced by some specific workload in
> > these memcgs,  and we want to drop them as soon as possible.
> > But unfortunately there is no good way to drop them except the
> > force_empy or global drop_caches.
>
> You can use memcg limits (e.g. memory high) to pro-actively reclaim
> excess memory. Have you tried that?
>
> > The force_empty will also drop the pagecache pages, which is not
> > expected by us.
>
> force_empty is intended to reclaim _all_ pages.
>
> > The global drop_caches can't work either because it will drop slabs in
> > other memcgs.
> > That is why I want to introduce per memcg drop_caches.
>
> Problems with negative dentries has been already discussed in the past.
> I believe there was no conclusion so far. Please try to dig into
> archives.

I have read the proposal of Waiman. But it seems there isn't a conclusion yet.
If the kernel can't fix this issue perfectly, then giving the user a
chance to work around it would be a possible solution - drop_caches is
that kind of workaround.

[ adding Waiman to CC ]

Yafang Shao Aug. 3, 2020, 2:26 p.m. UTC | #7

On Mon, Aug 3, 2020 at 10:18 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Aug 3, 2020 at 9:56 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 03-08-20 21:20:44, Yafang Shao wrote:
> > > On Mon, Aug 3, 2020 at 6:12 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 31-07-20 09:50:04, Yafang Shao wrote:
> > > > > On Thu, Jul 30, 2020 at 7:26 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Tue 28-07-20 03:40:32, Yafang Shao wrote:
> > > > > > > Sometimes we use memory.force_empty to drop pages in a memcg to work
> > > > > > > around some memory pressure issues. When we use force_empty, we want the
> > > > > > > pages can be reclaimed ASAP, however force_empty reclaims pages as a
> > > > > > > regular reclaimer which scans the page cache LRUs from DEF_PRIORITY
> > > > > > > priority and finally it will drop to 0 to do full scan. That is a waste
> > > > > > > of time, we'd better do full scan initially in force_empty.
> > > > > >
> > > > > > Do you have any numbers please?
> > > > > >
> > > > >
> > > > > Unfortunately the number doesn't improve obviously, while it is
> > > > > directly proportional to the numbers of total pages to be scanned.
> > > >
> > > > Your changelog claims an optimization and that should be backed by some
> > > > numbers. It is true that reclaim at a higher priority behaves slightly
> > > > and subtly differently but that urge for even more details in the
> > > > changelog.
> > > >
> > >
> > > With the below addition change (nr_to_scan also changed), the elapsed
> > > time of force_empty can be reduced by 10%.
> > >
> > > @@ -3208,6 +3211,7 @@ static inline bool memcg_has_children(struct
> > > mem_cgroup *memcg)
> > >  static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> > >  {
> > >         int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > > +       unsigned long size;
> > >
> > >         /* we call try-to-free pages for make this cgroup empty */
> > >         lru_add_drain_all();
> > > @@ -3215,14 +3219,15 @@ static int mem_cgroup_force_empty(struct
> > > mem_cgroup *memcg)
> > >         drain_all_stock(memcg);
> > >         /* try to free all pages in this cgroup */
> > > -       while (nr_retries && page_counter_read(&memcg->memory)) {
> > > +       while (nr_retries && (size = page_counter_read(&memcg->memory))) {
> > >                 int progress;
> > >
> > >                 if (signal_pending(current))
> > >                         return -EINTR;
> > > -               progress = try_to_free_mem_cgroup_pages(memcg, 1,
> > > -                                                       GFP_KERNEL, true);
> > > +               progress = try_to_free_mem_cgroup_pages(memcg, size,
> > > +                                                       GFP_KERNEL, true,
> > > +                                                       0);
> >
> > Have you tried this change without changing the reclaim priority?
> >
>
> I tried it again. Seems the improvement is mostly due to the change of
> nr_to_reclaim, rather the reclaim priority,
>
> -               progress = try_to_free_mem_cgroup_pages(memcg, 1,
> +               progress = try_to_free_mem_cgroup_pages(memcg, size,
>
>
> > > Below are the numbers for a 16G memcg with full clean pagecache.
> > > Without these change,
> > > $ time echo 1 > /sys/fs/cgroup/memory/foo/memory.force_empty
> > > real    0m2.247s
> > > user    0m0.000s
> > > sys     0m1.722s
> > >
> > > With these change,
> > > $ time echo 1 > /sys/fs/cgroup/memory/foo/memory.force_empty
> > > real    0m2.053s
> > > user    0m0.000s
> > > sys     0m1.529s
> > >
> > > But I'm not sure whether we should make this improvement, because
> > > force_empty is not a critical path.
> >
> > Well, an isolated change to force_empty would be more acceptable but it
> > is worth noting that a very large reclaim target might affect the
> > userspace triggering this path because it will potentially increase
> > latency to process any signals. I do not expect this to be a huge
> > problem in practice because even reclaim for a smaller target can take
> > quite long if the memory is not really reclaimable and it has to take
> > the full world scan. Moreovere most userspace will simply do
> > echo 1 > $MEMCG_PAGE/force_empty
> > and only care about killing that if it takes too long.
> >
>
> We may do it in a script to force empty many memcgs at the same time.
> Of course we can measure the time it takes to force empty, but that
> will be complicated.
>
> > > > > But then I notice that force_empty will try to write dirty pages, that
> > > > > is not expected by us, because this behavior may be dangerous in the
> > > > > production environment.
> > > >
> > > > I do not understand your claim here. Direct reclaim doesn't write dirty
> > > > page cache pages directly.
> > >
> > > It will write dirty pages once the sc->priority drops to a very low number.
> > > if (sc->priority < DEF_PRIORITY - 2)
> > >     sc->may_writepage = 1;
> >
> > OK, I see what you mean now. Please have a look above that check:
> >                         /*
> >                          * Only kswapd can writeback filesystem pages
> >                          * to avoid risk of stack overflow. But avoid
> >                          * injecting inefficient single-page IO into
> >                          * flusher writeback as much as possible: only
> >                          * write pages when we've encountered many
> >                          * dirty pages, and when we've already scanned
> >                          * the rest of the LRU for clean pages and see
> >                          * the same dirty pages again (PageReclaim).
> >                          */
> >
> > > >  And it is even less clear why that would be
> > > > dangerous if it did.
> > > >
> > >
> > > It will generate many IOs, which may block the others.
> > >
> > > > > What do you think introducing per memcg drop_cache ?
> > > >
> > > > I do not like the global drop_cache and per memcg is not very much
> > > > different. This all shouldn't be really necessary because we do have
> > > > means to reclaim memory in a memcg.
> > > > --
> > >
> > > We used to find an issue that there are many negative  dentries in some memcgs.
> >
> > Yes, negative dentries can build up but the memory reclaim should be
> > pretty effective reclaiming them.
> >
> > > These negative dentries were introduced by some specific workload in
> > > these memcgs,  and we want to drop them as soon as possible.
> > > But unfortunately there is no good way to drop them except the
> > > force_empy or global drop_caches.
> >
> > You can use memcg limits (e.g. memory high) to pro-actively reclaim
> > excess memory. Have you tried that?
> >
> > > The force_empty will also drop the pagecache pages, which is not
> > > expected by us.
> >
> > force_empty is intended to reclaim _all_ pages.
> >
> > > The global drop_caches can't work either because it will drop slabs in
> > > other memcgs.
> > > That is why I want to introduce per memcg drop_caches.
> >
> > Problems with negative dentries has been already discussed in the past.
> > I believe there was no conclusion so far. Please try to dig into
> > archives.
>
> I have read the proposal of Waiman. But it seems there isn't a conclusion yet.
> If the kernel can't fix this issue perfectly, then giving the user a
> chance to work around it would be a possible solution - drop_caches is
> that kind of workaround.
>
> [ adding Waiman to CC ]
>
>
> --

Forgot to reply to your suggestion that using memcg limit. Adding it below,

> You can use memcg limits (e.g. memory high) to pro-actively reclaim
> excess memory. Have you tried that?

The memcg limit not only reclaim the slabs, but also reclaim the pagecaches.
Furthermore, there is no per-memcg vm.vfs_cache_pressure neither.

Michal Hocko Aug. 3, 2020, 2:34 p.m. UTC | #8

On Mon 03-08-20 22:18:52, Yafang Shao wrote:
> On Mon, Aug 3, 2020 at 9:56 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 03-08-20 21:20:44, Yafang Shao wrote:
> > > On Mon, Aug 3, 2020 at 6:12 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 31-07-20 09:50:04, Yafang Shao wrote:
> > > > > On Thu, Jul 30, 2020 at 7:26 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Tue 28-07-20 03:40:32, Yafang Shao wrote:
> > > > > > > Sometimes we use memory.force_empty to drop pages in a memcg to work
> > > > > > > around some memory pressure issues. When we use force_empty, we want the
> > > > > > > pages can be reclaimed ASAP, however force_empty reclaims pages as a
> > > > > > > regular reclaimer which scans the page cache LRUs from DEF_PRIORITY
> > > > > > > priority and finally it will drop to 0 to do full scan. That is a waste
> > > > > > > of time, we'd better do full scan initially in force_empty.
> > > > > >
> > > > > > Do you have any numbers please?
> > > > > >
> > > > >
> > > > > Unfortunately the number doesn't improve obviously, while it is
> > > > > directly proportional to the numbers of total pages to be scanned.
> > > >
> > > > Your changelog claims an optimization and that should be backed by some
> > > > numbers. It is true that reclaim at a higher priority behaves slightly
> > > > and subtly differently but that urge for even more details in the
> > > > changelog.
> > > >
> > >
> > > With the below addition change (nr_to_scan also changed), the elapsed
> > > time of force_empty can be reduced by 10%.
> > >
> > > @@ -3208,6 +3211,7 @@ static inline bool memcg_has_children(struct
> > > mem_cgroup *memcg)
> > >  static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> > >  {
> > >         int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > > +       unsigned long size;
> > >
> > >         /* we call try-to-free pages for make this cgroup empty */
> > >         lru_add_drain_all();
> > > @@ -3215,14 +3219,15 @@ static int mem_cgroup_force_empty(struct
> > > mem_cgroup *memcg)
> > >         drain_all_stock(memcg);
> > >         /* try to free all pages in this cgroup */
> > > -       while (nr_retries && page_counter_read(&memcg->memory)) {
> > > +       while (nr_retries && (size = page_counter_read(&memcg->memory))) {
> > >                 int progress;
> > >
> > >                 if (signal_pending(current))
> > >                         return -EINTR;
> > > -               progress = try_to_free_mem_cgroup_pages(memcg, 1,
> > > -                                                       GFP_KERNEL, true);
> > > +               progress = try_to_free_mem_cgroup_pages(memcg, size,
> > > +                                                       GFP_KERNEL, true,
> > > +                                                       0);
> >
> > Have you tried this change without changing the reclaim priority?
> >
> 
> I tried it again. Seems the improvement is mostly due to the change of
> nr_to_reclaim, rather the reclaim priority,
> 
> -               progress = try_to_free_mem_cgroup_pages(memcg, 1,
> +               progress = try_to_free_mem_cgroup_pages(memcg, size,

This is what I've expected. The reclaim priority might have some side
effects as well but that requires very specific conditions when the
reclaim really has to dive to large scan windows to make some progress.
It would be interesting to find out where the improvement comes from
and how stable those numbers are. Because normally it shouldn't matter
much whether you make N rounds over the reclaim with a smaller target
or do the reclaim in a single round.

Michal Hocko Aug. 3, 2020, 2:37 p.m. UTC | #9

On Mon 03-08-20 22:26:10, Yafang Shao wrote:
> On Mon, Aug 3, 2020 at 10:18 PM Yafang Shao <laoar.shao@gmail.com> wrote:
[...]
> > You can use memcg limits (e.g. memory high) to pro-actively reclaim
> > excess memory. Have you tried that?
> 
> The memcg limit not only reclaim the slabs, but also reclaim the pagecaches.

True but drop_cache doesn't distinguish different kind of objects
either. The reclaim will simply try to drop unused cache. All or nothing
behavior of drop_caches is exactly what I dislike about the interface.
Memory reclaim can at least consider age and reflect the general usage
pattern.

> Furthermore, there is no per-memcg vm.vfs_cache_pressure neither.

Why does that matter?

Waiman Long Aug. 3, 2020, 3:26 p.m. UTC | #10

Yafang,

>>
>> > The force_empty will also drop the pagecache pages, which is not
>> > expected by us.
>>
>> force_empty is intended to reclaim _all_ pages.
>>
>> > The global drop_caches can't work either because it will drop slabs in
>> > other memcgs.
>> > That is why I want to introduce per memcg drop_caches.
>>
>> Problems with negative dentries has been already discussed in the past.
>> I believe there was no conclusion so far. Please try to dig into
>> archives.
>
>I have read the proposal of Waiman. But it seems there isn't a conclusion yet.
>If the kernel can't fix this issue perfectly, then giving the user a
>chance to work around it would be a possible solution - drop_caches is
>that kind of workaround.

I am sorry for failing to follow up on the negative dentry patches. Is
your reason of doing this patch primarily because of the negative dentry
build-up? Uncontrollable negative dentry buildup is a problem that should be
addressed. I will work on an updated patch and post it ASAP.

Cheers,
Longman

mm, memcg: do full scan initially in force_empty

Commit Message

Comments

Patch