diff mbox series

vmscan: retry without cache trim mode if nothing scanned

Message ID 20210311004449.1170308-1-ying.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series vmscan: retry without cache trim mode if nothing scanned | expand

Commit Message

Huang, Ying March 11, 2021, 12:44 a.m. UTC
From: Huang Ying <ying.huang@intel.com>

In shrink_node(), to determine whether to enable cache trim mode, the
LRU size is gotten via lruvec_page_state().  That gets the value from
a per-CPU counter (mem_cgroup_per_node->lruvec_stat[]).  The error of
the per-CPU counter from CPU local counting and the descendant memory
cgroups may cause some issues.  We run into this in 0-Day performance
test.

0-Day uses the RAM file system as root file system, so the number of
the reclaimable file pages is very small.  In the swap testing, the
inactive file LRU list will become almost empty soon.  But the size of
the inactive file LRU list gotten from the per-CPU counter may keep a
much larger value (say, 33, 50, etc.).  This will enable cache trim
mode, but nothing can be scanned in fact.  The following pattern
repeats for long time in the test,

priority	inactive_file_size	cache_trim_mode
12		33			0
11		33			0
...
6		33			0
5		33			1
...
1		33			1

That is, the cache_trim_mode will be enabled wrongly when the scan
priority decreases to 5.  And the problem will not be recovered for
long time.

It's hard to get the more accurate size of the inactive file list
without much more overhead.  And it's hard to estimate the error of
the per-CPU counter too, because there may be many descendant memory
cgroups.  But after the actual scanning, if nothing can be scanned
with the cache trim mode, it should be wrong to enable the cache trim
mode.  So we can retry with the cache trim mode disabled.  This patch
implement this policy.

The test results for pmbench with normal access address distribution
and 2 NVMe disks as swap on a Intel server for the base and patched
kernel are as follows.

      base     change    patched
----------     ------  ---------
 122100000      18.6%  144800000   pmbench.throughput.aps
    164078     -56.3%      71686   vmstat.swap.si
    166511     -20.8%     131957   vmstat.swap.so
      1992     -52.2%     953.17   proc-vmstat.kswapd_high_wmark_hit_quickly
    230.20     -63.8%      83.33   proc-vmstat.kswapd_low_wmark_hit_quickly
      2318     -41.4%       1358   proc-vmstat.pageoutrun

From the above table, with the patch, the page reclaiming algorithm
runs more smoothly, the hot/cold pages distinguishing works better, so
the swap in/out throughput decreases considerably, and the benchmark
throughput increases 18.6%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/vmscan.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

Comments

Shakeel Butt March 11, 2021, 12:57 a.m. UTC | #1
On Wed, Mar 10, 2021 at 4:47 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> From: Huang Ying <ying.huang@intel.com>
>
> In shrink_node(), to determine whether to enable cache trim mode, the
> LRU size is gotten via lruvec_page_state().  That gets the value from
> a per-CPU counter (mem_cgroup_per_node->lruvec_stat[]).  The error of
> the per-CPU counter from CPU local counting and the descendant memory
> cgroups may cause some issues.  We run into this in 0-Day performance
> test.
>
> 0-Day uses the RAM file system as root file system, so the number of
> the reclaimable file pages is very small.  In the swap testing, the
> inactive file LRU list will become almost empty soon.  But the size of
> the inactive file LRU list gotten from the per-CPU counter may keep a
> much larger value (say, 33, 50, etc.).  This will enable cache trim
> mode, but nothing can be scanned in fact.  The following pattern
> repeats for long time in the test,
>
> priority        inactive_file_size      cache_trim_mode
> 12              33                      0
> 11              33                      0
> ...
> 6               33                      0
> 5               33                      1
> ...
> 1               33                      1
>
> That is, the cache_trim_mode will be enabled wrongly when the scan
> priority decreases to 5.  And the problem will not be recovered for
> long time.
>
> It's hard to get the more accurate size of the inactive file list
> without much more overhead.  And it's hard to estimate the error of
> the per-CPU counter too, because there may be many descendant memory
> cgroups.  But after the actual scanning, if nothing can be scanned
> with the cache trim mode, it should be wrong to enable the cache trim
> mode.  So we can retry with the cache trim mode disabled.  This patch
> implement this policy.

Instead of playing with the already complicated heuristics, we should
improve the accuracy of the lruvec stats. Johannes already fixed the
memcg stats using rstat infrastructure and Tejun has suggestions on
how to use rstat infrastructure efficiently for lruvec stats at
https://lore.kernel.org/linux-mm/YCFgr300eRiEZwpL@slm.duckdns.org/.
Huang, Ying March 11, 2021, 8:52 a.m. UTC | #2
Hi, Butt,

Shakeel Butt <shakeelb@google.com> writes:

> On Wed, Mar 10, 2021 at 4:47 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> From: Huang Ying <ying.huang@intel.com>
>>
>> In shrink_node(), to determine whether to enable cache trim mode, the
>> LRU size is gotten via lruvec_page_state().  That gets the value from
>> a per-CPU counter (mem_cgroup_per_node->lruvec_stat[]).  The error of
>> the per-CPU counter from CPU local counting and the descendant memory
>> cgroups may cause some issues.  We run into this in 0-Day performance
>> test.
>>
>> 0-Day uses the RAM file system as root file system, so the number of
>> the reclaimable file pages is very small.  In the swap testing, the
>> inactive file LRU list will become almost empty soon.  But the size of
>> the inactive file LRU list gotten from the per-CPU counter may keep a
>> much larger value (say, 33, 50, etc.).  This will enable cache trim
>> mode, but nothing can be scanned in fact.  The following pattern
>> repeats for long time in the test,
>>
>> priority        inactive_file_size      cache_trim_mode
>> 12              33                      0
>> 11              33                      0
>> ...
>> 6               33                      0
>> 5               33                      1
>> ...
>> 1               33                      1
>>
>> That is, the cache_trim_mode will be enabled wrongly when the scan
>> priority decreases to 5.  And the problem will not be recovered for
>> long time.
>>
>> It's hard to get the more accurate size of the inactive file list
>> without much more overhead.  And it's hard to estimate the error of
>> the per-CPU counter too, because there may be many descendant memory
>> cgroups.  But after the actual scanning, if nothing can be scanned
>> with the cache trim mode, it should be wrong to enable the cache trim
>> mode.  So we can retry with the cache trim mode disabled.  This patch
>> implement this policy.
>
> Instead of playing with the already complicated heuristics, we should
> improve the accuracy of the lruvec stats. Johannes already fixed the
> memcg stats using rstat infrastructure and Tejun has suggestions on
> how to use rstat infrastructure efficiently for lruvec stats at
> https://lore.kernel.org/linux-mm/YCFgr300eRiEZwpL@slm.duckdns.org/.

Thanks for your information!  It should be better if we can improve the
accuracy of lruvec stats without much overhead.  But that may be not a
easy task.

If my understanding were correct, what Tejun suggested is to add a fast
read interface to rstat to be used in hot path.  And its accuracy is
similar as that of traditional per-CPU counter.  But if we can regularly
update the lruvec rstat with something like vmstat_update(), that should
be OK for the issue described in this patch.

Best Regards,
Huang, Ying
Shakeel Butt March 14, 2021, 8:58 p.m. UTC | #3
On Thu, Mar 11, 2021 at 12:52 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Butt,
>
> Shakeel Butt <shakeelb@google.com> writes:
>
> > On Wed, Mar 10, 2021 at 4:47 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> From: Huang Ying <ying.huang@intel.com>
> >>
> >> In shrink_node(), to determine whether to enable cache trim mode, the
> >> LRU size is gotten via lruvec_page_state().  That gets the value from
> >> a per-CPU counter (mem_cgroup_per_node->lruvec_stat[]).  The error of
> >> the per-CPU counter from CPU local counting and the descendant memory
> >> cgroups may cause some issues.  We run into this in 0-Day performance
> >> test.
> >>
> >> 0-Day uses the RAM file system as root file system, so the number of
> >> the reclaimable file pages is very small.  In the swap testing, the
> >> inactive file LRU list will become almost empty soon.  But the size of
> >> the inactive file LRU list gotten from the per-CPU counter may keep a
> >> much larger value (say, 33, 50, etc.).  This will enable cache trim
> >> mode, but nothing can be scanned in fact.  The following pattern
> >> repeats for long time in the test,
> >>
> >> priority        inactive_file_size      cache_trim_mode
> >> 12              33                      0
> >> 11              33                      0
> >> ...
> >> 6               33                      0
> >> 5               33                      1
> >> ...
> >> 1               33                      1
> >>
> >> That is, the cache_trim_mode will be enabled wrongly when the scan
> >> priority decreases to 5.  And the problem will not be recovered for
> >> long time.
> >>
> >> It's hard to get the more accurate size of the inactive file list
> >> without much more overhead.  And it's hard to estimate the error of
> >> the per-CPU counter too, because there may be many descendant memory
> >> cgroups.  But after the actual scanning, if nothing can be scanned
> >> with the cache trim mode, it should be wrong to enable the cache trim
> >> mode.  So we can retry with the cache trim mode disabled.  This patch
> >> implement this policy.
> >
> > Instead of playing with the already complicated heuristics, we should
> > improve the accuracy of the lruvec stats. Johannes already fixed the
> > memcg stats using rstat infrastructure and Tejun has suggestions on
> > how to use rstat infrastructure efficiently for lruvec stats at
> > https://lore.kernel.org/linux-mm/YCFgr300eRiEZwpL@slm.duckdns.org/.
>
> Thanks for your information!  It should be better if we can improve the
> accuracy of lruvec stats without much overhead.  But that may be not a
> easy task.
>
> If my understanding were correct, what Tejun suggested is to add a fast
> read interface to rstat to be used in hot path.  And its accuracy is
> similar as that of traditional per-CPU counter.  But if we can regularly
> update the lruvec rstat with something like vmstat_update(), that should
> be OK for the issue described in this patch.
>

This is also my understanding. Tejun, please correct us if we misunderstood you.

BTW Johannes was working on rstat-based lruvec stats patch. Johannes,
are you planning to work on the optimization Tejun has suggested.
Tejun Heo March 14, 2021, 10:51 p.m. UTC | #4
Hello,

On Sun, Mar 14, 2021 at 01:58:33PM -0700, Shakeel Butt wrote:
> > If my understanding were correct, what Tejun suggested is to add a fast
> > read interface to rstat to be used in hot path.  And its accuracy is
> > similar as that of traditional per-CPU counter.  But if we can regularly
> > update the lruvec rstat with something like vmstat_update(), that should
> > be OK for the issue described in this patch.
> >
> 
> This is also my understanding. Tejun, please correct us if we misunderstood you.

Yeah, that was what I had on mind. Instead of always flushing on read, have
a variant where flushing is explicit and trigger that periodically (or
whichever way appropriate).

Thanks.
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fea6b43bc1f9..1304e3c98a14 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2606,7 +2606,8 @@  static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	return inactive_lru_pages > pages_for_compaction;
 }
 
-static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
+static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc,
+			       bool skip_slab)
 {
 	struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
 	struct mem_cgroup *memcg;
@@ -2652,8 +2653,9 @@  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 
 		shrink_lruvec(lruvec, sc);
 
-		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
-			    sc->priority);
+		if (!skip_slab)
+			shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
+				    sc->priority);
 
 		/* Record the group's reclaim efficiency */
 		vmpressure(sc->gfp_mask, memcg, false,
@@ -2669,6 +2671,7 @@  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long nr_reclaimed, nr_scanned;
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
+	bool skip_slab;
 	unsigned long file;
 
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
@@ -2767,7 +2770,15 @@  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			anon >> sc->priority;
 	}
 
-	shrink_node_memcgs(pgdat, sc);
+	skip_slab = false;
+retry:
+	shrink_node_memcgs(pgdat, sc, skip_slab);
+	/* Nothing can be scanned with cache trim mode, retry without it */
+	if (sc->cache_trim_mode && sc->nr_scanned == nr_scanned) {
+		sc->cache_trim_mode = 0;
+		skip_slab = true;
+		goto retry;
+	}
 
 	if (reclaim_state) {
 		sc->nr_reclaimed += reclaim_state->reclaimed_slab;