[Regression,v5.0] mm: boosted kswapd reclaim b0rks system cache balance
diff mbox series

Message ID 20190807091858.2857-1-david@fromorbit.com
State New
Headers show
Series
  • [Regression,v5.0] mm: boosted kswapd reclaim b0rks system cache balance
Related show

Commit Message

Dave Chinner Aug. 7, 2019, 9:18 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

When running a simple, steady state 4kB file creation test to
simulate extracting tarballs larger than memory full of small files
into the filesystem, I noticed that once memory fills up the cache
balance goes to hell.

The workload is creating one dirty cached inode for every dirty
page, both of which should require a single IO each to clean and
reclaim, and creation of inodes is throttled by the rate at which
dirty writeback runs at (via balance dirty pages). Hence the ingest
rate of new cached inodes and page cache pages is identical and
steady. As a result, memory reclaim should quickly find a steady
balance between page cache and inode caches.

It doesn't.

The moment memory fills, the page cache is reclaimed at a much
faster rate than the inode cache, and evidence suggests taht the
inode cache shrinker is not being called when large batches of pages
are being reclaimed. In roughly the same time period that it takes
to fill memory with 50% pages and 50% slab caches, memory reclaim
reduces the page cache down to just dirty pages and slab caches fill
the entirity of memory.

At the point where the page cache is reduced to just the dirty
pages, there is a clear change in write IO patterns. Up to this
point it has been running at a steady 1500 write IOPS for ~200MB/s
of write throughtput (data, journal and metadata). Once the page
cache is trimmed to just dirty pages, the write IOPS immediately
start spiking to 5-10,000 IOPS and there is a noticable change in
IO sizes and completion times. The SSD is fast enough to soak these
up, so the measured performance is only slightly affected (numbers
below). It results in > ~50% throughput slowdowns on a spinning
disk with a NVRAM RAID cache in front of it, though. I didn't
capture the numbers at the time, and it takes far to long for me to
care to run it again and get them.

SSD perf degradation as the LRU empties to just dirty pages:

FSUse%        Count         Size    Files/sec     App Overhead
......
     0      4320000         4096      51349.6          1370049
     0      4480000         4096      48492.9          1362823
     0      4640000         4096      48473.8          1435623
     0      4800000         4096      46846.6          1370959
     0      4960000         4096      47086.6          1567401
     0      5120000         4096      46288.8          1368056
     0      5280000         4096      46003.2          1391604
     0      5440000         4096      46033.4          1458440
     0      5600000         4096      45035.1          1484132
     0      5760000         4096      43757.6          1492594
     0      5920000         4096      40739.4          1552104
     0      6080000         4096      37309.4          1627355
     0      6240000         4096      42003.3          1517357
.....
real    3m28.093s
user    0m57.852s
sys     14m28.193s

Average rate: 51724.6+/-2.4e+04 files/sec.

At first memory full point:

MemFree:          432676 kB
Active(file):      89820 kB
Inactive(file):  7330892 kB
Dirty:           1603576 kB
Writeback:          2908 kB
Slab:            6579384 kB
SReclaimable:    3727612 kB
SUnreclaim:      2851772 kB

A few seconds later at about half the page cache reclaimed:

MemFree:         1880588 kB
Active(file):      89948 kB
Inactive(file):  3021796 kB
Dirty:           1097072 kB
Writeback:          2600 kB
Slab:            8900912 kB
SReclaimable:    5060104 kB
SUnreclaim:      3840808 kB

And at about the 6080000 count point in the results above, right to
the end of the test:

MemFree:          574900 kB
Active(file):      89856 kB
Inactive(file):   483120 kB
Dirty:            372436 kB
Writeback:           324 kB
KReclaimable:    6506496 kB
Slab:           11898956 kB
SReclaimable:    6506496 kB
SUnreclaim:      5392460 kB


So the LRU is largely full of dirty pages, and we're getting spikes
of random writeback from memory reclaim so it's all going to shit.
Behaviour never recovers, the page cache remains pinned at just dirty
pages, and nothing I could tune would make any difference.
vfs_cache_pressure makes no difference - I would it up so high it
should trim the entire inode caches in a singel pass, yet it didn't
do anything. It was clear from tracing and live telemetry that the
shrinkers were pretty much not running except when there was
absolutely no memory free at all, and then they did the minimum
necessary to free memory to make progress.

So I went looking at the code, trying to find places where pages got
reclaimed and the shrinkers weren't called. There's only one -
kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim
small amounts of memory when an external fragmentation event
occurs"). I'm not even using THP or allocating huge pages, so this
code should not be active or having any effect onmemory reclaim at
all, yet the majority of reclaim is being done with "boost" and so
it's not reclaiming slab caches at all. It will only free clean
pages from the LRU.

And so when we do run out memory, it switches to normal reclaim,
which hits dirty pages on the LRU and does some shrinker work, too,
but then appears to switch back to boosted reclaim one watermarks
are reached.

The patch below restores page cache vs inode cache balance for this
steady state workload. It balances out at about 40% page cache, 60%
slab cache, and sustained performance is 10-15% higher than without
this patch because the IO patterns remain in control of dirty
writeback and the filesystem, not kswapd.

Performance with boosted reclaim also running shrinkers over the
same steady state portion of the test as above.

FSUse%        Count         Size    Files/sec     App Overhead
......
     0      4320000         4096      51341.9          1409983
     0      4480000         4096      51157.5          1486421
     0      4640000         4096      52041.5          1421837
     0      4800000         4096      52124.2          1442169
     0      4960000         4096      56266.6          1388865
     0      5120000         4096      52892.2          1357650
     0      5280000         4096      51879.5          1326590
     0      5440000         4096      52178.7          1362889
     0      5600000         4096      53863.0          1345956
     0      5760000         4096      52538.7          1321422
     0      5920000         4096      53144.5          1336966
     0      6080000         4096      53647.7          1372146
     0      6240000         4096      52434.7          1362534

.....
real    3m11.543s
user    0m57.506s
sys     14m20.913s

Average rate: 57634.2+/-2.8e+04 files/sec.

So it completed ~10% faster (both wall time and create rate) and had
far better IO patterns so the differences would be even more
pronounced on slow storage.

This patch is not a fix, just a demonstration of the fact that the
heuristics this "boosted reclaim for compaction" are based on are
flawed, will have nasty side effects for users that don't use THP
and so needs revisiting.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/vmscan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Michal Hocko Aug. 7, 2019, 9:30 a.m. UTC | #1
[Cc Mel and Vlastimil as it seems like fallout from 1c30844d2dfe2]

On Wed 07-08-19 19:18:58, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When running a simple, steady state 4kB file creation test to
> simulate extracting tarballs larger than memory full of small files
> into the filesystem, I noticed that once memory fills up the cache
> balance goes to hell.
> 
> The workload is creating one dirty cached inode for every dirty
> page, both of which should require a single IO each to clean and
> reclaim, and creation of inodes is throttled by the rate at which
> dirty writeback runs at (via balance dirty pages). Hence the ingest
> rate of new cached inodes and page cache pages is identical and
> steady. As a result, memory reclaim should quickly find a steady
> balance between page cache and inode caches.
> 
> It doesn't.
> 
> The moment memory fills, the page cache is reclaimed at a much
> faster rate than the inode cache, and evidence suggests taht the
> inode cache shrinker is not being called when large batches of pages
> are being reclaimed. In roughly the same time period that it takes
> to fill memory with 50% pages and 50% slab caches, memory reclaim
> reduces the page cache down to just dirty pages and slab caches fill
> the entirity of memory.
> 
> At the point where the page cache is reduced to just the dirty
> pages, there is a clear change in write IO patterns. Up to this
> point it has been running at a steady 1500 write IOPS for ~200MB/s
> of write throughtput (data, journal and metadata). Once the page
> cache is trimmed to just dirty pages, the write IOPS immediately
> start spiking to 5-10,000 IOPS and there is a noticable change in
> IO sizes and completion times. The SSD is fast enough to soak these
> up, so the measured performance is only slightly affected (numbers
> below). It results in > ~50% throughput slowdowns on a spinning
> disk with a NVRAM RAID cache in front of it, though. I didn't
> capture the numbers at the time, and it takes far to long for me to
> care to run it again and get them.
> 
> SSD perf degradation as the LRU empties to just dirty pages:
> 
> FSUse%        Count         Size    Files/sec     App Overhead
> ......
>      0      4320000         4096      51349.6          1370049
>      0      4480000         4096      48492.9          1362823
>      0      4640000         4096      48473.8          1435623
>      0      4800000         4096      46846.6          1370959
>      0      4960000         4096      47086.6          1567401
>      0      5120000         4096      46288.8          1368056
>      0      5280000         4096      46003.2          1391604
>      0      5440000         4096      46033.4          1458440
>      0      5600000         4096      45035.1          1484132
>      0      5760000         4096      43757.6          1492594
>      0      5920000         4096      40739.4          1552104
>      0      6080000         4096      37309.4          1627355
>      0      6240000         4096      42003.3          1517357
> .....
> real    3m28.093s
> user    0m57.852s
> sys     14m28.193s
> 
> Average rate: 51724.6+/-2.4e+04 files/sec.
> 
> At first memory full point:
> 
> MemFree:          432676 kB
> Active(file):      89820 kB
> Inactive(file):  7330892 kB
> Dirty:           1603576 kB
> Writeback:          2908 kB
> Slab:            6579384 kB
> SReclaimable:    3727612 kB
> SUnreclaim:      2851772 kB
> 
> A few seconds later at about half the page cache reclaimed:
> 
> MemFree:         1880588 kB
> Active(file):      89948 kB
> Inactive(file):  3021796 kB
> Dirty:           1097072 kB
> Writeback:          2600 kB
> Slab:            8900912 kB
> SReclaimable:    5060104 kB
> SUnreclaim:      3840808 kB
> 
> And at about the 6080000 count point in the results above, right to
> the end of the test:
> 
> MemFree:          574900 kB
> Active(file):      89856 kB
> Inactive(file):   483120 kB
> Dirty:            372436 kB
> Writeback:           324 kB
> KReclaimable:    6506496 kB
> Slab:           11898956 kB
> SReclaimable:    6506496 kB
> SUnreclaim:      5392460 kB
> 
> 
> So the LRU is largely full of dirty pages, and we're getting spikes
> of random writeback from memory reclaim so it's all going to shit.
> Behaviour never recovers, the page cache remains pinned at just dirty
> pages, and nothing I could tune would make any difference.
> vfs_cache_pressure makes no difference - I would it up so high it
> should trim the entire inode caches in a singel pass, yet it didn't
> do anything. It was clear from tracing and live telemetry that the
> shrinkers were pretty much not running except when there was
> absolutely no memory free at all, and then they did the minimum
> necessary to free memory to make progress.
> 
> So I went looking at the code, trying to find places where pages got
> reclaimed and the shrinkers weren't called. There's only one -
> kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim
> small amounts of memory when an external fragmentation event
> occurs"). I'm not even using THP or allocating huge pages, so this
> code should not be active or having any effect onmemory reclaim at
> all, yet the majority of reclaim is being done with "boost" and so
> it's not reclaiming slab caches at all. It will only free clean
> pages from the LRU.
> 
> And so when we do run out memory, it switches to normal reclaim,
> which hits dirty pages on the LRU and does some shrinker work, too,
> but then appears to switch back to boosted reclaim one watermarks
> are reached.
> 
> The patch below restores page cache vs inode cache balance for this
> steady state workload. It balances out at about 40% page cache, 60%
> slab cache, and sustained performance is 10-15% higher than without
> this patch because the IO patterns remain in control of dirty
> writeback and the filesystem, not kswapd.
> 
> Performance with boosted reclaim also running shrinkers over the
> same steady state portion of the test as above.
> 
> FSUse%        Count         Size    Files/sec     App Overhead
> ......
>      0      4320000         4096      51341.9          1409983
>      0      4480000         4096      51157.5          1486421
>      0      4640000         4096      52041.5          1421837
>      0      4800000         4096      52124.2          1442169
>      0      4960000         4096      56266.6          1388865
>      0      5120000         4096      52892.2          1357650
>      0      5280000         4096      51879.5          1326590
>      0      5440000         4096      52178.7          1362889
>      0      5600000         4096      53863.0          1345956
>      0      5760000         4096      52538.7          1321422
>      0      5920000         4096      53144.5          1336966
>      0      6080000         4096      53647.7          1372146
>      0      6240000         4096      52434.7          1362534
> 
> .....
> real    3m11.543s
> user    0m57.506s
> sys     14m20.913s
> 
> Average rate: 57634.2+/-2.8e+04 files/sec.
> 
> So it completed ~10% faster (both wall time and create rate) and had
> far better IO patterns so the differences would be even more
> pronounced on slow storage.
> 
> This patch is not a fix, just a demonstration of the fact that the
> heuristics this "boosted reclaim for compaction" are based on are
> flawed, will have nasty side effects for users that don't use THP
> and so needs revisiting.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  mm/vmscan.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9034570febd9..702e6523f8ad 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2748,10 +2748,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
>  			node_lru_pages += lru_pages;
>  
> -			if (sc->may_shrinkslab) {
> +			//if (sc->may_shrinkslab) {
>  				shrink_slab(sc->gfp_mask, pgdat->node_id,
>  				    memcg, sc->priority);
> -			}
> +			//}
>  
>  			/* Record the group's reclaim efficiency */
>  			vmpressure(sc->gfp_mask, memcg, false,
> -- 
> 2.23.0.rc1
Mel Gorman Aug. 7, 2019, 3:03 p.m. UTC | #2
On Wed, Aug 07, 2019 at 11:30:56AM +0200, Michal Hocko wrote:
> [Cc Mel and Vlastimil as it seems like fallout from 1c30844d2dfe2]
> 

More than likely.

> On Wed 07-08-19 19:18:58, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When running a simple, steady state 4kB file creation test to
> > simulate extracting tarballs larger than memory full of small files
> > into the filesystem, I noticed that once memory fills up the cache
> > balance goes to hell.
> > 

Ok, I'm assuming you are using fsmark with -k to keep files around,
and -S0 to leave cleaning to the background flush, a number of files per
iteration to get regular reporting and a total number of iterations to
fill memory to hit what you're seeing. I've created a configuration that
should do this but it'll take a long time to run on a local test machine.

I'm not 100% certain I guessed right as to get fsmark reports while memory
fills, it would have to be fewer files so each iteration would have to
preserve files. If the number of files per iteration is large enough to
fill memory then the drop in files/sec is not visible from the fs_mark
output (or we are using different versions). I guess you could just be
calculating average files/sec over the entire run based on elapsed time.

> > The workload is creating one dirty cached inode for every dirty
> > page, both of which should require a single IO each to clean and
> > reclaim, and creation of inodes is throttled by the rate at which
> > dirty writeback runs at (via balance dirty pages). Hence the ingest
> > rate of new cached inodes and page cache pages is identical and
> > steady. As a result, memory reclaim should quickly find a steady
> > balance between page cache and inode caches.
> > 
> > It doesn't.
> > 
> > The moment memory fills, the page cache is reclaimed at a much
> > faster rate than the inode cache, and evidence suggests taht the
> > inode cache shrinker is not being called when large batches of pages
> > are being reclaimed. In roughly the same time period that it takes
> > to fill memory with 50% pages and 50% slab caches, memory reclaim
> > reduces the page cache down to just dirty pages and slab caches fill
> > the entirity of memory.
> > 
> > At the point where the page cache is reduced to just the dirty
> > pages, there is a clear change in write IO patterns. Up to this
> > point it has been running at a steady 1500 write IOPS for ~200MB/s
> > of write throughtput (data, journal and metadata).

As observed by iostat -x or something else? Sum of r/s and w/s would
approximate iops but not the breakdown of whether it is data, journal
or metadata writes. The rest can be inferred from a blktrace but I would
prefer to replicate your setup as close as possible. If you're not using
fs_mark to report Files/sec, are you simply monitoring df -i over time?

> > <SNIP additional detail on fs_mark output>
> > <SNIP additional detail on monitoring meminfo over time>
> > <SNIP observations on dirty handling>

All understood.

> > So I went looking at the code, trying to find places where pages got
> > reclaimed and the shrinkers weren't called. There's only one -
> > kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim
> > small amounts of memory when an external fragmentation event
> > occurs"). I'm not even using THP or allocating huge pages, so this
> > code should not be active or having any effect onmemory reclaim at
> > all, yet the majority of reclaim is being done with "boost" and so
> > it's not reclaiming slab caches at all. It will only free clean
> > pages from the LRU.
> > 
> > And so when we do run out memory, it switches to normal reclaim,
> > which hits dirty pages on the LRU and does some shrinker work, too,
> > but then appears to switch back to boosted reclaim one watermarks
> > are reached.
> > 
> > The patch below restores page cache vs inode cache balance for this
> > steady state workload. It balances out at about 40% page cache, 60%
> > slab cache, and sustained performance is 10-15% higher than without
> > this patch because the IO patterns remain in control of dirty
> > writeback and the filesystem, not kswapd.
> > 
> > Performance with boosted reclaim also running shrinkers over the
> > same steady state portion of the test as above.
> > 

The boosting was not intended to target THP specifically -- it was meant
to help recover early from any fragmentation-related event for any user
that might need it. Hence, it's not tied to THP but even with THP
disabled, the boosting will still take effect.

One band-aid would be to disable watermark boosting entirely when THP is
disabled but that feels wrong. However, I would be interested in hearing
if sysctl vm.watermark_boost_factor=0 has the same effect as your patch.

The intention behind avoiding slab reclaim from kswapd context is that the
boost is due to a fragmentation-causing event. Compaction cannot move slab
pages so reclaiming them in response to fragmentation does not directly
help. However, it can indirectly help by avoiding fragmentation-causing
events due to slab allocation like inodes and also mean that reclaim
behaviour is not special cased.

On that basis, it may justify ripping out the may_shrinkslab logic
everywhere. The downside is that some microbenchmarks will notice.
Specifically IO benchmarks that fill memory and reread (particularly
rereading the metadata via any inode operation) may show reduced
results. Such benchmarks can be strongly affected by whether the inode
information is still memory resident and watermark boosting reduces
the changes the data is still resident in memory. Technically still a
regression but a tunable one.

Hence the following "it builds" patch that has zero supporting data on
whether it's a good idea or not.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dbdc46a84f63..6051a9007150 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -88,9 +88,6 @@ struct scan_control {
 	/* Can pages be swapped as part of reclaim? */
 	unsigned int may_swap:1;
 
-	/* e.g. boosted watermark reclaim leaves slabs alone */
-	unsigned int may_shrinkslab:1;
-
 	/*
 	 * Cgroups are not reclaimed below their configured memory.low,
 	 * unless we threaten to OOM. If any cgroups are skipped due to
@@ -2714,10 +2711,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
 			node_lru_pages += lru_pages;
 
-			if (sc->may_shrinkslab) {
-				shrink_slab(sc->gfp_mask, pgdat->node_id,
-				    memcg, sc->priority);
-			}
+			shrink_slab(sc->gfp_mask, pgdat->node_id,
+			    memcg, sc->priority);
 
 			/* Record the group's reclaim efficiency */
 			vmpressure(sc->gfp_mask, memcg, false,
@@ -3194,7 +3189,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = 1,
-		.may_shrinkslab = 1,
 	};
 
 	/*
@@ -3238,7 +3232,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.may_swap = !noswap,
-		.may_shrinkslab = 1,
 	};
 	unsigned long lru_pages;
 
@@ -3286,7 +3279,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = may_swap,
-		.may_shrinkslab = 1,
 	};
 
 	set_task_reclaim_state(current, &sc.reclaim_state);
@@ -3598,7 +3590,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 */
 		sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
 		sc.may_swap = !nr_boost_reclaim;
-		sc.may_shrinkslab = !nr_boost_reclaim;
 
 		/*
 		 * Do some background aging of the anon list, to give
Mel Gorman Aug. 7, 2019, 8:56 p.m. UTC | #3
On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote:
> <SNIP>
>
> On that basis, it may justify ripping out the may_shrinkslab logic
> everywhere. The downside is that some microbenchmarks will notice.
> Specifically IO benchmarks that fill memory and reread (particularly
> rereading the metadata via any inode operation) may show reduced
> results. Such benchmarks can be strongly affected by whether the inode
> information is still memory resident and watermark boosting reduces
> the changes the data is still resident in memory. Technically still a
> regression but a tunable one.
> 
> Hence the following "it builds" patch that has zero supporting data on
> whether it's a good idea or not.
> 

This is a more complete version of the same patch that summaries the
problem and includes data from my own testing

---8<---
mm, vmscan: Do not special-case slab reclaim when watermarks are boosted

Dave Chinner reported a problem pointing a finger at commit
1c30844d2dfe ("mm: reclaim small amounts of memory when an
external fragmentation event occurs"). The report is extensive (see
https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/)
and it's worth recording the most relevant parts (colorful language and
typos included).

	When running a simple, steady state 4kB file creation test to
	simulate extracting tarballs larger than memory full of small
	files into the filesystem, I noticed that once memory fills up
	the cache balance goes to hell.

	The workload is creating one dirty cached inode for every dirty
	page, both of which should require a single IO each to clean and
	reclaim, and creation of inodes is throttled by the rate at which
	dirty writeback runs at (via balance dirty pages). Hence the ingest
	rate of new cached inodes and page cache pages is identical and
	steady. As a result, memory reclaim should quickly find a steady
	balance between page cache and inode caches.

	The moment memory fills, the page cache is reclaimed at a much
	faster rate than the inode cache, and evidence suggests taht
	the inode cache shrinker is not being called when large batches
	of pages are being reclaimed. In roughly the same time period
	that it takes to fill memory with 50% pages and 50% slab caches,
	memory reclaim reduces the page cache down to just dirty pages
	and slab caches fill the entirity of memory.

	The LRU is largely full of dirty pages, and we're getting spikes
	of random writeback from memory reclaim so it's all going to shit.
	Behaviour never recovers, the page cache remains pinned at just
	dirty pages, and nothing I could tune would make any difference.
	vfs_cache_pressure makes no difference - I would it up so high
	it should trim the entire inode caches in a singel pass, yet it
	didn't do anything. It was clear from tracing and live telemetry
	that the shrinkers were pretty much not running except when
	there was absolutely no memory free at all, and then they did
	the minimum necessary to free memory to make progress.

	So I went looking at the code, trying to find places where pages
	got reclaimed and the shrinkers weren't called. There's only one
	- kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
	reclaim small amounts of memory when an external fragmentation
	event occurs").

The watermark boosting introduced by the commit is triggered in response
to an allocation "fragmentation event". The boosting was not intended
to target THP specifically.  However, with Dave's perfectly reasonable
workload, fragmentation events can be very common given the ratio of slab
to page cache allocations so boosting remains active for long periods
of time.

As high-order allocations typically use compaction and compaction cannot
move slab pages the decision was made in the commit to special-case kswapd
when watermarks are boosted -- kswapd avoids reclaiming slab as reclaiming
slab does not directly help compaction.

As Dave notes, this decision means that slab can be artificially protected
for long periods of time and messes up the balance with slab and page
caches.

Removing the special casing can still indirectly help fragmentation by
avoiding fragmentation-causing events due to slab allocation as pages
from a slab pageblock will have some slab objects freed.  Furthermore,
with the special casing, reclaim behaviour is unpredictable as kswapd
sometimes examines slab and sometimes does not in a manner that is tricky
to tune against and tricky to analyse.

This patch removes the special casing. The downside is that this is
not a universal performance win. Some benchmarks that depend on the
residency of data when rereading metadata may see a regression when
slab reclaim is restored to its original behaviour. Similarly, some
benchmarks that only read-once or write-once may perform better when page
reclaim is too aggressive. The primary upside is that slab shrinker is
less surprising (arguably more sane but that's a matter of opinion) and
behaves consistently regardless of the fragmentation state of the system.

A fsmark benchmark configuration was constructed similar to
what Dave reported and is codified by the mmtest configuration
config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket machine
to avoid dealing with NUMA-related issues and the timing of reclaim. The
storage was an SSD Samsung Evo and a fresh XFS filesystem was used for
the test data.

It is likely that the test configuration is not a proper match for Dave's
test as the results are different in terms of performance. However, my
configuration reports fsmark performance every 10% of memory worth of
files and I suspect Dave's configuration reported Files/sec when memory
was already full. THP was enabled for mine, disabled for Dave's and
probably a whole load of other methodology differences that rarely
get recorded properly.

fsmark
                                   5.3.0-rc3              5.3.0-rc3
                                     vanilla          shrinker-v1r1
Min       1-files/sec     5181.70 (   0.00%)     3204.20 ( -38.16%)
1st-qrtle 1-files/sec    14877.10 (   0.00%)     6596.90 ( -55.66%)
2nd-qrtle 1-files/sec     6521.30 (   0.00%)     5707.80 ( -12.47%)
3rd-qrtle 1-files/sec     5614.30 (   0.00%)     5363.80 (  -4.46%)
Max-1     1-files/sec    18463.00 (   0.00%)    18479.90 (   0.09%)
Max-5     1-files/sec    18028.40 (   0.00%)    17829.00 (  -1.11%)
Max-10    1-files/sec    17502.70 (   0.00%)    17080.90 (  -2.41%)
Max-90    1-files/sec     5438.80 (   0.00%)     5106.60 (  -6.11%)
Max-95    1-files/sec     5390.30 (   0.00%)     5020.40 (  -6.86%)
Max-99    1-files/sec     5271.20 (   0.00%)     3376.20 ( -35.95%)
Max       1-files/sec    18463.00 (   0.00%)    18479.90 (   0.09%)
Hmean     1-files/sec     7459.11 (   0.00%)     6249.49 ( -16.22%)
Stddev    1-files/sec     4733.16 (   0.00%)     4362.10 (   7.84%)
CoeffVar  1-files/sec       51.66 (   0.00%)       57.49 ( -11.29%)
BHmean-99 1-files/sec     7515.09 (   0.00%)     6351.81 ( -15.48%)
BHmean-95 1-files/sec     7625.39 (   0.00%)     6486.09 ( -14.94%)
BHmean-90 1-files/sec     7803.19 (   0.00%)     6588.61 ( -15.57%)
BHmean-75 1-files/sec     8518.74 (   0.00%)     6954.25 ( -18.37%)
BHmean-50 1-files/sec    10953.31 (   0.00%)     8017.89 ( -26.80%)
BHmean-25 1-files/sec    16732.38 (   0.00%)    11739.65 ( -29.84%)

                   5.3.0-rc3   5.3.0-rc3
                     vanillashrinker-v1r1
Duration User          77.29       89.09
Duration System      1097.13     1332.86
Duration Elapsed     2014.14     2596.39

This is showing that fsmark runs slower as a result of this patch but
there are other important observations that justify the patch.

1. With the vanilla kernel, the number of dirty pages in the system
   is very low for much of the test. With this patch, dirty pages
   is generally kept at 10% which matches vm.dirty_background_ratio
   which is normal expected historical behaviour.

2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
   0.95 for much of the test i.e. Slab is being left alone and dominating
   memory consumption. With the patch applied, the ratio varies between
   0.35 and 0.45 with the bulk of the measured ratios roughly half way
   between those values. This is a different balance to what Dave reported
   but it was at least consistent.

3. Slabs are scanned throughout the entire test with the patch applied.
   The vanille kernel has long periods with no scan activity and then
   relatively massive spikes.

4. Overall vmstats are closer to normal expectations

	                                5.3.0-rc3      5.3.0-rc3
	                                  vanilla  shrinker-v1r1
	Direct pages scanned             60308.00        5226.00
	Kswapd pages scanned          18316110.00    12295574.00
	Kswapd pages reclaimed        13121037.00     7280152.00
	Direct pages reclaimed           11817.00        5226.00
	Kswapd efficiency %                 71.64          59.21
	Kswapd velocity                   9093.76        4735.64
	Direct efficiency %                 19.59         100.00
	Direct velocity                     29.94           2.01
	Page reclaim immediate          247921.00           0.00
	Slabs scanned                 16602344.00    29369536.00
	Direct inode steals               1574.00         800.00
	Kswapd inode steals             130033.00     3968788.00
	Kswapd skipped wait                  0.00           0.00

	o Vanilla kernel is hitting direct reclaim more frequently,
	  not very much in absolute terms but the fact the patch
	  reduces it is interesting
	o "Page reclaim immediate" in the vanilla kernel indicates
	  dirty pages are being encountered at the tail of the LRU.
	  This is generally bad and means in this case that the LRU
	  is not long enough for dirty pages to be cleaned by the
	  background flush in time. This is eliminated by the
	  patch
	o With the patch, kswapd is reclaiming 30 times more slab
	  pages than with the vanilla kernel. This is indicative
	  of the watermark boosting over-protecting slab

A more complete set of tests is being run but the bottom line is that
special casing kswapd to avoid slab behaviour is unpredictable and can
lead to abnormal results for normal workloads. This patch restores the
expected behaviour that slab and page cache is balanced consistently for
a workload with a steady allocation ratio of slab/pagecache pages.

Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dbdc46a84f63..6051a9007150 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -88,9 +88,6 @@ struct scan_control {
 	/* Can pages be swapped as part of reclaim? */
 	unsigned int may_swap:1;
 
-	/* e.g. boosted watermark reclaim leaves slabs alone */
-	unsigned int may_shrinkslab:1;
-
 	/*
 	 * Cgroups are not reclaimed below their configured memory.low,
 	 * unless we threaten to OOM. If any cgroups are skipped due to
@@ -2714,10 +2711,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
 			node_lru_pages += lru_pages;
 
-			if (sc->may_shrinkslab) {
-				shrink_slab(sc->gfp_mask, pgdat->node_id,
-				    memcg, sc->priority);
-			}
+			shrink_slab(sc->gfp_mask, pgdat->node_id,
+			    memcg, sc->priority);
 
 			/* Record the group's reclaim efficiency */
 			vmpressure(sc->gfp_mask, memcg, false,
@@ -3194,7 +3189,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = 1,
-		.may_shrinkslab = 1,
 	};
 
 	/*
@@ -3238,7 +3232,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.may_swap = !noswap,
-		.may_shrinkslab = 1,
 	};
 	unsigned long lru_pages;
 
@@ -3286,7 +3279,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = may_swap,
-		.may_shrinkslab = 1,
 	};
 
 	set_task_reclaim_state(current, &sc.reclaim_state);
@@ -3598,7 +3590,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 */
 		sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
 		sc.may_swap = !nr_boost_reclaim;
-		sc.may_shrinkslab = !nr_boost_reclaim;
 
 		/*
 		 * Do some background aging of the anon list, to give
Dave Chinner Aug. 7, 2019, 10:08 p.m. UTC | #4
On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote:
> On Wed, Aug 07, 2019 at 11:30:56AM +0200, Michal Hocko wrote:
> > [Cc Mel and Vlastimil as it seems like fallout from 1c30844d2dfe2]
> > 
> 
> More than likely.
> 
> > On Wed 07-08-19 19:18:58, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > When running a simple, steady state 4kB file creation test to
> > > simulate extracting tarballs larger than memory full of small files
> > > into the filesystem, I noticed that once memory fills up the cache
> > > balance goes to hell.
> > > 
> 
> Ok, I'm assuming you are using fsmark with -k to keep files around,
> and -S0 to leave cleaning to the background flush, a number of files per
> iteration to get regular reporting and a total number of iterations to
> fill memory to hit what you're seeing. I've created a configuration that
> should do this but it'll take a long time to run on a local test machine.

./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  60 \
-d /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2 \
-d /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
-d /mnt/scratch/6  -d  /mnt/scratch/7  -d  /mnt/scratch/8 \
-d /mnt/scratch/9  -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
-d /mnt/scratch/12  -d  /mnt/scratch/13  -d  /mnt/scratch/14 \
-d /mnt/scratch/15

This doesn't delete files at all, creates 160,000 files per
iteration in directories of 10,000 files at a time, and runs 60
iterations. It leaves all writeback (data and metadata) to
background kernel mechanisms.

> I'm not 100% certain I guessed right as to get fsmark reports while memory
> fills, it would have to be fewer files so each iteration would have to
> preserve files. If the number of files per iteration is large enough to
> fill memory then the drop in files/sec is not visible from the fs_mark
> output (or we are using different versions). I guess you could just be
> calculating average files/sec over the entire run based on elapsed time.

The files/s average is the average of the fsmark iteration output.
(i.e. the rate at which it creates each group of 160k files).

> 
> > > The workload is creating one dirty cached inode for every dirty
> > > page, both of which should require a single IO each to clean and
> > > reclaim, and creation of inodes is throttled by the rate at which
> > > dirty writeback runs at (via balance dirty pages). Hence the ingest
> > > rate of new cached inodes and page cache pages is identical and
> > > steady. As a result, memory reclaim should quickly find a steady
> > > balance between page cache and inode caches.
> > > 
> > > It doesn't.
> > > 
> > > The moment memory fills, the page cache is reclaimed at a much
> > > faster rate than the inode cache, and evidence suggests taht the
> > > inode cache shrinker is not being called when large batches of pages
> > > are being reclaimed. In roughly the same time period that it takes
> > > to fill memory with 50% pages and 50% slab caches, memory reclaim
> > > reduces the page cache down to just dirty pages and slab caches fill
> > > the entirity of memory.
> > > 
> > > At the point where the page cache is reduced to just the dirty
> > > pages, there is a clear change in write IO patterns. Up to this
> > > point it has been running at a steady 1500 write IOPS for ~200MB/s
> > > of write throughtput (data, journal and metadata).
> 
> As observed by iostat -x or something else? Sum of r/s and w/s would

PCP + live pmcharts. Same as I've done for 15+ years :)

I could look at iostat, but it's much easier to watch graphs run
and then be able to double click on any point and get the actual
value.

I've attached a screen shot of the test machine overview while the
vanilla kernel runs the fsmark test (cpu, iops, IO bandwidth, XFS
create/remove/lookup ops, context switch rate and memory usage) at a
1 second sample rate. You can see the IO patterns change, the
context switch rate go nuts and the CPU usage pattern change when
the page cache hits empty.

> approximate iops but not the breakdown of whether it is data, journal
> or metadata writes.

I have that in other charts - the log chart tells me how many log
IOs are being done (constant 30MB/s in ~150 IOs/sec). And the >1GB/s
IO spike every 30s is the metadata writeback being aggregated into
large IOs by metadata writeback. That doesn't change, either...

> The rest can be inferred from a blktrace but I would
> prefer to replicate your setup as close as possible. If you're not using
> fs_mark to report Files/sec, are you simply monitoring df -i over time?

The way I run fsmark is iterative - the count field tells you how
many inodes it has created...

> > > So I went looking at the code, trying to find places where pages got
> > > reclaimed and the shrinkers weren't called. There's only one -
> > > kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim
> > > small amounts of memory when an external fragmentation event
> > > occurs"). I'm not even using THP or allocating huge pages, so this
> > > code should not be active or having any effect onmemory reclaim at
> > > all, yet the majority of reclaim is being done with "boost" and so
> > > it's not reclaiming slab caches at all. It will only free clean
> > > pages from the LRU.
> > > 
> > > And so when we do run out memory, it switches to normal reclaim,
> > > which hits dirty pages on the LRU and does some shrinker work, too,
> > > but then appears to switch back to boosted reclaim one watermarks
> > > are reached.
> > > 
> > > The patch below restores page cache vs inode cache balance for this
> > > steady state workload. It balances out at about 40% page cache, 60%
> > > slab cache, and sustained performance is 10-15% higher than without
> > > this patch because the IO patterns remain in control of dirty
> > > writeback and the filesystem, not kswapd.
> > > 
> > > Performance with boosted reclaim also running shrinkers over the
> > > same steady state portion of the test as above.
> > > 
> 
> The boosting was not intended to target THP specifically -- it was meant
> to help recover early from any fragmentation-related event for any user
> that might need it. Hence, it's not tied to THP but even with THP
> disabled, the boosting will still take effect.
> 
> One band-aid would be to disable watermark boosting entirely when THP is
> disabled but that feels wrong. However, I would be interested in hearing
> if sysctl vm.watermark_boost_factor=0 has the same effect as your patch.

<runs test>

Ok, it still runs it out of page cache, but it doesn't drive page
cache reclaim as hard once there's none left. The IO patterns are
less peaky, context switch rates are increased from ~3k/s to 15k/s
but remain pretty steady.

Test ran 5s faster and  file rate improved by ~2%. So it's better
once the page cache is largerly fully reclaimed, but it doesn't
prevent the page cache from being reclaimed instead of inodes....


> On that basis, it may justify ripping out the may_shrinkslab logic
> everywhere. The downside is that some microbenchmarks will notice.
> Specifically IO benchmarks that fill memory and reread (particularly
> rereading the metadata via any inode operation) may show reduced
> results.

Sure, but right now benchmarks that rely on page cache being
retained are being screwed over :)

> Such benchmarks can be strongly affected by whether the inode
> information is still memory resident and watermark boosting reduces
> the changes the data is still resident in memory. Technically still a
> regression but a tunable one.

/proc/sys/vm/vfs_cache_pressure is for tuning page cache/inode cache
balance. It should not occur as a side effect of watermark boosting.
Everyone knows about vfs_cache_pressure. Lots of people complain
when it doesn't work, and that's something watermark boosting to
change cache balance does.

Cheers,

Dave.
Dave Chinner Aug. 7, 2019, 10:32 p.m. UTC | #5
On Wed, Aug 07, 2019 at 09:56:15PM +0100, Mel Gorman wrote:
> On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote:
> > <SNIP>
> >
> > On that basis, it may justify ripping out the may_shrinkslab logic
> > everywhere. The downside is that some microbenchmarks will notice.
> > Specifically IO benchmarks that fill memory and reread (particularly
> > rereading the metadata via any inode operation) may show reduced
> > results. Such benchmarks can be strongly affected by whether the inode
> > information is still memory resident and watermark boosting reduces
> > the changes the data is still resident in memory. Technically still a
> > regression but a tunable one.
> > 
> > Hence the following "it builds" patch that has zero supporting data on
> > whether it's a good idea or not.
> > 
> 
> This is a more complete version of the same patch that summaries the
> problem and includes data from my own testing
....
> A fsmark benchmark configuration was constructed similar to
> what Dave reported and is codified by the mmtest configuration
> config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket machine
> to avoid dealing with NUMA-related issues and the timing of reclaim. The
> storage was an SSD Samsung Evo and a fresh XFS filesystem was used for
> the test data.

Have you run fstrim on that drive recently? I'm running these tests
on a 960 EVO ssd, and when I started looking at shrinkers 3 weeks
ago I had all sorts of whacky performance problems and inconsistent
results. Turned out there were all sorts of random long IO latencies
occurring (in the hundreds of milliseconds) because the drive was
constantly running garbage collection to free up space. As a result
it was both blocking on GC and thermal throttling under these fsmark
workloads.

I made a new XFS filesystem on it (lazy man's rm -rf *), then ran
fstrim on it to tell the drive all the space is free. Drive temps
dropped 30C immediately, and all of the whacky performance anomolies
went away. I now fstrim the drive in my vm startup scripts before
each test run, and it's giving consistent results again.

> It is likely that the test configuration is not a proper match for Dave's
> test as the results are different in terms of performance. However, my
> configuration reports fsmark performance every 10% of memory worth of
> files and I suspect Dave's configuration reported Files/sec when memory
> was already full. THP was enabled for mine, disabled for Dave's and
> probably a whole load of other methodology differences that rarely
> get recorded properly.

Yup, like I forgot to mention that my test system is using a 4-node
fakenuma setup (i.e. 4 nodes, 4GB RAM and 4 CPUs per node, so
there are 4 separate kswapd's doing concurrent reclaim). That
changes reclaim patterns as well.


> fsmark
>                                    5.3.0-rc3              5.3.0-rc3
>                                      vanilla          shrinker-v1r1
> Min       1-files/sec     5181.70 (   0.00%)     3204.20 ( -38.16%)
> 1st-qrtle 1-files/sec    14877.10 (   0.00%)     6596.90 ( -55.66%)
> 2nd-qrtle 1-files/sec     6521.30 (   0.00%)     5707.80 ( -12.47%)
> 3rd-qrtle 1-files/sec     5614.30 (   0.00%)     5363.80 (  -4.46%)
> Max-1     1-files/sec    18463.00 (   0.00%)    18479.90 (   0.09%)
> Max-5     1-files/sec    18028.40 (   0.00%)    17829.00 (  -1.11%)
> Max-10    1-files/sec    17502.70 (   0.00%)    17080.90 (  -2.41%)
> Max-90    1-files/sec     5438.80 (   0.00%)     5106.60 (  -6.11%)
> Max-95    1-files/sec     5390.30 (   0.00%)     5020.40 (  -6.86%)
> Max-99    1-files/sec     5271.20 (   0.00%)     3376.20 ( -35.95%)
> Max       1-files/sec    18463.00 (   0.00%)    18479.90 (   0.09%)
> Hmean     1-files/sec     7459.11 (   0.00%)     6249.49 ( -16.22%)
> Stddev    1-files/sec     4733.16 (   0.00%)     4362.10 (   7.84%)
> CoeffVar  1-files/sec       51.66 (   0.00%)       57.49 ( -11.29%)
> BHmean-99 1-files/sec     7515.09 (   0.00%)     6351.81 ( -15.48%)
> BHmean-95 1-files/sec     7625.39 (   0.00%)     6486.09 ( -14.94%)
> BHmean-90 1-files/sec     7803.19 (   0.00%)     6588.61 ( -15.57%)
> BHmean-75 1-files/sec     8518.74 (   0.00%)     6954.25 ( -18.37%)
> BHmean-50 1-files/sec    10953.31 (   0.00%)     8017.89 ( -26.80%)
> BHmean-25 1-files/sec    16732.38 (   0.00%)    11739.65 ( -29.84%)
> 
>                    5.3.0-rc3   5.3.0-rc3
>                      vanillashrinker-v1r1
> Duration User          77.29       89.09
> Duration System      1097.13     1332.86
> Duration Elapsed     2014.14     2596.39

I'm not sure we are testing or measuring exactly the same things :)

> This is showing that fsmark runs slower as a result of this patch but
> there are other important observations that justify the patch.
> 
> 1. With the vanilla kernel, the number of dirty pages in the system
>    is very low for much of the test. With this patch, dirty pages
>    is generally kept at 10% which matches vm.dirty_background_ratio
>    which is normal expected historical behaviour.
> 
> 2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
>    0.95 for much of the test i.e. Slab is being left alone and dominating
>    memory consumption. With the patch applied, the ratio varies between
>    0.35 and 0.45 with the bulk of the measured ratios roughly half way
>    between those values. This is a different balance to what Dave reported
>    but it was at least consistent.

Yeah, the balance is typically a bit different for different configs
and storage. The trick is getting the balance to be roughly
consistent across a range of different configs. The fakenuma setup
also has a significant impact on where the balance is found. And I
can't remember if the "fixed" memory usage numbers I quoted came
from a run with my "make XFS inode reclaim nonblocking" patchset or
not.

> 3. Slabs are scanned throughout the entire test with the patch applied.
>    The vanille kernel has long periods with no scan activity and then
>    relatively massive spikes.
> 
> 4. Overall vmstats are closer to normal expectations
> 
> 	                                5.3.0-rc3      5.3.0-rc3
> 	                                  vanilla  shrinker-v1r1
> 	Direct pages scanned             60308.00        5226.00
> 	Kswapd pages scanned          18316110.00    12295574.00
> 	Kswapd pages reclaimed        13121037.00     7280152.00
> 	Direct pages reclaimed           11817.00        5226.00
> 	Kswapd efficiency %                 71.64          59.21
> 	Kswapd velocity                   9093.76        4735.64
> 	Direct efficiency %                 19.59         100.00
> 	Direct velocity                     29.94           2.01
> 	Page reclaim immediate          247921.00           0.00
> 	Slabs scanned                 16602344.00    29369536.00
> 	Direct inode steals               1574.00         800.00
> 	Kswapd inode steals             130033.00     3968788.00
> 	Kswapd skipped wait                  0.00           0.00

That looks a lot better. Patch looks reasonable, though I'm
interested to know what impact it has on tests you ran in the
original commit for the boosting.

Cheers,

Dave.
Dave Chinner Aug. 7, 2019, 10:33 p.m. UTC | #6
On Thu, Aug 08, 2019 at 08:08:17AM +1000, Dave Chinner wrote:
> On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote:
> > > > inode cache shrinker is not being called when large batches of pages
> > > > are being reclaimed. In roughly the same time period that it takes
> > > > to fill memory with 50% pages and 50% slab caches, memory reclaim
> > > > reduces the page cache down to just dirty pages and slab caches fill
> > > > the entirity of memory.
> > > > 
> > > > At the point where the page cache is reduced to just the dirty
> > > > pages, there is a clear change in write IO patterns. Up to this
> > > > point it has been running at a steady 1500 write IOPS for ~200MB/s
> > > > of write throughtput (data, journal and metadata).
> > 
> > As observed by iostat -x or something else? Sum of r/s and w/s would
> 
> PCP + live pmcharts. Same as I've done for 15+ years :)
> 
> I could look at iostat, but it's much easier to watch graphs run
> and then be able to double click on any point and get the actual
> value.
> 
> I've attached a screen shot of the test machine overview while the
> vanilla kernel runs the fsmark test (cpu, iops, IO bandwidth, XFS
> create/remove/lookup ops, context switch rate and memory usage) at a
> 1 second sample rate. You can see the IO patterns change, the
> context switch rate go nuts and the CPU usage pattern change when
> the page cache hits empty.

And now attached. :)

Cheers,

Dave.
Mel Gorman Aug. 7, 2019, 11:48 p.m. UTC | #7
On Thu, Aug 08, 2019 at 08:32:41AM +1000, Dave Chinner wrote:
> On Wed, Aug 07, 2019 at 09:56:15PM +0100, Mel Gorman wrote:
> > On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote:
> > > <SNIP>
> > >
> > > On that basis, it may justify ripping out the may_shrinkslab logic
> > > everywhere. The downside is that some microbenchmarks will notice.
> > > Specifically IO benchmarks that fill memory and reread (particularly
> > > rereading the metadata via any inode operation) may show reduced
> > > results. Such benchmarks can be strongly affected by whether the inode
> > > information is still memory resident and watermark boosting reduces
> > > the changes the data is still resident in memory. Technically still a
> > > regression but a tunable one.
> > > 
> > > Hence the following "it builds" patch that has zero supporting data on
> > > whether it's a good idea or not.
> > > 
> > 
> > This is a more complete version of the same patch that summaries the
> > problem and includes data from my own testing
> ....
> > A fsmark benchmark configuration was constructed similar to
> > what Dave reported and is codified by the mmtest configuration
> > config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket machine
> > to avoid dealing with NUMA-related issues and the timing of reclaim. The
> > storage was an SSD Samsung Evo and a fresh XFS filesystem was used for
> > the test data.
> 
> Have you run fstrim on that drive recently? I'm running these tests
> on a 960 EVO ssd, and when I started looking at shrinkers 3 weeks
> ago I had all sorts of whacky performance problems and inconsistent
> results. Turned out there were all sorts of random long IO latencies
> occurring (in the hundreds of milliseconds) because the drive was
> constantly running garbage collection to free up space. As a result
> it was both blocking on GC and thermal throttling under these fsmark
> workloads.
> 

No, I was under the impression that making a new filesystem typically
trimmed it as well. Maybe that's just some filesystems (e.g. ext4) or
just completely wrong.

> I made a new XFS filesystem on it (lazy man's rm -rf *),

Ah, all IO tests I do make a new filesystem. I know there is the whole
problem of filesystem aging but I've yet to come across a methodology
that two people can agree on that is a sensible, reproducible method.

> then ran
> fstrim on it to tell the drive all the space is free. Drive temps
> dropped 30C immediately, and all of the whacky performance anomolies
> went away. I now fstrim the drive in my vm startup scripts before
> each test run, and it's giving consistent results again.
> 

I'll replicate that if making a new filesystem is not guaranteed to
trim. It'll muck up historical data but that happens to me every so
often anyway.

> > It is likely that the test configuration is not a proper match for Dave's
> > test as the results are different in terms of performance. However, my
> > configuration reports fsmark performance every 10% of memory worth of
> > files and I suspect Dave's configuration reported Files/sec when memory
> > was already full. THP was enabled for mine, disabled for Dave's and
> > probably a whole load of other methodology differences that rarely
> > get recorded properly.
> 
> Yup, like I forgot to mention that my test system is using a 4-node
> fakenuma setup (i.e. 4 nodes, 4GB RAM and 4 CPUs per node, so
> there are 4 separate kswapd's doing concurrent reclaim). That
> changes reclaim patterns as well.
> 

Good to know. In this particular case, I don't think I need to exactly
replicate what you have given that the slam reclaim behaviour is
definitely more consistent and the ratios of slab/pagecache are
predictable.

> 
> > fsmark
> >                                    5.3.0-rc3              5.3.0-rc3
> >                                      vanilla          shrinker-v1r1
> > Min       1-files/sec     5181.70 (   0.00%)     3204.20 ( -38.16%)
> > 1st-qrtle 1-files/sec    14877.10 (   0.00%)     6596.90 ( -55.66%)
> > 2nd-qrtle 1-files/sec     6521.30 (   0.00%)     5707.80 ( -12.47%)
> > 3rd-qrtle 1-files/sec     5614.30 (   0.00%)     5363.80 (  -4.46%)
> > Max-1     1-files/sec    18463.00 (   0.00%)    18479.90 (   0.09%)
> > Max-5     1-files/sec    18028.40 (   0.00%)    17829.00 (  -1.11%)
> > Max-10    1-files/sec    17502.70 (   0.00%)    17080.90 (  -2.41%)
> > Max-90    1-files/sec     5438.80 (   0.00%)     5106.60 (  -6.11%)
> > Max-95    1-files/sec     5390.30 (   0.00%)     5020.40 (  -6.86%)
> > Max-99    1-files/sec     5271.20 (   0.00%)     3376.20 ( -35.95%)
> > Max       1-files/sec    18463.00 (   0.00%)    18479.90 (   0.09%)
> > Hmean     1-files/sec     7459.11 (   0.00%)     6249.49 ( -16.22%)
> > Stddev    1-files/sec     4733.16 (   0.00%)     4362.10 (   7.84%)
> > CoeffVar  1-files/sec       51.66 (   0.00%)       57.49 ( -11.29%)
> > BHmean-99 1-files/sec     7515.09 (   0.00%)     6351.81 ( -15.48%)
> > BHmean-95 1-files/sec     7625.39 (   0.00%)     6486.09 ( -14.94%)
> > BHmean-90 1-files/sec     7803.19 (   0.00%)     6588.61 ( -15.57%)
> > BHmean-75 1-files/sec     8518.74 (   0.00%)     6954.25 ( -18.37%)
> > BHmean-50 1-files/sec    10953.31 (   0.00%)     8017.89 ( -26.80%)
> > BHmean-25 1-files/sec    16732.38 (   0.00%)    11739.65 ( -29.84%)
> > 
> >                    5.3.0-rc3   5.3.0-rc3
> >                      vanillashrinker-v1r1
> > Duration User          77.29       89.09
> > Duration System      1097.13     1332.86
> > Duration Elapsed     2014.14     2596.39
> 
> I'm not sure we are testing or measuring exactly the same things :)
> 

Probably not.

> > This is showing that fsmark runs slower as a result of this patch but
> > there are other important observations that justify the patch.
> > 
> > 1. With the vanilla kernel, the number of dirty pages in the system
> >    is very low for much of the test. With this patch, dirty pages
> >    is generally kept at 10% which matches vm.dirty_background_ratio
> >    which is normal expected historical behaviour.
> > 
> > 2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
> >    0.95 for much of the test i.e. Slab is being left alone and dominating
> >    memory consumption. With the patch applied, the ratio varies between
> >    0.35 and 0.45 with the bulk of the measured ratios roughly half way
> >    between those values. This is a different balance to what Dave reported
> >    but it was at least consistent.
> 
> Yeah, the balance is typically a bit different for different configs
> and storage. The trick is getting the balance to be roughly
> consistent across a range of different configs. The fakenuma setup
> also has a significant impact on where the balance is found. And I
> can't remember if the "fixed" memory usage numbers I quoted came
> from a run with my "make XFS inode reclaim nonblocking" patchset or
> not.
> 

Again, I wouldn't sweat too much about it. The generated graphs
definitely showed more consistent behaviour even if the headline
performance was not improved.

> > 3. Slabs are scanned throughout the entire test with the patch applied.
> >    The vanille kernel has long periods with no scan activity and then
> >    relatively massive spikes.
> > 
> > 4. Overall vmstats are closer to normal expectations
> > 
> > 	                                5.3.0-rc3      5.3.0-rc3
> > 	                                  vanilla  shrinker-v1r1
> > 	Direct pages scanned             60308.00        5226.00
> > 	Kswapd pages scanned          18316110.00    12295574.00
> > 	Kswapd pages reclaimed        13121037.00     7280152.00
> > 	Direct pages reclaimed           11817.00        5226.00
> > 	Kswapd efficiency %                 71.64          59.21
> > 	Kswapd velocity                   9093.76        4735.64
> > 	Direct efficiency %                 19.59         100.00
> > 	Direct velocity                     29.94           2.01
> > 	Page reclaim immediate          247921.00           0.00
> > 	Slabs scanned                 16602344.00    29369536.00
> > 	Direct inode steals               1574.00         800.00
> > 	Kswapd inode steals             130033.00     3968788.00
> > 	Kswapd skipped wait                  0.00           0.00
> 
> That looks a lot better. Patch looks reasonable, though I'm
> interested to know what impact it has on tests you ran in the
> original commit for the boosting.
> 

I'll find out soon enough but I'm leaning on the side that kswapd reclaim
should be predictable and that even if there are some performance problems
as a result of it, there will be others that see a gain. It'll be a case
of "no matter what way you jump, someone shouts" but kswapd having spiky
unpredictable behaviour is a recipe for "sometimes my machine is crap
and I've no idea why".
Mel Gorman Aug. 7, 2019, 11:55 p.m. UTC | #8
On Thu, Aug 08, 2019 at 08:08:17AM +1000, Dave Chinner wrote:
> On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote:
> > On Wed, Aug 07, 2019 at 11:30:56AM +0200, Michal Hocko wrote:
> > > [Cc Mel and Vlastimil as it seems like fallout from 1c30844d2dfe2]
> > > 
> > 
> > More than likely.
> > 
> > > On Wed 07-08-19 19:18:58, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > When running a simple, steady state 4kB file creation test to
> > > > simulate extracting tarballs larger than memory full of small files
> > > > into the filesystem, I noticed that once memory fills up the cache
> > > > balance goes to hell.
> > > > 
> > 
> > Ok, I'm assuming you are using fsmark with -k to keep files around,
> > and -S0 to leave cleaning to the background flush, a number of files per
> > iteration to get regular reporting and a total number of iterations to
> > fill memory to hit what you're seeing. I've created a configuration that
> > should do this but it'll take a long time to run on a local test machine.
> 
> ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  60 \
> -d /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2 \
> -d /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
> -d /mnt/scratch/6  -d  /mnt/scratch/7  -d  /mnt/scratch/8 \
> -d /mnt/scratch/9  -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
> -d /mnt/scratch/12  -d  /mnt/scratch/13  -d  /mnt/scratch/14 \
> -d /mnt/scratch/15
> 

Ok, I can replicate that. From a previous discussion my fs_mark setup is
able to replicate this sort of thing relatively easily. I'll tweak the
configuration. 

> This doesn't delete files at all, creates 160,000 files per
> iteration in directories of 10,000 files at a time, and runs 60
> iterations. It leaves all writeback (data and metadata) to
> background kernel mechanisms.
> 

Both our configurations left writeback to the background but I took a
different approach to get more reported results from fsmark.

> > I'm not 100% certain I guessed right as to get fsmark reports while memory
> > fills, it would have to be fewer files so each iteration would have to
> > preserve files. If the number of files per iteration is large enough to
> > fill memory then the drop in files/sec is not visible from the fs_mark
> > output (or we are using different versions). I guess you could just be
> > calculating average files/sec over the entire run based on elapsed time.
> 
> The files/s average is the average of the fsmark iteration output.
> (i.e. the rate at which it creates each group of 160k files).
> 

Good to confirm for sure.

> > 
> > > > The workload is creating one dirty cached inode for every dirty
> > > > page, both of which should require a single IO each to clean and
> > > > reclaim, and creation of inodes is throttled by the rate at which
> > > > dirty writeback runs at (via balance dirty pages). Hence the ingest
> > > > rate of new cached inodes and page cache pages is identical and
> > > > steady. As a result, memory reclaim should quickly find a steady
> > > > balance between page cache and inode caches.
> > > > 
> > > > It doesn't.
> > > > 
> > > > The moment memory fills, the page cache is reclaimed at a much
> > > > faster rate than the inode cache, and evidence suggests taht the
> > > > inode cache shrinker is not being called when large batches of pages
> > > > are being reclaimed. In roughly the same time period that it takes
> > > > to fill memory with 50% pages and 50% slab caches, memory reclaim
> > > > reduces the page cache down to just dirty pages and slab caches fill
> > > > the entirity of memory.
> > > > 
> > > > At the point where the page cache is reduced to just the dirty
> > > > pages, there is a clear change in write IO patterns. Up to this
> > > > point it has been running at a steady 1500 write IOPS for ~200MB/s
> > > > of write throughtput (data, journal and metadata).
> > 
> > As observed by iostat -x or something else? Sum of r/s and w/s would
> 
> PCP + live pmcharts. Same as I've done for 15+ years :)
> 

Yep. This is not the first time I thought I should wire PCP into
mmtests. There are historical reasons why it didn't happen but they are
getting less and less relevant.

> I could look at iostat, but it's much easier to watch graphs run
> and then be able to double click on any point and get the actual
> value.
> 

Can't argue with the logic.

> I've attached a screen shot of the test machine overview while the
> vanilla kernel runs the fsmark test (cpu, iops, IO bandwidth, XFS
> create/remove/lookup ops, context switch rate and memory usage) at a
> 1 second sample rate. You can see the IO patterns change, the
> context switch rate go nuts and the CPU usage pattern change when
> the page cache hits empty.
> 

Indeed.

> > approximate iops but not the breakdown of whether it is data, journal
> > or metadata writes.
> 
> I have that in other charts - the log chart tells me how many log
> IOs are being done (constant 30MB/s in ~150 IOs/sec). And the >1GB/s
> IO spike every 30s is the metadata writeback being aggregated into
> large IOs by metadata writeback. That doesn't change, either...
> 

Understood.

> > The rest can be inferred from a blktrace but I would
> > prefer to replicate your setup as close as possible. If you're not using
> > fs_mark to report Files/sec, are you simply monitoring df -i over time?
> 
> The way I run fsmark is iterative - the count field tells you how
> many inodes it has created...
> 

Understood, I'll update tests accordingly.

> > > > So I went looking at the code, trying to find places where pages got
> > > > reclaimed and the shrinkers weren't called. There's only one -
> > > > kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim
> > > > small amounts of memory when an external fragmentation event
> > > > occurs"). I'm not even using THP or allocating huge pages, so this
> > > > code should not be active or having any effect onmemory reclaim at
> > > > all, yet the majority of reclaim is being done with "boost" and so
> > > > it's not reclaiming slab caches at all. It will only free clean
> > > > pages from the LRU.
> > > > 
> > > > And so when we do run out memory, it switches to normal reclaim,
> > > > which hits dirty pages on the LRU and does some shrinker work, too,
> > > > but then appears to switch back to boosted reclaim one watermarks
> > > > are reached.
> > > > 
> > > > The patch below restores page cache vs inode cache balance for this
> > > > steady state workload. It balances out at about 40% page cache, 60%
> > > > slab cache, and sustained performance is 10-15% higher than without
> > > > this patch because the IO patterns remain in control of dirty
> > > > writeback and the filesystem, not kswapd.
> > > > 
> > > > Performance with boosted reclaim also running shrinkers over the
> > > > same steady state portion of the test as above.
> > > > 
> > 
> > The boosting was not intended to target THP specifically -- it was meant
> > to help recover early from any fragmentation-related event for any user
> > that might need it. Hence, it's not tied to THP but even with THP
> > disabled, the boosting will still take effect.
> > 
> > One band-aid would be to disable watermark boosting entirely when THP is
> > disabled but that feels wrong. However, I would be interested in hearing
> > if sysctl vm.watermark_boost_factor=0 has the same effect as your patch.
> 
> <runs test>
> 
> Ok, it still runs it out of page cache, but it doesn't drive page
> cache reclaim as hard once there's none left. The IO patterns are
> less peaky, context switch rates are increased from ~3k/s to 15k/s
> but remain pretty steady.
> 
> Test ran 5s faster and  file rate improved by ~2%. So it's better
> once the page cache is largerly fully reclaimed, but it doesn't
> prevent the page cache from being reclaimed instead of inodes....
> 

Ok. Ideally you would also confirm the patch itself works as you want.
It *should* but an actual confirmation would be nice.

> 
> > On that basis, it may justify ripping out the may_shrinkslab logic
> > everywhere. The downside is that some microbenchmarks will notice.
> > Specifically IO benchmarks that fill memory and reread (particularly
> > rereading the metadata via any inode operation) may show reduced
> > results.
> 
> Sure, but right now benchmarks that rely on page cache being
> retained are being screwed over :)
> 

Yeah. This is a perfect example of a change that will help some workloads
and hurt others. However, as mentioned elsewhere, predictable behaviour
is far easier to work with and it's closer to the historical behaviour.

> > Such benchmarks can be strongly affected by whether the inode
> > information is still memory resident and watermark boosting reduces
> > the changes the data is still resident in memory. Technically still a
> > regression but a tunable one.
> 
> /proc/sys/vm/vfs_cache_pressure is for tuning page cache/inode cache
> balance. It should not occur as a side effect of watermark boosting.
> Everyone knows about vfs_cache_pressure. Lots of people complain
> when it doesn't work, and that's something watermark boosting to
> change cache balance does.
> 

And this -- the fact that there is actually a tuning knob to deal with
the tradeoffs is better than the current vanilla boost behaviour which
has side-effects that cannot be tuned against. I'll update the changelog
and resend the patch tomorrow assuming that it actually fixes your
problem.

Thanks Dave (for both the report and the analysis that made this a
no-brainer)
Dave Chinner Aug. 8, 2019, 12:26 a.m. UTC | #9
On Thu, Aug 08, 2019 at 12:48:15AM +0100, Mel Gorman wrote:
> On Thu, Aug 08, 2019 at 08:32:41AM +1000, Dave Chinner wrote:
> > On Wed, Aug 07, 2019 at 09:56:15PM +0100, Mel Gorman wrote:
> > > On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote:
> > > > <SNIP>
> > > >
> > > > On that basis, it may justify ripping out the may_shrinkslab logic
> > > > everywhere. The downside is that some microbenchmarks will notice.
> > > > Specifically IO benchmarks that fill memory and reread (particularly
> > > > rereading the metadata via any inode operation) may show reduced
> > > > results. Such benchmarks can be strongly affected by whether the inode
> > > > information is still memory resident and watermark boosting reduces
> > > > the changes the data is still resident in memory. Technically still a
> > > > regression but a tunable one.
> > > > 
> > > > Hence the following "it builds" patch that has zero supporting data on
> > > > whether it's a good idea or not.
> > > > 
> > > 
> > > This is a more complete version of the same patch that summaries the
> > > problem and includes data from my own testing
> > ....
> > > A fsmark benchmark configuration was constructed similar to
> > > what Dave reported and is codified by the mmtest configuration
> > > config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket machine
> > > to avoid dealing with NUMA-related issues and the timing of reclaim. The
> > > storage was an SSD Samsung Evo and a fresh XFS filesystem was used for
> > > the test data.
> > 
> > Have you run fstrim on that drive recently? I'm running these tests
> > on a 960 EVO ssd, and when I started looking at shrinkers 3 weeks
> > ago I had all sorts of whacky performance problems and inconsistent
> > results. Turned out there were all sorts of random long IO latencies
> > occurring (in the hundreds of milliseconds) because the drive was
> > constantly running garbage collection to free up space. As a result
> > it was both blocking on GC and thermal throttling under these fsmark
> > workloads.
> > 
> 
> No, I was under the impression that making a new filesystem typically
> trimmed it as well. Maybe that's just some filesystems (e.g. ext4) or
> just completely wrong.

Depends. IIRC, some have turned that off by default because of the
amount of poor implementations that take minutes to trim a whole
device. XFS discards by default, but that doesn't mean it actually
gets done. e.g. it might be on a block device that does not support
or pass down discard requests.

FWIW, I run these in a VM on a sparse filesystem image (500TB) held
in a file on the host XFS filesystem and:

$ cat /sys/block/vdc/queue/discard_max_bytes 
0

Discard requests don't pass down through the virtio block device
(nor do I really want them to). Hence I have to punch the image file
and fstrim on the host side before launching the VM that runs the
tests...

> > then ran
> > fstrim on it to tell the drive all the space is free. Drive temps
> > dropped 30C immediately, and all of the whacky performance anomolies
> > went away. I now fstrim the drive in my vm startup scripts before
> > each test run, and it's giving consistent results again.
> > 
> 
> I'll replicate that if making a new filesystem is not guaranteed to
> trim. It'll muck up historical data but that happens to me every so
> often anyway.

mkfs.xfs should be doing it if you're directly on top of the SSD.
Just wanted to check seeing as I've recently been bitten by this.

> > That looks a lot better. Patch looks reasonable, though I'm
> > interested to know what impact it has on tests you ran in the
> > original commit for the boosting.
> > 
> 
> I'll find out soon enough but I'm leaning on the side that kswapd reclaim
> should be predictable and that even if there are some performance problems
> as a result of it, there will be others that see a gain. It'll be a case
> of "no matter what way you jump, someone shouts" but kswapd having spiky
> unpredictable behaviour is a recipe for "sometimes my machine is crap
> and I've no idea why".

Yeah, and that's precisely the motiviation for getting XFS inode
reclaim to avoid blocking altogether and relying on memory reclaim
to back off when appropriate. I expect there will be other problems
I find with reclaim backoff and blance as a kick the tyres more...

Cheers,

Dave.
Dave Chinner Aug. 8, 2019, 12:30 a.m. UTC | #10
On Thu, Aug 08, 2019 at 12:55:34AM +0100, Mel Gorman wrote:
> On Thu, Aug 08, 2019 at 08:08:17AM +1000, Dave Chinner wrote:
> > On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote:
> > > On Wed, Aug 07, 2019 at 11:30:56AM +0200, Michal Hocko wrote:
> > > The boosting was not intended to target THP specifically -- it was meant
> > > to help recover early from any fragmentation-related event for any user
> > > that might need it. Hence, it's not tied to THP but even with THP
> > > disabled, the boosting will still take effect.
> > > 
> > > One band-aid would be to disable watermark boosting entirely when THP is
> > > disabled but that feels wrong. However, I would be interested in hearing
> > > if sysctl vm.watermark_boost_factor=0 has the same effect as your patch.
> > 
> > <runs test>
> > 
> > Ok, it still runs it out of page cache, but it doesn't drive page
> > cache reclaim as hard once there's none left. The IO patterns are
> > less peaky, context switch rates are increased from ~3k/s to 15k/s
> > but remain pretty steady.
> > 
> > Test ran 5s faster and  file rate improved by ~2%. So it's better
> > once the page cache is largerly fully reclaimed, but it doesn't
> > prevent the page cache from being reclaimed instead of inodes....
> > 
> 
> Ok. Ideally you would also confirm the patch itself works as you want.
> It *should* but an actual confirmation would be nice.

Yup, I'll get to that later today.

Cheers,

Dave.
Dave Chinner Aug. 8, 2019, 5:51 a.m. UTC | #11
On Thu, Aug 08, 2019 at 10:30:25AM +1000, Dave Chinner wrote:
> On Thu, Aug 08, 2019 at 12:55:34AM +0100, Mel Gorman wrote:
> > On Thu, Aug 08, 2019 at 08:08:17AM +1000, Dave Chinner wrote:
> > > On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote:
> > > > On Wed, Aug 07, 2019 at 11:30:56AM +0200, Michal Hocko wrote:
> > > > The boosting was not intended to target THP specifically -- it was meant
> > > > to help recover early from any fragmentation-related event for any user
> > > > that might need it. Hence, it's not tied to THP but even with THP
> > > > disabled, the boosting will still take effect.
> > > > 
> > > > One band-aid would be to disable watermark boosting entirely when THP is
> > > > disabled but that feels wrong. However, I would be interested in hearing
> > > > if sysctl vm.watermark_boost_factor=0 has the same effect as your patch.
> > > 
> > > <runs test>
> > > 
> > > Ok, it still runs it out of page cache, but it doesn't drive page
> > > cache reclaim as hard once there's none left. The IO patterns are
> > > less peaky, context switch rates are increased from ~3k/s to 15k/s
> > > but remain pretty steady.
> > > 
> > > Test ran 5s faster and  file rate improved by ~2%. So it's better
> > > once the page cache is largerly fully reclaimed, but it doesn't
> > > prevent the page cache from being reclaimed instead of inodes....
> > > 
> > 
> > Ok. Ideally you would also confirm the patch itself works as you want.
> > It *should* but an actual confirmation would be nice.
> 
> Yup, I'll get to that later today.

Looks good, does what it says on the tin.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Christoph Hellwig Aug. 8, 2019, 3:36 p.m. UTC | #12
> -			if (sc->may_shrinkslab) {
> -				shrink_slab(sc->gfp_mask, pgdat->node_id,
> -				    memcg, sc->priority);
> -			}
> +			shrink_slab(sc->gfp_mask, pgdat->node_id,
> +			    memcg, sc->priority);

Not the most useful comment, but the indentation for the continuing line
is weird (already in the original code).  This should be something like:

			shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
					sc->priority);
Mel Gorman Aug. 8, 2019, 5:04 p.m. UTC | #13
On Thu, Aug 08, 2019 at 08:36:58AM -0700, Christoph Hellwig wrote:
> > -			if (sc->may_shrinkslab) {
> > -				shrink_slab(sc->gfp_mask, pgdat->node_id,
> > -				    memcg, sc->priority);
> > -			}
> > +			shrink_slab(sc->gfp_mask, pgdat->node_id,
> > +			    memcg, sc->priority);
> 
> Not the most useful comment, but the indentation for the continuing line
> is weird (already in the original code).  This should be something like:
> 
> 			shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> 					sc->priority);

If that's the worst you found then I take it as good news. I have not
sent a version with an updated changelog so I can fix it up.

Patch
diff mbox series

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9034570febd9..702e6523f8ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2748,10 +2748,10 @@  static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
 			node_lru_pages += lru_pages;
 
-			if (sc->may_shrinkslab) {
+			//if (sc->may_shrinkslab) {
 				shrink_slab(sc->gfp_mask, pgdat->node_id,
 				    memcg, sc->priority);
-			}
+			//}
 
 			/* Record the group's reclaim efficiency */
 			vmpressure(sc->gfp_mask, memcg, false,