Message ID | 20190807091858.2857-1-david@fromorbit.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [Regression,v5.0] mm: boosted kswapd reclaim b0rks system cache balance | expand |
[Cc Mel and Vlastimil as it seems like fallout from 1c30844d2dfe2] On Wed 07-08-19 19:18:58, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > When running a simple, steady state 4kB file creation test to > simulate extracting tarballs larger than memory full of small files > into the filesystem, I noticed that once memory fills up the cache > balance goes to hell. > > The workload is creating one dirty cached inode for every dirty > page, both of which should require a single IO each to clean and > reclaim, and creation of inodes is throttled by the rate at which > dirty writeback runs at (via balance dirty pages). Hence the ingest > rate of new cached inodes and page cache pages is identical and > steady. As a result, memory reclaim should quickly find a steady > balance between page cache and inode caches. > > It doesn't. > > The moment memory fills, the page cache is reclaimed at a much > faster rate than the inode cache, and evidence suggests taht the > inode cache shrinker is not being called when large batches of pages > are being reclaimed. In roughly the same time period that it takes > to fill memory with 50% pages and 50% slab caches, memory reclaim > reduces the page cache down to just dirty pages and slab caches fill > the entirity of memory. > > At the point where the page cache is reduced to just the dirty > pages, there is a clear change in write IO patterns. Up to this > point it has been running at a steady 1500 write IOPS for ~200MB/s > of write throughtput (data, journal and metadata). Once the page > cache is trimmed to just dirty pages, the write IOPS immediately > start spiking to 5-10,000 IOPS and there is a noticable change in > IO sizes and completion times. The SSD is fast enough to soak these > up, so the measured performance is only slightly affected (numbers > below). It results in > ~50% throughput slowdowns on a spinning > disk with a NVRAM RAID cache in front of it, though. I didn't > capture the numbers at the time, and it takes far to long for me to > care to run it again and get them. > > SSD perf degradation as the LRU empties to just dirty pages: > > FSUse% Count Size Files/sec App Overhead > ...... > 0 4320000 4096 51349.6 1370049 > 0 4480000 4096 48492.9 1362823 > 0 4640000 4096 48473.8 1435623 > 0 4800000 4096 46846.6 1370959 > 0 4960000 4096 47086.6 1567401 > 0 5120000 4096 46288.8 1368056 > 0 5280000 4096 46003.2 1391604 > 0 5440000 4096 46033.4 1458440 > 0 5600000 4096 45035.1 1484132 > 0 5760000 4096 43757.6 1492594 > 0 5920000 4096 40739.4 1552104 > 0 6080000 4096 37309.4 1627355 > 0 6240000 4096 42003.3 1517357 > ..... > real 3m28.093s > user 0m57.852s > sys 14m28.193s > > Average rate: 51724.6+/-2.4e+04 files/sec. > > At first memory full point: > > MemFree: 432676 kB > Active(file): 89820 kB > Inactive(file): 7330892 kB > Dirty: 1603576 kB > Writeback: 2908 kB > Slab: 6579384 kB > SReclaimable: 3727612 kB > SUnreclaim: 2851772 kB > > A few seconds later at about half the page cache reclaimed: > > MemFree: 1880588 kB > Active(file): 89948 kB > Inactive(file): 3021796 kB > Dirty: 1097072 kB > Writeback: 2600 kB > Slab: 8900912 kB > SReclaimable: 5060104 kB > SUnreclaim: 3840808 kB > > And at about the 6080000 count point in the results above, right to > the end of the test: > > MemFree: 574900 kB > Active(file): 89856 kB > Inactive(file): 483120 kB > Dirty: 372436 kB > Writeback: 324 kB > KReclaimable: 6506496 kB > Slab: 11898956 kB > SReclaimable: 6506496 kB > SUnreclaim: 5392460 kB > > > So the LRU is largely full of dirty pages, and we're getting spikes > of random writeback from memory reclaim so it's all going to shit. > Behaviour never recovers, the page cache remains pinned at just dirty > pages, and nothing I could tune would make any difference. > vfs_cache_pressure makes no difference - I would it up so high it > should trim the entire inode caches in a singel pass, yet it didn't > do anything. It was clear from tracing and live telemetry that the > shrinkers were pretty much not running except when there was > absolutely no memory free at all, and then they did the minimum > necessary to free memory to make progress. > > So I went looking at the code, trying to find places where pages got > reclaimed and the shrinkers weren't called. There's only one - > kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim > small amounts of memory when an external fragmentation event > occurs"). I'm not even using THP or allocating huge pages, so this > code should not be active or having any effect onmemory reclaim at > all, yet the majority of reclaim is being done with "boost" and so > it's not reclaiming slab caches at all. It will only free clean > pages from the LRU. > > And so when we do run out memory, it switches to normal reclaim, > which hits dirty pages on the LRU and does some shrinker work, too, > but then appears to switch back to boosted reclaim one watermarks > are reached. > > The patch below restores page cache vs inode cache balance for this > steady state workload. It balances out at about 40% page cache, 60% > slab cache, and sustained performance is 10-15% higher than without > this patch because the IO patterns remain in control of dirty > writeback and the filesystem, not kswapd. > > Performance with boosted reclaim also running shrinkers over the > same steady state portion of the test as above. > > FSUse% Count Size Files/sec App Overhead > ...... > 0 4320000 4096 51341.9 1409983 > 0 4480000 4096 51157.5 1486421 > 0 4640000 4096 52041.5 1421837 > 0 4800000 4096 52124.2 1442169 > 0 4960000 4096 56266.6 1388865 > 0 5120000 4096 52892.2 1357650 > 0 5280000 4096 51879.5 1326590 > 0 5440000 4096 52178.7 1362889 > 0 5600000 4096 53863.0 1345956 > 0 5760000 4096 52538.7 1321422 > 0 5920000 4096 53144.5 1336966 > 0 6080000 4096 53647.7 1372146 > 0 6240000 4096 52434.7 1362534 > > ..... > real 3m11.543s > user 0m57.506s > sys 14m20.913s > > Average rate: 57634.2+/-2.8e+04 files/sec. > > So it completed ~10% faster (both wall time and create rate) and had > far better IO patterns so the differences would be even more > pronounced on slow storage. > > This patch is not a fix, just a demonstration of the fact that the > heuristics this "boosted reclaim for compaction" are based on are > flawed, will have nasty side effects for users that don't use THP > and so needs revisiting. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > mm/vmscan.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 9034570febd9..702e6523f8ad 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2748,10 +2748,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > shrink_node_memcg(pgdat, memcg, sc, &lru_pages); > node_lru_pages += lru_pages; > > - if (sc->may_shrinkslab) { > + //if (sc->may_shrinkslab) { > shrink_slab(sc->gfp_mask, pgdat->node_id, > memcg, sc->priority); > - } > + //} > > /* Record the group's reclaim efficiency */ > vmpressure(sc->gfp_mask, memcg, false, > -- > 2.23.0.rc1
On Wed, Aug 07, 2019 at 11:30:56AM +0200, Michal Hocko wrote: > [Cc Mel and Vlastimil as it seems like fallout from 1c30844d2dfe2] > More than likely. > On Wed 07-08-19 19:18:58, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > When running a simple, steady state 4kB file creation test to > > simulate extracting tarballs larger than memory full of small files > > into the filesystem, I noticed that once memory fills up the cache > > balance goes to hell. > > Ok, I'm assuming you are using fsmark with -k to keep files around, and -S0 to leave cleaning to the background flush, a number of files per iteration to get regular reporting and a total number of iterations to fill memory to hit what you're seeing. I've created a configuration that should do this but it'll take a long time to run on a local test machine. I'm not 100% certain I guessed right as to get fsmark reports while memory fills, it would have to be fewer files so each iteration would have to preserve files. If the number of files per iteration is large enough to fill memory then the drop in files/sec is not visible from the fs_mark output (or we are using different versions). I guess you could just be calculating average files/sec over the entire run based on elapsed time. > > The workload is creating one dirty cached inode for every dirty > > page, both of which should require a single IO each to clean and > > reclaim, and creation of inodes is throttled by the rate at which > > dirty writeback runs at (via balance dirty pages). Hence the ingest > > rate of new cached inodes and page cache pages is identical and > > steady. As a result, memory reclaim should quickly find a steady > > balance between page cache and inode caches. > > > > It doesn't. > > > > The moment memory fills, the page cache is reclaimed at a much > > faster rate than the inode cache, and evidence suggests taht the > > inode cache shrinker is not being called when large batches of pages > > are being reclaimed. In roughly the same time period that it takes > > to fill memory with 50% pages and 50% slab caches, memory reclaim > > reduces the page cache down to just dirty pages and slab caches fill > > the entirity of memory. > > > > At the point where the page cache is reduced to just the dirty > > pages, there is a clear change in write IO patterns. Up to this > > point it has been running at a steady 1500 write IOPS for ~200MB/s > > of write throughtput (data, journal and metadata). As observed by iostat -x or something else? Sum of r/s and w/s would approximate iops but not the breakdown of whether it is data, journal or metadata writes. The rest can be inferred from a blktrace but I would prefer to replicate your setup as close as possible. If you're not using fs_mark to report Files/sec, are you simply monitoring df -i over time? > > <SNIP additional detail on fs_mark output> > > <SNIP additional detail on monitoring meminfo over time> > > <SNIP observations on dirty handling> All understood. > > So I went looking at the code, trying to find places where pages got > > reclaimed and the shrinkers weren't called. There's only one - > > kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim > > small amounts of memory when an external fragmentation event > > occurs"). I'm not even using THP or allocating huge pages, so this > > code should not be active or having any effect onmemory reclaim at > > all, yet the majority of reclaim is being done with "boost" and so > > it's not reclaiming slab caches at all. It will only free clean > > pages from the LRU. > > > > And so when we do run out memory, it switches to normal reclaim, > > which hits dirty pages on the LRU and does some shrinker work, too, > > but then appears to switch back to boosted reclaim one watermarks > > are reached. > > > > The patch below restores page cache vs inode cache balance for this > > steady state workload. It balances out at about 40% page cache, 60% > > slab cache, and sustained performance is 10-15% higher than without > > this patch because the IO patterns remain in control of dirty > > writeback and the filesystem, not kswapd. > > > > Performance with boosted reclaim also running shrinkers over the > > same steady state portion of the test as above. > > The boosting was not intended to target THP specifically -- it was meant to help recover early from any fragmentation-related event for any user that might need it. Hence, it's not tied to THP but even with THP disabled, the boosting will still take effect. One band-aid would be to disable watermark boosting entirely when THP is disabled but that feels wrong. However, I would be interested in hearing if sysctl vm.watermark_boost_factor=0 has the same effect as your patch. The intention behind avoiding slab reclaim from kswapd context is that the boost is due to a fragmentation-causing event. Compaction cannot move slab pages so reclaiming them in response to fragmentation does not directly help. However, it can indirectly help by avoiding fragmentation-causing events due to slab allocation like inodes and also mean that reclaim behaviour is not special cased. On that basis, it may justify ripping out the may_shrinkslab logic everywhere. The downside is that some microbenchmarks will notice. Specifically IO benchmarks that fill memory and reread (particularly rereading the metadata via any inode operation) may show reduced results. Such benchmarks can be strongly affected by whether the inode information is still memory resident and watermark boosting reduces the changes the data is still resident in memory. Technically still a regression but a tunable one. Hence the following "it builds" patch that has zero supporting data on whether it's a good idea or not. diff --git a/mm/vmscan.c b/mm/vmscan.c index dbdc46a84f63..6051a9007150 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -88,9 +88,6 @@ struct scan_control { /* Can pages be swapped as part of reclaim? */ unsigned int may_swap:1; - /* e.g. boosted watermark reclaim leaves slabs alone */ - unsigned int may_shrinkslab:1; - /* * Cgroups are not reclaimed below their configured memory.low, * unless we threaten to OOM. If any cgroups are skipped due to @@ -2714,10 +2711,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) shrink_node_memcg(pgdat, memcg, sc, &lru_pages); node_lru_pages += lru_pages; - if (sc->may_shrinkslab) { - shrink_slab(sc->gfp_mask, pgdat->node_id, - memcg, sc->priority); - } + shrink_slab(sc->gfp_mask, pgdat->node_id, + memcg, sc->priority); /* Record the group's reclaim efficiency */ vmpressure(sc->gfp_mask, memcg, false, @@ -3194,7 +3189,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = 1, - .may_shrinkslab = 1, }; /* @@ -3238,7 +3232,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, .may_unmap = 1, .reclaim_idx = MAX_NR_ZONES - 1, .may_swap = !noswap, - .may_shrinkslab = 1, }; unsigned long lru_pages; @@ -3286,7 +3279,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = may_swap, - .may_shrinkslab = 1, }; set_task_reclaim_state(current, &sc.reclaim_state); @@ -3598,7 +3590,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) */ sc.may_writepage = !laptop_mode && !nr_boost_reclaim; sc.may_swap = !nr_boost_reclaim; - sc.may_shrinkslab = !nr_boost_reclaim; /* * Do some background aging of the anon list, to give
On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote: > <SNIP> > > On that basis, it may justify ripping out the may_shrinkslab logic > everywhere. The downside is that some microbenchmarks will notice. > Specifically IO benchmarks that fill memory and reread (particularly > rereading the metadata via any inode operation) may show reduced > results. Such benchmarks can be strongly affected by whether the inode > information is still memory resident and watermark boosting reduces > the changes the data is still resident in memory. Technically still a > regression but a tunable one. > > Hence the following "it builds" patch that has zero supporting data on > whether it's a good idea or not. > This is a more complete version of the same patch that summaries the problem and includes data from my own testing ---8<--- mm, vmscan: Do not special-case slab reclaim when watermarks are boosted Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs"). The report is extensive (see https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/) and it's worth recording the most relevant parts (colorful language and typos included). When running a simple, steady state 4kB file creation test to simulate extracting tarballs larger than memory full of small files into the filesystem, I noticed that once memory fills up the cache balance goes to hell. The workload is creating one dirty cached inode for every dirty page, both of which should require a single IO each to clean and reclaim, and creation of inodes is throttled by the rate at which dirty writeback runs at (via balance dirty pages). Hence the ingest rate of new cached inodes and page cache pages is identical and steady. As a result, memory reclaim should quickly find a steady balance between page cache and inode caches. The moment memory fills, the page cache is reclaimed at a much faster rate than the inode cache, and evidence suggests taht the inode cache shrinker is not being called when large batches of pages are being reclaimed. In roughly the same time period that it takes to fill memory with 50% pages and 50% slab caches, memory reclaim reduces the page cache down to just dirty pages and slab caches fill the entirity of memory. The LRU is largely full of dirty pages, and we're getting spikes of random writeback from memory reclaim so it's all going to shit. Behaviour never recovers, the page cache remains pinned at just dirty pages, and nothing I could tune would make any difference. vfs_cache_pressure makes no difference - I would it up so high it should trim the entire inode caches in a singel pass, yet it didn't do anything. It was clear from tracing and live telemetry that the shrinkers were pretty much not running except when there was absolutely no memory free at all, and then they did the minimum necessary to free memory to make progress. So I went looking at the code, trying to find places where pages got reclaimed and the shrinkers weren't called. There's only one - kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs"). The watermark boosting introduced by the commit is triggered in response to an allocation "fragmentation event". The boosting was not intended to target THP specifically. However, with Dave's perfectly reasonable workload, fragmentation events can be very common given the ratio of slab to page cache allocations so boosting remains active for long periods of time. As high-order allocations typically use compaction and compaction cannot move slab pages the decision was made in the commit to special-case kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as reclaiming slab does not directly help compaction. As Dave notes, this decision means that slab can be artificially protected for long periods of time and messes up the balance with slab and page caches. Removing the special casing can still indirectly help fragmentation by avoiding fragmentation-causing events due to slab allocation as pages from a slab pageblock will have some slab objects freed. Furthermore, with the special casing, reclaim behaviour is unpredictable as kswapd sometimes examines slab and sometimes does not in a manner that is tricky to tune against and tricky to analyse. This patch removes the special casing. The downside is that this is not a universal performance win. Some benchmarks that depend on the residency of data when rereading metadata may see a regression when slab reclaim is restored to its original behaviour. Similarly, some benchmarks that only read-once or write-once may perform better when page reclaim is too aggressive. The primary upside is that slab shrinker is less surprising (arguably more sane but that's a matter of opinion) and behaves consistently regardless of the fragmentation state of the system. A fsmark benchmark configuration was constructed similar to what Dave reported and is codified by the mmtest configuration config-io-fsmark-small-file-stream. It was evaluated on a 1-socket machine to avoid dealing with NUMA-related issues and the timing of reclaim. The storage was an SSD Samsung Evo and a fresh XFS filesystem was used for the test data. It is likely that the test configuration is not a proper match for Dave's test as the results are different in terms of performance. However, my configuration reports fsmark performance every 10% of memory worth of files and I suspect Dave's configuration reported Files/sec when memory was already full. THP was enabled for mine, disabled for Dave's and probably a whole load of other methodology differences that rarely get recorded properly. fsmark 5.3.0-rc3 5.3.0-rc3 vanilla shrinker-v1r1 Min 1-files/sec 5181.70 ( 0.00%) 3204.20 ( -38.16%) 1st-qrtle 1-files/sec 14877.10 ( 0.00%) 6596.90 ( -55.66%) 2nd-qrtle 1-files/sec 6521.30 ( 0.00%) 5707.80 ( -12.47%) 3rd-qrtle 1-files/sec 5614.30 ( 0.00%) 5363.80 ( -4.46%) Max-1 1-files/sec 18463.00 ( 0.00%) 18479.90 ( 0.09%) Max-5 1-files/sec 18028.40 ( 0.00%) 17829.00 ( -1.11%) Max-10 1-files/sec 17502.70 ( 0.00%) 17080.90 ( -2.41%) Max-90 1-files/sec 5438.80 ( 0.00%) 5106.60 ( -6.11%) Max-95 1-files/sec 5390.30 ( 0.00%) 5020.40 ( -6.86%) Max-99 1-files/sec 5271.20 ( 0.00%) 3376.20 ( -35.95%) Max 1-files/sec 18463.00 ( 0.00%) 18479.90 ( 0.09%) Hmean 1-files/sec 7459.11 ( 0.00%) 6249.49 ( -16.22%) Stddev 1-files/sec 4733.16 ( 0.00%) 4362.10 ( 7.84%) CoeffVar 1-files/sec 51.66 ( 0.00%) 57.49 ( -11.29%) BHmean-99 1-files/sec 7515.09 ( 0.00%) 6351.81 ( -15.48%) BHmean-95 1-files/sec 7625.39 ( 0.00%) 6486.09 ( -14.94%) BHmean-90 1-files/sec 7803.19 ( 0.00%) 6588.61 ( -15.57%) BHmean-75 1-files/sec 8518.74 ( 0.00%) 6954.25 ( -18.37%) BHmean-50 1-files/sec 10953.31 ( 0.00%) 8017.89 ( -26.80%) BHmean-25 1-files/sec 16732.38 ( 0.00%) 11739.65 ( -29.84%) 5.3.0-rc3 5.3.0-rc3 vanillashrinker-v1r1 Duration User 77.29 89.09 Duration System 1097.13 1332.86 Duration Elapsed 2014.14 2596.39 This is showing that fsmark runs slower as a result of this patch but there are other important observations that justify the patch. 1. With the vanilla kernel, the number of dirty pages in the system is very low for much of the test. With this patch, dirty pages is generally kept at 10% which matches vm.dirty_background_ratio which is normal expected historical behaviour. 2. With the vanilla kernel, the ratio of Slab/Pagecache is close to 0.95 for much of the test i.e. Slab is being left alone and dominating memory consumption. With the patch applied, the ratio varies between 0.35 and 0.45 with the bulk of the measured ratios roughly half way between those values. This is a different balance to what Dave reported but it was at least consistent. 3. Slabs are scanned throughout the entire test with the patch applied. The vanille kernel has long periods with no scan activity and then relatively massive spikes. 4. Overall vmstats are closer to normal expectations 5.3.0-rc3 5.3.0-rc3 vanilla shrinker-v1r1 Direct pages scanned 60308.00 5226.00 Kswapd pages scanned 18316110.00 12295574.00 Kswapd pages reclaimed 13121037.00 7280152.00 Direct pages reclaimed 11817.00 5226.00 Kswapd efficiency % 71.64 59.21 Kswapd velocity 9093.76 4735.64 Direct efficiency % 19.59 100.00 Direct velocity 29.94 2.01 Page reclaim immediate 247921.00 0.00 Slabs scanned 16602344.00 29369536.00 Direct inode steals 1574.00 800.00 Kswapd inode steals 130033.00 3968788.00 Kswapd skipped wait 0.00 0.00 o Vanilla kernel is hitting direct reclaim more frequently, not very much in absolute terms but the fact the patch reduces it is interesting o "Page reclaim immediate" in the vanilla kernel indicates dirty pages are being encountered at the tail of the LRU. This is generally bad and means in this case that the LRU is not long enough for dirty pages to be cleaned by the background flush in time. This is eliminated by the patch o With the patch, kswapd is reclaiming 30 times more slab pages than with the vanilla kernel. This is indicative of the watermark boosting over-protecting slab A more complete set of tests is being run but the bottom line is that special casing kswapd to avoid slab behaviour is unpredictable and can lead to abnormal results for normal workloads. This patch restores the expected behaviour that slab and page cache is balanced consistently for a workload with a steady allocation ratio of slab/pagecache pages. Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/vmscan.c | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index dbdc46a84f63..6051a9007150 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -88,9 +88,6 @@ struct scan_control { /* Can pages be swapped as part of reclaim? */ unsigned int may_swap:1; - /* e.g. boosted watermark reclaim leaves slabs alone */ - unsigned int may_shrinkslab:1; - /* * Cgroups are not reclaimed below their configured memory.low, * unless we threaten to OOM. If any cgroups are skipped due to @@ -2714,10 +2711,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) shrink_node_memcg(pgdat, memcg, sc, &lru_pages); node_lru_pages += lru_pages; - if (sc->may_shrinkslab) { - shrink_slab(sc->gfp_mask, pgdat->node_id, - memcg, sc->priority); - } + shrink_slab(sc->gfp_mask, pgdat->node_id, + memcg, sc->priority); /* Record the group's reclaim efficiency */ vmpressure(sc->gfp_mask, memcg, false, @@ -3194,7 +3189,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = 1, - .may_shrinkslab = 1, }; /* @@ -3238,7 +3232,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, .may_unmap = 1, .reclaim_idx = MAX_NR_ZONES - 1, .may_swap = !noswap, - .may_shrinkslab = 1, }; unsigned long lru_pages; @@ -3286,7 +3279,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = may_swap, - .may_shrinkslab = 1, }; set_task_reclaim_state(current, &sc.reclaim_state); @@ -3598,7 +3590,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) */ sc.may_writepage = !laptop_mode && !nr_boost_reclaim; sc.may_swap = !nr_boost_reclaim; - sc.may_shrinkslab = !nr_boost_reclaim; /* * Do some background aging of the anon list, to give
On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote: > On Wed, Aug 07, 2019 at 11:30:56AM +0200, Michal Hocko wrote: > > [Cc Mel and Vlastimil as it seems like fallout from 1c30844d2dfe2] > > > > More than likely. > > > On Wed 07-08-19 19:18:58, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > When running a simple, steady state 4kB file creation test to > > > simulate extracting tarballs larger than memory full of small files > > > into the filesystem, I noticed that once memory fills up the cache > > > balance goes to hell. > > > > > Ok, I'm assuming you are using fsmark with -k to keep files around, > and -S0 to leave cleaning to the background flush, a number of files per > iteration to get regular reporting and a total number of iterations to > fill memory to hit what you're seeing. I've created a configuration that > should do this but it'll take a long time to run on a local test machine. ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 60 \ -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 \ -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 \ -d /mnt/scratch/6 -d /mnt/scratch/7 -d /mnt/scratch/8 \ -d /mnt/scratch/9 -d /mnt/scratch/10 -d /mnt/scratch/11 \ -d /mnt/scratch/12 -d /mnt/scratch/13 -d /mnt/scratch/14 \ -d /mnt/scratch/15 This doesn't delete files at all, creates 160,000 files per iteration in directories of 10,000 files at a time, and runs 60 iterations. It leaves all writeback (data and metadata) to background kernel mechanisms. > I'm not 100% certain I guessed right as to get fsmark reports while memory > fills, it would have to be fewer files so each iteration would have to > preserve files. If the number of files per iteration is large enough to > fill memory then the drop in files/sec is not visible from the fs_mark > output (or we are using different versions). I guess you could just be > calculating average files/sec over the entire run based on elapsed time. The files/s average is the average of the fsmark iteration output. (i.e. the rate at which it creates each group of 160k files). > > > > The workload is creating one dirty cached inode for every dirty > > > page, both of which should require a single IO each to clean and > > > reclaim, and creation of inodes is throttled by the rate at which > > > dirty writeback runs at (via balance dirty pages). Hence the ingest > > > rate of new cached inodes and page cache pages is identical and > > > steady. As a result, memory reclaim should quickly find a steady > > > balance between page cache and inode caches. > > > > > > It doesn't. > > > > > > The moment memory fills, the page cache is reclaimed at a much > > > faster rate than the inode cache, and evidence suggests taht the > > > inode cache shrinker is not being called when large batches of pages > > > are being reclaimed. In roughly the same time period that it takes > > > to fill memory with 50% pages and 50% slab caches, memory reclaim > > > reduces the page cache down to just dirty pages and slab caches fill > > > the entirity of memory. > > > > > > At the point where the page cache is reduced to just the dirty > > > pages, there is a clear change in write IO patterns. Up to this > > > point it has been running at a steady 1500 write IOPS for ~200MB/s > > > of write throughtput (data, journal and metadata). > > As observed by iostat -x or something else? Sum of r/s and w/s would PCP + live pmcharts. Same as I've done for 15+ years :) I could look at iostat, but it's much easier to watch graphs run and then be able to double click on any point and get the actual value. I've attached a screen shot of the test machine overview while the vanilla kernel runs the fsmark test (cpu, iops, IO bandwidth, XFS create/remove/lookup ops, context switch rate and memory usage) at a 1 second sample rate. You can see the IO patterns change, the context switch rate go nuts and the CPU usage pattern change when the page cache hits empty. > approximate iops but not the breakdown of whether it is data, journal > or metadata writes. I have that in other charts - the log chart tells me how many log IOs are being done (constant 30MB/s in ~150 IOs/sec). And the >1GB/s IO spike every 30s is the metadata writeback being aggregated into large IOs by metadata writeback. That doesn't change, either... > The rest can be inferred from a blktrace but I would > prefer to replicate your setup as close as possible. If you're not using > fs_mark to report Files/sec, are you simply monitoring df -i over time? The way I run fsmark is iterative - the count field tells you how many inodes it has created... > > > So I went looking at the code, trying to find places where pages got > > > reclaimed and the shrinkers weren't called. There's only one - > > > kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim > > > small amounts of memory when an external fragmentation event > > > occurs"). I'm not even using THP or allocating huge pages, so this > > > code should not be active or having any effect onmemory reclaim at > > > all, yet the majority of reclaim is being done with "boost" and so > > > it's not reclaiming slab caches at all. It will only free clean > > > pages from the LRU. > > > > > > And so when we do run out memory, it switches to normal reclaim, > > > which hits dirty pages on the LRU and does some shrinker work, too, > > > but then appears to switch back to boosted reclaim one watermarks > > > are reached. > > > > > > The patch below restores page cache vs inode cache balance for this > > > steady state workload. It balances out at about 40% page cache, 60% > > > slab cache, and sustained performance is 10-15% higher than without > > > this patch because the IO patterns remain in control of dirty > > > writeback and the filesystem, not kswapd. > > > > > > Performance with boosted reclaim also running shrinkers over the > > > same steady state portion of the test as above. > > > > > The boosting was not intended to target THP specifically -- it was meant > to help recover early from any fragmentation-related event for any user > that might need it. Hence, it's not tied to THP but even with THP > disabled, the boosting will still take effect. > > One band-aid would be to disable watermark boosting entirely when THP is > disabled but that feels wrong. However, I would be interested in hearing > if sysctl vm.watermark_boost_factor=0 has the same effect as your patch. <runs test> Ok, it still runs it out of page cache, but it doesn't drive page cache reclaim as hard once there's none left. The IO patterns are less peaky, context switch rates are increased from ~3k/s to 15k/s but remain pretty steady. Test ran 5s faster and file rate improved by ~2%. So it's better once the page cache is largerly fully reclaimed, but it doesn't prevent the page cache from being reclaimed instead of inodes.... > On that basis, it may justify ripping out the may_shrinkslab logic > everywhere. The downside is that some microbenchmarks will notice. > Specifically IO benchmarks that fill memory and reread (particularly > rereading the metadata via any inode operation) may show reduced > results. Sure, but right now benchmarks that rely on page cache being retained are being screwed over :) > Such benchmarks can be strongly affected by whether the inode > information is still memory resident and watermark boosting reduces > the changes the data is still resident in memory. Technically still a > regression but a tunable one. /proc/sys/vm/vfs_cache_pressure is for tuning page cache/inode cache balance. It should not occur as a side effect of watermark boosting. Everyone knows about vfs_cache_pressure. Lots of people complain when it doesn't work, and that's something watermark boosting to change cache balance does. Cheers, Dave.
On Wed, Aug 07, 2019 at 09:56:15PM +0100, Mel Gorman wrote: > On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote: > > <SNIP> > > > > On that basis, it may justify ripping out the may_shrinkslab logic > > everywhere. The downside is that some microbenchmarks will notice. > > Specifically IO benchmarks that fill memory and reread (particularly > > rereading the metadata via any inode operation) may show reduced > > results. Such benchmarks can be strongly affected by whether the inode > > information is still memory resident and watermark boosting reduces > > the changes the data is still resident in memory. Technically still a > > regression but a tunable one. > > > > Hence the following "it builds" patch that has zero supporting data on > > whether it's a good idea or not. > > > > This is a more complete version of the same patch that summaries the > problem and includes data from my own testing .... > A fsmark benchmark configuration was constructed similar to > what Dave reported and is codified by the mmtest configuration > config-io-fsmark-small-file-stream. It was evaluated on a 1-socket machine > to avoid dealing with NUMA-related issues and the timing of reclaim. The > storage was an SSD Samsung Evo and a fresh XFS filesystem was used for > the test data. Have you run fstrim on that drive recently? I'm running these tests on a 960 EVO ssd, and when I started looking at shrinkers 3 weeks ago I had all sorts of whacky performance problems and inconsistent results. Turned out there were all sorts of random long IO latencies occurring (in the hundreds of milliseconds) because the drive was constantly running garbage collection to free up space. As a result it was both blocking on GC and thermal throttling under these fsmark workloads. I made a new XFS filesystem on it (lazy man's rm -rf *), then ran fstrim on it to tell the drive all the space is free. Drive temps dropped 30C immediately, and all of the whacky performance anomolies went away. I now fstrim the drive in my vm startup scripts before each test run, and it's giving consistent results again. > It is likely that the test configuration is not a proper match for Dave's > test as the results are different in terms of performance. However, my > configuration reports fsmark performance every 10% of memory worth of > files and I suspect Dave's configuration reported Files/sec when memory > was already full. THP was enabled for mine, disabled for Dave's and > probably a whole load of other methodology differences that rarely > get recorded properly. Yup, like I forgot to mention that my test system is using a 4-node fakenuma setup (i.e. 4 nodes, 4GB RAM and 4 CPUs per node, so there are 4 separate kswapd's doing concurrent reclaim). That changes reclaim patterns as well. > fsmark > 5.3.0-rc3 5.3.0-rc3 > vanilla shrinker-v1r1 > Min 1-files/sec 5181.70 ( 0.00%) 3204.20 ( -38.16%) > 1st-qrtle 1-files/sec 14877.10 ( 0.00%) 6596.90 ( -55.66%) > 2nd-qrtle 1-files/sec 6521.30 ( 0.00%) 5707.80 ( -12.47%) > 3rd-qrtle 1-files/sec 5614.30 ( 0.00%) 5363.80 ( -4.46%) > Max-1 1-files/sec 18463.00 ( 0.00%) 18479.90 ( 0.09%) > Max-5 1-files/sec 18028.40 ( 0.00%) 17829.00 ( -1.11%) > Max-10 1-files/sec 17502.70 ( 0.00%) 17080.90 ( -2.41%) > Max-90 1-files/sec 5438.80 ( 0.00%) 5106.60 ( -6.11%) > Max-95 1-files/sec 5390.30 ( 0.00%) 5020.40 ( -6.86%) > Max-99 1-files/sec 5271.20 ( 0.00%) 3376.20 ( -35.95%) > Max 1-files/sec 18463.00 ( 0.00%) 18479.90 ( 0.09%) > Hmean 1-files/sec 7459.11 ( 0.00%) 6249.49 ( -16.22%) > Stddev 1-files/sec 4733.16 ( 0.00%) 4362.10 ( 7.84%) > CoeffVar 1-files/sec 51.66 ( 0.00%) 57.49 ( -11.29%) > BHmean-99 1-files/sec 7515.09 ( 0.00%) 6351.81 ( -15.48%) > BHmean-95 1-files/sec 7625.39 ( 0.00%) 6486.09 ( -14.94%) > BHmean-90 1-files/sec 7803.19 ( 0.00%) 6588.61 ( -15.57%) > BHmean-75 1-files/sec 8518.74 ( 0.00%) 6954.25 ( -18.37%) > BHmean-50 1-files/sec 10953.31 ( 0.00%) 8017.89 ( -26.80%) > BHmean-25 1-files/sec 16732.38 ( 0.00%) 11739.65 ( -29.84%) > > 5.3.0-rc3 5.3.0-rc3 > vanillashrinker-v1r1 > Duration User 77.29 89.09 > Duration System 1097.13 1332.86 > Duration Elapsed 2014.14 2596.39 I'm not sure we are testing or measuring exactly the same things :) > This is showing that fsmark runs slower as a result of this patch but > there are other important observations that justify the patch. > > 1. With the vanilla kernel, the number of dirty pages in the system > is very low for much of the test. With this patch, dirty pages > is generally kept at 10% which matches vm.dirty_background_ratio > which is normal expected historical behaviour. > > 2. With the vanilla kernel, the ratio of Slab/Pagecache is close to > 0.95 for much of the test i.e. Slab is being left alone and dominating > memory consumption. With the patch applied, the ratio varies between > 0.35 and 0.45 with the bulk of the measured ratios roughly half way > between those values. This is a different balance to what Dave reported > but it was at least consistent. Yeah, the balance is typically a bit different for different configs and storage. The trick is getting the balance to be roughly consistent across a range of different configs. The fakenuma setup also has a significant impact on where the balance is found. And I can't remember if the "fixed" memory usage numbers I quoted came from a run with my "make XFS inode reclaim nonblocking" patchset or not. > 3. Slabs are scanned throughout the entire test with the patch applied. > The vanille kernel has long periods with no scan activity and then > relatively massive spikes. > > 4. Overall vmstats are closer to normal expectations > > 5.3.0-rc3 5.3.0-rc3 > vanilla shrinker-v1r1 > Direct pages scanned 60308.00 5226.00 > Kswapd pages scanned 18316110.00 12295574.00 > Kswapd pages reclaimed 13121037.00 7280152.00 > Direct pages reclaimed 11817.00 5226.00 > Kswapd efficiency % 71.64 59.21 > Kswapd velocity 9093.76 4735.64 > Direct efficiency % 19.59 100.00 > Direct velocity 29.94 2.01 > Page reclaim immediate 247921.00 0.00 > Slabs scanned 16602344.00 29369536.00 > Direct inode steals 1574.00 800.00 > Kswapd inode steals 130033.00 3968788.00 > Kswapd skipped wait 0.00 0.00 That looks a lot better. Patch looks reasonable, though I'm interested to know what impact it has on tests you ran in the original commit for the boosting. Cheers, Dave.
On Thu, Aug 08, 2019 at 08:08:17AM +1000, Dave Chinner wrote: > On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote: > > > > inode cache shrinker is not being called when large batches of pages > > > > are being reclaimed. In roughly the same time period that it takes > > > > to fill memory with 50% pages and 50% slab caches, memory reclaim > > > > reduces the page cache down to just dirty pages and slab caches fill > > > > the entirity of memory. > > > > > > > > At the point where the page cache is reduced to just the dirty > > > > pages, there is a clear change in write IO patterns. Up to this > > > > point it has been running at a steady 1500 write IOPS for ~200MB/s > > > > of write throughtput (data, journal and metadata). > > > > As observed by iostat -x or something else? Sum of r/s and w/s would > > PCP + live pmcharts. Same as I've done for 15+ years :) > > I could look at iostat, but it's much easier to watch graphs run > and then be able to double click on any point and get the actual > value. > > I've attached a screen shot of the test machine overview while the > vanilla kernel runs the fsmark test (cpu, iops, IO bandwidth, XFS > create/remove/lookup ops, context switch rate and memory usage) at a > 1 second sample rate. You can see the IO patterns change, the > context switch rate go nuts and the CPU usage pattern change when > the page cache hits empty. And now attached. :) Cheers, Dave.
On Thu, Aug 08, 2019 at 08:32:41AM +1000, Dave Chinner wrote: > On Wed, Aug 07, 2019 at 09:56:15PM +0100, Mel Gorman wrote: > > On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote: > > > <SNIP> > > > > > > On that basis, it may justify ripping out the may_shrinkslab logic > > > everywhere. The downside is that some microbenchmarks will notice. > > > Specifically IO benchmarks that fill memory and reread (particularly > > > rereading the metadata via any inode operation) may show reduced > > > results. Such benchmarks can be strongly affected by whether the inode > > > information is still memory resident and watermark boosting reduces > > > the changes the data is still resident in memory. Technically still a > > > regression but a tunable one. > > > > > > Hence the following "it builds" patch that has zero supporting data on > > > whether it's a good idea or not. > > > > > > > This is a more complete version of the same patch that summaries the > > problem and includes data from my own testing > .... > > A fsmark benchmark configuration was constructed similar to > > what Dave reported and is codified by the mmtest configuration > > config-io-fsmark-small-file-stream. It was evaluated on a 1-socket machine > > to avoid dealing with NUMA-related issues and the timing of reclaim. The > > storage was an SSD Samsung Evo and a fresh XFS filesystem was used for > > the test data. > > Have you run fstrim on that drive recently? I'm running these tests > on a 960 EVO ssd, and when I started looking at shrinkers 3 weeks > ago I had all sorts of whacky performance problems and inconsistent > results. Turned out there were all sorts of random long IO latencies > occurring (in the hundreds of milliseconds) because the drive was > constantly running garbage collection to free up space. As a result > it was both blocking on GC and thermal throttling under these fsmark > workloads. > No, I was under the impression that making a new filesystem typically trimmed it as well. Maybe that's just some filesystems (e.g. ext4) or just completely wrong. > I made a new XFS filesystem on it (lazy man's rm -rf *), Ah, all IO tests I do make a new filesystem. I know there is the whole problem of filesystem aging but I've yet to come across a methodology that two people can agree on that is a sensible, reproducible method. > then ran > fstrim on it to tell the drive all the space is free. Drive temps > dropped 30C immediately, and all of the whacky performance anomolies > went away. I now fstrim the drive in my vm startup scripts before > each test run, and it's giving consistent results again. > I'll replicate that if making a new filesystem is not guaranteed to trim. It'll muck up historical data but that happens to me every so often anyway. > > It is likely that the test configuration is not a proper match for Dave's > > test as the results are different in terms of performance. However, my > > configuration reports fsmark performance every 10% of memory worth of > > files and I suspect Dave's configuration reported Files/sec when memory > > was already full. THP was enabled for mine, disabled for Dave's and > > probably a whole load of other methodology differences that rarely > > get recorded properly. > > Yup, like I forgot to mention that my test system is using a 4-node > fakenuma setup (i.e. 4 nodes, 4GB RAM and 4 CPUs per node, so > there are 4 separate kswapd's doing concurrent reclaim). That > changes reclaim patterns as well. > Good to know. In this particular case, I don't think I need to exactly replicate what you have given that the slam reclaim behaviour is definitely more consistent and the ratios of slab/pagecache are predictable. > > > fsmark > > 5.3.0-rc3 5.3.0-rc3 > > vanilla shrinker-v1r1 > > Min 1-files/sec 5181.70 ( 0.00%) 3204.20 ( -38.16%) > > 1st-qrtle 1-files/sec 14877.10 ( 0.00%) 6596.90 ( -55.66%) > > 2nd-qrtle 1-files/sec 6521.30 ( 0.00%) 5707.80 ( -12.47%) > > 3rd-qrtle 1-files/sec 5614.30 ( 0.00%) 5363.80 ( -4.46%) > > Max-1 1-files/sec 18463.00 ( 0.00%) 18479.90 ( 0.09%) > > Max-5 1-files/sec 18028.40 ( 0.00%) 17829.00 ( -1.11%) > > Max-10 1-files/sec 17502.70 ( 0.00%) 17080.90 ( -2.41%) > > Max-90 1-files/sec 5438.80 ( 0.00%) 5106.60 ( -6.11%) > > Max-95 1-files/sec 5390.30 ( 0.00%) 5020.40 ( -6.86%) > > Max-99 1-files/sec 5271.20 ( 0.00%) 3376.20 ( -35.95%) > > Max 1-files/sec 18463.00 ( 0.00%) 18479.90 ( 0.09%) > > Hmean 1-files/sec 7459.11 ( 0.00%) 6249.49 ( -16.22%) > > Stddev 1-files/sec 4733.16 ( 0.00%) 4362.10 ( 7.84%) > > CoeffVar 1-files/sec 51.66 ( 0.00%) 57.49 ( -11.29%) > > BHmean-99 1-files/sec 7515.09 ( 0.00%) 6351.81 ( -15.48%) > > BHmean-95 1-files/sec 7625.39 ( 0.00%) 6486.09 ( -14.94%) > > BHmean-90 1-files/sec 7803.19 ( 0.00%) 6588.61 ( -15.57%) > > BHmean-75 1-files/sec 8518.74 ( 0.00%) 6954.25 ( -18.37%) > > BHmean-50 1-files/sec 10953.31 ( 0.00%) 8017.89 ( -26.80%) > > BHmean-25 1-files/sec 16732.38 ( 0.00%) 11739.65 ( -29.84%) > > > > 5.3.0-rc3 5.3.0-rc3 > > vanillashrinker-v1r1 > > Duration User 77.29 89.09 > > Duration System 1097.13 1332.86 > > Duration Elapsed 2014.14 2596.39 > > I'm not sure we are testing or measuring exactly the same things :) > Probably not. > > This is showing that fsmark runs slower as a result of this patch but > > there are other important observations that justify the patch. > > > > 1. With the vanilla kernel, the number of dirty pages in the system > > is very low for much of the test. With this patch, dirty pages > > is generally kept at 10% which matches vm.dirty_background_ratio > > which is normal expected historical behaviour. > > > > 2. With the vanilla kernel, the ratio of Slab/Pagecache is close to > > 0.95 for much of the test i.e. Slab is being left alone and dominating > > memory consumption. With the patch applied, the ratio varies between > > 0.35 and 0.45 with the bulk of the measured ratios roughly half way > > between those values. This is a different balance to what Dave reported > > but it was at least consistent. > > Yeah, the balance is typically a bit different for different configs > and storage. The trick is getting the balance to be roughly > consistent across a range of different configs. The fakenuma setup > also has a significant impact on where the balance is found. And I > can't remember if the "fixed" memory usage numbers I quoted came > from a run with my "make XFS inode reclaim nonblocking" patchset or > not. > Again, I wouldn't sweat too much about it. The generated graphs definitely showed more consistent behaviour even if the headline performance was not improved. > > 3. Slabs are scanned throughout the entire test with the patch applied. > > The vanille kernel has long periods with no scan activity and then > > relatively massive spikes. > > > > 4. Overall vmstats are closer to normal expectations > > > > 5.3.0-rc3 5.3.0-rc3 > > vanilla shrinker-v1r1 > > Direct pages scanned 60308.00 5226.00 > > Kswapd pages scanned 18316110.00 12295574.00 > > Kswapd pages reclaimed 13121037.00 7280152.00 > > Direct pages reclaimed 11817.00 5226.00 > > Kswapd efficiency % 71.64 59.21 > > Kswapd velocity 9093.76 4735.64 > > Direct efficiency % 19.59 100.00 > > Direct velocity 29.94 2.01 > > Page reclaim immediate 247921.00 0.00 > > Slabs scanned 16602344.00 29369536.00 > > Direct inode steals 1574.00 800.00 > > Kswapd inode steals 130033.00 3968788.00 > > Kswapd skipped wait 0.00 0.00 > > That looks a lot better. Patch looks reasonable, though I'm > interested to know what impact it has on tests you ran in the > original commit for the boosting. > I'll find out soon enough but I'm leaning on the side that kswapd reclaim should be predictable and that even if there are some performance problems as a result of it, there will be others that see a gain. It'll be a case of "no matter what way you jump, someone shouts" but kswapd having spiky unpredictable behaviour is a recipe for "sometimes my machine is crap and I've no idea why".
On Thu, Aug 08, 2019 at 08:08:17AM +1000, Dave Chinner wrote: > On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote: > > On Wed, Aug 07, 2019 at 11:30:56AM +0200, Michal Hocko wrote: > > > [Cc Mel and Vlastimil as it seems like fallout from 1c30844d2dfe2] > > > > > > > More than likely. > > > > > On Wed 07-08-19 19:18:58, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > When running a simple, steady state 4kB file creation test to > > > > simulate extracting tarballs larger than memory full of small files > > > > into the filesystem, I noticed that once memory fills up the cache > > > > balance goes to hell. > > > > > > > > Ok, I'm assuming you are using fsmark with -k to keep files around, > > and -S0 to leave cleaning to the background flush, a number of files per > > iteration to get regular reporting and a total number of iterations to > > fill memory to hit what you're seeing. I've created a configuration that > > should do this but it'll take a long time to run on a local test machine. > > ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 60 \ > -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 \ > -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 \ > -d /mnt/scratch/6 -d /mnt/scratch/7 -d /mnt/scratch/8 \ > -d /mnt/scratch/9 -d /mnt/scratch/10 -d /mnt/scratch/11 \ > -d /mnt/scratch/12 -d /mnt/scratch/13 -d /mnt/scratch/14 \ > -d /mnt/scratch/15 > Ok, I can replicate that. From a previous discussion my fs_mark setup is able to replicate this sort of thing relatively easily. I'll tweak the configuration. > This doesn't delete files at all, creates 160,000 files per > iteration in directories of 10,000 files at a time, and runs 60 > iterations. It leaves all writeback (data and metadata) to > background kernel mechanisms. > Both our configurations left writeback to the background but I took a different approach to get more reported results from fsmark. > > I'm not 100% certain I guessed right as to get fsmark reports while memory > > fills, it would have to be fewer files so each iteration would have to > > preserve files. If the number of files per iteration is large enough to > > fill memory then the drop in files/sec is not visible from the fs_mark > > output (or we are using different versions). I guess you could just be > > calculating average files/sec over the entire run based on elapsed time. > > The files/s average is the average of the fsmark iteration output. > (i.e. the rate at which it creates each group of 160k files). > Good to confirm for sure. > > > > > > The workload is creating one dirty cached inode for every dirty > > > > page, both of which should require a single IO each to clean and > > > > reclaim, and creation of inodes is throttled by the rate at which > > > > dirty writeback runs at (via balance dirty pages). Hence the ingest > > > > rate of new cached inodes and page cache pages is identical and > > > > steady. As a result, memory reclaim should quickly find a steady > > > > balance between page cache and inode caches. > > > > > > > > It doesn't. > > > > > > > > The moment memory fills, the page cache is reclaimed at a much > > > > faster rate than the inode cache, and evidence suggests taht the > > > > inode cache shrinker is not being called when large batches of pages > > > > are being reclaimed. In roughly the same time period that it takes > > > > to fill memory with 50% pages and 50% slab caches, memory reclaim > > > > reduces the page cache down to just dirty pages and slab caches fill > > > > the entirity of memory. > > > > > > > > At the point where the page cache is reduced to just the dirty > > > > pages, there is a clear change in write IO patterns. Up to this > > > > point it has been running at a steady 1500 write IOPS for ~200MB/s > > > > of write throughtput (data, journal and metadata). > > > > As observed by iostat -x or something else? Sum of r/s and w/s would > > PCP + live pmcharts. Same as I've done for 15+ years :) > Yep. This is not the first time I thought I should wire PCP into mmtests. There are historical reasons why it didn't happen but they are getting less and less relevant. > I could look at iostat, but it's much easier to watch graphs run > and then be able to double click on any point and get the actual > value. > Can't argue with the logic. > I've attached a screen shot of the test machine overview while the > vanilla kernel runs the fsmark test (cpu, iops, IO bandwidth, XFS > create/remove/lookup ops, context switch rate and memory usage) at a > 1 second sample rate. You can see the IO patterns change, the > context switch rate go nuts and the CPU usage pattern change when > the page cache hits empty. > Indeed. > > approximate iops but not the breakdown of whether it is data, journal > > or metadata writes. > > I have that in other charts - the log chart tells me how many log > IOs are being done (constant 30MB/s in ~150 IOs/sec). And the >1GB/s > IO spike every 30s is the metadata writeback being aggregated into > large IOs by metadata writeback. That doesn't change, either... > Understood. > > The rest can be inferred from a blktrace but I would > > prefer to replicate your setup as close as possible. If you're not using > > fs_mark to report Files/sec, are you simply monitoring df -i over time? > > The way I run fsmark is iterative - the count field tells you how > many inodes it has created... > Understood, I'll update tests accordingly. > > > > So I went looking at the code, trying to find places where pages got > > > > reclaimed and the shrinkers weren't called. There's only one - > > > > kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim > > > > small amounts of memory when an external fragmentation event > > > > occurs"). I'm not even using THP or allocating huge pages, so this > > > > code should not be active or having any effect onmemory reclaim at > > > > all, yet the majority of reclaim is being done with "boost" and so > > > > it's not reclaiming slab caches at all. It will only free clean > > > > pages from the LRU. > > > > > > > > And so when we do run out memory, it switches to normal reclaim, > > > > which hits dirty pages on the LRU and does some shrinker work, too, > > > > but then appears to switch back to boosted reclaim one watermarks > > > > are reached. > > > > > > > > The patch below restores page cache vs inode cache balance for this > > > > steady state workload. It balances out at about 40% page cache, 60% > > > > slab cache, and sustained performance is 10-15% higher than without > > > > this patch because the IO patterns remain in control of dirty > > > > writeback and the filesystem, not kswapd. > > > > > > > > Performance with boosted reclaim also running shrinkers over the > > > > same steady state portion of the test as above. > > > > > > > > The boosting was not intended to target THP specifically -- it was meant > > to help recover early from any fragmentation-related event for any user > > that might need it. Hence, it's not tied to THP but even with THP > > disabled, the boosting will still take effect. > > > > One band-aid would be to disable watermark boosting entirely when THP is > > disabled but that feels wrong. However, I would be interested in hearing > > if sysctl vm.watermark_boost_factor=0 has the same effect as your patch. > > <runs test> > > Ok, it still runs it out of page cache, but it doesn't drive page > cache reclaim as hard once there's none left. The IO patterns are > less peaky, context switch rates are increased from ~3k/s to 15k/s > but remain pretty steady. > > Test ran 5s faster and file rate improved by ~2%. So it's better > once the page cache is largerly fully reclaimed, but it doesn't > prevent the page cache from being reclaimed instead of inodes.... > Ok. Ideally you would also confirm the patch itself works as you want. It *should* but an actual confirmation would be nice. > > > On that basis, it may justify ripping out the may_shrinkslab logic > > everywhere. The downside is that some microbenchmarks will notice. > > Specifically IO benchmarks that fill memory and reread (particularly > > rereading the metadata via any inode operation) may show reduced > > results. > > Sure, but right now benchmarks that rely on page cache being > retained are being screwed over :) > Yeah. This is a perfect example of a change that will help some workloads and hurt others. However, as mentioned elsewhere, predictable behaviour is far easier to work with and it's closer to the historical behaviour. > > Such benchmarks can be strongly affected by whether the inode > > information is still memory resident and watermark boosting reduces > > the changes the data is still resident in memory. Technically still a > > regression but a tunable one. > > /proc/sys/vm/vfs_cache_pressure is for tuning page cache/inode cache > balance. It should not occur as a side effect of watermark boosting. > Everyone knows about vfs_cache_pressure. Lots of people complain > when it doesn't work, and that's something watermark boosting to > change cache balance does. > And this -- the fact that there is actually a tuning knob to deal with the tradeoffs is better than the current vanilla boost behaviour which has side-effects that cannot be tuned against. I'll update the changelog and resend the patch tomorrow assuming that it actually fixes your problem. Thanks Dave (for both the report and the analysis that made this a no-brainer)
On Thu, Aug 08, 2019 at 12:48:15AM +0100, Mel Gorman wrote: > On Thu, Aug 08, 2019 at 08:32:41AM +1000, Dave Chinner wrote: > > On Wed, Aug 07, 2019 at 09:56:15PM +0100, Mel Gorman wrote: > > > On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote: > > > > <SNIP> > > > > > > > > On that basis, it may justify ripping out the may_shrinkslab logic > > > > everywhere. The downside is that some microbenchmarks will notice. > > > > Specifically IO benchmarks that fill memory and reread (particularly > > > > rereading the metadata via any inode operation) may show reduced > > > > results. Such benchmarks can be strongly affected by whether the inode > > > > information is still memory resident and watermark boosting reduces > > > > the changes the data is still resident in memory. Technically still a > > > > regression but a tunable one. > > > > > > > > Hence the following "it builds" patch that has zero supporting data on > > > > whether it's a good idea or not. > > > > > > > > > > This is a more complete version of the same patch that summaries the > > > problem and includes data from my own testing > > .... > > > A fsmark benchmark configuration was constructed similar to > > > what Dave reported and is codified by the mmtest configuration > > > config-io-fsmark-small-file-stream. It was evaluated on a 1-socket machine > > > to avoid dealing with NUMA-related issues and the timing of reclaim. The > > > storage was an SSD Samsung Evo and a fresh XFS filesystem was used for > > > the test data. > > > > Have you run fstrim on that drive recently? I'm running these tests > > on a 960 EVO ssd, and when I started looking at shrinkers 3 weeks > > ago I had all sorts of whacky performance problems and inconsistent > > results. Turned out there were all sorts of random long IO latencies > > occurring (in the hundreds of milliseconds) because the drive was > > constantly running garbage collection to free up space. As a result > > it was both blocking on GC and thermal throttling under these fsmark > > workloads. > > > > No, I was under the impression that making a new filesystem typically > trimmed it as well. Maybe that's just some filesystems (e.g. ext4) or > just completely wrong. Depends. IIRC, some have turned that off by default because of the amount of poor implementations that take minutes to trim a whole device. XFS discards by default, but that doesn't mean it actually gets done. e.g. it might be on a block device that does not support or pass down discard requests. FWIW, I run these in a VM on a sparse filesystem image (500TB) held in a file on the host XFS filesystem and: $ cat /sys/block/vdc/queue/discard_max_bytes 0 Discard requests don't pass down through the virtio block device (nor do I really want them to). Hence I have to punch the image file and fstrim on the host side before launching the VM that runs the tests... > > then ran > > fstrim on it to tell the drive all the space is free. Drive temps > > dropped 30C immediately, and all of the whacky performance anomolies > > went away. I now fstrim the drive in my vm startup scripts before > > each test run, and it's giving consistent results again. > > > > I'll replicate that if making a new filesystem is not guaranteed to > trim. It'll muck up historical data but that happens to me every so > often anyway. mkfs.xfs should be doing it if you're directly on top of the SSD. Just wanted to check seeing as I've recently been bitten by this. > > That looks a lot better. Patch looks reasonable, though I'm > > interested to know what impact it has on tests you ran in the > > original commit for the boosting. > > > > I'll find out soon enough but I'm leaning on the side that kswapd reclaim > should be predictable and that even if there are some performance problems > as a result of it, there will be others that see a gain. It'll be a case > of "no matter what way you jump, someone shouts" but kswapd having spiky > unpredictable behaviour is a recipe for "sometimes my machine is crap > and I've no idea why". Yeah, and that's precisely the motiviation for getting XFS inode reclaim to avoid blocking altogether and relying on memory reclaim to back off when appropriate. I expect there will be other problems I find with reclaim backoff and blance as a kick the tyres more... Cheers, Dave.
On Thu, Aug 08, 2019 at 12:55:34AM +0100, Mel Gorman wrote: > On Thu, Aug 08, 2019 at 08:08:17AM +1000, Dave Chinner wrote: > > On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote: > > > On Wed, Aug 07, 2019 at 11:30:56AM +0200, Michal Hocko wrote: > > > The boosting was not intended to target THP specifically -- it was meant > > > to help recover early from any fragmentation-related event for any user > > > that might need it. Hence, it's not tied to THP but even with THP > > > disabled, the boosting will still take effect. > > > > > > One band-aid would be to disable watermark boosting entirely when THP is > > > disabled but that feels wrong. However, I would be interested in hearing > > > if sysctl vm.watermark_boost_factor=0 has the same effect as your patch. > > > > <runs test> > > > > Ok, it still runs it out of page cache, but it doesn't drive page > > cache reclaim as hard once there's none left. The IO patterns are > > less peaky, context switch rates are increased from ~3k/s to 15k/s > > but remain pretty steady. > > > > Test ran 5s faster and file rate improved by ~2%. So it's better > > once the page cache is largerly fully reclaimed, but it doesn't > > prevent the page cache from being reclaimed instead of inodes.... > > > > Ok. Ideally you would also confirm the patch itself works as you want. > It *should* but an actual confirmation would be nice. Yup, I'll get to that later today. Cheers, Dave.
On Thu, Aug 08, 2019 at 10:30:25AM +1000, Dave Chinner wrote: > On Thu, Aug 08, 2019 at 12:55:34AM +0100, Mel Gorman wrote: > > On Thu, Aug 08, 2019 at 08:08:17AM +1000, Dave Chinner wrote: > > > On Wed, Aug 07, 2019 at 04:03:16PM +0100, Mel Gorman wrote: > > > > On Wed, Aug 07, 2019 at 11:30:56AM +0200, Michal Hocko wrote: > > > > The boosting was not intended to target THP specifically -- it was meant > > > > to help recover early from any fragmentation-related event for any user > > > > that might need it. Hence, it's not tied to THP but even with THP > > > > disabled, the boosting will still take effect. > > > > > > > > One band-aid would be to disable watermark boosting entirely when THP is > > > > disabled but that feels wrong. However, I would be interested in hearing > > > > if sysctl vm.watermark_boost_factor=0 has the same effect as your patch. > > > > > > <runs test> > > > > > > Ok, it still runs it out of page cache, but it doesn't drive page > > > cache reclaim as hard once there's none left. The IO patterns are > > > less peaky, context switch rates are increased from ~3k/s to 15k/s > > > but remain pretty steady. > > > > > > Test ran 5s faster and file rate improved by ~2%. So it's better > > > once the page cache is largerly fully reclaimed, but it doesn't > > > prevent the page cache from being reclaimed instead of inodes.... > > > > > > > Ok. Ideally you would also confirm the patch itself works as you want. > > It *should* but an actual confirmation would be nice. > > Yup, I'll get to that later today. Looks good, does what it says on the tin. Reviewed-by: Dave Chinner <dchinner@redhat.com>
> - if (sc->may_shrinkslab) { > - shrink_slab(sc->gfp_mask, pgdat->node_id, > - memcg, sc->priority); > - } > + shrink_slab(sc->gfp_mask, pgdat->node_id, > + memcg, sc->priority); Not the most useful comment, but the indentation for the continuing line is weird (already in the original code). This should be something like: shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
On Thu, Aug 08, 2019 at 08:36:58AM -0700, Christoph Hellwig wrote: > > - if (sc->may_shrinkslab) { > > - shrink_slab(sc->gfp_mask, pgdat->node_id, > > - memcg, sc->priority); > > - } > > + shrink_slab(sc->gfp_mask, pgdat->node_id, > > + memcg, sc->priority); > > Not the most useful comment, but the indentation for the continuing line > is weird (already in the original code). This should be something like: > > shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, > sc->priority); If that's the worst you found then I take it as good news. I have not sent a version with an updated changelog so I can fix it up.
diff --git a/mm/vmscan.c b/mm/vmscan.c index 9034570febd9..702e6523f8ad 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2748,10 +2748,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) shrink_node_memcg(pgdat, memcg, sc, &lru_pages); node_lru_pages += lru_pages; - if (sc->may_shrinkslab) { + //if (sc->may_shrinkslab) { shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority); - } + //} /* Record the group's reclaim efficiency */ vmpressure(sc->gfp_mask, memcg, false,