Patchwork [3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path

login
register
mail settings
Submitter Johannes Weiner
Date Jan. 23, 2017, 6:16 p.m.
Message ID <20170123181641.23938-4-hannes@cmpxchg.org>
Download mbox | patch
Permalink /patch/9533309/
State New
Headers show

Comments

Johannes Weiner - Jan. 23, 2017, 6:16 p.m.
Direct reclaim has been replaced by kswapd reclaim in pretty much all
common memory pressure situations, so this code most likely doesn't
accomplish the described effect anymore. The previous patch wakes up
flushers for all reclaimers when we encounter dirty pages at the tail
end of the LRU. Remove the crufty old direct reclaim invocation.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 17 -----------------
 1 file changed, 17 deletions(-)
Minchan Kim - Jan. 26, 2017, 1:38 a.m.
On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> Direct reclaim has been replaced by kswapd reclaim in pretty much all
> common memory pressure situations, so this code most likely doesn't
> accomplish the described effect anymore. The previous patch wakes up
> flushers for all reclaimers when we encounter dirty pages at the tail
> end of the LRU. Remove the crufty old direct reclaim invocation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Mel Gorman - Jan. 26, 2017, 10:05 a.m.
On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> Direct reclaim has been replaced by kswapd reclaim in pretty much all
> common memory pressure situations, so this code most likely doesn't
> accomplish the described effect anymore. The previous patch wakes up
> flushers for all reclaimers when we encounter dirty pages at the tail
> end of the LRU. Remove the crufty old direct reclaim invocation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

In general I like this. I worried first that if kswapd is blocked
writing pages that it won't reach the wakeup_flusher_threads but the
previous patch handles it.

Now though, it occurs to me with the last patch that we always writeout
the world when flushing threads. This may not be a great idea. Consider
for example if there is a heavy writer of short-lived tmp files. In such a
case, it is possible for the files to be truncated before they even hit the
disk. However, if there are multiple "writeout the world" calls, these may
now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
could all be requested to writeout the world and each request unplugs.

Is it possible to maintain the property of writing back pages relative
to the numbers of pages scanned or have you determined already that it's
not necessary?
Michal Hocko - Jan. 26, 2017, 1:21 p.m.
On Mon 23-01-17 13:16:39, Johannes Weiner wrote:
> Direct reclaim has been replaced by kswapd reclaim in pretty much all
> common memory pressure situations, so this code most likely doesn't
> accomplish the described effect anymore. The previous patch wakes up
> flushers for all reclaimers when we encounter dirty pages at the tail
> end of the LRU. Remove the crufty old direct reclaim invocation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 17 -----------------
>  1 file changed, 17 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 56ea8d24041f..915fc658de41 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2757,8 +2757,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  					  struct scan_control *sc)
>  {
>  	int initial_priority = sc->priority;
> -	unsigned long total_scanned = 0;
> -	unsigned long writeback_threshold;
>  retry:
>  	delayacct_freepages_start();
>  
> @@ -2771,7 +2769,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		sc->nr_scanned = 0;
>  		shrink_zones(zonelist, sc);
>  
> -		total_scanned += sc->nr_scanned;
>  		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>  			break;
>  
> @@ -2784,20 +2781,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		 */
>  		if (sc->priority < DEF_PRIORITY - 2)
>  			sc->may_writepage = 1;
> -
> -		/*
> -		 * Try to write back as many pages as we just scanned.  This
> -		 * tends to cause slow streaming writers to write data to the
> -		 * disk smoothly, at the dirtying rate, which is nice.   But
> -		 * that's undesirable in laptop mode, where we *want* lumpy
> -		 * writeout.  So in laptop mode, write out the whole world.
> -		 */
> -		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
> -		if (total_scanned > writeback_threshold) {
> -			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
> -						WB_REASON_VMSCAN);
> -			sc->may_writepage = 1;
> -		}
>  	} while (--sc->priority >= 0);
>  
>  	delayacct_freepages_end();
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Johannes Weiner - Jan. 26, 2017, 6:50 p.m.
On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > common memory pressure situations, so this code most likely doesn't
> > accomplish the described effect anymore. The previous patch wakes up
> > flushers for all reclaimers when we encounter dirty pages at the tail
> > end of the LRU. Remove the crufty old direct reclaim invocation.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> In general I like this. I worried first that if kswapd is blocked
> writing pages that it won't reach the wakeup_flusher_threads but the
> previous patch handles it.
> 
> Now though, it occurs to me with the last patch that we always writeout
> the world when flushing threads. This may not be a great idea. Consider
> for example if there is a heavy writer of short-lived tmp files. In such a
> case, it is possible for the files to be truncated before they even hit the
> disk. However, if there are multiple "writeout the world" calls, these may
> now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> could all be requested to writeout the world and each request unplugs.
> 
> Is it possible to maintain the property of writing back pages relative
> to the numbers of pages scanned or have you determined already that it's
> not necessary?

That's what I started out with - waking the flushers for nr_taken. I
was using a silly test case that wrote < dirty background limit and
then allocated a burst of anon memory. When the dirty data is linear,
the bigger IO requests are beneficial. They don't exhaust struct
request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
only 32), and they require less frequent plugging.

Force-flushing temporary files under memory pressure is a concern -
although the most recently dirtied files would get queued last, giving
them still some time to get truncated - but I'm wary about splitting
the flush requests too aggressively when we DO sustain throngs of
dirty pages hitting the reclaim scanners.

I didn't test this with the real workload that gave us problems yet,
though, because deploying enough machines to get a good sample size
takes 1-2 days and to run through the full load spectrum another 4-5.
So it's harder to fine-tune these patches.

But this is a legit concern. I'll try to find out what happens when we
reduce the wakeups to nr_taken.

Given the problem these patches address, though, would you be okay
with keeping this patch in -mm? We're too far into 4.10 to merge it
upstream now, and I should have data on more precise wakeups before
the next merge window.

Thanks
Mel Gorman - Jan. 26, 2017, 8:45 p.m.
On Thu, Jan 26, 2017 at 01:50:27PM -0500, Johannes Weiner wrote:
> On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> > On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > > common memory pressure situations, so this code most likely doesn't
> > > accomplish the described effect anymore. The previous patch wakes up
> > > flushers for all reclaimers when we encounter dirty pages at the tail
> > > end of the LRU. Remove the crufty old direct reclaim invocation.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > In general I like this. I worried first that if kswapd is blocked
> > writing pages that it won't reach the wakeup_flusher_threads but the
> > previous patch handles it.
> > 
> > Now though, it occurs to me with the last patch that we always writeout
> > the world when flushing threads. This may not be a great idea. Consider
> > for example if there is a heavy writer of short-lived tmp files. In such a
> > case, it is possible for the files to be truncated before they even hit the
> > disk. However, if there are multiple "writeout the world" calls, these may
> > now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> > could all be requested to writeout the world and each request unplugs.
> > 
> > Is it possible to maintain the property of writing back pages relative
> > to the numbers of pages scanned or have you determined already that it's
> > not necessary?
> 
> That's what I started out with - waking the flushers for nr_taken. I
> was using a silly test case that wrote < dirty background limit and
> then allocated a burst of anon memory. When the dirty data is linear,
> the bigger IO requests are beneficial. They don't exhaust struct
> request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
> only 32), and they require less frequent plugging.
> 

Understood.

> Force-flushing temporary files under memory pressure is a concern -
> although the most recently dirtied files would get queued last, giving
> them still some time to get truncated - but I'm wary about splitting
> the flush requests too aggressively when we DO sustain throngs of
> dirty pages hitting the reclaim scanners.
> 

That's fair enough. It's rare to see a case where a tmp file being
written instead of truncated in RAM causes problems. The only one that
really springs to mind is dbench3 whose "performance" often relied on
whether the files were truncated before writeback.

> I didn't test this with the real workload that gave us problems yet,
> though, because deploying enough machines to get a good sample size
> takes 1-2 days and to run through the full load spectrum another 4-5.
> So it's harder to fine-tune these patches.
> 
> But this is a legit concern. I'll try to find out what happens when we
> reduce the wakeups to nr_taken.
> 
> Given the problem these patches address, though, would you be okay
> with keeping this patch in -mm? We're too far into 4.10 to merge it
> upstream now, and I should have data on more precise wakeups before
> the next merge window.
> 

Yeah, that's fine. My concern is mostly theoritical but it's something
to watch out for in future regression reports. It should be relatively
easy to spot -- workload generates lots of short-lived tmp files for
whatever reason and reports that write IO is higher causing the system
to stall other IO requests.

Thanks.
Michal Hocko - Jan. 27, 2017, 12:01 p.m.
On Thu 26-01-17 13:50:27, Johannes Weiner wrote:
> On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> > On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > > common memory pressure situations, so this code most likely doesn't
> > > accomplish the described effect anymore. The previous patch wakes up
> > > flushers for all reclaimers when we encounter dirty pages at the tail
> > > end of the LRU. Remove the crufty old direct reclaim invocation.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > In general I like this. I worried first that if kswapd is blocked
> > writing pages that it won't reach the wakeup_flusher_threads but the
> > previous patch handles it.
> > 
> > Now though, it occurs to me with the last patch that we always writeout
> > the world when flushing threads. This may not be a great idea. Consider
> > for example if there is a heavy writer of short-lived tmp files. In such a
> > case, it is possible for the files to be truncated before they even hit the
> > disk. However, if there are multiple "writeout the world" calls, these may
> > now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> > could all be requested to writeout the world and each request unplugs.
> > 
> > Is it possible to maintain the property of writing back pages relative
> > to the numbers of pages scanned or have you determined already that it's
> > not necessary?
> 
> That's what I started out with - waking the flushers for nr_taken. I
> was using a silly test case that wrote < dirty background limit and
> then allocated a burst of anon memory. When the dirty data is linear,
> the bigger IO requests are beneficial. They don't exhaust struct
> request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
> only 32), and they require less frequent plugging.
> 
> Force-flushing temporary files under memory pressure is a concern -
> although the most recently dirtied files would get queued last, giving
> them still some time to get truncated - but I'm wary about splitting
> the flush requests too aggressively when we DO sustain throngs of
> dirty pages hitting the reclaim scanners.

I think the above would be helpful in the changelog for future
reference.
Mel Gorman - Jan. 27, 2017, 2:27 p.m.
On Fri, Jan 27, 2017 at 01:01:01PM +0100, Michal Hocko wrote:
> On Thu 26-01-17 13:50:27, Johannes Weiner wrote:
> > On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> > > On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > > > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > > > common memory pressure situations, so this code most likely doesn't
> > > > accomplish the described effect anymore. The previous patch wakes up
> > > > flushers for all reclaimers when we encounter dirty pages at the tail
> > > > end of the LRU. Remove the crufty old direct reclaim invocation.
> > > > 
> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > 
> > > In general I like this. I worried first that if kswapd is blocked
> > > writing pages that it won't reach the wakeup_flusher_threads but the
> > > previous patch handles it.
> > > 
> > > Now though, it occurs to me with the last patch that we always writeout
> > > the world when flushing threads. This may not be a great idea. Consider
> > > for example if there is a heavy writer of short-lived tmp files. In such a
> > > case, it is possible for the files to be truncated before they even hit the
> > > disk. However, if there are multiple "writeout the world" calls, these may
> > > now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> > > could all be requested to writeout the world and each request unplugs.
> > > 
> > > Is it possible to maintain the property of writing back pages relative
> > > to the numbers of pages scanned or have you determined already that it's
> > > not necessary?
> > 
> > That's what I started out with - waking the flushers for nr_taken. I
> > was using a silly test case that wrote < dirty background limit and
> > then allocated a burst of anon memory. When the dirty data is linear,
> > the bigger IO requests are beneficial. They don't exhaust struct
> > request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
> > only 32), and they require less frequent plugging.
> > 
> > Force-flushing temporary files under memory pressure is a concern -
> > although the most recently dirtied files would get queued last, giving
> > them still some time to get truncated - but I'm wary about splitting
> > the flush requests too aggressively when we DO sustain throngs of
> > dirty pages hitting the reclaim scanners.
> 
> I think the above would be helpful in the changelog for future
> reference.
> 

Agreed. I backported the series to 4.10-rc5 with one minor conflict and
ran a couple of tests on it. Mix of read/write random workload didn't show
anything interesting. Write-only database didn't show much difference in
performance but there were slight reductions in IO -- probably in the noise.

simoop did show big differences although not as big as I expected. This
is Chris Mason's workload that similate the VM activity of hadoop. I
won't go through the full details but over the samples measured during
an hour it reported

                                         4.10.0-rc5            4.10.0-rc5
                                            vanilla         johannes-v1r1
Amean    p50-Read             21346531.56 (  0.00%) 21697513.24 ( -1.64%)
Amean    p95-Read             24700518.40 (  0.00%) 25743268.98 ( -4.22%)
Amean    p99-Read             27959842.13 (  0.00%) 28963271.11 ( -3.59%)
Amean    p50-Write                1138.04 (  0.00%)      989.82 ( 13.02%)
Amean    p95-Write             1106643.48 (  0.00%)    12104.00 ( 98.91%)
Amean    p99-Write             1569213.22 (  0.00%)    36343.38 ( 97.68%)
Amean    p50-Allocation          85159.82 (  0.00%)    79120.70 (  7.09%)
Amean    p95-Allocation         204222.58 (  0.00%)   129018.43 ( 36.82%)
Amean    p99-Allocation         278070.04 (  0.00%)   183354.43 ( 34.06%)
Amean    final-p50-Read       21266432.00 (  0.00%) 21921792.00 ( -3.08%)
Amean    final-p95-Read       24870912.00 (  0.00%) 26116096.00 ( -5.01%)
Amean    final-p99-Read       28147712.00 (  0.00%) 29523968.00 ( -4.89%)
Amean    final-p50-Write          1130.00 (  0.00%)      977.00 ( 13.54%)
Amean    final-p95-Write       1033216.00 (  0.00%)     2980.00 ( 99.71%)
Amean    final-p99-Write       1517568.00 (  0.00%)    32672.00 ( 97.85%)
Amean    final-p50-Allocation    86656.00 (  0.00%)    78464.00 (  9.45%)
Amean    final-p95-Allocation   211712.00 (  0.00%)   116608.00 ( 44.92%)
Amean    final-p99-Allocation   287232.00 (  0.00%)   168704.00 ( 41.27%)

The latencies are actually completely horrific in comparison to 4.4 (and
4.10-rc5 is worse than 4.9 according to historical data for reasons I
haven't analysed yet).

Still, 95% of write latency (p95-write) is halved by the series and
allocation latency is way down. Direct reclaim activity is one fifth of
what it was according to vmstats. Kswapd activity is higher but this is not
necessarily surprising. Kswapd efficiency is unchanged at 99% (99% of pages
scanned were reclaimed) but direct reclaim efficiency went from 77% to 99%

In the vanilla kernel, 627MB of data was written back from reclaim
context. With the series, no data was written back. With or without the
patch, pages are being immediately reclaimed after writeback completes.
However, with the patch, only 1/8th of the pages are reclaimed like
this.

I expect you've done plenty of internal analysis but FWIW, I can confirm
for some basic tests that exercise this are and on one machine that it's
looking good and roughly matches my expectations.

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 56ea8d24041f..915fc658de41 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2757,8 +2757,6 @@  static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					  struct scan_control *sc)
 {
 	int initial_priority = sc->priority;
-	unsigned long total_scanned = 0;
-	unsigned long writeback_threshold;
 retry:
 	delayacct_freepages_start();
 
@@ -2771,7 +2769,6 @@  static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		shrink_zones(zonelist, sc);
 
-		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			break;
 
@@ -2784,20 +2781,6 @@  static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (sc->priority < DEF_PRIORITY - 2)
 			sc->may_writepage = 1;
-
-		/*
-		 * Try to write back as many pages as we just scanned.  This
-		 * tends to cause slow streaming writers to write data to the
-		 * disk smoothly, at the dirtying rate, which is nice.   But
-		 * that's undesirable in laptop mode, where we *want* lumpy
-		 * writeout.  So in laptop mode, write out the whole world.
-		 */
-		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_VMSCAN);
-			sc->may_writepage = 1;
-		}
 	} while (--sc->priority >= 0);
 
 	delayacct_freepages_end();