Patchwork [5/5] mm: vmscan: move dirty pages out of the way until they're flushed

login
register
mail settings
Submitter Johannes Weiner
Date Jan. 23, 2017, 6:16 p.m.
Message ID <20170123181641.23938-6-hannes@cmpxchg.org>
Download mbox | patch
Permalink /patch/9533311/
State New
Headers show

Comments

Johannes Weiner - Jan. 23, 2017, 6:16 p.m.
We noticed a performance regression when moving hadoop workloads from
3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
activity initiated by kswapd as well as frequent bursts of allocation
stalls and direct reclaim scans. Even lowering the dirty ratios to the
equivalent of less than 1% of memory would not eliminate the issue,
suggesting that dirty pages concentrate where the scanner is looking.

This can be traced back to recent efforts of thrash avoidance. Where
3.10 would not detect refaulting pages and continuously supply clean
cache to the inactive list, a thrashing workload on 4.0+ will detect
and activate refaulting pages right away, distilling used-once pages
on the inactive list much more effectively. This is by design, and it
makes sense for clean cache. But for the most part our workload's
cache faults are refaults and its use-once cache is from streaming
writes. We end up with most of the inactive list dirty, and we don't
go after the active cache as long as we have use-once pages around.

But waiting for writes to avoid reclaiming clean cache that *might*
refault is a bad trade-off. Even if the refaults happen, reads are
faster than writes. Before getting bogged down on writeback, reclaim
should first look at *all* cache in the system, even active cache.

To accomplish this, activate pages that have been dirty or under
writeback for two inactive LRU cycles. We know at this point that
there are not enough clean inactive pages left to satisfy memory
demand in the system. The pages are marked for immediate reclaim,
meaning they'll get moved back to the inactive LRU tail as soon as
they're written back and become reclaimable. But in the meantime, by
reducing the inactive list to only immediately reclaimable pages, we
allow the scanner to deactivate and refill the inactive list with
clean cache from the active list tail to guarantee forward progress.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mm_inline.h | 7 +++++++
 mm/swap.c                 | 9 +++++----
 mm/vmscan.c               | 6 +++---
 3 files changed, 15 insertions(+), 7 deletions(-)
Minchan Kim - Jan. 26, 2017, 1:47 a.m.
On Mon, Jan 23, 2017 at 01:16:41PM -0500, Johannes Weiner wrote:
> We noticed a performance regression when moving hadoop workloads from
> 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> activity initiated by kswapd as well as frequent bursts of allocation
> stalls and direct reclaim scans. Even lowering the dirty ratios to the
> equivalent of less than 1% of memory would not eliminate the issue,
> suggesting that dirty pages concentrate where the scanner is looking.
> 
> This can be traced back to recent efforts of thrash avoidance. Where
> 3.10 would not detect refaulting pages and continuously supply clean
> cache to the inactive list, a thrashing workload on 4.0+ will detect
> and activate refaulting pages right away, distilling used-once pages
> on the inactive list much more effectively. This is by design, and it
> makes sense for clean cache. But for the most part our workload's
> cache faults are refaults and its use-once cache is from streaming
> writes. We end up with most of the inactive list dirty, and we don't
> go after the active cache as long as we have use-once pages around.
> 
> But waiting for writes to avoid reclaiming clean cache that *might*
> refault is a bad trade-off. Even if the refaults happen, reads are
> faster than writes. Before getting bogged down on writeback, reclaim
> should first look at *all* cache in the system, even active cache.
> 
> To accomplish this, activate pages that have been dirty or under
> writeback for two inactive LRU cycles. We know at this point that
> there are not enough clean inactive pages left to satisfy memory
> demand in the system. The pages are marked for immediate reclaim,
> meaning they'll get moved back to the inactive LRU tail as soon as
> they're written back and become reclaimable. But in the meantime, by
> reducing the inactive list to only immediately reclaimable pages, we
> allow the scanner to deactivate and refill the inactive list with
> clean cache from the active list tail to guarantee forward progress.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

Every patches look reasaonable to me.
Mel Gorman - Jan. 26, 2017, 10:19 a.m.
On Mon, Jan 23, 2017 at 01:16:41PM -0500, Johannes Weiner wrote:
> We noticed a performance regression when moving hadoop workloads from
> 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> activity initiated by kswapd as well as frequent bursts of allocation
> stalls and direct reclaim scans. Even lowering the dirty ratios to the
> equivalent of less than 1% of memory would not eliminate the issue,
> suggesting that dirty pages concentrate where the scanner is looking.
> 

Note that some of this is also impacted by
bbddabe2e436aa7869b3ac5248df5c14ddde0cbf because it can have the effect
of dirty pages reaching the end of the LRU sooner if they are being
written. It's not impossible that hadoop is rewriting the same files,
hitting the end of the LRU due to no reads and then throwing reclaim
into a hole.

I've seen a few cases where random write only workloads regressed and it
was based on whether the random number generator was selecting the same
pages. With that commit, the LRU was effectively LIFO.

Similarly, I'd seen a case where a databases whose working set was
larger than the shared memory area regressed because the spill-over from
the database buffer to RAM was not being preserved because it was all
rights. That said, the same patch prevents the database being swapped so
it's not all bad but there have been consequences.

I don't have a problem with the patch although would prefer to have seen
more data for the series. However, I'm not entirely convinced that
thrash detection was the only problem. I think not activating pages on
write was a contributing factor although this patch looks better than
considering reverting bbddabe2e436aa7869b3ac5248df5c14ddde0cbf.
Michal Hocko - Jan. 26, 2017, 1:52 p.m.
On Mon 23-01-17 13:16:41, Johannes Weiner wrote:
> We noticed a performance regression when moving hadoop workloads from
> 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> activity initiated by kswapd as well as frequent bursts of allocation
> stalls and direct reclaim scans. Even lowering the dirty ratios to the
> equivalent of less than 1% of memory would not eliminate the issue,
> suggesting that dirty pages concentrate where the scanner is looking.
> 
> This can be traced back to recent efforts of thrash avoidance. Where
> 3.10 would not detect refaulting pages and continuously supply clean
> cache to the inactive list, a thrashing workload on 4.0+ will detect
> and activate refaulting pages right away, distilling used-once pages
> on the inactive list much more effectively. This is by design, and it
> makes sense for clean cache. But for the most part our workload's
> cache faults are refaults and its use-once cache is from streaming
> writes. We end up with most of the inactive list dirty, and we don't
> go after the active cache as long as we have use-once pages around.
> 
> But waiting for writes to avoid reclaiming clean cache that *might*
> refault is a bad trade-off. Even if the refaults happen, reads are
> faster than writes. Before getting bogged down on writeback, reclaim
> should first look at *all* cache in the system, even active cache.
> 
> To accomplish this, activate pages that have been dirty or under
> writeback for two inactive LRU cycles. We know at this point that
> there are not enough clean inactive pages left to satisfy memory
> demand in the system. The pages are marked for immediate reclaim,
> meaning they'll get moved back to the inactive LRU tail as soon as
> they're written back and become reclaimable. But in the meantime, by
> reducing the inactive list to only immediately reclaimable pages, we
> allow the scanner to deactivate and refill the inactive list with
> clean cache from the active list tail to guarantee forward progress.

I was worried that the inactive list can shrink too low and that could
lead to pre-mature OOM declaration but should_reclaim_retry should cope
with this because it considers NR_ZONE_WRITE_PENDING which includes both
dirty and writeback pages.

That being said the patch makes sense to me

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mm_inline.h | 7 +++++++
>  mm/swap.c                 | 9 +++++----
>  mm/vmscan.c               | 6 +++---
>  3 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 41d376e7116d..e030a68ead7e 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -50,6 +50,13 @@ static __always_inline void add_page_to_lru_list(struct page *page,
>  	list_add(&page->lru, &lruvec->lists[lru]);
>  }
>  
> +static __always_inline void add_page_to_lru_list_tail(struct page *page,
> +				struct lruvec *lruvec, enum lru_list lru)
> +{
> +	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
> +	list_add_tail(&page->lru, &lruvec->lists[lru]);
> +}
> +
>  static __always_inline void del_page_from_lru_list(struct page *page,
>  				struct lruvec *lruvec, enum lru_list lru)
>  {
> diff --git a/mm/swap.c b/mm/swap.c
> index aabf2e90fe32..c4910f14f957 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -209,9 +209,10 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
>  {
>  	int *pgmoved = arg;
>  
> -	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> -		enum lru_list lru = page_lru_base_type(page);
> -		list_move_tail(&page->lru, &lruvec->lists[lru]);
> +	if (PageLRU(page) && !PageUnevictable(page)) {
> +		del_page_from_lru_list(page, lruvec, page_lru(page));
> +		ClearPageActive(page);
> +		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
>  		(*pgmoved)++;
>  	}
>  }
> @@ -235,7 +236,7 @@ static void pagevec_move_tail(struct pagevec *pvec)
>   */
>  void rotate_reclaimable_page(struct page *page)
>  {
> -	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> +	if (!PageLocked(page) && !PageDirty(page) &&
>  	    !PageUnevictable(page) && PageLRU(page)) {
>  		struct pagevec *pvec;
>  		unsigned long flags;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index df0fe0cc438e..947ab6f4db10 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1063,7 +1063,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			    PageReclaim(page) &&
>  			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
>  				nr_immediate++;
> -				goto keep_locked;
> +				goto activate_locked;
>  
>  			/* Case 2 above */
>  			} else if (sane_reclaim(sc) ||
> @@ -1081,7 +1081,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				 */
>  				SetPageReclaim(page);
>  				nr_writeback++;
> -				goto keep_locked;
> +				goto activate_locked;
>  
>  			/* Case 3 above */
>  			} else {
> @@ -1174,7 +1174,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
>  				SetPageReclaim(page);
>  
> -				goto keep_locked;
> +				goto activate_locked;
>  			}
>  
>  			if (references == PAGEREF_RECLAIM_CLEAN)
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Johannes Weiner - Jan. 26, 2017, 8:07 p.m.
On Thu, Jan 26, 2017 at 10:19:16AM +0000, Mel Gorman wrote:
> On Mon, Jan 23, 2017 at 01:16:41PM -0500, Johannes Weiner wrote:
> > We noticed a performance regression when moving hadoop workloads from
> > 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> > activity initiated by kswapd as well as frequent bursts of allocation
> > stalls and direct reclaim scans. Even lowering the dirty ratios to the
> > equivalent of less than 1% of memory would not eliminate the issue,
> > suggesting that dirty pages concentrate where the scanner is looking.
> 
> Note that some of this is also impacted by
> bbddabe2e436aa7869b3ac5248df5c14ddde0cbf because it can have the effect
> of dirty pages reaching the end of the LRU sooner if they are being
> written. It's not impossible that hadoop is rewriting the same files,
> hitting the end of the LRU due to no reads and then throwing reclaim
> into a hole.
> 
> I've seen a few cases where random write only workloads regressed and it
> was based on whether the random number generator was selecting the same
> pages. With that commit, the LRU was effectively LIFO.
> 
> Similarly, I'd seen a case where a databases whose working set was
> larger than the shared memory area regressed because the spill-over from
> the database buffer to RAM was not being preserved because it was all
> rights. That said, the same patch prevents the database being swapped so
> it's not all bad but there have been consequences.
> 
> I don't have a problem with the patch although would prefer to have seen
> more data for the series. However, I'm not entirely convinced that
> thrash detection was the only problem. I think not activating pages on
> write was a contributing factor although this patch looks better than
> considering reverting bbddabe2e436aa7869b3ac5248df5c14ddde0cbf.

We didn't backport this commit into our 4.6 kernel, so it couldn't
have been a factor in our particular testing. But I will fully agree
with you that this change probably exacerbates the problem.

Another example is the recent shrinking of the inactive list:
59dc76b0d4df ("mm: vmscan: reduce size of inactive file list"). That
one we did in fact backport, after which the problem we were already
debugging got worse. That was a good hint where the problem was:

Every time we got better at keeping the clean hot cache separated out
on the active list, we increased the concentration of dirty pages on
the inactive list. Whether this is workingset.c activating refaulting
pages, whether that's not activating writeback cache, or whether that
is shrinking the inactive list size, they all worked toward exposing
the same deficiency in the reclaim-writeback model: that waiting for
writes is worse than potentially causing reads. That flaw has always
been there - since we had wait_on_page_writeback() in the reclaim
scanner and the split between inactive and active cache. It was just
historically much harder to trigger problems like this in practice.

That's why this is a regression over a period of kernel development
and cannot really be pinpointed to a specific commit.

This patch, by straight-up putting dirty/writeback pages at the head
of the combined page cache double LRU regardless of access frequency,
is making an explicit update to the reclaim-writeback model to codify
the trade-off between writes and potential refaults. Any alternative
(implementation differences aside of course) would require regressing
use-once separation to previous levels in some form.

The lack of data is not great, agreed as well. The thing I can say is
that for the hadoop workloads - and this is a whole spectrum of jobs
running on hundreds of machines in a test group over several days -
this patch series restores average job completions, allocation stalls,
amount of kswapd-initiated IO, sys% and iowait% to 3.10 levels - with
a high confidence, and no obvious metric that could have regressed.

Is there something specific that you would like to see tested? Aside
from trying that load with more civilized flusher wakeups in kswapd?
Mel Gorman - Jan. 26, 2017, 8:58 p.m.
On Thu, Jan 26, 2017 at 03:07:45PM -0500, Johannes Weiner wrote:
> On Thu, Jan 26, 2017 at 10:19:16AM +0000, Mel Gorman wrote:
> > On Mon, Jan 23, 2017 at 01:16:41PM -0500, Johannes Weiner wrote:
> > > We noticed a performance regression when moving hadoop workloads from
> > > 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> > > activity initiated by kswapd as well as frequent bursts of allocation
> > > stalls and direct reclaim scans. Even lowering the dirty ratios to the
> > > equivalent of less than 1% of memory would not eliminate the issue,
> > > suggesting that dirty pages concentrate where the scanner is looking.
> > 
> > Note that some of this is also impacted by
> > bbddabe2e436aa7869b3ac5248df5c14ddde0cbf because it can have the effect
> > of dirty pages reaching the end of the LRU sooner if they are being
> > written. It's not impossible that hadoop is rewriting the same files,
> > hitting the end of the LRU due to no reads and then throwing reclaim
> > into a hole.
> > 
> > I've seen a few cases where random write only workloads regressed and it
> > was based on whether the random number generator was selecting the same
> > pages. With that commit, the LRU was effectively LIFO.
> > 
> > Similarly, I'd seen a case where a databases whose working set was
> > larger than the shared memory area regressed because the spill-over from
> > the database buffer to RAM was not being preserved because it was all
> > rights. That said, the same patch prevents the database being swapped so
> > it's not all bad but there have been consequences.
> > 
> > I don't have a problem with the patch although would prefer to have seen
> > more data for the series. However, I'm not entirely convinced that
> > thrash detection was the only problem. I think not activating pages on
> > write was a contributing factor although this patch looks better than
> > considering reverting bbddabe2e436aa7869b3ac5248df5c14ddde0cbf.
> 
> We didn't backport this commit into our 4.6 kernel, so it couldn't
> have been a factor in our particular testing. But I will fully agree
> with you that this change probably exacerbates the problem.
> 

Ah, ok. I was not aware the patch couldn't have been part of what you
were seeing.

> Another example is the recent shrinking of the inactive list:
> 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list"). That
> one we did in fact backport, after which the problem we were already
> debugging got worse. That was a good hint where the problem was:
> 
> Every time we got better at keeping the clean hot cache separated out
> on the active list, we increased the concentration of dirty pages on
> the inactive list.

Somewhat ironic because the improved separation increases the
chances of kswapd writing out pages and direct reclaimers stalling on
wait_iff_congested.

> Whether this is workingset.c activating refaulting
> pages, whether that's not activating writeback cache, or whether that
> is shrinking the inactive list size, they all worked toward exposing
> the same deficiency in the reclaim-writeback model: that waiting for
> writes is worse than potentially causing reads. That flaw has always
> been there - since we had wait_on_page_writeback() in the reclaim
> scanner and the split between inactive and active cache. It was just
> historically much harder to trigger problems like this in practice.
> 
> That's why this is a regression over a period of kernel development
> and cannot really be pinpointed to a specific commit.
> 

Understood.

> This patch, by straight-up putting dirty/writeback pages at the head
> of the combined page cache double LRU regardless of access frequency,
> is making an explicit update to the reclaim-writeback model to codify
> the trade-off between writes and potential refaults. Any alternative
> (implementation differences aside of course) would require regressing
> use-once separation to previous levels in some form.
> 
> The lack of data is not great, agreed as well. The thing I can say is
> that for the hadoop workloads - and this is a whole spectrum of jobs
> running on hundreds of machines in a test group over several days -
> this patch series restores average job completions, allocation stalls,
> amount of kswapd-initiated IO, sys% and iowait% to 3.10 levels - with
> a high confidence, and no obvious metric that could have regressed.
> 

That's fair enough. It's rarely the case that a regression in a complex
workload has a single root cause. If it was, bisections would always work.

> Is there something specific that you would like to see tested? Aside
> from trying that load with more civilized flusher wakeups in kswapd?

Nothing specific that I'll force on you. At some point I'll shove Chris's
simoop workload through it has it allegedly has similar propertys to what
you're seeing. I only got around to examining it last week to see how it
behaved. It was very obvious that between 4.4 and 4.9 it started writing
heavily from reclaim context. However, it had also stopped swappiing which
pointing towards the grab_cache_page_write() commit. Kswapd scan rates had
also doubled. Detailed examination of the stall stats showed extremely long
stalls. I expect these patches to have an impact and would be surprised
if they didn't.

Similarly, any random read/write workload that is write intensive might
also be interesting although that might just hit the dirty balancing limits
if not tuned properly.

A write-only sysbench would also be interesting. That is also a workload
that between 4.4 and 4.9 had regressed severely. Partly this was dirty
pages getting to the tail of the LRU and the other part was the random
number generator reusing some pages that the activations preserved. I
think your patches would at least mitigate the first problem.

If you have the chance to do any of them, it would be nice, but the
patches make enough sense from plain review. If I thought they were
shakier than I would make more of a fuss.

Patch

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 41d376e7116d..e030a68ead7e 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -50,6 +50,13 @@  static __always_inline void add_page_to_lru_list(struct page *page,
 	list_add(&page->lru, &lruvec->lists[lru]);
 }
 
+static __always_inline void add_page_to_lru_list_tail(struct page *page,
+				struct lruvec *lruvec, enum lru_list lru)
+{
+	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+	list_add_tail(&page->lru, &lruvec->lists[lru]);
+}
+
 static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
diff --git a/mm/swap.c b/mm/swap.c
index aabf2e90fe32..c4910f14f957 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,9 +209,10 @@  static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
 {
 	int *pgmoved = arg;
 
-	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
-		enum lru_list lru = page_lru_base_type(page);
-		list_move_tail(&page->lru, &lruvec->lists[lru]);
+	if (PageLRU(page) && !PageUnevictable(page)) {
+		del_page_from_lru_list(page, lruvec, page_lru(page));
+		ClearPageActive(page);
+		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
 		(*pgmoved)++;
 	}
 }
@@ -235,7 +236,7 @@  static void pagevec_move_tail(struct pagevec *pvec)
  */
 void rotate_reclaimable_page(struct page *page)
 {
-	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
+	if (!PageLocked(page) && !PageDirty(page) &&
 	    !PageUnevictable(page) && PageLRU(page)) {
 		struct pagevec *pvec;
 		unsigned long flags;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df0fe0cc438e..947ab6f4db10 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1063,7 +1063,7 @@  static unsigned long shrink_page_list(struct list_head *page_list,
 			    PageReclaim(page) &&
 			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
 				nr_immediate++;
-				goto keep_locked;
+				goto activate_locked;
 
 			/* Case 2 above */
 			} else if (sane_reclaim(sc) ||
@@ -1081,7 +1081,7 @@  static unsigned long shrink_page_list(struct list_head *page_list,
 				 */
 				SetPageReclaim(page);
 				nr_writeback++;
-				goto keep_locked;
+				goto activate_locked;
 
 			/* Case 3 above */
 			} else {
@@ -1174,7 +1174,7 @@  static unsigned long shrink_page_list(struct list_head *page_list,
 				inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
 				SetPageReclaim(page);
 
-				goto keep_locked;
+				goto activate_locked;
 			}
 
 			if (references == PAGEREF_RECLAIM_CLEAN)