mbox series

[0/3] Limit runaway reclaim due to watermark boosting

Message ID 20200225141534.5044-1-mgorman@techsingularity.net (mailing list archive)
Headers show
Series Limit runaway reclaim due to watermark boosting | expand

Message

Mel Gorman Feb. 25, 2020, 2:15 p.m. UTC
Ivan Babrou reported the following

	Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when
	an external fragmentation event occurs") introduced undesired
	effects in our environment.

	  * NUMA with 2 x CPU
	  * 128GB of RAM
	  * THP disabled
	  * Upgraded from 4.19 to 5.4

	Before we saw free memory hover at around 1.4GB with no
	spikes. After the upgrade we saw some machines decide that they
	need a lot more than that, with frequent spikes above 10GB,
	often only on a single numa node.

There have been a few reports recently that might be watermark boost
related. Unfortunately, finding someone that can reproduce the problem
and test a patch has been problematic.  This series intends to limit
potential damage only.

Patch 1 disables boosting on small memory systems.

Patch 2 disables boosting by default if THP is disabled on the kernel
	command line on the basis that boosting primarily helps THP
	allocation latency. It is not touched at runtime to avoid
	overriding an explicit user request

Patch 3 disables boosting if kswapd priority is elevateed to avoid
	excessive reclaim.

 mm/huge_memory.c |  1 +
 mm/internal.h    |  6 +++++-
 mm/page_alloc.c  |  6 ++++--
 mm/vmscan.c      | 46 +++++++++++++++++++++++++++++++---------------
 4 files changed, 41 insertions(+), 18 deletions(-)

Comments

Andrew Morton Feb. 26, 2020, 2:51 a.m. UTC | #1
On Tue, 25 Feb 2020 14:15:31 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:

> Ivan Babrou reported the following

http://lkml.kernel.org/r/CABWYdi1eOUD1DHORJxTsWPMT3BcZhz++xP1pXhT=x4SgxtgQZA@mail.gmail.com
is helpful.

> 	Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when
> 	an external fragmentation event occurs") introduced undesired
> 	effects in our environment.
> 
> 	  * NUMA with 2 x CPU
> 	  * 128GB of RAM
> 	  * THP disabled
> 	  * Upgraded from 4.19 to 5.4
> 
> 	Before we saw free memory hover at around 1.4GB with no
> 	spikes. After the upgrade we saw some machines decide that they
> 	need a lot more than that, with frequent spikes above 10GB,
> 	often only on a single numa node.
> 
> There have been a few reports recently that might be watermark boost
> related. Unfortunately, finding someone that can reproduce the problem
> and test a patch has been problematic.  This series intends to limit
> potential damage only.

It's problematic that we don't understand what's happening.  And these
palliatives can only reduce our ability to do that.

Rik seems to have the means to reproduce this (or something similar)
and it seems Ivan can test patches three weeks hence.  So how about a
debug patch which will help figure out what's going on in there?
Mel Gorman Feb. 26, 2020, 8:04 a.m. UTC | #2
On Tue, Feb 25, 2020 at 06:51:30PM -0800, Andrew Morton wrote:
> On Tue, 25 Feb 2020 14:15:31 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > Ivan Babrou reported the following
> 
> http://lkml.kernel.org/r/CABWYdi1eOUD1DHORJxTsWPMT3BcZhz++xP1pXhT=x4SgxtgQZA@mail.gmail.com
> is helpful.
> 

Noted for future reference.

> > 	Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when
> > 	an external fragmentation event occurs") introduced undesired
> > 	effects in our environment.
> > 
> > 	  * NUMA with 2 x CPU
> > 	  * 128GB of RAM
> > 	  * THP disabled
> > 	  * Upgraded from 4.19 to 5.4
> > 
> > 	Before we saw free memory hover at around 1.4GB with no
> > 	spikes. After the upgrade we saw some machines decide that they
> > 	need a lot more than that, with frequent spikes above 10GB,
> > 	often only on a single numa node.
> > 
> > There have been a few reports recently that might be watermark boost
> > related. Unfortunately, finding someone that can reproduce the problem
> > and test a patch has been problematic.  This series intends to limit
> > potential damage only.
> 
> It's problematic that we don't understand what's happening.  And these
> palliatives can only reduce our ability to do that.
> 

Not for certain no, but we do know that there are conditions whereby
node 0 can end up reclaiming excessively for extended periods of time.
The available evidence does match a pattern whereby a lower zone on node
0 is getting stuck in a boosted state.

> Rik seems to have the means to reproduce this (or something similar)
> and it seems Ivan can test patches three weeks hence. 

If Rik can reproduce it great but I have a strong feeling that Ivan may
never be able to test this if it requires a production machine which is
why I did not wait the three weeks.

> So how about a
> debug patch which will help figure out what's going on in there?

A debug patch would not help much in this case given that we
have tracepoints. An ftrace containing mm_page_alloc_extfrag,
mm_vmscan_kswapd_wake, mm_vmscan_wakeup_kswapd and
mm_vmscan_node_reclaim_begin would be a big help for 30 seconds while the
problem is occurring would work. Ideally mm_vmscan_lru_shrink_inactive
would also be included to capture the priority but the size of the trace
is what's going to be problematic.

mm_page_alloc_extfrag would be correlated with the conditions that boost
the watermarks and the others would track what kswapd is doing to see if
it's persistently reclaiming. If they are, mm_vmscan_lru_shrink_inactive
would tell if it's persistently reclaiming at priority DEF_PRIORITY - 2
which would prove the patch would at least mitigate the problem.

It would be more preferable to have a description of a testcase that
reproduces the problem and I'll capture/analyse the trace myself.
It would also be something I could slot into a test grid to catch the
problem happening again in the future.