Message ID | 20200225141534.5044-1-mgorman@techsingularity.net (mailing list archive) |
---|---|
Headers | show |
Series | Limit runaway reclaim due to watermark boosting | expand |
On Tue, 25 Feb 2020 14:15:31 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > Ivan Babrou reported the following http://lkml.kernel.org/r/CABWYdi1eOUD1DHORJxTsWPMT3BcZhz++xP1pXhT=x4SgxtgQZA@mail.gmail.com is helpful. > Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when > an external fragmentation event occurs") introduced undesired > effects in our environment. > > * NUMA with 2 x CPU > * 128GB of RAM > * THP disabled > * Upgraded from 4.19 to 5.4 > > Before we saw free memory hover at around 1.4GB with no > spikes. After the upgrade we saw some machines decide that they > need a lot more than that, with frequent spikes above 10GB, > often only on a single numa node. > > There have been a few reports recently that might be watermark boost > related. Unfortunately, finding someone that can reproduce the problem > and test a patch has been problematic. This series intends to limit > potential damage only. It's problematic that we don't understand what's happening. And these palliatives can only reduce our ability to do that. Rik seems to have the means to reproduce this (or something similar) and it seems Ivan can test patches three weeks hence. So how about a debug patch which will help figure out what's going on in there?
On Tue, Feb 25, 2020 at 06:51:30PM -0800, Andrew Morton wrote: > On Tue, 25 Feb 2020 14:15:31 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > > > Ivan Babrou reported the following > > http://lkml.kernel.org/r/CABWYdi1eOUD1DHORJxTsWPMT3BcZhz++xP1pXhT=x4SgxtgQZA@mail.gmail.com > is helpful. > Noted for future reference. > > Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when > > an external fragmentation event occurs") introduced undesired > > effects in our environment. > > > > * NUMA with 2 x CPU > > * 128GB of RAM > > * THP disabled > > * Upgraded from 4.19 to 5.4 > > > > Before we saw free memory hover at around 1.4GB with no > > spikes. After the upgrade we saw some machines decide that they > > need a lot more than that, with frequent spikes above 10GB, > > often only on a single numa node. > > > > There have been a few reports recently that might be watermark boost > > related. Unfortunately, finding someone that can reproduce the problem > > and test a patch has been problematic. This series intends to limit > > potential damage only. > > It's problematic that we don't understand what's happening. And these > palliatives can only reduce our ability to do that. > Not for certain no, but we do know that there are conditions whereby node 0 can end up reclaiming excessively for extended periods of time. The available evidence does match a pattern whereby a lower zone on node 0 is getting stuck in a boosted state. > Rik seems to have the means to reproduce this (or something similar) > and it seems Ivan can test patches three weeks hence. If Rik can reproduce it great but I have a strong feeling that Ivan may never be able to test this if it requires a production machine which is why I did not wait the three weeks. > So how about a > debug patch which will help figure out what's going on in there? A debug patch would not help much in this case given that we have tracepoints. An ftrace containing mm_page_alloc_extfrag, mm_vmscan_kswapd_wake, mm_vmscan_wakeup_kswapd and mm_vmscan_node_reclaim_begin would be a big help for 30 seconds while the problem is occurring would work. Ideally mm_vmscan_lru_shrink_inactive would also be included to capture the priority but the size of the trace is what's going to be problematic. mm_page_alloc_extfrag would be correlated with the conditions that boost the watermarks and the others would track what kswapd is doing to see if it's persistently reclaiming. If they are, mm_vmscan_lru_shrink_inactive would tell if it's persistently reclaiming at priority DEF_PRIORITY - 2 which would prove the patch would at least mitigate the problem. It would be more preferable to have a description of a testcase that reproduces the problem and I'll capture/analyse the trace myself. It would also be something I could slot into a test grid to catch the problem happening again in the future.