diff mbox series

[01/11] mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations

Message ID 20200114002904.h9bSo%akpm@linux-foundation.org (mailing list archive)
State New, archived
Headers show
Series [01/11] mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations | expand

Commit Message

Andrew Morton Jan. 14, 2020, 12:29 a.m. UTC
From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations

THP page faults now attempt a __GFP_THISNODE allocation first, which
should only compact existing free memory, followed by another attempt that
can allocate from any node using reclaim/compaction effort specified by
global defrag setting and madvise.

This patch makes the following changes to the scheme:

- before the patch, the first allocation relies on a check for pageblock
  order and __GFP_IO to prevent excessive reclaim.  This however affects
  also the second attempt, which is not limited to single node.  Instead
  of that, reuse the existing check for costly order __GFP_NORETRY
  allocations, and make sure the first THP attempt uses __GFP_NORETRY.  As
  a side-effect, all costly order __GFP_NORETRY allocations will bail out
  if compaction needs reclaim, while previously they only bailed out when
  compaction was deferred due to previous failures.  This should be still
  acceptable within the __GFP_NORETRY semantics.

- before the patch, the second allocation attempt (on all nodes) was
  passing __GFP_NORETRY.  This is redundant as the check for pageblock
  order (discussed above) was stronger.  It's also contrary to
  madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
  requested.  After this patch, the second attempt doesn't pass
  __GFP_THISNODE nor __GFP_NORETRY.

To sum up, THP page faults now try the following attempts:

1. local node only THP allocation with no reclaim, just compaction.
2. for madvised VMA's or when synchronous compaction is enabled always - THP
   allocation from any node with effort determined by global defrag setting
   and VMA madvise
3. fallback to base pages on any node

Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempolicy.c  |   10 +++++++---
 mm/page_alloc.c |   24 +++++-------------------
 2 files changed, 12 insertions(+), 22 deletions(-)

Comments

Linus Torvalds Jan. 14, 2020, 2:16 a.m. UTC | #1
On Mon, Jan 13, 2020 at 4:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Vlastimil Babka <vbabka@suse.cz>
> Subject: mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations

I absolutely _detest_ how we've had this pattern of trying to change
the THP logic in a late -rc game.

Considering how unsuccessful some of the earlier attempts have been,
late -rc kernels really are *not* the time to make changes like this.

I'm going to take it this time, but

 (a) I'm going to revert *immediately* if anybody even peeps about the
new behavior

 (b) I want this to stop. Andrew - send THP "tweaks" during the merge
window, and not like some late-rc "fixes", when there are absolutely
_zero_ performance numbers or anything else critical associated with
the patch.

IOW, there is a decided lack of "bugfix" here. And if there actually
_were_ performance numbers, they should have been in the explanation.

So stop randomly messing with the THP allocation code, since it's
_known_ to be fragile with subtle side effects.

                    Linus
Vlastimil Babka Jan. 14, 2020, 8:05 a.m. UTC | #2
On 1/14/20 3:16 AM, Linus Torvalds wrote:
> On Mon, Jan 13, 2020 at 4:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> From: Vlastimil Babka <vbabka@suse.cz>
>> Subject: mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations
> 
> I absolutely _detest_ how we've had this pattern of trying to change
> the THP logic in a late -rc game.
> 
> Considering how unsuccessful some of the earlier attempts have been,
> late -rc kernels really are *not* the time to make changes like this.

Agreed, myself would have been much more comfortable in rc1.

> I'm going to take it this time, but

Thanks.

>  (a) I'm going to revert *immediately* if anybody even peeps about the
> new behavior
> 
>  (b) I want this to stop. Andrew - send THP "tweaks" during the merge
> window, and not like some late-rc "fixes", when there are absolutely
> _zero_ performance numbers or anything else critical associated with
> the patch.
> 
> IOW, there is a decided lack of "bugfix" here. And if there actually
> _were_ performance numbers, they should have been in the explanation.

FWIW Mel did some tests [1] which showed some improvement to THP success rate
and no significant extra swapping that was feared. We didn't get confirmation
from Andrea about his real life workloads though.

[1] https://lore.kernel.org/linux-mm/20191113112042.GG28938@suse.de/

> So stop randomly messing with the THP allocation code, since it's
> _known_ to be fragile with subtle side effects.
> 
>                     Linus
>
Michal Hocko Jan. 14, 2020, 8:46 a.m. UTC | #3
On Tue 14-01-20 09:05:57, Vlastimil Babka wrote:
> On 1/14/20 3:16 AM, Linus Torvalds wrote:
> > On Mon, Jan 13, 2020 at 4:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>
> >> From: Vlastimil Babka <vbabka@suse.cz>
> >> Subject: mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations
> > 
> > I absolutely _detest_ how we've had this pattern of trying to change
> > the THP logic in a late -rc game.
> > 
> > Considering how unsuccessful some of the earlier attempts have been,
> > late -rc kernels really are *not* the time to make changes like this.
> 
> Agreed, myself would have been much more comfortable in rc1.

Yeah I do agree. The patch is a result of a regression I have reported
back in early October (http://lkml.kernel.org/r/20191001083743.GC15624@dhcp22.suse.cz).
The workload is quite trivial albeit artificial which doesn't make it
super urgent. Still something to have addressed though.  It has been
mostly ignored for quite some time until Vlastimil came with the patch
which improved the situation while not causing any other obvious problems
(http://lkml.kernel.org/r/20191029151549.GO31513@dhcp22.suse.cz).
It has been sitting in the mmotm tree since Oct 29 and in linux-next
around a similar time without any reports. 5.5-rc1 (Dec 8) would have
been a good time to target but I understand that a more time in
linux-next wouldn't hurt and targeting 5.6-rc1 should be sufficient.
diff mbox series

Patch

--- a/mm/mempolicy.c~mm-thp-tweak-reclaim-compaction-effort-of-local-only-and-all-node-allocations
+++ a/mm/mempolicy.c
@@ -2148,18 +2148,22 @@  alloc_pages_vma(gfp_t gfp, int order, st
 		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
+			/*
+			 * First, try to allocate THP only on local node, but
+			 * don't reclaim unnecessarily, just compact.
+			 */
 			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
+				gfp | __GFP_THISNODE | __GFP_NORETRY, order);
 
 			/*
 			 * If hugepage allocations are configured to always
 			 * synchronous compact or the vma has been madvised
 			 * to prefer hugepage backing, retry allowing remote
-			 * memory as well.
+			 * memory with both reclaim and compact as well.
 			 */
 			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
 				page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_NORETRY, order);
+								gfp, order);
 
 			goto out;
 		}
--- a/mm/page_alloc.c~mm-thp-tweak-reclaim-compaction-effort-of-local-only-and-all-node-allocations
+++ a/mm/page_alloc.c
@@ -4476,8 +4476,11 @@  retry_cpuset:
 		if (page)
 			goto got_pg;
 
-		 if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&
-		     !(gfp_mask & __GFP_RETRY_MAYFAIL)) {
+		/*
+		 * Checks for costly allocations with __GFP_NORETRY, which
+		 * includes some THP page fault allocations
+		 */
+		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
 			/*
 			 * If allocating entire pageblock(s) and compaction
 			 * failed because all zones are below low watermarks
@@ -4498,23 +4501,6 @@  retry_cpuset:
 			if (compact_result == COMPACT_SKIPPED ||
 			    compact_result == COMPACT_DEFERRED)
 				goto nopage;
-		}
-
-		/*
-		 * Checks for costly allocations with __GFP_NORETRY, which
-		 * includes THP page fault allocations
-		 */
-		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
-			/*
-			 * If compaction is deferred for high-order allocations,
-			 * it is because sync compaction recently failed. If
-			 * this is the case and the caller requested a THP
-			 * allocation, we do not want to heavily disrupt the
-			 * system, so we fail the allocation instead of entering
-			 * direct reclaim.
-			 */
-			if (compact_result == COMPACT_DEFERRED)
-				goto nopage;
 
 			/*
 			 * Looks like reclaim/compaction is worth trying, but