diff mbox series

[2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always

Message ID 20180820032204.9591-3-aarcange@redhat.com (mailing list archive)
State New, archived
Headers show
Series fix for "pathological THP behavior" | expand

Commit Message

Andrea Arcangeli Aug. 20, 2018, 3:22 a.m. UTC
qemu uses MADV_HUGEPAGE which allows direct compaction (i.e.
__GFP_DIRECT_RECLAIM is set).

The problem is that direct compaction combined with the NUMA
__GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
hard the local node, instead of failing the allocation if there's no
THP available in the local node.

Such logic was ok until __GFP_THISNODE was added to the THP allocation
path even with MPOL_DEFAULT.

The idea behind the __GFP_THISNODE addition, is that it is better to
provide local memory in PAGE_SIZE units than to use remote NUMA THP
backed memory. That largely depends on the remote latency though, on
threadrippers for example the overhead is relatively low in my
experience.

The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
extremely slow qemu startup with vfio, if the VM is larger than the
size of one host NUMA node. This is because it will try very hard to
unsuccessfully swapout get_user_pages pinned pages as result of the
__GFP_THISNODE being set, instead of falling back to PAGE_SIZE
allocations and instead of trying to allocate THP on other nodes (it
would be even worse without vfio type1 GUP pins of course, except it'd
be swapping heavily instead).

It's very easy to reproduce this by setting
transparent_hugepage/defrag to "always", even with a simple memhog.

1) This can be fixed by retaining the __GFP_THISNODE logic also for
   __GFP_DIRECT_RELCAIM by allowing only one compaction run. Not even
   COMPACT_SKIPPED (i.e. compaction failing because not enough free
   memory in the zone) should be allowed to invoke reclaim.

2) An alternative is not use __GFP_THISNODE if __GFP_DIRECT_RELCAIM
   has been set by the caller (i.e. MADV_HUGEPAGE or
   defrag="always"). That would keep the NUMA locality restriction
   only when __GFP_DIRECT_RECLAIM is not set by the caller. So THP
   will be provided from remote nodes if available before falling back
   to PAGE_SIZE units in the local node, but an app using defrag =
   always (or madvise with MADV_HUGEPAGE) supposedly prefers that.

These are the results of 1) (higher GB/s is better).

Finished: 30 GB mapped, 10.188535s elapsed, 2.94GB/s
Finished: 34 GB mapped, 12.274777s elapsed, 2.77GB/s
Finished: 38 GB mapped, 13.847840s elapsed, 2.74GB/s
Finished: 42 GB mapped, 14.288587s elapsed, 2.94GB/s

Finished: 30 GB mapped, 8.907367s elapsed, 3.37GB/s
Finished: 34 GB mapped, 10.724797s elapsed, 3.17GB/s
Finished: 38 GB mapped, 14.272882s elapsed, 2.66GB/s
Finished: 42 GB mapped, 13.929525s elapsed, 3.02GB/s

These are the results of 2) (higher GB/s is better).

Finished: 30 GB mapped, 10.163159s elapsed, 2.95GB/s
Finished: 34 GB mapped, 11.806526s elapsed, 2.88GB/s
Finished: 38 GB mapped, 10.369081s elapsed, 3.66GB/s
Finished: 42 GB mapped, 12.357719s elapsed, 3.40GB/s

Finished: 30 GB mapped, 8.251396s elapsed, 3.64GB/s
Finished: 34 GB mapped, 12.093030s elapsed, 2.81GB/s
Finished: 38 GB mapped, 11.824903s elapsed, 3.21GB/s
Finished: 42 GB mapped, 15.950661s elapsed, 2.63GB/s

This is current upstream (higher GB/s is better).

Finished: 30 GB mapped, 8.821632s elapsed, 3.40GB/s
Finished: 34 GB mapped, 341.979543s elapsed, 0.10GB/s
Finished: 38 GB mapped, 761.933231s elapsed, 0.05GB/s
Finished: 42 GB mapped, 1188.409235s elapsed, 0.04GB/s

vfio is a good test because by pinning all memory it avoids the
swapping and reclaim only wastes CPU, a memhog based test would
created swapout storms and supposedly show a bigger stddev.

What is better between 1) and 2) depends on the hardware and on the
software. Virtualization EPT/NTP gets a bigger boost from THP as well
than host applications.

This commit implements 1).

Reported-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/gfp.h | 18 ++++++++++++++++++
 mm/mempolicy.c      | 12 +++++++++++-
 mm/page_alloc.c     |  4 ++++
 3 files changed, 33 insertions(+), 1 deletion(-)

Comments

Zi Yan Aug. 20, 2018, 12:35 p.m. UTC | #1
On 19 Aug 2018, at 23:22, Andrea Arcangeli wrote:

<snip>
>
> Reported-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/gfp.h | 18 ++++++++++++++++++
>  mm/mempolicy.c      | 12 +++++++++++-
>  mm/page_alloc.c     |  4 ++++
>  3 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index a6afcec53795..3c04d5d90e6d 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -44,6 +44,7 @@ struct vm_area_struct;
>  #else
>  #define ___GFP_NOLOCKDEP	0
>  #endif
> +#define ___GFP_ONLY_COMPACT	0x1000000u
>  /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>
>  /*
> @@ -178,6 +179,21 @@ struct vm_area_struct;
>   *   definitely preferable to use the flag rather than opencode endless
>   *   loop around allocator.
>   *   Using this flag for costly allocations is _highly_ discouraged.
> + *
> + * __GFP_ONLY_COMPACT: Only invoke compaction. Do not try to succeed
> + * the allocation by freeing memory. Never risk to free any
> + * "PAGE_SIZE" memory unit even if compaction failed specifically
> + * because of not enough free pages in the zone. This only makes sense
> + * only in combination with __GFP_THISNODE (enforced with a
> + * VM_WARN_ON), to restrict the THP allocation in the local node that
> + * triggered the page fault and fallback into PAGE_SIZE allocations in
> + * the same node. We don't want to invoke reclaim because there may be
> + * plenty of free memory already in the local node. More importantly
> + * there may be even plenty of free THP available in remote nodes so
> + * we should allocate those if something instead of reclaiming any
> + * memory in the local node. Implementation detail: set ___GFP_NORETRY
> + * too so that ___GFP_ONLY_COMPACT only needs to be checked in a slow
> + * path.
>   */
>  #define __GFP_IO	((__force gfp_t)___GFP_IO)
>  #define __GFP_FS	((__force gfp_t)___GFP_FS)
> @@ -187,6 +203,8 @@ struct vm_area_struct;
>  #define __GFP_RETRY_MAYFAIL	((__force gfp_t)___GFP_RETRY_MAYFAIL)
>  #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)
>  #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY)
> +#define __GFP_ONLY_COMPACT	((__force gfp_t)(___GFP_NORETRY | \
> +						 ___GFP_ONLY_COMPACT))
>
>  /*
>   * Action modifiers
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d6512ef28cde..6bf839f20dcc 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2047,8 +2047,18 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>
>  		if (!nmask || node_isset(hpage_node, *nmask)) {
>  			mpol_cond_put(pol);
> +			/*
> +			 * We restricted the allocation to the
> +			 * hpage_node so we must use
> +			 * __GFP_ONLY_COMPACT to allow at most a
> +			 * compaction attempt and not ever get into
> +			 * reclaim or it'll swap heavily with
> +			 * transparent_hugepage/defrag = always (or
> +			 * madvise under MADV_HUGEPAGE).
> +			 */
>  			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> +						  gfp | __GFP_THISNODE |
> +						  __GFP_ONLY_COMPACT, order);
>  			goto out;
>  		}
>  	}
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a790ef4be74e..01a5c2bd0860 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4144,6 +4144,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  			 */
>  			if (compact_result == COMPACT_DEFERRED)
>  				goto nopage;
> +			if (gfp_mask & __GFP_ONLY_COMPACT) {
> +				VM_WARN_ON(!(gfp_mask & __GFP_THISNODE));
> +				goto nopage;
> +			}
>
>  			/*
>  			 * Looks like reclaim/compaction is worth trying, but


I think this can also be triggered in khugepaged. In collapse_huge_page(), khugepaged_alloc_page()
would also cause DIRECT_RECLAIM if defrag==always, since GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM.

But is it an expected behavior of khugepaged?


—
Best Regards,
Yan Zi
Andrea Arcangeli Aug. 20, 2018, 3:32 p.m. UTC | #2
Hello,

On Mon, Aug 20, 2018 at 08:35:17AM -0400, Zi Yan wrote:
> I think this can also be triggered in khugepaged. In collapse_huge_page(), khugepaged_alloc_page()
> would also cause DIRECT_RECLAIM if defrag==always, since GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM.
> 
> But is it an expected behavior of khugepaged?

That's a good point, and answer it not obvious. It's not an apple to
apple comparison because khugepaged is not increasing the overall
memory footprint of the app. The pages that gets compacted gets freed
later. However not all memory of the node may be possible to compact
100% so it's not perfectly a 1:1 exchange and it could require some
swap to succeed compaction.

So we may want to look also at khugepaged later, but it's not obvious
it needs fixing too. It'd weaken a bit khugepaged to add
__GFP_COMPACT_ONLY to it, if compaction returns COMPACT_SKIPPED
especially.

As opposed in the main allocation path (when memory footprint
definitely increases) I even tried to still allow reclaim only for
COMPACT_SKIPPED and it still caused swapout storms because new THP
kept being added to the local node as the old memory was swapped out
to make more free memory to compact more 4k pages. Otherwise I could
have gotten away with using __GFP_NORETRY instead of
__GFP_COMPACT_ONLY but it wasn't nearly enough.

Similarly to khugepaged NUMA balancing also uses __GFP_THISNODE but if
it decided to migrate a page to such node supposedly there's a good
reason to call reclaim to allocate the page there if needed. That also
is freeing memory on node and adding memory to another node, and it's
not increasing the memory footprint overall (unlike khugepaged, it
increases the footprint cross-node though).

Thanks,
Andrea
Michal Hocko Aug. 21, 2018, 11:50 a.m. UTC | #3
On Sun 19-08-18 23:22:04, Andrea Arcangeli wrote:
> qemu uses MADV_HUGEPAGE which allows direct compaction (i.e.
> __GFP_DIRECT_RECLAIM is set).
> 
> The problem is that direct compaction combined with the NUMA
> __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> hard the local node, instead of failing the allocation if there's no
> THP available in the local node.
> 
> Such logic was ok until __GFP_THISNODE was added to the THP allocation
> path even with MPOL_DEFAULT.
> 
> The idea behind the __GFP_THISNODE addition, is that it is better to
> provide local memory in PAGE_SIZE units than to use remote NUMA THP
> backed memory. That largely depends on the remote latency though, on
> threadrippers for example the overhead is relatively low in my
> experience.
> 
> The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> extremely slow qemu startup with vfio, if the VM is larger than the
> size of one host NUMA node. This is because it will try very hard to
> unsuccessfully swapout get_user_pages pinned pages as result of the
> __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> allocations and instead of trying to allocate THP on other nodes (it
> would be even worse without vfio type1 GUP pins of course, except it'd
> be swapping heavily instead).
> 
> It's very easy to reproduce this by setting
> transparent_hugepage/defrag to "always", even with a simple memhog.
> 
> 1) This can be fixed by retaining the __GFP_THISNODE logic also for
>    __GFP_DIRECT_RELCAIM by allowing only one compaction run. Not even
>    COMPACT_SKIPPED (i.e. compaction failing because not enough free
>    memory in the zone) should be allowed to invoke reclaim.
> 
> 2) An alternative is not use __GFP_THISNODE if __GFP_DIRECT_RELCAIM
>    has been set by the caller (i.e. MADV_HUGEPAGE or
>    defrag="always"). That would keep the NUMA locality restriction
>    only when __GFP_DIRECT_RECLAIM is not set by the caller. So THP
>    will be provided from remote nodes if available before falling back
>    to PAGE_SIZE units in the local node, but an app using defrag =
>    always (or madvise with MADV_HUGEPAGE) supposedly prefers that.

So does reverting 5265047ac301 ("mm, thp: really limit transparent
hugepage allocation to local node") help?

I really detest a new gfp flag for one time semantic that is muddy as
hell.

> + * __GFP_ONLY_COMPACT: Only invoke compaction. Do not try to succeed
> + * the allocation by freeing memory. Never risk to free any
> + * "PAGE_SIZE" memory unit even if compaction failed specifically
> + * because of not enough free pages in the zone. This only makes sense
> + * only in combination with __GFP_THISNODE (enforced with a
> + * VM_WARN_ON), to restrict the THP allocation in the local node that
> + * triggered the page fault and fallback into PAGE_SIZE allocations in
> + * the same node. We don't want to invoke reclaim because there may be
> + * plenty of free memory already in the local node. More importantly
> + * there may be even plenty of free THP available in remote nodes so
> + * we should allocate those if something instead of reclaiming any
> + * memory in the local node. Implementation detail: set ___GFP_NORETRY
> + * too so that ___GFP_ONLY_COMPACT only needs to be checked in a slow
> + * path.

This is simply incomprehensible. How can anybody who is not deeply
familiar with the allocator/reclaim internals know when to use it.

If this is really a regression then we should start by pinpointing the
real culprit and go from there. If this is really 5265047ac301 then just
start by reverting it. I strongly suspect there is some mismatch in
expectations here. What others consider acceptable seems to be a problem
for others. I believe that was one of the reasons why we have changed
the default THP direct compaction behavior, no?
Andrea Arcangeli Aug. 21, 2018, 9:40 p.m. UTC | #4
On Tue, Aug 21, 2018 at 01:50:57PM +0200, Michal Hocko wrote:
> So does reverting 5265047ac301 ("mm, thp: really limit transparent
> hugepage allocation to local node") help?

That won't revert clean, to be sure you'd need to bisect around it
which I haven't tried because I don't think there was doubt around it
(and only the part in mempolicy.c is relevant at it).

It's not so important to focus on that commit the code changed again
later, I'm focusing on the code as a whole.

It's not just that commit that is the problem here, the problem starts
in the previous commits that adds the NUMA locality logic, the
addition of __GFP_THISNODE is the icing on the cake that makes things
fall apart.

> I really detest a new gfp flag for one time semantic that is muddy as
> hell.

Well there's no way to fix this other than to prevent reclaim to run,
if you still want to give a chance to page faults to obtain THP under
MADV_HUGEPAGE in the page fault without waiting minutes or hours for
khugpaged to catch up with it.

> This is simply incomprehensible. How can anybody who is not deeply
> familiar with the allocator/reclaim internals know when to use it.

Nobody should use this in drivers, it's a __GFP flag.

Note:

	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {

#define alloc_hugepage_vma(gfp_mask, vma, addr, order)	\
	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)

Only THP is ever affected by the BUG so nothing else will ever need to
call __GFP_COMPACT_ONLY. It is a VM internal flag, I wish there was a
way to make the build fail if a driver would use it but there isn't
right now.

All other order 9 or 10 allocations from drivers won't call
alloc_hugepage_vma. Only mm/huge_memory.c ever calls that which is why
the regression only happens with MADV_HUGEPAGE (i.e. qemu is being
bitten badly on NUMA hosts).

> If this is really a regression then we should start by pinpointing the

You can check yourself, create a 2 node vnuma guest or pick any host
with more than one node. Set defrag=always and run "memhog -r11111111
18g" if host has 16g per node. Add some swap and notice the swap storm
while all ram is left free in the other node.

> real culprit and go from there. If this is really 5265047ac301 then just

In my view there's no single culprit, but it was easy to identify the
last drop that made the MADV_HUGEPAGE glass overflow, and it's that
commit that adds __GFP_THISNODE. The combination of the previous code
that prioritized NUMA over THP and then the MADV_HUGEPAGE logic that
still uses compaction (and in turn reclaim if compaction fails with
COMPACT_SKIPPED because there's no 4k page in the local node) just
falls apart with __GFP_THISNODE set as well on top of it and it
doesn't do the expected thing either without it (i.e. THP gets
priority over NUMA locality without such flag).

__GFP_THISNODE and the logic there, only works ok when
__GFP_DIRECT_RECLAIM is not set, i.e. MADV_HUGEPAGE not set.

We don't want to wait hours for khugepaged to catch up in qemu to get
THP. Compaction is certainly worth it to run if the userland
explicitly gives the hint to the kernel the allocations are long
lived.

> start by reverting it. I strongly suspect there is some mismatch in
> expectations here. What others consider acceptable seems to be a problem
> for others. I believe that was one of the reasons why we have changed
> the default THP direct compaction behavior, no?

This is not about the "default" though, I'm not changing the default
either, this is about MADV_HUGEPAGE behavior and when you change from
the defrag "default" value to "always" (which is equivalent than
having have all vmas set with MADV_HUGEPAGE).

QEMU is being optimal in setting MADV_HUGEPAGE, and rightfully so, but
it's getting punished badly because of this kernel bug.

Thanks,
Andrea
Michal Hocko Aug. 22, 2018, 9:02 a.m. UTC | #5
On Tue 21-08-18 17:40:49, Andrea Arcangeli wrote:
> On Tue, Aug 21, 2018 at 01:50:57PM +0200, Michal Hocko wrote:
[...]
> > I really detest a new gfp flag for one time semantic that is muddy as
> > hell.
> 
> Well there's no way to fix this other than to prevent reclaim to run,
> if you still want to give a chance to page faults to obtain THP under
> MADV_HUGEPAGE in the page fault without waiting minutes or hours for
> khugpaged to catch up with it.

I do not get that part. Why should caller even care about reclaim vs.
compaction. How can you even make an educated guess what makes more
sense? This should be fully controlled by the allocator path. The caller
should only care about how hard to try. It's been some time since I've
looked but we used to have a gfp flags to tell that for THP allocations
as well.

> > This is simply incomprehensible. How can anybody who is not deeply
> > familiar with the allocator/reclaim internals know when to use it.
> 
> Nobody should use this in drivers, it's a __GFP flag.

Like other __GFP flags (e.g. __GFP_NOWARN, __GFP_COMP, __GFP_ZERO and
many others)?

> Note:
> 
> 	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> 
> #define alloc_hugepage_vma(gfp_mask, vma, addr, order)	\
> 	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
> 
> Only THP is ever affected by the BUG so nothing else will ever need to
> call __GFP_COMPACT_ONLY. It is a VM internal flag, I wish there was a
> way to make the build fail if a driver would use it but there isn't
> right now.

My experience tells me that there is nothing like an internal gfp flag
and abusing them is quite common and really hard to get rid of. Having a
THP specific gfp flag has also turned out to be a bad idea (e.g. GFP_THISNODE).

> > If this is really a regression then we should start by pinpointing the
> 
> You can check yourself, create a 2 node vnuma guest or pick any host
> with more than one node. Set defrag=always and run "memhog -r11111111
> 18g" if host has 16g per node. Add some swap and notice the swap storm
> while all ram is left free in the other node.

I am not disputing the bug itself. How hard should defrag=allways really
try is good question and I would say different people would have
different ideas but a swapping storm sounds like genuinely unwanted
behavior. I would expect that to be handled in the reclaim/compaction.
GFP_TRANSHUGE doesn't have ___GFP_RETRY_MAYFAIL so it shouldn't really
try too hard to reclaim.

> > real culprit and go from there. If this is really 5265047ac301 then just
> 
> In my view there's no single culprit, but it was easy to identify the
> last drop that made the MADV_HUGEPAGE glass overflow, and it's that
> commit that adds __GFP_THISNODE. The combination of the previous code
> that prioritized NUMA over THP and then the MADV_HUGEPAGE logic that
> still uses compaction (and in turn reclaim if compaction fails with
> COMPACT_SKIPPED because there's no 4k page in the local node) just
> falls apart with __GFP_THISNODE set as well on top of it and it
> doesn't do the expected thing either without it (i.e. THP gets
> priority over NUMA locality without such flag).
> 
> __GFP_THISNODE and the logic there, only works ok when
> __GFP_DIRECT_RECLAIM is not set, i.e. MADV_HUGEPAGE not set.

I still have to digest the __GFP_THISNODE thing but I _think_ that the
alloc_pages_vma code is just trying to be overly clever and
__GFP_THISNODE is not a good fit for it.
Michal Hocko Aug. 22, 2018, 11:07 a.m. UTC | #6
On Wed 22-08-18 11:02:14, Michal Hocko wrote:
> On Tue 21-08-18 17:40:49, Andrea Arcangeli wrote:
> > On Tue, Aug 21, 2018 at 01:50:57PM +0200, Michal Hocko wrote:
> [...]
> > > I really detest a new gfp flag for one time semantic that is muddy as
> > > hell.
> > 
> > Well there's no way to fix this other than to prevent reclaim to run,
> > if you still want to give a chance to page faults to obtain THP under
> > MADV_HUGEPAGE in the page fault without waiting minutes or hours for
> > khugpaged to catch up with it.
> 
> I do not get that part. Why should caller even care about reclaim vs.
> compaction. How can you even make an educated guess what makes more
> sense? This should be fully controlled by the allocator path. The caller
> should only care about how hard to try. It's been some time since I've
> looked but we used to have a gfp flags to tell that for THP allocations
> as well.

In other words, why do we even try to swap out when allocating costly
high order page for requests which do not insist to try really hard?

I mean why don't we do something like this?
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 03822f86f288..41005d3d4c2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3071,6 +3071,14 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
 		return 1;
 
+	/*
+	 * If we are allocating a costly order and do not insist on trying really
+	 * hard then we should keep the reclaim impact at minimum. So only
+	 * focus on easily reclaimable memory.
+	 */
+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
+		sc.may_swap = sc.may_unmap = 0;
+
 	trace_mm_vmscan_direct_reclaim_begin(order,
 				sc.may_writepage,
 				sc.gfp_mask,
Andrea Arcangeli Aug. 22, 2018, 2:24 p.m. UTC | #7
On Wed, Aug 22, 2018 at 01:07:37PM +0200, Michal Hocko wrote:
> On Wed 22-08-18 11:02:14, Michal Hocko wrote:
> > On Tue 21-08-18 17:40:49, Andrea Arcangeli wrote:
> > > On Tue, Aug 21, 2018 at 01:50:57PM +0200, Michal Hocko wrote:
> > [...]
> > > > I really detest a new gfp flag for one time semantic that is muddy as
> > > > hell.
> > > 
> > > Well there's no way to fix this other than to prevent reclaim to run,
> > > if you still want to give a chance to page faults to obtain THP under
> > > MADV_HUGEPAGE in the page fault without waiting minutes or hours for
> > > khugpaged to catch up with it.
> > 
> > I do not get that part. Why should caller even care about reclaim vs.
> > compaction. How can you even make an educated guess what makes more
> > sense? This should be fully controlled by the allocator path. The caller
> > should only care about how hard to try. It's been some time since I've
> > looked but we used to have a gfp flags to tell that for THP allocations
> > as well.
> 
> In other words, why do we even try to swap out when allocating costly
> high order page for requests which do not insist to try really hard?

Note that the testcase with vfio swaps nothing and writes nothing to
disk. No memory at all is being swapped or freed because 100% of the
node is pinned with GUP pins, so I'm dubious this could possible move
the needle for the reproducer that I used for the benchmark.

The swap storm I suggested to you as reproducer, because it's another
way the bug would see the light of the day and it's easier to
reproduce without requiring device assignment, but the badness is the
fact reclaim is called when it shouldn't be and whatever fix must
cover vfio too. The below I can't imagine how it could possibly have
an effect on vfio, and even for the swap storm case you're converting
a swap storm into a CPU waste, it'll still run just extremely slow
allocations like with vfio.

The effect of the below should be evaluated regardless of the issue
we've been discussing in this thread and it's a new corner case for
order > PAGE_ALLOC_COSTLY_ORDER. I don't like very much order >
PAGE_ALLOC_COSTLY_ORDER checks, those are arbitrary numbers, the more
checks are needed in various places for that, the more it's a sign the
VM is bad and arbitrary and with one more corner case required to hide
some badness. But again this will have effects unrelated to what we're
discussing here and it will just convert I/O into CPU waste and have
no effect on vfio.

> 
> I mean why don't we do something like this?
> ---
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 03822f86f288..41005d3d4c2d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3071,6 +3071,14 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  	if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
>  		return 1;
>  
> +	/*
> +	 * If we are allocating a costly order and do not insist on trying really
> +	 * hard then we should keep the reclaim impact at minimum. So only
> +	 * focus on easily reclaimable memory.
> +	 */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
> +		sc.may_swap = sc.may_unmap = 0;
> +
>  	trace_mm_vmscan_direct_reclaim_begin(order,
>  				sc.may_writepage,
>  				sc.gfp_mask,
> -- 
> Michal Hocko
> SUSE Labs
>
Michal Hocko Aug. 22, 2018, 2:45 p.m. UTC | #8
On Wed 22-08-18 10:24:46, Andrea Arcangeli wrote:
> On Wed, Aug 22, 2018 at 01:07:37PM +0200, Michal Hocko wrote:
> > On Wed 22-08-18 11:02:14, Michal Hocko wrote:
> > > On Tue 21-08-18 17:40:49, Andrea Arcangeli wrote:
> > > > On Tue, Aug 21, 2018 at 01:50:57PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > I really detest a new gfp flag for one time semantic that is muddy as
> > > > > hell.
> > > > 
> > > > Well there's no way to fix this other than to prevent reclaim to run,
> > > > if you still want to give a chance to page faults to obtain THP under
> > > > MADV_HUGEPAGE in the page fault without waiting minutes or hours for
> > > > khugpaged to catch up with it.
> > > 
> > > I do not get that part. Why should caller even care about reclaim vs.
> > > compaction. How can you even make an educated guess what makes more
> > > sense? This should be fully controlled by the allocator path. The caller
> > > should only care about how hard to try. It's been some time since I've
> > > looked but we used to have a gfp flags to tell that for THP allocations
> > > as well.
> > 
> > In other words, why do we even try to swap out when allocating costly
> > high order page for requests which do not insist to try really hard?
> 
> Note that the testcase with vfio swaps nothing and writes nothing to
> disk. No memory at all is being swapped or freed because 100% of the
> node is pinned with GUP pins, so I'm dubious this could possible move
> the needle for the reproducer that I used for the benchmark.

Now I am confused. How can compaction help at all then? I mean  if the
node is full of GUP pins then you can hardly do anything but fallback to
other node. Or how come your new GFP flag makes any difference?

> The swap storm I suggested to you as reproducer, because it's another
> way the bug would see the light of the day and it's easier to
> reproduce without requiring device assignment, but the badness is the
> fact reclaim is called when it shouldn't be and whatever fix must
> cover vfio too. The below I can't imagine how it could possibly have
> an effect on vfio, and even for the swap storm case you're converting
> a swap storm into a CPU waste, it'll still run just extremely slow
> allocations like with vfio.

It would still try to reclaim easy target as compaction requires. If you
do not reclaim at all you can make the current implementation of the
compaction noop due to its own watermark checks IIRC.

> The effect of the below should be evaluated regardless of the issue
> we've been discussing in this thread and it's a new corner case for
> order > PAGE_ALLOC_COSTLY_ORDER. I don't like very much order >
> PAGE_ALLOC_COSTLY_ORDER checks, those are arbitrary numbers, the more
> checks are needed in various places for that, the more it's a sign the
> VM is bad and arbitrary and with one more corner case required to hide
> some badness. But again this will have effects unrelated to what we're
> discussing here and it will just convert I/O into CPU waste and have
> no effect on vfio.

yeah, I agree about PAGE_ALLOC_COSTLY_ORDER being an arbitrary limit for
a different behavior. But we already do handle those specially so it
kind of makes sense to me to expand on that.
Andrea Arcangeli Aug. 22, 2018, 3:24 p.m. UTC | #9
On Wed, Aug 22, 2018 at 04:45:17PM +0200, Michal Hocko wrote:
> Now I am confused. How can compaction help at all then? I mean  if the
> node is full of GUP pins then you can hardly do anything but fallback to
> other node. Or how come your new GFP flag makes any difference?

It helps until the node is full.

If you don't call compaction you will get zero THP even when you've
plenty of free memory.

So the free memory goes down and down as more and more THP are
generated y compaction until compaction then fails with
COMPACT_SKIPPED, there's not enough free memory to relocate an "order
9" amount of physically contiguous PAGE_SIZEd fragments.

At that point the code calls reclaim to make space for a new
compaction run. Then if that fails again it's not because there's no
enough free memory.

Problem is if you ever call reclaim when compaction fails, what
happens is you free an "order 9" and then it gets eaten up by the app
so then next compaction call, calls COMPACT_SKIPPED again.

This is how compaction works since day zero it was introduced in
kernel 2.6.x something, if you don't have crystal clear the inner
workings of compaction you have an hard time to review this. So hope
the above shed some light of how this plays out.

So in general calling reclaim is ok because compaction fails more
often than not in such case because it can't compact memory not
because there aren't at least 2m free in any node. However when you use
__GFP_THISNODE combined with reclaim that changes the whole angle and
behavior of compaction if reclaim is still active.

Not calling compaction in MADV_HUGEPAGE means you can drop
MADV_HUGEPAGE as a whole. There's no point to ever set it unless we
call compaction. And if you don't call at least once compaction you
have near zero chances to get gigabytes of THP even if it's all
compactable memory and there are gigabytes of free memory in the node,
after some runtime that shakes the fragments in the buddy.

To make it even more clear why compaction has to run once at least
when MADV_HUGEPAGE is set, just check the second last column of your
/proc/buddyinfo before and after "echo 3 >/proc/sys/vm/drop_caches;
echo >/proc/sys/vm/compact_memory". Try to allocate memory without
MADV_HUGEPAGE and without running the "echo 3; echo" and see how much
THP you'll get. I've plenty of workloads that use MADV_HUGEPAGE not
just qemu and that totally benefit immediately from THP and there's no
point to ever defer compaction to khugepaged when userland says "this
is a long lived allocation".

Compaction is only potentially wasteful for short lived allocation, so
MADV_HUGEPAGE has to call compaction.

> It would still try to reclaim easy target as compaction requires. If you
> do not reclaim at all you can make the current implementation of the
> compaction noop due to its own watermark checks IIRC.

That's the feature, if you don't make it a noop when watermark checks
trigger, it'll end up wasting CPU and breaking vfio.

The point is that we want compaction to run when there's free memory
and compaction keeps succeeding.

So when compaction fails, if it's because we finished all free memory
in the node, we should just remove __GFP_THISNODE and allocate without
it (i.e. the optimization). If compaction fails because the memory is
fragmented but here's still free memory we should fail the allocation
and trigger the THP fallback to PAGE_SIZE fault.

Overall removing __GFP_THISNODE unconditionally would simply
prioritize THP over NUMA locality which is the point of this special
logic for THP. I can't blame the logic because it certainly helps NUMA
balancing a lot in letting the memory be in the right place from the
start. This is why __GFP_COMPACT_ONLY makes sense, to be able to
retain the logic but still preventing the corner case of such
__GFP_THISNODE that breaks the VM with MADV_HUGEPAGE.

> yeah, I agree about PAGE_ALLOC_COSTLY_ORDER being an arbitrary limit for
> a different behavior. But we already do handle those specially so it
> kind of makes sense to me to expand on that.

It's still a sign of one more place that needs magic for whatever
reason. So unless it can be justified by some runtime tests I wouldn't
make such change by just thinking about it. Reclaim is called if
there's no free memory left anywhere for compaction to run (i.e. if
__GFP_THISNODE is not set, if __GPF_THISNODE is set then the caller
better use __GFP_COMPACT_ONLY).

Now we could also get away without __GFP_COMPACT_ONLY, we could check
__GFP_THISNODE and make it behave exactly like __GFP_COMPACT_ONLY
whenever __GFP_DIRECT_RECLAIM was also set in addition of
__GFP_THISNODE, but then you couldn't use __GFP_THISNODE as a mbind
anymore and it would have more obscure semantics than a new flag I
think.

Thanks,
Andrea
Andrea Arcangeli Aug. 22, 2018, 3:52 p.m. UTC | #10
On Wed, Aug 22, 2018 at 11:02:14AM +0200, Michal Hocko wrote:
> I am not disputing the bug itself. How hard should defrag=allways really
> try is good question and I would say different people would have
> different ideas but a swapping storm sounds like genuinely unwanted
> behavior. I would expect that to be handled in the reclaim/compaction.
> GFP_TRANSHUGE doesn't have ___GFP_RETRY_MAYFAIL so it shouldn't really
> try too hard to reclaim.

Everything was ok as long as the THP allocation is not bind to the
local node with __GFP_THISNODE. Calling reclaim to free memory is part
of how compaction works if all free memory has been extinguished from
all nodes. At that point it's much more likely compaction fails not
because there's not at least 2m free but because of all memory is
fragmented. So it's true that MADV_HUGEPAGE may run better on not-NUMA
by not setting __GFP_COMPACT_ONLY though (i.e. like right now,
__GFP_THISNODE would be a noop there).

How hard defrag=always should try, I think it should at least call
compaction once, so at least in the case there's plenty of free memory
in the local node it'll have a chance. It sounds a sure win that way.

Calling compaction with __GFP_THISNODE will at least defrag all free
memory, it'll give MADV_HUGEPAGE a chance.

> I still have to digest the __GFP_THISNODE thing but I _think_ that the
> alloc_pages_vma code is just trying to be overly clever and
> __GFP_THISNODE is not a good fit for it. 

My option 2 did just that, it removed __GFP_THISNODE but only for
MADV_HUGEPAGE and in general whenever reclaim was activated by
__GFP_DIRECT_RECLAIM. That is also signal that the user really wants
THP so then it's less bad to prefer THP over NUMA locality.

For the default which is tuned for short lived allocation, preferring
local memory is most certainly better win for short lived allocation
where THP can't help much, this is why I didn't remove __GFP_THISNODE
from the default defrag policy.

Thanks,
Andrea
Michal Hocko Aug. 23, 2018, 10:50 a.m. UTC | #11
On Wed 22-08-18 11:24:02, Andrea Arcangeli wrote:
> On Wed, Aug 22, 2018 at 04:45:17PM +0200, Michal Hocko wrote:
> > Now I am confused. How can compaction help at all then? I mean  if the
> > node is full of GUP pins then you can hardly do anything but fallback to
> > other node. Or how come your new GFP flag makes any difference?
> 
> It helps until the node is full.
> 
> If you don't call compaction you will get zero THP even when you've
> plenty of free memory.
> 
> So the free memory goes down and down as more and more THP are
> generated y compaction until compaction then fails with
> COMPACT_SKIPPED, there's not enough free memory to relocate an "order
> 9" amount of physically contiguous PAGE_SIZEd fragments.
> 
> At that point the code calls reclaim to make space for a new
> compaction run. Then if that fails again it's not because there's no
> enough free memory.
> 
> Problem is if you ever call reclaim when compaction fails, what
> happens is you free an "order 9" and then it gets eaten up by the app
> so then next compaction call, calls COMPACT_SKIPPED again.
> 
> This is how compaction works since day zero it was introduced in
> kernel 2.6.x something, if you don't have crystal clear the inner
> workings of compaction you have an hard time to review this. So hope
> the above shed some light of how this plays out.
> 
> So in general calling reclaim is ok because compaction fails more
> often than not in such case because it can't compact memory not
> because there aren't at least 2m free in any node. However when you use
> __GFP_THISNODE combined with reclaim that changes the whole angle and
> behavior of compaction if reclaim is still active.
> 
> Not calling compaction in MADV_HUGEPAGE means you can drop
> MADV_HUGEPAGE as a whole. There's no point to ever set it unless we
> call compaction. And if you don't call at least once compaction you
> have near zero chances to get gigabytes of THP even if it's all
> compactable memory and there are gigabytes of free memory in the node,
> after some runtime that shakes the fragments in the buddy.
> 
> To make it even more clear why compaction has to run once at least
> when MADV_HUGEPAGE is set, just check the second last column of your
> /proc/buddyinfo before and after "echo 3 >/proc/sys/vm/drop_caches;
> echo >/proc/sys/vm/compact_memory". Try to allocate memory without
> MADV_HUGEPAGE and without running the "echo 3; echo" and see how much
> THP you'll get. I've plenty of workloads that use MADV_HUGEPAGE not
> just qemu and that totally benefit immediately from THP and there's no
> point to ever defer compaction to khugepaged when userland says "this
> is a long lived allocation".
> 
> Compaction is only potentially wasteful for short lived allocation, so
> MADV_HUGEPAGE has to call compaction.

I guess you have missed my point. I was not suggesting compaction is
pointless. I meant to say, how can be compaction useful in the scenario
you were suggesting when the node is full of pinned pages.

> > It would still try to reclaim easy target as compaction requires. If you
> > do not reclaim at all you can make the current implementation of the
> > compaction noop due to its own watermark checks IIRC.
> 
> That's the feature, if you don't make it a noop when watermark checks
> trigger, it'll end up wasting CPU and breaking vfio.
> 
> The point is that we want compaction to run when there's free memory
> and compaction keeps succeeding.
> 
> So when compaction fails, if it's because we finished all free memory
> in the node, we should just remove __GFP_THISNODE and allocate without
> it (i.e. the optimization). If compaction fails because the memory is
> fragmented but here's still free memory we should fail the allocation
> and trigger the THP fallback to PAGE_SIZE fault.
> 
> Overall removing __GFP_THISNODE unconditionally would simply
> prioritize THP over NUMA locality which is the point of this special
> logic for THP. I can't blame the logic because it certainly helps NUMA
> balancing a lot in letting the memory be in the right place from the
> start. This is why __GFP_COMPACT_ONLY makes sense, to be able to
> retain the logic but still preventing the corner case of such
> __GFP_THISNODE that breaks the VM with MADV_HUGEPAGE.

But __GFP_COMPACT_ONLY is a layering violation because you are
compaction does depend on the reclaim right now.
 
> > yeah, I agree about PAGE_ALLOC_COSTLY_ORDER being an arbitrary limit for
> > a different behavior. But we already do handle those specially so it
> > kind of makes sense to me to expand on that.
> 
> It's still a sign of one more place that needs magic for whatever
> reason. So unless it can be justified by some runtime tests I wouldn't
> make such change by just thinking about it. Reclaim is called if
> there's no free memory left anywhere for compaction to run (i.e. if
> __GFP_THISNODE is not set, if __GPF_THISNODE is set then the caller
> better use __GFP_COMPACT_ONLY).

I am not insisting on the hack I have proposed mostly for the sake of
discussion. But I _strongly_ believe that __GFP_COMPACT_ONLY is the
wrong way around the issue. We are revolving around __GFP_THISNODE
having negative side effect and that is exactly an example of a gfp flag
abuse for internal MM stuff which just happens to be a complete PITA for
a long time.
 
> Now we could also get away without __GFP_COMPACT_ONLY, we could check
> __GFP_THISNODE and make it behave exactly like __GFP_COMPACT_ONLY
> whenever __GFP_DIRECT_RECLAIM was also set in addition of
> __GFP_THISNODE, but then you couldn't use __GFP_THISNODE as a mbind
> anymore and it would have more obscure semantics than a new flag I
> think.

Or simply do not play tricks with __GFP_THISNODE.
Michal Hocko Aug. 23, 2018, 10:52 a.m. UTC | #12
On Wed 22-08-18 11:52:50, Andrea Arcangeli wrote:
> On Wed, Aug 22, 2018 at 11:02:14AM +0200, Michal Hocko wrote:
[...]
> > I still have to digest the __GFP_THISNODE thing but I _think_ that the
> > alloc_pages_vma code is just trying to be overly clever and
> > __GFP_THISNODE is not a good fit for it. 
> 
> My option 2 did just that, it removed __GFP_THISNODE but only for
> MADV_HUGEPAGE and in general whenever reclaim was activated by
> __GFP_DIRECT_RECLAIM. That is also signal that the user really wants
> THP so then it's less bad to prefer THP over NUMA locality.
> 
> For the default which is tuned for short lived allocation, preferring
> local memory is most certainly better win for short lived allocation
> where THP can't help much, this is why I didn't remove __GFP_THISNODE
> from the default defrag policy.

Yes I agree.
Michal Hocko Aug. 28, 2018, 7:53 a.m. UTC | #13
On Thu 23-08-18 12:52:53, Michal Hocko wrote:
> On Wed 22-08-18 11:52:50, Andrea Arcangeli wrote:
> > On Wed, Aug 22, 2018 at 11:02:14AM +0200, Michal Hocko wrote:
> [...]
> > > I still have to digest the __GFP_THISNODE thing but I _think_ that the
> > > alloc_pages_vma code is just trying to be overly clever and
> > > __GFP_THISNODE is not a good fit for it. 
> > 
> > My option 2 did just that, it removed __GFP_THISNODE but only for
> > MADV_HUGEPAGE and in general whenever reclaim was activated by
> > __GFP_DIRECT_RECLAIM. That is also signal that the user really wants
> > THP so then it's less bad to prefer THP over NUMA locality.
> > 
> > For the default which is tuned for short lived allocation, preferring
> > local memory is most certainly better win for short lived allocation
> > where THP can't help much, this is why I didn't remove __GFP_THISNODE
> > from the default defrag policy.
> 
> Yes I agree.

I finally got back to this again. I have checked your patch and I am
really wondering whether alloc_pages_vma is really the proper place to
play these tricks. We already have that mind blowing alloc_hugepage_direct_gfpmask
and it should be the proper place to handle this special casing. So what
do you think about the following. It should be essentially the same
thing. Aka use __GFP_THIS_NODE only when we are doing an optimistic THP
allocation. Madvise signalizes you know what you are doing and THP has
the top priority. If you care enough about the numa placement then you
should better use mempolicy.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..3cdb62f6aea7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -634,16 +634,16 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | __GFP_THISNODE);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM);
+							     __GFP_KSWAPD_RECLAIM | __GFP_THISNODE);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     0);
-	return GFP_TRANSHUGE_LIGHT;
+							     __GFP_THIS_NODE);
+	return GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
 }
 
 /* Caller must hold page table lock. */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..9f0800885613 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED &&
-						!(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
-
-		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
-			goto out;
-		}
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
Michal Hocko Aug. 28, 2018, 8:18 a.m. UTC | #14
[CC Stefan Priebe who has reported the same/similar issue on openSUSE
 mailing list recently - the thread starts http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com]

On Tue 28-08-18 09:53:21, Michal Hocko wrote:
> On Thu 23-08-18 12:52:53, Michal Hocko wrote:
> > On Wed 22-08-18 11:52:50, Andrea Arcangeli wrote:
> > > On Wed, Aug 22, 2018 at 11:02:14AM +0200, Michal Hocko wrote:
> > [...]
> > > > I still have to digest the __GFP_THISNODE thing but I _think_ that the
> > > > alloc_pages_vma code is just trying to be overly clever and
> > > > __GFP_THISNODE is not a good fit for it. 
> > > 
> > > My option 2 did just that, it removed __GFP_THISNODE but only for
> > > MADV_HUGEPAGE and in general whenever reclaim was activated by
> > > __GFP_DIRECT_RECLAIM. That is also signal that the user really wants
> > > THP so then it's less bad to prefer THP over NUMA locality.
> > > 
> > > For the default which is tuned for short lived allocation, preferring
> > > local memory is most certainly better win for short lived allocation
> > > where THP can't help much, this is why I didn't remove __GFP_THISNODE
> > > from the default defrag policy.
> > 
> > Yes I agree.
> 
> I finally got back to this again. I have checked your patch and I am
> really wondering whether alloc_pages_vma is really the proper place to
> play these tricks. We already have that mind blowing alloc_hugepage_direct_gfpmask
> and it should be the proper place to handle this special casing. So what
> do you think about the following. It should be essentially the same
> thing. Aka use __GFP_THIS_NODE only when we are doing an optimistic THP
> allocation. Madvise signalizes you know what you are doing and THP has
> the top priority. If you care enough about the numa placement then you
> should better use mempolicy.

Now the patch is still untested but it compiles at least.
---
From 88e0ca4c9c403c6046f1c47d5ee17548f9dc841a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 28 Aug 2018 09:59:19 +0200
Subject: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

Andrea has noticed [1] that a THP allocation might be really disruptive
when allocated on NUMA system with the local node full or hard to
reclaim. Stefan has posted an allocation stall report on 4.12 based
SLES kernel which suggests the same issue:
[245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
[245513.363983] kvm cpuset=/ mems_allowed=0-1
[245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
[245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
[245513.365905] Call Trace:
[245513.366535]  dump_stack+0x5c/0x84
[245513.367148]  warn_alloc+0xe0/0x180
[245513.367769]  __alloc_pages_slowpath+0x820/0xc90
[245513.368406]  ? __slab_free+0xa9/0x2f0
[245513.369048]  ? __slab_free+0xa9/0x2f0
[245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
[245513.370300]  alloc_pages_vma+0x1e5/0x280
[245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
[245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
[245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
[245513.372812]  __handle_mm_fault+0x93d/0x1060
[245513.373439]  handle_mm_fault+0xc6/0x1b0
[245513.374042]  __do_page_fault+0x230/0x430
[245513.374679]  ? get_vtime_delta+0x13/0xb0
[245513.375411]  do_page_fault+0x2a/0x70
[245513.376145]  ? page_fault+0x65/0x80
[245513.376882]  page_fault+0x7b/0x80
[...]
[245513.382056] Mem-Info:
[245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
                 active_file:60183 inactive_file:245285 isolated_file:0
                 unevictable:15657 dirty:286 writeback:1 unstable:0
                 slab_reclaimable:75543 slab_unreclaimable:2509111
                 mapped:81814 shmem:31764 pagetables:370616 bounce:0
                 free:32294031 free_pcp:6233 free_cma:0
[245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
juggle gfp flags for different allocation modes. The rationale is that
__GFP_THISNODE is helpful in relaxed defrag modes because falling back
to a different node might be more harmful than the benefit of a large page.
If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
a higher priority than local NUMA placement. The later might be controlled
via NUMA policies to be more fine grained.

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/huge_memory.c | 10 +++++-----
 mm/mempolicy.c   | 26 --------------------------
 2 files changed, 5 insertions(+), 31 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..a703c23f8bab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -634,16 +634,16 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | __GFP_THISNODE);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM);
+							     __GFP_KSWAPD_RECLAIM | __GFP_THISNODE);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     0);
-	return GFP_TRANSHUGE_LIGHT;
+							     __GFP_THISNODE);
+	return GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
 }
 
 /* Caller must hold page table lock. */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..9f0800885613 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED &&
-						!(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
-
-		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
-			goto out;
-		}
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
Stefan Priebe - Profihost AG Aug. 28, 2018, 8:54 a.m. UTC | #15
Am 28.08.2018 um 10:18 schrieb Michal Hocko:
> [CC Stefan Priebe who has reported the same/similar issue on openSUSE
>  mailing list recently - the thread starts http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com]
> 
> On Tue 28-08-18 09:53:21, Michal Hocko wrote:
>> On Thu 23-08-18 12:52:53, Michal Hocko wrote:
>>> On Wed 22-08-18 11:52:50, Andrea Arcangeli wrote:
>>>> On Wed, Aug 22, 2018 at 11:02:14AM +0200, Michal Hocko wrote:
>>> [...]
>>>>> I still have to digest the __GFP_THISNODE thing but I _think_ that the
>>>>> alloc_pages_vma code is just trying to be overly clever and
>>>>> __GFP_THISNODE is not a good fit for it. 
>>>>
>>>> My option 2 did just that, it removed __GFP_THISNODE but only for
>>>> MADV_HUGEPAGE and in general whenever reclaim was activated by
>>>> __GFP_DIRECT_RECLAIM. That is also signal that the user really wants
>>>> THP so then it's less bad to prefer THP over NUMA locality.
>>>>
>>>> For the default which is tuned for short lived allocation, preferring
>>>> local memory is most certainly better win for short lived allocation
>>>> where THP can't help much, this is why I didn't remove __GFP_THISNODE
>>>> from the default defrag policy.
>>>
>>> Yes I agree.
>>
>> I finally got back to this again. I have checked your patch and I am
>> really wondering whether alloc_pages_vma is really the proper place to
>> play these tricks. We already have that mind blowing alloc_hugepage_direct_gfpmask
>> and it should be the proper place to handle this special casing. So what
>> do you think about the following. It should be essentially the same
>> thing. Aka use __GFP_THIS_NODE only when we are doing an optimistic THP
>> allocation. Madvise signalizes you know what you are doing and THP has
>> the top priority. If you care enough about the numa placement then you
>> should better use mempolicy.
> 
> Now the patch is still untested but it compiles at least.

Great - i recompiled the SLES15 kernel with that one applied and will
test if it helps.

Stefan

> ---
> From 88e0ca4c9c403c6046f1c47d5ee17548f9dc841a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 28 Aug 2018 09:59:19 +0200
> Subject: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
> 
> Andrea has noticed [1] that a THP allocation might be really disruptive
> when allocated on NUMA system with the local node full or hard to
> reclaim. Stefan has posted an allocation stall report on 4.12 based
> SLES kernel which suggests the same issue:
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-phSLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80
> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>                  active_file:60183 inactive_file:245285 isolated_file:0
>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>                  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> juggle gfp flags for different allocation modes. The rationale is that
> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> to a different node might be more harmful than the benefit of a large page.
> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> a higher priority than local NUMA placement. The later might be controlled
> via NUMA policies to be more fine grained.
> 
> [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
> 
> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/huge_memory.c | 10 +++++-----
>  mm/mempolicy.c   | 26 --------------------------
>  2 files changed, 5 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3bc7e9c9a2a..a703c23f8bab 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -634,16 +634,16 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>  
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | __GFP_THISNODE);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM);
> +							     __GFP_KSWAPD_RECLAIM | __GFP_THISNODE);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     0);
> -	return GFP_TRANSHUGE_LIGHT;
> +							     __GFP_THISNODE);
> +	return GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
>  }
>  
>  /* Caller must hold page table lock. */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..9f0800885613 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> -		int hpage_node = node;
> -
> -		/*
> -		 * For hugepage allocation and non-interleave policy which
> -		 * allows the current node (or other explicitly preferred
> -		 * node) we only try to allocate from the current/preferred
> -		 * node and don't fall back to other nodes, as the cost of
> -		 * remote accesses would likely offset THP benefits.
> -		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> -		 */
> -		if (pol->mode == MPOL_PREFERRED &&
> -						!(pol->flags & MPOL_F_LOCAL))
> -			hpage_node = pol->v.preferred_node;
> -
> -		nmask = policy_nodemask(gfp, pol);
> -		if (!nmask || node_isset(hpage_node, *nmask)) {
> -			mpol_cond_put(pol);
> -			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> -			goto out;
> -		}
> -	}
> -
>  	nmask = policy_nodemask(gfp, pol);
>  	preferred_nid = policy_node(gfp, pol, node);
>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
>
Stefan Priebe - Profihost AG Aug. 29, 2018, 11:11 a.m. UTC | #16
Am 28.08.2018 um 10:54 schrieb Stefan Priebe - Profihost AG:
> 
> Am 28.08.2018 um 10:18 schrieb Michal Hocko:
>> [CC Stefan Priebe who has reported the same/similar issue on openSUSE
>>  mailing list recently - the thread starts http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com]
>>
>> On Tue 28-08-18 09:53:21, Michal Hocko wrote:
>>> On Thu 23-08-18 12:52:53, Michal Hocko wrote:
>>>> On Wed 22-08-18 11:52:50, Andrea Arcangeli wrote:
>>>>> On Wed, Aug 22, 2018 at 11:02:14AM +0200, Michal Hocko wrote:
>>>> [...]
>>>>>> I still have to digest the __GFP_THISNODE thing but I _think_ that the
>>>>>> alloc_pages_vma code is just trying to be overly clever and
>>>>>> __GFP_THISNODE is not a good fit for it. 
>>>>>
>>>>> My option 2 did just that, it removed __GFP_THISNODE but only for
>>>>> MADV_HUGEPAGE and in general whenever reclaim was activated by
>>>>> __GFP_DIRECT_RECLAIM. That is also signal that the user really wants
>>>>> THP so then it's less bad to prefer THP over NUMA locality.
>>>>>
>>>>> For the default which is tuned for short lived allocation, preferring
>>>>> local memory is most certainly better win for short lived allocation
>>>>> where THP can't help much, this is why I didn't remove __GFP_THISNODE
>>>>> from the default defrag policy.
>>>>
>>>> Yes I agree.
>>>
>>> I finally got back to this again. I have checked your patch and I am
>>> really wondering whether alloc_pages_vma is really the proper place to
>>> play these tricks. We already have that mind blowing alloc_hugepage_direct_gfpmask
>>> and it should be the proper place to handle this special casing. So what
>>> do you think about the following. It should be essentially the same
>>> thing. Aka use __GFP_THIS_NODE only when we are doing an optimistic THP
>>> allocation. Madvise signalizes you know what you are doing and THP has
>>> the top priority. If you care enough about the numa placement then you
>>> should better use mempolicy.
>>
>> Now the patch is still untested but it compiles at least.
> 
> Great - i recompiled the SLES15 kernel with that one applied and will
> test if it helps.

It seems to work fine. At least i was not able to reproduce the issue.

Greets,
Stefan

>> ---
>> From 88e0ca4c9c403c6046f1c47d5ee17548f9dc841a Mon Sep 17 00:00:00 2001
>> From: Michal Hocko <mhocko@suse.com>
>> Date: Tue, 28 Aug 2018 09:59:19 +0200
>> Subject: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
>>
>> Andrea has noticed [1] that a THP allocation might be really disruptive
>> when allocated on NUMA system with the local node full or hard to
>> reclaim. Stefan has posted an allocation stall report on 4.12 based
>> SLES kernel which suggests the same issue:
>> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
>> [245513.363983] kvm cpuset=/ mems_allowed=0-1
>> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-phSLE15 (unreleased)
>> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
>> [245513.365905] Call Trace:
>> [245513.366535]  dump_stack+0x5c/0x84
>> [245513.367148]  warn_alloc+0xe0/0x180
>> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
>> [245513.368406]  ? __slab_free+0xa9/0x2f0
>> [245513.369048]  ? __slab_free+0xa9/0x2f0
>> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
>> [245513.370300]  alloc_pages_vma+0x1e5/0x280
>> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
>> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
>> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
>> [245513.372812]  __handle_mm_fault+0x93d/0x1060
>> [245513.373439]  handle_mm_fault+0xc6/0x1b0
>> [245513.374042]  __do_page_fault+0x230/0x430
>> [245513.374679]  ? get_vtime_delta+0x13/0xb0
>> [245513.375411]  do_page_fault+0x2a/0x70
>> [245513.376145]  ? page_fault+0x65/0x80
>> [245513.376882]  page_fault+0x7b/0x80
>> [...]
>> [245513.382056] Mem-Info:
>> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>>                  active_file:60183 inactive_file:245285 isolated_file:0
>>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>>                  free:32294031 free_pcp:6233 free_cma:0
>> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>>
>> The defrag mode is "madvise" and from the above report it is clear that
>> the THP has been allocated for MADV_HUGEPAGA vma.
>>
>> Andrea has identified that the main source of the problem is
>> __GFP_THISNODE usage:
>>
>> : The problem is that direct compaction combined with the NUMA
>> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
>> : hard the local node, instead of failing the allocation if there's no
>> : THP available in the local node.
>> :
>> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
>> : path even with MPOL_DEFAULT.
>> :
>> : The idea behind the __GFP_THISNODE addition, is that it is better to
>> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
>> : backed memory. That largely depends on the remote latency though, on
>> : threadrippers for example the overhead is relatively low in my
>> : experience.
>> :
>> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
>> : extremely slow qemu startup with vfio, if the VM is larger than the
>> : size of one host NUMA node. This is because it will try very hard to
>> : unsuccessfully swapout get_user_pages pinned pages as result of the
>> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
>> : allocations and instead of trying to allocate THP on other nodes (it
>> : would be even worse without vfio type1 GUP pins of course, except it'd
>> : be swapping heavily instead).
>>
>> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
>> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
>> juggle gfp flags for different allocation modes. The rationale is that
>> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
>> to a different node might be more harmful than the benefit of a large page.
>> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
>> a higher priority than local NUMA placement. The later might be controlled
>> via NUMA policies to be more fine grained.
>>
>> [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
>>
>> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
>> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
>> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>> ---
>>  mm/huge_memory.c | 10 +++++-----
>>  mm/mempolicy.c   | 26 --------------------------
>>  2 files changed, 5 insertions(+), 31 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c3bc7e9c9a2a..a703c23f8bab 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -634,16 +634,16 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>  
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
>> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | __GFP_THISNODE);
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
>> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE;
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>> -							     __GFP_KSWAPD_RECLAIM);
>> +							     __GFP_KSWAPD_RECLAIM | __GFP_THISNODE);
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>> -							     0);
>> -	return GFP_TRANSHUGE_LIGHT;
>> +							     __GFP_THISNODE);
>> +	return GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
>>  }
>>  
>>  /* Caller must hold page table lock. */
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index da858f794eb6..9f0800885613 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>  		goto out;
>>  	}
>>  
>> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
>> -		int hpage_node = node;
>> -
>> -		/*
>> -		 * For hugepage allocation and non-interleave policy which
>> -		 * allows the current node (or other explicitly preferred
>> -		 * node) we only try to allocate from the current/preferred
>> -		 * node and don't fall back to other nodes, as the cost of
>> -		 * remote accesses would likely offset THP benefits.
>> -		 *
>> -		 * If the policy is interleave, or does not allow the current
>> -		 * node in its nodemask, we allocate the standard way.
>> -		 */
>> -		if (pol->mode == MPOL_PREFERRED &&
>> -						!(pol->flags & MPOL_F_LOCAL))
>> -			hpage_node = pol->v.preferred_node;
>> -
>> -		nmask = policy_nodemask(gfp, pol);
>> -		if (!nmask || node_isset(hpage_node, *nmask)) {
>> -			mpol_cond_put(pol);
>> -			page = __alloc_pages_node(hpage_node,
>> -						gfp | __GFP_THISNODE, order);
>> -			goto out;
>> -		}
>> -	}
>> -
>>  	nmask = policy_nodemask(gfp, pol);
>>  	preferred_nid = policy_node(gfp, pol, node);
>>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
>>
Michal Hocko Aug. 29, 2018, 2:28 p.m. UTC | #17
On Wed 29-08-18 09:28:21, Zi Yan wrote:
[...]
> This patch triggers WARN_ON_ONCE() in policy_node() when MPOL_BIND is used and THP is on.
> Should this WARN_ON_ONCE be removed?
> 
> 
> /*
> * __GFP_THISNODE shouldn't even be used with the bind policy
> * because we might easily break the expectation to stay on the
> * requested node and not break the policy.
> */
> WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));

This is really interesting. It seems to be me who added this warning but
I cannot simply make any sense of it. Let me try to dig some more.
Michal Hocko Aug. 29, 2018, 2:35 p.m. UTC | #18
On Wed 29-08-18 16:28:16, Michal Hocko wrote:
> On Wed 29-08-18 09:28:21, Zi Yan wrote:
> [...]
> > This patch triggers WARN_ON_ONCE() in policy_node() when MPOL_BIND is used and THP is on.
> > Should this WARN_ON_ONCE be removed?
> > 
> > 
> > /*
> > * __GFP_THISNODE shouldn't even be used with the bind policy
> > * because we might easily break the expectation to stay on the
> > * requested node and not break the policy.
> > */
> > WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> 
> This is really interesting. It seems to be me who added this warning but
> I cannot simply make any sense of it. Let me try to dig some more.

OK, I get it now. The warning seems to be incomplete. It is right to
complain when __GFP_THISNODE disagrees with MPOL_BIND policy but that is
not what we check here. Does this heal the warning?
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..7bb9354b1e4c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1728,7 +1728,10 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy,
 		 * because we might easily break the expectation to stay on the
 		 * requested node and not break the policy.
 		 */
-		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
+		if (policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE)) {
+			nodemask_t *nmask = policy_nodemask(gfp, policy);
+			WARN_ON_ONCE(!node_isset(nd, *nmask));
+		}
 	}
 
 	return nd;
Zi Yan Aug. 29, 2018, 3:22 p.m. UTC | #19
On 29 Aug 2018, at 10:35, Michal Hocko wrote:

> On Wed 29-08-18 16:28:16, Michal Hocko wrote:
>> On Wed 29-08-18 09:28:21, Zi Yan wrote:
>> [...]
>>> This patch triggers WARN_ON_ONCE() in policy_node() when MPOL_BIND is used and THP is on.
>>> Should this WARN_ON_ONCE be removed?
>>>
>>>
>>> /*
>>> * __GFP_THISNODE shouldn't even be used with the bind policy
>>> * because we might easily break the expectation to stay on the
>>> * requested node and not break the policy.
>>> */
>>> WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
>>
>> This is really interesting. It seems to be me who added this warning but
>> I cannot simply make any sense of it. Let me try to dig some more.
>
> OK, I get it now. The warning seems to be incomplete. It is right to
> complain when __GFP_THISNODE disagrees with MPOL_BIND policy but that is
> not what we check here. Does this heal the warning?
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..7bb9354b1e4c 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1728,7 +1728,10 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy,
>  		 * because we might easily break the expectation to stay on the
>  		 * requested node and not break the policy.
>  		 */
> -		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> +		if (policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE)) {
> +			nodemask_t *nmask = policy_nodemask(gfp, policy);
> +			WARN_ON_ONCE(!node_isset(nd, *nmask));
> +		}
>  	}
>
>  	return nd;

Unfortunately no. I simply ran “memhog -r3 1g membind 1” to test and the warning still showed up.

The reason is that nd is just a hint about which node to prefer for allocation and
can be ignored if it does not conform to mempolicy.
Taking my test as an example, if an application is only memory bound to node 1 but can run on any CPU
nodes and it launches on node 0, alloc_pages_vma() will see 0 as its node parameter
and passes 0 to policy_node()’s nd parameter. This should be OK, but your patches
would give a warning, because nd=0 is not set in nmask=1.

Now I get your comment “__GFP_THISNODE shouldn't even be used with the bind policy”,
since they are indeed incompatible. __GFP_THISNODE wants to use the node,
which can be ignored by MPOL_BIND policy.

IMHO, we could get rid of __GFP_THISNODE when MPOL_BIND is set, like

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0d2be5786b0c..a0fcb998d277 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1722,14 +1722,6 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy,
 {
        if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
                nd = policy->v.preferred_node;
-       else {
-               /*
-                * __GFP_THISNODE shouldn't even be used with the bind policy
-                * because we might easily break the expectation to stay on the
-                * requested node and not break the policy.
-                */
-               WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
-       }

        return nd;
 }
@@ -2026,6 +2018,13 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
                goto out;
        }

+       /*
+        * __GFP_THISNODE shouldn't even be used with the bind policy
+        * because we might easily break the expectation to stay on the
+        * requested node and not break the policy.
+        */
+       if (pol->mode == MPOL_BIND)
+               gfp &= ~__GFP_THISNODE;

        nmask = policy_nodemask(gfp, pol);
        preferred_nid = policy_node(gfp, pol, node);

What do you think?

—
Best Regards,
Yan Zi
Michal Hocko Aug. 29, 2018, 3:47 p.m. UTC | #20
On Wed 29-08-18 11:22:35, Zi Yan wrote:
> On 29 Aug 2018, at 10:35, Michal Hocko wrote:
> 
> > On Wed 29-08-18 16:28:16, Michal Hocko wrote:
> >> On Wed 29-08-18 09:28:21, Zi Yan wrote:
> >> [...]
> >>> This patch triggers WARN_ON_ONCE() in policy_node() when MPOL_BIND is used and THP is on.
> >>> Should this WARN_ON_ONCE be removed?
> >>>
> >>>
> >>> /*
> >>> * __GFP_THISNODE shouldn't even be used with the bind policy
> >>> * because we might easily break the expectation to stay on the
> >>> * requested node and not break the policy.
> >>> */
> >>> WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> >>
> >> This is really interesting. It seems to be me who added this warning but
> >> I cannot simply make any sense of it. Let me try to dig some more.
> >
> > OK, I get it now. The warning seems to be incomplete. It is right to
> > complain when __GFP_THISNODE disagrees with MPOL_BIND policy but that is
> > not what we check here. Does this heal the warning?
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index da858f794eb6..7bb9354b1e4c 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1728,7 +1728,10 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy,
> >  		 * because we might easily break the expectation to stay on the
> >  		 * requested node and not break the policy.
> >  		 */
> > -		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> > +		if (policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE)) {
> > +			nodemask_t *nmask = policy_nodemask(gfp, policy);
> > +			WARN_ON_ONCE(!node_isset(nd, *nmask));
> > +		}
> >  	}
> >
> >  	return nd;
> 
> Unfortunately no. I simply ran “memhog -r3 1g membind 1” to test and the warning still showed up.
> 
> The reason is that nd is just a hint about which node to prefer for allocation and
> can be ignored if it does not conform to mempolicy.
>
> Taking my test as an example, if an application is only memory bound to node 1 but can run on any CPU
> nodes and it launches on node 0, alloc_pages_vma() will see 0 as its node parameter
> and passes 0 to policy_node()’s nd parameter. This should be OK, but your patches
> would give a warning, because nd=0 is not set in nmask=1.
> 
> Now I get your comment “__GFP_THISNODE shouldn't even be used with the bind policy”,
> since they are indeed incompatible. __GFP_THISNODE wants to use the node,
> which can be ignored by MPOL_BIND policy.

Well, the assumption was that you do not run on a remote cpu to your
memory policy. But that seems a wrong assumption.

> IMHO, we could get rid of __GFP_THISNODE when MPOL_BIND is set, like
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 0d2be5786b0c..a0fcb998d277 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1722,14 +1722,6 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy,
>  {
>         if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
>                 nd = policy->v.preferred_node;
> -       else {
> -               /*
> -                * __GFP_THISNODE shouldn't even be used with the bind policy
> -                * because we might easily break the expectation to stay on the
> -                * requested node and not break the policy.
> -                */
> -               WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> -       }
> 
>         return nd;
>  }
> @@ -2026,6 +2018,13 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>                 goto out;
>         }
> 
> +       /*
> +        * __GFP_THISNODE shouldn't even be used with the bind policy
> +        * because we might easily break the expectation to stay on the
> +        * requested node and not break the policy.
> +        */
> +       if (pol->mode == MPOL_BIND)
> +               gfp &= ~__GFP_THISNODE;
> 
>         nmask = policy_nodemask(gfp, pol);
>         preferred_nid = policy_node(gfp, pol, node);
> 
> What do you think?

I do not like overwriting gfp flags like that. It is just ugly and error
prone. A more proper way would be to handle that at the layer we play
with __GFP_THISNODE. The resulting diff is larger though.

If there is a general concensus that this is growing too complicated
then Andrea's patch (the second variant to overwrite gfp mask) is much
simpler of course but I really detest the subtle gfp rewriting. I still
believe that all the nasty details should be covered at the single
place.


diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a703c23f8bab..94472bf9a31b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	gfp_t this_node = 0;
+	struct mempolicy *pol;
+
+#ifdef CONFIG_NUMA
+	/* __GFP_THISNODE makes sense only if there is no explicit binding */
+	pol = get_vma_policy(vma, addr);
+	if (pol->mode != MPOL_BIND)
+		this_node = __GFP_THISNODE;
+#endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | __GFP_THISNODE);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM | __GFP_THISNODE);
+							     __GFP_KSWAPD_RECLAIM | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_THISNODE);
-	return GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
+							     this_node);
+	return GFP_TRANSHUGE_LIGHT | this_node;
 }
 
 /* Caller must hold page table lock. */
@@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma);
+	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9f0800885613..75bbfc3d6233 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
Zi Yan Aug. 29, 2018, 4:06 p.m. UTC | #21
<snip>
>
> I do not like overwriting gfp flags like that. It is just ugly and error
> prone. A more proper way would be to handle that at the layer we play
> with __GFP_THISNODE. The resulting diff is larger though.

This makes sense to me.

>
> If there is a general concensus that this is growing too complicated
> then Andrea's patch (the second variant to overwrite gfp mask) is much
> simpler of course but I really detest the subtle gfp rewriting. I still
> believe that all the nasty details should be covered at the single
> place.
>
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5228c62af416..bac395f1d00a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a703c23f8bab..94472bf9a31b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +#endif
>
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | __GFP_THISNODE);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM | __GFP_THISNODE);
> +							     __GFP_KSWAPD_RECLAIM | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_THISNODE);
> -	return GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
> +							     this_node);
> +	return GFP_TRANSHUGE_LIGHT | this_node;
>  }
>
>  /* Caller must hold page table lock. */
> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  			pte_free(vma->vm_mm, pgtable);
>  		return ret;
>  	}
> -	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	if (unlikely(!page)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  alloc:
>  	if (transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	} else
>  		new_page = NULL;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 9f0800885613..75bbfc3d6233 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);

The warning goes away with this change. I am OK with this patch (plus the original one you sent out,
which could be merged with this one).

Thanks.

—
Best Regards,
Yan Zi
Michal Hocko Aug. 29, 2018, 4:25 p.m. UTC | #22
On Wed 29-08-18 12:06:48, Zi Yan wrote:
> The warning goes away with this change. I am OK with this patch (plus the original one you sent out,
> which could be merged with this one).

I will respin the patch, update the changelog and repost. Tomorrow I
hope.
diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index a6afcec53795..3c04d5d90e6d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -44,6 +44,7 @@  struct vm_area_struct;
 #else
 #define ___GFP_NOLOCKDEP	0
 #endif
+#define ___GFP_ONLY_COMPACT	0x1000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -178,6 +179,21 @@  struct vm_area_struct;
  *   definitely preferable to use the flag rather than opencode endless
  *   loop around allocator.
  *   Using this flag for costly allocations is _highly_ discouraged.
+ *
+ * __GFP_ONLY_COMPACT: Only invoke compaction. Do not try to succeed
+ * the allocation by freeing memory. Never risk to free any
+ * "PAGE_SIZE" memory unit even if compaction failed specifically
+ * because of not enough free pages in the zone. This only makes sense
+ * only in combination with __GFP_THISNODE (enforced with a
+ * VM_WARN_ON), to restrict the THP allocation in the local node that
+ * triggered the page fault and fallback into PAGE_SIZE allocations in
+ * the same node. We don't want to invoke reclaim because there may be
+ * plenty of free memory already in the local node. More importantly
+ * there may be even plenty of free THP available in remote nodes so
+ * we should allocate those if something instead of reclaiming any
+ * memory in the local node. Implementation detail: set ___GFP_NORETRY
+ * too so that ___GFP_ONLY_COMPACT only needs to be checked in a slow
+ * path.
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)
@@ -187,6 +203,8 @@  struct vm_area_struct;
 #define __GFP_RETRY_MAYFAIL	((__force gfp_t)___GFP_RETRY_MAYFAIL)
 #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)
 #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY)
+#define __GFP_ONLY_COMPACT	((__force gfp_t)(___GFP_NORETRY | \
+						 ___GFP_ONLY_COMPACT))
 
 /*
  * Action modifiers
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d6512ef28cde..6bf839f20dcc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2047,8 +2047,18 @@  alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
+			/*
+			 * We restricted the allocation to the
+			 * hpage_node so we must use
+			 * __GFP_ONLY_COMPACT to allow at most a
+			 * compaction attempt and not ever get into
+			 * reclaim or it'll swap heavily with
+			 * transparent_hugepage/defrag = always (or
+			 * madvise under MADV_HUGEPAGE).
+			 */
 			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
+						  gfp | __GFP_THISNODE |
+						  __GFP_ONLY_COMPACT, order);
 			goto out;
 		}
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a790ef4be74e..01a5c2bd0860 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4144,6 +4144,10 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			 */
 			if (compact_result == COMPACT_DEFERRED)
 				goto nopage;
+			if (gfp_mask & __GFP_ONLY_COMPACT) {
+				VM_WARN_ON(!(gfp_mask & __GFP_THISNODE));
+				goto nopage;
+			}
 
 			/*
 			 * Looks like reclaim/compaction is worth trying, but