mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
diff mbox series

Message ID 20180829192451.GG10223@dhcp22.suse.cz
State New
Headers show
Series
  • mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
Related show

Commit Message

Michal Hocko Aug. 29, 2018, 7:24 p.m. UTC
On Wed 29-08-18 18:25:28, Michal Hocko wrote:
> On Wed 29-08-18 12:06:48, Zi Yan wrote:
> > The warning goes away with this change. I am OK with this patch (plus the original one you sent out,
> > which could be merged with this one).
> 
> I will respin the patch, update the changelog and repost. Tomorrow I
> hope.

Here is the v2

From 4dc2f772756e6f91b9e64d1a3e2df4dca3475f5b Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 28 Aug 2018 09:59:19 +0200
Subject: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

Andrea has noticed [1] that a THP allocation might be really disruptive
when allocated on NUMA system with the local node full or hard to
reclaim. Stefan has posted an allocation stall report on 4.12 based
SLES kernel which suggests the same issue:
[245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
[245513.363983] kvm cpuset=/ mems_allowed=0-1
[245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
[245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
[245513.365905] Call Trace:
[245513.366535]  dump_stack+0x5c/0x84
[245513.367148]  warn_alloc+0xe0/0x180
[245513.367769]  __alloc_pages_slowpath+0x820/0xc90
[245513.368406]  ? __slab_free+0xa9/0x2f0
[245513.369048]  ? __slab_free+0xa9/0x2f0
[245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
[245513.370300]  alloc_pages_vma+0x1e5/0x280
[245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
[245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
[245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
[245513.372812]  __handle_mm_fault+0x93d/0x1060
[245513.373439]  handle_mm_fault+0xc6/0x1b0
[245513.374042]  __do_page_fault+0x230/0x430
[245513.374679]  ? get_vtime_delta+0x13/0xb0
[245513.375411]  do_page_fault+0x2a/0x70
[245513.376145]  ? page_fault+0x65/0x80
[245513.376882]  page_fault+0x7b/0x80
[...]
[245513.382056] Mem-Info:
[245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
                 active_file:60183 inactive_file:245285 isolated_file:0
                 unevictable:15657 dirty:286 writeback:1 unstable:0
                 slab_reclaimable:75543 slab_unreclaimable:2509111
                 mapped:81814 shmem:31764 pagetables:370616 bounce:0
                 free:32294031 free_pcp:6233 free_cma:0
[245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
juggle gfp flags for different allocation modes. The rationale is that
__GFP_THISNODE is helpful in relaxed defrag modes because falling back
to a different node might be more harmful than the benefit of a large page.
If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
a higher priority than local NUMA placement.

Be careful when the vma has an explicit numa binding though, because
__GFP_THISNODE is not playing well with it. We want to follow the
explicit numa policy rather than enforce a node which happens to be
local to the cpu we are running on.

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mempolicy.h |  2 ++
 mm/huge_memory.c          | 25 +++++++++++++++++--------
 mm/mempolicy.c            | 28 +---------------------------
 3 files changed, 20 insertions(+), 35 deletions(-)

Comments

Zi Yan Aug. 29, 2018, 10:54 p.m. UTC | #1
Hi Michal,

<snip>
>
> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mempolicy.h |  2 ++
>  mm/huge_memory.c          | 25 +++++++++++++++++--------
>  mm/mempolicy.c            | 28 +---------------------------
>  3 files changed, 20 insertions(+), 35 deletions(-)
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5228c62af416..bac395f1d00a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3bc7e9c9a2a..94472bf9a31b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +#endif
>
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM);
> +							     __GFP_KSWAPD_RECLAIM | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     0);
> -	return GFP_TRANSHUGE_LIGHT;
> +							     this_node);
> +	return GFP_TRANSHUGE_LIGHT | this_node;
>  }
>
>  /* Caller must hold page table lock. */
> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  			pte_free(vma->vm_mm, pgtable);
>  		return ret;
>  	}
> -	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	if (unlikely(!page)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  alloc:
>  	if (transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	} else
>  		new_page = NULL;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..75bbfc3d6233 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>
> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> -		int hpage_node = node;
> -
> -		/*
> -		 * For hugepage allocation and non-interleave policy which
> -		 * allows the current node (or other explicitly preferred
> -		 * node) we only try to allocate from the current/preferred
> -		 * node and don't fall back to other nodes, as the cost of
> -		 * remote accesses would likely offset THP benefits.
> -		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> -		 */
> -		if (pol->mode == MPOL_PREFERRED &&
> -						!(pol->flags & MPOL_F_LOCAL))
> -			hpage_node = pol->v.preferred_node;
> -
> -		nmask = policy_nodemask(gfp, pol);
> -		if (!nmask || node_isset(hpage_node, *nmask)) {
> -			mpol_cond_put(pol);
> -			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> -			goto out;
> -		}
> -	}
> -
>  	nmask = policy_nodemask(gfp, pol);
>  	preferred_nid = policy_node(gfp, pol, node);
>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> -- 
> 2.18.0
>

Thanks for your patch.

I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
each node and got the results below. I expect this test should fill one node, then fall back to the other.

1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}: no swap, THPs are allocated in the fallback node.
2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the disk instead of being allocated in the fallback node.
3. no madvise, THP is on by default + defrag = {always, defer, defer+madvise}: pages got swapped to the disk instead of
being allocated in the fallback node.
4. no madvise, THP is on by default + defrag = madvise: no swap, base pages are allocated in the fallback node.

The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.

The reason, as Andrea mentioned in his email, is that the combination of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM from this experiment). __THIS_NODE uses ZONELIST_NOFALLBACK, which removes the fallback possibility
and __GFP_*_RECLAIM triggers page reclaim in the first page allocation node when fallback nodes are removed by
ZONELIST_NOFALLBACK.

IMHO, __THIS_NODE should not be used for user memory allocation at all, since it fights against most of memory policies.
But kernel memory allocation would need it as a kernel MPOL_BIND memory policy.

Comments?

—
Best Regards,
Yan Zi
Michal Hocko Aug. 30, 2018, 6:47 a.m. UTC | #2
I forgot mpol_cond_put...

From 55295cfc4ed322acc4979036345a7afbefe17478 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 28 Aug 2018 09:59:19 +0200
Subject: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

Andrea has noticed [1] that a THP allocation might be really disruptive
when allocated on NUMA system with the local node full or hard to
reclaim. Stefan has posted an allocation stall report on 4.12 based
SLES kernel which suggests the same issue:
[245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
[245513.363983] kvm cpuset=/ mems_allowed=0-1
[245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
[245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
[245513.365905] Call Trace:
[245513.366535]  dump_stack+0x5c/0x84
[245513.367148]  warn_alloc+0xe0/0x180
[245513.367769]  __alloc_pages_slowpath+0x820/0xc90
[245513.368406]  ? __slab_free+0xa9/0x2f0
[245513.369048]  ? __slab_free+0xa9/0x2f0
[245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
[245513.370300]  alloc_pages_vma+0x1e5/0x280
[245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
[245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
[245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
[245513.372812]  __handle_mm_fault+0x93d/0x1060
[245513.373439]  handle_mm_fault+0xc6/0x1b0
[245513.374042]  __do_page_fault+0x230/0x430
[245513.374679]  ? get_vtime_delta+0x13/0xb0
[245513.375411]  do_page_fault+0x2a/0x70
[245513.376145]  ? page_fault+0x65/0x80
[245513.376882]  page_fault+0x7b/0x80
[...]
[245513.382056] Mem-Info:
[245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
                 active_file:60183 inactive_file:245285 isolated_file:0
                 unevictable:15657 dirty:286 writeback:1 unstable:0
                 slab_reclaimable:75543 slab_unreclaimable:2509111
                 mapped:81814 shmem:31764 pagetables:370616 bounce:0
                 free:32294031 free_pcp:6233 free_cma:0
[245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
juggle gfp flags for different allocation modes. The rationale is that
__GFP_THISNODE is helpful in relaxed defrag modes because falling back
to a different node might be more harmful than the benefit of a large page.
If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
a higher priority than local NUMA placement.

Be careful when the vma has an explicit numa binding though, because
__GFP_THISNODE is not playing well with it. We want to follow the
explicit numa policy rather than enforce a node which happens to be
local to the cpu we are running on.

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mempolicy.h |  2 ++
 mm/huge_memory.c          | 26 ++++++++++++++++++--------
 mm/mempolicy.c            | 28 +---------------------------
 3 files changed, 21 insertions(+), 35 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..56c9aac4dc86 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,21 +629,31 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	gfp_t this_node = 0;
+	struct mempolicy *pol;
+
+#ifdef CONFIG_NUMA
+	/* __GFP_THISNODE makes sense only if there is no explicit binding */
+	pol = get_vma_policy(vma, addr);
+	if (pol->mode != MPOL_BIND)
+		this_node = __GFP_THISNODE;
+	mpol_cond_put(pol);
+#endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM);
+							     __GFP_KSWAPD_RECLAIM | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     0);
-	return GFP_TRANSHUGE_LIGHT;
+							     this_node);
+	return GFP_TRANSHUGE_LIGHT | this_node;
 }
 
 /* Caller must hold page table lock. */
@@ -715,7 +725,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma);
+	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1290,7 +1300,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..75bbfc3d6233 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
@@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED &&
-						!(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
-
-		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
-			goto out;
-		}
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
Michal Hocko Aug. 30, 2018, 7 a.m. UTC | #3
On Wed 29-08-18 18:54:23, Zi Yan wrote:
[...]
> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> 
> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
> no swap, THPs are allocated in the fallback node.
> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
> disk instead of being allocated in the fallback node.
> 3. no madvise, THP is on by default + defrag = {always, defer,
> defer+madvise}: pages got swapped to the disk instead of being
> allocated in the fallback node.
> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
> pages are allocated in the fallback node.
> 
> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
> 
> The reason, as Andrea mentioned in his email, is that the combination
> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
> from this experiment).

But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
see kswapd do the reclaim to balance the node. If the node is full of
anonymous pages then there is no other way than swap out.

> __THIS_NODE uses ZONELIST_NOFALLBACK, which
> removes the fallback possibility and __GFP_*_RECLAIM triggers page
> reclaim in the first page allocation node when fallback nodes are
> removed by ZONELIST_NOFALLBACK.

Yes but the point is that the allocations which use __GFP_THISNODE are
optimistic so they shouldn't fallback to remote NUMA nodes.

> IMHO, __THIS_NODE should not be used for user memory allocation at
> all, since it fights against most of memory policies.  But kernel
> memory allocation would need it as a kernel MPOL_BIND memory policy.

__GFP_THISNODE is indeed an ugliness. I would really love to get rid of
it here. But the problem is that optimistic THP allocations should
prefer a local node because a remote node might easily offset the
advantage of the THP. I do not have a great idea how to achieve that
without __GFP_THISNODE though.
Zi Yan Aug. 30, 2018, 1:22 p.m. UTC | #4
On 30 Aug 2018, at 3:00, Michal Hocko wrote:

> On Wed 29-08-18 18:54:23, Zi Yan wrote:
> [...]
>> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
>>
>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
>> no swap, THPs are allocated in the fallback node.
>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
>> disk instead of being allocated in the fallback node.
>> 3. no madvise, THP is on by default + defrag = {always, defer,
>> defer+madvise}: pages got swapped to the disk instead of being
>> allocated in the fallback node.
>> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
>> pages are allocated in the fallback node.
>>
>> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
>>
>> The reason, as Andrea mentioned in his email, is that the combination
>> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
>> from this experiment).
>
> But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
> We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
> see kswapd do the reclaim to balance the node. If the node is full of
> anonymous pages then there is no other way than swap out.

GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on
+ defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping
can be triggered.

The key issue here is that “memhog -r3 130g” uses the default memory policy (MPOL_DEFAULT),
which should allow page allocation fallback to other nodes, but as shown in
result 3, swapping is triggered instead of page allocation fallback.

>
>> __THIS_NODE uses ZONELIST_NOFALLBACK, which
>> removes the fallback possibility and __GFP_*_RECLAIM triggers page
>> reclaim in the first page allocation node when fallback nodes are
>> removed by ZONELIST_NOFALLBACK.
>
> Yes but the point is that the allocations which use __GFP_THISNODE are
> optimistic so they shouldn't fallback to remote NUMA nodes.

This can be achieved by using MPOL_BIND memory policy which restricts
nodemask in struct alloc_context for user space memory allocations.

>
>> IMHO, __THIS_NODE should not be used for user memory allocation at
>> all, since it fights against most of memory policies.  But kernel
>> memory allocation would need it as a kernel MPOL_BIND memory policy.
>
> __GFP_THISNODE is indeed an ugliness. I would really love to get rid of
> it here. But the problem is that optimistic THP allocations should
> prefer a local node because a remote node might easily offset the
> advantage of the THP. I do not have a great idea how to achieve that
> without __GFP_THISNODE though.

MPOL_PREFERRED memory policy can be used to achieve this optimistic THP allocation
for user space. Even with the default memory policy, local memory node will be used
first until it is full. It seems to me that __GFP_THISNODE is not necessary
if a proper memory policy is used.

Let me know if I miss anything. Thanks.


—
Best Regards,
Yan Zi
Michal Hocko Aug. 30, 2018, 1:45 p.m. UTC | #5
On Thu 30-08-18 09:22:21, Zi Yan wrote:
> On 30 Aug 2018, at 3:00, Michal Hocko wrote:
> 
> > On Wed 29-08-18 18:54:23, Zi Yan wrote:
> > [...]
> >> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
> >> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> >>
> >> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
> >> no swap, THPs are allocated in the fallback node.
> >> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
> >> disk instead of being allocated in the fallback node.
> >> 3. no madvise, THP is on by default + defrag = {always, defer,
> >> defer+madvise}: pages got swapped to the disk instead of being
> >> allocated in the fallback node.
> >> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
> >> pages are allocated in the fallback node.
> >>
> >> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
> >>
> >> The reason, as Andrea mentioned in his email, is that the combination
> >> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
> >> from this experiment).
> >
> > But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
> > We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
> > see kswapd do the reclaim to balance the node. If the node is full of
> > anonymous pages then there is no other way than swap out.
> 
> GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on
> + defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping
> can be triggered.

Yes, but the setup tells that you are willing to pay price to get a THP.
defered=always uses that special __GFP_NORETRY (unless it is madvised
mapping) that should back off if the compaction failed recently. How
much that reduces the reclaim is not really clear to me right now to be
honest.

> The key issue here is that “memhog -r3 130g” uses the default memory policy (MPOL_DEFAULT),
> which should allow page allocation fallback to other nodes, but as shown in
> result 3, swapping is triggered instead of page allocation fallback.

Well, I guess this really depends. Fallback to a different node might be
seen as a bad thing and worse than the reclaim on the local node.

> >> __THIS_NODE uses ZONELIST_NOFALLBACK, which
> >> removes the fallback possibility and __GFP_*_RECLAIM triggers page
> >> reclaim in the first page allocation node when fallback nodes are
> >> removed by ZONELIST_NOFALLBACK.
> >
> > Yes but the point is that the allocations which use __GFP_THISNODE are
> > optimistic so they shouldn't fallback to remote NUMA nodes.
> 
> This can be achieved by using MPOL_BIND memory policy which restricts
> nodemask in struct alloc_context for user space memory allocations.

Yes, but that requires and explicit NUMA handling. And we are trying to
handle those cases which do not really give a damn and just want to use
THP if it is available or try harder when they ask by using madvise.

> >> IMHO, __THIS_NODE should not be used for user memory allocation at
> >> all, since it fights against most of memory policies.  But kernel
> >> memory allocation would need it as a kernel MPOL_BIND memory policy.
> >
> > __GFP_THISNODE is indeed an ugliness. I would really love to get rid of
> > it here. But the problem is that optimistic THP allocations should
> > prefer a local node because a remote node might easily offset the
> > advantage of the THP. I do not have a great idea how to achieve that
> > without __GFP_THISNODE though.
> 
> MPOL_PREFERRED memory policy can be used to achieve this optimistic
> THP allocation for user space. Even with the default memory policy,
> local memory node will be used first until it is full. It seems to
> me that __GFP_THISNODE is not necessary if a proper memory policy is
> used.
> 
> Let me know if I miss anything. Thanks.

You are missing that we are trying to define a sensible model for those
who do not really care about mempolicies. THP shouldn't cause more harm
than good for those.

I wish we could come up with a remotely sane and comprehensible model.
That means that you know how hard the allocator tries to get a THP for
you depending on the defrag configuration, your memory policy and your
madvise setting. The easiest one I can think of is to 
- always follow mempolicy when specified because you asked for it
  explicitly
- stay node local and low latency for the light THP defrag mode (defrag,
  madvise without hint and none) because THP is a nice to have
- if the defrag mode is always then you are willing to pay the latency
  price but off-node might be still a no-no.
- allow fallback for madvised mappings because you really want THP. If
  you care about specific numa placement then combine with the
  mempolicy.

As you can see I do not really mention anything about the direct reclaim
because that is just an implementation detail of the page allocator and
compaction interaction.

Maybe you can formulate a saner matrix with all the available modes that
we have.

Anyway, I guess we can agree that (almost) unconditional __GFP_THISNODE
is clearly wrong and we should address that first. Either Andrea's
option 2) patch or mine which does the similar thing except at the
proper layer (I believe). We can continue discussing other odd cases on
top I guess. Unless somebody has much brighter idea, of course.
Zi Yan Aug. 30, 2018, 2:02 p.m. UTC | #6
On 30 Aug 2018, at 9:45, Michal Hocko wrote:

> On Thu 30-08-18 09:22:21, Zi Yan wrote:
>> On 30 Aug 2018, at 3:00, Michal Hocko wrote:
>>
>>> On Wed 29-08-18 18:54:23, Zi Yan wrote:
>>> [...]
>>>> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
>>>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
>>>>
>>>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
>>>> no swap, THPs are allocated in the fallback node.
>>>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
>>>> disk instead of being allocated in the fallback node.
>>>> 3. no madvise, THP is on by default + defrag = {always, defer,
>>>> defer+madvise}: pages got swapped to the disk instead of being
>>>> allocated in the fallback node.
>>>> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
>>>> pages are allocated in the fallback node.
>>>>
>>>> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
>>>>
>>>> The reason, as Andrea mentioned in his email, is that the combination
>>>> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
>>>> from this experiment).
>>>
>>> But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
>>> We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
>>> see kswapd do the reclaim to balance the node. If the node is full of
>>> anonymous pages then there is no other way than swap out.
>>
>> GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on
>> + defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping
>> can be triggered.
>
> Yes, but the setup tells that you are willing to pay price to get a THP.
> defered=always uses that special __GFP_NORETRY (unless it is madvised
> mapping) that should back off if the compaction failed recently. How
> much that reduces the reclaim is not really clear to me right now to be
> honest.
>
>> The key issue here is that “memhog -r3 130g” uses the default memory policy (MPOL_DEFAULT),
>> which should allow page allocation fallback to other nodes, but as shown in
>> result 3, swapping is triggered instead of page allocation fallback.
>
> Well, I guess this really depends. Fallback to a different node might be
> seen as a bad thing and worse than the reclaim on the local node.
>
>>>> __THIS_NODE uses ZONELIST_NOFALLBACK, which
>>>> removes the fallback possibility and __GFP_*_RECLAIM triggers page
>>>> reclaim in the first page allocation node when fallback nodes are
>>>> removed by ZONELIST_NOFALLBACK.
>>>
>>> Yes but the point is that the allocations which use __GFP_THISNODE are
>>> optimistic so they shouldn't fallback to remote NUMA nodes.
>>
>> This can be achieved by using MPOL_BIND memory policy which restricts
>> nodemask in struct alloc_context for user space memory allocations.
>
> Yes, but that requires and explicit NUMA handling. And we are trying to
> handle those cases which do not really give a damn and just want to use
> THP if it is available or try harder when they ask by using madvise.
>
>>>> IMHO, __THIS_NODE should not be used for user memory allocation at
>>>> all, since it fights against most of memory policies.  But kernel
>>>> memory allocation would need it as a kernel MPOL_BIND memory policy.
>>>
>>> __GFP_THISNODE is indeed an ugliness. I would really love to get rid of
>>> it here. But the problem is that optimistic THP allocations should
>>> prefer a local node because a remote node might easily offset the
>>> advantage of the THP. I do not have a great idea how to achieve that
>>> without __GFP_THISNODE though.
>>
>> MPOL_PREFERRED memory policy can be used to achieve this optimistic
>> THP allocation for user space. Even with the default memory policy,
>> local memory node will be used first until it is full. It seems to
>> me that __GFP_THISNODE is not necessary if a proper memory policy is
>> used.
>>
>> Let me know if I miss anything. Thanks.
>
> You are missing that we are trying to define a sensible model for those
> who do not really care about mempolicies. THP shouldn't cause more harm
> than good for those.
>
> I wish we could come up with a remotely sane and comprehensible model.
> That means that you know how hard the allocator tries to get a THP for
> you depending on the defrag configuration, your memory policy and your
> madvise setting. The easiest one I can think of is to
> - always follow mempolicy when specified because you asked for it
>   explicitly
> - stay node local and low latency for the light THP defrag mode (defrag,
>   madvise without hint and none) because THP is a nice to have
> - if the defrag mode is always then you are willing to pay the latency
>   price but off-node might be still a no-no.
> - allow fallback for madvised mappings because you really want THP. If
>   you care about specific numa placement then combine with the
>   mempolicy.
>
> As you can see I do not really mention anything about the direct reclaim
> because that is just an implementation detail of the page allocator and
> compaction interaction.
>
> Maybe you can formulate a saner matrix with all the available modes that
> we have.
>
> Anyway, I guess we can agree that (almost) unconditional __GFP_THISNODE
> is clearly wrong and we should address that first. Either Andrea's
> option 2) patch or mine which does the similar thing except at the
> proper layer (I believe). We can continue discussing other odd cases on
> top I guess. Unless somebody has much brighter idea, of course.

Thanks for your explanation. It makes sense to me. I am fine with your patch.
You can add my Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>, since
my test result 1 shows that the problem mentioned in your changelog is solved.

—
Best Regards,
Yan Zi
Stefan Priebe - Profihost AG Aug. 30, 2018, 4:19 p.m. UTC | #7
Please add also:
Tested-by: Stefan Priebe <s.priebe@profihost.ag>

Stefan

Excuse my typo sent from my mobile phone.

> Am 30.08.2018 um 16:02 schrieb Zi Yan <zi.yan@cs.rutgers.edu>:
> 
> Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>
<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto">Please add also:<div><span style="background-color: rgba(255, 255, 255, 0);">Tested-by: Stefan Priebe &lt;<a href="mailto:s.priebe@profihost.ag">s.priebe@profihost.ag</a>&gt;</span></div><div><br><div id="AppleMailSignature">Stefan<div><br></div><div>Excuse my typo s<span style="font-size: 13pt;">ent from my mobile phone.</span></div></div><div><br>Am 30.08.2018 um 16:02 schrieb Zi Yan &lt;<a href="mailto:zi.yan@cs.rutgers.edu">zi.yan@cs.rutgers.edu</a>&gt;:<br><br></div><blockquote type="cite"><div>Tested-by: Zi Yan &lt;<span><a href="mailto:zi.yan@cs.rutgers.edu">zi.yan@cs.rutgers.edu</a></span>&gt;</div></blockquote></div></body></html>
Michal Hocko Aug. 30, 2018, 4:40 p.m. UTC | #8
On Thu 30-08-18 10:02:23, Zi Yan wrote:
> On 30 Aug 2018, at 9:45, Michal Hocko wrote:
> 
> > On Thu 30-08-18 09:22:21, Zi Yan wrote:
> >> On 30 Aug 2018, at 3:00, Michal Hocko wrote:
> >>
> >>> On Wed 29-08-18 18:54:23, Zi Yan wrote:
> >>> [...]
> >>>> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
> >>>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> >>>>
> >>>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
> >>>> no swap, THPs are allocated in the fallback node.
> >>>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
> >>>> disk instead of being allocated in the fallback node.
> >>>> 3. no madvise, THP is on by default + defrag = {always, defer,
> >>>> defer+madvise}: pages got swapped to the disk instead of being
> >>>> allocated in the fallback node.
> >>>> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
> >>>> pages are allocated in the fallback node.
> >>>>
> >>>> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.
> >>>>
> >>>> The reason, as Andrea mentioned in his email, is that the combination
> >>>> of __THIS_NODE and __GFP_DIRECT_RECLAIM (plus __GFP_KSWAPD_RECLAIM
> >>>> from this experiment).
> >>>
> >>> But we do not set __GFP_THISNODE along with __GFP_DIRECT_RECLAIM AFAICS.
> >>> We do for __GFP_KSWAPD_RECLAIM though and I guess that it is expected to
> >>> see kswapd do the reclaim to balance the node. If the node is full of
> >>> anonymous pages then there is no other way than swap out.
> >>
> >> GFP_TRANSHUGE implies __GFP_DIRECT_RECLAIM. When no madvise is given, THP is on
> >> + defrag=always, gfp_mask has __GFP_THISNODE and __GFP_DIRECT_RECLAIM, so swapping
> >> can be triggered.
> >
> > Yes, but the setup tells that you are willing to pay price to get a THP.
> > defered=always uses that special __GFP_NORETRY (unless it is madvised
> > mapping) that should back off if the compaction failed recently. How
> > much that reduces the reclaim is not really clear to me right now to be
> > honest.
> >
> >> The key issue here is that “memhog -r3 130g” uses the default memory policy (MPOL_DEFAULT),
> >> which should allow page allocation fallback to other nodes, but as shown in
> >> result 3, swapping is triggered instead of page allocation fallback.
> >
> > Well, I guess this really depends. Fallback to a different node might be
> > seen as a bad thing and worse than the reclaim on the local node.
> >
> >>>> __THIS_NODE uses ZONELIST_NOFALLBACK, which
> >>>> removes the fallback possibility and __GFP_*_RECLAIM triggers page
> >>>> reclaim in the first page allocation node when fallback nodes are
> >>>> removed by ZONELIST_NOFALLBACK.
> >>>
> >>> Yes but the point is that the allocations which use __GFP_THISNODE are
> >>> optimistic so they shouldn't fallback to remote NUMA nodes.
> >>
> >> This can be achieved by using MPOL_BIND memory policy which restricts
> >> nodemask in struct alloc_context for user space memory allocations.
> >
> > Yes, but that requires and explicit NUMA handling. And we are trying to
> > handle those cases which do not really give a damn and just want to use
> > THP if it is available or try harder when they ask by using madvise.
> >
> >>>> IMHO, __THIS_NODE should not be used for user memory allocation at
> >>>> all, since it fights against most of memory policies.  But kernel
> >>>> memory allocation would need it as a kernel MPOL_BIND memory policy.
> >>>
> >>> __GFP_THISNODE is indeed an ugliness. I would really love to get rid of
> >>> it here. But the problem is that optimistic THP allocations should
> >>> prefer a local node because a remote node might easily offset the
> >>> advantage of the THP. I do not have a great idea how to achieve that
> >>> without __GFP_THISNODE though.
> >>
> >> MPOL_PREFERRED memory policy can be used to achieve this optimistic
> >> THP allocation for user space. Even with the default memory policy,
> >> local memory node will be used first until it is full. It seems to
> >> me that __GFP_THISNODE is not necessary if a proper memory policy is
> >> used.
> >>
> >> Let me know if I miss anything. Thanks.
> >
> > You are missing that we are trying to define a sensible model for those
> > who do not really care about mempolicies. THP shouldn't cause more harm
> > than good for those.
> >
> > I wish we could come up with a remotely sane and comprehensible model.
> > That means that you know how hard the allocator tries to get a THP for
> > you depending on the defrag configuration, your memory policy and your
> > madvise setting. The easiest one I can think of is to
> > - always follow mempolicy when specified because you asked for it
> >   explicitly
> > - stay node local and low latency for the light THP defrag mode (defrag,
> >   madvise without hint and none) because THP is a nice to have
> > - if the defrag mode is always then you are willing to pay the latency
> >   price but off-node might be still a no-no.
> > - allow fallback for madvised mappings because you really want THP. If
> >   you care about specific numa placement then combine with the
> >   mempolicy.
> >
> > As you can see I do not really mention anything about the direct reclaim
> > because that is just an implementation detail of the page allocator and
> > compaction interaction.
> >
> > Maybe you can formulate a saner matrix with all the available modes that
> > we have.
> >
> > Anyway, I guess we can agree that (almost) unconditional __GFP_THISNODE
> > is clearly wrong and we should address that first. Either Andrea's
> > option 2) patch or mine which does the similar thing except at the
> > proper layer (I believe). We can continue discussing other odd cases on
> > top I guess. Unless somebody has much brighter idea, of course.
> 
> Thanks for your explanation. It makes sense to me. I am fine with your patch.
> You can add my Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>, since
> my test result 1 shows that the problem mentioned in your changelog is solved.

Thanks for your and Stefan's testing. I will wait for some more
feedback. I will be offline next few days and if there are no major
objections I will repost with both tested-bys early next week.
Andrea Arcangeli Sept. 5, 2018, 3:44 a.m. UTC | #9
On Thu, Aug 30, 2018 at 06:40:57PM +0200, Michal Hocko wrote:
> On Thu 30-08-18 10:02:23, Zi Yan wrote:
> > On 30 Aug 2018, at 9:45, Michal Hocko wrote:
> > 
> > > On Thu 30-08-18 09:22:21, Zi Yan wrote:
> > >> On 30 Aug 2018, at 3:00, Michal Hocko wrote:
> > >>
> > >>> On Wed 29-08-18 18:54:23, Zi Yan wrote:
> > >>> [...]
> > >>>> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
> > >>>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> > >>>>
> > >>>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}:
> > >>>> no swap, THPs are allocated in the fallback node.

no swap

> > >>>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the
> > >>>> disk instead of being allocated in the fallback node.

swap

> > >>>> 3. no madvise, THP is on by default + defrag = {always, defer,
> > >>>> defer+madvise}: pages got swapped to the disk instead of being
> > >>>> allocated in the fallback node.

swap

> > >>>> 4. no madvise, THP is on by default + defrag = madvise: no swap, base
> > >>>> pages are allocated in the fallback node.

no swap

> > >>>> The result 2 and 3 seems unexpected, since pages should be allocated in the fallback node.

I agree it's not great for 2 and 3.

I don't see how the above can be considered a 100% "pass" to the test,
at best it's a 50% pass.

Let me clarify the setup to be sure:

1) There was no hard bind at all

2) Let's also ignore NUMA balancing which is all but restrictive at
   the start and it's meant to converge over time if current
   conditions don't allow immediate convergence. For simplicity let's
   assume NUMA balancing off.

So what the test exercised is the plain normal allocation of RAM with
THP main knob enabled to "always" on a NUMA system.

No matter the madvise used or not used, 2 cases over 4 decided to
swapout instead of allocating totally free THP or PAGE_SIZEd pages.

As opposed there would have been absolutely zero swapouts in the exact
same test if the main THP knob would have been disabled with:

     echo never >/sys/kernel/mm/transparent_hugepage/enabled

There is no way that enabling THP (no matter what other defrag
settings were and no matter if MADV_HUGEPAGE was used or not) should
cause heavy swap storms during page faults allocating memory, when
disabling THP doesn't swap even a single 4k page. That can't possibly
be right.

This is because there is no way the overhead of swapping can be
compensated by the THP improvement.

And with swapping I really mean "reclaim", just testing with the
swapout testcase is simpler and doesn't require an iommu pinning all
memory. So setting may_swap and may_unmap to zero won't move the
needle because my test showed just massive CPU consumption in trying
so hard to generate THP from the local node, but nothing got swapped
out because of the iommu pins.

That kind of swapping may only pay off in the very long long term,
which is what khugepaged is for. khugepaged already takes care of the
long term, so we could later argue and think if khugepaged should
swapout or not in such condition, but I don't think there's much to
argue about the page fault.

> Thanks for your and Stefan's testing. I will wait for some more
> feedback. I will be offline next few days and if there are no major
> objections I will repost with both tested-bys early next week.

I'm not so positive about 2 of the above tests if I understood the
test correctly.

Those results are totally fine if you used the non default memory
policy, but with MPOL_DEFAULT and in turn no hard bind of the memory,
I'm afraid it'll be even be harder to reproduce when things will go
wrong again in those two cases.

Thanks,
Andrea
Michal Hocko Sept. 5, 2018, 7:08 a.m. UTC | #10
On Tue 04-09-18 23:44:03, Andrea Arcangeli wrote:
[...]
> That kind of swapping may only pay off in the very long long term,
> which is what khugepaged is for. khugepaged already takes care of the
> long term, so we could later argue and think if khugepaged should
> swapout or not in such condition, but I don't think there's much to
> argue about the page fault.

I agree that defrag==always doing a reclaim is not really good and
benefits are questionable. If you remember this was the primary reason
why the default has been changed.

> > Thanks for your and Stefan's testing. I will wait for some more
> > feedback. I will be offline next few days and if there are no major
> > objections I will repost with both tested-bys early next week.
> 
> I'm not so positive about 2 of the above tests if I understood the
> test correctly.
> 
> Those results are totally fine if you used the non default memory
> policy, but with MPOL_DEFAULT and in turn no hard bind of the memory,
> I'm afraid it'll be even be harder to reproduce when things will go
> wrong again in those two cases.

We can and should think about this much more but I would like to have
this regression closed. So can we address GFP_THISNODE part first and
build more complex solution on top?

Is there any objection to my patch which does the similar thing to your
patch v2 in a different location?
Vlastimil Babka Sept. 6, 2018, 10:59 a.m. UTC | #11
On 08/30/2018 12:54 AM, Zi Yan wrote:
> 
> Thanks for your patch.
> 
> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
> each node and got the results below. I expect this test should fill one node, then fall back to the other.
> 
> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}: no swap, THPs are allocated in the fallback node.
> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the disk instead of being allocated in the fallback node.

Hmm this is GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE.
No direct reclaim, so it would have to be kswapd causing the swapping? I
wouldn't expect it to be significant and over-reclaiming. What exactly
is your definition of "pages got swapped"?

> 3. no madvise, THP is on by default + defrag = {always, defer, defer+madvise}: pages got swapped to the disk instead of
> being allocated in the fallback node.

So this should be the most common case (no madvise, THP on). If it's
causing too much reclaim, it's not good IMHO.

depending on defrag:
defer (the default) = same as above, so it would have to be kswapd
always = GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM | __GFP_NORETRY |
__GFP_THISNODE
  - so direct reclaim also overreclaims despite __GFP_NORETRY?
defer+madvise = same as defer

Vlastimil
Vlastimil Babka Sept. 6, 2018, 11:10 a.m. UTC | #12
On 09/05/2018 09:08 AM, Michal Hocko wrote:
> On Tue 04-09-18 23:44:03, Andrea Arcangeli wrote:
> [...]
>> That kind of swapping may only pay off in the very long long term,
>> which is what khugepaged is for. khugepaged already takes care of the
>> long term, so we could later argue and think if khugepaged should
>> swapout or not in such condition, but I don't think there's much to
>> argue about the page fault.
> 
> I agree that defrag==always doing a reclaim is not really good and
> benefits are questionable. If you remember this was the primary reason
> why the default has been changed.
> 
>>> Thanks for your and Stefan's testing. I will wait for some more
>>> feedback. I will be offline next few days and if there are no major
>>> objections I will repost with both tested-bys early next week.
>>
>> I'm not so positive about 2 of the above tests if I understood the
>> test correctly.
>>
>> Those results are totally fine if you used the non default memory
>> policy, but with MPOL_DEFAULT and in turn no hard bind of the memory,
>> I'm afraid it'll be even be harder to reproduce when things will go
>> wrong again in those two cases.
> 
> We can and should think about this much more but I would like to have
> this regression closed. So can we address GFP_THISNODE part first and
> build more complex solution on top?
> 
> Is there any objection to my patch which does the similar thing to your
> patch v2 in a different location?

Similar but not the same. It fixes the madvise case, but I wonder about
the no-madvise defrag=defer case, where Zi Yan reports it still causes
swapping.
Vlastimil Babka Sept. 6, 2018, 11:16 a.m. UTC | #13
On 09/06/2018 01:10 PM, Vlastimil Babka wrote:
>> We can and should think about this much more but I would like to have
>> this regression closed. So can we address GFP_THISNODE part first and
>> build more complex solution on top?
>>
>> Is there any objection to my patch which does the similar thing to your
>> patch v2 in a different location?
> 
> Similar but not the same. It fixes the madvise case, but I wonder about
> the no-madvise defrag=defer case, where Zi Yan reports it still causes
> swapping.

Ah, but that should be the same with Andrea's variant 2) patch. There
should only be difference with defrag=always, which is direct reclaim
with __GFP_NORETRY, Andrea's patch would drop __GFP_THISNODE and your
not. Maybe Zi Yan can do the same kind of tests with Andrea's patch [1]
to confirm?

[1] https://marc.info/?l=linux-mm&m=153476267026951
Zi Yan Sept. 6, 2018, 11:17 a.m. UTC | #14
On 6 Sep 2018, at 6:59, Vlastimil Babka wrote:

> On 08/30/2018 12:54 AM, Zi Yan wrote:
>>
>> Thanks for your patch.
>>
>> I tested it against Linus’s tree with “memhog -r3 130g” in a two-socket machine with 128GB memory on
>> each node and got the results below. I expect this test should fill one node, then fall back to the other.
>>
>> 1. madvise(MADV_HUGEPAGE) + defrag = {always, madvise, defer+madvise}: no swap, THPs are allocated in the fallback node.
>> 2. madvise(MADV_HUGEPAGE) + defrag = defer: pages got swapped to the disk instead of being allocated in the fallback node.
>
> Hmm this is GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | __GFP_THISNODE.
> No direct reclaim, so it would have to be kswapd causing the swapping? I
> wouldn't expect it to be significant and over-reclaiming. What exactly
> is your definition of "pages got swapped"?

About 4GB pages are swapped to the disk (my swap disk size is 4.7GB).
My machine has 128GB memory in each node and memhog uses 130GB memory.
When one node is filled up, the oldest pages are swapped into disk
until memhog finishes touching all 130GB memory.

—
Best Regards,
Yan Zi
Vlastimil Babka Sept. 6, 2018, 11:18 a.m. UTC | #15
On 08/30/2018 08:47 AM, Michal Hocko wrote:
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +	mpol_cond_put(pol);

The code is better without the hack in alloc_pages_vma() but I'm not
thrilled about getting vma policy here and then immediately again in
alloc_pages_vma(). But if it can't be helped...
Michal Hocko Sept. 6, 2018, 11:25 a.m. UTC | #16
On Thu 06-09-18 13:16:00, Vlastimil Babka wrote:
> On 09/06/2018 01:10 PM, Vlastimil Babka wrote:
> >> We can and should think about this much more but I would like to have
> >> this regression closed. So can we address GFP_THISNODE part first and
> >> build more complex solution on top?
> >>
> >> Is there any objection to my patch which does the similar thing to your
> >> patch v2 in a different location?
> > 
> > Similar but not the same. It fixes the madvise case, but I wonder about
> > the no-madvise defrag=defer case, where Zi Yan reports it still causes
> > swapping.
> 
> Ah, but that should be the same with Andrea's variant 2) patch. There
> should only be difference with defrag=always, which is direct reclaim
> with __GFP_NORETRY, Andrea's patch would drop __GFP_THISNODE and your
> not. Maybe Zi Yan can do the same kind of tests with Andrea's patch [1]
> to confirm?

Yes, that is the only difference and that is why I've said those patches
are mostly similar. I do not want to touch defrag=always case because
this one has always been stall prone and we have replaced it as a
default just because of that. We should discuss what should be done with
that case separately IMHO.
Michal Hocko Sept. 6, 2018, 11:27 a.m. UTC | #17
On Thu 06-09-18 13:18:52, Vlastimil Babka wrote:
> On 08/30/2018 08:47 AM, Michal Hocko wrote:
> > -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> > +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> >  {
> >  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> > +	gfp_t this_node = 0;
> > +	struct mempolicy *pol;
> > +
> > +#ifdef CONFIG_NUMA
> > +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> > +	pol = get_vma_policy(vma, addr);
> > +	if (pol->mode != MPOL_BIND)
> > +		this_node = __GFP_THISNODE;
> > +	mpol_cond_put(pol);
> 
> The code is better without the hack in alloc_pages_vma() but I'm not
> thrilled about getting vma policy here and then immediately again in
> alloc_pages_vma(). But if it can't be helped...

The whole function is an act of beauty isn't it. I wanted to get the
policy from the caller but that would be even more messy so I've tried
to keep it in the ugly corner and have it hidden there. You should ask
your friends to read alloc_hugepage_direct_gfpmask unless they have done
something terribly wrong.
Zi Yan Sept. 6, 2018, 12:35 p.m. UTC | #18
On 6 Sep 2018, at 7:25, Michal Hocko wrote:

> On Thu 06-09-18 13:16:00, Vlastimil Babka wrote:
>> On 09/06/2018 01:10 PM, Vlastimil Babka wrote:
>>>> We can and should think about this much more but I would like to have
>>>> this regression closed. So can we address GFP_THISNODE part first and
>>>> build more complex solution on top?
>>>>
>>>> Is there any objection to my patch which does the similar thing to your
>>>> patch v2 in a different location?
>>>
>>> Similar but not the same. It fixes the madvise case, but I wonder about
>>> the no-madvise defrag=defer case, where Zi Yan reports it still causes
>>> swapping.
>>
>> Ah, but that should be the same with Andrea's variant 2) patch. There
>> should only be difference with defrag=always, which is direct reclaim
>> with __GFP_NORETRY, Andrea's patch would drop __GFP_THISNODE and your
>> not. Maybe Zi Yan can do the same kind of tests with Andrea's patch [1]
>> to confirm?
>
> Yes, that is the only difference and that is why I've said those patches
> are mostly similar. I do not want to touch defrag=always case because
> this one has always been stall prone and we have replaced it as a
> default just because of that. We should discuss what should be done with
> that case separately IMHO.

Vlastimil, my test using Andrea’s patch confirms your statement.
My test result of Andrea’s patch shows that it gives the same outcomes as
Michal’s patch except that when no madvise is used, THP is on by default
+ defrag = {always}, instead of swapping pages to disk, Adndrea’s patch
causes no swapping and THPs are allocated in the fallback node.

As I said before, the fundamental issue that causes swapping pages to disk
when allocating THPs in a filled node is __GFP_THISNODE removes all fallback
zone/node options, thus,  __GFP_KSWAPD_RECLAIM or __GFP_DIRECT_RECLAIM
can only swap pages out to satisfy the THP allocation request.

__GFP_THISNODE can be seen as a kernel-version MPOL_BIND policy, which
overwrites any user space memory policy and should be removed or limited
to kernel-only page allocations. But, as Michal said, we could discuss
this further but do not make this discussion on the critical path of merging
the patch.

—
Best Regards,
Yan Zi
Mel Gorman Sept. 12, 2018, 5:29 p.m. UTC | #19
On Wed, Aug 29, 2018 at 09:24:51PM +0200, Michal Hocko wrote:
> From 4dc2f772756e6f91b9e64d1a3e2df4dca3475f5b Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 28 Aug 2018 09:59:19 +0200
> Subject: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
> 
> Andrea has noticed [1] that a THP allocation might be really disruptive
> when allocated on NUMA system with the local node full or hard to
> reclaim. Stefan has posted an allocation stall report on 4.12 based
> SLES kernel which suggests the same issue:

Note that this behaviour is unhelpful but it is not against the defined
semantic of the "madvise" defrag option.

> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> juggle gfp flags for different allocation modes.

For the short term, I think you might be better off simply avoiding the
combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM and declaring
that the fix. That would be easier for -stable and the layering can be
dealt with as a cleanup.

I recognise that this fix means that users that expect zone_reclaim_mode==1
type behaviour may get burned but the users that benefit from that should
also be users that benefit from sizing their workload to a node. They should
be able to replicate that with mempolicies or at least use prepation scripts
to clear memory on a target node (e.g. membind a memhog to the desired size,
exit and then start the target workload).

I think this is a more appropriate solution than prematurely introducing a
GFP flag as it's not guaranteed that a user is willing to pay a compaction
penalty until it fails. That should be decided separately when the immediate
problem is resolved.

That said, I do think that sorting out where GFP flags are set for THP
should be done in the context of THP code and not alloc_pages_vma. The
current layering is a bit odd.

> The rationale is that
> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> to a different node might be more harmful than the benefit of a large page.
> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> a higher priority than local NUMA placement.
> 
> Be careful when the vma has an explicit numa binding though, because
> __GFP_THISNODE is not playing well with it. We want to follow the
> explicit numa policy rather than enforce a node which happens to be
> local to the cpu we are running on.
> 
> [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
> 
> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mempolicy.h |  2 ++
>  mm/huge_memory.c          | 25 +++++++++++++++++--------
>  mm/mempolicy.c            | 28 +---------------------------
>  3 files changed, 20 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5228c62af416..bac395f1d00a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>  
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3bc7e9c9a2a..94472bf9a31b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +#endif
>  

Where is the mpol_cond_put? Historically it might not have mattered
because THP could not be used with a shared possibility but it probably
matters now that tmpfs can be backed by THP.

The comment needs more expansion as well. Arguably it only makes sense in
the event we are explicitly bound to one node because if we are bound to
two nodes without interleaving then why not fall back? The answer to that
is outside the scope of the patch but the comment as-is will cause head
scratches in a years time.

>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM);
> +							     __GFP_KSWAPD_RECLAIM | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     0);
> -	return GFP_TRANSHUGE_LIGHT;
> +							     this_node);
> +	return GFP_TRANSHUGE_LIGHT | this_node;
>  }
>  
>  /* Caller must hold page table lock. */
> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  			pte_free(vma->vm_mm, pgtable);
>  		return ret;
>  	}
> -	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	if (unlikely(!page)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  alloc:
>  	if (transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	} else
>  		new_page = NULL;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..75bbfc3d6233 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> -		int hpage_node = node;
> -
> -		/*
> -		 * For hugepage allocation and non-interleave policy which
> -		 * allows the current node (or other explicitly preferred
> -		 * node) we only try to allocate from the current/preferred
> -		 * node and don't fall back to other nodes, as the cost of
> -		 * remote accesses would likely offset THP benefits.
> -		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> -		 */
> -		if (pol->mode == MPOL_PREFERRED &&
> -						!(pol->flags & MPOL_F_LOCAL))
> -			hpage_node = pol->v.preferred_node;
> -
> -		nmask = policy_nodemask(gfp, pol);
> -		if (!nmask || node_isset(hpage_node, *nmask)) {
> -			mpol_cond_put(pol);
> -			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> -			goto out;
> -		}
> -	}
> -

The hugepage flag passed into this function is now redundant and that
means that callers of alloc_hugepage_vma need to move back to using
alloc_pages_vma() directly and remove the API entirely. This block of
code is about both GFP flag settings and node selection but at a glance I
cannot see the point of it because it's very similar to the base page code.
The whole point may be to get around the warning in policy_node and that
could just as easily be side-stepped in alloc_hugepage_direct_gfpmask
as you do already in this patch. There should be no reason why THP has a
different policy than a base page within a single VMA.
Michal Hocko Sept. 17, 2018, 6:11 a.m. UTC | #20
[sorry I've missed your reply]

On Wed 12-09-18 18:29:25, Mel Gorman wrote:
> On Wed, Aug 29, 2018 at 09:24:51PM +0200, Michal Hocko wrote:
[...]
> I recognise that this fix means that users that expect zone_reclaim_mode==1
> type behaviour may get burned but the users that benefit from that should
> also be users that benefit from sizing their workload to a node. They should
> be able to replicate that with mempolicies or at least use prepation scripts
> to clear memory on a target node (e.g. membind a memhog to the desired size,
> exit and then start the target workload).

As I've said in other email. We probably want to add a new mempolicy
which has zone_reclaim_mode-like semantic.

[...]

> > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> > index 5228c62af416..bac395f1d00a 100644
> > --- a/include/linux/mempolicy.h
> > +++ b/include/linux/mempolicy.h
> > @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
> >  struct mempolicy *get_task_policy(struct task_struct *p);
> >  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> >  		unsigned long addr);
> > +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> > +						unsigned long addr);
> >  bool vma_policy_mof(struct vm_area_struct *vma);
> >  
> >  extern void numa_default_policy(void);
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c3bc7e9c9a2a..94472bf9a31b 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> >   *	    available
> >   * never: never stall for any thp allocation
> >   */
> > -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> > +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> >  {
> >  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> > +	gfp_t this_node = 0;
> > +	struct mempolicy *pol;
> > +
> > +#ifdef CONFIG_NUMA
> > +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> > +	pol = get_vma_policy(vma, addr);
> > +	if (pol->mode != MPOL_BIND)
> > +		this_node = __GFP_THISNODE;
> > +#endif
> >  
> 
> Where is the mpol_cond_put? Historically it might not have mattered
> because THP could not be used with a shared possibility but it probably
> matters now that tmpfs can be backed by THP.

http://lkml.kernel.org/r/20180830064732.GA2656@dhcp22.suse.cz

> The comment needs more expansion as well. Arguably it only makes sense in
> the event we are explicitly bound to one node because if we are bound to
> two nodes without interleaving then why not fall back? The answer to that
> is outside the scope of the patch but the comment as-is will cause head
> scratches in a years time.

Do you have any specific wording in mind? I have a bit hard time to come
up with something more precise and do not go into details too much.
 
> >  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> > -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> > +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
> >  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> > -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> > +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
> >  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
> >  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> > -							     __GFP_KSWAPD_RECLAIM);
> > +							     __GFP_KSWAPD_RECLAIM | this_node);
> >  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> >  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> > -							     0);
> > -	return GFP_TRANSHUGE_LIGHT;
> > +							     this_node);
> > +	return GFP_TRANSHUGE_LIGHT | this_node;
> >  }
> >  
> >  /* Caller must hold page table lock. */
> > @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >  			pte_free(vma->vm_mm, pgtable);
> >  		return ret;
> >  	}
> > -	gfp = alloc_hugepage_direct_gfpmask(vma);
> > +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
> >  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
> >  	if (unlikely(!page)) {
> >  		count_vm_event(THP_FAULT_FALLBACK);
> > @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
> >  alloc:
> >  	if (transparent_hugepage_enabled(vma) &&
> >  	    !transparent_hugepage_debug_cow()) {
> > -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> > +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
> >  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
> >  	} else
> >  		new_page = NULL;
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index da858f794eb6..75bbfc3d6233 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> >   * freeing by another task.  It is the caller's responsibility to free the
> >   * extra reference for shared policies.
> >   */
> > -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> > +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> >  						unsigned long addr)
> >  {
> >  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> > @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> >  		goto out;
> >  	}
> >  
> > -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> > -		int hpage_node = node;
> > -
> > -		/*
> > -		 * For hugepage allocation and non-interleave policy which
> > -		 * allows the current node (or other explicitly preferred
> > -		 * node) we only try to allocate from the current/preferred
> > -		 * node and don't fall back to other nodes, as the cost of
> > -		 * remote accesses would likely offset THP benefits.
> > -		 *
> > -		 * If the policy is interleave, or does not allow the current
> > -		 * node in its nodemask, we allocate the standard way.
> > -		 */
> > -		if (pol->mode == MPOL_PREFERRED &&
> > -						!(pol->flags & MPOL_F_LOCAL))
> > -			hpage_node = pol->v.preferred_node;
> > -
> > -		nmask = policy_nodemask(gfp, pol);
> > -		if (!nmask || node_isset(hpage_node, *nmask)) {
> > -			mpol_cond_put(pol);
> > -			page = __alloc_pages_node(hpage_node,
> > -						gfp | __GFP_THISNODE, order);
> > -			goto out;
> > -		}
> > -	}
> > -
> 
> The hugepage flag passed into this function is now redundant and that
> means that callers of alloc_hugepage_vma need to move back to using
> alloc_pages_vma() directly and remove the API entirely. This block of
> code is about both GFP flag settings and node selection but at a glance I
> cannot see the point of it because it's very similar to the base page code.
> The whole point may be to get around the warning in policy_node and that
> could just as easily be side-stepped in alloc_hugepage_direct_gfpmask
> as you do already in this patch. There should be no reason why THP has a
> different policy than a base page within a single VMA.

OK, I can follow up with a cleanup patch once we settle down with this
approach to fix the issue.

Thanks!
Stefan Priebe - Profihost AG Sept. 17, 2018, 7:04 a.m. UTC | #21
Hi,

i had multiple memory stalls this weekend again. All kvm processes where
spinning trying to get > 100% CPU and i was not able to even login to
ssh. After 5-10 minutes i was able to login.

There were about 150GB free mem on the host.

Relevant settings (no local storage involved):
        vm.dirty_background_ratio:
            3
        vm.dirty_ratio:
            10
        vm.min_free_kbytes:
            10567004

# cat /sys/kernel/mm/transparent_hugepage/defrag
always defer [defer+madvise] madvise never

# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

After that i had the following traces on the host node:
https://pastebin.com/raw/0VhyQmAv

Thanks!

Greets,
Stefan


Am 17.09.2018 um 08:11 schrieb Michal Hocko:
> [sorry I've missed your reply]
> 
> On Wed 12-09-18 18:29:25, Mel Gorman wrote:
>> On Wed, Aug 29, 2018 at 09:24:51PM +0200, Michal Hocko wrote:
> [...]
>> I recognise that this fix means that users that expect zone_reclaim_mode==1
>> type behaviour may get burned but the users that benefit from that should
>> also be users that benefit from sizing their workload to a node. They should
>> be able to replicate that with mempolicies or at least use prepation scripts
>> to clear memory on a target node (e.g. membind a memhog to the desired size,
>> exit and then start the target workload).
> 
> As I've said in other email. We probably want to add a new mempolicy
> which has zone_reclaim_mode-like semantic.
> 
> [...]
> 
>>> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
>>> index 5228c62af416..bac395f1d00a 100644
>>> --- a/include/linux/mempolicy.h
>>> +++ b/include/linux/mempolicy.h
>>> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>>>  struct mempolicy *get_task_policy(struct task_struct *p);
>>>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>  		unsigned long addr);
>>> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>> +						unsigned long addr);
>>>  bool vma_policy_mof(struct vm_area_struct *vma);
>>>  
>>>  extern void numa_default_policy(void);
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c3bc7e9c9a2a..94472bf9a31b 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>>   *	    available
>>>   * never: never stall for any thp allocation
>>>   */
>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>>>  {
>>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>> +	gfp_t this_node = 0;
>>> +	struct mempolicy *pol;
>>> +
>>> +#ifdef CONFIG_NUMA
>>> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
>>> +	pol = get_vma_policy(vma, addr);
>>> +	if (pol->mode != MPOL_BIND)
>>> +		this_node = __GFP_THISNODE;
>>> +#endif
>>>  
>>
>> Where is the mpol_cond_put? Historically it might not have mattered
>> because THP could not be used with a shared possibility but it probably
>> matters now that tmpfs can be backed by THP.
> 
> http://lkml.kernel.org/r/20180830064732.GA2656@dhcp22.suse.cz
> 
>> The comment needs more expansion as well. Arguably it only makes sense in
>> the event we are explicitly bound to one node because if we are bound to
>> two nodes without interleaving then why not fall back? The answer to that
>> is outside the scope of the patch but the comment as-is will cause head
>> scratches in a years time.
> 
> Do you have any specific wording in mind? I have a bit hard time to come
> up with something more precise and do not go into details too much.
>  
>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
>>> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>>> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
>>> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>>> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>>> -							     __GFP_KSWAPD_RECLAIM);
>>> +							     __GFP_KSWAPD_RECLAIM | this_node);
>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>>> -							     0);
>>> -	return GFP_TRANSHUGE_LIGHT;
>>> +							     this_node);
>>> +	return GFP_TRANSHUGE_LIGHT | this_node;
>>>  }
>>>  
>>>  /* Caller must hold page table lock. */
>>> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>>  			pte_free(vma->vm_mm, pgtable);
>>>  		return ret;
>>>  	}
>>> -	gfp = alloc_hugepage_direct_gfpmask(vma);
>>> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>>>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>>>  	if (unlikely(!page)) {
>>>  		count_vm_event(THP_FAULT_FALLBACK);
>>> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>>>  alloc:
>>>  	if (transparent_hugepage_enabled(vma) &&
>>>  	    !transparent_hugepage_debug_cow()) {
>>> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
>>> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>>>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>>>  	} else
>>>  		new_page = NULL;
>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>> index da858f794eb6..75bbfc3d6233 100644
>>> --- a/mm/mempolicy.c
>>> +++ b/mm/mempolicy.c
>>> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>   * freeing by another task.  It is the caller's responsibility to free the
>>>   * extra reference for shared policies.
>>>   */
>>> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>  						unsigned long addr)
>>>  {
>>>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
>>> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>>  		goto out;
>>>  	}
>>>  
>>> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
>>> -		int hpage_node = node;
>>> -
>>> -		/*
>>> -		 * For hugepage allocation and non-interleave policy which
>>> -		 * allows the current node (or other explicitly preferred
>>> -		 * node) we only try to allocate from the current/preferred
>>> -		 * node and don't fall back to other nodes, as the cost of
>>> -		 * remote accesses would likely offset THP benefits.
>>> -		 *
>>> -		 * If the policy is interleave, or does not allow the current
>>> -		 * node in its nodemask, we allocate the standard way.
>>> -		 */
>>> -		if (pol->mode == MPOL_PREFERRED &&
>>> -						!(pol->flags & MPOL_F_LOCAL))
>>> -			hpage_node = pol->v.preferred_node;
>>> -
>>> -		nmask = policy_nodemask(gfp, pol);
>>> -		if (!nmask || node_isset(hpage_node, *nmask)) {
>>> -			mpol_cond_put(pol);
>>> -			page = __alloc_pages_node(hpage_node,
>>> -						gfp | __GFP_THISNODE, order);
>>> -			goto out;
>>> -		}
>>> -	}
>>> -
>>
>> The hugepage flag passed into this function is now redundant and that
>> means that callers of alloc_hugepage_vma need to move back to using
>> alloc_pages_vma() directly and remove the API entirely. This block of
>> code is about both GFP flag settings and node selection but at a glance I
>> cannot see the point of it because it's very similar to the base page code.
>> The whole point may be to get around the warning in policy_node and that
>> could just as easily be side-stepped in alloc_hugepage_direct_gfpmask
>> as you do already in this patch. There should be no reason why THP has a
>> different policy than a base page within a single VMA.
> 
> OK, I can follow up with a cleanup patch once we settle down with this
> approach to fix the issue.
> 
> Thanks!
>
Stefan Priebe - Profihost AG Sept. 17, 2018, 9:32 a.m. UTC | #22
May be amissing piece:
vm.overcommit_memory=0

Greets,
Stefan

Am 17.09.2018 um 09:04 schrieb Stefan Priebe - Profihost AG:
> Hi,
> 
> i had multiple memory stalls this weekend again. All kvm processes where
> spinning trying to get > 100% CPU and i was not able to even login to
> ssh. After 5-10 minutes i was able to login.
> 
> There were about 150GB free mem on the host.
> 
> Relevant settings (no local storage involved):
>         vm.dirty_background_ratio:
>             3
>         vm.dirty_ratio:
>             10
>         vm.min_free_kbytes:
>             10567004
> 
> # cat /sys/kernel/mm/transparent_hugepage/defrag
> always defer [defer+madvise] madvise never
> 
> # cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> 
> After that i had the following traces on the host node:
> https://pastebin.com/raw/0VhyQmAv
> 
> Thanks!
> 
> Greets,
> Stefan
> 
> 
> Am 17.09.2018 um 08:11 schrieb Michal Hocko:
>> [sorry I've missed your reply]
>>
>> On Wed 12-09-18 18:29:25, Mel Gorman wrote:
>>> On Wed, Aug 29, 2018 at 09:24:51PM +0200, Michal Hocko wrote:
>> [...]
>>> I recognise that this fix means that users that expect zone_reclaim_mode==1
>>> type behaviour may get burned but the users that benefit from that should
>>> also be users that benefit from sizing their workload to a node. They should
>>> be able to replicate that with mempolicies or at least use prepation scripts
>>> to clear memory on a target node (e.g. membind a memhog to the desired size,
>>> exit and then start the target workload).
>>
>> As I've said in other email. We probably want to add a new mempolicy
>> which has zone_reclaim_mode-like semantic.
>>
>> [...]
>>
>>>> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
>>>> index 5228c62af416..bac395f1d00a 100644
>>>> --- a/include/linux/mempolicy.h
>>>> +++ b/include/linux/mempolicy.h
>>>> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>>>>  struct mempolicy *get_task_policy(struct task_struct *p);
>>>>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>>  		unsigned long addr);
>>>> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>> +						unsigned long addr);
>>>>  bool vma_policy_mof(struct vm_area_struct *vma);
>>>>  
>>>>  extern void numa_default_policy(void);
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index c3bc7e9c9a2a..94472bf9a31b 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -629,21 +629,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>>>   *	    available
>>>>   * never: never stall for any thp allocation
>>>>   */
>>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>>>>  {
>>>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>>> +	gfp_t this_node = 0;
>>>> +	struct mempolicy *pol;
>>>> +
>>>> +#ifdef CONFIG_NUMA
>>>> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
>>>> +	pol = get_vma_policy(vma, addr);
>>>> +	if (pol->mode != MPOL_BIND)
>>>> +		this_node = __GFP_THISNODE;
>>>> +#endif
>>>>  
>>>
>>> Where is the mpol_cond_put? Historically it might not have mattered
>>> because THP could not be used with a shared possibility but it probably
>>> matters now that tmpfs can be backed by THP.
>>
>> http://lkml.kernel.org/r/20180830064732.GA2656@dhcp22.suse.cz
>>
>>> The comment needs more expansion as well. Arguably it only makes sense in
>>> the event we are explicitly bound to one node because if we are bound to
>>> two nodes without interleaving then why not fall back? The answer to that
>>> is outside the scope of the patch but the comment as-is will cause head
>>> scratches in a years time.
>>
>> Do you have any specific wording in mind? I have a bit hard time to come
>> up with something more precise and do not go into details too much.
>>  
>>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
>>>> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>>>> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
>>>> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>>>> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>>>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>>>> -							     __GFP_KSWAPD_RECLAIM);
>>>> +							     __GFP_KSWAPD_RECLAIM | this_node);
>>>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>>>>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
>>>> -							     0);
>>>> -	return GFP_TRANSHUGE_LIGHT;
>>>> +							     this_node);
>>>> +	return GFP_TRANSHUGE_LIGHT | this_node;
>>>>  }
>>>>  
>>>>  /* Caller must hold page table lock. */
>>>> @@ -715,7 +724,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>>>  			pte_free(vma->vm_mm, pgtable);
>>>>  		return ret;
>>>>  	}
>>>> -	gfp = alloc_hugepage_direct_gfpmask(vma);
>>>> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>>>>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>>>>  	if (unlikely(!page)) {
>>>>  		count_vm_event(THP_FAULT_FALLBACK);
>>>> @@ -1290,7 +1299,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>>>>  alloc:
>>>>  	if (transparent_hugepage_enabled(vma) &&
>>>>  	    !transparent_hugepage_debug_cow()) {
>>>> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
>>>> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>>>>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>>>>  	} else
>>>>  		new_page = NULL;
>>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>>> index da858f794eb6..75bbfc3d6233 100644
>>>> --- a/mm/mempolicy.c
>>>> +++ b/mm/mempolicy.c
>>>> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>>   * freeing by another task.  It is the caller's responsibility to free the
>>>>   * extra reference for shared policies.
>>>>   */
>>>> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>>  						unsigned long addr)
>>>>  {
>>>>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
>>>> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>>>  		goto out;
>>>>  	}
>>>>  
>>>> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
>>>> -		int hpage_node = node;
>>>> -
>>>> -		/*
>>>> -		 * For hugepage allocation and non-interleave policy which
>>>> -		 * allows the current node (or other explicitly preferred
>>>> -		 * node) we only try to allocate from the current/preferred
>>>> -		 * node and don't fall back to other nodes, as the cost of
>>>> -		 * remote accesses would likely offset THP benefits.
>>>> -		 *
>>>> -		 * If the policy is interleave, or does not allow the current
>>>> -		 * node in its nodemask, we allocate the standard way.
>>>> -		 */
>>>> -		if (pol->mode == MPOL_PREFERRED &&
>>>> -						!(pol->flags & MPOL_F_LOCAL))
>>>> -			hpage_node = pol->v.preferred_node;
>>>> -
>>>> -		nmask = policy_nodemask(gfp, pol);
>>>> -		if (!nmask || node_isset(hpage_node, *nmask)) {
>>>> -			mpol_cond_put(pol);
>>>> -			page = __alloc_pages_node(hpage_node,
>>>> -						gfp | __GFP_THISNODE, order);
>>>> -			goto out;
>>>> -		}
>>>> -	}
>>>> -
>>>
>>> The hugepage flag passed into this function is now redundant and that
>>> means that callers of alloc_hugepage_vma need to move back to using
>>> alloc_pages_vma() directly and remove the API entirely. This block of
>>> code is about both GFP flag settings and node selection but at a glance I
>>> cannot see the point of it because it's very similar to the base page code.
>>> The whole point may be to get around the warning in policy_node and that
>>> could just as easily be side-stepped in alloc_hugepage_direct_gfpmask
>>> as you do already in this patch. There should be no reason why THP has a
>>> different policy than a base page within a single VMA.
>>
>> OK, I can follow up with a cleanup patch once we settle down with this
>> approach to fix the issue.
>>
>> Thanks!
>>
Michal Hocko Sept. 17, 2018, 11:27 a.m. UTC | #23
On Mon 17-09-18 09:04:02, Stefan Priebe - Profihost AG wrote:
> Hi,
> 
> i had multiple memory stalls this weekend again. All kvm processes where
> spinning trying to get > 100% CPU and i was not able to even login to
> ssh. After 5-10 minutes i was able to login.
> 
> There were about 150GB free mem on the host.
> 
> Relevant settings (no local storage involved):
>         vm.dirty_background_ratio:
>             3
>         vm.dirty_ratio:
>             10
>         vm.min_free_kbytes:
>             10567004
> 
> # cat /sys/kernel/mm/transparent_hugepage/defrag
> always defer [defer+madvise] madvise never
> 
> # cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> 
> After that i had the following traces on the host node:
> https://pastebin.com/raw/0VhyQmAv

I would suggest reporting this in a new email thread. I would also
recommend to CC kvm guys (see MAINTAINERS file in the kernel source
tree) and trace qemu/kvm processes to see what they are doing at the
time when you see the stall.

Patch
diff mbox series

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@  struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..94472bf9a31b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,21 +629,30 @@  static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	gfp_t this_node = 0;
+	struct mempolicy *pol;
+
+#ifdef CONFIG_NUMA
+	/* __GFP_THISNODE makes sense only if there is no explicit binding */
+	pol = get_vma_policy(vma, addr);
+	if (pol->mode != MPOL_BIND)
+		this_node = __GFP_THISNODE;
+#endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM);
+							     __GFP_KSWAPD_RECLAIM | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     0);
-	return GFP_TRANSHUGE_LIGHT;
+							     this_node);
+	return GFP_TRANSHUGE_LIGHT | this_node;
 }
 
 /* Caller must hold page table lock. */
@@ -715,7 +724,7 @@  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma);
+	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1290,7 +1299,7 @@  vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..75bbfc3d6233 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1648,7 +1648,7 @@  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
@@ -2026,32 +2026,6 @@  alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED &&
-						!(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
-
-		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
-			goto out;
-		}
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);