diff mbox series

[2/2] mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask

Message ID 20180925120326.24392-3-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show
Series thp nodereclaim fixes | expand

Commit Message

Michal Hocko Sept. 25, 2018, 12:03 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

THP allocation mode is quite complex and it depends on the defrag
mode. This complexity is hidden in alloc_hugepage_direct_gfpmask from a
large part currently. The NUMA special casing (namely __GFP_THISNODE) is
however independent and placed in alloc_pages_vma currently. This both
adds an unnecessary branch to all vma based page allocation requests and
it makes the code more complex unnecessarily as well. Not to mention
that e.g. shmem THP used to do the node reclaiming unconditionally
regardless of the defrag mode until recently. This was not only
unexpected behavior but it was also hardly a good default behavior and I
strongly suspect it was just a side effect of the code sharing more than
a deliberate decision which suggests that such a layering is wrong.

Get rid of the thp special casing from alloc_pages_vma and move the logic
to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to
the resulting gfp mask only when the direct reclaim is not requested and
when there is no explicit numa binding to preserve the current logic.

This allows for removing alloc_hugepage_vma as well.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h       | 12 +++-----
 include/linux/mempolicy.h |  2 ++
 mm/huge_memory.c          | 38 +++++++++++++++++------
 mm/mempolicy.c            | 63 +++------------------------------------
 mm/shmem.c                |  2 +-
 5 files changed, 40 insertions(+), 77 deletions(-)

Comments

Kirill A . Shutemov Sept. 26, 2018, 1:30 p.m. UTC | #1
On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote:
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3bc7e9c9a2a..c0bcede31930 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +
> +#ifdef CONFIG_NUMA
> +	struct mempolicy *pol;
> +	/*
> +	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
> +	 * specified, to express a general desire to stay on the current
> +	 * node for optimistic allocation attempts. If the defrag mode
> +	 * and/or madvise hint requires the direct reclaim then we prefer
> +	 * to fallback to other node rather than node reclaim because that
> +	 * can lead to excessive reclaim even though there is free memory
> +	 * on other nodes. We expect that NUMA preferences are specified
> +	 * by memory policies.
> +	 */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +	mpol_cond_put(pol);
> +#endif

I'm not very good with NUMA policies. Could you explain in more details how
the code above is equivalent to the code below?

...

> @@ -2026,60 +2025,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> -		int hpage_node = node;
> -
> -		/*
> -		 * For hugepage allocation and non-interleave policy which
> -		 * allows the current node (or other explicitly preferred
> -		 * node) we only try to allocate from the current/preferred
> -		 * node and don't fall back to other nodes, as the cost of
> -		 * remote accesses would likely offset THP benefits.
> -		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> -		 */
> -		if (pol->mode == MPOL_PREFERRED &&
> -						!(pol->flags & MPOL_F_LOCAL))
> -			hpage_node = pol->v.preferred_node;
> -
> -		nmask = policy_nodemask(gfp, pol);
> -		if (!nmask || node_isset(hpage_node, *nmask)) {
> -			mpol_cond_put(pol);
> -			/*
> -			 * We cannot invoke reclaim if __GFP_THISNODE
> -			 * is set. Invoking reclaim with
> -			 * __GFP_THISNODE set, would cause THP
> -			 * allocations to trigger heavy swapping
> -			 * despite there may be tons of free memory
> -			 * (including potentially plenty of THP
> -			 * already available in the buddy) on all the
> -			 * other NUMA nodes.
> -			 *
> -			 * At most we could invoke compaction when
> -			 * __GFP_THISNODE is set (but we would need to
> -			 * refrain from invoking reclaim even if
> -			 * compaction returned COMPACT_SKIPPED because
> -			 * there wasn't not enough memory to succeed
> -			 * compaction). For now just avoid
> -			 * __GFP_THISNODE instead of limiting the
> -			 * allocation path to a strict and single
> -			 * compaction invocation.
> -			 *
> -			 * Supposedly if direct reclaim was enabled by
> -			 * the caller, the app prefers THP regardless
> -			 * of the node it comes from so this would be
> -			 * more desiderable behavior than only
> -			 * providing THP originated from the local
> -			 * node in such case.
> -			 */
> -			if (!(gfp & __GFP_DIRECT_RECLAIM))
> -				gfp |= __GFP_THISNODE;
> -			page = __alloc_pages_node(hpage_node, gfp, order);
> -			goto out;
> -		}
> -	}
> -
>  	nmask = policy_nodemask(gfp, pol);
>  	preferred_nid = policy_node(gfp, pol, node);
>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
Michal Hocko Sept. 26, 2018, 2:17 p.m. UTC | #2
On Wed 26-09-18 16:30:39, Kirill A. Shutemov wrote:
> On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote:
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c3bc7e9c9a2a..c0bcede31930 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> >   *	    available
> >   * never: never stall for any thp allocation
> >   */
> > -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> > +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> >  {
> >  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> > +	gfp_t this_node = 0;
> > +
> > +#ifdef CONFIG_NUMA
> > +	struct mempolicy *pol;
> > +	/*
> > +	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
> > +	 * specified, to express a general desire to stay on the current
> > +	 * node for optimistic allocation attempts. If the defrag mode
> > +	 * and/or madvise hint requires the direct reclaim then we prefer
> > +	 * to fallback to other node rather than node reclaim because that
> > +	 * can lead to excessive reclaim even though there is free memory
> > +	 * on other nodes. We expect that NUMA preferences are specified
> > +	 * by memory policies.
> > +	 */
> > +	pol = get_vma_policy(vma, addr);
> > +	if (pol->mode != MPOL_BIND)
> > +		this_node = __GFP_THISNODE;
> > +	mpol_cond_put(pol);
> > +#endif
> 
> I'm not very good with NUMA policies. Could you explain in more details how
> the code above is equivalent to the code below?

MPOL_PREFERRED is handled by policy_node() before we call __alloc_pages_nodemask.
__GFP_THISNODE is applied only when we are not using
__GFP_DIRECT_RECLAIM which is handled in alloc_hugepage_direct_gfpmask
now.
Lastly MPOL_BIND wasn't handled explicitly but in the end the removed
late check would remove __GFP_THISNODE for it as well. So in the end we
are doing the same thing unless I miss something
 
> > @@ -2026,60 +2025,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> >  		goto out;
> >  	}
> >  
> > -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> > -		int hpage_node = node;
> > -
> > -		/*
> > -		 * For hugepage allocation and non-interleave policy which
> > -		 * allows the current node (or other explicitly preferred
> > -		 * node) we only try to allocate from the current/preferred
> > -		 * node and don't fall back to other nodes, as the cost of
> > -		 * remote accesses would likely offset THP benefits.
> > -		 *
> > -		 * If the policy is interleave, or does not allow the current
> > -		 * node in its nodemask, we allocate the standard way.
> > -		 */
> > -		if (pol->mode == MPOL_PREFERRED &&
> > -						!(pol->flags & MPOL_F_LOCAL))
> > -			hpage_node = pol->v.preferred_node;
> > -
> > -		nmask = policy_nodemask(gfp, pol);
> > -		if (!nmask || node_isset(hpage_node, *nmask)) {
> > -			mpol_cond_put(pol);
> > -			/*
> > -			 * We cannot invoke reclaim if __GFP_THISNODE
> > -			 * is set. Invoking reclaim with
> > -			 * __GFP_THISNODE set, would cause THP
> > -			 * allocations to trigger heavy swapping
> > -			 * despite there may be tons of free memory
> > -			 * (including potentially plenty of THP
> > -			 * already available in the buddy) on all the
> > -			 * other NUMA nodes.
> > -			 *
> > -			 * At most we could invoke compaction when
> > -			 * __GFP_THISNODE is set (but we would need to
> > -			 * refrain from invoking reclaim even if
> > -			 * compaction returned COMPACT_SKIPPED because
> > -			 * there wasn't not enough memory to succeed
> > -			 * compaction). For now just avoid
> > -			 * __GFP_THISNODE instead of limiting the
> > -			 * allocation path to a strict and single
> > -			 * compaction invocation.
> > -			 *
> > -			 * Supposedly if direct reclaim was enabled by
> > -			 * the caller, the app prefers THP regardless
> > -			 * of the node it comes from so this would be
> > -			 * more desiderable behavior than only
> > -			 * providing THP originated from the local
> > -			 * node in such case.
> > -			 */
> > -			if (!(gfp & __GFP_DIRECT_RECLAIM))
> > -				gfp |= __GFP_THISNODE;
> > -			page = __alloc_pages_node(hpage_node, gfp, order);
> > -			goto out;
> > -		}
> > -	}
> > -
> >  	nmask = policy_nodemask(gfp, pol);
> >  	preferred_nid = policy_node(gfp, pol, node);
> >  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> 
> -- 
>  Kirill A. Shutemov
Michal Hocko Sept. 26, 2018, 2:22 p.m. UTC | #3
On Wed 26-09-18 16:17:08, Michal Hocko wrote:
> On Wed 26-09-18 16:30:39, Kirill A. Shutemov wrote:
> > On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote:
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index c3bc7e9c9a2a..c0bcede31930 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> > >   *	    available
> > >   * never: never stall for any thp allocation
> > >   */
> > > -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> > > +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> > >  {
> > >  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> > > +	gfp_t this_node = 0;
> > > +
> > > +#ifdef CONFIG_NUMA
> > > +	struct mempolicy *pol;
> > > +	/*
> > > +	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
> > > +	 * specified, to express a general desire to stay on the current
> > > +	 * node for optimistic allocation attempts. If the defrag mode
> > > +	 * and/or madvise hint requires the direct reclaim then we prefer
> > > +	 * to fallback to other node rather than node reclaim because that
> > > +	 * can lead to excessive reclaim even though there is free memory
> > > +	 * on other nodes. We expect that NUMA preferences are specified
> > > +	 * by memory policies.
> > > +	 */
> > > +	pol = get_vma_policy(vma, addr);
> > > +	if (pol->mode != MPOL_BIND)
> > > +		this_node = __GFP_THISNODE;
> > > +	mpol_cond_put(pol);
> > > +#endif
> > 
> > I'm not very good with NUMA policies. Could you explain in more details how
> > the code above is equivalent to the code below?
> 
> MPOL_PREFERRED is handled by policy_node() before we call __alloc_pages_nodemask.
> __GFP_THISNODE is applied only when we are not using
> __GFP_DIRECT_RECLAIM which is handled in alloc_hugepage_direct_gfpmask
> now.
> Lastly MPOL_BIND wasn't handled explicitly but in the end the removed
> late check would remove __GFP_THISNODE for it as well. So in the end we
> are doing the same thing unless I miss something

Forgot to add. One notable exception would be that the previous code
would allow to hit
	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
in policy_node if the requested node (e.g. cpu local one) was outside of
the mbind nodemask. This is not possible now. We haven't heard about any
such warning yet so it is unlikely that it happens though.
David Rientjes Oct. 4, 2018, 8:17 p.m. UTC | #4
On Wed, 26 Sep 2018, Kirill A. Shutemov wrote:

> On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote:
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c3bc7e9c9a2a..c0bcede31930 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> >   *	    available
> >   * never: never stall for any thp allocation
> >   */
> > -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> > +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> >  {
> >  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> > +	gfp_t this_node = 0;
> > +
> > +#ifdef CONFIG_NUMA
> > +	struct mempolicy *pol;
> > +	/*
> > +	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
> > +	 * specified, to express a general desire to stay on the current
> > +	 * node for optimistic allocation attempts. If the defrag mode
> > +	 * and/or madvise hint requires the direct reclaim then we prefer
> > +	 * to fallback to other node rather than node reclaim because that
> > +	 * can lead to excessive reclaim even though there is free memory
> > +	 * on other nodes. We expect that NUMA preferences are specified
> > +	 * by memory policies.
> > +	 */
> > +	pol = get_vma_policy(vma, addr);
> > +	if (pol->mode != MPOL_BIND)
> > +		this_node = __GFP_THISNODE;
> > +	mpol_cond_put(pol);
> > +#endif
> 
> I'm not very good with NUMA policies. Could you explain in more details how
> the code above is equivalent to the code below?
> 

It breaks mbind() because new_page() is now using numa_node_id() to 
allocate migration targets for instead of using the mempolicy.  I'm not 
sure that this patch was tested for mbind().
Zi Yan Oct. 4, 2018, 9:49 p.m. UTC | #5
On 4 Oct 2018, at 16:17, David Rientjes wrote:

> On Wed, 26 Sep 2018, Kirill A. Shutemov wrote:
>
>> On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote:
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c3bc7e9c9a2a..c0bcede31930 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>>   *	    available
>>>   * never: never stall for any thp allocation
>>>   */
>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>>>  {
>>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>> +	gfp_t this_node = 0;
>>> +
>>> +#ifdef CONFIG_NUMA
>>> +	struct mempolicy *pol;
>>> +	/*
>>> +	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
>>> +	 * specified, to express a general desire to stay on the current
>>> +	 * node for optimistic allocation attempts. If the defrag mode
>>> +	 * and/or madvise hint requires the direct reclaim then we prefer
>>> +	 * to fallback to other node rather than node reclaim because that
>>> +	 * can lead to excessive reclaim even though there is free memory
>>> +	 * on other nodes. We expect that NUMA preferences are specified
>>> +	 * by memory policies.
>>> +	 */
>>> +	pol = get_vma_policy(vma, addr);
>>> +	if (pol->mode != MPOL_BIND)
>>> +		this_node = __GFP_THISNODE;
>>> +	mpol_cond_put(pol);
>>> +#endif
>>
>> I'm not very good with NUMA policies. Could you explain in more details how
>> the code above is equivalent to the code below?
>>
>
> It breaks mbind() because new_page() is now using numa_node_id() to
> allocate migration targets for instead of using the mempolicy.  I'm not
> sure that this patch was tested for mbind().

I do not see mbind() is broken. With both patches applied, I ran
"numactl -N 0 memhog -r1 4096m membind 1" and saw all pages are allocated
in Node 1 not Node 0, which is returned by numa_node_id().

From the source code, in alloc_pages_vma(), the nodemask is generated
from the memory policy (i.e. mbind in the case above), which only has
the nodes specified by mbind(). Then, __alloc_pages_nodemask() only uses
the zones from the nodemask. The numa_node_id() return value will be
ignored in the actual page allocation process if mbind policy is applied.

Let me know if I miss anything.


--
Best Regards
Yan Zi
Michal Hocko Oct. 9, 2018, 12:36 p.m. UTC | #6
On Thu 04-10-18 13:17:52, David Rientjes wrote:
> On Wed, 26 Sep 2018, Kirill A. Shutemov wrote:
> 
> > On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote:
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index c3bc7e9c9a2a..c0bcede31930 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> > >   *	    available
> > >   * never: never stall for any thp allocation
> > >   */
> > > -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> > > +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> > >  {
> > >  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> > > +	gfp_t this_node = 0;
> > > +
> > > +#ifdef CONFIG_NUMA
> > > +	struct mempolicy *pol;
> > > +	/*
> > > +	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
> > > +	 * specified, to express a general desire to stay on the current
> > > +	 * node for optimistic allocation attempts. If the defrag mode
> > > +	 * and/or madvise hint requires the direct reclaim then we prefer
> > > +	 * to fallback to other node rather than node reclaim because that
> > > +	 * can lead to excessive reclaim even though there is free memory
> > > +	 * on other nodes. We expect that NUMA preferences are specified
> > > +	 * by memory policies.
> > > +	 */
> > > +	pol = get_vma_policy(vma, addr);
> > > +	if (pol->mode != MPOL_BIND)
> > > +		this_node = __GFP_THISNODE;
> > > +	mpol_cond_put(pol);
> > > +#endif
> > 
> > I'm not very good with NUMA policies. Could you explain in more details how
> > the code above is equivalent to the code below?
> > 
> 
> It breaks mbind() because new_page() is now using numa_node_id() to 
> allocate migration targets for instead of using the mempolicy.  I'm not 
> sure that this patch was tested for mbind().

I am sorry but I do not follow, could you be more specific please?
MPOL_BIND should never get __GFP_THISNODE. What am I missing?
Andrew Morton Oct. 19, 2018, 2:11 a.m. UTC | #7
On Wed, 26 Sep 2018 16:22:27 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > MPOL_PREFERRED is handled by policy_node() before we call __alloc_pages_nodemask.
> > __GFP_THISNODE is applied only when we are not using
> > __GFP_DIRECT_RECLAIM which is handled in alloc_hugepage_direct_gfpmask
> > now.
> > Lastly MPOL_BIND wasn't handled explicitly but in the end the removed
> > late check would remove __GFP_THISNODE for it as well. So in the end we
> > are doing the same thing unless I miss something
> 
> Forgot to add. One notable exception would be that the previous code
> would allow to hit
> 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> in policy_node if the requested node (e.g. cpu local one) was outside of
> the mbind nodemask. This is not possible now. We haven't heard about any
> such warning yet so it is unlikely that it happens though.

Perhaps a changelog addition is needed to cover the above?

I assume that David's mbind() concern has gone away.

No acks or reviewed-by's yet?
Michal Hocko Oct. 19, 2018, 8:06 a.m. UTC | #8
On Thu 18-10-18 19:11:47, Andrew Morton wrote:
> On Wed, 26 Sep 2018 16:22:27 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > MPOL_PREFERRED is handled by policy_node() before we call __alloc_pages_nodemask.
> > > __GFP_THISNODE is applied only when we are not using
> > > __GFP_DIRECT_RECLAIM which is handled in alloc_hugepage_direct_gfpmask
> > > now.
> > > Lastly MPOL_BIND wasn't handled explicitly but in the end the removed
> > > late check would remove __GFP_THISNODE for it as well. So in the end we
> > > are doing the same thing unless I miss something
> > 
> > Forgot to add. One notable exception would be that the previous code
> > would allow to hit
> > 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> > in policy_node if the requested node (e.g. cpu local one) was outside of
> > the mbind nodemask. This is not possible now. We haven't heard about any
> > such warning yet so it is unlikely that it happens though.
> 
> Perhaps a changelog addition is needed to cover the above?

: THP allocation mode is quite complex and it depends on the defrag
: mode. This complexity is hidden in alloc_hugepage_direct_gfpmask from a
: large part currently. The NUMA special casing (namely __GFP_THISNODE) is
: however independent and placed in alloc_pages_vma currently. This both
: adds an unnecessary branch to all vma based page allocation requests and
: it makes the code more complex unnecessarily as well. Not to mention
: that e.g. shmem THP used to do the node reclaiming unconditionally
: regardless of the defrag mode until recently. This was not only
: unexpected behavior but it was also hardly a good default behavior and I
: strongly suspect it was just a side effect of the code sharing more than
: a deliberate decision which suggests that such a layering is wrong.
: 
: Moreover the oriinal code allowed to trigger
: 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
: in policy_node if the requested node (e.g. cpu local one) was outside of
: the mbind nodemask. This is not possible now. We haven't heard about any
: such warning yet so it is unlikely that it happens but still a signal of
: a wrong code layering.
: 
: Get rid of the thp special casing from alloc_pages_vma and move the logic
: to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to
: the resulting gfp mask only when the direct reclaim is not requested and
: when there is no explicit numa binding to preserve the current logic.
: 
: This allows for removing alloc_hugepage_vma as well.

Better?
 
> I assume that David's mbind() concern has gone away.

Either I've misunderstood it or it was not really a real issue.
Vlastimil Babka Oct. 22, 2018, 1:15 p.m. UTC | #9
On 9/26/18 4:22 PM, Michal Hocko wrote:
> On Wed 26-09-18 16:17:08, Michal Hocko wrote:
>> On Wed 26-09-18 16:30:39, Kirill A. Shutemov wrote:
>>> On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote:
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index c3bc7e9c9a2a..c0bcede31930 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>>>   *	    available
>>>>   * never: never stall for any thp allocation
>>>>   */
>>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>>>>  {
>>>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>>> +	gfp_t this_node = 0;
>>>> +
>>>> +#ifdef CONFIG_NUMA
>>>> +	struct mempolicy *pol;
>>>> +	/*
>>>> +	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
>>>> +	 * specified, to express a general desire to stay on the current
>>>> +	 * node for optimistic allocation attempts. If the defrag mode
>>>> +	 * and/or madvise hint requires the direct reclaim then we prefer
>>>> +	 * to fallback to other node rather than node reclaim because that
>>>> +	 * can lead to excessive reclaim even though there is free memory
>>>> +	 * on other nodes. We expect that NUMA preferences are specified
>>>> +	 * by memory policies.
>>>> +	 */
>>>> +	pol = get_vma_policy(vma, addr);
>>>> +	if (pol->mode != MPOL_BIND)
>>>> +		this_node = __GFP_THISNODE;
>>>> +	mpol_cond_put(pol);
>>>> +#endif
>>>
>>> I'm not very good with NUMA policies. Could you explain in more details how
>>> the code above is equivalent to the code below?
>>
>> MPOL_PREFERRED is handled by policy_node() before we call __alloc_pages_nodemask.
>> __GFP_THISNODE is applied only when we are not using
>> __GFP_DIRECT_RECLAIM which is handled in alloc_hugepage_direct_gfpmask
>> now.
>> Lastly MPOL_BIND wasn't handled explicitly but in the end the removed
>> late check would remove __GFP_THISNODE for it as well. So in the end we
>> are doing the same thing unless I miss something
> 
> Forgot to add. One notable exception would be that the previous code
> would allow to hit
> 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> in policy_node if the requested node (e.g. cpu local one) was outside of
> the mbind nodemask. This is not possible now. We haven't heard about any
> such warning yet so it is unlikely that it happens though.

I don't think the previous code could hit the warning, as the hugepage
path that would add __GFP_THISNODE didn't call policy_node() (containing
the warning) at all. IIRC early of your patch did hit the warning
though, which is why you added the MPOL_BIND policy check.
Vlastimil Babka Oct. 22, 2018, 1:27 p.m. UTC | #10
On 10/19/18 10:06 AM, Michal Hocko wrote:
> On Thu 18-10-18 19:11:47, Andrew Morton wrote:
>> On Wed, 26 Sep 2018 16:22:27 +0200 Michal Hocko <mhocko@kernel.org> wrote:
>>
>>>> MPOL_PREFERRED is handled by policy_node() before we call __alloc_pages_nodemask.
>>>> __GFP_THISNODE is applied only when we are not using
>>>> __GFP_DIRECT_RECLAIM which is handled in alloc_hugepage_direct_gfpmask
>>>> now.
>>>> Lastly MPOL_BIND wasn't handled explicitly but in the end the removed
>>>> late check would remove __GFP_THISNODE for it as well. So in the end we
>>>> are doing the same thing unless I miss something
>>>
>>> Forgot to add. One notable exception would be that the previous code
>>> would allow to hit
>>> 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
>>> in policy_node if the requested node (e.g. cpu local one) was outside of
>>> the mbind nodemask. This is not possible now. We haven't heard about any
>>> such warning yet so it is unlikely that it happens though.
>>
>> Perhaps a changelog addition is needed to cover the above?
> 
> : THP allocation mode is quite complex and it depends on the defrag
> : mode. This complexity is hidden in alloc_hugepage_direct_gfpmask from a
> : large part currently. The NUMA special casing (namely __GFP_THISNODE) is
> : however independent and placed in alloc_pages_vma currently. This both
> : adds an unnecessary branch to all vma based page allocation requests and
> : it makes the code more complex unnecessarily as well. Not to mention
> : that e.g. shmem THP used to do the node reclaiming unconditionally
> : regardless of the defrag mode until recently. This was not only
> : unexpected behavior but it was also hardly a good default behavior and I
> : strongly suspect it was just a side effect of the code sharing more than
> : a deliberate decision which suggests that such a layering is wrong.
> : 
> : Moreover the oriinal code allowed to trigger
> : 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> : in policy_node if the requested node (e.g. cpu local one) was outside of
> : the mbind nodemask. This is not possible now. We haven't heard about any
> : such warning yet so it is unlikely that it happens but still a signal of
> : a wrong code layering.

Ah, as I said in the other mail, I think it's inaccurate, the warning
was not possible to hit.

There's also a slight difference wrt MPOL_BIND. The previous code would
avoid using __GFP_THISNODE if the local node was outside of
policy_nodemask(). After your patch __GFP_THISNODE is avoided for all
MPOL_BIND policies. So there's a difference that if local node is
actually allowed by the bind policy's nodemask, previously
__GFP_THISNODE would be added, but now it won't be. I don't think it
matters that much though, but maybe the changelog could say that
(instead of the inaccurate note about warning). Note the other policy
where nodemask is relevant is MPOL_INTERLEAVE, and that's unchanged by
this patch.

When that's addressed, you can add

Acked-by: Vlastimil Babka <vbabka@suse.cz>

(Note I also agree with patch 1/2 but didn't think it was useful to
formally ack it on top of Mel's ack supported by actual measurements, as
we're all from the same company).

> : Get rid of the thp special casing from alloc_pages_vma and move the logic
> : to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to
> : the resulting gfp mask only when the direct reclaim is not requested and
> : when there is no explicit numa binding to preserve the current logic.
> : 
> : This allows for removing alloc_hugepage_vma as well.
> 
> Better?
>  
>> I assume that David's mbind() concern has gone away.
> 
> Either I've misunderstood it or it was not really a real issue.
>
Michal Hocko Oct. 22, 2018, 1:30 p.m. UTC | #11
On Mon 22-10-18 15:15:38, Vlastimil Babka wrote:
> On 9/26/18 4:22 PM, Michal Hocko wrote:
> > On Wed 26-09-18 16:17:08, Michal Hocko wrote:
> >> On Wed 26-09-18 16:30:39, Kirill A. Shutemov wrote:
> >>> On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote:
> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>> index c3bc7e9c9a2a..c0bcede31930 100644
> >>>> --- a/mm/huge_memory.c
> >>>> +++ b/mm/huge_memory.c
> >>>> @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> >>>>   *	    available
> >>>>   * never: never stall for any thp allocation
> >>>>   */
> >>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> >>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> >>>>  {
> >>>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> >>>> +	gfp_t this_node = 0;
> >>>> +
> >>>> +#ifdef CONFIG_NUMA
> >>>> +	struct mempolicy *pol;
> >>>> +	/*
> >>>> +	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
> >>>> +	 * specified, to express a general desire to stay on the current
> >>>> +	 * node for optimistic allocation attempts. If the defrag mode
> >>>> +	 * and/or madvise hint requires the direct reclaim then we prefer
> >>>> +	 * to fallback to other node rather than node reclaim because that
> >>>> +	 * can lead to excessive reclaim even though there is free memory
> >>>> +	 * on other nodes. We expect that NUMA preferences are specified
> >>>> +	 * by memory policies.
> >>>> +	 */
> >>>> +	pol = get_vma_policy(vma, addr);
> >>>> +	if (pol->mode != MPOL_BIND)
> >>>> +		this_node = __GFP_THISNODE;
> >>>> +	mpol_cond_put(pol);
> >>>> +#endif
> >>>
> >>> I'm not very good with NUMA policies. Could you explain in more details how
> >>> the code above is equivalent to the code below?
> >>
> >> MPOL_PREFERRED is handled by policy_node() before we call __alloc_pages_nodemask.
> >> __GFP_THISNODE is applied only when we are not using
> >> __GFP_DIRECT_RECLAIM which is handled in alloc_hugepage_direct_gfpmask
> >> now.
> >> Lastly MPOL_BIND wasn't handled explicitly but in the end the removed
> >> late check would remove __GFP_THISNODE for it as well. So in the end we
> >> are doing the same thing unless I miss something
> > 
> > Forgot to add. One notable exception would be that the previous code
> > would allow to hit
> > 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> > in policy_node if the requested node (e.g. cpu local one) was outside of
> > the mbind nodemask. This is not possible now. We haven't heard about any
> > such warning yet so it is unlikely that it happens though.
> 
> I don't think the previous code could hit the warning, as the hugepage
> path that would add __GFP_THISNODE didn't call policy_node() (containing
> the warning) at all. IIRC early of your patch did hit the warning
> though, which is why you added the MPOL_BIND policy check.

Are you sure? What prevents node_isset(node, policy_nodemask()) == F and
fallback to the !huge allocation path? alloc_pages_vma is usually called
with the local node and processes shouldn't run off their bounded num
mask but is that guaranteed? Moreover do_huge_pmd_wp_page_fallback uses
the former numa binding and that might be outside of the policy mask.

In any case, as I've said this is highly unlikely to hit which is
underlined by the lack of reports.
Vlastimil Babka Oct. 22, 2018, 1:35 p.m. UTC | #12
On 10/22/18 3:30 PM, Michal Hocko wrote:
> On Mon 22-10-18 15:15:38, Vlastimil Babka wrote:
>>> Forgot to add. One notable exception would be that the previous code
>>> would allow to hit
>>> 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
>>> in policy_node if the requested node (e.g. cpu local one) was outside of
>>> the mbind nodemask. This is not possible now. We haven't heard about any
>>> such warning yet so it is unlikely that it happens though.
>>
>> I don't think the previous code could hit the warning, as the hugepage
>> path that would add __GFP_THISNODE didn't call policy_node() (containing
>> the warning) at all. IIRC early of your patch did hit the warning
>> though, which is why you added the MPOL_BIND policy check.
> 
> Are you sure? What prevents node_isset(node, policy_nodemask()) == F and
> fallback to the !huge allocation path?

That can indeed happen, but then the code also skipped the "gfp |=
__GFP_THISNODE" part, right? So the warning wouldn't trigger.

> alloc_pages_vma is usually called
> with the local node and processes shouldn't run off their bounded num
> mask but is that guaranteed? Moreover do_huge_pmd_wp_page_fallback uses
> the former numa binding and that might be outside of the policy mask.
> 
> In any case, as I've said this is highly unlikely to hit which is
> underlined by the lack of reports.
>
Michal Hocko Oct. 22, 2018, 1:46 p.m. UTC | #13
On Mon 22-10-18 15:35:24, Vlastimil Babka wrote:
> On 10/22/18 3:30 PM, Michal Hocko wrote:
> > On Mon 22-10-18 15:15:38, Vlastimil Babka wrote:
> >>> Forgot to add. One notable exception would be that the previous code
> >>> would allow to hit
> >>> 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> >>> in policy_node if the requested node (e.g. cpu local one) was outside of
> >>> the mbind nodemask. This is not possible now. We haven't heard about any
> >>> such warning yet so it is unlikely that it happens though.
> >>
> >> I don't think the previous code could hit the warning, as the hugepage
> >> path that would add __GFP_THISNODE didn't call policy_node() (containing
> >> the warning) at all. IIRC early of your patch did hit the warning
> >> though, which is why you added the MPOL_BIND policy check.
> > 
> > Are you sure? What prevents node_isset(node, policy_nodemask()) == F and
> > fallback to the !huge allocation path?
> 
> That can indeed happen, but then the code also skipped the "gfp |=
> __GFP_THISNODE" part, right? So the warning wouldn't trigger.

I thought I have crawled all the code paths back then but maybe there
were some phantom ones... If you are sure about then we can stick with
the original changelog.
Vlastimil Babka Oct. 22, 2018, 1:53 p.m. UTC | #14
On 10/22/18 3:46 PM, Michal Hocko wrote:
> On Mon 22-10-18 15:35:24, Vlastimil Babka wrote:
>> On 10/22/18 3:30 PM, Michal Hocko wrote:
>>> On Mon 22-10-18 15:15:38, Vlastimil Babka wrote:
>>>>> Forgot to add. One notable exception would be that the previous code
>>>>> would allow to hit
>>>>> 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
>>>>> in policy_node if the requested node (e.g. cpu local one) was outside of
>>>>> the mbind nodemask. This is not possible now. We haven't heard about any
>>>>> such warning yet so it is unlikely that it happens though.
>>>>
>>>> I don't think the previous code could hit the warning, as the hugepage
>>>> path that would add __GFP_THISNODE didn't call policy_node() (containing
>>>> the warning) at all. IIRC early of your patch did hit the warning
>>>> though, which is why you added the MPOL_BIND policy check.
>>>
>>> Are you sure? What prevents node_isset(node, policy_nodemask()) == F and
>>> fallback to the !huge allocation path?
>>
>> That can indeed happen, but then the code also skipped the "gfp |=
>> __GFP_THISNODE" part, right? So the warning wouldn't trigger.
> 
> I thought I have crawled all the code paths back then but maybe there
> were some phantom ones... If you are sure about then we can stick with
> the original changelog.

The __GFP_THISNODE would have to already be set in the 'gfp' parameter
of alloc_pages_vma(), and alloc_hugepage_direct_gfpmask() could not add
it. So in the context of alloc_hugepage_direct_gfpmask() users I believe
the patch is not removing nor adding the possibility of the warning to
trigger.
Andrew Morton Oct. 24, 2018, 11:17 p.m. UTC | #15
On Mon, 22 Oct 2018 15:27:54 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> > : Moreover the oriinal code allowed to trigger
> > : 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> > : in policy_node if the requested node (e.g. cpu local one) was outside of
> > : the mbind nodemask. This is not possible now. We haven't heard about any
> > : such warning yet so it is unlikely that it happens but still a signal of
> > : a wrong code layering.
> 
> Ah, as I said in the other mail, I think it's inaccurate, the warning
> was not possible to hit.
> 
> There's also a slight difference wrt MPOL_BIND. The previous code would
> avoid using __GFP_THISNODE if the local node was outside of
> policy_nodemask(). After your patch __GFP_THISNODE is avoided for all
> MPOL_BIND policies. So there's a difference that if local node is
> actually allowed by the bind policy's nodemask, previously
> __GFP_THISNODE would be added, but now it won't be. I don't think it
> matters that much though, but maybe the changelog could say that
> (instead of the inaccurate note about warning). Note the other policy
> where nodemask is relevant is MPOL_INTERLEAVE, and that's unchanged by
> this patch.

So the above could go into the changelog, yes?

> When that's addressed, you can add

What is it that you'd like to see addressed?  Purely changelog updates?

> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks.
Vlastimil Babka Oct. 25, 2018, 4:56 a.m. UTC | #16
On 10/25/18 1:17 AM, Andrew Morton wrote:
> On Mon, 22 Oct 2018 15:27:54 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>>> : Moreover the oriinal code allowed to trigger
>>> : 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
>>> : in policy_node if the requested node (e.g. cpu local one) was outside of
>>> : the mbind nodemask. This is not possible now. We haven't heard about any
>>> : such warning yet so it is unlikely that it happens but still a signal of
>>> : a wrong code layering.
>>
>> Ah, as I said in the other mail, I think it's inaccurate, the warning
>> was not possible to hit.
>>
>> There's also a slight difference wrt MPOL_BIND. The previous code would
>> avoid using __GFP_THISNODE if the local node was outside of
>> policy_nodemask(). After your patch __GFP_THISNODE is avoided for all
>> MPOL_BIND policies. So there's a difference that if local node is
>> actually allowed by the bind policy's nodemask, previously
>> __GFP_THISNODE would be added, but now it won't be. I don't think it
>> matters that much though, but maybe the changelog could say that
>> (instead of the inaccurate note about warning). Note the other policy
>> where nodemask is relevant is MPOL_INTERLEAVE, and that's unchanged by
>> this patch.
> 
> So the above could go into the changelog, yes?

Yeah.

>> When that's addressed, you can add
> 
> What is it that you'd like to see addressed?  Purely changelog updates?

Right.

>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Thanks.
>
Michal Hocko Oct. 25, 2018, 4:14 p.m. UTC | #17
On Thu 25-10-18 06:56:37, Vlastimil Babka wrote:
> On 10/25/18 1:17 AM, Andrew Morton wrote:
> > On Mon, 22 Oct 2018 15:27:54 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> > 
> >>> : Moreover the oriinal code allowed to trigger
> >>> : 	WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> >>> : in policy_node if the requested node (e.g. cpu local one) was outside of
> >>> : the mbind nodemask. This is not possible now. We haven't heard about any
> >>> : such warning yet so it is unlikely that it happens but still a signal of
> >>> : a wrong code layering.
> >>
> >> Ah, as I said in the other mail, I think it's inaccurate, the warning
> >> was not possible to hit.
> >>
> >> There's also a slight difference wrt MPOL_BIND. The previous code would
> >> avoid using __GFP_THISNODE if the local node was outside of
> >> policy_nodemask(). After your patch __GFP_THISNODE is avoided for all
> >> MPOL_BIND policies. So there's a difference that if local node is
> >> actually allowed by the bind policy's nodemask, previously
> >> __GFP_THISNODE would be added, but now it won't be. I don't think it
> >> matters that much though, but maybe the changelog could say that
> >> (instead of the inaccurate note about warning). Note the other policy
> >> where nodemask is relevant is MPOL_INTERLEAVE, and that's unchanged by
> >> this patch.
> > 
> > So the above could go into the changelog, yes?
> 
> Yeah.

Andrew. Do you want me to repost the patch or you plan to update the
changelog yourself?
Andrew Morton Oct. 25, 2018, 4:18 p.m. UTC | #18
On October 25, 2018 9:14:10 AM PDT, Michal Hocko <mhocko@kernel.org> wrote:

>Andrew. Do you want me to repost the patch or you plan to update the
>changelog yourself?

Please send a replacement changelog and I'll paste it in?
Michal Hocko Oct. 25, 2018, 4:45 p.m. UTC | #19
On Thu 25-10-18 09:18:05, Andrew Morton wrote:
> 
> 
> On October 25, 2018 9:14:10 AM PDT, Michal Hocko <mhocko@kernel.org> wrote:
> 
> >Andrew. Do you want me to repost the patch or you plan to update the
> >changelog yourself?
> 
> Please send a replacement changelog and I'll paste it in?

THP allocation mode is quite complex and it depends on the defrag
mode. This complexity is hidden in alloc_hugepage_direct_gfpmask from a
large part currently. The NUMA special casing (namely __GFP_THISNODE) is
however independent and placed in alloc_pages_vma currently. This both
adds an unnecessary branch to all vma based page allocation requests and
it makes the code more complex unnecessarily as well. Not to mention
that e.g. shmem THP used to do the node reclaiming unconditionally
regardless of the defrag mode until recently. This was not only
unexpected behavior but it was also hardly a good default behavior and I
strongly suspect it was just a side effect of the code sharing more than
a deliberate decision which suggests that such a layering is wrong.

Get rid of the thp special casing from alloc_pages_vma and move the logic
to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to
the resulting gfp mask only when the direct reclaim is not requested and
when there is no explicit numa binding to preserve the current logic.

Please note that there's also a slight difference wrt MPOL_BIND now. The
previous code would avoid using __GFP_THISNODE if the local node was
outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided
for all MPOL_BIND policies. So there's a difference that if local
node is actually allowed by the bind policy's nodemask, previously
__GFP_THISNODE would be added, but now it won't be. From the behavior
POV this is still correct because the policy nodemask is used.
diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 24bcc5eec6b4..76f8db0b0e71 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -510,22 +510,18 @@  alloc_pages(gfp_t gfp_mask, unsigned int order)
 }
 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 			struct vm_area_struct *vma, unsigned long addr,
-			int node, bool hugepage);
-#define alloc_hugepage_vma(gfp_mask, vma, addr, order)	\
-	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
+			int node);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
-	alloc_pages(gfp_mask, order)
-#define alloc_hugepage_vma(gfp_mask, vma, addr, order)	\
+#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\
 	alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)			\
-	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
+	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
 #define alloc_page_vma_node(gfp_mask, vma, addr, node)		\
-	alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
+	alloc_pages_vma(gfp_mask, 0, vma, addr, node)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@  struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..c0bcede31930 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,21 +629,40 @@  static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	gfp_t this_node = 0;
+
+#ifdef CONFIG_NUMA
+	struct mempolicy *pol;
+	/*
+	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
+	 * specified, to express a general desire to stay on the current
+	 * node for optimistic allocation attempts. If the defrag mode
+	 * and/or madvise hint requires the direct reclaim then we prefer
+	 * to fallback to other node rather than node reclaim because that
+	 * can lead to excessive reclaim even though there is free memory
+	 * on other nodes. We expect that NUMA preferences are specified
+	 * by memory policies.
+	 */
+	pol = get_vma_policy(vma, addr);
+	if (pol->mode != MPOL_BIND)
+		this_node = __GFP_THISNODE;
+	mpol_cond_put(pol);
+#endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM);
+							     __GFP_KSWAPD_RECLAIM | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     0);
-	return GFP_TRANSHUGE_LIGHT;
+							     this_node);
+	return GFP_TRANSHUGE_LIGHT | this_node;
 }
 
 /* Caller must hold page table lock. */
@@ -715,8 +734,8 @@  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma);
-	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
+	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
+	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -1290,8 +1309,9 @@  vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
-		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
+		new_page = alloc_pages_vma(huge_gfp, HPAGE_PMD_ORDER, vma,
+				haddr, numa_node_id());
 	} else
 		new_page = NULL;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 149b6f4cf023..b7b564751165 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1102,8 +1102,8 @@  static struct page *new_page(struct page *page, unsigned long start)
 	} else if (PageTransHuge(page)) {
 		struct page *thp;
 
-		thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
-					 HPAGE_PMD_ORDER);
+		thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma,
+				address, numa_node_id());
 		if (!thp)
 			return NULL;
 		prep_transhuge_page(thp);
@@ -1648,7 +1648,7 @@  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
@@ -1997,7 +1997,6 @@  static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
  * 	@vma:  Pointer to VMA or NULL if not available.
  *	@addr: Virtual Address of the allocation. Must be inside the VMA.
  *	@node: Which node to prefer for allocation (modulo policy).
- *	@hugepage: for hugepages try only the preferred node if possible
  *
  * 	This function allocates a page from the kernel page pool and applies
  *	a NUMA policy associated with the VMA or the current process.
@@ -2008,7 +2007,7 @@  static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
  */
 struct page *
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
-		unsigned long addr, int node, bool hugepage)
+		unsigned long addr, int node)
 {
 	struct mempolicy *pol;
 	struct page *page;
@@ -2026,60 +2025,6 @@  alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED &&
-						!(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
-
-		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			/*
-			 * We cannot invoke reclaim if __GFP_THISNODE
-			 * is set. Invoking reclaim with
-			 * __GFP_THISNODE set, would cause THP
-			 * allocations to trigger heavy swapping
-			 * despite there may be tons of free memory
-			 * (including potentially plenty of THP
-			 * already available in the buddy) on all the
-			 * other NUMA nodes.
-			 *
-			 * At most we could invoke compaction when
-			 * __GFP_THISNODE is set (but we would need to
-			 * refrain from invoking reclaim even if
-			 * compaction returned COMPACT_SKIPPED because
-			 * there wasn't not enough memory to succeed
-			 * compaction). For now just avoid
-			 * __GFP_THISNODE instead of limiting the
-			 * allocation path to a strict and single
-			 * compaction invocation.
-			 *
-			 * Supposedly if direct reclaim was enabled by
-			 * the caller, the app prefers THP regardless
-			 * of the node it comes from so this would be
-			 * more desiderable behavior than only
-			 * providing THP originated from the local
-			 * node in such case.
-			 */
-			if (!(gfp & __GFP_DIRECT_RECLAIM))
-				gfp |= __GFP_THISNODE;
-			page = __alloc_pages_node(hpage_node, gfp, order);
-			goto out;
-		}
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
diff --git a/mm/shmem.c b/mm/shmem.c
index 0376c124b043..371e291f5d4b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1473,7 +1473,7 @@  static struct page *shmem_alloc_hugepage(gfp_t gfp,
 
 	shmem_pseudo_vma_init(&pvma, info, hindex);
 	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
-			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
+			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
 	shmem_pseudo_vma_destroy(&pvma);
 	if (page)
 		prep_transhuge_page(page);