[v4,08/13] mm/mempolicy: Create a page allocator for policy

Message ID	1615952410-36895-9-git-send-email-feng.tang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=6URa=IP=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 38D3864D9D IronPort-SDR: wKbt86H9q6r5UAVCOO6ZsJM6O94ZOAKDYiA6xlxRGfmcvLXQyD37eCtJoav25U8YGEDzJBcSoX +rkMjvHfPBSw== IronPort-SDR: gOh4cGWysJyppcFDDpf/okBQ9aviOEoKXOr3vICqoZYKkIwy4fzUG2+muI6VcsGKesrr8O1RoF XwwVj6UAOBYA== From: Feng Tang <feng.tang@intel.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org>, Andrea Arcangeli <aarcange@redhat.com>, David Rientjes <rientjes@google.com>, Mel Gorman <mgorman@techsingularity.net>, Mike Kravetz <mike.kravetz@oracle.com>, Randy Dunlap <rdunlap@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, Dave Hansen <dave.hansen@intel.com>, Ben Widawsky <ben.widawsky@intel.com>, Andi Kleen <ak@linux.intel.com>, Dan Williams <dan.j.williams@intel.com>, Feng Tang <feng.tang@intel.com> Subject: [PATCH v4 08/13] mm/mempolicy: Create a page allocator for policy Date: Wed, 17 Mar 2021 11:40:05 +0800 Message-Id: <1615952410-36895-9-git-send-email-feng.tang@intel.com> In-Reply-To: <1615952410-36895-1-git-send-email-feng.tang@intel.com> References: <1615952410-36895-1-git-send-email-feng.tang@intel.com> Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf30; identity=mailfrom; envelope-from="<feng.tang@intel.com>"; helo=mga07.intel.com; client-ip=134.134.136.100 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Introduced multi-preference mempolicy \| expand [v4,00/13] Introduced multi-preference mempolicy [v4,01/13] mm/mempolicy: Add comment for missing LOCAL [v4,02/13] mm/mempolicy: convert single preferred_node to full nodemask [v4,03/13] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes [v4,04/13] mm/mempolicy: allow preferred code to take a nodemask [v4,05/13] mm/mempolicy: refactor rebind code for PREFERRED_MANY [v4,06/13] mm/mempolicy: kill v.preferred_nodes [v4,07/13] mm/mempolicy: handle MPOL_PREFERRED_MANY like BIND [v4,08/13] mm/mempolicy: Create a page allocator for policy [v4,09/13] mm/mempolicy: Thread allocation for many preferred [v4,10/13] mm/mempolicy: VMA allocation for many preferred [v4,11/13] mm/mempolicy: huge-page allocation for many preferred [v4,12/13] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY [v4,13/13] mem/mempolicy: unify mpol_new_preferred() and mpol_new_preferred_many()

Message ID

1615952410-36895-9-git-send-email-feng.tang@intel.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 38D3864D9D
IronPort-SDR: 
 wKbt86H9q6r5UAVCOO6ZsJM6O94ZOAKDYiA6xlxRGfmcvLXQyD37eCtJoav25U8YGEDzJBcSoX
 +rkMjvHfPBSw==
IronPort-SDR: 
 gOh4cGWysJyppcFDDpf/okBQ9aviOEoKXOr3vICqoZYKkIwy4fzUG2+muI6VcsGKesrr8O1RoF
 XwwVj6UAOBYA==
From: Feng Tang <feng.tang@intel.com>
To: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Randy Dunlap <rdunlap@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Dave Hansen <dave.hansen@intel.com>,
	Ben Widawsky <ben.widawsky@intel.com>,
	Andi Kleen <ak@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Feng Tang <feng.tang@intel.com>
Subject: [PATCH v4 08/13] mm/mempolicy: Create a page allocator for policy
Date: Wed, 17 Mar 2021 11:40:05 +0800
Message-Id: <1615952410-36895-9-git-send-email-feng.tang@intel.com>
In-Reply-To: <1615952410-36895-1-git-send-email-feng.tang@intel.com>
References: <1615952410-36895-1-git-send-email-feng.tang@intel.com>
Received-SPF: none (intel.com>: No applicable sender policy available)
 receiver=imf30; identity=mailfrom; envelope-from="<feng.tang@intel.com>";
 helo=mga07.intel.com; client-ip=134.134.136.100
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Introduced multi-preference mempolicy | expand

Commit Message

Feng Tang March 17, 2021, 3:40 a.m. UTC

From: Ben Widawsky <ben.widawsky@intel.com>

Add a helper function which takes care of handling multiple preferred
nodes. It will be called by future patches that need to handle this,
specifically VMA based page allocation, and task based page allocation.
Huge pages don't quite fit the same pattern because they use different
underlying page allocation functions. This consumes the previous
interleave policy specific allocation function to make a one stop shop
for policy based allocation.

With this, MPOL_PREFERRED_MANY's semantic is more like MPOL_PREFERRED
that it will first try the preferred node/nodes, and fallback to all
other nodes when first try fails. Thanks to Michal Hocko for suggestions
on this.

For now, only interleaved policy will be used so there should be no
functional change yet. However, if bisection points to issues in the
next few commits, it was likely the fault of this patch.

Similar functionality is offered via policy_node() and
policy_nodemask(). By themselves however, neither can achieve this
fallback style of sets of nodes.

[ Feng: for the first try, add NOWARN flag, and skip the direct reclaim
  to speedup allocation in some case ]

Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/mempolicy.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 52 insertions(+), 13 deletions(-)

Comments

Michal Hocko April 14, 2021, 1:08 p.m. UTC | #1

On Wed 17-03-21 11:40:05, Feng Tang wrote:
> From: Ben Widawsky <ben.widawsky@intel.com>
> 
> Add a helper function which takes care of handling multiple preferred
> nodes. It will be called by future patches that need to handle this,
> specifically VMA based page allocation, and task based page allocation.
> Huge pages don't quite fit the same pattern because they use different
> underlying page allocation functions. This consumes the previous
> interleave policy specific allocation function to make a one stop shop
> for policy based allocation.
> 
> With this, MPOL_PREFERRED_MANY's semantic is more like MPOL_PREFERRED
> that it will first try the preferred node/nodes, and fallback to all
> other nodes when first try fails. Thanks to Michal Hocko for suggestions
> on this.
> 
> For now, only interleaved policy will be used so there should be no
> functional change yet. However, if bisection points to issues in the
> next few commits, it was likely the fault of this patch.

I am not sure this is helping much. Let's see in later patches but I
would keep them separate and rather create a dedicated function for the
new policy allocation mode.

> Similar functionality is offered via policy_node() and
> policy_nodemask(). By themselves however, neither can achieve this
> fallback style of sets of nodes.
> 
> [ Feng: for the first try, add NOWARN flag, and skip the direct reclaim
>   to speedup allocation in some case ]
> 
> Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  mm/mempolicy.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 52 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d945f29..d21105b 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2187,22 +2187,60 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
>  	return ret;
>  }
>  
> -/* Allocate a page in interleaved policy.
> -   Own path because it needs to do special accounting. */
> -static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> -					unsigned nid)
> +/* Handle page allocation for all but interleaved policies */
> +static struct page *alloc_pages_policy(struct mempolicy *pol, gfp_t gfp,
> +				       unsigned int order, int preferred_nid)
>  {
>  	struct page *page;
> +	gfp_t gfp_mask = gfp;
>  
> -	page = __alloc_pages(gfp, order, nid);
> -	/* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
> -	if (!static_branch_likely(&vm_numa_stat_key))
> +	if (pol->mode == MPOL_INTERLEAVE) {
> +		page = __alloc_pages(gfp, order, preferred_nid);
> +		/* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
> +		if (!static_branch_likely(&vm_numa_stat_key))
> +			return page;
> +		if (page && page_to_nid(page) == preferred_nid) {
> +			preempt_disable();
> +			__inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> +			preempt_enable();
> +		}
>  		return page;
> -	if (page && page_to_nid(page) == nid) {
> -		preempt_disable();
> -		__inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> -		preempt_enable();
>  	}
> +
> +	VM_BUG_ON(preferred_nid != NUMA_NO_NODE);
> +
> +	preferred_nid = numa_node_id();
> +
> +	/*
> +	 * There is a two pass approach implemented here for
> +	 * MPOL_PREFERRED_MANY. In the first pass we try the preferred nodes
> +	 * but allow the allocation to fail. The below table explains how
> +	 * this is achieved.
> +	 *
> +	 * | Policy                        | preferred nid | nodemask   |
> +	 * |-------------------------------|---------------|------------|
> +	 * | MPOL_DEFAULT                  | local         | NULL       |
> +	 * | MPOL_PREFERRED                | best          | NULL       |
> +	 * | MPOL_INTERLEAVE               | ERR           | ERR        |
> +	 * | MPOL_BIND                     | local         | pol->nodes |
> +	 * | MPOL_PREFERRED_MANY           | best          | pol->nodes |
> +	 * | MPOL_PREFERRED_MANY (round 2) | local         | NULL       |
> +	 * +-------------------------------+---------------+------------+
> +	 */
> +	if (pol->mode == MPOL_PREFERRED_MANY) {
> +		gfp_mask |=  __GFP_NOWARN;
> +
> +		/* Skip direct reclaim, as there will be a second try */
> +		gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> +	}
> +
> +	page = __alloc_pages_nodemask(gfp_mask, order,
> +				      policy_node(gfp, pol, preferred_nid),
> +				      policy_nodemask(gfp, pol));
> +
> +	if (unlikely(!page && pol->mode == MPOL_PREFERRED_MANY))
> +		page = __alloc_pages_nodemask(gfp, order, preferred_nid, NULL);
> +
>  	return page;
>  }
>  
> @@ -2244,8 +2282,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		unsigned nid;
>  
>  		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
> +		page = alloc_pages_policy(pol, gfp, order, nid);
>  		mpol_cond_put(pol);
> -		page = alloc_page_interleave(gfp, order, nid);
>  		goto out;
>  	}
>  
> @@ -2329,7 +2367,8 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
>  	 * nor system default_policy
>  	 */
>  	if (pol->mode == MPOL_INTERLEAVE)
> -		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> +		page = alloc_pages_policy(pol, gfp, order,
> +					  interleave_nodes(pol));
>  	else
>  		page = __alloc_pages_nodemask(gfp, order,
>  				policy_node(gfp, pol, numa_node_id()),
> -- 
> 2.7.4

Feng Tang April 15, 2021, 8:17 a.m. UTC | #2

On Wed, Apr 14, 2021 at 03:08:19PM +0200, Michal Hocko wrote:
> On Wed 17-03-21 11:40:05, Feng Tang wrote:
> > From: Ben Widawsky <ben.widawsky@intel.com>
> > 
> > Add a helper function which takes care of handling multiple preferred
> > nodes. It will be called by future patches that need to handle this,
> > specifically VMA based page allocation, and task based page allocation.
> > Huge pages don't quite fit the same pattern because they use different
> > underlying page allocation functions. This consumes the previous
> > interleave policy specific allocation function to make a one stop shop
> > for policy based allocation.
> > 
> > With this, MPOL_PREFERRED_MANY's semantic is more like MPOL_PREFERRED
> > that it will first try the preferred node/nodes, and fallback to all
> > other nodes when first try fails. Thanks to Michal Hocko for suggestions
> > on this.
> > 
> > For now, only interleaved policy will be used so there should be no
> > functional change yet. However, if bisection points to issues in the
> > next few commits, it was likely the fault of this patch.
> 
> I am not sure this is helping much. Let's see in later patches but I
> would keep them separate and rather create a dedicated function for the
> new policy allocation mode.
 
Thanks for the suggestion, we will rethink the implementations. 

- Feng

> > Similar functionality is offered via policy_node() and
> > policy_nodemask(). By themselves however, neither can achieve this
> > fallback style of sets of nodes.
> > 
> > [ Feng: for the first try, add NOWARN flag, and skip the direct reclaim
> >   to speedup allocation in some case ]
> > 
> > Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> >  mm/mempolicy.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++------------
> >  1 file changed, 52 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index d945f29..d21105b 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2187,22 +2187,60 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> >  	return ret;
> >  }
> >  
> > -/* Allocate a page in interleaved policy.
> > -   Own path because it needs to do special accounting. */
> > -static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> > -					unsigned nid)
> > +/* Handle page allocation for all but interleaved policies */
> > +static struct page *alloc_pages_policy(struct mempolicy *pol, gfp_t gfp,
> > +				       unsigned int order, int preferred_nid)
> >  {
> >  	struct page *page;
> > +	gfp_t gfp_mask = gfp;
> >  
> > -	page = __alloc_pages(gfp, order, nid);
> > -	/* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
> > -	if (!static_branch_likely(&vm_numa_stat_key))
> > +	if (pol->mode == MPOL_INTERLEAVE) {
> > +		page = __alloc_pages(gfp, order, preferred_nid);
> > +		/* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
> > +		if (!static_branch_likely(&vm_numa_stat_key))
> > +			return page;
> > +		if (page && page_to_nid(page) == preferred_nid) {
> > +			preempt_disable();
> > +			__inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> > +			preempt_enable();
> > +		}
> >  		return page;
> > -	if (page && page_to_nid(page) == nid) {
> > -		preempt_disable();
> > -		__inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> > -		preempt_enable();
> >  	}
> > +
> > +	VM_BUG_ON(preferred_nid != NUMA_NO_NODE);
> > +
> > +	preferred_nid = numa_node_id();
> > +
> > +	/*
> > +	 * There is a two pass approach implemented here for
> > +	 * MPOL_PREFERRED_MANY. In the first pass we try the preferred nodes
> > +	 * but allow the allocation to fail. The below table explains how
> > +	 * this is achieved.
> > +	 *
> > +	 * | Policy                        | preferred nid | nodemask   |
> > +	 * |-------------------------------|---------------|------------|
> > +	 * | MPOL_DEFAULT                  | local         | NULL       |
> > +	 * | MPOL_PREFERRED                | best          | NULL       |
> > +	 * | MPOL_INTERLEAVE               | ERR           | ERR        |
> > +	 * | MPOL_BIND                     | local         | pol->nodes |
> > +	 * | MPOL_PREFERRED_MANY           | best          | pol->nodes |
> > +	 * | MPOL_PREFERRED_MANY (round 2) | local         | NULL       |
> > +	 * +-------------------------------+---------------+------------+
> > +	 */
> > +	if (pol->mode == MPOL_PREFERRED_MANY) {
> > +		gfp_mask |=  __GFP_NOWARN;
> > +
> > +		/* Skip direct reclaim, as there will be a second try */
> > +		gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> > +	}
> > +
> > +	page = __alloc_pages_nodemask(gfp_mask, order,
> > +				      policy_node(gfp, pol, preferred_nid),
> > +				      policy_nodemask(gfp, pol));
> > +
> > +	if (unlikely(!page && pol->mode == MPOL_PREFERRED_MANY))
> > +		page = __alloc_pages_nodemask(gfp, order, preferred_nid, NULL);
> > +
> >  	return page;
> >  }
> >  
> > @@ -2244,8 +2282,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> >  		unsigned nid;
> >  
> >  		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
> > +		page = alloc_pages_policy(pol, gfp, order, nid);
> >  		mpol_cond_put(pol);
> > -		page = alloc_page_interleave(gfp, order, nid);
> >  		goto out;
> >  	}
> >  
> > @@ -2329,7 +2367,8 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
> >  	 * nor system default_policy
> >  	 */
> >  	if (pol->mode == MPOL_INTERLEAVE)
> > -		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> > +		page = alloc_pages_policy(pol, gfp, order,
> > +					  interleave_nodes(pol));
> >  	else
> >  		page = __alloc_pages_nodemask(gfp, order,
> >  				policy_node(gfp, pol, numa_node_id()),
> > -- 
> > 2.7.4
> 
> -- 
> Michal Hocko
> SUSE Labs

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d945f29..d21105b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2187,22 +2187,60 @@  bool mempolicy_nodemask_intersects(struct task_struct *tsk,
 	return ret;
 }
 
-/* Allocate a page in interleaved policy.
-   Own path because it needs to do special accounting. */
-static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
-					unsigned nid)
+/* Handle page allocation for all but interleaved policies */
+static struct page *alloc_pages_policy(struct mempolicy *pol, gfp_t gfp,
+				       unsigned int order, int preferred_nid)
 {
 	struct page *page;
+	gfp_t gfp_mask = gfp;
 
-	page = __alloc_pages(gfp, order, nid);
-	/* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
-	if (!static_branch_likely(&vm_numa_stat_key))
+	if (pol->mode == MPOL_INTERLEAVE) {
+		page = __alloc_pages(gfp, order, preferred_nid);
+		/* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
+		if (!static_branch_likely(&vm_numa_stat_key))
+			return page;
+		if (page && page_to_nid(page) == preferred_nid) {
+			preempt_disable();
+			__inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
+			preempt_enable();
+		}
 		return page;
-	if (page && page_to_nid(page) == nid) {
-		preempt_disable();
-		__inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
-		preempt_enable();
 	}
+
+	VM_BUG_ON(preferred_nid != NUMA_NO_NODE);
+
+	preferred_nid = numa_node_id();
+
+	/*
+	 * There is a two pass approach implemented here for
+	 * MPOL_PREFERRED_MANY. In the first pass we try the preferred nodes
+	 * but allow the allocation to fail. The below table explains how
+	 * this is achieved.
+	 *
+	 * | Policy                        | preferred nid | nodemask   |
+	 * |-------------------------------|---------------|------------|
+	 * | MPOL_DEFAULT                  | local         | NULL       |
+	 * | MPOL_PREFERRED                | best          | NULL       |
+	 * | MPOL_INTERLEAVE               | ERR           | ERR        |
+	 * | MPOL_BIND                     | local         | pol->nodes |
+	 * | MPOL_PREFERRED_MANY           | best          | pol->nodes |
+	 * | MPOL_PREFERRED_MANY (round 2) | local         | NULL       |
+	 * +-------------------------------+---------------+------------+
+	 */
+	if (pol->mode == MPOL_PREFERRED_MANY) {
+		gfp_mask |=  __GFP_NOWARN;
+
+		/* Skip direct reclaim, as there will be a second try */
+		gfp_mask &= ~__GFP_DIRECT_RECLAIM;
+	}
+
+	page = __alloc_pages_nodemask(gfp_mask, order,
+				      policy_node(gfp, pol, preferred_nid),
+				      policy_nodemask(gfp, pol));
+
+	if (unlikely(!page && pol->mode == MPOL_PREFERRED_MANY))
+		page = __alloc_pages_nodemask(gfp, order, preferred_nid, NULL);
+
 	return page;
 }
 
@@ -2244,8 +2282,8 @@  alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
+		page = alloc_pages_policy(pol, gfp, order, nid);
 		mpol_cond_put(pol);
-		page = alloc_page_interleave(gfp, order, nid);
 		goto out;
 	}
 
@@ -2329,7 +2367,8 @@  struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 	 * nor system default_policy
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
-		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
+		page = alloc_pages_policy(pol, gfp, order,
+					  interleave_nodes(pol));
 	else
 		page = __alloc_pages_nodemask(gfp, order,
 				policy_node(gfp, pol, numa_node_id()),

[v4,08/13] mm/mempolicy: Create a page allocator for policy

Commit Message

Comments

Patch