diff mbox series

mm: mempolicy: N:M interleave policy for tiered memory nodes

Message ID 20220607171949.85796-1-hannes@cmpxchg.org (mailing list archive)
State New
Headers show
Series mm: mempolicy: N:M interleave policy for tiered memory nodes | expand

Commit Message

Johannes Weiner June 7, 2022, 5:19 p.m. UTC
From: Hasan Al Maruf <hasanalmaruf@fb.com>

Existing interleave policy spreads out pages evenly across a set of
specified nodes, i.e. 1:1 interleave. Upcoming tiered memory systems
have CPU-less memory nodes with different peak bandwidth and
latency-bandwidth characteristics. In such systems, we will want to
use the additional bandwidth provided by lowtier memory for
bandwidth-intensive applications. However, the default 1:1 interleave
can lead to suboptimal bandwidth distribution.

Introduce an N:M interleave policy, where N pages allocated to the
top-tier nodes are followed by M pages allocated to lowtier nodes.
This provides the capability to steer the fraction of memory traffic
that goes to toptier vs. lowtier nodes. For example, 4:1 interleave
leads to an 80%/20% traffic breakdown between toptier and lowtier.

The ratios are configured through a new sysctl:

	vm.numa_tier_interleave = toptier lowtier

We have run experiments on bandwidth-intensive production services on
CXL-based tiered memory systems, where lowtier CXL memory has, when
compared to the toptier memory directly connected to the CPU:

	- ~half of the peak bandwidth
	- ~80ns higher idle latency
	- steeper latency vs. bandwidth curve

Results show that regular interleaving leads to a ~40% performance
regression over baseline; 5:1 interleaving shows an ~8% improvement
over baseline. We have found the optimal distribution changes based on
hardware characteristics: slower CXL memory will shift the optimal
breakdown from 5:1 to (e.g.) 8:1.

The sysctl only applies to processes and vmas with an "interleave"
policy and has no bearing on contexts using prefer or bind policies.

It defaults to a setting of "1 1", which represents even interleaving,
and so is backward compatible with existing setups.

Signed-off-by: Hasan Al Maruf <hasanalmaruf@fb.com>
Signed-off-by: Hao Wang <haowang3@fb.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/admin-guide/sysctl/vm.rst | 16 ++++++
 include/linux/mempolicy.h               |  2 +
 include/linux/sched.h                   |  1 +
 kernel/sysctl.c                         | 10 ++++
 mm/mempolicy.c                          | 67 +++++++++++++++++++++++--
 5 files changed, 93 insertions(+), 3 deletions(-)

Comments

Huang, Ying June 8, 2022, 4:19 a.m. UTC | #1
On Tue, 2022-06-07 at 13:19 -0400, Johannes Weiner wrote:
> From: Hasan Al Maruf <hasanalmaruf@fb.com>
> 
> Existing interleave policy spreads out pages evenly across a set of
> specified nodes, i.e. 1:1 interleave. Upcoming tiered memory systems
> have CPU-less memory nodes with different peak bandwidth and
> latency-bandwidth characteristics. In such systems, we will want to
> use the additional bandwidth provided by lowtier memory for
> bandwidth-intensive applications. However, the default 1:1 interleave
> can lead to suboptimal bandwidth distribution.
> 
> Introduce an N:M interleave policy, where N pages allocated to the
> top-tier nodes are followed by M pages allocated to lowtier nodes.
> This provides the capability to steer the fraction of memory traffic
> that goes to toptier vs. lowtier nodes. For example, 4:1 interleave
> leads to an 80%/20% traffic breakdown between toptier and lowtier.
> 
> The ratios are configured through a new sysctl:
> 
> 	vm.numa_tier_interleave = toptier lowtier
> 
> We have run experiments on bandwidth-intensive production services on
> CXL-based tiered memory systems, where lowtier CXL memory has, when
> compared to the toptier memory directly connected to the CPU:
> 
> 	- ~half of the peak bandwidth
> 	- ~80ns higher idle latency
> 	- steeper latency vs. bandwidth curve
> 
> Results show that regular interleaving leads to a ~40% performance
> regression over baseline; 5:1 interleaving shows an ~8% improvement
> over baseline. We have found the optimal distribution changes based on
> hardware characteristics: slower CXL memory will shift the optimal
> breakdown from 5:1 to (e.g.) 8:1.
> 
> The sysctl only applies to processes and vmas with an "interleave"
> policy and has no bearing on contexts using prefer or bind policies.
> 
> It defaults to a setting of "1 1", which represents even interleaving,
> and so is backward compatible with existing setups.
> 
> Signed-off-by: Hasan Al Maruf <hasanalmaruf@fb.com>
> Signed-off-by: Hao Wang <haowang3@fb.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

In general, I think the use case is valid.  But we are changing memory
tiering now, including

- make memory tiering explict

- support more than 2 tiers

- expose memory tiering via sysfs

Details can be found int the following threads,

https://lore.kernel.org/lkml/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com/
https://lore.kernel.org/lkml/20220603134237.131362-1-aneesh.kumar@linux.ibm.com/

With these changes, we may need to revise your implementation.  For
example, put interleave knobs in memory tier sysfs interface, support
more than 2 tiers, etc.

Best Regards,
Huang, Ying


[snip]
Johannes Weiner June 8, 2022, 2:16 p.m. UTC | #2
On Wed, Jun 08, 2022 at 12:19:52PM +0800, Ying Huang wrote:
> In general, I think the use case is valid.

Excellent!

> But we are changing memory tiering now, including
> 
> - make memory tiering explict
> 
> - support more than 2 tiers
> 
> - expose memory tiering via sysfs
> 
> Details can be found int the following threads,
> 
> https://lore.kernel.org/lkml/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com/
> https://lore.kernel.org/lkml/20220603134237.131362-1-aneesh.kumar@linux.ibm.com/
> 
> With these changes, we may need to revise your implementation.  For
> example, put interleave knobs in memory tier sysfs interface, support
> more than 2 tiers, etc.

Yeah, I was expecting the interface to be the main sticking point ;)
I'll rebase this patch as the mentioned discussions find consensus.

Thanks!
Tim Chen June 8, 2022, 6:15 p.m. UTC | #3
On Tue, 2022-06-07 at 13:19 -0400, Johannes Weiner wrote:
> 
>  /* Do dynamic interleaving for a process */
>  static unsigned interleave_nodes(struct mempolicy *policy)
>  {
>  	unsigned next;
>  	struct task_struct *me = current;
>  
> -	next = next_node_in(me->il_prev, policy->nodes);
> +	if (numa_tier_interleave[0] > 1 || numa_tier_interleave[1] > 1) {

When we have three memory tiers, do we expect an N:M:K policy?
Like interleaving between DDR5, DDR4 and PMEM memory.
Or we expect an N:M policy still by interleaving between two specific tiers?

The other question is whether we will need multiple interleave policies depending
on cgroup?
One policy could be interleave between tier1, tier2, tier3.
Another could be interleave between tier1 and tier2.

In the current implementation we have one global interleave knob
defined by numa_iter_interleave[].

Tim
Johannes Weiner June 8, 2022, 7:14 p.m. UTC | #4
Hi Tim,

On Wed, Jun 08, 2022 at 11:15:27AM -0700, Tim Chen wrote:
> On Tue, 2022-06-07 at 13:19 -0400, Johannes Weiner wrote:
> > 
> >  /* Do dynamic interleaving for a process */
> >  static unsigned interleave_nodes(struct mempolicy *policy)
> >  {
> >  	unsigned next;
> >  	struct task_struct *me = current;
> >  
> > -	next = next_node_in(me->il_prev, policy->nodes);
> > +	if (numa_tier_interleave[0] > 1 || numa_tier_interleave[1] > 1) {
> 
> When we have three memory tiers, do we expect an N:M:K policy?
> Like interleaving between DDR5, DDR4 and PMEM memory.
> Or we expect an N:M policy still by interleaving between two specific tiers?

In the context of the proposed 'explicit tiers' interface, I think it
would make sense to have a per-tier 'interleave_ratio knob. Because
the ratio is configured based on hardware properties, it can be
configured meaningfully for the entire tier hierarchy, even if
individual tasks or vmas interleave over only a subset of nodes.

> The other question is whether we will need multiple interleave policies depending
> on cgroup?
> One policy could be interleave between tier1, tier2, tier3.
> Another could be interleave between tier1 and tier2.

This is a good question.

One thing that has defined cgroup development in recent years is the
concept of "work conservation". Moving away from fixed limits and hard
partitioning, cgroups are increasingly configured with weights,
priorities, and guarantees (cpu.weight, io.latency/io.cost.qos,
memory.low). These weights and priorities are enforced when cgroups
are directly competing over a resource; but if there is no contention,
any active cgroup, regardless of priority, has full access to the
surplus (which could be the entire host if the main load is idle).

With that background, yes, we likely want some way of prioritizing
tier access when multiple cgroups are competing. But we ALSO want the
ability to say that if resources are NOT contended, a cgroup should
interleave memory over all tiers according to optimal bandwidth.

That means that regardless of how the competitive cgroup rules for
tier access end up looking like, it makes sense to have global
interleaving weights based on hardware properties as proposed here.

The effective cgroup IL ratio for each tier could then be something
like cgroup.tier_weight[tier] * tier/interleave_weight.
Tim Chen June 8, 2022, 11:40 p.m. UTC | #5
On Wed, 2022-06-08 at 15:14 -0400, Johannes Weiner wrote:
> Hi Tim,
> 
> On Wed, Jun 08, 2022 at 11:15:27AM -0700, Tim Chen wrote:
> > On Tue, 2022-06-07 at 13:19 -0400, Johannes Weiner wrote:
> > >  /* Do dynamic interleaving for a process */
> > >  static unsigned interleave_nodes(struct mempolicy *policy)
> > >  {
> > >  	unsigned next;
> > >  	struct task_struct *me = current;
> > >  
> > > -	next = next_node_in(me->il_prev, policy->nodes);
> > > +	if (numa_tier_interleave[0] > 1 || numa_tier_interleave[1] > 1) {
> > 
> > When we have three memory tiers, do we expect an N:M:K policy?
> > Like interleaving between DDR5, DDR4 and PMEM memory.
> > Or we expect an N:M policy still by interleaving between two specific tiers?
> 
> In the context of the proposed 'explicit tiers' interface, I think it
> would make sense to have a per-tier 'interleave_ratio knob. Because
> the ratio is configured based on hardware properties, it can be
> configured meaningfully for the entire tier hierarchy, even if
> individual tasks or vmas interleave over only a subset of nodes.

I think that makes sense.  So if have 3 tiers of memory whose bandwidth ratio are
4:2:1, then it makes sense to interleave according to this ratio, even if we choose
to interleave for a subset of nodes.  Say between tier 1 and tier 3, the
interleave ratio will be 4:1 as I can read 4 lines of data from tier 3 while
I got 1 line of data from tier 3.

> 
> > The other question is whether we will need multiple interleave policies depending
> > on cgroup?
> > One policy could be interleave between tier1, tier2, tier3.
> > Another could be interleave between tier1 and tier2.
> 
> This is a good question.
> 
> One thing that has defined cgroup development in recent years is the
> concept of "work conservation". Moving away from fixed limits and hard
> partitioning, cgroups are increasingly configured with weights,
> priorities, and guarantees (cpu.weight, io.latency/io.cost.qos,
> memory.low). These weights and priorities are enforced when cgroups
> are directly competing over a resource; but if there is no contention,
> any active cgroup, regardless of priority, has full access to the
> surplus (which could be the entire host if the main load is idle).
> 
> With that background, yes, we likely want some way of prioritizing
> tier access when multiple cgroups are competing. But we ALSO want the
> ability to say that if resources are NOT contended, a cgroup should
> interleave memory over all tiers according to optimal bandwidth.
> 
> That means that regardless of how the competitive cgroup rules for
> tier access end up looking like, it makes sense to have global
> interleaving weights based on hardware properties as proposed here.
> 
> The effective cgroup IL ratio for each tier could then be something
> like cgroup.tier_weight[tier] * tier/interleave_weight.

Thanks. I agree that a interleave ratio that's proportional to hardware
properties of each tier will suffice.

Tim
kernel test robot June 8, 2022, 11:44 p.m. UTC | #6
Hi Johannes,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Johannes-Weiner/mm-mempolicy-N-M-interleave-policy-for-tiered-memory-nodes/20220608-012652
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
config: x86_64-randconfig-a003 (https://download.01.org/0day-ci/archive/20220609/202206090708.jaGUnz8e-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project b92436efcb7813fc481b30f2593a4907568d917a)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/876d8daa0642d35f71ff504eeb3be4b950339a45
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Johannes-Weiner/mm-mempolicy-N-M-interleave-policy-for-tiered-memory-nodes/20220608-012652
        git checkout 876d8daa0642d35f71ff504eeb3be4b950339a45
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> mm/mempolicy.c:1890:11: warning: variable 'next' is used uninitialized whenever function 'next_node_tier' is called [-Wsometimes-uninitialized]
           unsigned next, start = nid;
           ~~~~~~~~~^~~~
   mm/mempolicy.c:1893:23: note: uninitialized use occurs here
                   next = next_node_in(next, policy->nodes);
                                       ^~~~
   include/linux/nodemask.h:278:46: note: expanded from macro 'next_node_in'
   #define next_node_in(n, src) __next_node_in((n), &(src))
                                                ^
   mm/mempolicy.c:1890:15: note: initialize the variable 'next' to silence this warning
           unsigned next, start = nid;
                        ^
                         = 0
   1 warning generated.


vim +1890 mm/mempolicy.c

  1887	
  1888	static unsigned next_node_tier(int nid, struct mempolicy *policy, bool toptier)
  1889	{
> 1890		unsigned next, start = nid;
  1891	
  1892		do {
  1893			next = next_node_in(next, policy->nodes);
  1894			if (next == MAX_NUMNODES)
  1895				break;
  1896			if (toptier == node_is_toptier(next))
  1897				break;
  1898		} while (next != start);
  1899		return next;
  1900	}
  1901
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 747e325ebcd0..0247a828ec50 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -55,6 +55,7 @@  files can be found in mm/swap.c.
 - nr_hugepages_mempolicy
 - nr_overcommit_hugepages
 - nr_trim_pages         (only if CONFIG_MMU=n)
+- numa_tier_interleave
 - numa_zonelist_order
 - oom_dump_tasks
 - oom_kill_allocating_task
@@ -597,6 +598,21 @@  The default value is 1.
 See Documentation/admin-guide/mm/nommu-mmap.rst for more information.
 
 
+numa_tier_interleave
+====================
+
+This sysctl is for tiered NUMA systems. It's a tuple that configures
+an N:M distribution between toptier and lowtier nodes for interleaving
+memory allocation policies.
+
+The first value configures the share of pages allocated on toptier
+nodes. The second value configures the share of lowtier placements.
+
+Allowed values range from 1 up to (and including) 100.
+
+The default value is 1 1, meaning even distribution.
+
+
 numa_zonelist_order
 ===================
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 668389b4b53d..4bd0f2a67052 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -54,6 +54,8 @@  struct mempolicy {
 	} w;
 };
 
+extern int numa_tier_interleave[2];
+
 /*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc42f7213dd9..7351cf31579b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1236,6 +1236,7 @@  struct task_struct {
 	/* Protected by alloc_lock: */
 	struct mempolicy		*mempolicy;
 	short				il_prev;
+	short				il_count;
 	short				pref_node_fork;
 #endif
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 50870a1db114..cfb238c6e0da 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -21,6 +21,7 @@ 
 
 #include <linux/module.h>
 #include <linux/mm.h>
+#include <linux/mempolicy.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sysctl.h>
@@ -2139,6 +2140,15 @@  static struct ctl_table vm_table[] = {
 		.extra1			= SYSCTL_ZERO,
 		.extra2			= SYSCTL_ONE,
 	},
+	{
+		.procname	= "numa_tier_interleave",
+		.data		= &numa_tier_interleave,
+		.maxlen		= sizeof(numa_tier_interleave),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ONE,
+		.extra2		= SYSCTL_ONE_HUNDRED,
+	},
 #endif
 	 {
 		.procname	= "hugetlb_shm_group",
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e4a409b8ac0b..3b532536cd44 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -120,6 +120,9 @@  static struct kmem_cache *sn_cache;
    policied. */
 enum zone_type policy_zone = 0;
 
+/* Toptier:lowtier interleaving ratio */
+int numa_tier_interleave[2] = { 1, 1 };
+
 /*
  * run-time system-wide default policy => local allocation
  */
@@ -871,8 +874,10 @@  static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 	task_lock(current);
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE)
+	if (new && new->mode == MPOL_INTERLEAVE) {
 		current->il_prev = MAX_NUMNODES-1;
+		current->il_count = 0;
+	}
 	task_unlock(current);
 	mpol_put(old);
 	ret = 0;
@@ -1881,15 +1886,47 @@  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 	return nd;
 }
 
+static unsigned next_node_tier(int nid, struct mempolicy *policy, bool toptier)
+{
+	unsigned next, start = nid;
+
+	do {
+		next = next_node_in(next, policy->nodes);
+		if (next == MAX_NUMNODES)
+			break;
+		if (toptier == node_is_toptier(next))
+			break;
+	} while (next != start);
+	return next;
+}
+
 /* Do dynamic interleaving for a process */
 static unsigned interleave_nodes(struct mempolicy *policy)
 {
 	unsigned next;
 	struct task_struct *me = current;
 
-	next = next_node_in(me->il_prev, policy->nodes);
+	if (numa_tier_interleave[0] > 1 || numa_tier_interleave[1] > 1) {
+		/*
+		 * When N:M interleaving is configured, allocate N
+		 * pages over toptier nodes first, then the remainder
+		 * on lowtier ones.
+		 */
+		if (me->il_count < numa_tier_interleave[0])
+			next = next_node_tier(me->il_prev, policy, true);
+		else
+			next = next_node_tier(me->il_prev, policy, false);
+		me->il_count++;
+		if (me->il_count >=
+		    numa_tier_interleave[0] + numa_tier_interleave[1])
+			me->il_count = 0;
+	} else {
+		next = next_node_in(me->il_prev, policy->nodes);
+	}
+
 	if (next < MAX_NUMNODES)
 		me->il_prev = next;
+
 	return next;
 }
 
@@ -1963,7 +2000,31 @@  static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 	nnodes = nodes_weight(nodemask);
 	if (!nnodes)
 		return numa_node_id();
-	target = (unsigned int)n % nnodes;
+
+	if (numa_tier_interleave[0] > 1 || numa_tier_interleave[1] > 1) {
+		unsigned vnnodes = 0;
+		int vtarget;
+
+		/*
+		 * When N:M interleaving is configured, calculate a
+		 * virtual target for @n in an N:M-scaled nodelist...
+		 */
+		for_each_node_mask(nid, nodemask)
+			vnnodes += numa_tier_interleave[!node_is_toptier(nid)];
+		vtarget = (int)((unsigned int)n % vnnodes);
+
+		/* ...then map it back to the physical nodelist */
+		target = 0;
+		for_each_node_mask(nid, nodemask) {
+			vtarget -= numa_tier_interleave[!node_is_toptier(nid)];
+			if (vtarget < 0)
+				break;
+			target++;
+		}
+	} else {
+		target = (unsigned int)n % nnodes;
+	}
+
 	nid = first_node(nodemask);
 	for (i = 0; i < target; i++)
 		nid = next_node(nid, nodemask);