mbox series

[v3,0/4] mm/mempolicy: weighted interleave mempolicy and sysfs extension

Message ID 20240125184345.47074-1-gregory.price@memverge.com (mailing list archive)
Headers show
Series mm/mempolicy: weighted interleave mempolicy and sysfs extension | expand

Message

Gregory Price Jan. 25, 2024, 6:43 p.m. UTC
Hi Andrew - this version added a fix for a stale weight
issue that can occur on cgroup migration, and some fixups
recommended by Ying Huang.  There is an additional patch
about the stale weight due to the introduction of atomics
that may warrant some explicit scrutiny before pulling.

v3: stale value fix, bulk allocator fixups, sysfs simplification

---

Weighted interleave is a new interleave policy intended to make
use of heterogeneous memory environments appearing with CXL.

The existing interleave mechanism does an even round-robin
distribution of memory across all nodes in a nodemask, while
weighted interleave distributes memory across nodes according
to a provided weight. (Weight = # of page allocations per round)

Weighted interleave is intended to reduce average latency when
bandwidth is pressured - therefore increasing total throughput.

In other words: It allows greater use of the total available
bandwidth in a heterogeneous hardware environment (different
hardware provides different bandwidth capacity).

As bandwidth is pressured, latency increases - first linearly
and then exponentially. By keeping bandwidth usage distributed
according to available bandwidth, we therefore can reduce the
average latency of a cacheline fetch.

A good explanation of the bandwidth vs latency response curve:
https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/

From the article:
```
Constant region:
    The latency response is fairly constant for the first 40%
    of the sustained bandwidth.
Linear region:
    In between 40% to 80% of the sustained bandwidth, the
    latency response increases almost linearly with the bandwidth
    demand of the system due to contention overhead by numerous
    memory requests.
Exponential region:
    Between 80% to 100% of the sustained bandwidth, the memory
    latency is dominated by the contention latency which can be
    as much as twice the idle latency or more.
Maximum sustained bandwidth :
    Is 65% to 75% of the theoretical maximum bandwidth.
```

As a general rule of thumb:
* If bandwidth usage is low, latency does not increase. It is
  optimal to place data in the nearest (lowest latency) device.
* If bandwidth usage is high, latency increases. It is optimal
  to place data such that bandwidth use is optimized per-device.

This is the top line goal: Provide a user a mechanism to target using
the "maximum sustained bandwidth" of each hardware component in a
heterogenous memory system.


For example, the stream benchmark demonstrates that 1:1 (default)
interleave is actively harmful, while weighted interleave can be
beneficial. Default interleave distributes data such that too much
pressure is placed on devices with lower available bandwidth.

Stream Benchmark (High level results, 1 Socket + 1 CXL Device)
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
Targeted weights   : +2.5% to +4% (consistently better than DRAM)

Global means the task-policy was set (set_mempolicy), while targeted
means VMA policies were set (mbind2). We see weighted interleave
is not always beneficial when applied globally, but is always
beneficial when applied to bandwidth-driving memory regions.

We implement sysfs entries for "system global" weights which can be
set by a daemon or administrator.


There are 3 patches in this set:
1) Implement system-global interleave weights as sysfs extension
   in mm/mempolicy.c.  These weights are RCU protected, and a
   default weight set is provided (all weights are 1 by default).

   In future work, we intend to expose an interface for HMAT/CDAT
   information to be used during boot to set reasonable system
   default values based on the memory configuration of the system
   discovered at boot or during device hotplug.

2) A mild refactor of some interleave-logic for re-use in the
   new weighted interleave logic.

3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind


Included below are some performance and LTP test information,
and a sample numactl branch which can be used for testing.

= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
   MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream   : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench  : +19% over DRAM. +47% over default interleave.

= LTP Testing Summary =
existing mempolicy & mbind tests: pass
mempolicy & mbind + weighted interleave (global weights): pass

= version history
v3:
- MAJOR: Changes cur_weight to be an atomic and to carry the
         current interleave node+weight, rather than just weight
         this fixes a stale-weight issue when a rebind occurs.
- minor doc updates
- sysfs: remove module_exit path, not needed
- sysfs: allocate node_attrs rather than static MAX_NUMNODES array
- interleave_nodes: handle cur_weight=0 conditions explicitly
- bulk allocator: prev_node should be initialized to me->il_prev
- bulk alloactor: weight collection logic style fixes
- bulk allocator: bulk allocator for loop style fixes
- bulk allocator: corner case fixes for resume_node/weight

v2:
- MAJOR: Torture tested bulk allocator, fixed edge conditions
         tracking the next me->il_node.  Added documentation.
         Prior version was stable, but the resulting me->il_node
         could be wrong under certain circumstances.
- naming: iw_table_mtx -> iw_table_lock
- RCU: use synchronize+kfree and simplify the weight structure
- default: remove default table, since it's static for now
- sysfs setup: simplify setup, if table==NULL presume 1's
- node_store: only allocate (sizeof(u8) * nr_node_ids)
- allocators: update to deal with NULL table pointer
- read_once: __builtin_memcpy -> memcpy
- formatting

v1:
- RCU: This version protects the weight array with RCU.
- ktest fix: proper include (types.h) in uapi header
- doc: make mpol_params in docs reflect definition
- doc: mempolicy.c comments in MPOL_WEIGHTED_INTERLEAVE patch

- Dropped task-local weights and syscalls from the proposal
  until affirmative use cases for task-local weights appear.
Link: https://lore.kernel.org/linux-mm/20240103224209.2541-1-gregory.price@memverge.com/

=====================================================================
Performance tests - MLC
From - Ravi Jonnalagadda <ravis.opensrc@micron.com>

Hardware: Single-socket, multiple CXL memory expanders.

Workload:                               W2
Data Signature:                         2:1 read:write
DRAM only bandwidth (GBps):             298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only:                    1.38x
Gain over default interleave:           2.64x

Workload:                               W5
Data Signature:                         1:1 read:write
DRAM only bandwidth (GBps):             273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only:                    1.4x
Gain over default interleave:           2.26x

=====================================================================
Performance test - Stream
From - Gregory Price <gregory.price@memverge.com>

Hardware: Single socket, single CXL expander
numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master

Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
mbind2 weights     : +2.5% to +4% (consistently better than DRAM)

dram only:
numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Function     Direction    BestRateMBs     AvgTime      MinTime      MaxTime
Copy:        0->0            200923.2     0.032662     0.031853     0.033301
Scale:       0->0            202123.0     0.032526     0.031664     0.032970
Add:         0->0            208873.2     0.047322     0.045961     0.047884
Triad:       0->0            208523.8     0.047262     0.046038     0.048414

CXL-only:
numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy:        0->0             22209.7     0.288661     0.288162     0.289342
Scale:       0->0             22288.2     0.287549     0.287147     0.288291
Add:         0->0             24419.1     0.393372     0.393135     0.393735
Triad:       0->0             24484.6     0.392337     0.392083     0.394331

Based on the above, the optimal weights are ~9:1
echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1
echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2

default interleave:
numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy:        0->0             44666.2     0.143671     0.143285     0.144174
Scale:       0->0             44781.6     0.143256     0.142916     0.143713
Add:         0->0             48600.7     0.197719     0.197528     0.197858
Triad:       0->0             48727.5     0.197204     0.197014     0.197439

global weighted interleave:
numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy:        0->0            190085.9     0.034289     0.033669     0.034645
Scale:       0->0            207677.4     0.031909     0.030817     0.033061
Add:         0->0            202036.8     0.048737     0.047516     0.053409
Triad:       0->0            217671.5     0.045819     0.044103     0.046755

targted regions w/ global weights (modified stream to mbind2 malloc'd regions))
numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc
Copy:        0->0            205827.0     0.031445     0.031094     0.031984
Scale:       0->0            208171.8     0.031320     0.030744     0.032505
Add:         0->0            217352.0     0.045087     0.044168     0.046515
Triad:       0->0            216884.8     0.045062     0.044263     0.046982

=====================================================================
Performance tests - XSBench
From - Hyeongtak Ji <hyeongtak.ji@sk.com>

Hardware: Single socket, Single CXL memory Expander

NUMA node 0: 56 logical cores, 128 GB memory
NUMA node 2: 96 GB CXL memory
Threads:     56
Lookups:     170,000,000

Summary: +19% over DRAM. +47% over default interleave.

Performance tests - XSBench
1. dram only
$ numactl -m 0 ./XSBench -s XL –p 5000000
Runtime:     36.235 seconds
Lookups/s:   4,691,618

2. default interleave
$ numactl –i 0,2 ./XSBench –s XL –p 5000000
Runtime:     55.243 seconds
Lookups/s:   3,077,293

3. weighted interleave
numactl –w –i 0,2 ./XSBench –s XL –p 5000000
Runtime:     29.262 seconds
Lookups/s:   5,809,513

=====================================================================
LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2

= Existing tests
set_mempolicy, get_mempolicy, mbind

MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality
but did not adjust tests for weighting.  Basically the weights were
set to 1, which is the default, and it should behavior like standard
MPOL_INTERLEAVE if logic is correct.

== set_mempolicy01 : passed   18, failed   0
== set_mempolicy02 : passed   10, failed   0
== set_mempolicy03 : passed   64, failed   0
== set_mempolicy04 : passed   32, failed   0
== set_mempolicy05 - n/a on non-x86
== set_mempolicy06 : passed   10, failed   0
   this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE
== set_mempolicy07 : passed   32, failed   0
   set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE
== get_mempolicy01 : passed   12, failed   0
   change: added MPOL_WEIGHTED_INTERLEAVE
== get_mempolicy02 : passed   2, failed   0
== mbind01 : passed   15, failed   0
   added MPOL_WEIGHTED_INTERLEAVE
== mbind02 : passed   4, failed   0
   added MPOL_WEIGHTED_INTERLEAVE
== mbind03 : passed   16, failed   0
   added MPOL_WEIGHTED_INTERLEAVE
== mbind04 : passed   48, failed   0
   added MPOL_WEIGHTED_INTERLEAVE

=====================================================================
numactl (set_mempolicy) w/ global weighting test
numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master

command: numactl -w --interleave=0,1 ./eatmem

result (weights 1:1):
0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4
7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4
50% distribution is correct

result (weights 5:1):
01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4
7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4
16.666% distribution is correct

result (weights 1:5):
01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4
7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4
16.666% distribution is correct

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void)
{
        char* mem = malloc(1024*1024*256);
        memset(mem, 1, 1024*1024*256);
        for (int i = 0; i  < ((1024*1024*256)/4096); i++)
        {
                mem = malloc(4096);
                mem[0] = 1;
        }
        printf("done\n");
        getchar();
        return 0;
}

=====================================================================

Suggested-by: Gregory Price <gregory.price@memverge.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Hasan Al Maruf <hasanalmaruf@fb.com>
Suggested-by: Hao Wang <haowang3@fb.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
Suggested-by: Frank van der Linden <fvdl@google.com>
Suggested-by: John Groves <john@jagalactic.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Suggested-by: Srinivasulu Thanneeru <sthanneeru@micron.com>
Suggested-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>

Gregory Price (3):
  mm/mempolicy: refactor a read-once mechanism into a function for
    re-use
  mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted
    interleaving
  mm/mempolicy: change cur_il_weight to atomic and carry the node with
    it

Rakie Kim (1):
  mm/mempolicy: implement the sysfs-based weighted_interleave interface

 .../ABI/testing/sysfs-kernel-mm-mempolicy     |   4 +
 ...fs-kernel-mm-mempolicy-weighted-interleave |  25 +
 .../admin-guide/mm/numa_memory_policy.rst     |   9 +
 include/linux/mempolicy.h                     |   3 +
 include/uapi/linux/mempolicy.h                |   1 +
 mm/mempolicy.c                                | 551 +++++++++++++++++-
 6 files changed, 579 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

Comments

Huang, Ying Jan. 26, 2024, 7:40 a.m. UTC | #1
Gregory Price <gourry.memverge@gmail.com> writes:

> In the prior patch, we carry only the current weight for a weighted
> interleave round with us across calls through the allocator path.
>
> node = next_node_in(current->il_prev, pol->nodemask)
> pol->cur_il_weight <--- this weight applies to the above node
>
> This separation of data can cause a race condition.
>
> If a cgroup-initiated task migration or mems_allowed change occurs
> from outside the context of the task, this can cause the weight to
> become stale, meaning we may end using that weight to allocate
> memory on the wrong node.
>
> Example:
>   1) task A sets (cur_il_weight = 8) and (current->il_prev) to
>      node0. node1 is the next set bit in pol->nodemask
>   2) rebind event occurs, removing node1 from the nodemask.
>      node2 is now the next set bit in pol->nodemask
>      cur_il_weight is now stale.
>   3) allocation occurs, next_node_in(il_prev, nodes) returns
>      node2. cur_il_weight is now applied to the wrong node.
>
> The upper level allocator logic must still enforce mems_allowed,
> so this isn't dangerous, but it is innaccurate.
>
> Just clearing the weight is insufficient, as it creates two more
> race conditions.  The root of the issue is the separation of weight
> and node data between nodemask and cur_il_weight.
>
> To solve this, update cur_il_weight to be an atomic_t, and place the
> node that the weight applies to in the upper bits of the field:
>
> atomic_t cur_il_weight
> 	node bits 32:8
> 	weight bits 7:0
>
> Now retrieving or clearing the active interleave node and weight
> is a single atomic operation, and we are not dependent on the
> potentially changing state of (pol->nodemask) to determine what
> node the weight applies to.
>
> Two special observations:
> - if the weight is non-zero, cur_il_weight must *always* have a
>   valid node number, e.g. it cannot be NUMA_NO_NODE (-1).

IIUC, we don't need that, "MAX_NUMNODES-1" is used instead.

>   This is because we steal the top byte for the weight.
>
> - MAX_NUMNODES is presently limited to 1024 or less on every
>   architecture. This would permanently limit MAX_NUMNODES to
>   an absolute maximum of (1 << 24) to avoid overflows.
>
> Per some reading and discussion, it appears that max nodes is
> limited to 1024 so that zone type still fits in page flags, so
> this method seemed preferable compared to the alternatives of
> trying to make all or part of mempolicy RCU protected (which
> may not be possible, since it is often referenced during code
> chunks which call operations that may sleep).
>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> ---
>  include/linux/mempolicy.h |  2 +-
>  mm/mempolicy.c            | 93 +++++++++++++++++++++++++--------------
>  2 files changed, 61 insertions(+), 34 deletions(-)
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index c644d7bbd396..8108fc6e96ca 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -56,7 +56,7 @@ struct mempolicy {
>  	} w;
>  
>  	/* Weighted interleave settings */
> -	u8 cur_il_weight;
> +	atomic_t cur_il_weight;

If we use this field for node and weight, why not change the field name?
For example, cur_wil_node_weight.

>  };
>  
>  /*
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 5a517511658e..41b5fef0a6f5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -321,7 +321,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
>  	policy->mode = mode;
>  	policy->flags = flags;
>  	policy->home_node = NUMA_NO_NODE;
> -	policy->cur_il_weight = 0;
> +	atomic_set(&policy->cur_il_weight, 0);
>  
>  	return policy;
>  }
> @@ -356,6 +356,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
>  		tmp = *nodes;
>  
>  	pol->nodes = tmp;
> +	atomic_set(&pol->cur_il_weight, 0);
>  }
>  
>  static void mpol_rebind_preferred(struct mempolicy *pol,
> @@ -973,8 +974,10 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  			*policy = next_node_in(current->il_prev, pol->nodes);
>  		} else if (pol == current->mempolicy &&
>  				(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
> -			if (pol->cur_il_weight)
> -				*policy = current->il_prev;
> +			int cweight = atomic_read(&pol->cur_il_weight);
> +
> +			if (cweight & 0xFF)
> +				*policy = cweight >> 8;

Please define some helper functions or macros instead of operate on bits
directly.

>  			else
>  				*policy = next_node_in(current->il_prev,
>  						       pol->nodes);

If we record current node in pol->cur_il_weight, why do we still need
curren->il_prev.  Can we only use pol->cur_il_weight?  And if so, we can
even make current->il_prev a union.

--
Best Regards,
Huang, Ying

[snip]
Gregory Price Jan. 26, 2024, 4:38 p.m. UTC | #2
On Fri, Jan 26, 2024 at 03:40:27PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > Two special observations:
> > - if the weight is non-zero, cur_il_weight must *always* have a
> >   valid node number, e.g. it cannot be NUMA_NO_NODE (-1).
> 
> IIUC, we don't need that, "MAX_NUMNODES-1" is used instead.
> 

Correct, I just thought it pertinent to call this out explicitly since
I'm stealing the top byte, but the node value has traditionally been a
full integer.

This may be relevant should anyone try to carry, a random node value
into this field. For example, if someone tried to copy policy->home_node
into cur_il_weight for whatever reason.

It's worth breaking out a function to defend against this - plus to hide
the bit operations directly as you recommend below.

> >  	/* Weighted interleave settings */
> > -	u8 cur_il_weight;
> > +	atomic_t cur_il_weight;
> 
> If we use this field for node and weight, why not change the field name?
> For example, cur_wil_node_weight.
> 

ack.

> > +			if (cweight & 0xFF)
> > +				*policy = cweight >> 8;
> 
> Please define some helper functions or macros instead of operate on bits
> directly.
> 

ack.

> >  			else
> >  				*policy = next_node_in(current->il_prev,
> >  						       pol->nodes);
> 
> If we record current node in pol->cur_il_weight, why do we still need
> curren->il_prev.  Can we only use pol->cur_il_weight?  And if so, we can
> even make current->il_prev a union.
> 

I just realized that there's a problem here for shared memory policies.

from weighted_interleave_nodes, I do this:

cur_weight = atomic_read(&policy->cur_il_weight);
...
weight--;
...
atomic_set(&policy->cur_il_weight, cur_weight);

On a shared memory policy, this is a race condition.


I don't think we can combine il_prev and cur_wil_node_weight because
the task policy may be different than the current policy.

i.e. it's totally valid to do the following:

1) set_mempolicy(MPOL_INTERLEAVE)
2) mbind(..., MPOL_WEIGHTED_INTERLEAVE)

Using current->il_prev between these two policies, is just plain incorrect,
so I will need to rethink this, and the existing code will need to be
updated such that weighted_interleave does not use current->il_prev.

~Gregory
Huang, Ying Jan. 29, 2024, 8:17 a.m. UTC | #3
Gregory Price <gregory.price@memverge.com> writes:

> On Fri, Jan 26, 2024 at 03:40:27PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > Two special observations:
>> > - if the weight is non-zero, cur_il_weight must *always* have a
>> >   valid node number, e.g. it cannot be NUMA_NO_NODE (-1).
>> 
>> IIUC, we don't need that, "MAX_NUMNODES-1" is used instead.
>> 
>
> Correct, I just thought it pertinent to call this out explicitly since
> I'm stealing the top byte, but the node value has traditionally been a
> full integer.
>
> This may be relevant should anyone try to carry, a random node value
> into this field. For example, if someone tried to copy policy->home_node
> into cur_il_weight for whatever reason.
>
> It's worth breaking out a function to defend against this - plus to hide
> the bit operations directly as you recommend below.
>
>> >  	/* Weighted interleave settings */
>> > -	u8 cur_il_weight;
>> > +	atomic_t cur_il_weight;
>> 
>> If we use this field for node and weight, why not change the field name?
>> For example, cur_wil_node_weight.
>> 
>
> ack.
>
>> > +			if (cweight & 0xFF)
>> > +				*policy = cweight >> 8;
>> 
>> Please define some helper functions or macros instead of operate on bits
>> directly.
>> 
>
> ack.
>
>> >  			else
>> >  				*policy = next_node_in(current->il_prev,
>> >  						       pol->nodes);
>> 
>> If we record current node in pol->cur_il_weight, why do we still need
>> curren->il_prev.  Can we only use pol->cur_il_weight?  And if so, we can
>> even make current->il_prev a union.
>> 
>
> I just realized that there's a problem here for shared memory policies.
>
> from weighted_interleave_nodes, I do this:
>
> cur_weight = atomic_read(&policy->cur_il_weight);
> ...
> weight--;
> ...
> atomic_set(&policy->cur_il_weight, cur_weight);
>
> On a shared memory policy, this is a race condition.
>
>
> I don't think we can combine il_prev and cur_wil_node_weight because
> the task policy may be different than the current policy.
>
> i.e. it's totally valid to do the following:
>
> 1) set_mempolicy(MPOL_INTERLEAVE)
> 2) mbind(..., MPOL_WEIGHTED_INTERLEAVE)
>
> Using current->il_prev between these two policies, is just plain incorrect,
> so I will need to rethink this, and the existing code will need to be
> updated such that weighted_interleave does not use current->il_prev.

IIUC, weighted_interleave_nodes() is only used for mempolicy of tasks
(set_mempolicy()), as in the following code.

+		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
+			weighted_interleave_nodes(pol) :
+			weighted_interleave_nid(pol, ilx);

But, in contrast, it's bad to put task-local "current weight" in
mempolicy.  So, I think that it's better to move cur_il_weight to
task_struct.  And maybe combine it with current->il_prev.

--
Best Regards,
Huang, Ying
Gregory Price Jan. 29, 2024, 3:48 p.m. UTC | #4
On Mon, Jan 29, 2024 at 04:17:46PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > Using current->il_prev between these two policies, is just plain incorrect,
> > so I will need to rethink this, and the existing code will need to be
> > updated such that weighted_interleave does not use current->il_prev.
> 
> IIUC, weighted_interleave_nodes() is only used for mempolicy of tasks
> (set_mempolicy()), as in the following code.
> 
> +		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
> +			weighted_interleave_nodes(pol) :
> +			weighted_interleave_nid(pol, ilx);
>

Was digging through this the past couple of days.  It does look like
this is true - because if (pol) comes from a vma, ilx will not be
NO_INTERLEAVE_INDEX.  If this changes in the future, however,
weighted_interleave_nodes may begin to miscount under heavy contention.

It may be worth documenting this explicitly, because this is incredibly
non-obvious.  I will add a comment to this chunk here.

> But, in contrast, it's bad to put task-local "current weight" in
> mempolicy.  So, I think that it's better to move cur_il_weight to
> task_struct.  And maybe combine it with current->il_prev.
> 

Given all of this, I think is reasonably. That is effectively what is
happening anyway for anyone that just uses `numactl -w --interleave=...`

Style question: is it preferable add an anonymous union into task_struct:

union {
    short il_prev;
    atomic_t wil_node_weight;
};

Or should I break out that union explicitly in mempolicy.h?

The latter involves additional code updates in mempolicy.c for the union
name (current->___.il_prev) but it lets us add documentation to mempolicy.h

~Gregory
Gregory Price Jan. 29, 2024, 6:11 p.m. UTC | #5
On Mon, Jan 29, 2024 at 10:48:47AM -0500, Gregory Price wrote:
> On Mon, Jan 29, 2024 at 04:17:46PM +0800, Huang, Ying wrote:
> > Gregory Price <gregory.price@memverge.com> writes:
> > 
> > But, in contrast, it's bad to put task-local "current weight" in
> > mempolicy.  So, I think that it's better to move cur_il_weight to
> > task_struct.  And maybe combine it with current->il_prev.
> > 
> Style question: is it preferable add an anonymous union into task_struct:
> 
> union {
>     short il_prev;
>     atomic_t wil_node_weight;
> };
> 
> Or should I break out that union explicitly in mempolicy.h?
> 

Having attempted this, it looks like including mempolicy.h into sched.h
is a non-starter.  There are build issues likely associated from the
nested include of uapi/linux/mempolicy.h

So I went ahead and did the following.  Style-wise If it's better to just
integrate this as an anonymous union in task_struct, let me know, but it
seemed better to add some documentation here.

I also added static get/set functions to mempolicy.c to touch these
values accordingly.

As suggested, I changed things to allow 0-weight in il_prev.node_weight
adjusted the logic accordingly. Will be testing this for a day or so
before sending out new patches.

~Gregory



diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffe8f618ab86..f0d2af3bbc69 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -745,6 +745,29 @@ struct kmap_ctrl {
 #endif
 };

+
+/*
+ * Describes task_struct interleave settings
+ *
+ * Interleave uses mpol_interleave.node
+ *   last node allocated from
+ *   intended for use in next_node_in() on the next allocation
+ *
+ * Weighted interleave uses mpol_interleave.node_weight
+ *   node is the value of the current node to allocate from
+ *   weight is the number of allocations left on that node
+ *   when weight is 0, next_node_in(node) will be invoked
+ */
+union mpol_interleave {
+       struct {
+               short node;
+               short resv;
+       };
+       /* structure: short node; u8 resv; u8 weight; */
+       atomic_t node_weight;
+};
+
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
        /*
@@ -1258,7 +1281,7 @@ struct task_struct {
 #ifdef CONFIG_NUMA
        /* Protected by alloc_lock: */
        struct mempolicy                *mempolicy;
-       short                           il_prev;
+       union mpol_interleave           il_prev;
        short                           pref_node_fork;
 #endif
 #ifdef CONFIG_NUMA_BALANCING



diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 92740b8f0eb5..48e365b507b2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -149,6 +149,66 @@ static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 static u8 __rcu *iw_table;
 static DEFINE_MUTEX(iw_table_lock);

+static u8 get_il_weight(int node)
+{
+       u8 __rcu *table;
+       u8 weight;
+
+       rcu_read_lock();
+       table = rcu_dereference(iw_table);
+       /* if no iw_table, use system default */
+       weight = table ? table[node] : 1;
+       /* if value in iw_table is 0, use system default */
+       weight = weight ? weight : 1;
+       rcu_read_unlock();
+       return weight;
+}
+
+/* Clear any interleave values from task->il_prev */
+static void clear_il_prev(void)
+{
+       int node_weight;
+
+       node_weight = MAKE_WIL_PREV(MAX_NUMNODES - 1, 0);
+       atomic_set(&current->il_prev.node_weight, node_weight);
+}
+
+/* get the next value for weighted interleave */
+static void get_wil_prev(int *node, u8 *weight)
+{
+       int node_weight;
+
+       node_weight = atomic_read(&current->il_prev.node_weight);
+       *node = WIL_NODE(node_weight);
+       *weight = WIL_WEIGHT(node_weight);
+}
+
+/* set the next value for weighted interleave */
+static void set_wil_prev(int node, u8 weight)
+{
+       int node_weight;
+
+       if (node == MAX_NUMNODES)
+               node -= 1;
+       node_weight = MAKE_WIL_PREV(node, weight);
+       atomic_set(&current->il_prev.node_weight, node_weight);
+}
+
+/* get the previous interleave node */
+static short get_il_prev(void)
+{
+       return current->il_prev.node;
+}
+
+/* set the previous interleave node */
+static void set_il_prev(int node)
+{
+       if (unlikely(node >= MAX_NUMNODES))
+               node = MAX_NUMNODES - 1;
+
+       current->il_prev.node = node;
+}
+
Huang, Ying Jan. 30, 2024, 3:15 a.m. UTC | #6
Gregory Price <gregory.price@memverge.com> writes:

> On Mon, Jan 29, 2024 at 10:48:47AM -0500, Gregory Price wrote:
>> On Mon, Jan 29, 2024 at 04:17:46PM +0800, Huang, Ying wrote:
>> > Gregory Price <gregory.price@memverge.com> writes:
>> > 
>> > But, in contrast, it's bad to put task-local "current weight" in
>> > mempolicy.  So, I think that it's better to move cur_il_weight to
>> > task_struct.  And maybe combine it with current->il_prev.
>> > 
>> Style question: is it preferable add an anonymous union into task_struct:
>> 
>> union {
>>     short il_prev;
>>     atomic_t wil_node_weight;
>> };
>> 
>> Or should I break out that union explicitly in mempolicy.h?
>> 
>
> Having attempted this, it looks like including mempolicy.h into sched.h
> is a non-starter.  There are build issues likely associated from the
> nested include of uapi/linux/mempolicy.h
>
> So I went ahead and did the following.  Style-wise If it's better to just
> integrate this as an anonymous union in task_struct, let me know, but it
> seemed better to add some documentation here.
>
> I also added static get/set functions to mempolicy.c to touch these
> values accordingly.
>
> As suggested, I changed things to allow 0-weight in il_prev.node_weight
> adjusted the logic accordingly. Will be testing this for a day or so
> before sending out new patches.
>

Thanks about this again.  It seems that we don't need to touch
task->il_prev and task->il_weight during rebinding for weighted
interleave too.

For weighted interleaving, il_prev is the node used for previous
allocation, il_weight is the weight after previous allocation.  So
weighted_interleave_nodes() could be as follows,

unsigned int weighted_interleave_nodes(struct mempolicy *policy)
{
        unsigned int nid;
        struct task_struct *me = current;

        nid = me->il_prev;
        if (!me->il_weight || !node_isset(nid, policy->nodes)) {
                nid = next_node_in(...);
                me->il_prev = nid;
                me->il_weight = weights[nid];
        }
        me->il_weight--;

        return nid;
}

If this works, we can just add il_weight into task_struct.

--
Best Regards,
Huang, Ying
Gregory Price Jan. 30, 2024, 3:33 a.m. UTC | #7
On Tue, Jan 30, 2024 at 11:15:35AM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Mon, Jan 29, 2024 at 10:48:47AM -0500, Gregory Price wrote:
> >> On Mon, Jan 29, 2024 at 04:17:46PM +0800, Huang, Ying wrote:
> >> > Gregory Price <gregory.price@memverge.com> writes:
> >> > 
> >> > But, in contrast, it's bad to put task-local "current weight" in
> >> > mempolicy.  So, I think that it's better to move cur_il_weight to
> >> > task_struct.  And maybe combine it with current->il_prev.
> >> > 
> >> Style question: is it preferable add an anonymous union into task_struct:
> >> 
> >> union {
> >>     short il_prev;
> >>     atomic_t wil_node_weight;
> >> };
> >> 
> >> Or should I break out that union explicitly in mempolicy.h?
> >> 
> >
> > Having attempted this, it looks like including mempolicy.h into sched.h
> > is a non-starter.  There are build issues likely associated from the
> > nested include of uapi/linux/mempolicy.h
> >
> > So I went ahead and did the following.  Style-wise If it's better to just
> > integrate this as an anonymous union in task_struct, let me know, but it
> > seemed better to add some documentation here.
> >
> > I also added static get/set functions to mempolicy.c to touch these
> > values accordingly.
> >
> > As suggested, I changed things to allow 0-weight in il_prev.node_weight
> > adjusted the logic accordingly. Will be testing this for a day or so
> > before sending out new patches.
> >
> 
> Thanks about this again.  It seems that we don't need to touch
> task->il_prev and task->il_weight during rebinding for weighted
> interleave too.
> 

It's not clear to me this is the case.  cpusets takes the task_lock to
change mems_allowed and rebind task->mempolicy, but I do not see the
task lock access blocking allocations.

Comments from cpusets suggest allocations can happen in parallel.

/*
 * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
 * @tsk: the task to change
 * @newmems: new nodes that the task will be set
 *
 * We use the mems_allowed_seq seqlock to safely update both tsk->mems_allowed
 * and rebind an eventual tasks' mempolicy. If the task is allocating in
 * parallel, it might temporarily see an empty intersection, which results in
 * a seqlock check and retry before OOM or allocation failure.
 */


For normal interleave, this isn't an issue because it always proceeds to
the next node. The same is not true of weighted interleave, which may
have a hanging weight in task->il_weight.

That is why I looked to combine the two, so at least node/weight were
carried together.

> unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> {
>         unsigned int nid;
>         struct task_struct *me = current;
> 
>         nid = me->il_prev;
>         if (!me->il_weight || !node_isset(nid, policy->nodes)) {
>                 nid = next_node_in(...);
>                 me->il_prev = nid;
>                 me->il_weight = weights[nid];
>         }
>         me->il_weight--;
> 
>         return nid;
> }

I ended up with this:

static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
{
       unsigned int node;
       u8 weight;

       get_wil_prev(&node, &weight);
       /* If nodemask was rebound, just fetch the next node */
       if (!weight) {
               node = next_node_in(node, policy->nodes);
               /* can only happen if nodemask has become invalid */
               if (node == MAX_NUMNODES)
                       return node;
               weight = get_il_weight(node);
       }
       weight--;
       set_wil_prev(node, weight);
       return node;
}

~Gregory
Huang, Ying Jan. 30, 2024, 5:18 a.m. UTC | #8
Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Jan 30, 2024 at 11:15:35AM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> > On Mon, Jan 29, 2024 at 10:48:47AM -0500, Gregory Price wrote:
>> >> On Mon, Jan 29, 2024 at 04:17:46PM +0800, Huang, Ying wrote:
>> >> > Gregory Price <gregory.price@memverge.com> writes:
>> >> > 
>> >> > But, in contrast, it's bad to put task-local "current weight" in
>> >> > mempolicy.  So, I think that it's better to move cur_il_weight to
>> >> > task_struct.  And maybe combine it with current->il_prev.
>> >> > 
>> >> Style question: is it preferable add an anonymous union into task_struct:
>> >> 
>> >> union {
>> >>     short il_prev;
>> >>     atomic_t wil_node_weight;
>> >> };
>> >> 
>> >> Or should I break out that union explicitly in mempolicy.h?
>> >> 
>> >
>> > Having attempted this, it looks like including mempolicy.h into sched.h
>> > is a non-starter.  There are build issues likely associated from the
>> > nested include of uapi/linux/mempolicy.h
>> >
>> > So I went ahead and did the following.  Style-wise If it's better to just
>> > integrate this as an anonymous union in task_struct, let me know, but it
>> > seemed better to add some documentation here.
>> >
>> > I also added static get/set functions to mempolicy.c to touch these
>> > values accordingly.
>> >
>> > As suggested, I changed things to allow 0-weight in il_prev.node_weight
>> > adjusted the logic accordingly. Will be testing this for a day or so
>> > before sending out new patches.
>> >
>> 
>> Thanks about this again.  It seems that we don't need to touch
>> task->il_prev and task->il_weight during rebinding for weighted
>> interleave too.
>> 
>
> It's not clear to me this is the case.  cpusets takes the task_lock to
> change mems_allowed and rebind task->mempolicy, but I do not see the
> task lock access blocking allocations.
>
> Comments from cpusets suggest allocations can happen in parallel.
>
> /*
>  * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
>  * @tsk: the task to change
>  * @newmems: new nodes that the task will be set
>  *
>  * We use the mems_allowed_seq seqlock to safely update both tsk->mems_allowed
>  * and rebind an eventual tasks' mempolicy. If the task is allocating in
>  * parallel, it might temporarily see an empty intersection, which results in
>  * a seqlock check and retry before OOM or allocation failure.
>  */
>
>
> For normal interleave, this isn't an issue because it always proceeds to
> the next node. The same is not true of weighted interleave, which may
> have a hanging weight in task->il_weight.

So, I added a check as follows,

node_isset(current->il_prev, policy->nodes)

If prev node is removed from nodemask, allocation will proceed to the
next node.  Otherwise, it's safe to use current->il_weight.  

--
Best Regards,
Huang, Ying

> That is why I looked to combine the two, so at least node/weight were
> carried together.
>
>> unsigned int weighted_interleave_nodes(struct mempolicy *policy)
>> {
>>         unsigned int nid;
>>         struct task_struct *me = current;
>> 
>>         nid = me->il_prev;
>>         if (!me->il_weight || !node_isset(nid, policy->nodes)) {
>>                 nid = next_node_in(...);
>>                 me->il_prev = nid;
>>                 me->il_weight = weights[nid];
>>         }
>>         me->il_weight--;
>> 
>>         return nid;
>> }
>
> I ended up with this:
>
> static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> {
>        unsigned int node;
>        u8 weight;
>
>        get_wil_prev(&node, &weight);
>        /* If nodemask was rebound, just fetch the next node */
>        if (!weight) {
>                node = next_node_in(node, policy->nodes);
>                /* can only happen if nodemask has become invalid */
>                if (node == MAX_NUMNODES)
>                        return node;
>                weight = get_il_weight(node);
>        }
>        weight--;
>        set_wil_prev(node, weight);
>        return node;
> }
>
> ~Gregory
Gregory Price Jan. 30, 2024, 4:01 p.m. UTC | #9
On Tue, Jan 30, 2024 at 01:18:30PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > For normal interleave, this isn't an issue because it always proceeds to
> > the next node. The same is not true of weighted interleave, which may
> > have a hanging weight in task->il_weight.
> 
> So, I added a check as follows,
> 
> node_isset(current->il_prev, policy->nodes)
> 
> If prev node is removed from nodemask, allocation will proceed to the
> next node.  Otherwise, it's safe to use current->il_weight.  
> 

Funny enough I have this on one of my branches and dropped it, but after
digging through everything - this should be sufficient.

I'll just add il_weight next to il_prev and have a new set of patches
out today. Code is already there, just needs one last cleanup pass.

~Gregory