Message ID | 20231031003810.4532-1-gregory.price@memverge.com |
---|---|
Headers | show |
Series | Node Weights and Weighted Interleave | expand |
On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: > > This hopefully also explains why it's a global setting. The usecase is > > different from conventional NUMA interleaving, which is used as a > > locality measure: spread shared data evenly between compute > > nodes. This one isn't about locality - the CXL tier doesn't have local > > compute. Instead, the optimal spread is based on hardware parameters, > > which is a global property rather than a per-workload one. > > Well, I am not convinced about that TBH. Sure it is probably a good fit > for this specific CXL usecase but it just doesn't fit into many others I > can think of - e.g. proportional use of those tiers based on the > workload - you get what you pay for. > > Is there any specific reason for not having a new interleave interface > which defines weights for the nodemask? Is this because the policy > itself is very dynamic or is this more driven by simplicity of use? > I had originally implemented it this way while experimenting with new mempolicies. https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@memverge.com/ The downside of doing it in mempolicy is... 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a non-trivial task. It is very "current-task" centric. 2) Barring a change to mempolicy to be sysfs friendly, the options for implementing weights in the mempolicy are either a) new flag and setting every weight individually in many syscalls, or b) a new syscall (set_mempolicy2), which is what I demonstrated in the RFC. 3) mempolicy is also subject to cgroup nodemasks, and as a result you end up with a rats nest of interactions between mempolicy nodemasks changing as a result of cgroup migrations, nodes potentially coming and going (hotplug under CXL), and others I'm probably forgetting. Basically: If a node leaves the nodemask, should you retain the weight, or should you reset it? If a new node comes into the node mask... what weight should you set? I did not have answers to these questions. It was recommended to explore placing it in tiers instead, so I took a crack at it here: https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/ This had similar issue with the idea of hotplug nodes: if you give a tier a weight, and one or more of the nodes goes away/comes back... what should you do with the weight? Split it up among the remaining nodes? Rebalance? Etc. The result of this discussion lead us to simply say "What if we place the weights directly in the node". And that lead us to this RFC. I am not against implementing it in mempolicy (as proof: my first RFC). I am simply searching for the acceptable way to implement it. One of the benefits of having it set as a global setting is that weights can be automatically generated from HMAT/HMEM information (ACPI tables) and programs already using MPOL_INTERLEAVE will have a direct benefit. I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added along side this patch so that MPOL_INTERLEAVE is left entirely alone. Happy to discuss more, ~Gregory
On Tue, Oct 31, 2023 at 12:22:16PM -0400, Johannes Weiner wrote: > On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: > > > > Well, I am not convinced about that TBH. Sure it is probably a good fit > > for this specific CXL usecase but it just doesn't fit into many others I > > can think of - e.g. proportional use of those tiers based on the > > workload - you get what you pay for. > > > > Is there any specific reason for not having a new interleave interface > > which defines weights for the nodemask? Is this because the policy > > itself is very dynamic or is this more driven by simplicity of use? > > A downside of *requiring* weights to be paired with the mempolicy is > that it's then the application that would have to figure out the > weights dynamically, instead of having a static host configuration. A > policy of "I want to be spread for optimal bus bandwidth" translates > between different hardware configurations, but optimal weights will > vary depending on the type of machine a job runs on. > > That doesn't mean there couldn't be usecases for having weights as > policy as well in other scenarios, like you allude to above. It's just > so far such usecases haven't really materialized or spelled out > concretely. Maybe we just want both - a global default, and the > ability to override it locally. Could you elaborate on the 'get what > you pay for' usecase you mentioned? I've been considering "por qué no los dos" for some time. Already have the code for both, just need to clean up the original RFC.
On Mon 30-10-23 20:38:06, Gregory Price wrote: > This patchset implements weighted interleave and adds a new sysfs > entry: /sys/devices/system/node/nodeN/accessM/il_weight. > > The il_weight of a node is used by mempolicy to implement weighted > interleave when `numactl --interleave=...` is invoked. By default > il_weight for a node is always 1, which preserves the default round > robin interleave behavior. > > Interleave weights may be set from 0-100, and denote the number of > pages that should be allocated from the node when interleaving > occurs. > > For example, if a node's interleave weight is set to 5, 5 pages > will be allocated from that node before the next node is scheduled > for allocations. I find this semantic rather weird TBH. First of all why do you think it makes sense to have those weights global for all users? What if different applications have different view on how to spred their interleaved memory? I do get that you might have a different tiers with largerly different runtime characteristics but why would you want to interleave them into a single mapping and have hard to predict runtime behavior? [...] > In this way it becomes possible to set an interleaving strategy > that fits the available bandwidth for the devices available on > the system. An example system: > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex > > In this setup, the effective weights for nodes 0-3 for a task > running on Node 0 may be [60, 20, 10, 10]. > > This spreads memory out across devices which all have different > latency and bandwidth attributes at a way that can maximize the > available resources. OK, so why is this any better than not using any memory policy rely on demotion to push out cold memory down the tier hierarchy? What is the actual real life usecase and what kind of benefits you can present?
On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: > On Mon 30-10-23 20:38:06, Gregory Price wrote: > > This patchset implements weighted interleave and adds a new sysfs > > entry: /sys/devices/system/node/nodeN/accessM/il_weight. > > > > The il_weight of a node is used by mempolicy to implement weighted > > interleave when `numactl --interleave=...` is invoked. By default > > il_weight for a node is always 1, which preserves the default round > > robin interleave behavior. > > > > Interleave weights may be set from 0-100, and denote the number of > > pages that should be allocated from the node when interleaving > > occurs. > > > > For example, if a node's interleave weight is set to 5, 5 pages > > will be allocated from that node before the next node is scheduled > > for allocations. > > I find this semantic rather weird TBH. First of all why do you think it > makes sense to have those weights global for all users? What if > different applications have different view on how to spred their > interleaved memory? > > I do get that you might have a different tiers with largerly different > runtime characteristics but why would you want to interleave them into a > single mapping and have hard to predict runtime behavior? > > [...] > > In this way it becomes possible to set an interleaving strategy > > that fits the available bandwidth for the devices available on > > the system. An example system: > > > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex > > > > In this setup, the effective weights for nodes 0-3 for a task > > running on Node 0 may be [60, 20, 10, 10]. > > > > This spreads memory out across devices which all have different > > latency and bandwidth attributes at a way that can maximize the > > available resources. > > OK, so why is this any better than not using any memory policy rely > on demotion to push out cold memory down the tier hierarchy? > > What is the actual real life usecase and what kind of benefits you can > present? There are two things CXL gives you: additional capacity and additional bus bandwidth. The promotion/demotion mechanism is good for the capacity usecase, where you have a nice hot/cold gradient in the workingset and want placement accordingly across faster and slower memory. The interleaving is useful when you have a flatter workingset distribution and poorer access locality. In that case, the CPU caches are less effective and the workload can be bus-bound. The workload might fit entirely into DRAM, but concentrating it there is suboptimal. Fanning it out in proportion to the relative performance of each memory tier gives better resuls. We experimented with datacenter workloads on such machines last year and found significant performance benefits: https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/ This hopefully also explains why it's a global setting. The usecase is different from conventional NUMA interleaving, which is used as a locality measure: spread shared data evenly between compute nodes. This one isn't about locality - the CXL tier doesn't have local compute. Instead, the optimal spread is based on hardware parameters, which is a global property rather than a per-workload one.
On Tue 31-10-23 11:21:42, Johannes Weiner wrote: > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: > > On Mon 30-10-23 20:38:06, Gregory Price wrote: > > > This patchset implements weighted interleave and adds a new sysfs > > > entry: /sys/devices/system/node/nodeN/accessM/il_weight. > > > > > > The il_weight of a node is used by mempolicy to implement weighted > > > interleave when `numactl --interleave=...` is invoked. By default > > > il_weight for a node is always 1, which preserves the default round > > > robin interleave behavior. > > > > > > Interleave weights may be set from 0-100, and denote the number of > > > pages that should be allocated from the node when interleaving > > > occurs. > > > > > > For example, if a node's interleave weight is set to 5, 5 pages > > > will be allocated from that node before the next node is scheduled > > > for allocations. > > > > I find this semantic rather weird TBH. First of all why do you think it > > makes sense to have those weights global for all users? What if > > different applications have different view on how to spred their > > interleaved memory? > > > > I do get that you might have a different tiers with largerly different > > runtime characteristics but why would you want to interleave them into a > > single mapping and have hard to predict runtime behavior? > > > > [...] > > > In this way it becomes possible to set an interleaving strategy > > > that fits the available bandwidth for the devices available on > > > the system. An example system: > > > > > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) > > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) > > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex > > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex > > > > > > In this setup, the effective weights for nodes 0-3 for a task > > > running on Node 0 may be [60, 20, 10, 10]. > > > > > > This spreads memory out across devices which all have different > > > latency and bandwidth attributes at a way that can maximize the > > > available resources. > > > > OK, so why is this any better than not using any memory policy rely > > on demotion to push out cold memory down the tier hierarchy? > > > > What is the actual real life usecase and what kind of benefits you can > > present? > > There are two things CXL gives you: additional capacity and additional > bus bandwidth. > > The promotion/demotion mechanism is good for the capacity usecase, > where you have a nice hot/cold gradient in the workingset and want > placement accordingly across faster and slower memory. > > The interleaving is useful when you have a flatter workingset > distribution and poorer access locality. In that case, the CPU caches > are less effective and the workload can be bus-bound. The workload > might fit entirely into DRAM, but concentrating it there is > suboptimal. Fanning it out in proportion to the relative performance > of each memory tier gives better resuls. > > We experimented with datacenter workloads on such machines last year > and found significant performance benefits: > > https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/ Thanks, this is a useful insight. > This hopefully also explains why it's a global setting. The usecase is > different from conventional NUMA interleaving, which is used as a > locality measure: spread shared data evenly between compute > nodes. This one isn't about locality - the CXL tier doesn't have local > compute. Instead, the optimal spread is based on hardware parameters, > which is a global property rather than a per-workload one. Well, I am not convinced about that TBH. Sure it is probably a good fit for this specific CXL usecase but it just doesn't fit into many others I can think of - e.g. proportional use of those tiers based on the workload - you get what you pay for. Is there any specific reason for not having a new interleave interface which defines weights for the nodemask? Is this because the policy itself is very dynamic or is this more driven by simplicity of use?
On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: > On Tue 31-10-23 11:21:42, Johannes Weiner wrote: > > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: > > > On Mon 30-10-23 20:38:06, Gregory Price wrote: > > > > This patchset implements weighted interleave and adds a new sysfs > > > > entry: /sys/devices/system/node/nodeN/accessM/il_weight. > > > > > > > > The il_weight of a node is used by mempolicy to implement weighted > > > > interleave when `numactl --interleave=...` is invoked. By default > > > > il_weight for a node is always 1, which preserves the default round > > > > robin interleave behavior. > > > > > > > > Interleave weights may be set from 0-100, and denote the number of > > > > pages that should be allocated from the node when interleaving > > > > occurs. > > > > > > > > For example, if a node's interleave weight is set to 5, 5 pages > > > > will be allocated from that node before the next node is scheduled > > > > for allocations. > > > > > > I find this semantic rather weird TBH. First of all why do you think it > > > makes sense to have those weights global for all users? What if > > > different applications have different view on how to spred their > > > interleaved memory? > > > > > > I do get that you might have a different tiers with largerly different > > > runtime characteristics but why would you want to interleave them into a > > > single mapping and have hard to predict runtime behavior? > > > > > > [...] > > > > In this way it becomes possible to set an interleaving strategy > > > > that fits the available bandwidth for the devices available on > > > > the system. An example system: > > > > > > > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) > > > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) > > > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex > > > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex > > > > > > > > In this setup, the effective weights for nodes 0-3 for a task > > > > running on Node 0 may be [60, 20, 10, 10]. > > > > > > > > This spreads memory out across devices which all have different > > > > latency and bandwidth attributes at a way that can maximize the > > > > available resources. > > > > > > OK, so why is this any better than not using any memory policy rely > > > on demotion to push out cold memory down the tier hierarchy? > > > > > > What is the actual real life usecase and what kind of benefits you can > > > present? > > > > There are two things CXL gives you: additional capacity and additional > > bus bandwidth. > > > > The promotion/demotion mechanism is good for the capacity usecase, > > where you have a nice hot/cold gradient in the workingset and want > > placement accordingly across faster and slower memory. > > > > The interleaving is useful when you have a flatter workingset > > distribution and poorer access locality. In that case, the CPU caches > > are less effective and the workload can be bus-bound. The workload > > might fit entirely into DRAM, but concentrating it there is > > suboptimal. Fanning it out in proportion to the relative performance > > of each memory tier gives better resuls. > > > > We experimented with datacenter workloads on such machines last year > > and found significant performance benefits: > > > > https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/ > > Thanks, this is a useful insight. > > > This hopefully also explains why it's a global setting. The usecase is > > different from conventional NUMA interleaving, which is used as a > > locality measure: spread shared data evenly between compute > > nodes. This one isn't about locality - the CXL tier doesn't have local > > compute. Instead, the optimal spread is based on hardware parameters, > > which is a global property rather than a per-workload one. > > Well, I am not convinced about that TBH. Sure it is probably a good fit > for this specific CXL usecase but it just doesn't fit into many others I > can think of - e.g. proportional use of those tiers based on the > workload - you get what you pay for. > > Is there any specific reason for not having a new interleave interface > which defines weights for the nodemask? Is this because the policy > itself is very dynamic or is this more driven by simplicity of use? A downside of *requiring* weights to be paired with the mempolicy is that it's then the application that would have to figure out the weights dynamically, instead of having a static host configuration. A policy of "I want to be spread for optimal bus bandwidth" translates between different hardware configurations, but optimal weights will vary depending on the type of machine a job runs on. That doesn't mean there couldn't be usecases for having weights as policy as well in other scenarios, like you allude to above. It's just so far such usecases haven't really materialized or spelled out concretely. Maybe we just want both - a global default, and the ability to override it locally. Could you elaborate on the 'get what you pay for' usecase you mentioned?
Michal Hocko <mhocko@suse.com> writes: > On Tue 31-10-23 11:21:42, Johannes Weiner wrote: >> On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: >> > On Mon 30-10-23 20:38:06, Gregory Price wrote: >> > > This patchset implements weighted interleave and adds a new sysfs >> > > entry: /sys/devices/system/node/nodeN/accessM/il_weight. >> > > >> > > The il_weight of a node is used by mempolicy to implement weighted >> > > interleave when `numactl --interleave=...` is invoked. By default >> > > il_weight for a node is always 1, which preserves the default round >> > > robin interleave behavior. >> > > >> > > Interleave weights may be set from 0-100, and denote the number of >> > > pages that should be allocated from the node when interleaving >> > > occurs. >> > > >> > > For example, if a node's interleave weight is set to 5, 5 pages >> > > will be allocated from that node before the next node is scheduled >> > > for allocations. >> > >> > I find this semantic rather weird TBH. First of all why do you think it >> > makes sense to have those weights global for all users? What if >> > different applications have different view on how to spred their >> > interleaved memory? >> > >> > I do get that you might have a different tiers with largerly different >> > runtime characteristics but why would you want to interleave them into a >> > single mapping and have hard to predict runtime behavior? >> > >> > [...] >> > > In this way it becomes possible to set an interleaving strategy >> > > that fits the available bandwidth for the devices available on >> > > the system. An example system: >> > > >> > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) >> > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) >> > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex >> > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex >> > > >> > > In this setup, the effective weights for nodes 0-3 for a task >> > > running on Node 0 may be [60, 20, 10, 10]. >> > > >> > > This spreads memory out across devices which all have different >> > > latency and bandwidth attributes at a way that can maximize the >> > > available resources. >> > >> > OK, so why is this any better than not using any memory policy rely >> > on demotion to push out cold memory down the tier hierarchy? >> > >> > What is the actual real life usecase and what kind of benefits you can >> > present? >> >> There are two things CXL gives you: additional capacity and additional >> bus bandwidth. >> >> The promotion/demotion mechanism is good for the capacity usecase, >> where you have a nice hot/cold gradient in the workingset and want >> placement accordingly across faster and slower memory. >> >> The interleaving is useful when you have a flatter workingset >> distribution and poorer access locality. In that case, the CPU caches >> are less effective and the workload can be bus-bound. The workload >> might fit entirely into DRAM, but concentrating it there is >> suboptimal. Fanning it out in proportion to the relative performance >> of each memory tier gives better resuls. >> >> We experimented with datacenter workloads on such machines last year >> and found significant performance benefits: >> >> https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/ > > Thanks, this is a useful insight. > >> This hopefully also explains why it's a global setting. The usecase is >> different from conventional NUMA interleaving, which is used as a >> locality measure: spread shared data evenly between compute >> nodes. This one isn't about locality - the CXL tier doesn't have local >> compute. Instead, the optimal spread is based on hardware parameters, >> which is a global property rather than a per-workload one. > > Well, I am not convinced about that TBH. Sure it is probably a good fit > for this specific CXL usecase but it just doesn't fit into many others I > can think of - e.g. proportional use of those tiers based on the > workload - you get what you pay for. For "pay", per my understanding, we need some cgroup based per-memory-tier (or per-node) usage limit. The following patchset is the first step for that. https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/ -- Best Regards, Huang, Ying
Johannes Weiner <hannes@cmpxchg.org> writes: > On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: >> On Tue 31-10-23 11:21:42, Johannes Weiner wrote: >> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: >> > > On Mon 30-10-23 20:38:06, Gregory Price wrote: [snip] >> >> > This hopefully also explains why it's a global setting. The usecase is >> > different from conventional NUMA interleaving, which is used as a >> > locality measure: spread shared data evenly between compute >> > nodes. This one isn't about locality - the CXL tier doesn't have local >> > compute. Instead, the optimal spread is based on hardware parameters, >> > which is a global property rather than a per-workload one. >> >> Well, I am not convinced about that TBH. Sure it is probably a good fit >> for this specific CXL usecase but it just doesn't fit into many others I >> can think of - e.g. proportional use of those tiers based on the >> workload - you get what you pay for. >> >> Is there any specific reason for not having a new interleave interface >> which defines weights for the nodemask? Is this because the policy >> itself is very dynamic or is this more driven by simplicity of use? > > A downside of *requiring* weights to be paired with the mempolicy is > that it's then the application that would have to figure out the > weights dynamically, instead of having a static host configuration. A > policy of "I want to be spread for optimal bus bandwidth" translates > between different hardware configurations, but optimal weights will > vary depending on the type of machine a job runs on. > > That doesn't mean there couldn't be usecases for having weights as > policy as well in other scenarios, like you allude to above. It's just > so far such usecases haven't really materialized or spelled out > concretely. Maybe we just want both - a global default, and the > ability to override it locally. I think that this is a good idea. The system-wise configuration with reasonable default makes applications life much easier. If more control is needed, some kind of workload specific configuration can be added. And, instead of adding another memory policy, a cgroup-wise configuration may be easier to be used. The per-workload weight may need to be adjusted when we deploying different combination of workloads in the system. Another question is that should the weight be per-memory-tier or per-node? In this patchset, the weight is per-source-target-node combination. That is, the weight becomes a matrix instead of a vector. IIUC, this is used to control cross-socket memory access in addition to per-memory-type memory access. Do you think the added complexity is necessary? > Could you elaborate on the 'get what you pay for' usecase you > mentioned? -- Best Regards, Huang, Ying
>> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: >>> On Tue 31-10-23 11:21:42, Johannes Weiner wrote: >>> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: >>> > > On Mon 30-10-23 20:38:06, Gregory Price wrote: > >[snip] > >>> >>> > This hopefully also explains why it's a global setting. The usecase is >>> > different from conventional NUMA interleaving, which is used as a >>> > locality measure: spread shared data evenly between compute >>> > nodes. This one isn't about locality - the CXL tier doesn't have local >>> > compute. Instead, the optimal spread is based on hardware parameters, >>> > which is a global property rather than a per-workload one. >>> >>> Well, I am not convinced about that TBH. Sure it is probably a good fit >>> for this specific CXL usecase but it just doesn't fit into many others I >>> can think of - e.g. proportional use of those tiers based on the >>> workload - you get what you pay for. >>> >>> Is there any specific reason for not having a new interleave interface >>> which defines weights for the nodemask? Is this because the policy >>> itself is very dynamic or is this more driven by simplicity of use? >> >> A downside of *requiring* weights to be paired with the mempolicy is >> that it's then the application that would have to figure out the >> weights dynamically, instead of having a static host configuration. A >> policy of "I want to be spread for optimal bus bandwidth" translates >> between different hardware configurations, but optimal weights will >> vary depending on the type of machine a job runs on. >> >> That doesn't mean there couldn't be usecases for having weights as >> policy as well in other scenarios, like you allude to above. It's just >> so far such usecases haven't really materialized or spelled out >> concretely. Maybe we just want both - a global default, and the >> ability to override it locally. > >I think that this is a good idea. The system-wise configuration with >reasonable default makes applications life much easier. If more control >is needed, some kind of workload specific configuration can be added. Glad that we are in agreement here. For bandwidth expansion use cases that this interleave patchset is trying to cater to, most applications would have to follow the "reasanable defaults" for weights. The necessity for applications to choose different weights while interleaving would probably be to do capacity expansion which the default memory tiering implementation would anyway support and provide better latency. >And, instead of adding another memory policy, a cgroup-wise >configuration may be easier to be used. The per-workload weight may >need to be adjusted when we deploying different combination of workloads >in the system. > >Another question is that should the weight be per-memory-tier or >per-node? In this patchset, the weight is per-source-target-node >combination. That is, the weight becomes a matrix instead of a vector. >IIUC, this is used to control cross-socket memory access in addition to >per-memory-type memory access. Do you think the added complexity is >necessary? Pros and Cons of Node based interleave: Pros: 1. Weights can be defined for devices with different bandwidth and latency characteristics individually irrespective of which tier they fall into. 2. Defining the weight per-source-target-node would be necessary for multi socket systems where few devices may be closer to one socket rather than other. Cons: 1. Weights need to be programmed for all the nodes which can be tedious for systems with lot of NUMA nodes. Pros and Cons of Memory Tier based interleave: Pros: 1. Programming weight per initiator would apply for all the nodes in the tier. 2. Weights can be calculated considering the cumulative bandwidth of all the nodes in the tier and need to be programmed once for all the nodes in a given tier. 3. It may be useful in cases where numa nodes with similar latency and bandwidth characteristics increase, possibly with pooling use cases. Cons: 1. If nodes with different bandwidth and latency characteristics are placed in same tier as seen in the current mainline kernel, it will be difficult to apply a correct interleave weight policy. 2. There will be a need for functionality to move nodes between different tiers or create new tiers to place such nodes for programming correct interleave weights. We are working on a patch to support it currently. 3. For systems where each numa node is having different characteristics, a single node might end up existing in different memory tier, which would be equivalent to node based interleaving. On newer systems where all CXL memory from different devices under a port are combined to form single numa node, this scenario might be applicable. 4. Users may need to keep track of different memory tiers and what nodes are present in each tier for invoking interleave policy. > >> Could you elaborate on the 'get what you pay for' usecase you >> mentioned? > >-- >Best Regards, >Huang, Ying -- Best Regards, Ravi Jonnalagadda
On Tue 31-10-23 00:27:04, Gregory Price wrote: > On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: > > > > This hopefully also explains why it's a global setting. The usecase is > > > different from conventional NUMA interleaving, which is used as a > > > locality measure: spread shared data evenly between compute > > > nodes. This one isn't about locality - the CXL tier doesn't have local > > > compute. Instead, the optimal spread is based on hardware parameters, > > > which is a global property rather than a per-workload one. > > > > Well, I am not convinced about that TBH. Sure it is probably a good fit > > for this specific CXL usecase but it just doesn't fit into many others I > > can think of - e.g. proportional use of those tiers based on the > > workload - you get what you pay for. > > > > Is there any specific reason for not having a new interleave interface > > which defines weights for the nodemask? Is this because the policy > > itself is very dynamic or is this more driven by simplicity of use? > > > > I had originally implemented it this way while experimenting with new > mempolicies. > > https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@memverge.com/ > > The downside of doing it in mempolicy is... > 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a > non-trivial task. It is very "current-task" centric. True. Cpusets is the way to make it less process centric but that comes with its own constains (namely which NUMA policies are supported). > 2) Barring a change to mempolicy to be sysfs friendly, the options for > implementing weights in the mempolicy are either a) new flag and > setting every weight individually in many syscalls, or b) a new > syscall (set_mempolicy2), which is what I demonstrated in the RFC. Yes, that would likely require a new syscall. > 3) mempolicy is also subject to cgroup nodemasks, and as a result you > end up with a rats nest of interactions between mempolicy nodemasks > changing as a result of cgroup migrations, nodes potentially coming > and going (hotplug under CXL), and others I'm probably forgetting. Is this really any different from what you are proposing though? > Basically: If a node leaves the nodemask, should you retain the > weight, or should you reset it? If a new node comes into the node > mask... what weight should you set? I did not have answers to these > questions. I am not really sure I follow you. Are you talking about cpuset nodemask changes or memory hotplug here. > It was recommended to explore placing it in tiers instead, so I took a > crack at it here: > > https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/ > > This had similar issue with the idea of hotplug nodes: if you give a > tier a weight, and one or more of the nodes goes away/comes back... what > should you do with the weight? Split it up among the remaining nodes? > Rebalance? Etc. How is this any different from node becoming depleted? You cannot really expect that you get memory you are asking for and you can easily end up getting memory from a different node instead. > The result of this discussion lead us to simply say "What if we place > the weights directly in the node". And that lead us to this RFC. Maybe I am missing something really crucial here but I do not see how this fundamentally changes anything. Memory hotremove (or mere node memory depletion) is not really a thing because interleaving is a best effort operation so you have to live with memory not being strictly distributed per your preferences. Memory hotadd will be easier to manage because you just update a single place after node is hotadded rather than gazillions partial policies. But, that requires that interleave policy nodemask is assuming future nodes going online and put them to the mask. > I am not against implementing it in mempolicy (as proof: my first RFC). > I am simply searching for the acceptable way to implement it. > > One of the benefits of having it set as a global setting is that weights > can be automatically generated from HMAT/HMEM information (ACPI tables) > and programs already using MPOL_INTERLEAVE will have a direct benefit. Right. This is understood. My main concern is whether this is outweights the limitations of having a _global_ policy _only_. Historically a single global policy usually led to finding ways how to make that more scoped (usually through cgroups). > I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added > along side this patch so that MPOL_INTERLEAVE is left entirely alone. > > Happy to discuss more, > ~Gregory
On Tue 31-10-23 12:22:16, Johannes Weiner wrote: > On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: [...] > > Is there any specific reason for not having a new interleave interface > > which defines weights for the nodemask? Is this because the policy > > itself is very dynamic or is this more driven by simplicity of use? > > A downside of *requiring* weights to be paired with the mempolicy is > that it's then the application that would have to figure out the > weights dynamically, instead of having a static host configuration. A > policy of "I want to be spread for optimal bus bandwidth" translates > between different hardware configurations, but optimal weights will > vary depending on the type of machine a job runs on. I can imagine this could be achieved by numactl(8) so that the process management tool could set this up for the process on the start up. Sure it wouldn't be very dynamic after then and that is why I was asking about how dynamic the situation might be in practice. > That doesn't mean there couldn't be usecases for having weights as > policy as well in other scenarios, like you allude to above. It's just > so far such usecases haven't really materialized or spelled out > concretely. Maybe we just want both - a global default, and the > ability to override it locally. Could you elaborate on the 'get what > you pay for' usecase you mentioned? This is more or less just an idea that came first to my mind when hearing about bus bandwidth optimizations. I suspect that sooner or later we just learn about usecases where the optimization function maximizes not only bandwidth but also cost for that bandwidth. Consider a hosting system serving different workloads each paying different QoS. Do I know about anybody requiring that now? No! But we should really test the proposed interface for potential future extensions. If such an extension is not reasonable and/or we can achieve that by different means then great.
On Wed 01-11-23 10:21:47, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: [...] > > Well, I am not convinced about that TBH. Sure it is probably a good fit > > for this specific CXL usecase but it just doesn't fit into many others I > > can think of - e.g. proportional use of those tiers based on the > > workload - you get what you pay for. > > For "pay", per my understanding, we need some cgroup based > per-memory-tier (or per-node) usage limit. The following patchset is > the first step for that. > > https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/ Why do we need a sysfs interface if there are plans for cgroup API?
On Wed, Nov 01, 2023 at 02:45:50PM +0100, Michal Hocko wrote: > On Tue 31-10-23 00:27:04, Gregory Price wrote: [... snip ...] > > > > The downside of doing it in mempolicy is... > > 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a > > non-trivial task. It is very "current-task" centric. > > True. Cpusets is the way to make it less process centric but that comes > with its own constains (namely which NUMA policies are supported). > > > 2) Barring a change to mempolicy to be sysfs friendly, the options for > > implementing weights in the mempolicy are either a) new flag and > > setting every weight individually in many syscalls, or b) a new > > syscall (set_mempolicy2), which is what I demonstrated in the RFC. > > Yes, that would likely require a new syscall. > > > 3) mempolicy is also subject to cgroup nodemasks, and as a result you > > end up with a rats nest of interactions between mempolicy nodemasks > > changing as a result of cgroup migrations, nodes potentially coming > > and going (hotplug under CXL), and others I'm probably forgetting. > > Is this really any different from what you are proposing though? > In only one manner: An external user can set the weight of a node that is added later on. If it is implemented in mempolicy, then this is not possible. Basically consider: `numactl --interleave=all ...` If `--weights=...`: when a node hotplug event occurs, there is no recourse for adding a weight for the new node (it will default to 1). Maybe the answer is "Best effort, sorry" and we don't handle that situation. That doesn't seem entirely unreasonable. At least with weights in node (or cgroup, or memtier, whatever) it provides the ability to set that weight outside the mempolicy context. > > weight, or should you reset it? If a new node comes into the node > > mask... what weight should you set? I did not have answers to these > > questions. > > I am not really sure I follow you. Are you talking about cpuset > nodemask changes or memory hotplug here. > Actually both - slightly different context. If the weights are implemented in mempolicy, if the cpuset nodemask changes then the mempolicy nodemask changes with it. If the node is removed from the system, I believe (need to validate this, but IIRC) the node will be removed from any registered cpusets. As a result, that falls down to mempolicy, and the node is removed. Not entirely sure what happens if a node is added. The only case where I think that is relevant is when cpuset is empty ("all") and mempolicy is set to something like `--interleave=all`. In this case, it's possible that the new node will simply have a default weight (1), and if weights are implemented in mempolicy only there is no recourse for changing it. > > It was recommended to explore placing it in tiers instead, so I took a > > crack at it here: > > > > https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/ > > > > This had similar issue with the idea of hotplug nodes: if you give a > > tier a weight, and one or more of the nodes goes away/comes back... what > > should you do with the weight? Split it up among the remaining nodes? > > Rebalance? Etc. > > How is this any different from node becoming depleted? You cannot > really expect that you get memory you are asking for and you can easily > end up getting memory from a different node instead. > ... snip ... > Maybe I am missing something really crucial here but I do not see how > this fundamentally changes anything. > > Memory hotremove ... snip ... > Memory hotadd ... snip ... > But, that requires that interleave policy nodemask is assuming future > nodes going online and put them to the mask. > The difference is the nodemask changes in mempolicy and cpuset. If a node is removed entirely from the nodemask, and then it comes back (through cpuset or something), then "what do you do with it"? If memory is depleted but opens up later - the interleave policy starts working as intended again. If a node disappears and comes back... that bit of plumbing is a bit more complex. So yes, the "assuming future nodes going online and put them into the mask" is the concern I have. A node being added/removed from the nodemask specifically different plumbing issues than just depletion. If that's really not a concern and we're happy to just let it be OBO until an actual use case for handling node hotplug for weighting, then mempolicy-based-weighting alone seems more than sufficient. > > I am not against implementing it in mempolicy (as proof: my first RFC). > > I am simply searching for the acceptable way to implement it. > > > > One of the benefits of having it set as a global setting is that weights > > can be automatically generated from HMAT/HMEM information (ACPI tables) > > and programs already using MPOL_INTERLEAVE will have a direct benefit. > > Right. This is understood. My main concern is whether this is outweights > the limitations of having a _global_ policy _only_. Historically a single > global policy usually led to finding ways how to make that more scoped > (usually through cgroups). > Maybe the answer here is put it in cgroups + mempolicy, and don't handle hotplug? This is an easy shift my this patch to cgroups, and then pulling my syscall patch forward to add weights directly to mempolicy. I think the interleave code stays pretty much the same, the only difference would be where the task gets the weight from: if (policy->mode == WEIGHTED_INTERLEAVE) weight = pol->weight[target_node] else cgroups.get_weight(from_node, target_node) ~Gregory
Gregory Price <gregory.price@memverge.com> writes: > On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: > >> > This hopefully also explains why it's a global setting. The usecase is >> > different from conventional NUMA interleaving, which is used as a >> > locality measure: spread shared data evenly between compute >> > nodes. This one isn't about locality - the CXL tier doesn't have local >> > compute. Instead, the optimal spread is based on hardware parameters, >> > which is a global property rather than a per-workload one. >> >> Well, I am not convinced about that TBH. Sure it is probably a good fit >> for this specific CXL usecase but it just doesn't fit into many others I >> can think of - e.g. proportional use of those tiers based on the >> workload - you get what you pay for. >> >> Is there any specific reason for not having a new interleave interface >> which defines weights for the nodemask? Is this because the policy >> itself is very dynamic or is this more driven by simplicity of use? >> > > I had originally implemented it this way while experimenting with new > mempolicies. > > https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@memverge.com/ > > The downside of doing it in mempolicy is... > 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a > non-trivial task. It is very "current-task" centric. > > 2) Barring a change to mempolicy to be sysfs friendly, the options for > implementing weights in the mempolicy are either a) new flag and > setting every weight individually in many syscalls, or b) a new > syscall (set_mempolicy2), which is what I demonstrated in the RFC. > > 3) mempolicy is also subject to cgroup nodemasks, and as a result you > end up with a rats nest of interactions between mempolicy nodemasks > changing as a result of cgroup migrations, nodes potentially coming > and going (hotplug under CXL), and others I'm probably forgetting. > > Basically: If a node leaves the nodemask, should you retain the > weight, or should you reset it? If a new node comes into the node > mask... what weight should you set? I did not have answers to these > questions. > > > It was recommended to explore placing it in tiers instead, so I took a > crack at it here: > > https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/ > > This had similar issue with the idea of hotplug nodes: if you give a > tier a weight, and one or more of the nodes goes away/comes back... what > should you do with the weight? Split it up among the remaining nodes? > Rebalance? Etc. The weight of a tier can be defined as the weight of one node of the tier instead of the weight of all nodes of the tier. That is, for a system as follows, tier 0: node 0, node 1; weight=4 tier 1: node 2, node 3; weight=1 If you run workload with `numactl --weighted-interleave -n 0,2,3`, the proportion will be: "4:0:1:1" on each node. While for `numactl --weighted-interleave -n 0,2`, it will be: "4:0:1:0". -- Best Regards, Huang, Ying > The result of this discussion lead us to simply say "What if we place > the weights directly in the node". And that lead us to this RFC. > > > I am not against implementing it in mempolicy (as proof: my first RFC). > I am simply searching for the acceptable way to implement it. > > One of the benefits of having it set as a global setting is that weights > can be automatically generated from HMAT/HMEM information (ACPI tables) > and programs already using MPOL_INTERLEAVE will have a direct benefit. > > I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added > along side this patch so that MPOL_INTERLEAVE is left entirely alone. > > Happy to discuss more, > ~Gregory
On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote: > On Wed 01-11-23 12:58:55, Gregory Price wrote: > > Basically consider: `numactl --interleave=all ...` > > > > If `--weights=...`: when a node hotplug event occurs, there is no > > recourse for adding a weight for the new node (it will default to 1). > > Correct and this is what I was asking about in an earlier email. How > much do we really need to consider this setup. Is this something nice to > have or does the nature of the technology requires to be fully dynamic > and expect new nodes coming up at any moment? > Dynamic Capacity is expected to cause a numa node to change size (in number of memory blocks) rather than cause numa nodes to come and go, so maybe handling the full node hotplug is a bit of an overreach. Good call, I'll stop considering this problem for now. > > If the node is removed from the system, I believe (need to validate > > this, but IIRC) the node will be removed from any registered cpusets. > > As a result, that falls down to mempolicy, and the node is removed. > > I do not think we do anything like that. Userspace might decide to > change the numa mask when a node is offlined but I do not think we do > anything like that automagically. > mpol_rebind_policy called by update_tasks_nodemask https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319 https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016 falls down from cpuset_hotplug_workfn: https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771 /* * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY]. * Call this routine anytime after node_states[N_MEMORY] changes. * See cpuset_update_active_cpus() for CPU hotplug handling. */ static int cpuset_track_online_nodes(struct notifier_block *self, unsigned long action, void *arg) { schedule_work(&cpuset_hotplug_work); return NOTIFY_OK; } void __init cpuset_init_smp(void) { ... hotplug_memory_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI); } Causes 1 of 3 situations: MPOL_F_STATIC_NODES: overwrite with (old & new) MPOL_F_RELATIVE_NODES: overwrite with a "relative" nodemask (fold+onto?) Default: either does a remap or replaces old with new. My assumption based on this is that a hot-unplugged node would completely be removed. Doesn't look like hot-add is handled at all, so I can just drop that entirely for now (except add default weight of 1 incase it is ever added in the future). I've been pushing agianst the weights being in memory-tiers.c for this reason, as a weight set per-tier is meaningless if a node disappears. Example: Tier has 2 nodes with some weight N split between them, such that interleave gives each node N/2 pages. If 1 node is removed, the remaining node gets N pages, which is twice the allocation. Presumably a node is an abstraction of 1 or more devices, therefore if the node is removed, the weight should change. You could handle hotplug in tiers, but if a node being hotplugged forcibly removes the node from cpusets and mempolicy nodemasks, then it's irrelevant since the node can never get selected for allocation anyway. It's looking more like cgroups is the right place to put this. > > Moving the global policy to cgroups would make the main cocern of > different workloads looking for different policy less problamatic. > I didn't have much time to think that through but the main question is > how to sanely define hierarchical properties of those weights? This is > more of a resource distribution than enforcement so maybe a simple > inherit or overwrite (if you have a more specific needs) semantic makes > sense and it is sufficient. > As a user I would assume it would operate much the same way as other nested cgroups, which is inherit by default (with subsets) or an explicit overwrite that can't exceed the higher level settings. Weights could arguably allow different settings than capacity controls, but that could be an extension. > This is not as much about the code as it is about the proper interface > because that will get cast in stone once introduced. It would be really > bad to realize that we have a global policy that doesn't fit well and > have hard time to work it around without breaking anybody. o7 I concur now. I'll take some time to rework this into a cgroups+mempolicy proposal based on my earlier RFCs. ~Gregory
Michal Hocko <mhocko@suse.com> writes: > On Wed 01-11-23 10:21:47, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: > [...] >> > Well, I am not convinced about that TBH. Sure it is probably a good fit >> > for this specific CXL usecase but it just doesn't fit into many others I >> > can think of - e.g. proportional use of those tiers based on the >> > workload - you get what you pay for. >> >> For "pay", per my understanding, we need some cgroup based >> per-memory-tier (or per-node) usage limit. The following patchset is >> the first step for that. >> >> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/ > > Why do we need a sysfs interface if there are plans for cgroup API? They are for different target. The cgroup API proposed here is to constrain the DRAM usage in a system with DRAM and CXL memory. The less you pay, the less DRAM and more CXL memory you use. -- Best Regards, Huang, Ying
Michal Hocko <mhocko@suse.com> writes: > On Tue 31-10-23 12:22:16, Johannes Weiner wrote: >> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: > [...] >> > Is there any specific reason for not having a new interleave interface >> > which defines weights for the nodemask? Is this because the policy >> > itself is very dynamic or is this more driven by simplicity of use? >> >> A downside of *requiring* weights to be paired with the mempolicy is >> that it's then the application that would have to figure out the >> weights dynamically, instead of having a static host configuration. A >> policy of "I want to be spread for optimal bus bandwidth" translates >> between different hardware configurations, but optimal weights will >> vary depending on the type of machine a job runs on. > > I can imagine this could be achieved by numactl(8) so that the process > management tool could set this up for the process on the start up. Sure > it wouldn't be very dynamic after then and that is why I was asking > about how dynamic the situation might be in practice. > >> That doesn't mean there couldn't be usecases for having weights as >> policy as well in other scenarios, like you allude to above. It's just >> so far such usecases haven't really materialized or spelled out >> concretely. Maybe we just want both - a global default, and the >> ability to override it locally. Could you elaborate on the 'get what >> you pay for' usecase you mentioned? > > This is more or less just an idea that came first to my mind when > hearing about bus bandwidth optimizations. I suspect that sooner or > later we just learn about usecases where the optimization function > maximizes not only bandwidth but also cost for that bandwidth. Consider > a hosting system serving different workloads each paying different > QoS. I don't think pure software solution can enforce the memory bandwidth allocation. For that, we will need something like MBA (Memory Bandwidth Allocation) as in the following URL, https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-allocation.html At lease, something like MBM (Memory Bandwidth Monitoring) as in the following URL will be needed. https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-monitoring.html The interleave solution helps the cooperative workloads only. > Do I know about anybody requiring that now? No! But we should really > test the proposed interface for potential future extensions. If such an > extension is not reasonable and/or we can achieve that by different > means then great. -- Best Regards, Huang, Ying
Ravi Jonnalagadda <ravis.opensrc@micron.com> writes: >>> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: >>>> On Tue 31-10-23 11:21:42, Johannes Weiner wrote: >>>> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: >>>> > > On Mon 30-10-23 20:38:06, Gregory Price wrote: >> >>[snip] >> >>>> >>>> > This hopefully also explains why it's a global setting. The usecase is >>>> > different from conventional NUMA interleaving, which is used as a >>>> > locality measure: spread shared data evenly between compute >>>> > nodes. This one isn't about locality - the CXL tier doesn't have local >>>> > compute. Instead, the optimal spread is based on hardware parameters, >>>> > which is a global property rather than a per-workload one. >>>> >>>> Well, I am not convinced about that TBH. Sure it is probably a good fit >>>> for this specific CXL usecase but it just doesn't fit into many others I >>>> can think of - e.g. proportional use of those tiers based on the >>>> workload - you get what you pay for. >>>> >>>> Is there any specific reason for not having a new interleave interface >>>> which defines weights for the nodemask? Is this because the policy >>>> itself is very dynamic or is this more driven by simplicity of use? >>> >>> A downside of *requiring* weights to be paired with the mempolicy is >>> that it's then the application that would have to figure out the >>> weights dynamically, instead of having a static host configuration. A >>> policy of "I want to be spread for optimal bus bandwidth" translates >>> between different hardware configurations, but optimal weights will >>> vary depending on the type of machine a job runs on. >>> >>> That doesn't mean there couldn't be usecases for having weights as >>> policy as well in other scenarios, like you allude to above. It's just >>> so far such usecases haven't really materialized or spelled out >>> concretely. Maybe we just want both - a global default, and the >>> ability to override it locally. >> >>I think that this is a good idea. The system-wise configuration with >>reasonable default makes applications life much easier. If more control >>is needed, some kind of workload specific configuration can be added. > > Glad that we are in agreement here. For bandwidth expansion use cases > that this interleave patchset is trying to cater to, most applications > would have to follow the "reasanable defaults" for weights. > The necessity for applications to choose different weights while > interleaving would probably be to do capacity expansion which the > default memory tiering implementation would anyway support and provide > better latency. > >>And, instead of adding another memory policy, a cgroup-wise >>configuration may be easier to be used. The per-workload weight may >>need to be adjusted when we deploying different combination of workloads >>in the system. >> >>Another question is that should the weight be per-memory-tier or >>per-node? In this patchset, the weight is per-source-target-node >>combination. That is, the weight becomes a matrix instead of a vector. >>IIUC, this is used to control cross-socket memory access in addition to >>per-memory-type memory access. Do you think the added complexity is >>necessary? > > Pros and Cons of Node based interleave: > Pros: > 1. Weights can be defined for devices with different bandwidth and latency > characteristics individually irrespective of which tier they fall into. > 2. Defining the weight per-source-target-node would be necessary for multi > socket systems where few devices may be closer to one socket rather than other. > Cons: > 1. Weights need to be programmed for all the nodes which can be tedious for > systems with lot of NUMA nodes. 2. More complex, so need justification, for example, practical use case. > Pros and Cons of Memory Tier based interleave: > Pros: > 1. Programming weight per initiator would apply for all the nodes in the tier. > 2. Weights can be calculated considering the cumulative bandwidth of all > the nodes in the tier and need to be programmed once for all the nodes in a > given tier. > 3. It may be useful in cases where numa nodes with similar latency and bandwidth > characteristics increase, possibly with pooling use cases. 4. simpler. > Cons: > 1. If nodes with different bandwidth and latency characteristics are placed > in same tier as seen in the current mainline kernel, it will be difficult to > apply a correct interleave weight policy. > 2. There will be a need for functionality to move nodes between different tiers > or create new tiers to place such nodes for programming correct interleave weights. > We are working on a patch to support it currently. Thanks! If we have such system, we will need this. > 3. For systems where each numa node is having different characteristics, > a single node might end up existing in different memory tier, which would be > equivalent to node based interleaving. No. A node can only exist in one memory tier. > On newer systems where all CXL memory from different devices under a > port are combined to form single numa node, this scenario might be > applicable. You mean the different memory ranges of a NUMA node may have different performance? I don't think that we can deal with this. > 4. Users may need to keep track of different memory tiers and what nodes are present > in each tier for invoking interleave policy. I don't think this is a con. With node based solution, you need to know your system too. >> >>> Could you elaborate on the 'get what you pay for' usecase you >>> mentioned? >> -- Best Regards, Huang, Ying
On Thu 02-11-23 14:11:09, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Wed 01-11-23 10:21:47, Huang, Ying wrote: > >> Michal Hocko <mhocko@suse.com> writes: > > [...] > >> > Well, I am not convinced about that TBH. Sure it is probably a good fit > >> > for this specific CXL usecase but it just doesn't fit into many others I > >> > can think of - e.g. proportional use of those tiers based on the > >> > workload - you get what you pay for. > >> > >> For "pay", per my understanding, we need some cgroup based > >> per-memory-tier (or per-node) usage limit. The following patchset is > >> the first step for that. > >> > >> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/ > > > > Why do we need a sysfs interface if there are plans for cgroup API? > > They are for different target. The cgroup API proposed here is to > constrain the DRAM usage in a system with DRAM and CXL memory. The less > you pay, the less DRAM and more CXL memory you use. Right, but why the usage distribution requires its own interface and cannot be combined with the access control part of it?
On Thu 02-11-23 14:21:49, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Tue 31-10-23 12:22:16, Johannes Weiner wrote: > >> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: > > [...] > >> > Is there any specific reason for not having a new interleave interface > >> > which defines weights for the nodemask? Is this because the policy > >> > itself is very dynamic or is this more driven by simplicity of use? > >> > >> A downside of *requiring* weights to be paired with the mempolicy is > >> that it's then the application that would have to figure out the > >> weights dynamically, instead of having a static host configuration. A > >> policy of "I want to be spread for optimal bus bandwidth" translates > >> between different hardware configurations, but optimal weights will > >> vary depending on the type of machine a job runs on. > > > > I can imagine this could be achieved by numactl(8) so that the process > > management tool could set this up for the process on the start up. Sure > > it wouldn't be very dynamic after then and that is why I was asking > > about how dynamic the situation might be in practice. > > > >> That doesn't mean there couldn't be usecases for having weights as > >> policy as well in other scenarios, like you allude to above. It's just > >> so far such usecases haven't really materialized or spelled out > >> concretely. Maybe we just want both - a global default, and the > >> ability to override it locally. Could you elaborate on the 'get what > >> you pay for' usecase you mentioned? > > > > This is more or less just an idea that came first to my mind when > > hearing about bus bandwidth optimizations. I suspect that sooner or > > later we just learn about usecases where the optimization function > > maximizes not only bandwidth but also cost for that bandwidth. Consider > > a hosting system serving different workloads each paying different > > QoS. > > I don't think pure software solution can enforce the memory bandwidth > allocation. For that, we will need something like MBA (Memory Bandwidth > Allocation) as in the following URL, > > https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-allocation.html > > At lease, something like MBM (Memory Bandwidth Monitoring) as in the > following URL will be needed. > > https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-monitoring.html > > The interleave solution helps the cooperative workloads only. Enforcement is an orthogonal thing IMO. We are talking about a best effort interface.
Should Node based interleave solution be considered complex or not would probably depend on number of numa nodes that would be present in the system and whether we are able to setup the default weights correctly to obtain optimum bandwidth expansion. > >> Pros and Cons of Memory Tier based interleave: >> Pros: >> 1. Programming weight per initiator would apply for all the nodes in the tier. >> 2. Weights can be calculated considering the cumulative bandwidth of all >> the nodes in the tier and need to be programmed once for all the nodes in a >> given tier. >> 3. It may be useful in cases where numa nodes with similar latency and bandwidth >> characteristics increase, possibly with pooling use cases. > >4. simpler. > >> Cons: >> 1. If nodes with different bandwidth and latency characteristics are placed >> in same tier as seen in the current mainline kernel, it will be difficult to >> apply a correct interleave weight policy. >> 2. There will be a need for functionality to move nodes between different tiers >> or create new tiers to place such nodes for programming correct interleave weights. >> We are working on a patch to support it currently. > >Thanks! If we have such system, we will need this. > >> 3. For systems where each numa node is having different characteristics, >> a single node might end up existing in different memory tier, which would be >> equivalent to node based interleaving. > >No. A node can only exist in one memory tier. Sorry for the confusion what i meant was, if each node is having different characteristics, to program the memory tier weights correctly we need to place each node in a separate tier of it's own. So each memory tier will contain only a single node and the solution would resemble node based interleaving. > >> On newer systems where all CXL memory from different devices under a >> port are combined to form single numa node, this scenario might be >> applicable. > >You mean the different memory ranges of a NUMA node may have different >performance? I don't think that we can deal with this. Example Configuration: On a server that we are using now, four different CXL cards are combined to form a single NUMA node and two other cards are exposed as two individual numa nodes. So if we have the ability to combine multiple CXL memory ranges to a single NUMA node the number of NUMA nodes in the system would potentially decrease even if we can't combine the entire range to form a single node. > >> 4. Users may need to keep track of different memory tiers and what nodes are present >> in each tier for invoking interleave policy. > >I don't think this is a con. With node based solution, you need to know >your system too. > >>> >>>> Could you elaborate on the 'get what you pay for' usecase you >>>> mentioned? >>> > >-- >Best Regards, >Huang, Ying -- Best Regards, Ravi Jonnalagadda
On Wed 01-11-23 12:58:55, Gregory Price wrote: > On Wed, Nov 01, 2023 at 02:45:50PM +0100, Michal Hocko wrote: > > On Tue 31-10-23 00:27:04, Gregory Price wrote: > [... snip ...] > > > > > > The downside of doing it in mempolicy is... > > > 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a > > > non-trivial task. It is very "current-task" centric. > > > > True. Cpusets is the way to make it less process centric but that comes > > with its own constains (namely which NUMA policies are supported). > > > > > 2) Barring a change to mempolicy to be sysfs friendly, the options for > > > implementing weights in the mempolicy are either a) new flag and > > > setting every weight individually in many syscalls, or b) a new > > > syscall (set_mempolicy2), which is what I demonstrated in the RFC. > > > > Yes, that would likely require a new syscall. > > > > > 3) mempolicy is also subject to cgroup nodemasks, and as a result you > > > end up with a rats nest of interactions between mempolicy nodemasks > > > changing as a result of cgroup migrations, nodes potentially coming > > > and going (hotplug under CXL), and others I'm probably forgetting. > > > > Is this really any different from what you are proposing though? > > > > In only one manner: An external user can set the weight of a node that > is added later on. If it is implemented in mempolicy, then this is not > possible. > > Basically consider: `numactl --interleave=all ...` > > If `--weights=...`: when a node hotplug event occurs, there is no > recourse for adding a weight for the new node (it will default to 1). Correct and this is what I was asking about in an earlier email. How much do we really need to consider this setup. Is this something nice to have or does the nature of the technology requires to be fully dynamic and expect new nodes coming up at any moment? > Maybe the answer is "Best effort, sorry" and we don't handle that > situation. That doesn't seem entirely unreasonable. > > At least with weights in node (or cgroup, or memtier, whatever) it > provides the ability to set that weight outside the mempolicy context. > > > > weight, or should you reset it? If a new node comes into the node > > > mask... what weight should you set? I did not have answers to these > > > questions. > > > > I am not really sure I follow you. Are you talking about cpuset > > nodemask changes or memory hotplug here. > > > > Actually both - slightly different context. > > If the weights are implemented in mempolicy, if the cpuset nodemask > changes then the mempolicy nodemask changes with it. > > If the node is removed from the system, I believe (need to validate > this, but IIRC) the node will be removed from any registered cpusets. > As a result, that falls down to mempolicy, and the node is removed. I do not think we do anything like that. Userspace might decide to change the numa mask when a node is offlined but I do not think we do anything like that automagically. > Not entirely sure what happens if a node is added. The only case where > I think that is relevant is when cpuset is empty ("all") and mempolicy > is set to something like `--interleave=all`. In this case, it's > possible that the new node will simply have a default weight (1), and if > weights are implemented in mempolicy only there is no recourse for changing > it. That is what I would expect. [...] > > Right. This is understood. My main concern is whether this is outweights > > the limitations of having a _global_ policy _only_. Historically a single > > global policy usually led to finding ways how to make that more scoped > > (usually through cgroups). > > > > Maybe the answer here is put it in cgroups + mempolicy, and don't handle > hotplug? This is an easy shift my this patch to cgroups, and then > pulling my syscall patch forward to add weights directly to mempolicy. Moving the global policy to cgroups would make the main cocern of different workloads looking for different policy less problamatic. I didn't have much time to think that through but the main question is how to sanely define hierarchical properties of those weights? This is more of a resource distribution than enforcement so maybe a simple inherit or overwrite (if you have a more specific needs) semantic makes sense and it is sufficient. > I think the interleave code stays pretty much the same, the only > difference would be where the task gets the weight from: > > if (policy->mode == WEIGHTED_INTERLEAVE) > weight = pol->weight[target_node] > else > cgroups.get_weight(from_node, target_node) > > ~Gregory This is not as much about the code as it is about the proper interface because that will get cast in stone once introduced. It would be really bad to realize that we have a global policy that doesn't fit well and have hard time to work it around without breaking anybody.
icable. > > > >You mean the different memory ranges of a NUMA node may have different > >performance? I don't think that we can deal with this. > > Example Configuration: On a server that we are using now, four different > CXL cards are combined to form a single NUMA node and two other cards are > exposed as two individual numa nodes. > So if we have the ability to combine multiple CXL memory ranges to a > single NUMA node the number of NUMA nodes in the system would potentially > decrease even if we can't combine the entire range to form a single node. > If it's in control of the kernel, today for CXL NUMA nodes are defined by CXL Fixed Memory Windows rather than the individual characteristics of devices that might be accessed from those windows. That's a useful simplification to get things going and it's not clear how the QoS aspects of CFMWS will be used. So will we always have enough windows with fine enough granularity coming from the _DSM QTG magic that they don't end up with different performance devices (or topologies) within each one? No idea. It's a bunch of trade offs of where the complexity lies and how much memory is being provided over CXL vs physical address space exhaustion. Long term, my guess is we'll need to support something more sophisticated with dynamic 'creation' of NUMA nodes (or something that looks like that anyway) so we can always have a separate one for each significantly different set of memory access characteristics. If they are coming from ACPI that's already required by the specification. This space is going to continue getting more complex. Upshot is that I wouldn't focus too much on possibility of a NUMA node having devices with very different memory access characterstics in it. That's a quirk of today's world that we can and should look to fix. If your bios is setting this up for you and presenting them in SRAT / HMAT etc then it's not complying with the ACPI spec. Jonathan
On Fri, Nov 03, 2023 at 10:56:01AM +0100, Michal Hocko wrote: > On Wed 01-11-23 23:18:59, Gregory Price wrote: > > On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote: > > > On Wed 01-11-23 12:58:55, Gregory Price wrote: > > > > Basically consider: `numactl --interleave=all ...` > > > > > > > > If `--weights=...`: when a node hotplug event occurs, there is no > > > > recourse for adding a weight for the new node (it will default to 1). > > > > > > Correct and this is what I was asking about in an earlier email. How > > > much do we really need to consider this setup. Is this something nice to > > > have or does the nature of the technology requires to be fully dynamic > > > and expect new nodes coming up at any moment? > > > > > > > Dynamic Capacity is expected to cause a numa node to change size (in > > number of memory blocks) rather than cause numa nodes to come and go, so > > maybe handling the full node hotplug is a bit of an overreach. > > > > Good call, I'll stop considering this problem for now. > > > > > > If the node is removed from the system, I believe (need to validate > > > > this, but IIRC) the node will be removed from any registered cpusets. > > > > As a result, that falls down to mempolicy, and the node is removed. > > > > > > I do not think we do anything like that. Userspace might decide to > > > change the numa mask when a node is offlined but I do not think we do > > > anything like that automagically. > > > > > > > mpol_rebind_policy called by update_tasks_nodemask > > https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319 > > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016 > > > > falls down from cpuset_hotplug_workfn: > > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771 > > Ohh, have missed that. Thanks for the reference. Quite honestly I am not > sure this code is really a) necessary and b) ever exercised. For the > former I would argue that offline node could be treated as completely > depleted one. From the correctness POV it shouldn't make any difference > and I am rather skeptical it would have performance improvements. Only thing I'm not sure of is what happens if mempolicy is allowed to select a node that doesn't exist. I could hack up a contrived test, but i don't think the state is reachable at the moment. More importantly, the rebind code is needed for task migration and for allowing the cpusets to be change-able. From the perspective of mempolicy, a node being hotplugged and the nodemask being changed due to cgroup cpuset changing looks very similar and comes with the same question: What do i do about weights when a change to the effective nodemask is made. This is why i'm falling toward "cgroups seem about right", because we can make mempolicy ask cgroups for the weight, and also allow mempolicy to carry its own explicit weight array - which allows for flexiblity. I think this may end up generalizing to a cgroup-wide mempolicy interface ala cgroup/mempolicy/[policy, nodemask, weights, ...] but one thing as a time :] > for the latter, full node offlines are really rare from experience. I > would be interested about actual real life usecases which do that yeah i'm just going to drop this from my requirement list and go OBO, for areas where i see it may cause an issue (potential for 0-weights) i will do something simple (initialize weights to 1), but otherwise I think it's too much to expect from the kernel. > > > > As a user I would assume it would operate much the same way as other > > nested cgroups, which is inherit by default (with subsets) or an > > explicit overwrite that can't exceed the higher level settings. > > This would make it rather impractical because a default (everything set > to 1) would be cast in stone. As mentioned above this this not an > enforcement limit. So I _think_ that a simple hierarchical rule like > cgroup_interleaving_mask(cgroup) > interleaving_mask = (cgroup->interleaving_mask) ? : cgroup_interleaving_mask(parent_cgroup(cgroup)) > > So child cgroups could overwrite parent as they wish. If there is any > enforcement (like a cpuset) that would filter useable nodes and the > allocation policy would simply apply weights on those. > Sorry yes, this is what I intended, I'm just bad at words. ~Gregory
Ravi Jonnalagadda <ravis.opensrc@micron.com> writes: > Should Node based interleave solution be considered complex or not would probably > depend on number of numa nodes that would be present in the system and whether > we are able to setup the default weights correctly to obtain optimum bandwidth > expansion. Node based interleave is more complex than tier based interleave. Because you have less tiers than nodes in general. >> >>> Pros and Cons of Memory Tier based interleave: >>> Pros: >>> 1. Programming weight per initiator would apply for all the nodes in the tier. >>> 2. Weights can be calculated considering the cumulative bandwidth of all >>> the nodes in the tier and need to be programmed once for all the nodes in a >>> given tier. >>> 3. It may be useful in cases where numa nodes with similar latency and bandwidth >>> characteristics increase, possibly with pooling use cases. >> >>4. simpler. >> >>> Cons: >>> 1. If nodes with different bandwidth and latency characteristics are placed >>> in same tier as seen in the current mainline kernel, it will be difficult to >>> apply a correct interleave weight policy. >>> 2. There will be a need for functionality to move nodes between different tiers >>> or create new tiers to place such nodes for programming correct interleave weights. >>> We are working on a patch to support it currently. >> >>Thanks! If we have such system, we will need this. >> >>> 3. For systems where each numa node is having different characteristics, >>> a single node might end up existing in different memory tier, which would be >>> equivalent to node based interleaving. >> >>No. A node can only exist in one memory tier. > > Sorry for the confusion what i meant was, if each node is having different > characteristics, to program the memory tier weights correctly we need to place > each node in a separate tier of it's own. So each memory tier will contain > only a single node and the solution would resemble node based interleaving. > >> >>> On newer systems where all CXL memory from different devices under a >>> port are combined to form single numa node, this scenario might be >>> applicable. >> >>You mean the different memory ranges of a NUMA node may have different >>performance? I don't think that we can deal with this. > > Example Configuration: On a server that we are using now, four different > CXL cards are combined to form a single NUMA node and two other cards are > exposed as two individual numa nodes. > So if we have the ability to combine multiple CXL memory ranges to a > single NUMA node the number of NUMA nodes in the system would potentially > decrease even if we can't combine the entire range to form a single node. Sorry, I misunderstand your words. Yes, it's possible that there one tier for each node in some systems. But I guess we will have less tiers than nodes in general. -- Best Regards, Huang, Ying >> >>> 4. Users may need to keep track of different memory tiers and what nodes are present >>> in each tier for invoking interleave policy. >> >>I don't think this is a con. With node based solution, you need to know >>your system too. >> >>>> >>>>> Could you elaborate on the 'get what you pay for' usecase you >>>>> mentioned? >>>> >> >>-- >>Best Regards, >>Huang, Ying > -- > Best Regards, > Ravi Jonnalagadda
Michal Hocko <mhocko@suse.com> writes: > On Thu 02-11-23 14:11:09, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Wed 01-11-23 10:21:47, Huang, Ying wrote: >> >> Michal Hocko <mhocko@suse.com> writes: >> > [...] >> >> > Well, I am not convinced about that TBH. Sure it is probably a good fit >> >> > for this specific CXL usecase but it just doesn't fit into many others I >> >> > can think of - e.g. proportional use of those tiers based on the >> >> > workload - you get what you pay for. >> >> >> >> For "pay", per my understanding, we need some cgroup based >> >> per-memory-tier (or per-node) usage limit. The following patchset is >> >> the first step for that. >> >> >> >> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/ >> > >> > Why do we need a sysfs interface if there are plans for cgroup API? >> >> They are for different target. The cgroup API proposed here is to >> constrain the DRAM usage in a system with DRAM and CXL memory. The less >> you pay, the less DRAM and more CXL memory you use. > > Right, but why the usage distribution requires its own interface and > cannot be combined with the access control part of it? Per my understanding, they are orthogonal. Weighted-interleave is a memory allocation policy, other memory allocation policies include local first, etc. Usage limit is to constrain the usage of specific memory types (e.g. DRAM) for a cgroup. It can be used together with local first policy and some other memory allocation policy. -- Best Regards, Huang, Ying
Gregory Price <gregory.price@memverge.com> writes: > On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote: >> On Wed 01-11-23 12:58:55, Gregory Price wrote: >> > Basically consider: `numactl --interleave=all ...` >> > >> > If `--weights=...`: when a node hotplug event occurs, there is no >> > recourse for adding a weight for the new node (it will default to 1). >> >> Correct and this is what I was asking about in an earlier email. How >> much do we really need to consider this setup. Is this something nice to >> have or does the nature of the technology requires to be fully dynamic >> and expect new nodes coming up at any moment? >> > > Dynamic Capacity is expected to cause a numa node to change size (in > number of memory blocks) rather than cause numa nodes to come and go, so > maybe handling the full node hotplug is a bit of an overreach. Will node max bandwidth change with the number of memory blocks? > Good call, I'll stop considering this problem for now. > >> > If the node is removed from the system, I believe (need to validate >> > this, but IIRC) the node will be removed from any registered cpusets. >> > As a result, that falls down to mempolicy, and the node is removed. >> >> I do not think we do anything like that. Userspace might decide to >> change the numa mask when a node is offlined but I do not think we do >> anything like that automagically. >> > > mpol_rebind_policy called by update_tasks_nodemask > https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319 > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016 > > falls down from cpuset_hotplug_workfn: > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771 > > /* > * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY]. > * Call this routine anytime after node_states[N_MEMORY] changes. > * See cpuset_update_active_cpus() for CPU hotplug handling. > */ > static int cpuset_track_online_nodes(struct notifier_block *self, > unsigned long action, void *arg) > { > schedule_work(&cpuset_hotplug_work); > return NOTIFY_OK; > } > > void __init cpuset_init_smp(void) > { > ... > hotplug_memory_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI); > } > > > Causes 1 of 3 situations: > MPOL_F_STATIC_NODES: overwrite with (old & new) > MPOL_F_RELATIVE_NODES: overwrite with a "relative" nodemask (fold+onto?) > Default: either does a remap or replaces old with new. > > My assumption based on this is that a hot-unplugged node would completely > be removed. Doesn't look like hot-add is handled at all, so I can just > drop that entirely for now (except add default weight of 1 incase it is > ever added in the future). > > I've been pushing agianst the weights being in memory-tiers.c for this > reason, as a weight set per-tier is meaningless if a node disappears. > > Example: Tier has 2 nodes with some weight N split between them, such > that interleave gives each node N/2 pages. If 1 node is removed, the > remaining node gets N pages, which is twice the allocation. Presumably > a node is an abstraction of 1 or more devices, therefore if the node is > removed, the weight should change. The per-tier weight can be defined as interleave weight of each node of the tier. Tier just groups NUMA nodes with similar performance. The performance (including bandwidth) is still per-node in the context of tier. If we have multiple nodes in one tier, this makes weight definition easier. > You could handle hotplug in tiers, but if a node being hotplugged forcibly > removes the node from cpusets and mempolicy nodemasks, then it's > irrelevant since the node can never get selected for allocation anyway. > > It's looking more like cgroups is the right place to put this. Have a cgroup/task level interface doesn't prevent us to have a system level interface to provide default for cgroups/tasks. Where performance information (e.g., from HMAT) can help define a reasonable default automatically. >> >> Moving the global policy to cgroups would make the main cocern of >> different workloads looking for different policy less problamatic. >> I didn't have much time to think that through but the main question is >> how to sanely define hierarchical properties of those weights? This is >> more of a resource distribution than enforcement so maybe a simple >> inherit or overwrite (if you have a more specific needs) semantic makes >> sense and it is sufficient. >> > > As a user I would assume it would operate much the same way as other > nested cgroups, which is inherit by default (with subsets) or an > explicit overwrite that can't exceed the higher level settings. > > Weights could arguably allow different settings than capacity controls, > but that could be an extension. > >> This is not as much about the code as it is about the proper interface >> because that will get cast in stone once introduced. It would be really >> bad to realize that we have a global policy that doesn't fit well and >> have hard time to work it around without breaking anybody. > > o7 I concur now. I'll take some time to rework this into a > cgroups+mempolicy proposal based on my earlier RFCs. -- Best Regards, Huang, Ying
On Fri 03-11-23 15:10:37, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Thu 02-11-23 14:11:09, Huang, Ying wrote: > >> Michal Hocko <mhocko@suse.com> writes: > >> > >> > On Wed 01-11-23 10:21:47, Huang, Ying wrote: > >> >> Michal Hocko <mhocko@suse.com> writes: > >> > [...] > >> >> > Well, I am not convinced about that TBH. Sure it is probably a good fit > >> >> > for this specific CXL usecase but it just doesn't fit into many others I > >> >> > can think of - e.g. proportional use of those tiers based on the > >> >> > workload - you get what you pay for. > >> >> > >> >> For "pay", per my understanding, we need some cgroup based > >> >> per-memory-tier (or per-node) usage limit. The following patchset is > >> >> the first step for that. > >> >> > >> >> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/ > >> > > >> > Why do we need a sysfs interface if there are plans for cgroup API? > >> > >> They are for different target. The cgroup API proposed here is to > >> constrain the DRAM usage in a system with DRAM and CXL memory. The less > >> you pay, the less DRAM and more CXL memory you use. > > > > Right, but why the usage distribution requires its own interface and > > cannot be combined with the access control part of it? > > Per my understanding, they are orthogonal. > > Weighted-interleave is a memory allocation policy, other memory > allocation policies include local first, etc. > > Usage limit is to constrain the usage of specific memory types > (e.g. DRAM) for a cgroup. It can be used together with local first > policy and some other memory allocation policy. Bad wording from me. Sorry for the confusion. Sure those are two orthogonal things and I didn't mean to suggest a single API to cover both. But if cgroup semantic can be reasonably defined for the usage enforcement can we put the interleaving behavior API under the same cgroup controller as well?
On Wed 01-11-23 23:18:59, Gregory Price wrote: > On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote: > > On Wed 01-11-23 12:58:55, Gregory Price wrote: > > > Basically consider: `numactl --interleave=all ...` > > > > > > If `--weights=...`: when a node hotplug event occurs, there is no > > > recourse for adding a weight for the new node (it will default to 1). > > > > Correct and this is what I was asking about in an earlier email. How > > much do we really need to consider this setup. Is this something nice to > > have or does the nature of the technology requires to be fully dynamic > > and expect new nodes coming up at any moment? > > > > Dynamic Capacity is expected to cause a numa node to change size (in > number of memory blocks) rather than cause numa nodes to come and go, so > maybe handling the full node hotplug is a bit of an overreach. > > Good call, I'll stop considering this problem for now. > > > > If the node is removed from the system, I believe (need to validate > > > this, but IIRC) the node will be removed from any registered cpusets. > > > As a result, that falls down to mempolicy, and the node is removed. > > > > I do not think we do anything like that. Userspace might decide to > > change the numa mask when a node is offlined but I do not think we do > > anything like that automagically. > > > > mpol_rebind_policy called by update_tasks_nodemask > https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319 > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016 > > falls down from cpuset_hotplug_workfn: > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771 Ohh, have missed that. Thanks for the reference. Quite honestly I am not sure this code is really a) necessary and b) ever exercised. For the former I would argue that offline node could be treated as completely depleted one. From the correctness POV it shouldn't make any difference and I am rather skeptical it would have performance improvements. And for the latter, full node offlines are really rare from experience. I would be interested about actual real life usecases which do that regularly. I do remember a certain HW vendor working on a hotplugable system (both CPUs and memory) to reduce downtimes cause by misbehaving CPUs/memoryu. This has turned out very impractical because of movable memory requirements and also some HW limitations (like most HW attached to Node0 which has turned out to be single point of failure anyway). [...] [...] > > Moving the global policy to cgroups would make the main cocern of > > different workloads looking for different policy less problamatic. > > I didn't have much time to think that through but the main question is > > how to sanely define hierarchical properties of those weights? This is > > more of a resource distribution than enforcement so maybe a simple > > inherit or overwrite (if you have a more specific needs) semantic makes > > sense and it is sufficient. > > > > As a user I would assume it would operate much the same way as other > nested cgroups, which is inherit by default (with subsets) or an > explicit overwrite that can't exceed the higher level settings. This would make it rather impractical because a default (everything set to 1) would be cast in stone. As mentioned above this this not an enforcement limit. So I _think_ that a simple hierarchical rule like cgroup_interleaving_mask(cgroup) interleaving_mask = (cgroup->interleaving_mask) ? : cgroup_interleaving_mask(parent_cgroup(cgroup)) So child cgroups could overwrite parent as they wish. If there is any enforcement (like a cpuset) that would filter useable nodes and the allocation policy would simply apply weights on those.
On Fri, 03 Nov 2023 15:45:13 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > Gregory Price <gregory.price@memverge.com> writes: > > > On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote: > >> On Wed 01-11-23 12:58:55, Gregory Price wrote: > >> > Basically consider: `numactl --interleave=all ...` > >> > > >> > If `--weights=...`: when a node hotplug event occurs, there is no > >> > recourse for adding a weight for the new node (it will default to 1). > >> > >> Correct and this is what I was asking about in an earlier email. How > >> much do we really need to consider this setup. Is this something nice to > >> have or does the nature of the technology requires to be fully dynamic > >> and expect new nodes coming up at any moment? > >> > > > > Dynamic Capacity is expected to cause a numa node to change size (in > > number of memory blocks) rather than cause numa nodes to come and go, so > > maybe handling the full node hotplug is a bit of an overreach. > > Will node max bandwidth change with the number of memory blocks? Typically no as even a single memory extent would probably be interleaved across all the actual memory devices (think DIMMS for simplicity) within a CXL device. I guess a device 'could' do some scaling based on capacity provided to a particular host but feels like they should be separate controls. I don't recall there being anything in the specification to suggest the need to recheck the CDAT info for updates when DC add / remove events happen. Mind you, who knows in future :) We'll point out in relevant forums that doing so would be very hard to handle cleanly in Linux. Jonathan
On Thu 02-11-23 14:21:14, Gregory Price wrote: [...] > Only thing I'm not sure of is what happens if mempolicy is allowed to > select a node that doesn't exist. I could hack up a contrived test, but > i don't think the state is reachable at the moment. There are two different kinds of doesn't exist. One is an offline node and the other is one with a number higher than the config option allows. Although we do have a concept of possible nodes N_POSSIBLE I do not think we do enforce that in any user interface and we only reject nodes outside of MAX_NODE. The possible nodes concept is more about optimizing for real HW so that we do not over shoot when the config allows a huge number of nodes while only handful of them are actually used (which is the majority of cases).
Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes: > On Fri, 03 Nov 2023 15:45:13 +0800 > "Huang, Ying" <ying.huang@intel.com> wrote: > >> Gregory Price <gregory.price@memverge.com> writes: >> >> > On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote: >> >> On Wed 01-11-23 12:58:55, Gregory Price wrote: >> >> > Basically consider: `numactl --interleave=all ...` >> >> > >> >> > If `--weights=...`: when a node hotplug event occurs, there is no >> >> > recourse for adding a weight for the new node (it will default to 1). >> >> >> >> Correct and this is what I was asking about in an earlier email. How >> >> much do we really need to consider this setup. Is this something nice to >> >> have or does the nature of the technology requires to be fully dynamic >> >> and expect new nodes coming up at any moment? >> >> >> > >> > Dynamic Capacity is expected to cause a numa node to change size (in >> > number of memory blocks) rather than cause numa nodes to come and go, so >> > maybe handling the full node hotplug is a bit of an overreach. >> >> Will node max bandwidth change with the number of memory blocks? > > Typically no as even a single memory extent would probably be interleaved > across all the actual memory devices (think DIMMS for simplicity) within > a CXL device. I guess a device 'could' do some scaling based on capacity > provided to a particular host but feels like they should be separate controls. > I don't recall there being anything in the specification to suggest the > need to recheck the CDAT info for updates when DC add / remove events happen. Sounds good! Thank you for detailed explanation. > Mind you, who knows in future :) We'll point out in relevant forums that > doing so would be very hard to handle cleanly in Linux. Thanks! -- Best Regards, Huang, Ying
Michal Hocko <mhocko@suse.com> writes: > On Fri 03-11-23 15:10:37, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Thu 02-11-23 14:11:09, Huang, Ying wrote: >> >> Michal Hocko <mhocko@suse.com> writes: >> >> >> >> > On Wed 01-11-23 10:21:47, Huang, Ying wrote: >> >> >> Michal Hocko <mhocko@suse.com> writes: >> >> > [...] >> >> >> > Well, I am not convinced about that TBH. Sure it is probably a good fit >> >> >> > for this specific CXL usecase but it just doesn't fit into many others I >> >> >> > can think of - e.g. proportional use of those tiers based on the >> >> >> > workload - you get what you pay for. >> >> >> >> >> >> For "pay", per my understanding, we need some cgroup based >> >> >> per-memory-tier (or per-node) usage limit. The following patchset is >> >> >> the first step for that. >> >> >> >> >> >> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/ >> >> > >> >> > Why do we need a sysfs interface if there are plans for cgroup API? >> >> >> >> They are for different target. The cgroup API proposed here is to >> >> constrain the DRAM usage in a system with DRAM and CXL memory. The less >> >> you pay, the less DRAM and more CXL memory you use. >> > >> > Right, but why the usage distribution requires its own interface and >> > cannot be combined with the access control part of it? >> >> Per my understanding, they are orthogonal. >> >> Weighted-interleave is a memory allocation policy, other memory >> allocation policies include local first, etc. >> >> Usage limit is to constrain the usage of specific memory types >> (e.g. DRAM) for a cgroup. It can be used together with local first >> policy and some other memory allocation policy. > > Bad wording from me. Sorry for the confusion. Never mind. > Sure those are two orthogonal things and I didn't mean to suggest a > single API to cover both. But if cgroup semantic can be reasonably > defined for the usage enforcement can we put the interleaving behavior > API under the same cgroup controller as well? I haven't thought about it thoroughly. But I think it should be the direction. -- Best Regards, Huang, Ying