Message ID | 20231109002517.106829-1-gregory.price@memverge.com |
---|---|
Headers | show |
Series | memcg weighted interleave mempolicy control | expand |
On Wed 08-11-23 19:25:14, Gregory Price wrote: > This patchset implements weighted interleave and adds a new cgroup > sysfs entry: cgroup/memory.interleave_weights (excluded from root). Why have you chosen memory controler rather than cpuset controller? TBH I do not think memcg is the best fit because traditionally memcg accounts consumption rather than memory placement. This means that the memory is already allocated when it is charged for a memcg. On the other hand cpuset controller is the one to control the allocation placement so it would seem a better fit.
On Thu, Nov 09, 2023 at 11:02:23AM +0100, Michal Hocko wrote: > On Wed 08-11-23 19:25:14, Gregory Price wrote: > > This patchset implements weighted interleave and adds a new cgroup > > sysfs entry: cgroup/memory.interleave_weights (excluded from root). > > Why have you chosen memory controler rather than cpuset controller? > TBH I do not think memcg is the best fit because traditionally memcg > accounts consumption rather than memory placement. This means that the > memory is already allocated when it is charged for a memcg. On the other > hand cpuset controller is the one to control the allocation placement so > it would seem a better fit. > -- > Michal Hocko > SUSE Labs Wasn't sure between the two, so i tossed it in memcg. Easy relocation, and the ode should remain basically the same, so I will wait a bit before throwing out an update. Assuming we've found the right place for it, i'll probably drop RFC at that point. ~Gregory
On Thu, Nov 09, 2023 at 11:02:23AM +0100, Michal Hocko wrote: > On Wed 08-11-23 19:25:14, Gregory Price wrote: > > This patchset implements weighted interleave and adds a new cgroup > > sysfs entry: cgroup/memory.interleave_weights (excluded from root). > > Why have you chosen memory controler rather than cpuset controller? > TBH I do not think memcg is the best fit because traditionally memcg > accounts consumption rather than memory placement. This means that the > memory is already allocated when it is charged for a memcg. On the other > hand cpuset controller is the one to control the allocation placement so > it would seem a better fit. > -- > Michal Hocko > SUSE Labs Actually going to walk back my last email, memcg actually feels more correct than cpuset, if only because of what the admin-guide says: """ The "memory" controller regulates distribution of memory. [... snip ...] While not completely water-tight, all major memory usages by a given cgroup are tracked so that the total memory consumption can be accounted and controlled to a reasonable extent. """ 'And controlled to a reasonable extent' seems to fit the description of this mechanism better than the cpuset description: """ The "cpuset" controller provides a mechanism for constraining the CPU and memory node placement of tasks to only the resources specified in the cpuset interface files in a task's current cgroup. """ This is not a constraining interface... it's "more of a suggestion". In particular, anything not using interleave doesn't even care about these weights at all. The distribution is only enforced for allocation, it does not cause migrations... thought that would be a neat idea. This is explicitly why the interface does not allow a weight of 0 (the not should be omitted from the policy nodemask or cpuset instead). Even if this were designed to enforce a particular distribution of memory, I'm not certain that would belong in cpusets either - but I suppose that is a separate discussion. It's possible this array of weights could be used to do both, but it seems (at least on the surface) that making this a hard control is an excellent way to induce OOMs where you may not want them. Anyway, summarizing: After a bit of reading, this does seem to map better to the "accounting consumption" subsystem than the "constrain" subsystem. However, if you think it's better suited for cpuset, I'm happy to push in that direction. ~Gregory
On 23/11/08 07:25PM, Gregory Price wrote: > This patchset implements weighted interleave and adds a new cgroup > sysfs entry: cgroup/memory.interleave_weights (excluded from root). <snip> We at Micron think this capability is really important, and moving it into cgroups introduces important flexibility that was missing from prior versions of the patchset. To me this mainly represents the antidote to having system bios programming the memory map to be heterogeneously interleaved - which I expect will be an option, and some customers seem to want it. But that approach will affect all apps (and also the kernel), and will hide at least the interleaved numa nodes from Linux' view of the topology. There is a lot not to like about that approach... This approach checks all the important boxes: it only applies to apps where it's enabled, the weighting can vary from one app to another, the kernel is not affected, and the numa topology is not buried. I see Michal's comment about cpuset vs. memory, and have no opinion there. Thumbs up! John Groves at Micron
Gregory Price <gourry.memverge@gmail.com> writes: > This patchset implements weighted interleave and adds a new cgroup > sysfs entry: cgroup/memory.interleave_weights (excluded from root). > > The il_weight of a node is used by mempolicy to implement weighted > interleave when `numactl --interleave=...` is invoked. By default > il_weight for a node is always 1, which preserves the default round > robin interleave behavior. IIUC, this makes it almost impossible to set the default weight of a node from the node memory bandwidth information. This will make the life of users a little harder. If so, how about use a new memory policy mode, for example MPOL_WEIGHTED_INTERLEAVE, etc. > Interleave weights denote the number of pages that should be > allocated from the node when interleaving occurs and have a range > of 1-255. The weight of a node can never be 0, and instead the > preferred way to prevent allocation is to remove the node from the > cpuset or mempolicy altogether. > > For example, if a node's interleave weight is set to 5, 5 pages > will be allocated from that node before the next node is scheduled > for allocations. > > # Set node weight for node 0 to 5 > echo 0:5 > /sys/fs/cgroup/user.slice/memory.interleave_weights > > # Set node weight for node 1 to 3 > echo 1:3 > /sys/fs/cgroup/user.slice/memory.interleave_weights > > # View the currently set weights > cat /sys/fs/cgroup/user.slice/memory.interleave_weights > 0:5,1:3 > > Weights will only be displayed for possible nodes. > > With this it becomes possible to set an interleaving strategy > that fits the available bandwidth for the devices available on > the system. An example system: > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) > Node 1 - CXL Memory. 64GB/s BW, on Node 0 root complex > > In this setup, the effective weights for a node set of [0,1] > may be may be [86, 14] (86% of memory on Node 0, 14% on node 1) > or some smaller fraction thereof to encourge quicker rounds > for better overall distribution. > > This spreads memory out across devices which all have different > latency and bandwidth attributes in a way that can maximize the > available resources. > -- Best Regards, Huang, Ying
On Thu 09-11-23 11:34:01, Gregory Price wrote: [...] > Anyway, summarizing: After a bit of reading, this does seem to map > better to the "accounting consumption" subsystem than the "constrain" > subsystem. However, if you think it's better suited for cpuset, I'm > happy to push in that direction. Maybe others see it differently but I stick with my previous position. Memcg is not a great fit for reasons already mentioned - most notably that the controller doesn't control the allocation but accounting what has been already allocated. Cpusets on the other hand constrains the allocations and that is exactly what you want to achieve.
On Fri, Nov 10, 2023 at 02:16:05PM +0800, Huang, Ying wrote: > Gregory Price <gourry.memverge@gmail.com> writes: > > > This patchset implements weighted interleave and adds a new cgroup > > sysfs entry: cgroup/memory.interleave_weights (excluded from root). > > > > The il_weight of a node is used by mempolicy to implement weighted > > interleave when `numactl --interleave=...` is invoked. By default > > il_weight for a node is always 1, which preserves the default round > > robin interleave behavior. > > IIUC, this makes it almost impossible to set the default weight of a > node from the node memory bandwidth information. This will make the > life of users a little harder. > > If so, how about use a new memory policy mode, for example > MPOL_WEIGHTED_INTERLEAVE, etc. > weights are also inherited from parent cgroups, so if you set them in parent slices you can automatically set update system settings. by default the parent slice weights will always be 1 until set otherwise. Once they're set, children inherit naturally. Maybe there's an argument here for including interleave_weights in the root cgroup. ~Gregory
On Fri, Nov 10, 2023 at 10:05:57AM +0100, Michal Hocko wrote: > On Thu 09-11-23 11:34:01, Gregory Price wrote: > [...] > > Anyway, summarizing: After a bit of reading, this does seem to map > > better to the "accounting consumption" subsystem than the "constrain" > > subsystem. However, if you think it's better suited for cpuset, I'm > > happy to push in that direction. > > Maybe others see it differently but I stick with my previous position. > Memcg is not a great fit for reasons already mentioned - most notably > that the controller doesn't control the allocation but accounting what > has been already allocated. Cpusets on the other hand constrains the > allocations and that is exactly what you want to achieve. > -- > Michal Hocko > SUSE Labs Digging in a bit, placing it in cpusets has locking requirements that concerns me. Maybe I'm being a bit over-cautious, so if none of this matters, then I'll go ahead and swap the code over to cpusets. Otherwise, just more food for thought in cpusets vs memcg. In cpusets.c it states when acquiring read-only access, we have to acquire the (global) callback lock: https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L391 * There are two global locks guarding cpuset structures - cpuset_mutex and * callback_lock. We also require taking task_lock() when dereferencing a * task's cpuset pointer. See "The task_lock() exception", at the end of this * comment. Examples: cpuset_node_allowed: https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L4780 spin_lock_irqsave(&callback_lock, flags); rcu_read_lock(); cs = nearest_hardwall_ancestor(task_cs(current)); <-- walks parents allowed = node_isset(node, cs->mems_allowed); rcu_read_unlock(); spin_unlock_irqrestore(&callback_lock, flags); cpuset_mems_allowed: https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L4679 spin_lock_irqsave(&callback_lock, flags); rcu_read_lock(); guarantee_online_mems(task_cs(tsk), &mask); <-- walks parents rcu_read_unlock(); spin_unlock_irqrestore(&callback_lock, flags); Seems apparent that any form of parent walk in cpusets will require the acquisition of &callback_lock. This does not appear true of memcg. Implementing a similar inheritance structure as described in this patch set would therefore cause the acquisition of the callback lock during node selection. So if we want this in cpuset, we're going to eat that lock acquisition, despite not really needing it. I'm was not intending to do checks against cpusets.mems_allowed when acquiring weights, as this is already enforced between cpusets and mempolicy on hotplug and mask changes, as well as in the allocators via read_mems_allowed_begin/retry.. This is why I said this was *not* a constraining feature. Additionally, if the node selected by mpol is exhausted, the allocator will simply acquire memory from another (allowed) node, disregarding the weights entirely (which is the correct / expected behavior). Another example of "this is more of a suggestion" rather than a constraint. So I'm contending here that putting it in cpusets is overkill. But if it likewise doesn't fit in memcg, is it insane to suggest that maybe we should consider adding cgroup.mpol, and maybe consider migrating features from mempolicy.c into cgroups (while keeping mpol the way it is). ~Gregory
Hello, On Thu, Nov 09, 2023 at 10:48:56PM +0000, John Groves wrote: > This approach checks all the important boxes: it only applies to apps where > it's enabled, the weighting can vary from one app to another, the > kernel is not affected, and the numa topology is not buried. Can't it be a mempol property which is inherited by child processes? Then all you'll need is e.g. adding systemd support to configure this at service unit level. I'm having a bit of hard time seeing why this needs to be a cgroup feature when it doesn't involve dynamic resource accounting / enforcement at all. Thanks.
On Fri, Nov 10, 2023 at 12:05:59PM -1000, tj@kernel.org wrote: > Hello, > > On Thu, Nov 09, 2023 at 10:48:56PM +0000, John Groves wrote: > > This approach checks all the important boxes: it only applies to apps where > > it's enabled, the weighting can vary from one app to another, the > > kernel is not affected, and the numa topology is not buried. > > Can't it be a mempol property which is inherited by child processes? Then > all you'll need is e.g. adding systemd support to configure this at service > unit level. I'm having a bit of hard time seeing why this needs to be a > cgroup feature when it doesn't involve dynamic resource accounting / > enforcement at all. > > Thanks. > > -- > tejun I did originally implement it this way, but note that it will either require some creative extension of set_mempolicy or even set_mempolicy2 as proposed here: https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@memverge.com/ One of the problems to consider is task migration. If a task is migrated from one socket to another, for example by being moved to a new cgroup with a different cpuset - the weights might be completely nonsensical for the new allowed topology. Unfortunately mpol has no way of being changed from outside the task itself once it's applied, other than changing its nodemasks via cpusets. So one concrete use case: kubernetes might like change cpusets or move tasks from one cgroup to another, or a vm might be migrated from one set of nodes to enother (technically not mutually exclusive here). Some memory policy settings (like weights) may no longer apply when this happens, so it would be preferable to have a way to change them. ~Gregory
Hello, Gregory. On Fri, Nov 10, 2023 at 05:29:25PM -0500, Gregory Price wrote: > I did originally implement it this way, but note that it will either > require some creative extension of set_mempolicy or even set_mempolicy2 > as proposed here: > > https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@memverge.com/ > > One of the problems to consider is task migration. If a task is > migrated from one socket to another, for example by being moved to a new > cgroup with a different cpuset - the weights might be completely nonsensical > for the new allowed topology. > > Unfortunately mpol has no way of being changed from outside the task > itself once it's applied, other than changing its nodemasks via cpusets. Maybe it's time to add one? > So one concrete use case: kubernetes might like change cpusets or move > tasks from one cgroup to another, or a vm might be migrated from one set > of nodes to enother (technically not mutually exclusive here). Some > memory policy settings (like weights) may no longer apply when this > happens, so it would be preferable to have a way to change them. Neither covers all use cases. As you noted in your mempolicy message, if the application wants finer grained control, cgroup interface isn't great. In general, any changes which are dynamically initiated by the application itself isn't a great fit for cgroup. I'm generally pretty awry of adding non-resource group configuration interface especially when they don't have counter part in the regular per-process/thread API for a few reasons: 1. The reason why people try to add those through cgroup somtimes is because it seems easier to add those new features through cgroup, which may be true to some degree, but shortcuts often aren't very conducive to long term maintainability. 2. As noted above, just having cgroup often excludes a signficant portion of use cases. Not all systems enable cgroups and programatic accesses from target processes / threads are coarse-grained and can be really awakward. 3. Cgroup can be convenient when group config change is necessary. However, we really don't want to keep adding kernel interface just for changing configs for a group of threads. For config changes which aren't high frequency, userspace iterating the member processes and applying the changes if possible is usually good enough which usually involves looping until no new process is found. If the looping is problematic, cgroup freezer can be used to atomically stop all member threads to provide atomicity too. Thanks.
On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@kernel.org wrote: > Hello, Gregory. > > On Fri, Nov 10, 2023 at 05:29:25PM -0500, Gregory Price wrote: > > Unfortunately mpol has no way of being changed from outside the task > > itself once it's applied, other than changing its nodemasks via cpusets. > > Maybe it's time to add one? > I've been considering this as well, but there's more context here being lost. It's not just about being able to toggle the policy of a single task, or related tasks, but actually in support of a more global data interleaving strategy that makes use of bandwidth more effectively as we begin to memory expansion and bandwidth expansion occur on the PCIE/CXL bus. If the memory landscape of a system changes, for example due to a hotplug event, you actually want to change the behavior of *every* task that is using interleaving. The fundamental bandwidth distribution of the entire system changed, so the behavior of every task using that memory should change with it. We've explored adding weights to: mempolicy, memory tiers, nodes, memcg, and now additionally cpusets. In the last email, I'd asked whether it might actually be worth adding a new mpol component of cgroups to aggregate these issues, rather than jam them into either component. I would love your thoughts on that. > > So one concrete use case: kubernetes might like change cpusets or move > > tasks from one cgroup to another, or a vm might be migrated from one set > > of nodes to enother (technically not mutually exclusive here). Some > > memory policy settings (like weights) may no longer apply when this > > happens, so it would be preferable to have a way to change them. > > Neither covers all use cases. As you noted in your mempolicy message, if the > application wants finer grained control, cgroup interface isn't great. In > general, any changes which are dynamically initiated by the application > itself isn't a great fit for cgroup. > It is certainly simple enough to add weights to mempolicy, but there are limitations. In particular, mempolicy is extremely `current task` focused, and significant refactor work would need to be done to allow external tasks the ability to toggle a target task's mempolicy. In particular I worry about the potential concurrency issues since mempolicy can be in the hot allocation path. (additionally, as you note below, you would have to hit every child thread separately to make effective changes, since it is per-task). I'm not opposed to this, but it was suggested to me that maybe there is a better place to place these weights. Maybe it can be managed mostly through RCU, though, so maybe the concern is overblow. Anyway... It's certainly my intent to add weights to mempolicy, as that's where I started. If that is the preferred starting point from the perspective of the mm community, I will revert to proposing set_mempolicy2 and/or full on converting mempolicy into a sys/procfs friendly mechanism. The goal here is to enable mempolicy, or something like it, to acquire additional flexibility in a heterogeneous memory world, considering how threads may be migrated, checkpointed/restored, and how use cases like bandwidth expansion may be insufficiently serviced by something as fine grained as per-task mempolicies. > I'm generally pretty awry of adding non-resource group configuration > interface especially when they don't have counter part in the regular > per-process/thread API for a few reasons: > > 1. The reason why people try to add those through cgroup somtimes is because > it seems easier to add those new features through cgroup, which may be > true to some degree, but shortcuts often aren't very conducive to long > term maintainability. > Concur. That's why i originally proposed the mempolicy extension, since I wasn't convinced by global settings, but I've been brought around by the fact that migrations and hotplug events may want to affect mass changes across a large number of unrelated tasks. > 2. As noted above, just having cgroup often excludes a signficant portion of > use cases. Not all systems enable cgroups and programatic accesses from > target processes / threads are coarse-grained and can be really awakward. > > 3. Cgroup can be convenient when group config change is necessary. However, > we really don't want to keep adding kernel interface just for changing > configs for a group of threads. For config changes which aren't high > frequency, userspace iterating the member processes and applying the > changes if possible is usually good enough which usually involves looping > until no new process is found. If the looping is problematic, cgroup > freezer can be used to atomically stop all member threads to provide > atomicity too. > If I can ask, do you think it would be out of line to propose a major refactor to mempolicy to enable external task's the ability to change a running task's mempolicy *as well as* a cgroup-wide mempolicy component? As you've alluded to here, I don't think either mechanism on their own is sufficient to handle all use cases, but the two combined does seem sufficient. I do appreciate the feedback here, thank you. I think we are getting to the bottom of how/where such new mempolicy mechanisms should be implemented. Gregory:
Hello, On Fri, Nov 10, 2023 at 10:42:39PM -0500, Gregory Price wrote: > On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@kernel.org wrote: ... > I've been considering this as well, but there's more context here being > lost. It's not just about being able to toggle the policy of a single > task, or related tasks, but actually in support of a more global data > interleaving strategy that makes use of bandwidth more effectively as > we begin to memory expansion and bandwidth expansion occur on the > PCIE/CXL bus. > > If the memory landscape of a system changes, for example due to a > hotplug event, you actually want to change the behavior of *every* task > that is using interleaving. The fundamental bandwidth distribution of > the entire system changed, so the behavior of every task using that > memory should change with it. > > We've explored adding weights to: mempolicy, memory tiers, nodes, memcg, > and now additionally cpusets. In the last email, I'd asked whether it > might actually be worth adding a new mpol component of cgroups to > aggregate these issues, rather than jam them into either component. > I would love your thoughts on that. As for CXL and the changing memory landscape, I think some caution is necessary as with any expected "future" technology changes. The recent example with non-volatile memory isn't too far from CXL either. Note that this is not to say that we shouldn't change anything until the hardware is wildly popular but more that we need to be cognizant of the speculative nature and the possibility of overbuilding for it. I don't have a golden answer but here are general suggestions: Build something which is small and/or useful even outside the context of the expected hardware landscape changes. Enable the core feature which is absolutely required in a minimal manner. Avoid being maximalist in feature and convenience coverage. Here, even if CXL actually becomes popular, how many are going to use memory hotplug and need to dynamically rebalance memory in actively running workloads? What's the scenario? Are there going to be an army of data center technicians going around plugging and unplugging CXL devices depending on system memory usage? Maybe there are some cases this is actually useful but for those niche use cases, isn't per-task interface with iteration enough? How often are these hotplug events going to be? > > > So one concrete use case: kubernetes might like change cpusets or move > > > tasks from one cgroup to another, or a vm might be migrated from one set > > > of nodes to enother (technically not mutually exclusive here). Some > > > memory policy settings (like weights) may no longer apply when this > > > happens, so it would be preferable to have a way to change them. > > > > Neither covers all use cases. As you noted in your mempolicy message, if the > > application wants finer grained control, cgroup interface isn't great. In > > general, any changes which are dynamically initiated by the application > > itself isn't a great fit for cgroup. > > It is certainly simple enough to add weights to mempolicy, but there > are limitations. In particular, mempolicy is extremely `current task` > focused, and significant refactor work would need to be done to allow > external tasks the ability to toggle a target task's mempolicy. > > In particular I worry about the potential concurrency issues since > mempolicy can be in the hot allocation path. Changing mpol from outside the task is a feature which is inherently useful regardless of CXL and I don't quite understand why hot path concurrency issues would be different whether the configuration is coming from mempol or cgroup but that could easily be me not being familiar with the involved code. ... > > 3. Cgroup can be convenient when group config change is necessary. However, > > we really don't want to keep adding kernel interface just for changing > > configs for a group of threads. For config changes which aren't high > > frequency, userspace iterating the member processes and applying the > > changes if possible is usually good enough which usually involves looping > > until no new process is found. If the looping is problematic, cgroup > > freezer can be used to atomically stop all member threads to provide > > atomicity too. > > > > If I can ask, do you think it would be out of line to propose a major > refactor to mempolicy to enable external task's the ability to change a > running task's mempolicy *as well as* a cgroup-wide mempolicy component? I don't think these group configurations fit cgroup filesystem interface very well. As these aren't resource allocations, it's unclear what the hierarchical relationship means. Besides, it feels awkard to be keep adding duplicate interfaces where the modality changes completely based on the operation scope. There are ample examples where other subsystems use cgroup membership information and while we haven't expanded that to syscalls yet, I don't see why that'd be all that difference. So, maybe it'd make sense to have the new mempolicy syscall take a cgroup ID as a target identifier too? ie. so that the scope of the operation (e.g. task, process, cgroup) and the content of the policy can stay orthogonal? Thanks.
tj@kernel.org wrote: > Hello, > > On Fri, Nov 10, 2023 at 10:42:39PM -0500, Gregory Price wrote: > > On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@kernel.org wrote: > ... > > I've been considering this as well, but there's more context here being > > lost. It's not just about being able to toggle the policy of a single > > task, or related tasks, but actually in support of a more global data > > interleaving strategy that makes use of bandwidth more effectively as > > we begin to memory expansion and bandwidth expansion occur on the > > PCIE/CXL bus. > > > > If the memory landscape of a system changes, for example due to a > > hotplug event, you actually want to change the behavior of *every* task > > that is using interleaving. The fundamental bandwidth distribution of > > the entire system changed, so the behavior of every task using that > > memory should change with it. > > > > We've explored adding weights to: mempolicy, memory tiers, nodes, memcg, > > and now additionally cpusets. In the last email, I'd asked whether it > > might actually be worth adding a new mpol component of cgroups to > > aggregate these issues, rather than jam them into either component. > > I would love your thoughts on that. > > As for CXL and the changing memory landscape, I think some caution is > necessary as with any expected "future" technology changes. The recent > example with non-volatile memory isn't too far from CXL either. Note that > this is not to say that we shouldn't change anything until the hardware is > wildly popular but more that we need to be cognizant of the speculative > nature and the possibility of overbuilding for it. > > I don't have a golden answer but here are general suggestions: Build > something which is small and/or useful even outside the context of the > expected hardware landscape changes. Enable the core feature which is > absolutely required in a minimal manner. Avoid being maximalist in feature > and convenience coverage. If I had to state the golden rule of kernel enabling, this paragraph comes close to being it. > Here, even if CXL actually becomes popular, how many are going to use memory > hotplug and need to dynamically rebalance memory in actively running > workloads? What's the scenario? Are there going to be an army of data center > technicians going around plugging and unplugging CXL devices depending on > system memory usage? While I have personal skepticism that all of the infrastructure in the CXL specification is going to become popular, one mechanism that seems poised to cross that threshold is "dynamic capacity". So it is not the case that techs are running around hot-adjusting physical memory. A host will have a cable hop to a shared memory pool in the rack where it can be dynamically provisioned across hosts. However, even then the bounds of what is dynamic is going to be constrained to a fixed address space with likely predictable performance characteristics for that address range. That potentially allows for a system wide memory interleave policy to be viable. That might be the place to start and mirrors, at a coarser granularity, what hardware interleaving can do. [..]
Gregory Price <gregory.price@memverge.com> writes: > On Fri, Nov 10, 2023 at 02:16:05PM +0800, Huang, Ying wrote: >> Gregory Price <gourry.memverge@gmail.com> writes: >> >> > This patchset implements weighted interleave and adds a new cgroup >> > sysfs entry: cgroup/memory.interleave_weights (excluded from root). >> > >> > The il_weight of a node is used by mempolicy to implement weighted >> > interleave when `numactl --interleave=...` is invoked. By default >> > il_weight for a node is always 1, which preserves the default round >> > robin interleave behavior. >> >> IIUC, this makes it almost impossible to set the default weight of a >> node from the node memory bandwidth information. This will make the >> life of users a little harder. >> >> If so, how about use a new memory policy mode, for example >> MPOL_WEIGHTED_INTERLEAVE, etc. >> > > weights are also inherited from parent cgroups, so if you set them in > parent slices you can automatically set update system settings. > > by default the parent slice weights will always be 1 until set > otherwise. Once they're set, children inherit naturally. > > Maybe there's an argument here for including interleave_weights in the > root cgroup. Even if the interleave_weights is introduced in root cgroup, the initial default weight need to be 1 to be back-compatible with the original MPOL_INTERLEAVE. If we don't reuse MPOL_INTERLEAVE, but use a new memory policy mode (say MPOL_WEIGHTED_INTERLEAVE). The default values of the interleave weight in root cgroup needn't to be 1. So, we can provide a more helpful default interleave weight based on the node memory bandwidth information (e.g., from HMAT, CDAT, etc). That will make users life much easier. Do you agree? -- Best Regards, Huang, Ying
On Sat, Nov 11, 2023 at 03:54:55PM -0800, Dan Williams wrote: > tj@kernel.org wrote: > > Hello, > > > > On Fri, Nov 10, 2023 at 10:42:39PM -0500, Gregory Price wrote: > > > On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@kernel.org wrote: > > > Here, even if CXL actually becomes popular, how many are going to use memory > > hotplug and need to dynamically rebalance memory in actively running > > workloads? What's the scenario? Are there going to be an army of data center > > technicians going around plugging and unplugging CXL devices depending on > > system memory usage? > > While I have personal skepticism that all of the infrastructure in the > CXL specification is going to become popular, one mechanism that seems > poised to cross that threshold is "dynamic capacity". So it is not the > case that techs are running around hot-adjusting physical memory. A host > will have a cable hop to a shared memory pool in the rack where it can > be dynamically provisioned across hosts. > > However, even then the bounds of what is dynamic is going to be > constrained to a fixed address space with likely predictable performance > characteristics for that address range. That potentially allows for a > system wide memory interleave policy to be viable. That might be the > place to start and mirrors, at a coarser granularity, what hardware > interleaving can do. > > [..] Funny enough, this is exactly why I skipped cgroups and went directly to implementing the weights as an attribute of numa nodes. It cuts out a middle-man and lets you apply weights globally. BUT the policy is still ultimately opt-in, so you don't really get a global effect, just a global control. Just given that lesson, yeah it's better to reduce the scope to mempolicy first. Getting to global interleave weights from there... more complicated. The simplees way I can think of to test system-wide weighted interleave is to have the init task create a default mempolicy and have all tasks inherit it. That feels like a big, dumb hammer - but it might work. Comparatively, implementing a mempolicy in the root cgroup and having tasks use that directly "feels" better, though lessons form this patch - interating cgroup parent trees on allocations feels not great. Barring that, if a cgroup.mempolicy and a default mempolicy for init aren't realistic, I don't see a good path to fruition for a global interleave approach that doesn't require nastier allocator changes. In the meantime, unless there's other pro-cgroups voices, I'm going to pivot back to my initial approach of doing it in mempolicy, though I may explore extending mempolicy into procfs at the same time. ~Gregory
On Mon, Nov 13, 2023 at 09:31:07AM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > > On Fri, Nov 10, 2023 at 02:16:05PM +0800, Huang, Ying wrote: > >> Gregory Price <gourry.memverge@gmail.com> writes: > >> > > > > weights are also inherited from parent cgroups, so if you set them in > > parent slices you can automatically set update system settings. > > > > by default the parent slice weights will always be 1 until set > > otherwise. Once they're set, children inherit naturally. > > > > Maybe there's an argument here for including interleave_weights in the > > root cgroup. > > Even if the interleave_weights is introduced in root cgroup, the initial > default weight need to be 1 to be back-compatible with the original > MPOL_INTERLEAVE. > Sorry, I am maybe not explaining correctly. Right now, the weights are not *replicated* when a child cgroup is created. Instead, when weights are requested (during allocation) the requestor searches for the first cgroup in its family that has weights. while (!memcg->weights && !memcg_is_root(memcg)) memcg = parent(memcg) We only create new weights on each child if the child explicitly has their weights set. We manage everything else via RCU to keep it all consistent. This would allow a set of weights in the root cgroup to be set and then immediately inherited by the entire system, though it does add the overhead of searching the cgroup family lineage on allocations (which could be non-trivial, we are still testing it). > If we don't reuse MPOL_INTERLEAVE, but use a new memory policy mode (say > MPOL_WEIGHTED_INTERLEAVE). The default values of the interleave weight > in root cgroup needn't to be 1. I agree, and I already have patches that do just this. Though based on other feedback, it's looking like I'll be reverting back to implementing all of this in mempolicy, and maybe trying to pull mempolicy forward into the procfs world. ~Gregory
On Fri 10-11-23 22:42:39, Gregory Price wrote: [...] > If I can ask, do you think it would be out of line to propose a major > refactor to mempolicy to enable external task's the ability to change a > running task's mempolicy *as well as* a cgroup-wide mempolicy component? No, I actually think this is a reasonable idea. pidfd_setmempolicy is a generally useful extension. The mempolicy code is heavily current task based and there might be some challenges but I believe this will a) improve the code base and b) allow more usecases. That being said, I still believe that a cgroup based interface is a much better choice over a global one. Cpusets seem to be a good fit as the controller does control memory placement wrt NUMA interfaces.
On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote: > On Fri 10-11-23 22:42:39, Gregory Price wrote: > [...] > > If I can ask, do you think it would be out of line to propose a major > > refactor to mempolicy to enable external task's the ability to change a > > running task's mempolicy *as well as* a cgroup-wide mempolicy component? > > No, I actually think this is a reasonable idea. pidfd_setmempolicy is a > generally useful extension. The mempolicy code is heavily current task > based and there might be some challenges but I believe this will a) > improve the code base and b) allow more usecases. Just read up on the pidfd_set_mempolicy lore, and yes I'm seeing all the same problems (I know there was discussion of vma policies, but i think that can be a topic for later). Have some thoughts on this, but will take some time to work through a few refactoring tickets first. > > That being said, I still believe that a cgroup based interface is a much > better choice over a global one. Cpusets seem to be a good fit as the > controller does control memory placement wrt NUMA interfaces. I think cpusets is a non-starter due to the global spinlock required when reading informaiton from it: https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391 Unless the proposal is to place the weights as a global cgroups value, in which case I think it would be better placed in default_mempolicy :]
On Tue 14-11-23 10:50:51, Gregory Price wrote: > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote: [...] > > That being said, I still believe that a cgroup based interface is a much > > better choice over a global one. Cpusets seem to be a good fit as the > > controller does control memory placement wrt NUMA interfaces. > > I think cpusets is a non-starter due to the global spinlock required when > reading informaiton from it: > > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391 Right, our current cpuset implementation indeed requires callback lock from the page allocator. But that is an implementation detail. I do not remember bug reports about the lock being a bottle neck though. If anything cpusets lock optimizations would be win also for users who do not want to use weighted interleave interface.
On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote: > On Tue 14-11-23 10:50:51, Gregory Price wrote: > > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote: > [...] > > > That being said, I still believe that a cgroup based interface is a much > > > better choice over a global one. Cpusets seem to be a good fit as the > > > controller does control memory placement wrt NUMA interfaces. > > > > I think cpusets is a non-starter due to the global spinlock required when > > reading informaiton from it: > > > > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391 > > Right, our current cpuset implementation indeed requires callback lock > from the page allocator. But that is an implementation detail. I do not > remember bug reports about the lock being a bottle neck though. If > anything cpusets lock optimizations would be win also for users who do > not want to use weighted interleave interface. Definitely agree, but that's a rather large increase of scope :[ We could consider a push-model similar to how cpuset nodemasks are pushed down to mempolicies, rather than a pull-model of having mempolicy read directly from cpusets, at least until cpusets lock optimization is undertaken. This pattern looks like a wart to me, which is why I avoided it, but the locking implications on the pull-model make me sad. Would like to point out that Tejun pushed back on implementing weights in cgroups (regardless of subcomponent), so I think we need to come to a consensus on where this data should live in a "more global" context (cpusets, memcg, nodes, etc) before I go mucking around further. So far we have: * mempolicy: updating weights is a very complicated undertaking, and no (good) way to do this from outside the task. would be better to have a coarser grained control. New syscall is likely needed to add/set weights in the per-task mempolicy, or bite the bullet on set_mempolicy2 and make the syscall extensible for the future. * memtiers: tier=node when devices are already interleaved or when all devices are different, so why add yet another layer of complexity if other constructs already exist. Additionally, you lose task-placement relative weighting (or it becomes very complex to implement. * cgroups: "this doesn't involve dynamic resource accounting / enforcement at all" and "these aren't resource allocations, it's unclear what the hierarchical relationship mean". * node: too global, explore smaller scope first then expand. For now I think there is consensus that mempolicy should have weights per-task regardless of how the more-global mechanism is defined, so i'll go ahead and put up another RFC for some options on that in the next week or so. The limitations on the first pass will be that only the task is capable of re-weighting should cpusets.mems or the nodemask change. ~Gregory
Gregory Price <gregory.price@memverge.com> writes: > On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote: >> On Tue 14-11-23 10:50:51, Gregory Price wrote: >> > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote: >> [...] >> > > That being said, I still believe that a cgroup based interface is a much >> > > better choice over a global one. Cpusets seem to be a good fit as the >> > > controller does control memory placement wrt NUMA interfaces. >> > >> > I think cpusets is a non-starter due to the global spinlock required when >> > reading informaiton from it: >> > >> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391 >> >> Right, our current cpuset implementation indeed requires callback lock >> from the page allocator. But that is an implementation detail. I do not >> remember bug reports about the lock being a bottle neck though. If >> anything cpusets lock optimizations would be win also for users who do >> not want to use weighted interleave interface. > > Definitely agree, but that's a rather large increase of scope :[ > > We could consider a push-model similar to how cpuset nodemasks are > pushed down to mempolicies, rather than a pull-model of having > mempolicy read directly from cpusets, at least until cpusets lock > optimization is undertaken. > > This pattern looks like a wart to me, which is why I avoided it, but the > locking implications on the pull-model make me sad. > > Would like to point out that Tejun pushed back on implementing weights > in cgroups (regardless of subcomponent), so I think we need to come > to a consensus on where this data should live in a "more global" > context (cpusets, memcg, nodes, etc) before I go mucking around > further. > > So far we have: > * mempolicy: updating weights is a very complicated undertaking, > and no (good) way to do this from outside the task. > would be better to have a coarser grained control. > > New syscall is likely needed to add/set weights in the > per-task mempolicy, or bite the bullet on set_mempolicy2 > and make the syscall extensible for the future. > > * memtiers: tier=node when devices are already interleaved or when all > devices are different, so why add yet another layer of > complexity if other constructs already exist. Additionally, > you lose task-placement relative weighting (or it becomes > very complex to implement. Because we usually have multiple nodes in one mem-tier, I still think mem-tier-based interface is simpler than node-based. But, it seems more complex to introduce mem-tier into mempolicy. Especially if we have per-task weights. So, I am fine to go with node-based interface. > * cgroups: "this doesn't involve dynamic resource accounting / > enforcement at all" and "these aren't resource > allocations, it's unclear what the hierarchical > relationship mean". > > * node: too global, explore smaller scope first then expand. Why is it too global? I understand that it doesn't cover all possible use cases (although I don't know whether these use cases are practical or not). But it can provide a reasonable default per-node weight based on available node performance information (such as, HMAT, CDAT, etc.). And, quite some workloads can just use it. I think this is an useful feature. > For now I think there is consensus that mempolicy should have weights > per-task regardless of how the more-global mechanism is defined, so i'll > go ahead and put up another RFC for some options on that in the next > week or so. > > The limitations on the first pass will be that only the task is capable > of re-weighting should cpusets.mems or the nodemask change. -- Best Regards, Huang, Ying
On Wed, Nov 15, 2023 at 01:56:53PM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > Because we usually have multiple nodes in one mem-tier, I still think > mem-tier-based interface is simpler than node-based. But, it seems more > complex to introduce mem-tier into mempolicy. Especially if we have > per-task weights. So, I am fine to go with node-based interface. > > > * cgroups: "this doesn't involve dynamic resource accounting / > > enforcement at all" and "these aren't resource > > allocations, it's unclear what the hierarchical > > relationship mean". > > > > * node: too global, explore smaller scope first then expand. > > Why is it too global? I understand that it doesn't cover all possible > use cases (although I don't know whether these use cases are practical > or not). But it can provide a reasonable default per-node weight based > on available node performance information (such as, HMAT, CDAT, etc.). > And, quite some workloads can just use it. I think this is an useful > feature. > Have been sharing notes with more folks. Michal thinks a global set of weights is unintuitive and not useful, and would prefer to see the per-task weights first. Though this may have been in response to adding it as an attribute of nodes directly. Another proposal here suggested adding a new sysfs setting https://github.com/skhynix/linux/commit/61d2fcc7a880185df186fa2544edcd2f8785952a $ tree /sys/kernel/mm/interleave_weight/ /sys/kernel/mm/interleave_weight/ ├── enabled [1] ├── possible [2] └── node ├── node0 │ └── interleave_weight [3] └── node1 └── interleave_weight [3] (this could be changed to /sys/kernel/mm/mempolicy/...) I think the internal representation of this can be simplified greatly, over what the patch provides now, but maybe this solves the "it doesn't belong in these other components" issue. Answer: Simply leave it as a static global kobject in mempolicy, which also deals with many of the issues regarding race conditions. If a user provides weights, use those. If they do not, use globals. On a cpuset rebind event (container migration, mems_allowed changes), manually set weights would have to remain, so in a bad case, the weights would be very out of line with the real distribution of memory. Example: if your nodemask is (0,1,2) and a migration changes it to (3,4,5), then unfortunately your weights will likely revert to [1,1,1] If set with global weights, they could automatically adjust. It would not be perfect, but it would be better than the potential worst case above. If that same migration occurs, the next allocation would simply use whatever the target node weights are in the global config. So if globally you have weights [3,2,1,1,2,3], and you move from nodemask (0,1,2) to (3,4,5), your weights change from [3,2,1] to [1,2,3]. If the structure is built as a matrix of (cpu_node,mem_nodes), the you can also optimize based on the node the task is running on. That feels very intuitive, deals with many race condition issues, and the global setting can actually be implemented without the need for set_mempolicy2 at all - which is certainly a bonus. Would love more thoughts here. Will have a new RFC with set_mempolicy2, mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above. Regards ~Gregory
Gregory Price <gregory.price@memverge.com> writes: > On Wed, Nov 15, 2023 at 01:56:53PM +0800, Huang, Ying wrote: >> Gregory Price <gregory.price@memverge.com> writes: >> >> Because we usually have multiple nodes in one mem-tier, I still think >> mem-tier-based interface is simpler than node-based. But, it seems more >> complex to introduce mem-tier into mempolicy. Especially if we have >> per-task weights. So, I am fine to go with node-based interface. >> >> > * cgroups: "this doesn't involve dynamic resource accounting / >> > enforcement at all" and "these aren't resource >> > allocations, it's unclear what the hierarchical >> > relationship mean". >> > >> > * node: too global, explore smaller scope first then expand. >> >> Why is it too global? I understand that it doesn't cover all possible >> use cases (although I don't know whether these use cases are practical >> or not). But it can provide a reasonable default per-node weight based >> on available node performance information (such as, HMAT, CDAT, etc.). >> And, quite some workloads can just use it. I think this is an useful >> feature. >> > > Have been sharing notes with more folks. Michal thinks a global set of > weights is unintuitive and not useful, and would prefer to see the > per-task weights first. > > Though this may have been in response to adding it as an attribute of > nodes directly. > > Another proposal here suggested adding a new sysfs setting > https://github.com/skhynix/linux/commit/61d2fcc7a880185df186fa2544edcd2f8785952a > > $ tree /sys/kernel/mm/interleave_weight/ > /sys/kernel/mm/interleave_weight/ > ├── enabled [1] > ├── possible [2] > └── node > ├── node0 > │ └── interleave_weight [3] > └── node1 > └── interleave_weight [3] > > (this could be changed to /sys/kernel/mm/mempolicy/...) > > I think the internal representation of this can be simplified greatly, > over what the patch provides now, but maybe this solves the "it doesn't > belong in these other components" issue. > > Answer: Simply leave it as a static global kobject in mempolicy, which > also deals with many of the issues regarding race conditions. Although personally I prefer to add interleave weight as an attribute of nodes. I understand that some people think it's not appropriate to place anything node-specific there. So, some place under /sys/kernel/mm sounds reasonable too. > If a user provides weights, use those. If they do not, use globals. Yes. That is the target use case. > On a cpuset rebind event (container migration, mems_allowed changes), > manually set weights would have to remain, so in a bad case, the > weights would be very out of line with the real distribution of memory. > > Example: if your nodemask is (0,1,2) and a migration changes it to > (3,4,5), then unfortunately your weights will likely revert to [1,1,1] > > If set with global weights, they could automatically adjust. It > would not be perfect, but it would be better than the potential worst > case above. If that same migration occurs, the next allocation would > simply use whatever the target node weights are in the global config. > > So if globally you have weights [3,2,1,1,2,3], and you move from > nodemask (0,1,2) to (3,4,5), your weights change from [3,2,1] to > [1,2,3]. That is nice. And I prefer to emphasize the simple use case. Users don't need to specify interleave weight always. Just use MPOL_WEIGHTED_INTERLEAVE policy, and system will provide reasonable default weight. > If the structure is built as a matrix of (cpu_node,mem_nodes), > the you can also optimize based on the node the task is running on. The matrix stuff makes the situation complex. If people do need something like that, they can just use set_memorypolicy2() with user specified weights. I still believe that "make simple stuff simple, and complex stuff possible". > That feels very intuitive, deals with many race condition issues, and > the global setting can actually be implemented without the need for > set_mempolicy2 at all - which is certainly a bonus. > > Would love more thoughts here. Will have a new RFC with set_mempolicy2, > mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above. Thanks for doing all these! -- Best Regards, Huang, Ying
On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > > If the structure is built as a matrix of (cpu_node,mem_nodes), > > the you can also optimize based on the node the task is running on. > > The matrix stuff makes the situation complex. If people do need > something like that, they can just use set_memorypolicy2() with user > specified weights. I still believe that "make simple stuff simple, and > complex stuff possible". > I don't think it's particularly complex, since we already have a distance matrix for numa nodes: available: 2 nodes (0-1) ... snip ... node distances: node 0 1 0: 10 21 1: 21 10 This would follow the same thing, just adjustable for bandwidth. I personally find the (src,dst) matrix very important for flexibility. But if there is particular pushback against it, having a one dimensional array is better than not having it, so I will take what I can get. > > That feels very intuitive, deals with many race condition issues, and > > the global setting can actually be implemented without the need for > > set_mempolicy2 at all - which is certainly a bonus. > > > > Would love more thoughts here. Will have a new RFC with set_mempolicy2, > > mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above. > > Thanks for doing all these! > Someone's got to :] ~Gregory
Gregory Price <gregory.price@memverge.com> writes: > On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote: >> Gregory Price <gregory.price@memverge.com> writes: >> >> > If the structure is built as a matrix of (cpu_node,mem_nodes), >> > the you can also optimize based on the node the task is running on. >> >> The matrix stuff makes the situation complex. If people do need >> something like that, they can just use set_memorypolicy2() with user >> specified weights. I still believe that "make simple stuff simple, and >> complex stuff possible". >> > > I don't think it's particularly complex, since we already have a > distance matrix for numa nodes: > > available: 2 nodes (0-1) > ... snip ... > node distances: > node 0 1 > 0: 10 21 > 1: 21 10 > > This would follow the same thing, just adjustable for bandwidth. We add complexity for requirement. Not there's something similar already. > I personally find the (src,dst) matrix very important for flexibility. With set_memorypolicy2(), I think we have the needed flexibility for users needs the complexity. > But if there is particular pushback against it, having a one dimensional > array is better than not having it, so I will take what I can get. TBH, I don't think that we really need that. Especially given we will have set_memorypolicy2(). >> > That feels very intuitive, deals with many race condition issues, and >> > the global setting can actually be implemented without the need for >> > set_mempolicy2 at all - which is certainly a bonus. >> > >> > Would love more thoughts here. Will have a new RFC with set_mempolicy2, >> > mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above. >> >> Thanks for doing all these! >> > > Someone's got to :] > -- Best Regards, Huang, Ying
On Tue, Dec 05, 2023 at 05:01:51PM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > > On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote: > >> Gregory Price <gregory.price@memverge.com> writes: > >> > >> > If the structure is built as a matrix of (cpu_node,mem_nodes), > >> > the you can also optimize based on the node the task is running on. > >> > >> The matrix stuff makes the situation complex. If people do need > >> something like that, they can just use set_memorypolicy2() with user > >> specified weights. I still believe that "make simple stuff simple, and > >> complex stuff possible". > >> > > > > I don't think it's particularly complex, since we already have a > > distance matrix for numa nodes: > > > > available: 2 nodes (0-1) > > ... snip ... > > node distances: > > node 0 1 > > 0: 10 21 > > 1: 21 10 > > > > This would follow the same thing, just adjustable for bandwidth. > > We add complexity for requirement. Not there's something similar > already. > > > I personally find the (src,dst) matrix very important for flexibility. > > With set_memorypolicy2(), I think we have the needed flexibility for > users needs the complexity. > > > But if there is particular pushback against it, having a one dimensional > > array is better than not having it, so I will take what I can get. > > TBH, I don't think that we really need that. Especially given we will > have set_memorypolicy2(). > From a complexity standpoint, it is exactly as complex as the hardware configuration itself: each socket has a different view of the memory topology. If you have a non-homogeneous memory configuration (e.g. a different number of CXL expanders on one socket thant he other), a flat array of weights has no way of capturing this hardware configuration. That makes the feature significantly less useful. In fact, it makes the feature equivalent to set_mempolicy2 - except that weights could be changed at runtime from outside a process. A matrix resolves one very specific use case: task migration set_mempolicy2 is not sufficient to solve this. There is presently no way for an external task to change the mempolicy of an existing task. That means a task must become "migration aware" to use weighting in the context of containers where migrations are likely. Two things to consider: A task... a) has no way of knowing a migration occured b) may not have visibility of numa nodes outside its cpusets prior to a migration - making it unlikely/not possible for them to set weights correctly in the event a migration occurs. If a server with 2 sockets is set up non-homogeneously (different amount of CXL memory expanders on each socket), then the effective bandwidth distribution between sockets will be different. If a container is migrated between sockets in this situation, then tasks with manually set weights, or if global weights are a single array, will have poor memory distributions in relation to the new view of the system. Requiring the global settings to be an array basically requires global weights to be sub-optimal for any use cases that is not explicitly a single workload that consumes all the cores on the system. If the system provides a matrix, then the global settings can be optimal and re-weighting in response to migration happens cleanly and transparently. ~Gregory
Gregory Price <gregory.price@memverge.com> writes: > On Tue, Dec 05, 2023 at 05:01:51PM +0800, Huang, Ying wrote: >> Gregory Price <gregory.price@memverge.com> writes: >> >> > On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote: >> >> Gregory Price <gregory.price@memverge.com> writes: >> >> >> >> > If the structure is built as a matrix of (cpu_node,mem_nodes), >> >> > the you can also optimize based on the node the task is running on. >> >> >> >> The matrix stuff makes the situation complex. If people do need >> >> something like that, they can just use set_memorypolicy2() with user >> >> specified weights. I still believe that "make simple stuff simple, and >> >> complex stuff possible". >> >> >> > >> > I don't think it's particularly complex, since we already have a >> > distance matrix for numa nodes: >> > >> > available: 2 nodes (0-1) >> > ... snip ... >> > node distances: >> > node 0 1 >> > 0: 10 21 >> > 1: 21 10 >> > >> > This would follow the same thing, just adjustable for bandwidth. >> >> We add complexity for requirement. Not there's something similar >> already. >> >> > I personally find the (src,dst) matrix very important for flexibility. >> >> With set_memorypolicy2(), I think we have the needed flexibility for >> users needs the complexity. >> >> > But if there is particular pushback against it, having a one dimensional >> > array is better than not having it, so I will take what I can get. >> >> TBH, I don't think that we really need that. Especially given we will >> have set_memorypolicy2(). >> > > From a complexity standpoint, it is exactly as complex as the hardware > configuration itself: each socket has a different view of the memory > topology. If you have a non-homogeneous memory configuration (e.g. a > different number of CXL expanders on one socket thant he other), a flat > array of weights has no way of capturing this hardware configuration. One important task of the software is to hide the complexity of hardware from the users. At least it should provide the option. It only add complexity based on real requirements. > That makes the feature significantly less useful. In fact, it makes the > feature equivalent to set_mempolicy2 - except that weights could be > changed at runtime from outside a process. > > > A matrix resolves one very specific use case: task migration > > > set_mempolicy2 is not sufficient to solve this. There is presently no > way for an external task to change the mempolicy of an existing task. > That means a task must become "migration aware" to use weighting in the > context of containers where migrations are likely. > > Two things to consider: A task... > a) has no way of knowing a migration occured > b) may not have visibility of numa nodes outside its cpusets prior to > a migration - making it unlikely/not possible for them to set > weights correctly in the event a migration occurs. > > If a server with 2 sockets is set up non-homogeneously (different amount > of CXL memory expanders on each socket), then the effective bandwidth > distribution between sockets will be different. > > If a container is migrated between sockets in this situation, then tasks > with manually set weights, or if global weights are a single array, will > have poor memory distributions in relation to the new view of the system. > > Requiring the global settings to be an array basically requires global > weights to be sub-optimal for any use cases that is not explicitly a > single workload that consumes all the cores on the system. > > If the system provides a matrix, then the global settings can be optimal > and re-weighting in response to migration happens cleanly and transparently. For these complex requirements, we will have process_set_mempolicy2(). I think that it's even more flexible than the global matrix. -- Best Regards, Huang, Ying
On Wed, Dec 06, 2023 at 08:50:23AM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > > > From a complexity standpoint, it is exactly as complex as the hardware > > configuration itself: each socket has a different view of the memory > > topology. If you have a non-homogeneous memory configuration (e.g. a > > different number of CXL expanders on one socket thant he other), a flat > > array of weights has no way of capturing this hardware configuration. > > One important task of the software is to hide the complexity of hardware > from the users. At least it should provide the option. It only add > complexity based on real requirements. > The global weights are intended to help adminstrators hide that complexity from actual end-users. The administrator of a system should already be aware of the hardware configuration, however to hide this complexity a system service can be made which auto-configures these weights at system-bringup and on memory-device hostplug to simplify and hide the complexity even further. A system service can use ACPI HMAT (ACPI Heterogeneous Memory Attribute Table) information to automatically set the global weight information at boot time and/or on hotplug. Such extensions have already been proposed in prior RFCs and on the cxl mailing list. To break this down a little more explicitly into 6 example use-cases, lets consider the potential ways in which weighted interleave may be set via set_mempolicy or set_mempolicy2(). 1. Actual end-user software calls it directly (or through libnuma) a) they can call set_mempolicy() without task-weights and accept the administrator configured global weights b) they can call set_mempolicy2() with task-weights and use task-local defined weighting 2. Actual end-user uses `numactl -w[weights] --interleave ...` a) if weights are not defined, use global weights b) if weights are defined, use task-local weights 3. Administrator / Orchestrator opts user-software into weighted interleave by wrapping their software into `numactl -w --interleave` a) if weights are not defined, use global weights b) if weights are not defined, use task-local weights The most common use case is likely to be (3a) - an administrator opting a user-workload into weighted-interleave via `numactl -w --interleave` or an orchestrator such as kubernetes doing something similar on pod/container dispatching. In all cases where the user does not define weights, they are trusting the administrator (or system-daemon) set weights to provide the optimal distribution, removing the complexity of understanding the hardware environment from the end-user. In all cases where the user does define weights, they are accepting the complexity of understanding the hardware environment. On the topic of the ACTUAL complexity of system hardware that is being hidden, we must consider a non-homogeneous bandwidth environment. The simplest form is an off the shelf Intel 2-socket server with CXL memory expander. Lets Consider a 2 socket system with the following configuration:: DRAM on Socket0: 300GB/s local DRAM bandwidth (node 0) DRAM on Socket1: 300GB/s local DRAM bandwidth (node 1) CXL on socket0: 128GB/s bandwidth (node 2) CXL on socket1: 128GB/s bandwidth (node 3) A single linear array of weights is not sufficient to capture the complexities of bandwidth distributions on this system, because of the presence of a UPI link between socket0 and socket1, which changes the bandwidth distribution depending on where a task runs. For example, 3 links of UPI is 62.4GB/s full-duplex. From the perspective of socket 0, the following is true: Bandwidth to Socket0 DRAM: 300GB/s (node 0) Bandwidth to Socket0 CXL: 100GB/s (node 2) Aggregate bandwidth to nodes (1,3): 62.4GB/s From the perspective of socket 1, this changes to: Bandwidth to Socket0 DRAM: 300GB/s (node 1) Bandwidth to Socket0 CXL: 100GB/s (node 3) Aggregate bandwidth to nodes (0,2): 62.4GB/s With a single linear array of weights that apply to the entire system, you cannot represent this configuration. And in fact, a single configuration of weights will always provide a sub-optimal distribution. The goal of simplicity defeats the entire goal of weighted interleave in a heterogeneous environment. > > For these complex requirements, we will have process_set_mempolicy2(). > I think that it's even more flexible than the global matrix. > process_set_mempolicy2() has a *very* long road to exist. The problem of mempolicy reference counting is non-trivial, and the plumbing requires changes to no less than 4 subsystems. Beyond that, the complexity of actually using process_set_mempolicy2() is the same as any situation in which set_mempolicy2() with task-local weights set: The absolute highest. The global weighting matrix actually hides this complexity entirely. > -- > Best Regards, > Huang, Ying