Message ID | cover.1617642417.git.tim.c.chen@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Manage the top tier memory in a tiered memory | expand |
On Mon 05-04-21 10:08:24, Tim Chen wrote: [...] > To make fine grain cgroup based management of the precious top tier > DRAM memory possible, this patchset adds a few new features: > 1. Provides memory monitors on the amount of top tier memory used per cgroup > and by the system as a whole. > 2. Applies soft limits on the top tier memory each cgroup uses > 3. Enables kswapd to demote top tier pages from cgroup with excess top > tier memory usages. Could you be more specific on how this interface is supposed to be used? > This allows us to provision different amount of top tier memory to each > cgroup according to the cgroup's latency need. > > The patchset is based on cgroup v1 interface. One shortcoming of the v1 > interface is the limit on the cgroup is a soft limit, so a cgroup can > exceed the limit quite a bit before reclaim before page demotion reins > it in. I have to say that I dislike abusing soft limit reclaim for this. In the past we have learned that the existing implementation is unfixable and changing the existing semantic impossible due to backward compatibility. So I would really prefer the soft limit just find its rest rather than see new potential usecases. I haven't really looked into details of this patchset but from a cursory look it seems like you are actually introducing a NUMA aware limits into memcg that would control consumption from some nodes differently than other nodes. This would be rather alien concept to the existing memcg infrastructure IMO. It looks like it is fusing borders between memcg and cputset controllers. You also seem to be basing the interface on the very specific usecase. Can we expect that there will be many different tiers requiring their own balancing?
On 4/6/21 2:08 AM, Michal Hocko wrote: > On Mon 05-04-21 10:08:24, Tim Chen wrote: > [...] >> To make fine grain cgroup based management of the precious top tier >> DRAM memory possible, this patchset adds a few new features: >> 1. Provides memory monitors on the amount of top tier memory used per cgroup >> and by the system as a whole. >> 2. Applies soft limits on the top tier memory each cgroup uses >> 3. Enables kswapd to demote top tier pages from cgroup with excess top >> tier memory usages. > Michal, Thanks for giving your feedback. Much appreciated. > Could you be more specific on how this interface is supposed to be used? We created a README section on the cgroup control part of this patchset: https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.71&id=20f20be02671384470c7cd8f66b56a9061a4071f to illustrate how this interface should be used. The top tier memory used is reported in memory.toptier_usage_in_bytes The amount of top tier memory usable by each cgroup without triggering page reclaim is controlled by the memory.toptier_soft_limit_in_bytes knob for each cgroup. We anticipate that for cgroup v2, we will have memory_toptier.max (max allowed top tier memory) memory_toptier.high (aggressive page demotion from top tier memory) memory_toptier.min (not to page demote from top tier memory at this threshold) this is analogous to existing controllers memory.max, memory.high and memory.min > >> This allows us to provision different amount of top tier memory to each >> cgroup according to the cgroup's latency need. >> >> The patchset is based on cgroup v1 interface. One shortcoming of the v1 >> interface is the limit on the cgroup is a soft limit, so a cgroup can >> exceed the limit quite a bit before reclaim before page demotion reins >> it in. > > I have to say that I dislike abusing soft limit reclaim for this. In the > past we have learned that the existing implementation is unfixable and > changing the existing semantic impossible due to backward compatibility. > So I would really prefer the soft limit just find its rest rather than > see new potential usecases. Do you think we can reuse some of the existing soft reclaim machinery for the v2 interface? More particularly, can we treat memory_toptier.high in cgroup v2 as a soft limit? We sort how much each mem cgroup exceeds memory_toptier.high and go after the cgroup that have largest excess first for page demotion. Will appreciate if you can shed some insights on what could go wrong with such an approach. > > I haven't really looked into details of this patchset but from a cursory > look it seems like you are actually introducing a NUMA aware limits into > memcg that would control consumption from some nodes differently than > other nodes. This would be rather alien concept to the existing memcg > infrastructure IMO. It looks like it is fusing borders between memcg and > cputset controllers. Want to make sure I understand what you mean by NUMA aware limits. Yes, in the patch set, it does treat the NUMA nodes differently. We are putting constraint on the "top tier" RAM nodes vs the lower tier PMEM nodes. Is this what you mean? I can see it does has some flavor of cpuset controller. In this case, it doesn't explicitly set a node as allowed or forbidden as in cpuset, but put some constraints on the usage of a group of nodes. Do you have suggestions on alternative controller for allocating tiered memory resource? > > You also seem to be basing the interface on the very specific usecase. > Can we expect that there will be many different tiers requiring their > own balancing? > You mean more than two tiers of memory? We did think a bit about system that has stuff like high bandwidth memory that's faster than DRAM. Our thought is usage and freeing of those memory will require explicit assignment (not used by default), so will be outside the realm of auto balancing. So at this point, we think two tiers will be good. Tim
On Wed 07-04-21 15:33:26, Tim Chen wrote: > > > On 4/6/21 2:08 AM, Michal Hocko wrote: > > On Mon 05-04-21 10:08:24, Tim Chen wrote: > > [...] > >> To make fine grain cgroup based management of the precious top tier > >> DRAM memory possible, this patchset adds a few new features: > >> 1. Provides memory monitors on the amount of top tier memory used per cgroup > >> and by the system as a whole. > >> 2. Applies soft limits on the top tier memory each cgroup uses > >> 3. Enables kswapd to demote top tier pages from cgroup with excess top > >> tier memory usages. > > > > Michal, > > Thanks for giving your feedback. Much appreciated. > > > Could you be more specific on how this interface is supposed to be used? > > We created a README section on the cgroup control part of this patchset: > https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.71&id=20f20be02671384470c7cd8f66b56a9061a4071f > to illustrate how this interface should be used. I have to confess I didn't get to look at demotion patches yet. > The top tier memory used is reported in > > memory.toptier_usage_in_bytes > > The amount of top tier memory usable by each cgroup without > triggering page reclaim is controlled by the > > memory.toptier_soft_limit_in_bytes Are you trying to say that soft limit acts as some sort of guarantee? Does that mean that if the memcg is under memory pressure top tiear memory is opted out from any reclaim if the usage is not in excess? From you previous email it sounds more like the limit is evaluated on the global memory pressure to balance specific memcgs which are in excess when trying to reclaim/demote a toptier numa node. Soft limit reclaim has several problems. Those are historical and therefore the behavior cannot be changed. E.g. go after the biggest excessed memcg (with priority 0 - aka potential full LRU scan) and then continue with a normal reclaim. This can be really disruptive to the top user. So you can likely define a more sane semantic. E.g. push back memcgs proporitional to their excess but then we have two different soft limits behavior which is bad as well. I am not really sure there is a sensible way out by (ab)using soft limit here. Also I am not really sure how this is going to be used in practice. There is no soft limit by default. So opting in would effectivelly discriminate those memcgs. There has been a similar problem with the soft limit we have in general. Is this really what you are looing for? What would be a typical usecase? [...] > >> The patchset is based on cgroup v1 interface. One shortcoming of the v1 > >> interface is the limit on the cgroup is a soft limit, so a cgroup can > >> exceed the limit quite a bit before reclaim before page demotion reins > >> it in. > > > > I have to say that I dislike abusing soft limit reclaim for this. In the > > past we have learned that the existing implementation is unfixable and > > changing the existing semantic impossible due to backward compatibility. > > So I would really prefer the soft limit just find its rest rather than > > see new potential usecases. > > Do you think we can reuse some of the existing soft reclaim machinery > for the v2 interface? > > More particularly, can we treat memory_toptier.high in cgroup v2 as a soft limit? No, you should follow existing limits semantics. High limit acts as a allocation throttling interface. > We sort how much each mem cgroup exceeds memory_toptier.high and > go after the cgroup that have largest excess first for page demotion. > Will appreciate if you can shed some insights on what could go wrong > with such an approach. This cannot work as a thorttling interface. > > I haven't really looked into details of this patchset but from a cursory > > look it seems like you are actually introducing a NUMA aware limits into > > memcg that would control consumption from some nodes differently than > > other nodes. This would be rather alien concept to the existing memcg > > infrastructure IMO. It looks like it is fusing borders between memcg and > > cputset controllers. > > Want to make sure I understand what you mean by NUMA aware limits. > Yes, in the patch set, it does treat the NUMA nodes differently. > We are putting constraint on the "top tier" RAM nodes vs the lower > tier PMEM nodes. Is this what you mean? What I am trying to say (and I have brought that up when demotion has been discussed at LSFMM) is that the implementation shouldn't be PMEM aware. The specific technology shouldn't be imprinted into the interface. Fundamentally you are trying to balance memory among NUMA nodes as we do not have other abstraction to use. So rather than talking about top, secondary, nth tier we have different NUMA nodes with different characteristics and you want to express your "priorities" for them. > I can see it does has > some flavor of cpuset controller. In this case, it doesn't explicitly > set a node as allowed or forbidden as in cpuset, but put some constraints > on the usage of a group of nodes. > > Do you have suggestions on alternative controller for allocating tiered memory resource? I am not really sure what would be the best interface to be honest. Maybe we want to carve this into memcg in some form of node priorities for the reclaim. Any of the existing limits is numa aware so far. Maybe we want to say hammer this node more than others if there is a memory pressure. Not sure that would help you particular usecase though. > > You also seem to be basing the interface on the very specific usecase. > > Can we expect that there will be many different tiers requiring their > > own balancing? > > > > You mean more than two tiers of memory? We did think a bit about system > that has stuff like high bandwidth memory that's faster than DRAM. > Our thought is usage and freeing of those memory will require > explicit assignment (not used by default), so will be outside the > realm of auto balancing. So at this point, we think two tiers will be good. Please keep in mind that once there is an interface it will be impossible to change in the future. So do not bind yourself to the 2 tier setups that you have in hands right now.
Hi Tim, On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > Traditionally, all memory is DRAM. Some DRAM might be closer/faster than > others NUMA wise, but a byte of media has about the same cost whether it > is close or far. But, with new memory tiers such as Persistent Memory > (PMEM). there is a choice between fast/expensive DRAM and slow/cheap > PMEM. > > The fast/expensive memory lives in the top tier of the memory hierachy. > > Previously, the patchset > [PATCH 00/10] [v7] Migrate Pages in lieu of discard > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/ > provides a mechanism to demote cold pages from DRAM node into PMEM. > > And the patchset > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/ > provides a mechanism to promote hot pages in PMEM to the DRAM node > leveraging autonuma. > > The two patchsets together keep the hot pages in DRAM and colder pages > in PMEM. Thanks for working on this as this is becoming more and more important particularly in the data centers where memory is a big portion of the cost. I see you have responded to Michal and I will add my more specific response there. Here I wanted to give my high level concern regarding using v1's soft limit like semantics for top tier memory. This patch series aims to distribute/partition top tier memory between jobs of different priorities. We want high priority jobs to have preferential access to the top tier memory and we don't want low priority jobs to hog the top tier memory. Using v1's soft limit like behavior can potentially cause high priority jobs to stall to make enough space on top tier memory on their allocation path and I think this patchset is aiming to reduce that impact by making kswapd do that work. However I think the more concerning issue is the low priority job hogging the top tier memory. The possible ways the low priority job can hog the top tier memory are by allocating non-movable memory or by mlocking the memory. (Oh there is also pinning the memory but I don't know if there is a user api to pin memory?) For the mlocked memory, you need to either modify the reclaim code or use a different mechanism for demoting cold memory. Basically I am saying we should put the upfront control (limit) on the usage of top tier memory by the jobs.
On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote: > > Hi Tim, > > On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > > Traditionally, all memory is DRAM. Some DRAM might be closer/faster than > > others NUMA wise, but a byte of media has about the same cost whether it > > is close or far. But, with new memory tiers such as Persistent Memory > > (PMEM). there is a choice between fast/expensive DRAM and slow/cheap > > PMEM. > > > > The fast/expensive memory lives in the top tier of the memory hierachy. > > > > Previously, the patchset > > [PATCH 00/10] [v7] Migrate Pages in lieu of discard > > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/ > > provides a mechanism to demote cold pages from DRAM node into PMEM. > > > > And the patchset > > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system > > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/ > > provides a mechanism to promote hot pages in PMEM to the DRAM node > > leveraging autonuma. > > > > The two patchsets together keep the hot pages in DRAM and colder pages > > in PMEM. > > Thanks for working on this as this is becoming more and more important > particularly in the data centers where memory is a big portion of the > cost. > > I see you have responded to Michal and I will add my more specific > response there. Here I wanted to give my high level concern regarding > using v1's soft limit like semantics for top tier memory. > > This patch series aims to distribute/partition top tier memory between > jobs of different priorities. We want high priority jobs to have > preferential access to the top tier memory and we don't want low > priority jobs to hog the top tier memory. > > Using v1's soft limit like behavior can potentially cause high > priority jobs to stall to make enough space on top tier memory on > their allocation path and I think this patchset is aiming to reduce > that impact by making kswapd do that work. However I think the more > concerning issue is the low priority job hogging the top tier memory. > > The possible ways the low priority job can hog the top tier memory are > by allocating non-movable memory or by mlocking the memory. (Oh there > is also pinning the memory but I don't know if there is a user api to > pin memory?) For the mlocked memory, you need to either modify the > reclaim code or use a different mechanism for demoting cold memory. Do you mean long term pin? RDMA should be able to simply pin the memory for weeks. A lot of transient pins come from Direct I/O. They should be less concerned. The low priority jobs should be able to be restricted by cpuset, for example, just keep them on second tier memory nodes. Then all the above problems are gone. > > Basically I am saying we should put the upfront control (limit) on the > usage of top tier memory by the jobs. This sounds similar to what I talked about in LSFMM 2019 (https://lwn.net/Articles/787418/). We used to have some potential usecase which divides DRAM:PMEM ratio for different jobs or memcgs when I was with Alibaba. In the first place I thought about per NUMA node limit, but it was very hard to configure it correctly for users unless you know exactly about your memory usage and hot/cold memory distribution. I'm wondering, just off the top of my head, if we could extend the semantic of low and min limit. For example, just redefine low and min to "the limit on top tier memory". Then we could have low priority jobs have 0 low/min limit. >
On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote: > > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > Hi Tim, > > > > On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > > > > Traditionally, all memory is DRAM. Some DRAM might be closer/faster than > > > others NUMA wise, but a byte of media has about the same cost whether it > > > is close or far. But, with new memory tiers such as Persistent Memory > > > (PMEM). there is a choice between fast/expensive DRAM and slow/cheap > > > PMEM. > > > > > > The fast/expensive memory lives in the top tier of the memory hierachy. > > > > > > Previously, the patchset > > > [PATCH 00/10] [v7] Migrate Pages in lieu of discard > > > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/ > > > provides a mechanism to demote cold pages from DRAM node into PMEM. > > > > > > And the patchset > > > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system > > > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/ > > > provides a mechanism to promote hot pages in PMEM to the DRAM node > > > leveraging autonuma. > > > > > > The two patchsets together keep the hot pages in DRAM and colder pages > > > in PMEM. > > > > Thanks for working on this as this is becoming more and more important > > particularly in the data centers where memory is a big portion of the > > cost. > > > > I see you have responded to Michal and I will add my more specific > > response there. Here I wanted to give my high level concern regarding > > using v1's soft limit like semantics for top tier memory. > > > > This patch series aims to distribute/partition top tier memory between > > jobs of different priorities. We want high priority jobs to have > > preferential access to the top tier memory and we don't want low > > priority jobs to hog the top tier memory. > > > > Using v1's soft limit like behavior can potentially cause high > > priority jobs to stall to make enough space on top tier memory on > > their allocation path and I think this patchset is aiming to reduce > > that impact by making kswapd do that work. However I think the more > > concerning issue is the low priority job hogging the top tier memory. > > > > The possible ways the low priority job can hog the top tier memory are > > by allocating non-movable memory or by mlocking the memory. (Oh there > > is also pinning the memory but I don't know if there is a user api to > > pin memory?) For the mlocked memory, you need to either modify the > > reclaim code or use a different mechanism for demoting cold memory. > > Do you mean long term pin? RDMA should be able to simply pin the > memory for weeks. A lot of transient pins come from Direct I/O. They > should be less concerned. > > The low priority jobs should be able to be restricted by cpuset, for > example, just keep them on second tier memory nodes. Then all the > above problems are gone. > Yes that's an extreme way to overcome the issue but we can do less extreme by just (hard) limiting the top tier usage of low priority jobs. > > > > Basically I am saying we should put the upfront control (limit) on the > > usage of top tier memory by the jobs. > > This sounds similar to what I talked about in LSFMM 2019 > (https://lwn.net/Articles/787418/). We used to have some potential > usecase which divides DRAM:PMEM ratio for different jobs or memcgs > when I was with Alibaba. > > In the first place I thought about per NUMA node limit, but it was > very hard to configure it correctly for users unless you know exactly > about your memory usage and hot/cold memory distribution. > > I'm wondering, just off the top of my head, if we could extend the > semantic of low and min limit. For example, just redefine low and min > to "the limit on top tier memory". Then we could have low priority > jobs have 0 low/min limit. > The low and min limits have semantics similar to the v1's soft limit for this situation i.e. letting the low priority job occupy top tier memory and depending on reclaim to take back the excess top tier memory use of such jobs. I have some thoughts on NUMA node limits which I will share in the other thread.
On Thu, Apr 8, 2021 at 1:29 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote: > > > > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > Hi Tim, > > > > > > On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > > > > > > Traditionally, all memory is DRAM. Some DRAM might be closer/faster than > > > > others NUMA wise, but a byte of media has about the same cost whether it > > > > is close or far. But, with new memory tiers such as Persistent Memory > > > > (PMEM). there is a choice between fast/expensive DRAM and slow/cheap > > > > PMEM. > > > > > > > > The fast/expensive memory lives in the top tier of the memory hierachy. > > > > > > > > Previously, the patchset > > > > [PATCH 00/10] [v7] Migrate Pages in lieu of discard > > > > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/ > > > > provides a mechanism to demote cold pages from DRAM node into PMEM. > > > > > > > > And the patchset > > > > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system > > > > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/ > > > > provides a mechanism to promote hot pages in PMEM to the DRAM node > > > > leveraging autonuma. > > > > > > > > The two patchsets together keep the hot pages in DRAM and colder pages > > > > in PMEM. > > > > > > Thanks for working on this as this is becoming more and more important > > > particularly in the data centers where memory is a big portion of the > > > cost. > > > > > > I see you have responded to Michal and I will add my more specific > > > response there. Here I wanted to give my high level concern regarding > > > using v1's soft limit like semantics for top tier memory. > > > > > > This patch series aims to distribute/partition top tier memory between > > > jobs of different priorities. We want high priority jobs to have > > > preferential access to the top tier memory and we don't want low > > > priority jobs to hog the top tier memory. > > > > > > Using v1's soft limit like behavior can potentially cause high > > > priority jobs to stall to make enough space on top tier memory on > > > their allocation path and I think this patchset is aiming to reduce > > > that impact by making kswapd do that work. However I think the more > > > concerning issue is the low priority job hogging the top tier memory. > > > > > > The possible ways the low priority job can hog the top tier memory are > > > by allocating non-movable memory or by mlocking the memory. (Oh there > > > is also pinning the memory but I don't know if there is a user api to > > > pin memory?) For the mlocked memory, you need to either modify the > > > reclaim code or use a different mechanism for demoting cold memory. > > > > Do you mean long term pin? RDMA should be able to simply pin the > > memory for weeks. A lot of transient pins come from Direct I/O. They > > should be less concerned. > > > > The low priority jobs should be able to be restricted by cpuset, for > > example, just keep them on second tier memory nodes. Then all the > > above problems are gone. > > > > Yes that's an extreme way to overcome the issue but we can do less > extreme by just (hard) limiting the top tier usage of low priority > jobs. > > > > > > > Basically I am saying we should put the upfront control (limit) on the > > > usage of top tier memory by the jobs. > > > > This sounds similar to what I talked about in LSFMM 2019 > > (https://lwn.net/Articles/787418/). We used to have some potential > > usecase which divides DRAM:PMEM ratio for different jobs or memcgs > > when I was with Alibaba. > > > > In the first place I thought about per NUMA node limit, but it was > > very hard to configure it correctly for users unless you know exactly > > about your memory usage and hot/cold memory distribution. > > > > I'm wondering, just off the top of my head, if we could extend the > > semantic of low and min limit. For example, just redefine low and min > > to "the limit on top tier memory". Then we could have low priority > > jobs have 0 low/min limit. > > > > The low and min limits have semantics similar to the v1's soft limit > for this situation i.e. letting the low priority job occupy top tier > memory and depending on reclaim to take back the excess top tier > memory use of such jobs. I don't get why low priority jobs can *not* use top tier memory? I can think of it may incur latency overhead for high priority jobs. If it is not allowed, it could be restricted by cpuset without introducing in any new interfaces. I'm supposed the memory utilization could be maximized by allowing all jobs allocate memory from all applicable nodes, then let reclaimer (or something new if needed) do the job to migrate the memory to proper nodes by time. We could achieve some kind of balance between memory utilization and resource isolation. > > I have some thoughts on NUMA node limits which I will share in the other thread. Look forward to reading it.
Yang Shi <shy828301@gmail.com> writes: > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote: >> >> Hi Tim, >> >> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote: >> > >> > Traditionally, all memory is DRAM. Some DRAM might be closer/faster than >> > others NUMA wise, but a byte of media has about the same cost whether it >> > is close or far. But, with new memory tiers such as Persistent Memory >> > (PMEM). there is a choice between fast/expensive DRAM and slow/cheap >> > PMEM. >> > >> > The fast/expensive memory lives in the top tier of the memory hierachy. >> > >> > Previously, the patchset >> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard >> > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/ >> > provides a mechanism to demote cold pages from DRAM node into PMEM. >> > >> > And the patchset >> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system >> > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/ >> > provides a mechanism to promote hot pages in PMEM to the DRAM node >> > leveraging autonuma. >> > >> > The two patchsets together keep the hot pages in DRAM and colder pages >> > in PMEM. >> >> Thanks for working on this as this is becoming more and more important >> particularly in the data centers where memory is a big portion of the >> cost. >> >> I see you have responded to Michal and I will add my more specific >> response there. Here I wanted to give my high level concern regarding >> using v1's soft limit like semantics for top tier memory. >> >> This patch series aims to distribute/partition top tier memory between >> jobs of different priorities. We want high priority jobs to have >> preferential access to the top tier memory and we don't want low >> priority jobs to hog the top tier memory. >> >> Using v1's soft limit like behavior can potentially cause high >> priority jobs to stall to make enough space on top tier memory on >> their allocation path and I think this patchset is aiming to reduce >> that impact by making kswapd do that work. However I think the more >> concerning issue is the low priority job hogging the top tier memory. >> >> The possible ways the low priority job can hog the top tier memory are >> by allocating non-movable memory or by mlocking the memory. (Oh there >> is also pinning the memory but I don't know if there is a user api to >> pin memory?) For the mlocked memory, you need to either modify the >> reclaim code or use a different mechanism for demoting cold memory. > > Do you mean long term pin? RDMA should be able to simply pin the > memory for weeks. A lot of transient pins come from Direct I/O. They > should be less concerned. > > The low priority jobs should be able to be restricted by cpuset, for > example, just keep them on second tier memory nodes. Then all the > above problems are gone. To optimize the page placement of a process between DRAM and PMEM, we want to place the hot pages in DRAM and the cold pages in PMEM. But the memory accessing pattern changes overtime, so we need to migrate pages between DRAM and PMEM to adapt to the changing. To avoid the hot pages be pinned in PMEM always, one way is to online the PMEM as movable zones. If so, and if the low priority jobs are restricted by cpuset to allocate from PMEM only, we may fail to run quite some workloads as being discussed in the following threads, https://lore.kernel.org/linux-mm/1604470210-124827-1-git-send-email-feng.tang@intel.com/ >> >> Basically I am saying we should put the upfront control (limit) on the >> usage of top tier memory by the jobs. > > This sounds similar to what I talked about in LSFMM 2019 > (https://lwn.net/Articles/787418/). We used to have some potential > usecase which divides DRAM:PMEM ratio for different jobs or memcgs > when I was with Alibaba. > > In the first place I thought about per NUMA node limit, but it was > very hard to configure it correctly for users unless you know exactly > about your memory usage and hot/cold memory distribution. > > I'm wondering, just off the top of my head, if we could extend the > semantic of low and min limit. For example, just redefine low and min > to "the limit on top tier memory". Then we could have low priority > jobs have 0 low/min limit. Per my understanding, memory.low/min are for the memory protection instead of the memory limiting. memory.high is for the memory limiting. Best Regards, Huang, Ying
On Thu 08-04-21 13:29:08, Shakeel Butt wrote: > On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote: [...] > > The low priority jobs should be able to be restricted by cpuset, for > > example, just keep them on second tier memory nodes. Then all the > > above problems are gone. Yes, if the aim is to isolate some users from certain numa node then cpuset is a good fit but as Shakeel says this is very likely not what this work is aiming for. > Yes that's an extreme way to overcome the issue but we can do less > extreme by just (hard) limiting the top tier usage of low priority > jobs. Per numa node high/hard limit would help with a more fine grained control. The configuration would be tricky though. All low priority memcgs would have to be carefully configured to leave enough for your important processes. That includes also memory which is not accounted to any memcg. The behavior of those limits would be quite tricky for OOM situations as well due to a lack of NUMA aware oom killer.
On Thu, Apr 8, 2021 at 7:58 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yang Shi <shy828301@gmail.com> writes: > > > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote: > >> > >> Hi Tim, > >> > >> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote: > >> > > >> > Traditionally, all memory is DRAM. Some DRAM might be closer/faster than > >> > others NUMA wise, but a byte of media has about the same cost whether it > >> > is close or far. But, with new memory tiers such as Persistent Memory > >> > (PMEM). there is a choice between fast/expensive DRAM and slow/cheap > >> > PMEM. > >> > > >> > The fast/expensive memory lives in the top tier of the memory hierachy. > >> > > >> > Previously, the patchset > >> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard > >> > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/ > >> > provides a mechanism to demote cold pages from DRAM node into PMEM. > >> > > >> > And the patchset > >> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system > >> > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/ > >> > provides a mechanism to promote hot pages in PMEM to the DRAM node > >> > leveraging autonuma. > >> > > >> > The two patchsets together keep the hot pages in DRAM and colder pages > >> > in PMEM. > >> > >> Thanks for working on this as this is becoming more and more important > >> particularly in the data centers where memory is a big portion of the > >> cost. > >> > >> I see you have responded to Michal and I will add my more specific > >> response there. Here I wanted to give my high level concern regarding > >> using v1's soft limit like semantics for top tier memory. > >> > >> This patch series aims to distribute/partition top tier memory between > >> jobs of different priorities. We want high priority jobs to have > >> preferential access to the top tier memory and we don't want low > >> priority jobs to hog the top tier memory. > >> > >> Using v1's soft limit like behavior can potentially cause high > >> priority jobs to stall to make enough space on top tier memory on > >> their allocation path and I think this patchset is aiming to reduce > >> that impact by making kswapd do that work. However I think the more > >> concerning issue is the low priority job hogging the top tier memory. > >> > >> The possible ways the low priority job can hog the top tier memory are > >> by allocating non-movable memory or by mlocking the memory. (Oh there > >> is also pinning the memory but I don't know if there is a user api to > >> pin memory?) For the mlocked memory, you need to either modify the > >> reclaim code or use a different mechanism for demoting cold memory. > > > > Do you mean long term pin? RDMA should be able to simply pin the > > memory for weeks. A lot of transient pins come from Direct I/O. They > > should be less concerned. > > > > The low priority jobs should be able to be restricted by cpuset, for > > example, just keep them on second tier memory nodes. Then all the > > above problems are gone. > > To optimize the page placement of a process between DRAM and PMEM, we > want to place the hot pages in DRAM and the cold pages in PMEM. But the > memory accessing pattern changes overtime, so we need to migrate pages > between DRAM and PMEM to adapt to the changing. > > To avoid the hot pages be pinned in PMEM always, one way is to online > the PMEM as movable zones. If so, and if the low priority jobs are > restricted by cpuset to allocate from PMEM only, we may fail to run > quite some workloads as being discussed in the following threads, > > https://lore.kernel.org/linux-mm/1604470210-124827-1-git-send-email-feng.tang@intel.com/ Thanks for sharing the thread. It seems the configuration of movable zone + node bind is not supported very well or need evolve to support new use cases. > > >> > >> Basically I am saying we should put the upfront control (limit) on the > >> usage of top tier memory by the jobs. > > > > This sounds similar to what I talked about in LSFMM 2019 > > (https://lwn.net/Articles/787418/). We used to have some potential > > usecase which divides DRAM:PMEM ratio for different jobs or memcgs > > when I was with Alibaba. > > > > In the first place I thought about per NUMA node limit, but it was > > very hard to configure it correctly for users unless you know exactly > > about your memory usage and hot/cold memory distribution. > > > > I'm wondering, just off the top of my head, if we could extend the > > semantic of low and min limit. For example, just redefine low and min > > to "the limit on top tier memory". Then we could have low priority > > jobs have 0 low/min limit. > > Per my understanding, memory.low/min are for the memory protection > instead of the memory limiting. memory.high is for the memory limiting. Yes, it is not limit. I just misused the term, I actually do mean protection but typed "limit". Sorry for the confusion. > > Best Regards, > Huang, Ying
On 4/8/21 4:52 AM, Michal Hocko wrote: >> The top tier memory used is reported in >> >> memory.toptier_usage_in_bytes >> >> The amount of top tier memory usable by each cgroup without >> triggering page reclaim is controlled by the >> >> memory.toptier_soft_limit_in_bytes > Michal, Thanks for your comments. I will like to take a step back and look at the eventual goal we envision: a mechanism to partition the tiered memory between the cgroups. A typical use case may be a system with two set of tasks. One set of task is very latency sensitive and we desire instantaneous response from them. Another set of tasks will be running batch jobs were latency and performance is not critical. In this case, we want to carve out enough top tier memory such that the working set of the latency sensitive tasks can fit entirely in the top tier memory. The rest of the top tier memory can be assigned to the background tasks. To achieve such cgroup based tiered memory management, we probably want something like the following. For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1, where tier t_0 sits at the top and demotes to the lower tier. We envision for this top tier memory t0 the following knobs and counters in the cgroup memory controller memory_t0.current Current usage of tier 0 memory by the cgroup. memory_t0.min If tier 0 memory used by the cgroup falls below this low boundary, the memory will not be subjected to demotion to lower tiers to free up memory at tier 0. memory_t0.low Above this boundary, the tier 0 memory will be subjected to demotion. The demotion pressure will be proportional to the overage. memory_t0.high If tier 0 memory used by the cgroup exceeds this high boundary, allocation of tier 0 memory by the cgroup will be throttled. The tier 0 memory used by this cgroup will also be subjected to heavy demotion. memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup. If needed, memory_t[12...].current/min/low/high for additional tiers can be added. This follows closely with the design of the general memory controller interface. Will such an interface looks sane and acceptable with everyone? The patch set I posted is meant to be a straw man cgroup v1 implementation and I readily admits that it falls short of the eventual functionality we want to achieve. It is meant to solicit feedback from everyone on how the tiered memory management should work. > Are you trying to say that soft limit acts as some sort of guarantee? No, the soft limit does not offers guarantee. It will only serves to keep the usage of the top tier memory in the vicinity of the soft limits. > Does that mean that if the memcg is under memory pressure top tiear > memory is opted out from any reclaim if the usage is not in excess? In the prototype implementation, regular memory reclaim is still in effect if we are under heavy memory pressure. > > From you previous email it sounds more like the limit is evaluated on > the global memory pressure to balance specific memcgs which are in > excess when trying to reclaim/demote a toptier numa node. On a top tier node, if the free memory on the node falls below a percentage, then we will start to reclaim/demote from the node. > > Soft limit reclaim has several problems. Those are historical and > therefore the behavior cannot be changed. E.g. go after the biggest > excessed memcg (with priority 0 - aka potential full LRU scan) and then > continue with a normal reclaim. This can be really disruptive to the top > user. Thanks for pointing out these problems with soft limit explicitly. > > So you can likely define a more sane semantic. E.g. push back memcgs > proporitional to their excess but then we have two different soft limits > behavior which is bad as well. I am not really sure there is a sensible > way out by (ab)using soft limit here. > > Also I am not really sure how this is going to be used in practice. > There is no soft limit by default. So opting in would effectivelly > discriminate those memcgs. There has been a similar problem with the > soft limit we have in general. Is this really what you are looing for? > What would be a typical usecase? >> Want to make sure I understand what you mean by NUMA aware limits. >> Yes, in the patch set, it does treat the NUMA nodes differently. >> We are putting constraint on the "top tier" RAM nodes vs the lower >> tier PMEM nodes. Is this what you mean? > > What I am trying to say (and I have brought that up when demotion has been > discussed at LSFMM) is that the implementation shouldn't be PMEM aware. > The specific technology shouldn't be imprinted into the interface. > Fundamentally you are trying to balance memory among NUMA nodes as we do > not have other abstraction to use. So rather than talking about top, > secondary, nth tier we have different NUMA nodes with different > characteristics and you want to express your "priorities" for them. With node priorities, how would the system reserve enough high performance memory for those performance critical task cgroup? By priority, do you mean the order of allocation of nodes for a cgroup? Or you mean that all the similar performing memory node will be grouped in the same priority? Tim
On Thu, Apr 8, 2021 at 1:50 PM Yang Shi <shy828301@gmail.com> wrote: > [...] > > > > The low and min limits have semantics similar to the v1's soft limit > > for this situation i.e. letting the low priority job occupy top tier > > memory and depending on reclaim to take back the excess top tier > > memory use of such jobs. > > I don't get why low priority jobs can *not* use top tier memory? I am saying low priority jobs can use top tier memory. The only difference is to limit them upfront (using limits) or reclaim from them later (using min/low/soft-limit). > I can > think of it may incur latency overhead for high priority jobs. If it > is not allowed, it could be restricted by cpuset without introducing > in any new interfaces. > > I'm supposed the memory utilization could be maximized by allowing all > jobs allocate memory from all applicable nodes, then let reclaimer (or > something new if needed) Most probably something new as we do want to consider unevictable memory as well. > do the job to migrate the memory to proper > nodes by time. We could achieve some kind of balance between memory > utilization and resource isolation. > Tradeoff between utilization and isolation should be decided by the user/admin.
On Thu, Apr 8, 2021 at 4:52 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > What I am trying to say (and I have brought that up when demotion has been > discussed at LSFMM) is that the implementation shouldn't be PMEM aware. > The specific technology shouldn't be imprinted into the interface. > Fundamentally you are trying to balance memory among NUMA nodes as we do > not have other abstraction to use. So rather than talking about top, > secondary, nth tier we have different NUMA nodes with different > characteristics and you want to express your "priorities" for them. > I am also inclined towards NUMA based approach. It makes the solution more general and even existing systems with multiple numa nodes and DRAM can take advantage of this approach (if it makes sense).
On Fri, Apr 9, 2021 at 4:26 PM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > On 4/8/21 4:52 AM, Michal Hocko wrote: > > >> The top tier memory used is reported in > >> > >> memory.toptier_usage_in_bytes > >> > >> The amount of top tier memory usable by each cgroup without > >> triggering page reclaim is controlled by the > >> > >> memory.toptier_soft_limit_in_bytes > > > > Michal, > > Thanks for your comments. I will like to take a step back and > look at the eventual goal we envision: a mechanism to partition the > tiered memory between the cgroups. > > A typical use case may be a system with two set of tasks. > One set of task is very latency sensitive and we desire instantaneous > response from them. Another set of tasks will be running batch jobs > were latency and performance is not critical. In this case, > we want to carve out enough top tier memory such that the working set > of the latency sensitive tasks can fit entirely in the top tier memory. > The rest of the top tier memory can be assigned to the background tasks. > > To achieve such cgroup based tiered memory management, we probably want > something like the following. > > For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1, > where tier t_0 sits at the top and demotes to the lower tier. > We envision for this top tier memory t0 the following knobs and counters > in the cgroup memory controller > > memory_t0.current Current usage of tier 0 memory by the cgroup. > > memory_t0.min If tier 0 memory used by the cgroup falls below this low > boundary, the memory will not be subjected to demotion > to lower tiers to free up memory at tier 0. > > memory_t0.low Above this boundary, the tier 0 memory will be subjected > to demotion. The demotion pressure will be proportional > to the overage. > > memory_t0.high If tier 0 memory used by the cgroup exceeds this high > boundary, allocation of tier 0 memory by the cgroup will > be throttled. The tier 0 memory used by this cgroup > will also be subjected to heavy demotion. > > memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup. > > If needed, memory_t[12...].current/min/low/high for additional tiers can be added. > This follows closely with the design of the general memory controller interface. > > Will such an interface looks sane and acceptable with everyone? > I have a couple of questions. Let's suppose we have a two socket system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1). Based on the tier definition of this patch series, tier_0: {node_0, node_1} and tier_1: {node_2, node_3}. My questions are: 1) Can we assume that the cost of access within a tier will always be less than the cost of access from the tier? (node_0 <-> node_1 vs node_0 <-> node_2) 2) If yes to (1), is that assumption future proof? Will the future systems with DRAM over CXL support have the same characteristics? 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0 <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3 might be third tier and similarly for jobs running on node_1, node_2 might be third tier. The reason I am asking these questions is that the statically partitioning memory nodes into tiers will inherently add platform specific assumptions in the user API. Assumptions like: 1) Access within tier is always cheaper than across tier. 2) Access from tier_i to tier_i+1 has uniform cost. The reason I am more inclined towards having numa centric control is that we don't have to make these assumptions. Though the usability will be more difficult. Greg (CCed) has some ideas on making it better and we will share our proposal after polishing it a bit more.
Tim Chen <tim.c.chen@linux.intel.com> writes: > On 4/8/21 4:52 AM, Michal Hocko wrote: > >>> The top tier memory used is reported in >>> >>> memory.toptier_usage_in_bytes >>> >>> The amount of top tier memory usable by each cgroup without >>> triggering page reclaim is controlled by the >>> >>> memory.toptier_soft_limit_in_bytes >> > > Michal, > > Thanks for your comments. I will like to take a step back and > look at the eventual goal we envision: a mechanism to partition the > tiered memory between the cgroups. > > A typical use case may be a system with two set of tasks. > One set of task is very latency sensitive and we desire instantaneous > response from them. Another set of tasks will be running batch jobs > were latency and performance is not critical. In this case, > we want to carve out enough top tier memory such that the working set > of the latency sensitive tasks can fit entirely in the top tier memory. > The rest of the top tier memory can be assigned to the background tasks. > > To achieve such cgroup based tiered memory management, we probably want > something like the following. > > For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1, > where tier t_0 sits at the top and demotes to the lower tier. > We envision for this top tier memory t0 the following knobs and counters > in the cgroup memory controller > > memory_t0.current Current usage of tier 0 memory by the cgroup. > > memory_t0.min If tier 0 memory used by the cgroup falls below this low > boundary, the memory will not be subjected to demotion > to lower tiers to free up memory at tier 0. > > memory_t0.low Above this boundary, the tier 0 memory will be subjected > to demotion. The demotion pressure will be proportional > to the overage. > > memory_t0.high If tier 0 memory used by the cgroup exceeds this high > boundary, allocation of tier 0 memory by the cgroup will > be throttled. The tier 0 memory used by this cgroup > will also be subjected to heavy demotion. I think we don't really need throttle here, because we can fallback to allocate memory from the t1. That will not cause something like IO device bandwidth saturation. Best Regards, Huang, Ying > memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup. > > If needed, memory_t[12...].current/min/low/high for additional tiers can be added. > This follows closely with the design of the general memory controller interface. > > Will such an interface looks sane and acceptable with everyone? > > The patch set I posted is meant to be a straw man cgroup v1 implementation > and I readily admits that it falls short of the eventual functionality > we want to achieve. It is meant to solicit feedback from everyone on how the tiered > memory management should work. > >> Are you trying to say that soft limit acts as some sort of guarantee? > > No, the soft limit does not offers guarantee. It will only serves to keep the usage > of the top tier memory in the vicinity of the soft limits. > >> Does that mean that if the memcg is under memory pressure top tiear >> memory is opted out from any reclaim if the usage is not in excess? > > In the prototype implementation, regular memory reclaim is still in effect > if we are under heavy memory pressure. > >> >> From you previous email it sounds more like the limit is evaluated on >> the global memory pressure to balance specific memcgs which are in >> excess when trying to reclaim/demote a toptier numa node. > > On a top tier node, if the free memory on the node falls below a percentage, then > we will start to reclaim/demote from the node. > >> >> Soft limit reclaim has several problems. Those are historical and >> therefore the behavior cannot be changed. E.g. go after the biggest >> excessed memcg (with priority 0 - aka potential full LRU scan) and then >> continue with a normal reclaim. This can be really disruptive to the top >> user. > > Thanks for pointing out these problems with soft limit explicitly. > >> >> So you can likely define a more sane semantic. E.g. push back memcgs >> proporitional to their excess but then we have two different soft limits >> behavior which is bad as well. I am not really sure there is a sensible >> way out by (ab)using soft limit here. >> >> Also I am not really sure how this is going to be used in practice. >> There is no soft limit by default. So opting in would effectivelly >> discriminate those memcgs. There has been a similar problem with the >> soft limit we have in general. Is this really what you are looing for? >> What would be a typical usecase? > >>> Want to make sure I understand what you mean by NUMA aware limits. >>> Yes, in the patch set, it does treat the NUMA nodes differently. >>> We are putting constraint on the "top tier" RAM nodes vs the lower >>> tier PMEM nodes. Is this what you mean? >> >> What I am trying to say (and I have brought that up when demotion has been >> discussed at LSFMM) is that the implementation shouldn't be PMEM aware. >> The specific technology shouldn't be imprinted into the interface. >> Fundamentally you are trying to balance memory among NUMA nodes as we do >> not have other abstraction to use. So rather than talking about top, >> secondary, nth tier we have different NUMA nodes with different >> characteristics and you want to express your "priorities" for them. > > With node priorities, how would the system reserve enough > high performance memory for those performance critical task cgroup? > > By priority, do you mean the order of allocation of nodes for a cgroup? > Or you mean that all the similar performing memory node will be grouped in > the same priority? > > Tim
On Fri 09-04-21 16:26:53, Tim Chen wrote: > > On 4/8/21 4:52 AM, Michal Hocko wrote: > > >> The top tier memory used is reported in > >> > >> memory.toptier_usage_in_bytes > >> > >> The amount of top tier memory usable by each cgroup without > >> triggering page reclaim is controlled by the > >> > >> memory.toptier_soft_limit_in_bytes > > > > Michal, > > Thanks for your comments. I will like to take a step back and > look at the eventual goal we envision: a mechanism to partition the > tiered memory between the cgroups. OK, this is goot mission statemet to start with. I would expect a follow up to say what kind of granularity of control you want to achieve here. Pressumably you want more than all or nothing because that is where cpusets can be used for. > A typical use case may be a system with two set of tasks. > One set of task is very latency sensitive and we desire instantaneous > response from them. Another set of tasks will be running batch jobs > were latency and performance is not critical. In this case, > we want to carve out enough top tier memory such that the working set > of the latency sensitive tasks can fit entirely in the top tier memory. > The rest of the top tier memory can be assigned to the background tasks. While from a very high level this makes sense I would be interested in more details though. Your high letency sensitive applications very likely want to be bound to high performance node, right? Can they tolerate memory reclaim? Can they consume more memory than the node size? What do you expect to happen then? > To achieve such cgroup based tiered memory management, we probably want > something like the following. > > For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1, > where tier t_0 sits at the top and demotes to the lower tier. How is each tear defined? Is this an admin define set of NUMA nodes or is this platform specific? [...] > Will such an interface looks sane and acceptable with everyone? Let's talk more about usecases first before we even start talking about the interface or which controller is the best fit for implementing it. > The patch set I posted is meant to be a straw man cgroup v1 implementation > and I readily admits that it falls short of the eventual functionality > we want to achieve. It is meant to solicit feedback from everyone on how the tiered > memory management should work. OK, fair enough. Let me then just state that I strongly believe that Soft limit based approach is a dead end and it would be better to focus on the actual usecases and try to understand what you want to achieve first. [...] > > What I am trying to say (and I have brought that up when demotion has been > > discussed at LSFMM) is that the implementation shouldn't be PMEM aware. > > The specific technology shouldn't be imprinted into the interface. > > Fundamentally you are trying to balance memory among NUMA nodes as we do > > not have other abstraction to use. So rather than talking about top, > > secondary, nth tier we have different NUMA nodes with different > > characteristics and you want to express your "priorities" for them. > > With node priorities, how would the system reserve enough > high performance memory for those performance critical task cgroup? > > By priority, do you mean the order of allocation of nodes for a cgroup? > Or you mean that all the similar performing memory node will be grouped in > the same priority? I have to say I do not yet have a clear idea on what those priorities would look like. I just wanted to outline that usecases you are interested about likely want to implement some form of (application transparent) control for memory distribution over several nodes. There is a long way to land on something more specific I am afraid.
On Mon, 12 Apr 2021 12:20:22 -0700 Shakeel Butt <shakeelb@google.com> wrote: > On Fri, Apr 9, 2021 at 4:26 PM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > > > > On 4/8/21 4:52 AM, Michal Hocko wrote: > > > > >> The top tier memory used is reported in > > >> > > >> memory.toptier_usage_in_bytes > > >> > > >> The amount of top tier memory usable by each cgroup without > > >> triggering page reclaim is controlled by the > > >> > > >> memory.toptier_soft_limit_in_bytes > > > > > > > Michal, > > > > Thanks for your comments. I will like to take a step back and > > look at the eventual goal we envision: a mechanism to partition the > > tiered memory between the cgroups. > > > > A typical use case may be a system with two set of tasks. > > One set of task is very latency sensitive and we desire instantaneous > > response from them. Another set of tasks will be running batch jobs > > were latency and performance is not critical. In this case, > > we want to carve out enough top tier memory such that the working set > > of the latency sensitive tasks can fit entirely in the top tier memory. > > The rest of the top tier memory can be assigned to the background tasks. > > > > To achieve such cgroup based tiered memory management, we probably want > > something like the following. > > > > For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1, > > where tier t_0 sits at the top and demotes to the lower tier. > > We envision for this top tier memory t0 the following knobs and counters > > in the cgroup memory controller > > > > memory_t0.current Current usage of tier 0 memory by the cgroup. > > > > memory_t0.min If tier 0 memory used by the cgroup falls below this low > > boundary, the memory will not be subjected to demotion > > to lower tiers to free up memory at tier 0. > > > > memory_t0.low Above this boundary, the tier 0 memory will be subjected > > to demotion. The demotion pressure will be proportional > > to the overage. > > > > memory_t0.high If tier 0 memory used by the cgroup exceeds this high > > boundary, allocation of tier 0 memory by the cgroup will > > be throttled. The tier 0 memory used by this cgroup > > will also be subjected to heavy demotion. > > > > memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup. > > > > If needed, memory_t[12...].current/min/low/high for additional tiers can be added. > > This follows closely with the design of the general memory controller interface. > > > > Will such an interface looks sane and acceptable with everyone? > > > > I have a couple of questions. Let's suppose we have a two socket > system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket > 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1). > Based on the tier definition of this patch series, tier_0: {node_0, > node_1} and tier_1: {node_2, node_3}. > > My questions are: > > 1) Can we assume that the cost of access within a tier will always be > less than the cost of access from the tier? (node_0 <-> node_1 vs > node_0 <-> node_2) No in large systems even it we can make this assumption in 2 socket ones. > 2) If yes to (1), is that assumption future proof? Will the future > systems with DRAM over CXL support have the same characteristics? > 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0 > <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3 > might be third tier and similarly for jobs running on node_1, node_2 > might be third tier. > > The reason I am asking these questions is that the statically > partitioning memory nodes into tiers will inherently add platform > specific assumptions in the user API. Absolutely agree. > > Assumptions like: > 1) Access within tier is always cheaper than across tier. > 2) Access from tier_i to tier_i+1 has uniform cost. > > The reason I am more inclined towards having numa centric control is > that we don't have to make these assumptions. Though the usability > will be more difficult. Greg (CCed) has some ideas on making it better > and we will share our proposal after polishing it a bit more. > Sounds good, will look out for that. Jonathan
On 4/8/21 1:29 PM, Shakeel Butt wrote: > On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote: > > The low and min limits have semantics similar to the v1's soft limit > for this situation i.e. letting the low priority job occupy top tier > memory and depending on reclaim to take back the excess top tier > memory use of such jobs. > > I have some thoughts on NUMA node limits which I will share in the other thread. > Shakeel, Look forward to the proposal on NUMA node limits. Which thread are you going to post it? Want to make sure I didn't miss it. Tim
On 4/12/21 12:20 PM, Shakeel Butt wrote: >> >> memory_t0.current Current usage of tier 0 memory by the cgroup. >> >> memory_t0.min If tier 0 memory used by the cgroup falls below this low >> boundary, the memory will not be subjected to demotion >> to lower tiers to free up memory at tier 0. >> >> memory_t0.low Above this boundary, the tier 0 memory will be subjected >> to demotion. The demotion pressure will be proportional >> to the overage. >> >> memory_t0.high If tier 0 memory used by the cgroup exceeds this high >> boundary, allocation of tier 0 memory by the cgroup will >> be throttled. The tier 0 memory used by this cgroup >> will also be subjected to heavy demotion. >> >> memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup. >> >> If needed, memory_t[12...].current/min/low/high for additional tiers can be added. >> This follows closely with the design of the general memory controller interface. >> >> Will such an interface looks sane and acceptable with everyone? >> > > I have a couple of questions. Let's suppose we have a two socket > system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket > 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1). > Based on the tier definition of this patch series, tier_0: {node_0, > node_1} and tier_1: {node_2, node_3}. > > My questions are: > > 1) Can we assume that the cost of access within a tier will always be > less than the cost of access from the tier? (node_0 <-> node_1 vs > node_0 <-> node_2) I do assume that higher tier memory offers better performance (or less access latency) than a lower tier memory. Otherwise, this defeats the whole purpose of promoting hot memory from lower tier to a higher tier, and demoting cold memory to a lower tier. Tiers assumption is embedded once we define this promotion/demotion relationship between the numa nodes. So if node_m ----demotes----> node_n <---promotes---- then node_m is one tier higher tier than node_n. This promotion/demotion relationship between the nodes is the underpinning of Dave and Ying's demotion and promotion patch sets. > 2) If yes to (1), is that assumption future proof? Will the future > systems with DRAM over CXL support have the same characteristics? I think if you configure a promotion/demotion relationship between DRAM over CXL and local-socket connected DRAM, you could divide them up into separate tiers. Or you don't care about the difference and you will configure them not to have a promotion/demotion relationship and they will be at the same tier. Balance within the same tier will be effected by the autonuma mechanism. > 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0 > <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3 > might be third tier and similarly for jobs running on node_1, node_2 > might be third tier. Tier definition is an admin's choice, of where the admin think the hot memory should reside after looking at the memory performance. It falls out of how the admin construct the promotion/demotion relationship between the nodes and OS does not assume the tier relationship from memory performance directly. > > The reason I am asking these questions is that the statically > partitioning memory nodes into tiers will inherently add platform > specific assumptions in the user API. > > Assumptions like: > 1) Access within tier is always cheaper than across tier. > 2) Access from tier_i to tier_i+1 has uniform cost. > > The reason I am more inclined towards having numa centric control is > that we don't have to make these assumptions. Though the usability > will be more difficult. Greg (CCed) has some ideas on making it better > and we will share our proposal after polishing it a bit more. > I am still trying to understand how a numa centric control actually work. Putting limits on every numa node for each cgroup seems to make the system configuration quite complicated. Looking forward to your proposal so I can better understand that perspective. Tim
On 4/8/21 10:18 AM, Shakeel Butt wrote: > > Using v1's soft limit like behavior can potentially cause high > priority jobs to stall to make enough space on top tier memory on > their allocation path and I think this patchset is aiming to reduce > that impact by making kswapd do that work. However I think the more > concerning issue is the low priority job hogging the top tier memory. > > The possible ways the low priority job can hog the top tier memory are > by allocating non-movable memory or by mlocking the memory. (Oh there > is also pinning the memory but I don't know if there is a user api to > pin memory?) For the mlocked memory, you need to either modify the > reclaim code or use a different mechanism for demoting cold memory. > > Basically I am saying we should put the upfront control (limit) on the > usage of top tier memory by the jobs. > Circling back to your comment here. I agree that soft limit is deficient in this scenario that you have pointed out. Eventually I was shooting for a hard limit on a memory tier for a cgroup that's similar to the v2 memory controller interface (see mail in the other thread). That interface should satisfy the hard constraint you want to place on the low priority jobs. Tim
On 4/9/21 12:24 AM, Michal Hocko wrote: > On Thu 08-04-21 13:29:08, Shakeel Butt wrote: >> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote: > [...] >>> The low priority jobs should be able to be restricted by cpuset, for >>> example, just keep them on second tier memory nodes. Then all the >>> above problems are gone. > > Yes, if the aim is to isolate some users from certain numa node then > cpuset is a good fit but as Shakeel says this is very likely not what > this work is aiming for. > >> Yes that's an extreme way to overcome the issue but we can do less >> extreme by just (hard) limiting the top tier usage of low priority >> jobs. > > Per numa node high/hard limit would help with a more fine grained control. > The configuration would be tricky though. All low priority memcgs would > have to be carefully configured to leave enough for your important > processes. That includes also memory which is not accounted to any > memcg. > The behavior of those limits would be quite tricky for OOM situations > as well due to a lack of NUMA aware oom killer. > Another downside of putting limits on individual NUMA node is it would limit flexibility. For example two memory nodes are similar enough in performance, that you really only care about a cgroup not using more than a threshold of the combined capacity from the two memory nodes. But when you put a hard limit on NUMA node, then you are tied down to a fix allocation partition for each node. Perhaps there are some kernel resources that are pre-allocated primarily from one node. A cgroup may bump into the limit on the node and failed the allocation, even when it has a lot of slack in the other node. This makes getting the configuration right trickier. There are some differences in opinion currently on whether grouping memory nodes into tiers, and putting limit on using them by cgroup is a desirable. Many people want the management constraint placed at individual NUMA node for each cgroup, instead of at the tier level. Will appreciate feedbacks from folks who have insights on how such NUMA based control interface will work, so we at least agree here in order to move forward. Tim
On Thu 15-04-21 15:31:46, Tim Chen wrote: > > > On 4/9/21 12:24 AM, Michal Hocko wrote: > > On Thu 08-04-21 13:29:08, Shakeel Butt wrote: > >> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote: > > [...] > >>> The low priority jobs should be able to be restricted by cpuset, for > >>> example, just keep them on second tier memory nodes. Then all the > >>> above problems are gone. > > > > Yes, if the aim is to isolate some users from certain numa node then > > cpuset is a good fit but as Shakeel says this is very likely not what > > this work is aiming for. > > > >> Yes that's an extreme way to overcome the issue but we can do less > >> extreme by just (hard) limiting the top tier usage of low priority > >> jobs. > > > > Per numa node high/hard limit would help with a more fine grained control. > > The configuration would be tricky though. All low priority memcgs would > > have to be carefully configured to leave enough for your important > > processes. That includes also memory which is not accounted to any > > memcg. > > The behavior of those limits would be quite tricky for OOM situations > > as well due to a lack of NUMA aware oom killer. > > > > Another downside of putting limits on individual NUMA > node is it would limit flexibility. Let me just clarify one thing. I haven't been proposing per NUMA limits. As I've said above it would be quite tricky to use and the behavior would be tricky as well. All I am saying is that we do not want to have an interface that is tightly bound to any specific HW setup (fast RAM as a top tier and PMEM as a fallback) that you have proposed here. We want to have a generic NUMA based abstraction. How that abstraction is going to look like is an open question and it really depends on usecase that we expect to see.