[RFC,v1,00/11] Manage the top tier memory in a tiered memory

Message ID	cover.1617642417.git.tim.c.chen@linux.intel.com (mailing list archive)
Headers	show Return-Path: <SRS0=dTKa=JC=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5CF54613B1 IronPort-SDR: MsUYGF+6CP+LZNkSXFqdXkUQ9XAOassvjK2x6YiI3ktG3OcCBBFhANMoVP3kN0M+PGF0j8daIm 9nUkizStPU1Q== IronPort-SDR: ppiyaFDCkV1nQdzghDpl80nb90mpZhmi6TgJsW2rkqmdwWxiXUl22phoUg7/f7tWWO10yOkN98 DIg4+Pw449TQ== From: Tim Chen <tim.c.chen@linux.intel.com> To: Michal Hocko <mhocko@suse.cz> Cc: Tim Chen <tim.c.chen@linux.intel.com>, Johannes Weiner <hannes@cmpxchg.org>, Andrew Morton <akpm@linux-foundation.org>, Dave Hansen <dave.hansen@intel.com>, Ying Huang <ying.huang@intel.com>, Dan Williams <dan.j.williams@intel.com>, David Rientjes <rientjes@google.com>, Shakeel Butt <shakeelb@google.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Date: Mon, 5 Apr 2021 10:08:24 -0700 Message-Id: <cover.1617642417.git.tim.c.chen@linux.intel.com> MIME-Version: 1.0 Received-SPF: none (linux.intel.com>: No applicable sender policy available) receiver=imf06; identity=mailfrom; envelope-from="<tim.c.chen@linux.intel.com>"; helo=mga17.intel.com; client-ip=192.55.52.151 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Manage the top tier memory in a tiered memory \| expand [RFC,v1,00/11] Manage the top tier memory in a tiered memory [RFC,v1,01/11] mm: Define top tier memory node mask [RFC,v1,02/11] mm: Add soft memory limit for mem cgroup [RFC,v1,03/11] mm: Account the top tier memory usage per cgroup [RFC,v1,04/11] mm: Report top tier memory usage in sysfs [RFC,v1,05/11] mm: Add soft_limit_top_tier tree for mem cgroup [RFC,v1,06/11] mm: Handle top tier memory in cgroup soft limit memory tree utilities [RFC,v1,07/11] mm: Account the total top tier memory in use [RFC,v1,08/11] mm: Add toptier option for mem_cgroup_soft_limit_reclaim() [RFC,v1,09/11] mm: Use kswapd to demote pages when toptier memory is tight [RFC,v1,10/11] mm: Set toptier_scale_factor via sysctl [RFC,v1,11/11] mm: Wakeup kswapd if toptier memory need soft reclaim

Tim Chen April 5, 2021, 5:08 p.m. UTC

Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
others NUMA wise, but a byte of media has about the same cost whether it
is close or far.  But, with new memory tiers such as Persistent Memory
(PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
PMEM.

The fast/expensive memory lives in the top tier of the memory hierachy.

Previously, the patchset
[PATCH 00/10] [v7] Migrate Pages in lieu of discard
https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
provides a mechanism to demote cold pages from DRAM node into PMEM.

And the patchset
[PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
provides a mechanism to promote hot pages in PMEM to the DRAM node
leveraging autonuma.

The two patchsets together keep the hot pages in DRAM and colder pages
in PMEM.

To make fine grain cgroup based management of the precious top tier
DRAM memory possible, this patchset adds a few new features:
1. Provides memory monitors on the amount of top tier memory used per cgroup 
   and by the system as a whole.
2. Applies soft limits on the top tier memory each cgroup uses 
3. Enables kswapd to demote top tier pages from cgroup with excess top
   tier memory usages.

This allows us to provision different amount of top tier memory to each
cgroup according to the cgroup's latency need.

The patchset is based on cgroup v1 interface. One shortcoming of the v1
interface is the limit on the cgroup is a soft limit, so a cgroup can
exceed the limit quite a bit before reclaim before page demotion reins
it in. 

We are also working on a cgroup v2 interface control interface that will will
have a max limit on the top tier memory per cgroup but requires much
additional logic to fall back and allocate from non top tier memory when a
cgroup reaches the maximum limit.  This simpler cgroup v1 implementation
with all its warts is used to illustrate the concept of cgroup based
top tier memory management and serves as a starting point of discussions.

The soft limit and soft reclaim logic in this patchset will be similar for what
we would do for a cgroup v2 interface when we reach the high watermark
for top tier usage in a cgroup v2 interface. 

This patchset is applied on top of 
[PATCH 00/10] [v7] Migrate Pages in lieu of discard
and
[PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system

It is part of a larger patchset.  You can play with the complete set of patches
using the tree:
https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/log/?h=tiering-0.71

Tim Chen (11):
  mm: Define top tier memory node mask
  mm: Add soft memory limit for mem cgroup
  mm: Account the top tier memory usage per cgroup
  mm: Report top tier memory usage in sysfs
  mm: Add soft_limit_top_tier tree for mem cgroup
  mm: Handle top tier memory in cgroup soft limit memory tree utilities
  mm: Account the total top tier memory in use
  mm: Add toptier option for mem_cgroup_soft_limit_reclaim()
  mm: Use kswapd to demote pages when toptier memory is tight
  mm: Set toptier_scale_factor via sysctl
  mm: Wakeup kswapd if toptier memory need soft reclaim

 Documentation/admin-guide/sysctl/vm.rst |  12 +
 drivers/base/node.c                     |   2 +
 include/linux/memcontrol.h              |  20 +-
 include/linux/mm.h                      |   4 +
 include/linux/mmzone.h                  |   7 +
 include/linux/nodemask.h                |   1 +
 include/linux/vmstat.h                  |  18 ++
 kernel/sysctl.c                         |  10 +
 mm/memcontrol.c                         | 303 +++++++++++++++++++-----
 mm/memory_hotplug.c                     |   3 +
 mm/migrate.c                            |   1 +
 mm/page_alloc.c                         |  36 ++-
 mm/vmscan.c                             |  73 +++++-
 mm/vmstat.c                             |  22 +-
 14 files changed, 444 insertions(+), 68 deletions(-)

Michal Hocko April 6, 2021, 9:08 a.m. UTC | #1

On Mon 05-04-21 10:08:24, Tim Chen wrote:
[...]
> To make fine grain cgroup based management of the precious top tier
> DRAM memory possible, this patchset adds a few new features:
> 1. Provides memory monitors on the amount of top tier memory used per cgroup 
>    and by the system as a whole.
> 2. Applies soft limits on the top tier memory each cgroup uses 
> 3. Enables kswapd to demote top tier pages from cgroup with excess top
>    tier memory usages.

Could you be more specific on how this interface is supposed to be used?

> This allows us to provision different amount of top tier memory to each
> cgroup according to the cgroup's latency need.
> 
> The patchset is based on cgroup v1 interface. One shortcoming of the v1
> interface is the limit on the cgroup is a soft limit, so a cgroup can
> exceed the limit quite a bit before reclaim before page demotion reins
> it in. 

I have to say that I dislike abusing soft limit reclaim for this. In the
past we have learned that the existing implementation is unfixable and
changing the existing semantic impossible due to backward compatibility.
So I would really prefer the soft limit just find its rest rather than
see new potential usecases.

I haven't really looked into details of this patchset but from a cursory
look it seems like you are actually introducing a NUMA aware limits into
memcg that would control consumption from some nodes differently than
other nodes. This would be rather alien concept to the existing memcg
infrastructure IMO. It looks like it is fusing borders between memcg and
cputset controllers.

You also seem to be basing the interface on the very specific usecase.
Can we expect that there will be many different tiers requiring their
own balancing?

Tim Chen April 7, 2021, 10:33 p.m. UTC | #2

On 4/6/21 2:08 AM, Michal Hocko wrote:
> On Mon 05-04-21 10:08:24, Tim Chen wrote:
> [...]
>> To make fine grain cgroup based management of the precious top tier
>> DRAM memory possible, this patchset adds a few new features:
>> 1. Provides memory monitors on the amount of top tier memory used per cgroup 
>>    and by the system as a whole.
>> 2. Applies soft limits on the top tier memory each cgroup uses 
>> 3. Enables kswapd to demote top tier pages from cgroup with excess top
>>    tier memory usages.
> 

Michal,

Thanks for giving your feedback.  Much appreciated.

> Could you be more specific on how this interface is supposed to be used?

We created a README section on the cgroup control part of this patchset:
https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.71&id=20f20be02671384470c7cd8f66b56a9061a4071f
to illustrate how this interface should be used.

The top tier memory used is reported in

memory.toptier_usage_in_bytes

The amount of top tier memory usable by each cgroup without
triggering page reclaim is controlled by the

memory.toptier_soft_limit_in_bytes 

knob for each cgroup.  

We anticipate that for cgroup v2, we will have

memory_toptier.max  (max allowed top tier memory)
memory_toptier.high (aggressive page demotion from top tier memory)
memory_toptier.min  (not to page demote from top tier memory at this threshold) 

this is analogous to existing controllers
memory.max, memory.high and memory.min

> 
>> This allows us to provision different amount of top tier memory to each
>> cgroup according to the cgroup's latency need.
>>
>> The patchset is based on cgroup v1 interface. One shortcoming of the v1
>> interface is the limit on the cgroup is a soft limit, so a cgroup can
>> exceed the limit quite a bit before reclaim before page demotion reins
>> it in. 
> 
> I have to say that I dislike abusing soft limit reclaim for this. In the
> past we have learned that the existing implementation is unfixable and
> changing the existing semantic impossible due to backward compatibility.
> So I would really prefer the soft limit just find its rest rather than
> see new potential usecases.

Do you think we can reuse some of the existing soft reclaim machinery
for the v2 interface?

More particularly, can we treat memory_toptier.high in cgroup v2 as a soft limit?
We sort how much each mem cgroup exceeds memory_toptier.high and
go after the cgroup that have largest excess first for page demotion.
Will appreciate if you can shed some insights on what could go wrong
with such an approach. 

> 
> I haven't really looked into details of this patchset but from a cursory
> look it seems like you are actually introducing a NUMA aware limits into
> memcg that would control consumption from some nodes differently than
> other nodes. This would be rather alien concept to the existing memcg
> infrastructure IMO. It looks like it is fusing borders between memcg and
> cputset controllers.

Want to make sure I understand what you mean by NUMA aware limits.
Yes, in the patch set, it does treat the NUMA nodes differently.
We are putting constraint on the "top tier" RAM nodes vs the lower
tier PMEM nodes.  Is this what you mean?  I can see it does has
some flavor of cpuset controller.  In this case, it doesn't explicitly
set a node as allowed or forbidden as in cpuset, but put some constraints
on the usage of a group of nodes.  

Do you have suggestions on alternative controller for allocating tiered memory resource?

> 
> You also seem to be basing the interface on the very specific usecase.
> Can we expect that there will be many different tiers requiring their
> own balancing?
> 

You mean more than two tiers of memory? We did think a bit about system
that has stuff like high bandwidth memory that's faster than DRAM.
Our thought is usage and freeing of those memory will require 
explicit assignment (not used by default), so will be outside the
realm of auto balancing.  So at this point, we think two tiers will be good.

Tim

Michal Hocko April 8, 2021, 11:52 a.m. UTC | #3

On Wed 07-04-21 15:33:26, Tim Chen wrote:
> 
> 
> On 4/6/21 2:08 AM, Michal Hocko wrote:
> > On Mon 05-04-21 10:08:24, Tim Chen wrote:
> > [...]
> >> To make fine grain cgroup based management of the precious top tier
> >> DRAM memory possible, this patchset adds a few new features:
> >> 1. Provides memory monitors on the amount of top tier memory used per cgroup 
> >>    and by the system as a whole.
> >> 2. Applies soft limits on the top tier memory each cgroup uses 
> >> 3. Enables kswapd to demote top tier pages from cgroup with excess top
> >>    tier memory usages.
> > 
> 
> Michal,
> 
> Thanks for giving your feedback.  Much appreciated.
> 
> > Could you be more specific on how this interface is supposed to be used?
> 
> We created a README section on the cgroup control part of this patchset:
> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.71&id=20f20be02671384470c7cd8f66b56a9061a4071f
> to illustrate how this interface should be used.

I have to confess I didn't get to look at demotion patches yet.

> The top tier memory used is reported in
> 
> memory.toptier_usage_in_bytes
> 
> The amount of top tier memory usable by each cgroup without
> triggering page reclaim is controlled by the
> 
> memory.toptier_soft_limit_in_bytes 

Are you trying to say that soft limit acts as some sort of guarantee?
Does that mean that if the memcg is under memory pressure top tiear
memory is opted out from any reclaim if the usage is not in excess?

From you previous email it sounds more like the limit is evaluated on
the global memory pressure to balance specific memcgs which are in
excess when trying to reclaim/demote a toptier numa node.

Soft limit reclaim has several problems. Those are historical and
therefore the behavior cannot be changed. E.g. go after the biggest
excessed memcg (with priority 0 - aka potential full LRU scan) and then
continue with a normal reclaim. This can be really disruptive to the top
user.

So you can likely define a more sane semantic. E.g. push back memcgs
proporitional to their excess but then we have two different soft limits
behavior which is bad as well. I am not really sure there is a sensible
way out by (ab)using soft limit here.

Also I am not really sure how this is going to be used in practice.
There is no soft limit by default. So opting in would effectivelly
discriminate those memcgs. There has been a similar problem with the
soft limit we have in general. Is this really what you are looing for?
What would be a typical usecase?

[...]
> >> The patchset is based on cgroup v1 interface. One shortcoming of the v1
> >> interface is the limit on the cgroup is a soft limit, so a cgroup can
> >> exceed the limit quite a bit before reclaim before page demotion reins
> >> it in. 
> > 
> > I have to say that I dislike abusing soft limit reclaim for this. In the
> > past we have learned that the existing implementation is unfixable and
> > changing the existing semantic impossible due to backward compatibility.
> > So I would really prefer the soft limit just find its rest rather than
> > see new potential usecases.
> 
> Do you think we can reuse some of the existing soft reclaim machinery
> for the v2 interface?
> 
> More particularly, can we treat memory_toptier.high in cgroup v2 as a soft limit?

No, you should follow existing limits semantics. High limit acts as a
allocation throttling interface.

> We sort how much each mem cgroup exceeds memory_toptier.high and
> go after the cgroup that have largest excess first for page demotion.
> Will appreciate if you can shed some insights on what could go wrong
> with such an approach. 

This cannot work as a thorttling interface.

> > I haven't really looked into details of this patchset but from a cursory
> > look it seems like you are actually introducing a NUMA aware limits into
> > memcg that would control consumption from some nodes differently than
> > other nodes. This would be rather alien concept to the existing memcg
> > infrastructure IMO. It looks like it is fusing borders between memcg and
> > cputset controllers.
> 
> Want to make sure I understand what you mean by NUMA aware limits.
> Yes, in the patch set, it does treat the NUMA nodes differently.
> We are putting constraint on the "top tier" RAM nodes vs the lower
> tier PMEM nodes.  Is this what you mean?

What I am trying to say (and I have brought that up when demotion has been
discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
The specific technology shouldn't be imprinted into the interface.
Fundamentally you are trying to balance memory among NUMA nodes as we do
not have other abstraction to use. So rather than talking about top,
secondary, nth tier we have different NUMA nodes with different
characteristics and you want to express your "priorities" for them.

> I can see it does has
> some flavor of cpuset controller.  In this case, it doesn't explicitly
> set a node as allowed or forbidden as in cpuset, but put some constraints
> on the usage of a group of nodes.  
> 
> Do you have suggestions on alternative controller for allocating tiered memory resource?

I am not really sure what would be the best interface to be honest.
Maybe we want to carve this into memcg in some form of node priorities
for the reclaim. Any of the existing limits is numa aware so far. Maybe
we want to say hammer this node more than others if there is a memory
pressure. Not sure that would help you particular usecase though.

> > You also seem to be basing the interface on the very specific usecase.
> > Can we expect that there will be many different tiers requiring their
> > own balancing?
> > 
> 
> You mean more than two tiers of memory? We did think a bit about system
> that has stuff like high bandwidth memory that's faster than DRAM.
> Our thought is usage and freeing of those memory will require 
> explicit assignment (not used by default), so will be outside the
> realm of auto balancing.  So at this point, we think two tiers will be good.

Please keep in mind that once there is an interface it will be
impossible to change in the future. So do not bind yourself to the 2
tier setups that you have in hands right now.

Shakeel Butt April 8, 2021, 5:18 p.m. UTC | #4

Hi Tim,

On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> others NUMA wise, but a byte of media has about the same cost whether it
> is close or far.  But, with new memory tiers such as Persistent Memory
> (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> PMEM.
>
> The fast/expensive memory lives in the top tier of the memory hierachy.
>
> Previously, the patchset
> [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
> provides a mechanism to demote cold pages from DRAM node into PMEM.
>
> And the patchset
> [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
> provides a mechanism to promote hot pages in PMEM to the DRAM node
> leveraging autonuma.
>
> The two patchsets together keep the hot pages in DRAM and colder pages
> in PMEM.

Thanks for working on this as this is becoming more and more important
particularly in the data centers where memory is a big portion of the
cost.

I see you have responded to Michal and I will add my more specific
response there. Here I wanted to give my high level concern regarding
using v1's soft limit like semantics for top tier memory.

This patch series aims to distribute/partition top tier memory between
jobs of different priorities. We want high priority jobs to have
preferential access to the top tier memory and we don't want low
priority jobs to hog the top tier memory.

Using v1's soft limit like behavior can potentially cause high
priority jobs to stall to make enough space on top tier memory on
their allocation path and I think this patchset is aiming to reduce
that impact by making kswapd do that work. However I think the more
concerning issue is the low priority job hogging the top tier memory.

The possible ways the low priority job can hog the top tier memory are
by allocating non-movable memory or by mlocking the memory. (Oh there
is also pinning the memory but I don't know if there is a user api to
pin memory?) For the mlocked memory, you need to either modify the
reclaim code or use a different mechanism for demoting cold memory.

Basically I am saying we should put the upfront control (limit) on the
usage of top tier memory by the jobs.

Yang Shi April 8, 2021, 6 p.m. UTC | #5

On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Hi Tim,
>
> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> > others NUMA wise, but a byte of media has about the same cost whether it
> > is close or far.  But, with new memory tiers such as Persistent Memory
> > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> > PMEM.
> >
> > The fast/expensive memory lives in the top tier of the memory hierachy.
> >
> > Previously, the patchset
> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
> > provides a mechanism to demote cold pages from DRAM node into PMEM.
> >
> > And the patchset
> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
> > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > leveraging autonuma.
> >
> > The two patchsets together keep the hot pages in DRAM and colder pages
> > in PMEM.
>
> Thanks for working on this as this is becoming more and more important
> particularly in the data centers where memory is a big portion of the
> cost.
>
> I see you have responded to Michal and I will add my more specific
> response there. Here I wanted to give my high level concern regarding
> using v1's soft limit like semantics for top tier memory.
>
> This patch series aims to distribute/partition top tier memory between
> jobs of different priorities. We want high priority jobs to have
> preferential access to the top tier memory and we don't want low
> priority jobs to hog the top tier memory.
>
> Using v1's soft limit like behavior can potentially cause high
> priority jobs to stall to make enough space on top tier memory on
> their allocation path and I think this patchset is aiming to reduce
> that impact by making kswapd do that work. However I think the more
> concerning issue is the low priority job hogging the top tier memory.
>
> The possible ways the low priority job can hog the top tier memory are
> by allocating non-movable memory or by mlocking the memory. (Oh there
> is also pinning the memory but I don't know if there is a user api to
> pin memory?) For the mlocked memory, you need to either modify the
> reclaim code or use a different mechanism for demoting cold memory.

Do you mean long term pin? RDMA should be able to simply pin the
memory for weeks. A lot of transient pins come from Direct I/O. They
should be less concerned.

The low priority jobs should be able to be restricted by cpuset, for
example, just keep them on second tier memory nodes. Then all the
above problems are gone.

>
> Basically I am saying we should put the upfront control (limit) on the
> usage of top tier memory by the jobs.

This sounds similar to what I talked about in LSFMM 2019
(https://lwn.net/Articles/787418/). We used to have some potential
usecase which divides DRAM:PMEM ratio for different jobs or memcgs
when I was with Alibaba.

In the first place I thought about per NUMA node limit, but it was
very hard to configure it correctly for users unless you know exactly
about your memory usage and hot/cold memory distribution.

I'm wondering, just off the top of my head, if we could extend the
semantic of low and min limit. For example, just redefine low and min
to "the limit on top tier memory". Then we could have low priority
jobs have 0 low/min limit.

>

Shakeel Butt April 8, 2021, 8:29 p.m. UTC | #6

On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > Hi Tim,
> >
> > On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > >
> > > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> > > others NUMA wise, but a byte of media has about the same cost whether it
> > > is close or far.  But, with new memory tiers such as Persistent Memory
> > > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> > > PMEM.
> > >
> > > The fast/expensive memory lives in the top tier of the memory hierachy.
> > >
> > > Previously, the patchset
> > > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
> > > provides a mechanism to demote cold pages from DRAM node into PMEM.
> > >
> > > And the patchset
> > > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> > > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
> > > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > > leveraging autonuma.
> > >
> > > The two patchsets together keep the hot pages in DRAM and colder pages
> > > in PMEM.
> >
> > Thanks for working on this as this is becoming more and more important
> > particularly in the data centers where memory is a big portion of the
> > cost.
> >
> > I see you have responded to Michal and I will add my more specific
> > response there. Here I wanted to give my high level concern regarding
> > using v1's soft limit like semantics for top tier memory.
> >
> > This patch series aims to distribute/partition top tier memory between
> > jobs of different priorities. We want high priority jobs to have
> > preferential access to the top tier memory and we don't want low
> > priority jobs to hog the top tier memory.
> >
> > Using v1's soft limit like behavior can potentially cause high
> > priority jobs to stall to make enough space on top tier memory on
> > their allocation path and I think this patchset is aiming to reduce
> > that impact by making kswapd do that work. However I think the more
> > concerning issue is the low priority job hogging the top tier memory.
> >
> > The possible ways the low priority job can hog the top tier memory are
> > by allocating non-movable memory or by mlocking the memory. (Oh there
> > is also pinning the memory but I don't know if there is a user api to
> > pin memory?) For the mlocked memory, you need to either modify the
> > reclaim code or use a different mechanism for demoting cold memory.
>
> Do you mean long term pin? RDMA should be able to simply pin the
> memory for weeks. A lot of transient pins come from Direct I/O. They
> should be less concerned.
>
> The low priority jobs should be able to be restricted by cpuset, for
> example, just keep them on second tier memory nodes. Then all the
> above problems are gone.
>

Yes that's an extreme way to overcome the issue but we can do less
extreme by just (hard) limiting the top tier usage of low priority
jobs.

> >
> > Basically I am saying we should put the upfront control (limit) on the
> > usage of top tier memory by the jobs.
>
> This sounds similar to what I talked about in LSFMM 2019
> (https://lwn.net/Articles/787418/). We used to have some potential
> usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> when I was with Alibaba.
>
> In the first place I thought about per NUMA node limit, but it was
> very hard to configure it correctly for users unless you know exactly
> about your memory usage and hot/cold memory distribution.
>
> I'm wondering, just off the top of my head, if we could extend the
> semantic of low and min limit. For example, just redefine low and min
> to "the limit on top tier memory". Then we could have low priority
> jobs have 0 low/min limit.
>

The low and min limits have semantics similar to the v1's soft limit
for this situation i.e. letting the low priority job occupy top tier
memory and depending on reclaim to take back the excess top tier
memory use of such jobs.

I have some thoughts on NUMA node limits which I will share in the other thread.

Yang Shi April 8, 2021, 8:50 p.m. UTC | #7

On Thu, Apr 8, 2021 at 1:29 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > Hi Tim,
> > >
> > > On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > > >
> > > > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> > > > others NUMA wise, but a byte of media has about the same cost whether it
> > > > is close or far.  But, with new memory tiers such as Persistent Memory
> > > > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> > > > PMEM.
> > > >
> > > > The fast/expensive memory lives in the top tier of the memory hierachy.
> > > >
> > > > Previously, the patchset
> > > > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > > > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
> > > > provides a mechanism to demote cold pages from DRAM node into PMEM.
> > > >
> > > > And the patchset
> > > > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> > > > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
> > > > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > > > leveraging autonuma.
> > > >
> > > > The two patchsets together keep the hot pages in DRAM and colder pages
> > > > in PMEM.
> > >
> > > Thanks for working on this as this is becoming more and more important
> > > particularly in the data centers where memory is a big portion of the
> > > cost.
> > >
> > > I see you have responded to Michal and I will add my more specific
> > > response there. Here I wanted to give my high level concern regarding
> > > using v1's soft limit like semantics for top tier memory.
> > >
> > > This patch series aims to distribute/partition top tier memory between
> > > jobs of different priorities. We want high priority jobs to have
> > > preferential access to the top tier memory and we don't want low
> > > priority jobs to hog the top tier memory.
> > >
> > > Using v1's soft limit like behavior can potentially cause high
> > > priority jobs to stall to make enough space on top tier memory on
> > > their allocation path and I think this patchset is aiming to reduce
> > > that impact by making kswapd do that work. However I think the more
> > > concerning issue is the low priority job hogging the top tier memory.
> > >
> > > The possible ways the low priority job can hog the top tier memory are
> > > by allocating non-movable memory or by mlocking the memory. (Oh there
> > > is also pinning the memory but I don't know if there is a user api to
> > > pin memory?) For the mlocked memory, you need to either modify the
> > > reclaim code or use a different mechanism for demoting cold memory.
> >
> > Do you mean long term pin? RDMA should be able to simply pin the
> > memory for weeks. A lot of transient pins come from Direct I/O. They
> > should be less concerned.
> >
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.
> >
>
> Yes that's an extreme way to overcome the issue but we can do less
> extreme by just (hard) limiting the top tier usage of low priority
> jobs.
>
> > >
> > > Basically I am saying we should put the upfront control (limit) on the
> > > usage of top tier memory by the jobs.
> >
> > This sounds similar to what I talked about in LSFMM 2019
> > (https://lwn.net/Articles/787418/). We used to have some potential
> > usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> > when I was with Alibaba.
> >
> > In the first place I thought about per NUMA node limit, but it was
> > very hard to configure it correctly for users unless you know exactly
> > about your memory usage and hot/cold memory distribution.
> >
> > I'm wondering, just off the top of my head, if we could extend the
> > semantic of low and min limit. For example, just redefine low and min
> > to "the limit on top tier memory". Then we could have low priority
> > jobs have 0 low/min limit.
> >
>
> The low and min limits have semantics similar to the v1's soft limit
> for this situation i.e. letting the low priority job occupy top tier
> memory and depending on reclaim to take back the excess top tier
> memory use of such jobs.

I don't get why low priority jobs can *not* use top tier memory? I can
think of it may incur latency overhead for high priority jobs. If it
is not allowed, it could be restricted by cpuset without introducing
in any new interfaces.

I'm supposed the memory utilization could be maximized by allowing all
jobs allocate memory from all applicable nodes, then let reclaimer (or
something new if needed) do the job to migrate the memory to proper
nodes by time. We could achieve some kind of balance between memory
utilization and resource isolation.

>
> I have some thoughts on NUMA node limits which I will share in the other thread.

Look forward to reading it.

Huang, Ying April 9, 2021, 2:58 a.m. UTC | #8

Yang Shi <shy828301@gmail.com> writes:

> On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote:
>>
>> Hi Tim,
>>
>> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>> >
>> > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
>> > others NUMA wise, but a byte of media has about the same cost whether it
>> > is close or far.  But, with new memory tiers such as Persistent Memory
>> > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
>> > PMEM.
>> >
>> > The fast/expensive memory lives in the top tier of the memory hierachy.
>> >
>> > Previously, the patchset
>> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
>> > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
>> > provides a mechanism to demote cold pages from DRAM node into PMEM.
>> >
>> > And the patchset
>> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
>> > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
>> > provides a mechanism to promote hot pages in PMEM to the DRAM node
>> > leveraging autonuma.
>> >
>> > The two patchsets together keep the hot pages in DRAM and colder pages
>> > in PMEM.
>>
>> Thanks for working on this as this is becoming more and more important
>> particularly in the data centers where memory is a big portion of the
>> cost.
>>
>> I see you have responded to Michal and I will add my more specific
>> response there. Here I wanted to give my high level concern regarding
>> using v1's soft limit like semantics for top tier memory.
>>
>> This patch series aims to distribute/partition top tier memory between
>> jobs of different priorities. We want high priority jobs to have
>> preferential access to the top tier memory and we don't want low
>> priority jobs to hog the top tier memory.
>>
>> Using v1's soft limit like behavior can potentially cause high
>> priority jobs to stall to make enough space on top tier memory on
>> their allocation path and I think this patchset is aiming to reduce
>> that impact by making kswapd do that work. However I think the more
>> concerning issue is the low priority job hogging the top tier memory.
>>
>> The possible ways the low priority job can hog the top tier memory are
>> by allocating non-movable memory or by mlocking the memory. (Oh there
>> is also pinning the memory but I don't know if there is a user api to
>> pin memory?) For the mlocked memory, you need to either modify the
>> reclaim code or use a different mechanism for demoting cold memory.
>
> Do you mean long term pin? RDMA should be able to simply pin the
> memory for weeks. A lot of transient pins come from Direct I/O. They
> should be less concerned.
>
> The low priority jobs should be able to be restricted by cpuset, for
> example, just keep them on second tier memory nodes. Then all the
> above problems are gone.

To optimize the page placement of a process between DRAM and PMEM, we
want to place the hot pages in DRAM and the cold pages in PMEM.  But the
memory accessing pattern changes overtime, so we need to migrate pages
between DRAM and PMEM to adapt to the changing.

To avoid the hot pages be pinned in PMEM always, one way is to online
the PMEM as movable zones.  If so, and if the low priority jobs are
restricted by cpuset to allocate from PMEM only, we may fail to run
quite some workloads as being discussed in the following threads,

https://lore.kernel.org/linux-mm/1604470210-124827-1-git-send-email-feng.tang@intel.com/

>>
>> Basically I am saying we should put the upfront control (limit) on the
>> usage of top tier memory by the jobs.
>
> This sounds similar to what I talked about in LSFMM 2019
> (https://lwn.net/Articles/787418/). We used to have some potential
> usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> when I was with Alibaba.
>
> In the first place I thought about per NUMA node limit, but it was
> very hard to configure it correctly for users unless you know exactly
> about your memory usage and hot/cold memory distribution.
>
> I'm wondering, just off the top of my head, if we could extend the
> semantic of low and min limit. For example, just redefine low and min
> to "the limit on top tier memory". Then we could have low priority
> jobs have 0 low/min limit.

Per my understanding, memory.low/min are for the memory protection
instead of the memory limiting.  memory.high is for the memory limiting.

Best Regards,
Huang, Ying

Michal Hocko April 9, 2021, 7:24 a.m. UTC | #9

On Thu 08-04-21 13:29:08, Shakeel Butt wrote:
> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
[...]
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.

Yes, if the aim is to isolate some users from certain numa node then
cpuset is a good fit but as Shakeel says this is very likely not what
this work is aiming for.

> Yes that's an extreme way to overcome the issue but we can do less
> extreme by just (hard) limiting the top tier usage of low priority
> jobs.

Per numa node high/hard limit would help with a more fine grained control.
The configuration would be tricky though. All low priority memcgs would
have to be carefully configured to leave enough for your important
processes. That includes also memory which is not accounted to any
memcg. 
The behavior of those limits would be quite tricky for OOM situations
as well due to a lack of NUMA aware oom killer.

Yang Shi April 9, 2021, 8:50 p.m. UTC | #10

On Thu, Apr 8, 2021 at 7:58 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <shakeelb@google.com> wrote:
> >>
> >> Hi Tim,
> >>
> >> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >> >
> >> > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> >> > others NUMA wise, but a byte of media has about the same cost whether it
> >> > is close or far.  But, with new memory tiers such as Persistent Memory
> >> > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> >> > PMEM.
> >> >
> >> > The fast/expensive memory lives in the top tier of the memory hierachy.
> >> >
> >> > Previously, the patchset
> >> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> >> > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/
> >> > provides a mechanism to demote cold pages from DRAM node into PMEM.
> >> >
> >> > And the patchset
> >> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> >> > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/
> >> > provides a mechanism to promote hot pages in PMEM to the DRAM node
> >> > leveraging autonuma.
> >> >
> >> > The two patchsets together keep the hot pages in DRAM and colder pages
> >> > in PMEM.
> >>
> >> Thanks for working on this as this is becoming more and more important
> >> particularly in the data centers where memory is a big portion of the
> >> cost.
> >>
> >> I see you have responded to Michal and I will add my more specific
> >> response there. Here I wanted to give my high level concern regarding
> >> using v1's soft limit like semantics for top tier memory.
> >>
> >> This patch series aims to distribute/partition top tier memory between
> >> jobs of different priorities. We want high priority jobs to have
> >> preferential access to the top tier memory and we don't want low
> >> priority jobs to hog the top tier memory.
> >>
> >> Using v1's soft limit like behavior can potentially cause high
> >> priority jobs to stall to make enough space on top tier memory on
> >> their allocation path and I think this patchset is aiming to reduce
> >> that impact by making kswapd do that work. However I think the more
> >> concerning issue is the low priority job hogging the top tier memory.
> >>
> >> The possible ways the low priority job can hog the top tier memory are
> >> by allocating non-movable memory or by mlocking the memory. (Oh there
> >> is also pinning the memory but I don't know if there is a user api to
> >> pin memory?) For the mlocked memory, you need to either modify the
> >> reclaim code or use a different mechanism for demoting cold memory.
> >
> > Do you mean long term pin? RDMA should be able to simply pin the
> > memory for weeks. A lot of transient pins come from Direct I/O. They
> > should be less concerned.
> >
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.
>
> To optimize the page placement of a process between DRAM and PMEM, we
> want to place the hot pages in DRAM and the cold pages in PMEM.  But the
> memory accessing pattern changes overtime, so we need to migrate pages
> between DRAM and PMEM to adapt to the changing.
>
> To avoid the hot pages be pinned in PMEM always, one way is to online
> the PMEM as movable zones.  If so, and if the low priority jobs are
> restricted by cpuset to allocate from PMEM only, we may fail to run
> quite some workloads as being discussed in the following threads,
>
> https://lore.kernel.org/linux-mm/1604470210-124827-1-git-send-email-feng.tang@intel.com/

Thanks for sharing the thread. It seems the configuration of movable
zone + node bind is not supported very well or need evolve to support
new use cases.

>
> >>
> >> Basically I am saying we should put the upfront control (limit) on the
> >> usage of top tier memory by the jobs.
> >
> > This sounds similar to what I talked about in LSFMM 2019
> > (https://lwn.net/Articles/787418/). We used to have some potential
> > usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> > when I was with Alibaba.
> >
> > In the first place I thought about per NUMA node limit, but it was
> > very hard to configure it correctly for users unless you know exactly
> > about your memory usage and hot/cold memory distribution.
> >
> > I'm wondering, just off the top of my head, if we could extend the
> > semantic of low and min limit. For example, just redefine low and min
> > to "the limit on top tier memory". Then we could have low priority
> > jobs have 0 low/min limit.
>
> Per my understanding, memory.low/min are for the memory protection
> instead of the memory limiting.  memory.high is for the memory limiting.

Yes, it is not limit. I just misused the term, I actually do mean
protection but typed "limit". Sorry for the confusion.

>
> Best Regards,
> Huang, Ying

Tim Chen April 9, 2021, 11:26 p.m. UTC | #11

On 4/8/21 4:52 AM, Michal Hocko wrote:

>> The top tier memory used is reported in
>>
>> memory.toptier_usage_in_bytes
>>
>> The amount of top tier memory usable by each cgroup without
>> triggering page reclaim is controlled by the
>>
>> memory.toptier_soft_limit_in_bytes 
> 

Michal,

Thanks for your comments.  I will like to take a step back and
look at the eventual goal we envision: a mechanism to partition the 
tiered memory between the cgroups. 

A typical use case may be a system with two set of tasks.
One set of task is very latency sensitive and we desire instantaneous
response from them. Another set of tasks will be running batch jobs
were latency and performance is not critical.   In this case,
we want to carve out enough top tier memory such that the working set
of the latency sensitive tasks can fit entirely in the top tier memory.
The rest of the top tier memory can be assigned to the background tasks.  

To achieve such cgroup based tiered memory management, we probably want
something like the following.

For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
where tier t_0 sits at the top and demotes to the lower tier. 
We envision for this top tier memory t0 the following knobs and counters 
in the cgroup memory controller

memory_t0.current 	Current usage of tier 0 memory by the cgroup.

memory_t0.min		If tier 0 memory used by the cgroup falls below this low
			boundary, the memory will not be subjected to demotion
			to lower tiers to free up memory at tier 0.  

memory_t0.low		Above this boundary, the tier 0 memory will be subjected
			to demotion.  The demotion pressure will be proportional
			to the overage.

memory_t0.high		If tier 0 memory used by the cgroup exceeds this high
			boundary, allocation of tier 0 memory by the cgroup will
			be throttled. The tier 0 memory used by this cgroup
			will also be subjected to heavy demotion.

memory_t0.max		This will be a hard usage limit of tier 0 memory on the cgroup.

If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
This follows closely with the design of the general memory controller interface.  

Will such an interface looks sane and acceptable with everyone?

The patch set I posted is meant to be a straw man cgroup v1 implementation
and I readily admits that it falls short of the eventual functionality 
we want to achieve.  It is meant to solicit feedback from everyone on how the tiered
memory management should work.

> Are you trying to say that soft limit acts as some sort of guarantee?

No, the soft limit does not offers guarantee.  It will only serves to keep the usage
of the top tier memory in the vicinity of the soft limits.

> Does that mean that if the memcg is under memory pressure top tiear
> memory is opted out from any reclaim if the usage is not in excess?

In the prototype implementation, regular memory reclaim is still in effect
if we are under heavy memory pressure. 

> 
> From you previous email it sounds more like the limit is evaluated on
> the global memory pressure to balance specific memcgs which are in
> excess when trying to reclaim/demote a toptier numa node.

On a top tier node, if the free memory on the node falls below a percentage, then
we will start to reclaim/demote from the node.

> 
> Soft limit reclaim has several problems. Those are historical and
> therefore the behavior cannot be changed. E.g. go after the biggest
> excessed memcg (with priority 0 - aka potential full LRU scan) and then
> continue with a normal reclaim. This can be really disruptive to the top
> user.

Thanks for pointing out these problems with soft limit explicitly.

> 
> So you can likely define a more sane semantic. E.g. push back memcgs
> proporitional to their excess but then we have two different soft limits
> behavior which is bad as well. I am not really sure there is a sensible
> way out by (ab)using soft limit here.
> 
> Also I am not really sure how this is going to be used in practice.
> There is no soft limit by default. So opting in would effectivelly
> discriminate those memcgs. There has been a similar problem with the
> soft limit we have in general. Is this really what you are looing for?
> What would be a typical usecase?

>> Want to make sure I understand what you mean by NUMA aware limits.
>> Yes, in the patch set, it does treat the NUMA nodes differently.
>> We are putting constraint on the "top tier" RAM nodes vs the lower
>> tier PMEM nodes.  Is this what you mean?
> 
> What I am trying to say (and I have brought that up when demotion has been
> discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
> The specific technology shouldn't be imprinted into the interface.
> Fundamentally you are trying to balance memory among NUMA nodes as we do
> not have other abstraction to use. So rather than talking about top,
> secondary, nth tier we have different NUMA nodes with different
> characteristics and you want to express your "priorities" for them.

With node priorities, how would the system reserve enough
high performance memory for those performance critical task cgroup? 

By priority, do you mean the order of allocation of nodes for a cgroup?
Or you mean that all the similar performing memory node will be grouped in
the same priority?

Tim

Shakeel Butt April 12, 2021, 2:03 p.m. UTC | #12

On Thu, Apr 8, 2021 at 1:50 PM Yang Shi <shy828301@gmail.com> wrote:
>
[...]

> >
> > The low and min limits have semantics similar to the v1's soft limit
> > for this situation i.e. letting the low priority job occupy top tier
> > memory and depending on reclaim to take back the excess top tier
> > memory use of such jobs.
>
> I don't get why low priority jobs can *not* use top tier memory?

I am saying low priority jobs can use top tier memory. The only
difference is to limit them upfront (using limits) or reclaim from
them later (using min/low/soft-limit).

> I can
> think of it may incur latency overhead for high priority jobs. If it
> is not allowed, it could be restricted by cpuset without introducing
> in any new interfaces.
>
> I'm supposed the memory utilization could be maximized by allowing all
> jobs allocate memory from all applicable nodes, then let reclaimer (or
> something new if needed)

Most probably something new as we do want to consider unevictable
memory as well.

> do the job to migrate the memory to proper
> nodes by time. We could achieve some kind of balance between memory
> utilization and resource isolation.
>

Tradeoff between utilization and isolation should be decided by the user/admin.

Shakeel Butt April 12, 2021, 2:03 p.m. UTC | #13

On Thu, Apr 8, 2021 at 4:52 AM Michal Hocko <mhocko@suse.com> wrote:
>
[...]
>
> What I am trying to say (and I have brought that up when demotion has been
> discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
> The specific technology shouldn't be imprinted into the interface.
> Fundamentally you are trying to balance memory among NUMA nodes as we do
> not have other abstraction to use. So rather than talking about top,
> secondary, nth tier we have different NUMA nodes with different
> characteristics and you want to express your "priorities" for them.
>

I am also inclined towards NUMA based approach. It makes the solution
more general and even existing systems with multiple numa nodes and
DRAM can take advantage of this approach (if it makes sense).

Shakeel Butt April 12, 2021, 7:20 p.m. UTC | #14

On Fri, Apr 9, 2021 at 4:26 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
> On 4/8/21 4:52 AM, Michal Hocko wrote:
>
> >> The top tier memory used is reported in
> >>
> >> memory.toptier_usage_in_bytes
> >>
> >> The amount of top tier memory usable by each cgroup without
> >> triggering page reclaim is controlled by the
> >>
> >> memory.toptier_soft_limit_in_bytes
> >
>
> Michal,
>
> Thanks for your comments.  I will like to take a step back and
> look at the eventual goal we envision: a mechanism to partition the
> tiered memory between the cgroups.
>
> A typical use case may be a system with two set of tasks.
> One set of task is very latency sensitive and we desire instantaneous
> response from them. Another set of tasks will be running batch jobs
> were latency and performance is not critical.   In this case,
> we want to carve out enough top tier memory such that the working set
> of the latency sensitive tasks can fit entirely in the top tier memory.
> The rest of the top tier memory can be assigned to the background tasks.
>
> To achieve such cgroup based tiered memory management, we probably want
> something like the following.
>
> For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> where tier t_0 sits at the top and demotes to the lower tier.
> We envision for this top tier memory t0 the following knobs and counters
> in the cgroup memory controller
>
> memory_t0.current       Current usage of tier 0 memory by the cgroup.
>
> memory_t0.min           If tier 0 memory used by the cgroup falls below this low
>                         boundary, the memory will not be subjected to demotion
>                         to lower tiers to free up memory at tier 0.
>
> memory_t0.low           Above this boundary, the tier 0 memory will be subjected
>                         to demotion.  The demotion pressure will be proportional
>                         to the overage.
>
> memory_t0.high          If tier 0 memory used by the cgroup exceeds this high
>                         boundary, allocation of tier 0 memory by the cgroup will
>                         be throttled. The tier 0 memory used by this cgroup
>                         will also be subjected to heavy demotion.
>
> memory_t0.max           This will be a hard usage limit of tier 0 memory on the cgroup.
>
> If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
> This follows closely with the design of the general memory controller interface.
>
> Will such an interface looks sane and acceptable with everyone?
>

I have a couple of questions. Let's suppose we have a two socket
system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket
0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1).
Based on the tier definition of this patch series, tier_0: {node_0,
node_1} and tier_1: {node_2, node_3}.

My questions are:

1) Can we assume that the cost of access within a tier will always be
less than the cost of access from the tier? (node_0 <-> node_1 vs
node_0 <-> node_2)
2) If yes to (1), is that assumption future proof? Will the future
systems with DRAM over CXL support have the same characteristics?
3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0
<-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3
might be third tier and similarly for jobs running on node_1, node_2
might be third tier.

The reason I am asking these questions is that the statically
partitioning memory nodes into tiers will inherently add platform
specific assumptions in the user API.

Assumptions like:
1) Access within tier is always cheaper than across tier.
2) Access from tier_i to tier_i+1 has uniform cost.

The reason I am more inclined towards having numa centric control is
that we don't have to make these assumptions. Though the usability
will be more difficult. Greg (CCed) has some ideas on making it better
and we will share our proposal after polishing it a bit more.

Huang, Ying April 13, 2021, 2:15 a.m. UTC | #15

Tim Chen <tim.c.chen@linux.intel.com> writes:

> On 4/8/21 4:52 AM, Michal Hocko wrote:
>
>>> The top tier memory used is reported in
>>>
>>> memory.toptier_usage_in_bytes
>>>
>>> The amount of top tier memory usable by each cgroup without
>>> triggering page reclaim is controlled by the
>>>
>>> memory.toptier_soft_limit_in_bytes 
>> 
>
> Michal,
>
> Thanks for your comments.  I will like to take a step back and
> look at the eventual goal we envision: a mechanism to partition the 
> tiered memory between the cgroups. 
>
> A typical use case may be a system with two set of tasks.
> One set of task is very latency sensitive and we desire instantaneous
> response from them. Another set of tasks will be running batch jobs
> were latency and performance is not critical.   In this case,
> we want to carve out enough top tier memory such that the working set
> of the latency sensitive tasks can fit entirely in the top tier memory.
> The rest of the top tier memory can be assigned to the background tasks.  
>
> To achieve such cgroup based tiered memory management, we probably want
> something like the following.
>
> For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> where tier t_0 sits at the top and demotes to the lower tier. 
> We envision for this top tier memory t0 the following knobs and counters 
> in the cgroup memory controller
>
> memory_t0.current 	Current usage of tier 0 memory by the cgroup.
>
> memory_t0.min		If tier 0 memory used by the cgroup falls below this low
> 			boundary, the memory will not be subjected to demotion
> 			to lower tiers to free up memory at tier 0.  
>
> memory_t0.low		Above this boundary, the tier 0 memory will be subjected
> 			to demotion.  The demotion pressure will be proportional
> 			to the overage.
>
> memory_t0.high		If tier 0 memory used by the cgroup exceeds this high
> 			boundary, allocation of tier 0 memory by the cgroup will
> 			be throttled. The tier 0 memory used by this cgroup
> 			will also be subjected to heavy demotion.

I think we don't really need throttle here, because we can fallback to
allocate memory from the t1.  That will not cause something like IO
device bandwidth saturation.

Best Regards,
Huang, Ying

> memory_t0.max		This will be a hard usage limit of tier 0 memory on the cgroup.
>
> If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
> This follows closely with the design of the general memory controller interface.  
>
> Will such an interface looks sane and acceptable with everyone?
>
> The patch set I posted is meant to be a straw man cgroup v1 implementation
> and I readily admits that it falls short of the eventual functionality 
> we want to achieve.  It is meant to solicit feedback from everyone on how the tiered
> memory management should work.
>
>> Are you trying to say that soft limit acts as some sort of guarantee?
>
> No, the soft limit does not offers guarantee.  It will only serves to keep the usage
> of the top tier memory in the vicinity of the soft limits.
>
>> Does that mean that if the memcg is under memory pressure top tiear
>> memory is opted out from any reclaim if the usage is not in excess?
>
> In the prototype implementation, regular memory reclaim is still in effect
> if we are under heavy memory pressure. 
>
>> 
>> From you previous email it sounds more like the limit is evaluated on
>> the global memory pressure to balance specific memcgs which are in
>> excess when trying to reclaim/demote a toptier numa node.
>
> On a top tier node, if the free memory on the node falls below a percentage, then
> we will start to reclaim/demote from the node.
>
>> 
>> Soft limit reclaim has several problems. Those are historical and
>> therefore the behavior cannot be changed. E.g. go after the biggest
>> excessed memcg (with priority 0 - aka potential full LRU scan) and then
>> continue with a normal reclaim. This can be really disruptive to the top
>> user.
>
> Thanks for pointing out these problems with soft limit explicitly.
>
>> 
>> So you can likely define a more sane semantic. E.g. push back memcgs
>> proporitional to their excess but then we have two different soft limits
>> behavior which is bad as well. I am not really sure there is a sensible
>> way out by (ab)using soft limit here.
>> 
>> Also I am not really sure how this is going to be used in practice.
>> There is no soft limit by default. So opting in would effectivelly
>> discriminate those memcgs. There has been a similar problem with the
>> soft limit we have in general. Is this really what you are looing for?
>> What would be a typical usecase?
>
>>> Want to make sure I understand what you mean by NUMA aware limits.
>>> Yes, in the patch set, it does treat the NUMA nodes differently.
>>> We are putting constraint on the "top tier" RAM nodes vs the lower
>>> tier PMEM nodes.  Is this what you mean?
>> 
>> What I am trying to say (and I have brought that up when demotion has been
>> discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
>> The specific technology shouldn't be imprinted into the interface.
>> Fundamentally you are trying to balance memory among NUMA nodes as we do
>> not have other abstraction to use. So rather than talking about top,
>> secondary, nth tier we have different NUMA nodes with different
>> characteristics and you want to express your "priorities" for them.
>
> With node priorities, how would the system reserve enough
> high performance memory for those performance critical task cgroup? 
>
> By priority, do you mean the order of allocation of nodes for a cgroup?
> Or you mean that all the similar performing memory node will be grouped in
> the same priority?
>
> Tim

Michal Hocko April 13, 2021, 8:33 a.m. UTC | #16

On Fri 09-04-21 16:26:53, Tim Chen wrote:
> 
> On 4/8/21 4:52 AM, Michal Hocko wrote:
> 
> >> The top tier memory used is reported in
> >>
> >> memory.toptier_usage_in_bytes
> >>
> >> The amount of top tier memory usable by each cgroup without
> >> triggering page reclaim is controlled by the
> >>
> >> memory.toptier_soft_limit_in_bytes 
> > 
> 
> Michal,
> 
> Thanks for your comments.  I will like to take a step back and
> look at the eventual goal we envision: a mechanism to partition the 
> tiered memory between the cgroups. 

OK, this is goot mission statemet to start with. I would expect a follow
up to say what kind of granularity of control you want to achieve here.
Pressumably you want more than all or nothing because that is where
cpusets can be used for.

> A typical use case may be a system with two set of tasks.
> One set of task is very latency sensitive and we desire instantaneous
> response from them. Another set of tasks will be running batch jobs
> were latency and performance is not critical.   In this case,
> we want to carve out enough top tier memory such that the working set
> of the latency sensitive tasks can fit entirely in the top tier memory.
> The rest of the top tier memory can be assigned to the background tasks.  

While from a very high level this makes sense I would be interested in
more details though. Your high letency sensitive applications very likely
want to be bound to high performance node, right? Can they tolerate
memory reclaim? Can they consume more memory than the node size? What do
you expect to happen then?

> To achieve such cgroup based tiered memory management, we probably want
> something like the following.
> 
> For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> where tier t_0 sits at the top and demotes to the lower tier. 

How is each tear defined? Is this an admin define set of NUMA nodes or
is this platform specific?

[...]

> Will such an interface looks sane and acceptable with everyone?

Let's talk more about usecases first before we even start talking about
the interface or which controller is the best fit for implementing it.

> The patch set I posted is meant to be a straw man cgroup v1 implementation
> and I readily admits that it falls short of the eventual functionality 
> we want to achieve.  It is meant to solicit feedback from everyone on how the tiered
> memory management should work.

OK, fair enough. Let me then just state that I strongly believe that
Soft limit based approach is a dead end and it would be better to focus
on the actual usecases and try to understand what you want to achieve
first.

[...]

> > What I am trying to say (and I have brought that up when demotion has been
> > discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
> > The specific technology shouldn't be imprinted into the interface.
> > Fundamentally you are trying to balance memory among NUMA nodes as we do
> > not have other abstraction to use. So rather than talking about top,
> > secondary, nth tier we have different NUMA nodes with different
> > characteristics and you want to express your "priorities" for them.
> 
> With node priorities, how would the system reserve enough
> high performance memory for those performance critical task cgroup? 
> 
> By priority, do you mean the order of allocation of nodes for a cgroup?
> Or you mean that all the similar performing memory node will be grouped in
> the same priority?

I have to say I do not yet have a clear idea on what those priorities
would look like. I just wanted to outline that usecases you are
interested about likely want to implement some form of (application
transparent) control for memory distribution over several nodes. There
is a long way to land on something more specific I am afraid.

Jonathan Cameron April 14, 2021, 8:59 a.m. UTC | #17

On Mon, 12 Apr 2021 12:20:22 -0700
Shakeel Butt <shakeelb@google.com> wrote:

> On Fri, Apr 9, 2021 at 4:26 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> >
> > On 4/8/21 4:52 AM, Michal Hocko wrote:
> >  
> > >> The top tier memory used is reported in
> > >>
> > >> memory.toptier_usage_in_bytes
> > >>
> > >> The amount of top tier memory usable by each cgroup without
> > >> triggering page reclaim is controlled by the
> > >>
> > >> memory.toptier_soft_limit_in_bytes  
> > >  
> >
> > Michal,
> >
> > Thanks for your comments.  I will like to take a step back and
> > look at the eventual goal we envision: a mechanism to partition the
> > tiered memory between the cgroups.
> >
> > A typical use case may be a system with two set of tasks.
> > One set of task is very latency sensitive and we desire instantaneous
> > response from them. Another set of tasks will be running batch jobs
> > were latency and performance is not critical.   In this case,
> > we want to carve out enough top tier memory such that the working set
> > of the latency sensitive tasks can fit entirely in the top tier memory.
> > The rest of the top tier memory can be assigned to the background tasks.
> >
> > To achieve such cgroup based tiered memory management, we probably want
> > something like the following.
> >
> > For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> > where tier t_0 sits at the top and demotes to the lower tier.
> > We envision for this top tier memory t0 the following knobs and counters
> > in the cgroup memory controller
> >
> > memory_t0.current       Current usage of tier 0 memory by the cgroup.
> >
> > memory_t0.min           If tier 0 memory used by the cgroup falls below this low
> >                         boundary, the memory will not be subjected to demotion
> >                         to lower tiers to free up memory at tier 0.
> >
> > memory_t0.low           Above this boundary, the tier 0 memory will be subjected
> >                         to demotion.  The demotion pressure will be proportional
> >                         to the overage.
> >
> > memory_t0.high          If tier 0 memory used by the cgroup exceeds this high
> >                         boundary, allocation of tier 0 memory by the cgroup will
> >                         be throttled. The tier 0 memory used by this cgroup
> >                         will also be subjected to heavy demotion.
> >
> > memory_t0.max           This will be a hard usage limit of tier 0 memory on the cgroup.
> >
> > If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
> > This follows closely with the design of the general memory controller interface.
> >
> > Will such an interface looks sane and acceptable with everyone?
> >  
> 
> I have a couple of questions. Let's suppose we have a two socket
> system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket
> 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1).
> Based on the tier definition of this patch series, tier_0: {node_0,
> node_1} and tier_1: {node_2, node_3}.
> 
> My questions are:
> 
> 1) Can we assume that the cost of access within a tier will always be
> less than the cost of access from the tier? (node_0 <-> node_1 vs
> node_0 <-> node_2)

No in large systems even it we can make this assumption in 2 socket ones.

> 2) If yes to (1), is that assumption future proof? Will the future
> systems with DRAM over CXL support have the same characteristics?
> 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0
> <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3
> might be third tier and similarly for jobs running on node_1, node_2
> might be third tier.
> 
> The reason I am asking these questions is that the statically
> partitioning memory nodes into tiers will inherently add platform
> specific assumptions in the user API.

Absolutely agree.

> 
> Assumptions like:
> 1) Access within tier is always cheaper than across tier.
> 2) Access from tier_i to tier_i+1 has uniform cost.
> 
> The reason I am more inclined towards having numa centric control is
> that we don't have to make these assumptions. Though the usability
> will be more difficult. Greg (CCed) has some ideas on making it better
> and we will share our proposal after polishing it a bit more.
> 

Sounds good, will look out for that.

Jonathan

Tim Chen April 14, 2021, 11:22 p.m. UTC | #18

On 4/8/21 1:29 PM, Shakeel Butt wrote:
> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:

> 
> The low and min limits have semantics similar to the v1's soft limit
> for this situation i.e. letting the low priority job occupy top tier
> memory and depending on reclaim to take back the excess top tier
> memory use of such jobs.
> 
> I have some thoughts on NUMA node limits which I will share in the other thread.
> 

Shakeel,

Look forward to the proposal on NUMA node limits.  Which thread are
you going to post it?  Want to make sure I didn't miss it.

Tim

Tim Chen April 15, 2021, 12:42 a.m. UTC | #19

On 4/12/21 12:20 PM, Shakeel Butt wrote:

>>
>> memory_t0.current       Current usage of tier 0 memory by the cgroup.
>>
>> memory_t0.min           If tier 0 memory used by the cgroup falls below this low
>>                         boundary, the memory will not be subjected to demotion
>>                         to lower tiers to free up memory at tier 0.
>>
>> memory_t0.low           Above this boundary, the tier 0 memory will be subjected
>>                         to demotion.  The demotion pressure will be proportional
>>                         to the overage.
>>
>> memory_t0.high          If tier 0 memory used by the cgroup exceeds this high
>>                         boundary, allocation of tier 0 memory by the cgroup will
>>                         be throttled. The tier 0 memory used by this cgroup
>>                         will also be subjected to heavy demotion.
>>
>> memory_t0.max           This will be a hard usage limit of tier 0 memory on the cgroup.
>>
>> If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
>> This follows closely with the design of the general memory controller interface.
>>
>> Will such an interface looks sane and acceptable with everyone?
>>
> 
> I have a couple of questions. Let's suppose we have a two socket
> system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket
> 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1).
> Based on the tier definition of this patch series, tier_0: {node_0,
> node_1} and tier_1: {node_2, node_3}.
> 
> My questions are:
> 
> 1) Can we assume that the cost of access within a tier will always be
> less than the cost of access from the tier? (node_0 <-> node_1 vs
> node_0 <-> node_2)

I do assume that higher tier memory offers better performance (or less
access latency) than a lower tier memory.  Otherwise, this defeats the
whole purpose of promoting hot memory from lower tier to a higher tier,
and demoting cold memory to a lower tier.

Tiers assumption is embedded once we define this promotion/demotion relationship
between the numa nodes.

So if 

  node_m ----demotes----> node_n
         <---promotes---- 

then node_m is one tier higher tier than node_n. This promotion/demotion
relationship between the nodes is the underpinning of Dave and Ying's
demotion and promotion patch sets.  

> 2) If yes to (1), is that assumption future proof? Will the future
> systems with DRAM over CXL support have the same characteristics?

I think if you configure a promotion/demotion relationship between
DRAM over CXL and local-socket connected DRAM, you could divide them
up into separate tiers.  Or you don't care about the difference and
you will configure them not to have a promotion/demotion relationship
and they will be at the same tier.  Balance within the same tier
will be effected by the autonuma mechanism.

> 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0
> <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3
> might be third tier and similarly for jobs running on node_1, node_2
> might be third tier.

Tier definition is an admin's choice, of where the admin think the
hot memory should reside after looking at the memory performance.
It falls out of how the admin construct the promotion/demotion relationship
between the nodes and OS does not assume the tier relationship from
memory performance directly. 

> 
> The reason I am asking these questions is that the statically
> partitioning memory nodes into tiers will inherently add platform
> specific assumptions in the user API.
> 
> Assumptions like:
> 1) Access within tier is always cheaper than across tier.
> 2) Access from tier_i to tier_i+1 has uniform cost.
> 
> The reason I am more inclined towards having numa centric control is
> that we don't have to make these assumptions. Though the usability
> will be more difficult. Greg (CCed) has some ideas on making it better
> and we will share our proposal after polishing it a bit more.
> 

I am still trying to understand how a numa centric control actually
work. Putting limits on every numa node for each cgroup
seems to make the system configuration quite complicated.  Looking
forward to your proposal so I can better understand that perspective.

Tim

Tim Chen April 15, 2021, 10:25 p.m. UTC | #20

On 4/8/21 10:18 AM, Shakeel Butt wrote:

> 
> Using v1's soft limit like behavior can potentially cause high
> priority jobs to stall to make enough space on top tier memory on
> their allocation path and I think this patchset is aiming to reduce
> that impact by making kswapd do that work. However I think the more
> concerning issue is the low priority job hogging the top tier memory.
> 
> The possible ways the low priority job can hog the top tier memory are
> by allocating non-movable memory or by mlocking the memory. (Oh there
> is also pinning the memory but I don't know if there is a user api to
> pin memory?) For the mlocked memory, you need to either modify the
> reclaim code or use a different mechanism for demoting cold memory.
> 
> Basically I am saying we should put the upfront control (limit) on the
> usage of top tier memory by the jobs.
> 

Circling back to your comment here.  

I agree that soft limit is deficient in this scenario that you 
have pointed out.  Eventually I was shooting for a hard limit on a 
memory tier for a cgroup that's similar to the v2 memory controller
interface (see mail in the other thread).  That interface should
satisfy the hard constraint you want to place on the low priority
jobs.

Tim

Tim Chen April 15, 2021, 10:31 p.m. UTC | #21

On 4/9/21 12:24 AM, Michal Hocko wrote:
> On Thu 08-04-21 13:29:08, Shakeel Butt wrote:
>> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
> [...]
>>> The low priority jobs should be able to be restricted by cpuset, for
>>> example, just keep them on second tier memory nodes. Then all the
>>> above problems are gone.
> 
> Yes, if the aim is to isolate some users from certain numa node then
> cpuset is a good fit but as Shakeel says this is very likely not what
> this work is aiming for.
> 
>> Yes that's an extreme way to overcome the issue but we can do less
>> extreme by just (hard) limiting the top tier usage of low priority
>> jobs.
> 
> Per numa node high/hard limit would help with a more fine grained control.
> The configuration would be tricky though. All low priority memcgs would
> have to be carefully configured to leave enough for your important
> processes. That includes also memory which is not accounted to any
> memcg. 
> The behavior of those limits would be quite tricky for OOM situations
> as well due to a lack of NUMA aware oom killer.
> 

Another downside of putting limits on individual NUMA
node is it would limit flexibility.  For example two memory nodes are
similar enough in performance, that you really only care about a cgroup
not using more than a threshold of the combined capacity from the two
memory nodes.  But when you put a hard limit on NUMA node, then you are
tied down to a fix allocation partition for each node.  Perhaps there are
some kernel resources that are pre-allocated primarily from one node. A
cgroup may bump into the limit on the node and failed the allocation,
even when it has a lot of slack in the other node.  This makes getting
the configuration right trickier.

There are some differences in opinion currently
on whether grouping memory nodes into tiers, and putting limit on
using them by cgroup is a desirable.  Many people want the 
management constraint placed at individual NUMA node for each cgroup, instead
of at the tier level.  Will appreciate feedbacks from folks who have
insights on how such NUMA based control interface will work, so we
at least agree here in order to move forward.

Tim

Michal Hocko April 16, 2021, 6:38 a.m. UTC | #22

On Thu 15-04-21 15:31:46, Tim Chen wrote:
> 
> 
> On 4/9/21 12:24 AM, Michal Hocko wrote:
> > On Thu 08-04-21 13:29:08, Shakeel Butt wrote:
> >> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
> > [...]
> >>> The low priority jobs should be able to be restricted by cpuset, for
> >>> example, just keep them on second tier memory nodes. Then all the
> >>> above problems are gone.
> > 
> > Yes, if the aim is to isolate some users from certain numa node then
> > cpuset is a good fit but as Shakeel says this is very likely not what
> > this work is aiming for.
> > 
> >> Yes that's an extreme way to overcome the issue but we can do less
> >> extreme by just (hard) limiting the top tier usage of low priority
> >> jobs.
> > 
> > Per numa node high/hard limit would help with a more fine grained control.
> > The configuration would be tricky though. All low priority memcgs would
> > have to be carefully configured to leave enough for your important
> > processes. That includes also memory which is not accounted to any
> > memcg. 
> > The behavior of those limits would be quite tricky for OOM situations
> > as well due to a lack of NUMA aware oom killer.
> > 
> 
> Another downside of putting limits on individual NUMA
> node is it would limit flexibility.

Let me just clarify one thing. I haven't been proposing per NUMA limits.
As I've said above it would be quite tricky to use and the behavior
would be tricky as well. All I am saying is that we do not want to have
an interface that is tightly bound to any specific HW setup (fast RAM as
a top tier and PMEM as a fallback) that you have proposed here. We want
to have a generic NUMA based abstraction. How that abstraction is going
to look like is an open question and it really depends on usecase that
we expect to see.

[RFC,v1,00/11] Manage the top tier memory in a tiered memory

Message

Comments