mbox series

[v2,RFC,0/9] Another Approach to Use PMEM as NUMA Node

Message ID 1554955019-29472-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
Headers show
Series Another Approach to Use PMEM as NUMA Node | expand

Message

Yang Shi April 11, 2019, 3:56 a.m. UTC
With Dave Hansen's patches merged into Linus's tree

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4

PMEM could be hot plugged as NUMA node now.  But, how to use PMEM as NUMA node
effectively and efficiently is still a question. 

There have been a couple of proposals posted on the mailing list [1] [2] [3].


Changelog
=========
v1 --> v2:
* Dropped the default allocation node mask.  The memory placement restriction
  could be achieved by mempolicy or cpuset.
* Dropped the new mempolicy since its semantic is not that clear yet.
* Dropped PG_Promote flag.
* Defined N_CPU_MEM nodemask for the nodes which have both CPU and memory.
* Extended page_check_references() to implement "twice access" check for
  anonymous page in NUMA balancing path.
* Reworked the memory demotion code.

v1: https://lore.kernel.org/linux-mm/1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com/


Design
======
Basically, the approach is aimed to spread data from DRAM (closest to local
CPU) down further to PMEM and disk (typically assume the lower tier storage
is slower, larger and cheaper than the upper tier) by their hotness.  The
patchset tries to achieve this goal by doing memory promotion/demotion via
NUMA balancing and memory reclaim as what the below diagram shows:

    DRAM <--> PMEM <--> Disk
      ^                   ^
      |-------------------|
               swap

When DRAM has memory pressure, demote pages to PMEM via page reclaim path.
Then NUMA balancing will promote pages to DRAM as long as the page is referenced
again.  The memory pressure on PMEM node would push the inactive pages of PMEM 
to disk via swap.

The promotion/demotion happens only between "primary" nodes (the nodes have
both CPU and memory) and PMEM nodes.  No promotion/demotion between PMEM nodes
and promotion from DRAM to PMEM and demotion from PMEM to DRAM.

The HMAT is effectively going to enforce "cpu-less" nodes for any memory range
that has differentiated performance from the conventional memory pool, or
differentiated performance for a specific initiator, per Dan Williams.  So,
assuming PMEM nodes are cpuless nodes sounds reasonable.

However, cpuless nodes might be not PMEM nodes.  But, actually, memory
promotion/demotion doesn't care what kind of memory will be the target nodes,
it could be DRAM, PMEM or something else, as long as they are the second tier
memory (slower, larger and cheaper than regular DRAM), otherwise it sounds
pointless to do such demotion.

Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in
order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and
memoryless nodes (some architectures, i.e. Power, may have memoryless nodes).
Typically, memory allocation would happen on such nodes by default unless
cpuless nodes are specified explicitly, cpuless nodes would be just fallback
nodes, so they are also as known as "primary" nodes in this patchset.  With
two tier memory system (i.e. DRAM + PMEM), this sounds good enough to
demonstrate the promotion/demotion approach for now, and this looks more
architecture-independent.  But it may be better to construct such node mask
by reading hardware information (i.e. HMAT), particularly for more complex
memory hierarchy.

To reduce memory thrashing and PMEM bandwidth pressure, promote twice faulted
page in NUMA balancing.  Implement "twice access" check by extending
page_check_references() for anonymous pages.

When doing demotion, demote to the less-contended local PMEM node.  If the
local PMEM node is contended (i.e. migrate_pages() returns -ENOMEM), just do
swap instead of demotion.  To make things simple, demotion to the remote PMEM
node is not allowed for now if the local PMEM node is online.  If the local
PMEM node is not online, just demote to the remote one.  If no PMEM node online,
just do normal swap.

Anonymous page only for the time being since NUMA balancing can't promote
unmapped page cache.

Added vmstat counters for pgdemote_kswapd, pgdemote_direct and
numa_pages_promoted.

There are definitely still some details need to be sorted out, for example,
shall respect to mempolicy in demotion path, etc.

Any comment is welcome.


Test
====
The stress test was done with mmtests + applications workload (i.e. sysbench,
grep, etc).

Generate memory pressure by running mmtest's usemem-stress-numa-compact,
then run other applications as workload to stress the promotion and demotion
path.  The machine was still alive after the stress test had been running for
~30 hours.  The /proc/vmstat also shows:

...
pgdemote_kswapd 3316563
pgdemote_direct 1930721
...
numa_pages_promoted 81838


TODO
====
1. Promote page cache. There are a couple of ways to handle this in kernel,
   i.e. promote via active LRU in reclaim path on PMEM node, or promote in
   mark_page_accessed().

2. Promote/demote HugeTLB. Now HugeTLB is not on LRU and NUMA balancing just
   skips it.

3. May place kernel pages (i.e. page table, slabs, etc) on DRAM only.


[1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
[2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@intel.com/
[3]: https://lore.kernel.org/linux-mm/20190404071312.GD12864@dhcp22.suse.cz/T/#me1c1ed102741ba945c57071de9749e16a76e9f3d


Yang Shi (9):
      mm: define N_CPU_MEM node states
      mm: page_alloc: make find_next_best_node find return cpuless node
      mm: numa: promote pages to DRAM when it gets accessed twice
      mm: migrate: make migrate_pages() return nr_succeeded
      mm: vmscan: demote anon DRAM pages to PMEM node
      mm: vmscan: don't demote for memcg reclaim
      mm: vmscan: check if the demote target node is contended or not
      mm: vmscan: add page demotion counter
      mm: numa: add page promotion counter

 drivers/base/node.c            |   2 +
 include/linux/gfp.h            |  12 +++
 include/linux/migrate.h        |   6 +-
 include/linux/mmzone.h         |   3 +
 include/linux/nodemask.h       |   3 +-
 include/linux/vm_event_item.h  |   3 +
 include/linux/vmstat.h         |   1 +
 include/trace/events/migrate.h |   3 +-
 mm/compaction.c                |   3 +-
 mm/debug.c                     |   1 +
 mm/gup.c                       |   4 +-
 mm/huge_memory.c               |  15 ++++
 mm/internal.h                  | 105 +++++++++++++++++++++++++
 mm/memory-failure.c            |   7 +-
 mm/memory.c                    |  25 ++++++
 mm/memory_hotplug.c            |  10 ++-
 mm/mempolicy.c                 |   7 +-
 mm/migrate.c                   |  33 +++++---
 mm/page_alloc.c                |  19 +++--
 mm/vmscan.c                    | 262 +++++++++++++++++++++++++++++++++++++++++----------------------
 mm/vmstat.c                    |  14 +++-
 21 files changed, 418 insertions(+), 120 deletions(-)

Comments

Dave Hansen April 11, 2019, 2:28 p.m. UTC | #1
This isn't so much another aproach, as it it some tweaks on top of
what's there.  Right?

This set seems to present a bunch of ideas, like "promote if accessed
twice".  Seems like a good idea, but I'm a lot more interested in seeing
data about it being a good idea.  What workloads is it good for?  Bad for?

These look like fun to play with, but I'd be really curious what you
think needs to be done before we start merging these ideas.
Michal Hocko April 12, 2019, 8:47 a.m. UTC | #2
On Thu 11-04-19 11:56:50, Yang Shi wrote:
[...]
> Design
> ======
> Basically, the approach is aimed to spread data from DRAM (closest to local
> CPU) down further to PMEM and disk (typically assume the lower tier storage
> is slower, larger and cheaper than the upper tier) by their hotness.  The
> patchset tries to achieve this goal by doing memory promotion/demotion via
> NUMA balancing and memory reclaim as what the below diagram shows:
> 
>     DRAM <--> PMEM <--> Disk
>       ^                   ^
>       |-------------------|
>                swap
> 
> When DRAM has memory pressure, demote pages to PMEM via page reclaim path.
> Then NUMA balancing will promote pages to DRAM as long as the page is referenced
> again.  The memory pressure on PMEM node would push the inactive pages of PMEM 
> to disk via swap.
> 
> The promotion/demotion happens only between "primary" nodes (the nodes have
> both CPU and memory) and PMEM nodes.  No promotion/demotion between PMEM nodes
> and promotion from DRAM to PMEM and demotion from PMEM to DRAM.
> 
> The HMAT is effectively going to enforce "cpu-less" nodes for any memory range
> that has differentiated performance from the conventional memory pool, or
> differentiated performance for a specific initiator, per Dan Williams.  So,
> assuming PMEM nodes are cpuless nodes sounds reasonable.
> 
> However, cpuless nodes might be not PMEM nodes.  But, actually, memory
> promotion/demotion doesn't care what kind of memory will be the target nodes,
> it could be DRAM, PMEM or something else, as long as they are the second tier
> memory (slower, larger and cheaper than regular DRAM), otherwise it sounds
> pointless to do such demotion.
> 
> Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in
> order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and
> memoryless nodes (some architectures, i.e. Power, may have memoryless nodes).
> Typically, memory allocation would happen on such nodes by default unless
> cpuless nodes are specified explicitly, cpuless nodes would be just fallback
> nodes, so they are also as known as "primary" nodes in this patchset.  With
> two tier memory system (i.e. DRAM + PMEM), this sounds good enough to
> demonstrate the promotion/demotion approach for now, and this looks more
> architecture-independent.  But it may be better to construct such node mask
> by reading hardware information (i.e. HMAT), particularly for more complex
> memory hierarchy.

I still believe you are overcomplicating this without a strong reason.
Why cannot we start simple and build from there? In other words I do not
think we really need anything like N_CPU_MEM at all.

I would expect that the very first attempt wouldn't do much more than
migrate to-be-reclaimed pages (without an explicit binding) with a
very optimistic allocation strategy (effectivelly GFP_NOWAIT) and if
that fails then simply give up. All that hooked essentially to the
node_reclaim path with a new node_reclaim mode so that the behavior
would be opt-in. This should be the most simplistic way to start AFAICS
and something people can play with without risking regressions.

Once we see how that behaves in the real world and what kind of corner
case user are able to trigger then we can build on top. E.g. do we want
to migrate from cpuless nodes as well? I am not really sure TBH. On one
hand why not if other nodes are free to hold that memory? Swap out is
more expensive. Anyway this is kind of decision which would rather be
shaped on an existing experience rather than ad-hoc decistion right now.

I would also not touch the numa balancing logic at this stage and rather
see how the current implementation behaves.
Yang Shi April 16, 2019, 12:09 a.m. UTC | #3
On 4/12/19 1:47 AM, Michal Hocko wrote:
> On Thu 11-04-19 11:56:50, Yang Shi wrote:
> [...]
>> Design
>> ======
>> Basically, the approach is aimed to spread data from DRAM (closest to local
>> CPU) down further to PMEM and disk (typically assume the lower tier storage
>> is slower, larger and cheaper than the upper tier) by their hotness.  The
>> patchset tries to achieve this goal by doing memory promotion/demotion via
>> NUMA balancing and memory reclaim as what the below diagram shows:
>>
>>      DRAM <--> PMEM <--> Disk
>>        ^                   ^
>>        |-------------------|
>>                 swap
>>
>> When DRAM has memory pressure, demote pages to PMEM via page reclaim path.
>> Then NUMA balancing will promote pages to DRAM as long as the page is referenced
>> again.  The memory pressure on PMEM node would push the inactive pages of PMEM
>> to disk via swap.
>>
>> The promotion/demotion happens only between "primary" nodes (the nodes have
>> both CPU and memory) and PMEM nodes.  No promotion/demotion between PMEM nodes
>> and promotion from DRAM to PMEM and demotion from PMEM to DRAM.
>>
>> The HMAT is effectively going to enforce "cpu-less" nodes for any memory range
>> that has differentiated performance from the conventional memory pool, or
>> differentiated performance for a specific initiator, per Dan Williams.  So,
>> assuming PMEM nodes are cpuless nodes sounds reasonable.
>>
>> However, cpuless nodes might be not PMEM nodes.  But, actually, memory
>> promotion/demotion doesn't care what kind of memory will be the target nodes,
>> it could be DRAM, PMEM or something else, as long as they are the second tier
>> memory (slower, larger and cheaper than regular DRAM), otherwise it sounds
>> pointless to do such demotion.
>>
>> Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in
>> order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and
>> memoryless nodes (some architectures, i.e. Power, may have memoryless nodes).
>> Typically, memory allocation would happen on such nodes by default unless
>> cpuless nodes are specified explicitly, cpuless nodes would be just fallback
>> nodes, so they are also as known as "primary" nodes in this patchset.  With
>> two tier memory system (i.e. DRAM + PMEM), this sounds good enough to
>> demonstrate the promotion/demotion approach for now, and this looks more
>> architecture-independent.  But it may be better to construct such node mask
>> by reading hardware information (i.e. HMAT), particularly for more complex
>> memory hierarchy.
> I still believe you are overcomplicating this without a strong reason.
> Why cannot we start simple and build from there? In other words I do not
> think we really need anything like N_CPU_MEM at all.

In this patchset N_CPU_MEM is used to tell us what nodes are cpuless 
nodes. They would be the preferred demotion target.  Of course, we could 
rely on firmware to just demote to the next best node, but it may be a 
"preferred" node, if so I don't see too much benefit achieved by 
demotion. Am I missing anything?

>
> I would expect that the very first attempt wouldn't do much more than
> migrate to-be-reclaimed pages (without an explicit binding) with a

Do you mean respect mempolicy or cpuset when doing demotion? I was 
wondering this, but I didn't do so in the current implementation since 
it may need walk the rmap to retrieve the mempolicy in the reclaim path. 
Is there any easier way to do so?

> very optimistic allocation strategy (effectivelly GFP_NOWAIT) and if

Yes, this has been done in this patchset.

> that fails then simply give up. All that hooked essentially to the
> node_reclaim path with a new node_reclaim mode so that the behavior
> would be opt-in. This should be the most simplistic way to start AFAICS
> and something people can play with without risking regressions.

I agree it is safer to start with node reclaim. Once it is stable enough 
and we are confident enough, it can be extended to global reclaim.

>
> Once we see how that behaves in the real world and what kind of corner
> case user are able to trigger then we can build on top. E.g. do we want
> to migrate from cpuless nodes as well? I am not really sure TBH. On one
> hand why not if other nodes are free to hold that memory? Swap out is
> more expensive. Anyway this is kind of decision which would rather be
> shaped on an existing experience rather than ad-hoc decistion right now.

I do agree.

>
> I would also not touch the numa balancing logic at this stage and rather
> see how the current implementation behaves.

I agree we would prefer start from something simpler and see how it works.

The "twice access" optimization is aimed to reduce the PMEM bandwidth 
burden since the bandwidth of PMEM is scarce resource. I did compare 
"twice access" to "no twice access", it does save a lot bandwidth for 
some once-off access pattern. For example, when running stress test with 
mmtest's usemem-stress-numa-compact. The kernel would promote ~600,000 
pages with "twice access" in 4 hours, but it would promote ~80,000,000 
pages without "twice access".

Thanks,
Yang
Michal Hocko April 16, 2019, 7:47 a.m. UTC | #4
On Mon 15-04-19 17:09:07, Yang Shi wrote:
> 
> 
> On 4/12/19 1:47 AM, Michal Hocko wrote:
> > On Thu 11-04-19 11:56:50, Yang Shi wrote:
> > [...]
> > > Design
> > > ======
> > > Basically, the approach is aimed to spread data from DRAM (closest to local
> > > CPU) down further to PMEM and disk (typically assume the lower tier storage
> > > is slower, larger and cheaper than the upper tier) by their hotness.  The
> > > patchset tries to achieve this goal by doing memory promotion/demotion via
> > > NUMA balancing and memory reclaim as what the below diagram shows:
> > > 
> > >      DRAM <--> PMEM <--> Disk
> > >        ^                   ^
> > >        |-------------------|
> > >                 swap
> > > 
> > > When DRAM has memory pressure, demote pages to PMEM via page reclaim path.
> > > Then NUMA balancing will promote pages to DRAM as long as the page is referenced
> > > again.  The memory pressure on PMEM node would push the inactive pages of PMEM
> > > to disk via swap.
> > > 
> > > The promotion/demotion happens only between "primary" nodes (the nodes have
> > > both CPU and memory) and PMEM nodes.  No promotion/demotion between PMEM nodes
> > > and promotion from DRAM to PMEM and demotion from PMEM to DRAM.
> > > 
> > > The HMAT is effectively going to enforce "cpu-less" nodes for any memory range
> > > that has differentiated performance from the conventional memory pool, or
> > > differentiated performance for a specific initiator, per Dan Williams.  So,
> > > assuming PMEM nodes are cpuless nodes sounds reasonable.
> > > 
> > > However, cpuless nodes might be not PMEM nodes.  But, actually, memory
> > > promotion/demotion doesn't care what kind of memory will be the target nodes,
> > > it could be DRAM, PMEM or something else, as long as they are the second tier
> > > memory (slower, larger and cheaper than regular DRAM), otherwise it sounds
> > > pointless to do such demotion.
> > > 
> > > Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in
> > > order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and
> > > memoryless nodes (some architectures, i.e. Power, may have memoryless nodes).
> > > Typically, memory allocation would happen on such nodes by default unless
> > > cpuless nodes are specified explicitly, cpuless nodes would be just fallback
> > > nodes, so they are also as known as "primary" nodes in this patchset.  With
> > > two tier memory system (i.e. DRAM + PMEM), this sounds good enough to
> > > demonstrate the promotion/demotion approach for now, and this looks more
> > > architecture-independent.  But it may be better to construct such node mask
> > > by reading hardware information (i.e. HMAT), particularly for more complex
> > > memory hierarchy.
> > I still believe you are overcomplicating this without a strong reason.
> > Why cannot we start simple and build from there? In other words I do not
> > think we really need anything like N_CPU_MEM at all.
> 
> In this patchset N_CPU_MEM is used to tell us what nodes are cpuless nodes.
> They would be the preferred demotion target.  Of course, we could rely on
> firmware to just demote to the next best node, but it may be a "preferred"
> node, if so I don't see too much benefit achieved by demotion. Am I missing
> anything?

Why cannot we simply demote in the proximity order? Why do you make
cpuless nodes so special? If other close nodes are vacant then just use
them.
 
> > I would expect that the very first attempt wouldn't do much more than
> > migrate to-be-reclaimed pages (without an explicit binding) with a
> 
> Do you mean respect mempolicy or cpuset when doing demotion? I was wondering
> this, but I didn't do so in the current implementation since it may need
> walk the rmap to retrieve the mempolicy in the reclaim path. Is there any
> easier way to do so?

You definitely have to follow policy. You cannot demote to a node which
is outside of the cpuset/mempolicy because you are breaking contract
expected by the userspace. That implies doing a rmap walk.

> > I would also not touch the numa balancing logic at this stage and rather
> > see how the current implementation behaves.
> 
> I agree we would prefer start from something simpler and see how it works.
> 
> The "twice access" optimization is aimed to reduce the PMEM bandwidth burden
> since the bandwidth of PMEM is scarce resource. I did compare "twice access"
> to "no twice access", it does save a lot bandwidth for some once-off access
> pattern. For example, when running stress test with mmtest's
> usemem-stress-numa-compact. The kernel would promote ~600,000 pages with
> "twice access" in 4 hours, but it would promote ~80,000,000 pages without
> "twice access".

I pressume this is a result of a synthetic workload, right? Or do you
have any numbers for a real life usecase?
Dave Hansen April 16, 2019, 2:30 p.m. UTC | #5
On 4/16/19 12:47 AM, Michal Hocko wrote:
> You definitely have to follow policy. You cannot demote to a node which
> is outside of the cpuset/mempolicy because you are breaking contract
> expected by the userspace. That implies doing a rmap walk.

What *is* the contract with userspace, anyway? :)

Obviously, the preferred policy doesn't have any strict contract.

The strict binding has a bit more of a contract, but it doesn't prevent
swapping.  Strict binding also doesn't keep another app from moving the
memory.

We have a reasonable argument that demotion is better than swapping.
So, we could say that even if a VMA has a strict NUMA policy, demoting
pages mapped there pages still beats swapping them or tossing the page
cache.  It's doing them a favor to demote them.

Or, maybe we just need a swap hybrid where demotion moves the page but
keeps it unmapped and in the swap cache.  That way an access gets a
fault and we can promote the page back to where it should be.  That
would be faster than I/O-based swap for sure.

Anyway, I agree that the kernel probably shouldn't be moving pages
around willy-nilly with no consideration for memory policies, but users
might give us some wiggle room too.
Michal Hocko April 16, 2019, 2:39 p.m. UTC | #6
On Tue 16-04-19 07:30:20, Dave Hansen wrote:
> On 4/16/19 12:47 AM, Michal Hocko wrote:
> > You definitely have to follow policy. You cannot demote to a node which
> > is outside of the cpuset/mempolicy because you are breaking contract
> > expected by the userspace. That implies doing a rmap walk.
> 
> What *is* the contract with userspace, anyway? :)
> 
> Obviously, the preferred policy doesn't have any strict contract.
> 
> The strict binding has a bit more of a contract, but it doesn't prevent
> swapping.

Yes, but swapping is not a problem for using binding for memory
partitioning.

> Strict binding also doesn't keep another app from moving the
> memory.

I would consider that a bug.
Zi Yan April 16, 2019, 3:33 p.m. UTC | #7
On 16 Apr 2019, at 10:30, Dave Hansen wrote:

> On 4/16/19 12:47 AM, Michal Hocko wrote:
>> You definitely have to follow policy. You cannot demote to a node which
>> is outside of the cpuset/mempolicy because you are breaking contract
>> expected by the userspace. That implies doing a rmap walk.
>
> What *is* the contract with userspace, anyway? :)
>
> Obviously, the preferred policy doesn't have any strict contract.
>
> The strict binding has a bit more of a contract, but it doesn't prevent
> swapping.  Strict binding also doesn't keep another app from moving the
> memory.
>
> We have a reasonable argument that demotion is better than swapping.
> So, we could say that even if a VMA has a strict NUMA policy, demoting
> pages mapped there pages still beats swapping them or tossing the page
> cache.  It's doing them a favor to demote them.

I just wonder whether page migration is always better than swapping,
since SSD write throughput keeps improving but page migration throughput
is still low. For example, my machine has a SSD with 2GB/s writing throughput
but the throughput of 4KB page migration is less than 1GB/s, why do we
want to use page migration for demotion instead of swapping?


--
Best Regards,
Yan Zi
Dave Hansen April 16, 2019, 3:46 p.m. UTC | #8
On 4/16/19 7:39 AM, Michal Hocko wrote:
>> Strict binding also doesn't keep another app from moving the
>> memory.
> I would consider that a bug.

A bug where, though?  Certainly not in the kernel.

I'm just saying that if an app has an assumption that strict binding
means that its memory can *NEVER* move, then that assumption is simply
wrong.  It's not the guarantee that we provide.  In fact, we provide
APIs (migrate_pages() at leaset) that explicitly and intentionally break
that guarantee.

All that our NUMA APIs provide (even the strict ones) is a promise about
where newly-allocated pages will be allocated.
Dave Hansen April 16, 2019, 3:55 p.m. UTC | #9
On 4/16/19 8:33 AM, Zi Yan wrote:
>> We have a reasonable argument that demotion is better than
>> swapping. So, we could say that even if a VMA has a strict NUMA
>> policy, demoting pages mapped there pages still beats swapping
>> them or tossing the page cache.  It's doing them a favor to
>> demote them.
> I just wonder whether page migration is always better than
> swapping, since SSD write throughput keeps improving but page
> migration throughput is still low. For example, my machine has a
> SSD with 2GB/s writing throughput but the throughput of 4KB page
> migration is less than 1GB/s, why do we want to use page migration
> for demotion instead of swapping?

Just because we observe that page migration apparently has lower
throughput today doesn't mean that we should consider it a dead end.
Zi Yan April 16, 2019, 4:12 p.m. UTC | #10
On 16 Apr 2019, at 11:55, Dave Hansen wrote:

> On 4/16/19 8:33 AM, Zi Yan wrote:
>>> We have a reasonable argument that demotion is better than
>>> swapping. So, we could say that even if a VMA has a strict NUMA
>>> policy, demoting pages mapped there pages still beats swapping
>>> them or tossing the page cache.  It's doing them a favor to
>>> demote them.
>> I just wonder whether page migration is always better than
>> swapping, since SSD write throughput keeps improving but page
>> migration throughput is still low. For example, my machine has a
>> SSD with 2GB/s writing throughput but the throughput of 4KB page
>> migration is less than 1GB/s, why do we want to use page migration
>> for demotion instead of swapping?
>
> Just because we observe that page migration apparently has lower
> throughput today doesn't mean that we should consider it a dead end.

I definitely agree. I also want to make the point that we might
want to improve page migration as well to show that demotion via
page migration will work. Since most of proposed demotion approaches
use the same page replacement policy as swapping, if we do not have
high-throughput page migration, we might draw false conclusions that
demotion is no better than swapping but demotion can actually do
much better. :)

--
Best Regards,
Yan Zi
Michal Hocko April 16, 2019, 6:34 p.m. UTC | #11
On Tue 16-04-19 08:46:56, Dave Hansen wrote:
> On 4/16/19 7:39 AM, Michal Hocko wrote:
> >> Strict binding also doesn't keep another app from moving the
> >> memory.
> > I would consider that a bug.
> 
> A bug where, though?  Certainly not in the kernel.

Kernel should refrain from moving explicitly bound memory nilly willy. I
certainly agree that there are corner cases. E.g. memory hotplug. We do
break CPU affinity for CPU offline as well. So this is something user
should expect. But the kernel shouldn't move explicitly bound pages to a
different node implicitly. I am not sure whether we even do that during
compaction if we do then I would consider _this_ to be a bug. And NUMA
rebalancing under memory pressure falls into the same category IMO.
Yang Shi April 16, 2019, 7:19 p.m. UTC | #12
On 4/16/19 12:47 AM, Michal Hocko wrote:
> On Mon 15-04-19 17:09:07, Yang Shi wrote:
>>
>> On 4/12/19 1:47 AM, Michal Hocko wrote:
>>> On Thu 11-04-19 11:56:50, Yang Shi wrote:
>>> [...]
>>>> Design
>>>> ======
>>>> Basically, the approach is aimed to spread data from DRAM (closest to local
>>>> CPU) down further to PMEM and disk (typically assume the lower tier storage
>>>> is slower, larger and cheaper than the upper tier) by their hotness.  The
>>>> patchset tries to achieve this goal by doing memory promotion/demotion via
>>>> NUMA balancing and memory reclaim as what the below diagram shows:
>>>>
>>>>       DRAM <--> PMEM <--> Disk
>>>>         ^                   ^
>>>>         |-------------------|
>>>>                  swap
>>>>
>>>> When DRAM has memory pressure, demote pages to PMEM via page reclaim path.
>>>> Then NUMA balancing will promote pages to DRAM as long as the page is referenced
>>>> again.  The memory pressure on PMEM node would push the inactive pages of PMEM
>>>> to disk via swap.
>>>>
>>>> The promotion/demotion happens only between "primary" nodes (the nodes have
>>>> both CPU and memory) and PMEM nodes.  No promotion/demotion between PMEM nodes
>>>> and promotion from DRAM to PMEM and demotion from PMEM to DRAM.
>>>>
>>>> The HMAT is effectively going to enforce "cpu-less" nodes for any memory range
>>>> that has differentiated performance from the conventional memory pool, or
>>>> differentiated performance for a specific initiator, per Dan Williams.  So,
>>>> assuming PMEM nodes are cpuless nodes sounds reasonable.
>>>>
>>>> However, cpuless nodes might be not PMEM nodes.  But, actually, memory
>>>> promotion/demotion doesn't care what kind of memory will be the target nodes,
>>>> it could be DRAM, PMEM or something else, as long as they are the second tier
>>>> memory (slower, larger and cheaper than regular DRAM), otherwise it sounds
>>>> pointless to do such demotion.
>>>>
>>>> Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in
>>>> order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and
>>>> memoryless nodes (some architectures, i.e. Power, may have memoryless nodes).
>>>> Typically, memory allocation would happen on such nodes by default unless
>>>> cpuless nodes are specified explicitly, cpuless nodes would be just fallback
>>>> nodes, so they are also as known as "primary" nodes in this patchset.  With
>>>> two tier memory system (i.e. DRAM + PMEM), this sounds good enough to
>>>> demonstrate the promotion/demotion approach for now, and this looks more
>>>> architecture-independent.  But it may be better to construct such node mask
>>>> by reading hardware information (i.e. HMAT), particularly for more complex
>>>> memory hierarchy.
>>> I still believe you are overcomplicating this without a strong reason.
>>> Why cannot we start simple and build from there? In other words I do not
>>> think we really need anything like N_CPU_MEM at all.
>> In this patchset N_CPU_MEM is used to tell us what nodes are cpuless nodes.
>> They would be the preferred demotion target.  Of course, we could rely on
>> firmware to just demote to the next best node, but it may be a "preferred"
>> node, if so I don't see too much benefit achieved by demotion. Am I missing
>> anything?
> Why cannot we simply demote in the proximity order? Why do you make
> cpuless nodes so special? If other close nodes are vacant then just use
> them.

We could. But, this raises another question, would we prefer to just 
demote to the next fallback node (just try once), if it is contended, 
then just swap (i.e. DRAM0 -> PMEM0 -> Swap); or would we prefer to try 
all the nodes in the fallback order to find the first less contended one 
(i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?


|------|     |------| |------|        |------|
|PMEM0|---|DRAM0| --- CPU0 --- CPU1 --- |DRAM1| --- |PMEM1|
|------|     |------| |------|       |------|

The first one sounds simpler, and the current implementation does so and 
this needs find out the closest PMEM node by recognizing cpuless node.

If we prefer go with the second option, it is definitely unnecessary to 
specialize any node.

>   
>>> I would expect that the very first attempt wouldn't do much more than
>>> migrate to-be-reclaimed pages (without an explicit binding) with a
>> Do you mean respect mempolicy or cpuset when doing demotion? I was wondering
>> this, but I didn't do so in the current implementation since it may need
>> walk the rmap to retrieve the mempolicy in the reclaim path. Is there any
>> easier way to do so?
> You definitely have to follow policy. You cannot demote to a node which
> is outside of the cpuset/mempolicy because you are breaking contract
> expected by the userspace. That implies doing a rmap walk.

OK, however, this may prevent from demoting unmapped page cache since 
there is no way to find those pages' policy.

And, we have to think about what we should do when the demotion target 
has conflict with the mempolicy. The easiest way is to just skip those 
conflict pages in demotion. Or we may have to do the demotion one page 
by one page instead of migrating a list of pages.

>
>>> I would also not touch the numa balancing logic at this stage and rather
>>> see how the current implementation behaves.
>> I agree we would prefer start from something simpler and see how it works.
>>
>> The "twice access" optimization is aimed to reduce the PMEM bandwidth burden
>> since the bandwidth of PMEM is scarce resource. I did compare "twice access"
>> to "no twice access", it does save a lot bandwidth for some once-off access
>> pattern. For example, when running stress test with mmtest's
>> usemem-stress-numa-compact. The kernel would promote ~600,000 pages with
>> "twice access" in 4 hours, but it would promote ~80,000,000 pages without
>> "twice access".
> I pressume this is a result of a synthetic workload, right? Or do you
> have any numbers for a real life usecase?

The test just uses usemem.
Dave Hansen April 16, 2019, 9:22 p.m. UTC | #13
On 4/16/19 12:19 PM, Yang Shi wrote:
> would we prefer to try all the nodes in the fallback order to find the
> first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?

Once a page went to DRAM1, how would we tell that it originated in DRAM0
and is following the DRAM0 path rather than the DRAM1 path?

Memory on DRAM0's path would be:

	DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap

Memory on DRAM1's path would be:

	DRAM1 -> PMEM1 -> DRAM0 -> PMEM0 -> Swap

Keith Busch had a set of patches to let you specify the demotion order
via sysfs for fun.  The rules we came up with were:
1. Pages keep no history of where they have been
2. Each node can only demote to one other node
3. The demotion path can not have cycles

That ensures that we *can't* follow the paths you described above, if we
follow those rules...
Yang Shi April 16, 2019, 9:59 p.m. UTC | #14
On 4/16/19 2:22 PM, Dave Hansen wrote:
> On 4/16/19 12:19 PM, Yang Shi wrote:
>> would we prefer to try all the nodes in the fallback order to find the
>> first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?
> Once a page went to DRAM1, how would we tell that it originated in DRAM0
> and is following the DRAM0 path rather than the DRAM1 path?
>
> Memory on DRAM0's path would be:
>
> 	DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap
>
> Memory on DRAM1's path would be:
>
> 	DRAM1 -> PMEM1 -> DRAM0 -> PMEM0 -> Swap
>
> Keith Busch had a set of patches to let you specify the demotion order
> via sysfs for fun.  The rules we came up with were:
> 1. Pages keep no history of where they have been
> 2. Each node can only demote to one other node

Does this mean any remote node? Or just DRAM to PMEM, but remote PMEM 
might be ok?

> 3. The demotion path can not have cycles

I agree with these rules, actually my implementation does imply the 
similar rule. I tried to understand what Michal means. My current 
implementation expects to have demotion happen from the initiator to the 
target in the same local pair. But, Michal may expect to be able to 
demote to remote initiator or target if the local target is contended.

IMHO, demotion in the local pair makes things much simpler.

>
> That ensures that we *can't* follow the paths you described above, if we
> follow those rules...

Yes, it might create a circle.
Dave Hansen April 16, 2019, 11:04 p.m. UTC | #15
On 4/16/19 2:59 PM, Yang Shi wrote:
> On 4/16/19 2:22 PM, Dave Hansen wrote:
>> Keith Busch had a set of patches to let you specify the demotion order
>> via sysfs for fun.  The rules we came up with were:
>> 1. Pages keep no history of where they have been
>> 2. Each node can only demote to one other node
> 
> Does this mean any remote node? Or just DRAM to PMEM, but remote PMEM
> might be ok?

In Keith's code, I don't think we differentiated.  We let any node
demote to any other node you want, as long as it follows the cycle rule.
Yang Shi April 16, 2019, 11:17 p.m. UTC | #16
On 4/16/19 4:04 PM, Dave Hansen wrote:
> On 4/16/19 2:59 PM, Yang Shi wrote:
>> On 4/16/19 2:22 PM, Dave Hansen wrote:
>>> Keith Busch had a set of patches to let you specify the demotion order
>>> via sysfs for fun.  The rules we came up with were:
>>> 1. Pages keep no history of where they have been
>>> 2. Each node can only demote to one other node
>> Does this mean any remote node? Or just DRAM to PMEM, but remote PMEM
>> might be ok?
> In Keith's code, I don't think we differentiated.  We let any node
> demote to any other node you want, as long as it follows the cycle rule.

I recall Keith's code let the userspace define the target node. Anyway, 
we may need add one rule: not migrate-on-reclaim from PMEM node. 
Demoting from PMEM to DRAM sounds pointless.
Yang Shi April 16, 2019, 11:18 p.m. UTC | #17
>>>> Why cannot we start simple and build from there? In other words I 
>>>> do not
>>>> think we really need anything like N_CPU_MEM at all.
>>> In this patchset N_CPU_MEM is used to tell us what nodes are cpuless 
>>> nodes.
>>> They would be the preferred demotion target.  Of course, we could 
>>> rely on
>>> firmware to just demote to the next best node, but it may be a 
>>> "preferred"
>>> node, if so I don't see too much benefit achieved by demotion. Am I 
>>> missing
>>> anything?
>> Why cannot we simply demote in the proximity order? Why do you make
>> cpuless nodes so special? If other close nodes are vacant then just use
>> them.

And, I'm supposed we agree to *not* migrate from PMEM node (cpuless 
node) to any other node on reclaim path, right? If so we need know if 
the current node is DRAM node or PMEM node. If DRAM node, do demotion; 
if PMEM node, do swap. So, using N_CPU_MEM to tell us if the current 
node is DRAM node or not.

> We could. But, this raises another question, would we prefer to just 
> demote to the next fallback node (just try once), if it is contended, 
> then just swap (i.e. DRAM0 -> PMEM0 -> Swap); or would we prefer to 
> try all the nodes in the fallback order to find the first less 
> contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?
>
>
> |------|     |------| |------|        |------|
> |PMEM0|---|DRAM0| --- CPU0 --- CPU1 --- |DRAM1| --- |PMEM1|
> |------|     |------| |------|       |------|
>
> The first one sounds simpler, and the current implementation does so 
> and this needs find out the closest PMEM node by recognizing cpuless 
> node.
>
> If we prefer go with the second option, it is definitely unnecessary 
> to specialize any node.
>
Michal Hocko April 17, 2019, 9:17 a.m. UTC | #18
On Tue 16-04-19 12:19:21, Yang Shi wrote:
> 
> 
> On 4/16/19 12:47 AM, Michal Hocko wrote:
[...]
> > Why cannot we simply demote in the proximity order? Why do you make
> > cpuless nodes so special? If other close nodes are vacant then just use
> > them.
> 
> We could. But, this raises another question, would we prefer to just demote
> to the next fallback node (just try once), if it is contended, then just
> swap (i.e. DRAM0 -> PMEM0 -> Swap); or would we prefer to try all the nodes
> in the fallback order to find the first less contended one (i.e. DRAM0 ->
> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?

I would go with the later. Why, because it is more natural. Because that
is the natural allocation path so I do not see why this shouldn't be the
natural demotion path.

> 
> |------|     |------| |------|        |------|
> |PMEM0|---|DRAM0| --- CPU0 --- CPU1 --- |DRAM1| --- |PMEM1|
> |------|     |------| |------|       |------|
> 
> The first one sounds simpler, and the current implementation does so and
> this needs find out the closest PMEM node by recognizing cpuless node.

Unless you are specifying an explicit nodemask then the allocator will
do the allocation fallback for the migration target for you.

> If we prefer go with the second option, it is definitely unnecessary to
> specialize any node.
> 
> > > > I would expect that the very first attempt wouldn't do much more than
> > > > migrate to-be-reclaimed pages (without an explicit binding) with a
> > > Do you mean respect mempolicy or cpuset when doing demotion? I was wondering
> > > this, but I didn't do so in the current implementation since it may need
> > > walk the rmap to retrieve the mempolicy in the reclaim path. Is there any
> > > easier way to do so?
> > You definitely have to follow policy. You cannot demote to a node which
> > is outside of the cpuset/mempolicy because you are breaking contract
> > expected by the userspace. That implies doing a rmap walk.
> 
> OK, however, this may prevent from demoting unmapped page cache since there
> is no way to find those pages' policy.

I do not really expect that hard numa binding for the page cache is a
usecase we really have to lose sleep over for now.

> And, we have to think about what we should do when the demotion target has
> conflict with the mempolicy.

Simply skip it.

> The easiest way is to just skip those conflict
> pages in demotion. Or we may have to do the demotion one page by one page
> instead of migrating a list of pages.

Yes one page at the time sounds reasonable to me. THis is how we do
reclaim anyway.
Michal Hocko April 17, 2019, 9:23 a.m. UTC | #19
On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> On 4/16/19 12:19 PM, Yang Shi wrote:
> > would we prefer to try all the nodes in the fallback order to find the
> > first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?
> 
> Once a page went to DRAM1, how would we tell that it originated in DRAM0
> and is following the DRAM0 path rather than the DRAM1 path?
> 
> Memory on DRAM0's path would be:
> 
> 	DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap
> 
> Memory on DRAM1's path would be:
> 
> 	DRAM1 -> PMEM1 -> DRAM0 -> PMEM0 -> Swap
> 
> Keith Busch had a set of patches to let you specify the demotion order
> via sysfs for fun.  The rules we came up with were:

I am not a fan of any sysfs "fun"

> 1. Pages keep no history of where they have been

makes sense

> 2. Each node can only demote to one other node

Not really, see my other email. I do not really see any strong reason
why not use the full zonelist to demote to

> 3. The demotion path can not have cycles

yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
the migration target. That should prevent from loops or artificial nodes
exhausting quite naturaly AFAICS. Maybe we will need some tricks to
raise the watermark but I am not convinced something like that is really
necessary.
Keith Busch April 17, 2019, 3:13 p.m. UTC | #20
On Tue, Apr 16, 2019 at 04:17:44PM -0700, Yang Shi wrote:
> On 4/16/19 4:04 PM, Dave Hansen wrote:
> > On 4/16/19 2:59 PM, Yang Shi wrote:
> > > On 4/16/19 2:22 PM, Dave Hansen wrote:
> > > > Keith Busch had a set of patches to let you specify the demotion order
> > > > via sysfs for fun.  The rules we came up with were:
> > > > 1. Pages keep no history of where they have been
> > > > 2. Each node can only demote to one other node
> > > Does this mean any remote node? Or just DRAM to PMEM, but remote PMEM
> > > might be ok?
> > In Keith's code, I don't think we differentiated.  We let any node
> > demote to any other node you want, as long as it follows the cycle rule.
> 
> I recall Keith's code let the userspace define the target node.

Right, you have to opt-in in my original proposal since it may be a
bit presumptuous of the kernel to decide how a node's memory is going
to be used. User applications have other intentions for it.

It wouldn't be too difficult to make HMAT to create a reasonable initial
migration graph too, and that can also make that an opt-in user choice.

> Anyway, we may need add one rule: not migrate-on-reclaim from PMEM
> node.  Demoting from  PMEM to DRAM sounds pointless.

I really don't think we should be making such hard rules on PMEM. It
makes more sense to consider performance and locality for migration
rules than on a persistence attribute.
Keith Busch April 17, 2019, 3:23 p.m. UTC | #21
On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
> On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> > Keith Busch had a set of patches to let you specify the demotion order
> > via sysfs for fun.  The rules we came up with were:
> 
> I am not a fan of any sysfs "fun"

I'm hung up on the user facing interface, but there should be some way a
user decides if a memory node is or is not a migrate target, right?
Keith Busch April 17, 2019, 3:37 p.m. UTC | #22
On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
> On Wed 17-04-19 09:23:46, Keith Busch wrote:
> > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
> > > On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> > > > Keith Busch had a set of patches to let you specify the demotion order
> > > > via sysfs for fun.  The rules we came up with were:
> > > 
> > > I am not a fan of any sysfs "fun"
> > 
> > I'm hung up on the user facing interface, but there should be some way a
> > user decides if a memory node is or is not a migrate target, right?
> 
> Why? Or to put it differently, why do we have to start with a user
> interface at this stage when we actually barely have any real usecases
> out there?

The use case is an alternative to swap, right? The user has to decide
which storage is the swap target, so operating in the same spirit.
Michal Hocko April 17, 2019, 3:39 p.m. UTC | #23
On Wed 17-04-19 09:23:46, Keith Busch wrote:
> On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
> > On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> > > Keith Busch had a set of patches to let you specify the demotion order
> > > via sysfs for fun.  The rules we came up with were:
> > 
> > I am not a fan of any sysfs "fun"
> 
> I'm hung up on the user facing interface, but there should be some way a
> user decides if a memory node is or is not a migrate target, right?

Why? Or to put it differently, why do we have to start with a user
interface at this stage when we actually barely have any real usecases
out there?
Michal Hocko April 17, 2019, 4:39 p.m. UTC | #24
On Wed 17-04-19 09:37:39, Keith Busch wrote:
> On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
> > On Wed 17-04-19 09:23:46, Keith Busch wrote:
> > > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
> > > > On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> > > > > Keith Busch had a set of patches to let you specify the demotion order
> > > > > via sysfs for fun.  The rules we came up with were:
> > > > 
> > > > I am not a fan of any sysfs "fun"
> > > 
> > > I'm hung up on the user facing interface, but there should be some way a
> > > user decides if a memory node is or is not a migrate target, right?
> > 
> > Why? Or to put it differently, why do we have to start with a user
> > interface at this stage when we actually barely have any real usecases
> > out there?
> 
> The use case is an alternative to swap, right? The user has to decide
> which storage is the swap target, so operating in the same spirit.

I do not follow. If you use rebalancing you can still deplete the memory
and end up in a swap storage. If you want to reclaim/swap rather than
rebalance then you do not enable rebalancing (by node_reclaim or similar
mechanism).
Dave Hansen April 17, 2019, 5:13 p.m. UTC | #25
On 4/17/19 2:23 AM, Michal Hocko wrote:
>> 3. The demotion path can not have cycles
> yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
> the migration target. That should prevent from loops or artificial nodes
> exhausting quite naturaly AFAICS. Maybe we will need some tricks to
> raise the watermark but I am not convinced something like that is really
> necessary.

I don't think GFP_NOWAIT alone is good enough.

Let's say we have a system full of clean page cache and only two nodes:
0 and 1.  GFP_NOWAIT will eventually kick off kswapd on both nodes.
Each kswapd will be migrating pages to the *other* node since each is in
the other's fallback path.

I think what you're saying is that, eventually, the kswapds will see
allocation failures and stop migrating, providing hysteresis.  This is
probably true.

But, I'm more concerned about that window where the kswapds are throwing
pages at each other because they're effectively just wasting resources
in this window.  I guess we should figure our how large this window is
and how fast (or if) the dampening occurs in practice.
Yang Shi April 17, 2019, 5:26 p.m. UTC | #26
On 4/17/19 9:39 AM, Michal Hocko wrote:
> On Wed 17-04-19 09:37:39, Keith Busch wrote:
>> On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
>>> On Wed 17-04-19 09:23:46, Keith Busch wrote:
>>>> On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
>>>>> On Tue 16-04-19 14:22:33, Dave Hansen wrote:
>>>>>> Keith Busch had a set of patches to let you specify the demotion order
>>>>>> via sysfs for fun.  The rules we came up with were:
>>>>> I am not a fan of any sysfs "fun"
>>>> I'm hung up on the user facing interface, but there should be some way a
>>>> user decides if a memory node is or is not a migrate target, right?
>>> Why? Or to put it differently, why do we have to start with a user
>>> interface at this stage when we actually barely have any real usecases
>>> out there?
>> The use case is an alternative to swap, right? The user has to decide
>> which storage is the swap target, so operating in the same spirit.
> I do not follow. If you use rebalancing you can still deplete the memory
> and end up in a swap storage. If you want to reclaim/swap rather than
> rebalance then you do not enable rebalancing (by node_reclaim or similar
> mechanism).

I'm a little bit confused. Do you mean just do *not* do reclaim/swap in 
rebalancing mode? If rebalancing is on, then node_reclaim just move the 
pages around nodes, then kswapd or direct reclaim would take care of swap?

If so the node reclaim on PMEM node may rebalance the pages to DRAM 
node? Should this be allowed?

I think both I and Keith was supposed to treat PMEM as a tier in the 
reclaim hierarchy. The reclaim should push inactive pages down to PMEM, 
then swap. So, PMEM is kind of a "terminal" node. So, he introduced 
sysfs defined target node, I introduced N_CPU_MEM.

>
Keith Busch April 17, 2019, 5:29 p.m. UTC | #27
On Wed, Apr 17, 2019 at 10:26:05AM -0700, Yang Shi wrote:
> On 4/17/19 9:39 AM, Michal Hocko wrote:
> > On Wed 17-04-19 09:37:39, Keith Busch wrote:
> > > On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
> > > > On Wed 17-04-19 09:23:46, Keith Busch wrote:
> > > > > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
> > > > > > On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> > > > > > > Keith Busch had a set of patches to let you specify the demotion order
> > > > > > > via sysfs for fun.  The rules we came up with were:
> > > > > > I am not a fan of any sysfs "fun"
> > > > > I'm hung up on the user facing interface, but there should be some way a
> > > > > user decides if a memory node is or is not a migrate target, right?
> > > > Why? Or to put it differently, why do we have to start with a user
> > > > interface at this stage when we actually barely have any real usecases
> > > > out there?
> > > The use case is an alternative to swap, right? The user has to decide
> > > which storage is the swap target, so operating in the same spirit.
> > I do not follow. If you use rebalancing you can still deplete the memory
> > and end up in a swap storage. If you want to reclaim/swap rather than
> > rebalance then you do not enable rebalancing (by node_reclaim or similar
> > mechanism).
> 
> I'm a little bit confused. Do you mean just do *not* do reclaim/swap in
> rebalancing mode? If rebalancing is on, then node_reclaim just move the
> pages around nodes, then kswapd or direct reclaim would take care of swap?
> 
> If so the node reclaim on PMEM node may rebalance the pages to DRAM node?
> Should this be allowed?
> 
> I think both I and Keith was supposed to treat PMEM as a tier in the reclaim
> hierarchy. The reclaim should push inactive pages down to PMEM, then swap.
> So, PMEM is kind of a "terminal" node. So, he introduced sysfs defined
> target node, I introduced N_CPU_MEM.

Yeah, I think Yang and I view "demotion" as a separate feature from
numa rebalancing.
Michal Hocko April 17, 2019, 5:51 p.m. UTC | #28
On Wed 17-04-19 10:26:05, Yang Shi wrote:
> 
> 
> On 4/17/19 9:39 AM, Michal Hocko wrote:
> > On Wed 17-04-19 09:37:39, Keith Busch wrote:
> > > On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
> > > > On Wed 17-04-19 09:23:46, Keith Busch wrote:
> > > > > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
> > > > > > On Tue 16-04-19 14:22:33, Dave Hansen wrote:
> > > > > > > Keith Busch had a set of patches to let you specify the demotion order
> > > > > > > via sysfs for fun.  The rules we came up with were:
> > > > > > I am not a fan of any sysfs "fun"
> > > > > I'm hung up on the user facing interface, but there should be some way a
> > > > > user decides if a memory node is or is not a migrate target, right?
> > > > Why? Or to put it differently, why do we have to start with a user
> > > > interface at this stage when we actually barely have any real usecases
> > > > out there?
> > > The use case is an alternative to swap, right? The user has to decide
> > > which storage is the swap target, so operating in the same spirit.
> > I do not follow. If you use rebalancing you can still deplete the memory
> > and end up in a swap storage. If you want to reclaim/swap rather than
> > rebalance then you do not enable rebalancing (by node_reclaim or similar
> > mechanism).
> 
> I'm a little bit confused. Do you mean just do *not* do reclaim/swap in
> rebalancing mode? If rebalancing is on, then node_reclaim just move the
> pages around nodes, then kswapd or direct reclaim would take care of swap?

Yes, that was the idea I wanted to get through. Sorry if that was not
really clear.

> If so the node reclaim on PMEM node may rebalance the pages to DRAM node?
> Should this be allowed?

Why it shouldn't? If there are other vacant Nodes to absorb that memory
then why not use it?

> I think both I and Keith was supposed to treat PMEM as a tier in the reclaim
> hierarchy. The reclaim should push inactive pages down to PMEM, then swap.
> So, PMEM is kind of a "terminal" node. So, he introduced sysfs defined
> target node, I introduced N_CPU_MEM.

I understand that. And I am trying to figure out whether we really have
to tream PMEM specially here. Why is it any better than a generic NUMA
rebalancing code that could be used for many other usecases which are
not PMEM specific. If you present PMEM as a regular memory then also use
it as a normal memory.
Michal Hocko April 17, 2019, 5:57 p.m. UTC | #29
On Wed 17-04-19 10:13:44, Dave Hansen wrote:
> On 4/17/19 2:23 AM, Michal Hocko wrote:
> >> 3. The demotion path can not have cycles
> > yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
> > the migration target. That should prevent from loops or artificial nodes
> > exhausting quite naturaly AFAICS. Maybe we will need some tricks to
> > raise the watermark but I am not convinced something like that is really
> > necessary.
> 
> I don't think GFP_NOWAIT alone is good enough.
> 
> Let's say we have a system full of clean page cache and only two nodes:
> 0 and 1.  GFP_NOWAIT will eventually kick off kswapd on both nodes.
> Each kswapd will be migrating pages to the *other* node since each is in
> the other's fallback path.

I was thinking along node reclaim like based migration. You are right
that a parallel kswapd might reclaim enough to cause the ping pong and
we might need to play some watermaks tricks but as you say below this is
to be seen and a playground to explore. All I am saying is to try the
most simplistic approach first without all the bells and whistles to see
how this plays out with real workloads and build on top of that.

We already do have model - node_reclaim - which turned out to suck a lot
because the reclaim was just too aggressive wrt. refault. Maybe
migration will turn out much more feasible. And maybe I am completely
wrong and we need a much more complex solution.

> I think what you're saying is that, eventually, the kswapds will see
> allocation failures and stop migrating, providing hysteresis.  This is
> probably true.
> 
> But, I'm more concerned about that window where the kswapds are throwing
> pages at each other because they're effectively just wasting resources
> in this window.  I guess we should figure our how large this window is
> and how fast (or if) the dampening occurs in practice.
Yang Shi April 17, 2019, 8:43 p.m. UTC | #30
>>
>>>> I would also not touch the numa balancing logic at this stage and 
>>>> rather
>>>> see how the current implementation behaves.
>>> I agree we would prefer start from something simpler and see how it 
>>> works.
>>>
>>> The "twice access" optimization is aimed to reduce the PMEM 
>>> bandwidth burden
>>> since the bandwidth of PMEM is scarce resource. I did compare "twice 
>>> access"
>>> to "no twice access", it does save a lot bandwidth for some once-off 
>>> access
>>> pattern. For example, when running stress test with mmtest's
>>> usemem-stress-numa-compact. The kernel would promote ~600,000 pages 
>>> with
>>> "twice access" in 4 hours, but it would promote ~80,000,000 pages 
>>> without
>>> "twice access".
>> I pressume this is a result of a synthetic workload, right? Or do you
>> have any numbers for a real life usecase?
>
> The test just uses usemem.

I tried to run some more real life like usecases, the below shows the 
result by running mmtest's db-sysbench-mariadb-oltp-rw-medium test, 
which is a typical database workload, with and w/o "twice access" 
optimization.

                              w/                  w/o
promotion          32771           312250

We can see the kernel did 10x promotion w/o "twice access" optimization.

I also tried kernel-devel and redis tests in mmtest, but they can't 
generate enough memory pressure, so I had to run usemem test to generate 
memory pressure. However, this brought in huge noise, particularly for 
the w/o "twice access" case. But, the mysql test should be able to 
demonstrate the improvement achieved by this optimization.

And, I'm wondering whether this optimization is also suitable to general 
NUMA balancing or not.
Michal Hocko April 18, 2019, 9:02 a.m. UTC | #31
On Wed 17-04-19 13:43:44, Yang Shi wrote:
[...]
> And, I'm wondering whether this optimization is also suitable to general
> NUMA balancing or not.

If there are convincing numbers then this should be a preferable way to
deal with it. Please note that the number of promotions is not the only
metric to watch. The overal performance/access latency would be another one.
Yang Shi April 18, 2019, 4:24 p.m. UTC | #32
On 4/17/19 10:51 AM, Michal Hocko wrote:
> On Wed 17-04-19 10:26:05, Yang Shi wrote:
>> On 4/17/19 9:39 AM, Michal Hocko wrote:
>>> On Wed 17-04-19 09:37:39, Keith Busch wrote:
>>>> On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
>>>>> On Wed 17-04-19 09:23:46, Keith Busch wrote:
>>>>>> On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
>>>>>>> On Tue 16-04-19 14:22:33, Dave Hansen wrote:
>>>>>>>> Keith Busch had a set of patches to let you specify the demotion order
>>>>>>>> via sysfs for fun.  The rules we came up with were:
>>>>>>> I am not a fan of any sysfs "fun"
>>>>>> I'm hung up on the user facing interface, but there should be some way a
>>>>>> user decides if a memory node is or is not a migrate target, right?
>>>>> Why? Or to put it differently, why do we have to start with a user
>>>>> interface at this stage when we actually barely have any real usecases
>>>>> out there?
>>>> The use case is an alternative to swap, right? The user has to decide
>>>> which storage is the swap target, so operating in the same spirit.
>>> I do not follow. If you use rebalancing you can still deplete the memory
>>> and end up in a swap storage. If you want to reclaim/swap rather than
>>> rebalance then you do not enable rebalancing (by node_reclaim or similar
>>> mechanism).
>> I'm a little bit confused. Do you mean just do *not* do reclaim/swap in
>> rebalancing mode? If rebalancing is on, then node_reclaim just move the
>> pages around nodes, then kswapd or direct reclaim would take care of swap?
> Yes, that was the idea I wanted to get through. Sorry if that was not
> really clear.
>
>> If so the node reclaim on PMEM node may rebalance the pages to DRAM node?
>> Should this be allowed?
> Why it shouldn't? If there are other vacant Nodes to absorb that memory
> then why not use it?
>
>> I think both I and Keith was supposed to treat PMEM as a tier in the reclaim
>> hierarchy. The reclaim should push inactive pages down to PMEM, then swap.
>> So, PMEM is kind of a "terminal" node. So, he introduced sysfs defined
>> target node, I introduced N_CPU_MEM.
> I understand that. And I am trying to figure out whether we really have
> to tream PMEM specially here. Why is it any better than a generic NUMA
> rebalancing code that could be used for many other usecases which are
> not PMEM specific. If you present PMEM as a regular memory then also use
> it as a normal memory.

This also makes some sense. We just look at PMEM from different point of 
view. Taking into account the performance disparity may outweigh 
treating it as a normal memory in this patchset.

A ridiculous idea, may we have two modes? One for "rebalancing", the 
other for "demotion"?
Keith Busch April 18, 2019, 6:16 p.m. UTC | #33
On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote:
> On 4/17/19 2:23 AM, Michal Hocko wrote:
> > yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
> > the migration target. That should prevent from loops or artificial nodes
> > exhausting quite naturaly AFAICS. Maybe we will need some tricks to
> > raise the watermark but I am not convinced something like that is really
> > necessary.
> 
> I don't think GFP_NOWAIT alone is good enough.
> 
> Let's say we have a system full of clean page cache and only two nodes:
> 0 and 1.  GFP_NOWAIT will eventually kick off kswapd on both nodes.
> Each kswapd will be migrating pages to the *other* node since each is in
> the other's fallback path.
> 
> I think what you're saying is that, eventually, the kswapds will see
> allocation failures and stop migrating, providing hysteresis.  This is
> probably true.
> 
> But, I'm more concerned about that window where the kswapds are throwing
> pages at each other because they're effectively just wasting resources
> in this window.  I guess we should figure our how large this window is
> and how fast (or if) the dampening occurs in practice.

I'm still refining tests to help answer this and have some preliminary
data. My test rig has CPU + memory Node 0, memory-only Node 1, and a
fast swap device. The test has an application strict mbind more than
the total memory to node 0, and forever writes random cachelines from
per-cpu threads.

I'm testing two memory pressure policies:

  Node 0 can migrate to Node 1, no cycles
  Node 0 and Node 1 migrate with each other (0 -> 1 -> 0 cycles)

After the initial ramp up time, the second policy is ~7-10% slower than
no cycles. There doesn't appear to be a temporary window dealing with
bouncing pages: it's just a slower overall steady state. Looks like when
migration fails and falls back to swap, the newly freed pages occasionaly
get sniped by the other node, keeping the pressure up.
Yang Shi April 18, 2019, 7:23 p.m. UTC | #34
On 4/18/19 11:16 AM, Keith Busch wrote:
> On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote:
>> On 4/17/19 2:23 AM, Michal Hocko wrote:
>>> yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
>>> the migration target. That should prevent from loops or artificial nodes
>>> exhausting quite naturaly AFAICS. Maybe we will need some tricks to
>>> raise the watermark but I am not convinced something like that is really
>>> necessary.
>> I don't think GFP_NOWAIT alone is good enough.
>>
>> Let's say we have a system full of clean page cache and only two nodes:
>> 0 and 1.  GFP_NOWAIT will eventually kick off kswapd on both nodes.
>> Each kswapd will be migrating pages to the *other* node since each is in
>> the other's fallback path.
>>
>> I think what you're saying is that, eventually, the kswapds will see
>> allocation failures and stop migrating, providing hysteresis.  This is
>> probably true.
>>
>> But, I'm more concerned about that window where the kswapds are throwing
>> pages at each other because they're effectively just wasting resources
>> in this window.  I guess we should figure our how large this window is
>> and how fast (or if) the dampening occurs in practice.
> I'm still refining tests to help answer this and have some preliminary
> data. My test rig has CPU + memory Node 0, memory-only Node 1, and a
> fast swap device. The test has an application strict mbind more than
> the total memory to node 0, and forever writes random cachelines from
> per-cpu threads.

Thanks for the test. A follow-up question, how about the size for each 
node? Is node 1 bigger than node 0? Since PMEM typically has larger 
capacity, so I'm wondering whether the capacity may make things 
different or not.

> I'm testing two memory pressure policies:
>
>    Node 0 can migrate to Node 1, no cycles
>    Node 0 and Node 1 migrate with each other (0 -> 1 -> 0 cycles)
>
> After the initial ramp up time, the second policy is ~7-10% slower than
> no cycles. There doesn't appear to be a temporary window dealing with
> bouncing pages: it's just a slower overall steady state. Looks like when
> migration fails and falls back to swap, the newly freed pages occasionaly
> get sniped by the other node, keeping the pressure up.
Zi Yan April 18, 2019, 9:07 p.m. UTC | #35
On 18 Apr 2019, at 15:23, Yang Shi wrote:

> On 4/18/19 11:16 AM, Keith Busch wrote:
>> On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote:
>>> On 4/17/19 2:23 AM, Michal Hocko wrote:
>>>> yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
>>>> the migration target. That should prevent from loops or artificial nodes
>>>> exhausting quite naturaly AFAICS. Maybe we will need some tricks to
>>>> raise the watermark but I am not convinced something like that is really
>>>> necessary.
>>> I don't think GFP_NOWAIT alone is good enough.
>>>
>>> Let's say we have a system full of clean page cache and only two nodes:
>>> 0 and 1.  GFP_NOWAIT will eventually kick off kswapd on both nodes.
>>> Each kswapd will be migrating pages to the *other* node since each is in
>>> the other's fallback path.
>>>
>>> I think what you're saying is that, eventually, the kswapds will see
>>> allocation failures and stop migrating, providing hysteresis.  This is
>>> probably true.
>>>
>>> But, I'm more concerned about that window where the kswapds are throwing
>>> pages at each other because they're effectively just wasting resources
>>> in this window.  I guess we should figure our how large this window is
>>> and how fast (or if) the dampening occurs in practice.
>> I'm still refining tests to help answer this and have some preliminary
>> data. My test rig has CPU + memory Node 0, memory-only Node 1, and a
>> fast swap device. The test has an application strict mbind more than
>> the total memory to node 0, and forever writes random cachelines from
>> per-cpu threads.
>
> Thanks for the test. A follow-up question, how about the size for each node? Is node 1 bigger than node 0? Since PMEM typically has larger capacity, so I'm wondering whether the capacity may make things different or not.
>
>> I'm testing two memory pressure policies:
>>
>>    Node 0 can migrate to Node 1, no cycles
>>    Node 0 and Node 1 migrate with each other (0 -> 1 -> 0 cycles)
>>
>> After the initial ramp up time, the second policy is ~7-10% slower than
>> no cycles. There doesn't appear to be a temporary window dealing with
>> bouncing pages: it's just a slower overall steady state. Looks like when
>> migration fails and falls back to swap, the newly freed pages occasionaly
>> get sniped by the other node, keeping the pressure up.


In addition to these two policies, I am curious about how MPOL_PREFERRED to Node 0
performs. I just wonder how bad static page allocation does.

--
Best Regards,
Yan Zi
Fengguang Wu May 1, 2019, 5:20 a.m. UTC | #36
On Thu, Apr 18, 2019 at 11:02:27AM +0200, Michal Hocko wrote:
>On Wed 17-04-19 13:43:44, Yang Shi wrote:
>[...]
>> And, I'm wondering whether this optimization is also suitable to general
>> NUMA balancing or not.
>
>If there are convincing numbers then this should be a preferable way to
>deal with it. Please note that the number of promotions is not the only
>metric to watch. The overal performance/access latency would be another one.

Good question. Shi and me aligned today. Also talked with Mel (but
sorry I must missed some points due to poor English listening). It
becomes clear that

1) PMEM/DRAM page promotion/demotion is a hard problem to attack.
There will and should be multiple approaches for open discussion
before settling down. The criteria might be balanced complexity,
overheads, performance, etc.

2) We need a lot more data to lay solid foundation for effective
discussions. Testing will be a rather time consuming part for
contributor. We'll need to work together to create a number of
benchmarks that can well exercise the kernel promotion/demotion paths
and gather the necessary numbers. By collaborating on a common set of
tests, we can not only amortize efforts, but also compare different
approaches or compare v1/v2/... of the same approach conveniently.

Ying has already created several LKP test cases for that purpose.
Shi and me plan to join the efforts, too.

Thanks,
Fengguang
Fengguang Wu May 1, 2019, 6:43 a.m. UTC | #37
On Wed, Apr 17, 2019 at 11:17:48AM +0200, Michal Hocko wrote:
>On Tue 16-04-19 12:19:21, Yang Shi wrote:
>>
>>
>> On 4/16/19 12:47 AM, Michal Hocko wrote:
>[...]
>> > Why cannot we simply demote in the proximity order? Why do you make
>> > cpuless nodes so special? If other close nodes are vacant then just use
>> > them.
>>
>> We could. But, this raises another question, would we prefer to just demote
>> to the next fallback node (just try once), if it is contended, then just
>> swap (i.e. DRAM0 -> PMEM0 -> Swap); or would we prefer to try all the nodes
>> in the fallback order to find the first less contended one (i.e. DRAM0 ->
>> PMEM0 -> DRAM1 -> PMEM1 -> Swap)?
>
>I would go with the later. Why, because it is more natural. Because that
>is the natural allocation path so I do not see why this shouldn't be the
>natural demotion path.

"Demotion" should be more performance wise by "demoting to the
next-level (cheaper/slower) memory". Otherwise something like this
may happen.

DRAM0 pressured => demote cold pages to DRAM1 
DRAM1 pressured => demote cold pages to DRAM0

Kind of DRAM0/DRAM1 exchanged a fraction of the demoted cold pages,
which looks not helpful for overall system performance.

Over time, it's even possible some cold pages get "demoted" in path
DRAM0=>DRAM1=>DRAM0=>DRAM1=>...

Thanks,
Fengguang