Message ID | 1554955019-29472-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | Another Approach to Use PMEM as NUMA Node | expand |
This isn't so much another aproach, as it it some tweaks on top of what's there. Right? This set seems to present a bunch of ideas, like "promote if accessed twice". Seems like a good idea, but I'm a lot more interested in seeing data about it being a good idea. What workloads is it good for? Bad for? These look like fun to play with, but I'd be really curious what you think needs to be done before we start merging these ideas.
On Thu 11-04-19 11:56:50, Yang Shi wrote: [...] > Design > ====== > Basically, the approach is aimed to spread data from DRAM (closest to local > CPU) down further to PMEM and disk (typically assume the lower tier storage > is slower, larger and cheaper than the upper tier) by their hotness. The > patchset tries to achieve this goal by doing memory promotion/demotion via > NUMA balancing and memory reclaim as what the below diagram shows: > > DRAM <--> PMEM <--> Disk > ^ ^ > |-------------------| > swap > > When DRAM has memory pressure, demote pages to PMEM via page reclaim path. > Then NUMA balancing will promote pages to DRAM as long as the page is referenced > again. The memory pressure on PMEM node would push the inactive pages of PMEM > to disk via swap. > > The promotion/demotion happens only between "primary" nodes (the nodes have > both CPU and memory) and PMEM nodes. No promotion/demotion between PMEM nodes > and promotion from DRAM to PMEM and demotion from PMEM to DRAM. > > The HMAT is effectively going to enforce "cpu-less" nodes for any memory range > that has differentiated performance from the conventional memory pool, or > differentiated performance for a specific initiator, per Dan Williams. So, > assuming PMEM nodes are cpuless nodes sounds reasonable. > > However, cpuless nodes might be not PMEM nodes. But, actually, memory > promotion/demotion doesn't care what kind of memory will be the target nodes, > it could be DRAM, PMEM or something else, as long as they are the second tier > memory (slower, larger and cheaper than regular DRAM), otherwise it sounds > pointless to do such demotion. > > Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in > order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and > memoryless nodes (some architectures, i.e. Power, may have memoryless nodes). > Typically, memory allocation would happen on such nodes by default unless > cpuless nodes are specified explicitly, cpuless nodes would be just fallback > nodes, so they are also as known as "primary" nodes in this patchset. With > two tier memory system (i.e. DRAM + PMEM), this sounds good enough to > demonstrate the promotion/demotion approach for now, and this looks more > architecture-independent. But it may be better to construct such node mask > by reading hardware information (i.e. HMAT), particularly for more complex > memory hierarchy. I still believe you are overcomplicating this without a strong reason. Why cannot we start simple and build from there? In other words I do not think we really need anything like N_CPU_MEM at all. I would expect that the very first attempt wouldn't do much more than migrate to-be-reclaimed pages (without an explicit binding) with a very optimistic allocation strategy (effectivelly GFP_NOWAIT) and if that fails then simply give up. All that hooked essentially to the node_reclaim path with a new node_reclaim mode so that the behavior would be opt-in. This should be the most simplistic way to start AFAICS and something people can play with without risking regressions. Once we see how that behaves in the real world and what kind of corner case user are able to trigger then we can build on top. E.g. do we want to migrate from cpuless nodes as well? I am not really sure TBH. On one hand why not if other nodes are free to hold that memory? Swap out is more expensive. Anyway this is kind of decision which would rather be shaped on an existing experience rather than ad-hoc decistion right now. I would also not touch the numa balancing logic at this stage and rather see how the current implementation behaves.
On 4/12/19 1:47 AM, Michal Hocko wrote: > On Thu 11-04-19 11:56:50, Yang Shi wrote: > [...] >> Design >> ====== >> Basically, the approach is aimed to spread data from DRAM (closest to local >> CPU) down further to PMEM and disk (typically assume the lower tier storage >> is slower, larger and cheaper than the upper tier) by their hotness. The >> patchset tries to achieve this goal by doing memory promotion/demotion via >> NUMA balancing and memory reclaim as what the below diagram shows: >> >> DRAM <--> PMEM <--> Disk >> ^ ^ >> |-------------------| >> swap >> >> When DRAM has memory pressure, demote pages to PMEM via page reclaim path. >> Then NUMA balancing will promote pages to DRAM as long as the page is referenced >> again. The memory pressure on PMEM node would push the inactive pages of PMEM >> to disk via swap. >> >> The promotion/demotion happens only between "primary" nodes (the nodes have >> both CPU and memory) and PMEM nodes. No promotion/demotion between PMEM nodes >> and promotion from DRAM to PMEM and demotion from PMEM to DRAM. >> >> The HMAT is effectively going to enforce "cpu-less" nodes for any memory range >> that has differentiated performance from the conventional memory pool, or >> differentiated performance for a specific initiator, per Dan Williams. So, >> assuming PMEM nodes are cpuless nodes sounds reasonable. >> >> However, cpuless nodes might be not PMEM nodes. But, actually, memory >> promotion/demotion doesn't care what kind of memory will be the target nodes, >> it could be DRAM, PMEM or something else, as long as they are the second tier >> memory (slower, larger and cheaper than regular DRAM), otherwise it sounds >> pointless to do such demotion. >> >> Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in >> order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and >> memoryless nodes (some architectures, i.e. Power, may have memoryless nodes). >> Typically, memory allocation would happen on such nodes by default unless >> cpuless nodes are specified explicitly, cpuless nodes would be just fallback >> nodes, so they are also as known as "primary" nodes in this patchset. With >> two tier memory system (i.e. DRAM + PMEM), this sounds good enough to >> demonstrate the promotion/demotion approach for now, and this looks more >> architecture-independent. But it may be better to construct such node mask >> by reading hardware information (i.e. HMAT), particularly for more complex >> memory hierarchy. > I still believe you are overcomplicating this without a strong reason. > Why cannot we start simple and build from there? In other words I do not > think we really need anything like N_CPU_MEM at all. In this patchset N_CPU_MEM is used to tell us what nodes are cpuless nodes. They would be the preferred demotion target. Of course, we could rely on firmware to just demote to the next best node, but it may be a "preferred" node, if so I don't see too much benefit achieved by demotion. Am I missing anything? > > I would expect that the very first attempt wouldn't do much more than > migrate to-be-reclaimed pages (without an explicit binding) with a Do you mean respect mempolicy or cpuset when doing demotion? I was wondering this, but I didn't do so in the current implementation since it may need walk the rmap to retrieve the mempolicy in the reclaim path. Is there any easier way to do so? > very optimistic allocation strategy (effectivelly GFP_NOWAIT) and if Yes, this has been done in this patchset. > that fails then simply give up. All that hooked essentially to the > node_reclaim path with a new node_reclaim mode so that the behavior > would be opt-in. This should be the most simplistic way to start AFAICS > and something people can play with without risking regressions. I agree it is safer to start with node reclaim. Once it is stable enough and we are confident enough, it can be extended to global reclaim. > > Once we see how that behaves in the real world and what kind of corner > case user are able to trigger then we can build on top. E.g. do we want > to migrate from cpuless nodes as well? I am not really sure TBH. On one > hand why not if other nodes are free to hold that memory? Swap out is > more expensive. Anyway this is kind of decision which would rather be > shaped on an existing experience rather than ad-hoc decistion right now. I do agree. > > I would also not touch the numa balancing logic at this stage and rather > see how the current implementation behaves. I agree we would prefer start from something simpler and see how it works. The "twice access" optimization is aimed to reduce the PMEM bandwidth burden since the bandwidth of PMEM is scarce resource. I did compare "twice access" to "no twice access", it does save a lot bandwidth for some once-off access pattern. For example, when running stress test with mmtest's usemem-stress-numa-compact. The kernel would promote ~600,000 pages with "twice access" in 4 hours, but it would promote ~80,000,000 pages without "twice access". Thanks, Yang
On Mon 15-04-19 17:09:07, Yang Shi wrote: > > > On 4/12/19 1:47 AM, Michal Hocko wrote: > > On Thu 11-04-19 11:56:50, Yang Shi wrote: > > [...] > > > Design > > > ====== > > > Basically, the approach is aimed to spread data from DRAM (closest to local > > > CPU) down further to PMEM and disk (typically assume the lower tier storage > > > is slower, larger and cheaper than the upper tier) by their hotness. The > > > patchset tries to achieve this goal by doing memory promotion/demotion via > > > NUMA balancing and memory reclaim as what the below diagram shows: > > > > > > DRAM <--> PMEM <--> Disk > > > ^ ^ > > > |-------------------| > > > swap > > > > > > When DRAM has memory pressure, demote pages to PMEM via page reclaim path. > > > Then NUMA balancing will promote pages to DRAM as long as the page is referenced > > > again. The memory pressure on PMEM node would push the inactive pages of PMEM > > > to disk via swap. > > > > > > The promotion/demotion happens only between "primary" nodes (the nodes have > > > both CPU and memory) and PMEM nodes. No promotion/demotion between PMEM nodes > > > and promotion from DRAM to PMEM and demotion from PMEM to DRAM. > > > > > > The HMAT is effectively going to enforce "cpu-less" nodes for any memory range > > > that has differentiated performance from the conventional memory pool, or > > > differentiated performance for a specific initiator, per Dan Williams. So, > > > assuming PMEM nodes are cpuless nodes sounds reasonable. > > > > > > However, cpuless nodes might be not PMEM nodes. But, actually, memory > > > promotion/demotion doesn't care what kind of memory will be the target nodes, > > > it could be DRAM, PMEM or something else, as long as they are the second tier > > > memory (slower, larger and cheaper than regular DRAM), otherwise it sounds > > > pointless to do such demotion. > > > > > > Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in > > > order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and > > > memoryless nodes (some architectures, i.e. Power, may have memoryless nodes). > > > Typically, memory allocation would happen on such nodes by default unless > > > cpuless nodes are specified explicitly, cpuless nodes would be just fallback > > > nodes, so they are also as known as "primary" nodes in this patchset. With > > > two tier memory system (i.e. DRAM + PMEM), this sounds good enough to > > > demonstrate the promotion/demotion approach for now, and this looks more > > > architecture-independent. But it may be better to construct such node mask > > > by reading hardware information (i.e. HMAT), particularly for more complex > > > memory hierarchy. > > I still believe you are overcomplicating this without a strong reason. > > Why cannot we start simple and build from there? In other words I do not > > think we really need anything like N_CPU_MEM at all. > > In this patchset N_CPU_MEM is used to tell us what nodes are cpuless nodes. > They would be the preferred demotion target. Of course, we could rely on > firmware to just demote to the next best node, but it may be a "preferred" > node, if so I don't see too much benefit achieved by demotion. Am I missing > anything? Why cannot we simply demote in the proximity order? Why do you make cpuless nodes so special? If other close nodes are vacant then just use them. > > I would expect that the very first attempt wouldn't do much more than > > migrate to-be-reclaimed pages (without an explicit binding) with a > > Do you mean respect mempolicy or cpuset when doing demotion? I was wondering > this, but I didn't do so in the current implementation since it may need > walk the rmap to retrieve the mempolicy in the reclaim path. Is there any > easier way to do so? You definitely have to follow policy. You cannot demote to a node which is outside of the cpuset/mempolicy because you are breaking contract expected by the userspace. That implies doing a rmap walk. > > I would also not touch the numa balancing logic at this stage and rather > > see how the current implementation behaves. > > I agree we would prefer start from something simpler and see how it works. > > The "twice access" optimization is aimed to reduce the PMEM bandwidth burden > since the bandwidth of PMEM is scarce resource. I did compare "twice access" > to "no twice access", it does save a lot bandwidth for some once-off access > pattern. For example, when running stress test with mmtest's > usemem-stress-numa-compact. The kernel would promote ~600,000 pages with > "twice access" in 4 hours, but it would promote ~80,000,000 pages without > "twice access". I pressume this is a result of a synthetic workload, right? Or do you have any numbers for a real life usecase?
On 4/16/19 12:47 AM, Michal Hocko wrote: > You definitely have to follow policy. You cannot demote to a node which > is outside of the cpuset/mempolicy because you are breaking contract > expected by the userspace. That implies doing a rmap walk. What *is* the contract with userspace, anyway? :) Obviously, the preferred policy doesn't have any strict contract. The strict binding has a bit more of a contract, but it doesn't prevent swapping. Strict binding also doesn't keep another app from moving the memory. We have a reasonable argument that demotion is better than swapping. So, we could say that even if a VMA has a strict NUMA policy, demoting pages mapped there pages still beats swapping them or tossing the page cache. It's doing them a favor to demote them. Or, maybe we just need a swap hybrid where demotion moves the page but keeps it unmapped and in the swap cache. That way an access gets a fault and we can promote the page back to where it should be. That would be faster than I/O-based swap for sure. Anyway, I agree that the kernel probably shouldn't be moving pages around willy-nilly with no consideration for memory policies, but users might give us some wiggle room too.
On Tue 16-04-19 07:30:20, Dave Hansen wrote: > On 4/16/19 12:47 AM, Michal Hocko wrote: > > You definitely have to follow policy. You cannot demote to a node which > > is outside of the cpuset/mempolicy because you are breaking contract > > expected by the userspace. That implies doing a rmap walk. > > What *is* the contract with userspace, anyway? :) > > Obviously, the preferred policy doesn't have any strict contract. > > The strict binding has a bit more of a contract, but it doesn't prevent > swapping. Yes, but swapping is not a problem for using binding for memory partitioning. > Strict binding also doesn't keep another app from moving the > memory. I would consider that a bug.
On 16 Apr 2019, at 10:30, Dave Hansen wrote: > On 4/16/19 12:47 AM, Michal Hocko wrote: >> You definitely have to follow policy. You cannot demote to a node which >> is outside of the cpuset/mempolicy because you are breaking contract >> expected by the userspace. That implies doing a rmap walk. > > What *is* the contract with userspace, anyway? :) > > Obviously, the preferred policy doesn't have any strict contract. > > The strict binding has a bit more of a contract, but it doesn't prevent > swapping. Strict binding also doesn't keep another app from moving the > memory. > > We have a reasonable argument that demotion is better than swapping. > So, we could say that even if a VMA has a strict NUMA policy, demoting > pages mapped there pages still beats swapping them or tossing the page > cache. It's doing them a favor to demote them. I just wonder whether page migration is always better than swapping, since SSD write throughput keeps improving but page migration throughput is still low. For example, my machine has a SSD with 2GB/s writing throughput but the throughput of 4KB page migration is less than 1GB/s, why do we want to use page migration for demotion instead of swapping? -- Best Regards, Yan Zi
On 4/16/19 7:39 AM, Michal Hocko wrote: >> Strict binding also doesn't keep another app from moving the >> memory. > I would consider that a bug. A bug where, though? Certainly not in the kernel. I'm just saying that if an app has an assumption that strict binding means that its memory can *NEVER* move, then that assumption is simply wrong. It's not the guarantee that we provide. In fact, we provide APIs (migrate_pages() at leaset) that explicitly and intentionally break that guarantee. All that our NUMA APIs provide (even the strict ones) is a promise about where newly-allocated pages will be allocated.
On 4/16/19 8:33 AM, Zi Yan wrote: >> We have a reasonable argument that demotion is better than >> swapping. So, we could say that even if a VMA has a strict NUMA >> policy, demoting pages mapped there pages still beats swapping >> them or tossing the page cache. It's doing them a favor to >> demote them. > I just wonder whether page migration is always better than > swapping, since SSD write throughput keeps improving but page > migration throughput is still low. For example, my machine has a > SSD with 2GB/s writing throughput but the throughput of 4KB page > migration is less than 1GB/s, why do we want to use page migration > for demotion instead of swapping? Just because we observe that page migration apparently has lower throughput today doesn't mean that we should consider it a dead end.
On 16 Apr 2019, at 11:55, Dave Hansen wrote: > On 4/16/19 8:33 AM, Zi Yan wrote: >>> We have a reasonable argument that demotion is better than >>> swapping. So, we could say that even if a VMA has a strict NUMA >>> policy, demoting pages mapped there pages still beats swapping >>> them or tossing the page cache. It's doing them a favor to >>> demote them. >> I just wonder whether page migration is always better than >> swapping, since SSD write throughput keeps improving but page >> migration throughput is still low. For example, my machine has a >> SSD with 2GB/s writing throughput but the throughput of 4KB page >> migration is less than 1GB/s, why do we want to use page migration >> for demotion instead of swapping? > > Just because we observe that page migration apparently has lower > throughput today doesn't mean that we should consider it a dead end. I definitely agree. I also want to make the point that we might want to improve page migration as well to show that demotion via page migration will work. Since most of proposed demotion approaches use the same page replacement policy as swapping, if we do not have high-throughput page migration, we might draw false conclusions that demotion is no better than swapping but demotion can actually do much better. :) -- Best Regards, Yan Zi
On Tue 16-04-19 08:46:56, Dave Hansen wrote: > On 4/16/19 7:39 AM, Michal Hocko wrote: > >> Strict binding also doesn't keep another app from moving the > >> memory. > > I would consider that a bug. > > A bug where, though? Certainly not in the kernel. Kernel should refrain from moving explicitly bound memory nilly willy. I certainly agree that there are corner cases. E.g. memory hotplug. We do break CPU affinity for CPU offline as well. So this is something user should expect. But the kernel shouldn't move explicitly bound pages to a different node implicitly. I am not sure whether we even do that during compaction if we do then I would consider _this_ to be a bug. And NUMA rebalancing under memory pressure falls into the same category IMO.
On 4/16/19 12:47 AM, Michal Hocko wrote: > On Mon 15-04-19 17:09:07, Yang Shi wrote: >> >> On 4/12/19 1:47 AM, Michal Hocko wrote: >>> On Thu 11-04-19 11:56:50, Yang Shi wrote: >>> [...] >>>> Design >>>> ====== >>>> Basically, the approach is aimed to spread data from DRAM (closest to local >>>> CPU) down further to PMEM and disk (typically assume the lower tier storage >>>> is slower, larger and cheaper than the upper tier) by their hotness. The >>>> patchset tries to achieve this goal by doing memory promotion/demotion via >>>> NUMA balancing and memory reclaim as what the below diagram shows: >>>> >>>> DRAM <--> PMEM <--> Disk >>>> ^ ^ >>>> |-------------------| >>>> swap >>>> >>>> When DRAM has memory pressure, demote pages to PMEM via page reclaim path. >>>> Then NUMA balancing will promote pages to DRAM as long as the page is referenced >>>> again. The memory pressure on PMEM node would push the inactive pages of PMEM >>>> to disk via swap. >>>> >>>> The promotion/demotion happens only between "primary" nodes (the nodes have >>>> both CPU and memory) and PMEM nodes. No promotion/demotion between PMEM nodes >>>> and promotion from DRAM to PMEM and demotion from PMEM to DRAM. >>>> >>>> The HMAT is effectively going to enforce "cpu-less" nodes for any memory range >>>> that has differentiated performance from the conventional memory pool, or >>>> differentiated performance for a specific initiator, per Dan Williams. So, >>>> assuming PMEM nodes are cpuless nodes sounds reasonable. >>>> >>>> However, cpuless nodes might be not PMEM nodes. But, actually, memory >>>> promotion/demotion doesn't care what kind of memory will be the target nodes, >>>> it could be DRAM, PMEM or something else, as long as they are the second tier >>>> memory (slower, larger and cheaper than regular DRAM), otherwise it sounds >>>> pointless to do such demotion. >>>> >>>> Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in >>>> order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and >>>> memoryless nodes (some architectures, i.e. Power, may have memoryless nodes). >>>> Typically, memory allocation would happen on such nodes by default unless >>>> cpuless nodes are specified explicitly, cpuless nodes would be just fallback >>>> nodes, so they are also as known as "primary" nodes in this patchset. With >>>> two tier memory system (i.e. DRAM + PMEM), this sounds good enough to >>>> demonstrate the promotion/demotion approach for now, and this looks more >>>> architecture-independent. But it may be better to construct such node mask >>>> by reading hardware information (i.e. HMAT), particularly for more complex >>>> memory hierarchy. >>> I still believe you are overcomplicating this without a strong reason. >>> Why cannot we start simple and build from there? In other words I do not >>> think we really need anything like N_CPU_MEM at all. >> In this patchset N_CPU_MEM is used to tell us what nodes are cpuless nodes. >> They would be the preferred demotion target. Of course, we could rely on >> firmware to just demote to the next best node, but it may be a "preferred" >> node, if so I don't see too much benefit achieved by demotion. Am I missing >> anything? > Why cannot we simply demote in the proximity order? Why do you make > cpuless nodes so special? If other close nodes are vacant then just use > them. We could. But, this raises another question, would we prefer to just demote to the next fallback node (just try once), if it is contended, then just swap (i.e. DRAM0 -> PMEM0 -> Swap); or would we prefer to try all the nodes in the fallback order to find the first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)? |------| |------| |------| |------| |PMEM0|---|DRAM0| --- CPU0 --- CPU1 --- |DRAM1| --- |PMEM1| |------| |------| |------| |------| The first one sounds simpler, and the current implementation does so and this needs find out the closest PMEM node by recognizing cpuless node. If we prefer go with the second option, it is definitely unnecessary to specialize any node. > >>> I would expect that the very first attempt wouldn't do much more than >>> migrate to-be-reclaimed pages (without an explicit binding) with a >> Do you mean respect mempolicy or cpuset when doing demotion? I was wondering >> this, but I didn't do so in the current implementation since it may need >> walk the rmap to retrieve the mempolicy in the reclaim path. Is there any >> easier way to do so? > You definitely have to follow policy. You cannot demote to a node which > is outside of the cpuset/mempolicy because you are breaking contract > expected by the userspace. That implies doing a rmap walk. OK, however, this may prevent from demoting unmapped page cache since there is no way to find those pages' policy. And, we have to think about what we should do when the demotion target has conflict with the mempolicy. The easiest way is to just skip those conflict pages in demotion. Or we may have to do the demotion one page by one page instead of migrating a list of pages. > >>> I would also not touch the numa balancing logic at this stage and rather >>> see how the current implementation behaves. >> I agree we would prefer start from something simpler and see how it works. >> >> The "twice access" optimization is aimed to reduce the PMEM bandwidth burden >> since the bandwidth of PMEM is scarce resource. I did compare "twice access" >> to "no twice access", it does save a lot bandwidth for some once-off access >> pattern. For example, when running stress test with mmtest's >> usemem-stress-numa-compact. The kernel would promote ~600,000 pages with >> "twice access" in 4 hours, but it would promote ~80,000,000 pages without >> "twice access". > I pressume this is a result of a synthetic workload, right? Or do you > have any numbers for a real life usecase? The test just uses usemem.
On 4/16/19 12:19 PM, Yang Shi wrote: > would we prefer to try all the nodes in the fallback order to find the > first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)? Once a page went to DRAM1, how would we tell that it originated in DRAM0 and is following the DRAM0 path rather than the DRAM1 path? Memory on DRAM0's path would be: DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap Memory on DRAM1's path would be: DRAM1 -> PMEM1 -> DRAM0 -> PMEM0 -> Swap Keith Busch had a set of patches to let you specify the demotion order via sysfs for fun. The rules we came up with were: 1. Pages keep no history of where they have been 2. Each node can only demote to one other node 3. The demotion path can not have cycles That ensures that we *can't* follow the paths you described above, if we follow those rules...
On 4/16/19 2:22 PM, Dave Hansen wrote: > On 4/16/19 12:19 PM, Yang Shi wrote: >> would we prefer to try all the nodes in the fallback order to find the >> first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)? > Once a page went to DRAM1, how would we tell that it originated in DRAM0 > and is following the DRAM0 path rather than the DRAM1 path? > > Memory on DRAM0's path would be: > > DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap > > Memory on DRAM1's path would be: > > DRAM1 -> PMEM1 -> DRAM0 -> PMEM0 -> Swap > > Keith Busch had a set of patches to let you specify the demotion order > via sysfs for fun. The rules we came up with were: > 1. Pages keep no history of where they have been > 2. Each node can only demote to one other node Does this mean any remote node? Or just DRAM to PMEM, but remote PMEM might be ok? > 3. The demotion path can not have cycles I agree with these rules, actually my implementation does imply the similar rule. I tried to understand what Michal means. My current implementation expects to have demotion happen from the initiator to the target in the same local pair. But, Michal may expect to be able to demote to remote initiator or target if the local target is contended. IMHO, demotion in the local pair makes things much simpler. > > That ensures that we *can't* follow the paths you described above, if we > follow those rules... Yes, it might create a circle.
On 4/16/19 2:59 PM, Yang Shi wrote: > On 4/16/19 2:22 PM, Dave Hansen wrote: >> Keith Busch had a set of patches to let you specify the demotion order >> via sysfs for fun. The rules we came up with were: >> 1. Pages keep no history of where they have been >> 2. Each node can only demote to one other node > > Does this mean any remote node? Or just DRAM to PMEM, but remote PMEM > might be ok? In Keith's code, I don't think we differentiated. We let any node demote to any other node you want, as long as it follows the cycle rule.
On 4/16/19 4:04 PM, Dave Hansen wrote: > On 4/16/19 2:59 PM, Yang Shi wrote: >> On 4/16/19 2:22 PM, Dave Hansen wrote: >>> Keith Busch had a set of patches to let you specify the demotion order >>> via sysfs for fun. The rules we came up with were: >>> 1. Pages keep no history of where they have been >>> 2. Each node can only demote to one other node >> Does this mean any remote node? Or just DRAM to PMEM, but remote PMEM >> might be ok? > In Keith's code, I don't think we differentiated. We let any node > demote to any other node you want, as long as it follows the cycle rule. I recall Keith's code let the userspace define the target node. Anyway, we may need add one rule: not migrate-on-reclaim from PMEM node. Demoting from PMEM to DRAM sounds pointless.
>>>> Why cannot we start simple and build from there? In other words I >>>> do not >>>> think we really need anything like N_CPU_MEM at all. >>> In this patchset N_CPU_MEM is used to tell us what nodes are cpuless >>> nodes. >>> They would be the preferred demotion target. Of course, we could >>> rely on >>> firmware to just demote to the next best node, but it may be a >>> "preferred" >>> node, if so I don't see too much benefit achieved by demotion. Am I >>> missing >>> anything? >> Why cannot we simply demote in the proximity order? Why do you make >> cpuless nodes so special? If other close nodes are vacant then just use >> them. And, I'm supposed we agree to *not* migrate from PMEM node (cpuless node) to any other node on reclaim path, right? If so we need know if the current node is DRAM node or PMEM node. If DRAM node, do demotion; if PMEM node, do swap. So, using N_CPU_MEM to tell us if the current node is DRAM node or not. > We could. But, this raises another question, would we prefer to just > demote to the next fallback node (just try once), if it is contended, > then just swap (i.e. DRAM0 -> PMEM0 -> Swap); or would we prefer to > try all the nodes in the fallback order to find the first less > contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)? > > > |------| |------| |------| |------| > |PMEM0|---|DRAM0| --- CPU0 --- CPU1 --- |DRAM1| --- |PMEM1| > |------| |------| |------| |------| > > The first one sounds simpler, and the current implementation does so > and this needs find out the closest PMEM node by recognizing cpuless > node. > > If we prefer go with the second option, it is definitely unnecessary > to specialize any node. >
On Tue 16-04-19 12:19:21, Yang Shi wrote: > > > On 4/16/19 12:47 AM, Michal Hocko wrote: [...] > > Why cannot we simply demote in the proximity order? Why do you make > > cpuless nodes so special? If other close nodes are vacant then just use > > them. > > We could. But, this raises another question, would we prefer to just demote > to the next fallback node (just try once), if it is contended, then just > swap (i.e. DRAM0 -> PMEM0 -> Swap); or would we prefer to try all the nodes > in the fallback order to find the first less contended one (i.e. DRAM0 -> > PMEM0 -> DRAM1 -> PMEM1 -> Swap)? I would go with the later. Why, because it is more natural. Because that is the natural allocation path so I do not see why this shouldn't be the natural demotion path. > > |------| |------| |------| |------| > |PMEM0|---|DRAM0| --- CPU0 --- CPU1 --- |DRAM1| --- |PMEM1| > |------| |------| |------| |------| > > The first one sounds simpler, and the current implementation does so and > this needs find out the closest PMEM node by recognizing cpuless node. Unless you are specifying an explicit nodemask then the allocator will do the allocation fallback for the migration target for you. > If we prefer go with the second option, it is definitely unnecessary to > specialize any node. > > > > > I would expect that the very first attempt wouldn't do much more than > > > > migrate to-be-reclaimed pages (without an explicit binding) with a > > > Do you mean respect mempolicy or cpuset when doing demotion? I was wondering > > > this, but I didn't do so in the current implementation since it may need > > > walk the rmap to retrieve the mempolicy in the reclaim path. Is there any > > > easier way to do so? > > You definitely have to follow policy. You cannot demote to a node which > > is outside of the cpuset/mempolicy because you are breaking contract > > expected by the userspace. That implies doing a rmap walk. > > OK, however, this may prevent from demoting unmapped page cache since there > is no way to find those pages' policy. I do not really expect that hard numa binding for the page cache is a usecase we really have to lose sleep over for now. > And, we have to think about what we should do when the demotion target has > conflict with the mempolicy. Simply skip it. > The easiest way is to just skip those conflict > pages in demotion. Or we may have to do the demotion one page by one page > instead of migrating a list of pages. Yes one page at the time sounds reasonable to me. THis is how we do reclaim anyway.
On Tue 16-04-19 14:22:33, Dave Hansen wrote: > On 4/16/19 12:19 PM, Yang Shi wrote: > > would we prefer to try all the nodes in the fallback order to find the > > first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)? > > Once a page went to DRAM1, how would we tell that it originated in DRAM0 > and is following the DRAM0 path rather than the DRAM1 path? > > Memory on DRAM0's path would be: > > DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap > > Memory on DRAM1's path would be: > > DRAM1 -> PMEM1 -> DRAM0 -> PMEM0 -> Swap > > Keith Busch had a set of patches to let you specify the demotion order > via sysfs for fun. The rules we came up with were: I am not a fan of any sysfs "fun" > 1. Pages keep no history of where they have been makes sense > 2. Each node can only demote to one other node Not really, see my other email. I do not really see any strong reason why not use the full zonelist to demote to > 3. The demotion path can not have cycles yes. This could be achieved by GFP_NOWAIT opportunistic allocation for the migration target. That should prevent from loops or artificial nodes exhausting quite naturaly AFAICS. Maybe we will need some tricks to raise the watermark but I am not convinced something like that is really necessary.
On Tue, Apr 16, 2019 at 04:17:44PM -0700, Yang Shi wrote: > On 4/16/19 4:04 PM, Dave Hansen wrote: > > On 4/16/19 2:59 PM, Yang Shi wrote: > > > On 4/16/19 2:22 PM, Dave Hansen wrote: > > > > Keith Busch had a set of patches to let you specify the demotion order > > > > via sysfs for fun. The rules we came up with were: > > > > 1. Pages keep no history of where they have been > > > > 2. Each node can only demote to one other node > > > Does this mean any remote node? Or just DRAM to PMEM, but remote PMEM > > > might be ok? > > In Keith's code, I don't think we differentiated. We let any node > > demote to any other node you want, as long as it follows the cycle rule. > > I recall Keith's code let the userspace define the target node. Right, you have to opt-in in my original proposal since it may be a bit presumptuous of the kernel to decide how a node's memory is going to be used. User applications have other intentions for it. It wouldn't be too difficult to make HMAT to create a reasonable initial migration graph too, and that can also make that an opt-in user choice. > Anyway, we may need add one rule: not migrate-on-reclaim from PMEM > node. Demoting from PMEM to DRAM sounds pointless. I really don't think we should be making such hard rules on PMEM. It makes more sense to consider performance and locality for migration rules than on a persistence attribute.
On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: > On Tue 16-04-19 14:22:33, Dave Hansen wrote: > > Keith Busch had a set of patches to let you specify the demotion order > > via sysfs for fun. The rules we came up with were: > > I am not a fan of any sysfs "fun" I'm hung up on the user facing interface, but there should be some way a user decides if a memory node is or is not a migrate target, right?
On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote: > On Wed 17-04-19 09:23:46, Keith Busch wrote: > > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: > > > On Tue 16-04-19 14:22:33, Dave Hansen wrote: > > > > Keith Busch had a set of patches to let you specify the demotion order > > > > via sysfs for fun. The rules we came up with were: > > > > > > I am not a fan of any sysfs "fun" > > > > I'm hung up on the user facing interface, but there should be some way a > > user decides if a memory node is or is not a migrate target, right? > > Why? Or to put it differently, why do we have to start with a user > interface at this stage when we actually barely have any real usecases > out there? The use case is an alternative to swap, right? The user has to decide which storage is the swap target, so operating in the same spirit.
On Wed 17-04-19 09:23:46, Keith Busch wrote: > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: > > On Tue 16-04-19 14:22:33, Dave Hansen wrote: > > > Keith Busch had a set of patches to let you specify the demotion order > > > via sysfs for fun. The rules we came up with were: > > > > I am not a fan of any sysfs "fun" > > I'm hung up on the user facing interface, but there should be some way a > user decides if a memory node is or is not a migrate target, right? Why? Or to put it differently, why do we have to start with a user interface at this stage when we actually barely have any real usecases out there?
On Wed 17-04-19 09:37:39, Keith Busch wrote: > On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote: > > On Wed 17-04-19 09:23:46, Keith Busch wrote: > > > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: > > > > On Tue 16-04-19 14:22:33, Dave Hansen wrote: > > > > > Keith Busch had a set of patches to let you specify the demotion order > > > > > via sysfs for fun. The rules we came up with were: > > > > > > > > I am not a fan of any sysfs "fun" > > > > > > I'm hung up on the user facing interface, but there should be some way a > > > user decides if a memory node is or is not a migrate target, right? > > > > Why? Or to put it differently, why do we have to start with a user > > interface at this stage when we actually barely have any real usecases > > out there? > > The use case is an alternative to swap, right? The user has to decide > which storage is the swap target, so operating in the same spirit. I do not follow. If you use rebalancing you can still deplete the memory and end up in a swap storage. If you want to reclaim/swap rather than rebalance then you do not enable rebalancing (by node_reclaim or similar mechanism).
On 4/17/19 2:23 AM, Michal Hocko wrote: >> 3. The demotion path can not have cycles > yes. This could be achieved by GFP_NOWAIT opportunistic allocation for > the migration target. That should prevent from loops or artificial nodes > exhausting quite naturaly AFAICS. Maybe we will need some tricks to > raise the watermark but I am not convinced something like that is really > necessary. I don't think GFP_NOWAIT alone is good enough. Let's say we have a system full of clean page cache and only two nodes: 0 and 1. GFP_NOWAIT will eventually kick off kswapd on both nodes. Each kswapd will be migrating pages to the *other* node since each is in the other's fallback path. I think what you're saying is that, eventually, the kswapds will see allocation failures and stop migrating, providing hysteresis. This is probably true. But, I'm more concerned about that window where the kswapds are throwing pages at each other because they're effectively just wasting resources in this window. I guess we should figure our how large this window is and how fast (or if) the dampening occurs in practice.
On 4/17/19 9:39 AM, Michal Hocko wrote: > On Wed 17-04-19 09:37:39, Keith Busch wrote: >> On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote: >>> On Wed 17-04-19 09:23:46, Keith Busch wrote: >>>> On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: >>>>> On Tue 16-04-19 14:22:33, Dave Hansen wrote: >>>>>> Keith Busch had a set of patches to let you specify the demotion order >>>>>> via sysfs for fun. The rules we came up with were: >>>>> I am not a fan of any sysfs "fun" >>>> I'm hung up on the user facing interface, but there should be some way a >>>> user decides if a memory node is or is not a migrate target, right? >>> Why? Or to put it differently, why do we have to start with a user >>> interface at this stage when we actually barely have any real usecases >>> out there? >> The use case is an alternative to swap, right? The user has to decide >> which storage is the swap target, so operating in the same spirit. > I do not follow. If you use rebalancing you can still deplete the memory > and end up in a swap storage. If you want to reclaim/swap rather than > rebalance then you do not enable rebalancing (by node_reclaim or similar > mechanism). I'm a little bit confused. Do you mean just do *not* do reclaim/swap in rebalancing mode? If rebalancing is on, then node_reclaim just move the pages around nodes, then kswapd or direct reclaim would take care of swap? If so the node reclaim on PMEM node may rebalance the pages to DRAM node? Should this be allowed? I think both I and Keith was supposed to treat PMEM as a tier in the reclaim hierarchy. The reclaim should push inactive pages down to PMEM, then swap. So, PMEM is kind of a "terminal" node. So, he introduced sysfs defined target node, I introduced N_CPU_MEM. >
On Wed, Apr 17, 2019 at 10:26:05AM -0700, Yang Shi wrote: > On 4/17/19 9:39 AM, Michal Hocko wrote: > > On Wed 17-04-19 09:37:39, Keith Busch wrote: > > > On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote: > > > > On Wed 17-04-19 09:23:46, Keith Busch wrote: > > > > > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: > > > > > > On Tue 16-04-19 14:22:33, Dave Hansen wrote: > > > > > > > Keith Busch had a set of patches to let you specify the demotion order > > > > > > > via sysfs for fun. The rules we came up with were: > > > > > > I am not a fan of any sysfs "fun" > > > > > I'm hung up on the user facing interface, but there should be some way a > > > > > user decides if a memory node is or is not a migrate target, right? > > > > Why? Or to put it differently, why do we have to start with a user > > > > interface at this stage when we actually barely have any real usecases > > > > out there? > > > The use case is an alternative to swap, right? The user has to decide > > > which storage is the swap target, so operating in the same spirit. > > I do not follow. If you use rebalancing you can still deplete the memory > > and end up in a swap storage. If you want to reclaim/swap rather than > > rebalance then you do not enable rebalancing (by node_reclaim or similar > > mechanism). > > I'm a little bit confused. Do you mean just do *not* do reclaim/swap in > rebalancing mode? If rebalancing is on, then node_reclaim just move the > pages around nodes, then kswapd or direct reclaim would take care of swap? > > If so the node reclaim on PMEM node may rebalance the pages to DRAM node? > Should this be allowed? > > I think both I and Keith was supposed to treat PMEM as a tier in the reclaim > hierarchy. The reclaim should push inactive pages down to PMEM, then swap. > So, PMEM is kind of a "terminal" node. So, he introduced sysfs defined > target node, I introduced N_CPU_MEM. Yeah, I think Yang and I view "demotion" as a separate feature from numa rebalancing.
On Wed 17-04-19 10:26:05, Yang Shi wrote: > > > On 4/17/19 9:39 AM, Michal Hocko wrote: > > On Wed 17-04-19 09:37:39, Keith Busch wrote: > > > On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote: > > > > On Wed 17-04-19 09:23:46, Keith Busch wrote: > > > > > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: > > > > > > On Tue 16-04-19 14:22:33, Dave Hansen wrote: > > > > > > > Keith Busch had a set of patches to let you specify the demotion order > > > > > > > via sysfs for fun. The rules we came up with were: > > > > > > I am not a fan of any sysfs "fun" > > > > > I'm hung up on the user facing interface, but there should be some way a > > > > > user decides if a memory node is or is not a migrate target, right? > > > > Why? Or to put it differently, why do we have to start with a user > > > > interface at this stage when we actually barely have any real usecases > > > > out there? > > > The use case is an alternative to swap, right? The user has to decide > > > which storage is the swap target, so operating in the same spirit. > > I do not follow. If you use rebalancing you can still deplete the memory > > and end up in a swap storage. If you want to reclaim/swap rather than > > rebalance then you do not enable rebalancing (by node_reclaim or similar > > mechanism). > > I'm a little bit confused. Do you mean just do *not* do reclaim/swap in > rebalancing mode? If rebalancing is on, then node_reclaim just move the > pages around nodes, then kswapd or direct reclaim would take care of swap? Yes, that was the idea I wanted to get through. Sorry if that was not really clear. > If so the node reclaim on PMEM node may rebalance the pages to DRAM node? > Should this be allowed? Why it shouldn't? If there are other vacant Nodes to absorb that memory then why not use it? > I think both I and Keith was supposed to treat PMEM as a tier in the reclaim > hierarchy. The reclaim should push inactive pages down to PMEM, then swap. > So, PMEM is kind of a "terminal" node. So, he introduced sysfs defined > target node, I introduced N_CPU_MEM. I understand that. And I am trying to figure out whether we really have to tream PMEM specially here. Why is it any better than a generic NUMA rebalancing code that could be used for many other usecases which are not PMEM specific. If you present PMEM as a regular memory then also use it as a normal memory.
On Wed 17-04-19 10:13:44, Dave Hansen wrote: > On 4/17/19 2:23 AM, Michal Hocko wrote: > >> 3. The demotion path can not have cycles > > yes. This could be achieved by GFP_NOWAIT opportunistic allocation for > > the migration target. That should prevent from loops or artificial nodes > > exhausting quite naturaly AFAICS. Maybe we will need some tricks to > > raise the watermark but I am not convinced something like that is really > > necessary. > > I don't think GFP_NOWAIT alone is good enough. > > Let's say we have a system full of clean page cache and only two nodes: > 0 and 1. GFP_NOWAIT will eventually kick off kswapd on both nodes. > Each kswapd will be migrating pages to the *other* node since each is in > the other's fallback path. I was thinking along node reclaim like based migration. You are right that a parallel kswapd might reclaim enough to cause the ping pong and we might need to play some watermaks tricks but as you say below this is to be seen and a playground to explore. All I am saying is to try the most simplistic approach first without all the bells and whistles to see how this plays out with real workloads and build on top of that. We already do have model - node_reclaim - which turned out to suck a lot because the reclaim was just too aggressive wrt. refault. Maybe migration will turn out much more feasible. And maybe I am completely wrong and we need a much more complex solution. > I think what you're saying is that, eventually, the kswapds will see > allocation failures and stop migrating, providing hysteresis. This is > probably true. > > But, I'm more concerned about that window where the kswapds are throwing > pages at each other because they're effectively just wasting resources > in this window. I guess we should figure our how large this window is > and how fast (or if) the dampening occurs in practice.
>> >>>> I would also not touch the numa balancing logic at this stage and >>>> rather >>>> see how the current implementation behaves. >>> I agree we would prefer start from something simpler and see how it >>> works. >>> >>> The "twice access" optimization is aimed to reduce the PMEM >>> bandwidth burden >>> since the bandwidth of PMEM is scarce resource. I did compare "twice >>> access" >>> to "no twice access", it does save a lot bandwidth for some once-off >>> access >>> pattern. For example, when running stress test with mmtest's >>> usemem-stress-numa-compact. The kernel would promote ~600,000 pages >>> with >>> "twice access" in 4 hours, but it would promote ~80,000,000 pages >>> without >>> "twice access". >> I pressume this is a result of a synthetic workload, right? Or do you >> have any numbers for a real life usecase? > > The test just uses usemem. I tried to run some more real life like usecases, the below shows the result by running mmtest's db-sysbench-mariadb-oltp-rw-medium test, which is a typical database workload, with and w/o "twice access" optimization. w/ w/o promotion 32771 312250 We can see the kernel did 10x promotion w/o "twice access" optimization. I also tried kernel-devel and redis tests in mmtest, but they can't generate enough memory pressure, so I had to run usemem test to generate memory pressure. However, this brought in huge noise, particularly for the w/o "twice access" case. But, the mysql test should be able to demonstrate the improvement achieved by this optimization. And, I'm wondering whether this optimization is also suitable to general NUMA balancing or not.
On Wed 17-04-19 13:43:44, Yang Shi wrote: [...] > And, I'm wondering whether this optimization is also suitable to general > NUMA balancing or not. If there are convincing numbers then this should be a preferable way to deal with it. Please note that the number of promotions is not the only metric to watch. The overal performance/access latency would be another one.
On 4/17/19 10:51 AM, Michal Hocko wrote: > On Wed 17-04-19 10:26:05, Yang Shi wrote: >> On 4/17/19 9:39 AM, Michal Hocko wrote: >>> On Wed 17-04-19 09:37:39, Keith Busch wrote: >>>> On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote: >>>>> On Wed 17-04-19 09:23:46, Keith Busch wrote: >>>>>> On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: >>>>>>> On Tue 16-04-19 14:22:33, Dave Hansen wrote: >>>>>>>> Keith Busch had a set of patches to let you specify the demotion order >>>>>>>> via sysfs for fun. The rules we came up with were: >>>>>>> I am not a fan of any sysfs "fun" >>>>>> I'm hung up on the user facing interface, but there should be some way a >>>>>> user decides if a memory node is or is not a migrate target, right? >>>>> Why? Or to put it differently, why do we have to start with a user >>>>> interface at this stage when we actually barely have any real usecases >>>>> out there? >>>> The use case is an alternative to swap, right? The user has to decide >>>> which storage is the swap target, so operating in the same spirit. >>> I do not follow. If you use rebalancing you can still deplete the memory >>> and end up in a swap storage. If you want to reclaim/swap rather than >>> rebalance then you do not enable rebalancing (by node_reclaim or similar >>> mechanism). >> I'm a little bit confused. Do you mean just do *not* do reclaim/swap in >> rebalancing mode? If rebalancing is on, then node_reclaim just move the >> pages around nodes, then kswapd or direct reclaim would take care of swap? > Yes, that was the idea I wanted to get through. Sorry if that was not > really clear. > >> If so the node reclaim on PMEM node may rebalance the pages to DRAM node? >> Should this be allowed? > Why it shouldn't? If there are other vacant Nodes to absorb that memory > then why not use it? > >> I think both I and Keith was supposed to treat PMEM as a tier in the reclaim >> hierarchy. The reclaim should push inactive pages down to PMEM, then swap. >> So, PMEM is kind of a "terminal" node. So, he introduced sysfs defined >> target node, I introduced N_CPU_MEM. > I understand that. And I am trying to figure out whether we really have > to tream PMEM specially here. Why is it any better than a generic NUMA > rebalancing code that could be used for many other usecases which are > not PMEM specific. If you present PMEM as a regular memory then also use > it as a normal memory. This also makes some sense. We just look at PMEM from different point of view. Taking into account the performance disparity may outweigh treating it as a normal memory in this patchset. A ridiculous idea, may we have two modes? One for "rebalancing", the other for "demotion"?
On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote: > On 4/17/19 2:23 AM, Michal Hocko wrote: > > yes. This could be achieved by GFP_NOWAIT opportunistic allocation for > > the migration target. That should prevent from loops or artificial nodes > > exhausting quite naturaly AFAICS. Maybe we will need some tricks to > > raise the watermark but I am not convinced something like that is really > > necessary. > > I don't think GFP_NOWAIT alone is good enough. > > Let's say we have a system full of clean page cache and only two nodes: > 0 and 1. GFP_NOWAIT will eventually kick off kswapd on both nodes. > Each kswapd will be migrating pages to the *other* node since each is in > the other's fallback path. > > I think what you're saying is that, eventually, the kswapds will see > allocation failures and stop migrating, providing hysteresis. This is > probably true. > > But, I'm more concerned about that window where the kswapds are throwing > pages at each other because they're effectively just wasting resources > in this window. I guess we should figure our how large this window is > and how fast (or if) the dampening occurs in practice. I'm still refining tests to help answer this and have some preliminary data. My test rig has CPU + memory Node 0, memory-only Node 1, and a fast swap device. The test has an application strict mbind more than the total memory to node 0, and forever writes random cachelines from per-cpu threads. I'm testing two memory pressure policies: Node 0 can migrate to Node 1, no cycles Node 0 and Node 1 migrate with each other (0 -> 1 -> 0 cycles) After the initial ramp up time, the second policy is ~7-10% slower than no cycles. There doesn't appear to be a temporary window dealing with bouncing pages: it's just a slower overall steady state. Looks like when migration fails and falls back to swap, the newly freed pages occasionaly get sniped by the other node, keeping the pressure up.
On 4/18/19 11:16 AM, Keith Busch wrote: > On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote: >> On 4/17/19 2:23 AM, Michal Hocko wrote: >>> yes. This could be achieved by GFP_NOWAIT opportunistic allocation for >>> the migration target. That should prevent from loops or artificial nodes >>> exhausting quite naturaly AFAICS. Maybe we will need some tricks to >>> raise the watermark but I am not convinced something like that is really >>> necessary. >> I don't think GFP_NOWAIT alone is good enough. >> >> Let's say we have a system full of clean page cache and only two nodes: >> 0 and 1. GFP_NOWAIT will eventually kick off kswapd on both nodes. >> Each kswapd will be migrating pages to the *other* node since each is in >> the other's fallback path. >> >> I think what you're saying is that, eventually, the kswapds will see >> allocation failures and stop migrating, providing hysteresis. This is >> probably true. >> >> But, I'm more concerned about that window where the kswapds are throwing >> pages at each other because they're effectively just wasting resources >> in this window. I guess we should figure our how large this window is >> and how fast (or if) the dampening occurs in practice. > I'm still refining tests to help answer this and have some preliminary > data. My test rig has CPU + memory Node 0, memory-only Node 1, and a > fast swap device. The test has an application strict mbind more than > the total memory to node 0, and forever writes random cachelines from > per-cpu threads. Thanks for the test. A follow-up question, how about the size for each node? Is node 1 bigger than node 0? Since PMEM typically has larger capacity, so I'm wondering whether the capacity may make things different or not. > I'm testing two memory pressure policies: > > Node 0 can migrate to Node 1, no cycles > Node 0 and Node 1 migrate with each other (0 -> 1 -> 0 cycles) > > After the initial ramp up time, the second policy is ~7-10% slower than > no cycles. There doesn't appear to be a temporary window dealing with > bouncing pages: it's just a slower overall steady state. Looks like when > migration fails and falls back to swap, the newly freed pages occasionaly > get sniped by the other node, keeping the pressure up.
On 18 Apr 2019, at 15:23, Yang Shi wrote: > On 4/18/19 11:16 AM, Keith Busch wrote: >> On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote: >>> On 4/17/19 2:23 AM, Michal Hocko wrote: >>>> yes. This could be achieved by GFP_NOWAIT opportunistic allocation for >>>> the migration target. That should prevent from loops or artificial nodes >>>> exhausting quite naturaly AFAICS. Maybe we will need some tricks to >>>> raise the watermark but I am not convinced something like that is really >>>> necessary. >>> I don't think GFP_NOWAIT alone is good enough. >>> >>> Let's say we have a system full of clean page cache and only two nodes: >>> 0 and 1. GFP_NOWAIT will eventually kick off kswapd on both nodes. >>> Each kswapd will be migrating pages to the *other* node since each is in >>> the other's fallback path. >>> >>> I think what you're saying is that, eventually, the kswapds will see >>> allocation failures and stop migrating, providing hysteresis. This is >>> probably true. >>> >>> But, I'm more concerned about that window where the kswapds are throwing >>> pages at each other because they're effectively just wasting resources >>> in this window. I guess we should figure our how large this window is >>> and how fast (or if) the dampening occurs in practice. >> I'm still refining tests to help answer this and have some preliminary >> data. My test rig has CPU + memory Node 0, memory-only Node 1, and a >> fast swap device. The test has an application strict mbind more than >> the total memory to node 0, and forever writes random cachelines from >> per-cpu threads. > > Thanks for the test. A follow-up question, how about the size for each node? Is node 1 bigger than node 0? Since PMEM typically has larger capacity, so I'm wondering whether the capacity may make things different or not. > >> I'm testing two memory pressure policies: >> >> Node 0 can migrate to Node 1, no cycles >> Node 0 and Node 1 migrate with each other (0 -> 1 -> 0 cycles) >> >> After the initial ramp up time, the second policy is ~7-10% slower than >> no cycles. There doesn't appear to be a temporary window dealing with >> bouncing pages: it's just a slower overall steady state. Looks like when >> migration fails and falls back to swap, the newly freed pages occasionaly >> get sniped by the other node, keeping the pressure up. In addition to these two policies, I am curious about how MPOL_PREFERRED to Node 0 performs. I just wonder how bad static page allocation does. -- Best Regards, Yan Zi
On Thu, Apr 18, 2019 at 11:02:27AM +0200, Michal Hocko wrote: >On Wed 17-04-19 13:43:44, Yang Shi wrote: >[...] >> And, I'm wondering whether this optimization is also suitable to general >> NUMA balancing or not. > >If there are convincing numbers then this should be a preferable way to >deal with it. Please note that the number of promotions is not the only >metric to watch. The overal performance/access latency would be another one. Good question. Shi and me aligned today. Also talked with Mel (but sorry I must missed some points due to poor English listening). It becomes clear that 1) PMEM/DRAM page promotion/demotion is a hard problem to attack. There will and should be multiple approaches for open discussion before settling down. The criteria might be balanced complexity, overheads, performance, etc. 2) We need a lot more data to lay solid foundation for effective discussions. Testing will be a rather time consuming part for contributor. We'll need to work together to create a number of benchmarks that can well exercise the kernel promotion/demotion paths and gather the necessary numbers. By collaborating on a common set of tests, we can not only amortize efforts, but also compare different approaches or compare v1/v2/... of the same approach conveniently. Ying has already created several LKP test cases for that purpose. Shi and me plan to join the efforts, too. Thanks, Fengguang
On Wed, Apr 17, 2019 at 11:17:48AM +0200, Michal Hocko wrote: >On Tue 16-04-19 12:19:21, Yang Shi wrote: >> >> >> On 4/16/19 12:47 AM, Michal Hocko wrote: >[...] >> > Why cannot we simply demote in the proximity order? Why do you make >> > cpuless nodes so special? If other close nodes are vacant then just use >> > them. >> >> We could. But, this raises another question, would we prefer to just demote >> to the next fallback node (just try once), if it is contended, then just >> swap (i.e. DRAM0 -> PMEM0 -> Swap); or would we prefer to try all the nodes >> in the fallback order to find the first less contended one (i.e. DRAM0 -> >> PMEM0 -> DRAM1 -> PMEM1 -> Swap)? > >I would go with the later. Why, because it is more natural. Because that >is the natural allocation path so I do not see why this shouldn't be the >natural demotion path. "Demotion" should be more performance wise by "demoting to the next-level (cheaper/slower) memory". Otherwise something like this may happen. DRAM0 pressured => demote cold pages to DRAM1 DRAM1 pressured => demote cold pages to DRAM0 Kind of DRAM0/DRAM1 exchanged a fraction of the demoted cold pages, which looks not helpful for overall system performance. Over time, it's even possible some cold pages get "demoted" in path DRAM0=>DRAM1=>DRAM0=>DRAM1=>... Thanks, Fengguang