Message ID | 20190321200157.29678-1-keith.busch@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Page demotion for memory reclaim | expand |
On 21 Mar 2019, at 13:01, Keith Busch wrote: > The kernel has recently added support for using persistent memory as > normal RAM: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 > > The persistent memory is hot added to nodes separate from other memory > types, which makes it convenient to make node based memory policies. > > When persistent memory provides a larger and cheaper address space, but > with slower access characteristics than system RAM, we'd like the kernel > to make use of these memory-only nodes as a migration tier for pages > that would normally be discared during memory reclaim. This is faster > than doing IO for swap or page cache, and makes better utilization of > available physical address space. > > The feature is not enabled by default. The user must opt-in to kernel > managed page migration by defining the demotion path. In the future, > we may want to have the kernel automatically create this based on > heterogeneous memory attributes and CPU locality. > Cc more people here. Thank you for the patchset. This is definitely useful when we have larger PMEM backing existing DRAM. I have several questions: 1. The name of “page demotion” seems confusing to me, since I thought it was about large pages demote to small pages as opposite to promoting small pages to THPs. Am I the only one here? 2. For the demotion path, a common case would be from high-performance memory, like HBM or Multi-Channel DRAM, to DRAM, then to PMEM, and finally to disks, right? More general case for demotion path would be derived from the memory performance description from HMAT[1], right? Do you have any algorithm to form such a path from HMAT? 3. Do you have a plan for promoting pages from lower-level memory to higher-level memory, like from PMEM to DRAM? Will this one-way demotion make all pages sink to PMEM and disk? 4. In your patch 3, you created a new method migrate_demote_mapping() to migrate pages to other memory node, is there any problem of reusing existing migrate_pages() interface? 5. In addition, you only migrate base pages, is there any performance concern on migrating THPs? Is it too costly to migrate THPs? Thanks. [1] https://lwn.net/Articles/724562/ -- Best Regards, Yan Zi
On Thu, Mar 21, 2019 at 02:20:51PM -0700, Zi Yan wrote: > 1. The name of “page demotion” seems confusing to me, since I thought it was about large pages > demote to small pages as opposite to promoting small pages to THPs. Am I the only > one here? If you have a THP, we'll skip the page migration and fall through to split_huge_page_to_list(), then the smaller pages can be considered, migrated and reclaimed individually. Not that we couldn't try to migrate a THP directly. It was just simpler implementation for this first attempt. > 2. For the demotion path, a common case would be from high-performance memory, like HBM > or Multi-Channel DRAM, to DRAM, then to PMEM, and finally to disks, right? More general > case for demotion path would be derived from the memory performance description from HMAT[1], > right? Do you have any algorithm to form such a path from HMAT? Yes, I have a PoC for the kernel setting up a demotion path based on HMAT properties here: https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/commit/?h=mm-migrate&id=4d007659e1dd1b0dad49514348be4441fbe7cadb The above is just from an experimental branch. > 3. Do you have a plan for promoting pages from lower-level memory to higher-level memory, > like from PMEM to DRAM? Will this one-way demotion make all pages sink to PMEM and disk? Promoting previously demoted pages would require the application do something to make that happen if you turn demotion on with this series. Kernel auto-promotion is still being investigated, and it's a little trickier than reclaim. If it sinks to disk, though, the next access behavior is the same as before, without this series. > 4. In your patch 3, you created a new method migrate_demote_mapping() to migrate pages to > other memory node, is there any problem of reusing existing migrate_pages() interface? Yes, we may not want to migrate everything in the shrink_page_list() pages. We might want to keep a page, so we have to do those checks first. At the point we know we want to attempt migration, the page is already locked and not in a list, so it is just easier to directly invoke the new __unmap_and_move_locked() that migrate_pages() eventually also calls. > 5. In addition, you only migrate base pages, is there any performance concern on migrating THPs? > Is it too costly to migrate THPs? It was just easier to consider single pages first, so we let a THP split if possible. I'm not sure of the cost in migrating THPs directly.
On Thu, Mar 21, 2019 at 3:36 PM Keith Busch <keith.busch@intel.com> wrote: > > On Thu, Mar 21, 2019 at 02:20:51PM -0700, Zi Yan wrote: > > 1. The name of “page demotion” seems confusing to me, since I thought it was about large pages > > demote to small pages as opposite to promoting small pages to THPs. Am I the only > > one here? > > If you have a THP, we'll skip the page migration and fall through to > split_huge_page_to_list(), then the smaller pages can be considered, > migrated and reclaimed individually. Not that we couldn't try to migrate > a THP directly. It was just simpler implementation for this first attempt. > > > 2. For the demotion path, a common case would be from high-performance memory, like HBM > > or Multi-Channel DRAM, to DRAM, then to PMEM, and finally to disks, right? More general > > case for demotion path would be derived from the memory performance description from HMAT[1], > > right? Do you have any algorithm to form such a path from HMAT? > > Yes, I have a PoC for the kernel setting up a demotion path based on > HMAT properties here: > > https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/commit/?h=mm-migrate&id=4d007659e1dd1b0dad49514348be4441fbe7cadb > > The above is just from an experimental branch. > > > 3. Do you have a plan for promoting pages from lower-level memory to higher-level memory, > > like from PMEM to DRAM? Will this one-way demotion make all pages sink to PMEM and disk? > > Promoting previously demoted pages would require the application do > something to make that happen if you turn demotion on with this series. > Kernel auto-promotion is still being investigated, and it's a little > trickier than reclaim. Just FYI. I'm currently working on a patchset which tries to promotes page from second tier memory (i.e. PMEM) to DRAM via NUMA balancing. But, NUMA balancing can't deal with unmapped page cache, they have to be promoted from different path, i.e. mark_page_accessed(). And, I do agree with Keith, promotion is definitely trickier than reclaim since kernel can't recognize "hot" pages accurately. NUMA balancing is still corse-grained and inaccurate, but it is simple. If we would like to implement more sophisticated algorithm, in-kernel implementation might be not a good idea. Thanks, Yang > > If it sinks to disk, though, the next access behavior is the same as > before, without this series. > > > 4. In your patch 3, you created a new method migrate_demote_mapping() to migrate pages to > > other memory node, is there any problem of reusing existing migrate_pages() interface? > > Yes, we may not want to migrate everything in the shrink_page_list() > pages. We might want to keep a page, so we have to do those checks first. At > the point we know we want to attempt migration, the page is already > locked and not in a list, so it is just easier to directly invoke the > new __unmap_and_move_locked() that migrate_pages() eventually also calls. > > > 5. In addition, you only migrate base pages, is there any performance concern on migrating THPs? > > Is it too costly to migrate THPs? > > It was just easier to consider single pages first, so we let a THP split > if possible. I'm not sure of the cost in migrating THPs directly. >
<snip> >> 2. For the demotion path, a common case would be from high-performance memory, like HBM >> or Multi-Channel DRAM, to DRAM, then to PMEM, and finally to disks, right? More general >> case for demotion path would be derived from the memory performance description from HMAT[1], >> right? Do you have any algorithm to form such a path from HMAT? > > Yes, I have a PoC for the kernel setting up a demotion path based on > HMAT properties here: > > https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/commit/?h=mm-migrate&id=4d007659e1dd1b0dad49514348be4441fbe7cadb > > The above is just from an experimental branch. Got it. Thanks. > >> 3. Do you have a plan for promoting pages from lower-level memory to higher-level memory, >> like from PMEM to DRAM? Will this one-way demotion make all pages sink to PMEM and disk? > > Promoting previously demoted pages would require the application do > something to make that happen if you turn demotion on with this series. > Kernel auto-promotion is still being investigated, and it's a little > trickier than reclaim. > > If it sinks to disk, though, the next access behavior is the same as > before, without this series. This means, when demotion is on, the path for a page would be DRAM->PMEM->Disk->DRAM->PMEM->… . This could be a start point. I actually did something similar here for two-level heterogeneous memory structure: https://github.com/ysarch-lab/nimble_page_management_asplos_2019/blob/nimble_page_management_4_14_78/mm/memory_manage.c#L401. What I did basically was calling shrink_page_list() periodically, so pages will be separated in active and inactive lists. Then, pages in the _inactive_ list of fast memory (like DRAM) are migrated to slow memory (like PMEM) and pages in the _active_ list of slow memory are migrated to fast memory. It is kinda of abusing the existing page lists. :) My conclusion from that experiments is that you need high-throughput page migration mechanisms, like multi-threaded page migration, migrating a bunch of pages in a batch (https://github.com/ysarch-lab/nimble_page_management_asplos_2019/blob/nimble_page_management_4_14_78/mm/copy_page.c), and a new mechanism called exchange pages (https://github.com/ysarch-lab/nimble_page_management_asplos_2019/blob/nimble_page_management_4_14_78/mm/exchange.c), so that using page migration to manage multi-level memory systems becomes useful. Otherwise, the overheads (TLB shootdown and other kernel activities in the page migration process) of page migration may kill the benefit. Because the performance gap between DRAM and PMEM is supposed to be smaller than the one between DRAM and disk, the benefit of putting data in DRAM might not compensate the cost of migrating cold pages from DRAM to PMEM. Namely, directly putting data in PMEM after DRAM is full might be better. >> 4. In your patch 3, you created a new method migrate_demote_mapping() to migrate pages to >> other memory node, is there any problem of reusing existing migrate_pages() interface? > > Yes, we may not want to migrate everything in the shrink_page_list() > pages. We might want to keep a page, so we have to do those checks first. At > the point we know we want to attempt migration, the page is already > locked and not in a list, so it is just easier to directly invoke the > new __unmap_and_move_locked() that migrate_pages() eventually also calls. Right, I understand that you want to only migrate small pages to begin with. My question is why not using the existing migrate_pages() in your patch 3. Like: diff --git a/mm/vmscan.c b/mm/vmscan.c index a5ad0b35ab8e..0a0753af357f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1261,6 +1261,20 @@ static unsigned long shrink_page_list(struct list_head *page_list, ; /* try to reclaim the page below */ } + if (!PageCompound(page)) { + int next_nid = next_migration_node(page); + int err; + + if (next_nid != TERMINAL_NODE) { + LIST_HEAD(migrate_list); + list_add(&migrate_list, &page->lru); + err = migrate_pages(&migrate_list, alloc_new_node_page, NULL, + next_nid, MIGRATE_ASYNC, MR_DEMOTION); + if (err) + putback_movable_pages(&migrate_list); + } + } + /* * Anonymous process memory has backing store? * Try to allocate it some swap space here. Because your new migrate_demote_mapping() basically does the same thing as the code above. If you are not OK with the gfp flags in alloc_new_node_page(), you can just write your own alloc_new_node_page(). :) > >> 5. In addition, you only migrate base pages, is there any performance concern on migrating THPs? >> Is it too costly to migrate THPs? > > It was just easier to consider single pages first, so we let a THP split > if possible. I'm not sure of the cost in migrating THPs directly. AFAICT, when migrating the same amount of 2MB data, migrating a THP is much quick than migrating 512 4KB pages. Because you save 511 TLB shootdowns in THP migration and copying 2MB contiguous data achieves higher throughput than copying individual 4KB pages. But it highly depends on whether any subpage in a THP is hotter than others, so migrating a THP as a whole might hurt performance sometimes. Just some of my observation in my own experiments. -- Best Regards, Yan Zi
On 21 Mar 2019, at 16:02, Yang Shi wrote: > On Thu, Mar 21, 2019 at 3:36 PM Keith Busch <keith.busch@intel.com> wrote: >> >> On Thu, Mar 21, 2019 at 02:20:51PM -0700, Zi Yan wrote: >>> 1. The name of “page demotion” seems confusing to me, since I thought it was about large pages >>> demote to small pages as opposite to promoting small pages to THPs. Am I the only >>> one here? >> >> If you have a THP, we'll skip the page migration and fall through to >> split_huge_page_to_list(), then the smaller pages can be considered, >> migrated and reclaimed individually. Not that we couldn't try to migrate >> a THP directly. It was just simpler implementation for this first attempt. >> >>> 2. For the demotion path, a common case would be from high-performance memory, like HBM >>> or Multi-Channel DRAM, to DRAM, then to PMEM, and finally to disks, right? More general >>> case for demotion path would be derived from the memory performance description from HMAT[1], >>> right? Do you have any algorithm to form such a path from HMAT? >> >> Yes, I have a PoC for the kernel setting up a demotion path based on >> HMAT properties here: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/commit/?h=mm-migrate&id=4d007659e1dd1b0dad49514348be4441fbe7cadb >> >> The above is just from an experimental branch. >> >>> 3. Do you have a plan for promoting pages from lower-level memory to higher-level memory, >>> like from PMEM to DRAM? Will this one-way demotion make all pages sink to PMEM and disk? >> >> Promoting previously demoted pages would require the application do >> something to make that happen if you turn demotion on with this series. >> Kernel auto-promotion is still being investigated, and it's a little >> trickier than reclaim. > > Just FYI. I'm currently working on a patchset which tries to promotes > page from second tier memory (i.e. PMEM) to DRAM via NUMA balancing. > But, NUMA balancing can't deal with unmapped page cache, they have to > be promoted from different path, i.e. mark_page_accessed(). Got it. Another concern is that NUMA balancing marks pages inaccessible to obtain access information. It might add more overheads on top of page migration overheads. Considering the benefit of migrating pages from PMEM to DRAM is not as large as “bring data from disk to DRAM”, the overheads might offset the benefit, meaning you might see performance degradation. > > And, I do agree with Keith, promotion is definitely trickier than > reclaim since kernel can't recognize "hot" pages accurately. NUMA > balancing is still corse-grained and inaccurate, but it is simple. If > we would like to implement more sophisticated algorithm, in-kernel > implementation might be not a good idea. I agree. Or hardware vendor, like Intel, could bring more information on page hotness, like multi-bit access bits or page-modification log provided by Intel for virtualization. -- Best Regards, Yan Zi
On Thu, Mar 21, 2019 at 05:12:33PM -0700, Zi Yan wrote: > > Yes, we may not want to migrate everything in the shrink_page_list() > > pages. We might want to keep a page, so we have to do those checks first. At > > the point we know we want to attempt migration, the page is already > > locked and not in a list, so it is just easier to directly invoke the > > new __unmap_and_move_locked() that migrate_pages() eventually also calls. > > Right, I understand that you want to only migrate small pages to begin with. My question is > why not using the existing migrate_pages() in your patch 3. Like: > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index a5ad0b35ab8e..0a0753af357f 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1261,6 +1261,20 @@ static unsigned long shrink_page_list(struct list_head *page_list, > ; /* try to reclaim the page below */ > } > > + if (!PageCompound(page)) { > + int next_nid = next_migration_node(page); > + int err; > + > + if (next_nid != TERMINAL_NODE) { > + LIST_HEAD(migrate_list); > + list_add(&migrate_list, &page->lru); > + err = migrate_pages(&migrate_list, alloc_new_node_page, NULL, > + next_nid, MIGRATE_ASYNC, MR_DEMOTION); > + if (err) > + putback_movable_pages(&migrate_list); > + } > + } > + > /* > * Anonymous process memory has backing store? > * Try to allocate it some swap space here. > > Because your new migrate_demote_mapping() basically does the same thing as the code above. > If you are not OK with the gfp flags in alloc_new_node_page(), you can just write your own > alloc_new_node_page(). :) The page is already locked, you can't call migrate_pages() with locked pages. You'd have to surround migrate_pages with unlock_page/try_lock_page, and I thought that looked odd. Further, it changes the flow if the subsequent try lock fails, and I'm trying to be careful about not introducing different behavior if migration fails. Patch 2/5 is included here so we can reuse the necessary code from a locked page context.