Message ID | 20250103172419.4148674-1-ziy@nvidia.com |
---|---|
Headers | show |
Series | Accelerate page migration with batching and multi threads | expand |
On Fri, Jan 03, 2025 at 12:24:14PM -0500, Zi Yan wrote: > Hi all, > > This patchset accelerates page migration by batching folio copy operations and > using multiple CPU threads and is based on Shivank's Enhancements to Page > Migration with Batch Offloading via DMA patchset[1] and my original accelerate > page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59. > The last patch is for testing purpose and should not be considered. > This is well timed as I've been testing a batch-migration variant of migrate_misplaced_folio for my pagecache promotion work (attached). I will add this to my pagecache branch and give it a test at some point. Quick question: is the multi-threaded movement supported in the context of task_work? i.e. in which context is the multi-threaded path safe/unsafe? (inline in a syscall, async only, etc). ~Gregory --- diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 9438cc7c2aeb..17baf63964c0 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -146,6 +146,9 @@ int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node); int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, int node); +int migrate_misplaced_folio_batch(struct list_head *foliolist, + struct vm_area_struct *vma, + int node); #else static inline int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node) @@ -157,6 +160,12 @@ static inline int migrate_misplaced_folio(struct folio *folio, { return -EAGAIN; /* can't migrate now */ } +int migrate_misplaced_folio_batch(struct list_head *foliolist, + struct vm_area_struct *vma, + int node) +{ + return -EAGAIN; /* can't migrate now */ +} #endif /* CONFIG_NUMA_BALANCING */ #ifdef CONFIG_MIGRATION diff --git a/mm/migrate.c b/mm/migrate.c index 459f396f7bc1..454fd93c4cc7 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2608,5 +2608,27 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, BUG_ON(!list_empty(&migratepages)); return nr_remaining ? -EAGAIN : 0; } + +int migrate_misplaced_folio_batch(struct list_head *folio_list, + struct vm_area_struct *vma, + int node) +{ + pg_data_t *pgdat = NODE_DATA(node); + unsigned int nr_succeeded; + int nr_remaining; + + nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio, + NULL, node, MIGRATE_ASYNC, + MR_NUMA_MISPLACED, &nr_succeeded); + if (nr_remaining) + putback_movable_pages(folio_list); + + if (nr_succeeded) { + count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); + mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded); + } + BUG_ON(!list_empty(folio_list)); + return nr_remaining ? -EAGAIN : 0; +} #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_NUMA */
On 3 Jan 2025, at 14:17, Gregory Price wrote: > On Fri, Jan 03, 2025 at 12:24:14PM -0500, Zi Yan wrote: >> Hi all, >> >> This patchset accelerates page migration by batching folio copy operations and >> using multiple CPU threads and is based on Shivank's Enhancements to Page >> Migration with Batch Offloading via DMA patchset[1] and my original accelerate >> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59. >> The last patch is for testing purpose and should not be considered. >> > > This is well timed as I've been testing a batch-migration variant of > migrate_misplaced_folio for my pagecache promotion work (attached). > > I will add this to my pagecache branch and give it a test at some point. Great. Thanks. > > Quick question: is the multi-threaded movement supported in the context > of task_work? i.e. in which context is the multi-threaded path > safe/unsafe? (inline in a syscall, async only, etc). It should work in any context, like syscall, memory compaction, and so on, since it just distributes memcpy to different CPUs using workqueue. > > ~Gregory > > --- > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index 9438cc7c2aeb..17baf63964c0 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -146,6 +146,9 @@ int migrate_misplaced_folio_prepare(struct folio *folio, > struct vm_area_struct *vma, int node); > int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, > int node); > +int migrate_misplaced_folio_batch(struct list_head *foliolist, > + struct vm_area_struct *vma, > + int node); > #else > static inline int migrate_misplaced_folio_prepare(struct folio *folio, > struct vm_area_struct *vma, int node) > @@ -157,6 +160,12 @@ static inline int migrate_misplaced_folio(struct folio *folio, > { > return -EAGAIN; /* can't migrate now */ > } > +int migrate_misplaced_folio_batch(struct list_head *foliolist, > + struct vm_area_struct *vma, > + int node) > +{ > + return -EAGAIN; /* can't migrate now */ > +} > #endif /* CONFIG_NUMA_BALANCING */ > > #ifdef CONFIG_MIGRATION > diff --git a/mm/migrate.c b/mm/migrate.c > index 459f396f7bc1..454fd93c4cc7 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -2608,5 +2608,27 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, > BUG_ON(!list_empty(&migratepages)); > return nr_remaining ? -EAGAIN : 0; > } > + > +int migrate_misplaced_folio_batch(struct list_head *folio_list, > + struct vm_area_struct *vma, > + int node) > +{ > + pg_data_t *pgdat = NODE_DATA(node); > + unsigned int nr_succeeded; > + int nr_remaining; > + > + nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio, > + NULL, node, MIGRATE_ASYNC, > + MR_NUMA_MISPLACED, &nr_succeeded); > + if (nr_remaining) > + putback_movable_pages(folio_list); > + > + if (nr_succeeded) { > + count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); > + mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded); > + } > + BUG_ON(!list_empty(folio_list)); > + return nr_remaining ? -EAGAIN : 0; > +} > #endif /* CONFIG_NUMA_BALANCING */ > #endif /* CONFIG_NUMA */ Best Regards, Yan, Zi
On Fri, Jan 3, 2025 at 9:24 AM Zi Yan <ziy@nvidia.com> wrote: > > Hi all, > > This patchset accelerates page migration by batching folio copy operations and > using multiple CPU threads and is based on Shivank's Enhancements to Page > Migration with Batch Offloading via DMA patchset[1] and my original accelerate > page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59. > The last patch is for testing purpose and should not be considered. > > The motivations are: > > 1. Batching folio copy increases copy throughput. Especially for base page > migrations, folio copy throughput is low since there are kernel activities like > moving folio metadata and updating page table entries sit between two folio > copies. And base page sizes are relatively small, 4KB on x86_64, ARM64 > and 64KB on ARM64. > > 2. Single CPU thread has limited copy throughput. Using multi threads is > a natural extension to speed up folio copy, when DMA engine is NOT > available in a system. > > > Design > === > > It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY > (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside > migrate_folio_move() and perform them in one shot afterwards. A > copy_page_lists_mt() function is added to use multi threads to copy > folios from src list to dst list. > > Changes compared to Shivank's patchset (mainly rewrote batching folio > copy code) > === > > 1. mig_info is removed, so no memory allocation is needed during > batching folio copies. src->private is used to store old page state and > anon_vma after folio metadata is copied from src to dst. > > 2. move_to_new_folio() and migrate_folio_move() are refactored to remove > redundant code in migrate_folios_batch_move(). > > 3. folio_mc_copy() is used for the single threaded copy code to keep the > original kernel behavior. > > > Performance > === > > I benchmarked move_pages() throughput on a two socket NUMA system with two > NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB > mTHP page migration are measured. > > The tables below show move_pages() throughput with different > configurations and different numbers of copied pages. The x-axis is the > configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32 > threads with this patchset applied. And the unit is GB/s. > > The 32-thread copy throughput can be up to 10x of single thread serial folio > copy. Batching folio copy not only benefits huge page but also base > page. > > 64KB (GB/s): > > vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 > 32 5.43 4.90 5.65 7.31 7.60 8.61 6.43 > 256 6.95 6.89 9.28 14.67 22.41 23.39 23.93 > 512 7.88 7.26 10.15 17.53 27.82 27.88 33.93 > 768 7.65 7.42 10.46 18.59 28.65 29.67 30.76 > 1024 7.46 8.01 10.90 17.77 27.04 32.18 38.80 > > 2MB mTHP (GB/s): > > vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 > 1 5.94 2.90 6.90 8.56 11.16 8.76 6.41 > 2 7.67 5.57 7.11 12.48 17.37 15.68 14.10 > 4 8.01 6.04 10.25 20.14 22.52 27.79 25.28 > 8 8.42 7.00 11.41 24.73 33.96 32.62 39.55 > 16 9.41 6.91 12.23 27.51 43.95 49.15 51.38 > 32 10.23 7.15 13.03 29.52 49.49 69.98 71.51 > 64 9.40 7.37 13.88 30.38 52.00 76.89 79.41 > 128 8.59 7.23 14.20 28.39 49.98 78.27 90.18 > 256 8.43 7.16 14.59 28.14 48.78 76.88 92.28 > 512 8.31 7.78 14.40 26.20 43.31 63.91 75.21 > 768 8.30 7.86 14.83 27.41 46.25 69.85 81.31 > 1024 8.31 7.90 14.96 27.62 46.75 71.76 83.84 Is this done on an idle system or a busy system? For real production workloads, all the CPUs are likely busy. It would be great to have the performance data collected from a busys system too. > > > TODOs > === > 1. Multi-threaded folio copy routine needs to look at CPU scheduler and > only use idle CPUs to avoid interfering userspace workloads. Of course > more complicated policies can be used based on migration issuing thread > priority. The other potential problem is it is hard to attribute cpu time consumed by the migration work threads to cpu cgroups. In a multi-tenant environment this may result in unfair cpu time counting. However, it is a chronic problem to properly count cpu time for kernel threads. I'm not sure whether it has been solved or not. > > 2. Eliminate memory allocation during multi-threaded folio copy routine > if possible. > > 3. A runtime check to decide when use multi-threaded folio copy. > Something like cache hotness issue mentioned by Matthew[3]. > > 4. Use non-temporal CPU instructions to avoid cache pollution issues. AFAICT, arm64 already uses non-temporal instructions for copy page. > > 5. Explicitly make multi-threaded folio copy only available to > !HIGHMEM, since kmap_local_page() would be needed for each kernel > folio copy work threads and expensive. > > 6. A better interface than copy_page_lists_mt() to allow DMA data copy > to be used as well. > > Let me know your thoughts. Thanks. > > > [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com/ > [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/ > [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.org/ > > Byungchul Park (1): > mm: separate move/undo doing on folio list from migrate_pages_batch() > > Zi Yan (4): > mm/migrate: factor out code in move_to_new_folio() and > migrate_folio_move() > mm/migrate: add migrate_folios_batch_move to batch the folio move > operations > mm/migrate: introduce multi-threaded page copy routine > test: add sysctl for folio copy tests and adjust > NR_MAX_BATCHED_MIGRATION > > include/linux/migrate.h | 3 + > include/linux/migrate_mode.h | 2 + > include/linux/mm.h | 4 + > include/linux/sysctl.h | 1 + > kernel/sysctl.c | 29 ++- > mm/Makefile | 2 +- > mm/copy_pages.c | 190 +++++++++++++++ > mm/migrate.c | 443 +++++++++++++++++++++++++++-------- > 8 files changed, 577 insertions(+), 97 deletions(-) > create mode 100644 mm/copy_pages.c > > -- > 2.45.2 >