[RFC,0/5] Accelerate page migration with batching and multi threads

Message ID	20250103172419.4148674-1-ziy@nvidia.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Zi Yan <ziy@nvidia.com> To: linux-mm@kvack.org Cc: David Rientjes <rientjes@google.com>, Shivank Garg <shivankg@amd.com>, Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>, David Hildenbrand <david@redhat.com>, John Hubbard <jhubbard@nvidia.com>, Kirill Shutemov <k.shutemov@gmail.com>, Matthew Wilcox <willy@infradead.org>, Mel Gorman <mel.gorman@gmail.com>, "Rao, Bharata Bhasker" <bharata@amd.com>, Rik van Riel <riel@surriel.com>, RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>, Wei Xu <weixugc@google.com>, Suyeon Lee <leesuyeon0506@gmail.com>, Lei Chen <leillc@google.com>, "Shukla, Santosh" <santosh.shukla@amd.com>, "Grimm, Jon" <jon.grimm@amd.com>, sj@kernel.org, shy828301@gmail.com, Liam Howlett <liam.howlett@oracle.com>, Gregory Price <gregory.price@memverge.com>, "Huang, Ying" <ying.huang@linux.alibaba.com>, Zi Yan <ziy@nvidia.com> Subject: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Date: Fri, 3 Jan 2025 12:24:14 -0500 Message-ID: <20250103172419.4148674-1-ziy@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Accelerate page migration with batching and multi threads \| expand [RFC,0/5] Accelerate page migration with batching and multi threads [RFC,1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() [RFC,2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() [RFC,3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations [RFC,4/5] mm/migrate: introduce multi-threaded page copy routine [RFC,5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION

Message ID

20250103172419.4148674-1-ziy@nvidia.com (mailing list archive)

Headers

From: Zi Yan <ziy@nvidia.com>
To: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>,
	Shivank Garg <shivankg@amd.com>,
	Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
	David Hildenbrand <david@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Kirill Shutemov <k.shutemov@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mel.gorman@gmail.com>,
	"Rao, Bharata Bhasker" <bharata@amd.com>,
	Rik van Riel <riel@surriel.com>,
	RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
	Wei Xu <weixugc@google.com>,
	Suyeon Lee <leesuyeon0506@gmail.com>,
	Lei Chen <leillc@google.com>,
	"Shukla, Santosh" <santosh.shukla@amd.com>,
	"Grimm, Jon" <jon.grimm@amd.com>,
	sj@kernel.org,
	shy828301@gmail.com,
	Liam Howlett <liam.howlett@oracle.com>,
	Gregory Price <gregory.price@memverge.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Zi Yan <ziy@nvidia.com>
Subject: [RFC PATCH 0/5] Accelerate page migration with batching and multi
 threads
Date: Fri,  3 Jan 2025 12:24:14 -0500
Message-ID: <20250103172419.4148674-1-ziy@nvidia.com>
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 60EYopr5AYDuqUzOftoxM/7zJ7oRjWwnkFp30UGq7dKmuosvLrIPKxi+hiMEsGonCqO1hABbkdSuGmxHrJIJgHEgUVhcMNBvWvA15tGbfvphyjCAQA8SLdqxLu5KKqr4q9Zy7Vh0+9AVCa55JC5SplwPy2t+EEgG26raCm9YOhimsRyUfo4mTykmMhm3f2sf40E2pBNkPi189spAGe+O7t68lz88PfViDq9Mwrndhr9u7fCWwa738P0gjRnHZSS0bHmO6evou5kGhmMvJeYKu2agxJ92PO2dhLUYAWUBGJK6T9fdQ0Innz5LpaZKgxqEJ2ipblo9MBXyXZplYO1j+PyeagDoKF59mUrUknGaAuYYBsRitELUGujDD3CW/Aa76YPN9vIMO+qW3dOcG62nvdwne5qsEDSmpjauk9Gu+AOjsdtyiydZyMZi/H1Tzehccfo5D0szZwB0Yj1eMNbUjHHrc925wvICph8OtYzaz4GXyqReGrvhQgkeVEE/d60JxnSHbAUNzTOwa8nxopzFIWksmvmmmX6aOO53nX5Uvn5fxERV3pdWVW7Dx7bev6QFHAoqiS5GqWdpWe8eoZi/nsfsNSTkQOLexGkkV4PZFGg2X6LodubuAxthsG3GzhYMJSas7TMGjf/F7ORgTcR1D1L4uRqaW8iP0cdwWP8NC7sMxihKuyaVTnPNqb++yYHFHrNC8GQQSMxYheqRtONFGzHDpah6hXSjtDRVUREVkLeLFRMQDwtk5TJtY0PSoa+24hnCke72RAjSh+zSSLIWZA9AQXBQnsJZMhSzg8pk/DweOgOzjClI1c1b22iJanXwWJ8WCSdG05eBd7oO7J69fYF8JWIy2PPOERClMgxYWvB6RfIc5xlrWtQyuO95VOxtgIEqb2XXKP4i+sUgOj5PmwwP8nMnCmIFn/c+Lf36Cs2w6CltiQE8d0M9M4o4AC8WmqpXIO5Bu0smBP1QvOltWVE35NhZ6bjyNqLUQ4kaJE+vGkxxe5prQNQHArcYh/aLrQA+2/IRkONPaPo0yCi19p2zfMyIrrJUH8ZtsJgVKmAOKY5o/Mk68RVvADhvPB1NlEx0G1H9GR/EX47/RqRKqV1/WiHHyVOBvqQi41a6/cTP2Ky/1+0LYezzotryCfJoTwlD08CYqfkmOZg1t9LXGGLUjUKUijl45yUETJgNMhDV+GhtPZ0GT+d+vyBY3mCyifrmZci1/hTQAnoMo/mjDxmaSpXaKaekKPRl/6ILhoLSF/6n202DS2FWei1nddlyiCsdVCSb43mmyvOwa/J+AklBHaNCbQrLQOL4BubLBiQiph0pr8uKZ1ZIEXc1jp6rc+7h2bq1LXUOzUiu70MjIWAi6pC+Ld1S8V3WB+cbFQrVP/uR6jAA1XqX09igd+cuSCNlezbvGSELIqifi+rOlSS+2+zijulJgd+nJBY1kpIw1rMtIodXsMBHKiOmDnyOX/vYsZAdsjSJEw5aIJvim3yPcQ50y/8kOcr7KZF3m96Jl8KmWv6Fm+EJRnWMUe62dhnq9Xs88tNTJceVj0T0PUlxT0CIspfJg9oDrk7fB8o=
X-OriginatorOrg: Nvidia.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 d5d4cb9f-928b-4237-9a41-08dd2c1b7b4b
X-MS-Exchange-CrossTenant-AuthSource: DS7PR12MB9473.namprd12.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 03 Jan 2025 17:24:30.2042
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 ft7hoS74H1+A0VStWnoqjUnTJkE2zYaKM0x5y9CtuQXY42/zVLJQrc2ulMfVKyQ7
X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY8PR12MB8066
X-Rspamd-Queue-Id: 7508612000C
X-Rspamd-Server: rspam12
X-Stat-Signature: 5o7ixr7o3kjtnmp77qu9cothpzodhgqn
X-Rspam-User: 
X-HE-Tag: 1735925003-980893
X-HE-Meta: 
 U2FsdGVkX1++P734abiPGblGchhQ9ESU1P6Rmtr1qLvabNN06QQfj7CccjotPS9MD/iBx1hCodNVls7wqwprOdhkJxEGjpR6JduRZM6FCk9FlXcK61IovCNzILYhAmDy22ekUQ53YnwtHb9CchSwRLUN3AIQs4E7K+kjtkRzRfOCp9DRePoBRbCLQE4KOfUcCC6ZGnXI98S9zbBWXW257MXYJhoE/bB2AaPko2/59U1spIqSXPRSd8doR+snn//Er7BF8ToYF93DjnhPhp/EmEH5X1ENh5Ytp/zEEzeZ5U/JcW+c1yICD5yXVCKnk4vSkFmdxnQbvqPeMaSyqpDfzNWk0qZAK3+z60O5RzFMJqa4YYE8gRgdWs53yuZoIQ/VvyXOnqxICmvhzzWmc2TxNuxGQtD6Dsfzuc61rP2WwhC/eM6Qb39gCEejsspGD/RiwpuBLnICVo/PxSwCRWfRpWux/vIESXJU/bN+ZxeuSIT7rsvYx8gE1yTAAf3D9rfIa6Qk0Ulx1mAtD6Uhj5AlK+OnxQJzFjjIqkUkeXhWK0eWzri2GEHhdtmDQyuLh2o5MxczfaGrYRIXJ6BFuPqOKzYTcEsvBobaCCEOY71haCKo4uVxcsq2lrpxB7GtJbDqWMqhfyXlhUVdxcan4bvKZdlwEI91a0TAMIfMcitZp5B1XDYv2ZGFYV5+1D5jWFGCwZpjbLTer6LkjW8wY02n8SuPeuMyjsaMwaUvdcJLyvTIJZq6l++uleyua5GpRj/ukle8TCjqZxCXv86okyHAFgUYm3w1oII9cxqDURQdlSb6RNctoyMLDoZgdEUn7daVDSXoZTLnRski5bxiK4Ucql5RukoFixJMyWjVEJuQT9WMBy8gefX2owooV3HogVyZO+jV1NNbAiGPFOuCQQhA+z+JwP06bvgsld5bp0YP7GZcVd/CoUHr9MpJ2U5AONMYPzlullb9PGdqYVcRdFn
 pjkBVqDP
 hXgFJqaMGlMUSgpDTBYX/w1dE87K3nWWeHnFOdQCW+NPw30NYqxR58oEDafOoBAWZMebXU6HB2jlp2uC9jdKu5p4nZ4+p9T+byc8oQbhZVvlu5Ag0yyv2IlTN3NIPw7Hu0zojvyGKWyb/aoUTdBfVfYyAcezIXTdAqr08X1ZiVevnoiX6f5B/F5nJVqofsjilZMkeMpJRcCMrbZtc+oPPFkj1vvYZggamSAGxBq6Z1XOnZyI7zy6UmHGSdMtc6Uh8J+gAXC0tUG/iAqZjd72SOe7N5EqRZNld2zDPosCw0f6le7OhVAZcfUCsbsjYAtUW5F93XEJ5dqiG1a07+eMHkZChpJ4dclp02W5WpqTpKFKeaLIuE06NEe2cK/QD4+IENIiBvZXUSp0wpAAOXvYR8wJ7CFbp02BrLWUsQ/hCIMk2GYYMWRQwaPrSPRdYudCAou/t2iwkD/MPJkdu3RTQR22SF2yawmoAZ1ssG4gcCnKgxOk8ETh3Z2rqiloksefnDquQVCl9dsrkaj+aXg7X2wuUlrFBNU44tpcHFDqDdeZQloQ=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Series

Accelerate page migration with batching and multi threads | expand

Message

Zi Yan Jan. 3, 2025, 5:24 p.m. UTC

Hi all,

This patchset accelerates page migration by batching folio copy operations and
using multiple CPU threads and is based on Shivank's Enhancements to Page
Migration with Batch Offloading via DMA patchset[1] and my original accelerate
page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
The last patch is for testing purpose and should not be considered.

The motivations are:

1. Batching folio copy increases copy throughput. Especially for base page
migrations, folio copy throughput is low since there are kernel activities like
moving folio metadata and updating page table entries sit between two folio
copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
and 64KB on ARM64.

2. Single CPU thread has limited copy throughput. Using multi threads is
a natural extension to speed up folio copy, when DMA engine is NOT
available in a system.


Design
===

It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
(renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
migrate_folio_move() and perform them in one shot afterwards. A
copy_page_lists_mt() function is added to use multi threads to copy
folios from src list to dst list.

Changes compared to Shivank's patchset (mainly rewrote batching folio
copy code)
===

1. mig_info is removed, so no memory allocation is needed during
batching folio copies. src->private is used to store old page state and
anon_vma after folio metadata is copied from src to dst.

2. move_to_new_folio() and migrate_folio_move() are refactored to remove
redundant code in migrate_folios_batch_move().

3. folio_mc_copy() is used for the single threaded copy code to keep the
original kernel behavior.


Performance
===

I benchmarked move_pages() throughput on a two socket NUMA system with two
NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
mTHP page migration are measured.

The tables below show move_pages() throughput with different
configurations and different numbers of copied pages. The x-axis is the
configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
threads with this patchset applied. And the unit is GB/s.

The 32-thread copy throughput can be up to 10x of single thread serial folio
copy. Batching folio copy not only benefits huge page but also base
page.

64KB (GB/s):

		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80

2MB mTHP (GB/s):

		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84


TODOs
===
1. Multi-threaded folio copy routine needs to look at CPU scheduler and
only use idle CPUs to avoid interfering userspace workloads. Of course
more complicated policies can be used based on migration issuing thread
priority.

2. Eliminate memory allocation during multi-threaded folio copy routine
if possible.

3. A runtime check to decide when use multi-threaded folio copy.
Something like cache hotness issue mentioned by Matthew[3].

4. Use non-temporal CPU instructions to avoid cache pollution issues.

5. Explicitly make multi-threaded folio copy only available to
!HIGHMEM, since kmap_local_page() would be needed for each kernel
folio copy work threads and expensive.

6. A better interface than copy_page_lists_mt() to allow DMA data copy
to be used as well.

Let me know your thoughts. Thanks.


[1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com/
[2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
[3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.org/

Byungchul Park (1):
  mm: separate move/undo doing on folio list from migrate_pages_batch()

Zi Yan (4):
  mm/migrate: factor out code in move_to_new_folio() and
    migrate_folio_move()
  mm/migrate: add migrate_folios_batch_move to batch the folio move
    operations
  mm/migrate: introduce multi-threaded page copy routine
  test: add sysctl for folio copy tests and adjust
    NR_MAX_BATCHED_MIGRATION

 include/linux/migrate.h      |   3 +
 include/linux/migrate_mode.h |   2 +
 include/linux/mm.h           |   4 +
 include/linux/sysctl.h       |   1 +
 kernel/sysctl.c              |  29 ++-
 mm/Makefile                  |   2 +-
 mm/copy_pages.c              | 190 +++++++++++++++
 mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
 8 files changed, 577 insertions(+), 97 deletions(-)
 create mode 100644 mm/copy_pages.c

Comments

Gregory Price Jan. 3, 2025, 7:17 p.m. UTC | #1

On Fri, Jan 03, 2025 at 12:24:14PM -0500, Zi Yan wrote:
> Hi all,
> 
> This patchset accelerates page migration by batching folio copy operations and
> using multiple CPU threads and is based on Shivank's Enhancements to Page
> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
> The last patch is for testing purpose and should not be considered.
> 

This is well timed as I've been testing a batch-migration variant of
migrate_misplaced_folio for my pagecache promotion work (attached).

I will add this to my pagecache branch and give it a test at some point.

Quick question: is the multi-threaded movement supported in the context
of task_work?  i.e. in which context is the multi-threaded path
safe/unsafe? (inline in a syscall, async only, etc).

~Gregory

---

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9438cc7c2aeb..17baf63964c0 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -146,6 +146,9 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
                struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
                           int node);
+int migrate_misplaced_folio_batch(struct list_head *foliolist,
+                                 struct vm_area_struct *vma,
+                                 int node);
 #else
 static inline int migrate_misplaced_folio_prepare(struct folio *folio,
                struct vm_area_struct *vma, int node)
@@ -157,6 +160,12 @@ static inline int migrate_misplaced_folio(struct folio *folio,
 {
        return -EAGAIN; /* can't migrate now */
 }
+int migrate_misplaced_folio_batch(struct list_head *foliolist,
+                                 struct vm_area_struct *vma,
+                                 int node)
+{
+       return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_NUMA_BALANCING */

 #ifdef CONFIG_MIGRATION
diff --git a/mm/migrate.c b/mm/migrate.c
index 459f396f7bc1..454fd93c4cc7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2608,5 +2608,27 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
        BUG_ON(!list_empty(&migratepages));
        return nr_remaining ? -EAGAIN : 0;
 }
+
+int migrate_misplaced_folio_batch(struct list_head *folio_list,
+                                 struct vm_area_struct *vma,
+                                 int node)
+{
+       pg_data_t *pgdat = NODE_DATA(node);
+       unsigned int nr_succeeded;
+       int nr_remaining;
+
+       nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
+                                    NULL, node, MIGRATE_ASYNC,
+                                    MR_NUMA_MISPLACED, &nr_succeeded);
+       if (nr_remaining)
+               putback_movable_pages(folio_list);
+
+       if (nr_succeeded) {
+               count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
+               mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
+       }
+       BUG_ON(!list_empty(folio_list));
+       return nr_remaining ? -EAGAIN : 0;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */

Zi Yan Jan. 3, 2025, 7:32 p.m. UTC | #2

On 3 Jan 2025, at 14:17, Gregory Price wrote:

> On Fri, Jan 03, 2025 at 12:24:14PM -0500, Zi Yan wrote:
>> Hi all,
>>
>> This patchset accelerates page migration by batching folio copy operations and
>> using multiple CPU threads and is based on Shivank's Enhancements to Page
>> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
>> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
>> The last patch is for testing purpose and should not be considered.
>>
>
> This is well timed as I've been testing a batch-migration variant of
> migrate_misplaced_folio for my pagecache promotion work (attached).
>
> I will add this to my pagecache branch and give it a test at some point.

Great. Thanks.

>
> Quick question: is the multi-threaded movement supported in the context
> of task_work?  i.e. in which context is the multi-threaded path
> safe/unsafe? (inline in a syscall, async only, etc).

It should work in any context, like syscall, memory compaction, and so on,
since it just distributes memcpy to different CPUs using workqueue.

>
> ~Gregory
>
> ---
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 9438cc7c2aeb..17baf63964c0 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -146,6 +146,9 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
>                 struct vm_area_struct *vma, int node);
>  int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>                            int node);
> +int migrate_misplaced_folio_batch(struct list_head *foliolist,
> +                                 struct vm_area_struct *vma,
> +                                 int node);
>  #else
>  static inline int migrate_misplaced_folio_prepare(struct folio *folio,
>                 struct vm_area_struct *vma, int node)
> @@ -157,6 +160,12 @@ static inline int migrate_misplaced_folio(struct folio *folio,
>  {
>         return -EAGAIN; /* can't migrate now */
>  }
> +int migrate_misplaced_folio_batch(struct list_head *foliolist,
> +                                 struct vm_area_struct *vma,
> +                                 int node)
> +{
> +       return -EAGAIN; /* can't migrate now */
> +}
>  #endif /* CONFIG_NUMA_BALANCING */
>
>  #ifdef CONFIG_MIGRATION
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 459f396f7bc1..454fd93c4cc7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2608,5 +2608,27 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>         BUG_ON(!list_empty(&migratepages));
>         return nr_remaining ? -EAGAIN : 0;
>  }
> +
> +int migrate_misplaced_folio_batch(struct list_head *folio_list,
> +                                 struct vm_area_struct *vma,
> +                                 int node)
> +{
> +       pg_data_t *pgdat = NODE_DATA(node);
> +       unsigned int nr_succeeded;
> +       int nr_remaining;
> +
> +       nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
> +                                    NULL, node, MIGRATE_ASYNC,
> +                                    MR_NUMA_MISPLACED, &nr_succeeded);
> +       if (nr_remaining)
> +               putback_movable_pages(folio_list);
> +
> +       if (nr_succeeded) {
> +               count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> +               mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
> +       }
> +       BUG_ON(!list_empty(folio_list));
> +       return nr_remaining ? -EAGAIN : 0;
> +}
>  #endif /* CONFIG_NUMA_BALANCING */
>  #endif /* CONFIG_NUMA */


Best Regards,
Yan, Zi

Yang Shi Jan. 3, 2025, 10:09 p.m. UTC | #3

On Fri, Jan 3, 2025 at 9:24 AM Zi Yan <ziy@nvidia.com> wrote:
>
> Hi all,
>
> This patchset accelerates page migration by batching folio copy operations and
> using multiple CPU threads and is based on Shivank's Enhancements to Page
> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
> The last patch is for testing purpose and should not be considered.
>
> The motivations are:
>
> 1. Batching folio copy increases copy throughput. Especially for base page
> migrations, folio copy throughput is low since there are kernel activities like
> moving folio metadata and updating page table entries sit between two folio
> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
> and 64KB on ARM64.
>
> 2. Single CPU thread has limited copy throughput. Using multi threads is
> a natural extension to speed up folio copy, when DMA engine is NOT
> available in a system.
>
>
> Design
> ===
>
> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
> migrate_folio_move() and perform them in one shot afterwards. A
> copy_page_lists_mt() function is added to use multi threads to copy
> folios from src list to dst list.
>
> Changes compared to Shivank's patchset (mainly rewrote batching folio
> copy code)
> ===
>
> 1. mig_info is removed, so no memory allocation is needed during
> batching folio copies. src->private is used to store old page state and
> anon_vma after folio metadata is copied from src to dst.
>
> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
> redundant code in migrate_folios_batch_move().
>
> 3. folio_mc_copy() is used for the single threaded copy code to keep the
> original kernel behavior.
>
>
> Performance
> ===
>
> I benchmarked move_pages() throughput on a two socket NUMA system with two
> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
> mTHP page migration are measured.
>
> The tables below show move_pages() throughput with different
> configurations and different numbers of copied pages. The x-axis is the
> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
> threads with this patchset applied. And the unit is GB/s.
>
> The 32-thread copy throughput can be up to 10x of single thread serial folio
> copy. Batching folio copy not only benefits huge page but also base
> page.
>
> 64KB (GB/s):
>
>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
> 32              5.43    4.90    5.65    7.31    7.60    8.61    6.43
> 256             6.95    6.89    9.28    14.67   22.41   23.39   23.93
> 512             7.88    7.26    10.15   17.53   27.82   27.88   33.93
> 768             7.65    7.42    10.46   18.59   28.65   29.67   30.76
> 1024    7.46    8.01    10.90   17.77   27.04   32.18   38.80
>
> 2MB mTHP (GB/s):
>
>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
> 1               5.94    2.90    6.90    8.56    11.16   8.76    6.41
> 2               7.67    5.57    7.11    12.48   17.37   15.68   14.10
> 4               8.01    6.04    10.25   20.14   22.52   27.79   25.28
> 8               8.42    7.00    11.41   24.73   33.96   32.62   39.55
> 16              9.41    6.91    12.23   27.51   43.95   49.15   51.38
> 32              10.23   7.15    13.03   29.52   49.49   69.98   71.51
> 64              9.40    7.37    13.88   30.38   52.00   76.89   79.41
> 128             8.59    7.23    14.20   28.39   49.98   78.27   90.18
> 256             8.43    7.16    14.59   28.14   48.78   76.88   92.28
> 512             8.31    7.78    14.40   26.20   43.31   63.91   75.21
> 768             8.30    7.86    14.83   27.41   46.25   69.85   81.31
> 1024    8.31    7.90    14.96   27.62   46.75   71.76   83.84

Is this done on an idle system or a busy system? For real production
workloads, all the CPUs are likely busy. It would be great to have the
performance data collected from a busys system too.

>
>
> TODOs
> ===
> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
> only use idle CPUs to avoid interfering userspace workloads. Of course
> more complicated policies can be used based on migration issuing thread
> priority.

The other potential problem is it is hard to attribute cpu time
consumed by the migration work threads to cpu cgroups. In a
multi-tenant environment this may result in unfair cpu time counting.
However, it is a chronic problem to properly count cpu time for kernel
threads. I'm not sure whether it has been solved or not.

>
> 2. Eliminate memory allocation during multi-threaded folio copy routine
> if possible.
>
> 3. A runtime check to decide when use multi-threaded folio copy.
> Something like cache hotness issue mentioned by Matthew[3].
>
> 4. Use non-temporal CPU instructions to avoid cache pollution issues.

AFAICT, arm64 already uses non-temporal instructions for copy page.

>
> 5. Explicitly make multi-threaded folio copy only available to
> !HIGHMEM, since kmap_local_page() would be needed for each kernel
> folio copy work threads and expensive.
>
> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
> to be used as well.
>
> Let me know your thoughts. Thanks.
>
>
> [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com/
> [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
> [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.org/
>
> Byungchul Park (1):
>   mm: separate move/undo doing on folio list from migrate_pages_batch()
>
> Zi Yan (4):
>   mm/migrate: factor out code in move_to_new_folio() and
>     migrate_folio_move()
>   mm/migrate: add migrate_folios_batch_move to batch the folio move
>     operations
>   mm/migrate: introduce multi-threaded page copy routine
>   test: add sysctl for folio copy tests and adjust
>     NR_MAX_BATCHED_MIGRATION
>
>  include/linux/migrate.h      |   3 +
>  include/linux/migrate_mode.h |   2 +
>  include/linux/mm.h           |   4 +
>  include/linux/sysctl.h       |   1 +
>  kernel/sysctl.c              |  29 ++-
>  mm/Makefile                  |   2 +-
>  mm/copy_pages.c              | 190 +++++++++++++++
>  mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
>  8 files changed, 577 insertions(+), 97 deletions(-)
>  create mode 100644 mm/copy_pages.c
>
> --
> 2.45.2
>

Zi Yan Jan. 6, 2025, 2:33 a.m. UTC | #4

On 3 Jan 2025, at 17:09, Yang Shi wrote:

> On Fri, Jan 3, 2025 at 9:24 AM Zi Yan <ziy@nvidia.com> wrote:
>>
>> Hi all,
>>
>> This patchset accelerates page migration by batching folio copy operations and
>> using multiple CPU threads and is based on Shivank's Enhancements to Page
>> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
>> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
>> The last patch is for testing purpose and should not be considered.
>>
>> The motivations are:
>>
>> 1. Batching folio copy increases copy throughput. Especially for base page
>> migrations, folio copy throughput is low since there are kernel activities like
>> moving folio metadata and updating page table entries sit between two folio
>> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
>> and 64KB on ARM64.
>>
>> 2. Single CPU thread has limited copy throughput. Using multi threads is
>> a natural extension to speed up folio copy, when DMA engine is NOT
>> available in a system.
>>
>>
>> Design
>> ===
>>
>> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
>> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
>> migrate_folio_move() and perform them in one shot afterwards. A
>> copy_page_lists_mt() function is added to use multi threads to copy
>> folios from src list to dst list.
>>
>> Changes compared to Shivank's patchset (mainly rewrote batching folio
>> copy code)
>> ===
>>
>> 1. mig_info is removed, so no memory allocation is needed during
>> batching folio copies. src->private is used to store old page state and
>> anon_vma after folio metadata is copied from src to dst.
>>
>> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
>> redundant code in migrate_folios_batch_move().
>>
>> 3. folio_mc_copy() is used for the single threaded copy code to keep the
>> original kernel behavior.
>>
>>
>> Performance
>> ===
>>
>> I benchmarked move_pages() throughput on a two socket NUMA system with two
>> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
>> mTHP page migration are measured.
>>
>> The tables below show move_pages() throughput with different
>> configurations and different numbers of copied pages. The x-axis is the
>> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
>> threads with this patchset applied. And the unit is GB/s.
>>
>> The 32-thread copy throughput can be up to 10x of single thread serial folio
>> copy. Batching folio copy not only benefits huge page but also base
>> page.
>>
>> 64KB (GB/s):
>>
>>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
>> 32              5.43    4.90    5.65    7.31    7.60    8.61    6.43
>> 256             6.95    6.89    9.28    14.67   22.41   23.39   23.93
>> 512             7.88    7.26    10.15   17.53   27.82   27.88   33.93
>> 768             7.65    7.42    10.46   18.59   28.65   29.67   30.76
>> 1024    7.46    8.01    10.90   17.77   27.04   32.18   38.80
>>
>> 2MB mTHP (GB/s):
>>
>>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
>> 1               5.94    2.90    6.90    8.56    11.16   8.76    6.41
>> 2               7.67    5.57    7.11    12.48   17.37   15.68   14.10
>> 4               8.01    6.04    10.25   20.14   22.52   27.79   25.28
>> 8               8.42    7.00    11.41   24.73   33.96   32.62   39.55
>> 16              9.41    6.91    12.23   27.51   43.95   49.15   51.38
>> 32              10.23   7.15    13.03   29.52   49.49   69.98   71.51
>> 64              9.40    7.37    13.88   30.38   52.00   76.89   79.41
>> 128             8.59    7.23    14.20   28.39   49.98   78.27   90.18
>> 256             8.43    7.16    14.59   28.14   48.78   76.88   92.28
>> 512             8.31    7.78    14.40   26.20   43.31   63.91   75.21
>> 768             8.30    7.86    14.83   27.41   46.25   69.85   81.31
>> 1024    8.31    7.90    14.96   27.62   46.75   71.76   83.84
>
> Is this done on an idle system or a busy system? For real production
> workloads, all the CPUs are likely busy. It would be great to have the
> performance data collected from a busys system too.

Yes, it was done on an idle system.

I redid the experiments on a busy system by running stress on all CPU
cores and the results are as not good, since all CPUs are occupied.
Then I switched to system_highpri_wq, the throughput got better,
almost on par with the results on an idle machine. The numbers are
below.

It becomes a trade-off between page migration throughput vs user
application performance on _a busy system_. If a page migration is badly
needed, system_highpri_wq can be used to retain high copy throughput.
Otherwise, multithreads should not be used.

64KB with system_unbound_wq on a busy system (GB/s):

| ---- | -------- | ---- | ---- | ---- | ---- | ----- | ----- |
|      | vanilla  | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | mt_32 |
| ---- | -------- | ---- | ---- | ---- | ---- | ----- | ----- |
| 32   | 4.05     | 1.51 | 1.32 | 1.20 | 4.31 | 1.05  | 0.02  |
| 256  | 6.91     | 3.93 | 4.61 | 0.08 | 4.46 | 4.30  | 3.89  |
| 512  | 7.28     | 4.87 | 1.81 | 6.18 | 4.38 | 5.58  | 6.10  |
| 768  | 4.57     | 5.72 | 5.35 | 5.24 | 5.94 | 5.66  | 0.20  |
| 1024 | 7.88     | 5.73 | 5.81 | 6.52 | 7.29 | 6.06  | 5.62  |

2MB with system_unbound_wq on a busy system  (GB/s):

| ---- | ------- | ---- | ---- | ---- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4 | mt_8  | mt_16 | mt_32 |
| ---- | ------- | ---- | ---- | ---- | ----- | ----- | ----- |
| 1    | 1.38    | 0.59 | 1.45 | 1.99 | 1.59  | 2.18  | 1.48  |
| 2    | 1.13    | 3.08 | 3.11 | 1.85 | 0.32  | 1.46  | 2.53  |
| 4    | 8.31    | 4.02 | 5.68 | 3.22 | 2.96  | 5.77  | 2.91  |
| 8    | 8.16    | 5.09 | 1.19 | 4.96 | 4.50  | 3.36  | 4.99  |
| 16   | 3.47    | 5.13 | 5.72 | 7.06 | 5.90  | 6.49  | 5.34  |
| 32   | 8.42    | 6.97 | 0.13 | 6.77 | 7.69  | 7.56  | 2.87  |
| 64   | 7.45    | 8.06 | 7.22 | 8.60 | 8.07  | 7.16  | 0.57  |
| 128  | 7.77    | 7.93 | 7.29 | 8.31 | 7.77  | 9.05  | 0.92  |
| 256  | 6.91    | 7.20 | 6.80 | 8.56 | 7.81  | 10.13 | 11.21 |
| 512  | 6.72    | 7.22 | 7.77 | 9.71 | 10.68 | 10.35 | 10.40 |
| 768  | 6.87    | 7.18 | 7.98 | 9.28 | 10.85 | 10.83 | 14.17 |
| 1024 | 6.95    | 7.23 | 8.03 | 9.59 | 10.88 | 10.22 | 20.27 |



64KB with system_highpri_wq on a busy system  (GB/s):

| ---- | ------- | ---- | ---- | ----- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 | mt_32 |
| ---- | ------- | ---- | ---- | ----- | ----- | ----- | ----- |
| 32   | 4.05    | 2.63 | 1.62 | 1.90  | 3.34  | 3.71  | 3.40  |
| 256  | 6.91    | 5.16 | 4.33 | 8.07  | 6.81  | 10.31 | 13.51 |
| 512  | 7.28    | 4.89 | 6.43 | 15.72 | 11.31 | 18.03 | 32.69 |
| 768  | 4.57    | 6.27 | 6.42 | 11.06 | 8.56  | 14.91 | 9.24  |
| 1024 | 7.88    | 6.73 | 0.49 | 17.09 | 19.34 | 23.60 | 18.12 |


2MB with system_highpri_wq on a busy system  (GB/s):

| ---- | ------- | ---- | ----- | ----- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2  | mt_4  | mt_8  | mt_16 | mt_32 |
| ---- | ------- | ---- | ----- | ----- | ----- | ----- | ----- |
| 1    | 1.38    | 1.18 | 1.17  | 5.00  | 1.68  | 3.86  | 2.46  |
| 2    | 1.13    | 1.78 | 1.05  | 0.01  | 3.52  | 1.84  | 1.80  |
| 4    | 8.31    | 3.91 | 5.24  | 4.30  | 4.12  | 2.93  | 3.44  |
| 8    | 8.16    | 6.09 | 3.67  | 7.81  | 11.10 | 8.47  | 15.21 |
| 16   | 3.47    | 6.02 | 8.44  | 11.80 | 9.56  | 12.84 | 9.81  |
| 32   | 8.42    | 7.34 | 10.10 | 13.79 | 23.03 | 26.68 | 45.24 |
| 64   | 7.45    | 7.90 | 12.27 | 19.99 | 36.08 | 35.11 | 60.26 |
| 128  | 7.77    | 7.57 | 13.35 | 24.67 | 35.03 | 41.40 | 51.68 |
| 256  | 6.91    | 7.40 | 14.13 | 25.37 | 38.83 | 62.18 | 51.37 |
| 512  | 6.72    | 7.26 | 14.72 | 27.37 | 43.99 | 66.84 | 69.63 |
| 768  | 6.87    | 7.29 | 14.84 | 26.34 | 47.21 | 67.51 | 80.32 |
| 1024 | 6.95    | 7.26 | 14.88 | 26.98 | 47.75 | 74.99 | 85.00 |



>
>>
>>
>> TODOs
>> ===
>> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
>> only use idle CPUs to avoid interfering userspace workloads. Of course
>> more complicated policies can be used based on migration issuing thread
>> priority.
>
> The other potential problem is it is hard to attribute cpu time
> consumed by the migration work threads to cpu cgroups. In a
> multi-tenant environment this may result in unfair cpu time counting.
> However, it is a chronic problem to properly count cpu time for kernel
> threads. I'm not sure whether it has been solved or not.
>
>>
>> 2. Eliminate memory allocation during multi-threaded folio copy routine
>> if possible.
>>
>> 3. A runtime check to decide when use multi-threaded folio copy.
>> Something like cache hotness issue mentioned by Matthew[3].
>>
>> 4. Use non-temporal CPU instructions to avoid cache pollution issues.
>
> AFAICT, arm64 already uses non-temporal instructions for copy page.

Right. My current implementation uses memcpy, which does not use non-temporal
on ARM64, since a huge page can be copied by multiple threads. A non-temporal
memcpy can be added for this use.

Thank you for the inputs.

>
>>
>> 5. Explicitly make multi-threaded folio copy only available to
>> !HIGHMEM, since kmap_local_page() would be needed for each kernel
>> folio copy work threads and expensive.
>>
>> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
>> to be used as well.
>>
>> Let me know your thoughts. Thanks.
>>
>>
>> [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com/
>> [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
>> [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.org/
>>
>> Byungchul Park (1):
>>   mm: separate move/undo doing on folio list from migrate_pages_batch()
>>
>> Zi Yan (4):
>>   mm/migrate: factor out code in move_to_new_folio() and
>>     migrate_folio_move()
>>   mm/migrate: add migrate_folios_batch_move to batch the folio move
>>     operations
>>   mm/migrate: introduce multi-threaded page copy routine
>>   test: add sysctl for folio copy tests and adjust
>>     NR_MAX_BATCHED_MIGRATION
>>
>>  include/linux/migrate.h      |   3 +
>>  include/linux/migrate_mode.h |   2 +
>>  include/linux/mm.h           |   4 +
>>  include/linux/sysctl.h       |   1 +
>>  kernel/sysctl.c              |  29 ++-
>>  mm/Makefile                  |   2 +-
>>  mm/copy_pages.c              | 190 +++++++++++++++
>>  mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
>>  8 files changed, 577 insertions(+), 97 deletions(-)
>>  create mode 100644 mm/copy_pages.c
>>
>> --
>> 2.45.2
>>


--
Best Regards,
Yan, Zi