mbox series

[RFC,0/5] Enhancements to Page Migration with Batch Offloading via DMA

Message ID 20240614221525.19170-1-shivankg@amd.com (mailing list archive)
Headers show
Series Enhancements to Page Migration with Batch Offloading via DMA | expand

Message

Shivank Garg June 14, 2024, 10:15 p.m. UTC
This series introduces enhancements to the page migration code to optimize
the "folio move" operations by batching them and enable offloading on DMA
hardware accelerators.

Page migration involves three key steps:
1. Unmap: Allocating dst folios and replace the src folio PTEs with
migration PTEs.
2. TLB Flush: Flushing the TLB for all unmapped folios.
3. Move: Copying the page mappings, flags and contents from src to dst.
Update metadata, lists, refcounts and restore working PTEs.

While the first two steps (setting TLB flush pending for unmapped folios
and TLB batch flush) been optimized with batching, this series focuses
on optimizing the folio move step.

In the current design, the folio move operation is performed sequentially
for each folio:
for_each_folio() {
        Copy folio metadata like flags and mappings
        Copy the folio content from src to dst
        Update PTEs with new mappings
}

In the proposed design, we batch the folio copy operations to leverage DMA
offloading. The updated design is as follows:
for_each_folio() {
        Copy folio metadata like flags and mappings
}
Batch copy the page content from src to dst by offloading to DMA engine
for_each_folio() {
        Update PTEs with new mappings
}

Motivation:
Data copying across NUMA nodes while page migration incurs significant
overhead. For instance, folio copy can take up to 26.6% of the total
migration cost for migrating 256MB of data.
Modern systems are equipped with powerful DMA engines for bulk data
copying. Utilizing these hardware accelerators will become essential for
large-scale tiered-memory systems with CXL nodes where lots of page
promotion and demotion can happen.
Following the trend of batching operations in the memory migration core
path (like batch migration and batch TLB flush), batch copying folio data
is a logical progression in this direction.

We conducted experiments to measure folio copy overheads for page
migration from a remote node to a local NUMA node, modeling page
promotions for different workload sizes (4KB, 2MB, 256MB and 1GB).

Setup Information: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT
Enabled), 1 NUMA node connected to each socket.
Linux Kernel 6.8.0, DVFS set to Performance, and cpuinfo_cur_freq: 2 GHz.
THP, compaction, numa_balancing are disabled to reduce interfernce.

migrate_pages() { <- t1
	..
	<- t2
	folio_copy()
	<- t3 
	..
} <- t4

overheads Fraction, F= (t3-t2)/(t4-t1)
Measurement: Mean ± SD is measured in cpu_cycles/page
Generic Kernel
4KB::   migrate_pages:17799.00±4278.25  folio_copy:794±232.87  F:0.0478±0.0199
2MB::   migrate_pages:3478.42±94.93  folio_copy:493.84±28.21  F:0.1418±0.0050
256MB:: migrate_pages:3668.56±158.47  folio_copy:815.40±171.76  F:0.2206±0.0371
1GB::   migrate_pages:3769.98±55.79  folio_copy:804.68±60.07  F:0.2132±0.0134

Results with patched kernel:
1. Offload disabled - folios batch-move using CPU
4KB::   migrate_pages:14941.60±2556.53  folio_copy:799.60±211.66  F:0.0554±0.0190
2MB::   migrate_pages:3448.44±83.74  folio_copy:533.34±37.81  F:0.1545±0.0085
256MB:: migrate_pages:3723.56±132.93  folio_copy:907.64±132.63  F:0.2427±0.0270
1GB::   migrate_pages:3788.20±46.65  folio_copy:888.46±49.50  F:0.2344±0.0107

2. Offload enabled - folios batch-move using DMAengine
4KB::   migrate_pages:46739.80±4827.15  folio_copy:32222.40±3543.42  F:0.6904±0.0423
2MB::   migrate_pages:13798.10±205.33  folio_copy:10971.60±202.50  F:0.7951±0.0033
256MB:: migrate_pages:13217.20±163.99  folio_copy:10431.20±167.25  F:0.7891±0.0029
1GB::   migrate_pages:13309.70±113.93  folio_copy:10410.00±117.77  F:0.7821±0.0023

Discussion:
The DMAEngine achieved net throughput of 768MB/s. Additional optimizations
are needed to make DMA offloading beneficial compared to CPU-based
migration. This can include parallelism, specialized DMA hardware,
asynchronous and speculative data migration.

Status:
Current patchset is functional, except for non-LRU folios.

Dependencies:
1. This series is based on Linux-v6.8.
2. Patch 1,2,3 involve preparatory work and implementation for batching
the folio move. Patch 4 adds support for DMA offload.
3. DMA hardware and driver support are required to enable DMA offload.
Without suitable support, CPU is used for batch migration. Requirements
are described in Patch 4.
4. Patch 5 adds a DMA driver using DMAengine APIs for end-to-end
testing and validation. 

Testing:
The patch series has been tested with migrate_pages(2) and move_pages(2)
using anonymous memory and memory-mapped files.

Byungchul Park (1):
  mm: separate move/undo doing on folio list from migrate_pages_batch()

Mike Day (1):
  mm: add support for DMA folio Migration

Shivank Garg (3):
  mm: add folios_copy() for copying pages in batch during migration
  mm: add migrate_folios_batch_move to batch the folio move operations
  dcbm: add dma core batch migrator for batch page offloading

 drivers/dma/Kconfig         |   2 +
 drivers/dma/Makefile        |   1 +
 drivers/dma/dcbm/Kconfig    |   7 +
 drivers/dma/dcbm/Makefile   |   1 +
 drivers/dma/dcbm/dcbm.c     | 229 +++++++++++++++++++++
 include/linux/migrate_dma.h |  36 ++++
 include/linux/mm.h          |   1 +
 mm/Kconfig                  |   8 +
 mm/Makefile                 |   1 +
 mm/migrate.c                | 385 +++++++++++++++++++++++++++++++-----
 mm/migrate_dma.c            |  51 +++++
 mm/util.c                   |  22 +++
 12 files changed, 692 insertions(+), 52 deletions(-)
 create mode 100644 drivers/dma/dcbm/Kconfig
 create mode 100644 drivers/dma/dcbm/Makefile
 create mode 100644 drivers/dma/dcbm/dcbm.c
 create mode 100644 include/linux/migrate_dma.h
 create mode 100644 mm/migrate_dma.c

Comments

Matthew Wilcox June 15, 2024, 4:02 a.m. UTC | #1
On Sat, Jun 15, 2024 at 03:45:20AM +0530, Shivank Garg wrote:
> We conducted experiments to measure folio copy overheads for page
> migration from a remote node to a local NUMA node, modeling page
> promotions for different workload sizes (4KB, 2MB, 256MB and 1GB).
> 
> Setup Information: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT
> Enabled), 1 NUMA node connected to each socket.
> Linux Kernel 6.8.0, DVFS set to Performance, and cpuinfo_cur_freq: 2 GHz.
> THP, compaction, numa_balancing are disabled to reduce interfernce.
> 
> migrate_pages() { <- t1
> 	..
> 	<- t2
> 	folio_copy()
> 	<- t3 
> 	..
> } <- t4
> 
> overheads Fraction, F= (t3-t2)/(t4-t1)
> Measurement: Mean ± SD is measured in cpu_cycles/page
> Generic Kernel
> 4KB::   migrate_pages:17799.00±4278.25  folio_copy:794±232.87  F:0.0478±0.0199
> 2MB::   migrate_pages:3478.42±94.93  folio_copy:493.84±28.21  F:0.1418±0.0050
> 256MB:: migrate_pages:3668.56±158.47  folio_copy:815.40±171.76  F:0.2206±0.0371
> 1GB::   migrate_pages:3769.98±55.79  folio_copy:804.68±60.07  F:0.2132±0.0134
> 
> Results with patched kernel:
> 1. Offload disabled - folios batch-move using CPU
> 4KB::   migrate_pages:14941.60±2556.53  folio_copy:799.60±211.66  F:0.0554±0.0190
> 2MB::   migrate_pages:3448.44±83.74  folio_copy:533.34±37.81  F:0.1545±0.0085
> 256MB:: migrate_pages:3723.56±132.93  folio_copy:907.64±132.63  F:0.2427±0.0270
> 1GB::   migrate_pages:3788.20±46.65  folio_copy:888.46±49.50  F:0.2344±0.0107
> 
> 2. Offload enabled - folios batch-move using DMAengine
> 4KB::   migrate_pages:46739.80±4827.15  folio_copy:32222.40±3543.42  F:0.6904±0.0423
> 2MB::   migrate_pages:13798.10±205.33  folio_copy:10971.60±202.50  F:0.7951±0.0033
> 256MB:: migrate_pages:13217.20±163.99  folio_copy:10431.20±167.25  F:0.7891±0.0029
> 1GB::   migrate_pages:13309.70±113.93  folio_copy:10410.00±117.77  F:0.7821±0.0023

You haven't measured the important thing though -- what's the cost _to
userspace_?  When the CPU does the copy, the data is now cache-hot in
that CPU's cache.  When the DMA engine does the copy, it's not cache-hot
in any CPU.

Now, this may not be a big problem.  I don't think we do anything to
ensure that the CPU that is going to access the folio in userspace is
the one which does the copy.

But your methodology is wrong.
Shivank Garg June 17, 2024, 11:40 a.m. UTC | #2
Hi Matthew,

On 6/15/2024 9:32 AM, Matthew Wilcox wrote:
> On Sat, Jun 15, 2024 at 03:45:20AM +0530, Shivank Garg wrote:

> 
> You haven't measured the important thing though -- what's the cost
> _to userspace_?  When the CPU does the copy, the data is now
> cache-hot in that CPU's cache.  When the DMA engine does the copy,
> it's not cache-hot in any CPU.
> 
> Now, this may not be a big problem.  I don't think we do anything to 
> ensure that the CPU that is going to access the folio in userspace
> is the one which does the copy.
> 
> But your methodology is wrong.

You're right about importance of measuring the cost to userspace.
I initially focused on analyzing the folio_copy overheads within migrate_pages to identify potential optimizations opportunities using DMA hardware accelerators.

To address this, I'm planning extend my experiments to measure the cost to userspace specifically related to cache-hotness. This will involve the accessing the migrated pages after the migration process is complete, and measuring the resulting latency to read/write.

This approach of DMA-offloading could possibly help in scenarios involving bulk data copying with workload size >> cache capacity or incurs a large shootdown overhead.

The userspace cost analysis will provide a more comprehensive picture of page-migration using CPU v/s DMA-offloading.

I appreciate your feedback.

Shivank
Shivank Garg June 25, 2024, 8:57 a.m. UTC | #3
Hi,

On 6/17/2024 5:10 PM, Garg, Shivank wrote:
> Hi Matthew,
> 
> On 6/15/2024 9:32 AM, Matthew Wilcox wrote:
>> On Sat, Jun 15, 2024 at 03:45:20AM +0530, Shivank Garg wrote:
> 
>>
>> You haven't measured the important thing though -- what's the cost
>> _to userspace_?  When the CPU does the copy, the data is now
>> cache-hot in that CPU's cache.  When the DMA engine does the copy,
>> it's not cache-hot in any CPU.
>>
>> Now, this may not be a big problem.  I don't think we do anything to 
>> ensure that the CPU that is going to access the folio in userspace
>> is the one which does the copy.
>>
>> But your methodology is wrong.
> 
> You're right about importance of measuring the cost to userspace.
> I initially focused on analyzing the folio_copy overheads within migrate_pages to identify potential optimizations opportunities using DMA hardware accelerators.
> 
> To address this, I'm planning extend my experiments to measure the cost to userspace specifically related to cache-hotness. This will involve the accessing the migrated pages after the migration process is complete, and measuring the resulting latency to read/write.
> 
> This approach of DMA-offloading could possibly help in scenarios involving bulk data copying with workload size >> cache capacity or incurs a large shootdown overhead.
> 
> The userspace cost analysis will provide a more comprehensive picture of page-migration using CPU v/s DMA-offloading.
> 
> I appreciate your feedback.



I extended my earlier experiments for page migration from remote node to
a local NUMA node. This involves measuring the cost to userspace for
different workload sizes (4KB, 2MB, 256MB, and 1GB).
My experiments capture two scenarios: First, Smaller workload size (4KB and 2MB)
that fit within the CPU cache. Second, Larger workload size (512MB and 1GB)
that exceeds cache capacity.

move_pages for N pages from src_node=0 to dst_node=1

Measurement: Mean ± SD is reported in cpu cycles per page (normalized
w.r.t. number of pages = N)

move_pages: Cycles taken by move_pages(2) syscall (cost per page)
uncached_access: Cycles taken to access memory (just after clflush) for pages
on src node 1.
cached_access: Cycles taken to access memory (when everything is previously
touched) for pages on src node 1.
post_move_access: Cycles taken to access memory just after move_pages syscall
(when pages are moved to dst node 0)

Generic Kernel:
4KB:: move_pages:193154.40±50519.59  uncached_access:1269.40±163.11  cached_access:383.00±31.92  post_move_access:420.40±77.04
2MB:: move_pages:4930.36±100.74  uncached_access:793.46±82.39  cached_access:208.59±2.07  post_move_access:181.34±11.55
512MB:: move_pages:4498.93±146.95  uncached_access:656.43±23.08  cached_access:801.93±111.80  post_move_access:402.37±15.26
1GB:: move_pages:4419.88±203.91  uncached_access:627.85±13.24  cached_access:776.01±94.27  post_move_access:384.24±7.33

Results with Patched Kernel:
1. Offload disabled - Folios batch-move using CPU
4KB:: move_pages:206370.20±28303.18  uncached_access:1265.20±141.38  cached_access:385.40±54.32  post_move_access:407.80±52.60
2MB:: move_pages:5110.16±188.60  uncached_access:794.05±72.25  cached_access:208.65±1.75  post_move_access:177.48±9.93
512MB:: move_pages:4548.00±188.91  uncached_access:658.23±23.63  cached_access:777.34±113.15  post_move_access:403.48±17.27
1GB:: move_pages:4521.19±195.13  uncached_access:628.85±14.72  cached_access:750.85±98.22  post_move_access:387.79±9.49

2. Offload enabled - Folios batch-move using DMAengine
4KB:: move_pages:222818.00±22710.80  uncached_access:1277.80±145.74  cached_access:405.20±101.85  post_move_access:427.60±130.13
2MB:: move_pages:15590.80±288.89  uncached_access:799.36±76.60  cached_access:208.79±2.11  post_move_access:183.21±11.67
512MB:: move_pages:14154.06±197.59  uncached_access:649.93±20.35  cached_access:814.10±109.81  post_move_access:403.43±13.79
1GB:: move_pages:14415.04±303.83  uncached_access:629.03±14.83  cached_access:731.16±97.67  post_move_access:385.08±7.62

Code snippet to access memory:
before = rdtsc();
for (int i = 0; i < num_pages; i++) {
	for (int j = 0; j < page_size; j += 64) {
		junk += *(long *)(pages[i] + j);
	}
}
after = rdtsc();

Discussion:
1. My analysis revealed no significant difference in post-move access times
between CPU and DMA migration.
2. For smaller workloads, cached accesses are significantly faster than
uncached accesses. However, for larger workloads, caches become less effective.
3. As expected, post-migration access times are significantly lower due to
NUMA locality.
4. Just to make sure prefetchers weren't messing with things, I ran another
test with them turned off. The post-migration access cycles for DMA and CPU
with prefetcher-disabled are still similar.

Thanks,
Shivank