Message ID | 20240614221525.19170-1-shivankg@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | Enhancements to Page Migration with Batch Offloading via DMA | expand |
On Sat, Jun 15, 2024 at 03:45:20AM +0530, Shivank Garg wrote: > We conducted experiments to measure folio copy overheads for page > migration from a remote node to a local NUMA node, modeling page > promotions for different workload sizes (4KB, 2MB, 256MB and 1GB). > > Setup Information: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT > Enabled), 1 NUMA node connected to each socket. > Linux Kernel 6.8.0, DVFS set to Performance, and cpuinfo_cur_freq: 2 GHz. > THP, compaction, numa_balancing are disabled to reduce interfernce. > > migrate_pages() { <- t1 > .. > <- t2 > folio_copy() > <- t3 > .. > } <- t4 > > overheads Fraction, F= (t3-t2)/(t4-t1) > Measurement: Mean ± SD is measured in cpu_cycles/page > Generic Kernel > 4KB:: migrate_pages:17799.00±4278.25 folio_copy:794±232.87 F:0.0478±0.0199 > 2MB:: migrate_pages:3478.42±94.93 folio_copy:493.84±28.21 F:0.1418±0.0050 > 256MB:: migrate_pages:3668.56±158.47 folio_copy:815.40±171.76 F:0.2206±0.0371 > 1GB:: migrate_pages:3769.98±55.79 folio_copy:804.68±60.07 F:0.2132±0.0134 > > Results with patched kernel: > 1. Offload disabled - folios batch-move using CPU > 4KB:: migrate_pages:14941.60±2556.53 folio_copy:799.60±211.66 F:0.0554±0.0190 > 2MB:: migrate_pages:3448.44±83.74 folio_copy:533.34±37.81 F:0.1545±0.0085 > 256MB:: migrate_pages:3723.56±132.93 folio_copy:907.64±132.63 F:0.2427±0.0270 > 1GB:: migrate_pages:3788.20±46.65 folio_copy:888.46±49.50 F:0.2344±0.0107 > > 2. Offload enabled - folios batch-move using DMAengine > 4KB:: migrate_pages:46739.80±4827.15 folio_copy:32222.40±3543.42 F:0.6904±0.0423 > 2MB:: migrate_pages:13798.10±205.33 folio_copy:10971.60±202.50 F:0.7951±0.0033 > 256MB:: migrate_pages:13217.20±163.99 folio_copy:10431.20±167.25 F:0.7891±0.0029 > 1GB:: migrate_pages:13309.70±113.93 folio_copy:10410.00±117.77 F:0.7821±0.0023 You haven't measured the important thing though -- what's the cost _to userspace_? When the CPU does the copy, the data is now cache-hot in that CPU's cache. When the DMA engine does the copy, it's not cache-hot in any CPU. Now, this may not be a big problem. I don't think we do anything to ensure that the CPU that is going to access the folio in userspace is the one which does the copy. But your methodology is wrong.
Hi Matthew, On 6/15/2024 9:32 AM, Matthew Wilcox wrote: > On Sat, Jun 15, 2024 at 03:45:20AM +0530, Shivank Garg wrote: > > You haven't measured the important thing though -- what's the cost > _to userspace_? When the CPU does the copy, the data is now > cache-hot in that CPU's cache. When the DMA engine does the copy, > it's not cache-hot in any CPU. > > Now, this may not be a big problem. I don't think we do anything to > ensure that the CPU that is going to access the folio in userspace > is the one which does the copy. > > But your methodology is wrong. You're right about importance of measuring the cost to userspace. I initially focused on analyzing the folio_copy overheads within migrate_pages to identify potential optimizations opportunities using DMA hardware accelerators. To address this, I'm planning extend my experiments to measure the cost to userspace specifically related to cache-hotness. This will involve the accessing the migrated pages after the migration process is complete, and measuring the resulting latency to read/write. This approach of DMA-offloading could possibly help in scenarios involving bulk data copying with workload size >> cache capacity or incurs a large shootdown overhead. The userspace cost analysis will provide a more comprehensive picture of page-migration using CPU v/s DMA-offloading. I appreciate your feedback. Shivank
Hi, On 6/17/2024 5:10 PM, Garg, Shivank wrote: > Hi Matthew, > > On 6/15/2024 9:32 AM, Matthew Wilcox wrote: >> On Sat, Jun 15, 2024 at 03:45:20AM +0530, Shivank Garg wrote: > >> >> You haven't measured the important thing though -- what's the cost >> _to userspace_? When the CPU does the copy, the data is now >> cache-hot in that CPU's cache. When the DMA engine does the copy, >> it's not cache-hot in any CPU. >> >> Now, this may not be a big problem. I don't think we do anything to >> ensure that the CPU that is going to access the folio in userspace >> is the one which does the copy. >> >> But your methodology is wrong. > > You're right about importance of measuring the cost to userspace. > I initially focused on analyzing the folio_copy overheads within migrate_pages to identify potential optimizations opportunities using DMA hardware accelerators. > > To address this, I'm planning extend my experiments to measure the cost to userspace specifically related to cache-hotness. This will involve the accessing the migrated pages after the migration process is complete, and measuring the resulting latency to read/write. > > This approach of DMA-offloading could possibly help in scenarios involving bulk data copying with workload size >> cache capacity or incurs a large shootdown overhead. > > The userspace cost analysis will provide a more comprehensive picture of page-migration using CPU v/s DMA-offloading. > > I appreciate your feedback. I extended my earlier experiments for page migration from remote node to a local NUMA node. This involves measuring the cost to userspace for different workload sizes (4KB, 2MB, 256MB, and 1GB). My experiments capture two scenarios: First, Smaller workload size (4KB and 2MB) that fit within the CPU cache. Second, Larger workload size (512MB and 1GB) that exceeds cache capacity. move_pages for N pages from src_node=0 to dst_node=1 Measurement: Mean ± SD is reported in cpu cycles per page (normalized w.r.t. number of pages = N) move_pages: Cycles taken by move_pages(2) syscall (cost per page) uncached_access: Cycles taken to access memory (just after clflush) for pages on src node 1. cached_access: Cycles taken to access memory (when everything is previously touched) for pages on src node 1. post_move_access: Cycles taken to access memory just after move_pages syscall (when pages are moved to dst node 0) Generic Kernel: 4KB:: move_pages:193154.40±50519.59 uncached_access:1269.40±163.11 cached_access:383.00±31.92 post_move_access:420.40±77.04 2MB:: move_pages:4930.36±100.74 uncached_access:793.46±82.39 cached_access:208.59±2.07 post_move_access:181.34±11.55 512MB:: move_pages:4498.93±146.95 uncached_access:656.43±23.08 cached_access:801.93±111.80 post_move_access:402.37±15.26 1GB:: move_pages:4419.88±203.91 uncached_access:627.85±13.24 cached_access:776.01±94.27 post_move_access:384.24±7.33 Results with Patched Kernel: 1. Offload disabled - Folios batch-move using CPU 4KB:: move_pages:206370.20±28303.18 uncached_access:1265.20±141.38 cached_access:385.40±54.32 post_move_access:407.80±52.60 2MB:: move_pages:5110.16±188.60 uncached_access:794.05±72.25 cached_access:208.65±1.75 post_move_access:177.48±9.93 512MB:: move_pages:4548.00±188.91 uncached_access:658.23±23.63 cached_access:777.34±113.15 post_move_access:403.48±17.27 1GB:: move_pages:4521.19±195.13 uncached_access:628.85±14.72 cached_access:750.85±98.22 post_move_access:387.79±9.49 2. Offload enabled - Folios batch-move using DMAengine 4KB:: move_pages:222818.00±22710.80 uncached_access:1277.80±145.74 cached_access:405.20±101.85 post_move_access:427.60±130.13 2MB:: move_pages:15590.80±288.89 uncached_access:799.36±76.60 cached_access:208.79±2.11 post_move_access:183.21±11.67 512MB:: move_pages:14154.06±197.59 uncached_access:649.93±20.35 cached_access:814.10±109.81 post_move_access:403.43±13.79 1GB:: move_pages:14415.04±303.83 uncached_access:629.03±14.83 cached_access:731.16±97.67 post_move_access:385.08±7.62 Code snippet to access memory: before = rdtsc(); for (int i = 0; i < num_pages; i++) { for (int j = 0; j < page_size; j += 64) { junk += *(long *)(pages[i] + j); } } after = rdtsc(); Discussion: 1. My analysis revealed no significant difference in post-move access times between CPU and DMA migration. 2. For smaller workloads, cached accesses are significantly faster than uncached accesses. However, for larger workloads, caches become less effective. 3. As expected, post-migration access times are significantly lower due to NUMA locality. 4. Just to make sure prefetchers weren't messing with things, I ran another test with them turned off. The post-migration access cycles for DMA and CPU with prefetcher-disabled are still similar. Thanks, Shivank