Message ID | 20240928021620.8369-1-kanchana.p.sridhar@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: zswap swap-out of large folios | expand |
On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Hi All, > > This patch-series enables zswap_store() to accept and store large > folios. The most significant contribution in this series is from the > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > migrated to mm-unstable as of 9-27-2024 in patch 6 of this series, and > adapted based on code review comments received for v7 of the current > patch-series. > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > > The first few patches do the prep work for supporting large folios in > zswap_store. Patch 6 provides the main functionality to swap-out large > folios in zswap. Patch 7 adds sysfs per-order hugepages "zswpout" counters > that get incremented upon successful zswap_store of large folios: > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > Patch 8 updates the documentation for the new sysfs "zswpout" counters. > > This patch-series is a pre-requisite for zswap compress batching of large > folio swap-out and decompress batching of swap-ins based on > swapin_readahead(), using Intel IAA hardware acceleration, which we would > like to submit in subsequent patch-series, with performance improvement > data. > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > Thanks also to Nhat, Yosry, Johannes, Barry, Chengming, Usama and Ying for > their helpful feedback, data reviews and suggestions! > > Co-development signoff request: > =============================== > I would like to thank Ryan Roberts for his original RFC [1] and request > his co-developer signoff on patch 6 in this series. Thanks Ryan! Ryan, could you help Kanchana out with a Signed-off-by please :) > > > > System setup for testing: > ========================= > Testing of this patch-series was done with mm-unstable as of 9-27-2024, > commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000. Data was gathered > without/with this patch-series, on an Intel Sapphire Rapids server, > dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and > 525G SSD disk partition swap. Core frequency was fixed at 2500MHz. > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem > processes were run, each allocating and writing 10G of memory, and sleeping > for 10 sec before exiting: > > usemem --init-time -w -O -s 10 -n 30 10g > > Other kernel configuration parameters: > > zswap compressors : zstd, deflate-iaa > zswap allocator : zsmalloc > vm.page-cluster : 2 > > In the experiments where "deflate-iaa" is used as the zswap compressor, > IAA "compression verification" is enabled by default > (cat /sys/bus/dsa/drivers/crypto/verify_compress). Hence each IAA > compression will be decompressed internally by the "iaa_crypto" driver, the > crc-s returned by the hardware will be compared and errors reported in case > of mismatches. Thus "deflate-iaa" helps ensure better data integrity as > compared to the software compressors, and the experimental data listed > below is with verify_compress set to "1". > > Total and average throughput are derived from the individual 30 processes' > throughputs reported by usemem. elapsed/sys times are measured with perf. > > The vm stats and sysfs hugepages stats included with the performance data > provide details on the swapout activity to zswap/swap device. > > > Testing labels used in data summaries: > ====================================== > The data refers to these test configurations and the before/after > comparisons that they do: > > before-case1: > ------------- > mm-unstable 9-27-2024, CONFIG_THP_SWAP=N (compares zswap 4K vs. zswap 64K) > > In this scenario, CONFIG_THP_SWAP=N results in 64K/2M folios to be split > into 4K folios that get processed by zswap. > > before-case2: > ------------- > mm-unstable 9-27-2024, CONFIG_THP_SWAP=Y (compares SSD swap large folios vs. zswap large folios) > > In this scenario, CONFIG_THP_SWAP=Y results in zswap rejecting large > folios, which will then be stored by the SSD swap device. > > after: > ------ > v8 of this patch-series, CONFIG_THP_SWAP=Y > > The "after" is CONFIG_THP_SWAP=Y and v8 of this patch-series, that results > in 64K/2M folios to not be split, and to be processed by zswap_store. > > > Regression Testing: > =================== > I ran vm-scalability usemem without large folios, i.e., only 4K folios with > mm-unstable and this patch-series. The main goal was to make sure that > there is no functional or performance regression wrt the earlier zswap > behavior for 4K folios, now that 4K folios will be processed by the new > zswap_store() code. > > The data indicates there is no significant regression. > > ------------------------------------------------------------------------------- > 4K folios: > ========== > > zswap compressor zstd zstd zstd zstd v8 zstd v8 > before-case1 before-case2 after vs. vs. > case1 case2 > ------------------------------------------------------------------------------- > Total throughput (KB/s) 4,793,363 4,880,978 4,813,151 0% -1% > Average throughput (KB/s) 159,778 162,699 160,438 0% -1% > elapsed time (sec) 130.14 123.17 127.21 2% -3% > sys time (sec) 3,135.53 2,985.64 3,110.53 1% -4% > > memcg_high 446,826 444,626 448,231 > memcg_swap_fail 0 0 0 > pswpout 0 0 0 > pswpin 0 0 0 > zswpout 48,932,107 48,931,971 48,931,584 > zswpin 383 386 388 > thp_swpout 0 0 0 > thp_swpout_fallback 0 0 0 > 64kB-mthp_swpout_fallback 0 0 0 > pgmajfault 3,063 3,077 3,082 > swap_ra 93 94 93 > swap_ra_hit 47 47 47 > ZSWPOUT-64kB n/a n/a 0 > SWPOUT-64kB 0 0 0 > ------------------------------------------------------------------------------- > > > Performance Testing: > ==================== > > We list the data for 64K folios with before/after data per-compressor, > followed by the same for 2M pmd-mappable folios. > > > ------------------------------------------------------------------------------- > 64K folios: zstd: > ================= > > zswap compressor zstd zstd zstd zstd v8 > before-case1 before-case2 after vs. vs. > case1 case2 > ------------------------------------------------------------------------------- > Total throughput (KB/s) 5,222,213 1,076,611 6,227,367 19% 478% > Average throughput (KB/s) 174,073 35,887 207,578 19% 478% > elapsed time (sec) 120.50 347.16 109.21 9% 69% The diff here is supposed to be negative, right? (Same for the below results) Otherwise the results are looking really good, we have come a long way since the first version :) Thanks for working on this! I will look at individual patches later today or early next week. > > sys time (sec) 2,930.33 248.16 2,609.22 11% -951% > memcg_high 416,773 552,200 482,703 > memcg_swap_fail 3,192,906 1,293 944 > pswpout 0 40,778,448 0 > pswpin 0 16 0 > zswpout 48,931,583 20,903 48,931,271 > zswpin 384 363 392 > thp_swpout 0 0 0 > thp_swpout_fallback 0 0 0 > 64kB-mthp_swpout_fallback 3,192,906 1,293 944 > pgmajfault 3,452 3,072 3,095 > swap_ra 90 87 100 > swap_ra_hit 42 43 56 > ZSWPOUT-64kB n/a n/a 3,057,260 > SWPOUT-64kB 0 2,548,653 0 > ------------------------------------------------------------------------------- > > > ------------------------------------------------------------------------------- > 64K folios: deflate-iaa: > ======================== > > zswap compressor deflate-iaa deflate-iaa deflate-iaa deflate-iaa v8 > before-case1 before-case2 after vs. vs. > case1 case2 > ------------------------------------------------------------------------------- > Total throughput (KB/s) 5,652,608 1,089,180 6,315,000 12% 480% > Average throughput (KB/s) 188,420 36,306 210,500 12% 480% > elapsed time (sec) 102.90 343.35 91.11 11% 73% > > > sys time (sec) 2,246.86 213.53 1,939.31 14% -808% > memcg_high 576,104 502,907 612,505 > memcg_swap_fail 4,016,117 1,407 1,660 > pswpout 0 40,862,080 0 > pswpin 0 20 0 > zswpout 61,163,423 22,444 57,317,607 > zswpin 401 368 449 > thp_swpout 0 0 0 > thp_swpout_fallback 0 0 0 > 64kB-mthp_swpout_fallback 4,016,117 1,407 1,660 > pgmajfault 3,063 3,153 3,167 > swap_ra 96 93 149 > swap_ra_hit 46 45 89 > ZSWPOUT-64kB n/a n/a 3,580,673 > SWPOUT-64kB 0 2,553,880 0 > ------------------------------------------------------------------------------- > > > ------------------------------------------------------------------------------- > 2M folios: zstd: > ================ > > zswap compressor zstd zstd zstd zstd v8 > before-case1 before-case2 after vs. vs. > case1 case2 > ------------------------------------------------------------------------------- > Total throughput (KB/s) 5,895,500 1,109,694 6,460,111 10% 482% > Average throughput (KB/s) 196,516 36,989 215,337 10% 482% > elapsed time (sec) 108.77 334.28 105.92 3% 68% > > > sys time (sec) 2,657.14 94.88 2,436.24 8% -2468% > memcg_high 64,200 66,316 60,300 > memcg_swap_fail 101,182 70 30 > pswpout 0 40,166,400 0 > pswpin 0 0 0 > zswpout 48,931,499 36,507 48,869,236 > zswpin 380 379 397 > thp_swpout 0 78,450 0 > thp_swpout_fallback 101,182 70 30 > 2MB-mthp_swpout_fallback 0 0 0 > pgmajfault 3,067 3,417 4,765 > swap_ra 91 90 5,073 > swap_ra_hit 45 45 5,024 > ZSWPOUT-2MB n/a n/a 95,408 > SWPOUT-2MB 0 78,450 0 > ------------------------------------------------------------------------------- > > > ------------------------------------------------------------------------------- > 2M folios: deflate-iaa: > ======================= > > zswap compressor deflate-iaa deflate-iaa deflate-iaa deflate-iaa v8 > before-case1 before-case2 after vs. vs. > case1 case2 > ------------------------------------------------------------------------------- > Total throughput (KB/s) 6,286,587 1,126,785 7,569,560 20% 572% > Average throughput (KB/s) 209,552 37,559 252,318 20% 572% > elapsed time (sec) 96.19 333.03 81.96 15% 75% > > sys time (sec) 2,141.44 99.96 1,768.41 17% -1669% > memcg_high 99,253 64,666 75,139 > memcg_swap_fail 129,074 53 73 > pswpout 0 40,048,128 0 > pswpin 0 0 0 > zswpout 61,312,794 28,321 57,083,119 > zswpin 383 406 447 > thp_swpout 0 78,219 0 > thp_swpout_fallback 129,074 53 73 > 2MB-mthp_swpout_fallback 0 0 0 > pgmajfault 3,430 3,077 7,133 > swap_ra 91 103 11,978 > swap_ra_hit 47 46 11,920 > ZSWPOUT-2MB n/a n/a 111,390 > SWPOUT-2MB 0 78,219 0 > ------------------------------------------------------------------------------- > > And finally, this is a comparison of deflate-iaa vs. zstd with v8 of this > patch-series: > > --------------------------------------------- > zswap_store large folios v8 > Impr w/ deflate-iaa vs. zstd > > 64K folios 2M folios > --------------------------------------------- > Throughput (KB/s) 1% 17% > elapsed time (sec) 17% 23% > sys time (sec) 26% 27% > --------------------------------------------- > > > Conclusions based on the performance results: > ============================================= > > v8 wrt before-case1: > -------------------- > We see significant improvements in throughput, elapsed and sys time for > zstd and deflate-iaa, when comparing before-case1 (THP_SWAP=N) vs. after > (THP_SWAP=Y) with zswap_store large folios. > > v8 wrt before-case2: > -------------------- > We see even more significant improvements in throughput and elapsed time > for zstd and deflate-iaa, when comparing before-case2 (large-folio-SSD) > vs. after (large-folio-zswap). The sys time increases with > large-folio-zswap as expected, due to the CPU compression time > vs. asynchronous disk write times, as pointed out by Ying and Yosry. > > In before-case2, when zswap does not store large folios, only allocations > and cgroup charging due to 4K folio zswap stores count towards the cgroup > memory limit. However, in the after scenario, with the introduction of > zswap_store() of large folios, there is an added component of the zswap > compressed pool usage from large folio stores from potentially all 30 > processes, that gets counted towards the memory limit. As a result, we see > higher swapout activity in the "after" data. > > > Summary: > ======== > The v8 data presented above shows that zswap_store of large folios > demonstrates good throughput/performance improvements compared to > conventional SSD swap of large folios with a sufficiently large 525G SSD > swap device. Hence, it seems reasonable for zswap_store to support large > folios, so that further performance improvements can be implemented. > > In the experimental setup used in this patchset, we have enabled IAA > compress verification to ensure additional hardware data integrity CRC > checks not currently done by the software compressors. We see good > throughput/latency improvements with deflate-iaa vs. zstd with zswap_store > of large folios. > > Some of the ideas for further reducing latency that have shown promise in > our experiments, are: > > 1) IAA compress/decompress batching. > 2) Distributing compress jobs across all IAA devices on the socket. > > The tests run for this patchset are using only 1 IAA device per core, that > avails of 2 compress engines on the device. In our experiments with IAA > batching, we distribute compress jobs from all cores to the 8 compress > engines available per socket. We further compress the pages in each folio > in parallel in the accelerator. As a result, we improve compress latency > and reclaim throughput. > > In decompress batching, we use swapin_readahead to generate a prefetch > batch of 4K folios that we decompress in parallel in IAA. > > ------------------------------------------------------------------------------ > IAA compress/decompress batching > Further improvements wrt v8 zswap_store Sequential > subpage store using "deflate-iaa": > > "deflate-iaa" Batching "deflate-iaa-canned" [2] Batching > Additional Impr Additional Impr > 64K folios 2M folios 64K folios 2M folios > ------------------------------------------------------------------------------ > Throughput (KB/s) 35% 34% 44% 44% > elapsed time (sec) 9% 10% 14% 17% > sys time (sec) 0.4% 4% 8% 15% > ------------------------------------------------------------------------------ > > > With zswap IAA compress/decompress batching, we are able to demonstrate > significant performance improvements and memory savings in server > scalability experiments in highly contended system scenarios under > significant memory pressure; as compared to software compressors. We hope > to submit this work in subsequent patch series. The current patch-series is > a prequisite for these future submissions. > > Thanks, > Kanchana > > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > [2] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/ > > > Changes since v7: > ================= > 1) Rebased to mm-unstable as of 9-27-2024, > commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000. > 2) Added Nhat's 'Reviewed-by' to patches 1 and 2. Thanks Nhat! > 3) Implemented one-time obj_cgroup_may_zswap and zswap_check_limits at the > start of zswap_store. Implemented one-time batch updates to cgroup zswap > charging (with total compressed bytes), zswap_stored_pages and the > memcg/vm zswpout event stats (with folio_nr_pages()) only for successful > stores at the end of zswap_store. Thanks Yosry and Johannes for guidance > on this! > 4) Changed the existing zswap_pool_get() to zswap_pool_tryget(). Modified > zswap_pool_current_get() and zswap_pool_find_get() to call > zswap_pool_tryget(). Furthermore, zswap_store() obtains a reference to a > valid zswap_pool upfront by calling zswap_pool_tryget(), and errors out > if the tryget fails. Added a new zswap_pool_get() that calls > "percpu_ref_get(&pool->ref)" and is called in zswap_store_page(), as > suggested by Johannes & Yosry. Thanks both! > 5) Provided a new count_objcg_events() API for batch event updates. > 6) Changed "zswap_stored_pages" to atomic_long_t to support adding > folio_nr_pages() to it once a large folio is stored successfully. > 7) Deleted the refactoring done in v7 for the xarray updates in > zswap_store_page(); and unwinding of stored offsets in zswap_store() in > case of errors, as suggested by Johannes. > 8) Deleted the CONFIG_ZSWAP_STORE_THP_DEFAULT_ON config option and > "zswap_mthp_enabled" tunable, as recommended by Yosry, Johannes and > Nhat. > 9) Replaced references to "mTHP" with "large folios"; organized > before/after data per-compressor for easier visual comparisons; > incorporated Nhat's feedback in the documentation updates; moved > changelog to the end. Thanks Johannes, Yosry and Nhat! > 10) Moved the usemem testing configuration to 30 processes, each allocating > 10G within a 150G memory-limit constrained cgroup, maintaining the > allocated memory for 10 sec before exiting. Thanks Ying for this > suggestion! > > Changes since v6: > ================= > 1) Rebased to mm-unstable as of 9-23-2024, > commit acfabf7e197f7a5bedf4749dac1f39551417b049. > 2) Refactored into smaller commits, as suggested by Yosry and > Chengming. Thanks both! > 3) Reworded the commit log for patches 5 and 6 as per Yosry's > suggestion. Thanks Yosry! > 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk > partition. Also, all experiments are run with usemem --sleep 10, so that > the memory allocated by the 70 processes remains in memory > longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for > their help with refining the performance characterization methodology. > 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested by > Nhat. Thanks Nhat! > > Changes since v5: > ================= > 1) Rebased to mm-unstable as of 8/29/2024, > commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642. > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to > enable/disable zswap_store() of mTHP folios. Thanks Nhat for the > suggestion to add a knob by which users can enable/disable this > change. Nhat, I hope this is along the lines of what you were > thinking. > 3) Added vm-scalability usemem data with 4K folios with > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure > there is no regression with this change. > 4) Added data with usemem with 64K and 2M THP for an alternate view of > before/after, as suggested by Yosry, so we can understand the impact > of when mTHPs are split into 4K folios in shrink_folio_list() > (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored > in zswap. Thanks Yosry for this suggestion. > > Changes since v4: > ================= > 1) Published before/after data with zstd, as suggested by Nhat (Thanks > Nhat for the data reviews!). > 2) Rebased to mm-unstable from 8/27/2024, > commit b659edec079c90012cf8d05624e312d1062b8b87. > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if > CONFIG_MEMCG is not defined, to resolve build errors reported by kernel > robot; as per Nhat's and Michal's suggestion to not require a separate > patch to fix the build errors (thanks both!). > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as > suggested by Yosry (Thanks Yosry!). > 5) Squashed the commits that define new mthp zswpout stat counters, and > invoke count_mthp_stat() after successful zswap_store()s; into a single > commit. Thanks Yosry for this suggestion! > > Changes since v3: > ================= > 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > changes to count_mthp_stat() so that it's always defined, even when THP > is disabled. Barry, I have also made one other change in page_io.c > where count_mthp_stat() is called by count_swpout_vm_event(). I would > appreciate it if you can review this. Thanks! > Hopefully this should resolve the kernel robot build errors. > > Changes since v2: > ================= > 1) Gathered usemem data using SSD as the backing swap device for zswap, > as suggested by Ying Huang. Ying, I would appreciate it if you can > review the latest data. Thanks! > 2) Generated the base commit info in the patches to attempt to address > the kernel test robot build errors. > 3) No code changes to the individual patches themselves. > > Changes since RFC v1: > ===================== > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > Thanks Barry! > 2) Addressed some of the code review comments that Nhat Pham provided in > Ryan's initial RFC [1]: > - Added a comment about the cgroup zswap limit checks occuring once per > folio at the beginning of zswap_store(). > Nhat, Ryan, please do let me know if the comments convey the summary > from the RFC discussion. Thanks! > - Posted data on running the cgroup suite's zswap kselftest. > 3) Rebased to v6.11-rc3. > 4) Gathered performance data with usemem and the rebased patch-series. > > > > Kanchana P Sridhar (8): > mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined. > mm: zswap: Modify zswap_compress() to accept a page instead of a > folio. > mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget(). > mm: Provide a new count_objcg_events() API for batch event updates. > mm: zswap: Modify zswap_stored_pages to be atomic_long_t. > mm: zswap: Support large folios in zswap_store(). > mm: swap: Count successful large folio zswap stores in hugepage > zswpout stats. > mm: Document the newly added sysfs large folios zswpout stats. > > Documentation/admin-guide/mm/transhuge.rst | 8 +- > fs/proc/meminfo.c | 2 +- > include/linux/huge_mm.h | 1 + > include/linux/memcontrol.h | 24 ++ > include/linux/zswap.h | 2 +- > mm/huge_memory.c | 3 + > mm/page_io.c | 1 + > mm/zswap.c | 254 +++++++++++++++------ > 8 files changed, 219 insertions(+), 76 deletions(-) > > > base-commit: de2fbaa6d9c3576ec7133ed02a370ec9376bf000 > -- > 2.27.0 >
Hi Yosry, > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Friday, September 27, 2024 7:25 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Ryan Roberts > <ryan.roberts@arm.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v8 0/8] mm: zswap swap-out of large folios > > On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi All, > > > > This patch-series enables zswap_store() to accept and store large > > folios. The most significant contribution in this series is from the > > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > > migrated to mm-unstable as of 9-27-2024 in patch 6 of this series, and > > adapted based on code review comments received for v7 of the current > > patch-series. > > > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > > https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@arm.com/T/#u > > > > The first few patches do the prep work for supporting large folios in > > zswap_store. Patch 6 provides the main functionality to swap-out large > > folios in zswap. Patch 7 adds sysfs per-order hugepages "zswpout" counters > > that get incremented upon successful zswap_store of large folios: > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > > > Patch 8 updates the documentation for the new sysfs "zswpout" counters. > > > > This patch-series is a pre-requisite for zswap compress batching of large > > folio swap-out and decompress batching of swap-ins based on > > swapin_readahead(), using Intel IAA hardware acceleration, which we > would > > like to submit in subsequent patch-series, with performance improvement > > data. > > > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > > > Thanks also to Nhat, Yosry, Johannes, Barry, Chengming, Usama and Ying for > > their helpful feedback, data reviews and suggestions! > > > > Co-development signoff request: > > =============================== > > I would like to thank Ryan Roberts for his original RFC [1] and request > > his co-developer signoff on patch 6 in this series. Thanks Ryan! > > > Ryan, could you help Kanchana out with a Signed-off-by please :) > > > > > > > > > System setup for testing: > > ========================= > > Testing of this patch-series was done with mm-unstable as of 9-27-2024, > > commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000. Data was > gathered > > without/with this patch-series, on an Intel Sapphire Rapids server, > > dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and > > 525G SSD disk partition swap. Core frequency was fixed at 2500MHz. > > > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > > was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem > > processes were run, each allocating and writing 10G of memory, and > sleeping > > for 10 sec before exiting: > > > > usemem --init-time -w -O -s 10 -n 30 10g > > > > Other kernel configuration parameters: > > > > zswap compressors : zstd, deflate-iaa > > zswap allocator : zsmalloc > > vm.page-cluster : 2 > > > > In the experiments where "deflate-iaa" is used as the zswap compressor, > > IAA "compression verification" is enabled by default > > (cat /sys/bus/dsa/drivers/crypto/verify_compress). Hence each IAA > > compression will be decompressed internally by the "iaa_crypto" driver, the > > crc-s returned by the hardware will be compared and errors reported in > case > > of mismatches. Thus "deflate-iaa" helps ensure better data integrity as > > compared to the software compressors, and the experimental data listed > > below is with verify_compress set to "1". > > > > Total and average throughput are derived from the individual 30 processes' > > throughputs reported by usemem. elapsed/sys times are measured with > perf. > > > > The vm stats and sysfs hugepages stats included with the performance data > > provide details on the swapout activity to zswap/swap device. > > > > > > Testing labels used in data summaries: > > ====================================== > > The data refers to these test configurations and the before/after > > comparisons that they do: > > > > before-case1: > > ------------- > > mm-unstable 9-27-2024, CONFIG_THP_SWAP=N (compares zswap 4K vs. > zswap 64K) > > > > In this scenario, CONFIG_THP_SWAP=N results in 64K/2M folios to be split > > into 4K folios that get processed by zswap. > > > > before-case2: > > ------------- > > mm-unstable 9-27-2024, CONFIG_THP_SWAP=Y (compares SSD swap large > folios vs. zswap large folios) > > > > In this scenario, CONFIG_THP_SWAP=Y results in zswap rejecting large > > folios, which will then be stored by the SSD swap device. > > > > after: > > ------ > > v8 of this patch-series, CONFIG_THP_SWAP=Y > > > > The "after" is CONFIG_THP_SWAP=Y and v8 of this patch-series, that > results > > in 64K/2M folios to not be split, and to be processed by zswap_store. > > > > > > Regression Testing: > > =================== > > I ran vm-scalability usemem without large folios, i.e., only 4K folios with > > mm-unstable and this patch-series. The main goal was to make sure that > > there is no functional or performance regression wrt the earlier zswap > > behavior for 4K folios, now that 4K folios will be processed by the new > > zswap_store() code. > > > > The data indicates there is no significant regression. > > > > ------------------------------------------------------------------------------- > > 4K folios: > > ========== > > > > zswap compressor zstd zstd zstd zstd v8 zstd v8 > > before-case1 before-case2 after vs. vs. > > case1 case2 > > ------------------------------------------------------------------------------- > > Total throughput (KB/s) 4,793,363 4,880,978 4,813,151 0% -1% > > Average throughput (KB/s) 159,778 162,699 160,438 0% -1% > > elapsed time (sec) 130.14 123.17 127.21 2% -3% > > sys time (sec) 3,135.53 2,985.64 3,110.53 1% -4% > > > > memcg_high 446,826 444,626 448,231 > > memcg_swap_fail 0 0 0 > > pswpout 0 0 0 > > pswpin 0 0 0 > > zswpout 48,932,107 48,931,971 48,931,584 > > zswpin 383 386 388 > > thp_swpout 0 0 0 > > thp_swpout_fallback 0 0 0 > > 64kB-mthp_swpout_fallback 0 0 0 > > pgmajfault 3,063 3,077 3,082 > > swap_ra 93 94 93 > > swap_ra_hit 47 47 47 > > ZSWPOUT-64kB n/a n/a 0 > > SWPOUT-64kB 0 0 0 > > ------------------------------------------------------------------------------- > > > > > > Performance Testing: > > ==================== > > > > We list the data for 64K folios with before/after data per-compressor, > > followed by the same for 2M pmd-mappable folios. > > > > > > ------------------------------------------------------------------------------- > > 64K folios: zstd: > > ================= > > > > zswap compressor zstd zstd zstd zstd v8 > > before-case1 before-case2 after vs. vs. > > case1 case2 > > ------------------------------------------------------------------------------- > > Total throughput (KB/s) 5,222,213 1,076,611 6,227,367 19% 478% > > Average throughput (KB/s) 174,073 35,887 207,578 19% 478% > > elapsed time (sec) 120.50 347.16 109.21 9% 69% > > > The diff here is supposed to be negative, right? > (Same for the below results) So this is supposed to be positive to indicate the throughput improvement [(new-old)/old] with v8 as compared to the before-case1 and before-case2. For latency, a positive value indicates the latency reducing, since I calculate [(old-new)/old]. This is the metric used throughout. Based on this convention, positive percentages are improvements in both, throughput and latency. > > Otherwise the results are looking really good, we have come a long way > since the first version :) > > Thanks for working on this! I will look at individual patches later > today or early next week. Many thanks Yosry :) I immensely appreciate your, Nhat's, Johannes', Ying's and others' help in getting here! Sure, this sounds good. Thanks, Kanchana > > > > > sys time (sec) 2,930.33 248.16 2,609.22 11% -951% > > memcg_high 416,773 552,200 482,703 > > memcg_swap_fail 3,192,906 1,293 944 > > pswpout 0 40,778,448 0 > > pswpin 0 16 0 > > zswpout 48,931,583 20,903 48,931,271 > > zswpin 384 363 392 > > thp_swpout 0 0 0 > > thp_swpout_fallback 0 0 0 > > 64kB-mthp_swpout_fallback 3,192,906 1,293 944 > > pgmajfault 3,452 3,072 3,095 > > swap_ra 90 87 100 > > swap_ra_hit 42 43 56 > > ZSWPOUT-64kB n/a n/a 3,057,260 > > SWPOUT-64kB 0 2,548,653 0 > > ------------------------------------------------------------------------------- > > > > > > ------------------------------------------------------------------------------- > > 64K folios: deflate-iaa: > > ======================== > > > > zswap compressor deflate-iaa deflate-iaa deflate-iaa deflate-iaa v8 > > before-case1 before-case2 after vs. vs. > > case1 case2 > > ------------------------------------------------------------------------------- > > Total throughput (KB/s) 5,652,608 1,089,180 6,315,000 12% 480% > > Average throughput (KB/s) 188,420 36,306 210,500 12% 480% > > elapsed time (sec) 102.90 343.35 91.11 11% 73% > > > > > > sys time (sec) 2,246.86 213.53 1,939.31 14% -808% > > memcg_high 576,104 502,907 612,505 > > memcg_swap_fail 4,016,117 1,407 1,660 > > pswpout 0 40,862,080 0 > > pswpin 0 20 0 > > zswpout 61,163,423 22,444 57,317,607 > > zswpin 401 368 449 > > thp_swpout 0 0 0 > > thp_swpout_fallback 0 0 0 > > 64kB-mthp_swpout_fallback 4,016,117 1,407 1,660 > > pgmajfault 3,063 3,153 3,167 > > swap_ra 96 93 149 > > swap_ra_hit 46 45 89 > > ZSWPOUT-64kB n/a n/a 3,580,673 > > SWPOUT-64kB 0 2,553,880 0 > > ------------------------------------------------------------------------------- > > > > > > ------------------------------------------------------------------------------- > > 2M folios: zstd: > > ================ > > > > zswap compressor zstd zstd zstd zstd v8 > > before-case1 before-case2 after vs. vs. > > case1 case2 > > ------------------------------------------------------------------------------- > > Total throughput (KB/s) 5,895,500 1,109,694 6,460,111 10% 482% > > Average throughput (KB/s) 196,516 36,989 215,337 10% 482% > > elapsed time (sec) 108.77 334.28 105.92 3% 68% > > > > > > sys time (sec) 2,657.14 94.88 2,436.24 8% -2468% > > memcg_high 64,200 66,316 60,300 > > memcg_swap_fail 101,182 70 30 > > pswpout 0 40,166,400 0 > > pswpin 0 0 0 > > zswpout 48,931,499 36,507 48,869,236 > > zswpin 380 379 397 > > thp_swpout 0 78,450 0 > > thp_swpout_fallback 101,182 70 30 > > 2MB-mthp_swpout_fallback 0 0 0 > > pgmajfault 3,067 3,417 4,765 > > swap_ra 91 90 5,073 > > swap_ra_hit 45 45 5,024 > > ZSWPOUT-2MB n/a n/a 95,408 > > SWPOUT-2MB 0 78,450 0 > > ------------------------------------------------------------------------------- > > > > > > ------------------------------------------------------------------------------- > > 2M folios: deflate-iaa: > > ======================= > > > > zswap compressor deflate-iaa deflate-iaa deflate-iaa deflate-iaa v8 > > before-case1 before-case2 after vs. vs. > > case1 case2 > > ------------------------------------------------------------------------------- > > Total throughput (KB/s) 6,286,587 1,126,785 7,569,560 20% 572% > > Average throughput (KB/s) 209,552 37,559 252,318 20% 572% > > elapsed time (sec) 96.19 333.03 81.96 15% 75% > > > > sys time (sec) 2,141.44 99.96 1,768.41 17% -1669% > > memcg_high 99,253 64,666 75,139 > > memcg_swap_fail 129,074 53 73 > > pswpout 0 40,048,128 0 > > pswpin 0 0 0 > > zswpout 61,312,794 28,321 57,083,119 > > zswpin 383 406 447 > > thp_swpout 0 78,219 0 > > thp_swpout_fallback 129,074 53 73 > > 2MB-mthp_swpout_fallback 0 0 0 > > pgmajfault 3,430 3,077 7,133 > > swap_ra 91 103 11,978 > > swap_ra_hit 47 46 11,920 > > ZSWPOUT-2MB n/a n/a 111,390 > > SWPOUT-2MB 0 78,219 0 > > ------------------------------------------------------------------------------- > > > > And finally, this is a comparison of deflate-iaa vs. zstd with v8 of this > > patch-series: > > > > --------------------------------------------- > > zswap_store large folios v8 > > Impr w/ deflate-iaa vs. zstd > > > > 64K folios 2M folios > > --------------------------------------------- > > Throughput (KB/s) 1% 17% > > elapsed time (sec) 17% 23% > > sys time (sec) 26% 27% > > --------------------------------------------- > > > > > > Conclusions based on the performance results: > > ============================================= > > > > v8 wrt before-case1: > > -------------------- > > We see significant improvements in throughput, elapsed and sys time for > > zstd and deflate-iaa, when comparing before-case1 (THP_SWAP=N) vs. > after > > (THP_SWAP=Y) with zswap_store large folios. > > > > v8 wrt before-case2: > > -------------------- > > We see even more significant improvements in throughput and elapsed > time > > for zstd and deflate-iaa, when comparing before-case2 (large-folio-SSD) > > vs. after (large-folio-zswap). The sys time increases with > > large-folio-zswap as expected, due to the CPU compression time > > vs. asynchronous disk write times, as pointed out by Ying and Yosry. > > > > In before-case2, when zswap does not store large folios, only allocations > > and cgroup charging due to 4K folio zswap stores count towards the cgroup > > memory limit. However, in the after scenario, with the introduction of > > zswap_store() of large folios, there is an added component of the zswap > > compressed pool usage from large folio stores from potentially all 30 > > processes, that gets counted towards the memory limit. As a result, we see > > higher swapout activity in the "after" data. > > > > > > Summary: > > ======== > > The v8 data presented above shows that zswap_store of large folios > > demonstrates good throughput/performance improvements compared to > > conventional SSD swap of large folios with a sufficiently large 525G SSD > > swap device. Hence, it seems reasonable for zswap_store to support large > > folios, so that further performance improvements can be implemented. > > > > In the experimental setup used in this patchset, we have enabled IAA > > compress verification to ensure additional hardware data integrity CRC > > checks not currently done by the software compressors. We see good > > throughput/latency improvements with deflate-iaa vs. zstd with > zswap_store > > of large folios. > > > > Some of the ideas for further reducing latency that have shown promise in > > our experiments, are: > > > > 1) IAA compress/decompress batching. > > 2) Distributing compress jobs across all IAA devices on the socket. > > > > The tests run for this patchset are using only 1 IAA device per core, that > > avails of 2 compress engines on the device. In our experiments with IAA > > batching, we distribute compress jobs from all cores to the 8 compress > > engines available per socket. We further compress the pages in each folio > > in parallel in the accelerator. As a result, we improve compress latency > > and reclaim throughput. > > > > In decompress batching, we use swapin_readahead to generate a prefetch > > batch of 4K folios that we decompress in parallel in IAA. > > > > ------------------------------------------------------------------------------ > > IAA compress/decompress batching > > Further improvements wrt v8 zswap_store Sequential > > subpage store using "deflate-iaa": > > > > "deflate-iaa" Batching "deflate-iaa-canned" [2] Batching > > Additional Impr Additional Impr > > 64K folios 2M folios 64K folios 2M folios > > ------------------------------------------------------------------------------ > > Throughput (KB/s) 35% 34% 44% 44% > > elapsed time (sec) 9% 10% 14% 17% > > sys time (sec) 0.4% 4% 8% 15% > > ------------------------------------------------------------------------------ > > > > > > With zswap IAA compress/decompress batching, we are able to > demonstrate > > significant performance improvements and memory savings in server > > scalability experiments in highly contended system scenarios under > > significant memory pressure; as compared to software compressors. We > hope > > to submit this work in subsequent patch series. The current patch-series is > > a prequisite for these future submissions. > > > > Thanks, > > Kanchana > > > > > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@arm.com/T/#u > > [2] https://patchwork.kernel.org/project/linux- > crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/ > > > > > > Changes since v7: > > ================= > > 1) Rebased to mm-unstable as of 9-27-2024, > > commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000. > > 2) Added Nhat's 'Reviewed-by' to patches 1 and 2. Thanks Nhat! > > 3) Implemented one-time obj_cgroup_may_zswap and zswap_check_limits > at the > > start of zswap_store. Implemented one-time batch updates to cgroup > zswap > > charging (with total compressed bytes), zswap_stored_pages and the > > memcg/vm zswpout event stats (with folio_nr_pages()) only for successful > > stores at the end of zswap_store. Thanks Yosry and Johannes for guidance > > on this! > > 4) Changed the existing zswap_pool_get() to zswap_pool_tryget(). Modified > > zswap_pool_current_get() and zswap_pool_find_get() to call > > zswap_pool_tryget(). Furthermore, zswap_store() obtains a reference to a > > valid zswap_pool upfront by calling zswap_pool_tryget(), and errors out > > if the tryget fails. Added a new zswap_pool_get() that calls > > "percpu_ref_get(&pool->ref)" and is called in zswap_store_page(), as > > suggested by Johannes & Yosry. Thanks both! > > 5) Provided a new count_objcg_events() API for batch event updates. > > 6) Changed "zswap_stored_pages" to atomic_long_t to support adding > > folio_nr_pages() to it once a large folio is stored successfully. > > 7) Deleted the refactoring done in v7 for the xarray updates in > > zswap_store_page(); and unwinding of stored offsets in zswap_store() in > > case of errors, as suggested by Johannes. > > 8) Deleted the CONFIG_ZSWAP_STORE_THP_DEFAULT_ON config option > and > > "zswap_mthp_enabled" tunable, as recommended by Yosry, Johannes and > > Nhat. > > 9) Replaced references to "mTHP" with "large folios"; organized > > before/after data per-compressor for easier visual comparisons; > > incorporated Nhat's feedback in the documentation updates; moved > > changelog to the end. Thanks Johannes, Yosry and Nhat! > > 10) Moved the usemem testing configuration to 30 processes, each > allocating > > 10G within a 150G memory-limit constrained cgroup, maintaining the > > allocated memory for 10 sec before exiting. Thanks Ying for this > > suggestion! > > > > Changes since v6: > > ================= > > 1) Rebased to mm-unstable as of 9-23-2024, > > commit acfabf7e197f7a5bedf4749dac1f39551417b049. > > 2) Refactored into smaller commits, as suggested by Yosry and > > Chengming. Thanks both! > > 3) Reworded the commit log for patches 5 and 6 as per Yosry's > > suggestion. Thanks Yosry! > > 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk > > partition. Also, all experiments are run with usemem --sleep 10, so that > > the memory allocated by the 70 processes remains in memory > > longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for > > their help with refining the performance characterization methodology. > > 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested > by > > Nhat. Thanks Nhat! > > > > Changes since v5: > > ================= > > 1) Rebased to mm-unstable as of 8/29/2024, > > commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642. > > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to > > enable/disable zswap_store() of mTHP folios. Thanks Nhat for the > > suggestion to add a knob by which users can enable/disable this > > change. Nhat, I hope this is along the lines of what you were > > thinking. > > 3) Added vm-scalability usemem data with 4K folios with > > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make > sure > > there is no regression with this change. > > 4) Added data with usemem with 64K and 2M THP for an alternate view of > > before/after, as suggested by Yosry, so we can understand the impact > > of when mTHPs are split into 4K folios in shrink_folio_list() > > (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored > > in zswap. Thanks Yosry for this suggestion. > > > > Changes since v4: > > ================= > > 1) Published before/after data with zstd, as suggested by Nhat (Thanks > > Nhat for the data reviews!). > > 2) Rebased to mm-unstable from 8/27/2024, > > commit b659edec079c90012cf8d05624e312d1062b8b87. > > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if > > CONFIG_MEMCG is not defined, to resolve build errors reported by kernel > > robot; as per Nhat's and Michal's suggestion to not require a separate > > patch to fix the build errors (thanks both!). > > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as > > suggested by Yosry (Thanks Yosry!). > > 5) Squashed the commits that define new mthp zswpout stat counters, and > > invoke count_mthp_stat() after successful zswap_store()s; into a single > > commit. Thanks Yosry for this suggestion! > > > > Changes since v3: > > ================= > > 1) Rebased to mm-unstable commit > 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > > changes to count_mthp_stat() so that it's always defined, even when THP > > is disabled. Barry, I have also made one other change in page_io.c > > where count_mthp_stat() is called by count_swpout_vm_event(). I would > > appreciate it if you can review this. Thanks! > > Hopefully this should resolve the kernel robot build errors. > > > > Changes since v2: > > ================= > > 1) Gathered usemem data using SSD as the backing swap device for zswap, > > as suggested by Ying Huang. Ying, I would appreciate it if you can > > review the latest data. Thanks! > > 2) Generated the base commit info in the patches to attempt to address > > the kernel test robot build errors. > > 3) No code changes to the individual patches themselves. > > > > Changes since RFC v1: > > ===================== > > > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > > Thanks Barry! > > 2) Addressed some of the code review comments that Nhat Pham provided > in > > Ryan's initial RFC [1]: > > - Added a comment about the cgroup zswap limit checks occuring once > per > > folio at the beginning of zswap_store(). > > Nhat, Ryan, please do let me know if the comments convey the summary > > from the RFC discussion. Thanks! > > - Posted data on running the cgroup suite's zswap kselftest. > > 3) Rebased to v6.11-rc3. > > 4) Gathered performance data with usemem and the rebased patch-series. > > > > > > > > Kanchana P Sridhar (8): > > mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined. > > mm: zswap: Modify zswap_compress() to accept a page instead of a > > folio. > > mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget(). > > mm: Provide a new count_objcg_events() API for batch event updates. > > mm: zswap: Modify zswap_stored_pages to be atomic_long_t. > > mm: zswap: Support large folios in zswap_store(). > > mm: swap: Count successful large folio zswap stores in hugepage > > zswpout stats. > > mm: Document the newly added sysfs large folios zswpout stats. > > > > Documentation/admin-guide/mm/transhuge.rst | 8 +- > > fs/proc/meminfo.c | 2 +- > > include/linux/huge_mm.h | 1 + > > include/linux/memcontrol.h | 24 ++ > > include/linux/zswap.h | 2 +- > > mm/huge_memory.c | 3 + > > mm/page_io.c | 1 + > > mm/zswap.c | 254 +++++++++++++++------ > > 8 files changed, 219 insertions(+), 76 deletions(-) > > > > > > base-commit: de2fbaa6d9c3576ec7133ed02a370ec9376bf000 > > -- > > 2.27.0 > >
[..] > > > Performance Testing: > > > ==================== > > > > > > We list the data for 64K folios with before/after data per-compressor, > > > followed by the same for 2M pmd-mappable folios. > > > > > > > > > ------------------------------------------------------------------------------- > > > 64K folios: zstd: > > > ================= > > > > > > zswap compressor zstd zstd zstd zstd v8 > > > before-case1 before-case2 after vs. vs. > > > case1 case2 > > > ------------------------------------------------------------------------------- > > > Total throughput (KB/s) 5,222,213 1,076,611 6,227,367 19% 478% > > > Average throughput (KB/s) 174,073 35,887 207,578 19% 478% > > > elapsed time (sec) 120.50 347.16 109.21 9% 69% > > > > > > The diff here is supposed to be negative, right? > > (Same for the below results) > > So this is supposed to be positive to indicate the throughput improvement > [(new-old)/old] with v8 as compared to the before-case1 and before-case2. > For latency, a positive value indicates the latency reducing, since I calculate > [(old-new)/old]. This is the metric used throughout. > > Based on this convention, positive percentages are improvements in both, > throughput and latency. But you use negative percentages for sys time, we should at least be consistent with this.