mbox series

[v2,0/4] mm: ZSWAP swap-out of mTHP folios

Message ID 20240816054805.5201-1-kanchana.p.sridhar@intel.com (mailing list archive)
Headers show
Series mm: ZSWAP swap-out of mTHP folios | expand

Message

Sridhar, Kanchana P Aug. 16, 2024, 5:48 a.m. UTC
Hi All,

This patch-series enables zswap_store() to accept and store mTHP
folios. The most significant contribution in this series is from the 
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to v6.11-rc3 in patch 2/4 of this series.

[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
     https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Additionally, there is an attempt to modularize some of the functionality
in zswap_store(), to make it more amenable to supporting any-order
mTHPs.

For instance, the determination of whether a folio is same-filled is
based on mapping an index into the folio to derive the page. Likewise,
there is a function "zswap_store_entry" added to store a zswap_entry in
the xarray.

For accounting purposes, the patch-series adds per-order mTHP sysfs
"zswpout" counters that get incremented upon successful zswap_store of
an mTHP folio:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

This patch-series is a precursor to ZSWAP compress batching of mTHP
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent RFC patch-series, with performance improvement data.

Thanks to Ying Huang for pre-posting review feedback and suggestions!

Changes since RFC v1:
=====================

1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
   Thanks Barry!
2) Addressed some of the code review comments that Nhat Pham provided in
   Ryan's initial RFC [1]:
   - Added a comment about the cgroup zswap limit checks occuring once per
     folio at the beginning of zswap_store().
     Nhat, Ryan, please do let me know if the comments convey the summary
     from the RFC discussion. Thanks!
   - Posted data on running the cgroup suite's zswap kselftest.
3) Rebased to v6.11-rc3.
4) Gathered performance data with usemem and the rebased patch-series.

Performance Testing:
====================
Testing of this patch-series was done with the v6.11-rc3 mainline, without
and with this patch-series, on an Intel Sapphire Rapids server,
dual-socket 56 cores per socket, 4 IAA devices per socket.

The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the backing
swap device. Core frequency was fixed at 2500MHz.

The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 40G. Following a similar methodology as in Ryan Roberts'
"Swap-out mTHP without splitting" series [2], 70 usemem processes were
run, each allocating and writing 1G of memory:

    usemem --init-time -w -O -n 70 1g

Other kernel configuration parameters:

    ZSWAP Compressor  : LZ4, DEFLATE-IAA
    ZSWAP Allocator   : ZSMALLOC
    ZRAM Compressor   : LZO-RLE
    SWAP page-cluster : 2

In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
IAA "compression verification" is enabled. Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors.

Throughput reported by usemem and perf sys time for running the test
are as follows:

 64KB mTHP:
 ==========
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |    118,928 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |     82,665 |       -30% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |    176,210 |        48% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |   1,032.20 |   Baseline |
 |zswap-mTHP=Store    | ZSWAP lz4         |   1,854.51 |       -80% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |     582.71 |        44% |
  ------------------------------------------------------------------

  -----------------------------------------------------------------------
 | VMSTATS, mTHP ZSWAP stats,   |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 | mTHP ZRAM stats:             |   mainline |       Store |       Store |
 |                              |            |         lz4 | deflate-iaa |
 |-----------------------------------------------------------------------|
 | pswpin                       |         16 |           0 |           0 |
 | pswpout                      |  7,770,720 |           0 |           0 |
 | zswpin                       |        547 |         695 |         579 |
 | zswpout                      |      1,394 |  15,462,778 |   7,284,554 |
 |-----------------------------------------------------------------------|
 | thp_swpout                   |          0 |           0 |           0 |
 | thp_swpout_fallback          |          0 |           0 |           0 |
 | pgmajfault                   |      3,786 |       3,541 |       3,367 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/zswpout |            |     966,328 |     455,196 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/swpout  |    485,670 |           0 |           0 |
  -----------------------------------------------------------------------


 2MB PMD-THP/2048K mTHP:
 =======================
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |    177,340 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |     84,030 |       -53% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |    185,691 |         5% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |     876.29 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |   1,740.55 |       -99% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |     650.33 |        26% |
  ------------------------------------------------------------------

  ------------------------------------------------------------------------- 
 | VMSTATS, mTHP ZSWAP stats,     |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 | mTHP ZRAM stats:               |   mainline |       Store |       Store |
 |                                |            |         lz4 | deflate-iaa |
 |-------------------------------------------------------------------------|
 | pswpin                         |          0 |           0 |           0 |
 | pswpout                        |  8,628,224 |           0 |           0 |
 | zswpin                         |        678 |      22,733 |       1,641 |
 | zswpout                        |      1,481 |  14,828,597 |   9,404,937 |
 |-------------------------------------------------------------------------|
 | thp_swpout                     |     16,852 |           0 |           0 |
 | thp_swpout_fallback            |          0 |           0 |           0 |
 | pgmajfault                     |      3,467 |      25,550 |       4,800 |
 |-------------------------------------------------------------------------|
 | hugepages-2048kB/stats/zswpout |            |      28,924 |      18,366 |
 |-------------------------------------------------------------------------|
 | hugepages-2048kB/stats/swpout  |     16,852 |           0 |           0 |
  -------------------------------------------------------------------------

As expected, in the "Before" experiment, there are relatively fewer
swapouts because ZRAM utilization is not accounted in the cgroup.

With the introduction of zswap_store mTHP, the "After" data reflects the
higher swapout activity, and consequent throughput/sys time degradation
when LZ4 is used as the zswap compressor. However, we observe considerable
throughput and sys time improvement in the "After" data when DEFLATE-IAA
is the zswap compressor. This observation holds for 64K mTHP and 2MB THP
experiments. IAA's higher compression ratio and better compress latency
can be attributed to fewer swap-outs and major page-faults, that result
in better throughput and sys time.

Our goal is to improve ZSWAP mTHP store performance using batching. With
Intel IAA compress/decompress batching used in ZSWAP (to be submitted as
additional RFC series), we are able to demonstrate significant
performance improvements and memory savings with IAA as compared to
software compressors.

cgroup zswap kselftest:
=======================

"Before":
=========
  Test run with v6.11-rc3 and no code changes:
    mTHP 64K set to 'always'
    zswap compressor set to 'lz4'
    page-cluster = 3

  zswap shrinker_enabled = N:
  ---------------------------
  ok 1 test_zswap_usage
  ok 2 test_swapin_nozswap
  # at least 24MB should be brought back from zswap
  not ok 3 test_zswapin
  # zswpwb_after is 0 while wb is enablednot ok 4 test_zswap_writeback_enabled
  # Failed to reclaim all of the requested memory
  not ok 5 test_zswap_writeback_disabled
  ok 6 # SKIP test_no_kmem_bypass
  ok 7 test_no_invasive_cgroup_shrink

  zswap shrinker_enabled = Y:
  ---------------------------
  ok 1 test_zswap_usage
  ok 2 test_swapin_nozswap
  # at least 24MB should be brought back from zswap
  not ok 3 test_zswapin
  # zswpwb_after is 0 while wb is enablednot ok 4 test_zswap_writeback_enabled
  # Failed to reclaim all of the requested memory
  not ok 5 test_zswap_writeback_disabled
  ok 6 # SKIP test_no_kmem_bypass
  not ok 7 test_no_invasive_cgroup_shrink

"After":
========
  Test run with this patch-series and v6.11-rc3:
    mTHP 64K set to 'always'
    zswap compressor set to 'deflate-iaa'
    page-cluster = 3

  zswap shrinker_enabled = N:
  ---------------------------
  ok 1 test_zswap_usage
  ok 2 test_swapin_nozswap
  ok 3 test_zswapin
  ok 4 test_zswap_writeback_enabled
  ok 5 test_zswap_writeback_disabled
  ok 6 # SKIP test_no_kmem_bypass
  ok 7 test_no_invasive_cgroup_shrink
  
  zswap shrinker_enabled = Y:
  ---------------------------
  ok 1 test_zswap_usage
  ok 2 test_swapin_nozswap
  # at least 24MB should be brought back from zswap
  not ok 3 test_zswapin
  ok 4 test_zswap_writeback_enabled
  ok 5 test_zswap_writeback_disabled
  ok 6 # SKIP test_no_kmem_bypass
  not ok 7 test_no_invasive_cgroup_shrink

I haven't taken an in-depth look into the cgroup zswap tests, but it
looks like the results with the patch-series are no worse than without,
and in some cases better (not exactly sure why, this needs more
analysis).

I would greatly appreciate your code review comments and suggestions!

Thanks,
Kanchana

[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/


Kanchana P Sridhar (4):
  mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
  mm: zswap: zswap_store() extended to handle mTHP folios.
  mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats.
  mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.

 include/linux/huge_mm.h |   1 +
 mm/huge_memory.c        |   2 +
 mm/page_io.c            |   7 ++
 mm/zswap.c              | 238 +++++++++++++++++++++++++++++-----------
 4 files changed, 184 insertions(+), 64 deletions(-)

Comments

Huang, Ying Aug. 16, 2024, 9:02 a.m. UTC | #1
Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:

> Hi All,
>
> This patch-series enables zswap_store() to accept and store mTHP
> folios. The most significant contribution in this series is from the 
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to v6.11-rc3 in patch 2/4 of this series.
>
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Additionally, there is an attempt to modularize some of the functionality
> in zswap_store(), to make it more amenable to supporting any-order
> mTHPs.
>
> For instance, the determination of whether a folio is same-filled is
> based on mapping an index into the folio to derive the page. Likewise,
> there is a function "zswap_store_entry" added to store a zswap_entry in
> the xarray.
>
> For accounting purposes, the patch-series adds per-order mTHP sysfs
> "zswpout" counters that get incremented upon successful zswap_store of
> an mTHP folio:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> This patch-series is a precursor to ZSWAP compress batching of mTHP
> swap-out and decompress batching of swap-ins based on swapin_readahead(),
> using Intel IAA hardware acceleration, which we would like to submit in
> subsequent RFC patch-series, with performance improvement data.
>
> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>
> Changes since RFC v1:
> =====================
>
> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>    Thanks Barry!
> 2) Addressed some of the code review comments that Nhat Pham provided in
>    Ryan's initial RFC [1]:
>    - Added a comment about the cgroup zswap limit checks occuring once per
>      folio at the beginning of zswap_store().
>      Nhat, Ryan, please do let me know if the comments convey the summary
>      from the RFC discussion. Thanks!
>    - Posted data on running the cgroup suite's zswap kselftest.
> 3) Rebased to v6.11-rc3.
> 4) Gathered performance data with usemem and the rebased patch-series.
>
> Performance Testing:
> ====================
> Testing of this patch-series was done with the v6.11-rc3 mainline, without
> and with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket.
>
> The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the backing
> swap device. Core frequency was fixed at 2500MHz.

I don't think that this is a reasonable test configuration, there's no
benefit to use ZSWAP+ZRAM.  We should use a normal SSD as backing swap
device.

> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 40G. Following a similar methodology as in Ryan Roberts'
> "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> run, each allocating and writing 1G of memory:
>
>     usemem --init-time -w -O -n 70 1g
>
> Other kernel configuration parameters:
>
>     ZSWAP Compressor  : LZ4, DEFLATE-IAA
>     ZSWAP Allocator   : ZSMALLOC
>     ZRAM Compressor   : LZO-RLE
>     SWAP page-cluster : 2
>
> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> IAA "compression verification" is enabled. Hence each IAA compression
> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> returned by the hardware will be compared and errors reported in case of
> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors.
>
> Throughput reported by usemem and perf sys time for running the test
> are as follows:
>
>  64KB mTHP:
>  ==========
>   ------------------------------------------------------------------
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
>  |                    |                   |       KB/s |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | ZRAM lzo-rle      |    118,928 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |     82,665 |       -30% |

Because the test configuration isn't reasonable, the performance drop
isn't reasonable too.  We should compare between zswap+SSD w/o mTHP
zswap and zswap+SSD w/ mTHP zswap.  I think that there should be
performance improvement for that.

>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    176,210 |        48% |
>  |------------------------------------------------------------------|
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
>  |                    |                   |        sec |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | ZRAM lzo-rle      |   1,032.20 |   Baseline |
>  |zswap-mTHP=Store    | ZSWAP lz4         |   1,854.51 |       -80% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     582.71 |        44% |
>   ------------------------------------------------------------------
>
>   -----------------------------------------------------------------------
>  | VMSTATS, mTHP ZSWAP stats,   |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
>  | mTHP ZRAM stats:             |   mainline |       Store |       Store |
>  |                              |            |         lz4 | deflate-iaa |
>  |-----------------------------------------------------------------------|
>  | pswpin                       |         16 |           0 |           0 |
>  | pswpout                      |  7,770,720 |           0 |           0 |
>  | zswpin                       |        547 |         695 |         579 |
>  | zswpout                      |      1,394 |  15,462,778 |   7,284,554 |
>  |-----------------------------------------------------------------------|
>  | thp_swpout                   |          0 |           0 |           0 |
>  | thp_swpout_fallback          |          0 |           0 |           0 |
>  | pgmajfault                   |      3,786 |       3,541 |       3,367 |
>  |-----------------------------------------------------------------------|
>  | hugepages-64kB/stats/zswpout |            |     966,328 |     455,196 |
>  |-----------------------------------------------------------------------|
>  | hugepages-64kB/stats/swpout  |    485,670 |           0 |           0 |
>   -----------------------------------------------------------------------
>
>
>  2MB PMD-THP/2048K mTHP:
>  =======================
>   ------------------------------------------------------------------
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
>  |                    |                   |       KB/s |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | ZRAM lzo-rle      |    177,340 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |     84,030 |       -53% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    185,691 |         5% |
>  |------------------------------------------------------------------|
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
>  |                    |                   |        sec |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | ZRAM lzo-rle      |     876.29 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |   1,740.55 |       -99% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     650.33 |        26% |
>   ------------------------------------------------------------------
>
>   ------------------------------------------------------------------------- 
>  | VMSTATS, mTHP ZSWAP stats,     |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
>  | mTHP ZRAM stats:               |   mainline |       Store |       Store |
>  |                                |            |         lz4 | deflate-iaa |
>  |-------------------------------------------------------------------------|
>  | pswpin                         |          0 |           0 |           0 |
>  | pswpout                        |  8,628,224 |           0 |           0 |
>  | zswpin                         |        678 |      22,733 |       1,641 |
>  | zswpout                        |      1,481 |  14,828,597 |   9,404,937 |
>  |-------------------------------------------------------------------------|
>  | thp_swpout                     |     16,852 |           0 |           0 |
>  | thp_swpout_fallback            |          0 |           0 |           0 |
>  | pgmajfault                     |      3,467 |      25,550 |       4,800 |
>  |-------------------------------------------------------------------------|
>  | hugepages-2048kB/stats/zswpout |            |      28,924 |      18,366 |
>  |-------------------------------------------------------------------------|
>  | hugepages-2048kB/stats/swpout  |     16,852 |           0 |           0 |
>   -------------------------------------------------------------------------
>
> As expected, in the "Before" experiment, there are relatively fewer
> swapouts because ZRAM utilization is not accounted in the cgroup.
>
> With the introduction of zswap_store mTHP, the "After" data reflects the
> higher swapout activity, and consequent throughput/sys time degradation
> when LZ4 is used as the zswap compressor. However, we observe considerable
> throughput and sys time improvement in the "After" data when DEFLATE-IAA
> is the zswap compressor. This observation holds for 64K mTHP and 2MB THP
> experiments. IAA's higher compression ratio and better compress latency
> can be attributed to fewer swap-outs and major page-faults, that result
> in better throughput and sys time.
>
> Our goal is to improve ZSWAP mTHP store performance using batching. With
> Intel IAA compress/decompress batching used in ZSWAP (to be submitted as
> additional RFC series), we are able to demonstrate significant
> performance improvements and memory savings with IAA as compared to
> software compressors.
>
> cgroup zswap kselftest:
> =======================
>
> "Before":
> =========
>   Test run with v6.11-rc3 and no code changes:
>     mTHP 64K set to 'always'
>     zswap compressor set to 'lz4'
>     page-cluster = 3
>
>   zswap shrinker_enabled = N:
>   ---------------------------
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   # at least 24MB should be brought back from zswap
>   not ok 3 test_zswapin
>   # zswpwb_after is 0 while wb is enablednot ok 4 test_zswap_writeback_enabled
>   # Failed to reclaim all of the requested memory
>   not ok 5 test_zswap_writeback_disabled
>   ok 6 # SKIP test_no_kmem_bypass
>   ok 7 test_no_invasive_cgroup_shrink
>
>   zswap shrinker_enabled = Y:
>   ---------------------------
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   # at least 24MB should be brought back from zswap
>   not ok 3 test_zswapin
>   # zswpwb_after is 0 while wb is enablednot ok 4 test_zswap_writeback_enabled
>   # Failed to reclaim all of the requested memory
>   not ok 5 test_zswap_writeback_disabled
>   ok 6 # SKIP test_no_kmem_bypass
>   not ok 7 test_no_invasive_cgroup_shrink
>
> "After":
> ========
>   Test run with this patch-series and v6.11-rc3:
>     mTHP 64K set to 'always'
>     zswap compressor set to 'deflate-iaa'
>     page-cluster = 3
>
>   zswap shrinker_enabled = N:
>   ---------------------------
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   ok 3 test_zswapin
>   ok 4 test_zswap_writeback_enabled
>   ok 5 test_zswap_writeback_disabled
>   ok 6 # SKIP test_no_kmem_bypass
>   ok 7 test_no_invasive_cgroup_shrink
>   
>   zswap shrinker_enabled = Y:
>   ---------------------------
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   # at least 24MB should be brought back from zswap
>   not ok 3 test_zswapin
>   ok 4 test_zswap_writeback_enabled
>   ok 5 test_zswap_writeback_disabled
>   ok 6 # SKIP test_no_kmem_bypass
>   not ok 7 test_no_invasive_cgroup_shrink
>
> I haven't taken an in-depth look into the cgroup zswap tests, but it
> looks like the results with the patch-series are no worse than without,
> and in some cases better (not exactly sure why, this needs more
> analysis).
>
> I would greatly appreciate your code review comments and suggestions!
>
> Thanks,
> Kanchana
>
> [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
>
>
> Kanchana P Sridhar (4):
>   mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
>   mm: zswap: zswap_store() extended to handle mTHP folios.
>   mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats.
>   mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.
>
>  include/linux/huge_mm.h |   1 +
>  mm/huge_memory.c        |   2 +
>  mm/page_io.c            |   7 ++
>  mm/zswap.c              | 238 +++++++++++++++++++++++++++++-----------
>  4 files changed, 184 insertions(+), 64 deletions(-)

--
Best Regards,
Huang, Ying
Sridhar, Kanchana P Aug. 16, 2024, 5:50 p.m. UTC | #2
Hi Ying,

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Friday, August 16, 2024 2:03 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
> 
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs.
> >
> > For instance, the determination of whether a folio is same-filled is
> > based on mapping an index into the folio to derive the page. Likewise,
> > there is a function "zswap_store_entry" added to store a zswap_entry in
> > the xarray.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >    Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >    Ryan's initial RFC [1]:
> >    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >      folio at the beginning of zswap_store().
> >      Nhat, Ryan, please do let me know if the comments convey the summary
> >      from the RFC discussion. Thanks!
> >    - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the
> backing
> > swap device. Core frequency was fixed at 2500MHz.
> 
> I don't think that this is a reasonable test configuration, there's no
> benefit to use ZSWAP+ZRAM.  We should use a normal SSD as backing swap
> device.

Thanks for this suggestion. Sure, I will gather data using SSD instead of ZRAM
as the backing swap device.

> 
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 40G. Following a similar methodology as in Ryan Roberts'
> > "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> > run, each allocating and writing 1G of memory:
> >
> >     usemem --init-time -w -O -n 70 1g
> >
> > Other kernel configuration parameters:
> >
> >     ZSWAP Compressor  : LZ4, DEFLATE-IAA
> >     ZSWAP Allocator   : ZSMALLOC
> >     ZRAM Compressor   : LZO-RLE
> >     SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput reported by usemem and perf sys time for running the test
> > are as follows:
> >
> >  64KB mTHP:
> >  ==========
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | ZRAM lzo-rle      |    118,928 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |     82,665 |       -30% |
> 
> Because the test configuration isn't reasonable, the performance drop
> isn't reasonable too.  We should compare between zswap+SSD w/o mTHP
> zswap and zswap+SSD w/ mTHP zswap.  I think that there should be
> performance improvement for that.

Sure, I will gather and post the data with these two configurations.

Thanks,
Kanchana

> 
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    176,210 |        48% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | ZRAM lzo-rle      |   1,032.20 |   Baseline |
> >  |zswap-mTHP=Store    | ZSWAP lz4         |   1,854.51 |       -80% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     582.71 |        44% |
> >   ------------------------------------------------------------------
> >
> >   -----------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP stats,   |  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  | mTHP ZRAM stats:             |   mainline |       Store |       Store |
> >  |                              |            |         lz4 | deflate-iaa |
> >  |-----------------------------------------------------------------------|
> >  | pswpin                       |         16 |           0 |           0 |
> >  | pswpout                      |  7,770,720 |           0 |           0 |
> >  | zswpin                       |        547 |         695 |         579 |
> >  | zswpout                      |      1,394 |  15,462,778 |   7,284,554 |
> >  |-----------------------------------------------------------------------|
> >  | thp_swpout                   |          0 |           0 |           0 |
> >  | thp_swpout_fallback          |          0 |           0 |           0 |
> >  | pgmajfault                   |      3,786 |       3,541 |       3,367 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/zswpout |            |     966,328 |     455,196 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/swpout  |    485,670 |           0 |           0 |
> >   -----------------------------------------------------------------------
> >
> >
> >  2MB PMD-THP/2048K mTHP:
> >  =======================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | ZRAM lzo-rle      |    177,340 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |     84,030 |       -53% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    185,691 |         5% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | ZRAM lzo-rle      |     876.29 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |   1,740.55 |       -99% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     650.33 |        26% |
> >   ------------------------------------------------------------------
> >
> >   -------------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP stats,     |  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  | mTHP ZRAM stats:               |   mainline |       Store |       Store |
> >  |                                |            |         lz4 | deflate-iaa |
> >  |-------------------------------------------------------------------------|
> >  | pswpin                         |          0 |           0 |           0 |
> >  | pswpout                        |  8,628,224 |           0 |           0 |
> >  | zswpin                         |        678 |      22,733 |       1,641 |
> >  | zswpout                        |      1,481 |  14,828,597 |   9,404,937 |
> >  |-------------------------------------------------------------------------|
> >  | thp_swpout                     |     16,852 |           0 |           0 |
> >  | thp_swpout_fallback            |          0 |           0 |           0 |
> >  | pgmajfault                     |      3,467 |      25,550 |       4,800 |
> >  |-------------------------------------------------------------------------|
> >  | hugepages-2048kB/stats/zswpout |            |      28,924 |      18,366 |
> >  |-------------------------------------------------------------------------|
> >  | hugepages-2048kB/stats/swpout  |     16,852 |           0 |           0 |
> >   -------------------------------------------------------------------------
> >
> > As expected, in the "Before" experiment, there are relatively fewer
> > swapouts because ZRAM utilization is not accounted in the cgroup.
> >
> > With the introduction of zswap_store mTHP, the "After" data reflects the
> > higher swapout activity, and consequent throughput/sys time degradation
> > when LZ4 is used as the zswap compressor. However, we observe
> considerable
> > throughput and sys time improvement in the "After" data when DEFLATE-
> IAA
> > is the zswap compressor. This observation holds for 64K mTHP and 2MB THP
> > experiments. IAA's higher compression ratio and better compress latency
> > can be attributed to fewer swap-outs and major page-faults, that result
> > in better throughput and sys time.
> >
> > Our goal is to improve ZSWAP mTHP store performance using batching. With
> > Intel IAA compress/decompress batching used in ZSWAP (to be submitted as
> > additional RFC series), we are able to demonstrate significant
> > performance improvements and memory savings with IAA as compared to
> > software compressors.
> >
> > cgroup zswap kselftest:
> > =======================
> >
> > "Before":
> > =========
> >   Test run with v6.11-rc3 and no code changes:
> >     mTHP 64K set to 'always'
> >     zswap compressor set to 'lz4'
> >     page-cluster = 3
> >
> >   zswap shrinker_enabled = N:
> >   ---------------------------
> >   ok 1 test_zswap_usage
> >   ok 2 test_swapin_nozswap
> >   # at least 24MB should be brought back from zswap
> >   not ok 3 test_zswapin
> >   # zswpwb_after is 0 while wb is enablednot ok 4
> test_zswap_writeback_enabled
> >   # Failed to reclaim all of the requested memory
> >   not ok 5 test_zswap_writeback_disabled
> >   ok 6 # SKIP test_no_kmem_bypass
> >   ok 7 test_no_invasive_cgroup_shrink
> >
> >   zswap shrinker_enabled = Y:
> >   ---------------------------
> >   ok 1 test_zswap_usage
> >   ok 2 test_swapin_nozswap
> >   # at least 24MB should be brought back from zswap
> >   not ok 3 test_zswapin
> >   # zswpwb_after is 0 while wb is enablednot ok 4
> test_zswap_writeback_enabled
> >   # Failed to reclaim all of the requested memory
> >   not ok 5 test_zswap_writeback_disabled
> >   ok 6 # SKIP test_no_kmem_bypass
> >   not ok 7 test_no_invasive_cgroup_shrink
> >
> > "After":
> > ========
> >   Test run with this patch-series and v6.11-rc3:
> >     mTHP 64K set to 'always'
> >     zswap compressor set to 'deflate-iaa'
> >     page-cluster = 3
> >
> >   zswap shrinker_enabled = N:
> >   ---------------------------
> >   ok 1 test_zswap_usage
> >   ok 2 test_swapin_nozswap
> >   ok 3 test_zswapin
> >   ok 4 test_zswap_writeback_enabled
> >   ok 5 test_zswap_writeback_disabled
> >   ok 6 # SKIP test_no_kmem_bypass
> >   ok 7 test_no_invasive_cgroup_shrink
> >
> >   zswap shrinker_enabled = Y:
> >   ---------------------------
> >   ok 1 test_zswap_usage
> >   ok 2 test_swapin_nozswap
> >   # at least 24MB should be brought back from zswap
> >   not ok 3 test_zswapin
> >   ok 4 test_zswap_writeback_enabled
> >   ok 5 test_zswap_writeback_disabled
> >   ok 6 # SKIP test_no_kmem_bypass
> >   not ok 7 test_no_invasive_cgroup_shrink
> >
> > I haven't taken an in-depth look into the cgroup zswap tests, but it
> > looks like the results with the patch-series are no worse than without,
> > and in some cases better (not exactly sure why, this needs more
> > analysis).
> >
> > I would greatly appreciate your code review comments and suggestions!
> >
> > Thanks,
> > Kanchana
> >
> > [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-
> ryan.roberts@arm.com/
> >
> >
> > Kanchana P Sridhar (4):
> >   mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
> >   mm: zswap: zswap_store() extended to handle mTHP folios.
> >   mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats.
> >   mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.
> >
> >  include/linux/huge_mm.h |   1 +
> >  mm/huge_memory.c        |   2 +
> >  mm/page_io.c            |   7 ++
> >  mm/zswap.c              | 238 +++++++++++++++++++++++++++++-----------
> >  4 files changed, 184 insertions(+), 64 deletions(-)
> 
> --
> Best Regards,
> Huang, Ying