mbox series

[0/2] Improve Zram by separating compression context from kswapd

Message ID 20250307120141.1566673-1-qun-wei.lin@mediatek.com (mailing list archive)
Headers show
Series Improve Zram by separating compression context from kswapd | expand

Message

Qun-Wei Lin March 7, 2025, 12:01 p.m. UTC
This patch series introduces a new mechanism called kcompressd to
improve the efficiency of memory reclaiming in the operating system. The
main goal is to separate the tasks of page scanning and page compression
into distinct processes or threads, thereby reducing the load on the
kswapd thread and enhancing overall system performance under high memory
pressure conditions.

Problem:
 In the current system, the kswapd thread is responsible for both
 scanning the LRU pages and compressing pages into the ZRAM. This
 combined responsibility can lead to significant performance bottlenecks,
 especially under high memory pressure. The kswapd thread becomes a
 single point of contention, causing delays in memory reclaiming and
 overall system performance degradation.

Target:
 The target of this invention is to improve the efficiency of memory
 reclaiming. By separating the tasks of page scanning and page
 compression into distinct processes or threads, the system can handle
 memory pressure more effectively.

Patch 1:
- Introduces 2 new feature flags, BLK_FEAT_READ_SYNCHRONOUS and
  SWP_READ_SYNCHRONOUS_IO.

Patch 2:
- Implemented the core functionality of Kcompressd and made necessary
  modifications to the zram driver to support it.

In our handheld devices, we found that applying this mechanism under high
memory pressure scenarios can increase the rate of pgsteal_anon per second
by over 260% compared to the situation with only kswapd.

Qun-Wei Lin (2):
  mm: Split BLK_FEAT_SYNCHRONOUS and SWP_SYNCHRONOUS_IO into separate
    read and write flags
  kcompressd: Add Kcompressd for accelerated zram compression

 drivers/block/brd.c             |   3 +-
 drivers/block/zram/Kconfig      |  11 ++
 drivers/block/zram/Makefile     |   3 +-
 drivers/block/zram/kcompressd.c | 340 ++++++++++++++++++++++++++++++++
 drivers/block/zram/kcompressd.h |  25 +++
 drivers/block/zram/zram_drv.c   |  21 +-
 drivers/nvdimm/btt.c            |   3 +-
 drivers/nvdimm/pmem.c           |   5 +-
 include/linux/blkdev.h          |  24 ++-
 include/linux/swap.h            |  31 +--
 mm/memory.c                     |   4 +-
 mm/page_io.c                    |   6 +-
 mm/swapfile.c                   |   7 +-
 13 files changed, 446 insertions(+), 37 deletions(-)
 create mode 100644 drivers/block/zram/kcompressd.c
 create mode 100644 drivers/block/zram/kcompressd.h

Comments

Barry Song March 7, 2025, 7:34 p.m. UTC | #1
On Sat, Mar 8, 2025 at 1:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote:
>
> This patch series introduces a new mechanism called kcompressd to
> improve the efficiency of memory reclaiming in the operating system. The
> main goal is to separate the tasks of page scanning and page compression
> into distinct processes or threads, thereby reducing the load on the
> kswapd thread and enhancing overall system performance under high memory
> pressure conditions.
>
> Problem:
>  In the current system, the kswapd thread is responsible for both
>  scanning the LRU pages and compressing pages into the ZRAM. This
>  combined responsibility can lead to significant performance bottlenecks,
>  especially under high memory pressure. The kswapd thread becomes a
>  single point of contention, causing delays in memory reclaiming and
>  overall system performance degradation.
>
> Target:
>  The target of this invention is to improve the efficiency of memory
>  reclaiming. By separating the tasks of page scanning and page
>  compression into distinct processes or threads, the system can handle
>  memory pressure more effectively.

Sounds great. However, we also have a time window where folios under
writeback are kept, whereas previously, writeback was done synchronously
without your patch. This may temporarily increase memory usage until the
kept folios are re-scanned.

So, you’ve observed that folio_rotate_reclaimable() runs shortly while the
async thread completes compression? Then the kept folios are shortly
re-scanned?

>
> Patch 1:
> - Introduces 2 new feature flags, BLK_FEAT_READ_SYNCHRONOUS and
>   SWP_READ_SYNCHRONOUS_IO.
>
> Patch 2:
> - Implemented the core functionality of Kcompressd and made necessary
>   modifications to the zram driver to support it.
>
> In our handheld devices, we found that applying this mechanism under high
> memory pressure scenarios can increase the rate of pgsteal_anon per second
> by over 260% compared to the situation with only kswapd.

Sounds really great.

What compression algorithm is being used? I assume that after switching to a
different compression algorithms, the benefits will change significantly. For
example, Zstd might not show as much improvement.
How was the CPU usage ratio between page scan/unmap and compression
observed before applying this patch?

>
> Qun-Wei Lin (2):
>   mm: Split BLK_FEAT_SYNCHRONOUS and SWP_SYNCHRONOUS_IO into separate
>     read and write flags
>   kcompressd: Add Kcompressd for accelerated zram compression
>
>  drivers/block/brd.c             |   3 +-
>  drivers/block/zram/Kconfig      |  11 ++
>  drivers/block/zram/Makefile     |   3 +-
>  drivers/block/zram/kcompressd.c | 340 ++++++++++++++++++++++++++++++++
>  drivers/block/zram/kcompressd.h |  25 +++
>  drivers/block/zram/zram_drv.c   |  21 +-
>  drivers/nvdimm/btt.c            |   3 +-
>  drivers/nvdimm/pmem.c           |   5 +-
>  include/linux/blkdev.h          |  24 ++-
>  include/linux/swap.h            |  31 +--
>  mm/memory.c                     |   4 +-
>  mm/page_io.c                    |   6 +-
>  mm/swapfile.c                   |   7 +-
>  13 files changed, 446 insertions(+), 37 deletions(-)
>  create mode 100644 drivers/block/zram/kcompressd.c
>  create mode 100644 drivers/block/zram/kcompressd.h
>
> --
> 2.45.2
>

Thanks
Barry
Nhat Pham March 7, 2025, 11:03 p.m. UTC | #2
On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote:
>
> This patch series introduces a new mechanism called kcompressd to
> improve the efficiency of memory reclaiming in the operating system. The
> main goal is to separate the tasks of page scanning and page compression
> into distinct processes or threads, thereby reducing the load on the
> kswapd thread and enhancing overall system performance under high memory
> pressure conditions.

Please excuse my ignorance, but from your cover letter I still don't
quite get what is the problem here? And how would decouple compression
and scanning help?

>
> Problem:
>  In the current system, the kswapd thread is responsible for both
>  scanning the LRU pages and compressing pages into the ZRAM. This
>  combined responsibility can lead to significant performance bottlenecks,

What bottleneck are we talking about? Is one stage slower than the other?

>  especially under high memory pressure. The kswapd thread becomes a
>  single point of contention, causing delays in memory reclaiming and
>  overall system performance degradation.
>
> Target:
>  The target of this invention is to improve the efficiency of memory
>  reclaiming. By separating the tasks of page scanning and page
>  compression into distinct processes or threads, the system can handle
>  memory pressure more effectively.

I'm not a zram maintainer, so I'm definitely not trying to stop this
patch. But whatever problem zram is facing will likely occur with
zswap too, so I'd like to learn more :)
Barry Song March 8, 2025, 5:41 a.m. UTC | #3
On Sat, Mar 8, 2025 at 12:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote:
> >
> > This patch series introduces a new mechanism called kcompressd to
> > improve the efficiency of memory reclaiming in the operating system. The
> > main goal is to separate the tasks of page scanning and page compression
> > into distinct processes or threads, thereby reducing the load on the
> > kswapd thread and enhancing overall system performance under high memory
> > pressure conditions.
>
> Please excuse my ignorance, but from your cover letter I still don't
> quite get what is the problem here? And how would decouple compression
> and scanning help?

My understanding is as follows:

When kswapd attempts to reclaim M anonymous folios and N file folios,
the process involves the following steps:

* t1: Time to scan and unmap anonymous folios
* t2: Time to compress anonymous folios
* t3: Time to reclaim file folios

Currently, these steps are executed sequentially, meaning the total time
required to reclaim M + N folios is t1 + t2 + t3.

However, Qun-Wei's patch enables t1 + t3 and t2 to run in parallel,
reducing the total time to max(t1 + t3, t2). This likely improves the
reclamation speed, potentially reducing allocation stalls.

I don’t have concrete data on this. Does Qun-Wei have detailed
performance data?

>
> >
> > Problem:
> >  In the current system, the kswapd thread is responsible for both
> >  scanning the LRU pages and compressing pages into the ZRAM. This
> >  combined responsibility can lead to significant performance bottlenecks,
>
> What bottleneck are we talking about? Is one stage slower than the other?
>
> >  especially under high memory pressure. The kswapd thread becomes a
> >  single point of contention, causing delays in memory reclaiming and
> >  overall system performance degradation.
> >
> > Target:
> >  The target of this invention is to improve the efficiency of memory
> >  reclaiming. By separating the tasks of page scanning and page
> >  compression into distinct processes or threads, the system can handle
> >  memory pressure more effectively.
>
> I'm not a zram maintainer, so I'm definitely not trying to stop this
> patch. But whatever problem zram is facing will likely occur with
> zswap too, so I'd like to learn more :)

Right, this is likely something that could be addressed more generally
for zswap and zram.

Thanks
Barry