mbox series

[v2,0/6] mm: zswap: global shrinker fix and proactive shrink

Message ID 20240706022523.1104080-1-flintglass@gmail.com (mailing list archive)
Headers show
Series mm: zswap: global shrinker fix and proactive shrink | expand

Message

Takero Funaki July 6, 2024, 2:25 a.m. UTC
This series addresses some issues and introduces a minor improvement in
the zswap global shrinker.

This version addresses issues discovered during the review of v1:
https://lore.kernel.org/linux-mm/20240608155316.451600-1-flintglass@gmail.com/
and includes three additional patches to fix another issue uncovered by
applying v1.

Changes in v2:
- Added 3 patches to reduce resource contention.
mm: zswap: fix global shrinker memcg iteration:
- Change the loop style (Yosry, Nhat, Shakeel)
- To not skip online memcg, save zswap_last_shrink to detect cursor change by cleaner.  (Yosry, Nhat, Shakeel)
mm: zswap: fix global shrinker error handling logic:
- Change error code for no-writeback memcg. (Yosry)
- Use nr_scanned to check if lru is empty. (Yosry)

Changes in v1:
mm: zswap: fix global shrinker memcg iteration:
- Drop and reacquire spinlock before skipping a memcg.
- Add some comment to clarify the locking mechanism.
mm: zswap: proactive shrinking before pool size limit is hit:
- Remove unneeded check before scheduling work.
- Change shrink start threshold to accept_thr_percent + 1%.


Patches
===============================

1. Fix the memcg iteration logic that abort iteration on offline memcgs.
2. Fix the error path that aborts on expected error codes.
3. Add proactive shrinking at accept threshold + 1%.
4. Drop the global shrinker workqueue WQ_MEM_RECLAIM flag to not block
   pageout.
5. Store incompressible pages as-is to accept all pages.
6. Interrupt the shrinker to avoid blocking page-in/out.

Patches 1 to 3 should be applied in this order to avoid potential loops
caused by the first issue. Patch 3 and later can be applied
independently, but the first two issues must be resolved to ensure the
shrinker can evict pages. Patches 4 to 6 address resource contention
issues in the current shrinker uncovered by applying patch 3.

With this series applied, the shrinker will continue to evict pages
until the accept_threshold_percent proactively, as documented in patch
3. As a side effect of changes in the hysteresis logic, zswap will no
longer reject pages under the max pool limit.

The series is rebased on mainline 6.10-rc6.

Proactive shrinking (Patch 1 to 3)
===============================

Visible issue to resolve
-------------------------------
The visible issue with the current global shrinker is that pageout/in
operations from active processes are slow when zswap is near its max
pool size. This is particularly significant on small memory systems
where total swap usage exceeds what zswap can store. This results in old
pages occupying most of the zswap pool space, with recent pages using
the swap disk directly.

Root cause of the issue
-------------------------------
This issue is caused by zswap maintaining the pool size near 100%. Since
the shrinker fails to shrink the pool to accept_threshold_percent, zswap
rejects incoming pages more frequently than it should. The rejected
pages are directly written to disk while zswap protects old pages from
eviction, leading to slow pageout/in performance for recent pages.

Changes to resolve the issue
-------------------------------
If the pool size were shrunk proactively, rejection by pool limit hits
would be less likely. New incoming pages could be accepted as the pool
gains some space in advance, while older pages are written back in the
background. zswap would then be filled with recent pages, as expected
in the LRU logic.

Patches 1 and 2 ensure the shrinker reduces the pool to the
accept_threshold_percent. Without patch 2, shrink_worker() stops
evicting pages on the 16th encounter with a memcg that has no stored
pages (e.g., tiny services managed under systemd).
Patch 3 makes zswap_store() trigger the shrinker before reaching the max
pool size.

With this series, zswap will prepare some space to reduce the
probability of problematic pool_limit_hit situations, thus reducing slow
reclaim and the page priority inversion against LRU.

Visible changes
-------------------------------
Once proactive shrinking reduces the pool size, pageouts complete
instantly as long as the space prepared by proactive shrinking can store
the reclaimed pages. If an admin sees a large pool_limit_hit, lowering
the accept_threshold_percent will improve active process performance.
The optimal point depends on the tradeoff between active process
pageout/in performance and background services' pagein performance.

Benchmark
-------------------------------
The benchmark below observes active process performance under memory
pressure, coexisting with background services occupying the zswap pool.

The tests were run on a vm with 2 vCPU, 1GB RAM + 4GB swap.  20% max
pool and 50% accept threshould (about 100MB proactive shrink),
zsmalloc/lz4.  This test prefer active process than background process
to demonstrate my intended workload.

It runs `apt -qq update` 10 times under a 60MB memcg memory.max
constraint to emulate an active process running under pressure. The
zswap pool is filled prior to the test by allocating a large memory
(~1.5GB) to emulate background inactive services.

The counters below are from vmstat and zswap debugfs stats. The times
are average seconds using /usr/bin/time -v.

pre-patch, 6.10-rc4
|                    | Start       | End         | Delta      |
|--------------------|-------------|-------------|------------|
| pool_limit_hit     | 845         | 845         | 0          |
| pool_total_size    | 201539584   | 201539584   | 0          |
| stored_pages       | 63138       | 63138       | 0          |
| written_back_pages | 12          | 12          | 0          |
| pswpin             | 387         | 32412       | 32025      |
| pswpout            | 153078      | 197138      | 44060      |
| zswpin             | 0           | 0           | 0          |
| zswpout            | 63150       | 63150       | 0          |
| zswpwb             | 12          | 12          | 0          |

| Time              |              |
|-------------------|--------------|
| elapsed           | 8.473        |
| user              | 1.881        |
| system            | 0.814        |

post-patch, 6.10-rc4 with patch 1 to 5
|                    | Start       | End         | Delta      |
|--------------------|-------------|-------------|------------|
| pool_limit_hit     | 81861       | 81861       | 0          |
| pool_total_size    | 75001856    | 87117824    | 12115968   |
| reject_reclaim_fail| 0           | 32          | 32         |
| same_filled_pages  | 135         | 135         | 0          |
| stored_pages       | 23919       | 27549       | 3630       |
| written_back_pages | 97665       | 106994      | 10329      |
| pswpin             | 4981        | 8223        | 3242       |
| pswpout            | 179554      | 188883      | 9329       |
| zswpin             | 293863      | 590014      | 296151     |
| zswpout            | 455338      | 764882      | 309544     |
| zswpwb             | 97665       | 106994      | 10329      |

| Time              |              |
|-------------------|--------------|
| elapsed           | 4.525        |
| user              | 1.839        |
| system            | 1.467        |

Although the pool_limit_hit is not increased in both cases,
zswap_store() rejected pages before this patch. Note that, before this
series, zswap_store() did not increment pool_limit_hit on rejection by
limit hit hysteresis (only the first few hits were counted).

From the pre-patch result, the existing zswap global shrinker cannot
write back effectively and locks the old pages in the pool. The
pswpin/out indicates the active process uses the swap device directly.

From the post-patch result, zswpin/out/wb are increased as expected,
indicating the active process uses zswap and the old pages of the
background services are evicted from the pool. pswpin/out are
significantly reduced from pre-patch results.


System responsiveness issue (patch 4 to 6)
===============================
After applying patches 1 to 3, I encountered severe responsiveness
degradation while zswap global shrinker is running under heavy memory
pressure.

Visible issue to resolve
-------------------------------
The visible issue happens with patches 1 to 3 applied when a large
amount of memory allocation happens and zswap cannot store the incoming
pages.
While global shrinker is writing back pages, system stops responding as
if under heavy memory thrashing.

This issue is less likely to happen without patches 1 to 3 or zswap is
disabled. I believe this is because the global shrinker could not write
back a meaningful amount of pages, as described in patch 2.

Root cause and changes to resolve the issue
-------------------------------
It seems that zswap shrinker blocking IO for memory reclaim and faults
is the root cause of this responsiveness issue. I introduced three
patches to reduce possible blocking in the following problematic
situations:

1. Contention on workqueue thread pools by shrink_worker() using
WQ_MEM_RECLAIM unnecessarily.

Although the shrinker runs simultaneously with memory reclaim, shrinking
is not required to reclaim memory since zswap_store() can reject pages
without interfering with memory reclaim progress. shrink_worker() should
not use WQ_MEM_RECLAIM and should be delayed when another work in
WQ_MEM_RECLAIM is reclaiming memory. The existing code requires
allocating memory inside shrink_worker(), potentially blocking other
latency-sensitive reclaim work.

2. Contention on swap IO.

Since zswap_writeback_entry() performs write IO in 4KB pages, it
consumes a lot of IOPS, increasing the IO latency of swapout/in. We
should not perform IO for background shrinking while zswap_store() is
rejecting pages or zswap_load() is failing to find stored pages. This
series implements two mitigation logics to reduce the IO contention:

2-a. Do not reject pages in zswap_store().
This is mostly achieved by patch 3. With patch 3, zswap can prepare
space proactively and accept pages while the global shrinker is running.

To avoid rejection further, patch 5 (store incompressible pages) is
added. This reduces rejection by storing incompressible pages. When
zsmalloc is used, we can accept incompressible pages with small memory
overhead. It is a minor optimization, but I think it is worth
implementing. This does not improve performance on current zbud but does
not incur a performance penalty.

2-b. Interrupt writeback while pagein/out.
Once zswap runs out of prepared space, we cannot accept incoming pages,
incurring direct writes to the swap disk. At this moment, the shrinker
is proactively evicting pages, leading to IO contention with memory
reclaim.

Performing low-priority IO is straightforward but requires
reimplementing a low-priority version of __swap_writepage(). Instead, in
patch 6, I implemented a heuristic, delaying the next zswap writeback
based on the elapsed time since zswap_store() rejected a page.

When zswap_store() hits the max pool size and rejects pages,
swap_writepage() immediately performs the writeback to disk. The time
jiffies is saved to tell shrink_worker() to sleep up to
ZSWAP_GLOBAL_SHRINK_DELAY msec.

The same logic applied to zswap_load(). When zswap cannot find a page in
the stored pool, pagein requires read IO from the swap device. The
global shrinker should be interrupted here.

This patch proposes a constant delay of 500 milliseconds, aligning with
the mq-deadline target latency.

Visible change
-------------------------------
With patches 4 to 6, the global shrinker pauses the writeback while
pagein/out operations are using the swap device. This change reduces
resource contention and makes memory reclaim/faults complete faster,
thereby reducing system responsiveness degradation. 

Intended scenario for memory reclaim:
1. zswap pool < accept_threshold as the initial state. This is achieved
   by patch 3, proactive shrinking.
2. Active processes start allocating pages. Pageout is buffered by zswap
   without IO.
3. zswap reaches shrink_start_threshold. zswap continues to buffer
   incoming pages and starts writeback immediately in the background.
4. zswap reaches max pool size. zswap interrupts the global shrinker and
   starts rejecting pages. Write IO for the rejected page will consume
   all IO resources.
5. Active processes stop allocating pages. After the delay, the shrinker
   resumes writeback until the accept threshold.

Benchmark
-------------------------------
To demonstrate that the shrinker writeback is not interfering with
pagein/out operations, I measured the elapsed time of allocating 2GB of
3/4 compressible data by a Python script, averaged over 10 runs:

|                      | elapsed| user  | sys   |
|----------------------|--------|-------|-------|
| With patches 1 to 3  | 13.10  | 0.183 | 2.049 |
| With all patches     | 11.17  | 0.116 | 1.490 |
| zswap off (baseline) | 11.81  | 0.149 | 1.381 |

Although this test cannot distinguish responsiveness issues caused by
zswap writeback from normal memory thrashing between plain pagein/out,
the difference from the baseline indicates that the patches reduced
performance degradation on pageout caused by zswap writeback.

The tests were run on kernel 6.10-rc5 on a VM with 1GB RAM (idling Azure
VM with persistent block swap device), 2 vCPUs, zsmalloc/lz4, 25% max
pool, and 50% accept threshold.

---


Takero Funaki (6):
  mm: zswap: fix global shrinker memcg iteration
  mm: zswap: fix global shrinker error handling logic
  mm: zswap: proactive shrinking before pool size limit is hit
  mm: zswap: make writeback run in the background
  mm: zswap: store incompressible page as-is
  mm: zswap: interrupt shrinker writeback while pagein/out IO

 Documentation/admin-guide/mm/zswap.rst |  17 +-
 mm/zswap.c                             | 264 ++++++++++++++++++++-----
 2 files changed, 219 insertions(+), 62 deletions(-)

Comments

Andrew Morton July 6, 2024, 5:32 p.m. UTC | #1
On Sat,  6 Jul 2024 02:25:16 +0000 Takero Funaki <flintglass@gmail.com> wrote:

> This series addresses some issues and introduces a minor improvement in
> the zswap global shrinker.

Seems that patches 1 & 2 might be worthy of backporting into earlier
kernels?  Could you please provide a description of the
userspace-visible effects of the bugs so the desirability of such an
action can be better understood?
Takero Funaki July 7, 2024, 10:54 a.m. UTC | #2
2024年7月7日(日) 2:32 Andrew Morton <akpm@linux-foundation.org>:
> Seems that patches 1 & 2 might be worthy of backporting into earlier
> kernels?  Could you please provide a description of the
> userspace-visible effects of the bugs so the desirability of such an
> action can be better understood?

Patch 1 and Patch 2 partially resolve the zswap global shrinker that
leads to performance degradation on small systems.  However, the fix
uncovers another issue addressed in patches 3 to 6.  Backporting only
the two patches can be a tradeoff with possible performance
degradation in some cases. I am not sure the possible issue can be
acceptable.

The visible issue is described in the cover letter:

> Visible issue to resolve
> -------------------------------
> The visible issue with the current global shrinker is that pageout/in
> operations from active processes are slow when zswap is near its max
> pool size. This is particularly significant on small memory systems
> where total swap usage exceeds what zswap can store. This results in old
> pages occupying most of the zswap pool space, with recent pages using
> the swap disk directly.
>
> Root cause of the issue
> -------------------------------
> This issue is caused by zswap maintaining the pool size near 100%. Since
> the shrinker fails to shrink the pool to accept_threshold_percent, zswap
> rejects incoming pages more frequently than it should. The rejected
> pages are directly written to disk while zswap protects old pages from
> eviction, leading to slow pageout/in performance for recent pages.

Patches 1 and 2 partially resolve the issue by fixing iteration logic.
With the two patches applied, zswap shrinker starts evicting pages
once the pool limit is hit, as described in the current zswap
documentation. However, this fix might not give performance
improvement since it lacks proactive shrinking required to prepare
spaces before pool limit is hit, implemented in patch 3.

Unfortunately, the fix uncovers another issue described in the bottom
half of the cover letter. Because the shrinker performs writeback
simultaneously with pageout for rejected pages, the shrinker delays
actual memory reclaim unnecessarily. The first issue masked the second
by virtually disabling the global shrinker writeback. I think the
second issue only occurs under severe memory pressure, but may degrade
pageout performance as shown in the benchmark at the bottom of the
cover letter.
Nhat Pham July 9, 2024, 12:53 a.m. UTC | #3
On Fri, Jul 5, 2024 at 7:25 PM Takero Funaki <flintglass@gmail.com> wrote:
>
>

Woah, thanks for the copious of details :)

> This series addresses some issues and introduces a minor improvement in
> the zswap global shrinker.
>
> This version addresses issues discovered during the review of v1:
> https://lore.kernel.org/linux-mm/20240608155316.451600-1-flintglass@gmail.com/
> and includes three additional patches to fix another issue uncovered by
> applying v1.
>
> Changes in v2:
> - Added 3 patches to reduce resource contention.
> mm: zswap: fix global shrinker memcg iteration:
> - Change the loop style (Yosry, Nhat, Shakeel)
> - To not skip online memcg, save zswap_last_shrink to detect cursor change by cleaner.  (Yosry, Nhat, Shakeel)
> mm: zswap: fix global shrinker error handling logic:
> - Change error code for no-writeback memcg. (Yosry)
> - Use nr_scanned to check if lru is empty. (Yosry)
>
> Changes in v1:
> mm: zswap: fix global shrinker memcg iteration:
> - Drop and reacquire spinlock before skipping a memcg.
> - Add some comment to clarify the locking mechanism.
> mm: zswap: proactive shrinking before pool size limit is hit:
> - Remove unneeded check before scheduling work.
> - Change shrink start threshold to accept_thr_percent + 1%.
>
>
> Patches
> ===============================
>
> 1. Fix the memcg iteration logic that abort iteration on offline memcgs.
> 2. Fix the error path that aborts on expected error codes.
> 3. Add proactive shrinking at accept threshold + 1%.
> 4. Drop the global shrinker workqueue WQ_MEM_RECLAIM flag to not block
>    pageout.
> 5. Store incompressible pages as-is to accept all pages.
> 6. Interrupt the shrinker to avoid blocking page-in/out.
>
> Patches 1 to 3 should be applied in this order to avoid potential loops
> caused by the first issue. Patch 3 and later can be applied
> independently, but the first two issues must be resolved to ensure the
> shrinker can evict pages. Patches 4 to 6 address resource contention
> issues in the current shrinker uncovered by applying patch 3.
>
> With this series applied, the shrinker will continue to evict pages
> until the accept_threshold_percent proactively, as documented in patch
> 3. As a side effect of changes in the hysteresis logic, zswap will no
> longer reject pages under the max pool limit.
>
> The series is rebased on mainline 6.10-rc6.
>
>
> The counters below are from vmstat and zswap debugfs stats. The times
> are average seconds using /usr/bin/time -v.
>
> pre-patch, 6.10-rc4
> |                    | Start       | End         | Delta      |
> |--------------------|-------------|-------------|------------|
> | pool_limit_hit     | 845         | 845         | 0          |
> | pool_total_size    | 201539584   | 201539584   | 0          |
> | stored_pages       | 63138       | 63138       | 0          |
> | written_back_pages | 12          | 12          | 0          |
> | pswpin             | 387         | 32412       | 32025      |
> | pswpout            | 153078      | 197138      | 44060      |
> | zswpin             | 0           | 0           | 0          |
> | zswpout            | 63150       | 63150       | 0          |
> | zswpwb             | 12          | 12          | 0          |
>
> | Time              |              |
> |-------------------|--------------|
> | elapsed           | 8.473        |
> | user              | 1.881        |
> | system            | 0.814        |
>
> post-patch, 6.10-rc4 with patch 1 to 5

You mean 1 to 6? There are 6 patches, no?

Just out of pure curiosity, could you include the stats from patch 1-3 only?

> |                    | Start       | End         | Delta      |
> |--------------------|-------------|-------------|------------|
> | pool_limit_hit     | 81861       | 81861       | 0          |
> | pool_total_size    | 75001856    | 87117824    | 12115968   |
> | reject_reclaim_fail| 0           | 32          | 32         |
> | same_filled_pages  | 135         | 135         | 0          |
> | stored_pages       | 23919       | 27549       | 3630       |
> | written_back_pages | 97665       | 106994      | 10329      |
> | pswpin             | 4981        | 8223        | 3242       |
> | pswpout            | 179554      | 188883      | 9329       |

The pswpin delta actually decreases. Nice :)

> | zswpin             | 293863      | 590014      | 296151     |
> | zswpout            | 455338      | 764882      | 309544     |
> | zswpwb             | 97665       | 106994      | 10329      |
>
> | Time              |              |
> |-------------------|--------------|
> | elapsed           | 4.525        |
> | user              | 1.839        |
> | system            | 1.467        |
>
> Although the pool_limit_hit is not increased in both cases,
> zswap_store() rejected pages before this patch. Note that, before this
> series, zswap_store() did not increment pool_limit_hit on rejection by
> limit hit hysteresis (only the first few hits were counted).

Yeah hysteresis + the broken global shrinker puts the system in a very
bad state. Thanks for showing this.

>
> From the pre-patch result, the existing zswap global shrinker cannot
> write back effectively and locks the old pages in the pool. The
> pswpin/out indicates the active process uses the swap device directly.
>
> From the post-patch result, zswpin/out/wb are increased as expected,
> indicating the active process uses zswap and the old pages of the
> background services are evicted from the pool. pswpin/out are
> significantly reduced from pre-patch results.

Lovely :)

>
>
> System responsiveness issue (patch 4 to 6)
> ===============================
> After applying patches 1 to 3, I encountered severe responsiveness
> degradation while zswap global shrinker is running under heavy memory
> pressure.
>
> Visible issue to resolve
> -------------------------------
> The visible issue happens with patches 1 to 3 applied when a large
> amount of memory allocation happens and zswap cannot store the incoming
> pages.
> While global shrinker is writing back pages, system stops responding as
> if under heavy memory thrashing.
>
> This issue is less likely to happen without patches 1 to 3 or zswap is
> disabled. I believe this is because the global shrinker could not write
> back a meaningful amount of pages, as described in patch 2.
>
> Root cause and changes to resolve the issue
> -------------------------------
> It seems that zswap shrinker blocking IO for memory reclaim and faults
> is the root cause of this responsiveness issue. I introduced three
> patches to reduce possible blocking in the following problematic
> situations:
>
> 1. Contention on workqueue thread pools by shrink_worker() using
> WQ_MEM_RECLAIM unnecessarily.
>
> Although the shrinker runs simultaneously with memory reclaim, shrinking
> is not required to reclaim memory since zswap_store() can reject pages
> without interfering with memory reclaim progress. shrink_worker() should
> not use WQ_MEM_RECLAIM and should be delayed when another work in
> WQ_MEM_RECLAIM is reclaiming memory. The existing code requires
> allocating memory inside shrink_worker(), potentially blocking other
> latency-sensitive reclaim work.
>
> 2. Contention on swap IO.
>
> Since zswap_writeback_entry() performs write IO in 4KB pages, it
> consumes a lot of IOPS, increasing the IO latency of swapout/in. We
> should not perform IO for background shrinking while zswap_store() is
> rejecting pages or zswap_load() is failing to find stored pages. This
> series implements two mitigation logics to reduce the IO contention:
>
> 2-a. Do not reject pages in zswap_store().
> This is mostly achieved by patch 3. With patch 3, zswap can prepare
> space proactively and accept pages while the global shrinker is running.
>
> To avoid rejection further, patch 5 (store incompressible pages) is
> added. This reduces rejection by storing incompressible pages. When
> zsmalloc is used, we can accept incompressible pages with small memory
> overhead. It is a minor optimization, but I think it is worth
> implementing. This does not improve performance on current zbud but does
> not incur a performance penalty.
>
> 2-b. Interrupt writeback while pagein/out.
> Once zswap runs out of prepared space, we cannot accept incoming pages,
> incurring direct writes to the swap disk. At this moment, the shrinker
> is proactively evicting pages, leading to IO contention with memory
> reclaim.
>
> Performing low-priority IO is straightforward but requires
> reimplementing a low-priority version of __swap_writepage(). Instead, in
> patch 6, I implemented a heuristic, delaying the next zswap writeback
> based on the elapsed time since zswap_store() rejected a page.
>
> When zswap_store() hits the max pool size and rejects pages,
> swap_writepage() immediately performs the writeback to disk. The time
> jiffies is saved to tell shrink_worker() to sleep up to
> ZSWAP_GLOBAL_SHRINK_DELAY msec.
>
> The same logic applied to zswap_load(). When zswap cannot find a page in
> the stored pool, pagein requires read IO from the swap device. The
> global shrinker should be interrupted here.
>
> This patch proposes a constant delay of 500 milliseconds, aligning with
> the mq-deadline target latency.
>
> Visible change
> -------------------------------
> With patches 4 to 6, the global shrinker pauses the writeback while
> pagein/out operations are using the swap device. This change reduces
> resource contention and makes memory reclaim/faults complete faster,
> thereby reducing system responsiveness degradation.

Ah this is interesting. Did you actually see improvement in your real
deployment (i.e not the benchmark) with patch 4-6 in?

>
> Intended scenario for memory reclaim:
> 1. zswap pool < accept_threshold as the initial state. This is achieved
>    by patch 3, proactive shrinking.
> 2. Active processes start allocating pages. Pageout is buffered by zswap
>    without IO.
> 3. zswap reaches shrink_start_threshold. zswap continues to buffer
>    incoming pages and starts writeback immediately in the background.
> 4. zswap reaches max pool size. zswap interrupts the global shrinker and
>    starts rejecting pages. Write IO for the rejected page will consume
>    all IO resources.

This sounds like the proactive shrinker is still not aggressive
enough, and/or there are some sort of misspecifications of the zswap
setting... Correct me if I'm wrong, but the new proactive global
shrinker begins 1% after the acceptance threshold, and shrinks down to
acceptance threshold, right? How are we still hitting the pool
limit...

My concern is that we are knowingly (and perhaps unnecessarily)
creating an LRU inversion here - preferring swapping out the rejected
pages over the colder pages in the zswap pool. Shouldn't it be the
other way around? For instance, can we spiral into the following
scenario:

1. zswap pool becomes full.
2. Memory is still tight, so anonymous memory will be reclaimed. zswap
keeps rejecting incoming pages, and putting a hold on the global
shrinker.
3. The pages that are swapped out are warmer than the ones stored in
the zswap pool, so they will be more likely to be swapped in (which,
IIUC, will also further delay the global shrinker).

and the cycle keeps going on and on?

Have you experimented with synchronous reclaim in the case the pool is
full? All the way to the acceptance threshold is too aggressive of
course - you might need to find something in between :)

> 5. Active processes stop allocating pages. After the delay, the shrinker
>    resumes writeback until the accept threshold.
>
> Benchmark
> -------------------------------
> To demonstrate that the shrinker writeback is not interfering with
> pagein/out operations, I measured the elapsed time of allocating 2GB of
> 3/4 compressible data by a Python script, averaged over 10 runs:
>
> |                      | elapsed| user  | sys   |
> |----------------------|--------|-------|-------|
> | With patches 1 to 3  | 13.10  | 0.183 | 2.049 |
> | With all patches     | 11.17  | 0.116 | 1.490 |
> | zswap off (baseline) | 11.81  | 0.149 | 1.381 |
>
> Although this test cannot distinguish responsiveness issues caused by
> zswap writeback from normal memory thrashing between plain pagein/out,
> the difference from the baseline indicates that the patches reduced
> performance degradation on pageout caused by zswap writeback.
>

I wonder if this contention would show up in PSI metrics
(/proc/pressure/io, or the cgroup variants if you use them ). Maybe
correlate reclaim counters (pgscan, zswpout, pswpout, zswpwb etc.)
with IO pressure to show the pattern, i.e the contention problem was
there before, and is now resolved? :)
Takero Funaki July 10, 2024, 10:26 p.m. UTC | #4
2024年7月9日(火) 9:53 Nhat Pham <nphamcs@gmail.com>:

> > post-patch, 6.10-rc4 with patch 1 to 5
>
> You mean 1 to 6? There are 6 patches, no?

oops. with patches 1 to 6.

>
> Just out of pure curiosity, could you include the stats from patch 1-3 only?
>

I will rerun the bench in v3. I assume this bench does not reflect
patches 4 to 6, as delta pool_limit_hit=0 means no rejection from
zswap.

> Ah this is interesting. Did you actually see improvement in your real
> deployment (i.e not the benchmark) with patch 4-6 in?
>

As replied in patch 6, memory consuming tasks like `apt upgrade` for instance.

> >
> > Intended scenario for memory reclaim:
> > 1. zswap pool < accept_threshold as the initial state. This is achieved
> >    by patch 3, proactive shrinking.
> > 2. Active processes start allocating pages. Pageout is buffered by zswap
> >    without IO.
> > 3. zswap reaches shrink_start_threshold. zswap continues to buffer
> >    incoming pages and starts writeback immediately in the background.
> > 4. zswap reaches max pool size. zswap interrupts the global shrinker and
> >    starts rejecting pages. Write IO for the rejected page will consume
> >    all IO resources.
>
> This sounds like the proactive shrinker is still not aggressive
> enough, and/or there are some sort of misspecifications of the zswap
> setting... Correct me if I'm wrong, but the new proactive global
> shrinker begins 1% after the acceptance threshold, and shrinks down to
> acceptance threshold, right? How are we still hitting the pool
> limit...
>

Proactive shrinking should not be aggressive. With patches 4 and 6, I
modified the global shrinker to be less aggressive against pagein/out.
Shrinking proactively cannot avoid hitting the pool limit when memory
pressure grows faster.

> My concern is that we are knowingly (and perhaps unnecessarily)
> creating an LRU inversion here - preferring swapping out the rejected
> pages over the colder pages in the zswap pool. Shouldn't it be the
> other way around? For instance, can we spiral into the following
> scenario:
>
> 1. zswap pool becomes full.
> 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> keeps rejecting incoming pages, and putting a hold on the global
> shrinker.
> 3. The pages that are swapped out are warmer than the ones stored in
> the zswap pool, so they will be more likely to be swapped in (which,
> IIUC, will also further delay the global shrinker).
>
> and the cycle keeps going on and on?

I agree this does not follow LRU, but I think the LRU priority
inversion is unavoidable once the pool limit is hit.
The accept_thr_percent should be lowered to reduce the probability of
LRU inversion if it matters. (it is why I implemented proactive
shrinker.)

When the writeback throughput is slower than memory usage grows,
zswap_store() will have to reject pages sooner or later.
If we evict the oldest stored pages synchronously before rejecting a
new page (rotating pool to keep LRU), it will affect latency depending
how much writeback is required to store the new page. If the oldest
pages were compressed well, we would have to evict too many pages to
store a warmer page, which blocks the reclaim progress. Fragmentation
in the zspool may also increase the required writeback amount.
We cannot accomplish both maintaining LRU priority and maintaining
pageout latency.

Additionally, zswap_writeback_entry() is slower than direct pageout. I
assume this is because shrinker performs 4KB IO synchronously. I am
seeing shrinking throughput is limited by disk IOPS * 4KB while much
higher throughput can be achieved by disabling zswap. direct pageout
can be faster than zswap writeback, possibly because of bio
optimization or sequential allocation of swap.


> Have you experimented with synchronous reclaim in the case the pool is
> full? All the way to the acceptance threshold is too aggressive of
> course - you might need to find something in between :)
>

I don't get what the expected situation is.
The benchmark of patch 6 is performing synchronous reclaim in the case
the pool is full, since bulk memory allocation (write to mmapped
space) is much faster than writeback throughput. The zswap pool is
filled instantly at the beginning of benchmark runs. The
accept_thr_percent is not significant for the benchmark, I think.


>
> I wonder if this contention would show up in PSI metrics
> (/proc/pressure/io, or the cgroup variants if you use them ). Maybe
> correlate reclaim counters (pgscan, zswpout, pswpout, zswpwb etc.)
> with IO pressure to show the pattern, i.e the contention problem was
> there before, and is now resolved? :)

Unfortunately, I could not find a reliable metric other than elapsed
time. It seems PSI does not distinguish stalls for rejected pageout
from stalls for shrinker writeback.
For counters, this issue affects latency but does not increase the
number of pagein/out. Is there any better way to observe the origin of
contention?

Thanks.
Nhat Pham July 12, 2024, 11:02 p.m. UTC | #5
On Wed, Jul 10, 2024 at 3:27 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年7月9日(火) 9:53 Nhat Pham <nphamcs@gmail.com>:
>
> > > post-patch, 6.10-rc4 with patch 1 to 5
> >
> > You mean 1 to 6? There are 6 patches, no?
>
> oops. with patches 1 to 6.
>
> >
> > Just out of pure curiosity, could you include the stats from patch 1-3 only?
> >
>
> I will rerun the bench in v3. I assume this bench does not reflect
> patches 4 to 6, as delta pool_limit_hit=0 means no rejection from
> zswap.
>
> > Ah this is interesting. Did you actually see improvement in your real
> > deployment (i.e not the benchmark) with patch 4-6 in?
> >
>
> As replied in patch 6, memory consuming tasks like `apt upgrade` for instance.
>
> > >
> > > Intended scenario for memory reclaim:
> > > 1. zswap pool < accept_threshold as the initial state. This is achieved
> > >    by patch 3, proactive shrinking.
> > > 2. Active processes start allocating pages. Pageout is buffered by zswap
> > >    without IO.
> > > 3. zswap reaches shrink_start_threshold. zswap continues to buffer
> > >    incoming pages and starts writeback immediately in the background.
> > > 4. zswap reaches max pool size. zswap interrupts the global shrinker and
> > >    starts rejecting pages. Write IO for the rejected page will consume
> > >    all IO resources.
> >
> > This sounds like the proactive shrinker is still not aggressive
> > enough, and/or there are some sort of misspecifications of the zswap
> > setting... Correct me if I'm wrong, but the new proactive global
> > shrinker begins 1% after the acceptance threshold, and shrinks down to
> > acceptance threshold, right? How are we still hitting the pool
> > limit...
> >
>
> Proactive shrinking should not be aggressive. With patches 4 and 6, I
> modified the global shrinker to be less aggressive against pagein/out.
> Shrinking proactively cannot avoid hitting the pool limit when memory
> pressure grows faster.
>
> > My concern is that we are knowingly (and perhaps unnecessarily)
> > creating an LRU inversion here - preferring swapping out the rejected
> > pages over the colder pages in the zswap pool. Shouldn't it be the
> > other way around? For instance, can we spiral into the following
> > scenario:
> >
> > 1. zswap pool becomes full.
> > 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> > keeps rejecting incoming pages, and putting a hold on the global
> > shrinker.
> > 3. The pages that are swapped out are warmer than the ones stored in
> > the zswap pool, so they will be more likely to be swapped in (which,
> > IIUC, will also further delay the global shrinker).
> >
> > and the cycle keeps going on and on?
>
> I agree this does not follow LRU, but I think the LRU priority
> inversion is unavoidable once the pool limit is hit.
> The accept_thr_percent should be lowered to reduce the probability of
> LRU inversion if it matters. (it is why I implemented proactive
> shrinker.)

And yet, in your own benchmark it fails to prevent that, no? I think
you lower it all the way down to 50%.

>
> When the writeback throughput is slower than memory usage grows,
> zswap_store() will have to reject pages sooner or later.
> If we evict the oldest stored pages synchronously before rejecting a
> new page (rotating pool to keep LRU), it will affect latency depending
> how much writeback is required to store the new page. If the oldest
> pages were compressed well, we would have to evict too many pages to
> store a warmer page, which blocks the reclaim progress. Fragmentation
> in the zspool may also increase the required writeback amount.
> We cannot accomplish both maintaining LRU priority and maintaining
> pageout latency.

Hmm yeah, I guess this is fair. Looks like there is not a lot of
choice, if you want to maintain decent pageout latency...

I could suggest that you have a budgeted zswap writeback on zswap
store - i.e if the pool is full, then try to zswap writeback until we
have enough space or if the budget is reached. But that feels like
even more engineering - the IO priority approach might even be easier
at that point LOL.

Oh well, global shrinker delay it is :)

>
> Additionally, zswap_writeback_entry() is slower than direct pageout. I
> assume this is because shrinker performs 4KB IO synchronously. I am
> seeing shrinking throughput is limited by disk IOPS * 4KB while much
> higher throughput can be achieved by disabling zswap. direct pageout
> can be faster than zswap writeback, possibly because of bio
> optimization or sequential allocation of swap.

Hah, this is interesting!

I wonder though, if the solution here is to perform some sort of
batching for zswap writeback.

BTW, what is the type of the storage device you are using for swap? Is
it SSD or HDD etc?

>
>
> > Have you experimented with synchronous reclaim in the case the pool is
> > full? All the way to the acceptance threshold is too aggressive of
> > course - you might need to find something in between :)
> >
>
> I don't get what the expected situation is.
> The benchmark of patch 6 is performing synchronous reclaim in the case
> the pool is full, since bulk memory allocation (write to mmapped
> space) is much faster than writeback throughput. The zswap pool is
> filled instantly at the beginning of benchmark runs. The
> accept_thr_percent is not significant for the benchmark, I think.

No. I meant synchronous reclaim as in triggering zswap writeback
within the zswap store path, to make space for the incoming new zswap
pages. But you already addressed it above :)


>
>
> >
> > I wonder if this contention would show up in PSI metrics
> > (/proc/pressure/io, or the cgroup variants if you use them ). Maybe
> > correlate reclaim counters (pgscan, zswpout, pswpout, zswpwb etc.)
> > with IO pressure to show the pattern, i.e the contention problem was
> > there before, and is now resolved? :)
>
> Unfortunately, I could not find a reliable metric other than elapsed
> time. It seems PSI does not distinguish stalls for rejected pageout
> from stalls for shrinker writeback.
> For counters, this issue affects latency but does not increase the
> number of pagein/out. Is there any better way to observe the origin of
> contention?
>
> Thanks.
Takero Funaki July 15, 2024, 8:20 a.m. UTC | #6
2024年7月13日(土) 8:02 Nhat Pham <nphamcs@gmail.com>:

> >
> > I agree this does not follow LRU, but I think the LRU priority
> > inversion is unavoidable once the pool limit is hit.
> > The accept_thr_percent should be lowered to reduce the probability of
> > LRU inversion if it matters. (it is why I implemented proactive
> > shrinker.)
>
> And yet, in your own benchmark it fails to prevent that, no? I think
> you lower it all the way down to 50%.
>
> >
> > When the writeback throughput is slower than memory usage grows,
> > zswap_store() will have to reject pages sooner or later.
> > If we evict the oldest stored pages synchronously before rejecting a
> > new page (rotating pool to keep LRU), it will affect latency depending
> > how much writeback is required to store the new page. If the oldest
> > pages were compressed well, we would have to evict too many pages to
> > store a warmer page, which blocks the reclaim progress. Fragmentation
> > in the zspool may also increase the required writeback amount.
> > We cannot accomplish both maintaining LRU priority and maintaining
> > pageout latency.
>
> Hmm yeah, I guess this is fair. Looks like there is not a lot of
> choice, if you want to maintain decent pageout latency...
>
> I could suggest that you have a budgeted zswap writeback on zswap
> store - i.e if the pool is full, then try to zswap writeback until we
> have enough space or if the budget is reached. But that feels like
> even more engineering - the IO priority approach might even be easier
> at that point LOL.
>
> Oh well, global shrinker delay it is :)
>
> >
> > Additionally, zswap_writeback_entry() is slower than direct pageout. I
> > assume this is because shrinker performs 4KB IO synchronously. I am
> > seeing shrinking throughput is limited by disk IOPS * 4KB while much
> > higher throughput can be achieved by disabling zswap. direct pageout
> > can be faster than zswap writeback, possibly because of bio
> > optimization or sequential allocation of swap.
>
> Hah, this is interesting!
>
> I wonder though, if the solution here is to perform some sort of
> batching for zswap writeback.
>
> BTW, what is the type of the storage device you are using for swap? Is
> it SSD or HDD etc?
>

It was tested on an Azure VM with SSD-backed storage. The total IOPS
was capped at 4K IOPS by the VM host. The max throughput of the global
shrinker was around 16 MB/s. Proactive shrinking cannot prevent
pool_limit_hit since memory allocation can be on the order of GB/s.
(The benchmark script allocates 2 GB sequentially, which was
compressed to 1.3 GB, while the zswap pool was limited to 200 MB.)


> >
> >
> > > Have you experimented with synchronous reclaim in the case the pool is
> > > full? All the way to the acceptance threshold is too aggressive of
> > > course - you might need to find something in between :)
> > >
> >
> > I don't get what the expected situation is.
> > The benchmark of patch 6 is performing synchronous reclaim in the case
> > the pool is full, since bulk memory allocation (write to mmapped
> > space) is much faster than writeback throughput. The zswap pool is
> > filled instantly at the beginning of benchmark runs. The
> > accept_thr_percent is not significant for the benchmark, I think.
>
> No. I meant synchronous reclaim as in triggering zswap writeback
> within the zswap store path, to make space for the incoming new zswap
> pages. But you already addressed it above :)
>
>
> >
> >
> > >
> > > I wonder if this contention would show up in PSI metrics
> > > (/proc/pressure/io, or the cgroup variants if you use them ). Maybe
> > > correlate reclaim counters (pgscan, zswpout, pswpout, zswpwb etc.)
> > > with IO pressure to show the pattern, i.e the contention problem was
> > > there before, and is now resolved? :)
> >
> > Unfortunately, I could not find a reliable metric other than elapsed
> > time. It seems PSI does not distinguish stalls for rejected pageout
> > from stalls for shrinker writeback.
> > For counters, this issue affects latency but does not increase the
> > number of pagein/out. Is there any better way to observe the origin of
> > contention?
> >
> > Thanks.
Yosry Ahmed July 17, 2024, 2:53 a.m. UTC | #7
[..]
>
> > My concern is that we are knowingly (and perhaps unnecessarily)
> > creating an LRU inversion here - preferring swapping out the rejected
> > pages over the colder pages in the zswap pool. Shouldn't it be the
> > other way around? For instance, can we spiral into the following
> > scenario:
> >
> > 1. zswap pool becomes full.
> > 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> > keeps rejecting incoming pages, and putting a hold on the global
> > shrinker.
> > 3. The pages that are swapped out are warmer than the ones stored in
> > the zswap pool, so they will be more likely to be swapped in (which,
> > IIUC, will also further delay the global shrinker).
> >
> > and the cycle keeps going on and on?
>
> I agree this does not follow LRU, but I think the LRU priority
> inversion is unavoidable once the pool limit is hit.
> The accept_thr_percent should be lowered to reduce the probability of
> LRU inversion if it matters. (it is why I implemented proactive
> shrinker.)

Why?

Let's take a step back. You are suggesting that we throttle zswap
writeback to allow reclaim to swapout warmer pages to swap device. As
Nhat said, we are proliferating LRU inversion instead of fixing it.

I think I had a similar discussion with Johannes about this before,
and we discussed that if zswap becomes full, we should instead
throttle reclaim and allow zswap writeback to proceed (i.e. the
opposite of what this series is doing). This would be similar to how
we throttle reclaim today to wait for dirty pages to be written back.

This should reduce/fix the LRU inversion instead of proliferating it,
and it should reduce the total amout of IO as colder pages should go
to disk while warmer pages go to zswap. I am wondering if we can reuse
the reclaim_throttle() mechanism here.

One concern I have is that we will also throttle file pages if we use
reclaim_throttle(), since I don't see per-type throttling there. This
could be fine, since we similarly throttle zswap reclaim if there are
too many dirty file pages. I am not super familiar with reclaim
throttling, so maybe I missed something obvious or there is a better
way, but I believe that from a high level this should be the right way
to go.

I actually think if we do this properly, and throttle reclaim when
zswap becomes full, we may be able to drop the acceptance hysteresis
and rely on the throttling mechanism to make sure we stop reclaim
until we free up enough space in zswap to avoid consistently hitting
the limit, but this could be a future extension.

Johannes, any thoughts here?

Anyway, since patches 1-2 are independent of the rest of the series,
feel free to send them separately, and we can continue the discussion
on the best way forward for the rest of the series.
Nhat Pham July 17, 2024, 5:49 p.m. UTC | #8
On Tue, Jul 16, 2024 at 7:53 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> >
> > > My concern is that we are knowingly (and perhaps unnecessarily)
> > > creating an LRU inversion here - preferring swapping out the rejected
> > > pages over the colder pages in the zswap pool. Shouldn't it be the
> > > other way around? For instance, can we spiral into the following
> > > scenario:
> > >
> > > 1. zswap pool becomes full.
> > > 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> > > keeps rejecting incoming pages, and putting a hold on the global
> > > shrinker.
> > > 3. The pages that are swapped out are warmer than the ones stored in
> > > the zswap pool, so they will be more likely to be swapped in (which,
> > > IIUC, will also further delay the global shrinker).
> > >
> > > and the cycle keeps going on and on?
> >
> > I agree this does not follow LRU, but I think the LRU priority
> > inversion is unavoidable once the pool limit is hit.
> > The accept_thr_percent should be lowered to reduce the probability of
> > LRU inversion if it matters. (it is why I implemented proactive
> > shrinker.)
>
> Why?
>
> Let's take a step back. You are suggesting that we throttle zswap
> writeback to allow reclaim to swapout warmer pages to swap device. As
> Nhat said, we are proliferating LRU inversion instead of fixing it.
>
> I think I had a similar discussion with Johannes about this before,
> and we discussed that if zswap becomes full, we should instead
> throttle reclaim and allow zswap writeback to proceed (i.e. the
> opposite of what this series is doing). This would be similar to how
> we throttle reclaim today to wait for dirty pages to be written back.
>

I completely agree with this analysis and proposal - it's somewhat
similar to what I have in mind, but more fleshed out :)

> This should reduce/fix the LRU inversion instead of proliferating it,
> and it should reduce the total amout of IO as colder pages should go
> to disk while warmer pages go to zswap. I am wondering if we can reuse
> the reclaim_throttle() mechanism here.
>
> One concern I have is that we will also throttle file pages if we use
> reclaim_throttle(), since I don't see per-type throttling there. This
> could be fine, since we similarly throttle zswap reclaim if there are
> too many dirty file pages. I am not super familiar with reclaim
> throttling, so maybe I missed something obvious or there is a better
> way, but I believe that from a high level this should be the right way
> to go.

I don't think we have any infrastructure for anon-only throttling in
vmscan logic, but it sounds trivial to implement if necessary :)

>
> I actually think if we do this properly, and throttle reclaim when
> zswap becomes full, we may be able to drop the acceptance hysteresis
> and rely on the throttling mechanism to make sure we stop reclaim
> until we free up enough space in zswap to avoid consistently hitting
> the limit, but this could be a future extension.

Agree - this hysteresis heuristics needs to die.

IMHO, I think we should still have the proactive global shrinking
action that Takero is proposing in patch 3. The throttling is nice,
but it'd be even nicer if we can get ahead of that :)

>
> Johannes, any thoughts here?
>
> Anyway, since patches 1-2 are independent of the rest of the series,
> feel free to send them separately, and we can continue the discussion
> on the best way forward for the rest of the series.

+1.
Yosry Ahmed July 17, 2024, 6:05 p.m. UTC | #9
On Wed, Jul 17, 2024 at 10:49 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Jul 16, 2024 at 7:53 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > [..]
> > >
> > > > My concern is that we are knowingly (and perhaps unnecessarily)
> > > > creating an LRU inversion here - preferring swapping out the rejected
> > > > pages over the colder pages in the zswap pool. Shouldn't it be the
> > > > other way around? For instance, can we spiral into the following
> > > > scenario:
> > > >
> > > > 1. zswap pool becomes full.
> > > > 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> > > > keeps rejecting incoming pages, and putting a hold on the global
> > > > shrinker.
> > > > 3. The pages that are swapped out are warmer than the ones stored in
> > > > the zswap pool, so they will be more likely to be swapped in (which,
> > > > IIUC, will also further delay the global shrinker).
> > > >
> > > > and the cycle keeps going on and on?
> > >
> > > I agree this does not follow LRU, but I think the LRU priority
> > > inversion is unavoidable once the pool limit is hit.
> > > The accept_thr_percent should be lowered to reduce the probability of
> > > LRU inversion if it matters. (it is why I implemented proactive
> > > shrinker.)
> >
> > Why?
> >
> > Let's take a step back. You are suggesting that we throttle zswap
> > writeback to allow reclaim to swapout warmer pages to swap device. As
> > Nhat said, we are proliferating LRU inversion instead of fixing it.
> >
> > I think I had a similar discussion with Johannes about this before,
> > and we discussed that if zswap becomes full, we should instead
> > throttle reclaim and allow zswap writeback to proceed (i.e. the
> > opposite of what this series is doing). This would be similar to how
> > we throttle reclaim today to wait for dirty pages to be written back.
> >
>
> I completely agree with this analysis and proposal - it's somewhat
> similar to what I have in mind, but more fleshed out :)
>
> > This should reduce/fix the LRU inversion instead of proliferating it,
> > and it should reduce the total amout of IO as colder pages should go
> > to disk while warmer pages go to zswap. I am wondering if we can reuse
> > the reclaim_throttle() mechanism here.
> >
> > One concern I have is that we will also throttle file pages if we use
> > reclaim_throttle(), since I don't see per-type throttling there. This
> > could be fine, since we similarly throttle zswap reclaim if there are
> > too many dirty file pages. I am not super familiar with reclaim
> > throttling, so maybe I missed something obvious or there is a better
> > way, but I believe that from a high level this should be the right way
> > to go.
>
> I don't think we have any infrastructure for anon-only throttling in
> vmscan logic, but it sounds trivial to implement if necessary :)
>
> >
> > I actually think if we do this properly, and throttle reclaim when
> > zswap becomes full, we may be able to drop the acceptance hysteresis
> > and rely on the throttling mechanism to make sure we stop reclaim
> > until we free up enough space in zswap to avoid consistently hitting
> > the limit, but this could be a future extension.
>
> Agree - this hysteresis heuristics needs to die.
>
> IMHO, I think we should still have the proactive global shrinking
> action that Takero is proposing in patch 3. The throttling is nice,
> but it'd be even nicer if we can get ahead of that :)

I have always thought that the shrinker should play this role in one
way or another. Instead of an arbitrary watermark and asynchronous
work, it incrementally pushes the zswap LRU toward disk as reclaim
activity increases.

Is the point behind proactive shrinking is to reduce the latency in
the reclaim path?
Nhat Pham July 17, 2024, 7:01 p.m. UTC | #10
On Wed, Jul 17, 2024 at 11:05 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> I have always thought that the shrinker should play this role in one
> way or another. Instead of an arbitrary watermark and asynchronous
> work, it incrementally pushes the zswap LRU toward disk as reclaim
> activity increases.
>
> Is the point behind proactive shrinking is to reduce the latency in
> the reclaim path?

Yeah, reducing latency is the one benefit I have in mind :) I don't
feel too strongly regarding this though - in fact I'm more biased
towards the other shrinker in the first place.
Takero Funaki July 19, 2024, 2:55 p.m. UTC | #11
Thank you all for your reviews and comments.

2024年7月18日(木) 3:05 Yosry Ahmed <yosryahmed@google.com>:
>
> On Wed, Jul 17, 2024 at 10:49 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Jul 16, 2024 at 7:53 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > [..]
> > > >
> > > > > My concern is that we are knowingly (and perhaps unnecessarily)
> > > > > creating an LRU inversion here - preferring swapping out the rejected
> > > > > pages over the colder pages in the zswap pool. Shouldn't it be the
> > > > > other way around? For instance, can we spiral into the following
> > > > > scenario:
> > > > >
> > > > > 1. zswap pool becomes full.
> > > > > 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> > > > > keeps rejecting incoming pages, and putting a hold on the global
> > > > > shrinker.
> > > > > 3. The pages that are swapped out are warmer than the ones stored in
> > > > > the zswap pool, so they will be more likely to be swapped in (which,
> > > > > IIUC, will also further delay the global shrinker).
> > > > >
> > > > > and the cycle keeps going on and on?
> > > >
> > > > I agree this does not follow LRU, but I think the LRU priority
> > > > inversion is unavoidable once the pool limit is hit.
> > > > The accept_thr_percent should be lowered to reduce the probability of
> > > > LRU inversion if it matters. (it is why I implemented proactive
> > > > shrinker.)
> > >
> > > Why?
> > >
> > > Let's take a step back. You are suggesting that we throttle zswap
> > > writeback to allow reclaim to swapout warmer pages to swap device. As
> > > Nhat said, we are proliferating LRU inversion instead of fixing it.
> > >
> > > I think I had a similar discussion with Johannes about this before,
> > > and we discussed that if zswap becomes full, we should instead
> > > throttle reclaim and allow zswap writeback to proceed (i.e. the
> > > opposite of what this series is doing). This would be similar to how
> > > we throttle reclaim today to wait for dirty pages to be written back.
> > >
> >
> > I completely agree with this analysis and proposal - it's somewhat
> > similar to what I have in mind, but more fleshed out :)
> >
> > > This should reduce/fix the LRU inversion instead of proliferating it,
> > > and it should reduce the total amout of IO as colder pages should go
> > > to disk while warmer pages go to zswap. I am wondering if we can reuse
> > > the reclaim_throttle() mechanism here.
> > >
> > > One concern I have is that we will also throttle file pages if we use
> > > reclaim_throttle(), since I don't see per-type throttling there. This
> > > could be fine, since we similarly throttle zswap reclaim if there are
> > > too many dirty file pages. I am not super familiar with reclaim
> > > throttling, so maybe I missed something obvious or there is a better
> > > way, but I believe that from a high level this should be the right way
> > > to go.
> >
> > I don't think we have any infrastructure for anon-only throttling in
> > vmscan logic, but it sounds trivial to implement if necessary :)
> >
> > >
> > > I actually think if we do this properly, and throttle reclaim when
> > > zswap becomes full, we may be able to drop the acceptance hysteresis
> > > and rely on the throttling mechanism to make sure we stop reclaim
> > > until we free up enough space in zswap to avoid consistently hitting
> > > the limit, but this could be a future extension.
> >
> > Agree - this hysteresis heuristics needs to die.
> >
> > IMHO, I think we should still have the proactive global shrinking
> > action that Takero is proposing in patch 3. The throttling is nice,
> > but it'd be even nicer if we can get ahead of that :)
>
> I have always thought that the shrinker should play this role in one
> way or another. Instead of an arbitrary watermark and asynchronous
> work, it incrementally pushes the zswap LRU toward disk as reclaim
> activity increases.
>
> Is the point behind proactive shrinking is to reduce the latency in
> the reclaim path?

For proactive shrinking, I thought the latency and throughput of
pageout should be prioritized, assuming that delaying the reclaim
progress by rejection or synchronous writeback is not always
acceptable. Similarly, patch 6 accepted breaking LRU priority to avoid
degrading pageout performance compared to zswap-disabled systems.

But It seems like zswap prefers the LRU heuristics and a larger pool.
The shrinker should writeback synchronously after the pool limit is
hit until the max pool size, and zswap should backpressure the
reclaim, right?  If so, my proposal is in the opposite direction. I
will submit patches 1 and 2 as v3.
Nhat Pham July 26, 2024, 6:13 p.m. UTC | #12
On Mon, Jul 15, 2024 at 1:20 AM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年7月13日(土) 8:02 Nhat Pham <nphamcs@gmail.com>:
>
> It was tested on an Azure VM with SSD-backed storage. The total IOPS
> was capped at 4K IOPS by the VM host. The max throughput of the global
> shrinker was around 16 MB/s. Proactive shrinking cannot prevent
> pool_limit_hit since memory allocation can be on the order of GB/s.
> (The benchmark script allocates 2 GB sequentially, which was
> compressed to 1.3 GB, while the zswap pool was limited to 200 MB.)

Hmmm I noticed that in a lot of other swap read/write paths (in
__read_swap_cache_async(), or in shrink_lruvec()), we are doing block
device plugging (blk_{start|finish}_plug()). The global shrinker path,
however, is currently not doing this - it's triggered in a workqueue,
separate from all these reclaim paths.

I wonder if there are any values to doing the same for zswap global
shrinker. We do acquire a mutex (which can sleep) for every page,
which can unplug, but IIUC we only sleep when the mutex is currently
held by another task, and the mutex is per-CPU. The compression
algorithm is usually non-sleeping as well (for e.g, zstd). So maybe
there could be improvement in throughput here?

(Btw - friendly reminder that everyone should use zsmalloc as the default :))

Anyway, I haven't really played with this, and I don't have the right
setup that mimics your use case. If you do decide to give this a shot,
let me know :)
Nhat Pham July 26, 2024, 6:25 p.m. UTC | #13
On Fri, Jul 26, 2024 at 11:13 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> __read_swap_cache_async(), or in shrink_lruvec()), we are doing block

/s/__read_swap_cache_async()/swapin_{cluster|vma}_readahead()

We're initiating the plugging at the swapin callsites.