mbox series

[v3,0/5] workload-specific and memory pressure-driven zswap writeback

Message ID 20231017232152.2605440-1-nphamcs@gmail.com (mailing list archive)
Headers show
Series workload-specific and memory pressure-driven zswap writeback | expand

Message

Nhat Pham Oct. 17, 2023, 11:21 p.m. UTC
Changelog:
v3:
   * Add a patch to export per-cgroup zswap writeback counters
   * Add a patch to update zswap's kselftest
   * Separate the new list_lru functions into its own prep patch
   * Do not start from the top of the hierarchy when encounter a memcg
     that is not online for the global limit zswap writeback (patch 2)
     (suggested by Yosry Ahmed)
   * Do not remove the swap entry from list_lru in
     __read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
   * Removed a redundant zswap pool getting (patch 2)
     (reported by Ryan Roberts)
   * Use atomic for the nr_zswap_protected (instead of lruvec's lock)
     (patch 5) (suggested by Yosry Ahmed)
   * Remove the per-cgroup zswap shrinker knob (patch 5)
     (suggested by Yosry Ahmed)
v2:
   * Fix loongarch compiler errors
   * Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM

There are currently several issues with zswap writeback:

1. There is only a single global LRU for zswap, making it impossible to
   perform worload-specific shrinking - an memcg under memory pressure
   cannot determine which pages in the pool it owns, and often ends up
   writing pages from other memcgs. This issue has been previously
   observed in practice and mitigated by simply disabling
   memcg-initiated shrinking:

   https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u

   But this solution leaves a lot to be desired, as we still do not
   have an avenue for an memcg to free up its own memory locked up in
   the zswap pool.

2. We only shrink the zswap pool when the user-defined limit is hit.
   This means that if we set the limit too high, cold data that are
   unlikely to be used again will reside in the pool, wasting precious
   memory. It is hard to predict how much zswap space will be needed
   ahead of time, as this depends on the workload (specifically, on
   factors such as memory access patterns and compressibility of the
   memory pages).

This patch series solves these issues by separating the global zswap
LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
(i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
new shrinker does not have any parameter that must be tuned by the
user, and can be opted in or out on a per-memcg basis.

As a proof of concept, we ran the following synthetic benchmark:
build the linux kernel in a memory-limited cgroup, and allocate some
cold data in tmpfs to see if the shrinker could write them out and
improved the overall performance. Depending on the amount of cold data
generated, we observe from 14% to 35% reduction in kernel CPU time used
in the kernel builds.

Domenico Cerasuolo (3):
  zswap: make shrinking memcg-aware
  mm: memcg: add per-memcg zswap writeback stat
  selftests: cgroup: update per-memcg zswap writeback selftest

Nhat Pham (2):
  mm: list_lru: allow external numa node and cgroup tracking
  zswap: shrinks zswap pool based on memory pressure

 Documentation/admin-guide/mm/zswap.rst      |   7 +
 include/linux/list_lru.h                    |  38 +++
 include/linux/memcontrol.h                  |   7 +
 include/linux/mmzone.h                      |  14 +
 mm/list_lru.c                               |  43 ++-
 mm/memcontrol.c                             |  15 +
 mm/mmzone.c                                 |   3 +
 mm/swap.h                                   |   3 +-
 mm/swap_state.c                             |  38 ++-
 mm/zswap.c                                  | 335 ++++++++++++++++----
 tools/testing/selftests/cgroup/test_zswap.c |  74 +++--
 11 files changed, 485 insertions(+), 92 deletions(-)

Comments

Andrew Morton Oct. 19, 2023, 5:12 p.m. UTC | #1
On Tue, 17 Oct 2023 16:21:47 -0700 Nhat Pham <nphamcs@gmail.com> wrote:

> Subject: [PATCH v3 0/5] workload-specific and memory pressure-driven zswap writeback

We're at -rc6 and I'd prefer to drop this series from mm.git, have
another go during the next cycle.

However Hugh's v2 series "mempolicy: cleanups leading to NUMA mpol
without vma" has syntactic dependencies on this series and will need
rework, so I'd like to make that decision soon.

Do we feel that this series can be made into a mergeable state within
the next few days?

Thanks.
Yosry Ahmed Oct. 19, 2023, 5:33 p.m. UTC | #2
On Thu, Oct 19, 2023 at 10:12 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Tue, 17 Oct 2023 16:21:47 -0700 Nhat Pham <nphamcs@gmail.com> wrote:
>
> > Subject: [PATCH v3 0/5] workload-specific and memory pressure-driven zswap writeback
>
> We're at -rc6 and I'd prefer to drop this series from mm.git, have
> another go during the next cycle.
>
> However Hugh's v2 series "mempolicy: cleanups leading to NUMA mpol
> without vma" has syntactic dependencies on this series and will need
> rework, so I'd like to make that decision soon.
>
> Do we feel that this series can be made into a mergeable state within
> the next few days?

There are parts of the code that I would feel more comfortable if
someone took a look at (which I mentioned in individual patches). So
unless this happens in the next few days I wouldn't say so.

>
> Thanks.
Nhat Pham Oct. 19, 2023, 6:31 p.m. UTC | #3
On Thu, Oct 19, 2023 at 10:34 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Oct 19, 2023 at 10:12 AM Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > On Tue, 17 Oct 2023 16:21:47 -0700 Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > > Subject: [PATCH v3 0/5] workload-specific and memory pressure-driven zswap writeback
> >
> > We're at -rc6 and I'd prefer to drop this series from mm.git, have
> > another go during the next cycle.
> >
> > However Hugh's v2 series "mempolicy: cleanups leading to NUMA mpol
> > without vma" has syntactic dependencies on this series and will need
> > rework, so I'd like to make that decision soon.
> >
> > Do we feel that this series can be made into a mergeable state within
> > the next few days?
>
> There are parts of the code that I would feel more comfortable if
> someone took a look at (which I mentioned in individual patches). So
> unless this happens in the next few days I wouldn't say so.
>

I'm not super familiar with the other series. How big is the dependency?
Looks like it's just a small part in the swapcache code right?

If this is the case, I feel like the best course of action is to rebase
the mempolicy patch series on top of mm-unstable, and resolve
this merge conflict. I will then send out v4 of the zswap shrinker,
rebased on top of the mempolicy patch series.

If this is not the case, one thing we can do is:

a) Fix bugs (there's one kernel test robot it seems)
b) Fix user-visible details (writeback counter for e.g)

and just merge the series for now. FWIW, this is an optional
feature and disabled by default. So performance optimization
and aesthetics change (list_lru_add() renaming etc.) can wait.

We can push out v4 by the end of today and early tomorrow
if all goes well. Then everyone can review and comment on it.

How does everyone feel about this strategy?

> >
> > Thanks.
Andrew Morton Oct. 19, 2023, 6:36 p.m. UTC | #4
On Thu, 19 Oct 2023 11:31:17 -0700 Nhat Pham <nphamcs@gmail.com> wrote:

> > There are parts of the code that I would feel more comfortable if
> > someone took a look at (which I mentioned in individual patches). So
> > unless this happens in the next few days I wouldn't say so.
> >
> 
> I'm not super familiar with the other series. How big is the dependency?
> Looks like it's just a small part in the swapcache code right?
> 
> If this is the case, I feel like the best course of action is to rebase
> the mempolicy patch series on top of mm-unstable, and resolve
> this merge conflict.

OK, thanks.

Hugh, do you have time to look at rebasing on the mm-stable which I
pushed out 15 minutes ago?

> I will then send out v4 of the zswap shrinker,
> rebased on top of the mempolicy patch series.
> 
> If this is not the case, one thing we can do is:
> 
> a) Fix bugs (there's one kernel test robot it seems)
> b) Fix user-visible details (writeback counter for e.g)
> 
> and just merge the series for now. FWIW, this is an optional
> feature and disabled by default. So performance optimization
> and aesthetics change (list_lru_add() renaming etc.) can wait.
> 
> We can push out v4 by the end of today and early tomorrow
> if all goes well. Then everyone can review and comment on it.
>
Hugh Dickins Oct. 19, 2023, 7:23 p.m. UTC | #5
On Thu, 19 Oct 2023, Andrew Morton wrote:
> On Thu, 19 Oct 2023 11:31:17 -0700 Nhat Pham <nphamcs@gmail.com> wrote:
> 
> > > There are parts of the code that I would feel more comfortable if
> > > someone took a look at (which I mentioned in individual patches). So
> > > unless this happens in the next few days I wouldn't say so.
> > >
> > 
> > I'm not super familiar with the other series. How big is the dependency?
> > Looks like it's just a small part in the swapcache code right?
> > 
> > If this is the case, I feel like the best course of action is to rebase
> > the mempolicy patch series on top of mm-unstable, and resolve
> > this merge conflict.
> 
> OK, thanks.
> 
> Hugh, do you have time to look at rebasing on the mm-stable which I
> pushed out 15 minutes ago?

Okay, I'm on it - but (unless you insist otherwise) it's only a v3 of
the 10/12 "mempolicy: alloc_pages_mpol() for NUMA policy without vma"
that I'm expecting to send you - the rest should just cherry-pick in
cleanly.  I'll check that of course, but I'm afraid of losing details
(e.g. any Acks you've meanwhile added) if I resend the lot.

Hugh

> 
> > I will then send out v4 of the zswap shrinker,
> > rebased on top of the mempolicy patch series.
> > 
> > If this is not the case, one thing we can do is:
> > 
> > a) Fix bugs (there's one kernel test robot it seems)
> > b) Fix user-visible details (writeback counter for e.g)
> > 
> > and just merge the series for now. FWIW, this is an optional
> > feature and disabled by default. So performance optimization
> > and aesthetics change (list_lru_add() renaming etc.) can wait.
> > 
> > We can push out v4 by the end of today and early tomorrow
> > if all goes well. Then everyone can review and comment on it.