diff mbox series

[1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker

Message ID 20240725232813.2260665-2-nphamcs@gmail.com (mailing list archive)
State New
Headers show
Series improving dynamic zswap shrinker protection scheme | expand

Commit Message

Nhat Pham July 25, 2024, 11:28 p.m. UTC
Current zswap shrinker's heursitics to prevent overshrinking is brittle
and inaccurate, specifically in the way we decay the protection size
(i.e making pages in the zswap LRU eligible for reclaim).

We currently decay protection aggressively in zswap_lru_add() calls.
This leads to the following unfortunate effect: when a new batch of
pages enter zswap, the protection size rapidly decays to below 25% of
the zswap LRU size, which is way too low.

We have observed this effect in production, when experimenting with the
zswap shrinker: the rate of shrinking shoots up massively right after a
new batch of zswap stores. This is somewhat the opposite of what we want
originally - when new pages enter zswap, we want to protect both these
new pages AND the pages that are already protected in the zswap LRU.

Replace existing heuristics with a second chance algorithm

1. When a new zswap entry is stored in the zswap pool, its reference bit
   is set.
2. When the zswap shrinker encounters a zswap entry with the reference
   bit set, give it a second chance - only flips the reference bit and
   rotate it in the LRU.
3. If the shrinker encounters the entry again, this time with its
   reference bit unset, then it can reclaim the entry.

In this manner, the aging of the pages in the zswap LRUs are decoupled
from zswap stores, and picks up the pace with increasing memory pressure
(which is what we want).

We will still maintain the count of swapins, which is consumed and
subtracted from the lru size in zswap_shrinker_count(), to further
penalize past overshrinking that led to disk swapins. The idea is that
had we considered this many more pages in the LRU active/protected, they
would not have been written back and we would not have had to swapped
them in.

To test this new heuristics, I built the kernel under a cgroup with
memory.max set to 2G, on a host with 36 cores:

With the old shrinker:

real: 263.89s
user: 4318.11s
sys: 673.29s
swapins: 227300.5

With the second chance algorithm:

real: 244.85s
user: 4327.22s
sys: 664.39s
swapins: 94663

(average over 5 runs)

We observe an 1.3% reduction in kernel CPU usage, and around 7.2%
reduction in real time. Note that the number of swapped in pages
dropped by 58%.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/zswap.h | 16 ++++-----
 mm/zswap.c            | 84 +++++++++++++++++++------------------------
 2 files changed, 44 insertions(+), 56 deletions(-)

Comments

Yosry Ahmed July 26, 2024, 9:58 p.m. UTC | #1
On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Current zswap shrinker's heursitics to prevent overshrinking is brittle
> and inaccurate, specifically in the way we decay the protection size
> (i.e making pages in the zswap LRU eligible for reclaim).

Thanks for working on this and experimenting with different
heuristics. I was not a huge fan of these, so I am glad we are trying
to replace them with something more intuitive.

>
> We currently decay protection aggressively in zswap_lru_add() calls.
> This leads to the following unfortunate effect: when a new batch of
> pages enter zswap, the protection size rapidly decays to below 25% of
> the zswap LRU size, which is way too low.
>
> We have observed this effect in production, when experimenting with the
> zswap shrinker: the rate of shrinking shoots up massively right after a
> new batch of zswap stores. This is somewhat the opposite of what we want
> originally - when new pages enter zswap, we want to protect both these
> new pages AND the pages that are already protected in the zswap LRU.
>
> Replace existing heuristics with a second chance algorithm
>
> 1. When a new zswap entry is stored in the zswap pool, its reference bit
>    is set.
> 2. When the zswap shrinker encounters a zswap entry with the reference
>    bit set, give it a second chance - only flips the reference bit and
>    rotate it in the LRU.
> 3. If the shrinker encounters the entry again, this time with its
>    reference bit unset, then it can reclaim the entry.

At the first look, this is similar to the reclaim algorithm. A
fundamental difference here is that the reference bit is only set
once, when the entry is created. It is different from the conventional
second chance page reclaim/replacement algorithm.

What this really does, is that it slows down writeback by enforcing
that we need to iterate entries exactly twice before we write them
back. This sounds a little arbitrary and not very intuitive to me.

Taking a step back, what we really want is to writeback zswap entries
in order, and avoid writing back more entries than needed. I think the
key here is "when needed", which is defined by how much memory
pressure we have. The shrinker framework should already be taking this
into account.

Looking at do_shrink_slab(), in the case of zswap (seek = 2),
total_scan should boil down to:

total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority

, and this is bounded by zswap_shrinker_count() * 2.

Ignoring nr_deferred, we start by scanning 1/2048th of
zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 *
zswap_shrinker_count() at zero priority (before OOMs). At
NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count().

Keep in mind that zswap_shrinker_count() does not return the number of
all zswap entries, it subtracts the protected part (or recent swapins)
and scales by the compression ratio. So this looks reasonable at first
sight, perhaps we want to tune the seek to slow down writeback if we
think it's too much, but that doesn't explain the scenario you are
describing.

Now let's factor in nr_deferred, which looks to me like it could be
the culprit here. I am assuming the intention is that if we counted
freeable slab objects before but didn't get to free them, we should do
it the next time around. This feels like it assumes that the objects
will remain there unless reclaimed by the shrinker. This does not
apply for zswap, because the objects can be swapped in.

Also, in the beginning, before we encounter too many swapins, the
protection will be very low, so zswap_shrinker_count() will return a
relatively high value. Even if we don't scan and writeback this
amount, we will keep carrying this value forward in next reclaim
operations, even if the number of existing zswap entries have
decreased due to swapins.

Could this be the problem? The number of deferred objects to be
scanned just keeps going forward as a high value, essentially
rendering the heuristics in zswap_shrinker_count() useless?

If we just need to slow down writeback by making sure we scan entries
twice, could something similar be achieved just by tuning the seek
without needing any heuristics to begin with?

I am just trying to understand if the main problem is that zswap does
not fit well into the shrinker framework as it is, and how we can
improve this.

Just to be clear, I am in favor of changing those heuristics to
something more intuitive and simpler, but I really want to understand
what is going on. The approach taken by this patch is definitely
simpler, but it doesn't feel more intuitive to me (at least not yet).

>
> In this manner, the aging of the pages in the zswap LRUs are decoupled
> from zswap stores, and picks up the pace with increasing memory pressure
> (which is what we want).
>
> We will still maintain the count of swapins, which is consumed and
> subtracted from the lru size in zswap_shrinker_count(), to further
> penalize past overshrinking that led to disk swapins. The idea is that
> had we considered this many more pages in the LRU active/protected, they
> would not have been written back and we would not have had to swapped
> them in.
>
> To test this new heuristics, I built the kernel under a cgroup with
> memory.max set to 2G, on a host with 36 cores:
>
> With the old shrinker:
>
> real: 263.89s
> user: 4318.11s
> sys: 673.29s
> swapins: 227300.5
>
> With the second chance algorithm:
>
> real: 244.85s
> user: 4327.22s
> sys: 664.39s
> swapins: 94663
>
> (average over 5 runs)
>
> We observe an 1.3% reduction in kernel CPU usage, and around 7.2%
> reduction in real time. Note that the number of swapped in pages
> dropped by 58%.
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  include/linux/zswap.h | 16 ++++-----
>  mm/zswap.c            | 84 +++++++++++++++++++------------------------
>  2 files changed, 44 insertions(+), 56 deletions(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 6cecb4a4f68b..b94b6ae262d5 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -13,17 +13,15 @@ extern atomic_t zswap_stored_pages;
>
>  struct zswap_lruvec_state {
>         /*
> -        * Number of pages in zswap that should be protected from the shrinker.
> -        * This number is an estimate of the following counts:
> +        * Number of swapped in pages, i.e not found in the zswap pool.
>          *
> -        * a) Recent page faults.
> -        * b) Recent insertion to the zswap LRU. This includes new zswap stores,
> -        *    as well as recent zswap LRU rotations.
> -        *
> -        * These pages are likely to be warm, and might incur IO if the are written
> -        * to swap.
> +        * This is consumed and subtracted from the lru size in
> +        * zswap_shrinker_count() to penalize past overshrinking that led to disk
> +        * swapins. The idea is that had we considered this many more pages in the
> +        * LRU active/protected and not written them back, we would not have had to
> +        * swapped them in.
>          */
> -       atomic_long_t nr_zswap_protected;
> +       atomic_long_t nr_swapins;
>  };
>
>  unsigned long zswap_total_pages(void);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index adeaf9c97fde..a24ee015d7bc 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -203,6 +203,7 @@ struct zswap_entry {
>         };
>         struct obj_cgroup *objcg;
>         struct list_head lru;
> +       bool referenced;

If we take this approach, this needs to be placed in the hole after
the length, to avoid increasing the size of the zswap_entry.

>  };
>
>  static struct xarray *zswap_trees[MAX_SWAPFILES];
> @@ -700,11 +701,10 @@ static inline int entry_to_nid(struct zswap_entry *entry)
>
>  static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
>  {
> -       atomic_long_t *nr_zswap_protected;
> -       unsigned long lru_size, old, new;
>         int nid = entry_to_nid(entry);
>         struct mem_cgroup *memcg;
> -       struct lruvec *lruvec;
> +
> +       entry->referenced = true;
>
>         /*
>          * Note that it is safe to use rcu_read_lock() here, even in the face of
> @@ -722,19 +722,6 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
>         memcg = mem_cgroup_from_entry(entry);
>         /* will always succeed */
>         list_lru_add(list_lru, &entry->lru, nid, memcg);
> -
> -       /* Update the protection area */
> -       lru_size = list_lru_count_one(list_lru, nid, memcg);
> -       lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> -       nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected;
> -       old = atomic_long_inc_return(nr_zswap_protected);
> -       /*
> -        * Decay to avoid overflow and adapt to changing workloads.
> -        * This is based on LRU reclaim cost decaying heuristics.
> -        */
> -       do {
> -               new = old > lru_size / 4 ? old / 2 : old;
> -       } while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new));
>         rcu_read_unlock();
>  }
>
> @@ -752,7 +739,7 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>
>  void zswap_lruvec_state_init(struct lruvec *lruvec)
>  {
> -       atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0);
> +       atomic_long_set(&lruvec->zswap_lruvec_state.nr_swapins, 0);
>  }
>
>  void zswap_folio_swapin(struct folio *folio)
> @@ -761,7 +748,7 @@ void zswap_folio_swapin(struct folio *folio)
>
>         if (folio) {
>                 lruvec = folio_lruvec(folio);
> -               atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> +               atomic_long_inc(&lruvec->zswap_lruvec_state.nr_swapins);
>         }
>  }
>
> @@ -1091,6 +1078,16 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>         enum lru_status ret = LRU_REMOVED_RETRY;
>         int writeback_result;
>
> +       /*
> +        * Second chance algorithm: if the entry has its reference bit set, give it
> +        * a second chance. Only clear the reference bit and rotate it in the
> +        * zswap's LRU list.
> +        */
> +       if (entry->referenced) {
> +               entry->referenced = false;
> +               return LRU_ROTATE;
> +       }
> +
>         /*
>          * As soon as we drop the LRU lock, the entry can be freed by
>          * a concurrent invalidation. This means the following:
> @@ -1157,8 +1154,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>  static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>                 struct shrink_control *sc)
>  {
> -       struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
> -       unsigned long shrink_ret, nr_protected, lru_size;
> +       unsigned long shrink_ret;
>         bool encountered_page_in_swapcache = false;
>
>         if (!zswap_shrinker_enabled ||
> @@ -1167,25 +1163,6 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>                 return SHRINK_STOP;
>         }
>
> -       nr_protected =
> -               atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> -       lru_size = list_lru_shrink_count(&zswap_list_lru, sc);
> -
> -       /*
> -        * Abort if we are shrinking into the protected region.
> -        *
> -        * This short-circuiting is necessary because if we have too many multiple
> -        * concurrent reclaimers getting the freeable zswap object counts at the
> -        * same time (before any of them made reasonable progress), the total
> -        * number of reclaimed objects might be more than the number of unprotected
> -        * objects (i.e the reclaimers will reclaim into the protected area of the
> -        * zswap LRU).
> -        */
> -       if (nr_protected >= lru_size - sc->nr_to_scan) {
> -               sc->nr_scanned = 0;
> -               return SHRINK_STOP;
> -       }
> -
>         shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb,
>                 &encountered_page_in_swapcache);
>
> @@ -1200,7 +1177,8 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>  {
>         struct mem_cgroup *memcg = sc->memcg;
>         struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
> -       unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> +       atomic_long_t *nr_swapins = &lruvec->zswap_lruvec_state.nr_swapins;
> +       unsigned long nr_backing, nr_stored, lru_size, nr_swapins_cur, nr_remain;
>
>         if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
>                 return 0;
> @@ -1233,14 +1211,26 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>         if (!nr_stored)
>                 return 0;
>
> -       nr_protected =
> -               atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> -       nr_freeable = list_lru_shrink_count(&zswap_list_lru, sc);
> +       lru_size = list_lru_shrink_count(&zswap_list_lru, sc);
> +       if (!lru_size)
> +               return 0;
> +
>         /*
> -        * Subtract the lru size by an estimate of the number of pages
> -        * that should be protected.
> +        * Subtract the lru size by the number of pages that are recently swapped
> +        * in. The idea is that had we protect the zswap's LRU by this amount of
> +        * pages, these swap in would not have happened.
>          */
> -       nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0;
> +       nr_swapins_cur = atomic_long_read(nr_swapins);
> +       do {
> +               if (lru_size >= nr_swapins_cur)
> +                       nr_remain = 0;
> +               else
> +                       nr_remain = nr_swapins_cur - lru_size;
> +       } while (!atomic_long_try_cmpxchg(nr_swapins, &nr_swapins_cur, nr_remain));
> +
> +       lru_size -= nr_swapins_cur - nr_remain;
> +       if (!lru_size)
> +               return 0;
>
>         /*
>          * Scale the number of freeable pages by the memory saving factor.
> @@ -1253,7 +1243,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>          * space. Hence, we may scale nr_freeable down a little bit more than we
>          * should if we have a lot of same-filled pages.
>          */
> -       return mult_frac(nr_freeable, nr_backing, nr_stored);
> +       return mult_frac(lru_size, nr_backing, nr_stored);
>  }
>
>  static struct shrinker *zswap_alloc_shrinker(void)
> --
> 2.43.0
>
Nhat Pham July 29, 2024, 11:07 p.m. UTC | #2
On Fri, Jul 26, 2024 at 2:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Current zswap shrinker's heursitics to prevent overshrinking is brittle
> > and inaccurate, specifically in the way we decay the protection size
> > (i.e making pages in the zswap LRU eligible for reclaim).
>
> Thanks for working on this and experimenting with different
> heuristics. I was not a huge fan of these, so I am glad we are trying
> to replace them with something more intuitive.

That is certainly the intention :) I'm not a huge fan of those
heuristics either - they seem fairly brittle and arbitrary to me even
back then (as you have pointed out).

>
> >
> > We currently decay protection aggressively in zswap_lru_add() calls.
> > This leads to the following unfortunate effect: when a new batch of
> > pages enter zswap, the protection size rapidly decays to below 25% of
> > the zswap LRU size, which is way too low.
> >
> > We have observed this effect in production, when experimenting with the
> > zswap shrinker: the rate of shrinking shoots up massively right after a
> > new batch of zswap stores. This is somewhat the opposite of what we want
> > originally - when new pages enter zswap, we want to protect both these
> > new pages AND the pages that are already protected in the zswap LRU.
> >
> > Replace existing heuristics with a second chance algorithm
> >
> > 1. When a new zswap entry is stored in the zswap pool, its reference bit
> >    is set.
> > 2. When the zswap shrinker encounters a zswap entry with the reference
> >    bit set, give it a second chance - only flips the reference bit and
> >    rotate it in the LRU.
> > 3. If the shrinker encounters the entry again, this time with its
> >    reference bit unset, then it can reclaim the entry.
>
> At the first look, this is similar to the reclaim algorithm. A
> fundamental difference here is that the reference bit is only set
> once, when the entry is created. It is different from the conventional
> second chance page reclaim/replacement algorithm.

I suppose, yeah. We no longer have non-exclusive loading (in most
scenarios), so the reference bit is only set once, on store attempt.

The interpretation/justification is still there though (somewhat). We
are still giving each object another "chance" to stay in the zswap
pool, in case it is needed soon, before it is reclaimed.

> Taking a step back, what we really want is to writeback zswap entries
> in order, and avoid writing back more entries than needed. I think the
> key here is "when needed", which is defined by how much memory
> pressure we have. The shrinker framework should already be taking this
> into account.
>
> Looking at do_shrink_slab(), in the case of zswap (seek = 2),
> total_scan should boil down to:
>
> total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority
>
> , and this is bounded by zswap_shrinker_count() * 2.
>
> Ignoring nr_deferred, we start by scanning 1/2048th of
> zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 *
> zswap_shrinker_count() at zero priority (before OOMs). At
> NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count().
>
> Keep in mind that zswap_shrinker_count() does not return the number of
> all zswap entries, it subtracts the protected part (or recent swapins)

Note that this "protection" decays rather aggressively with zswap
stores. From zswap_lru_add():

new = old > lru_size / 4 ? old / 2 : old;

IOW, if the protection size exceeds 25% lru_size, half it. A couple of
zswap_stores() could easily slash this to 25% of the LRU (or even
below) rapidly.

I guess I can fiddle with this decaying attempt + decouple the
decaying from storing (i.e moving it somewhere else). But any formula
I can come up with (another multiplicative factor?) seems as arbitrary
as this formula, and any places that I place the decaying could
potentially couple two actions, which lead to unintended effect.

The second chance algorithm creates a similar two-section LRU
configuration as the old scheme - protected/unprotected (or, analogous
to the page reclaim algorithm, active/inactive). However, it bypasses
the need for such an artificially constructed formula, which you can
think of as the rate of objects aging. It naturally ties the aging of
objects (i.e moving the objects from one section to another) in the
zswap pool with memory pressure, which makes sense to me, FWIW.

> and scales by the compression ratio. So this looks reasonable at first
> sight, perhaps we want to tune the seek to slow down writeback if we
> think it's too much, but that doesn't explain the scenario you are
> describing.
>
> Now let's factor in nr_deferred, which looks to me like it could be
> the culprit here. I am assuming the intention is that if we counted
> freeable slab objects before but didn't get to free them, we should do
> it the next time around. This feels like it assumes that the objects
> will remain there unless reclaimed by the shrinker. This does not
> apply for zswap, because the objects can be swapped in.
>
> Also, in the beginning, before we encounter too many swapins, the
> protection will be very low, so zswap_shrinker_count() will return a
> relatively high value. Even if we don't scan and writeback this
> amount, we will keep carrying this value forward in next reclaim
> operations, even if the number of existing zswap entries have
> decreased due to swapins.
>
> Could this be the problem? The number of deferred objects to be
> scanned just keeps going forward as a high value, essentially
> rendering the heuristics in zswap_shrinker_count() useless?

Interesting theory. I wonder if I can do some tracing to investigate
this (or maybe just spam trace_printk() lol).

But yeah, this nr_deferred part seems suspicious. The number of
freeable objects returned by the shrinker_count() function (at least
the one implemented by zswap) doesn't really track which object is
newly freeable since the last shrinker_count() invocation. So an
object can be accounted for in a previous invocation, fail to be
reclaimed in that invocation yet still count towards future shrinker
invocation? That sounds wrong :) Even without zswapin, that's still
double counting, no? Future shrinker_count() call will still consider
the previously-failed-to-reclaim object. Do other shrinkers somehow
track the objects that have been accounted for in previous shrinker
attempts, and use this in the freeable computation?

That said, how bad it is in practice depends on the rate of reclaim
failure, since the nr_deferred only increments when we fail to scan.
Which is actually not very high - for instance, this is the statistics
from a couple of my benchmark runs (using the old zswap shrinker
protection scheme, i.e the status quo):

/sys/kernel/debug/zswap/reject_reclaim_fail:292
/sys/kernel/debug/zswap/written_back_pages:6947364

IIRC, exclusive loading really lowers the number of reclaim rejection
(you're less likely to observe a page that is already loaded into
memory and in swap cache - its zswap entry is already invalidated).

This could be a problem, and probably should be fixed (for instance,
by adding a no-defer mode for shrinkers similar to zswap), but there
seems to be an additional/orthogonal reason for overshrinking.

>
> If we just need to slow down writeback by making sure we scan entries
> twice, could something similar be achieved just by tuning the seek
> without needing any heuristics to begin with?

Ah but this is a good point. My only defence for not using seek would
be I have no clue how to interpret the seek value :) Based on the
documentation:

int seeks; /* seeks to recreate an obj */

So I'm guessing this sorta kinda represents the number of disk seeks
(or IO?) required to reconstruct the object. Which makes sense in the
special case of 0. But it seems really arbitrary otherwise -
DEFAULT_SEEKS is 2, but zswap does not really take 2 seeks/IOs to
reconstruct the object, no?

It might have been different in the past - I don't know the historical
context here. But now, I guess this is just a shorthand for writeback
pace. The second chance algorithm has a (nominal) justification that
backs it, even though in effect it's the same.

But yeah, if we only think about actual effects, perhaps we don't even
need the reference bit manipulation - just halving the writeback pace
by half is good enough. There might be some differences with
non-exclusive loading, but now that's gone (for the most part). The
other difference I can think of is in the aging-reclaiming ordering:

1. With the reference bit scheme, shrinkers will age the entire LRU
before reclaiming anything (due to LRU rotation).
2. WIth increasing seek value, we will start reclaiming right away
(albeit half of our old pace).

Average reclaim rate should stay the same, but the former gives the
objects on the zswap pool more chance to be loaded into main memory at
the beginning. Or maybe I'm overthinking this and it doesn't really
matter - benchmarking time? :)

>
> I am just trying to understand if the main problem is that zswap does
> not fit well into the shrinker framework as it is, and how we can
> improve this.

Agree. Another potential problem is under highly concurrent settings,
we can have many shrinker instances hitting the same zswap LRU. And
since the protection mechanism is racy/best-effort at best, we can
easily write back well into the protection area.

I was seriously considering physically separating the protected
(active) and unprotected (inactive) LRUs for zswap, but the second
chance algorithm seems a natural way to implement this separation, and
plays quite well with the current shrinker framework - aging and
reclaiming are all serialized via the lru lock.

> Just to be clear, I am in favor of changing those heuristics to
> something more intuitive and simpler, but I really want to understand
> what is going on. The approach taken by this patch is definitely
> simpler, but it doesn't feel more intuitive to me (at least not yet).

Thanks, yeah I agree. I was actually thinking that the second chance
algorithm would be a bit more natural than some protection decaying
formulae I pull out of thin air, but perhaps there are still problems
with it.

> If we take this approach, this needs to be placed in the hole after
> the length, to avoid increasing the size of the zswap_entry.

Very good point - I was actually thinking about this the other day,
but somehow forgot to fix it before submission :) I'll fix this.
Nhat Pham July 30, 2024, 12:07 a.m. UTC | #3
On Mon, Jul 29, 2024 at 4:07 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Fri, Jul 26, 2024 at 2:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > Current zswap shrinker's heursitics to prevent overshrinking is brittle
> > > and inaccurate, specifically in the way we decay the protection size
> > > (i.e making pages in the zswap LRU eligible for reclaim).
> >
> > Thanks for working on this and experimenting with different
> > heuristics. I was not a huge fan of these, so I am glad we are trying
> > to replace them with something more intuitive.
>
> That is certainly the intention :) I'm not a huge fan of those
> heuristics either - they seem fairly brittle and arbitrary to me even
> back then (as you have pointed out).
>
> >
> > >
> > > We currently decay protection aggressively in zswap_lru_add() calls.
> > > This leads to the following unfortunate effect: when a new batch of
> > > pages enter zswap, the protection size rapidly decays to below 25% of
> > > the zswap LRU size, which is way too low.
> > >
> > > We have observed this effect in production, when experimenting with the
> > > zswap shrinker: the rate of shrinking shoots up massively right after a
> > > new batch of zswap stores. This is somewhat the opposite of what we want
> > > originally - when new pages enter zswap, we want to protect both these
> > > new pages AND the pages that are already protected in the zswap LRU.
> > >
> > > Replace existing heuristics with a second chance algorithm
> > >
> > > 1. When a new zswap entry is stored in the zswap pool, its reference bit
> > >    is set.
> > > 2. When the zswap shrinker encounters a zswap entry with the reference
> > >    bit set, give it a second chance - only flips the reference bit and
> > >    rotate it in the LRU.
> > > 3. If the shrinker encounters the entry again, this time with its
> > >    reference bit unset, then it can reclaim the entry.
> >
> > At the first look, this is similar to the reclaim algorithm. A
> > fundamental difference here is that the reference bit is only set
> > once, when the entry is created. It is different from the conventional
> > second chance page reclaim/replacement algorithm.
>
> I suppose, yeah. We no longer have non-exclusive loading (in most
> scenarios), so the reference bit is only set once, on store attempt.
>
> The interpretation/justification is still there though (somewhat). We
> are still giving each object another "chance" to stay in the zswap
> pool, in case it is needed soon, before it is reclaimed.
>
> > Taking a step back, what we really want is to writeback zswap entries
> > in order, and avoid writing back more entries than needed. I think the
> > key here is "when needed", which is defined by how much memory
> > pressure we have. The shrinker framework should already be taking this
> > into account.
> >
> > Looking at do_shrink_slab(), in the case of zswap (seek = 2),
> > total_scan should boil down to:
> >
> > total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority
> >
> > , and this is bounded by zswap_shrinker_count() * 2.
> >
> > Ignoring nr_deferred, we start by scanning 1/2048th of
> > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 *
> > zswap_shrinker_count() at zero priority (before OOMs). At
> > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count().
> >
> > Keep in mind that zswap_shrinker_count() does not return the number of
> > all zswap entries, it subtracts the protected part (or recent swapins)
>
> Note that this "protection" decays rather aggressively with zswap
> stores. From zswap_lru_add():
>
> new = old > lru_size / 4 ? old / 2 : old;
>
> IOW, if the protection size exceeds 25% lru_size, half it. A couple of
> zswap_stores() could easily slash this to 25% of the LRU (or even
> below) rapidly.
>
> I guess I can fiddle with this decaying attempt + decouple the
> decaying from storing (i.e moving it somewhere else). But any formula
> I can come up with (another multiplicative factor?) seems as arbitrary
> as this formula, and any places that I place the decaying could
> potentially couple two actions, which lead to unintended effect.
>
> The second chance algorithm creates a similar two-section LRU
> configuration as the old scheme - protected/unprotected (or, analogous
> to the page reclaim algorithm, active/inactive). However, it bypasses
> the need for such an artificially constructed formula, which you can
> think of as the rate of objects aging. It naturally ties the aging of
> objects (i.e moving the objects from one section to another) in the
> zswap pool with memory pressure, which makes sense to me, FWIW.
>
> > and scales by the compression ratio. So this looks reasonable at first
> > sight, perhaps we want to tune the seek to slow down writeback if we
> > think it's too much, but that doesn't explain the scenario you are
> > describing.
> >
> > Now let's factor in nr_deferred, which looks to me like it could be
> > the culprit here. I am assuming the intention is that if we counted
> > freeable slab objects before but didn't get to free them, we should do
> > it the next time around. This feels like it assumes that the objects
> > will remain there unless reclaimed by the shrinker. This does not
> > apply for zswap, because the objects can be swapped in.
> >
> > Also, in the beginning, before we encounter too many swapins, the
> > protection will be very low, so zswap_shrinker_count() will return a
> > relatively high value. Even if we don't scan and writeback this
> > amount, we will keep carrying this value forward in next reclaim
> > operations, even if the number of existing zswap entries have
> > decreased due to swapins.
> >
> > Could this be the problem? The number of deferred objects to be
> > scanned just keeps going forward as a high value, essentially
> > rendering the heuristics in zswap_shrinker_count() useless?
>
> Interesting theory. I wonder if I can do some tracing to investigate
> this (or maybe just spam trace_printk() lol).

OK, I found this trace:

sudo echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_shrink_slab_end/enable

I'll give this a shot, and report any interesting findings in the data
collected :)


>
> But yeah, this nr_deferred part seems suspicious. The number of
> freeable objects returned by the shrinker_count() function (at least
> the one implemented by zswap) doesn't really track which object is
> newly freeable since the last shrinker_count() invocation. So an
> object can be accounted for in a previous invocation, fail to be
> reclaimed in that invocation yet still count towards future shrinker
> invocation? That sounds wrong :) Even without zswapin, that's still
> double counting, no? Future shrinker_count() call will still consider
> the previously-failed-to-reclaim object. Do other shrinkers somehow
> track the objects that have been accounted for in previous shrinker
> attempts, and use this in the freeable computation?
>
> That said, how bad it is in practice depends on the rate of reclaim
> failure, since the nr_deferred only increments when we fail to scan.
> Which is actually not very high - for instance, this is the statistics
> from a couple of my benchmark runs (using the old zswap shrinker
> protection scheme, i.e the status quo):
>
> /sys/kernel/debug/zswap/reject_reclaim_fail:292
> /sys/kernel/debug/zswap/written_back_pages:6947364
>
> IIRC, exclusive loading really lowers the number of reclaim rejection
> (you're less likely to observe a page that is already loaded into
> memory and in swap cache - its zswap entry is already invalidated).
>

Actually, on closer inspection, we also increasing the deferred count
when we short-circuit in each of the following scenario:

a) When we disable zswap shrinker or zswap writeback in general. I
don't think this matters too much in practice - we're not reclaiming
here anyway :)
b) When we detect that there are less unprotected objects than the batch size.

FWIW, the second case also goes away with the new scheme :)

> This could be a problem, and probably should be fixed (for instance,
> by adding a no-defer mode for shrinkers similar to zswap), but there
> seems to be an additional/orthogonal reason for overshrinking.

I can also just implement this and benchmark it :)
Johannes Weiner July 30, 2024, 3:39 a.m. UTC | #4
On Fri, Jul 26, 2024 at 02:58:14PM -0700, Yosry Ahmed wrote:
> On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Current zswap shrinker's heursitics to prevent overshrinking is brittle
> > and inaccurate, specifically in the way we decay the protection size
> > (i.e making pages in the zswap LRU eligible for reclaim).
> 
> Thanks for working on this and experimenting with different
> heuristics. I was not a huge fan of these, so I am glad we are trying
> to replace them with something more intuitive.
> 
> >
> > We currently decay protection aggressively in zswap_lru_add() calls.
> > This leads to the following unfortunate effect: when a new batch of
> > pages enter zswap, the protection size rapidly decays to below 25% of
> > the zswap LRU size, which is way too low.
> >
> > We have observed this effect in production, when experimenting with the
> > zswap shrinker: the rate of shrinking shoots up massively right after a
> > new batch of zswap stores. This is somewhat the opposite of what we want
> > originally - when new pages enter zswap, we want to protect both these
> > new pages AND the pages that are already protected in the zswap LRU.
> >
> > Replace existing heuristics with a second chance algorithm
> >
> > 1. When a new zswap entry is stored in the zswap pool, its reference bit
> >    is set.
> > 2. When the zswap shrinker encounters a zswap entry with the reference
> >    bit set, give it a second chance - only flips the reference bit and
> >    rotate it in the LRU.
> > 3. If the shrinker encounters the entry again, this time with its
> >    reference bit unset, then it can reclaim the entry.
> 
> At the first look, this is similar to the reclaim algorithm. A
> fundamental difference here is that the reference bit is only set
> once, when the entry is created. It is different from the conventional
> second chance page reclaim/replacement algorithm.
> 
> What this really does, is that it slows down writeback by enforcing
> that we need to iterate entries exactly twice before we write them
> back. This sounds a little arbitrary and not very intuitive to me.

This isn't different than other second chance algorithms. Those
usually set the reference bit again to buy the entry another round. In
our case, another reference causes a zswapin, which removes the entry
from the list - buying it another round. Entries will get reclaimed
once the scan rate catches up with the longest reuse distance.

The main goal, which was also the goal of the protection math, is to
slow down writebacks in proportion to new entries showing up. This
gives zswap a chance to solve memory pressure through compression. If
memory pressure persists, writeback should pick up.

If no new entries were to show up, then sure, this would be busy
work. In practice, new entries do show up at a varying rate. This all
happens in parallel to anon reclaim, after all. The key here is that
new entries will be interleaved with rotated entries, and they consume
scan work! This is what results in the proportional slowdown.

> Taking a step back, what we really want is to writeback zswap entries
> in order, and avoid writing back more entries than needed. I think the
> key here is "when needed", which is defined by how much memory
> pressure we have. The shrinker framework should already be taking this
> into account.
> 
> Looking at do_shrink_slab(), in the case of zswap (seek = 2),
> total_scan should boil down to:
> 
> total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority
> 
> , and this is bounded by zswap_shrinker_count() * 2.
> 
> Ignoring nr_deferred, we start by scanning 1/2048th of
> zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 *
> zswap_shrinker_count() at zero priority (before OOMs). At
> NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count().
> 
> Keep in mind that zswap_shrinker_count() does not return the number of
> all zswap entries, it subtracts the protected part (or recent swapins)
> and scales by the compression ratio. So this looks reasonable at first
> sight, perhaps we want to tune the seek to slow down writeback if we
> think it's too much, but that doesn't explain the scenario you are
> describing.
> 
> Now let's factor in nr_deferred, which looks to me like it could be
> the culprit here. I am assuming the intention is that if we counted
> freeable slab objects before but didn't get to free them, we should do
> it the next time around. This feels like it assumes that the objects
> will remain there unless reclaimed by the shrinker. This does not
> apply for zswap, because the objects can be swapped in.

Hm.

_count() returns (objects - protected) * compression_rate, then the
shrinker does the >> priority dance. So to_scan is expected to be a
small portion of unprotected objects.

_scan() bails if to_scan > (objects - protected).

How often does this actually abort in practice?

> Also, in the beginning, before we encounter too many swapins, the
> protection will be very low, so zswap_shrinker_count() will return a
> relatively high value. Even if we don't scan and writeback this
> amount, we will keep carrying this value forward in next reclaim
> operations, even if the number of existing zswap entries have
> decreased due to swapins.
> 
> Could this be the problem? The number of deferred objects to be
> scanned just keeps going forward as a high value, essentially
> rendering the heuristics in zswap_shrinker_count() useless?
> 
> If we just need to slow down writeback by making sure we scan entries
> twice, could something similar be achieved just by tuning the seek
> without needing any heuristics to begin with?

Seek is a fixed coefficient for the scan rate.

We want to slow writeback when recent zswapouts dominate the zswap
pool (expanding or thrashing), and speed it up when recent entries
make up a small share of the pool (stagnating).

This is what the second chance accomplishes.
Nhat Pham July 30, 2024, 6:23 a.m. UTC | #5
On Mon, Jul 29, 2024 at 8:39 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Seek is a fixed coefficient for the scan rate.
>
> We want to slow writeback when recent zswapouts dominate the zswap
> pool (expanding or thrashing), and speed it up when recent entries
> make up a small share of the pool (stagnating).
>
> This is what the second chance accomplishes.

Wow, this is something that I did not even consider. Thanks for
pointing this out, Johannes.

shrinker->seeks tuning allows you to adjust writeback pacing, as a
ratio of the pool size. When the pool is static (no/few zswpin or
zswpout), then these two are similar (on average). But with concurrent
activities (pages coming in and out), the dynamics can potentially be
different.

Second chance allows you to have different dynamics depending on
recent pool activities. The recent zswpouts will be protected by
virtue of the reference bits (and given another chance, which will be
taken if it's used again soon), and the pages concurrently zswapped in
obviously will be too, whereas the stale objects who have already been
touched by the shrinker once in the past will be evicted immediately.
IOW, all of the above activities (zswpin, zswpout, reclaim pressure)
can harmonize seamlessly to adjust the effective rate of writeback.

Without any additional heuristics (old or new), increasing seek (i.e
decreasing the writeback rate) by itself only has a static effect, and
definitely does not accomplish the aformentioned dynamic writeback
rate adjustment. Now, we can (and did try to) mimic the above behavior
somewhat with the protection size scheme: only return the unprotected
size, carefully increase it on zswpout and swpin (so that zswpouts are
not considered), carefully prevent shrinker from reclaiming into
protected area, etc.. But it's incredibly brittle - with all these
hacks, it becomes even more complicated and unintuitive than the
second chance algorithm. If it's performing well, then sure, but it's
not. Might as well do the simpler thing? :)

Besides, the problem with the haphazard aging (i.e protection
decaying) remains - at which point do we decay, and how much do we
decay? Compare this with the new second chance scheme, which gives you
a natural aging mechanism, and can elegantly adjust itself with
reclaim/memory pressure.
Yosry Ahmed July 30, 2024, 6:46 p.m. UTC | #6
On Mon, Jul 29, 2024 at 8:39 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Jul 26, 2024 at 02:58:14PM -0700, Yosry Ahmed wrote:
> > On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > Current zswap shrinker's heursitics to prevent overshrinking is brittle
> > > and inaccurate, specifically in the way we decay the protection size
> > > (i.e making pages in the zswap LRU eligible for reclaim).
> >
> > Thanks for working on this and experimenting with different
> > heuristics. I was not a huge fan of these, so I am glad we are trying
> > to replace them with something more intuitive.
> >
> > >
> > > We currently decay protection aggressively in zswap_lru_add() calls.
> > > This leads to the following unfortunate effect: when a new batch of
> > > pages enter zswap, the protection size rapidly decays to below 25% of
> > > the zswap LRU size, which is way too low.
> > >
> > > We have observed this effect in production, when experimenting with the
> > > zswap shrinker: the rate of shrinking shoots up massively right after a
> > > new batch of zswap stores. This is somewhat the opposite of what we want
> > > originally - when new pages enter zswap, we want to protect both these
> > > new pages AND the pages that are already protected in the zswap LRU.
> > >
> > > Replace existing heuristics with a second chance algorithm
> > >
> > > 1. When a new zswap entry is stored in the zswap pool, its reference bit
> > >    is set.
> > > 2. When the zswap shrinker encounters a zswap entry with the reference
> > >    bit set, give it a second chance - only flips the reference bit and
> > >    rotate it in the LRU.
> > > 3. If the shrinker encounters the entry again, this time with its
> > >    reference bit unset, then it can reclaim the entry.
> >
> > At the first look, this is similar to the reclaim algorithm. A
> > fundamental difference here is that the reference bit is only set
> > once, when the entry is created. It is different from the conventional
> > second chance page reclaim/replacement algorithm.
> >
> > What this really does, is that it slows down writeback by enforcing
> > that we need to iterate entries exactly twice before we write them
> > back. This sounds a little arbitrary and not very intuitive to me.
>
> This isn't different than other second chance algorithms. Those
> usually set the reference bit again to buy the entry another round. In
> our case, another reference causes a zswapin, which removes the entry
> from the list - buying it another round. Entries will get reclaimed
> once the scan rate catches up with the longest reuse distance.
>
> The main goal, which was also the goal of the protection math, is to
> slow down writebacks in proportion to new entries showing up. This
> gives zswap a chance to solve memory pressure through compression. If
> memory pressure persists, writeback should pick up.
>
> If no new entries were to show up, then sure, this would be busy
> work. In practice, new entries do show up at a varying rate. This all
> happens in parallel to anon reclaim, after all. The key here is that
> new entries will be interleaved with rotated entries, and they consume
> scan work! This is what results in the proportional slowdown.

The fact that in practice there will be new entries showing up because
we are doing anon reclaim is exactly what I had in mind. My point is
that in practice, for most cases, the reference bit could just be
causing a steady slowdown, which can be achieved just by tuning the
seek. But I have a better idea now about what you mean with the
proportional slowdown.

(more below)

>
> > Taking a step back, what we really want is to writeback zswap entries
> > in order, and avoid writing back more entries than needed. I think the
> > key here is "when needed", which is defined by how much memory
> > pressure we have. The shrinker framework should already be taking this
> > into account.
> >
> > Looking at do_shrink_slab(), in the case of zswap (seek = 2),
> > total_scan should boil down to:
> >
> > total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority
> >
> > , and this is bounded by zswap_shrinker_count() * 2.
> >
> > Ignoring nr_deferred, we start by scanning 1/2048th of
> > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 *
> > zswap_shrinker_count() at zero priority (before OOMs). At
> > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count().
> >
> > Keep in mind that zswap_shrinker_count() does not return the number of
> > all zswap entries, it subtracts the protected part (or recent swapins)
> > and scales by the compression ratio. So this looks reasonable at first
> > sight, perhaps we want to tune the seek to slow down writeback if we
> > think it's too much, but that doesn't explain the scenario you are
> > describing.
> >
> > Now let's factor in nr_deferred, which looks to me like it could be
> > the culprit here. I am assuming the intention is that if we counted
> > freeable slab objects before but didn't get to free them, we should do
> > it the next time around. This feels like it assumes that the objects
> > will remain there unless reclaimed by the shrinker. This does not
> > apply for zswap, because the objects can be swapped in.
>
> Hm.
>
> _count() returns (objects - protected) * compression_rate, then the
> shrinker does the >> priority dance. So to_scan is expected to be a
> small portion of unprotected objects.
>
> _scan() bails if to_scan > (objects - protected).
>
> How often does this actually abort in practice?

Good question :)

>
> > Also, in the beginning, before we encounter too many swapins, the
> > protection will be very low, so zswap_shrinker_count() will return a
> > relatively high value. Even if we don't scan and writeback this
> > amount, we will keep carrying this value forward in next reclaim
> > operations, even if the number of existing zswap entries have
> > decreased due to swapins.
> >
> > Could this be the problem? The number of deferred objects to be
> > scanned just keeps going forward as a high value, essentially
> > rendering the heuristics in zswap_shrinker_count() useless?
> >
> > If we just need to slow down writeback by making sure we scan entries
> > twice, could something similar be achieved just by tuning the seek
> > without needing any heuristics to begin with?
>
> Seek is a fixed coefficient for the scan rate.
>
> We want to slow writeback when recent zswapouts dominate the zswap
> pool (expanding or thrashing), and speed it up when recent entries
> make up a small share of the pool (stagnating).
>
> This is what the second chance accomplishes.

Yeah I am more convinced now, I wanted to check if just tuning the
seek achieves a similar result, but the complexity from the reference
bit is not much anyway -- so maybe it isn't worth it.

I still think the nr_deferred handling in the shrinker framework does
not seem compatible with zswap. Entries could be double counted, and
previously counted entries could go away during swapins. Unless I am
missing something, it seems like the scanning can be arbitrary, and
not truly proportional to the reclaim priority and zswap LRUs size.

I also think it's important to clarify that there are two mechanisms
that control the writeback rate with this patch, the reference bit
proportional slowdown (protecting against writeback of recently
swapped out pages), and nr_swapins protection (feedback from swapins
of recently written back pages).

Maybe we can move things around to make it more obvious how these
mechanisms work hand in hand, or have a comment somewhere describing
the writeback mechanism.
Nhat Pham July 30, 2024, 9:40 p.m. UTC | #7
On Tue, Jul 30, 2024 at 11:46 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
>
> I still think the nr_deferred handling in the shrinker framework does
> not seem compatible with zswap. Entries could be double counted, and
> previously counted entries could go away during swapins. Unless I am
> missing something, it seems like the scanning can be arbitrary, and
> not truly proportional to the reclaim priority and zswap LRUs size.

We can do some investigation by the side. I did find the trace point
that collects deferred numbers, maybe that can illustrate the problem?

It'd be an orthogonal issue though. I don't think fixing the
nr_deferred part would be enough to handle the various cases Johannes
pointed out.
Nhat Pham July 30, 2024, 10:04 p.m. UTC | #8
On Tue, Jul 30, 2024 at 11:46 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> I also think it's important to clarify that there are two mechanisms
> that control the writeback rate with this patch, the reference bit
> proportional slowdown (protecting against writeback of recently
> swapped out pages), and nr_swapins protection (feedback from swapins
> of recently written back pages).

Regarding this - perhaps I wasn't being clear, but I did include both
in the changelog. The latter is this paragraph:

"We will still maintain the count of swapins, which is consumed and
subtracted from the lru size in zswap_shrinker_count(), to further
penalize past overshrinking that led to disk swapins. The idea is that
had we considered this many more pages in the LRU active/protected, they
would not have been written back and we would not have had to swapped
them in."

I also replicate this comment in the nr_swapin explanation - see
struct zswap_lruvec_state.

Same goes with the second chance algorithm - I did briefly explain it
in shrink_memcg_cb(), as well as the changelog.

>
> Maybe we can move things around to make it more obvious how these
> mechanisms work hand in hand, or have a comment somewhere describing
> the writeback mechanism.

Hmm let me add some documentation for this near
zswap_shrinker_{scan|count}() definitions:
diff mbox series

Patch

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 6cecb4a4f68b..b94b6ae262d5 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -13,17 +13,15 @@  extern atomic_t zswap_stored_pages;
 
 struct zswap_lruvec_state {
 	/*
-	 * Number of pages in zswap that should be protected from the shrinker.
-	 * This number is an estimate of the following counts:
+	 * Number of swapped in pages, i.e not found in the zswap pool.
 	 *
-	 * a) Recent page faults.
-	 * b) Recent insertion to the zswap LRU. This includes new zswap stores,
-	 *    as well as recent zswap LRU rotations.
-	 *
-	 * These pages are likely to be warm, and might incur IO if the are written
-	 * to swap.
+	 * This is consumed and subtracted from the lru size in
+	 * zswap_shrinker_count() to penalize past overshrinking that led to disk
+	 * swapins. The idea is that had we considered this many more pages in the
+	 * LRU active/protected and not written them back, we would not have had to
+	 * swapped them in.
 	 */
-	atomic_long_t nr_zswap_protected;
+	atomic_long_t nr_swapins;
 };
 
 unsigned long zswap_total_pages(void);
diff --git a/mm/zswap.c b/mm/zswap.c
index adeaf9c97fde..a24ee015d7bc 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -203,6 +203,7 @@  struct zswap_entry {
 	};
 	struct obj_cgroup *objcg;
 	struct list_head lru;
+	bool referenced;
 };
 
 static struct xarray *zswap_trees[MAX_SWAPFILES];
@@ -700,11 +701,10 @@  static inline int entry_to_nid(struct zswap_entry *entry)
 
 static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
 {
-	atomic_long_t *nr_zswap_protected;
-	unsigned long lru_size, old, new;
 	int nid = entry_to_nid(entry);
 	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
+
+	entry->referenced = true;
 
 	/*
 	 * Note that it is safe to use rcu_read_lock() here, even in the face of
@@ -722,19 +722,6 @@  static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
 	memcg = mem_cgroup_from_entry(entry);
 	/* will always succeed */
 	list_lru_add(list_lru, &entry->lru, nid, memcg);
-
-	/* Update the protection area */
-	lru_size = list_lru_count_one(list_lru, nid, memcg);
-	lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
-	nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected;
-	old = atomic_long_inc_return(nr_zswap_protected);
-	/*
-	 * Decay to avoid overflow and adapt to changing workloads.
-	 * This is based on LRU reclaim cost decaying heuristics.
-	 */
-	do {
-		new = old > lru_size / 4 ? old / 2 : old;
-	} while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new));
 	rcu_read_unlock();
 }
 
@@ -752,7 +739,7 @@  static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
 
 void zswap_lruvec_state_init(struct lruvec *lruvec)
 {
-	atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0);
+	atomic_long_set(&lruvec->zswap_lruvec_state.nr_swapins, 0);
 }
 
 void zswap_folio_swapin(struct folio *folio)
@@ -761,7 +748,7 @@  void zswap_folio_swapin(struct folio *folio)
 
 	if (folio) {
 		lruvec = folio_lruvec(folio);
-		atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
+		atomic_long_inc(&lruvec->zswap_lruvec_state.nr_swapins);
 	}
 }
 
@@ -1091,6 +1078,16 @@  static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 	enum lru_status ret = LRU_REMOVED_RETRY;
 	int writeback_result;
 
+	/*
+	 * Second chance algorithm: if the entry has its reference bit set, give it
+	 * a second chance. Only clear the reference bit and rotate it in the
+	 * zswap's LRU list.
+	 */
+	if (entry->referenced) {
+		entry->referenced = false;
+		return LRU_ROTATE;
+	}
+
 	/*
 	 * As soon as we drop the LRU lock, the entry can be freed by
 	 * a concurrent invalidation. This means the following:
@@ -1157,8 +1154,7 @@  static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 		struct shrink_control *sc)
 {
-	struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
-	unsigned long shrink_ret, nr_protected, lru_size;
+	unsigned long shrink_ret;
 	bool encountered_page_in_swapcache = false;
 
 	if (!zswap_shrinker_enabled ||
@@ -1167,25 +1163,6 @@  static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 		return SHRINK_STOP;
 	}
 
-	nr_protected =
-		atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
-	lru_size = list_lru_shrink_count(&zswap_list_lru, sc);
-
-	/*
-	 * Abort if we are shrinking into the protected region.
-	 *
-	 * This short-circuiting is necessary because if we have too many multiple
-	 * concurrent reclaimers getting the freeable zswap object counts at the
-	 * same time (before any of them made reasonable progress), the total
-	 * number of reclaimed objects might be more than the number of unprotected
-	 * objects (i.e the reclaimers will reclaim into the protected area of the
-	 * zswap LRU).
-	 */
-	if (nr_protected >= lru_size - sc->nr_to_scan) {
-		sc->nr_scanned = 0;
-		return SHRINK_STOP;
-	}
-
 	shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb,
 		&encountered_page_in_swapcache);
 
@@ -1200,7 +1177,8 @@  static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 {
 	struct mem_cgroup *memcg = sc->memcg;
 	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
-	unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
+	atomic_long_t *nr_swapins = &lruvec->zswap_lruvec_state.nr_swapins;
+	unsigned long nr_backing, nr_stored, lru_size, nr_swapins_cur, nr_remain;
 
 	if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
 		return 0;
@@ -1233,14 +1211,26 @@  static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	if (!nr_stored)
 		return 0;
 
-	nr_protected =
-		atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
-	nr_freeable = list_lru_shrink_count(&zswap_list_lru, sc);
+	lru_size = list_lru_shrink_count(&zswap_list_lru, sc);
+	if (!lru_size)
+		return 0;
+
 	/*
-	 * Subtract the lru size by an estimate of the number of pages
-	 * that should be protected.
+	 * Subtract the lru size by the number of pages that are recently swapped
+	 * in. The idea is that had we protect the zswap's LRU by this amount of
+	 * pages, these swap in would not have happened.
 	 */
-	nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0;
+	nr_swapins_cur = atomic_long_read(nr_swapins);
+	do {
+		if (lru_size >= nr_swapins_cur)
+			nr_remain = 0;
+		else
+			nr_remain = nr_swapins_cur - lru_size;
+	} while (!atomic_long_try_cmpxchg(nr_swapins, &nr_swapins_cur, nr_remain));
+
+	lru_size -= nr_swapins_cur - nr_remain;
+	if (!lru_size)
+		return 0;
 
 	/*
 	 * Scale the number of freeable pages by the memory saving factor.
@@ -1253,7 +1243,7 @@  static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	 * space. Hence, we may scale nr_freeable down a little bit more than we
 	 * should if we have a lot of same-filled pages.
 	 */
-	return mult_frac(nr_freeable, nr_backing, nr_stored);
+	return mult_frac(lru_size, nr_backing, nr_stored);
 }
 
 static struct shrinker *zswap_alloc_shrinker(void)