Message ID | 20240725232813.2260665-2-nphamcs@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | improving dynamic zswap shrinker protection scheme | expand |
On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote: > > Current zswap shrinker's heursitics to prevent overshrinking is brittle > and inaccurate, specifically in the way we decay the protection size > (i.e making pages in the zswap LRU eligible for reclaim). Thanks for working on this and experimenting with different heuristics. I was not a huge fan of these, so I am glad we are trying to replace them with something more intuitive. > > We currently decay protection aggressively in zswap_lru_add() calls. > This leads to the following unfortunate effect: when a new batch of > pages enter zswap, the protection size rapidly decays to below 25% of > the zswap LRU size, which is way too low. > > We have observed this effect in production, when experimenting with the > zswap shrinker: the rate of shrinking shoots up massively right after a > new batch of zswap stores. This is somewhat the opposite of what we want > originally - when new pages enter zswap, we want to protect both these > new pages AND the pages that are already protected in the zswap LRU. > > Replace existing heuristics with a second chance algorithm > > 1. When a new zswap entry is stored in the zswap pool, its reference bit > is set. > 2. When the zswap shrinker encounters a zswap entry with the reference > bit set, give it a second chance - only flips the reference bit and > rotate it in the LRU. > 3. If the shrinker encounters the entry again, this time with its > reference bit unset, then it can reclaim the entry. At the first look, this is similar to the reclaim algorithm. A fundamental difference here is that the reference bit is only set once, when the entry is created. It is different from the conventional second chance page reclaim/replacement algorithm. What this really does, is that it slows down writeback by enforcing that we need to iterate entries exactly twice before we write them back. This sounds a little arbitrary and not very intuitive to me. Taking a step back, what we really want is to writeback zswap entries in order, and avoid writing back more entries than needed. I think the key here is "when needed", which is defined by how much memory pressure we have. The shrinker framework should already be taking this into account. Looking at do_shrink_slab(), in the case of zswap (seek = 2), total_scan should boil down to: total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority , and this is bounded by zswap_shrinker_count() * 2. Ignoring nr_deferred, we start by scanning 1/2048th of zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 * zswap_shrinker_count() at zero priority (before OOMs). At NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count(). Keep in mind that zswap_shrinker_count() does not return the number of all zswap entries, it subtracts the protected part (or recent swapins) and scales by the compression ratio. So this looks reasonable at first sight, perhaps we want to tune the seek to slow down writeback if we think it's too much, but that doesn't explain the scenario you are describing. Now let's factor in nr_deferred, which looks to me like it could be the culprit here. I am assuming the intention is that if we counted freeable slab objects before but didn't get to free them, we should do it the next time around. This feels like it assumes that the objects will remain there unless reclaimed by the shrinker. This does not apply for zswap, because the objects can be swapped in. Also, in the beginning, before we encounter too many swapins, the protection will be very low, so zswap_shrinker_count() will return a relatively high value. Even if we don't scan and writeback this amount, we will keep carrying this value forward in next reclaim operations, even if the number of existing zswap entries have decreased due to swapins. Could this be the problem? The number of deferred objects to be scanned just keeps going forward as a high value, essentially rendering the heuristics in zswap_shrinker_count() useless? If we just need to slow down writeback by making sure we scan entries twice, could something similar be achieved just by tuning the seek without needing any heuristics to begin with? I am just trying to understand if the main problem is that zswap does not fit well into the shrinker framework as it is, and how we can improve this. Just to be clear, I am in favor of changing those heuristics to something more intuitive and simpler, but I really want to understand what is going on. The approach taken by this patch is definitely simpler, but it doesn't feel more intuitive to me (at least not yet). > > In this manner, the aging of the pages in the zswap LRUs are decoupled > from zswap stores, and picks up the pace with increasing memory pressure > (which is what we want). > > We will still maintain the count of swapins, which is consumed and > subtracted from the lru size in zswap_shrinker_count(), to further > penalize past overshrinking that led to disk swapins. The idea is that > had we considered this many more pages in the LRU active/protected, they > would not have been written back and we would not have had to swapped > them in. > > To test this new heuristics, I built the kernel under a cgroup with > memory.max set to 2G, on a host with 36 cores: > > With the old shrinker: > > real: 263.89s > user: 4318.11s > sys: 673.29s > swapins: 227300.5 > > With the second chance algorithm: > > real: 244.85s > user: 4327.22s > sys: 664.39s > swapins: 94663 > > (average over 5 runs) > > We observe an 1.3% reduction in kernel CPU usage, and around 7.2% > reduction in real time. Note that the number of swapped in pages > dropped by 58%. > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > --- > include/linux/zswap.h | 16 ++++----- > mm/zswap.c | 84 +++++++++++++++++++------------------------ > 2 files changed, 44 insertions(+), 56 deletions(-) > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index 6cecb4a4f68b..b94b6ae262d5 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -13,17 +13,15 @@ extern atomic_t zswap_stored_pages; > > struct zswap_lruvec_state { > /* > - * Number of pages in zswap that should be protected from the shrinker. > - * This number is an estimate of the following counts: > + * Number of swapped in pages, i.e not found in the zswap pool. > * > - * a) Recent page faults. > - * b) Recent insertion to the zswap LRU. This includes new zswap stores, > - * as well as recent zswap LRU rotations. > - * > - * These pages are likely to be warm, and might incur IO if the are written > - * to swap. > + * This is consumed and subtracted from the lru size in > + * zswap_shrinker_count() to penalize past overshrinking that led to disk > + * swapins. The idea is that had we considered this many more pages in the > + * LRU active/protected and not written them back, we would not have had to > + * swapped them in. > */ > - atomic_long_t nr_zswap_protected; > + atomic_long_t nr_swapins; > }; > > unsigned long zswap_total_pages(void); > diff --git a/mm/zswap.c b/mm/zswap.c > index adeaf9c97fde..a24ee015d7bc 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -203,6 +203,7 @@ struct zswap_entry { > }; > struct obj_cgroup *objcg; > struct list_head lru; > + bool referenced; If we take this approach, this needs to be placed in the hole after the length, to avoid increasing the size of the zswap_entry. > }; > > static struct xarray *zswap_trees[MAX_SWAPFILES]; > @@ -700,11 +701,10 @@ static inline int entry_to_nid(struct zswap_entry *entry) > > static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) > { > - atomic_long_t *nr_zswap_protected; > - unsigned long lru_size, old, new; > int nid = entry_to_nid(entry); > struct mem_cgroup *memcg; > - struct lruvec *lruvec; > + > + entry->referenced = true; > > /* > * Note that it is safe to use rcu_read_lock() here, even in the face of > @@ -722,19 +722,6 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) > memcg = mem_cgroup_from_entry(entry); > /* will always succeed */ > list_lru_add(list_lru, &entry->lru, nid, memcg); > - > - /* Update the protection area */ > - lru_size = list_lru_count_one(list_lru, nid, memcg); > - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); > - nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected; > - old = atomic_long_inc_return(nr_zswap_protected); > - /* > - * Decay to avoid overflow and adapt to changing workloads. > - * This is based on LRU reclaim cost decaying heuristics. > - */ > - do { > - new = old > lru_size / 4 ? old / 2 : old; > - } while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new)); > rcu_read_unlock(); > } > > @@ -752,7 +739,7 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry) > > void zswap_lruvec_state_init(struct lruvec *lruvec) > { > - atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0); > + atomic_long_set(&lruvec->zswap_lruvec_state.nr_swapins, 0); > } > > void zswap_folio_swapin(struct folio *folio) > @@ -761,7 +748,7 @@ void zswap_folio_swapin(struct folio *folio) > > if (folio) { > lruvec = folio_lruvec(folio); > - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected); > + atomic_long_inc(&lruvec->zswap_lruvec_state.nr_swapins); > } > } > > @@ -1091,6 +1078,16 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o > enum lru_status ret = LRU_REMOVED_RETRY; > int writeback_result; > > + /* > + * Second chance algorithm: if the entry has its reference bit set, give it > + * a second chance. Only clear the reference bit and rotate it in the > + * zswap's LRU list. > + */ > + if (entry->referenced) { > + entry->referenced = false; > + return LRU_ROTATE; > + } > + > /* > * As soon as we drop the LRU lock, the entry can be freed by > * a concurrent invalidation. This means the following: > @@ -1157,8 +1154,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o > static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > struct shrink_control *sc) > { > - struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); > - unsigned long shrink_ret, nr_protected, lru_size; > + unsigned long shrink_ret; > bool encountered_page_in_swapcache = false; > > if (!zswap_shrinker_enabled || > @@ -1167,25 +1163,6 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > return SHRINK_STOP; > } > > - nr_protected = > - atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected); > - lru_size = list_lru_shrink_count(&zswap_list_lru, sc); > - > - /* > - * Abort if we are shrinking into the protected region. > - * > - * This short-circuiting is necessary because if we have too many multiple > - * concurrent reclaimers getting the freeable zswap object counts at the > - * same time (before any of them made reasonable progress), the total > - * number of reclaimed objects might be more than the number of unprotected > - * objects (i.e the reclaimers will reclaim into the protected area of the > - * zswap LRU). > - */ > - if (nr_protected >= lru_size - sc->nr_to_scan) { > - sc->nr_scanned = 0; > - return SHRINK_STOP; > - } > - > shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb, > &encountered_page_in_swapcache); > > @@ -1200,7 +1177,8 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > { > struct mem_cgroup *memcg = sc->memcg; > struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid)); > - unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; > + atomic_long_t *nr_swapins = &lruvec->zswap_lruvec_state.nr_swapins; > + unsigned long nr_backing, nr_stored, lru_size, nr_swapins_cur, nr_remain; > > if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg)) > return 0; > @@ -1233,14 +1211,26 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > if (!nr_stored) > return 0; > > - nr_protected = > - atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected); > - nr_freeable = list_lru_shrink_count(&zswap_list_lru, sc); > + lru_size = list_lru_shrink_count(&zswap_list_lru, sc); > + if (!lru_size) > + return 0; > + > /* > - * Subtract the lru size by an estimate of the number of pages > - * that should be protected. > + * Subtract the lru size by the number of pages that are recently swapped > + * in. The idea is that had we protect the zswap's LRU by this amount of > + * pages, these swap in would not have happened. > */ > - nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0; > + nr_swapins_cur = atomic_long_read(nr_swapins); > + do { > + if (lru_size >= nr_swapins_cur) > + nr_remain = 0; > + else > + nr_remain = nr_swapins_cur - lru_size; > + } while (!atomic_long_try_cmpxchg(nr_swapins, &nr_swapins_cur, nr_remain)); > + > + lru_size -= nr_swapins_cur - nr_remain; > + if (!lru_size) > + return 0; > > /* > * Scale the number of freeable pages by the memory saving factor. > @@ -1253,7 +1243,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > * space. Hence, we may scale nr_freeable down a little bit more than we > * should if we have a lot of same-filled pages. > */ > - return mult_frac(nr_freeable, nr_backing, nr_stored); > + return mult_frac(lru_size, nr_backing, nr_stored); > } > > static struct shrinker *zswap_alloc_shrinker(void) > -- > 2.43.0 >
On Fri, Jul 26, 2024 at 2:58 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > Current zswap shrinker's heursitics to prevent overshrinking is brittle > > and inaccurate, specifically in the way we decay the protection size > > (i.e making pages in the zswap LRU eligible for reclaim). > > Thanks for working on this and experimenting with different > heuristics. I was not a huge fan of these, so I am glad we are trying > to replace them with something more intuitive. That is certainly the intention :) I'm not a huge fan of those heuristics either - they seem fairly brittle and arbitrary to me even back then (as you have pointed out). > > > > > We currently decay protection aggressively in zswap_lru_add() calls. > > This leads to the following unfortunate effect: when a new batch of > > pages enter zswap, the protection size rapidly decays to below 25% of > > the zswap LRU size, which is way too low. > > > > We have observed this effect in production, when experimenting with the > > zswap shrinker: the rate of shrinking shoots up massively right after a > > new batch of zswap stores. This is somewhat the opposite of what we want > > originally - when new pages enter zswap, we want to protect both these > > new pages AND the pages that are already protected in the zswap LRU. > > > > Replace existing heuristics with a second chance algorithm > > > > 1. When a new zswap entry is stored in the zswap pool, its reference bit > > is set. > > 2. When the zswap shrinker encounters a zswap entry with the reference > > bit set, give it a second chance - only flips the reference bit and > > rotate it in the LRU. > > 3. If the shrinker encounters the entry again, this time with its > > reference bit unset, then it can reclaim the entry. > > At the first look, this is similar to the reclaim algorithm. A > fundamental difference here is that the reference bit is only set > once, when the entry is created. It is different from the conventional > second chance page reclaim/replacement algorithm. I suppose, yeah. We no longer have non-exclusive loading (in most scenarios), so the reference bit is only set once, on store attempt. The interpretation/justification is still there though (somewhat). We are still giving each object another "chance" to stay in the zswap pool, in case it is needed soon, before it is reclaimed. > Taking a step back, what we really want is to writeback zswap entries > in order, and avoid writing back more entries than needed. I think the > key here is "when needed", which is defined by how much memory > pressure we have. The shrinker framework should already be taking this > into account. > > Looking at do_shrink_slab(), in the case of zswap (seek = 2), > total_scan should boil down to: > > total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority > > , and this is bounded by zswap_shrinker_count() * 2. > > Ignoring nr_deferred, we start by scanning 1/2048th of > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 * > zswap_shrinker_count() at zero priority (before OOMs). At > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count(). > > Keep in mind that zswap_shrinker_count() does not return the number of > all zswap entries, it subtracts the protected part (or recent swapins) Note that this "protection" decays rather aggressively with zswap stores. From zswap_lru_add(): new = old > lru_size / 4 ? old / 2 : old; IOW, if the protection size exceeds 25% lru_size, half it. A couple of zswap_stores() could easily slash this to 25% of the LRU (or even below) rapidly. I guess I can fiddle with this decaying attempt + decouple the decaying from storing (i.e moving it somewhere else). But any formula I can come up with (another multiplicative factor?) seems as arbitrary as this formula, and any places that I place the decaying could potentially couple two actions, which lead to unintended effect. The second chance algorithm creates a similar two-section LRU configuration as the old scheme - protected/unprotected (or, analogous to the page reclaim algorithm, active/inactive). However, it bypasses the need for such an artificially constructed formula, which you can think of as the rate of objects aging. It naturally ties the aging of objects (i.e moving the objects from one section to another) in the zswap pool with memory pressure, which makes sense to me, FWIW. > and scales by the compression ratio. So this looks reasonable at first > sight, perhaps we want to tune the seek to slow down writeback if we > think it's too much, but that doesn't explain the scenario you are > describing. > > Now let's factor in nr_deferred, which looks to me like it could be > the culprit here. I am assuming the intention is that if we counted > freeable slab objects before but didn't get to free them, we should do > it the next time around. This feels like it assumes that the objects > will remain there unless reclaimed by the shrinker. This does not > apply for zswap, because the objects can be swapped in. > > Also, in the beginning, before we encounter too many swapins, the > protection will be very low, so zswap_shrinker_count() will return a > relatively high value. Even if we don't scan and writeback this > amount, we will keep carrying this value forward in next reclaim > operations, even if the number of existing zswap entries have > decreased due to swapins. > > Could this be the problem? The number of deferred objects to be > scanned just keeps going forward as a high value, essentially > rendering the heuristics in zswap_shrinker_count() useless? Interesting theory. I wonder if I can do some tracing to investigate this (or maybe just spam trace_printk() lol). But yeah, this nr_deferred part seems suspicious. The number of freeable objects returned by the shrinker_count() function (at least the one implemented by zswap) doesn't really track which object is newly freeable since the last shrinker_count() invocation. So an object can be accounted for in a previous invocation, fail to be reclaimed in that invocation yet still count towards future shrinker invocation? That sounds wrong :) Even without zswapin, that's still double counting, no? Future shrinker_count() call will still consider the previously-failed-to-reclaim object. Do other shrinkers somehow track the objects that have been accounted for in previous shrinker attempts, and use this in the freeable computation? That said, how bad it is in practice depends on the rate of reclaim failure, since the nr_deferred only increments when we fail to scan. Which is actually not very high - for instance, this is the statistics from a couple of my benchmark runs (using the old zswap shrinker protection scheme, i.e the status quo): /sys/kernel/debug/zswap/reject_reclaim_fail:292 /sys/kernel/debug/zswap/written_back_pages:6947364 IIRC, exclusive loading really lowers the number of reclaim rejection (you're less likely to observe a page that is already loaded into memory and in swap cache - its zswap entry is already invalidated). This could be a problem, and probably should be fixed (for instance, by adding a no-defer mode for shrinkers similar to zswap), but there seems to be an additional/orthogonal reason for overshrinking. > > If we just need to slow down writeback by making sure we scan entries > twice, could something similar be achieved just by tuning the seek > without needing any heuristics to begin with? Ah but this is a good point. My only defence for not using seek would be I have no clue how to interpret the seek value :) Based on the documentation: int seeks; /* seeks to recreate an obj */ So I'm guessing this sorta kinda represents the number of disk seeks (or IO?) required to reconstruct the object. Which makes sense in the special case of 0. But it seems really arbitrary otherwise - DEFAULT_SEEKS is 2, but zswap does not really take 2 seeks/IOs to reconstruct the object, no? It might have been different in the past - I don't know the historical context here. But now, I guess this is just a shorthand for writeback pace. The second chance algorithm has a (nominal) justification that backs it, even though in effect it's the same. But yeah, if we only think about actual effects, perhaps we don't even need the reference bit manipulation - just halving the writeback pace by half is good enough. There might be some differences with non-exclusive loading, but now that's gone (for the most part). The other difference I can think of is in the aging-reclaiming ordering: 1. With the reference bit scheme, shrinkers will age the entire LRU before reclaiming anything (due to LRU rotation). 2. WIth increasing seek value, we will start reclaiming right away (albeit half of our old pace). Average reclaim rate should stay the same, but the former gives the objects on the zswap pool more chance to be loaded into main memory at the beginning. Or maybe I'm overthinking this and it doesn't really matter - benchmarking time? :) > > I am just trying to understand if the main problem is that zswap does > not fit well into the shrinker framework as it is, and how we can > improve this. Agree. Another potential problem is under highly concurrent settings, we can have many shrinker instances hitting the same zswap LRU. And since the protection mechanism is racy/best-effort at best, we can easily write back well into the protection area. I was seriously considering physically separating the protected (active) and unprotected (inactive) LRUs for zswap, but the second chance algorithm seems a natural way to implement this separation, and plays quite well with the current shrinker framework - aging and reclaiming are all serialized via the lru lock. > Just to be clear, I am in favor of changing those heuristics to > something more intuitive and simpler, but I really want to understand > what is going on. The approach taken by this patch is definitely > simpler, but it doesn't feel more intuitive to me (at least not yet). Thanks, yeah I agree. I was actually thinking that the second chance algorithm would be a bit more natural than some protection decaying formulae I pull out of thin air, but perhaps there are still problems with it. > If we take this approach, this needs to be placed in the hole after > the length, to avoid increasing the size of the zswap_entry. Very good point - I was actually thinking about this the other day, but somehow forgot to fix it before submission :) I'll fix this.
On Mon, Jul 29, 2024 at 4:07 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Fri, Jul 26, 2024 at 2:58 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > > Current zswap shrinker's heursitics to prevent overshrinking is brittle > > > and inaccurate, specifically in the way we decay the protection size > > > (i.e making pages in the zswap LRU eligible for reclaim). > > > > Thanks for working on this and experimenting with different > > heuristics. I was not a huge fan of these, so I am glad we are trying > > to replace them with something more intuitive. > > That is certainly the intention :) I'm not a huge fan of those > heuristics either - they seem fairly brittle and arbitrary to me even > back then (as you have pointed out). > > > > > > > > > We currently decay protection aggressively in zswap_lru_add() calls. > > > This leads to the following unfortunate effect: when a new batch of > > > pages enter zswap, the protection size rapidly decays to below 25% of > > > the zswap LRU size, which is way too low. > > > > > > We have observed this effect in production, when experimenting with the > > > zswap shrinker: the rate of shrinking shoots up massively right after a > > > new batch of zswap stores. This is somewhat the opposite of what we want > > > originally - when new pages enter zswap, we want to protect both these > > > new pages AND the pages that are already protected in the zswap LRU. > > > > > > Replace existing heuristics with a second chance algorithm > > > > > > 1. When a new zswap entry is stored in the zswap pool, its reference bit > > > is set. > > > 2. When the zswap shrinker encounters a zswap entry with the reference > > > bit set, give it a second chance - only flips the reference bit and > > > rotate it in the LRU. > > > 3. If the shrinker encounters the entry again, this time with its > > > reference bit unset, then it can reclaim the entry. > > > > At the first look, this is similar to the reclaim algorithm. A > > fundamental difference here is that the reference bit is only set > > once, when the entry is created. It is different from the conventional > > second chance page reclaim/replacement algorithm. > > I suppose, yeah. We no longer have non-exclusive loading (in most > scenarios), so the reference bit is only set once, on store attempt. > > The interpretation/justification is still there though (somewhat). We > are still giving each object another "chance" to stay in the zswap > pool, in case it is needed soon, before it is reclaimed. > > > Taking a step back, what we really want is to writeback zswap entries > > in order, and avoid writing back more entries than needed. I think the > > key here is "when needed", which is defined by how much memory > > pressure we have. The shrinker framework should already be taking this > > into account. > > > > Looking at do_shrink_slab(), in the case of zswap (seek = 2), > > total_scan should boil down to: > > > > total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority > > > > , and this is bounded by zswap_shrinker_count() * 2. > > > > Ignoring nr_deferred, we start by scanning 1/2048th of > > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 * > > zswap_shrinker_count() at zero priority (before OOMs). At > > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count(). > > > > Keep in mind that zswap_shrinker_count() does not return the number of > > all zswap entries, it subtracts the protected part (or recent swapins) > > Note that this "protection" decays rather aggressively with zswap > stores. From zswap_lru_add(): > > new = old > lru_size / 4 ? old / 2 : old; > > IOW, if the protection size exceeds 25% lru_size, half it. A couple of > zswap_stores() could easily slash this to 25% of the LRU (or even > below) rapidly. > > I guess I can fiddle with this decaying attempt + decouple the > decaying from storing (i.e moving it somewhere else). But any formula > I can come up with (another multiplicative factor?) seems as arbitrary > as this formula, and any places that I place the decaying could > potentially couple two actions, which lead to unintended effect. > > The second chance algorithm creates a similar two-section LRU > configuration as the old scheme - protected/unprotected (or, analogous > to the page reclaim algorithm, active/inactive). However, it bypasses > the need for such an artificially constructed formula, which you can > think of as the rate of objects aging. It naturally ties the aging of > objects (i.e moving the objects from one section to another) in the > zswap pool with memory pressure, which makes sense to me, FWIW. > > > and scales by the compression ratio. So this looks reasonable at first > > sight, perhaps we want to tune the seek to slow down writeback if we > > think it's too much, but that doesn't explain the scenario you are > > describing. > > > > Now let's factor in nr_deferred, which looks to me like it could be > > the culprit here. I am assuming the intention is that if we counted > > freeable slab objects before but didn't get to free them, we should do > > it the next time around. This feels like it assumes that the objects > > will remain there unless reclaimed by the shrinker. This does not > > apply for zswap, because the objects can be swapped in. > > > > Also, in the beginning, before we encounter too many swapins, the > > protection will be very low, so zswap_shrinker_count() will return a > > relatively high value. Even if we don't scan and writeback this > > amount, we will keep carrying this value forward in next reclaim > > operations, even if the number of existing zswap entries have > > decreased due to swapins. > > > > Could this be the problem? The number of deferred objects to be > > scanned just keeps going forward as a high value, essentially > > rendering the heuristics in zswap_shrinker_count() useless? > > Interesting theory. I wonder if I can do some tracing to investigate > this (or maybe just spam trace_printk() lol). OK, I found this trace: sudo echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_shrink_slab_end/enable I'll give this a shot, and report any interesting findings in the data collected :) > > But yeah, this nr_deferred part seems suspicious. The number of > freeable objects returned by the shrinker_count() function (at least > the one implemented by zswap) doesn't really track which object is > newly freeable since the last shrinker_count() invocation. So an > object can be accounted for in a previous invocation, fail to be > reclaimed in that invocation yet still count towards future shrinker > invocation? That sounds wrong :) Even without zswapin, that's still > double counting, no? Future shrinker_count() call will still consider > the previously-failed-to-reclaim object. Do other shrinkers somehow > track the objects that have been accounted for in previous shrinker > attempts, and use this in the freeable computation? > > That said, how bad it is in practice depends on the rate of reclaim > failure, since the nr_deferred only increments when we fail to scan. > Which is actually not very high - for instance, this is the statistics > from a couple of my benchmark runs (using the old zswap shrinker > protection scheme, i.e the status quo): > > /sys/kernel/debug/zswap/reject_reclaim_fail:292 > /sys/kernel/debug/zswap/written_back_pages:6947364 > > IIRC, exclusive loading really lowers the number of reclaim rejection > (you're less likely to observe a page that is already loaded into > memory and in swap cache - its zswap entry is already invalidated). > Actually, on closer inspection, we also increasing the deferred count when we short-circuit in each of the following scenario: a) When we disable zswap shrinker or zswap writeback in general. I don't think this matters too much in practice - we're not reclaiming here anyway :) b) When we detect that there are less unprotected objects than the batch size. FWIW, the second case also goes away with the new scheme :) > This could be a problem, and probably should be fixed (for instance, > by adding a no-defer mode for shrinkers similar to zswap), but there > seems to be an additional/orthogonal reason for overshrinking. I can also just implement this and benchmark it :)
On Fri, Jul 26, 2024 at 02:58:14PM -0700, Yosry Ahmed wrote: > On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > Current zswap shrinker's heursitics to prevent overshrinking is brittle > > and inaccurate, specifically in the way we decay the protection size > > (i.e making pages in the zswap LRU eligible for reclaim). > > Thanks for working on this and experimenting with different > heuristics. I was not a huge fan of these, so I am glad we are trying > to replace them with something more intuitive. > > > > > We currently decay protection aggressively in zswap_lru_add() calls. > > This leads to the following unfortunate effect: when a new batch of > > pages enter zswap, the protection size rapidly decays to below 25% of > > the zswap LRU size, which is way too low. > > > > We have observed this effect in production, when experimenting with the > > zswap shrinker: the rate of shrinking shoots up massively right after a > > new batch of zswap stores. This is somewhat the opposite of what we want > > originally - when new pages enter zswap, we want to protect both these > > new pages AND the pages that are already protected in the zswap LRU. > > > > Replace existing heuristics with a second chance algorithm > > > > 1. When a new zswap entry is stored in the zswap pool, its reference bit > > is set. > > 2. When the zswap shrinker encounters a zswap entry with the reference > > bit set, give it a second chance - only flips the reference bit and > > rotate it in the LRU. > > 3. If the shrinker encounters the entry again, this time with its > > reference bit unset, then it can reclaim the entry. > > At the first look, this is similar to the reclaim algorithm. A > fundamental difference here is that the reference bit is only set > once, when the entry is created. It is different from the conventional > second chance page reclaim/replacement algorithm. > > What this really does, is that it slows down writeback by enforcing > that we need to iterate entries exactly twice before we write them > back. This sounds a little arbitrary and not very intuitive to me. This isn't different than other second chance algorithms. Those usually set the reference bit again to buy the entry another round. In our case, another reference causes a zswapin, which removes the entry from the list - buying it another round. Entries will get reclaimed once the scan rate catches up with the longest reuse distance. The main goal, which was also the goal of the protection math, is to slow down writebacks in proportion to new entries showing up. This gives zswap a chance to solve memory pressure through compression. If memory pressure persists, writeback should pick up. If no new entries were to show up, then sure, this would be busy work. In practice, new entries do show up at a varying rate. This all happens in parallel to anon reclaim, after all. The key here is that new entries will be interleaved with rotated entries, and they consume scan work! This is what results in the proportional slowdown. > Taking a step back, what we really want is to writeback zswap entries > in order, and avoid writing back more entries than needed. I think the > key here is "when needed", which is defined by how much memory > pressure we have. The shrinker framework should already be taking this > into account. > > Looking at do_shrink_slab(), in the case of zswap (seek = 2), > total_scan should boil down to: > > total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority > > , and this is bounded by zswap_shrinker_count() * 2. > > Ignoring nr_deferred, we start by scanning 1/2048th of > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 * > zswap_shrinker_count() at zero priority (before OOMs). At > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count(). > > Keep in mind that zswap_shrinker_count() does not return the number of > all zswap entries, it subtracts the protected part (or recent swapins) > and scales by the compression ratio. So this looks reasonable at first > sight, perhaps we want to tune the seek to slow down writeback if we > think it's too much, but that doesn't explain the scenario you are > describing. > > Now let's factor in nr_deferred, which looks to me like it could be > the culprit here. I am assuming the intention is that if we counted > freeable slab objects before but didn't get to free them, we should do > it the next time around. This feels like it assumes that the objects > will remain there unless reclaimed by the shrinker. This does not > apply for zswap, because the objects can be swapped in. Hm. _count() returns (objects - protected) * compression_rate, then the shrinker does the >> priority dance. So to_scan is expected to be a small portion of unprotected objects. _scan() bails if to_scan > (objects - protected). How often does this actually abort in practice? > Also, in the beginning, before we encounter too many swapins, the > protection will be very low, so zswap_shrinker_count() will return a > relatively high value. Even if we don't scan and writeback this > amount, we will keep carrying this value forward in next reclaim > operations, even if the number of existing zswap entries have > decreased due to swapins. > > Could this be the problem? The number of deferred objects to be > scanned just keeps going forward as a high value, essentially > rendering the heuristics in zswap_shrinker_count() useless? > > If we just need to slow down writeback by making sure we scan entries > twice, could something similar be achieved just by tuning the seek > without needing any heuristics to begin with? Seek is a fixed coefficient for the scan rate. We want to slow writeback when recent zswapouts dominate the zswap pool (expanding or thrashing), and speed it up when recent entries make up a small share of the pool (stagnating). This is what the second chance accomplishes.
On Mon, Jul 29, 2024 at 8:39 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > Seek is a fixed coefficient for the scan rate. > > We want to slow writeback when recent zswapouts dominate the zswap > pool (expanding or thrashing), and speed it up when recent entries > make up a small share of the pool (stagnating). > > This is what the second chance accomplishes. Wow, this is something that I did not even consider. Thanks for pointing this out, Johannes. shrinker->seeks tuning allows you to adjust writeback pacing, as a ratio of the pool size. When the pool is static (no/few zswpin or zswpout), then these two are similar (on average). But with concurrent activities (pages coming in and out), the dynamics can potentially be different. Second chance allows you to have different dynamics depending on recent pool activities. The recent zswpouts will be protected by virtue of the reference bits (and given another chance, which will be taken if it's used again soon), and the pages concurrently zswapped in obviously will be too, whereas the stale objects who have already been touched by the shrinker once in the past will be evicted immediately. IOW, all of the above activities (zswpin, zswpout, reclaim pressure) can harmonize seamlessly to adjust the effective rate of writeback. Without any additional heuristics (old or new), increasing seek (i.e decreasing the writeback rate) by itself only has a static effect, and definitely does not accomplish the aformentioned dynamic writeback rate adjustment. Now, we can (and did try to) mimic the above behavior somewhat with the protection size scheme: only return the unprotected size, carefully increase it on zswpout and swpin (so that zswpouts are not considered), carefully prevent shrinker from reclaiming into protected area, etc.. But it's incredibly brittle - with all these hacks, it becomes even more complicated and unintuitive than the second chance algorithm. If it's performing well, then sure, but it's not. Might as well do the simpler thing? :) Besides, the problem with the haphazard aging (i.e protection decaying) remains - at which point do we decay, and how much do we decay? Compare this with the new second chance scheme, which gives you a natural aging mechanism, and can elegantly adjust itself with reclaim/memory pressure.
On Mon, Jul 29, 2024 at 8:39 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Jul 26, 2024 at 02:58:14PM -0700, Yosry Ahmed wrote: > > On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > > Current zswap shrinker's heursitics to prevent overshrinking is brittle > > > and inaccurate, specifically in the way we decay the protection size > > > (i.e making pages in the zswap LRU eligible for reclaim). > > > > Thanks for working on this and experimenting with different > > heuristics. I was not a huge fan of these, so I am glad we are trying > > to replace them with something more intuitive. > > > > > > > > We currently decay protection aggressively in zswap_lru_add() calls. > > > This leads to the following unfortunate effect: when a new batch of > > > pages enter zswap, the protection size rapidly decays to below 25% of > > > the zswap LRU size, which is way too low. > > > > > > We have observed this effect in production, when experimenting with the > > > zswap shrinker: the rate of shrinking shoots up massively right after a > > > new batch of zswap stores. This is somewhat the opposite of what we want > > > originally - when new pages enter zswap, we want to protect both these > > > new pages AND the pages that are already protected in the zswap LRU. > > > > > > Replace existing heuristics with a second chance algorithm > > > > > > 1. When a new zswap entry is stored in the zswap pool, its reference bit > > > is set. > > > 2. When the zswap shrinker encounters a zswap entry with the reference > > > bit set, give it a second chance - only flips the reference bit and > > > rotate it in the LRU. > > > 3. If the shrinker encounters the entry again, this time with its > > > reference bit unset, then it can reclaim the entry. > > > > At the first look, this is similar to the reclaim algorithm. A > > fundamental difference here is that the reference bit is only set > > once, when the entry is created. It is different from the conventional > > second chance page reclaim/replacement algorithm. > > > > What this really does, is that it slows down writeback by enforcing > > that we need to iterate entries exactly twice before we write them > > back. This sounds a little arbitrary and not very intuitive to me. > > This isn't different than other second chance algorithms. Those > usually set the reference bit again to buy the entry another round. In > our case, another reference causes a zswapin, which removes the entry > from the list - buying it another round. Entries will get reclaimed > once the scan rate catches up with the longest reuse distance. > > The main goal, which was also the goal of the protection math, is to > slow down writebacks in proportion to new entries showing up. This > gives zswap a chance to solve memory pressure through compression. If > memory pressure persists, writeback should pick up. > > If no new entries were to show up, then sure, this would be busy > work. In practice, new entries do show up at a varying rate. This all > happens in parallel to anon reclaim, after all. The key here is that > new entries will be interleaved with rotated entries, and they consume > scan work! This is what results in the proportional slowdown. The fact that in practice there will be new entries showing up because we are doing anon reclaim is exactly what I had in mind. My point is that in practice, for most cases, the reference bit could just be causing a steady slowdown, which can be achieved just by tuning the seek. But I have a better idea now about what you mean with the proportional slowdown. (more below) > > > Taking a step back, what we really want is to writeback zswap entries > > in order, and avoid writing back more entries than needed. I think the > > key here is "when needed", which is defined by how much memory > > pressure we have. The shrinker framework should already be taking this > > into account. > > > > Looking at do_shrink_slab(), in the case of zswap (seek = 2), > > total_scan should boil down to: > > > > total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority > > > > , and this is bounded by zswap_shrinker_count() * 2. > > > > Ignoring nr_deferred, we start by scanning 1/2048th of > > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 * > > zswap_shrinker_count() at zero priority (before OOMs). At > > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count(). > > > > Keep in mind that zswap_shrinker_count() does not return the number of > > all zswap entries, it subtracts the protected part (or recent swapins) > > and scales by the compression ratio. So this looks reasonable at first > > sight, perhaps we want to tune the seek to slow down writeback if we > > think it's too much, but that doesn't explain the scenario you are > > describing. > > > > Now let's factor in nr_deferred, which looks to me like it could be > > the culprit here. I am assuming the intention is that if we counted > > freeable slab objects before but didn't get to free them, we should do > > it the next time around. This feels like it assumes that the objects > > will remain there unless reclaimed by the shrinker. This does not > > apply for zswap, because the objects can be swapped in. > > Hm. > > _count() returns (objects - protected) * compression_rate, then the > shrinker does the >> priority dance. So to_scan is expected to be a > small portion of unprotected objects. > > _scan() bails if to_scan > (objects - protected). > > How often does this actually abort in practice? Good question :) > > > Also, in the beginning, before we encounter too many swapins, the > > protection will be very low, so zswap_shrinker_count() will return a > > relatively high value. Even if we don't scan and writeback this > > amount, we will keep carrying this value forward in next reclaim > > operations, even if the number of existing zswap entries have > > decreased due to swapins. > > > > Could this be the problem? The number of deferred objects to be > > scanned just keeps going forward as a high value, essentially > > rendering the heuristics in zswap_shrinker_count() useless? > > > > If we just need to slow down writeback by making sure we scan entries > > twice, could something similar be achieved just by tuning the seek > > without needing any heuristics to begin with? > > Seek is a fixed coefficient for the scan rate. > > We want to slow writeback when recent zswapouts dominate the zswap > pool (expanding or thrashing), and speed it up when recent entries > make up a small share of the pool (stagnating). > > This is what the second chance accomplishes. Yeah I am more convinced now, I wanted to check if just tuning the seek achieves a similar result, but the complexity from the reference bit is not much anyway -- so maybe it isn't worth it. I still think the nr_deferred handling in the shrinker framework does not seem compatible with zswap. Entries could be double counted, and previously counted entries could go away during swapins. Unless I am missing something, it seems like the scanning can be arbitrary, and not truly proportional to the reclaim priority and zswap LRUs size. I also think it's important to clarify that there are two mechanisms that control the writeback rate with this patch, the reference bit proportional slowdown (protecting against writeback of recently swapped out pages), and nr_swapins protection (feedback from swapins of recently written back pages). Maybe we can move things around to make it more obvious how these mechanisms work hand in hand, or have a comment somewhere describing the writeback mechanism.
On Tue, Jul 30, 2024 at 11:46 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > I still think the nr_deferred handling in the shrinker framework does > not seem compatible with zswap. Entries could be double counted, and > previously counted entries could go away during swapins. Unless I am > missing something, it seems like the scanning can be arbitrary, and > not truly proportional to the reclaim priority and zswap LRUs size. We can do some investigation by the side. I did find the trace point that collects deferred numbers, maybe that can illustrate the problem? It'd be an orthogonal issue though. I don't think fixing the nr_deferred part would be enough to handle the various cases Johannes pointed out.
On Tue, Jul 30, 2024 at 11:46 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > I also think it's important to clarify that there are two mechanisms > that control the writeback rate with this patch, the reference bit > proportional slowdown (protecting against writeback of recently > swapped out pages), and nr_swapins protection (feedback from swapins > of recently written back pages). Regarding this - perhaps I wasn't being clear, but I did include both in the changelog. The latter is this paragraph: "We will still maintain the count of swapins, which is consumed and subtracted from the lru size in zswap_shrinker_count(), to further penalize past overshrinking that led to disk swapins. The idea is that had we considered this many more pages in the LRU active/protected, they would not have been written back and we would not have had to swapped them in." I also replicate this comment in the nr_swapin explanation - see struct zswap_lruvec_state. Same goes with the second chance algorithm - I did briefly explain it in shrink_memcg_cb(), as well as the changelog. > > Maybe we can move things around to make it more obvious how these > mechanisms work hand in hand, or have a comment somewhere describing > the writeback mechanism. Hmm let me add some documentation for this near zswap_shrinker_{scan|count}() definitions:
diff --git a/include/linux/zswap.h b/include/linux/zswap.h index 6cecb4a4f68b..b94b6ae262d5 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -13,17 +13,15 @@ extern atomic_t zswap_stored_pages; struct zswap_lruvec_state { /* - * Number of pages in zswap that should be protected from the shrinker. - * This number is an estimate of the following counts: + * Number of swapped in pages, i.e not found in the zswap pool. * - * a) Recent page faults. - * b) Recent insertion to the zswap LRU. This includes new zswap stores, - * as well as recent zswap LRU rotations. - * - * These pages are likely to be warm, and might incur IO if the are written - * to swap. + * This is consumed and subtracted from the lru size in + * zswap_shrinker_count() to penalize past overshrinking that led to disk + * swapins. The idea is that had we considered this many more pages in the + * LRU active/protected and not written them back, we would not have had to + * swapped them in. */ - atomic_long_t nr_zswap_protected; + atomic_long_t nr_swapins; }; unsigned long zswap_total_pages(void); diff --git a/mm/zswap.c b/mm/zswap.c index adeaf9c97fde..a24ee015d7bc 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -203,6 +203,7 @@ struct zswap_entry { }; struct obj_cgroup *objcg; struct list_head lru; + bool referenced; }; static struct xarray *zswap_trees[MAX_SWAPFILES]; @@ -700,11 +701,10 @@ static inline int entry_to_nid(struct zswap_entry *entry) static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) { - atomic_long_t *nr_zswap_protected; - unsigned long lru_size, old, new; int nid = entry_to_nid(entry); struct mem_cgroup *memcg; - struct lruvec *lruvec; + + entry->referenced = true; /* * Note that it is safe to use rcu_read_lock() here, even in the face of @@ -722,19 +722,6 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) memcg = mem_cgroup_from_entry(entry); /* will always succeed */ list_lru_add(list_lru, &entry->lru, nid, memcg); - - /* Update the protection area */ - lru_size = list_lru_count_one(list_lru, nid, memcg); - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); - nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected; - old = atomic_long_inc_return(nr_zswap_protected); - /* - * Decay to avoid overflow and adapt to changing workloads. - * This is based on LRU reclaim cost decaying heuristics. - */ - do { - new = old > lru_size / 4 ? old / 2 : old; - } while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new)); rcu_read_unlock(); } @@ -752,7 +739,7 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry) void zswap_lruvec_state_init(struct lruvec *lruvec) { - atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0); + atomic_long_set(&lruvec->zswap_lruvec_state.nr_swapins, 0); } void zswap_folio_swapin(struct folio *folio) @@ -761,7 +748,7 @@ void zswap_folio_swapin(struct folio *folio) if (folio) { lruvec = folio_lruvec(folio); - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected); + atomic_long_inc(&lruvec->zswap_lruvec_state.nr_swapins); } } @@ -1091,6 +1078,16 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o enum lru_status ret = LRU_REMOVED_RETRY; int writeback_result; + /* + * Second chance algorithm: if the entry has its reference bit set, give it + * a second chance. Only clear the reference bit and rotate it in the + * zswap's LRU list. + */ + if (entry->referenced) { + entry->referenced = false; + return LRU_ROTATE; + } + /* * As soon as we drop the LRU lock, the entry can be freed by * a concurrent invalidation. This means the following: @@ -1157,8 +1154,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc) { - struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); - unsigned long shrink_ret, nr_protected, lru_size; + unsigned long shrink_ret; bool encountered_page_in_swapcache = false; if (!zswap_shrinker_enabled || @@ -1167,25 +1163,6 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, return SHRINK_STOP; } - nr_protected = - atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected); - lru_size = list_lru_shrink_count(&zswap_list_lru, sc); - - /* - * Abort if we are shrinking into the protected region. - * - * This short-circuiting is necessary because if we have too many multiple - * concurrent reclaimers getting the freeable zswap object counts at the - * same time (before any of them made reasonable progress), the total - * number of reclaimed objects might be more than the number of unprotected - * objects (i.e the reclaimers will reclaim into the protected area of the - * zswap LRU). - */ - if (nr_protected >= lru_size - sc->nr_to_scan) { - sc->nr_scanned = 0; - return SHRINK_STOP; - } - shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb, &encountered_page_in_swapcache); @@ -1200,7 +1177,8 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, { struct mem_cgroup *memcg = sc->memcg; struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid)); - unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; + atomic_long_t *nr_swapins = &lruvec->zswap_lruvec_state.nr_swapins; + unsigned long nr_backing, nr_stored, lru_size, nr_swapins_cur, nr_remain; if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg)) return 0; @@ -1233,14 +1211,26 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, if (!nr_stored) return 0; - nr_protected = - atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected); - nr_freeable = list_lru_shrink_count(&zswap_list_lru, sc); + lru_size = list_lru_shrink_count(&zswap_list_lru, sc); + if (!lru_size) + return 0; + /* - * Subtract the lru size by an estimate of the number of pages - * that should be protected. + * Subtract the lru size by the number of pages that are recently swapped + * in. The idea is that had we protect the zswap's LRU by this amount of + * pages, these swap in would not have happened. */ - nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0; + nr_swapins_cur = atomic_long_read(nr_swapins); + do { + if (lru_size >= nr_swapins_cur) + nr_remain = 0; + else + nr_remain = nr_swapins_cur - lru_size; + } while (!atomic_long_try_cmpxchg(nr_swapins, &nr_swapins_cur, nr_remain)); + + lru_size -= nr_swapins_cur - nr_remain; + if (!lru_size) + return 0; /* * Scale the number of freeable pages by the memory saving factor. @@ -1253,7 +1243,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, * space. Hence, we may scale nr_freeable down a little bit more than we * should if we have a lot of same-filled pages. */ - return mult_frac(nr_freeable, nr_backing, nr_stored); + return mult_frac(lru_size, nr_backing, nr_stored); } static struct shrinker *zswap_alloc_shrinker(void)
Current zswap shrinker's heursitics to prevent overshrinking is brittle and inaccurate, specifically in the way we decay the protection size (i.e making pages in the zswap LRU eligible for reclaim). We currently decay protection aggressively in zswap_lru_add() calls. This leads to the following unfortunate effect: when a new batch of pages enter zswap, the protection size rapidly decays to below 25% of the zswap LRU size, which is way too low. We have observed this effect in production, when experimenting with the zswap shrinker: the rate of shrinking shoots up massively right after a new batch of zswap stores. This is somewhat the opposite of what we want originally - when new pages enter zswap, we want to protect both these new pages AND the pages that are already protected in the zswap LRU. Replace existing heuristics with a second chance algorithm 1. When a new zswap entry is stored in the zswap pool, its reference bit is set. 2. When the zswap shrinker encounters a zswap entry with the reference bit set, give it a second chance - only flips the reference bit and rotate it in the LRU. 3. If the shrinker encounters the entry again, this time with its reference bit unset, then it can reclaim the entry. In this manner, the aging of the pages in the zswap LRUs are decoupled from zswap stores, and picks up the pace with increasing memory pressure (which is what we want). We will still maintain the count of swapins, which is consumed and subtracted from the lru size in zswap_shrinker_count(), to further penalize past overshrinking that led to disk swapins. The idea is that had we considered this many more pages in the LRU active/protected, they would not have been written back and we would not have had to swapped them in. To test this new heuristics, I built the kernel under a cgroup with memory.max set to 2G, on a host with 36 cores: With the old shrinker: real: 263.89s user: 4318.11s sys: 673.29s swapins: 227300.5 With the second chance algorithm: real: 244.85s user: 4327.22s sys: 664.39s swapins: 94663 (average over 5 runs) We observe an 1.3% reduction in kernel CPU usage, and around 7.2% reduction in real time. Note that the number of swapped in pages dropped by 58%. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Nhat Pham <nphamcs@gmail.com> --- include/linux/zswap.h | 16 ++++----- mm/zswap.c | 84 +++++++++++++++++++------------------------ 2 files changed, 44 insertions(+), 56 deletions(-)