Message ID | 20240312-zswap-xarray-v6-1-1b82027d7082@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v6] zswap: replace RB tree with xarray | expand |
On Tue, Mar 12, 2024 at 10:31:12AM -0700, Chris Li wrote: > Very deep RB tree requires rebalance at times. That > contributes to the zswap fault latencies. Xarray does not > need to perform tree rebalance. Replacing RB tree to xarray > can have some small performance gain. > > One small difference is that xarray insert might fail with > ENOMEM, while RB tree insert does not allocate additional > memory. > > The zswap_entry size will reduce a bit due to removing the > RB node, which has two pointers and a color field. Xarray > store the pointer in the xarray tree rather than the > zswap_entry. Every entry has one pointer from the xarray > tree. Overall, switching to xarray should save some memory, > if the swap entries are densely packed. > > Notice the zswap_rb_search and zswap_rb_insert always > followed by zswap_rb_erase. Use xa_erase and xa_store > directly. That saves one tree lookup as well. > > Remove zswap_invalidate_entry due to no need to call > zswap_rb_erase any more. Use zswap_free_entry instead. > > The "struct zswap_tree" has been replaced by "struct xarray". > The tree spin lock has transferred to the xarray lock. > > Run the kernel build testing 10 times for each version, averages: > (memory.max=2GB, zswap shrinker and writeback enabled, > one 50GB swapfile, 24 HT core, 32 jobs) > > mm-9a0181a3710eb xarray v5 > user 3532.385 3535.658 > sys 536.231 530.083 > real 200.431 200.176 This is a great improvement code and complexity wise. I have a few questions and comments below: What kernel version is this based on? It doesn't apply to mm-everything, and I can't find 9a0181a3710eb anywhere. > @@ -1555,28 +1473,35 @@ bool zswap_store(struct folio *folio) > insert_entry: > entry->swpentry = swp; > entry->objcg = objcg; > - if (objcg) { > - obj_cgroup_charge_zswap(objcg, entry->length); > - /* Account before objcg ref is moved to tree */ > - count_objcg_event(objcg, ZSWPOUT); > - } > > - /* map */ > - spin_lock(&tree->lock); > /* > * The folio may have been dirtied again, invalidate the > * possibly stale entry before inserting the new entry. > */ The comment is now somewhat stale and somewhat out of place. It should be above that `if (old)` part... See below. > - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) { > - zswap_invalidate_entry(tree, dupentry); > - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry)); > + old = xa_store(tree, offset, entry, GFP_KERNEL); > + if (xa_is_err(old)) { > + int err = xa_err(old); > + if (err == -ENOMEM) > + zswap_reject_alloc_fail++; > + else > + WARN_ONCE(err, "%s: xa_store failed: %d\n", > + __func__, err); > + goto store_failed; No need to complicate it. If we have a bug there, an incorrect fail stat bump is the least of our concerns. Also, no need for __func__ since that information is included in the WARN: if (xa_is_err(old)) { WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err); zswap_reject_alloc_fail++; goto store_failed; } I think here is where that comment above should go: /* * We may have had an existing entry that became stale when * the folio was redirtied and now the new version is being * swapped out. Get rid of the old. */ > + if (old) > + zswap_entry_free(old); > + > + if (objcg) { > + obj_cgroup_charge_zswap(objcg, entry->length); > + /* Account before objcg ref is moved to tree */ > + count_objcg_event(objcg, ZSWPOUT); > } > + > if (entry->length) { > INIT_LIST_HEAD(&entry->lru); > zswap_lru_add(&zswap.list_lru, entry); > atomic_inc(&zswap.nr_stored); > } > - spin_unlock(&tree->lock); We previously relied on the tree lock to finish initializing the entry while it's already in tree. Now we rely on something else: 1. Concurrent stores and invalidations are excluded by folio lock. 2. Writeback is excluded by the entry not being on the LRU yet. The publishing order matters to prevent writeback from seeing an incoherent entry. I think this deserves a comment. > /* update stats */ > atomic_inc(&zswap_stored_pages); > @@ -1585,6 +1510,12 @@ bool zswap_store(struct folio *folio) > > return true; > > +store_failed: > + if (!entry->length) { > + atomic_dec(&zswap_same_filled_pages); > + goto freepage; > + } It'd be good to avoid the nested goto. Why not make the pool operations conditional on entry->length instead: store_failed: if (!entry->length) atomic_dec(&zswap_same_filled_pages); else { zpool_free(zswap_find_zpool(...)); put_pool: zswap_pool_put(entry->pool); } freepage: Not super pretty either, but it's a linear flow at least.
Hi Johannes, On Tue, Mar 12, 2024 at 11:43 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Mar 12, 2024 at 10:31:12AM -0700, Chris Li wrote: > > Very deep RB tree requires rebalance at times. That > > contributes to the zswap fault latencies. Xarray does not > > need to perform tree rebalance. Replacing RB tree to xarray > > can have some small performance gain. > > > > One small difference is that xarray insert might fail with > > ENOMEM, while RB tree insert does not allocate additional > > memory. > > > > The zswap_entry size will reduce a bit due to removing the > > RB node, which has two pointers and a color field. Xarray > > store the pointer in the xarray tree rather than the > > zswap_entry. Every entry has one pointer from the xarray > > tree. Overall, switching to xarray should save some memory, > > if the swap entries are densely packed. > > > > Notice the zswap_rb_search and zswap_rb_insert always > > followed by zswap_rb_erase. Use xa_erase and xa_store > > directly. That saves one tree lookup as well. > > > > Remove zswap_invalidate_entry due to no need to call > > zswap_rb_erase any more. Use zswap_free_entry instead. > > > > The "struct zswap_tree" has been replaced by "struct xarray". > > The tree spin lock has transferred to the xarray lock. > > > > Run the kernel build testing 10 times for each version, averages: > > (memory.max=2GB, zswap shrinker and writeback enabled, > > one 50GB swapfile, 24 HT core, 32 jobs) > > > > mm-9a0181a3710eb xarray v5 > > user 3532.385 3535.658 > > sys 536.231 530.083 > > real 200.431 200.176 > > This is a great improvement code and complexity wise. Thanks! > > I have a few questions and comments below: > > What kernel version is this based on? It doesn't apply to > mm-everything, and I can't find 9a0181a3710eb anywhere. It is based on an old version of the mm-unstable. I can try to rebase on mm-everything or later mm-unstable. > > > @@ -1555,28 +1473,35 @@ bool zswap_store(struct folio *folio) > > insert_entry: > > entry->swpentry = swp; > > entry->objcg = objcg; > > - if (objcg) { > > - obj_cgroup_charge_zswap(objcg, entry->length); > > - /* Account before objcg ref is moved to tree */ > > - count_objcg_event(objcg, ZSWPOUT); > > - } > > > > - /* map */ > > - spin_lock(&tree->lock); > > /* > > * The folio may have been dirtied again, invalidate the > > * possibly stale entry before inserting the new entry. > > */ > > The comment is now somewhat stale and somewhat out of place. It should > be above that `if (old)` part... See below. Ack. > > > - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) { > > - zswap_invalidate_entry(tree, dupentry); > > - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry)); > > + old = xa_store(tree, offset, entry, GFP_KERNEL); > > + if (xa_is_err(old)) { > > + int err = xa_err(old); > > + if (err == -ENOMEM) > > + zswap_reject_alloc_fail++; > > + else > > + WARN_ONCE(err, "%s: xa_store failed: %d\n", > > + __func__, err); > > + goto store_failed; > > No need to complicate it. If we have a bug there, an incorrect fail > stat bump is the least of our concerns. Also, no need for __func__ > since that information is included in the WARN: > > if (xa_is_err(old)) { > WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err); > zswap_reject_alloc_fail++; > goto store_failed; > } Ah, I see. Thanks for the simplification. > > I think here is where that comment above should go: Ack. > > /* > * We may have had an existing entry that became stale when > * the folio was redirtied and now the new version is being > * swapped out. Get rid of the old. > */ > > + if (old) > > + zswap_entry_free(old); > > + > > + if (objcg) { > > + obj_cgroup_charge_zswap(objcg, entry->length); > > + /* Account before objcg ref is moved to tree */ > > + count_objcg_event(objcg, ZSWPOUT); > > } > > + > > if (entry->length) { > > INIT_LIST_HEAD(&entry->lru); > > zswap_lru_add(&zswap.list_lru, entry); > > atomic_inc(&zswap.nr_stored); > > } > > - spin_unlock(&tree->lock); > > We previously relied on the tree lock to finish initializing the entry > while it's already in tree. Now we rely on something else: > > 1. Concurrent stores and invalidations are excluded by folio lock. > > 2. Writeback is excluded by the entry not being on the LRU yet. > The publishing order matters to prevent writeback from seeing > an incoherent entry. > > I think this deserves a comment. I will add your 1. and 2. into a comment block. Thanks for the suggestion. > > > /* update stats */ > > atomic_inc(&zswap_stored_pages); > > @@ -1585,6 +1510,12 @@ bool zswap_store(struct folio *folio) > > > > return true; > > > > +store_failed: > > + if (!entry->length) { > > + atomic_dec(&zswap_same_filled_pages); > > + goto freepage; > > + } > > It'd be good to avoid the nested goto. Why not make the pool > operations conditional on entry->length instead: > > store_failed: > if (!entry->length) > atomic_dec(&zswap_same_filled_pages); > else { > zpool_free(zswap_find_zpool(...)); > put_pool: > zswap_pool_put(entry->pool); > } > freepage: Sure, I have one internal version exactly like that. I later changed again to get rid of the else. I can use your version as well. > > Not super pretty either, but it's a linear flow at least. Thanks for your suggestions. I will send out a new version. Chris
On Wed, Mar 13, 2024 at 12:31 AM Chris Li <chrisl@kernel.org> wrote: > > Very deep RB tree requires rebalance at times. That > contributes to the zswap fault latencies. Xarray does not > need to perform tree rebalance. Replacing RB tree to xarray > can have some small performance gain. > > One small difference is that xarray insert might fail with > ENOMEM, while RB tree insert does not allocate additional > memory. > > The zswap_entry size will reduce a bit due to removing the > RB node, which has two pointers and a color field. Xarray > store the pointer in the xarray tree rather than the > zswap_entry. Every entry has one pointer from the xarray > tree. Overall, switching to xarray should save some memory, > if the swap entries are densely packed. > > Notice the zswap_rb_search and zswap_rb_insert always > followed by zswap_rb_erase. Use xa_erase and xa_store > directly. That saves one tree lookup as well. > > Remove zswap_invalidate_entry due to no need to call > zswap_rb_erase any more. Use zswap_free_entry instead. > > The "struct zswap_tree" has been replaced by "struct xarray". > The tree spin lock has transferred to the xarray lock. > > Run the kernel build testing 10 times for each version, averages: > (memory.max=2GB, zswap shrinker and writeback enabled, > one 50GB swapfile, 24 HT core, 32 jobs) > > mm-9a0181a3710eb xarray v5 > user 3532.385 3535.658 > sys 536.231 530.083 > real 200.431 200.176 > > --- > > > Reviewed-by: Barry Song <baohua@kernel.org> > Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> > Acked-by: Yosry Ahmed <yosryahmed@google.com> > Signed-off-by: Chris Li <chrisl@kernel.org> Apologies for the delayed review :) LGTM FWIW. Looks like you're sending a fixlet to address Johannes' comments, but I assume it won't change too much, so feel free to include: Reviewed-by: Nhat Pham <nphamcs@gmail.com>
[..] > > @@ -1555,28 +1473,35 @@ bool zswap_store(struct folio *folio) > insert_entry: > entry->swpentry = swp; > entry->objcg = objcg; > - if (objcg) { > - obj_cgroup_charge_zswap(objcg, entry->length); > - /* Account before objcg ref is moved to tree */ I do not understand this comment, but it seems to care about the charging happening before the entry is added to the tree. This patch will move it after the tree insertion. Johannes, do you mind elaborating what this comment is referring to? It should be clarified, updated, or removed as part of this movement. > > - count_objcg_event(objcg, ZSWPOUT); > - } > > - /* map */ > - spin_lock(&tree->lock); > /* > * The folio may have been dirtied again, invalidate the > * possibly stale entry before inserting the new entry. > */ > - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) { > - zswap_invalidate_entry(tree, dupentry); > - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry)); > + old = xa_store(tree, offset, entry, GFP_KERNEL); > + if (xa_is_err(old)) { > + int err = xa_err(old); > + if (err == -ENOMEM) > + zswap_reject_alloc_fail++; > + else > + WARN_ONCE(err, "%s: xa_store failed: %d\n", > + __func__, err); > + goto store_failed; > + } > + if (old) > + zswap_entry_free(old); > + > + if (objcg) { > + obj_cgroup_charge_zswap(objcg, entry->length); > + /* Account before objcg ref is moved to tree */ > + count_objcg_event(objcg, ZSWPOUT); > } > + > if (entry->length) { > INIT_LIST_HEAD(&entry->lru); > zswap_lru_add(&zswap.list_lru, entry); > atomic_inc(&zswap.nr_stored); > } > - spin_unlock(&tree->lock); > > /* update stats */ > atomic_inc(&zswap_stored_pages); [..]
On Fri, Mar 15, 2024 at 06:30:37PM -0700, Yosry Ahmed wrote: > [..] > > > > @@ -1555,28 +1473,35 @@ bool zswap_store(struct folio *folio) > > insert_entry: > > entry->swpentry = swp; > > entry->objcg = objcg; > > - if (objcg) { > > - obj_cgroup_charge_zswap(objcg, entry->length); > > - /* Account before objcg ref is moved to tree */ > > > I do not understand this comment, but it seems to care about the > charging happening before the entry is added to the tree. This patch > will move it after the tree insertion. > > Johannes, do you mind elaborating what this comment is referring to? > It should be clarified, updated, or removed as part of this movement. Wait, I wrote that? ^_^ The thinking was this: the objcg reference acquired in this context is passed on to the tree. Once the entry is in the tree and the tree->lock released, the entry is public and the current context doesn't have its own pin on objcg anymore. Ergo, objcg is no longer safe to access from this context. This is a conservative take, though, considering the wider context: the swapcache itself, through folio lock, prevents invalidation; and reclaim/writeback cannot happen before the entry is on the LRU. After Chris's patch, the tree is no longer a serialization point for stores. The swapcache and the LRU are. I had asked Chris upthread to add an explicit comment about that. I think once he does that, the objcg situation should be self-evident as well. So in the next version, please just remove this now stale one-liner.
On Sat, Mar 16, 2024 at 6:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Mar 15, 2024 at 06:30:37PM -0700, Yosry Ahmed wrote: > > [..] > > > > > > @@ -1555,28 +1473,35 @@ bool zswap_store(struct folio *folio) > > > insert_entry: > > > entry->swpentry = swp; > > > entry->objcg = objcg; > > > - if (objcg) { > > > - obj_cgroup_charge_zswap(objcg, entry->length); > > > - /* Account before objcg ref is moved to tree */ > > > > > > I do not understand this comment, but it seems to care about the > > charging happening before the entry is added to the tree. This patch > > will move it after the tree insertion. > > > > Johannes, do you mind elaborating what this comment is referring to? > > It should be clarified, updated, or removed as part of this movement. > > Wait, I wrote that? ^_^ Well, past Johannes did :) > > The thinking was this: the objcg reference acquired in this context is > passed on to the tree. Once the entry is in the tree and the > tree->lock released, the entry is public and the current context > doesn't have its own pin on objcg anymore. Ergo, objcg is no longer > safe to access from this context. > > This is a conservative take, though, considering the wider context: > the swapcache itself, through folio lock, prevents invalidation; and > reclaim/writeback cannot happen before the entry is on the LRU. Actually, I think just the folio being locked in the swapcache is enough protection. Even if the entry is added to the LRU, the writeback code will find it in the swapcache and abort. > > After Chris's patch, the tree is no longer a serialization point for > stores. The swapcache and the LRU are. I had asked Chris upthread to > add an explicit comment about that. I think once he does that, the > objcg situation should be self-evident as well. Perhaps it should be clarified that the swapcache on its own is enough protection against both invalidation and reclaim/writeback, and the entry not being on the LRU is *additional* protection on top of that against reclaim/writeback. Right? > > So in the next version, please just remove this now stale one-liner. Thanks for confirming. Chris, please remove this comment and update the comment Johannes asked you to add as mentioned above. Thanks!
On Sat, Mar 16, 2024 at 6:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Mar 15, 2024 at 06:30:37PM -0700, Yosry Ahmed wrote: > > [..] > > > > > > @@ -1555,28 +1473,35 @@ bool zswap_store(struct folio *folio) > > > insert_entry: > > > entry->swpentry = swp; > > > entry->objcg = objcg; > > > - if (objcg) { > > > - obj_cgroup_charge_zswap(objcg, entry->length); > > > - /* Account before objcg ref is moved to tree */ > > > > > > I do not understand this comment, but it seems to care about the > > charging happening before the entry is added to the tree. This patch > > will move it after the tree insertion. > > > > Johannes, do you mind elaborating what this comment is referring to? > > It should be clarified, updated, or removed as part of this movement. > > Wait, I wrote that? ^_^ > > The thinking was this: the objcg reference acquired in this context is > passed on to the tree. Once the entry is in the tree and the > tree->lock released, the entry is public and the current context > doesn't have its own pin on objcg anymore. Ergo, objcg is no longer > safe to access from this context. > > This is a conservative take, though, considering the wider context: > the swapcache itself, through folio lock, prevents invalidation; and > reclaim/writeback cannot happen before the entry is on the LRU. > > After Chris's patch, the tree is no longer a serialization point for > stores. The swapcache and the LRU are. I had asked Chris upthread to > add an explicit comment about that. I think once he does that, the > objcg situation should be self-evident as well. > > So in the next version, please just remove this now stale one-liner. Ack. Chris
On Sat, Mar 16, 2024 at 11:12 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Sat, Mar 16, 2024 at 6:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Fri, Mar 15, 2024 at 06:30:37PM -0700, Yosry Ahmed wrote: > > > [..] > > > > > > > > @@ -1555,28 +1473,35 @@ bool zswap_store(struct folio *folio) > > > > insert_entry: > > > > entry->swpentry = swp; > > > > entry->objcg = objcg; > > > > - if (objcg) { > > > > - obj_cgroup_charge_zswap(objcg, entry->length); > > > > - /* Account before objcg ref is moved to tree */ > > > > > > > > > I do not understand this comment, but it seems to care about the > > > charging happening before the entry is added to the tree. This patch > > > will move it after the tree insertion. > > > > > > Johannes, do you mind elaborating what this comment is referring to? > > > It should be clarified, updated, or removed as part of this movement. > > > > Wait, I wrote that? ^_^ > > Well, past Johannes did :) > > > > > The thinking was this: the objcg reference acquired in this context is > > passed on to the tree. Once the entry is in the tree and the > > tree->lock released, the entry is public and the current context > > doesn't have its own pin on objcg anymore. Ergo, objcg is no longer > > safe to access from this context. > > > > This is a conservative take, though, considering the wider context: > > the swapcache itself, through folio lock, prevents invalidation; and > > reclaim/writeback cannot happen before the entry is on the LRU. > > Actually, I think just the folio being locked in the swapcache is > enough protection. Even if the entry is added to the LRU, the > writeback code will find it in the swapcache and abort. > > > > > After Chris's patch, the tree is no longer a serialization point for > > stores. The swapcache and the LRU are. I had asked Chris upthread to > > add an explicit comment about that. I think once he does that, the > > objcg situation should be self-evident as well. > > Perhaps it should be clarified that the swapcache on its own is enough > protection against both invalidation and reclaim/writeback, and the > entry not being on the LRU is *additional* protection on top of that > against reclaim/writeback. Right? > > > > > So in the next version, please just remove this now stale one-liner. > > Thanks for confirming. Chris, please remove this comment and update > the comment Johannes asked you to add as mentioned above. Thanks! Will do. Chris
diff --git a/mm/zswap.c b/mm/zswap.c index 011e068eb355..2fc5eae9fcb4 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -20,7 +20,6 @@ #include <linux/spinlock.h> #include <linux/types.h> #include <linux/atomic.h> -#include <linux/rbtree.h> #include <linux/swap.h> #include <linux/crypto.h> #include <linux/scatterlist.h> @@ -196,7 +195,6 @@ static struct { * This structure contains the metadata for tracking a single compressed * page within zswap. * - * rbnode - links the entry into red-black tree for the appropriate swap type * swpentry - associated swap entry, the offset indexes into the red-black tree * length - the length in bytes of the compressed page data. Needed during * decompression. For a same value filled page length is 0, and both @@ -208,7 +206,6 @@ static struct { * lru - handle to the pool's lru used to evict pages. */ struct zswap_entry { - struct rb_node rbnode; swp_entry_t swpentry; unsigned int length; struct zswap_pool *pool; @@ -220,12 +217,7 @@ struct zswap_entry { struct list_head lru; }; -struct zswap_tree { - struct rb_root rbroot; - spinlock_t lock; -}; - -static struct zswap_tree *zswap_trees[MAX_SWAPFILES]; +static struct xarray *zswap_trees[MAX_SWAPFILES]; static unsigned int nr_zswap_trees[MAX_SWAPFILES]; /* RCU-protected iteration */ @@ -253,7 +245,7 @@ static bool zswap_has_pool; * helpers and fwd declarations **********************************/ -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp) +static inline struct xarray *swap_zswap_tree(swp_entry_t swp) { return &zswap_trees[swp_type(swp)][swp_offset(swp) >> SWAP_ADDRESS_SPACE_SHIFT]; @@ -804,63 +796,6 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) spin_unlock(&zswap.shrink_lock); } -/********************************* -* rbtree functions -**********************************/ -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset) -{ - struct rb_node *node = root->rb_node; - struct zswap_entry *entry; - pgoff_t entry_offset; - - while (node) { - entry = rb_entry(node, struct zswap_entry, rbnode); - entry_offset = swp_offset(entry->swpentry); - if (entry_offset > offset) - node = node->rb_left; - else if (entry_offset < offset) - node = node->rb_right; - else - return entry; - } - return NULL; -} - -/* - * In the case that a entry with the same offset is found, a pointer to - * the existing entry is stored in dupentry and the function returns -EEXIST - */ -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry, - struct zswap_entry **dupentry) -{ - struct rb_node **link = &root->rb_node, *parent = NULL; - struct zswap_entry *myentry; - pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry); - - while (*link) { - parent = *link; - myentry = rb_entry(parent, struct zswap_entry, rbnode); - myentry_offset = swp_offset(myentry->swpentry); - if (myentry_offset > entry_offset) - link = &(*link)->rb_left; - else if (myentry_offset < entry_offset) - link = &(*link)->rb_right; - else { - *dupentry = myentry; - return -EEXIST; - } - } - rb_link_node(&entry->rbnode, parent, link); - rb_insert_color(&entry->rbnode, root); - return 0; -} - -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) -{ - rb_erase(&entry->rbnode, root); - RB_CLEAR_NODE(&entry->rbnode); -} - /********************************* * zswap entry functions **********************************/ @@ -872,7 +807,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid) entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid); if (!entry) return NULL; - RB_CLEAR_NODE(&entry->rbnode); return entry; } @@ -914,17 +848,6 @@ static void zswap_entry_free(struct zswap_entry *entry) zswap_update_total_size(); } -/* - * The caller hold the tree lock and search the entry from the tree, - * so it must be on the tree, remove it from the tree and free it. - */ -static void zswap_invalidate_entry(struct zswap_tree *tree, - struct zswap_entry *entry) -{ - zswap_rb_erase(&tree->rbroot, entry); - zswap_entry_free(entry); -} - /********************************* * compressed storage functions **********************************/ @@ -1113,7 +1036,8 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page) static int zswap_writeback_entry(struct zswap_entry *entry, swp_entry_t swpentry) { - struct zswap_tree *tree; + struct xarray *tree; + pgoff_t offset = swp_offset(swpentry); struct folio *folio; struct mempolicy *mpol; bool folio_was_allocated; @@ -1150,19 +1074,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry, * be dereferenced. */ tree = swap_zswap_tree(swpentry); - spin_lock(&tree->lock); - if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) { - spin_unlock(&tree->lock); + if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) { delete_from_swap_cache(folio); folio_unlock(folio); folio_put(folio); return -ENOMEM; } - /* Safe to deref entry after the entry is verified above. */ - zswap_rb_erase(&tree->rbroot, entry); - spin_unlock(&tree->lock); - zswap_decompress(entry, &folio->page); count_vm_event(ZSWPWB); @@ -1471,8 +1389,8 @@ bool zswap_store(struct folio *folio) { swp_entry_t swp = folio->swap; pgoff_t offset = swp_offset(swp); - struct zswap_tree *tree = swap_zswap_tree(swp); - struct zswap_entry *entry, *dupentry; + struct xarray *tree = swap_zswap_tree(swp); + struct zswap_entry *entry, *old; struct obj_cgroup *objcg = NULL; struct mem_cgroup *memcg = NULL; @@ -1555,28 +1473,35 @@ bool zswap_store(struct folio *folio) insert_entry: entry->swpentry = swp; entry->objcg = objcg; - if (objcg) { - obj_cgroup_charge_zswap(objcg, entry->length); - /* Account before objcg ref is moved to tree */ - count_objcg_event(objcg, ZSWPOUT); - } - /* map */ - spin_lock(&tree->lock); /* * The folio may have been dirtied again, invalidate the * possibly stale entry before inserting the new entry. */ - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) { - zswap_invalidate_entry(tree, dupentry); - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry)); + old = xa_store(tree, offset, entry, GFP_KERNEL); + if (xa_is_err(old)) { + int err = xa_err(old); + if (err == -ENOMEM) + zswap_reject_alloc_fail++; + else + WARN_ONCE(err, "%s: xa_store failed: %d\n", + __func__, err); + goto store_failed; + } + if (old) + zswap_entry_free(old); + + if (objcg) { + obj_cgroup_charge_zswap(objcg, entry->length); + /* Account before objcg ref is moved to tree */ + count_objcg_event(objcg, ZSWPOUT); } + if (entry->length) { INIT_LIST_HEAD(&entry->lru); zswap_lru_add(&zswap.list_lru, entry); atomic_inc(&zswap.nr_stored); } - spin_unlock(&tree->lock); /* update stats */ atomic_inc(&zswap_stored_pages); @@ -1585,6 +1510,12 @@ bool zswap_store(struct folio *folio) return true; +store_failed: + if (!entry->length) { + atomic_dec(&zswap_same_filled_pages); + goto freepage; + } + zpool_free(zswap_find_zpool(entry), entry->handle); put_pool: zswap_pool_put(entry->pool); freepage: @@ -1598,11 +1529,9 @@ bool zswap_store(struct folio *folio) * possibly stale entry which was previously stored at this offset. * Otherwise, writeback could overwrite the new data in the swapfile. */ - spin_lock(&tree->lock); - entry = zswap_rb_search(&tree->rbroot, offset); + entry = xa_erase(tree, offset); if (entry) - zswap_invalidate_entry(tree, entry); - spin_unlock(&tree->lock); + zswap_entry_free(entry); return false; shrink: @@ -1615,20 +1544,15 @@ bool zswap_load(struct folio *folio) swp_entry_t swp = folio->swap; pgoff_t offset = swp_offset(swp); struct page *page = &folio->page; - struct zswap_tree *tree = swap_zswap_tree(swp); + struct xarray *tree = swap_zswap_tree(swp); struct zswap_entry *entry; u8 *dst; VM_WARN_ON_ONCE(!folio_test_locked(folio)); - spin_lock(&tree->lock); - entry = zswap_rb_search(&tree->rbroot, offset); - if (!entry) { - spin_unlock(&tree->lock); + entry = xa_erase(tree, offset); + if (!entry) return false; - } - zswap_rb_erase(&tree->rbroot, entry); - spin_unlock(&tree->lock); if (entry->length) zswap_decompress(entry, page); @@ -1652,19 +1576,17 @@ bool zswap_load(struct folio *folio) void zswap_invalidate(swp_entry_t swp) { pgoff_t offset = swp_offset(swp); - struct zswap_tree *tree = swap_zswap_tree(swp); + struct xarray *tree = swap_zswap_tree(swp); struct zswap_entry *entry; - spin_lock(&tree->lock); - entry = zswap_rb_search(&tree->rbroot, offset); + entry = xa_erase(tree, offset); if (entry) - zswap_invalidate_entry(tree, entry); - spin_unlock(&tree->lock); + zswap_entry_free(entry); } int zswap_swapon(int type, unsigned long nr_pages) { - struct zswap_tree *trees, *tree; + struct xarray *trees, *tree; unsigned int nr, i; nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES); @@ -1674,11 +1596,8 @@ int zswap_swapon(int type, unsigned long nr_pages) return -ENOMEM; } - for (i = 0; i < nr; i++) { - tree = trees + i; - tree->rbroot = RB_ROOT; - spin_lock_init(&tree->lock); - } + for (i = 0; i < nr; i++) + xa_init(trees + i); nr_zswap_trees[type] = nr; zswap_trees[type] = trees; @@ -1687,7 +1606,7 @@ int zswap_swapon(int type, unsigned long nr_pages) void zswap_swapoff(int type) { - struct zswap_tree *trees = zswap_trees[type]; + struct xarray *trees = zswap_trees[type]; unsigned int i; if (!trees) @@ -1695,7 +1614,7 @@ void zswap_swapoff(int type) /* try_to_unuse() invalidated all the entries already */ for (i = 0; i < nr_zswap_trees[type]; i++) - WARN_ON_ONCE(!RB_EMPTY_ROOT(&trees[i].rbroot)); + WARN_ON_ONCE(!xa_empty(trees + i)); kvfree(trees); nr_zswap_trees[type] = 0;