diff mbox series

[v4] zswap: replace RB tree with xarray

Message ID 20240304-zswap-xarray-v4-1-c4b45670cc30@kernel.org (mailing list archive)
State New
Headers show
Series [v4] zswap: replace RB tree with xarray | expand

Commit Message

Chris Li March 4, 2024, 9:32 p.m. UTC
Very deep RB tree requires rebalance at times. That
contributes to the zswap fault latencies. Xarray does not
need to perform tree rebalance. Replacing RB tree to xarray
can have some small performance gain.

One small difference is that xarray insert might fail with
ENOMEM, while RB tree insert does not allocate additional
memory.

The zswap_entry size will reduce a bit due to removing the
RB node, which has two pointers and a color field. Xarray
store the pointer in the xarray tree rather than the
zswap_entry. Every entry has one pointer from the xarray
tree. Overall, switching to xarray should save some memory,
if the swap entries are densely packed.

Notice the zswap_rb_search and zswap_rb_insert always
followed by zswap_rb_erase. Use xa_erase directly. The entry
erase into zswap_xa_insert as well. That saves one tree
lookup as well.

Remove zswap_invalidate_entry due to no need to call
zswap_rb_erase any more. Use zswap_free_entry instead.

The "struct zswap_tree" has been replaced by "struct xarray".
The tree spin lock has transferred to the xarray lock.

Run the kernel build testing 10 times for each version, averages:
(memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)

mm-9a0181a3710eb                    xarray v4
user       3526.829                            3526.930
sys        532.754                             526.525
real       198.748                             198.850

---


Signed-off-by: Chris Li <chrisl@kernel.org>
---
Changes in v4:
- Remove zswap_xa_search_and_earse, use xa_erase directly.
- Move charge of objcg after zswap_xa_insert.
- Avoid erase old entry on insert fail error path.
- Remove not needed swap_zswap_tree change
- Link to v3: https://lore.kernel.org/r/20240302-zswap-xarray-v3-1-5900252f2302@kernel.org

Changes in v3:
- Use xa_cmpxchg instead of zswap_xa_search_and_delete in zswap_writeback_entry.
- Use xa_store in zswap_xa_insert directly. Reduce the scope of spinlock.
- Fix xa_store error handling for same page fill case.
- Link to v2: https://lore.kernel.org/r/20240229-zswap-xarray-v2-1-e50284dfcdb1@kernel.org

Changes in v2:
- Replace struct zswap_tree with struct xarray.
- Remove zswap_tree spinlock, use xarray lock instead.
- Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
- Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
- Link to v1: https://lore.kernel.org/r/20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org
---
 mm/zswap.c | 186 ++++++++++++++++++++++++-------------------------------------
 1 file changed, 72 insertions(+), 114 deletions(-)


---
base-commit: 9a0181a3710eba1f5c6d19eadcca888be3d54e4f
change-id: 20240104-zswap-xarray-716260e541e3

Best regards,

Comments

Chengming Zhou March 5, 2024, 2:52 a.m. UTC | #1
Hi Chris,

On 2024/3/5 05:32, Chris Li wrote:
> Very deep RB tree requires rebalance at times. That
> contributes to the zswap fault latencies. Xarray does not
> need to perform tree rebalance. Replacing RB tree to xarray
> can have some small performance gain.
> 
> One small difference is that xarray insert might fail with
> ENOMEM, while RB tree insert does not allocate additional
> memory.
> 
> The zswap_entry size will reduce a bit due to removing the
> RB node, which has two pointers and a color field. Xarray
> store the pointer in the xarray tree rather than the
> zswap_entry. Every entry has one pointer from the xarray
> tree. Overall, switching to xarray should save some memory,
> if the swap entries are densely packed.
> 
> Notice the zswap_rb_search and zswap_rb_insert always
> followed by zswap_rb_erase. Use xa_erase directly. The entry
> erase into zswap_xa_insert as well. That saves one tree
> lookup as well.
> 
> Remove zswap_invalidate_entry due to no need to call
> zswap_rb_erase any more. Use zswap_free_entry instead.
> 
> The "struct zswap_tree" has been replaced by "struct xarray".
> The tree spin lock has transferred to the xarray lock.
> 
> Run the kernel build testing 10 times for each version, averages:
> (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)
> 
> mm-9a0181a3710eb                    xarray v4
> user       3526.829                            3526.930
> sys        532.754                             526.525
> real       198.748                             198.850
> 
> ---
> 
> 
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> Changes in v4:
> - Remove zswap_xa_search_and_earse, use xa_erase directly.
> - Move charge of objcg after zswap_xa_insert.
> - Avoid erase old entry on insert fail error path.
> - Remove not needed swap_zswap_tree change
> - Link to v3: https://lore.kernel.org/r/20240302-zswap-xarray-v3-1-5900252f2302@kernel.org
> 
> Changes in v3:
> - Use xa_cmpxchg instead of zswap_xa_search_and_delete in zswap_writeback_entry.
> - Use xa_store in zswap_xa_insert directly. Reduce the scope of spinlock.
> - Fix xa_store error handling for same page fill case.
> - Link to v2: https://lore.kernel.org/r/20240229-zswap-xarray-v2-1-e50284dfcdb1@kernel.org
> 
> Changes in v2:
> - Replace struct zswap_tree with struct xarray.
> - Remove zswap_tree spinlock, use xarray lock instead.
> - Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
> - Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
> - Link to v1: https://lore.kernel.org/r/20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org
> ---
>  mm/zswap.c | 186 ++++++++++++++++++++++++-------------------------------------
>  1 file changed, 72 insertions(+), 114 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 011e068eb355..4f4a3f452b76 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -20,7 +20,6 @@
>  #include <linux/spinlock.h>
>  #include <linux/types.h>
>  #include <linux/atomic.h>
> -#include <linux/rbtree.h>
>  #include <linux/swap.h>
>  #include <linux/crypto.h>
>  #include <linux/scatterlist.h>
> @@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
>  static u64 zswap_reject_alloc_fail;
>  /* Store failed because the entry metadata could not be allocated (rare) */
>  static u64 zswap_reject_kmemcache_fail;
> +/* Store failed because xarray can't insert the entry*/
> +static u64 zswap_reject_xarray_fail;
>  
>  /* Shrinker work queue */
>  static struct workqueue_struct *shrink_wq;
> @@ -196,7 +197,6 @@ static struct {
>   * This structure contains the metadata for tracking a single compressed
>   * page within zswap.
>   *
> - * rbnode - links the entry into red-black tree for the appropriate swap type
>   * swpentry - associated swap entry, the offset indexes into the red-black tree
>   * length - the length in bytes of the compressed page data.  Needed during
>   *          decompression. For a same value filled page length is 0, and both
> @@ -208,7 +208,6 @@ static struct {
>   * lru - handle to the pool's lru used to evict pages.
>   */
>  struct zswap_entry {
> -	struct rb_node rbnode;
>  	swp_entry_t swpentry;
>  	unsigned int length;
>  	struct zswap_pool *pool;
> @@ -220,12 +219,7 @@ struct zswap_entry {
>  	struct list_head lru;
>  };
>  
> -struct zswap_tree {
> -	struct rb_root rbroot;
> -	spinlock_t lock;
> -};
> -
> -static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
> +static struct xarray *zswap_trees[MAX_SWAPFILES];
>  static unsigned int nr_zswap_trees[MAX_SWAPFILES];
>  
>  /* RCU-protected iteration */
> @@ -253,7 +247,7 @@ static bool zswap_has_pool;
>  * helpers and fwd declarations
>  **********************************/
>  
> -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> +static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>  {
>  	return &zswap_trees[swp_type(swp)][swp_offset(swp)
>  		>> SWAP_ADDRESS_SPACE_SHIFT];
> @@ -805,60 +799,33 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
>  }
>  
>  /*********************************
> -* rbtree functions
> +* xarray functions
>  **********************************/
> -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
> -{
> -	struct rb_node *node = root->rb_node;
> -	struct zswap_entry *entry;
> -	pgoff_t entry_offset;
> -
> -	while (node) {
> -		entry = rb_entry(node, struct zswap_entry, rbnode);
> -		entry_offset = swp_offset(entry->swpentry);
> -		if (entry_offset > offset)
> -			node = node->rb_left;
> -		else if (entry_offset < offset)
> -			node = node->rb_right;
> -		else
> -			return entry;
> -	}
> -	return NULL;
> -}
>  
>  /*
>   * In the case that a entry with the same offset is found, a pointer to
> - * the existing entry is stored in dupentry and the function returns -EEXIST
> + * the existing entry is stored in old and erased from the tree.
> + * Function return error on insert.
>   */
> -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
> -			struct zswap_entry **dupentry)
> +static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
> +			   struct zswap_entry **old)
>  {
> -	struct rb_node **link = &root->rb_node, *parent = NULL;
> -	struct zswap_entry *myentry;
> -	pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
> -
> -	while (*link) {
> -		parent = *link;
> -		myentry = rb_entry(parent, struct zswap_entry, rbnode);
> -		myentry_offset = swp_offset(myentry->swpentry);
> -		if (myentry_offset > entry_offset)
> -			link = &(*link)->rb_left;
> -		else if (myentry_offset < entry_offset)
> -			link = &(*link)->rb_right;
> -		else {
> -			*dupentry = myentry;
> -			return -EEXIST;
> -		}
> -	}
> -	rb_link_node(&entry->rbnode, parent, link);
> -	rb_insert_color(&entry->rbnode, root);
> -	return 0;
> -}
> +	int err;
> +	struct zswap_entry *e;
> +	pgoff_t offset = swp_offset(entry->swpentry);
>  
> -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> -{
> -	rb_erase(&entry->rbnode, root);
> -	RB_CLEAR_NODE(&entry->rbnode);
> +	e = xa_store(tree, offset, entry, GFP_KERNEL);
> +	err = xa_err(e);
> +
> +	if (err) {
> +		e = xa_erase(tree, offset);
> +		if (err == -ENOMEM)
> +			zswap_reject_alloc_fail++;
> +		else
> +			zswap_reject_xarray_fail++;
> +	}
> +	*old = e;
> +	return err;
>  }
>  
>  /*********************************
> @@ -872,7 +839,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
>  	entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
>  	if (!entry)
>  		return NULL;
> -	RB_CLEAR_NODE(&entry->rbnode);
>  	return entry;
>  }
>  
> @@ -914,17 +880,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
>  	zswap_update_total_size();
>  }
>  
> -/*
> - * The caller hold the tree lock and search the entry from the tree,
> - * so it must be on the tree, remove it from the tree and free it.
> - */
> -static void zswap_invalidate_entry(struct zswap_tree *tree,
> -				   struct zswap_entry *entry)
> -{
> -	zswap_rb_erase(&tree->rbroot, entry);
> -	zswap_entry_free(entry);
> -}
> -
>  /*********************************
>  * compressed storage functions
>  **********************************/
> @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
>  static int zswap_writeback_entry(struct zswap_entry *entry,
>  				 swp_entry_t swpentry)
>  {
> -	struct zswap_tree *tree;
> +	struct xarray *tree;
> +	pgoff_t offset = swp_offset(swpentry);
> +	struct zswap_entry *e;
>  	struct folio *folio;
>  	struct mempolicy *mpol;
>  	bool folio_was_allocated;
> @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	 * be dereferenced.
>  	 */
>  	tree = swap_zswap_tree(swpentry);
> -	spin_lock(&tree->lock);
> -	if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> -		spin_unlock(&tree->lock);
> +	e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
> +	if (e != entry) {

Maybe "if (xa_cmpxchg() != entry)" look better, so "e" variable can be removed,
since we don't use it.

>  		delete_from_swap_cache(folio);
>  		folio_unlock(folio);
>  		folio_put(folio);
>  		return -ENOMEM;
>  	}
>  
> -	/* Safe to deref entry after the entry is verified above. */
> -	zswap_rb_erase(&tree->rbroot, entry);
> -	spin_unlock(&tree->lock);
> -
>  	zswap_decompress(entry, &folio->page);
>  
>  	count_vm_event(ZSWPWB);
> @@ -1471,10 +1423,12 @@ bool zswap_store(struct folio *folio)
>  {
>  	swp_entry_t swp = folio->swap;
>  	pgoff_t offset = swp_offset(swp);
> -	struct zswap_tree *tree = swap_zswap_tree(swp);
> -	struct zswap_entry *entry, *dupentry;
> +	struct xarray *tree = swap_zswap_tree(swp);
> +	struct zswap_entry *entry, *old;
>  	struct obj_cgroup *objcg = NULL;
>  	struct mem_cgroup *memcg = NULL;
> +	int err;
> +	bool old_erased = false;
>  
>  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
>  	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1526,6 +1480,7 @@ bool zswap_store(struct folio *folio)
>  			kunmap_local(src);
>  			entry->length = 0;
>  			entry->value = value;
> +			entry->pool = NULL;
>  			atomic_inc(&zswap_same_filled_pages);
>  			goto insert_entry;
>  		}
> @@ -1555,28 +1510,31 @@ bool zswap_store(struct folio *folio)
>  insert_entry:
>  	entry->swpentry = swp;
>  	entry->objcg = objcg;
> -	if (objcg) {
> -		obj_cgroup_charge_zswap(objcg, entry->length);
> -		/* Account before objcg ref is moved to tree */
> -		count_objcg_event(objcg, ZSWPOUT);
> -	}
>  
>  	/* map */
> -	spin_lock(&tree->lock);
>  	/*
>  	 * The folio may have been dirtied again, invalidate the
>  	 * possibly stale entry before inserting the new entry.
>  	 */
> -	if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> -		zswap_invalidate_entry(tree, dupentry);
> -		WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> +	err = zswap_xa_insert(tree, entry, &old);
> +	if (old)
> +		zswap_entry_free(old);
> +	if (err) {
> +		old_erased = true;
> +		goto insert_failed;
>  	}

It looks a little complicated for me :( How do you think this like below?

old = xa_store(tree, offset, entry, GFP_KERNEL);
if (xa_is_err(old))
	goto store_failed;

if (old)
	zswap_entry_free(old);

Then zswap_xa_insert() wrapper can be removed since no use elsewhere.
So the error handling path is kept much the same as before and simpler.

> +
> +	if (objcg) {
> +		obj_cgroup_charge_zswap(objcg, entry->length);
> +		/* Account before objcg ref is moved to tree */
> +		count_objcg_event(objcg, ZSWPOUT);
> +	}
> +
>  	if (entry->length) {
>  		INIT_LIST_HEAD(&entry->lru);
>  		zswap_lru_add(&zswap.list_lru, entry);
>  		atomic_inc(&zswap.nr_stored);
>  	}
> -	spin_unlock(&tree->lock);
>  
>  	/* update stats */
>  	atomic_inc(&zswap_stored_pages);
> @@ -1585,6 +1543,12 @@ bool zswap_store(struct folio *folio)
>  
>  	return true;
>  
> +insert_failed:
> +	if (!entry->length) {
> +		atomic_dec(&zswap_same_filled_pages);
> +		goto freepage;
> +	}
> +	zpool_free(zswap_find_zpool(entry), entry->handle);

entry->pool can be used here instead of zswap_find_zpool().

Thanks!

>  put_pool:
>  	zswap_pool_put(entry->pool);
>  freepage:
> @@ -1592,17 +1556,19 @@ bool zswap_store(struct folio *folio)
>  reject:
>  	if (objcg)
>  		obj_cgroup_put(objcg);
> +
> +	if (old_erased)
> +		goto failed;
>  check_old:
>  	/*
>  	 * If the zswap store fails or zswap is disabled, we must invalidate the
>  	 * possibly stale entry which was previously stored at this offset.
>  	 * Otherwise, writeback could overwrite the new data in the swapfile.
>  	 */
> -	spin_lock(&tree->lock);
> -	entry = zswap_rb_search(&tree->rbroot, offset);
> +	entry = xa_erase(tree, offset);
>  	if (entry)
> -		zswap_invalidate_entry(tree, entry);
> -	spin_unlock(&tree->lock);
> +		zswap_entry_free(entry);
> +failed:
>  	return false;
>  
>  shrink:
> @@ -1615,20 +1581,15 @@ bool zswap_load(struct folio *folio)
>  	swp_entry_t swp = folio->swap;
>  	pgoff_t offset = swp_offset(swp);
>  	struct page *page = &folio->page;
> -	struct zswap_tree *tree = swap_zswap_tree(swp);
> +	struct xarray *tree = swap_zswap_tree(swp);
>  	struct zswap_entry *entry;
>  	u8 *dst;
>  
>  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
>  
> -	spin_lock(&tree->lock);
> -	entry = zswap_rb_search(&tree->rbroot, offset);
> -	if (!entry) {
> -		spin_unlock(&tree->lock);
> +	entry = xa_erase(tree, offset);
> +	if (!entry)
>  		return false;
> -	}
> -	zswap_rb_erase(&tree->rbroot, entry);
> -	spin_unlock(&tree->lock);
>  
>  	if (entry->length)
>  		zswap_decompress(entry, page);
> @@ -1652,19 +1613,17 @@ bool zswap_load(struct folio *folio)
>  void zswap_invalidate(swp_entry_t swp)
>  {
>  	pgoff_t offset = swp_offset(swp);
> -	struct zswap_tree *tree = swap_zswap_tree(swp);
> +	struct xarray *tree = swap_zswap_tree(swp);
>  	struct zswap_entry *entry;
>  
> -	spin_lock(&tree->lock);
> -	entry = zswap_rb_search(&tree->rbroot, offset);
> +	entry = xa_erase(tree, offset);
>  	if (entry)
> -		zswap_invalidate_entry(tree, entry);
> -	spin_unlock(&tree->lock);
> +		zswap_entry_free(entry);
>  }
>  
>  int zswap_swapon(int type, unsigned long nr_pages)
>  {
> -	struct zswap_tree *trees, *tree;
> +	struct xarray *trees, *tree;
>  	unsigned int nr, i;
>  
>  	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
> @@ -1674,11 +1633,8 @@ int zswap_swapon(int type, unsigned long nr_pages)
>  		return -ENOMEM;
>  	}
>  
> -	for (i = 0; i < nr; i++) {
> -		tree = trees + i;
> -		tree->rbroot = RB_ROOT;
> -		spin_lock_init(&tree->lock);
> -	}
> +	for (i = 0; i < nr; i++)
> +		xa_init(trees + i);
>  
>  	nr_zswap_trees[type] = nr;
>  	zswap_trees[type] = trees;
> @@ -1687,7 +1643,7 @@ int zswap_swapon(int type, unsigned long nr_pages)
>  
>  void zswap_swapoff(int type)
>  {
> -	struct zswap_tree *trees = zswap_trees[type];
> +	struct xarray *trees = zswap_trees[type];
>  	unsigned int i;
>  
>  	if (!trees)
> @@ -1695,7 +1651,7 @@ void zswap_swapoff(int type)
>  
>  	/* try_to_unuse() invalidated all the entries already */
>  	for (i = 0; i < nr_zswap_trees[type]; i++)
> -		WARN_ON_ONCE(!RB_EMPTY_ROOT(&trees[i].rbroot));
> +		WARN_ON_ONCE(!xa_empty(trees + i));
>  
>  	kvfree(trees);
>  	nr_zswap_trees[type] = 0;
> @@ -1727,6 +1683,8 @@ static int zswap_debugfs_init(void)
>  			   zswap_debugfs_root, &zswap_reject_kmemcache_fail);
>  	debugfs_create_u64("reject_compress_fail", 0444,
>  			   zswap_debugfs_root, &zswap_reject_compress_fail);
> +	debugfs_create_u64("reject_xarray_fail", 0444,
> +			   zswap_debugfs_root, &zswap_reject_xarray_fail);
>  	debugfs_create_u64("reject_compress_poor", 0444,
>  			   zswap_debugfs_root, &zswap_reject_compress_poor);
>  	debugfs_create_u64("written_back_pages", 0444,
> 
> ---
> base-commit: 9a0181a3710eba1f5c6d19eadcca888be3d54e4f
> change-id: 20240104-zswap-xarray-716260e541e3
> 
> Best regards,
Yosry Ahmed March 5, 2024, 5:08 a.m. UTC | #2
Hi Chris,

On Mon, Mar 04, 2024 at 01:32:20PM -0800, Chris Li wrote:
> Very deep RB tree requires rebalance at times. That
> contributes to the zswap fault latencies. Xarray does not
> need to perform tree rebalance. Replacing RB tree to xarray
> can have some small performance gain.
> 
> One small difference is that xarray insert might fail with
> ENOMEM, while RB tree insert does not allocate additional
> memory.
> 
> The zswap_entry size will reduce a bit due to removing the
> RB node, which has two pointers and a color field. Xarray
> store the pointer in the xarray tree rather than the
> zswap_entry. Every entry has one pointer from the xarray
> tree. Overall, switching to xarray should save some memory,
> if the swap entries are densely packed.
> 
> Notice the zswap_rb_search and zswap_rb_insert always
> followed by zswap_rb_erase. Use xa_erase directly. The entry
> erase into zswap_xa_insert as well. That saves one tree
> lookup as well.
> 
> Remove zswap_invalidate_entry due to no need to call
> zswap_rb_erase any more. Use zswap_free_entry instead.
> 
> The "struct zswap_tree" has been replaced by "struct xarray".
> The tree spin lock has transferred to the xarray lock.
> 
> Run the kernel build testing 10 times for each version, averages:
> (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)

Please add details about the number of threads and the number of CPUs as
well.

> 
> mm-9a0181a3710eb                    xarray v4
> user       3526.829                            3526.930
> sys        532.754                             526.525
> real       198.748                             198.850
> 
> ---
> 
> 
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> Changes in v4:
> - Remove zswap_xa_search_and_earse, use xa_erase directly.
> - Move charge of objcg after zswap_xa_insert.
> - Avoid erase old entry on insert fail error path.
> - Remove not needed swap_zswap_tree change
> - Link to v3: https://lore.kernel.org/r/20240302-zswap-xarray-v3-1-5900252f2302@kernel.org
> 
> Changes in v3:
> - Use xa_cmpxchg instead of zswap_xa_search_and_delete in zswap_writeback_entry.
> - Use xa_store in zswap_xa_insert directly. Reduce the scope of spinlock.
> - Fix xa_store error handling for same page fill case.
> - Link to v2: https://lore.kernel.org/r/20240229-zswap-xarray-v2-1-e50284dfcdb1@kernel.org
> 
> Changes in v2:
> - Replace struct zswap_tree with struct xarray.
> - Remove zswap_tree spinlock, use xarray lock instead.
> - Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
> - Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
> - Link to v1: https://lore.kernel.org/r/20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org
> ---
>  mm/zswap.c | 186 ++++++++++++++++++++++++-------------------------------------
>  1 file changed, 72 insertions(+), 114 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 011e068eb355..4f4a3f452b76 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -20,7 +20,6 @@
>  #include <linux/spinlock.h>
>  #include <linux/types.h>
>  #include <linux/atomic.h>
> -#include <linux/rbtree.h>
>  #include <linux/swap.h>
>  #include <linux/crypto.h>
>  #include <linux/scatterlist.h>
> @@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
>  static u64 zswap_reject_alloc_fail;
>  /* Store failed because the entry metadata could not be allocated (rare) */
>  static u64 zswap_reject_kmemcache_fail;
> +/* Store failed because xarray can't insert the entry*/
> +static u64 zswap_reject_xarray_fail;
>  
>  /* Shrinker work queue */
>  static struct workqueue_struct *shrink_wq;
> @@ -196,7 +197,6 @@ static struct {
>   * This structure contains the metadata for tracking a single compressed
>   * page within zswap.
>   *
> - * rbnode - links the entry into red-black tree for the appropriate swap type
>   * swpentry - associated swap entry, the offset indexes into the red-black tree
>   * length - the length in bytes of the compressed page data.  Needed during
>   *          decompression. For a same value filled page length is 0, and both
> @@ -208,7 +208,6 @@ static struct {
>   * lru - handle to the pool's lru used to evict pages.
>   */
>  struct zswap_entry {
> -	struct rb_node rbnode;
>  	swp_entry_t swpentry;
>  	unsigned int length;
>  	struct zswap_pool *pool;
> @@ -220,12 +219,7 @@ struct zswap_entry {
>  	struct list_head lru;
>  };
>  
> -struct zswap_tree {
> -	struct rb_root rbroot;
> -	spinlock_t lock;
> -};
> -
> -static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
> +static struct xarray *zswap_trees[MAX_SWAPFILES];
>  static unsigned int nr_zswap_trees[MAX_SWAPFILES];
>  
>  /* RCU-protected iteration */
> @@ -253,7 +247,7 @@ static bool zswap_has_pool;
>  * helpers and fwd declarations
>  **********************************/
>  
> -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> +static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>  {
>  	return &zswap_trees[swp_type(swp)][swp_offset(swp)
>  		>> SWAP_ADDRESS_SPACE_SHIFT];
> @@ -805,60 +799,33 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
>  }
>  
>  /*********************************
> -* rbtree functions
> +* xarray functions
>  **********************************/
> -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
> -{
> -	struct rb_node *node = root->rb_node;
> -	struct zswap_entry *entry;
> -	pgoff_t entry_offset;
> -
> -	while (node) {
> -		entry = rb_entry(node, struct zswap_entry, rbnode);
> -		entry_offset = swp_offset(entry->swpentry);
> -		if (entry_offset > offset)
> -			node = node->rb_left;
> -		else if (entry_offset < offset)
> -			node = node->rb_right;
> -		else
> -			return entry;
> -	}
> -	return NULL;
> -}
>  
>  /*
>   * In the case that a entry with the same offset is found, a pointer to
> - * the existing entry is stored in dupentry and the function returns -EEXIST
> + * the existing entry is stored in old and erased from the tree.
> + * Function return error on insert.
>   */
> -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
> -			struct zswap_entry **dupentry)
> +static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
> +			   struct zswap_entry **old)
>  {
> -	struct rb_node **link = &root->rb_node, *parent = NULL;
> -	struct zswap_entry *myentry;
> -	pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
> -
> -	while (*link) {
> -		parent = *link;
> -		myentry = rb_entry(parent, struct zswap_entry, rbnode);
> -		myentry_offset = swp_offset(myentry->swpentry);
> -		if (myentry_offset > entry_offset)
> -			link = &(*link)->rb_left;
> -		else if (myentry_offset < entry_offset)
> -			link = &(*link)->rb_right;
> -		else {
> -			*dupentry = myentry;
> -			return -EEXIST;
> -		}
> -	}
> -	rb_link_node(&entry->rbnode, parent, link);
> -	rb_insert_color(&entry->rbnode, root);
> -	return 0;
> -}
> +	int err;
> +	struct zswap_entry *e;
> +	pgoff_t offset = swp_offset(entry->swpentry);
>  
> -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> -{
> -	rb_erase(&entry->rbnode, root);
> -	RB_CLEAR_NODE(&entry->rbnode);
> +	e = xa_store(tree, offset, entry, GFP_KERNEL);
> +	err = xa_err(e);
> +
> +	if (err) {
> +		e = xa_erase(tree, offset);
> +		if (err == -ENOMEM)
> +			zswap_reject_alloc_fail++;
> +		else
> +			zswap_reject_xarray_fail++;

I think this is too complicated, and as Chengming pointed out, I believe
we can use xa_store() directly in zswap_store().

I am also not sure what the need for zswap_reject_xarray_fail is. Are
there any reasons why the store here can fail other than -ENOMEM? The
docs say the only other option is -EINVAL, and looking at __xa_store(),
it seems like this is only possible if xa_is_internal() is true (which
means we are not passing in a properly aligned pointer IIUC).

> +	}
> +	*old = e;
> +	return err;
>  }
>  
>  /*********************************
> @@ -872,7 +839,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
>  	entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
>  	if (!entry)
>  		return NULL;
> -	RB_CLEAR_NODE(&entry->rbnode);
>  	return entry;
>  }
>  
> @@ -914,17 +880,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
>  	zswap_update_total_size();
>  }
>  
> -/*
> - * The caller hold the tree lock and search the entry from the tree,
> - * so it must be on the tree, remove it from the tree and free it.
> - */
> -static void zswap_invalidate_entry(struct zswap_tree *tree,
> -				   struct zswap_entry *entry)
> -{
> -	zswap_rb_erase(&tree->rbroot, entry);
> -	zswap_entry_free(entry);
> -}
> -
>  /*********************************
>  * compressed storage functions
>  **********************************/
> @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
>  static int zswap_writeback_entry(struct zswap_entry *entry,
>  				 swp_entry_t swpentry)
>  {
> -	struct zswap_tree *tree;
> +	struct xarray *tree;
> +	pgoff_t offset = swp_offset(swpentry);
> +	struct zswap_entry *e;
>  	struct folio *folio;
>  	struct mempolicy *mpol;
>  	bool folio_was_allocated;
> @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	 * be dereferenced.
>  	 */
>  	tree = swap_zswap_tree(swpentry);
> -	spin_lock(&tree->lock);
> -	if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> -		spin_unlock(&tree->lock);
> +	e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
> +	if (e != entry) {

I think we can avoid adding 'e' and 'offset' local variables here and
just do everything in the if condition. If you want to avoid the line
break, then introducing 'offset' is fine, but I don't see any value from
'e'.


>  		delete_from_swap_cache(folio);
>  		folio_unlock(folio);
>  		folio_put(folio);
>  		return -ENOMEM;
>  	}
>  
> -	/* Safe to deref entry after the entry is verified above. */
> -	zswap_rb_erase(&tree->rbroot, entry);
> -	spin_unlock(&tree->lock);
> -
>  	zswap_decompress(entry, &folio->page);
>  
>  	count_vm_event(ZSWPWB);
> @@ -1471,10 +1423,12 @@ bool zswap_store(struct folio *folio)
>  {
>  	swp_entry_t swp = folio->swap;
>  	pgoff_t offset = swp_offset(swp);
> -	struct zswap_tree *tree = swap_zswap_tree(swp);
> -	struct zswap_entry *entry, *dupentry;
> +	struct xarray *tree = swap_zswap_tree(swp);
> +	struct zswap_entry *entry, *old;
>  	struct obj_cgroup *objcg = NULL;
>  	struct mem_cgroup *memcg = NULL;
> +	int err;
> +	bool old_erased = false;
>  
>  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
>  	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1526,6 +1480,7 @@ bool zswap_store(struct folio *folio)
>  			kunmap_local(src);
>  			entry->length = 0;
>  			entry->value = value;
> +			entry->pool = NULL;

Why do we need to initialize the pool here? Is this is a bug fix for an
existing problem or just keeping things clean? Either way I think it
should be done separately, unless it is related to a change in this
patch.

>  			atomic_inc(&zswap_same_filled_pages);
>  			goto insert_entry;
>  		}
> @@ -1555,28 +1510,31 @@ bool zswap_store(struct folio *folio)
>  insert_entry:
>  	entry->swpentry = swp;
>  	entry->objcg = objcg;
> -	if (objcg) {
> -		obj_cgroup_charge_zswap(objcg, entry->length);
> -		/* Account before objcg ref is moved to tree */
> -		count_objcg_event(objcg, ZSWPOUT);
> -	}
>  
>  	/* map */

Let's remove this comment while we are here, it's pointless AFAICT.

> -	spin_lock(&tree->lock);
>  	/*
>  	 * The folio may have been dirtied again, invalidate the
>  	 * possibly stale entry before inserting the new entry.
>  	 */
> -	if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> -		zswap_invalidate_entry(tree, dupentry);
> -		WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> +	err = zswap_xa_insert(tree, entry, &old);
> +	if (old)
> +		zswap_entry_free(old);
> +	if (err) {
> +		old_erased = true;

I think this can be made simpler if we open code xa_store() here,
especially that we already have cleanup code below under 'check_old'
that removes the exisitng old entry. So zswap_xa_insert() replicates
this cleanup, then we add this 'old_erased' boolean to avoid doing the
cleanup below. It seems like it would much more straightforward with
open-coding xa_store() here and relying on the existing cleanup for the
old entry.  Also, if we initialize 'old' to NULL, we can use its value
to figure out whether any cleanup is needed under 'check_old' or not.

Taking a step back, I think we can further simplify this. What if we
move the tree insertion to right after we allocate the zswap entry? In
this case, if the tree insertion fails, we don't need to decrement the
same filled counter. If the tree insertion succeeds and then something
else fails, the existing cleanup code under 'check_old' will already
clean up the tree insertion for us.

If this works, we don't need to add extra cleanup code or move any code
around. Something like:

	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
	if (!entry) {
		zswap_reject_kmemcache_fail++;
		goto reject;
	}

	old = xa_store(tree, offset, entry, GFP_KERNEL);
	if (xa_is_err(old)) {
		old = NULL;
		goto freepage;
	}
	if (old)
		zswap_free_entry(old);

	...
check_old:
	if (!old) {
		old = xa_erase(tree, offset);
		if (old)
			zswap_entry_free(old);
	}


WDYT?

> +		goto insert_failed;
>  	}
> +
> +	if (objcg) {
> +		obj_cgroup_charge_zswap(objcg, entry->length);
> +		/* Account before objcg ref is moved to tree */
> +		count_objcg_event(objcg, ZSWPOUT);
> +	}
> +
>  	if (entry->length) {
>  		INIT_LIST_HEAD(&entry->lru);
>  		zswap_lru_add(&zswap.list_lru, entry);
>  		atomic_inc(&zswap.nr_stored);
>  	}
> -	spin_unlock(&tree->lock);
>  
>  	/* update stats */
>  	atomic_inc(&zswap_stored_pages);
> @@ -1585,6 +1543,12 @@ bool zswap_store(struct folio *folio)
>  
>  	return true;
>  
> +insert_failed:
> +	if (!entry->length) {
> +		atomic_dec(&zswap_same_filled_pages);
> +		goto freepage;
> +	}
> +	zpool_free(zswap_find_zpool(entry), entry->handle);
>  put_pool:
>  	zswap_pool_put(entry->pool);
>  freepage:
> @@ -1592,17 +1556,19 @@ bool zswap_store(struct folio *folio)
>  reject:
>  	if (objcg)
>  		obj_cgroup_put(objcg);
> +
> +	if (old_erased)
> +		goto failed;
>  check_old:
>  	/*
>  	 * If the zswap store fails or zswap is disabled, we must invalidate the
>  	 * possibly stale entry which was previously stored at this offset.
>  	 * Otherwise, writeback could overwrite the new data in the swapfile.
>  	 */
> -	spin_lock(&tree->lock);
> -	entry = zswap_rb_search(&tree->rbroot, offset);
> +	entry = xa_erase(tree, offset);
>  	if (entry)
> -		zswap_invalidate_entry(tree, entry);
> -	spin_unlock(&tree->lock);
> +		zswap_entry_free(entry);
> +failed:
>  	return false;
>  
>  shrink:
[..]
Chris Li March 6, 2024, 9:12 p.m. UTC | #3
Hi Chengming,

On Mon, Mar 4, 2024 at 6:52 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Hi Chris,
>
> On 2024/3/5 05:32, Chris Li wrote:
> > Very deep RB tree requires rebalance at times. That
> > contributes to the zswap fault latencies. Xarray does not
> > need to perform tree rebalance. Replacing RB tree to xarray
> > can have some small performance gain.
> >
> > One small difference is that xarray insert might fail with
> > ENOMEM, while RB tree insert does not allocate additional
> > memory.
> >
> > The zswap_entry size will reduce a bit due to removing the
> > RB node, which has two pointers and a color field. Xarray
> > store the pointer in the xarray tree rather than the
> > zswap_entry. Every entry has one pointer from the xarray
> > tree. Overall, switching to xarray should save some memory,
> > if the swap entries are densely packed.
> >
> > Notice the zswap_rb_search and zswap_rb_insert always
> > followed by zswap_rb_erase. Use xa_erase directly. The entry
> > erase into zswap_xa_insert as well. That saves one tree
> > lookup as well.
> >
> > Remove zswap_invalidate_entry due to no need to call
> > zswap_rb_erase any more. Use zswap_free_entry instead.
> >
> > The "struct zswap_tree" has been replaced by "struct xarray".
> > The tree spin lock has transferred to the xarray lock.
> >
> > Run the kernel build testing 10 times for each version, averages:
> > (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)
> >
> > mm-9a0181a3710eb                    xarray v4
> > user       3526.829                            3526.930
> > sys        532.754                             526.525
> > real       198.748                             198.850
> >
> > ---
> >
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> > Changes in v4:
> > - Remove zswap_xa_search_and_earse, use xa_erase directly.
> > - Move charge of objcg after zswap_xa_insert.
> > - Avoid erase old entry on insert fail error path.
> > - Remove not needed swap_zswap_tree change
> > - Link to v3: https://lore.kernel.org/r/20240302-zswap-xarray-v3-1-5900252f2302@kernel.org
> >
> > Changes in v3:
> > - Use xa_cmpxchg instead of zswap_xa_search_and_delete in zswap_writeback_entry.
> > - Use xa_store in zswap_xa_insert directly. Reduce the scope of spinlock.
> > - Fix xa_store error handling for same page fill case.
> > - Link to v2: https://lore.kernel.org/r/20240229-zswap-xarray-v2-1-e50284dfcdb1@kernel.org
> >
> > Changes in v2:
> > - Replace struct zswap_tree with struct xarray.
> > - Remove zswap_tree spinlock, use xarray lock instead.
> > - Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
> > - Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
> > - Link to v1: https://lore.kernel.org/r/20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org
> > ---
> >  mm/zswap.c | 186 ++++++++++++++++++++++++-------------------------------------
> >  1 file changed, 72 insertions(+), 114 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 011e068eb355..4f4a3f452b76 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -20,7 +20,6 @@
> >  #include <linux/spinlock.h>
> >  #include <linux/types.h>
> >  #include <linux/atomic.h>
> > -#include <linux/rbtree.h>
> >  #include <linux/swap.h>
> >  #include <linux/crypto.h>
> >  #include <linux/scatterlist.h>
> > @@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
> >  static u64 zswap_reject_alloc_fail;
> >  /* Store failed because the entry metadata could not be allocated (rare) */
> >  static u64 zswap_reject_kmemcache_fail;
> > +/* Store failed because xarray can't insert the entry*/
> > +static u64 zswap_reject_xarray_fail;
> >
> >  /* Shrinker work queue */
> >  static struct workqueue_struct *shrink_wq;
> > @@ -196,7 +197,6 @@ static struct {
> >   * This structure contains the metadata for tracking a single compressed
> >   * page within zswap.
> >   *
> > - * rbnode - links the entry into red-black tree for the appropriate swap type
> >   * swpentry - associated swap entry, the offset indexes into the red-black tree
> >   * length - the length in bytes of the compressed page data.  Needed during
> >   *          decompression. For a same value filled page length is 0, and both
> > @@ -208,7 +208,6 @@ static struct {
> >   * lru - handle to the pool's lru used to evict pages.
> >   */
> >  struct zswap_entry {
> > -     struct rb_node rbnode;
> >       swp_entry_t swpentry;
> >       unsigned int length;
> >       struct zswap_pool *pool;
> > @@ -220,12 +219,7 @@ struct zswap_entry {
> >       struct list_head lru;
> >  };
> >
> > -struct zswap_tree {
> > -     struct rb_root rbroot;
> > -     spinlock_t lock;
> > -};
> > -
> > -static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
> > +static struct xarray *zswap_trees[MAX_SWAPFILES];
> >  static unsigned int nr_zswap_trees[MAX_SWAPFILES];
> >
> >  /* RCU-protected iteration */
> > @@ -253,7 +247,7 @@ static bool zswap_has_pool;
> >  * helpers and fwd declarations
> >  **********************************/
> >
> > -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> > +static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> >  {
> >       return &zswap_trees[swp_type(swp)][swp_offset(swp)
> >               >> SWAP_ADDRESS_SPACE_SHIFT];
> > @@ -805,60 +799,33 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> >  }
> >
> >  /*********************************
> > -* rbtree functions
> > +* xarray functions
> >  **********************************/
> > -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
> > -{
> > -     struct rb_node *node = root->rb_node;
> > -     struct zswap_entry *entry;
> > -     pgoff_t entry_offset;
> > -
> > -     while (node) {
> > -             entry = rb_entry(node, struct zswap_entry, rbnode);
> > -             entry_offset = swp_offset(entry->swpentry);
> > -             if (entry_offset > offset)
> > -                     node = node->rb_left;
> > -             else if (entry_offset < offset)
> > -                     node = node->rb_right;
> > -             else
> > -                     return entry;
> > -     }
> > -     return NULL;
> > -}
> >
> >  /*
> >   * In the case that a entry with the same offset is found, a pointer to
> > - * the existing entry is stored in dupentry and the function returns -EEXIST
> > + * the existing entry is stored in old and erased from the tree.
> > + * Function return error on insert.
> >   */
> > -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
> > -                     struct zswap_entry **dupentry)
> > +static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
> > +                        struct zswap_entry **old)
> >  {
> > -     struct rb_node **link = &root->rb_node, *parent = NULL;
> > -     struct zswap_entry *myentry;
> > -     pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
> > -
> > -     while (*link) {
> > -             parent = *link;
> > -             myentry = rb_entry(parent, struct zswap_entry, rbnode);
> > -             myentry_offset = swp_offset(myentry->swpentry);
> > -             if (myentry_offset > entry_offset)
> > -                     link = &(*link)->rb_left;
> > -             else if (myentry_offset < entry_offset)
> > -                     link = &(*link)->rb_right;
> > -             else {
> > -                     *dupentry = myentry;
> > -                     return -EEXIST;
> > -             }
> > -     }
> > -     rb_link_node(&entry->rbnode, parent, link);
> > -     rb_insert_color(&entry->rbnode, root);
> > -     return 0;
> > -}
> > +     int err;
> > +     struct zswap_entry *e;
> > +     pgoff_t offset = swp_offset(entry->swpentry);
> >
> > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> > -{
> > -     rb_erase(&entry->rbnode, root);
> > -     RB_CLEAR_NODE(&entry->rbnode);
> > +     e = xa_store(tree, offset, entry, GFP_KERNEL);
> > +     err = xa_err(e);
> > +
> > +     if (err) {
> > +             e = xa_erase(tree, offset);
> > +             if (err == -ENOMEM)
> > +                     zswap_reject_alloc_fail++;
> > +             else
> > +                     zswap_reject_xarray_fail++;
> > +     }
> > +     *old = e;
> > +     return err;
> >  }
> >
> >  /*********************************
> > @@ -872,7 +839,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
> >       entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
> >       if (!entry)
> >               return NULL;
> > -     RB_CLEAR_NODE(&entry->rbnode);
> >       return entry;
> >  }
> >
> > @@ -914,17 +880,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
> >       zswap_update_total_size();
> >  }
> >
> > -/*
> > - * The caller hold the tree lock and search the entry from the tree,
> > - * so it must be on the tree, remove it from the tree and free it.
> > - */
> > -static void zswap_invalidate_entry(struct zswap_tree *tree,
> > -                                struct zswap_entry *entry)
> > -{
> > -     zswap_rb_erase(&tree->rbroot, entry);
> > -     zswap_entry_free(entry);
> > -}
> > -
> >  /*********************************
> >  * compressed storage functions
> >  **********************************/
> > @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
> >  static int zswap_writeback_entry(struct zswap_entry *entry,
> >                                swp_entry_t swpentry)
> >  {
> > -     struct zswap_tree *tree;
> > +     struct xarray *tree;
> > +     pgoff_t offset = swp_offset(swpentry);
> > +     struct zswap_entry *e;
> >       struct folio *folio;
> >       struct mempolicy *mpol;
> >       bool folio_was_allocated;
> > @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >        * be dereferenced.
> >        */
> >       tree = swap_zswap_tree(swpentry);
> > -     spin_lock(&tree->lock);
> > -     if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> > -             spin_unlock(&tree->lock);
> > +     e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
> > +     if (e != entry) {
>
> Maybe "if (xa_cmpxchg() != entry)" look better, so "e" variable can be removed,
> since we don't use it.

I thought about that. The main reason "if (xa_cmpxchg(tree, offset,
entry, NULL, GFP_KERNEL) != entry)" is a long expression. Having
!= entry at the end of the expression makes it harder to read. Have
the temporary variable IMHO help reading the comparison.

Not sure I understand the motivation to save a temporary variable. The
compiler always generates temporary variables internally. Having one
here does not affect the compiling result at all. It is just there to
help reading. If you still think it has the reverse effect on reading
I can of course remove it. It generates the same code anyway.

>
> >               delete_from_swap_cache(folio);
> >               folio_unlock(folio);
> >               folio_put(folio);
> >               return -ENOMEM;
> >       }
> >
> > -     /* Safe to deref entry after the entry is verified above. */
> > -     zswap_rb_erase(&tree->rbroot, entry);
> > -     spin_unlock(&tree->lock);
> > -
> >       zswap_decompress(entry, &folio->page);
> >
> >       count_vm_event(ZSWPWB);
> > @@ -1471,10 +1423,12 @@ bool zswap_store(struct folio *folio)
> >  {
> >       swp_entry_t swp = folio->swap;
> >       pgoff_t offset = swp_offset(swp);
> > -     struct zswap_tree *tree = swap_zswap_tree(swp);
> > -     struct zswap_entry *entry, *dupentry;
> > +     struct xarray *tree = swap_zswap_tree(swp);
> > +     struct zswap_entry *entry, *old;
> >       struct obj_cgroup *objcg = NULL;
> >       struct mem_cgroup *memcg = NULL;
> > +     int err;
> > +     bool old_erased = false;
> >
> >       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > @@ -1526,6 +1480,7 @@ bool zswap_store(struct folio *folio)
> >                       kunmap_local(src);
> >                       entry->length = 0;
> >                       entry->value = value;
> > +                     entry->pool = NULL;
> >                       atomic_inc(&zswap_same_filled_pages);
> >                       goto insert_entry;
> >               }
> > @@ -1555,28 +1510,31 @@ bool zswap_store(struct folio *folio)
> >  insert_entry:
> >       entry->swpentry = swp;
> >       entry->objcg = objcg;
> > -     if (objcg) {
> > -             obj_cgroup_charge_zswap(objcg, entry->length);
> > -             /* Account before objcg ref is moved to tree */
> > -             count_objcg_event(objcg, ZSWPOUT);
> > -     }
> >
> >       /* map */
> > -     spin_lock(&tree->lock);
> >       /*
> >        * The folio may have been dirtied again, invalidate the
> >        * possibly stale entry before inserting the new entry.
> >        */
> > -     if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> > -             zswap_invalidate_entry(tree, dupentry);
> > -             WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> > +     err = zswap_xa_insert(tree, entry, &old);
> > +     if (old)
> > +             zswap_entry_free(old);
> > +     if (err) {
> > +             old_erased = true;
> > +             goto insert_failed;
> >       }
>
> It looks a little complicated for me :( How do you think this like below?
>
> old = xa_store(tree, offset, entry, GFP_KERNEL);
> if (xa_is_err(old))

Might want to bump up the ENOMEM error count if it is out of memory
since zswap tracks it.

>         goto store_failed;
>
> if (old)
>         zswap_entry_free(old);
>
> Then zswap_xa_insert() wrapper can be removed since no use elsewhere.
> So the error handling path is kept much the same as before and simpler.

Sure.

>
> > +
> > +     if (objcg) {
> > +             obj_cgroup_charge_zswap(objcg, entry->length);
> > +             /* Account before objcg ref is moved to tree */
> > +             count_objcg_event(objcg, ZSWPOUT);
> > +     }
> > +
> >       if (entry->length) {
> >               INIT_LIST_HEAD(&entry->lru);
> >               zswap_lru_add(&zswap.list_lru, entry);
> >               atomic_inc(&zswap.nr_stored);
> >       }
> > -     spin_unlock(&tree->lock);
> >
> >       /* update stats */
> >       atomic_inc(&zswap_stored_pages);
> > @@ -1585,6 +1543,12 @@ bool zswap_store(struct folio *folio)
> >
> >       return true;
> >
> > +insert_failed:
> > +     if (!entry->length) {
> > +             atomic_dec(&zswap_same_filled_pages);
> > +             goto freepage;
> > +     }
> > +     zpool_free(zswap_find_zpool(entry), entry->handle);
>
> entry->pool can be used here instead of zswap_find_zpool().

Ack.

I was following the code in zswap_entry_free(). Does it mean the
zswap_find_zpool(entry) can also use entry->pool?

Thanks for the feedback.

Chris

>
> Thanks!
>
> >  put_pool:
> >       zswap_pool_put(entry->pool);
> >  freepage:
> > @@ -1592,17 +1556,19 @@ bool zswap_store(struct folio *folio)
> >  reject:
> >       if (objcg)
> >               obj_cgroup_put(objcg);
> > +
> > +     if (old_erased)
> > +             goto failed;
> >  check_old:
> >       /*
> >        * If the zswap store fails or zswap is disabled, we must invalidate the
> >        * possibly stale entry which was previously stored at this offset.
> >        * Otherwise, writeback could overwrite the new data in the swapfile.
> >        */
> > -     spin_lock(&tree->lock);
> > -     entry = zswap_rb_search(&tree->rbroot, offset);
> > +     entry = xa_erase(tree, offset);
> >       if (entry)
> > -             zswap_invalidate_entry(tree, entry);
> > -     spin_unlock(&tree->lock);
> > +             zswap_entry_free(entry);
> > +failed:
> >       return false;
> >
> >  shrink:
> > @@ -1615,20 +1581,15 @@ bool zswap_load(struct folio *folio)
> >       swp_entry_t swp = folio->swap;
> >       pgoff_t offset = swp_offset(swp);
> >       struct page *page = &folio->page;
> > -     struct zswap_tree *tree = swap_zswap_tree(swp);
> > +     struct xarray *tree = swap_zswap_tree(swp);
> >       struct zswap_entry *entry;
> >       u8 *dst;
> >
> >       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >
> > -     spin_lock(&tree->lock);
> > -     entry = zswap_rb_search(&tree->rbroot, offset);
> > -     if (!entry) {
> > -             spin_unlock(&tree->lock);
> > +     entry = xa_erase(tree, offset);
> > +     if (!entry)
> >               return false;
> > -     }
> > -     zswap_rb_erase(&tree->rbroot, entry);
> > -     spin_unlock(&tree->lock);
> >
> >       if (entry->length)
> >               zswap_decompress(entry, page);
> > @@ -1652,19 +1613,17 @@ bool zswap_load(struct folio *folio)
> >  void zswap_invalidate(swp_entry_t swp)
> >  {
> >       pgoff_t offset = swp_offset(swp);
> > -     struct zswap_tree *tree = swap_zswap_tree(swp);
> > +     struct xarray *tree = swap_zswap_tree(swp);
> >       struct zswap_entry *entry;
> >
> > -     spin_lock(&tree->lock);
> > -     entry = zswap_rb_search(&tree->rbroot, offset);
> > +     entry = xa_erase(tree, offset);
> >       if (entry)
> > -             zswap_invalidate_entry(tree, entry);
> > -     spin_unlock(&tree->lock);
> > +             zswap_entry_free(entry);
> >  }
> >
> >  int zswap_swapon(int type, unsigned long nr_pages)
> >  {
> > -     struct zswap_tree *trees, *tree;
> > +     struct xarray *trees, *tree;
> >       unsigned int nr, i;
> >
> >       nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
> > @@ -1674,11 +1633,8 @@ int zswap_swapon(int type, unsigned long nr_pages)
> >               return -ENOMEM;
> >       }
> >
> > -     for (i = 0; i < nr; i++) {
> > -             tree = trees + i;
> > -             tree->rbroot = RB_ROOT;
> > -             spin_lock_init(&tree->lock);
> > -     }
> > +     for (i = 0; i < nr; i++)
> > +             xa_init(trees + i);
> >
> >       nr_zswap_trees[type] = nr;
> >       zswap_trees[type] = trees;
> > @@ -1687,7 +1643,7 @@ int zswap_swapon(int type, unsigned long nr_pages)
> >
> >  void zswap_swapoff(int type)
> >  {
> > -     struct zswap_tree *trees = zswap_trees[type];
> > +     struct xarray *trees = zswap_trees[type];
> >       unsigned int i;
> >
> >       if (!trees)
> > @@ -1695,7 +1651,7 @@ void zswap_swapoff(int type)
> >
> >       /* try_to_unuse() invalidated all the entries already */
> >       for (i = 0; i < nr_zswap_trees[type]; i++)
> > -             WARN_ON_ONCE(!RB_EMPTY_ROOT(&trees[i].rbroot));
> > +             WARN_ON_ONCE(!xa_empty(trees + i));
> >
> >       kvfree(trees);
> >       nr_zswap_trees[type] = 0;
> > @@ -1727,6 +1683,8 @@ static int zswap_debugfs_init(void)
> >                          zswap_debugfs_root, &zswap_reject_kmemcache_fail);
> >       debugfs_create_u64("reject_compress_fail", 0444,
> >                          zswap_debugfs_root, &zswap_reject_compress_fail);
> > +     debugfs_create_u64("reject_xarray_fail", 0444,
> > +                        zswap_debugfs_root, &zswap_reject_xarray_fail);
> >       debugfs_create_u64("reject_compress_poor", 0444,
> >                          zswap_debugfs_root, &zswap_reject_compress_poor);
> >       debugfs_create_u64("written_back_pages", 0444,
> >
> > ---
> > base-commit: 9a0181a3710eba1f5c6d19eadcca888be3d54e4f
> > change-id: 20240104-zswap-xarray-716260e541e3
> >
> > Best regards,
>
Chris Li March 6, 2024, 9:37 p.m. UTC | #4
Hi Yosry,

On Mon, Mar 4, 2024 at 9:08 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Hi Chris,
>
> On Mon, Mar 04, 2024 at 01:32:20PM -0800, Chris Li wrote:
> > Very deep RB tree requires rebalance at times. That
> > contributes to the zswap fault latencies. Xarray does not
> > need to perform tree rebalance. Replacing RB tree to xarray
> > can have some small performance gain.
> >
> > One small difference is that xarray insert might fail with
> > ENOMEM, while RB tree insert does not allocate additional
> > memory.
> >
> > The zswap_entry size will reduce a bit due to removing the
> > RB node, which has two pointers and a color field. Xarray
> > store the pointer in the xarray tree rather than the
> > zswap_entry. Every entry has one pointer from the xarray
> > tree. Overall, switching to xarray should save some memory,
> > if the swap entries are densely packed.
> >
> > Notice the zswap_rb_search and zswap_rb_insert always
> > followed by zswap_rb_erase. Use xa_erase directly. The entry
> > erase into zswap_xa_insert as well. That saves one tree
> > lookup as well.
> >
> > Remove zswap_invalidate_entry due to no need to call
> > zswap_rb_erase any more. Use zswap_free_entry instead.
> >
> > The "struct zswap_tree" has been replaced by "struct xarray".
> > The tree spin lock has transferred to the xarray lock.
> >
> > Run the kernel build testing 10 times for each version, averages:
> > (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)
>
> Please add details about the number of threads and the number of CPUs as
> well.

Sure, I will update the commit message in V5. It is 24 HT core running
32 jobs for make.

>
> >
> > mm-9a0181a3710eb                    xarray v4
> > user       3526.829                            3526.930
> > sys        532.754                             526.525
> > real       198.748                             198.850
> >
> > ---
> >
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> > Changes in v4:
> > - Remove zswap_xa_search_and_earse, use xa_erase directly.
> > - Move charge of objcg after zswap_xa_insert.
> > - Avoid erase old entry on insert fail error path.
> > - Remove not needed swap_zswap_tree change
> > - Link to v3: https://lore.kernel.org/r/20240302-zswap-xarray-v3-1-5900252f2302@kernel.org
> >
> > Changes in v3:
> > - Use xa_cmpxchg instead of zswap_xa_search_and_delete in zswap_writeback_entry.
> > - Use xa_store in zswap_xa_insert directly. Reduce the scope of spinlock.
> > - Fix xa_store error handling for same page fill case.
> > - Link to v2: https://lore.kernel.org/r/20240229-zswap-xarray-v2-1-e50284dfcdb1@kernel.org
> >
> > Changes in v2:
> > - Replace struct zswap_tree with struct xarray.
> > - Remove zswap_tree spinlock, use xarray lock instead.
> > - Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
> > - Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
> > - Link to v1: https://lore.kernel.org/r/20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org
> > ---
> >  mm/zswap.c | 186 ++++++++++++++++++++++++-------------------------------------
> >  1 file changed, 72 insertions(+), 114 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 011e068eb355..4f4a3f452b76 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -20,7 +20,6 @@
> >  #include <linux/spinlock.h>
> >  #include <linux/types.h>
> >  #include <linux/atomic.h>
> > -#include <linux/rbtree.h>
> >  #include <linux/swap.h>
> >  #include <linux/crypto.h>
> >  #include <linux/scatterlist.h>
> > @@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
> >  static u64 zswap_reject_alloc_fail;
> >  /* Store failed because the entry metadata could not be allocated (rare) */
> >  static u64 zswap_reject_kmemcache_fail;
> > +/* Store failed because xarray can't insert the entry*/
> > +static u64 zswap_reject_xarray_fail;
> >
> >  /* Shrinker work queue */
> >  static struct workqueue_struct *shrink_wq;
> > @@ -196,7 +197,6 @@ static struct {
> >   * This structure contains the metadata for tracking a single compressed
> >   * page within zswap.
> >   *
> > - * rbnode - links the entry into red-black tree for the appropriate swap type
> >   * swpentry - associated swap entry, the offset indexes into the red-black tree
> >   * length - the length in bytes of the compressed page data.  Needed during
> >   *          decompression. For a same value filled page length is 0, and both
> > @@ -208,7 +208,6 @@ static struct {
> >   * lru - handle to the pool's lru used to evict pages.
> >   */
> >  struct zswap_entry {
> > -     struct rb_node rbnode;
> >       swp_entry_t swpentry;
> >       unsigned int length;
> >       struct zswap_pool *pool;
> > @@ -220,12 +219,7 @@ struct zswap_entry {
> >       struct list_head lru;
> >  };
> >
> > -struct zswap_tree {
> > -     struct rb_root rbroot;
> > -     spinlock_t lock;
> > -};
> > -
> > -static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
> > +static struct xarray *zswap_trees[MAX_SWAPFILES];
> >  static unsigned int nr_zswap_trees[MAX_SWAPFILES];
> >
> >  /* RCU-protected iteration */
> > @@ -253,7 +247,7 @@ static bool zswap_has_pool;
> >  * helpers and fwd declarations
> >  **********************************/
> >
> > -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> > +static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> >  {
> >       return &zswap_trees[swp_type(swp)][swp_offset(swp)
> >               >> SWAP_ADDRESS_SPACE_SHIFT];
> > @@ -805,60 +799,33 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> >  }
> >
> >  /*********************************
> > -* rbtree functions
> > +* xarray functions
> >  **********************************/
> > -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
> > -{
> > -     struct rb_node *node = root->rb_node;
> > -     struct zswap_entry *entry;
> > -     pgoff_t entry_offset;
> > -
> > -     while (node) {
> > -             entry = rb_entry(node, struct zswap_entry, rbnode);
> > -             entry_offset = swp_offset(entry->swpentry);
> > -             if (entry_offset > offset)
> > -                     node = node->rb_left;
> > -             else if (entry_offset < offset)
> > -                     node = node->rb_right;
> > -             else
> > -                     return entry;
> > -     }
> > -     return NULL;
> > -}
> >
> >  /*
> >   * In the case that a entry with the same offset is found, a pointer to
> > - * the existing entry is stored in dupentry and the function returns -EEXIST
> > + * the existing entry is stored in old and erased from the tree.
> > + * Function return error on insert.
> >   */
> > -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
> > -                     struct zswap_entry **dupentry)
> > +static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
> > +                        struct zswap_entry **old)
> >  {
> > -     struct rb_node **link = &root->rb_node, *parent = NULL;
> > -     struct zswap_entry *myentry;
> > -     pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
> > -
> > -     while (*link) {
> > -             parent = *link;
> > -             myentry = rb_entry(parent, struct zswap_entry, rbnode);
> > -             myentry_offset = swp_offset(myentry->swpentry);
> > -             if (myentry_offset > entry_offset)
> > -                     link = &(*link)->rb_left;
> > -             else if (myentry_offset < entry_offset)
> > -                     link = &(*link)->rb_right;
> > -             else {
> > -                     *dupentry = myentry;
> > -                     return -EEXIST;
> > -             }
> > -     }
> > -     rb_link_node(&entry->rbnode, parent, link);
> > -     rb_insert_color(&entry->rbnode, root);
> > -     return 0;
> > -}
> > +     int err;
> > +     struct zswap_entry *e;
> > +     pgoff_t offset = swp_offset(entry->swpentry);
> >
> > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> > -{
> > -     rb_erase(&entry->rbnode, root);
> > -     RB_CLEAR_NODE(&entry->rbnode);
> > +     e = xa_store(tree, offset, entry, GFP_KERNEL);
> > +     err = xa_err(e);
> > +
> > +     if (err) {
> > +             e = xa_erase(tree, offset);
> > +             if (err == -ENOMEM)
> > +                     zswap_reject_alloc_fail++;
> > +             else
> > +                     zswap_reject_xarray_fail++;
>
> I think this is too complicated, and as Chengming pointed out, I believe
> we can use xa_store() directly in zswap_store().

Sure.

> I am also not sure what the need for zswap_reject_xarray_fail is. Are
> there any reasons why the store here can fail other than -ENOMEM? The
> docs say the only other option is -EINVAL, and looking at __xa_store(),
> it seems like this is only possible if xa_is_internal() is true (which
> means we are not passing in a properly aligned pointer IIUC).

Because the xa_store document said it can return two error codes. I
see zswap try to classify the error count it hit, that is why I add
the zswap_reject_xarray_fail.

>
> > +     }
> > +     *old = e;
> > +     return err;
> >  }
> >
> >  /*********************************
> > @@ -872,7 +839,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
> >       entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
> >       if (!entry)
> >               return NULL;
> > -     RB_CLEAR_NODE(&entry->rbnode);
> >       return entry;
> >  }
> >
> > @@ -914,17 +880,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
> >       zswap_update_total_size();
> >  }
> >
> > -/*
> > - * The caller hold the tree lock and search the entry from the tree,
> > - * so it must be on the tree, remove it from the tree and free it.
> > - */
> > -static void zswap_invalidate_entry(struct zswap_tree *tree,
> > -                                struct zswap_entry *entry)
> > -{
> > -     zswap_rb_erase(&tree->rbroot, entry);
> > -     zswap_entry_free(entry);
> > -}
> > -
> >  /*********************************
> >  * compressed storage functions
> >  **********************************/
> > @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
> >  static int zswap_writeback_entry(struct zswap_entry *entry,
> >                                swp_entry_t swpentry)
> >  {
> > -     struct zswap_tree *tree;
> > +     struct xarray *tree;
> > +     pgoff_t offset = swp_offset(swpentry);
> > +     struct zswap_entry *e;
> >       struct folio *folio;
> >       struct mempolicy *mpol;
> >       bool folio_was_allocated;
> > @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >        * be dereferenced.
> >        */
> >       tree = swap_zswap_tree(swpentry);
> > -     spin_lock(&tree->lock);
> > -     if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> > -             spin_unlock(&tree->lock);
> > +     e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
> > +     if (e != entry) {
>
> I think we can avoid adding 'e' and 'offset' local variables here and
> just do everything in the if condition. If you want to avoid the line
> break, then introducing 'offset' is fine, but I don't see any value from
> 'e'.

As I said in my other email. I don't think having this type of local
variable affects the compiler negatively. The compiler generally uses
their own local variable to track the expression anyway. So I am not
sure about the motivation to remove local variables alone, if it helps
the reading. I feel the line  "if (xa_cmpxchg(tree, offset, entry,
NULL, GFP_KERNEL) != entry)" is too long and complicated inside the if
condition. That is just me. Not a big deal.

>
>
> >               delete_from_swap_cache(folio);
> >               folio_unlock(folio);
> >               folio_put(folio);
> >               return -ENOMEM;
> >       }
> >
> > -     /* Safe to deref entry after the entry is verified above. */
> > -     zswap_rb_erase(&tree->rbroot, entry);
> > -     spin_unlock(&tree->lock);
> > -
> >       zswap_decompress(entry, &folio->page);
> >
> >       count_vm_event(ZSWPWB);
> > @@ -1471,10 +1423,12 @@ bool zswap_store(struct folio *folio)
> >  {
> >       swp_entry_t swp = folio->swap;
> >       pgoff_t offset = swp_offset(swp);
> > -     struct zswap_tree *tree = swap_zswap_tree(swp);
> > -     struct zswap_entry *entry, *dupentry;
> > +     struct xarray *tree = swap_zswap_tree(swp);
> > +     struct zswap_entry *entry, *old;
> >       struct obj_cgroup *objcg = NULL;
> >       struct mem_cgroup *memcg = NULL;
> > +     int err;
> > +     bool old_erased = false;
> >
> >       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > @@ -1526,6 +1480,7 @@ bool zswap_store(struct folio *folio)
> >                       kunmap_local(src);
> >                       entry->length = 0;
> >                       entry->value = value;
> > +                     entry->pool = NULL;
>
> Why do we need to initialize the pool here? Is this is a bug fix for an
> existing problem or just keeping things clean? Either way I think it
> should be done separately, unless it is related to a change in this
> patch.

I notice the entry->pool will leave uninitialized. I think it should
be cleaned up. It is a clean up, it does not need to happen in this
patch. I can do that as a separate patch.

>
> >                       atomic_inc(&zswap_same_filled_pages);
> >                       goto insert_entry;
> >               }
> > @@ -1555,28 +1510,31 @@ bool zswap_store(struct folio *folio)
> >  insert_entry:
> >       entry->swpentry = swp;
> >       entry->objcg = objcg;
> > -     if (objcg) {
> > -             obj_cgroup_charge_zswap(objcg, entry->length);
> > -             /* Account before objcg ref is moved to tree */
> > -             count_objcg_event(objcg, ZSWPOUT);
> > -     }
> >
> >       /* map */
>
> Let's remove this comment while we are here, it's pointless AFAICT.

Sure.

>
> > -     spin_lock(&tree->lock);
> >       /*
> >        * The folio may have been dirtied again, invalidate the
> >        * possibly stale entry before inserting the new entry.
> >        */
> > -     if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> > -             zswap_invalidate_entry(tree, dupentry);
> > -             WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> > +     err = zswap_xa_insert(tree, entry, &old);
> > +     if (old)
> > +             zswap_entry_free(old);
> > +     if (err) {
> > +             old_erased = true;
>
> I think this can be made simpler if we open code xa_store() here,
> especially that we already have cleanup code below under 'check_old'
> that removes the exisitng old entry. So zswap_xa_insert() replicates
> this cleanup, then we add this 'old_erased' boolean to avoid doing the
> cleanup below. It seems like it would much more straightforward with
> open-coding xa_store() here and relying on the existing cleanup for the
> old entry.  Also, if we initialize 'old' to NULL, we can use its value
> to figure out whether any cleanup is needed under 'check_old' or not.

I think that is very similar to what Chengming was suggesting.

>
> Taking a step back, I think we can further simplify this. What if we
> move the tree insertion to right after we allocate the zswap entry? In
> this case, if the tree insertion fails, we don't need to decrement the
> same filled counter. If the tree insertion succeeds and then something
> else fails, the existing cleanup code under 'check_old' will already
> clean up the tree insertion for us.

That will create complications that, if the zswap compression fails
the compression ratio, you will have to remove the entry from the tree
as clean up. You have both xa_store() and xa_erase() where the current
code just does one xa_erase() on compression failure.

>
> If this works, we don't need to add extra cleanup code or move any code
> around. Something like:

Due to the extra xa_insert() on compression failure, I think
Chengming's or your earlier suggestion is better.

BTW, while you are here, can you confirm this race discussed in
earlier email can't happen? Chengming convinced me this shouldn't
happen. Like to hear your thoughts.

 CPU1                         CPU2

xa_store()
                                   entry = xa_erase()
                                   zswap_free_entry(entry)

if (entry->length)
     ...
CPU1 is using entry after free.

Chris


>
>         entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
>         if (!entry) {
>                 zswap_reject_kmemcache_fail++;
>                 goto reject;
>         }
>
>         old = xa_store(tree, offset, entry, GFP_KERNEL);
>         if (xa_is_err(old)) {
>                 old = NULL;
>                 goto freepage;
>         }
>         if (old)
>                 zswap_free_entry(old);
>
>         ...
> check_old:
>         if (!old) {
>                 old = xa_erase(tree, offset);
>                 if (old)
>                         zswap_entry_free(old);
>         }
>
>
> WDYT?
>
> > +             goto insert_failed;
> >       }
> > +
> > +     if (objcg) {
> > +             obj_cgroup_charge_zswap(objcg, entry->length);
> > +             /* Account before objcg ref is moved to tree */
> > +             count_objcg_event(objcg, ZSWPOUT);
> > +     }
> > +
> >       if (entry->length) {
> >               INIT_LIST_HEAD(&entry->lru);
> >               zswap_lru_add(&zswap.list_lru, entry);
> >               atomic_inc(&zswap.nr_stored);
> >       }
> > -     spin_unlock(&tree->lock);
> >
> >       /* update stats */
> >       atomic_inc(&zswap_stored_pages);
> > @@ -1585,6 +1543,12 @@ bool zswap_store(struct folio *folio)
> >
> >       return true;
> >
> > +insert_failed:
> > +     if (!entry->length) {
> > +             atomic_dec(&zswap_same_filled_pages);
> > +             goto freepage;
> > +     }
> > +     zpool_free(zswap_find_zpool(entry), entry->handle);
> >  put_pool:
> >       zswap_pool_put(entry->pool);
> >  freepage:
> > @@ -1592,17 +1556,19 @@ bool zswap_store(struct folio *folio)
> >  reject:
> >       if (objcg)
> >               obj_cgroup_put(objcg);
> > +
> > +     if (old_erased)
> > +             goto failed;
> >  check_old:
> >       /*
> >        * If the zswap store fails or zswap is disabled, we must invalidate the
> >        * possibly stale entry which was previously stored at this offset.
> >        * Otherwise, writeback could overwrite the new data in the swapfile.
> >        */
> > -     spin_lock(&tree->lock);
> > -     entry = zswap_rb_search(&tree->rbroot, offset);
> > +     entry = xa_erase(tree, offset);
> >       if (entry)
> > -             zswap_invalidate_entry(tree, entry);
> > -     spin_unlock(&tree->lock);
> > +             zswap_entry_free(entry);
> > +failed:
> >       return false;
> >
> >  shrink:
> [..]
>
Chris Li March 6, 2024, 11:43 p.m. UTC | #5
On Mon, Mar 4, 2024 at 6:52 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Hi Chris,
>
> On 2024/3/5 05:32, Chris Li wrote:
> > Very deep RB tree requires rebalance at times. That
> > contributes to the zswap fault latencies. Xarray does not
> > need to perform tree rebalance. Replacing RB tree to xarray
> > can have some small performance gain.
> >
> > One small difference is that xarray insert might fail with
> > ENOMEM, while RB tree insert does not allocate additional
> > memory.
> >
> > The zswap_entry size will reduce a bit due to removing the
> > RB node, which has two pointers and a color field. Xarray
> > store the pointer in the xarray tree rather than the
> > zswap_entry. Every entry has one pointer from the xarray
> > tree. Overall, switching to xarray should save some memory,
> > if the swap entries are densely packed.
> >
> > Notice the zswap_rb_search and zswap_rb_insert always
> > followed by zswap_rb_erase. Use xa_erase directly. The entry
> > erase into zswap_xa_insert as well. That saves one tree
> > lookup as well.
> >
> > Remove zswap_invalidate_entry due to no need to call
> > zswap_rb_erase any more. Use zswap_free_entry instead.
> >
> > The "struct zswap_tree" has been replaced by "struct xarray".
> > The tree spin lock has transferred to the xarray lock.
> >
> > Run the kernel build testing 10 times for each version, averages:
> > (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)
> >
> > mm-9a0181a3710eb                    xarray v4
> > user       3526.829                            3526.930
> > sys        532.754                             526.525
> > real       198.748                             198.850
> >
> > ---
> >
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> > Changes in v4:
> > - Remove zswap_xa_search_and_earse, use xa_erase directly.
> > - Move charge of objcg after zswap_xa_insert.
> > - Avoid erase old entry on insert fail error path.
> > - Remove not needed swap_zswap_tree change
> > - Link to v3: https://lore.kernel.org/r/20240302-zswap-xarray-v3-1-5900252f2302@kernel.org
> >
> > Changes in v3:
> > - Use xa_cmpxchg instead of zswap_xa_search_and_delete in zswap_writeback_entry.
> > - Use xa_store in zswap_xa_insert directly. Reduce the scope of spinlock.
> > - Fix xa_store error handling for same page fill case.
> > - Link to v2: https://lore.kernel.org/r/20240229-zswap-xarray-v2-1-e50284dfcdb1@kernel.org
> >
> > Changes in v2:
> > - Replace struct zswap_tree with struct xarray.
> > - Remove zswap_tree spinlock, use xarray lock instead.
> > - Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
> > - Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
> > - Link to v1: https://lore.kernel.org/r/20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org
> > ---
> >  mm/zswap.c | 186 ++++++++++++++++++++++++-------------------------------------
> >  1 file changed, 72 insertions(+), 114 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 011e068eb355..4f4a3f452b76 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -20,7 +20,6 @@
> >  #include <linux/spinlock.h>
> >  #include <linux/types.h>
> >  #include <linux/atomic.h>
> > -#include <linux/rbtree.h>
> >  #include <linux/swap.h>
> >  #include <linux/crypto.h>
> >  #include <linux/scatterlist.h>
> > @@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
> >  static u64 zswap_reject_alloc_fail;
> >  /* Store failed because the entry metadata could not be allocated (rare) */
> >  static u64 zswap_reject_kmemcache_fail;
> > +/* Store failed because xarray can't insert the entry*/
> > +static u64 zswap_reject_xarray_fail;
> >
> >  /* Shrinker work queue */
> >  static struct workqueue_struct *shrink_wq;
> > @@ -196,7 +197,6 @@ static struct {
> >   * This structure contains the metadata for tracking a single compressed
> >   * page within zswap.
> >   *
> > - * rbnode - links the entry into red-black tree for the appropriate swap type
> >   * swpentry - associated swap entry, the offset indexes into the red-black tree
> >   * length - the length in bytes of the compressed page data.  Needed during
> >   *          decompression. For a same value filled page length is 0, and both
> > @@ -208,7 +208,6 @@ static struct {
> >   * lru - handle to the pool's lru used to evict pages.
> >   */
> >  struct zswap_entry {
> > -     struct rb_node rbnode;
> >       swp_entry_t swpentry;
> >       unsigned int length;
> >       struct zswap_pool *pool;
> > @@ -220,12 +219,7 @@ struct zswap_entry {
> >       struct list_head lru;
> >  };
> >
> > -struct zswap_tree {
> > -     struct rb_root rbroot;
> > -     spinlock_t lock;
> > -};
> > -
> > -static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
> > +static struct xarray *zswap_trees[MAX_SWAPFILES];
> >  static unsigned int nr_zswap_trees[MAX_SWAPFILES];
> >
> >  /* RCU-protected iteration */
> > @@ -253,7 +247,7 @@ static bool zswap_has_pool;
> >  * helpers and fwd declarations
> >  **********************************/
> >
> > -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> > +static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> >  {
> >       return &zswap_trees[swp_type(swp)][swp_offset(swp)
> >               >> SWAP_ADDRESS_SPACE_SHIFT];
> > @@ -805,60 +799,33 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> >  }
> >
> >  /*********************************
> > -* rbtree functions
> > +* xarray functions
> >  **********************************/
> > -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
> > -{
> > -     struct rb_node *node = root->rb_node;
> > -     struct zswap_entry *entry;
> > -     pgoff_t entry_offset;
> > -
> > -     while (node) {
> > -             entry = rb_entry(node, struct zswap_entry, rbnode);
> > -             entry_offset = swp_offset(entry->swpentry);
> > -             if (entry_offset > offset)
> > -                     node = node->rb_left;
> > -             else if (entry_offset < offset)
> > -                     node = node->rb_right;
> > -             else
> > -                     return entry;
> > -     }
> > -     return NULL;
> > -}
> >
> >  /*
> >   * In the case that a entry with the same offset is found, a pointer to
> > - * the existing entry is stored in dupentry and the function returns -EEXIST
> > + * the existing entry is stored in old and erased from the tree.
> > + * Function return error on insert.
> >   */
> > -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
> > -                     struct zswap_entry **dupentry)
> > +static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
> > +                        struct zswap_entry **old)
> >  {
> > -     struct rb_node **link = &root->rb_node, *parent = NULL;
> > -     struct zswap_entry *myentry;
> > -     pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
> > -
> > -     while (*link) {
> > -             parent = *link;
> > -             myentry = rb_entry(parent, struct zswap_entry, rbnode);
> > -             myentry_offset = swp_offset(myentry->swpentry);
> > -             if (myentry_offset > entry_offset)
> > -                     link = &(*link)->rb_left;
> > -             else if (myentry_offset < entry_offset)
> > -                     link = &(*link)->rb_right;
> > -             else {
> > -                     *dupentry = myentry;
> > -                     return -EEXIST;
> > -             }
> > -     }
> > -     rb_link_node(&entry->rbnode, parent, link);
> > -     rb_insert_color(&entry->rbnode, root);
> > -     return 0;
> > -}
> > +     int err;
> > +     struct zswap_entry *e;
> > +     pgoff_t offset = swp_offset(entry->swpentry);
> >
> > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> > -{
> > -     rb_erase(&entry->rbnode, root);
> > -     RB_CLEAR_NODE(&entry->rbnode);
> > +     e = xa_store(tree, offset, entry, GFP_KERNEL);
> > +     err = xa_err(e);
> > +
> > +     if (err) {
> > +             e = xa_erase(tree, offset);
> > +             if (err == -ENOMEM)
> > +                     zswap_reject_alloc_fail++;
> > +             else
> > +                     zswap_reject_xarray_fail++;
> > +     }
> > +     *old = e;
> > +     return err;
> >  }
> >
> >  /*********************************
> > @@ -872,7 +839,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
> >       entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
> >       if (!entry)
> >               return NULL;
> > -     RB_CLEAR_NODE(&entry->rbnode);
> >       return entry;
> >  }
> >
> > @@ -914,17 +880,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
> >       zswap_update_total_size();
> >  }
> >
> > -/*
> > - * The caller hold the tree lock and search the entry from the tree,
> > - * so it must be on the tree, remove it from the tree and free it.
> > - */
> > -static void zswap_invalidate_entry(struct zswap_tree *tree,
> > -                                struct zswap_entry *entry)
> > -{
> > -     zswap_rb_erase(&tree->rbroot, entry);
> > -     zswap_entry_free(entry);
> > -}
> > -
> >  /*********************************
> >  * compressed storage functions
> >  **********************************/
> > @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
> >  static int zswap_writeback_entry(struct zswap_entry *entry,
> >                                swp_entry_t swpentry)
> >  {
> > -     struct zswap_tree *tree;
> > +     struct xarray *tree;
> > +     pgoff_t offset = swp_offset(swpentry);
> > +     struct zswap_entry *e;
> >       struct folio *folio;
> >       struct mempolicy *mpol;
> >       bool folio_was_allocated;
> > @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >        * be dereferenced.
> >        */
> >       tree = swap_zswap_tree(swpentry);
> > -     spin_lock(&tree->lock);
> > -     if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> > -             spin_unlock(&tree->lock);
> > +     e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
> > +     if (e != entry) {
>
> Maybe "if (xa_cmpxchg() != entry)" look better, so "e" variable can be removed,
> since we don't use it.
>
> >               delete_from_swap_cache(folio);
> >               folio_unlock(folio);
> >               folio_put(folio);
> >               return -ENOMEM;
> >       }
> >
> > -     /* Safe to deref entry after the entry is verified above. */
> > -     zswap_rb_erase(&tree->rbroot, entry);
> > -     spin_unlock(&tree->lock);
> > -
> >       zswap_decompress(entry, &folio->page);
> >
> >       count_vm_event(ZSWPWB);
> > @@ -1471,10 +1423,12 @@ bool zswap_store(struct folio *folio)
> >  {
> >       swp_entry_t swp = folio->swap;
> >       pgoff_t offset = swp_offset(swp);
> > -     struct zswap_tree *tree = swap_zswap_tree(swp);
> > -     struct zswap_entry *entry, *dupentry;
> > +     struct xarray *tree = swap_zswap_tree(swp);
> > +     struct zswap_entry *entry, *old;
> >       struct obj_cgroup *objcg = NULL;
> >       struct mem_cgroup *memcg = NULL;
> > +     int err;
> > +     bool old_erased = false;
> >
> >       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > @@ -1526,6 +1480,7 @@ bool zswap_store(struct folio *folio)
> >                       kunmap_local(src);
> >                       entry->length = 0;
> >                       entry->value = value;
> > +                     entry->pool = NULL;
> >                       atomic_inc(&zswap_same_filled_pages);
> >                       goto insert_entry;
> >               }
> > @@ -1555,28 +1510,31 @@ bool zswap_store(struct folio *folio)
> >  insert_entry:
> >       entry->swpentry = swp;
> >       entry->objcg = objcg;
> > -     if (objcg) {
> > -             obj_cgroup_charge_zswap(objcg, entry->length);
> > -             /* Account before objcg ref is moved to tree */
> > -             count_objcg_event(objcg, ZSWPOUT);
> > -     }
> >
> >       /* map */
> > -     spin_lock(&tree->lock);
> >       /*
> >        * The folio may have been dirtied again, invalidate the
> >        * possibly stale entry before inserting the new entry.
> >        */
> > -     if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> > -             zswap_invalidate_entry(tree, dupentry);
> > -             WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> > +     err = zswap_xa_insert(tree, entry, &old);
> > +     if (old)
> > +             zswap_entry_free(old);
> > +     if (err) {
> > +             old_erased = true;
> > +             goto insert_failed;
> >       }
>
> It looks a little complicated for me :( How do you think this like below?
>
> old = xa_store(tree, offset, entry, GFP_KERNEL);
> if (xa_is_err(old))
>         goto store_failed;
>
> if (old)
>         zswap_entry_free(old);
>
> Then zswap_xa_insert() wrapper can be removed since no use elsewhere.
> So the error handling path is kept much the same as before and simpler.
>
> > +
> > +     if (objcg) {
> > +             obj_cgroup_charge_zswap(objcg, entry->length);
> > +             /* Account before objcg ref is moved to tree */
> > +             count_objcg_event(objcg, ZSWPOUT);
> > +     }
> > +
> >       if (entry->length) {
> >               INIT_LIST_HEAD(&entry->lru);
> >               zswap_lru_add(&zswap.list_lru, entry);
> >               atomic_inc(&zswap.nr_stored);
> >       }
> > -     spin_unlock(&tree->lock);
> >
> >       /* update stats */
> >       atomic_inc(&zswap_stored_pages);
> > @@ -1585,6 +1543,12 @@ bool zswap_store(struct folio *folio)
> >
> >       return true;
> >
> > +insert_failed:
> > +     if (!entry->length) {
> > +             atomic_dec(&zswap_same_filled_pages);
> > +             goto freepage;
> > +     }
> > +     zpool_free(zswap_find_zpool(entry), entry->handle);
>
> entry->pool can be used here instead of zswap_find_zpool().

Actually you can't use the entry->pool there. I get a compile error.

static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
{
        int i = 0;

        if (ZSWAP_NR_ZPOOLS > 1)
                i = hash_ptr(entry, ilog2(ZSWAP_NR_ZPOOLS));

        return entry->pool->zpools[i];
}

Find zpool will return one of the entry->pool->zpools[i]. It is not
the same as entry->pool.

Chris
Chengming Zhou March 7, 2024, 2:17 a.m. UTC | #6
On 2024/3/7 05:12, Chris Li wrote:
> Hi Chengming,
> 
> On Mon, Mar 4, 2024 at 6:52 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> Hi Chris,
>>
>> On 2024/3/5 05:32, Chris Li wrote:
>>> Very deep RB tree requires rebalance at times. That
>>> contributes to the zswap fault latencies. Xarray does not
>>> need to perform tree rebalance. Replacing RB tree to xarray
>>> can have some small performance gain.
>>>
>>> One small difference is that xarray insert might fail with
>>> ENOMEM, while RB tree insert does not allocate additional
>>> memory.
>>>
>>> The zswap_entry size will reduce a bit due to removing the
>>> RB node, which has two pointers and a color field. Xarray
>>> store the pointer in the xarray tree rather than the
>>> zswap_entry. Every entry has one pointer from the xarray
>>> tree. Overall, switching to xarray should save some memory,
>>> if the swap entries are densely packed.
>>>
>>> Notice the zswap_rb_search and zswap_rb_insert always
>>> followed by zswap_rb_erase. Use xa_erase directly. The entry
>>> erase into zswap_xa_insert as well. That saves one tree
>>> lookup as well.
>>>
>>> Remove zswap_invalidate_entry due to no need to call
>>> zswap_rb_erase any more. Use zswap_free_entry instead.
>>>
>>> The "struct zswap_tree" has been replaced by "struct xarray".
>>> The tree spin lock has transferred to the xarray lock.
>>>
>>> Run the kernel build testing 10 times for each version, averages:
>>> (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)
>>>
>>> mm-9a0181a3710eb                    xarray v4
>>> user       3526.829                            3526.930
>>> sys        532.754                             526.525
>>> real       198.748                             198.850
>>>
>>> ---
>>>
>>>
>>> Signed-off-by: Chris Li <chrisl@kernel.org>
>>> ---
>>> Changes in v4:
>>> - Remove zswap_xa_search_and_earse, use xa_erase directly.
>>> - Move charge of objcg after zswap_xa_insert.
>>> - Avoid erase old entry on insert fail error path.
>>> - Remove not needed swap_zswap_tree change
>>> - Link to v3: https://lore.kernel.org/r/20240302-zswap-xarray-v3-1-5900252f2302@kernel.org
>>>
>>> Changes in v3:
>>> - Use xa_cmpxchg instead of zswap_xa_search_and_delete in zswap_writeback_entry.
>>> - Use xa_store in zswap_xa_insert directly. Reduce the scope of spinlock.
>>> - Fix xa_store error handling for same page fill case.
>>> - Link to v2: https://lore.kernel.org/r/20240229-zswap-xarray-v2-1-e50284dfcdb1@kernel.org
>>>
>>> Changes in v2:
>>> - Replace struct zswap_tree with struct xarray.
>>> - Remove zswap_tree spinlock, use xarray lock instead.
>>> - Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
>>> - Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
>>> - Link to v1: https://lore.kernel.org/r/20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org
>>> ---
>>>  mm/zswap.c | 186 ++++++++++++++++++++++++-------------------------------------
>>>  1 file changed, 72 insertions(+), 114 deletions(-)
>>>
>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>> index 011e068eb355..4f4a3f452b76 100644
>>> --- a/mm/zswap.c
>>> +++ b/mm/zswap.c
>>> @@ -20,7 +20,6 @@
>>>  #include <linux/spinlock.h>
>>>  #include <linux/types.h>
>>>  #include <linux/atomic.h>
>>> -#include <linux/rbtree.h>
>>>  #include <linux/swap.h>
>>>  #include <linux/crypto.h>
>>>  #include <linux/scatterlist.h>
>>> @@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
>>>  static u64 zswap_reject_alloc_fail;
>>>  /* Store failed because the entry metadata could not be allocated (rare) */
>>>  static u64 zswap_reject_kmemcache_fail;
>>> +/* Store failed because xarray can't insert the entry*/
>>> +static u64 zswap_reject_xarray_fail;
>>>
>>>  /* Shrinker work queue */
>>>  static struct workqueue_struct *shrink_wq;
>>> @@ -196,7 +197,6 @@ static struct {
>>>   * This structure contains the metadata for tracking a single compressed
>>>   * page within zswap.
>>>   *
>>> - * rbnode - links the entry into red-black tree for the appropriate swap type
>>>   * swpentry - associated swap entry, the offset indexes into the red-black tree
>>>   * length - the length in bytes of the compressed page data.  Needed during
>>>   *          decompression. For a same value filled page length is 0, and both
>>> @@ -208,7 +208,6 @@ static struct {
>>>   * lru - handle to the pool's lru used to evict pages.
>>>   */
>>>  struct zswap_entry {
>>> -     struct rb_node rbnode;
>>>       swp_entry_t swpentry;
>>>       unsigned int length;
>>>       struct zswap_pool *pool;
>>> @@ -220,12 +219,7 @@ struct zswap_entry {
>>>       struct list_head lru;
>>>  };
>>>
>>> -struct zswap_tree {
>>> -     struct rb_root rbroot;
>>> -     spinlock_t lock;
>>> -};
>>> -
>>> -static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
>>> +static struct xarray *zswap_trees[MAX_SWAPFILES];
>>>  static unsigned int nr_zswap_trees[MAX_SWAPFILES];
>>>
>>>  /* RCU-protected iteration */
>>> @@ -253,7 +247,7 @@ static bool zswap_has_pool;
>>>  * helpers and fwd declarations
>>>  **********************************/
>>>
>>> -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
>>> +static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>>>  {
>>>       return &zswap_trees[swp_type(swp)][swp_offset(swp)
>>>               >> SWAP_ADDRESS_SPACE_SHIFT];
>>> @@ -805,60 +799,33 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
>>>  }
>>>
>>>  /*********************************
>>> -* rbtree functions
>>> +* xarray functions
>>>  **********************************/
>>> -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
>>> -{
>>> -     struct rb_node *node = root->rb_node;
>>> -     struct zswap_entry *entry;
>>> -     pgoff_t entry_offset;
>>> -
>>> -     while (node) {
>>> -             entry = rb_entry(node, struct zswap_entry, rbnode);
>>> -             entry_offset = swp_offset(entry->swpentry);
>>> -             if (entry_offset > offset)
>>> -                     node = node->rb_left;
>>> -             else if (entry_offset < offset)
>>> -                     node = node->rb_right;
>>> -             else
>>> -                     return entry;
>>> -     }
>>> -     return NULL;
>>> -}
>>>
>>>  /*
>>>   * In the case that a entry with the same offset is found, a pointer to
>>> - * the existing entry is stored in dupentry and the function returns -EEXIST
>>> + * the existing entry is stored in old and erased from the tree.
>>> + * Function return error on insert.
>>>   */
>>> -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
>>> -                     struct zswap_entry **dupentry)
>>> +static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
>>> +                        struct zswap_entry **old)
>>>  {
>>> -     struct rb_node **link = &root->rb_node, *parent = NULL;
>>> -     struct zswap_entry *myentry;
>>> -     pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
>>> -
>>> -     while (*link) {
>>> -             parent = *link;
>>> -             myentry = rb_entry(parent, struct zswap_entry, rbnode);
>>> -             myentry_offset = swp_offset(myentry->swpentry);
>>> -             if (myentry_offset > entry_offset)
>>> -                     link = &(*link)->rb_left;
>>> -             else if (myentry_offset < entry_offset)
>>> -                     link = &(*link)->rb_right;
>>> -             else {
>>> -                     *dupentry = myentry;
>>> -                     return -EEXIST;
>>> -             }
>>> -     }
>>> -     rb_link_node(&entry->rbnode, parent, link);
>>> -     rb_insert_color(&entry->rbnode, root);
>>> -     return 0;
>>> -}
>>> +     int err;
>>> +     struct zswap_entry *e;
>>> +     pgoff_t offset = swp_offset(entry->swpentry);
>>>
>>> -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
>>> -{
>>> -     rb_erase(&entry->rbnode, root);
>>> -     RB_CLEAR_NODE(&entry->rbnode);
>>> +     e = xa_store(tree, offset, entry, GFP_KERNEL);
>>> +     err = xa_err(e);
>>> +
>>> +     if (err) {
>>> +             e = xa_erase(tree, offset);
>>> +             if (err == -ENOMEM)
>>> +                     zswap_reject_alloc_fail++;
>>> +             else
>>> +                     zswap_reject_xarray_fail++;
>>> +     }
>>> +     *old = e;
>>> +     return err;
>>>  }
>>>
>>>  /*********************************
>>> @@ -872,7 +839,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
>>>       entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
>>>       if (!entry)
>>>               return NULL;
>>> -     RB_CLEAR_NODE(&entry->rbnode);
>>>       return entry;
>>>  }
>>>
>>> @@ -914,17 +880,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
>>>       zswap_update_total_size();
>>>  }
>>>
>>> -/*
>>> - * The caller hold the tree lock and search the entry from the tree,
>>> - * so it must be on the tree, remove it from the tree and free it.
>>> - */
>>> -static void zswap_invalidate_entry(struct zswap_tree *tree,
>>> -                                struct zswap_entry *entry)
>>> -{
>>> -     zswap_rb_erase(&tree->rbroot, entry);
>>> -     zswap_entry_free(entry);
>>> -}
>>> -
>>>  /*********************************
>>>  * compressed storage functions
>>>  **********************************/
>>> @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
>>>  static int zswap_writeback_entry(struct zswap_entry *entry,
>>>                                swp_entry_t swpentry)
>>>  {
>>> -     struct zswap_tree *tree;
>>> +     struct xarray *tree;
>>> +     pgoff_t offset = swp_offset(swpentry);
>>> +     struct zswap_entry *e;
>>>       struct folio *folio;
>>>       struct mempolicy *mpol;
>>>       bool folio_was_allocated;
>>> @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>>>        * be dereferenced.
>>>        */
>>>       tree = swap_zswap_tree(swpentry);
>>> -     spin_lock(&tree->lock);
>>> -     if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
>>> -             spin_unlock(&tree->lock);
>>> +     e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
>>> +     if (e != entry) {
>>
>> Maybe "if (xa_cmpxchg() != entry)" look better, so "e" variable can be removed,
>> since we don't use it.
> 
> I thought about that. The main reason "if (xa_cmpxchg(tree, offset,
> entry, NULL, GFP_KERNEL) != entry)" is a long expression. Having
> != entry at the end of the expression makes it harder to read. Have
> the temporary variable IMHO help reading the comparison.
> 
> Not sure I understand the motivation to save a temporary variable. The
> compiler always generates temporary variables internally. Having one
> here does not affect the compiling result at all. It is just there to
> help reading. If you still think it has the reverse effect on reading
> I can of course remove it. It generates the same code anyway.
You are right, just a minor type suggestion, no much difference if you keep it.

> 
>>
>>>               delete_from_swap_cache(folio);
>>>               folio_unlock(folio);
>>>               folio_put(folio);
>>>               return -ENOMEM;
>>>       }
>>>
>>> -     /* Safe to deref entry after the entry is verified above. */
>>> -     zswap_rb_erase(&tree->rbroot, entry);
>>> -     spin_unlock(&tree->lock);
>>> -
>>>       zswap_decompress(entry, &folio->page);
>>>
>>>       count_vm_event(ZSWPWB);
>>> @@ -1471,10 +1423,12 @@ bool zswap_store(struct folio *folio)
>>>  {
>>>       swp_entry_t swp = folio->swap;
>>>       pgoff_t offset = swp_offset(swp);
>>> -     struct zswap_tree *tree = swap_zswap_tree(swp);
>>> -     struct zswap_entry *entry, *dupentry;
>>> +     struct xarray *tree = swap_zswap_tree(swp);
>>> +     struct zswap_entry *entry, *old;
>>>       struct obj_cgroup *objcg = NULL;
>>>       struct mem_cgroup *memcg = NULL;
>>> +     int err;
>>> +     bool old_erased = false;
>>>
>>>       VM_WARN_ON_ONCE(!folio_test_locked(folio));
>>>       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>>> @@ -1526,6 +1480,7 @@ bool zswap_store(struct folio *folio)
>>>                       kunmap_local(src);
>>>                       entry->length = 0;
>>>                       entry->value = value;
>>> +                     entry->pool = NULL;
>>>                       atomic_inc(&zswap_same_filled_pages);
>>>                       goto insert_entry;
>>>               }
>>> @@ -1555,28 +1510,31 @@ bool zswap_store(struct folio *folio)
>>>  insert_entry:
>>>       entry->swpentry = swp;
>>>       entry->objcg = objcg;
>>> -     if (objcg) {
>>> -             obj_cgroup_charge_zswap(objcg, entry->length);
>>> -             /* Account before objcg ref is moved to tree */
>>> -             count_objcg_event(objcg, ZSWPOUT);
>>> -     }
>>>
>>>       /* map */
>>> -     spin_lock(&tree->lock);
>>>       /*
>>>        * The folio may have been dirtied again, invalidate the
>>>        * possibly stale entry before inserting the new entry.
>>>        */
>>> -     if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
>>> -             zswap_invalidate_entry(tree, dupentry);
>>> -             WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
>>> +     err = zswap_xa_insert(tree, entry, &old);
>>> +     if (old)
>>> +             zswap_entry_free(old);
>>> +     if (err) {
>>> +             old_erased = true;
>>> +             goto insert_failed;
>>>       }
>>
>> It looks a little complicated for me :( How do you think this like below?
>>
>> old = xa_store(tree, offset, entry, GFP_KERNEL);
>> if (xa_is_err(old))
> 
> Might want to bump up the ENOMEM error count if it is out of memory
> since zswap tracks it.

Right.

> 
>>         goto store_failed;
>>
>> if (old)
>>         zswap_entry_free(old);
>>
>> Then zswap_xa_insert() wrapper can be removed since no use elsewhere.
>> So the error handling path is kept much the same as before and simpler.
> 
> Sure.
> 
>>
>>> +
>>> +     if (objcg) {
>>> +             obj_cgroup_charge_zswap(objcg, entry->length);
>>> +             /* Account before objcg ref is moved to tree */
>>> +             count_objcg_event(objcg, ZSWPOUT);
>>> +     }
>>> +
>>>       if (entry->length) {
>>>               INIT_LIST_HEAD(&entry->lru);
>>>               zswap_lru_add(&zswap.list_lru, entry);
>>>               atomic_inc(&zswap.nr_stored);
>>>       }
>>> -     spin_unlock(&tree->lock);
>>>
>>>       /* update stats */
>>>       atomic_inc(&zswap_stored_pages);
>>> @@ -1585,6 +1543,12 @@ bool zswap_store(struct folio *folio)
>>>
>>>       return true;
>>>
>>> +insert_failed:
>>> +     if (!entry->length) {
>>> +             atomic_dec(&zswap_same_filled_pages);
>>> +             goto freepage;
>>> +     }
>>> +     zpool_free(zswap_find_zpool(entry), entry->handle);
>>
>> entry->pool can be used here instead of zswap_find_zpool().
> 
> Ack.
> 
> I was following the code in zswap_entry_free(). Does it mean the
> zswap_find_zpool(entry) can also use entry->pool?

Ah, I just see your other email, you are right, this is zpool.

Thanks.

> 
> Thanks for the feedback.
> 
> Chris
> 
>>
>> Thanks!
>>
>>>  put_pool:
>>>       zswap_pool_put(entry->pool);
>>>  freepage:
>>> @@ -1592,17 +1556,19 @@ bool zswap_store(struct folio *folio)
>>>  reject:
>>>       if (objcg)
>>>               obj_cgroup_put(objcg);
>>> +
>>> +     if (old_erased)
>>> +             goto failed;
>>>  check_old:
>>>       /*
>>>        * If the zswap store fails or zswap is disabled, we must invalidate the
>>>        * possibly stale entry which was previously stored at this offset.
>>>        * Otherwise, writeback could overwrite the new data in the swapfile.
>>>        */
>>> -     spin_lock(&tree->lock);
>>> -     entry = zswap_rb_search(&tree->rbroot, offset);
>>> +     entry = xa_erase(tree, offset);
>>>       if (entry)
>>> -             zswap_invalidate_entry(tree, entry);
>>> -     spin_unlock(&tree->lock);
>>> +             zswap_entry_free(entry);
>>> +failed:
>>>       return false;
>>>
>>>  shrink:
>>> @@ -1615,20 +1581,15 @@ bool zswap_load(struct folio *folio)
>>>       swp_entry_t swp = folio->swap;
>>>       pgoff_t offset = swp_offset(swp);
>>>       struct page *page = &folio->page;
>>> -     struct zswap_tree *tree = swap_zswap_tree(swp);
>>> +     struct xarray *tree = swap_zswap_tree(swp);
>>>       struct zswap_entry *entry;
>>>       u8 *dst;
>>>
>>>       VM_WARN_ON_ONCE(!folio_test_locked(folio));
>>>
>>> -     spin_lock(&tree->lock);
>>> -     entry = zswap_rb_search(&tree->rbroot, offset);
>>> -     if (!entry) {
>>> -             spin_unlock(&tree->lock);
>>> +     entry = xa_erase(tree, offset);
>>> +     if (!entry)
>>>               return false;
>>> -     }
>>> -     zswap_rb_erase(&tree->rbroot, entry);
>>> -     spin_unlock(&tree->lock);
>>>
>>>       if (entry->length)
>>>               zswap_decompress(entry, page);
>>> @@ -1652,19 +1613,17 @@ bool zswap_load(struct folio *folio)
>>>  void zswap_invalidate(swp_entry_t swp)
>>>  {
>>>       pgoff_t offset = swp_offset(swp);
>>> -     struct zswap_tree *tree = swap_zswap_tree(swp);
>>> +     struct xarray *tree = swap_zswap_tree(swp);
>>>       struct zswap_entry *entry;
>>>
>>> -     spin_lock(&tree->lock);
>>> -     entry = zswap_rb_search(&tree->rbroot, offset);
>>> +     entry = xa_erase(tree, offset);
>>>       if (entry)
>>> -             zswap_invalidate_entry(tree, entry);
>>> -     spin_unlock(&tree->lock);
>>> +             zswap_entry_free(entry);
>>>  }
>>>
>>>  int zswap_swapon(int type, unsigned long nr_pages)
>>>  {
>>> -     struct zswap_tree *trees, *tree;
>>> +     struct xarray *trees, *tree;
>>>       unsigned int nr, i;
>>>
>>>       nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
>>> @@ -1674,11 +1633,8 @@ int zswap_swapon(int type, unsigned long nr_pages)
>>>               return -ENOMEM;
>>>       }
>>>
>>> -     for (i = 0; i < nr; i++) {
>>> -             tree = trees + i;
>>> -             tree->rbroot = RB_ROOT;
>>> -             spin_lock_init(&tree->lock);
>>> -     }
>>> +     for (i = 0; i < nr; i++)
>>> +             xa_init(trees + i);
>>>
>>>       nr_zswap_trees[type] = nr;
>>>       zswap_trees[type] = trees;
>>> @@ -1687,7 +1643,7 @@ int zswap_swapon(int type, unsigned long nr_pages)
>>>
>>>  void zswap_swapoff(int type)
>>>  {
>>> -     struct zswap_tree *trees = zswap_trees[type];
>>> +     struct xarray *trees = zswap_trees[type];
>>>       unsigned int i;
>>>
>>>       if (!trees)
>>> @@ -1695,7 +1651,7 @@ void zswap_swapoff(int type)
>>>
>>>       /* try_to_unuse() invalidated all the entries already */
>>>       for (i = 0; i < nr_zswap_trees[type]; i++)
>>> -             WARN_ON_ONCE(!RB_EMPTY_ROOT(&trees[i].rbroot));
>>> +             WARN_ON_ONCE(!xa_empty(trees + i));
>>>
>>>       kvfree(trees);
>>>       nr_zswap_trees[type] = 0;
>>> @@ -1727,6 +1683,8 @@ static int zswap_debugfs_init(void)
>>>                          zswap_debugfs_root, &zswap_reject_kmemcache_fail);
>>>       debugfs_create_u64("reject_compress_fail", 0444,
>>>                          zswap_debugfs_root, &zswap_reject_compress_fail);
>>> +     debugfs_create_u64("reject_xarray_fail", 0444,
>>> +                        zswap_debugfs_root, &zswap_reject_xarray_fail);
>>>       debugfs_create_u64("reject_compress_poor", 0444,
>>>                          zswap_debugfs_root, &zswap_reject_compress_poor);
>>>       debugfs_create_u64("written_back_pages", 0444,
>>>
>>> ---
>>> base-commit: 9a0181a3710eba1f5c6d19eadcca888be3d54e4f
>>> change-id: 20240104-zswap-xarray-716260e541e3
>>>
>>> Best regards,
>>
Yosry Ahmed March 7, 2024, 9:07 a.m. UTC | #7
[..]
> > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> > > -{
> > > -     rb_erase(&entry->rbnode, root);
> > > -     RB_CLEAR_NODE(&entry->rbnode);
> > > +     e = xa_store(tree, offset, entry, GFP_KERNEL);
> > > +     err = xa_err(e);
> > > +
> > > +     if (err) {
> > > +             e = xa_erase(tree, offset);
> > > +             if (err == -ENOMEM)
> > > +                     zswap_reject_alloc_fail++;
> > > +             else
> > > +                     zswap_reject_xarray_fail++;
> >
> > I think this is too complicated, and as Chengming pointed out, I believe
> > we can use xa_store() directly in zswap_store().
> 
> Sure.
> 
> > I am also not sure what the need for zswap_reject_xarray_fail is. Are
> > there any reasons why the store here can fail other than -ENOMEM? The
> > docs say the only other option is -EINVAL, and looking at __xa_store(),
> > it seems like this is only possible if xa_is_internal() is true (which
> > means we are not passing in a properly aligned pointer IIUC).
> 
> Because the xa_store document said it can return two error codes. I
> see zswap try to classify the error count it hit, that is why I add
> the zswap_reject_xarray_fail.

Right, but I think we should not get -EINVAL in this case. I think it
would be more appropriate to have WARN_ON() or VM_WARN_ON() in this
case?

[..]
> > > @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
> > >  static int zswap_writeback_entry(struct zswap_entry *entry,
> > >                                swp_entry_t swpentry)
> > >  {
> > > -     struct zswap_tree *tree;
> > > +     struct xarray *tree;
> > > +     pgoff_t offset = swp_offset(swpentry);
> > > +     struct zswap_entry *e;
> > >       struct folio *folio;
> > >       struct mempolicy *mpol;
> > >       bool folio_was_allocated;
> > > @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > >        * be dereferenced.
> > >        */
> > >       tree = swap_zswap_tree(swpentry);
> > > -     spin_lock(&tree->lock);
> > > -     if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> > > -             spin_unlock(&tree->lock);
> > > +     e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
> > > +     if (e != entry) {
> >
> > I think we can avoid adding 'e' and 'offset' local variables here and
> > just do everything in the if condition. If you want to avoid the line
> > break, then introducing 'offset' is fine, but I don't see any value from
> > 'e'.
> 
> As I said in my other email. I don't think having this type of local
> variable affects the compiler negatively. The compiler generally uses
> their own local variable to track the expression anyway. So I am not
> sure about the motivation to remove local variables alone, if it helps
> the reading. I feel the line  "if (xa_cmpxchg(tree, offset, entry,
> NULL, GFP_KERNEL) != entry)" is too long and complicated inside the if
> condition. That is just me. Not a big deal.

I just think 'e' is not providing any readability improvements. If
anything, people need to pay closer attention to figure out 'e' is only
a temp variable and 'entry' is the real deal.

I vote for:
	if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL))

[..]
> > > @@ -1471,10 +1423,12 @@ bool zswap_store(struct folio *folio)
> > >  {
> > >       swp_entry_t swp = folio->swap;
> > >       pgoff_t offset = swp_offset(swp);
> > > -     struct zswap_tree *tree = swap_zswap_tree(swp);
> > > -     struct zswap_entry *entry, *dupentry;
> > > +     struct xarray *tree = swap_zswap_tree(swp);
> > > +     struct zswap_entry *entry, *old;
> > >       struct obj_cgroup *objcg = NULL;
> > >       struct mem_cgroup *memcg = NULL;
> > > +     int err;
> > > +     bool old_erased = false;
> > >
> > >       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > >       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > > @@ -1526,6 +1480,7 @@ bool zswap_store(struct folio *folio)
> > >                       kunmap_local(src);
> > >                       entry->length = 0;
> > >                       entry->value = value;
> > > +                     entry->pool = NULL;
> >
> > Why do we need to initialize the pool here? Is this is a bug fix for an
> > existing problem or just keeping things clean? Either way I think it
> > should be done separately, unless it is related to a change in this
> > patch.
> 
> I notice the entry->pool will leave uninitialized. I think it should
> be cleaned up. It is a clean up, it does not need to happen in this
> patch. I can do that as a separate patch.

Yes please.
 
[..]
> >
> > >       /*
> > >        * The folio may have been dirtied again, invalidate the
> > >        * possibly stale entry before inserting the new entry.
> > >        */
> > > -     if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> > > -             zswap_invalidate_entry(tree, dupentry);
> > > -             WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> > > +     err = zswap_xa_insert(tree, entry, &old);
> > > +     if (old)
> > > +             zswap_entry_free(old);
> > > +     if (err) {
> > > +             old_erased = true;
> >
> > I think this can be made simpler if we open code xa_store() here,
> > especially that we already have cleanup code below under 'check_old'
> > that removes the exisitng old entry. So zswap_xa_insert() replicates
> > this cleanup, then we add this 'old_erased' boolean to avoid doing the
> > cleanup below. It seems like it would much more straightforward with
> > open-coding xa_store() here and relying on the existing cleanup for the
> > old entry.  Also, if we initialize 'old' to NULL, we can use its value
> > to figure out whether any cleanup is needed under 'check_old' or not.
> 
> I think that is very similar to what Chengming was suggesting.
> 
> >
> > Taking a step back, I think we can further simplify this. What if we
> > move the tree insertion to right after we allocate the zswap entry? In
> > this case, if the tree insertion fails, we don't need to decrement the
> > same filled counter. If the tree insertion succeeds and then something
> > else fails, the existing cleanup code under 'check_old' will already
> > clean up the tree insertion for us.
> 
> That will create complications that, if the zswap compression fails
> the compression ratio, you will have to remove the entry from the tree
> as clean up. You have both xa_store() and xa_erase() where the current
> code just does one xa_erase() on compression failure.

Not really. If xa_store() fails because of -ENOMEM, then I think by
definition we do not need xa_erase() as there shouldn't be any stale
entries. I also think -ENOMEM should be the only valid errno from
xa_store() in this context. So we can avoid the check_old code if
xa_store() is called (whether it fails or succeeds) IIUC.

I prefer calling xa_store() entry and avoiding the extra 'insert_failed'
cleanup code, especially that unlike other cleanup code, it has its own
branching based on entry->length. I am also planning a cleanup for
zswap_store() to split the code better for the same_filled case and
avoid some unnecessary checks and failures, so it would be useful to
keep the common code path together.

> 
> >
> > If this works, we don't need to add extra cleanup code or move any code
> > around. Something like:
> 
> Due to the extra xa_insert() on compression failure, I think
> Chengming's or your earlier suggestion is better.
> 
> BTW, while you are here, can you confirm this race discussed in
> earlier email can't happen? Chengming convinced me this shouldn't
> happen. Like to hear your thoughts.
> 
>  CPU1                         CPU2
> 
> xa_store()
>                                    entry = xa_erase()
>                                    zswap_free_entry(entry)
> 
> if (entry->length)
>      ...
> CPU1 is using entry after free.

IIUC, CPU1 is in zswap_store(), CPU2 could either in zswap_invalidate()
or zswap_load().

For zswap_load(), I think synchronization is done in the core swap code
ensure we are not doing parallel swapin/swapout at the same entry,
right? In this specific case, I think the folio would be in the
swapcache while swapout (i.e. zswap_store()) is ongoing, so any swapins
will read the folio and not call zswap_load().

Actually, if we do not prevent parallel swapin/swapou at the same entry,
I suspect we may have problems even outside of zswap. For example, we
may read a partially written swap entry from disk, right? Or does the
block layer synchronize this somehow?

For zswap_invalidate(), the core swap code calls it when the swap entry
is no longer used and before we free it for reuse, so IIUC parallel
swapouts (i.e. zswap_store()) should not be possible here as well.
Chris Li March 7, 2024, 6:45 p.m. UTC | #8
On Wed, Mar 6, 2024 at 6:17 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2024/3/7 05:12, Chris Li wrote:
> > Hi Chengming,
> >
> > On Mon, Mar 4, 2024 at 6:52 PM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>
> >> Hi Chris,
> >>
> >> On 2024/3/5 05:32, Chris Li wrote:
> >>> Very deep RB tree requires rebalance at times. That
> >>> contributes to the zswap fault latencies. Xarray does not
> >>> need to perform tree rebalance. Replacing RB tree to xarray
> >>> can have some small performance gain.
> >>>
> >>> One small difference is that xarray insert might fail with
> >>> ENOMEM, while RB tree insert does not allocate additional
> >>> memory.
> >>>
> >>> The zswap_entry size will reduce a bit due to removing the
> >>> RB node, which has two pointers and a color field. Xarray
> >>> store the pointer in the xarray tree rather than the
> >>> zswap_entry. Every entry has one pointer from the xarray
> >>> tree. Overall, switching to xarray should save some memory,
> >>> if the swap entries are densely packed.
> >>>
> >>> Notice the zswap_rb_search and zswap_rb_insert always
> >>> followed by zswap_rb_erase. Use xa_erase directly. The entry
> >>> erase into zswap_xa_insert as well. That saves one tree
> >>> lookup as well.
> >>>
> >>> Remove zswap_invalidate_entry due to no need to call
> >>> zswap_rb_erase any more. Use zswap_free_entry instead.
> >>>
> >>> The "struct zswap_tree" has been replaced by "struct xarray".
> >>> The tree spin lock has transferred to the xarray lock.
> >>>
> >>> Run the kernel build testing 10 times for each version, averages:
> >>> (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)
> >>>
> >>> mm-9a0181a3710eb                    xarray v4
> >>> user       3526.829                            3526.930
> >>> sys        532.754                             526.525
> >>> real       198.748                             198.850
> >>>
> >>> ---
> >>>
> >>>
> >>> Signed-off-by: Chris Li <chrisl@kernel.org>
> >>> ---
> >>> Changes in v4:
> >>> - Remove zswap_xa_search_and_earse, use xa_erase directly.
> >>> - Move charge of objcg after zswap_xa_insert.
> >>> - Avoid erase old entry on insert fail error path.
> >>> - Remove not needed swap_zswap_tree change
> >>> - Link to v3: https://lore.kernel.org/r/20240302-zswap-xarray-v3-1-5900252f2302@kernel.org
> >>>
> >>> Changes in v3:
> >>> - Use xa_cmpxchg instead of zswap_xa_search_and_delete in zswap_writeback_entry.
> >>> - Use xa_store in zswap_xa_insert directly. Reduce the scope of spinlock.
> >>> - Fix xa_store error handling for same page fill case.
> >>> - Link to v2: https://lore.kernel.org/r/20240229-zswap-xarray-v2-1-e50284dfcdb1@kernel.org
> >>>
> >>> Changes in v2:
> >>> - Replace struct zswap_tree with struct xarray.
> >>> - Remove zswap_tree spinlock, use xarray lock instead.
> >>> - Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
> >>> - Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
> >>> - Link to v1: https://lore.kernel.org/r/20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org
> >>> ---
> >>>  mm/zswap.c | 186 ++++++++++++++++++++++++-------------------------------------
> >>>  1 file changed, 72 insertions(+), 114 deletions(-)
> >>>
> >>> diff --git a/mm/zswap.c b/mm/zswap.c
> >>> index 011e068eb355..4f4a3f452b76 100644
> >>> --- a/mm/zswap.c
> >>> +++ b/mm/zswap.c
> >>> @@ -20,7 +20,6 @@
> >>>  #include <linux/spinlock.h>
> >>>  #include <linux/types.h>
> >>>  #include <linux/atomic.h>
> >>> -#include <linux/rbtree.h>
> >>>  #include <linux/swap.h>
> >>>  #include <linux/crypto.h>
> >>>  #include <linux/scatterlist.h>
> >>> @@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
> >>>  static u64 zswap_reject_alloc_fail;
> >>>  /* Store failed because the entry metadata could not be allocated (rare) */
> >>>  static u64 zswap_reject_kmemcache_fail;
> >>> +/* Store failed because xarray can't insert the entry*/
> >>> +static u64 zswap_reject_xarray_fail;
> >>>
> >>>  /* Shrinker work queue */
> >>>  static struct workqueue_struct *shrink_wq;
> >>> @@ -196,7 +197,6 @@ static struct {
> >>>   * This structure contains the metadata for tracking a single compressed
> >>>   * page within zswap.
> >>>   *
> >>> - * rbnode - links the entry into red-black tree for the appropriate swap type
> >>>   * swpentry - associated swap entry, the offset indexes into the red-black tree
> >>>   * length - the length in bytes of the compressed page data.  Needed during
> >>>   *          decompression. For a same value filled page length is 0, and both
> >>> @@ -208,7 +208,6 @@ static struct {
> >>>   * lru - handle to the pool's lru used to evict pages.
> >>>   */
> >>>  struct zswap_entry {
> >>> -     struct rb_node rbnode;
> >>>       swp_entry_t swpentry;
> >>>       unsigned int length;
> >>>       struct zswap_pool *pool;
> >>> @@ -220,12 +219,7 @@ struct zswap_entry {
> >>>       struct list_head lru;
> >>>  };
> >>>
> >>> -struct zswap_tree {
> >>> -     struct rb_root rbroot;
> >>> -     spinlock_t lock;
> >>> -};
> >>> -
> >>> -static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
> >>> +static struct xarray *zswap_trees[MAX_SWAPFILES];
> >>>  static unsigned int nr_zswap_trees[MAX_SWAPFILES];
> >>>
> >>>  /* RCU-protected iteration */
> >>> @@ -253,7 +247,7 @@ static bool zswap_has_pool;
> >>>  * helpers and fwd declarations
> >>>  **********************************/
> >>>
> >>> -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> >>> +static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> >>>  {
> >>>       return &zswap_trees[swp_type(swp)][swp_offset(swp)
> >>>               >> SWAP_ADDRESS_SPACE_SHIFT];
> >>> @@ -805,60 +799,33 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> >>>  }
> >>>
> >>>  /*********************************
> >>> -* rbtree functions
> >>> +* xarray functions
> >>>  **********************************/
> >>> -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
> >>> -{
> >>> -     struct rb_node *node = root->rb_node;
> >>> -     struct zswap_entry *entry;
> >>> -     pgoff_t entry_offset;
> >>> -
> >>> -     while (node) {
> >>> -             entry = rb_entry(node, struct zswap_entry, rbnode);
> >>> -             entry_offset = swp_offset(entry->swpentry);
> >>> -             if (entry_offset > offset)
> >>> -                     node = node->rb_left;
> >>> -             else if (entry_offset < offset)
> >>> -                     node = node->rb_right;
> >>> -             else
> >>> -                     return entry;
> >>> -     }
> >>> -     return NULL;
> >>> -}
> >>>
> >>>  /*
> >>>   * In the case that a entry with the same offset is found, a pointer to
> >>> - * the existing entry is stored in dupentry and the function returns -EEXIST
> >>> + * the existing entry is stored in old and erased from the tree.
> >>> + * Function return error on insert.
> >>>   */
> >>> -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
> >>> -                     struct zswap_entry **dupentry)
> >>> +static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
> >>> +                        struct zswap_entry **old)
> >>>  {
> >>> -     struct rb_node **link = &root->rb_node, *parent = NULL;
> >>> -     struct zswap_entry *myentry;
> >>> -     pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
> >>> -
> >>> -     while (*link) {
> >>> -             parent = *link;
> >>> -             myentry = rb_entry(parent, struct zswap_entry, rbnode);
> >>> -             myentry_offset = swp_offset(myentry->swpentry);
> >>> -             if (myentry_offset > entry_offset)
> >>> -                     link = &(*link)->rb_left;
> >>> -             else if (myentry_offset < entry_offset)
> >>> -                     link = &(*link)->rb_right;
> >>> -             else {
> >>> -                     *dupentry = myentry;
> >>> -                     return -EEXIST;
> >>> -             }
> >>> -     }
> >>> -     rb_link_node(&entry->rbnode, parent, link);
> >>> -     rb_insert_color(&entry->rbnode, root);
> >>> -     return 0;
> >>> -}
> >>> +     int err;
> >>> +     struct zswap_entry *e;
> >>> +     pgoff_t offset = swp_offset(entry->swpentry);
> >>>
> >>> -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> >>> -{
> >>> -     rb_erase(&entry->rbnode, root);
> >>> -     RB_CLEAR_NODE(&entry->rbnode);
> >>> +     e = xa_store(tree, offset, entry, GFP_KERNEL);
> >>> +     err = xa_err(e);
> >>> +
> >>> +     if (err) {
> >>> +             e = xa_erase(tree, offset);
> >>> +             if (err == -ENOMEM)
> >>> +                     zswap_reject_alloc_fail++;
> >>> +             else
> >>> +                     zswap_reject_xarray_fail++;
> >>> +     }
> >>> +     *old = e;
> >>> +     return err;
> >>>  }
> >>>
> >>>  /*********************************
> >>> @@ -872,7 +839,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
> >>>       entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
> >>>       if (!entry)
> >>>               return NULL;
> >>> -     RB_CLEAR_NODE(&entry->rbnode);
> >>>       return entry;
> >>>  }
> >>>
> >>> @@ -914,17 +880,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
> >>>       zswap_update_total_size();
> >>>  }
> >>>
> >>> -/*
> >>> - * The caller hold the tree lock and search the entry from the tree,
> >>> - * so it must be on the tree, remove it from the tree and free it.
> >>> - */
> >>> -static void zswap_invalidate_entry(struct zswap_tree *tree,
> >>> -                                struct zswap_entry *entry)
> >>> -{
> >>> -     zswap_rb_erase(&tree->rbroot, entry);
> >>> -     zswap_entry_free(entry);
> >>> -}
> >>> -
> >>>  /*********************************
> >>>  * compressed storage functions
> >>>  **********************************/
> >>> @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
> >>>  static int zswap_writeback_entry(struct zswap_entry *entry,
> >>>                                swp_entry_t swpentry)
> >>>  {
> >>> -     struct zswap_tree *tree;
> >>> +     struct xarray *tree;
> >>> +     pgoff_t offset = swp_offset(swpentry);
> >>> +     struct zswap_entry *e;
> >>>       struct folio *folio;
> >>>       struct mempolicy *mpol;
> >>>       bool folio_was_allocated;
> >>> @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >>>        * be dereferenced.
> >>>        */
> >>>       tree = swap_zswap_tree(swpentry);
> >>> -     spin_lock(&tree->lock);
> >>> -     if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> >>> -             spin_unlock(&tree->lock);
> >>> +     e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
> >>> +     if (e != entry) {
> >>
> >> Maybe "if (xa_cmpxchg() != entry)" look better, so "e" variable can be removed,
> >> since we don't use it.
> >
> > I thought about that. The main reason "if (xa_cmpxchg(tree, offset,
> > entry, NULL, GFP_KERNEL) != entry)" is a long expression. Having
> > != entry at the end of the expression makes it harder to read. Have
> > the temporary variable IMHO help reading the comparison.
> >
> > Not sure I understand the motivation to save a temporary variable. The
> > compiler always generates temporary variables internally. Having one
> > here does not affect the compiling result at all. It is just there to
> > help reading. If you still think it has the reverse effect on reading
> > I can of course remove it. It generates the same code anyway.
> You are right, just a minor type suggestion, no much difference if you keep it.

Thanks for the understanding and flexibility. I agree this is more a
personal flavor kind of thing,
Doesn't have a big enough impact on readability which has to flip one
way or the other.

Chris
Chris Li March 7, 2024, 7:39 p.m. UTC | #9
On Thu, Mar 7, 2024 at 1:07 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> > > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> > > > -{
> > > > -     rb_erase(&entry->rbnode, root);
> > > > -     RB_CLEAR_NODE(&entry->rbnode);
> > > > +     e = xa_store(tree, offset, entry, GFP_KERNEL);
> > > > +     err = xa_err(e);
> > > > +
> > > > +     if (err) {
> > > > +             e = xa_erase(tree, offset);
> > > > +             if (err == -ENOMEM)
> > > > +                     zswap_reject_alloc_fail++;
> > > > +             else
> > > > +                     zswap_reject_xarray_fail++;
> > >
> > > I think this is too complicated, and as Chengming pointed out, I believe
> > > we can use xa_store() directly in zswap_store().
> >
> > Sure.
> >
> > > I am also not sure what the need for zswap_reject_xarray_fail is. Are
> > > there any reasons why the store here can fail other than -ENOMEM? The
> > > docs say the only other option is -EINVAL, and looking at __xa_store(),
> > > it seems like this is only possible if xa_is_internal() is true (which
> > > means we are not passing in a properly aligned pointer IIUC).
> >
> > Because the xa_store document said it can return two error codes. I
> > see zswap try to classify the error count it hit, that is why I add
> > the zswap_reject_xarray_fail.
>
> Right, but I think we should not get -EINVAL in this case. I think it
> would be more appropriate to have WARN_ON() or VM_WARN_ON() in this
> case?

In the context of the existing zswap code, it seems the zswap keeps
track of different failure reason counters. So that we can read the
failure counter and have a good clue where the source code causes the
failure.

If the document said it can return -EINVAL, I wouldn't just look at
the current implementation and assume it does not generate -EINVAL.
What if the zswap entry allocation has a bug that returns some strange
alignment that makes xarray not happy? There is a lot of assumption
that some error code can or can show up. Just to be consistent with
the other part of the zswap code, I give it a counter to indicate it
is the xarray causing the fail and error. That is the safe thing to
do. It is a good thing this kind of error branch never hits. But if it
does, it gives better information for debugging. It is just following
the same spirit of the rest of the zswap error handling. What do you
say?

>
> [..]
> > > > @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
> > > >  static int zswap_writeback_entry(struct zswap_entry *entry,
> > > >                                swp_entry_t swpentry)
> > > >  {
> > > > -     struct zswap_tree *tree;
> > > > +     struct xarray *tree;
> > > > +     pgoff_t offset = swp_offset(swpentry);
> > > > +     struct zswap_entry *e;
> > > >       struct folio *folio;
> > > >       struct mempolicy *mpol;
> > > >       bool folio_was_allocated;
> > > > @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > > >        * be dereferenced.
> > > >        */
> > > >       tree = swap_zswap_tree(swpentry);
> > > > -     spin_lock(&tree->lock);
> > > > -     if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> > > > -             spin_unlock(&tree->lock);
> > > > +     e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
> > > > +     if (e != entry) {
> > >
> > > I think we can avoid adding 'e' and 'offset' local variables here and
> > > just do everything in the if condition. If you want to avoid the line
> > > break, then introducing 'offset' is fine, but I don't see any value from
> > > 'e'.
> >
> > As I said in my other email. I don't think having this type of local
> > variable affects the compiler negatively. The compiler generally uses
> > their own local variable to track the expression anyway. So I am not
> > sure about the motivation to remove local variables alone, if it helps
> > the reading. I feel the line  "if (xa_cmpxchg(tree, offset, entry,
> > NULL, GFP_KERNEL) != entry)" is too long and complicated inside the if
> > condition. That is just me. Not a big deal.
>
> I just think 'e' is not providing any readability improvements. If
> anything, people need to pay closer attention to figure out 'e' is only
> a temp variable and 'entry' is the real deal.

Well, 'e' is assigned in the previous line and used in the very next
line (only once).
I hardly can see this impact any one understands what it is doing. I
can also come up with a reason (for the argument sake) that this maps
out to the machine code better.  In the machine code it generates, you
see xa_cmpxchg first then "if" branch. You can also ask the debugger
what is the current value of "e", which register is currently mapped
to if it is still in the frame context. But that is not important.

There are also C Macro symbols that define something else, just once.
Are we following the rule to remove all symbols that only define and
use once?

Anyway, the bottom line is that it generates the exact same machine
code. Readability is subjective, there is some personal preference in
it. IMHO the difference in readability is so trivial here that it does
not justify having to flip one way or the other in this case.

>
> I vote for:
>         if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL))

I heard you.

>
> [..]
> > > > @@ -1471,10 +1423,12 @@ bool zswap_store(struct folio *folio)
> > > >  {
> > > >       swp_entry_t swp = folio->swap;
> > > >       pgoff_t offset = swp_offset(swp);
> > > > -     struct zswap_tree *tree = swap_zswap_tree(swp);
> > > > -     struct zswap_entry *entry, *dupentry;
> > > > +     struct xarray *tree = swap_zswap_tree(swp);
> > > > +     struct zswap_entry *entry, *old;
> > > >       struct obj_cgroup *objcg = NULL;
> > > >       struct mem_cgroup *memcg = NULL;
> > > > +     int err;
> > > > +     bool old_erased = false;
> > > >
> > > >       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > >       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > > > @@ -1526,6 +1480,7 @@ bool zswap_store(struct folio *folio)
> > > >                       kunmap_local(src);
> > > >                       entry->length = 0;
> > > >                       entry->value = value;
> > > > +                     entry->pool = NULL;
> > >
> > > Why do we need to initialize the pool here? Is this is a bug fix for an
> > > existing problem or just keeping things clean? Either way I think it
> > > should be done separately, unless it is related to a change in this
> > > patch.
> >
> > I notice the entry->pool will leave uninitialized. I think it should
> > be cleaned up. It is a clean up, it does not need to happen in this
> > patch. I can do that as a separate patch.
>
> Yes please.
>
> [..]
> > >
> > > >       /*
> > > >        * The folio may have been dirtied again, invalidate the
> > > >        * possibly stale entry before inserting the new entry.
> > > >        */
> > > > -     if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> > > > -             zswap_invalidate_entry(tree, dupentry);
> > > > -             WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> > > > +     err = zswap_xa_insert(tree, entry, &old);
> > > > +     if (old)
> > > > +             zswap_entry_free(old);
> > > > +     if (err) {
> > > > +             old_erased = true;
> > >
> > > I think this can be made simpler if we open code xa_store() here,
> > > especially that we already have cleanup code below under 'check_old'
> > > that removes the exisitng old entry. So zswap_xa_insert() replicates
> > > this cleanup, then we add this 'old_erased' boolean to avoid doing the
> > > cleanup below. It seems like it would much more straightforward with
> > > open-coding xa_store() here and relying on the existing cleanup for the
> > > old entry.  Also, if we initialize 'old' to NULL, we can use its value
> > > to figure out whether any cleanup is needed under 'check_old' or not.
> >
> > I think that is very similar to what Chengming was suggesting.
> >
> > >
> > > Taking a step back, I think we can further simplify this. What if we
> > > move the tree insertion to right after we allocate the zswap entry? In
> > > this case, if the tree insertion fails, we don't need to decrement the
> > > same filled counter. If the tree insertion succeeds and then something
> > > else fails, the existing cleanup code under 'check_old' will already
> > > clean up the tree insertion for us.
> >
> > That will create complications that, if the zswap compression fails
> > the compression ratio, you will have to remove the entry from the tree
> > as clean up. You have both xa_store() and xa_erase() where the current
> > code just does one xa_erase() on compression failure.
>
> Not really. If xa_store() fails because of -ENOMEM, then I think by
> definition we do not need xa_erase() as there shouldn't be any stale
> entries. I also think -ENOMEM should be the only valid errno from
> xa_store() in this context. So we can avoid the check_old code if
> xa_store() is called (whether it fails or succeeds) IIUC.
>
> I prefer calling xa_store() entry and avoiding the extra 'insert_failed'
> cleanup code, especially that unlike other cleanup code, it has its own
> branching based on entry->length. I am also planning a cleanup for
> zswap_store() to split the code better for the same_filled case and
> avoid some unnecessary checks and failures, so it would be useful to
> keep the common code path together.

I would like to point out that, in real world usage, we will likely
see more pages fail due to compress rather than xa_store failure.
There is a non zero percentage of pages that are not compressible.
The -ENOMEM for xa_store are very rare, much much rarer than the non
compressible page. So in the real work running, insert first will have
one more "xa_store" than compression first. I would argue compression
first will perform slightly better, because "xa_store" will need to
take the spin lock. Avoid the extra "xa_store" and avoid taking the
lock all together. There is simply more compression failure than
xa_store failure.
It also feels slightly strange when inserting the entry into xarray
when the entry is not ready to use. I don't think that is triggeriable
as a bug right now, but that makes me feel a little bit strange.

>
> >
> > >
> > > If this works, we don't need to add extra cleanup code or move any code
> > > around. Something like:
> >
> > Due to the extra xa_insert() on compression failure, I think
> > Chengming's or your earlier suggestion is better.
> >
> > BTW, while you are here, can you confirm this race discussed in
> > earlier email can't happen? Chengming convinced me this shouldn't
> > happen. Like to hear your thoughts.
> >
> >  CPU1                         CPU2
> >
> > xa_store()
> >                                    entry = xa_erase()
> >                                    zswap_free_entry(entry)
> >
> > if (entry->length)
> >      ...
> > CPU1 is using entry after free.
>
> IIUC, CPU1 is in zswap_store(), CPU2 could either in zswap_invalidate()
> or zswap_load().
>
> For zswap_load(), I think synchronization is done in the core swap code
> ensure we are not doing parallel swapin/swapout at the same entry,
> right? In this specific case, I think the folio would be in the
> swapcache while swapout (i.e. zswap_store()) is ongoing, so any swapins
> will read the folio and not call zswap_load().
>
> Actually, if we do not prevent parallel swapin/swapou at the same entry,
> I suspect we may have problems even outside of zswap. For example, we
> may read a partially written swap entry from disk, right? Or does the
> block layer synchronize this somehow?

I don't think the block layer is doing the synchronization here. Zswap
does not go to the block layer at all.
I think after looking up from the swap cache, the filio will be
locked, that is preventing others' path. I would like to hear others'
thoughts on that. Need more audits.

>
> For zswap_invalidate(), the core swap code calls it when the swap entry
> is no longer used and before we free it for reuse, so IIUC parallel
> swapouts (i.e. zswap_store()) should not be possible here as well.

There is also the write back path, I don't think it is triggerable due
to entry not being inserted into the LRU list yet.

Chris
Yosry Ahmed March 7, 2024, 9:28 p.m. UTC | #10
[..]
> > > > I am also not sure what the need for zswap_reject_xarray_fail is. Are
> > > > there any reasons why the store here can fail other than -ENOMEM? The
> > > > docs say the only other option is -EINVAL, and looking at __xa_store(),
> > > > it seems like this is only possible if xa_is_internal() is true (which
> > > > means we are not passing in a properly aligned pointer IIUC).
> > >
> > > Because the xa_store document said it can return two error codes. I
> > > see zswap try to classify the error count it hit, that is why I add
> > > the zswap_reject_xarray_fail.
> >
> > Right, but I think we should not get -EINVAL in this case. I think it
> > would be more appropriate to have WARN_ON() or VM_WARN_ON() in this
> > case?
> 
> In the context of the existing zswap code, it seems the zswap keeps
> track of different failure reason counters. So that we can read the
> failure counter and have a good clue where the source code causes the
> failure.

zswap keeps track of different *valid* failure reasons. I think in this
case this would be a bug, which warrants a warning rather than a counter
(same reasoning why we removed the duplicate entry counter).

> 
> If the document said it can return -EINVAL, I wouldn't just look at
> the current implementation and assume it does not generate -EINVAL.

My assumption is that a properly-aligned pointer will always be valid to
store in the xarray, so the only valid return error is -ENOMEM, but
perhaps Matthew will correct me here.

Maybe we can have an xa_is_valid() helper to check this separately and
guarantee no -EINVAL returns, but I *think* the fact that it is
slab-allocated pointer may be enough.

> What if the zswap entry allocation has a bug that returns some strange
> alignment that makes xarray not happy? There is a lot of assumption
> that some error code can or can show up. Just to be consistent with
> the other part of the zswap code, I give it a counter to indicate it
> is the xarray causing the fail and error. That is the safe thing to
> do. It is a good thing this kind of error branch never hits. But if it
> does, it gives better information for debugging. It is just following
> the same spirit of the rest of the zswap error handling. What do you
> say?

This is exactly my point, these counters are not for bugs. We should
have a warning if there is a bug.

[..]
> > > > > +     e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
> > > > > +     if (e != entry) {
> > > >
> > > > I think we can avoid adding 'e' and 'offset' local variables here and
> > > > just do everything in the if condition. If you want to avoid the line
> > > > break, then introducing 'offset' is fine, but I don't see any value from
> > > > 'e'.
> > >
> > > As I said in my other email. I don't think having this type of local
> > > variable affects the compiler negatively. The compiler generally uses
> > > their own local variable to track the expression anyway. So I am not
> > > sure about the motivation to remove local variables alone, if it helps
> > > the reading. I feel the line  "if (xa_cmpxchg(tree, offset, entry,
> > > NULL, GFP_KERNEL) != entry)" is too long and complicated inside the if
> > > condition. That is just me. Not a big deal.
> >
> > I just think 'e' is not providing any readability improvements. If
> > anything, people need to pay closer attention to figure out 'e' is only
> > a temp variable and 'entry' is the real deal.
> 
> Well, 'e' is assigned in the previous line and used in the very next
> line (only once).
> I hardly can see this impact any one understands what it is doing. I
> can also come up with a reason (for the argument sake) that this maps
> out to the machine code better.  In the machine code it generates, you
> see xa_cmpxchg first then "if" branch. You can also ask the debugger
> what is the current value of "e", which register is currently mapped
> to if it is still in the frame context. But that is not important.
> 
> There are also C Macro symbols that define something else, just once.
> Are we following the rule to remove all symbols that only define and
> use once?
> 
> Anyway, the bottom line is that it generates the exact same machine
> code. Readability is subjective, there is some personal preference in
> it. IMHO the difference in readability is so trivial here that it does
> not justify having to flip one way or the other in this case.

My argument was solely based on readability. I agree there is no
difference in machine code, and that this is personal preference, which
is why I stated my vote below (and I believe Chengming had the same
preference). This is not a blocker either way :)

[..]
> > > > >       /*
> > > > >        * The folio may have been dirtied again, invalidate the
> > > > >        * possibly stale entry before inserting the new entry.
> > > > >        */
> > > > > -     if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> > > > > -             zswap_invalidate_entry(tree, dupentry);
> > > > > -             WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> > > > > +     err = zswap_xa_insert(tree, entry, &old);
> > > > > +     if (old)
> > > > > +             zswap_entry_free(old);
> > > > > +     if (err) {
> > > > > +             old_erased = true;
> > > >
> > > > I think this can be made simpler if we open code xa_store() here,
> > > > especially that we already have cleanup code below under 'check_old'
> > > > that removes the exisitng old entry. So zswap_xa_insert() replicates
> > > > this cleanup, then we add this 'old_erased' boolean to avoid doing the
> > > > cleanup below. It seems like it would much more straightforward with
> > > > open-coding xa_store() here and relying on the existing cleanup for the
> > > > old entry.  Also, if we initialize 'old' to NULL, we can use its value
> > > > to figure out whether any cleanup is needed under 'check_old' or not.
> > >
> > > I think that is very similar to what Chengming was suggesting.
> > >
> > > >
> > > > Taking a step back, I think we can further simplify this. What if we
> > > > move the tree insertion to right after we allocate the zswap entry? In
> > > > this case, if the tree insertion fails, we don't need to decrement the
> > > > same filled counter. If the tree insertion succeeds and then something
> > > > else fails, the existing cleanup code under 'check_old' will already
> > > > clean up the tree insertion for us.
> > >
> > > That will create complications that, if the zswap compression fails
> > > the compression ratio, you will have to remove the entry from the tree
> > > as clean up. You have both xa_store() and xa_erase() where the current
> > > code just does one xa_erase() on compression failure.
> >
> > Not really. If xa_store() fails because of -ENOMEM, then I think by
> > definition we do not need xa_erase() as there shouldn't be any stale
> > entries. I also think -ENOMEM should be the only valid errno from
> > xa_store() in this context. So we can avoid the check_old code if
> > xa_store() is called (whether it fails or succeeds) IIUC.
> >
> > I prefer calling xa_store() entry and avoiding the extra 'insert_failed'
> > cleanup code, especially that unlike other cleanup code, it has its own
> > branching based on entry->length. I am also planning a cleanup for
> > zswap_store() to split the code better for the same_filled case and
> > avoid some unnecessary checks and failures, so it would be useful to
> > keep the common code path together.
> 
> I would like to point out that, in real world usage, we will likely
> see more pages fail due to compress rather than xa_store failure.
> There is a non zero percentage of pages that are not compressible.
> The -ENOMEM for xa_store are very rare, much much rarer than the non
> compressible page. So in the real work running, insert first will have
> one more "xa_store" than compression first. I would argue compression
> first will perform slightly better, because "xa_store" will need to
> take the spin lock. Avoid the extra "xa_store" and avoid taking the
> lock all together. There is simply more compression failure than
> xa_store failure.
> It also feels slightly strange when inserting the entry into xarray
> when the entry is not ready to use. I don't think that is triggeriable
> as a bug right now, but that makes me feel a little bit strange.

Good point. I don't think the perfomrnace difference will be
significant, but you are right. Let's keep the insertion here and we can
revisit this later when we cleanup the same_filled code.

nit: please rename the cleanup label from 'insert_failed' to
'store_failed' :)

[..]
> > > BTW, while you are here, can you confirm this race discussed in
> > > earlier email can't happen? Chengming convinced me this shouldn't
> > > happen. Like to hear your thoughts.
> > >
> > >  CPU1                         CPU2
> > >
> > > xa_store()
> > >                                    entry = xa_erase()
> > >                                    zswap_free_entry(entry)
> > >
> > > if (entry->length)
> > >      ...
> > > CPU1 is using entry after free.
> >
> > IIUC, CPU1 is in zswap_store(), CPU2 could either in zswap_invalidate()
> > or zswap_load().
> >
> > For zswap_load(), I think synchronization is done in the core swap code
> > ensure we are not doing parallel swapin/swapout at the same entry,
> > right? In this specific case, I think the folio would be in the
> > swapcache while swapout (i.e. zswap_store()) is ongoing, so any swapins
> > will read the folio and not call zswap_load().
> >
> > Actually, if we do not prevent parallel swapin/swapou at the same entry,
> > I suspect we may have problems even outside of zswap. For example, we
> > may read a partially written swap entry from disk, right? Or does the
> > block layer synchronize this somehow?
> 
> I don't think the block layer is doing the synchronization here. Zswap
> does not go to the block layer at all.
> I think after looking up from the swap cache, the filio will be
> locked, that is preventing others' path. I would like to hear others'
> thoughts on that. Need more audits.

I did not mean that the block layer is doing any synchronizaiton here.
It comes from the swap cache as I outlined. I think it's safe.

I was just wondering what happens with parallel swapin/swapout if there
is no swap cache synchronization for normal swapfiles. I was wondering
if the block layer will handle the synchronization, in which case zswap
may have different requirements than normal swapfiles. Just something
that seemed good to know :)

> 
> >
> > For zswap_invalidate(), the core swap code calls it when the swap entry
> > is no longer used and before we free it for reuse, so IIUC parallel
> > swapouts (i.e. zswap_store()) should not be possible here as well.
> 
> There is also the write back path, I don't think it is triggerable due
> to entry not being inserted into the LRU list yet.

Ah yes, I looked at callsites for xa_erase(), there is a xa_cmpxchg()
there. Actually I think the swap cache synchronizes here too.
zswap_writeback_entry() is like swapin, so this case is like too
parallel swapins. Both of them should call __read_swap_cache_async(),
and only one of them will actually do the swapin.

If zswap_writeback_entry() wins the race, zswap_load() will not be
called in the first place, the swap cache entry will be used. If swapin
(i.e. zswap_load()) wins the race, zswap_writeback_entry() should skip.

Right?
diff mbox series

Patch

diff --git a/mm/zswap.c b/mm/zswap.c
index 011e068eb355..4f4a3f452b76 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -20,7 +20,6 @@ 
 #include <linux/spinlock.h>
 #include <linux/types.h>
 #include <linux/atomic.h>
-#include <linux/rbtree.h>
 #include <linux/swap.h>
 #include <linux/crypto.h>
 #include <linux/scatterlist.h>
@@ -71,6 +70,8 @@  static u64 zswap_reject_compress_poor;
 static u64 zswap_reject_alloc_fail;
 /* Store failed because the entry metadata could not be allocated (rare) */
 static u64 zswap_reject_kmemcache_fail;
+/* Store failed because xarray can't insert the entry*/
+static u64 zswap_reject_xarray_fail;
 
 /* Shrinker work queue */
 static struct workqueue_struct *shrink_wq;
@@ -196,7 +197,6 @@  static struct {
  * This structure contains the metadata for tracking a single compressed
  * page within zswap.
  *
- * rbnode - links the entry into red-black tree for the appropriate swap type
  * swpentry - associated swap entry, the offset indexes into the red-black tree
  * length - the length in bytes of the compressed page data.  Needed during
  *          decompression. For a same value filled page length is 0, and both
@@ -208,7 +208,6 @@  static struct {
  * lru - handle to the pool's lru used to evict pages.
  */
 struct zswap_entry {
-	struct rb_node rbnode;
 	swp_entry_t swpentry;
 	unsigned int length;
 	struct zswap_pool *pool;
@@ -220,12 +219,7 @@  struct zswap_entry {
 	struct list_head lru;
 };
 
-struct zswap_tree {
-	struct rb_root rbroot;
-	spinlock_t lock;
-};
-
-static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
+static struct xarray *zswap_trees[MAX_SWAPFILES];
 static unsigned int nr_zswap_trees[MAX_SWAPFILES];
 
 /* RCU-protected iteration */
@@ -253,7 +247,7 @@  static bool zswap_has_pool;
 * helpers and fwd declarations
 **********************************/
 
-static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
+static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 {
 	return &zswap_trees[swp_type(swp)][swp_offset(swp)
 		>> SWAP_ADDRESS_SPACE_SHIFT];
@@ -805,60 +799,33 @@  void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
 }
 
 /*********************************
-* rbtree functions
+* xarray functions
 **********************************/
-static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
-{
-	struct rb_node *node = root->rb_node;
-	struct zswap_entry *entry;
-	pgoff_t entry_offset;
-
-	while (node) {
-		entry = rb_entry(node, struct zswap_entry, rbnode);
-		entry_offset = swp_offset(entry->swpentry);
-		if (entry_offset > offset)
-			node = node->rb_left;
-		else if (entry_offset < offset)
-			node = node->rb_right;
-		else
-			return entry;
-	}
-	return NULL;
-}
 
 /*
  * In the case that a entry with the same offset is found, a pointer to
- * the existing entry is stored in dupentry and the function returns -EEXIST
+ * the existing entry is stored in old and erased from the tree.
+ * Function return error on insert.
  */
-static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
-			struct zswap_entry **dupentry)
+static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
+			   struct zswap_entry **old)
 {
-	struct rb_node **link = &root->rb_node, *parent = NULL;
-	struct zswap_entry *myentry;
-	pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
-
-	while (*link) {
-		parent = *link;
-		myentry = rb_entry(parent, struct zswap_entry, rbnode);
-		myentry_offset = swp_offset(myentry->swpentry);
-		if (myentry_offset > entry_offset)
-			link = &(*link)->rb_left;
-		else if (myentry_offset < entry_offset)
-			link = &(*link)->rb_right;
-		else {
-			*dupentry = myentry;
-			return -EEXIST;
-		}
-	}
-	rb_link_node(&entry->rbnode, parent, link);
-	rb_insert_color(&entry->rbnode, root);
-	return 0;
-}
+	int err;
+	struct zswap_entry *e;
+	pgoff_t offset = swp_offset(entry->swpentry);
 
-static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
-{
-	rb_erase(&entry->rbnode, root);
-	RB_CLEAR_NODE(&entry->rbnode);
+	e = xa_store(tree, offset, entry, GFP_KERNEL);
+	err = xa_err(e);
+
+	if (err) {
+		e = xa_erase(tree, offset);
+		if (err == -ENOMEM)
+			zswap_reject_alloc_fail++;
+		else
+			zswap_reject_xarray_fail++;
+	}
+	*old = e;
+	return err;
 }
 
 /*********************************
@@ -872,7 +839,6 @@  static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
 	entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
 	if (!entry)
 		return NULL;
-	RB_CLEAR_NODE(&entry->rbnode);
 	return entry;
 }
 
@@ -914,17 +880,6 @@  static void zswap_entry_free(struct zswap_entry *entry)
 	zswap_update_total_size();
 }
 
-/*
- * The caller hold the tree lock and search the entry from the tree,
- * so it must be on the tree, remove it from the tree and free it.
- */
-static void zswap_invalidate_entry(struct zswap_tree *tree,
-				   struct zswap_entry *entry)
-{
-	zswap_rb_erase(&tree->rbroot, entry);
-	zswap_entry_free(entry);
-}
-
 /*********************************
 * compressed storage functions
 **********************************/
@@ -1113,7 +1068,9 @@  static void zswap_decompress(struct zswap_entry *entry, struct page *page)
 static int zswap_writeback_entry(struct zswap_entry *entry,
 				 swp_entry_t swpentry)
 {
-	struct zswap_tree *tree;
+	struct xarray *tree;
+	pgoff_t offset = swp_offset(swpentry);
+	struct zswap_entry *e;
 	struct folio *folio;
 	struct mempolicy *mpol;
 	bool folio_was_allocated;
@@ -1150,19 +1107,14 @@  static int zswap_writeback_entry(struct zswap_entry *entry,
 	 * be dereferenced.
 	 */
 	tree = swap_zswap_tree(swpentry);
-	spin_lock(&tree->lock);
-	if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
-		spin_unlock(&tree->lock);
+	e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
+	if (e != entry) {
 		delete_from_swap_cache(folio);
 		folio_unlock(folio);
 		folio_put(folio);
 		return -ENOMEM;
 	}
 
-	/* Safe to deref entry after the entry is verified above. */
-	zswap_rb_erase(&tree->rbroot, entry);
-	spin_unlock(&tree->lock);
-
 	zswap_decompress(entry, &folio->page);
 
 	count_vm_event(ZSWPWB);
@@ -1471,10 +1423,12 @@  bool zswap_store(struct folio *folio)
 {
 	swp_entry_t swp = folio->swap;
 	pgoff_t offset = swp_offset(swp);
-	struct zswap_tree *tree = swap_zswap_tree(swp);
-	struct zswap_entry *entry, *dupentry;
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct zswap_entry *entry, *old;
 	struct obj_cgroup *objcg = NULL;
 	struct mem_cgroup *memcg = NULL;
+	int err;
+	bool old_erased = false;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1526,6 +1480,7 @@  bool zswap_store(struct folio *folio)
 			kunmap_local(src);
 			entry->length = 0;
 			entry->value = value;
+			entry->pool = NULL;
 			atomic_inc(&zswap_same_filled_pages);
 			goto insert_entry;
 		}
@@ -1555,28 +1510,31 @@  bool zswap_store(struct folio *folio)
 insert_entry:
 	entry->swpentry = swp;
 	entry->objcg = objcg;
-	if (objcg) {
-		obj_cgroup_charge_zswap(objcg, entry->length);
-		/* Account before objcg ref is moved to tree */
-		count_objcg_event(objcg, ZSWPOUT);
-	}
 
 	/* map */
-	spin_lock(&tree->lock);
 	/*
 	 * The folio may have been dirtied again, invalidate the
 	 * possibly stale entry before inserting the new entry.
 	 */
-	if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
-		zswap_invalidate_entry(tree, dupentry);
-		WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
+	err = zswap_xa_insert(tree, entry, &old);
+	if (old)
+		zswap_entry_free(old);
+	if (err) {
+		old_erased = true;
+		goto insert_failed;
 	}
+
+	if (objcg) {
+		obj_cgroup_charge_zswap(objcg, entry->length);
+		/* Account before objcg ref is moved to tree */
+		count_objcg_event(objcg, ZSWPOUT);
+	}
+
 	if (entry->length) {
 		INIT_LIST_HEAD(&entry->lru);
 		zswap_lru_add(&zswap.list_lru, entry);
 		atomic_inc(&zswap.nr_stored);
 	}
-	spin_unlock(&tree->lock);
 
 	/* update stats */
 	atomic_inc(&zswap_stored_pages);
@@ -1585,6 +1543,12 @@  bool zswap_store(struct folio *folio)
 
 	return true;
 
+insert_failed:
+	if (!entry->length) {
+		atomic_dec(&zswap_same_filled_pages);
+		goto freepage;
+	}
+	zpool_free(zswap_find_zpool(entry), entry->handle);
 put_pool:
 	zswap_pool_put(entry->pool);
 freepage:
@@ -1592,17 +1556,19 @@  bool zswap_store(struct folio *folio)
 reject:
 	if (objcg)
 		obj_cgroup_put(objcg);
+
+	if (old_erased)
+		goto failed;
 check_old:
 	/*
 	 * If the zswap store fails or zswap is disabled, we must invalidate the
 	 * possibly stale entry which was previously stored at this offset.
 	 * Otherwise, writeback could overwrite the new data in the swapfile.
 	 */
-	spin_lock(&tree->lock);
-	entry = zswap_rb_search(&tree->rbroot, offset);
+	entry = xa_erase(tree, offset);
 	if (entry)
-		zswap_invalidate_entry(tree, entry);
-	spin_unlock(&tree->lock);
+		zswap_entry_free(entry);
+failed:
 	return false;
 
 shrink:
@@ -1615,20 +1581,15 @@  bool zswap_load(struct folio *folio)
 	swp_entry_t swp = folio->swap;
 	pgoff_t offset = swp_offset(swp);
 	struct page *page = &folio->page;
-	struct zswap_tree *tree = swap_zswap_tree(swp);
+	struct xarray *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 	u8 *dst;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 
-	spin_lock(&tree->lock);
-	entry = zswap_rb_search(&tree->rbroot, offset);
-	if (!entry) {
-		spin_unlock(&tree->lock);
+	entry = xa_erase(tree, offset);
+	if (!entry)
 		return false;
-	}
-	zswap_rb_erase(&tree->rbroot, entry);
-	spin_unlock(&tree->lock);
 
 	if (entry->length)
 		zswap_decompress(entry, page);
@@ -1652,19 +1613,17 @@  bool zswap_load(struct folio *folio)
 void zswap_invalidate(swp_entry_t swp)
 {
 	pgoff_t offset = swp_offset(swp);
-	struct zswap_tree *tree = swap_zswap_tree(swp);
+	struct xarray *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 
-	spin_lock(&tree->lock);
-	entry = zswap_rb_search(&tree->rbroot, offset);
+	entry = xa_erase(tree, offset);
 	if (entry)
-		zswap_invalidate_entry(tree, entry);
-	spin_unlock(&tree->lock);
+		zswap_entry_free(entry);
 }
 
 int zswap_swapon(int type, unsigned long nr_pages)
 {
-	struct zswap_tree *trees, *tree;
+	struct xarray *trees, *tree;
 	unsigned int nr, i;
 
 	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
@@ -1674,11 +1633,8 @@  int zswap_swapon(int type, unsigned long nr_pages)
 		return -ENOMEM;
 	}
 
-	for (i = 0; i < nr; i++) {
-		tree = trees + i;
-		tree->rbroot = RB_ROOT;
-		spin_lock_init(&tree->lock);
-	}
+	for (i = 0; i < nr; i++)
+		xa_init(trees + i);
 
 	nr_zswap_trees[type] = nr;
 	zswap_trees[type] = trees;
@@ -1687,7 +1643,7 @@  int zswap_swapon(int type, unsigned long nr_pages)
 
 void zswap_swapoff(int type)
 {
-	struct zswap_tree *trees = zswap_trees[type];
+	struct xarray *trees = zswap_trees[type];
 	unsigned int i;
 
 	if (!trees)
@@ -1695,7 +1651,7 @@  void zswap_swapoff(int type)
 
 	/* try_to_unuse() invalidated all the entries already */
 	for (i = 0; i < nr_zswap_trees[type]; i++)
-		WARN_ON_ONCE(!RB_EMPTY_ROOT(&trees[i].rbroot));
+		WARN_ON_ONCE(!xa_empty(trees + i));
 
 	kvfree(trees);
 	nr_zswap_trees[type] = 0;
@@ -1727,6 +1683,8 @@  static int zswap_debugfs_init(void)
 			   zswap_debugfs_root, &zswap_reject_kmemcache_fail);
 	debugfs_create_u64("reject_compress_fail", 0444,
 			   zswap_debugfs_root, &zswap_reject_compress_fail);
+	debugfs_create_u64("reject_xarray_fail", 0444,
+			   zswap_debugfs_root, &zswap_reject_xarray_fail);
 	debugfs_create_u64("reject_compress_poor", 0444,
 			   zswap_debugfs_root, &zswap_reject_compress_poor);
 	debugfs_create_u64("written_back_pages", 0444,