diff mbox series

[RFC,5/6] zsmalloc: introduce handle mapping API

Message ID 20250127080254.1302026-6-senozhatsky@chromium.org (mailing list archive)
State New
Headers show
Series zsmalloc: make zsmalloc preemptible | expand

Commit Message

Sergey Senozhatsky Jan. 27, 2025, 7:59 a.m. UTC
Introduce new API to map/unmap zsmalloc handle/object.  The key
difference is that this API does not impose atomicity restrictions
on its users, unlike zs_map_object() which returns with page-faults
and preemption disabled - handle mapping API does not need a per-CPU
vm-area because the users are required to provide an aux buffer for
objects that span several physical pages.

Keep zs_map_object/zs_unmap_object for the time being, as there are
still users of it, but eventually old API will be removed.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 include/linux/zsmalloc.h |  29 ++++++++
 mm/zsmalloc.c            | 148 ++++++++++++++++++++++++++++-----------
 2 files changed, 138 insertions(+), 39 deletions(-)

Comments

Yosry Ahmed Jan. 27, 2025, 9:26 p.m. UTC | #1
On Mon, Jan 27, 2025 at 04:59:30PM +0900, Sergey Senozhatsky wrote:
> Introduce new API to map/unmap zsmalloc handle/object.  The key
> difference is that this API does not impose atomicity restrictions
> on its users, unlike zs_map_object() which returns with page-faults
> and preemption disabled

I think that's not entirely accurate, see below.

[..]
> @@ -1309,12 +1297,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  		goto out;
>  	}
>  
> -	/* this object spans two pages */
> -	zpdescs[0] = zpdesc;
> -	zpdescs[1] = get_next_zpdesc(zpdesc);
> -	BUG_ON(!zpdescs[1]);
> +	ret = area->vm_buf;
> +	/* disable page faults to match kmap_local_page() return conditions */
> +	pagefault_disable();

Is this accurate/necessary? I am looking at kmap_local_page() and I
don't see it. Maybe that's remnant from the old code using
kmap_atomic()?

> +	if (mm != ZS_MM_WO) {
> +		/* this object spans two pages */
> +		zs_obj_copyin(area->vm_buf, zpdesc, off, class->size);
> +	}
>  
> -	ret = __zs_map_object(area, zpdescs, off, class->size);
>  out:
>  	if (likely(!ZsHugePage(zspage)))
>  		ret += ZS_HANDLE_SIZE;
Yosry Ahmed Jan. 27, 2025, 9:58 p.m. UTC | #2
On Mon, Jan 27, 2025 at 04:59:30PM +0900, Sergey Senozhatsky wrote:
> Introduce new API to map/unmap zsmalloc handle/object.  The key
> difference is that this API does not impose atomicity restrictions
> on its users, unlike zs_map_object() which returns with page-faults
> and preemption disabled - handle mapping API does not need a per-CPU
> vm-area because the users are required to provide an aux buffer for
> objects that span several physical pages.

I like the idea of supplying the buffer directly to zsmalloc, and zswap
already has per-CPU buffers allocated. This will help remove the special
case to handle not being able to sleep in zswap_decompress().

That being said, I am not a big fan of the new API for several reasons:
- The interface seems complicated, why do we need struct
zs_handle_mapping? Can't the user just pass an extra parameter to
zs_map_object/zs_unmap_object() to supply the buffer, and the return
value is the pointer to the data within the buffer?

- This seems to require an additional buffer on the compress side. Right
now, zswap compresses the page into its own buffer, maps the handle,
and copies to it. Now the map operation will require an extra buffer.
I guess in the WO case the buffer is not needed and we can just pass
NULL?

Taking a step back, it actually seems to me that the mapping interface
may not be the best, at least from a zswap perspective. In both cases,
we map, copy from/to the handle, then unmap. The special casing here is
essentially handling the copy direction. Zram looks fairly similar but I
didn't look too closely.

I wonder if the API should store/load instead. You either pass a buffer
to be stored (equivalent to today's alloc + map + copy), or pass a
buffer to load into (equivalent to today's map + copy). What we really
need on the zswap side is zs_store() and zs_load(), not zs_map() with
different mapping types and an optional buffer if we are going to
eventually store. I guess that's part of a larger overhaul and we'd need
to update other zpool allocators (or remove them, z3fold should be
coming soon).

Anyway this is mostly just me ranting because improving the interface to
avoid the atomicity requires making it even more complicated, when it's
really simple when you think about it in terms of what you really want
to do (i.e. store and load).

> Keep zs_map_object/zs_unmap_object for the time being, as there are
> still users of it, but eventually old API will be removed.
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  include/linux/zsmalloc.h |  29 ++++++++
>  mm/zsmalloc.c            | 148 ++++++++++++++++++++++++++++-----------
>  2 files changed, 138 insertions(+), 39 deletions(-)
> 
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index a48cd0ffe57d..72d84537dd38 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -58,4 +58,33 @@ unsigned long zs_compact(struct zs_pool *pool);
>  unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
>  
>  void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
> +
> +struct zs_handle_mapping {
> +	unsigned long handle;
> +	/* Points to start of the object data either within local_copy or
> +	 * within local_mapping. This is what callers should use to access
> +	 * or modify handle data.
> +	 */
> +	void *handle_mem;
> +
> +	enum zs_mapmode mode;
> +	union {
> +		/*
> +		 * Handle object data copied, because it spans across several
> +		 * (non-contiguous) physical pages. This pointer should be
> +		 * set by the zs_map_handle() caller beforehand and should
> +		 * never be accessed directly.
> +		 */
> +		void *local_copy;
> +		/*
> +		 * Handle object mapped directly. Should never be used
> +		 * directly.
> +		 */
> +		void *local_mapping;
> +	};
> +};
> +
> +int zs_map_handle(struct zs_pool *pool, struct zs_handle_mapping *map);
> +void zs_unmap_handle(struct zs_pool *pool, struct zs_handle_mapping *map);
> +
>  #endif
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index a5c1f9852072..281bba4a3277 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -1132,18 +1132,14 @@ static inline void __zs_cpu_down(struct mapping_area *area)
>  	area->vm_buf = NULL;
>  }
>  
> -static void *__zs_map_object(struct mapping_area *area,
> -			struct zpdesc *zpdescs[2], int off, int size)
> +static void zs_obj_copyin(void *buf, struct zpdesc *zpdesc, int off, int size)
>  {
> +	struct zpdesc *zpdescs[2];
>  	size_t sizes[2];
> -	char *buf = area->vm_buf;
> -
> -	/* disable page faults to match kmap_local_page() return conditions */
> -	pagefault_disable();
>  
> -	/* no read fastpath */
> -	if (area->vm_mm == ZS_MM_WO)
> -		goto out;
> +	zpdescs[0] = zpdesc;
> +	zpdescs[1] = get_next_zpdesc(zpdesc);
> +	BUG_ON(!zpdescs[1]);
>  
>  	sizes[0] = PAGE_SIZE - off;
>  	sizes[1] = size - sizes[0];
> @@ -1151,21 +1147,17 @@ static void *__zs_map_object(struct mapping_area *area,
>  	/* copy object to per-cpu buffer */
>  	memcpy_from_page(buf, zpdesc_page(zpdescs[0]), off, sizes[0]);
>  	memcpy_from_page(buf + sizes[0], zpdesc_page(zpdescs[1]), 0, sizes[1]);
> -out:
> -	return area->vm_buf;
>  }
>  
> -static void __zs_unmap_object(struct mapping_area *area,
> -			struct zpdesc *zpdescs[2], int off, int size)
> +static void zs_obj_copyout(void *buf, struct zpdesc *zpdesc, int off, int size)
>  {
> +	struct zpdesc *zpdescs[2];
>  	size_t sizes[2];
> -	char *buf;
>  
> -	/* no write fastpath */
> -	if (area->vm_mm == ZS_MM_RO)
> -		goto out;
> +	zpdescs[0] = zpdesc;
> +	zpdescs[1] = get_next_zpdesc(zpdesc);
> +	BUG_ON(!zpdescs[1]);
>  
> -	buf = area->vm_buf;
>  	buf = buf + ZS_HANDLE_SIZE;
>  	size -= ZS_HANDLE_SIZE;
>  	off += ZS_HANDLE_SIZE;
> @@ -1176,10 +1168,6 @@ static void __zs_unmap_object(struct mapping_area *area,
>  	/* copy per-cpu buffer to object */
>  	memcpy_to_page(zpdesc_page(zpdescs[0]), off, buf, sizes[0]);
>  	memcpy_to_page(zpdesc_page(zpdescs[1]), 0, buf + sizes[0], sizes[1]);
> -
> -out:
> -	/* enable page faults to match kunmap_local() return conditions */
> -	pagefault_enable();
>  }
>  
>  static int zs_cpu_prepare(unsigned int cpu)
> @@ -1260,6 +1248,8 @@ EXPORT_SYMBOL_GPL(zs_get_total_pages);
>   * against nested mappings.
>   *
>   * This function returns with preemption and page faults disabled.
> + *
> + * NOTE: this function is deprecated and will be removed.
>   */
>  void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  			enum zs_mapmode mm)
> @@ -1268,10 +1258,8 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	struct zpdesc *zpdesc;
>  	unsigned long obj, off;
>  	unsigned int obj_idx;
> -
>  	struct size_class *class;
>  	struct mapping_area *area;
> -	struct zpdesc *zpdescs[2];
>  	void *ret;
>  
>  	/*
> @@ -1309,12 +1297,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  		goto out;
>  	}
>  
> -	/* this object spans two pages */
> -	zpdescs[0] = zpdesc;
> -	zpdescs[1] = get_next_zpdesc(zpdesc);
> -	BUG_ON(!zpdescs[1]);
> +	ret = area->vm_buf;
> +	/* disable page faults to match kmap_local_page() return conditions */
> +	pagefault_disable();
> +	if (mm != ZS_MM_WO) {
> +		/* this object spans two pages */
> +		zs_obj_copyin(area->vm_buf, zpdesc, off, class->size);
> +	}
>  
> -	ret = __zs_map_object(area, zpdescs, off, class->size);
>  out:
>  	if (likely(!ZsHugePage(zspage)))
>  		ret += ZS_HANDLE_SIZE;
> @@ -1323,13 +1313,13 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  }
>  EXPORT_SYMBOL_GPL(zs_map_object);
>  
> +/* NOTE: this function is deprecated and will be removed. */
>  void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
>  {
>  	struct zspage *zspage;
>  	struct zpdesc *zpdesc;
>  	unsigned long obj, off;
>  	unsigned int obj_idx;
> -
>  	struct size_class *class;
>  	struct mapping_area *area;
>  
> @@ -1340,23 +1330,103 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
>  	off = offset_in_page(class->size * obj_idx);
>  
>  	area = this_cpu_ptr(&zs_map_area);
> -	if (off + class->size <= PAGE_SIZE)
> +	if (off + class->size <= PAGE_SIZE) {
>  		kunmap_local(area->vm_addr);
> -	else {
> -		struct zpdesc *zpdescs[2];
> +		goto out;
> +	}
>  
> -		zpdescs[0] = zpdesc;
> -		zpdescs[1] = get_next_zpdesc(zpdesc);
> -		BUG_ON(!zpdescs[1]);
> +	if (area->vm_mm != ZS_MM_RO)
> +		zs_obj_copyout(area->vm_buf, zpdesc, off, class->size);
> +	/* enable page faults to match kunmap_local() return conditions */
> +	pagefault_enable();
>  
> -		__zs_unmap_object(area, zpdescs, off, class->size);
> -	}
> +out:
>  	local_unlock(&zs_map_area.lock);
> -
>  	zspage_read_unlock(zspage);
>  }
>  EXPORT_SYMBOL_GPL(zs_unmap_object);
>  
> +void zs_unmap_handle(struct zs_pool *pool, struct zs_handle_mapping *map)
> +{
> +	struct zspage *zspage;
> +	struct zpdesc *zpdesc;
> +	unsigned long obj, off;
> +	unsigned int obj_idx;
> +	struct size_class *class;
> +
> +	obj = handle_to_obj(map->handle);
> +	obj_to_location(obj, &zpdesc, &obj_idx);
> +	zspage = get_zspage(zpdesc);
> +	class = zspage_class(pool, zspage);
> +	off = offset_in_page(class->size * obj_idx);
> +
> +	if (off + class->size <= PAGE_SIZE) {
> +		kunmap_local(map->local_mapping);
> +		goto out;
> +	}
> +
> +	if (map->mode != ZS_MM_RO)
> +		zs_obj_copyout(map->local_copy, zpdesc, off, class->size);
> +
> +out:
> +	zspage_read_unlock(zspage);
> +}
> +EXPORT_SYMBOL_GPL(zs_unmap_handle);
> +
> +int zs_map_handle(struct zs_pool *pool, struct zs_handle_mapping *map)
> +{
> +	struct zspage *zspage;
> +	struct zpdesc *zpdesc;
> +	unsigned long obj, off;
> +	unsigned int obj_idx;
> +	struct size_class *class;
> +
> +	WARN_ON(in_interrupt());
> +
> +	/* It guarantees it can get zspage from handle safely */
> +	pool_read_lock(pool);
> +	obj = handle_to_obj(map->handle);
> +	obj_to_location(obj, &zpdesc, &obj_idx);
> +	zspage = get_zspage(zpdesc);
> +
> +	/*
> +	 * migration cannot move any zpages in this zspage. Here, class->lock
> +	 * is too heavy since callers would take some time until they calls
> +	 * zs_unmap_object API so delegate the locking from class to zspage
> +	 * which is smaller granularity.
> +	 */
> +	zspage_read_lock(zspage);
> +	pool_read_unlock(pool);
> +
> +	class = zspage_class(pool, zspage);
> +	off = offset_in_page(class->size * obj_idx);
> +
> +	if (off + class->size <= PAGE_SIZE) {
> +		/* this object is contained entirely within a page */
> +		map->local_mapping = kmap_local_zpdesc(zpdesc);
> +		map->handle_mem = map->local_mapping + off;
> +		goto out;
> +	}
> +
> +	if (WARN_ON_ONCE(!map->local_copy)) {
> +		zspage_read_unlock(zspage);
> +		return -EINVAL;
> +	}
> +
> +	map->handle_mem = map->local_copy;
> +	if (map->mode != ZS_MM_WO) {
> +		/* this object spans two pages */
> +		zs_obj_copyin(map->local_copy, zpdesc, off, class->size);
> +	}
> +
> +out:
> +	if (likely(!ZsHugePage(zspage)))
> +		map->handle_mem += ZS_HANDLE_SIZE;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(zs_map_handle);
> +
>  /**
>   * zs_huge_class_size() - Returns the size (in bytes) of the first huge
>   *                        zsmalloc &size_class.
> -- 
> 2.48.1.262.g85cc9f2d1e-goog
>
Sergey Senozhatsky Jan. 28, 2025, 12:37 a.m. UTC | #3
On (25/01/27 21:26), Yosry Ahmed wrote:
> On Mon, Jan 27, 2025 at 04:59:30PM +0900, Sergey Senozhatsky wrote:
> > Introduce new API to map/unmap zsmalloc handle/object.  The key
> > difference is that this API does not impose atomicity restrictions
> > on its users, unlike zs_map_object() which returns with page-faults
> > and preemption disabled
> 
> I think that's not entirely accurate, see below.

Preemption is disabled via zspage-s rwlock_t - zs_map_object() returns
with it being locked and it's being unlocked in zs_unmap_object().  Then
the function disables pagefaults and per-CPU local lock (protects per-CPU
vm-area) additionally disables preemption.

> [..]
> > @@ -1309,12 +1297,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> >  		goto out;
> >  	}
> >  
> > -	/* this object spans two pages */
> > -	zpdescs[0] = zpdesc;
> > -	zpdescs[1] = get_next_zpdesc(zpdesc);
> > -	BUG_ON(!zpdescs[1]);
> > +	ret = area->vm_buf;
> > +	/* disable page faults to match kmap_local_page() return conditions */
> > +	pagefault_disable();
> 
> Is this accurate/necessary? I am looking at kmap_local_page() and I
> don't see it. Maybe that's remnant from the old code using
> kmap_atomic()?

No, this does not look accuare nor neccesary to me.  I asume that's from
a very long time ago, but regardless of that I don't really understand
why that API wants to resemblwe kmap_atomic() (I think that was the
intention).  This interface if expected to be gone so I didn't want
to dig into it and fix it.
Yosry Ahmed Jan. 28, 2025, 12:49 a.m. UTC | #4
On Tue, Jan 28, 2025 at 09:37:20AM +0900, Sergey Senozhatsky wrote:
> On (25/01/27 21:26), Yosry Ahmed wrote:
> > On Mon, Jan 27, 2025 at 04:59:30PM +0900, Sergey Senozhatsky wrote:
> > > Introduce new API to map/unmap zsmalloc handle/object.  The key
> > > difference is that this API does not impose atomicity restrictions
> > > on its users, unlike zs_map_object() which returns with page-faults
> > > and preemption disabled
> > 
> > I think that's not entirely accurate, see below.
> 
> Preemption is disabled via zspage-s rwlock_t - zs_map_object() returns
> with it being locked and it's being unlocked in zs_unmap_object().  Then
> the function disables pagefaults and per-CPU local lock (protects per-CPU
> vm-area) additionally disables preemption.

Right, I meant it does not always disable page faults.

> 
> > [..]
> > > @@ -1309,12 +1297,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> > >  		goto out;
> > >  	}
> > >  
> > > -	/* this object spans two pages */
> > > -	zpdescs[0] = zpdesc;
> > > -	zpdescs[1] = get_next_zpdesc(zpdesc);
> > > -	BUG_ON(!zpdescs[1]);
> > > +	ret = area->vm_buf;
> > > +	/* disable page faults to match kmap_local_page() return conditions */
> > > +	pagefault_disable();
> > 
> > Is this accurate/necessary? I am looking at kmap_local_page() and I
> > don't see it. Maybe that's remnant from the old code using
> > kmap_atomic()?
> 
> No, this does not look accuare nor neccesary to me.  I asume that's from
> a very long time ago, but regardless of that I don't really understand
> why that API wants to resemblwe kmap_atomic() (I think that was the
> intention).  This interface if expected to be gone so I didn't want
> to dig into it and fix it.

My assumption has been that back when we were using kmap_atomic(), which
disables page faults, we wanted to make this API's behavior consistent
for users where or not we called kmap_atomic() -- so this makes sure it
always disables page faults.

Now that we switched to kmap_local_page(), which doesn't disable page
faults, this was left behind, ulitmately making the interface
inconsistent and contradicting the purpose of its existence.

This is 100% speculation on my end :)

Anyway, if this function will be removed soon then it's not worth
revisiting it now.
Sergey Senozhatsky Jan. 28, 2025, 12:59 a.m. UTC | #5
On (25/01/27 21:58), Yosry Ahmed wrote:
> On Mon, Jan 27, 2025 at 04:59:30PM +0900, Sergey Senozhatsky wrote:
> > Introduce new API to map/unmap zsmalloc handle/object.  The key
> > difference is that this API does not impose atomicity restrictions
> > on its users, unlike zs_map_object() which returns with page-faults
> > and preemption disabled - handle mapping API does not need a per-CPU
> > vm-area because the users are required to provide an aux buffer for
> > objects that span several physical pages.
> 
> I like the idea of supplying the buffer directly to zsmalloc, and zswap
> already has per-CPU buffers allocated. This will help remove the special
> case to handle not being able to sleep in zswap_decompress().

The interface, basically, is what we currently have, but the state
is moved out of zsmalloc internal per-CPU vm-area.

> That being said, I am not a big fan of the new API for several reasons:
> - The interface seems complicated, why do we need struct
> zs_handle_mapping? Can't the user just pass an extra parameter to
> zs_map_object/zs_unmap_object() to supply the buffer, and the return
> value is the pointer to the data within the buffer?

At least now we need to save some state - e.g. direction of the map()
so that during unmap zsmalloc determines if it needs to perform copy-out
or not.  It also needs that state in order to know if the buffer needs
to be unmapped.

zsmalloc MAP has two cases:
a) the object spans several physical non-contig pages: copy-in object into
  aux buffer and return (linear) pointer to that buffer
b) the object is contained within a physical page: kmap that page and
  return (linear) pointer to that mapping, unmap in zs_unmap_object().

> - This seems to require an additional buffer on the compress side. Right
> now, zswap compresses the page into its own buffer, maps the handle,
> and copies to it. Now the map operation will require an extra buffer.

Yes, for (a) mentioned above.

> I guess in the WO case the buffer is not needed and we can just pass
> NULL?

Yes.

> Taking a step back, it actually seems to me that the mapping interface
> may not be the best, at least from a zswap perspective. In both cases,
> we map, copy from/to the handle, then unmap. The special casing here is
> essentially handling the copy direction. Zram looks fairly similar but I
> didn't look too closely.
> 
> I wonder if the API should store/load instead. You either pass a buffer
> to be stored (equivalent to today's alloc + map + copy), or pass a
> buffer to load into (equivalent to today's map + copy). What we really
> need on the zswap side is zs_store() and zs_load(), not zs_map() with
> different mapping types and an optional buffer if we are going to
> eventually store. I guess that's part of a larger overhaul and we'd need
> to update other zpool allocators (or remove them, z3fold should be
> coming soon).

So I though about it: load and store.

zs_obj_load()
{
	zspage->page kmap, etc.
	memcpy buf page   # if direction is not WO
	unmap
}

zs_obj_store()
{
	zspage->page kmap, etc.
	memcpy page buf   # if direction is not RO
	unmap
}

load+store would not require zsmalloc to be preemptible internally, we
could just keep existing atomic locks and it would make things a little
simpler on the zram side (slot-free-notification is called from atomic
section).

But, and it's a big but.  And it's (b) from the above.  I wasn't brave
enough to just drop (b) optimization and replace it with memcpy(),
especially when we work with relatively large objects (say size-class
3600 bytes and above).  This certainly would not make battery powered
devices happier.  Maybe in zswap the page is only read once (is that
correct?), but in zram page can be read multiple times (e.g. when zram
is used as a raw block-dev, or has a mounted fs on it) which means
multiple extra memcpy()-s.
Sergey Senozhatsky Jan. 28, 2025, 1:13 a.m. UTC | #6
On (25/01/28 00:49), Yosry Ahmed wrote:
> > Preemption is disabled via zspage-s rwlock_t - zs_map_object() returns
> > with it being locked and it's being unlocked in zs_unmap_object().  Then
> > the function disables pagefaults and per-CPU local lock (protects per-CPU
> > vm-area) additionally disables preemption.
> 
> Right, I meant it does not always disable page faults.

I'll add "sometimes" :)

[..]
> Anyway, if this function will be removed soon then it's not worth
> revisiting it now.

Ack.
Yosry Ahmed Jan. 28, 2025, 1:36 a.m. UTC | #7
On Tue, Jan 28, 2025 at 09:59:55AM +0900, Sergey Senozhatsky wrote:
> On (25/01/27 21:58), Yosry Ahmed wrote:
> > On Mon, Jan 27, 2025 at 04:59:30PM +0900, Sergey Senozhatsky wrote:
> > > Introduce new API to map/unmap zsmalloc handle/object.  The key
> > > difference is that this API does not impose atomicity restrictions
> > > on its users, unlike zs_map_object() which returns with page-faults
> > > and preemption disabled - handle mapping API does not need a per-CPU
> > > vm-area because the users are required to provide an aux buffer for
> > > objects that span several physical pages.
> > 
> > I like the idea of supplying the buffer directly to zsmalloc, and zswap
> > already has per-CPU buffers allocated. This will help remove the special
> > case to handle not being able to sleep in zswap_decompress().
> 
> The interface, basically, is what we currently have, but the state
> is moved out of zsmalloc internal per-CPU vm-area.
> 
> > That being said, I am not a big fan of the new API for several reasons:
> > - The interface seems complicated, why do we need struct
> > zs_handle_mapping? Can't the user just pass an extra parameter to
> > zs_map_object/zs_unmap_object() to supply the buffer, and the return
> > value is the pointer to the data within the buffer?
> 
> At least now we need to save some state - e.g. direction of the map()
> so that during unmap zsmalloc determines if it needs to perform copy-out
> or not.  It also needs that state in order to know if the buffer needs
> to be unmapped.
> 
> zsmalloc MAP has two cases:
> a) the object spans several physical non-contig pages: copy-in object into
>   aux buffer and return (linear) pointer to that buffer
> b) the object is contained within a physical page: kmap that page and
>   return (linear) pointer to that mapping, unmap in zs_unmap_object().

Ack. See below.

> > - This seems to require an additional buffer on the compress side. Right
> > now, zswap compresses the page into its own buffer, maps the handle,
> > and copies to it. Now the map operation will require an extra buffer.
> 
> Yes, for (a) mentioned above.
> 
> > I guess in the WO case the buffer is not needed and we can just pass
> > NULL?
> 
> Yes.

Perhaps we want to document this and enforce it (make sure that the
NULL-ness of the buffer matches the access type).

> > Taking a step back, it actually seems to me that the mapping interface
> > may not be the best, at least from a zswap perspective. In both cases,
> > we map, copy from/to the handle, then unmap. The special casing here is
> > essentially handling the copy direction. Zram looks fairly similar but I
> > didn't look too closely.
> > 
> > I wonder if the API should store/load instead. You either pass a buffer
> > to be stored (equivalent to today's alloc + map + copy), or pass a
> > buffer to load into (equivalent to today's map + copy). What we really
> > need on the zswap side is zs_store() and zs_load(), not zs_map() with
> > different mapping types and an optional buffer if we are going to
> > eventually store. I guess that's part of a larger overhaul and we'd need
> > to update other zpool allocators (or remove them, z3fold should be
> > coming soon).
> 
> So I though about it: load and store.
> 
> zs_obj_load()
> {
> 	zspage->page kmap, etc.
> 	memcpy buf page   # if direction is not WO
> 	unmap
> }
> 
> zs_obj_store()
> {
> 	zspage->page kmap, etc.
> 	memcpy page buf   # if direction is not RO
> 	unmap
> }
> 
> load+store would not require zsmalloc to be preemptible internally, we
> could just keep existing atomic locks and it would make things a little
> simpler on the zram side (slot-free-notification is called from atomic
> section).
> 
> But, and it's a big but.  And it's (b) from the above.  I wasn't brave
> enough to just drop (b) optimization and replace it with memcpy(),
> especially when we work with relatively large objects (say size-class
> 3600 bytes and above).  This certainly would not make battery powered
> devices happier.  Maybe in zswap the page is only read once (is that
> correct?), but in zram page can be read multiple times (e.g. when zram
> is used as a raw block-dev, or has a mounted fs on it) which means
> multiple extra memcpy()-s.

In zswap, because we use the crypto_acomp API, when we cannot sleep with
the object mapped (which is true for zsmalloc), we just copy the
compressed object into a preallocated buffer anyway. So having a
zs_obj_load() interface would move that copy inside zsmalloc.

With your series, zswap can drop the memcpy and save some cycles on the
compress side. I didn't realize that zram does not perform any copies on the
read/decompress side.

Maybe the load interface can still provide a buffer to avoid the copy
where possible? I suspect with that we don't need the state and can
just pass a pointer. We'd need another call to potentially unmap, so
maybe load_start/load_end, or read_start/read_end.

Something like:

zs_obj_read_start(.., buf)
{
	if (contained in one page)
		return kmapped obj
	else
		memcpy to buf
		return buf
}

zs_obj_read_end(.., buf)
{
	if (container in one page)
		kunmap
}

The interface is more straightforward and we can drop the map flags
entirely, unless I missed something here. Unfortunately you'd still need
the locking changes in zsmalloc to make zram reads fully preemptible.

I am not suggesting that we have to go this way, just throwing out
ideas.

BTW, are we positive that the locking changes made in this series are
not introducing regressions? I'd hate for us to avoid an extra copy but
end up paying for it in lock contention.
Sergey Senozhatsky Jan. 28, 2025, 5:29 a.m. UTC | #8
On (25/01/28 01:36), Yosry Ahmed wrote:
> > Yes, for (a) mentioned above.
> > 
> > > I guess in the WO case the buffer is not needed and we can just pass
> > > NULL?
> > 
> > Yes.
> 
> Perhaps we want to document this and enforce it (make sure that the
> NULL-ness of the buffer matches the access type).

Right.

> > But, and it's a big but.  And it's (b) from the above.  I wasn't brave
> > enough to just drop (b) optimization and replace it with memcpy(),
> > especially when we work with relatively large objects (say size-class
> > 3600 bytes and above).  This certainly would not make battery powered
> > devices happier.  Maybe in zswap the page is only read once (is that
> > correct?), but in zram page can be read multiple times (e.g. when zram
> > is used as a raw block-dev, or has a mounted fs on it) which means
> > multiple extra memcpy()-s.
> 
> In zswap, because we use the crypto_acomp API, when we cannot sleep with
> the object mapped (which is true for zsmalloc), we just copy the
> compressed object into a preallocated buffer anyway. So having a
> zs_obj_load() interface would move that copy inside zsmalloc.

Yeah, I saw zpool_can_sleep_mapped() and had the same thought.  zram,
as of now, doesn't support algos that can/need schedule internally for
whatever reason - kmalloc, mutex, H/W wait, etc.

> With your series, zswap can drop the memcpy and save some cycles on the
> compress side. I didn't realize that zram does not perform any copies on the
> read/decompress side.
> 
> Maybe the load interface can still provide a buffer to avoid the copy
> where possible? I suspect with that we don't need the state and can
> just pass a pointer. We'd need another call to potentially unmap, so
> maybe load_start/load_end, or read_start/read_end.
> 
> Something like:
> 
> zs_obj_read_start(.., buf)
> {
> 	if (contained in one page)
> 		return kmapped obj
> 	else
> 		memcpy to buf
> 		return buf
> }
> 
> zs_obj_read_end(.., buf)
> {
> 	if (container in one page)
> 		kunmap
> }
> 
> The interface is more straightforward and we can drop the map flags
> entirely, unless I missed something here. Unfortunately you'd still need
> the locking changes in zsmalloc to make zram reads fully preemptible.

Agreed, the interface part is less of a problem, the atomicity of zsmalloc
is a much bigger issue.  We, technically, only need to mark zspage as "being
used, don't free" so that zsfree/compaction/migration don't mess with it,
but this is only "technically".  In practice we then have

	CPU0							CPU1

	zs_map_object
	set READ bit					migrate
	schedule						pool rwlock
									size class spin-lock
									wait for READ bit to clear
	...								set WRITE bit
	clear READ bit

and the whole thing collapses like a house of cards.  I wasn't able
to trigger a watchdog on my tests, but the pattern is there and it's
enough.  Maybe we can teach compaction and migration to try-WRITE and
bail out if the page is locked, but I don't know.

> I am not suggesting that we have to go this way, just throwing out
> ideas.

Sure, load+store is still an option.  While that zs_map_object()
optimization is nice, it may have two sides [in zram case].  On
one hand, we safe memcpy() [but only for certain objects], on the
other hand, we keep the page locked for the entire decompression
duration, which can be quite a while (e.g. when algorithm is
configured with a very high compression level):

	CPU0							CPU1

	zs_map_object
	read lock page rwlock			write lock page rwlock
									spin
	decompress()					... spin a lot
	read unlock page rwlock

Maybe copy-in is just an okay thing to do.  Let me try to measure.

> BTW, are we positive that the locking changes made in this series are
> not introducing regressions?

Cannot claim that with confidence.  Our workloads don't match, we don't
even use zsmalloc in the same way :)  Here be dragons.
Sergey Senozhatsky Jan. 28, 2025, 9:38 a.m. UTC | #9
On (25/01/28 14:29), Sergey Senozhatsky wrote:
> Maybe copy-in is just an okay thing to do.  Let me try to measure.

Naaah, not really okay.  On our memory-pressure test (4GB device, 4
CPUs) that kmap_local thingy appears to save approx 6GB of memcpy().

CPY stats: 734954 1102903168 4926116 6566654656

There were 734954 cases when we memcpy() [object spans two pages] with
accumulated size of 1102903168 bytes, and 4926116 cases when we took
a shortcut via kmap_local and avoided memcpy(), with accumulated size
of 6566654656 bytes.

In both cases I counted only RO direction for map, and WO direction
for unmap.
Sergey Senozhatsky Jan. 28, 2025, 11:10 a.m. UTC | #10
On (25/01/28 14:29), Sergey Senozhatsky wrote:
> Maybe we can teach compaction and migration to try-WRITE and
> bail out if the page is locked, but I don't know.

This seems to be working just fine.
Yosry Ahmed Jan. 28, 2025, 5:21 p.m. UTC | #11
On Tue, Jan 28, 2025 at 06:38:35PM +0900, Sergey Senozhatsky wrote:
> On (25/01/28 14:29), Sergey Senozhatsky wrote:
> > Maybe copy-in is just an okay thing to do.  Let me try to measure.
> 
> Naaah, not really okay.  On our memory-pressure test (4GB device, 4
> CPUs) that kmap_local thingy appears to save approx 6GB of memcpy().
> 
> CPY stats: 734954 1102903168 4926116 6566654656
> 
> There were 734954 cases when we memcpy() [object spans two pages] with
> accumulated size of 1102903168 bytes, and 4926116 cases when we took
> a shortcut via kmap_local and avoided memcpy(), with accumulated size
> of 6566654656 bytes.
> 
> In both cases I counted only RO direction for map, and WO direction
> for unmap.

Yeah seems like the optimization is effective, at least on that
workload, unless the memcpy() is cheap and avoiding it is not buying as
much (do you know if that's the case?).

Anyway, we can keep the optimization and zswap could start making use of
it if zsmalloc becomes preemtible, so that's still a win.
Yosry Ahmed Jan. 28, 2025, 5:22 p.m. UTC | #12
On Tue, Jan 28, 2025 at 08:10:10PM +0900, Sergey Senozhatsky wrote:
> On (25/01/28 14:29), Sergey Senozhatsky wrote:
> > Maybe we can teach compaction and migration to try-WRITE and
> > bail out if the page is locked, but I don't know.
> 
> This seems to be working just fine.

Does this mean we won't need as much locking changes to get zsmalloc to
be preemtible?

I am slightly worried about how these changes will affect performance
tbh.
Sergey Senozhatsky Jan. 28, 2025, 11:01 p.m. UTC | #13
On (25/01/28 17:22), Yosry Ahmed wrote:
> On Tue, Jan 28, 2025 at 08:10:10PM +0900, Sergey Senozhatsky wrote:
> > On (25/01/28 14:29), Sergey Senozhatsky wrote:
> > > Maybe we can teach compaction and migration to try-WRITE and
> > > bail out if the page is locked, but I don't know.
> > 
> > This seems to be working just fine.
> 
> Does this mean we won't need as much locking changes to get zsmalloc to
> be preemtible?

Correct, only zspage lock is getting converted.
diff mbox series

Patch

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index a48cd0ffe57d..72d84537dd38 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -58,4 +58,33 @@  unsigned long zs_compact(struct zs_pool *pool);
 unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
 
 void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
+
+struct zs_handle_mapping {
+	unsigned long handle;
+	/* Points to start of the object data either within local_copy or
+	 * within local_mapping. This is what callers should use to access
+	 * or modify handle data.
+	 */
+	void *handle_mem;
+
+	enum zs_mapmode mode;
+	union {
+		/*
+		 * Handle object data copied, because it spans across several
+		 * (non-contiguous) physical pages. This pointer should be
+		 * set by the zs_map_handle() caller beforehand and should
+		 * never be accessed directly.
+		 */
+		void *local_copy;
+		/*
+		 * Handle object mapped directly. Should never be used
+		 * directly.
+		 */
+		void *local_mapping;
+	};
+};
+
+int zs_map_handle(struct zs_pool *pool, struct zs_handle_mapping *map);
+void zs_unmap_handle(struct zs_pool *pool, struct zs_handle_mapping *map);
+
 #endif
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index a5c1f9852072..281bba4a3277 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1132,18 +1132,14 @@  static inline void __zs_cpu_down(struct mapping_area *area)
 	area->vm_buf = NULL;
 }
 
-static void *__zs_map_object(struct mapping_area *area,
-			struct zpdesc *zpdescs[2], int off, int size)
+static void zs_obj_copyin(void *buf, struct zpdesc *zpdesc, int off, int size)
 {
+	struct zpdesc *zpdescs[2];
 	size_t sizes[2];
-	char *buf = area->vm_buf;
-
-	/* disable page faults to match kmap_local_page() return conditions */
-	pagefault_disable();
 
-	/* no read fastpath */
-	if (area->vm_mm == ZS_MM_WO)
-		goto out;
+	zpdescs[0] = zpdesc;
+	zpdescs[1] = get_next_zpdesc(zpdesc);
+	BUG_ON(!zpdescs[1]);
 
 	sizes[0] = PAGE_SIZE - off;
 	sizes[1] = size - sizes[0];
@@ -1151,21 +1147,17 @@  static void *__zs_map_object(struct mapping_area *area,
 	/* copy object to per-cpu buffer */
 	memcpy_from_page(buf, zpdesc_page(zpdescs[0]), off, sizes[0]);
 	memcpy_from_page(buf + sizes[0], zpdesc_page(zpdescs[1]), 0, sizes[1]);
-out:
-	return area->vm_buf;
 }
 
-static void __zs_unmap_object(struct mapping_area *area,
-			struct zpdesc *zpdescs[2], int off, int size)
+static void zs_obj_copyout(void *buf, struct zpdesc *zpdesc, int off, int size)
 {
+	struct zpdesc *zpdescs[2];
 	size_t sizes[2];
-	char *buf;
 
-	/* no write fastpath */
-	if (area->vm_mm == ZS_MM_RO)
-		goto out;
+	zpdescs[0] = zpdesc;
+	zpdescs[1] = get_next_zpdesc(zpdesc);
+	BUG_ON(!zpdescs[1]);
 
-	buf = area->vm_buf;
 	buf = buf + ZS_HANDLE_SIZE;
 	size -= ZS_HANDLE_SIZE;
 	off += ZS_HANDLE_SIZE;
@@ -1176,10 +1168,6 @@  static void __zs_unmap_object(struct mapping_area *area,
 	/* copy per-cpu buffer to object */
 	memcpy_to_page(zpdesc_page(zpdescs[0]), off, buf, sizes[0]);
 	memcpy_to_page(zpdesc_page(zpdescs[1]), 0, buf + sizes[0], sizes[1]);
-
-out:
-	/* enable page faults to match kunmap_local() return conditions */
-	pagefault_enable();
 }
 
 static int zs_cpu_prepare(unsigned int cpu)
@@ -1260,6 +1248,8 @@  EXPORT_SYMBOL_GPL(zs_get_total_pages);
  * against nested mappings.
  *
  * This function returns with preemption and page faults disabled.
+ *
+ * NOTE: this function is deprecated and will be removed.
  */
 void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 			enum zs_mapmode mm)
@@ -1268,10 +1258,8 @@  void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	struct zpdesc *zpdesc;
 	unsigned long obj, off;
 	unsigned int obj_idx;
-
 	struct size_class *class;
 	struct mapping_area *area;
-	struct zpdesc *zpdescs[2];
 	void *ret;
 
 	/*
@@ -1309,12 +1297,14 @@  void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 		goto out;
 	}
 
-	/* this object spans two pages */
-	zpdescs[0] = zpdesc;
-	zpdescs[1] = get_next_zpdesc(zpdesc);
-	BUG_ON(!zpdescs[1]);
+	ret = area->vm_buf;
+	/* disable page faults to match kmap_local_page() return conditions */
+	pagefault_disable();
+	if (mm != ZS_MM_WO) {
+		/* this object spans two pages */
+		zs_obj_copyin(area->vm_buf, zpdesc, off, class->size);
+	}
 
-	ret = __zs_map_object(area, zpdescs, off, class->size);
 out:
 	if (likely(!ZsHugePage(zspage)))
 		ret += ZS_HANDLE_SIZE;
@@ -1323,13 +1313,13 @@  void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 }
 EXPORT_SYMBOL_GPL(zs_map_object);
 
+/* NOTE: this function is deprecated and will be removed. */
 void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 {
 	struct zspage *zspage;
 	struct zpdesc *zpdesc;
 	unsigned long obj, off;
 	unsigned int obj_idx;
-
 	struct size_class *class;
 	struct mapping_area *area;
 
@@ -1340,23 +1330,103 @@  void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 	off = offset_in_page(class->size * obj_idx);
 
 	area = this_cpu_ptr(&zs_map_area);
-	if (off + class->size <= PAGE_SIZE)
+	if (off + class->size <= PAGE_SIZE) {
 		kunmap_local(area->vm_addr);
-	else {
-		struct zpdesc *zpdescs[2];
+		goto out;
+	}
 
-		zpdescs[0] = zpdesc;
-		zpdescs[1] = get_next_zpdesc(zpdesc);
-		BUG_ON(!zpdescs[1]);
+	if (area->vm_mm != ZS_MM_RO)
+		zs_obj_copyout(area->vm_buf, zpdesc, off, class->size);
+	/* enable page faults to match kunmap_local() return conditions */
+	pagefault_enable();
 
-		__zs_unmap_object(area, zpdescs, off, class->size);
-	}
+out:
 	local_unlock(&zs_map_area.lock);
-
 	zspage_read_unlock(zspage);
 }
 EXPORT_SYMBOL_GPL(zs_unmap_object);
 
+void zs_unmap_handle(struct zs_pool *pool, struct zs_handle_mapping *map)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	obj = handle_to_obj(map->handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		kunmap_local(map->local_mapping);
+		goto out;
+	}
+
+	if (map->mode != ZS_MM_RO)
+		zs_obj_copyout(map->local_copy, zpdesc, off, class->size);
+
+out:
+	zspage_read_unlock(zspage);
+}
+EXPORT_SYMBOL_GPL(zs_unmap_handle);
+
+int zs_map_handle(struct zs_pool *pool, struct zs_handle_mapping *map)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	WARN_ON(in_interrupt());
+
+	/* It guarantees it can get zspage from handle safely */
+	pool_read_lock(pool);
+	obj = handle_to_obj(map->handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/*
+	 * migration cannot move any zpages in this zspage. Here, class->lock
+	 * is too heavy since callers would take some time until they calls
+	 * zs_unmap_object API so delegate the locking from class to zspage
+	 * which is smaller granularity.
+	 */
+	zspage_read_lock(zspage);
+	pool_read_unlock(pool);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		map->local_mapping = kmap_local_zpdesc(zpdesc);
+		map->handle_mem = map->local_mapping + off;
+		goto out;
+	}
+
+	if (WARN_ON_ONCE(!map->local_copy)) {
+		zspage_read_unlock(zspage);
+		return -EINVAL;
+	}
+
+	map->handle_mem = map->local_copy;
+	if (map->mode != ZS_MM_WO) {
+		/* this object spans two pages */
+		zs_obj_copyin(map->local_copy, zpdesc, off, class->size);
+	}
+
+out:
+	if (likely(!ZsHugePage(zspage)))
+		map->handle_mem += ZS_HANDLE_SIZE;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(zs_map_handle);
+
 /**
  * zs_huge_class_size() - Returns the size (in bytes) of the first huge
  *                        zsmalloc &size_class.