[PATCHv1,4/6] zsmalloc: introduce new object mapping API

Message ID	20250129064853.2210753-5-senozhatsky@chromium.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Sergey Senozhatsky <senozhatsky@chromium.org> To: Andrew Morton <akpm@linux-foundation.org>, Minchan Kim <minchan@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>, Yosry Ahmed <yosry.ahmed@linux.dev>, Nhat Pham <nphamcs@gmail.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Sergey Senozhatsky <senozhatsky@chromium.org> Subject: [PATCHv1 4/6] zsmalloc: introduce new object mapping API Date: Wed, 29 Jan 2025 15:43:50 +0900 Message-ID: <20250129064853.2210753-5-senozhatsky@chromium.org> In-Reply-To: <20250129064853.2210753-1-senozhatsky@chromium.org> References: <20250129064853.2210753-1-senozhatsky@chromium.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	zsmalloc: preemptible object mapping \| expand [PATCHv1,0/6] zsmalloc: preemptible object mapping [PATCHv1,1/6] zsmalloc: factor out pool locking helpers [PATCHv1,2/6] zsmalloc: factor out size-class locking helpers [PATCHv1,3/6] zsmalloc: make zspage lock preemptible [PATCHv1,4/6] zsmalloc: introduce new object mapping API [PATCHv1,5/6] zram: switch to new zsmalloc object mapping API [PATCHv1,6/6] zram: add might_sleep to zcomp API

Message ID

20250129064853.2210753-5-senozhatsky@chromium.org (mailing list archive)

State

New

Headers

From: Sergey Senozhatsky <senozhatsky@chromium.org>
To: Andrew Morton <akpm@linux-foundation.org>,
	Minchan Kim <minchan@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosry.ahmed@linux.dev>,
	Nhat Pham <nphamcs@gmail.com>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Sergey Senozhatsky <senozhatsky@chromium.org>
Subject: [PATCHv1 4/6] zsmalloc: introduce new object mapping API
Date: Wed, 29 Jan 2025 15:43:50 +0900
Message-ID: <20250129064853.2210753-5-senozhatsky@chromium.org>
In-Reply-To: <20250129064853.2210753-1-senozhatsky@chromium.org>
References: <20250129064853.2210753-1-senozhatsky@chromium.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

zsmalloc: preemptible object mapping | expand

Commit Message

Sergey Senozhatsky Jan. 29, 2025, 6:43 a.m. UTC

Current object mapping API is a little cumbersome.  First, it's
inconsistent, sometimes it returns with page-faults disabled and
sometimes with page-faults enabled.  Second, and most importantly,
it enforces atomicity restrictions on its users.  zs_map_object()
has to return a liner object address which is not always possible
because some objects span multiple physical (non-contiguous)
pages.  For such objects zsmalloc uses a per-CPU buffer to which
object's data is copied before a pointer to that per-CPU buffer
is returned back to the caller.  This leads to another, final,
issue - extra memcpy().  Since the caller gets a pointer to
per-CPU buffer it can memcpy() data only to that buffer, and
during zs_unmap_object() zsmalloc will memcpy() from that per-CPU
buffer to physical pages that object in question spans across.

New API splits functions by access mode:
- zs_obj_read_begin(handle, local_copy)
  Returns a pointer to handle memory.  For objects that span two
  physical pages a local_copy buffer is used to store object's
  data before the address is returned to the caller.  Otherwise
  the object's page is kmap_local mapped directly.

- zs_obj_read_end(handle, buf)
  Unmaps the page if it was kmap_local mapped by zs_obj_read_begin().

- zs_obj_write(handle, buf, len)
  Copies len-bytes from compression buffer to handle memory
  (takes care of objects that span two pages).  This does not
  need any additional (e.g. per-CPU) buffers and writes the data
  directly to zsmalloc pool pages.

The old API will stay around until the remaining users switch
to the new one.  After that we'll also remove zsmalloc per-CPU
buffer and CPU hotplug handling.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 include/linux/zsmalloc.h |   8 +++
 mm/zsmalloc.c            | 129 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 137 insertions(+)

Comments

Yosry Ahmed Jan. 29, 2025, 5:31 p.m. UTC | #1

On Wed, Jan 29, 2025 at 03:43:50PM +0900, Sergey Senozhatsky wrote:
> Current object mapping API is a little cumbersome.  First, it's
> inconsistent, sometimes it returns with page-faults disabled and
> sometimes with page-faults enabled.  Second, and most importantly,
> it enforces atomicity restrictions on its users.  zs_map_object()
> has to return a liner object address which is not always possible
> because some objects span multiple physical (non-contiguous)
> pages.  For such objects zsmalloc uses a per-CPU buffer to which
> object's data is copied before a pointer to that per-CPU buffer
> is returned back to the caller.  This leads to another, final,
> issue - extra memcpy().  Since the caller gets a pointer to
> per-CPU buffer it can memcpy() data only to that buffer, and
> during zs_unmap_object() zsmalloc will memcpy() from that per-CPU
> buffer to physical pages that object in question spans across.
> 
> New API splits functions by access mode:
> - zs_obj_read_begin(handle, local_copy)
>   Returns a pointer to handle memory.  For objects that span two
>   physical pages a local_copy buffer is used to store object's
>   data before the address is returned to the caller.  Otherwise
>   the object's page is kmap_local mapped directly.
> 
> - zs_obj_read_end(handle, buf)
>   Unmaps the page if it was kmap_local mapped by zs_obj_read_begin().
> 
> - zs_obj_write(handle, buf, len)
>   Copies len-bytes from compression buffer to handle memory
>   (takes care of objects that span two pages).  This does not
>   need any additional (e.g. per-CPU) buffers and writes the data
>   directly to zsmalloc pool pages.
> 
> The old API will stay around until the remaining users switch
> to the new one.  After that we'll also remove zsmalloc per-CPU
> buffer and CPU hotplug handling.

I will propose removing zbud (in addition to z3fold) soon. If that gets
in then we'd only need to update zpool and zswap code to use the new
API. I can take care of that if you want.

> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>

I have a couple of questions below, but generally LGTM:

Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>

> ---
>  include/linux/zsmalloc.h |   8 +++
>  mm/zsmalloc.c            | 129 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 137 insertions(+)
> 
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index a48cd0ffe57d..625adae8e547 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -58,4 +58,12 @@ unsigned long zs_compact(struct zs_pool *pool);
>  unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
>  
>  void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
> +
> +void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
> +		     void *handle_mem);
> +void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
> +			void *local_copy);

Nit: Any reason to put 'end' before 'begin'? Same for the function
definitions.

> +void zs_obj_write(struct zs_pool *pool, unsigned long handle,
> +		  void *handle_mem, size_t mem_len);
> +
>  #endif
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 8f4011713bc8..0e21bc57470b 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -1371,6 +1371,135 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
>  }
>  EXPORT_SYMBOL_GPL(zs_unmap_object);
>  
> +void zs_obj_write(struct zs_pool *pool, unsigned long handle,
> +		  void *handle_mem, size_t mem_len)
> +{
> +	struct zspage *zspage;
> +	struct zpdesc *zpdesc;
> +	unsigned long obj, off;
> +	unsigned int obj_idx;
> +	struct size_class *class;
> +
> +	WARN_ON(in_interrupt());
> +
> +	/* Guarantee we can get zspage from handle safely */
> +	pool_read_lock(pool);
> +	obj = handle_to_obj(handle);
> +	obj_to_location(obj, &zpdesc, &obj_idx);
> +	zspage = get_zspage(zpdesc);
> +
> +	/* Make sure migration doesn't move any pages in this zspage */
> +	zspage_read_lock(zspage);
> +	pool_read_unlock(pool);
> +
> +	class = zspage_class(pool, zspage);
> +	off = offset_in_page(class->size * obj_idx);
> +
> +	if (off + class->size <= PAGE_SIZE) {
> +		/* this object is contained entirely within a page */
> +		void *dst = kmap_local_zpdesc(zpdesc);
> +
> +		if (!ZsHugePage(zspage))
> +			off += ZS_HANDLE_SIZE;
> +		memcpy(dst + off, handle_mem, mem_len);
> +		kunmap_local(dst);
> +	} else {
> +		size_t sizes[2];
> +
> +		/* this object spans two pages */
> +		off += ZS_HANDLE_SIZE;

Are huge pages always stored in a single page? If yes, can we just do
this before the if block for both cases:

if (!ZsHugePage(zspage))
	off += ZS_HANDLE_SIZE;

> +		sizes[0] = PAGE_SIZE - off;
> +		sizes[1] = mem_len - sizes[0];
> +
> +		memcpy_to_page(zpdesc_page(zpdesc), off,
> +			       handle_mem, sizes[0]);
> +		zpdesc = get_next_zpdesc(zpdesc);
> +		memcpy_to_page(zpdesc_page(zpdesc), 0,
> +			       handle_mem + sizes[0], sizes[1]);
> +	}
> +
> +	zspage_read_unlock(zspage);
> +}
> +EXPORT_SYMBOL_GPL(zs_obj_write);
> +
> +void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
> +		     void *handle_mem)
> +{
> +	struct zspage *zspage;
> +	struct zpdesc *zpdesc;
> +	unsigned long obj, off;
> +	unsigned int obj_idx;
> +	struct size_class *class;
> +
> +	obj = handle_to_obj(handle);
> +	obj_to_location(obj, &zpdesc, &obj_idx);
> +	zspage = get_zspage(zpdesc);
> +	class = zspage_class(pool, zspage);
> +	off = offset_in_page(class->size * obj_idx);
> +
> +	if (off + class->size <= PAGE_SIZE) {
> +		if (!ZsHugePage(zspage))
> +			off += ZS_HANDLE_SIZE;
> +		handle_mem -= off;
> +		kunmap_local(handle_mem);
> +	}
> +
> +	zspage_read_unlock(zspage);
> +}
> +EXPORT_SYMBOL_GPL(zs_obj_read_end);
> +
> +void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
> +			void *local_copy)
> +{
> +	struct zspage *zspage;
> +	struct zpdesc *zpdesc;
> +	unsigned long obj, off;
> +	unsigned int obj_idx;
> +	struct size_class *class;
> +	void *addr;
> +
> +	WARN_ON(in_interrupt());
> +
> +	/* Guarantee we can get zspage from handle safely */
> +	pool_read_lock(pool);
> +	obj = handle_to_obj(handle);
> +	obj_to_location(obj, &zpdesc, &obj_idx);
> +	zspage = get_zspage(zpdesc);
> +
> +	/* Make sure migration doesn't move any pages in this zspage */
> +	zspage_read_lock(zspage);
> +	pool_read_unlock(pool);
> +
> +	class = zspage_class(pool, zspage);
> +	off = offset_in_page(class->size * obj_idx);
> +
> +	if (off + class->size <= PAGE_SIZE) {
> +		/* this object is contained entirely within a page */
> +		addr = kmap_local_zpdesc(zpdesc);
> +		addr += off;
> +	} else {
> +		size_t sizes[2];
> +
> +		/* this object spans two pages */
> +		sizes[0] = PAGE_SIZE - off;
> +		sizes[1] = class->size - sizes[0];
> +		addr = local_copy;
> +
> +		memcpy_from_page(addr, zpdesc_page(zpdesc),
> +				 off, sizes[0]);
> +		zpdesc = get_next_zpdesc(zpdesc);
> +		memcpy_from_page(addr + sizes[0],
> +				 zpdesc_page(zpdesc),
> +				 0, sizes[1]);
> +	}
> +
> +	if (!ZsHugePage(zspage))
> +		addr += ZS_HANDLE_SIZE;
> +
> +	return addr;
> +}
> +EXPORT_SYMBOL_GPL(zs_obj_read_begin);
> +
>  /**
>   * zs_huge_class_size() - Returns the size (in bytes) of the first huge
>   *                        zsmalloc &size_class.
> -- 
> 2.48.1.262.g85cc9f2d1e-goog
>

Sergey Senozhatsky Jan. 30, 2025, 3:21 a.m. UTC | #2

On (25/01/29 17:31), Yosry Ahmed wrote:
> On Wed, Jan 29, 2025 at 03:43:50PM +0900, Sergey Senozhatsky wrote:
[..]
> > The old API will stay around until the remaining users switch
> > to the new one.  After that we'll also remove zsmalloc per-CPU
> > buffer and CPU hotplug handling.
> 
> I will propose removing zbud (in addition to z3fold) soon. If that gets
> in then we'd only need to update zpool and zswap code to use the new
> API. I can take care of that if you want.

Sounds like a plan.  I think I saw zbud deprecation patch (along with z3fold
removal).  I guess you still want to keep zpool, just because it's there
already?

> > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> 
> I have a couple of questions below, but generally LGTM:
> 
> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>

Thanks.

[..]
> > +void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
> > +		     void *handle_mem);
> > +void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
> > +			void *local_copy);
> 
> Nit: Any reason to put 'end' before 'begin'? Same for the function
> definitions.

An old habit, I just put release() before init() (or end() before
begin()) because often times you call end() from begin() error path.
Not in this particular case, but I just do that semi-automatically.

[..]
> > +	if (off + class->size <= PAGE_SIZE) {
> > +		/* this object is contained entirely within a page */
> > +		void *dst = kmap_local_zpdesc(zpdesc);
> > +
> > +		if (!ZsHugePage(zspage))
> > +			off += ZS_HANDLE_SIZE;
> > +		memcpy(dst + off, handle_mem, mem_len);
> > +		kunmap_local(dst);
> > +	} else {
> > +		size_t sizes[2];
> > +
> > +		/* this object spans two pages */
> > +		off += ZS_HANDLE_SIZE;
> 
> Are huge pages always stored in a single page? If yes, can we just do
> this before the if block for both cases:

Yes.

> if (!ZsHugePage(zspage))
> 	off += ZS_HANDLE_SIZE;

Looks good.

Sergey Senozhatsky Jan. 30, 2025, 4:17 a.m. UTC | #3

On (25/01/30 12:21), Sergey Senozhatsky wrote:
> [..]
> > > +	if (off + class->size <= PAGE_SIZE) {
> > > +		/* this object is contained entirely within a page */
> > > +		void *dst = kmap_local_zpdesc(zpdesc);
> > > +
> > > +		if (!ZsHugePage(zspage))
> > > +			off += ZS_HANDLE_SIZE;
> > > +		memcpy(dst + off, handle_mem, mem_len);
> > > +		kunmap_local(dst);
> > > +	} else {
> > > +		size_t sizes[2];
> > > +
> > > +		/* this object spans two pages */
> > > +		off += ZS_HANDLE_SIZE;
> > 
> > Are huge pages always stored in a single page? If yes, can we just do
> > this before the if block for both cases:
> 
> Yes.
> 
> > if (!ZsHugePage(zspage))
> > 	off += ZS_HANDLE_SIZE;
> 
> Looks good.

Ah, now I see why I didn't do it.  off += ZS_HANDLE_SIZE before
if (off + size <= PAGE_SIZE) messes it up.

Yosry Ahmed Jan. 30, 2025, 4:27 p.m. UTC | #4

On Thu, Jan 30, 2025 at 12:21:14PM +0900, Sergey Senozhatsky wrote:
> On (25/01/29 17:31), Yosry Ahmed wrote:
> > On Wed, Jan 29, 2025 at 03:43:50PM +0900, Sergey Senozhatsky wrote:
> [..]
> > > The old API will stay around until the remaining users switch
> > > to the new one.  After that we'll also remove zsmalloc per-CPU
> > > buffer and CPU hotplug handling.
> > 
> > I will propose removing zbud (in addition to z3fold) soon. If that gets
> > in then we'd only need to update zpool and zswap code to use the new
> > API. I can take care of that if you want.
> 
> Sounds like a plan.  I think I saw zbud deprecation patch (along with z3fold
> removal).  I guess you still want to keep zpool, just because it's there
> already?

Now the proposal is to remove zbud right away (patch already sent). If
this lands then our lives become easier.

I am keeping zpool around for now because it is not doing any harm, we
can remove it later. For the zbud/z3fold their presence is a problem due
to bit roting, and having to support new APIs (like this one) in them if
we want to use them unconditionally in zswap.

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index a48cd0ffe57d..625adae8e547 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -58,4 +58,12 @@  unsigned long zs_compact(struct zs_pool *pool);
 unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
 
 void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
+
+void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
+		     void *handle_mem);
+void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
+			void *local_copy);
+void zs_obj_write(struct zs_pool *pool, unsigned long handle,
+		  void *handle_mem, size_t mem_len);
+
 #endif
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 8f4011713bc8..0e21bc57470b 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1371,6 +1371,135 @@  void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 }
 EXPORT_SYMBOL_GPL(zs_unmap_object);
 
+void zs_obj_write(struct zs_pool *pool, unsigned long handle,
+		  void *handle_mem, size_t mem_len)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	WARN_ON(in_interrupt());
+
+	/* Guarantee we can get zspage from handle safely */
+	pool_read_lock(pool);
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/* Make sure migration doesn't move any pages in this zspage */
+	zspage_read_lock(zspage);
+	pool_read_unlock(pool);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		void *dst = kmap_local_zpdesc(zpdesc);
+
+		if (!ZsHugePage(zspage))
+			off += ZS_HANDLE_SIZE;
+		memcpy(dst + off, handle_mem, mem_len);
+		kunmap_local(dst);
+	} else {
+		size_t sizes[2];
+
+		/* this object spans two pages */
+		off += ZS_HANDLE_SIZE;
+		sizes[0] = PAGE_SIZE - off;
+		sizes[1] = mem_len - sizes[0];
+
+		memcpy_to_page(zpdesc_page(zpdesc), off,
+			       handle_mem, sizes[0]);
+		zpdesc = get_next_zpdesc(zpdesc);
+		memcpy_to_page(zpdesc_page(zpdesc), 0,
+			       handle_mem + sizes[0], sizes[1]);
+	}
+
+	zspage_read_unlock(zspage);
+}
+EXPORT_SYMBOL_GPL(zs_obj_write);
+
+void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
+		     void *handle_mem)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		if (!ZsHugePage(zspage))
+			off += ZS_HANDLE_SIZE;
+		handle_mem -= off;
+		kunmap_local(handle_mem);
+	}
+
+	zspage_read_unlock(zspage);
+}
+EXPORT_SYMBOL_GPL(zs_obj_read_end);
+
+void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
+			void *local_copy)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+	void *addr;
+
+	WARN_ON(in_interrupt());
+
+	/* Guarantee we can get zspage from handle safely */
+	pool_read_lock(pool);
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/* Make sure migration doesn't move any pages in this zspage */
+	zspage_read_lock(zspage);
+	pool_read_unlock(pool);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		addr = kmap_local_zpdesc(zpdesc);
+		addr += off;
+	} else {
+		size_t sizes[2];
+
+		/* this object spans two pages */
+		sizes[0] = PAGE_SIZE - off;
+		sizes[1] = class->size - sizes[0];
+		addr = local_copy;
+
+		memcpy_from_page(addr, zpdesc_page(zpdesc),
+				 off, sizes[0]);
+		zpdesc = get_next_zpdesc(zpdesc);
+		memcpy_from_page(addr + sizes[0],
+				 zpdesc_page(zpdesc),
+				 0, sizes[1]);
+	}
+
+	if (!ZsHugePage(zspage))
+		addr += ZS_HANDLE_SIZE;
+
+	return addr;
+}
+EXPORT_SYMBOL_GPL(zs_obj_read_begin);
+
 /**
  * zs_huge_class_size() - Returns the size (in bytes) of the first huge
  *                        zsmalloc &size_class.

[PATCHv1,4/6] zsmalloc: introduce new object mapping API

Commit Message

Comments

Patch