diff mbox series

[v9,14/19] zsmalloc: introduce new object mapping API

Message ID 20250227043618.88380-15-senozhatsky@chromium.org (mailing list archive)
State New
Headers show
Series zsmalloc/zram: there be preemption | expand

Commit Message

Sergey Senozhatsky Feb. 27, 2025, 4:35 a.m. UTC
Current object mapping API is a little cumbersome.  First, it's
inconsistent, sometimes it returns with page-faults disabled and
sometimes with page-faults enabled.  Second, and most importantly,
it enforces atomicity restrictions on its users.  zs_map_object()
has to return a liner object address which is not always possible
because some objects span multiple physical (non-contiguous)
pages.  For such objects zsmalloc uses a per-CPU buffer to which
object's data is copied before a pointer to that per-CPU buffer
is returned back to the caller.  This leads to another, final,
issue - extra memcpy().  Since the caller gets a pointer to
per-CPU buffer it can memcpy() data only to that buffer, and
during zs_unmap_object() zsmalloc will memcpy() from that per-CPU
buffer to physical pages that object in question spans across.

New API splits functions by access mode:
- zs_obj_read_begin(handle, local_copy)
  Returns a pointer to handle memory.  For objects that span two
  physical pages a local_copy buffer is used to store object's
  data before the address is returned to the caller.  Otherwise
  the object's page is kmap_local mapped directly.

- zs_obj_read_end(handle, buf)
  Unmaps the page if it was kmap_local mapped by zs_obj_read_begin().

- zs_obj_write(handle, buf, len)
  Copies len-bytes from compression buffer to handle memory
  (takes care of objects that span two pages).  This does not
  need any additional (e.g. per-CPU) buffers and writes the data
  directly to zsmalloc pool pages.

In terms of performance, on a synthetic and completely reproducible
test that allocates fixed number of objects of fixed sizes and
iterates over those objects, first mapping in RO then in RW mode:

OLD API
=======

3 first results out of 10

  369,205,778      instructions        #    0.80  insn per cycle
   40,467,926      branches            #  113.732 M/sec

  369,002,122      instructions        #    0.62  insn per cycle
   40,426,145      branches            #  189.361 M/sec

  369,036,706      instructions        #    0.63  insn per cycle
   40,430,860      branches            #  204.105 M/sec

[..]

NEW API
=======

3 first results out of 10

  265,799,293      instructions        #    0.51  insn per cycle
   29,834,567      branches            #  170.281 M/sec

  265,765,970      instructions        #    0.55  insn per cycle
   29,829,019      branches            #  161.602 M/sec

  265,764,702      instructions        #    0.51  insn per cycle
   29,828,015      branches            #  189.677 M/sec

[..]

T-test on all 10 runs
=====================

Difference at 95.0% confidence
   -1.03219e+08 +/- 55308.7
   -27.9705% +/- 0.0149878%
   (Student's t, pooled s = 58864.4)

The old API will stay around until the remaining users switch
to the new one.  After that we'll also remove zsmalloc per-CPU
buffer and CPU hotplug handling.

The split of map(RO) and map(WO) into read_{begin/end}/write is
suggested by Yosry Ahmed.

Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 include/linux/zsmalloc.h |   8 +++
 mm/zsmalloc.c            | 125 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 133 insertions(+)

Comments

Yosry Ahmed Feb. 27, 2025, 5:48 a.m. UTC | #1
On Thu, Feb 27, 2025 at 01:35:32PM +0900, Sergey Senozhatsky wrote:
> Current object mapping API is a little cumbersome.  First, it's
> inconsistent, sometimes it returns with page-faults disabled and
> sometimes with page-faults enabled.  Second, and most importantly,
> it enforces atomicity restrictions on its users.  zs_map_object()
> has to return a liner object address which is not always possible
> because some objects span multiple physical (non-contiguous)
> pages.  For such objects zsmalloc uses a per-CPU buffer to which
> object's data is copied before a pointer to that per-CPU buffer
> is returned back to the caller.  This leads to another, final,
> issue - extra memcpy().  Since the caller gets a pointer to
> per-CPU buffer it can memcpy() data only to that buffer, and
> during zs_unmap_object() zsmalloc will memcpy() from that per-CPU
> buffer to physical pages that object in question spans across.
> 
> New API splits functions by access mode:
> - zs_obj_read_begin(handle, local_copy)
>   Returns a pointer to handle memory.  For objects that span two
>   physical pages a local_copy buffer is used to store object's
>   data before the address is returned to the caller.  Otherwise
>   the object's page is kmap_local mapped directly.
> 
> - zs_obj_read_end(handle, buf)
>   Unmaps the page if it was kmap_local mapped by zs_obj_read_begin().
> 
> - zs_obj_write(handle, buf, len)
>   Copies len-bytes from compression buffer to handle memory
>   (takes care of objects that span two pages).  This does not
>   need any additional (e.g. per-CPU) buffers and writes the data
>   directly to zsmalloc pool pages.
> 
> In terms of performance, on a synthetic and completely reproducible
> test that allocates fixed number of objects of fixed sizes and
> iterates over those objects, first mapping in RO then in RW mode:
> 
> OLD API
> =======
> 
> 3 first results out of 10
> 
>   369,205,778      instructions        #    0.80  insn per cycle
>    40,467,926      branches            #  113.732 M/sec
> 
>   369,002,122      instructions        #    0.62  insn per cycle
>    40,426,145      branches            #  189.361 M/sec
> 
>   369,036,706      instructions        #    0.63  insn per cycle
>    40,430,860      branches            #  204.105 M/sec
> 
> [..]
> 
> NEW API
> =======
> 
> 3 first results out of 10
> 
>   265,799,293      instructions        #    0.51  insn per cycle
>    29,834,567      branches            #  170.281 M/sec
> 
>   265,765,970      instructions        #    0.55  insn per cycle
>    29,829,019      branches            #  161.602 M/sec
> 
>   265,764,702      instructions        #    0.51  insn per cycle
>    29,828,015      branches            #  189.677 M/sec
> 
> [..]
> 
> T-test on all 10 runs
> =====================
> 
> Difference at 95.0% confidence
>    -1.03219e+08 +/- 55308.7
>    -27.9705% +/- 0.0149878%
>    (Student's t, pooled s = 58864.4)
> 
> The old API will stay around until the remaining users switch
> to the new one.  After that we'll also remove zsmalloc per-CPU
> buffer and CPU hotplug handling.
> 
> The split of map(RO) and map(WO) into read_{begin/end}/write is
> suggested by Yosry Ahmed.
> 
> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>

I see my Reviewed-by was removed at some point. Did something change in
this patch (do I need to review it again) or was it just lost?
Sergey Senozhatsky Feb. 27, 2025, 6:54 a.m. UTC | #2
On (25/02/27 05:48), Yosry Ahmed wrote:
> > The old API will stay around until the remaining users switch
> > to the new one.  After that we'll also remove zsmalloc per-CPU
> > buffer and CPU hotplug handling.
> > 
> > The split of map(RO) and map(WO) into read_{begin/end}/write is
> > suggested by Yosry Ahmed.
> > 
> > Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> 
> I see my Reviewed-by was removed at some point. Did something change in
> this patch (do I need to review it again) or was it just lost?

No, nothing has changed.  I am actually not sure why it's not there
anymore... Sorry.
Yosry Ahmed Feb. 27, 2025, 7:09 a.m. UTC | #3
On Thu, Feb 27, 2025 at 03:54:19PM +0900, Sergey Senozhatsky wrote:
> On (25/02/27 05:48), Yosry Ahmed wrote:
> > > The old API will stay around until the remaining users switch
> > > to the new one.  After that we'll also remove zsmalloc per-CPU
> > > buffer and CPU hotplug handling.
> > > 
> > > The split of map(RO) and map(WO) into read_{begin/end}/write is
> > > suggested by Yosry Ahmed.
> > > 
> > > Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> > 
> > I see my Reviewed-by was removed at some point. Did something change in
> > this patch (do I need to review it again) or was it just lost?
> 
> No, nothing has changed.  I am actually not sure why it's not there
> anymore... Sorry.

No worries, I am just glad I don't have to review it again :P
Herbert Xu March 1, 2025, 7:22 a.m. UTC | #4
Sergey Senozhatsky <senozhatsky@chromium.org> wrote:
>
> New API splits functions by access mode:
> - zs_obj_read_begin(handle, local_copy)
>  Returns a pointer to handle memory.  For objects that span two
>  physical pages a local_copy buffer is used to store object's
>  data before the address is returned to the caller.  Otherwise
>  the object's page is kmap_local mapped directly.

I presume this buffer is always given to the compression algorithm
to decompress? In that case there should be no need to linearise
them at all.

Just return a two-entry SG list, and give it to the Crypto API
to deal with.  Both software and hardware algorithms can handle
non-linear input.  Yes software decompression is currently
linearising all input with a copy, but that is no different
to the copy that you're making in zsmalloc.

So please change this API to create an SG list instead of copying.
That way we can then optimise the software decompression to read
non-linear input directly and skip the copying altogether.

Cheers,
Sergey Senozhatsky March 3, 2025, 2:42 a.m. UTC | #5
On (25/03/01 15:22), Herbert Xu wrote:
> Sergey Senozhatsky <senozhatsky@chromium.org> wrote:
> >
> > New API splits functions by access mode:
> > - zs_obj_read_begin(handle, local_copy)
> >  Returns a pointer to handle memory.  For objects that span two
> >  physical pages a local_copy buffer is used to store object's
> >  data before the address is returned to the caller.  Otherwise
> >  the object's page is kmap_local mapped directly.
> 
> I presume this buffer is always given to the compression algorithm
> to decompress? In that case there should be no need to linearise
> them at all.
> 
> Just return a two-entry SG list, and give it to the Crypto API
> to deal with.  Both software and hardware algorithms can handle
> non-linear input.  Yes software decompression is currently
> linearising all input with a copy, but that is no different
> to the copy that you're making in zsmalloc.
> 
> So please change this API to create an SG list instead of copying.
> That way we can then optimise the software decompression to read
> non-linear input directly and skip the copying altogether.

A heads up:

Discussed with Herbert privately, we will look into SG-list API
later (future dev cycles).
Herbert Xu March 3, 2025, 2:51 a.m. UTC | #6
On Mon, Mar 03, 2025 at 11:42:28AM +0900, Sergey Senozhatsky wrote:
>
> Discussed with Herbert privately, we will look into SG-list API
> later (future dev cycles).

Yes let's proceed with this and hopefully we can remove all the
complexity caused by linearising the compressed data later.

Thanks,
diff mbox series

Patch

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index a48cd0ffe57d..7d70983cf398 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -58,4 +58,12 @@  unsigned long zs_compact(struct zs_pool *pool);
 unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
 
 void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
+
+void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
+			void *local_copy);
+void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
+		     void *handle_mem);
+void zs_obj_write(struct zs_pool *pool, unsigned long handle,
+		  void *handle_mem, size_t mem_len);
+
 #endif
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 74a7aaebf7a0..147915ba04f9 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1364,6 +1364,131 @@  void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 }
 EXPORT_SYMBOL_GPL(zs_unmap_object);
 
+void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
+			void *local_copy)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+	void *addr;
+
+	/* Guarantee we can get zspage from handle safely */
+	read_lock(&pool->lock);
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/* Make sure migration doesn't move any pages in this zspage */
+	zspage_read_lock(zspage);
+	read_unlock(&pool->lock);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		addr = kmap_local_zpdesc(zpdesc);
+		addr += off;
+	} else {
+		size_t sizes[2];
+
+		/* this object spans two pages */
+		sizes[0] = PAGE_SIZE - off;
+		sizes[1] = class->size - sizes[0];
+		addr = local_copy;
+
+		memcpy_from_page(addr, zpdesc_page(zpdesc),
+				 off, sizes[0]);
+		zpdesc = get_next_zpdesc(zpdesc);
+		memcpy_from_page(addr + sizes[0],
+				 zpdesc_page(zpdesc),
+				 0, sizes[1]);
+	}
+
+	if (!ZsHugePage(zspage))
+		addr += ZS_HANDLE_SIZE;
+
+	return addr;
+}
+EXPORT_SYMBOL_GPL(zs_obj_read_begin);
+
+void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
+		     void *handle_mem)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		if (!ZsHugePage(zspage))
+			off += ZS_HANDLE_SIZE;
+		handle_mem -= off;
+		kunmap_local(handle_mem);
+	}
+
+	zspage_read_unlock(zspage);
+}
+EXPORT_SYMBOL_GPL(zs_obj_read_end);
+
+void zs_obj_write(struct zs_pool *pool, unsigned long handle,
+		  void *handle_mem, size_t mem_len)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	/* Guarantee we can get zspage from handle safely */
+	read_lock(&pool->lock);
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/* Make sure migration doesn't move any pages in this zspage */
+	zspage_read_lock(zspage);
+	read_unlock(&pool->lock);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		void *dst = kmap_local_zpdesc(zpdesc);
+
+		if (!ZsHugePage(zspage))
+			off += ZS_HANDLE_SIZE;
+		memcpy(dst + off, handle_mem, mem_len);
+		kunmap_local(dst);
+	} else {
+		/* this object spans two pages */
+		size_t sizes[2];
+
+		off += ZS_HANDLE_SIZE;
+		sizes[0] = PAGE_SIZE - off;
+		sizes[1] = mem_len - sizes[0];
+
+		memcpy_to_page(zpdesc_page(zpdesc), off,
+			       handle_mem, sizes[0]);
+		zpdesc = get_next_zpdesc(zpdesc);
+		memcpy_to_page(zpdesc_page(zpdesc), 0,
+			       handle_mem + sizes[0], sizes[1]);
+	}
+
+	zspage_read_unlock(zspage);
+}
+EXPORT_SYMBOL_GPL(zs_obj_write);
+
 /**
  * zs_huge_class_size() - Returns the size (in bytes) of the first huge
  *                        zsmalloc &size_class.