[RFC,7/7] mm: zswap: Use acomp virtual address interface

Message ID	153c340a52090f2ff82f8f066203186a932d3f99.1740651138.git.herbert@gondor.apana.org.au (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Thu, 27 Feb 2025 18:15:09 +0800 Message-Id: <153c340a52090f2ff82f8f066203186a932d3f99.1740651138.git.herbert@gondor.apana.org.au> In-Reply-To: <cover.1740651138.git.herbert@gondor.apana.org.au> References: <cover.1740651138.git.herbert@gondor.apana.org.au> From: Herbert Xu <herbert@gondor.apana.org.au> Subject: [RFC PATCH 7/7] mm: zswap: Use acomp virtual address interface To: Linux Crypto Mailing List <linux-crypto@vger.kernel.org> Cc: linux-mm@kvack.org, Yosry Ahmed <yosry.ahmed@linux.dev> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	crypto: acomp - Add request chaining and virtual address support \| expand [0/7] crypto: acomp - Add request chaining and virtual address support [1/7] crypto: iaa - Test the correct request flag [2/7] crypto: acomp - Remove acomp request flags [3/7] crypto: acomp - Add request chaining and virtual addresses [4/7] crypto: testmgr - Remove NULL dst acomp tests [5/7] crypto: scomp - Remove support for non-trivial SG lists [6/7] crypto: scomp - Add chaining and virtual address support [RFC,7/7] mm: zswap: Use acomp virtual address interface

Herbert Xu Feb. 27, 2025, 10:15 a.m. UTC

Use the acomp virtual address interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---
 mm/zswap.c | 23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

Yosry Ahmed Feb. 27, 2025, 6:11 p.m. UTC | #1

On Thu, Feb 27, 2025 at 06:15:09PM +0800, Herbert Xu wrote:
> Use the acomp virtual address interface.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

I can't speak to the rest of the series, but I like what this patch is
doing in zswap. Together with the recent zsmalloc changes, we should be
able to drop the alternative memcpy path completely, without needing to
use kmap_to_page() or removing the warning the highmem code.

Thanks for doing this!

> ---
>  mm/zswap.c | 23 ++++++++---------------
>  1 file changed, 8 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 6504174fbc6a..2b5a2398a9be 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -925,27 +925,20 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  			   struct zswap_pool *pool)
>  {
>  	struct crypto_acomp_ctx *acomp_ctx;
> -	struct scatterlist input, output;
>  	int comp_ret = 0, alloc_ret = 0;
>  	unsigned int dlen = PAGE_SIZE;
>  	unsigned long handle;
>  	struct zpool *zpool;
> +	const u8 *src;
>  	char *buf;
>  	gfp_t gfp;
>  	u8 *dst;
>  
>  	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> +	src = kmap_local_page(page);
>  	dst = acomp_ctx->buffer;
> -	sg_init_table(&input, 1);
> -	sg_set_page(&input, page, PAGE_SIZE, 0);
>  
> -	/*
> -	 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
> -	 * and hardware-accelerators may won't check the dst buffer size, so
> -	 * giving the dst buffer with enough length to avoid buffer overflow.
> -	 */
> -	sg_init_one(&output, dst, PAGE_SIZE * 2);
> -	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> +	acomp_request_set_virt(acomp_ctx->req, src, dst, PAGE_SIZE, dlen);
>  
>  	/*
>  	 * it maybe looks a little bit silly that we send an asynchronous request,
> @@ -960,6 +953,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	 * acomp instance, so multiple threads can do (de)compression in parallel.
>  	 */
>  	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> +	kunmap_local(src);
>  	dlen = acomp_ctx->req->dlen;
>  	if (comp_ret)
>  		goto unlock;
> @@ -994,9 +988,9 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  {
>  	struct zpool *zpool = entry->pool->zpool;
> -	struct scatterlist input, output;
>  	struct crypto_acomp_ctx *acomp_ctx;
>  	u8 *src;
> +	u8 *dst;
>  
>  	acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
>  	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> @@ -1016,11 +1010,10 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  		zpool_unmap_handle(zpool, entry->handle);
>  	}
>  
> -	sg_init_one(&input, src, entry->length);
> -	sg_init_table(&output, 1);
> -	sg_set_folio(&output, folio, PAGE_SIZE, 0);
> -	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> +	dst = kmap_local_folio(folio, 0);
> +	acomp_request_set_virt(acomp_ctx->req, src, dst, entry->length, PAGE_SIZE);
>  	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> +	kunmap_local(dst);
>  	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
>  
>  	if (src != acomp_ctx->buffer)
> -- 
> 2.39.5
>

Eric Biggers Feb. 27, 2025, 6:38 p.m. UTC | #2

On Thu, Feb 27, 2025 at 06:11:04PM +0000, Yosry Ahmed wrote:
> On Thu, Feb 27, 2025 at 06:15:09PM +0800, Herbert Xu wrote:
> > Use the acomp virtual address interface.
> > 
> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> I can't speak to the rest of the series, but I like what this patch is
> doing in zswap. Together with the recent zsmalloc changes, we should be
> able to drop the alternative memcpy path completely, without needing to
> use kmap_to_page() or removing the warning the highmem code.
> 
> Thanks for doing this!

Well, unfortunately this patchset still uses sg_init_one() on the virtual
address, so AFAICS it doesn't work with arbitrary virtual addresses.

- Eric

Yosry Ahmed Feb. 27, 2025, 9:43 p.m. UTC | #3

On Thu, Feb 27, 2025 at 10:38:47AM -0800, Eric Biggers wrote:
> On Thu, Feb 27, 2025 at 06:11:04PM +0000, Yosry Ahmed wrote:
> > On Thu, Feb 27, 2025 at 06:15:09PM +0800, Herbert Xu wrote:
> > > Use the acomp virtual address interface.
> > > 
> > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> > 
> > I can't speak to the rest of the series, but I like what this patch is
> > doing in zswap. Together with the recent zsmalloc changes, we should be
> > able to drop the alternative memcpy path completely, without needing to
> > use kmap_to_page() or removing the warning the highmem code.
> > 
> > Thanks for doing this!
> 
> Well, unfortunately this patchset still uses sg_init_one() on the virtual
> address, so AFAICS it doesn't work with arbitrary virtual addresses.

If we cannot pass in highmem addresses then the problem is not solved.
Thanks for pointing this out.

Herbert Xu Feb. 28, 2025, 8:13 a.m. UTC | #4

On Thu, Feb 27, 2025 at 09:43:52PM +0000, Yosry Ahmed wrote:
>
> If we cannot pass in highmem addresses then the problem is not solved.
> Thanks for pointing this out.

Oh I didn't realise this even existed.

It's much better to handle this in the Crypto API where the copy
and be done only when it's needed for DMA.

I'll respin this.

Thanks,

Herbert Xu Feb. 28, 2025, 9:54 a.m. UTC | #5

On Fri, Feb 28, 2025 at 04:13:08PM +0800, Herbert Xu wrote:
>
> I'll respin this.

FWIW this is what the interface looks like.  Does it look OK?
Longer term hardware offload drivers should handle these non-DMA
pointers directly by having their own buffers.  For the time being
I'm simply redirecting these to a software fallback.

diff --git a/mm/zswap.c b/mm/zswap.c
index 2b5a2398a9be..2fd241c65f80 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -994,30 +994,16 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 
 	acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
 	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
-	/*
-	 * If zpool_map_handle is atomic, we cannot reliably utilize its mapped buffer
-	 * to do crypto_acomp_decompress() which might sleep. In such cases, we must
-	 * resort to copying the buffer to a temporary one.
-	 * Meanwhile, zpool_map_handle() might return a non-linearly mapped buffer,
-	 * such as a kmap address of high memory or even ever a vmap address.
-	 * However, sg_init_one is only equipped to handle linearly mapped low memory.
-	 * In such cases, we also must copy the buffer to a temporary and lowmem one.
-	 */
-	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
-	    !virt_addr_valid(src)) {
-		memcpy(acomp_ctx->buffer, src, entry->length);
-		src = acomp_ctx->buffer;
-		zpool_unmap_handle(zpool, entry->handle);
-	}
-
 	dst = kmap_local_folio(folio, 0);
-	acomp_request_set_virt(acomp_ctx->req, src, dst, entry->length, PAGE_SIZE);
+	if (!zpool_can_sleep_mapped(zpool) || !virt_addr_valid(src))
+		acomp_request_set_nondma(acomp_ctx->req, src, dst, entry->length, PAGE_SIZE);
+	else
+		acomp_request_set_virt(acomp_ctx->req, src, dst, entry->length, PAGE_SIZE);
 	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
 	kunmap_local(dst);
 	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
 
-	if (src != acomp_ctx->buffer)
-		zpool_unmap_handle(zpool, entry->handle);
+	zpool_unmap_handle(zpool, entry->handle);
 	acomp_ctx_put_unlock(acomp_ctx);
 }

Yosry Ahmed Feb. 28, 2025, 3:56 p.m. UTC | #6

On Fri, Feb 28, 2025 at 05:54:53PM +0800, Herbert Xu wrote:
> On Fri, Feb 28, 2025 at 04:13:08PM +0800, Herbert Xu wrote:
> >
> > I'll respin this.
> 
> FWIW this is what the interface looks like.  Does it look OK?
> Longer term hardware offload drivers should handle these non-DMA
> pointers directly by having their own buffers.  For the time being
> I'm simply redirecting these to a software fallback.
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 2b5a2398a9be..2fd241c65f80 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -994,30 +994,16 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  
>  	acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
>  	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> -	/*
> -	 * If zpool_map_handle is atomic, we cannot reliably utilize its mapped buffer
> -	 * to do crypto_acomp_decompress() which might sleep. In such cases, we must
> -	 * resort to copying the buffer to a temporary one.
> -	 * Meanwhile, zpool_map_handle() might return a non-linearly mapped buffer,
> -	 * such as a kmap address of high memory or even ever a vmap address.
> -	 * However, sg_init_one is only equipped to handle linearly mapped low memory.
> -	 * In such cases, we also must copy the buffer to a temporary and lowmem one.
> -	 */
> -	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
> -	    !virt_addr_valid(src)) {
> -		memcpy(acomp_ctx->buffer, src, entry->length);
> -		src = acomp_ctx->buffer;
> -		zpool_unmap_handle(zpool, entry->handle);
> -	}
> -
>  	dst = kmap_local_folio(folio, 0);
> -	acomp_request_set_virt(acomp_ctx->req, src, dst, entry->length, PAGE_SIZE);
> +	if (!zpool_can_sleep_mapped(zpool) || !virt_addr_valid(src))

Why is the acomp_ctx->is_sleepable check no longer needed?

Also, the zpool_can_sleep_mapped() cases will go away soon-ish, so I was
kinda hoping that the !virt_addr_valid() case goes away too and is
handled internally in the crypto library. Right now the problem is that
virt_to_page() is used to get the underlying page, which doesn't work
for kmap addresses.

> +		acomp_request_set_nondma(acomp_ctx->req, src, dst, entry->length, PAGE_SIZE);
> +	else
> +		acomp_request_set_virt(acomp_ctx->req, src, dst, entry->length, PAGE_SIZE);
>  	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
>  	kunmap_local(dst);
>  	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
>  
> -	if (src != acomp_ctx->buffer)
> -		zpool_unmap_handle(zpool, entry->handle);
> +	zpool_unmap_handle(zpool, entry->handle);
>  	acomp_ctx_put_unlock(acomp_ctx);
>  }
> 
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert Xu March 1, 2025, 6:36 a.m. UTC | #7

On Fri, Feb 28, 2025 at 03:56:42PM +0000, Yosry Ahmed wrote:
> 
> Why is the acomp_ctx->is_sleepable check no longer needed?

Because the Crypto API knows which driver uses DMA so there is
no need to check that here.  All the API needs is for you to tell
it whether the memory can undergo DMA.

> Also, the zpool_can_sleep_mapped() cases will go away soon-ish, so I was
> kinda hoping that the !virt_addr_valid() case goes away too and is
> handled internally in the crypto library. Right now the problem is that
> virt_to_page() is used to get the underlying page, which doesn't work
> for kmap addresses.

Actually this check looks buggy.  The issue is that there is no
way in general to tell whether a given pointer can be used for
DMA or not.  You have to go to the source (e.g., is it SLAB,
vmalloc, or something else) to make that determination.

So there is no simple check that the Crypto API can do to determine
this.  We have to rely on you to tell us whether it's OK to do DMA.
Otherwise the assumption must be that it is not safe to do DMA and
a copy is always required.

Now of course this is all irrelevant if you use software crypto
that does not involve DMA.  So regardless of whether you can do
DMA or not, if you're going through the software path it will just
use that memory as is without any extra copies.

The difference only comes into play if your request is routed to
hardware offload.

In fact it looks like this check and fallback path is wrong to
begin with.  It's perfectly OK to do DMA to high memory, assuming
that the device can handle it.  And if not the device would need
to take care of it anyway since an SG list can always live in
highmem.

I thought this was a lot more complicated and you had some weird
arbtirary pointer from an unknown source.  But if it's just highmem
I can get rid of the memcpy for you right now.

Cheers,

Herbert Xu March 1, 2025, 7:03 a.m. UTC | #8

On Sat, Mar 01, 2025 at 02:36:50PM +0800, Herbert Xu wrote:
>
> I thought this was a lot more complicated and you had some weird
> arbtirary pointer from an unknown source.  But if it's just highmem
> I can get rid of the memcpy for you right now.

So it appears that your highmem usage is coming from zsmalloc.  In
that case you don't want virtual addresses at all, you want an SG
list.

In fact you've already gone through a totally unnecessary copy
in _zs_map_object.  Had it simply given us a 2-entry SG list,
the Crypto API can process the data directly with no copies at
all.

The whole point of SG lists is to deal with memory fragmentation.
When your object is too big to fit in a single page, you need an
SG list to describe it.  Forcing virtual addresses just leads to
an unnecessary copy.

Chers,

Herbert Xu March 1, 2025, 7:34 a.m. UTC | #9

On Thu, Feb 27, 2025 at 10:38:47AM -0800, Eric Biggers wrote:
>
> Well, unfortunately this patchset still uses sg_init_one() on the virtual
> address, so AFAICS it doesn't work with arbitrary virtual addresses.

Right, you can't do sg_init_one on highmem memory.  But the problem
here is that this pointer (the source for decompression) should
never have been linear to start with.

Had it been an SG list, things would have just worked.

In fact, digging deeper reveals even more reasons why it should be
non-linear.  The object that we're decompressing is often split over
two physical pages.  In order to produce a linear pointer, the
zsmalloc code is actually forcing a copy (incidentally, that's
why an extra copy is being added in zswap because the zsmalloc
copy is done with a per-CPU buffer that cannot be used with async).

Had this be done with SG lists none of these issues would've existed
and things would just work whether you're doing software compression
or hardware offload.

So I'm withdrawing my acomp patch-set because the premise was wrong.
The only user for acomp actually wants SG lists.

Cheers,

Herbert Xu March 1, 2025, 7:38 a.m. UTC | #10

On Sat, Mar 01, 2025 at 03:34:33PM +0800, Herbert Xu wrote:
>
> So I'm withdrawing my acomp patch-set because the premise was wrong.
> The only user for acomp actually wants SG lists.

For a laugh I'm going to rewrite the LZO decomp code to take
an SG list as input.  That would then allow the decompression
path to run with zero copies even when the data spans two pages.

Cheers,

Yosry Ahmed March 3, 2025, 8:17 p.m. UTC | #11

On Sat, Mar 01, 2025 at 03:03:48PM +0800, Herbert Xu wrote:
> On Sat, Mar 01, 2025 at 02:36:50PM +0800, Herbert Xu wrote:
> >
> > I thought this was a lot more complicated and you had some weird
> > arbtirary pointer from an unknown source.  But if it's just highmem
> > I can get rid of the memcpy for you right now.
> 
> So it appears that your highmem usage is coming from zsmalloc.  In
> that case you don't want virtual addresses at all, you want an SG
> list.
> 
> In fact you've already gone through a totally unnecessary copy
> in _zs_map_object.  Had it simply given us a 2-entry SG list,
> the Crypto API can process the data directly with no copies at
> all.
> 
> The whole point of SG lists is to deal with memory fragmentation.
> When your object is too big to fit in a single page, you need an
> SG list to describe it.  Forcing virtual addresses just leads to
> an unnecessary copy.

I have seen the other thread with Sergey, I believe the conclusion is
that zsmalloc will be updated to use SG lists, at which point zswap can
just pass this as-is to the crypto API, and we don't need any copies in
either zsmalloc or zswap.

Is this correct?

Will this patch series be dropped?

> 
> Chers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert Xu March 4, 2025, 3:29 a.m. UTC | #12

On Mon, Mar 03, 2025 at 08:17:27PM +0000, Yosry Ahmed wrote:
>
> I have seen the other thread with Sergey, I believe the conclusion is
> that zsmalloc will be updated to use SG lists, at which point zswap can
> just pass this as-is to the crypto API, and we don't need any copies in
> either zsmalloc or zswap.
> 
> Is this correct?

That is my hope yes.

So there are two reasons why zswap should be using SG lists:

1) Non-linear memory because compressed object spans two pages;
2) Highmem.

> Will this patch series be dropped?

Not comletely, I rather liked the simplification of the scomp scratch
code.  And the chaining functionality is still needed for the batching
work.  The virtual address support will disappear for now but could
always come back if someone wants to do that.

However, I will reinstate the scomp scratch buffer in a more limited
form just to cater for the need to linearise things if the algorithm
does not support non-linear input (I hope to modify the algorithms
we care about to support non-linear input but that's going to be a
long-term project apart from LZO which I've already completed).

Right now there is a proliferation of per-cpu buffers throughout the
zswap/acomp call-stack.  I'm going to consolidate them so that there
is a single per-cpu buffer for everything (the stream memory, and
the linearisation buffer), and that it only comes into play when
needed (so hopefully LZO decompression will become completely
preemptible and not use any per-cpu buffers at all).

Once that is done I will repost this.

Thanks,

Yosry Ahmed March 4, 2025, 4:30 a.m. UTC | #13

On Tue, Mar 04, 2025 at 11:29:44AM +0800, Herbert Xu wrote:
> On Mon, Mar 03, 2025 at 08:17:27PM +0000, Yosry Ahmed wrote:
> >
> > I have seen the other thread with Sergey, I believe the conclusion is
> > that zsmalloc will be updated to use SG lists, at which point zswap can
> > just pass this as-is to the crypto API, and we don't need any copies in
> > either zsmalloc or zswap.
> > 
> > Is this correct?
> 
> That is my hope yes.
> 
> So there are two reasons why zswap should be using SG lists:
> 
> 1) Non-linear memory because compressed object spans two pages;
> 2) Highmem.
> 
> > Will this patch series be dropped?
> 
> Not comletely, I rather liked the simplification of the scomp scratch
> code.  And the chaining functionality is still needed for the batching
> work.  The virtual address support will disappear for now but could
> always come back if someone wants to do that.
> 
> However, I will reinstate the scomp scratch buffer in a more limited
> form just to cater for the need to linearise things if the algorithm
> does not support non-linear input (I hope to modify the algorithms
> we care about to support non-linear input but that's going to be a
> long-term project apart from LZO which I've already completed).
> 
> Right now there is a proliferation of per-cpu buffers throughout the
> zswap/acomp call-stack.  I'm going to consolidate them so that there
> is a single per-cpu buffer for everything (the stream memory, and
> the linearisation buffer), and that it only comes into play when
> needed (so hopefully LZO decompression will become completely
> preemptible and not use any per-cpu buffers at all).

Looking forward to this :)

> 
> Once that is done I will repost this.
> 
> Thanks,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert Xu March 4, 2025, 6:10 a.m. UTC | #14

On Tue, Mar 04, 2025 at 04:30:01AM +0000, Yosry Ahmed wrote:
>
> Looking forward to this :)

Here is something that is entirely untested.  It doesn't depend
on my acomp patch-set and should apply on mainline.

commit 71955ecc7d6680386bb685f7f7f3951b5da9da9a
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Thu Feb 27 18:10:32 2025 +0800

    mm: zswap: Give non-linear objects to Crypto API
    
    Instead of copying non-linear objects into a buffer, use the
    scatterlist to give them directly to the Crypto API.
    
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/linux/zpool.h b/include/linux/zpool.h
index a67d62b79698..0250a4f0760d 100644
--- a/include/linux/zpool.h
+++ b/include/linux/zpool.h
@@ -12,6 +12,7 @@
 #ifndef _ZPOOL_H_
 #define _ZPOOL_H_
 
+struct scatterlist;
 struct zpool;
 
 /*
@@ -53,6 +54,11 @@ void *zpool_map_handle(struct zpool *pool, unsigned long handle,
 
 void zpool_unmap_handle(struct zpool *pool, unsigned long handle);
 
+void zpool_map_sg(struct zpool *pool, unsigned long handle,
+		  enum zpool_mapmode mm, struct scatterlist *sg);
+
+void zpool_unmap_sg(struct zpool *pool, unsigned long handle);
+
 u64 zpool_get_total_pages(struct zpool *pool);
 
 
@@ -66,7 +72,9 @@ u64 zpool_get_total_pages(struct zpool *pool);
  * @free:	free mem from a pool.
  * @sleep_mapped: whether zpool driver can sleep during map.
  * @map:	map a handle.
+ * @map_sg:	map a handle into a two-entry SG list
  * @unmap:	unmap a handle.
+ * @unmap:	unmap a handle given to map_sg.
  * @total_size:	get total size of a pool.
  *
  * This is created by a zpool implementation and registered
@@ -89,7 +97,11 @@ struct zpool_driver {
 	bool sleep_mapped;
 	void *(*map)(void *pool, unsigned long handle,
 				enum zpool_mapmode mm);
+	void (*map_sg)(void *pool, unsigned long handle,
+				enum zpool_mapmode mm,
+				struct scatterlist *sg);
 	void (*unmap)(void *pool, unsigned long handle);
+	void (*unmap_sg)(void *pool, unsigned long handle);
 
 	u64 (*total_pages)(void *pool);
 };
diff --git a/mm/z3fold.c b/mm/z3fold.c
index 379d24b4fef9..7be8e3b3ba13 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -36,6 +36,7 @@
 #include <linux/percpu.h>
 #include <linux/preempt.h>
 #include <linux/workqueue.h>
+#include <linux/scatterlist.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/zpool.h>
@@ -1402,6 +1403,15 @@ static void z3fold_zpool_unmap(void *pool, unsigned long handle)
 	z3fold_unmap(pool, handle);
 }
 
+static void z3fold_zpool_map_sg(void *pool, unsigned long handle,
+				 enum zpool_mapmode mm,
+				 struct scatterlist sg[2])
+{
+	void *buf = z3fold_map(pool, handle);
+
+	sg_init_one(sg, buf, PAGE_SIZE - offset_in_page(buf));
+}
+
 static u64 z3fold_zpool_total_pages(void *pool)
 {
 	return z3fold_get_pool_pages(pool);
@@ -1416,6 +1426,7 @@ static struct zpool_driver z3fold_zpool_driver = {
 	.malloc =	z3fold_zpool_malloc,
 	.free =		z3fold_zpool_free,
 	.map =		z3fold_zpool_map,
+	.map_sg =	z3fold_zpool_map_sg,
 	.unmap =	z3fold_zpool_unmap,
 	.total_pages =	z3fold_zpool_total_pages,
 };
diff --git a/mm/zbud.c b/mm/zbud.c
index e9836fff9438..f6a4da93c985 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -49,6 +49,7 @@
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/preempt.h>
+#include <linux/scatterlist.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/zpool.h>
@@ -410,6 +411,14 @@ static void zbud_zpool_unmap(void *pool, unsigned long handle)
 	zbud_unmap(pool, handle);
 }
 
+static void zbud_zpool_map_sg(void *pool, unsigned long handle,
+			      enum zpool_mapmode mm, struct scatterlist sg[2])
+{
+	void *buf = (void *)handle;
+
+	sg_init_one(sg, buf, PAGE_SIZE - offset_in_page(buf));
+}
+
 static u64 zbud_zpool_total_pages(void *pool)
 {
 	return zbud_get_pool_pages(pool);
@@ -424,7 +433,9 @@ static struct zpool_driver zbud_zpool_driver = {
 	.malloc =	zbud_zpool_malloc,
 	.free =		zbud_zpool_free,
 	.map =		zbud_zpool_map,
+	.map_sg =	zbud_zpool_map_sg,
 	.unmap =	zbud_zpool_unmap,
+	.unmap_sg =	zbud_zpool_unmap,
 	.total_pages =	zbud_zpool_total_pages,
 };
 
diff --git a/mm/zpool.c b/mm/zpool.c
index b9fda1fa857d..120dbca8ca6e 100644
--- a/mm/zpool.c
+++ b/mm/zpool.c
@@ -13,6 +13,7 @@
 #include <linux/list.h>
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/scatterlist.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/module.h>
@@ -305,6 +306,33 @@ void *zpool_map_handle(struct zpool *zpool, unsigned long handle,
 	return zpool->driver->map(zpool->pool, handle, mapmode);
 }
 
+/**
+ * zpool_map_handle() - Map a previously allocated handle into an SG list
+ * @zpool:	The zpool that the handle was allocated from
+ * @handle:	The handle to map
+ * @mapmode:	How the memory should be mapped
+ * @sg:		2-entry SG list to store the mapping
+ *
+ * This maps a previously allocated handle into an SG list.  The
+ * @mapmode param indicates to the implementation how the memory
+ * will be * used, i.e. read-only, write-only, read-write.  If the
+ * implementation does not support it, the memory will be treated
+ * as read-write.
+ *
+ * This may hold locks, disable interrupts, and/or preemption,
+ * and the zpool_unmap_handle() must be called to undo those
+ * actions.  The code that uses the mapped handle should complete
+ * its operations on the mapped handle memory quickly and unmap
+ * as soon as possible.  As the implementation may use per-cpu
+ * data, multiple handles should not be mapped concurrently on
+ * any cpu.
+ */
+void zpool_map_sg(struct zpool *zpool, unsigned long handle,
+		   enum zpool_mapmode mapmode, struct scatterlist *sg)
+{
+	zpool->driver->map_sg(zpool->pool, handle, mapmode, sg);
+}
+
 /**
  * zpool_unmap_handle() - Unmap a previously mapped handle
  * @zpool:	The zpool that the handle was allocated from
@@ -320,6 +348,21 @@ void zpool_unmap_handle(struct zpool *zpool, unsigned long handle)
 	zpool->driver->unmap(zpool->pool, handle);
 }
 
+/**
+ * zpool_unmap_sg() - Unmap a previously SG-mapped handle
+ * @zpool:	The zpool that the handle was allocated from
+ * @handle:	The handle to unmap
+ *
+ * This unmaps a previously mapped handle.  Any locks or other
+ * actions that the implementation took in zpool_map_handle()
+ * will be undone here.  The memory area returned from
+ * zpool_map_handle() should no longer be used after this.
+ */
+void zpool_unmap_sg(struct zpool *zpool, unsigned long handle)
+{
+	zpool->driver->unmap_sg(zpool->pool, handle);
+}
+
 /**
  * zpool_get_total_pages() - The total size of the pool
  * @zpool:	The zpool to check
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 6d0e47f7ae33..122294dd4105 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -49,6 +49,7 @@
 #include <linux/pagemap.h>
 #include <linux/fs.h>
 #include <linux/local_lock.h>
+#include <linux/scatterlist.h>
 #include "zpdesc.h"
 
 #define ZSPAGE_MAGIC	0x58
@@ -306,6 +307,10 @@ static void init_deferred_free(struct zs_pool *pool) {}
 static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage) {}
 #endif
 
+static void zs_map_object_sg(struct zs_pool *pool, unsigned long handle,
+			     enum zs_mapmode mm, struct scatterlist sg[2]);
+static void zs_unmap_object_sg(struct zs_pool *pool, unsigned long handle);
+
 static int create_cache(struct zs_pool *pool)
 {
 	char *name;
@@ -426,6 +431,32 @@ static void zs_zpool_unmap(void *pool, unsigned long handle)
 	zs_unmap_object(pool, handle);
 }
 
+static void zs_zpool_map_sg(void *pool, unsigned long handle,
+			    enum zpool_mapmode mm, struct scatterlist sg[2])
+{
+	enum zs_mapmode zs_mm;
+
+	switch (mm) {
+	case ZPOOL_MM_RO:
+		zs_mm = ZS_MM_RO;
+		break;
+	case ZPOOL_MM_WO:
+		zs_mm = ZS_MM_WO;
+		break;
+	case ZPOOL_MM_RW:
+	default:
+		zs_mm = ZS_MM_RW;
+		break;
+	}
+
+	zs_map_object_sg(pool, handle, zs_mm, sg);
+}
+
+static void zs_zpool_unmap_sg(void *pool, unsigned long handle)
+{
+	zs_unmap_object_sg(pool, handle);
+}
+
 static u64 zs_zpool_total_pages(void *pool)
 {
 	return zs_get_total_pages(pool);
@@ -440,7 +471,9 @@ static struct zpool_driver zs_zpool_driver = {
 	.malloc =		  zs_zpool_malloc,
 	.free =			  zs_zpool_free,
 	.map =			  zs_zpool_map,
+	.map_sg =		  zs_zpool_map_sg,
 	.unmap =		  zs_zpool_unmap,
+	.unmap_sg =		  zs_zpool_unmap_sg,
 	.total_pages =		  zs_zpool_total_pages,
 };
 
@@ -1281,6 +1314,72 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 }
 EXPORT_SYMBOL_GPL(zs_unmap_object);
 
+static void zs_map_object_sg(struct zs_pool *pool, unsigned long handle,
+			     enum zs_mapmode mm, struct scatterlist sg[2])
+{
+	int handle_size = ZS_HANDLE_SIZE;
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+
+	struct size_class *class;
+	struct zpdesc *zpdescs[2];
+
+	/* It guarantees it can get zspage from handle safely */
+	read_lock(&pool->migrate_lock);
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/*
+	 * migration cannot move any zpages in this zspage. Here, class->lock
+	 * is too heavy since callers would take some time until they calls
+	 * zs_unmap_object API so delegate the locking from class to zspage
+	 * which is smaller granularity.
+	 */
+	migrate_read_lock(zspage);
+	read_unlock(&pool->migrate_lock);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (unlikely(ZsHugePage(zspage)))
+		handle_size = 0;
+
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		sg_init_table(sg, 1);
+		sg_set_page(sg, zpdesc_page(zpdesc), class->size - handle_size,
+			    off + handle_size);
+		return;
+	}
+
+	/* this object spans two pages */
+	zpdescs[0] = zpdesc;
+	zpdescs[1] = get_next_zpdesc(zpdesc);
+	BUG_ON(!zpdescs[1]);
+
+	sg_init_table(sg, 2);
+	sg_set_page(sg, zpdesc_page(zpdescs[0]),
+		    PAGE_SIZE - off - handle_size, off + handle_size);
+	sg_set_page(&sg[1], zpdesc_page(zpdescs[1]),
+		    class->size - (PAGE_SIZE - off - handle_size), 0);
+}
+
+static void zs_unmap_object_sg(struct zs_pool *pool, unsigned long handle)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned int obj_idx;
+	unsigned long obj;
+
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+	migrate_read_unlock(zspage);
+}
+
 /**
  * zs_huge_class_size() - Returns the size (in bytes) of the first huge
  *                        zsmalloc &size_class.
diff --git a/mm/zswap.c b/mm/zswap.c
index 6504174fbc6a..004fdf26da61 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -13,6 +13,8 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include <crypto/acompress.h>
+#include <crypto/scatterwalk.h>
 #include <linux/module.h>
 #include <linux/cpu.h>
 #include <linux/highmem.h>
@@ -26,7 +28,6 @@
 #include <linux/mempolicy.h>
 #include <linux/mempool.h>
 #include <linux/zpool.h>
-#include <crypto/acompress.h>
 #include <linux/zswap.h>
 #include <linux/mm_types.h>
 #include <linux/page-flags.h>
@@ -928,9 +929,9 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	struct scatterlist input, output;
 	int comp_ret = 0, alloc_ret = 0;
 	unsigned int dlen = PAGE_SIZE;
+	struct scatterlist sg[2];
 	unsigned long handle;
 	struct zpool *zpool;
-	char *buf;
 	gfp_t gfp;
 	u8 *dst;
 
@@ -972,9 +973,9 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	if (alloc_ret)
 		goto unlock;
 
-	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
-	memcpy(buf, dst, dlen);
-	zpool_unmap_handle(zpool, handle);
+	zpool_map_sg(zpool, handle, ZPOOL_MM_WO, sg);
+	memcpy_to_sglist(sg, 0, dst, dlen);
+	zpool_unmap_sg(zpool, handle);
 
 	entry->handle = handle;
 	entry->length = dlen;
@@ -994,37 +995,19 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 {
 	struct zpool *zpool = entry->pool->zpool;
-	struct scatterlist input, output;
 	struct crypto_acomp_ctx *acomp_ctx;
-	u8 *src;
+	struct scatterlist input[2];
+	struct scatterlist output;
 
 	acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
-	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
-	/*
-	 * If zpool_map_handle is atomic, we cannot reliably utilize its mapped buffer
-	 * to do crypto_acomp_decompress() which might sleep. In such cases, we must
-	 * resort to copying the buffer to a temporary one.
-	 * Meanwhile, zpool_map_handle() might return a non-linearly mapped buffer,
-	 * such as a kmap address of high memory or even ever a vmap address.
-	 * However, sg_init_one is only equipped to handle linearly mapped low memory.
-	 * In such cases, we also must copy the buffer to a temporary and lowmem one.
-	 */
-	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
-	    !virt_addr_valid(src)) {
-		memcpy(acomp_ctx->buffer, src, entry->length);
-		src = acomp_ctx->buffer;
-		zpool_unmap_handle(zpool, entry->handle);
-	}
-
-	sg_init_one(&input, src, entry->length);
+	zpool_map_sg(zpool, entry->handle, ZPOOL_MM_RO, input);
 	sg_init_table(&output, 1);
 	sg_set_folio(&output, folio, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
+	acomp_request_set_params(acomp_ctx->req, input, &output, entry->length, PAGE_SIZE);
 	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
 	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
 
-	if (src != acomp_ctx->buffer)
-		zpool_unmap_handle(zpool, entry->handle);
+	zpool_unmap_sg(zpool, entry->handle);
 	acomp_ctx_put_unlock(acomp_ctx);
 }

Sergey Senozhatsky March 4, 2025, 8:33 a.m. UTC | #15

On (25/03/04 14:10), Herbert Xu wrote:
[..]
> +static void zs_map_object_sg(struct zs_pool *pool, unsigned long handle,
> +			     enum zs_mapmode mm, struct scatterlist sg[2])
> +{
> +	int handle_size = ZS_HANDLE_SIZE;
> +	struct zspage *zspage;
> +	struct zpdesc *zpdesc;
> +	unsigned long obj, off;
> +	unsigned int obj_idx;
> +
> +	struct size_class *class;
> +	struct zpdesc *zpdescs[2];
> +
> +	/* It guarantees it can get zspage from handle safely */
> +	read_lock(&pool->migrate_lock);
> +	obj = handle_to_obj(handle);
> +	obj_to_location(obj, &zpdesc, &obj_idx);
> +	zspage = get_zspage(zpdesc);
> +
> +	/*
> +	 * migration cannot move any zpages in this zspage. Here, class->lock
> +	 * is too heavy since callers would take some time until they calls
> +	 * zs_unmap_object API so delegate the locking from class to zspage
> +	 * which is smaller granularity.
> +	 */
> +	migrate_read_lock(zspage);
> +	read_unlock(&pool->migrate_lock);
> +
> +	class = zspage_class(pool, zspage);
> +	off = offset_in_page(class->size * obj_idx);
> +
> +	if (unlikely(ZsHugePage(zspage)))
> +		handle_size = 0;
> +
> +	if (off + class->size <= PAGE_SIZE) {
> +		/* this object is contained entirely within a page */
> +		sg_init_table(sg, 1);
> +		sg_set_page(sg, zpdesc_page(zpdesc), class->size - handle_size,
> +			    off + handle_size);
> +		return;
> +	}
> +
> +	/* this object spans two pages */
> +	zpdescs[0] = zpdesc;
> +	zpdescs[1] = get_next_zpdesc(zpdesc);
> +	BUG_ON(!zpdescs[1]);
> +
> +	sg_init_table(sg, 2);
> +	sg_set_page(sg, zpdesc_page(zpdescs[0]),
> +		    PAGE_SIZE - off - handle_size, off + handle_size);
> +	sg_set_page(&sg[1], zpdesc_page(zpdescs[1]),
> +		    class->size - (PAGE_SIZE - off - handle_size), 0);
> +}

[..]

>  static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  {
>  	struct zpool *zpool = entry->pool->zpool;
> -	struct scatterlist input, output;
>  	struct crypto_acomp_ctx *acomp_ctx;
> -	u8 *src;
> +	struct scatterlist input[2];
> +	struct scatterlist output;
>  
>  	acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
> -	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> -	/*
> -	 * If zpool_map_handle is atomic, we cannot reliably utilize its mapped buffer
> -	 * to do crypto_acomp_decompress() which might sleep. In such cases, we must
> -	 * resort to copying the buffer to a temporary one.
> -	 * Meanwhile, zpool_map_handle() might return a non-linearly mapped buffer,
> -	 * such as a kmap address of high memory or even ever a vmap address.
> -	 * However, sg_init_one is only equipped to handle linearly mapped low memory.
> -	 * In such cases, we also must copy the buffer to a temporary and lowmem one.
> -	 */
> -	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
> -	    !virt_addr_valid(src)) {
> -		memcpy(acomp_ctx->buffer, src, entry->length);
> -		src = acomp_ctx->buffer;
> -		zpool_unmap_handle(zpool, entry->handle);
> -	}
> -
> -	sg_init_one(&input, src, entry->length);
> +	zpool_map_sg(zpool, entry->handle, ZPOOL_MM_RO, input);
>  	sg_init_table(&output, 1);
>  	sg_set_folio(&output, folio, PAGE_SIZE, 0);
> -	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> +	acomp_request_set_params(acomp_ctx->req, input, &output, entry->length, PAGE_SIZE);
>  	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
>  	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);

And at some point you do memcpy() from SG list to a local buffer?

zsmalloc map() has a shortcut - for objects that fit one physical
page (that includes huge incompressible PAGE_SIZE-ed objects)
zsmalloc kmap the physical page in question and returns a pointer
to that mapping.

Herbert Xu March 4, 2025, 8:42 a.m. UTC | #16

On Tue, Mar 04, 2025 at 05:33:05PM +0900, Sergey Senozhatsky wrote:
>
> And at some point you do memcpy() from SG list to a local buffer?
> 
> zsmalloc map() has a shortcut - for objects that fit one physical
> page (that includes huge incompressible PAGE_SIZE-ed objects)
> zsmalloc kmap the physical page in question and returns a pointer
> to that mapping.

If the SG list only has a single entry, there will be no copies
whatsoever even with the existing scomp code (crypto/scompress.c):

static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
{

	...
        if (sg_nents(req->src) == 1 && !PageHighMem(sg_page(req->src))) {
                src = page_to_virt(sg_page(req->src)) + req->src->offset;
	} else {
		Use scratch buffer and do a copy
	}

This still does an unnecessary copy for highmem, but that will
disappear after my acomp patch-set:

                if (sg_nents(req->src) == 1 &&
                    (!PageHighMem(sg_page(req->src)) ||
                     req->src->offset + slen <= PAGE_SIZE))
                        src = kmap_local_page(sg_page(req->src)) + req->src->offset;
		else
			Use scratch buffer and do a copy

I've also modified LZO decompression to handle SG lists which I will
post soon.  That will mean that no copies will ever occur for LZO
decompression.  The same change could be extended to other algorithms
if someone wishes to eliminate the copy for their favourite algorithm.

Cheers,

Sergey Senozhatsky March 4, 2025, 8:45 a.m. UTC | #17

On (25/03/04 16:42), Herbert Xu wrote:
> On Tue, Mar 04, 2025 at 05:33:05PM +0900, Sergey Senozhatsky wrote:
> >
> > And at some point you do memcpy() from SG list to a local buffer?
> > 
> > zsmalloc map() has a shortcut - for objects that fit one physical
> > page (that includes huge incompressible PAGE_SIZE-ed objects)
> > zsmalloc kmap the physical page in question and returns a pointer
> > to that mapping.
> 
> If the SG list only has a single entry, there will be no copies
> whatsoever even with the existing scomp code (crypto/scompress.c):

Nice.

[..]

> This still does an unnecessary copy for highmem, but that will
> disappear after my acomp patch-set:
> 
>                 if (sg_nents(req->src) == 1 &&
>                     (!PageHighMem(sg_page(req->src)) ||
>                      req->src->offset + slen <= PAGE_SIZE))
>                         src = kmap_local_page(sg_page(req->src)) + req->src->offset;
> 		else
> 			Use scratch buffer and do a copy
> 
> I've also modified LZO decompression to handle SG lists which I will
> post soon.  That will mean that no copies will ever occur for LZO
> decompression.  The same change could be extended to other algorithms
> if someone wishes to eliminate the copy for their favourite algorithm.

That's interesting.

Sergey Senozhatsky March 4, 2025, 1:19 p.m. UTC | #18

On (25/03/04 14:10), Herbert Xu wrote:
> +static void zs_map_object_sg(struct zs_pool *pool, unsigned long handle,
> +			     enum zs_mapmode mm, struct scatterlist sg[2])
> +{
[..]
> +	sg_init_table(sg, 2);
> +	sg_set_page(sg, zpdesc_page(zpdescs[0]),
> +		    PAGE_SIZE - off - handle_size, off + handle_size);
> +	sg_set_page(&sg[1], zpdesc_page(zpdescs[1]),
> +		    class->size - (PAGE_SIZE - off - handle_size), 0);
> +}
> +
> +static void zs_unmap_object_sg(struct zs_pool *pool, unsigned long handle)
> +{
> +	struct zspage *zspage;
> +	struct zpdesc *zpdesc;
> +	unsigned int obj_idx;
> +	unsigned long obj;
> +
> +	obj = handle_to_obj(handle);
> +	obj_to_location(obj, &zpdesc, &obj_idx);
> +	zspage = get_zspage(zpdesc);
> +	migrate_read_unlock(zspage);
> +}

One thing to notice is that these functions don't actually map/unmap.

And the handling is spread out over different parts of the stack,
sg list is set in zsmalloc, but the actual zsmalloc map local page is
done in crypto, and then zswap does memcpy() to write to object and so
on.  The "new" zsmalloc map API, which we plan on landing soon, handles
most of the things within zsmalloc.  Would it be possible to do something
similar with the sg API?

> @@ -928,9 +929,9 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	struct scatterlist input, output;
>  	int comp_ret = 0, alloc_ret = 0;
>  	unsigned int dlen = PAGE_SIZE;
> +	struct scatterlist sg[2];
>  	unsigned long handle;
>  	struct zpool *zpool;
> -	char *buf;
>  	gfp_t gfp;
>  	u8 *dst;
>  
> @@ -972,9 +973,9 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	if (alloc_ret)
>  		goto unlock;
>  
> -	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> -	memcpy(buf, dst, dlen);
> -	zpool_unmap_handle(zpool, handle);
> +	zpool_map_sg(zpool, handle, ZPOOL_MM_WO, sg);
> +	memcpy_to_sglist(sg, 0, dst, dlen);
> +	zpool_unmap_sg(zpool, handle);

You can give zsmalloc a handle and a compressed buffer (u8) and
zsmalloc should be able to figure it out.  WO direction map()
seems, a bit, like an extra step.

Yosry Ahmed March 4, 2025, 8:47 p.m. UTC | #19

On Tue, Mar 04, 2025 at 10:19:51PM +0900, Sergey Senozhatsky wrote:
> On (25/03/04 14:10), Herbert Xu wrote:
> > +static void zs_map_object_sg(struct zs_pool *pool, unsigned long handle,
> > +			     enum zs_mapmode mm, struct scatterlist sg[2])
> > +{
> [..]
> > +	sg_init_table(sg, 2);
> > +	sg_set_page(sg, zpdesc_page(zpdescs[0]),
> > +		    PAGE_SIZE - off - handle_size, off + handle_size);
> > +	sg_set_page(&sg[1], zpdesc_page(zpdescs[1]),
> > +		    class->size - (PAGE_SIZE - off - handle_size), 0);
> > +}
> > +
> > +static void zs_unmap_object_sg(struct zs_pool *pool, unsigned long handle)
> > +{
> > +	struct zspage *zspage;
> > +	struct zpdesc *zpdesc;
> > +	unsigned int obj_idx;
> > +	unsigned long obj;
> > +
> > +	obj = handle_to_obj(handle);
> > +	obj_to_location(obj, &zpdesc, &obj_idx);
> > +	zspage = get_zspage(zpdesc);
> > +	migrate_read_unlock(zspage);
> > +}
> 
> One thing to notice is that these functions don't actually map/unmap.
> 
> And the handling is spread out over different parts of the stack,
> sg list is set in zsmalloc, but the actual zsmalloc map local page is
> done in crypto, and then zswap does memcpy() to write to object and so
> on.  The "new" zsmalloc map API, which we plan on landing soon, handles
> most of the things within zsmalloc.  Would it be possible to do something
> similar with the sg API?

Yeah I have the same feeling that the handling is all over the place.
Also, we don't want to introduce new map APIs, so anything we do for
zswap should ideally work for zram.

We need to agree on the APIs between zsmalloc <-> zswap/zcomp <->
crypto.

In the compression path, zswap currently passes in the page to the
crypto API to get it compressed, and then allocates an object in
zsmalloc and memcpy() the compressed page to it. Then, zsmalloc may
internally memcpy() again.

These two copies should become one once zswap starts using
zs_obj_write(), and I think this is the bare minimum because we cannot
allocate the object before we are done with the compression, so we need
at least one copy.

In the decompression path, zswap gets the compressed object from
zsmalloc, which will memcpy() to a buffer if it spans two pages. Zswap
will memcpy() again if needed (highmem / not sleepable). Then we pass
it to the crypto API. I am not sure if we do extra copies internally,
but probably not.

The not sleepable case will disappear with the new zsmalloc API as well,
so the only copy in the zswap code will be if we use highmem. This goes
away if the crypto API can deal with highmem addresses, or if we use
kmap_to_page() in the zswap code.

IIUC, what Herbert is suggesting is that we rework all of this to use SG
lists to reduce copies, but I am not sure which copies can go away? We
have one copy in the compression path that probably cannot go away.
After the zsmalloc changes (and ignoring highmem), we have one copy in
the decompression path for when objects span two pages. I think this
will still happen with SG lists, except internally in the crypto API.

So I am not sure what is the advantage of using SG lists here? The only
improvement that we can make is to eliminate the copy in the highmem
case, but I think we don't really need SG lists for this.

Am I missing something?

Herbert Xu March 5, 2025, 3:40 a.m. UTC | #20

On Tue, Mar 04, 2025 at 10:19:51PM +0900, Sergey Senozhatsky wrote:
>
> One thing to notice is that these functions don't actually map/unmap.

Fair enough.  I'll rename them to pin and unpin.

> And the handling is spread out over different parts of the stack,
> sg list is set in zsmalloc, but the actual zsmalloc map local page is
> done in crypto, and then zswap does memcpy() to write to object and so
> on.  The "new" zsmalloc map API, which we plan on landing soon, handles
> most of the things within zsmalloc.  Would it be possible to do something
> similar with the sg API?

If by mapping you're referring to kmap then it's only being done
in the Crypto API.  zswap is not doing any mappings with my patch,
even the copy to SG list operation after compression calls Crypto
API code (the newly introduced memcpy_to_sglist from crypto/scatterwalk.c.

The data is only ever read/written by Crypto API code so it
would seem to be more natural to map it when and where the
data is needed.

This also eliminates unnecessary mappings when the data is passed
to hardware offload, since there is no point in mapping the data
into CPU address space at all if it's only going to be accessed
with DMA.

> > @@ -972,9 +973,9 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >  	if (alloc_ret)
> >  		goto unlock;
> >  
> > -	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> > -	memcpy(buf, dst, dlen);
> > -	zpool_unmap_handle(zpool, handle);
> > +	zpool_map_sg(zpool, handle, ZPOOL_MM_WO, sg);
> > +	memcpy_to_sglist(sg, 0, dst, dlen);
> > +	zpool_unmap_sg(zpool, handle);
> 
> You can give zsmalloc a handle and a compressed buffer (u8) and
> zsmalloc should be able to figure it out.  WO direction map()
> seems, a bit, like an extra step.

Sure, this part can be dropped since your patch-set already provides
an interface for writing to the buffer.  Here is the same patch rebased
on top of your read_begin series.

commit d5891a27df516192e381047b4c79de4e9f7df4cd
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Thu Feb 27 18:10:32 2025 +0800

    mm: zswap: Give non-linear objects to Crypto API
    
    Instead of copying non-linear objects into a buffer, use the
    scatterlist to give them directly to the Crypto API.
    
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/linux/zpool.h b/include/linux/zpool.h
index a67d62b79698..795f8e3ad964 100644
--- a/include/linux/zpool.h
+++ b/include/linux/zpool.h
@@ -12,27 +12,9 @@
 #ifndef _ZPOOL_H_
 #define _ZPOOL_H_
 
+struct scatterlist;
 struct zpool;
 
-/*
- * Control how a handle is mapped.  It will be ignored if the
- * implementation does not support it.  Its use is optional.
- * Note that this does not refer to memory protection, it
- * refers to how the memory will be copied in/out if copying
- * is necessary during mapping; read-write is the safest as
- * it copies the existing memory in on map, and copies the
- * changed memory back out on unmap.  Write-only does not copy
- * in the memory and should only be used for initialization.
- * If in doubt, use ZPOOL_MM_DEFAULT which is read-write.
- */
-enum zpool_mapmode {
-	ZPOOL_MM_RW, /* normal read-write mapping */
-	ZPOOL_MM_RO, /* read-only (no copy-out at unmap time) */
-	ZPOOL_MM_WO, /* write-only (no copy-in at map time) */
-
-	ZPOOL_MM_DEFAULT = ZPOOL_MM_RW
-};
-
 bool zpool_has_pool(char *type);
 
 struct zpool *zpool_create_pool(const char *type, const char *name, gfp_t gfp);
@@ -48,10 +30,13 @@ int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp,
 
 void zpool_free(struct zpool *pool, unsigned long handle);
 
-void *zpool_map_handle(struct zpool *pool, unsigned long handle,
-			enum zpool_mapmode mm);
+void zpool_pin_handle(struct zpool *pool, unsigned long handle,
+		      struct scatterlist *sg);
 
-void zpool_unmap_handle(struct zpool *pool, unsigned long handle);
+void zpool_unpin_handle(struct zpool *pool, unsigned long handle);
+
+void zpool_write_handle(struct zpool *pool, unsigned long handle,
+			void *handle_mem, size_t mem_len);
 
 u64 zpool_get_total_pages(struct zpool *pool);
 
@@ -64,9 +49,9 @@ u64 zpool_get_total_pages(struct zpool *pool);
  * @destroy:	destroy a pool.
  * @malloc:	allocate mem from a pool.
  * @free:	free mem from a pool.
- * @sleep_mapped: whether zpool driver can sleep during map.
- * @map:	map a handle.
- * @unmap:	unmap a handle.
+ * @pin:	pin a handle and write it into a two-entry SG list.
+ * @unpin:	unpin a handle.
+ * @write:	write buffer to a handle.
  * @total_size:	get total size of a pool.
  *
  * This is created by a zpool implementation and registered
@@ -86,10 +71,10 @@ struct zpool_driver {
 				unsigned long *handle);
 	void (*free)(void *pool, unsigned long handle);
 
-	bool sleep_mapped;
-	void *(*map)(void *pool, unsigned long handle,
-				enum zpool_mapmode mm);
-	void (*unmap)(void *pool, unsigned long handle);
+	void (*pin)(void *pool, unsigned long handle, struct scatterlist *sg);
+	void (*unpin)(void *pool, unsigned long handle);
+	void (*write)(void *pool, unsigned long handle,
+				void *handle_mem, size_t mem_len);
 
 	u64 (*total_pages)(void *pool);
 };
@@ -98,6 +83,4 @@ void zpool_register_driver(struct zpool_driver *driver);
 
 int zpool_unregister_driver(struct zpool_driver *driver);
 
-bool zpool_can_sleep_mapped(struct zpool *pool);
-
 #endif
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 7d70983cf398..c26baf9fb331 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -16,23 +16,6 @@
 
 #include <linux/types.h>
 
-/*
- * zsmalloc mapping modes
- *
- * NOTE: These only make a difference when a mapped object spans pages.
- */
-enum zs_mapmode {
-	ZS_MM_RW, /* normal read-write mapping */
-	ZS_MM_RO, /* read-only (no copy-out at unmap time) */
-	ZS_MM_WO /* write-only (no copy-in at map time) */
-	/*
-	 * NOTE: ZS_MM_WO should only be used for initializing new
-	 * (uninitialized) allocations.  Partial writes to already
-	 * initialized allocations should use ZS_MM_RW to preserve the
-	 * existing data.
-	 */
-};
-
 struct zs_pool_stats {
 	/* How many pages were migrated (freed) */
 	atomic_long_t pages_compacted;
@@ -48,10 +31,6 @@ void zs_free(struct zs_pool *pool, unsigned long obj);
 
 size_t zs_huge_class_size(struct zs_pool *pool);
 
-void *zs_map_object(struct zs_pool *pool, unsigned long handle,
-			enum zs_mapmode mm);
-void zs_unmap_object(struct zs_pool *pool, unsigned long handle);
-
 unsigned long zs_get_total_pages(struct zs_pool *pool);
 unsigned long zs_compact(struct zs_pool *pool);
 
diff --git a/mm/z3fold.c b/mm/z3fold.c
index 379d24b4fef9..f0dc45cf9138 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -36,6 +36,7 @@
 #include <linux/percpu.h>
 #include <linux/preempt.h>
 #include <linux/workqueue.h>
+#include <linux/scatterlist.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/zpool.h>
@@ -1392,16 +1393,28 @@ static void z3fold_zpool_free(void *pool, unsigned long handle)
 	z3fold_free(pool, handle);
 }
 
-static void *z3fold_zpool_map(void *pool, unsigned long handle,
-			enum zpool_mapmode mm)
+static void z3fold_zpool_pin(void *pool, unsigned long handle,
+			     struct scatterlist sg[2])
 {
-	return z3fold_map(pool, handle);
+	void *buf = z3fold_map(pool, handle);
+
+	sg_init_one(sg, buf, PAGE_SIZE - offset_in_page(buf));
 }
-static void z3fold_zpool_unmap(void *pool, unsigned long handle)
+
+static void z3fold_zpool_unpin(void *pool, unsigned long handle)
 {
 	z3fold_unmap(pool, handle);
 }
 
+static void z3fold_zpool_write(void *pool, unsigned long handle,
+			       void *handle_mem, size_t mem_len)
+{
+	void *buf = z3fold_map(pool, handle);
+
+	memcpy(buf, handle_mem, mem_len);
+	z3fold_unmap(pool, handle);
+}
+
 static u64 z3fold_zpool_total_pages(void *pool)
 {
 	return z3fold_get_pool_pages(pool);
@@ -1409,14 +1422,14 @@ static u64 z3fold_zpool_total_pages(void *pool)
 
 static struct zpool_driver z3fold_zpool_driver = {
 	.type =		"z3fold",
-	.sleep_mapped = true,
 	.owner =	THIS_MODULE,
 	.create =	z3fold_zpool_create,
 	.destroy =	z3fold_zpool_destroy,
 	.malloc =	z3fold_zpool_malloc,
 	.free =		z3fold_zpool_free,
-	.map =		z3fold_zpool_map,
-	.unmap =	z3fold_zpool_unmap,
+	.pin =		z3fold_zpool_pin,
+	.unpin =	z3fold_zpool_unpin,
+	.write =	z3fold_zpool_write,
 	.total_pages =	z3fold_zpool_total_pages,
 };
 
diff --git a/mm/zbud.c b/mm/zbud.c
index e9836fff9438..21c0a9c26abe 100644
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -36,10 +36,9 @@
  *
  * The zbud API differs from that of conventional allocators in that the
  * allocation function, zbud_alloc(), returns an opaque handle to the user,
- * not a dereferenceable pointer.  The user must map the handle using
- * zbud_map() in order to get a usable pointer by which to access the
- * allocation data and unmap the handle with zbud_unmap() when operations
- * on the allocation data are complete.
+ * not a dereferenceable pointer.  The user must pin the handle using
+ * zbud_pin() in order to access the allocation data and unpin the handle
+ * with zbud_unpin() when operations on the allocation data are complete.
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -49,6 +48,7 @@
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/preempt.h>
+#include <linux/scatterlist.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/zpool.h>
@@ -339,28 +339,30 @@ static void zbud_free(struct zbud_pool *pool, unsigned long handle)
 }
 
 /**
- * zbud_map() - maps the allocation associated with the given handle
+ * zbud_pin() - pins the allocation associated with the given handle
  * @pool:	pool in which the allocation resides
- * @handle:	handle associated with the allocation to be mapped
+ * @handle:	handle associated with the allocation to be pinned
+ * @sg:		2-entry scatter list to store the memory pointers
  *
- * While trivial for zbud, the mapping functions for others allocators
+ * While trivial for zbud, the pinning functions for others allocators
  * implementing this allocation API could have more complex information encoded
  * in the handle and could create temporary mappings to make the data
  * accessible to the user.
- *
- * Returns: a pointer to the mapped allocation
  */
-static void *zbud_map(struct zbud_pool *pool, unsigned long handle)
+static void zbud_pin(struct zbud_pool *pool, unsigned long handle,
+		      struct scatterlist sg[2])
 {
-	return (void *)(handle);
+	void *buf = (void *)handle;
+
+	sg_init_one(sg, buf, PAGE_SIZE - offset_in_page(buf));
 }
 
 /**
- * zbud_unmap() - maps the allocation associated with the given handle
+ * zbud_unpin() - unpins the allocation associated with the given handle
  * @pool:	pool in which the allocation resides
- * @handle:	handle associated with the allocation to be unmapped
+ * @handle:	handle associated with the allocation to be unpinned
  */
-static void zbud_unmap(struct zbud_pool *pool, unsigned long handle)
+static void zbud_unpin(struct zbud_pool *pool, unsigned long handle)
 {
 }
 
@@ -400,14 +402,20 @@ static void zbud_zpool_free(void *pool, unsigned long handle)
 	zbud_free(pool, handle);
 }
 
-static void *zbud_zpool_map(void *pool, unsigned long handle,
-			enum zpool_mapmode mm)
+static void zbud_zpool_pin(void *pool, unsigned long handle,
+			   struct scatterlist sg[2])
 {
-	return zbud_map(pool, handle);
+	zbud_pin(pool, handle, sg);
 }
-static void zbud_zpool_unmap(void *pool, unsigned long handle)
+static void zbud_zpool_unpin(void *pool, unsigned long handle)
 {
-	zbud_unmap(pool, handle);
+	zbud_unpin(pool, handle);
+}
+
+static void zbud_zpool_write(void *pool, unsigned long handle,
+			     void *handle_mem, size_t mem_len)
+{
+	memcpy((void *)handle, handle_mem, mem_len);
 }
 
 static u64 zbud_zpool_total_pages(void *pool)
@@ -417,14 +425,14 @@ static u64 zbud_zpool_total_pages(void *pool)
 
 static struct zpool_driver zbud_zpool_driver = {
 	.type =		"zbud",
-	.sleep_mapped = true,
 	.owner =	THIS_MODULE,
 	.create =	zbud_zpool_create,
 	.destroy =	zbud_zpool_destroy,
 	.malloc =	zbud_zpool_malloc,
 	.free =		zbud_zpool_free,
-	.map =		zbud_zpool_map,
-	.unmap =	zbud_zpool_unmap,
+	.pin =		zbud_zpool_pin,
+	.unpin =	zbud_zpool_unpin,
+	.write =	zbud_zpool_write,
 	.total_pages =	zbud_zpool_total_pages,
 };
 
diff --git a/mm/zpool.c b/mm/zpool.c
index b9fda1fa857d..304639959b90 100644
--- a/mm/zpool.c
+++ b/mm/zpool.c
@@ -13,6 +13,7 @@
 #include <linux/list.h>
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/scatterlist.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/module.h>
@@ -278,46 +279,53 @@ void zpool_free(struct zpool *zpool, unsigned long handle)
 }
 
 /**
- * zpool_map_handle() - Map a previously allocated handle into memory
+ * zpool_pin_handle() - Pin a previously allocated handle into memory
  * @zpool:	The zpool that the handle was allocated from
- * @handle:	The handle to map
- * @mapmode:	How the memory should be mapped
+ * @handle:	The handle to pin
+ * @sg:		2-entry scatterlist to store pointers to the memmory
  *
- * This maps a previously allocated handle into memory.  The @mapmode
- * param indicates to the implementation how the memory will be
- * used, i.e. read-only, write-only, read-write.  If the
- * implementation does not support it, the memory will be treated
- * as read-write.
+ * This pins a previously allocated handle into memory.
  *
  * This may hold locks, disable interrupts, and/or preemption,
- * and the zpool_unmap_handle() must be called to undo those
- * actions.  The code that uses the mapped handle should complete
- * its operations on the mapped handle memory quickly and unmap
- * as soon as possible.  As the implementation may use per-cpu
- * data, multiple handles should not be mapped concurrently on
- * any cpu.
- *
- * Returns: A pointer to the handle's mapped memory area.
+ * and the zpool_unpin_handle() must be called to undo those
+ * actions.  The code that uses the pinned handle should complete
+ * its operations on the pinned handle memory quickly and unpin
+ * as soon as possible.
  */
-void *zpool_map_handle(struct zpool *zpool, unsigned long handle,
-			enum zpool_mapmode mapmode)
+void zpool_pin_handle(struct zpool *zpool, unsigned long handle,
+		      struct scatterlist *sg)
 {
-	return zpool->driver->map(zpool->pool, handle, mapmode);
+	zpool->driver->pin(zpool->pool, handle, sg);
 }
 
 /**
- * zpool_unmap_handle() - Unmap a previously mapped handle
+ * zpool_unpin_handle() - Unpin a previously pinned handle
  * @zpool:	The zpool that the handle was allocated from
- * @handle:	The handle to unmap
+ * @handle:	The handle to unpin
  *
- * This unmaps a previously mapped handle.  Any locks or other
- * actions that the implementation took in zpool_map_handle()
+ * This unpins a previously pinned handle.  Any locks or other
+ * actions that the implementation took in zpool_pin_handle()
  * will be undone here.  The memory area returned from
- * zpool_map_handle() should no longer be used after this.
+ * zpool_pin_handle() should no longer be used after this.
  */
-void zpool_unmap_handle(struct zpool *zpool, unsigned long handle)
+void zpool_unpin_handle(struct zpool *zpool, unsigned long handle)
 {
-	zpool->driver->unmap(zpool->pool, handle);
+	zpool->driver->unpin(zpool->pool, handle);
+}
+
+/**
+ * zpool_write_handle() - Write to a previously allocated handle
+ * @zpool:	The zpool that the handle was allocated from
+ * @handle:	The handle to write
+ * @handle_mem:	Data to write from
+ * @mem_len:	Length of data to be written
+ *
+ * This writes data to a previously allocated handle.
+ */
+void zpool_write_handle(struct zpool *zpool, unsigned long handle,
+			void *handle_mem, size_t mem_len)
+{
+	zpool->driver->write(zpool->pool, handle, handle_mem, mem_len);
 }
 
 /**
@@ -333,23 +341,5 @@ u64 zpool_get_total_pages(struct zpool *zpool)
 	return zpool->driver->total_pages(zpool->pool);
 }
 
-/**
- * zpool_can_sleep_mapped - Test if zpool can sleep when do mapped.
- * @zpool:	The zpool to test
- *
- * Some allocators enter non-preemptible context in ->map() callback (e.g.
- * disable pagefaults) and exit that context in ->unmap(), which limits what
- * we can do with the mapped object. For instance, we cannot wait for
- * asynchronous crypto API to decompress such an object or take mutexes
- * since those will call into the scheduler. This function tells us whether
- * we use such an allocator.
- *
- * Returns: true if zpool can sleep; false otherwise.
- */
-bool zpool_can_sleep_mapped(struct zpool *zpool)
-{
-	return zpool->driver->sleep_mapped;
-}
-
 MODULE_AUTHOR("Dan Streetman <ddstreet@ieee.org>");
 MODULE_DESCRIPTION("Common API for compressed memory storage");
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 63c99db71dc1..934b3be467e6 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -23,6 +23,7 @@
  *	zspage->lock
  */
 
+#include <crypto/scatterwalk.h>
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
@@ -49,6 +50,7 @@
 #include <linux/pagemap.h>
 #include <linux/fs.h>
 #include <linux/local_lock.h>
+#include <linux/scatterlist.h>
 #include "zpdesc.h"
 
 #define ZSPAGE_MAGIC	0x58
@@ -281,13 +283,6 @@ struct zspage {
 	struct zspage_lock zsl;
 };
 
-struct mapping_area {
-	local_lock_t lock;
-	char *vm_buf; /* copy buffer for objects that span pages */
-	char *vm_addr; /* address of kmap_local_page()'ed pages */
-	enum zs_mapmode vm_mm; /* mapping mode */
-};
-
 static void zspage_lock_init(struct zspage *zspage)
 {
 	static struct lock_class_key __key;
@@ -453,6 +448,10 @@ static void record_obj(unsigned long handle, unsigned long obj)
 
 #ifdef CONFIG_ZPOOL
 
+static int zs_pin_object(struct zs_pool *pool, unsigned long handle,
+			 struct scatterlist sg[2]);
+static void zs_unpin_object(struct zs_pool *pool, unsigned long handle);
+
 static void *zs_zpool_create(const char *name, gfp_t gfp)
 {
 	/*
@@ -482,29 +481,21 @@ static void zs_zpool_free(void *pool, unsigned long handle)
 	zs_free(pool, handle);
 }
 
-static void *zs_zpool_map(void *pool, unsigned long handle,
-			enum zpool_mapmode mm)
+static void zs_zpool_pin(void *pool, unsigned long handle,
+			 struct scatterlist sg[2])
 {
-	enum zs_mapmode zs_mm;
-
-	switch (mm) {
-	case ZPOOL_MM_RO:
-		zs_mm = ZS_MM_RO;
-		break;
-	case ZPOOL_MM_WO:
-		zs_mm = ZS_MM_WO;
-		break;
-	case ZPOOL_MM_RW:
-	default:
-		zs_mm = ZS_MM_RW;
-		break;
-	}
-
-	return zs_map_object(pool, handle, zs_mm);
+	zs_pin_object(pool, handle, sg);
 }
-static void zs_zpool_unmap(void *pool, unsigned long handle)
+
+static void zs_zpool_unpin(void *pool, unsigned long handle)
 {
-	zs_unmap_object(pool, handle);
+	zs_unpin_object(pool, handle);
+}
+
+static void zs_zpool_write(void *pool, unsigned long handle,
+			   void *handle_mem, size_t mem_len)
+{
+	zs_obj_write(pool, handle, handle_mem, mem_len);
 }
 
 static u64 zs_zpool_total_pages(void *pool)
@@ -520,19 +511,15 @@ static struct zpool_driver zs_zpool_driver = {
 	.malloc_support_movable = true,
 	.malloc =		  zs_zpool_malloc,
 	.free =			  zs_zpool_free,
-	.map =			  zs_zpool_map,
-	.unmap =		  zs_zpool_unmap,
+	.pin =			  zs_zpool_pin,
+	.unpin =		  zs_zpool_unpin,
+	.write =		  zs_zpool_write,
 	.total_pages =		  zs_zpool_total_pages,
 };
 
 MODULE_ALIAS("zpool-zsmalloc");
 #endif /* CONFIG_ZPOOL */
 
-/* per-cpu VM mapping areas for zspage accesses that cross page boundaries */
-static DEFINE_PER_CPU(struct mapping_area, zs_map_area) = {
-	.lock	= INIT_LOCAL_LOCK(lock),
-};
-
 static inline bool __maybe_unused is_first_zpdesc(struct zpdesc *zpdesc)
 {
 	return PagePrivate(zpdesc_page(zpdesc));
@@ -1117,93 +1104,6 @@ static struct zspage *find_get_zspage(struct size_class *class)
 	return zspage;
 }
 
-static inline int __zs_cpu_up(struct mapping_area *area)
-{
-	/*
-	 * Make sure we don't leak memory if a cpu UP notification
-	 * and zs_init() race and both call zs_cpu_up() on the same cpu
-	 */
-	if (area->vm_buf)
-		return 0;
-	area->vm_buf = kmalloc(ZS_MAX_ALLOC_SIZE, GFP_KERNEL);
-	if (!area->vm_buf)
-		return -ENOMEM;
-	return 0;
-}
-
-static inline void __zs_cpu_down(struct mapping_area *area)
-{
-	kfree(area->vm_buf);
-	area->vm_buf = NULL;
-}
-
-static void *__zs_map_object(struct mapping_area *area,
-			struct zpdesc *zpdescs[2], int off, int size)
-{
-	size_t sizes[2];
-	char *buf = area->vm_buf;
-
-	/* disable page faults to match kmap_local_page() return conditions */
-	pagefault_disable();
-
-	/* no read fastpath */
-	if (area->vm_mm == ZS_MM_WO)
-		goto out;
-
-	sizes[0] = PAGE_SIZE - off;
-	sizes[1] = size - sizes[0];
-
-	/* copy object to per-cpu buffer */
-	memcpy_from_page(buf, zpdesc_page(zpdescs[0]), off, sizes[0]);
-	memcpy_from_page(buf + sizes[0], zpdesc_page(zpdescs[1]), 0, sizes[1]);
-out:
-	return area->vm_buf;
-}
-
-static void __zs_unmap_object(struct mapping_area *area,
-			struct zpdesc *zpdescs[2], int off, int size)
-{
-	size_t sizes[2];
-	char *buf;
-
-	/* no write fastpath */
-	if (area->vm_mm == ZS_MM_RO)
-		goto out;
-
-	buf = area->vm_buf;
-	buf = buf + ZS_HANDLE_SIZE;
-	size -= ZS_HANDLE_SIZE;
-	off += ZS_HANDLE_SIZE;
-
-	sizes[0] = PAGE_SIZE - off;
-	sizes[1] = size - sizes[0];
-
-	/* copy per-cpu buffer to object */
-	memcpy_to_page(zpdesc_page(zpdescs[0]), off, buf, sizes[0]);
-	memcpy_to_page(zpdesc_page(zpdescs[1]), 0, buf + sizes[0], sizes[1]);
-
-out:
-	/* enable page faults to match kunmap_local() return conditions */
-	pagefault_enable();
-}
-
-static int zs_cpu_prepare(unsigned int cpu)
-{
-	struct mapping_area *area;
-
-	area = &per_cpu(zs_map_area, cpu);
-	return __zs_cpu_up(area);
-}
-
-static int zs_cpu_dead(unsigned int cpu)
-{
-	struct mapping_area *area;
-
-	area = &per_cpu(zs_map_area, cpu);
-	__zs_cpu_down(area);
-	return 0;
-}
-
 static bool can_merge(struct size_class *prev, int pages_per_zspage,
 					int objs_per_zspage)
 {
@@ -1251,126 +1151,15 @@ unsigned long zs_get_total_pages(struct zs_pool *pool)
 }
 EXPORT_SYMBOL_GPL(zs_get_total_pages);
 
-/**
- * zs_map_object - get address of allocated object from handle.
- * @pool: pool from which the object was allocated
- * @handle: handle returned from zs_malloc
- * @mm: mapping mode to use
- *
- * Before using an object allocated from zs_malloc, it must be mapped using
- * this function. When done with the object, it must be unmapped using
- * zs_unmap_object.
- *
- * Only one object can be mapped per cpu at a time. There is no protection
- * against nested mappings.
- *
- * This function returns with preemption and page faults disabled.
- */
-void *zs_map_object(struct zs_pool *pool, unsigned long handle,
-			enum zs_mapmode mm)
-{
-	struct zspage *zspage;
-	struct zpdesc *zpdesc;
-	unsigned long obj, off;
-	unsigned int obj_idx;
-
-	struct size_class *class;
-	struct mapping_area *area;
-	struct zpdesc *zpdescs[2];
-	void *ret;
-
-	/*
-	 * Because we use per-cpu mapping areas shared among the
-	 * pools/users, we can't allow mapping in interrupt context
-	 * because it can corrupt another users mappings.
-	 */
-	BUG_ON(in_interrupt());
-
-	/* It guarantees it can get zspage from handle safely */
-	read_lock(&pool->lock);
-	obj = handle_to_obj(handle);
-	obj_to_location(obj, &zpdesc, &obj_idx);
-	zspage = get_zspage(zpdesc);
-
-	/*
-	 * migration cannot move any zpages in this zspage. Here, class->lock
-	 * is too heavy since callers would take some time until they calls
-	 * zs_unmap_object API so delegate the locking from class to zspage
-	 * which is smaller granularity.
-	 */
-	zspage_read_lock(zspage);
-	read_unlock(&pool->lock);
-
-	class = zspage_class(pool, zspage);
-	off = offset_in_page(class->size * obj_idx);
-
-	local_lock(&zs_map_area.lock);
-	area = this_cpu_ptr(&zs_map_area);
-	area->vm_mm = mm;
-	if (off + class->size <= PAGE_SIZE) {
-		/* this object is contained entirely within a page */
-		area->vm_addr = kmap_local_zpdesc(zpdesc);
-		ret = area->vm_addr + off;
-		goto out;
-	}
-
-	/* this object spans two pages */
-	zpdescs[0] = zpdesc;
-	zpdescs[1] = get_next_zpdesc(zpdesc);
-	BUG_ON(!zpdescs[1]);
-
-	ret = __zs_map_object(area, zpdescs, off, class->size);
-out:
-	if (likely(!ZsHugePage(zspage)))
-		ret += ZS_HANDLE_SIZE;
-
-	return ret;
-}
-EXPORT_SYMBOL_GPL(zs_map_object);
-
-void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
-{
-	struct zspage *zspage;
-	struct zpdesc *zpdesc;
-	unsigned long obj, off;
-	unsigned int obj_idx;
-
-	struct size_class *class;
-	struct mapping_area *area;
-
-	obj = handle_to_obj(handle);
-	obj_to_location(obj, &zpdesc, &obj_idx);
-	zspage = get_zspage(zpdesc);
-	class = zspage_class(pool, zspage);
-	off = offset_in_page(class->size * obj_idx);
-
-	area = this_cpu_ptr(&zs_map_area);
-	if (off + class->size <= PAGE_SIZE)
-		kunmap_local(area->vm_addr);
-	else {
-		struct zpdesc *zpdescs[2];
-
-		zpdescs[0] = zpdesc;
-		zpdescs[1] = get_next_zpdesc(zpdesc);
-		BUG_ON(!zpdescs[1]);
-
-		__zs_unmap_object(area, zpdescs, off, class->size);
-	}
-	local_unlock(&zs_map_area.lock);
-
-	zspage_read_unlock(zspage);
-}
-EXPORT_SYMBOL_GPL(zs_unmap_object);
-
-void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
-			void *local_copy)
+static int zs_pin_object(struct zs_pool *pool, unsigned long handle,
+			 struct scatterlist sg[2])
 {
+	int handle_size = ZS_HANDLE_SIZE;
 	struct zspage *zspage;
 	struct zpdesc *zpdesc;
 	unsigned long obj, off;
 	unsigned int obj_idx;
 	struct size_class *class;
-	void *addr;
 
 	/* Guarantee we can get zspage from handle safely */
 	read_lock(&pool->lock);
@@ -1385,33 +1174,56 @@ void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
 	class = zspage_class(pool, zspage);
 	off = offset_in_page(class->size * obj_idx);
 
+	if (ZsHugePage(zspage))
+		handle_size = 0;
+
 	if (off + class->size <= PAGE_SIZE) {
 		/* this object is contained entirely within a page */
-		addr = kmap_local_zpdesc(zpdesc);
-		addr += off;
+		sg_init_table(sg, 1);
+		sg_set_page(sg, zpdesc_page(zpdesc),
+			    class->size - handle_size, off + handle_size);
 	} else {
 		size_t sizes[2];
 
 		/* this object spans two pages */
 		sizes[0] = PAGE_SIZE - off;
 		sizes[1] = class->size - sizes[0];
-		addr = local_copy;
 
-		memcpy_from_page(addr, zpdesc_page(zpdesc),
-				 off, sizes[0]);
+		sg_init_table(sg, 2);
+		sg_set_page(sg, zpdesc_page(zpdesc), sizes[0] - handle_size,
+			    off + handle_size);
 		zpdesc = get_next_zpdesc(zpdesc);
-		memcpy_from_page(addr + sizes[0],
-				 zpdesc_page(zpdesc),
-				 0, sizes[1]);
+		sg_set_page(&sg[1], zpdesc_page(zpdesc), sizes[1], 0);
 	}
 
-	if (!ZsHugePage(zspage))
-		addr += ZS_HANDLE_SIZE;
+	return class->size - handle_size;
+}
+
+void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
+			void *local_copy)
+{
+	struct scatterlist sg[2];
+	void *addr;
+	int len;
+
+	len = zs_pin_object(pool, handle, sg);
+	if (sg_is_last(sg)) {
+		addr = kmap_local_page(sg_page(sg));
+		addr += sg[0].offset;;
+	} else {
+		addr = local_copy;
+		memcpy_from_sglist(addr, sg, 0, len);
+	}
 
 	return addr;
 }
 EXPORT_SYMBOL_GPL(zs_obj_read_begin);
 
+static void zs_unpin_object(struct zs_pool *pool, unsigned long handle)
+{
+	zs_obj_read_end(pool, handle, NULL);
+}
+
 void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
 		     void *handle_mem)
 {
@@ -1427,7 +1239,7 @@ void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
 	class = zspage_class(pool, zspage);
 	off = offset_in_page(class->size * obj_idx);
 
-	if (off + class->size <= PAGE_SIZE) {
+	if (handle_mem && off + class->size <= PAGE_SIZE) {
 		if (!ZsHugePage(zspage))
 			off += ZS_HANDLE_SIZE;
 		handle_mem -= off;
@@ -1441,49 +1253,11 @@ EXPORT_SYMBOL_GPL(zs_obj_read_end);
 void zs_obj_write(struct zs_pool *pool, unsigned long handle,
 		  void *handle_mem, size_t mem_len)
 {
-	struct zspage *zspage;
-	struct zpdesc *zpdesc;
-	unsigned long obj, off;
-	unsigned int obj_idx;
-	struct size_class *class;
+	struct scatterlist sg[2];
 
-	/* Guarantee we can get zspage from handle safely */
-	read_lock(&pool->lock);
-	obj = handle_to_obj(handle);
-	obj_to_location(obj, &zpdesc, &obj_idx);
-	zspage = get_zspage(zpdesc);
-
-	/* Make sure migration doesn't move any pages in this zspage */
-	zspage_read_lock(zspage);
-	read_unlock(&pool->lock);
-
-	class = zspage_class(pool, zspage);
-	off = offset_in_page(class->size * obj_idx);
-
-	if (off + class->size <= PAGE_SIZE) {
-		/* this object is contained entirely within a page */
-		void *dst = kmap_local_zpdesc(zpdesc);
-
-		if (!ZsHugePage(zspage))
-			off += ZS_HANDLE_SIZE;
-		memcpy(dst + off, handle_mem, mem_len);
-		kunmap_local(dst);
-	} else {
-		/* this object spans two pages */
-		size_t sizes[2];
-
-		off += ZS_HANDLE_SIZE;
-		sizes[0] = PAGE_SIZE - off;
-		sizes[1] = mem_len - sizes[0];
-
-		memcpy_to_page(zpdesc_page(zpdesc), off,
-			       handle_mem, sizes[0]);
-		zpdesc = get_next_zpdesc(zpdesc);
-		memcpy_to_page(zpdesc_page(zpdesc), 0,
-			       handle_mem + sizes[0], sizes[1]);
-	}
-
-	zspage_read_unlock(zspage);
+	zs_pin_object(pool, handle, sg);
+	memcpy_to_sglist(sg, 0, handle_mem, mem_len);
+	zs_unpin_object(pool, handle);
 }
 EXPORT_SYMBOL_GPL(zs_obj_write);
 
@@ -2465,13 +2239,6 @@ EXPORT_SYMBOL_GPL(zs_destroy_pool);
 
 static int __init zs_init(void)
 {
-	int ret;
-
-	ret = cpuhp_setup_state(CPUHP_MM_ZS_PREPARE, "mm/zsmalloc:prepare",
-				zs_cpu_prepare, zs_cpu_dead);
-	if (ret)
-		goto out;
-
 #ifdef CONFIG_ZPOOL
 	zpool_register_driver(&zs_zpool_driver);
 #endif
@@ -2479,9 +2246,6 @@ static int __init zs_init(void)
 	zs_stat_init();
 
 	return 0;
-
-out:
-	return ret;
 }
 
 static void __exit zs_exit(void)
@@ -2489,7 +2253,6 @@ static void __exit zs_exit(void)
 #ifdef CONFIG_ZPOOL
 	zpool_unregister_driver(&zs_zpool_driver);
 #endif
-	cpuhp_remove_state(CPUHP_MM_ZS_PREPARE);
 
 	zs_stat_exit();
 }
diff --git a/mm/zswap.c b/mm/zswap.c
index 6504174fbc6a..74252187d763 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -147,7 +147,6 @@ struct crypto_acomp_ctx {
 	struct crypto_wait wait;
 	u8 *buffer;
 	struct mutex mutex;
-	bool is_sleepable;
 };
 
 /*
@@ -865,7 +864,6 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 
 	acomp_ctx->buffer = buffer;
 	acomp_ctx->acomp = acomp;
-	acomp_ctx->is_sleepable = acomp_is_async(acomp);
 	acomp_ctx->req = req;
 	mutex_unlock(&acomp_ctx->mutex);
 	return 0;
@@ -930,7 +928,6 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	unsigned int dlen = PAGE_SIZE;
 	unsigned long handle;
 	struct zpool *zpool;
-	char *buf;
 	gfp_t gfp;
 	u8 *dst;
 
@@ -972,9 +969,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	if (alloc_ret)
 		goto unlock;
 
-	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
-	memcpy(buf, dst, dlen);
-	zpool_unmap_handle(zpool, handle);
+	zpool_write_handle(zpool, handle, dst, dlen);
 
 	entry->handle = handle;
 	entry->length = dlen;
@@ -994,37 +989,19 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 {
 	struct zpool *zpool = entry->pool->zpool;
-	struct scatterlist input, output;
 	struct crypto_acomp_ctx *acomp_ctx;
-	u8 *src;
+	struct scatterlist input[2];
+	struct scatterlist output;
 
 	acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
-	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
-	/*
-	 * If zpool_map_handle is atomic, we cannot reliably utilize its mapped buffer
-	 * to do crypto_acomp_decompress() which might sleep. In such cases, we must
-	 * resort to copying the buffer to a temporary one.
-	 * Meanwhile, zpool_map_handle() might return a non-linearly mapped buffer,
-	 * such as a kmap address of high memory or even ever a vmap address.
-	 * However, sg_init_one is only equipped to handle linearly mapped low memory.
-	 * In such cases, we also must copy the buffer to a temporary and lowmem one.
-	 */
-	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
-	    !virt_addr_valid(src)) {
-		memcpy(acomp_ctx->buffer, src, entry->length);
-		src = acomp_ctx->buffer;
-		zpool_unmap_handle(zpool, entry->handle);
-	}
-
-	sg_init_one(&input, src, entry->length);
+	zpool_pin_handle(zpool, entry->handle, input);
 	sg_init_table(&output, 1);
 	sg_set_folio(&output, folio, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
+	acomp_request_set_params(acomp_ctx->req, input, &output, entry->length, PAGE_SIZE);
 	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
 	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
 
-	if (src != acomp_ctx->buffer)
-		zpool_unmap_handle(zpool, entry->handle);
+	zpool_unpin_handle(zpool, entry->handle);
 	acomp_ctx_put_unlock(acomp_ctx);
 }
 

Thanks,

Herbert Xu March 5, 2025, 3:45 a.m. UTC | #21

On Tue, Mar 04, 2025 at 08:47:48PM +0000, Yosry Ahmed wrote:
>
> Yeah I have the same feeling that the handling is all over the place.
> Also, we don't want to introduce new map APIs, so anything we do for
> zswap should ideally work for zram.

I will be getting to zram next.  AFAIK all that's missing from
acomp is parameter support.  Once that is added we can convert
zram over to acomp and get rid of zcomp altogether.

> IIUC, what Herbert is suggesting is that we rework all of this to use SG
> lists to reduce copies, but I am not sure which copies can go away? We
> have one copy in the compression path that probably cannot go away.
> After the zsmalloc changes (and ignoring highmem), we have one copy in
> the decompression path for when objects span two pages. I think this
> will still happen with SG lists, except internally in the crypto API.

It's the decompression copy when the object spans two pages that
will disappear.  Because I have added SG support to LZO:

commit a81b9ed5287424aa7d6c191fca7019820fc1d130
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Sun Mar 2 13:56:22 2025 +0800

    crypto: lib/lzo - Add decompression scatterlist support
    
    Add lzo1x_decompress_safe_sg which handles a scatterlist as its
    input.  This is useful as pages often compress into large objects
    that straddle page boundaries so it takes extra effort to linearise
    them in the face of memory fragmentation.
    
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/linux/lzo.h b/include/linux/lzo.h
index 4d30e3624acd..f3686ec4aa84 100644
--- a/include/linux/lzo.h
+++ b/include/linux/lzo.h
@@ -1,6 +1,11 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef __LZO_H__
 #define __LZO_H__
+
+#include <linux/types.h>
+
+struct scatterlist;
+
 /*
  *  LZO Public Kernel Interface
  *  A mini subset of the LZO real-time data compression library
@@ -40,6 +45,10 @@ int lzorle1x_1_compress_safe(const unsigned char *src, size_t src_len,
 int lzo1x_decompress_safe(const unsigned char *src, size_t src_len,
 			  unsigned char *dst, size_t *dst_len);
 
+/* decompression with source SG list */
+int lzo1x_decompress_safe_sg(struct scatterlist *src, size_t src_len,
+			     unsigned char *dst, size_t *dst_len);
+
 /*
  * Return values (< 0 = Error)
  */
diff --git a/lib/lzo/Makefile b/lib/lzo/Makefile
index fc7b2b7ef4b2..276a7246af72 100644
--- a/lib/lzo/Makefile
+++ b/lib/lzo/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lzo_compress-objs := lzo1x_compress.o lzo1x_compress_safe.o
-lzo_decompress-objs := lzo1x_decompress_safe.o
+lzo_decompress-objs := lzo1x_decompress_safe.o lzo1x_decompress_safe_sg.o
 
 obj-$(CONFIG_LZO_COMPRESS) += lzo_compress.o
 obj-$(CONFIG_LZO_DECOMPRESS) += lzo_decompress.o
diff --git a/lib/lzo/lzo1x_decompress_safe.c b/lib/lzo/lzo1x_decompress_safe.c
index c94f4928e188..e1da9725e33b 100644
--- a/lib/lzo/lzo1x_decompress_safe.c
+++ b/lib/lzo/lzo1x_decompress_safe.c
@@ -12,17 +12,26 @@
  *  Richard Purdie <rpurdie@openedhand.com>
  */
 
-#ifndef STATIC
+#include <linux/lzo.h>
 #include <linux/module.h>
 #include <linux/kernel.h>
-#endif
-#include <linux/unaligned.h>
-#include <linux/lzo.h>
+#include <linux/scatterlist.h>
 #include "lzodefs.h"
 
-#define HAVE_IP(x)      ((size_t)(ip_end - ip) >= (size_t)(x))
-#define HAVE_OP(x)      ((size_t)(op_end - op) >= (size_t)(x))
+#undef INPUT_IS_LINEAR
+#ifndef HAVE_IP
+#define IP_LEFT()	(in + in_len - ip)
+#define HAVE_IP(x)      ((size_t)(in + in_len - ip) >= (size_t)(x))
 #define NEED_IP(x)      if (!HAVE_IP(x)) goto input_overrun
+#define GET_IP()	ip = in
+#define CHECK_IP()	NEED_IP(1)
+#define PUT_IP()	do {} while (0)
+#define INPUT		const unsigned char *in
+#define LZO_SG(name)	name
+#define INPUT_IS_LINEAR	1
+#endif
+
+#define HAVE_OP(x)      ((size_t)(op_end - op) >= (size_t)(x))
 #define NEED_OP(x)      if (!HAVE_OP(x)) goto output_overrun
 #define TEST_LB(m_pos)  if ((m_pos) < out) goto lookbehind_overrun
 
@@ -36,34 +45,34 @@
  */
 #define MAX_255_COUNT      ((((size_t)~0) / 255) - 2)
 
-int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
-			  unsigned char *out, size_t *out_len)
+int LZO_SG(lzo1x_decompress_safe)(INPUT, size_t in_len,
+				  unsigned char *out, size_t *out_len)
 {
+	struct sg_mapping_iter miter __maybe_unused;
 	unsigned char *op;
 	const unsigned char *ip;
 	size_t t, next;
 	size_t state = 0;
 	const unsigned char *m_pos;
-	const unsigned char * const ip_end = in + in_len;
 	unsigned char * const op_end = out + *out_len;
-
 	unsigned char bitstream_version;
+	int err;
 
 	op = out;
-	ip = in;
-
-	if (unlikely(in_len < 3))
-		goto input_overrun;
+	GET_IP();
 
 	if (likely(in_len >= 5) && likely(*ip == 17)) {
-		bitstream_version = ip[1];
-		ip += 2;
+		ip++;
+		CHECK_IP();
+		bitstream_version = *ip++;
+		CHECK_IP();
 	} else {
 		bitstream_version = 0;
 	}
 
 	if (*ip > 17) {
 		t = *ip++ - 17;
+		CHECK_IP();
 		if (t < 4) {
 			next = t;
 			goto match_next;
@@ -73,22 +82,23 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
 
 	for (;;) {
 		t = *ip++;
+		CHECK_IP();
 		if (t < 16) {
 			if (likely(state == 0)) {
 				if (unlikely(t == 0)) {
-					size_t offset;
-					const unsigned char *ip_last = ip;
+					size_t offset = 0;
 
 					while (unlikely(*ip == 0)) {
 						ip++;
-						NEED_IP(1);
+						CHECK_IP();
+						offset++;
 					}
-					offset = ip - ip_last;
 					if (unlikely(offset > MAX_255_COUNT))
 						return LZO_E_ERROR;
 
 					offset = (offset << 8) - offset;
 					t += offset + 15 + *ip++;
+					CHECK_IP();
 				}
 				t += 3;
 copy_literal_run:
@@ -110,9 +120,9 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
 #endif
 				{
 					NEED_OP(t);
-					NEED_IP(t + 3);
 					do {
 						*op++ = *ip++;
+						CHECK_IP();
 					} while (--t > 0);
 				}
 				state = 4;
@@ -122,6 +132,7 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
 				m_pos = op - 1;
 				m_pos -= t >> 2;
 				m_pos -= *ip++ << 2;
+				CHECK_IP();
 				TEST_LB(m_pos);
 				NEED_OP(2);
 				op[0] = m_pos[0];
@@ -133,6 +144,7 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
 				m_pos = op - (1 + M2_MAX_OFFSET);
 				m_pos -= t >> 2;
 				m_pos -= *ip++ << 2;
+				CHECK_IP();
 				t = 3;
 			}
 		} else if (t >= 64) {
@@ -140,45 +152,48 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
 			m_pos = op - 1;
 			m_pos -= (t >> 2) & 7;
 			m_pos -= *ip++ << 3;
+			CHECK_IP();
 			t = (t >> 5) - 1 + (3 - 1);
 		} else if (t >= 32) {
 			t = (t & 31) + (3 - 1);
 			if (unlikely(t == 2)) {
-				size_t offset;
-				const unsigned char *ip_last = ip;
+				size_t offset = 0;
 
 				while (unlikely(*ip == 0)) {
 					ip++;
-					NEED_IP(1);
+					CHECK_IP();
+					offset++;
 				}
-				offset = ip - ip_last;
 				if (unlikely(offset > MAX_255_COUNT))
 					return LZO_E_ERROR;
 
 				offset = (offset << 8) - offset;
 				t += offset + 31 + *ip++;
-				NEED_IP(2);
+				CHECK_IP();
 			}
 			m_pos = op - 1;
-			next = get_unaligned_le16(ip);
-			ip += 2;
+			next = *ip++;
+			CHECK_IP();
+			next += *ip++ << 8;
+			CHECK_IP();
 			m_pos -= next >> 2;
 			next &= 3;
 		} else {
-			NEED_IP(2);
-			next = get_unaligned_le16(ip);
+			next = *ip++;
+			CHECK_IP();
+			next += *ip++ << 8;
 			if (((next & 0xfffc) == 0xfffc) &&
 			    ((t & 0xf8) == 0x18) &&
 			    likely(bitstream_version)) {
-				NEED_IP(3);
 				t &= 7;
-				t |= ip[2] << 3;
+				CHECK_IP();
+				t |= *ip++ << 3;
+				CHECK_IP();
 				t += MIN_ZERO_RUN_LENGTH;
 				NEED_OP(t);
 				memset(op, 0, t);
 				op += t;
 				next &= 3;
-				ip += 3;
 				goto match_next;
 			} else {
 				m_pos = op;
@@ -186,26 +201,43 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
 				t = (t & 7) + (3 - 1);
 				if (unlikely(t == 2)) {
 					size_t offset;
-					const unsigned char *ip_last = ip;
+					size_t tip;
 
-					while (unlikely(*ip == 0)) {
-						ip++;
-						NEED_IP(1);
+					CHECK_IP();
+					if (!next) {
+						offset = 2;
+						while (unlikely(*ip == 0)) {
+							ip++;
+							CHECK_IP();
+							offset++;
+						}
+
+						tip = *ip++;
+						CHECK_IP();
+						next = *ip++;
+						CHECK_IP();
+					} else if (!(next & 0xff)) {
+						offset = 1;
+						tip = next >> 8;
+						next = *ip++;
+						CHECK_IP();
+					} else {
+						offset = 0;
+						tip = next & 0xff;
+						next >>= 8;
 					}
-					offset = ip - ip_last;
 					if (unlikely(offset > MAX_255_COUNT))
 						return LZO_E_ERROR;
 
 					offset = (offset << 8) - offset;
-					t += offset + 7 + *ip++;
-					NEED_IP(2);
-					next = get_unaligned_le16(ip);
+					t += offset + 7 + tip;
+					next += *ip++ << 8;
 				}
-				ip += 2;
 				m_pos -= next >> 2;
 				next &= 3;
 				if (m_pos == op)
 					goto eof_found;
+				CHECK_IP();
 				m_pos -= 0x4000;
 			}
 		}
@@ -260,36 +292,42 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
 		} else
 #endif
 		{
-			NEED_IP(t + 3);
 			NEED_OP(t);
 			while (t > 0) {
 				*op++ = *ip++;
+				CHECK_IP();
 				t--;
 			}
 		}
 	}
 
 eof_found:
-	*out_len = op - out;
-	return (t != 3       ? LZO_E_ERROR :
-		ip == ip_end ? LZO_E_OK :
-		ip <  ip_end ? LZO_E_INPUT_NOT_CONSUMED : LZO_E_INPUT_OVERRUN);
+	err = t != 3		? LZO_E_ERROR :
+	      !IP_LEFT()	? LZO_E_OK :
+	      IP_LEFT() > 0	? LZO_E_INPUT_NOT_CONSUMED :
+				  LZO_E_INPUT_OVERRUN;
+	goto out;
 
 input_overrun:
-	*out_len = op - out;
-	return LZO_E_INPUT_OVERRUN;
+	err = LZO_E_INPUT_OVERRUN;
+	goto out;
 
 output_overrun:
-	*out_len = op - out;
-	return LZO_E_OUTPUT_OVERRUN;
+	err = LZO_E_OUTPUT_OVERRUN;
+	goto out;
 
 lookbehind_overrun:
-	*out_len = op - out;
-	return LZO_E_LOOKBEHIND_OVERRUN;
-}
-#ifndef STATIC
-EXPORT_SYMBOL_GPL(lzo1x_decompress_safe);
+	err = LZO_E_LOOKBEHIND_OVERRUN;
+	goto out;
 
+out:
+	PUT_IP();
+	*out_len = op - out;
+	return err;
+}
+EXPORT_SYMBOL_GPL(LZO_SG(lzo1x_decompress_safe));
+
+#ifdef INPUT_IS_LINEAR
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("LZO1X Decompressor");
 
diff --git a/lib/lzo/lzo1x_decompress_safe_sg.c b/lib/lzo/lzo1x_decompress_safe_sg.c
new file mode 100644
index 000000000000..7312ac1b9412
--- /dev/null
+++ b/lib/lzo/lzo1x_decompress_safe_sg.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  LZO1X Decompressor from LZO
+ *
+ *  Copyright (C) 1996-2012 Markus F.X.J. Oberhumer <markus@oberhumer.com>
+ *
+ *  The full LZO package can be found at:
+ *  http://www.oberhumer.com/opensource/lzo/
+ *
+ *  Changed for Linux kernel use by:
+ *  Nitin Gupta <nitingupta910@gmail.com>
+ *  Richard Purdie <rpurdie@openedhand.com>
+ */
+
+#include <linux/scatterlist.h>
+#include <linux/types.h>
+
+#define IP_LEFT()	((u8 *)miter.addr + miter.length - ip)
+#define HAVE_IP(x)	((size_t)((u8 *)miter.addr + miter.length - ip) >= (size_t)(x))
+#define GET_IP()	do { \
+				sg_miter_start(&miter, sg, sg_nents(sg), SG_MITER_ATOMIC); \
+				if (!lzo_sg_miter_next(&miter)) \
+					goto input_overrun; \
+				ip = miter.addr; \
+			} while (0)
+#define PUT_IP()	sg_miter_stop(&miter)
+#define CHECK_IP()	do { \
+				if (!HAVE_IP(1)) { \
+					if (!lzo_sg_miter_next(&miter)) \
+						goto input_overrun; \
+					ip = miter.addr; \
+				} \
+			} while (0)
+
+#define INPUT		struct scatterlist *sg
+#define LZO_SG(name)	name##_sg
+
+static bool lzo_sg_miter_next(struct sg_mapping_iter *miter)
+{
+	do {
+		if (!sg_miter_next(miter))
+			return false;
+	} while (!miter->length);
+
+	return true;
+}
+
+#include "lzo1x_decompress_safe.c"
diff --git a/lib/lzo/lzodefs.h b/lib/lzo/lzodefs.h
index b60851fcf6ce..5d5a029983e6 100644
--- a/lib/lzo/lzodefs.h
+++ b/lib/lzo/lzodefs.h
@@ -12,6 +12,7 @@
  *  Richard Purdie <rpurdie@openedhand.com>
  */
 
+#include <linux/unaligned.h>
 
 /* Version
  * 0: original lzo version

Cheers,

Herbert Xu March 5, 2025, 7:41 a.m. UTC | #22

On Wed, Mar 05, 2025 at 06:18:25AM +0000, Yosry Ahmed wrote:
>
> I think there are other motivations for zcomp. Nhat was actually talking
> about switch zswap to use zcomp for other reasons. Please see this
> thread:
> https://lore.kernel.org/lkml/CAKEwX=O8zQj3Vj=2G6aCjK7e2DDs+VBUhRd25AefTdcvFOT-=A@mail.gmail.com/.

The only reason I saw was the support for algorithm parameters.
Yes that will of course be added to crypto_acomp before I attempt
to replace zcomp.

Cheers,

Herbert Xu March 5, 2025, 7:46 a.m. UTC | #23

On Wed, Mar 05, 2025 at 06:17:01AM +0000, Yosry Ahmed wrote:
>
> Actually I just sent out a series that I had sitting in my local tree
> for a bit to complete Sergey's work and completely remove the map/unmap
> APIs:
> https://lore.kernel.org/lkml/20250305061134.4105762-1-yosry.ahmed@linux.dev/.

Looks good to me!
 
> I am not objecting to switch the API to use SG lists if we intend to
> switch multiple compression algorithms to use them and will completely
> switch to using SG-based APIs in both zswap and zram. But I don't want
> us to have two separate interfaces please.

Fair enough.  I will wait until crypto_acomp can replace zcomp
before posting more SG list changes to zswap.

> Also, please take a look at patch 2 in this series for another reason, I
> want to make sure if your virtual address series can be used to remove
> the !virt_addr_valid() memcpy() case completely.

Yes it should work provided that you specify the memory as nondma.

Cheers,

Herbert Xu March 5, 2025, 2:10 p.m. UTC | #24

On Wed, Mar 05, 2025 at 03:46:18PM +0800, Herbert Xu wrote:
>
> > Also, please take a look at patch 2 in this series for another reason, I
> > want to make sure if your virtual address series can be used to remove
> > the !virt_addr_valid() memcpy() case completely.
> 
> Yes it should work provided that you specify the memory as nondma.

Actually you can do better than that, specify the memory as nondma
if IS_ENABLED(CONFIG_HIGHMEM), and otherwise as virt.

The idea is that we only have two hardware offload drivers, both
from Intel and they're probably not going to be used on platforms
with HIGHMEM.

So something like:

	if (IS_ENABLED(CONFIG_HIGHMEM))
		acomp_request_set_nondma(acomp_ctx->req, src, dst, entry->length, PAGE_SIZE);
	else
		acomp_request_set_virt(acomp_ctx->req, src, dst, entry->length, PAGE_SIZE);

Of course all this would disappear if we used SG lists properly.

Cheers,

Yosry Ahmed March 5, 2025, 4:25 p.m. UTC | #25

On Wed, Mar 05, 2025 at 10:10:46PM +0800, Herbert Xu wrote:
> On Wed, Mar 05, 2025 at 03:46:18PM +0800, Herbert Xu wrote:
> >
> > > Also, please take a look at patch 2 in this series for another reason, I
> > > want to make sure if your virtual address series can be used to remove
> > > the !virt_addr_valid() memcpy() case completely.
> > 
> > Yes it should work provided that you specify the memory as nondma.
> 
> Actually you can do better than that, specify the memory as nondma
> if IS_ENABLED(CONFIG_HIGHMEM), and otherwise as virt.
> 
> The idea is that we only have two hardware offload drivers, both
> from Intel and they're probably not going to be used on platforms
> with HIGHMEM.
> 
> So something like:
> 
> 	if (IS_ENABLED(CONFIG_HIGHMEM))
> 		acomp_request_set_nondma(acomp_ctx->req, src, dst, entry->length, PAGE_SIZE);
> 	else
> 		acomp_request_set_virt(acomp_ctx->req, src, dst, entry->length, PAGE_SIZE);

Well, ideally it would be based on whether the address itself is a
highmem address or not, it may not be, even if CONFIG_HIGHMEM is
enabled.

> 
> Of course all this would disappear if we used SG lists properly.

Zswap is already using an SG list when calling into the crypto API. The
problem is that SGs (i.e. sg_init_one()) does not support kmap highmem
addresses. Is there a fundamental reason this can't happen, or is it
just sg_set_bug() using virt_to_page().

Also, since the crypto API is using SG lists internally as far as I can
tell, how does acomp_request_set_nondma() essentially deal with this? I
don't understand why we need to use a separate nondma API for highmem.

Nhat Pham March 5, 2025, 5:07 p.m. UTC | #26

On Tue, Mar 4, 2025 at 11:41 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Wed, Mar 05, 2025 at 06:18:25AM +0000, Yosry Ahmed wrote:
> >
> > I think there are other motivations for zcomp. Nhat was actually talking
> > about switch zswap to use zcomp for other reasons. Please see this
> > thread:
> > https://lore.kernel.org/lkml/CAKEwX=O8zQj3Vj=2G6aCjK7e2DDs+VBUhRd25AefTdcvFOT-=A@mail.gmail.com/.
>
> The only reason I saw was the support for algorithm parameters.
> Yes that will of course be added to crypto_acomp before I attempt
> to replace zcomp.

For the record, that's also the only reason why I was thinking about
it. :) I have no passion for zcomp or anything - as long as we support
all the cases (hardware acceleration/offloading, algorithms
parameters, etc.), I'm happy :)

Thanks for the hard work, Herbert, and I look forward to seeing all of
this work.

Herbert Xu March 6, 2025, 12:40 a.m. UTC | #27

On Wed, Mar 05, 2025 at 04:25:44PM +0000, Yosry Ahmed wrote:
>
> Zswap is already using an SG list when calling into the crypto API. The
> problem is that SGs (i.e. sg_init_one()) does not support kmap highmem
> addresses. Is there a fundamental reason this can't happen, or is it
> just sg_set_bug() using virt_to_page().

The whole point of SG lists is so that you don't need to kmap it
until the data is actually accessed.  Putting kmapped memory into
the SG lists defeats the purpose.

> Also, since the crypto API is using SG lists internally as far as I can
> tell, how does acomp_request_set_nondma() essentially deal with this? I
> don't understand why we need to use a separate nondma API for highmem.

I will post another acomp update soon where the hardware offload
path (the only one that can't deal with nondma) will simply fall
back to software compression if it sees a nondma address.

Cheers,

Sergey Senozhatsky March 6, 2025, 2:48 a.m. UTC | #28

On (25/03/05 09:07), Nhat Pham wrote:
> On Tue, Mar 4, 2025 at 11:41 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >
> > On Wed, Mar 05, 2025 at 06:18:25AM +0000, Yosry Ahmed wrote:
> > >
> > > I think there are other motivations for zcomp. Nhat was actually talking
> > > about switch zswap to use zcomp for other reasons. Please see this
> > > thread:
> > > https://lore.kernel.org/lkml/CAKEwX=O8zQj3Vj=2G6aCjK7e2DDs+VBUhRd25AefTdcvFOT-=A@mail.gmail.com/.
> >
> > The only reason I saw was the support for algorithm parameters.
> > Yes that will of course be added to crypto_acomp before I attempt
> > to replace zcomp.
> 
> For the record, that's also the only reason why I was thinking about
> it. :) I have no passion for zcomp or anything - as long as we support
> all the cases (hardware acceleration/offloading, algorithms
> parameters, etc.), I'm happy :)
> 
> Thanks for the hard work, Herbert, and I look forward to seeing all of
> this work.

zcomp arrived at the right time and served its purpose.

Back in the days, when I started adding params to comp algos, zram was
still using *legacy* crypto (scomp?) API and Herbert made it clear that
parameters would be added only to a new acomp API, which was a blocker for
zram (zram by design did not support anything async or even sleepable).
So the decision was to drop scomp from zram (this should have happened
sooner or later anyway), enable parameters support (so that we could start
playing around with acceleration levels, user C/D dicts, etc.) and begin
preparing zram for async API.  The last part turned up to be a little more
complicated than was anticipated (as usual), but now we are reaching the
point [1] when zram and zsmalloc become async ready.

With this we can start moving parameters support to acomp, switch zram
to acomp and sunset zcomp.

[1] https://lore.kernel.org/linux-mm/20250303022425.285971-1-senozhatsky@chromium.org

Yosry Ahmed March 6, 2025, 4:58 p.m. UTC | #29

On Thu, Mar 06, 2025 at 08:40:27AM +0800, Herbert Xu wrote:
> On Wed, Mar 05, 2025 at 04:25:44PM +0000, Yosry Ahmed wrote:
> >
> > Zswap is already using an SG list when calling into the crypto API. The
> > problem is that SGs (i.e. sg_init_one()) does not support kmap highmem
> > addresses. Is there a fundamental reason this can't happen, or is it
> > just sg_set_bug() using virt_to_page().
> 
> The whole point of SG lists is so that you don't need to kmap it
> until the data is actually accessed.  Putting kmapped memory into
> the SG lists defeats the purpose.

I saw your patch in the other thread and I like just passing a SG list
from zsmalloc to zswap, and passing it as-is to the crypto API. The
problem of virt vs highmem addresses organically goes away with that.

Thanks.

[RFC,7/7] mm: zswap: Use acomp virtual address interface

Commit Message

Comments

Patch