diff mbox series

zsmalloc: do not use bit_spin_lock

Message ID 20201220002228.38697-1-vitaly.wool@konsulko.com (mailing list archive)
State New, archived
Headers show
Series zsmalloc: do not use bit_spin_lock | expand

Commit Message

Vitaly Wool Dec. 20, 2020, 12:22 a.m. UTC
zsmalloc takes bit spinlock in its _map() callback and releases it
only in unmap() which is unsafe and leads to zswap complaining
about scheduling in atomic context.

To fix that and to improve RT properties of zsmalloc, remove that
bit spinlock completely and use a bit flag instead.

Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com>
---
 mm/zsmalloc.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

Comments

Matthew Wilcox Dec. 20, 2020, 1:18 a.m. UTC | #1
On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> zsmalloc takes bit spinlock in its _map() callback and releases it
> only in unmap() which is unsafe and leads to zswap complaining
> about scheduling in atomic context.
> 
> To fix that and to improve RT properties of zsmalloc, remove that
> bit spinlock completely and use a bit flag instead.

Isn't this just "I open coded bit spinlock to make the lockdep
warnings go away"?
Mike Galbraith Dec. 20, 2020, 1:23 a.m. UTC | #2
On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote:
> zsmalloc takes bit spinlock in its _map() callback and releases it
> only in unmap() which is unsafe and leads to zswap complaining
> about scheduling in atomic context.
>
> To fix that and to improve RT properties of zsmalloc, remove that
> bit spinlock completely and use a bit flag instead.

It also does get_cpu_var() in map(), put_cpu_var() in unmap().

	-Mike
Mike Galbraith Dec. 20, 2020, 1:56 a.m. UTC | #3
On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote:
> zsmalloc takes bit spinlock in its _map() callback and releases it
> only in unmap() which is unsafe and leads to zswap complaining
> about scheduling in atomic context.
>
> To fix that and to improve RT properties of zsmalloc, remove that
> bit spinlock completely and use a bit flag instead.


> -static void pin_tag(unsigned long handle) __acquires(bitlock)
> +static void pin_tag(unsigned long handle)
>  {
> -	bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
> +	preempt_disable();
> +	while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle))
> +		cpu_relax();
> +	preempt_enable();
>  }

If try doesn't need to disable preemption, neither does pin.

	-Mike
Mike Galbraith Dec. 20, 2020, 4:11 a.m. UTC | #4
On Sun, 2020-12-20 at 02:23 +0100, Mike Galbraith wrote:
> On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote:
> > zsmalloc takes bit spinlock in its _map() callback and releases it
> > only in unmap() which is unsafe and leads to zswap complaining
> > about scheduling in atomic context.
> >
> > To fix that and to improve RT properties of zsmalloc, remove that
> > bit spinlock completely and use a bit flag instead.
>
> It also does get_cpu_var() in map(), put_cpu_var() in unmap().

Bah, I forgot to mention the config dependent rwlock, it's held across
map()/unmap() as well, so there are two more hurdles, not one.

	-Mike
Vitaly Wool Dec. 20, 2020, 7:21 a.m. UTC | #5
On Sun, Dec 20, 2020 at 2:18 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > zsmalloc takes bit spinlock in its _map() callback and releases it
> > only in unmap() which is unsafe and leads to zswap complaining
> > about scheduling in atomic context.
> >
> > To fix that and to improve RT properties of zsmalloc, remove that
> > bit spinlock completely and use a bit flag instead.
>
> Isn't this just "I open coded bit spinlock to make the lockdep
> warnings go away"?

Not really because bit spinlock leaves preemption disabled.
Mike Galbraith Dec. 20, 2020, 7:47 a.m. UTC | #6
On Sun, 2020-12-20 at 02:23 +0100, Mike Galbraith wrote:
> On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote:
> > zsmalloc takes bit spinlock in its _map() callback and releases it
> > only in unmap() which is unsafe and leads to zswap complaining
> > about scheduling in atomic context.
> >
> > To fix that and to improve RT properties of zsmalloc, remove that
> > bit spinlock completely and use a bit flag instead.
>
> It also does get_cpu_var() in map(), put_cpu_var() in unmap().

That aside, the bit spinlock removal seems to hold up to beating in RT.
I stripped out the RT changes to replace the bit spinlocks, applied the
still needed atm might_sleep() fix, and ltp zram and zswap test are
running in a loop with no signs that it's a bad idea, so I hope that
makes it in (minus the preempt disabled spin which I whacked), as it
makes zsmalloc markedly more RT friendly.

RT changes go from:
 1 file changed, 79 insertions(+), 6 deletions(-)
to:
 1 file changed, 8 insertions(+), 3 deletions(-)

	-Mike
Song Bao Hua (Barry Song) Dec. 20, 2020, 9:20 p.m. UTC | #7
> -----Original Message-----
> From: Mike Galbraith [mailto:efault@gmx.de]
> Sent: Sunday, December 20, 2020 8:48 PM
> To: Vitaly Wool <vitaly.wool@konsulko.com>; LKML
> <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>
> Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej
> Siewior <bigeasy@linutronix.de>; Minchan Kim <minchan@kernel.org>; NitinGupta
> <ngupta@vflare.org>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> 
> On Sun, 2020-12-20 at 02:23 +0100, Mike Galbraith wrote:
> > On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote:
> > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > only in unmap() which is unsafe and leads to zswap complaining
> > > about scheduling in atomic context.
> > >
> > > To fix that and to improve RT properties of zsmalloc, remove that
> > > bit spinlock completely and use a bit flag instead.
> >
> > It also does get_cpu_var() in map(), put_cpu_var() in unmap().
> 
> That aside, the bit spinlock removal seems to hold up to beating in RT.
> I stripped out the RT changes to replace the bit spinlocks, applied the
> still needed atm might_sleep() fix, and ltp zram and zswap test are
> running in a loop with no signs that it's a bad idea, so I hope that
> makes it in (minus the preempt disabled spin which I whacked), as it
> makes zsmalloc markedly more RT friendly.
> 
> RT changes go from:
>  1 file changed, 79 insertions(+), 6 deletions(-)
> to:
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 

Sorry, would you like to show the change for 
"8 insertions(+), 3 deletions(-)"?

BTW, your original patch looks not right as 
crypto_wait_req(crypto_acomp_decompress()...)
can sleep too.

[copy from your original patch with comment]
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1258,20 +1258,20 @@ static int zswap_frontswap_load(unsigned

 	/* decompress */
 	dlen = PAGE_SIZE;
+	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+	mutex_lock(acomp_ctx->mutex);
 	src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
 	if (zpool_evictable(entry->pool->zpool))
 		src += sizeof(struct zswap_header);

-	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
-	mutex_lock(acomp_ctx->mutex);
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_page(&output, page, PAGE_SIZE, 0);
 	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);

/*!!!!!!!!!!!!!!!!
 * here crypto could sleep
 !!!!!!!!!!!!!!*/

 	ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
-	mutex_unlock(acomp_ctx->mutex);

 	zpool_unmap_handle(entry->pool->zpool, entry->handle);
+	mutex_unlock(acomp_ctx->mutex);
 	BUG_ON(ret);

 freeentry:

[end]

so I guess we have to fix zsmalloc.


> 	-Mike

Thanks
Barry
Mike Galbraith Dec. 20, 2020, 10:10 p.m. UTC | #8
On Sun, 2020-12-20 at 21:20 +0000, Song Bao Hua (Barry Song) wrote:
>
> > -----Original Message-----
> > From: Mike Galbraith [mailto:efault@gmx.de]
> > Sent: Sunday, December 20, 2020 8:48 PM
> > To: Vitaly Wool <vitaly.wool@konsulko.com>; LKML
> > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>
> > Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej
> > Siewior <bigeasy@linutronix.de>; Minchan Kim <minchan@kernel.org>; NitinGupta
> > <ngupta@vflare.org>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Sun, 2020-12-20 at 02:23 +0100, Mike Galbraith wrote:
> > > On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote:
> > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > about scheduling in atomic context.
> > > >
> > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > bit spinlock completely and use a bit flag instead.
> > >
> > > It also does get_cpu_var() in map(), put_cpu_var() in unmap().
> >
> > That aside, the bit spinlock removal seems to hold up to beating in RT.
> > I stripped out the RT changes to replace the bit spinlocks, applied the
> > still needed atm might_sleep() fix, and ltp zram and zswap test are
> > running in a loop with no signs that it's a bad idea, so I hope that
> > makes it in (minus the preempt disabled spin which I whacked), as it
> > makes zsmalloc markedly more RT friendly.
> >
> > RT changes go from:
> >  1 file changed, 79 insertions(+), 6 deletions(-)
> > to:
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> >
>
> Sorry, would you like to show the change for
> "8 insertions(+), 3 deletions(-)"?

Sure.
---
 mm/zsmalloc.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -57,6 +57,7 @@
 #include <linux/wait.h>
 #include <linux/pagemap.h>
 #include <linux/fs.h>
+#include <linux/local_lock.h>

 #define ZSPAGE_MAGIC	0x58

@@ -293,6 +294,7 @@ struct zspage {
 };

 struct mapping_area {
+	local_lock_t lock;
 	char *vm_buf; /* copy buffer for objects that span pages */
 	char *vm_addr; /* address of kmap_atomic()'ed pages */
 	enum zs_mapmode vm_mm; /* mapping mode */
@@ -455,7 +457,9 @@ MODULE_ALIAS("zpool-zsmalloc");
 #endif /* CONFIG_ZPOOL */

 /* per-cpu VM mapping areas for zspage accesses that cross page
boundaries */
-static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
+static DEFINE_PER_CPU(struct mapping_area, zs_map_area) = {
+	.lock	= INIT_LOCAL_LOCK(lock),
+};

 static bool is_zspage_isolated(struct zspage *zspage)
 {
@@ -1276,7 +1280,8 @@ void *zs_map_object(struct zs_pool *pool
 	class = pool->size_class[class_idx];
 	off = (class->size * obj_idx) & ~PAGE_MASK;

-	area = &get_cpu_var(zs_map_area);
+	local_lock(&zs_map_area.lock);
+	area = this_cpu_ptr(&zs_map_area);
 	area->vm_mm = mm;
 	if (off + class->size <= PAGE_SIZE) {
 		/* this object is contained entirely within a page */
@@ -1330,7 +1335,7 @@ void zs_unmap_object(struct zs_pool *poo

 		__zs_unmap_object(area, pages, off, class->size);
 	}
-	put_cpu_var(zs_map_area);
+	local_unlock(&zs_map_area.lock);

 	migrate_read_unlock(zspage);
 	unpin_tag(handle);


> BTW, your original patch looks not right as
> crypto_wait_req(crypto_acomp_decompress()...)
> can sleep too.
>
> [copy from your original patch with comment]
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1258,20 +1258,20 @@ static int zswap_frontswap_load(unsigned
>
>  	/* decompress */
>  	dlen = PAGE_SIZE;
> +	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> +	mutex_lock(acomp_ctx->mutex);
>  	src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
>  	if (zpool_evictable(entry->pool->zpool))
>  		src += sizeof(struct zswap_header);
>
> -	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> -	mutex_lock(acomp_ctx->mutex);
>  	sg_init_one(&input, src, entry->length);
>  	sg_init_table(&output, 1);
>  	sg_set_page(&output, page, PAGE_SIZE, 0);
>  	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
>
> /*!!!!!!!!!!!!!!!!
>  * here crypto could sleep
>  !!!!!!!!!!!!!!*/

Hohum, another one for my Bitmaster-9000 patch shredder.

	-Mike
Minchan Kim Dec. 21, 2020, 5:24 p.m. UTC | #9
On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> zsmalloc takes bit spinlock in its _map() callback and releases it
> only in unmap() which is unsafe and leads to zswap complaining
> about scheduling in atomic context.
> 
> To fix that and to improve RT properties of zsmalloc, remove that
> bit spinlock completely and use a bit flag instead.

I don't want to use such open code for the lock.

I see from Mike's patch, recent zswap change introduced the lockdep
splat bug and you want to improve zsmalloc to fix the zswap bug and
introduce this patch with allowing preemption enabling.

https://lore.kernel.org/linux-mm/fae85e4440a8ef6f13192476bd33a4826416fc58.camel@gmx.de/

zs_[un/map]_object is designed to be used in fast path(i.e.,
zs_map_object/4K page copy/zs_unmap_object) so the spinlock is
perfectly fine for API point of view. However, zswap introduced
using the API with mutex_lock/crypto_wait_req where allowing
preemption, which was wrong.

Furthermore, the zs_map_object already has a few more places where
disablepreemptions(migrate_read_lock, get_cpu_var and kmap_atomic).

Without making those locks preemptible all at once, zswap will still
see the lockdep warning.

> 
> Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com>
> ---
>  mm/zsmalloc.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 7289f502ffac..ff26546a7fed 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, void *obj)
>  
>  static inline int testpin_tag(unsigned long handle)
>  {
> -	return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
> +	return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
>  }
>  
>  static inline int trypin_tag(unsigned long handle)
>  {
> -	return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
> +	return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
>  }
>  
> -static void pin_tag(unsigned long handle) __acquires(bitlock)
> +static void pin_tag(unsigned long handle)
>  {
> -	bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
> +	preempt_disable();
> +	while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle))
> +		cpu_relax();
> +	preempt_enable();
>  }
>  
>  static void unpin_tag(unsigned long handle) __releases(bitlock)
>  {
> -	bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
> +	clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
>  }
>  
>  static void reset_page(struct page *page)
> -- 
> 2.20.1
>
Vitaly Wool Dec. 21, 2020, 7:20 p.m. UTC | #10
On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > zsmalloc takes bit spinlock in its _map() callback and releases it
> > only in unmap() which is unsafe and leads to zswap complaining
> > about scheduling in atomic context.
> >
> > To fix that and to improve RT properties of zsmalloc, remove that
> > bit spinlock completely and use a bit flag instead.
>
> I don't want to use such open code for the lock.
>
> I see from Mike's patch, recent zswap change introduced the lockdep
> splat bug and you want to improve zsmalloc to fix the zswap bug and
> introduce this patch with allowing preemption enabling.

This understanding is upside down. The code in zswap you are referring
to is not buggy.  You may claim that it is suboptimal but there is
nothing wrong in taking a mutex.

> https://lore.kernel.org/linux-mm/fae85e4440a8ef6f13192476bd33a4826416fc58.camel@gmx.de/
>
> zs_[un/map]_object is designed to be used in fast path(i.e.,
> zs_map_object/4K page copy/zs_unmap_object) so the spinlock is
> perfectly fine for API point of view. However, zswap introduced
> using the API with mutex_lock/crypto_wait_req where allowing
> preemption, which was wrong.

Taking a spinlock in one callback and releasing it in another is
unsafe and error prone. What if unmap was called on completion of a
DMA-like transfer from another context, like a threaded IRQ handler?
In that case this spinlock might never be released.

Anyway I can come up with a zswap patch explicitly stating that
zsmalloc is not fully compliant with zswap / zpool API to avoid
confusion for the time being. Would that be ok with you?

Best regards,
   Vitaly

> Furthermore, the zs_map_object already has a few more places where
> disablepreemptions(migrate_read_lock, get_cpu_var and kmap_atomic).
>
> Without making those locks preemptible all at once, zswap will still
> see the lockdep warning.
>
> >
> > Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com>
> > ---
> >  mm/zsmalloc.c | 13 ++++++++-----
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index 7289f502ffac..ff26546a7fed 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, void *obj)
> >
> >  static inline int testpin_tag(unsigned long handle)
> >  {
> > -     return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
> > +     return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> >  }
> >
> >  static inline int trypin_tag(unsigned long handle)
> >  {
> > -     return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > +     return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> >  }
> >
> > -static void pin_tag(unsigned long handle) __acquires(bitlock)
> > +static void pin_tag(unsigned long handle)
> >  {
> > -     bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > +     preempt_disable();
> > +     while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle))
> > +             cpu_relax();
> > +     preempt_enable();
> >  }
> >
> >  static void unpin_tag(unsigned long handle) __releases(bitlock)
> >  {
> > -     bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > +     clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> >  }
> >
> >  static void reset_page(struct page *page)
> > --
> > 2.20.1
> >
Shakeel Butt Dec. 21, 2020, 7:50 p.m. UTC | #11
On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> wrote:
>
> On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > only in unmap() which is unsafe and leads to zswap complaining
> > > about scheduling in atomic context.
> > >
> > > To fix that and to improve RT properties of zsmalloc, remove that
> > > bit spinlock completely and use a bit flag instead.
> >
> > I don't want to use such open code for the lock.
> >
> > I see from Mike's patch, recent zswap change introduced the lockdep
> > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > introduce this patch with allowing preemption enabling.
>
> This understanding is upside down. The code in zswap you are referring
> to is not buggy.  You may claim that it is suboptimal but there is
> nothing wrong in taking a mutex.
>

Is this suboptimal for all or just the hardware accelerators? Sorry, I
am not very familiar with the crypto API. If I select lzo or lz4 as a
zswap compressor will the [de]compression be async or sync?

> > https://lore.kernel.org/linux-mm/fae85e4440a8ef6f13192476bd33a4826416fc58.camel@gmx.de/
> >
> > zs_[un/map]_object is designed to be used in fast path(i.e.,
> > zs_map_object/4K page copy/zs_unmap_object) so the spinlock is
> > perfectly fine for API point of view. However, zswap introduced
> > using the API with mutex_lock/crypto_wait_req where allowing
> > preemption, which was wrong.
>
> Taking a spinlock in one callback and releasing it in another is
> unsafe and error prone. What if unmap was called on completion of a
> DMA-like transfer from another context, like a threaded IRQ handler?
> In that case this spinlock might never be released.
>
> Anyway I can come up with a zswap patch explicitly stating that
> zsmalloc is not fully compliant with zswap / zpool API

The documentation of zpool_map_handle() clearly states "This may hold
locks, disable interrupts, and/or preemption, ...", so how come
zsmalloc is not fully compliant?

> to avoid
> confusion for the time being. Would that be ok with you?
>
> Best regards,
>    Vitaly
>
Song Bao Hua (Barry Song) Dec. 21, 2020, 8:05 p.m. UTC | #12
> -----Original Message-----
> From: Shakeel Butt [mailto:shakeelb@google.com]
> Sent: Tuesday, December 22, 2020 8:50 AM
> To: Vitaly Wool <vitaly.wool@konsulko.com>
> Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; LKML
> <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song Bao Hua
> (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior
> <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> <akpm@linux-foundation.org>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> 
> On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> wrote:
> >
> > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > about scheduling in atomic context.
> > > >
> > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > bit spinlock completely and use a bit flag instead.
> > >
> > > I don't want to use such open code for the lock.
> > >
> > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > > introduce this patch with allowing preemption enabling.
> >
> > This understanding is upside down. The code in zswap you are referring
> > to is not buggy.  You may claim that it is suboptimal but there is
> > nothing wrong in taking a mutex.
> >
> 
> Is this suboptimal for all or just the hardware accelerators? Sorry, I
> am not very familiar with the crypto API. If I select lzo or lz4 as a
> zswap compressor will the [de]compression be async or sync?

Right now, in crypto subsystem, new drivers are required to write based on
async APIs. The old sync API can't work in new accelerator drivers as they
are not supported at all.

Old drivers are used to sync, but they've got async wrappers to support async
APIs. Eg.
crypto: acomp - add support for lz4 via scomp
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d

crypto: acomp - add support for lzo via scomp
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e

so they are supporting async APIs but they are still working in sync mode as
those old drivers don't sleep.

> 
> > >
> https://lore.kernel.org/linux-mm/fae85e4440a8ef6f13192476bd33a4826416fc58.
> camel@gmx.de/
> > >
> > > zs_[un/map]_object is designed to be used in fast path(i.e.,
> > > zs_map_object/4K page copy/zs_unmap_object) so the spinlock is
> > > perfectly fine for API point of view. However, zswap introduced
> > > using the API with mutex_lock/crypto_wait_req where allowing
> > > preemption, which was wrong.
> >
> > Taking a spinlock in one callback and releasing it in another is
> > unsafe and error prone. What if unmap was called on completion of a
> > DMA-like transfer from another context, like a threaded IRQ handler?
> > In that case this spinlock might never be released.
> >
> > Anyway I can come up with a zswap patch explicitly stating that
> > zsmalloc is not fully compliant with zswap / zpool API
> 
> The documentation of zpool_map_handle() clearly states "This may hold
> locks, disable interrupts, and/or preemption, ...", so how come
> zsmalloc is not fully compliant?

Zbud, z3fold haven't really done this. If we hold spinlock before
entering zswap and release spinlock after calling zswap, this will
put zswap in an atomic context which isn't necessarily needed.

> 
> > to avoid
> > confusion for the time being. Would that be ok with you?
> >
> > Best regards,
> >    Vitaly
> >

Thanks
Barry
Minchan Kim Dec. 21, 2020, 8:22 p.m. UTC | #13
On Mon, Dec 21, 2020 at 08:20:26PM +0100, Vitaly Wool wrote:
> On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > only in unmap() which is unsafe and leads to zswap complaining
> > > about scheduling in atomic context.
> > >
> > > To fix that and to improve RT properties of zsmalloc, remove that
> > > bit spinlock completely and use a bit flag instead.
> >
> > I don't want to use such open code for the lock.
> >
> > I see from Mike's patch, recent zswap change introduced the lockdep
> > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > introduce this patch with allowing preemption enabling.
> 
> This understanding is upside down. The code in zswap you are referring
> to is not buggy.  You may claim that it is suboptimal but there is
> nothing wrong in taking a mutex.

No, it's surely break from zswap since zpool/zsmalloc has worked like
this and now you are saying "nothing wrong" even though it breaks
the rule.

> 
> > https://lore.kernel.org/linux-mm/fae85e4440a8ef6f13192476bd33a4826416fc58.camel@gmx.de/
> >
> > zs_[un/map]_object is designed to be used in fast path(i.e.,
> > zs_map_object/4K page copy/zs_unmap_object) so the spinlock is
> > perfectly fine for API point of view. However, zswap introduced
> > using the API with mutex_lock/crypto_wait_req where allowing
> > preemption, which was wrong.
> 
> Taking a spinlock in one callback and releasing it in another is
> unsafe and error prone. What if unmap was called on completion of a
> DMA-like transfer from another context, like a threaded IRQ handler?
> In that case this spinlock might never be released.
> 
> Anyway I can come up with a zswap patch explicitly stating that
> zsmalloc is not fully compliant with zswap / zpool API to avoid
> confusion for the time being. Would that be ok with you?

It's your call since you are maintainer of zswap now and you are
breaking the rule we have kept for a long time.


> 
> Best regards,
>    Vitaly
> 
> > Furthermore, the zs_map_object already has a few more places where
> > disablepreemptions(migrate_read_lock, get_cpu_var and kmap_atomic).
> >
> > Without making those locks preemptible all at once, zswap will still
> > see the lockdep warning.
> >
> > >
> > > Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com>
> > > ---
> > >  mm/zsmalloc.c | 13 ++++++++-----
> > >  1 file changed, 8 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > > index 7289f502ffac..ff26546a7fed 100644
> > > --- a/mm/zsmalloc.c
> > > +++ b/mm/zsmalloc.c
> > > @@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, void *obj)
> > >
> > >  static inline int testpin_tag(unsigned long handle)
> > >  {
> > > -     return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
> > > +     return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> > >  }
> > >
> > >  static inline int trypin_tag(unsigned long handle)
> > >  {
> > > -     return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > > +     return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> > >  }
> > >
> > > -static void pin_tag(unsigned long handle) __acquires(bitlock)
> > > +static void pin_tag(unsigned long handle)
> > >  {
> > > -     bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > > +     preempt_disable();
> > > +     while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle))
> > > +             cpu_relax();
> > > +     preempt_enable();
> > >  }
> > >
> > >  static void unpin_tag(unsigned long handle) __releases(bitlock)
> > >  {
> > > -     bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > > +     clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> > >  }
> > >
> > >  static void reset_page(struct page *page)
> > > --
> > > 2.20.1
> > >
Shakeel Butt Dec. 21, 2020, 9:02 p.m. UTC | #14
On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
<song.bao.hua@hisilicon.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Shakeel Butt [mailto:shakeelb@google.com]
> > Sent: Tuesday, December 22, 2020 8:50 AM
> > To: Vitaly Wool <vitaly.wool@konsulko.com>
> > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; LKML
> > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song Bao Hua
> > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > <akpm@linux-foundation.org>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> wrote:
> > >
> > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote:
> > > >
> > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > about scheduling in atomic context.
> > > > >
> > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > bit spinlock completely and use a bit flag instead.
> > > >
> > > > I don't want to use such open code for the lock.
> > > >
> > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > > > introduce this patch with allowing preemption enabling.
> > >
> > > This understanding is upside down. The code in zswap you are referring
> > > to is not buggy.  You may claim that it is suboptimal but there is
> > > nothing wrong in taking a mutex.
> > >
> >
> > Is this suboptimal for all or just the hardware accelerators? Sorry, I
> > am not very familiar with the crypto API. If I select lzo or lz4 as a
> > zswap compressor will the [de]compression be async or sync?
>
> Right now, in crypto subsystem, new drivers are required to write based on
> async APIs. The old sync API can't work in new accelerator drivers as they
> are not supported at all.
>
> Old drivers are used to sync, but they've got async wrappers to support async
> APIs. Eg.
> crypto: acomp - add support for lz4 via scomp
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
>
> crypto: acomp - add support for lzo via scomp
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
>
> so they are supporting async APIs but they are still working in sync mode as
> those old drivers don't sleep.
>

Good to know that those are sync because I want them to be sync.
Please note that zswap is a cache in front of a real swap and the load
operation is latency sensitive as it comes in the page fault path and
directly impacts the applications. I doubt decompressing synchronously
a 4k page on a cpu will be costlier than asynchronously decompressing
the same page from hardware accelerators.
Song Bao Hua (Barry Song) Dec. 21, 2020, 9:25 p.m. UTC | #15
> -----Original Message-----
> From: Shakeel Butt [mailto:shakeelb@google.com]
> Sent: Tuesday, December 22, 2020 10:03 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim <minchan@kernel.org>;
> Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> <akpm@linux-foundation.org>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> 
> On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > To: Vitaly Wool <vitaly.wool@konsulko.com>
> > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; LKML
> > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song Bao
> Hua
> > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior
> > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey
> Senozhatsky
> > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > <akpm@linux-foundation.org>
> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com>
> wrote:
> > > >
> > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote:
> > > > >
> > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > about scheduling in atomic context.
> > > > > >
> > > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > > bit spinlock completely and use a bit flag instead.
> > > > >
> > > > > I don't want to use such open code for the lock.
> > > > >
> > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > > > > introduce this patch with allowing preemption enabling.
> > > >
> > > > This understanding is upside down. The code in zswap you are referring
> > > > to is not buggy.  You may claim that it is suboptimal but there is
> > > > nothing wrong in taking a mutex.
> > > >
> > >
> > > Is this suboptimal for all or just the hardware accelerators? Sorry, I
> > > am not very familiar with the crypto API. If I select lzo or lz4 as a
> > > zswap compressor will the [de]compression be async or sync?
> >
> > Right now, in crypto subsystem, new drivers are required to write based on
> > async APIs. The old sync API can't work in new accelerator drivers as they
> > are not supported at all.
> >
> > Old drivers are used to sync, but they've got async wrappers to support async
> > APIs. Eg.
> > crypto: acomp - add support for lz4 via scomp
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> >
> > crypto: acomp - add support for lzo via scomp
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> >
> > so they are supporting async APIs but they are still working in sync mode
> as
> > those old drivers don't sleep.
> >
> 
> Good to know that those are sync because I want them to be sync.
> Please note that zswap is a cache in front of a real swap and the load
> operation is latency sensitive as it comes in the page fault path and
> directly impacts the applications. I doubt decompressing synchronously
> a 4k page on a cpu will be costlier than asynchronously decompressing
> the same page from hardware accelerators.

If you read the old paper:
https://www.ibm.com/support/pages/new-linux-zswap-compression-functionality
Because the hardware accelerator speeds up compression, looking at the zswap
metrics we observed that there were more store and load requests in a given
amount of time, which filled up the zswap pool faster than a software
compression run. Because of this behavior, we set the max_pool_percent
parameter to 30 for the hardware compression runs - this means that zswap
can use up to 30% of the 10GB of total memory.

So using hardware accelerators, we get a chance to speed up compression
while decreasing cpu utilization.

BTW, If it is not easy to change zsmalloc, one quick workaround we might do
in zswap is adding the below after applying Mike's original patch:

if(in_atomic()) /* for zsmalloc */
	while(!try_wait_for_completion(&req->done);
else /* for zbud, z3fold */
	crypto_wait_req(....);

crypto_wait_req() is actually doing wait_for_completion():
static inline int crypto_wait_req(int err, struct crypto_wait *wait)
{
	switch (err) {
	case -EINPROGRESS:
	case -EBUSY:
		wait_for_completion(&wait->completion);
		reinit_completion(&wait->completion);
		err = wait->err;
		break;
	}

	return err;
}

Thanks
Barry
Vitaly Wool Dec. 21, 2020, 10:11 p.m. UTC | #16
On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
<song.bao.hua@hisilicon.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Shakeel Butt [mailto:shakeelb@google.com]
> > Sent: Tuesday, December 22, 2020 10:03 AM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim <minchan@kernel.org>;
> > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > <akpm@linux-foundation.org>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > <song.bao.hua@hisilicon.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > To: Vitaly Wool <vitaly.wool@konsulko.com>
> > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; LKML
> > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song Bao
> > Hua
> > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior
> > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey
> > Senozhatsky
> > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > <akpm@linux-foundation.org>
> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com>
> > wrote:
> > > > >
> > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote:
> > > > > >
> > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > about scheduling in atomic context.
> > > > > > >
> > > > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > >
> > > > > > I don't want to use such open code for the lock.
> > > > > >
> > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > > > > > introduce this patch with allowing preemption enabling.
> > > > >
> > > > > This understanding is upside down. The code in zswap you are referring
> > > > > to is not buggy.  You may claim that it is suboptimal but there is
> > > > > nothing wrong in taking a mutex.
> > > > >
> > > >
> > > > Is this suboptimal for all or just the hardware accelerators? Sorry, I
> > > > am not very familiar with the crypto API. If I select lzo or lz4 as a
> > > > zswap compressor will the [de]compression be async or sync?
> > >
> > > Right now, in crypto subsystem, new drivers are required to write based on
> > > async APIs. The old sync API can't work in new accelerator drivers as they
> > > are not supported at all.
> > >
> > > Old drivers are used to sync, but they've got async wrappers to support async
> > > APIs. Eg.
> > > crypto: acomp - add support for lz4 via scomp
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > >
> > > crypto: acomp - add support for lzo via scomp
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > >
> > > so they are supporting async APIs but they are still working in sync mode
> > as
> > > those old drivers don't sleep.
> > >
> >
> > Good to know that those are sync because I want them to be sync.
> > Please note that zswap is a cache in front of a real swap and the load
> > operation is latency sensitive as it comes in the page fault path and
> > directly impacts the applications. I doubt decompressing synchronously
> > a 4k page on a cpu will be costlier than asynchronously decompressing
> > the same page from hardware accelerators.
>
> If you read the old paper:
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionality
> Because the hardware accelerator speeds up compression, looking at the zswap
> metrics we observed that there were more store and load requests in a given
> amount of time, which filled up the zswap pool faster than a software
> compression run. Because of this behavior, we set the max_pool_percent
> parameter to 30 for the hardware compression runs - this means that zswap
> can use up to 30% of the 10GB of total memory.
>
> So using hardware accelerators, we get a chance to speed up compression
> while decreasing cpu utilization.
>
> BTW, If it is not easy to change zsmalloc, one quick workaround we might do
> in zswap is adding the below after applying Mike's original patch:
>
> if(in_atomic()) /* for zsmalloc */
>         while(!try_wait_for_completion(&req->done);
> else /* for zbud, z3fold */
>         crypto_wait_req(....);

I don't think I'm going to ack this, sorry.

Best regards,
   Vitaly

> crypto_wait_req() is actually doing wait_for_completion():
> static inline int crypto_wait_req(int err, struct crypto_wait *wait)
> {
>         switch (err) {
>         case -EINPROGRESS:
>         case -EBUSY:
>                 wait_for_completion(&wait->completion);
>                 reinit_completion(&wait->completion);
>                 err = wait->err;
>                 break;
>         }
>
>         return err;
> }
>
> Thanks
> Barry
Song Bao Hua (Barry Song) Dec. 21, 2020, 10:42 p.m. UTC | #17
> -----Original Message-----
> From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> Sent: Tuesday, December 22, 2020 11:12 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike
> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> <akpm@linux-foundation.org>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> 
> On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim
> <minchan@kernel.org>;
> > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>;
> linux-mm
> > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > <akpm@linux-foundation.org>
> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > <song.bao.hua@hisilicon.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > To: Vitaly Wool <vitaly.wool@konsulko.com>
> > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>;
> LKML
> > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song
> Bao
> > > Hua
> > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior
> > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey
> > > Senozhatsky
> > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > <akpm@linux-foundation.org>
> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > >
> > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com>
> > > wrote:
> > > > > >
> > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > >
> > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases
> it
> > > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > > about scheduling in atomic context.
> > > > > > > >
> > > > > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > >
> > > > > > > I don't want to use such open code for the lock.
> > > > > > >
> > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug
> and
> > > > > > > introduce this patch with allowing preemption enabling.
> > > > > >
> > > > > > This understanding is upside down. The code in zswap you are referring
> > > > > > to is not buggy.  You may claim that it is suboptimal but there is
> > > > > > nothing wrong in taking a mutex.
> > > > > >
> > > > >
> > > > > Is this suboptimal for all or just the hardware accelerators? Sorry,
> I
> > > > > am not very familiar with the crypto API. If I select lzo or lz4 as
> a
> > > > > zswap compressor will the [de]compression be async or sync?
> > > >
> > > > Right now, in crypto subsystem, new drivers are required to write based
> on
> > > > async APIs. The old sync API can't work in new accelerator drivers as
> they
> > > > are not supported at all.
> > > >
> > > > Old drivers are used to sync, but they've got async wrappers to support
> async
> > > > APIs. Eg.
> > > > crypto: acomp - add support for lz4 via scomp
> > > >
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > >
> > > > crypto: acomp - add support for lzo via scomp
> > > >
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > >
> > > > so they are supporting async APIs but they are still working in sync mode
> > > as
> > > > those old drivers don't sleep.
> > > >
> > >
> > > Good to know that those are sync because I want them to be sync.
> > > Please note that zswap is a cache in front of a real swap and the load
> > > operation is latency sensitive as it comes in the page fault path and
> > > directly impacts the applications. I doubt decompressing synchronously
> > > a 4k page on a cpu will be costlier than asynchronously decompressing
> > > the same page from hardware accelerators.
> >
> > If you read the old paper:
> >
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> y
> > Because the hardware accelerator speeds up compression, looking at the zswap
> > metrics we observed that there were more store and load requests in a given
> > amount of time, which filled up the zswap pool faster than a software
> > compression run. Because of this behavior, we set the max_pool_percent
> > parameter to 30 for the hardware compression runs - this means that zswap
> > can use up to 30% of the 10GB of total memory.
> >
> > So using hardware accelerators, we get a chance to speed up compression
> > while decreasing cpu utilization.
> >
> > BTW, If it is not easy to change zsmalloc, one quick workaround we might do
> > in zswap is adding the below after applying Mike's original patch:
> >
> > if(in_atomic()) /* for zsmalloc */
> >         while(!try_wait_for_completion(&req->done);
> > else /* for zbud, z3fold */
> >         crypto_wait_req(....);
> 
> I don't think I'm going to ack this, sorry.
> 

Fair enough. And I am also thinking if we can move zpool_unmap_handle()
quite after zpool_map_handle() as below:

	dlen = PAGE_SIZE;
	src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
	if (zpool_evictable(entry->pool->zpool))
		src += sizeof(struct zswap_header);
+	zpool_unmap_handle(entry->pool->zpool, entry->handle);

	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
	mutex_lock(acomp_ctx->mutex);
	sg_init_one(&input, src, entry->length);
	sg_init_table(&output, 1);
	sg_set_page(&output, page, PAGE_SIZE, 0);
	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
	ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
	mutex_unlock(acomp_ctx->mutex);

-	zpool_unmap_handle(entry->pool->zpool, entry->handle);

Since src is always low memory and we only need its virtual address
to get the page of src in sg_init_one(). We don't actually read it
by CPU anywhere.

> Best regards,
>    Vitaly
> 
> > crypto_wait_req() is actually doing wait_for_completion():
> > static inline int crypto_wait_req(int err, struct crypto_wait *wait)
> > {
> >         switch (err) {
> >         case -EINPROGRESS:
> >         case -EBUSY:
> >                 wait_for_completion(&wait->completion);
> >                 reinit_completion(&wait->completion);
> >                 err = wait->err;
> >                 break;
> >         }
> >
> >         return err;
> > }
> >
Thanks
Barry
Shakeel Butt Dec. 21, 2020, 10:46 p.m. UTC | #18
On Mon, Dec 21, 2020 at 1:30 PM Song Bao Hua (Barry Song)
<song.bao.hua@hisilicon.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Shakeel Butt [mailto:shakeelb@google.com]
> > Sent: Tuesday, December 22, 2020 10:03 AM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim <minchan@kernel.org>;
> > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > <akpm@linux-foundation.org>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > <song.bao.hua@hisilicon.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > To: Vitaly Wool <vitaly.wool@konsulko.com>
> > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; LKML
> > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song Bao
> > Hua
> > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior
> > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey
> > Senozhatsky
> > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > <akpm@linux-foundation.org>
> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com>
> > wrote:
> > > > >
> > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote:
> > > > > >
> > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > about scheduling in atomic context.
> > > > > > >
> > > > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > >
> > > > > > I don't want to use such open code for the lock.
> > > > > >
> > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > > > > > introduce this patch with allowing preemption enabling.
> > > > >
> > > > > This understanding is upside down. The code in zswap you are referring
> > > > > to is not buggy.  You may claim that it is suboptimal but there is
> > > > > nothing wrong in taking a mutex.
> > > > >
> > > >
> > > > Is this suboptimal for all or just the hardware accelerators? Sorry, I
> > > > am not very familiar with the crypto API. If I select lzo or lz4 as a
> > > > zswap compressor will the [de]compression be async or sync?
> > >
> > > Right now, in crypto subsystem, new drivers are required to write based on
> > > async APIs. The old sync API can't work in new accelerator drivers as they
> > > are not supported at all.
> > >
> > > Old drivers are used to sync, but they've got async wrappers to support async
> > > APIs. Eg.
> > > crypto: acomp - add support for lz4 via scomp
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > >
> > > crypto: acomp - add support for lzo via scomp
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > >
> > > so they are supporting async APIs but they are still working in sync mode
> > as
> > > those old drivers don't sleep.
> > >
> >
> > Good to know that those are sync because I want them to be sync.
> > Please note that zswap is a cache in front of a real swap and the load
> > operation is latency sensitive as it comes in the page fault path and
> > directly impacts the applications. I doubt decompressing synchronously
> > a 4k page on a cpu will be costlier than asynchronously decompressing
> > the same page from hardware accelerators.
>
> If you read the old paper:
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionality
> Because the hardware accelerator speeds up compression, looking at the zswap
> metrics we observed that there were more store and load requests in a given
> amount of time, which filled up the zswap pool faster than a software
> compression run. Because of this behavior, we set the max_pool_percent
> parameter to 30 for the hardware compression runs - this means that zswap
> can use up to 30% of the 10GB of total memory.
>
> So using hardware accelerators, we get a chance to speed up compression
> while decreasing cpu utilization.
>

I don't care much about the compression. It's the decompression or
more specifically the latency of decompression I really care about.

Compression happens on reclaim, so latency is not really an issue.
Reclaim can be pressure-based or proactive. I think async batched
compression by accelerators makes a lot of sense. Though I doubt zswap
is the right layer for that. To me adding "async batched compression
support by accelerators" in zram looks more natural as the kernel
already has async block I/O support.

For decompression, I would like as low latency as possible which I
think is only possible by doing decompression on a cpu synchronously.
Song Bao Hua (Barry Song) Dec. 21, 2020, 11:02 p.m. UTC | #19
> -----Original Message-----
> From: Shakeel Butt [mailto:shakeelb@google.com]
> Sent: Tuesday, December 22, 2020 11:46 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim <minchan@kernel.org>;
> Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> <akpm@linux-foundation.org>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> 
> On Mon, Dec 21, 2020 at 1:30 PM Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim
> <minchan@kernel.org>;
> > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>;
> linux-mm
> > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > <akpm@linux-foundation.org>
> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > <song.bao.hua@hisilicon.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > To: Vitaly Wool <vitaly.wool@konsulko.com>
> > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>;
> LKML
> > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song
> Bao
> > > Hua
> > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior
> > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey
> > > Senozhatsky
> > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > <akpm@linux-foundation.org>
> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > >
> > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com>
> > > wrote:
> > > > > >
> > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > >
> > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases
> it
> > > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > > about scheduling in atomic context.
> > > > > > > >
> > > > > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > >
> > > > > > > I don't want to use such open code for the lock.
> > > > > > >
> > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug
> and
> > > > > > > introduce this patch with allowing preemption enabling.
> > > > > >
> > > > > > This understanding is upside down. The code in zswap you are referring
> > > > > > to is not buggy.  You may claim that it is suboptimal but there is
> > > > > > nothing wrong in taking a mutex.
> > > > > >
> > > > >
> > > > > Is this suboptimal for all or just the hardware accelerators? Sorry,
> I
> > > > > am not very familiar with the crypto API. If I select lzo or lz4 as
> a
> > > > > zswap compressor will the [de]compression be async or sync?
> > > >
> > > > Right now, in crypto subsystem, new drivers are required to write based
> on
> > > > async APIs. The old sync API can't work in new accelerator drivers as
> they
> > > > are not supported at all.
> > > >
> > > > Old drivers are used to sync, but they've got async wrappers to support
> async
> > > > APIs. Eg.
> > > > crypto: acomp - add support for lz4 via scomp
> > > >
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > >
> > > > crypto: acomp - add support for lzo via scomp
> > > >
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > >
> > > > so they are supporting async APIs but they are still working in sync mode
> > > as
> > > > those old drivers don't sleep.
> > > >
> > >
> > > Good to know that those are sync because I want them to be sync.
> > > Please note that zswap is a cache in front of a real swap and the load
> > > operation is latency sensitive as it comes in the page fault path and
> > > directly impacts the applications. I doubt decompressing synchronously
> > > a 4k page on a cpu will be costlier than asynchronously decompressing
> > > the same page from hardware accelerators.
> >
> > If you read the old paper:
> >
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> y
> > Because the hardware accelerator speeds up compression, looking at the zswap
> > metrics we observed that there were more store and load requests in a given
> > amount of time, which filled up the zswap pool faster than a software
> > compression run. Because of this behavior, we set the max_pool_percent
> > parameter to 30 for the hardware compression runs - this means that zswap
> > can use up to 30% of the 10GB of total memory.
> >
> > So using hardware accelerators, we get a chance to speed up compression
> > while decreasing cpu utilization.
> >
> 
> I don't care much about the compression. It's the decompression or
> more specifically the latency of decompression I really care about.
> 
> Compression happens on reclaim, so latency is not really an issue.
> Reclaim can be pressure-based or proactive. I think async batched
> compression by accelerators makes a lot of sense. Though I doubt zswap
> is the right layer for that. To me adding "async batched compression
> support by accelerators" in zram looks more natural as the kernel
> already has async block I/O support.

Yep.
zram is one of the targets I have thought about to support acomp.

> 
> For decompression, I would like as low latency as possible which I
> think is only possible by doing decompression on a cpu synchronously.

One possibility is that we change HW accelerator driver to be sync
polling for decompression. But this still depends on async api as
this is the framework nowadays, the difference would be the driver
won't really block. crypto_wait_req() will return without actual
sleep.

Thanks
Barry
Song Bao Hua (Barry Song) Dec. 21, 2020, 11:35 p.m. UTC | #20
> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, December 22, 2020 11:38 AM
> To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike
> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> <akpm@linux-foundation.org>
> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> 
> 
> 
> > -----Original Message-----
> > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> > Sent: Tuesday, December 22, 2020 11:12 AM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>;
> Mike
> > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > <akpm@linux-foundation.org>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
> > <song.bao.hua@hisilicon.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim
> > <minchan@kernel.org>;
> > > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>;
> > linux-mm
> > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > <akpm@linux-foundation.org>
> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > > <song.bao.hua@hisilicon.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > > To: Vitaly Wool <vitaly.wool@konsulko.com>
> > > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>;
> > LKML
> > > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song
> > Bao
> > > > Hua
> > > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior
> > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey
> > > > Senozhatsky
> > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > > <akpm@linux-foundation.org>
> > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > >
> > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com>
> > > > wrote:
> > > > > > >
> > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org>
> wrote:
> > > > > > > >
> > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases
> > it
> > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > > > about scheduling in atomic context.
> > > > > > > > >
> > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove
> that
> > > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > > >
> > > > > > > > I don't want to use such open code for the lock.
> > > > > > > >
> > > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug
> > and
> > > > > > > > introduce this patch with allowing preemption enabling.
> > > > > > >
> > > > > > > This understanding is upside down. The code in zswap you are referring
> > > > > > > to is not buggy.  You may claim that it is suboptimal but there is
> > > > > > > nothing wrong in taking a mutex.
> > > > > > >
> > > > > >
> > > > > > Is this suboptimal for all or just the hardware accelerators? Sorry,
> > I
> > > > > > am not very familiar with the crypto API. If I select lzo or lz4 as
> > a
> > > > > > zswap compressor will the [de]compression be async or sync?
> > > > >
> > > > > Right now, in crypto subsystem, new drivers are required to write based
> > on
> > > > > async APIs. The old sync API can't work in new accelerator drivers as
> > they
> > > > > are not supported at all.
> > > > >
> > > > > Old drivers are used to sync, but they've got async wrappers to support
> > async
> > > > > APIs. Eg.
> > > > > crypto: acomp - add support for lz4 via scomp
> > > > >
> > > >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > > >
> > > > > crypto: acomp - add support for lzo via scomp
> > > > >
> > > >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > > >
> > > > > so they are supporting async APIs but they are still working in sync
> mode
> > > > as
> > > > > those old drivers don't sleep.
> > > > >
> > > >
> > > > Good to know that those are sync because I want them to be sync.
> > > > Please note that zswap is a cache in front of a real swap and the load
> > > > operation is latency sensitive as it comes in the page fault path and
> > > > directly impacts the applications. I doubt decompressing synchronously
> > > > a 4k page on a cpu will be costlier than asynchronously decompressing
> > > > the same page from hardware accelerators.
> > >
> > > If you read the old paper:
> > >
> >
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> > y
> > > Because the hardware accelerator speeds up compression, looking at the zswap
> > > metrics we observed that there were more store and load requests in a given
> > > amount of time, which filled up the zswap pool faster than a software
> > > compression run. Because of this behavior, we set the max_pool_percent
> > > parameter to 30 for the hardware compression runs - this means that zswap
> > > can use up to 30% of the 10GB of total memory.
> > >
> > > So using hardware accelerators, we get a chance to speed up compression
> > > while decreasing cpu utilization.
> > >
> > > BTW, If it is not easy to change zsmalloc, one quick workaround we might
> do
> > > in zswap is adding the below after applying Mike's original patch:
> > >
> > > if(in_atomic()) /* for zsmalloc */
> > >         while(!try_wait_for_completion(&req->done);
> > > else /* for zbud, z3fold */
> > >         crypto_wait_req(....);
> >
> > I don't think I'm going to ack this, sorry.
> >
> 
> Fair enough. And I am also thinking if we can move zpool_unmap_handle()
> quite after zpool_map_handle() as below:
> 
> 	dlen = PAGE_SIZE;
> 	src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
> 	if (zpool_evictable(entry->pool->zpool))
> 		src += sizeof(struct zswap_header);
> +	zpool_unmap_handle(entry->pool->zpool, entry->handle);
> 
> 	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> 	mutex_lock(acomp_ctx->mutex);
> 	sg_init_one(&input, src, entry->length);
> 	sg_init_table(&output, 1);
> 	sg_set_page(&output, page, PAGE_SIZE, 0);
> 	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length,
> dlen);
> 	ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> &acomp_ctx->wait);
> 	mutex_unlock(acomp_ctx->mutex);
> 
> -	zpool_unmap_handle(entry->pool->zpool, entry->handle);
> 
> Since src is always low memory and we only need its virtual address
> to get the page of src in sg_init_one(). We don't actually read it
> by CPU anywhere.

The below code might be better:

	dlen = PAGE_SIZE;
	src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
	if (zpool_evictable(entry->pool->zpool))
		src += sizeof(struct zswap_header);

	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);

+	zpool_unmap_handle(entry->pool->zpool, entry->handle);

	mutex_lock(acomp_ctx->mutex);
	sg_init_one(&input, src, entry->length);
	sg_init_table(&output, 1);
	sg_set_page(&output, page, PAGE_SIZE, 0);
	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
	ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
	mutex_unlock(acomp_ctx->mutex);

-	zpool_unmap_handle(entry->pool->zpool, entry->handle);

> 
> > Best regards,
> >    Vitaly
> >
> > > crypto_wait_req() is actually doing wait_for_completion():
> > > static inline int crypto_wait_req(int err, struct crypto_wait *wait)
> > > {
> > >         switch (err) {
> > >         case -EINPROGRESS:
> > >         case -EBUSY:
> > >                 wait_for_completion(&wait->completion);
> > >                 reinit_completion(&wait->completion);
> > >                 err = wait->err;
> > >                 break;
> > >         }
> > >
> > >         return err;
> > > }

Thanks
Barry
Vitaly Wool Dec. 22, 2020, 12:59 a.m. UTC | #21
On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song)
<song.bao.hua@hisilicon.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Song Bao Hua (Barry Song)
> > Sent: Tuesday, December 22, 2020 11:38 AM
> > To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike
> > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > <akpm@linux-foundation.org>
> > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> >
> >
> > > -----Original Message-----
> > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> > > Sent: Tuesday, December 22, 2020 11:12 AM
> > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>;
> > Mike
> > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > <akpm@linux-foundation.org>
> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
> > > <song.bao.hua@hisilicon.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim
> > > <minchan@kernel.org>;
> > > > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>;
> > > linux-mm
> > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > <akpm@linux-foundation.org>
> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > >
> > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > > > <song.bao.hua@hisilicon.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > > > To: Vitaly Wool <vitaly.wool@konsulko.com>
> > > > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>;
> > > LKML
> > > > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song
> > > Bao
> > > > > Hua
> > > > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior
> > > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey
> > > > > Senozhatsky
> > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > > > <akpm@linux-foundation.org>
> > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > > >
> > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org>
> > wrote:
> > > > > > > > >
> > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases
> > > it
> > > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > > > > about scheduling in atomic context.
> > > > > > > > > >
> > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove
> > that
> > > > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > > > >
> > > > > > > > > I don't want to use such open code for the lock.
> > > > > > > > >
> > > > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug
> > > and
> > > > > > > > > introduce this patch with allowing preemption enabling.
> > > > > > > >
> > > > > > > > This understanding is upside down. The code in zswap you are referring
> > > > > > > > to is not buggy.  You may claim that it is suboptimal but there is
> > > > > > > > nothing wrong in taking a mutex.
> > > > > > > >
> > > > > > >
> > > > > > > Is this suboptimal for all or just the hardware accelerators? Sorry,
> > > I
> > > > > > > am not very familiar with the crypto API. If I select lzo or lz4 as
> > > a
> > > > > > > zswap compressor will the [de]compression be async or sync?
> > > > > >
> > > > > > Right now, in crypto subsystem, new drivers are required to write based
> > > on
> > > > > > async APIs. The old sync API can't work in new accelerator drivers as
> > > they
> > > > > > are not supported at all.
> > > > > >
> > > > > > Old drivers are used to sync, but they've got async wrappers to support
> > > async
> > > > > > APIs. Eg.
> > > > > > crypto: acomp - add support for lz4 via scomp
> > > > > >
> > > > >
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > > > >
> > > > > > crypto: acomp - add support for lzo via scomp
> > > > > >
> > > > >
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > > > >
> > > > > > so they are supporting async APIs but they are still working in sync
> > mode
> > > > > as
> > > > > > those old drivers don't sleep.
> > > > > >
> > > > >
> > > > > Good to know that those are sync because I want them to be sync.
> > > > > Please note that zswap is a cache in front of a real swap and the load
> > > > > operation is latency sensitive as it comes in the page fault path and
> > > > > directly impacts the applications. I doubt decompressing synchronously
> > > > > a 4k page on a cpu will be costlier than asynchronously decompressing
> > > > > the same page from hardware accelerators.
> > > >
> > > > If you read the old paper:
> > > >
> > >
> > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> > > y
> > > > Because the hardware accelerator speeds up compression, looking at the zswap
> > > > metrics we observed that there were more store and load requests in a given
> > > > amount of time, which filled up the zswap pool faster than a software
> > > > compression run. Because of this behavior, we set the max_pool_percent
> > > > parameter to 30 for the hardware compression runs - this means that zswap
> > > > can use up to 30% of the 10GB of total memory.
> > > >
> > > > So using hardware accelerators, we get a chance to speed up compression
> > > > while decreasing cpu utilization.
> > > >
> > > > BTW, If it is not easy to change zsmalloc, one quick workaround we might
> > do
> > > > in zswap is adding the below after applying Mike's original patch:
> > > >
> > > > if(in_atomic()) /* for zsmalloc */
> > > >         while(!try_wait_for_completion(&req->done);
> > > > else /* for zbud, z3fold */
> > > >         crypto_wait_req(....);
> > >
> > > I don't think I'm going to ack this, sorry.
> > >
> >
> > Fair enough. And I am also thinking if we can move zpool_unmap_handle()
> > quite after zpool_map_handle() as below:
> >
> >       dlen = PAGE_SIZE;
> >       src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
> >       if (zpool_evictable(entry->pool->zpool))
> >               src += sizeof(struct zswap_header);
> > +     zpool_unmap_handle(entry->pool->zpool, entry->handle);
> >
> >       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >       mutex_lock(acomp_ctx->mutex);
> >       sg_init_one(&input, src, entry->length);
> >       sg_init_table(&output, 1);
> >       sg_set_page(&output, page, PAGE_SIZE, 0);
> >       acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length,
> > dlen);
> >       ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> > &acomp_ctx->wait);
> >       mutex_unlock(acomp_ctx->mutex);
> >
> > -     zpool_unmap_handle(entry->pool->zpool, entry->handle);
> >
> > Since src is always low memory and we only need its virtual address
> > to get the page of src in sg_init_one(). We don't actually read it
> > by CPU anywhere.
>
> The below code might be better:
>
>         dlen = PAGE_SIZE;
>         src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
>         if (zpool_evictable(entry->pool->zpool))
>                 src += sizeof(struct zswap_header);
>
>         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>
> +       zpool_unmap_handle(entry->pool->zpool, entry->handle);
>
>         mutex_lock(acomp_ctx->mutex);
>         sg_init_one(&input, src, entry->length);
>         sg_init_table(&output, 1);
>         sg_set_page(&output, page, PAGE_SIZE, 0);
>         acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
>         ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
>         mutex_unlock(acomp_ctx->mutex);
>
> -       zpool_unmap_handle(entry->pool->zpool, entry->handle);

I don't see how this is going to work since we can't guarantee src
will be a valid pointer after the zpool_unmap_handle() call, can we?
Could you please elaborate?

~Vitaly

> >
> > > Best regards,
> > >    Vitaly
> > >
> > > > crypto_wait_req() is actually doing wait_for_completion():
> > > > static inline int crypto_wait_req(int err, struct crypto_wait *wait)
> > > > {
> > > >         switch (err) {
> > > >         case -EINPROGRESS:
> > > >         case -EBUSY:
> > > >                 wait_for_completion(&wait->completion);
> > > >                 reinit_completion(&wait->completion);
> > > >                 err = wait->err;
> > > >                 break;
> > > >         }
> > > >
> > > >         return err;
> > > > }
>
> Thanks
> Barry
Song Bao Hua (Barry Song) Dec. 22, 2020, 1:10 a.m. UTC | #22
> -----Original Message-----
> From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> Sent: Tuesday, December 22, 2020 2:00 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike
> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> <akpm@linux-foundation.org>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> 
> On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Song Bao Hua (Barry Song)
> > > Sent: Tuesday, December 22, 2020 11:38 AM
> > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
> > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>;
> Mike
> > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > <akpm@linux-foundation.org>
> > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> > > > Sent: Tuesday, December 22, 2020 11:12 AM
> > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>;
> > > Mike
> > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > <akpm@linux-foundation.org>
> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
> > > > <song.bao.hua@hisilicon.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > > > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim
> > > > <minchan@kernel.org>;
> > > > > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>;
> > > > linux-mm
> > > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior
> <bigeasy@linutronix.de>;
> > > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > > <akpm@linux-foundation.org>
> > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > >
> > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > > > > <song.bao.hua@hisilicon.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > > > > To: Vitaly Wool <vitaly.wool@konsulko.com>
> > > > > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith
> <efault@gmx.de>;
> > > > LKML
> > > > > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>;
> Song
> > > > Bao
> > > > > > Hua
> > > > > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej
> Siewior
> > > > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey
> > > > > > Senozhatsky
> > > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > > > > <akpm@linux-foundation.org>
> > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > > > >
> > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool
> <vitaly.wool@konsulko.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org>
> > > wrote:
> > > > > > > > > >
> > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases
> > > > it
> > > > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > > > > > about scheduling in atomic context.
> > > > > > > > > > >
> > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove
> > > that
> > > > > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > > > > >
> > > > > > > > > > I don't want to use such open code for the lock.
> > > > > > > > > >
> > > > > > > > > > I see from Mike's patch, recent zswap change introduced the
> lockdep
> > > > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap
> bug
> > > > and
> > > > > > > > > > introduce this patch with allowing preemption enabling.
> > > > > > > > >
> > > > > > > > > This understanding is upside down. The code in zswap you are
> referring
> > > > > > > > > to is not buggy.  You may claim that it is suboptimal but there
> is
> > > > > > > > > nothing wrong in taking a mutex.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Is this suboptimal for all or just the hardware accelerators?
> Sorry,
> > > > I
> > > > > > > > am not very familiar with the crypto API. If I select lzo or lz4
> as
> > > > a
> > > > > > > > zswap compressor will the [de]compression be async or sync?
> > > > > > >
> > > > > > > Right now, in crypto subsystem, new drivers are required to write
> based
> > > > on
> > > > > > > async APIs. The old sync API can't work in new accelerator drivers
> as
> > > > they
> > > > > > > are not supported at all.
> > > > > > >
> > > > > > > Old drivers are used to sync, but they've got async wrappers to
> support
> > > > async
> > > > > > > APIs. Eg.
> > > > > > > crypto: acomp - add support for lz4 via scomp
> > > > > > >
> > > > > >
> > > >
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > > > > >
> > > > > > > crypto: acomp - add support for lzo via scomp
> > > > > > >
> > > > > >
> > > >
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > > > > >
> > > > > > > so they are supporting async APIs but they are still working in
> sync
> > > mode
> > > > > > as
> > > > > > > those old drivers don't sleep.
> > > > > > >
> > > > > >
> > > > > > Good to know that those are sync because I want them to be sync.
> > > > > > Please note that zswap is a cache in front of a real swap and the
> load
> > > > > > operation is latency sensitive as it comes in the page fault path
> and
> > > > > > directly impacts the applications. I doubt decompressing synchronously
> > > > > > a 4k page on a cpu will be costlier than asynchronously decompressing
> > > > > > the same page from hardware accelerators.
> > > > >
> > > > > If you read the old paper:
> > > > >
> > > >
> > >
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> > > > y
> > > > > Because the hardware accelerator speeds up compression, looking at the
> zswap
> > > > > metrics we observed that there were more store and load requests in
> a given
> > > > > amount of time, which filled up the zswap pool faster than a software
> > > > > compression run. Because of this behavior, we set the max_pool_percent
> > > > > parameter to 30 for the hardware compression runs - this means that
> zswap
> > > > > can use up to 30% of the 10GB of total memory.
> > > > >
> > > > > So using hardware accelerators, we get a chance to speed up compression
> > > > > while decreasing cpu utilization.
> > > > >
> > > > > BTW, If it is not easy to change zsmalloc, one quick workaround we might
> > > do
> > > > > in zswap is adding the below after applying Mike's original patch:
> > > > >
> > > > > if(in_atomic()) /* for zsmalloc */
> > > > >         while(!try_wait_for_completion(&req->done);
> > > > > else /* for zbud, z3fold */
> > > > >         crypto_wait_req(....);
> > > >
> > > > I don't think I'm going to ack this, sorry.
> > > >
> > >
> > > Fair enough. And I am also thinking if we can move zpool_unmap_handle()
> > > quite after zpool_map_handle() as below:
> > >
> > >       dlen = PAGE_SIZE;
> > >       src = zpool_map_handle(entry->pool->zpool, entry->handle,
> ZPOOL_MM_RO);
> > >       if (zpool_evictable(entry->pool->zpool))
> > >               src += sizeof(struct zswap_header);
> > > +     zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > >
> > >       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > >       mutex_lock(acomp_ctx->mutex);
> > >       sg_init_one(&input, src, entry->length);
> > >       sg_init_table(&output, 1);
> > >       sg_set_page(&output, page, PAGE_SIZE, 0);
> > >       acomp_request_set_params(acomp_ctx->req, &input, &output,
> entry->length,
> > > dlen);
> > >       ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> > > &acomp_ctx->wait);
> > >       mutex_unlock(acomp_ctx->mutex);
> > >
> > > -     zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > >
> > > Since src is always low memory and we only need its virtual address
> > > to get the page of src in sg_init_one(). We don't actually read it
> > > by CPU anywhere.
> >
> > The below code might be better:
> >
> >         dlen = PAGE_SIZE;
> >         src = zpool_map_handle(entry->pool->zpool, entry->handle,
> ZPOOL_MM_RO);
> >         if (zpool_evictable(entry->pool->zpool))
> >                 src += sizeof(struct zswap_header);
> >
> >         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >
> > +       zpool_unmap_handle(entry->pool->zpool, entry->handle);
> >
> >         mutex_lock(acomp_ctx->mutex);
> >         sg_init_one(&input, src, entry->length);
> >         sg_init_table(&output, 1);
> >         sg_set_page(&output, page, PAGE_SIZE, 0);
> >         acomp_request_set_params(acomp_ctx->req, &input, &output,
> entry->length, dlen);
> >         ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> &acomp_ctx->wait);
> >         mutex_unlock(acomp_ctx->mutex);
> >
> > -       zpool_unmap_handle(entry->pool->zpool, entry->handle);
> 
> I don't see how this is going to work since we can't guarantee src
> will be a valid pointer after the zpool_unmap_handle() call, can we?
> Could you please elaborate?

A valid pointer is for cpu to read and write. Here, cpu doesn't read
and write it, we only need to get page struct from the address.

void sg_init_one(struct scatterlist *sg, const void *buf, unsigned int buflen)
{
	sg_init_table(sg, 1);
	sg_set_buf(sg, buf, buflen);
}

static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
			      unsigned int buflen)
{
#ifdef CONFIG_DEBUG_SG
	BUG_ON(!virt_addr_valid(buf));
#endif
	sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
}

sg_init_one() is always using an address which has a linear mapping
with physical address.
So once we get the value of src, we can get the page struct.

src has a linear mapping with physical address. It doesn't require
page table walk which vmalloc_to_page() wants.

The req only requires page to initialize sg table, I think if
we are going to use a cpu-based (de)compression, the crypto
driver will kmap it again.

> 
> ~Vitaly

Thanks
Barry
Song Bao Hua (Barry Song) Dec. 22, 2020, 1:42 a.m. UTC | #23
> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, December 22, 2020 2:06 PM
> To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike
> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> <akpm@linux-foundation.org>
> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> 
> 
> 
> > -----Original Message-----
> > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> > Sent: Tuesday, December 22, 2020 2:00 PM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>;
> Mike
> > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > <akpm@linux-foundation.org>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song)
> > <song.bao.hua@hisilicon.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Song Bao Hua (Barry Song)
> > > > Sent: Tuesday, December 22, 2020 11:38 AM
> > > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
> > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>;
> > Mike
> > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > <akpm@linux-foundation.org>
> > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> > > > > Sent: Tuesday, December 22, 2020 11:12 AM
> > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim
> <minchan@kernel.org>;
> > > > Mike
> > > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>;
> linux-mm
> > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior
> <bigeasy@linutronix.de>;
> > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > <akpm@linux-foundation.org>
> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > >
> > > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
> > > > > <song.bao.hua@hisilicon.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > > > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > > > > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim
> > > > > <minchan@kernel.org>;
> > > > > > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>;
> > > > > linux-mm
> > > > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de>;
> > > > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > > > <akpm@linux-foundation.org>
> > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > > >
> > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > > > > > <song.bao.hua@hisilicon.com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > > > > > To: Vitaly Wool <vitaly.wool@konsulko.com>
> > > > > > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith
> > <efault@gmx.de>;
> > > > > LKML
> > > > > > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>;
> > Song
> > > > > Bao
> > > > > > > Hua
> > > > > > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej
> > Siewior
> > > > > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey
> > > > > > > Senozhatsky
> > > > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > > > > > <akpm@linux-foundation.org>
> > > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > > > > >
> > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool
> > <vitaly.wool@konsulko.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org>
> > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and
> releases
> > > > > it
> > > > > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > > > > > > about scheduling in atomic context.
> > > > > > > > > > > >
> > > > > > > > > > > > To fix that and to improve RT properties of zsmalloc,
> remove
> > > > that
> > > > > > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > > > > > >
> > > > > > > > > > > I don't want to use such open code for the lock.
> > > > > > > > > > >
> > > > > > > > > > > I see from Mike's patch, recent zswap change introduced
> the
> > lockdep
> > > > > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap
> > bug
> > > > > and
> > > > > > > > > > > introduce this patch with allowing preemption enabling.
> > > > > > > > > >
> > > > > > > > > > This understanding is upside down. The code in zswap you are
> > referring
> > > > > > > > > > to is not buggy.  You may claim that it is suboptimal but there
> > is
> > > > > > > > > > nothing wrong in taking a mutex.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Is this suboptimal for all or just the hardware accelerators?
> > Sorry,
> > > > > I
> > > > > > > > > am not very familiar with the crypto API. If I select lzo or
> lz4
> > as
> > > > > a
> > > > > > > > > zswap compressor will the [de]compression be async or sync?
> > > > > > > >
> > > > > > > > Right now, in crypto subsystem, new drivers are required to write
> > based
> > > > > on
> > > > > > > > async APIs. The old sync API can't work in new accelerator drivers
> > as
> > > > > they
> > > > > > > > are not supported at all.
> > > > > > > >
> > > > > > > > Old drivers are used to sync, but they've got async wrappers to
> > support
> > > > > async
> > > > > > > > APIs. Eg.
> > > > > > > > crypto: acomp - add support for lz4 via scomp
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > > > > > >
> > > > > > > > crypto: acomp - add support for lzo via scomp
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > > > > > >
> > > > > > > > so they are supporting async APIs but they are still working in
> > sync
> > > > mode
> > > > > > > as
> > > > > > > > those old drivers don't sleep.
> > > > > > > >
> > > > > > >
> > > > > > > Good to know that those are sync because I want them to be sync.
> > > > > > > Please note that zswap is a cache in front of a real swap and the
> > load
> > > > > > > operation is latency sensitive as it comes in the page fault path
> > and
> > > > > > > directly impacts the applications. I doubt decompressing synchronously
> > > > > > > a 4k page on a cpu will be costlier than asynchronously decompressing
> > > > > > > the same page from hardware accelerators.
> > > > > >
> > > > > > If you read the old paper:
> > > > > >
> > > > >
> > > >
> >
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> > > > > y
> > > > > > Because the hardware accelerator speeds up compression, looking at
> the
> > zswap
> > > > > > metrics we observed that there were more store and load requests in
> > a given
> > > > > > amount of time, which filled up the zswap pool faster than a software
> > > > > > compression run. Because of this behavior, we set the max_pool_percent
> > > > > > parameter to 30 for the hardware compression runs - this means that
> > zswap
> > > > > > can use up to 30% of the 10GB of total memory.
> > > > > >
> > > > > > So using hardware accelerators, we get a chance to speed up compression
> > > > > > while decreasing cpu utilization.
> > > > > >
> > > > > > BTW, If it is not easy to change zsmalloc, one quick workaround we
> might
> > > > do
> > > > > > in zswap is adding the below after applying Mike's original patch:
> > > > > >
> > > > > > if(in_atomic()) /* for zsmalloc */
> > > > > >         while(!try_wait_for_completion(&req->done);
> > > > > > else /* for zbud, z3fold */
> > > > > >         crypto_wait_req(....);
> > > > >
> > > > > I don't think I'm going to ack this, sorry.
> > > > >
> > > >
> > > > Fair enough. And I am also thinking if we can move zpool_unmap_handle()
> > > > quite after zpool_map_handle() as below:
> > > >
> > > >       dlen = PAGE_SIZE;
> > > >       src = zpool_map_handle(entry->pool->zpool, entry->handle,
> > ZPOOL_MM_RO);
> > > >       if (zpool_evictable(entry->pool->zpool))
> > > >               src += sizeof(struct zswap_header);
> > > > +     zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > > >
> > > >       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > > >       mutex_lock(acomp_ctx->mutex);
> > > >       sg_init_one(&input, src, entry->length);
> > > >       sg_init_table(&output, 1);
> > > >       sg_set_page(&output, page, PAGE_SIZE, 0);
> > > >       acomp_request_set_params(acomp_ctx->req, &input, &output,
> > entry->length,
> > > > dlen);
> > > >       ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> > > > &acomp_ctx->wait);
> > > >       mutex_unlock(acomp_ctx->mutex);
> > > >
> > > > -     zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > > >
> > > > Since src is always low memory and we only need its virtual address
> > > > to get the page of src in sg_init_one(). We don't actually read it
> > > > by CPU anywhere.
> > >
> > > The below code might be better:
> > >
> > >         dlen = PAGE_SIZE;
> > >         src = zpool_map_handle(entry->pool->zpool, entry->handle,
> > ZPOOL_MM_RO);
> > >         if (zpool_evictable(entry->pool->zpool))
> > >                 src += sizeof(struct zswap_header);
> > >
> > >         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > >
> > > +       zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > >
> > >         mutex_lock(acomp_ctx->mutex);
> > >         sg_init_one(&input, src, entry->length);
> > >         sg_init_table(&output, 1);
> > >         sg_set_page(&output, page, PAGE_SIZE, 0);
> > >         acomp_request_set_params(acomp_ctx->req, &input, &output,
> > entry->length, dlen);
> > >         ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> > &acomp_ctx->wait);
> > >         mutex_unlock(acomp_ctx->mutex);
> > >
> > > -       zpool_unmap_handle(entry->pool->zpool, entry->handle);
> >
> > I don't see how this is going to work since we can't guarantee src
> > will be a valid pointer after the zpool_unmap_handle() call, can we?
> > Could you please elaborate?
> 
> A valid pointer is for cpu to read and write. Here, cpu doesn't read
> and write it, we only need to get page struct from the address.
> 
> void sg_init_one(struct scatterlist *sg, const void *buf, unsigned int buflen)
> {
> 	sg_init_table(sg, 1);
> 	sg_set_buf(sg, buf, buflen);
> }
> 
> static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
> 			      unsigned int buflen)
> {
> #ifdef CONFIG_DEBUG_SG
> 	BUG_ON(!virt_addr_valid(buf));
> #endif
> 	sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
> }
> 
> sg_init_one() is always using an address which has a linear mapping
> with physical address.
> So once we get the value of src, we can get the page struct.
> 
> src has a linear mapping with physical address. It doesn't require
> page table walk which vmalloc_to_page() wants.
> 
> The req only requires page to initialize sg table, I think if
> we are going to use a cpu-based (de)compression, the crypto
> driver will kmap it again.

Probably I made another bug here. for zsmalloc, it is possible to
get highmem for zpool since its malloc_support_movable = true.

	if (zpool_malloc_support_movable(entry->pool->zpool))
		gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
	ret = zpool_malloc(entry->pool->zpool, hlen + dlen, gfp, &handle);

For 64bit system, there is never a highmem. For 32bit system, we may
trigger this bug.

So actually zswap should have used kmap_to_page() which can support
both linear mapping and non-linear mapping. sg_init_one() only supports
linear mapping.
But it does't change the fact: Once req is initialized with page
struct, we can unmap src. If we are going to use a HW accelerator,
it would be a DMA; if we are going to use CPU decompression, crypto
driver will kmap() again.

> 
> >
> > ~Vitaly
> 

Thanks
Barry
Vitaly Wool Dec. 22, 2020, 1:57 a.m. UTC | #24
On Tue, 22 Dec 2020, 02:42 Song Bao Hua (Barry Song), <
song.bao.hua@hisilicon.com> wrote:

>
>
> > -----Original Message-----
> > From: Song Bao Hua (Barry Song)
> > Sent: Tuesday, December 22, 2020 2:06 PM
> > To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>;
> Mike
> > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > <akpm@linux-foundation.org>
> > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> >
> >
> > > -----Original Message-----
> > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> > > Sent: Tuesday, December 22, 2020 2:00 PM
> > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <
> minchan@kernel.org>;
> > Mike
> > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>;
> linux-mm
> > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de
> >;
> > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > <akpm@linux-foundation.org>
> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > > On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song)
> > > <song.bao.hua@hisilicon.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Song Bao Hua (Barry Song)
> > > > > Sent: Tuesday, December 22, 2020 11:38 AM
> > > > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
> > > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <
> minchan@kernel.org>;
> > > Mike
> > > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>;
> linux-mm
> > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <
> bigeasy@linutronix.de>;
> > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > <akpm@linux-foundation.org>
> > > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> > > > > > Sent: Tuesday, December 22, 2020 11:12 AM
> > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim
> > <minchan@kernel.org>;
> > > > > Mike
> > > > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>;
> > linux-mm
> > > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de>;
> > > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > > <akpm@linux-foundation.org>
> > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > >
> > > > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
> > > > > > <song.bao.hua@hisilicon.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > > > > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > > > > > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim
> > > > > > <minchan@kernel.org>;
> > > > > > > > Mike Galbraith <efault@gmx.de>; LKML <
> linux-kernel@vger.kernel.org>;
> > > > > > linux-mm
> > > > > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior
> > > <bigeasy@linutronix.de>;
> > > > > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > > > > <akpm@linux-foundation.org>
> > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > > > >
> > > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > > > > > > <song.bao.hua@hisilicon.com> wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com]
> > > > > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > > > > > > To: Vitaly Wool <vitaly.wool@konsulko.com>
> > > > > > > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith
> > > <efault@gmx.de>;
> > > > > > LKML
> > > > > > > > > > <linux-kernel@vger.kernel.org>; linux-mm <
> linux-mm@kvack.org>;
> > > Song
> > > > > > Bao
> > > > > > > > Hua
> > > > > > > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian
> Andrzej
> > > Siewior
> > > > > > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>;
> Sergey
> > > > > > > > Senozhatsky
> > > > > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > > > > > > > <akpm@linux-foundation.org>
> > > > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool
> > > <vitaly.wool@konsulko.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <
> minchan@kernel.org>
> > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly
> Wool wrote:
> > > > > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback
> and
> > releases
> > > > > > it
> > > > > > > > > > > > > only in unmap() which is unsafe and leads to zswap
> complaining
> > > > > > > > > > > > > about scheduling in atomic context.
> > > > > > > > > > > > >
> > > > > > > > > > > > > To fix that and to improve RT properties of
> zsmalloc,
> > remove
> > > > > that
> > > > > > > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > > > > > > >
> > > > > > > > > > > > I don't want to use such open code for the lock.
> > > > > > > > > > > >
> > > > > > > > > > > > I see from Mike's patch, recent zswap change
> introduced
> > the
> > > lockdep
> > > > > > > > > > > > splat bug and you want to improve zsmalloc to fix
> the zswap
> > > bug
> > > > > > and
> > > > > > > > > > > > introduce this patch with allowing preemption
> enabling.
> > > > > > > > > > >
> > > > > > > > > > > This understanding is upside down. The code in zswap
> you are
> > > referring
> > > > > > > > > > > to is not buggy.  You may claim that it is suboptimal
> but there
> > > is
> > > > > > > > > > > nothing wrong in taking a mutex.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Is this suboptimal for all or just the hardware
> accelerators?
> > > Sorry,
> > > > > > I
> > > > > > > > > > am not very familiar with the crypto API. If I select
> lzo or
> > lz4
> > > as
> > > > > > a
> > > > > > > > > > zswap compressor will the [de]compression be async or
> sync?
> > > > > > > > >
> > > > > > > > > Right now, in crypto subsystem, new drivers are required
> to write
> > > based
> > > > > > on
> > > > > > > > > async APIs. The old sync API can't work in new accelerator
> drivers
> > > as
> > > > > > they
> > > > > > > > > are not supported at all.
> > > > > > > > >
> > > > > > > > > Old drivers are used to sync, but they've got async
> wrappers to
> > > support
> > > > > > async
> > > > > > > > > APIs. Eg.
> > > > > > > > > crypto: acomp - add support for lz4 via scomp
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > > > > > > >
> > > > > > > > > crypto: acomp - add support for lzo via scomp
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > > > > > > >
> > > > > > > > > so they are supporting async APIs but they are still
> working in
> > > sync
> > > > > mode
> > > > > > > > as
> > > > > > > > > those old drivers don't sleep.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Good to know that those are sync because I want them to be
> sync.
> > > > > > > > Please note that zswap is a cache in front of a real swap
> and the
> > > load
> > > > > > > > operation is latency sensitive as it comes in the page fault
> path
> > > and
> > > > > > > > directly impacts the applications. I doubt decompressing
> synchronously
> > > > > > > > a 4k page on a cpu will be costlier than asynchronously
> decompressing
> > > > > > > > the same page from hardware accelerators.
> > > > > > >
> > > > > > > If you read the old paper:
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> > > > > > y
> > > > > > > Because the hardware accelerator speeds up compression,
> looking at
> > the
> > > zswap
> > > > > > > metrics we observed that there were more store and load
> requests in
> > > a given
> > > > > > > amount of time, which filled up the zswap pool faster than a
> software
> > > > > > > compression run. Because of this behavior, we set the
> max_pool_percent
> > > > > > > parameter to 30 for the hardware compression runs - this means
> that
> > > zswap
> > > > > > > can use up to 30% of the 10GB of total memory.
> > > > > > >
> > > > > > > So using hardware accelerators, we get a chance to speed up
> compression
> > > > > > > while decreasing cpu utilization.
> > > > > > >
> > > > > > > BTW, If it is not easy to change zsmalloc, one quick
> workaround we
> > might
> > > > > do
> > > > > > > in zswap is adding the below after applying Mike's original
> patch:
> > > > > > >
> > > > > > > if(in_atomic()) /* for zsmalloc */
> > > > > > >         while(!try_wait_for_completion(&req->done);
> > > > > > > else /* for zbud, z3fold */
> > > > > > >         crypto_wait_req(....);
> > > > > >
> > > > > > I don't think I'm going to ack this, sorry.
> > > > > >
> > > > >
> > > > > Fair enough. And I am also thinking if we can move
> zpool_unmap_handle()
> > > > > quite after zpool_map_handle() as below:
> > > > >
> > > > >       dlen = PAGE_SIZE;
> > > > >       src = zpool_map_handle(entry->pool->zpool, entry->handle,
> > > ZPOOL_MM_RO);
> > > > >       if (zpool_evictable(entry->pool->zpool))
> > > > >               src += sizeof(struct zswap_header);
> > > > > +     zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > > > >
> > > > >       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > > > >       mutex_lock(acomp_ctx->mutex);
> > > > >       sg_init_one(&input, src, entry->length);
> > > > >       sg_init_table(&output, 1);
> > > > >       sg_set_page(&output, page, PAGE_SIZE, 0);
> > > > >       acomp_request_set_params(acomp_ctx->req, &input, &output,
> > > entry->length,
> > > > > dlen);
> > > > >       ret =
> crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> > > > > &acomp_ctx->wait);
> > > > >       mutex_unlock(acomp_ctx->mutex);
> > > > >
> > > > > -     zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > > > >
> > > > > Since src is always low memory and we only need its virtual address
> > > > > to get the page of src in sg_init_one(). We don't actually read it
> > > > > by CPU anywhere.
> > > >
> > > > The below code might be better:
> > > >
> > > >         dlen = PAGE_SIZE;
> > > >         src = zpool_map_handle(entry->pool->zpool, entry->handle,
> > > ZPOOL_MM_RO);
> > > >         if (zpool_evictable(entry->pool->zpool))
> > > >                 src += sizeof(struct zswap_header);
> > > >
> > > >         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > > >
> > > > +       zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > > >
> > > >         mutex_lock(acomp_ctx->mutex);
> > > >         sg_init_one(&input, src, entry->length);
> > > >         sg_init_table(&output, 1);
> > > >         sg_set_page(&output, page, PAGE_SIZE, 0);
> > > >         acomp_request_set_params(acomp_ctx->req, &input, &output,
> > > entry->length, dlen);
> > > >         ret =
> crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> > > &acomp_ctx->wait);
> > > >         mutex_unlock(acomp_ctx->mutex);
> > > >
> > > > -       zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > >
> > > I don't see how this is going to work since we can't guarantee src
> > > will be a valid pointer after the zpool_unmap_handle() call, can we?
> > > Could you please elaborate?
> >
> > A valid pointer is for cpu to read and write. Here, cpu doesn't read
> > and write it, we only need to get page struct from the address.
> >
> > void sg_init_one(struct scatterlist *sg, const void *buf, unsigned int
> buflen)
> > {
> >       sg_init_table(sg, 1);
> >       sg_set_buf(sg, buf, buflen);
> > }
> >
> > static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
> >                             unsigned int buflen)
> > {
> > #ifdef CONFIG_DEBUG_SG
> >       BUG_ON(!virt_addr_valid(buf));
> > #endif
> >       sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
> > }
> >
> > sg_init_one() is always using an address which has a linear mapping
> > with physical address.
> > So once we get the value of src, we can get the page struct.
> >
> > src has a linear mapping with physical address. It doesn't require
> > page table walk which vmalloc_to_page() wants.
> >
> > The req only requires page to initialize sg table, I think if
> > we are going to use a cpu-based (de)compression, the crypto
> > driver will kmap it again.
>
> Probably I made another bug here. for zsmalloc, it is possible to
> get highmem for zpool since its malloc_support_movable = true.
>
>         if (zpool_malloc_support_movable(entry->pool->zpool))
>                 gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
>         ret = zpool_malloc(entry->pool->zpool, hlen + dlen, gfp, &handle);
>
> For 64bit system, there is never a highmem. For 32bit system, we may
> trigger this bug.
>
> So actually zswap should have used kmap_to_page() which can support
> both linear mapping and non-linear mapping. sg_init_one() only supports
> linear mapping.
> But it does't change the fact: Once req is initialized with page
> struct, we can unmap src. If we are going to use a HW accelerator,
> it would be a DMA; if we are going to use CPU decompression, crypto
> driver will kmap() again.
>

I'm still not convinced. Will kmap what, src? At this point src might
become just a bogus pointer. Why couldn't the object have been moved
somewhere else (due to the compaction mechanism for instance) at the time
DMA kicks in?


> >
> > >
> > > ~Vitaly
> >
>
> Thanks
> Barry
>
Song Bao Hua (Barry Song) Dec. 22, 2020, 2:07 a.m. UTC | #25
> I'm still not convinced. Will kmap what, src? At this point src might become just a bogus pointer. 

As long as the memory is still there, we can kmap it by its page struct. But if
it is not there anymore, we have no way.

> Why couldn't the object have been moved somewhere else (due to the compaction mechanism for instance)
> at the time DMA kicks in?

So zs_map_object() will guarantee the src won't be moved by holding those preemption-disabled lock?
If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?

> 
> >
> > ~Vitaly
> 

Thanks
Barry
Song Bao Hua (Barry Song) Dec. 22, 2020, 2:10 a.m. UTC | #26
> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, December 22, 2020 3:03 PM
> To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike
> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com>
> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> 
> 
> > I'm still not convinced. Will kmap what, src? At this point src might become
> just a bogus pointer.
> 
> As long as the memory is still there, we can kmap it by its page struct. But
> if
> it is not there anymore, we have no way.
> 
> > Why couldn't the object have been moved somewhere else (due to the compaction
> mechanism for instance)
> > at the time DMA kicks in?
> 
> So zs_map_object() will guarantee the src won't be moved by holding those
> preemption-disabled lock?
> If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
> 

Or we can do get_page() to avoid the movement of the page.

> >
> > >
> > > ~Vitaly
> >
> 
> Thanks
> Barry
David Laight Dec. 22, 2020, 9:20 a.m. UTC | #27
From: Song Bao Hua
> Sent: 21 December 2020 23:02
...
> > For decompression, I would like as low latency as possible which I
> > think is only possible by doing decompression on a cpu synchronously.
> 
> One possibility is that we change HW accelerator driver to be sync
> polling for decompression. But this still depends on async api as
> this is the framework nowadays, the difference would be the driver
> won't really block. crypto_wait_req() will return without actual
> sleep.

How long does the HW accelerated compress/decompress need to be before
it is actually worth sleeping the process?
While the HW version might be faster than the SW one, it may not be
enough faster to allow for the hardware interrupt and process sleep.
So it may be worth just spinning (polling the hardware register)
until the request completes.

If decompress are done that way, but compress left async, then
the decompress might need to fallback to SW if the HW is busy.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Vitaly Wool Dec. 22, 2020, 9:32 a.m. UTC | #28
On Tue, 22 Dec 2020, 10:20 David Laight, <David.Laight@aculab.com> wrote:

> From: Song Bao Hua
> > Sent: 21 December 2020 23:02
> ...
> > > For decompression, I would like as low latency as possible which I
> > > think is only possible by doing decompression on a cpu synchronously.
> >
> > One possibility is that we change HW accelerator driver to be sync
> > polling for decompression. But this still depends on async api as
> > this is the framework nowadays, the difference would be the driver
> > won't really block. crypto_wait_req() will return without actual
> > sleep.
>
> How long does the HW accelerated compress/decompress need to be before
> it is actually worth sleeping the process?
> While the HW version might be faster than the SW one, it may not be
> enough faster to allow for the hardware interrupt and process sleep.
> So it may be worth just spinning (polling the hardware register)
> until the request completes.
>
> If decompress are done that way, but compress left async, then
> the decompress might need to fallback to SW if the HW is busy.
>

This is an option too, of course. However please bear in mind that Kerne
mutexes are hybrid in the sense that there will be normally some spinning
first.

~Vitaly


>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1
> 1PT, UK
> Registration No: 1397386 (Wales)
>
Vitaly Wool Dec. 22, 2020, 9:44 a.m. UTC | #29
On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
<song.bao.hua@hisilicon.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Song Bao Hua (Barry Song)
> > Sent: Tuesday, December 22, 2020 3:03 PM
> > To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike
> > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com>
> > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> >
> > > I'm still not convinced. Will kmap what, src? At this point src might become
> > just a bogus pointer.
> >
> > As long as the memory is still there, we can kmap it by its page struct. But
> > if
> > it is not there anymore, we have no way.
> >
> > > Why couldn't the object have been moved somewhere else (due to the compaction
> > mechanism for instance)
> > > at the time DMA kicks in?
> >
> > So zs_map_object() will guarantee the src won't be moved by holding those
> > preemption-disabled lock?
> > If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
> >
>
> Or we can do get_page() to avoid the movement of the page.


I would like to discuss this more in zswap context than zsmalloc's.
Since zsmalloc does not implement reclaim callback, using it in zswap
is a corner case anyway.

zswap, on the other hand, may be dealing with some new backends in
future which have more chances to become mainstream. Imagine typical
NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in
unused video memory. In such a case if you try to use a pointer to an
invalidated zpool mapping, you are on the way to thrash the system.
So: no assumptions that the zswap pool is in regular linear RAM should
be made.

~Vitaly
>
>
> > >
> > > >
> > > > ~Vitaly
> > >
> >
> > Thanks
> > Barry
>
>
Song Bao Hua (Barry Song) Dec. 22, 2020, 9:06 p.m. UTC | #30
> -----Original Message-----
> From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> Sent: Tuesday, December 22, 2020 10:44 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike
> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> 
> On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
> <song.bao.hua@hisilicon.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Song Bao Hua (Barry Song)
> > > Sent: Tuesday, December 22, 2020 3:03 PM
> > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
> > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>;
> Mike
> > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com>
> > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > >
> > > > I'm still not convinced. Will kmap what, src? At this point src might
> become
> > > just a bogus pointer.
> > >
> > > As long as the memory is still there, we can kmap it by its page struct.
> But
> > > if
> > > it is not there anymore, we have no way.
> > >
> > > > Why couldn't the object have been moved somewhere else (due to the compaction
> > > mechanism for instance)
> > > > at the time DMA kicks in?
> > >
> > > So zs_map_object() will guarantee the src won't be moved by holding those
> > > preemption-disabled lock?
> > > If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
> > >
> >
> > Or we can do get_page() to avoid the movement of the page.
> 
> 
> I would like to discuss this more in zswap context than zsmalloc's.
> Since zsmalloc does not implement reclaim callback, using it in zswap
> is a corner case anyway.

I see. But it seems we still need a solution for the compatibility
of zsmalloc and zswap? this will require change in either zsmalloc
or zswap. 
or do you want to make zswap depend on !ZSMALLOC?

> 
> zswap, on the other hand, may be dealing with some new backends in
> future which have more chances to become mainstream. Imagine typical
> NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in
> unused video memory. In such a case if you try to use a pointer to an
> invalidated zpool mapping, you are on the way to thrash the system.
> So: no assumptions that the zswap pool is in regular linear RAM should
> be made.
> 
> ~Vitaly

Thanks
Barry
Vitaly Wool Dec. 23, 2020, 12:11 a.m. UTC | #31
On Tue, 22 Dec 2020, 22:06 Song Bao Hua (Barry Song),
<song.bao.hua@hisilicon.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> > Sent: Tuesday, December 22, 2020 10:44 PM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike
> > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
> > <song.bao.hua@hisilicon.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Song Bao Hua (Barry Song)
> > > > Sent: Tuesday, December 22, 2020 3:03 PM
> > > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
> > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>;
> > Mike
> > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com>
> > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > >
> > > > > I'm still not convinced. Will kmap what, src? At this point src might
> > become
> > > > just a bogus pointer.
> > > >
> > > > As long as the memory is still there, we can kmap it by its page struct.
> > But
> > > > if
> > > > it is not there anymore, we have no way.
> > > >
> > > > > Why couldn't the object have been moved somewhere else (due to the compaction
> > > > mechanism for instance)
> > > > > at the time DMA kicks in?
> > > >
> > > > So zs_map_object() will guarantee the src won't be moved by holding those
> > > > preemption-disabled lock?
> > > > If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
> > > >
> > >
> > > Or we can do get_page() to avoid the movement of the page.
> >
> >
> > I would like to discuss this more in zswap context than zsmalloc's.
> > Since zsmalloc does not implement reclaim callback, using it in zswap
> > is a corner case anyway.
>
> I see. But it seems we still need a solution for the compatibility
> of zsmalloc and zswap? this will require change in either zsmalloc
> or zswap.
> or do you want to make zswap depend on !ZSMALLOC?

No, I really don't think we should go that far. What if we add a flag
to zpool, named like "can_sleep_mapped", and have it set for
zbud/z3fold?
Then zswap could go the current path if the flag is set; and if it's
not set, and mutex_trylock fails, copy data from src to a temporary
buffer, then unmap the handle, take the mutex, process the buffer
instead of src. Not the nicest thing to do but at least it won't break
anything.

~Vitaly

> > zswap, on the other hand, may be dealing with some new backends in
> > future which have more chances to become mainstream. Imagine typical
> > NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in
> > unused video memory. In such a case if you try to use a pointer to an
> > invalidated zpool mapping, you are on the way to thrash the system.
> > So: no assumptions that the zswap pool is in regular linear RAM should
> > be made.
> >
> > ~Vitaly
>
> Thanks
> Barry
tiantao (H) Dec. 23, 2020, 12:44 p.m. UTC | #32
在 2020/12/23 8:11, Vitaly Wool 写道:
> On Tue, 22 Dec 2020, 22:06 Song Bao Hua (Barry Song),
> <song.bao.hua@hisilicon.com> wrote:
>>
>>
>>> -----Original Message-----
>>> From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
>>> Sent: Tuesday, December 22, 2020 10:44 PM
>>> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
>>> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike
>>> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
>>> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
>>> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
>>> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
>>> <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com>
>>> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
>>>
>>> On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
>>> <song.bao.hua@hisilicon.com> wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Song Bao Hua (Barry Song)
>>>>> Sent: Tuesday, December 22, 2020 3:03 PM
>>>>> To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
>>>>> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>;
>>> Mike
>>>>> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
>>>>> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
>>>>> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
>>>>> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
>>>>> <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com>
>>>>> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
>>>>>
>>>>>
>>>>>> I'm still not convinced. Will kmap what, src? At this point src might
>>> become
>>>>> just a bogus pointer.
>>>>>
>>>>> As long as the memory is still there, we can kmap it by its page struct.
>>> But
>>>>> if
>>>>> it is not there anymore, we have no way.
>>>>>
>>>>>> Why couldn't the object have been moved somewhere else (due to the compaction
>>>>> mechanism for instance)
>>>>>> at the time DMA kicks in?
>>>>> So zs_map_object() will guarantee the src won't be moved by holding those
>>>>> preemption-disabled lock?
>>>>> If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
>>>>>
>>>> Or we can do get_page() to avoid the movement of the page.
>>>
>>> I would like to discuss this more in zswap context than zsmalloc's.
>>> Since zsmalloc does not implement reclaim callback, using it in zswap
>>> is a corner case anyway.
>> I see. But it seems we still need a solution for the compatibility
>> of zsmalloc and zswap? this will require change in either zsmalloc
>> or zswap.
>> or do you want to make zswap depend on !ZSMALLOC?
> No, I really don't think we should go that far. What if we add a flag
> to zpool, named like "can_sleep_mapped", and have it set for
> zbud/z3fold?
> Then zswap could go the current path if the flag is set; and if it's
> not set, and mutex_trylock fails, copy data from src to a temporary
> buffer, then unmap the handle, take the mutex, process the buffer
> instead of src. Not the nicest thing to do but at least it won't break
> anything.

write the following patch according to your idea, what do you think ?

--- a/mm/zswap.c

+++ b/mm/zswap.c
@@ -1235,7 +1235,7 @@ static int zswap_frontswap_load(unsigned type, 
pgoff_t offset,
         struct zswap_entry *entry;
         struct scatterlist input, output;
         struct crypto_acomp_ctx *acomp_ctx;
-       u8 *src, *dst;
+       u8 *src, *dst, *tmp;
         unsigned int dlen;
         int ret;

@@ -1262,16 +1262,26 @@ static int zswap_frontswap_load(unsigned type, 
pgoff_t offset,
         if (zpool_evictable(entry->pool->zpool))
                 src += sizeof(struct zswap_header);

+       if (!zpool_can_sleep_mapped(entry->pool->zpool) && 
!mutex_trylock(acomp_ctx->mutex)) {
+               tmp = kmemdup(src, entry->length, GFP_ATOMIC);
+               zpool_unmap_handle(entry->pool->zpool, entry->handle); ???
+               if (!tmp)
+                       goto freeentry;
+       }
         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
         mutex_lock(acomp_ctx->mutex);
-       sg_init_one(&input, src, entry->length);
+       sg_init_one(&input, zpool_can_sleep_mapped(entry->pool->zpool) ? 
src : tmp, entry->length);
         sg_init_table(&output, 1);
         sg_set_page(&output, page, PAGE_SIZE, 0);
         acomp_request_set_params(acomp_ctx->req, &input, &output, 
entry->length, dlen);
         ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), 
&acomp_ctx->wait);
         mutex_unlock(acomp_ctx->mutex);

-       zpool_unmap_handle(entry->pool->zpool, entry->handle);
+       if (zpool_can_sleep_mapped(entry->pool->zpool))
+               zpool_unmap_handle(entry->pool->zpool, entry->handle);
+       else
+               kfree(tmp);
+
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -440,6 +440,7 @@ static u64 zs_zpool_total_size(void *pool)

  static struct zpool_driver zs_zpool_driver = {
         .type =                   "zsmalloc",
+       .sleep_mapped =           false,
         .owner =                  THIS_MODULE,
         .create =                 zs_zpool_create,
         .destroy =                zs_zpool_destroy,

>
> ~Vitaly
>
>>> zswap, on the other hand, may be dealing with some new backends in
>>> future which have more chances to become mainstream. Imagine typical
>>> NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in
>>> unused video memory. In such a case if you try to use a pointer to an
>>> invalidated zpool mapping, you are on the way to thrash the system.
>>> So: no assumptions that the zswap pool is in regular linear RAM should
>>> be made.
>>>
>>> ~Vitaly
>> Thanks
>> Barry
Vitaly Wool Dec. 23, 2020, 6:25 p.m. UTC | #33
On Wed, Dec 23, 2020 at 1:44 PM tiantao (H) <tiantao6@huawei.com> wrote:
>
>
> 在 2020/12/23 8:11, Vitaly Wool 写道:
> > On Tue, 22 Dec 2020, 22:06 Song Bao Hua (Barry Song),
> > <song.bao.hua@hisilicon.com> wrote:
> >>
> >>
> >>> -----Original Message-----
> >>> From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> >>> Sent: Tuesday, December 22, 2020 10:44 PM
> >>> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> >>> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike
> >>> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> >>> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> >>> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> >>> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> >>> <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com>
> >>> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >>>
> >>> On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
> >>> <song.bao.hua@hisilicon.com> wrote:
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Song Bao Hua (Barry Song)
> >>>>> Sent: Tuesday, December 22, 2020 3:03 PM
> >>>>> To: 'Vitaly Wool' <vitaly.wool@konsulko.com>
> >>>>> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>;
> >>> Mike
> >>>>> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm
> >>>>> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>;
> >>>>> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> >>>>> <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> >>>>> <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com>
> >>>>> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> >>>>>
> >>>>>
> >>>>>> I'm still not convinced. Will kmap what, src? At this point src might
> >>> become
> >>>>> just a bogus pointer.
> >>>>>
> >>>>> As long as the memory is still there, we can kmap it by its page struct.
> >>> But
> >>>>> if
> >>>>> it is not there anymore, we have no way.
> >>>>>
> >>>>>> Why couldn't the object have been moved somewhere else (due to the compaction
> >>>>> mechanism for instance)
> >>>>>> at the time DMA kicks in?
> >>>>> So zs_map_object() will guarantee the src won't be moved by holding those
> >>>>> preemption-disabled lock?
> >>>>> If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
> >>>>>
> >>>> Or we can do get_page() to avoid the movement of the page.
> >>>
> >>> I would like to discuss this more in zswap context than zsmalloc's.
> >>> Since zsmalloc does not implement reclaim callback, using it in zswap
> >>> is a corner case anyway.
> >> I see. But it seems we still need a solution for the compatibility
> >> of zsmalloc and zswap? this will require change in either zsmalloc
> >> or zswap.
> >> or do you want to make zswap depend on !ZSMALLOC?
> > No, I really don't think we should go that far. What if we add a flag
> > to zpool, named like "can_sleep_mapped", and have it set for
> > zbud/z3fold?
> > Then zswap could go the current path if the flag is set; and if it's
> > not set, and mutex_trylock fails, copy data from src to a temporary
> > buffer, then unmap the handle, take the mutex, process the buffer
> > instead of src. Not the nicest thing to do but at least it won't break
> > anything.
>
> write the following patch according to your idea, what do you think ?

Yep, that is basically what I was thinking of. Some nitpicks below:

> --- a/mm/zswap.c
>
> +++ b/mm/zswap.c
> @@ -1235,7 +1235,7 @@ static int zswap_frontswap_load(unsigned type,
> pgoff_t offset,
>          struct zswap_entry *entry;
>          struct scatterlist input, output;
>          struct crypto_acomp_ctx *acomp_ctx;
> -       u8 *src, *dst;
> +       u8 *src, *dst, *tmp;
>          unsigned int dlen;
>          int ret;
>
> @@ -1262,16 +1262,26 @@ static int zswap_frontswap_load(unsigned type,
> pgoff_t offset,
>          if (zpool_evictable(entry->pool->zpool))
>                  src += sizeof(struct zswap_header);
>
> +       if (!zpool_can_sleep_mapped(entry->pool->zpool) &&
> !mutex_trylock(acomp_ctx->mutex)) {
> +               tmp = kmemdup(src, entry->length, GFP_ATOMIC);

kmemdump? just use memcpy :)

> +               zpool_unmap_handle(entry->pool->zpool, entry->handle); ???
> +               if (!tmp)
> +                       goto freeentry;

Jumping to freentry results in returning success which isn't
appropriate here. You should return -ENOMEM in this case.

> +       }
>          acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>          mutex_lock(acomp_ctx->mutex);
> -       sg_init_one(&input, src, entry->length);
> +       sg_init_one(&input, zpool_can_sleep_mapped(entry->pool->zpool) ?
> src : tmp, entry->length);

This is kind of hard to read, I would rather assign src to tmp after
memcpy and then no condition check would be needed here.

>          sg_init_table(&output, 1);
>          sg_set_page(&output, page, PAGE_SIZE, 0);
>          acomp_request_set_params(acomp_ctx->req, &input, &output,
> entry->length, dlen);
>          ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> &acomp_ctx->wait);
>          mutex_unlock(acomp_ctx->mutex);
>
> -       zpool_unmap_handle(entry->pool->zpool, entry->handle);
> +       if (zpool_can_sleep_mapped(entry->pool->zpool))
> +               zpool_unmap_handle(entry->pool->zpool, entry->handle);
> +       else
> +               kfree(tmp);
> +
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -440,6 +440,7 @@ static u64 zs_zpool_total_size(void *pool)
>
>   static struct zpool_driver zs_zpool_driver = {
>          .type =                   "zsmalloc",
> +       .sleep_mapped =           false,
>          .owner =                  THIS_MODULE,
>          .create =                 zs_zpool_create,
>          .destroy =                zs_zpool_destroy,

Best regards,
   Vitaly

> >
> > ~Vitaly
> >
> >>> zswap, on the other hand, may be dealing with some new backends in
> >>> future which have more chances to become mainstream. Imagine typical
> >>> NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in
> >>> unused video memory. In such a case if you try to use a pointer to an
> >>> invalidated zpool mapping, you are on the way to thrash the system.
> >>> So: no assumptions that the zswap pool is in regular linear RAM should
> >>> be made.
> >>>
> >>> ~Vitaly
> >> Thanks
> >> Barry
>
Sebastian Andrzej Siewior Jan. 14, 2021, 4:17 p.m. UTC | #34
On 2020-12-20 08:21:37 [+0100], Vitaly Wool wrote:
> Not really because bit spinlock leaves preemption disabled.

It leaves it disabled for a reason. Now you just spin until the original
context gets back on the CPU. On UP with preemption, if the "lock owner"
gets preempted, the next lock attempt will lock-up the system.

Sebastian
Sebastian Andrzej Siewior Jan. 14, 2021, 4:18 p.m. UTC | #35
On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote:
> > write the following patch according to your idea, what do you think ?
> 
> Yep, that is basically what I was thinking of. Some nitpicks below:

Did this go somewhere? The thread just ends here on my end.
Mike, is this patch fixing / helping your case in anyway?

> Best regards,
>    Vitaly

Sebastian
Vitaly Wool Jan. 14, 2021, 4:29 p.m. UTC | #36
On Thu, 14 Jan 2021, 17:18 Sebastian Andrzej Siewior,
<bigeasy@linutronix.de> wrote:
>
> On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote:
> > > write the following patch according to your idea, what do you think ?
> >
> > Yep, that is basically what I was thinking of. Some nitpicks below:
>
> Did this go somewhere? The thread just ends here on my end.
> Mike, is this patch fixing / helping your case in anyway?

Please see
* https://marc.info/?l=linux-mm&m=160889419514019&w=2
* https://marc.info/?l=linux-mm&m=160889418114011&w=2
* https://marc.info/?l=linux-mm&m=160889448814057&w=2

Haven't had time to test these yet but seem to be alright.

Best regards,
   Vitaly
Sebastian Andrzej Siewior Jan. 14, 2021, 4:56 p.m. UTC | #37
On 2021-01-14 17:29:37 [+0100], Vitaly Wool wrote:
> On Thu, 14 Jan 2021, 17:18 Sebastian Andrzej Siewior,
> <bigeasy@linutronix.de> wrote:
> >
> > On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote:
> > > > write the following patch according to your idea, what do you think ?
> > >
> > > Yep, that is basically what I was thinking of. Some nitpicks below:
> >
> > Did this go somewhere? The thread just ends here on my end.
> > Mike, is this patch fixing / helping your case in anyway?
> 
> Please see
> * https://marc.info/?l=linux-mm&m=160889419514019&w=2
> * https://marc.info/?l=linux-mm&m=160889418114011&w=2
> * https://marc.info/?l=linux-mm&m=160889448814057&w=2

Thank you, that would be
   1608894171-54174-1-git-send-email-tiantao6@hisilicon.com

for b4 compatibility :)

> Haven't had time to test these yet but seem to be alright.

So zs_map_object() still disables preemption but the mutex part is
avoided by the patch?

> Best regards,
>    Vitaly

Sebastian
Vitaly Wool Jan. 14, 2021, 5:15 p.m. UTC | #38
On Thu, Jan 14, 2021 at 5:56 PM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2021-01-14 17:29:37 [+0100], Vitaly Wool wrote:
> > On Thu, 14 Jan 2021, 17:18 Sebastian Andrzej Siewior,
> > <bigeasy@linutronix.de> wrote:
> > >
> > > On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote:
> > > > > write the following patch according to your idea, what do you think ?
> > > >
> > > > Yep, that is basically what I was thinking of. Some nitpicks below:
> > >
> > > Did this go somewhere? The thread just ends here on my end.
> > > Mike, is this patch fixing / helping your case in anyway?
> >
> > Please see
> > * https://marc.info/?l=linux-mm&m=160889419514019&w=2
> > * https://marc.info/?l=linux-mm&m=160889418114011&w=2
> > * https://marc.info/?l=linux-mm&m=160889448814057&w=2
>
> Thank you, that would be
>    1608894171-54174-1-git-send-email-tiantao6@hisilicon.com
>
> for b4 compatibility :)
>
> > Haven't had time to test these yet but seem to be alright.
>
> So zs_map_object() still disables preemption but the mutex part is
> avoided by the patch?

Basically, yes. Minchan was very clear that he didn't want to remove
that inter-function locking, so be it.
I wouldn't really advise to use zsmalloc with zswap because zsmalloc
has no support for reclaim, nevertheless I wouldn't like this
configuration to stop working for those who are already using it.

Would you or Mike be up for testing Tian Taos's patchset?

Best regards,
   Vitaly
Sebastian Andrzej Siewior Jan. 14, 2021, 5:18 p.m. UTC | #39
On 2021-01-14 18:15:08 [+0100], Vitaly Wool wrote:
> 
> Basically, yes. Minchan was very clear that he didn't want to remove
> that inter-function locking, so be it.
> I wouldn't really advise to use zsmalloc with zswap because zsmalloc
> has no support for reclaim, nevertheless I wouldn't like this
> configuration to stop working for those who are already using it.
> 
> Would you or Mike be up for testing Tian Taos's patchset?

I will try to reproduce Mike's original report and the fix early next
week.

> Best regards,
>    Vitaly

Sebastian
diff mbox series

Patch

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 7289f502ffac..ff26546a7fed 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -876,22 +876,25 @@  static unsigned long obj_to_head(struct page *page, void *obj)
 
 static inline int testpin_tag(unsigned long handle)
 {
-	return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
+	return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static inline int trypin_tag(unsigned long handle)
 {
-	return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
+	return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
-static void pin_tag(unsigned long handle) __acquires(bitlock)
+static void pin_tag(unsigned long handle)
 {
-	bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
+	preempt_disable();
+	while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle))
+		cpu_relax();
+	preempt_enable();
 }
 
 static void unpin_tag(unsigned long handle) __releases(bitlock)
 {
-	bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
+	clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static void reset_page(struct page *page)