Message ID | 20201220002228.38697-1-vitaly.wool@konsulko.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | zsmalloc: do not use bit_spin_lock | expand |
On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > zsmalloc takes bit spinlock in its _map() callback and releases it > only in unmap() which is unsafe and leads to zswap complaining > about scheduling in atomic context. > > To fix that and to improve RT properties of zsmalloc, remove that > bit spinlock completely and use a bit flag instead. Isn't this just "I open coded bit spinlock to make the lockdep warnings go away"?
On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote: > zsmalloc takes bit spinlock in its _map() callback and releases it > only in unmap() which is unsafe and leads to zswap complaining > about scheduling in atomic context. > > To fix that and to improve RT properties of zsmalloc, remove that > bit spinlock completely and use a bit flag instead. It also does get_cpu_var() in map(), put_cpu_var() in unmap(). -Mike
On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote: > zsmalloc takes bit spinlock in its _map() callback and releases it > only in unmap() which is unsafe and leads to zswap complaining > about scheduling in atomic context. > > To fix that and to improve RT properties of zsmalloc, remove that > bit spinlock completely and use a bit flag instead. > -static void pin_tag(unsigned long handle) __acquires(bitlock) > +static void pin_tag(unsigned long handle) > { > - bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle); > + preempt_disable(); > + while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle)) > + cpu_relax(); > + preempt_enable(); > } If try doesn't need to disable preemption, neither does pin. -Mike
On Sun, 2020-12-20 at 02:23 +0100, Mike Galbraith wrote: > On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote: > > zsmalloc takes bit spinlock in its _map() callback and releases it > > only in unmap() which is unsafe and leads to zswap complaining > > about scheduling in atomic context. > > > > To fix that and to improve RT properties of zsmalloc, remove that > > bit spinlock completely and use a bit flag instead. > > It also does get_cpu_var() in map(), put_cpu_var() in unmap(). Bah, I forgot to mention the config dependent rwlock, it's held across map()/unmap() as well, so there are two more hurdles, not one. -Mike
On Sun, Dec 20, 2020 at 2:18 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > zsmalloc takes bit spinlock in its _map() callback and releases it > > only in unmap() which is unsafe and leads to zswap complaining > > about scheduling in atomic context. > > > > To fix that and to improve RT properties of zsmalloc, remove that > > bit spinlock completely and use a bit flag instead. > > Isn't this just "I open coded bit spinlock to make the lockdep > warnings go away"? Not really because bit spinlock leaves preemption disabled.
On Sun, 2020-12-20 at 02:23 +0100, Mike Galbraith wrote: > On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote: > > zsmalloc takes bit spinlock in its _map() callback and releases it > > only in unmap() which is unsafe and leads to zswap complaining > > about scheduling in atomic context. > > > > To fix that and to improve RT properties of zsmalloc, remove that > > bit spinlock completely and use a bit flag instead. > > It also does get_cpu_var() in map(), put_cpu_var() in unmap(). That aside, the bit spinlock removal seems to hold up to beating in RT. I stripped out the RT changes to replace the bit spinlocks, applied the still needed atm might_sleep() fix, and ltp zram and zswap test are running in a loop with no signs that it's a bad idea, so I hope that makes it in (minus the preempt disabled spin which I whacked), as it makes zsmalloc markedly more RT friendly. RT changes go from: 1 file changed, 79 insertions(+), 6 deletions(-) to: 1 file changed, 8 insertions(+), 3 deletions(-) -Mike
> -----Original Message----- > From: Mike Galbraith [mailto:efault@gmx.de] > Sent: Sunday, December 20, 2020 8:48 PM > To: Vitaly Wool <vitaly.wool@konsulko.com>; LKML > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org> > Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej > Siewior <bigeasy@linutronix.de>; Minchan Kim <minchan@kernel.org>; NitinGupta > <ngupta@vflare.org> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > On Sun, 2020-12-20 at 02:23 +0100, Mike Galbraith wrote: > > On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote: > > > zsmalloc takes bit spinlock in its _map() callback and releases it > > > only in unmap() which is unsafe and leads to zswap complaining > > > about scheduling in atomic context. > > > > > > To fix that and to improve RT properties of zsmalloc, remove that > > > bit spinlock completely and use a bit flag instead. > > > > It also does get_cpu_var() in map(), put_cpu_var() in unmap(). > > That aside, the bit spinlock removal seems to hold up to beating in RT. > I stripped out the RT changes to replace the bit spinlocks, applied the > still needed atm might_sleep() fix, and ltp zram and zswap test are > running in a loop with no signs that it's a bad idea, so I hope that > makes it in (minus the preempt disabled spin which I whacked), as it > makes zsmalloc markedly more RT friendly. > > RT changes go from: > 1 file changed, 79 insertions(+), 6 deletions(-) > to: > 1 file changed, 8 insertions(+), 3 deletions(-) > Sorry, would you like to show the change for "8 insertions(+), 3 deletions(-)"? BTW, your original patch looks not right as crypto_wait_req(crypto_acomp_decompress()...) can sleep too. [copy from your original patch with comment] --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1258,20 +1258,20 @@ static int zswap_frontswap_load(unsigned /* decompress */ dlen = PAGE_SIZE; + acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); + mutex_lock(acomp_ctx->mutex); src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO); if (zpool_evictable(entry->pool->zpool)) src += sizeof(struct zswap_header); - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); - mutex_lock(acomp_ctx->mutex); sg_init_one(&input, src, entry->length); sg_init_table(&output, 1); sg_set_page(&output, page, PAGE_SIZE, 0); acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen); /*!!!!!!!!!!!!!!!! * here crypto could sleep !!!!!!!!!!!!!!*/ ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); - mutex_unlock(acomp_ctx->mutex); zpool_unmap_handle(entry->pool->zpool, entry->handle); + mutex_unlock(acomp_ctx->mutex); BUG_ON(ret); freeentry: [end] so I guess we have to fix zsmalloc. > -Mike Thanks Barry
On Sun, 2020-12-20 at 21:20 +0000, Song Bao Hua (Barry Song) wrote: > > > -----Original Message----- > > From: Mike Galbraith [mailto:efault@gmx.de] > > Sent: Sunday, December 20, 2020 8:48 PM > > To: Vitaly Wool <vitaly.wool@konsulko.com>; LKML > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org> > > Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej > > Siewior <bigeasy@linutronix.de>; Minchan Kim <minchan@kernel.org>; NitinGupta > > <ngupta@vflare.org> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > On Sun, 2020-12-20 at 02:23 +0100, Mike Galbraith wrote: > > > On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote: > > > > zsmalloc takes bit spinlock in its _map() callback and releases it > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > about scheduling in atomic context. > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove that > > > > bit spinlock completely and use a bit flag instead. > > > > > > It also does get_cpu_var() in map(), put_cpu_var() in unmap(). > > > > That aside, the bit spinlock removal seems to hold up to beating in RT. > > I stripped out the RT changes to replace the bit spinlocks, applied the > > still needed atm might_sleep() fix, and ltp zram and zswap test are > > running in a loop with no signs that it's a bad idea, so I hope that > > makes it in (minus the preempt disabled spin which I whacked), as it > > makes zsmalloc markedly more RT friendly. > > > > RT changes go from: > > 1 file changed, 79 insertions(+), 6 deletions(-) > > to: > > 1 file changed, 8 insertions(+), 3 deletions(-) > > > > Sorry, would you like to show the change for > "8 insertions(+), 3 deletions(-)"? Sure. --- mm/zsmalloc.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -57,6 +57,7 @@ #include <linux/wait.h> #include <linux/pagemap.h> #include <linux/fs.h> +#include <linux/local_lock.h> #define ZSPAGE_MAGIC 0x58 @@ -293,6 +294,7 @@ struct zspage { }; struct mapping_area { + local_lock_t lock; char *vm_buf; /* copy buffer for objects that span pages */ char *vm_addr; /* address of kmap_atomic()'ed pages */ enum zs_mapmode vm_mm; /* mapping mode */ @@ -455,7 +457,9 @@ MODULE_ALIAS("zpool-zsmalloc"); #endif /* CONFIG_ZPOOL */ /* per-cpu VM mapping areas for zspage accesses that cross page boundaries */ -static DEFINE_PER_CPU(struct mapping_area, zs_map_area); +static DEFINE_PER_CPU(struct mapping_area, zs_map_area) = { + .lock = INIT_LOCAL_LOCK(lock), +}; static bool is_zspage_isolated(struct zspage *zspage) { @@ -1276,7 +1280,8 @@ void *zs_map_object(struct zs_pool *pool class = pool->size_class[class_idx]; off = (class->size * obj_idx) & ~PAGE_MASK; - area = &get_cpu_var(zs_map_area); + local_lock(&zs_map_area.lock); + area = this_cpu_ptr(&zs_map_area); area->vm_mm = mm; if (off + class->size <= PAGE_SIZE) { /* this object is contained entirely within a page */ @@ -1330,7 +1335,7 @@ void zs_unmap_object(struct zs_pool *poo __zs_unmap_object(area, pages, off, class->size); } - put_cpu_var(zs_map_area); + local_unlock(&zs_map_area.lock); migrate_read_unlock(zspage); unpin_tag(handle); > BTW, your original patch looks not right as > crypto_wait_req(crypto_acomp_decompress()...) > can sleep too. > > [copy from your original patch with comment] > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1258,20 +1258,20 @@ static int zswap_frontswap_load(unsigned > > /* decompress */ > dlen = PAGE_SIZE; > + acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > + mutex_lock(acomp_ctx->mutex); > src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO); > if (zpool_evictable(entry->pool->zpool)) > src += sizeof(struct zswap_header); > > - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > - mutex_lock(acomp_ctx->mutex); > sg_init_one(&input, src, entry->length); > sg_init_table(&output, 1); > sg_set_page(&output, page, PAGE_SIZE, 0); > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen); > > /*!!!!!!!!!!!!!!!! > * here crypto could sleep > !!!!!!!!!!!!!!*/ Hohum, another one for my Bitmaster-9000 patch shredder. -Mike
On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > zsmalloc takes bit spinlock in its _map() callback and releases it > only in unmap() which is unsafe and leads to zswap complaining > about scheduling in atomic context. > > To fix that and to improve RT properties of zsmalloc, remove that > bit spinlock completely and use a bit flag instead. I don't want to use such open code for the lock. I see from Mike's patch, recent zswap change introduced the lockdep splat bug and you want to improve zsmalloc to fix the zswap bug and introduce this patch with allowing preemption enabling. https://lore.kernel.org/linux-mm/fae85e4440a8ef6f13192476bd33a4826416fc58.camel@gmx.de/ zs_[un/map]_object is designed to be used in fast path(i.e., zs_map_object/4K page copy/zs_unmap_object) so the spinlock is perfectly fine for API point of view. However, zswap introduced using the API with mutex_lock/crypto_wait_req where allowing preemption, which was wrong. Furthermore, the zs_map_object already has a few more places where disablepreemptions(migrate_read_lock, get_cpu_var and kmap_atomic). Without making those locks preemptible all at once, zswap will still see the lockdep warning. > > Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com> > --- > mm/zsmalloc.c | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c > index 7289f502ffac..ff26546a7fed 100644 > --- a/mm/zsmalloc.c > +++ b/mm/zsmalloc.c > @@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, void *obj) > > static inline int testpin_tag(unsigned long handle) > { > - return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle); > + return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle); > } > > static inline int trypin_tag(unsigned long handle) > { > - return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle); > + return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle); > } > > -static void pin_tag(unsigned long handle) __acquires(bitlock) > +static void pin_tag(unsigned long handle) > { > - bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle); > + preempt_disable(); > + while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle)) > + cpu_relax(); > + preempt_enable(); > } > > static void unpin_tag(unsigned long handle) __releases(bitlock) > { > - bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle); > + clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle); > } > > static void reset_page(struct page *page) > -- > 2.20.1 >
On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote: > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > zsmalloc takes bit spinlock in its _map() callback and releases it > > only in unmap() which is unsafe and leads to zswap complaining > > about scheduling in atomic context. > > > > To fix that and to improve RT properties of zsmalloc, remove that > > bit spinlock completely and use a bit flag instead. > > I don't want to use such open code for the lock. > > I see from Mike's patch, recent zswap change introduced the lockdep > splat bug and you want to improve zsmalloc to fix the zswap bug and > introduce this patch with allowing preemption enabling. This understanding is upside down. The code in zswap you are referring to is not buggy. You may claim that it is suboptimal but there is nothing wrong in taking a mutex. > https://lore.kernel.org/linux-mm/fae85e4440a8ef6f13192476bd33a4826416fc58.camel@gmx.de/ > > zs_[un/map]_object is designed to be used in fast path(i.e., > zs_map_object/4K page copy/zs_unmap_object) so the spinlock is > perfectly fine for API point of view. However, zswap introduced > using the API with mutex_lock/crypto_wait_req where allowing > preemption, which was wrong. Taking a spinlock in one callback and releasing it in another is unsafe and error prone. What if unmap was called on completion of a DMA-like transfer from another context, like a threaded IRQ handler? In that case this spinlock might never be released. Anyway I can come up with a zswap patch explicitly stating that zsmalloc is not fully compliant with zswap / zpool API to avoid confusion for the time being. Would that be ok with you? Best regards, Vitaly > Furthermore, the zs_map_object already has a few more places where > disablepreemptions(migrate_read_lock, get_cpu_var and kmap_atomic). > > Without making those locks preemptible all at once, zswap will still > see the lockdep warning. > > > > > Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com> > > --- > > mm/zsmalloc.c | 13 ++++++++----- > > 1 file changed, 8 insertions(+), 5 deletions(-) > > > > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c > > index 7289f502ffac..ff26546a7fed 100644 > > --- a/mm/zsmalloc.c > > +++ b/mm/zsmalloc.c > > @@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, void *obj) > > > > static inline int testpin_tag(unsigned long handle) > > { > > - return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle); > > + return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle); > > } > > > > static inline int trypin_tag(unsigned long handle) > > { > > - return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle); > > + return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle); > > } > > > > -static void pin_tag(unsigned long handle) __acquires(bitlock) > > +static void pin_tag(unsigned long handle) > > { > > - bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle); > > + preempt_disable(); > > + while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle)) > > + cpu_relax(); > > + preempt_enable(); > > } > > > > static void unpin_tag(unsigned long handle) __releases(bitlock) > > { > > - bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle); > > + clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle); > > } > > > > static void reset_page(struct page *page) > > -- > > 2.20.1 > >
On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> wrote: > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote: > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > zsmalloc takes bit spinlock in its _map() callback and releases it > > > only in unmap() which is unsafe and leads to zswap complaining > > > about scheduling in atomic context. > > > > > > To fix that and to improve RT properties of zsmalloc, remove that > > > bit spinlock completely and use a bit flag instead. > > > > I don't want to use such open code for the lock. > > > > I see from Mike's patch, recent zswap change introduced the lockdep > > splat bug and you want to improve zsmalloc to fix the zswap bug and > > introduce this patch with allowing preemption enabling. > > This understanding is upside down. The code in zswap you are referring > to is not buggy. You may claim that it is suboptimal but there is > nothing wrong in taking a mutex. > Is this suboptimal for all or just the hardware accelerators? Sorry, I am not very familiar with the crypto API. If I select lzo or lz4 as a zswap compressor will the [de]compression be async or sync? > > https://lore.kernel.org/linux-mm/fae85e4440a8ef6f13192476bd33a4826416fc58.camel@gmx.de/ > > > > zs_[un/map]_object is designed to be used in fast path(i.e., > > zs_map_object/4K page copy/zs_unmap_object) so the spinlock is > > perfectly fine for API point of view. However, zswap introduced > > using the API with mutex_lock/crypto_wait_req where allowing > > preemption, which was wrong. > > Taking a spinlock in one callback and releasing it in another is > unsafe and error prone. What if unmap was called on completion of a > DMA-like transfer from another context, like a threaded IRQ handler? > In that case this spinlock might never be released. > > Anyway I can come up with a zswap patch explicitly stating that > zsmalloc is not fully compliant with zswap / zpool API The documentation of zpool_map_handle() clearly states "This may hold locks, disable interrupts, and/or preemption, ...", so how come zsmalloc is not fully compliant? > to avoid > confusion for the time being. Would that be ok with you? > > Best regards, > Vitaly >
> -----Original Message----- > From: Shakeel Butt [mailto:shakeelb@google.com] > Sent: Tuesday, December 22, 2020 8:50 AM > To: Vitaly Wool <vitaly.wool@konsulko.com> > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; LKML > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song Bao Hua > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > <akpm@linux-foundation.org> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> wrote: > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > > zsmalloc takes bit spinlock in its _map() callback and releases it > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > about scheduling in atomic context. > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove that > > > > bit spinlock completely and use a bit flag instead. > > > > > > I don't want to use such open code for the lock. > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep > > > splat bug and you want to improve zsmalloc to fix the zswap bug and > > > introduce this patch with allowing preemption enabling. > > > > This understanding is upside down. The code in zswap you are referring > > to is not buggy. You may claim that it is suboptimal but there is > > nothing wrong in taking a mutex. > > > > Is this suboptimal for all or just the hardware accelerators? Sorry, I > am not very familiar with the crypto API. If I select lzo or lz4 as a > zswap compressor will the [de]compression be async or sync? Right now, in crypto subsystem, new drivers are required to write based on async APIs. The old sync API can't work in new accelerator drivers as they are not supported at all. Old drivers are used to sync, but they've got async wrappers to support async APIs. Eg. crypto: acomp - add support for lz4 via scomp https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d crypto: acomp - add support for lzo via scomp https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e so they are supporting async APIs but they are still working in sync mode as those old drivers don't sleep. > > > > > https://lore.kernel.org/linux-mm/fae85e4440a8ef6f13192476bd33a4826416fc58. > camel@gmx.de/ > > > > > > zs_[un/map]_object is designed to be used in fast path(i.e., > > > zs_map_object/4K page copy/zs_unmap_object) so the spinlock is > > > perfectly fine for API point of view. However, zswap introduced > > > using the API with mutex_lock/crypto_wait_req where allowing > > > preemption, which was wrong. > > > > Taking a spinlock in one callback and releasing it in another is > > unsafe and error prone. What if unmap was called on completion of a > > DMA-like transfer from another context, like a threaded IRQ handler? > > In that case this spinlock might never be released. > > > > Anyway I can come up with a zswap patch explicitly stating that > > zsmalloc is not fully compliant with zswap / zpool API > > The documentation of zpool_map_handle() clearly states "This may hold > locks, disable interrupts, and/or preemption, ...", so how come > zsmalloc is not fully compliant? Zbud, z3fold haven't really done this. If we hold spinlock before entering zswap and release spinlock after calling zswap, this will put zswap in an atomic context which isn't necessarily needed. > > > to avoid > > confusion for the time being. Would that be ok with you? > > > > Best regards, > > Vitaly > > Thanks Barry
On Mon, Dec 21, 2020 at 08:20:26PM +0100, Vitaly Wool wrote: > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote: > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > zsmalloc takes bit spinlock in its _map() callback and releases it > > > only in unmap() which is unsafe and leads to zswap complaining > > > about scheduling in atomic context. > > > > > > To fix that and to improve RT properties of zsmalloc, remove that > > > bit spinlock completely and use a bit flag instead. > > > > I don't want to use such open code for the lock. > > > > I see from Mike's patch, recent zswap change introduced the lockdep > > splat bug and you want to improve zsmalloc to fix the zswap bug and > > introduce this patch with allowing preemption enabling. > > This understanding is upside down. The code in zswap you are referring > to is not buggy. You may claim that it is suboptimal but there is > nothing wrong in taking a mutex. No, it's surely break from zswap since zpool/zsmalloc has worked like this and now you are saying "nothing wrong" even though it breaks the rule. > > > https://lore.kernel.org/linux-mm/fae85e4440a8ef6f13192476bd33a4826416fc58.camel@gmx.de/ > > > > zs_[un/map]_object is designed to be used in fast path(i.e., > > zs_map_object/4K page copy/zs_unmap_object) so the spinlock is > > perfectly fine for API point of view. However, zswap introduced > > using the API with mutex_lock/crypto_wait_req where allowing > > preemption, which was wrong. > > Taking a spinlock in one callback and releasing it in another is > unsafe and error prone. What if unmap was called on completion of a > DMA-like transfer from another context, like a threaded IRQ handler? > In that case this spinlock might never be released. > > Anyway I can come up with a zswap patch explicitly stating that > zsmalloc is not fully compliant with zswap / zpool API to avoid > confusion for the time being. Would that be ok with you? It's your call since you are maintainer of zswap now and you are breaking the rule we have kept for a long time. > > Best regards, > Vitaly > > > Furthermore, the zs_map_object already has a few more places where > > disablepreemptions(migrate_read_lock, get_cpu_var and kmap_atomic). > > > > Without making those locks preemptible all at once, zswap will still > > see the lockdep warning. > > > > > > > > Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com> > > > --- > > > mm/zsmalloc.c | 13 ++++++++----- > > > 1 file changed, 8 insertions(+), 5 deletions(-) > > > > > > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c > > > index 7289f502ffac..ff26546a7fed 100644 > > > --- a/mm/zsmalloc.c > > > +++ b/mm/zsmalloc.c > > > @@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, void *obj) > > > > > > static inline int testpin_tag(unsigned long handle) > > > { > > > - return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle); > > > + return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle); > > > } > > > > > > static inline int trypin_tag(unsigned long handle) > > > { > > > - return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle); > > > + return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle); > > > } > > > > > > -static void pin_tag(unsigned long handle) __acquires(bitlock) > > > +static void pin_tag(unsigned long handle) > > > { > > > - bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle); > > > + preempt_disable(); > > > + while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle)) > > > + cpu_relax(); > > > + preempt_enable(); > > > } > > > > > > static void unpin_tag(unsigned long handle) __releases(bitlock) > > > { > > > - bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle); > > > + clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle); > > > } > > > > > > static void reset_page(struct page *page) > > > -- > > > 2.20.1 > > >
On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> wrote: > > > > > -----Original Message----- > > From: Shakeel Butt [mailto:shakeelb@google.com] > > Sent: Tuesday, December 22, 2020 8:50 AM > > To: Vitaly Wool <vitaly.wool@konsulko.com> > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; LKML > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song Bao Hua > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > <akpm@linux-foundation.org> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> wrote: > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > > > zsmalloc takes bit spinlock in its _map() callback and releases it > > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > > about scheduling in atomic context. > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove that > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep > > > > splat bug and you want to improve zsmalloc to fix the zswap bug and > > > > introduce this patch with allowing preemption enabling. > > > > > > This understanding is upside down. The code in zswap you are referring > > > to is not buggy. You may claim that it is suboptimal but there is > > > nothing wrong in taking a mutex. > > > > > > > Is this suboptimal for all or just the hardware accelerators? Sorry, I > > am not very familiar with the crypto API. If I select lzo or lz4 as a > > zswap compressor will the [de]compression be async or sync? > > Right now, in crypto subsystem, new drivers are required to write based on > async APIs. The old sync API can't work in new accelerator drivers as they > are not supported at all. > > Old drivers are used to sync, but they've got async wrappers to support async > APIs. Eg. > crypto: acomp - add support for lz4 via scomp > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > crypto: acomp - add support for lzo via scomp > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > so they are supporting async APIs but they are still working in sync mode as > those old drivers don't sleep. > Good to know that those are sync because I want them to be sync. Please note that zswap is a cache in front of a real swap and the load operation is latency sensitive as it comes in the page fault path and directly impacts the applications. I doubt decompressing synchronously a 4k page on a cpu will be costlier than asynchronously decompressing the same page from hardware accelerators.
> -----Original Message----- > From: Shakeel Butt [mailto:shakeelb@google.com] > Sent: Tuesday, December 22, 2020 10:03 AM > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim <minchan@kernel.org>; > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > <akpm@linux-foundation.org> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > -----Original Message----- > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > Sent: Tuesday, December 22, 2020 8:50 AM > > > To: Vitaly Wool <vitaly.wool@konsulko.com> > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; LKML > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song Bao > Hua > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey > Senozhatsky > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > <akpm@linux-foundation.org> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> > wrote: > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases it > > > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > > > about scheduling in atomic context. > > > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove that > > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug and > > > > > introduce this patch with allowing preemption enabling. > > > > > > > > This understanding is upside down. The code in zswap you are referring > > > > to is not buggy. You may claim that it is suboptimal but there is > > > > nothing wrong in taking a mutex. > > > > > > > > > > Is this suboptimal for all or just the hardware accelerators? Sorry, I > > > am not very familiar with the crypto API. If I select lzo or lz4 as a > > > zswap compressor will the [de]compression be async or sync? > > > > Right now, in crypto subsystem, new drivers are required to write based on > > async APIs. The old sync API can't work in new accelerator drivers as they > > are not supported at all. > > > > Old drivers are used to sync, but they've got async wrappers to support async > > APIs. Eg. > > crypto: acomp - add support for lz4 via scomp > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > > > crypto: acomp - add support for lzo via scomp > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > > > so they are supporting async APIs but they are still working in sync mode > as > > those old drivers don't sleep. > > > > Good to know that those are sync because I want them to be sync. > Please note that zswap is a cache in front of a real swap and the load > operation is latency sensitive as it comes in the page fault path and > directly impacts the applications. I doubt decompressing synchronously > a 4k page on a cpu will be costlier than asynchronously decompressing > the same page from hardware accelerators. If you read the old paper: https://www.ibm.com/support/pages/new-linux-zswap-compression-functionality Because the hardware accelerator speeds up compression, looking at the zswap metrics we observed that there were more store and load requests in a given amount of time, which filled up the zswap pool faster than a software compression run. Because of this behavior, we set the max_pool_percent parameter to 30 for the hardware compression runs - this means that zswap can use up to 30% of the 10GB of total memory. So using hardware accelerators, we get a chance to speed up compression while decreasing cpu utilization. BTW, If it is not easy to change zsmalloc, one quick workaround we might do in zswap is adding the below after applying Mike's original patch: if(in_atomic()) /* for zsmalloc */ while(!try_wait_for_completion(&req->done); else /* for zbud, z3fold */ crypto_wait_req(....); crypto_wait_req() is actually doing wait_for_completion(): static inline int crypto_wait_req(int err, struct crypto_wait *wait) { switch (err) { case -EINPROGRESS: case -EBUSY: wait_for_completion(&wait->completion); reinit_completion(&wait->completion); err = wait->err; break; } return err; } Thanks Barry
On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> wrote: > > > > > -----Original Message----- > > From: Shakeel Butt [mailto:shakeelb@google.com] > > Sent: Tuesday, December 22, 2020 10:03 AM > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim <minchan@kernel.org>; > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > <akpm@linux-foundation.org> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > -----Original Message----- > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > Sent: Tuesday, December 22, 2020 8:50 AM > > > > To: Vitaly Wool <vitaly.wool@konsulko.com> > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; LKML > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song Bao > > Hua > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey > > Senozhatsky > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > <akpm@linux-foundation.org> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> > > wrote: > > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases it > > > > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > > > > about scheduling in atomic context. > > > > > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove that > > > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug and > > > > > > introduce this patch with allowing preemption enabling. > > > > > > > > > > This understanding is upside down. The code in zswap you are referring > > > > > to is not buggy. You may claim that it is suboptimal but there is > > > > > nothing wrong in taking a mutex. > > > > > > > > > > > > > Is this suboptimal for all or just the hardware accelerators? Sorry, I > > > > am not very familiar with the crypto API. If I select lzo or lz4 as a > > > > zswap compressor will the [de]compression be async or sync? > > > > > > Right now, in crypto subsystem, new drivers are required to write based on > > > async APIs. The old sync API can't work in new accelerator drivers as they > > > are not supported at all. > > > > > > Old drivers are used to sync, but they've got async wrappers to support async > > > APIs. Eg. > > > crypto: acomp - add support for lz4 via scomp > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > > > > > crypto: acomp - add support for lzo via scomp > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > > > > > so they are supporting async APIs but they are still working in sync mode > > as > > > those old drivers don't sleep. > > > > > > > Good to know that those are sync because I want them to be sync. > > Please note that zswap is a cache in front of a real swap and the load > > operation is latency sensitive as it comes in the page fault path and > > directly impacts the applications. I doubt decompressing synchronously > > a 4k page on a cpu will be costlier than asynchronously decompressing > > the same page from hardware accelerators. > > If you read the old paper: > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionality > Because the hardware accelerator speeds up compression, looking at the zswap > metrics we observed that there were more store and load requests in a given > amount of time, which filled up the zswap pool faster than a software > compression run. Because of this behavior, we set the max_pool_percent > parameter to 30 for the hardware compression runs - this means that zswap > can use up to 30% of the 10GB of total memory. > > So using hardware accelerators, we get a chance to speed up compression > while decreasing cpu utilization. > > BTW, If it is not easy to change zsmalloc, one quick workaround we might do > in zswap is adding the below after applying Mike's original patch: > > if(in_atomic()) /* for zsmalloc */ > while(!try_wait_for_completion(&req->done); > else /* for zbud, z3fold */ > crypto_wait_req(....); I don't think I'm going to ack this, sorry. Best regards, Vitaly > crypto_wait_req() is actually doing wait_for_completion(): > static inline int crypto_wait_req(int err, struct crypto_wait *wait) > { > switch (err) { > case -EINPROGRESS: > case -EBUSY: > wait_for_completion(&wait->completion); > reinit_completion(&wait->completion); > err = wait->err; > break; > } > > return err; > } > > Thanks > Barry
> -----Original Message----- > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > Sent: Tuesday, December 22, 2020 11:12 AM > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > <akpm@linux-foundation.org> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song) > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > -----Original Message----- > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > Sent: Tuesday, December 22, 2020 10:03 AM > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim > <minchan@kernel.org>; > > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; > linux-mm > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > <akpm@linux-foundation.org> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) > > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > Sent: Tuesday, December 22, 2020 8:50 AM > > > > > To: Vitaly Wool <vitaly.wool@konsulko.com> > > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; > LKML > > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song > Bao > > > Hua > > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey > > > Senozhatsky > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > <akpm@linux-foundation.org> > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> > > > wrote: > > > > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases > it > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > > > > > about scheduling in atomic context. > > > > > > > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove that > > > > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug > and > > > > > > > introduce this patch with allowing preemption enabling. > > > > > > > > > > > > This understanding is upside down. The code in zswap you are referring > > > > > > to is not buggy. You may claim that it is suboptimal but there is > > > > > > nothing wrong in taking a mutex. > > > > > > > > > > > > > > > > Is this suboptimal for all or just the hardware accelerators? Sorry, > I > > > > > am not very familiar with the crypto API. If I select lzo or lz4 as > a > > > > > zswap compressor will the [de]compression be async or sync? > > > > > > > > Right now, in crypto subsystem, new drivers are required to write based > on > > > > async APIs. The old sync API can't work in new accelerator drivers as > they > > > > are not supported at all. > > > > > > > > Old drivers are used to sync, but they've got async wrappers to support > async > > > > APIs. Eg. > > > > crypto: acomp - add support for lz4 via scomp > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > > > > > > > crypto: acomp - add support for lzo via scomp > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > > > > > > > so they are supporting async APIs but they are still working in sync mode > > > as > > > > those old drivers don't sleep. > > > > > > > > > > Good to know that those are sync because I want them to be sync. > > > Please note that zswap is a cache in front of a real swap and the load > > > operation is latency sensitive as it comes in the page fault path and > > > directly impacts the applications. I doubt decompressing synchronously > > > a 4k page on a cpu will be costlier than asynchronously decompressing > > > the same page from hardware accelerators. > > > > If you read the old paper: > > > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit > y > > Because the hardware accelerator speeds up compression, looking at the zswap > > metrics we observed that there were more store and load requests in a given > > amount of time, which filled up the zswap pool faster than a software > > compression run. Because of this behavior, we set the max_pool_percent > > parameter to 30 for the hardware compression runs - this means that zswap > > can use up to 30% of the 10GB of total memory. > > > > So using hardware accelerators, we get a chance to speed up compression > > while decreasing cpu utilization. > > > > BTW, If it is not easy to change zsmalloc, one quick workaround we might do > > in zswap is adding the below after applying Mike's original patch: > > > > if(in_atomic()) /* for zsmalloc */ > > while(!try_wait_for_completion(&req->done); > > else /* for zbud, z3fold */ > > crypto_wait_req(....); > > I don't think I'm going to ack this, sorry. > Fair enough. And I am also thinking if we can move zpool_unmap_handle() quite after zpool_map_handle() as below: dlen = PAGE_SIZE; src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO); if (zpool_evictable(entry->pool->zpool)) src += sizeof(struct zswap_header); + zpool_unmap_handle(entry->pool->zpool, entry->handle); acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); mutex_lock(acomp_ctx->mutex); sg_init_one(&input, src, entry->length); sg_init_table(&output, 1); sg_set_page(&output, page, PAGE_SIZE, 0); acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen); ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); mutex_unlock(acomp_ctx->mutex); - zpool_unmap_handle(entry->pool->zpool, entry->handle); Since src is always low memory and we only need its virtual address to get the page of src in sg_init_one(). We don't actually read it by CPU anywhere. > Best regards, > Vitaly > > > crypto_wait_req() is actually doing wait_for_completion(): > > static inline int crypto_wait_req(int err, struct crypto_wait *wait) > > { > > switch (err) { > > case -EINPROGRESS: > > case -EBUSY: > > wait_for_completion(&wait->completion); > > reinit_completion(&wait->completion); > > err = wait->err; > > break; > > } > > > > return err; > > } > > Thanks Barry
On Mon, Dec 21, 2020 at 1:30 PM Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> wrote: > > > > > -----Original Message----- > > From: Shakeel Butt [mailto:shakeelb@google.com] > > Sent: Tuesday, December 22, 2020 10:03 AM > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim <minchan@kernel.org>; > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > <akpm@linux-foundation.org> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > -----Original Message----- > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > Sent: Tuesday, December 22, 2020 8:50 AM > > > > To: Vitaly Wool <vitaly.wool@konsulko.com> > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; LKML > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song Bao > > Hua > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey > > Senozhatsky > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > <akpm@linux-foundation.org> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> > > wrote: > > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases it > > > > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > > > > about scheduling in atomic context. > > > > > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove that > > > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug and > > > > > > introduce this patch with allowing preemption enabling. > > > > > > > > > > This understanding is upside down. The code in zswap you are referring > > > > > to is not buggy. You may claim that it is suboptimal but there is > > > > > nothing wrong in taking a mutex. > > > > > > > > > > > > > Is this suboptimal for all or just the hardware accelerators? Sorry, I > > > > am not very familiar with the crypto API. If I select lzo or lz4 as a > > > > zswap compressor will the [de]compression be async or sync? > > > > > > Right now, in crypto subsystem, new drivers are required to write based on > > > async APIs. The old sync API can't work in new accelerator drivers as they > > > are not supported at all. > > > > > > Old drivers are used to sync, but they've got async wrappers to support async > > > APIs. Eg. > > > crypto: acomp - add support for lz4 via scomp > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > > > > > crypto: acomp - add support for lzo via scomp > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > > > > > so they are supporting async APIs but they are still working in sync mode > > as > > > those old drivers don't sleep. > > > > > > > Good to know that those are sync because I want them to be sync. > > Please note that zswap is a cache in front of a real swap and the load > > operation is latency sensitive as it comes in the page fault path and > > directly impacts the applications. I doubt decompressing synchronously > > a 4k page on a cpu will be costlier than asynchronously decompressing > > the same page from hardware accelerators. > > If you read the old paper: > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionality > Because the hardware accelerator speeds up compression, looking at the zswap > metrics we observed that there were more store and load requests in a given > amount of time, which filled up the zswap pool faster than a software > compression run. Because of this behavior, we set the max_pool_percent > parameter to 30 for the hardware compression runs - this means that zswap > can use up to 30% of the 10GB of total memory. > > So using hardware accelerators, we get a chance to speed up compression > while decreasing cpu utilization. > I don't care much about the compression. It's the decompression or more specifically the latency of decompression I really care about. Compression happens on reclaim, so latency is not really an issue. Reclaim can be pressure-based or proactive. I think async batched compression by accelerators makes a lot of sense. Though I doubt zswap is the right layer for that. To me adding "async batched compression support by accelerators" in zram looks more natural as the kernel already has async block I/O support. For decompression, I would like as low latency as possible which I think is only possible by doing decompression on a cpu synchronously.
> -----Original Message----- > From: Shakeel Butt [mailto:shakeelb@google.com] > Sent: Tuesday, December 22, 2020 11:46 AM > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim <minchan@kernel.org>; > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > <akpm@linux-foundation.org> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > On Mon, Dec 21, 2020 at 1:30 PM Song Bao Hua (Barry Song) > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > -----Original Message----- > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > Sent: Tuesday, December 22, 2020 10:03 AM > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim > <minchan@kernel.org>; > > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; > linux-mm > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > <akpm@linux-foundation.org> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) > > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > Sent: Tuesday, December 22, 2020 8:50 AM > > > > > To: Vitaly Wool <vitaly.wool@konsulko.com> > > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; > LKML > > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song > Bao > > > Hua > > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey > > > Senozhatsky > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > <akpm@linux-foundation.org> > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> > > > wrote: > > > > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases > it > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > > > > > about scheduling in atomic context. > > > > > > > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove that > > > > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug > and > > > > > > > introduce this patch with allowing preemption enabling. > > > > > > > > > > > > This understanding is upside down. The code in zswap you are referring > > > > > > to is not buggy. You may claim that it is suboptimal but there is > > > > > > nothing wrong in taking a mutex. > > > > > > > > > > > > > > > > Is this suboptimal for all or just the hardware accelerators? Sorry, > I > > > > > am not very familiar with the crypto API. If I select lzo or lz4 as > a > > > > > zswap compressor will the [de]compression be async or sync? > > > > > > > > Right now, in crypto subsystem, new drivers are required to write based > on > > > > async APIs. The old sync API can't work in new accelerator drivers as > they > > > > are not supported at all. > > > > > > > > Old drivers are used to sync, but they've got async wrappers to support > async > > > > APIs. Eg. > > > > crypto: acomp - add support for lz4 via scomp > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > > > > > > > crypto: acomp - add support for lzo via scomp > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > > > > > > > so they are supporting async APIs but they are still working in sync mode > > > as > > > > those old drivers don't sleep. > > > > > > > > > > Good to know that those are sync because I want them to be sync. > > > Please note that zswap is a cache in front of a real swap and the load > > > operation is latency sensitive as it comes in the page fault path and > > > directly impacts the applications. I doubt decompressing synchronously > > > a 4k page on a cpu will be costlier than asynchronously decompressing > > > the same page from hardware accelerators. > > > > If you read the old paper: > > > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit > y > > Because the hardware accelerator speeds up compression, looking at the zswap > > metrics we observed that there were more store and load requests in a given > > amount of time, which filled up the zswap pool faster than a software > > compression run. Because of this behavior, we set the max_pool_percent > > parameter to 30 for the hardware compression runs - this means that zswap > > can use up to 30% of the 10GB of total memory. > > > > So using hardware accelerators, we get a chance to speed up compression > > while decreasing cpu utilization. > > > > I don't care much about the compression. It's the decompression or > more specifically the latency of decompression I really care about. > > Compression happens on reclaim, so latency is not really an issue. > Reclaim can be pressure-based or proactive. I think async batched > compression by accelerators makes a lot of sense. Though I doubt zswap > is the right layer for that. To me adding "async batched compression > support by accelerators" in zram looks more natural as the kernel > already has async block I/O support. Yep. zram is one of the targets I have thought about to support acomp. > > For decompression, I would like as low latency as possible which I > think is only possible by doing decompression on a cpu synchronously. One possibility is that we change HW accelerator driver to be sync polling for decompression. But this still depends on async api as this is the framework nowadays, the difference would be the driver won't really block. crypto_wait_req() will return without actual sleep. Thanks Barry
> -----Original Message----- > From: Song Bao Hua (Barry Song) > Sent: Tuesday, December 22, 2020 11:38 AM > To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > <akpm@linux-foundation.org> > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > -----Original Message----- > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > > Sent: Tuesday, December 22, 2020 11:12 AM > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; > Mike > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > <akpm@linux-foundation.org> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song) > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > -----Original Message----- > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > Sent: Tuesday, December 22, 2020 10:03 AM > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim > > <minchan@kernel.org>; > > > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; > > linux-mm > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > <akpm@linux-foundation.org> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) > > > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM > > > > > > To: Vitaly Wool <vitaly.wool@konsulko.com> > > > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; > > LKML > > > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song > > Bao > > > > Hua > > > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior > > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey > > > > Senozhatsky > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > > <akpm@linux-foundation.org> > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> > > > > wrote: > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> > wrote: > > > > > > > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases > > it > > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > > > > > > about scheduling in atomic context. > > > > > > > > > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove > that > > > > > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > > > > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep > > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug > > and > > > > > > > > introduce this patch with allowing preemption enabling. > > > > > > > > > > > > > > This understanding is upside down. The code in zswap you are referring > > > > > > > to is not buggy. You may claim that it is suboptimal but there is > > > > > > > nothing wrong in taking a mutex. > > > > > > > > > > > > > > > > > > > Is this suboptimal for all or just the hardware accelerators? Sorry, > > I > > > > > > am not very familiar with the crypto API. If I select lzo or lz4 as > > a > > > > > > zswap compressor will the [de]compression be async or sync? > > > > > > > > > > Right now, in crypto subsystem, new drivers are required to write based > > on > > > > > async APIs. The old sync API can't work in new accelerator drivers as > > they > > > > > are not supported at all. > > > > > > > > > > Old drivers are used to sync, but they've got async wrappers to support > > async > > > > > APIs. Eg. > > > > > crypto: acomp - add support for lz4 via scomp > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > > > > > > > > > crypto: acomp - add support for lzo via scomp > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > > > > > > > > > so they are supporting async APIs but they are still working in sync > mode > > > > as > > > > > those old drivers don't sleep. > > > > > > > > > > > > > Good to know that those are sync because I want them to be sync. > > > > Please note that zswap is a cache in front of a real swap and the load > > > > operation is latency sensitive as it comes in the page fault path and > > > > directly impacts the applications. I doubt decompressing synchronously > > > > a 4k page on a cpu will be costlier than asynchronously decompressing > > > > the same page from hardware accelerators. > > > > > > If you read the old paper: > > > > > > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit > > y > > > Because the hardware accelerator speeds up compression, looking at the zswap > > > metrics we observed that there were more store and load requests in a given > > > amount of time, which filled up the zswap pool faster than a software > > > compression run. Because of this behavior, we set the max_pool_percent > > > parameter to 30 for the hardware compression runs - this means that zswap > > > can use up to 30% of the 10GB of total memory. > > > > > > So using hardware accelerators, we get a chance to speed up compression > > > while decreasing cpu utilization. > > > > > > BTW, If it is not easy to change zsmalloc, one quick workaround we might > do > > > in zswap is adding the below after applying Mike's original patch: > > > > > > if(in_atomic()) /* for zsmalloc */ > > > while(!try_wait_for_completion(&req->done); > > > else /* for zbud, z3fold */ > > > crypto_wait_req(....); > > > > I don't think I'm going to ack this, sorry. > > > > Fair enough. And I am also thinking if we can move zpool_unmap_handle() > quite after zpool_map_handle() as below: > > dlen = PAGE_SIZE; > src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO); > if (zpool_evictable(entry->pool->zpool)) > src += sizeof(struct zswap_header); > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > mutex_lock(acomp_ctx->mutex); > sg_init_one(&input, src, entry->length); > sg_init_table(&output, 1); > sg_set_page(&output, page, PAGE_SIZE, 0); > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, > dlen); > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > &acomp_ctx->wait); > mutex_unlock(acomp_ctx->mutex); > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > Since src is always low memory and we only need its virtual address > to get the page of src in sg_init_one(). We don't actually read it > by CPU anywhere. The below code might be better: dlen = PAGE_SIZE; src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO); if (zpool_evictable(entry->pool->zpool)) src += sizeof(struct zswap_header); acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); + zpool_unmap_handle(entry->pool->zpool, entry->handle); mutex_lock(acomp_ctx->mutex); sg_init_one(&input, src, entry->length); sg_init_table(&output, 1); sg_set_page(&output, page, PAGE_SIZE, 0); acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen); ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); mutex_unlock(acomp_ctx->mutex); - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > Best regards, > > Vitaly > > > > > crypto_wait_req() is actually doing wait_for_completion(): > > > static inline int crypto_wait_req(int err, struct crypto_wait *wait) > > > { > > > switch (err) { > > > case -EINPROGRESS: > > > case -EBUSY: > > > wait_for_completion(&wait->completion); > > > reinit_completion(&wait->completion); > > > err = wait->err; > > > break; > > > } > > > > > > return err; > > > } Thanks Barry
On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> wrote: > > > > > -----Original Message----- > > From: Song Bao Hua (Barry Song) > > Sent: Tuesday, December 22, 2020 11:38 AM > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > <akpm@linux-foundation.org> > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > -----Original Message----- > > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > > > Sent: Tuesday, December 22, 2020 11:12 AM > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; > > Mike > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > <akpm@linux-foundation.org> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song) > > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > Sent: Tuesday, December 22, 2020 10:03 AM > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > > > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim > > > <minchan@kernel.org>; > > > > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; > > > linux-mm > > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > <akpm@linux-foundation.org> > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) > > > > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM > > > > > > > To: Vitaly Wool <vitaly.wool@konsulko.com> > > > > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith <efault@gmx.de>; > > > LKML > > > > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; Song > > > Bao > > > > > Hua > > > > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej Siewior > > > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey > > > > > Senozhatsky > > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > > > <akpm@linux-foundation.org> > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <vitaly.wool@konsulko.com> > > > > > wrote: > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> > > wrote: > > > > > > > > > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases > > > it > > > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > > > > > > > about scheduling in atomic context. > > > > > > > > > > > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove > > that > > > > > > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > > > > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > > > > > > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep > > > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug > > > and > > > > > > > > > introduce this patch with allowing preemption enabling. > > > > > > > > > > > > > > > > This understanding is upside down. The code in zswap you are referring > > > > > > > > to is not buggy. You may claim that it is suboptimal but there is > > > > > > > > nothing wrong in taking a mutex. > > > > > > > > > > > > > > > > > > > > > > Is this suboptimal for all or just the hardware accelerators? Sorry, > > > I > > > > > > > am not very familiar with the crypto API. If I select lzo or lz4 as > > > a > > > > > > > zswap compressor will the [de]compression be async or sync? > > > > > > > > > > > > Right now, in crypto subsystem, new drivers are required to write based > > > on > > > > > > async APIs. The old sync API can't work in new accelerator drivers as > > > they > > > > > > are not supported at all. > > > > > > > > > > > > Old drivers are used to sync, but they've got async wrappers to support > > > async > > > > > > APIs. Eg. > > > > > > crypto: acomp - add support for lz4 via scomp > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > > > > > > > > > > > crypto: acomp - add support for lzo via scomp > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > > > > > > > > > > > so they are supporting async APIs but they are still working in sync > > mode > > > > > as > > > > > > those old drivers don't sleep. > > > > > > > > > > > > > > > > Good to know that those are sync because I want them to be sync. > > > > > Please note that zswap is a cache in front of a real swap and the load > > > > > operation is latency sensitive as it comes in the page fault path and > > > > > directly impacts the applications. I doubt decompressing synchronously > > > > > a 4k page on a cpu will be costlier than asynchronously decompressing > > > > > the same page from hardware accelerators. > > > > > > > > If you read the old paper: > > > > > > > > > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit > > > y > > > > Because the hardware accelerator speeds up compression, looking at the zswap > > > > metrics we observed that there were more store and load requests in a given > > > > amount of time, which filled up the zswap pool faster than a software > > > > compression run. Because of this behavior, we set the max_pool_percent > > > > parameter to 30 for the hardware compression runs - this means that zswap > > > > can use up to 30% of the 10GB of total memory. > > > > > > > > So using hardware accelerators, we get a chance to speed up compression > > > > while decreasing cpu utilization. > > > > > > > > BTW, If it is not easy to change zsmalloc, one quick workaround we might > > do > > > > in zswap is adding the below after applying Mike's original patch: > > > > > > > > if(in_atomic()) /* for zsmalloc */ > > > > while(!try_wait_for_completion(&req->done); > > > > else /* for zbud, z3fold */ > > > > crypto_wait_req(....); > > > > > > I don't think I'm going to ack this, sorry. > > > > > > > Fair enough. And I am also thinking if we can move zpool_unmap_handle() > > quite after zpool_map_handle() as below: > > > > dlen = PAGE_SIZE; > > src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO); > > if (zpool_evictable(entry->pool->zpool)) > > src += sizeof(struct zswap_header); > > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > mutex_lock(acomp_ctx->mutex); > > sg_init_one(&input, src, entry->length); > > sg_init_table(&output, 1); > > sg_set_page(&output, page, PAGE_SIZE, 0); > > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, > > dlen); > > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > > &acomp_ctx->wait); > > mutex_unlock(acomp_ctx->mutex); > > > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > Since src is always low memory and we only need its virtual address > > to get the page of src in sg_init_one(). We don't actually read it > > by CPU anywhere. > > The below code might be better: > > dlen = PAGE_SIZE; > src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO); > if (zpool_evictable(entry->pool->zpool)) > src += sizeof(struct zswap_header); > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > mutex_lock(acomp_ctx->mutex); > sg_init_one(&input, src, entry->length); > sg_init_table(&output, 1); > sg_set_page(&output, page, PAGE_SIZE, 0); > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen); > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); > mutex_unlock(acomp_ctx->mutex); > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); I don't see how this is going to work since we can't guarantee src will be a valid pointer after the zpool_unmap_handle() call, can we? Could you please elaborate? ~Vitaly > > > > > Best regards, > > > Vitaly > > > > > > > crypto_wait_req() is actually doing wait_for_completion(): > > > > static inline int crypto_wait_req(int err, struct crypto_wait *wait) > > > > { > > > > switch (err) { > > > > case -EINPROGRESS: > > > > case -EBUSY: > > > > wait_for_completion(&wait->completion); > > > > reinit_completion(&wait->completion); > > > > err = wait->err; > > > > break; > > > > } > > > > > > > > return err; > > > > } > > Thanks > Barry
> -----Original Message----- > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > Sent: Tuesday, December 22, 2020 2:00 PM > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > <akpm@linux-foundation.org> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song) > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > -----Original Message----- > > > From: Song Bao Hua (Barry Song) > > > Sent: Tuesday, December 22, 2020 11:38 AM > > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; > Mike > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > <akpm@linux-foundation.org> > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > -----Original Message----- > > > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > > > > Sent: Tuesday, December 22, 2020 11:12 AM > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; > > > Mike > > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > <akpm@linux-foundation.org> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song) > > > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > > Sent: Tuesday, December 22, 2020 10:03 AM > > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > > > > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim > > > > <minchan@kernel.org>; > > > > > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; > > > > linux-mm > > > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior > <bigeasy@linutronix.de>; > > > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > > <akpm@linux-foundation.org> > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) > > > > > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM > > > > > > > > To: Vitaly Wool <vitaly.wool@konsulko.com> > > > > > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith > <efault@gmx.de>; > > > > LKML > > > > > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; > Song > > > > Bao > > > > > > Hua > > > > > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej > Siewior > > > > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey > > > > > > Senozhatsky > > > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > > > > <akpm@linux-foundation.org> > > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool > <vitaly.wool@konsulko.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> > > > wrote: > > > > > > > > > > > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases > > > > it > > > > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > > > > > > > > about scheduling in atomic context. > > > > > > > > > > > > > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove > > > that > > > > > > > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > > > > > > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > > > > > > > > > > > > > I see from Mike's patch, recent zswap change introduced the > lockdep > > > > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap > bug > > > > and > > > > > > > > > > introduce this patch with allowing preemption enabling. > > > > > > > > > > > > > > > > > > This understanding is upside down. The code in zswap you are > referring > > > > > > > > > to is not buggy. You may claim that it is suboptimal but there > is > > > > > > > > > nothing wrong in taking a mutex. > > > > > > > > > > > > > > > > > > > > > > > > > Is this suboptimal for all or just the hardware accelerators? > Sorry, > > > > I > > > > > > > > am not very familiar with the crypto API. If I select lzo or lz4 > as > > > > a > > > > > > > > zswap compressor will the [de]compression be async or sync? > > > > > > > > > > > > > > Right now, in crypto subsystem, new drivers are required to write > based > > > > on > > > > > > > async APIs. The old sync API can't work in new accelerator drivers > as > > > > they > > > > > > > are not supported at all. > > > > > > > > > > > > > > Old drivers are used to sync, but they've got async wrappers to > support > > > > async > > > > > > > APIs. Eg. > > > > > > > crypto: acomp - add support for lz4 via scomp > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > > > > > > > > > > > > > crypto: acomp - add support for lzo via scomp > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > > > > > > > > > > > > > so they are supporting async APIs but they are still working in > sync > > > mode > > > > > > as > > > > > > > those old drivers don't sleep. > > > > > > > > > > > > > > > > > > > Good to know that those are sync because I want them to be sync. > > > > > > Please note that zswap is a cache in front of a real swap and the > load > > > > > > operation is latency sensitive as it comes in the page fault path > and > > > > > > directly impacts the applications. I doubt decompressing synchronously > > > > > > a 4k page on a cpu will be costlier than asynchronously decompressing > > > > > > the same page from hardware accelerators. > > > > > > > > > > If you read the old paper: > > > > > > > > > > > > > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit > > > > y > > > > > Because the hardware accelerator speeds up compression, looking at the > zswap > > > > > metrics we observed that there were more store and load requests in > a given > > > > > amount of time, which filled up the zswap pool faster than a software > > > > > compression run. Because of this behavior, we set the max_pool_percent > > > > > parameter to 30 for the hardware compression runs - this means that > zswap > > > > > can use up to 30% of the 10GB of total memory. > > > > > > > > > > So using hardware accelerators, we get a chance to speed up compression > > > > > while decreasing cpu utilization. > > > > > > > > > > BTW, If it is not easy to change zsmalloc, one quick workaround we might > > > do > > > > > in zswap is adding the below after applying Mike's original patch: > > > > > > > > > > if(in_atomic()) /* for zsmalloc */ > > > > > while(!try_wait_for_completion(&req->done); > > > > > else /* for zbud, z3fold */ > > > > > crypto_wait_req(....); > > > > > > > > I don't think I'm going to ack this, sorry. > > > > > > > > > > Fair enough. And I am also thinking if we can move zpool_unmap_handle() > > > quite after zpool_map_handle() as below: > > > > > > dlen = PAGE_SIZE; > > > src = zpool_map_handle(entry->pool->zpool, entry->handle, > ZPOOL_MM_RO); > > > if (zpool_evictable(entry->pool->zpool)) > > > src += sizeof(struct zswap_header); > > > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > > mutex_lock(acomp_ctx->mutex); > > > sg_init_one(&input, src, entry->length); > > > sg_init_table(&output, 1); > > > sg_set_page(&output, page, PAGE_SIZE, 0); > > > acomp_request_set_params(acomp_ctx->req, &input, &output, > entry->length, > > > dlen); > > > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > > > &acomp_ctx->wait); > > > mutex_unlock(acomp_ctx->mutex); > > > > > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > Since src is always low memory and we only need its virtual address > > > to get the page of src in sg_init_one(). We don't actually read it > > > by CPU anywhere. > > > > The below code might be better: > > > > dlen = PAGE_SIZE; > > src = zpool_map_handle(entry->pool->zpool, entry->handle, > ZPOOL_MM_RO); > > if (zpool_evictable(entry->pool->zpool)) > > src += sizeof(struct zswap_header); > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > > > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > mutex_lock(acomp_ctx->mutex); > > sg_init_one(&input, src, entry->length); > > sg_init_table(&output, 1); > > sg_set_page(&output, page, PAGE_SIZE, 0); > > acomp_request_set_params(acomp_ctx->req, &input, &output, > entry->length, dlen); > > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > &acomp_ctx->wait); > > mutex_unlock(acomp_ctx->mutex); > > > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > I don't see how this is going to work since we can't guarantee src > will be a valid pointer after the zpool_unmap_handle() call, can we? > Could you please elaborate? A valid pointer is for cpu to read and write. Here, cpu doesn't read and write it, we only need to get page struct from the address. void sg_init_one(struct scatterlist *sg, const void *buf, unsigned int buflen) { sg_init_table(sg, 1); sg_set_buf(sg, buf, buflen); } static inline void sg_set_buf(struct scatterlist *sg, const void *buf, unsigned int buflen) { #ifdef CONFIG_DEBUG_SG BUG_ON(!virt_addr_valid(buf)); #endif sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf)); } sg_init_one() is always using an address which has a linear mapping with physical address. So once we get the value of src, we can get the page struct. src has a linear mapping with physical address. It doesn't require page table walk which vmalloc_to_page() wants. The req only requires page to initialize sg table, I think if we are going to use a cpu-based (de)compression, the crypto driver will kmap it again. > > ~Vitaly Thanks Barry
> -----Original Message----- > From: Song Bao Hua (Barry Song) > Sent: Tuesday, December 22, 2020 2:06 PM > To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > <akpm@linux-foundation.org> > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > -----Original Message----- > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > > Sent: Tuesday, December 22, 2020 2:00 PM > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; > Mike > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > <akpm@linux-foundation.org> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song) > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > -----Original Message----- > > > > From: Song Bao Hua (Barry Song) > > > > Sent: Tuesday, December 22, 2020 11:38 AM > > > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; > > Mike > > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > <akpm@linux-foundation.org> > > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > > > > > Sent: Tuesday, December 22, 2020 11:12 AM > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim > <minchan@kernel.org>; > > > > Mike > > > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; > linux-mm > > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior > <bigeasy@linutronix.de>; > > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > <akpm@linux-foundation.org> > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song) > > > > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > > > Sent: Tuesday, December 22, 2020 10:03 AM > > > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > > > > > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim > > > > > <minchan@kernel.org>; > > > > > > > Mike Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; > > > > > linux-mm > > > > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior > > <bigeasy@linutronix.de>; > > > > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > > > <akpm@linux-foundation.org> > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) > > > > > > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM > > > > > > > > > To: Vitaly Wool <vitaly.wool@konsulko.com> > > > > > > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith > > <efault@gmx.de>; > > > > > LKML > > > > > > > > > <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; > > Song > > > > > Bao > > > > > > > Hua > > > > > > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian Andrzej > > Siewior > > > > > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey > > > > > > > Senozhatsky > > > > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > > > > > <akpm@linux-foundation.org> > > > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool > > <vitaly.wool@konsulko.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <minchan@kernel.org> > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote: > > > > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and > releases > > > > > it > > > > > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining > > > > > > > > > > > > about scheduling in atomic context. > > > > > > > > > > > > > > > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, > remove > > > > that > > > > > > > > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > > > > > > > > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > > > > > > > > > > > > > > > I see from Mike's patch, recent zswap change introduced > the > > lockdep > > > > > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap > > bug > > > > > and > > > > > > > > > > > introduce this patch with allowing preemption enabling. > > > > > > > > > > > > > > > > > > > > This understanding is upside down. The code in zswap you are > > referring > > > > > > > > > > to is not buggy. You may claim that it is suboptimal but there > > is > > > > > > > > > > nothing wrong in taking a mutex. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is this suboptimal for all or just the hardware accelerators? > > Sorry, > > > > > I > > > > > > > > > am not very familiar with the crypto API. If I select lzo or > lz4 > > as > > > > > a > > > > > > > > > zswap compressor will the [de]compression be async or sync? > > > > > > > > > > > > > > > > Right now, in crypto subsystem, new drivers are required to write > > based > > > > > on > > > > > > > > async APIs. The old sync API can't work in new accelerator drivers > > as > > > > > they > > > > > > > > are not supported at all. > > > > > > > > > > > > > > > > Old drivers are used to sync, but they've got async wrappers to > > support > > > > > async > > > > > > > > APIs. Eg. > > > > > > > > crypto: acomp - add support for lz4 via scomp > > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > > > > > > > > > > > > > > > crypto: acomp - add support for lzo via scomp > > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > > > > > > > > > > > > > > > so they are supporting async APIs but they are still working in > > sync > > > > mode > > > > > > > as > > > > > > > > those old drivers don't sleep. > > > > > > > > > > > > > > > > > > > > > > Good to know that those are sync because I want them to be sync. > > > > > > > Please note that zswap is a cache in front of a real swap and the > > load > > > > > > > operation is latency sensitive as it comes in the page fault path > > and > > > > > > > directly impacts the applications. I doubt decompressing synchronously > > > > > > > a 4k page on a cpu will be costlier than asynchronously decompressing > > > > > > > the same page from hardware accelerators. > > > > > > > > > > > > If you read the old paper: > > > > > > > > > > > > > > > > > > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit > > > > > y > > > > > > Because the hardware accelerator speeds up compression, looking at > the > > zswap > > > > > > metrics we observed that there were more store and load requests in > > a given > > > > > > amount of time, which filled up the zswap pool faster than a software > > > > > > compression run. Because of this behavior, we set the max_pool_percent > > > > > > parameter to 30 for the hardware compression runs - this means that > > zswap > > > > > > can use up to 30% of the 10GB of total memory. > > > > > > > > > > > > So using hardware accelerators, we get a chance to speed up compression > > > > > > while decreasing cpu utilization. > > > > > > > > > > > > BTW, If it is not easy to change zsmalloc, one quick workaround we > might > > > > do > > > > > > in zswap is adding the below after applying Mike's original patch: > > > > > > > > > > > > if(in_atomic()) /* for zsmalloc */ > > > > > > while(!try_wait_for_completion(&req->done); > > > > > > else /* for zbud, z3fold */ > > > > > > crypto_wait_req(....); > > > > > > > > > > I don't think I'm going to ack this, sorry. > > > > > > > > > > > > > Fair enough. And I am also thinking if we can move zpool_unmap_handle() > > > > quite after zpool_map_handle() as below: > > > > > > > > dlen = PAGE_SIZE; > > > > src = zpool_map_handle(entry->pool->zpool, entry->handle, > > ZPOOL_MM_RO); > > > > if (zpool_evictable(entry->pool->zpool)) > > > > src += sizeof(struct zswap_header); > > > > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > > > mutex_lock(acomp_ctx->mutex); > > > > sg_init_one(&input, src, entry->length); > > > > sg_init_table(&output, 1); > > > > sg_set_page(&output, page, PAGE_SIZE, 0); > > > > acomp_request_set_params(acomp_ctx->req, &input, &output, > > entry->length, > > > > dlen); > > > > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > > > > &acomp_ctx->wait); > > > > mutex_unlock(acomp_ctx->mutex); > > > > > > > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > > > Since src is always low memory and we only need its virtual address > > > > to get the page of src in sg_init_one(). We don't actually read it > > > > by CPU anywhere. > > > > > > The below code might be better: > > > > > > dlen = PAGE_SIZE; > > > src = zpool_map_handle(entry->pool->zpool, entry->handle, > > ZPOOL_MM_RO); > > > if (zpool_evictable(entry->pool->zpool)) > > > src += sizeof(struct zswap_header); > > > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > > > > > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > mutex_lock(acomp_ctx->mutex); > > > sg_init_one(&input, src, entry->length); > > > sg_init_table(&output, 1); > > > sg_set_page(&output, page, PAGE_SIZE, 0); > > > acomp_request_set_params(acomp_ctx->req, &input, &output, > > entry->length, dlen); > > > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > > &acomp_ctx->wait); > > > mutex_unlock(acomp_ctx->mutex); > > > > > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > I don't see how this is going to work since we can't guarantee src > > will be a valid pointer after the zpool_unmap_handle() call, can we? > > Could you please elaborate? > > A valid pointer is for cpu to read and write. Here, cpu doesn't read > and write it, we only need to get page struct from the address. > > void sg_init_one(struct scatterlist *sg, const void *buf, unsigned int buflen) > { > sg_init_table(sg, 1); > sg_set_buf(sg, buf, buflen); > } > > static inline void sg_set_buf(struct scatterlist *sg, const void *buf, > unsigned int buflen) > { > #ifdef CONFIG_DEBUG_SG > BUG_ON(!virt_addr_valid(buf)); > #endif > sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf)); > } > > sg_init_one() is always using an address which has a linear mapping > with physical address. > So once we get the value of src, we can get the page struct. > > src has a linear mapping with physical address. It doesn't require > page table walk which vmalloc_to_page() wants. > > The req only requires page to initialize sg table, I think if > we are going to use a cpu-based (de)compression, the crypto > driver will kmap it again. Probably I made another bug here. for zsmalloc, it is possible to get highmem for zpool since its malloc_support_movable = true. if (zpool_malloc_support_movable(entry->pool->zpool)) gfp |= __GFP_HIGHMEM | __GFP_MOVABLE; ret = zpool_malloc(entry->pool->zpool, hlen + dlen, gfp, &handle); For 64bit system, there is never a highmem. For 32bit system, we may trigger this bug. So actually zswap should have used kmap_to_page() which can support both linear mapping and non-linear mapping. sg_init_one() only supports linear mapping. But it does't change the fact: Once req is initialized with page struct, we can unmap src. If we are going to use a HW accelerator, it would be a DMA; if we are going to use CPU decompression, crypto driver will kmap() again. > > > > > ~Vitaly > Thanks Barry
On Tue, 22 Dec 2020, 02:42 Song Bao Hua (Barry Song), < song.bao.hua@hisilicon.com> wrote: > > > > -----Original Message----- > > From: Song Bao Hua (Barry Song) > > Sent: Tuesday, December 22, 2020 2:06 PM > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; > Mike > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > <akpm@linux-foundation.org> > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > -----Original Message----- > > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > > > Sent: Tuesday, December 22, 2020 2:00 PM > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim < > minchan@kernel.org>; > > Mike > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; > linux-mm > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de > >; > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > <akpm@linux-foundation.org> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song) > > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Song Bao Hua (Barry Song) > > > > > Sent: Tuesday, December 22, 2020 11:38 AM > > > > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > > > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim < > minchan@kernel.org>; > > > Mike > > > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; > linux-mm > > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior < > bigeasy@linutronix.de>; > > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > <akpm@linux-foundation.org> > > > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > > > > > > Sent: Tuesday, December 22, 2020 11:12 AM > > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > > > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim > > <minchan@kernel.org>; > > > > > Mike > > > > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; > > linux-mm > > > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior > > <bigeasy@linutronix.de>; > > > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > > <akpm@linux-foundation.org> > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song) > > > > > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > > > > Sent: Tuesday, December 22, 2020 10:03 AM > > > > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > > > > > > > Cc: Vitaly Wool <vitaly.wool@konsulko.com>; Minchan Kim > > > > > > <minchan@kernel.org>; > > > > > > > > Mike Galbraith <efault@gmx.de>; LKML < > linux-kernel@vger.kernel.org>; > > > > > > linux-mm > > > > > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior > > > <bigeasy@linutronix.de>; > > > > > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > > > > <akpm@linux-foundation.org> > > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) > > > > > > > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM > > > > > > > > > > To: Vitaly Wool <vitaly.wool@konsulko.com> > > > > > > > > > > Cc: Minchan Kim <minchan@kernel.org>; Mike Galbraith > > > <efault@gmx.de>; > > > > > > LKML > > > > > > > > > > <linux-kernel@vger.kernel.org>; linux-mm < > linux-mm@kvack.org>; > > > Song > > > > > > Bao > > > > > > > > Hua > > > > > > > > > > (Barry Song) <song.bao.hua@hisilicon.com>; Sebastian > Andrzej > > > Siewior > > > > > > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; > Sergey > > > > > > > > Senozhatsky > > > > > > > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > > > > > > > <akpm@linux-foundation.org> > > > > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool > > > <vitaly.wool@konsulko.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim < > minchan@kernel.org> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly > Wool wrote: > > > > > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback > and > > releases > > > > > > it > > > > > > > > > > > > > only in unmap() which is unsafe and leads to zswap > complaining > > > > > > > > > > > > > about scheduling in atomic context. > > > > > > > > > > > > > > > > > > > > > > > > > > To fix that and to improve RT properties of > zsmalloc, > > remove > > > > > that > > > > > > > > > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > > > > > > > > > > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > > > > > > > > > > > > > > > > > I see from Mike's patch, recent zswap change > introduced > > the > > > lockdep > > > > > > > > > > > > splat bug and you want to improve zsmalloc to fix > the zswap > > > bug > > > > > > and > > > > > > > > > > > > introduce this patch with allowing preemption > enabling. > > > > > > > > > > > > > > > > > > > > > > This understanding is upside down. The code in zswap > you are > > > referring > > > > > > > > > > > to is not buggy. You may claim that it is suboptimal > but there > > > is > > > > > > > > > > > nothing wrong in taking a mutex. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is this suboptimal for all or just the hardware > accelerators? > > > Sorry, > > > > > > I > > > > > > > > > > am not very familiar with the crypto API. If I select > lzo or > > lz4 > > > as > > > > > > a > > > > > > > > > > zswap compressor will the [de]compression be async or > sync? > > > > > > > > > > > > > > > > > > Right now, in crypto subsystem, new drivers are required > to write > > > based > > > > > > on > > > > > > > > > async APIs. The old sync API can't work in new accelerator > drivers > > > as > > > > > > they > > > > > > > > > are not supported at all. > > > > > > > > > > > > > > > > > > Old drivers are used to sync, but they've got async > wrappers to > > > support > > > > > > async > > > > > > > > > APIs. Eg. > > > > > > > > > crypto: acomp - add support for lz4 via scomp > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > > > > > > > > > > > > > > > > > crypto: acomp - add support for lzo via scomp > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > > > > > > > > > > > > > > > > > so they are supporting async APIs but they are still > working in > > > sync > > > > > mode > > > > > > > > as > > > > > > > > > those old drivers don't sleep. > > > > > > > > > > > > > > > > > > > > > > > > > Good to know that those are sync because I want them to be > sync. > > > > > > > > Please note that zswap is a cache in front of a real swap > and the > > > load > > > > > > > > operation is latency sensitive as it comes in the page fault > path > > > and > > > > > > > > directly impacts the applications. I doubt decompressing > synchronously > > > > > > > > a 4k page on a cpu will be costlier than asynchronously > decompressing > > > > > > > > the same page from hardware accelerators. > > > > > > > > > > > > > > If you read the old paper: > > > > > > > > > > > > > > > > > > > > > > > > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit > > > > > > y > > > > > > > Because the hardware accelerator speeds up compression, > looking at > > the > > > zswap > > > > > > > metrics we observed that there were more store and load > requests in > > > a given > > > > > > > amount of time, which filled up the zswap pool faster than a > software > > > > > > > compression run. Because of this behavior, we set the > max_pool_percent > > > > > > > parameter to 30 for the hardware compression runs - this means > that > > > zswap > > > > > > > can use up to 30% of the 10GB of total memory. > > > > > > > > > > > > > > So using hardware accelerators, we get a chance to speed up > compression > > > > > > > while decreasing cpu utilization. > > > > > > > > > > > > > > BTW, If it is not easy to change zsmalloc, one quick > workaround we > > might > > > > > do > > > > > > > in zswap is adding the below after applying Mike's original > patch: > > > > > > > > > > > > > > if(in_atomic()) /* for zsmalloc */ > > > > > > > while(!try_wait_for_completion(&req->done); > > > > > > > else /* for zbud, z3fold */ > > > > > > > crypto_wait_req(....); > > > > > > > > > > > > I don't think I'm going to ack this, sorry. > > > > > > > > > > > > > > > > Fair enough. And I am also thinking if we can move > zpool_unmap_handle() > > > > > quite after zpool_map_handle() as below: > > > > > > > > > > dlen = PAGE_SIZE; > > > > > src = zpool_map_handle(entry->pool->zpool, entry->handle, > > > ZPOOL_MM_RO); > > > > > if (zpool_evictable(entry->pool->zpool)) > > > > > src += sizeof(struct zswap_header); > > > > > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > > > > mutex_lock(acomp_ctx->mutex); > > > > > sg_init_one(&input, src, entry->length); > > > > > sg_init_table(&output, 1); > > > > > sg_set_page(&output, page, PAGE_SIZE, 0); > > > > > acomp_request_set_params(acomp_ctx->req, &input, &output, > > > entry->length, > > > > > dlen); > > > > > ret = > crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > > > > > &acomp_ctx->wait); > > > > > mutex_unlock(acomp_ctx->mutex); > > > > > > > > > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > > > > > Since src is always low memory and we only need its virtual address > > > > > to get the page of src in sg_init_one(). We don't actually read it > > > > > by CPU anywhere. > > > > > > > > The below code might be better: > > > > > > > > dlen = PAGE_SIZE; > > > > src = zpool_map_handle(entry->pool->zpool, entry->handle, > > > ZPOOL_MM_RO); > > > > if (zpool_evictable(entry->pool->zpool)) > > > > src += sizeof(struct zswap_header); > > > > > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > > > > > > > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > > > mutex_lock(acomp_ctx->mutex); > > > > sg_init_one(&input, src, entry->length); > > > > sg_init_table(&output, 1); > > > > sg_set_page(&output, page, PAGE_SIZE, 0); > > > > acomp_request_set_params(acomp_ctx->req, &input, &output, > > > entry->length, dlen); > > > > ret = > crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > > > &acomp_ctx->wait); > > > > mutex_unlock(acomp_ctx->mutex); > > > > > > > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > I don't see how this is going to work since we can't guarantee src > > > will be a valid pointer after the zpool_unmap_handle() call, can we? > > > Could you please elaborate? > > > > A valid pointer is for cpu to read and write. Here, cpu doesn't read > > and write it, we only need to get page struct from the address. > > > > void sg_init_one(struct scatterlist *sg, const void *buf, unsigned int > buflen) > > { > > sg_init_table(sg, 1); > > sg_set_buf(sg, buf, buflen); > > } > > > > static inline void sg_set_buf(struct scatterlist *sg, const void *buf, > > unsigned int buflen) > > { > > #ifdef CONFIG_DEBUG_SG > > BUG_ON(!virt_addr_valid(buf)); > > #endif > > sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf)); > > } > > > > sg_init_one() is always using an address which has a linear mapping > > with physical address. > > So once we get the value of src, we can get the page struct. > > > > src has a linear mapping with physical address. It doesn't require > > page table walk which vmalloc_to_page() wants. > > > > The req only requires page to initialize sg table, I think if > > we are going to use a cpu-based (de)compression, the crypto > > driver will kmap it again. > > Probably I made another bug here. for zsmalloc, it is possible to > get highmem for zpool since its malloc_support_movable = true. > > if (zpool_malloc_support_movable(entry->pool->zpool)) > gfp |= __GFP_HIGHMEM | __GFP_MOVABLE; > ret = zpool_malloc(entry->pool->zpool, hlen + dlen, gfp, &handle); > > For 64bit system, there is never a highmem. For 32bit system, we may > trigger this bug. > > So actually zswap should have used kmap_to_page() which can support > both linear mapping and non-linear mapping. sg_init_one() only supports > linear mapping. > But it does't change the fact: Once req is initialized with page > struct, we can unmap src. If we are going to use a HW accelerator, > it would be a DMA; if we are going to use CPU decompression, crypto > driver will kmap() again. > I'm still not convinced. Will kmap what, src? At this point src might become just a bogus pointer. Why couldn't the object have been moved somewhere else (due to the compaction mechanism for instance) at the time DMA kicks in? > > > > > > > > ~Vitaly > > > > Thanks > Barry >
> I'm still not convinced. Will kmap what, src? At this point src might become just a bogus pointer. As long as the memory is still there, we can kmap it by its page struct. But if it is not there anymore, we have no way. > Why couldn't the object have been moved somewhere else (due to the compaction mechanism for instance) > at the time DMA kicks in? So zs_map_object() will guarantee the src won't be moved by holding those preemption-disabled lock? If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case? > > > > > ~Vitaly > Thanks Barry
> -----Original Message----- > From: Song Bao Hua (Barry Song) > Sent: Tuesday, December 22, 2020 3:03 PM > To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com> > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > I'm still not convinced. Will kmap what, src? At this point src might become > just a bogus pointer. > > As long as the memory is still there, we can kmap it by its page struct. But > if > it is not there anymore, we have no way. > > > Why couldn't the object have been moved somewhere else (due to the compaction > mechanism for instance) > > at the time DMA kicks in? > > So zs_map_object() will guarantee the src won't be moved by holding those > preemption-disabled lock? > If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case? > Or we can do get_page() to avoid the movement of the page. > > > > > > > > ~Vitaly > > > > Thanks > Barry
From: Song Bao Hua > Sent: 21 December 2020 23:02 ... > > For decompression, I would like as low latency as possible which I > > think is only possible by doing decompression on a cpu synchronously. > > One possibility is that we change HW accelerator driver to be sync > polling for decompression. But this still depends on async api as > this is the framework nowadays, the difference would be the driver > won't really block. crypto_wait_req() will return without actual > sleep. How long does the HW accelerated compress/decompress need to be before it is actually worth sleeping the process? While the HW version might be faster than the SW one, it may not be enough faster to allow for the hardware interrupt and process sleep. So it may be worth just spinning (polling the hardware register) until the request completes. If decompress are done that way, but compress left async, then the decompress might need to fallback to SW if the HW is busy. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, 22 Dec 2020, 10:20 David Laight, <David.Laight@aculab.com> wrote: > From: Song Bao Hua > > Sent: 21 December 2020 23:02 > ... > > > For decompression, I would like as low latency as possible which I > > > think is only possible by doing decompression on a cpu synchronously. > > > > One possibility is that we change HW accelerator driver to be sync > > polling for decompression. But this still depends on async api as > > this is the framework nowadays, the difference would be the driver > > won't really block. crypto_wait_req() will return without actual > > sleep. > > How long does the HW accelerated compress/decompress need to be before > it is actually worth sleeping the process? > While the HW version might be faster than the SW one, it may not be > enough faster to allow for the hardware interrupt and process sleep. > So it may be worth just spinning (polling the hardware register) > until the request completes. > > If decompress are done that way, but compress left async, then > the decompress might need to fallback to SW if the HW is busy. > This is an option too, of course. However please bear in mind that Kerne mutexes are hybrid in the sense that there will be normally some spinning first. ~Vitaly > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 > 1PT, UK > Registration No: 1397386 (Wales) >
On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song), <song.bao.hua@hisilicon.com> wrote: > > > > > -----Original Message----- > > From: Song Bao Hua (Barry Song) > > Sent: Tuesday, December 22, 2020 3:03 PM > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com> > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > I'm still not convinced. Will kmap what, src? At this point src might become > > just a bogus pointer. > > > > As long as the memory is still there, we can kmap it by its page struct. But > > if > > it is not there anymore, we have no way. > > > > > Why couldn't the object have been moved somewhere else (due to the compaction > > mechanism for instance) > > > at the time DMA kicks in? > > > > So zs_map_object() will guarantee the src won't be moved by holding those > > preemption-disabled lock? > > If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case? > > > > Or we can do get_page() to avoid the movement of the page. I would like to discuss this more in zswap context than zsmalloc's. Since zsmalloc does not implement reclaim callback, using it in zswap is a corner case anyway. zswap, on the other hand, may be dealing with some new backends in future which have more chances to become mainstream. Imagine typical NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in unused video memory. In such a case if you try to use a pointer to an invalidated zpool mapping, you are on the way to thrash the system. So: no assumptions that the zswap pool is in regular linear RAM should be made. ~Vitaly > > > > > > > > > > > > > ~Vitaly > > > > > > > Thanks > > Barry > >
> -----Original Message----- > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > Sent: Tuesday, December 22, 2020 10:44 PM > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song), > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > -----Original Message----- > > > From: Song Bao Hua (Barry Song) > > > Sent: Tuesday, December 22, 2020 3:03 PM > > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; > Mike > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com> > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > I'm still not convinced. Will kmap what, src? At this point src might > become > > > just a bogus pointer. > > > > > > As long as the memory is still there, we can kmap it by its page struct. > But > > > if > > > it is not there anymore, we have no way. > > > > > > > Why couldn't the object have been moved somewhere else (due to the compaction > > > mechanism for instance) > > > > at the time DMA kicks in? > > > > > > So zs_map_object() will guarantee the src won't be moved by holding those > > > preemption-disabled lock? > > > If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case? > > > > > > > Or we can do get_page() to avoid the movement of the page. > > > I would like to discuss this more in zswap context than zsmalloc's. > Since zsmalloc does not implement reclaim callback, using it in zswap > is a corner case anyway. I see. But it seems we still need a solution for the compatibility of zsmalloc and zswap? this will require change in either zsmalloc or zswap. or do you want to make zswap depend on !ZSMALLOC? > > zswap, on the other hand, may be dealing with some new backends in > future which have more chances to become mainstream. Imagine typical > NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in > unused video memory. In such a case if you try to use a pointer to an > invalidated zpool mapping, you are on the way to thrash the system. > So: no assumptions that the zswap pool is in regular linear RAM should > be made. > > ~Vitaly Thanks Barry
On Tue, 22 Dec 2020, 22:06 Song Bao Hua (Barry Song), <song.bao.hua@hisilicon.com> wrote: > > > > > -----Original Message----- > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > > Sent: Tuesday, December 22, 2020 10:44 PM > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song), > > <song.bao.hua@hisilicon.com> wrote: > > > > > > > > > > > > > -----Original Message----- > > > > From: Song Bao Hua (Barry Song) > > > > Sent: Tuesday, December 22, 2020 3:03 PM > > > > To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > > > > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; > > Mike > > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > > > > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton > > > > <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com> > > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > I'm still not convinced. Will kmap what, src? At this point src might > > become > > > > just a bogus pointer. > > > > > > > > As long as the memory is still there, we can kmap it by its page struct. > > But > > > > if > > > > it is not there anymore, we have no way. > > > > > > > > > Why couldn't the object have been moved somewhere else (due to the compaction > > > > mechanism for instance) > > > > > at the time DMA kicks in? > > > > > > > > So zs_map_object() will guarantee the src won't be moved by holding those > > > > preemption-disabled lock? > > > > If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case? > > > > > > > > > > Or we can do get_page() to avoid the movement of the page. > > > > > > I would like to discuss this more in zswap context than zsmalloc's. > > Since zsmalloc does not implement reclaim callback, using it in zswap > > is a corner case anyway. > > I see. But it seems we still need a solution for the compatibility > of zsmalloc and zswap? this will require change in either zsmalloc > or zswap. > or do you want to make zswap depend on !ZSMALLOC? No, I really don't think we should go that far. What if we add a flag to zpool, named like "can_sleep_mapped", and have it set for zbud/z3fold? Then zswap could go the current path if the flag is set; and if it's not set, and mutex_trylock fails, copy data from src to a temporary buffer, then unmap the handle, take the mutex, process the buffer instead of src. Not the nicest thing to do but at least it won't break anything. ~Vitaly > > zswap, on the other hand, may be dealing with some new backends in > > future which have more chances to become mainstream. Imagine typical > > NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in > > unused video memory. In such a case if you try to use a pointer to an > > invalidated zpool mapping, you are on the way to thrash the system. > > So: no assumptions that the zswap pool is in regular linear RAM should > > be made. > > > > ~Vitaly > > Thanks > Barry
在 2020/12/23 8:11, Vitaly Wool 写道: > On Tue, 22 Dec 2020, 22:06 Song Bao Hua (Barry Song), > <song.bao.hua@hisilicon.com> wrote: >> >> >>> -----Original Message----- >>> From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] >>> Sent: Tuesday, December 22, 2020 10:44 PM >>> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> >>> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike >>> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm >>> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; >>> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky >>> <sergey.senozhatsky.work@gmail.com>; Andrew Morton >>> <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com> >>> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock >>> >>> On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song), >>> <song.bao.hua@hisilicon.com> wrote: >>>> >>>> >>>>> -----Original Message----- >>>>> From: Song Bao Hua (Barry Song) >>>>> Sent: Tuesday, December 22, 2020 3:03 PM >>>>> To: 'Vitaly Wool' <vitaly.wool@konsulko.com> >>>>> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; >>> Mike >>>>> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm >>>>> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; >>>>> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky >>>>> <sergey.senozhatsky.work@gmail.com>; Andrew Morton >>>>> <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com> >>>>> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock >>>>> >>>>> >>>>>> I'm still not convinced. Will kmap what, src? At this point src might >>> become >>>>> just a bogus pointer. >>>>> >>>>> As long as the memory is still there, we can kmap it by its page struct. >>> But >>>>> if >>>>> it is not there anymore, we have no way. >>>>> >>>>>> Why couldn't the object have been moved somewhere else (due to the compaction >>>>> mechanism for instance) >>>>>> at the time DMA kicks in? >>>>> So zs_map_object() will guarantee the src won't be moved by holding those >>>>> preemption-disabled lock? >>>>> If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case? >>>>> >>>> Or we can do get_page() to avoid the movement of the page. >>> >>> I would like to discuss this more in zswap context than zsmalloc's. >>> Since zsmalloc does not implement reclaim callback, using it in zswap >>> is a corner case anyway. >> I see. But it seems we still need a solution for the compatibility >> of zsmalloc and zswap? this will require change in either zsmalloc >> or zswap. >> or do you want to make zswap depend on !ZSMALLOC? > No, I really don't think we should go that far. What if we add a flag > to zpool, named like "can_sleep_mapped", and have it set for > zbud/z3fold? > Then zswap could go the current path if the flag is set; and if it's > not set, and mutex_trylock fails, copy data from src to a temporary > buffer, then unmap the handle, take the mutex, process the buffer > instead of src. Not the nicest thing to do but at least it won't break > anything. write the following patch according to your idea, what do you think ? --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1235,7 +1235,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, struct zswap_entry *entry; struct scatterlist input, output; struct crypto_acomp_ctx *acomp_ctx; - u8 *src, *dst; + u8 *src, *dst, *tmp; unsigned int dlen; int ret; @@ -1262,16 +1262,26 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, if (zpool_evictable(entry->pool->zpool)) src += sizeof(struct zswap_header); + if (!zpool_can_sleep_mapped(entry->pool->zpool) && !mutex_trylock(acomp_ctx->mutex)) { + tmp = kmemdup(src, entry->length, GFP_ATOMIC); + zpool_unmap_handle(entry->pool->zpool, entry->handle); ??? + if (!tmp) + goto freeentry; + } acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); mutex_lock(acomp_ctx->mutex); - sg_init_one(&input, src, entry->length); + sg_init_one(&input, zpool_can_sleep_mapped(entry->pool->zpool) ? src : tmp, entry->length); sg_init_table(&output, 1); sg_set_page(&output, page, PAGE_SIZE, 0); acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen); ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); mutex_unlock(acomp_ctx->mutex); - zpool_unmap_handle(entry->pool->zpool, entry->handle); + if (zpool_can_sleep_mapped(entry->pool->zpool)) + zpool_unmap_handle(entry->pool->zpool, entry->handle); + else + kfree(tmp); + --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -440,6 +440,7 @@ static u64 zs_zpool_total_size(void *pool) static struct zpool_driver zs_zpool_driver = { .type = "zsmalloc", + .sleep_mapped = false, .owner = THIS_MODULE, .create = zs_zpool_create, .destroy = zs_zpool_destroy, > > ~Vitaly > >>> zswap, on the other hand, may be dealing with some new backends in >>> future which have more chances to become mainstream. Imagine typical >>> NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in >>> unused video memory. In such a case if you try to use a pointer to an >>> invalidated zpool mapping, you are on the way to thrash the system. >>> So: no assumptions that the zswap pool is in regular linear RAM should >>> be made. >>> >>> ~Vitaly >> Thanks >> Barry
On Wed, Dec 23, 2020 at 1:44 PM tiantao (H) <tiantao6@huawei.com> wrote: > > > 在 2020/12/23 8:11, Vitaly Wool 写道: > > On Tue, 22 Dec 2020, 22:06 Song Bao Hua (Barry Song), > > <song.bao.hua@hisilicon.com> wrote: > >> > >> > >>> -----Original Message----- > >>> From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > >>> Sent: Tuesday, December 22, 2020 10:44 PM > >>> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > >>> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; Mike > >>> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > >>> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > >>> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > >>> <sergey.senozhatsky.work@gmail.com>; Andrew Morton > >>> <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com> > >>> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > >>> > >>> On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song), > >>> <song.bao.hua@hisilicon.com> wrote: > >>>> > >>>> > >>>>> -----Original Message----- > >>>>> From: Song Bao Hua (Barry Song) > >>>>> Sent: Tuesday, December 22, 2020 3:03 PM > >>>>> To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > >>>>> Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <minchan@kernel.org>; > >>> Mike > >>>>> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>; linux-mm > >>>>> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; > >>>>> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky > >>>>> <sergey.senozhatsky.work@gmail.com>; Andrew Morton > >>>>> <akpm@linux-foundation.org>; tiantao (H) <tiantao6@hisilicon.com> > >>>>> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > >>>>> > >>>>> > >>>>>> I'm still not convinced. Will kmap what, src? At this point src might > >>> become > >>>>> just a bogus pointer. > >>>>> > >>>>> As long as the memory is still there, we can kmap it by its page struct. > >>> But > >>>>> if > >>>>> it is not there anymore, we have no way. > >>>>> > >>>>>> Why couldn't the object have been moved somewhere else (due to the compaction > >>>>> mechanism for instance) > >>>>>> at the time DMA kicks in? > >>>>> So zs_map_object() will guarantee the src won't be moved by holding those > >>>>> preemption-disabled lock? > >>>>> If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case? > >>>>> > >>>> Or we can do get_page() to avoid the movement of the page. > >>> > >>> I would like to discuss this more in zswap context than zsmalloc's. > >>> Since zsmalloc does not implement reclaim callback, using it in zswap > >>> is a corner case anyway. > >> I see. But it seems we still need a solution for the compatibility > >> of zsmalloc and zswap? this will require change in either zsmalloc > >> or zswap. > >> or do you want to make zswap depend on !ZSMALLOC? > > No, I really don't think we should go that far. What if we add a flag > > to zpool, named like "can_sleep_mapped", and have it set for > > zbud/z3fold? > > Then zswap could go the current path if the flag is set; and if it's > > not set, and mutex_trylock fails, copy data from src to a temporary > > buffer, then unmap the handle, take the mutex, process the buffer > > instead of src. Not the nicest thing to do but at least it won't break > > anything. > > write the following patch according to your idea, what do you think ? Yep, that is basically what I was thinking of. Some nitpicks below: > --- a/mm/zswap.c > > +++ b/mm/zswap.c > @@ -1235,7 +1235,7 @@ static int zswap_frontswap_load(unsigned type, > pgoff_t offset, > struct zswap_entry *entry; > struct scatterlist input, output; > struct crypto_acomp_ctx *acomp_ctx; > - u8 *src, *dst; > + u8 *src, *dst, *tmp; > unsigned int dlen; > int ret; > > @@ -1262,16 +1262,26 @@ static int zswap_frontswap_load(unsigned type, > pgoff_t offset, > if (zpool_evictable(entry->pool->zpool)) > src += sizeof(struct zswap_header); > > + if (!zpool_can_sleep_mapped(entry->pool->zpool) && > !mutex_trylock(acomp_ctx->mutex)) { > + tmp = kmemdup(src, entry->length, GFP_ATOMIC); kmemdump? just use memcpy :) > + zpool_unmap_handle(entry->pool->zpool, entry->handle); ??? > + if (!tmp) > + goto freeentry; Jumping to freentry results in returning success which isn't appropriate here. You should return -ENOMEM in this case. > + } > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > mutex_lock(acomp_ctx->mutex); > - sg_init_one(&input, src, entry->length); > + sg_init_one(&input, zpool_can_sleep_mapped(entry->pool->zpool) ? > src : tmp, entry->length); This is kind of hard to read, I would rather assign src to tmp after memcpy and then no condition check would be needed here. > sg_init_table(&output, 1); > sg_set_page(&output, page, PAGE_SIZE, 0); > acomp_request_set_params(acomp_ctx->req, &input, &output, > entry->length, dlen); > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > &acomp_ctx->wait); > mutex_unlock(acomp_ctx->mutex); > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > + if (zpool_can_sleep_mapped(entry->pool->zpool)) > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > + else > + kfree(tmp); > + > --- a/mm/zsmalloc.c > +++ b/mm/zsmalloc.c > @@ -440,6 +440,7 @@ static u64 zs_zpool_total_size(void *pool) > > static struct zpool_driver zs_zpool_driver = { > .type = "zsmalloc", > + .sleep_mapped = false, > .owner = THIS_MODULE, > .create = zs_zpool_create, > .destroy = zs_zpool_destroy, Best regards, Vitaly > > > > ~Vitaly > > > >>> zswap, on the other hand, may be dealing with some new backends in > >>> future which have more chances to become mainstream. Imagine typical > >>> NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in > >>> unused video memory. In such a case if you try to use a pointer to an > >>> invalidated zpool mapping, you are on the way to thrash the system. > >>> So: no assumptions that the zswap pool is in regular linear RAM should > >>> be made. > >>> > >>> ~Vitaly > >> Thanks > >> Barry >
On 2020-12-20 08:21:37 [+0100], Vitaly Wool wrote:
> Not really because bit spinlock leaves preemption disabled.
It leaves it disabled for a reason. Now you just spin until the original
context gets back on the CPU. On UP with preemption, if the "lock owner"
gets preempted, the next lock attempt will lock-up the system.
Sebastian
On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote: > > write the following patch according to your idea, what do you think ? > > Yep, that is basically what I was thinking of. Some nitpicks below: Did this go somewhere? The thread just ends here on my end. Mike, is this patch fixing / helping your case in anyway? > Best regards, > Vitaly Sebastian
On Thu, 14 Jan 2021, 17:18 Sebastian Andrzej Siewior, <bigeasy@linutronix.de> wrote: > > On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote: > > > write the following patch according to your idea, what do you think ? > > > > Yep, that is basically what I was thinking of. Some nitpicks below: > > Did this go somewhere? The thread just ends here on my end. > Mike, is this patch fixing / helping your case in anyway? Please see * https://marc.info/?l=linux-mm&m=160889419514019&w=2 * https://marc.info/?l=linux-mm&m=160889418114011&w=2 * https://marc.info/?l=linux-mm&m=160889448814057&w=2 Haven't had time to test these yet but seem to be alright. Best regards, Vitaly
On 2021-01-14 17:29:37 [+0100], Vitaly Wool wrote: > On Thu, 14 Jan 2021, 17:18 Sebastian Andrzej Siewior, > <bigeasy@linutronix.de> wrote: > > > > On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote: > > > > write the following patch according to your idea, what do you think ? > > > > > > Yep, that is basically what I was thinking of. Some nitpicks below: > > > > Did this go somewhere? The thread just ends here on my end. > > Mike, is this patch fixing / helping your case in anyway? > > Please see > * https://marc.info/?l=linux-mm&m=160889419514019&w=2 > * https://marc.info/?l=linux-mm&m=160889418114011&w=2 > * https://marc.info/?l=linux-mm&m=160889448814057&w=2 Thank you, that would be 1608894171-54174-1-git-send-email-tiantao6@hisilicon.com for b4 compatibility :) > Haven't had time to test these yet but seem to be alright. So zs_map_object() still disables preemption but the mutex part is avoided by the patch? > Best regards, > Vitaly Sebastian
On Thu, Jan 14, 2021 at 5:56 PM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2021-01-14 17:29:37 [+0100], Vitaly Wool wrote: > > On Thu, 14 Jan 2021, 17:18 Sebastian Andrzej Siewior, > > <bigeasy@linutronix.de> wrote: > > > > > > On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote: > > > > > write the following patch according to your idea, what do you think ? > > > > > > > > Yep, that is basically what I was thinking of. Some nitpicks below: > > > > > > Did this go somewhere? The thread just ends here on my end. > > > Mike, is this patch fixing / helping your case in anyway? > > > > Please see > > * https://marc.info/?l=linux-mm&m=160889419514019&w=2 > > * https://marc.info/?l=linux-mm&m=160889418114011&w=2 > > * https://marc.info/?l=linux-mm&m=160889448814057&w=2 > > Thank you, that would be > 1608894171-54174-1-git-send-email-tiantao6@hisilicon.com > > for b4 compatibility :) > > > Haven't had time to test these yet but seem to be alright. > > So zs_map_object() still disables preemption but the mutex part is > avoided by the patch? Basically, yes. Minchan was very clear that he didn't want to remove that inter-function locking, so be it. I wouldn't really advise to use zsmalloc with zswap because zsmalloc has no support for reclaim, nevertheless I wouldn't like this configuration to stop working for those who are already using it. Would you or Mike be up for testing Tian Taos's patchset? Best regards, Vitaly
On 2021-01-14 18:15:08 [+0100], Vitaly Wool wrote: > > Basically, yes. Minchan was very clear that he didn't want to remove > that inter-function locking, so be it. > I wouldn't really advise to use zsmalloc with zswap because zsmalloc > has no support for reclaim, nevertheless I wouldn't like this > configuration to stop working for those who are already using it. > > Would you or Mike be up for testing Tian Taos's patchset? I will try to reproduce Mike's original report and the fix early next week. > Best regards, > Vitaly Sebastian
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 7289f502ffac..ff26546a7fed 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, void *obj) static inline int testpin_tag(unsigned long handle) { - return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle); + return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle); } static inline int trypin_tag(unsigned long handle) { - return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle); + return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle); } -static void pin_tag(unsigned long handle) __acquires(bitlock) +static void pin_tag(unsigned long handle) { - bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle); + preempt_disable(); + while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle)) + cpu_relax(); + preempt_enable(); } static void unpin_tag(unsigned long handle) __releases(bitlock) { - bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle); + clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle); } static void reset_page(struct page *page)
zsmalloc takes bit spinlock in its _map() callback and releases it only in unmap() which is unsafe and leads to zswap complaining about scheduling in atomic context. To fix that and to improve RT properties of zsmalloc, remove that bit spinlock completely and use a bit flag instead. Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com> --- mm/zsmalloc.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)