Message ID | 20250303084724.6490-13-kanchana.p.sridhar@intel.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Herbert Xu |
Headers | show |
Series | zswap IAA compress batching | expand |
On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote: > This patch modifies the acomp_ctx resources' lifetime to be from pool > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to > "struct crypto_acomp_ctx" which simplify a few things: > > 1) zswap_pool_create() will initialize all members of each percpu acomp_ctx > to 0 or NULL and only then initialize the mutex. > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online > to true, without locking the mutex. > 3) CPU hotunplug will lock the mutex before setting __online to false. It > will not delete any resources. > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online > is true, and if so, return the mutex for use in zswap compress and > decompress ops. > 5) CPU onlining after offlining will simply check if either __online or > nr_reqs are non-0, and return 0 if so, without re-allocating the > resources. > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() to > delete the acomp_ctx resources. > 7) Common resource deletion code in case of zswap_cpu_comp_prepare() > errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new > acomp_ctx_dealloc(). > > The CPU hot[un]plug callback functions are moved to "pool functions" > accordingly. > > The per-cpu memory cost of not deleting the acomp_ctx resources upon CPU > offlining, and only deleting them when the pool is destroyed, is as follows: > > IAA with batching: 64.8 KB > Software compressors: 8.2 KB > > I would appreciate code review comments on whether this memory cost is > acceptable, for the latency improvement that it provides due to a faster > reclaim restart after a CPU hotunplug-hotplug sequence - all that the > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and > if so, set __online to true and return, and reclaim can proceed. I like the idea of allocating the resources on memory hotplug but leaving them allocated until the pool is torn down. It avoids allocating unnecessary memory if some CPUs are never onlined, but it simplifies things because we don't have to synchronize against the resources being freed in CPU offline. The only case that would suffer from this AFAICT is if someone onlines many CPUs, uses them once, and then offline them and not use them again. I am not familiar with CPU hotplug use cases so I can't tell if that's something people do, but I am inclined to agree with this simplification. > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++------------------ > 1 file changed, 182 insertions(+), 91 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 10f2a16e7586..cff96df1df8b 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void) > struct crypto_acomp_ctx { > struct crypto_acomp *acomp; > struct acomp_req *req; > - struct crypto_wait wait; Is there a reason for moving this? If not please avoid unrelated changes. > u8 *buffer; > + u8 nr_reqs; > + struct crypto_wait wait; > struct mutex mutex; > bool is_sleepable; > + bool __online; I don't believe we need this. If we are not freeing resources during CPU offlining, then we do not need a CPU offline callback and acomp_ctx->__online serves no purpose. The whole point of synchronizing between offlining and compress/decompress operations is to avoid UAF. If offlining does not free resources, then we can hold the mutex directly in the compress/decompress path and drop the hotunplug callback completely. I also believe nr_reqs can be dropped from this patch, as it seems like it's only used know when to set __online. > }; > > /* > @@ -246,6 +248,122 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp) > **********************************/ > static void __zswap_pool_empty(struct percpu_ref *ref); > > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) > +{ > + if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) { > + > + if (!IS_ERR_OR_NULL(acomp_ctx->req)) > + acomp_request_free(acomp_ctx->req); > + acomp_ctx->req = NULL; > + > + kfree(acomp_ctx->buffer); > + acomp_ctx->buffer = NULL; > + > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > + crypto_free_acomp(acomp_ctx->acomp); > + > + acomp_ctx->nr_reqs = 0; > + } > +} Please split the pure refactoring into a separate patch to make it easier to review. > + > +static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) Why is the function moved while being changed? It's really hard to see the diff this way. If the function needs to be moved please do that separately as well. I also see some ordering changes inside the function (e.g. we now allocate the request before the buffer). Not sure if these are intentional. If not, please keep the diff to the required changes only. > +{ > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > + int ret = -ENOMEM; > + > + /* > + * Just to be even more fail-safe against changes in assumptions and/or > + * implementation of the CPU hotplug code. > + */ > + if (acomp_ctx->__online) > + return 0; > + > + if (acomp_ctx->nr_reqs) { > + acomp_ctx->__online = true; > + return 0; > + } > + > + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); > + if (IS_ERR(acomp_ctx->acomp)) { > + pr_err("could not alloc crypto acomp %s : %ld\n", > + pool->tfm_name, PTR_ERR(acomp_ctx->acomp)); > + ret = PTR_ERR(acomp_ctx->acomp); > + goto fail; > + } > + > + acomp_ctx->nr_reqs = 1; > + > + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp); > + if (!acomp_ctx->req) { > + pr_err("could not alloc crypto acomp_request %s\n", > + pool->tfm_name); > + ret = -ENOMEM; > + goto fail; > + } > + > + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); > + if (!acomp_ctx->buffer) { > + ret = -ENOMEM; > + goto fail; > + } > + > + crypto_init_wait(&acomp_ctx->wait); > + > + /* > + * if the backend of acomp is async zip, crypto_req_done() will wakeup > + * crypto_wait_req(); if the backend of acomp is scomp, the callback > + * won't be called, crypto_wait_req() will return without blocking. > + */ > + acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG, > + crypto_req_done, &acomp_ctx->wait); > + > + acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp); > + > + acomp_ctx->__online = true; > + > + return 0; > + > +fail: > + acomp_ctx_dealloc(acomp_ctx); > + > + return ret; > +} > + > +static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) > +{ > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > + > + mutex_lock(&acomp_ctx->mutex); > + acomp_ctx->__online = false; > + mutex_unlock(&acomp_ctx->mutex); > + > + return 0; > +} > + > +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct hlist_node *node) > +{ > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > + > + /* > + * The lifetime of acomp_ctx resources is from pool creation to > + * pool deletion. > + * > + * Reclaims should not be happening because, we get to this routine only > + * in two scenarios: > + * > + * 1) pool creation failures before/during the pool ref initialization. > + * 2) we are in the process of releasing the pool, it is off the > + * zswap_pools list and has no references. > + * > + * Hence, there is no need for locks. > + */ > + acomp_ctx->__online = false; > + acomp_ctx_dealloc(acomp_ctx); Since __online can be dropped, we can probably drop zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly? > +} > + > static struct zswap_pool *zswap_pool_create(char *type, char *compressor) > { > struct zswap_pool *pool; > @@ -285,13 +403,21 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) > goto error; > } > > - for_each_possible_cpu(cpu) > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > + for_each_possible_cpu(cpu) { > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > + > + acomp_ctx->acomp = NULL; > + acomp_ctx->req = NULL; > + acomp_ctx->buffer = NULL; > + acomp_ctx->__online = false; > + acomp_ctx->nr_reqs = 0; Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them right away? If it is in fact needed we should probably just use __GFP_ZERO. > + mutex_init(&acomp_ctx->mutex); > + } > > ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > &pool->node); > if (ret) > - goto error; > + goto ref_fail; > > /* being the current pool takes 1 ref; this func expects the > * caller to always add the new pool as the current pool > @@ -307,6 +433,9 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) > return pool; > > ref_fail: > + for_each_possible_cpu(cpu) > + zswap_cpu_comp_dealloc(cpu, &pool->node); > + > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); I am wondering if we can guard these by hlist_empty(&pool->node) instead of having separate labels. If we do that we can probably make all the cleanup calls conditional and merge this cleanup code with zswap_pool_destroy(). Although I am not too sure about whether or not we should rely on hlist_empty() for this. I am just thinking out loud, no need to do anything here. If you decide to pursue this tho please make it a separate refactoring patch. > error: > if (pool->acomp_ctx) > @@ -361,8 +490,13 @@ static struct zswap_pool *__zswap_pool_create_fallback(void) > > static void zswap_pool_destroy(struct zswap_pool *pool) > { > + int cpu; > + > zswap_pool_debug("destroying", pool); > > + for_each_possible_cpu(cpu) > + zswap_cpu_comp_dealloc(cpu, &pool->node); > + > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); > free_percpu(pool->acomp_ctx); > > @@ -816,85 +950,6 @@ static void zswap_entry_free(struct zswap_entry *entry) > /********************************* > * compressed storage functions > **********************************/ > -static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) > -{ > - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > - struct crypto_acomp *acomp = NULL; > - struct acomp_req *req = NULL; > - u8 *buffer = NULL; > - int ret; > - > - buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); > - if (!buffer) { > - ret = -ENOMEM; > - goto fail; > - } > - > - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); > - if (IS_ERR(acomp)) { > - pr_err("could not alloc crypto acomp %s : %ld\n", > - pool->tfm_name, PTR_ERR(acomp)); > - ret = PTR_ERR(acomp); > - goto fail; > - } > - > - req = acomp_request_alloc(acomp); > - if (!req) { > - pr_err("could not alloc crypto acomp_request %s\n", > - pool->tfm_name); > - ret = -ENOMEM; > - goto fail; > - } > - > - /* > - * Only hold the mutex after completing allocations, otherwise we may > - * recurse into zswap through reclaim and attempt to hold the mutex > - * again resulting in a deadlock. > - */ > - mutex_lock(&acomp_ctx->mutex); > - crypto_init_wait(&acomp_ctx->wait); > - > - /* > - * if the backend of acomp is async zip, crypto_req_done() will wakeup > - * crypto_wait_req(); if the backend of acomp is scomp, the callback > - * won't be called, crypto_wait_req() will return without blocking. > - */ > - acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, > - crypto_req_done, &acomp_ctx->wait); > - > - acomp_ctx->buffer = buffer; > - acomp_ctx->acomp = acomp; > - acomp_ctx->is_sleepable = acomp_is_async(acomp); > - acomp_ctx->req = req; > - mutex_unlock(&acomp_ctx->mutex); > - return 0; > - > -fail: > - if (acomp) > - crypto_free_acomp(acomp); > - kfree(buffer); > - return ret; > -} > - > -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) > -{ > - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > - > - mutex_lock(&acomp_ctx->mutex); > - if (!IS_ERR_OR_NULL(acomp_ctx)) { > - if (!IS_ERR_OR_NULL(acomp_ctx->req)) > - acomp_request_free(acomp_ctx->req); > - acomp_ctx->req = NULL; > - if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > - crypto_free_acomp(acomp_ctx->acomp); > - kfree(acomp_ctx->buffer); > - } > - mutex_unlock(&acomp_ctx->mutex); > - > - return 0; > -} > > static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > { > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > > for (;;) { > acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > - mutex_lock(&acomp_ctx->mutex); > - if (likely(acomp_ctx->req)) > - return acomp_ctx; > /* > - * It is possible that we were migrated to a different CPU after > - * getting the per-CPU ctx but before the mutex was acquired. If > - * the old CPU got offlined, zswap_cpu_comp_dead() could have > - * already freed ctx->req (among other things) and set it to > - * NULL. Just try again on the new CPU that we ended up on. > + * If the CPU onlining code successfully allocates acomp_ctx resources, > + * it sets acomp_ctx->__online to true. Until this happens, we have > + * two options: > + * > + * 1. Return NULL and fail all stores on this CPU. > + * 2. Retry, until onlining has finished allocating resources. > + * > + * In theory, option 1 could be more appropriate, because it > + * allows the calling procedure to decide how it wants to handle > + * reclaim racing with CPU hotplug. For instance, it might be Ok > + * for compress to return an error for the backing swap device > + * to store the folio. Decompress could wait until we get a > + * valid and locked mutex after onlining has completed. For now, > + * we go with option 2 because adding a do-while in > + * zswap_decompress() adds latency for software compressors. > + * > + * Once initialized, the resources will be de-allocated only > + * when the pool is destroyed. The acomp_ctx will hold on to the > + * resources through CPU offlining/onlining at any time until > + * the pool is destroyed. > + * > + * This prevents races/deadlocks between reclaim and CPU acomp_ctx > + * resource allocation that are a dependency for reclaim. > + * It further simplifies the interaction with CPU onlining and > + * offlining: > + * > + * - CPU onlining does not take the mutex. It only allocates > + * resources and sets __online to true. > + * - CPU offlining acquires the mutex before setting > + * __online to false. If reclaim has acquired the mutex, > + * offlining will have to wait for reclaim to complete before > + * hotunplug can proceed. Further, hotplug merely sets > + * __online to false. It does not delete the acomp_ctx > + * resources. > + * > + * Option 1 is better than potentially not exiting the earlier > + * for (;;) loop because the system is running low on memory > + * and/or CPUs are getting offlined for whatever reason. At > + * least failing this store will prevent data loss by failing > + * zswap_store(), and saving the data in the backing swap device. > */ I believe we can dropped. I don't think we can have any store/load operations on a CPU before it's fully onlined, and we should always have a reference on the pool here, so the resources cannot go away. So unless I missed something we can drop this completely now and just hold the mutex directly in the load/store paths. > + mutex_lock(&acomp_ctx->mutex); > + if (likely(acomp_ctx->__online)) > + return acomp_ctx; > + > mutex_unlock(&acomp_ctx->mutex); > } > } > -- > 2.27.0 >
On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote: > This patch modifies the acomp_ctx resources' lifetime to be from pool > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to > "struct crypto_acomp_ctx" which simplify a few things: > > 1) zswap_pool_create() will initialize all members of each percpu acomp_ctx > to 0 or NULL and only then initialize the mutex. > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online > to true, without locking the mutex. > 3) CPU hotunplug will lock the mutex before setting __online to false. It > will not delete any resources. > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online > is true, and if so, return the mutex for use in zswap compress and > decompress ops. > 5) CPU onlining after offlining will simply check if either __online or > nr_reqs are non-0, and return 0 if so, without re-allocating the > resources. > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() to > delete the acomp_ctx resources. > 7) Common resource deletion code in case of zswap_cpu_comp_prepare() > errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new > acomp_ctx_dealloc(). > > The CPU hot[un]plug callback functions are moved to "pool functions" > accordingly. > > The per-cpu memory cost of not deleting the acomp_ctx resources upon CPU > offlining, and only deleting them when the pool is destroyed, is as follows: > > IAA with batching: 64.8 KB > Software compressors: 8.2 KB I am assuming this is specifically on x86_64, so let's call that out.
On Thu, Mar 06, 2025 at 07:35:36PM +0000, Yosry Ahmed wrote: > On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote: > > This patch modifies the acomp_ctx resources' lifetime to be from pool > > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to > > "struct crypto_acomp_ctx" which simplify a few things: > > > > 1) zswap_pool_create() will initialize all members of each percpu acomp_ctx > > to 0 or NULL and only then initialize the mutex. > > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online > > to true, without locking the mutex. > > 3) CPU hotunplug will lock the mutex before setting __online to false. It > > will not delete any resources. > > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online > > is true, and if so, return the mutex for use in zswap compress and > > decompress ops. > > 5) CPU onlining after offlining will simply check if either __online or > > nr_reqs are non-0, and return 0 if so, without re-allocating the > > resources. > > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() to > > delete the acomp_ctx resources. > > 7) Common resource deletion code in case of zswap_cpu_comp_prepare() > > errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new > > acomp_ctx_dealloc(). > > > > The CPU hot[un]plug callback functions are moved to "pool functions" > > accordingly. > > > > The per-cpu memory cost of not deleting the acomp_ctx resources upon CPU > > offlining, and only deleting them when the pool is destroyed, is as follows: > > > > IAA with batching: 64.8 KB > > Software compressors: 8.2 KB > > > > I would appreciate code review comments on whether this memory cost is > > acceptable, for the latency improvement that it provides due to a faster > > reclaim restart after a CPU hotunplug-hotplug sequence - all that the > > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and > > if so, set __online to true and return, and reclaim can proceed. > > I like the idea of allocating the resources on memory hotplug but > leaving them allocated until the pool is torn down. It avoids allocating > unnecessary memory if some CPUs are never onlined, but it simplifies > things because we don't have to synchronize against the resources being > freed in CPU offline. > > The only case that would suffer from this AFAICT is if someone onlines > many CPUs, uses them once, and then offline them and not use them again. > I am not familiar with CPU hotplug use cases so I can't tell if that's > something people do, but I am inclined to agree with this > simplification. > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++------------------ > > 1 file changed, 182 insertions(+), 91 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 10f2a16e7586..cff96df1df8b 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void) > > struct crypto_acomp_ctx { > > struct crypto_acomp *acomp; > > struct acomp_req *req; > > - struct crypto_wait wait; > > Is there a reason for moving this? If not please avoid unrelated changes. > > > u8 *buffer; > > + u8 nr_reqs; > > + struct crypto_wait wait; > > struct mutex mutex; > > bool is_sleepable; > > + bool __online; > > I don't believe we need this. > > If we are not freeing resources during CPU offlining, then we do not > need a CPU offline callback and acomp_ctx->__online serves no purpose. > > The whole point of synchronizing between offlining and > compress/decompress operations is to avoid UAF. If offlining does not > free resources, then we can hold the mutex directly in the > compress/decompress path and drop the hotunplug callback completely. > > I also believe nr_reqs can be dropped from this patch, as it seems like > it's only used know when to set __online. > > > }; > > > > /* > > @@ -246,6 +248,122 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp) > > **********************************/ > > static void __zswap_pool_empty(struct percpu_ref *ref); > > > > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) > > +{ > > + if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) { Also, we can just return early here to save an indentation level: if (IS_ERR_OR_NULL(acomp_ctx) || !acomp_ctx->nr_reqs) return; > > + > > + if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > + acomp_request_free(acomp_ctx->req); > > + acomp_ctx->req = NULL; > > + > > + kfree(acomp_ctx->buffer); > > + acomp_ctx->buffer = NULL; > > + > > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > + crypto_free_acomp(acomp_ctx->acomp); > > + > > + acomp_ctx->nr_reqs = 0; > > + } > > +} > > Please split the pure refactoring into a separate patch to make it > easier to review.
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Thursday, March 6, 2025 11:36 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > allocation/deletion and mutex lock usage. > > On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote: > > This patch modifies the acomp_ctx resources' lifetime to be from pool > > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to > > "struct crypto_acomp_ctx" which simplify a few things: > > > > 1) zswap_pool_create() will initialize all members of each percpu > acomp_ctx > > to 0 or NULL and only then initialize the mutex. > > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online > > to true, without locking the mutex. > > 3) CPU hotunplug will lock the mutex before setting __online to false. It > > will not delete any resources. > > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online > > is true, and if so, return the mutex for use in zswap compress and > > decompress ops. > > 5) CPU onlining after offlining will simply check if either __online or > > nr_reqs are non-0, and return 0 if so, without re-allocating the > > resources. > > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() > to > > delete the acomp_ctx resources. > > 7) Common resource deletion code in case of zswap_cpu_comp_prepare() > > errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new > > acomp_ctx_dealloc(). > > > > The CPU hot[un]plug callback functions are moved to "pool functions" > > accordingly. > > > > The per-cpu memory cost of not deleting the acomp_ctx resources upon > CPU > > offlining, and only deleting them when the pool is destroyed, is as follows: > > > > IAA with batching: 64.8 KB > > Software compressors: 8.2 KB > > > > I would appreciate code review comments on whether this memory cost is > > acceptable, for the latency improvement that it provides due to a faster > > reclaim restart after a CPU hotunplug-hotplug sequence - all that the > > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and > > if so, set __online to true and return, and reclaim can proceed. > > I like the idea of allocating the resources on memory hotplug but > leaving them allocated until the pool is torn down. It avoids allocating > unnecessary memory if some CPUs are never onlined, but it simplifies > things because we don't have to synchronize against the resources being > freed in CPU offline. > > The only case that would suffer from this AFAICT is if someone onlines > many CPUs, uses them once, and then offline them and not use them again. > I am not familiar with CPU hotplug use cases so I can't tell if that's > something people do, but I am inclined to agree with this > simplification. Thanks Yosry, for your code review comments! Good to know that this simplification is acceptable. > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++-------------- > ---- > > 1 file changed, 182 insertions(+), 91 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 10f2a16e7586..cff96df1df8b 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void) > > struct crypto_acomp_ctx { > > struct crypto_acomp *acomp; > > struct acomp_req *req; > > - struct crypto_wait wait; > > Is there a reason for moving this? If not please avoid unrelated changes. The reason is so that req/buffer, and reqs/buffers with batching, go together logically, hence I found this easier to understand. I can restore this to the original order, if that's preferable. > > > u8 *buffer; > > + u8 nr_reqs; > > + struct crypto_wait wait; > > struct mutex mutex; > > bool is_sleepable; > > + bool __online; > > I don't believe we need this. > > If we are not freeing resources during CPU offlining, then we do not > need a CPU offline callback and acomp_ctx->__online serves no purpose. > > The whole point of synchronizing between offlining and > compress/decompress operations is to avoid UAF. If offlining does not > free resources, then we can hold the mutex directly in the > compress/decompress path and drop the hotunplug callback completely. > > I also believe nr_reqs can be dropped from this patch, as it seems like > it's only used know when to set __online. All great points! In fact, that was the original solution I had implemented (not having an offline callback). But then, I spent some time understanding the v6.13 hotfix for synchronizing freeing of resources, and this comment in zswap_cpu_comp_prepare(): /* * Only hold the mutex after completing allocations, otherwise we may * recurse into zswap through reclaim and attempt to hold the mutex * again resulting in a deadlock. */ Hence, I figured the constraint of "recurse into zswap through reclaim" was something to comprehend in the simplification (even though I had a tough time imagining how this could happen). Hence, I added the "bool __online" because zswap_cpu_comp_prepare() does not acquire the mutex lock while allocating resources. We have already initialized the mutex, so in theory, it is possible for compress/decompress to acquire the mutex lock. The __online acts as a way to indicate whether compress/decompress can proceed reliably to use the resources. The "nr_reqs" was needed as a way to distinguish between initial and subsequent calls into zswap_cpu_comp_prepare(), for e.g., on a CPU that goes through an online-offline-online sequence. In the initial onlining, we need to allocate resources because nr_reqs=0. If resources are to be allocated, we set acomp_ctx->nr_reqs and proceed to allocate reqs/buffers/etc. In the subsequent onlining, we can quickly inspect nr_reqs as being greater than 0 and return, thus avoiding any latency delays before reclaim/page-faults can be handled on that CPU. Please let me know if this rationale seems reasonable for why __online and nr_reqs were introduced. > > > }; > > > > /* > > @@ -246,6 +248,122 @@ static inline struct xarray > *swap_zswap_tree(swp_entry_t swp) > > **********************************/ > > static void __zswap_pool_empty(struct percpu_ref *ref); > > > > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) > > +{ > > + if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) { > > + > > + if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > + acomp_request_free(acomp_ctx->req); > > + acomp_ctx->req = NULL; > > + > > + kfree(acomp_ctx->buffer); > > + acomp_ctx->buffer = NULL; > > + > > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > + crypto_free_acomp(acomp_ctx->acomp); > > + > > + acomp_ctx->nr_reqs = 0; > > + } > > +} > > Please split the pure refactoring into a separate patch to make it > easier to review. Sure, will do. > > > + > > +static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node > *node) > > Why is the function moved while being changed? It's really hard to see > the diff this way. If the function needs to be moved please do that > separately as well. Sure, will do. > > I also see some ordering changes inside the function (e.g. we now > allocate the request before the buffer). Not sure if these are > intentional. If not, please keep the diff to the required changes only. The reason for this was, I am trying to organize the allocations based on dependencies. Unless requests are allocated, there is no point in allocating buffers. Please let me know if this is Ok. > > > +{ > > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > node); > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > >acomp_ctx, cpu); > > + int ret = -ENOMEM; > > + > > + /* > > + * Just to be even more fail-safe against changes in assumptions > and/or > > + * implementation of the CPU hotplug code. > > + */ > > + if (acomp_ctx->__online) > > + return 0; > > + > > + if (acomp_ctx->nr_reqs) { > > + acomp_ctx->__online = true; > > + return 0; > > + } > > + > > + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, > 0, cpu_to_node(cpu)); > > + if (IS_ERR(acomp_ctx->acomp)) { > > + pr_err("could not alloc crypto acomp %s : %ld\n", > > + pool->tfm_name, PTR_ERR(acomp_ctx->acomp)); > > + ret = PTR_ERR(acomp_ctx->acomp); > > + goto fail; > > + } > > + > > + acomp_ctx->nr_reqs = 1; > > + > > + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp); > > + if (!acomp_ctx->req) { > > + pr_err("could not alloc crypto acomp_request %s\n", > > + pool->tfm_name); > > + ret = -ENOMEM; > > + goto fail; > > + } > > + > > + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, > cpu_to_node(cpu)); > > + if (!acomp_ctx->buffer) { > > + ret = -ENOMEM; > > + goto fail; > > + } > > + > > + crypto_init_wait(&acomp_ctx->wait); > > + > > + /* > > + * if the backend of acomp is async zip, crypto_req_done() will > wakeup > > + * crypto_wait_req(); if the backend of acomp is scomp, the callback > > + * won't be called, crypto_wait_req() will return without blocking. > > + */ > > + acomp_request_set_callback(acomp_ctx->req, > CRYPTO_TFM_REQ_MAY_BACKLOG, > > + crypto_req_done, &acomp_ctx->wait); > > + > > + acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp); > > + > > + acomp_ctx->__online = true; > > + > > + return 0; > > + > > +fail: > > + acomp_ctx_dealloc(acomp_ctx); > > + > > + return ret; > > +} > > + > > +static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node > *node) > > +{ > > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > node); > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > >acomp_ctx, cpu); > > + > > + mutex_lock(&acomp_ctx->mutex); > > + acomp_ctx->__online = false; > > + mutex_unlock(&acomp_ctx->mutex); > > + > > + return 0; > > +} > > + > > +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct hlist_node > *node) > > +{ > > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > node); > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > >acomp_ctx, cpu); > > + > > + /* > > + * The lifetime of acomp_ctx resources is from pool creation to > > + * pool deletion. > > + * > > + * Reclaims should not be happening because, we get to this routine > only > > + * in two scenarios: > > + * > > + * 1) pool creation failures before/during the pool ref initialization. > > + * 2) we are in the process of releasing the pool, it is off the > > + * zswap_pools list and has no references. > > + * > > + * Hence, there is no need for locks. > > + */ > > + acomp_ctx->__online = false; > > + acomp_ctx_dealloc(acomp_ctx); > > Since __online can be dropped, we can probably drop > zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly? I suppose there is value in having a way in zswap to know for sure, that resource allocation has completed, and it is safe for compress/decompress to proceed. Especially because the mutex has been initialized before we get to resource allocation. Would you agree? > > > +} > > + > > static struct zswap_pool *zswap_pool_create(char *type, char > *compressor) > > { > > struct zswap_pool *pool; > > @@ -285,13 +403,21 @@ static struct zswap_pool > *zswap_pool_create(char *type, char *compressor) > > goto error; > > } > > > > - for_each_possible_cpu(cpu) > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > + for_each_possible_cpu(cpu) { > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > >acomp_ctx, cpu); > > + > > + acomp_ctx->acomp = NULL; > > + acomp_ctx->req = NULL; > > + acomp_ctx->buffer = NULL; > > + acomp_ctx->__online = false; > > + acomp_ctx->nr_reqs = 0; > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them > right away? Yes, I figured this is needed for two reasons: 1) For the error handling in zswap_cpu_comp_prepare() and calls into zswap_cpu_comp_dealloc() to be handled by the common procedure "acomp_ctx_dealloc()" unambiguously. 2) The second scenario I thought of that would need this, is let's say the zswap compressor is switched immediately after setting the compressor. Some cores have executed the onlining code and some haven't. Because there are no pool refs held, zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured it would help to initialize these acomp_ctx members before the hand-off to "cpuhp_state_add_instance()" in zswap_pool_create(). Please let me know if these are valid considerations. > > If it is in fact needed we should probably just use __GFP_ZERO. Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of "alloc_percpu()" for the acomp_ctx? > > > + mutex_init(&acomp_ctx->mutex); > > + } > > > > ret = > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > &pool->node); > > if (ret) > > - goto error; > > + goto ref_fail; > > > > /* being the current pool takes 1 ref; this func expects the > > * caller to always add the new pool as the current pool > > @@ -307,6 +433,9 @@ static struct zswap_pool *zswap_pool_create(char > *type, char *compressor) > > return pool; > > > > ref_fail: > > + for_each_possible_cpu(cpu) > > + zswap_cpu_comp_dealloc(cpu, &pool->node); > > + > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > &pool->node); > > I am wondering if we can guard these by hlist_empty(&pool->node) instead > of having separate labels. If we do that we can probably make all the > cleanup calls conditional and merge this cleanup code with > zswap_pool_destroy(). > > Although I am not too sure about whether or not we should rely on > hlist_empty() for this. I am just thinking out loud, no need to do > anything here. If you decide to pursue this tho please make it a > separate refactoring patch. Sure, makes sense. > > > error: > > if (pool->acomp_ctx) > > @@ -361,8 +490,13 @@ static struct zswap_pool > *__zswap_pool_create_fallback(void) > > > > static void zswap_pool_destroy(struct zswap_pool *pool) > > { > > + int cpu; > > + > > zswap_pool_debug("destroying", pool); > > > > + for_each_possible_cpu(cpu) > > + zswap_cpu_comp_dealloc(cpu, &pool->node); > > + > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > &pool->node); > > free_percpu(pool->acomp_ctx); > > > > @@ -816,85 +950,6 @@ static void zswap_entry_free(struct zswap_entry > *entry) > > /********************************* > > * compressed storage functions > > **********************************/ > > -static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node > *node) > > -{ > > - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > node); > > - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > >acomp_ctx, cpu); > > - struct crypto_acomp *acomp = NULL; > > - struct acomp_req *req = NULL; > > - u8 *buffer = NULL; > > - int ret; > > - > > - buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, > cpu_to_node(cpu)); > > - if (!buffer) { > > - ret = -ENOMEM; > > - goto fail; > > - } > > - > > - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, > cpu_to_node(cpu)); > > - if (IS_ERR(acomp)) { > > - pr_err("could not alloc crypto acomp %s : %ld\n", > > - pool->tfm_name, PTR_ERR(acomp)); > > - ret = PTR_ERR(acomp); > > - goto fail; > > - } > > - > > - req = acomp_request_alloc(acomp); > > - if (!req) { > > - pr_err("could not alloc crypto acomp_request %s\n", > > - pool->tfm_name); > > - ret = -ENOMEM; > > - goto fail; > > - } > > - > > - /* > > - * Only hold the mutex after completing allocations, otherwise we > may > > - * recurse into zswap through reclaim and attempt to hold the mutex > > - * again resulting in a deadlock. > > - */ > > - mutex_lock(&acomp_ctx->mutex); > > - crypto_init_wait(&acomp_ctx->wait); > > - > > - /* > > - * if the backend of acomp is async zip, crypto_req_done() will > wakeup > > - * crypto_wait_req(); if the backend of acomp is scomp, the callback > > - * won't be called, crypto_wait_req() will return without blocking. > > - */ > > - acomp_request_set_callback(req, > CRYPTO_TFM_REQ_MAY_BACKLOG, > > - crypto_req_done, &acomp_ctx->wait); > > - > > - acomp_ctx->buffer = buffer; > > - acomp_ctx->acomp = acomp; > > - acomp_ctx->is_sleepable = acomp_is_async(acomp); > > - acomp_ctx->req = req; > > - mutex_unlock(&acomp_ctx->mutex); > > - return 0; > > - > > -fail: > > - if (acomp) > > - crypto_free_acomp(acomp); > > - kfree(buffer); > > - return ret; > > -} > > - > > -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node > *node) > > -{ > > - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > node); > > - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > >acomp_ctx, cpu); > > - > > - mutex_lock(&acomp_ctx->mutex); > > - if (!IS_ERR_OR_NULL(acomp_ctx)) { > > - if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > - acomp_request_free(acomp_ctx->req); > > - acomp_ctx->req = NULL; > > - if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > - crypto_free_acomp(acomp_ctx->acomp); > > - kfree(acomp_ctx->buffer); > > - } > > - mutex_unlock(&acomp_ctx->mutex); > > - > > - return 0; > > -} > > > > static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct > zswap_pool *pool) > > { > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > > > > for (;;) { > > acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > - mutex_lock(&acomp_ctx->mutex); > > - if (likely(acomp_ctx->req)) > > - return acomp_ctx; > > /* > > - * It is possible that we were migrated to a different CPU > after > > - * getting the per-CPU ctx but before the mutex was > acquired. If > > - * the old CPU got offlined, zswap_cpu_comp_dead() could > have > > - * already freed ctx->req (among other things) and set it to > > - * NULL. Just try again on the new CPU that we ended up on. > > + * If the CPU onlining code successfully allocates acomp_ctx > resources, > > + * it sets acomp_ctx->__online to true. Until this happens, we > have > > + * two options: > > + * > > + * 1. Return NULL and fail all stores on this CPU. > > + * 2. Retry, until onlining has finished allocating resources. > > + * > > + * In theory, option 1 could be more appropriate, because it > > + * allows the calling procedure to decide how it wants to > handle > > + * reclaim racing with CPU hotplug. For instance, it might be > Ok > > + * for compress to return an error for the backing swap device > > + * to store the folio. Decompress could wait until we get a > > + * valid and locked mutex after onlining has completed. For > now, > > + * we go with option 2 because adding a do-while in > > + * zswap_decompress() adds latency for software > compressors. > > + * > > + * Once initialized, the resources will be de-allocated only > > + * when the pool is destroyed. The acomp_ctx will hold on to > the > > + * resources through CPU offlining/onlining at any time until > > + * the pool is destroyed. > > + * > > + * This prevents races/deadlocks between reclaim and CPU > acomp_ctx > > + * resource allocation that are a dependency for reclaim. > > + * It further simplifies the interaction with CPU onlining and > > + * offlining: > > + * > > + * - CPU onlining does not take the mutex. It only allocates > > + * resources and sets __online to true. > > + * - CPU offlining acquires the mutex before setting > > + * __online to false. If reclaim has acquired the mutex, > > + * offlining will have to wait for reclaim to complete before > > + * hotunplug can proceed. Further, hotplug merely sets > > + * __online to false. It does not delete the acomp_ctx > > + * resources. > > + * > > + * Option 1 is better than potentially not exiting the earlier > > + * for (;;) loop because the system is running low on memory > > + * and/or CPUs are getting offlined for whatever reason. At > > + * least failing this store will prevent data loss by failing > > + * zswap_store(), and saving the data in the backing swap > device. > > */ > > I believe we can dropped. I don't think we can have any store/load > operations on a CPU before it's fully onlined, and we should always have > a reference on the pool here, so the resources cannot go away. > > So unless I missed something we can drop this completely now and just > hold the mutex directly in the load/store paths. Based on the above explanations, please let me know if it is a good idea to keep the __online, or if you think further simplification is possible. Thanks, Kanchana > > > + mutex_lock(&acomp_ctx->mutex); > > + if (likely(acomp_ctx->__online)) > > + return acomp_ctx; > > + > > mutex_unlock(&acomp_ctx->mutex); > > } > > } > > -- > > 2.27.0 > >
On Fri, Mar 07, 2025 at 12:01:14AM +0000, Sridhar, Kanchana P wrote: > > > -----Original Message----- > > From: Yosry Ahmed <yosry.ahmed@linux.dev> > > Sent: Thursday, March 6, 2025 11:36 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > > Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > > allocation/deletion and mutex lock usage. > > > > On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote: > > > This patch modifies the acomp_ctx resources' lifetime to be from pool > > > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to > > > "struct crypto_acomp_ctx" which simplify a few things: > > > > > > 1) zswap_pool_create() will initialize all members of each percpu > > acomp_ctx > > > to 0 or NULL and only then initialize the mutex. > > > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online > > > to true, without locking the mutex. > > > 3) CPU hotunplug will lock the mutex before setting __online to false. It > > > will not delete any resources. > > > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online > > > is true, and if so, return the mutex for use in zswap compress and > > > decompress ops. > > > 5) CPU onlining after offlining will simply check if either __online or > > > nr_reqs are non-0, and return 0 if so, without re-allocating the > > > resources. > > > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() > > to > > > delete the acomp_ctx resources. > > > 7) Common resource deletion code in case of zswap_cpu_comp_prepare() > > > errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new > > > acomp_ctx_dealloc(). > > > > > > The CPU hot[un]plug callback functions are moved to "pool functions" > > > accordingly. > > > > > > The per-cpu memory cost of not deleting the acomp_ctx resources upon > > CPU > > > offlining, and only deleting them when the pool is destroyed, is as follows: > > > > > > IAA with batching: 64.8 KB > > > Software compressors: 8.2 KB > > > > > > I would appreciate code review comments on whether this memory cost is > > > acceptable, for the latency improvement that it provides due to a faster > > > reclaim restart after a CPU hotunplug-hotplug sequence - all that the > > > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and > > > if so, set __online to true and return, and reclaim can proceed. > > > > I like the idea of allocating the resources on memory hotplug but > > leaving them allocated until the pool is torn down. It avoids allocating > > unnecessary memory if some CPUs are never onlined, but it simplifies > > things because we don't have to synchronize against the resources being > > freed in CPU offline. > > > > The only case that would suffer from this AFAICT is if someone onlines > > many CPUs, uses them once, and then offline them and not use them again. > > I am not familiar with CPU hotplug use cases so I can't tell if that's > > something people do, but I am inclined to agree with this > > simplification. > > Thanks Yosry, for your code review comments! Good to know that this > simplification is acceptable. > > > > > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > > --- > > > mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++-------------- > > ---- > > > 1 file changed, 182 insertions(+), 91 deletions(-) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 10f2a16e7586..cff96df1df8b 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void) > > > struct crypto_acomp_ctx { > > > struct crypto_acomp *acomp; > > > struct acomp_req *req; > > > - struct crypto_wait wait; > > > > Is there a reason for moving this? If not please avoid unrelated changes. > > The reason is so that req/buffer, and reqs/buffers with batching, go together > logically, hence I found this easier to understand. I can restore this to the > original order, if that's preferable. I see. In that case, this fits better in the patch that actually adds support for having multiple requests and buffers, and please call it out explicitly in the commit message. > > > > > > u8 *buffer; > > > + u8 nr_reqs; > > > + struct crypto_wait wait; > > > struct mutex mutex; > > > bool is_sleepable; > > > + bool __online; > > > > I don't believe we need this. > > > > If we are not freeing resources during CPU offlining, then we do not > > need a CPU offline callback and acomp_ctx->__online serves no purpose. > > > > The whole point of synchronizing between offlining and > > compress/decompress operations is to avoid UAF. If offlining does not > > free resources, then we can hold the mutex directly in the > > compress/decompress path and drop the hotunplug callback completely. > > > > I also believe nr_reqs can be dropped from this patch, as it seems like > > it's only used know when to set __online. > > All great points! In fact, that was the original solution I had implemented > (not having an offline callback). But then, I spent some time understanding > the v6.13 hotfix for synchronizing freeing of resources, and this comment > in zswap_cpu_comp_prepare(): > > /* > * Only hold the mutex after completing allocations, otherwise we may > * recurse into zswap through reclaim and attempt to hold the mutex > * again resulting in a deadlock. > */ > > Hence, I figured the constraint of "recurse into zswap through reclaim" was > something to comprehend in the simplification (even though I had a tough > time imagining how this could happen). The constraint here is about zswap_cpu_comp_prepare() holding the mutex, making an allocation which internally triggers reclaim, then recursing into zswap and trying to hold the same mutex again causing a deadlock. If zswap_cpu_comp_prepare() does not need to hold the mutex to begin with, the constraint naturally goes away. > > Hence, I added the "bool __online" because zswap_cpu_comp_prepare() > does not acquire the mutex lock while allocating resources. We have already > initialized the mutex, so in theory, it is possible for compress/decompress > to acquire the mutex lock. The __online acts as a way to indicate whether > compress/decompress can proceed reliably to use the resources. For compress/decompress to acquire the mutex they need to run on that CPU, and I don't think that's possible before onlining completes, so zswap_cpu_comp_prepare() must have already completed before compress/decompress can use that CPU IIUC. > > The "nr_reqs" was needed as a way to distinguish between initial and > subsequent calls into zswap_cpu_comp_prepare(), for e.g., on a CPU that > goes through an online-offline-online sequence. In the initial onlining, > we need to allocate resources because nr_reqs=0. If resources are to > be allocated, we set acomp_ctx->nr_reqs and proceed to allocate > reqs/buffers/etc. In the subsequent onlining, we can quickly inspect > nr_reqs as being greater than 0 and return, thus avoiding any latency > delays before reclaim/page-faults can be handled on that CPU. > > Please let me know if this rationale seems reasonable for why > __online and nr_reqs were introduced. Based on what I said, I still don't believe they are needed, but please correct me if I am wrong. [..] > > I also see some ordering changes inside the function (e.g. we now > > allocate the request before the buffer). Not sure if these are > > intentional. If not, please keep the diff to the required changes only. > > The reason for this was, I am trying to organize the allocations based > on dependencies. Unless requests are allocated, there is no point in > allocating buffers. Please let me know if this is Ok. Please separate refactoring changes in general from functional changes because it makes code review harder. In this specific instance, I think moving the code is probably not worth it, as there's also no point in allocating requests if we cannot allocate buffers. In fact, since the buffers are larger, in theory their allocation is more likely to fail, so it makes since to do it first. Anyway, please propose such refactoring changes separately and they can be discussed as such. [..] > > > +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct hlist_node > > *node) > > > +{ > > > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > > node); > > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > > >acomp_ctx, cpu); > > > + > > > + /* > > > + * The lifetime of acomp_ctx resources is from pool creation to > > > + * pool deletion. > > > + * > > > + * Reclaims should not be happening because, we get to this routine > > only > > > + * in two scenarios: > > > + * > > > + * 1) pool creation failures before/during the pool ref initialization. > > > + * 2) we are in the process of releasing the pool, it is off the > > > + * zswap_pools list and has no references. > > > + * > > > + * Hence, there is no need for locks. > > > + */ > > > + acomp_ctx->__online = false; > > > + acomp_ctx_dealloc(acomp_ctx); > > > > Since __online can be dropped, we can probably drop > > zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly? > > I suppose there is value in having a way in zswap to know for sure, that > resource allocation has completed, and it is safe for compress/decompress > to proceed. Especially because the mutex has been initialized before we > get to resource allocation. Would you agree? As I mentioned above, I believe compress/decompress cannot run on a CPU before the onlining completes. Please correct me if I am wrong. > > > > > > +} > > > + > > > static struct zswap_pool *zswap_pool_create(char *type, char > > *compressor) > > > { > > > struct zswap_pool *pool; > > > @@ -285,13 +403,21 @@ static struct zswap_pool > > *zswap_pool_create(char *type, char *compressor) > > > goto error; > > > } > > > > > > - for_each_possible_cpu(cpu) > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > > + for_each_possible_cpu(cpu) { > > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > > >acomp_ctx, cpu); > > > + > > > + acomp_ctx->acomp = NULL; > > > + acomp_ctx->req = NULL; > > > + acomp_ctx->buffer = NULL; > > > + acomp_ctx->__online = false; > > > + acomp_ctx->nr_reqs = 0; > > > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them > > right away? > > Yes, I figured this is needed for two reasons: > > 1) For the error handling in zswap_cpu_comp_prepare() and calls into > zswap_cpu_comp_dealloc() to be handled by the common procedure > "acomp_ctx_dealloc()" unambiguously. This makes sense. When you move the refactoring to create acomp_ctx_dealloc() to a separate patch, please include this change in it and call it out explicitly in the commit message. > 2) The second scenario I thought of that would need this, is let's say > the zswap compressor is switched immediately after setting the > compressor. Some cores have executed the onlining code and > some haven't. Because there are no pool refs held, > zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured > it would help to initialize these acomp_ctx members before the > hand-off to "cpuhp_state_add_instance()" in zswap_pool_create(). I believe cpuhp_state_add_instance() calls the onlining function synchronously on all present CPUs, so I don't think it's possible to end up in a state where the pool is being destroyed and some CPU executed zswap_cpu_comp_prepare() while others haven't. That being said, this made me think of a different problem. If pool destruction races with CPU onlining, there could be a race between zswap_cpu_comp_prepare() allocating resources and zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them. I believe we must always call cpuhp_state_remove_instance() *before* freeing the resources to prevent this race from happening. This needs to be documented with a comment. Let me know if I missed something. > > Please let me know if these are valid considerations. > > > > > If it is in fact needed we should probably just use __GFP_ZERO. > > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of "alloc_percpu()" > for the acomp_ctx? Yeah if we need to initialize all/most fields to 0 let's use alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO. [..] > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > > > > > > for (;;) { > > > acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > > - mutex_lock(&acomp_ctx->mutex); > > > - if (likely(acomp_ctx->req)) > > > - return acomp_ctx; > > > /* > > > - * It is possible that we were migrated to a different CPU > > after > > > - * getting the per-CPU ctx but before the mutex was > > acquired. If > > > - * the old CPU got offlined, zswap_cpu_comp_dead() could > > have > > > - * already freed ctx->req (among other things) and set it to > > > - * NULL. Just try again on the new CPU that we ended up on. > > > + * If the CPU onlining code successfully allocates acomp_ctx > > resources, > > > + * it sets acomp_ctx->__online to true. Until this happens, we > > have > > > + * two options: > > > + * > > > + * 1. Return NULL and fail all stores on this CPU. > > > + * 2. Retry, until onlining has finished allocating resources. > > > + * > > > + * In theory, option 1 could be more appropriate, because it > > > + * allows the calling procedure to decide how it wants to > > handle > > > + * reclaim racing with CPU hotplug. For instance, it might be > > Ok > > > + * for compress to return an error for the backing swap device > > > + * to store the folio. Decompress could wait until we get a > > > + * valid and locked mutex after onlining has completed. For > > now, > > > + * we go with option 2 because adding a do-while in > > > + * zswap_decompress() adds latency for software > > compressors. > > > + * > > > + * Once initialized, the resources will be de-allocated only > > > + * when the pool is destroyed. The acomp_ctx will hold on to > > the > > > + * resources through CPU offlining/onlining at any time until > > > + * the pool is destroyed. > > > + * > > > + * This prevents races/deadlocks between reclaim and CPU > > acomp_ctx > > > + * resource allocation that are a dependency for reclaim. > > > + * It further simplifies the interaction with CPU onlining and > > > + * offlining: > > > + * > > > + * - CPU onlining does not take the mutex. It only allocates > > > + * resources and sets __online to true. > > > + * - CPU offlining acquires the mutex before setting > > > + * __online to false. If reclaim has acquired the mutex, > > > + * offlining will have to wait for reclaim to complete before > > > + * hotunplug can proceed. Further, hotplug merely sets > > > + * __online to false. It does not delete the acomp_ctx > > > + * resources. > > > + * > > > + * Option 1 is better than potentially not exiting the earlier > > > + * for (;;) loop because the system is running low on memory > > > + * and/or CPUs are getting offlined for whatever reason. At > > > + * least failing this store will prevent data loss by failing > > > + * zswap_store(), and saving the data in the backing swap > > device. > > > */ > > > > I believe we can dropped. I don't think we can have any store/load > > operations on a CPU before it's fully onlined, and we should always have > > a reference on the pool here, so the resources cannot go away. > > > > So unless I missed something we can drop this completely now and just > > hold the mutex directly in the load/store paths. > > Based on the above explanations, please let me know if it is a good idea > to keep the __online, or if you think further simplification is possible. I still think it's not needed. Let me know if I missed anything.
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Friday, March 7, 2025 11:30 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > allocation/deletion and mutex lock usage. > > On Fri, Mar 07, 2025 at 12:01:14AM +0000, Sridhar, Kanchana P wrote: > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosry.ahmed@linux.dev> > > > Sent: Thursday, March 6, 2025 11:36 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > > hannes@cmpxchg.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; > > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > > > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; > > > Gopal, Vinodh <vinodh.gopal@intel.com> > > > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > > > allocation/deletion and mutex lock usage. > > > > > > On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote: > > > > This patch modifies the acomp_ctx resources' lifetime to be from pool > > > > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to > > > > "struct crypto_acomp_ctx" which simplify a few things: > > > > > > > > 1) zswap_pool_create() will initialize all members of each percpu > > > acomp_ctx > > > > to 0 or NULL and only then initialize the mutex. > > > > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online > > > > to true, without locking the mutex. > > > > 3) CPU hotunplug will lock the mutex before setting __online to false. It > > > > will not delete any resources. > > > > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online > > > > is true, and if so, return the mutex for use in zswap compress and > > > > decompress ops. > > > > 5) CPU onlining after offlining will simply check if either __online or > > > > nr_reqs are non-0, and return 0 if so, without re-allocating the > > > > resources. > > > > 6) zswap_pool_destroy() will call a newly added > zswap_cpu_comp_dealloc() > > > to > > > > delete the acomp_ctx resources. > > > > 7) Common resource deletion code in case of > zswap_cpu_comp_prepare() > > > > errors, and for use in zswap_cpu_comp_dealloc(), is factored into a > new > > > > acomp_ctx_dealloc(). > > > > > > > > The CPU hot[un]plug callback functions are moved to "pool functions" > > > > accordingly. > > > > > > > > The per-cpu memory cost of not deleting the acomp_ctx resources upon > > > CPU > > > > offlining, and only deleting them when the pool is destroyed, is as > follows: > > > > > > > > IAA with batching: 64.8 KB > > > > Software compressors: 8.2 KB > > > > > > > > I would appreciate code review comments on whether this memory cost > is > > > > acceptable, for the latency improvement that it provides due to a faster > > > > reclaim restart after a CPU hotunplug-hotplug sequence - all that the > > > > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, > and > > > > if so, set __online to true and return, and reclaim can proceed. > > > > > > I like the idea of allocating the resources on memory hotplug but > > > leaving them allocated until the pool is torn down. It avoids allocating > > > unnecessary memory if some CPUs are never onlined, but it simplifies > > > things because we don't have to synchronize against the resources being > > > freed in CPU offline. > > > > > > The only case that would suffer from this AFAICT is if someone onlines > > > many CPUs, uses them once, and then offline them and not use them > again. > > > I am not familiar with CPU hotplug use cases so I can't tell if that's > > > something people do, but I am inclined to agree with this > > > simplification. > > > > Thanks Yosry, for your code review comments! Good to know that this > > simplification is acceptable. > > > > > > > > > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > > > --- > > > > mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++---------- > ---- > > > ---- > > > > 1 file changed, 182 insertions(+), 91 deletions(-) > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index 10f2a16e7586..cff96df1df8b 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void) > > > > struct crypto_acomp_ctx { > > > > struct crypto_acomp *acomp; > > > > struct acomp_req *req; > > > > - struct crypto_wait wait; > > > > > > Is there a reason for moving this? If not please avoid unrelated changes. > > > > The reason is so that req/buffer, and reqs/buffers with batching, go together > > logically, hence I found this easier to understand. I can restore this to the > > original order, if that's preferable. > > I see. In that case, this fits better in the patch that actually adds > support for having multiple requests and buffers, and please call it out > explicitly in the commit message. Thanks Yosry, for the follow up comments. Sure, this makes sense. > > > > > > > > > > u8 *buffer; > > > > + u8 nr_reqs; > > > > + struct crypto_wait wait; > > > > struct mutex mutex; > > > > bool is_sleepable; > > > > + bool __online; > > > > > > I don't believe we need this. > > > > > > If we are not freeing resources during CPU offlining, then we do not > > > need a CPU offline callback and acomp_ctx->__online serves no purpose. > > > > > > The whole point of synchronizing between offlining and > > > compress/decompress operations is to avoid UAF. If offlining does not > > > free resources, then we can hold the mutex directly in the > > > compress/decompress path and drop the hotunplug callback completely. > > > > > > I also believe nr_reqs can be dropped from this patch, as it seems like > > > it's only used know when to set __online. > > > > All great points! In fact, that was the original solution I had implemented > > (not having an offline callback). But then, I spent some time understanding > > the v6.13 hotfix for synchronizing freeing of resources, and this comment > > in zswap_cpu_comp_prepare(): > > > > /* > > * Only hold the mutex after completing allocations, otherwise we > may > > * recurse into zswap through reclaim and attempt to hold the mutex > > * again resulting in a deadlock. > > */ > > > > Hence, I figured the constraint of "recurse into zswap through reclaim" was > > something to comprehend in the simplification (even though I had a tough > > time imagining how this could happen). > > The constraint here is about zswap_cpu_comp_prepare() holding the mutex, > making an allocation which internally triggers reclaim, then recursing > into zswap and trying to hold the same mutex again causing a deadlock. > > If zswap_cpu_comp_prepare() does not need to hold the mutex to begin > with, the constraint naturally goes away. Actually, if it is possible for the allocations in zswap_cpu_comp_prepare() to trigger reclaim, then I believe we need some way for reclaim to know if the acomp_ctx resources are available. Hence, this seems like a potential for deadlock regardless of the mutex. I verified that all the zswap_cpu_comp_prepare() allocations are done with GFP_KERNEL, which implicitly allows direct reclaim. So this appears to be a risk for deadlock between zswap_compress() and zswap_cpu_comp_prepare() in general, i.e., aside from this patchset. I can think of the following options to resolve this, and would welcome other suggestions: 1) Less intrusive: acomp_ctx_get_cpu_lock() should get the mutex, check if acomp_ctx->__online is true, and if so, return the mutex. If acomp_ctx->__online is false, then it returns NULL. In other words, we don't have the for loop. - This will cause recursions into direct reclaim from zswap_cpu_comp_prepare() to fail, cpuhotplug to fail. However, there is no deadlock. - zswap_compress() will need to detect NULL returned by acomp_ctx_get_cpu_lock(), and return an error. - zswap_decompress() will need a BUG_ON(!acomp_ctx) after calling acomp_ctx_get_cpu_lock(). - We won't be migrated to a different CPU because we hold the mutex, hence zswap_cpu_comp_dead() will wait on the mutex. 2) More intrusive: We would need to use a gfp_t that prevents direct reclaim and kswapd, i.e., something similar to GFP_TRANSHUGE_LIGHT in gfp_types.h, but for non-THP allocations. If we decide to adopt this approach, we would need changes in include/crypto/acompress.h, crypto/api.c, and crypto/acompress.c to allow crypto_create_tfm_node() to call crypto_alloc_tfmmem() with this new gfp_t, in lieu of GFP_KERNEL. > > > > > Hence, I added the "bool __online" because zswap_cpu_comp_prepare() > > does not acquire the mutex lock while allocating resources. We have > already > > initialized the mutex, so in theory, it is possible for compress/decompress > > to acquire the mutex lock. The __online acts as a way to indicate whether > > compress/decompress can proceed reliably to use the resources. > > For compress/decompress to acquire the mutex they need to run on that > CPU, and I don't think that's possible before onlining completes, so > zswap_cpu_comp_prepare() must have already completed before > compress/decompress can use that CPU IIUC. If we can make this assumption, that would be great! However, I am not totally sure because of the GFP_KERNEL allocations in zswap_cpu_comp_prepare(). > > > > > The "nr_reqs" was needed as a way to distinguish between initial and > > subsequent calls into zswap_cpu_comp_prepare(), for e.g., on a CPU that > > goes through an online-offline-online sequence. In the initial onlining, > > we need to allocate resources because nr_reqs=0. If resources are to > > be allocated, we set acomp_ctx->nr_reqs and proceed to allocate > > reqs/buffers/etc. In the subsequent onlining, we can quickly inspect > > nr_reqs as being greater than 0 and return, thus avoiding any latency > > delays before reclaim/page-faults can be handled on that CPU. > > > > Please let me know if this rationale seems reasonable for why > > __online and nr_reqs were introduced. > > Based on what I said, I still don't believe they are needed, but please > correct me if I am wrong. Same comments as above. > > [..] > > > I also see some ordering changes inside the function (e.g. we now > > > allocate the request before the buffer). Not sure if these are > > > intentional. If not, please keep the diff to the required changes only. > > > > The reason for this was, I am trying to organize the allocations based > > on dependencies. Unless requests are allocated, there is no point in > > allocating buffers. Please let me know if this is Ok. > > Please separate refactoring changes in general from functional changes > because it makes code review harder. Sure, I will do so. > > In this specific instance, I think moving the code is probably not worth > it, as there's also no point in allocating requests if we cannot > allocate buffers. In fact, since the buffers are larger, in theory their > allocation is more likely to fail, so it makes since to do it first. Understood, makes better sense than allocating the requests first. > > Anyway, please propose such refactoring changes separately and they can > be discussed as such. Ok. > > [..] > > > > +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct > hlist_node > > > *node) > > > > +{ > > > > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > > > node); > > > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > > > >acomp_ctx, cpu); > > > > + > > > > + /* > > > > + * The lifetime of acomp_ctx resources is from pool creation to > > > > + * pool deletion. > > > > + * > > > > + * Reclaims should not be happening because, we get to this routine > > > only > > > > + * in two scenarios: > > > > + * > > > > + * 1) pool creation failures before/during the pool ref initialization. > > > > + * 2) we are in the process of releasing the pool, it is off the > > > > + * zswap_pools list and has no references. > > > > + * > > > > + * Hence, there is no need for locks. > > > > + */ > > > > + acomp_ctx->__online = false; > > > > + acomp_ctx_dealloc(acomp_ctx); > > > > > > Since __online can be dropped, we can probably drop > > > zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly? > > > > I suppose there is value in having a way in zswap to know for sure, that > > resource allocation has completed, and it is safe for compress/decompress > > to proceed. Especially because the mutex has been initialized before we > > get to resource allocation. Would you agree? > > As I mentioned above, I believe compress/decompress cannot run on a CPU > before the onlining completes. Please correct me if I am wrong. > > > > > > > > > > +} > > > > + > > > > static struct zswap_pool *zswap_pool_create(char *type, char > > > *compressor) > > > > { > > > > struct zswap_pool *pool; > > > > @@ -285,13 +403,21 @@ static struct zswap_pool > > > *zswap_pool_create(char *type, char *compressor) > > > > goto error; > > > > } > > > > > > > > - for_each_possible_cpu(cpu) > > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > > > + for_each_possible_cpu(cpu) { > > > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > > > >acomp_ctx, cpu); > > > > + > > > > + acomp_ctx->acomp = NULL; > > > > + acomp_ctx->req = NULL; > > > > + acomp_ctx->buffer = NULL; > > > > + acomp_ctx->__online = false; > > > > + acomp_ctx->nr_reqs = 0; > > > > > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them > > > right away? > > > > Yes, I figured this is needed for two reasons: > > > > 1) For the error handling in zswap_cpu_comp_prepare() and calls into > > zswap_cpu_comp_dealloc() to be handled by the common procedure > > "acomp_ctx_dealloc()" unambiguously. > > This makes sense. When you move the refactoring to create > acomp_ctx_dealloc() to a separate patch, please include this change in > it and call it out explicitly in the commit message. Sure. > > > 2) The second scenario I thought of that would need this, is let's say > > the zswap compressor is switched immediately after setting the > > compressor. Some cores have executed the onlining code and > > some haven't. Because there are no pool refs held, > > zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured > > it would help to initialize these acomp_ctx members before the > > hand-off to "cpuhp_state_add_instance()" in zswap_pool_create(). > > I believe cpuhp_state_add_instance() calls the onlining function > synchronously on all present CPUs, so I don't think it's possible to end > up in a state where the pool is being destroyed and some CPU executed > zswap_cpu_comp_prepare() while others haven't. I looked at the cpuhotplug code some more. The startup callback is invoked sequentially for_each_present_cpu(). If an error occurs for any one of them, it calls the teardown callback only on the earlier cores that have already finished running the startup callback. However, zswap_cpu_comp_dealloc() will be called for all cores, even the ones for which the startup callback was not run. Hence, I believe the zero initialization is useful, albeit using alloc_percpu_gfp(__GFP_ZERO) to allocate the acomp_ctx. > > That being said, this made me think of a different problem. If pool > destruction races with CPU onlining, there could be a race between > zswap_cpu_comp_prepare() allocating resources and > zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them. > > I believe we must always call cpuhp_state_remove_instance() *before* > freeing the resources to prevent this race from happening. This needs to > be documented with a comment. Yes, this race condition is possible, thanks for catching this! The problem with calling cpuhp_state_remove_instance() before freeing the resources is that cpuhp_state_add_instance() and cpuhp_state_remove_instance() both acquire a "mutex_lock(&cpuhp_state_mutex);" at the beginning; and hence are serialized. For the reasons motivating why acomp_ctx->__online is set to false in zswap_cpu_comp_dead(), I cannot call cpuhp_state_remove_instance() before calling acomp_ctx_dealloc() because the latter could wait until acomp_ctx->__online to be true before deleting the resources. I will think about this some more. Another possibility is to not rely on cpuhotplug in zswap, and instead manage the per-cpu acomp_ctx resource allocation entirely in zswap_pool_create(), and deletion entirely in zswap_pool_destroy(), along with the necessary error handling. Let me think about this some more as well. > > Let me know if I missed something. > > > > > Please let me know if these are valid considerations. > > > > > > > > If it is in fact needed we should probably just use __GFP_ZERO. > > > > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of > "alloc_percpu()" > > for the acomp_ctx? > > Yeah if we need to initialize all/most fields to 0 let's use > alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO. Sounds good. > > [..] > > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx > > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > > > > > > > > for (;;) { > > > > acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > > > - mutex_lock(&acomp_ctx->mutex); > > > > - if (likely(acomp_ctx->req)) > > > > - return acomp_ctx; > > > > /* > > > > - * It is possible that we were migrated to a different CPU > > > after > > > > - * getting the per-CPU ctx but before the mutex was > > > acquired. If > > > > - * the old CPU got offlined, zswap_cpu_comp_dead() could > > > have > > > > - * already freed ctx->req (among other things) and set it to > > > > - * NULL. Just try again on the new CPU that we ended up on. > > > > + * If the CPU onlining code successfully allocates acomp_ctx > > > resources, > > > > + * it sets acomp_ctx->__online to true. Until this happens, we > > > have > > > > + * two options: > > > > + * > > > > + * 1. Return NULL and fail all stores on this CPU. > > > > + * 2. Retry, until onlining has finished allocating resources. > > > > + * > > > > + * In theory, option 1 could be more appropriate, because it > > > > + * allows the calling procedure to decide how it wants to > > > handle > > > > + * reclaim racing with CPU hotplug. For instance, it might be > > > Ok > > > > + * for compress to return an error for the backing swap device > > > > + * to store the folio. Decompress could wait until we get a > > > > + * valid and locked mutex after onlining has completed. For > > > now, > > > > + * we go with option 2 because adding a do-while in > > > > + * zswap_decompress() adds latency for software > > > compressors. > > > > + * > > > > + * Once initialized, the resources will be de-allocated only > > > > + * when the pool is destroyed. The acomp_ctx will hold on to > > > the > > > > + * resources through CPU offlining/onlining at any time until > > > > + * the pool is destroyed. > > > > + * > > > > + * This prevents races/deadlocks between reclaim and CPU > > > acomp_ctx > > > > + * resource allocation that are a dependency for reclaim. > > > > + * It further simplifies the interaction with CPU onlining and > > > > + * offlining: > > > > + * > > > > + * - CPU onlining does not take the mutex. It only allocates > > > > + * resources and sets __online to true. > > > > + * - CPU offlining acquires the mutex before setting > > > > + * __online to false. If reclaim has acquired the mutex, > > > > + * offlining will have to wait for reclaim to complete before > > > > + * hotunplug can proceed. Further, hotplug merely sets > > > > + * __online to false. It does not delete the acomp_ctx > > > > + * resources. > > > > + * > > > > + * Option 1 is better than potentially not exiting the earlier > > > > + * for (;;) loop because the system is running low on memory > > > > + * and/or CPUs are getting offlined for whatever reason. At > > > > + * least failing this store will prevent data loss by failing > > > > + * zswap_store(), and saving the data in the backing swap > > > device. > > > > */ > > > > > > I believe we can dropped. I don't think we can have any store/load > > > operations on a CPU before it's fully onlined, and we should always have > > > a reference on the pool here, so the resources cannot go away. > > > > > > So unless I missed something we can drop this completely now and just > > > hold the mutex directly in the load/store paths. > > > > Based on the above explanations, please let me know if it is a good idea > > to keep the __online, or if you think further simplification is possible. > > I still think it's not needed. Let me know if I missed anything. Let me think some more about whether it is feasible to not have cpuhotplug manage the acomp_ctx resource allocation, and instead have this be done through the pool creation/deletion routines. Thanks, Kanchana
On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote: > [..] > > > > > u8 *buffer; > > > > > + u8 nr_reqs; > > > > > + struct crypto_wait wait; > > > > > struct mutex mutex; > > > > > bool is_sleepable; > > > > > + bool __online; > > > > > > > > I don't believe we need this. > > > > > > > > If we are not freeing resources during CPU offlining, then we do not > > > > need a CPU offline callback and acomp_ctx->__online serves no purpose. > > > > > > > > The whole point of synchronizing between offlining and > > > > compress/decompress operations is to avoid UAF. If offlining does not > > > > free resources, then we can hold the mutex directly in the > > > > compress/decompress path and drop the hotunplug callback completely. > > > > > > > > I also believe nr_reqs can be dropped from this patch, as it seems like > > > > it's only used know when to set __online. > > > > > > All great points! In fact, that was the original solution I had implemented > > > (not having an offline callback). But then, I spent some time understanding > > > the v6.13 hotfix for synchronizing freeing of resources, and this comment > > > in zswap_cpu_comp_prepare(): > > > > > > /* > > > * Only hold the mutex after completing allocations, otherwise we > > may > > > * recurse into zswap through reclaim and attempt to hold the mutex > > > * again resulting in a deadlock. > > > */ > > > > > > Hence, I figured the constraint of "recurse into zswap through reclaim" was > > > something to comprehend in the simplification (even though I had a tough > > > time imagining how this could happen). > > > > The constraint here is about zswap_cpu_comp_prepare() holding the mutex, > > making an allocation which internally triggers reclaim, then recursing > > into zswap and trying to hold the same mutex again causing a deadlock. > > > > If zswap_cpu_comp_prepare() does not need to hold the mutex to begin > > with, the constraint naturally goes away. > > Actually, if it is possible for the allocations in zswap_cpu_comp_prepare() > to trigger reclaim, then I believe we need some way for reclaim to know if > the acomp_ctx resources are available. Hence, this seems like a potential > for deadlock regardless of the mutex. I took a closer look and I believe my hotfix was actually unnecessary. I sent it out in response to a syzbot report, but upon closer look it seems like it was not an actual problem. Sorry if my patch confused you. Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The comment above says: * PREPARE: The callbacks are invoked on a control CPU before the * hotplugged CPU is started up or after the hotplugged CPU has died. So even if we go into reclaim during zswap_cpu_comp_prepare(), it will never be on the CPU that we are allocating resources for. The other case where zswap_cpu_comp_prepare() could race with compression/decompression is when a pool is being created. In this case, reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the same CPU AFAICT. However, because the pool is still under creation, it will not be used (i.e. zswap_pool_current_get() won't find it). So I think we don't need to worry about zswap_cpu_comp_prepare() racing with compression or decompression for the same pool and CPU. > > I verified that all the zswap_cpu_comp_prepare() allocations are done with > GFP_KERNEL, which implicitly allows direct reclaim. So this appears to be a > risk for deadlock between zswap_compress() and zswap_cpu_comp_prepare() > in general, i.e., aside from this patchset. > > I can think of the following options to resolve this, and would welcome > other suggestions: > > 1) Less intrusive: acomp_ctx_get_cpu_lock() should get the mutex, check > if acomp_ctx->__online is true, and if so, return the mutex. If > acomp_ctx->__online is false, then it returns NULL. In other words, we > don't have the for loop. > - This will cause recursions into direct reclaim from zswap_cpu_comp_prepare() > to fail, cpuhotplug to fail. However, there is no deadlock. > - zswap_compress() will need to detect NULL returned by > acomp_ctx_get_cpu_lock(), and return an error. > - zswap_decompress() will need a BUG_ON(!acomp_ctx) after calling > acomp_ctx_get_cpu_lock(). > - We won't be migrated to a different CPU because we hold the mutex, hence > zswap_cpu_comp_dead() will wait on the mutex. > > 2) More intrusive: We would need to use a gfp_t that prevents direct reclaim > and kswapd, i.e., something similar to GFP_TRANSHUGE_LIGHT in gfp_types.h, > but for non-THP allocations. If we decide to adopt this approach, we would > need changes in include/crypto/acompress.h, crypto/api.c, and crypto/acompress.c > to allow crypto_create_tfm_node() to call crypto_alloc_tfmmem() with this > new gfp_t, in lieu of GFP_KERNEL. > > > > > > > > > Hence, I added the "bool __online" because zswap_cpu_comp_prepare() > > > does not acquire the mutex lock while allocating resources. We have > > already > > > initialized the mutex, so in theory, it is possible for compress/decompress > > > to acquire the mutex lock. The __online acts as a way to indicate whether > > > compress/decompress can proceed reliably to use the resources. > > > > For compress/decompress to acquire the mutex they need to run on that > > CPU, and I don't think that's possible before onlining completes, so > > zswap_cpu_comp_prepare() must have already completed before > > compress/decompress can use that CPU IIUC. > > If we can make this assumption, that would be great! However, I am not > totally sure because of the GFP_KERNEL allocations in > zswap_cpu_comp_prepare(). As I mentioned above, when zswap_cpu_comp_prepare() is run we are in one of two situations: - The pool is under creation, so we cannot race with stores/loads from that same pool. - The CPU is being onlined, in which case zswap_cpu_comp_prepare() is called from a control CPU before tasks start running on the CPU being onlined. Please correct me if I am wrong. [..] > > > > > @@ -285,13 +403,21 @@ static struct zswap_pool > > > > *zswap_pool_create(char *type, char *compressor) > > > > > goto error; > > > > > } > > > > > > > > > > - for_each_possible_cpu(cpu) > > > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > > > > + for_each_possible_cpu(cpu) { > > > > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > > > > >acomp_ctx, cpu); > > > > > + > > > > > + acomp_ctx->acomp = NULL; > > > > > + acomp_ctx->req = NULL; > > > > > + acomp_ctx->buffer = NULL; > > > > > + acomp_ctx->__online = false; > > > > > + acomp_ctx->nr_reqs = 0; > > > > > > > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them > > > > right away? > > > > > > Yes, I figured this is needed for two reasons: > > > > > > 1) For the error handling in zswap_cpu_comp_prepare() and calls into > > > zswap_cpu_comp_dealloc() to be handled by the common procedure > > > "acomp_ctx_dealloc()" unambiguously. > > > > This makes sense. When you move the refactoring to create > > acomp_ctx_dealloc() to a separate patch, please include this change in > > it and call it out explicitly in the commit message. > > Sure. > > > > > > 2) The second scenario I thought of that would need this, is let's say > > > the zswap compressor is switched immediately after setting the > > > compressor. Some cores have executed the onlining code and > > > some haven't. Because there are no pool refs held, > > > zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured > > > it would help to initialize these acomp_ctx members before the > > > hand-off to "cpuhp_state_add_instance()" in zswap_pool_create(). > > > > I believe cpuhp_state_add_instance() calls the onlining function > > synchronously on all present CPUs, so I don't think it's possible to end > > up in a state where the pool is being destroyed and some CPU executed > > zswap_cpu_comp_prepare() while others haven't. > > I looked at the cpuhotplug code some more. The startup callback is > invoked sequentially for_each_present_cpu(). If an error occurs for any > one of them, it calls the teardown callback only on the earlier cores that > have already finished running the startup callback. However, > zswap_cpu_comp_dealloc() will be called for all cores, even the ones > for which the startup callback was not run. Hence, I believe the > zero initialization is useful, albeit using alloc_percpu_gfp(__GFP_ZERO) > to allocate the acomp_ctx. Yeah this is point (1) above IIUC, and I agree about zero initialization for that. > > > > > That being said, this made me think of a different problem. If pool > > destruction races with CPU onlining, there could be a race between > > zswap_cpu_comp_prepare() allocating resources and > > zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them. > > > > I believe we must always call cpuhp_state_remove_instance() *before* > > freeing the resources to prevent this race from happening. This needs to > > be documented with a comment. > > Yes, this race condition is possible, thanks for catching this! The problem with > calling cpuhp_state_remove_instance() before freeing the resources is that > cpuhp_state_add_instance() and cpuhp_state_remove_instance() both > acquire a "mutex_lock(&cpuhp_state_mutex);" at the beginning; and hence > are serialized. > > For the reasons motivating why acomp_ctx->__online is set to false in > zswap_cpu_comp_dead(), I cannot call cpuhp_state_remove_instance() > before calling acomp_ctx_dealloc() because the latter could wait until > acomp_ctx->__online to be true before deleting the resources. I will > think about this some more. I believe this problem goes away with acomp_ctx->__online going away, right? > > Another possibility is to not rely on cpuhotplug in zswap, and instead > manage the per-cpu acomp_ctx resource allocation entirely in > zswap_pool_create(), and deletion entirely in zswap_pool_destroy(), > along with the necessary error handling. Let me think about this some > more as well. > > > > > Let me know if I missed something. > > > > > > > > Please let me know if these are valid considerations. > > > > > > > > > > > If it is in fact needed we should probably just use __GFP_ZERO. > > > > > > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of > > "alloc_percpu()" > > > for the acomp_ctx? > > > > Yeah if we need to initialize all/most fields to 0 let's use > > alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO. > > Sounds good. > > > > > [..] > > > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx > > > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > > > > > > > > > > for (;;) { > > > > > acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > > > > - mutex_lock(&acomp_ctx->mutex); > > > > > - if (likely(acomp_ctx->req)) > > > > > - return acomp_ctx; > > > > > /* > > > > > - * It is possible that we were migrated to a different CPU > > > > after > > > > > - * getting the per-CPU ctx but before the mutex was > > > > acquired. If > > > > > - * the old CPU got offlined, zswap_cpu_comp_dead() could > > > > have > > > > > - * already freed ctx->req (among other things) and set it to > > > > > - * NULL. Just try again on the new CPU that we ended up on. > > > > > + * If the CPU onlining code successfully allocates acomp_ctx > > > > resources, > > > > > + * it sets acomp_ctx->__online to true. Until this happens, we > > > > have > > > > > + * two options: > > > > > + * > > > > > + * 1. Return NULL and fail all stores on this CPU. > > > > > + * 2. Retry, until onlining has finished allocating resources. > > > > > + * > > > > > + * In theory, option 1 could be more appropriate, because it > > > > > + * allows the calling procedure to decide how it wants to > > > > handle > > > > > + * reclaim racing with CPU hotplug. For instance, it might be > > > > Ok > > > > > + * for compress to return an error for the backing swap device > > > > > + * to store the folio. Decompress could wait until we get a > > > > > + * valid and locked mutex after onlining has completed. For > > > > now, > > > > > + * we go with option 2 because adding a do-while in > > > > > + * zswap_decompress() adds latency for software > > > > compressors. > > > > > + * > > > > > + * Once initialized, the resources will be de-allocated only > > > > > + * when the pool is destroyed. The acomp_ctx will hold on to > > > > the > > > > > + * resources through CPU offlining/onlining at any time until > > > > > + * the pool is destroyed. > > > > > + * > > > > > + * This prevents races/deadlocks between reclaim and CPU > > > > acomp_ctx > > > > > + * resource allocation that are a dependency for reclaim. > > > > > + * It further simplifies the interaction with CPU onlining and > > > > > + * offlining: > > > > > + * > > > > > + * - CPU onlining does not take the mutex. It only allocates > > > > > + * resources and sets __online to true. > > > > > + * - CPU offlining acquires the mutex before setting > > > > > + * __online to false. If reclaim has acquired the mutex, > > > > > + * offlining will have to wait for reclaim to complete before > > > > > + * hotunplug can proceed. Further, hotplug merely sets > > > > > + * __online to false. It does not delete the acomp_ctx > > > > > + * resources. > > > > > + * > > > > > + * Option 1 is better than potentially not exiting the earlier > > > > > + * for (;;) loop because the system is running low on memory > > > > > + * and/or CPUs are getting offlined for whatever reason. At > > > > > + * least failing this store will prevent data loss by failing > > > > > + * zswap_store(), and saving the data in the backing swap > > > > device. > > > > > */ > > > > > > > > I believe we can dropped. I don't think we can have any store/load > > > > operations on a CPU before it's fully onlined, and we should always have > > > > a reference on the pool here, so the resources cannot go away. > > > > > > > > So unless I missed something we can drop this completely now and just > > > > hold the mutex directly in the load/store paths. > > > > > > Based on the above explanations, please let me know if it is a good idea > > > to keep the __online, or if you think further simplification is possible. > > > > I still think it's not needed. Let me know if I missed anything. > > Let me think some more about whether it is feasible to not have cpuhotplug > manage the acomp_ctx resource allocation, and instead have this be done > through the pool creation/deletion routines. I don't think this is necessary, see my comments above.
diff --git a/mm/zswap.c b/mm/zswap.c index 10f2a16e7586..cff96df1df8b 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -144,10 +144,12 @@ bool zswap_never_enabled(void) struct crypto_acomp_ctx { struct crypto_acomp *acomp; struct acomp_req *req; - struct crypto_wait wait; u8 *buffer; + u8 nr_reqs; + struct crypto_wait wait; struct mutex mutex; bool is_sleepable; + bool __online; }; /* @@ -246,6 +248,122 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp) **********************************/ static void __zswap_pool_empty(struct percpu_ref *ref); +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) +{ + if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) { + + if (!IS_ERR_OR_NULL(acomp_ctx->req)) + acomp_request_free(acomp_ctx->req); + acomp_ctx->req = NULL; + + kfree(acomp_ctx->buffer); + acomp_ctx->buffer = NULL; + + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) + crypto_free_acomp(acomp_ctx->acomp); + + acomp_ctx->nr_reqs = 0; + } +} + +static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) +{ + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + int ret = -ENOMEM; + + /* + * Just to be even more fail-safe against changes in assumptions and/or + * implementation of the CPU hotplug code. + */ + if (acomp_ctx->__online) + return 0; + + if (acomp_ctx->nr_reqs) { + acomp_ctx->__online = true; + return 0; + } + + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); + if (IS_ERR(acomp_ctx->acomp)) { + pr_err("could not alloc crypto acomp %s : %ld\n", + pool->tfm_name, PTR_ERR(acomp_ctx->acomp)); + ret = PTR_ERR(acomp_ctx->acomp); + goto fail; + } + + acomp_ctx->nr_reqs = 1; + + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp); + if (!acomp_ctx->req) { + pr_err("could not alloc crypto acomp_request %s\n", + pool->tfm_name); + ret = -ENOMEM; + goto fail; + } + + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); + if (!acomp_ctx->buffer) { + ret = -ENOMEM; + goto fail; + } + + crypto_init_wait(&acomp_ctx->wait); + + /* + * if the backend of acomp is async zip, crypto_req_done() will wakeup + * crypto_wait_req(); if the backend of acomp is scomp, the callback + * won't be called, crypto_wait_req() will return without blocking. + */ + acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG, + crypto_req_done, &acomp_ctx->wait); + + acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp); + + acomp_ctx->__online = true; + + return 0; + +fail: + acomp_ctx_dealloc(acomp_ctx); + + return ret; +} + +static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) +{ + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + + mutex_lock(&acomp_ctx->mutex); + acomp_ctx->__online = false; + mutex_unlock(&acomp_ctx->mutex); + + return 0; +} + +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct hlist_node *node) +{ + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + + /* + * The lifetime of acomp_ctx resources is from pool creation to + * pool deletion. + * + * Reclaims should not be happening because, we get to this routine only + * in two scenarios: + * + * 1) pool creation failures before/during the pool ref initialization. + * 2) we are in the process of releasing the pool, it is off the + * zswap_pools list and has no references. + * + * Hence, there is no need for locks. + */ + acomp_ctx->__online = false; + acomp_ctx_dealloc(acomp_ctx); +} + static struct zswap_pool *zswap_pool_create(char *type, char *compressor) { struct zswap_pool *pool; @@ -285,13 +403,21 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) goto error; } - for_each_possible_cpu(cpu) - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); + for_each_possible_cpu(cpu) { + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + + acomp_ctx->acomp = NULL; + acomp_ctx->req = NULL; + acomp_ctx->buffer = NULL; + acomp_ctx->__online = false; + acomp_ctx->nr_reqs = 0; + mutex_init(&acomp_ctx->mutex); + } ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); if (ret) - goto error; + goto ref_fail; /* being the current pool takes 1 ref; this func expects the * caller to always add the new pool as the current pool @@ -307,6 +433,9 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) return pool; ref_fail: + for_each_possible_cpu(cpu) + zswap_cpu_comp_dealloc(cpu, &pool->node); + cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); error: if (pool->acomp_ctx) @@ -361,8 +490,13 @@ static struct zswap_pool *__zswap_pool_create_fallback(void) static void zswap_pool_destroy(struct zswap_pool *pool) { + int cpu; + zswap_pool_debug("destroying", pool); + for_each_possible_cpu(cpu) + zswap_cpu_comp_dealloc(cpu, &pool->node); + cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); free_percpu(pool->acomp_ctx); @@ -816,85 +950,6 @@ static void zswap_entry_free(struct zswap_entry *entry) /********************************* * compressed storage functions **********************************/ -static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) -{ - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); - struct crypto_acomp *acomp = NULL; - struct acomp_req *req = NULL; - u8 *buffer = NULL; - int ret; - - buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); - if (!buffer) { - ret = -ENOMEM; - goto fail; - } - - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); - if (IS_ERR(acomp)) { - pr_err("could not alloc crypto acomp %s : %ld\n", - pool->tfm_name, PTR_ERR(acomp)); - ret = PTR_ERR(acomp); - goto fail; - } - - req = acomp_request_alloc(acomp); - if (!req) { - pr_err("could not alloc crypto acomp_request %s\n", - pool->tfm_name); - ret = -ENOMEM; - goto fail; - } - - /* - * Only hold the mutex after completing allocations, otherwise we may - * recurse into zswap through reclaim and attempt to hold the mutex - * again resulting in a deadlock. - */ - mutex_lock(&acomp_ctx->mutex); - crypto_init_wait(&acomp_ctx->wait); - - /* - * if the backend of acomp is async zip, crypto_req_done() will wakeup - * crypto_wait_req(); if the backend of acomp is scomp, the callback - * won't be called, crypto_wait_req() will return without blocking. - */ - acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, - crypto_req_done, &acomp_ctx->wait); - - acomp_ctx->buffer = buffer; - acomp_ctx->acomp = acomp; - acomp_ctx->is_sleepable = acomp_is_async(acomp); - acomp_ctx->req = req; - mutex_unlock(&acomp_ctx->mutex); - return 0; - -fail: - if (acomp) - crypto_free_acomp(acomp); - kfree(buffer); - return ret; -} - -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) -{ - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); - - mutex_lock(&acomp_ctx->mutex); - if (!IS_ERR_OR_NULL(acomp_ctx)) { - if (!IS_ERR_OR_NULL(acomp_ctx->req)) - acomp_request_free(acomp_ctx->req); - acomp_ctx->req = NULL; - if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) - crypto_free_acomp(acomp_ctx->acomp); - kfree(acomp_ctx->buffer); - } - mutex_unlock(&acomp_ctx->mutex); - - return 0; -} static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) { @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) for (;;) { acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); - mutex_lock(&acomp_ctx->mutex); - if (likely(acomp_ctx->req)) - return acomp_ctx; /* - * It is possible that we were migrated to a different CPU after - * getting the per-CPU ctx but before the mutex was acquired. If - * the old CPU got offlined, zswap_cpu_comp_dead() could have - * already freed ctx->req (among other things) and set it to - * NULL. Just try again on the new CPU that we ended up on. + * If the CPU onlining code successfully allocates acomp_ctx resources, + * it sets acomp_ctx->__online to true. Until this happens, we have + * two options: + * + * 1. Return NULL and fail all stores on this CPU. + * 2. Retry, until onlining has finished allocating resources. + * + * In theory, option 1 could be more appropriate, because it + * allows the calling procedure to decide how it wants to handle + * reclaim racing with CPU hotplug. For instance, it might be Ok + * for compress to return an error for the backing swap device + * to store the folio. Decompress could wait until we get a + * valid and locked mutex after onlining has completed. For now, + * we go with option 2 because adding a do-while in + * zswap_decompress() adds latency for software compressors. + * + * Once initialized, the resources will be de-allocated only + * when the pool is destroyed. The acomp_ctx will hold on to the + * resources through CPU offlining/onlining at any time until + * the pool is destroyed. + * + * This prevents races/deadlocks between reclaim and CPU acomp_ctx + * resource allocation that are a dependency for reclaim. + * It further simplifies the interaction with CPU onlining and + * offlining: + * + * - CPU onlining does not take the mutex. It only allocates + * resources and sets __online to true. + * - CPU offlining acquires the mutex before setting + * __online to false. If reclaim has acquired the mutex, + * offlining will have to wait for reclaim to complete before + * hotunplug can proceed. Further, hotplug merely sets + * __online to false. It does not delete the acomp_ctx + * resources. + * + * Option 1 is better than potentially not exiting the earlier + * for (;;) loop because the system is running low on memory + * and/or CPUs are getting offlined for whatever reason. At + * least failing this store will prevent data loss by failing + * zswap_store(), and saving the data in the backing swap device. */ + mutex_lock(&acomp_ctx->mutex); + if (likely(acomp_ctx->__online)) + return acomp_ctx; + mutex_unlock(&acomp_ctx->mutex); } }
This patch modifies the acomp_ctx resources' lifetime to be from pool creation to deletion. A "bool __online" and "u8 nr_reqs" are added to "struct crypto_acomp_ctx" which simplify a few things: 1) zswap_pool_create() will initialize all members of each percpu acomp_ctx to 0 or NULL and only then initialize the mutex. 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online to true, without locking the mutex. 3) CPU hotunplug will lock the mutex before setting __online to false. It will not delete any resources. 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online is true, and if so, return the mutex for use in zswap compress and decompress ops. 5) CPU onlining after offlining will simply check if either __online or nr_reqs are non-0, and return 0 if so, without re-allocating the resources. 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() to delete the acomp_ctx resources. 7) Common resource deletion code in case of zswap_cpu_comp_prepare() errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new acomp_ctx_dealloc(). The CPU hot[un]plug callback functions are moved to "pool functions" accordingly. The per-cpu memory cost of not deleting the acomp_ctx resources upon CPU offlining, and only deleting them when the pool is destroyed, is as follows: IAA with batching: 64.8 KB Software compressors: 8.2 KB I would appreciate code review comments on whether this memory cost is acceptable, for the latency improvement that it provides due to a faster reclaim restart after a CPU hotunplug-hotplug sequence - all that the hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and if so, set __online to true and return, and reclaim can proceed. Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++------------------ 1 file changed, 182 insertions(+), 91 deletions(-)