Message ID | 20250107222236.2715883-2-yosryahmed@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/2] Revert "mm: zswap: fix race between [de]compression and CPU hotunplug" | expand |
On Wed, Jan 8, 2025 at 11:22 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the > current CPU at the beginning of the operation is retrieved and used > throughout. However, since neither preemption nor migration are disabled, > it is possible that the operation continues on a different CPU. > > If the original CPU is hotunplugged while the acomp_ctx is still in use, > we run into a UAF bug as the resources attached to the acomp_ctx are freed > during hotunplug in zswap_cpu_comp_dead(). > > The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use > crypto_acomp API for hardware acceleration") when the switch to the > crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was > retrieved using get_cpu_ptr() which disables preemption and makes sure the > CPU cannot go away from under us. Preemption cannot be disabled with the > crypto_acomp API as a sleepable context is needed. > > Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to > per-acomp_ctx") increased the UAF surface area by making the per-CPU > buffers dynamic, adding yet another resource that can be freed from under > zswap compression/decompression by CPU hotunplug. > > This cannot be fixed by holding cpus_read_lock(), as it is possible for > code already holding the lock to fall into reclaim and enter zswap > (causing a deadlock). It also cannot be fixed by wrapping the usage of > acomp_ctx in an SRCU critical section and using synchronize_srcu() in > zswap_cpu_comp_dead(), because synchronize_srcu() is not allowed in > CPU-hotplug notifiers (see > Documentation/RCU/Design/Requirements/Requirements.rst). > > This can be fixed by refcounting the acomp_ctx, but it involves > complexity in handling the race between the refcount dropping to zero in > zswap_[de]compress() and the refcount being re-initialized when the CPU > is onlined. > > Keep things simple for now and just disable migration while using the > per-CPU acomp_ctx to block CPU hotunplug until the usage is over. > > Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") > Cc: <stable@vger.kernel.org> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > Reported-by: Johannes Weiner <hannes@cmpxchg.org> > Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ > Reported-by: Sam Sun <samsun1006219@gmail.com> > Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/ > --- > mm/zswap.c | 19 ++++++++++++++++--- > 1 file changed, 16 insertions(+), 3 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f6316b66fb236..ecd86153e8a32 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -880,6 +880,18 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) > return 0; > } > > +/* Remain on the CPU while using its acomp_ctx to stop it from going offline */ > +static struct crypto_acomp_ctx *acomp_ctx_get_cpu(struct crypto_acomp_ctx __percpu *acomp_ctx) > +{ > + migrate_disable(); I'm not entirely sure, but I feel it is quite unsafe. Allowing sleep during migrate_disable() and migrate_enable() would require the entire scheduler, runqueue, waitqueue, and CPU hotplug mechanisms to be aware that a task is pinned to a specific CPU. If there is no sleep during this period, it seems to be only a runqueue issue—CPU hotplug can wait for the task to be unpinned while it is always in runqueue. However, if sleep is involved, the situation becomes significantly more complex. If static data doesn't consume much memory, it could be the simplest solution. > + return raw_cpu_ptr(acomp_ctx); > +} > + > +static void acomp_ctx_put_cpu(void) > +{ > + migrate_enable(); > +} > + > static bool zswap_compress(struct page *page, struct zswap_entry *entry, > struct zswap_pool *pool) > { > @@ -893,8 +905,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > gfp_t gfp; > u8 *dst; > > - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > - > + acomp_ctx = acomp_ctx_get_cpu(pool->acomp_ctx); > mutex_lock(&acomp_ctx->mutex); > > dst = acomp_ctx->buffer; > @@ -950,6 +961,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > zswap_reject_alloc_fail++; > > mutex_unlock(&acomp_ctx->mutex); > + acomp_ctx_put_cpu(); > return comp_ret == 0 && alloc_ret == 0; > } > > @@ -960,7 +972,7 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > struct crypto_acomp_ctx *acomp_ctx; > u8 *src; > > - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > + acomp_ctx = acomp_ctx_get_cpu(entry->pool->acomp_ctx); > mutex_lock(&acomp_ctx->mutex); > > src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); > @@ -990,6 +1002,7 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > if (src != acomp_ctx->buffer) > zpool_unmap_handle(zpool, entry->handle); > + acomp_ctx_put_cpu(); > } > > /********************************* > -- > 2.47.1.613.gc27f4b7a9f-goog > Thanks barry
On Tue, Jan 7, 2025 at 2:47 PM Barry Song <baohua@kernel.org> wrote: > > On Wed, Jan 8, 2025 at 11:22 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the > > current CPU at the beginning of the operation is retrieved and used > > throughout. However, since neither preemption nor migration are disabled, > > it is possible that the operation continues on a different CPU. > > > > If the original CPU is hotunplugged while the acomp_ctx is still in use, > > we run into a UAF bug as the resources attached to the acomp_ctx are freed > > during hotunplug in zswap_cpu_comp_dead(). > > > > The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use > > crypto_acomp API for hardware acceleration") when the switch to the > > crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was > > retrieved using get_cpu_ptr() which disables preemption and makes sure the > > CPU cannot go away from under us. Preemption cannot be disabled with the > > crypto_acomp API as a sleepable context is needed. > > > > Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to > > per-acomp_ctx") increased the UAF surface area by making the per-CPU > > buffers dynamic, adding yet another resource that can be freed from under > > zswap compression/decompression by CPU hotunplug. > > > > This cannot be fixed by holding cpus_read_lock(), as it is possible for > > code already holding the lock to fall into reclaim and enter zswap > > (causing a deadlock). It also cannot be fixed by wrapping the usage of > > acomp_ctx in an SRCU critical section and using synchronize_srcu() in > > zswap_cpu_comp_dead(), because synchronize_srcu() is not allowed in > > CPU-hotplug notifiers (see > > Documentation/RCU/Design/Requirements/Requirements.rst). > > > > This can be fixed by refcounting the acomp_ctx, but it involves > > complexity in handling the race between the refcount dropping to zero in > > zswap_[de]compress() and the refcount being re-initialized when the CPU > > is onlined. > > > > Keep things simple for now and just disable migration while using the > > per-CPU acomp_ctx to block CPU hotunplug until the usage is over. > > > > Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") > > Cc: <stable@vger.kernel.org> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > Reported-by: Johannes Weiner <hannes@cmpxchg.org> > > Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ > > Reported-by: Sam Sun <samsun1006219@gmail.com> > > Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/ > > --- > > mm/zswap.c | 19 ++++++++++++++++--- > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index f6316b66fb236..ecd86153e8a32 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -880,6 +880,18 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) > > return 0; > > } > > > > +/* Remain on the CPU while using its acomp_ctx to stop it from going offline */ > > +static struct crypto_acomp_ctx *acomp_ctx_get_cpu(struct crypto_acomp_ctx __percpu *acomp_ctx) > > +{ > > + migrate_disable(); > > I'm not entirely sure, but I feel it is quite unsafe. Allowing sleep > during migrate_disable() and > migrate_enable() would require the entire scheduler, runqueue, > waitqueue, and CPU > hotplug mechanisms to be aware that a task is pinned to a specific CPU. My understanding is that sleeping is already allowed when migration is disabled (unlike preemption). See delete_all_elements() in kernel/bpf/hashtab.c for example, or __bpf_prog_enter_sleepable() in kernel/bpf/trampoline.c. I am not sure exactly what you mean. > > If there is no sleep during this period, it seems to be only a > runqueue issue—CPU hotplug can > wait for the task to be unpinned while it is always in runqueue. > However, if sleep is involved, > the situation becomes significantly more complex. > > If static data doesn't consume much memory, it could be the simplest solution. Do you mean allocating the buffers and requests for all possible CPUs instead of allocating them dynamically in CPU hotplug notifiers? I am not sure how much more memory this would be. Seems like it depends on CONFIG options and the firmware.
On Wed, Jan 8, 2025 at 11:47 AM Barry Song <baohua@kernel.org> wrote: > > On Wed, Jan 8, 2025 at 11:22 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the > > current CPU at the beginning of the operation is retrieved and used > > throughout. However, since neither preemption nor migration are disabled, > > it is possible that the operation continues on a different CPU. > > > > If the original CPU is hotunplugged while the acomp_ctx is still in use, > > we run into a UAF bug as the resources attached to the acomp_ctx are freed > > during hotunplug in zswap_cpu_comp_dead(). > > > > The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use > > crypto_acomp API for hardware acceleration") when the switch to the > > crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was > > retrieved using get_cpu_ptr() which disables preemption and makes sure the > > CPU cannot go away from under us. Preemption cannot be disabled with the > > crypto_acomp API as a sleepable context is needed. > > > > Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to > > per-acomp_ctx") increased the UAF surface area by making the per-CPU > > buffers dynamic, adding yet another resource that can be freed from under > > zswap compression/decompression by CPU hotunplug. > > > > This cannot be fixed by holding cpus_read_lock(), as it is possible for > > code already holding the lock to fall into reclaim and enter zswap > > (causing a deadlock). It also cannot be fixed by wrapping the usage of > > acomp_ctx in an SRCU critical section and using synchronize_srcu() in > > zswap_cpu_comp_dead(), because synchronize_srcu() is not allowed in > > CPU-hotplug notifiers (see > > Documentation/RCU/Design/Requirements/Requirements.rst). > > > > This can be fixed by refcounting the acomp_ctx, but it involves > > complexity in handling the race between the refcount dropping to zero in > > zswap_[de]compress() and the refcount being re-initialized when the CPU > > is onlined. > > > > Keep things simple for now and just disable migration while using the > > per-CPU acomp_ctx to block CPU hotunplug until the usage is over. > > > > Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") > > Cc: <stable@vger.kernel.org> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > Reported-by: Johannes Weiner <hannes@cmpxchg.org> > > Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ > > Reported-by: Sam Sun <samsun1006219@gmail.com> > > Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/ > > --- > > mm/zswap.c | 19 ++++++++++++++++--- > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index f6316b66fb236..ecd86153e8a32 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -880,6 +880,18 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) > > return 0; > > } > > > > +/* Remain on the CPU while using its acomp_ctx to stop it from going offline */ > > +static struct crypto_acomp_ctx *acomp_ctx_get_cpu(struct crypto_acomp_ctx __percpu *acomp_ctx) > > +{ > > + migrate_disable(); > > I'm not entirely sure, but I feel it is quite unsafe. Allowing sleep > during migrate_disable() and > migrate_enable() would require the entire scheduler, runqueue, > waitqueue, and CPU > hotplug mechanisms to be aware that a task is pinned to a specific CPU. > > If there is no sleep during this period, it seems to be only a > runqueue issue—CPU hotplug can > wait for the task to be unpinned while it is always in runqueue. > However, if sleep is involved, > the situation becomes significantly more complex. After double-checking the scheduler's code, it seems fine. When a task is scheduled out, __schedule() will set its allowable cpu by: migrate_disable_switch(rq, prev); static void migrate_disable_switch(struct rq *rq, struct task_struct *p) { struct affinity_context ac = { .new_mask = cpumask_of(rq->cpu), .flags = SCA_MIGRATE_DISABLE, }; if (likely(!p->migration_disabled)) return; if (p->cpus_ptr != &p->cpus_mask) return; /* * Violates locking rules! see comment in __do_set_cpus_allowed(). */ __do_set_cpus_allowed(p, &ac); } while woken-up, the previous cpu will be selected: /* * The caller (fork, wakeup) owns p->pi_lock, ->cpus_ptr is stable. */ static inline int select_task_rq(struct task_struct *p, int cpu, int wake_flags) { lockdep_assert_held(&p->pi_lock); if (p->nr_cpus_allowed > 1 && !is_migration_disabled(p)) cpu = p->sched_class->select_task_rq(p, cpu, wake_flags); else cpu = cpumask_any(p->cpus_ptr); ... } Anyway, not an expert :-) Hopefully, other experts can provide their input to confirm whether sleeping during migrate_disable() is all good. > > If static data doesn't consume much memory, it could be the simplest solution. > > > + return raw_cpu_ptr(acomp_ctx); > > +} > > + > > +static void acomp_ctx_put_cpu(void) > > +{ > > + migrate_enable(); > > +} > > + > > static bool zswap_compress(struct page *page, struct zswap_entry *entry, > > struct zswap_pool *pool) > > { > > @@ -893,8 +905,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > > gfp_t gfp; > > u8 *dst; > > > > - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > - > > + acomp_ctx = acomp_ctx_get_cpu(pool->acomp_ctx); > > mutex_lock(&acomp_ctx->mutex); > > > > dst = acomp_ctx->buffer; > > @@ -950,6 +961,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > > zswap_reject_alloc_fail++; > > > > mutex_unlock(&acomp_ctx->mutex); > > + acomp_ctx_put_cpu(); > > return comp_ret == 0 && alloc_ret == 0; > > } > > > > @@ -960,7 +972,7 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > struct crypto_acomp_ctx *acomp_ctx; > > u8 *src; > > > > - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > + acomp_ctx = acomp_ctx_get_cpu(entry->pool->acomp_ctx); > > mutex_lock(&acomp_ctx->mutex); > > > > src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); > > @@ -990,6 +1002,7 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > > > if (src != acomp_ctx->buffer) > > zpool_unmap_handle(zpool, entry->handle); > > + acomp_ctx_put_cpu(); > > } > > > > /********************************* > > -- > > 2.47.1.613.gc27f4b7a9f-goog > > > Thanks Barry
On Wed, Jan 8, 2025 at 12:26 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Jan 7, 2025 at 2:47 PM Barry Song <baohua@kernel.org> wrote: > > > > On Wed, Jan 8, 2025 at 11:22 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the > > > current CPU at the beginning of the operation is retrieved and used > > > throughout. However, since neither preemption nor migration are disabled, > > > it is possible that the operation continues on a different CPU. > > > > > > If the original CPU is hotunplugged while the acomp_ctx is still in use, > > > we run into a UAF bug as the resources attached to the acomp_ctx are freed > > > during hotunplug in zswap_cpu_comp_dead(). > > > > > > The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use > > > crypto_acomp API for hardware acceleration") when the switch to the > > > crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was > > > retrieved using get_cpu_ptr() which disables preemption and makes sure the > > > CPU cannot go away from under us. Preemption cannot be disabled with the > > > crypto_acomp API as a sleepable context is needed. > > > > > > Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to > > > per-acomp_ctx") increased the UAF surface area by making the per-CPU > > > buffers dynamic, adding yet another resource that can be freed from under > > > zswap compression/decompression by CPU hotunplug. > > > > > > This cannot be fixed by holding cpus_read_lock(), as it is possible for > > > code already holding the lock to fall into reclaim and enter zswap > > > (causing a deadlock). It also cannot be fixed by wrapping the usage of > > > acomp_ctx in an SRCU critical section and using synchronize_srcu() in > > > zswap_cpu_comp_dead(), because synchronize_srcu() is not allowed in > > > CPU-hotplug notifiers (see > > > Documentation/RCU/Design/Requirements/Requirements.rst). > > > > > > This can be fixed by refcounting the acomp_ctx, but it involves > > > complexity in handling the race between the refcount dropping to zero in > > > zswap_[de]compress() and the refcount being re-initialized when the CPU > > > is onlined. > > > > > > Keep things simple for now and just disable migration while using the > > > per-CPU acomp_ctx to block CPU hotunplug until the usage is over. > > > > > > Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") > > > Cc: <stable@vger.kernel.org> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > Reported-by: Johannes Weiner <hannes@cmpxchg.org> > > > Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ > > > Reported-by: Sam Sun <samsun1006219@gmail.com> > > > Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/ > > > --- > > > mm/zswap.c | 19 ++++++++++++++++--- > > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index f6316b66fb236..ecd86153e8a32 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -880,6 +880,18 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) > > > return 0; > > > } > > > > > > +/* Remain on the CPU while using its acomp_ctx to stop it from going offline */ > > > +static struct crypto_acomp_ctx *acomp_ctx_get_cpu(struct crypto_acomp_ctx __percpu *acomp_ctx) > > > +{ > > > + migrate_disable(); > > > > I'm not entirely sure, but I feel it is quite unsafe. Allowing sleep > > during migrate_disable() and > > migrate_enable() would require the entire scheduler, runqueue, > > waitqueue, and CPU > > hotplug mechanisms to be aware that a task is pinned to a specific CPU. > > My understanding is that sleeping is already allowed when migration is > disabled (unlike preemption). See delete_all_elements() in > kernel/bpf/hashtab.c for example, or __bpf_prog_enter_sleepable() in > kernel/bpf/trampoline.c. I am not sure exactly what you mean. Initially, I had some doubts about whether the scheduler handled this correctly, but after reviewing the scheduler code again, it seems fine. While a task is dequeued, its cpus_ptr is updated. When the task is woken up, it uses its cpus_ptr instead of selecting a suitable CPU (e.g., idle or wake-affine CPUs). Only after migrate_enable() restores this_rq()->nr_pinned to zero can CPU hotplug take down the CPU. Therefore, I sent another email to correct myself: https://lore.kernel.org/linux-mm/CAGsJ_4yb03yo6So-8wZwcy2fa-tURRrgJ+P-XhDL-RHgg1DvVA@mail.gmail.com/ > > > > > If there is no sleep during this period, it seems to be only a > > runqueue issue—CPU hotplug can > > wait for the task to be unpinned while it is always in runqueue. > > However, if sleep is involved, > > the situation becomes significantly more complex. > > > > If static data doesn't consume much memory, it could be the simplest solution. > > Do you mean allocating the buffers and requests for all possible CPUs > instead of allocating them dynamically in CPU hotplug notifiers? I am > not sure how much more memory this would be. Seems like it depends on > CONFIG options and the firmware. Correct, the firmware/devicetree will help identify all possible CPUs. Even if CONFIG is large, only those possible CPUs will be allocated memory. However, if migrate_disable() and migrate_enable() work as expected, we don't need to consider this option. Thanks Barry
On Wed, Jan 8, 2025 at 12:38 PM Barry Song <baohua@kernel.org> wrote: > > On Wed, Jan 8, 2025 at 12:26 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Tue, Jan 7, 2025 at 2:47 PM Barry Song <baohua@kernel.org> wrote: > > > > > > On Wed, Jan 8, 2025 at 11:22 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the > > > > current CPU at the beginning of the operation is retrieved and used > > > > throughout. However, since neither preemption nor migration are disabled, > > > > it is possible that the operation continues on a different CPU. > > > > > > > > If the original CPU is hotunplugged while the acomp_ctx is still in use, > > > > we run into a UAF bug as the resources attached to the acomp_ctx are freed > > > > during hotunplug in zswap_cpu_comp_dead(). > > > > > > > > The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use > > > > crypto_acomp API for hardware acceleration") when the switch to the > > > > crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was > > > > retrieved using get_cpu_ptr() which disables preemption and makes sure the > > > > CPU cannot go away from under us. Preemption cannot be disabled with the > > > > crypto_acomp API as a sleepable context is needed. > > > > > > > > Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to > > > > per-acomp_ctx") increased the UAF surface area by making the per-CPU > > > > buffers dynamic, adding yet another resource that can be freed from under > > > > zswap compression/decompression by CPU hotunplug. > > > > > > > > This cannot be fixed by holding cpus_read_lock(), as it is possible for > > > > code already holding the lock to fall into reclaim and enter zswap > > > > (causing a deadlock). It also cannot be fixed by wrapping the usage of > > > > acomp_ctx in an SRCU critical section and using synchronize_srcu() in > > > > zswap_cpu_comp_dead(), because synchronize_srcu() is not allowed in > > > > CPU-hotplug notifiers (see > > > > Documentation/RCU/Design/Requirements/Requirements.rst). > > > > > > > > This can be fixed by refcounting the acomp_ctx, but it involves > > > > complexity in handling the race between the refcount dropping to zero in > > > > zswap_[de]compress() and the refcount being re-initialized when the CPU > > > > is onlined. > > > > > > > > Keep things simple for now and just disable migration while using the > > > > per-CPU acomp_ctx to block CPU hotunplug until the usage is over. > > > > > > > > Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") > > > > Cc: <stable@vger.kernel.org> > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > Reported-by: Johannes Weiner <hannes@cmpxchg.org> > > > > Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ > > > > Reported-by: Sam Sun <samsun1006219@gmail.com> > > > > Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/ > > > > --- > > > > mm/zswap.c | 19 ++++++++++++++++--- > > > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index f6316b66fb236..ecd86153e8a32 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -880,6 +880,18 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) > > > > return 0; > > > > } > > > > > > > > +/* Remain on the CPU while using its acomp_ctx to stop it from going offline */ > > > > +static struct crypto_acomp_ctx *acomp_ctx_get_cpu(struct crypto_acomp_ctx __percpu *acomp_ctx) > > > > +{ > > > > + migrate_disable(); > > > > > > I'm not entirely sure, but I feel it is quite unsafe. Allowing sleep > > > during migrate_disable() and > > > migrate_enable() would require the entire scheduler, runqueue, > > > waitqueue, and CPU > > > hotplug mechanisms to be aware that a task is pinned to a specific CPU. > > > > My understanding is that sleeping is already allowed when migration is > > disabled (unlike preemption). See delete_all_elements() in > > kernel/bpf/hashtab.c for example, or __bpf_prog_enter_sleepable() in > > kernel/bpf/trampoline.c. I am not sure exactly what you mean. > > Initially, I had some doubts about whether the scheduler handled this correctly, > but after reviewing the scheduler code again, it seems fine. While a task is > dequeued, its cpus_ptr is updated. When the task is woken up, it uses its > cpus_ptr instead of selecting a suitable CPU (e.g., idle or wake-affine CPUs). > Only after migrate_enable() restores this_rq()->nr_pinned to zero can CPU > hotplug take down the CPU. > > Therefore, I sent another email to correct myself: > https://lore.kernel.org/linux-mm/CAGsJ_4yb03yo6So-8wZwcy2fa-tURRrgJ+P-XhDL-RHgg1DvVA@mail.gmail.com/ > > > > > > > > > If there is no sleep during this period, it seems to be only a > > > runqueue issue—CPU hotplug can > > > wait for the task to be unpinned while it is always in runqueue. > > > However, if sleep is involved, > > > the situation becomes significantly more complex. > > > > > > If static data doesn't consume much memory, it could be the simplest solution. > > > > Do you mean allocating the buffers and requests for all possible CPUs > > instead of allocating them dynamically in CPU hotplug notifiers? I am > > not sure how much more memory this would be. Seems like it depends on > > CONFIG options and the firmware. > > Correct, the firmware/devicetree will help identify all possible CPUs. Even > if CONFIG is large, only those possible CPUs will be allocated memory. > However, if migrate_disable() and migrate_enable() work as expected, > we don't need to consider this option. By the way, there could be a slight performance regression if the previous CPU is occupied with other tasks, as we would have to wait for preemption. However, when migrate_disable() is not in effect, the woken-up task can quickly select a neighboring idle CPU and continue execution. When the entire system is not busy, there might be no difference. However, when the system is packed with tasks, it could make a difference. Hopefully, the regression is small enough to be negligible :-) Thanks Barry
On Tue, Jan 7, 2025 at 3:56 PM Barry Song <baohua@kernel.org> wrote: > > On Wed, Jan 8, 2025 at 12:38 PM Barry Song <baohua@kernel.org> wrote: > > > > On Wed, Jan 8, 2025 at 12:26 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Tue, Jan 7, 2025 at 2:47 PM Barry Song <baohua@kernel.org> wrote: > > > > > > > > On Wed, Jan 8, 2025 at 11:22 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the > > > > > current CPU at the beginning of the operation is retrieved and used > > > > > throughout. However, since neither preemption nor migration are disabled, > > > > > it is possible that the operation continues on a different CPU. > > > > > > > > > > If the original CPU is hotunplugged while the acomp_ctx is still in use, > > > > > we run into a UAF bug as the resources attached to the acomp_ctx are freed > > > > > during hotunplug in zswap_cpu_comp_dead(). > > > > > > > > > > The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use > > > > > crypto_acomp API for hardware acceleration") when the switch to the > > > > > crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was > > > > > retrieved using get_cpu_ptr() which disables preemption and makes sure the > > > > > CPU cannot go away from under us. Preemption cannot be disabled with the > > > > > crypto_acomp API as a sleepable context is needed. > > > > > > > > > > Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to > > > > > per-acomp_ctx") increased the UAF surface area by making the per-CPU > > > > > buffers dynamic, adding yet another resource that can be freed from under > > > > > zswap compression/decompression by CPU hotunplug. > > > > > > > > > > This cannot be fixed by holding cpus_read_lock(), as it is possible for > > > > > code already holding the lock to fall into reclaim and enter zswap > > > > > (causing a deadlock). It also cannot be fixed by wrapping the usage of > > > > > acomp_ctx in an SRCU critical section and using synchronize_srcu() in > > > > > zswap_cpu_comp_dead(), because synchronize_srcu() is not allowed in > > > > > CPU-hotplug notifiers (see > > > > > Documentation/RCU/Design/Requirements/Requirements.rst). > > > > > > > > > > This can be fixed by refcounting the acomp_ctx, but it involves > > > > > complexity in handling the race between the refcount dropping to zero in > > > > > zswap_[de]compress() and the refcount being re-initialized when the CPU > > > > > is onlined. > > > > > > > > > > Keep things simple for now and just disable migration while using the > > > > > per-CPU acomp_ctx to block CPU hotunplug until the usage is over. > > > > > > > > > > Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") > > > > > Cc: <stable@vger.kernel.org> > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > > Reported-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ > > > > > Reported-by: Sam Sun <samsun1006219@gmail.com> > > > > > Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/ > > > > > --- > > > > > mm/zswap.c | 19 ++++++++++++++++--- > > > > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > index f6316b66fb236..ecd86153e8a32 100644 > > > > > --- a/mm/zswap.c > > > > > +++ b/mm/zswap.c > > > > > @@ -880,6 +880,18 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) > > > > > return 0; > > > > > } > > > > > > > > > > +/* Remain on the CPU while using its acomp_ctx to stop it from going offline */ > > > > > +static struct crypto_acomp_ctx *acomp_ctx_get_cpu(struct crypto_acomp_ctx __percpu *acomp_ctx) > > > > > +{ > > > > > + migrate_disable(); > > > > > > > > I'm not entirely sure, but I feel it is quite unsafe. Allowing sleep > > > > during migrate_disable() and > > > > migrate_enable() would require the entire scheduler, runqueue, > > > > waitqueue, and CPU > > > > hotplug mechanisms to be aware that a task is pinned to a specific CPU. > > > > > > My understanding is that sleeping is already allowed when migration is > > > disabled (unlike preemption). See delete_all_elements() in > > > kernel/bpf/hashtab.c for example, or __bpf_prog_enter_sleepable() in > > > kernel/bpf/trampoline.c. I am not sure exactly what you mean. > > > > Initially, I had some doubts about whether the scheduler handled this correctly, > > but after reviewing the scheduler code again, it seems fine. While a task is > > dequeued, its cpus_ptr is updated. When the task is woken up, it uses its > > cpus_ptr instead of selecting a suitable CPU (e.g., idle or wake-affine CPUs). > > Only after migrate_enable() restores this_rq()->nr_pinned to zero can CPU > > hotplug take down the CPU. > > > > Therefore, I sent another email to correct myself: > > https://lore.kernel.org/linux-mm/CAGsJ_4yb03yo6So-8wZwcy2fa-tURRrgJ+P-XhDL-RHgg1DvVA@mail.gmail.com/ > > > > > > > > > > > > > If there is no sleep during this period, it seems to be only a > > > > runqueue issue—CPU hotplug can > > > > wait for the task to be unpinned while it is always in runqueue. > > > > However, if sleep is involved, > > > > the situation becomes significantly more complex. > > > > > > > > If static data doesn't consume much memory, it could be the simplest solution. > > > > > > Do you mean allocating the buffers and requests for all possible CPUs > > > instead of allocating them dynamically in CPU hotplug notifiers? I am > > > not sure how much more memory this would be. Seems like it depends on > > > CONFIG options and the firmware. > > > > Correct, the firmware/devicetree will help identify all possible CPUs. Even > > if CONFIG is large, only those possible CPUs will be allocated memory. > > However, if migrate_disable() and migrate_enable() work as expected, > > we don't need to consider this option. > > By the way, there could be a slight performance regression if the previous > CPU is occupied with other tasks, as we would have to wait for preemption. > However, when migrate_disable() is not in effect, the woken-up task can > quickly select a neighboring idle CPU and continue execution. > > When the entire system is not busy, there might be no difference. However, when > the system is packed with tasks, it could make a difference. Hopefully, the > regression is small enough to be negligible :-) Hopefully the robot will let us know if it isn't :)
Hi Yosry, > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, January 7, 2025 2:23 PM > To: Andrew Morton <akpm@linux-foundation.org> > Cc: Johannes Weiner <hannes@cmpxchg.org>; Nhat Pham > <nphamcs@gmail.com>; Chengming Zhou <chengming.zhou@linux.dev>; > Vitaly Wool <vitalywool@gmail.com>; Barry Song <baohua@kernel.org>; Sam > Sun <samsun1006219@gmail.com>; Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com>; linux-mm@kvack.org; linux- > kernel@vger.kernel.org; Yosry Ahmed <yosryahmed@google.com>; > stable@vger.kernel.org > Subject: [PATCH v2 2/2] mm: zswap: disable migration while using per-CPU > acomp_ctx > > In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of > the > current CPU at the beginning of the operation is retrieved and used > throughout. However, since neither preemption nor migration are disabled, > it is possible that the operation continues on a different CPU. > > If the original CPU is hotunplugged while the acomp_ctx is still in use, > we run into a UAF bug as the resources attached to the acomp_ctx are freed > during hotunplug in zswap_cpu_comp_dead(). > > The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to > use > crypto_acomp API for hardware acceleration") when the switch to the > crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was > retrieved using get_cpu_ptr() which disables preemption and makes sure the > CPU cannot go away from under us. Preemption cannot be disabled with the > crypto_acomp API as a sleepable context is needed. > > Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to > per-acomp_ctx") increased the UAF surface area by making the per-CPU > buffers dynamic, adding yet another resource that can be freed from under > zswap compression/decompression by CPU hotunplug. > > This cannot be fixed by holding cpus_read_lock(), as it is possible for > code already holding the lock to fall into reclaim and enter zswap > (causing a deadlock). It also cannot be fixed by wrapping the usage of > acomp_ctx in an SRCU critical section and using synchronize_srcu() in > zswap_cpu_comp_dead(), because synchronize_srcu() is not allowed in > CPU-hotplug notifiers (see > Documentation/RCU/Design/Requirements/Requirements.rst). > > This can be fixed by refcounting the acomp_ctx, but it involves > complexity in handling the race between the refcount dropping to zero in > zswap_[de]compress() and the refcount being re-initialized when the CPU > is onlined. > > Keep things simple for now and just disable migration while using the > per-CPU acomp_ctx to block CPU hotunplug until the usage is over. > > Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for > hardware acceleration") > Cc: <stable@vger.kernel.org> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > Reported-by: Johannes Weiner <hannes@cmpxchg.org> > Closes: > https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ > Reported-by: Sam Sun <samsun1006219@gmail.com> > Closes: > https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4O > curuL4tPg6OaQ@mail.gmail.com/ > --- > mm/zswap.c | 19 ++++++++++++++++--- > 1 file changed, 16 insertions(+), 3 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f6316b66fb236..ecd86153e8a32 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -880,6 +880,18 @@ static int zswap_cpu_comp_dead(unsigned int cpu, > struct hlist_node *node) > return 0; > } > > +/* Remain on the CPU while using its acomp_ctx to stop it from going offline > */ > +static struct crypto_acomp_ctx *acomp_ctx_get_cpu(struct > crypto_acomp_ctx __percpu *acomp_ctx) > +{ > + migrate_disable(); > + return raw_cpu_ptr(acomp_ctx); > +} > + > +static void acomp_ctx_put_cpu(void) > +{ > + migrate_enable(); > +} > + > static bool zswap_compress(struct page *page, struct zswap_entry *entry, > struct zswap_pool *pool) > { > @@ -893,8 +905,7 @@ static bool zswap_compress(struct page *page, struct > zswap_entry *entry, > gfp_t gfp; > u8 *dst; > > - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > - > + acomp_ctx = acomp_ctx_get_cpu(pool->acomp_ctx); > mutex_lock(&acomp_ctx->mutex); > > dst = acomp_ctx->buffer; > @@ -950,6 +961,7 @@ static bool zswap_compress(struct page *page, struct > zswap_entry *entry, > zswap_reject_alloc_fail++; > > mutex_unlock(&acomp_ctx->mutex); > + acomp_ctx_put_cpu(); I have observed that disabling/enabling preemption in this portion of the code can result in scheduling while atomic errors. Is the same possible while disabling/enabling migration? Couple of possibly related thoughts: 1) I have been thinking some more about the purpose of this per-cpu acomp_ctx mutex. It appears that the main benefit is it causes task blocked errors (which are useful to detect problems) if any computes in the code section it covers take a long duration. Other than that, it does not protect a resource, nor prevent cpu offlining from deleting its containing structure. 2) Seems like the overall problem appears to be applicable to any per-cpu data that is being used by a process, vis-a-vis cpu hotunplug. Could it be that a solution in cpu hotunplug can safe-guard more generally? Really not sure about the specifics of any solution, but it occurred to me that this may not be a problem unique to zswap. Thanks, Kanchana > return comp_ret == 0 && alloc_ret == 0; > } > > @@ -960,7 +972,7 @@ static void zswap_decompress(struct zswap_entry > *entry, struct folio *folio) > struct crypto_acomp_ctx *acomp_ctx; > u8 *src; > > - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > + acomp_ctx = acomp_ctx_get_cpu(entry->pool->acomp_ctx); > mutex_lock(&acomp_ctx->mutex); > > src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); > @@ -990,6 +1002,7 @@ static void zswap_decompress(struct zswap_entry > *entry, struct folio *folio) > > if (src != acomp_ctx->buffer) > zpool_unmap_handle(zpool, entry->handle); > + acomp_ctx_put_cpu(); > } > > /********************************* > -- > 2.47.1.613.gc27f4b7a9f-goog
On Tue, Jan 7, 2025 at 4:02 PM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > Hi Yosry, > > > -----Original Message----- > > From: Yosry Ahmed <yosryahmed@google.com> > > Sent: Tuesday, January 7, 2025 2:23 PM > > To: Andrew Morton <akpm@linux-foundation.org> > > Cc: Johannes Weiner <hannes@cmpxchg.org>; Nhat Pham > > <nphamcs@gmail.com>; Chengming Zhou <chengming.zhou@linux.dev>; > > Vitaly Wool <vitalywool@gmail.com>; Barry Song <baohua@kernel.org>; Sam > > Sun <samsun1006219@gmail.com>; Sridhar, Kanchana P > > <kanchana.p.sridhar@intel.com>; linux-mm@kvack.org; linux- > > kernel@vger.kernel.org; Yosry Ahmed <yosryahmed@google.com>; > > stable@vger.kernel.org > > Subject: [PATCH v2 2/2] mm: zswap: disable migration while using per-CPU > > acomp_ctx > > > > In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of > > the > > current CPU at the beginning of the operation is retrieved and used > > throughout. However, since neither preemption nor migration are disabled, > > it is possible that the operation continues on a different CPU. > > > > If the original CPU is hotunplugged while the acomp_ctx is still in use, > > we run into a UAF bug as the resources attached to the acomp_ctx are freed > > during hotunplug in zswap_cpu_comp_dead(). > > > > The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to > > use > > crypto_acomp API for hardware acceleration") when the switch to the > > crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was > > retrieved using get_cpu_ptr() which disables preemption and makes sure the > > CPU cannot go away from under us. Preemption cannot be disabled with the > > crypto_acomp API as a sleepable context is needed. > > > > Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to > > per-acomp_ctx") increased the UAF surface area by making the per-CPU > > buffers dynamic, adding yet another resource that can be freed from under > > zswap compression/decompression by CPU hotunplug. > > > > This cannot be fixed by holding cpus_read_lock(), as it is possible for > > code already holding the lock to fall into reclaim and enter zswap > > (causing a deadlock). It also cannot be fixed by wrapping the usage of > > acomp_ctx in an SRCU critical section and using synchronize_srcu() in > > zswap_cpu_comp_dead(), because synchronize_srcu() is not allowed in > > CPU-hotplug notifiers (see > > Documentation/RCU/Design/Requirements/Requirements.rst). > > > > This can be fixed by refcounting the acomp_ctx, but it involves > > complexity in handling the race between the refcount dropping to zero in > > zswap_[de]compress() and the refcount being re-initialized when the CPU > > is onlined. > > > > Keep things simple for now and just disable migration while using the > > per-CPU acomp_ctx to block CPU hotunplug until the usage is over. > > > > Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for > > hardware acceleration") > > Cc: <stable@vger.kernel.org> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > Reported-by: Johannes Weiner <hannes@cmpxchg.org> > > Closes: > > https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ > > Reported-by: Sam Sun <samsun1006219@gmail.com> > > Closes: > > https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4O > > curuL4tPg6OaQ@mail.gmail.com/ > > --- > > mm/zswap.c | 19 ++++++++++++++++--- > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index f6316b66fb236..ecd86153e8a32 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -880,6 +880,18 @@ static int zswap_cpu_comp_dead(unsigned int cpu, > > struct hlist_node *node) > > return 0; > > } > > > > +/* Remain on the CPU while using its acomp_ctx to stop it from going offline > > */ > > +static struct crypto_acomp_ctx *acomp_ctx_get_cpu(struct > > crypto_acomp_ctx __percpu *acomp_ctx) > > +{ > > + migrate_disable(); > > + return raw_cpu_ptr(acomp_ctx); > > +} > > + > > +static void acomp_ctx_put_cpu(void) > > +{ > > + migrate_enable(); > > +} > > + > > static bool zswap_compress(struct page *page, struct zswap_entry *entry, > > struct zswap_pool *pool) > > { > > @@ -893,8 +905,7 @@ static bool zswap_compress(struct page *page, struct > > zswap_entry *entry, > > gfp_t gfp; > > u8 *dst; > > > > - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > - > > + acomp_ctx = acomp_ctx_get_cpu(pool->acomp_ctx); > > mutex_lock(&acomp_ctx->mutex); > > > > dst = acomp_ctx->buffer; > > @@ -950,6 +961,7 @@ static bool zswap_compress(struct page *page, struct > > zswap_entry *entry, > > zswap_reject_alloc_fail++; > > > > mutex_unlock(&acomp_ctx->mutex); > > + acomp_ctx_put_cpu(); > > I have observed that disabling/enabling preemption in this portion of the code > can result in scheduling while atomic errors. Is the same possible while > disabling/enabling migration? IIUC no, migration disabled is not an atomic context. > > Couple of possibly related thoughts: > > 1) I have been thinking some more about the purpose of this per-cpu acomp_ctx > mutex. It appears that the main benefit is it causes task blocked errors (which are > useful to detect problems) if any computes in the code section it covers take a > long duration. Other than that, it does not protect a resource, nor prevent > cpu offlining from deleting its containing structure. It does protect resources. Consider this case: - Process A runs on CPU #1, gets the acomp_ctx, and locks it, then is migrated to CPU #2. - Process B runs on CPU #1, gets the same acomp_ctx, and tries to lock it then waits for process A. Without the lock they would be using the same lock. It is also possible that process A simply gets preempted while running on CPU #1 by another task that also tries to use the acomp_ctx. The mutex also protects against this case. > > 2) Seems like the overall problem appears to be applicable to any per-cpu data > that is being used by a process, vis-a-vis cpu hotunplug. Could it be that a > solution in cpu hotunplug can safe-guard more generally? Really not sure > about the specifics of any solution, but it occurred to me that this may not > be a problem unique to zswap. Not really. Static per-CPU data and data allocated with alloc_percpu() should be available for all possible CPUs, regardless of whether they are online or not, so CPU hotunplug is not relevant. It is relevant here because we allocate the memory dynamically for online CPUs only to save memory. I am not sure how important this is as I am not aware what the difference between the number of online and possible CPUs can be in real life deployments. > > Thanks, > Kanchana > > > return comp_ret == 0 && alloc_ret == 0; > > } > > > > @@ -960,7 +972,7 @@ static void zswap_decompress(struct zswap_entry > > *entry, struct folio *folio) > > struct crypto_acomp_ctx *acomp_ctx; > > u8 *src; > > > > - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > + acomp_ctx = acomp_ctx_get_cpu(entry->pool->acomp_ctx); > > mutex_lock(&acomp_ctx->mutex); > > > > src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); > > @@ -990,6 +1002,7 @@ static void zswap_decompress(struct zswap_entry > > *entry, struct folio *folio) > > > > if (src != acomp_ctx->buffer) > > zpool_unmap_handle(zpool, entry->handle); > > + acomp_ctx_put_cpu(); > > } > > > > /********************************* > > -- > > 2.47.1.613.gc27f4b7a9f-goog >
> -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, January 7, 2025 4:12 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Andrew Morton <akpm@linux-foundation.org>; Johannes Weiner > <hannes@cmpxchg.org>; Nhat Pham <nphamcs@gmail.com>; Chengming > Zhou <chengming.zhou@linux.dev>; Vitaly Wool <vitalywool@gmail.com>; > Barry Song <baohua@kernel.org>; Sam Sun <samsun1006219@gmail.com>; > linux-mm@kvack.org; linux-kernel@vger.kernel.org; stable@vger.kernel.org > Subject: Re: [PATCH v2 2/2] mm: zswap: disable migration while using per- > CPU acomp_ctx > > On Tue, Jan 7, 2025 at 4:02 PM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi Yosry, > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosryahmed@google.com> > > > Sent: Tuesday, January 7, 2025 2:23 PM > > > To: Andrew Morton <akpm@linux-foundation.org> > > > Cc: Johannes Weiner <hannes@cmpxchg.org>; Nhat Pham > > > <nphamcs@gmail.com>; Chengming Zhou <chengming.zhou@linux.dev>; > > > Vitaly Wool <vitalywool@gmail.com>; Barry Song <baohua@kernel.org>; > Sam > > > Sun <samsun1006219@gmail.com>; Sridhar, Kanchana P > > > <kanchana.p.sridhar@intel.com>; linux-mm@kvack.org; linux- > > > kernel@vger.kernel.org; Yosry Ahmed <yosryahmed@google.com>; > > > stable@vger.kernel.org > > > Subject: [PATCH v2 2/2] mm: zswap: disable migration while using per- > CPU > > > acomp_ctx > > > > > > In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of > > > the > > > current CPU at the beginning of the operation is retrieved and used > > > throughout. However, since neither preemption nor migration are > disabled, > > > it is possible that the operation continues on a different CPU. > > > > > > If the original CPU is hotunplugged while the acomp_ctx is still in use, > > > we run into a UAF bug as the resources attached to the acomp_ctx are > freed > > > during hotunplug in zswap_cpu_comp_dead(). > > > > > > The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move > to > > > use > > > crypto_acomp API for hardware acceleration") when the switch to the > > > crypto_acomp API was made. Prior to that, the per-CPU crypto_comp > was > > > retrieved using get_cpu_ptr() which disables preemption and makes sure > the > > > CPU cannot go away from under us. Preemption cannot be disabled with > the > > > crypto_acomp API as a sleepable context is needed. > > > > > > Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to > > > per-acomp_ctx") increased the UAF surface area by making the per-CPU > > > buffers dynamic, adding yet another resource that can be freed from > under > > > zswap compression/decompression by CPU hotunplug. > > > > > > This cannot be fixed by holding cpus_read_lock(), as it is possible for > > > code already holding the lock to fall into reclaim and enter zswap > > > (causing a deadlock). It also cannot be fixed by wrapping the usage of > > > acomp_ctx in an SRCU critical section and using synchronize_srcu() in > > > zswap_cpu_comp_dead(), because synchronize_srcu() is not allowed in > > > CPU-hotplug notifiers (see > > > Documentation/RCU/Design/Requirements/Requirements.rst). > > > > > > This can be fixed by refcounting the acomp_ctx, but it involves > > > complexity in handling the race between the refcount dropping to zero in > > > zswap_[de]compress() and the refcount being re-initialized when the CPU > > > is onlined. > > > > > > Keep things simple for now and just disable migration while using the > > > per-CPU acomp_ctx to block CPU hotunplug until the usage is over. > > > > > > Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for > > > hardware acceleration") > > > Cc: <stable@vger.kernel.org> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > Reported-by: Johannes Weiner <hannes@cmpxchg.org> > > > Closes: > > > > https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ > > > Reported-by: Sam Sun <samsun1006219@gmail.com> > > > Closes: > > > > https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4O > > > curuL4tPg6OaQ@mail.gmail.com/ > > > --- > > > mm/zswap.c | 19 ++++++++++++++++--- > > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index f6316b66fb236..ecd86153e8a32 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -880,6 +880,18 @@ static int zswap_cpu_comp_dead(unsigned int > cpu, > > > struct hlist_node *node) > > > return 0; > > > } > > > > > > +/* Remain on the CPU while using its acomp_ctx to stop it from going > offline > > > */ > > > +static struct crypto_acomp_ctx *acomp_ctx_get_cpu(struct > > > crypto_acomp_ctx __percpu *acomp_ctx) > > > +{ > > > + migrate_disable(); > > > + return raw_cpu_ptr(acomp_ctx); > > > +} > > > + > > > +static void acomp_ctx_put_cpu(void) > > > +{ > > > + migrate_enable(); > > > +} > > > + > > > static bool zswap_compress(struct page *page, struct zswap_entry > *entry, > > > struct zswap_pool *pool) > > > { > > > @@ -893,8 +905,7 @@ static bool zswap_compress(struct page *page, > struct > > > zswap_entry *entry, > > > gfp_t gfp; > > > u8 *dst; > > > > > > - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > > - > > > + acomp_ctx = acomp_ctx_get_cpu(pool->acomp_ctx); > > > mutex_lock(&acomp_ctx->mutex); > > > > > > dst = acomp_ctx->buffer; > > > @@ -950,6 +961,7 @@ static bool zswap_compress(struct page *page, > struct > > > zswap_entry *entry, > > > zswap_reject_alloc_fail++; > > > > > > mutex_unlock(&acomp_ctx->mutex); > > > + acomp_ctx_put_cpu(); > > > > I have observed that disabling/enabling preemption in this portion of the > code > > can result in scheduling while atomic errors. Is the same possible while > > disabling/enabling migration? > > IIUC no, migration disabled is not an atomic context. Ok, thanks. > > > > > Couple of possibly related thoughts: > > > > 1) I have been thinking some more about the purpose of this per-cpu > acomp_ctx > > mutex. It appears that the main benefit is it causes task blocked errors > (which are > > useful to detect problems) if any computes in the code section it covers > take a > > long duration. Other than that, it does not protect a resource, nor > prevent > > cpu offlining from deleting its containing structure. > > It does protect resources. Consider this case: > - Process A runs on CPU #1, gets the acomp_ctx, and locks it, then is > migrated to CPU #2. > - Process B runs on CPU #1, gets the same acomp_ctx, and tries to lock > it then waits for process A. Without the lock they would be using the > same lock. > > It is also possible that process A simply gets preempted while running > on CPU #1 by another task that also tries to use the acomp_ctx. The > mutex also protects against this case. Got it, thanks for the explanations. It seems with this patch, the mutex would be redundant in the first example. Would this also be true of the second example where process A gets preempted? If not, is it worth figuring out a solution that works for both migration and preemption? > > > > > 2) Seems like the overall problem appears to be applicable to any per-cpu > data > > that is being used by a process, vis-a-vis cpu hotunplug. Could it be that a > > solution in cpu hotunplug can safe-guard more generally? Really not sure > > about the specifics of any solution, but it occurred to me that this may > not > > be a problem unique to zswap. > > Not really. Static per-CPU data and data allocated with alloc_percpu() > should be available for all possible CPUs, regardless of whether they > are online or not, so CPU hotunplug is not relevant. It is relevant > here because we allocate the memory dynamically for online CPUs only > to save memory. I am not sure how important this is as I am not aware > what the difference between the number of online and possible CPUs can > be in real life deployments. Thought I would clarify what I meant: the problem of per-cpu data that gets allocated dynamically using cpu hotplug and deleted even while in use by cpu hotunplug may not be unique to zswap. If so, I was wondering if a more generic solution in the cpu hotunplug code would be feasible/worth exploring. Thanks, Kanchana > > > > > Thanks, > > Kanchana > > > > > return comp_ret == 0 && alloc_ret == 0; > > > } > > > > > > @@ -960,7 +972,7 @@ static void zswap_decompress(struct > zswap_entry > > > *entry, struct folio *folio) > > > struct crypto_acomp_ctx *acomp_ctx; > > > u8 *src; > > > > > > - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > > + acomp_ctx = acomp_ctx_get_cpu(entry->pool->acomp_ctx); > > > mutex_lock(&acomp_ctx->mutex); > > > > > > src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); > > > @@ -990,6 +1002,7 @@ static void zswap_decompress(struct > zswap_entry > > > *entry, struct folio *folio) > > > > > > if (src != acomp_ctx->buffer) > > > zpool_unmap_handle(zpool, entry->handle); > > > + acomp_ctx_put_cpu(); > > > } > > > > > > /********************************* > > > -- > > > 2.47.1.613.gc27f4b7a9f-goog > >
[..] > > > Couple of possibly related thoughts: > > > > > > 1) I have been thinking some more about the purpose of this per-cpu > > acomp_ctx > > > mutex. It appears that the main benefit is it causes task blocked errors > > (which are > > > useful to detect problems) if any computes in the code section it covers > > take a > > > long duration. Other than that, it does not protect a resource, nor > > prevent > > > cpu offlining from deleting its containing structure. > > > > It does protect resources. Consider this case: > > - Process A runs on CPU #1, gets the acomp_ctx, and locks it, then is > > migrated to CPU #2. > > - Process B runs on CPU #1, gets the same acomp_ctx, and tries to lock > > it then waits for process A. Without the lock they would be using the > > same lock. > > > > It is also possible that process A simply gets preempted while running > > on CPU #1 by another task that also tries to use the acomp_ctx. The > > mutex also protects against this case. > > Got it, thanks for the explanations. It seems with this patch, the mutex > would be redundant in the first example. Would this also be true of the > second example where process A gets preempted? I think the mutex is still required in the second example. Migration being disabled does not prevent other processes from running on the same CPU and attempting to use the same acomp_ctx. > If not, is it worth > figuring out a solution that works for both migration and preemption? Not sure exactly what you mean here. I suspect you mean have a single mechanism to protect against concurrent usage and CPU hotunplug rather than disabling migration and having a mutex. Yeah that would be ideal, but probably not for a hotfix. > > > > > > > > > 2) Seems like the overall problem appears to be applicable to any per-cpu > > data > > > that is being used by a process, vis-a-vis cpu hotunplug. Could it be that a > > > solution in cpu hotunplug can safe-guard more generally? Really not sure > > > about the specifics of any solution, but it occurred to me that this may > > not > > > be a problem unique to zswap. > > > > Not really. Static per-CPU data and data allocated with alloc_percpu() > > should be available for all possible CPUs, regardless of whether they > > are online or not, so CPU hotunplug is not relevant. It is relevant > > here because we allocate the memory dynamically for online CPUs only > > to save memory. I am not sure how important this is as I am not aware > > what the difference between the number of online and possible CPUs can > > be in real life deployments. > > Thought I would clarify what I meant: the problem of per-cpu data that > gets allocated dynamically using cpu hotplug and deleted even while in use > by cpu hotunplug may not be unique to zswap. If so, I was wondering if > a more generic solution in the cpu hotunplug code would be feasible/worth > exploring. I didn't look too closely, if there's something out there or something can be easily developed I'd be open to updating the zswap code accordingly, but I don't have time to look into it tbh, and it's too late in the release cycle to get creative imo.
On Tue, Jan 7, 2025 at 5:18 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > [..] > > > > Couple of possibly related thoughts: > > > > > > > > 1) I have been thinking some more about the purpose of this per-cpu > > > acomp_ctx > > > > mutex. It appears that the main benefit is it causes task blocked errors > > > (which are > > > > useful to detect problems) if any computes in the code section it covers > > > take a > > > > long duration. Other than that, it does not protect a resource, nor > > > prevent > > > > cpu offlining from deleting its containing structure. > > > > > > It does protect resources. Consider this case: > > > - Process A runs on CPU #1, gets the acomp_ctx, and locks it, then is > > > migrated to CPU #2. > > > - Process B runs on CPU #1, gets the same acomp_ctx, and tries to lock > > > it then waits for process A. Without the lock they would be using the > > > same lock. > > > > > > It is also possible that process A simply gets preempted while running > > > on CPU #1 by another task that also tries to use the acomp_ctx. The > > > mutex also protects against this case. > > > > Got it, thanks for the explanations. It seems with this patch, the mutex > > would be redundant in the first example. Would this also be true of the > > second example where process A gets preempted? > > I think the mutex is still required in the second example. Migration > being disabled does not prevent other processes from running on the > same CPU and attempting to use the same acomp_ctx. > > > If not, is it worth > > figuring out a solution that works for both migration and preemption? > > Not sure exactly what you mean here. I suspect you mean have a single > mechanism to protect against concurrent usage and CPU hotunplug rather > than disabling migration and having a mutex. Yeah that would be ideal, > but probably not for a hotfix. Actually, using the mutex to protect against CPU hotunplug is not too complicated. The following diff is one way to do it (lightly tested). Johannes, Nhat, any preferences between this patch (disabling migration) and the following diff? diff --git a/mm/zswap.c b/mm/zswap.c index f6316b66fb236..4d6817c679a54 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -869,17 +869,40 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + mutex_lock(&acomp_ctx->mutex); if (!IS_ERR_OR_NULL(acomp_ctx)) { if (!IS_ERR_OR_NULL(acomp_ctx->req)) acomp_request_free(acomp_ctx->req); + acomp_ctx->req = NULL; if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) crypto_free_acomp(acomp_ctx->acomp); kfree(acomp_ctx->buffer); } + mutex_unlock(&acomp_ctx->mutex); return 0; } +static struct crypto_acomp_ctx *acomp_ctx_get_cpu_locked( + struct crypto_acomp_ctx __percpu *acomp_ctx) +{ + struct crypto_acomp_ctx *ctx; + + for (;;) { + ctx = raw_cpu_ptr(acomp_ctx); + mutex_lock(&ctx->mutex); + if (likely(ctx->req)) + return ctx; + /* Raced with zswap_cpu_comp_dead() on CPU hotunplug */ + mutex_unlock(&ctx->mutex); + } +} + +static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *ctx) +{ + mutex_unlock(&ctx->mutex); +} + static bool zswap_compress(struct page *page, struct zswap_entry *entry, struct zswap_pool *pool) { @@ -893,10 +916,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, gfp_t gfp; u8 *dst; - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); - - mutex_lock(&acomp_ctx->mutex); - + acomp_ctx = acomp_ctx_get_cpu_locked(pool->acomp_ctx); dst = acomp_ctx->buffer; sg_init_table(&input, 1); sg_set_page(&input, page, PAGE_SIZE, 0); @@ -949,7 +969,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, else if (alloc_ret) zswap_reject_alloc_fail++; - mutex_unlock(&acomp_ctx->mutex); + acomp_ctx_put_unlock(acomp_ctx); return comp_ret == 0 && alloc_ret == 0; } @@ -960,9 +980,7 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) struct crypto_acomp_ctx *acomp_ctx; u8 *src; - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); - mutex_lock(&acomp_ctx->mutex); - + acomp_ctx = acomp_ctx_get_cpu_locked(entry->pool->acomp_ctx); src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); /*
On Wed, Jan 8, 2025 at 9:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > Actually, using the mutex to protect against CPU hotunplug is not too > complicated. The following diff is one way to do it (lightly tested). > Johannes, Nhat, any preferences between this patch (disabling > migration) and the following diff? I mean if this works, this over migration diasbling any day? :) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f6316b66fb236..4d6817c679a54 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -869,17 +869,40 @@ static int zswap_cpu_comp_dead(unsigned int cpu, > struct hlist_node *node) > struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > > + mutex_lock(&acomp_ctx->mutex); > if (!IS_ERR_OR_NULL(acomp_ctx)) { > if (!IS_ERR_OR_NULL(acomp_ctx->req)) > acomp_request_free(acomp_ctx->req); > + acomp_ctx->req = NULL; > if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > crypto_free_acomp(acomp_ctx->acomp); > kfree(acomp_ctx->buffer); > } > + mutex_unlock(&acomp_ctx->mutex); > > return 0; > } > > +static struct crypto_acomp_ctx *acomp_ctx_get_cpu_locked( > + struct crypto_acomp_ctx __percpu *acomp_ctx) > +{ > + struct crypto_acomp_ctx *ctx; > + > + for (;;) { > + ctx = raw_cpu_ptr(acomp_ctx); > + mutex_lock(&ctx->mutex); I'm a bit confused. IIUC, ctx is per-cpu right? What's protecting this cpu-local data (including the mutex) from being invalidated under us while we're sleeping and waiting for the mutex? If it is somehow protected, then yeah this seems quite elegant :) > + if (likely(ctx->req)) > + return ctx; > + /* Raced with zswap_cpu_comp_dead() on CPU hotunplug */ > + mutex_unlock(&ctx->mutex); > + } > +} > + > +static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *ctx) > +{ > + mutex_unlock(&ctx->mutex); > +} > + > static bool zswap_compress(struct page *page, struct zswap_entry *entry, > struct zswap_pool *pool) > { > @@ -893,10 +916,7 @@ static bool zswap_compress(struct page *page, > struct zswap_entry *entry, > gfp_t gfp; > u8 *dst; > > - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > - > - mutex_lock(&acomp_ctx->mutex); > - > + acomp_ctx = acomp_ctx_get_cpu_locked(pool->acomp_ctx); > dst = acomp_ctx->buffer; > sg_init_table(&input, 1); > sg_set_page(&input, page, PAGE_SIZE, 0); > @@ -949,7 +969,7 @@ static bool zswap_compress(struct page *page, > struct zswap_entry *entry, > else if (alloc_ret) > zswap_reject_alloc_fail++; > > - mutex_unlock(&acomp_ctx->mutex); > + acomp_ctx_put_unlock(acomp_ctx); > return comp_ret == 0 && alloc_ret == 0; > } > > @@ -960,9 +980,7 @@ static void zswap_decompress(struct zswap_entry > *entry, struct folio *folio) > struct crypto_acomp_ctx *acomp_ctx; > u8 *src; > > - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > - mutex_lock(&acomp_ctx->mutex); > - > + acomp_ctx = acomp_ctx_get_cpu_locked(entry->pool->acomp_ctx); > src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); > /*
On 2025/1/8 12:46, Nhat Pham wrote: > On Wed, Jan 8, 2025 at 9:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> >> Actually, using the mutex to protect against CPU hotunplug is not too >> complicated. The following diff is one way to do it (lightly tested). >> Johannes, Nhat, any preferences between this patch (disabling >> migration) and the following diff? > > I mean if this works, this over migration diasbling any day? :) > >> >> diff --git a/mm/zswap.c b/mm/zswap.c >> index f6316b66fb236..4d6817c679a54 100644 >> --- a/mm/zswap.c >> +++ b/mm/zswap.c >> @@ -869,17 +869,40 @@ static int zswap_cpu_comp_dead(unsigned int cpu, >> struct hlist_node *node) >> struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); >> struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); >> >> + mutex_lock(&acomp_ctx->mutex); >> if (!IS_ERR_OR_NULL(acomp_ctx)) { >> if (!IS_ERR_OR_NULL(acomp_ctx->req)) >> acomp_request_free(acomp_ctx->req); >> + acomp_ctx->req = NULL; >> if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) >> crypto_free_acomp(acomp_ctx->acomp); >> kfree(acomp_ctx->buffer); >> } >> + mutex_unlock(&acomp_ctx->mutex); >> >> return 0; >> } >> >> +static struct crypto_acomp_ctx *acomp_ctx_get_cpu_locked( >> + struct crypto_acomp_ctx __percpu *acomp_ctx) >> +{ >> + struct crypto_acomp_ctx *ctx; >> + >> + for (;;) { >> + ctx = raw_cpu_ptr(acomp_ctx); >> + mutex_lock(&ctx->mutex); > > I'm a bit confused. IIUC, ctx is per-cpu right? What's protecting this > cpu-local data (including the mutex) from being invalidated under us > while we're sleeping and waiting for the mutex? Yeah, it's not safe, we can only use this_cpu_ptr(), which will disable preempt (so cpu offline can't kick in), and get refcount of ctx. Since we can't mutex_lock in the preempt disabled section. Thanks. > > If it is somehow protected, then yeah this seems quite elegant :) > >> + if (likely(ctx->req)) >> + return ctx; >> + /* Raced with zswap_cpu_comp_dead() on CPU hotunplug */ >> + mutex_unlock(&ctx->mutex); >> + } >> +} >> + >> +static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *ctx) >> +{ >> + mutex_unlock(&ctx->mutex); >> +} >> + >> static bool zswap_compress(struct page *page, struct zswap_entry *entry, >> struct zswap_pool *pool) >> { >> @@ -893,10 +916,7 @@ static bool zswap_compress(struct page *page, >> struct zswap_entry *entry, >> gfp_t gfp; >> u8 *dst; >> >> - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); >> - >> - mutex_lock(&acomp_ctx->mutex); >> - >> + acomp_ctx = acomp_ctx_get_cpu_locked(pool->acomp_ctx); >> dst = acomp_ctx->buffer; >> sg_init_table(&input, 1); >> sg_set_page(&input, page, PAGE_SIZE, 0); >> @@ -949,7 +969,7 @@ static bool zswap_compress(struct page *page, >> struct zswap_entry *entry, >> else if (alloc_ret) >> zswap_reject_alloc_fail++; >> >> - mutex_unlock(&acomp_ctx->mutex); >> + acomp_ctx_put_unlock(acomp_ctx); >> return comp_ret == 0 && alloc_ret == 0; >> } >> >> @@ -960,9 +980,7 @@ static void zswap_decompress(struct zswap_entry >> *entry, struct folio *folio) >> struct crypto_acomp_ctx *acomp_ctx; >> u8 *src; >> >> - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); >> - mutex_lock(&acomp_ctx->mutex); >> - >> + acomp_ctx = acomp_ctx_get_cpu_locked(entry->pool->acomp_ctx); >> src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); >> /*
On Wed, Jan 8, 2025 at 5:46 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Wed, Jan 8, 2025 at 9:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > Actually, using the mutex to protect against CPU hotunplug is not too > > complicated. The following diff is one way to do it (lightly tested). > > Johannes, Nhat, any preferences between this patch (disabling > > migration) and the following diff? > > I mean if this works, this over migration diasbling any day? :) > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index f6316b66fb236..4d6817c679a54 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -869,17 +869,40 @@ static int zswap_cpu_comp_dead(unsigned int cpu, > > struct hlist_node *node) > > struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > > > > + mutex_lock(&acomp_ctx->mutex); > > if (!IS_ERR_OR_NULL(acomp_ctx)) { > > if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > acomp_request_free(acomp_ctx->req); > > + acomp_ctx->req = NULL; > > if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > crypto_free_acomp(acomp_ctx->acomp); > > kfree(acomp_ctx->buffer); > > } > > + mutex_unlock(&acomp_ctx->mutex); > > > > return 0; > > } > > > > +static struct crypto_acomp_ctx *acomp_ctx_get_cpu_locked( > > + struct crypto_acomp_ctx __percpu *acomp_ctx) > > +{ > > + struct crypto_acomp_ctx *ctx; > > + > > + for (;;) { > > + ctx = raw_cpu_ptr(acomp_ctx); > > + mutex_lock(&ctx->mutex); > > I'm a bit confused. IIUC, ctx is per-cpu right? What's protecting this > cpu-local data (including the mutex) from being invalidated under us > while we're sleeping and waiting for the mutex? > > If it is somehow protected, then yeah this seems quite elegant :) thought about this again. Could it be the following? bool cpus_is_read_locked(void) { return percpu_is_read_locked(&cpu_hotplug_lock); } in zswap: bool locked = cpus_is_read_locked(); if (!locked) cpus_read_lock(); .... // do our job if (!locked) cpus_read_unlock(); This seems to resolve all three problems: 1. if our context has held read lock, we won't hold it again; 2. if other contexts are holding write lock, we wait for the completion of cpuhotplug by acquiring read lock 3. if our context hasn't held a read lock, we hold it. > > > + if (likely(ctx->req)) > > + return ctx; > > + /* Raced with zswap_cpu_comp_dead() on CPU hotunplug */ > > + mutex_unlock(&ctx->mutex); > > + } > > +} > > + > > +static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *ctx) > > +{ > > + mutex_unlock(&ctx->mutex); > > +} > > + > > static bool zswap_compress(struct page *page, struct zswap_entry *entry, > > struct zswap_pool *pool) > > { > > @@ -893,10 +916,7 @@ static bool zswap_compress(struct page *page, > > struct zswap_entry *entry, > > gfp_t gfp; > > u8 *dst; > > > > - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > - > > - mutex_lock(&acomp_ctx->mutex); > > - > > + acomp_ctx = acomp_ctx_get_cpu_locked(pool->acomp_ctx); > > dst = acomp_ctx->buffer; > > sg_init_table(&input, 1); > > sg_set_page(&input, page, PAGE_SIZE, 0); > > @@ -949,7 +969,7 @@ static bool zswap_compress(struct page *page, > > struct zswap_entry *entry, > > else if (alloc_ret) > > zswap_reject_alloc_fail++; > > > > - mutex_unlock(&acomp_ctx->mutex); > > + acomp_ctx_put_unlock(acomp_ctx); > > return comp_ret == 0 && alloc_ret == 0; > > } > > > > @@ -960,9 +980,7 @@ static void zswap_decompress(struct zswap_entry > > *entry, struct folio *folio) > > struct crypto_acomp_ctx *acomp_ctx; > > u8 *src; > > > > - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > - mutex_lock(&acomp_ctx->mutex); > > - > > + acomp_ctx = acomp_ctx_get_cpu_locked(entry->pool->acomp_ctx); > > src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); > > /* Thanks Barry
On Wed, Jan 8, 2025 at 6:06 PM Barry Song <21cnbao@gmail.com> wrote: > > On Wed, Jan 8, 2025 at 5:46 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Wed, Jan 8, 2025 at 9:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > Actually, using the mutex to protect against CPU hotunplug is not too > > > complicated. The following diff is one way to do it (lightly tested). > > > Johannes, Nhat, any preferences between this patch (disabling > > > migration) and the following diff? > > > > I mean if this works, this over migration diasbling any day? :) > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index f6316b66fb236..4d6817c679a54 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -869,17 +869,40 @@ static int zswap_cpu_comp_dead(unsigned int cpu, > > > struct hlist_node *node) > > > struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > > > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > > > > > > + mutex_lock(&acomp_ctx->mutex); > > > if (!IS_ERR_OR_NULL(acomp_ctx)) { > > > if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > > acomp_request_free(acomp_ctx->req); > > > + acomp_ctx->req = NULL; > > > if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > > crypto_free_acomp(acomp_ctx->acomp); > > > kfree(acomp_ctx->buffer); > > > } > > > + mutex_unlock(&acomp_ctx->mutex); > > > > > > return 0; > > > } > > > > > > +static struct crypto_acomp_ctx *acomp_ctx_get_cpu_locked( > > > + struct crypto_acomp_ctx __percpu *acomp_ctx) > > > +{ > > > + struct crypto_acomp_ctx *ctx; > > > + > > > + for (;;) { > > > + ctx = raw_cpu_ptr(acomp_ctx); > > > + mutex_lock(&ctx->mutex); > > > > I'm a bit confused. IIUC, ctx is per-cpu right? What's protecting this > > cpu-local data (including the mutex) from being invalidated under us > > while we're sleeping and waiting for the mutex? > > > > If it is somehow protected, then yeah this seems quite elegant :) > > thought about this again. Could it be the following? > > bool cpus_is_read_locked(void) > { > return percpu_is_read_locked(&cpu_hotplug_lock); > } > > in zswap: > > bool locked = cpus_is_read_locked(); > > if (!locked) > cpus_read_lock(); > > .... // do our job > > if (!locked) > cpus_read_unlock(); > > This seems to resolve all three problems: > 1. if our context has held read lock, we won't hold it again; > 2. if other contexts are holding write lock, we wait for the > completion of cpuhotplug > by acquiring read lock > 3. if our context hasn't held a read lock, we hold it. > sorry for the noise. This won't work because percpu_is_read_locked() is a sum: bool percpu_is_read_locked(struct percpu_rw_semaphore *sem) { return per_cpu_sum(*sem->read_count) != 0 && !atomic_read(&sem->block); } EXPORT_SYMBOL_GPL(percpu_is_read_locked); If other CPUs hold the read lock, it will also return true. However, once those CPUs release the lock, our data might still be released by CPU hotplug. This approach would require something like percpu_is_read_locked_by_me() :-( > > > > > + if (likely(ctx->req)) > > > + return ctx; > > > + /* Raced with zswap_cpu_comp_dead() on CPU hotunplug */ > > > + mutex_unlock(&ctx->mutex); > > > + } > > > +} > > > + > > > +static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *ctx) > > > +{ > > > + mutex_unlock(&ctx->mutex); > > > +} > > > + > > > static bool zswap_compress(struct page *page, struct zswap_entry *entry, > > > struct zswap_pool *pool) > > > { > > > @@ -893,10 +916,7 @@ static bool zswap_compress(struct page *page, > > > struct zswap_entry *entry, > > > gfp_t gfp; > > > u8 *dst; > > > > > > - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > > - > > > - mutex_lock(&acomp_ctx->mutex); > > > - > > > + acomp_ctx = acomp_ctx_get_cpu_locked(pool->acomp_ctx); > > > dst = acomp_ctx->buffer; > > > sg_init_table(&input, 1); > > > sg_set_page(&input, page, PAGE_SIZE, 0); > > > @@ -949,7 +969,7 @@ static bool zswap_compress(struct page *page, > > > struct zswap_entry *entry, > > > else if (alloc_ret) > > > zswap_reject_alloc_fail++; > > > > > > - mutex_unlock(&acomp_ctx->mutex); > > > + acomp_ctx_put_unlock(acomp_ctx); > > > return comp_ret == 0 && alloc_ret == 0; > > > } > > > > > > @@ -960,9 +980,7 @@ static void zswap_decompress(struct zswap_entry > > > *entry, struct folio *folio) > > > struct crypto_acomp_ctx *acomp_ctx; > > > u8 *src; > > > > > > - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > > - mutex_lock(&acomp_ctx->mutex); > > > - > > > + acomp_ctx = acomp_ctx_get_cpu_locked(entry->pool->acomp_ctx); > > > src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); > > > /* > > Thanks > Barry
On Tue, Jan 7, 2025 at 9:00 PM Chengming Zhou <chengming.zhou@linux.dev> wrote: > > On 2025/1/8 12:46, Nhat Pham wrote: > > On Wed, Jan 8, 2025 at 9:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > >> > >> > >> Actually, using the mutex to protect against CPU hotunplug is not too > >> complicated. The following diff is one way to do it (lightly tested). > >> Johannes, Nhat, any preferences between this patch (disabling > >> migration) and the following diff? > > > > I mean if this works, this over migration diasbling any day? :) > > > >> > >> diff --git a/mm/zswap.c b/mm/zswap.c > >> index f6316b66fb236..4d6817c679a54 100644 > >> --- a/mm/zswap.c > >> +++ b/mm/zswap.c > >> @@ -869,17 +869,40 @@ static int zswap_cpu_comp_dead(unsigned int cpu, > >> struct hlist_node *node) > >> struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > >> struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > >> > >> + mutex_lock(&acomp_ctx->mutex); > >> if (!IS_ERR_OR_NULL(acomp_ctx)) { > >> if (!IS_ERR_OR_NULL(acomp_ctx->req)) > >> acomp_request_free(acomp_ctx->req); > >> + acomp_ctx->req = NULL; > >> if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > >> crypto_free_acomp(acomp_ctx->acomp); > >> kfree(acomp_ctx->buffer); > >> } > >> + mutex_unlock(&acomp_ctx->mutex); > >> > >> return 0; > >> } > >> > >> +static struct crypto_acomp_ctx *acomp_ctx_get_cpu_locked( > >> + struct crypto_acomp_ctx __percpu *acomp_ctx) > >> +{ > >> + struct crypto_acomp_ctx *ctx; > >> + > >> + for (;;) { > >> + ctx = raw_cpu_ptr(acomp_ctx); > >> + mutex_lock(&ctx->mutex); > > > > I'm a bit confused. IIUC, ctx is per-cpu right? What's protecting this > > cpu-local data (including the mutex) from being invalidated under us > > while we're sleeping and waiting for the mutex? Please correct me if I am wrong, but my understanding is that memory allocated with alloc_percpu() is allocated for each *possible* CPU, and does not go away when CPUs are offlined. We allocate the per-CPU crypto_acomp_ctx structs with alloc_percpu() (including the mutex), so they should not go away with CPU offlining. OTOH, we allocate the crypto_acomp_ctx.acompx, crypto_acomp_ctx.req, and crypto_acomp_ctx.buffer only for online CPUs through the CPU hotplug notifiers (i.e. zswap_cpu_comp_prepare() and zswap_cpu_comp_dead()). These are the resources that can go away with CPU offlining, and what we need to protect about. The approach I am taking here is to hold the per-CPU mutex in the CPU offlining code while we free these resources, and set crypto_acomp_ctx.req to NULL. In acomp_ctx_get_cpu_locked(), we hold the mutex of the current CPU, and check if crypto_acomp_ctx.req is NULL. If it is NULL, then the CPU is offlined between raw_cpu_ptr() and acquiring the mutex, and we retry on the new CPU that we end up on. If it is not NULL, then we are guaranteed that the resources will not be freed by CPU offlining until acomp_ctx_put_unlock() is called and the mutex is unlocked. > > Yeah, it's not safe, we can only use this_cpu_ptr(), which will disable > preempt (so cpu offline can't kick in), and get refcount of ctx. Since > we can't mutex_lock in the preempt disabled section. My understanding is that the purpose of this_cpu_ptr() disabling preemption is to prevent multiple CPUs accessing per-CPU data of a single CPU concurrently. In the zswap case, we don't really need that because we use the mutex to protect against it (and we cannot disable preemption anyway).
On Tue, Jan 7, 2025 at 9:34 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Jan 7, 2025 at 9:00 PM Chengming Zhou <chengming.zhou@linux.dev> wrote: > > > > On 2025/1/8 12:46, Nhat Pham wrote: > > > On Wed, Jan 8, 2025 at 9:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > >> > > >> > > >> Actually, using the mutex to protect against CPU hotunplug is not too > > >> complicated. The following diff is one way to do it (lightly tested). > > >> Johannes, Nhat, any preferences between this patch (disabling > > >> migration) and the following diff? > > > > > > I mean if this works, this over migration diasbling any day? :) > > > > > >> > > >> diff --git a/mm/zswap.c b/mm/zswap.c > > >> index f6316b66fb236..4d6817c679a54 100644 > > >> --- a/mm/zswap.c > > >> +++ b/mm/zswap.c > > >> @@ -869,17 +869,40 @@ static int zswap_cpu_comp_dead(unsigned int cpu, > > >> struct hlist_node *node) > > >> struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > > >> struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > > >> > > >> + mutex_lock(&acomp_ctx->mutex); > > >> if (!IS_ERR_OR_NULL(acomp_ctx)) { > > >> if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > >> acomp_request_free(acomp_ctx->req); > > >> + acomp_ctx->req = NULL; > > >> if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > >> crypto_free_acomp(acomp_ctx->acomp); > > >> kfree(acomp_ctx->buffer); > > >> } > > >> + mutex_unlock(&acomp_ctx->mutex); > > >> > > >> return 0; > > >> } > > >> > > >> +static struct crypto_acomp_ctx *acomp_ctx_get_cpu_locked( > > >> + struct crypto_acomp_ctx __percpu *acomp_ctx) > > >> +{ > > >> + struct crypto_acomp_ctx *ctx; > > >> + > > >> + for (;;) { > > >> + ctx = raw_cpu_ptr(acomp_ctx); > > >> + mutex_lock(&ctx->mutex); > > > > > > I'm a bit confused. IIUC, ctx is per-cpu right? What's protecting this > > > cpu-local data (including the mutex) from being invalidated under us > > > while we're sleeping and waiting for the mutex? > > Please correct me if I am wrong, but my understanding is that memory > allocated with alloc_percpu() is allocated for each *possible* CPU, > and does not go away when CPUs are offlined. We allocate the per-CPU > crypto_acomp_ctx structs with alloc_percpu() (including the mutex), so > they should not go away with CPU offlining. > > OTOH, we allocate the crypto_acomp_ctx.acompx, crypto_acomp_ctx.req, > and crypto_acomp_ctx.buffer only for online CPUs through the CPU > hotplug notifiers (i.e. zswap_cpu_comp_prepare() and > zswap_cpu_comp_dead()). These are the resources that can go away with > CPU offlining, and what we need to protect about. ..and now that I explain all of this I am wondering if the complexity is warranted in the first place. It goes back all the way to the first zswap commit, so I can't tell the justification for it. I am not sure if they are setups that have significantly different numbers of online and possible CPUs. Maybe we should just bite the bullet and just allocate everything with alloc_percpu() and rip out the hotplug callbacks completely.
On 2025/1/8 13:34, Yosry Ahmed wrote: > On Tue, Jan 7, 2025 at 9:00 PM Chengming Zhou <chengming.zhou@linux.dev> wrote: >> >> On 2025/1/8 12:46, Nhat Pham wrote: >>> On Wed, Jan 8, 2025 at 9:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: >>>> >>>> >>>> Actually, using the mutex to protect against CPU hotunplug is not too >>>> complicated. The following diff is one way to do it (lightly tested). >>>> Johannes, Nhat, any preferences between this patch (disabling >>>> migration) and the following diff? >>> >>> I mean if this works, this over migration diasbling any day? :) >>> >>>> >>>> diff --git a/mm/zswap.c b/mm/zswap.c >>>> index f6316b66fb236..4d6817c679a54 100644 >>>> --- a/mm/zswap.c >>>> +++ b/mm/zswap.c >>>> @@ -869,17 +869,40 @@ static int zswap_cpu_comp_dead(unsigned int cpu, >>>> struct hlist_node *node) >>>> struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); >>>> struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); >>>> >>>> + mutex_lock(&acomp_ctx->mutex); >>>> if (!IS_ERR_OR_NULL(acomp_ctx)) { >>>> if (!IS_ERR_OR_NULL(acomp_ctx->req)) >>>> acomp_request_free(acomp_ctx->req); >>>> + acomp_ctx->req = NULL; >>>> if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) >>>> crypto_free_acomp(acomp_ctx->acomp); >>>> kfree(acomp_ctx->buffer); >>>> } >>>> + mutex_unlock(&acomp_ctx->mutex); >>>> >>>> return 0; >>>> } >>>> >>>> +static struct crypto_acomp_ctx *acomp_ctx_get_cpu_locked( >>>> + struct crypto_acomp_ctx __percpu *acomp_ctx) >>>> +{ >>>> + struct crypto_acomp_ctx *ctx; >>>> + >>>> + for (;;) { >>>> + ctx = raw_cpu_ptr(acomp_ctx); >>>> + mutex_lock(&ctx->mutex); >>> >>> I'm a bit confused. IIUC, ctx is per-cpu right? What's protecting this >>> cpu-local data (including the mutex) from being invalidated under us >>> while we're sleeping and waiting for the mutex? > > Please correct me if I am wrong, but my understanding is that memory > allocated with alloc_percpu() is allocated for each *possible* CPU, > and does not go away when CPUs are offlined. We allocate the per-CPU > crypto_acomp_ctx structs with alloc_percpu() (including the mutex), so > they should not go away with CPU offlining. Ah, right! I missed that only buffer and req is dynamically allocated by the cpu online callback. Then your fix is safe to me, thanks for your explanation! > > OTOH, we allocate the crypto_acomp_ctx.acompx, crypto_acomp_ctx.req, > and crypto_acomp_ctx.buffer only for online CPUs through the CPU > hotplug notifiers (i.e. zswap_cpu_comp_prepare() and > zswap_cpu_comp_dead()). These are the resources that can go away with > CPU offlining, and what we need to protect about. > > The approach I am taking here is to hold the per-CPU mutex in the CPU > offlining code while we free these resources, and set > crypto_acomp_ctx.req to NULL. In acomp_ctx_get_cpu_locked(), we hold > the mutex of the current CPU, and check if crypto_acomp_ctx.req is > NULL. > > If it is NULL, then the CPU is offlined between raw_cpu_ptr() and > acquiring the mutex, and we retry on the new CPU that we end up on. If > it is not NULL, then we are guaranteed that the resources will not be > freed by CPU offlining until acomp_ctx_put_unlock() is called and the > mutex is unlocked. > >> >> Yeah, it's not safe, we can only use this_cpu_ptr(), which will disable >> preempt (so cpu offline can't kick in), and get refcount of ctx. Since >> we can't mutex_lock in the preempt disabled section. > > My understanding is that the purpose of this_cpu_ptr() disabling > preemption is to prevent multiple CPUs accessing per-CPU data of a > single CPU concurrently. In the zswap case, we don't really need that > because we use the mutex to protect against it (and we cannot disable > preemption anyway). Yes, your fix is correct, preemption disable is not needed in this case. Thanks.
On Wed, Jan 8, 2025 at 6:56 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Jan 7, 2025 at 9:34 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Tue, Jan 7, 2025 at 9:00 PM Chengming Zhou <chengming.zhou@linux.dev> wrote: > > > > > > On 2025/1/8 12:46, Nhat Pham wrote: > > > > On Wed, Jan 8, 2025 at 9:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > >> > > > >> > > > >> Actually, using the mutex to protect against CPU hotunplug is not too > > > >> complicated. The following diff is one way to do it (lightly tested). > > > >> Johannes, Nhat, any preferences between this patch (disabling > > > >> migration) and the following diff? > > > > > > > > I mean if this works, this over migration diasbling any day? :) > > > > > > > >> > > > >> diff --git a/mm/zswap.c b/mm/zswap.c > > > >> index f6316b66fb236..4d6817c679a54 100644 > > > >> --- a/mm/zswap.c > > > >> +++ b/mm/zswap.c > > > >> @@ -869,17 +869,40 @@ static int zswap_cpu_comp_dead(unsigned int cpu, > > > >> struct hlist_node *node) > > > >> struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > > > >> struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > > > >> > > > >> + mutex_lock(&acomp_ctx->mutex); > > > >> if (!IS_ERR_OR_NULL(acomp_ctx)) { > > > >> if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > > >> acomp_request_free(acomp_ctx->req); > > > >> + acomp_ctx->req = NULL; > > > >> if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > > >> crypto_free_acomp(acomp_ctx->acomp); > > > >> kfree(acomp_ctx->buffer); > > > >> } > > > >> + mutex_unlock(&acomp_ctx->mutex); > > > >> > > > >> return 0; > > > >> } > > > >> > > > >> +static struct crypto_acomp_ctx *acomp_ctx_get_cpu_locked( > > > >> + struct crypto_acomp_ctx __percpu *acomp_ctx) > > > >> +{ > > > >> + struct crypto_acomp_ctx *ctx; > > > >> + > > > >> + for (;;) { > > > >> + ctx = raw_cpu_ptr(acomp_ctx); > > > >> + mutex_lock(&ctx->mutex); > > > > > > > > I'm a bit confused. IIUC, ctx is per-cpu right? What's protecting this > > > > cpu-local data (including the mutex) from being invalidated under us > > > > while we're sleeping and waiting for the mutex? > > > > Please correct me if I am wrong, but my understanding is that memory > > allocated with alloc_percpu() is allocated for each *possible* CPU, > > and does not go away when CPUs are offlined. We allocate the per-CPU > > crypto_acomp_ctx structs with alloc_percpu() (including the mutex), so > > they should not go away with CPU offlining. > > > > OTOH, we allocate the crypto_acomp_ctx.acompx, crypto_acomp_ctx.req, > > and crypto_acomp_ctx.buffer only for online CPUs through the CPU > > hotplug notifiers (i.e. zswap_cpu_comp_prepare() and > > zswap_cpu_comp_dead()). These are the resources that can go away with > > CPU offlining, and what we need to protect about. > > ..and now that I explain all of this I am wondering if the complexity > is warranted in the first place. It goes back all the way to the first > zswap commit, so I can't tell the justification for it. Personally, I would vote for the added complexity, as it avoids the potential negative side effects of reverting the scheduler's optimization for selecting a suitable CPU for a woken-up task and I have been looking for an approach to resolve it by cpuhotplug lock (obviously quite hacky and more complex than using this mutex) for (;;) in acomp_ctx_get_cpu_locked() is a bit tricky but correct and really interesting, maybe it needs some comments. > > I am not sure if they are setups that have significantly different > numbers of online and possible CPUs. Maybe we should just bite the > bullet and just allocate everything with alloc_percpu() and rip out > the hotplug callbacks completely. Thanks Barry
On Tue, Jan 7, 2025 at 11:57 PM Barry Song <21cnbao@gmail.com> wrote: > > On Wed, Jan 8, 2025 at 6:56 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Tue, Jan 7, 2025 at 9:34 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Tue, Jan 7, 2025 at 9:00 PM Chengming Zhou <chengming.zhou@linux.dev> wrote: > > > > > > > > On 2025/1/8 12:46, Nhat Pham wrote: > > > > > On Wed, Jan 8, 2025 at 9:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > >> > > > > >> > > > > >> Actually, using the mutex to protect against CPU hotunplug is not too > > > > >> complicated. The following diff is one way to do it (lightly tested). > > > > >> Johannes, Nhat, any preferences between this patch (disabling > > > > >> migration) and the following diff? > > > > > > > > > > I mean if this works, this over migration diasbling any day? :) > > > > > > > > > >> > > > > >> diff --git a/mm/zswap.c b/mm/zswap.c > > > > >> index f6316b66fb236..4d6817c679a54 100644 > > > > >> --- a/mm/zswap.c > > > > >> +++ b/mm/zswap.c > > > > >> @@ -869,17 +869,40 @@ static int zswap_cpu_comp_dead(unsigned int cpu, > > > > >> struct hlist_node *node) > > > > >> struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > > > > >> struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > > > > >> > > > > >> + mutex_lock(&acomp_ctx->mutex); > > > > >> if (!IS_ERR_OR_NULL(acomp_ctx)) { > > > > >> if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > > > >> acomp_request_free(acomp_ctx->req); > > > > >> + acomp_ctx->req = NULL; > > > > >> if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > > > >> crypto_free_acomp(acomp_ctx->acomp); > > > > >> kfree(acomp_ctx->buffer); > > > > >> } > > > > >> + mutex_unlock(&acomp_ctx->mutex); > > > > >> > > > > >> return 0; > > > > >> } > > > > >> > > > > >> +static struct crypto_acomp_ctx *acomp_ctx_get_cpu_locked( > > > > >> + struct crypto_acomp_ctx __percpu *acomp_ctx) > > > > >> +{ > > > > >> + struct crypto_acomp_ctx *ctx; > > > > >> + > > > > >> + for (;;) { > > > > >> + ctx = raw_cpu_ptr(acomp_ctx); > > > > >> + mutex_lock(&ctx->mutex); > > > > > > > > > > I'm a bit confused. IIUC, ctx is per-cpu right? What's protecting this > > > > > cpu-local data (including the mutex) from being invalidated under us > > > > > while we're sleeping and waiting for the mutex? > > > > > > Please correct me if I am wrong, but my understanding is that memory > > > allocated with alloc_percpu() is allocated for each *possible* CPU, > > > and does not go away when CPUs are offlined. We allocate the per-CPU > > > crypto_acomp_ctx structs with alloc_percpu() (including the mutex), so > > > they should not go away with CPU offlining. > > > > > > OTOH, we allocate the crypto_acomp_ctx.acompx, crypto_acomp_ctx.req, > > > and crypto_acomp_ctx.buffer only for online CPUs through the CPU > > > hotplug notifiers (i.e. zswap_cpu_comp_prepare() and > > > zswap_cpu_comp_dead()). These are the resources that can go away with > > > CPU offlining, and what we need to protect about. > > > > ..and now that I explain all of this I am wondering if the complexity > > is warranted in the first place. It goes back all the way to the first > > zswap commit, so I can't tell the justification for it. > > Personally, I would vote for the added complexity, as it avoids the > potential negative side effects of reverting the scheduler's optimization > for selecting a suitable CPU for a woken-up task and I have been looking > for an approach to resolve it by cpuhotplug lock (obviously quite hacky > and more complex than using this mutex) Oh, I was not talking about my proposed diff, but the existing logic that allocates the requests and buffers in the hotplug callbacks instead of just using alloc_percpu() to allocate them once for each possible CPU. I was wondering if there are actual setups where this matters and a significant amount of memory is being saved. Otherwise we should simplify things and just rip out the hotplug callbacks. Anyway, for now I will cleanup and send the mutex diff as a new patch. > > for (;;) in acomp_ctx_get_cpu_locked() is a bit tricky but correct and > really interesting, maybe it needs some comments. > > > > > I am not sure if they are setups that have significantly different > > numbers of online and possible CPUs. Maybe we should just bite the > > bullet and just allocate everything with alloc_percpu() and rip out > > the hotplug callbacks completely. > > Thanks > Barry
On Wed, Jan 8, 2025 at 12:34 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Jan 7, 2025 at 9:00 PM Chengming Zhou <chengming.zhou@linux.dev> wrote: > Please correct me if I am wrong, but my understanding is that memory > allocated with alloc_percpu() is allocated for each *possible* CPU, > and does not go away when CPUs are offlined. We allocate the per-CPU > crypto_acomp_ctx structs with alloc_percpu() (including the mutex), so > they should not go away with CPU offlining. > > OTOH, we allocate the crypto_acomp_ctx.acompx, crypto_acomp_ctx.req, > and crypto_acomp_ctx.buffer only for online CPUs through the CPU > hotplug notifiers (i.e. zswap_cpu_comp_prepare() and > zswap_cpu_comp_dead()). These are the resources that can go away with > CPU offlining, and what we need to protect about. > > The approach I am taking here is to hold the per-CPU mutex in the CPU > offlining code while we free these resources, and set > crypto_acomp_ctx.req to NULL. In acomp_ctx_get_cpu_locked(), we hold > the mutex of the current CPU, and check if crypto_acomp_ctx.req is > NULL. > > If it is NULL, then the CPU is offlined between raw_cpu_ptr() and > acquiring the mutex, and we retry on the new CPU that we end up on. If > it is not NULL, then we are guaranteed that the resources will not be > freed by CPU offlining until acomp_ctx_put_unlock() is called and the > mutex is unlocked. > Ah you're right, that makes a lot of sense now :)
On Wed, Jan 8, 2025 at 10:36 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > Oh, I was not talking about my proposed diff, but the existing logic > that allocates the requests and buffers in the hotplug callbacks > instead of just using alloc_percpu() to allocate them once for each > possible CPU. I was wondering if there are actual setups where this > matters and a significant amount of memory is being saved. Otherwise > we should simplify things and just rip out the hotplug callbacks. My vote is for ripping the hotplug callbacks (eventually) :) In addition to the discrepancy in the number of possible and online CPUs, we also need a relatively smaller memory size for the discrepancy to matter, no? Systems with hundreds of CPUs (hopefully) should have hundreds of GBs worth of memory available (if not more). Anyhow, we can just go with the diff you sent for now (and for past kernels). Seems simple enough, and wouldn't get in the way of the eventual hotplug logic removal (if you decide to pursue it).
On Wed, Jan 8, 2025 at 7:50 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Wed, Jan 8, 2025 at 10:36 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > Oh, I was not talking about my proposed diff, but the existing logic > > that allocates the requests and buffers in the hotplug callbacks > > instead of just using alloc_percpu() to allocate them once for each > > possible CPU. I was wondering if there are actual setups where this > > matters and a significant amount of memory is being saved. Otherwise > > we should simplify things and just rip out the hotplug callbacks. > > My vote is for ripping the hotplug callbacks (eventually) :) In > addition to the discrepancy in the number of possible and online CPUs, > we also need a relatively smaller memory size for the discrepancy to > matter, no? Systems with hundreds of CPUs (hopefully) should have > hundreds of GBs worth of memory available (if not more). > > Anyhow, we can just go with the diff you sent for now (and for past > kernels). Seems simple enough, and wouldn't get in the way of the > eventual hotplug logic removal (if you decide to pursue it). Yeah I sent that out just now: https://lore.kernel.org/lkml/20250108161529.3193825-1-yosryahmed@google.com/ I will look into sending an RFC to rip out the hotplug callbacks later.
diff --git a/mm/zswap.c b/mm/zswap.c index f6316b66fb236..ecd86153e8a32 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -880,6 +880,18 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) return 0; } +/* Remain on the CPU while using its acomp_ctx to stop it from going offline */ +static struct crypto_acomp_ctx *acomp_ctx_get_cpu(struct crypto_acomp_ctx __percpu *acomp_ctx) +{ + migrate_disable(); + return raw_cpu_ptr(acomp_ctx); +} + +static void acomp_ctx_put_cpu(void) +{ + migrate_enable(); +} + static bool zswap_compress(struct page *page, struct zswap_entry *entry, struct zswap_pool *pool) { @@ -893,8 +905,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, gfp_t gfp; u8 *dst; - acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); - + acomp_ctx = acomp_ctx_get_cpu(pool->acomp_ctx); mutex_lock(&acomp_ctx->mutex); dst = acomp_ctx->buffer; @@ -950,6 +961,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, zswap_reject_alloc_fail++; mutex_unlock(&acomp_ctx->mutex); + acomp_ctx_put_cpu(); return comp_ret == 0 && alloc_ret == 0; } @@ -960,7 +972,7 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) struct crypto_acomp_ctx *acomp_ctx; u8 *src; - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); + acomp_ctx = acomp_ctx_get_cpu(entry->pool->acomp_ctx); mutex_lock(&acomp_ctx->mutex); src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); @@ -990,6 +1002,7 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) if (src != acomp_ctx->buffer) zpool_unmap_handle(zpool, entry->handle); + acomp_ctx_put_cpu(); } /*********************************
In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the current CPU at the beginning of the operation is retrieved and used throughout. However, since neither preemption nor migration are disabled, it is possible that the operation continues on a different CPU. If the original CPU is hotunplugged while the acomp_ctx is still in use, we run into a UAF bug as the resources attached to the acomp_ctx are freed during hotunplug in zswap_cpu_comp_dead(). The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") when the switch to the crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was retrieved using get_cpu_ptr() which disables preemption and makes sure the CPU cannot go away from under us. Preemption cannot be disabled with the crypto_acomp API as a sleepable context is needed. Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to per-acomp_ctx") increased the UAF surface area by making the per-CPU buffers dynamic, adding yet another resource that can be freed from under zswap compression/decompression by CPU hotunplug. This cannot be fixed by holding cpus_read_lock(), as it is possible for code already holding the lock to fall into reclaim and enter zswap (causing a deadlock). It also cannot be fixed by wrapping the usage of acomp_ctx in an SRCU critical section and using synchronize_srcu() in zswap_cpu_comp_dead(), because synchronize_srcu() is not allowed in CPU-hotplug notifiers (see Documentation/RCU/Design/Requirements/Requirements.rst). This can be fixed by refcounting the acomp_ctx, but it involves complexity in handling the race between the refcount dropping to zero in zswap_[de]compress() and the refcount being re-initialized when the CPU is onlined. Keep things simple for now and just disable migration while using the per-CPU acomp_ctx to block CPU hotunplug until the usage is over. Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") Cc: <stable@vger.kernel.org> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ Reported-by: Sam Sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/ --- mm/zswap.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-)