Message ID | 20231120083559.285174-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [V4,resend] lib/group_cpus.c: avoid to acquire cpu hotplug lock in group_cpus_evenly | expand |
On Mon, 20 Nov 2023 16:35:59 +0800 Ming Lei <ming.lei@redhat.com> wrote: > group_cpus_evenly() could be part of storage driver's error handler, > such as nvme driver, when may happen during CPU hotplug, in which > storage queue has to drain its pending IOs because all CPUs associated > with the queue are offline and the queue is becoming inactive. And > handling IO needs error handler to provide forward progress. > > Then dead lock is caused: > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > handler is waiting for inflight IO > > 2) error handler is waiting for CPU hotplug lock > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > error handling can't provide forward progress. > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > in which two stage spreads are taken: 1) the 1st stage is over all present > CPUs; 2) the end stage is over all other CPUs. > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > remove the CPU hotplug lock by storing it into one local cache. This way > doesn't change correctness, because all CPUs are still covered. I'm not sure what is the intended merge path for this, but I can do lib/. Do you think that a -stable backport is needed? It sounds that way. If so, are we able to identify a suitable Fixes: target? That would predate f7b3ea8cf72f3 ("genirq/affinity: Move group_cpus_evenly() into lib/").
On Mon, 20 Nov 2023 12:00:59 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > On Mon, 20 Nov 2023 16:35:59 +0800 Ming Lei <ming.lei@redhat.com> wrote: > > > group_cpus_evenly() could be part of storage driver's error handler, > > such as nvme driver, when may happen during CPU hotplug, in which > > storage queue has to drain its pending IOs because all CPUs associated > > with the queue are offline and the queue is becoming inactive. And > > handling IO needs error handler to provide forward progress. > > > > Then dead lock is caused: > > > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > > handler is waiting for inflight IO > > > > 2) error handler is waiting for CPU hotplug lock > > > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > > error handling can't provide forward progress. > > > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > > in which two stage spreads are taken: 1) the 1st stage is over all present > > CPUs; 2) the end stage is over all other CPUs. > > > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > > remove the CPU hotplug lock by storing it into one local cache. This way > > doesn't change correctness, because all CPUs are still covered. > > I'm not sure what is the intended merge path for this, but I can do lib/. > > Do you think that a -stable backport is needed? It sounds that way. > > If so, are we able to identify a suitable Fixes: target? That would > predate f7b3ea8cf72f3 ("genirq/affinity: Move group_cpus_evenly() into > lib/"). No? I think this predates 428e211641ed8 ("genirq/affinity: Replace deprecated CPU-hotplug functions." also. I'll slap a cc:stable on it and I'll let you and the -stable maintainers figure it out.
Hi Ming, On Mon, Nov 20, 2023 at 04:35:59PM +0800, Ming Lei wrote: > group_cpus_evenly() could be part of storage driver's error handler, > such as nvme driver, when may happen during CPU hotplug, in which > storage queue has to drain its pending IOs because all CPUs associated > with the queue are offline and the queue is becoming inactive. And > handling IO needs error handler to provide forward progress. > > Then dead lock is caused: > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > handler is waiting for inflight IO > > 2) error handler is waiting for CPU hotplug lock > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > error handling can't provide forward progress. > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > in which two stage spreads are taken: 1) the 1st stage is over all present > CPUs; 2) the end stage is over all other CPUs. > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > remove the CPU hotplug lock by storing it into one local cache. This way > doesn't change correctness, because all CPUs are still covered. > > Cc: Keith Busch <kbusch@kernel.org> > Cc: linux-nvme@lists.infradead.org > Cc: linux-block@vger.kernel.org > Reported-by: Yi Zhang <yi.zhang@redhat.com> > Reported-by: Guangwu Zhang <guazhang@redhat.com> > Tested-by: Guangwu Zhang <guazhang@redhat.com> > Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> > Reviewed-by: Jens Axboe <axboe@kernel.dk> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > lib/group_cpus.c | 22 ++++++++++++++++------ > 1 file changed, 16 insertions(+), 6 deletions(-) > > diff --git a/lib/group_cpus.c b/lib/group_cpus.c > index aa3f6815bb12..ee272c4cefcc 100644 > --- a/lib/group_cpus.c > +++ b/lib/group_cpus.c > @@ -366,13 +366,25 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > if (!masks) > goto fail_node_to_cpumask; > > - /* Stabilize the cpumasks */ > - cpus_read_lock(); > build_node_to_cpumask(node_to_cpumask); > > + /* > + * Make a local cache of 'cpu_present_mask', so the two stages > + * spread can observe consistent 'cpu_present_mask' without holding > + * cpu hotplug lock, then we can reduce deadlock risk with cpu > + * hotplug code. > + * > + * Here CPU hotplug may happen when reading `cpu_present_mask`, and > + * we can live with the case because it only affects that hotplug > + * CPU is handled in the 1st or 2nd stage, and either way is correct > + * from API user viewpoint since 2-stage spread is sort of > + * optimization. > + */ > + cpumask_copy(npresmsk, data_race(cpu_present_mask)); Now that you initialize the npresmsk explicitly, you can allocate it using alloc_cpumask_var(). The same actually holds for nmsk too, and even before this patch. Maybe fix it in a separate prepending patch? > + > /* grouping present CPUs first */ > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > - cpu_present_mask, nmsk, masks); > + npresmsk, nmsk, masks); > if (ret < 0) > goto fail_build_affinity; > nr_present = ret; > @@ -387,15 +399,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > curgrp = 0; > else > curgrp = nr_present; > - cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); > + cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk); > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > npresmsk, nmsk, masks); The first thing the helper does is checking if nprepmask is empty. cpumask_andnot() returns false in that case. So, assuming that present cpumask in the previous call can't be empty, we can save few cycles if drop corresponding check in the helper and do like this: if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk) == 0) { nr_others = 0; goto fail_build_affinity; } ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, npresmsk, nmsk, masks); Although, it's not related to this patch directly. So, if you fix zalloc_cpumask_var(), the patch looks good to me. Reviewed-by: Yury Norov <yury.norov@gmail.com> > if (ret >= 0) > nr_others = ret; > > fail_build_affinity: > - cpus_read_unlock(); > - > if (ret >= 0) > WARN_ON(nr_present + nr_others < numgrps); > > -- > 2.41.0 >
On Wed, 6 Dec 2023 16:41:44 -0800 Yury Norov <yury.norov@gmail.com> wrote: > Although, it's not related to this patch directly. So, if you fix > zalloc_cpumask_var(), the patch looks good to me. > > Reviewed-by: Yury Norov <yury.norov@gmail.com> Thanks. I just moved this into mm.git's non-rebasing mm-hotfixes-stable branch, so I'd prefer not to have to redo it at this stage. Let's do these things (with which I agree) as a followup patch, please.
On Wed, Dec 6, 2023 at 4:54 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 6 Dec 2023 16:41:44 -0800 Yury Norov <yury.norov@gmail.com> wrote: > > > Although, it's not related to this patch directly. So, if you fix > > zalloc_cpumask_var(), the patch looks good to me. > > > > Reviewed-by: Yury Norov <yury.norov@gmail.com> > > Thanks. > > I just moved this into mm.git's non-rebasing mm-hotfixes-stable branch, > so I'd prefer not to have to redo it at this stage. > > Let's do these things (with which I agree) as a followup patch, please. Sure, I can do that, and append to my series.
On Wed, Dec 06, 2023 at 04:41:44PM -0800, Yury Norov wrote: > Hi Ming, > > On Mon, Nov 20, 2023 at 04:35:59PM +0800, Ming Lei wrote: > > group_cpus_evenly() could be part of storage driver's error handler, > > such as nvme driver, when may happen during CPU hotplug, in which > > storage queue has to drain its pending IOs because all CPUs associated > > with the queue are offline and the queue is becoming inactive. And > > handling IO needs error handler to provide forward progress. > > > > Then dead lock is caused: > > > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > > handler is waiting for inflight IO > > > > 2) error handler is waiting for CPU hotplug lock > > > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > > error handling can't provide forward progress. > > > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > > in which two stage spreads are taken: 1) the 1st stage is over all present > > CPUs; 2) the end stage is over all other CPUs. > > > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > > remove the CPU hotplug lock by storing it into one local cache. This way > > doesn't change correctness, because all CPUs are still covered. > > > > Cc: Keith Busch <kbusch@kernel.org> > > Cc: linux-nvme@lists.infradead.org > > Cc: linux-block@vger.kernel.org > > Reported-by: Yi Zhang <yi.zhang@redhat.com> > > Reported-by: Guangwu Zhang <guazhang@redhat.com> > > Tested-by: Guangwu Zhang <guazhang@redhat.com> > > Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> > > Reviewed-by: Jens Axboe <axboe@kernel.dk> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > lib/group_cpus.c | 22 ++++++++++++++++------ > > 1 file changed, 16 insertions(+), 6 deletions(-) > > > > diff --git a/lib/group_cpus.c b/lib/group_cpus.c > > index aa3f6815bb12..ee272c4cefcc 100644 > > --- a/lib/group_cpus.c > > +++ b/lib/group_cpus.c > > @@ -366,13 +366,25 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > > if (!masks) > > goto fail_node_to_cpumask; > > > > - /* Stabilize the cpumasks */ > > - cpus_read_lock(); > > build_node_to_cpumask(node_to_cpumask); > > > > + /* > > + * Make a local cache of 'cpu_present_mask', so the two stages > > + * spread can observe consistent 'cpu_present_mask' without holding > > + * cpu hotplug lock, then we can reduce deadlock risk with cpu > > + * hotplug code. > > + * > > + * Here CPU hotplug may happen when reading `cpu_present_mask`, and > > + * we can live with the case because it only affects that hotplug > > + * CPU is handled in the 1st or 2nd stage, and either way is correct > > + * from API user viewpoint since 2-stage spread is sort of > > + * optimization. > > + */ > > + cpumask_copy(npresmsk, data_race(cpu_present_mask)); > > Now that you initialize the npresmsk explicitly, you can allocate it > using alloc_cpumask_var(). Indeed, but this way is actually before this patch, and not related with this fix. > > The same actually holds for nmsk too, and even before this patch. Maybe > fix it in a separate prepending patch? Yeah, 'nmsk' is similar with 'npresmsk', and it is not fix, just one optimization. group_cpus_evenly() is only run in slow path, so this kind of micro-optimization is not urgent and should be done in standalone patch, and even we can live with it. > > > + > > /* grouping present CPUs first */ > > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > > - cpu_present_mask, nmsk, masks); > > + npresmsk, nmsk, masks); > > if (ret < 0) > > goto fail_build_affinity; > > nr_present = ret; > > @@ -387,15 +399,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > > curgrp = 0; > > else > > curgrp = nr_present; > > - cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); > > + cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk); > > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > > npresmsk, nmsk, masks); > > The first thing the helper does is checking if nprepmask is empty. > cpumask_andnot() returns false in that case. So, assuming that present > cpumask in the previous call can't be empty, we can save few cycles if > drop corresponding check in the helper and do like this: > > if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk) == 0) { > nr_others = 0; > goto fail_build_affinity; > } > > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > npresmsk, nmsk, masks); > > Although, it's not related to this patch directly. So, if you fix > zalloc_cpumask_var(), the patch looks good to me. I'd rather not make things complicated, as mentioned this API is only run in slow path. > > Reviewed-by: Yury Norov <yury.norov@gmail.com> Thanks for the review! Thanks, Ming
On Wed, Dec 06, 2023 at 03:12:46PM -0800, Andrew Morton wrote: > On Mon, 20 Nov 2023 12:00:59 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > > On Mon, 20 Nov 2023 16:35:59 +0800 Ming Lei <ming.lei@redhat.com> wrote: > > > > > group_cpus_evenly() could be part of storage driver's error handler, > > > such as nvme driver, when may happen during CPU hotplug, in which > > > storage queue has to drain its pending IOs because all CPUs associated > > > with the queue are offline and the queue is becoming inactive. And > > > handling IO needs error handler to provide forward progress. > > > > > > Then dead lock is caused: > > > > > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > > > handler is waiting for inflight IO > > > > > > 2) error handler is waiting for CPU hotplug lock > > > > > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > > > error handling can't provide forward progress. > > > > > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > > > in which two stage spreads are taken: 1) the 1st stage is over all present > > > CPUs; 2) the end stage is over all other CPUs. > > > > > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > > > remove the CPU hotplug lock by storing it into one local cache. This way > > > doesn't change correctness, because all CPUs are still covered. > > > > I'm not sure what is the intended merge path for this, but I can do lib/. > > > > Do you think that a -stable backport is needed? It sounds that way. > > > > If so, are we able to identify a suitable Fixes: target? That would > > predate f7b3ea8cf72f3 ("genirq/affinity: Move group_cpus_evenly() into > > lib/"). > > No? I think this predates 428e211641ed8 ("genirq/affinity: Replace > deprecated CPU-hotplug functions." also. > > I'll slap a cc:stable on it and I'll let you and the -stable > maintainers figure it out. The issue should be introduced since 3ee0ce2a54df ("genirq/affinity: Use get/put_online_cpus around cpumask operations") in v4.8, but the logic has been changed a lot, so may take some effort to backport to longterm stables. The issue is reported from RH QA test, in which both cpu hotplug and nvme error recovering are triggered at same time, and easy to duplicate in QE lab, but may be hard to trigger in production environment. Thanks, Ming
diff --git a/lib/group_cpus.c b/lib/group_cpus.c index aa3f6815bb12..ee272c4cefcc 100644 --- a/lib/group_cpus.c +++ b/lib/group_cpus.c @@ -366,13 +366,25 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) if (!masks) goto fail_node_to_cpumask; - /* Stabilize the cpumasks */ - cpus_read_lock(); build_node_to_cpumask(node_to_cpumask); + /* + * Make a local cache of 'cpu_present_mask', so the two stages + * spread can observe consistent 'cpu_present_mask' without holding + * cpu hotplug lock, then we can reduce deadlock risk with cpu + * hotplug code. + * + * Here CPU hotplug may happen when reading `cpu_present_mask`, and + * we can live with the case because it only affects that hotplug + * CPU is handled in the 1st or 2nd stage, and either way is correct + * from API user viewpoint since 2-stage spread is sort of + * optimization. + */ + cpumask_copy(npresmsk, data_race(cpu_present_mask)); + /* grouping present CPUs first */ ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, - cpu_present_mask, nmsk, masks); + npresmsk, nmsk, masks); if (ret < 0) goto fail_build_affinity; nr_present = ret; @@ -387,15 +399,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) curgrp = 0; else curgrp = nr_present; - cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); + cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk); ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, npresmsk, nmsk, masks); if (ret >= 0) nr_others = ret; fail_build_affinity: - cpus_read_unlock(); - if (ret >= 0) WARN_ON(nr_present + nr_others < numgrps);