diff mbox

[2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk

Message ID 20180117123444.18393-3-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei Jan. 17, 2018, 12:34 p.m. UTC
We know this WARN_ON is harmless and the stack trace isn't useful too,
so convert it to printk(), and avoid to confuse people.

Also add comment about two releated races here.

Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

Comments

Jens Axboe Jan. 17, 2018, 3:46 p.m. UTC | #1
On 1/17/18 5:34 AM, Ming Lei wrote:
> We know this WARN_ON is harmless and the stack trace isn't useful too,
> so convert it to printk(), and avoid to confuse people.

I disagree, it is useful to know the exact path it happened from,
in case it's a valid warning. It could be an inline run and
we screwed up the logic, or it could be from a workqueue and
the reason would be entirely different.
Christian Borntraeger Jan. 17, 2018, 4:05 p.m. UTC | #2
On 01/17/2018 04:46 PM, Jens Axboe wrote:
> On 1/17/18 5:34 AM, Ming Lei wrote:
>> We know this WARN_ON is harmless and the stack trace isn't useful too,
>> so convert it to printk(), and avoid to confuse people.
> 
> I disagree, it is useful to know the exact path it happened from,
> in case it's a valid warning. It could be an inline run and
> we screwed up the logic, or it could be from a workqueue and
> the reason would be entirely different.

Then add a dump_stack or whatever, but WARN_ON does have fatal effects
for some setups. If it can happen then WARN_ON is just wrong.
Jens Axboe Jan. 17, 2018, 4:07 p.m. UTC | #3
On 1/17/18 9:05 AM, Christian Borntraeger wrote:
> 
> 
> On 01/17/2018 04:46 PM, Jens Axboe wrote:
>> On 1/17/18 5:34 AM, Ming Lei wrote:
>>> We know this WARN_ON is harmless and the stack trace isn't useful too,
>>> so convert it to printk(), and avoid to confuse people.
>>
>> I disagree, it is useful to know the exact path it happened from,
>> in case it's a valid warning. It could be an inline run and
>> we screwed up the logic, or it could be from a workqueue and
>> the reason would be entirely different.
> 
> Then add a dump_stack or whatever, but WARN_ON does have fatal effects
> for some setups. If it can happen then WARN_ON is just wrong.

dump_stack() is fine - and the intent is for it to never happen, it
would be nice to close those holes so we're only catching cases that
are due to bad code.
Ming Lei Jan. 17, 2018, 4:17 p.m. UTC | #4
On Wed, Jan 17, 2018 at 08:46:40AM -0700, Jens Axboe wrote:
> On 1/17/18 5:34 AM, Ming Lei wrote:
> > We know this WARN_ON is harmless and the stack trace isn't useful too,
> > so convert it to printk(), and avoid to confuse people.
> 
> I disagree, it is useful to know the exact path it happened from,
> in case it's a valid warning. It could be an inline run and

inline run is only possible in theory, since the window is too
small.

> we screwed up the logic, or it could be from a workqueue and
> the reason would be entirely different.

OK, if you don't agree, I will just add the comment on the races
we found.
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index dc4066d28323..6562360bf108 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1391,9 +1391,25 @@  static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	/*
 	 * We should be running this queue from one of the CPUs that
 	 * are mapped to it.
+	 *
+	 * There are at least two related races now between setting
+	 * hctx->next_cpu from blk_mq_hctx_next_cpu() and running
+	 * __blk_mq_run_hw_queue():
+	 *
+	 * - hctx->next_cpu is found offline in blk_mq_hctx_next_cpu(),
+	 *   but later it becomes online, then this warning is harmless
+	 *   at all
+	 *
+	 * - hctx->next_cpu is found online in blk_mq_hctx_next_cpu(),
+	 *   but later it becomes offline, then the warning can't be
+	 *   triggered, and we depend on blk-mq timeout handler to
+	 *   handle dispatched requests to this hctx
 	 */
-	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
-		cpu_online(hctx->next_cpu));
+	if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
+		cpu_online(hctx->next_cpu))
+		printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n",
+			raw_smp_processor_id(),
+			cpumask_empty(hctx->cpumask) ? "inactive": "active");
 
 	/*
 	 * We can't run the queue inline with ints disabled. Ensure that