diff mbox

[2/2] blk-mq: add flag for drivers wanting blocking ->queue_rq()

Message ID 1474555980-2787-3-git-send-email-axboe@fb.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jens Axboe Sept. 22, 2016, 2:53 p.m. UTC
If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its
->queue_rq() handler. For that case, blk-mq ensures that we always
calls it from a safe context.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-mq.c         | 2 +-
 include/linux/blk-mq.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig Sept. 22, 2016, 2:59 p.m. UTC | #1
On Thu, Sep 22, 2016 at 08:53:00AM -0600, Jens Axboe wrote:
> If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its
> ->queue_rq() handler. For that case, blk-mq ensures that we always
> calls it from a safe context.

First can you provide a more useful defintion of blocking?  Lots of current
drivers will already block in ->queue_rq, mostly to allocate memory.

Second we looked at something similar a few times ago, mosty notably
when converting loop and rbd, and came to the conclusion that
performance sucks when we only have that single per-hctx work struct
for a busy device.  I think Ming has done a lot of the benchmarking,
so I've added him to Cc.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Sept. 22, 2016, 3:03 p.m. UTC | #2
On 09/22/2016 08:59 AM, Christoph Hellwig wrote:
> On Thu, Sep 22, 2016 at 08:53:00AM -0600, Jens Axboe wrote:
>> If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its
>> ->queue_rq() handler. For that case, blk-mq ensures that we always
>> calls it from a safe context.
>
> First can you provide a more useful defintion of blocking?  Lots of current
> drivers will already block in ->queue_rq, mostly to allocate memory.

Having to grab a mutex, for instance. We invoke ->queue_rq() with
preemption disabled, so I'd hope that would not be the case... What
drivers block in ->queue_rq?

> Second we looked at something similar a few times ago, mosty notably
> when converting loop and rbd, and came to the conclusion that
> performance sucks when we only have that single per-hctx work struct
> for a busy device.  I think Ming has done a lot of the benchmarking,
> so I've added him to Cc.

Loop was another case that was on my radar to get rid of the queue_work
it currently has to do. Josef is currently testing the nbd driver using
this approach, so we should get some numbers there too. If we have to,
we can always bump up the concurrency to mimic more of the behavior of
having multiple workers running on the hardware queue. I'd prefer to
still do that in blk-mq, instead of having drivers reinventing their own
work offload functionality.
Josef Bacik Sept. 22, 2016, 3:04 p.m. UTC | #3
On 09/22/2016 10:59 AM, Christoph Hellwig wrote:
> On Thu, Sep 22, 2016 at 08:53:00AM -0600, Jens Axboe wrote:
>> If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its
>> ->queue_rq() handler. For that case, blk-mq ensures that we always
>> calls it from a safe context.
>
> First can you provide a more useful defintion of blocking?  Lots of current
> drivers will already block in ->queue_rq, mostly to allocate memory.
>

NBD needs to take a mutex before it sends.  This patch was born out of a kbuild 
error I got because of a schedule while atomic, this was the backtrace

[   17.698207] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:620
[   17.700903] in_atomic(): 1, irqs_disabled(): 0, pid: 1050, name: mount
[   17.702350] 1 lock held by mount/1050:
[   17.703395]  #0:  (&type->s_umount_key#20/1){+.+.+.}, at: 
[<ffffffff8122627b>] sget_userns+0x2f7/0x4b0
[   17.706064] CPU: 0 PID: 1050 Comm: mount Not tainted 4.8.0-rc4-00007-gfd8383fd #1
[   17.708149] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[   17.710523]  0000000000000000 ffff880034dd39f0 ffffffff8178e23d ffff8800350e4d00
[   17.712773]  000000000000041a ffff880034dd3a18 ffffffff81108c39 ffffffff839f1476
[   17.714978]  000000000000026c 0000000000000000 ffff880034dd3a40 ffffffff81108cb5
[   17.717224] Call Trace:
[   17.718130]  [<ffffffff8178e23d>] dump_stack+0x82/0xb8
[   17.719474]  [<ffffffff81108c39>] ___might_sleep+0x1bd/0x1c4
[   17.720813]  [<ffffffff81108cb5>] __might_sleep+0x75/0x7c
[   17.722100]  [<ffffffff82ef53e2>] mutex_lock_nested+0x3e/0x396
[   17.723449]  [<ffffffff8114aace>] ? mod_timer+0x10/0x12
[   17.724705]  [<ffffffff81770002>] ? blk_add_timer+0xe0/0xe5
[   17.726088]  [<ffffffff81c2a35c>] nbd_queue_rq+0x7b/0x14b
[   17.727367]  [<ffffffff81c2a35c>] ? nbd_queue_rq+0x7b/0x14b
[   17.728653]  [<ffffffff81772b66>] __blk_mq_run_hw_queue+0x1c7/0x2c8
[   17.730118]  [<ffffffff81772942>] blk_mq_run_hw_queue+0x5e/0xbb
[   17.731454]  [<ffffffff81773d53>] blk_sq_make_request+0x1a1/0x1ba
[   17.732806]  [<ffffffff81768e24>] generic_make_request+0xbd/0x160
[   17.734153]  [<ffffffff81768fbd>] submit_bio+0xf6/0xff
[   17.735365]  [<ffffffff81252806>] submit_bh_wbc+0x136/0x143
[   17.736719]  [<ffffffff81252c00>] submit_bh+0x10/0x12
[   17.737888]  [<ffffffff81252c52>] __bread_gfp+0x50/0x6f
[   17.739212]  [<ffffffff812f290a>] ext4_fill_super+0x1f4/0x27ec
[   17.740492]  [<ffffffff81798a59>] ? vsnprintf+0x22d/0x3b7
[   17.741685]  [<ffffffff812265e3>] mount_bdev+0x144/0x197
[   17.742928]  [<ffffffff812f2716>] ? ext4_calculate_overhead+0x2bd/0x2bd
[   17.744275]  [<ffffffff812ede93>] ext4_mount+0x15/0x17
[   17.745406]  [<ffffffff81227049>] mount_fs+0x67/0x131
[   17.746518]  [<ffffffff8123ee2f>] vfs_kern_mount+0x6b/0xdb
[   17.747675]  [<ffffffff81241759>] do_mount+0x708/0x97d
[   17.748833]  [<ffffffff811e87f0>] ? __might_fault+0x7e/0x84
[   17.750074]  [<ffffffff811da511>] ? strndup_user+0x3f/0x53
[   17.751198]  [<ffffffff81268eb6>] compat_SyS_mount+0x185/0x1ae
[   17.752433]  [<ffffffff81003b6f>] do_int80_syscall_32+0x5c/0x6b
[   17.753592]  [<ffffffff82efa6d8>] entry_INT80_compat+0x38/0x50
[   17.754952] block nbd11: Attempted send on closed socket

> Second we looked at something similar a few times ago, mosty notably
> when converting loop and rbd, and came to the conclusion that
> performance sucks when we only have that single per-hctx work struct
> for a busy device.  I think Ming has done a lot of the benchmarking,
> so I've added him to Cc.
>

Yeah this looks to be about a 5% hit on my microbenchmarks for NBD, but nobody 
accuses NBD of being particularly fast either so I wasn't too worried about it. 
If we want to say that we can block in ->queue_rq that would make me happy too. 
Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Sept. 22, 2016, 3:12 p.m. UTC | #4
On Thu, Sep 22, 2016 at 09:03:56AM -0600, Jens Axboe wrote:
> Having to grab a mutex, for instance. We invoke ->queue_rq() with
> preemption disabled, so I'd hope that would not be the case... What
> drivers block in ->queue_rq?

I though I had converted a lot of them to GFP_NOIO instead of GFP_ATOMIC
allocations, but I can't find any evidence of that.  Maybe it was just
my imagination, or an unsubmitted patch series.  Sorry for the
confusion.

> Loop was another case that was on my radar to get rid of the queue_work
> it currently has to do. Josef is currently testing the nbd driver using
> this approach, so we should get some numbers there too. If we have to,
> we can always bump up the concurrency to mimic more of the behavior of
> having multiple workers running on the hardware queue. I'd prefer to
> still do that in blk-mq, instead of having drivers reinventing their own
> work offload functionality.

There should be a lot of numbers in the list archives from the time
that Ming did the loop conversion, as I've been trying to steer him
that way, and he actually implemented and benchmarked it.

We can't just increase the concurrency as a single work_struct item
can't be queued multiple times even on a high concurreny workqueue.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Sept. 22, 2016, 3:17 p.m. UTC | #5
On 09/22/2016 09:12 AM, Christoph Hellwig wrote:
> On Thu, Sep 22, 2016 at 09:03:56AM -0600, Jens Axboe wrote:
>> Having to grab a mutex, for instance. We invoke ->queue_rq() with
>> preemption disabled, so I'd hope that would not be the case... What
>> drivers block in ->queue_rq?
>
> I though I had converted a lot of them to GFP_NOIO instead of GFP_ATOMIC
> allocations, but I can't find any evidence of that.  Maybe it was just
> my imagination, or an unsubmitted patch series.  Sorry for the
> confusion.

OK, that makes more sense. Pretty sure we would have had complaints!

>> Loop was another case that was on my radar to get rid of the queue_work
>> it currently has to do. Josef is currently testing the nbd driver using
>> this approach, so we should get some numbers there too. If we have to,
>> we can always bump up the concurrency to mimic more of the behavior of
>> having multiple workers running on the hardware queue. I'd prefer to
>> still do that in blk-mq, instead of having drivers reinventing their own
>> work offload functionality.
>
> There should be a lot of numbers in the list archives from the time
> that Ming did the loop conversion, as I've been trying to steer him
> that way, and he actually implemented and benchmarked it.
>
> We can't just increase the concurrency as a single work_struct item
> can't be queued multiple times even on a high concurreny workqueue.

But we could have more work items, if we had to. Even if loop isn't a
drop-in replacement for this simpler approach, I think it'll work well
enough for nbd. The 5% number from Josef is comparing to not having any
offload at all, I suspect the number from just converting to queue_work
in the driver would be similar.
Ming Lei Sept. 23, 2016, 1:35 a.m. UTC | #6
On Thu, Sep 22, 2016 at 11:12 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Sep 22, 2016 at 09:03:56AM -0600, Jens Axboe wrote:
>> Having to grab a mutex, for instance. We invoke ->queue_rq() with
>> preemption disabled, so I'd hope that would not be the case... What
>> drivers block in ->queue_rq?
>
> I though I had converted a lot of them to GFP_NOIO instead of GFP_ATOMIC
> allocations, but I can't find any evidence of that.  Maybe it was just
> my imagination, or an unsubmitted patch series.  Sorry for the
> confusion.
>
>> Loop was another case that was on my radar to get rid of the queue_work
>> it currently has to do. Josef is currently testing the nbd driver using
>> this approach, so we should get some numbers there too. If we have to,
>> we can always bump up the concurrency to mimic more of the behavior of
>> having multiple workers running on the hardware queue. I'd prefer to
>> still do that in blk-mq, instead of having drivers reinventing their own
>> work offload functionality.
>
> There should be a lot of numbers in the list archives from the time
> that Ming did the loop conversion, as I've been trying to steer him
> that way, and he actually implemented and benchmarked it.
>
> We can't just increase the concurrency as a single work_struct item
> can't be queued multiple times even on a high concurreny workqueue.

Yes, that is it, and that is why last time we can't convert to this way.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Sept. 23, 2016, 1:43 a.m. UTC | #7
On Thu, Sep 22, 2016 at 11:17 PM, Jens Axboe <axboe@fb.com> wrote:
> On 09/22/2016 09:12 AM, Christoph Hellwig wrote:
>>
>> On Thu, Sep 22, 2016 at 09:03:56AM -0600, Jens Axboe wrote:
>>>
>>> Having to grab a mutex, for instance. We invoke ->queue_rq() with
>>> preemption disabled, so I'd hope that would not be the case... What
>>> drivers block in ->queue_rq?
>>
>>
>> I though I had converted a lot of them to GFP_NOIO instead of GFP_ATOMIC
>> allocations, but I can't find any evidence of that.  Maybe it was just
>> my imagination, or an unsubmitted patch series.  Sorry for the
>> confusion.
>
>
> OK, that makes more sense. Pretty sure we would have had complaints!
>
>>> Loop was another case that was on my radar to get rid of the queue_work
>>> it currently has to do. Josef is currently testing the nbd driver using
>>> this approach, so we should get some numbers there too. If we have to,
>>> we can always bump up the concurrency to mimic more of the behavior of
>>> having multiple workers running on the hardware queue. I'd prefer to
>>> still do that in blk-mq, instead of having drivers reinventing their own
>>> work offload functionality.
>>
>>
>> There should be a lot of numbers in the list archives from the time
>> that Ming did the loop conversion, as I've been trying to steer him
>> that way, and he actually implemented and benchmarked it.
>>
>> We can't just increase the concurrency as a single work_struct item
>> can't be queued multiple times even on a high concurreny workqueue.
>
>
> But we could have more work items, if we had to. Even if loop isn't a
> drop-in replacement for this simpler approach, I think it'll work well
> enough for nbd. The 5% number from Josef is comparing to not having any
> offload at all, I suspect the number from just converting to queue_work
> in the driver would be similar.

5% might be a noise.

From nbd's .queue_rq(), kthread_work should match its case because
one per-nbd mutex is required and all cmds sent to the nbd are serialized,
so using common work items may hurt performance caused by
unnecessary context switch.

thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Sept. 23, 2016, 2:44 a.m. UTC | #8
> On Sep 22, 2016, at 9:44 PM, Ming Lei <ming.lei@canonical.com> wrote:
> 
>> On Thu, Sep 22, 2016 at 11:17 PM, Jens Axboe <axboe@fb.com> wrote:
>>> On 09/22/2016 09:12 AM, Christoph Hellwig wrote:
>>> 
>>>> On Thu, Sep 22, 2016 at 09:03:56AM -0600, Jens Axboe wrote:
>>>> 
>>>> Having to grab a mutex, for instance. We invoke ->queue_rq() with
>>>> preemption disabled, so I'd hope that would not be the case... What
>>>> drivers block in ->queue_rq?
>>> 
>>> 
>>> I though I had converted a lot of them to GFP_NOIO instead of GFP_ATOMIC
>>> allocations, but I can't find any evidence of that.  Maybe it was just
>>> my imagination, or an unsubmitted patch series.  Sorry for the
>>> confusion.
>> 
>> 
>> OK, that makes more sense. Pretty sure we would have had complaints!
>> 
>>>> Loop was another case that was on my radar to get rid of the queue_work
>>>> it currently has to do. Josef is currently testing the nbd driver using
>>>> this approach, so we should get some numbers there too. If we have to,
>>>> we can always bump up the concurrency to mimic more of the behavior of
>>>> having multiple workers running on the hardware queue. I'd prefer to
>>>> still do that in blk-mq, instead of having drivers reinventing their own
>>>> work offload functionality.
>>> 
>>> 
>>> There should be a lot of numbers in the list archives from the time
>>> that Ming did the loop conversion, as I've been trying to steer him
>>> that way, and he actually implemented and benchmarked it.
>>> 
>>> We can't just increase the concurrency as a single work_struct item
>>> can't be queued multiple times even on a high concurreny workqueue.
>> 
>> 
>> But we could have more work items, if we had to. Even if loop isn't a
>> drop-in replacement for this simpler approach, I think it'll work well
>> enough for nbd. The 5% number from Josef is comparing to not having any
>> offload at all, I suspect the number from just converting to queue_work
>> in the driver would be similar.
> 
> 5% might be a noise.
> 
> From nbd's .queue_rq(), kthread_work should match its case because
> one per-nbd mutex is required and all cmds sent to the nbd are serialized,
> so using common work items may hurt performance caused by
> unnecessary context switch.

This is with my multi connection patch with 4 connections per device (so 4 hw queues).  The 5% is real but I don't think it'll get better by punting to a worker per command, so this approach is fine for NBD.  Thanks,

Josef--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c29700010b5c..eae2f12f011f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -908,7 +908,7 @@  void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	    !blk_mq_hw_queue_mapped(hctx)))
 		return;
 
-	if (!async) {
+	if (!async && !(hctx->flags & BLK_MQ_F_BLOCKING)) {
 		int cpu = get_cpu();
 		if (cpumask_test_cpu(cpu, hctx->cpumask)) {
 			__blk_mq_run_hw_queue(hctx);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index fbcfdf323243..5daa0ef756dd 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -155,6 +155,7 @@  enum {
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
+	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,