diff mbox

[1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue

Message ID 1456938434-20387-2-git-send-email-hch@lst.de (mailing list archive)
State Accepted, archived
Delegated to: Jens Axboe
Headers show

Commit Message

Christoph Hellwig March 2, 2016, 5:07 p.m. UTC
There is no reason to treat the admin queue different in terms of
request merging.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Jeff Lien <Jeff.Lien@hgst.com>
Tested-by: Jeff Lien <Jeff.Lien@hgst.com>
---
 drivers/nvme/host/pci.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Sagi Grimberg March 2, 2016, 5:35 p.m. UTC | #1
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 680f578..19be56a 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1289,6 +1289,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>   		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
>   		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
>   		dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
> +		dev->admin_tagset.flags = BLK_MQ_F_SHOULD_MERGE;

I still don't understand what's the point?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keith Busch March 2, 2016, 6:15 p.m. UTC | #2
On Wed, Mar 02, 2016 at 07:35:56PM +0200, Sagi Grimberg wrote:
> 
> >diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> >index 680f578..19be56a 100644
> >--- a/drivers/nvme/host/pci.c
> >+++ b/drivers/nvme/host/pci.c
> >@@ -1289,6 +1289,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
> >  		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
> >  		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
> >  		dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
> >+		dev->admin_tagset.flags = BLK_MQ_F_SHOULD_MERGE;
> 
> I still don't understand what's the point?

I agree, this looks like a no-op. The admin request queue isn't for
block data, and the admin queue only uses cmd_type REQ_TYPE_DRV_PRIV. I
believe only requests of REQ_TYPE_FS type are mergable.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe March 2, 2016, 8:01 p.m. UTC | #3
On 03/02/2016 10:07 AM, Christoph Hellwig wrote:
> There is no reason to treat the admin queue different in terms of
> request merging.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reported-by: Jeff Lien <Jeff.Lien@hgst.com>
> Tested-by: Jeff Lien <Jeff.Lien@hgst.com>
> ---
>   drivers/nvme/host/pci.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 680f578..19be56a 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1289,6 +1289,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>   		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
>   		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
>   		dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
> +		dev->admin_tagset.flags = BLK_MQ_F_SHOULD_MERGE;
>   		dev->admin_tagset.driver_data = dev;
>
>   		if (blk_mq_alloc_tag_set(&dev->admin_tagset))

Agree with Keith and others that his doesn't make a lot of sense. It's 
not that it'll break anything, but it won't change anything either.

I've applied 2-5/5 to for-linus. I guess that's one more day or not 
shipping to Linus...
Christoph Hellwig March 2, 2016, 9:17 p.m. UTC | #4
On Wed, Mar 02, 2016 at 07:35:56PM +0200, Sagi Grimberg wrote:
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 680f578..19be56a 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -1289,6 +1289,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>>   		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
>>   		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
>>   		dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
>> +		dev->admin_tagset.flags = BLK_MQ_F_SHOULD_MERGE;
>
> I still don't understand what's the point?

still? :)

But yes, we're not going to hit the merge case for the passthrough
command indeed.  Except for keeping the flags in sync, which sounds
like a good ?dea in general there is not reall need for this one.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei March 2, 2016, 11:55 p.m. UTC | #5
On Thu, Mar 3, 2016 at 5:17 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Mar 02, 2016 at 07:35:56PM +0200, Sagi Grimberg wrote:
>>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>>> index 680f578..19be56a 100644
>>> --- a/drivers/nvme/host/pci.c
>>> +++ b/drivers/nvme/host/pci.c
>>> @@ -1289,6 +1289,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>>>              dev->admin_tagset.timeout = ADMIN_TIMEOUT;
>>>              dev->admin_tagset.numa_node = dev_to_node(dev->dev);
>>>              dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
>>> +            dev->admin_tagset.flags = BLK_MQ_F_SHOULD_MERGE;
>>
>> I still don't understand what's the point?
>
> still? :)
>
> But yes, we're not going to hit the merge case for the passthrough
> command indeed.  Except for keeping the flags in sync, which sounds
> like a good ?dea in general there is not reall need for this one.

NO_SG_MERGE has been bypassed after bio splitting is introduced,
and looks it should have been removed.
Christoph Hellwig March 3, 2016, 8:46 a.m. UTC | #6
On Thu, Mar 03, 2016 at 07:55:15AM +0800, Ming Lei wrote:
> > command indeed.  Except for keeping the flags in sync, which sounds
> > like a good ?dea in general there is not reall need for this one.
> 
> NO_SG_MERGE has been bypassed after bio splitting is introduced,
> and looks it should have been removed.

Partially at least.  Anyway, let's start a discussion on blk-mq merge
flags.

First I really hate the way how any merges are opt-in.  I'd be much
happier with an opt-out to get the sane behavior by default.  Every
drivers set BLK_MQ_F_SHOULD_MERGE, and while only a few drivers set
BLK_MQ_F_SG_MERGE I suspect all should as well.  Especially as the
SG merging is cheaper and more useful than the bio mergig in many
cases.

Jens, what are you fine with simply adding a BLK_MQ_F_NOMERGE as an
inverse flag that disables both bio and sg merging (where we control it),
and just default to doing them?

> 
> 
> -- 
> Ming Lei
---end quoted text---
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei March 3, 2016, 11:24 a.m. UTC | #7
On Thu, Mar 3, 2016 at 4:46 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Mar 03, 2016 at 07:55:15AM +0800, Ming Lei wrote:
>> > command indeed.  Except for keeping the flags in sync, which sounds
>> > like a good ?dea in general there is not reall need for this one.
>>
>> NO_SG_MERGE has been bypassed after bio splitting is introduced,
>> and looks it should have been removed.
>
> Partially at least.  Anyway, let's start a discussion on blk-mq merge
> flags.
>
> First I really hate the way how any merges are opt-in.  I'd be much
> happier with an opt-out to get the sane behavior by default.  Every
> drivers set BLK_MQ_F_SHOULD_MERGE, and while only a few drivers set
> BLK_MQ_F_SG_MERGE I suspect all should as well.  Especially as the
> SG merging is cheaper and more useful than the bio mergig in many
> cases.

After bio splitting is introduced, SG merging becomes much cheaper, and
I can't see the difference between SG_MERGE and NO_SG_MERGE when
doing test over null_blk.

Once multipage bvecs is supported, SG merging should be always because
the bvec stored in bio->bi_io_vec can be thought as sort of segment.

Also from hardware view, it is often more efficient to handle less segments.
Last time Keith mentioned the point too about NVMe.

That is why I think we might consider to remove the flag.
Christoph Hellwig March 3, 2016, 11:59 a.m. UTC | #8
On Thu, Mar 03, 2016 at 07:24:10PM +0800, Ming Lei wrote:
> After bio splitting is introduced, SG merging becomes much cheaper, and
> I can't see the difference between SG_MERGE and NO_SG_MERGE when
> doing test over null_blk.

Ok.

> Once multipage bvecs is supported, SG merging should be always because
> the bvec stored in bio->bi_io_vec can be thought as sort of segment.
> 
> Also from hardware view, it is often more efficient to handle less segments.
> Last time Keith mentioned the point too about NVMe.
> 
> That is why I think we might consider to remove the flag.

I hate these sorts of flags, but we have to weight this against the
reason they were introduced.  I think the rational was for workloads
that do small (4k or less) I/Os and really care about latency.

But I'd be much happier if we'd only allow this to be turned off at
runtime through sysfs, and not requiring the drivers to opt for
sane behavior at compile time.

I also agree about the multipage biovecs - if we build large iovecs
directly in bio_add_page we get S/G merging for a two comparisms and
a branch, so it'll be essentially for free.  But for that we need
multipage biovecs first.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig March 15, 2016, 4:08 p.m. UTC | #9
Jens, Keith -

any comments?  I think the lack of BLK_MQ_F_SG_MERGE for nvme
is rather harmful at the moment, as I get much better request
packing with it.  I'd rather fix this in the block layer, but
if that's not ok I can also send a pure NVMe patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe March 15, 2016, 10:13 p.m. UTC | #10
On 03/15/2016 09:08 AM, Christoph Hellwig wrote:
> Jens, Keith -
>
> any comments?  I think the lack of BLK_MQ_F_SG_MERGE for nvme
> is rather harmful at the moment, as I get much better request
> packing with it.  I'd rather fix this in the block layer, but
> if that's not ok I can also send a pure NVMe patch.

We can default to merging, and have drivers opt out, if they need to. 
NVMe is probably the only one that doesn't set the flag currently, so 
the net effect of the two different approaches would be the same. We'll 
have a slightly higher CPU overhead without a perf win for workloads 
that don't get merging, but overall it's always a win if we can merge 
commands. So a better default, I think, even for nvme.
diff mbox

Patch

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 680f578..19be56a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1289,6 +1289,7 @@  static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
 		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
 		dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
+		dev->admin_tagset.flags = BLK_MQ_F_SHOULD_MERGE;
 		dev->admin_tagset.driver_data = dev;
 
 		if (blk_mq_alloc_tag_set(&dev->admin_tagset))