[1/3] nvme/pci: Start request after doorbell ring

Message ID	20171221204636.2924-2-keith.busch@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Keith Busch <keith.busch@intel.com> To: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>, Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Subject: [PATCH 1/3] nvme/pci: Start request after doorbell ring Date: Thu, 21 Dec 2017 13:46:34 -0700 Message-Id: <20171221204636.2924-2-keith.busch@intel.com> In-Reply-To: <20171221204636.2924-1-keith.busch@intel.com> References: <20171221204636.2924-1-keith.busch@intel.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Keith Busch Dec. 21, 2017, 8:46 p.m. UTC

This is a performance optimization that allows the hardware to work on
a command in parallel with the kernel's stats and timeout tracking.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/pci.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Jens Axboe Dec. 21, 2017, 8:49 p.m. UTC | #1

On 12/21/17 1:46 PM, Keith Busch wrote:
> This is a performance optimization that allows the hardware to work on
> a command in parallel with the kernel's stats and timeout tracking.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/nvme/host/pci.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f5800c3c9082..df5550ce0531 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -886,8 +886,6 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>  			goto out_cleanup_iod;
>  	}
>  
> -	blk_mq_start_request(req);
> -
>  	spin_lock_irq(&nvmeq->q_lock);
>  	if (unlikely(nvmeq->cq_vector < 0)) {
>  		ret = BLK_STS_IOERR;
> @@ -895,6 +893,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		goto out_cleanup_iod;
>  	}
>  	__nvme_submit_cmd(nvmeq, &cmnd);
> +	blk_mq_start_request(req);
>  	nvme_process_cq(nvmeq);
>  	spin_unlock_irq(&nvmeq->q_lock);
>  	return BLK_STS_OK;

I guess this will work since we hold the q_lock, but you should probably
include a comment to that effect since it's important that we can't find
the completion before we've called started.

Actually, you'd need to reorder this after #2 (I'm assuming, it hasn't
shown up yet, but I'm guessing it's kill the cq check after submission)
since if we have multiple CPUs on the same hardware queue, one of
the other CPUs could find a completion event between your
__nvme_submit_cmd() and blk_mq_start_request() call. Very short window,
but it does exist.

Jens Axboe Dec. 21, 2017, 8:53 p.m. UTC | #2

On 12/21/17 1:49 PM, Jens Axboe wrote:
> On 12/21/17 1:46 PM, Keith Busch wrote:
>> This is a performance optimization that allows the hardware to work on
>> a command in parallel with the kernel's stats and timeout tracking.
>>
>> Signed-off-by: Keith Busch <keith.busch@intel.com>
>> ---
>>  drivers/nvme/host/pci.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index f5800c3c9082..df5550ce0531 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -886,8 +886,6 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>>  			goto out_cleanup_iod;
>>  	}
>>  
>> -	blk_mq_start_request(req);
>> -
>>  	spin_lock_irq(&nvmeq->q_lock);
>>  	if (unlikely(nvmeq->cq_vector < 0)) {
>>  		ret = BLK_STS_IOERR;
>> @@ -895,6 +893,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>>  		goto out_cleanup_iod;
>>  	}
>>  	__nvme_submit_cmd(nvmeq, &cmnd);
>> +	blk_mq_start_request(req);
>>  	nvme_process_cq(nvmeq);
>>  	spin_unlock_irq(&nvmeq->q_lock);
>>  	return BLK_STS_OK;
> 
> I guess this will work since we hold the q_lock, but you should probably
> include a comment to that effect since it's important that we can't find
> the completion before we've called started.
> 
> Actually, you'd need to reorder this after #2 (I'm assuming, it hasn't
> shown up yet, but I'm guessing it's kill the cq check after submission)
> since if we have multiple CPUs on the same hardware queue, one of
> the other CPUs could find a completion event between your
> __nvme_submit_cmd() and blk_mq_start_request() call. Very short window,
> but it does exist.

Turns out that wasn't what patch 2 was. And the code is right there
above as well, and under the q_lock, so I guess that race doesn't
exist.

But that does bring up the fact if we should always be doing the
nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
maybe it's better to make the submission path faster and skip it?

Jens Axboe Dec. 21, 2017, 9:01 p.m. UTC | #3

On 12/21/17 2:02 PM, Keith Busch wrote:
> On Thu, Dec 21, 2017 at 01:53:44PM -0700, Jens Axboe wrote:
>> Turns out that wasn't what patch 2 was. And the code is right there
>> above as well, and under the q_lock, so I guess that race doesn't
>> exist.
>>
>> But that does bring up the fact if we should always be doing the
>> nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
>> maybe it's better to make the submission path faster and skip it?
> 
> Yes, I am okay to remove the opprotunistic nvme_process_cq in the
> submission path. Even under deeply queued IO, I've not seen this provide
> any measurable benefit.

I haven't either, but curious if others had. It's mostly just extra
overhead, I haven't seen it provide a latency reduction of any kind.

Keith Busch Dec. 21, 2017, 9:02 p.m. UTC | #4

On Thu, Dec 21, 2017 at 01:53:44PM -0700, Jens Axboe wrote:
> Turns out that wasn't what patch 2 was. And the code is right there
> above as well, and under the q_lock, so I guess that race doesn't
> exist.
> 
> But that does bring up the fact if we should always be doing the
> nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
> maybe it's better to make the submission path faster and skip it?

Yes, I am okay to remove the opprotunistic nvme_process_cq in the
submission path. Even under deeply queued IO, I've not seen this provide
any measurable benefit.

Sagi Grimberg Dec. 25, 2017, 10:11 a.m. UTC | #5

> This is a performance optimization that allows the hardware to work on
> a command in parallel with the kernel's stats and timeout tracking.

Curious to know what does this buy us?

Sagi Grimberg Dec. 25, 2017, 10:12 a.m. UTC | #6

>> But that does bring up the fact if we should always be doing the
>> nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
>> maybe it's better to make the submission path faster and skip it?
> 
> Yes, I am okay to remove the opprotunistic nvme_process_cq in the
> submission path. Even under deeply queued IO, I've not seen this provide
> any measurable benefit.

+100 for removing it. Never really understood why it gets us anything...

Keith Busch Dec. 26, 2017, 8:35 p.m. UTC | #7

On Mon, Dec 25, 2017 at 12:11:47PM +0200, Sagi Grimberg wrote:
> > This is a performance optimization that allows the hardware to work on
> > a command in parallel with the kernel's stats and timeout tracking.
> 
> Curious to know what does this buy us?

blk_mq_start_request doesn't do anything to make the command ready to
submit to hardware, so this is pure software overhead, somewhere between
200-300 nanoseconds as far as I can measure (YMMV). We can post the command
to hardware before taking on this software overhead so the hardware gets
to fetch the command that much sooner.

One thing to remember is we need to ensure the driver can't complete the
request before starting it. The driver's exisitng locking does ensure
that with this patch, so it's just something to keep in mind if anything
should ever change in the future.

Sagi Grimberg Dec. 27, 2017, 9:02 a.m. UTC | #8

>> Curious to know what does this buy us?
> 
> blk_mq_start_request doesn't do anything to make the command ready to
> submit to hardware, so this is pure software overhead, somewhere between
> 200-300 nanoseconds as far as I can measure (YMMV). We can post the command
> to hardware before taking on this software overhead so the hardware gets
> to fetch the command that much sooner.
> 
> One thing to remember is we need to ensure the driver can't complete the
> request before starting it. The driver's exisitng locking does ensure
> that with this patch, so it's just something to keep in mind if anything
> should ever change in the future.

Needs to be well documented.

This mean we won't be able to split the submission and completion locks
now... So it does present a tradeoff.

Christoph Hellwig Dec. 29, 2017, 9:44 a.m. UTC | #9

Looks fine, although I agree a comment on the q_lock exclusion would
be useful.

Christoph Hellwig Dec. 29, 2017, 9:44 a.m. UTC | #10

On Mon, Dec 25, 2017 at 12:12:52PM +0200, Sagi Grimberg wrote:
>
>>> But that does bring up the fact if we should always be doing the
>>> nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
>>> maybe it's better to make the submission path faster and skip it?
>>
>> Yes, I am okay to remove the opprotunistic nvme_process_cq in the
>> submission path. Even under deeply queued IO, I've not seen this provide
>> any measurable benefit.
>
> +100 for removing it. Never really understood why it gets us anything...

Yes, we should get rid of it.

Keith Busch Jan. 3, 2018, 8:21 p.m. UTC | #11

On Thu, Dec 21, 2017 at 02:01:44PM -0700, Jens Axboe wrote:
> I haven't either, but curious if others had. It's mostly just extra
> overhead, I haven't seen it provide a latency reduction of any kind.

I've removed the submission side poll in a local build, and amazingly I
am observing a not insignificant increase in latency without it on some
workloads with certain hardware. I will have to delay recommending/posting
removal of this pending further investigation.

Keith Busch Jan. 23, 2018, 12:16 a.m. UTC | #12

On Wed, Jan 03, 2018 at 01:21:05PM -0700, Keith Busch wrote:
> 
> I've removed the submission side poll in a local build, and amazingly I
> am observing a not insignificant increase in latency without it on some
> workloads with certain hardware. I will have to delay recommending/posting
> removal of this pending further investigation.

This took longer to narrow than I hoped. I had been getting very wild
results, and there were two things to blame: a kernel irq issue[*],
and a platform bios causing a lot of CPU thermal frequency throttling.

Now that's sorted out, I am back to tuning this and the opprotunistic
nvme submission side polling once again does not produce a measurable
difference. My world is sane again. Will try to send an update for
consideration this week after some more testing.

 * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=a0c9259dc4

[1/3] nvme/pci: Start request after doorbell ring

Commit Message

Comments

Patch