[LSF/MM,TOPIC,LSF/MM,ATTEND] NAPI polling for block drivers

Message ID	aaf5c35f-7e81-fa44-d467-53b574debdd4@suse.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers To: Johannes Thumshirn <jthumshirn@suse.de>, Sagi Grimberg <sagi@grimberg.me> References: <d60d4499-6ed8-0a69-edbf-db205a6dba6a@kernel.dk> <8b47ca34-d2ff-26dc-721e-2cb1e18f1efc@grimberg.me> <499af528-7810-f82d-1f11-cbf8f3a5b21c@grimberg.me> <53a3fb6c-c75a-519a-f669-2bcab404e01d@grimberg.me> <20170117162752.GE6067@linux-x5ow.site> <6df6bf6a-7cd3-1700-2b0a-e140325ebf47@grimberg.me> <20170118135156.GG3514@linux-x5ow.site> <e3438024-e0f2-db96-5122-4508863be603@grimberg.me> <20170118145816.GI3514@linux-x5ow.site> <c7eabe1c-5dcd-9859-2b41-9683f68a0534@grimberg.me> <20170118151643.GJ3514@linux-x5ow.site> Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>, Linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, Keith Busch <keith.busch@intel.com>, "lsf-pc@lists.linux-foundation.org" <lsf-pc@lists.linux-foundation.org> From: Hannes Reinecke <hare@suse.de> Message-ID: <aaf5c35f-7e81-fa44-d467-53b574debdd4@suse.de> Date: Wed, 18 Jan 2017 16:39:19 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.6.0 MIME-Version: 1.0 In-Reply-To: <20170118151643.GJ3514@linux-x5ow.site> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Message ID

aaf5c35f-7e81-fa44-d467-53b574debdd4@suse.de (mailing list archive)

State

New, archived

Headers

Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
To: Johannes Thumshirn <jthumshirn@suse.de>, Sagi Grimberg <sagi@grimberg.me>
References: <d60d4499-6ed8-0a69-edbf-db205a6dba6a@kernel.dk>
	<8b47ca34-d2ff-26dc-721e-2cb1e18f1efc@grimberg.me>
	<499af528-7810-f82d-1f11-cbf8f3a5b21c@grimberg.me>
	<53a3fb6c-c75a-519a-f669-2bcab404e01d@grimberg.me>
	<20170117162752.GE6067@linux-x5ow.site>
	<6df6bf6a-7cd3-1700-2b0a-e140325ebf47@grimberg.me>
	<20170118135156.GG3514@linux-x5ow.site>
	<e3438024-e0f2-db96-5122-4508863be603@grimberg.me>
	<20170118145816.GI3514@linux-x5ow.site>
	<c7eabe1c-5dcd-9859-2b41-9683f68a0534@grimberg.me>
	<20170118151643.GJ3514@linux-x5ow.site>
Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>,
	Linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, Keith Busch <keith.busch@intel.com>,
	"lsf-pc@lists.linux-foundation.org" <lsf-pc@lists.linux-foundation.org>
From: Hannes Reinecke <hare@suse.de>
Message-ID: <aaf5c35f-7e81-fa44-d467-53b574debdd4@suse.de>
Date: Wed, 18 Jan 2017 16:39:19 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
	Thunderbird/45.6.0
MIME-Version: 1.0
In-Reply-To: <20170118151643.GJ3514@linux-x5ow.site>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 8bit
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk

Commit Message

Hannes Reinecke Jan. 18, 2017, 3:39 p.m. UTC

On 01/18/2017 04:16 PM, Johannes Thumshirn wrote:
> On Wed, Jan 18, 2017 at 05:14:36PM +0200, Sagi Grimberg wrote:
>>
>>> Hannes just spotted this:
>>> static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>>>                         const struct blk_mq_queue_data *bd)
>>> {
>>> [...]
>>>        __nvme_submit_cmd(nvmeq, &cmnd);
>>>        nvme_process_cq(nvmeq);
>>>        spin_unlock_irq(&nvmeq->q_lock);
>>>        return BLK_MQ_RQ_QUEUE_OK;
>>> out_cleanup_iod:
>>>        nvme_free_iod(dev, req);
>>> out_free_cmd:
>>>        nvme_cleanup_cmd(req);
>>>        return ret;
>>> }
>>>
>>> So we're draining the CQ on submit. This of cause makes polling for
>>> completions in the IRQ handler rather pointless as we already did in the
>>> submission path.
>>
>> I think you missed:
>> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
> 
> I indeed did, thanks.
> 
But it doesn't help.

We're still having to wait for the first interrupt, and if we're really
fast that's the only completion we have to process.

Try this:



That should avoid the first interrupt, and with a bit of lock reduce the
number of interrupts _drastically_.

Cheers,

Hannes

Comments

Sagi Grimberg Jan. 19, 2017, 8:12 a.m. UTC | #1

>>> I think you missed:
>>> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
>>
>> I indeed did, thanks.
>>
> But it doesn't help.
>
> We're still having to wait for the first interrupt, and if we're really
> fast that's the only completion we have to process.
>
> Try this:
>
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b4b32e6..e2dd9e2 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>         }
>         __nvme_submit_cmd(nvmeq, &cmnd);
>         spin_unlock(&nvmeq->sq_lock);
> +       disable_irq_nosync(nvmeq_irq(irq));
> +       irq_poll_sched(&nvmeq->iop);

a. This would trigger a condition that we disable irq twice which
is wrong at least because it will generate a warning.

b. This would cause a way-too-much triggers of ksoftirqd. In order for
it to be effective we need to to run only when it should and optimally
when the completion queue has a batch of completions waiting.

After a deeper analysis, I agree with Bart that interrupt coalescing is
needed for it to work. The problem with nvme coalescing as Jens said, is
a death penalty of 100us granularity. Hannes, Johannes, how does it look
like with the devices you are testing with?

Also, I think that adaptive moderation is needed in order for it to
work well. I know that some networking drivers implemented adaptive
moderation in SW before having HW support for it. It can be done by
maintaining stats and having a periodic work that looks at it and
changes the moderation parameters.

Does anyone think that this is something we should consider?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg Jan. 19, 2017, 8:23 a.m. UTC | #2

Christoph suggest to me once that we can take a hybrid
approach where we consume a small amount of completions (say 4)
right away from the interrupt handler and if we have more
we schedule irq-poll to reap the rest. But back then it
didn't work better which is not aligned with my observations
that we consume only 1 completion per interrupt...

I can give it another go... What do people think about it?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Johannes Thumshirn Jan. 19, 2017, 9:13 a.m. UTC | #3

On Thu, Jan 19, 2017 at 10:12:17AM +0200, Sagi Grimberg wrote:
> 
> >>>I think you missed:
> >>>http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
> >>
> >>I indeed did, thanks.
> >>
> >But it doesn't help.
> >
> >We're still having to wait for the first interrupt, and if we're really
> >fast that's the only completion we have to process.
> >
> >Try this:
> >
> >
> >diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> >index b4b32e6..e2dd9e2 100644
> >--- a/drivers/nvme/host/pci.c
> >+++ b/drivers/nvme/host/pci.c
> >@@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> >        }
> >        __nvme_submit_cmd(nvmeq, &cmnd);
> >        spin_unlock(&nvmeq->sq_lock);
> >+       disable_irq_nosync(nvmeq_irq(irq));
> >+       irq_poll_sched(&nvmeq->iop);
> 
> a. This would trigger a condition that we disable irq twice which
> is wrong at least because it will generate a warning.
> 
> b. This would cause a way-too-much triggers of ksoftirqd. In order for
> it to be effective we need to to run only when it should and optimally
> when the completion queue has a batch of completions waiting.
> 
> After a deeper analysis, I agree with Bart that interrupt coalescing is
> needed for it to work. The problem with nvme coalescing as Jens said, is
> a death penalty of 100us granularity. Hannes, Johannes, how does it look
> like with the devices you are testing with?

I haven't had a look at AHCI's Command Completion Coalescing yet but hopefully
I find the time today (+SSD testing!!!).

Don't know if Hannes did (but I _think_ no). The problem is we've already
maxed out our test HW w/o irq_poll and so the only changes we're seeing
currently is an increase of wasted CPU cycles. Not what we wanted to have.

> 
> Also, I think that adaptive moderation is needed in order for it to
> work well. I know that some networking drivers implemented adaptive
> moderation in SW before having HW support for it. It can be done by
> maintaining stats and having a periodic work that looks at it and
> changes the moderation parameters.
> 
> Does anyone think that this is something we should consider?

Yes we've been discussing this internally as well and it sounds good but thats
still all pure theory and nothing actually implemented and tested.

Byte,
	Johannes

Johannes Thumshirn Jan. 19, 2017, 9:18 a.m. UTC | #4

On Thu, Jan 19, 2017 at 10:23:28AM +0200, Sagi Grimberg wrote:
> Christoph suggest to me once that we can take a hybrid
> approach where we consume a small amount of completions (say 4)
> right away from the interrupt handler and if we have more
> we schedule irq-poll to reap the rest. But back then it
> didn't work better which is not aligned with my observations
> that we consume only 1 completion per interrupt...
> 
> I can give it another go... What do people think about it?

This could be good.

What's also possible (see answer to my previous mail) is measuring the time it
takes for a completion to arrive and if the average time is lower than the
context switch time just busy loop insted of waiting for the IRQ to arrive. If
it is higher we can always schedule a timer to hit _before_ the IRQ will
likely arrive and start polling. Is this something that sounds reasonable to
you guys as well?

	Johannes

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b4b32e6..e2dd9e2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -623,6 +623,8 @@  static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
        }
        __nvme_submit_cmd(nvmeq, &cmnd);
        spin_unlock(&nvmeq->sq_lock);
+       disable_irq_nosync(nvmeq_irq(irq));
+       irq_poll_sched(&nvmeq->iop);
        return BLK_MQ_RQ_QUEUE_OK;
 out_cleanup_iod:
        nvme_free_iod(dev, req);

[LSF/MM,TOPIC,LSF/MM,ATTEND] NAPI polling for block drivers

Commit Message

Comments

Patch