[RFC,0/4] restore polling to nvme-rdma

Message ID	20181211233652.9705-1-sagi@grimberg.me (mailing list archive)
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: Sagi Grimberg <sagi@grimberg.me> To: linux-nvme@lists.infradead.org Cc: linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Keith Busch <keith.busch@intel.com> Subject: [PATCH RFC 0/4] restore polling to nvme-rdma Date: Tue, 11 Dec 2018 15:36:47 -0800 Message-Id: <20181211233652.9705-1-sagi@grimberg.me> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk
Series	restore polling to nvme-rdma \| expand [RFC,0/4] restore polling to nvme-rdma [RFC,1/4] nvme-fabrics: allow user to pass in nr_poll_queues [RFC,2/4] rdma: introduce ib_change_cq_ctx [RFC,3/4] nvme-rdma: implement polling queue map [RFC,4/4] nvme-multipath: disable polling for underlying namespace request queue [RFC,nvme-cli,5/4] fabrics: pass in number of polling queues

Sagi Grimberg Dec. 11, 2018, 11:36 p.m. UTC

Add an additional queue mapping for polling queues that will
host polling for latency critical I/O.

One caveat is that we don't want these queues to be pure polling
as we don't want to bother with polling for the initial nvmf connect
I/O. Hence, introduce ib_change_cq_ctx that will modify the cq polling
context from SOFTIRQ to DIRECT. Note that this function is not safe
with inflight I/O so the caller must make sure not to call it without
having all I/O quiesced (we also relax the ib_cq_completion_direct warning
as we have a scenario that this can happen).

With that, we simply defer blk_poll callout to ib_process_cq_direct and
we're done. One thing that might worth adding is some kind of ignore
regexp of some sort because we don't want to give up polling because we
consumed memory registration completions. As for now, we might break the
polling early do to that.

Finally, we turn off polling support for nvme-multipath as it won't invoke
polling and our completion queues no longer generates any interrupts for
it. I didn't come up with a good way to get around it so far...

Sagi Grimberg (4):
  nvme-fabrics: allow user to pass in nr_poll_queues
  rdma: introduce ib_change_cq_ctx
  nvme-rdma: implement polling queue map
  nvme-multipath: disable polling for underlying namespace request queue

 drivers/infiniband/core/cq.c | 102 ++++++++++++++++++++++++-----------
 drivers/nvme/host/core.c     |   2 +
 drivers/nvme/host/fabrics.c  |  16 +++++-
 drivers/nvme/host/fabrics.h  |   3 ++
 drivers/nvme/host/rdma.c     |  35 +++++++++++-
 include/rdma/ib_verbs.h      |   1 +
 6 files changed, 124 insertions(+), 35 deletions(-)

Christoph Hellwig Dec. 12, 2018, 7:07 a.m. UTC | #1

On Tue, Dec 11, 2018 at 03:36:47PM -0800, Sagi Grimberg wrote:
> Add an additional queue mapping for polling queues that will
> host polling for latency critical I/O.
> 
> One caveat is that we don't want these queues to be pure polling
> as we don't want to bother with polling for the initial nvmf connect
> I/O. Hence, introduce ib_change_cq_ctx that will modify the cq polling
> context from SOFTIRQ to DIRECT.

So do we really care?  Yes, polling for the initial connect is not
exactly efficient, but then again it doesn't happen all that often.

Except for efficiency is there any problem with just starting out
in polling mode?

Sagi Grimberg Dec. 12, 2018, 7:16 a.m. UTC | #2

>> Add an additional queue mapping for polling queues that will
>> host polling for latency critical I/O.
>>
>> One caveat is that we don't want these queues to be pure polling
>> as we don't want to bother with polling for the initial nvmf connect
>> I/O. Hence, introduce ib_change_cq_ctx that will modify the cq polling
>> context from SOFTIRQ to DIRECT.
> 
> So do we really care?  Yes, polling for the initial connect is not
> exactly efficient, but then again it doesn't happen all that often.
> 
> Except for efficiency is there any problem with just starting out
> in polling mode?

I found it cumbersome so I didn't really consider it...
Isn't it a bit awkward? we will need to implement polled connect
locally in nvme-rdma (because fabrics doesn't know anything about
queues, hctx or polling).

I'm open to looking at it if you think that this is better. Note that if
we had the CQ in our hands, we would do exactly what we did here
effectively: use interrupt for the connect and then simply not
re-arm it again and poll... Should we poll the connect just because
we are behind the CQ API?

Christoph Hellwig Dec. 12, 2018, 8:09 a.m. UTC | #3

On Tue, Dec 11, 2018 at 11:16:31PM -0800, Sagi Grimberg wrote:
>
>>> Add an additional queue mapping for polling queues that will
>>> host polling for latency critical I/O.
>>>
>>> One caveat is that we don't want these queues to be pure polling
>>> as we don't want to bother with polling for the initial nvmf connect
>>> I/O. Hence, introduce ib_change_cq_ctx that will modify the cq polling
>>> context from SOFTIRQ to DIRECT.
>>
>> So do we really care?  Yes, polling for the initial connect is not
>> exactly efficient, but then again it doesn't happen all that often.
>>
>> Except for efficiency is there any problem with just starting out
>> in polling mode?
>
> I found it cumbersome so I didn't really consider it...
> Isn't it a bit awkward? we will need to implement polled connect
> locally in nvme-rdma (because fabrics doesn't know anything about
> queues, hctx or polling).

Well, it should just be a little blk_poll loop, right?

> I'm open to looking at it if you think that this is better. Note that if
> we had the CQ in our hands, we would do exactly what we did here
> effectively: use interrupt for the connect and then simply not
> re-arm it again and poll... Should we poll the connect just because
> we are behind the CQ API?

I'm just worried that the switch between the different context looks
like a way to easy way to shoot yourself in the foot, so if we can
avoid exposing that it would make for a harder to abuse API.

Sagi Grimberg Dec. 12, 2018, 8:53 a.m. UTC | #4

>> I found it cumbersome so I didn't really consider it...
>> Isn't it a bit awkward? we will need to implement polled connect
>> locally in nvme-rdma (because fabrics doesn't know anything about
>> queues, hctx or polling).
> 
> Well, it should just be a little blk_poll loop, right?

Not so much about the poll loop, but the fact that we will need
to check if we need to poll for this special case every time in
.queue_rq and its somewhat annoying...

>> I'm open to looking at it if you think that this is better. Note that if
>> we had the CQ in our hands, we would do exactly what we did here
>> effectively: use interrupt for the connect and then simply not
>> re-arm it again and poll... Should we poll the connect just because
>> we are behind the CQ API?
> 
> I'm just worried that the switch between the different context looks
> like a way to easy way to shoot yourself in the foot, so if we can
> avoid exposing that it would make for a harder to abuse API.

Well, it would have been 100% safe if we could undo a cq re-arm that we
did in the past...

The code is documented that the caller must make sure that there is no
inflight I/O during the invocation of the routine..

We could be creative if we really want to make it 100% safe for inflight
I/O (although no one should ever need to use that). We can flush the
current cq context (work/irq), switch to polling context, then create a
single entry QP attached to this CQ and drain it :)
That would make it safe but its brain-dead...

Anyway, if people think its really a bad idea we'll go ahead and
poll the nvmf connect...

Christoph Hellwig Dec. 12, 2018, 2:05 p.m. UTC | #5

On Wed, Dec 12, 2018 at 12:53:47AM -0800, Sagi Grimberg wrote:
>> Well, it should just be a little blk_poll loop, right?
>
> Not so much about the poll loop, but the fact that we will need
> to check if we need to poll for this special case every time in
> .queue_rq and its somewhat annoying...

I don't think we need to add any special check in queue_rq.  Any
polled command should be marked as HIPRI from the submitting code,
and that includes the queue creation.

> Anyway, if people think its really a bad idea we'll go ahead and
> poll the nvmf connect...

I'm really just worried about the completion context switching.  If the
RDMA folks are happy with that API and you strongly favour this version
I'll let you decide.

Steve Wise Dec. 12, 2018, 4:37 p.m. UTC | #6

Hey Sagi,

> Subject: [PATCH RFC 0/4] restore polling to nvme-rdma
> 
> Add an additional queue mapping for polling queues that will
> host polling for latency critical I/O.
> 
> One caveat is that we don't want these queues to be pure polling
> as we don't want to bother with polling for the initial nvmf connect
> I/O. Hence, introduce ib_change_cq_ctx that will modify the cq polling
> context from SOFTIRQ to DIRECT. Note that this function is not safe
> with inflight I/O so the caller must make sure not to call it without
> having all I/O quiesced (we also relax the ib_cq_completion_direct warning
> as we have a scenario that this can happen).

Is there no way to handle this in the core?  Maybe have the polling context
transition to DIRECT when the queue becomes empty and before re-arming the
CQ?  So ib_change_cq_ctx() would be called to indicate the change should
happen when it is safe to do so.  

Just a thought..

Sagi Grimberg Dec. 12, 2018, 6:05 p.m. UTC | #7

> Hey Sagi,

Hi Steve,

> Is there no way to handle this in the core?  Maybe have the polling context
> transition to DIRECT when the queue becomes empty and before re-arming the
> CQ?

That is what I suggested, but that would mean that we we need to drain
the cq before making the switch, which means we need to allocate a
dedicated qp for that cq, and even that doesn't guarantee that the
ULP is not posting other wrs on its own qp(s)...

So making this safe for infight I/O would be a challenge... If we end
up agreeing that we are ok with this functionality, I'd much rather not
deal with it and simply document "use with care".

> So ib_change_cq_ctx() would be called to indicate the change should
> happen when it is safe to do so.

You lost me here... ib_change_cq_ctx would get called by who and when?

Steve Wise Dec. 12, 2018, 6:10 p.m. UTC | #8

> > Hey Sagi,
> 
> Hi Steve,
> 
> > Is there no way to handle this in the core?  Maybe have the polling context
> > transition to DIRECT when the queue becomes empty and before re-arming
> the
> > CQ?
> 
> That is what I suggested, but that would mean that we we need to drain
> the cq before making the switch, which means we need to allocate a
> dedicated qp for that cq, and even that doesn't guarantee that the
> ULP is not posting other wrs on its own qp(s)...
> 
> So making this safe for infight I/O would be a challenge... If we end
> up agreeing that we are ok with this functionality, I'd much rather not
> deal with it and simply document "use with care".
> 
> > So ib_change_cq_ctx() would be called to indicate the change should
> > happen when it is safe to do so.
> 
> You lost me here... ib_change_cq_ctx would get called by who and when?

I didn't look in detail at your changes, but ib_change_cq_ctx() is called by the application, right?  I was just asking what if the semantics of the call were "change the context when it is safe to do so" vs "do it immediately and hope there are no outstanding WRs".   But I don't think this semantic change simplifies the problem.

Sagi Grimberg Dec. 12, 2018, 6:23 p.m. UTC | #9

>>> Well, it should just be a little blk_poll loop, right?
>>
>> Not so much about the poll loop, but the fact that we will need
>> to check if we need to poll for this special case every time in
>> .queue_rq and its somewhat annoying...
> 
> I don't think we need to add any special check in queue_rq.  Any
> polled command should be marked as HIPRI from the submitting code,
> and that includes the queue creation.

Hmm, good idea, perhaps adding something like to following:
--
/**
  * blk_execute_rq_polled - execute a request and poll for its completion
  * @q:          queue to insert the request in
  * @bd_disk:    matching gendisk
  * @rq:         request to insert
  * @at_head:    insert request at head or tail of queue
  *
  * Description:
  *    Insert a fully prepared request at the back of the I/O scheduler
  *    queue for execution and poll for completion.
  */
void blk_execute_rq_polled(struct request_queue *q, struct gendisk *bd_disk,
                    struct request *rq, int at_head)
{
         DECLARE_COMPLETION_ONSTACK(wait);
         unsigned long hang_check;

	rq->cmd_flags |= REQ_HIPRI;
         rq->end_io_data = &wait;
         blk_execute_rq_nowait(q, bd_disk, rq, at_head, blk_end_sync_rq);

         while (!completion_done(&wait))
		blk_poll(q, 0, true);
}
EXPORT_SYMBOL(blk_execute_rq_polled);
--

and __nvme_submit_sync_cmd() can receive a poll argument as well to call
blk_execute_req_polled()?

[RFC,0/4] restore polling to nvme-rdma

Message

Comments