diff mbox

[3/9] IB: add a helper to safely drain a QP

Message ID 1447422410-20891-4-git-send-email-hch@lst.de (mailing list archive)
State Superseded
Headers show

Commit Message

Christoph Hellwig Nov. 13, 2015, 1:46 p.m. UTC
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/infiniband/core/cq.c | 46 ++++++++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h      |  2 ++
 2 files changed, 48 insertions(+)

Comments

Steve Wise Nov. 13, 2015, 4:16 p.m. UTC | #1
On 11/13/2015 7:46 AM, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/infiniband/core/cq.c | 46 ++++++++++++++++++++++++++++++++++++++++++++
>   include/rdma/ib_verbs.h      |  2 ++
>   2 files changed, 48 insertions(+)
>
> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
> index d9eb796..bf2a079 100644
> --- a/drivers/infiniband/core/cq.c
> +++ b/drivers/infiniband/core/cq.c
> @@ -206,3 +206,49 @@ void ib_free_cq(struct ib_cq *cq)
>   	WARN_ON_ONCE(ret);
>   }
>   EXPORT_SYMBOL(ib_free_cq);
> +
> +struct ib_stop_cqe {
> +	struct ib_cqe	cqe;
> +	struct completion done;
> +};
> +
> +static void ib_stop_done(struct ib_cq *cq, struct ib_wc *wc)
> +{
> +	struct ib_stop_cqe *stop =
> +		container_of(wc->wr_cqe, struct ib_stop_cqe, cqe);
> +
> +	complete(&stop->done);
> +}
> +
> +/*
> + * Change a queue pair into the error state and wait until all receive
> + * completions have been processed before destroying it. This avoids that
> + * the receive completion handler can access the queue pair while it is
> + * being destroyed.
> + */
> +void ib_drain_qp(struct ib_qp *qp)
> +{
> +	struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
> +	struct ib_stop_cqe stop = { };
> +	struct ib_recv_wr wr, *bad_wr;
> +	int ret;
> +
> +	wr.wr_cqe = &stop.cqe;
> +	stop.cqe.done = ib_stop_done;
> +	init_completion(&stop.done);
> +
> +	ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
> +	if (ret) {
> +		WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
> +		return;
> +	}
> +
> +	ret = ib_post_recv(qp, &wr, &bad_wr);
> +	if (ret) {
> +		WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
> +		return;
> +	}
> +
> +	wait_for_completion(&stop.done);
> +}
> +EXPORT_SYMBOL(ib_drain_qp);

This won't work with iwarp qps.  Once the QP is in ERROR state, 
post_send/post_recv can return a synchronous error vs async via the 
cq.   The IB spec explicitly states that posts while in ERROR will be 
completed with "flushed" via the CQ.

From http://tools.ietf.org/html/draft-hilland-rddp-verbs-00#section-6.2.4:

    *   At some point in the execution of the flushing operation, the RI
        MUST begin to return an Immediate Error for any attempt to post
        a WR to a Work Queue; prior to that point, any WQEs posted to a
        Work Queue MUST be enqueued and then flushed as described above
        (e.g. The PostSQ is done in Non-Privileged Mode and the Non-
        Privileged Mode portion of the RI has not yet been informed that
        the QP is in the Error state).


Also pending send work requests can be completed with status "flushed", 
and I would think we need to do something similar for send wrs.  We 
definitely can see this with cxgb4 in the presence of unsignaled wrs 
that aren't followed by a signaled wr at the time the QP is moved out of 
RTS.   The driver has no way to know if these pending unsignaled wrs 
completed or not.  So it completes them with "flushed" status.

So how can we do this for iwarp?  It seems like all that might be needed 
is to modify the QP state to idle, retrying until it succeeds:

    If the QP is transitioning to the Error state, or has not yet
    finished flushing the Work Queues, a Modify QP request to transition
    to the IDLE state MUST fail with an Immediate Error. If none of the
    prior conditions are true, a Modify QP to the Idle state MUST take
    the QP to the Idle state. No other state transitions out of Error
    are supported. Any attempt to transition the QP to a state other
    than Idle MUST result in an Immediate Error.


Steve.

> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index e11e038..f59a8d3 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -3075,4 +3075,6 @@ int ib_sg_to_pages(struct ib_mr *mr,
>   		   int sg_nents,
>   		   int (*set_page)(struct ib_mr *, u64));
>   
> +void ib_drain_qp(struct ib_qp *qp);
> +
>   #endif /* IB_VERBS_H */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 14, 2015, 7:05 a.m. UTC | #2
On Fri, Nov 13, 2015 at 10:16:04AM -0600, Steve Wise wrote:
> So how can we do this for iwarp?  It seems like all that might be needed is 
> to modify the QP state to idle, retrying until it succeeds:
>
>    If the QP is transitioning to the Error state, or has not yet
>    finished flushing the Work Queues, a Modify QP request to transition
>    to the IDLE state MUST fail with an Immediate Error. If none of the
>    prior conditions are true, a Modify QP to the Idle state MUST take
>    the QP to the Idle state. No other state transitions out of Error
>    are supported. Any attempt to transition the QP to a state other
>    than Idle MUST result in an Immediate Error.

Can you try to write up some code for this?  We could then wire it up
in the common helper.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 15, 2015, 9:34 a.m. UTC | #3
> +
> +struct ib_stop_cqe {
> +	struct ib_cqe	cqe;
> +	struct completion done;
> +};
> +
> +static void ib_stop_done(struct ib_cq *cq, struct ib_wc *wc)
> +{
> +	struct ib_stop_cqe *stop =
> +		container_of(wc->wr_cqe, struct ib_stop_cqe, cqe);
> +
> +	complete(&stop->done);
> +}
> +
> +/*
> + * Change a queue pair into the error state and wait until all receive
> + * completions have been processed before destroying it. This avoids that
> + * the receive completion handler can access the queue pair while it is
> + * being destroyed.
> + */
> +void ib_drain_qp(struct ib_qp *qp)
> +{
> +	struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
> +	struct ib_stop_cqe stop = { };
> +	struct ib_recv_wr wr, *bad_wr;
> +	int ret;
> +
> +	wr.wr_cqe = &stop.cqe;
> +	stop.cqe.done = ib_stop_done;
> +	init_completion(&stop.done);
> +
> +	ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
> +	if (ret) {
> +		WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
> +		return;
> +	}
> +
> +	ret = ib_post_recv(qp, &wr, &bad_wr);
> +	if (ret) {
> +		WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
> +		return;
> +	}
> +
> +	wait_for_completion(&stop.done);
> +}

This is taken from srp, and srp drains using a recv wr due to a race
causing a use-after-free condition in srp which re-posts a recv buffer
in the recv completion handler. srp does not really care if there are
pending send flushes.

I'm not sure if there are ordering rules for send/recv queues in
terms of flush completions, meaning that even if all recv flushes
were consumed maybe there are send flushes still pending.

I think that for a general drain helper it would be useful to
make sure that both the recv _and_ send flushes were drained.

So, something like:

void ib_drain_qp(struct ib_qp *qp)
{
	struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
	struct ib_stop_cqe rstop, sstop;
	struct ib_recv_wr rwr = {}, *bad_rwr;
	struct ib_send_wr swr = {}, *bad_swr;
	int ret;

	rwr.wr_cqe = &rstop.cqe;
	rstop.cqe.done = ib_stop_done;
	init_completion(&rstop.done);

	swr.wr_cqe = &sstop.cqe;
	sstop.cqe.done = ib_stop_done;
	init_completion(&sstop.done);

	ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
	if (ret) {
		WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
		return;
	}

	ret = ib_post_recv(qp, &rwr, &bad_rwr);
	if (ret) {
		WARN_ONCE(ret, "failed to drain recv queue: %d\n", ret);
		return;
	}

	ret = ib_post_send(qp, &swr, &bad_swr);
	if (ret) {
		WARN_ONCE(ret, "failed to drain send queue: %d\n", ret);
		return;
	}

	wait_for_completion(&rstop.done);
	wait_for_completion(&sstop.done);
}

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise Nov. 16, 2015, 4:38 p.m. UTC | #4
On 11/15/2015 3:34 AM, Sagi Grimberg wrote:
>
>> +
>> +struct ib_stop_cqe {
>> +    struct ib_cqe    cqe;
>> +    struct completion done;
>> +};
>> +
>> +static void ib_stop_done(struct ib_cq *cq, struct ib_wc *wc)
>> +{
>> +    struct ib_stop_cqe *stop =
>> +        container_of(wc->wr_cqe, struct ib_stop_cqe, cqe);
>> +
>> +    complete(&stop->done);
>> +}
>> +
>> +/*
>> + * Change a queue pair into the error state and wait until all receive
>> + * completions have been processed before destroying it. This avoids 
>> that
>> + * the receive completion handler can access the queue pair while it is
>> + * being destroyed.
>> + */
>> +void ib_drain_qp(struct ib_qp *qp)
>> +{
>> +    struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
>> +    struct ib_stop_cqe stop = { };
>> +    struct ib_recv_wr wr, *bad_wr;
>> +    int ret;
>> +
>> +    wr.wr_cqe = &stop.cqe;
>> +    stop.cqe.done = ib_stop_done;
>> +    init_completion(&stop.done);
>> +
>> +    ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
>> +    if (ret) {
>> +        WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
>> +        return;
>> +    }
>> +
>> +    ret = ib_post_recv(qp, &wr, &bad_wr);
>> +    if (ret) {
>> +        WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
>> +        return;
>> +    }
>> +
>> +    wait_for_completion(&stop.done);
>> +}
>
> This is taken from srp, and srp drains using a recv wr due to a race
> causing a use-after-free condition in srp which re-posts a recv buffer
> in the recv completion handler. srp does not really care if there are
> pending send flushes.
>
> I'm not sure if there are ordering rules for send/recv queues in
> terms of flush completions, meaning that even if all recv flushes
> were consumed maybe there are send flushes still pending.
>
> I think that for a general drain helper it would be useful to
> make sure that both the recv _and_ send flushes were drained.
>
> So, something like:
>
> void ib_drain_qp(struct ib_qp *qp)
> {
>     struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
>     struct ib_stop_cqe rstop, sstop;
>     struct ib_recv_wr rwr = {}, *bad_rwr;
>     struct ib_send_wr swr = {}, *bad_swr;
>     int ret;
>
>     rwr.wr_cqe = &rstop.cqe;
>     rstop.cqe.done = ib_stop_done;
>     init_completion(&rstop.done);
>
>     swr.wr_cqe = &sstop.cqe;
>     sstop.cqe.done = ib_stop_done;
>     init_completion(&sstop.done);
>
>     ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
>     if (ret) {
>         WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
>         return;
>     }
>
>     ret = ib_post_recv(qp, &rwr, &bad_rwr);
>     if (ret) {
>         WARN_ONCE(ret, "failed to drain recv queue: %d\n", ret);
>         return;
>     }
>
>     ret = ib_post_send(qp, &swr, &bad_swr);
>     if (ret) {
>         WARN_ONCE(ret, "failed to drain send queue: %d\n", ret);
>         return;
>     }
>
>     wait_for_completion(&rstop.done);
>     wait_for_completion(&sstop.done);
> }
>
> Thoughts?

This won't work for iWARP as per my previous email.  But I will code 
something up that will.

Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise Nov. 16, 2015, 6:30 p.m. UTC | #5
> -----Original Message-----
> From: Steve Wise [mailto:swise@opengridcomputing.com]
> Sent: Monday, November 16, 2015 10:38 AM
> To: Sagi Grimberg; Christoph Hellwig; linux-rdma@vger.kernel.org
> Cc: bart.vanassche@sandisk.com; axboe@fb.com; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 3/9] IB: add a helper to safely drain a QP
> 
> On 11/15/2015 3:34 AM, Sagi Grimberg wrote:
> >
> >> +
> >> +struct ib_stop_cqe {
> >> +    struct ib_cqe    cqe;
> >> +    struct completion done;
> >> +};
> >> +
> >> +static void ib_stop_done(struct ib_cq *cq, struct ib_wc *wc)
> >> +{
> >> +    struct ib_stop_cqe *stop =
> >> +        container_of(wc->wr_cqe, struct ib_stop_cqe, cqe);
> >> +
> >> +    complete(&stop->done);
> >> +}
> >> +
> >> +/*
> >> + * Change a queue pair into the error state and wait until all receive
> >> + * completions have been processed before destroying it. This avoids
> >> that
> >> + * the receive completion handler can access the queue pair while it is
> >> + * being destroyed.
> >> + */
> >> +void ib_drain_qp(struct ib_qp *qp)
> >> +{
> >> +    struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
> >> +    struct ib_stop_cqe stop = { };
> >> +    struct ib_recv_wr wr, *bad_wr;
> >> +    int ret;
> >> +
> >> +    wr.wr_cqe = &stop.cqe;
> >> +    stop.cqe.done = ib_stop_done;
> >> +    init_completion(&stop.done);
> >> +
> >> +    ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
> >> +    if (ret) {
> >> +        WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
> >> +        return;
> >> +    }
> >> +
> >> +    ret = ib_post_recv(qp, &wr, &bad_wr);
> >> +    if (ret) {
> >> +        WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
> >> +        return;
> >> +    }
> >> +
> >> +    wait_for_completion(&stop.done);
> >> +}
> >
> > This is taken from srp, and srp drains using a recv wr due to a race
> > causing a use-after-free condition in srp which re-posts a recv buffer
> > in the recv completion handler. srp does not really care if there are
> > pending send flushes.
> >
> > I'm not sure if there are ordering rules for send/recv queues in
> > terms of flush completions, meaning that even if all recv flushes
> > were consumed maybe there are send flushes still pending.
> >
> > I think that for a general drain helper it would be useful to
> > make sure that both the recv _and_ send flushes were drained.
> >
> > So, something like:
> >
> > void ib_drain_qp(struct ib_qp *qp)
> > {
> >     struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
> >     struct ib_stop_cqe rstop, sstop;
> >     struct ib_recv_wr rwr = {}, *bad_rwr;
> >     struct ib_send_wr swr = {}, *bad_swr;
> >     int ret;
> >
> >     rwr.wr_cqe = &rstop.cqe;
> >     rstop.cqe.done = ib_stop_done;
> >     init_completion(&rstop.done);
> >
> >     swr.wr_cqe = &sstop.cqe;
> >     sstop.cqe.done = ib_stop_done;
> >     init_completion(&sstop.done);
> >
> >     ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
> >     if (ret) {
> >         WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
> >         return;
> >     }
> >
> >     ret = ib_post_recv(qp, &rwr, &bad_rwr);
> >     if (ret) {
> >         WARN_ONCE(ret, "failed to drain recv queue: %d\n", ret);
> >         return;
> >     }
> >
> >     ret = ib_post_send(qp, &swr, &bad_swr);
> >     if (ret) {
> >         WARN_ONCE(ret, "failed to drain send queue: %d\n", ret);
> >         return;
> >     }
> >
> >     wait_for_completion(&rstop.done);
> >     wait_for_completion(&sstop.done);
> > }
> >
> > Thoughts?
> 
> This won't work for iWARP as per my previous email.  But I will code
> something up that will.
> 
> Steve

After looking at the nes driver, I don't see any common way to support drain w/o some serious driver mods.  Since SRP is the only
user, perhaps we can ignore iWARP for this function...

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 16, 2015, 6:37 p.m. UTC | #6
> After looking at the nes driver, I don't see any common way to support drain w/o some serious driver mods.  Since SRP is the only
> user, perhaps we can ignore iWARP for this function...

But iser/isert essentially does it too (and I think xprtrdma will have
it soon)...

the modify_qp is invoked from rdma_disconnect() and we do post
an 'empty' wr to wait for all the flushes to drain (see
iser_conn_terminate).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise Nov. 16, 2015, 7:03 p.m. UTC | #7
> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Sagi Grimberg
> Sent: Monday, November 16, 2015 12:38 PM
> To: Steve Wise; 'Christoph Hellwig'; linux-rdma@vger.kernel.org
> Cc: bart.vanassche@sandisk.com; axboe@fb.com; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 3/9] IB: add a helper to safely drain a QP
> 
> 
> > After looking at the nes driver, I don't see any common way to support drain w/o some serious driver mods.  Since SRP is the
only
> > user, perhaps we can ignore iWARP for this function...
> 
> But iser/isert essentially does it too (and I think xprtrdma will have
> it soon)...
> 
> the modify_qp is invoked from rdma_disconnect() and we do post
> an 'empty' wr to wait for all the flushes to drain (see
> iser_conn_terminate).

That won't work for iWARP.  Is this code new?  I didn't see any errors that would result from this code when I tested iSER over
cxgb4 with the old iwarp support patches.   

Perhaps we need another way to do this?  Like a completion object in the QP that gets triggered when the SQ and RQ become empty
after a transition to ERROR (and CLOSING for iwarp).  Then a core service that just waits until the QP is empty.  Implementation of
this design would hit the providers though since only they know when the flush is completed.

Alternatively, I could enable post-while-in-error support in cxgb4 and ignore the spec in this regard.  But I'd rather not do that.
:)

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 17, 2015, 8:54 a.m. UTC | #8
> That won't work for iWARP.  Is this code new?  I didn't see any errors that would result from this code when I tested iSER over
> cxgb4 with the old iwarp support patches.

It's there since ~3.17 I think...

>
> Perhaps we need another way to do this?  Like a completion object in the QP that gets triggered when the SQ and RQ become empty
> after a transition to ERROR (and CLOSING for iwarp).  Then a core service that just waits until the QP is empty.  Implementation of
> this design would hit the providers though since only they know when the flush is completed.

ULPs need a drain functionality, so ib_drain_qp() is the way to go...

How about we add a drain_qp() callout and have:

	if (qp->device->drain_qp) {
		qp->device->drain_qp();
		return;
	}

	IB drain qp logic...

This way iWARP devices can have their own magic on how to implement this
functionality.

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Nov. 17, 2015, 5:06 p.m. UTC | #9
On 11/15/2015 01:34 AM, Sagi Grimberg wrote:
> This is taken from srp, and srp drains using a recv wr due to a race
> causing a use-after-free condition in srp which re-posts a recv buffer
> in the recv completion handler.

Hello Sagi,

Would it be possible to clarify this ? Does this refer to an existing 
race or a race that would only occur if the code would be modified ?

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 18, 2015, 7:59 a.m. UTC | #10
On 17/11/2015 19:06, Bart Van Assche wrote:
> On 11/15/2015 01:34 AM, Sagi Grimberg wrote:
>> This is taken from srp, and srp drains using a recv wr due to a race
>> causing a use-after-free condition in srp which re-posts a recv buffer
>> in the recv completion handler.
>
> Hello Sagi,
>
> Would it be possible to clarify this ? Does this refer to an existing
> race or a race that would only occur if the code would be modified ?

I was referring to a bug that srp_destroy_qp() was design to
address:

commit 7dad6b2e440d810273946b0e7092a8fe043c3b8a
Author: Bart Van Assche <bvanassche@acm.org>
Date:   Tue Oct 21 18:00:35 2014 +0200

     IB/srp: Fix a race condition triggered by destroying a queue pair

     At least LID reassignment can trigger a race condition in the SRP
     initiator driver, namely the receive completion handler trying to
     post a request on a QP during or after QP destruction and before
     the CQ's have been destroyed. Avoid this race by modifying a QP
     into the error state and by waiting until all receive completions
     have been processed before destroying a QP.

     Reported-by: Max Gurtuvoy <maxg@mellanox.com>
     Signed-off-by: Bart Van Assche <bvanassche@acm.org>
     Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
     Signed-off-by: Christoph Hellwig <hch@lst.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 18, 2015, 11:32 a.m. UTC | #11
Christoph,

Given the discussion around this patch I think it would
be a good idea remove it from the patchset since it's not
mandatory for the CQ abstraction. I think that we should
take it with Steve to come up with a complete solution for
this bit.

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 18, 2015, 2:06 p.m. UTC | #12
On Wed, Nov 18, 2015 at 01:32:19PM +0200, Sagi Grimberg wrote:
> Christoph,
>
> Given the discussion around this patch I think it would
> be a good idea remove it from the patchset since it's not
> mandatory for the CQ abstraction. I think that we should
> take it with Steve to come up with a complete solution for
> this bit.
>
> Thoughts?

Yes, let's drop it for now.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise Nov. 18, 2015, 3:21 p.m. UTC | #13
On 11/18/2015 8:06 AM, Christoph Hellwig wrote:
> On Wed, Nov 18, 2015 at 01:32:19PM +0200, Sagi Grimberg wrote:
>> Christoph,
>>
>> Given the discussion around this patch I think it would
>> be a good idea remove it from the patchset since it's not
>> mandatory for the CQ abstraction. I think that we should
>> take it with Steve to come up with a complete solution for
>> this bit.
>>
>> Thoughts?
> Yes, let's drop it for now.
>

Fine with me.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 23, 2015, 10:28 a.m. UTC | #14
> That won't work for iWARP.  Is this code new?  I didn't see any errors that would result from this code when I tested iSER over
> cxgb4 with the old iwarp support patches.

Steve,

I think I figured out why this works with iWARP.

For iWARP, rdma_disconnect() calls iw_cm_disconnect() with abrupt=0
which would make iw_cm_disconnect() move the QP into SQ_DRAIN state"

int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt)
{
	...

         if (qp) {
                 if (abrupt)
                         ret = iwcm_modify_qp_err(qp);
                 else
                         ret = iwcm_modify_qp_sqd(qp);

                 /*
                  * If both sides are disconnecting the QP could
                  * already be in ERR or SQD states
                  */
                 ret = 0;
	}
}

IFAIK, SQD state allows the ULP to post work requests on the send
queue and expect these work requests to FLUSH.

So Maybe we should have:
void ib_drain_qp(struct ib_qp *qp)
{
     struct ib_qp_attr attr = { };
     struct ib_stop_cqe rstop, sstop;
     struct ib_recv_wr rwr = {}, *bad_rwr;
     struct ib_send_wr swr = {}, *bad_swr;
     enum ib_qp_state state;
     int ret;

     if rdma_cap_ib_cm(id->device, id->port_num) {
	state = IB_QPS_ERR;
     else if rdma_cap_iw_cm(id->device, id->port_num)
         state = IB_QPS_SQD;
     else
        return;

     rwr.wr_cqe = &rstop.cqe;
     rstop.cqe.done = ib_stop_done;
     init_completion(&rstop.done);

     swr.wr_cqe = &sstop.cqe;
     sstop.cqe.done = ib_stop_done;
     swr.send_flags = IB_SEND_SIGNALED;
     init_completion(&sstop.done);

     attr.qp_state = state;
     ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
     if (ret) {
         WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
         return;
     }

     ret = ib_post_recv(qp, &rwr, &bad_rwr);
     if (ret) {
         WARN_ONCE(ret, "failed to drain recv queue: %d\n", ret);
         return;
     }

     ret = ib_post_send(qp, &swr, &bad_swr);
     if (ret) {
         WARN_ONCE(ret, "failed to drain send queue: %d\n", ret);
         return;
     }

     wait_for_completion(&rstop.done);
     wait_for_completion(&sstop.done);
}

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 23, 2015, 10:35 a.m. UTC | #15
> So Maybe we should have:
> void ib_drain_qp(struct ib_qp *qp)

Christoph suggested that this flushing would be taken care
of by rdma_disconnect which sounds even better I think..
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 23, 2015, 2:33 p.m. UTC | #16
On Mon, Nov 23, 2015 at 12:35:44PM +0200, Sagi Grimberg wrote:
>
>> So Maybe we should have:
>> void ib_drain_qp(struct ib_qp *qp)
>
> Christoph suggested that this flushing would be taken care
> of by rdma_disconnect which sounds even better I think..

Note that will only work once we've converted all drivers to
the new CQ API under development.  Without that every completion
handlers needs a special case for the drain WC.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise Nov. 23, 2015, 2:44 p.m. UTC | #17
> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
> Sent: Monday, November 23, 2015 4:29 AM
> To: Steve Wise; 'Christoph Hellwig'; linux-rdma@vger.kernel.org
> Cc: bart.vanassche@sandisk.com; axboe@fb.com; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 3/9] IB: add a helper to safely drain a QP
> 
> 
> > That won't work for iWARP.  Is this code new?  I didn't see any errors that would result from this code when I tested iSER over
> > cxgb4 with the old iwarp support patches.
> 
> Steve,
> 
> I think I figured out why this works with iWARP.
> 
> For iWARP, rdma_disconnect() calls iw_cm_disconnect() with abrupt=0
> which would make iw_cm_disconnect() move the QP into SQ_DRAIN state"
>

Yes.  Note:  SQ_DRAIN == CLOSING state for iWARP QPs.   CLOSING state means the transport will try and do an orderly shutdown.
More on this below.
 
> int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt)
> {
> 	...
> 
>          if (qp) {
>                  if (abrupt)
>                          ret = iwcm_modify_qp_err(qp);
>                  else
>                          ret = iwcm_modify_qp_sqd(qp);
> 
>                  /*
>                   * If both sides are disconnecting the QP could
>                   * already be in ERR or SQD states
>                   */
>                  ret = 0;
> 	}
> }
> 
> IFAIK, SQD state allows the ULP to post work requests on the send
> queue and expect these work requests to FLUSH.
> 

The iWARP QP states are different from IB unfortunately.  And the way iWARP was plugged into the original IB-centric RDMA subsystem,
this difference is not very visible.  Moving an iWARP to CLOSING/SQD begins an "orderly close" of the TCP connection.  IE TCP FIN,
FIN/ACK, ACK.   

> So Maybe we should have:
> void ib_drain_qp(struct ib_qp *qp)
> {
>      struct ib_qp_attr attr = { };
>      struct ib_stop_cqe rstop, sstop;
>      struct ib_recv_wr rwr = {}, *bad_rwr;
>      struct ib_send_wr swr = {}, *bad_swr;
>      enum ib_qp_state state;
>      int ret;
> 
>      if rdma_cap_ib_cm(id->device, id->port_num) {
> 	state = IB_QPS_ERR;
>      else if rdma_cap_iw_cm(id->device, id->port_num)
>          state = IB_QPS_SQD;
>      else
>         return;
> 
>      rwr.wr_cqe = &rstop.cqe;
>      rstop.cqe.done = ib_stop_done;
>      init_completion(&rstop.done);
> 
>      swr.wr_cqe = &sstop.cqe;
>      sstop.cqe.done = ib_stop_done;
>      swr.send_flags = IB_SEND_SIGNALED;
>      init_completion(&sstop.done);
> 
>      attr.qp_state = state;
>      ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
>      if (ret) {
>          WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
>          return;
>      }
> 
>      ret = ib_post_recv(qp, &rwr, &bad_rwr);
>      if (ret) {
>          WARN_ONCE(ret, "failed to drain recv queue: %d\n", ret);
>          return;
>      }
> 
>      ret = ib_post_send(qp, &swr, &bad_swr);
>      if (ret) {
>          WARN_ONCE(ret, "failed to drain send queue: %d\n", ret);
>          return;
>      }
> 
>      wait_for_completion(&rstop.done);
>      wait_for_completion(&sstop.done);
> }
> 
> Thoughts?

The problem with moving the QP -> CLOSING (aka SQD) is this:  as per the iWARP Verbs spec, ULPS _must_ quiesce the SQ before moving
it to CLOSING.  IE make sure there are no outstanding SQ WRs.  So the drain operation really has to be done _before_ the move to
CLOSING/SQD. :(  If there _are_ outstanding SQ WRs when an attempt to move the QP to CLOSING, or an ingress RDMA operation arrives
while the QP is in CLOSING (and doing a TCP fin/fin-ack exchange), the QP is immediately moved to ERROR.   Also, no WR posts are
allowed while the QP is in CLOSING, unlike the IB SQD state.

The valid drain logic that I think needs to be implemented to support iWARP is one of two methods:

1) as I said before, enhance the ib_qp struct to have a "flush complete" completion object, changes the providers to all complete
that object when a) they are in ERROR and b) the SQ and RQ become empty (or is already empty).  Then ib_drain_qp() just waits for
this completion.

2) change the iwarp providers to allow posting WRs while in ERROR.  One way is do this and still support the requirement that "at
some point while in error, the provider must synchronously fail posts", is to allow the posts if the SQ or RQ still has pending WRs,
but fail immediately if the SQ or RQ is already empty.  Thus the "drain" WRs issued by iw_drain_qp() would work if they were needed,
and fail immediately if they are not needed.  In either case, the flush operation is complete.

I really wish the iWARP spec architects had avoided these sorts of diversions from the IB spec....

Steve.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise Nov. 23, 2015, 2:48 p.m. UTC | #18
> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Sagi Grimberg
> Sent: Monday, November 23, 2015 4:36 AM
> To: Steve Wise; 'Christoph Hellwig'; linux-rdma@vger.kernel.org
> Cc: bart.vanassche@sandisk.com; axboe@fb.com; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 3/9] IB: add a helper to safely drain a QP
> 
> 
> > So Maybe we should have:
> > void ib_drain_qp(struct ib_qp *qp)
> 
> Christoph suggested that this flushing would be taken care
> of by rdma_disconnect which sounds even better I think..
> --

Agreed. 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index d9eb796..bf2a079 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -206,3 +206,49 @@  void ib_free_cq(struct ib_cq *cq)
 	WARN_ON_ONCE(ret);
 }
 EXPORT_SYMBOL(ib_free_cq);
+
+struct ib_stop_cqe {
+	struct ib_cqe	cqe;
+	struct completion done;
+};
+
+static void ib_stop_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ib_stop_cqe *stop =
+		container_of(wc->wr_cqe, struct ib_stop_cqe, cqe);
+
+	complete(&stop->done);
+}
+
+/*
+ * Change a queue pair into the error state and wait until all receive
+ * completions have been processed before destroying it. This avoids that
+ * the receive completion handler can access the queue pair while it is
+ * being destroyed.
+ */
+void ib_drain_qp(struct ib_qp *qp)
+{
+	struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
+	struct ib_stop_cqe stop = { };
+	struct ib_recv_wr wr, *bad_wr;
+	int ret;
+
+	wr.wr_cqe = &stop.cqe;
+	stop.cqe.done = ib_stop_done;
+	init_completion(&stop.done);
+
+	ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
+	if (ret) {
+		WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
+		return;
+	}
+
+	ret = ib_post_recv(qp, &wr, &bad_wr);
+	if (ret) {
+		WARN_ONCE(ret, "failed to drain QP: %d\n", ret);
+		return;
+	}
+
+	wait_for_completion(&stop.done);
+}
+EXPORT_SYMBOL(ib_drain_qp);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index e11e038..f59a8d3 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -3075,4 +3075,6 @@  int ib_sg_to_pages(struct ib_mr *mr,
 		   int sg_nents,
 		   int (*set_page)(struct ib_mr *, u64));
 
+void ib_drain_qp(struct ib_qp *qp);
+
 #endif /* IB_VERBS_H */