[v1,03/18] xprtrdma: Remove completion polling budgets

Message ID	20150917204435.19671.56195.stgit@manet.1015granger.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> Subject: [PATCH v1 03/18] xprtrdma: Remove completion polling budgets From: Chuck Lever <chuck.lever@oracle.com> To: linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org Date: Thu, 17 Sep 2015 16:44:35 -0400 Message-ID: <20150917204435.19671.56195.stgit@manet.1015granger.net> In-Reply-To: <20150917202829.19671.90044.stgit@manet.1015granger.net> References: <20150917202829.19671.90044.stgit@manet.1015granger.net> User-Agent: StGit/0.17.1-3-g7d0f MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

Chuck Lever Sept. 17, 2015, 8:44 p.m. UTC

Commit 8301a2c047cc ("xprtrdma: Limit work done by completion
handler") was supposed to prevent xprtrdma's upcall handlers from
starving other softIRQ work by letting them return to the provider
before all CQEs have been polled.

The logic assumes the provider will call the upcall handler again
immediately if the CQ is re-armed while there are still queued CQEs.

This assumption is invalid. The IBTA spec says that after a CQ is
armed, the hardware must interrupt only when a new CQE is inserted.
xprtrdma can't rely on the provider calling again, even though some
providers do.

Therefore, leaving CQEs on queue makes sense only when there is
another mechanism that ensures all remaining CQEs are consumed in a
timely fashion. xprtrdma does not have such a mechanism. If a CQE
remains queued, the transport can wait forever to send the next RPC.

Finally, move the wcs array back onto the stack to ensure that the
poll array is always local to the CPU where the completion upcall is
running.

Fixes: 8301a2c047cc ("xprtrdma: Limit work done by completion ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c     |  100 ++++++++++++++++++---------------------
 net/sunrpc/xprtrdma/xprt_rdma.h |    5 --
 2 files changed, 45 insertions(+), 60 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Devesh Sharma Sept. 18, 2015, 6:52 a.m. UTC | #1

On Fri, Sep 18, 2015 at 2:14 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> Commit 8301a2c047cc ("xprtrdma: Limit work done by completion
> handler") was supposed to prevent xprtrdma's upcall handlers from
> starving other softIRQ work by letting them return to the provider
> before all CQEs have been polled.
>
> The logic assumes the provider will call the upcall handler again
> immediately if the CQ is re-armed while there are still queued CQEs.
>
> This assumption is invalid. The IBTA spec says that after a CQ is
> armed, the hardware must interrupt only when a new CQE is inserted.
> xprtrdma can't rely on the provider calling again, even though some
> providers do.
>
> Therefore, leaving CQEs on queue makes sense only when there is
> another mechanism that ensures all remaining CQEs are consumed in a
> timely fashion. xprtrdma does not have such a mechanism. If a CQE
> remains queued, the transport can wait forever to send the next RPC.
>
> Finally, move the wcs array back onto the stack to ensure that the
> poll array is always local to the CPU where the completion upcall is
> running.
>
> Fixes: 8301a2c047cc ("xprtrdma: Limit work done by completion ...")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xprtrdma/verbs.c     |  100 ++++++++++++++++++---------------------
>  net/sunrpc/xprtrdma/xprt_rdma.h |    5 --
>  2 files changed, 45 insertions(+), 60 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 8a477e2..f2e3863 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -158,34 +158,37 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
>         }
>  }
>
> -static int
> +/* The wc array is on stack: automatic memory is always CPU-local.
> + *
> + * The common case is a single completion is ready. By asking
> + * for two entries, a return code of 1 means there is exactly
> + * one completion and no more. We don't have to poll again to
> + * know that the CQ is now empty.
> + */
> +static void
>  rpcrdma_sendcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>  {
> -       struct ib_wc *wcs;
> -       int budget, count, rc;
> +       struct ib_wc *pos, wcs[2];
> +       int count, rc;
>
> -       budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
>         do {
> -               wcs = ep->rep_send_wcs;
> +               pos = wcs;
>
> -               rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
> -               if (rc <= 0)
> -                       return rc;
> +               rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
> +               if (rc < 0)
> +                       goto out_warn;
>
>                 count = rc;
>                 while (count-- > 0)
> -                       rpcrdma_sendcq_process_wc(wcs++);
> -       } while (rc == RPCRDMA_POLLSIZE && --budget);
> -       return 0;
> +                       rpcrdma_sendcq_process_wc(pos++);
> +       } while (rc == ARRAY_SIZE(wcs));

I think I have missed something and not able to understand the reason
for polling 2 CQEs in one poll? It is possible that in a given poll_cq
call you end up getting on 1 completion, the other completion is
delayed due to some reason. Would it be better to poll for 1 in every
poll call Or
otherwise have this
while ( rc <= ARRAY_SIZE(wcs) && rc);

> +       return;
> +
> +out_warn:
> +       pr_warn("RPC:       %s: ib_poll_cq() failed %i\n", __func__, rc);
>  }
>
> -/*
> - * Handle send, fast_reg_mr, and local_inv completions.
> - *
> - * Send events are typically suppressed and thus do not result
> - * in an upcall. Occasionally one is signaled, however. This
> - * prevents the provider's completion queue from wrapping and
> - * losing a completion.
> +/* Handle provider send completion upcalls.
>   */
>  static void
>  rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
> @@ -193,12 +196,7 @@ rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>         struct rpcrdma_ep *ep = (struct rpcrdma_ep *)cq_context;
>         int rc;
>
> -       rc = rpcrdma_sendcq_poll(cq, ep);
> -       if (rc) {
> -               dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
> -                       __func__, rc);
> -               return;
> -       }
> +       rpcrdma_sendcq_poll(cq, ep);
>
>         rc = ib_req_notify_cq(cq,
>                         IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
> @@ -247,44 +245,41 @@ out_fail:
>         goto out_schedule;
>  }
>
> -static int
> +/* The wc array is on stack: automatic memory is always CPU-local.
> + *
> + * struct ib_wc is 64 bytes, making the poll array potentially
> + * large. But this is at the bottom of the call chain. Further
> + * substantial work is done in another thread.
> + */
> +static void
>  rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>  {
> -       struct list_head sched_list;
> -       struct ib_wc *wcs;
> -       int budget, count, rc;
> +       struct ib_wc *pos, wcs[4];
> +       LIST_HEAD(sched_list);
> +       int count, rc;
>
> -       INIT_LIST_HEAD(&sched_list);
> -       budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
>         do {
> -               wcs = ep->rep_recv_wcs;
> +               pos = wcs;
>
> -               rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
> -               if (rc <= 0)
> -                       goto out_schedule;
> +               rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
> +               if (rc < 0)
> +                       goto out_warn;
>
>                 count = rc;
>                 while (count-- > 0)
> -                       rpcrdma_recvcq_process_wc(wcs++, &sched_list);
> -       } while (rc == RPCRDMA_POLLSIZE && --budget);
> -       rc = 0;
> +                       rpcrdma_recvcq_process_wc(pos++, &sched_list);
> +       } while (rc == ARRAY_SIZE(wcs));
>
>  out_schedule:
>         rpcrdma_schedule_tasklet(&sched_list);
> -       return rc;
> +       return;
> +
> +out_warn:
> +       pr_warn("RPC:       %s: ib_poll_cq() failed %i\n", __func__, rc);
> +       goto out_schedule;
>  }
>
> -/*
> - * Handle receive completions.
> - *
> - * It is reentrant but processes single events in order to maintain
> - * ordering of receives to keep server credits.
> - *
> - * It is the responsibility of the scheduled tasklet to return
> - * recv buffers to the pool. NOTE: this affects synchronization of
> - * connection shutdown. That is, the structures required for
> - * the completion of the reply handler must remain intact until
> - * all memory has been reclaimed.
> +/* Handle provider receive completion upcalls.
>   */
>  static void
>  rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
> @@ -292,12 +287,7 @@ rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>         struct rpcrdma_ep *ep = (struct rpcrdma_ep *)cq_context;
>         int rc;
>
> -       rc = rpcrdma_recvcq_poll(cq, ep);
> -       if (rc) {
> -               dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
> -                       __func__, rc);
> -               return;
> -       }
> +       rpcrdma_recvcq_poll(cq, ep);
>
>         rc = ib_req_notify_cq(cq,
>                         IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index c09414e..42c8d44 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -77,9 +77,6 @@ struct rpcrdma_ia {
>   * RDMA Endpoint -- one per transport instance
>   */
>
> -#define RPCRDMA_WC_BUDGET      (128)
> -#define RPCRDMA_POLLSIZE       (16)
> -
>  struct rpcrdma_ep {
>         atomic_t                rep_cqcount;
>         int                     rep_cqinit;
> @@ -89,8 +86,6 @@ struct rpcrdma_ep {
>         struct rdma_conn_param  rep_remote_cma;
>         struct sockaddr_storage rep_remote_addr;
>         struct delayed_work     rep_connect_worker;
> -       struct ib_wc            rep_send_wcs[RPCRDMA_POLLSIZE];
> -       struct ib_wc            rep_recv_wcs[RPCRDMA_POLLSIZE];
>  };
>
>  /*
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chuck Lever Sept. 18, 2015, 2:19 p.m. UTC | #2

Hi Devesh-


On Sep 18, 2015, at 2:52 AM, Devesh Sharma <devesh.sharma@avagotech.com> wrote:

> On Fri, Sep 18, 2015 at 2:14 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> Commit 8301a2c047cc ("xprtrdma: Limit work done by completion
>> handler") was supposed to prevent xprtrdma's upcall handlers from
>> starving other softIRQ work by letting them return to the provider
>> before all CQEs have been polled.
>> 
>> The logic assumes the provider will call the upcall handler again
>> immediately if the CQ is re-armed while there are still queued CQEs.
>> 
>> This assumption is invalid. The IBTA spec says that after a CQ is
>> armed, the hardware must interrupt only when a new CQE is inserted.
>> xprtrdma can't rely on the provider calling again, even though some
>> providers do.
>> 
>> Therefore, leaving CQEs on queue makes sense only when there is
>> another mechanism that ensures all remaining CQEs are consumed in a
>> timely fashion. xprtrdma does not have such a mechanism. If a CQE
>> remains queued, the transport can wait forever to send the next RPC.
>> 
>> Finally, move the wcs array back onto the stack to ensure that the
>> poll array is always local to the CPU where the completion upcall is
>> running.
>> 
>> Fixes: 8301a2c047cc ("xprtrdma: Limit work done by completion ...")
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> net/sunrpc/xprtrdma/verbs.c     |  100 ++++++++++++++++++---------------------
>> net/sunrpc/xprtrdma/xprt_rdma.h |    5 --
>> 2 files changed, 45 insertions(+), 60 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 8a477e2..f2e3863 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -158,34 +158,37 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
>>        }
>> }
>> 
>> -static int
>> +/* The wc array is on stack: automatic memory is always CPU-local.
>> + *
>> + * The common case is a single completion is ready. By asking
>> + * for two entries, a return code of 1 means there is exactly
>> + * one completion and no more. We don't have to poll again to
>> + * know that the CQ is now empty.
>> + */
>> +static void
>> rpcrdma_sendcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>> {
>> -       struct ib_wc *wcs;
>> -       int budget, count, rc;
>> +       struct ib_wc *pos, wcs[2];
>> +       int count, rc;
>> 
>> -       budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
>>        do {
>> -               wcs = ep->rep_send_wcs;
>> +               pos = wcs;
>> 
>> -               rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
>> -               if (rc <= 0)
>> -                       return rc;
>> +               rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
>> +               if (rc < 0)
>> +                       goto out_warn;
>> 
>>                count = rc;
>>                while (count-- > 0)
>> -                       rpcrdma_sendcq_process_wc(wcs++);
>> -       } while (rc == RPCRDMA_POLLSIZE && --budget);
>> -       return 0;
>> +                       rpcrdma_sendcq_process_wc(pos++);
>> +       } while (rc == ARRAY_SIZE(wcs));
> 
> I think I have missed something and not able to understand the reason
> for polling 2 CQEs in one poll?

See the block comment above.

When ib_poll_cq() returns the same number of WCs as the
consumer requested, there may still be CQEs waiting to
be polled. Another call to ib_poll_cq() is needed to find
out if that's the case.

When ib_poll_cq() returns fewer WCs than the consumer
requested, the consumer doesn't have to call again to
know that the CQ is empty.

The common case, by far, is that one CQE is ready. By
requesting two, the number returned is less than the
number requested, and the consumer can tell immediately
that the CQE is drained. The extra ib_poll_cq call is
avoided.

Note that the existing logic also relies on polling
multiple WCs at once, and stopping the loop when the
number of returned WCs is less than the size of the
array.


> It is possible that in a given poll_cq
> call you end up getting on 1 completion, the other completion is
> delayed due to some reason.

If a CQE is allowed to be delayed, how does polling
again guarantee that the consumer can retrieve it?

What happens if a signal occurs, there is only one CQE,
but it is delayed? ib_poll_cq would return 0 in that
case, and the consumer would never call again, thinking
the CQ is empty. There's no way the consumer can know
for sure when a CQ is drained.

If the delayed CQE happens only when there is more
than one CQE, how can polling multiple WCs ever work
reliably?

Maybe I don't understand what is meant by delayed.


> Would it be better to poll for 1 in every
> poll call Or
> otherwise have this
> while ( rc <= ARRAY_SIZE(wcs) && rc);
> 
>> +       return;
>> +
>> +out_warn:
>> +       pr_warn("RPC:       %s: ib_poll_cq() failed %i\n", __func__, rc);
>> }
>> 
>> -/*
>> - * Handle send, fast_reg_mr, and local_inv completions.
>> - *
>> - * Send events are typically suppressed and thus do not result
>> - * in an upcall. Occasionally one is signaled, however. This
>> - * prevents the provider's completion queue from wrapping and
>> - * losing a completion.
>> +/* Handle provider send completion upcalls.
>>  */
>> static void
>> rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>> @@ -193,12 +196,7 @@ rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>>        struct rpcrdma_ep *ep = (struct rpcrdma_ep *)cq_context;
>>        int rc;
>> 
>> -       rc = rpcrdma_sendcq_poll(cq, ep);
>> -       if (rc) {
>> -               dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>> -                       __func__, rc);
>> -               return;
>> -       }
>> +       rpcrdma_sendcq_poll(cq, ep);
>> 
>>        rc = ib_req_notify_cq(cq,
>>                        IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
>> @@ -247,44 +245,41 @@ out_fail:
>>        goto out_schedule;
>> }
>> 
>> -static int
>> +/* The wc array is on stack: automatic memory is always CPU-local.
>> + *
>> + * struct ib_wc is 64 bytes, making the poll array potentially
>> + * large. But this is at the bottom of the call chain. Further
>> + * substantial work is done in another thread.
>> + */
>> +static void
>> rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>> {
>> -       struct list_head sched_list;
>> -       struct ib_wc *wcs;
>> -       int budget, count, rc;
>> +       struct ib_wc *pos, wcs[4];
>> +       LIST_HEAD(sched_list);
>> +       int count, rc;
>> 
>> -       INIT_LIST_HEAD(&sched_list);
>> -       budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
>>        do {
>> -               wcs = ep->rep_recv_wcs;
>> +               pos = wcs;
>> 
>> -               rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
>> -               if (rc <= 0)
>> -                       goto out_schedule;
>> +               rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
>> +               if (rc < 0)
>> +                       goto out_warn;
>> 
>>                count = rc;
>>                while (count-- > 0)
>> -                       rpcrdma_recvcq_process_wc(wcs++, &sched_list);
>> -       } while (rc == RPCRDMA_POLLSIZE && --budget);
>> -       rc = 0;
>> +                       rpcrdma_recvcq_process_wc(pos++, &sched_list);
>> +       } while (rc == ARRAY_SIZE(wcs));
>> 
>> out_schedule:
>>        rpcrdma_schedule_tasklet(&sched_list);
>> -       return rc;
>> +       return;
>> +
>> +out_warn:
>> +       pr_warn("RPC:       %s: ib_poll_cq() failed %i\n", __func__, rc);
>> +       goto out_schedule;
>> }
>> 
>> -/*
>> - * Handle receive completions.
>> - *
>> - * It is reentrant but processes single events in order to maintain
>> - * ordering of receives to keep server credits.
>> - *
>> - * It is the responsibility of the scheduled tasklet to return
>> - * recv buffers to the pool. NOTE: this affects synchronization of
>> - * connection shutdown. That is, the structures required for
>> - * the completion of the reply handler must remain intact until
>> - * all memory has been reclaimed.
>> +/* Handle provider receive completion upcalls.
>>  */
>> static void
>> rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>> @@ -292,12 +287,7 @@ rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>>        struct rpcrdma_ep *ep = (struct rpcrdma_ep *)cq_context;
>>        int rc;
>> 
>> -       rc = rpcrdma_recvcq_poll(cq, ep);
>> -       if (rc) {
>> -               dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>> -                       __func__, rc);
>> -               return;
>> -       }
>> +       rpcrdma_recvcq_poll(cq, ep);
>> 
>>        rc = ib_req_notify_cq(cq,
>>                        IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index c09414e..42c8d44 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -77,9 +77,6 @@ struct rpcrdma_ia {
>>  * RDMA Endpoint -- one per transport instance
>>  */
>> 
>> -#define RPCRDMA_WC_BUDGET      (128)
>> -#define RPCRDMA_POLLSIZE       (16)
>> -
>> struct rpcrdma_ep {
>>        atomic_t                rep_cqcount;
>>        int                     rep_cqinit;
>> @@ -89,8 +86,6 @@ struct rpcrdma_ep {
>>        struct rdma_conn_param  rep_remote_cma;
>>        struct sockaddr_storage rep_remote_addr;
>>        struct delayed_work     rep_connect_worker;
>> -       struct ib_wc            rep_send_wcs[RPCRDMA_POLLSIZE];
>> -       struct ib_wc            rep_recv_wcs[RPCRDMA_POLLSIZE];
>> };
>> 
>> /*
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg Sept. 20, 2015, 10:35 a.m. UTC | #3

>> It is possible that in a given poll_cq
>> call you end up getting on 1 completion, the other completion is
>> delayed due to some reason.
>
> If a CQE is allowed to be delayed, how does polling
> again guarantee that the consumer can retrieve it?
>
> What happens if a signal occurs, there is only one CQE,
> but it is delayed? ib_poll_cq would return 0 in that
> case, and the consumer would never call again, thinking
> the CQ is empty. There's no way the consumer can know
> for sure when a CQ is drained.
>
> If the delayed CQE happens only when there is more
> than one CQE, how can polling multiple WCs ever work
> reliably?
>
> Maybe I don't understand what is meant by delayed.
>

If I'm not mistaken, Devesh meant that if between ib_poll_cq (where you
polled the last 2 wcs) until the while statement another CQE was
generated then you lost a bit of efficiency. Correct?

>
>> Would it be better to poll for 1 in every
>> poll call Or
>> otherwise have this
>> while ( rc <= ARRAY_SIZE(wcs) && rc);
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Devesh Sharma Sept. 21, 2015, 8:51 a.m. UTC | #4

On Fri, Sep 18, 2015 at 7:49 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> Hi Devesh-
>
>
> On Sep 18, 2015, at 2:52 AM, Devesh Sharma <devesh.sharma@avagotech.com> wrote:
>
>> On Fri, Sep 18, 2015 at 2:14 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>
>>> Commit 8301a2c047cc ("xprtrdma: Limit work done by completion
>>> handler") was supposed to prevent xprtrdma's upcall handlers from
>>> starving other softIRQ work by letting them return to the provider
>>> before all CQEs have been polled.
>>>
>>> The logic assumes the provider will call the upcall handler again
>>> immediately if the CQ is re-armed while there are still queued CQEs.
>>>
>>> This assumption is invalid. The IBTA spec says that after a CQ is
>>> armed, the hardware must interrupt only when a new CQE is inserted.
>>> xprtrdma can't rely on the provider calling again, even though some
>>> providers do.
>>>
>>> Therefore, leaving CQEs on queue makes sense only when there is
>>> another mechanism that ensures all remaining CQEs are consumed in a
>>> timely fashion. xprtrdma does not have such a mechanism. If a CQE
>>> remains queued, the transport can wait forever to send the next RPC.
>>>
>>> Finally, move the wcs array back onto the stack to ensure that the
>>> poll array is always local to the CPU where the completion upcall is
>>> running.
>>>
>>> Fixes: 8301a2c047cc ("xprtrdma: Limit work done by completion ...")
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>> ---
>>> net/sunrpc/xprtrdma/verbs.c     |  100 ++++++++++++++++++---------------------
>>> net/sunrpc/xprtrdma/xprt_rdma.h |    5 --
>>> 2 files changed, 45 insertions(+), 60 deletions(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index 8a477e2..f2e3863 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -158,34 +158,37 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
>>>        }
>>> }
>>>
>>> -static int
>>> +/* The wc array is on stack: automatic memory is always CPU-local.
>>> + *
>>> + * The common case is a single completion is ready. By asking
>>> + * for two entries, a return code of 1 means there is exactly
>>> + * one completion and no more. We don't have to poll again to
>>> + * know that the CQ is now empty.
>>> + */
>>> +static void
>>> rpcrdma_sendcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>>> {
>>> -       struct ib_wc *wcs;
>>> -       int budget, count, rc;
>>> +       struct ib_wc *pos, wcs[2];
>>> +       int count, rc;
>>>
>>> -       budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
>>>        do {
>>> -               wcs = ep->rep_send_wcs;
>>> +               pos = wcs;
>>>
>>> -               rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
>>> -               if (rc <= 0)
>>> -                       return rc;
>>> +               rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
>>> +               if (rc < 0)
>>> +                       goto out_warn;
>>>
>>>                count = rc;
>>>                while (count-- > 0)
>>> -                       rpcrdma_sendcq_process_wc(wcs++);
>>> -       } while (rc == RPCRDMA_POLLSIZE && --budget);
>>> -       return 0;
>>> +                       rpcrdma_sendcq_process_wc(pos++);
>>> +       } while (rc == ARRAY_SIZE(wcs));
>>
>> I think I have missed something and not able to understand the reason
>> for polling 2 CQEs in one poll?
>
> See the block comment above.
>
> When ib_poll_cq() returns the same number of WCs as the
> consumer requested, there may still be CQEs waiting to
> be polled. Another call to ib_poll_cq() is needed to find
> out if that's the case.

True...

>
> When ib_poll_cq() returns fewer WCs than the consumer
> requested, the consumer doesn't have to call again to
> know that the CQ is empty.

Agree, the while loop will terminate here. What if immediately after
the vendor_poll_cq() decided to report only 1 CQE and terminate
polling loop, another CQE is added. This new CQE will be polled only
after T usec (where T is interrupt-latency).

>
> The common case, by far, is that one CQE is ready. By
> requesting two, the number returned is less than the
> number requested, and the consumer can tell immediately
> that the CQE is drained. The extra ib_poll_cq call is
> avoided.
>
> Note that the existing logic also relies on polling
> multiple WCs at once, and stopping the loop when the
> number of returned WCs is less than the size of the
> array.

There is a logic to perform extra polling too if arm_cq reports missed cqes.
we change the while-loop arm-cq reporting missed cqe logic may be removed.

>
>
>> It is possible that in a given poll_cq
>> call you end up getting on 1 completion, the other completion is
>> delayed due to some reason.
>
> If a CQE is allowed to be delayed, how does polling
> again guarantee that the consumer can retrieve it?

Its possible the moment vendor_poll_cq() looked at the CQ-memory
buffer and decided to report 1 CQE, another CQE was in flight CQE but
poll_cq has already decided not to report 2.

>
> What happens if a signal occurs, there is only one CQE,
> but it is delayed? ib_poll_cq would return 0 in that
> case, and the consumer would never call again, thinking
> the CQ is empty. There's no way the consumer can know
> for sure when a CQ is drained.

Hardware usually guarantees to signal only after the CQE is dma'ed.

>
> If the delayed CQE happens only when there is more
> than one CQE, how can polling multiple WCs ever work
> reliably?
>
> Maybe I don't understand what is meant by delayed.
>
>
>> Would it be better to poll for 1 in every
>> poll call Or
>> otherwise have this
>> while ( rc <= ARRAY_SIZE(wcs) && rc);
>>
>>> +       return;
>>> +
>>> +out_warn:
>>> +       pr_warn("RPC:       %s: ib_poll_cq() failed %i\n", __func__, rc);
>>> }
>>>
>>> -/*
>>> - * Handle send, fast_reg_mr, and local_inv completions.
>>> - *
>>> - * Send events are typically suppressed and thus do not result
>>> - * in an upcall. Occasionally one is signaled, however. This
>>> - * prevents the provider's completion queue from wrapping and
>>> - * losing a completion.
>>> +/* Handle provider send completion upcalls.
>>>  */
>>> static void
>>> rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>>> @@ -193,12 +196,7 @@ rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>>>        struct rpcrdma_ep *ep = (struct rpcrdma_ep *)cq_context;
>>>        int rc;
>>>
>>> -       rc = rpcrdma_sendcq_poll(cq, ep);
>>> -       if (rc) {
>>> -               dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>>> -                       __func__, rc);
>>> -               return;
>>> -       }
>>> +       rpcrdma_sendcq_poll(cq, ep);
>>>
>>>        rc = ib_req_notify_cq(cq,
>>>                        IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
>>> @@ -247,44 +245,41 @@ out_fail:
>>>        goto out_schedule;
>>> }
>>>
>>> -static int
>>> +/* The wc array is on stack: automatic memory is always CPU-local.
>>> + *
>>> + * struct ib_wc is 64 bytes, making the poll array potentially
>>> + * large. But this is at the bottom of the call chain. Further
>>> + * substantial work is done in another thread.
>>> + */
>>> +static void
>>> rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>>> {
>>> -       struct list_head sched_list;
>>> -       struct ib_wc *wcs;
>>> -       int budget, count, rc;
>>> +       struct ib_wc *pos, wcs[4];
>>> +       LIST_HEAD(sched_list);
>>> +       int count, rc;
>>>
>>> -       INIT_LIST_HEAD(&sched_list);
>>> -       budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
>>>        do {
>>> -               wcs = ep->rep_recv_wcs;
>>> +               pos = wcs;
>>>
>>> -               rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
>>> -               if (rc <= 0)
>>> -                       goto out_schedule;
>>> +               rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
>>> +               if (rc < 0)
>>> +                       goto out_warn;
>>>
>>>                count = rc;
>>>                while (count-- > 0)
>>> -                       rpcrdma_recvcq_process_wc(wcs++, &sched_list);
>>> -       } while (rc == RPCRDMA_POLLSIZE && --budget);
>>> -       rc = 0;
>>> +                       rpcrdma_recvcq_process_wc(pos++, &sched_list);
>>> +       } while (rc == ARRAY_SIZE(wcs));
>>>
>>> out_schedule:
>>>        rpcrdma_schedule_tasklet(&sched_list);
>>> -       return rc;
>>> +       return;
>>> +
>>> +out_warn:
>>> +       pr_warn("RPC:       %s: ib_poll_cq() failed %i\n", __func__, rc);
>>> +       goto out_schedule;
>>> }
>>>
>>> -/*
>>> - * Handle receive completions.
>>> - *
>>> - * It is reentrant but processes single events in order to maintain
>>> - * ordering of receives to keep server credits.
>>> - *
>>> - * It is the responsibility of the scheduled tasklet to return
>>> - * recv buffers to the pool. NOTE: this affects synchronization of
>>> - * connection shutdown. That is, the structures required for
>>> - * the completion of the reply handler must remain intact until
>>> - * all memory has been reclaimed.
>>> +/* Handle provider receive completion upcalls.
>>>  */
>>> static void
>>> rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>>> @@ -292,12 +287,7 @@ rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>>>        struct rpcrdma_ep *ep = (struct rpcrdma_ep *)cq_context;
>>>        int rc;
>>>
>>> -       rc = rpcrdma_recvcq_poll(cq, ep);
>>> -       if (rc) {
>>> -               dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>>> -                       __func__, rc);
>>> -               return;
>>> -       }
>>> +       rpcrdma_recvcq_poll(cq, ep);
>>>
>>>        rc = ib_req_notify_cq(cq,
>>>                        IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> index c09414e..42c8d44 100644
>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> @@ -77,9 +77,6 @@ struct rpcrdma_ia {
>>>  * RDMA Endpoint -- one per transport instance
>>>  */
>>>
>>> -#define RPCRDMA_WC_BUDGET      (128)
>>> -#define RPCRDMA_POLLSIZE       (16)
>>> -
>>> struct rpcrdma_ep {
>>>        atomic_t                rep_cqcount;
>>>        int                     rep_cqinit;
>>> @@ -89,8 +86,6 @@ struct rpcrdma_ep {
>>>        struct rdma_conn_param  rep_remote_cma;
>>>        struct sockaddr_storage rep_remote_addr;
>>>        struct delayed_work     rep_connect_worker;
>>> -       struct ib_wc            rep_send_wcs[RPCRDMA_POLLSIZE];
>>> -       struct ib_wc            rep_recv_wcs[RPCRDMA_POLLSIZE];
>>> };
>>>
>>> /*
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Devesh Sharma Sept. 21, 2015, 8:51 a.m. UTC | #5

On Sun, Sep 20, 2015 at 4:05 PM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>>> It is possible that in a given poll_cq
>>> call you end up getting on 1 completion, the other completion is
>>> delayed due to some reason.
>>
>>
>> If a CQE is allowed to be delayed, how does polling
>> again guarantee that the consumer can retrieve it?
>>
>> What happens if a signal occurs, there is only one CQE,
>> but it is delayed? ib_poll_cq would return 0 in that
>> case, and the consumer would never call again, thinking
>> the CQ is empty. There's no way the consumer can know
>> for sure when a CQ is drained.
>>
>> If the delayed CQE happens only when there is more
>> than one CQE, how can polling multiple WCs ever work
>> reliably?
>>
>> Maybe I don't understand what is meant by delayed.
>>
>
> If I'm not mistaken, Devesh meant that if between ib_poll_cq (where you
> polled the last 2 wcs) until the while statement another CQE was
> generated then you lost a bit of efficiency. Correct?

Yes, That's the point.

>
>
>>
>>> Would it be better to poll for 1 in every
>>> poll call Or
>>> otherwise have this
>>> while ( rc <= ARRAY_SIZE(wcs) && rc);
>>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chuck Lever Sept. 21, 2015, 3:45 p.m. UTC | #6

> On Sep 21, 2015, at 1:51 AM, Devesh Sharma <devesh.sharma@avagotech.com> wrote:
> 
> On Sun, Sep 20, 2015 at 4:05 PM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>>>> It is possible that in a given poll_cq
>>>> call you end up getting on 1 completion, the other completion is
>>>> delayed due to some reason.
>>> 
>>> 
>>> If a CQE is allowed to be delayed, how does polling
>>> again guarantee that the consumer can retrieve it?
>>> 
>>> What happens if a signal occurs, there is only one CQE,
>>> but it is delayed? ib_poll_cq would return 0 in that
>>> case, and the consumer would never call again, thinking
>>> the CQ is empty. There's no way the consumer can know
>>> for sure when a CQ is drained.
>>> 
>>> If the delayed CQE happens only when there is more
>>> than one CQE, how can polling multiple WCs ever work
>>> reliably?
>>> 
>>> Maybe I don't understand what is meant by delayed.
>>> 
>> 
>> If I'm not mistaken, Devesh meant that if between ib_poll_cq (where you
>> polled the last 2 wcs) until the while statement another CQE was
>> generated then you lost a bit of efficiency. Correct?
> 
> Yes, That's the point.

I’m optimizing for the common case where 1 CQE is ready
to be polled. How much of an efficiency loss are you
talking about, how often would this loss occur, and is
this a problem for all providers / devices?

Is this an issue for the current arrangement where 8 WCs
are polled at a time?


—
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Devesh Sharma Sept. 22, 2015, 5:32 p.m. UTC | #7

On Mon, Sep 21, 2015 at 9:15 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
>> On Sep 21, 2015, at 1:51 AM, Devesh Sharma <devesh.sharma@avagotech.com> wrote:
>>
>> On Sun, Sep 20, 2015 at 4:05 PM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>>>>> It is possible that in a given poll_cq
>>>>> call you end up getting on 1 completion, the other completion is
>>>>> delayed due to some reason.
>>>>
>>>>
>>>> If a CQE is allowed to be delayed, how does polling
>>>> again guarantee that the consumer can retrieve it?
>>>>
>>>> What happens if a signal occurs, there is only one CQE,
>>>> but it is delayed? ib_poll_cq would return 0 in that
>>>> case, and the consumer would never call again, thinking
>>>> the CQ is empty. There's no way the consumer can know
>>>> for sure when a CQ is drained.
>>>>
>>>> If the delayed CQE happens only when there is more
>>>> than one CQE, how can polling multiple WCs ever work
>>>> reliably?
>>>>
>>>> Maybe I don't understand what is meant by delayed.
>>>>
>>>
>>> If I'm not mistaken, Devesh meant that if between ib_poll_cq (where you
>>> polled the last 2 wcs) until the while statement another CQE was
>>> generated then you lost a bit of efficiency. Correct?
>>
>> Yes, That's the point.
>
> I’m optimizing for the common case where 1 CQE is ready
> to be polled. How much of an efficiency loss are you
> talking about, how often would this loss occur, and is
> this a problem for all providers / devices?

The scenario would happen or not is difficult to predict, but its
quite possible with any vendor based on load on PCI bus I guess.
This may affect the latency figures though.

>
> Is this an issue for the current arrangement where 8 WCs
> are polled at a time?

Yes, its there even today.

>
>
> —
> Chuck Lever
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chuck Lever Oct. 1, 2015, 4:37 p.m. UTC | #8

> On Sep 22, 2015, at 1:32 PM, Devesh Sharma <devesh.sharma@avagotech.com> wrote:
> 
> On Mon, Sep 21, 2015 at 9:15 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>>> On Sep 21, 2015, at 1:51 AM, Devesh Sharma <devesh.sharma@avagotech.com> wrote:
>>> 
>>> On Sun, Sep 20, 2015 at 4:05 PM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>>>>>> It is possible that in a given poll_cq
>>>>>> call you end up getting on 1 completion, the other completion is
>>>>>> delayed due to some reason.
>>>>> 
>>>>> 
>>>>> If a CQE is allowed to be delayed, how does polling
>>>>> again guarantee that the consumer can retrieve it?
>>>>> 
>>>>> What happens if a signal occurs, there is only one CQE,
>>>>> but it is delayed? ib_poll_cq would return 0 in that
>>>>> case, and the consumer would never call again, thinking
>>>>> the CQ is empty. There's no way the consumer can know
>>>>> for sure when a CQ is drained.
>>>>> 
>>>>> If the delayed CQE happens only when there is more
>>>>> than one CQE, how can polling multiple WCs ever work
>>>>> reliably?
>>>>> 
>>>>> Maybe I don't understand what is meant by delayed.
>>>>> 
>>>> 
>>>> If I'm not mistaken, Devesh meant that if between ib_poll_cq (where you
>>>> polled the last 2 wcs) until the while statement another CQE was
>>>> generated then you lost a bit of efficiency. Correct?
>>> 
>>> Yes, That's the point.
>> 
>> I’m optimizing for the common case where 1 CQE is ready
>> to be polled. How much of an efficiency loss are you
>> talking about, how often would this loss occur, and is
>> this a problem for all providers / devices?
> 
> The scenario would happen or not is difficult to predict, but its
> quite possible with any vendor based on load on PCI bus I guess.
> This may affect the latency figures though.
> 
>> 
>> Is this an issue for the current arrangement where 8 WCs
>> are polled at a time?
> 
> Yes, its there even today.

This review comment does not feel closed yet. Maybe it’s
because I don’t understand exactly what the issue is.

Is this the problem that REPORT_MISSED_EVENTS is supposed to
resolve? 

A missed WC will result in an RPC/RDMA transport deadlock. In
fact that is the reason for this particular patch (although
it addresses only one source of missed WCs). So I would like
to see that there are no windows here.

I’ve been told the only sure way to address this for every
provider is to use the classic but inefficient mechanism of
poll one WC at a time until no WC is returned; re-arm; poll
again until no WC is returned.

In the common case this means two extra poll_cq calls that
return nothing. So I claim the current status quo isn’t
good enough :-)

Doug and others have suggested the best place to address
problems with missed WC signals is in the drivers. All of
them should live up to the ib_poll_cq() API contract the
same way. In addition I’d really like to see

 - polling and arming work without having to perform extra
    unneeded locking of the CQ, and
 - polling arrays work without introducing races

Can we have that discussion now, since there is already
some discussion of IB core API fix-ups?

—
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Oct. 1, 2015, 5:13 p.m. UTC | #9

On Thu, Oct 01, 2015 at 12:37:36PM -0400, Chuck Lever wrote:

> A missed WC will result in an RPC/RDMA transport deadlock. In
> fact that is the reason for this particular patch (although
> it addresses only one source of missed WCs). So I would like
> to see that there are no windows here.

WCs are never missed.

The issue is a race where re-arming the CQ might not work, meaning you
don't get an event.

You can certainly use arrays with poll_cq. There is no race in the API
here.

But you have to use the IB_CQ_REPORT_MISSED_EVENTS scheme to guarantee
the CQ is actually armed or continue to loop again.

Basically you have to loop until ib_req_notify_cq succeeds.

Any driver that doesn't support this is broken, do we know of any?

while (1) {
  struct ib_wc wcs[100];
  int rc = ib_poll_cq(cw, NELEMS(wcs), wcs);

  .. process rc wcs ..

  if (rc != NELEMS(wcs))
    if (ib_req_notify_cq(cq, IB_CQ_NEXT_COMP |
        IB_CQ_REPORT_MISSED_EVENTS) == 0)
     break;
}

API wise, we should probably look at forcing
IB_CQ_REPORT_MISSED_EVENTS on and dropping the flag.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chuck Lever Oct. 1, 2015, 5:36 p.m. UTC | #10

> On Oct 1, 2015, at 1:13 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Thu, Oct 01, 2015 at 12:37:36PM -0400, Chuck Lever wrote:
> 
>> A missed WC will result in an RPC/RDMA transport deadlock. In
>> fact that is the reason for this particular patch (although
>> it addresses only one source of missed WCs). So I would like
>> to see that there are no windows here.
> 
> WCs are never missed.

The review comment earlier in this thread suggested there is
a race condition where a WC can be “delayed” resulting in,
well, I’m still not certain what the consequences are.

There was some interest in going back to a single WC array
argument when calling ib_poll_cq. I’d like to avoid that.

If there is an issue using a WC array, I’d like to
understand what it is.

> The issue is a race where re-arming the CQ might not work, meaning you
> don't get an event.
> 
> You can certainly use arrays with poll_cq. There is no race in the API
> here.
> 
> But you have to use the IB_CQ_REPORT_MISSED_EVENTS scheme to guarantee
> the CQ is actually armed or continue to loop again.
> 
> Basically you have to loop until ib_req_notify_cq succeeds.
> 
> Any driver that doesn't support this is broken, do we know of any?
> 
> while (1) {
>  struct ib_wc wcs[100];
>  int rc = ib_poll_cq(cw, NELEMS(wcs), wcs);
> 
>  .. process rc wcs ..
> 
>  if (rc != NELEMS(wcs))
>    if (ib_req_notify_cq(cq, IB_CQ_NEXT_COMP |
>        IB_CQ_REPORT_MISSED_EVENTS) == 0)
>     break;
> }
> 
> API wise, we should probably look at forcing
> IB_CQ_REPORT_MISSED_EVENTS on and dropping the flag.

It’s been suggested that it’s not clear what a positive
return value from ib_req_notify_cq() means when the
REPORT_MISSED_EVENTS flags is set: does it mean that
the CQ has been re-armed? I had assumed that a positive
RC meant both missed events and a successful re-arm,
but the pseudo-code above suggests that is not the
case.

—
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Oct. 1, 2015, 6:15 p.m. UTC | #11

On Thu, Oct 01, 2015 at 01:36:26PM -0400, Chuck Lever wrote:

> >> A missed WC will result in an RPC/RDMA transport deadlock. In
> >> fact that is the reason for this particular patch (although
> >> it addresses only one source of missed WCs). So I would like
> >> to see that there are no windows here.
> > 
> > WCs are never missed.
> 
> The review comment earlier in this thread suggested there is
> a race condition where a WC can be “delayed” resulting in,
> well, I’m still not certain what the consequences are.

Yes. The consequence would typically be lockup of CQ processing.

> > while (1) {
> >  struct ib_wc wcs[100];
> >  int rc = ib_poll_cq(cw, NELEMS(wcs), wcs);
> > 
> >  .. process rc wcs ..
> > 
> >  if (rc != NELEMS(wcs))
> >    if (ib_req_notify_cq(cq, IB_CQ_NEXT_COMP |
> >        IB_CQ_REPORT_MISSED_EVENTS) == 0)
> >     break;
> > }
> > 
> > API wise, we should probably look at forcing
> > IB_CQ_REPORT_MISSED_EVENTS on and dropping the flag.
> 
> It’s been suggested that it’s not clear what a positive
> return value from ib_req_notify_cq() means when the
> REPORT_MISSED_EVENTS flags is set: does it mean that
> the CQ has been re-armed? I had assumed that a positive
> RC meant both missed events and a successful re-arm,
> but the pseudo-code above suggests that is not the
> case.

The ULP must assume the CQ has NOT been armed after a positive return.

What the driver does to the arm state is undefined - for instance the
driver may trigger a callback and still return 1 here.

However, the driver must make this guarentee:

 If ib_req_notify_cq(IB_CQ_REPORT_MISSED_EVENTS) returns 0 then
 the call back will always be called when the CQ is non-empty.

The ULP must loop doing polling until the above happens, otherwise the
event notification may be missed.

ie the above is guarnteed to close the WC delay/lockup race.

Again, if there has been confusion on the driver side, drivers that
don't implement the above are broken.

Review Roland's original commit comments on this feature.

 https://github.com/jgunthorpe/linux/commit/ed23a72778f3dbd465e55b06fe31629e7e1dd2f3

I'm not sure where we are at on the 'significant overhead for some
low-level drivers' issue, but assuming that is still the case, then
the recommendation is this:

bool exiting = false;
while (1) {
   struct ib_wc wcs[100];
   int rc = ib_poll_cq(cq, NELEMS(wcs), wcs);
   if (rc == 0 && exiting)
        break;

   .. process rc wcs ..

   if (rc != NELEMS(wcs)) {
        ib_req_notify_cq(cq, IB_CQ_NEXT_COMP)
	exiting = true;
   } else
	exiting = false;
}

ie a double poll.

AFAIK, this is a common pattern in the ULPs.. Perhaps we should
implement this as a core API:

struct ib_wc wcs[100];
while ((rc = ib_poll_cq_and_arm(cq, NELEMS(wcs), wcs)) != 0) {
   .. process rc wcs  ..

 ib_poll_cq_and_arm reads wcs off the CQ. If it returns 0 then the
 callback is guarenteed to happen when the CQ is non empty.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chuck Lever Oct. 1, 2015, 6:31 p.m. UTC | #12

> On Oct 1, 2015, at 2:15 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Thu, Oct 01, 2015 at 01:36:26PM -0400, Chuck Lever wrote:
> 
>>>> A missed WC will result in an RPC/RDMA transport deadlock. In
>>>> fact that is the reason for this particular patch (although
>>>> it addresses only one source of missed WCs). So I would like
>>>> to see that there are no windows here.
>>> 
>>> WCs are never missed.
>> 
>> The review comment earlier in this thread suggested there is
>> a race condition where a WC can be “delayed” resulting in,
>> well, I’m still not certain what the consequences are.
> 
> Yes. The consequence would typically be lockup of CQ processing.
> 
>>> while (1) {
>>> struct ib_wc wcs[100];
>>> int rc = ib_poll_cq(cw, NELEMS(wcs), wcs);
>>> 
>>> .. process rc wcs ..
>>> 
>>> if (rc != NELEMS(wcs))
>>>   if (ib_req_notify_cq(cq, IB_CQ_NEXT_COMP |
>>>       IB_CQ_REPORT_MISSED_EVENTS) == 0)
>>>    break;
>>> }
>>> 
>>> API wise, we should probably look at forcing
>>> IB_CQ_REPORT_MISSED_EVENTS on and dropping the flag.
>> 
>> It’s been suggested that it’s not clear what a positive
>> return value from ib_req_notify_cq() means when the
>> REPORT_MISSED_EVENTS flags is set: does it mean that
>> the CQ has been re-armed? I had assumed that a positive
>> RC meant both missed events and a successful re-arm,
>> but the pseudo-code above suggests that is not the
>> case.
> 
> The ULP must assume the CQ has NOT been armed after a positive return.

OK, I will fix this when I revise 03/18.


> What the driver does to the arm state is undefined - for instance the
> driver may trigger a callback and still return 1 here.
> 
> However, the driver must make this guarentee:
> 
> If ib_req_notify_cq(IB_CQ_REPORT_MISSED_EVENTS) returns 0 then
> the call back will always be called when the CQ is non-empty.
> 
> The ULP must loop doing polling until the above happens, otherwise the
> event notification may be missed.
> 
> ie the above is guarnteed to close the WC delay/lockup race.
> 
> Again, if there has been confusion on the driver side, drivers that
> don't implement the above are broken.
> 
> Review Roland's original commit comments on this feature.
> 
> https://github.com/jgunthorpe/linux/commit/ed23a72778f3dbd465e55b06fe31629e7e1dd2f3
> 
> I'm not sure where we are at on the 'significant overhead for some
> low-level drivers' issue, but assuming that is still the case, then
> the recommendation is this:
> 
> bool exiting = false;
> while (1) {
>   struct ib_wc wcs[100];
>   int rc = ib_poll_cq(cq, NELEMS(wcs), wcs);
>   if (rc == 0 && exiting)
>        break;
> 
>   .. process rc wcs ..
> 
>   if (rc != NELEMS(wcs)) {
>        ib_req_notify_cq(cq, IB_CQ_NEXT_COMP)
> 	exiting = true;
>   } else
> 	exiting = false;
> }
> 
> ie a double poll.
> AFAIK, this is a common pattern in the ULPs.. Perhaps we should
> implement this as a core API:
> 
> struct ib_wc wcs[100];
> while ((rc = ib_poll_cq_and_arm(cq, NELEMS(wcs), wcs)) != 0) {
>   .. process rc wcs  ..
> 
> ib_poll_cq_and_arm reads wcs off the CQ. If it returns 0 then the
> callback is guarenteed to happen when the CQ is non empty.

This makes sense to me, especially if it means fewer
times grabbing and releasing the CQ’s spinlock.

Does a ULP want to continue polling if ib_poll_cq{_and_arm)
returns a negative RC?


—
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Oct. 1, 2015, 6:38 p.m. UTC | #13

On Thu, Oct 01, 2015 at 02:31:41PM -0400, Chuck Lever wrote:

> Does a ULP want to continue polling if ib_poll_cq{_and_arm)
> returns a negative RC?

No. We should try and figure out if that can even happen, if not get
rid of the possibility.

If it can happen, it can only mean the CQ is busted, and needs to be
destroyed. Ie the whole ULP probably needs to restart.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v1,03/18] xprtrdma: Remove completion polling budgets

Commit Message

Comments

Patch