diff mbox series

[v1,2/8] xprtrdma: Do not post Receives after disconnect

Message ID 161721937122.515226.14731175629421422152.stgit@manet.1015granger.net (mailing list archive)
State New
Headers show
Series xprtrdma Receive Queue fixes | expand

Commit Message

Chuck Lever III March 31, 2021, 7:36 p.m. UTC
Currently the Receive completion handler refreshes the Receive Queue
whenever a successful Receive completion occurs.

On disconnect, xprtrdma drains the Receive Queue. The first few
Receive completions after a disconnect are typically successful,
until the first flushed Receive.

This means the Receive completion handler continues to post more
Receive WRs after the drain sentinel has been posted. The late-
posted Receives flush after the drain sentinel has completed,
leading to a crash later in rpcrdma_xprt_disconnect().

To prevent this crash, xprtrdma has to ensure that the Receive
handler stops posting Receives before ib_drain_rq() posts its
drain sentinel.

This patch is probably not sufficient to fully close that window,
but does significantly reduce the opportunity for a crash to
occur without incurring undue performance overhead.

Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |    7 +++++++
 1 file changed, 7 insertions(+)

Comments

Tom Talpey March 31, 2021, 7:59 p.m. UTC | #1
On 3/31/2021 3:36 PM, Chuck Lever wrote:
> Currently the Receive completion handler refreshes the Receive Queue
> whenever a successful Receive completion occurs.
> 
> On disconnect, xprtrdma drains the Receive Queue. The first few
> Receive completions after a disconnect are typically successful,
> until the first flushed Receive.
> 
> This means the Receive completion handler continues to post more
> Receive WRs after the drain sentinel has been posted. The late-
> posted Receives flush after the drain sentinel has completed,
> leading to a crash later in rpcrdma_xprt_disconnect().
> 
> To prevent this crash, xprtrdma has to ensure that the Receive
> handler stops posting Receives before ib_drain_rq() posts its
> drain sentinel.
> 
> This patch is probably not sufficient to fully close that window,

"Probably" is not a word I'd like to use in a stable:cc...

> but does significantly reduce the opportunity for a crash to
> occur without incurring undue performance overhead.
> 
> Cc: stable@vger.kernel.org # v5.7
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>   net/sunrpc/xprtrdma/verbs.c |    7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index ec912cf9c618..1d88685badbe 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -1371,8 +1371,10 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
>   {
>   	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>   	struct rpcrdma_ep *ep = r_xprt->rx_ep;
> +	struct ib_qp_init_attr init_attr;
>   	struct ib_recv_wr *wr, *bad_wr;
>   	struct rpcrdma_rep *rep;
> +	struct ib_qp_attr attr;
>   	int needed, count, rc;
>   
>   	rc = 0;
> @@ -1385,6 +1387,11 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
>   	if (!temp)
>   		needed += RPCRDMA_MAX_RECV_BATCH;
>   
> +	if (ib_query_qp(ep->re_id->qp, &attr, IB_QP_STATE, &init_attr))
> +		goto out;

This call isn't completely cheap.

> +	if (attr.qp_state == IB_QPS_ERR)
> +		goto out;

But the QP is free to disconnect or go to error right now. This approach
just reduces the timing hole. Is it not possible to mark the WRs as
being part of a batch, and allowing them to flush? You could borrow a
bit in the completion cookie, and check it when the CQE pops out. Maybe.

> +
>   	/* fast path: all needed reps can be found on the free list */
>   	wr = NULL;
>   	while (needed) {
> 
> 
>
Chuck Lever III March 31, 2021, 8:31 p.m. UTC | #2
On Wed, Mar 31, 2021 at 4:01 PM Tom Talpey <tom@talpey.com> wrote:
>
> On 3/31/2021 3:36 PM, Chuck Lever wrote:
> > Currently the Receive completion handler refreshes the Receive Queue
> > whenever a successful Receive completion occurs.
> >
> > On disconnect, xprtrdma drains the Receive Queue. The first few
> > Receive completions after a disconnect are typically successful,
> > until the first flushed Receive.
> >
> > This means the Receive completion handler continues to post more
> > Receive WRs after the drain sentinel has been posted. The late-
> > posted Receives flush after the drain sentinel has completed,
> > leading to a crash later in rpcrdma_xprt_disconnect().
> >
> > To prevent this crash, xprtrdma has to ensure that the Receive
> > handler stops posting Receives before ib_drain_rq() posts its
> > drain sentinel.
> >
> > This patch is probably not sufficient to fully close that window,
>
> "Probably" is not a word I'd like to use in a stable:cc...

Well, I could be easily convinced to remove the Cc: stable
for this one, since it's not a full fix. But this is a pretty pervasive
problem with disconnect handling.


> > but does significantly reduce the opportunity for a crash to
> > occur without incurring undue performance overhead.
> >
> > Cc: stable@vger.kernel.org # v5.7
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >   net/sunrpc/xprtrdma/verbs.c |    7 +++++++
> >   1 file changed, 7 insertions(+)
> >
> > diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> > index ec912cf9c618..1d88685badbe 100644
> > --- a/net/sunrpc/xprtrdma/verbs.c
> > +++ b/net/sunrpc/xprtrdma/verbs.c
> > @@ -1371,8 +1371,10 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
> >   {
> >       struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> >       struct rpcrdma_ep *ep = r_xprt->rx_ep;
> > +     struct ib_qp_init_attr init_attr;
> >       struct ib_recv_wr *wr, *bad_wr;
> >       struct rpcrdma_rep *rep;
> > +     struct ib_qp_attr attr;
> >       int needed, count, rc;
> >
> >       rc = 0;
> > @@ -1385,6 +1387,11 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
> >       if (!temp)
> >               needed += RPCRDMA_MAX_RECV_BATCH;
> >
> > +     if (ib_query_qp(ep->re_id->qp, &attr, IB_QP_STATE, &init_attr))
> > +             goto out;
>
> This call isn't completely cheap.

True, but it's done only once every 7 Receive completions.

The other option is to use re_connect_status, and add some memory
barriers to ensure we get the latest value. That doesn't help us get the
race window closed any further, though.

> > +     if (attr.qp_state == IB_QPS_ERR)
> > +             goto out;
>
> But the QP is free to disconnect or go to error right now. This approach
> just reduces the timing hole.

Agreed 100%. I just couldn't think of a better approach. I'm definitely open
to better ideas.

> Is it not possible to mark the WRs as
> being part of a batch, and allowing them to flush? You could borrow a
> bit in the completion cookie, and check it when the CQE pops out. Maybe.

It's not an issue with batching, it's an issue with posting Receives from the
Receive completion handler. I'd think that any of the ULPs that post Receives
in their completion handler would have the same issue.

The purpose of the QP drain in rpcrdma_xprt_disconnect() is to ensure there
are no more WRs in flight so that the hardware resources can be safely
destroyed. If the Receive completion handler continues to post Receive WRs
after the drain sentinel has been posted, leaks and crashes become possible.


> > +
> >       /* fast path: all needed reps can be found on the free list */
> >       wr = NULL;
> >       while (needed) {
> >
> >
> >
Tom Talpey March 31, 2021, 9:22 p.m. UTC | #3
On 3/31/2021 4:31 PM, Chuck Lever wrote:
> On Wed, Mar 31, 2021 at 4:01 PM Tom Talpey <tom@talpey.com> wrote:
>>
>> On 3/31/2021 3:36 PM, Chuck Lever wrote:
>>> Currently the Receive completion handler refreshes the Receive Queue
>>> whenever a successful Receive completion occurs.
>>>
>>> On disconnect, xprtrdma drains the Receive Queue. The first few
>>> Receive completions after a disconnect are typically successful,
>>> until the first flushed Receive.
snip
>> Is it not possible to mark the WRs as
>> being part of a batch, and allowing them to flush? You could borrow a
>> bit in the completion cookie, and check it when the CQE pops out. Maybe.
> 
> It's not an issue with batching, it's an issue with posting Receives from the
> Receive completion handler. I'd think that any of the ULPs that post Receives
> in their completion handler would have the same issue.
> 
> The purpose of the QP drain in rpcrdma_xprt_disconnect() is to ensure there
> are no more WRs in flight so that the hardware resources can be safely
> destroyed. If the Receive completion handler continues to post Receive WRs
> after the drain sentinel has been posted, leaks and crashes become possible.
Well, why not do an atomic_set() of a flag just before posting the
sentinel, and check it with atomic_get() before any other RQ post?


Tom.
Chuck Lever April 1, 2021, 4:56 p.m. UTC | #4
On Wed, Mar 31, 2021 at 5:22 PM Tom Talpey <tom@talpey.com> wrote:
>
> On 3/31/2021 4:31 PM, Chuck Lever wrote:
> > On Wed, Mar 31, 2021 at 4:01 PM Tom Talpey <tom@talpey.com> wrote:
> >>
> >> On 3/31/2021 3:36 PM, Chuck Lever wrote:
> >>> Currently the Receive completion handler refreshes the Receive Queue
> >>> whenever a successful Receive completion occurs.
> >>>
> >>> On disconnect, xprtrdma drains the Receive Queue. The first few
> >>> Receive completions after a disconnect are typically successful,
> >>> until the first flushed Receive.
> snip
> >> Is it not possible to mark the WRs as
> >> being part of a batch, and allowing them to flush? You could borrow a
> >> bit in the completion cookie, and check it when the CQE pops out. Maybe.
> >
> > It's not an issue with batching, it's an issue with posting Receives from the
> > Receive completion handler. I'd think that any of the ULPs that post Receives
> > in their completion handler would have the same issue.
> >
> > The purpose of the QP drain in rpcrdma_xprt_disconnect() is to ensure there
> > are no more WRs in flight so that the hardware resources can be safely
> > destroyed. If the Receive completion handler continues to post Receive WRs
> > after the drain sentinel has been posted, leaks and crashes become possible.

> Well, why not do an atomic_set() of a flag just before posting the
> sentinel, and check it with atomic_get() before any other RQ post?

After a couple of private exchanges, Tom and I agree that doing an
ib_query_qp() in rpcrdma_post_recvs() has some drawbacks and
does not fully close the race window.

There is nothing that prevents rpcrdma_xprt_disconnect() from starting
the RQ drain while rpcrdma_post_recvs is still running. There needs to
be serialization between ib_drain_rq() and rpcrdma_post_recvs() so
that the drain sentinel is guaranteed to be the final Receive WR posted
on the RQ.

It would be great if the core ib_drain_rq() API itself could handle the
exclusion.
diff mbox series

Patch

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index ec912cf9c618..1d88685badbe 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1371,8 +1371,10 @@  void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
 {
 	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
 	struct rpcrdma_ep *ep = r_xprt->rx_ep;
+	struct ib_qp_init_attr init_attr;
 	struct ib_recv_wr *wr, *bad_wr;
 	struct rpcrdma_rep *rep;
+	struct ib_qp_attr attr;
 	int needed, count, rc;
 
 	rc = 0;
@@ -1385,6 +1387,11 @@  void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
 	if (!temp)
 		needed += RPCRDMA_MAX_RECV_BATCH;
 
+	if (ib_query_qp(ep->re_id->qp, &attr, IB_QP_STATE, &init_attr))
+		goto out;
+	if (attr.qp_state == IB_QPS_ERR)
+		goto out;
+
 	/* fast path: all needed reps can be found on the free list */
 	wr = NULL;
 	while (needed) {