mbox series

[v1,00/11] NFS/RDMA client side connection overhaul

Message ID 20200221214906.2072.32572.stgit@manet.1015granger.net (mailing list archive)
Headers show
Series NFS/RDMA client side connection overhaul | expand

Message

Chuck Lever Feb. 21, 2020, 10 p.m. UTC
Howdy.

I've had reports (and personal experience) where the Linux NFS/RDMA
client waits for a very long time after a disruption of the network
or NFS server.

There is a disconnect time wait in the Connection Manager which
blocks the RPC/RDMA transport from tearing down a connection for a
few minutes when the remote cannot respond to DREQ messages.

An RPC/RDMA transport has only one slot for connection state, so the
transport is prevented from establishing a fresh connection until
the time wait completes.

This patch series refactors the connection end point data structures
to enable one active and multiple zombie connections. Now, while a
defunct connection is waiting to die, it is separated from the
transport, clearing the way for the immediate creation of a new
connection. Clean-up of the old connection's data structures and
resources then completes in the background.

Well, that's the idea, anyway. Review and comments welcome. Hoping
this can be merged in v5.7.

---

Chuck Lever (11):
      xprtrdma: Invoke rpcrdma_ep_create() in the connect worker
      xprtrdma: Refactor frwr_init_mr()
      xprtrdma: Clean up the post_send path
      xprtrdma: Refactor rpcrdma_ep_connect() and rpcrdma_ep_disconnect()
      xprtrdma: Allocate Protection Domain in rpcrdma_ep_create()
      xprtrdma: Invoke rpcrdma_ia_open in the connect worker
      xprtrdma: Remove rpcrdma_ia::ri_flags
      xprtrdma: Disconnect on flushed completion
      xprtrdma: Merge struct rpcrdma_ia into struct rpcrdma_ep
      xprtrdma: Extract sockaddr from struct rdma_cm_id
      xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt


 include/trace/events/rpcrdma.h    |   97 ++---
 net/sunrpc/xprtrdma/backchannel.c |    8 
 net/sunrpc/xprtrdma/frwr_ops.c    |  152 ++++----
 net/sunrpc/xprtrdma/rpc_rdma.c    |   32 +-
 net/sunrpc/xprtrdma/transport.c   |   72 +---
 net/sunrpc/xprtrdma/verbs.c       |  681 ++++++++++++++-----------------------
 net/sunrpc/xprtrdma/xprt_rdma.h   |   89 ++---
 7 files changed, 445 insertions(+), 686 deletions(-)

--
Chuck Lever

Comments

Tom Talpey March 1, 2020, 6:09 p.m. UTC | #1
On 2/21/2020 2:00 PM, Chuck Lever wrote:
> Howdy.
> 
> I've had reports (and personal experience) where the Linux NFS/RDMA
> client waits for a very long time after a disruption of the network
> or NFS server.
> 
> There is a disconnect time wait in the Connection Manager which
> blocks the RPC/RDMA transport from tearing down a connection for a
> few minutes when the remote cannot respond to DREQ messages.

This seems really unfortunate. Why such a long wait in the RDMA layer?
I can see a backoff, to prevent connection attempt flooding, but a
constant "few minute" pause is a very blunt instrument.

> An RPC/RDMA transport has only one slot for connection state, so the
> transport is prevented from establishing a fresh connection until
> the time wait completes.
> 
> This patch series refactors the connection end point data structures
> to enable one active and multiple zombie connections. Now, while a
> defunct connection is waiting to die, it is separated from the
> transport, clearing the way for the immediate creation of a new
> connection. Clean-up of the old connection's data structures and
> resources then completes in the background.

This is a good idea in any case. It separates the layers, and leads
to better connection establishment throughput.

Does the RPCRDMA layer ensure it backs off, if connection retries
fail? Or are you depending on the NFS upper layer for this.

Tom.

> Well, that's the idea, anyway. Review and comments welcome. Hoping
> this can be merged in v5.7.
> 
> ---
> 
> Chuck Lever (11):
>        xprtrdma: Invoke rpcrdma_ep_create() in the connect worker
>        xprtrdma: Refactor frwr_init_mr()
>        xprtrdma: Clean up the post_send path
>        xprtrdma: Refactor rpcrdma_ep_connect() and rpcrdma_ep_disconnect()
>        xprtrdma: Allocate Protection Domain in rpcrdma_ep_create()
>        xprtrdma: Invoke rpcrdma_ia_open in the connect worker
>        xprtrdma: Remove rpcrdma_ia::ri_flags
>        xprtrdma: Disconnect on flushed completion
>        xprtrdma: Merge struct rpcrdma_ia into struct rpcrdma_ep
>        xprtrdma: Extract sockaddr from struct rdma_cm_id
>        xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt
> 
> 
>   include/trace/events/rpcrdma.h    |   97 ++---
>   net/sunrpc/xprtrdma/backchannel.c |    8
>   net/sunrpc/xprtrdma/frwr_ops.c    |  152 ++++----
>   net/sunrpc/xprtrdma/rpc_rdma.c    |   32 +-
>   net/sunrpc/xprtrdma/transport.c   |   72 +---
>   net/sunrpc/xprtrdma/verbs.c       |  681 ++++++++++++++-----------------------
>   net/sunrpc/xprtrdma/xprt_rdma.h   |   89 ++---
>   7 files changed, 445 insertions(+), 686 deletions(-)
> 
> --
> Chuck Lever
> 
>
Chuck Lever March 1, 2020, 6:12 p.m. UTC | #2
> On Mar 1, 2020, at 1:09 PM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 2/21/2020 2:00 PM, Chuck Lever wrote:
>> Howdy.
>> I've had reports (and personal experience) where the Linux NFS/RDMA
>> client waits for a very long time after a disruption of the network
>> or NFS server.
>> There is a disconnect time wait in the Connection Manager which
>> blocks the RPC/RDMA transport from tearing down a connection for a
>> few minutes when the remote cannot respond to DREQ messages.
> 
> This seems really unfortunate. Why such a long wait in the RDMA layer?
> I can see a backoff, to prevent connection attempt flooding, but a
> constant "few minute" pause is a very blunt instrument.

The last clause here is the operative conundrum: "when the remote
cannot respond". That should be pretty rare, but it's frequent
enough to be bothersome in some environments.

As to why the time wait is so long, I don't know the answer to that.


>> An RPC/RDMA transport has only one slot for connection state, so the
>> transport is prevented from establishing a fresh connection until
>> the time wait completes.
>> This patch series refactors the connection end point data structures
>> to enable one active and multiple zombie connections. Now, while a
>> defunct connection is waiting to die, it is separated from the
>> transport, clearing the way for the immediate creation of a new
>> connection. Clean-up of the old connection's data structures and
>> resources then completes in the background.
> 
> This is a good idea in any case. It separates the layers, and leads
> to better connection establishment throughput.
> 
> Does the RPCRDMA layer ensure it backs off, if connection retries
> fail? Or are you depending on the NFS upper layer for this.

There is a complicated back-off scheme that is modeled on the TCP
connection back-off logic.


> Tom.
> 
>> Well, that's the idea, anyway. Review and comments welcome. Hoping
>> this can be merged in v5.7.
>> ---
>> Chuck Lever (11):
>>       xprtrdma: Invoke rpcrdma_ep_create() in the connect worker
>>       xprtrdma: Refactor frwr_init_mr()
>>       xprtrdma: Clean up the post_send path
>>       xprtrdma: Refactor rpcrdma_ep_connect() and rpcrdma_ep_disconnect()
>>       xprtrdma: Allocate Protection Domain in rpcrdma_ep_create()
>>       xprtrdma: Invoke rpcrdma_ia_open in the connect worker
>>       xprtrdma: Remove rpcrdma_ia::ri_flags
>>       xprtrdma: Disconnect on flushed completion
>>       xprtrdma: Merge struct rpcrdma_ia into struct rpcrdma_ep
>>       xprtrdma: Extract sockaddr from struct rdma_cm_id
>>       xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt
>>  include/trace/events/rpcrdma.h    |   97 ++---
>>  net/sunrpc/xprtrdma/backchannel.c |    8
>>  net/sunrpc/xprtrdma/frwr_ops.c    |  152 ++++----
>>  net/sunrpc/xprtrdma/rpc_rdma.c    |   32 +-
>>  net/sunrpc/xprtrdma/transport.c   |   72 +---
>>  net/sunrpc/xprtrdma/verbs.c       |  681 ++++++++++++++-----------------------
>>  net/sunrpc/xprtrdma/xprt_rdma.h   |   89 ++---
>>  7 files changed, 445 insertions(+), 686 deletions(-)
>> --
>> Chuck Lever

--
Chuck Lever
Chuck Lever March 11, 2020, 3:27 p.m. UTC | #3
Hi Anna, I don't recall receiving any comments that require modifying
this series. Do you want me to resend it for the next merge window?


> On Feb 21, 2020, at 5:00 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> Howdy.
> 
> I've had reports (and personal experience) where the Linux NFS/RDMA
> client waits for a very long time after a disruption of the network
> or NFS server.
> 
> There is a disconnect time wait in the Connection Manager which
> blocks the RPC/RDMA transport from tearing down a connection for a
> few minutes when the remote cannot respond to DREQ messages.
> 
> An RPC/RDMA transport has only one slot for connection state, so the
> transport is prevented from establishing a fresh connection until
> the time wait completes.
> 
> This patch series refactors the connection end point data structures
> to enable one active and multiple zombie connections. Now, while a
> defunct connection is waiting to die, it is separated from the
> transport, clearing the way for the immediate creation of a new
> connection. Clean-up of the old connection's data structures and
> resources then completes in the background.
> 
> Well, that's the idea, anyway. Review and comments welcome. Hoping
> this can be merged in v5.7.
> 
> ---
> 
> Chuck Lever (11):
>      xprtrdma: Invoke rpcrdma_ep_create() in the connect worker
>      xprtrdma: Refactor frwr_init_mr()
>      xprtrdma: Clean up the post_send path
>      xprtrdma: Refactor rpcrdma_ep_connect() and rpcrdma_ep_disconnect()
>      xprtrdma: Allocate Protection Domain in rpcrdma_ep_create()
>      xprtrdma: Invoke rpcrdma_ia_open in the connect worker
>      xprtrdma: Remove rpcrdma_ia::ri_flags
>      xprtrdma: Disconnect on flushed completion
>      xprtrdma: Merge struct rpcrdma_ia into struct rpcrdma_ep
>      xprtrdma: Extract sockaddr from struct rdma_cm_id
>      xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt
> 
> 
> include/trace/events/rpcrdma.h    |   97 ++---
> net/sunrpc/xprtrdma/backchannel.c |    8 
> net/sunrpc/xprtrdma/frwr_ops.c    |  152 ++++----
> net/sunrpc/xprtrdma/rpc_rdma.c    |   32 +-
> net/sunrpc/xprtrdma/transport.c   |   72 +---
> net/sunrpc/xprtrdma/verbs.c       |  681 ++++++++++++++-----------------------
> net/sunrpc/xprtrdma/xprt_rdma.h   |   89 ++---
> 7 files changed, 445 insertions(+), 686 deletions(-)
> 
> --
> Chuck Lever
Schumaker, Anna March 11, 2020, 5:16 p.m. UTC | #4
Hi Chuck,

On Wed, 2020-03-11 at 11:27 -0400, Chuck Lever wrote:
> Hi Anna, I don't recall receiving any comments that require modifying
> this series. Do you want me to resend it for the next merge window?

If there haven't been any changes, then I'll just use the version you've already
posted. No need to resend.

Thanks for checking!
Anna

> 
> 
> > On Feb 21, 2020, at 5:00 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > 
> > Howdy.
> > 
> > I've had reports (and personal experience) where the Linux NFS/RDMA
> > client waits for a very long time after a disruption of the network
> > or NFS server.
> > 
> > There is a disconnect time wait in the Connection Manager which
> > blocks the RPC/RDMA transport from tearing down a connection for a
> > few minutes when the remote cannot respond to DREQ messages.
> > 
> > An RPC/RDMA transport has only one slot for connection state, so the
> > transport is prevented from establishing a fresh connection until
> > the time wait completes.
> > 
> > This patch series refactors the connection end point data structures
> > to enable one active and multiple zombie connections. Now, while a
> > defunct connection is waiting to die, it is separated from the
> > transport, clearing the way for the immediate creation of a new
> > connection. Clean-up of the old connection's data structures and
> > resources then completes in the background.
> > 
> > Well, that's the idea, anyway. Review and comments welcome. Hoping
> > this can be merged in v5.7.
> > 
> > ---
> > 
> > Chuck Lever (11):
> >      xprtrdma: Invoke rpcrdma_ep_create() in the connect worker
> >      xprtrdma: Refactor frwr_init_mr()
> >      xprtrdma: Clean up the post_send path
> >      xprtrdma: Refactor rpcrdma_ep_connect() and rpcrdma_ep_disconnect()
> >      xprtrdma: Allocate Protection Domain in rpcrdma_ep_create()
> >      xprtrdma: Invoke rpcrdma_ia_open in the connect worker
> >      xprtrdma: Remove rpcrdma_ia::ri_flags
> >      xprtrdma: Disconnect on flushed completion
> >      xprtrdma: Merge struct rpcrdma_ia into struct rpcrdma_ep
> >      xprtrdma: Extract sockaddr from struct rdma_cm_id
> >      xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt
> > 
> > 
> > include/trace/events/rpcrdma.h    |   97 ++---
> > net/sunrpc/xprtrdma/backchannel.c |    8 
> > net/sunrpc/xprtrdma/frwr_ops.c    |  152 ++++----
> > net/sunrpc/xprtrdma/rpc_rdma.c    |   32 +-
> > net/sunrpc/xprtrdma/transport.c   |   72 +---
> > net/sunrpc/xprtrdma/verbs.c       |  681 ++++++++++++++---------------------
> > --
> > net/sunrpc/xprtrdma/xprt_rdma.h   |   89 ++---
> > 7 files changed, 445 insertions(+), 686 deletions(-)
> > 
> > --
> > Chuck Lever