diff mbox series

NFS: enable nconnect for RDMA

Message ID 20240228213523.35819-1-trondmy@kernel.org (mailing list archive)
State New
Headers show
Series NFS: enable nconnect for RDMA | expand

Commit Message

Trond Myklebust Feb. 28, 2024, 9:35 p.m. UTC
From: Trond Myklebust <trond.myklebust@hammerspace.com>

It appears that in certain cases, RDMA capable transports can benefit
from the ability to establish multiple connections to increase their
throughput. This patch therefore enables the use of the "nconnect" mount
option for those use cases.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/nfs3client.c | 1 +
 fs/nfs/nfs4client.c | 2 ++
 2 files changed, 3 insertions(+)

Comments

Chuck Lever March 3, 2024, 6:35 p.m. UTC | #1
On Wed, Feb 28, 2024 at 04:35:23PM -0500, trondmy@kernel.org wrote:
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> 
> It appears that in certain cases, RDMA capable transports can benefit
> from the ability to establish multiple connections to increase their
> throughput. This patch therefore enables the use of the "nconnect" mount
> option for those use cases.
> 
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

No objection to this patch.

You don't mention here if you have root-caused the throughput issue.
One thing I've noticed is that contention for the transport's
queue_lock is holding back the RPC/RDMA Receive completion handler,
which is single-threaded per transport.

A way to mitigate this would be to replace the recv_queue
R-B tree with a data structure that can perform a lookup while
holding only the RCU read lock. I have experimented with using an
xarray for the recv_queue, and saw improvement.

The workload on that data structure is different for TCP versus
RDMA, though: on RDMA, the number of RPCs in flight is significantly
smaller. For tens of thousands of RPCs in flight, an xarray might
not scale well. The newer Maple tree or rose bush hash, or maybe a
more classic data structure like rhashtable, might handle this
workload better.

It's also worth considering deleting each RPC from the recv_queue
in a less performance-sensitive context, for example, xprt_release,
rather than in xprt_complete_rqst.


> ---
>  fs/nfs/nfs3client.c | 1 +
>  fs/nfs/nfs4client.c | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/fs/nfs/nfs3client.c b/fs/nfs/nfs3client.c
> index 674c012868b1..b0c8a39c2bbd 100644
> --- a/fs/nfs/nfs3client.c
> +++ b/fs/nfs/nfs3client.c
> @@ -111,6 +111,7 @@ struct nfs_client *nfs3_set_ds_client(struct nfs_server *mds_srv,
>  	cl_init.hostname = buf;
>  
>  	switch (ds_proto) {
> +	case XPRT_TRANSPORT_RDMA:
>  	case XPRT_TRANSPORT_TCP:
>  	case XPRT_TRANSPORT_TCP_TLS:
>  		if (mds_clp->cl_nconnect > 1)
> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
> index 11e3a285594c..84573df5cf5a 100644
> --- a/fs/nfs/nfs4client.c
> +++ b/fs/nfs/nfs4client.c
> @@ -924,6 +924,7 @@ static int nfs4_set_client(struct nfs_server *server,
>  	else
>  		cl_init.max_connect = max_connect;
>  	switch (proto) {
> +	case XPRT_TRANSPORT_RDMA:
>  	case XPRT_TRANSPORT_TCP:
>  	case XPRT_TRANSPORT_TCP_TLS:
>  		cl_init.nconnect = nconnect;
> @@ -1000,6 +1001,7 @@ struct nfs_client *nfs4_set_ds_client(struct nfs_server *mds_srv,
>  	cl_init.hostname = buf;
>  
>  	switch (ds_proto) {
> +	case XPRT_TRANSPORT_RDMA:
>  	case XPRT_TRANSPORT_TCP:
>  	case XPRT_TRANSPORT_TCP_TLS:
>  		if (mds_clp->cl_nconnect > 1) {
> -- 
> 2.44.0
> 
>
Olga Kornievskaia March 4, 2024, 7:01 p.m. UTC | #2
On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On Wed, Feb 28, 2024 at 04:35:23PM -0500, trondmy@kernel.org wrote:
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> >
> > It appears that in certain cases, RDMA capable transports can benefit
> > from the ability to establish multiple connections to increase their
> > throughput. This patch therefore enables the use of the "nconnect" mount
> > option for those use cases.
> >
> > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> No objection to this patch.
>
> You don't mention here if you have root-caused the throughput issue.
> One thing I've noticed is that contention for the transport's
> queue_lock is holding back the RPC/RDMA Receive completion handler,
> which is single-threaded per transport.

Curious: how does a queue_lock per transport is a problem for
nconnect? nconnect would create its own transport, would it now and so
it would have its own queue_lock (per nconnect).

> A way to mitigate this would be to replace the recv_queue
> R-B tree with a data structure that can perform a lookup while
> holding only the RCU read lock. I have experimented with using an
> xarray for the recv_queue, and saw improvement.
>
> The workload on that data structure is different for TCP versus
> RDMA, though: on RDMA, the number of RPCs in flight is significantly
> smaller. For tens of thousands of RPCs in flight, an xarray might
> not scale well. The newer Maple tree or rose bush hash, or maybe a
> more classic data structure like rhashtable, might handle this
> workload better.
>
> It's also worth considering deleting each RPC from the recv_queue
> in a less performance-sensitive context, for example, xprt_release,
> rather than in xprt_complete_rqst.
>
>
> > ---
> >  fs/nfs/nfs3client.c | 1 +
> >  fs/nfs/nfs4client.c | 2 ++
> >  2 files changed, 3 insertions(+)
> >
> > diff --git a/fs/nfs/nfs3client.c b/fs/nfs/nfs3client.c
> > index 674c012868b1..b0c8a39c2bbd 100644
> > --- a/fs/nfs/nfs3client.c
> > +++ b/fs/nfs/nfs3client.c
> > @@ -111,6 +111,7 @@ struct nfs_client *nfs3_set_ds_client(struct nfs_server *mds_srv,
> >       cl_init.hostname = buf;
> >
> >       switch (ds_proto) {
> > +     case XPRT_TRANSPORT_RDMA:
> >       case XPRT_TRANSPORT_TCP:
> >       case XPRT_TRANSPORT_TCP_TLS:
> >               if (mds_clp->cl_nconnect > 1)
> > diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
> > index 11e3a285594c..84573df5cf5a 100644
> > --- a/fs/nfs/nfs4client.c
> > +++ b/fs/nfs/nfs4client.c
> > @@ -924,6 +924,7 @@ static int nfs4_set_client(struct nfs_server *server,
> >       else
> >               cl_init.max_connect = max_connect;
> >       switch (proto) {
> > +     case XPRT_TRANSPORT_RDMA:
> >       case XPRT_TRANSPORT_TCP:
> >       case XPRT_TRANSPORT_TCP_TLS:
> >               cl_init.nconnect = nconnect;
> > @@ -1000,6 +1001,7 @@ struct nfs_client *nfs4_set_ds_client(struct nfs_server *mds_srv,
> >       cl_init.hostname = buf;
> >
> >       switch (ds_proto) {
> > +     case XPRT_TRANSPORT_RDMA:
> >       case XPRT_TRANSPORT_TCP:
> >       case XPRT_TRANSPORT_TCP_TLS:
> >               if (mds_clp->cl_nconnect > 1) {
> > --
> > 2.44.0
> >
> >
>
> --
> Chuck Lever
>
Chuck Lever March 4, 2024, 7:32 p.m. UTC | #3
> On Mar 4, 2024, at 2:01 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> 
> On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> On Wed, Feb 28, 2024 at 04:35:23PM -0500, trondmy@kernel.org wrote:
>>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>>> 
>>> It appears that in certain cases, RDMA capable transports can benefit
>>> from the ability to establish multiple connections to increase their
>>> throughput. This patch therefore enables the use of the "nconnect" mount
>>> option for those use cases.
>>> 
>>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
>> 
>> No objection to this patch.
>> 
>> You don't mention here if you have root-caused the throughput issue.
>> One thing I've noticed is that contention for the transport's
>> queue_lock is holding back the RPC/RDMA Receive completion handler,
>> which is single-threaded per transport.
> 
> Curious: how does a queue_lock per transport is a problem for
> nconnect? nconnect would create its own transport, would it now and so
> it would have its own queue_lock (per nconnect).

I did not mean to imply that queue_lock contention is a
problem for nconnect or would increase when there are
multiple transports.

But there is definitely lock contention between the send and
receive code paths, and that could be one source of the relief
that Trond saw by adding more transports. IMO that contention
should be addressed at some point.

I'm not asking for a change to the proposed patch. But I am
suggesting some possible future work.


>> A way to mitigate this would be to replace the recv_queue
>> R-B tree with a data structure that can perform a lookup while
>> holding only the RCU read lock. I have experimented with using an
>> xarray for the recv_queue, and saw improvement.
>> 
>> The workload on that data structure is different for TCP versus
>> RDMA, though: on RDMA, the number of RPCs in flight is significantly
>> smaller. For tens of thousands of RPCs in flight, an xarray might
>> not scale well. The newer Maple tree or rose bush hash, or maybe a
>> more classic data structure like rhashtable, might handle this
>> workload better.
>> 
>> It's also worth considering deleting each RPC from the recv_queue
>> in a less performance-sensitive context, for example, xprt_release,
>> rather than in xprt_complete_rqst.


--
Chuck Lever
Olga Kornievskaia March 4, 2024, 8:25 p.m. UTC | #4
On Mon, Mar 4, 2024 at 2:33 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
>
> > On Mar 4, 2024, at 2:01 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >
> > On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >> On Wed, Feb 28, 2024 at 04:35:23PM -0500, trondmy@kernel.org wrote:
> >>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> >>>
> >>> It appears that in certain cases, RDMA capable transports can benefit
> >>> from the ability to establish multiple connections to increase their
> >>> throughput. This patch therefore enables the use of the "nconnect" mount
> >>> option for those use cases.
> >>>
> >>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> >>
> >> No objection to this patch.
> >>
> >> You don't mention here if you have root-caused the throughput issue.
> >> One thing I've noticed is that contention for the transport's
> >> queue_lock is holding back the RPC/RDMA Receive completion handler,
> >> which is single-threaded per transport.
> >
> > Curious: how does a queue_lock per transport is a problem for
> > nconnect? nconnect would create its own transport, would it now and so
> > it would have its own queue_lock (per nconnect).
>
> I did not mean to imply that queue_lock contention is a
> problem for nconnect or would increase when there are
> multiple transports.

Got it. I wanted to make sure I didn't misunderstand something.You are
stating that even for a single transport there could be improvements
made to make sending and receiving more independent of each other.


> But there is definitely lock contention between the send and
> receive code paths, and that could be one source of the relief
> that Trond saw by adding more transports. IMO that contention
> should be addressed at some point.
>
> I'm not asking for a change to the proposed patch. But I am
> suggesting some possible future work.
>
>
> >> A way to mitigate this would be to replace the recv_queue
> >> R-B tree with a data structure that can perform a lookup while
> >> holding only the RCU read lock. I have experimented with using an
> >> xarray for the recv_queue, and saw improvement.
> >>
> >> The workload on that data structure is different for TCP versus
> >> RDMA, though: on RDMA, the number of RPCs in flight is significantly
> >> smaller. For tens of thousands of RPCs in flight, an xarray might
> >> not scale well. The newer Maple tree or rose bush hash, or maybe a
> >> more classic data structure like rhashtable, might handle this
> >> workload better.
> >>
> >> It's also worth considering deleting each RPC from the recv_queue
> >> in a less performance-sensitive context, for example, xprt_release,
> >> rather than in xprt_complete_rqst.
>
>
> --
> Chuck Lever
>
>
Trond Myklebust March 4, 2024, 11:08 p.m. UTC | #5
On Mon, 2024-03-04 at 19:32 +0000, Chuck Lever III wrote:
> 
> 
> > On Mar 4, 2024, at 2:01 PM, Olga Kornievskaia <aglo@umich.edu>
> > wrote:
> > 
> > On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@oracle.com>
> > wrote:
> > > 
> > > On Wed, Feb 28, 2024 at 04:35:23PM -0500,
> > > trondmy@kernel.org wrote:
> > > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > > 
> > > > It appears that in certain cases, RDMA capable transports can
> > > > benefit
> > > > from the ability to establish multiple connections to increase
> > > > their
> > > > throughput. This patch therefore enables the use of the
> > > > "nconnect" mount
> > > > option for those use cases.
> > > > 
> > > > Signed-off-by: Trond Myklebust
> > > > <trond.myklebust@hammerspace.com>
> > > 
> > > No objection to this patch.
> > > 
> > > You don't mention here if you have root-caused the throughput
> > > issue.
> > > One thing I've noticed is that contention for the transport's
> > > queue_lock is holding back the RPC/RDMA Receive completion
> > > handler,
> > > which is single-threaded per transport.
> > 
> > Curious: how does a queue_lock per transport is a problem for
> > nconnect? nconnect would create its own transport, would it now and
> > so
> > it would have its own queue_lock (per nconnect).
> 
> I did not mean to imply that queue_lock contention is a
> problem for nconnect or would increase when there are
> multiple transports.
> 
> But there is definitely lock contention between the send and
> receive code paths, and that could be one source of the relief
> that Trond saw by adding more transports. IMO that contention
> should be addressed at some point.
> 
> I'm not asking for a change to the proposed patch. But I am
> suggesting some possible future work.
> 

We were comparing NFS/RDMA performance to that of NFS/TCP, and it was
clear that the nconnect value was giving the latter a major boost. Once
we enabled nconnect for the RDMA channel, then the values evened out a
lot more.
Once we fixed the nconnect issue, what we were seeing when the RDMA
code maxed out was actually that the CPU got pegged running the IB
completion work queues on writes.

We can certainly look into improving the performance of
xprt_lookup_rqst() if we have evidence that is slow, but I'm not yet
sure that was what we were seeing.
Chuck Lever March 5, 2024, 2:02 p.m. UTC | #6
On Mon, Mar 04, 2024 at 11:08:00PM +0000, Trond Myklebust wrote:
> On Mon, 2024-03-04 at 19:32 +0000, Chuck Lever III wrote:
> > 
> > 
> > > On Mar 4, 2024, at 2:01 PM, Olga Kornievskaia <aglo@umich.edu>
> > > wrote:
> > > 
> > > On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@oracle.com>
> > > wrote:
> > > > 
> > > > On Wed, Feb 28, 2024 at 04:35:23PM -0500,
> > > > trondmy@kernel.org wrote:
> > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > > > 
> > > > > It appears that in certain cases, RDMA capable transports can
> > > > > benefit
> > > > > from the ability to establish multiple connections to increase
> > > > > their
> > > > > throughput. This patch therefore enables the use of the
> > > > > "nconnect" mount
> > > > > option for those use cases.
> > > > > 
> > > > > Signed-off-by: Trond Myklebust
> > > > > <trond.myklebust@hammerspace.com>
> > > > 
> > > > No objection to this patch.
> > > > 
> > > > You don't mention here if you have root-caused the throughput
> > > > issue.
> > > > One thing I've noticed is that contention for the transport's
> > > > queue_lock is holding back the RPC/RDMA Receive completion
> > > > handler,
> > > > which is single-threaded per transport.
> > > 
> > > Curious: how does a queue_lock per transport is a problem for
> > > nconnect? nconnect would create its own transport, would it now and
> > > so
> > > it would have its own queue_lock (per nconnect).
> > 
> > I did not mean to imply that queue_lock contention is a
> > problem for nconnect or would increase when there are
> > multiple transports.
> > 
> > But there is definitely lock contention between the send and
> > receive code paths, and that could be one source of the relief
> > that Trond saw by adding more transports. IMO that contention
> > should be addressed at some point.
> > 
> > I'm not asking for a change to the proposed patch. But I am
> > suggesting some possible future work.
> > 
> 
> We were comparing NFS/RDMA performance to that of NFS/TCP, and it was
> clear that the nconnect value was giving the latter a major boost. Once
> we enabled nconnect for the RDMA channel, then the values evened out a
> lot more.
> Once we fixed the nconnect issue, what we were seeing when the RDMA
> code maxed out was actually that the CPU got pegged running the IB
> completion work queues on writes.
> 
> We can certainly look into improving the performance of
> xprt_lookup_rqst() if we have evidence that is slow, but I'm not yet
> sure that was what we were seeing.

One observation: the Receive completion handler doesn't do anything
that is CPU-intensive. If ib_comp_wq is hot, that's an indication of
lock contention.

I've found there are typically two contended locks when handling
RPC/RDMA Receive completions:

 - The workqueue pool lock. Tejun mitigated that issue in v6.7.

 - The queue_lock, as described above.

A flame graph might narrow the issue.
diff mbox series

Patch

diff --git a/fs/nfs/nfs3client.c b/fs/nfs/nfs3client.c
index 674c012868b1..b0c8a39c2bbd 100644
--- a/fs/nfs/nfs3client.c
+++ b/fs/nfs/nfs3client.c
@@ -111,6 +111,7 @@  struct nfs_client *nfs3_set_ds_client(struct nfs_server *mds_srv,
 	cl_init.hostname = buf;
 
 	switch (ds_proto) {
+	case XPRT_TRANSPORT_RDMA:
 	case XPRT_TRANSPORT_TCP:
 	case XPRT_TRANSPORT_TCP_TLS:
 		if (mds_clp->cl_nconnect > 1)
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 11e3a285594c..84573df5cf5a 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -924,6 +924,7 @@  static int nfs4_set_client(struct nfs_server *server,
 	else
 		cl_init.max_connect = max_connect;
 	switch (proto) {
+	case XPRT_TRANSPORT_RDMA:
 	case XPRT_TRANSPORT_TCP:
 	case XPRT_TRANSPORT_TCP_TLS:
 		cl_init.nconnect = nconnect;
@@ -1000,6 +1001,7 @@  struct nfs_client *nfs4_set_ds_client(struct nfs_server *mds_srv,
 	cl_init.hostname = buf;
 
 	switch (ds_proto) {
+	case XPRT_TRANSPORT_RDMA:
 	case XPRT_TRANSPORT_TCP:
 	case XPRT_TRANSPORT_TCP_TLS:
 		if (mds_clp->cl_nconnect > 1) {