[v1,13/16] NFS: Add sidecar RPC client support

Message ID	20141016194000.13414.83844.stgit@manet.1015granger.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> Subject: [PATCH v1 13/16] NFS: Add sidecar RPC client support From: Chuck Lever <chuck.lever@oracle.com> To: linux-nfs@vger.kernel.org Date: Thu, 16 Oct 2014 15:40:00 -0400 Message-ID: <20141016194000.13414.83844.stgit@manet.1015granger.net> In-Reply-To: <20141016192919.13414.3151.stgit@manet.1015granger.net> References: <20141016192919.13414.3151.stgit@manet.1015granger.net> User-Agent: StGit/0.17.1-3-g7d0f MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

Chuck Lever Oct. 16, 2014, 7:40 p.m. UTC

So far, TCP is the only transport that supports bi-directional RPC.

When mounting with NFSv4.1 using a transport that does not support
bi-directional RPC, establish a TCP sidecar connection to handle
backchannel traffic for a session. The sidecar transport does not
use its forward channel except for sending BIND_CONN_TO_SESSION
operations.

This commit adds logic to create and destroy the sidecar transport.
Subsequent commits add logic to use the transport.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/client.c           |    1 +
 fs/nfs/nfs4client.c       |   54 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/nfs_fs_sb.h |    2 ++
 3 files changed, 57 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Schumaker, Anna Oct. 20, 2014, 5:33 p.m. UTC | #1

On 10/16/14 15:40, Chuck Lever wrote:
> So far, TCP is the only transport that supports bi-directional RPC.
>
> When mounting with NFSv4.1 using a transport that does not support
> bi-directional RPC, establish a TCP sidecar connection to handle
> backchannel traffic for a session. The sidecar transport does not
> use its forward channel except for sending BIND_CONN_TO_SESSION
> operations.
>
> This commit adds logic to create and destroy the sidecar transport.
> Subsequent commits add logic to use the transport.

I thought NFS v4.0 also uses a separate connection for the backchannel?  Can any of that code be reused here, rather than creating new sidecar structures?

Anna

>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfs/client.c           |    1 +
>  fs/nfs/nfs4client.c       |   54 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/nfs_fs_sb.h |    2 ++
>  3 files changed, 57 insertions(+)
>
> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> index 6a4f366..19f49bf 100644
> --- a/fs/nfs/client.c
> +++ b/fs/nfs/client.c
> @@ -78,6 +78,7 @@ const struct rpc_program nfs_program = {
>  	.stats			= &nfs_rpcstat,
>  	.pipe_dir_name		= NFS_PIPE_DIRNAME,
>  };
> +EXPORT_SYMBOL_GPL(nfs_program);
>  
>  struct rpc_stat nfs_rpcstat = {
>  	.program		= &nfs_program
> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
> index 5f4b818..b1cc35e 100644
> --- a/fs/nfs/nfs4client.c
> +++ b/fs/nfs/nfs4client.c
> @@ -213,6 +213,8 @@ static void nfs4_destroy_callback(struct nfs_client *clp)
>  {
>  	if (__test_and_clear_bit(NFS_CS_CALLBACK, &clp->cl_res_state))
>  		nfs_callback_down(clp->cl_mvops->minor_version, clp->cl_net);
> +	if (clp->cl_bc_rpcclient)
> +		rpc_shutdown_client(clp->cl_bc_rpcclient);
>  }
>  
>  static void nfs4_shutdown_client(struct nfs_client *clp)
> @@ -291,6 +293,53 @@ int nfs40_init_client(struct nfs_client *clp)
>  
>  #if defined(CONFIG_NFS_V4_1)
>  
> +/*
> + * Create a separate rpc_clnt using TCP that can provide a
> + * backchannel service.
> + */
> +static int nfs41_create_sidecar_rpc_client(struct nfs_client *clp)
> +{
> +	struct sockaddr_storage address;
> +	struct sockaddr *sap = (struct sockaddr *)&address;
> +	struct rpc_create_args args = {
> +		.net		= clp->cl_net,
> +		.protocol	= XPRT_TRANSPORT_TCP,
> +		.address	= sap,
> +		.addrsize	= clp->cl_addrlen,
> +		.servername	= clp->cl_hostname,
> +		.program	= &nfs_program,
> +		.version	= clp->rpc_ops->version,
> +		.flags		= (RPC_CLNT_CREATE_DISCRTRY |
> +				  RPC_CLNT_CREATE_NOPING),
> +	};
> +	struct rpc_clnt *clnt;
> +	struct rpc_cred *cred;
> +
> +	if (rpc_xprt_is_bidirectional(clp->cl_rpcclient))
> +		return 0;
> +
> +	if (test_bit(NFS_CS_NORESVPORT, &clp->cl_flags))
> +		args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
> +	memcpy(sap, &clp->cl_addr, clp->cl_addrlen);
> +	rpc_set_port(sap, NFS_PORT);
> +	cred = nfs4_get_clid_cred(clp);
> +	if (cred) {
> +		args.authflavor = cred->cr_auth->au_flavor;
> +		put_rpccred(cred);
> +	} else
> +		args.authflavor = RPC_AUTH_UNIX;
> +
> +	clnt = rpc_create(&args);
> +	if (IS_ERR(clnt)) {
> +		dprintk("%s: cannot create side-car RPC client. Error = %ld\n",
> +			__func__, PTR_ERR(clnt));
> +		return PTR_ERR(clnt);
> +	}
> +
> +	clp->cl_bc_rpcclient = clnt;
> +	return 0;
> +}
> +
>  /**
>   * nfs41_init_client - nfs_client initialization tasks for NFSv4.1+
>   * @clp - nfs_client to initialize
> @@ -300,6 +349,11 @@ int nfs40_init_client(struct nfs_client *clp)
>  int nfs41_init_client(struct nfs_client *clp)
>  {
>  	struct nfs4_session *session = NULL;
> +	int ret;
> +
> +	ret = nfs41_create_sidecar_rpc_client(clp);
> +	if (ret)
> +		return ret;
>  
>  	/*
>  	 * Create the session and mark it expired.
> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
> index 922be2e..159d703 100644
> --- a/include/linux/nfs_fs_sb.h
> +++ b/include/linux/nfs_fs_sb.h
> @@ -87,6 +87,8 @@ struct nfs_client {
>  
>  	/* The sequence id to use for the next CREATE_SESSION */
>  	u32			cl_seqid;
> +	/* The optional sidecar backchannel transport */
> +	struct rpc_clnt		*cl_bc_rpcclient;
>  	/* The flags used for obtaining the clientid during EXCHANGE_ID */
>  	u32			cl_exchange_flags;
>  	struct nfs4_session	*cl_session;	/* shared session */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chuck Lever Oct. 20, 2014, 6:09 p.m. UTC | #2

On Oct 20, 2014, at 1:33 PM, Anna Schumaker <Anna.Schumaker@netapp.com> wrote:

> On 10/16/14 15:40, Chuck Lever wrote:
>> So far, TCP is the only transport that supports bi-directional RPC.
>> 
>> When mounting with NFSv4.1 using a transport that does not support
>> bi-directional RPC, establish a TCP sidecar connection to handle
>> backchannel traffic for a session. The sidecar transport does not
>> use its forward channel except for sending BIND_CONN_TO_SESSION
>> operations.
>> 
>> This commit adds logic to create and destroy the sidecar transport.
>> Subsequent commits add logic to use the transport.
> 
> I thought NFS v4.0 also uses a separate connection for the backchannel?

For NFSv4.0, the server opens a connection to the client. For
NFSv4.1, the client opens the side car connection to the
server, and then performs BIND_CONN_TO_SESSION, an operation
not in NFSv4.0. The processes are not terribly similar.

> Can any of that code be reused here, rather than creating new sidecar structures?

Since it’s the client opening a connection in this case,
I don’t see any obvious common code paths that I haven’t
already re-used. I’m open to suggestions.

> Anna
> 
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> fs/nfs/client.c           |    1 +
>> fs/nfs/nfs4client.c       |   54 +++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/nfs_fs_sb.h |    2 ++
>> 3 files changed, 57 insertions(+)
>> 
>> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
>> index 6a4f366..19f49bf 100644
>> --- a/fs/nfs/client.c
>> +++ b/fs/nfs/client.c
>> @@ -78,6 +78,7 @@ const struct rpc_program nfs_program = {
>> 	.stats			= &nfs_rpcstat,
>> 	.pipe_dir_name		= NFS_PIPE_DIRNAME,
>> };
>> +EXPORT_SYMBOL_GPL(nfs_program);
>> 
>> struct rpc_stat nfs_rpcstat = {
>> 	.program		= &nfs_program
>> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
>> index 5f4b818..b1cc35e 100644
>> --- a/fs/nfs/nfs4client.c
>> +++ b/fs/nfs/nfs4client.c
>> @@ -213,6 +213,8 @@ static void nfs4_destroy_callback(struct nfs_client *clp)
>> {
>> 	if (__test_and_clear_bit(NFS_CS_CALLBACK, &clp->cl_res_state))
>> 		nfs_callback_down(clp->cl_mvops->minor_version, clp->cl_net);
>> +	if (clp->cl_bc_rpcclient)
>> +		rpc_shutdown_client(clp->cl_bc_rpcclient);
>> }
>> 
>> static void nfs4_shutdown_client(struct nfs_client *clp)
>> @@ -291,6 +293,53 @@ int nfs40_init_client(struct nfs_client *clp)
>> 
>> #if defined(CONFIG_NFS_V4_1)
>> 
>> +/*
>> + * Create a separate rpc_clnt using TCP that can provide a
>> + * backchannel service.
>> + */
>> +static int nfs41_create_sidecar_rpc_client(struct nfs_client *clp)
>> +{
>> +	struct sockaddr_storage address;
>> +	struct sockaddr *sap = (struct sockaddr *)&address;
>> +	struct rpc_create_args args = {
>> +		.net		= clp->cl_net,
>> +		.protocol	= XPRT_TRANSPORT_TCP,
>> +		.address	= sap,
>> +		.addrsize	= clp->cl_addrlen,
>> +		.servername	= clp->cl_hostname,
>> +		.program	= &nfs_program,
>> +		.version	= clp->rpc_ops->version,
>> +		.flags		= (RPC_CLNT_CREATE_DISCRTRY |
>> +				  RPC_CLNT_CREATE_NOPING),
>> +	};
>> +	struct rpc_clnt *clnt;
>> +	struct rpc_cred *cred;
>> +
>> +	if (rpc_xprt_is_bidirectional(clp->cl_rpcclient))
>> +		return 0;
>> +
>> +	if (test_bit(NFS_CS_NORESVPORT, &clp->cl_flags))
>> +		args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
>> +	memcpy(sap, &clp->cl_addr, clp->cl_addrlen);
>> +	rpc_set_port(sap, NFS_PORT);
>> +	cred = nfs4_get_clid_cred(clp);
>> +	if (cred) {
>> +		args.authflavor = cred->cr_auth->au_flavor;
>> +		put_rpccred(cred);
>> +	} else
>> +		args.authflavor = RPC_AUTH_UNIX;
>> +
>> +	clnt = rpc_create(&args);
>> +	if (IS_ERR(clnt)) {
>> +		dprintk("%s: cannot create side-car RPC client. Error = %ld\n",
>> +			__func__, PTR_ERR(clnt));
>> +		return PTR_ERR(clnt);
>> +	}
>> +
>> +	clp->cl_bc_rpcclient = clnt;
>> +	return 0;
>> +}
>> +
>> /**
>>  * nfs41_init_client - nfs_client initialization tasks for NFSv4.1+
>>  * @clp - nfs_client to initialize
>> @@ -300,6 +349,11 @@ int nfs40_init_client(struct nfs_client *clp)
>> int nfs41_init_client(struct nfs_client *clp)
>> {
>> 	struct nfs4_session *session = NULL;
>> +	int ret;
>> +
>> +	ret = nfs41_create_sidecar_rpc_client(clp);
>> +	if (ret)
>> +		return ret;
>> 
>> 	/*
>> 	 * Create the session and mark it expired.
>> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
>> index 922be2e..159d703 100644
>> --- a/include/linux/nfs_fs_sb.h
>> +++ b/include/linux/nfs_fs_sb.h
>> @@ -87,6 +87,8 @@ struct nfs_client {
>> 
>> 	/* The sequence id to use for the next CREATE_SESSION */
>> 	u32			cl_seqid;
>> +	/* The optional sidecar backchannel transport */
>> +	struct rpc_clnt		*cl_bc_rpcclient;
>> 	/* The flags used for obtaining the clientid during EXCHANGE_ID */
>> 	u32			cl_exchange_flags;
>> 	struct nfs4_session	*cl_session;	/* shared session */
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust Oct. 20, 2014, 7:40 p.m. UTC | #3

On Mon, Oct 20, 2014 at 9:09 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
> On Oct 20, 2014, at 1:33 PM, Anna Schumaker <Anna.Schumaker@netapp.com> wrote:
>
> > On 10/16/14 15:40, Chuck Lever wrote:
> >> So far, TCP is the only transport that supports bi-directional RPC.
> >>
> >> When mounting with NFSv4.1 using a transport that does not support
> >> bi-directional RPC, establish a TCP sidecar connection to handle
> >> backchannel traffic for a session. The sidecar transport does not
> >> use its forward channel except for sending BIND_CONN_TO_SESSION
> >> operations.
> >>
> >> This commit adds logic to create and destroy the sidecar transport.
> >> Subsequent commits add logic to use the transport.
> >
> > I thought NFS v4.0 also uses a separate connection for the backchannel?
>
> For NFSv4.0, the server opens a connection to the client. For
> NFSv4.1, the client opens the side car connection to the
> server, and then performs BIND_CONN_TO_SESSION, an operation
> not in NFSv4.0. The processes are not terribly similar.
>
> > Can any of that code be reused here, rather than creating new sidecar structures?
>
> Since it’s the client opening a connection in this case,
> I don’t see any obvious common code paths that I haven’t
> already re-used. I’m open to suggestions.

Why aren't we doing the callbacks via RDMA as per the recommendation
in RFC5667 section 5.1?

Chuck Lever Oct. 20, 2014, 8:11 p.m. UTC | #4

Hi Trond-

On Oct 20, 2014, at 3:40 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> Why aren't we doing the callbacks via RDMA as per the recommendation
> in RFC5667 section 5.1?

There’s no benefit to it. With a side car, the server requires
few or no changes. There are no CB operations that benefit
from using RDMA. It’s very quick to implement, re-using most of
the client backchannel implementation that already exists.

I’ve discussed this with an author of RFC 5667 [cc’d], and also
with the implementors of an existing NFSv4.1 server that supports
RDMA. They both agree that a side car is an acceptable, or even a
preferable, way to approach backchannel support.

Also, when I discussed this with you months ago, you also felt
that a side car was better than adding backchannel support to the
xprtrdma transport. I took this approach only because you OK’d it.

But I don’t see an explicit recommendation in section 5.1. Which
text are you referring to?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust Oct. 20, 2014, 10:31 p.m. UTC | #5

On Mon, Oct 20, 2014 at 11:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> Hi Trond-
>
> On Oct 20, 2014, at 3:40 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>
>> Why aren't we doing the callbacks via RDMA as per the recommendation
>> in RFC5667 section 5.1?
>
> There’s no benefit to it. With a side car, the server requires
> few or no changes. There are no CB operations that benefit
> from using RDMA. It’s very quick to implement, re-using most of
> the client backchannel implementation that already exists.
>
> I’ve discussed this with an author of RFC 5667 [cc’d], and also
> with the implementors of an existing NFSv4.1 server that supports
> RDMA. They both agree that a side car is an acceptable, or even a
> preferable, way to approach backchannel support.
>
> Also, when I discussed this with you months ago, you also felt
> that a side car was better than adding backchannel support to the
> xprtrdma transport. I took this approach only because you OK’d it.
>
> But I don’t see an explicit recommendation in section 5.1. Which
> text are you referring to?

The very first paragraph argues that because callback messages don't
carry bulk data, there is no problem with using RPC/RDMA and, in
particular, with using RDMA_MSG provided that the buffer sizes are
negotiated correctly.

So the questions are:

1) Where is the discussion of the merits for and against adding
bi-directional support to the xprtrdma layer in Linux? What is the
showstopper preventing implementation of a design based around
RFC5667?

2) Why do we instead have to solve the whole backchannel problem in
the NFSv4.1 layer, and where is the discussion of the merits for and
against that particular solution? As far as I can tell, it imposes at
least 2 extra requirements:
 a) NFSv4.1 client+server must have support either for session
trunking or for clientid trunking
 b) NFSv4.1 client must be able to set up a TCP connection to the
server (that can be session/clientid trunked with the existing RDMA
channel)

All I've found so far on googling these questions is a 5 1/2 year old
email exchange between Tom Tucker and Ricardo where the conclusion
appears to be that we can, in time, implement both designs. However
there is no explanation of why we would want to do so.
http://comments.gmane.org/gmane.linux.nfs/22927

Chuck Lever Oct. 21, 2014, 1:06 a.m. UTC | #6

On Oct 20, 2014, at 6:31 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> On Mon, Oct 20, 2014 at 11:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> Hi Trond-
>> 
>> On Oct 20, 2014, at 3:40 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>> 
>>> Why aren't we doing the callbacks via RDMA as per the recommendation
>>> in RFC5667 section 5.1?
>> 
>> There’s no benefit to it. With a side car, the server requires
>> few or no changes. There are no CB operations that benefit
>> from using RDMA. It’s very quick to implement, re-using most of
>> the client backchannel implementation that already exists.
>> 
>> I’ve discussed this with an author of RFC 5667 [cc’d], and also
>> with the implementors of an existing NFSv4.1 server that supports
>> RDMA. They both agree that a side car is an acceptable, or even a
>> preferable, way to approach backchannel support.
>> 
>> Also, when I discussed this with you months ago, you also felt
>> that a side car was better than adding backchannel support to the
>> xprtrdma transport. I took this approach only because you OK’d it.
>> 
>> But I don’t see an explicit recommendation in section 5.1. Which
>> text are you referring to?
> 
> The very first paragraph argues that because callback messages don't
> carry bulk data, there is no problem with using RPC/RDMA and, in
> particular, with using RDMA_MSG provided that the buffer sizes are
> negotiated correctly.

The opening paragraph is advice that applies to all forms
of NFSv4 callback, including NFSv4.0, which uses a separate
transport initiated from the NFS server. Specific advice about
NFSv4.1 bi-directional RPC is left to the next two paragraphs,
but they suggest there be dragons. I rather think this is a
warning not to “go there.”

> So the questions are:
> 
> 1) Where is the discussion of the merits for and against adding
> bi-directional support to the xprtrdma layer in Linux? What is the
> showstopper preventing implementation of a design based around
> RFC5667?

There is no show-stopper (see Section 5.1, after all). It’s
simply a matter of development effort: a side-car is much
less work than implementing full RDMA backchannel support for
both a client and server, especially since TCP backchannel
already works and can be used immediately.

Also, no problem with eventually implementing RDMA backchannel
if the complexity, and any performance overhead it introduces in
the forward channel, can be justified. The client can use the
CREATE_SESSION flags to detect what a server supports.

> 2) Why do we instead have to solve the whole backchannel problem in
> the NFSv4.1 layer, and where is the discussion of the merits for and
> against that particular solution? As far as I can tell, it imposes at
> least 2 extra requirements:
> a) NFSv4.1 client+server must have support either for session
> trunking or for clientid trunking

Very minimal trunking support. The only operation allowed on
the TCP side-car's forward channel is BIND_CONN_TO_SESSION.

Bruce told me that associating multiple transports to a
clientid/session should not be an issue for his server (his
words were “if that doesn’t work, it’s a bug”).

Would this restrictive form of trunking present a problem?

> b) NFSv4.1 client must be able to set up a TCP connection to the
> server (that can be session/clientid trunked with the existing RDMA
> channel)

Also very minimal changes. The changes are already done,
posted in v1 of this patch series.

> All I've found so far on googling these questions is a 5 1/2 year old
> email exchange between Tom Tucker and Ricardo where the conclusion
> appears to be that we can, in time, implement both designs.

You and I spoke about this on Feb 13, 2014 during pub night.
At the time you stated that a side-car was the only spec-
compliant way to approach this. I said I would go forward
with the idea in Linux, and you did not object.

> However
> there is no explanation of why we would want to do so.
> http://comments.gmane.org/gmane.linux.nfs/22927

I’ve implemented exactly what Ricardo proposed in this
thread, including dealing with connection loss:

> > The thinking is that NFSRDMA could initially use a TCP callback channel.
> > We'll implement BIND_CONN_TO_SESSION so that the backchannel does not
> > need to be tied to the forechannel connection.  This should address the
> > case where you have NFSRDMA for the forechannel and TCP for the
> > backchannel.  BIND_CONN_TO_SESSION is also required to reestablish
> > dropped connections effectively (to avoid losing the reply cache).

And here’s what you had to say in support of the idea:

> Given what they're hoping to achieve, I'm fine with
> doing a simple implementation of sessions first, then progressively
> refining it.

What’s the next step?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust Oct. 21, 2014, 7:45 a.m. UTC | #7

On Tue, Oct 21, 2014 at 4:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On Oct 20, 2014, at 6:31 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>
>> On Mon, Oct 20, 2014 at 11:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>> Hi Trond-
>>>
>>> On Oct 20, 2014, at 3:40 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>>
>>>> Why aren't we doing the callbacks via RDMA as per the recommendation
>>>> in RFC5667 section 5.1?
>>>
>>> There’s no benefit to it. With a side car, the server requires
>>> few or no changes. There are no CB operations that benefit
>>> from using RDMA. It’s very quick to implement, re-using most of
>>> the client backchannel implementation that already exists.
>>>
>>> I’ve discussed this with an author of RFC 5667 [cc’d], and also
>>> with the implementors of an existing NFSv4.1 server that supports
>>> RDMA. They both agree that a side car is an acceptable, or even a
>>> preferable, way to approach backchannel support.
>>>
>>> Also, when I discussed this with you months ago, you also felt
>>> that a side car was better than adding backchannel support to the
>>> xprtrdma transport. I took this approach only because you OK’d it.
>>>
>>> But I don’t see an explicit recommendation in section 5.1. Which
>>> text are you referring to?
>>
>> The very first paragraph argues that because callback messages don't
>> carry bulk data, there is no problem with using RPC/RDMA and, in
>> particular, with using RDMA_MSG provided that the buffer sizes are
>> negotiated correctly.
>
> The opening paragraph is advice that applies to all forms
> of NFSv4 callback, including NFSv4.0, which uses a separate
> transport initiated from the NFS server. Specific advice about
> NFSv4.1 bi-directional RPC is left to the next two paragraphs,
> but they suggest there be dragons. I rather think this is a
> warning not to “go there.”

All I see is a warning not to rely on XIDs to distinguish between
backchannel and forward channel RPC requests. How is that particular
to RDMA?

>> So the questions are:
>>
>> 1) Where is the discussion of the merits for and against adding
>> bi-directional support to the xprtrdma layer in Linux? What is the
>> showstopper preventing implementation of a design based around
>> RFC5667?
>
> There is no show-stopper (see Section 5.1, after all). It’s
> simply a matter of development effort: a side-car is much
> less work than implementing full RDMA backchannel support for
> both a client and server, especially since TCP backchannel
> already works and can be used immediately.
>
> Also, no problem with eventually implementing RDMA backchannel
> if the complexity, and any performance overhead it introduces in
> the forward channel, can be justified. The client can use the
> CREATE_SESSION flags to detect what a server supports.

What complexity and performance overhead does it introduce in the
forward channel? I've never been presented with a complete discussion
of this alternative either from you or from anybody else.

>> 2) Why do we instead have to solve the whole backchannel problem in
>> the NFSv4.1 layer, and where is the discussion of the merits for and
>> against that particular solution? As far as I can tell, it imposes at
>> least 2 extra requirements:
>> a) NFSv4.1 client+server must have support either for session
>> trunking or for clientid trunking
>
> Very minimal trunking support. The only operation allowed on
> the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
>
> Bruce told me that associating multiple transports to a
> clientid/session should not be an issue for his server (his
> words were “if that doesn’t work, it’s a bug”).
>
> Would this restrictive form of trunking present a problem?
>
>> b) NFSv4.1 client must be able to set up a TCP connection to the
>> server (that can be session/clientid trunked with the existing RDMA
>> channel)
>
> Also very minimal changes. The changes are already done,
> posted in v1 of this patch series.

I'm not asking for details on the size of the changesets, but for a
justification of the design itself. If it is possible to confine all
the changes to the RPC/RDMA layer, then why consider patches that
change the NFSv4.1 layer at all?

Chuck Lever Oct. 21, 2014, 5:11 p.m. UTC | #8

On Oct 21, 2014, at 3:45 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> On Tue, Oct 21, 2014 at 4:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> There is no show-stopper (see Section 5.1, after all). It’s
>> simply a matter of development effort: a side-car is much
>> less work than implementing full RDMA backchannel support for
>> both a client and server, especially since TCP backchannel
>> already works and can be used immediately.
>> 
>> Also, no problem with eventually implementing RDMA backchannel
>> if the complexity, and any performance overhead it introduces in
>> the forward channel, can be justified. The client can use the
>> CREATE_SESSION flags to detect what a server supports.
> 
> What complexity and performance overhead does it introduce in the
> forward channel?

The benefit of RDMA is that there are opportunities to
reduce host CPU interaction with incoming data.
Bi-direction requires that the transport look at the RPC
header to determine the direction of the message. That
could have an impact on the forward channel, but it’s
never been measured, to my knowledge.

The reason this is more of an issue for RPC/RDMA is that
a copy of the XID appears in the RPC/RDMA header to avoid
the need to look at the RPC header. That’s typically what
implementations use to steer RPC reply processing.

Often the RPC/RDMA header and RPC header land in
disparate buffers. The RPC/RDMA reply handler looks
strictly at the RPC/RDMA header, and runs in a tasklet
usually on a different CPU. Adding bi-direction would mean
the transport would have to peek into the upper layer
headers, possibly resulting in cache line bouncing.

The complexity would be the addition of over a hundred
new lines of code on the client, and possibly a similar
amount of new code on the server. Small, perhaps, but
not insignificant.

>>> 2) Why do we instead have to solve the whole backchannel problem in
>>> the NFSv4.1 layer, and where is the discussion of the merits for and
>>> against that particular solution? As far as I can tell, it imposes at
>>> least 2 extra requirements:
>>> a) NFSv4.1 client+server must have support either for session
>>> trunking or for clientid trunking
>> 
>> Very minimal trunking support. The only operation allowed on
>> the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
>> 
>> Bruce told me that associating multiple transports to a
>> clientid/session should not be an issue for his server (his
>> words were “if that doesn’t work, it’s a bug”).
>> 
>> Would this restrictive form of trunking present a problem?
>> 
>>> b) NFSv4.1 client must be able to set up a TCP connection to the
>>> server (that can be session/clientid trunked with the existing RDMA
>>> channel)
>> 
>> Also very minimal changes. The changes are already done,
>> posted in v1 of this patch series.
> 
> I'm not asking for details on the size of the changesets, but for a
> justification of the design itself.

The size of the changeset _is_ the justification. It’s
a much less invasive change to add a TCP side-car than
it is to implement RDMA backchannel on both server and
client.

Most servers would require almost no change. Linux needs
only a bug fix or two. Effectively zero-impact for
servers that already support NFSv4.0 on RDMA to get
NFSv4.1 and pNFS on RDMA, with working callbacks.

That’s really all there is to it. It’s almost entirely a
practical consideration: we have the infrastructure and
can make it work in just a few lines of code.

> If it is possible to confine all
> the changes to the RPC/RDMA layer, then why consider patches that
> change the NFSv4.1 layer at all?

The fast new transport bring-up benefit is probably the
biggest win. A TCP side-car makes bringing up any new
transport implementation simpler.

And, RPC/RDMA offers zero performance benefit for
backchannel traffic, especially since CB traffic would
never move via RDMA READ/WRITE (as per RFC 5667 section
5.1).

The primary benefit to doing an RPC/RDMA-only solution
is that there is no upper layer impact. Is that a design
requirement?

There’s also been no discussion of issues with adding a
very restricted amount of transport trunking. Can you
elaborate on the problems this could introduce?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust Oct. 22, 2014, 8:39 a.m. UTC | #9

On Tue, Oct 21, 2014 at 8:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On Oct 21, 2014, at 3:45 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>
>> On Tue, Oct 21, 2014 at 4:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>
>>> There is no show-stopper (see Section 5.1, after all). It’s
>>> simply a matter of development effort: a side-car is much
>>> less work than implementing full RDMA backchannel support for
>>> both a client and server, especially since TCP backchannel
>>> already works and can be used immediately.
>>>
>>> Also, no problem with eventually implementing RDMA backchannel
>>> if the complexity, and any performance overhead it introduces in
>>> the forward channel, can be justified. The client can use the
>>> CREATE_SESSION flags to detect what a server supports.
>>
>> What complexity and performance overhead does it introduce in the
>> forward channel?
>
> The benefit of RDMA is that there are opportunities to
> reduce host CPU interaction with incoming data.
> Bi-direction requires that the transport look at the RPC
> header to determine the direction of the message. That
> could have an impact on the forward channel, but it’s
> never been measured, to my knowledge.
>
> The reason this is more of an issue for RPC/RDMA is that
> a copy of the XID appears in the RPC/RDMA header to avoid
> the need to look at the RPC header. That’s typically what
> implementations use to steer RPC reply processing.
>
> Often the RPC/RDMA header and RPC header land in
> disparate buffers. The RPC/RDMA reply handler looks
> strictly at the RPC/RDMA header, and runs in a tasklet
> usually on a different CPU. Adding bi-direction would mean
> the transport would have to peek into the upper layer
> headers, possibly resulting in cache line bouncing.

Under what circumstances would you expect to receive a valid NFSv4.1
callback with an RDMA header that spans multiple cache lines?

> The complexity would be the addition of over a hundred
> new lines of code on the client, and possibly a similar
> amount of new code on the server. Small, perhaps, but
> not insignificant.

Until there are RDMA users, I care a lot less about code changes to
xprtrdma than to NFS.

>>>> 2) Why do we instead have to solve the whole backchannel problem in
>>>> the NFSv4.1 layer, and where is the discussion of the merits for and
>>>> against that particular solution? As far as I can tell, it imposes at
>>>> least 2 extra requirements:
>>>> a) NFSv4.1 client+server must have support either for session
>>>> trunking or for clientid trunking
>>>
>>> Very minimal trunking support. The only operation allowed on
>>> the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
>>>
>>> Bruce told me that associating multiple transports to a
>>> clientid/session should not be an issue for his server (his
>>> words were “if that doesn’t work, it’s a bug”).
>>>
>>> Would this restrictive form of trunking present a problem?
>>>
>>>> b) NFSv4.1 client must be able to set up a TCP connection to the
>>>> server (that can be session/clientid trunked with the existing RDMA
>>>> channel)
>>>
>>> Also very minimal changes. The changes are already done,
>>> posted in v1 of this patch series.
>>
>> I'm not asking for details on the size of the changesets, but for a
>> justification of the design itself.
>
> The size of the changeset _is_ the justification. It’s
> a much less invasive change to add a TCP side-car than
> it is to implement RDMA backchannel on both server and
> client.

Please define your use of the word "invasive" in the above context. To
me "invasive" means "will affect code that is in use by others".

> Most servers would require almost no change. Linux needs
> only a bug fix or two. Effectively zero-impact for
> servers that already support NFSv4.0 on RDMA to get
> NFSv4.1 and pNFS on RDMA, with working callbacks.
>
> That’s really all there is to it. It’s almost entirely a
> practical consideration: we have the infrastructure and
> can make it work in just a few lines of code.
>
>> If it is possible to confine all
>> the changes to the RPC/RDMA layer, then why consider patches that
>> change the NFSv4.1 layer at all?
>
> The fast new transport bring-up benefit is probably the
> biggest win. A TCP side-car makes bringing up any new
> transport implementation simpler.

That's an assertion that assumes:
- we actually want to implement more transports aside from RDMA.
- implementing bi-directional transports in the RPC layer is non-simple

Right now, the benefit is only to RDMA users. Nobody else is asking
for such a change.

> And, RPC/RDMA offers zero performance benefit for
> backchannel traffic, especially since CB traffic would
> never move via RDMA READ/WRITE (as per RFC 5667 section
> 5.1).
>
> The primary benefit to doing an RPC/RDMA-only solution
> is that there is no upper layer impact. Is that a design
> requirement?
>
> There’s also been no discussion of issues with adding a
> very restricted amount of transport trunking. Can you
> elaborate on the problems this could introduce?

I haven't looked into those problems and I'm not the one asking anyone
to implement trunking.

Chuck Lever Oct. 22, 2014, 5:20 p.m. UTC | #10

> On Oct 22, 2014, at 4:39 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> 
>> On Tue, Oct 21, 2014 at 8:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>>> On Oct 21, 2014, at 3:45 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>> 
>>>> On Tue, Oct 21, 2014 at 4:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>> 
>>>> There is no show-stopper (see Section 5.1, after all). It’s
>>>> simply a matter of development effort: a side-car is much
>>>> less work than implementing full RDMA backchannel support for
>>>> both a client and server, especially since TCP backchannel
>>>> already works and can be used immediately.
>>>> 
>>>> Also, no problem with eventually implementing RDMA backchannel
>>>> if the complexity, and any performance overhead it introduces in
>>>> the forward channel, can be justified. The client can use the
>>>> CREATE_SESSION flags to detect what a server supports.
>>> 
>>> What complexity and performance overhead does it introduce in the
>>> forward channel?
>> 
>> The benefit of RDMA is that there are opportunities to
>> reduce host CPU interaction with incoming data.
>> Bi-direction requires that the transport look at the RPC
>> header to determine the direction of the message. That
>> could have an impact on the forward channel, but it’s
>> never been measured, to my knowledge.
>> 
>> The reason this is more of an issue for RPC/RDMA is that
>> a copy of the XID appears in the RPC/RDMA header to avoid
>> the need to look at the RPC header. That’s typically what
>> implementations use to steer RPC reply processing.
>> 
>> Often the RPC/RDMA header and RPC header land in
>> disparate buffers. The RPC/RDMA reply handler looks
>> strictly at the RPC/RDMA header, and runs in a tasklet
>> usually on a different CPU. Adding bi-direction would mean
>> the transport would have to peek into the upper layer
>> headers, possibly resulting in cache line bouncing.
> 
> Under what circumstances would you expect to receive a valid NFSv4.1
> callback with an RDMA header that spans multiple cache lines?

The RPC header and RPC/RDMA header are separate entities, but
together can span multiple cache lines if the server has returned a
chunk list containing multiple entries.

For example, RDMA_NOMSG would send the RPC/RDMA header
via RDMA SEND with a chunk list that represents the RPC and NFS
payload. That list could make the header larger than 32 bytes.

I expect that any callback that involves more than 1024 byte of
RPC payload will need to use RDMA_NOMSG. A long device
info list might fit that category?

>> The complexity would be the addition of over a hundred
>> new lines of code on the client, and possibly a similar
>> amount of new code on the server. Small, perhaps, but
>> not insignificant.
> 
> Until there are RDMA users, I care a lot less about code changes to
> xprtrdma than to NFS.
> 
>>>>> 2) Why do we instead have to solve the whole backchannel problem in
>>>>> the NFSv4.1 layer, and where is the discussion of the merits for and
>>>>> against that particular solution? As far as I can tell, it imposes at
>>>>> least 2 extra requirements:
>>>>> a) NFSv4.1 client+server must have support either for session
>>>>> trunking or for clientid trunking
>>>> 
>>>> Very minimal trunking support. The only operation allowed on
>>>> the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
>>>> 
>>>> Bruce told me that associating multiple transports to a
>>>> clientid/session should not be an issue for his server (his
>>>> words were “if that doesn’t work, it’s a bug”).
>>>> 
>>>> Would this restrictive form of trunking present a problem?
>>>> 
>>>>> b) NFSv4.1 client must be able to set up a TCP connection to the
>>>>> server (that can be session/clientid trunked with the existing RDMA
>>>>> channel)
>>>> 
>>>> Also very minimal changes. The changes are already done,
>>>> posted in v1 of this patch series.
>>> 
>>> I'm not asking for details on the size of the changesets, but for a
>>> justification of the design itself.
>> 
>> The size of the changeset _is_ the justification. It’s
>> a much less invasive change to add a TCP side-car than
>> it is to implement RDMA backchannel on both server and
>> client.
> 
> Please define your use of the word "invasive" in the above context. To
> me "invasive" means "will affect code that is in use by others".

The server side, then, is non-invasive. The client side makes minor
changes to state management.

> 
>> Most servers would require almost no change. Linux needs
>> only a bug fix or two. Effectively zero-impact for
>> servers that already support NFSv4.0 on RDMA to get
>> NFSv4.1 and pNFS on RDMA, with working callbacks.
>> 
>> That’s really all there is to it. It’s almost entirely a
>> practical consideration: we have the infrastructure and
>> can make it work in just a few lines of code.
>> 
>>> If it is possible to confine all
>>> the changes to the RPC/RDMA layer, then why consider patches that
>>> change the NFSv4.1 layer at all?
>> 
>> The fast new transport bring-up benefit is probably the
>> biggest win. A TCP side-car makes bringing up any new
>> transport implementation simpler.
> 
> That's an assertion that assumes:
> - we actually want to implement more transports aside from RDMA

So you no longer consider RPC/SCTP a possibility?

> - implementing bi-directional transports in the RPC layer is non-simple

I don't care to generalize about that. In the RPC/RDMA case, there
are some complications that make it non-simple, but not impossible.
So we have an example of a non-simple case, IMO.

> Right now, the benefit is only to RDMA users. Nobody else is asking
> for such a change.
> 
>> And, RPC/RDMA offers zero performance benefit for
>> backchannel traffic, especially since CB traffic would
>> never move via RDMA READ/WRITE (as per RFC 5667 section
>> 5.1).
>> 
>> The primary benefit to doing an RPC/RDMA-only solution
>> is that there is no upper layer impact. Is that a design
>> requirement?

Based on your objections, it appears that "no upper layer
impact" is a hard design requirement. I will take this as a
NACK for the side-car approach.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust Oct. 22, 2014, 8:53 p.m. UTC | #11

On Wed, Oct 22, 2014 at 8:20 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
>> On Oct 22, 2014, at 4:39 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>
>>> On Tue, Oct 21, 2014 at 8:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>
>>>> On Oct 21, 2014, at 3:45 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>>>
>>>>> On Tue, Oct 21, 2014 at 4:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>
>>>>> There is no show-stopper (see Section 5.1, after all). It’s
>>>>> simply a matter of development effort: a side-car is much
>>>>> less work than implementing full RDMA backchannel support for
>>>>> both a client and server, especially since TCP backchannel
>>>>> already works and can be used immediately.
>>>>>
>>>>> Also, no problem with eventually implementing RDMA backchannel
>>>>> if the complexity, and any performance overhead it introduces in
>>>>> the forward channel, can be justified. The client can use the
>>>>> CREATE_SESSION flags to detect what a server supports.
>>>>
>>>> What complexity and performance overhead does it introduce in the
>>>> forward channel?
>>>
>>> The benefit of RDMA is that there are opportunities to
>>> reduce host CPU interaction with incoming data.
>>> Bi-direction requires that the transport look at the RPC
>>> header to determine the direction of the message. That
>>> could have an impact on the forward channel, but it’s
>>> never been measured, to my knowledge.
>>>
>>> The reason this is more of an issue for RPC/RDMA is that
>>> a copy of the XID appears in the RPC/RDMA header to avoid
>>> the need to look at the RPC header. That’s typically what
>>> implementations use to steer RPC reply processing.
>>>
>>> Often the RPC/RDMA header and RPC header land in
>>> disparate buffers. The RPC/RDMA reply handler looks
>>> strictly at the RPC/RDMA header, and runs in a tasklet
>>> usually on a different CPU. Adding bi-direction would mean
>>> the transport would have to peek into the upper layer
>>> headers, possibly resulting in cache line bouncing.
>>
>> Under what circumstances would you expect to receive a valid NFSv4.1
>> callback with an RDMA header that spans multiple cache lines?
>
> The RPC header and RPC/RDMA header are separate entities, but
> together can span multiple cache lines if the server has returned a
> chunk list containing multiple entries.
>
> For example, RDMA_NOMSG would send the RPC/RDMA header
> via RDMA SEND with a chunk list that represents the RPC and NFS
> payload. That list could make the header larger than 32 bytes.
>
> I expect that any callback that involves more than 1024 byte of
> RPC payload will need to use RDMA_NOMSG. A long device
> info list might fit that category?

Right, but are there any callbacks that would do that? AFAICS, most of
them are CB_SEQUENCE+(PUT_FH+CB_do_some_recall_operation_on_this_file
| some single CB_operation)

The point is that we can set finite limits on the size of callbacks in
the CREATE_SESSION. As long as those limits are reasonable (and 1K
does seem more than reasonable for existing use cases) then why
shouldn't we be able to expect the server to use RDMA_MSG?

>>> The complexity would be the addition of over a hundred
>>> new lines of code on the client, and possibly a similar
>>> amount of new code on the server. Small, perhaps, but
>>> not insignificant.
>>
>> Until there are RDMA users, I care a lot less about code changes to
>> xprtrdma than to NFS.
>>
>>>>>> 2) Why do we instead have to solve the whole backchannel problem in
>>>>>> the NFSv4.1 layer, and where is the discussion of the merits for and
>>>>>> against that particular solution? As far as I can tell, it imposes at
>>>>>> least 2 extra requirements:
>>>>>> a) NFSv4.1 client+server must have support either for session
>>>>>> trunking or for clientid trunking
>>>>>
>>>>> Very minimal trunking support. The only operation allowed on
>>>>> the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
>>>>>
>>>>> Bruce told me that associating multiple transports to a
>>>>> clientid/session should not be an issue for his server (his
>>>>> words were “if that doesn’t work, it’s a bug”).
>>>>>
>>>>> Would this restrictive form of trunking present a problem?
>>>>>
>>>>>> b) NFSv4.1 client must be able to set up a TCP connection to the
>>>>>> server (that can be session/clientid trunked with the existing RDMA
>>>>>> channel)
>>>>>
>>>>> Also very minimal changes. The changes are already done,
>>>>> posted in v1 of this patch series.
>>>>
>>>> I'm not asking for details on the size of the changesets, but for a
>>>> justification of the design itself.
>>>
>>> The size of the changeset _is_ the justification. It’s
>>> a much less invasive change to add a TCP side-car than
>>> it is to implement RDMA backchannel on both server and
>>> client.
>>
>> Please define your use of the word "invasive" in the above context. To
>> me "invasive" means "will affect code that is in use by others".
>
> The server side, then, is non-invasive. The client side makes minor
> changes to state management.
>
>>
>>> Most servers would require almost no change. Linux needs
>>> only a bug fix or two. Effectively zero-impact for
>>> servers that already support NFSv4.0 on RDMA to get
>>> NFSv4.1 and pNFS on RDMA, with working callbacks.
>>>
>>> That’s really all there is to it. It’s almost entirely a
>>> practical consideration: we have the infrastructure and
>>> can make it work in just a few lines of code.
>>>
>>>> If it is possible to confine all
>>>> the changes to the RPC/RDMA layer, then why consider patches that
>>>> change the NFSv4.1 layer at all?
>>>
>>> The fast new transport bring-up benefit is probably the
>>> biggest win. A TCP side-car makes bringing up any new
>>> transport implementation simpler.
>>
>> That's an assertion that assumes:
>> - we actually want to implement more transports aside from RDMA
>
> So you no longer consider RPC/SCTP a possibility?

I'd still like to consider it, but the whole point would be to _avoid_
doing trunking in the NFS layer. SCTP does trunking/multi-pathing at
the transport level, meaning that we don't have to deal with tracking
connections, state, replaying messages, etc.
Doing bi-directional RPC with SCTP is not an issue, since the
transport is fully symmetric.

>> - implementing bi-directional transports in the RPC layer is non-simple
>
> I don't care to generalize about that. In the RPC/RDMA case, there
> are some complications that make it non-simple, but not impossible.
> So we have an example of a non-simple case, IMO.
>
>> Right now, the benefit is only to RDMA users. Nobody else is asking
>> for such a change.
>>
>>> And, RPC/RDMA offers zero performance benefit for
>>> backchannel traffic, especially since CB traffic would
>>> never move via RDMA READ/WRITE (as per RFC 5667 section
>>> 5.1).
>>>
>>> The primary benefit to doing an RPC/RDMA-only solution
>>> is that there is no upper layer impact. Is that a design
>>> requirement?
>
> Based on your objections, it appears that "no upper layer
> impact" is a hard design requirement. I will take this as a
> NACK for the side-car approach.

There is not a hard NACK yet, but I am asking for stronger
justification. I do _not_ want to find myself in a situation 2 or 3
years down the road where I have to argue against someone telling me
that we additionally have to implement callbacks over IB/RDMA because
the TCP sidecar is an incomplete solution. We should do either one or
the other, but not both...

Chuck Lever Oct. 22, 2014, 10:38 p.m. UTC | #12

On Oct 22, 2014, at 4:53 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> On Wed, Oct 22, 2014 at 8:20 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>>> On Oct 22, 2014, at 4:39 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>> 
>>>> On Tue, Oct 21, 2014 at 8:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>> 
>>>>> On Oct 21, 2014, at 3:45 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>>>> 
>>>>>> On Tue, Oct 21, 2014 at 4:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>> 
>>>>>> There is no show-stopper (see Section 5.1, after all). It’s
>>>>>> simply a matter of development effort: a side-car is much
>>>>>> less work than implementing full RDMA backchannel support for
>>>>>> both a client and server, especially since TCP backchannel
>>>>>> already works and can be used immediately.
>>>>>> 
>>>>>> Also, no problem with eventually implementing RDMA backchannel
>>>>>> if the complexity, and any performance overhead it introduces in
>>>>>> the forward channel, can be justified. The client can use the
>>>>>> CREATE_SESSION flags to detect what a server supports.
>>>>> 
>>>>> What complexity and performance overhead does it introduce in the
>>>>> forward channel?
>>>> 
>>>> The benefit of RDMA is that there are opportunities to
>>>> reduce host CPU interaction with incoming data.
>>>> Bi-direction requires that the transport look at the RPC
>>>> header to determine the direction of the message. That
>>>> could have an impact on the forward channel, but it’s
>>>> never been measured, to my knowledge.
>>>> 
>>>> The reason this is more of an issue for RPC/RDMA is that
>>>> a copy of the XID appears in the RPC/RDMA header to avoid
>>>> the need to look at the RPC header. That’s typically what
>>>> implementations use to steer RPC reply processing.
>>>> 
>>>> Often the RPC/RDMA header and RPC header land in
>>>> disparate buffers. The RPC/RDMA reply handler looks
>>>> strictly at the RPC/RDMA header, and runs in a tasklet
>>>> usually on a different CPU. Adding bi-direction would mean
>>>> the transport would have to peek into the upper layer
>>>> headers, possibly resulting in cache line bouncing.
>>> 
>>> Under what circumstances would you expect to receive a valid NFSv4.1
>>> callback with an RDMA header that spans multiple cache lines?
>> 
>> The RPC header and RPC/RDMA header are separate entities, but
>> together can span multiple cache lines if the server has returned a
>> chunk list containing multiple entries.
>> 
>> For example, RDMA_NOMSG would send the RPC/RDMA header
>> via RDMA SEND with a chunk list that represents the RPC and NFS
>> payload. That list could make the header larger than 32 bytes.
>> 
>> I expect that any callback that involves more than 1024 byte of
>> RPC payload will need to use RDMA_NOMSG. A long device
>> info list might fit that category?
> 
> Right, but are there any callbacks that would do that? AFAICS, most of
> them are CB_SEQUENCE+(PUT_FH+CB_do_some_recall_operation_on_this_file
> | some single CB_operation)

That is a question only a pNFS layout developer can answer.

Allowing larger CB operations might be important. I thought
I heard Matt list a couple of examples that might move bulk
data via CB, but probably none that have implementations
currently.

I’m not familiar with block or flex-file, so I can’t make
any kind of guess about those.

> The point is that we can set finite limits on the size of callbacks in
> the CREATE_SESSION. As long as those limits are reasonable (and 1K
> does seem more than reasonable for existing use cases) then why
> shouldn't we be able to expect the server to use RDMA_MSG?

The spec allows both RDMA_MSG and RDMA_NOMSG for CB
RPC. That provides some implementation flexibility, but
it also means either end can use either MSG type. An
interoperable CB service would have to support all
scenarios.

RDMA_NOMSG can support small or large payloads. RDMA_MSG
can support only a payload that fits in the receiver’s
pre-posted buffer (and that includes the RPC and NFS
headers) because NFS CB RPCs are not allowed to use
read or write chunks.

Since there are no implementations, currently, I was
hoping everyone might agree to stick with using only
RDMA_NOMSG for NFSv4.1 CB RPCs on RPC/RDMA. That would
mean there was a high limit on CB RPC payload size, and
RDMA READ would be used to move CB RPC calls and replies
in all cases.

NFS uses NOMSG so infrequently that it would be a good
way to limit churn in the transport’s hot paths.
Particularly NFS READ and WRITE are required to use
RDMA_MSG with their payload encoded in chunks. If the
incoming message is MSG, then clearly it can’t be a
reverse RPC, and there’s no need to go looking for it
(as long as we all agree on the “CBs use only NOMSG”
convention).

The receiver can tell just by looking at the RPC/RDMA
header that extra processing won’t be needed in the
common case.

Clearly we need a prototype or two to understand these
issues. And probably some WG discussion is warranted.

>>>> The complexity would be the addition of over a hundred
>>>> new lines of code on the client, and possibly a similar
>>>> amount of new code on the server. Small, perhaps, but
>>>> not insignificant.
>>> 
>>> Until there are RDMA users, I care a lot less about code changes to
>>> xprtrdma than to NFS.
>>> 
>>>>>>> 2) Why do we instead have to solve the whole backchannel problem in
>>>>>>> the NFSv4.1 layer, and where is the discussion of the merits for and
>>>>>>> against that particular solution? As far as I can tell, it imposes at
>>>>>>> least 2 extra requirements:
>>>>>>> a) NFSv4.1 client+server must have support either for session
>>>>>>> trunking or for clientid trunking
>>>>>> 
>>>>>> Very minimal trunking support. The only operation allowed on
>>>>>> the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
>>>>>> 
>>>>>> Bruce told me that associating multiple transports to a
>>>>>> clientid/session should not be an issue for his server (his
>>>>>> words were “if that doesn’t work, it’s a bug”).
>>>>>> 
>>>>>> Would this restrictive form of trunking present a problem?
>>>>>> 
>>>>>>> b) NFSv4.1 client must be able to set up a TCP connection to the
>>>>>>> server (that can be session/clientid trunked with the existing RDMA
>>>>>>> channel)
>>>>>> 
>>>>>> Also very minimal changes. The changes are already done,
>>>>>> posted in v1 of this patch series.
>>>>> 
>>>>> I'm not asking for details on the size of the changesets, but for a
>>>>> justification of the design itself.
>>>> 
>>>> The size of the changeset _is_ the justification. It’s
>>>> a much less invasive change to add a TCP side-car than
>>>> it is to implement RDMA backchannel on both server and
>>>> client.
>>> 
>>> Please define your use of the word "invasive" in the above context. To
>>> me "invasive" means "will affect code that is in use by others".
>> 
>> The server side, then, is non-invasive. The client side makes minor
>> changes to state management.
>> 
>>> 
>>>> Most servers would require almost no change. Linux needs
>>>> only a bug fix or two. Effectively zero-impact for
>>>> servers that already support NFSv4.0 on RDMA to get
>>>> NFSv4.1 and pNFS on RDMA, with working callbacks.
>>>> 
>>>> That’s really all there is to it. It’s almost entirely a
>>>> practical consideration: we have the infrastructure and
>>>> can make it work in just a few lines of code.
>>>> 
>>>>> If it is possible to confine all
>>>>> the changes to the RPC/RDMA layer, then why consider patches that
>>>>> change the NFSv4.1 layer at all?
>>>> 
>>>> The fast new transport bring-up benefit is probably the
>>>> biggest win. A TCP side-car makes bringing up any new
>>>> transport implementation simpler.
>>> 
>>> That's an assertion that assumes:
>>> - we actually want to implement more transports aside from RDMA
>> 
>> So you no longer consider RPC/SCTP a possibility?
> 
> I'd still like to consider it, but the whole point would be to _avoid_
> doing trunking in the NFS layer. SCTP does trunking/multi-pathing at
> the transport level, meaning that we don't have to deal with tracking
> connections, state, replaying messages, etc.
> Doing bi-directional RPC with SCTP is not an issue, since the
> transport is fully symmetric.
> 
>>> - implementing bi-directional transports in the RPC layer is non-simple
>> 
>> I don't care to generalize about that. In the RPC/RDMA case, there
>> are some complications that make it non-simple, but not impossible.
>> So we have an example of a non-simple case, IMO.
>> 
>>> Right now, the benefit is only to RDMA users. Nobody else is asking
>>> for such a change.
>>> 
>>>> And, RPC/RDMA offers zero performance benefit for
>>>> backchannel traffic, especially since CB traffic would
>>>> never move via RDMA READ/WRITE (as per RFC 5667 section
>>>> 5.1).
>>>> 
>>>> The primary benefit to doing an RPC/RDMA-only solution
>>>> is that there is no upper layer impact. Is that a design
>>>> requirement?
>> 
>> Based on your objections, it appears that "no upper layer
>> impact" is a hard design requirement. I will take this as a
>> NACK for the side-car approach.
> 
> There is not a hard NACK yet, but I am asking for stronger
> justification. I do _not_ want to find myself in a situation 2 or 3
> years down the road where I have to argue against someone telling me
> that we additionally have to implement callbacks over IB/RDMA because
> the TCP sidecar is an incomplete solution. We should do either one or
> the other, but not both…

It is impossible to predict the future. However, I’m not
sure there’s a problem building both eventually, especially
because there is no spec guidance I’m aware of about which
bi-directional RPC mechanisms MUST be supported for NFS/RDMA.

I might be wrong, but we have enough mechanism in
CREATE_SESSION for a client with bi-directional RPC/RDMA
support to detect a server with no bi-directional RPC/RDMA,
and use side-car only in that case.

If we need more discussion, I can drop the side-car patches
for 3.19 and do some more research.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields Oct. 23, 2014, 1:32 p.m. UTC | #13

On Tue, Oct 21, 2014 at 01:11:26PM -0400, Chuck Lever wrote:
> The size of the changeset _is_ the justification. It’s
> a much less invasive change to add a TCP side-car than
> it is to implement RDMA backchannel on both server and
> client.

Something I'm confused about: is bidirectional RPC/RDMA optional or
mandatory for servers to implement?

Something somewhere has to be mandatory if we want to guarantee a
working backchannel between any two implementations.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chuck Lever Oct. 23, 2014, 1:55 p.m. UTC | #14

On Oct 23, 2014, at 9:32 AM, J. Bruce Fields <bfields@fieldses.org> wrote:

> On Tue, Oct 21, 2014 at 01:11:26PM -0400, Chuck Lever wrote:
>> The size of the changeset _is_ the justification. It’s
>> a much less invasive change to add a TCP side-car than
>> it is to implement RDMA backchannel on both server and
>> client.
> 
> Something I'm confused about: is bidirectional RPC/RDMA optional or
> mandatory for servers to implement?

IMO bi-directional RPC/RDMA is not required anywhere in RFCs 5666
(the RPC/RDMA spec) or 5667 (the NFS on RPC/RDMA spec).

> Something somewhere has to be mandatory if we want to guarantee a
> working backchannel between any two implementations.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v1,13/16] NFS: Add sidecar RPC client support

Commit Message

Comments

Patch