[v1,10/10] svcrdma: Handle additional inline content
diff mbox

Message ID 20150109192319.4901.89444.stgit@klimt.1015granger.net
State New, archived
Headers show

Commit Message

Chuck Lever Jan. 9, 2015, 7:23 p.m. UTC
Most NFS RPCs place large payload arguments at the end of the RPC
header (eg, NFSv3 WRITE). For NFSv3 WRITE and SYMLINK, RPC/RDMA
sends the complete RPC header inline, and the payload argument in a
read list.

One important case is not like this, however. NFSv4 WRITE compounds
can have an operation after the WRITE operation. The proper way to
convey an NFSv4 WRITE is to place the GETATTR inline, but _after_
the read list position. (Note Linux clients currently do not do
this, but they will be changed to do it in the future).

The receiver could put trailing inline content in the XDR tail
buffer. But the Linux server's NFSv4 compound processing does not
consider the XDR tail buffer.

So, move trailing inline content to the end of the page list. This
presents the incoming compound to upper layers the same way the
socket code does.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   62 +++++++++++++++++++++++++++++++
 1 files changed, 62 insertions(+), 0 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sagi Grimberg Jan. 11, 2015, 6:01 p.m. UTC | #1
On 1/9/2015 9:23 PM, Chuck Lever wrote:
> Most NFS RPCs place large payload arguments at the end of the RPC
> header (eg, NFSv3 WRITE). For NFSv3 WRITE and SYMLINK, RPC/RDMA
> sends the complete RPC header inline, and the payload argument in a
> read list.
>
> One important case is not like this, however. NFSv4 WRITE compounds
> can have an operation after the WRITE operation. The proper way to
> convey an NFSv4 WRITE is to place the GETATTR inline, but _after_
> the read list position. (Note Linux clients currently do not do
> this, but they will be changed to do it in the future).
>
> The receiver could put trailing inline content in the XDR tail
> buffer. But the Linux server's NFSv4 compound processing does not
> consider the XDR tail buffer.
>
> So, move trailing inline content to the end of the page list. This
> presents the incoming compound to upper layers the same way the
> socket code does.
>

Would this memcpy be saved if you just posted a larger receive buffer
and the client would used it "really inline" as part of it's post_send?

I'm just trying to understand if this complicated logic is worth the
extra bytes of a larger recv buffer you are saving...

Will this code path happen a lot? If so you might get some overhead
you may want to avoid.

I may not see the full picture here... Just thought I'd ask...

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever Jan. 12, 2015, 1:13 a.m. UTC | #2
On Jan 11, 2015, at 1:01 PM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> On 1/9/2015 9:23 PM, Chuck Lever wrote:
>> Most NFS RPCs place large payload arguments at the end of the RPC
>> header (eg, NFSv3 WRITE). For NFSv3 WRITE and SYMLINK, RPC/RDMA
>> sends the complete RPC header inline, and the payload argument in a
>> read list.
>> 
>> One important case is not like this, however. NFSv4 WRITE compounds
>> can have an operation after the WRITE operation. The proper way to
>> convey an NFSv4 WRITE is to place the GETATTR inline, but _after_
>> the read list position. (Note Linux clients currently do not do
>> this, but they will be changed to do it in the future).
>> 
>> The receiver could put trailing inline content in the XDR tail
>> buffer. But the Linux server's NFSv4 compound processing does not
>> consider the XDR tail buffer.
>> 
>> So, move trailing inline content to the end of the page list. This
>> presents the incoming compound to upper layers the same way the
>> socket code does.
>> 
> 
> Would this memcpy be saved if you just posted a larger receive buffer
> and the client would used it "really inline" as part of it's post_send?

The receive buffer doesn’t need to be larger. Clients already should
construct this trailing inline content in their SEND buffers.

Not that the  Linux client doesn’t yet send the extra inline via RDMA
SEND, it uses a separate RDMA READ to move the extra bytes, and that’s
a bug.

If the client does send this inline as it’s supposed to, the server
would receive it in its pre-posted RECV buffer. This patch simply
moves that content into the XDR buffer page list, where the server’s
XDR decoder can find it.

The TCP transport already places this content at the end of the XDR
buffer’s page list, and svcrdma puts this content in the same spot
if the client sends it via RDMA READ. So now, if the client sends the
extra bytes via SEND, svcrdma will put the bytes in the same spot as
the other cases.

The goal is to provide support for trailing inline content without
altering the upper layers of the NFS server.

> I'm just trying to understand if this complicated logic is worth the
> extra bytes of a larger recv buffer you are saving...

The complexity here arises only because we have to deal with the
possibility that copying the inline content may cross into a new page.
If we could guarantee that will never happen, then half of
rdma_copy_tail() can be removed.

> Will this code path happen a lot? If so you might get some overhead
> you may want to avoid.

It happens for every NFSv4 WRITE compound sent by a Linux NFS client,
and amounts to 16 bytes. (Yes, we’re now doing a 16-byte RDMA READ to
bring that data over. As I said, that’s a bug).

The overhead on the server for moving the extra 16 bytes is tiny, and
we get to re-use the current server’s NFSv4 compound decoder instead
of creating upper layer code just for svcrdma.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Jan. 13, 2015, 10:11 a.m. UTC | #3
On 1/12/2015 3:13 AM, Chuck Lever wrote:
>
> On Jan 11, 2015, at 1:01 PM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>> On 1/9/2015 9:23 PM, Chuck Lever wrote:
>>> Most NFS RPCs place large payload arguments at the end of the RPC
>>> header (eg, NFSv3 WRITE). For NFSv3 WRITE and SYMLINK, RPC/RDMA
>>> sends the complete RPC header inline, and the payload argument in a
>>> read list.
>>>
>>> One important case is not like this, however. NFSv4 WRITE compounds
>>> can have an operation after the WRITE operation. The proper way to
>>> convey an NFSv4 WRITE is to place the GETATTR inline, but _after_
>>> the read list position. (Note Linux clients currently do not do
>>> this, but they will be changed to do it in the future).
>>>
>>> The receiver could put trailing inline content in the XDR tail
>>> buffer. But the Linux server's NFSv4 compound processing does not
>>> consider the XDR tail buffer.
>>>
>>> So, move trailing inline content to the end of the page list. This
>>> presents the incoming compound to upper layers the same way the
>>> socket code does.
>>>
>>
>> Would this memcpy be saved if you just posted a larger receive buffer
>> and the client would used it "really inline" as part of it's post_send?
>
> The receive buffer doesn’t need to be larger. Clients already should
> construct this trailing inline content in their SEND buffers.
>
> Not that the  Linux client doesn’t yet send the extra inline via RDMA
> SEND, it uses a separate RDMA READ to move the extra bytes, and that’s
> a bug.
>
> If the client does send this inline as it’s supposed to, the server
> would receive it in its pre-posted RECV buffer. This patch simply
> moves that content into the XDR buffer page list, where the server’s
> XDR decoder can find it.

Would it make more sense to manipulate pointers instead of copying data?
But if this is only 16 bytes than maybe it's not worth the trouble...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever Jan. 13, 2015, 2:35 p.m. UTC | #4
On Jan 13, 2015, at 5:11 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> On 1/12/2015 3:13 AM, Chuck Lever wrote:
>> 
>> On Jan 11, 2015, at 1:01 PM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>> 
>>> On 1/9/2015 9:23 PM, Chuck Lever wrote:
>>>> Most NFS RPCs place large payload arguments at the end of the RPC
>>>> header (eg, NFSv3 WRITE). For NFSv3 WRITE and SYMLINK, RPC/RDMA
>>>> sends the complete RPC header inline, and the payload argument in a
>>>> read list.
>>>> 
>>>> One important case is not like this, however. NFSv4 WRITE compounds
>>>> can have an operation after the WRITE operation. The proper way to
>>>> convey an NFSv4 WRITE is to place the GETATTR inline, but _after_
>>>> the read list position. (Note Linux clients currently do not do
>>>> this, but they will be changed to do it in the future).
>>>> 
>>>> The receiver could put trailing inline content in the XDR tail
>>>> buffer. But the Linux server's NFSv4 compound processing does not
>>>> consider the XDR tail buffer.
>>>> 
>>>> So, move trailing inline content to the end of the page list. This
>>>> presents the incoming compound to upper layers the same way the
>>>> socket code does.
>>>> 
>>> 
>>> Would this memcpy be saved if you just posted a larger receive buffer
>>> and the client would used it "really inline" as part of it's post_send?
>> 
>> The receive buffer doesn’t need to be larger. Clients already should
>> construct this trailing inline content in their SEND buffers.
>> 
>> Not that the  Linux client doesn’t yet send the extra inline via RDMA
>> SEND, it uses a separate RDMA READ to move the extra bytes, and that’s
>> a bug.
>> 
>> If the client does send this inline as it’s supposed to, the server
>> would receive it in its pre-posted RECV buffer. This patch simply
>> moves that content into the XDR buffer page list, where the server’s
>> XDR decoder can find it.
> 
> Would it make more sense to manipulate pointers instead of copying data?

It would. My first approach was to use the tail iovec in xdr_buf.
Simply point the tail’s iov_addr at trailing inline content in the
RECV buffer.

But as mentioned, the server’s XDR decoders don’t look at the tail
iovec.

The socket transport delivers this little piece of data at the end of
the xdr_buf page list, because all it has to do is read data off the
socket and stick it in pages.

So svcrdma can do that too. It’s a little more awkward, but the upper
layer code stays the same.

> But if this is only 16 bytes than maybe it's not worth the trouble…

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index a345cad..f44bf4e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -364,6 +364,63 @@  rdma_rcl_chunk_count(struct rpcrdma_read_chunk *ch)
 	return count;
 }
 
+/* If there was additional inline content, append it to the end of arg.pages.
+ * Tail copy has to be done after the reader function has determined how many
+ * pages are needed for RDMA READ.
+ */
+static int
+rdma_copy_tail(struct svc_rqst *rqstp, struct svc_rdma_op_ctxt *head,
+	       u32 position, u32 byte_count, u32 page_offset, int page_no)
+{
+	char *srcp, *destp;
+	int ret;
+
+	ret = 0;
+	srcp = head->arg.head[0].iov_base + position;
+	byte_count = head->arg.head[0].iov_len - position;
+	if (byte_count > PAGE_SIZE) {
+		dprintk("svcrdma: large tail unsupported\n");
+		goto err;
+	}
+
+	/* Fit as much of the tail on the current page as possible */
+	if (page_offset != PAGE_SIZE) {
+		destp = page_address(rqstp->rq_arg.pages[page_no]);
+		destp += page_offset;
+
+		while (byte_count--) {
+			*destp++ = *srcp++;
+			page_offset++;
+			if (page_offset == PAGE_SIZE)
+				break;
+		}
+
+		goto done;
+	}
+
+	/* Fit the rest on the next page */
+	page_no++;
+	if (!rqstp->rq_arg.pages[page_no]) {
+		dprintk("svcrdma: no more room for tail\n");
+		goto err;
+	}
+	destp = page_address(rqstp->rq_arg.pages[page_no]);
+	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no+1];
+	rqstp->rq_next_page = rqstp->rq_respages + 1;
+	while (byte_count--)
+		*destp++ = *srcp++;
+
+done:
+	ret = 1;
+	byte_count = head->arg.head[0].iov_len - position;
+	head->arg.page_len += byte_count;
+	head->arg.len += byte_count;
+	head->arg.buflen += byte_count;
+
+err:
+	return ret;
+}
+
 static int rdma_read_chunks(struct svcxprt_rdma *xprt,
 			    struct rpcrdma_msg *rmsgp,
 			    struct svc_rqst *rqstp,
@@ -440,9 +497,14 @@  static int rdma_read_chunks(struct svcxprt_rdma *xprt,
 		head->arg.page_len += pad;
 		head->arg.len += pad;
 		head->arg.buflen += pad;
+		page_offset += pad;
 	}
 
 	ret = 1;
+	if (position && position < head->arg.head[0].iov_len)
+		ret = rdma_copy_tail(rqstp, head, position,
+				     byte_count, page_offset, page_no);
+	head->arg.head[0].iov_len = position;
 	head->position = position;
 
  err: