diff mbox

[v1,07/12] xprtrdma: Don't provide a reply chunk when expecting a short reply

Message ID 20150709204246.26247.10367.stgit@manet.1015granger.net (mailing list archive)
State Not Applicable
Headers show

Commit Message

Chuck Lever III July 9, 2015, 8:42 p.m. UTC
Currently Linux always offers a reply chunk, even for small replies
(unless a read or write list is needed for the RPC operation).

A comment in rpcrdma_marshal_req() reads:

> Currently we try to not actually use read inline.
> Reply chunks have the desirable property that
> they land, packed, directly in the target buffers
> without headers, so they require no fixup. The
> additional RDMA Write op sends the same amount
> of data, streams on-the-wire and adds no overhead
> on receive. Therefore, we request a reply chunk
> for non-writes wherever feasible and efficient.

This considers only the network bandwidth cost of sending the RPC
reply. For replies which are only a few dozen bytes, this is
typically not a good trade-off.

If the server chooses to return the reply inline:

 - The client has registered and invalidated a memory region to
   catch the reply, which is then not used

If the server chooses to use the reply chunk:

 - The server sends a few bytes using a heavyweight RDMA WRITE for
   operation. The entire RPC reply is conveyed in two RDMA
   operations (WRITE_ONLY, SEND) instead of one.

Note that both the server and client have to prepare or copy the
reply data anyway to construct these replies. There's no benefit to
using an RDMA transfer since the host CPU has to be involved.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c |   14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sagi Grimberg July 12, 2015, 2:58 p.m. UTC | #1
On 7/9/2015 11:42 PM, Chuck Lever wrote:
> Currently Linux always offers a reply chunk, even for small replies
> (unless a read or write list is needed for the RPC operation).
>
> A comment in rpcrdma_marshal_req() reads:
>
>> Currently we try to not actually use read inline.
>> Reply chunks have the desirable property that
>> they land, packed, directly in the target buffers
>> without headers, so they require no fixup. The
>> additional RDMA Write op sends the same amount
>> of data, streams on-the-wire and adds no overhead
>> on receive. Therefore, we request a reply chunk
>> for non-writes wherever feasible and efficient.
>
> This considers only the network bandwidth cost of sending the RPC
> reply. For replies which are only a few dozen bytes, this is
> typically not a good trade-off.
>
> If the server chooses to return the reply inline:
>
>   - The client has registered and invalidated a memory region to
>     catch the reply, which is then not used
>
> If the server chooses to use the reply chunk:
>
>   - The server sends a few bytes using a heavyweight RDMA WRITE for
>     operation. The entire RPC reply is conveyed in two RDMA
>     operations (WRITE_ONLY, SEND) instead of one.

Pipelined WRITE+SEND operations are hardly an overhead compared to
copying chunks of data.

>
> Note that both the server and client have to prepare or copy the
> reply data anyway to construct these replies. There's no benefit to
> using an RDMA transfer since the host CPU has to be involved.

I think that preparation (posting 1 or 2 WQEs) and copying
chunks of data of say 8K-16K might be different.

I understand that you probably see better performance scaling. But this
might be HW dependent. Also, this might backfire on you if your
configuration is one-to-many. Then, data copy CPU cycles might become
more expensive.

I don't really know what is better, but just thought I'd present
another side to this.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever III July 12, 2015, 6:38 p.m. UTC | #2
Hi Sagi-


On Jul 12, 2015, at 10:58 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> On 7/9/2015 11:42 PM, Chuck Lever wrote:
>> Currently Linux always offers a reply chunk, even for small replies
>> (unless a read or write list is needed for the RPC operation).
>> 
>> A comment in rpcrdma_marshal_req() reads:
>> 
>>> Currently we try to not actually use read inline.
>>> Reply chunks have the desirable property that
>>> they land, packed, directly in the target buffers
>>> without headers, so they require no fixup. The
>>> additional RDMA Write op sends the same amount
>>> of data, streams on-the-wire and adds no overhead
>>> on receive. Therefore, we request a reply chunk
>>> for non-writes wherever feasible and efficient.
>> 
>> This considers only the network bandwidth cost of sending the RPC
>> reply. For replies which are only a few dozen bytes, this is
>> typically not a good trade-off.
>> 
>> If the server chooses to return the reply inline:
>> 
>>  - The client has registered and invalidated a memory region to
>>    catch the reply, which is then not used
>> 
>> If the server chooses to use the reply chunk:
>> 
>>  - The server sends a few bytes using a heavyweight RDMA WRITE for
>>    operation. The entire RPC reply is conveyed in two RDMA
>>    operations (WRITE_ONLY, SEND) instead of one.
> 
> Pipelined WRITE+SEND operations are hardly an overhead compared to
> copying chunks of data.
> 
>> 
>> Note that both the server and client have to prepare or copy the
>> reply data anyway to construct these replies. There's no benefit to
>> using an RDMA transfer since the host CPU has to be involved.
> 
> I think that preparation (posting 1 or 2 WQEs) and copying
> chunks of data of say 8K-16K might be different.

Two points that are probably not clear from my patch description:

1. This patch affects only replies (usually much) smaller than the
   client’s inline threshold (1KB). Anything larger will continue
   to use RDMA transfer.

2. These replies are constructed in the RPC buffer by the server,
   and parsed in the receive buffer by the client. They are not
   simple data copies on either endpoint.

Think NFS GETATTR: the server is gathering metadata from multiple
sources, and XDR encoding it in the reply send buffer. The data
is not copied, it is manipulated before the SEND.

The client then XDR decodes the received stream and scatters the
decoded results into multiple in-memory data structures.

Because XDR encoding/decoding is involved, there really is no
benefit to an RDMA transfer for these replies.


> I understand that you probably see better performance scaling. But this
> might be HW dependent. Also, this might backfire on you if your
> configuration is one-to-many. Then, data copy CPU cycles might become
> more expensive.
> 
> I don't really know what is better, but just thought I'd present
> another side to this.

Thanks for your review!

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 14, 2015, 9:54 a.m. UTC | #3
On 7/12/2015 9:38 PM, Chuck Lever wrote:
> Hi Sagi-
>
>
> On Jul 12, 2015, at 10:58 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>> On 7/9/2015 11:42 PM, Chuck Lever wrote:
>>> Currently Linux always offers a reply chunk, even for small replies
>>> (unless a read or write list is needed for the RPC operation).
>>>
>>> A comment in rpcrdma_marshal_req() reads:
>>>
>>>> Currently we try to not actually use read inline.
>>>> Reply chunks have the desirable property that
>>>> they land, packed, directly in the target buffers
>>>> without headers, so they require no fixup. The
>>>> additional RDMA Write op sends the same amount
>>>> of data, streams on-the-wire and adds no overhead
>>>> on receive. Therefore, we request a reply chunk
>>>> for non-writes wherever feasible and efficient.
>>>
>>> This considers only the network bandwidth cost of sending the RPC
>>> reply. For replies which are only a few dozen bytes, this is
>>> typically not a good trade-off.
>>>
>>> If the server chooses to return the reply inline:
>>>
>>>   - The client has registered and invalidated a memory region to
>>>     catch the reply, which is then not used
>>>
>>> If the server chooses to use the reply chunk:
>>>
>>>   - The server sends a few bytes using a heavyweight RDMA WRITE for
>>>     operation. The entire RPC reply is conveyed in two RDMA
>>>     operations (WRITE_ONLY, SEND) instead of one.
>>
>> Pipelined WRITE+SEND operations are hardly an overhead compared to
>> copying chunks of data.
>>
>>>
>>> Note that both the server and client have to prepare or copy the
>>> reply data anyway to construct these replies. There's no benefit to
>>> using an RDMA transfer since the host CPU has to be involved.
>>
>> I think that preparation (posting 1 or 2 WQEs) and copying
>> chunks of data of say 8K-16K might be different.
>
> Two points that are probably not clear from my patch description:
>
> 1. This patch affects only replies (usually much) smaller than the
>     client’s inline threshold (1KB). Anything larger will continue
>     to use RDMA transfer.
>
> 2. These replies are constructed in the RPC buffer by the server,
>     and parsed in the receive buffer by the client. They are not
>     simple data copies on either endpoint.
>
> Think NFS GETATTR: the server is gathering metadata from multiple
> sources, and XDR encoding it in the reply send buffer. The data
> is not copied, it is manipulated before the SEND.
>
> The client then XDR decodes the received stream and scatters the
> decoded results into multiple in-memory data structures.
>
> Because XDR encoding/decoding is involved, there really is no
> benefit to an RDMA transfer for these replies.

I see. Thanks for the clarification.

Reviewed-By: Sagi Grimberg <sagig@mellanox.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index e569da4..8ac1448c 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -429,7 +429,7 @@  rpcrdma_marshal_req(struct rpc_rqst *rqst)
 	 *
 	 * o Read ops return data as write chunk(s), header as inline.
 	 * o If the expected result is under the inline threshold, all ops
-	 *   return as inline (but see later).
+	 *   return as inline.
 	 * o Large non-read ops return as a single reply chunk.
 	 */
 	if (rqst->rq_rcv_buf.flags & XDRBUF_READ)
@@ -503,18 +503,6 @@  rpcrdma_marshal_req(struct rpc_rqst *rqst)
 			headerp->rm_body.rm_nochunks.rm_empty[2] = xdr_zero;
 			/* new length after pullup */
 			rpclen = rqst->rq_svec[0].iov_len;
-			/*
-			 * Currently we try to not actually use read inline.
-			 * Reply chunks have the desirable property that
-			 * they land, packed, directly in the target buffers
-			 * without headers, so they require no fixup. The
-			 * additional RDMA Write op sends the same amount
-			 * of data, streams on-the-wire and adds no overhead
-			 * on receive. Therefore, we request a reply chunk
-			 * for non-writes wherever feasible and efficient.
-			 */
-			if (wtype == rpcrdma_noch)
-				wtype = rpcrdma_replych;
 		}
 	}