diff mbox

[v7,21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

Message ID 20171107085514.12693-22-longli@exchange.microsoft.com (mailing list archive)
State New, archived
Headers show

Commit Message

Long Li Nov. 7, 2017, 8:55 a.m. UTC
From: Long Li <longli@microsoft.com>

If I/O size is larger than rdma_readwrite_threshold, use RDMA write for
SMB read by specifying channel SMB2_CHANNEL_RDMA_V1 or
SMB2_CHANNEL_RDMA_V1_INVALIDATE in the SMB packet, depending on SMB dialect
used. Append a smbd_buffer_descriptor_v1 to the end of the SMB packet and fill
in other values to indicate this SMB read uses RDMA write.

There is no need to read from the transport for incoming payload. At the time
SMB read response comes back, the data is already transfered and placed in the
pages by RDMA hardware.

When SMB read is finished, deregister the memory regions if RDMA write is used
for this SMB read. smbd_deregister_mr may need to do local invalidation and
sleep, if server remote invalidation is not used.

There are situations where the MID may not be created on I/O failure, under
which memory region is deregistered when read data context is released.

Signed-off-by: Long Li <longli@microsoft.com>
---
 fs/cifs/file.c    | 17 +++++++++++++++--
 fs/cifs/smb2pdu.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 59 insertions(+), 3 deletions(-)

Comments

Tom Talpey Sept. 19, 2018, 5:59 a.m. UTC | #1
Replying to a very old message, but it's something we
discussed today at the IOLab event so to capture it:

On 11/7/2017 12:55 AM, Long Li wrote:
> From: Long Li <longli@microsoft.com>
> 
> ---
>   fs/cifs/file.c    | 17 +++++++++++++++--
>   fs/cifs/smb2pdu.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 59 insertions(+), 3 deletions(-)
> ...
> diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
> index c8afb83..8a5ff90 100644
> --- a/fs/cifs/smb2pdu.c
> +++ b/fs/cifs/smb2pdu.c
> @@ -2379,7 +2379,40 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
>   	req->MinimumCount = 0;
>   	req->Length = cpu_to_le32(io_parms->length);
>   	req->Offset = cpu_to_le64(io_parms->offset);
> +#ifdef CONFIG_CIFS_SMB_DIRECT
> +	/*
> +	 * If we want to do a RDMA write, fill in and append
> +	 * smbd_buffer_descriptor_v1 to the end of read request
> +	 */
> +	if (server->rdma && rdata &&
> +		rdata->bytes >= server->smbd_conn->rdma_readwrite_threshold) {
> +
> +		struct smbd_buffer_descriptor_v1 *v1;
> +		bool need_invalidate =
> +			io_parms->tcon->ses->server->dialect == SMB30_PROT_ID;
> +
> +		rdata->mr = smbd_register_mr(
> +				server->smbd_conn, rdata->pages,
> +				rdata->nr_pages, rdata->tailsz,
> +				true, need_invalidate);
> +		if (!rdata->mr)
> +			return -ENOBUFS;
> +
> +		req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
> +		if (need_invalidate)
> +			req->Channel = SMB2_CHANNEL_RDMA_V1;
> +		req->ReadChannelInfoOffset =
> +			offsetof(struct smb2_read_plain_req, Buffer);
> +		req->ReadChannelInfoLength =
> +			sizeof(struct smbd_buffer_descriptor_v1);
> +		v1 = (struct smbd_buffer_descriptor_v1 *) &req->Buffer[0];
> +		v1->offset = rdata->mr->mr->iova;

It's unnecessary, and possibly leaking kernel information, to use
the IOVA as the offset of a memory region which is registered using
an FRWR. Because such regions are based on the exact bytes targeted
by the memory handle, the offset can be set to any value, typically
zero, but nearly arbitrary. As long as the (offset + length) does
not wrap or otherwise overflow, offset can be set to anything
convenient.

Since SMB reads and writes range up to 8MB, I'd suggest zeroing the
least significant 23 bits, which should guarantee it. The other 41
bits, party on. You could randomize them, pass some clever identifier
such as MID sequence, whatever.

Tom.

> +		v1->token = rdata->mr->mr->rkey;
> +		v1->length = rdata->mr->mr->length;
Long Li Sept. 20, 2018, 5:01 p.m. UTC | #2
> Subject: Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via
> RDMA write through memory registration
> 
> Replying to a very old message, but it's something we discussed today at the
> IOLab event so to capture it:
> 
> On 11/7/2017 12:55 AM, Long Li wrote:
> > From: Long Li <longli@microsoft.com>
> >
> > ---
> >   fs/cifs/file.c    | 17 +++++++++++++++--
> >   fs/cifs/smb2pdu.c | 45
> ++++++++++++++++++++++++++++++++++++++++++++-
> >   2 files changed, 59 insertions(+), 3 deletions(-) ...
> > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index
> > c8afb83..8a5ff90 100644
> > --- a/fs/cifs/smb2pdu.c
> > +++ b/fs/cifs/smb2pdu.c
> > @@ -2379,7 +2379,40 @@ smb2_new_read_req(void **buf, unsigned int
> *total_len,
> >   	req->MinimumCount = 0;
> >   	req->Length = cpu_to_le32(io_parms->length);
> >   	req->Offset = cpu_to_le64(io_parms->offset);
> > +#ifdef CONFIG_CIFS_SMB_DIRECT
> > +	/*
> > +	 * If we want to do a RDMA write, fill in and append
> > +	 * smbd_buffer_descriptor_v1 to the end of read request
> > +	 */
> > +	if (server->rdma && rdata &&
> > +		rdata->bytes >= server->smbd_conn-
> >rdma_readwrite_threshold) {
> > +
> > +		struct smbd_buffer_descriptor_v1 *v1;
> > +		bool need_invalidate =
> > +			io_parms->tcon->ses->server->dialect ==
> SMB30_PROT_ID;
> > +
> > +		rdata->mr = smbd_register_mr(
> > +				server->smbd_conn, rdata->pages,
> > +				rdata->nr_pages, rdata->tailsz,
> > +				true, need_invalidate);
> > +		if (!rdata->mr)
> > +			return -ENOBUFS;
> > +
> > +		req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
> > +		if (need_invalidate)
> > +			req->Channel = SMB2_CHANNEL_RDMA_V1;
> > +		req->ReadChannelInfoOffset =
> > +			offsetof(struct smb2_read_plain_req, Buffer);
> > +		req->ReadChannelInfoLength =
> > +			sizeof(struct smbd_buffer_descriptor_v1);
> > +		v1 = (struct smbd_buffer_descriptor_v1 *) &req->Buffer[0];
> > +		v1->offset = rdata->mr->mr->iova;
> 
> It's unnecessary, and possibly leaking kernel information, to use the IOVA as
> the offset of a memory region which is registered using an FRWR. Because
> such regions are based on the exact bytes targeted by the memory handle,
> the offset can be set to any value, typically zero, but nearly arbitrary. As long
> as the (offset + length) does not wrap or otherwise overflow, offset can be
> set to anything convenient.
> 
> Since SMB reads and writes range up to 8MB, I'd suggest zeroing the least
> significant 23 bits, which should guarantee it. The other 41 bits, party on. You
> could randomize them, pass some clever identifier such as MID sequence,
> whatever.
> 
> Tom.

Thanks Tom. I will fix this.

> 
> > +		v1->token = rdata->mr->mr->rkey;
> > +		v1->length = rdata->mr->mr->length;
Stefan Metzmacher Sept. 22, 2018, 3:56 a.m. UTC | #3
Hi,

>> +        req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
>> +        if (need_invalidate)
>> +            req->Channel = SMB2_CHANNEL_RDMA_V1;
>> +        req->ReadChannelInfoOffset =
>> +            offsetof(struct smb2_read_plain_req, Buffer);
>> +        req->ReadChannelInfoLength =
>> +            sizeof(struct smbd_buffer_descriptor_v1);
>> +        v1 = (struct smbd_buffer_descriptor_v1 *) &req->Buffer[0];
>> +        v1->offset = rdata->mr->mr->iova;
> 
> It's unnecessary, and possibly leaking kernel information, to use
> the IOVA as the offset of a memory region which is registered using
> an FRWR. Because such regions are based on the exact bytes targeted
> by the memory handle, the offset can be set to any value, typically
> zero, but nearly arbitrary. As long as the (offset + length) does
> not wrap or otherwise overflow, offset can be set to anything
> convenient.
> 
> Since SMB reads and writes range up to 8MB, I'd suggest zeroing the
> least significant 23 bits, which should guarantee it. The other 41
> bits, party on. You could randomize them, pass some clever identifier
> such as MID sequence, whatever.

I just tested that setting:

mr->iova &= (PAGE_SIZE - 1);
mr->iova |= 0xFFFFFFFF00000000;

after the ib_map_mr_sg() and before doing the IB_WR_REG_MR, seems to work.

metze
Tom Talpey Sept. 22, 2018, 5:16 p.m. UTC | #4
On 9/21/2018 8:56 PM, Stefan Metzmacher wrote:
> Hi,
> 
>>> +        req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
>>> +        if (need_invalidate)
>>> +            req->Channel = SMB2_CHANNEL_RDMA_V1;
>>> +        req->ReadChannelInfoOffset =
>>> +            offsetof(struct smb2_read_plain_req, Buffer);
>>> +        req->ReadChannelInfoLength =
>>> +            sizeof(struct smbd_buffer_descriptor_v1);
>>> +        v1 = (struct smbd_buffer_descriptor_v1 *) &req->Buffer[0];
>>> +        v1->offset = rdata->mr->mr->iova;
>>
>> It's unnecessary, and possibly leaking kernel information, to use
>> the IOVA as the offset of a memory region which is registered using
>> an FRWR. Because such regions are based on the exact bytes targeted
>> by the memory handle, the offset can be set to any value, typically
>> zero, but nearly arbitrary. As long as the (offset + length) does
>> not wrap or otherwise overflow, offset can be set to anything
>> convenient.
>>
>> Since SMB reads and writes range up to 8MB, I'd suggest zeroing the
>> least significant 23 bits, which should guarantee it. The other 41
>> bits, party on. You could randomize them, pass some clever identifier
>> such as MID sequence, whatever.
> 
> I just tested that setting:
> 
> mr->iova &= (PAGE_SIZE - 1);
> mr->iova |= 0xFFFFFFFF00000000;
> 
> after the ib_map_mr_sg() and before doing the IB_WR_REG_MR, seems to work.

Good! As you know, we were concerned about it after seeing that
the ib_dma_map_sg() code was unconditionally setting it to the
dma_mapped address. By salting those FFFF's with varying data,
this should give your FRWR regions stronger integrity in addition
to not leaking kernel "addresses" to the wire.

Tom.
Stefan Metzmacher Sept. 23, 2018, 9:24 p.m. UTC | #5
Hi Tom,

>> I just tested that setting:
>>
>> mr->iova &= (PAGE_SIZE - 1);
>> mr->iova |= 0xFFFFFFFF00000000;
>>
>> after the ib_map_mr_sg() and before doing the IB_WR_REG_MR, seems to
>> work.
> 
> Good! As you know, we were concerned about it after seeing that
> the ib_dma_map_sg() code was unconditionally setting it to the
> dma_mapped address. By salting those FFFF's with varying data,
> this should give your FRWR regions stronger integrity in addition
> to not leaking kernel "addresses" to the wire.

Just wondering... Isn't the thing we use called FRMR?

metze
Tom Talpey Sept. 24, 2018, 4 a.m. UTC | #6
On 9/23/2018 2:24 PM, Stefan Metzmacher wrote:
> Hi Tom,
> 
>>> I just tested that setting:
>>>
>>> mr->iova &= (PAGE_SIZE - 1);
>>> mr->iova |= 0xFFFFFFFF00000000;
>>>
>>> after the ib_map_mr_sg() and before doing the IB_WR_REG_MR, seems to
>>> work.
>>
>> Good! As you know, we were concerned about it after seeing that
>> the ib_dma_map_sg() code was unconditionally setting it to the
>> dma_mapped address. By salting those FFFF's with varying data,
>> this should give your FRWR regions stronger integrity in addition
>> to not leaking kernel "addresses" to the wire.
> 
> Just wondering... Isn't the thing we use called FRMR?

They're basically the same concept, it's a subtle difference.

FRMR = Fast Register Memory Region
FRWR = Fast Register Work Request

The memory region is the mr itself, this is created early on.

The work request is built when actually binding the physical
pages to the region, and setting the offset, length, etc, which
is what's happening in the routine that I made the comment on.

So, for this discussion I chose to say FRWR. Sorry for any
confusion!

Tom.
Stefan Metzmacher Sept. 24, 2018, 4:07 a.m. UTC | #7
> They're basically the same concept, it's a subtle difference.
> 
> FRMR = Fast Register Memory Region
> FRWR = Fast Register Work Request
> 
> The memory region is the mr itself, this is created early on.
> 
> The work request is built when actually binding the physical
> pages to the region, and setting the offset, length, etc, which
> is what's happening in the routine that I made the comment on.
> 
> So, for this discussion I chose to say FRWR. Sorry for any
> confusion!

Ah, thanks! Confusion resolved:-)

metze
diff mbox

Patch

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0786f19..464776a 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -42,7 +42,7 @@ 
 #include "cifs_debug.h"
 #include "cifs_fs_sb.h"
 #include "fscache.h"
-
+#include "smbdirect.h"
 
 static inline int cifs_convert_flags(unsigned int flags)
 {
@@ -2908,7 +2908,12 @@  cifs_readdata_release(struct kref *refcount)
 {
 	struct cifs_readdata *rdata = container_of(refcount,
 					struct cifs_readdata, refcount);
-
+#ifdef CONFIG_CIFS_SMB_DIRECT
+	if (rdata->mr) {
+		smbd_deregister_mr(rdata->mr);
+		rdata->mr = NULL;
+	}
+#endif
 	if (rdata->cfile)
 		cifsFileInfo_put(rdata->cfile);
 
@@ -3037,6 +3042,10 @@  uncached_fill_pages(struct TCP_Server_Info *server,
 		}
 		if (iter)
 			result = copy_page_from_iter(page, 0, n, iter);
+#ifdef CONFIG_CIFS_SMB_DIRECT
+		else if (rdata->mr)
+			result = n;
+#endif
 		else
 			result = cifs_read_page_from_socket(server, page, n);
 		if (result < 0)
@@ -3606,6 +3615,10 @@  readpages_fill_pages(struct TCP_Server_Info *server,
 
 		if (iter)
 			result = copy_page_from_iter(page, 0, n, iter);
+#ifdef CONFIG_CIFS_SMB_DIRECT
+		else if (rdata->mr)
+			result = n;
+#endif
 		else
 			result = cifs_read_page_from_socket(server, page, n);
 		if (result < 0)
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index c8afb83..8a5ff90 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2379,7 +2379,40 @@  smb2_new_read_req(void **buf, unsigned int *total_len,
 	req->MinimumCount = 0;
 	req->Length = cpu_to_le32(io_parms->length);
 	req->Offset = cpu_to_le64(io_parms->offset);
+#ifdef CONFIG_CIFS_SMB_DIRECT
+	/*
+	 * If we want to do a RDMA write, fill in and append
+	 * smbd_buffer_descriptor_v1 to the end of read request
+	 */
+	if (server->rdma && rdata &&
+		rdata->bytes >= server->smbd_conn->rdma_readwrite_threshold) {
+
+		struct smbd_buffer_descriptor_v1 *v1;
+		bool need_invalidate =
+			io_parms->tcon->ses->server->dialect == SMB30_PROT_ID;
+
+		rdata->mr = smbd_register_mr(
+				server->smbd_conn, rdata->pages,
+				rdata->nr_pages, rdata->tailsz,
+				true, need_invalidate);
+		if (!rdata->mr)
+			return -ENOBUFS;
+
+		req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+		if (need_invalidate)
+			req->Channel = SMB2_CHANNEL_RDMA_V1;
+		req->ReadChannelInfoOffset =
+			offsetof(struct smb2_read_plain_req, Buffer);
+		req->ReadChannelInfoLength =
+			sizeof(struct smbd_buffer_descriptor_v1);
+		v1 = (struct smbd_buffer_descriptor_v1 *) &req->Buffer[0];
+		v1->offset = rdata->mr->mr->iova;
+		v1->token = rdata->mr->mr->rkey;
+		v1->length = rdata->mr->mr->length;
 
+		*total_len += sizeof(*v1) - 1;
+	}
+#endif
 	if (request_type & CHAINED_REQUEST) {
 		if (!(request_type & END_OF_CHAIN)) {
 			/* next 8-byte aligned request */
@@ -2458,7 +2491,17 @@  smb2_readv_callback(struct mid_q_entry *mid)
 		if (rdata->result != -ENODATA)
 			rdata->result = -EIO;
 	}
-
+#ifdef CONFIG_CIFS_SMB_DIRECT
+	/*
+	 * If this rdata has a memmory registered, the MR can be freed
+	 * MR needs to be freed as soon as I/O finishes to prevent deadlock
+	 * because they have limited number and are used for future I/Os
+	 */
+	if (rdata->mr) {
+		smbd_deregister_mr(rdata->mr);
+		rdata->mr = NULL;
+	}
+#endif
 	if (rdata->result)
 		cifs_stats_fail_inc(tcon, SMB2_READ_HE);