diff mbox

svcrdma: fix offset calculation for non-page aligned sge entries

Message ID 1394635867-19089-1-git-send-email-jlayton@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jeff Layton March 12, 2014, 2:51 p.m. UTC
The xdr_off value in dma_map_xdr gets passed to ib_dma_map_page as the
offset into the page to be mapped. For the case of the pages array, the
existing calculation always seems to end up at 0, which causes data
corruption when a non-page-aligned READ request comes in. The server ends
up doing the RDMA_WRITE from the wrong part of the page.

Override the xdr_off value with the page_base in that situation to fix
the issue. Obviously, this method can't contend with a page_base that is
larger than PAGE_SIZE. I'm not sure if that's ever the case, but add a
WARN_ON in the event that that ever happens.

Cc: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Tom Tucker March 12, 2014, 8:43 p.m. UTC | #1
Hi Jeff,


On 3/12/14 9:51 AM, Jeff Layton wrote:
> The xdr_off value in dma_map_xdr gets passed to ib_dma_map_page as the
> offset into the page to be mapped. For the case of the pages array, the
> existing calculation always seems to end up at 0, which causes data
> corruption when a non-page-aligned READ request comes in. The server ends
> up doing the RDMA_WRITE from the wrong part of the page.
>
> Override the xdr_off value with the page_base in that situation to fix
> the issue. Obviously, this method can't contend with a page_base that is
> larger than PAGE_SIZE.
I think we saw page base > PAGE_SIZE when inter-operating with Sun, but 
maybe someone else on the list has a clearer recollection.

I am curious, however, why xdr_off was always zero if in fact, page_base 
was not page aligned.

Thanks,
Tom
> I'm not sure if that's ever the case, but add a
> WARN_ON in the event that that ever happens.
>
> Cc: Tom Tucker <tom@opengridcomputing.com>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>   net/sunrpc/xprtrdma/svc_rdma_sendto.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> index c1d124dc772b..76e524b428d0 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> @@ -265,8 +265,11 @@ static dma_addr_t dma_map_xdr(struct svcxprt_rdma *xprt,
>   		xdr_off -= xdr->head[0].iov_len;
>   		if (xdr_off < xdr->page_len) {
>   			/* This offset is in the page list */
> -			page = xdr->pages[xdr_off >> PAGE_SHIFT];
> -			xdr_off &= ~PAGE_MASK;
> +			int pgnum = xdr_off >> PAGE_SHIFT;
> +
> +			page = xdr->pages[pgnum];
> +			WARN_ON(xdr->page_base > PAGE_SIZE);
> +			xdr_off = pgnum ? 0 : xdr->page_base;
>   		} else {
>   			/* This offset is in the tail */
>   			xdr_off -= xdr->page_len;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton March 12, 2014, 9:18 p.m. UTC | #2
On Wed, 12 Mar 2014 15:43:10 -0500
Tom Tucker <tom@opengridcomputing.com> wrote:

> Hi Jeff,
> 
> 
> On 3/12/14 9:51 AM, Jeff Layton wrote:
> > The xdr_off value in dma_map_xdr gets passed to ib_dma_map_page as the
> > offset into the page to be mapped. For the case of the pages array, the
> > existing calculation always seems to end up at 0, which causes data
> > corruption when a non-page-aligned READ request comes in. The server ends
> > up doing the RDMA_WRITE from the wrong part of the page.
> >
> > Override the xdr_off value with the page_base in that situation to fix
> > the issue. Obviously, this method can't contend with a page_base that is
> > larger than PAGE_SIZE.
> I think we saw page base > PAGE_SIZE when inter-operating with Sun, but 
> maybe someone else on the list has a clearer recollection.
> 

Ok, I may need to rejigger that logic to account for that case. I'll do
that and send a v2 once I test it out...

> I am curious, however, why xdr_off was always zero if in fact, page_base 
> was not page aligned.
> 

Mostly I was testing with sub-page sized READ requests (which are of
course done via RDMA_WRITE). Assume we have a READ request for 512
bytes.

On the first pass into send_write, we call dma_map_xdr to map the
head. On that call xdr_off is 0.

On the next pass, we go to send the page data. At that point,
xdr_off == head[0].iov_len:

-------------------8<----------------------
        } else {
                xdr_off -= xdr->head[0].iov_len;
                if (xdr_off < xdr->page_len) {
                        /* This offset is in the page list */
                        page = xdr->pages[xdr_off >> PAGE_SHIFT];
                        xdr_off &= ~PAGE_MASK;
                } else {
-------------------8<----------------------

...so we subtract xdr->head[0].iov_len, and xdr_off is now 0.

The problem is that you can't determine where in the pages array the
data to send actually starts just from the xdr_off. That just tells you
how far into the xdr buffer you are. If the data you actually want
isn't aligned to the start of the page (xdr->page_base !=0) then you
can't tell from that.

That info is tracked inside the xdr->page_base, so we need use that to
determine what the mapping offset should be.
Jeff Layton March 15, 2014, 1:30 p.m. UTC | #3
On Wed, 12 Mar 2014 17:18:32 -0400
Jeff Layton <jlayton@redhat.com> wrote:

> On Wed, 12 Mar 2014 15:43:10 -0500
> Tom Tucker <tom@opengridcomputing.com> wrote:
> 
> > Hi Jeff,
> > 
> > 
> > On 3/12/14 9:51 AM, Jeff Layton wrote:
> > > The xdr_off value in dma_map_xdr gets passed to ib_dma_map_page as the
> > > offset into the page to be mapped. For the case of the pages array, the
> > > existing calculation always seems to end up at 0, which causes data
> > > corruption when a non-page-aligned READ request comes in. The server ends
> > > up doing the RDMA_WRITE from the wrong part of the page.
> > >
> > > Override the xdr_off value with the page_base in that situation to fix
> > > the issue. Obviously, this method can't contend with a page_base that is
> > > larger than PAGE_SIZE.
> > I think we saw page base > PAGE_SIZE when inter-operating with Sun, but 
> > maybe someone else on the list has a clearer recollection.
> > 
> 
> Ok, I may need to rejigger that logic to account for that case. I'll do
> that and send a v2 once I test it out...
> 

Self-NAK on this patch...

page_base I think is not so much a problem as the fact that we can get
a READ request that is not page-aligned and that spans multiple pages
in the array.

For instance, if you do (pseudocode):

    fd = open("/file/on/nfsordma", O_RDONLY|O_DIRECT);
    lseek(fd, 1024, SEEK_SET);
    read(fd, buf, PAGE_SIZE);

...then you get different data than if you didn't use O_DIRECT. I think
we could probably cook up a similar test using pynfs, fwiw...

The existing code doesn't handle that correctly, and neither does the
code after my patch. Fixing this correctly doesn't look trivial though,
as I think we'll have to push some awareness of the alignment of the
pages array farther up the call chain.


> > I am curious, however, why xdr_off was always zero if in fact, page_base 
> > was not page aligned.
> > 
> 
> Mostly I was testing with sub-page sized READ requests (which are of
> course done via RDMA_WRITE). Assume we have a READ request for 512
> bytes.
> 
> On the first pass into send_write, we call dma_map_xdr to map the
> head. On that call xdr_off is 0.
> 
> On the next pass, we go to send the page data. At that point,
> xdr_off == head[0].iov_len:
> 
> -------------------8<----------------------
>         } else {
>                 xdr_off -= xdr->head[0].iov_len;
>                 if (xdr_off < xdr->page_len) {
>                         /* This offset is in the page list */
>                         page = xdr->pages[xdr_off >> PAGE_SHIFT];
>                         xdr_off &= ~PAGE_MASK;
>                 } else {
> -------------------8<----------------------
> 
> ...so we subtract xdr->head[0].iov_len, and xdr_off is now 0.
> 
> The problem is that you can't determine where in the pages array the
> data to send actually starts just from the xdr_off. That just tells you
> how far into the xdr buffer you are. If the data you actually want
> isn't aligned to the start of the page (xdr->page_base !=0) then you
> can't tell from that.
> 
> That info is tracked inside the xdr->page_base, so we need use that to
> determine what the mapping offset should be.
>
Jeff Layton March 17, 2014, 5:17 p.m. UTC | #4
On Wed, 12 Mar 2014 17:18:32 -0400
Jeff Layton <jlayton@redhat.com> wrote:

> On Wed, 12 Mar 2014 15:43:10 -0500
> Tom Tucker <tom@opengridcomputing.com> wrote:
> 
> > Hi Jeff,
> > 
> > 
> > On 3/12/14 9:51 AM, Jeff Layton wrote:
> > > The xdr_off value in dma_map_xdr gets passed to ib_dma_map_page as the
> > > offset into the page to be mapped. For the case of the pages array, the
> > > existing calculation always seems to end up at 0, which causes data
> > > corruption when a non-page-aligned READ request comes in. The server ends
> > > up doing the RDMA_WRITE from the wrong part of the page.
> > >
> > > Override the xdr_off value with the page_base in that situation to fix
> > > the issue. Obviously, this method can't contend with a page_base that is
> > > larger than PAGE_SIZE.
> > I think we saw page base > PAGE_SIZE when inter-operating with Sun, but 
> > maybe someone else on the list has a clearer recollection.
> > 
> 
> Ok, I may need to rejigger that logic to account for that case. I'll do
> that and send a v2 once I test it out...
> 
> > I am curious, however, why xdr_off was always zero if in fact, page_base 
> > was not page aligned.
> > 
> 
> Mostly I was testing with sub-page sized READ requests (which are of
> course done via RDMA_WRITE). Assume we have a READ request for 512
> bytes.
> 
> On the first pass into send_write, we call dma_map_xdr to map the
> head. On that call xdr_off is 0.
> 
> On the next pass, we go to send the page data. At that point,
> xdr_off == head[0].iov_len:
> 
> -------------------8<----------------------
>         } else {
>                 xdr_off -= xdr->head[0].iov_len;
>                 if (xdr_off < xdr->page_len) {
>                         /* This offset is in the page list */
>                         page = xdr->pages[xdr_off >> PAGE_SHIFT];
>                         xdr_off &= ~PAGE_MASK;
>                 } else {
> -------------------8<----------------------
> 
> ...so we subtract xdr->head[0].iov_len, and xdr_off is now 0.
> 
> The problem is that you can't determine where in the pages array the
> data to send actually starts just from the xdr_off. That just tells you
> how far into the xdr buffer you are. If the data you actually want
> isn't aligned to the start of the page (xdr->page_base !=0) then you
> can't tell from that.
> 
> That info is tracked inside the xdr->page_base, so we need use that to
> determine what the mapping offset should be.
> 

Ok, so the above is only part of the story. It's true that I never get
an xdr_off > 0 after subtracting the head's iov_len, but only if the
requested read call doesn't span more than one page, or more than one
wc_array entry.

I went over the existing code again over the last few days and came to
the conclusion that it's basically correct, and the only thing missing
is making it add in the xdr->page_base. The patch I just sent adds that
in, and from what I can tell it does the right thing on every READ call
that I tested.

I'm still seeing panics in the write codepath, which I suspect is the
same problem that Steve Wise was looking at, but I haven't yet had a
chance to look at that closely.
Boaz Harrosh April 3, 2014, 9:43 a.m. UTC | #5
On 03/12/2014 11:18 PM, Jeff Layton wrote:
> 
> Ok, I may need to rejigger that logic to account for that case. I'll do
> that and send a v2 once I test it out...
> 

Hi Jeff in obj-layout we have this code:

	if (*p_pgbase > PAGE_SIZE) {
		dprintk("%s: pgbase(0x%x) > PAGE_SIZE\n", __func__, *p_pgbase);
		*p_pages += *p_pgbase >> PAGE_SHIFT;
		*p_pgbase &= ~PAGE_MASK;
	}

ie. advance the page_array pointer and keep the pgbase within page.

if I recall correctly this case happens when you return a short read/write
then the retry comes again with same page_array but with base jumping over
the previous short IO, and length with the reminder.

I'm not sure what is the code path that feeds your xdr stuff but it
might not filter out the way we do in obj-lo. the ORE is the same
it assumes pgbase < PAGE_SIZE.

Cheers
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton April 3, 2014, 11:20 a.m. UTC | #6
On Thu, 3 Apr 2014 12:43:00 +0300
Boaz Harrosh <bharrosh@panasas.com> wrote:

> On 03/12/2014 11:18 PM, Jeff Layton wrote:
> > 
> > Ok, I may need to rejigger that logic to account for that case. I'll do
> > that and send a v2 once I test it out...
> > 
> 
> Hi Jeff in obj-layout we have this code:
> 
> 	if (*p_pgbase > PAGE_SIZE) {
> 		dprintk("%s: pgbase(0x%x) > PAGE_SIZE\n", __func__, *p_pgbase);
> 		*p_pages += *p_pgbase >> PAGE_SHIFT;
> 		*p_pgbase &= ~PAGE_MASK;
> 	}
> 
> ie. advance the page_array pointer and keep the pgbase within page.
> 
> if I recall correctly this case happens when you return a short read/write
> then the retry comes again with same page_array but with base jumping over
> the previous short IO, and length with the reminder.
> 
> I'm not sure what is the code path that feeds your xdr stuff but it
> might not filter out the way we do in obj-lo. the ORE is the same
> it assumes pgbase < PAGE_SIZE.
> 
> Cheers
> Boaz
> 

Thanks Boaz,

After I had a fresh look at the logic in dma_map_xdr(), I figured out
that the logic in there should cover all of the cases we care about. It
was just the the page_base wasn't being respected. The updated patch is
this one-liner that I sent to the list:

     [PATCH v2] svcrdma: fix offset calculation for non-page aligned sge entries

That seems to fix the broken testcase that I had, which was that
sub-page DIO reads would get corrupted. Does that patch look ok to you,
or have I missed any cases?
diff mbox

Patch

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index c1d124dc772b..76e524b428d0 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -265,8 +265,11 @@  static dma_addr_t dma_map_xdr(struct svcxprt_rdma *xprt,
 		xdr_off -= xdr->head[0].iov_len;
 		if (xdr_off < xdr->page_len) {
 			/* This offset is in the page list */
-			page = xdr->pages[xdr_off >> PAGE_SHIFT];
-			xdr_off &= ~PAGE_MASK;
+			int pgnum = xdr_off >> PAGE_SHIFT;
+
+			page = xdr->pages[pgnum];
+			WARN_ON(xdr->page_base > PAGE_SIZE);
+			xdr_off = pgnum ? 0 : xdr->page_base;
 		} else {
 			/* This offset is in the tail */
 			xdr_off -= xdr->page_len;