Message ID | 1394635867-19089-1-git-send-email-jlayton@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Jeff, On 3/12/14 9:51 AM, Jeff Layton wrote: > The xdr_off value in dma_map_xdr gets passed to ib_dma_map_page as the > offset into the page to be mapped. For the case of the pages array, the > existing calculation always seems to end up at 0, which causes data > corruption when a non-page-aligned READ request comes in. The server ends > up doing the RDMA_WRITE from the wrong part of the page. > > Override the xdr_off value with the page_base in that situation to fix > the issue. Obviously, this method can't contend with a page_base that is > larger than PAGE_SIZE. I think we saw page base > PAGE_SIZE when inter-operating with Sun, but maybe someone else on the list has a clearer recollection. I am curious, however, why xdr_off was always zero if in fact, page_base was not page aligned. Thanks, Tom > I'm not sure if that's ever the case, but add a > WARN_ON in the event that that ever happens. > > Cc: Tom Tucker <tom@opengridcomputing.com> > Signed-off-by: Jeff Layton <jlayton@redhat.com> > --- > net/sunrpc/xprtrdma/svc_rdma_sendto.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > index c1d124dc772b..76e524b428d0 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > @@ -265,8 +265,11 @@ static dma_addr_t dma_map_xdr(struct svcxprt_rdma *xprt, > xdr_off -= xdr->head[0].iov_len; > if (xdr_off < xdr->page_len) { > /* This offset is in the page list */ > - page = xdr->pages[xdr_off >> PAGE_SHIFT]; > - xdr_off &= ~PAGE_MASK; > + int pgnum = xdr_off >> PAGE_SHIFT; > + > + page = xdr->pages[pgnum]; > + WARN_ON(xdr->page_base > PAGE_SIZE); > + xdr_off = pgnum ? 0 : xdr->page_base; > } else { > /* This offset is in the tail */ > xdr_off -= xdr->page_len; -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 12 Mar 2014 15:43:10 -0500 Tom Tucker <tom@opengridcomputing.com> wrote: > Hi Jeff, > > > On 3/12/14 9:51 AM, Jeff Layton wrote: > > The xdr_off value in dma_map_xdr gets passed to ib_dma_map_page as the > > offset into the page to be mapped. For the case of the pages array, the > > existing calculation always seems to end up at 0, which causes data > > corruption when a non-page-aligned READ request comes in. The server ends > > up doing the RDMA_WRITE from the wrong part of the page. > > > > Override the xdr_off value with the page_base in that situation to fix > > the issue. Obviously, this method can't contend with a page_base that is > > larger than PAGE_SIZE. > I think we saw page base > PAGE_SIZE when inter-operating with Sun, but > maybe someone else on the list has a clearer recollection. > Ok, I may need to rejigger that logic to account for that case. I'll do that and send a v2 once I test it out... > I am curious, however, why xdr_off was always zero if in fact, page_base > was not page aligned. > Mostly I was testing with sub-page sized READ requests (which are of course done via RDMA_WRITE). Assume we have a READ request for 512 bytes. On the first pass into send_write, we call dma_map_xdr to map the head. On that call xdr_off is 0. On the next pass, we go to send the page data. At that point, xdr_off == head[0].iov_len: -------------------8<---------------------- } else { xdr_off -= xdr->head[0].iov_len; if (xdr_off < xdr->page_len) { /* This offset is in the page list */ page = xdr->pages[xdr_off >> PAGE_SHIFT]; xdr_off &= ~PAGE_MASK; } else { -------------------8<---------------------- ...so we subtract xdr->head[0].iov_len, and xdr_off is now 0. The problem is that you can't determine where in the pages array the data to send actually starts just from the xdr_off. That just tells you how far into the xdr buffer you are. If the data you actually want isn't aligned to the start of the page (xdr->page_base !=0) then you can't tell from that. That info is tracked inside the xdr->page_base, so we need use that to determine what the mapping offset should be.
On Wed, 12 Mar 2014 17:18:32 -0400 Jeff Layton <jlayton@redhat.com> wrote: > On Wed, 12 Mar 2014 15:43:10 -0500 > Tom Tucker <tom@opengridcomputing.com> wrote: > > > Hi Jeff, > > > > > > On 3/12/14 9:51 AM, Jeff Layton wrote: > > > The xdr_off value in dma_map_xdr gets passed to ib_dma_map_page as the > > > offset into the page to be mapped. For the case of the pages array, the > > > existing calculation always seems to end up at 0, which causes data > > > corruption when a non-page-aligned READ request comes in. The server ends > > > up doing the RDMA_WRITE from the wrong part of the page. > > > > > > Override the xdr_off value with the page_base in that situation to fix > > > the issue. Obviously, this method can't contend with a page_base that is > > > larger than PAGE_SIZE. > > I think we saw page base > PAGE_SIZE when inter-operating with Sun, but > > maybe someone else on the list has a clearer recollection. > > > > Ok, I may need to rejigger that logic to account for that case. I'll do > that and send a v2 once I test it out... > Self-NAK on this patch... page_base I think is not so much a problem as the fact that we can get a READ request that is not page-aligned and that spans multiple pages in the array. For instance, if you do (pseudocode): fd = open("/file/on/nfsordma", O_RDONLY|O_DIRECT); lseek(fd, 1024, SEEK_SET); read(fd, buf, PAGE_SIZE); ...then you get different data than if you didn't use O_DIRECT. I think we could probably cook up a similar test using pynfs, fwiw... The existing code doesn't handle that correctly, and neither does the code after my patch. Fixing this correctly doesn't look trivial though, as I think we'll have to push some awareness of the alignment of the pages array farther up the call chain. > > I am curious, however, why xdr_off was always zero if in fact, page_base > > was not page aligned. > > > > Mostly I was testing with sub-page sized READ requests (which are of > course done via RDMA_WRITE). Assume we have a READ request for 512 > bytes. > > On the first pass into send_write, we call dma_map_xdr to map the > head. On that call xdr_off is 0. > > On the next pass, we go to send the page data. At that point, > xdr_off == head[0].iov_len: > > -------------------8<---------------------- > } else { > xdr_off -= xdr->head[0].iov_len; > if (xdr_off < xdr->page_len) { > /* This offset is in the page list */ > page = xdr->pages[xdr_off >> PAGE_SHIFT]; > xdr_off &= ~PAGE_MASK; > } else { > -------------------8<---------------------- > > ...so we subtract xdr->head[0].iov_len, and xdr_off is now 0. > > The problem is that you can't determine where in the pages array the > data to send actually starts just from the xdr_off. That just tells you > how far into the xdr buffer you are. If the data you actually want > isn't aligned to the start of the page (xdr->page_base !=0) then you > can't tell from that. > > That info is tracked inside the xdr->page_base, so we need use that to > determine what the mapping offset should be. >
On Wed, 12 Mar 2014 17:18:32 -0400 Jeff Layton <jlayton@redhat.com> wrote: > On Wed, 12 Mar 2014 15:43:10 -0500 > Tom Tucker <tom@opengridcomputing.com> wrote: > > > Hi Jeff, > > > > > > On 3/12/14 9:51 AM, Jeff Layton wrote: > > > The xdr_off value in dma_map_xdr gets passed to ib_dma_map_page as the > > > offset into the page to be mapped. For the case of the pages array, the > > > existing calculation always seems to end up at 0, which causes data > > > corruption when a non-page-aligned READ request comes in. The server ends > > > up doing the RDMA_WRITE from the wrong part of the page. > > > > > > Override the xdr_off value with the page_base in that situation to fix > > > the issue. Obviously, this method can't contend with a page_base that is > > > larger than PAGE_SIZE. > > I think we saw page base > PAGE_SIZE when inter-operating with Sun, but > > maybe someone else on the list has a clearer recollection. > > > > Ok, I may need to rejigger that logic to account for that case. I'll do > that and send a v2 once I test it out... > > > I am curious, however, why xdr_off was always zero if in fact, page_base > > was not page aligned. > > > > Mostly I was testing with sub-page sized READ requests (which are of > course done via RDMA_WRITE). Assume we have a READ request for 512 > bytes. > > On the first pass into send_write, we call dma_map_xdr to map the > head. On that call xdr_off is 0. > > On the next pass, we go to send the page data. At that point, > xdr_off == head[0].iov_len: > > -------------------8<---------------------- > } else { > xdr_off -= xdr->head[0].iov_len; > if (xdr_off < xdr->page_len) { > /* This offset is in the page list */ > page = xdr->pages[xdr_off >> PAGE_SHIFT]; > xdr_off &= ~PAGE_MASK; > } else { > -------------------8<---------------------- > > ...so we subtract xdr->head[0].iov_len, and xdr_off is now 0. > > The problem is that you can't determine where in the pages array the > data to send actually starts just from the xdr_off. That just tells you > how far into the xdr buffer you are. If the data you actually want > isn't aligned to the start of the page (xdr->page_base !=0) then you > can't tell from that. > > That info is tracked inside the xdr->page_base, so we need use that to > determine what the mapping offset should be. > Ok, so the above is only part of the story. It's true that I never get an xdr_off > 0 after subtracting the head's iov_len, but only if the requested read call doesn't span more than one page, or more than one wc_array entry. I went over the existing code again over the last few days and came to the conclusion that it's basically correct, and the only thing missing is making it add in the xdr->page_base. The patch I just sent adds that in, and from what I can tell it does the right thing on every READ call that I tested. I'm still seeing panics in the write codepath, which I suspect is the same problem that Steve Wise was looking at, but I haven't yet had a chance to look at that closely.
On 03/12/2014 11:18 PM, Jeff Layton wrote: > > Ok, I may need to rejigger that logic to account for that case. I'll do > that and send a v2 once I test it out... > Hi Jeff in obj-layout we have this code: if (*p_pgbase > PAGE_SIZE) { dprintk("%s: pgbase(0x%x) > PAGE_SIZE\n", __func__, *p_pgbase); *p_pages += *p_pgbase >> PAGE_SHIFT; *p_pgbase &= ~PAGE_MASK; } ie. advance the page_array pointer and keep the pgbase within page. if I recall correctly this case happens when you return a short read/write then the retry comes again with same page_array but with base jumping over the previous short IO, and length with the reminder. I'm not sure what is the code path that feeds your xdr stuff but it might not filter out the way we do in obj-lo. the ORE is the same it assumes pgbase < PAGE_SIZE. Cheers Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 3 Apr 2014 12:43:00 +0300 Boaz Harrosh <bharrosh@panasas.com> wrote: > On 03/12/2014 11:18 PM, Jeff Layton wrote: > > > > Ok, I may need to rejigger that logic to account for that case. I'll do > > that and send a v2 once I test it out... > > > > Hi Jeff in obj-layout we have this code: > > if (*p_pgbase > PAGE_SIZE) { > dprintk("%s: pgbase(0x%x) > PAGE_SIZE\n", __func__, *p_pgbase); > *p_pages += *p_pgbase >> PAGE_SHIFT; > *p_pgbase &= ~PAGE_MASK; > } > > ie. advance the page_array pointer and keep the pgbase within page. > > if I recall correctly this case happens when you return a short read/write > then the retry comes again with same page_array but with base jumping over > the previous short IO, and length with the reminder. > > I'm not sure what is the code path that feeds your xdr stuff but it > might not filter out the way we do in obj-lo. the ORE is the same > it assumes pgbase < PAGE_SIZE. > > Cheers > Boaz > Thanks Boaz, After I had a fresh look at the logic in dma_map_xdr(), I figured out that the logic in there should cover all of the cases we care about. It was just the the page_base wasn't being respected. The updated patch is this one-liner that I sent to the list: [PATCH v2] svcrdma: fix offset calculation for non-page aligned sge entries That seems to fix the broken testcase that I had, which was that sub-page DIO reads would get corrupted. Does that patch look ok to you, or have I missed any cases?
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c index c1d124dc772b..76e524b428d0 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c @@ -265,8 +265,11 @@ static dma_addr_t dma_map_xdr(struct svcxprt_rdma *xprt, xdr_off -= xdr->head[0].iov_len; if (xdr_off < xdr->page_len) { /* This offset is in the page list */ - page = xdr->pages[xdr_off >> PAGE_SHIFT]; - xdr_off &= ~PAGE_MASK; + int pgnum = xdr_off >> PAGE_SHIFT; + + page = xdr->pages[pgnum]; + WARN_ON(xdr->page_base > PAGE_SIZE); + xdr_off = pgnum ? 0 : xdr->page_base; } else { /* This offset is in the tail */ xdr_off -= xdr->page_len;
The xdr_off value in dma_map_xdr gets passed to ib_dma_map_page as the offset into the page to be mapped. For the case of the pages array, the existing calculation always seems to end up at 0, which causes data corruption when a non-page-aligned READ request comes in. The server ends up doing the RDMA_WRITE from the wrong part of the page. Override the xdr_off value with the page_base in that situation to fix the issue. Obviously, this method can't contend with a page_base that is larger than PAGE_SIZE. I'm not sure if that's ever the case, but add a WARN_ON in the event that that ever happens. Cc: Tom Tucker <tom@opengridcomputing.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> --- net/sunrpc/xprtrdma/svc_rdma_sendto.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)