NFS DIO overhaul - Patchwork

Message ID	20110517104532.72b7f588@tlielax.poochiereds.net (mailing list archive)
State	New, archived
Headers	show Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter1.kernel.org (8.14.4/8.14.3) with ESMTP id p4HEjb09011226 for <patchwork-linux-nfs@patchwork.kernel.org>; Tue, 17 May 2011 14:45:38 GMT Date: Tue, 17 May 2011 10:45:32 -0400 From: Jeff Layton <jlayton@redhat.com> To: Trond.Myklebust@netapp.com Cc: Badari Pulavarty <pbadari@us.ibm.com>, linux-nfs@vger.kernel.org Subject: NFS DIO overhaul Message-ID: <20110517104532.72b7f588@tlielax.poochiereds.net> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

Message ID

20110517104532.72b7f588@tlielax.poochiereds.net (mailing list archive)

State

New, archived

Headers

Date: Tue, 17 May 2011 10:45:32 -0400
From: Jeff Layton <jlayton@redhat.com>
To: Trond.Myklebust@netapp.com
Cc: Badari Pulavarty <pbadari@us.ibm.com>, linux-nfs@vger.kernel.org
Subject: NFS DIO overhaul
Message-ID: <20110517104532.72b7f588@tlielax.poochiereds.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-nfs-owner@vger.kernel.org
Precedence: bulk

Commit Message

Jeff Layton May 17, 2011, 2:45 p.m. UTC

Hi Trond,

You recently mentioned on the list that you were working on an overhaul
of the DIO code that would merge large parts of it with the buffered IO
routines.

I also started some work on an overhaul of the DIO code to make it work
better with arbitrary sized arrays of iovecs. I've only really started
it, but I have the following routine so far. The idea here is to always
try to max out the size of a direct I/O READ or WRITE regardless of the
layout of the iovec array. Once we have the count, we can then just
marshal up a single read or write and send it, advance the iov_iter and
repeat until we're done.

Obviously, it's rather complicated logic, but I think this is doable...

My question is -- is this something you would consider taking if I were
to finish it up in the near future? Or should I not bother since you're
working on an overhaul of this code?

My main concern is that we have some existing shipping code where we
need to fix this, and  the changes you're planning to make may be too
invasive to backport. If that's the case, then I may need to do
something along these lines anyway (or just take Badari's latest patch).

Thanks...My "starter" patch follows:

------------------------[snip]------------------------

[PATCH] nfs: add a direct I/O count routine

...to count the number of bytes that we can stick in a single r/w req.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/nfs/direct.c |   90 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 90 insertions(+), 0 deletions(-)

Comments

Chuck Lever May 17, 2011, 2:50 p.m. UTC | #1

Hi Jeff, Badari-

On May 17, 2011, at 10:45 AM, Jeff Layton wrote:

> Hi Trond,
> 
> You recently mentioned on the list that you were working on an overhaul
> of the DIO code that would merge large parts of it with the buffered IO
> routines.
> 
> I also started some work on an overhaul of the DIO code to make it work
> better with arbitrary sized arrays of iovecs. I've only really started
> it, but I have the following routine so far. The idea here is to always
> try to max out the size of a direct I/O READ or WRITE regardless of the
> layout of the iovec array. Once we have the count, we can then just
> marshal up a single read or write and send it, advance the iov_iter and
> repeat until we're done.
> 
> Obviously, it's rather complicated logic, but I think this is doable...
> 
> My question is -- is this something you would consider taking if I were
> to finish it up in the near future? Or should I not bother since you're
> working on an overhaul of this code?
> 
> My main concern is that we have some existing shipping code where we
> need to fix this, and  the changes you're planning to make may be too
> invasive to backport. If that's the case, then I may need to do
> something along these lines anyway (or just take Badari's latest patch).

Do we have detailed analysis now on why simply sending these as separate UNSTABLE writes followed by a COMMIT is not sufficient to address performance issues?

> Thanks...My "starter" patch follows:
> 
> ------------------------[snip]------------------------
> 
> [PATCH] nfs: add a direct I/O count routine
> 
> ...to count the number of bytes that we can stick in a single r/w req.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
> fs/nfs/direct.c |   90 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 90 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index c8add8d..15658fe 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -227,6 +227,96 @@ static void nfs_direct_complete(struct nfs_direct_req *dreq)
> }
> 
> /*
> + * Walk an iovec array, and count bytes until it hits "maxsize" or until it
> + * hits a "discontinuity" in the array.
> + */
> +unsigned int
> +nfs_direct_count(const struct iov_iter *i_orig, unsigned int maxsize)
> +{
> +	struct iov_iter i;
> +	unsigned int count = 0, end_of_page, end_of_iov, end_of_seg;
> +	unsigned long base;
> +
> +	/* copy iov_iter so original is preserved */
> +	memcpy(&i, i_orig, sizeof(i));
> +
> +	/*
> +	 * Determine how many bytes to end of page. Advance to end of page or
> +	 * end of current iovec, or "size" bytes, whichever is closer.
> +	 *
> +	 * If end of a page, then continue to next page and repeat.
> +	 *
> +	 * If end of iovec, then see whether current iov ends on
> +	 * page boundary and next one starts on one. If so, then
> +	 * repeat.
> +	 *
> +	 * If not, then see if the iov's are "continuous". Are previous
> +	 * end and current beginning on same page? If so, then see if
> +	 * current end is just before next beginning. If so, repeat.
> +	 *
> +	 * If none of the conditions above are true, then break and
> +	 * return current count of bytes.
> +	 */
> +	for (;;) {
> +		base = (unsigned long)i.iov->iov_base + i.iov_offset;
> +
> +		/* bytes to end of page */
> +		end_of_page = PAGE_SIZE - (base & (PAGE_SIZE - 1));
> +
> +		/* bytes to end of iov */
> +		end_of_iov = i.iov->iov_len - i.iov_offset;
> +
> +		/* find min of end of page, end of iov, and r/wsize */
> +		end_of_seg = min(end_of_page, end_of_iov);
> +		end_of_seg = min(end_of_seg, maxsize);
> +
> +		iov_iter_advance(&i, end_of_seg);
> +		maxsize -= end_of_seg;
> +		count += end_of_seg;
> +
> +		/* Did we hit the end of array or maxsize? */
> +		if (maxsize - end_of_seg == 0 || i.count == 0)
> +			break;
> +
> +		/*
> +		 * Else end_of_seg either == end_of_page or end_of_iov,
> +		 * and iov has more data in it.
> +		 */
> +
> +		/* Did we hit a page boundary, but next segement is in same
> +		 * iov? If so, then carry on.
> +		 */
> +		if (end_of_page != end_of_iov && end_of_seg == end_of_page)
> +			goto advance_ii;
> +
> +		/* Did iov end on a page boundary? Ensure next starts on one. */
> +		if (end_of_page == end_of_iov &&
> +		    (unsigned long)i.iov[1].iov_base & (PAGE_SIZE - 1))
> +			break;
> +
> +		/* Make sure next iovec starts on same page as first */
> +		if ((base & PAGE_MASK) !=
> +		    ((unsigned long)i.iov[1].iov_base & PAGE_MASK))
> +			break;
> +
> +		/* and the offsets match up */
> +		if (end_of_iov + 1 != (unsigned long)i.iov[1].iov_base)
> +			break;
> +
> +advance_ii:
> +		/* Everything checks out. Advance into next iov and continue */
> +		iov_iter_advance(&i, 1);
> +		--maxsize;
> +		++count;
> +		/* Did we hit the end of array or maxsize? */
> +		if (maxsize == 0 || i.count == 0)
> +			break;
> +	}
> +
> +	return count;
> +}
> +
> +/*
>  * We must hold a reference to all the pages in this direct read request
>  * until the RPCs complete.  This could be long *after* we are woken up in
>  * nfs_direct_wait (for instance, if someone hits ^C on a slow server).
> -- 
> 1.7.1
> 
> 
> 
> -- 
> Jeff Layton <jlayton@redhat.com>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jeff Layton May 17, 2011, 2:54 p.m. UTC | #2

On Tue, 17 May 2011 10:50:23 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> Hi Jeff, Badari-
> 
> On May 17, 2011, at 10:45 AM, Jeff Layton wrote:
> 
> > Hi Trond,
> > 
> > You recently mentioned on the list that you were working on an overhaul
> > of the DIO code that would merge large parts of it with the buffered IO
> > routines.
> > 
> > I also started some work on an overhaul of the DIO code to make it work
> > better with arbitrary sized arrays of iovecs. I've only really started
> > it, but I have the following routine so far. The idea here is to always
> > try to max out the size of a direct I/O READ or WRITE regardless of the
> > layout of the iovec array. Once we have the count, we can then just
> > marshal up a single read or write and send it, advance the iov_iter and
> > repeat until we're done.
> > 
> > Obviously, it's rather complicated logic, but I think this is doable...
> > 
> > My question is -- is this something you would consider taking if I were
> > to finish it up in the near future? Or should I not bother since you're
> > working on an overhaul of this code?
> > 
> > My main concern is that we have some existing shipping code where we
> > need to fix this, and  the changes you're planning to make may be too
> > invasive to backport. If that's the case, then I may need to do
> > something along these lines anyway (or just take Badari's latest patch).
> 
> Do we have detailed analysis now on why simply sending these as separate UNSTABLE writes followed by a COMMIT is not sufficient to address performance issues?
> 

They've done some benchmarking that shows that performance plateaus out
earlier with that approach. That approach also does absolutely nothing
for the read codepath which has similar problems.


> > Thanks...My "starter" patch follows:
> > 
> > ------------------------[snip]------------------------
> > 
> > [PATCH] nfs: add a direct I/O count routine
> > 
> > ...to count the number of bytes that we can stick in a single r/w req.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> > fs/nfs/direct.c |   90 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 files changed, 90 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> > index c8add8d..15658fe 100644
> > --- a/fs/nfs/direct.c
> > +++ b/fs/nfs/direct.c
> > @@ -227,6 +227,96 @@ static void nfs_direct_complete(struct nfs_direct_req *dreq)
> > }
> > 
> > /*
> > + * Walk an iovec array, and count bytes until it hits "maxsize" or until it
> > + * hits a "discontinuity" in the array.
> > + */
> > +unsigned int
> > +nfs_direct_count(const struct iov_iter *i_orig, unsigned int maxsize)
> > +{
> > +	struct iov_iter i;
> > +	unsigned int count = 0, end_of_page, end_of_iov, end_of_seg;
> > +	unsigned long base;
> > +
> > +	/* copy iov_iter so original is preserved */
> > +	memcpy(&i, i_orig, sizeof(i));
> > +
> > +	/*
> > +	 * Determine how many bytes to end of page. Advance to end of page or
> > +	 * end of current iovec, or "size" bytes, whichever is closer.
> > +	 *
> > +	 * If end of a page, then continue to next page and repeat.
> > +	 *
> > +	 * If end of iovec, then see whether current iov ends on
> > +	 * page boundary and next one starts on one. If so, then
> > +	 * repeat.
> > +	 *
> > +	 * If not, then see if the iov's are "continuous". Are previous
> > +	 * end and current beginning on same page? If so, then see if
> > +	 * current end is just before next beginning. If so, repeat.
> > +	 *
> > +	 * If none of the conditions above are true, then break and
> > +	 * return current count of bytes.
> > +	 */
> > +	for (;;) {
> > +		base = (unsigned long)i.iov->iov_base + i.iov_offset;
> > +
> > +		/* bytes to end of page */
> > +		end_of_page = PAGE_SIZE - (base & (PAGE_SIZE - 1));
> > +
> > +		/* bytes to end of iov */
> > +		end_of_iov = i.iov->iov_len - i.iov_offset;
> > +
> > +		/* find min of end of page, end of iov, and r/wsize */
> > +		end_of_seg = min(end_of_page, end_of_iov);
> > +		end_of_seg = min(end_of_seg, maxsize);
> > +
> > +		iov_iter_advance(&i, end_of_seg);
> > +		maxsize -= end_of_seg;
> > +		count += end_of_seg;
> > +
> > +		/* Did we hit the end of array or maxsize? */
> > +		if (maxsize - end_of_seg == 0 || i.count == 0)
> > +			break;
> > +
> > +		/*
> > +		 * Else end_of_seg either == end_of_page or end_of_iov,
> > +		 * and iov has more data in it.
> > +		 */
> > +
> > +		/* Did we hit a page boundary, but next segement is in same
> > +		 * iov? If so, then carry on.
> > +		 */
> > +		if (end_of_page != end_of_iov && end_of_seg == end_of_page)
> > +			goto advance_ii;
> > +
> > +		/* Did iov end on a page boundary? Ensure next starts on one. */
> > +		if (end_of_page == end_of_iov &&
> > +		    (unsigned long)i.iov[1].iov_base & (PAGE_SIZE - 1))
> > +			break;
> > +
> > +		/* Make sure next iovec starts on same page as first */
> > +		if ((base & PAGE_MASK) !=
> > +		    ((unsigned long)i.iov[1].iov_base & PAGE_MASK))
> > +			break;
> > +
> > +		/* and the offsets match up */
> > +		if (end_of_iov + 1 != (unsigned long)i.iov[1].iov_base)
> > +			break;
> > +
> > +advance_ii:
> > +		/* Everything checks out. Advance into next iov and continue */
> > +		iov_iter_advance(&i, 1);
> > +		--maxsize;
> > +		++count;
> > +		/* Did we hit the end of array or maxsize? */
> > +		if (maxsize == 0 || i.count == 0)
> > +			break;
> > +	}
> > +
> > +	return count;
> > +}
> > +
> > +/*
> >  * We must hold a reference to all the pages in this direct read request
> >  * until the RPCs complete.  This could be long *after* we are woken up in
> >  * nfs_direct_wait (for instance, if someone hits ^C on a slow server).
> > -- 
> > 1.7.1
> > 
> > 
> > 
> > -- 
> > Jeff Layton <jlayton@redhat.com>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index c8add8d..15658fe 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -227,6 +227,96 @@  static void nfs_direct_complete(struct nfs_direct_req *dreq)
 }
 
 /*
+ * Walk an iovec array, and count bytes until it hits "maxsize" or until it
+ * hits a "discontinuity" in the array.
+ */
+unsigned int
+nfs_direct_count(const struct iov_iter *i_orig, unsigned int maxsize)
+{
+	struct iov_iter i;
+	unsigned int count = 0, end_of_page, end_of_iov, end_of_seg;
+	unsigned long base;
+
+	/* copy iov_iter so original is preserved */
+	memcpy(&i, i_orig, sizeof(i));
+
+	/*
+	 * Determine how many bytes to end of page. Advance to end of page or
+	 * end of current iovec, or "size" bytes, whichever is closer.
+	 *
+	 * If end of a page, then continue to next page and repeat.
+	 *
+	 * If end of iovec, then see whether current iov ends on
+	 * page boundary and next one starts on one. If so, then
+	 * repeat.
+	 *
+	 * If not, then see if the iov's are "continuous". Are previous
+	 * end and current beginning on same page? If so, then see if
+	 * current end is just before next beginning. If so, repeat.
+	 *
+	 * If none of the conditions above are true, then break and
+	 * return current count of bytes.
+	 */
+	for (;;) {
+		base = (unsigned long)i.iov->iov_base + i.iov_offset;
+
+		/* bytes to end of page */
+		end_of_page = PAGE_SIZE - (base & (PAGE_SIZE - 1));
+
+		/* bytes to end of iov */
+		end_of_iov = i.iov->iov_len - i.iov_offset;
+
+		/* find min of end of page, end of iov, and r/wsize */
+		end_of_seg = min(end_of_page, end_of_iov);
+		end_of_seg = min(end_of_seg, maxsize);
+
+		iov_iter_advance(&i, end_of_seg);
+		maxsize -= end_of_seg;
+		count += end_of_seg;
+
+		/* Did we hit the end of array or maxsize? */
+		if (maxsize - end_of_seg == 0 || i.count == 0)
+			break;
+
+		/*
+		 * Else end_of_seg either == end_of_page or end_of_iov,
+		 * and iov has more data in it.
+		 */
+
+		/* Did we hit a page boundary, but next segement is in same
+		 * iov? If so, then carry on.
+		 */
+		if (end_of_page != end_of_iov && end_of_seg == end_of_page)
+			goto advance_ii;
+
+		/* Did iov end on a page boundary? Ensure next starts on one. */
+		if (end_of_page == end_of_iov &&
+		    (unsigned long)i.iov[1].iov_base & (PAGE_SIZE - 1))
+			break;
+
+		/* Make sure next iovec starts on same page as first */
+		if ((base & PAGE_MASK) !=
+		    ((unsigned long)i.iov[1].iov_base & PAGE_MASK))
+			break;
+
+		/* and the offsets match up */
+		if (end_of_iov + 1 != (unsigned long)i.iov[1].iov_base)
+			break;
+
+advance_ii:
+		/* Everything checks out. Advance into next iov and continue */
+		iov_iter_advance(&i, 1);
+		--maxsize;
+		++count;
+		/* Did we hit the end of array or maxsize? */
+		if (maxsize == 0 || i.count == 0)
+			break;
+	}
+
+	return count;
+}
+
+/*
  * We must hold a reference to all the pages in this direct read request
  * until the RPCs complete.  This could be long *after* we are woken up in
  * nfs_direct_wait (for instance, if someone hits ^C on a slow server).