diff mbox

[RFC] Vector read/write support for NFS (DIO) client

Message ID 1302622335.3877.62.camel@badari-desktop (mailing list archive)
State New, archived
Headers show

Commit Message

Badari Pulavarty April 12, 2011, 3:32 p.m. UTC
Hi,

We recently ran into serious performance issue with NFS client.
It turned out that its due to lack of readv/write support for
NFS (O_DIRECT) client.

Here is our use-case:

In our cloud environment, our storage is over NFS. Files
on NFS are passed as a blockdevices to the guest (using
O_DIRECT). When guest is doing IO on these block devices,
they will end up as O_DIRECT writes to NFS (on KVM host).

QEMU (on the host) gets a vector from virtio-ring and
submits them. Old versions of QEMU, linearized the vector
it got from KVM (copied them into a buffer) and submits
the buffer. So, NFS client always received a single buffer.

Later versions of QEMU, eliminated this copy and submits
a vector directly using preadv/pwritev().

NFS client loops through the vector and submits each
vector as separate request for each IO < wsize. In our
case (negotiated wsize=1MB), for 256K IO - we get 64 
vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs. 
Server end up doing each 4K synchronously. This causes
serious performance degrade. We are trying to see if the
performance improves if we convert IOs to ASYNC - but
our initial results doesn't look good.

readv/writev support NFS client for all possible cases is
hard. Instead, if all vectors are page-aligned and 
iosizes page-multiple - it fits the current code easily.
Luckily, QEMU use-case fits these requirements.

Here is the patch to add this support. Comments ?

Thanks,
Badari

Special readv/writev support for NFS O_DIRECT. Currently NFS 
client O_DIRECT read/write iterates over individual elements 
in the vector and submits the IO and waits for them to finish. 
If individual IOsize is < r/wsize, each IO will be FILE_SYNC. 
Server has to finish individual smaller IOs synchronous. This 
causes serious performance degrade.

This patch is a special case to submit larger IOs from the client
only if all the IO buffers are page-aligned and each individual
IOs are page-multiple + total IO size is < r/wsize. 

This is the common case for QEMU usage.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
---
 fs/nfs/direct.c |  259 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 259 insertions(+)



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Chuck Lever III April 12, 2011, 3:36 p.m. UTC | #1
On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:

> Hi,
> 
> We recently ran into serious performance issue with NFS client.
> It turned out that its due to lack of readv/write support for
> NFS (O_DIRECT) client.
> 
> Here is our use-case:
> 
> In our cloud environment, our storage is over NFS. Files
> on NFS are passed as a blockdevices to the guest (using
> O_DIRECT). When guest is doing IO on these block devices,
> they will end up as O_DIRECT writes to NFS (on KVM host).
> 
> QEMU (on the host) gets a vector from virtio-ring and
> submits them. Old versions of QEMU, linearized the vector
> it got from KVM (copied them into a buffer) and submits
> the buffer. So, NFS client always received a single buffer.
> 
> Later versions of QEMU, eliminated this copy and submits
> a vector directly using preadv/pwritev().
> 
> NFS client loops through the vector and submits each
> vector as separate request for each IO < wsize. In our
> case (negotiated wsize=1MB), for 256K IO - we get 64 
> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs. 
> Server end up doing each 4K synchronously. This causes
> serious performance degrade. We are trying to see if the
> performance improves if we convert IOs to ASYNC - but
> our initial results doesn't look good.
> 
> readv/writev support NFS client for all possible cases is
> hard. Instead, if all vectors are page-aligned and 
> iosizes page-multiple - it fits the current code easily.
> Luckily, QEMU use-case fits these requirements.
> 
> Here is the patch to add this support. Comments ?

Restricting buffer alignment requirements would be an onerous API change, IMO.

If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed.  I think Jeff Layton already has a patch that does this.

> 
> Thanks,
> Badari
> 
> Special readv/writev support for NFS O_DIRECT. Currently NFS 
> client O_DIRECT read/write iterates over individual elements 
> in the vector and submits the IO and waits for them to finish. 
> If individual IOsize is < r/wsize, each IO will be FILE_SYNC. 
> Server has to finish individual smaller IOs synchronous. This 
> causes serious performance degrade.
> 
> This patch is a special case to submit larger IOs from the client
> only if all the IO buffers are page-aligned and each individual
> IOs are page-multiple + total IO size is < r/wsize. 
> 
> This is the common case for QEMU usage.
> 
> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
> ---
> fs/nfs/direct.c |  259 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 259 insertions(+)
> 
> Index: linux-2.6.38.2/fs/nfs/direct.c
> ===================================================================
> --- linux-2.6.38.2.orig/fs/nfs/direct.c	2011-04-12 10:56:29.374266292 -0400
> +++ linux-2.6.38.2/fs/nfs/direct.c	2011-04-12 11:01:21.883266310 -0400
> @@ -271,6 +271,52 @@ static const struct rpc_call_ops nfs_rea
> 	.rpc_release = nfs_direct_read_release,
> };
> 
> +static int nfs_direct_check_single_io(struct nfs_direct_req *dreq,
> +				      const struct iovec *iov,
> +                                      unsigned long nr_segs, int rw)
> +{
> +	unsigned long seg;
> +	int pages = 0;
> +	struct nfs_open_context *ctx = dreq->ctx;
> +	struct inode *inode = ctx->path.dentry->d_inode;
> +	size_t size;
> +
> +
> +	/* If its a single vector - use old code */
> +	if (nr_segs == 1)
> +		return pages;
> +	/*
> +	 * Check to see if all IO buffers are page-aligned and
> +	 * individual IO sizes are page-multiple.
> +	 * If so, we can submit them in a single IO.
> +	 * Otherwise, use old code
> +	 */
> +	for (seg = 0; seg < nr_segs; seg++) {
> +		unsigned long user_addr = (unsigned long)iov[seg].iov_base;
> +
> +		if ((user_addr & (PAGE_SIZE - 1)) ||
> +			(iov[seg].iov_len & (PAGE_SIZE - 1))) {
> +			pages = 0;
> +			break;
> +		}
> +		pages += (iov[seg].iov_len >> PAGE_SHIFT);
> +	}
> +
> +	if (rw)
> +		size = NFS_SERVER(inode)->wsize;
> +	else
> +		size = NFS_SERVER(inode)->rsize;
> +
> +	/*
> +         * If IOsize > wsize/rsize, we need to split IOs into r/wsize anyway
> +         * - fall back to old code.
> +         */
> +	if (pages > (size >> PAGE_SHIFT))
> +		pages = 0;
> +
> +	return pages;
> +}
> +
> /*
>  * For each rsize'd chunk of the user's buffer, dispatch an NFS READ
>  * operation.  If nfs_readdata_alloc() or get_user_pages() fails,
> @@ -385,6 +431,101 @@ static ssize_t nfs_direct_read_schedule_
> 	return result < 0 ? (ssize_t) result : -EFAULT;
> }
> 
> +/*
> + * Special Case: Efficient readv() support - only if all the IO buffers
> + * in the vector are page-aligned and IO sizes are page-multiple
> + * + total IO size is < rsize. We can map all of the vectors together
> + * and submit them in a single IO instead of operating on single entry
> + * in the vector.
> + */
> +static ssize_t nfs_direct_read_schedule_single_io(struct nfs_direct_req *dreq,
> +						 const struct iovec *iov,
> +						 unsigned long nr_segs,
> +						 int pages,
> +						 loff_t pos)
> +{
> +	struct nfs_open_context *ctx = dreq->ctx;
> +	struct inode *inode = ctx->path.dentry->d_inode;
> +	unsigned long user_addr = (unsigned long)iov->iov_base;
> +	size_t bytes = pages << PAGE_SHIFT;
> +	struct rpc_task *task;
> +	struct rpc_message msg = {
> +		.rpc_cred = ctx->cred,
> +	};
> +	struct rpc_task_setup task_setup_data = {
> +		.rpc_client = NFS_CLIENT(inode),
> +		.rpc_message = &msg,
> +		.callback_ops = &nfs_read_direct_ops,
> +		.workqueue = nfsiod_workqueue,
> +		.flags = RPC_TASK_ASYNC,
> +	};
> +	unsigned int pgbase;
> +	int result;
> +	int seg;
> +	int mapped = 0;
> +	int nr_pages;
> +	ssize_t started = 0;
> +	struct nfs_read_data *data;
> +
> +	result = -ENOMEM;
> +	data = nfs_readdata_alloc(pages);
> +	if (unlikely(!data))
> +		return result;
> +
> +	pgbase = user_addr & ~PAGE_MASK;
> +
> +	for (seg = 0; seg < nr_segs; seg++) {
> +		user_addr = (unsigned long)iov[seg].iov_base;
> +		nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
> +
> +		down_read(&current->mm->mmap_sem);
> +		result = get_user_pages(current, current->mm, user_addr,
> +					nr_pages, 1, 0, &data->pagevec[mapped],
> +					NULL);
> +		up_read(&current->mm->mmap_sem);
> +		if (result < 0) {
> +			/* unmap what is done so far */
> +			nfs_direct_release_pages(data->pagevec, mapped);
> +			nfs_readdata_free(data);
> +			return result;
> +		}
> +		mapped += result;
> +	}
> +
> +	get_dreq(dreq);
> +
> +	data->req = (struct nfs_page *) dreq;
> +	data->inode = inode;
> +	data->cred = msg.rpc_cred;
> +	data->args.fh = NFS_FH(inode);
> +	data->args.context = ctx;
> +	data->args.offset = pos;
> +	data->args.pgbase = pgbase;
> +	data->args.pages = data->pagevec;
> +	data->args.count = bytes;
> +	data->res.fattr = &data->fattr;
> +	data->res.eof = 0;
> +	data->res.count = bytes;
> +	nfs_fattr_init(&data->fattr);
> +
> +	task_setup_data.task = &data->task;
> +	task_setup_data.callback_data = data;
> +	msg.rpc_argp = &data->args;
> +	msg.rpc_resp = &data->res;
> +	NFS_PROTO(inode)->read_setup(data, &msg);
> +
> +	task = rpc_run_task(&task_setup_data);
> +	if (IS_ERR(task))
> +		goto out;
> +	rpc_put_task(task);
> +
> +	started += bytes;
> +out:
> +	if (started)
> +		return started;
> +	return result < 0 ? (ssize_t) result : -EFAULT;
> +}
> +
> static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
> 					      const struct iovec *iov,
> 					      unsigned long nr_segs,
> @@ -396,6 +537,16 @@ static ssize_t nfs_direct_read_schedule_
> 
> 	get_dreq(dreq);
> 
> +	result = nfs_direct_check_single_io(dreq, iov, nr_segs, 0);
> +	if (result) {
> +		result = nfs_direct_read_schedule_single_io(dreq, iov, nr_segs,
> +							result, pos);
> +		if (result < 0)
> +			goto out;
> +		requested_bytes += result;
> +		pos += result;
> +		goto out;
> +	}
> 	for (seg = 0; seg < nr_segs; seg++) {
> 		const struct iovec *vec = &iov[seg];
> 		result = nfs_direct_read_schedule_segment(dreq, vec, pos);
> @@ -416,6 +567,7 @@ static ssize_t nfs_direct_read_schedule_
> 		return result < 0 ? result : -EIO;
> 	}
> 
> +out:
> 	if (put_dreq(dreq))
> 		nfs_direct_complete(dreq);
> 	return 0;
> @@ -821,6 +973,102 @@ static ssize_t nfs_direct_write_schedule
> 	return result < 0 ? (ssize_t) result : -EFAULT;
> }
> 
> +/*
> + * Special Case: Efficient writev() support - only if all the IO buffers
> + * in the vector are page-aligned and IO sizes are page-multiple
> + * + total IO size is < wsize. We can map all of the vectors together
> + * and submit them in a single IO instead of operating on single entry
> + * in the vector.
> + */
> +static ssize_t nfs_direct_write_schedule_single_io(struct nfs_direct_req *dreq,
> +						 const struct iovec *iov,
> +						 unsigned long nr_segs,
> +						 int pages,
> +						 loff_t pos, int sync)
> +{
> +	struct nfs_open_context *ctx = dreq->ctx;
> +	struct inode *inode = ctx->path.dentry->d_inode;
> +	unsigned long user_addr = (unsigned long)iov->iov_base;
> +	size_t bytes = pages << PAGE_SHIFT;
> +	struct rpc_task *task;
> +	struct rpc_message msg = {
> +		.rpc_cred = ctx->cred,
> +	};
> +	struct rpc_task_setup task_setup_data = {
> +		.rpc_client = NFS_CLIENT(inode),
> +		.rpc_message = &msg,
> +		.callback_ops = &nfs_write_direct_ops,
> +		.workqueue = nfsiod_workqueue,
> +		.flags = RPC_TASK_ASYNC,
> +	};
> +	unsigned int pgbase;
> +	int result;
> +	int seg;
> +	int mapped = 0;
> +	int nr_pages;
> +	ssize_t started = 0;
> +	struct nfs_write_data *data;
> +
> +	result = -ENOMEM;
> +	data = nfs_writedata_alloc(pages);
> +	if (unlikely(!data))
> +		return result;
> +
> +	pgbase = user_addr & ~PAGE_MASK;
> +
> +	for (seg = 0; seg < nr_segs; seg++) {
> +		user_addr = (unsigned long)iov[seg].iov_base;
> +		nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
> +
> +		down_read(&current->mm->mmap_sem);
> +		result = get_user_pages(current, current->mm, user_addr,
> +					nr_pages, 0, 0, &data->pagevec[mapped], NULL);
> +		up_read(&current->mm->mmap_sem);
> +		if (result < 0) {
> +			/* unmap what is done so far */
> +			nfs_direct_release_pages(data->pagevec, mapped);
> +			nfs_writedata_free(data);
> +			return result;
> +		}
> +		mapped += result;
> +	}
> +
> +	get_dreq(dreq);
> +	list_move_tail(&data->pages, &dreq->rewrite_list);
> +
> +	data->req = (struct nfs_page *) dreq;
> +	data->inode = inode;
> +	data->cred = msg.rpc_cred;
> +	data->args.fh = NFS_FH(inode);
> +	data->args.context = ctx;
> +	data->args.offset = pos;
> +	data->args.pgbase = pgbase;
> +	data->args.pages = data->pagevec;
> +	data->args.count = bytes;
> +	data->args.stable = sync;
> +	data->res.fattr = &data->fattr;
> +	data->res.count = bytes;
> +	data->res.verf = &data->verf;
> +	nfs_fattr_init(&data->fattr);
> +
> +	task_setup_data.task = &data->task;
> +	task_setup_data.callback_data = data;
> +	msg.rpc_argp = &data->args;
> +	msg.rpc_resp = &data->res;
> +	NFS_PROTO(inode)->write_setup(data, &msg);
> +
> +	task = rpc_run_task(&task_setup_data);
> +	if (IS_ERR(task))
> +		goto out;
> +	rpc_put_task(task);
> +
> +	started += bytes;
> +out:
> +	if (started)
> +		return started;
> +	return result < 0 ? (ssize_t) result : -EFAULT;
> +}
> +
> static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
> 					       const struct iovec *iov,
> 					       unsigned long nr_segs,
> @@ -832,6 +1080,16 @@ static ssize_t nfs_direct_write_schedule
> 
> 	get_dreq(dreq);
> 
> +	result = nfs_direct_check_single_io(dreq, iov, nr_segs, 1);
> +	if (result) {
> +		result = nfs_direct_write_schedule_single_io(dreq, iov, nr_segs,
> +							result, pos, sync);
> +		if (result < 0)
> +			goto out;
> +		requested_bytes += result;
> +		pos += result;
> +		goto out;
> +	}
> 	for (seg = 0; seg < nr_segs; seg++) {
> 		const struct iovec *vec = &iov[seg];
> 		result = nfs_direct_write_schedule_segment(dreq, vec,
> @@ -853,6 +1111,7 @@ static ssize_t nfs_direct_write_schedule
> 		return result < 0 ? result : -EIO;
> 	}
> 
> +out:
> 	if (put_dreq(dreq))
> 		nfs_direct_write_complete(dreq, dreq->inode);
> 	return 0;
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust April 12, 2011, 3:49 p.m. UTC | #2
On Tue, 2011-04-12 at 08:32 -0700, Badari Pulavarty wrote:
> Hi,
> 
> We recently ran into serious performance issue with NFS client.
> It turned out that its due to lack of readv/write support for
> NFS (O_DIRECT) client.
> 
> Here is our use-case:
> 
> In our cloud environment, our storage is over NFS. Files
> on NFS are passed as a blockdevices to the guest (using
> O_DIRECT). When guest is doing IO on these block devices,
> they will end up as O_DIRECT writes to NFS (on KVM host).
> 
> QEMU (on the host) gets a vector from virtio-ring and
> submits them. Old versions of QEMU, linearized the vector
> it got from KVM (copied them into a buffer) and submits
> the buffer. So, NFS client always received a single buffer.
> 
> Later versions of QEMU, eliminated this copy and submits
> a vector directly using preadv/pwritev().
> 
> NFS client loops through the vector and submits each
> vector as separate request for each IO < wsize. In our
> case (negotiated wsize=1MB), for 256K IO - we get 64 
> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs. 
> Server end up doing each 4K synchronously. This causes
> serious performance degrade. We are trying to see if the
> performance improves if we convert IOs to ASYNC - but
> our initial results doesn't look good.
> 
> readv/writev support NFS client for all possible cases is
> hard. Instead, if all vectors are page-aligned and 
> iosizes page-multiple - it fits the current code easily.
> Luckily, QEMU use-case fits these requirements.
> 
> Here is the patch to add this support. Comments ?

Your approach goes in the direction of further special-casing O_DIRECT
in the NFS client. I'd like to move away from that and towards
integration with the ordinary read/write codepaths so that aside from
adding request coalescing, we can also enable pNFS support.

Cheers
   Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Badari Pulavarty April 12, 2011, 4:15 p.m. UTC | #3
On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
> 
> > Hi,
> > 
> > We recently ran into serious performance issue with NFS client.
> > It turned out that its due to lack of readv/write support for
> > NFS (O_DIRECT) client.
> > 
> > Here is our use-case:
> > 
> > In our cloud environment, our storage is over NFS. Files
> > on NFS are passed as a blockdevices to the guest (using
> > O_DIRECT). When guest is doing IO on these block devices,
> > they will end up as O_DIRECT writes to NFS (on KVM host).
> > 
> > QEMU (on the host) gets a vector from virtio-ring and
> > submits them. Old versions of QEMU, linearized the vector
> > it got from KVM (copied them into a buffer) and submits
> > the buffer. So, NFS client always received a single buffer.
> > 
> > Later versions of QEMU, eliminated this copy and submits
> > a vector directly using preadv/pwritev().
> > 
> > NFS client loops through the vector and submits each
> > vector as separate request for each IO < wsize. In our
> > case (negotiated wsize=1MB), for 256K IO - we get 64 
> > vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs. 
> > Server end up doing each 4K synchronously. This causes
> > serious performance degrade. We are trying to see if the
> > performance improves if we convert IOs to ASYNC - but
> > our initial results doesn't look good.
> > 
> > readv/writev support NFS client for all possible cases is
> > hard. Instead, if all vectors are page-aligned and 
> > iosizes page-multiple - it fits the current code easily.
> > Luckily, QEMU use-case fits these requirements.
> > 
> > Here is the patch to add this support. Comments ?
> 
> Restricting buffer alignment requirements would be an onerous API change, IMO.

I am not suggesting an API change at all. All I am doing is, if all
the IOs are aligned - we can do a fast path as we can do in a single
IO request. (as if we got a single buffer). Otherwise, we do hard
way as today - loop through each one and do them individually.

> 
> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed.  I think Jeff Layton already has a patch that does this.

We are trying that patch. It does improve the performance by little,
but not anywhere close to doing it as a single vector/buffer.

Khoa, can you share your performance data for all the
suggestions/patches you tried so far ?

Thanks,
Badari


> > 
> > Thanks,
> > Badari
> > 
> > Special readv/writev support for NFS O_DIRECT. Currently NFS 
> > client O_DIRECT read/write iterates over individual elements 
> > in the vector and submits the IO and waits for them to finish. 
> > If individual IOsize is < r/wsize, each IO will be FILE_SYNC. 
> > Server has to finish individual smaller IOs synchronous. This 
> > causes serious performance degrade.
> > 
> > This patch is a special case to submit larger IOs from the client
> > only if all the IO buffers are page-aligned and each individual
> > IOs are page-multiple + total IO size is < r/wsize. 
> > 
> > This is the common case for QEMU usage.
> > 
> > Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
> > ---
> > fs/nfs/direct.c |  259 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 259 insertions(+)
> > 
> > Index: linux-2.6.38.2/fs/nfs/direct.c
> > ===================================================================
> > --- linux-2.6.38.2.orig/fs/nfs/direct.c	2011-04-12 10:56:29.374266292 -0400
> > +++ linux-2.6.38.2/fs/nfs/direct.c	2011-04-12 11:01:21.883266310 -0400
> > @@ -271,6 +271,52 @@ static const struct rpc_call_ops nfs_rea
> > 	.rpc_release = nfs_direct_read_release,
> > };
> > 
> > +static int nfs_direct_check_single_io(struct nfs_direct_req *dreq,
> > +				      const struct iovec *iov,
> > +                                      unsigned long nr_segs, int rw)
> > +{
> > +	unsigned long seg;
> > +	int pages = 0;
> > +	struct nfs_open_context *ctx = dreq->ctx;
> > +	struct inode *inode = ctx->path.dentry->d_inode;
> > +	size_t size;
> > +
> > +
> > +	/* If its a single vector - use old code */
> > +	if (nr_segs == 1)
> > +		return pages;
> > +	/*
> > +	 * Check to see if all IO buffers are page-aligned and
> > +	 * individual IO sizes are page-multiple.
> > +	 * If so, we can submit them in a single IO.
> > +	 * Otherwise, use old code
> > +	 */
> > +	for (seg = 0; seg < nr_segs; seg++) {
> > +		unsigned long user_addr = (unsigned long)iov[seg].iov_base;
> > +
> > +		if ((user_addr & (PAGE_SIZE - 1)) ||
> > +			(iov[seg].iov_len & (PAGE_SIZE - 1))) {
> > +			pages = 0;
> > +			break;
> > +		}
> > +		pages += (iov[seg].iov_len >> PAGE_SHIFT);
> > +	}
> > +
> > +	if (rw)
> > +		size = NFS_SERVER(inode)->wsize;
> > +	else
> > +		size = NFS_SERVER(inode)->rsize;
> > +
> > +	/*
> > +         * If IOsize > wsize/rsize, we need to split IOs into r/wsize anyway
> > +         * - fall back to old code.
> > +         */
> > +	if (pages > (size >> PAGE_SHIFT))
> > +		pages = 0;
> > +
> > +	return pages;
> > +}
> > +
> > /*
> >  * For each rsize'd chunk of the user's buffer, dispatch an NFS READ
> >  * operation.  If nfs_readdata_alloc() or get_user_pages() fails,
> > @@ -385,6 +431,101 @@ static ssize_t nfs_direct_read_schedule_
> > 	return result < 0 ? (ssize_t) result : -EFAULT;
> > }
> > 
> > +/*
> > + * Special Case: Efficient readv() support - only if all the IO buffers
> > + * in the vector are page-aligned and IO sizes are page-multiple
> > + * + total IO size is < rsize. We can map all of the vectors together
> > + * and submit them in a single IO instead of operating on single entry
> > + * in the vector.
> > + */
> > +static ssize_t nfs_direct_read_schedule_single_io(struct nfs_direct_req *dreq,
> > +						 const struct iovec *iov,
> > +						 unsigned long nr_segs,
> > +						 int pages,
> > +						 loff_t pos)
> > +{
> > +	struct nfs_open_context *ctx = dreq->ctx;
> > +	struct inode *inode = ctx->path.dentry->d_inode;
> > +	unsigned long user_addr = (unsigned long)iov->iov_base;
> > +	size_t bytes = pages << PAGE_SHIFT;
> > +	struct rpc_task *task;
> > +	struct rpc_message msg = {
> > +		.rpc_cred = ctx->cred,
> > +	};
> > +	struct rpc_task_setup task_setup_data = {
> > +		.rpc_client = NFS_CLIENT(inode),
> > +		.rpc_message = &msg,
> > +		.callback_ops = &nfs_read_direct_ops,
> > +		.workqueue = nfsiod_workqueue,
> > +		.flags = RPC_TASK_ASYNC,
> > +	};
> > +	unsigned int pgbase;
> > +	int result;
> > +	int seg;
> > +	int mapped = 0;
> > +	int nr_pages;
> > +	ssize_t started = 0;
> > +	struct nfs_read_data *data;
> > +
> > +	result = -ENOMEM;
> > +	data = nfs_readdata_alloc(pages);
> > +	if (unlikely(!data))
> > +		return result;
> > +
> > +	pgbase = user_addr & ~PAGE_MASK;
> > +
> > +	for (seg = 0; seg < nr_segs; seg++) {
> > +		user_addr = (unsigned long)iov[seg].iov_base;
> > +		nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
> > +
> > +		down_read(&current->mm->mmap_sem);
> > +		result = get_user_pages(current, current->mm, user_addr,
> > +					nr_pages, 1, 0, &data->pagevec[mapped],
> > +					NULL);
> > +		up_read(&current->mm->mmap_sem);
> > +		if (result < 0) {
> > +			/* unmap what is done so far */
> > +			nfs_direct_release_pages(data->pagevec, mapped);
> > +			nfs_readdata_free(data);
> > +			return result;
> > +		}
> > +		mapped += result;
> > +	}
> > +
> > +	get_dreq(dreq);
> > +
> > +	data->req = (struct nfs_page *) dreq;
> > +	data->inode = inode;
> > +	data->cred = msg.rpc_cred;
> > +	data->args.fh = NFS_FH(inode);
> > +	data->args.context = ctx;
> > +	data->args.offset = pos;
> > +	data->args.pgbase = pgbase;
> > +	data->args.pages = data->pagevec;
> > +	data->args.count = bytes;
> > +	data->res.fattr = &data->fattr;
> > +	data->res.eof = 0;
> > +	data->res.count = bytes;
> > +	nfs_fattr_init(&data->fattr);
> > +
> > +	task_setup_data.task = &data->task;
> > +	task_setup_data.callback_data = data;
> > +	msg.rpc_argp = &data->args;
> > +	msg.rpc_resp = &data->res;
> > +	NFS_PROTO(inode)->read_setup(data, &msg);
> > +
> > +	task = rpc_run_task(&task_setup_data);
> > +	if (IS_ERR(task))
> > +		goto out;
> > +	rpc_put_task(task);
> > +
> > +	started += bytes;
> > +out:
> > +	if (started)
> > +		return started;
> > +	return result < 0 ? (ssize_t) result : -EFAULT;
> > +}
> > +
> > static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
> > 					      const struct iovec *iov,
> > 					      unsigned long nr_segs,
> > @@ -396,6 +537,16 @@ static ssize_t nfs_direct_read_schedule_
> > 
> > 	get_dreq(dreq);
> > 
> > +	result = nfs_direct_check_single_io(dreq, iov, nr_segs, 0);
> > +	if (result) {
> > +		result = nfs_direct_read_schedule_single_io(dreq, iov, nr_segs,
> > +							result, pos);
> > +		if (result < 0)
> > +			goto out;
> > +		requested_bytes += result;
> > +		pos += result;
> > +		goto out;
> > +	}
> > 	for (seg = 0; seg < nr_segs; seg++) {
> > 		const struct iovec *vec = &iov[seg];
> > 		result = nfs_direct_read_schedule_segment(dreq, vec, pos);
> > @@ -416,6 +567,7 @@ static ssize_t nfs_direct_read_schedule_
> > 		return result < 0 ? result : -EIO;
> > 	}
> > 
> > +out:
> > 	if (put_dreq(dreq))
> > 		nfs_direct_complete(dreq);
> > 	return 0;
> > @@ -821,6 +973,102 @@ static ssize_t nfs_direct_write_schedule
> > 	return result < 0 ? (ssize_t) result : -EFAULT;
> > }
> > 
> > +/*
> > + * Special Case: Efficient writev() support - only if all the IO buffers
> > + * in the vector are page-aligned and IO sizes are page-multiple
> > + * + total IO size is < wsize. We can map all of the vectors together
> > + * and submit them in a single IO instead of operating on single entry
> > + * in the vector.
> > + */
> > +static ssize_t nfs_direct_write_schedule_single_io(struct nfs_direct_req *dreq,
> > +						 const struct iovec *iov,
> > +						 unsigned long nr_segs,
> > +						 int pages,
> > +						 loff_t pos, int sync)
> > +{
> > +	struct nfs_open_context *ctx = dreq->ctx;
> > +	struct inode *inode = ctx->path.dentry->d_inode;
> > +	unsigned long user_addr = (unsigned long)iov->iov_base;
> > +	size_t bytes = pages << PAGE_SHIFT;
> > +	struct rpc_task *task;
> > +	struct rpc_message msg = {
> > +		.rpc_cred = ctx->cred,
> > +	};
> > +	struct rpc_task_setup task_setup_data = {
> > +		.rpc_client = NFS_CLIENT(inode),
> > +		.rpc_message = &msg,
> > +		.callback_ops = &nfs_write_direct_ops,
> > +		.workqueue = nfsiod_workqueue,
> > +		.flags = RPC_TASK_ASYNC,
> > +	};
> > +	unsigned int pgbase;
> > +	int result;
> > +	int seg;
> > +	int mapped = 0;
> > +	int nr_pages;
> > +	ssize_t started = 0;
> > +	struct nfs_write_data *data;
> > +
> > +	result = -ENOMEM;
> > +	data = nfs_writedata_alloc(pages);
> > +	if (unlikely(!data))
> > +		return result;
> > +
> > +	pgbase = user_addr & ~PAGE_MASK;
> > +
> > +	for (seg = 0; seg < nr_segs; seg++) {
> > +		user_addr = (unsigned long)iov[seg].iov_base;
> > +		nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
> > +
> > +		down_read(&current->mm->mmap_sem);
> > +		result = get_user_pages(current, current->mm, user_addr,
> > +					nr_pages, 0, 0, &data->pagevec[mapped], NULL);
> > +		up_read(&current->mm->mmap_sem);
> > +		if (result < 0) {
> > +			/* unmap what is done so far */
> > +			nfs_direct_release_pages(data->pagevec, mapped);
> > +			nfs_writedata_free(data);
> > +			return result;
> > +		}
> > +		mapped += result;
> > +	}
> > +
> > +	get_dreq(dreq);
> > +	list_move_tail(&data->pages, &dreq->rewrite_list);
> > +
> > +	data->req = (struct nfs_page *) dreq;
> > +	data->inode = inode;
> > +	data->cred = msg.rpc_cred;
> > +	data->args.fh = NFS_FH(inode);
> > +	data->args.context = ctx;
> > +	data->args.offset = pos;
> > +	data->args.pgbase = pgbase;
> > +	data->args.pages = data->pagevec;
> > +	data->args.count = bytes;
> > +	data->args.stable = sync;
> > +	data->res.fattr = &data->fattr;
> > +	data->res.count = bytes;
> > +	data->res.verf = &data->verf;
> > +	nfs_fattr_init(&data->fattr);
> > +
> > +	task_setup_data.task = &data->task;
> > +	task_setup_data.callback_data = data;
> > +	msg.rpc_argp = &data->args;
> > +	msg.rpc_resp = &data->res;
> > +	NFS_PROTO(inode)->write_setup(data, &msg);
> > +
> > +	task = rpc_run_task(&task_setup_data);
> > +	if (IS_ERR(task))
> > +		goto out;
> > +	rpc_put_task(task);
> > +
> > +	started += bytes;
> > +out:
> > +	if (started)
> > +		return started;
> > +	return result < 0 ? (ssize_t) result : -EFAULT;
> > +}
> > +
> > static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
> > 					       const struct iovec *iov,
> > 					       unsigned long nr_segs,
> > @@ -832,6 +1080,16 @@ static ssize_t nfs_direct_write_schedule
> > 
> > 	get_dreq(dreq);
> > 
> > +	result = nfs_direct_check_single_io(dreq, iov, nr_segs, 1);
> > +	if (result) {
> > +		result = nfs_direct_write_schedule_single_io(dreq, iov, nr_segs,
> > +							result, pos, sync);
> > +		if (result < 0)
> > +			goto out;
> > +		requested_bytes += result;
> > +		pos += result;
> > +		goto out;
> > +	}
> > 	for (seg = 0; seg < nr_segs; seg++) {
> > 		const struct iovec *vec = &iov[seg];
> > 		result = nfs_direct_write_schedule_segment(dreq, vec,
> > @@ -853,6 +1111,7 @@ static ssize_t nfs_direct_write_schedule
> > 		return result < 0 ? result : -EIO;
> > 	}
> > 
> > +out:
> > 	if (put_dreq(dreq))
> > 		nfs_direct_write_complete(dreq, dreq->inode);
> > 	return 0;
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
> 
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Badari Pulavarty April 12, 2011, 4:17 p.m. UTC | #4
On Tue, 2011-04-12 at 11:49 -0400, Trond Myklebust wrote:
> On Tue, 2011-04-12 at 08:32 -0700, Badari Pulavarty wrote:
> > Hi,
> > 
> > We recently ran into serious performance issue with NFS client.
> > It turned out that its due to lack of readv/write support for
> > NFS (O_DIRECT) client.
> > 
> > Here is our use-case:
> > 
> > In our cloud environment, our storage is over NFS. Files
> > on NFS are passed as a blockdevices to the guest (using
> > O_DIRECT). When guest is doing IO on these block devices,
> > they will end up as O_DIRECT writes to NFS (on KVM host).
> > 
> > QEMU (on the host) gets a vector from virtio-ring and
> > submits them. Old versions of QEMU, linearized the vector
> > it got from KVM (copied them into a buffer) and submits
> > the buffer. So, NFS client always received a single buffer.
> > 
> > Later versions of QEMU, eliminated this copy and submits
> > a vector directly using preadv/pwritev().
> > 
> > NFS client loops through the vector and submits each
> > vector as separate request for each IO < wsize. In our
> > case (negotiated wsize=1MB), for 256K IO - we get 64 
> > vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs. 
> > Server end up doing each 4K synchronously. This causes
> > serious performance degrade. We are trying to see if the
> > performance improves if we convert IOs to ASYNC - but
> > our initial results doesn't look good.
> > 
> > readv/writev support NFS client for all possible cases is
> > hard. Instead, if all vectors are page-aligned and 
> > iosizes page-multiple - it fits the current code easily.
> > Luckily, QEMU use-case fits these requirements.
> > 
> > Here is the patch to add this support. Comments ?
> 
> Your approach goes in the direction of further special-casing O_DIRECT
> in the NFS client. I'd like to move away from that and towards
> integration with the ordinary read/write codepaths so that aside from
> adding request coalescing, we can also enable pNFS support.
> 

I completely agree. But its a major under-taking :(

Thanks,
Badari

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust April 12, 2011, 4:26 p.m. UTC | #5
On Tue, 2011-04-12 at 09:17 -0700, Badari Pulavarty wrote:
> On Tue, 2011-04-12 at 11:49 -0400, Trond Myklebust wrote:
> > On Tue, 2011-04-12 at 08:32 -0700, Badari Pulavarty wrote:
> > > Hi,
> > > 
> > > We recently ran into serious performance issue with NFS client.
> > > It turned out that its due to lack of readv/write support for
> > > NFS (O_DIRECT) client.
> > > 
> > > Here is our use-case:
> > > 
> > > In our cloud environment, our storage is over NFS. Files
> > > on NFS are passed as a blockdevices to the guest (using
> > > O_DIRECT). When guest is doing IO on these block devices,
> > > they will end up as O_DIRECT writes to NFS (on KVM host).
> > > 
> > > QEMU (on the host) gets a vector from virtio-ring and
> > > submits them. Old versions of QEMU, linearized the vector
> > > it got from KVM (copied them into a buffer) and submits
> > > the buffer. So, NFS client always received a single buffer.
> > > 
> > > Later versions of QEMU, eliminated this copy and submits
> > > a vector directly using preadv/pwritev().
> > > 
> > > NFS client loops through the vector and submits each
> > > vector as separate request for each IO < wsize. In our
> > > case (negotiated wsize=1MB), for 256K IO - we get 64 
> > > vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs. 
> > > Server end up doing each 4K synchronously. This causes
> > > serious performance degrade. We are trying to see if the
> > > performance improves if we convert IOs to ASYNC - but
> > > our initial results doesn't look good.
> > > 
> > > readv/writev support NFS client for all possible cases is
> > > hard. Instead, if all vectors are page-aligned and 
> > > iosizes page-multiple - it fits the current code easily.
> > > Luckily, QEMU use-case fits these requirements.
> > > 
> > > Here is the patch to add this support. Comments ?
> > 
> > Your approach goes in the direction of further special-casing O_DIRECT
> > in the NFS client. I'd like to move away from that and towards
> > integration with the ordinary read/write codepaths so that aside from
> > adding request coalescing, we can also enable pNFS support.
> > 
> 
> I completely agree. But its a major under-taking :(

Sure, but it is one that I'm working on. I'm just explaining why I'd
prefer not to include more stop-gap O_DIRECT patches at this point. We
can afford to wait for one more release cycle if it means fixing
O_DIRECT once and for all.

Cheers,
  Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever III April 12, 2011, 4:42 p.m. UTC | #6
On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:

> On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
>> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
>> 
>>> Hi,
>>> 
>>> We recently ran into serious performance issue with NFS client.
>>> It turned out that its due to lack of readv/write support for
>>> NFS (O_DIRECT) client.
>>> 
>>> Here is our use-case:
>>> 
>>> In our cloud environment, our storage is over NFS. Files
>>> on NFS are passed as a blockdevices to the guest (using
>>> O_DIRECT). When guest is doing IO on these block devices,
>>> they will end up as O_DIRECT writes to NFS (on KVM host).
>>> 
>>> QEMU (on the host) gets a vector from virtio-ring and
>>> submits them. Old versions of QEMU, linearized the vector
>>> it got from KVM (copied them into a buffer) and submits
>>> the buffer. So, NFS client always received a single buffer.
>>> 
>>> Later versions of QEMU, eliminated this copy and submits
>>> a vector directly using preadv/pwritev().
>>> 
>>> NFS client loops through the vector and submits each
>>> vector as separate request for each IO < wsize. In our
>>> case (negotiated wsize=1MB), for 256K IO - we get 64 
>>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs. 
>>> Server end up doing each 4K synchronously. This causes
>>> serious performance degrade. We are trying to see if the
>>> performance improves if we convert IOs to ASYNC - but
>>> our initial results doesn't look good.
>>> 
>>> readv/writev support NFS client for all possible cases is
>>> hard. Instead, if all vectors are page-aligned and 
>>> iosizes page-multiple - it fits the current code easily.
>>> Luckily, QEMU use-case fits these requirements.
>>> 
>>> Here is the patch to add this support. Comments ?
>> 
>> Restricting buffer alignment requirements would be an onerous API change, IMO.
> 
> I am not suggesting an API change at all. All I am doing is, if all
> the IOs are aligned - we can do a fast path as we can do in a single
> IO request. (as if we got a single buffer). Otherwise, we do hard
> way as today - loop through each one and do them individually.

Thanks for the clarification.  That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.

>> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed.  I think Jeff Layton already has a patch that does this.
> 
> We are trying that patch. It does improve the performance by little,
> but not anywhere close to doing it as a single vector/buffer.
> 
> Khoa, can you share your performance data for all the
> suggestions/patches you tried so far ?

The individual WRITEs should be submitted in parallel.  If there is additional performance overhead, it is probably due to the small RPC slot table size.  Have you tried increasing it?

> 
> Thanks,
> Badari
> 
> 
>>> 
>>> Thanks,
>>> Badari
>>> 
>>> Special readv/writev support for NFS O_DIRECT. Currently NFS 
>>> client O_DIRECT read/write iterates over individual elements 
>>> in the vector and submits the IO and waits for them to finish. 
>>> If individual IOsize is < r/wsize, each IO will be FILE_SYNC. 
>>> Server has to finish individual smaller IOs synchronous. This 
>>> causes serious performance degrade.
>>> 
>>> This patch is a special case to submit larger IOs from the client
>>> only if all the IO buffers are page-aligned and each individual
>>> IOs are page-multiple + total IO size is < r/wsize. 
>>> 
>>> This is the common case for QEMU usage.
>>> 
>>> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
>>> ---
>>> fs/nfs/direct.c |  259 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 259 insertions(+)
>>> 
>>> Index: linux-2.6.38.2/fs/nfs/direct.c
>>> ===================================================================
>>> --- linux-2.6.38.2.orig/fs/nfs/direct.c	2011-04-12 10:56:29.374266292 -0400
>>> +++ linux-2.6.38.2/fs/nfs/direct.c	2011-04-12 11:01:21.883266310 -0400
>>> @@ -271,6 +271,52 @@ static const struct rpc_call_ops nfs_rea
>>> 	.rpc_release = nfs_direct_read_release,
>>> };
>>> 
>>> +static int nfs_direct_check_single_io(struct nfs_direct_req *dreq,
>>> +				      const struct iovec *iov,
>>> +                                      unsigned long nr_segs, int rw)
>>> +{
>>> +	unsigned long seg;
>>> +	int pages = 0;
>>> +	struct nfs_open_context *ctx = dreq->ctx;
>>> +	struct inode *inode = ctx->path.dentry->d_inode;
>>> +	size_t size;
>>> +
>>> +
>>> +	/* If its a single vector - use old code */
>>> +	if (nr_segs == 1)
>>> +		return pages;
>>> +	/*
>>> +	 * Check to see if all IO buffers are page-aligned and
>>> +	 * individual IO sizes are page-multiple.
>>> +	 * If so, we can submit them in a single IO.
>>> +	 * Otherwise, use old code
>>> +	 */
>>> +	for (seg = 0; seg < nr_segs; seg++) {
>>> +		unsigned long user_addr = (unsigned long)iov[seg].iov_base;
>>> +
>>> +		if ((user_addr & (PAGE_SIZE - 1)) ||
>>> +			(iov[seg].iov_len & (PAGE_SIZE - 1))) {
>>> +			pages = 0;
>>> +			break;
>>> +		}
>>> +		pages += (iov[seg].iov_len >> PAGE_SHIFT);
>>> +	}
>>> +
>>> +	if (rw)
>>> +		size = NFS_SERVER(inode)->wsize;
>>> +	else
>>> +		size = NFS_SERVER(inode)->rsize;
>>> +
>>> +	/*
>>> +         * If IOsize > wsize/rsize, we need to split IOs into r/wsize anyway
>>> +         * - fall back to old code.
>>> +         */
>>> +	if (pages > (size >> PAGE_SHIFT))
>>> +		pages = 0;
>>> +
>>> +	return pages;
>>> +}
>>> +
>>> /*
>>> * For each rsize'd chunk of the user's buffer, dispatch an NFS READ
>>> * operation.  If nfs_readdata_alloc() or get_user_pages() fails,
>>> @@ -385,6 +431,101 @@ static ssize_t nfs_direct_read_schedule_
>>> 	return result < 0 ? (ssize_t) result : -EFAULT;
>>> }
>>> 
>>> +/*
>>> + * Special Case: Efficient readv() support - only if all the IO buffers
>>> + * in the vector are page-aligned and IO sizes are page-multiple
>>> + * + total IO size is < rsize. We can map all of the vectors together
>>> + * and submit them in a single IO instead of operating on single entry
>>> + * in the vector.
>>> + */
>>> +static ssize_t nfs_direct_read_schedule_single_io(struct nfs_direct_req *dreq,
>>> +						 const struct iovec *iov,
>>> +						 unsigned long nr_segs,
>>> +						 int pages,
>>> +						 loff_t pos)
>>> +{
>>> +	struct nfs_open_context *ctx = dreq->ctx;
>>> +	struct inode *inode = ctx->path.dentry->d_inode;
>>> +	unsigned long user_addr = (unsigned long)iov->iov_base;
>>> +	size_t bytes = pages << PAGE_SHIFT;
>>> +	struct rpc_task *task;
>>> +	struct rpc_message msg = {
>>> +		.rpc_cred = ctx->cred,
>>> +	};
>>> +	struct rpc_task_setup task_setup_data = {
>>> +		.rpc_client = NFS_CLIENT(inode),
>>> +		.rpc_message = &msg,
>>> +		.callback_ops = &nfs_read_direct_ops,
>>> +		.workqueue = nfsiod_workqueue,
>>> +		.flags = RPC_TASK_ASYNC,
>>> +	};
>>> +	unsigned int pgbase;
>>> +	int result;
>>> +	int seg;
>>> +	int mapped = 0;
>>> +	int nr_pages;
>>> +	ssize_t started = 0;
>>> +	struct nfs_read_data *data;
>>> +
>>> +	result = -ENOMEM;
>>> +	data = nfs_readdata_alloc(pages);
>>> +	if (unlikely(!data))
>>> +		return result;
>>> +
>>> +	pgbase = user_addr & ~PAGE_MASK;
>>> +
>>> +	for (seg = 0; seg < nr_segs; seg++) {
>>> +		user_addr = (unsigned long)iov[seg].iov_base;
>>> +		nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
>>> +
>>> +		down_read(&current->mm->mmap_sem);
>>> +		result = get_user_pages(current, current->mm, user_addr,
>>> +					nr_pages, 1, 0, &data->pagevec[mapped],
>>> +					NULL);
>>> +		up_read(&current->mm->mmap_sem);
>>> +		if (result < 0) {
>>> +			/* unmap what is done so far */
>>> +			nfs_direct_release_pages(data->pagevec, mapped);
>>> +			nfs_readdata_free(data);
>>> +			return result;
>>> +		}
>>> +		mapped += result;
>>> +	}
>>> +
>>> +	get_dreq(dreq);
>>> +
>>> +	data->req = (struct nfs_page *) dreq;
>>> +	data->inode = inode;
>>> +	data->cred = msg.rpc_cred;
>>> +	data->args.fh = NFS_FH(inode);
>>> +	data->args.context = ctx;
>>> +	data->args.offset = pos;
>>> +	data->args.pgbase = pgbase;
>>> +	data->args.pages = data->pagevec;
>>> +	data->args.count = bytes;
>>> +	data->res.fattr = &data->fattr;
>>> +	data->res.eof = 0;
>>> +	data->res.count = bytes;
>>> +	nfs_fattr_init(&data->fattr);
>>> +
>>> +	task_setup_data.task = &data->task;
>>> +	task_setup_data.callback_data = data;
>>> +	msg.rpc_argp = &data->args;
>>> +	msg.rpc_resp = &data->res;
>>> +	NFS_PROTO(inode)->read_setup(data, &msg);
>>> +
>>> +	task = rpc_run_task(&task_setup_data);
>>> +	if (IS_ERR(task))
>>> +		goto out;
>>> +	rpc_put_task(task);
>>> +
>>> +	started += bytes;
>>> +out:
>>> +	if (started)
>>> +		return started;
>>> +	return result < 0 ? (ssize_t) result : -EFAULT;
>>> +}
>>> +
>>> static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
>>> 					      const struct iovec *iov,
>>> 					      unsigned long nr_segs,
>>> @@ -396,6 +537,16 @@ static ssize_t nfs_direct_read_schedule_
>>> 
>>> 	get_dreq(dreq);
>>> 
>>> +	result = nfs_direct_check_single_io(dreq, iov, nr_segs, 0);
>>> +	if (result) {
>>> +		result = nfs_direct_read_schedule_single_io(dreq, iov, nr_segs,
>>> +							result, pos);
>>> +		if (result < 0)
>>> +			goto out;
>>> +		requested_bytes += result;
>>> +		pos += result;
>>> +		goto out;
>>> +	}
>>> 	for (seg = 0; seg < nr_segs; seg++) {
>>> 		const struct iovec *vec = &iov[seg];
>>> 		result = nfs_direct_read_schedule_segment(dreq, vec, pos);
>>> @@ -416,6 +567,7 @@ static ssize_t nfs_direct_read_schedule_
>>> 		return result < 0 ? result : -EIO;
>>> 	}
>>> 
>>> +out:
>>> 	if (put_dreq(dreq))
>>> 		nfs_direct_complete(dreq);
>>> 	return 0;
>>> @@ -821,6 +973,102 @@ static ssize_t nfs_direct_write_schedule
>>> 	return result < 0 ? (ssize_t) result : -EFAULT;
>>> }
>>> 
>>> +/*
>>> + * Special Case: Efficient writev() support - only if all the IO buffers
>>> + * in the vector are page-aligned and IO sizes are page-multiple
>>> + * + total IO size is < wsize. We can map all of the vectors together
>>> + * and submit them in a single IO instead of operating on single entry
>>> + * in the vector.
>>> + */
>>> +static ssize_t nfs_direct_write_schedule_single_io(struct nfs_direct_req *dreq,
>>> +						 const struct iovec *iov,
>>> +						 unsigned long nr_segs,
>>> +						 int pages,
>>> +						 loff_t pos, int sync)
>>> +{
>>> +	struct nfs_open_context *ctx = dreq->ctx;
>>> +	struct inode *inode = ctx->path.dentry->d_inode;
>>> +	unsigned long user_addr = (unsigned long)iov->iov_base;
>>> +	size_t bytes = pages << PAGE_SHIFT;
>>> +	struct rpc_task *task;
>>> +	struct rpc_message msg = {
>>> +		.rpc_cred = ctx->cred,
>>> +	};
>>> +	struct rpc_task_setup task_setup_data = {
>>> +		.rpc_client = NFS_CLIENT(inode),
>>> +		.rpc_message = &msg,
>>> +		.callback_ops = &nfs_write_direct_ops,
>>> +		.workqueue = nfsiod_workqueue,
>>> +		.flags = RPC_TASK_ASYNC,
>>> +	};
>>> +	unsigned int pgbase;
>>> +	int result;
>>> +	int seg;
>>> +	int mapped = 0;
>>> +	int nr_pages;
>>> +	ssize_t started = 0;
>>> +	struct nfs_write_data *data;
>>> +
>>> +	result = -ENOMEM;
>>> +	data = nfs_writedata_alloc(pages);
>>> +	if (unlikely(!data))
>>> +		return result;
>>> +
>>> +	pgbase = user_addr & ~PAGE_MASK;
>>> +
>>> +	for (seg = 0; seg < nr_segs; seg++) {
>>> +		user_addr = (unsigned long)iov[seg].iov_base;
>>> +		nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
>>> +
>>> +		down_read(&current->mm->mmap_sem);
>>> +		result = get_user_pages(current, current->mm, user_addr,
>>> +					nr_pages, 0, 0, &data->pagevec[mapped], NULL);
>>> +		up_read(&current->mm->mmap_sem);
>>> +		if (result < 0) {
>>> +			/* unmap what is done so far */
>>> +			nfs_direct_release_pages(data->pagevec, mapped);
>>> +			nfs_writedata_free(data);
>>> +			return result;
>>> +		}
>>> +		mapped += result;
>>> +	}
>>> +
>>> +	get_dreq(dreq);
>>> +	list_move_tail(&data->pages, &dreq->rewrite_list);
>>> +
>>> +	data->req = (struct nfs_page *) dreq;
>>> +	data->inode = inode;
>>> +	data->cred = msg.rpc_cred;
>>> +	data->args.fh = NFS_FH(inode);
>>> +	data->args.context = ctx;
>>> +	data->args.offset = pos;
>>> +	data->args.pgbase = pgbase;
>>> +	data->args.pages = data->pagevec;
>>> +	data->args.count = bytes;
>>> +	data->args.stable = sync;
>>> +	data->res.fattr = &data->fattr;
>>> +	data->res.count = bytes;
>>> +	data->res.verf = &data->verf;
>>> +	nfs_fattr_init(&data->fattr);
>>> +
>>> +	task_setup_data.task = &data->task;
>>> +	task_setup_data.callback_data = data;
>>> +	msg.rpc_argp = &data->args;
>>> +	msg.rpc_resp = &data->res;
>>> +	NFS_PROTO(inode)->write_setup(data, &msg);
>>> +
>>> +	task = rpc_run_task(&task_setup_data);
>>> +	if (IS_ERR(task))
>>> +		goto out;
>>> +	rpc_put_task(task);
>>> +
>>> +	started += bytes;
>>> +out:
>>> +	if (started)
>>> +		return started;
>>> +	return result < 0 ? (ssize_t) result : -EFAULT;
>>> +}
>>> +
>>> static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
>>> 					       const struct iovec *iov,
>>> 					       unsigned long nr_segs,
>>> @@ -832,6 +1080,16 @@ static ssize_t nfs_direct_write_schedule
>>> 
>>> 	get_dreq(dreq);
>>> 
>>> +	result = nfs_direct_check_single_io(dreq, iov, nr_segs, 1);
>>> +	if (result) {
>>> +		result = nfs_direct_write_schedule_single_io(dreq, iov, nr_segs,
>>> +							result, pos, sync);
>>> +		if (result < 0)
>>> +			goto out;
>>> +		requested_bytes += result;
>>> +		pos += result;
>>> +		goto out;
>>> +	}
>>> 	for (seg = 0; seg < nr_segs; seg++) {
>>> 		const struct iovec *vec = &iov[seg];
>>> 		result = nfs_direct_write_schedule_segment(dreq, vec,
>>> @@ -853,6 +1111,7 @@ static ssize_t nfs_direct_write_schedule
>>> 		return result < 0 ? result : -EIO;
>>> 	}
>>> 
>>> +out:
>>> 	if (put_dreq(dreq))
>>> 		nfs_direct_write_complete(dreq, dreq->inode);
>>> 	return 0;
>>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> -- 
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>> 
>> 
>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Badari Pulavarty April 12, 2011, 5:46 p.m. UTC | #7
On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote:
> On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:
> 
> > On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
> >> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
> >> 
> >>> Hi,
> >>> 
> >>> We recently ran into serious performance issue with NFS client.
> >>> It turned out that its due to lack of readv/write support for
> >>> NFS (O_DIRECT) client.
> >>> 
> >>> Here is our use-case:
> >>> 
> >>> In our cloud environment, our storage is over NFS. Files
> >>> on NFS are passed as a blockdevices to the guest (using
> >>> O_DIRECT). When guest is doing IO on these block devices,
> >>> they will end up as O_DIRECT writes to NFS (on KVM host).
> >>> 
> >>> QEMU (on the host) gets a vector from virtio-ring and
> >>> submits them. Old versions of QEMU, linearized the vector
> >>> it got from KVM (copied them into a buffer) and submits
> >>> the buffer. So, NFS client always received a single buffer.
> >>> 
> >>> Later versions of QEMU, eliminated this copy and submits
> >>> a vector directly using preadv/pwritev().
> >>> 
> >>> NFS client loops through the vector and submits each
> >>> vector as separate request for each IO < wsize. In our
> >>> case (negotiated wsize=1MB), for 256K IO - we get 64 
> >>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs. 
> >>> Server end up doing each 4K synchronously. This causes
> >>> serious performance degrade. We are trying to see if the
> >>> performance improves if we convert IOs to ASYNC - but
> >>> our initial results doesn't look good.
> >>> 
> >>> readv/writev support NFS client for all possible cases is
> >>> hard. Instead, if all vectors are page-aligned and 
> >>> iosizes page-multiple - it fits the current code easily.
> >>> Luckily, QEMU use-case fits these requirements.
> >>> 
> >>> Here is the patch to add this support. Comments ?
> >> 
> >> Restricting buffer alignment requirements would be an onerous API change, IMO.
> > 
> > I am not suggesting an API change at all. All I am doing is, if all
> > the IOs are aligned - we can do a fast path as we can do in a single
> > IO request. (as if we got a single buffer). Otherwise, we do hard
> > way as today - loop through each one and do them individually.
> 
> Thanks for the clarification.  That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.
> 
> >> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed.  I think Jeff Layton already has a patch that does this.
> > 
> > We are trying that patch. It does improve the performance by little,
> > but not anywhere close to doing it as a single vector/buffer.
> > 
> > Khoa, can you share your performance data for all the
> > suggestions/patches you tried so far ?
> 
> The individual WRITEs should be submitted in parallel.  If there is additional performance overhead, it is probably due to the small RPC slot table size.  Have you tried increasing it?

We haven't tried both fixes together (RPC slot increase, Turn into
ASYNC). Each one individually didn't help much. We will try them
together.

Thanks,
Badari

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 15, 2011, 5:33 p.m. UTC | #8
On Tue, Apr 12, 2011 at 11:49:29AM -0400, Trond Myklebust wrote:
> Your approach goes in the direction of further special-casing O_DIRECT
> in the NFS client. I'd like to move away from that and towards
> integration with the ordinary read/write codepaths so that aside from
> adding request coalescing, we can also enable pNFS support.

What is the exact plan?  Split the direct I/O into two passes, one
to lock down the user pages and then a second one to send the pages
over the wire, which is shared with the writeback code?  If that's
the case it should naturally allow plugging in a scheme like Badari
to send pages from different iovecs in a single on the wire request -
after all page cache pages are non-continuous in virtual and physical
memory, too.

When do you plan to release your read/write code re-write?  If it's
not anytime soon how is applying Badari's patch going to hurt?  Most
of it probably will get reverted with a complete rewrite, but at least
the logic to check which direct I/O iovecs can coalesced would stay
in the new world order.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust April 15, 2011, 6 p.m. UTC | #9
On Fri, 2011-04-15 at 13:33 -0400, Christoph Hellwig wrote:
> On Tue, Apr 12, 2011 at 11:49:29AM -0400, Trond Myklebust wrote:
> > Your approach goes in the direction of further special-casing O_DIRECT
> > in the NFS client. I'd like to move away from that and towards
> > integration with the ordinary read/write codepaths so that aside from
> > adding request coalescing, we can also enable pNFS support.
> 
> What is the exact plan?  Split the direct I/O into two passes, one
> to lock down the user pages and then a second one to send the pages
> over the wire, which is shared with the writeback code?  If that's
> the case it should naturally allow plugging in a scheme like Badari
> to send pages from different iovecs in a single on the wire request -
> after all page cache pages are non-continuous in virtual and physical
> memory, too.

You can't lock the user pages unfortunately: they may need to be faulted
in.

What I was thinking of doing is splitting out the code in the RPC
callbacks that plays around with page flags and puts the pages on the
inode's dirty list so that they don't get called in the case of
O_DIRECT.
I then want to attach the O_DIRECT pages to the nfsi->nfs_page_tree
radix tree so that they can be tracked by the NFS layer. I'm assuming
that nobody is going to be silly enough to require simultaneous writes
via O_DIRECT to the same locations.
Then we can feed the O_DIRECT pages into nfs_page_async_flush() so that
they share the existing page cache write coalescing and pnfs code.

The commit code will be easy to reuse too, since the requests are listed
in the radix tree and so nfs_scan_list() can find and process them in
the usual fashion.

The main problem that I have yet to figure out is what to do if the
server flags a reboot and the requests need to be resent. One option I'm
looking into is using the aio 'kick handler' to resubmit the writes.
Another may be to just resend directly from the nfsiod work queue.

> When do you plan to release your read/write code re-write?  If it's
> not anytime soon how is applying Badari's patch going to hurt?  Most
> of it probably will get reverted with a complete rewrite, but at least
> the logic to check which direct I/O iovecs can coalesced would stay
> in the new world order.

I'm hoping that I can do the rewrite fairly quickly once the resend
problem is solved. It shouldn't be more than a couple of weeks of
coding.
diff mbox

Patch

Index: linux-2.6.38.2/fs/nfs/direct.c
===================================================================
--- linux-2.6.38.2.orig/fs/nfs/direct.c	2011-04-12 10:56:29.374266292 -0400
+++ linux-2.6.38.2/fs/nfs/direct.c	2011-04-12 11:01:21.883266310 -0400
@@ -271,6 +271,52 @@  static const struct rpc_call_ops nfs_rea
 	.rpc_release = nfs_direct_read_release,
 };
 
+static int nfs_direct_check_single_io(struct nfs_direct_req *dreq,
+				      const struct iovec *iov,
+                                      unsigned long nr_segs, int rw)
+{
+	unsigned long seg;
+	int pages = 0;
+	struct nfs_open_context *ctx = dreq->ctx;
+	struct inode *inode = ctx->path.dentry->d_inode;
+	size_t size;
+
+
+	/* If its a single vector - use old code */
+	if (nr_segs == 1)
+		return pages;
+	/*
+	 * Check to see if all IO buffers are page-aligned and
+	 * individual IO sizes are page-multiple.
+	 * If so, we can submit them in a single IO.
+	 * Otherwise, use old code
+	 */
+	for (seg = 0; seg < nr_segs; seg++) {
+		unsigned long user_addr = (unsigned long)iov[seg].iov_base;
+
+		if ((user_addr & (PAGE_SIZE - 1)) ||
+			(iov[seg].iov_len & (PAGE_SIZE - 1))) {
+			pages = 0;
+			break;
+		}
+		pages += (iov[seg].iov_len >> PAGE_SHIFT);
+	}
+
+	if (rw)
+		size = NFS_SERVER(inode)->wsize;
+	else
+		size = NFS_SERVER(inode)->rsize;
+
+	/*
+         * If IOsize > wsize/rsize, we need to split IOs into r/wsize anyway
+         * - fall back to old code.
+         */
+	if (pages > (size >> PAGE_SHIFT))
+		pages = 0;
+
+	return pages;
+}
+
 /*
  * For each rsize'd chunk of the user's buffer, dispatch an NFS READ
  * operation.  If nfs_readdata_alloc() or get_user_pages() fails,
@@ -385,6 +431,101 @@  static ssize_t nfs_direct_read_schedule_
 	return result < 0 ? (ssize_t) result : -EFAULT;
 }
 
+/*
+ * Special Case: Efficient readv() support - only if all the IO buffers
+ * in the vector are page-aligned and IO sizes are page-multiple
+ * + total IO size is < rsize. We can map all of the vectors together
+ * and submit them in a single IO instead of operating on single entry
+ * in the vector.
+ */
+static ssize_t nfs_direct_read_schedule_single_io(struct nfs_direct_req *dreq,
+						 const struct iovec *iov,
+						 unsigned long nr_segs,
+						 int pages,
+						 loff_t pos)
+{
+	struct nfs_open_context *ctx = dreq->ctx;
+	struct inode *inode = ctx->path.dentry->d_inode;
+	unsigned long user_addr = (unsigned long)iov->iov_base;
+	size_t bytes = pages << PAGE_SHIFT;
+	struct rpc_task *task;
+	struct rpc_message msg = {
+		.rpc_cred = ctx->cred,
+	};
+	struct rpc_task_setup task_setup_data = {
+		.rpc_client = NFS_CLIENT(inode),
+		.rpc_message = &msg,
+		.callback_ops = &nfs_read_direct_ops,
+		.workqueue = nfsiod_workqueue,
+		.flags = RPC_TASK_ASYNC,
+	};
+	unsigned int pgbase;
+	int result;
+	int seg;
+	int mapped = 0;
+	int nr_pages;
+	ssize_t started = 0;
+	struct nfs_read_data *data;
+
+	result = -ENOMEM;
+	data = nfs_readdata_alloc(pages);
+	if (unlikely(!data))
+		return result;
+
+	pgbase = user_addr & ~PAGE_MASK;
+
+	for (seg = 0; seg < nr_segs; seg++) {
+		user_addr = (unsigned long)iov[seg].iov_base;
+		nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
+
+		down_read(&current->mm->mmap_sem);
+		result = get_user_pages(current, current->mm, user_addr,
+					nr_pages, 1, 0, &data->pagevec[mapped],
+					NULL);
+		up_read(&current->mm->mmap_sem);
+		if (result < 0) {
+			/* unmap what is done so far */
+			nfs_direct_release_pages(data->pagevec, mapped);
+			nfs_readdata_free(data);
+			return result;
+		}
+		mapped += result;
+	}
+
+	get_dreq(dreq);
+
+	data->req = (struct nfs_page *) dreq;
+	data->inode = inode;
+	data->cred = msg.rpc_cred;
+	data->args.fh = NFS_FH(inode);
+	data->args.context = ctx;
+	data->args.offset = pos;
+	data->args.pgbase = pgbase;
+	data->args.pages = data->pagevec;
+	data->args.count = bytes;
+	data->res.fattr = &data->fattr;
+	data->res.eof = 0;
+	data->res.count = bytes;
+	nfs_fattr_init(&data->fattr);
+
+	task_setup_data.task = &data->task;
+	task_setup_data.callback_data = data;
+	msg.rpc_argp = &data->args;
+	msg.rpc_resp = &data->res;
+	NFS_PROTO(inode)->read_setup(data, &msg);
+
+	task = rpc_run_task(&task_setup_data);
+	if (IS_ERR(task))
+		goto out;
+	rpc_put_task(task);
+
+	started += bytes;
+out:
+	if (started)
+		return started;
+	return result < 0 ? (ssize_t) result : -EFAULT;
+}
+
 static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 					      const struct iovec *iov,
 					      unsigned long nr_segs,
@@ -396,6 +537,16 @@  static ssize_t nfs_direct_read_schedule_
 
 	get_dreq(dreq);
 
+	result = nfs_direct_check_single_io(dreq, iov, nr_segs, 0);
+	if (result) {
+		result = nfs_direct_read_schedule_single_io(dreq, iov, nr_segs,
+							result, pos);
+		if (result < 0)
+			goto out;
+		requested_bytes += result;
+		pos += result;
+		goto out;
+	}
 	for (seg = 0; seg < nr_segs; seg++) {
 		const struct iovec *vec = &iov[seg];
 		result = nfs_direct_read_schedule_segment(dreq, vec, pos);
@@ -416,6 +567,7 @@  static ssize_t nfs_direct_read_schedule_
 		return result < 0 ? result : -EIO;
 	}
 
+out:
 	if (put_dreq(dreq))
 		nfs_direct_complete(dreq);
 	return 0;
@@ -821,6 +973,102 @@  static ssize_t nfs_direct_write_schedule
 	return result < 0 ? (ssize_t) result : -EFAULT;
 }
 
+/*
+ * Special Case: Efficient writev() support - only if all the IO buffers
+ * in the vector are page-aligned and IO sizes are page-multiple
+ * + total IO size is < wsize. We can map all of the vectors together
+ * and submit them in a single IO instead of operating on single entry
+ * in the vector.
+ */
+static ssize_t nfs_direct_write_schedule_single_io(struct nfs_direct_req *dreq,
+						 const struct iovec *iov,
+						 unsigned long nr_segs,
+						 int pages,
+						 loff_t pos, int sync)
+{
+	struct nfs_open_context *ctx = dreq->ctx;
+	struct inode *inode = ctx->path.dentry->d_inode;
+	unsigned long user_addr = (unsigned long)iov->iov_base;
+	size_t bytes = pages << PAGE_SHIFT;
+	struct rpc_task *task;
+	struct rpc_message msg = {
+		.rpc_cred = ctx->cred,
+	};
+	struct rpc_task_setup task_setup_data = {
+		.rpc_client = NFS_CLIENT(inode),
+		.rpc_message = &msg,
+		.callback_ops = &nfs_write_direct_ops,
+		.workqueue = nfsiod_workqueue,
+		.flags = RPC_TASK_ASYNC,
+	};
+	unsigned int pgbase;
+	int result;
+	int seg;
+	int mapped = 0;
+	int nr_pages;
+	ssize_t started = 0;
+	struct nfs_write_data *data;
+
+	result = -ENOMEM;
+	data = nfs_writedata_alloc(pages);
+	if (unlikely(!data))
+		return result;
+
+	pgbase = user_addr & ~PAGE_MASK;
+
+	for (seg = 0; seg < nr_segs; seg++) {
+		user_addr = (unsigned long)iov[seg].iov_base;
+		nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
+
+		down_read(&current->mm->mmap_sem);
+		result = get_user_pages(current, current->mm, user_addr,
+					nr_pages, 0, 0, &data->pagevec[mapped], NULL);
+		up_read(&current->mm->mmap_sem);
+		if (result < 0) {
+			/* unmap what is done so far */
+			nfs_direct_release_pages(data->pagevec, mapped);
+			nfs_writedata_free(data);
+			return result;
+		}
+		mapped += result;
+	}
+
+	get_dreq(dreq);
+	list_move_tail(&data->pages, &dreq->rewrite_list);
+
+	data->req = (struct nfs_page *) dreq;
+	data->inode = inode;
+	data->cred = msg.rpc_cred;
+	data->args.fh = NFS_FH(inode);
+	data->args.context = ctx;
+	data->args.offset = pos;
+	data->args.pgbase = pgbase;
+	data->args.pages = data->pagevec;
+	data->args.count = bytes;
+	data->args.stable = sync;
+	data->res.fattr = &data->fattr;
+	data->res.count = bytes;
+	data->res.verf = &data->verf;
+	nfs_fattr_init(&data->fattr);
+
+	task_setup_data.task = &data->task;
+	task_setup_data.callback_data = data;
+	msg.rpc_argp = &data->args;
+	msg.rpc_resp = &data->res;
+	NFS_PROTO(inode)->write_setup(data, &msg);
+
+	task = rpc_run_task(&task_setup_data);
+	if (IS_ERR(task))
+		goto out;
+	rpc_put_task(task);
+
+	started += bytes;
+out:
+	if (started)
+		return started;
+	return result < 0 ? (ssize_t) result : -EFAULT;
+}
+
 static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 					       const struct iovec *iov,
 					       unsigned long nr_segs,
@@ -832,6 +1080,16 @@  static ssize_t nfs_direct_write_schedule
 
 	get_dreq(dreq);
 
+	result = nfs_direct_check_single_io(dreq, iov, nr_segs, 1);
+	if (result) {
+		result = nfs_direct_write_schedule_single_io(dreq, iov, nr_segs,
+							result, pos, sync);
+		if (result < 0)
+			goto out;
+		requested_bytes += result;
+		pos += result;
+		goto out;
+	}
 	for (seg = 0; seg < nr_segs; seg++) {
 		const struct iovec *vec = &iov[seg];
 		result = nfs_direct_write_schedule_segment(dreq, vec,
@@ -853,6 +1111,7 @@  static ssize_t nfs_direct_write_schedule
 		return result < 0 ? result : -EIO;
 	}
 
+out:
 	if (put_dreq(dreq))
 		nfs_direct_write_complete(dreq, dreq->inode);
 	return 0;