diff mbox

[RFC] Vector read/write support for NFS (DIO) client

Message ID 20110413083656.12e54a91@tlielax.poochiereds.net
State New, archived
Headers show

Commit Message

Jeff Layton April 13, 2011, 12:36 p.m. UTC
On Tue, 12 Apr 2011 10:46:00 -0700
Badari Pulavarty <pbadari@us.ibm.com> wrote:

> On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote:
> > On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:
> > 
> > > On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
> > >> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
> > >> 
> > >>> Hi,
> > >>> 
> > >>> We recently ran into serious performance issue with NFS client.
> > >>> It turned out that its due to lack of readv/write support for
> > >>> NFS (O_DIRECT) client.
> > >>> 
> > >>> Here is our use-case:
> > >>> 
> > >>> In our cloud environment, our storage is over NFS. Files
> > >>> on NFS are passed as a blockdevices to the guest (using
> > >>> O_DIRECT). When guest is doing IO on these block devices,
> > >>> they will end up as O_DIRECT writes to NFS (on KVM host).
> > >>> 
> > >>> QEMU (on the host) gets a vector from virtio-ring and
> > >>> submits them. Old versions of QEMU, linearized the vector
> > >>> it got from KVM (copied them into a buffer) and submits
> > >>> the buffer. So, NFS client always received a single buffer.
> > >>> 
> > >>> Later versions of QEMU, eliminated this copy and submits
> > >>> a vector directly using preadv/pwritev().
> > >>> 
> > >>> NFS client loops through the vector and submits each
> > >>> vector as separate request for each IO < wsize. In our
> > >>> case (negotiated wsize=1MB), for 256K IO - we get 64 
> > >>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs. 
> > >>> Server end up doing each 4K synchronously. This causes
> > >>> serious performance degrade. We are trying to see if the
> > >>> performance improves if we convert IOs to ASYNC - but
> > >>> our initial results doesn't look good.
> > >>> 
> > >>> readv/writev support NFS client for all possible cases is
> > >>> hard. Instead, if all vectors are page-aligned and 
> > >>> iosizes page-multiple - it fits the current code easily.
> > >>> Luckily, QEMU use-case fits these requirements.
> > >>> 
> > >>> Here is the patch to add this support. Comments ?
> > >> 
> > >> Restricting buffer alignment requirements would be an onerous API change, IMO.
> > > 
> > > I am not suggesting an API change at all. All I am doing is, if all
> > > the IOs are aligned - we can do a fast path as we can do in a single
> > > IO request. (as if we got a single buffer). Otherwise, we do hard
> > > way as today - loop through each one and do them individually.
> > 
> > Thanks for the clarification.  That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.
> > 
> > >> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed.  I think Jeff Layton already has a patch that does this.
> > > 
> > > We are trying that patch. It does improve the performance by little,
> > > but not anywhere close to doing it as a single vector/buffer.
> > > 
> > > Khoa, can you share your performance data for all the
> > > suggestions/patches you tried so far ?
> > 
> > The individual WRITEs should be submitted in parallel.  If there is additional performance overhead, it is probably due to the small RPC slot table size.  Have you tried increasing it?
> 
> We haven't tried both fixes together (RPC slot increase, Turn into
> ASYNC). Each one individually didn't help much. We will try them
> together.
> 

I must have missed some of these emails, but here's the patch that
Chuck mentioned and based on his suggestion. It may be reasonable as a
stopgap solution until Trond's overhaul of the DIO code is complete.

With this + a larger slot table size, I would think you'd have a
substantial write performance improvement. Probably not as good as if
the writes were coalesced, but it should help.

Badari, if you can let us know whether this plus increasing the slot
table size helps, then I'll plan to "officially" submit it. This one is
against RHEL6 but it should apply to mainline kernels with a small
offset.

-----------------[snip]-----------------

BZ#694309: nfs: use unstable writes for bigger groups of DIO writes

Currently, the client uses FILE_SYNC whenever it's writing less than or
equal data to the wsize. This is a problem though if we have a bunch
of small iovec's batched up in a single writev call. The client will
iterate over them and do a single FILE_SYNC WRITE for each.

Instead, change the code to do unstable writes when we'll need to do
multiple WRITE RPC's in order to satisfy the request. While we're at
it, optimize away the allocation of commit_data when we aren't going
to use it anyway.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/nfs/direct.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

Comments

Badari Pulavarty April 13, 2011, 1:43 p.m. UTC | #1
On 4/13/2011 5:36 AM, Jeff Layton wrote:
> On Tue, 12 Apr 2011 10:46:00 -0700
> Badari Pulavarty<pbadari@us.ibm.com>  wrote:
>
>    
>> On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote:
>>      
>>> On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:
>>>
>>>        
>>>> On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
>>>>          
>>>>> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
>>>>>
>>>>>            
>>>>>> Hi,
>>>>>>
>>>>>> We recently ran into serious performance issue with NFS client.
>>>>>> It turned out that its due to lack of readv/write support for
>>>>>> NFS (O_DIRECT) client.
>>>>>>
>>>>>> Here is our use-case:
>>>>>>
>>>>>> In our cloud environment, our storage is over NFS. Files
>>>>>> on NFS are passed as a blockdevices to the guest (using
>>>>>> O_DIRECT). When guest is doing IO on these block devices,
>>>>>> they will end up as O_DIRECT writes to NFS (on KVM host).
>>>>>>
>>>>>> QEMU (on the host) gets a vector from virtio-ring and
>>>>>> submits them. Old versions of QEMU, linearized the vector
>>>>>> it got from KVM (copied them into a buffer) and submits
>>>>>> the buffer. So, NFS client always received a single buffer.
>>>>>>
>>>>>> Later versions of QEMU, eliminated this copy and submits
>>>>>> a vector directly using preadv/pwritev().
>>>>>>
>>>>>> NFS client loops through the vector and submits each
>>>>>> vector as separate request for each IO<  wsize. In our
>>>>>> case (negotiated wsize=1MB), for 256K IO - we get 64
>>>>>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
>>>>>> Server end up doing each 4K synchronously. This causes
>>>>>> serious performance degrade. We are trying to see if the
>>>>>> performance improves if we convert IOs to ASYNC - but
>>>>>> our initial results doesn't look good.
>>>>>>
>>>>>> readv/writev support NFS client for all possible cases is
>>>>>> hard. Instead, if all vectors are page-aligned and
>>>>>> iosizes page-multiple - it fits the current code easily.
>>>>>> Luckily, QEMU use-case fits these requirements.
>>>>>>
>>>>>> Here is the patch to add this support. Comments ?
>>>>>>              
>>>>> Restricting buffer alignment requirements would be an onerous API change, IMO.
>>>>>            
>>>> I am not suggesting an API change at all. All I am doing is, if all
>>>> the IOs are aligned - we can do a fast path as we can do in a single
>>>> IO request. (as if we got a single buffer). Otherwise, we do hard
>>>> way as today - loop through each one and do them individually.
>>>>          
>>> Thanks for the clarification.  That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.
>>>
>>>        
>>>>> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed.  I think Jeff Layton already has a patch that does this.
>>>>>            
>>>> We are trying that patch. It does improve the performance by little,
>>>> but not anywhere close to doing it as a single vector/buffer.
>>>>
>>>> Khoa, can you share your performance data for all the
>>>> suggestions/patches you tried so far ?
>>>>          
>>> The individual WRITEs should be submitted in parallel.  If there is additional performance overhead, it is probably due to the small RPC slot table size.  Have you tried increasing it?
>>>        
>> We haven't tried both fixes together (RPC slot increase, Turn into
>> ASYNC). Each one individually didn't help much. We will try them
>> together.
>>
>>      
> I must have missed some of these emails, but here's the patch that
> Chuck mentioned and based on his suggestion. It may be reasonable as a
> stopgap solution until Trond's overhaul of the DIO code is complete.
>
> With this + a larger slot table size, I would think you'd have a
> substantial write performance improvement. Probably not as good as if
> the writes were coalesced, but it should help.
>
> Badari, if you can let us know whether this plus increasing the slot
> table size helps, then I'll plan to "officially" submit it. This one is
> against RHEL6 but it should apply to mainline kernels with a small
> offset.
>
> -----------------[snip]-----------------
>
> BZ#694309: nfs: use unstable writes for bigger groups of DIO writes
>
> Currently, the client uses FILE_SYNC whenever it's writing less than or
> equal data to the wsize. This is a problem though if we have a bunch
> of small iovec's batched up in a single writev call. The client will
> iterate over them and do a single FILE_SYNC WRITE for each.
>
> Instead, change the code to do unstable writes when we'll need to do
> multiple WRITE RPC's in order to satisfy the request. While we're at
> it, optimize away the allocation of commit_data when we aren't going
> to use it anyway.
>
> Signed-off-by: Jeff Layton<jlayton@redhat.com>
>    

Khoa is running tests with this + RPC table change. Single thread 
performance improved,
but multi-thread tests doesn't scale (I guess running out of RPC table 
entires even with 128 ?).
He will share the results shortly.

Thanks,
Badari

> ---
>   fs/nfs/direct.c |   13 +++++++++++--
>   1 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 398f8ed..d2ed659 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -870,9 +870,18 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
>   	dreq = nfs_direct_req_alloc();
>   	if (!dreq)
>   		goto out;
> -	nfs_alloc_commit_data(dreq);
>
> -	if (dreq->commit_data == NULL || count<= wsize)
> +	if (count>  wsize || nr_segs>  1)
> +		nfs_alloc_commit_data(dreq);
> +	else
> +		dreq->commit_data = NULL;
> +
> +	/*
> +	 * If we couldn't allocate commit data, or we'll just be doing a
> +	 * single write, then make this a NFS_FILE_SYNC write and do away
> +	 * with the commit.
> +	 */
> +	if (dreq->commit_data == NULL)
>   		sync = NFS_FILE_SYNC;
>
>   	dreq->inode = inode;
>    


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton April 13, 2011, 2:02 p.m. UTC | #2
On Wed, 13 Apr 2011 06:43:53 -0700
Badari Pulavarty <pbadari@us.ibm.com> wrote:

> On 4/13/2011 5:36 AM, Jeff Layton wrote:
> > On Tue, 12 Apr 2011 10:46:00 -0700
> > Badari Pulavarty<pbadari@us.ibm.com>  wrote:
> >
> >    
> >> On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote:
> >>      
> >>> On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:
> >>>
> >>>        
> >>>> On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
> >>>>          
> >>>>> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
> >>>>>
> >>>>>            
> >>>>>> Hi,
> >>>>>>
> >>>>>> We recently ran into serious performance issue with NFS client.
> >>>>>> It turned out that its due to lack of readv/write support for
> >>>>>> NFS (O_DIRECT) client.
> >>>>>>
> >>>>>> Here is our use-case:
> >>>>>>
> >>>>>> In our cloud environment, our storage is over NFS. Files
> >>>>>> on NFS are passed as a blockdevices to the guest (using
> >>>>>> O_DIRECT). When guest is doing IO on these block devices,
> >>>>>> they will end up as O_DIRECT writes to NFS (on KVM host).
> >>>>>>
> >>>>>> QEMU (on the host) gets a vector from virtio-ring and
> >>>>>> submits them. Old versions of QEMU, linearized the vector
> >>>>>> it got from KVM (copied them into a buffer) and submits
> >>>>>> the buffer. So, NFS client always received a single buffer.
> >>>>>>
> >>>>>> Later versions of QEMU, eliminated this copy and submits
> >>>>>> a vector directly using preadv/pwritev().
> >>>>>>
> >>>>>> NFS client loops through the vector and submits each
> >>>>>> vector as separate request for each IO<  wsize. In our
> >>>>>> case (negotiated wsize=1MB), for 256K IO - we get 64
> >>>>>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
> >>>>>> Server end up doing each 4K synchronously. This causes
> >>>>>> serious performance degrade. We are trying to see if the
> >>>>>> performance improves if we convert IOs to ASYNC - but
> >>>>>> our initial results doesn't look good.
> >>>>>>
> >>>>>> readv/writev support NFS client for all possible cases is
> >>>>>> hard. Instead, if all vectors are page-aligned and
> >>>>>> iosizes page-multiple - it fits the current code easily.
> >>>>>> Luckily, QEMU use-case fits these requirements.
> >>>>>>
> >>>>>> Here is the patch to add this support. Comments ?
> >>>>>>              
> >>>>> Restricting buffer alignment requirements would be an onerous API change, IMO.
> >>>>>            
> >>>> I am not suggesting an API change at all. All I am doing is, if all
> >>>> the IOs are aligned - we can do a fast path as we can do in a single
> >>>> IO request. (as if we got a single buffer). Otherwise, we do hard
> >>>> way as today - loop through each one and do them individually.
> >>>>          
> >>> Thanks for the clarification.  That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.
> >>>
> >>>        
> >>>>> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed.  I think Jeff Layton already has a patch that does this.
> >>>>>            
> >>>> We are trying that patch. It does improve the performance by little,
> >>>> but not anywhere close to doing it as a single vector/buffer.
> >>>>
> >>>> Khoa, can you share your performance data for all the
> >>>> suggestions/patches you tried so far ?
> >>>>          
> >>> The individual WRITEs should be submitted in parallel.  If there is additional performance overhead, it is probably due to the small RPC slot table size.  Have you tried increasing it?
> >>>        
> >> We haven't tried both fixes together (RPC slot increase, Turn into
> >> ASYNC). Each one individually didn't help much. We will try them
> >> together.
> >>
> >>      
> > I must have missed some of these emails, but here's the patch that
> > Chuck mentioned and based on his suggestion. It may be reasonable as a
> > stopgap solution until Trond's overhaul of the DIO code is complete.
> >
> > With this + a larger slot table size, I would think you'd have a
> > substantial write performance improvement. Probably not as good as if
> > the writes were coalesced, but it should help.
> >
> > Badari, if you can let us know whether this plus increasing the slot
> > table size helps, then I'll plan to "officially" submit it. This one is
> > against RHEL6 but it should apply to mainline kernels with a small
> > offset.
> >
> > -----------------[snip]-----------------
> >
> > BZ#694309: nfs: use unstable writes for bigger groups of DIO writes
> >
> > Currently, the client uses FILE_SYNC whenever it's writing less than or
> > equal data to the wsize. This is a problem though if we have a bunch
> > of small iovec's batched up in a single writev call. The client will
> > iterate over them and do a single FILE_SYNC WRITE for each.
> >
> > Instead, change the code to do unstable writes when we'll need to do
> > multiple WRITE RPC's in order to satisfy the request. While we're at
> > it, optimize away the allocation of commit_data when we aren't going
> > to use it anyway.
> >
> > Signed-off-by: Jeff Layton<jlayton@redhat.com>
> >    
> 
> Khoa is running tests with this + RPC table change. Single thread 
> performance improved,
> but multi-thread tests doesn't scale (I guess running out of RPC table 
> entires even with 128 ?).
> He will share the results shortly.
> 
> Thanks,
> Badari
> 

Ok, good to know.

I tend to think that the whole slot table concept needs some
reconsideration anyway. Off the top of my head...

We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
mempool with a minimum number of slots. Have them all be allocated with
GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
the waitqueue like it does today. Then, the clients can allocate
rpc_rqst's as they need as long as memory holds out for it.

We have the reserve_xprt stuff to handle congestion control anyway so I
don't really see the value in the artificial limits that the slot table
provides.

Maybe I should hack up a patchset for this...
Trond Myklebust April 13, 2011, 2:22 p.m. UTC | #3
On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
> mempool with a minimum number of slots. Have them all be allocated with
> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
> the waitqueue like it does today. Then, the clients can allocate
> rpc_rqst's as they need as long as memory holds out for it.
> 
> We have the reserve_xprt stuff to handle congestion control anyway so I
> don't really see the value in the artificial limits that the slot table
> provides.
> 
> Maybe I should hack up a patchset for this...

This issue has come up several times recently. My preference would be to
tie the availability of slots to the TCP window size, and basically say
that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
off allocating more slots until we get a ->write_space() callback which
clears that flag.

For the RDMA case, we can continue to use the current system of a fixed
number of preallocated slots.

Cheers
  Trond
Andy Adamson April 13, 2011, 2:27 p.m. UTC | #4
On Apr 13, 2011, at 10:22 AM, Trond Myklebust wrote:

> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
>> mempool with a minimum number of slots. Have them all be allocated with
>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
>> the waitqueue like it does today. Then, the clients can allocate
>> rpc_rqst's as they need as long as memory holds out for it.
>> 
>> We have the reserve_xprt stuff to handle congestion control anyway so I
>> don't really see the value in the artificial limits that the slot table
>> provides.
>> 
>> Maybe I should hack up a patchset for this...
> 
> This issue has come up several times recently. My preference would be to
> tie the availability of slots to the TCP window size, and basically say
> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
> off allocating more slots until we get a ->write_space() callback which
> clears that flag.

I am scoping the dynamic rpc_slot allocation work and plan to prototype.

-->Andy
 
> 
> For the RDMA case, we can continue to use the current system of a fixed
> number of preallocated slots.
> 
> Cheers
>  Trond
> -- 
> Trond Myklebust
> Linux NFS client maintainer
> 
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton April 13, 2011, 5:20 p.m. UTC | #5
On Wed, 13 Apr 2011 10:22:13 -0400
Trond Myklebust <Trond.Myklebust@netapp.com> wrote:

> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
> > We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
> > mempool with a minimum number of slots. Have them all be allocated with
> > GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
> > the waitqueue like it does today. Then, the clients can allocate
> > rpc_rqst's as they need as long as memory holds out for it.
> > 
> > We have the reserve_xprt stuff to handle congestion control anyway so I
> > don't really see the value in the artificial limits that the slot table
> > provides.
> > 
> > Maybe I should hack up a patchset for this...
> 
> This issue has come up several times recently. My preference would be to
> tie the availability of slots to the TCP window size, and basically say
> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
> off allocating more slots until we get a ->write_space() callback which
> clears that flag.
> 
> For the RDMA case, we can continue to use the current system of a fixed
> number of preallocated slots.
> 

I take it then that we'd want a similar scheme for UDP as well? I guess
I'm just not sure what the slot table is supposed to be for.

Possibly naive question, and maybe you or Andy have scoped this out
already...

Wouldn't it make more sense to allow the code to allocate rpc_rqst's as
needed, and manage congestion control in reserve_xprt ? It appears that
that at least is what xprt_reserve_xprt_cong is supposed to do. The TCP
variant (xprt_reserve_xprt) doesn't do that currently, but we could do
it there and that would seem to make for more parity between the TCP
and UDP in this sense.

We could do that similarly for RDMA too. Simply keep track of how many
RPCs are in flight and only allow reserving the xprt when that number
hasn't crossed the max number of slots...
Trond Myklebust April 13, 2011, 5:35 p.m. UTC | #6
On Wed, 2011-04-13 at 13:20 -0400, Jeff Layton wrote:
> On Wed, 13 Apr 2011 10:22:13 -0400
> Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
> 
> > On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
> > > We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
> > > mempool with a minimum number of slots. Have them all be allocated with
> > > GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
> > > the waitqueue like it does today. Then, the clients can allocate
> > > rpc_rqst's as they need as long as memory holds out for it.
> > > 
> > > We have the reserve_xprt stuff to handle congestion control anyway so I
> > > don't really see the value in the artificial limits that the slot table
> > > provides.
> > > 
> > > Maybe I should hack up a patchset for this...
> > 
> > This issue has come up several times recently. My preference would be to
> > tie the availability of slots to the TCP window size, and basically say
> > that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
> > off allocating more slots until we get a ->write_space() callback which
> > clears that flag.
> > 
> > For the RDMA case, we can continue to use the current system of a fixed
> > number of preallocated slots.
> > 
> 
> I take it then that we'd want a similar scheme for UDP as well? I guess
> I'm just not sure what the slot table is supposed to be for.

No. I don't see UDP as having much of a future in 10GigE and other high
bandwidth environments (or even in 1GigE setups). Let's just leave it as
it is...

> Possibly naive question, and maybe you or Andy have scoped this out
> already...
> 
> Wouldn't it make more sense to allow the code to allocate rpc_rqst's as
> needed, and manage congestion control in reserve_xprt ? It appears that
> that at least is what xprt_reserve_xprt_cong is supposed to do. The TCP
> variant (xprt_reserve_xprt) doesn't do that currently, but we could do
> it there and that would seem to make for more parity between the TCP
> and UDP in this sense.
> 
> We could do that similarly for RDMA too. Simply keep track of how many
> RPCs are in flight and only allow reserving the xprt when that number
> hasn't crossed the max number of slots...

What is the point of allocating a lot of resources when you lack the
bandwidth to do anything with them?

The reason for tying this to the TCP window size is to try to queue up
as much data as we can possibly transmit, without eating too much out of
the same GFP_ATOMIC pool of memory that the networking layer also uses.
Andy Adamson April 13, 2011, 5:56 p.m. UTC | #7
On Apr 13, 2011, at 1:20 PM, Jeff Layton wrote:

> On Wed, 13 Apr 2011 10:22:13 -0400
> Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
> 
>> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
>>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
>>> mempool with a minimum number of slots. Have them all be allocated with
>>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
>>> the waitqueue like it does today. Then, the clients can allocate
>>> rpc_rqst's as they need as long as memory holds out for it.
>>> 
>>> We have the reserve_xprt stuff to handle congestion control anyway so I
>>> don't really see the value in the artificial limits that the slot table
>>> provides.
>>> 
>>> Maybe I should hack up a patchset for this...
>> 
>> This issue has come up several times recently. My preference would be to
>> tie the availability of slots to the TCP window size, and basically say
>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
>> off allocating more slots until we get a ->write_space() callback which
>> clears that flag.
>> 
>> For the RDMA case, we can continue to use the current system of a fixed
>> number of preallocated slots.
>> 
> 
> I take it then that we'd want a similar scheme for UDP as well? I guess
> I'm just not sure what the slot table is supposed to be for.

[andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
There is no reason to allocate more rpc_rqsts that can fit on the wire. 

> 
> Possibly naive question, and maybe you or Andy have scoped this out
> already...
> 
> Wouldn't it make more sense to allow the code to allocate rpc_rqst's as
> needed, and manage congestion control in reserve_xprt ?

[andros] Congestion control is not what the rpc_slot table is managing. It does need to have
a minimum which experience has set at 16. It's the maximum that needs to be dynamic.
Congestion control by the lower layers should work unfettered within the # of rpc_slots. Today that
is not always the case when 16 slots is not enough to fill the wire, and the administrator has 
not changed the # of rpc_slots.

> It appears that
> that at least is what xprt_reserve_xprt_cong is supposed to do. The TCP
> variant (xprt_reserve_xprt) doesn't do that currently, but we could do
> it there and that would seem to make for more parity between the TCP
> and UDP in this sense.
> 
> We could do that similarly for RDMA too. Simply keep track of how many
> RPCs are in flight and only allow reserving the xprt when that number
> hasn't crossed the max number of slots...


> 
> -- 
> Jeff Layton <jlayton@redhat.com>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust April 13, 2011, 6:14 p.m. UTC | #8
On Wed, 2011-04-13 at 13:56 -0400, Andy Adamson wrote:
> On Apr 13, 2011, at 1:20 PM, Jeff Layton wrote:
> 
> > On Wed, 13 Apr 2011 10:22:13 -0400
> > Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
> > 
> >> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
> >>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
> >>> mempool with a minimum number of slots. Have them all be allocated with
> >>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
> >>> the waitqueue like it does today. Then, the clients can allocate
> >>> rpc_rqst's as they need as long as memory holds out for it.
> >>> 
> >>> We have the reserve_xprt stuff to handle congestion control anyway so I
> >>> don't really see the value in the artificial limits that the slot table
> >>> provides.
> >>> 
> >>> Maybe I should hack up a patchset for this...
> >> 
> >> This issue has come up several times recently. My preference would be to
> >> tie the availability of slots to the TCP window size, and basically say
> >> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
> >> off allocating more slots until we get a ->write_space() callback which
> >> clears that flag.
> >> 
> >> For the RDMA case, we can continue to use the current system of a fixed
> >> number of preallocated slots.
> >> 
> > 
> > I take it then that we'd want a similar scheme for UDP as well? I guess
> > I'm just not sure what the slot table is supposed to be for.
> 
> [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
> can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
> For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
> There is no reason to allocate more rpc_rqsts that can fit on the wire. 

Agreed, but as I said earlier, there is no reason to even try to use UDP
on high bandwidth links, so I suggest we just leave it as-is.
 
> > Possibly naive question, and maybe you or Andy have scoped this out
> > already...
> > 
> > Wouldn't it make more sense to allow the code to allocate rpc_rqst's as
> > needed, and manage congestion control in reserve_xprt ?
> 
> [andros] Congestion control is not what the rpc_slot table is managing. It does need to have
> a minimum which experience has set at 16. It's the maximum that needs to be dynamic.
> Congestion control by the lower layers should work unfettered within the # of rpc_slots. Today that
> is not always the case when 16 slots is not enough to fill the wire, and the administrator has 
> not changed the # of rpc_slots.

Agreed. However we do need to ensure is that the networking layer is
aware that we have more data to send, and that it should negotiate a
window size increase with the server if possible. To do that, we need to
allocate just enough slots to hit the 'SOCK_ASYNC_NOSPACE' limit and
then wait.
Chuck Lever April 13, 2011, 6:47 p.m. UTC | #9
On Apr 13, 2011, at 2:14 PM, Trond Myklebust wrote:

> On Wed, 2011-04-13 at 13:56 -0400, Andy Adamson wrote:
>> On Apr 13, 2011, at 1:20 PM, Jeff Layton wrote:
>> 
>>> On Wed, 13 Apr 2011 10:22:13 -0400
>>> Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
>>> 
>>>> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
>>>>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
>>>>> mempool with a minimum number of slots. Have them all be allocated with
>>>>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
>>>>> the waitqueue like it does today. Then, the clients can allocate
>>>>> rpc_rqst's as they need as long as memory holds out for it.
>>>>> 
>>>>> We have the reserve_xprt stuff to handle congestion control anyway so I
>>>>> don't really see the value in the artificial limits that the slot table
>>>>> provides.
>>>>> 
>>>>> Maybe I should hack up a patchset for this...
>>>> 
>>>> This issue has come up several times recently. My preference would be to
>>>> tie the availability of slots to the TCP window size, and basically say
>>>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
>>>> off allocating more slots until we get a ->write_space() callback which
>>>> clears that flag.
>>>> 
>>>> For the RDMA case, we can continue to use the current system of a fixed
>>>> number of preallocated slots.
>>>> 
>>> 
>>> I take it then that we'd want a similar scheme for UDP as well? I guess
>>> I'm just not sure what the slot table is supposed to be for.
>> 
>> [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
>> can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
>> For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
>> There is no reason to allocate more rpc_rqsts that can fit on the wire. 
> 
> Agreed, but as I said earlier, there is no reason to even try to use UDP
> on high bandwidth links, so I suggest we just leave it as-is.

I think Jeff is suggesting that all the transports should use the same logic, but UDP and RDMA should simply have fixed upper limits on their slot table size.  UDP would then behave the same as before, but would share code with the others.  That might be cleaner than maintaining separate slot allocation mechanisms for each transport.

In other words, share the code, but parametrize it so that UDP and RDMA have effectively fixed slot tables as before, but TCP is allowed to expand.

My two pfennig's worth.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton April 13, 2011, 7:04 p.m. UTC | #10
On Wed, 13 Apr 2011 14:47:05 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> 
> On Apr 13, 2011, at 2:14 PM, Trond Myklebust wrote:
> 
> > On Wed, 2011-04-13 at 13:56 -0400, Andy Adamson wrote:
> >> On Apr 13, 2011, at 1:20 PM, Jeff Layton wrote:
> >> 
> >>> On Wed, 13 Apr 2011 10:22:13 -0400
> >>> Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
> >>> 
> >>>> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
> >>>>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
> >>>>> mempool with a minimum number of slots. Have them all be allocated with
> >>>>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
> >>>>> the waitqueue like it does today. Then, the clients can allocate
> >>>>> rpc_rqst's as they need as long as memory holds out for it.
> >>>>> 
> >>>>> We have the reserve_xprt stuff to handle congestion control anyway so I
> >>>>> don't really see the value in the artificial limits that the slot table
> >>>>> provides.
> >>>>> 
> >>>>> Maybe I should hack up a patchset for this...
> >>>> 
> >>>> This issue has come up several times recently. My preference would be to
> >>>> tie the availability of slots to the TCP window size, and basically say
> >>>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
> >>>> off allocating more slots until we get a ->write_space() callback which
> >>>> clears that flag.
> >>>> 
> >>>> For the RDMA case, we can continue to use the current system of a fixed
> >>>> number of preallocated slots.
> >>>> 
> >>> 
> >>> I take it then that we'd want a similar scheme for UDP as well? I guess
> >>> I'm just not sure what the slot table is supposed to be for.
> >> 
> >> [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
> >> can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
> >> For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
> >> There is no reason to allocate more rpc_rqsts that can fit on the wire. 
> > 
> > Agreed, but as I said earlier, there is no reason to even try to use UDP
> > on high bandwidth links, so I suggest we just leave it as-is.
> 
> I think Jeff is suggesting that all the transports should use the same logic, but UDP and RDMA should simply have fixed upper limits on their slot table size.  UDP would then behave the same as before, but would share code with the others.  That might be cleaner than maintaining separate slot allocation mechanisms for each transport.
> 
> In other words, share the code, but parametrize it so that UDP and RDMA have effectively fixed slot tables as before, but TCP is allowed to expand.
>

That was my initial thought, but Trond has a point that there's no
reason to allocate info for a call that we're not able to send.

The idea of hooking up congestion feedback from the networking layer
into the slot allocation code sounds intriguing, so for now I'll stop
armchair quarterbacking and just wait to see what Andy comes up with :)
Dean Hildebrand April 14, 2011, 12:21 a.m. UTC | #11
>>> This issue has come up several times recently. My preference would be to
>>> tie the availability of slots to the TCP window size, and basically say
>>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
>>> off allocating more slots until we get a ->write_space() callback which
>>> clears that flag.
>>>
>>> For the RDMA case, we can continue to use the current system of a fixed
>>> number of preallocated slots.
>>>
>> I take it then that we'd want a similar scheme for UDP as well? I guess
>> I'm just not sure what the slot table is supposed to be for.
> [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
> can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
> For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
> There is no reason to allocate more rpc_rqsts that can fit on the wire.
I agree with checking for space on the link.

The above formula is a good lower bound on the maximum number of slots, 
but there are many times when a client could use more slots than the 
above formula.  For example, we don't want to punish writes if rsize > 
wsize.  Also, you have to account for the server memory, which can 
sometimes hold several write requests while waiting for them to  be 
sync'd to disk, leaving the TCP buffers less than full.

Also, I think any solution should allow admins to limit the maximum 
number of slots.  Too many slots can increase request randomness at the 
server, and sometimes severely reduce performance.

Dean

>> Possibly naive question, and maybe you or Andy have scoped this out
>> already...
>>
>> Wouldn't it make more sense to allow the code to allocate rpc_rqst's as
>> needed, and manage congestion control in reserve_xprt ?
> [andros] Congestion control is not what the rpc_slot table is managing. It does need to have
> a minimum which experience has set at 16. It's the maximum that needs to be dynamic.
> Congestion control by the lower layers should work unfettered within the # of rpc_slots. Today that
> is not always the case when 16 slots is not enough to fill the wire, and the administrator has
> not changed the # of rpc_slots.
>
>> It appears that
>> that at least is what xprt_reserve_xprt_cong is supposed to do. The TCP
>> variant (xprt_reserve_xprt) doesn't do that currently, but we could do
>> it there and that would seem to make for more parity between the TCP
>> and UDP in this sense.
>>
>> We could do that similarly for RDMA too. Simply keep track of how many
>> RPCs are in flight and only allow reserving the xprt when that number
>> hasn't crossed the max number of slots...
>
>> -- 
>> Jeff Layton<jlayton@redhat.com>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust April 14, 2011, 12:42 a.m. UTC | #12
On Wed, 2011-04-13 at 17:21 -0700, Dean wrote:
> >>> This issue has come up several times recently. My preference would be to
> >>> tie the availability of slots to the TCP window size, and basically say
> >>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
> >>> off allocating more slots until we get a ->write_space() callback which
> >>> clears that flag.
> >>>
> >>> For the RDMA case, we can continue to use the current system of a fixed
> >>> number of preallocated slots.
> >>>
> >> I take it then that we'd want a similar scheme for UDP as well? I guess
> >> I'm just not sure what the slot table is supposed to be for.
> > [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
> > can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
> > For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
> > There is no reason to allocate more rpc_rqsts that can fit on the wire.
> I agree with checking for space on the link.
> 
> The above formula is a good lower bound on the maximum number of slots, 
> but there are many times when a client could use more slots than the 
> above formula.  For example, we don't want to punish writes if rsize > 
> wsize.  Also, you have to account for the server memory, which can 
> sometimes hold several write requests while waiting for them to  be 
> sync'd to disk, leaving the TCP buffers less than full.

Err... No... On the contrary, it is a good _upper_ bound on the number
of slots. There is no point in allocating a slot for an RPC request
which you know you have no ability to transmit. That has nothing to do
with rsize or wsize values: if the socket is backed up, it won't take
more data.

Trond
Dean Hildebrand April 14, 2011, 6:39 a.m. UTC | #13
On 4/13/11 5:42 PM, Trond Myklebust wrote:
> On Wed, 2011-04-13 at 17:21 -0700, Dean wrote:
>>>>> This issue has come up several times recently. My preference would be to
>>>>> tie the availability of slots to the TCP window size, and basically say
>>>>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
>>>>> off allocating more slots until we get a ->write_space() callback which
>>>>> clears that flag.
>>>>>
>>>>> For the RDMA case, we can continue to use the current system of a fixed
>>>>> number of preallocated slots.
>>>>>
>>>> I take it then that we'd want a similar scheme for UDP as well? I guess
>>>> I'm just not sure what the slot table is supposed to be for.
>>> [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
>>> can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
>>> For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
>>> There is no reason to allocate more rpc_rqsts that can fit on the wire.
>> I agree with checking for space on the link.
>>
>> The above formula is a good lower bound on the maximum number of slots,
>> but there are many times when a client could use more slots than the
>> above formula.  For example, we don't want to punish writes if rsize>
>> wsize.  Also, you have to account for the server memory, which can
>> sometimes hold several write requests while waiting for them to  be
>> sync'd to disk, leaving the TCP buffers less than full.
> Err... No... On the contrary, it is a good _upper_ bound on the number
> of slots. There is no point in allocating a slot for an RPC request
> which you know you have no ability to transmit. That has nothing to do
> with rsize or wsize values: if the socket is backed up, it won't take
> more data.

Absolutely, I'm just trying to point out that checking the 
SOCK_ASYNC_NOSPACE flag seems to be the only way to guarantee it won't 
take more data.
Dean
> Trond
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 398f8ed..d2ed659 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -870,9 +870,18 @@  static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
 	dreq = nfs_direct_req_alloc();
 	if (!dreq)
 		goto out;
-	nfs_alloc_commit_data(dreq);
 
-	if (dreq->commit_data == NULL || count <= wsize)
+	if (count > wsize || nr_segs > 1)
+		nfs_alloc_commit_data(dreq);
+	else
+		dreq->commit_data = NULL;
+
+	/*
+	 * If we couldn't allocate commit data, or we'll just be doing a
+	 * single write, then make this a NFS_FILE_SYNC write and do away
+	 * with the commit.
+	 */
+	if (dreq->commit_data == NULL)
 		sync = NFS_FILE_SYNC;
 
 	dreq->inode = inode;