diff mbox

Is tcp autotuning really what NFS wants?

Message ID 20130716140021.312b5b07@notabene.brown (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown July 16, 2013, 4 a.m. UTC
On Mon, 15 Jul 2013 21:58:03 -0400 "J.Bruce Fields" <bfields@citi.umich.edu>
wrote:

> On Mon, Jul 15, 2013 at 02:32:03PM +1000, NeilBrown wrote:
> > On Wed, 10 Jul 2013 15:07:28 -0400 "J.Bruce Fields" <bfields@citi.umich.edu>
> > wrote:
> > 
> > > On Wed, Jul 10, 2013 at 02:32:33PM +1000, NeilBrown wrote:
> > > > On Tue, 9 Jul 2013 22:27:35 -0400 "J.Bruce Fields" <bfields@citi.umich.edu>
> > > > wrote:
> > > > 
> > > > > On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
> > > > > > 
> > > > > > Hi,
> > > > > >  I just noticed this commit:
> > > > > > 
> > > > > > commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> > > > > > Author: Olga Kornievskaia <aglo@citi.umich.edu>
> > > > > > Date:   Tue Oct 21 14:13:47 2008 -0400
> > > > > > 
> > > > > >     svcrpc: take advantage of tcp autotuning
> > > > > > 
> > > > > > 
> > > > > > which I must confess surprised me.  I wonder if the full implications of
> > > > > > removing that functionality were understood.
> > > > > > 
> > > > > > Previously nfsd would set the transmit buffer space for a connection to
> > > > > > ensure there is plenty to hold all replies.  Now it doesn't.
> > > > > > 
> > > > > > nfsd refuses to accept a request if there isn't enough space in the transmit
> > > > > > buffer to send a reply.  This is important to ensure that each reply gets
> > > > > > sent atomically without blocking and there is no risk of replies getting
> > > > > > interleaved.
> > > 
> > > By the way, it's xpt_mutex that really guarantees non-interleaving,
> > > isn't it?
> > 
> > Yes.  I had the two different things confused.  The issue is only about
> > blocking on writes and consequently tying up nfsd threads.
> > 
> > > 
> > > I think of the svc_tcp_has_wspace checks as solving a problem of
> > > fairness between clients.  If we only had one client, everything would
> > > work without them.  But when we have multiple clients we don't want a
> > > single slow client to be able to tie block every thread waiting for that
> > > client to receive read data.  Is that accurate?
> > 
> > It agrees with my understanding - yes.
> > 
> > 
> > > 
> > > > > > The server starts out with a large estimate of the reply space (1M) and for
> > > > > > NFSv3 and v2 it quickly adjusts this down to something realistic.  For NFSv4
> > > > > > it is much harder to estimate the space needed so it just assumes every
> > > > > > reply will require 1M of space.
> > > > > > 
> > > > > > This means that with NFSv4, as soon as you have enough concurrent requests
> > > > > > such that 1M each reserves all of whatever window size was auto-tuned, new
> > > > > > requests on that connection will be ignored.
> > > > > >
> > > > > > This could significantly limit the amount of parallelism that can be achieved
> > > > > > for a single TCP connection (and given that the Linux client strongly prefers
> > > > > > a single connection now, this could become more of an issue).
> > > > > 
> > > > > Worse, I believe it can deadlock completely if the transmit buffer
> > > > > shrinks too far, and people really have run into this:
> > > > > 
> > > > > 	http://mid.gmane.org/<20130125185748.GC29596@fieldses.org>
> > > > > 
> > > > > Trond's suggestion looked at the time like it might work and be doable:
> > > > > 
> > > > > 	http://mid.gmane.org/<4FA345DA4F4AE44899BD2B03EEEC2FA91833C1D8@sacexcmbx05-prd.hq.netapp.com>
> > > > > 
> > > > > but I dropped it.
> > > > 
> > > > I would probably generalise Trond's suggestion and allow "N" extra requests
> > > > through that exceed the reservation, when N is related to the number of idle
> > > > threads.  squareroot might be nice, but half is probably easiest.
> > > > 
> > > > If any send takes more than 30 seconds the sk_sndtimeo will kick in and close
> > > > the connection so a really bad connection won't block threads indefinitely.
> > > > 
> > > > 
> > > > And yes - a nice test case would be good.
> > > > 
> > > > What do you think of the following (totally untested - just for comment)?
> > > 
> > > In cases where the disk is the bottleneck, there aren't actually going
> > > to be idle threads, are there?  In which case I don't think this helps
> > > save a stuck client.
> > 
> > Do we care about the disk being a bottleneck?  We don't currently try to
> > handle that situation at all - if the disk is slow, everything slows down.
> 
> Right, that's fine, it's just that:
> 
> Suppose you have a fast network, several fast clients all continuously
> reading, and one slow disk.
> 
> Then in the steady state all the threads will be waiting on the disk,
> and the receive buffers will all be full of read requests from the
> various clients.  When a thread completes a read it will immediately
> send the response, pick up the next request, and go back to waiting on
> the disk.
> 
> Suppose the send buffer of one of the clients drops below the minimum
> necessary.
> 
> Then aftert that point, I don't think that one client will ever have
> another rpc processed, even after your proposed change.

That would be because I only allowed extra requests up to half the number of
idle threads, and there would be zero idle threads.

So yes - I see your point.

I now think that it is sufficient to just allow one request through per
socket.  While there is high memory pressure, that is sufficient.  When the
memory pressure drops, that should be enough to cause sndbuf to grow.

We should be able to use xpt_reserved to check if this is the only request
or not: 




Thoughts?

I tried testing this, but xpo_has_wspace never fails for me, even if I remove
the calls to svc_sock_setbufsize (which probably aren't wanted for TCP any
more).

> 
> Am I missing anything?

Not as far as I can see - thanks!

NeilBrown

Comments

J. Bruce Fields July 16, 2013, 2:24 p.m. UTC | #1
Adding Ben Myers to Cc: in case he has any testing help or advice (see
below):

On Tue, Jul 16, 2013 at 02:00:21PM +1000, NeilBrown wrote:
> That would be because I only allowed extra requests up to half the number of
> idle threads, and there would be zero idle threads.
> 
> So yes - I see your point.
> 
> I now think that it is sufficient to just allow one request through per
> socket.  While there is high memory pressure, that is sufficient.  When the
> memory pressure drops, that should be enough to cause sndbuf to grow.
> 
> We should be able to use xpt_reserved to check if this is the only request
> or not: 

A remaining possible bad case is if the number of "bad" sockets is more
than the number of threads, and if the problem is something that won't
resolve itself by just waiting a little while.

I don't know if that's likely in practice.  Maybe a network failure cuts
you off from a large swath (but not all) of your clients?  Or maybe some
large proportion of your clients are just slow?

But this is two lines and looks likely to solve the immediate
problem:

> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 80a6640..5b832ef 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -331,7 +331,8 @@ static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
>  	if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
>  		return true;
>  	if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
> -		return xprt->xpt_ops->xpo_has_wspace(xprt);
> +		return xprt->xpt_ops->xpo_has_wspace(xprt) ||
> +			atomic_read(&xprt->xpt_reserved) == 0;
>  	return false;
>  }
>  
> @@ -730,7 +731,8 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
>  		newxpt = xprt->xpt_ops->xpo_accept(xprt);
>  		if (newxpt)
>  			svc_add_new_temp_xprt(serv, newxpt);
> -	} else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
> +	} else if (xprt->xpt_ops->xpo_has_wspace(xprt) ||
> +		   atomic_read(&xprt->xpt_reserved) == 0) {
>  		/* XPT_DATA|XPT_DEFERRED case: */
>  		dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
>  			rqstp, rqstp->rq_pool->sp_id, xprt,
> 
> 
> Thoughts?
> 
> I tried testing this, but xpo_has_wspace never fails for me, even if I remove
> the calls to svc_sock_setbufsize (which probably aren't wanted for TCP any
> more).

Ben, do you still have a setup that can reproduce the problem?  Or did
you ever find an easier way to reproduce?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Myers July 18, 2013, 12:03 a.m. UTC | #2
Hey Bruce,

On Tue, Jul 16, 2013 at 10:24:30AM -0400, J.Bruce Fields wrote:
> Adding Ben Myers to Cc: in case he has any testing help or advice (see
> below):
> 
> On Tue, Jul 16, 2013 at 02:00:21PM +1000, NeilBrown wrote:
> > That would be because I only allowed extra requests up to half the number of
> > idle threads, and there would be zero idle threads.
> > 
> > So yes - I see your point.
> > 
> > I now think that it is sufficient to just allow one request through per
> > socket.  While there is high memory pressure, that is sufficient.  When the
> > memory pressure drops, that should be enough to cause sndbuf to grow.
> > 
> > We should be able to use xpt_reserved to check if this is the only request
> > or not: 
> 
> A remaining possible bad case is if the number of "bad" sockets is more
> than the number of threads, and if the problem is something that won't
> resolve itself by just waiting a little while.
> 
> I don't know if that's likely in practice.  Maybe a network failure cuts
> you off from a large swath (but not all) of your clients?  Or maybe some
> large proportion of your clients are just slow?
> 
> But this is two lines and looks likely to solve the immediate
> problem:
> 
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index 80a6640..5b832ef 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -331,7 +331,8 @@ static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> >  	if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> >  		return true;
> >  	if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
> > -		return xprt->xpt_ops->xpo_has_wspace(xprt);
> > +		return xprt->xpt_ops->xpo_has_wspace(xprt) ||
> > +			atomic_read(&xprt->xpt_reserved) == 0;
> >  	return false;
> >  }
> >  
> > @@ -730,7 +731,8 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
> >  		newxpt = xprt->xpt_ops->xpo_accept(xprt);
> >  		if (newxpt)
> >  			svc_add_new_temp_xprt(serv, newxpt);
> > -	} else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
> > +	} else if (xprt->xpt_ops->xpo_has_wspace(xprt) ||
> > +		   atomic_read(&xprt->xpt_reserved) == 0) {
> >  		/* XPT_DATA|XPT_DEFERRED case: */
> >  		dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
> >  			rqstp, rqstp->rq_pool->sp_id, xprt,
> > 
> > 
> > Thoughts?
> > 
> > I tried testing this, but xpo_has_wspace never fails for me, even if I remove
> > the calls to svc_sock_setbufsize (which probably aren't wanted for TCP any
> > more).
> 
> Ben, do you still have a setup that can reproduce the problem?  Or did you
> ever find an easier way to reproduce?

Unfortunately I don't still have that test rig handy, and I did not find an
easier way to reproduce this.  I do agree that it 'is sufficient to just allow
one request through per socket'... probably that would have solved the problem
in our case.  Maybe it would help to limit the amount of memory in your system
to try and induce additional memory pressure?  IIRC it was a 1mb block size
read workload where we would hit this.

Thanks,
Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields July 24, 2013, 9:07 p.m. UTC | #3
On Wed, Jul 17, 2013 at 07:03:19PM -0500, Ben Myers wrote:
> Hey Bruce,
> 
> On Tue, Jul 16, 2013 at 10:24:30AM -0400, J.Bruce Fields wrote:
> > Adding Ben Myers to Cc: in case he has any testing help or advice (see
> > below):
> > 
> > On Tue, Jul 16, 2013 at 02:00:21PM +1000, NeilBrown wrote:
> > > That would be because I only allowed extra requests up to half the number of
> > > idle threads, and there would be zero idle threads.
> > > 
> > > So yes - I see your point.
> > > 
> > > I now think that it is sufficient to just allow one request through per
> > > socket.  While there is high memory pressure, that is sufficient.  When the
> > > memory pressure drops, that should be enough to cause sndbuf to grow.
> > > 
> > > We should be able to use xpt_reserved to check if this is the only request
> > > or not: 
> > 
> > A remaining possible bad case is if the number of "bad" sockets is more
> > than the number of threads, and if the problem is something that won't
> > resolve itself by just waiting a little while.
> > 
> > I don't know if that's likely in practice.  Maybe a network failure cuts
> > you off from a large swath (but not all) of your clients?  Or maybe some
> > large proportion of your clients are just slow?
> > 
> > But this is two lines and looks likely to solve the immediate
> > problem:
> > 
> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > index 80a6640..5b832ef 100644
> > > --- a/net/sunrpc/svc_xprt.c
> > > +++ b/net/sunrpc/svc_xprt.c
> > > @@ -331,7 +331,8 @@ static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> > >  	if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> > >  		return true;
> > >  	if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
> > > -		return xprt->xpt_ops->xpo_has_wspace(xprt);
> > > +		return xprt->xpt_ops->xpo_has_wspace(xprt) ||
> > > +			atomic_read(&xprt->xpt_reserved) == 0;
> > >  	return false;
> > >  }
> > >  
> > > @@ -730,7 +731,8 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
> > >  		newxpt = xprt->xpt_ops->xpo_accept(xprt);
> > >  		if (newxpt)
> > >  			svc_add_new_temp_xprt(serv, newxpt);
> > > -	} else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
> > > +	} else if (xprt->xpt_ops->xpo_has_wspace(xprt) ||
> > > +		   atomic_read(&xprt->xpt_reserved) == 0) {
> > >  		/* XPT_DATA|XPT_DEFERRED case: */
> > >  		dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
> > >  			rqstp, rqstp->rq_pool->sp_id, xprt,
> > > 
> > > 
> > > Thoughts?
> > > 
> > > I tried testing this, but xpo_has_wspace never fails for me, even if I remove
> > > the calls to svc_sock_setbufsize (which probably aren't wanted for TCP any
> > > more).
> > 
> > Ben, do you still have a setup that can reproduce the problem?  Or did you
> > ever find an easier way to reproduce?
> 
> Unfortunately I don't still have that test rig handy, and I did not find an
> easier way to reproduce this.  I do agree that it 'is sufficient to just allow
> one request through per socket'... probably that would have solved the problem
> in our case.  Maybe it would help to limit the amount of memory in your system
> to try and induce additional memory pressure?  IIRC it was a 1mb block size
> read workload where we would hit this.

That makes sense.

Or I wonder if you could cheat and artificially force the tcp code to
behave as if there was memory pressure.  (E.g. by calling
tcp_enter_memory_pressure on some trigger).

Neil, are you still looking at this?  (If you're done with it, could you
resend that patch with a signed-off-by?)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 80a6640..5b832ef 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -331,7 +331,8 @@  static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
 	if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
 		return true;
 	if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
-		return xprt->xpt_ops->xpo_has_wspace(xprt);
+		return xprt->xpt_ops->xpo_has_wspace(xprt) ||
+			atomic_read(&xprt->xpt_reserved) == 0;
 	return false;
 }
 
@@ -730,7 +731,8 @@  static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
 		newxpt = xprt->xpt_ops->xpo_accept(xprt);
 		if (newxpt)
 			svc_add_new_temp_xprt(serv, newxpt);
-	} else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
+	} else if (xprt->xpt_ops->xpo_has_wspace(xprt) ||
+		   atomic_read(&xprt->xpt_reserved) == 0) {
 		/* XPT_DATA|XPT_DEFERRED case: */
 		dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
 			rqstp, rqstp->rq_pool->sp_id, xprt,