diff mbox

NFSD/sunrpc: avoid deadlock on TCP connection due to memory pressure.

Message ID 20130725201805.GB17962@fieldses.org (mailing list archive)
State New, archived
Headers show

Commit Message

J. Bruce Fields July 25, 2013, 8:18 p.m. UTC
On Thu, Jul 25, 2013 at 11:30:23AM +1000, NeilBrown wrote:
> 
> Since we enabled auto-tuning for sunrpc TCP connections we do not
> guarantee that there is enough write-space on each connection to
> queue a reply.
> 
> If memory pressure causes the window to shrink too small, the request
> throttling in sunrpc/svc will not accept any requests so no more requests
> will be handled.  Even when pressure decreases the window will not
> grow again until data is sent on the connection.
> This means we get a deadlock:  no requests will be handled until there
> is more space, and no space will be allocated until a request is
> handled.
> 
> This can be simulated by modifying svc_tcp_has_wspace to inflate the
> number of byte required and removing the 'svc_sock_setbufsize' calls
> in svc_setup_socket.

Ah-hah!

> I found that multiplying by 16 was enough to make the requirement
> exceed the default allocation.  With this modification in place:
>    mount -o vers=3,proto=tcp 127.0.0.1:/home /mnt
> would block and eventually time out because the nfs server could not
> accept any requests.

So, this?:


> This patch relaxes the request throttling to always allow at least one
> request through per connection.  It does this by checking both
>   sk_stream_min_wspace() and xprt->xpt_reserved
> are zero.
> The first is zero when the TCP transmit queue is empty.
> The second is zero when there are no RPC requests being processed.
> When both of these are zero the socket is idle and so one more
> request can safely be allowed through.
> 
> Applying this patch allows the above mount command to succeed cleanly.
> Tracing shows that the allocated write buffer space quickly grows and
> after a few requests are handled, the extra tests are no longer needed
> to permit further requests to be processed.
> 
> The main purpose of request throttling is to handle the case when one
> client is slow at collecting replies and the send queue gets full of
> replies that the client hasn't acknowledged (at the TCP level) yet.
> As we only change behaviour when the send queue is empty this main
> purpose is still preserved.
> 
> Reported-by: Ben Myers <bpm@sgi.com>
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> --
> As you can see I've changed the patch.  While writing up the above 
> description realised there was a weakness and so added the sk_stream_min_wspace
> test.  That allowed me to write the final paragraph.

This is great, thanks!

Inclined to queue it up for 3.11 and stable....

--b.

> 
> NeilBrown
> 
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 305374d..7762b9f 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1193,7 +1193,9 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
>  	if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
>  		return 1;
>  	required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
> -	if (sk_stream_wspace(svsk->sk_sk) >= required)
> +	if (sk_stream_wspace(svsk->sk_sk) >= required ||
> +	    (sk_stream_min_wspace(svsk->sk_sk) == 0 &&
> +	     atomic_read(&xprt->xpt_reserved) == 0))
>  		return 1;
>  	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
>  	return 0;


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

NeilBrown July 25, 2013, 8:33 p.m. UTC | #1
On Thu, 25 Jul 2013 16:18:05 -0400 "J.Bruce Fields" <bfields@citi.umich.edu>
wrote:

> On Thu, Jul 25, 2013 at 11:30:23AM +1000, NeilBrown wrote:
> > 
> > Since we enabled auto-tuning for sunrpc TCP connections we do not
> > guarantee that there is enough write-space on each connection to
> > queue a reply.
> > 
> > If memory pressure causes the window to shrink too small, the request
> > throttling in sunrpc/svc will not accept any requests so no more requests
> > will be handled.  Even when pressure decreases the window will not
> > grow again until data is sent on the connection.
> > This means we get a deadlock:  no requests will be handled until there
> > is more space, and no space will be allocated until a request is
> > handled.
> > 
> > This can be simulated by modifying svc_tcp_has_wspace to inflate the
> > number of byte required and removing the 'svc_sock_setbufsize' calls
> > in svc_setup_socket.
> 
> Ah-hah!
> 
> > I found that multiplying by 16 was enough to make the requirement
> > exceed the default allocation.  With this modification in place:
> >    mount -o vers=3,proto=tcp 127.0.0.1:/home /mnt
> > would block and eventually time out because the nfs server could not
> > accept any requests.
> 
> So, this?:

Close enough.  I just put "//" in front of the lines I didn't want rather
than delete them.  But yes: exactly that effect.


> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 305374d..36de50d 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1193,6 +1193,7 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
>  	if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
>  		return 1;
>  	required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
> +	required *= 16;
>  	if (sk_stream_wspace(svsk->sk_sk) >= required)
>  		return 1;
>  	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> @@ -1378,14 +1379,8 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
>  	/* Initialize the socket */
>  	if (sock->type == SOCK_DGRAM)
>  		svc_udp_init(svsk, serv);
> -	else {
> -		/* initialise setting must have enough space to
> -		 * receive and respond to one request.
> -		 */
> -		svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
> -					4 * serv->sv_max_mesg);
> +	else
>  		svc_tcp_init(svsk, serv);
> -	}
>  
>  	dprintk("svc: svc_setup_socket created %p (inet %p)\n",
>  				svsk, svsk->sk_sk);
> 
> > This patch relaxes the request throttling to always allow at least one
> > request through per connection.  It does this by checking both
> >   sk_stream_min_wspace() and xprt->xpt_reserved
> > are zero.
> > The first is zero when the TCP transmit queue is empty.
> > The second is zero when there are no RPC requests being processed.
> > When both of these are zero the socket is idle and so one more
> > request can safely be allowed through.
> > 
> > Applying this patch allows the above mount command to succeed cleanly.
> > Tracing shows that the allocated write buffer space quickly grows and
> > after a few requests are handled, the extra tests are no longer needed
> > to permit further requests to be processed.
> > 
> > The main purpose of request throttling is to handle the case when one
> > client is slow at collecting replies and the send queue gets full of
> > replies that the client hasn't acknowledged (at the TCP level) yet.
> > As we only change behaviour when the send queue is empty this main
> > purpose is still preserved.
> > 
> > Reported-by: Ben Myers <bpm@sgi.com>
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > 
> > --
> > As you can see I've changed the patch.  While writing up the above 
> > description realised there was a weakness and so added the sk_stream_min_wspace
> > test.  That allowed me to write the final paragraph.
> 
> This is great, thanks!
> 
> Inclined to queue it up for 3.11 and stable....

I'd agree for 3.11.
It feels a bit border-line for stable.  "dead-lock" and "has been seen in the
wild" are technically enough justification...
I'd probably mark it as "pleas don't apply to -stable until 3.11 is released"
or something like that, just for a bit of breathing space.
Your call though.

NeilBrown

> 
> --b.
> 
> > 
> > NeilBrown
> > 
> > 
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index 305374d..7762b9f 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -1193,7 +1193,9 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
> >  	if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
> >  		return 1;
> >  	required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
> > -	if (sk_stream_wspace(svsk->sk_sk) >= required)
> > +	if (sk_stream_wspace(svsk->sk_sk) >= required ||
> > +	    (sk_stream_min_wspace(svsk->sk_sk) == 0 &&
> > +	     atomic_read(&xprt->xpt_reserved) == 0))
> >  		return 1;
> >  	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> >  	return 0;
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields July 26, 2013, 2:19 p.m. UTC | #2
On Fri, Jul 26, 2013 at 06:33:03AM +1000, NeilBrown wrote:
> On Thu, 25 Jul 2013 16:18:05 -0400 "J.Bruce Fields" <bfields@citi.umich.edu>
> wrote:
> 
> > On Thu, Jul 25, 2013 at 11:30:23AM +1000, NeilBrown wrote:
> > > 
> > > Since we enabled auto-tuning for sunrpc TCP connections we do not
> > > guarantee that there is enough write-space on each connection to
> > > queue a reply.
...
> > This is great, thanks!
> > 
> > Inclined to queue it up for 3.11 and stable....
> 
> I'd agree for 3.11.
> It feels a bit border-line for stable.  "dead-lock" and "has been seen in the
> wild" are technically enough justification...
> I'd probably mark it as "pleas don't apply to -stable until 3.11 is released"
> or something like that, just for a bit of breathing space.
> Your call though.


So my takeaway from http://lwn.net/Articles/559113/ was that Linus and
Greg were requesting that:

	- criteria for -stable and late -rc's should really be about the
	  same, and
	- people should follow Documentation/stable-kernel-rules.txt.

So as an exercise to remind me what those rules are:

Easy questions:

	- "no bigger than 100 lines, with context."  Check.
	- "It must fix only one thing."  Check.
	- "real bug that bothers people".  Check.
	- "tested": yep.  It doesn't actually say "tested on stable
	  trees", and I recall this did land you with a tricky bug one
	  time when a prerequisite was omitted from the backport.

Judgement calls:

	- "obviously correct": it's short, but admittedly subtle, and
	  performance regressions can take a while to get sorted out.
	- "It must fix a problem that causes a build error (but not for
	  things marked CONFIG_BROKEN), an oops, a hang, data
	  corruption, a real security issue, or some "oh, that's not
	  good" issue.  In short, something critical."  We could argue
	  that "server stops responding" is critical, though not to the
	  same degree as a panic.
	- OR: alternatively: "Serious issues as reported by a user of a
	  distribution kernel may also be considered if they fix a
	  notable performance or interactivity issue." The only bz I've
	  personally seen was the result of artificial testing of some
	  kind, and it sounds like your case involved a disk failure?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 30, 2013, 2:48 a.m. UTC | #3
On Fri, 26 Jul 2013 10:19:16 -0400 "J.Bruce Fields" <bfields@citi.umich.edu>
wrote:

> On Fri, Jul 26, 2013 at 06:33:03AM +1000, NeilBrown wrote:
> > On Thu, 25 Jul 2013 16:18:05 -0400 "J.Bruce Fields" <bfields@citi.umich.edu>
> > wrote:
> > 
> > > On Thu, Jul 25, 2013 at 11:30:23AM +1000, NeilBrown wrote:
> > > > 
> > > > Since we enabled auto-tuning for sunrpc TCP connections we do not
> > > > guarantee that there is enough write-space on each connection to
> > > > queue a reply.
> ...
> > > This is great, thanks!
> > > 
> > > Inclined to queue it up for 3.11 and stable....
> > 
> > I'd agree for 3.11.
> > It feels a bit border-line for stable.  "dead-lock" and "has been seen in the
> > wild" are technically enough justification...
> > I'd probably mark it as "pleas don't apply to -stable until 3.11 is released"
> > or something like that, just for a bit of breathing space.
> > Your call though.
> 
> 
> So my takeaway from http://lwn.net/Articles/559113/ was that Linus and
> Greg were requesting that:
> 
> 	- criteria for -stable and late -rc's should really be about the
> 	  same, and
> 	- people should follow Documentation/stable-kernel-rules.txt.
> 
> So as an exercise to remind me what those rules are:
> 
> Easy questions:
> 
> 	- "no bigger than 100 lines, with context."  Check.
> 	- "It must fix only one thing."  Check.
> 	- "real bug that bothers people".  Check.
> 	- "tested": yep.  It doesn't actually say "tested on stable
> 	  trees", and I recall this did land you with a tricky bug one
> 	  time when a prerequisite was omitted from the backport.
> 
> Judgement calls:
> 
> 	- "obviously correct": it's short, but admittedly subtle, and
> 	  performance regressions can take a while to get sorted out.
> 	- "It must fix a problem that causes a build error (but not for
> 	  things marked CONFIG_BROKEN), an oops, a hang, data
> 	  corruption, a real security issue, or some "oh, that's not
> 	  good" issue.  In short, something critical."  We could argue
> 	  that "server stops responding" is critical, though not to the
> 	  same degree as a panic.
> 	- OR: alternatively: "Serious issues as reported by a user of a
> 	  distribution kernel may also be considered if they fix a
> 	  notable performance or interactivity issue." The only bz I've
> 	  personally seen was the result of artificial testing of some
> 	  kind, and it sounds like your case involved a disk failure?
> 
> --b.

Looks like  good analysis ... except that it doesn't seem conclusive.  Being
conclusive would make it really good. :-)

The case that brought it to my attention doesn't require the fix.
A file system was mis-behaving (blocking when it should return EJUKEBOX) and
this resulted in nfsd behaviour different than my expectation.
I expected nfsd to keep accepting requests until all threads were blocks.
However only 4 requests were accepted (which is actually better behaviour,
but not what I expected).
So I looked into it and thought that what I found wasn't really right.  Which
turned out to be the case, but not the way I thought...

So my direct experience doesn't argue for the patch going to -stable at all.
If the only other reports are from artificial testing then I'd leave it out
of -stable.  I don't feel -rc4 (that's next I think) is too late for it
though.

NeilBrown
J. Bruce Fields Aug. 1, 2013, 2:49 a.m. UTC | #4
On Tue, Jul 30, 2013 at 12:48:57PM +1000, NeilBrown wrote:
> On Fri, 26 Jul 2013 10:19:16 -0400 "J.Bruce Fields" <bfields@citi.umich.edu>
> wrote:
> 
> > On Fri, Jul 26, 2013 at 06:33:03AM +1000, NeilBrown wrote:
> > > On Thu, 25 Jul 2013 16:18:05 -0400 "J.Bruce Fields" <bfields@citi.umich.edu>
> > > wrote:
> > > 
> > > > On Thu, Jul 25, 2013 at 11:30:23AM +1000, NeilBrown wrote:
> > > > > 
> > > > > Since we enabled auto-tuning for sunrpc TCP connections we do not
> > > > > guarantee that there is enough write-space on each connection to
> > > > > queue a reply.
> > ...
> > > > This is great, thanks!
> > > > 
> > > > Inclined to queue it up for 3.11 and stable....
> > > 
> > > I'd agree for 3.11.
> > > It feels a bit border-line for stable.  "dead-lock" and "has been seen in the
> > > wild" are technically enough justification...
> > > I'd probably mark it as "pleas don't apply to -stable until 3.11 is released"
> > > or something like that, just for a bit of breathing space.
> > > Your call though.
> > 
> > 
> > So my takeaway from http://lwn.net/Articles/559113/ was that Linus and
> > Greg were requesting that:
> > 
> > 	- criteria for -stable and late -rc's should really be about the
> > 	  same, and
> > 	- people should follow Documentation/stable-kernel-rules.txt.
> > 
> > So as an exercise to remind me what those rules are:
> > 
> > Easy questions:
> > 
> > 	- "no bigger than 100 lines, with context."  Check.
> > 	- "It must fix only one thing."  Check.
> > 	- "real bug that bothers people".  Check.
> > 	- "tested": yep.  It doesn't actually say "tested on stable
> > 	  trees", and I recall this did land you with a tricky bug one
> > 	  time when a prerequisite was omitted from the backport.
> > 
> > Judgement calls:
> > 
> > 	- "obviously correct": it's short, but admittedly subtle, and
> > 	  performance regressions can take a while to get sorted out.
> > 	- "It must fix a problem that causes a build error (but not for
> > 	  things marked CONFIG_BROKEN), an oops, a hang, data
> > 	  corruption, a real security issue, or some "oh, that's not
> > 	  good" issue.  In short, something critical."  We could argue
> > 	  that "server stops responding" is critical, though not to the
> > 	  same degree as a panic.
> > 	- OR: alternatively: "Serious issues as reported by a user of a
> > 	  distribution kernel may also be considered if they fix a
> > 	  notable performance or interactivity issue." The only bz I've
> > 	  personally seen was the result of artificial testing of some
> > 	  kind, and it sounds like your case involved a disk failure?
> > 
> > --b.
> 
> Looks like  good analysis ... except that it doesn't seem conclusive.  Being
> conclusive would make it really good. :-)

Hah, yeah sorry.

> The case that brought it to my attention doesn't require the fix.
> A file system was mis-behaving (blocking when it should return EJUKEBOX) and
> this resulted in nfsd behaviour different than my expectation.
> I expected nfsd to keep accepting requests until all threads were blocks.
> However only 4 requests were accepted (which is actually better behaviour,
> but not what I expected).
> So I looked into it and thought that what I found wasn't really right.  Which
> turned out to be the case, but not the way I thought...
> 
> So my direct experience doesn't argue for the patch going to -stable at all.
> If the only other reports are from artificial testing then I'd leave it out
> of -stable.  I don't feel -rc4 (that's next I think) is too late for it
> though.

OK, I think I agree.  We can pass it along to stable later if more
complaints surface....

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 305374d..36de50d 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1193,6 +1193,7 @@  static int svc_tcp_has_wspace(struct svc_xprt *xprt)
 	if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
 		return 1;
 	required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
+	required *= 16;
 	if (sk_stream_wspace(svsk->sk_sk) >= required)
 		return 1;
 	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
@@ -1378,14 +1379,8 @@  static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
 	/* Initialize the socket */
 	if (sock->type == SOCK_DGRAM)
 		svc_udp_init(svsk, serv);
-	else {
-		/* initialise setting must have enough space to
-		 * receive and respond to one request.
-		 */
-		svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
-					4 * serv->sv_max_mesg);
+	else
 		svc_tcp_init(svsk, serv);
-	}
 
 	dprintk("svc: svc_setup_socket created %p (inet %p)\n",
 				svsk, svsk->sk_sk);