From patchwork Tue Feb 11 14:33:31 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Trond Myklebust X-Patchwork-Id: 3627491 Return-Path: X-Original-To: patchwork-linux-nfs@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork1.web.kernel.org (Postfix) with ESMTP id 86CF89F382 for ; Tue, 11 Feb 2014 14:33:37 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 8C447201F2 for ; Tue, 11 Feb 2014 14:33:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4A8DB201EC for ; Tue, 11 Feb 2014 14:33:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751256AbaBKOde (ORCPT ); Tue, 11 Feb 2014 09:33:34 -0500 Received: from mail-ig0-f178.google.com ([209.85.213.178]:63099 "EHLO mail-ig0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751199AbaBKOdd (ORCPT ); Tue, 11 Feb 2014 09:33:33 -0500 Received: by mail-ig0-f178.google.com with SMTP id uq10so8863433igb.5 for ; Tue, 11 Feb 2014 06:33:33 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to :references:organization:content-type:mime-version :content-transfer-encoding; bh=eJp/3ddISbcsMb+dyt+aZS+/dYkO/a6f0bbD+0Vu8+o=; b=DRt+HGQqm9stnpbWG3cFvLYralbYvxA6yuNMALeqCEdArOmb9g1g/f6x5IBZXs2qmg Md2sIxP/dOcWTw9/sGsmlspMBxk9IWalAA6w7sjxZ4J98yzgo2iV8pnhDS5pkr+0YCgc 33YLNxKlthPkwZQokhoLgiwcOTfj1Oz4GdHjiITxHmTcgiJCsGKo8h5U0vp/+d3fkXec dBp+bhXiWoKnEx8VYJi/tz2Te/Q/QtsP9jl8EwxDqrMmSYco7LPewagFGB+WwJGrxWIk nTGoS7+lNaq1rZFUCot6rPlolN0i1XQGvFpjoIUJSuT9/ZIfDObyutaUWa6Ce+Iees3r By9A== X-Gm-Message-State: ALoCoQlVFCEGw6Z6aqFvf4UtLgzRwQd5hUfiKu9w9+c5Ja8YQGURudpGeB9Pi4rwROV1J26xuNYp X-Received: by 10.43.140.77 with SMTP id iz13mr1982373icc.47.1392129213115; Tue, 11 Feb 2014 06:33:33 -0800 (PST) Received: from [172.16.74.154] (c-98-209-19-95.hsd1.mi.comcast.net. [98.209.19.95]) by mx.google.com with ESMTPSA id ri2sm54470344igc.9.2014.02.11.06.33.32 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 11 Feb 2014 06:33:32 -0800 (PST) Message-ID: <1392129211.5763.5.camel@leira.trondhjem.org> Subject: Re: xprt_wait_for_buffer_space changes causes a hang. From: Trond Myklebust To: NeilBrown Cc: NFS Date: Tue, 11 Feb 2014 09:33:31 -0500 In-Reply-To: <20140210170315.33dfc621@notabene.brown> References: <20140210170315.33dfc621@notabene.brown> Organization: PrimaryData Inc X-Mailer: Evolution 3.10.3 (3.10.3-1.fc20) Mime-Version: 1.0 Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Mon, 2014-02-10 at 17:03 +1100, NeilBrown wrote: > Hi, > We have a customer who reports occasional but reproducible hangs on our 3.0 > based kernel. > I managed to deduce that > > commit a9a6b52ee1baa865283a91eb8d443ee91adfca56 > Author: Trond Myklebust > Date: Fri Feb 22 14:57:57 2013 -0500 > > SUNRPC: Don't start the retransmission timer when out of socket space > > was to blame (it got into our kernel through -stable ... not sure why it > deserved to be in -stable). Reverting that patch fixes the problem. However I > don't fully understand why. > The reason why that patch was put into stable was that the connection breakage triggered by the timeouts was causing nasty behaviour when servers (or the network) are heavily loaded. Instead of clearing the logjam, breaking the connection and then reconnecting would aggravate it, causing hangs. Anyhow, does the following patch help to break the race? 8<------------------------------------------------------------------ From e4c0373be4b8deae2667a7478d34415b99924abc Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Tue, 11 Feb 2014 09:15:54 -0500 Subject: [PATCH] SUNRPC: Fix races in xs_nospace() When a send failure occurs due to the socket being out of buffer space, we call xs_nospace() in order to have the RPC task wait until the socket has drained enough to make it worth while trying again. The current patch fixes a race in which the socket is drained before we get round to setting up the machinery in xs_nospace(), and which is reported to cause hangs. Link: http://lkml.kernel.org/r/20140210170315.33dfc621@notabene.brown Fixes: a9a6b52ee1ba (SUNRPC: Don't start the retransmission timer...) Reported-by: Neil Brown Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust --- net/sunrpc/xprtsock.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 6497c221612c..966763d735e9 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -510,6 +510,7 @@ static int xs_nospace(struct rpc_task *task) struct rpc_rqst *req = task->tk_rqstp; struct rpc_xprt *xprt = req->rq_xprt; struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt); + struct sock *sk = transport->inet; int ret = -EAGAIN; dprintk("RPC: %5u xmit incomplete (%u left of %u)\n", @@ -527,7 +528,7 @@ static int xs_nospace(struct rpc_task *task) * window size */ set_bit(SOCK_NOSPACE, &transport->sock->flags); - transport->inet->sk_write_pending++; + sk->sk_write_pending++; /* ...and wait for more buffer space */ xprt_wait_for_buffer_space(task, xs_nospace_callback); } @@ -537,6 +538,9 @@ static int xs_nospace(struct rpc_task *task) } spin_unlock_bh(&xprt->transport_lock); + + /* Race breaker in case memory is freed before above code is called */ + sk->sk_write_space(sk); return ret; }