From patchwork Tue Jan 8 19:44:25 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Trond Myklebust X-Patchwork-Id: 1948021 Return-Path: X-Original-To: patchwork-linux-nfs@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork1.kernel.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by patchwork1.kernel.org (Postfix) with ESMTP id A9E0C3FED4 for ; Tue, 8 Jan 2013 19:44:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755921Ab3AHTo2 (ORCPT ); Tue, 8 Jan 2013 14:44:28 -0500 Received: from mx1.netapp.com ([216.240.18.38]:52166 "EHLO mx1.netapp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756134Ab3AHTo1 (ORCPT ); Tue, 8 Jan 2013 14:44:27 -0500 X-IronPort-AV: E=Sophos;i="4.84,433,1355126400"; d="scan'208,223";a="234410076" Received: from smtp2.corp.netapp.com ([10.57.159.114]) by mx1-out.netapp.com with ESMTP; 08 Jan 2013 11:44:27 -0800 Received: from vmwexceht01-prd.hq.netapp.com (vmwexceht01-prd.hq.netapp.com [10.106.76.239]) by smtp2.corp.netapp.com (8.13.1/8.13.1/NTAP-1.6) with ESMTP id r08JiQxP025810; Tue, 8 Jan 2013 11:44:26 -0800 (PST) Received: from SACEXCMBX04-PRD.hq.netapp.com ([169.254.6.100]) by vmwexceht01-prd.hq.netapp.com ([10.106.76.239]) with mapi id 14.02.0328.009; Tue, 8 Jan 2013 11:44:26 -0800 From: "Myklebust, Trond" To: Chris Perl CC: "linux-nfs@vger.kernel.org" Subject: Re: Possible Race Condition on SIGKILL Thread-Topic: Possible Race Condition on SIGKILL Thread-Index: AQHN7QkawRBw2ZLtLEa0y38Zfddd9Zg+zGsAgAAIdICAAAKEgIAAAbgAgAAX1ICAAVpIgIAAEe0A Date: Tue, 8 Jan 2013 19:44:25 +0000 Message-ID: <4FA345DA4F4AE44899BD2B03EEEC2FA911993608@SACEXCMBX04-PRD.hq.netapp.com> References: <20130107185848.GB16957@nyc-qws-132.nyc.delacy.com> <4FA345DA4F4AE44899BD2B03EEEC2FA91199197E@SACEXCMBX04-PRD.hq.netapp.com> <20130107202021.GC16957@nyc-qws-132.nyc.delacy.com> <1357590561.28341.11.camel@lade.trondhjem.org> <4FA345DA4F4AE44899BD2B03EEEC2FA911991BE9@SACEXCMBX04-PRD.hq.netapp.com> <20130107220047.GA30814@nyc-qws-132.nyc.delacy.com> <20130108184011.GA30872@nyc-qws-132.nyc.delacy.com> In-Reply-To: <20130108184011.GA30872@nyc-qws-132.nyc.delacy.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: x-originating-ip: [10.104.60.117] MIME-Version: 1.0 Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Tue, 2013-01-08 at 13:40 -0500, Chris Perl wrote: > On Mon, Jan 07, 2013 at 05:00:47PM -0500, Chris Perl wrote: > Anyway, it appears that on mount the rpc_tasks tk_client member is NULL > and therefore the double dereference of task->tk_xprt is what blew > things up. I ammended the patch for this [1] and am testing it > now. > > Thus far, I've still hit hangs, it just seems to take longer. I'll have > to dig in a bit further to see what's going on now. > > Is this CentOS 6.3 kernel this system too old for you guys to care? > I.e. should I spend time reporting digging into and reporting problems > for this system as well or you only care about the fedora system? My main interest is always the upstream (Linus) kernel, however the RPC client in the CentOS 6.3 kernel does actually contain a lot of code that was recently backported from upstream. As such, it is definitely of interest to figure out corner case bugs so that we can compare to upstream... > I'll report back again when I have further info and after testing the > fedora system. > > [1] linux-kernel-test.patch I've attached the latest copy of the patch (v4). In addition to the check for tk_client!=NULL, it needed a couple of changes to deal with the RCU code. Cheers Trond From 87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Mon, 7 Jan 2013 14:30:46 -0500 Subject: [PATCH v4] SUNRPC: Ensure we release the socket write lock if the rpc_task exits early If the rpc_task exits while holding the socket write lock before it has allocated an rpc slot, then the usual mechanism for releasing the write lock in xprt_release() is defeated. The problem occurs if the call to xprt_lock_write() initially fails, so that the rpc_task is put on the xprt->sending wait queue. If the task exits after being assigned the lock by __xprt_lock_write_func, but before it has retried the call to xprt_lock_and_alloc_slot(), then it calls xprt_release() while holding the write lock, but will immediately exit due to the test for task->tk_rqstp != NULL. Reported-by: Chris Perl Signed-off-by: Trond Myklebust Cc: stable@vger.kernel.org [>= 3.1] --- net/sunrpc/sched.c | 3 +-- net/sunrpc/xprt.c | 12 ++++++++++-- 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c index b4133bd..bfa3171 100644 --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -972,8 +972,7 @@ static void rpc_async_release(struct work_struct *work) static void rpc_release_resources_task(struct rpc_task *task) { - if (task->tk_rqstp) - xprt_release(task); + xprt_release(task); if (task->tk_msg.rpc_cred) { put_rpccred(task->tk_msg.rpc_cred); task->tk_msg.rpc_cred = NULL; diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c index bd462a5..33811db 100644 --- a/net/sunrpc/xprt.c +++ b/net/sunrpc/xprt.c @@ -1136,10 +1136,18 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt) void xprt_release(struct rpc_task *task) { struct rpc_xprt *xprt; - struct rpc_rqst *req; + struct rpc_rqst *req = task->tk_rqstp; - if (!(req = task->tk_rqstp)) + if (req == NULL) { + if (task->tk_client) { + rcu_read_lock(); + xprt = rcu_dereference(task->tk_client->cl_xprt); + if (xprt->snd_task == task) + xprt_release_write(xprt, task); + rcu_read_unlock(); + } return; + } xprt = req->rq_xprt; if (task->tk_ops->rpc_count_stats != NULL) -- 1.7.11.7