From patchwork Fri Jan 3 16:52:12 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chuck Lever X-Patchwork-Id: 11317069 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6BBB1109A for ; Fri, 3 Jan 2020 16:52:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 49A7D217F4 for ; Fri, 3 Jan 2020 16:52:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="RWASNqSp" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728021AbgACQwO (ORCPT ); Fri, 3 Jan 2020 11:52:14 -0500 Received: from mail-yw1-f68.google.com ([209.85.161.68]:38875 "EHLO mail-yw1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727912AbgACQwO (ORCPT ); Fri, 3 Jan 2020 11:52:14 -0500 Received: by mail-yw1-f68.google.com with SMTP id 10so18746477ywv.5 for ; Fri, 03 Jan 2020 08:52:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:from:to:cc:date:message-id:in-reply-to:references :user-agent:mime-version:content-transfer-encoding; bh=lx2ByXRb2E6HSft0879G0aUxzQETJaxyjGgs2Ax3D8Y=; b=RWASNqSp452XjftnMpTEHLWK/UXIsYOWAczclNA4+xSAJ97yyvst2pQUpICp4lkbpY 2y5q6dB+CAtMh/HjNKMes5Au5Eu0ZINymzAhEQ/tf2dA0z+jS2LjgC+dBKnIsH7u/z1v RbXwXUPv4nIYL9XO0gxZD/EDZkx95vXnTtUyyJ0KCFz2JBKwC3tOZqUly/Z0ABeAztm8 Xv62ZKvBVH7FIvLNjXCHSB3wniYOO1PvGpzKlgETRVmpjwMif+yZQgD7+hF+QiynStvi xtJf+Rcf72BYPEBXQCE53ygejxr0WkEfVMt72qRsUulwZjteLM5WVkMAkaX16+dZeguT 6/bw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:from:to:cc:date:message-id :in-reply-to:references:user-agent:mime-version :content-transfer-encoding; bh=lx2ByXRb2E6HSft0879G0aUxzQETJaxyjGgs2Ax3D8Y=; b=qrqhaN59AawqUGeV/8ctWumg4V9ibYZUYrYuSfAVXnogHE29RrVFcwTaTMndJA9BIP mkGpVoI05Vyo/ZAVIDG6jZryineONwdkIEux0uHUcogSBLQiHw6DrZyzR1lixq7q5By8 NcDQN3AR1jHVPOPHq8vV2gN/3bS3HQvfFgkRPypxAx/Y+Jn87kqw6NPoTzRixjsd48WA tPm/HBsLsyzSpZ0Dcl3yupIgURj0jD9Ucbt0WOo/q5W01Kw+7FFCwQqNihPe8ILSP4Vc EoNBHbKoOFf+E8xBet7qwzaUVmEy0BmxE0k2nbbIQr5aj163MtoCPW+mmTzTpzmbCU93 iQGg== X-Gm-Message-State: APjAAAVyrPFxXTBt4WMoKq53l5QcjMHE/naMrqkIt38UKgw+WqradA0H N0SV7J/RtPEAq1wIjVGDfjudWCjK X-Google-Smtp-Source: APXvYqyQPbxk73awJP2lDE7y/nbon+cWitTWwiTeERy+JK0JOU6HVwAQWcEmxzypz6bCIoioPdS7QQ== X-Received: by 2002:a0d:d84b:: with SMTP id a72mr64758107ywe.33.1578070333541; Fri, 03 Jan 2020 08:52:13 -0800 (PST) Received: from gateway.1015granger.net (c-68-61-232-219.hsd1.mi.comcast.net. [68.61.232.219]) by smtp.gmail.com with ESMTPSA id v38sm24520114ywh.63.2020.01.03.08.52.13 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 Jan 2020 08:52:13 -0800 (PST) Received: from morisot.1015granger.net (morisot.1015granger.net [192.168.1.67]) by gateway.1015granger.net (8.14.7/8.14.7) with ESMTP id 003GqC6H016369; Fri, 3 Jan 2020 16:52:12 GMT Subject: [PATCH 1/3] xprtrdma: Fix create_qp crash on device unload From: Chuck Lever To: anna.schumaker@netapp.com Cc: linux-nfs@vger.kernel.org Date: Fri, 03 Jan 2020 11:52:12 -0500 Message-ID: <157807033222.3637.2110603095451994959.stgit@morisot.1015granger.net> In-Reply-To: <157807026361.3637.2531475820164100233.stgit@morisot.1015granger.net> References: <157807026361.3637.2531475820164100233.stgit@morisot.1015granger.net> User-Agent: StGit/0.19 MIME-Version: 1.0 Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On device re-insertion, the RDMA device driver crashes trying to set up a new QP: Nov 27 16:32:06 manet kernel: BUG: kernel NULL pointer dereference, address: 00000000000001c0 Nov 27 16:32:06 manet kernel: #PF: supervisor write access in kernel mode Nov 27 16:32:06 manet kernel: #PF: error_code(0x0002) - not-present page Nov 27 16:32:06 manet kernel: PGD 0 P4D 0 Nov 27 16:32:06 manet kernel: Oops: 0002 [#1] SMP Nov 27 16:32:06 manet kernel: CPU: 1 PID: 345 Comm: kworker/u28:0 Tainted: G W 5.4.0 #852 Nov 27 16:32:06 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015 Nov 27 16:32:06 manet kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma] Nov 27 16:32:06 manet kernel: RIP: 0010:atomic_try_cmpxchg+0x2/0x12 Nov 27 16:32:06 manet kernel: Code: ff ff 48 8b 04 24 5a c3 c6 07 00 0f 1f 40 00 c3 31 c0 48 81 ff 08 09 68 81 72 0c 31 c0 48 81 ff 83 0c 68 81 0f 92 c0 c3 8b 06 0f b1 17 0f 94 c2 84 d2 75 02 89 06 88 d0 c3 53 ba 01 00 00 00 Nov 27 16:32:06 manet kernel: RSP: 0018:ffffc900035abbf0 EFLAGS: 00010046 Nov 27 16:32:06 manet kernel: RAX: 0000000000000000 RBX: 00000000000001c0 RCX: 0000000000000000 Nov 27 16:32:06 manet kernel: RDX: 0000000000000001 RSI: ffffc900035abbfc RDI: 00000000000001c0 Nov 27 16:32:06 manet kernel: RBP: ffffc900035abde0 R08: 000000000000000e R09: ffffffffffffc000 Nov 27 16:32:06 manet kernel: R10: 0000000000000000 R11: 000000000002e800 R12: ffff88886169d9f8 Nov 27 16:32:06 manet kernel: R13: ffff88886169d9f4 R14: 0000000000000246 R15: 0000000000000000 Nov 27 16:32:06 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa40000(0000) knlGS:0000000000000000 Nov 27 16:32:06 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 27 16:32:06 manet kernel: CR2: 00000000000001c0 CR3: 0000000002009006 CR4: 00000000001606e0 Nov 27 16:32:06 manet kernel: Call Trace: Nov 27 16:32:06 manet kernel: do_raw_spin_lock+0x2f/0x5a Nov 27 16:32:06 manet kernel: create_qp_common.isra.47+0x856/0xadf [mlx4_ib] Nov 27 16:32:06 manet kernel: ? slab_post_alloc_hook.isra.60+0xa/0x1a Nov 27 16:32:06 manet kernel: ? __kmalloc+0x125/0x139 Nov 27 16:32:06 manet kernel: mlx4_ib_create_qp+0x57f/0x972 [mlx4_ib] The fix is to copy the qp_init_attr struct that was just created by rpcrdma_ep_create() instead of using the one from the previous connection instance. Fixes: 98ef77d1aaa7 ("xprtrdma: Send Queue size grows after a reconnect") Signed-off-by: Chuck Lever --- net/sunrpc/xprtrdma/verbs.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index 77c7dd7f05e8..3a56458e8c05 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -599,6 +599,7 @@ static int rpcrdma_ep_recreate_xprt(struct rpcrdma_xprt *r_xprt, struct ib_qp_init_attr *qp_init_attr) { struct rpcrdma_ia *ia = &r_xprt->rx_ia; + struct rpcrdma_ep *ep = &r_xprt->rx_ep; int rc, err; trace_xprtrdma_reinsert(r_xprt); @@ -613,6 +614,7 @@ static int rpcrdma_ep_recreate_xprt(struct rpcrdma_xprt *r_xprt, pr_err("rpcrdma: rpcrdma_ep_create returned %d\n", err); goto out2; } + memcpy(qp_init_attr, &ep->rep_attr, sizeof(*qp_init_attr)); rc = -ENETUNREACH; err = rdma_create_qp(ia->ri_id, ia->ri_pd, qp_init_attr); From patchwork Fri Jan 3 16:52:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chuck Lever X-Patchwork-Id: 11317075 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 90C70109A for ; Fri, 3 Jan 2020 16:52:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6CA1F222C4 for ; Fri, 3 Jan 2020 16:52:21 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ESxcjERX" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727988AbgACQwV (ORCPT ); Fri, 3 Jan 2020 11:52:21 -0500 Received: from mail-yb1-f196.google.com ([209.85.219.196]:40791 "EHLO mail-yb1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727912AbgACQwU (ORCPT ); Fri, 3 Jan 2020 11:52:20 -0500 Received: by mail-yb1-f196.google.com with SMTP id a2so18847324ybr.7 for ; Fri, 03 Jan 2020 08:52:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:from:to:cc:date:message-id:in-reply-to:references :user-agent:mime-version:content-transfer-encoding; bh=6eCy8RTnkTp4jqu5RbgWVFw00E8wG1/G+IUpPmQpIIk=; b=ESxcjERXaFHk/os8AMdhgKrnUxBerGkq/XI6alyYlz/i3/uDK3e7E7P1rNhe4h1Dyd OYQquKzVboBjvRwq7QJkfZhWPRKBReC+/MFhOkkQ/0OwsDN1fVKFvod3BWEccCSdwnfN AfTdQWIB/gCNdqGhf88qsKVhMm+40I3kb+PIYLdhxuPrX27nC5s2cayjZx6IsWjX3NNE xV9Stl07pNZ//HqhLdI1KNKbiIRzPfISulVAjx2b6PCkdEB5f60/SXKXLzAgS/zjtQpC P8vBPiS7/ecGRdvbUsoEusXM4sedFPLmEeVJmErvsoYSKHgao6utn9ZBKImVT8RtZq3m AW2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:from:to:cc:date:message-id :in-reply-to:references:user-agent:mime-version :content-transfer-encoding; bh=6eCy8RTnkTp4jqu5RbgWVFw00E8wG1/G+IUpPmQpIIk=; b=BCgT1HGDWOyfGkbYxivV+sB7yzQ1Z4Y62Evg5ZKsWY53xUI7OCWT1s5pjYnC+6Jywn kQ+G31CgxqYvEyQnq5Vdw5Mrz11l+zHBRNDlsk/H6lL9su1/CsihGQxoR2j34UkHuf98 VN6R4Q/H1Vm6iwJv0u261LNY2Ej1a32dM256DbQIfYQYpHMO8K+yNT705AIDm6WaQJ/7 7uzmQqcncKYOtGXZZpE0pEDF40fvsQbgoZm8dDVZl/Y9CueoH2mjWyShpz7Ps9Xf5qhv 3W8Vj5cETFAW2icJwfHc4PZy3KUDl0aEk8yIpvMEoGHZxd/Vj4Kmqk8mSbj46QAC3KMB 9uFg== X-Gm-Message-State: APjAAAVPwjZxYBcaAOgYOW3p+zIuSyuMQiDk6P2QKMhIpFa8XqC5J2kM 3yjLDdBvBrvOKJ8C7i36acSJzZe/ X-Google-Smtp-Source: APXvYqyOI4g6jxD1GFnZAJ5ke683WiLEiGRjsk8/ZdPy/tgVD4NfTkpFM8aY0Wa4ciqllCjOS/6uAw== X-Received: by 2002:a25:d657:: with SMTP id n84mr66777035ybg.408.1578070338833; Fri, 03 Jan 2020 08:52:18 -0800 (PST) Received: from gateway.1015granger.net (c-68-61-232-219.hsd1.mi.comcast.net. [68.61.232.219]) by smtp.gmail.com with ESMTPSA id z2sm23099967ywb.13.2020.01.03.08.52.18 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 Jan 2020 08:52:18 -0800 (PST) Received: from morisot.1015granger.net (morisot.1015granger.net [192.168.1.67]) by gateway.1015granger.net (8.14.7/8.14.7) with ESMTP id 003GqHWo016372; Fri, 3 Jan 2020 16:52:17 GMT Subject: [PATCH 2/3] xprtrdma: Fix completion wait during device removal From: Chuck Lever To: anna.schumaker@netapp.com Cc: linux-nfs@vger.kernel.org Date: Fri, 03 Jan 2020 11:52:17 -0500 Message-ID: <157807033753.3637.8863288300598635285.stgit@morisot.1015granger.net> In-Reply-To: <157807026361.3637.2531475820164100233.stgit@morisot.1015granger.net> References: <157807026361.3637.2531475820164100233.stgit@morisot.1015granger.net> User-Agent: StGit/0.19 MIME-Version: 1.0 Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org I've found that on occasion, "rmmod " will hang while if an NFS is under load. Ensure that ri_remove_done is initialized only just before the transport is woken up to force a close. This avoids the completion possibly getting initialized again while the CM event handler is waiting for a wake-up. Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS mount") Signed-off-by: Chuck Lever --- net/sunrpc/xprtrdma/verbs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index 3a56458e8c05..2c40465a19e1 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -244,6 +244,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) ia->ri_id->device->name, rpcrdma_addrstr(r_xprt), rpcrdma_portstr(r_xprt)); #endif + init_completion(&ia->ri_remove_done); set_bit(RPCRDMA_IAF_REMOVING, &ia->ri_flags); ep->rep_connected = -ENODEV; xprt_force_disconnect(xprt); @@ -297,7 +298,6 @@ rpcrdma_create_id(struct rpcrdma_xprt *xprt, struct rpcrdma_ia *ia) int rc; init_completion(&ia->ri_done); - init_completion(&ia->ri_remove_done); id = rdma_create_id(xprt->rx_xprt.xprt_net, rpcrdma_cm_event_handler, xprt, RDMA_PS_TCP, IB_QPT_RC); From patchwork Fri Jan 3 16:52:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chuck Lever X-Patchwork-Id: 11317077 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7744B138C for ; Fri, 3 Jan 2020 16:52:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4D47A222C4 for ; Fri, 3 Jan 2020 16:52:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="WWTAGc7O" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728088AbgACQwZ (ORCPT ); Fri, 3 Jan 2020 11:52:25 -0500 Received: from mail-yb1-f194.google.com ([209.85.219.194]:36155 "EHLO mail-yb1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727912AbgACQwZ (ORCPT ); Fri, 3 Jan 2020 11:52:25 -0500 Received: by mail-yb1-f194.google.com with SMTP id w126so16945091yba.3 for ; Fri, 03 Jan 2020 08:52:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:from:to:cc:date:message-id:in-reply-to:references :user-agent:mime-version:content-transfer-encoding; bh=rXiIYT7cfNWtLb99yIcbO3RwkHeET8RunTTH+yoGcQ4=; b=WWTAGc7Opngpc5j7520gv+PS8KRq03i+4PewTRCcJ22VttU6EP/9hcequQme4ndPqr mCt5YDeaLhzXJwjaA/EL2QiTh46YqoTbqFg8LW+Ba+4agVvR3mCdufSAb2i4SP8R46sJ UoK2F/0O6DvoGdhLJBFHlo3G9HeJmbUq9gK2LxYqLtDnoPzvh/FwuN2kLV8igyooXZUZ XjLE3wdIq2LneObU5rnKjsQFRBzDm8/mU3xuOnMcfrgs/WkyIo8jni92t38ByYTiKeaF AUf68J0/RGx56yesdC/uARpt35+q1ZP6hg1e1UItmQPVCKQ+KNGoCpTSzEm3JibwPoj8 MCgg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:from:to:cc:date:message-id :in-reply-to:references:user-agent:mime-version :content-transfer-encoding; bh=rXiIYT7cfNWtLb99yIcbO3RwkHeET8RunTTH+yoGcQ4=; b=lCiNwqJxFxYVy7c+MVxOsgBoZV0a0TE+OtjHNkoLcS5i6rzqM000Ty5jPlU9H6ePcq 04VIATfSNwHSgf2dJT+/Gac9uJAZ++QGy/pWTKsgKOsOMSeU2pmrCEyF/xRAp0B74rDq JuAtwEkhqvahARmP7fkdAWKd0mBrAxOYlbMiZsiZXMU6UCfNdSldbUB2AOdNEFG901zy XVBiq5uvXASZ8wM07L7E/GnN6ASpC8jHSOM0Pr6FruHhH773/HD/MDx9l4NsI3RWhobr 1Go6+3MOzTD3RYSt6ZJL/7n6qNtIq5bg0pR9EsdnRy6ewCWDxaxFyQpSQYxLLgRN7GpH yFhQ== X-Gm-Message-State: APjAAAUxnX4zh/kE5K5Bkg3TRiq7Tg/gfDyosMewirt7Ds7pUFLxKHwk o0N7j71gNVFWbNOF6g32QBB72k8C X-Google-Smtp-Source: APXvYqz7S+trXZvsnI+ZUX9ysD0G0pBena1esKywCszx9tEwmkvVqGDFs0LXNJ3Rb+4TFNCXSuWuNA== X-Received: by 2002:a25:a2c2:: with SMTP id c2mr63804115ybn.246.1578070344153; Fri, 03 Jan 2020 08:52:24 -0800 (PST) Received: from gateway.1015granger.net (c-68-61-232-219.hsd1.mi.comcast.net. [68.61.232.219]) by smtp.gmail.com with ESMTPSA id g190sm23629820ywf.41.2020.01.03.08.52.23 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 Jan 2020 08:52:23 -0800 (PST) Received: from morisot.1015granger.net (morisot.1015granger.net [192.168.1.67]) by gateway.1015granger.net (8.14.7/8.14.7) with ESMTP id 003GqM4A016375; Fri, 3 Jan 2020 16:52:22 GMT Subject: [PATCH 3/3] xprtrdma: Fix oops in Receive handler after device removal From: Chuck Lever To: anna.schumaker@netapp.com Cc: linux-nfs@vger.kernel.org Date: Fri, 03 Jan 2020 11:52:22 -0500 Message-ID: <157807034285.3637.1107321602862156718.stgit@morisot.1015granger.net> In-Reply-To: <157807026361.3637.2531475820164100233.stgit@morisot.1015granger.net> References: <157807026361.3637.2531475820164100233.stgit@morisot.1015granger.net> User-Agent: StGit/0.19 MIME-Version: 1.0 Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org Since v5.4, a device removal occasionally triggered this oops: Dec 2 17:13:53 manet kernel: BUG: unable to handle page fault for address: 0000000c00000219 Dec 2 17:13:53 manet kernel: #PF: supervisor read access in kernel mode Dec 2 17:13:53 manet kernel: #PF: error_code(0x0000) - not-present page Dec 2 17:13:53 manet kernel: PGD 0 P4D 0 Dec 2 17:13:53 manet kernel: Oops: 0000 [#1] SMP Dec 2 17:13:53 manet kernel: CPU: 2 PID: 468 Comm: kworker/2:1H Tainted: G W 5.4.0-00050-g53717e43af61 #883 Dec 2 17:13:53 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015 Dec 2 17:13:53 manet kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] Dec 2 17:13:53 manet kernel: RIP: 0010:rpcrdma_wc_receive+0x7c/0xf6 [rpcrdma] Dec 2 17:13:53 manet kernel: Code: 6d 8b 43 14 89 c1 89 45 78 48 89 4d 40 8b 43 2c 89 45 14 8b 43 20 89 45 18 48 8b 45 20 8b 53 14 48 8b 30 48 8b 40 10 48 8b 38 <48> 8b 87 18 02 00 00 48 85 c0 75 18 48 8b 05 1e 24 c4 e1 48 85 c0 Dec 2 17:13:53 manet kernel: RSP: 0018:ffffc900035dfe00 EFLAGS: 00010246 Dec 2 17:13:53 manet kernel: RAX: ffff888467290000 RBX: ffff88846c638400 RCX: 0000000000000048 Dec 2 17:13:53 manet kernel: RDX: 0000000000000048 RSI: 00000000f942e000 RDI: 0000000c00000001 Dec 2 17:13:53 manet kernel: RBP: ffff888467611b00 R08: ffff888464e4a3c4 R09: 0000000000000000 Dec 2 17:13:53 manet kernel: R10: ffffc900035dfc88 R11: fefefefefefefeff R12: ffff888865af4428 Dec 2 17:13:53 manet kernel: R13: ffff888466023000 R14: ffff88846c63f000 R15: 0000000000000010 Dec 2 17:13:53 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000 Dec 2 17:13:53 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 2 17:13:53 manet kernel: CR2: 0000000c00000219 CR3: 0000000002009002 CR4: 00000000001606e0 Dec 2 17:13:53 manet kernel: Call Trace: Dec 2 17:13:53 manet kernel: __ib_process_cq+0x5c/0x14e [ib_core] Dec 2 17:13:53 manet kernel: ib_cq_poll_work+0x26/0x70 [ib_core] Dec 2 17:13:53 manet kernel: process_one_work+0x19d/0x2cd Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf Dec 2 17:13:53 manet kernel: worker_thread+0x1a6/0x25a Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf Dec 2 17:13:53 manet kernel: kthread+0xf4/0xf9 Dec 2 17:13:53 manet kernel: ? kthread_queue_delayed_work+0x74/0x74 Dec 2 17:13:53 manet kernel: ret_from_fork+0x24/0x30 The proximal cause is that this rpcrdma_rep has a rr_rdmabuf that is still pointing to the old ib_device, which has been freed. The only way that is possible is if this rpcrdma_rep was not destroyed by rpcrdma_ia_remove. Debugging showed that was indeed the case: this rpcrdma_rep was still in use by a completing RPC at the time of the device removal, and thus wasn't on the rep free list. So, it was not found by rpcrdma_reps_destroy(). The fix is to introduce a list of all rpcrdma_reps so that they all can be found when a device is removed. That list is used to perform only regbuf DMA unmapping, replacing that call to rpcrdma_reps_destroy(). Meanwhile, to prevent corruption of this list, I've moved the destruction of temp rpcrdma_rep objects to rpcrdma_post_recvs(). rpcrdma_xprt_drain() ensures that post_recvs (and thus rep_destroy) is not invoked while rpcrdma_reps_unmap is walking rb_all_reps, thus protecting the rb_all_reps list. Fixes: b0b227f071a0 ("xprtrdma: Use an llist to manage free rpcrdma_reps") Signed-off-by: Chuck Lever --- net/sunrpc/xprtrdma/verbs.c | 25 +++++++++++++++++++------ net/sunrpc/xprtrdma/xprt_rdma.h | 2 ++ 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index 2c40465a19e1..fda3889993cb 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -77,7 +77,7 @@ static void rpcrdma_sendctx_put_locked(struct rpcrdma_xprt *r_xprt, struct rpcrdma_sendctx *sc); static void rpcrdma_reqs_reset(struct rpcrdma_xprt *r_xprt); -static void rpcrdma_reps_destroy(struct rpcrdma_buffer *buf); +static void rpcrdma_reps_unmap(struct rpcrdma_xprt *r_xprt); static void rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt); static void rpcrdma_mrs_destroy(struct rpcrdma_xprt *r_xprt); static struct rpcrdma_regbuf * @@ -421,7 +421,7 @@ rpcrdma_ia_remove(struct rpcrdma_ia *ia) /* The ULP is responsible for ensuring all DMA * mappings and MRs are gone. */ - rpcrdma_reps_destroy(buf); + rpcrdma_reps_unmap(r_xprt); list_for_each_entry(req, &buf->rb_allreqs, rl_all) { rpcrdma_regbuf_dma_unmap(req->rl_rdmabuf); rpcrdma_regbuf_dma_unmap(req->rl_sendbuf); @@ -1092,6 +1092,7 @@ static struct rpcrdma_rep *rpcrdma_rep_create(struct rpcrdma_xprt *r_xprt, rep->rr_recv_wr.sg_list = &rep->rr_rdmabuf->rg_iov; rep->rr_recv_wr.num_sge = 1; rep->rr_temp = temp; + list_add(&rep->rr_all, &r_xprt->rx_buf.rb_all_reps); return rep; out_free: @@ -1102,6 +1103,7 @@ static struct rpcrdma_rep *rpcrdma_rep_create(struct rpcrdma_xprt *r_xprt, static void rpcrdma_rep_destroy(struct rpcrdma_rep *rep) { + list_del(&rep->rr_all); rpcrdma_regbuf_free(rep->rr_rdmabuf); kfree(rep); } @@ -1120,10 +1122,16 @@ static struct rpcrdma_rep *rpcrdma_rep_get_locked(struct rpcrdma_buffer *buf) static void rpcrdma_rep_put(struct rpcrdma_buffer *buf, struct rpcrdma_rep *rep) { - if (!rep->rr_temp) - llist_add(&rep->rr_node, &buf->rb_free_reps); - else - rpcrdma_rep_destroy(rep); + llist_add(&rep->rr_node, &buf->rb_free_reps); +} + +static void rpcrdma_reps_unmap(struct rpcrdma_xprt *r_xprt) +{ + struct rpcrdma_buffer *buf = &r_xprt->rx_buf; + struct rpcrdma_rep *rep; + + list_for_each_entry(rep, &buf->rb_all_reps, rr_all) + rpcrdma_regbuf_dma_unmap(rep->rr_rdmabuf); } static void rpcrdma_reps_destroy(struct rpcrdma_buffer *buf) @@ -1154,6 +1162,7 @@ int rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt) INIT_LIST_HEAD(&buf->rb_send_bufs); INIT_LIST_HEAD(&buf->rb_allreqs); + INIT_LIST_HEAD(&buf->rb_all_reps); rc = -ENOMEM; for (i = 0; i < buf->rb_max_requests; i++) { @@ -1506,6 +1515,10 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp) wr = NULL; while (needed) { rep = rpcrdma_rep_get_locked(buf); + if (rep && rep->rr_temp) { + rpcrdma_rep_destroy(rep); + continue; + } if (!rep) rep = rpcrdma_rep_create(r_xprt, temp); if (!rep) diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index 5d15140a0266..d796d68609ed 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -203,6 +203,7 @@ struct rpcrdma_rep { struct xdr_stream rr_stream; struct llist_node rr_node; struct ib_recv_wr rr_recv_wr; + struct list_head rr_all; }; /* To reduce the rate at which a transport invokes ib_post_recv @@ -368,6 +369,7 @@ struct rpcrdma_buffer { struct list_head rb_allreqs; struct list_head rb_all_mrs; + struct list_head rb_all_reps; struct llist_head rb_free_reps;