From patchwork Fri Aug 19 13:28:38 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 9290307 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 9C6D8607FF for ; Fri, 19 Aug 2016 13:29:41 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 902E529426 for ; Fri, 19 Aug 2016 13:29:41 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8482F2942D; Fri, 19 Aug 2016 13:29:41 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AA7E029429 for ; Fri, 19 Aug 2016 13:29:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752067AbcHSN30 (ORCPT ); Fri, 19 Aug 2016 09:29:26 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:51747 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753973AbcHSN27 (ORCPT ); Fri, 19 Aug 2016 09:28:59 -0400 Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.11/8.16.0.11) with SMTP id u7JDQrEe104267 for ; Fri, 19 Aug 2016 09:28:48 -0400 Received: from e24smtp03.br.ibm.com (e24smtp03.br.ibm.com [32.104.18.24]) by mx0a-001b2d01.pphosted.com with ESMTP id 24wuv9t5gf-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Fri, 19 Aug 2016 09:28:48 -0400 Received: from localhost by e24smtp03.br.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 19 Aug 2016 10:28:45 -0300 Received: from d24dlp01.br.ibm.com (9.18.248.204) by e24smtp03.br.ibm.com (10.172.0.139) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Fri, 19 Aug 2016 10:28:41 -0300 X-IBM-Helo: d24dlp01.br.ibm.com X-IBM-MailFrom: krisman@linux.vnet.ibm.com X-IBM-RcptTo: linux-block@vger.kernel.org;linux-scsi@vger.kernel.org Received: from d24relay03.br.ibm.com (d24relay03.br.ibm.com [9.13.184.25]) by d24dlp01.br.ibm.com (Postfix) with ESMTP id 3C4F0352006C; Fri, 19 Aug 2016 09:28:20 -0400 (EDT) Received: from d24av04.br.ibm.com (d24av04.br.ibm.com [9.8.31.97]) by d24relay03.br.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id u7JDSfgm17367296; Fri, 19 Aug 2016 10:28:41 -0300 Received: from d24av04.br.ibm.com (localhost [127.0.0.1]) by d24av04.br.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id u7JDSeCX028262; Fri, 19 Aug 2016 10:28:41 -0300 Received: from localhost ([9.78.151.157]) by d24av04.br.ibm.com (8.14.4/8.14.4/NCO v10.0 AVin) with ESMTP id u7JDSdWN028258; Fri, 19 Aug 2016 10:28:40 -0300 From: Gabriel Krisman Bertazi To: Keith Busch Cc: Christoph Hellwig , linux-nvme@lists.infradead.org, Jens Axboe , Brian King , linux-block@vger.kernel.org, linux-scsi@vger.kernel.org Subject: Re: Oops when completing request on the wrong queue References: <87a8gltgks.fsf@linux.vnet.ibm.com> Date: Fri, 19 Aug 2016 10:28:38 -0300 In-Reply-To: <87a8gltgks.fsf@linux.vnet.ibm.com> (Gabriel Krisman Bertazi's message of "Wed, 10 Aug 2016 01:04:35 -0300") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.95 (gnu/linux) MIME-Version: 1.0 X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16081913-0024-0000-0000-000000F447FB X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16081913-0025-0000-0000-000015812ACB Message-Id: <871t1kq455.fsf@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2016-08-19_05:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000 definitions=main-1608190164 Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Gabriel Krisman Bertazi writes: > We, IBM, have been experiencing eventual Oops when stressing IO at the > same time we add/remove processors. The Oops happens in the IRQ path, > when we try to complete a request that was apparently meant for another > queue. > > In __nvme_process_cq, the driver will use the cqe.command_id and the > nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that > initiated the IO. Eventually, it happens that the request returned by > that function is not initialized, and we crash inside > __blk_mq_complete_request, as shown below. > > [ 2679.050701] Faulting instruction address: 0xc000000000549500 > [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1] > [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries > [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c > rpcsec_gss_krb5 auth_rpcgss > nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc > autofs4 btrfs xor > raid6_pq ibmvscsi ibmveth nvme > [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu > [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000 > [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680 > [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300 Not tainted (4.4.0-22-generic) > [ 2679.050807] MSR: 8000000100009033 CR: 28022428 XER: 00000000 > [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0 > GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00 > GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001 > GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68 > GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001 > GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50 > GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002 > GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff > GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00 > [ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0 > [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme] > [ 2679.050930] Call Trace: > [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable) > [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme] > [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme] > [ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320 > [ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0 > [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0 > [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80 > [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0 > [ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24 > [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140 > [ 2679.051054] [c000000451bc33a0] [c000000000002594] > hardware_interrupt_common+0x114/0x180 > > I added some debugging which skipped the Oops and made me think that we > are fetching the wrong request because we are actually looking at the > wrong set of tags. > > The log below exemplifies what I am saying. Instead of crashing in the > Oops, I made the execution skip the request completion and simply > consider that odd CQE as handled. The first line of the log does that. > >> nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q >> nvme nvme0: I/O 309 QID 63 timeout, aborting >> nvme nvme0: Abort status: 0x0 >> nvme nvme0: I/O 309 QID 63 timeout, reset controller >> nvme nvme0: completing aborted command with status: fffffffc >> nvme 0003:03:00.0: Using 64-bit DMA iommu bypass >> blk_update_request: I/O error, dev nvme0n1, sector 2105537536 > > In the first line, we found the request 309 for queue 35, which would > have crashed the execution. My debug patch just ignores it. > > Then, we can see that eventually, the same IO, 309, will timeout in > another QID, which was actually expecting for it. The Abort request > gets sent and completed, but we never receive the completion of the > aborted request, thus we timeout again and restart the device. > > This only happens when we are changing the SMT settings very > frequently, which points back to the way we correlate the hctx->tags to > the nvmeq structure, but I failed to find the exact code spot were > things get messed up. > > I've put a some effort on this code lately, but I definitely need some > assistance on this one. Since, as far as I understand, this issue is > much more likely to reproduce on boxes with a lot of hotpluggable > processors, I'm willing to test any patches you may find suitable in our > systems. > Hi, [Adding linux-block, linux-scsi and Jens in CC for blk-mq input] Made some progress on tracking down the issue. It's looking more of a blk-mq bug than nvme to me now. Let me know if I made any wrong assumption below: I started to wonder whether blk-mq is requesting the IO in a wrong queue or if the request is being moved to another queue during hotplug, as it would happen in blk_mq_hctx_cpu_offline. I verified that when the fail occurs, the request never was never touched in blk_mq_hctx_cpu_offline. Also, and more interesting, I added the following BUG_ON to nvme_queue_req: BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req); From my experiments, I could never hit this during normal operation (without SMT). But, as soon as I start triggering hotplug changes, I hit it pretty fast. By changing to WARN_ON, I see that the Oops will happen almost immediately after this request is submitted, once __process_cq tries to complete it. Also, when adding the patch below, we send a few -EIOs back to upper layers, but it prevents the crash. I also think i saw a similar hang when running on a virtio_scsi disk. I couldn't reproduce it a second time, so we still lack confirmation, but if confirmed, it indeed points away from nvme and back to the block layer. Any suggestions to further track this down? --- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b927ecf..10f2257 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -599,6 +599,11 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, if (ret) return ret; + if(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req) { + ret = BLK_MQ_RQ_QUEUE_ERROR; + goto out; + } + ret = nvme_setup_cmd(ns, req, &cmnd); if (ret) goto out;