Message ID | 871t1kq455.fsf@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 08/19/2016 07:28 AM, Gabriel Krisman Bertazi wrote: > Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: > >> We, IBM, have been experiencing eventual Oops when stressing IO at the >> same time we add/remove processors. The Oops happens in the IRQ path, >> when we try to complete a request that was apparently meant for another >> queue. >> >> In __nvme_process_cq, the driver will use the cqe.command_id and the >> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that >> initiated the IO. Eventually, it happens that the request returned by >> that function is not initialized, and we crash inside >> __blk_mq_complete_request, as shown below. >> >> [ 2679.050701] Faulting instruction address: 0xc000000000549500 >> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1] >> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries >> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c >> rpcsec_gss_krb5 auth_rpcgss >> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc >> autofs4 btrfs xor >> raid6_pq ibmvscsi ibmveth nvme >> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu >> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000 >> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680 >> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300 Not tainted (4.4.0-22-generic) >> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE> CR: 28022428 XER: 00000000 >> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0 >> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00 >> GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001 >> GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68 >> GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001 >> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50 >> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002 >> GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff >> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00 >> [ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0 >> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme] >> [ 2679.050930] Call Trace: >> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable) >> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme] >> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme] >> [ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320 >> [ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0 >> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0 >> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80 >> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0 >> [ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24 >> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140 >> [ 2679.051054] [c000000451bc33a0] [c000000000002594] >> hardware_interrupt_common+0x114/0x180 >> >> I added some debugging which skipped the Oops and made me think that we >> are fetching the wrong request because we are actually looking at the >> wrong set of tags. >> >> The log below exemplifies what I am saying. Instead of crashing in the >> Oops, I made the execution skip the request completion and simply >> consider that odd CQE as handled. The first line of the log does that. >> >>> nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q >>> nvme nvme0: I/O 309 QID 63 timeout, aborting >>> nvme nvme0: Abort status: 0x0 >>> nvme nvme0: I/O 309 QID 63 timeout, reset controller >>> nvme nvme0: completing aborted command with status: fffffffc >>> nvme 0003:03:00.0: Using 64-bit DMA iommu bypass >>> blk_update_request: I/O error, dev nvme0n1, sector 2105537536 >> >> In the first line, we found the request 309 for queue 35, which would >> have crashed the execution. My debug patch just ignores it. >> >> Then, we can see that eventually, the same IO, 309, will timeout in >> another QID, which was actually expecting for it. The Abort request >> gets sent and completed, but we never receive the completion of the >> aborted request, thus we timeout again and restart the device. >> >> This only happens when we are changing the SMT settings very >> frequently, which points back to the way we correlate the hctx->tags to >> the nvmeq structure, but I failed to find the exact code spot were >> things get messed up. >> >> I've put a some effort on this code lately, but I definitely need some >> assistance on this one. Since, as far as I understand, this issue is >> much more likely to reproduce on boxes with a lot of hotpluggable >> processors, I'm willing to test any patches you may find suitable in our >> systems. >> > > Hi, > > [Adding linux-block, linux-scsi and Jens in CC for blk-mq input] > > Made some progress on tracking down the issue. It's looking more of a > blk-mq bug than nvme to me now. Let me know if I made any wrong > assumption below: > > I started to wonder whether blk-mq is requesting the IO in a wrong queue > or if the request is being moved to another queue during hotplug, as it > would happen in blk_mq_hctx_cpu_offline. > > I verified that when the fail occurs, the request never was never > touched in blk_mq_hctx_cpu_offline. Also, and more interesting, I added > the following BUG_ON to nvme_queue_req: > > BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req); > > From my experiments, I could never hit this during normal operation > (without SMT). But, as soon as I start triggering hotplug changes, I hit > it pretty fast. By changing to WARN_ON, I see that the Oops will happen > almost immediately after this request is submitted, once __process_cq > tries to complete it. > > Also, when adding the patch below, we send a few -EIOs back to upper > layers, but it prevents the crash. > > I also think i saw a similar hang when running on a virtio_scsi disk. I > couldn't reproduce it a second time, so we still lack confirmation, but > if confirmed, it indeed points away from nvme and back to the block > layer. > > Any suggestions to further track this down? Some good detective work so far! I agree, this looks like a blk-mq core bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when req->tag != nvmeq->tags? I don't immediately see how this could happen, the freezing should protect us from this, unless it's broken somehow. I'll take a look at this.
On 08/19/2016 08:13 AM, Jens Axboe wrote: > On 08/19/2016 07:28 AM, Gabriel Krisman Bertazi wrote: >> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: >> >>> We, IBM, have been experiencing eventual Oops when stressing IO at the >>> same time we add/remove processors. The Oops happens in the IRQ path, >>> when we try to complete a request that was apparently meant for another >>> queue. >>> >>> In __nvme_process_cq, the driver will use the cqe.command_id and the >>> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that >>> initiated the IO. Eventually, it happens that the request returned by >>> that function is not initialized, and we crash inside >>> __blk_mq_complete_request, as shown below. >>> >>> [ 2679.050701] Faulting instruction address: 0xc000000000549500 >>> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1] >>> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries >>> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c >>> rpcsec_gss_krb5 auth_rpcgss >>> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc >>> autofs4 btrfs xor >>> raid6_pq ibmvscsi ibmveth nvme >>> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted >>> 4.4.0-22-generic #40-Ubuntu >>> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: >>> c000000451bc0000 >>> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: >>> c000000000549680 >>> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300 Not tainted >>> (4.4.0-22-generic) >>> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE> CR: >>> 28022428 XER: 00000000 >>> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: >>> 40000000 SOFTE: 0 >>> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 >>> c0000002799a0a00 >>> GPR04: 0000000000000000 0000000000000105 0000000000000004 >>> 0000000000000001 >>> GPR08: c0000002799a0a50 0000000000000000 0000000000000000 >>> d0000000046bdc68 >>> GPR12: c000000000549680 c000000007b1c780 0000000000000008 >>> 0000000000000001 >>> GPR16: 0000000000000000 0000000000000000 0000000000000000 >>> 0000000010005b50 >>> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 >>> 0000000000000002 >>> GPR24: 00000000000001e3 0000000000000000 0000000000000000 >>> ffffffffffffffff >>> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 >>> c0000002799a0a00 >>> [ 2679.050908] NIP [c000000000549500] >>> __blk_mq_complete_request+0x30/0x1b0 >>> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme] >>> [ 2679.050930] Call Trace: >>> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] >>> 0xc0000004fff2fd40 (unreliable) >>> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] >>> __nvme_process_cq+0xf4/0x2c0 [nvme] >>> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] >>> nvme_irq+0x3c/0xb0 [nvme] >>> [ 2679.050972] [c0000004fff2fe10] [c000000000130660] >>> handle_irq_event_percpu+0xa0/0x320 >>> [ 2679.050981] [c0000004fff2fed0] [c000000000130948] >>> handle_irq_event+0x68/0xc0 >>> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] >>> handle_fasteoi_irq+0xec/0x2b0 >>> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] >>> generic_handle_irq+0x54/0x80 >>> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0 >>> [ 2679.051020] [c0000004fff2ff90] [c000000000024800] >>> call_do_irq+0x14/0x24 >>> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140 >>> [ 2679.051054] [c000000451bc33a0] [c000000000002594] >>> hardware_interrupt_common+0x114/0x180 >>> >>> I added some debugging which skipped the Oops and made me think that we >>> are fetching the wrong request because we are actually looking at the >>> wrong set of tags. >>> >>> The log below exemplifies what I am saying. Instead of crashing in the >>> Oops, I made the execution skip the request completion and simply >>> consider that odd CQE as handled. The first line of the log does that. >>> >>>> nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q >>>> nvme nvme0: I/O 309 QID 63 timeout, aborting >>>> nvme nvme0: Abort status: 0x0 >>>> nvme nvme0: I/O 309 QID 63 timeout, reset controller >>>> nvme nvme0: completing aborted command with status: fffffffc >>>> nvme 0003:03:00.0: Using 64-bit DMA iommu bypass >>>> blk_update_request: I/O error, dev nvme0n1, sector 2105537536 >>> >>> In the first line, we found the request 309 for queue 35, which would >>> have crashed the execution. My debug patch just ignores it. >>> >>> Then, we can see that eventually, the same IO, 309, will timeout in >>> another QID, which was actually expecting for it. The Abort request >>> gets sent and completed, but we never receive the completion of the >>> aborted request, thus we timeout again and restart the device. >>> >>> This only happens when we are changing the SMT settings very >>> frequently, which points back to the way we correlate the hctx->tags to >>> the nvmeq structure, but I failed to find the exact code spot were >>> things get messed up. >>> >>> I've put a some effort on this code lately, but I definitely need some >>> assistance on this one. Since, as far as I understand, this issue is >>> much more likely to reproduce on boxes with a lot of hotpluggable >>> processors, I'm willing to test any patches you may find suitable in our >>> systems. >>> >> >> Hi, >> >> [Adding linux-block, linux-scsi and Jens in CC for blk-mq input] >> >> Made some progress on tracking down the issue. It's looking more of a >> blk-mq bug than nvme to me now. Let me know if I made any wrong >> assumption below: >> >> I started to wonder whether blk-mq is requesting the IO in a wrong queue >> or if the request is being moved to another queue during hotplug, as it >> would happen in blk_mq_hctx_cpu_offline. >> >> I verified that when the fail occurs, the request never was never >> touched in blk_mq_hctx_cpu_offline. Also, and more interesting, I added >> the following BUG_ON to nvme_queue_req: >> >> BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req); >> >> From my experiments, I could never hit this during normal operation >> (without SMT). But, as soon as I start triggering hotplug changes, I hit >> it pretty fast. By changing to WARN_ON, I see that the Oops will happen >> almost immediately after this request is submitted, once __process_cq >> tries to complete it. >> >> Also, when adding the patch below, we send a few -EIOs back to upper >> layers, but it prevents the crash. >> >> I also think i saw a similar hang when running on a virtio_scsi disk. I >> couldn't reproduce it a second time, so we still lack confirmation, but >> if confirmed, it indeed points away from nvme and back to the block >> layer. >> >> Any suggestions to further track this down? > > Some good detective work so far! I agree, this looks like a blk-mq core > bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when > req->tag != nvmeq->tags? I don't immediately see how this could happen, > the freezing should protect us from this, unless it's broken somehow. > I'll take a look at this. Can you share what you ran to online/offline CPUs? I can't reproduce this here.
Jens Axboe <axboe@kernel.dk> writes: >> Some good detective work so far! I agree, this looks like a blk-mq core >> bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when >> req->tag != nvmeq->tags? I don't immediately see how this could happen, >> the freezing should protect us from this, unless it's broken somehow. >> I'll take a look at this. Nothing much on the BUG() stack trace, at least to me. We are coming from an async execution of __blk_mq_run_hw_queue: This is from 4.7. I'll give another run on 4.8-rc1. [ 1953.910860] kernel BUG at ../drivers/nvme/host/pci.c:602! cpu 0x48: Vector: 700 (Program Check) at [c000007870d3b7a0] pc: d000000035aa2e88: nvme_queue_rq+0x1a8/0x7a0 [nvme] lr: d000000035aa2e78: nvme_queue_rq+0x198/0x7a0 [nvme] sp: c000007870d3ba20 msr: 900000010282b033 current = 0xc000007870d9e400 paca = 0xc000000002bd8800 softe: 0 irq_happened: 0x01 pid = 9205, comm = kworker/72:1H kernel BUG at ../drivers/nvme/host/pci.c:602! Linux version 4.7.0.mainline+ (root@iod76) (gcc version 5.3.1 20160413 (Ubuntu/IBM 5.3.1-14ubuntu2. 1) ) #150 SMP Wed Aug 17 14:53:47 CDT 2016 enter ? for help [c000007870d3bb10] c0000000005697b4 __blk_mq_run_hw_queue+0x384/0x640 [c000007870d3bc50] c0000000000e8cf0 process_one_work+0x1e0/0x590 [c000007870d3bce0] c0000000000e9148 worker_thread+0xa8/0x660 [c000007870d3bd80] c0000000000f2090 kthread+0x110/0x130 [c000007870d3be30] c0000000000095f0 ret_from_kernel_thread+0x5c/0x6c 48:mon> Another interesting data point is that we always reach the WARN_ON in __blk_mq_run_hw_queue immediately before submitting the bad request. Maybe we are touching some percpu variable from the wrong cpu? See the WARN_ON trace below. [ 369.953825] ------------[ cut here ]------------ [ 369.954011] WARNING: CPU: 142 PID: 9475 at ../block/blk-mq.c:840 __blk_mq_run_hw_queue+0x338/0x6 40 [ 369.954139] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache binfmt_ misc xfs ofpart cmdlinepart powernv_flash mtd ipmi_powernv ipmi_msghandler at24 opal_prd ibmpowernv nvmem_core powernv_rng input_leds joydev uio_pdrv_genirq mac_hid uio ib_iser rdma_cm sunrpc iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 ra id456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm hid_generic uas usb_storage usbhid hid nvme ahci libahci nvme_core drbg ansi_cprng [ 369.955011] CPU: 142 PID: 9475 Comm: kworker/142:1H Not tainted 4.7.0.mainline+ #150 [ 369.955085] Workqueue: kblockd blk_mq_run_work_fn [ 369.955153] task: c0000078cbb89a00 ti: c0000078cb024000 task.ti: c0000078cb024000 [ 369.955253] NIP: c000000000569768 LR: c000000000569668 CTR: 0000000000000000 [ 369.955314] REGS: c0000078cb027890 TRAP: 0700 Not tainted (4.7.0.mainline+) [ 369.955372] MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 28002224 XER: 20000000 [ 369.955579] CFAR: c000000000569a18 SOFTE: 1 GPR00: c0000000005695b0 c0000078cb027b10 c0000000013d0b00 0000000000000008 GPR04: 0000000000000008 0000000000000040 0000000000000000 0000000000000001 GPR08: c0000078c345d800 0000000000000001 0000000001700000 9000000102803003 GPR12: 0000000000002200 c000000002bffe00 c0000078bb96b000 0000000000000000 GPR16: 0000000000000008 c0000078bb930000 c000007954aafd00 c0000078bb961800 GPR20: c000007954aafd08 c000000000da3728 c0000078bea80000 c0000078cb027b88 GPR24: 0000000000000000 c0000078bea80000 c0000078bb961808 fffffffffffffffb GPR28: 0000000000000000 c0000078cb027b98 0000000000000000 c0000078bd299700 [ 369.956334] NIP [c000000000569768] __blk_mq_run_hw_queue+0x338/0x640 [ 369.956383] LR [c000000000569668] __blk_mq_run_hw_queue+0x238/0x640 [ 369.956433] Call Trace: [ 369.956454] [c0000078cb027b10] [c0000000005695b0] __blk_mq_run_hw_queue+0x180/0x640 (unreliable) [ 369.956535] [c0000078cb027c50] [c0000000000e8cf0] process_one_work+0x1e0/0x590 [ 369.956605] [c0000078cb027ce0] [c0000000000e9148] worker_thread+0xa8/0x660 [ 369.956664] [c0000078cb027d80] [c0000000000f2090] kthread+0x110/0x130 [ 369.956723] [c0000078cb027e30] [c0000000000095f0] ret_from_kernel_thread+0x5c/0x6c [ 369.956791] Instruction dump: [ 369.956821] 815f0108 e91302d8 81280000 7f895040 409d01e8 e9280060 794a1f24 7d29502a [ 369.956920] 7d2afa78 312affff 7d295110 7d2907b4 <0b090000> e9210088 e95f0058 38810060 [ 369.957020] ---[ end trace 1398dd60e3057065 ]--- [ 369.957094] ------------[ cut here ]------------ > Can you share what you ran to online/offline CPUs? I can't reproduce > this here. I was using the ppc64_cpu tool, which shouldn't do nothing more than write to sysfs. but I just reproduced it with the script below. Note that this is ppc64le. I don't have a x86 in hand to attempt to reproduce right now, but I'll look for one and see how it goes. #!/bin/bash MAXCPUs=159 STATE=1 while true; do for i in $(seq 1 ${MAXCPUS}); do if (($i%8)); then echo $STATE > /sys/devices/system/cpu/cpu$i/online fi done if [[ $STATE -eq 1 ]]; then STATE=0 else STATE=1 fi done
Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: >> Can you share what you ran to online/offline CPUs? I can't reproduce >> this here. > > I was using the ppc64_cpu tool, which shouldn't do nothing more than > write to sysfs. but I just reproduced it with the script below. > > Note that this is ppc64le. I don't have a x86 in hand to attempt to > reproduce right now, but I'll look for one and see how it goes. Hi, Any luck on reproducing it? We were initially reproducing with a proprietary stress test, but I gave a try to a generated fio jobfile associated with the SMT script I shared earlier and I could reproduce the crash consistently in less than 10 minutes of execution. this was still ppc64le, though. I couldn't get my hands on nvme on x86 yet. The job file I used, as well as the smt.sh script, in case you want to give it a try: jobfile: http://krisman.be/k/nvmejob.fio smt.sh: http://krisman.be/k/smt.sh Still, the trigger seems to be consistently a heavy load of IO associated with CPU addition/removal. Let me share my progress from the last couple days in the hope that it rings a bell for you. Firstly, I verified that when we hit the BUG_ON in nvme_queue_rq, the request_queue's freeze_depth is 0, which points away from a fault in the freeze/unfreeze mechanism. If a request was escaping and going through the block layer during a freeze, we'd see freeze_depth >= 1. Before that, I had also tried to keep the q_usage_counter in atomic mode, in case of a bug in the percpu refcount. No luck, the BUG_ON was still hit. Also, I don't see anything special about the request that reaches the BUG_ON. It's a REQ_TYPE_FS request and, at least in the last time I reproduced, it was a READ that came from the stress test task through submit_bio. So nothing remarkable about it too, as far as I can see. I'm still thinking about a case in which the mapping get's screwed up, where a ctx would appear into two hctxs bitmaps after a remap, or if the ctx got remaped to another hctx. I'm still learning my way through the cpumap code, so I'm not sure it's a real possibility, but I'm not convinced it isn't. Some preliminary tests don't suggest it's the case at play, but I wanna spend a little more time on this theory (maybe for my lack of better ideas :) On a side note, probably unrelated to this crash, it also got me thinking about the current usefulness of blk_mq_hctx_notify. Since CPU is dead, no more requests would be coming through its ctx. I think we could force a queue run in blk_mq_queue_reinit_notify, before remapping, which would cause the hctx to fetch the remaining requests from that dead ctx (since it's not unmapped yet). This way, we could maintain a single hotplug notification hook and simplify the hotplug path. I haven't written code for it yet, but I'll see if I can come up with something and send to the list.
On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote: > Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: > >>> Can you share what you ran to online/offline CPUs? I can't reproduce >>> this here. >> >> I was using the ppc64_cpu tool, which shouldn't do nothing more than >> write to sysfs. but I just reproduced it with the script below. >> >> Note that this is ppc64le. I don't have a x86 in hand to attempt to >> reproduce right now, but I'll look for one and see how it goes. > > Hi, > > Any luck on reproducing it? We were initially reproducing with a > proprietary stress test, but I gave a try to a generated fio jobfile > associated with the SMT script I shared earlier and I could reproduce > the crash consistently in less than 10 minutes of execution. this was > still ppc64le, though. I couldn't get my hands on nvme on x86 yet. Nope, I have not been able to reproduce it. How long does the CPU offline/online actions take on ppc64? It's pretty slow on x86, which may hide the issue. I took out the various printk's associated with bringing a CPU off/online, as well as IRQ breaking parts, but didn't help in reproducing it. > The job file I used, as well as the smt.sh script, in case you want to > give it a try: > > jobfile: http://krisman.be/k/nvmejob.fio > smt.sh: http://krisman.be/k/smt.sh > > Still, the trigger seems to be consistently a heavy load of IO > associated with CPU addition/removal. My workload looks similar to yours, in that it's high depth and with a lot of jobs to keep most CPUs loaded. My bash script is different than yours, I'll try that and see if it helps here.
On 08/23/2016 03:11 PM, Jens Axboe wrote: > On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote: >> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: >> >>>> Can you share what you ran to online/offline CPUs? I can't reproduce >>>> this here. >>> >>> I was using the ppc64_cpu tool, which shouldn't do nothing more than >>> write to sysfs. but I just reproduced it with the script below. >>> >>> Note that this is ppc64le. I don't have a x86 in hand to attempt to >>> reproduce right now, but I'll look for one and see how it goes. >> >> Hi, >> >> Any luck on reproducing it? We were initially reproducing with a >> proprietary stress test, but I gave a try to a generated fio jobfile >> associated with the SMT script I shared earlier and I could reproduce >> the crash consistently in less than 10 minutes of execution. this was >> still ppc64le, though. I couldn't get my hands on nvme on x86 yet. > > Nope, I have not been able to reproduce it. How long does the CPU > offline/online actions take on ppc64? It's pretty slow on x86, which may > hide the issue. I took out the various printk's associated with bringing > a CPU off/online, as well as IRQ breaking parts, but didn't help in > reproducing it. > >> The job file I used, as well as the smt.sh script, in case you want to >> give it a try: >> >> jobfile: http://krisman.be/k/nvmejob.fio >> smt.sh: http://krisman.be/k/smt.sh >> >> Still, the trigger seems to be consistently a heavy load of IO >> associated with CPU addition/removal. > > My workload looks similar to yours, in that it's high depth and with a > lot of jobs to keep most CPUs loaded. My bash script is different than > yours, I'll try that and see if it helps here. Actually, I take that back. You're not using O_DIRECT, hence all your jobs are running at QD=1, not the 256 specified. That looks odd, but I'll try, maybe it'll hit something different.
On Tue, Aug 23, 2016 at 03:14:23PM -0600, Jens Axboe wrote: > On 08/23/2016 03:11 PM, Jens Axboe wrote: > >My workload looks similar to yours, in that it's high depth and with a > >lot of jobs to keep most CPUs loaded. My bash script is different than > >yours, I'll try that and see if it helps here. > > Actually, I take that back. You're not using O_DIRECT, hence all your > jobs are running at QD=1, not the 256 specified. That looks odd, but > I'll try, maybe it'll hit something different. I haven't recreated this either, but I think I can logically see why this failure is happening. I sent an nvme driver patch earlier on this thread to exit the hardware context, which I thought would do the trick if the hctx's tags were being moved. That turns out to be wrong for a couple reasons. First, we can't release the nvmeq->tags when a hctx exits because that nvmeq may be used by other namespaces that need to point to the device's tag set. The other reason is that blk-mq doesn't exit or init hardware contexts when remapping for a CPU event, leaving the nvme driver unaware a hardware context points to a different tag set. So I think I see why this test would fail; don't know about a fix yet. Maybe the nvme driver needs some indirection instead of pointing directly to the tagset after init_hctx. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b927ecf..10f2257 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -599,6 +599,11 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, if (ret) return ret; + if(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req) { + ret = BLK_MQ_RQ_QUEUE_ERROR; + goto out; + } + ret = nvme_setup_cmd(ns, req, &cmnd); if (ret) goto out;