diff mbox

Oops when completing request on the wrong queue

Message ID 871t1kq455.fsf@linux.vnet.ibm.com
State New, archived
Headers show

Commit Message

Gabriel Krisman Bertazi Aug. 19, 2016, 1:28 p.m. UTC
Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:

> We, IBM, have been experiencing eventual Oops when stressing IO at the
> same time we add/remove processors.  The Oops happens in the IRQ path,
> when we try to complete a request that was apparently meant for another
> queue.
>
> In __nvme_process_cq, the driver will use the cqe.command_id and the
> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
> initiated the IO.  Eventually, it happens that the request returned by
> that function is not initialized, and we crash inside
> __blk_mq_complete_request, as shown below.
>
> [ 2679.050701] Faulting instruction address: 0xc000000000549500
> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
> rpcsec_gss_krb5 auth_rpcgss
> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
> autofs4 btrfs xor
> raid6_pq ibmvscsi ibmveth nvme
> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu
> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000
> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680
> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted  (4.4.0-22-generic)
> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28022428  XER: 00000000
> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0 
> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00 
> GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001 
> GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68 
> GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001 
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50 
> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002 
> GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff 
> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00 
> [ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0
> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
> [ 2679.050930] Call Trace:
> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable)
> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme]
> [ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320
> [ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0
> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0
> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80
> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
> [ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24
> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
> [ 2679.051054] [c000000451bc33a0] [c000000000002594]
> hardware_interrupt_common+0x114/0x180
>
> I added some debugging which skipped the Oops and made me think that we
> are fetching the wrong request because we are actually looking at the
> wrong set of tags.
>
> The log below exemplifies what I am saying.  Instead of crashing in the
> Oops, I made the execution skip the request completion and simply
> consider that odd CQE as handled.  The first line of the log does that.
>
>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>  nvme nvme0: Abort status: 0x0
>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>  nvme nvme0: completing aborted command with status: fffffffc
>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>
> In the first line, we found the request 309 for queue 35, which would
> have crashed the execution.  My debug patch just ignores it.
>
> Then, we can see that eventually, the same IO, 309, will timeout in
> another QID, which was actually expecting for it.  The Abort request
> gets sent and completed, but we never receive the completion of the
> aborted request, thus we timeout again and restart the device.
>
> This only happens when we are changing the SMT settings very
> frequently, which points back to the way we correlate the hctx->tags to
> the nvmeq structure, but I failed to find the exact code spot were
> things get messed up.
>
> I've put a some effort on this code lately, but I definitely need some
> assistance on this one.  Since, as far as I understand, this issue is
> much more likely to reproduce on boxes with a lot of hotpluggable
> processors, I'm willing to test any patches you may find suitable in our
> systems.
>

Hi,

[Adding linux-block, linux-scsi and Jens in CC for blk-mq input]

Made some progress on tracking down the issue.  It's looking more of a
blk-mq bug than nvme to me now.  Let me know if I made any wrong
assumption below:

I started to wonder whether blk-mq is requesting the IO in a wrong queue
or if the request is being moved to another queue during hotplug, as it
would happen in blk_mq_hctx_cpu_offline.

I verified that when the fail occurs, the request never was never
touched in blk_mq_hctx_cpu_offline.  Also, and more interesting, I added
the following BUG_ON to nvme_queue_req:

    BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req);

From my experiments, I could never hit this during normal operation
(without SMT). But, as soon as I start triggering hotplug changes, I hit
it pretty fast.  By changing to WARN_ON, I see that the Oops will happen
almost immediately after this request is submitted, once __process_cq
tries to complete it.

Also, when adding the patch below, we send a few -EIOs back to upper
layers, but it prevents the crash.

I also think i saw a similar hang when running on a virtio_scsi disk.  I
couldn't reproduce it a second time, so we still lack confirmation, but
if confirmed, it indeed points away from nvme and back to the block
layer.

Any suggestions to further track this down?



--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jens Axboe Aug. 19, 2016, 2:13 p.m. UTC | #1
On 08/19/2016 07:28 AM, Gabriel Krisman Bertazi wrote:
> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>
>> We, IBM, have been experiencing eventual Oops when stressing IO at the
>> same time we add/remove processors.  The Oops happens in the IRQ path,
>> when we try to complete a request that was apparently meant for another
>> queue.
>>
>> In __nvme_process_cq, the driver will use the cqe.command_id and the
>> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
>> initiated the IO.  Eventually, it happens that the request returned by
>> that function is not initialized, and we crash inside
>> __blk_mq_complete_request, as shown below.
>>
>> [ 2679.050701] Faulting instruction address: 0xc000000000549500
>> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
>> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
>> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
>> rpcsec_gss_krb5 auth_rpcgss
>> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
>> autofs4 btrfs xor
>> raid6_pq ibmvscsi ibmveth nvme
>> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu
>> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000
>> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680
>> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted  (4.4.0-22-generic)
>> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28022428  XER: 00000000
>> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0
>> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00
>> GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001
>> GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68
>> GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001
>> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50
>> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002
>> GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff
>> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00
>> [ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0
>> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
>> [ 2679.050930] Call Trace:
>> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable)
>> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
>> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme]
>> [ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320
>> [ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0
>> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0
>> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80
>> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
>> [ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24
>> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
>> [ 2679.051054] [c000000451bc33a0] [c000000000002594]
>> hardware_interrupt_common+0x114/0x180
>>
>> I added some debugging which skipped the Oops and made me think that we
>> are fetching the wrong request because we are actually looking at the
>> wrong set of tags.
>>
>> The log below exemplifies what I am saying.  Instead of crashing in the
>> Oops, I made the execution skip the request completion and simply
>> consider that odd CQE as handled.  The first line of the log does that.
>>
>>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>>  nvme nvme0: Abort status: 0x0
>>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>>  nvme nvme0: completing aborted command with status: fffffffc
>>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>>
>> In the first line, we found the request 309 for queue 35, which would
>> have crashed the execution.  My debug patch just ignores it.
>>
>> Then, we can see that eventually, the same IO, 309, will timeout in
>> another QID, which was actually expecting for it.  The Abort request
>> gets sent and completed, but we never receive the completion of the
>> aborted request, thus we timeout again and restart the device.
>>
>> This only happens when we are changing the SMT settings very
>> frequently, which points back to the way we correlate the hctx->tags to
>> the nvmeq structure, but I failed to find the exact code spot were
>> things get messed up.
>>
>> I've put a some effort on this code lately, but I definitely need some
>> assistance on this one.  Since, as far as I understand, this issue is
>> much more likely to reproduce on boxes with a lot of hotpluggable
>> processors, I'm willing to test any patches you may find suitable in our
>> systems.
>>
>
> Hi,
>
> [Adding linux-block, linux-scsi and Jens in CC for blk-mq input]
>
> Made some progress on tracking down the issue.  It's looking more of a
> blk-mq bug than nvme to me now.  Let me know if I made any wrong
> assumption below:
>
> I started to wonder whether blk-mq is requesting the IO in a wrong queue
> or if the request is being moved to another queue during hotplug, as it
> would happen in blk_mq_hctx_cpu_offline.
>
> I verified that when the fail occurs, the request never was never
> touched in blk_mq_hctx_cpu_offline.  Also, and more interesting, I added
> the following BUG_ON to nvme_queue_req:
>
>     BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req);
>
> From my experiments, I could never hit this during normal operation
> (without SMT). But, as soon as I start triggering hotplug changes, I hit
> it pretty fast.  By changing to WARN_ON, I see that the Oops will happen
> almost immediately after this request is submitted, once __process_cq
> tries to complete it.
>
> Also, when adding the patch below, we send a few -EIOs back to upper
> layers, but it prevents the crash.
>
> I also think i saw a similar hang when running on a virtio_scsi disk.  I
> couldn't reproduce it a second time, so we still lack confirmation, but
> if confirmed, it indeed points away from nvme and back to the block
> layer.
>
> Any suggestions to further track this down?

Some good detective work so far! I agree, this looks like a blk-mq core
bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
req->tag != nvmeq->tags? I don't immediately see how this could happen,
the freezing should protect us from this, unless it's broken somehow.
I'll take a look at this.
Jens Axboe Aug. 19, 2016, 3:51 p.m. UTC | #2
On 08/19/2016 08:13 AM, Jens Axboe wrote:
> On 08/19/2016 07:28 AM, Gabriel Krisman Bertazi wrote:
>> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>>
>>> We, IBM, have been experiencing eventual Oops when stressing IO at the
>>> same time we add/remove processors.  The Oops happens in the IRQ path,
>>> when we try to complete a request that was apparently meant for another
>>> queue.
>>>
>>> In __nvme_process_cq, the driver will use the cqe.command_id and the
>>> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
>>> initiated the IO.  Eventually, it happens that the request returned by
>>> that function is not initialized, and we crash inside
>>> __blk_mq_complete_request, as shown below.
>>>
>>> [ 2679.050701] Faulting instruction address: 0xc000000000549500
>>> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
>>> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
>>> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
>>> rpcsec_gss_krb5 auth_rpcgss
>>> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
>>> autofs4 btrfs xor
>>> raid6_pq ibmvscsi ibmveth nvme
>>> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted
>>> 4.4.0-22-generic #40-Ubuntu
>>> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti:
>>> c000000451bc0000
>>> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR:
>>> c000000000549680
>>> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted
>>> (4.4.0-22-generic)
>>> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR:
>>> 28022428  XER: 00000000
>>> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR:
>>> 40000000 SOFTE: 0
>>> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300
>>> c0000002799a0a00
>>> GPR04: 0000000000000000 0000000000000105 0000000000000004
>>> 0000000000000001
>>> GPR08: c0000002799a0a50 0000000000000000 0000000000000000
>>> d0000000046bdc68
>>> GPR12: c000000000549680 c000000007b1c780 0000000000000008
>>> 0000000000000001
>>> GPR16: 0000000000000000 0000000000000000 0000000000000000
>>> 0000000010005b50
>>> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001
>>> 0000000000000002
>>> GPR24: 00000000000001e3 0000000000000000 0000000000000000
>>> ffffffffffffffff
>>> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180
>>> c0000002799a0a00
>>> [ 2679.050908] NIP [c000000000549500]
>>> __blk_mq_complete_request+0x30/0x1b0
>>> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
>>> [ 2679.050930] Call Trace:
>>> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40]
>>> 0xc0000004fff2fd40 (unreliable)
>>> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84]
>>> __nvme_process_cq+0xf4/0x2c0 [nvme]
>>> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c]
>>> nvme_irq+0x3c/0xb0 [nvme]
>>> [ 2679.050972] [c0000004fff2fe10] [c000000000130660]
>>> handle_irq_event_percpu+0xa0/0x320
>>> [ 2679.050981] [c0000004fff2fed0] [c000000000130948]
>>> handle_irq_event+0x68/0xc0
>>> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c]
>>> handle_fasteoi_irq+0xec/0x2b0
>>> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844]
>>> generic_handle_irq+0x54/0x80
>>> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
>>> [ 2679.051020] [c0000004fff2ff90] [c000000000024800]
>>> call_do_irq+0x14/0x24
>>> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
>>> [ 2679.051054] [c000000451bc33a0] [c000000000002594]
>>> hardware_interrupt_common+0x114/0x180
>>>
>>> I added some debugging which skipped the Oops and made me think that we
>>> are fetching the wrong request because we are actually looking at the
>>> wrong set of tags.
>>>
>>> The log below exemplifies what I am saying.  Instead of crashing in the
>>> Oops, I made the execution skip the request completion and simply
>>> consider that odd CQE as handled.  The first line of the log does that.
>>>
>>>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>>>  nvme nvme0: Abort status: 0x0
>>>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>>>  nvme nvme0: completing aborted command with status: fffffffc
>>>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>>>
>>> In the first line, we found the request 309 for queue 35, which would
>>> have crashed the execution.  My debug patch just ignores it.
>>>
>>> Then, we can see that eventually, the same IO, 309, will timeout in
>>> another QID, which was actually expecting for it.  The Abort request
>>> gets sent and completed, but we never receive the completion of the
>>> aborted request, thus we timeout again and restart the device.
>>>
>>> This only happens when we are changing the SMT settings very
>>> frequently, which points back to the way we correlate the hctx->tags to
>>> the nvmeq structure, but I failed to find the exact code spot were
>>> things get messed up.
>>>
>>> I've put a some effort on this code lately, but I definitely need some
>>> assistance on this one.  Since, as far as I understand, this issue is
>>> much more likely to reproduce on boxes with a lot of hotpluggable
>>> processors, I'm willing to test any patches you may find suitable in our
>>> systems.
>>>
>>
>> Hi,
>>
>> [Adding linux-block, linux-scsi and Jens in CC for blk-mq input]
>>
>> Made some progress on tracking down the issue.  It's looking more of a
>> blk-mq bug than nvme to me now.  Let me know if I made any wrong
>> assumption below:
>>
>> I started to wonder whether blk-mq is requesting the IO in a wrong queue
>> or if the request is being moved to another queue during hotplug, as it
>> would happen in blk_mq_hctx_cpu_offline.
>>
>> I verified that when the fail occurs, the request never was never
>> touched in blk_mq_hctx_cpu_offline.  Also, and more interesting, I added
>> the following BUG_ON to nvme_queue_req:
>>
>>     BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req);
>>
>> From my experiments, I could never hit this during normal operation
>> (without SMT). But, as soon as I start triggering hotplug changes, I hit
>> it pretty fast.  By changing to WARN_ON, I see that the Oops will happen
>> almost immediately after this request is submitted, once __process_cq
>> tries to complete it.
>>
>> Also, when adding the patch below, we send a few -EIOs back to upper
>> layers, but it prevents the crash.
>>
>> I also think i saw a similar hang when running on a virtio_scsi disk.  I
>> couldn't reproduce it a second time, so we still lack confirmation, but
>> if confirmed, it indeed points away from nvme and back to the block
>> layer.
>>
>> Any suggestions to further track this down?
>
> Some good detective work so far! I agree, this looks like a blk-mq core
> bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
> req->tag != nvmeq->tags? I don't immediately see how this could happen,
> the freezing should protect us from this, unless it's broken somehow.
> I'll take a look at this.

Can you share what you ran to online/offline CPUs? I can't reproduce
this here.
Gabriel Krisman Bertazi Aug. 19, 2016, 4:38 p.m. UTC | #3
Jens Axboe <axboe@kernel.dk> writes:

>> Some good detective work so far! I agree, this looks like a blk-mq core
>> bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
>> req->tag != nvmeq->tags? I don't immediately see how this could happen,
>> the freezing should protect us from this, unless it's broken somehow.
>> I'll take a look at this.

Nothing much on the BUG() stack trace, at least to me.  We are coming
from an async execution of __blk_mq_run_hw_queue:

This is from 4.7. I'll give another run on 4.8-rc1.

[ 1953.910860] kernel BUG at ../drivers/nvme/host/pci.c:602!
cpu 0x48: Vector: 700 (Program Check) at [c000007870d3b7a0]
    pc: d000000035aa2e88: nvme_queue_rq+0x1a8/0x7a0 [nvme]
    lr: d000000035aa2e78: nvme_queue_rq+0x198/0x7a0 [nvme]
    sp: c000007870d3ba20
   msr: 900000010282b033
  current = 0xc000007870d9e400
  paca    = 0xc000000002bd8800   softe: 0        irq_happened: 0x01
    pid   = 9205, comm = kworker/72:1H
kernel BUG at ../drivers/nvme/host/pci.c:602!
Linux version 4.7.0.mainline+ (root@iod76) (gcc version 5.3.1 20160413 (Ubuntu/IBM 5.3.1-14ubuntu2.
1) ) #150 SMP Wed Aug 17 14:53:47 CDT 2016
enter ? for help
[c000007870d3bb10] c0000000005697b4 __blk_mq_run_hw_queue+0x384/0x640
[c000007870d3bc50] c0000000000e8cf0 process_one_work+0x1e0/0x590
[c000007870d3bce0] c0000000000e9148 worker_thread+0xa8/0x660
[c000007870d3bd80] c0000000000f2090 kthread+0x110/0x130
[c000007870d3be30] c0000000000095f0 ret_from_kernel_thread+0x5c/0x6c
48:mon>


Another interesting data point is that we always reach the WARN_ON in
__blk_mq_run_hw_queue immediately before submitting the bad request.

Maybe we are touching some percpu variable from the wrong cpu?

See the WARN_ON trace below.


[  369.953825] ------------[ cut here ]------------
[  369.954011] WARNING: CPU: 142 PID: 9475 at ../block/blk-mq.c:840 __blk_mq_run_hw_queue+0x338/0x6
40
[  369.954139] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache binfmt_
misc xfs ofpart cmdlinepart powernv_flash mtd ipmi_powernv ipmi_msghandler at24 opal_prd ibmpowernv
 nvmem_core powernv_rng input_leds joydev uio_pdrv_genirq mac_hid uio ib_iser rdma_cm sunrpc iw_cm
ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 ra
id456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
 multipath linear ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
 drm hid_generic uas usb_storage usbhid hid nvme ahci libahci nvme_core drbg ansi_cprng
[  369.955011] CPU: 142 PID: 9475 Comm: kworker/142:1H Not tainted 4.7.0.mainline+ #150
[  369.955085] Workqueue: kblockd blk_mq_run_work_fn
[  369.955153] task: c0000078cbb89a00 ti: c0000078cb024000 task.ti: c0000078cb024000
[  369.955253] NIP: c000000000569768 LR: c000000000569668 CTR: 0000000000000000
[  369.955314] REGS: c0000078cb027890 TRAP: 0700   Not tainted  (4.7.0.mainline+)
[  369.955372] MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 28002224  XER:
 20000000
[  369.955579] CFAR: c000000000569a18 SOFTE: 1
GPR00: c0000000005695b0 c0000078cb027b10 c0000000013d0b00 0000000000000008
GPR04: 0000000000000008 0000000000000040 0000000000000000 0000000000000001
GPR08: c0000078c345d800 0000000000000001 0000000001700000 9000000102803003
GPR12: 0000000000002200 c000000002bffe00 c0000078bb96b000 0000000000000000
GPR16: 0000000000000008 c0000078bb930000 c000007954aafd00 c0000078bb961800
GPR20: c000007954aafd08 c000000000da3728 c0000078bea80000 c0000078cb027b88
GPR24: 0000000000000000 c0000078bea80000 c0000078bb961808 fffffffffffffffb
GPR28: 0000000000000000 c0000078cb027b98 0000000000000000 c0000078bd299700
[  369.956334] NIP [c000000000569768] __blk_mq_run_hw_queue+0x338/0x640
[  369.956383] LR [c000000000569668] __blk_mq_run_hw_queue+0x238/0x640
[  369.956433] Call Trace:
[  369.956454] [c0000078cb027b10] [c0000000005695b0] __blk_mq_run_hw_queue+0x180/0x640 (unreliable)
[  369.956535] [c0000078cb027c50] [c0000000000e8cf0] process_one_work+0x1e0/0x590
[  369.956605] [c0000078cb027ce0] [c0000000000e9148] worker_thread+0xa8/0x660
[  369.956664] [c0000078cb027d80] [c0000000000f2090] kthread+0x110/0x130
[  369.956723] [c0000078cb027e30] [c0000000000095f0] ret_from_kernel_thread+0x5c/0x6c
[  369.956791] Instruction dump:
[  369.956821] 815f0108 e91302d8 81280000 7f895040 409d01e8 e9280060 794a1f24 7d29502a
[  369.956920] 7d2afa78 312affff 7d295110 7d2907b4 <0b090000> e9210088 e95f0058 38810060
[  369.957020] ---[ end trace 1398dd60e3057065 ]---
[  369.957094] ------------[ cut here ]------------


> Can you share what you ran to online/offline CPUs? I can't reproduce
> this here.

I was using the ppc64_cpu tool, which shouldn't do nothing more than
write to sysfs.  but I just reproduced it with the script below.

Note that this is ppc64le.  I don't have a x86 in hand to attempt to
reproduce right now, but I'll look for one and see how it goes.


#!/bin/bash

MAXCPUs=159
STATE=1

while true; do
    for i in $(seq 1 ${MAXCPUS}); do
        if (($i%8)); then
            echo $STATE > /sys/devices/system/cpu/cpu$i/online
        fi
    done
    if [[ $STATE -eq 1 ]]; then
        STATE=0
    else
        STATE=1
    fi
done
Gabriel Krisman Bertazi Aug. 23, 2016, 8:54 p.m. UTC | #4
Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:

>> Can you share what you ran to online/offline CPUs? I can't reproduce
>> this here.
>
> I was using the ppc64_cpu tool, which shouldn't do nothing more than
> write to sysfs.  but I just reproduced it with the script below.
>
> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
> reproduce right now, but I'll look for one and see how it goes.

Hi,

Any luck on reproducing it?  We were initially reproducing with a
proprietary stress test, but I gave a try to a generated fio jobfile
associated with the SMT script I shared earlier and I could reproduce
the crash consistently in less than 10 minutes of execution.  this was
still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.

The job file I used, as well as the smt.sh script, in case you want to
give it a try:

jobfile: http://krisman.be/k/nvmejob.fio
smt.sh:  http://krisman.be/k/smt.sh

Still, the trigger seems to be consistently a heavy load of IO
associated with CPU addition/removal.

Let me share my progress from the last couple days in the hope that it
rings a bell for you.

Firstly, I verified that when we hit the BUG_ON in nvme_queue_rq, the
request_queue's freeze_depth is 0, which points away from a fault in the
freeze/unfreeze mechanism.  If a request was escaping and going through
the block layer during a freeze, we'd see freeze_depth >= 1.  Before
that, I had also tried to keep the q_usage_counter in atomic mode, in
case of a bug in the percpu refcount.  No luck, the BUG_ON was still
hit.

Also, I don't see anything special about the request that reaches the
BUG_ON.  It's a REQ_TYPE_FS request and, at least in the last time I
reproduced, it was a READ that came from the stress test task through
submit_bio.  So nothing remarkable about it too, as far as I can see.

I'm still thinking about a case in which the mapping get's screwed up,
where a ctx would appear into two hctxs bitmaps after a remap, or if the
ctx got remaped to another hctx.  I'm still learning my way through the
cpumap code, so I'm not sure it's a real possibility, but I'm not
convinced it isn't.  Some preliminary tests don't suggest it's the case
at play, but I wanna spend a little more time on this theory (maybe
for my lack of better ideas :)

On a side note, probably unrelated to this crash, it also got me
thinking about the current usefulness of blk_mq_hctx_notify.  Since CPU
is dead, no more requests would be coming through its ctx.  I think we
could force a queue run in blk_mq_queue_reinit_notify, before remapping,
which would cause the hctx to fetch the remaining requests from that
dead ctx (since it's not unmapped yet).  This way, we could maintain a
single hotplug notification hook and simplify the hotplug path.  I
haven't written code for it yet, but I'll see if I can come up with
something and send to the list.
Jens Axboe Aug. 23, 2016, 9:11 p.m. UTC | #5
On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>
>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>> this here.
>>
>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>> write to sysfs.  but I just reproduced it with the script below.
>>
>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>> reproduce right now, but I'll look for one and see how it goes.
>
> Hi,
>
> Any luck on reproducing it?  We were initially reproducing with a
> proprietary stress test, but I gave a try to a generated fio jobfile
> associated with the SMT script I shared earlier and I could reproduce
> the crash consistently in less than 10 minutes of execution.  this was
> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.

Nope, I have not been able to reproduce it. How long does the CPU
offline/online actions take on ppc64? It's pretty slow on x86, which may
hide the issue. I took out the various printk's associated with bringing
a CPU off/online, as well as IRQ breaking parts, but didn't help in
reproducing it.

> The job file I used, as well as the smt.sh script, in case you want to
> give it a try:
>
> jobfile: http://krisman.be/k/nvmejob.fio
> smt.sh:  http://krisman.be/k/smt.sh
>
> Still, the trigger seems to be consistently a heavy load of IO
> associated with CPU addition/removal.

My workload looks similar to yours, in that it's high depth and with a
lot of jobs to keep most CPUs loaded. My bash script is different than
yours, I'll try that and see if it helps here.
Jens Axboe Aug. 23, 2016, 9:14 p.m. UTC | #6
On 08/23/2016 03:11 PM, Jens Axboe wrote:
> On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
>> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>>
>>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>>> this here.
>>>
>>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>>> write to sysfs.  but I just reproduced it with the script below.
>>>
>>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>>> reproduce right now, but I'll look for one and see how it goes.
>>
>> Hi,
>>
>> Any luck on reproducing it?  We were initially reproducing with a
>> proprietary stress test, but I gave a try to a generated fio jobfile
>> associated with the SMT script I shared earlier and I could reproduce
>> the crash consistently in less than 10 minutes of execution.  this was
>> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.
>
> Nope, I have not been able to reproduce it. How long does the CPU
> offline/online actions take on ppc64? It's pretty slow on x86, which may
> hide the issue. I took out the various printk's associated with bringing
> a CPU off/online, as well as IRQ breaking parts, but didn't help in
> reproducing it.
>
>> The job file I used, as well as the smt.sh script, in case you want to
>> give it a try:
>>
>> jobfile: http://krisman.be/k/nvmejob.fio
>> smt.sh:  http://krisman.be/k/smt.sh
>>
>> Still, the trigger seems to be consistently a heavy load of IO
>> associated with CPU addition/removal.
>
> My workload looks similar to yours, in that it's high depth and with a
> lot of jobs to keep most CPUs loaded. My bash script is different than
> yours, I'll try that and see if it helps here.

Actually, I take that back. You're not using O_DIRECT, hence all your
jobs are running at QD=1, not the 256 specified. That looks odd, but
I'll try, maybe it'll hit something different.
Keith Busch Aug. 23, 2016, 10:49 p.m. UTC | #7
On Tue, Aug 23, 2016 at 03:14:23PM -0600, Jens Axboe wrote:
> On 08/23/2016 03:11 PM, Jens Axboe wrote:
> >My workload looks similar to yours, in that it's high depth and with a
> >lot of jobs to keep most CPUs loaded. My bash script is different than
> >yours, I'll try that and see if it helps here.
> 
> Actually, I take that back. You're not using O_DIRECT, hence all your
> jobs are running at QD=1, not the 256 specified. That looks odd, but
> I'll try, maybe it'll hit something different.

I haven't recreated this either, but I think I can logically see why
this failure is happening.

I sent an nvme driver patch earlier on this thread to exit the hardware
context, which I thought would do the trick if the hctx's tags were
being moved. That turns out to be wrong for a couple reasons.

First, we can't release the nvmeq->tags when a hctx exits because
that nvmeq may be used by other namespaces that need to point to
the device's tag set.

The other reason is that blk-mq doesn't exit or init hardware contexts
when remapping for a CPU event, leaving the nvme driver unaware a hardware
context points to a different tag set.

So I think I see why this test would fail; don't know about a fix yet.
Maybe the nvme driver needs some indirection instead of pointing
directly to the tagset after init_hctx.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b927ecf..10f2257 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -599,6 +599,11 @@  static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
        if (ret)
                return ret;

+       if(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req) {
+         ret = BLK_MQ_RQ_QUEUE_ERROR;
+         goto out;
+       }
+
        ret = nvme_setup_cmd(ns, req, &cmnd);
        if (ret)
                goto out;