diff mbox

[dm-devel] split scsi passthrough fields out of struct request V2

Message ID 9cbf0ce5-ed79-0252-fd2d-34bebaafffa3@fb.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jens Axboe Jan. 27, 2017, 8:04 a.m. UTC
On 01/26/2017 11:40 PM, Jens Axboe wrote:
> On 01/26/2017 06:22 PM, Jens Axboe wrote:
>> On 01/26/2017 06:15 PM, Bart Van Assche wrote:
>>> On Thu, 2017-01-26 at 17:41 -0700, Jens Axboe wrote:
>>>> On 01/26/2017 05:38 PM, Bart Van Assche wrote:
>>>>> I see similar behavior with the blk-mq-sched branch of
>>>>> git://git.kernel.dk/linux-block.git (git commit ID 0efe27068ecf):
>>>>> booting happens much slower than usual and I/O hangs if I run the
>>>>> srp-test software.
>>>>
>>>> Please don't run that, run for-4.11/block and merge it to master.
>>>> Same behavior?
>>>
>>> I have not yet had the chance to run the srp-test software against that
>>> kernel. But I already see that booting takes more than ten times longer
>>> than usual. Note: as far as I know the dm-mpath driver is not involved
>>> in the boot process of my test system.
>>
>> What's your boot device? I've been booting this on a variety of setups,
>> no problems observed. It's booting my laptop, and on SCSI and SATA as
>> well. What is your root drive? What is the queue depth of it?
>> Controller?
> 
> Are you using dm for your root device?
> 
> I think I see what is going on. The scheduler framework put the
> insertion of flushes on the side, whereas it's integrated "nicely"
> on the legacy side.
> 
> Can you try with this applied? This is on top of the previous two that
> we already went through. Or, you can just pull:
> 
> git://git.kernel.dk/linux-block for-4.11/next
> 
> which is for-4.11/block with the next set of fixes on top that I haven't
> pulled in yet.

The previous patch had a bug if you didn't use a scheduler, here's a
version that should work fine in both cases. I've also updated the
above mentioned branch, so feel free to pull that as well and merge to
master like before.

commit 2f54ba92a274a7c1a5ceb34a56565f84f7b994b7
Author: Jens Axboe <axboe@fb.com>
Date:   Fri Jan 27 01:00:47 2017 -0700

    blk-mq-sched: add flush insertion into blk_mq_sched_insert_request()
    
    Instead of letting the caller check this and handle the details
    of inserting a flush request, put the logic in the scheduler
    insertion function. This fixes direct flush insertion outside
    of the usual make_request_fn calls, like from dm via
    blk_insert_cloned_request().
    
    Signed-off-by: Jens Axboe <axboe@fb.com>

Comments

Bart Van Assche Jan. 27, 2017, 4:52 p.m. UTC | #1
On Fri, 2017-01-27 at 01:04 -0700, Jens Axboe wrote:
> The previous patch had a bug if you didn't use a scheduler, here's a
> version that should work fine in both cases. I've also updated the
> above mentioned branch, so feel free to pull that as well and merge to
> master like before.

Booting time is back to normal with commit f3a8ab7d55bc merged with
v4.10-rc5. That's a great improvement. However, running the srp-test
software triggers now a new complaint:

[  215.600386] sd 11:0:0:0: [sdh] Attached SCSI disk
[  215.609485] sd 11:0:0:0: alua: port group 00 state A non-preferred supports TOlUSNA
[  215.722900] scsi 13:0:0:0: alua: Detached
[  215.724452] general protection fault: 0000 [#1] SMP
[  215.724484] Modules linked in: dm_service_time ib_srp scsi_transport_srp target_core_user uio target_core_pscsi target_core_file ib_srpt target_core_iblock target_core_mod brd netconsole xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm msr configfs ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp kvm_intel hid_generic kvm usbhid irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel mlx4_core ghash_clmulni_intel iTCO_wdt dcdbas pcbc tg3
[  215.724629]  iTCO_vendor_support ptp aesni_intel pps_core aes_x86_64 pcspkr crypto_simd libphy ipmi_si glue_helper cryptd ipmi_devintf tpm_tis devlink fjes ipmi_msghandler tpm_tis_core tpm mei_me lpc_ich mei mfd_core button shpchp wmi mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm sr_mod cdrom ehci_pci ehci_hcd usbcore usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua autofs4
[  215.724719] CPU: 9 PID: 8043 Comm: multipathd Not tainted 4.10.0-rc5-dbg+ #1
[  215.724748] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014
[  215.724775] task: ffff8801717998c0 task.stack: ffffc90002a9c000
[  215.724804] RIP: 0010:scsi_device_put+0xb/0x30
[  215.724829] RSP: 0018:ffffc90002a9faa0 EFLAGS: 00010246
[  215.724855] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88038bf85698 RCX: 0000000000000006
[  215.724880] RDX: 0000000000000006 RSI: ffff88017179a108 RDI: ffff88038bf85698
[  215.724906] RBP: ffffc90002a9faa8 R08: ffff880384786008 R09: 0000000100170007
[  215.724932] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88038bf85698
[  215.724958] R13: ffff88038919f090 R14: dead000000000100 R15: ffff88038a41dd28
[  215.724983] FS:  00007fbf8c6cf700(0000) GS:ffff88046f440000(0000) knlGS:0000000000000000
[  215.725010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  215.725035] CR2: 00007f1262ef3ee0 CR3: 000000044f6cc000 CR4: 00000000001406e0
[  215.725060] Call Trace:
[  215.725086]  scsi_disk_put+0x2d/0x40
[  215.725110]  sd_release+0x3d/0xb0
[  215.725137]  __blkdev_put+0x29e/0x360
[  215.725163]  blkdev_put+0x49/0x170
[  215.725192]  dm_put_table_device+0x58/0xc0 [dm_mod]
[  215.725219]  dm_put_device+0x70/0xc0 [dm_mod]
[  215.725269]  free_priority_group+0x92/0xc0 [dm_multipath]
[  215.725295]  free_multipath+0x70/0xc0 [dm_multipath]
[  215.725320]  multipath_dtr+0x19/0x20 [dm_multipath]
[  215.725348]  dm_table_destroy+0x67/0x120 [dm_mod]
[  215.725379]  dev_suspend+0xde/0x240 [dm_mod]
[  215.725434]  ctl_ioctl+0x1f5/0x520 [dm_mod]
[  215.725489]  dm_ctl_ioctl+0xe/0x20 [dm_mod]
[  215.725515]  do_vfs_ioctl+0x8f/0x700
[  215.725589]  SyS_ioctl+0x3c/0x70
[  215.725614]  entry_SYSCALL_64_fastpath+0x18/0xad
[  215.725641] RIP: 0033:0x7fbf8aca0667
[  215.725665] RSP: 002b:00007fbf8c6cd668 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  215.725692] RAX: ffffffffffffffda RBX: 0000000000000046 RCX: 00007fbf8aca0667
[  215.725716] RDX: 00007fbf8006b940 RSI: 00000000c138fd06 RDI: 0000000000000007
[  215.725743] RBP: 0000000000000009 R08: 00007fbf8c6cb3c0 R09: 00007fbf8b68d8d8
[  215.725768] R10: 0000000000000075 R11: 0000000000000246 R12: 00007fbf8c6cd770
[  215.725793] R13: 0000000000000013 R14: 00000000006168f0 R15: 0000000000f74780
[  215.725820] Code: bc 24 b8 00 00 00 e8 55 c8 1c 00 48 83 c4 08 48 89 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 00 55 48 89 e5 53 48 8b 07 48 89 fb <48> 8b 80 a8 01 00 00 48 8b 38 e8 f6 68 c5 ff 48 8d bb 38 02 00 
[  215.725903] RIP: scsi_device_put+0xb/0x30 RSP: ffffc90002a9faa0

(gdb) list *(scsi_device_put+0xb)
0xffffffff8149fc2b is in scsi_device_put (drivers/scsi/scsi.c:957).
952      * count of the underlying LLDD module.  The device is freed once the last
953      * user vanishes.
954      */
955     void scsi_device_put(struct scsi_device *sdev)
956     {
957             module_put(sdev->host->hostt->module);
958             put_device(&sdev->sdev_gendev);
959     }
960     EXPORT_SYMBOL(scsi_device_put);
961
(gdb) disas scsi_device_put
Dump of assembler code for function scsi_device_put:
   0xffffffff8149fc20 <+0>:     push   %rbp
   0xffffffff8149fc21 <+1>:     mov    %rsp,%rbp
   0xffffffff8149fc24 <+4>:     push   %rbx
   0xffffffff8149fc25 <+5>:     mov    (%rdi),%rax
   0xffffffff8149fc28 <+8>:     mov    %rdi,%rbx
   0xffffffff8149fc2b <+11>:    mov    0x1a8(%rax),%rax
   0xffffffff8149fc32 <+18>:    mov    (%rax),%rdi
   0xffffffff8149fc35 <+21>:    callq  0xffffffff810f6530 <module_put>
   0xffffffff8149fc3a <+26>:    lea    0x238(%rbx),%rdi
   0xffffffff8149fc41 <+33>:    callq  0xffffffff814714b0 <put_device>
   0xffffffff8149fc46 <+38>:    pop    %rbx
   0xffffffff8149fc47 <+39>:    pop    %rbp
   0xffffffff8149fc48 <+40>:    retq    
End of assembler dump.
(gdb) print &((struct Scsi_Host *)0)->hostt  
$2 = (struct scsi_host_template **) 0x1a8 <irq_stack_union+424>

Apparently scsi_device_put() was called for a SCSI device that was already
freed (memory poisoning was enabled in my test). This is something I had
not yet seen before.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Jan. 27, 2017, 4:56 p.m. UTC | #2
On 01/27/2017 09:52 AM, Bart Van Assche wrote:
> On Fri, 2017-01-27 at 01:04 -0700, Jens Axboe wrote:
>> The previous patch had a bug if you didn't use a scheduler, here's a
>> version that should work fine in both cases. I've also updated the
>> above mentioned branch, so feel free to pull that as well and merge to
>> master like before.
> 
> Booting time is back to normal with commit f3a8ab7d55bc merged with
> v4.10-rc5. That's a great improvement. However, running the srp-test
> software triggers now a new complaint:
> 
> [  215.600386] sd 11:0:0:0: [sdh] Attached SCSI disk
> [  215.609485] sd 11:0:0:0: alua: port group 00 state A non-preferred supports TOlUSNA
> [  215.722900] scsi 13:0:0:0: alua: Detached
> [  215.724452] general protection fault: 0000 [#1] SMP
> [  215.724484] Modules linked in: dm_service_time ib_srp scsi_transport_srp target_core_user uio target_core_pscsi target_core_file ib_srpt target_core_iblock target_core_mod brd netconsole xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm msr configfs ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp kvm_intel hid_generic kvm usbhid irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel mlx4_core ghash_clmulni_intel iTCO_wdt dcdbas pcbc tg3
> [  215.724629]  iTCO_vendor_support ptp aesni_intel pps_core aes_x86_64 pcspkr crypto_simd libphy ipmi_si glue_helper cryptd ipmi_devintf tpm_tis devlink fjes ipmi_msghandler tpm_tis_core tpm mei_me lpc_ich mei mfd_core button shpchp wmi mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm sr_mod cdrom ehci_pci ehci_hcd usbcore usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua autofs4
> [  215.724719] CPU: 9 PID: 8043 Comm: multipathd Not tainted 4.10.0-rc5-dbg+ #1
> [  215.724748] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014
> [  215.724775] task: ffff8801717998c0 task.stack: ffffc90002a9c000
> [  215.724804] RIP: 0010:scsi_device_put+0xb/0x30
> [  215.724829] RSP: 0018:ffffc90002a9faa0 EFLAGS: 00010246
> [  215.724855] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88038bf85698 RCX: 0000000000000006
> [  215.724880] RDX: 0000000000000006 RSI: ffff88017179a108 RDI: ffff88038bf85698
> [  215.724906] RBP: ffffc90002a9faa8 R08: ffff880384786008 R09: 0000000100170007
> [  215.724932] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88038bf85698
> [  215.724958] R13: ffff88038919f090 R14: dead000000000100 R15: ffff88038a41dd28
> [  215.724983] FS:  00007fbf8c6cf700(0000) GS:ffff88046f440000(0000) knlGS:0000000000000000
> [  215.725010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  215.725035] CR2: 00007f1262ef3ee0 CR3: 000000044f6cc000 CR4: 00000000001406e0
> [  215.725060] Call Trace:
> [  215.725086]  scsi_disk_put+0x2d/0x40
> [  215.725110]  sd_release+0x3d/0xb0
> [  215.725137]  __blkdev_put+0x29e/0x360
> [  215.725163]  blkdev_put+0x49/0x170
> [  215.725192]  dm_put_table_device+0x58/0xc0 [dm_mod]
> [  215.725219]  dm_put_device+0x70/0xc0 [dm_mod]
> [  215.725269]  free_priority_group+0x92/0xc0 [dm_multipath]
> [  215.725295]  free_multipath+0x70/0xc0 [dm_multipath]
> [  215.725320]  multipath_dtr+0x19/0x20 [dm_multipath]
> [  215.725348]  dm_table_destroy+0x67/0x120 [dm_mod]
> [  215.725379]  dev_suspend+0xde/0x240 [dm_mod]
> [  215.725434]  ctl_ioctl+0x1f5/0x520 [dm_mod]
> [  215.725489]  dm_ctl_ioctl+0xe/0x20 [dm_mod]
> [  215.725515]  do_vfs_ioctl+0x8f/0x700
> [  215.725589]  SyS_ioctl+0x3c/0x70
> [  215.725614]  entry_SYSCALL_64_fastpath+0x18/0xad
> [  215.725641] RIP: 0033:0x7fbf8aca0667
> [  215.725665] RSP: 002b:00007fbf8c6cd668 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [  215.725692] RAX: ffffffffffffffda RBX: 0000000000000046 RCX: 00007fbf8aca0667
> [  215.725716] RDX: 00007fbf8006b940 RSI: 00000000c138fd06 RDI: 0000000000000007
> [  215.725743] RBP: 0000000000000009 R08: 00007fbf8c6cb3c0 R09: 00007fbf8b68d8d8
> [  215.725768] R10: 0000000000000075 R11: 0000000000000246 R12: 00007fbf8c6cd770
> [  215.725793] R13: 0000000000000013 R14: 00000000006168f0 R15: 0000000000f74780
> [  215.725820] Code: bc 24 b8 00 00 00 e8 55 c8 1c 00 48 83 c4 08 48 89 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 00 55 48 89 e5 53 48 8b 07 48 89 fb <48> 8b 80 a8 01 00 00 48 8b 38 e8 f6 68 c5 ff 48 8d bb 38 02 00 
> [  215.725903] RIP: scsi_device_put+0xb/0x30 RSP: ffffc90002a9faa0
> 
> (gdb) list *(scsi_device_put+0xb)
> 0xffffffff8149fc2b is in scsi_device_put (drivers/scsi/scsi.c:957).
> 952      * count of the underlying LLDD module.  The device is freed once the last
> 953      * user vanishes.
> 954      */
> 955     void scsi_device_put(struct scsi_device *sdev)
> 956     {
> 957             module_put(sdev->host->hostt->module);
> 958             put_device(&sdev->sdev_gendev);
> 959     }
> 960     EXPORT_SYMBOL(scsi_device_put);
> 961
> (gdb) disas scsi_device_put
> Dump of assembler code for function scsi_device_put:
>    0xffffffff8149fc20 <+0>:     push   %rbp
>    0xffffffff8149fc21 <+1>:     mov    %rsp,%rbp
>    0xffffffff8149fc24 <+4>:     push   %rbx
>    0xffffffff8149fc25 <+5>:     mov    (%rdi),%rax
>    0xffffffff8149fc28 <+8>:     mov    %rdi,%rbx
>    0xffffffff8149fc2b <+11>:    mov    0x1a8(%rax),%rax
>    0xffffffff8149fc32 <+18>:    mov    (%rax),%rdi
>    0xffffffff8149fc35 <+21>:    callq  0xffffffff810f6530 <module_put>
>    0xffffffff8149fc3a <+26>:    lea    0x238(%rbx),%rdi
>    0xffffffff8149fc41 <+33>:    callq  0xffffffff814714b0 <put_device>
>    0xffffffff8149fc46 <+38>:    pop    %rbx
>    0xffffffff8149fc47 <+39>:    pop    %rbp
>    0xffffffff8149fc48 <+40>:    retq    
> End of assembler dump.
> (gdb) print &((struct Scsi_Host *)0)->hostt  
> $2 = (struct scsi_host_template **) 0x1a8 <irq_stack_union+424>
> 
> Apparently scsi_device_put() was called for a SCSI device that was already
> freed (memory poisoning was enabled in my test). This is something I had
> not yet seen before.

I have no idea what this is, I haven't messed with life time or devices
or queues at all in that branch.
Bart Van Assche Jan. 27, 2017, 5:03 p.m. UTC | #3
On Fri, 2017-01-27 at 09:56 -0700, Jens Axboe wrote:
> I have no idea what this is, I haven't messed with life time or devices
> or queues at all in that branch.

The srp-test software passes with kernel v4.9. Something must have changed.
I'll see whether I can find some time to look further into this.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Jan. 31, 2017, 1:12 a.m. UTC | #4
On Fri, 2017-01-27 at 09:56 -0700, Jens Axboe wrote:
> On 01/27/2017 09:52 AM, Bart Van Assche wrote:
> > [  215.724452] general protection fault: 0000 [#1] SMP
> > [  215.725060] Call Trace:
> > [  215.725086]  scsi_disk_put+0x2d/0x40
> > [  215.725110]  sd_release+0x3d/0xb0
> > [  215.725137]  __blkdev_put+0x29e/0x360
> > [  215.725163]  blkdev_put+0x49/0x170
> > [  215.725192]  dm_put_table_device+0x58/0xc0 [dm_mod]
> > [  215.725219]  dm_put_device+0x70/0xc0 [dm_mod]
> > [  215.725269]  free_priority_group+0x92/0xc0 [dm_multipath]
> > [  215.725295]  free_multipath+0x70/0xc0 [dm_multipath]
> > [  215.725320]  multipath_dtr+0x19/0x20 [dm_multipath]
> > [  215.725348]  dm_table_destroy+0x67/0x120 [dm_mod]
> > [  215.725379]  dev_suspend+0xde/0x240 [dm_mod]
> > [  215.725434]  ctl_ioctl+0x1f5/0x520 [dm_mod]
> > [  215.725489]  dm_ctl_ioctl+0xe/0x20 [dm_mod]
> > [  215.725515]  do_vfs_ioctl+0x8f/0x700
> > [  215.725589]  SyS_ioctl+0x3c/0x70
> > [  215.725614]  entry_SYSCALL_64_fastpath+0x18/0xad
> > 
> 
> I have no idea what this is, I haven't messed with life time or devices
> or queues at all in that branch.

Hello Jens,

Running the srp-test software against kernel 4.9.6 and kernel 4.10-rc5
went fine. With your for-4.11/block branch (commit 400f73b23f457a) however
I just ran into the following:

[  214.555527] ------------[ cut here ]------------
[  214.555565] WARNING: CPU: 5 PID: 13201 at kernel/locking/lockdep.c:3514 lock_release+0x346/0x480
[  214.555588] DEBUG_LOCKS_WARN_ON(depth <= 0)
[  214.555824] CPU: 5 PID: 13201 Comm: fio Not tainted 4.10.0-rc3-dbg+ #1
[  214.555846] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014
[  214.555867] Call Trace:
[  214.555889]  dump_stack+0x68/0x93
[  214.555911]  __warn+0xc6/0xe0
[  214.555953]  warn_slowpath_fmt+0x4a/0x50
[  214.555973]  lock_release+0x346/0x480
[  214.556021]  aio_write+0x106/0x140
[  214.556067]  do_io_submit+0x37d/0x900
[  214.556108]  SyS_io_submit+0xb/0x10
[  214.556131]  entry_SYSCALL_64_fastpath+0x18/0xad

I will continue to try to figure out what is causing this behavior.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Jan. 31, 2017, 1:38 a.m. UTC | #5
> On Jan 30, 2017, at 5:12 PM, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
> 
>> On Fri, 2017-01-27 at 09:56 -0700, Jens Axboe wrote:
>>> On 01/27/2017 09:52 AM, Bart Van Assche wrote:
>>> [  215.724452] general protection fault: 0000 [#1] SMP
>>> [  215.725060] Call Trace:
>>> [  215.725086]  scsi_disk_put+0x2d/0x40
>>> [  215.725110]  sd_release+0x3d/0xb0
>>> [  215.725137]  __blkdev_put+0x29e/0x360
>>> [  215.725163]  blkdev_put+0x49/0x170
>>> [  215.725192]  dm_put_table_device+0x58/0xc0 [dm_mod]
>>> [  215.725219]  dm_put_device+0x70/0xc0 [dm_mod]
>>> [  215.725269]  free_priority_group+0x92/0xc0 [dm_multipath]
>>> [  215.725295]  free_multipath+0x70/0xc0 [dm_multipath]
>>> [  215.725320]  multipath_dtr+0x19/0x20 [dm_multipath]
>>> [  215.725348]  dm_table_destroy+0x67/0x120 [dm_mod]
>>> [  215.725379]  dev_suspend+0xde/0x240 [dm_mod]
>>> [  215.725434]  ctl_ioctl+0x1f5/0x520 [dm_mod]
>>> [  215.725489]  dm_ctl_ioctl+0xe/0x20 [dm_mod]
>>> [  215.725515]  do_vfs_ioctl+0x8f/0x700
>>> [  215.725589]  SyS_ioctl+0x3c/0x70
>>> [  215.725614]  entry_SYSCALL_64_fastpath+0x18/0xad
>>> 
>> 
>> I have no idea what this is, I haven't messed with life time or devices
>> or queues at all in that branch.
> 
> Hello Jens,
> 
> Running the srp-test software against kernel 4.9.6 and kernel 4.10-rc5
> went fine. With your for-4.11/block branch (commit 400f73b23f457a) however
> I just ran into the following:
> 
> [  214.555527] ------------[ cut here ]------------
> [  214.555565] WARNING: CPU: 5 PID: 13201 at kernel/locking/lockdep.c:3514 lock_release+0x346/0x480
> [  214.555588] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [  214.555824] CPU: 5 PID: 13201 Comm: fio Not tainted 4.10.0-rc3-dbg+ #1
> [  214.555846] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014
> [  214.555867] Call Trace:
> [  214.555889]  dump_stack+0x68/0x93
> [  214.555911]  __warn+0xc6/0xe0
> [  214.555953]  warn_slowpath_fmt+0x4a/0x50
> [  214.555973]  lock_release+0x346/0x480
> [  214.556021]  aio_write+0x106/0x140
> [  214.556067]  do_io_submit+0x37d/0x900
> [  214.556108]  SyS_io_submit+0xb/0x10
> [  214.556131]  entry_SYSCALL_64_fastpath+0x18/0xad
> 
> I will continue to try to figure out what is causing this behavior.

That's a known bug in mainline. Pull it into 4.10-rc6,
or use my for-next where everything is already merged. 


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Jan. 31, 2017, 4:13 a.m. UTC | #6
On 01/30/2017 05:38 PM, Jens Axboe wrote:
> 
> 
>> On Jan 30, 2017, at 5:12 PM, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
>>
>>> On Fri, 2017-01-27 at 09:56 -0700, Jens Axboe wrote:
>>>> On 01/27/2017 09:52 AM, Bart Van Assche wrote:
>>>> [  215.724452] general protection fault: 0000 [#1] SMP
>>>> [  215.725060] Call Trace:
>>>> [  215.725086]  scsi_disk_put+0x2d/0x40
>>>> [  215.725110]  sd_release+0x3d/0xb0
>>>> [  215.725137]  __blkdev_put+0x29e/0x360
>>>> [  215.725163]  blkdev_put+0x49/0x170
>>>> [  215.725192]  dm_put_table_device+0x58/0xc0 [dm_mod]
>>>> [  215.725219]  dm_put_device+0x70/0xc0 [dm_mod]
>>>> [  215.725269]  free_priority_group+0x92/0xc0 [dm_multipath]
>>>> [  215.725295]  free_multipath+0x70/0xc0 [dm_multipath]
>>>> [  215.725320]  multipath_dtr+0x19/0x20 [dm_multipath]
>>>> [  215.725348]  dm_table_destroy+0x67/0x120 [dm_mod]
>>>> [  215.725379]  dev_suspend+0xde/0x240 [dm_mod]
>>>> [  215.725434]  ctl_ioctl+0x1f5/0x520 [dm_mod]
>>>> [  215.725489]  dm_ctl_ioctl+0xe/0x20 [dm_mod]
>>>> [  215.725515]  do_vfs_ioctl+0x8f/0x700
>>>> [  215.725589]  SyS_ioctl+0x3c/0x70
>>>> [  215.725614]  entry_SYSCALL_64_fastpath+0x18/0xad
>>>>
>>>
>>> I have no idea what this is, I haven't messed with life time or devices
>>> or queues at all in that branch.
>>
>> Hello Jens,
>>
>> Running the srp-test software against kernel 4.9.6 and kernel 4.10-rc5
>> went fine. With your for-4.11/block branch (commit 400f73b23f457a) however
>> I just ran into the following:
>>
>> [  214.555527] ------------[ cut here ]------------
>> [  214.555565] WARNING: CPU: 5 PID: 13201 at kernel/locking/lockdep.c:3514 lock_release+0x346/0x480
>> [  214.555588] DEBUG_LOCKS_WARN_ON(depth <= 0)
>> [  214.555824] CPU: 5 PID: 13201 Comm: fio Not tainted 4.10.0-rc3-dbg+ #1
>> [  214.555846] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014
>> [  214.555867] Call Trace:
>> [  214.555889]  dump_stack+0x68/0x93
>> [  214.555911]  __warn+0xc6/0xe0
>> [  214.555953]  warn_slowpath_fmt+0x4a/0x50
>> [  214.555973]  lock_release+0x346/0x480
>> [  214.556021]  aio_write+0x106/0x140
>> [  214.556067]  do_io_submit+0x37d/0x900
>> [  214.556108]  SyS_io_submit+0xb/0x10
>> [  214.556131]  entry_SYSCALL_64_fastpath+0x18/0xad
>>
>> I will continue to try to figure out what is causing this behavior.
> 
> That's a known bug in mainline. Pull it into 4.10-rc6,
> or use my for-next where everything is already merged.

Since I'm not on the phone anymore, this is the commit that was
merged after my for-4.11/block was forked, which fixes this issue:

commit a12f1ae61c489076a9aeb90bddca7722bf330df3
Author: Shaohua Li <shli@fb.com>
Date:   Tue Dec 13 12:09:56 2016 -0800

    aio: fix lock dep warning

So you can just pull that in, if you want, or do what I suggested above.
Bart Van Assche Jan. 31, 2017, 9:35 p.m. UTC | #7
On Mon, 2017-01-30 at 17:38 -0800, Jens Axboe wrote:
> That's a known bug in mainline. Pull it into 4.10-rc6,
> or use my for-next where everything is already merged. 

Hello Jens,

With your for-next branch (commit c2e60b3a2602) I haven't hit any block
layer crashes so far. The only issue I encountered that is new is a
memory leak triggered by the SG-IO code. These memory leak reports
started to appear after I started testing the mq-deadline scheduler.
kmemleak reported the following call stack multiple times after my tests
had finished:

unreferenced object 0xffff88041119e528 (size 192):
  comm "multipathd", pid 2353, jiffies 4295128020 (age 1332.440s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 12 01 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8165e3b5>] kmemleak_alloc+0x45/0xa0
    [<ffffffff811cc23d>] __kmalloc+0x15d/0x2f0
    [<ffffffff81310e35>] bio_alloc_bioset+0x185/0x1f0
    [<ffffffff813117f4>] bio_map_user_iov+0x124/0x400
    [<ffffffff81320b7a>] blk_rq_map_user_iov+0x11a/0x210
    [<ffffffff81320cbd>] blk_rq_map_user+0x4d/0x60
    [<ffffffff81336694>] sg_io+0x3d4/0x410
    [<ffffffff813369d0>] scsi_cmd_ioctl+0x300/0x490
    [<ffffffff81336b9d>] scsi_cmd_blk_ioctl+0x3d/0x50
    [<ffffffff814b4360>] sd_ioctl+0x80/0x100
    [<ffffffff8132ddde>] blkdev_ioctl+0x51e/0x9f0
    [<ffffffff8122f388>] block_ioctl+0x38/0x40
    [<ffffffff8120097f>] do_vfs_ioctl+0x8f/0x700
    [<ffffffff8120102c>] SyS_ioctl+0x3c/0x70
    [<ffffffff8166c4aa>] entry_SYSCALL_64_fastpath+0x18/0xad

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Jan. 31, 2017, 9:55 p.m. UTC | #8
On Tue, 2017-01-31 at 13:34 -0800, Bart Van Assche wrote:
> On Mon, 2017-01-30 at 17:38 -0800, Jens Axboe wrote:
> > That's a known bug in mainline. Pull it into 4.10-rc6,
> > or use my for-next where everything is already merged. 
> 
> Hello Jens,
> 
> With your for-next branch (commit c2e60b3a2602) I haven't hit any block
> layer crashes so far. The only issue I encountered that is new is a
> memory leak triggered by the SG-IO code. These memory leak reports
> started to appear after I started testing the mq-deadline scheduler.
> kmemleak reported the following call stack multiple times after my tests
> had finished:
> 
> unreferenced object 0xffff88041119e528 (size 192):
>   comm "multipathd", pid 2353, jiffies 4295128020 (age 1332.440s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 12 01 00 00 00 00 00 00  ................
>   backtrace:
>     [<ffffffff8165e3b5>] kmemleak_alloc+0x45/0xa0
>     [<ffffffff811cc23d>] __kmalloc+0x15d/0x2f0
>     [<ffffffff81310e35>] bio_alloc_bioset+0x185/0x1f0
>     [<ffffffff813117f4>] bio_map_user_iov+0x124/0x400
>     [<ffffffff81320b7a>] blk_rq_map_user_iov+0x11a/0x210
>     [<ffffffff81320cbd>] blk_rq_map_user+0x4d/0x60
>     [<ffffffff81336694>] sg_io+0x3d4/0x410
>     [<ffffffff813369d0>] scsi_cmd_ioctl+0x300/0x490
>     [<ffffffff81336b9d>] scsi_cmd_blk_ioctl+0x3d/0x50
>     [<ffffffff814b4360>] sd_ioctl+0x80/0x100
>     [<ffffffff8132ddde>] blkdev_ioctl+0x51e/0x9f0
>     [<ffffffff8122f388>] block_ioctl+0x38/0x40
>     [<ffffffff8120097f>] do_vfs_ioctl+0x8f/0x700
>     [<ffffffff8120102c>] SyS_ioctl+0x3c/0x70
>     [<ffffffff8166c4aa>] entry_SYSCALL_64_fastpath+0x18/0xad

After I repeated my test the above findings were confirmed: no memory leaks
were reported by kmemleak after a test with I/O scheduler "none" and the
above call stack was reported 44 times by kmemleak after a test with I/O
scheduler "mq-deadline".

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Jan. 31, 2017, 9:58 p.m. UTC | #9
On 01/31/2017 01:55 PM, Bart Van Assche wrote:
> On Tue, 2017-01-31 at 13:34 -0800, Bart Van Assche wrote:
>> On Mon, 2017-01-30 at 17:38 -0800, Jens Axboe wrote:
>>> That's a known bug in mainline. Pull it into 4.10-rc6,
>>> or use my for-next where everything is already merged. 
>>
>> Hello Jens,
>>
>> With your for-next branch (commit c2e60b3a2602) I haven't hit any block
>> layer crashes so far. The only issue I encountered that is new is a
>> memory leak triggered by the SG-IO code. These memory leak reports
>> started to appear after I started testing the mq-deadline scheduler.
>> kmemleak reported the following call stack multiple times after my tests
>> had finished:
>>
>> unreferenced object 0xffff88041119e528 (size 192):
>>   comm "multipathd", pid 2353, jiffies 4295128020 (age 1332.440s)
>>   hex dump (first 32 bytes):
>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>     00 00 00 00 00 00 00 00 12 01 00 00 00 00 00 00  ................
>>   backtrace:
>>     [<ffffffff8165e3b5>] kmemleak_alloc+0x45/0xa0
>>     [<ffffffff811cc23d>] __kmalloc+0x15d/0x2f0
>>     [<ffffffff81310e35>] bio_alloc_bioset+0x185/0x1f0
>>     [<ffffffff813117f4>] bio_map_user_iov+0x124/0x400
>>     [<ffffffff81320b7a>] blk_rq_map_user_iov+0x11a/0x210
>>     [<ffffffff81320cbd>] blk_rq_map_user+0x4d/0x60
>>     [<ffffffff81336694>] sg_io+0x3d4/0x410
>>     [<ffffffff813369d0>] scsi_cmd_ioctl+0x300/0x490
>>     [<ffffffff81336b9d>] scsi_cmd_blk_ioctl+0x3d/0x50
>>     [<ffffffff814b4360>] sd_ioctl+0x80/0x100
>>     [<ffffffff8132ddde>] blkdev_ioctl+0x51e/0x9f0
>>     [<ffffffff8122f388>] block_ioctl+0x38/0x40
>>     [<ffffffff8120097f>] do_vfs_ioctl+0x8f/0x700
>>     [<ffffffff8120102c>] SyS_ioctl+0x3c/0x70
>>     [<ffffffff8166c4aa>] entry_SYSCALL_64_fastpath+0x18/0xad
> 
> After I repeated my test the above findings were confirmed: no memory leaks
> were reported by kmemleak after a test with I/O scheduler "none" and the
> above call stack was reported 44 times by kmemleak after a test with I/O
> scheduler "mq-deadline".

Interesting, I'll check this. Doesn't make any sense why the scheduler
would be implicated in that, given how we run completions now. But if
it complains, then something must be up.
Bart Van Assche Feb. 1, 2017, 1:01 a.m. UTC | #10
On Tue, 2017-01-31 at 13:58 -0800, Jens Axboe wrote:
> Interesting, I'll check this. Doesn't make any sense why the scheduler
> would be implicated in that, given how we run completions now. But if
> it complains, then something must be up.

(reduced CC-list)

There is another issue that needs further attention and for which I have
not yet had the time to root-cause it: several processes became unkillable
while I was running srp-test (with scheduler "none"). Apparently the dm
queues got stuck in state BLK_MQ_S_SCHED_RESTART:

# grep -aH '' dm*/*/0/{state,dispatch,*/rq_list}
dm-0/mq/0/state:0x4
dm-1/mq/0/state:0x4
dm-0/mq/0/dispatch:ffff880386d1ca00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1637, .internal_tag=-1}
dm-1/mq/0/dispatch:ffff880388ee5400 {.cmd_type=1, .cmd_flags=0x4000, .rq_flags=0x2000, .tag=42, .internal_tag=-1}
dm-1/mq/0/dispatch:ffff880388ee5600 {.cmd_type=1, .cmd_flags=0x4000, .rq_flags=0x2000, .tag=43, .internal_tag=-1}
dm-1/mq/0/dispatch:ffff880388ee5800 {.cmd_type=1, .cmd_flags=0x4000, .rq_flags=0x2000, .tag=44, .internal_tag=-1}
dm-1/mq/0/dispatch:ffff880388ee5a00 {.cmd_type=1, .cmd_flags=0x0, .rq_flags=0x2000, .tag=45, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1cc00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1638, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1ce00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1639, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1d000 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1640, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1d200 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1641, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1b600 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1627, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1b800 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1628, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1d400 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1642, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1bc00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1630, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1ae00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1623, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1c200 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1633, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1d600 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1643, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1d800 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1644, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1c400 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1634, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1a400 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1618, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1c600 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1635, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1c800 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1636, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1da00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1645, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1dc00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1646, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1de00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1647, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1e000 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1648, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1e200 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1649, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1e400 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1650, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1e600 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1651, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1e800 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1652, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1ea00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1653, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1ec00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1654, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1ee00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1655, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1f000 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1656, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1f200 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1657, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1f400 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1658, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1f600 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1659, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1f800 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1660, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1fa00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1661, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1fc00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1662, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1fe00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1663, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d18000 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1600, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d18200 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1601, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d18400 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1602, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d18600 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1603, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d18800 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1604, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d18a00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1605, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d18c00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1606, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d18e00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1607, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d19000 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1608, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d19200 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1609, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d19400 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1610, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d19600 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1611, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d19800 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1612, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d19a00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1613, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d19c00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1614, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d19e00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1615, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1a000 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1616, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1a200 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1617, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1a600 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1619, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1a800 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1620, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1aa00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1621, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1ac00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1622, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1b000 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1624, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1b200 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1625, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1b400 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1626, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1ba00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1629, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1be00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1631, .internal_tag=-1}
dm-0/mq/0/cpu1/rq_list:ffff880386d1c000 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1632, .internal_tag=-1}
dm-0/mq/0/cpu5/rq_list:ffff880385b97e00 {.cmd_type=1, .cmd_flags=0x0, .rq_flags=0x2000, .tag=447, .internal_tag=-1}

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Feb. 1, 2017, 6:38 a.m. UTC | #11
On 01/31/2017 05:01 PM, Bart Van Assche wrote:
> On Tue, 2017-01-31 at 13:58 -0800, Jens Axboe wrote:
>> Interesting, I'll check this. Doesn't make any sense why the scheduler
>> would be implicated in that, given how we run completions now. But if
>> it complains, then something must be up.
> 
> (reduced CC-list)
> 
> There is another issue that needs further attention and for which I have
> not yet had the time to root-cause it: several processes became unkillable
> while I was running srp-test (with scheduler "none"). Apparently the dm
> queues got stuck in state BLK_MQ_S_SCHED_RESTART:
> 
> # grep -aH '' dm*/*/0/{state,dispatch,*/rq_list}
> dm-0/mq/0/state:0x4
> dm-1/mq/0/state:0x4
> dm-0/mq/0/dispatch:ffff880386d1ca00 {.cmd_type=1, .cmd_flags=0x8801, .rq_flags=0x2000, .tag=1637, .internal_tag=-1}
> dm-1/mq/0/dispatch:ffff880388ee5400 {.cmd_type=1, .cmd_flags=0x4000, .rq_flags=0x2000, .tag=42, .internal_tag=-1}
> dm-1/mq/0/dispatch:ffff880388ee5600 {.cmd_type=1, .cmd_flags=0x4000, .rq_flags=0x2000, .tag=43, .internal_tag=-1}

I think this patch:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.11/block&id=12d70958a2e8d587acaa51dafd5d6620e00b7543

should fix it for you. I just ran into the same thing tonight, testing
an unrelated thing. It's the only reason that state should be 0x4 for
you, so it has the same finger print.

The patch has been merged into for-next.
Bart Van Assche Feb. 1, 2017, 4:46 p.m. UTC | #12
On Tue, 2017-01-31 at 22:38 -0800, Jens Axboe wrote:
> I think this patch:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.11/block&id=12d70958a2e8d587acaa51dafd5d6620e00b7543
> 
> should fix it for you. I just ran into the same thing tonight, testing
> an unrelated thing. It's the only reason that state should be 0x4 for
> you, so it has the same finger print.
> 
> The patch has been merged into for-next.

Hello Jens,

Thanks for having looked into this. However, after having pulled the latest
block for-next tree (dbb85b06229f) another lockup was triggered soon (02-sq
is the name of a shell script of the srp-test suite):

[  243.021265] sysrq: SysRq : Show Blocked State
[  243.021301]   task                        PC stack   pid father
[  243.022909] 02-sq           D    0 10864  10509 0x00000000
[  243.022933] Call Trace:
[  243.022956]  __schedule+0x2da/0xb00
[  243.022979]  schedule+0x38/0x90
[  243.023002]  blk_mq_freeze_queue_wait+0x51/0xa0
[  243.023025]  ? remove_wait_queue+0x70/0x70
[  243.023047]  blk_mq_freeze_queue+0x15/0x20
[  243.023070]  elevator_switch+0x24/0x220
[  243.023093]  __elevator_change+0xd3/0x110
[  243.023115]  elv_iosched_store+0x21/0x60
[  243.023140]  queue_attr_store+0x54/0x90
[  243.023164]  sysfs_kf_write+0x40/0x50
[  243.023188]  kernfs_fop_write+0x137/0x1c0
[  243.023214]  __vfs_write+0x23/0x140
[  243.023242]  ? rcu_read_lock_sched_held+0x45/0x80
[  243.023265]  ? rcu_sync_lockdep_assert+0x2a/0x50
[  243.023287]  ? __sb_start_write+0xde/0x200
[  243.023308]  ? vfs_write+0x190/0x1e0
[  243.023329]  vfs_write+0xc3/0x1e0
[  243.023351]  SyS_write+0x44/0xa0
[  243.023373]  entry_SYSCALL_64_fastpath+0x18/0xad

My attempt to query the state of the blk-mq queues triggered the
following hang:

[  243.023555] grep            D    0 11010  11008 0x00000000
[  243.023578] Call Trace:
[  243.023599]  __schedule+0x2da/0xb00
[  243.023619]  schedule+0x38/0x90
[  243.023640]  schedule_preempt_disabled+0x10/0x20
[  243.023662]  mutex_lock_nested+0x23a/0x650
[  243.023683]  ? hctx_tags_show+0x2c/0x60
[  243.023703]  hctx_tags_show+0x2c/0x60
[  243.023725]  seq_read+0xf2/0x3d0
[  243.023746]  ? full_proxy_poll+0xb0/0xb0
[  243.023776]  full_proxy_read+0x83/0xb0
[  243.023798]  ? full_proxy_poll+0xb0/0xb0
[  243.023821]  __vfs_read+0x23/0x130
[  243.023843]  vfs_read+0xa3/0x170
[  243.023865]  SyS_read+0x44/0xa0
[  243.023888]  entry_SYSCALL_64_fastpath+0x18/0xad

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Feb. 1, 2017, 5:13 p.m. UTC | #13
On 02/01/2017 08:46 AM, Bart Van Assche wrote:
> On Tue, 2017-01-31 at 22:38 -0800, Jens Axboe wrote:
>> I think this patch:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.11/block&id=12d70958a2e8d587acaa51dafd5d6620e00b7543
>>
>> should fix it for you. I just ran into the same thing tonight, testing
>> an unrelated thing. It's the only reason that state should be 0x4 for
>> you, so it has the same finger print.
>>
>> The patch has been merged into for-next.
> 
> Hello Jens,
> 
> Thanks for having looked into this. However, after having pulled the latest
> block for-next tree (dbb85b06229f) another lockup was triggered soon (02-sq
> is the name of a shell script of the srp-test suite):
> 
> [  243.021265] sysrq: SysRq : Show Blocked State
> [  243.021301]   task                        PC stack   pid father
> [  243.022909] 02-sq           D    0 10864  10509 0x00000000
> [  243.022933] Call Trace:
> [  243.022956]  __schedule+0x2da/0xb00
> [  243.022979]  schedule+0x38/0x90
> [  243.023002]  blk_mq_freeze_queue_wait+0x51/0xa0
> [  243.023025]  ? remove_wait_queue+0x70/0x70
> [  243.023047]  blk_mq_freeze_queue+0x15/0x20
> [  243.023070]  elevator_switch+0x24/0x220
> [  243.023093]  __elevator_change+0xd3/0x110
> [  243.023115]  elv_iosched_store+0x21/0x60
> [  243.023140]  queue_attr_store+0x54/0x90
> [  243.023164]  sysfs_kf_write+0x40/0x50
> [  243.023188]  kernfs_fop_write+0x137/0x1c0
> [  243.023214]  __vfs_write+0x23/0x140
> [  243.023242]  ? rcu_read_lock_sched_held+0x45/0x80
> [  243.023265]  ? rcu_sync_lockdep_assert+0x2a/0x50
> [  243.023287]  ? __sb_start_write+0xde/0x200
> [  243.023308]  ? vfs_write+0x190/0x1e0
> [  243.023329]  vfs_write+0xc3/0x1e0
> [  243.023351]  SyS_write+0x44/0xa0
> [  243.023373]  entry_SYSCALL_64_fastpath+0x18/0xad

So that's changing the elevator - did this happen while heavy IO was
going to the drive, or was it idle?

> My attempt to query the state of the blk-mq queues triggered the
> following hang:
> 
> [  243.023555] grep            D    0 11010  11008 0x00000000
> [  243.023578] Call Trace:
> [  243.023599]  __schedule+0x2da/0xb00
> [  243.023619]  schedule+0x38/0x90
> [  243.023640]  schedule_preempt_disabled+0x10/0x20
> [  243.023662]  mutex_lock_nested+0x23a/0x650
> [  243.023683]  ? hctx_tags_show+0x2c/0x60
> [  243.023703]  hctx_tags_show+0x2c/0x60
> [  243.023725]  seq_read+0xf2/0x3d0
> [  243.023746]  ? full_proxy_poll+0xb0/0xb0
> [  243.023776]  full_proxy_read+0x83/0xb0
> [  243.023798]  ? full_proxy_poll+0xb0/0xb0
> [  243.023821]  __vfs_read+0x23/0x130
> [  243.023843]  vfs_read+0xa3/0x170
> [  243.023865]  SyS_read+0x44/0xa0
> [  243.023888]  entry_SYSCALL_64_fastpath+0x18/0xad

That's because the previous elevator switch is stalled in sysfs, and we
grab the queue sysfs lock for any of the show/store functions.  So if
one hangs, all of them will...
Bart Van Assche Feb. 1, 2017, 5:28 p.m. UTC | #14
On Wed, 2017-02-01 at 09:13 -0800, Jens Axboe wrote:
> On 02/01/2017 08:46 AM, Bart Van Assche wrote:
> > Thanks for having looked into this. However, after having pulled the latest
> > block for-next tree (dbb85b06229f) another lockup was triggered soon (02-sq
> > is the name of a shell script of the srp-test suite):
> > 
> > [  243.021265] sysrq: SysRq : Show Blocked State
> > [  243.021301]   task                        PC stack   pid father
> > [  243.022909] 02-sq           D    0 10864  10509 0x00000000
> > [  243.022933] Call Trace:
> > [  243.022956]  __schedule+0x2da/0xb00
> > [  243.022979]  schedule+0x38/0x90
> > [  243.023002]  blk_mq_freeze_queue_wait+0x51/0xa0
> > [  243.023025]  ? remove_wait_queue+0x70/0x70
> > [  243.023047]  blk_mq_freeze_queue+0x15/0x20
> > [  243.023070]  elevator_switch+0x24/0x220
> > [  243.023093]  __elevator_change+0xd3/0x110
> > [  243.023115]  elv_iosched_store+0x21/0x60
> > [  243.023140]  queue_attr_store+0x54/0x90
> > [  243.023164]  sysfs_kf_write+0x40/0x50
> > [  243.023188]  kernfs_fop_write+0x137/0x1c0
> > [  243.023214]  __vfs_write+0x23/0x140
> > [  243.023242]  ? rcu_read_lock_sched_held+0x45/0x80
> > [  243.023265]  ? rcu_sync_lockdep_assert+0x2a/0x50
> > [  243.023287]  ? __sb_start_write+0xde/0x200
> > [  243.023308]  ? vfs_write+0x190/0x1e0
> > [  243.023329]  vfs_write+0xc3/0x1e0
> > [  243.023351]  SyS_write+0x44/0xa0
> > [  243.023373]  entry_SYSCALL_64_fastpath+0x18/0xad
> 
> So that's changing the elevator - did this happen while heavy IO was
> going to the drive, or was it idle?

Hello Jens,

The shell command that was used to set the elevator is the following
($realdev is a dm device):

echo none > /sys/class/block/$(basename "$realdev")/queue/scheduler

I'm not sure whether any I/O was ongoing when the scheduler was being
changed from "none" into "none". There are two other processes that got
stuck but running lsof against these processes did not reveal what block
device these two processes were trying to examine:

[  243.021672] systemd-udevd   D    0 10585    504 0x00000000
[  243.021700] Call Trace:
[  243.021726]  __schedule+0x2da/0xb00
[  243.021749]  schedule+0x38/0x90
[  243.021771]  schedule_timeout+0x2fe/0x640
[  243.021882]  io_schedule_timeout+0x9f/0x110
[  243.021930]  wait_on_page_bit_common+0x121/0x1e0
[  243.021977]  generic_file_read_iter+0x17c/0x790
[  243.022030]  blkdev_read_iter+0x30/0x40
[  243.022053]  __vfs_read+0xbb/0x130
[  243.022075]  vfs_read+0xa3/0x170
[  243.022098]  SyS_read+0x44/0xa0
[  243.022120]  entry_SYSCALL_64_fastpath+0x18/0xad
[  243.022298] systemd-udevd   D    0 10612    504 0x00000000
[  243.022320] Call Trace:
[  243.022341]  __schedule+0x2da/0xb00
[  243.022363]  schedule+0x38/0x90
[  243.022383]  schedule_timeout+0x2fe/0x640
[  243.022490]  io_schedule_timeout+0x9f/0x110
[  243.022543]  wait_on_page_bit_common+0x121/0x1e0
[  243.022595]  generic_file_read_iter+0x17c/0x790
[  243.022640]  blkdev_read_iter+0x30/0x40
[  243.022663]  __vfs_read+0xbb/0x130
[  243.022685]  vfs_read+0xa3/0x170
[  243.022707]  SyS_read+0x44/0xa0
[  243.022729]  entry_SYSCALL_64_fastpath+0x18/0xad

# lsof -p10585
COMMAND     PID USER   FD      TYPE DEVICE SIZE/OFF NODE NAME
systemd-u 10585 root  cwd   unknown                      /proc/10585/cwd (readlink: No such file or directory)
systemd-u 10585 root  rtd   unknown                      /proc/10585/root (readlink: No such file or directory)
systemd-u 10585 root  txt   unknown                      /proc/10585/exe
# lsof -p10612
COMMAND     PID USER   FD      TYPE DEVICE SIZE/OFF NODE NAME
systemd-u 10612 root  cwd   unknown                      /proc/10612/cwd (readlink: No such file or directory)
systemd-u 10612 root  rtd   unknown                      /proc/10612/root (readlink: No such file or directory)
systemd-u 10612 root  txt   unknown                      /proc/10612/exe

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Feb. 1, 2017, 7:21 p.m. UTC | #15
On Wed, 2017-02-01 at 09:13 -0800, Jens Axboe wrote:
> So that's changing the elevator - did this happen while heavy IO was
> going to the drive, or was it idle?

I just ran into an I/O hang while running test 02-sq on top of kernel
v4.9.6. I will have a closer look at the dm code to see whether I can
find anything that is wrong in the dm code.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Feb. 1, 2017, 10:01 p.m. UTC | #16
On Wed, 2017-02-01 at 09:13 -0800, Jens Axboe wrote:
> So that's changing the elevator - did this happen while heavy IO was
> going to the drive, or was it idle?

Hello Jens,

I think I figured out what was going on:
* Test 02-mq created scsi-mq SRP paths and multipathd created dm-mq device
  nodes on top of these SRP paths.
* Test 02-sq started with removing the SRP paths and with switching the
  SCSI and DM cores from mq to sq but did not remove the dm-mq device
  nodes. When that test script next performed an SRP log in multipathd
  failed to add the single queue SCSI devices to an existing dm-mq
  device node. In other words, the dm-mq devices had queue_if_no_path
  set but did not have any paths. Hence the lockup for requests sent
  to these dm devices.

I have modified my tests scripts such that the dm device nodes from a
previous test are removed before a new test starts. Since I made that
change I haven't seen any I/O lockup. However, a new issue shows up
sporadically, an issue that I had not yet seen during any test with
a kernel tree from Linus:

[  227.613440] general protection fault: 0000 [#1] SMP
[  227.613495] Modules linked in: dm_service_time ib_srp scsi_transport_srp target_core_user uio target_core_pscsi target_core_file ib_srpt target_core_iblock target_core_mod brd netconsole xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet ib_ipoib msr rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm configfs ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc tg3 aesni_intel iTCO_wdt mlx4_core ptp iTCO_vendor_support dcdbas aes_x86_64 crypto_simd glue_helper pps_core cryptd pcspkr devlink ipmi_si libphy ipmi_devintf fjes ipmi_msghandler tpm_tis tpm_tis_core lpc_ich mei_me mfd_core mei shpchp wmi tpm button hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sr_mod sysimgblt fb_sys_fops cdrom ttm drm ehci_pci ehci_hcd usbcore usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua autofs4
[  227.613774] CPU: 3 PID: 28 Comm: ksoftirqd/3 Not tainted 4.10.0-rc5-dbg+ #1
[  227.613840] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014
[  227.613893] task: ffff880172a249c0 task.stack: ffffc90001aa8000
[  227.613932] RIP: 0010:rq_completed+0x12/0x90 [dm_mod]
[  227.613965] RSP: 0018:ffffc90001aabda8 EFLAGS: 00010246
[  227.614006] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
[  227.614043] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 6b6b6b6b6b6b6b6b
[  227.614074] RBP: ffffc90001aabdc0 R08: ffff8803825f4c38 R09: 0000000000000000
[  227.614105] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  227.614137] R13: 0000000000000000 R14: ffffffff81c05120 R15: 0000000000000004
[  227.614170] FS:  0000000000000000(0000) GS:ffff88046f2c0000(0000) knlGS:0000000000000000
[  227.614209] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  227.614239] CR2: 0000557e28bc20d0 CR3: 000000038594e000 CR4: 00000000001406e0
[  227.614268] Call Trace:
[  227.614301]  dm_softirq_done+0xe6/0x1e0 [dm_mod]
[  227.614337]  blk_done_softirq+0x88/0xa0
[  227.614369]  __do_softirq+0xba/0x4c0
[  227.614470]  run_ksoftirqd+0x1a/0x50
[  227.614499]  smpboot_thread_fn+0x123/0x1e0
[  227.614529]  kthread+0x107/0x140
[  227.614624]  ret_from_fork+0x2e/0x40
[  227.614648] Code: ff ff 31 f6 48 89 c7 e8 cd 0e 2f e1 5d c3 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 63 f6 48 89 e5 41 55 41 89 d5 41 54 53 48 89 fb <4c> 8b a7 88 02 00 00 f0 ff 8c b7 50 03 00 00 e8 ba 43 ff ff 85 
[  227.614738] RIP: rq_completed+0x12/0x90 [dm_mod] RSP: ffffc90001aabda8

(gdb) list *(rq_completed+0x12)
0xdd12 is in rq_completed (drivers/md/dm-rq.c:187).
182      * the md may be freed in dm_put() at the end of this function.
183      * Or do dm_get() before calling this function and dm_put() later.
184      */
185     static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
186     {
187             struct request_queue *q = md->queue;
188             unsigned long flags;
189
190             atomic_dec(&md->pending[rw]);
191

(gdb) disas rq_completed
Dump of assembler code for function rq_completed:
   0x000000000000dd00 <+0>:     push   %rbp
   0x000000000000dd01 <+1>:     movslq %esi,%rsi
   0x000000000000dd04 <+4>:     mov    %rsp,%rbp
   0x000000000000dd07 <+7>:     push   %r13
   0x000000000000dd09 <+9>:     mov    %edx,%r13d
   0x000000000000dd0c <+12>:    push   %r12
   0x000000000000dd0e <+14>:    push   %rbx
   0x000000000000dd0f <+15>:    mov    %rdi,%rbx
   0x000000000000dd12 <+18>:    mov    0x288(%rdi),%r12
   0x000000000000dd19 <+25>:    lock decl 0x350(%rdi,%rsi,4)

So this was caused by an attempt to dereference %rdi = 0x6b6b6b6b6b6b6b6b.
Hence this is probably a use-after-free of struct mapped_device.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Feb. 2, 2017, 5:27 p.m. UTC | #17
On Wed, 2017-02-01 at 22:01 +0000, Bart Van Assche wrote:
> However, a new issue shows up sporadically, an issue that I had not yet seen
> during any test with a kernel tree from Linus:
>
> [  227.613440] general protection fault: 0000 [#1] SMP
> [  227.613495] Modules linked in: dm_service_time ib_srp scsi_transport_srp target_core_user uio target_core_pscsi target_core_file ib_srpt target_core_iblock target_core_mod brd netconsole xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet ib_ipoib msr rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm configfs ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc tg3 aesni_intel iTCO_wdt mlx4_core ptp iTCO_vendor_support dcdbas aes_x86_64 crypto_simd glue_helper pps_core cryptd pcspkr devlink ipmi_si libphy ipmi_devintf fjes ipmi_msghandler tpm_tis tpm_tis_core lpc_ich mei_me mfd_core mei shpchp wmi tpm button hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sr_mod sysimgblt fb_sys_fops cdrom ttm drm ehci_pci ehci_hcd usbcore usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua autofs4
> [  227.613774] CPU: 3 PID: 28 Comm: ksoftirqd/3 Not tainted 4.10.0-rc5-dbg+ #1
> [  227.613840] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014
> [  227.613893] task: ffff880172a249c0 task.stack: ffffc90001aa8000
> [  227.613932] RIP: 0010:rq_completed+0x12/0x90 [dm_mod]
> [  227.613965] RSP: 0018:ffffc90001aabda8 EFLAGS: 00010246
> [  227.614006] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
> [  227.614043] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 6b6b6b6b6b6b6b6b
> [  227.614074] RBP: ffffc90001aabdc0 R08: ffff8803825f4c38 R09: 0000000000000000
> [  227.614105] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  227.614137] R13: 0000000000000000 R14: ffffffff81c05120 R15: 0000000000000004
> [  227.614170] FS:  0000000000000000(0000) GS:ffff88046f2c0000(0000) knlGS:0000000000000000
> [  227.614209] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  227.614239] CR2: 0000557e28bc20d0 CR3: 000000038594e000 CR4: 00000000001406e0
> [  227.614268] Call Trace:
> [  227.614301]  dm_softirq_done+0xe6/0x1e0 [dm_mod]
> [  227.614337]  blk_done_softirq+0x88/0xa0
> [  227.614369]  __do_softirq+0xba/0x4c0
> [  227.614470]  run_ksoftirqd+0x1a/0x50
> [  227.614499]  smpboot_thread_fn+0x123/0x1e0
> [  227.614529]  kthread+0x107/0x140
> [  227.614624]  ret_from_fork+0x2e/0x40
> [  227.614648] Code: ff ff 31 f6 48 89 c7 e8 cd 0e 2f e1 5d c3 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 63 f6 48 89 e5 41 55 41 89 d5 41 54 53 48 89 fb <4c> 8b a7 88 02 00 00 f0 ff 8c b7 50 03 00 00 e8 ba 43 ff ff 85 
> [  227.614738] RIP: rq_completed+0x12/0x90 [dm_mod] RSP: ffffc90001aabda8
> 
> (gdb) list *(rq_completed+0x12)
> 0xdd12 is in rq_completed (drivers/md/dm-rq.c:187).
> 182      * the md may be freed in dm_put() at the end of this function.
> 183      * Or do dm_get() before calling this function and dm_put() later.
> 184      */
> 185     static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
> 186     {
> 187             struct request_queue *q = md->queue;
> 188             unsigned long flags;
> 189
> 190             atomic_dec(&md->pending[rw]);
> 191
> 
> (gdb) disas rq_completed
> Dump of assembler code for function rq_completed:
>    0x000000000000dd00 <+0>:     push   %rbp
>    0x000000000000dd01 <+1>:     movslq %esi,%rsi
>    0x000000000000dd04 <+4>:     mov    %rsp,%rbp
>    0x000000000000dd07 <+7>:     push   %r13
>    0x000000000000dd09 <+9>:     mov    %edx,%r13d
>    0x000000000000dd0c <+12>:    push   %r12
>    0x000000000000dd0e <+14>:    push   %rbx
>    0x000000000000dd0f <+15>:    mov    %rdi,%rbx
>    0x000000000000dd12 <+18>:    mov    0x288(%rdi),%r12
>    0x000000000000dd19 <+25>:    lock decl 0x350(%rdi,%rsi,4)
> 
> So this was caused by an attempt to dereference %rdi = 0x6b6b6b6b6b6b6b6b.
> Hence this is probably a use-after-free of struct mapped_device.

Hello Christoph and Mike,

The above crash occurs with Jens' for-next branch but not with Jens'
for-4.11/block branch. Sorry but I think this means that the SCSI
passthrough refactoring code is not yet ready for prime time.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 2, 2017, 6:33 p.m. UTC | #18
On Thu, Feb 02 2017 at 12:27pm -0500,
Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:

> On Wed, 2017-02-01 at 22:01 +0000, Bart Van Assche wrote:
> > However, a new issue shows up sporadically, an issue that I had not yet seen
> > during any test with a kernel tree from Linus:
> >
> > [  227.613440] general protection fault: 0000 [#1] SMP
> > [  227.613495] Modules linked in: dm_service_time ib_srp scsi_transport_srp target_core_user uio target_core_pscsi target_core_file ib_srpt target_core_iblock target_core_mod brd netconsole xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet ib_ipoib msr rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm configfs ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc tg3 aesni_intel iTCO_wdt mlx4_core ptp iTCO_vendor_support dcdbas aes_x86_64 crypto_simd glue_helper pps_core cryptd pcspkr devlink ipmi_si libphy ipmi_devintf fjes ipmi_msghandler tpm_tis tpm_tis_core lpc_ich mei_me mfd_c
 ore mei shpchp wmi tpm button hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sr_mod sysimgblt fb_sys_fops cdrom ttm drm ehci_pci ehci_hcd usbcore usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua autofs4
> > [  227.613774] CPU: 3 PID: 28 Comm: ksoftirqd/3 Not tainted 4.10.0-rc5-dbg+ #1
> > [  227.613840] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014
> > [  227.613893] task: ffff880172a249c0 task.stack: ffffc90001aa8000
> > [  227.613932] RIP: 0010:rq_completed+0x12/0x90 [dm_mod]
> > [  227.613965] RSP: 0018:ffffc90001aabda8 EFLAGS: 00010246
> > [  227.614006] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
> > [  227.614043] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 6b6b6b6b6b6b6b6b
> > [  227.614074] RBP: ffffc90001aabdc0 R08: ffff8803825f4c38 R09: 0000000000000000
> > [  227.614105] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > [  227.614137] R13: 0000000000000000 R14: ffffffff81c05120 R15: 0000000000000004
> > [  227.614170] FS:  0000000000000000(0000) GS:ffff88046f2c0000(0000) knlGS:0000000000000000
> > [  227.614209] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  227.614239] CR2: 0000557e28bc20d0 CR3: 000000038594e000 CR4: 00000000001406e0
> > [  227.614268] Call Trace:
> > [  227.614301]  dm_softirq_done+0xe6/0x1e0 [dm_mod]
> > [  227.614337]  blk_done_softirq+0x88/0xa0
> > [  227.614369]  __do_softirq+0xba/0x4c0
> > [  227.614470]  run_ksoftirqd+0x1a/0x50
> > [  227.614499]  smpboot_thread_fn+0x123/0x1e0
> > [  227.614529]  kthread+0x107/0x140
> > [  227.614624]  ret_from_fork+0x2e/0x40
> > [  227.614648] Code: ff ff 31 f6 48 89 c7 e8 cd 0e 2f e1 5d c3 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 63 f6 48 89 e5 41 55 41 89 d5 41 54 53 48 89 fb <4c> 8b a7 88 02 00 00 f0 ff 8c b7 50 03 00 00 e8 ba 43 ff ff 85 
> > [  227.614738] RIP: rq_completed+0x12/0x90 [dm_mod] RSP: ffffc90001aabda8
> > 
> > (gdb) list *(rq_completed+0x12)
> > 0xdd12 is in rq_completed (drivers/md/dm-rq.c:187).
> > 182      * the md may be freed in dm_put() at the end of this function.
> > 183      * Or do dm_get() before calling this function and dm_put() later.
> > 184      */
> > 185     static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
> > 186     {
> > 187             struct request_queue *q = md->queue;
> > 188             unsigned long flags;
> > 189
> > 190             atomic_dec(&md->pending[rw]);
> > 191
> > 
> > (gdb) disas rq_completed
> > Dump of assembler code for function rq_completed:
> >    0x000000000000dd00 <+0>:     push   %rbp
> >    0x000000000000dd01 <+1>:     movslq %esi,%rsi
> >    0x000000000000dd04 <+4>:     mov    %rsp,%rbp
> >    0x000000000000dd07 <+7>:     push   %r13
> >    0x000000000000dd09 <+9>:     mov    %edx,%r13d
> >    0x000000000000dd0c <+12>:    push   %r12
> >    0x000000000000dd0e <+14>:    push   %rbx
> >    0x000000000000dd0f <+15>:    mov    %rdi,%rbx
> >    0x000000000000dd12 <+18>:    mov    0x288(%rdi),%r12
> >    0x000000000000dd19 <+25>:    lock decl 0x350(%rdi,%rsi,4)
> > 
> > So this was caused by an attempt to dereference %rdi = 0x6b6b6b6b6b6b6b6b.
> > Hence this is probably a use-after-free of struct mapped_device.
> 
> Hello Christoph and Mike,
> 
> The above crash occurs with Jens' for-next branch but not with Jens'
> for-4.11/block branch. Sorry but I think this means that the SCSI
> passthrough refactoring code is not yet ready for prime time.

I somehow missed your original report from yesterday.

I do agree that this would appear to be a use-after-free; certainly not
a fun bug to sort out.

I'll go back over hch's changes to see if I can spot anything.  But is
this testing using dm_mod.use_bk_mq=Y or are you testing old .request_fn
dm-multipath?

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Feb. 2, 2017, 6:43 p.m. UTC | #19
On Thu, 2017-02-02 at 13:33 -0500, Mike Snitzer wrote:
> I'll go back over hch's changes to see if I can spot anything.  But is
> this testing using dm_mod.use_bk_mq=Y or are you testing old .request_fn
> dm-multipath?

Hello Mike,

The srp-test software tests multiple configurations: dm-mq on scsi-mq, dm-sq
on scsi-mq and dm-sq on scsi-sq. I have not yet checked which of these
three configurations triggers the kernel crash.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 2, 2017, 7:13 p.m. UTC | #20
On Thu, Feb 02 2017 at  1:43pm -0500,
Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:

> On Thu, 2017-02-02 at 13:33 -0500, Mike Snitzer wrote:
> > I'll go back over hch's changes to see if I can spot anything.  But is
> > this testing using dm_mod.use_bk_mq=Y or are you testing old .request_fn
> > dm-multipath?
> 
> Hello Mike,
> 
> The srp-test software tests multiple configurations: dm-mq on scsi-mq, dm-sq
> on scsi-mq and dm-sq on scsi-sq. I have not yet checked which of these
> three configurations triggers the kernel crash.

OK, such info is important to provide for crashes like this.  Please let
me know once you do.

Thanks,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Feb. 2, 2017, 7:46 p.m. UTC | #21
On Thu, 2017-02-02 at 14:13 -0500, Mike Snitzer wrote:
> On Thu, Feb 02 2017 at  1:43pm -0500, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
> > On Thu, 2017-02-02 at 13:33 -0500, Mike Snitzer wrote:
> > > I'll go back over hch's changes to see if I can spot anything.  But is
> > > this testing using dm_mod.use_bk_mq=Y or are you testing old .request_fn
> > > dm-multipath?
> > 
> > The srp-test software tests multiple configurations: dm-mq on scsi-mq, dm-sq
> > on scsi-mq and dm-sq on scsi-sq. I have not yet checked which of these
> > three configurations triggers the kernel crash.
> 
> OK, such info is important to provide for crashes like this.  Please let
> me know once you do.

Hello Mike,

Apparently it's the large I/O test (using dm-mq on scsi-mq) that triggers the
crash:

# ~bart/software/infiniband/srp-test/run_tests -r 10
[ ... ]
Test /home/bart/software/infiniband/srp-test/tests/02-sq-on-mq succeeded
Running test /home/bart/software/infiniband/srp-test/tests/03 ...
Test large transfer sizes with cmd_sg_entries=255
removing /dev/mapper/mpatht: [ CRASH ]

The source code of the test I ran is available at
https://github.com/bvanassche/srp-test.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 2, 2017, 9:04 p.m. UTC | #22
On Thu, Feb 02 2017 at  2:46pm -0500,
Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:

> On Thu, 2017-02-02 at 14:13 -0500, Mike Snitzer wrote:
> > On Thu, Feb 02 2017 at  1:43pm -0500, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
> > > On Thu, 2017-02-02 at 13:33 -0500, Mike Snitzer wrote:
> > > > I'll go back over hch's changes to see if I can spot anything.  But is
> > > > this testing using dm_mod.use_bk_mq=Y or are you testing old .request_fn
> > > > dm-multipath?
> > > 
> > > The srp-test software tests multiple configurations: dm-mq on scsi-mq, dm-sq
> > > on scsi-mq and dm-sq on scsi-sq. I have not yet checked which of these
> > > three configurations triggers the kernel crash.
> > 
> > OK, such info is important to provide for crashes like this.  Please let
> > me know once you do.
> 
> Hello Mike,
> 
> Apparently it's the large I/O test (using dm-mq on scsi-mq) that triggers the
> crash:

I've gone over Christoph's "dm: always defer request allocation to the
owner of the request_queue" commit yet again.  Most of that commit's
changes are just mechanical.  I didn't see any problems.

In general, dm_start_request() calls dm_get(md) to take a reference on
the mapped_device.  And rq_completed() calls dm_put(md) to drop the
reference.  The DM device's request_queue (md->queue) should _not_ ever
be torn down before all references on the md have been dropped. But I'll
have to look closer on how/if that is enforced anywhere by coordinating
with block core.

In any case, the crash you reported was that the mapped_device was being
dereferenced after it was freed (at line 187's md->queue).  Which seems
to imply a dm_get/dm_put reference count regression.  But I'm not seeing
where at this point.

> # ~bart/software/infiniband/srp-test/run_tests -r 10
> [ ... ]
> Test /home/bart/software/infiniband/srp-test/tests/02-sq-on-mq succeeded
> Running test /home/bart/software/infiniband/srp-test/tests/03 ...
> Test large transfer sizes with cmd_sg_entries=255
> removing /dev/mapper/mpatht: [ CRASH ]
> 
> The source code of the test I ran is available at
> https://github.com/bvanassche/srp-test.

Any progress on getting this to work without requiring infiniband HW?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 2, 2017, 9:10 p.m. UTC | #23
On Thu, Feb 02 2017 at  4:04pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Thu, Feb 02 2017 at  2:46pm -0500,
> Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
> 
> > On Thu, 2017-02-02 at 14:13 -0500, Mike Snitzer wrote:
> > > On Thu, Feb 02 2017 at  1:43pm -0500, Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:
> > > > On Thu, 2017-02-02 at 13:33 -0500, Mike Snitzer wrote:
> > > > > I'll go back over hch's changes to see if I can spot anything.  But is
> > > > > this testing using dm_mod.use_bk_mq=Y or are you testing old .request_fn
> > > > > dm-multipath?
> > > > 
> > > > The srp-test software tests multiple configurations: dm-mq on scsi-mq, dm-sq
> > > > on scsi-mq and dm-sq on scsi-sq. I have not yet checked which of these
> > > > three configurations triggers the kernel crash.
> > > 
> > > OK, such info is important to provide for crashes like this.  Please let
> > > me know once you do.
> > 
> > Hello Mike,
> > 
> > Apparently it's the large I/O test (using dm-mq on scsi-mq) that triggers the
> > crash:
> 
> I've gone over Christoph's "dm: always defer request allocation to the
> owner of the request_queue" commit yet again.  Most of that commit's
> changes are just mechanical.  I didn't see any problems.
> 
> In general, dm_start_request() calls dm_get(md) to take a reference on
> the mapped_device.  And rq_completed() calls dm_put(md) to drop the
> reference.  The DM device's request_queue (md->queue) should _not_ ever
> be torn down before all references on the md have been dropped. But I'll
> have to look closer on how/if that is enforced anywhere by coordinating
> with block core.
> 
> In any case, the crash you reported was that the mapped_device was being
> dereferenced after it was freed (at line 187's md->queue).  Which seems
> to imply a dm_get/dm_put reference count regression.  But I'm not seeing
> where at this point.

Maybe it isn't a regression but something about Christoph's changes
causes a race to present itself?

Care to try moving the dm_get(md) at the end of dm_start_request() to
the beginning of dm_start_request() and report back on whether it helps
at all?

Thanks,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Feb. 2, 2017, 10:38 p.m. UTC | #24
On Thu, 2017-02-02 at 16:04 -0500, Mike Snitzer wrote:
> Any progress on getting this to work without requiring infiniband HW?

Hello Mike,

Intructions for running these tests over SoftRoCE have been added to
the README.md file in https://github.com/bvanassche/srp-test. However,
I'm not sure the SoftRoCE driver is already stable enough to run these
tests on top of that driver.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Feb. 3, 2017, 12:20 a.m. UTC | #25
On Thu, 2017-02-02 at 16:10 -0500, Mike Snitzer wrote:
> Care to try moving the dm_get(md) at the end of dm_start_request() to
> the beginning of dm_start_request() and report back on whether it helps
> at all?

Hello Mike,

Sorry but I don't see how that could make a difference. While we are at it:
since dm_start_request() calls dm_get() and since rq_completed() calls
dm_put(), calls to these two functions should always be paired. There is
only one dm_start_request() call in the dm-mq code, namely the one in
dm_mq_queue_rq(). However, if map_request() returns DM_MAPIO_REQUEUE then
rq_completed() is called twice: a first time by
dm_requeue_original_request() and a second time by the dm_mq_queue_rq().
Do you agree with this?

Thanks,

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Snitzer Feb. 3, 2017, 12:42 a.m. UTC | #26
On Thu, Feb 02 2017 at  7:20pm -0500,
Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:

> On Thu, 2017-02-02 at 16:10 -0500, Mike Snitzer wrote:
> > Care to try moving the dm_get(md) at the end of dm_start_request() to
> > the beginning of dm_start_request() and report back on whether it helps
> > at all?
> 
> Hello Mike,
> 
> Sorry but I don't see how that could make a difference.

Yeah, I thought about it further after I suggested it and agree that it
shouldn't make a difference (request isn't actually issued during
dm_start_request).

> While we are at it:
> since dm_start_request() calls dm_get() and since rq_completed() calls
> dm_put(), calls to these two functions should always be paired. There is
> only one dm_start_request() call in the dm-mq code, namely the one in
> dm_mq_queue_rq(). However, if map_request() returns DM_MAPIO_REQUEUE then
> rq_completed() is called twice: a first time by
> dm_requeue_original_request() and a second time by the dm_mq_queue_rq().
> Do you agree with this?

Not seeing it.

DM_MAPIO_DELAY_REQUEUE will call dm_requeue_original_request(), but
dm_mq_queue_rq() won't call rq_completed() in that case.

And for DM_MAPIO_REQUEUE dm_mq_queue_rq() will call rq_completed().
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index a61f1407f4f6..78daf5b6d7cb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2129,7 +2129,7 @@  int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 	if (q->mq_ops) {
 		if (blk_queue_io_stat(q))
 			blk_account_io_start(rq, true);
-		blk_mq_sched_insert_request(rq, false, true, false);
+		blk_mq_sched_insert_request(rq, false, true, false, false);
 		return 0;
 	}
 
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 86656fdfa637..ed1f10165268 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -66,7 +66,7 @@  void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
 	 * be reused after dying flag is set
 	 */
 	if (q->mq_ops) {
-		blk_mq_sched_insert_request(rq, at_head, true, false);
+		blk_mq_sched_insert_request(rq, at_head, true, false, false);
 		return;
 	}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index d7de34ee39c2..4427896641ac 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -456,7 +456,7 @@  void blk_insert_flush(struct request *rq)
 	if ((policy & REQ_FSEQ_DATA) &&
 	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
 		if (q->mq_ops)
-			blk_mq_sched_insert_request(rq, false, true, false);
+			blk_mq_sched_insert_request(rq, false, true, false, false);
 		else
 			list_add_tail(&rq->queuelist, &q->queue_head);
 		return;
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c27613de80c5..5e91743e193a 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -336,6 +336,64 @@  void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
 	}
 }
 
+/*
+ * Add flush/fua to the queue. If we fail getting a driver tag, then
+ * punt to the requeue list. Requeue will re-invoke us from a context
+ * that's safe to block from.
+ */
+static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx,
+				      struct request *rq, bool can_block)
+{
+	if (blk_mq_get_driver_tag(rq, &hctx, can_block)) {
+		blk_insert_flush(rq);
+		blk_mq_run_hw_queue(hctx, true);
+	} else
+		blk_mq_add_to_requeue_list(rq, true, true);
+}
+
+void blk_mq_sched_insert_request(struct request *rq, bool at_head,
+				 bool run_queue, bool async, bool can_block)
+{
+	struct request_queue *q = rq->q;
+	struct elevator_queue *e = q->elevator;
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+	if (rq->tag == -1 && (rq->cmd_flags & (REQ_PREFLUSH | REQ_FUA))) {
+		blk_mq_sched_insert_flush(hctx, rq, can_block);
+		return;
+	}
+
+	if (e && e->type->ops.mq.insert_requests) {
+		LIST_HEAD(list);
+
+		list_add(&rq->queuelist, &list);
+		e->type->ops.mq.insert_requests(hctx, &list, at_head);
+	} else {
+		spin_lock(&ctx->lock);
+		__blk_mq_insert_request(hctx, rq, at_head);
+		spin_unlock(&ctx->lock);
+	}
+
+	if (run_queue)
+		blk_mq_run_hw_queue(hctx, async);
+}
+
+void blk_mq_sched_insert_requests(struct request_queue *q,
+				  struct blk_mq_ctx *ctx,
+				  struct list_head *list, bool run_queue_async)
+{
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (e && e->type->ops.mq.insert_requests)
+		e->type->ops.mq.insert_requests(hctx, list, false);
+	else
+		blk_mq_insert_requests(hctx, ctx, list);
+
+	blk_mq_run_hw_queue(hctx, run_queue_async);
+}
+
 static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
 				   struct blk_mq_hw_ctx *hctx,
 				   unsigned int hctx_idx)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index becbc7840364..9478aaeb48c5 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -21,6 +21,12 @@  bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
 void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx);
 
+void blk_mq_sched_insert_request(struct request *rq, bool at_head,
+				 bool run_queue, bool async, bool can_block);
+void blk_mq_sched_insert_requests(struct request_queue *q,
+				  struct blk_mq_ctx *ctx,
+				  struct list_head *list, bool run_queue_async);
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 			struct list_head *rq_list,
@@ -62,45 +68,6 @@  static inline void blk_mq_sched_put_rq_priv(struct request_queue *q,
 		e->type->ops.mq.put_rq_priv(q, rq);
 }
 
-static inline void
-blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue,
-			    bool async)
-{
-	struct request_queue *q = rq->q;
-	struct elevator_queue *e = q->elevator;
-	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-
-	if (e && e->type->ops.mq.insert_requests) {
-		LIST_HEAD(list);
-
-		list_add(&rq->queuelist, &list);
-		e->type->ops.mq.insert_requests(hctx, &list, at_head);
-	} else {
-		spin_lock(&ctx->lock);
-		__blk_mq_insert_request(hctx, rq, at_head);
-		spin_unlock(&ctx->lock);
-	}
-
-	if (run_queue)
-		blk_mq_run_hw_queue(hctx, async);
-}
-
-static inline void
-blk_mq_sched_insert_requests(struct request_queue *q, struct blk_mq_ctx *ctx,
-			     struct list_head *list, bool run_queue_async)
-{
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-	struct elevator_queue *e = hctx->queue->elevator;
-
-	if (e && e->type->ops.mq.insert_requests)
-		e->type->ops.mq.insert_requests(hctx, list, false);
-	else
-		blk_mq_insert_requests(hctx, ctx, list);
-
-	blk_mq_run_hw_queue(hctx, run_queue_async);
-}
-
 static inline bool
 blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
 			 struct bio *bio)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 1b156ca79af6..78bbacd129c9 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -106,6 +106,7 @@  unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	struct sbq_wait_state *ws;
 	DEFINE_WAIT(wait);
 	unsigned int tag_offset;
+	bool drop_ctx;
 	int tag;
 
 	if (data->flags & BLK_MQ_REQ_RESERVED) {
@@ -128,6 +129,7 @@  unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		return BLK_MQ_TAG_FAIL;
 
 	ws = bt_wait_ptr(bt, data->hctx);
+	drop_ctx = data->ctx == NULL;
 	do {
 		prepare_to_wait(&ws->wait, &wait, TASK_UNINTERRUPTIBLE);
 
@@ -150,7 +152,8 @@  unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != -1)
 			break;
 
-		blk_mq_put_ctx(data->ctx);
+		if (data->ctx)
+			blk_mq_put_ctx(data->ctx);
 
 		io_schedule();
 
@@ -166,6 +169,9 @@  unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		ws = bt_wait_ptr(bt, data->hctx);
 	} while (1);
 
+	if (drop_ctx && data->ctx)
+		blk_mq_put_ctx(data->ctx);
+
 	finish_wait(&ws->wait, &wait);
 
 found_tag:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4df397910251..888868b62018 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -568,13 +568,13 @@  static void blk_mq_requeue_work(struct work_struct *work)
 
 		rq->rq_flags &= ~RQF_SOFTBARRIER;
 		list_del_init(&rq->queuelist);
-		blk_mq_sched_insert_request(rq, true, false, false);
+		blk_mq_sched_insert_request(rq, true, false, false, true);
 	}
 
 	while (!list_empty(&rq_list)) {
 		rq = list_entry(rq_list.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
-		blk_mq_sched_insert_request(rq, false, false, false);
+		blk_mq_sched_insert_request(rq, false, false, false, true);
 	}
 
 	blk_mq_run_hw_queues(q, false);
@@ -847,12 +847,11 @@  static inline unsigned int queued_to_index(unsigned int queued)
 	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
 }
 
-static bool blk_mq_get_driver_tag(struct request *rq,
-				  struct blk_mq_hw_ctx **hctx, bool wait)
+bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
+			   bool wait)
 {
 	struct blk_mq_alloc_data data = {
 		.q = rq->q,
-		.ctx = rq->mq_ctx,
 		.hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu),
 		.flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
 	};
@@ -1395,7 +1394,7 @@  static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 	}
 
 insert:
-	blk_mq_sched_insert_request(rq, false, true, true);
+	blk_mq_sched_insert_request(rq, false, true, true, false);
 }
 
 /*
@@ -1446,10 +1445,12 @@  static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	cookie = request_to_qc_t(data.hctx, rq);
 
 	if (unlikely(is_flush_fua)) {
+		blk_mq_put_ctx(data.ctx);
 		blk_mq_bio_to_request(rq, bio);
 		blk_mq_get_driver_tag(rq, NULL, true);
 		blk_insert_flush(rq);
-		goto run_queue;
+		blk_mq_run_hw_queue(data.hctx, true);
+		goto done;
 	}
 
 	plug = current->plug;
@@ -1502,7 +1503,7 @@  static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		blk_mq_put_ctx(data.ctx);
 		blk_mq_bio_to_request(rq, bio);
 		blk_mq_sched_insert_request(rq, false, true,
-						!is_sync || is_flush_fua);
+						!is_sync || is_flush_fua, true);
 		goto done;
 	}
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
@@ -1512,7 +1513,6 @@  static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		 * latter allows for merging opportunities and more efficient
 		 * dispatching.
 		 */
-run_queue:
 		blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
 	}
 	blk_mq_put_ctx(data.ctx);
@@ -1568,10 +1568,12 @@  static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	cookie = request_to_qc_t(data.hctx, rq);
 
 	if (unlikely(is_flush_fua)) {
+		blk_mq_put_ctx(data.ctx);
 		blk_mq_bio_to_request(rq, bio);
 		blk_mq_get_driver_tag(rq, NULL, true);
 		blk_insert_flush(rq);
-		goto run_queue;
+		blk_mq_run_hw_queue(data.hctx, true);
+		goto done;
 	}
 
 	/*
@@ -1612,7 +1614,7 @@  static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		blk_mq_put_ctx(data.ctx);
 		blk_mq_bio_to_request(rq, bio);
 		blk_mq_sched_insert_request(rq, false, true,
-						!is_sync || is_flush_fua);
+						!is_sync || is_flush_fua, true);
 		goto done;
 	}
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
@@ -1622,7 +1624,6 @@  static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		 * latter allows for merging opportunities and more efficient
 		 * dispatching.
 		 */
-run_queue:
 		blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
 	}
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d19b0e75a129..d34929968071 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -34,6 +34,8 @@  void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
+bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
+				bool wait);
 
 /*
  * Internal helpers for allocating/freeing the request map