diff mbox

Time to make dynamically allocated devt the default for scsi disks?

Message ID 1471047451.2407.95.camel@HansenPartnership.com (mailing list archive)
State New, archived
Headers show

Commit Message

James Bottomley Aug. 13, 2016, 12:17 a.m. UTC
On Fri, 2016-08-12 at 14:29 -0700, Dan Williams wrote:
> Before spending effort trying to flush the destruction of old bdi
> instances before new ones are registered, is it rather time to
> complete the conversion of sd to only use dynamically allocated devt?

Do we have to go that far?  Surely your fix is extensible: the only
reason it doesn't work for us is that the gendisk holds the parent
without a reference, so we can free the SCSI device before its child
gendisk (good job no-one actually uses gendisk->parent after we've
released it ...).  If we fix that it would mean SCSI can't release the
sdev until after the queue is dead and the bdi namespace released, so
isn't something like this the easy fix?

James

---

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dan Williams Aug. 13, 2016, 12:29 a.m. UTC | #1
On Fri, Aug 12, 2016 at 5:17 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Fri, 2016-08-12 at 14:29 -0700, Dan Williams wrote:
>> Before spending effort trying to flush the destruction of old bdi
>> instances before new ones are registered, is it rather time to
>> complete the conversion of sd to only use dynamically allocated devt?
>
> Do we have to go that far?  Surely your fix is extensible: the only
> reason it doesn't work for us is that the gendisk holds the parent
> without a reference, so we can free the SCSI device before its child
> gendisk (good job no-one actually uses gendisk->parent after we've
> released it ...).  If we fix that it would mean SCSI can't release the
> sdev until after the queue is dead and the bdi namespace released, so
> isn't something like this the easy fix?
>
> James
>
> ---
>
> diff --git a/block/genhd.c b/block/genhd.c
> index fcd6d4f..54ae4ae 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -514,7 +514,7 @@ static void register_disk(struct device *parent, struct gendisk *disk)
>         struct hd_struct *part;
>         int err;
>
> -       ddev->parent = parent;
> +       ddev->parent = get_device(parent);
>
>         dev_set_name(ddev, "%s", disk->disk_name);
>
> @@ -1144,6 +1144,7 @@ static void disk_release(struct device *dev)
>         hd_free_part(&disk->part0);
>         if (disk->queue)
>                 blk_put_queue(disk->queue);
> +       put_device(dev->parent);
>         kfree(disk);
>  }
>  struct class block_class = {

Looks ok at first glance to me.

We do hold a reference on the parent device, but it gets dropped at
device_unregister() time and this moves it out to the final put.
However, this does leave static devt block-device-drivers that
register a disk without a parent device susceptible to the race... I
think those exist given all the drivers still using add_disk() after
commit 52c44d93c26f "block: remove ->driverfs_dev".
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Aug. 13, 2016, 4:57 a.m. UTC | #2
On Fri, Aug 12, 2016 at 5:29 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Aug 12, 2016 at 5:17 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
>> On Fri, 2016-08-12 at 14:29 -0700, Dan Williams wrote:
>>> Before spending effort trying to flush the destruction of old bdi
>>> instances before new ones are registered, is it rather time to
>>> complete the conversion of sd to only use dynamically allocated devt?
>>
>> Do we have to go that far?  Surely your fix is extensible: the only
>> reason it doesn't work for us is that the gendisk holds the parent
>> without a reference, so we can free the SCSI device before its child
>> gendisk (good job no-one actually uses gendisk->parent after we've
>> released it ...).  If we fix that it would mean SCSI can't release the
>> sdev until after the queue is dead and the bdi namespace released, so
>> isn't something like this the easy fix?
>>
>> James
>>
>> ---
>>
>> diff --git a/block/genhd.c b/block/genhd.c
>> index fcd6d4f..54ae4ae 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -514,7 +514,7 @@ static void register_disk(struct device *parent, struct gendisk *disk)
>>         struct hd_struct *part;
>>         int err;
>>
>> -       ddev->parent = parent;
>> +       ddev->parent = get_device(parent);
>>
>>         dev_set_name(ddev, "%s", disk->disk_name);
>>
>> @@ -1144,6 +1144,7 @@ static void disk_release(struct device *dev)
>>         hd_free_part(&disk->part0);
>>         if (disk->queue)
>>                 blk_put_queue(disk->queue);
>> +       put_device(dev->parent);
>>         kfree(disk);
>>  }
>>  struct class block_class = {
>
> Looks ok at first glance to me.
>
> We do hold a reference on the parent device, but it gets dropped at
> device_unregister() time and this moves it out to the final put.
> However, this does leave static devt block-device-drivers that
> register a disk without a parent device susceptible to the race... I
> think those exist given all the drivers still using add_disk() after
> commit 52c44d93c26f "block: remove ->driverfs_dev".

So I tried the attached and it makes the libnvdimm unit tests start
crashing.  A couple crash logs attached.  Not yet sure what assumption
is getting violated, but how about that conversion of scsi to use
dynamic devt? ;-)
[   30.149817] =========================
[   30.150729] [ BUG: held lock freed! ]
[   30.151644] 4.8.0-rc1+ #19 Tainted: G           O   
[   30.152717] -------------------------
[   30.153635] lt-libndctl/863 is freeing memory ffff880310b7e128-ffff880310b7e927, with a lock still held there!
[   30.155719]  (&dev->mutex){......}, at: [<ffffffff815f931d>] device_release_driver+0x1d/0x40
[   30.157927] 6 locks held by lt-libndctl/863:
[   30.158917]  #0:  (sb_writers#7){.+.+.+}, at: [<ffffffff8128c537>] __sb_start_write+0xb7/0xf0
[   30.161267]  #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8131c191>] kernfs_fop_write+0x101/0x1c0
[   30.163545]  #2:  (s_active#3){++++.+}, at: [<ffffffff8131c199>] kernfs_fop_write+0x109/0x1c0
[   30.165893]  #3:  (&dev->mutex){......}, at: [<ffffffff815f7a3f>] unbind_store+0xff/0x160
[   30.168124]  #4:  (&dev->mutex){......}, at: [<ffffffff815f931d>] device_release_driver+0x1d/0x40
[   30.170435]  #5:  (&dev->mutex){......}, at: [<ffffffff815f931d>] device_release_driver+0x1d/0x40
[   30.172755] 
[   30.172755] stack backtrace:
[   30.174156] CPU: 0 PID: 863 Comm: lt-libndctl Tainted: G           O    4.8.0-rc1+ #19
[   30.175988] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[   30.178260]  0000000000000086 0000000099c8ecbb ffff8803394538d0 ffffffff814b4f23
[   30.180325]  ffff8803380c4a90 ffff880310b7e128 ffff880339453908 ffffffff81107653
[   30.182389]  ffff880310b7e128 0000000000000282 ffff88033b002940 ffffea000c42de00
[   30.184456] Call Trace:
[   30.185232]  [<ffffffff814b4f23>] dump_stack+0x85/0xc2
[   30.186328]  [<ffffffff81107653>] debug_check_no_locks_freed+0x153/0x160
[   30.187616]  [<ffffffffa003fb1e>] ? namespace_pmem_release+0x2e/0x40 [libnvdimm]
[   30.189391]  [<ffffffff8125e039>] kfree+0xc9/0x270
[   30.190445]  [<ffffffffa003fb1e>] namespace_pmem_release+0x2e/0x40 [libnvdimm]
[   30.192197]  [<ffffffff815f4042>] device_release+0x32/0x90
[   30.193333]  [<ffffffff814b7a1a>] kobject_release+0x6a/0x170
[   30.194494]  [<ffffffff814b78e7>] kobject_put+0x27/0x50
[   30.195602]  [<ffffffff815f4377>] put_device+0x17/0x20
[   30.196696]  [<ffffffff81493290>] disk_release+0xc0/0x100
[   30.197828]  [<ffffffff815f4042>] device_release+0x32/0x90
[   30.198966]  [<ffffffff814b7a1a>] kobject_release+0x6a/0x170
[   30.200126]  [<ffffffff814b78e7>] kobject_put+0x27/0x50
[   30.201230]  [<ffffffff815f4377>] put_device+0x17/0x20
[   30.202328]  [<ffffffff81214e0a>] bdi_unregister+0x20a/0x290
[   30.203482]  [<ffffffff81107289>] ? trace_hardirqs_on_caller+0x129/0x1b0
[   30.204769]  [<ffffffff8147e2ac>] blk_cleanup_queue+0x18c/0x280
[   30.205957]  [<ffffffffa00a117e>] pmem_release_queue+0xe/0x10 [nd_pmem]
[   30.207228]  [<ffffffff815fd49f>] devm_action_release+0xf/0x20
[   30.208407]  [<ffffffff815fdf79>] release_nodes+0x129/0x230
[   30.209556]  [<ffffffff815fe2ec>] devres_release_all+0x3c/0x60
[   30.210736]  [<ffffffff815f9249>] __device_release_driver+0xa9/0x160
[   30.211979]  [<ffffffff815f9325>] device_release_driver+0x25/0x40
[   30.213187]  [<ffffffff815f88a4>] bus_remove_device+0x114/0x190
[   30.214377]  [<ffffffff815f4978>] device_del+0x148/0x270
[   30.215490]  [<ffffffff815f4380>] ? put_device+0x20/0x20
[   30.216609]  [<ffffffffa003f770>] ? nd_region_remove+0x80/0x80 [libnvdimm]
[   30.217914]  [<ffffffff815f4aba>] device_unregister+0x1a/0x60
[   30.219085]  [<ffffffffa003b828>] nd_device_unregister+0x48/0x50 [libnvdimm]
[   30.220412]  [<ffffffffa003f780>] child_unregister+0x10/0x20 [libnvdimm]
[   30.221696]  [<ffffffff815f4470>] device_for_each_child+0x50/0x90
[   30.222910]  [<ffffffffa003f718>] nd_region_remove+0x28/0x80 [libnvdimm]
[   30.224195]  [<ffffffffa003bbe1>] nvdimm_bus_remove+0x41/0xa0 [libnvdimm]
[   30.225488]  [<ffffffff815f9241>] __device_release_driver+0xa1/0x160
[   30.226727]  [<ffffffff815f9325>] device_release_driver+0x25/0x40
[   30.227947]  [<ffffffff815f7a4f>] unbind_store+0x10f/0x160
[   30.229081]  [<ffffffff815f6fa5>] drv_attr_store+0x25/0x30
[   30.230215]  [<ffffffff8131cee5>] sysfs_kf_write+0x45/0x60
[   30.231352]  [<ffffffff8131c1c5>] kernfs_fop_write+0x135/0x1c0
[   30.232530]  [<ffffffff81288977>] __vfs_write+0x37/0x160
[   30.233643]  [<ffffffff81102f47>] ? update_fast_ctr+0x17/0x30
[   30.234814]  [<ffffffff81102fd9>] ? percpu_down_read+0x49/0x90
[   30.235988]  [<ffffffff8128c537>] ? __sb_start_write+0xb7/0xf0
[   30.237169]  [<ffffffff8128c537>] ? __sb_start_write+0xb7/0xf0
[   30.238344]  [<ffffffff812890d8>] vfs_write+0xb8/0x1b0
[   30.239444]  [<ffffffff8128a568>] SyS_write+0x58/0xc0
[   30.240534]  [<ffffffff8199837c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[   30.693773] BUG: unable to handle kernel paging request at 0000000040070088

[..]


[   30.695296] IP: [<ffffffff819886ae>] klist_put+0xe/0x90
[   30.696501] PGD 3383e2067 PUD 0 
[   30.697532] Oops: 0000 [#1] SMP
[   30.698403] Dumping ftrace buffer:
[   30.699306]    (ftrace buffer empty)
[   30.700228] Modules linked in: nd_blk(O) nfit_test(O) ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_security iptable_mangle iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_security ebtable_filter ebtables ip6table_filter ip6_tables dax_pmem(O) nd_pmem(O) nd_btt(O) dax(O) nd_e820(O) nfit(O) tpm_tis libnvdimm(O) tpm_tis_core tpm serio_raw nfit_test_iomap(O)
[   30.712449] CPU: 0 PID: 863 Comm: lt-libndctl Tainted: G           O    4.8.0-rc1+ #19
[   30.714325] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[   30.716650] task: ffff8803380c4240 task.stack: ffff880339450000
[   30.717868] RIP: 0010:[<ffffffff819886ae>]  [<ffffffff819886ae>] klist_put+0xe/0x90
[   30.719774] RSP: 0018:ffff880339453b40  EFLAGS: 00010246
[   30.720917] RAX: 0000000000000000 RBX: ffff8803380c4240 RCX: ffff8803380c4240
[   30.722277] RDX: ffffffff819887dd RSI: 0000000000000001 RDI: 0000000040070088
[   30.723647] RBP: ffff880339453b60 R08: ffffffff81f4eba0 R09: 0000000000000000
[   30.725026] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000040070088
[   30.726395] R13: ffffffffa004d3e0 R14: 0000000000000001 R15: fffffffffffffff2
[   30.727763] FS:  00007f253e15f840(0000) GS:ffff88033fc00000(0000) knlGS:0000000000000000
[   30.729641] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   30.730842] CR2: 0000000040070088 CR3: 00000003395d3000 CR4: 00000000000006f0
[   30.732212] Stack:
[   30.732941]  ffff8803380c4240 0000000040070088 ffffffffa004d3e0 ffff880310b79050
[   30.735047]  ffff880339453bc8 ffffffff819887ea ffff8803104ac038 ffff880324ba2848
[   30.737156]  ffffffff81f4eba0 ffffffff81f4eba0 0000000040070088 ffff8803380c4240
[   30.739249] Call Trace:
[   30.740029]  [<ffffffff819887ea>] klist_remove+0x7a/0xe0
[   30.741166]  [<ffffffff815f9291>] __device_release_driver+0xf1/0x160
[   30.742434]  [<ffffffff815f9325>] device_release_driver+0x25/0x40
[   30.743668]  [<ffffffff815f88a4>] bus_remove_device+0x114/0x190
[   30.744892]  [<ffffffff815f4978>] device_del+0x148/0x270
[   30.746038]  [<ffffffff815f4380>] ? put_device+0x20/0x20
[   30.747185]  [<ffffffffa003f770>] ? nd_region_remove+0x80/0x80 [libnvdimm]
[   30.748520]  [<ffffffff815f4aba>] device_unregister+0x1a/0x60
[   30.749715]  [<ffffffffa003b828>] nd_device_unregister+0x48/0x50 [libnvdimm]
[   30.751088]  [<ffffffffa003f780>] child_unregister+0x10/0x20 [libnvdimm]
[   30.752401]  [<ffffffff815f4470>] device_for_each_child+0x50/0x90
[   30.753650]  [<ffffffffa003f718>] nd_region_remove+0x28/0x80 [libnvdimm]
[   30.754979]  [<ffffffffa003bbe1>] nvdimm_bus_remove+0x41/0xa0 [libnvdimm]
[   30.756307]  [<ffffffff815f9241>] __device_release_driver+0xa1/0x160
[   30.757583]  [<ffffffff815f9325>] device_release_driver+0x25/0x40
[   30.758828]  [<ffffffff815f7a4f>] unbind_store+0x10f/0x160
[   30.759986]  [<ffffffff815f6fa5>] drv_attr_store+0x25/0x30
[   30.761145]  [<ffffffff8131cee5>] sysfs_kf_write+0x45/0x60
[   30.762299]  [<ffffffff8131c1c5>] kernfs_fop_write+0x135/0x1c0
[   30.763502]  [<ffffffff81288977>] __vfs_write+0x37/0x160
[   30.764641]  [<ffffffff81102f47>] ? update_fast_ctr+0x17/0x30
[   30.765845]  [<ffffffff81102fd9>] ? percpu_down_read+0x49/0x90
[   30.767055]  [<ffffffff8128c537>] ? __sb_start_write+0xb7/0xf0
[   30.768265]  [<ffffffff8128c537>] ? __sb_start_write+0xb7/0xf0
[   30.769477]  [<ffffffff812890d8>] vfs_write+0xb8/0x1b0
[   30.770592]  [<ffffffff8128a568>] SyS_write+0x58/0xc0
[   30.771697]  [<ffffffff8199837c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[   30.772979] Code: f5 55 5d 00 01 e8 b3 6e 72 ff e9 63 ff ff ff e8 69 f5 00 00 e9 72 ff ff ff 0f 1f 40 00 55 48 89 e5 41 56 41 55 41 54 53 41 89 f6 <4c> 8b 27 48 89 fb 49 83 e4 fe 4c 89 e7 4d 8b 6c 24 50 e8 3b f2 
[   30.781430] RIP  [<ffffffff819886ae>] klist_put+0xe/0x90
[   30.782642]  RSP <ffff880339453b40>
[   30.783550] CR2: 0000000040070088
[   30.784495] ---[ end trace 0dcefe91828fba52 ]---
[   30.786917] Kernel panic - not syncing: Fatal exception
[   30.788400] Dumping ftrace buffer:
[   30.789305]    (ftrace buffer empty)
[   30.790225] Kernel Offset: disabled
[   30.791137] Rebooting in 5 seconds..
[   34.255986] general protection fault: 0000 [#1] SMP
[   34.257026] Dumping ftrace buffer:
[   34.257859]    (ftrace buffer empty)
[   34.258713] Modules linked in: nd_blk(O) nfit_test(O) ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables nd_pmem(O) dax_pmem(O) dax(O) nd_btt(O) serio_raw nd_e820(O) nfit(O) libnvdimm(O) tpm_tis tpm_tis_core tpm nfit_test_iomap(O)
[   34.269688] CPU: 1 PID: 871 Comm: lt-libndctl Tainted: G           O    4.8.0-rc1+ #19
[   34.271364] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[   34.273450] task: ffff880338b64240 task.stack: ffff880338a80000
[   34.274564] RIP: 0010:[<ffffffff819888ae>]  [<ffffffff819888ae>] klist_next+0x5e/0x120
[   34.276327] RSP: 0018:ffff880338a83c90  EFLAGS: 00010217
[   34.277371] RAX: ffff880313384278 RBX: 6b6b6b6b6b6b6b63 RCX: c2482a36b45bcd4d
[   34.278614] RDX: 0000000000000001 RSI: 0000000082f7649c RDI: ffff880313384248
[   34.279864] RBP: ffff880338a83cb8 R08: 0000000000000000 R09: 0000000000000000
[   34.281114] R10: 0000000000000000 R11: 0000000000003436 R12: ffff880338a83cc8
[   34.282356] R13: 0000000000000000 R14: ffff880313384a80 R15: 0000000000000000
[   34.283593] FS:  00007fa1a7e06840(0000) GS:ffff88033fd00000(0000) knlGS:0000000000000000
[   34.285306] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   34.286404] CR2: 0000561d5afc9208 CR3: 000000033891c000 CR4: 00000000000006e0
[   34.287654] Stack:
[   34.288339]  0000000000000000 0000000000000000 ffffffffa002e770 ffffffffa017d100
[   34.290258]  fffffffffffffff2 ffff880338a83cf8 ffffffff815f447b ffff880313384248
[   34.301588]  0000000000000000 0000000070757190 ffff880312c49868 ffff880312c49868
[   34.303488] Call Trace:
[   34.304208]  [<ffffffffa002e770>] ? nd_region_remove+0x80/0x80 [libnvdimm]
[   34.305428]  [<ffffffff815f447b>] device_for_each_child+0x5b/0x90
[   34.306566]  [<ffffffffa002e718>] nd_region_remove+0x28/0x80 [libnvdimm]
[   34.307776]  [<ffffffffa002abe1>] nvdimm_bus_remove+0x41/0xa0 [libnvdimm]
[   34.308983]  [<ffffffff815f9241>] __device_release_driver+0xa1/0x160
[   34.310148]  [<ffffffff815f9325>] device_release_driver+0x25/0x40
[   34.311273]  [<ffffffff815f7a4f>] unbind_store+0x10f/0x160
[   34.312330]  [<ffffffff815f6fa5>] drv_attr_store+0x25/0x30
[   34.313391]  [<ffffffff8131cee5>] sysfs_kf_write+0x45/0x60
[   34.314451]  [<ffffffff8131c1c5>] kernfs_fop_write+0x135/0x1c0
[   34.315557]  [<ffffffff81288977>] __vfs_write+0x37/0x160
[   34.316610]  [<ffffffff81102f47>] ? update_fast_ctr+0x17/0x30
[   34.317704]  [<ffffffff81102fd9>] ? percpu_down_read+0x49/0x90
[   34.318805]  [<ffffffff8128c537>] ? __sb_start_write+0xb7/0xf0
[   34.319906]  [<ffffffff8128c537>] ? __sb_start_write+0xb7/0xf0
[   34.321010]  [<ffffffff812890d8>] vfs_write+0xb8/0x1b0
[   34.322029]  [<ffffffff8128a568>] SyS_write+0x58/0xc0
[   34.323039]  [<ffffffff8199837c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[   34.324208] Code: 8d 58 f8 f0 41 83 6e 18 01 74 60 49 8b 3c 24 45 31 ed 48 8d 47 30 45 31 ff 49 c7 44 24 08 00 00 00 00 48 39 c3 0f 84 b4 00 00 00 <f6> 03 01 75 70 b8 01 00 00 00 f0 0f c1 43 18 83 c0 01 83 f8 01 
[   34.331684] RIP  [<ffffffff819888ae>] klist_next+0x5e/0x120
[   34.332798]  RSP <ffff880338a83c90>
[   34.333659] ---[ end trace db64022aaca42534 ]---
James Bottomley Aug. 13, 2016, 3:23 p.m. UTC | #3
On Fri, 2016-08-12 at 21:57 -0700, Dan Williams wrote:
> On Fri, Aug 12, 2016 at 5:29 PM, Dan Williams <
> dan.j.williams@intel.com> wrote:
> > On Fri, Aug 12, 2016 at 5:17 PM, James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> > > On Fri, 2016-08-12 at 14:29 -0700, Dan Williams wrote:
> > > > Before spending effort trying to flush the destruction of old
> > > > bdi
> > > > instances before new ones are registered, is it rather time to
> > > > complete the conversion of sd to only use dynamically allocated
> > > > devt?
> > > 
> > > Do we have to go that far?  Surely your fix is extensible: the
> > > only
> > > reason it doesn't work for us is that the gendisk holds the
> > > parent
> > > without a reference, so we can free the SCSI device before its
> > > child
> > > gendisk (good job no-one actually uses gendisk->parent after
> > > we've
> > > released it ...).  If we fix that it would mean SCSI can't
> > > release the
> > > sdev until after the queue is dead and the bdi namespace
> > > released, so
> > > isn't something like this the easy fix?
> > > 
> > > James
> > > 
> > > ---
> > > 
> > > diff --git a/block/genhd.c b/block/genhd.c
> > > index fcd6d4f..54ae4ae 100644
> > > --- a/block/genhd.c
> > > +++ b/block/genhd.c
> > > @@ -514,7 +514,7 @@ static void register_disk(struct device
> > > *parent, struct gendisk *disk)
> > >         struct hd_struct *part;
> > >         int err;
> > > 
> > > -       ddev->parent = parent;
> > > +       ddev->parent = get_device(parent);
> > > 
> > >         dev_set_name(ddev, "%s", disk->disk_name);
> > > 
> > > @@ -1144,6 +1144,7 @@ static void disk_release(struct device
> > > *dev)
> > >         hd_free_part(&disk->part0);
> > >         if (disk->queue)
> > >                 blk_put_queue(disk->queue);
> > > +       put_device(dev->parent);
> > >         kfree(disk);
> > >  }
> > >  struct class block_class = {
> > 
> > Looks ok at first glance to me.
> > 
> > We do hold a reference on the parent device, but it gets dropped at
> > device_unregister() time and this moves it out to the final put.

We do?  Where?

> > However, this does leave static devt block-device-drivers that
> > register a disk without a parent device susceptible to the race... 
> > I think those exist given all the drivers still using add_disk()
> > after commit 52c44d93c26f "block: remove ->driverfs_dev".

It does?  The race is the fact that the parent can be removed before
the child meaning if the parent name is re-registered before the child
dies we get a duplicate name in bdi space.

> So I tried the attached and it makes the libnvdimm unit tests start
> crashing.

Well, the attached is clearly buggy, isn't it?  You're trying to do a
get on the parent before the parent is actually set.  Why don't you
just try the incremental patch I sent instead of trying to rework it?

>   A couple crash logs attached.  Not yet sure what assumption
> is getting violated, but how about that conversion of scsi to use
> dynamic devt? ;-)

It's completely orthogonal.  The problem is in hierarchy lifetimes:
switching from static to dynamic allocation won't change that at all. 
 You don't see this problem in nvme because the parent control device's
lifetime belongs to the controller not the disk.  In SCSI the parent is
our representation of the SCSI device whose lifetime is governed at the
SCSI level and effectively represents the disk.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Aug. 13, 2016, 4:29 p.m. UTC | #4
On Sat, Aug 13, 2016 at 8:23 AM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Fri, 2016-08-12 at 21:57 -0700, Dan Williams wrote:
>> On Fri, Aug 12, 2016 at 5:29 PM, Dan Williams <
>> dan.j.williams@intel.com> wrote:
>> > On Fri, Aug 12, 2016 at 5:17 PM, James Bottomley
>> > <James.Bottomley@hansenpartnership.com> wrote:
>> > > On Fri, 2016-08-12 at 14:29 -0700, Dan Williams wrote:
>> > > > Before spending effort trying to flush the destruction of old
>> > > > bdi
>> > > > instances before new ones are registered, is it rather time to
>> > > > complete the conversion of sd to only use dynamically allocated
>> > > > devt?
>> > >
>> > > Do we have to go that far?  Surely your fix is extensible: the
>> > > only
>> > > reason it doesn't work for us is that the gendisk holds the
>> > > parent
>> > > without a reference, so we can free the SCSI device before its
>> > > child
>> > > gendisk (good job no-one actually uses gendisk->parent after
>> > > we've
>> > > released it ...).  If we fix that it would mean SCSI can't
>> > > release the
>> > > sdev until after the queue is dead and the bdi namespace
>> > > released, so
>> > > isn't something like this the easy fix?
>> > >
>> > > James
>> > >
>> > > ---
>> > >
>> > > diff --git a/block/genhd.c b/block/genhd.c
>> > > index fcd6d4f..54ae4ae 100644
>> > > --- a/block/genhd.c
>> > > +++ b/block/genhd.c
>> > > @@ -514,7 +514,7 @@ static void register_disk(struct device
>> > > *parent, struct gendisk *disk)
>> > >         struct hd_struct *part;
>> > >         int err;
>> > >
>> > > -       ddev->parent = parent;
>> > > +       ddev->parent = get_device(parent);
>> > >
>> > >         dev_set_name(ddev, "%s", disk->disk_name);
>> > >
>> > > @@ -1144,6 +1144,7 @@ static void disk_release(struct device
>> > > *dev)
>> > >         hd_free_part(&disk->part0);
>> > >         if (disk->queue)
>> > >                 blk_put_queue(disk->queue);
>> > > +       put_device(dev->parent);
>> > >         kfree(disk);
>> > >  }
>> > >  struct class block_class = {
>> >
>> > Looks ok at first glance to me.
>> >
>> > We do hold a reference on the parent device, but it gets dropped at
>> > device_unregister() time and this moves it out to the final put.
>
> We do?  Where?

Yes, register_disk() does "ddev->parent = parent" and then
"device_add(ddev)".  device_add() takes the parent reference.

>
>> > However, this does leave static devt block-device-drivers that
>> > register a disk without a parent device susceptible to the race...
>> > I think those exist given all the drivers still using add_disk()
>> > after commit 52c44d93c26f "block: remove ->driverfs_dev".
>
> It does?  The race is the fact that the parent can be removed before
> the child meaning if the parent name is re-registered before the child
> dies we get a duplicate name in bdi space.

No, the race is that the *name* of the parent isn't released until the
child is both unregistered and put.  The device core is already
ensuring that the parent is not released until all descendants have
been removed.

>
>> So I tried the attached and it makes the libnvdimm unit tests start
>> crashing.
>
> Well, the attached is clearly buggy, isn't it?  You're trying to do a
> get on the parent before the parent is actually set.

Ah, yes, thank you.  Fixed up v2 attached that passes my tests.

> Why don't you
> just try the incremental patch I sent instead of trying to rework it?

I reworked it because it is the bdi that holds this extra dependency
on the disk's parent, not the disk itself.

>
>>   A couple crash logs attached.  Not yet sure what assumption
>> is getting violated, but how about that conversion of scsi to use
>> dynamic devt? ;-)
>
> It's completely orthogonal.  The problem is in hierarchy lifetimes:
> switching from static to dynamic allocation won't change that at all.
>  You don't see this problem in nvme because the parent control device's
> lifetime belongs to the controller not the disk.  In SCSI the parent is
> our representation of the SCSI device whose lifetime is governed at the
> SCSI level and effectively represents the disk.
>

No, it's only the name.  We could achieve the same by teaching the
block core to manage the "sd_index_ida" instead of the sd driver
itself, but the v2-patch attached works and does not introduce that
layering violation.
James Bottomley Aug. 13, 2016, 5:43 p.m. UTC | #5
On Sat, 2016-08-13 at 09:29 -0700, Dan Williams wrote:
> On Sat, Aug 13, 2016 at 8:23 AM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > It does?  The race is the fact that the parent can be removed 
> > before the child meaning if the parent name is re-registered before 
> > the child dies we get a duplicate name in bdi space.
> 
> No, the race is that the *name* of the parent isn't released until 
> the child is both unregistered and put.  The device core is already
> ensuring that the parent is not released until all descendants have
> been removed.

We're both saying the same thing: the problem is that, with
df08c32ce3be the bdi name lifetime is tied to the lifetime of the
gendisk.  However, the parent of the gendisk currently is only tied to
the visibility lifetime of the gendisk, not the final put lifetime, so
it doesn't see this.

> > 
> > > So I tried the attached and it makes the libnvdimm unit tests 
> > > start crashing.
> > 
> > Well, the attached is clearly buggy, isn't it?  You're trying to do 
> > a get on the parent before the parent is actually set.
> 
> Ah, yes, thank you.  Fixed up v2 attached that passes my tests.
> 
> > Why don't you just try the incremental patch I sent instead of 
> > trying to rework it?
> 
> I reworked it because it is the bdi that holds this extra dependency
> on the disk's parent, not the disk itself.

Philosophically I don't like this approach.  The dependency goes 

bdi->gendisk->parent

Making the bdi manage the parent lifetime is an unusual pattern. 
 Making the parent stay around until the last reference to gendisk is
put is the usual one.

> > >   A couple crash logs attached.  Not yet sure what assumption
> > > is getting violated, but how about that conversion of scsi to use
> > > dynamic devt? ;-)
> > 
> > It's completely orthogonal.  The problem is in hierarchy lifetimes:
> > switching from static to dynamic allocation won't change that at 
> > all.  You don't see this problem in nvme because the parent control
> > device's lifetime belongs to the controller not the disk.  In SCSI 
> > the parent is our representation of the SCSI device whose lifetime 
> > is governed at the SCSI level and effectively represents the disk.
> > 
> 
> No, it's only the name.  We could achieve the same by teaching the
> block core to manage the "sd_index_ida" instead of the sd driver
> itself, but the v2-patch attached works and does not introduce that
> layering violation.

Um, so this patch doesn't fix the problem. It merely makes the lifetime
rules correct so the problem can then be fixed at the scsi level.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Aug. 13, 2016, 6:27 p.m. UTC | #6
On Sat, Aug 13, 2016 at 10:43 AM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Sat, 2016-08-13 at 09:29 -0700, Dan Williams wrote:
>> On Sat, Aug 13, 2016 at 8:23 AM, James Bottomley
>> <James.Bottomley@hansenpartnership.com> wrote:
>> > It does?  The race is the fact that the parent can be removed
>> > before the child meaning if the parent name is re-registered before
>> > the child dies we get a duplicate name in bdi space.
>>
>> No, the race is that the *name* of the parent isn't released until
>> the child is both unregistered and put.  The device core is already
>> ensuring that the parent is not released until all descendants have
>> been removed.
>
> We're both saying the same thing: the problem is that, with
> df08c32ce3be the bdi name lifetime is tied to the lifetime of the
> gendisk.  However, the parent of the gendisk currently is only tied to
> the visibility lifetime of the gendisk, not the final put lifetime, so
> it doesn't see this.
>
>> >
>> > > So I tried the attached and it makes the libnvdimm unit tests
>> > > start crashing.
>> >
>> > Well, the attached is clearly buggy, isn't it?  You're trying to do
>> > a get on the parent before the parent is actually set.
>>
>> Ah, yes, thank you.  Fixed up v2 attached that passes my tests.
>>
>> > Why don't you just try the incremental patch I sent instead of
>> > trying to rework it?
>>
>> I reworked it because it is the bdi that holds this extra dependency
>> on the disk's parent, not the disk itself.
>
> Philosophically I don't like this approach.  The dependency goes
>
> bdi->gendisk->parent

I'm arguing that there is no bdi->gendisk dependency.

The dependency is:

bdi->devt

It just so happens that block-dynamic devt is released in disk_release().

> Making the bdi manage the parent lifetime is an unusual pattern.
>  Making the parent stay around until the last reference to gendisk is
> put is the usual one.

What's unusual is the bdi's dependency on the allocated name, not the
gendisk itself.

>> > >   A couple crash logs attached.  Not yet sure what assumption
>> > > is getting violated, but how about that conversion of scsi to use
>> > > dynamic devt? ;-)
>> >
>> > It's completely orthogonal.  The problem is in hierarchy lifetimes:
>> > switching from static to dynamic allocation won't change that at
>> > all.  You don't see this problem in nvme because the parent control
>> > device's lifetime belongs to the controller not the disk.  In SCSI
>> > the parent is our representation of the SCSI device whose lifetime
>> > is governed at the SCSI level and effectively represents the disk.
>> >
>>
>> No, it's only the name.  We could achieve the same by teaching the
>> block core to manage the "sd_index_ida" instead of the sd driver
>> itself, but the v2-patch attached works and does not introduce that
>> layering violation.
>
> Um, so this patch doesn't fix the problem. It merely makes the lifetime
> rules correct so the problem can then be fixed at the scsi level.

You're right that this patch does not fix the problem, I missed that
the scsi_disk is not the parent of the gendisk, so this patch does
nothing to delay scsi_disk_release.  What I think is the real fix is
to make the devt properly reference counted and prevent
ida_remove(&sd_index_ida, sdkp->index); from being called until all
objects derived from that allocation are done with it.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Aug. 13, 2016, 8:38 p.m. UTC | #7
On Sat, Aug 13, 2016 at 11:27 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sat, Aug 13, 2016 at 10:43 AM, James Bottomley
[..]
>> Um, so this patch doesn't fix the problem. It merely makes the lifetime
>> rules correct so the problem can then be fixed at the scsi level.
>
> You're right that this patch does not fix the problem, I missed that
> the scsi_disk is not the parent of the gendisk, so this patch does
> nothing to delay scsi_disk_release.  What I think is the real fix is
> to make the devt properly reference counted and prevent
> ida_remove(&sd_index_ida, sdkp->index); from being called until all
> objects derived from that allocation are done with it.

So, here's an RFC along these lines (only compile tested).   This
approach provides a generic way to coordinate the lifetime of the bdi
name versus the devt allocated by the block device driver.
diff mbox

Patch

diff --git a/block/genhd.c b/block/genhd.c
index fcd6d4f..54ae4ae 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -514,7 +514,7 @@  static void register_disk(struct device *parent, struct gendisk *disk)
 	struct hd_struct *part;
 	int err;
 
-	ddev->parent = parent;
+	ddev->parent = get_device(parent);
 
 	dev_set_name(ddev, "%s", disk->disk_name);
 
@@ -1144,6 +1144,7 @@  static void disk_release(struct device *dev)
 	hd_free_part(&disk->part0);
 	if (disk->queue)
 		blk_put_queue(disk->queue);
+	put_device(dev->parent);
 	kfree(disk);
 }
 struct class block_class = {