Message ID | 20171023145126.2471-7-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 10/23/2017 04:51 PM, Christoph Hellwig wrote: > With this flag a driver can create a gendisk that can be used for I/O > submission inside the kernel, but which is not registered as user > facing block device. This will be useful for the NVMe multipath > implementation. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > block/genhd.c | 57 +++++++++++++++++++++++++++++++++++---------------- > include/linux/genhd.h | 1 + > 2 files changed, 40 insertions(+), 18 deletions(-) > Can we have some information in sysfs to figure out if a gendisk is hidden? I'd hate having to do an inverse lookup in /proc/partitions; it's always hard to prove that something is _not_ present. And we already present various information (like disk_removable_show()), so it's not without precedent. And it would make integration with systemd/udev easier. Cheers, Hannes
On Mon, Oct 23 2017 at 10:51am -0400, Christoph Hellwig <hch@lst.de> wrote: > With this flag a driver can create a gendisk that can be used for I/O > submission inside the kernel, but which is not registered as user > facing block device. This will be useful for the NVMe multipath > implementation. Having the NVme driver go to such lengths to hide its resources from upper layers is certainly the work of an evil genius experiencing some serious territorial issues. Not sugar-coating it.. you wouldn't. I kept meaning to reply to your earlier iterations on this series to ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so that the NVMe driver doesn't implicitly consume (and hide) all per-controler devices? Ah well. There is only one correct way to do NVMe multipathing after all right?
On Tue, Oct 24, 2017 at 09:18:47AM +0200, Hannes Reinecke wrote: > Can we have some information in sysfs to figure out if a gendisk is > hidden? I'd hate having to do an inverse lookup in /proc/partitions; > it's always hard to prove that something is _not_ present. > And we already present various information (like disk_removable_show()), > so it's not without precedent. > And it would make integration with systemd/udev easier. Sure, I'll add a hidden flag sysfs file.
On Tue, Oct 24, 2017 at 05:40:00PM -0400, Mike Snitzer wrote: > Having the NVme driver go to such lengths to hide its resources from > upper layers is certainly the work of an evil genius experiencing some > serious territorial issues. Not sugar-coating it.. you wouldn't. I'm pretty surre Hannes will appreciate being called an evil genius :) > I kept meaning to reply to your earlier iterations on this series to > ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so > that the NVMe driver doesn't implicitly consume (and hide) all > per-controler devices? I thought about adding it, but mostly for a different reason: it's quite a bit of code, and we now start to see NVMe in deeply embedded contexts, e.g. the latest Compact Flash spec is based on NVMe, so it might be a good idea to give people a chance to avoid the overhead. > Ah well. There is only one correct way to do NVMe multipathing after > all right? I don't think you'll get very useful results, even if you try. But I guess we'll just have to tell people to use SuSE if they want NVMe multipathing to work then :)
Hi Christoph, On 2017/10/28 14:38, Christoph Hellwig wrote: > On Tue, Oct 24, 2017 at 05:40:00PM -0400, Mike Snitzer wrote: >> Having the NVme driver go to such lengths to hide its resources from >> upper layers is certainly the work of an evil genius experiencing some >> serious territorial issues. Not sugar-coating it.. you wouldn't. > > I'm pretty surre Hannes will appreciate being called an evil genius :) > >> I kept meaning to reply to your earlier iterations on this series to >> ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so >> that the NVMe driver doesn't implicitly consume (and hide) all >> per-controler devices? > > I thought about adding it, but mostly for a different reason: it's > quite a bit of code, and we now start to see NVMe in deeply embedded > contexts, e.g. the latest Compact Flash spec is based on NVMe, so it > might be a good idea to give people a chance to avoid the overhead. > Think of some current advanced features of DM-Multipath combined with multipath-tools such as path-latency priority grouping, intermittent IO error accounting for path degradation, delayed or immediate or follow-over failback feature. Those features, which is significant in some scenario, need to use per-controller block devices. Therefore, I think it is worthy adding a CONFIG_NVME_MULTIPATHING knob to hide or show per-controller block devices. How about let me to add this this CONFIG_NVME_MULTIPATHING knob? Regards Guan >> Ah well. There is only one correct way to do NVMe multipathing after >> all right? > > I don't think you'll get very useful results, even if you try. But I > guess we'll just have to tell people to use SuSE if they want NVMe > multipathing to work then :) > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme > >
On Sat, Oct 28, 2017 at 03:20:07PM +0800, Guan Junxiong wrote: > Think of some current advanced features of DM-Multipath combined with multipath-tools > such as path-latency priority grouping, intermittent IO error accounting for path > degradation, delayed or immediate or follow-over failback feature. > Those features, which is significant in some scenario, need to use per-controller block devices. > Therefore, I think it is worthy adding a CONFIG_NVME_MULTIPATHING knob to > hide or show per-controller block devices. > > How about let me to add this this CONFIG_NVME_MULTIPATHING knob? There is defintively not going to be a run-time nob, sorry. You've spent this work on dm-multipath after the nvme group pretty clearly said that this is not what we are going to support, and we have not interest in supporting this. Especially as proper path discovery, asymetic namespaces states and similar can only be supported properly with the in-kernel code.
On 2017/10/28 15:42, Christoph Hellwig wrote: > On Sat, Oct 28, 2017 at 03:20:07PM +0800, Guan Junxiong wrote: >> Think of some current advanced features of DM-Multipath combined with multipath-tools >> such as path-latency priority grouping, intermittent IO error accounting for path >> degradation, delayed or immediate or follow-over failback feature. >> Those features, which is significant in some scenario, need to use per-controller block devices. >> Therefore, I think it is worthy adding a CONFIG_NVME_MULTIPATHING knob to >> hide or show per-controller block devices. >> >> How about let me to add this this CONFIG_NVME_MULTIPATHING knob? > > There is defintively not going to be a run-time nob, sorry. You've > spent this work on dm-multipath after the nvme group pretty clearly > said that this is not what we are going to support, and we have not > interest in supporting this. Especially as proper path discovery, > asymetic namespaces states and similar can only be supported properly > with the in-kernel code. > Does it mean some extra works such as: 1) showing the path topology of nvme multipath device 2) daemon to implement immediate and delayed failback 3) detecting sub-healthy path due to shaky link 4) grouping paths besides ANA ... and so on, need to integrate into the user-space tool such as nvme-cli or a new invented user-space tool named "nvme-mpath" ? Which do you prefer? And the kernel also needs to export more setting and getting methods. Regards Guan .
On Sat, Oct 28 2017 at 2:38am -0400, Christoph Hellwig <hch@lst.de> wrote: > On Tue, Oct 24, 2017 at 05:40:00PM -0400, Mike Snitzer wrote: > > Having the NVme driver go to such lengths to hide its resources from > > upper layers is certainly the work of an evil genius experiencing some > > serious territorial issues. Not sugar-coating it.. you wouldn't. > > I'm pretty surre Hannes will appreciate being called an evil genius :) Well, to be fair.. it doesn't take that much brain power to arrive at isolationism. And pretty sure Hannes had a better idea with using the blkdev_get (holders/claim) interface but you quickly dismissed that. Despite it being the best _existing_ way to ensure mutual exclusion with the rest of the block layer. > > I kept meaning to reply to your earlier iterations on this series to > > ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so > > that the NVMe driver doesn't implicitly consume (and hide) all > > per-controler devices? > > I thought about adding it, but mostly for a different reason: it's > quite a bit of code, and we now start to see NVMe in deeply embedded > contexts, e.g. the latest Compact Flash spec is based on NVMe, so it > might be a good idea to give people a chance to avoid the overhead. OK, so please add it. > > Ah well. There is only one correct way to do NVMe multipathing after > > all right? > > I don't think you'll get very useful results, even if you try. But I > guess we'll just have to tell people to use SuSE if they want NVMe > multipathing to work then :) Don't do that. Don't assume you know it all. Don't fabricate vendor wars in your head and then project it out to the rest of the community. We're all in this together. Fact is Hannes and I have exchanged private mail (in response to this very thread) and we agree that your approach is currently not suitable for enterprise deployment. Hannes needs to also deal with the coming duality of Linux multipathing and you aren't making it easy. Just because you're able to be myopic doesn't mean the rest of us with way more direct customers behind us can be. You're finding your way with this new multipath model and I'm happy to see that happen. But what I'm not happy about is the steps you're taking to be actively disruptive. There were so many ways this all could've gone and sadly you've elected to stand up an architecture that doesn't even allow the prospect of reuse. And for what? Because doing so takes 10% more work? Well we can backfill that work if you'll grant us an open-mind. I'm really not against you. I just need very basic controls put into the NVMe multipathing code that allows it to be disabled (yet reused). Not that I have immediate plans to actually _use_ it. My hope is I can delegate NVMe multipathing to NVMe! But I need the latitude to find my way with the requirements I am, or will be, needing to consider. Please don't paint me and others in my position into a corner. Mike
On Saturday, October 28, 2017 3:10 AM, Guan Junxiong wrote: >Does it mean some extra works such as: >1) showing the path topology of nvme multipath device Isn't connection to the target suppose guide the host to the shortest and fastest path available path or so called most optimized path. Sysfs entry can easily store that as we store path related info over there. >2) daemon to implement immediate and delayed failback This is also based on target, whenever target is ready for immediate or delayed failback it will let the connect from host succeed. Until then host is in reconnecting state across all path until it finds an optimized or un-optimized path. Why is this extra daemon needed ? >4) grouping paths besides ANA Why we cannot use NS Group here. Would be good idea to move away from legacy way of doing things for NVME devices. NVME Multipath implementation by Christoph is very simple. How about not making it super complicated. Best regards, Anish
On Sat, Oct 28, 2017 at 06:09:46PM +0800, Guan Junxiong wrote: > Does it mean some extra works such as: > 1) showing the path topology of nvme multipath device It's in sysfs. And Johannes volunteered to also add nvme-cli support. > 2) daemon to implement immediate and delayed failback The whole point is to not have a daemon. > 3) detecting sub-healthy path due to shaky link We can do this in kernel space. It just needs someone to implement it. > 4) grouping paths besides ANA We don't want to do non-standard grouping. Please work with the NVMe working group for your grouping ideas.
On 10/28/2017 04:17 PM, Mike Snitzer wrote: > On Sat, Oct 28 2017 at 2:38am -0400, > Christoph Hellwig <hch@lst.de> wrote: > >> On Tue, Oct 24, 2017 at 05:40:00PM -0400, Mike Snitzer wrote: [ .. ] >>> Ah well. There is only one correct way to do NVMe multipathing after >>> all right? >> >> I don't think you'll get very useful results, even if you try. But I >> guess we'll just have to tell people to use SuSE if they want NVMe >> multipathing to work then :) > > Don't do that. Don't assume you know it all. Don't fabricate vendor > wars in your head and then project it out to the rest of the community. > We're all in this together. > > Fact is Hannes and I have exchanged private mail (in response to this > very thread) and we agree that your approach is currently not suitable > for enterprise deployment. Hannes needs to also deal with the coming > duality of Linux multipathing and you aren't making it easy. Just > because you're able to be myopic doesn't mean the rest of us with way > more direct customers behind us can be. > > You're finding your way with this new multipath model and I'm happy to > see that happen. But what I'm not happy about is the steps you're > taking to be actively disruptive. There were so many ways this all > could've gone and sadly you've elected to stand up an architecture that > doesn't even allow the prospect of reuse. And for what? Because doing > so takes 10% more work? Well we can backfill that work if you'll grant > us an open-mind. > > I'm really not against you. I just need very basic controls put into > the NVMe multipathing code that allows it to be disabled (yet reused). > Not that I have immediate plans to actually _use_ it. My hope is I can > delegate NVMe multipathing to NVMe! But I need the latitude to find my > way with the requirements I am, or will be, needing to consider. > > Please don't paint me and others in my position into a corner. > To add my some cents from the SUSE perspective: We have _quite_ some deployments on dm-multipathing, and most of our customers are quite accustomed to setting up deployments with that. It will be impossible to ensure that all customers and installation scripts will _not_ try starting up multipathing, and for some scenarios even preferring to use dm-multipath just because their tooling is geared up for that. So we absolutely need peaceful coexistence between dm-multipathing and nvme multipathing. The precise level can be discussed (whether it being a global on/off switch or some more fine-grained thingie), but we simply cannot declare nvme multipathing being the one true way for NVMe. The support-calls alone will kill us. Note: this has _nothing_ to do with performance. I'm perfectly fine to accept that dm-multipath has sub-optimal performance. But that should not imply that it cannot be used for NVMe. After all, Linux is about choice, not about forcing users to do things in one way only. Cheers, Hannes
diff --git a/block/genhd.c b/block/genhd.c index 1174d24e405e..11a41cca3475 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -585,6 +585,11 @@ static void register_disk(struct device *parent, struct gendisk *disk) */ pm_runtime_set_memalloc_noio(ddev, true); + if (disk->flags & GENHD_FL_HIDDEN) { + dev_set_uevent_suppress(ddev, 0); + return; + } + disk->part0.holder_dir = kobject_create_and_add("holders", &ddev->kobj); disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj); @@ -616,6 +621,11 @@ static void register_disk(struct device *parent, struct gendisk *disk) while ((part = disk_part_iter_next(&piter))) kobject_uevent(&part_to_dev(part)->kobj, KOBJ_ADD); disk_part_iter_exit(&piter); + + err = sysfs_create_link(&ddev->kobj, + &disk->queue->backing_dev_info->dev->kobj, + "bdi"); + WARN_ON(err); } /** @@ -630,7 +640,6 @@ static void register_disk(struct device *parent, struct gendisk *disk) */ void device_add_disk(struct device *parent, struct gendisk *disk) { - struct backing_dev_info *bdi; dev_t devt; int retval; @@ -639,7 +648,8 @@ void device_add_disk(struct device *parent, struct gendisk *disk) * parameters make sense. */ WARN_ON(disk->minors && !(disk->major || disk->first_minor)); - WARN_ON(!disk->minors && !(disk->flags & GENHD_FL_EXT_DEVT)); + WARN_ON(!disk->minors && + !(disk->flags & (GENHD_FL_EXT_DEVT | GENHD_FL_HIDDEN))); disk->flags |= GENHD_FL_UP; @@ -648,18 +658,26 @@ void device_add_disk(struct device *parent, struct gendisk *disk) WARN_ON(1); return; } - disk_to_dev(disk)->devt = devt; disk->major = MAJOR(devt); disk->first_minor = MINOR(devt); disk_alloc_events(disk); - /* Register BDI before referencing it from bdev */ - bdi = disk->queue->backing_dev_info; - bdi_register_owner(bdi, disk_to_dev(disk)); - - blk_register_region(disk_devt(disk), disk->minors, NULL, - exact_match, exact_lock, disk); + if (disk->flags & GENHD_FL_HIDDEN) { + /* + * Don't let hidden disks show up in /proc/partitions, + * and don't bother scanning for partitions either. + */ + disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO; + disk->flags |= GENHD_FL_NO_PART_SCAN; + } else { + /* Register BDI before referencing it from bdev */ + disk_to_dev(disk)->devt = devt; + bdi_register_owner(disk->queue->backing_dev_info, + disk_to_dev(disk)); + blk_register_region(disk_devt(disk), disk->minors, NULL, + exact_match, exact_lock, disk); + } register_disk(parent, disk); blk_register_queue(disk); @@ -669,10 +687,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk) */ WARN_ON_ONCE(!blk_get_queue(disk->queue)); - retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj, - "bdi"); - WARN_ON(retval); - disk_add_events(disk); blk_integrity_add(disk); } @@ -701,7 +715,8 @@ void del_gendisk(struct gendisk *disk) set_capacity(disk, 0); disk->flags &= ~GENHD_FL_UP; - sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi"); + if (!(disk->flags & GENHD_FL_HIDDEN)) + sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi"); if (disk->queue) { /* * Unregister bdi before releasing device numbers (as they can @@ -712,13 +727,15 @@ void del_gendisk(struct gendisk *disk) } else { WARN_ON(1); } - blk_unregister_region(disk_devt(disk), disk->minors); + + if (!(disk->flags & GENHD_FL_HIDDEN)) { + blk_unregister_region(disk_devt(disk), disk->minors); + kobject_put(disk->part0.holder_dir); + kobject_put(disk->slave_dir); + } part_stat_set_all(&disk->part0, 0); disk->part0.stamp = 0; - - kobject_put(disk->part0.holder_dir); - kobject_put(disk->slave_dir); if (!sysfs_deprecated) sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk))); pm_runtime_set_memalloc_noio(disk_to_dev(disk), false); @@ -781,6 +798,10 @@ struct gendisk *get_gendisk(dev_t devt, int *partno) spin_unlock_bh(&ext_devt_lock); } + if (unlikely(disk->flags & GENHD_FL_HIDDEN)) { + put_disk(disk); + disk = NULL; + } return disk; } EXPORT_SYMBOL(get_gendisk); diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 5c0ed5db33c2..93aae3476f58 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -140,6 +140,7 @@ struct hd_struct { #define GENHD_FL_NATIVE_CAPACITY 128 #define GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE 256 #define GENHD_FL_NO_PART_SCAN 512 +#define GENHD_FL_HIDDEN 1024 enum { DISK_EVENT_MEDIA_CHANGE = 1 << 0, /* media changed */
With this flag a driver can create a gendisk that can be used for I/O submission inside the kernel, but which is not registered as user facing block device. This will be useful for the NVMe multipath implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> --- block/genhd.c | 57 +++++++++++++++++++++++++++++++++++---------------- include/linux/genhd.h | 1 + 2 files changed, 40 insertions(+), 18 deletions(-)