diff mbox

[06/17] block: introduce GENHD_FL_HIDDEN

Message ID 20171023145126.2471-7-hch@lst.de (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig Oct. 23, 2017, 2:51 p.m. UTC
With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device.  This will be useful for the NVMe multipath
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/genhd.c         | 57 +++++++++++++++++++++++++++++++++++----------------
 include/linux/genhd.h |  1 +
 2 files changed, 40 insertions(+), 18 deletions(-)

Comments

Hannes Reinecke Oct. 24, 2017, 7:18 a.m. UTC | #1
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> With this flag a driver can create a gendisk that can be used for I/O
> submission inside the kernel, but which is not registered as user
> facing block device.  This will be useful for the NVMe multipath
> implementation.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/genhd.c         | 57 +++++++++++++++++++++++++++++++++++----------------
>  include/linux/genhd.h |  1 +
>  2 files changed, 40 insertions(+), 18 deletions(-)
> 
Can we have some information in sysfs to figure out if a gendisk is
hidden? I'd hate having to do an inverse lookup in /proc/partitions;
it's always hard to prove that something is _not_ present.
And we already present various information (like disk_removable_show()),
so it's not without precedent.
And it would make integration with systemd/udev easier.

Cheers,

Hannes
Mike Snitzer Oct. 24, 2017, 9:40 p.m. UTC | #2
On Mon, Oct 23 2017 at 10:51am -0400,
Christoph Hellwig <hch@lst.de> wrote:

> With this flag a driver can create a gendisk that can be used for I/O
> submission inside the kernel, but which is not registered as user
> facing block device.  This will be useful for the NVMe multipath
> implementation.

Having the NVme driver go to such lengths to hide its resources from
upper layers is certainly the work of an evil genius experiencing some
serious territorial issues.  Not sugar-coating it.. you wouldn't.

I kept meaning to reply to your earlier iterations on this series to
ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
that the NVMe driver doesn't implicitly consume (and hide) all
per-controler devices?

Ah well.  There is only one correct way to do NVMe multipathing after
all right?
Christoph Hellwig Oct. 28, 2017, 6:15 a.m. UTC | #3
On Tue, Oct 24, 2017 at 09:18:47AM +0200, Hannes Reinecke wrote:
> Can we have some information in sysfs to figure out if a gendisk is
> hidden? I'd hate having to do an inverse lookup in /proc/partitions;
> it's always hard to prove that something is _not_ present.
> And we already present various information (like disk_removable_show()),
> so it's not without precedent.
> And it would make integration with systemd/udev easier.

Sure, I'll add a hidden flag sysfs file.
Christoph Hellwig Oct. 28, 2017, 6:38 a.m. UTC | #4
On Tue, Oct 24, 2017 at 05:40:00PM -0400, Mike Snitzer wrote:
> Having the NVme driver go to such lengths to hide its resources from
> upper layers is certainly the work of an evil genius experiencing some
> serious territorial issues.  Not sugar-coating it.. you wouldn't.

I'm pretty surre Hannes will appreciate being called an evil genius :)

> I kept meaning to reply to your earlier iterations on this series to
> ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
> that the NVMe driver doesn't implicitly consume (and hide) all
> per-controler devices?

I thought about adding it, but mostly for a different reason: it's
quite a bit of code, and we now start to see NVMe in deeply embedded
contexts, e.g. the latest Compact Flash spec is based on NVMe, so it
might be a good idea to give people a chance to avoid the overhead.

> Ah well.  There is only one correct way to do NVMe multipathing after
> all right?

I don't think you'll get very useful results, even if you try.  But I
guess we'll just have to tell people to use SuSE if they want NVMe
multipathing to work then :)
Guan Junxiong Oct. 28, 2017, 7:20 a.m. UTC | #5
Hi Christoph,

On 2017/10/28 14:38, Christoph Hellwig wrote:
> On Tue, Oct 24, 2017 at 05:40:00PM -0400, Mike Snitzer wrote:
>> Having the NVme driver go to such lengths to hide its resources from
>> upper layers is certainly the work of an evil genius experiencing some
>> serious territorial issues.  Not sugar-coating it.. you wouldn't.
> 
> I'm pretty surre Hannes will appreciate being called an evil genius :)
> 
>> I kept meaning to reply to your earlier iterations on this series to
>> ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
>> that the NVMe driver doesn't implicitly consume (and hide) all
>> per-controler devices?
> 
> I thought about adding it, but mostly for a different reason: it's
> quite a bit of code, and we now start to see NVMe in deeply embedded
> contexts, e.g. the latest Compact Flash spec is based on NVMe, so it
> might be a good idea to give people a chance to avoid the overhead.
> 

Think of some current advanced features of DM-Multipath combined with multipath-tools
such as path-latency priority grouping, intermittent IO error accounting for path
degradation, delayed or immediate or follow-over failback feature.
Those features, which is significant in some scenario, need to use per-controller block devices.
Therefore, I think it is worthy adding a CONFIG_NVME_MULTIPATHING knob to
hide or show per-controller block devices.

How about let me to add this this CONFIG_NVME_MULTIPATHING knob?

Regards
Guan



>> Ah well.  There is only one correct way to do NVMe multipathing after
>> all right?
> 
> I don't think you'll get very useful results, even if you try.  But I
> guess we'll just have to tell people to use SuSE if they want NVMe
> multipathing to work then :)
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 
>
Christoph Hellwig Oct. 28, 2017, 7:42 a.m. UTC | #6
On Sat, Oct 28, 2017 at 03:20:07PM +0800, Guan Junxiong wrote:
> Think of some current advanced features of DM-Multipath combined with multipath-tools
> such as path-latency priority grouping, intermittent IO error accounting for path
> degradation, delayed or immediate or follow-over failback feature.
> Those features, which is significant in some scenario, need to use per-controller block devices.
> Therefore, I think it is worthy adding a CONFIG_NVME_MULTIPATHING knob to
> hide or show per-controller block devices.
> 
> How about let me to add this this CONFIG_NVME_MULTIPATHING knob?

There is defintively not going to be a run-time nob, sorry.  You've
spent this work on dm-multipath after the nvme group pretty clearly
said that this is not what we are going to support, and we have not
interest in supporting this.  Especially as proper path discovery,
asymetic namespaces states and similar can only be supported properly
with the in-kernel code.
Guan Junxiong Oct. 28, 2017, 10:09 a.m. UTC | #7
On 2017/10/28 15:42, Christoph Hellwig wrote:
> On Sat, Oct 28, 2017 at 03:20:07PM +0800, Guan Junxiong wrote:
>> Think of some current advanced features of DM-Multipath combined with multipath-tools
>> such as path-latency priority grouping, intermittent IO error accounting for path
>> degradation, delayed or immediate or follow-over failback feature.
>> Those features, which is significant in some scenario, need to use per-controller block devices.
>> Therefore, I think it is worthy adding a CONFIG_NVME_MULTIPATHING knob to
>> hide or show per-controller block devices.
>>
>> How about let me to add this this CONFIG_NVME_MULTIPATHING knob?
> 
> There is defintively not going to be a run-time nob, sorry.  You've
> spent this work on dm-multipath after the nvme group pretty clearly
> said that this is not what we are going to support, and we have not
> interest in supporting this.  Especially as proper path discovery,
> asymetic namespaces states and similar can only be supported properly
> with the in-kernel code.
> 
Does it mean some extra works such as:
1) showing the path topology of nvme multipath device
2) daemon to implement immediate and delayed failback
3) detecting sub-healthy path due to shaky link
4) grouping paths besides ANA
...
and so on,
need to integrate into the user-space tool such as nvme-cli
or a new invented user-space tool named "nvme-mpath" ?
Which do you prefer?

And the kernel also needs to export more setting and getting methods.

Regards
Guan
.
Mike Snitzer Oct. 28, 2017, 2:17 p.m. UTC | #8
On Sat, Oct 28 2017 at  2:38am -0400,
Christoph Hellwig <hch@lst.de> wrote:

> On Tue, Oct 24, 2017 at 05:40:00PM -0400, Mike Snitzer wrote:
> > Having the NVme driver go to such lengths to hide its resources from
> > upper layers is certainly the work of an evil genius experiencing some
> > serious territorial issues.  Not sugar-coating it.. you wouldn't.
> 
> I'm pretty surre Hannes will appreciate being called an evil genius :)

Well, to be fair.. it doesn't take that much brain power to arrive at
isolationism.

And pretty sure Hannes had a better idea with using the blkdev_get
(holders/claim) interface but you quickly dismissed that.  Despite it
being the best _existing_ way to ensure mutual exclusion with the rest
of the block layer.

> > I kept meaning to reply to your earlier iterations on this series to
> > ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
> > that the NVMe driver doesn't implicitly consume (and hide) all
> > per-controler devices?
> 
> I thought about adding it, but mostly for a different reason: it's
> quite a bit of code, and we now start to see NVMe in deeply embedded
> contexts, e.g. the latest Compact Flash spec is based on NVMe, so it
> might be a good idea to give people a chance to avoid the overhead.

OK, so please add it.

> > Ah well.  There is only one correct way to do NVMe multipathing after
> > all right?
> 
> I don't think you'll get very useful results, even if you try.  But I
> guess we'll just have to tell people to use SuSE if they want NVMe
> multipathing to work then :)

Don't do that.  Don't assume you know it all.  Don't fabricate vendor
wars in your head and then project it out to the rest of the community.
We're all in this together.

Fact is Hannes and I have exchanged private mail (in response to this
very thread) and we agree that your approach is currently not suitable
for enterprise deployment.  Hannes needs to also deal with the coming
duality of Linux multipathing and you aren't making it easy.  Just
because you're able to be myopic doesn't mean the rest of us with way
more direct customers behind us can be.

You're finding your way with this new multipath model and I'm happy to
see that happen.  But what I'm not happy about is the steps you're
taking to be actively disruptive.  There were so many ways this all
could've gone and sadly you've elected to stand up an architecture that
doesn't even allow the prospect of reuse.  And for what?  Because doing
so takes 10% more work?  Well we can backfill that work if you'll grant
us an open-mind.

I'm really not against you.  I just need very basic controls put into
the NVMe multipathing code that allows it to be disabled (yet reused).
Not that I have immediate plans to actually _use_ it.  My hope is I can
delegate NVMe multipathing to NVMe!  But I need the latitude to find my
way with the requirements I am, or will be, needing to consider.

Please don't paint me and others in my position into a corner.

Mike
Anish M Jhaveri Oct. 29, 2017, 8 a.m. UTC | #9
On Saturday, October 28, 2017 3:10 AM,  Guan Junxiong wrote:

>Does it mean some extra works such as:
>1) showing the path topology of nvme multipath device
Isn't connection to the target suppose guide the host to the shortest and fastest path available path or so called most optimized path. Sysfs entry can easily store that as we store path related info over there. 

>2) daemon to implement immediate and delayed failback
This is also based on target, whenever target is ready for immediate or delayed failback it will let the connect from host succeed. Until then host is in reconnecting state across all path until it finds an optimized or un-optimized path. Why is this extra daemon needed ?
 
>4) grouping paths besides ANA
Why we cannot use NS Group here.

Would be good idea to move away from legacy way of doing things for NVME devices. NVME Multipath implementation by Christoph is very simple. How about not making it super complicated.  

Best regards,
Anish
Christoph Hellwig Oct. 29, 2017, 8:57 a.m. UTC | #10
On Sat, Oct 28, 2017 at 06:09:46PM +0800, Guan Junxiong wrote:
> Does it mean some extra works such as:
> 1) showing the path topology of nvme multipath device

It's in sysfs.  And Johannes volunteered to also add nvme-cli
support.

> 2) daemon to implement immediate and delayed failback

The whole point is to not have a daemon.

> 3) detecting sub-healthy path due to shaky link

We can do this in kernel space.  It just needs someone to implement it.

> 4) grouping paths besides ANA

We don't want to do non-standard grouping.  Please work with the
NVMe working group for your grouping ideas.
Hannes Reinecke Oct. 29, 2017, 10:01 a.m. UTC | #11
On 10/28/2017 04:17 PM, Mike Snitzer wrote:
> On Sat, Oct 28 2017 at  2:38am -0400,
> Christoph Hellwig <hch@lst.de> wrote:
> 
>> On Tue, Oct 24, 2017 at 05:40:00PM -0400, Mike Snitzer wrote:
[ .. ]
>>> Ah well.  There is only one correct way to do NVMe multipathing after
>>> all right?
>>
>> I don't think you'll get very useful results, even if you try.  But I
>> guess we'll just have to tell people to use SuSE if they want NVMe
>> multipathing to work then :)
> 
> Don't do that.  Don't assume you know it all.  Don't fabricate vendor
> wars in your head and then project it out to the rest of the community.
> We're all in this together.
> 
> Fact is Hannes and I have exchanged private mail (in response to this
> very thread) and we agree that your approach is currently not suitable
> for enterprise deployment.  Hannes needs to also deal with the coming
> duality of Linux multipathing and you aren't making it easy.  Just
> because you're able to be myopic doesn't mean the rest of us with way
> more direct customers behind us can be.
> 
> You're finding your way with this new multipath model and I'm happy to
> see that happen.  But what I'm not happy about is the steps you're
> taking to be actively disruptive.  There were so many ways this all
> could've gone and sadly you've elected to stand up an architecture that
> doesn't even allow the prospect of reuse.  And for what?  Because doing
> so takes 10% more work?  Well we can backfill that work if you'll grant
> us an open-mind.
> 
> I'm really not against you.  I just need very basic controls put into
> the NVMe multipathing code that allows it to be disabled (yet reused).
> Not that I have immediate plans to actually _use_ it.  My hope is I can
> delegate NVMe multipathing to NVMe!  But I need the latitude to find my
> way with the requirements I am, or will be, needing to consider.
> 
> Please don't paint me and others in my position into a corner.
> 
To add my some cents from the SUSE perspective:
We have _quite_ some deployments on dm-multipathing, and most of our
customers are quite accustomed to setting up deployments with that.
It will be impossible to ensure that all customers and installation
scripts will _not_ try starting up multipathing, and for some scenarios
even preferring to use dm-multipath just because their tooling is geared
up for that.
So we absolutely need peaceful coexistence between dm-multipathing and
nvme multipathing. The precise level can be discussed (whether it being
a global on/off switch or some more fine-grained thingie), but we simply
cannot declare nvme multipathing being the one true way for NVMe.
The support-calls alone will kill us.

Note: this has _nothing_ to do with performance. I'm perfectly fine to
accept that dm-multipath has sub-optimal performance.
But that should not imply that it cannot be used for NVMe.

After all, Linux is about choice, not about forcing users to do things
in one way only.

Cheers,

Hannes
diff mbox

Patch

diff --git a/block/genhd.c b/block/genhd.c
index 1174d24e405e..11a41cca3475 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -585,6 +585,11 @@  static void register_disk(struct device *parent, struct gendisk *disk)
 	 */
 	pm_runtime_set_memalloc_noio(ddev, true);
 
+	if (disk->flags & GENHD_FL_HIDDEN) {
+		dev_set_uevent_suppress(ddev, 0);
+		return;
+	}
+
 	disk->part0.holder_dir = kobject_create_and_add("holders", &ddev->kobj);
 	disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
 
@@ -616,6 +621,11 @@  static void register_disk(struct device *parent, struct gendisk *disk)
 	while ((part = disk_part_iter_next(&piter)))
 		kobject_uevent(&part_to_dev(part)->kobj, KOBJ_ADD);
 	disk_part_iter_exit(&piter);
+
+	err = sysfs_create_link(&ddev->kobj,
+				&disk->queue->backing_dev_info->dev->kobj,
+				"bdi");
+	WARN_ON(err);
 }
 
 /**
@@ -630,7 +640,6 @@  static void register_disk(struct device *parent, struct gendisk *disk)
  */
 void device_add_disk(struct device *parent, struct gendisk *disk)
 {
-	struct backing_dev_info *bdi;
 	dev_t devt;
 	int retval;
 
@@ -639,7 +648,8 @@  void device_add_disk(struct device *parent, struct gendisk *disk)
 	 * parameters make sense.
 	 */
 	WARN_ON(disk->minors && !(disk->major || disk->first_minor));
-	WARN_ON(!disk->minors && !(disk->flags & GENHD_FL_EXT_DEVT));
+	WARN_ON(!disk->minors &&
+		!(disk->flags & (GENHD_FL_EXT_DEVT | GENHD_FL_HIDDEN)));
 
 	disk->flags |= GENHD_FL_UP;
 
@@ -648,18 +658,26 @@  void device_add_disk(struct device *parent, struct gendisk *disk)
 		WARN_ON(1);
 		return;
 	}
-	disk_to_dev(disk)->devt = devt;
 	disk->major = MAJOR(devt);
 	disk->first_minor = MINOR(devt);
 
 	disk_alloc_events(disk);
 
-	/* Register BDI before referencing it from bdev */
-	bdi = disk->queue->backing_dev_info;
-	bdi_register_owner(bdi, disk_to_dev(disk));
-
-	blk_register_region(disk_devt(disk), disk->minors, NULL,
-			    exact_match, exact_lock, disk);
+	if (disk->flags & GENHD_FL_HIDDEN) {
+		/*
+		 * Don't let hidden disks show up in /proc/partitions,
+		 * and don't bother scanning for partitions either.
+		 */
+		disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
+		disk->flags |= GENHD_FL_NO_PART_SCAN;
+	} else {
+		/* Register BDI before referencing it from bdev */
+		disk_to_dev(disk)->devt = devt;
+		bdi_register_owner(disk->queue->backing_dev_info,
+				disk_to_dev(disk));
+		blk_register_region(disk_devt(disk), disk->minors, NULL,
+				    exact_match, exact_lock, disk);
+	}
 	register_disk(parent, disk);
 	blk_register_queue(disk);
 
@@ -669,10 +687,6 @@  void device_add_disk(struct device *parent, struct gendisk *disk)
 	 */
 	WARN_ON_ONCE(!blk_get_queue(disk->queue));
 
-	retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
-				   "bdi");
-	WARN_ON(retval);
-
 	disk_add_events(disk);
 	blk_integrity_add(disk);
 }
@@ -701,7 +715,8 @@  void del_gendisk(struct gendisk *disk)
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
-	sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
+	if (!(disk->flags & GENHD_FL_HIDDEN))
+		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
 	if (disk->queue) {
 		/*
 		 * Unregister bdi before releasing device numbers (as they can
@@ -712,13 +727,15 @@  void del_gendisk(struct gendisk *disk)
 	} else {
 		WARN_ON(1);
 	}
-	blk_unregister_region(disk_devt(disk), disk->minors);
+
+	if (!(disk->flags & GENHD_FL_HIDDEN)) {
+		blk_unregister_region(disk_devt(disk), disk->minors);
+		kobject_put(disk->part0.holder_dir);
+		kobject_put(disk->slave_dir);
+	}
 
 	part_stat_set_all(&disk->part0, 0);
 	disk->part0.stamp = 0;
-
-	kobject_put(disk->part0.holder_dir);
-	kobject_put(disk->slave_dir);
 	if (!sysfs_deprecated)
 		sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
@@ -781,6 +798,10 @@  struct gendisk *get_gendisk(dev_t devt, int *partno)
 		spin_unlock_bh(&ext_devt_lock);
 	}
 
+	if (unlikely(disk->flags & GENHD_FL_HIDDEN)) {
+		put_disk(disk);
+		disk = NULL;
+	}
 	return disk;
 }
 EXPORT_SYMBOL(get_gendisk);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 5c0ed5db33c2..93aae3476f58 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -140,6 +140,7 @@  struct hd_struct {
 #define GENHD_FL_NATIVE_CAPACITY		128
 #define GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE	256
 #define GENHD_FL_NO_PART_SCAN			512
+#define GENHD_FL_HIDDEN				1024
 
 enum {
 	DISK_EVENT_MEDIA_CHANGE			= 1 << 0, /* media changed */