diff mbox series

[4/4] block: expose devt for GENHD_FL_HIDDEN disks

Message ID 20181206164812.30925-5-cascardo@canonical.com (mailing list archive)
State New, archived
Headers show
Series nvme multipath: expose slaves/holders | expand

Commit Message

Thadeu Lima de Souza Cascardo Dec. 6, 2018, 4:48 p.m. UTC
Without this exposure, lsblk will fail as it tries to find out the
device's dev_t numbers. This causes a real problem for nvme multipath
devices, as their slaves are hidden.

Exposing them fixes the problem, even though trying to open the devices
returns an error in the case of nvme multipath. So, right now, it's the
driver's responsibility to return a failure to open hidden devices.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
---
 block/genhd.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

Comments

Christoph Hellwig Dec. 6, 2018, 8:22 p.m. UTC | #1
On Thu, Dec 06, 2018 at 02:48:12PM -0200, Thadeu Lima de Souza Cascardo wrote:
> Without this exposure, lsblk will fail as it tries to find out the
> device's dev_t numbers. This causes a real problem for nvme multipath
> devices, as their slaves are hidden.
> 
> Exposing them fixes the problem, even though trying to open the devices
> returns an error in the case of nvme multipath. So, right now, it's the
> driver's responsibility to return a failure to open hidden devices.

So the problem with this is that it will cause udev to actually create
the /dev/nvmeXcYnZ nodes, which due to the previous patch will always
fail to open, which is a bit confusing.  I guess we could live with this
if we add udev rules to supress the creation or something, but in general
it is a bit ugly.
Christoph Hellwig Dec. 12, 2018, 8:32 a.m. UTC | #2
On Thu, Dec 06, 2018 at 09:22:00PM +0100, Christoph Hellwig wrote:
> On Thu, Dec 06, 2018 at 02:48:12PM -0200, Thadeu Lima de Souza Cascardo wrote:
> > Without this exposure, lsblk will fail as it tries to find out the
> > device's dev_t numbers. This causes a real problem for nvme multipath
> > devices, as their slaves are hidden.
> > 
> > Exposing them fixes the problem, even though trying to open the devices
> > returns an error in the case of nvme multipath. So, right now, it's the
> > driver's responsibility to return a failure to open hidden devices.
> 
> So the problem with this is that it will cause udev to actually create
> the /dev/nvmeXcYnZ nodes, which due to the previous patch will always
> fail to open, which is a bit confusing.  I guess we could live with this
> if we add udev rules to supress the creation or something, but in general
> it is a bit ugly.

Thadeu (or other distro folks), any comments on this issue?  If not
I'd like to just move forward with the first two patches for now,
and it would be good to have some reviews for them.
Thadeu Lima de Souza Cascardo Dec. 12, 2018, 12:39 p.m. UTC | #3
On Thu, Dec 06, 2018 at 09:22:00PM +0100, Christoph Hellwig wrote:
> On Thu, Dec 06, 2018 at 02:48:12PM -0200, Thadeu Lima de Souza Cascardo wrote:
> > Without this exposure, lsblk will fail as it tries to find out the
> > device's dev_t numbers. This causes a real problem for nvme multipath
> > devices, as their slaves are hidden.
> > 
> > Exposing them fixes the problem, even though trying to open the devices
> > returns an error in the case of nvme multipath. So, right now, it's the
> > driver's responsibility to return a failure to open hidden devices.
> 
> So the problem with this is that it will cause udev to actually create
> the /dev/nvmeXcYnZ nodes, which due to the previous patch will always
> fail to open, which is a bit confusing.  I guess we could live with this
> if we add udev rules to supress the creation or something, but in general
> it is a bit ugly.

Well, udev could just look at the hidden attribute. At least to some, the fact
that lsblk would fail was reason enough to revert the patches, so I would
rather apply the entire set.

We could add a message to the log when users try to open them, but given that
some programs (udev is one) would open them without any user interaction, maybe
we shouldn't.

Cascardo.
Hannes Reinecke Dec. 13, 2018, 9:18 a.m. UTC | #4
On 12/6/18 5:48 PM, Thadeu Lima de Souza Cascardo wrote:
> Without this exposure, lsblk will fail as it tries to find out the
> device's dev_t numbers. This causes a real problem for nvme multipath
> devices, as their slaves are hidden.
> 
> Exposing them fixes the problem, even though trying to open the devices
> returns an error in the case of nvme multipath. So, right now, it's the
> driver's responsibility to return a failure to open hidden devices.
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> ---
>   block/genhd.c | 9 ++++-----
>   1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 7674fce32fca..65a7fa664188 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -687,6 +687,7 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
>   
>   	disk_alloc_events(disk);
>   
> +	disk_to_dev(disk)->devt = devt;
>   	if (disk->flags & GENHD_FL_HIDDEN) {
>   		/*
>   		 * Don't let hidden disks show up in /proc/partitions,
> @@ -698,13 +699,12 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
>   		int ret;
>   
>   		/* Register BDI before referencing it from bdev */
> -		disk_to_dev(disk)->devt = devt;
>   		ret = bdi_register_owner(disk->queue->backing_dev_info,
>   						disk_to_dev(disk));
>   		WARN_ON(ret);
> -		blk_register_region(disk_devt(disk), disk->minors, NULL,
> -				    exact_match, exact_lock, disk);
>   	}
> +	blk_register_region(disk_devt(disk), disk->minors, NULL,
> +			    exact_match, exact_lock, disk);
>   	register_disk(parent, disk, groups);
>   	if (register_queue)
>   		blk_register_queue(disk);
> @@ -776,8 +776,7 @@ void del_gendisk(struct gendisk *disk)
>   		WARN_ON(1);
>   	}
>   
> -	if (!(disk->flags & GENHD_FL_HIDDEN))
> -		blk_unregister_region(disk_devt(disk), disk->minors);
> +	blk_unregister_region(disk_devt(disk), disk->minors);
>   
>   	kobject_put(disk->part0.holder_dir);
>   	kobject_put(disk->slave_dir);
> 
Welll ... this is not just 'lsblk', but more importantly this will force 
udev to create _block_ device nodes for the hidden devices, essentially 
'unhide' them.

Is this what we want?
Christoph?
I thought the entire _point_ of having hidden devices is that the are 
... well ... hidden ...

Cheers,

Hannes
Thadeu Lima de Souza Cascardo Dec. 13, 2018, 11:41 a.m. UTC | #5
On Thu, Dec 13, 2018 at 10:18:40AM +0100, Hannes Reinecke wrote:
> On 12/6/18 5:48 PM, Thadeu Lima de Souza Cascardo wrote:
> > Without this exposure, lsblk will fail as it tries to find out the
> > device's dev_t numbers. This causes a real problem for nvme multipath
> > devices, as their slaves are hidden.
> > 
> > Exposing them fixes the problem, even though trying to open the devices
> > returns an error in the case of nvme multipath. So, right now, it's the
> > driver's responsibility to return a failure to open hidden devices.
> > 
> > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> > ---
> >   block/genhd.c | 9 ++++-----
> >   1 file changed, 4 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block/genhd.c b/block/genhd.c
> > index 7674fce32fca..65a7fa664188 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -687,6 +687,7 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
> >   	disk_alloc_events(disk);
> > +	disk_to_dev(disk)->devt = devt;
> >   	if (disk->flags & GENHD_FL_HIDDEN) {
> >   		/*
> >   		 * Don't let hidden disks show up in /proc/partitions,
> > @@ -698,13 +699,12 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
> >   		int ret;
> >   		/* Register BDI before referencing it from bdev */
> > -		disk_to_dev(disk)->devt = devt;
> >   		ret = bdi_register_owner(disk->queue->backing_dev_info,
> >   						disk_to_dev(disk));
> >   		WARN_ON(ret);
> > -		blk_register_region(disk_devt(disk), disk->minors, NULL,
> > -				    exact_match, exact_lock, disk);
> >   	}
> > +	blk_register_region(disk_devt(disk), disk->minors, NULL,
> > +			    exact_match, exact_lock, disk);
> >   	register_disk(parent, disk, groups);
> >   	if (register_queue)
> >   		blk_register_queue(disk);
> > @@ -776,8 +776,7 @@ void del_gendisk(struct gendisk *disk)
> >   		WARN_ON(1);
> >   	}
> > -	if (!(disk->flags & GENHD_FL_HIDDEN))
> > -		blk_unregister_region(disk_devt(disk), disk->minors);
> > +	blk_unregister_region(disk_devt(disk), disk->minors);
> >   	kobject_put(disk->part0.holder_dir);
> >   	kobject_put(disk->slave_dir);
> > 
> Welll ... this is not just 'lsblk', but more importantly this will force
> udev to create _block_ device nodes for the hidden devices, essentially
> 'unhide' them.
> 
> Is this what we want?
> Christoph?
> I thought the entire _point_ of having hidden devices is that the are ...
> well ... hidden ...
> 

I knew this would be the most controversial patch. And I had other solutions as
well, but preferred this one. So, the other alternative would be just not use
GENHD_FL_HIDDEN for the nvme devices, which would leave that flag without a
single user in the kernel. That would still fix the two problems
(initramfs-tools and lsblk), and not create any other problem I know of. That
patch would still fail to open the underlying devices when there is a
head/multipath associated with it.

So, the only thing GENHD_FL_HIDDEN gives us is saving some resources, like bdi,
for example. And we could also use it to fail open when blkdev_get* is called.
Of couse, that would still imply that its name should be changed, but as we
already have an attribute named after that, I find it hard to suggest such a
change.

Cascardo.

> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke		   Teamlead Storage & Networking
> hare@suse.de			               +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> HRB 21284 (AG Nürnberg)
Hannes Reinecke Dec. 13, 2018, 12:19 p.m. UTC | #6
On 12/13/18 12:41 PM, Thadeu Lima de Souza Cascardo wrote:
> On Thu, Dec 13, 2018 at 10:18:40AM +0100, Hannes Reinecke wrote:
>> On 12/6/18 5:48 PM, Thadeu Lima de Souza Cascardo wrote:
>>> Without this exposure, lsblk will fail as it tries to find out the
>>> device's dev_t numbers. This causes a real problem for nvme multipath
>>> devices, as their slaves are hidden.
>>>
>>> Exposing them fixes the problem, even though trying to open the devices
>>> returns an error in the case of nvme multipath. So, right now, it's the
>>> driver's responsibility to return a failure to open hidden devices.
>>>
>>> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
>>> ---
>>>    block/genhd.c | 9 ++++-----
>>>    1 file changed, 4 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/block/genhd.c b/block/genhd.c
>>> index 7674fce32fca..65a7fa664188 100644
>>> --- a/block/genhd.c
>>> +++ b/block/genhd.c
>>> @@ -687,6 +687,7 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
>>>    	disk_alloc_events(disk);
>>> +	disk_to_dev(disk)->devt = devt;
>>>    	if (disk->flags & GENHD_FL_HIDDEN) {
>>>    		/*
>>>    		 * Don't let hidden disks show up in /proc/partitions,
>>> @@ -698,13 +699,12 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
>>>    		int ret;
>>>    		/* Register BDI before referencing it from bdev */
>>> -		disk_to_dev(disk)->devt = devt;
>>>    		ret = bdi_register_owner(disk->queue->backing_dev_info,
>>>    						disk_to_dev(disk));
>>>    		WARN_ON(ret);
>>> -		blk_register_region(disk_devt(disk), disk->minors, NULL,
>>> -				    exact_match, exact_lock, disk);
>>>    	}
>>> +	blk_register_region(disk_devt(disk), disk->minors, NULL,
>>> +			    exact_match, exact_lock, disk);
>>>    	register_disk(parent, disk, groups);
>>>    	if (register_queue)
>>>    		blk_register_queue(disk);
>>> @@ -776,8 +776,7 @@ void del_gendisk(struct gendisk *disk)
>>>    		WARN_ON(1);
>>>    	}
>>> -	if (!(disk->flags & GENHD_FL_HIDDEN))
>>> -		blk_unregister_region(disk_devt(disk), disk->minors);
>>> +	blk_unregister_region(disk_devt(disk), disk->minors);
>>>    	kobject_put(disk->part0.holder_dir);
>>>    	kobject_put(disk->slave_dir);
>>>
>> Welll ... this is not just 'lsblk', but more importantly this will force
>> udev to create _block_ device nodes for the hidden devices, essentially
>> 'unhide' them.
>>
>> Is this what we want?
>> Christoph?
>> I thought the entire _point_ of having hidden devices is that the are ...
>> well ... hidden ...
>>
> 
> I knew this would be the most controversial patch. And I had other solutions as
> well, but preferred this one. So, the other alternative would be just not use
> GENHD_FL_HIDDEN for the nvme devices, which would leave that flag without a
> single user in the kernel. That would still fix the two problems
> (initramfs-tools and lsblk), and not create any other problem I know of. That
> patch would still fail to open the underlying devices when there is a
> head/multipath associated with it.
> 
> So, the only thing GENHD_FL_HIDDEN gives us is saving some resources, like bdi,
> for example. And we could also use it to fail open when blkdev_get* is called.
> Of couse, that would still imply that its name should be changed, but as we
> already have an attribute named after that, I find it hard to suggest such a
> change.
> 
The whole point of the native nvme multipath implementation was that the 
block devices behind the multipathed device are _NOT_ visible to the OS.

This patch reverts the whole scheme, making it behaving just like the 
classical device-mapper multipathing did.

Why can't we just update lsblk to present the correct output?

nvme-cli (and multipath, for that matter) work happily with the current 
setup, _and_ provide the correct topology:

So why can't we do it with lsblk?

Cheers,

Hannes
Christoph Hellwig Dec. 13, 2018, 2:32 p.m. UTC | #7
On Thu, Dec 13, 2018 at 10:18:40AM +0100, Hannes Reinecke wrote:
> Welll ... this is not just 'lsblk', but more importantly this will force 
> udev to create _block_ device nodes for the hidden devices, essentially 
> 'unhide' them.
>
> Is this what we want?
> Christoph?
> I thought the entire _point_ of having hidden devices is that the are ... 
> well ... hidden ...

Yes, that is why I really don't like the last two patches.

And I've checked back - lsblk actually works just fine at the moment.
But it turns out once we create the slave links it stops working,
which is a really good argument against the first two patches, which
would otherwise seem nice..
Thadeu Lima de Souza Cascardo Dec. 13, 2018, 3:25 p.m. UTC | #8
On Thu, Dec 13, 2018 at 03:32:18PM +0100, Christoph Hellwig wrote:
> On Thu, Dec 13, 2018 at 10:18:40AM +0100, Hannes Reinecke wrote:
> > Welll ... this is not just 'lsblk', but more importantly this will force 
> > udev to create _block_ device nodes for the hidden devices, essentially 
> > 'unhide' them.
> >
> > Is this what we want?
> > Christoph?
> > I thought the entire _point_ of having hidden devices is that the are ... 
> > well ... hidden ...
> 
> Yes, that is why I really don't like the last two patches.
> 
> And I've checked back - lsblk actually works just fine at the moment.
> But it turns out once we create the slave links it stops working,
> which is a really good argument against the first two patches, which
> would otherwise seem nice..

Which is why I have sent the "paths/" patchset in the first place. Because I
did some homework and read the previous discussion about this, and how lsblk
failure to behave with slave links led to the revert of the slaves/holders
patch by Dr. Hannes.

Cascardo.
Thadeu Lima de Souza Cascardo Dec. 13, 2018, 4:08 p.m. UTC | #9
On Thu, Dec 13, 2018 at 01:19:25PM +0100, Hannes Reinecke wrote:
> On 12/13/18 12:41 PM, Thadeu Lima de Souza Cascardo wrote:
> > On Thu, Dec 13, 2018 at 10:18:40AM +0100, Hannes Reinecke wrote:
> > > On 12/6/18 5:48 PM, Thadeu Lima de Souza Cascardo wrote:
> > > > Without this exposure, lsblk will fail as it tries to find out the
> > > > device's dev_t numbers. This causes a real problem for nvme multipath
> > > > devices, as their slaves are hidden.
> > > > 
> > > > Exposing them fixes the problem, even though trying to open the devices
> > > > returns an error in the case of nvme multipath. So, right now, it's the
> > > > driver's responsibility to return a failure to open hidden devices.
> > > > 
> > > > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> > > > ---
> > > >    block/genhd.c | 9 ++++-----
> > > >    1 file changed, 4 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/block/genhd.c b/block/genhd.c
> > > > index 7674fce32fca..65a7fa664188 100644
> > > > --- a/block/genhd.c
> > > > +++ b/block/genhd.c
> > > > @@ -687,6 +687,7 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
> > > >    	disk_alloc_events(disk);
> > > > +	disk_to_dev(disk)->devt = devt;
> > > >    	if (disk->flags & GENHD_FL_HIDDEN) {
> > > >    		/*
> > > >    		 * Don't let hidden disks show up in /proc/partitions,
> > > > @@ -698,13 +699,12 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
> > > >    		int ret;
> > > >    		/* Register BDI before referencing it from bdev */
> > > > -		disk_to_dev(disk)->devt = devt;
> > > >    		ret = bdi_register_owner(disk->queue->backing_dev_info,
> > > >    						disk_to_dev(disk));
> > > >    		WARN_ON(ret);
> > > > -		blk_register_region(disk_devt(disk), disk->minors, NULL,
> > > > -				    exact_match, exact_lock, disk);
> > > >    	}
> > > > +	blk_register_region(disk_devt(disk), disk->minors, NULL,
> > > > +			    exact_match, exact_lock, disk);
> > > >    	register_disk(parent, disk, groups);
> > > >    	if (register_queue)
> > > >    		blk_register_queue(disk);
> > > > @@ -776,8 +776,7 @@ void del_gendisk(struct gendisk *disk)
> > > >    		WARN_ON(1);
> > > >    	}
> > > > -	if (!(disk->flags & GENHD_FL_HIDDEN))
> > > > -		blk_unregister_region(disk_devt(disk), disk->minors);
> > > > +	blk_unregister_region(disk_devt(disk), disk->minors);
> > > >    	kobject_put(disk->part0.holder_dir);
> > > >    	kobject_put(disk->slave_dir);
> > > > 
> > > Welll ... this is not just 'lsblk', but more importantly this will force
> > > udev to create _block_ device nodes for the hidden devices, essentially
> > > 'unhide' them.
> > > 
> > > Is this what we want?
> > > Christoph?
> > > I thought the entire _point_ of having hidden devices is that the are ...
> > > well ... hidden ...
> > > 
> > 
> > I knew this would be the most controversial patch. And I had other solutions as
> > well, but preferred this one. So, the other alternative would be just not use
> > GENHD_FL_HIDDEN for the nvme devices, which would leave that flag without a
> > single user in the kernel. That would still fix the two problems
> > (initramfs-tools and lsblk), and not create any other problem I know of. That
> > patch would still fail to open the underlying devices when there is a
> > head/multipath associated with it.
> > 
> > So, the only thing GENHD_FL_HIDDEN gives us is saving some resources, like bdi,
> > for example. And we could also use it to fail open when blkdev_get* is called.
> > Of couse, that would still imply that its name should be changed, but as we
> > already have an attribute named after that, I find it hard to suggest such a
> > change.
> > 
> The whole point of the native nvme multipath implementation was that the
> block devices behind the multipathed device are _NOT_ visible to the OS.
> 

They are still partially visible. And the fact that their relationship to
the block device is not visible makes initramfs-tools break, that is, it
can't find the PCI device and its necessary drivers to get to the root
block device.

Also, when we make those relationships clear, lsblk finds those same
"hidden" devices but breaks because it can't find its devno.

> This patch reverts the whole scheme, making it behaving just like the
> classical device-mapper multipathing did.
> 

As you are the expert in multipath, would you say that the behavior is the
same? Do we really only care about showing relationships between devices
and hidden block devices as the only differente between those two
implementations?  If so, I don't think it's worth it. From the little I
know, they are two different multipath implementations, and hiding devices
is orthogonal to their differences.

> Why can't we just update lsblk to present the correct output?
> 
> nvme-cli (and multipath, for that matter) work happily with the current
> setup, _and_ provide the correct topology:
> 
> So why can't we do it with lsblk?

Because old userspace is broken by the patches that introduce slaves and
holders. That is, users upgrade their kernels, and, suddenly, lsblk is
broken, and distros need to offer upgrades.

It's not the same with initramfs-tools. It's broken already. And I am
trying to provide a kernel fix that will fix it without requiring any
updates. And if we go the "paths/" path, when a user upgrade their kernels,
they would still need an updated initramfs-tools, but an outdated one does
not regress because of the kernel upgrade. It has already been broken.

Cascardo.

> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke		   Teamlead Storage & Networking
> hare@suse.de			               +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> HRB 21284 (AG Nürnberg)
Christoph Hellwig Dec. 13, 2018, 8:20 p.m. UTC | #10
On Thu, Dec 13, 2018 at 01:25:33PM -0200, Thadeu Lima de Souza Cascardo wrote:
> > And I've checked back - lsblk actually works just fine at the moment.
> > But it turns out once we create the slave links it stops working,
> > which is a really good argument against the first two patches, which
> > would otherwise seem nice..
> 
> Which is why I have sent the "paths/" patchset in the first place. Because I
> did some homework and read the previous discussion about this, and how lsblk
> failure to behave with slave links led to the revert of the slaves/holders
> patch by Dr. Hannes.

Sorry, I did not actually notice that Hannes patch manually created
the same slaves/holders link we otherwise create using the block layer
APIs.  Had I realized those actually were the same that had saved
me some work.

So I guess the v2 paths/ link patch from you is the least of all evils.
Hannes, can you look over that one?
Thadeu Lima de Souza Cascardo Dec. 13, 2018, 9 p.m. UTC | #11
On Thu, Dec 13, 2018 at 09:20:16PM +0100, Christoph Hellwig wrote:
> On Thu, Dec 13, 2018 at 01:25:33PM -0200, Thadeu Lima de Souza Cascardo wrote:
> > > And I've checked back - lsblk actually works just fine at the moment.
> > > But it turns out once we create the slave links it stops working,
> > > which is a really good argument against the first two patches, which
> > > would otherwise seem nice..
> > 
> > Which is why I have sent the "paths/" patchset in the first place. Because I
> > did some homework and read the previous discussion about this, and how lsblk
> > failure to behave with slave links led to the revert of the slaves/holders
> > patch by Dr. Hannes.
> 
> Sorry, I did not actually notice that Hannes patch manually created
> the same slaves/holders link we otherwise create using the block layer
> APIs.  Had I realized those actually were the same that had saved
> me some work.
> 
> So I guess the v2 paths/ link patch from you is the least of all evils.
> Hannes, can you look over that one?

No hard feelings. Sorry we had to go through all these messages to make that
clear.

This should be Message-Id: <20181101232955.19329-1-cascardo@canonical.com> and
Hannes was on Cc as well.

Regards.
Cascardo.
Hannes Reinecke Dec. 14, 2018, 7:47 a.m. UTC | #12
On 12/13/18 4:25 PM, Thadeu Lima de Souza Cascardo wrote:
> On Thu, Dec 13, 2018 at 03:32:18PM +0100, Christoph Hellwig wrote:
>> On Thu, Dec 13, 2018 at 10:18:40AM +0100, Hannes Reinecke wrote:
>>> Welll ... this is not just 'lsblk', but more importantly this will force
>>> udev to create _block_ device nodes for the hidden devices, essentially
>>> 'unhide' them.
>>>
>>> Is this what we want?
>>> Christoph?
>>> I thought the entire _point_ of having hidden devices is that the are ...
>>> well ... hidden ...
>>
>> Yes, that is why I really don't like the last two patches.
>>
>> And I've checked back - lsblk actually works just fine at the moment.
>> But it turns out once we create the slave links it stops working,
>> which is a really good argument against the first two patches, which
>> would otherwise seem nice..
> 
> Which is why I have sent the "paths/" patchset in the first place. Because I
> did some homework and read the previous discussion about this, and how lsblk
> failure to behave with slave links led to the revert of the slaves/holders
> patch by Dr. Hannes.
> 
But you haven't answered my question:

Why can't we patch 'lsblk' to provide the required information even with 
the current sysfs layout?

Cheers,

Hannes
Thadeu Lima de Souza Cascardo Dec. 14, 2018, 8:56 a.m. UTC | #13
On Fri, Dec 14, 2018 at 08:47:20AM +0100, Hannes Reinecke wrote:
> On 12/13/18 4:25 PM, Thadeu Lima de Souza Cascardo wrote:
> > On Thu, Dec 13, 2018 at 03:32:18PM +0100, Christoph Hellwig wrote:
> > > On Thu, Dec 13, 2018 at 10:18:40AM +0100, Hannes Reinecke wrote:
> > > > Welll ... this is not just 'lsblk', but more importantly this will force
> > > > udev to create _block_ device nodes for the hidden devices, essentially
> > > > 'unhide' them.
> > > > 
> > > > Is this what we want?
> > > > Christoph?
> > > > I thought the entire _point_ of having hidden devices is that the are ...
> > > > well ... hidden ...
> > > 
> > > Yes, that is why I really don't like the last two patches.
> > > 
> > > And I've checked back - lsblk actually works just fine at the moment.
> > > But it turns out once we create the slave links it stops working,
> > > which is a really good argument against the first two patches, which
> > > would otherwise seem nice..
> > 
> > Which is why I have sent the "paths/" patchset in the first place. Because I
> > did some homework and read the previous discussion about this, and how lsblk
> > failure to behave with slave links led to the revert of the slaves/holders
> > patch by Dr. Hannes.
> > 
> But you haven't answered my question:
> 
> Why can't we patch 'lsblk' to provide the required information even with the
> current sysfs layout?
> 

I think we could, but with my Ubuntu hat on, after the kernel fix for
initramfs-tools, that is, slaves/holders links, the user will get an updated
kernel that breaks the current lsblk on Bionic (Ubuntu 18.04). That will
require that we backport the lsblk fix, which is not only more work, but there
may be users who only update from -security, which is where kernel updates end
regularly, but not this lsblk fix.

And that kernel update is a regression against that old lsblk version.

Cascardo.

> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke		   Teamlead Storage & Networking
> hare@suse.de			               +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> HRB 21284 (AG Nürnberg)
Thadeu Lima de Souza Cascardo Dec. 14, 2018, 9:06 a.m. UTC | #14
On Fri, Dec 14, 2018 at 06:56:06AM -0200, Thadeu Lima de Souza Cascardo wrote:
> On Fri, Dec 14, 2018 at 08:47:20AM +0100, Hannes Reinecke wrote:
> > But you haven't answered my question:
> > 
> > Why can't we patch 'lsblk' to provide the required information even with the
> > current sysfs layout?
> > 
> 

Just to be clear here. If with 'current sysfs layout' you mean without any of
the patches we have been talking about, lsblk is not broken. It just works with
nvme multipath enabled. It will show the multipath paths and simply ignore the
underlying/hidden ones. If we hid them, we meant for them to be hidden, right?

What I am trying to fix here is how to find out which PCI device/driver is
needed to get to the block device holding the root filesystem, which is what
initramfs needs. And the nvme multipath device is a virtual device, pointing to
no driver at all, and no relation to its underlying devices, needed for it to
work.

Cascardo.

> I think we could, but with my Ubuntu hat on, after the kernel fix for
> initramfs-tools, that is, slaves/holders links, the user will get an updated
> kernel that breaks the current lsblk on Bionic (Ubuntu 18.04). That will
> require that we backport the lsblk fix, which is not only more work, but there
> may be users who only update from -security, which is where kernel updates end
> regularly, but not this lsblk fix.
> 
> And that kernel update is a regression against that old lsblk version.
> 
> Cascardo.
> 
> > Cheers,
> > 
> > Hannes
> > -- 
> > Dr. Hannes Reinecke		   Teamlead Storage & Networking
> > hare@suse.de			               +49 911 74053 688
> > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> > GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> > HRB 21284 (AG Nürnberg)
Hannes Reinecke Dec. 14, 2018, 9:44 a.m. UTC | #15
On 12/14/18 9:56 AM, Thadeu Lima de Souza Cascardo wrote:
> On Fri, Dec 14, 2018 at 08:47:20AM +0100, Hannes Reinecke wrote:
>> On 12/13/18 4:25 PM, Thadeu Lima de Souza Cascardo wrote:
>>> On Thu, Dec 13, 2018 at 03:32:18PM +0100, Christoph Hellwig wrote:
>>>> On Thu, Dec 13, 2018 at 10:18:40AM +0100, Hannes Reinecke wrote:
>>>>> Welll ... this is not just 'lsblk', but more importantly this will force
>>>>> udev to create _block_ device nodes for the hidden devices, essentially
>>>>> 'unhide' them.
>>>>>
>>>>> Is this what we want?
>>>>> Christoph?
>>>>> I thought the entire _point_ of having hidden devices is that the are ...
>>>>> well ... hidden ...
>>>>
>>>> Yes, that is why I really don't like the last two patches.
>>>>
>>>> And I've checked back - lsblk actually works just fine at the moment.
>>>> But it turns out once we create the slave links it stops working,
>>>> which is a really good argument against the first two patches, which
>>>> would otherwise seem nice..
>>>
>>> Which is why I have sent the "paths/" patchset in the first place. Because I
>>> did some homework and read the previous discussion about this, and how lsblk
>>> failure to behave with slave links led to the revert of the slaves/holders
>>> patch by Dr. Hannes.
>>>
>> But you haven't answered my question:
>>
>> Why can't we patch 'lsblk' to provide the required information even with the
>> current sysfs layout?
>>
> 
> I think we could, but with my Ubuntu hat on, after the kernel fix for
> initramfs-tools, that is, slaves/holders links, the user will get an updated
> kernel that breaks the current lsblk on Bionic (Ubuntu 18.04). That will
> require that we backport the lsblk fix, which is not only more work, but there
> may be users who only update from -security, which is where kernel updates end
> regularly, but not this lsblk fix.
> 
> And that kernel update is a regression against that old lsblk version.
> 
Now you get me confused.
Which kernel update you are referring to?
We do _not_ provide any 'slave' link with the current upstream, so I 
somewhat fail to see which breakage you are referring to ...

Confused,

Hannes
Hannes Reinecke Dec. 14, 2018, 9:54 a.m. UTC | #16
On 12/14/18 10:06 AM, Thadeu Lima de Souza Cascardo wrote:
> On Fri, Dec 14, 2018 at 06:56:06AM -0200, Thadeu Lima de Souza Cascardo wrote:
>> On Fri, Dec 14, 2018 at 08:47:20AM +0100, Hannes Reinecke wrote:
>>> But you haven't answered my question:
>>>
>>> Why can't we patch 'lsblk' to provide the required information even with the
>>> current sysfs layout?
>>>
>>
> 
> Just to be clear here. If with 'current sysfs layout' you mean without any of
> the patches we have been talking about, lsblk is not broken. It just works with
> nvme multipath enabled. It will show the multipath paths and simply ignore the
> underlying/hidden ones. If we hid them, we meant for them to be hidden, right?
> 
> What I am trying to fix here is how to find out which PCI device/driver is
> needed to get to the block device holding the root filesystem, which is what
> initramfs needs. And the nvme multipath device is a virtual device, pointing to
> no driver at all, and no relation to its underlying devices, needed for it to
> work.
> 

Well ...
But this is an entirely different proposition.
The 'slaves'/'holders' trick just allows to map the relationship between 
_block_ devices, which arguably is a bit pointless here seeing that we 
don't actually have block devices for the underlying devices.
But even if we _were_ implementing that you would still fail to get to 
the PCI device providing the block devices as there is no link pointing 
from one to another.

With the currently layout we have this hierarchy:

NVMe namespace (/dev/nvmeXn1Y) -> NVMe-subsys -> NVMe controller

and the NVMe controller is missing a link pointing to the device 
presenting the controller:

# ls -l /sys/devices/virtual/nvme-fabrics/ctl/nvme2
total 0
-r--r--r-- 1 root root 4096 Dec 13 13:18 address
-r--r--r-- 1 root root 4096 Dec 13 13:18 cntlid
--w------- 1 root root 4096 Dec 13 13:18 delete_controller
-r--r--r-- 1 root root 4096 Dec 13 13:18 dev
lrwxrwxrwx 1 root root    0 Dec 13 13:18 device -> ../../ctl
-r--r--r-- 1 root root 4096 Dec 13 13:18 firmware_rev
-r--r--r-- 1 root root 4096 Dec 13 13:18 model
drwxr-xr-x 9 root root    0 Dec  3 13:55 nvme2c64n1
drwxr-xr-x 2 root root    0 Dec 13 13:18 power
--w------- 1 root root 4096 Dec 13 13:18 rescan_controller
--w------- 1 root root 4096 Dec 13 13:18 reset_controller
-r--r--r-- 1 root root 4096 Dec 13 13:18 serial
-r--r--r-- 1 root root 4096 Dec 13 13:18 state
-r--r--r-- 1 root root 4096 Dec 13 13:18 subsysnqn
lrwxrwxrwx 1 root root    0 Dec  3 13:44 subsystem -> 
../../../../../class/nvme
-r--r--r-- 1 root root 4096 Dec 13 13:18 transport
-rw-r--r-- 1 root root 4096 Dec 13 13:18 uevent

So what we need to do is to update the 'device' link to point to the PCI 
device providing the controller.
(Actually, we would need to point the 'device' link to point to the 
entity providing the transport address, but I guess we don't have that 
for now.)

And _that's_ what we need to fix; the slaves/holders stuff doesn't solve 
the underlying problem, and really shouldn't be merged at all.

Cheers,

Hannes
Hannes Reinecke Dec. 14, 2018, 10:18 a.m. UTC | #17
On 12/14/18 10:54 AM, Hannes Reinecke wrote:
> On 12/14/18 10:06 AM, Thadeu Lima de Souza Cascardo wrote:
>> On Fri, Dec 14, 2018 at 06:56:06AM -0200, Thadeu Lima de Souza 
>> Cascardo wrote:
>>> On Fri, Dec 14, 2018 at 08:47:20AM +0100, Hannes Reinecke wrote:
>>>> But you haven't answered my question:
>>>>
>>>> Why can't we patch 'lsblk' to provide the required information even 
>>>> with the
>>>> current sysfs layout?
>>>>
>>>
>>
>> Just to be clear here. If with 'current sysfs layout' you mean without 
>> any of
>> the patches we have been talking about, lsblk is not broken. It just 
>> works with
>> nvme multipath enabled. It will show the multipath paths and simply 
>> ignore the
>> underlying/hidden ones. If we hid them, we meant for them to be 
>> hidden, right?
>>
>> What I am trying to fix here is how to find out which PCI 
>> device/driver is
>> needed to get to the block device holding the root filesystem, which 
>> is what
>> initramfs needs. And the nvme multipath device is a virtual device, 
>> pointing to
>> no driver at all, and no relation to its underlying devices, needed 
>> for it to
>> work.
>>
> 
> Well ...
> But this is an entirely different proposition.
> The 'slaves'/'holders' trick just allows to map the relationship between 
> _block_ devices, which arguably is a bit pointless here seeing that we 
> don't actually have block devices for the underlying devices.
> But even if we _were_ implementing that you would still fail to get to 
> the PCI device providing the block devices as there is no link pointing 
> from one to another.
> 
> With the currently layout we have this hierarchy:
> 
> NVMe namespace (/dev/nvmeXn1Y) -> NVMe-subsys -> NVMe controller
> 
> and the NVMe controller is missing a link pointing to the device 
> presenting the controller:
> 
> # ls -l /sys/devices/virtual/nvme-fabrics/ctl/nvme2
> total 0
> -r--r--r-- 1 root root 4096 Dec 13 13:18 address
> -r--r--r-- 1 root root 4096 Dec 13 13:18 cntlid
> --w------- 1 root root 4096 Dec 13 13:18 delete_controller
> -r--r--r-- 1 root root 4096 Dec 13 13:18 dev
> lrwxrwxrwx 1 root root    0 Dec 13 13:18 device -> ../../ctl
> -r--r--r-- 1 root root 4096 Dec 13 13:18 firmware_rev
> -r--r--r-- 1 root root 4096 Dec 13 13:18 model
> drwxr-xr-x 9 root root    0 Dec  3 13:55 nvme2c64n1
> drwxr-xr-x 2 root root    0 Dec 13 13:18 power
> --w------- 1 root root 4096 Dec 13 13:18 rescan_controller
> --w------- 1 root root 4096 Dec 13 13:18 reset_controller
> -r--r--r-- 1 root root 4096 Dec 13 13:18 serial
> -r--r--r-- 1 root root 4096 Dec 13 13:18 state
> -r--r--r-- 1 root root 4096 Dec 13 13:18 subsysnqn
> lrwxrwxrwx 1 root root    0 Dec  3 13:44 subsystem -> 
> ../../../../../class/nvme
> -r--r--r-- 1 root root 4096 Dec 13 13:18 transport
> -rw-r--r-- 1 root root 4096 Dec 13 13:18 uevent
> 
> So what we need to do is to update the 'device' link to point to the PCI 
> device providing the controller.
> (Actually, we would need to point the 'device' link to point to the 
> entity providing the transport address, but I guess we don't have that 
> for now.)
> 
> And _that's_ what we need to fix; the slaves/holders stuff doesn't solve 
> the underlying problem, and really shouldn't be merged at all.
> 
Mind you, it _does_ work for PCI-NVMe:

# ls -l /sys/class/nvme/nvme0
total 0
-r--r--r--  1 root root 4096 Dec 14 11:14 cntlid
-r--r--r--  1 root root 4096 Dec 14 11:14 dev
lrwxrwxrwx  1 root root    0 Dec 14 11:14 device -> ../../../0000:45:00.0
-r--r--r--  1 root root 4096 Dec 14 11:14 firmware_rev
-r--r--r--  1 root root 4096 Dec 14 11:14 model
drwxr-xr-x 12 root root    0 Dec  3 13:43 nvme1n1
drwxr-xr-x  2 root root    0 Dec 14 11:14 power
--w-------  1 root root 4096 Dec 14 11:14 rescan_controller
--w-------  1 root root 4096 Dec 14 11:14 reset_controller
-r--r--r--  1 root root 4096 Dec 14 11:14 serial
-r--r--r--  1 root root 4096 Dec 14 11:14 state
-r--r--r--  1 root root 4096 Dec 14 11:14 subsysnqn
lrwxrwxrwx  1 root root    0 Dec  3 13:43 subsystem -> 
../../../../../../class/nvme
-r--r--r--  1 root root 4096 Dec 14 11:14 transport
-rw-r--r--  1 root root 4096 Dec 14 11:14 uevent

So it might be as simple as this patch:

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index feb86b59170e..1ecdec6b8b4a 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3117,7 +3117,7 @@ nvme_fc_init_ctrl(struct device *dev, struct 
nvmf_ctrl_options *opts,
          * Defer this to the connect path.
          */

-       ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_fc_ctrl_ops, 0);
+       ret = nvme_init_ctrl(&ctrl->ctrl, ctrl->dev, &nvme_fc_ctrl_ops, 0);
         if (ret)
                 goto out_cleanup_admin_q;


As for RDMA / TCP we're running on a network address which really isn't 
tied to a specific device, so we wouldn't have any device to hook on 
without some trickery.

Cheers,

Hannes
Thadeu Lima de Souza Cascardo Dec. 14, 2018, 11:09 a.m. UTC | #18
On Fri, Dec 14, 2018 at 10:54:04AM +0100, Hannes Reinecke wrote:
> On 12/14/18 10:06 AM, Thadeu Lima de Souza Cascardo wrote:
> > On Fri, Dec 14, 2018 at 06:56:06AM -0200, Thadeu Lima de Souza Cascardo wrote:
> > > On Fri, Dec 14, 2018 at 08:47:20AM +0100, Hannes Reinecke wrote:
> > > > But you haven't answered my question:
> > > > 
> > > > Why can't we patch 'lsblk' to provide the required information even with the
> > > > current sysfs layout?
> > > > 
> > > 
> > 
> > Just to be clear here. If with 'current sysfs layout' you mean without any of
> > the patches we have been talking about, lsblk is not broken. It just works with
> > nvme multipath enabled. It will show the multipath paths and simply ignore the
> > underlying/hidden ones. If we hid them, we meant for them to be hidden, right?
> > 
> > What I am trying to fix here is how to find out which PCI device/driver is
> > needed to get to the block device holding the root filesystem, which is what
> > initramfs needs. And the nvme multipath device is a virtual device, pointing to
> > no driver at all, and no relation to its underlying devices, needed for it to
> > work.
> > 
> 
> Well ...
> But this is an entirely different proposition.
> The 'slaves'/'holders' trick just allows to map the relationship between
> _block_ devices, which arguably is a bit pointless here seeing that we don't
> actually have block devices for the underlying devices.
> But even if we _were_ implementing that you would still fail to get to the
> PCI device providing the block devices as there is no link pointing from one
> to another.

Well, initramfs-tools does traverse the slaves because your root filesystem
could be on top of a device-mapped block device. I will try your other patch in
any case and I will see if that fixes the problem I have at hand.

Thanks.
Cascardo.

> 
> With the currently layout we have this hierarchy:
> 
> NVMe namespace (/dev/nvmeXn1Y) -> NVMe-subsys -> NVMe controller
> 
> and the NVMe controller is missing a link pointing to the device presenting
> the controller:
> 
> # ls -l /sys/devices/virtual/nvme-fabrics/ctl/nvme2
> total 0
> -r--r--r-- 1 root root 4096 Dec 13 13:18 address
> -r--r--r-- 1 root root 4096 Dec 13 13:18 cntlid
> --w------- 1 root root 4096 Dec 13 13:18 delete_controller
> -r--r--r-- 1 root root 4096 Dec 13 13:18 dev
> lrwxrwxrwx 1 root root    0 Dec 13 13:18 device -> ../../ctl
> -r--r--r-- 1 root root 4096 Dec 13 13:18 firmware_rev
> -r--r--r-- 1 root root 4096 Dec 13 13:18 model
> drwxr-xr-x 9 root root    0 Dec  3 13:55 nvme2c64n1
> drwxr-xr-x 2 root root    0 Dec 13 13:18 power
> --w------- 1 root root 4096 Dec 13 13:18 rescan_controller
> --w------- 1 root root 4096 Dec 13 13:18 reset_controller
> -r--r--r-- 1 root root 4096 Dec 13 13:18 serial
> -r--r--r-- 1 root root 4096 Dec 13 13:18 state
> -r--r--r-- 1 root root 4096 Dec 13 13:18 subsysnqn
> lrwxrwxrwx 1 root root    0 Dec  3 13:44 subsystem ->
> ../../../../../class/nvme
> -r--r--r-- 1 root root 4096 Dec 13 13:18 transport
> -rw-r--r-- 1 root root 4096 Dec 13 13:18 uevent
> 
> So what we need to do is to update the 'device' link to point to the PCI
> device providing the controller.
> (Actually, we would need to point the 'device' link to point to the entity
> providing the transport address, but I guess we don't have that for now.)
> 
> And _that's_ what we need to fix; the slaves/holders stuff doesn't solve the
> underlying problem, and really shouldn't be merged at all.
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke		   Teamlead Storage & Networking
> hare@suse.de			               +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> HRB 21284 (AG Nürnberg)
diff mbox series

Patch

diff --git a/block/genhd.c b/block/genhd.c
index 7674fce32fca..65a7fa664188 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -687,6 +687,7 @@  static void __device_add_disk(struct device *parent, struct gendisk *disk,
 
 	disk_alloc_events(disk);
 
+	disk_to_dev(disk)->devt = devt;
 	if (disk->flags & GENHD_FL_HIDDEN) {
 		/*
 		 * Don't let hidden disks show up in /proc/partitions,
@@ -698,13 +699,12 @@  static void __device_add_disk(struct device *parent, struct gendisk *disk,
 		int ret;
 
 		/* Register BDI before referencing it from bdev */
-		disk_to_dev(disk)->devt = devt;
 		ret = bdi_register_owner(disk->queue->backing_dev_info,
 						disk_to_dev(disk));
 		WARN_ON(ret);
-		blk_register_region(disk_devt(disk), disk->minors, NULL,
-				    exact_match, exact_lock, disk);
 	}
+	blk_register_region(disk_devt(disk), disk->minors, NULL,
+			    exact_match, exact_lock, disk);
 	register_disk(parent, disk, groups);
 	if (register_queue)
 		blk_register_queue(disk);
@@ -776,8 +776,7 @@  void del_gendisk(struct gendisk *disk)
 		WARN_ON(1);
 	}
 
-	if (!(disk->flags & GENHD_FL_HIDDEN))
-		blk_unregister_region(disk_devt(disk), disk->minors);
+	blk_unregister_region(disk_devt(disk), disk->minors);
 
 	kobject_put(disk->part0.holder_dir);
 	kobject_put(disk->slave_dir);