diff mbox series

nvme: Use first ctrl->instance id as subsystem id

Message ID 20190814142836.2322-1-gpiccoli@canonical.com (mailing list archive)
State New, archived
Headers show
Series nvme: Use first ctrl->instance id as subsystem id | expand

Commit Message

Guilherme Piccoli Aug. 14, 2019, 2:28 p.m. UTC
Since after the introduction of NVMe multipath, we have a struct to
track subsystems, and more important, we have now the nvme block device
name bound to the subsystem id instead of ctrl->instance as before.
This is not a big problem, users can even fallback to the old behavior
using the module parameter "nvme_core.multipath=N" in case they don't
have multipath and wish to have a consistent mapping between the char
device nvmeX and the block device nvmeXnY.

That said, we noticed the nvme subsystem id is generated by its own ID
allocator, and ctrl->instance value has itself an ID allocator too.
The controller instance is generated during the probe, in the function
nvme_init_ctrl(), which always executes before nvme_init_subsystem().
That said, and since according to the spec we have a relation 1:N
between subsystem and controllers (i.e., one subsystem may have more
controllers but not the reciprocal), why not use the ctrl->instance id
as the subsystem id?

This patch does exactly this: removes the ID allocator for subsystems
and use the unique controller instance id as subsystem id. The patch
was tested in multiple scenarios, like:

* multiple controllers (each one tied to its own subsystem);
* 2 controllers in a single subsystem (using emulated nvme controllers
from qemu). In this case, subsystem gets the id of its 1st controller;
* NVMEoF/TCP with this patch in both target and host kernels.

All test cases worked as expected. With this patch, the "coincidence"
of the char device number matches the block device number is forced for
single-controllers subsystems (the most usual scenario), even without
disabling multipath. It's useful for scenarios with some multipath'ed
controllers and some solo controllers.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
---
 drivers/nvme/host/core.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

Comments

Keith Busch Aug. 14, 2019, 4:06 p.m. UTC | #1
On Wed, Aug 14, 2019 at 07:28:36AM -0700, Guilherme G. Piccoli wrote:
> Since after the introduction of NVMe multipath, we have a struct to
> track subsystems, and more important, we have now the nvme block device
> name bound to the subsystem id instead of ctrl->instance as before.
> This is not a big problem, users can even fallback to the old behavior
> using the module parameter "nvme_core.multipath=N" in case they don't
> have multipath and wish to have a consistent mapping between the char
> device nvmeX and the block device nvmeXnY.
> 
> That said, we noticed the nvme subsystem id is generated by its own ID
> allocator, and ctrl->instance value has itself an ID allocator too.
> The controller instance is generated during the probe, in the function
> nvme_init_ctrl(), which always executes before nvme_init_subsystem().
> That said, and since according to the spec we have a relation 1:N
> between subsystem and controllers (i.e., one subsystem may have more
> controllers but not the reciprocal), why not use the ctrl->instance id
> as the subsystem id?

The subsystem lifetime is not tied to a single controller's. Disconnect
the "first" controller in a multipathed subsystem with this patch, then
connect another controller from a different subsystem, and now you will
create naming collisions.
Guilherme Piccoli Aug. 14, 2019, 4:18 p.m. UTC | #2
On 14/08/2019 13:06, Keith Busch wrote:
> On Wed, Aug 14, 2019 at 07:28:36AM -0700, Guilherme G. Piccoli wrote:
>>[...]
> 
> The subsystem lifetime is not tied to a single controller's. Disconnect
> the "first" controller in a multipathed subsystem with this patch, then
> connect another controller from a different subsystem, and now you will
> create naming collisions.
> 

Hi Keith, thanks for your clarification. Isn't the controller id unique?
Could the new connected controller from a different subsystem have the
same id? If you can give a rough example I appreciate.

But given the above statement is a fact, what do you think of trying the
ctrl->instance first and in case we have duplicity, fallback to
subsystem ID allocator?

Since the creation of subsystems is not a critical path, adding this
small burden shouldn't be a huge penalty, and it'll help a lot with the
huge amount of reports of "confusion" after the introduction of nvme
multipathing, also it helps for the case I mentioned in the description,
some multipath'ed controllers, some single ones.

Cheers,


Guilherme
Keith Busch Aug. 14, 2019, 4:27 p.m. UTC | #3
On Wed, Aug 14, 2019 at 09:18:22AM -0700, Guilherme G. Piccoli wrote:
> On 14/08/2019 13:06, Keith Busch wrote:
> > On Wed, Aug 14, 2019 at 07:28:36AM -0700, Guilherme G. Piccoli wrote:
> >>[...]
> > 
> > The subsystem lifetime is not tied to a single controller's. Disconnect
> > the "first" controller in a multipathed subsystem with this patch, then
> > connect another controller from a different subsystem, and now you will
> > create naming collisions.
> > 
> 
> Hi Keith, thanks for your clarification. Isn't the controller id unique?
> Could the new connected controller from a different subsystem have the
> same id? If you can give a rough example I appreciate.

Sure, start with nvme subsystem A, with host connected to to
controllers, X and Y.

 ctrl X gets instance 0, which you assign to the newly discovered
 subsytem

 ctrl Y gets instance 1

 disconnect ctrl X, which releases instance 0 back to the allocator

 connect to a new ctrl Z in new subsystem B: ctrl Z gets the first
 available instance, which is now 0, and you assign that name to the new
 susbystem, colliding with the sysfs nvme-subsys entries we've created
 for subsys A, as well as any namespaces.
 
> But given the above statement is a fact, what do you think of trying the
> ctrl->instance first and in case we have duplicity, fallback to
> subsystem ID allocator?

At the point we assign the subsystem identifier, we're locked into using
that for the namespace names, which may be discovered long before we're
aware the host has multiple connections to the same subsystem.

I think it'd be better to just completely disassociate any notion of
relationships based on names. The following patch enforces that way of
thinking:

  http://lists.infradead.org/pipermail/linux-nvme/2019-May/024142.html
Guilherme Piccoli Aug. 14, 2019, 6:29 p.m. UTC | #4
On 14/08/2019 13:27, Keith Busch wrote:
> On Wed, Aug 14, 2019 at 09:18:22AM -0700, Guilherme G. Piccoli wrote:
>> [...]If you can give a rough example I appreciate.
> 
> Sure, start with nvme subsystem A, with host connected to to
> controllers, X and Y.
> 
>  ctrl X gets instance 0, which you assign to the newly discovered
>  subsytem
> 
>  ctrl Y gets instance 1
> 
>  disconnect ctrl X, which releases instance 0 back to the allocator
> 

Thanks a lot for the example Keith! This is the point I missed,
returning the id back to allocator. If we happen to have some IDA in
kernel without this property, my idea could work!

>  connect to a new ctrl Z in new subsystem B: ctrl Z gets the first
>  available instance, which is now 0, and you assign that name to the new
>  susbystem, colliding with the sysfs nvme-subsys entries we've created
>  for subsys A, as well as any namespaces.
>  
>> But given the above statement is a fact, what do you think of trying the
>> ctrl->instance first and in case we have duplicity, fallback to
>> subsystem ID allocator?
> 
> At the point we assign the subsystem identifier, we're locked into using
> that for the namespace names, which may be discovered long before we're
> aware the host has multiple connections to the same subsystem.
> 
> I think it'd be better to just completely disassociate any notion of
> relationships based on names. The following patch enforces that way of
> thinking:
> 
>   http://lists.infradead.org/pipermail/linux-nvme/2019-May/024142.html
> 

Interesting thread, thanks for the pointer. I think no matter what we do
in this front (to disassociate the relation of nvme numbering), it'll
always be confusion, specially since we broke valid assumptions since
nvme day-0. Would it be possible of instead reusing the numbering
positions in the naming, add more fields?
It is a suggestion from my colleague Dan (CCed here), something like:
for non-multipath nvme, keep nvmeC and nvmeCnN (C=controller ida,
N=namespace); for multipath nvme, use nvmeScCnN (S=subsystem ida).

Let me know your thoughts, and thanks again for the prompt response!
Cheers,


Guilherme
Keith Busch Aug. 14, 2019, 7:08 p.m. UTC | #5
On Wed, Aug 14, 2019 at 11:29:17AM -0700, Guilherme G. Piccoli wrote:
> It is a suggestion from my colleague Dan (CCed here), something like:
> for non-multipath nvme, keep nvmeC and nvmeCnN (C=controller ida,
> N=namespace); for multipath nvme, use nvmeScCnN (S=subsystem ida).

This will inevitably lead to collisions. The existing naming scheme was
selected specifically to avoid that problem.
Guilherme Piccoli Aug. 15, 2019, 1:01 p.m. UTC | #6
On 14/08/2019 16:08, Keith Busch wrote:
> On Wed, Aug 14, 2019 at 11:29:17AM -0700, Guilherme G. Piccoli wrote:
>> It is a suggestion from my colleague Dan (CCed here), something like:
>> for non-multipath nvme, keep nvmeC and nvmeCnN (C=controller ida,
>> N=namespace); for multipath nvme, use nvmeScCnN (S=subsystem ida).
> 
> This will inevitably lead to collisions. The existing naming scheme was
> selected specifically to avoid that problem.
> 

Thanks again for the discussion Keith! So, I guess the way to go is
really the kernel parameter in case users want to fallback to the old
numbering (and only if they're not using multipath).

Cheers,


Guilherme
diff mbox series

Patch

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index cca270836892..60631e856b41 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -81,7 +81,6 @@  EXPORT_SYMBOL_GPL(nvme_reset_wq);
 struct workqueue_struct *nvme_delete_wq;
 EXPORT_SYMBOL_GPL(nvme_delete_wq);
 
-static DEFINE_IDA(nvme_subsystems_ida);
 static LIST_HEAD(nvme_subsystems);
 static DEFINE_MUTEX(nvme_subsystems_lock);
 
@@ -2316,7 +2315,6 @@  static void nvme_release_subsystem(struct device *dev)
 	struct nvme_subsystem *subsys =
 		container_of(dev, struct nvme_subsystem, dev);
 
-	ida_simple_remove(&nvme_subsystems_ida, subsys->instance);
 	kfree(subsys);
 }
 
@@ -2445,12 +2443,8 @@  static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
 	if (!subsys)
 		return -ENOMEM;
-	ret = ida_simple_get(&nvme_subsystems_ida, 0, 0, GFP_KERNEL);
-	if (ret < 0) {
-		kfree(subsys);
-		return ret;
-	}
-	subsys->instance = ret;
+
+	subsys->instance = ctrl->instance;
 	mutex_init(&subsys->lock);
 	kref_init(&subsys->ref);
 	INIT_LIST_HEAD(&subsys->ctrls);
@@ -4051,7 +4045,6 @@  static int __init nvme_core_init(void)
 
 static void __exit nvme_core_exit(void)
 {
-	ida_destroy(&nvme_subsystems_ida);
 	class_destroy(nvme_subsys_class);
 	class_destroy(nvme_class);
 	unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);