diff mbox series

[driver-core,v6,9/9] libnvdimm: Schedule device registration on node local to the device

Message ID 154170044652.12967.17419321472770956712.stgit@ahduyck-desk1.jf.intel.com (mailing list archive)
State Superseded, archived
Headers show
Series Add NUMA aware async_schedule calls | expand

Commit Message

Alexander Duyck Nov. 8, 2018, 6:07 p.m. UTC
Force the device registration for nvdimm devices to be closer to the actual
device. This is achieved by using either the NUMA node ID of the region, or
of the parent. By doing this we can have everything above the region based
on the region, and everything below the region based on the nvdimm bus.

By guaranteeing NUMA locality I see an improvement of as high as 25% for
per-node init of a system with 12TB of persistent memory.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/nvdimm/bus.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

Comments

Dan Williams Nov. 27, 2018, 2:21 a.m. UTC | #1
On Thu, Nov 8, 2018 at 10:07 AM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> Force the device registration for nvdimm devices to be closer to the actual
> device. This is achieved by using either the NUMA node ID of the region, or
> of the parent. By doing this we can have everything above the region based
> on the region, and everything below the region based on the nvdimm bus.
>
> By guaranteeing NUMA locality I see an improvement of as high as 25% for
> per-node init of a system with 12TB of persistent memory.
>

It seems the speed-up is achieved with just patches 1, 2, and 9 from
this series, correct? I wouldn't want to hold up that benefit while
the driver-core bits are debated.

You can add:

    Reviewed-by: Dan Williams <dan.j.williams@intel.com>

...if the series needs to be kept together, but as far as I can see
the workqueue changes enable 2 sub-topics of development and it might
make sense for Tejun to take those first 2 and then Greg and I can
base any follow-up topics on that stable baseline.
Alexander Duyck Nov. 27, 2018, 6:04 p.m. UTC | #2
On Mon, 2018-11-26 at 18:21 -0800, Dan Williams wrote:
> On Thu, Nov 8, 2018 at 10:07 AM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
> > 
> > Force the device registration for nvdimm devices to be closer to the actual
> > device. This is achieved by using either the NUMA node ID of the region, or
> > of the parent. By doing this we can have everything above the region based
> > on the region, and everything below the region based on the nvdimm bus.
> > 
> > By guaranteeing NUMA locality I see an improvement of as high as 25% for
> > per-node init of a system with 12TB of persistent memory.
> > 
> 
> It seems the speed-up is achieved with just patches 1, 2, and 9 from
> this series, correct? I wouldn't want to hold up that benefit while
> the driver-core bits are debated.

Actually patch 6 ends up impacting things for persistent memory as
well. The problem is that all the async calls to add interfaces only do
anything if the driver is already loaded. So there are cases such as
the X86_PMEM_LEGACY_DEVICE case where the memory regions end up still
being serialized because the devices are added before the driver.

> You can add:
> 
>     Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 
> ...if the series needs to be kept together, but as far as I can see
> the workqueue changes enable 2 sub-topics of development and it might
> make sense for Tejun to take those first 2 and then Greg and I can
> base any follow-up topics on that stable baseline.

I had originally put this out there for Tejun to apply, but him and
Greg had talked and Greg agreed to apply the set. If it works for you I
would prefer to just keep it together for now as I don't believe there
will be too many more revisions of this needed.
Dan Williams Nov. 27, 2018, 7:34 p.m. UTC | #3
On Tue, Nov 27, 2018 at 10:04 AM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> On Mon, 2018-11-26 at 18:21 -0800, Dan Williams wrote:
> > On Thu, Nov 8, 2018 at 10:07 AM Alexander Duyck
> > <alexander.h.duyck@linux.intel.com> wrote:
> > >
> > > Force the device registration for nvdimm devices to be closer to the actual
> > > device. This is achieved by using either the NUMA node ID of the region, or
> > > of the parent. By doing this we can have everything above the region based
> > > on the region, and everything below the region based on the nvdimm bus.
> > >
> > > By guaranteeing NUMA locality I see an improvement of as high as 25% for
> > > per-node init of a system with 12TB of persistent memory.
> > >
> >
> > It seems the speed-up is achieved with just patches 1, 2, and 9 from
> > this series, correct? I wouldn't want to hold up that benefit while
> > the driver-core bits are debated.
>
> Actually patch 6 ends up impacting things for persistent memory as
> well. The problem is that all the async calls to add interfaces only do
> anything if the driver is already loaded. So there are cases such as
> the X86_PMEM_LEGACY_DEVICE case where the memory regions end up still
> being serialized because the devices are added before the driver.

Ok, but is the patch 6 change generally useful outside of the
libnvdimm case? Yes, local hacks like MODULE_SOFTDEP are terrible for
global problems, but what I'm trying to tease out if this change
benefits other async probing subsystems outside of libnvdimm, SCSI
perhaps? Bart can you chime in with the benefits you see so it's clear
to Greg that the driver-core changes are a generic improvement?

> > You can add:
> >
> >     Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> >
> > ...if the series needs to be kept together, but as far as I can see
> > the workqueue changes enable 2 sub-topics of development and it might
> > make sense for Tejun to take those first 2 and then Greg and I can
> > base any follow-up topics on that stable baseline.
>
> I had originally put this out there for Tejun to apply, but him and
> Greg had talked and Greg agreed to apply the set. If it works for you I
> would prefer to just keep it together for now as I don't believe there
> will be too many more revisions of this needed.
>

That works for me.
Bart Van Assche Nov. 27, 2018, 8:33 p.m. UTC | #4
On Tue, 2018-11-27 at 11:34 -0800, Dan Williams wrote:
> On Tue, Nov 27, 2018 at 10:04 AM Alexander Duyck
> <alexander.h.duyck@linux.intel.com> wrote:
> > 
> > On Mon, 2018-11-26 at 18:21 -0800, Dan Williams wrote:
> > > On Thu, Nov 8, 2018 at 10:07 AM Alexander Duyck
> > > <alexander.h.duyck@linux.intel.com> wrote:
> > > > 
> > > > Force the device registration for nvdimm devices to be closer to the actual
> > > > device. This is achieved by using either the NUMA node ID of the region, or
> > > > of the parent. By doing this we can have everything above the region based
> > > > on the region, and everything below the region based on the nvdimm bus.
> > > > 
> > > > By guaranteeing NUMA locality I see an improvement of as high as 25% for
> > > > per-node init of a system with 12TB of persistent memory.
> > > > 
> > > 
> > > It seems the speed-up is achieved with just patches 1, 2, and 9 from
> > > this series, correct? I wouldn't want to hold up that benefit while
> > > the driver-core bits are debated.
> > 
> > Actually patch 6 ends up impacting things for persistent memory as
> > well. The problem is that all the async calls to add interfaces only do
> > anything if the driver is already loaded. So there are cases such as
> > the X86_PMEM_LEGACY_DEVICE case where the memory regions end up still
> > being serialized because the devices are added before the driver.
> 
> Ok, but is the patch 6 change generally useful outside of the
> libnvdimm case? Yes, local hacks like MODULE_SOFTDEP are terrible for
> global problems, but what I'm trying to tease out if this change
> benefits other async probing subsystems outside of libnvdimm, SCSI
> perhaps? Bart can you chime in with the benefits you see so it's clear
> to Greg that the driver-core changes are a generic improvement?

Hi Dan,

For SCSI asynchronous probing is really important because when scanning SAN
LUNs there is plenty of potential for concurrency due to the network delay.

I think the following quote provides the information you are looking for:

"This patch reduces the time needed for loading the scsi_debug kernel
module with parameters delay=0 and max_luns=256 from 0.7s to 0.1s. In
other words, this specific test runs about seven times faster."

Source: https://www.spinics.net/lists/linux-scsi/msg124457.html

Best regards,

Bart.
Dan Williams Nov. 27, 2018, 8:50 p.m. UTC | #5
On Tue, Nov 27, 2018 at 12:33 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On Tue, 2018-11-27 at 11:34 -0800, Dan Williams wrote:
> > On Tue, Nov 27, 2018 at 10:04 AM Alexander Duyck
> > <alexander.h.duyck@linux.intel.com> wrote:
> > >
> > > On Mon, 2018-11-26 at 18:21 -0800, Dan Williams wrote:
> > > > On Thu, Nov 8, 2018 at 10:07 AM Alexander Duyck
> > > > <alexander.h.duyck@linux.intel.com> wrote:
> > > > >
> > > > > Force the device registration for nvdimm devices to be closer to the actual
> > > > > device. This is achieved by using either the NUMA node ID of the region, or
> > > > > of the parent. By doing this we can have everything above the region based
> > > > > on the region, and everything below the region based on the nvdimm bus.
> > > > >
> > > > > By guaranteeing NUMA locality I see an improvement of as high as 25% for
> > > > > per-node init of a system with 12TB of persistent memory.
> > > > >
> > > >
> > > > It seems the speed-up is achieved with just patches 1, 2, and 9 from
> > > > this series, correct? I wouldn't want to hold up that benefit while
> > > > the driver-core bits are debated.
> > >
> > > Actually patch 6 ends up impacting things for persistent memory as
> > > well. The problem is that all the async calls to add interfaces only do
> > > anything if the driver is already loaded. So there are cases such as
> > > the X86_PMEM_LEGACY_DEVICE case where the memory regions end up still
> > > being serialized because the devices are added before the driver.
> >
> > Ok, but is the patch 6 change generally useful outside of the
> > libnvdimm case? Yes, local hacks like MODULE_SOFTDEP are terrible for
> > global problems, but what I'm trying to tease out if this change
> > benefits other async probing subsystems outside of libnvdimm, SCSI
> > perhaps? Bart can you chime in with the benefits you see so it's clear
> > to Greg that the driver-core changes are a generic improvement?
>
> Hi Dan,
>
> For SCSI asynchronous probing is really important because when scanning SAN
> LUNs there is plenty of potential for concurrency due to the network delay.
>
> I think the following quote provides the information you are looking for:
>
> "This patch reduces the time needed for loading the scsi_debug kernel
> module with parameters delay=0 and max_luns=256 from 0.7s to 0.1s. In
> other words, this specific test runs about seven times faster."
>
> Source: https://www.spinics.net/lists/linux-scsi/msg124457.html

Thanks Bart, so tying this back to Alex's patches, does the ordering
problem that Alex's patches solve impact the SCSI case? I'm looking
for something like "SCSI depends on asynchronous probing and without
'driver core: Establish clear order of operations for deferred probe
and remove' probing is often needlessly serialized". I.e. does it
suffer from the same platform problem that libnvdimm ran into where
it's local async probing implementation was hindered by the driver
core?
Bart Van Assche Nov. 27, 2018, 9:22 p.m. UTC | #6
On Tue, 2018-11-27 at 12:50 -0800, Dan Williams wrote:
> Thanks Bart, so tying this back to Alex's patches, does the ordering
> problem that Alex's patches solve impact the SCSI case? I'm looking
> for something like "SCSI depends on asynchronous probing and without
> 'driver core: Establish clear order of operations for deferred probe
> and remove' probing is often needlessly serialized". I.e. does it
> suffer from the same platform problem that libnvdimm ran into where
> it's local async probing implementation was hindered by the driver
> core?

(+Martin)

Hi Dan,

Patch 6/9 reduces the time needed to scan SCSI LUNs significantly. The only
way to realize that speedup is by enabling more concurrency. That's why I
think that patch 6/9 is a significant driver core improvement.

Bart.
Dan Williams Nov. 27, 2018, 10:34 p.m. UTC | #7
On Tue, Nov 27, 2018 at 1:22 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On Tue, 2018-11-27 at 12:50 -0800, Dan Williams wrote:
> > Thanks Bart, so tying this back to Alex's patches, does the ordering
> > problem that Alex's patches solve impact the SCSI case? I'm looking
> > for something like "SCSI depends on asynchronous probing and without
> > 'driver core: Establish clear order of operations for deferred probe
> > and remove' probing is often needlessly serialized". I.e. does it
> > suffer from the same platform problem that libnvdimm ran into where
> > it's local async probing implementation was hindered by the driver
> > core?
>
> (+Martin)
>
> Hi Dan,
>
> Patch 6/9 reduces the time needed to scan SCSI LUNs significantly. The only
> way to realize that speedup is by enabling more concurrency. That's why I
> think that patch 6/9 is a significant driver core improvement.

Perfect, Alex, with that added to the 6/9 changelog and moving to
device_private for the async state tracking you can add my
Reviewed-by.
diff mbox series

Patch

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index f1fb39921236..b1e193541874 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -23,6 +23,7 @@ 
 #include <linux/ndctl.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/cpu.h>
 #include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
@@ -513,11 +514,15 @@  void __nd_device_register(struct device *dev)
 		set_dev_node(dev, to_nd_region(dev)->numa_node);
 
 	dev->bus = &nvdimm_bus_type;
-	if (dev->parent)
+	if (dev->parent) {
 		get_device(dev->parent);
+		if (dev_to_node(dev) == NUMA_NO_NODE)
+			set_dev_node(dev, dev_to_node(dev->parent));
+	}
 	get_device(dev);
-	async_schedule_domain(nd_async_device_register, dev,
-			&nd_async_domain);
+
+	async_schedule_dev_domain(nd_async_device_register, dev,
+				  &nd_async_domain);
 }
 
 void nd_device_register(struct device *dev)