diff mbox series

[RFC,workqueue/driver-core,1/5] workqueue: Provide queue_work_near to queue work near a given NUMA node

Message ID 20180926215138.13512.33146.stgit@localhost.localdomain (mailing list archive)
State Superseded
Headers show
Series Add NUMA aware async_schedule calls | expand

Commit Message

Alexander Duyck Sept. 26, 2018, 9:51 p.m. UTC
This patch provides a new function queue_work_near which is meant to
schedule work on the nearest unbound CPU to the requested NUMA node. The
main motivation for this is to help assist asynchronous init to better
improve boot times for devices that are local to a specific node.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/workqueue.h |    2 +
 kernel/workqueue.c        |  129 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 129 insertions(+), 2 deletions(-)

Comments

Tejun Heo Sept. 26, 2018, 9:53 p.m. UTC | #1
Hello,

On Wed, Sep 26, 2018 at 02:51:38PM -0700, Alexander Duyck wrote:
> This patch provides a new function queue_work_near which is meant to
> schedule work on the nearest unbound CPU to the requested NUMA node. The
> main motivation for this is to help assist asynchronous init to better
> improve boot times for devices that are local to a specific node.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Why not just use unbound workqueues, which are NUMA-affine by default?
Are there big enough advantages?

Thanks.
Alexander Duyck Sept. 26, 2018, 10:05 p.m. UTC | #2
On 9/26/2018 2:53 PM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 26, 2018 at 02:51:38PM -0700, Alexander Duyck wrote:
>> This patch provides a new function queue_work_near which is meant to
>> schedule work on the nearest unbound CPU to the requested NUMA node. The
>> main motivation for this is to help assist asynchronous init to better
>> improve boot times for devices that are local to a specific node.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Why not just use unbound workqueues, which are NUMA-affine by default?
> Are there big enough advantages?
> 
> Thanks.

I am using unbound workqueues. However there isn't an interface that 
exposes the NUMA bits of them directly. All I am doing with this patch 
is adding "queue_work_near" which takes a NUMA node as an argument and 
then copies the logic of "queue_work_on" with the exception that I am 
doing a check to verify that there is an intersection between 
wq_unbound_cpumask and the cpumask of the node, and then passing a CPU 
from that intersection into "__queue_work".

Thanks.

- Alex
Tejun Heo Sept. 26, 2018, 10:09 p.m. UTC | #3
Hello,

On Wed, Sep 26, 2018 at 03:05:17PM -0700, Alexander Duyck wrote:
> I am using unbound workqueues. However there isn't an interface that
> exposes the NUMA bits of them directly. All I am doing with this
> patch is adding "queue_work_near" which takes a NUMA node as an
> argument and then copies the logic of "queue_work_on" with the
> exception that I am doing a check to verify that there is an
> intersection between wq_unbound_cpumask and the cpumask of the node,
> and then passing a CPU from that intersection into "__queue_work".

Can it just take a cpu id and not feed that to __queue_work()?  That
looks like a lot of extra logic.

Thanks.
Alexander Duyck Sept. 26, 2018, 10:19 p.m. UTC | #4
On 9/26/2018 3:09 PM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 26, 2018 at 03:05:17PM -0700, Alexander Duyck wrote:
>> I am using unbound workqueues. However there isn't an interface that
>> exposes the NUMA bits of them directly. All I am doing with this
>> patch is adding "queue_work_near" which takes a NUMA node as an
>> argument and then copies the logic of "queue_work_on" with the
>> exception that I am doing a check to verify that there is an
>> intersection between wq_unbound_cpumask and the cpumask of the node,
>> and then passing a CPU from that intersection into "__queue_work".
> 
> Can it just take a cpu id and not feed that to __queue_work()?  That
> looks like a lot of extra logic.
> 
> Thanks.

I could just use queue_work_on probably, but is there any issue if I am 
passing CPU values that are not in the wq_unbound_cpumask? That was 
mostly my concern. Also for an unbound queue do I need to worry about 
the hotplug lock? I wasn't sure if that was the case or not as I know it 
is called out as something to be concerned with using queue_work_on, but 
in __queue_work the value is just used to determine which node to grab a 
work queue from.

I forgot to address your question about the advantages. They are pretty 
significant. The test system I was working with was initializing 3TB of 
nvdimm memory per node. If the node is aligned it takes something like 
24 seconds, whereas an unaligned core can take 36 seconds or more.

Thanks.

- Alex
Tejun Heo Oct. 1, 2018, 4:01 p.m. UTC | #5
Hello,

On Wed, Sep 26, 2018 at 03:19:21PM -0700, Alexander Duyck wrote:
> On 9/26/2018 3:09 PM, Tejun Heo wrote:
> I could just use queue_work_on probably, but is there any issue if I
> am passing CPU values that are not in the wq_unbound_cpumask? That

That should be fine.  If it can't find any available cpu, it'll fall
back to round-robin.  We probably can improve it so that it can
consider the numa distance when falling back.

> was mostly my concern. Also for an unbound queue do I need to worry
> about the hotplug lock? I wasn't sure if that was the case or not as

Issuers don't need to worry about them.

> I know it is called out as something to be concerned with using
> queue_work_on, but in __queue_work the value is just used to
> determine which node to grab a work queue from.

It might be better to leave queue_work_on() to be used for per-cpu
workqueues and introduce queue_work_near() as you suggseted.  I just
don't want it to duplicate the node selection code in it.  Would that
work?

> I forgot to address your question about the advantages. They are
> pretty significant. The test system I was working with was
> initializing 3TB of nvdimm memory per node. If the node is aligned
> it takes something like 24 seconds, whereas an unaligned core can
> take 36 seconds or more.

Oh yeah, sure, numa affinity matters quite a bit on memory heavy
workloads.  I was mistaken that you were adding adding numa affinity
to per-cpu workqueues.

Thanks.
Alexander Duyck Oct. 1, 2018, 9:54 p.m. UTC | #6
On 10/1/2018 9:01 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 26, 2018 at 03:19:21PM -0700, Alexander Duyck wrote:
>> On 9/26/2018 3:09 PM, Tejun Heo wrote:
>> I could just use queue_work_on probably, but is there any issue if I
>> am passing CPU values that are not in the wq_unbound_cpumask? That
> 
> That should be fine.  If it can't find any available cpu, it'll fall
> back to round-robin.  We probably can improve it so that it can
> consider the numa distance when falling back.
> 
>> was mostly my concern. Also for an unbound queue do I need to worry
>> about the hotplug lock? I wasn't sure if that was the case or not as
> 
> Issuers don't need to worry about them.
> 
>> I know it is called out as something to be concerned with using
>> queue_work_on, but in __queue_work the value is just used to
>> determine which node to grab a work queue from.
> 
> It might be better to leave queue_work_on() to be used for per-cpu
> workqueues and introduce queue_work_near() as you suggseted.  I just
> don't want it to duplicate the node selection code in it.  Would that
> work?

So if I understand what you are saying correctly we default to 
round-robin on a given node has no CPUs attached to it. I could probably 
work with that if that is the default behavior instead of adding much of 
the complexity I already have.

The question I have then is what should I do about workqueues that 
aren't WQ_UNBOUND if they attempt to use queue_work_near? In that case I 
should be looking for some way to go from a node to a CPU shouldn't I? 
If so should I look at doing something like wq_select_unbound_cpu that 
uses the node cpumask instead of the wq_unbound_cpumask?

- Alex
Tejun Heo Oct. 2, 2018, 5:41 p.m. UTC | #7
Hello,

On Mon, Oct 01, 2018 at 02:54:39PM -0700, Alexander Duyck wrote:
> >It might be better to leave queue_work_on() to be used for per-cpu
> >workqueues and introduce queue_work_near() as you suggseted.  I just
> >don't want it to duplicate the node selection code in it.  Would that
> >work?
> 
> So if I understand what you are saying correctly we default to
> round-robin on a given node has no CPUs attached to it. I could
> probably work with that if that is the default behavior instead of
> adding much of the complexity I already have.

Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
round-robin.  We can probably do better there and find the nearest
node considering topology.

> The question I have then is what should I do about workqueues that
> aren't WQ_UNBOUND if they attempt to use queue_work_near? In that

Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
out later and users could already do that anyway.

Thanks.
Alexander Duyck Oct. 2, 2018, 6:23 p.m. UTC | #8
On 10/2/2018 10:41 AM, Tejun Heo wrote:
> Hello,
> 
> On Mon, Oct 01, 2018 at 02:54:39PM -0700, Alexander Duyck wrote:
>>> It might be better to leave queue_work_on() to be used for per-cpu
>>> workqueues and introduce queue_work_near() as you suggseted.  I just
>>> don't want it to duplicate the node selection code in it.  Would that
>>> work?
>>
>> So if I understand what you are saying correctly we default to
>> round-robin on a given node has no CPUs attached to it. I could
>> probably work with that if that is the default behavior instead of
>> adding much of the complexity I already have.
> 
> Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
> requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
> round-robin.  We can probably do better there and find the nearest
> node considering topology.

Well if we could get wq_select_unbound_cpu doing the right thing based 
on node topology that would be most of my work solved right there. 
Basically I could just pass WQ_CPU_UNBOUND with the correct node and it 
would take care of getting to the right CPU.

>> The question I have then is what should I do about workqueues that
>> aren't WQ_UNBOUND if they attempt to use queue_work_near? In that
> 
> Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
> out later and users could already do that anyway.
> 
> Thanks.

So are you saying I should just return an error for now if somebody 
tries to use something other than an unbound workqueue with 
queue_work_near, and expect everyone else to just use queue_work_on for 
the other workqueue types?

Thanks.

- Alex
Tejun Heo Oct. 2, 2018, 6:41 p.m. UTC | #9
Hello,

On Tue, Oct 02, 2018 at 11:23:26AM -0700, Alexander Duyck wrote:
> >Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
> >requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
> >round-robin.  We can probably do better there and find the nearest
> >node considering topology.
> 
> Well if we could get wq_select_unbound_cpu doing the right thing
> based on node topology that would be most of my work solved right
> there. Basically I could just pass WQ_CPU_UNBOUND with the correct
> node and it would take care of getting to the right CPU.

Yeah, sth like that.  It might be better to keep the function to take
cpu for consistency as everything else passes around cpu.

> >>The question I have then is what should I do about workqueues that
> >>aren't WQ_UNBOUND if they attempt to use queue_work_near? In that
> >
> >Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
> >out later and users could already do that anyway.
> 
> So are you saying I should just return an error for now if somebody
> tries to use something other than an unbound workqueue with
> queue_work_near, and expect everyone else to just use queue_work_on
> for the other workqueue types?

Oh, I meant that let's not add a new interface for now and just use
queue_work_on() for your use case too.

Thanks.
Alexander Duyck Oct. 2, 2018, 8:49 p.m. UTC | #10
On 10/2/2018 11:41 AM, Tejun Heo wrote:
> Hello,
> 
> On Tue, Oct 02, 2018 at 11:23:26AM -0700, Alexander Duyck wrote:
>>> Yeah, it's all in wq_select_unbound_cpu().  Right now, if the
>>> requested cpu isn't in wq_unbound_cpumask, it falls back to dumb
>>> round-robin.  We can probably do better there and find the nearest
>>> node considering topology.
>>
>> Well if we could get wq_select_unbound_cpu doing the right thing
>> based on node topology that would be most of my work solved right
>> there. Basically I could just pass WQ_CPU_UNBOUND with the correct
>> node and it would take care of getting to the right CPU.
> 
> Yeah, sth like that.  It might be better to keep the function to take
> cpu for consistency as everything else passes around cpu.
> 
>>>> The question I have then is what should I do about workqueues that
>>>> aren't WQ_UNBOUND if they attempt to use queue_work_near? In that
>>>
>>> Hmm... yeah, let's just use queue_work_on() for now.  We can sort it
>>> out later and users could already do that anyway.
>>
>> So are you saying I should just return an error for now if somebody
>> tries to use something other than an unbound workqueue with
>> queue_work_near, and expect everyone else to just use queue_work_on
>> for the other workqueue types?
> 
> Oh, I meant that let's not add a new interface for now and just use
> queue_work_on() for your use case too.
> 
> Thanks.

So the only issue is that I was hoping to get away with not having to 
add additional preemption. That was the motivation behind doing 
queue_work_near as I could just wrap it all in the same local_irq_save 
that way I don't have to worry about the CPU I am on changing.

What I may look at doing is just greatly reducing the 
workqueue_select_unbound_cpu_near function to essentially just perform a 
few tests and then will just use the results from a cpumask_any_and of 
the cpumask_of_node and the cpu_online_mask. I'll probably rename it 
while I am at it since I am going to probably be getting away from the 
"unbound" checks in the logic.

- Alex
diff mbox series

Patch

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 60d673e15632..1f9f0a65437b 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -463,6 +463,8 @@  int apply_workqueue_attrs(struct workqueue_struct *wq,
 
 extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
 			struct work_struct *work);
+extern bool queue_work_near(int node, struct workqueue_struct *wq,
+			    struct work_struct *work);
 extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *work, unsigned long delay);
 extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0280deac392e..a971d3c4096e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -49,6 +49,7 @@ 
 #include <linux/uaccess.h>
 #include <linux/sched/isolation.h>
 #include <linux/nmi.h>
+#include <linux/device.h>
 
 #include "workqueue_internal.h"
 
@@ -1332,8 +1333,9 @@  static bool is_chained_work(struct workqueue_struct *wq)
  * by wq_unbound_cpumask.  Otherwise, round robin among the allowed ones to
  * avoid perturbing sensitive tasks.
  */
-static int wq_select_unbound_cpu(int cpu)
+static int wq_select_unbound_cpu(void)
 {
+	int cpu = raw_smp_processor_id();
 	static bool printed_dbg_warning;
 	int new_cpu;
 
@@ -1385,7 +1387,7 @@  static void __queue_work(int cpu, struct workqueue_struct *wq,
 		return;
 retry:
 	if (req_cpu == WORK_CPU_UNBOUND)
-		cpu = wq_select_unbound_cpu(raw_smp_processor_id());
+		cpu = wq_select_unbound_cpu();
 
 	/* pwq which will be used unless @work is executing elsewhere */
 	if (!(wq->flags & WQ_UNBOUND))
@@ -1492,6 +1494,129 @@  bool queue_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL(queue_work_on);
 
+/**
+ * workqueue_select_unbound_cpu_near - Select an unbound CPU based on NUMA node
+ * @node: NUMA node ID that we want to bind a CPU from
+ *
+ * This function will attempt to find a "random" cpu available to the unbound
+ * workqueues on a given node. If there are no CPUs available on the given
+ * node it will return WORK_CPU_UNBOUND indicating that we should just
+ * schedule to any available CPU if we need to schedule this work.
+ */
+static int workqueue_select_unbound_cpu_near(int node)
+{
+	const struct cpumask *wq_cpumask, *node_cpumask;
+	int cpu;
+
+	/* No point in doing this if NUMA isn't enabled for workqueues */
+	if (!wq_numa_enabled)
+		return WORK_CPU_UNBOUND;
+
+	/* delay binding to CPU if node is not valid or online */
+	if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
+		return WORK_CPU_UNBOUND;
+
+	/* If wq_unbound_cpumask is empty then just use cpu_online_mask */
+	wq_cpumask = cpumask_empty(wq_unbound_cpumask) ? cpu_online_mask :
+							 wq_unbound_cpumask;
+
+	/*
+	 * If node has no CPUs, or no CPUs in the unbound cpumask then we
+	 * need to try and find the nearest node that does have CPUs in the
+	 * unbound cpumask.
+	 */
+	if (!nr_cpus_node(node) ||
+	    !cpumask_intersects(cpumask_of_node(node), wq_cpumask)) {
+		int min_val = INT_MAX, best_node = NUMA_NO_NODE;
+		int this_node, val;
+
+		for_each_online_node(this_node) {
+			if (this_node == node)
+				continue;
+
+			val = node_distance(node, this_node);
+			if (min_val < val)
+				continue;
+
+			if (!nr_cpus_node(this_node) ||
+			    !cpumask_intersects(cpumask_of_node(this_node),
+						wq_cpumask))
+				continue;
+
+			best_node = this_node;
+			min_val = val;
+		}
+
+		/* If we failed to find a close node just defer */
+		if (best_node == NUMA_NO_NODE)
+			return WORK_CPU_UNBOUND;
+
+		/* update node to reflect optimal value */
+		node = best_node;
+	}
+
+
+	/* Use local node/cpu if we are already there */
+	cpu = raw_smp_processor_id();
+	if (node == cpu_to_node(cpu) &&
+	    cpumask_test_cpu(cpu, wq_unbound_cpumask))
+		return cpu;
+
+	/*
+	 * Reuse the same value as wq_select_unbound_cpu above to prevent
+	 * us from mapping the same CPU each time. The impact to
+	 * wq_select_unbound_cpu should be minimal since the above function
+	 * only uses it when it has to load balance on remote CPUs similar
+	 * to what I am doing here.
+	 */
+	cpu = __this_cpu_read(wq_rr_cpu_last);
+	node_cpumask = cpumask_of_node(node);
+	cpu = cpumask_next_and(cpu, wq_cpumask, node_cpumask);
+	if (unlikely(cpu >= nr_cpu_ids)) {
+		cpu = cpumask_first_and(wq_cpumask, node_cpumask);
+		if (unlikely(cpu >= nr_cpu_ids))
+			return WORK_CPU_UNBOUND;
+	}
+	__this_cpu_write(wq_rr_cpu_last, cpu);
+
+	return cpu;
+}
+
+/**
+ * queue_work_near - queue work on the nearest unbound cpu to a given NUMA node
+ * @node: NUMA node that we are targeting the work for
+ * @wq: workqueue to use
+ * @work: work to queue
+ *
+ * We queue the work to a specific CPU based on a given NUMA node, the
+ * caller must ensure it can't go away.
+ *
+ * This function will only make a best effort attempt at getting this onto
+ * the right NUMA node. If no node is requested or the requested node is
+ * offline then we just fall back to standard queue_work behavior.
+ *
+ * Return: %false if @work was already on a queue, %true otherwise.
+ */
+bool queue_work_near(int node, struct workqueue_struct *wq,
+		     struct work_struct *work)
+{
+	unsigned long flags;
+	bool ret = false;
+
+	local_irq_save(flags);
+
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+		int cpu = workqueue_select_unbound_cpu_near(node);
+
+		__queue_work(cpu, wq, work);
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(queue_work_near);
+
 void delayed_work_timer_fn(struct timer_list *t)
 {
 	struct delayed_work *dwork = from_timer(dwork, t, timer);