diff mbox series

[PATCH-cgroup,1/4] workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask

Message ID 20231013181122.3518610-2-longman@redhat.com (mailing list archive)
State New
Headers show
Series cgroup/cpuset: Improve CPU isolation in isolated partitions | expand

Commit Message

Waiman Long Oct. 13, 2023, 6:11 p.m. UTC
When the "isolcpus" boot command line option is used to add a set
of isolated CPUs, those CPUs will be excluded automatically from
wq_unbound_cpumask to avoid running work functions from unbound
workqueues.

Recently cpuset has been extended to allow the creation of partitions
of isolated CPUs dynamically. To make it closer to the "isolcpus"
in functionality, the CPUs in those isolated cpuset partitions should be
excluded from wq_unbound_cpumask as well. This can be done currently by
explicitly writing to the workqueue's cpumask sysfs file after creating
the isolated partitions. However, this process can be error prone.
Ideally, the cpuset code should be allowed to request the workqueue code
to exclude those isolated CPUs from wq_unbound_cpumask so that this
operation can be done automatically and the isolated CPUs will be returned
back to wq_unbound_cpumask after the destructions of the isolated
cpuset partitions.

This patch adds a new workqueue_unbound_exclude_cpumask() to enable
that. This new function will exclude the specified isolated CPUs
from wq_unbound_cpumask. To be able to restore those isolated CPUs
back after the destruction of isolated cpuset partitions, a new
wq_user_unbound_cpumask is added to store the user provided unbound
cpumask either from the boot command line options or from writing to
the cpumask sysfs file. This new cpumask provides the basis for CPU
exclusion.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/workqueue.h |  2 +-
 kernel/workqueue.c        | 42 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 42 insertions(+), 2 deletions(-)

Comments

Tejun Heo Oct. 18, 2023, 9:24 a.m. UTC | #1
Hello,

On Fri, Oct 13, 2023 at 02:11:19PM -0400, Waiman Long wrote:
> When the "isolcpus" boot command line option is used to add a set
> of isolated CPUs, those CPUs will be excluded automatically from
> wq_unbound_cpumask to avoid running work functions from unbound
> workqueues.
> 
> Recently cpuset has been extended to allow the creation of partitions
> of isolated CPUs dynamically. To make it closer to the "isolcpus"
> in functionality, the CPUs in those isolated cpuset partitions should be
> excluded from wq_unbound_cpumask as well. This can be done currently by
> explicitly writing to the workqueue's cpumask sysfs file after creating
> the isolated partitions. However, this process can be error prone.
> Ideally, the cpuset code should be allowed to request the workqueue code
> to exclude those isolated CPUs from wq_unbound_cpumask so that this
> operation can be done automatically and the isolated CPUs will be returned
> back to wq_unbound_cpumask after the destructions of the isolated
> cpuset partitions.
> 
> This patch adds a new workqueue_unbound_exclude_cpumask() to enable
> that. This new function will exclude the specified isolated CPUs
> from wq_unbound_cpumask. To be able to restore those isolated CPUs
> back after the destruction of isolated cpuset partitions, a new
> wq_user_unbound_cpumask is added to store the user provided unbound
> cpumask either from the boot command line options or from writing to
> the cpumask sysfs file. This new cpumask provides the basis for CPU
> exclusion.

The behaviors around wq_unbound_cpumask is getting pretty inconsistent:

1. Housekeeping excludes isolated CPUs on boot but allows user to override
   it to include isolated CPUs afterwards.

2. If an unbound wq's cpumask doesn't have any intersection with
   wq_unbound_cpumask we ignore the per-wq cpumask and falls back to
   wq_unbound_cpumask.

3. You're adding a masking layer on top with exclude which fails to set if
   the intersection is empty.

Can we do the followings for consistency?

1. User's requested_unbound_cpumask is stored separately (as in this patch).

2. The effect wq_unbound_cpumask is determined by requested_unbound_cpumask
   & housekeeping_cpumask & cpuset_allowed_cpumask. The operation order
   matters. When an & operation yields an cpumask, the cpumask from the
   previous step is the effective one.

3. Expose these cpumasks in sysfs so that what's happening is obvious.

> +int workqueue_unbound_exclude_cpumask(cpumask_var_t exclude_cpumask)
> +{
> +	cpumask_var_t cpumask;
> +	int ret = 0;
> +
> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	/*
> +	 * The caller of this function may have called cpus_read_lock(),
> +	 * use cpus_read_trylock() to avoid potential deadlock.
> +	 */
> +	if (!cpus_read_trylock())
> +		return -EBUSY;

This means that a completely unrelated cpus_write_lock() can fail this
operation and thus cpuset config writes. Let's please not do this. Can't we
just make sure that the caller holds the lock?

> +	mutex_lock(&wq_pool_mutex);
> +
> +	if (!cpumask_andnot(cpumask, wq_user_unbound_cpumask, exclude_cpumask))
> +		ret = -EINVAL;	/* The new cpumask can't be empty */

For better or worse, the usual mode-of-failure for "no usable CPU" is just
falling back to something which works rather than failing the operation.
Let's follow that.

Thanks.
Waiman Long Oct. 18, 2023, 1:41 p.m. UTC | #2
On 10/18/23 05:24, Tejun Heo wrote:
> Hello,
>
> On Fri, Oct 13, 2023 at 02:11:19PM -0400, Waiman Long wrote:
>> When the "isolcpus" boot command line option is used to add a set
>> of isolated CPUs, those CPUs will be excluded automatically from
>> wq_unbound_cpumask to avoid running work functions from unbound
>> workqueues.
>>
>> Recently cpuset has been extended to allow the creation of partitions
>> of isolated CPUs dynamically. To make it closer to the "isolcpus"
>> in functionality, the CPUs in those isolated cpuset partitions should be
>> excluded from wq_unbound_cpumask as well. This can be done currently by
>> explicitly writing to the workqueue's cpumask sysfs file after creating
>> the isolated partitions. However, this process can be error prone.
>> Ideally, the cpuset code should be allowed to request the workqueue code
>> to exclude those isolated CPUs from wq_unbound_cpumask so that this
>> operation can be done automatically and the isolated CPUs will be returned
>> back to wq_unbound_cpumask after the destructions of the isolated
>> cpuset partitions.
>>
>> This patch adds a new workqueue_unbound_exclude_cpumask() to enable
>> that. This new function will exclude the specified isolated CPUs
>> from wq_unbound_cpumask. To be able to restore those isolated CPUs
>> back after the destruction of isolated cpuset partitions, a new
>> wq_user_unbound_cpumask is added to store the user provided unbound
>> cpumask either from the boot command line options or from writing to
>> the cpumask sysfs file. This new cpumask provides the basis for CPU
>> exclusion.
> The behaviors around wq_unbound_cpumask is getting pretty inconsistent:
>
> 1. Housekeeping excludes isolated CPUs on boot but allows user to override
>     it to include isolated CPUs afterwards.
>
> 2. If an unbound wq's cpumask doesn't have any intersection with
>     wq_unbound_cpumask we ignore the per-wq cpumask and falls back to
>     wq_unbound_cpumask.
>
> 3. You're adding a masking layer on top with exclude which fails to set if
>     the intersection is empty.
>
> Can we do the followings for consistency?
>
> 1. User's requested_unbound_cpumask is stored separately (as in this patch).
>
> 2. The effect wq_unbound_cpumask is determined by requested_unbound_cpumask
>     & housekeeping_cpumask & cpuset_allowed_cpumask. The operation order
>     matters. When an & operation yields an cpumask, the cpumask from the
>     previous step is the effective one.
Sure. I will do that.
>
> 3. Expose these cpumasks in sysfs so that what's happening is obvious.

I can expose the requested_unbound_cpumask. As for the isolated CPUs, 
see my other reply.

>> +int workqueue_unbound_exclude_cpumask(cpumask_var_t exclude_cpumask)
>> +{
>> +	cpumask_var_t cpumask;
>> +	int ret = 0;
>> +
>> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
>> +		return -ENOMEM;
>> +
>> +	/*
>> +	 * The caller of this function may have called cpus_read_lock(),
>> +	 * use cpus_read_trylock() to avoid potential deadlock.
>> +	 */
>> +	if (!cpus_read_trylock())
>> +		return -EBUSY;
> This means that a completely unrelated cpus_write_lock() can fail this
> operation and thus cpuset config writes. Let's please not do this. Can't we
> just make sure that the caller holds the lock?
This condition is actually triggered by a few hotplug tests in 
test_cpuset_prs.sh. I will make sure that either cpu read or write lock 
is held before calling this function and eliminate rcu read locking here.
>
>> +	mutex_lock(&wq_pool_mutex);
>> +
>> +	if (!cpumask_andnot(cpumask, wq_user_unbound_cpumask, exclude_cpumask))
>> +		ret = -EINVAL;	/* The new cpumask can't be empty */
> For better or worse, the usual mode-of-failure for "no usable CPU" is just
> falling back to something which works rather than failing the operation.
> Let's follow that.

In this case, it is just leaving the current unbound cpumask unchanged. 
I will follow the precedence discussed above to make sure that there is 
a graceful fallback.

Cheers,
Longman
Waiman Long Oct. 18, 2023, 7:18 p.m. UTC | #3
On 10/18/23 09:41, Waiman Long wrote:
> On 10/18/23 05:24, Tejun Heo wrote:
>> Hello,
>>
>> On Fri, Oct 13, 2023 at 02:11:19PM -0400, Waiman Long wrote:
>>> When the "isolcpus" boot command line option is used to add a set
>>> of isolated CPUs, those CPUs will be excluded automatically from
>>> wq_unbound_cpumask to avoid running work functions from unbound
>>> workqueues.
>>>
>>> Recently cpuset has been extended to allow the creation of partitions
>>> of isolated CPUs dynamically. To make it closer to the "isolcpus"
>>> in functionality, the CPUs in those isolated cpuset partitions 
>>> should be
>>> excluded from wq_unbound_cpumask as well. This can be done currently by
>>> explicitly writing to the workqueue's cpumask sysfs file after creating
>>> the isolated partitions. However, this process can be error prone.
>>> Ideally, the cpuset code should be allowed to request the workqueue 
>>> code
>>> to exclude those isolated CPUs from wq_unbound_cpumask so that this
>>> operation can be done automatically and the isolated CPUs will be 
>>> returned
>>> back to wq_unbound_cpumask after the destructions of the isolated
>>> cpuset partitions.
>>>
>>> This patch adds a new workqueue_unbound_exclude_cpumask() to enable
>>> that. This new function will exclude the specified isolated CPUs
>>> from wq_unbound_cpumask. To be able to restore those isolated CPUs
>>> back after the destruction of isolated cpuset partitions, a new
>>> wq_user_unbound_cpumask is added to store the user provided unbound
>>> cpumask either from the boot command line options or from writing to
>>> the cpumask sysfs file. This new cpumask provides the basis for CPU
>>> exclusion.
>> The behaviors around wq_unbound_cpumask is getting pretty inconsistent:
>>
>> 1. Housekeeping excludes isolated CPUs on boot but allows user to 
>> override
>>     it to include isolated CPUs afterwards.
>>
>> 2. If an unbound wq's cpumask doesn't have any intersection with
>>     wq_unbound_cpumask we ignore the per-wq cpumask and falls back to
>>     wq_unbound_cpumask.
>>
>> 3. You're adding a masking layer on top with exclude which fails to 
>> set if
>>     the intersection is empty.
>>
>> Can we do the followings for consistency?
>>
>> 1. User's requested_unbound_cpumask is stored separately (as in this 
>> patch).
>>
>> 2. The effect wq_unbound_cpumask is determined by 
>> requested_unbound_cpumask
>>     & housekeeping_cpumask & cpuset_allowed_cpumask. The operation order
>>     matters. When an & operation yields an cpumask, the cpumask from the
>>     previous step is the effective one.
> Sure. I will do that.

I have a second thought after taking a further look at that. First of 
all, cpuset_allowed_mask isn't relevant here and the mask can certainly 
contain offline CPUs. So cpu_possible_mask is the proper fallback.

With the current patch, wq_user_unbound_cpumask is set up initially as  
(HK_TYPE_WQ ∩ HK_TYPE_DOMAIN) house keeping mask and rewritten by any 
subsequent write to workqueue/cpumask sysfs file. So using 
wq_user_unbound_cpumask has the implied precedence of user-sysfs written 
mask, command line isolcpus or nohz_full option mask and 
cpu_possible_mask. I think just fall back to wq_user_unbound_cpumask if 
the operation fails should be enough.

Cheers,
Longman
Tejun Heo Oct. 24, 2023, 3:28 a.m. UTC | #4
Hello,

On Wed, Oct 18, 2023 at 03:18:52PM -0400, Waiman Long wrote:
> I have a second thought after taking a further look at that. First of all,
> cpuset_allowed_mask isn't relevant here and the mask can certainly contain
> offline CPUs. So cpu_possible_mask is the proper fallback.
> 
> With the current patch, wq_user_unbound_cpumask is set up initially as 
> (HK_TYPE_WQ ∩ HK_TYPE_DOMAIN) house keeping mask and rewritten by any
> subsequent write to workqueue/cpumask sysfs file. So using

The current behavior is not something which is carefully planned. It's more
accidental than anything. If we can come up with a more intutive and
consistent behavior, that should be fine.

> wq_user_unbound_cpumask has the implied precedence of user-sysfs written
> mask, command line isolcpus or nohz_full option mask and cpu_possible_mask.
> I think just fall back to wq_user_unbound_cpumask if the operation fails
> should be enough.

But yeah, that sounds acceptable.

Thanks.
Waiman Long Oct. 25, 2023, 6:47 p.m. UTC | #5
On 10/23/23 23:28, Tejun Heo wrote:
> Hello,
>
> On Wed, Oct 18, 2023 at 03:18:52PM -0400, Waiman Long wrote:
>> I have a second thought after taking a further look at that. First of all,
>> cpuset_allowed_mask isn't relevant here and the mask can certainly contain
>> offline CPUs. So cpu_possible_mask is the proper fallback.
>>
>> With the current patch, wq_user_unbound_cpumask is set up initially as
>> (HK_TYPE_WQ ∩ HK_TYPE_DOMAIN) house keeping mask and rewritten by any
>> subsequent write to workqueue/cpumask sysfs file. So using
> The current behavior is not something which is carefully planned. It's more
> accidental than anything. If we can come up with a more intutive and
> consistent behavior, that should be fine.
>
>> wq_user_unbound_cpumask has the implied precedence of user-sysfs written
>> mask, command line isolcpus or nohz_full option mask and cpu_possible_mask.
>> I think just fall back to wq_user_unbound_cpumask if the operation fails
>> should be enough.
> But yeah, that sounds acceptable.

I have implemented the fallback to the user requested cpumask in the 
failure case.

Cheers,
Longman
diff mbox series

Patch

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 1c1d06804d45..a936460ccc7e 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -483,7 +483,7 @@  struct workqueue_attrs *alloc_workqueue_attrs(void);
 void free_workqueue_attrs(struct workqueue_attrs *attrs);
 int apply_workqueue_attrs(struct workqueue_struct *wq,
 			  const struct workqueue_attrs *attrs);
-int workqueue_set_unbound_cpumask(cpumask_var_t cpumask);
+extern int workqueue_unbound_exclude_cpumask(cpumask_var_t cpumask);
 
 extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
 			struct work_struct *work);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 010b674b02a7..19d403aa41b0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -381,6 +381,9 @@  static bool workqueue_freezing;		/* PL: have wqs started freezing? */
 /* PL&A: allowable cpus for unbound wqs and work items */
 static cpumask_var_t wq_unbound_cpumask;
 
+/* PL: user-provided unbound cpumask via sysfs */
+static cpumask_var_t wq_user_unbound_cpumask;
+
 /* for further constrain wq_unbound_cpumask by cmdline parameter*/
 static struct cpumask wq_cmdline_cpumask __initdata;
 
@@ -5825,7 +5828,7 @@  static int workqueue_apply_unbound_cpumask(const cpumask_var_t unbound_cpumask)
  *  		-EINVAL	- Invalid @cpumask
  *  		-ENOMEM	- Failed to allocate memory for attrs or pwqs.
  */
-int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
+static int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
 {
 	int ret = -EINVAL;
 
@@ -5836,6 +5839,7 @@  int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
 	cpumask_and(cpumask, cpumask, cpu_possible_mask);
 	if (!cpumask_empty(cpumask)) {
 		apply_wqattrs_lock();
+		cpumask_copy(wq_user_unbound_cpumask, cpumask);
 		if (cpumask_equal(cpumask, wq_unbound_cpumask)) {
 			ret = 0;
 			goto out_unlock;
@@ -5850,6 +5854,40 @@  int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
 	return ret;
 }
 
+/**
+ * workqueue_unbound_exclude_cpumask - Exclude given CPUs from unbound cpumask
+ * @exclude_cpumask: the cpumask to be excluded from wq_unbound_cpumask
+ *
+ * This function can be called from cpuset code to provide a set of isolated
+ * CPUs that should be excluded from wq_unbound_cpumask.
+ */
+int workqueue_unbound_exclude_cpumask(cpumask_var_t exclude_cpumask)
+{
+	cpumask_var_t cpumask;
+	int ret = 0;
+
+	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
+		return -ENOMEM;
+
+	/*
+	 * The caller of this function may have called cpus_read_lock(),
+	 * use cpus_read_trylock() to avoid potential deadlock.
+	 */
+	if (!cpus_read_trylock())
+		return -EBUSY;
+	mutex_lock(&wq_pool_mutex);
+
+	if (!cpumask_andnot(cpumask, wq_user_unbound_cpumask, exclude_cpumask))
+		ret = -EINVAL;	/* The new cpumask can't be empty */
+	else if (!cpumask_equal(cpumask, wq_unbound_cpumask))
+		ret = workqueue_apply_unbound_cpumask(cpumask);
+
+	mutex_unlock(&wq_pool_mutex);
+	cpus_read_unlock();
+	free_cpumask_var(cpumask);
+	return ret;
+}
+
 static int parse_affn_scope(const char *val)
 {
 	int i;
@@ -6520,11 +6558,13 @@  void __init workqueue_init_early(void)
 	BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
 
 	BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
+	BUG_ON(!alloc_cpumask_var(&wq_user_unbound_cpumask, GFP_KERNEL));
 	cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_WQ));
 	cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_DOMAIN));
 
 	if (!cpumask_empty(&wq_cmdline_cpumask))
 		cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, &wq_cmdline_cpumask);
+	cpumask_copy(wq_user_unbound_cpumask, wq_unbound_cpumask);
 
 	pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);