diff mbox series

[01/10] cgroup/cpuset: Fix race between newly created partition and dying one

Message ID 20250330215248.3620801-2-longman@redhat.com (mailing list archive)
State New
Headers show
Series cgroup/cpuset: Miscellaneous partition bug fixes and enhancements | expand

Commit Message

Waiman Long March 30, 2025, 9:52 p.m. UTC
There is a possible race between removing a cgroup diectory that is
a partition root and the creation of a new partition.  The partition
to be removed can be dying but still online, it doesn't not currently
participate in checking for exclusive CPUs conflict, but the exclusive
CPUs are still there in subpartitions_cpus and isolated_cpus. These
two cpumasks are global states that affect the operation of cpuset
partitions. The exclusive CPUs in dying cpusets will only be removed
when cpuset_css_offline() function is called after an RCU delay.

As a result, it is possible that a new partition can be created with
exclusive CPUs that overlap with those of a dying one. When that dying
partition is finally offlined, it removes those overlapping exclusive
CPUs from subpartitions_cpus and maybe isolated_cpus resulting in an
incorrect CPU configuration.

This bug was found when a warning was triggered in
remote_partition_disable() during testing because the subpartitions_cpus
mask was empty.

One possible way to fix this is to iterate the dying cpusets as well and
avoid using the exclusive CPUs in those dying cpusets. However, this
can still cause random partition creation failures or other anomalies
due to racing. A better way to fix this race is to reset the partition
state at the moment when a cpuset is being killed.

Introduce a new css_killed() CSS function pointer and call it, if
defined, before setting CSS_DYING flag in kill_css(). Also update the
css_is_dying() helper to use the CSS_DYING flag introduced by commit
33c35aa48178 ("cgroup: Prevent kill_css() from being called more than
once") for proper synchronization.

Add a new cpuset_css_killed() function to reset the partition state of
a valid partition root if it is being killed.

Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/cgroup-defs.h |  1 +
 include/linux/cgroup.h      |  2 +-
 kernel/cgroup/cgroup.c      |  6 ++++++
 kernel/cgroup/cpuset.c      | 20 +++++++++++++++++---
 4 files changed, 25 insertions(+), 4 deletions(-)

Comments

Tejun Heo March 31, 2025, 11:13 p.m. UTC | #1
Hello,

On Sun, Mar 30, 2025 at 05:52:39PM -0400, Waiman Long wrote:
...
> One possible way to fix this is to iterate the dying cpusets as well and
> avoid using the exclusive CPUs in those dying cpusets. However, this
> can still cause random partition creation failures or other anomalies
> due to racing. A better way to fix this race is to reset the partition
> state at the moment when a cpuset is being killed.

I'm not a big fan of adding another method call in the destruction path.
css_offline() is where the kill can be seen from all CPUs and notified to
the controller and I'm not sure why bringing it sooner would be necessary to
close the race window. Can't the creation side drain the cgroups that are
going down if the asynchronous part is a problem? e.g. We already have
cgroup_lock_and_drain_offline() which isn't the most scalable thing but
partition operations aren't very frequent, right? And if that's a problem,
there should be a way to make it reasonably quicker.

Thanks.
Waiman Long April 1, 2025, 3:12 a.m. UTC | #2
On 3/31/25 7:13 PM, Tejun Heo wrote:
> Hello,
>
> On Sun, Mar 30, 2025 at 05:52:39PM -0400, Waiman Long wrote:
> ...
>> One possible way to fix this is to iterate the dying cpusets as well and
>> avoid using the exclusive CPUs in those dying cpusets. However, this
>> can still cause random partition creation failures or other anomalies
>> due to racing. A better way to fix this race is to reset the partition
>> state at the moment when a cpuset is being killed.
> I'm not a big fan of adding another method call in the destruction path.
> css_offline() is where the kill can be seen from all CPUs and notified to
> the controller and I'm not sure why bringing it sooner would be necessary to
> close the race window. Can't the creation side drain the cgroups that are
> going down if the asynchronous part is a problem? e.g. We already have
> cgroup_lock_and_drain_offline() which isn't the most scalable thing but
> partition operations aren't very frequent, right? And if that's a problem,
> there should be a way to make it reasonably quicker.

The problem is the RCU delay between the time a cgroup is killed and is 
in a dying state and when the partition is deactivated when 
cpuset_css_offline() is called. That delay can be rather lengthy 
depending on the current workload.

Another alternative that I can think of is to scan the remote partition 
list for remote partition and sibling cpusets for local partition 
whenever some kind of conflicts are detected when enabling a partition. 
When a dying cpuset partition is detected, deactivate it immediately to 
resolve the conflict. Otherwise, the dying partition will still be 
deactivated at cpuset_css_offline() time.

That will be a bit more complex and I think can still get the problem 
solved without adding a new method. What do you think? If you are OK 
with that, I will send out a new patch later this week.

Thanks,
Longman
Tejun Heo April 1, 2025, 7:59 p.m. UTC | #3
Hello, Waiman.

On Mon, Mar 31, 2025 at 11:12:06PM -0400, Waiman Long wrote:
> The problem is the RCU delay between the time a cgroup is killed and is in a
> dying state and when the partition is deactivated when cpuset_css_offline()
> is called. That delay can be rather lengthy depending on the current
> workload.

If we don't have to do it too often, synchronize_rcu_expedited() may be
workable too. What do you think?

> Another alternative that I can think of is to scan the remote partition list
> for remote partition and sibling cpusets for local partition whenever some
> kind of conflicts are detected when enabling a partition. When a dying
> cpuset partition is detected, deactivate it immediately to resolve the
> conflict. Otherwise, the dying partition will still be deactivated at
> cpuset_css_offline() time.
> 
> That will be a bit more complex and I think can still get the problem solved
> without adding a new method. What do you think? If you are OK with that, I
> will send out a new patch later this week.

If synchronize_rcu_expedited() won't do, let's go with the original patch.
The operation does make general sense in that it's for a distinctive step in
the destruction process although I'm a bit curious why it's called before
DYING is set.

Thanks.
Waiman Long April 1, 2025, 8:41 p.m. UTC | #4
On 4/1/25 3:59 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, Mar 31, 2025 at 11:12:06PM -0400, Waiman Long wrote:
>> The problem is the RCU delay between the time a cgroup is killed and is in a
>> dying state and when the partition is deactivated when cpuset_css_offline()
>> is called. That delay can be rather lengthy depending on the current
>> workload.
> If we don't have to do it too often, synchronize_rcu_expedited() may be
> workable too. What do you think?

I don't think we ever call synchronize_rcu() in the cgroup code except 
for rstat flush. In fact, we didn't use to have an easy way to know if 
there were dying cpusets hanging around. Now we can probably use the 
root cgroup's nr_dying_subsys[cpuset_cgrp_id] to know if we need to use 
synchronize_rcu*() call to wait for it. However, I still need to check 
if there is any racing window that will cause us to miss it.

>
>> Another alternative that I can think of is to scan the remote partition list
>> for remote partition and sibling cpusets for local partition whenever some
>> kind of conflicts are detected when enabling a partition. When a dying
>> cpuset partition is detected, deactivate it immediately to resolve the
>> conflict. Otherwise, the dying partition will still be deactivated at
>> cpuset_css_offline() time.
>>
>> That will be a bit more complex and I think can still get the problem solved
>> without adding a new method. What do you think? If you are OK with that, I
>> will send out a new patch later this week.
> If synchronize_rcu_expedited() won't do, let's go with the original patch.
> The operation does make general sense in that it's for a distinctive step in
> the destruction process although I'm a bit curious why it's called before
> DYING is set.

Again, we have to synchronize between the css_is_dying() call in 
is_cpuset_online() which is used by cpuset_for_each_child() against the 
calling of cpuset_css_killed(). Since setting of the CSS_DYING flag is 
protected by cgroup_mutex() while most of the cpuset code is protected 
by cpuset_mutex. The two operations can be asynchronous with each other. 
So I have to make sure that by the time CSS_DYING is set, the 
cpuset_css_killed() call has been invoked. I need to do similar check if 
we decide to use synchronize_rcu*() to wait for the completion of 
cpuset_css_offline() call.

As I am also dealing with a lot of locking related issues, I am more 
attuned to this kind of racing conditions to make sure nothing bad will 
happen.

Cheers,
Longman
Waiman Long April 1, 2025, 8:56 p.m. UTC | #5
On 4/1/25 4:41 PM, Waiman Long wrote:
>
> On 4/1/25 3:59 PM, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Mon, Mar 31, 2025 at 11:12:06PM -0400, Waiman Long wrote:
>>> The problem is the RCU delay between the time a cgroup is killed and 
>>> is in a
>>> dying state and when the partition is deactivated when 
>>> cpuset_css_offline()
>>> is called. That delay can be rather lengthy depending on the current
>>> workload.
>> If we don't have to do it too often, synchronize_rcu_expedited() may be
>> workable too. What do you think?
>
> I don't think we ever call synchronize_rcu() in the cgroup code except 
> for rstat flush. In fact, we didn't use to have an easy way to know if 
> there were dying cpusets hanging around. Now we can probably use the 
> root cgroup's nr_dying_subsys[cpuset_cgrp_id] to know if we need to 
> use synchronize_rcu*() call to wait for it. However, I still need to 
> check if there is any racing window that will cause us to miss it.

Sorry, I don't think I can use synchronize_rcu_expedited() as the use 
cases that I am seeing most often is the creation of isolated partitions 
running latency sensitive applications like DPDK. Using 
synchronize_rcu_expedited() will send IPIs to all the CPUs which may 
break the required latency guarantee for those applications. Just using 
synchronize_rcu(), however, will have unpredictable latency impacting 
user experience.

>
>>
>>> Another alternative that I can think of is to scan the remote 
>>> partition list
>>> for remote partition and sibling cpusets for local partition 
>>> whenever some
>>> kind of conflicts are detected when enabling a partition. When a dying
>>> cpuset partition is detected, deactivate it immediately to resolve the
>>> conflict. Otherwise, the dying partition will still be deactivated at
>>> cpuset_css_offline() time.
>>>
>>> That will be a bit more complex and I think can still get the 
>>> problem solved
>>> without adding a new method. What do you think? If you are OK with 
>>> that, I
>>> will send out a new patch later this week.
>> If synchronize_rcu_expedited() won't do, let's go with the original 
>> patch.
>> The operation does make general sense in that it's for a distinctive 
>> step in
>> the destruction process although I'm a bit curious why it's called 
>> before
>> DYING is set.
>
Because of the above, I still prefer either using the original patch or 
scanning for dying cpuset partitions in case a conflict is detected. 
Please let me know what you think about it.

Thanks,
Longman
diff mbox series

Patch

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 485b651869d9..5bc8f55c8cca 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -710,6 +710,7 @@  struct cgroup_subsys {
 	void (*css_released)(struct cgroup_subsys_state *css);
 	void (*css_free)(struct cgroup_subsys_state *css);
 	void (*css_reset)(struct cgroup_subsys_state *css);
+	void (*css_killed)(struct cgroup_subsys_state *css);
 	void (*css_rstat_flush)(struct cgroup_subsys_state *css, int cpu);
 	int (*css_extra_stat_show)(struct seq_file *seq,
 				   struct cgroup_subsys_state *css);
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 28e999f2c642..e7da3c3b098b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -344,7 +344,7 @@  static inline u64 cgroup_id(const struct cgroup *cgrp)
  */
 static inline bool css_is_dying(struct cgroup_subsys_state *css)
 {
-	return !(css->flags & CSS_NO_REF) && percpu_ref_is_dying(&css->refcnt);
+	return css->flags & CSS_DYING;
 }
 
 static inline void cgroup_get(struct cgroup *cgrp)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f231fe3a0744..49d622205997 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5909,6 +5909,12 @@  static void kill_css(struct cgroup_subsys_state *css)
 	if (css->flags & CSS_DYING)
 		return;
 
+	/*
+	 * Call css_killed(), if defined, before setting the CSS_DYING flag
+	 */
+	if (css->ss->css_killed)
+		css->ss->css_killed(css);
+
 	css->flags |= CSS_DYING;
 
 	/*
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 39c1fc643d77..749994312d47 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3475,9 +3475,6 @@  static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	cpus_read_lock();
 	mutex_lock(&cpuset_mutex);
 
-	if (is_partition_valid(cs))
-		update_prstate(cs, 0);
-
 	if (!cpuset_v2() && is_sched_load_balance(cs))
 		cpuset_update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
@@ -3488,6 +3485,22 @@  static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	cpus_read_unlock();
 }
 
+static void cpuset_css_killed(struct cgroup_subsys_state *css)
+{
+	struct cpuset *cs = css_cs(css);
+
+	cpus_read_lock();
+	mutex_lock(&cpuset_mutex);
+
+	/* Reset valid partition back to member */
+	if (is_partition_valid(cs))
+		update_prstate(cs, PRS_MEMBER);
+
+	mutex_unlock(&cpuset_mutex);
+	cpus_read_unlock();
+
+}
+
 static void cpuset_css_free(struct cgroup_subsys_state *css)
 {
 	struct cpuset *cs = css_cs(css);
@@ -3609,6 +3622,7 @@  struct cgroup_subsys cpuset_cgrp_subsys = {
 	.css_alloc	= cpuset_css_alloc,
 	.css_online	= cpuset_css_online,
 	.css_offline	= cpuset_css_offline,
+	.css_killed	= cpuset_css_killed,
 	.css_free	= cpuset_css_free,
 	.can_attach	= cpuset_can_attach,
 	.cancel_attach	= cpuset_cancel_attach,