diff mbox series

[v6,5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst

Message ID 20210814205743.3039-6-longman@redhat.com (mailing list archive)
State New
Headers show
Series cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand

Commit Message

Waiman Long Aug. 14, 2021, 8:57 p.m. UTC
Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced
"isolated" cpuset partition type as well as the ability to create
non-top cpuset partition with no cpu allocated to it.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 116 +++++++++++++++---------
 1 file changed, 71 insertions(+), 45 deletions(-)

Comments

Tejun Heo Aug. 16, 2021, 5:08 p.m. UTC | #1
On Sat, Aug 14, 2021 at 04:57:42PM -0400, Waiman Long wrote:
> +	A parent partition may distribute all its CPUs to its child
> +	partitions as long as it is not the root cgroup and there is no
> +	task directly associated with that parent partition.  Otherwise,

"there is not task directly associated with the parent partition" isn't
necessary, right? That's already enforced by the cgroup hierarchy itself.

> +	there must be at least one cpu left in the parent partition.
> +	A new task cannot be moved to a partition root with no effective
> +	cpu.
> +
> +	Once becoming a partition root, changes to "cpuset.cpus"
> +	is generally allowed as long as the first condition above
> +	(cpu exclusivity rule) is true.

All the above ultimately says is that "a new task cannot be moved to a
partition root with no effective cpu", but I don't understand why this would
be a separate rule. Shouldn't the partition just stop being a partition when
it doesn't have any exclusive cpu? What's the benefit of having multiple its
own failure mode?

> +	Sometimes, changes to "cpuset.cpus" or cpu hotplug may cause
> +	the state of the partition root to become invalid when the
> +	other constraints of partition root are violated.  Therefore,
> +	it is recommended that users should always set "cpuset.cpus"
> +	to the proper value first before enabling partition.  In case
> +	"cpuset.cpus" has to be modified after partition is enabled,
> +	users should check the state of "cpuset.cpus.partition" after
> +	making change to it to make sure that the partition is still
> +	valid.

So, idk why the this doesn't cover the one above it. Also, this really
should be worded a lot stronger. It's not just recommended - confirming and
monitoring the transitions is an integral and essential part of using
cpuset.

...
> +	An invalid partition is not a real partition even though the
> +	restriction of the cpu exclusivity rule will still apply.

Is there a reason we can't bring this in line with other failure behaviors?

> +	In the special case of a parent partition competing with a child
> +	partition for the only CPU left, the parent partition wins and
> +	the child partition becomes invalid.

Given that parent partitions are *always* empty, this rule doesn't seem to
make sense.

So, I think this definitely is a step in the right direction but still seems
to be neither here or there. Before, we pretended that we could police the
input when we couldn't. Now, we're changing the interface so that it
includes configuration failures as an integral part; however, we're still
policing some particular inputs while letting other inputs pass through and
trigger failures and why one is handled one way while the other differently
seems rather arbitrary.

Thanks.
Waiman Long Aug. 24, 2021, 5:35 a.m. UTC | #2
On 8/16/21 1:08 PM, Tejun Heo wrote:
> On Sat, Aug 14, 2021 at 04:57:42PM -0400, Waiman Long wrote:
>> +	A parent partition may distribute all its CPUs to its child
>> +	partitions as long as it is not the root cgroup and there is no
>> +	task directly associated with that parent partition.  Otherwise,
> "there is not task directly associated with the parent partition" isn't
> necessary, right? That's already enforced by the cgroup hierarchy itself.

Sorry for the late reply as I was on vacation last week.

Yes, that is true. I should have de-emphasized that the fact that parent 
partition must have no task.

>
>> +	there must be at least one cpu left in the parent partition.
>> +	A new task cannot be moved to a partition root with no effective
>> +	cpu.
>> +
>> +	Once becoming a partition root, changes to "cpuset.cpus"
>> +	is generally allowed as long as the first condition above
>> +	(cpu exclusivity rule) is true.
> All the above ultimately says is that "a new task cannot be moved to a
> partition root with no effective cpu", but I don't understand why this would
> be a separate rule. Shouldn't the partition just stop being a partition when
> it doesn't have any exclusive cpu? What's the benefit of having multiple its
> own failure mode?
A partition with 0 cpu can be considered as a special partition type for 
spawning child partitions. This can be temporary as the cpus will be 
given back when a child partition is destroyed.
>
>> +	Sometimes, changes to "cpuset.cpus" or cpu hotplug may cause
>> +	the state of the partition root to become invalid when the
>> +	other constraints of partition root are violated.  Therefore,
>> +	it is recommended that users should always set "cpuset.cpus"
>> +	to the proper value first before enabling partition.  In case
>> +	"cpuset.cpus" has to be modified after partition is enabled,
>> +	users should check the state of "cpuset.cpus.partition" after
>> +	making change to it to make sure that the partition is still
>> +	valid.
> So, idk why the this doesn't cover the one above it. Also, this really
> should be worded a lot stronger. It's not just recommended - confirming and
> monitoring the transitions is an integral and essential part of using
> cpuset.
Sure, I will reword it to remove any mention of recommendation
> ...
>> +	An invalid partition is not a real partition even though the
>> +	restriction of the cpu exclusivity rule will still apply.
> Is there a reason we can't bring this in line with other failure behaviors?
The internal flags are kept so that we can easily recover and become a 
valid partition again when the cpus become available. Otherwise, we can 
guarantee that the partition status can be restored even when the cpus 
become available.
>
>> +	In the special case of a parent partition competing with a child
>> +	partition for the only CPU left, the parent partition wins and
>> +	the child partition becomes invalid.
> Given that parent partitions are *always* empty, this rule doesn't seem to
> make sense.
You are right. I will update the wording.
>
> So, I think this definitely is a step in the right direction but still seems
> to be neither here or there. Before, we pretended that we could police the
> input when we couldn't. Now, we're changing the interface so that it
> includes configuration failures as an integral part; however, we're still
> policing some particular inputs while letting other inputs pass through and
> trigger failures and why one is handled one way while the other differently
> seems rather arbitrary.
>
The cpu_exclusive and load_balance flags are attributes associated 
directly with the partition type. They are not affected by cpu 
availability or changing of cpu list. That is why they are kept even 
when the partition become invalid. If we have to remove them, it will be 
equivalent to changing partition back to member and we may not need an 
invalid partition type at all. Also, we will not be able to revert back 
to partition again when the cpus becomes available.

Cheers,
Longman
Tejun Heo Aug. 24, 2021, 7:04 p.m. UTC | #3
Hello,

On Tue, Aug 24, 2021 at 01:35:33AM -0400, Waiman Long wrote:
> Sorry for the late reply as I was on vacation last week.

No worries. Hope you enjoyed the vacation. :)

> > All the above ultimately says is that "a new task cannot be moved to a
> > partition root with no effective cpu", but I don't understand why this would
> > be a separate rule. Shouldn't the partition just stop being a partition when
> > it doesn't have any exclusive cpu? What's the benefit of having multiple its
> > own failure mode?
>
> A partition with 0 cpu can be considered as a special partition type for
> spawning child partitions. This can be temporary as the cpus will be given
> back when a child partition is destroyed.

But it can also happen by cpus going offline while the partition is
populated, right? Am I correct in thinking that a partition without cpu is
valid if its subtree contains cpus and invalid otherwise? If that's the
case, it looks like the rules can be made significantly simpler. The parent
cgroups never have processes anyway, so a partition is valid if its subtree
contains cpus, invalid otherwise.

> > So, I think this definitely is a step in the right direction but still seems
> > to be neither here or there. Before, we pretended that we could police the
> > input when we couldn't. Now, we're changing the interface so that it
> > includes configuration failures as an integral part; however, we're still
> > policing some particular inputs while letting other inputs pass through and
> > trigger failures and why one is handled one way while the other differently
> > seems rather arbitrary.
> > 
> The cpu_exclusive and load_balance flags are attributes associated directly
> with the partition type. They are not affected by cpu availability or
> changing of cpu list. That is why they are kept even when the partition
> become invalid. If we have to remove them, it will be equivalent to changing
> partition back to member and we may not need an invalid partition type at
> all. Also, we will not be able to revert back to partition again when the
> cpus becomes available.

Oh, yeah, I'm not saying to lose those states. What I'm trying to say is
that the rules and failure modes seem a lot more complicated than they need
to be. If the configuration becomes invalid for whatever reason, transition
the partition into invalid state and report why. If the situation resolves
for whatever reason, transition it back to valid state. Shouldn't that work?

Thanks.
Waiman Long Aug. 25, 2021, 7:21 p.m. UTC | #4
On 8/24/21 3:04 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Aug 24, 2021 at 01:35:33AM -0400, Waiman Long wrote:
>> Sorry for the late reply as I was on vacation last week.
> No worries. Hope you enjoyed the vacation. :)
>
>>> All the above ultimately says is that "a new task cannot be moved to a
>>> partition root with no effective cpu", but I don't understand why this would
>>> be a separate rule. Shouldn't the partition just stop being a partition when
>>> it doesn't have any exclusive cpu? What's the benefit of having multiple its
>>> own failure mode?
>> A partition with 0 cpu can be considered as a special partition type for
>> spawning child partitions. This can be temporary as the cpus will be given
>> back when a child partition is destroyed.
> But it can also happen by cpus going offline while the partition is
> populated, right? Am I correct in thinking that a partition without cpu is
> valid if its subtree contains cpus and invalid otherwise? If that's the
> case, it looks like the rules can be made significantly simpler. The parent
> cgroups never have processes anyway, so a partition is valid if its subtree
> contains cpus, invalid otherwise.
Yes, that is true. Thanks for the simplification.
>
>>> So, I think this definitely is a step in the right direction but still seems
>>> to be neither here or there. Before, we pretended that we could police the
>>> input when we couldn't. Now, we're changing the interface so that it
>>> includes configuration failures as an integral part; however, we're still
>>> policing some particular inputs while letting other inputs pass through and
>>> trigger failures and why one is handled one way while the other differently
>>> seems rather arbitrary.
>>>
>> The cpu_exclusive and load_balance flags are attributes associated directly
>> with the partition type. They are not affected by cpu availability or
>> changing of cpu list. That is why they are kept even when the partition
>> become invalid. If we have to remove them, it will be equivalent to changing
>> partition back to member and we may not need an invalid partition type at
>> all. Also, we will not be able to revert back to partition again when the
>> cpus becomes available.
> Oh, yeah, I'm not saying to lose those states. What I'm trying to say is
> that the rules and failure modes seem a lot more complicated than they need
> to be. If the configuration becomes invalid for whatever reason, transition
> the partition into invalid state and report why. If the situation resolves
> for whatever reason, transition it back to valid state. Shouldn't that work?

I agree that the current description is probably more complicated than 
it should be. I will try to fix that.

Thanks,
Longman
Tejun Heo Aug. 25, 2021, 7:24 p.m. UTC | #5
Hello,

On Wed, Aug 25, 2021 at 03:21:59PM -0400, Waiman Long wrote:
> I agree that the current description is probably more complicated than it
> should be. I will try to fix that.

To avoid repeating back-and-forth with all the code changes, would it help
if you first describe the intended behaviors whether that's in the form of
doc patch or just informal description?

Thank you.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index babbe04c8d37..9ad52f74fb12 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2091,8 +2091,9 @@  Cpuset Interface Files
 	It accepts only the following input values when written to.
 
 	  ========	================================
-	  "root"	a partition root
-	  "member"	a non-root member of a partition
+	  "member"	Non-root member of a partition
+	  "root"	Partition root
+	  "isolated"	Partition root without load balancing
 	  ========	================================
 
 	When set to be a partition root, the current cgroup is the
@@ -2101,64 +2102,89 @@  Cpuset Interface Files
 	partition roots themselves and their descendants.  The root
 	cgroup is always a partition root.
 
-	There are constraints on where a partition root can be set.
-	It can only be set in a cgroup if all the following conditions
-	are true.
+	When set to "isolated", the CPUs in that partition root will
+	be in an isolated state without any load balancing from the
+	scheduler.  Tasks in such a partition must be explicitly bound
+	to each individual CPU.
+
+	There are constraints on where a partition root can be set
+	("root" or "isolated").  It can only be set in a cgroup if all
+	the following conditions are true.
 
 	1) The "cpuset.cpus" is not empty and the list of CPUs are
 	   exclusive, i.e. they are not shared by any of its siblings.
 	2) The parent cgroup is a partition root.
-	3) The "cpuset.cpus" is also a proper subset of the parent's
+	3) The "cpuset.cpus" is a subset of the parent's
 	   "cpuset.cpus.effective".
 	4) There is no child cgroups with cpuset enabled.  This is for
 	   eliminating corner cases that have to be handled if such a
 	   condition is allowed.
 
-	Setting it to partition root will take the CPUs away from the
-	effective CPUs of the parent cgroup.  Once it is set, this
+	Setting it to a partition root will take the CPUs away from
+	the effective CPUs of the parent cgroup.  Once it is set, this
 	file cannot be reverted back to "member" if there are any child
 	cgroups with cpuset enabled.
 
-	A parent partition cannot distribute all its CPUs to its
-	child partitions.  There must be at least one cpu left in the
-	parent partition.
-
-	Once becoming a partition root, changes to "cpuset.cpus" is
-	generally allowed as long as the first condition above is true,
-	the change will not take away all the CPUs from the parent
-	partition and the new "cpuset.cpus" value is a superset of its
-	children's "cpuset.cpus" values.
-
-	Sometimes, external factors like changes to ancestors'
-	"cpuset.cpus" or cpu hotplug can cause the state of the partition
-	root to change.  On read, the "cpuset.sched.partition" file
-	can show the following values.
-
-	  ==============	==============================
-	  "member"		Non-root member of a partition
-	  "root"		Partition root
-	  "root invalid"	Invalid partition root
-	  ==============	==============================
-
-	It is a partition root if the first 2 partition root conditions
-	above are true and at least one CPU from "cpuset.cpus" is
-	granted by the parent cgroup.
-
-	A partition root can become invalid if none of CPUs requested
-	in "cpuset.cpus" can be granted by the parent cgroup or the
-	parent cgroup is no longer a partition root itself.  In this
-	case, it is not a real partition even though the restriction
-	of the first partition root condition above will still apply.
+	A parent partition may distribute all its CPUs to its child
+	partitions as long as it is not the root cgroup and there is no
+	task directly associated with that parent partition.  Otherwise,
+	there must be at least one cpu left in the parent partition.
+	A new task cannot be moved to a partition root with no effective
+	cpu.
+
+	Once becoming a partition root, changes to "cpuset.cpus"
+	is generally allowed as long as the first condition above
+	(cpu exclusivity rule) is true.
+
+	Sometimes, changes to "cpuset.cpus" or cpu hotplug may cause
+	the state of the partition root to become invalid when the
+	other constraints of partition root are violated.  Therefore,
+	it is recommended that users should always set "cpuset.cpus"
+	to the proper value first before enabling partition.  In case
+	"cpuset.cpus" has to be modified after partition is enabled,
+	users should check the state of "cpuset.cpus.partition" after
+	making change to it to make sure that the partition is still
+	valid.
+
+	On read, the "cpuset.cpus.partition" file can show the following
+	values.
+
+	  ======================	==============================
+	  "member"			Non-root member of a partition
+	  "root"			Partition root
+	  "isolated"			Partition root without load balancing
+	  "root invalid (<reason>)"	Invalid partition root
+	  ======================	==============================
+
+	A partition root becomes invalid if all the CPUs requested in
+	"cpuset.cpus" become unavailable.  This can happen if all the
+	CPUs have been offlined, or the state of an ancestor partition
+	root become invalid. "<reason>" is a string that describes why
+	the partition becomes invalid.
+
+	An invalid partition is not a real partition even though the
+	restriction of the cpu exclusivity rule will still apply.
 	The cpu affinity of all the tasks in the cgroup will then be
 	associated with CPUs in the nearest ancestor partition.
 
-	An invalid partition root can be transitioned back to a
-	real partition root if at least one of the requested CPUs
-	can now be granted by its parent.  In this case, the cpu
-	affinity of all the tasks in the formerly invalid partition
-	will be associated to the CPUs of the newly formed partition.
-	Changing the partition state of an invalid partition root to
-	"member" is always allowed even if child cpusets are present.
+	In the special case of a parent partition competing with a child
+	partition for the only CPU left, the parent partition wins and
+	the child partition becomes invalid.
+
+	An invalid partition root can be transitioned back to a real
+	partition root if at least one of the requested CPUs become
+	available again. In this case, the cpu affinity of all the tasks
+	in the formerly invalid partition will be associated to the CPUs
+	of the newly formed partition.	Changing the partition state of
+	an invalid partition root to "member" is always allowed even if
+	child cpusets are present. However changing a partition root back
+	to member will not be allowed if child partitions are present.
+
+	Poll and inotify events are triggered whenever the state
+	of "cpuset.cpus.partition" changes.  That includes changes
+	caused by write to "cpuset.cpus.partition" and cpu hotplug.
+	This will allow an user space agent to monitor changes caused
+	by hotplug events.
 
 
 Device controller