diff mbox series

[v11,7/8] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst

Message ID 20220510153413.400020-8-longman@redhat.com (mailing list archive)
State New
Headers show
Series [v11,1/8] cgroup/cpuset: Add top_cpuset check in update_tasks_cpumask() | expand

Commit Message

Waiman Long May 10, 2022, 3:34 p.m. UTC
Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced
"isolated" cpuset partition type as well as other changes made in other
cpuset patches.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 149 +++++++++++++-----------
 1 file changed, 83 insertions(+), 66 deletions(-)

Comments

Tejun Heo June 12, 2022, 5:49 p.m. UTC | #1
Hello,

On Tue, May 10, 2022 at 11:34:12AM -0400, Waiman Long wrote:
> +	External events like hotplug or changes to "cpuset.cpus" can
> +	cause a valid partition root to become invalid and vice versa.
> +	Note that a task cannot be moved to a cgroup with empty
> +	"cpuset.cpus.effective".
> +
> +	For a valid partition root or an invalid partition root with
> +	the exclusivity rule enabled, changes made to "cpuset.cpus"
> +	that violate the exclusivity rule will not be allowed.

My memory is failing but this is the same thing that we were discussing
before, right? The point was that the different behaviors re. system events
and config actions seemed unncessary and IIRC Michal was of the same opinion
(please correct me if I'm misremembering).

> +	A valid non-root parent partition may distribute out all its CPUs
> +	to its child partitions when there is no task associated with it.

I'm probably forgetting something. Was this necessary because of threaded
cgroup support because otherwise the above condition is superflous?

Thanks.
Waiman Long June 13, 2022, 3:02 a.m. UTC | #2
On 6/12/22 13:49, Tejun Heo wrote:
> Hello,
>
> On Tue, May 10, 2022 at 11:34:12AM -0400, Waiman Long wrote:
>> +	External events like hotplug or changes to "cpuset.cpus" can
>> +	cause a valid partition root to become invalid and vice versa.
>> +	Note that a task cannot be moved to a cgroup with empty
>> +	"cpuset.cpus.effective".
>> +
>> +	For a valid partition root or an invalid partition root with
>> +	the exclusivity rule enabled, changes made to "cpuset.cpus"
>> +	that violate the exclusivity rule will not be allowed.
> My memory is failing but this is the same thing that we were discussing
> before, right? The point was that the different behaviors re. system events
> and config actions seemed unncessary and IIRC Michal was of the same opinion
> (please correct me if I'm misremembering).

That is the behavior enforced by setting the CPU_EXCLUSIVE bit in cgroup 
v1. I haven't explicitly change it to make it different in cgroup v2. 
The major reason is that I don't want change to one cpuset to affect a 
sibling partition as it may make the code more complicate to validate if 
a partition is valid.

>
>> +	A valid non-root parent partition may distribute out all its CPUs
>> +	to its child partitions when there is no task associated with it.
> I'm probably forgetting something. Was this necessary because of threaded
> cgroup support because otherwise the above condition is superflous?

The top cpuset cannot have empty cpus.effective whereas the non-root 
partition roots can. Maybe I should reword it to make it more clear.

Thanks,
Longman
Tejun Heo June 13, 2022, 3:12 a.m. UTC | #3
Hello,

On Sun, Jun 12, 2022 at 11:02:38PM -0400, Waiman Long wrote:
> That is the behavior enforced by setting the CPU_EXCLUSIVE bit in cgroup v1.
> I haven't explicitly change it to make it different in cgroup v2. The major
> reason is that I don't want change to one cpuset to affect a sibling
> partition as it may make the code more complicate to validate if a partition
> is valid.

If at all possible, I'd really like to avoid situations where a parent can't
withdraw resources due to something that a descendant does.

Thanks.
Waiman Long June 13, 2022, 1:18 p.m. UTC | #4
On 6/12/22 23:12, Tejun Heo wrote:
> Hello,
>
> On Sun, Jun 12, 2022 at 11:02:38PM -0400, Waiman Long wrote:
>> That is the behavior enforced by setting the CPU_EXCLUSIVE bit in cgroup v1.
>> I haven't explicitly change it to make it different in cgroup v2. The major
>> reason is that I don't want change to one cpuset to affect a sibling
>> partition as it may make the code more complicate to validate if a partition
>> is valid.
> If at all possible, I'd really like to avoid situations where a parent can't
> withdraw resources due to something that a descendant does.

No, it doesn't affect parent at all. It just limit whats the siblings 
can do due to their mutual constraint. If this is what the confusion is 
about, I will try to reword the doc text.

Cheers,
Longman
Michal Koutný June 13, 2022, 2:24 p.m. UTC | #5
On Sun, Jun 12, 2022 at 05:12:51PM -1000, Tejun Heo <tj@kernel.org> wrote:
> Hello,
> 
> On Sun, Jun 12, 2022 at 11:02:38PM -0400, Waiman Long wrote:
> > That is the behavior enforced by setting the CPU_EXCLUSIVE bit in cgroup v1.
> > I haven't explicitly change it to make it different in cgroup v2. The major
> > reason is that I don't want change to one cpuset to affect a sibling
> > partition as it may make the code more complicate to validate if a partition
> > is valid.
> 
> If at all possible, I'd really like to avoid situations where a parent can't
> withdraw resources due to something that a descendant does.

My understanding of the discussed paragraph is that the changes are only
disallowed only among siblings on one level (due to exclusivity rule,
checked in validate_change()). A change in parent won't affect
(non)exclusivity of (valid) children so it's simply allowed.

So the docs (and implementation by a quick look) is sensible.

Michal
Waiman Long June 13, 2022, 5:06 p.m. UTC | #6
On 6/13/22 09:18, Waiman Long wrote:
> On 6/12/22 23:12, Tejun Heo wrote:
>> Hello,
>>
>> On Sun, Jun 12, 2022 at 11:02:38PM -0400, Waiman Long wrote:
>>> That is the behavior enforced by setting the CPU_EXCLUSIVE bit in 
>>> cgroup v1.
>>> I haven't explicitly change it to make it different in cgroup v2. 
>>> The major
>>> reason is that I don't want change to one cpuset to affect a sibling
>>> partition as it may make the code more complicate to validate if a 
>>> partition
>>> is valid.
>> If at all possible, I'd really like to avoid situations where a 
>> parent can't
>> withdraw resources due to something that a descendant does.
>
> No, it doesn't affect parent at all. It just limit whats the siblings 
> can do due to their mutual constraint. If this is what the confusion 
> is about, I will try to reword the doc text.

I am planning to make the following change to the documentation patch. 
Please let me know if that can clarify the confusion, if any.

Thanks,
Longman

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guid>
index 9184a09e0fc9..9cbfa25dab97 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2176,7 +2175,8 @@ Cpuset Interface Files

         For a valid partition root or an invalid partition root with
         the exclusivity rule enabled, changes made to "cpuset.cpus"
-       that violate the exclusivity rule will not be allowed.
+       that violate the exclusivity rule with its siblings will not
+       be allowed.

         A valid non-root parent partition may distribute out all its CPUs
         to its child partitions when there is no task associated with it.
Tejun Heo June 13, 2022, 5:28 p.m. UTC | #7
Hello,

On Mon, Jun 13, 2022 at 04:24:52PM +0200, Michal Koutný wrote:
> On Sun, Jun 12, 2022 at 05:12:51PM -1000, Tejun Heo <tj@kernel.org> wrote:
> > On Sun, Jun 12, 2022 at 11:02:38PM -0400, Waiman Long wrote:
> > > That is the behavior enforced by setting the CPU_EXCLUSIVE bit in cgroup v1.
> > > I haven't explicitly change it to make it different in cgroup v2. The major
> > > reason is that I don't want change to one cpuset to affect a sibling
> > > partition as it may make the code more complicate to validate if a partition
> > > is valid.
> > 
> > If at all possible, I'd really like to avoid situations where a parent can't
> > withdraw resources due to something that a descendant does.
> 
> My understanding of the discussed paragraph is that the changes are only
> disallowed only among siblings on one level (due to exclusivity rule,
> checked in validate_change()). A change in parent won't affect
> (non)exclusivity of (valid) children so it's simply allowed.
> 
> So the docs (and implementation by a quick look) is sensible.

I see. Is this part even necessary? All the .cpus files of the siblings are
owned by the parent who's responsible for configuring both the mode that the
cgroup subtree is gonna be in and their cpumasks. Given that all the other
errors it can make are notified through "invalid (REASON)" in the mode file,
wouldn't it fit better to notify cpus configuration error the same way too?

Thanks.
Michal Koutný June 13, 2022, 5:55 p.m. UTC | #8
On Mon, Jun 13, 2022 at 07:28:25AM -1000, Tejun Heo <tj@kernel.org> wrote:
> I see. Is this part even necessary? All the .cpus files of the siblings are
> owned by the parent who's responsible for configuring both the mode that the
> cgroup subtree is gonna be in and their cpumasks. 

Do you mean such an example:

    parent	cpuset.cpus=SET (root)	cpuset.cpus.partition=isolated 
    `- child_1	cpuset.cpus=partition_of(SET) (root)	cpuset.cpus.partition=isolated
    `- ...
    `- child_n	cpuset.cpus=partition_of(SET) (root)	cpuset.cpus.partition=isolated
?

I don't think child_*/cpuset.cpus must be owned by root.
Actually, the root would only configure the parent, i.e.
parent/cpuset.cpus (whose changes would be disallowed to the
unprivileged tasks) and the distribution among siblings would up to the
whatever runs below.

> Given that all the other errors it can make are notified through
> "invalid (REASON)" in the mode file, wouldn't it fit better to notify
> cpus configuration error the same way too?

Do you suggest that a write into child_*/cpuset.cpus that'd not be
exclusive wrt a sibling would result in an error string in
parent/cpuset.cpus.partition?

Thanks,
Michal
Tejun Heo June 13, 2022, 6 p.m. UTC | #9
Hello,

On Mon, Jun 13, 2022 at 07:55:49PM +0200, Michal Koutný wrote:
> On Mon, Jun 13, 2022 at 07:28:25AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > I see. Is this part even necessary? All the .cpus files of the siblings are
> > owned by the parent who's responsible for configuring both the mode that the
> > cgroup subtree is gonna be in and their cpumasks. 
> 
> Do you mean such an example:
> 
>     parent	cpuset.cpus=SET (root)	cpuset.cpus.partition=isolated 
>     `- child_1	cpuset.cpus=partition_of(SET) (root)	cpuset.cpus.partition=isolated
>     `- ...
>     `- child_n	cpuset.cpus=partition_of(SET) (root)	cpuset.cpus.partition=isolated
> ?
> 
> I don't think child_*/cpuset.cpus must be owned by root.

I meant the parent.

> Actually, the root would only configure the parent, i.e.
> parent/cpuset.cpus (whose changes would be disallowed to the
> unprivileged tasks) and the distribution among siblings would up to the
> whatever runs below.
> 
> > Given that all the other errors it can make are notified through
> > "invalid (REASON)" in the mode file, wouldn't it fit better to notify
> > cpus configuration error the same way too?
> 
> Do you suggest that a write into child_*/cpuset.cpus that'd not be
> exclusive wrt a sibling would result in an error string in
> parent/cpuset.cpus.partition?

Yeah, I don't know why this part is different from any other errors that the
parent can make.

Thanks.
Michal Koutný June 14, 2022, 11:53 a.m. UTC | #10
On Mon, Jun 13, 2022 at 08:00:56AM -1000, Tejun Heo <tj@kernel.org> wrote:
> Yeah, I don't know why this part is different from any other errors that the
> parent can make.

It's different because a write to parent's cpuset.cpus is independent of
whether cpuset.cpus of its children are exclusive or not.
In an extreme case the children may be non-exclusive

    parent	cpuset.cpus=0-3 //   valid partition
    `- child_1	cpuset.cpus=0-1	// invalid partition
    `- child_2	cpuset.cpus=1-2 // invalid partition

but the parent can still be a valid partition (thanks to cpu no. 3 in
the example above).

Do I miss anything?

Thanks,
Michal
Tejun Heo June 27, 2022, 7:10 p.m. UTC | #11
Hello,

On Tue, Jun 14, 2022 at 01:53:45PM +0200, Michal Koutný wrote:
> On Mon, Jun 13, 2022 at 08:00:56AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > Yeah, I don't know why this part is different from any other errors that the
> > parent can make.
> 
> It's different because a write to parent's cpuset.cpus is independent of
> whether cpuset.cpus of its children are exclusive or not.
> In an extreme case the children may be non-exclusive
> 
>     parent	cpuset.cpus=0-3 //   valid partition
>     `- child_1	cpuset.cpus=0-1	// invalid partition
>     `- child_2	cpuset.cpus=1-2 // invalid partition
> 
> but the parent can still be a valid partition (thanks to cpu no. 3 in
> the example above).
> 
> Do I miss anything?

What I'm trying to say is that cpuset.cpus of child_1 and child_2 are
owned by the parent, so a feature which blocks siblings from
intersecting each other doesn't make whole lot of sense because all
those files are under the control of the parent who would have the
power to enable or disable the restrition anyway.

The partition mode file is owned by the parent too, right? So, all
these are to be configured by the same entity and the errors can be
reported the same way, no?

Thanks.
Michal Koutný June 30, 2022, 2:32 p.m. UTC | #12
On Tue, Jun 28, 2022 at 04:10:29AM +0900, Tejun Heo <tj@kernel.org> wrote:
> What I'm trying to say is that cpuset.cpus of child_1 and child_2 are
> owned by the parent,

Cf

On Mon, Jun 13, 2022 at 08:00:56AM -1000, Tejun Heo <tj@kernel.org> wrote:
> On Mon, Jun 13, 2022 at 07:55:49PM +0200, Michal Koutný wrote:
> > I don't think child_*/cpuset.cpus must be owned by root.
> 
> I meant the parent.

I'm slightly confused.

> so a feature which blocks siblings from intersecting each other
> doesn't make whole lot of sense because all those files are under the
> control of the parent who would have the power to enable or disable
> the restrition anyway.

file				owner
parent/				user (mkdir)
`- cpuset.cpus			root
`- cpuset.cpus.partition	root	(P)
`- child_1/			user
  ` cpuset.cpus			user	(*)
`- child_2/			user
  ` cpuset.cpus			user	(*)

The writes to child cpuset.cpus may/may not invalidate parent's (P)
partition validity (whether a cpu is left to it to host possible tasks).
child_1 vs child_2 overlap affects only whether the children cgroups are
a valid partition.

I think you mean: writes to children cpuset.cpus should be allowed,
possible exclusivity violation should be reported in
parent/cpuset.cpus.partition.

What I thought was OK: prevent (fail) writes to children cpuset.cpus
that'd violate the exclusivity (or would take the last cpu from parent
if it's necessary to host a task).
IMO, it's similar to failed writes to parent/cgroup.subtree_control in a
delegated subtree if the parent still has some tasks (that'd violate
internal node constraint).

What I think might still be OK: allow writes to children cpuset.cpus
that violate exclusivity and report that in children's
cpuset.cpus.partition. Writes that'd take last cpu from parent should
still fail (similar to the failing subtree_control writes above).

Hope that clarifies,
Michal
Tejun Heo June 30, 2022, 10:53 p.m. UTC | #13
Hello,

On Thu, Jun 30, 2022 at 04:32:11PM +0200, Michal Koutný wrote:
> file				owner
> parent/				user (mkdir)
> `- cpuset.cpus			root
> `- cpuset.cpus.partition	root	(P)
> `- child_1/			user
>   ` cpuset.cpus			user	(*)
> `- child_2/			user
>   ` cpuset.cpus			user	(*)
> 
> The writes to child cpuset.cpus may/may not invalidate parent's (P)
> partition validity (whether a cpu is left to it to host possible tasks).
> child_1 vs child_2 overlap affects only whether the children cgroups are
> a valid partition.
> 
> I think you mean: writes to children cpuset.cpus should be allowed,
> possible exclusivity violation should be reported in
> parent/cpuset.cpus.partition.

I see.

> What I thought was OK: prevent (fail) writes to children cpuset.cpus
> that'd violate the exclusivity (or would take the last cpu from parent
> if it's necessary to host a task).
> IMO, it's similar to failed writes to parent/cgroup.subtree_control in a
> delegated subtree if the parent still has some tasks (that'd violate
> internal node constraint).
> 
> What I think might still be OK: allow writes to children cpuset.cpus
> that violate exclusivity and report that in children's
> cpuset.cpus.partition. Writes that'd take last cpu from parent should
> still fail (similar to the failing subtree_control writes above).

Yeah, this one.

So, here, one important question is who owns cpuset.cpus.partition
file - is it a konb which is owned by the parent like other resource
control knobs including cpuset.cpus or is it a knob which is owned by
the cgroup itself for selecting its own operation like cgroup.procs or
cgroup.subtree_control.

In the former case, the parent being able to say that "my children
can't overlap" makes sense although I'm not a big fan of the current
interface (again, who owns that knob?). In the latter case, it doesn't
really make sense cuz it'd be declaring "I can't make my children
overlap" - well, then, don't.

Thanks.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 69d7a6983f78..9184a09e0fc9 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2110,74 +2110,91 @@  Cpuset Interface Files
 	It accepts only the following input values when written to.
 
 	  ========	================================
-	  "root"	a partition root
-	  "member"	a non-root member of a partition
+	  "member"	Non-root member of a partition
+	  "root"	Partition root
+	  "isolated"	Partition root without load balancing
 	  ========	================================
 
-	When set to be a partition root, the current cgroup is the
-	root of a new partition or scheduling domain that comprises
-	itself and all its descendants except those that are separate
-	partition roots themselves and their descendants.  The root
-	cgroup is always a partition root.
-
-	There are constraints on where a partition root can be set.
-	It can only be set in a cgroup if all the following conditions
-	are true.
-
-	1) The "cpuset.cpus" is not empty and the list of CPUs are
-	   exclusive, i.e. they are not shared by any of its siblings.
-	2) The parent cgroup is a partition root.
-	3) The "cpuset.cpus" is also a proper subset of the parent's
-	   "cpuset.cpus.effective".
-	4) There is no child cgroups with cpuset enabled.  This is for
-	   eliminating corner cases that have to be handled if such a
-	   condition is allowed.
-
-	Setting it to partition root will take the CPUs away from the
-	effective CPUs of the parent cgroup.  Once it is set, this
-	file cannot be reverted back to "member" if there are any child
-	cgroups with cpuset enabled.
-
-	A parent partition cannot distribute all its CPUs to its
-	child partitions.  There must be at least one cpu left in the
-	parent partition.
-
-	Once becoming a partition root, changes to "cpuset.cpus" is
-	generally allowed as long as the first condition above is true,
-	the change will not take away all the CPUs from the parent
-	partition and the new "cpuset.cpus" value is a superset of its
-	children's "cpuset.cpus" values.
-
-	Sometimes, external factors like changes to ancestors'
-	"cpuset.cpus" or cpu hotplug can cause the state of the partition
-	root to change.  On read, the "cpuset.sched.partition" file
-	can show the following values.
-
-	  ==============	==============================
-	  "member"		Non-root member of a partition
-	  "root"		Partition root
-	  "root invalid"	Invalid partition root
-	  ==============	==============================
-
-	It is a partition root if the first 2 partition root conditions
-	above are true and at least one CPU from "cpuset.cpus" is
-	granted by the parent cgroup.
-
-	A partition root can become invalid if none of CPUs requested
-	in "cpuset.cpus" can be granted by the parent cgroup or the
-	parent cgroup is no longer a partition root itself.  In this
-	case, it is not a real partition even though the restriction
-	of the first partition root condition above will still apply.
-	The cpu affinity of all the tasks in the cgroup will then be
-	associated with CPUs in the nearest ancestor partition.
-
-	An invalid partition root can be transitioned back to a
-	real partition root if at least one of the requested CPUs
-	can now be granted by its parent.  In this case, the cpu
-	affinity of all the tasks in the formerly invalid partition
-	will be associated to the CPUs of the newly formed partition.
-	Changing the partition state of an invalid partition root to
-	"member" is always allowed even if child cpusets are present.
+	The root cgroup is always a partition root and its state
+	cannot be changed.  All other non-root cgroups start out as
+	"member".
+
+	When set to "root", the current cgroup is the root of a new
+	partition or scheduling domain that comprises itself and all
+	its descendants except those that are separate partition roots
+	themselves and their descendants.
+
+	When set to "isolated", the CPUs in that partition root will
+	be in an isolated state without any load balancing from the
+	scheduler.  Tasks placed in such a partition with multiple
+	CPUs should be carefully distributed and bound to each of the
+	individual CPUs for optimal performance.
+
+	The value shown in "cpuset.cpus.effective" of a partition root
+	is the CPUs that the partition root can dedicate to a potential
+	new child partition root. The new child subtracts available
+	CPUs from its parent "cpuset.cpus.effective".
+
+	A partition root ("root" or "isolated") can be in one of the
+	two possible states - valid or invalid.  An invalid partition
+	root is in a degraded state where some state information may
+	be retained, but behaves more like a "member".
+
+	All possible state transitions among "member", "root" and
+	"isolated" are allowed.
+
+	On read, the "cpuset.cpus.partition" file can show the following
+	values.
+
+	  ======================	==============================
+	  "member"			Non-root member of a partition
+	  "root"			Partition root
+	  "isolated"			Partition root without load balancing
+	  "root invalid (<reason>)"	Invalid partition root
+	  "isolated invalid (<reason>)"	Invalid isolated partition root
+	  ======================	==============================
+
+	In the case of an invalid partition root, a descriptive string on
+	why the partition is invalid is included within parentheses.
+
+	For a partition root to become valid, the following conditions
+	must be met.
+
+	1) The "cpuset.cpus" is exclusive, i.e. they are not shared by
+	   any of its siblings (exclusivity rule).
+	2) The parent cgroup is a valid partition root.
+	3) The "cpuset.cpus" is not empty and must contain at least
+	   one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
+	4) The "cpuset.cpus.effective" must be a subset of "cpuset.cpus"
+	   and cannot be empty unless there is no task associated with
+	   this partition.
+
+	External events like hotplug or changes to "cpuset.cpus" can
+	cause a valid partition root to become invalid and vice versa.
+	Note that a task cannot be moved to a cgroup with empty
+	"cpuset.cpus.effective".
+
+	For a valid partition root or an invalid partition root with
+	the exclusivity rule enabled, changes made to "cpuset.cpus"
+	that violate the exclusivity rule will not be allowed.
+
+	A valid non-root parent partition may distribute out all its CPUs
+	to its child partitions when there is no task associated with it.
+
+	Care must be taken to change a valid partition root to
+	"member" as all its child partitions, if present, will become
+	invalid causing disruption to tasks running in those child
+	partitions. These inactivated partitions could be recovered if
+	their parent is switched back to a partition root with a proper
+	set of "cpuset.cpus".
+
+	Poll and inotify events are triggered whenever the state of
+	"cpuset.cpus.partition" changes.  That includes changes caused
+	by write to "cpuset.cpus.partition", cpu hotplug or other
+	changes that modify the validity status of the partition.
+	This will allow user space agents to monitor unexpected changes
+	to "cpuset.cpus.partition" without the need to do continuous
+	polling.
 
 
 Device controller