diff mbox series

[v9,6/7] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst

Message ID 20211205183220.818872-7-longman@redhat.com (mailing list archive)
State New
Headers show
Series cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand

Commit Message

Waiman Long Dec. 5, 2021, 6:32 p.m. UTC
Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced
"isolated" cpuset partition type as well as other changes made in other
cpuset patches.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 168 ++++++++++++++----------
 1 file changed, 102 insertions(+), 66 deletions(-)

Comments

Tejun Heo Dec. 13, 2021, 9 p.m. UTC | #1
Hello,

On Sun, Dec 05, 2021 at 01:32:19PM -0500, Waiman Long wrote:
> +	In the case of an invalid partition root, a descriptive string on
> +	why the partition is invalid is included within parentheses.
> +
> +	Almost all possible state transitions among "member", valid
> +	and invalid partition roots are allowed except from "member"
> +	to invalid partition root.

So, this part still bothers me for the following two reasons that I brought
up earlier:

* When a valid partition turns invalid, now we have a reliable way of
  discovering what exactly caused the transition. However, when a user now
  fails to turn a member into partition, all they get is -EINVAL and there's
  no way to discover why it failed and the failure conditions that -EINVAL
  represents aren't simple.

* In an automated configuration scenarios, this operation mode may be
  difficult to make reliable and lead to sporadic failures which can be
  tricky to track down. The core problem is that whether a given operation
  succeeds or not may depend on external states (CPU on/offline) which may
  change asynchronously in a way that the configuring entity doesn't have
  any control over.

It's true that both are existing problems with the current partition
interface and given that this is a pretty spcialized feature, this can be
okay. Michal, what are your thoughts?

Thanks.
Michal Koutný Dec. 15, 2021, 2:44 p.m. UTC | #2
On Mon, Dec 13, 2021 at 11:00:17AM -1000, Tejun Heo <tj@kernel.org> wrote:
> * When a valid partition turns invalid, now we have a reliable way of
>   discovering what exactly caused the transition. However, when a user now
>   fails to turn a member into partition, all they get is -EINVAL and there's
>   no way to discover why it failed and the failure conditions that -EINVAL
>   represents aren't simple.
> 
> * In an automated configuration scenarios, this operation mode may be
>   difficult to make reliable and lead to sporadic failures which can be
>   tricky to track down. The core problem is that whether a given operation
>   succeeds or not may depend on external states (CPU on/offline) which may
>   change asynchronously in a way that the configuring entity doesn't have
>   any control over.
> 
> It's true that both are existing problems with the current partition
> interface and given that this is a pretty spcialized feature, this can be
> okay. Michal, what are your thoughts?

Because of asynchronous changes, the return value should not be that
important and the user should watch cpuset.partitions for the result
(end state) anyway.
Furthermore, the reasons should be IMO just informative (i.e. I like
they're not explicitly documented) and not API.

But I see there could be a distinction between -EINVAL (the supplied
input makes no sense) and -EAGAIN(?) denoting that the switch to
partition root could not happen (due to outer constraints).

You seem to propose to replace the -EAGAIN above with a success code and
allow the switch to an invalid root.
The action of the configuring entity would be different: retry (when?)
vs wait till transition happens (notification) (although the immediate
effect (the change did not happen) is same).
I considered the two variants equal but the clear information about when
the change can happen I'd favor the variant allowing the switch to
invalid root now.


Michal
Waiman Long Dec. 15, 2021, 6:16 p.m. UTC | #3
On 12/15/21 09:44, Michal Koutný wrote:
> On Mon, Dec 13, 2021 at 11:00:17AM -1000, Tejun Heo <tj@kernel.org> wrote:
>> * When a valid partition turns invalid, now we have a reliable way of
>>    discovering what exactly caused the transition. However, when a user now
>>    fails to turn a member into partition, all they get is -EINVAL and there's
>>    no way to discover why it failed and the failure conditions that -EINVAL
>>    represents aren't simple.
>>
>> * In an automated configuration scenarios, this operation mode may be
>>    difficult to make reliable and lead to sporadic failures which can be
>>    tricky to track down. The core problem is that whether a given operation
>>    succeeds or not may depend on external states (CPU on/offline) which may
>>    change asynchronously in a way that the configuring entity doesn't have
>>    any control over.
>>
>> It's true that both are existing problems with the current partition
>> interface and given that this is a pretty spcialized feature, this can be
>> okay. Michal, what are your thoughts?
> Because of asynchronous changes, the return value should not be that
> important and the user should watch cpuset.partitions for the result
> (end state) anyway.
> Furthermore, the reasons should be IMO just informative (i.e. I like
> they're not explicitly documented) and not API.
>
> But I see there could be a distinction between -EINVAL (the supplied
> input makes no sense) and -EAGAIN(?) denoting that the switch to
> partition root could not happen (due to outer constraints).
>
> You seem to propose to replace the -EAGAIN above with a success code and
> allow the switch to an invalid root.
> The action of the configuring entity would be different: retry (when?)
> vs wait till transition happens (notification) (although the immediate
> effect (the change did not happen) is same).
> I considered the two variants equal but the clear information about when
> the change can happen I'd favor the variant allowing the switch to
> invalid root now.

Allowing direct transition from member to invalid partition doesn't feel 
right for me. A casual user may assume a partition is correctly formed 
without double checking the "cpuset.partition" value. Returning an error 
will prevent this kind of issue. If returning more information about the 
failure is the main reason for allowing the invalid partition 
transition, we can extend the "cpuset.partition" read syntax to also 
show the reason for the previous failure.

Cheers,
Longman
Tejun Heo Dec. 15, 2021, 6:35 p.m. UTC | #4
Hello, Waiman.

On Wed, Dec 15, 2021 at 01:16:43PM -0500, Waiman Long wrote:
> Allowing direct transition from member to invalid partition doesn't feel
> right for me. A casual user may assume a partition is correctly formed
> without double checking the "cpuset.partition" value. Returning an error
> will prevent this kind of issue. If returning more information about the
> failure is the main reason for allowing the invalid partition transition, we
> can extend the "cpuset.partition" read syntax to also show the reason for
> the previous failure.

I don't think it's a good idea to display error messages without a way to
link the error to the one who triggered it. This is the same problem we had
with resettable counters. It only works for scenarios where one guy is
sitting in front of the computer but gets nastry for more complex scnearios
and automation.

I understand that allowing transitions to invalid state can feel jarring.
There are pros and cons to both approaches. It's similar dynamics tho.
Erroring out may be more intuitive for a casual user but makes it harder for
more complex scenarios because whether a given operation errors or not is
dependent on external asynchronous states, there's no good way of reporting
the exact nature of the error or detecting when the operation would succeed
in the future, and the error conditions are rather arbitrary.

Thanks.
Waiman Long Dec. 15, 2021, 6:55 p.m. UTC | #5
On 12/15/21 13:35, Tejun Heo wrote:
> Hello, Waiman.
>
> On Wed, Dec 15, 2021 at 01:16:43PM -0500, Waiman Long wrote:
>> Allowing direct transition from member to invalid partition doesn't feel
>> right for me. A casual user may assume a partition is correctly formed
>> without double checking the "cpuset.partition" value. Returning an error
>> will prevent this kind of issue. If returning more information about the
>> failure is the main reason for allowing the invalid partition transition, we
>> can extend the "cpuset.partition" read syntax to also show the reason for
>> the previous failure.
> I don't think it's a good idea to display error messages without a way to
> link the error to the one who triggered it. This is the same problem we had
> with resettable counters. It only works for scenarios where one guy is
> sitting in front of the computer but gets nastry for more complex scnearios
> and automation.
Yes, I agree it is not a good way to handle this issue.
>
> I understand that allowing transitions to invalid state can feel jarring.
> There are pros and cons to both approaches. It's similar dynamics tho.
> Erroring out may be more intuitive for a casual user but makes it harder for
> more complex scenarios because whether a given operation errors or not is
> dependent on external asynchronous states, there's no good way of reporting
> the exact nature of the error or detecting when the operation would succeed
> in the future, and the error conditions are rather arbitrary.

Thanks for the explanation. Yes, there are always pros and cons for 
different approach to a problem. I am not totally against allowing 
member to invalid partition transition. In that case, reading back 
"cpuset.partition" is a must to verify that it is really a success.

How about we allow transition to an invalid partition state but still 
return an error?

Regards,
Longman
Tejun Heo Jan. 12, 2022, 9:21 p.m. UTC | #6
Hello,

On Wed, Dec 15, 2021 at 01:55:05PM -0500, Waiman Long wrote:
> How about we allow transition to an invalid partition state but still return
> an error?

Like -EAGAIN as Michal suggested? I don't know. I understand the motivation
but one problem is that error return usually means that the operation didn't
change the state of the system and that holds even for -EAGAIN. So, we'd be
trading one locally jarring thing (this thing is asynchrnous and the actual
state transitions should be monitored separately) to another one which is
jarring in a wider context (this thing returned error but the system state
changed anyway). To me, the latter seems noticeably worse given how common
the assumption that an error return indicate that nothing actually happened.

We have other cases where we split operation submissions and completions
(aios being the obvious one) but I don't think we have any where -EAGAIN
indicates successful initiation of an operation. At least I hope not.

Thanks.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 2aeb7ae8b393..9612319b353f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2099,74 +2099,110 @@  Cpuset Interface Files
 	It accepts only the following input values when written to.
 
 	  ========	================================
-	  "root"	a partition root
-	  "member"	a non-root member of a partition
+	  "member"	Non-root member of a partition
+	  "root"	Partition root
+	  "isolated"	Partition root without load balancing
 	  ========	================================
 
-	When set to be a partition root, the current cgroup is the
-	root of a new partition or scheduling domain that comprises
-	itself and all its descendants except those that are separate
-	partition roots themselves and their descendants.  The root
-	cgroup is always a partition root.
-
-	There are constraints on where a partition root can be set.
-	It can only be set in a cgroup if all the following conditions
-	are true.
-
-	1) The "cpuset.cpus" is not empty and the list of CPUs are
-	   exclusive, i.e. they are not shared by any of its siblings.
-	2) The parent cgroup is a partition root.
-	3) The "cpuset.cpus" is also a proper subset of the parent's
-	   "cpuset.cpus.effective".
-	4) There is no child cgroups with cpuset enabled.  This is for
-	   eliminating corner cases that have to be handled if such a
-	   condition is allowed.
-
-	Setting it to partition root will take the CPUs away from the
-	effective CPUs of the parent cgroup.  Once it is set, this
-	file cannot be reverted back to "member" if there are any child
-	cgroups with cpuset enabled.
-
-	A parent partition cannot distribute all its CPUs to its
-	child partitions.  There must be at least one cpu left in the
-	parent partition.
-
-	Once becoming a partition root, changes to "cpuset.cpus" is
-	generally allowed as long as the first condition above is true,
-	the change will not take away all the CPUs from the parent
-	partition and the new "cpuset.cpus" value is a superset of its
-	children's "cpuset.cpus" values.
-
-	Sometimes, external factors like changes to ancestors'
-	"cpuset.cpus" or cpu hotplug can cause the state of the partition
-	root to change.  On read, the "cpuset.sched.partition" file
-	can show the following values.
-
-	  ==============	==============================
-	  "member"		Non-root member of a partition
-	  "root"		Partition root
-	  "root invalid"	Invalid partition root
-	  ==============	==============================
-
-	It is a partition root if the first 2 partition root conditions
-	above are true and at least one CPU from "cpuset.cpus" is
-	granted by the parent cgroup.
-
-	A partition root can become invalid if none of CPUs requested
-	in "cpuset.cpus" can be granted by the parent cgroup or the
-	parent cgroup is no longer a partition root itself.  In this
-	case, it is not a real partition even though the restriction
-	of the first partition root condition above will still apply.
-	The cpu affinity of all the tasks in the cgroup will then be
-	associated with CPUs in the nearest ancestor partition.
-
-	An invalid partition root can be transitioned back to a
-	real partition root if at least one of the requested CPUs
-	can now be granted by its parent.  In this case, the cpu
-	affinity of all the tasks in the formerly invalid partition
-	will be associated to the CPUs of the newly formed partition.
-	Changing the partition state of an invalid partition root to
-	"member" is always allowed even if child cpusets are present.
+	The root cgroup is always a partition root and its state
+	cannot be changed.  All other non-root cgroups start out as
+	"member".
+
+	When set to "root", the current cgroup is the root of a new
+	partition or scheduling domain that comprises itself and
+	all its descendants except those that are separate partition
+	roots themselves and their descendants.
+
+	The value shown in "cpuset.cpus.effective" of a partition root is
+	the CPUs that the parent partition root can dedicate to the new
+	partition root.  They are subtracted from "cpuset.cpus.effective"
+	of the parent and may be different from "cpuset.cpus"
+
+	When set to "isolated", the CPUs in that partition root will
+	be in an isolated state without any load balancing from the
+	scheduler.  Tasks placed in such a partition with multiple
+	CPUs should be carefully distributed and bound to each of the
+	individual CPUs for optimal performance.
+
+	A partition root ("root" or "isolated") can be in one of the
+	two possible states - valid or invalid.  An invalid partition
+	root is in a degraded state where some state information are
+	retained, but behaves more like a "member".
+
+	On read, the "cpuset.cpus.partition" file can show the following
+	values.
+
+	  ======================	==============================
+	  "member"			Non-root member of a partition
+	  "root"			Partition root
+	  "isolated"			Partition root without load balancing
+	  "root invalid (<reason>)"	Invalid partition root
+	  "isolated invalid (<reason>)"	Invalid isolated partition root
+	  ======================	==============================
+
+	In the case of an invalid partition root, a descriptive string on
+	why the partition is invalid is included within parentheses.
+
+	Almost all possible state transitions among "member", valid
+	and invalid partition roots are allowed except from "member"
+	to invalid partition root.
+
+	Before the "member" to partition root transition can happen,
+	the following conditions must be met or the transition will
+	not be allowed.
+
+	1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are
+	   not shared by any of its siblings.
+	2) The parent cgroup is a valid partition root.
+	3) The "cpuset.cpus" must contain at least one of the CPUs from
+	   parent's "cpuset.cpus", i.e. they overlap.
+	4) There is no child cgroups with cpuset enabled.  This avoids
+	   cpu migrations of multiple cgroups simultaneously which can
+	   be problematic.
+
+	Once becoming a partition root, the only rule restricting
+	changes made to "cpuset.cpus" is the exclusivity rule where
+	none of the siblings of a partition root can share CPUs with
+	it.
+
+	External events like hotplug or inappropriate changes to
+	"cpuset.cpus" can cause a valid partition root to become invalid.
+	Besides the exclusivity rule listed above, the other conditions
+	required to maintain the validity of a partition root are
+	as follows:
+
+	1) The parent cgroup is a valid partition root.
+	2) If "cpuset.cpus.effective" is empty, the partition must have
+	   no task associated with it. Otherwise, the partition becomes
+	   invalid and "cpuset.cpus.effective" will fall back to that
+	   of the nearest non-empty ancestor.
+
+	A corollary of a valid partition root is that
+	"cpuset.cpus.effective" is always a subset of "cpuset.cpus".
+	Note that a task cannot be moved to a cgroup with empty
+	"cpuset.cpus.effective".
+
+	A valid non-root parent partition may distribute out all its CPUs
+	to its child partitions when there is no task associated with it.
+
+	An invalid partition root will be reverted back to a valid
+	one if none of the validity constraints of a valid partition
+	root are violated due to hotplug events or proper changes to
+	"cpuset.cpus" files.
+
+	Changing a partition root (valid or invalid) to "member" is
+	always allowed.  If there are child partition roots underneath
+	it, they will become invalid and unrecoverable.  So care must
+	be taken to double check for this condition before disabling
+	a partition root.
+
+	Poll and inotify events are triggered whenever the state of
+	"cpuset.cpus.partition" changes.  That includes changes caused
+	by write to "cpuset.cpus.partition", cpu hotplug or other
+	changes that modify the validity status of the partition.
+	This will allow user space agents to monitor unexpected changes
+	to "cpuset.cpus.partition" without the need to do continuous
+	polling.
 
 
 Device controller