diff mbox series

[v3,6/9] cgroup/cpuset: Add a new isolated cpus.partition type

Message ID 20210720141834.10624-7-longman@redhat.com (mailing list archive)
State New
Headers show
Series cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand

Commit Message

Waiman Long July 20, 2021, 2:18 p.m. UTC
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD

commit 994fb794cb252edd124a46ca0994e37a4726a100
Author: Waiman Long <longman@redhat.com>
Date:   Sat, 19 Jun 2021 13:28:19 -0400

    cgroup/cpuset: Add a new isolated cpus.partition type

    Cpuset v1 uses the sched_load_balance control file to determine if load
    balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
    as its use may require disabling load balancing at cgroup root.

    For workloads that require very low latency like DPDK, the latency
    jitters caused by periodic load balancing may exceed the desired
    latency limit.

    When cpuset v2 is in use, the only way to avoid this latency cost is to
    use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
    the kernel boot, however, there is no way to add or remove CPUs from
    this isolated set. For workloads that are more dynamic in nature, that
    means users have to provision enough CPUs for the worst case situation
    resulting in excess idle CPUs.

    To address this issue for cpuset v2, a new cpuset.cpus.partition type
    "isolated" is added which allows the creation of a cpuset partition
    without load balancing. This will allow system administrators to
    dynamically adjust the size of isolated partition to the current need
    of the workload without rebooting the system.

    Signed-off-by: Waiman Long <longman@redhat.com>

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 48 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

Comments

Frederic Weisbecker July 27, 2021, 11:42 a.m. UTC | #1
On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long wrote:
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD
> 
> commit 994fb794cb252edd124a46ca0994e37a4726a100
> Author: Waiman Long <longman@redhat.com>
> Date:   Sat, 19 Jun 2021 13:28:19 -0400
> 
>     cgroup/cpuset: Add a new isolated cpus.partition type
> 
>     Cpuset v1 uses the sched_load_balance control file to determine if load
>     balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
>     as its use may require disabling load balancing at cgroup root.
> 
>     For workloads that require very low latency like DPDK, the latency
>     jitters caused by periodic load balancing may exceed the desired
>     latency limit.
> 
>     When cpuset v2 is in use, the only way to avoid this latency cost is to
>     use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
>     the kernel boot, however, there is no way to add or remove CPUs from
>     this isolated set. For workloads that are more dynamic in nature, that
>     means users have to provision enough CPUs for the worst case situation
>     resulting in excess idle CPUs.
> 
>     To address this issue for cpuset v2, a new cpuset.cpus.partition type
>     "isolated" is added which allows the creation of a cpuset partition
>     without load balancing. This will allow system administrators to
>     dynamically adjust the size of isolated partition to the current need
>     of the workload without rebooting the system.
> 
>     Signed-off-by: Waiman Long <longman@redhat.com>
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Nice! And while we are adding a new ABI, can we take advantage of that and
add a specific semantic that if a new isolated partition matches a subset of
"isolcpus=", it automatically maps to it. This means that any further
modification to that isolated partition will also modify the associated
isolcpus= subset.

Or to summarize, when we create a new isolated partition, remove the associated
CPUs from isolcpus= ?

Thanks.
Waiman Long July 27, 2021, 3:56 p.m. UTC | #2
On 7/27/21 7:42 AM, Frederic Weisbecker wrote:
> On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long wrote:
>> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD
>>
>> commit 994fb794cb252edd124a46ca0994e37a4726a100
>> Author: Waiman Long <longman@redhat.com>
>> Date:   Sat, 19 Jun 2021 13:28:19 -0400
>>
>>      cgroup/cpuset: Add a new isolated cpus.partition type
>>
>>      Cpuset v1 uses the sched_load_balance control file to determine if load
>>      balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
>>      as its use may require disabling load balancing at cgroup root.
>>
>>      For workloads that require very low latency like DPDK, the latency
>>      jitters caused by periodic load balancing may exceed the desired
>>      latency limit.
>>
>>      When cpuset v2 is in use, the only way to avoid this latency cost is to
>>      use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
>>      the kernel boot, however, there is no way to add or remove CPUs from
>>      this isolated set. For workloads that are more dynamic in nature, that
>>      means users have to provision enough CPUs for the worst case situation
>>      resulting in excess idle CPUs.
>>
>>      To address this issue for cpuset v2, a new cpuset.cpus.partition type
>>      "isolated" is added which allows the creation of a cpuset partition
>>      without load balancing. This will allow system administrators to
>>      dynamically adjust the size of isolated partition to the current need
>>      of the workload without rebooting the system.
>>
>>      Signed-off-by: Waiman Long <longman@redhat.com>
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> Nice! And while we are adding a new ABI, can we take advantage of that and
> add a specific semantic that if a new isolated partition matches a subset of
> "isolcpus=", it automatically maps to it. This means that any further
> modification to that isolated partition will also modify the associated
> isolcpus= subset.
>
> Or to summarize, when we create a new isolated partition, remove the associated
> CPUs from isolcpus= ?

We can certainly do that as a follow-on. Another idea that I have been 
thinking about is to automatically generating a isolated partition under 
root to match the given isolcpus parameter when the v2 filesystem is 
mounted. That needs more experimentation and testing to verify that it 
can work.

Cheers,
Longman
Michal Koutný July 28, 2021, 4:09 p.m. UTC | #3
Hello Waiman.

On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long <longman@redhat.com> wrote:
> @@ -2026,6 +2036,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)
> [...]
> +	} else if (old_prs && new_prs) {

If an isolated root partition becomes invalid (new_prs == PRS_ERROR)...

> +		/*
> +		 * A change in load balance state only, no change in cpumasks.
> +		 */
> +		update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));

...this seems to erase information about CS_SCHED_LOAD_BALANCE zeroness.

IOW, if there's an isolated partition that becomes invalid and later
valid again (a cpu is (re)added), it will be a normal root partition
without the requested isolation, which is IMO undesired.

I may have overlooked something in broader context but it seems to me
the invalidity should be saved independently of the root/isolated type.

Regards,
Michal
Waiman Long July 28, 2021, 4:27 p.m. UTC | #4
On 7/28/21 12:09 PM, Michal Koutný wrote:
> Hello Waiman.
>
> On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long <longman@redhat.com> wrote:
>> @@ -2026,6 +2036,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)
>> [...]
>> +	} else if (old_prs && new_prs) {
> If an isolated root partition becomes invalid (new_prs == PRS_ERROR)...
>
>> +		/*
>> +		 * A change in load balance state only, no change in cpumasks.
>> +		 */
>> +		update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));
> ...this seems to erase information about CS_SCHED_LOAD_BALANCE zeroness.
>
> IOW, if there's an isolated partition that becomes invalid and later
> valid again (a cpu is (re)added), it will be a normal root partition
> without the requested isolation, which is IMO undesired.
>
> I may have overlooked something in broader context but it seems to me
> the invalidity should be saved independently of the root/isolated type.

PRS_ERROR cannot be passed to update_prstate(). For this patchset, 
PRS_ERROR can only be set by changes in hotplug. The current design will 
maintain the set flag (CS_SCHED_LOAD_BALANCE) and use it to decide to 
switch back to PRS_ENABLED or PRS_ISOLATED when the cpus are available 
again.

Cheers,
Longman
Michal Koutný July 28, 2021, 5:25 p.m. UTC | #5
On Wed, Jul 28, 2021 at 12:27:58PM -0400, Waiman Long <llong@redhat.com> wrote:
> PRS_ERROR cannot be passed to update_prstate(). For this patchset, PRS_ERROR
> can only be set by changes in hotplug. The current design will maintain the
> set flag (CS_SCHED_LOAD_BALANCE) and use it to decide to switch back to
> PRS_ENABLED or PRS_ISOLATED when the cpus are available again.

I see it now, thanks. (I still find a bit weird that the "isolated"
partition will be shown as "root invalid" when it's lacking cpus
(instead of "isolated invalid" and returning to "isolated") but I can
understand the approach of having just one "root invalid" for all.)

This patch can have
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Frederic Weisbecker July 29, 2021, 11:03 a.m. UTC | #6
On Tue, Jul 27, 2021 at 11:56:25AM -0400, Waiman Long wrote:
> On 7/27/21 7:42 AM, Frederic Weisbecker wrote:
> > On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long wrote:
> > > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD
> > > 
> > > commit 994fb794cb252edd124a46ca0994e37a4726a100
> > > Author: Waiman Long <longman@redhat.com>
> > > Date:   Sat, 19 Jun 2021 13:28:19 -0400
> > > 
> > >      cgroup/cpuset: Add a new isolated cpus.partition type
> > > 
> > >      Cpuset v1 uses the sched_load_balance control file to determine if load
> > >      balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
> > >      as its use may require disabling load balancing at cgroup root.
> > > 
> > >      For workloads that require very low latency like DPDK, the latency
> > >      jitters caused by periodic load balancing may exceed the desired
> > >      latency limit.
> > > 
> > >      When cpuset v2 is in use, the only way to avoid this latency cost is to
> > >      use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
> > >      the kernel boot, however, there is no way to add or remove CPUs from
> > >      this isolated set. For workloads that are more dynamic in nature, that
> > >      means users have to provision enough CPUs for the worst case situation
> > >      resulting in excess idle CPUs.
> > > 
> > >      To address this issue for cpuset v2, a new cpuset.cpus.partition type
> > >      "isolated" is added which allows the creation of a cpuset partition
> > >      without load balancing. This will allow system administrators to
> > >      dynamically adjust the size of isolated partition to the current need
> > >      of the workload without rebooting the system.
> > > 
> > >      Signed-off-by: Waiman Long <longman@redhat.com>
> > > 
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > Nice! And while we are adding a new ABI, can we take advantage of that and
> > add a specific semantic that if a new isolated partition matches a subset of
> > "isolcpus=", it automatically maps to it. This means that any further
> > modification to that isolated partition will also modify the associated
> > isolcpus= subset.
> > 
> > Or to summarize, when we create a new isolated partition, remove the associated
> > CPUs from isolcpus= ?
> 
> We can certainly do that as a follow-on.

I'm just concerned that this feature gets merged before we add that new
isolcpus= implicit mapping, which technically is a new ABI. Well I guess I
should hurry up and try to propose a patchset quickly once I'm back from
vacation :-)



> Another idea that I have been
> thinking about is to automatically generating a isolated partition under
> root to match the given isolcpus parameter when the v2 filesystem is
> mounted. That needs more experimentation and testing to verify that it can
> work.

I thought about that too, mounting an "isolcpus" subdirectory withing the top
cpuset but I was worried it could break userspace that wouldn't expect that new
thing to show up.

Thanks.
diff mbox series

Patch

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c16060b703cc..60562346ecc1 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -172,6 +172,8 @@  struct cpuset {
  *
  *   1 - partition root
  *
+ *   2 - partition root without load balancing (isolated)
+ *
  *  -1 - invalid partition root
  *       None of the cpus in cpus_allowed can be put into the parent's
  *       subparts_cpus. In this case, the cpuset is not a real partition
@@ -183,6 +185,7 @@  struct cpuset {
  */
 #define PRS_DISABLED		0
 #define PRS_ENABLED		1
+#define PRS_ISOLATED		2
 #define PRS_ERROR		-1
 
 /*
@@ -1285,17 +1288,22 @@  static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		int prev_prs = cpuset->partition_root_state;
 
 		/*
-		 * Check for possible transition between PRS_ENABLED
-		 * and PRS_ERROR.
+		 * Check for possible transition between PRS_ERROR and
+		 * PRS_ENABLED/PRS_ISOLATED.
 		 */
 		switch (cpuset->partition_root_state) {
 		case PRS_ENABLED:
+		case PRS_ISOLATED:
 			if (part_error)
 				new_prs = PRS_ERROR;
 			break;
 		case PRS_ERROR:
-			if (!part_error)
+			if (part_error)
+				break;
+			if (is_sched_load_balance(cpuset))
 				new_prs = PRS_ENABLED;
+			else
+				new_prs = PRS_ISOLATED;
 			break;
 		}
 		/*
@@ -1434,6 +1442,7 @@  static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 				break;
 
 			case PRS_ENABLED:
+			case PRS_ISOLATED:
 				if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
 					update_tasks_cpumask(parent);
 				break;
@@ -1453,7 +1462,7 @@  static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		spin_lock_irq(&callback_lock);
 
-		if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) {
+		if (cp->nr_subparts_cpus && (new_prs <= 0)) {
 			/*
 			 * Put all active subparts_cpus back to effective_cpus.
 			 */
@@ -1992,6 +2001,7 @@  static int update_prstate(struct cpuset *cs, int new_prs)
 	int err, old_prs = cs->partition_root_state;
 	struct cpuset *parent = parent_cs(cs);
 	struct tmpmasks tmpmask;
+	bool sched_domain_rebuilt = false;
 
 	if (old_prs == new_prs)
 		return 0;
@@ -2026,6 +2036,22 @@  static int update_prstate(struct cpuset *cs, int new_prs)
 			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
 			goto out;
 		}
+
+		if (new_prs == PRS_ISOLATED) {
+			/*
+			 * Disable the load balance flag should not return an
+			 * error unless the system is running out of memory.
+			 */
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+			sched_domain_rebuilt = true;
+		}
+	} else if (old_prs && new_prs) {
+		/*
+		 * A change in load balance state only, no change in cpumasks.
+		 */
+		update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));
+		err = 0;
+		goto out;	/* Sched domain is rebuilt in update_flag() */
 	} else {
 		/*
 		 * Switch back to member is always allowed if PRS_ERROR.
@@ -2050,6 +2076,12 @@  static int update_prstate(struct cpuset *cs, int new_prs)
 reset_flag:
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+
+		if (!is_sched_load_balance(cs)) {
+			/* Make sure load balance is on */
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 1);
+			sched_domain_rebuilt = true;
+		}
 	}
 
 	/*
@@ -2062,7 +2094,8 @@  static int update_prstate(struct cpuset *cs, int new_prs)
 	if (parent->child_ecpus_count)
 		update_sibling_cpumasks(parent, cs, &tmpmask);
 
-	rebuild_sched_domains_locked();
+	if (!sched_domain_rebuilt)
+		rebuild_sched_domains_locked();
 out:
 	if (!err) {
 		spin_lock_irq(&callback_lock);
@@ -2564,6 +2597,9 @@  static int sched_partition_show(struct seq_file *seq, void *v)
 	case PRS_ENABLED:
 		seq_puts(seq, "root\n");
 		break;
+	case PRS_ISOLATED:
+		seq_puts(seq, "isolated\n");
+		break;
 	case PRS_DISABLED:
 		seq_puts(seq, "member\n");
 		break;
@@ -2590,6 +2626,8 @@  static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 		val = PRS_ENABLED;
 	else if (!strcmp(buf, "member"))
 		val = PRS_DISABLED;
+	else if (!strcmp(buf, "isolated"))
+		val = PRS_ISOLATED;
 	else
 		return -EINVAL;