Message ID | 20210720141834.10624-7-longman@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand |
On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long wrote: > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD > > commit 994fb794cb252edd124a46ca0994e37a4726a100 > Author: Waiman Long <longman@redhat.com> > Date: Sat, 19 Jun 2021 13:28:19 -0400 > > cgroup/cpuset: Add a new isolated cpus.partition type > > Cpuset v1 uses the sched_load_balance control file to determine if load > balancing should be enabled. Cpuset v2 gets rid of sched_load_balance > as its use may require disabling load balancing at cgroup root. > > For workloads that require very low latency like DPDK, the latency > jitters caused by periodic load balancing may exceed the desired > latency limit. > > When cpuset v2 is in use, the only way to avoid this latency cost is to > use the "isolcpus=" kernel boot option to isolate a set of CPUs. After > the kernel boot, however, there is no way to add or remove CPUs from > this isolated set. For workloads that are more dynamic in nature, that > means users have to provision enough CPUs for the worst case situation > resulting in excess idle CPUs. > > To address this issue for cpuset v2, a new cpuset.cpus.partition type > "isolated" is added which allows the creation of a cpuset partition > without load balancing. This will allow system administrators to > dynamically adjust the size of isolated partition to the current need > of the workload without rebooting the system. > > Signed-off-by: Waiman Long <longman@redhat.com> > > Signed-off-by: Waiman Long <longman@redhat.com> Nice! And while we are adding a new ABI, can we take advantage of that and add a specific semantic that if a new isolated partition matches a subset of "isolcpus=", it automatically maps to it. This means that any further modification to that isolated partition will also modify the associated isolcpus= subset. Or to summarize, when we create a new isolated partition, remove the associated CPUs from isolcpus= ? Thanks.
On 7/27/21 7:42 AM, Frederic Weisbecker wrote: > On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long wrote: >> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD >> >> commit 994fb794cb252edd124a46ca0994e37a4726a100 >> Author: Waiman Long <longman@redhat.com> >> Date: Sat, 19 Jun 2021 13:28:19 -0400 >> >> cgroup/cpuset: Add a new isolated cpus.partition type >> >> Cpuset v1 uses the sched_load_balance control file to determine if load >> balancing should be enabled. Cpuset v2 gets rid of sched_load_balance >> as its use may require disabling load balancing at cgroup root. >> >> For workloads that require very low latency like DPDK, the latency >> jitters caused by periodic load balancing may exceed the desired >> latency limit. >> >> When cpuset v2 is in use, the only way to avoid this latency cost is to >> use the "isolcpus=" kernel boot option to isolate a set of CPUs. After >> the kernel boot, however, there is no way to add or remove CPUs from >> this isolated set. For workloads that are more dynamic in nature, that >> means users have to provision enough CPUs for the worst case situation >> resulting in excess idle CPUs. >> >> To address this issue for cpuset v2, a new cpuset.cpus.partition type >> "isolated" is added which allows the creation of a cpuset partition >> without load balancing. This will allow system administrators to >> dynamically adjust the size of isolated partition to the current need >> of the workload without rebooting the system. >> >> Signed-off-by: Waiman Long <longman@redhat.com> >> >> Signed-off-by: Waiman Long <longman@redhat.com> > Nice! And while we are adding a new ABI, can we take advantage of that and > add a specific semantic that if a new isolated partition matches a subset of > "isolcpus=", it automatically maps to it. This means that any further > modification to that isolated partition will also modify the associated > isolcpus= subset. > > Or to summarize, when we create a new isolated partition, remove the associated > CPUs from isolcpus= ? We can certainly do that as a follow-on. Another idea that I have been thinking about is to automatically generating a isolated partition under root to match the given isolcpus parameter when the v2 filesystem is mounted. That needs more experimentation and testing to verify that it can work. Cheers, Longman
Hello Waiman. On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long <longman@redhat.com> wrote: > @@ -2026,6 +2036,22 @@ static int update_prstate(struct cpuset *cs, int new_prs) > [...] > + } else if (old_prs && new_prs) { If an isolated root partition becomes invalid (new_prs == PRS_ERROR)... > + /* > + * A change in load balance state only, no change in cpumasks. > + */ > + update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED)); ...this seems to erase information about CS_SCHED_LOAD_BALANCE zeroness. IOW, if there's an isolated partition that becomes invalid and later valid again (a cpu is (re)added), it will be a normal root partition without the requested isolation, which is IMO undesired. I may have overlooked something in broader context but it seems to me the invalidity should be saved independently of the root/isolated type. Regards, Michal
On 7/28/21 12:09 PM, Michal Koutný wrote: > Hello Waiman. > > On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long <longman@redhat.com> wrote: >> @@ -2026,6 +2036,22 @@ static int update_prstate(struct cpuset *cs, int new_prs) >> [...] >> + } else if (old_prs && new_prs) { > If an isolated root partition becomes invalid (new_prs == PRS_ERROR)... > >> + /* >> + * A change in load balance state only, no change in cpumasks. >> + */ >> + update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED)); > ...this seems to erase information about CS_SCHED_LOAD_BALANCE zeroness. > > IOW, if there's an isolated partition that becomes invalid and later > valid again (a cpu is (re)added), it will be a normal root partition > without the requested isolation, which is IMO undesired. > > I may have overlooked something in broader context but it seems to me > the invalidity should be saved independently of the root/isolated type. PRS_ERROR cannot be passed to update_prstate(). For this patchset, PRS_ERROR can only be set by changes in hotplug. The current design will maintain the set flag (CS_SCHED_LOAD_BALANCE) and use it to decide to switch back to PRS_ENABLED or PRS_ISOLATED when the cpus are available again. Cheers, Longman
On Wed, Jul 28, 2021 at 12:27:58PM -0400, Waiman Long <llong@redhat.com> wrote: > PRS_ERROR cannot be passed to update_prstate(). For this patchset, PRS_ERROR > can only be set by changes in hotplug. The current design will maintain the > set flag (CS_SCHED_LOAD_BALANCE) and use it to decide to switch back to > PRS_ENABLED or PRS_ISOLATED when the cpus are available again. I see it now, thanks. (I still find a bit weird that the "isolated" partition will be shown as "root invalid" when it's lacking cpus (instead of "isolated invalid" and returning to "isolated") but I can understand the approach of having just one "root invalid" for all.) This patch can have Reviewed-by: Michal Koutný <mkoutny@suse.com>
On Tue, Jul 27, 2021 at 11:56:25AM -0400, Waiman Long wrote: > On 7/27/21 7:42 AM, Frederic Weisbecker wrote: > > On Tue, Jul 20, 2021 at 10:18:31AM -0400, Waiman Long wrote: > > > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD > > > > > > commit 994fb794cb252edd124a46ca0994e37a4726a100 > > > Author: Waiman Long <longman@redhat.com> > > > Date: Sat, 19 Jun 2021 13:28:19 -0400 > > > > > > cgroup/cpuset: Add a new isolated cpus.partition type > > > > > > Cpuset v1 uses the sched_load_balance control file to determine if load > > > balancing should be enabled. Cpuset v2 gets rid of sched_load_balance > > > as its use may require disabling load balancing at cgroup root. > > > > > > For workloads that require very low latency like DPDK, the latency > > > jitters caused by periodic load balancing may exceed the desired > > > latency limit. > > > > > > When cpuset v2 is in use, the only way to avoid this latency cost is to > > > use the "isolcpus=" kernel boot option to isolate a set of CPUs. After > > > the kernel boot, however, there is no way to add or remove CPUs from > > > this isolated set. For workloads that are more dynamic in nature, that > > > means users have to provision enough CPUs for the worst case situation > > > resulting in excess idle CPUs. > > > > > > To address this issue for cpuset v2, a new cpuset.cpus.partition type > > > "isolated" is added which allows the creation of a cpuset partition > > > without load balancing. This will allow system administrators to > > > dynamically adjust the size of isolated partition to the current need > > > of the workload without rebooting the system. > > > > > > Signed-off-by: Waiman Long <longman@redhat.com> > > > > > > Signed-off-by: Waiman Long <longman@redhat.com> > > Nice! And while we are adding a new ABI, can we take advantage of that and > > add a specific semantic that if a new isolated partition matches a subset of > > "isolcpus=", it automatically maps to it. This means that any further > > modification to that isolated partition will also modify the associated > > isolcpus= subset. > > > > Or to summarize, when we create a new isolated partition, remove the associated > > CPUs from isolcpus= ? > > We can certainly do that as a follow-on. I'm just concerned that this feature gets merged before we add that new isolcpus= implicit mapping, which technically is a new ABI. Well I guess I should hurry up and try to propose a patchset quickly once I'm back from vacation :-) > Another idea that I have been > thinking about is to automatically generating a isolated partition under > root to match the given isolcpus parameter when the v2 filesystem is > mounted. That needs more experimentation and testing to verify that it can > work. I thought about that too, mounting an "isolcpus" subdirectory withing the top cpuset but I was worried it could break userspace that wouldn't expect that new thing to show up. Thanks.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index c16060b703cc..60562346ecc1 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -172,6 +172,8 @@ struct cpuset { * * 1 - partition root * + * 2 - partition root without load balancing (isolated) + * * -1 - invalid partition root * None of the cpus in cpus_allowed can be put into the parent's * subparts_cpus. In this case, the cpuset is not a real partition @@ -183,6 +185,7 @@ struct cpuset { */ #define PRS_DISABLED 0 #define PRS_ENABLED 1 +#define PRS_ISOLATED 2 #define PRS_ERROR -1 /* @@ -1285,17 +1288,22 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, int prev_prs = cpuset->partition_root_state; /* - * Check for possible transition between PRS_ENABLED - * and PRS_ERROR. + * Check for possible transition between PRS_ERROR and + * PRS_ENABLED/PRS_ISOLATED. */ switch (cpuset->partition_root_state) { case PRS_ENABLED: + case PRS_ISOLATED: if (part_error) new_prs = PRS_ERROR; break; case PRS_ERROR: - if (!part_error) + if (part_error) + break; + if (is_sched_load_balance(cpuset)) new_prs = PRS_ENABLED; + else + new_prs = PRS_ISOLATED; break; } /* @@ -1434,6 +1442,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) break; case PRS_ENABLED: + case PRS_ISOLATED: if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp)) update_tasks_cpumask(parent); break; @@ -1453,7 +1462,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) spin_lock_irq(&callback_lock); - if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) { + if (cp->nr_subparts_cpus && (new_prs <= 0)) { /* * Put all active subparts_cpus back to effective_cpus. */ @@ -1992,6 +2001,7 @@ static int update_prstate(struct cpuset *cs, int new_prs) int err, old_prs = cs->partition_root_state; struct cpuset *parent = parent_cs(cs); struct tmpmasks tmpmask; + bool sched_domain_rebuilt = false; if (old_prs == new_prs) return 0; @@ -2026,6 +2036,22 @@ static int update_prstate(struct cpuset *cs, int new_prs) update_flag(CS_CPU_EXCLUSIVE, cs, 0); goto out; } + + if (new_prs == PRS_ISOLATED) { + /* + * Disable the load balance flag should not return an + * error unless the system is running out of memory. + */ + update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); + sched_domain_rebuilt = true; + } + } else if (old_prs && new_prs) { + /* + * A change in load balance state only, no change in cpumasks. + */ + update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED)); + err = 0; + goto out; /* Sched domain is rebuilt in update_flag() */ } else { /* * Switch back to member is always allowed if PRS_ERROR. @@ -2050,6 +2076,12 @@ static int update_prstate(struct cpuset *cs, int new_prs) reset_flag: /* Turning off CS_CPU_EXCLUSIVE will not return error */ update_flag(CS_CPU_EXCLUSIVE, cs, 0); + + if (!is_sched_load_balance(cs)) { + /* Make sure load balance is on */ + update_flag(CS_SCHED_LOAD_BALANCE, cs, 1); + sched_domain_rebuilt = true; + } } /* @@ -2062,7 +2094,8 @@ static int update_prstate(struct cpuset *cs, int new_prs) if (parent->child_ecpus_count) update_sibling_cpumasks(parent, cs, &tmpmask); - rebuild_sched_domains_locked(); + if (!sched_domain_rebuilt) + rebuild_sched_domains_locked(); out: if (!err) { spin_lock_irq(&callback_lock); @@ -2564,6 +2597,9 @@ static int sched_partition_show(struct seq_file *seq, void *v) case PRS_ENABLED: seq_puts(seq, "root\n"); break; + case PRS_ISOLATED: + seq_puts(seq, "isolated\n"); + break; case PRS_DISABLED: seq_puts(seq, "member\n"); break; @@ -2590,6 +2626,8 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf, val = PRS_ENABLED; else if (!strcmp(buf, "member")) val = PRS_DISABLED; + else if (!strcmp(buf, "isolated")) + val = PRS_ISOLATED; else return -EINVAL;