From patchwork Mon Oct 18 14:36:14 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 12566687 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E1A4C433F5 for ; Mon, 18 Oct 2021 14:37:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7A4B261002 for ; Mon, 18 Oct 2021 14:37:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231955AbhJROkA (ORCPT ); Mon, 18 Oct 2021 10:40:00 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:23902 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231934AbhJROkA (ORCPT ); Mon, 18 Oct 2021 10:40:00 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1634567869; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=f5EDtU2sKbTUXtW/gai9ES9K6QHtWaXCAxynj7eu4Mk=; b=gvnUpoz6Op5XhV7v4gZiDn019HKCxW5WvqwEffF1C6+43x8eYGI6kwjvmUpOiMBYBHbLBz QxMYPhfTP4b+ScbWH7D+gpUc5MKdsGD+pzAwL5KgcAui7jysOlF+hKW9jdBCAszt42kYDn OXuo2FsKATVQGQeY4d//QMm4wNVkOv0= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-339-X41Wdn2qP-GnEcpyv964gw-1; Mon, 18 Oct 2021 10:37:45 -0400 X-MC-Unique: X41Wdn2qP-GnEcpyv964gw-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id C6AA9EC1A1; Mon, 18 Oct 2021 14:37:43 +0000 (UTC) Received: from llong.com (unknown [10.22.16.224]) by smtp.corp.redhat.com (Postfix) with ESMTP id 6B4A570F58; Mon, 18 Oct 2021 14:37:07 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, Andrew Morton , Roman Gushchin , Phil Auld , Peter Zijlstra , Juri Lelli , Frederic Weisbecker , Marcelo Tosatti , =?utf-8?q?Michal_Koutn=C3=BD?= , Waiman Long Subject: [PATCH v8 1/6] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective Date: Mon, 18 Oct 2021 10:36:14 -0400 Message-Id: <20211018143619.205065-2-longman@redhat.com> In-Reply-To: <20211018143619.205065-1-longman@redhat.com> References: <20211018143619.205065-1-longman@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org Currently, a partition root cannot have empty "cpuset.cpus.effective". As a result, a parent partition root cannot distribute out all its CPUs to child partitions with no CPUs left. However in most cases, there shouldn't be any tasks associated with intermediate nodes of the default hierarchy. So the current rule is too restrictive and can waste valuable CPU resource. To address this issue, we are now allowing a partition to have empty "cpuset.cpus.effective" as long as it has no task. Therefore, a parent partition with no task can now have all its CPUs distributed out to its child partitions. The top cpuset always have some house-keeping tasks running and so its list of effective cpu can't never be empty. Once a partition with empty "cpuset.cpus.effective" is formed, no new task can be moved into it until "cpuset.cpus.effective" becomes non-empty. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 113 +++++++++++++++++++++++++++++++---------- 1 file changed, 85 insertions(+), 28 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 2a9695ccb65f..e3a6609bc919 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -386,6 +386,41 @@ static inline bool is_in_v2_mode(void) (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE); } +/** + * partition_is_populated - check if partition has tasks + * @cs: partition root to be checked + * @excluded_child: a child cpuset to be excluded in task checking + * Return: true if there are tasks, false otherwise + * + * It is assumed that @cs is a valid partition root. @excluded_child should + * be non-NULL when this cpuset is going to become a partition itself. + */ +static inline bool partition_is_populated(struct cpuset *cs, + struct cpuset *excluded_child) +{ + struct cgroup_subsys_state *css; + struct cpuset *child; + + if (cs->css.cgroup->nr_populated_csets) + return true; + if (!excluded_child && !cs->nr_subparts_cpus) + return cgroup_is_populated(cs->css.cgroup); + + rcu_read_lock(); + cpuset_for_each_child(child, css, cs) { + if (child == excluded_child) + continue; + if (is_partition_root(child)) + continue; + if (cgroup_is_populated(child->css.cgroup)) { + rcu_read_unlock(); + return true; + } + } + rcu_read_unlock(); + return false; +} + /* * Return in pmask the portion of a task's cpusets's cpus_allowed that * are online and are capable of running the task. If none are found, @@ -1198,22 +1233,25 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css)) return -EBUSY; - /* - * Enabling partition root is not allowed if not all the CPUs - * can be granted from parent's effective_cpus or at least one - * CPU will be left after that. - */ - if ((cmd == partcmd_enable) && - (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) || - cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus))) - return -EINVAL; - - /* - * A cpumask update cannot make parent's effective_cpus become empty. - */ adding = deleting = false; old_prs = new_prs = cpuset->partition_root_state; if (cmd == partcmd_enable) { + /* + * Enabling partition root is not allowed if not all the CPUs + * can be granted from parent's effective_cpus. + */ + if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus)) + return -EINVAL; + + /* + * A parent can be left with no CPU as long as there is no + * task directly associated with the parent partition. For + * such a parent, no new task can be moved into it. + */ + if (partition_is_populated(parent, cpuset) && + cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)) + return -EINVAL; + cpumask_copy(tmp->addmask, cpuset->cpus_allowed); adding = true; } else if (cmd == partcmd_disable) { @@ -1235,9 +1273,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, adding = cpumask_andnot(tmp->addmask, tmp->addmask, parent->subparts_cpus); /* - * Return error if the new effective_cpus could become empty. + * Return error if the new effective_cpus could become empty + * and there are tasks in the parent. */ - if (adding && + if (adding && partition_is_populated(parent, cpuset) && cpumask_equal(parent->effective_cpus, tmp->addmask)) { if (!deleting) return -EINVAL; @@ -1263,8 +1302,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, */ adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed, parent->effective_cpus); - part_error = cpumask_equal(tmp->addmask, - parent->effective_cpus); + part_error = cpumask_equal(tmp->addmask, parent->effective_cpus) && + partition_is_populated(parent, cpuset); } if (cmd == partcmd_update) { @@ -1366,9 +1405,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) /* * If it becomes empty, inherit the effective mask of the - * parent, which is guaranteed to have some CPUs. + * parent, which is guaranteed to have some CPUs unless + * it is a partition root that has explicitly distributed + * out all its CPUs. */ if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) { + if (is_partition_root(cp) && + cpumask_equal(cp->cpus_allowed, cp->subparts_cpus)) + goto update_parent_subparts; + cpumask_copy(tmp->new_cpus, parent->effective_cpus); if (!cp->use_parent_ecpus) { cp->use_parent_ecpus = true; @@ -1390,6 +1435,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) continue; } +update_parent_subparts: /* * update_parent_subparts_cpumask() should have been called * for cs already in update_cpumask(). We should also call @@ -2189,6 +2235,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))) goto out_unlock; + /* + * On default hierarchy, task cannot be moved to a cpuset with empty + * effective cpus. + */ + if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus)) + goto out_unlock; + cgroup_taskset_for_each(task, css, tset) { ret = task_can_attach(task, cs->cpus_allowed); if (ret) @@ -3053,7 +3106,8 @@ hotplug_update_tasks(struct cpuset *cs, struct cpumask *new_cpus, nodemask_t *new_mems, bool cpus_updated, bool mems_updated) { - if (cpumask_empty(new_cpus)) + /* A partition root is allowed to have empty effective cpus */ + if (cpumask_empty(new_cpus) && !is_partition_root(cs)) cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus); if (nodes_empty(*new_mems)) *new_mems = parent_cs(cs)->effective_mems; @@ -3122,11 +3176,12 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) /* * In the unlikely event that a partition root has empty - * effective_cpus or its parent becomes erroneous, we have to - * transition it to the erroneous state. + * effective_cpus with tasks or its parent becomes erroneous, we + * have to transition it to the erroneous state. */ - if (is_partition_root(cs) && (cpumask_empty(&new_cpus) || - (parent->partition_root_state == PRS_ERROR))) { + if (is_partition_root(cs) && + ((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) || + (parent->partition_root_state == PRS_ERROR))) { if (cs->nr_subparts_cpus) { spin_lock_irq(&callback_lock); cs->nr_subparts_cpus = 0; @@ -3136,13 +3191,15 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) } /* - * If the effective_cpus is empty because the child - * partitions take away all the CPUs, we can keep - * the current partition and let the child partitions - * fight for available CPUs. + * Force the partition to become invalid if either one of + * the following conditions hold: + * 1) empty effective cpus but not valid empty partition. + * 2) parent is invalid or doesn't grant any cpus to child + * partitions. */ if ((parent->partition_root_state == PRS_ERROR) || - cpumask_empty(&new_cpus)) { + (cpumask_empty(&new_cpus) && + partition_is_populated(cs, NULL))) { int old_prs; update_parent_subparts_cpumask(cs, partcmd_disable, From patchwork Mon Oct 18 14:36:15 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 12566689 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 747C5C433F5 for ; Mon, 18 Oct 2021 14:37:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5E37260FD9 for ; Mon, 18 Oct 2021 14:37:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232094AbhJROkE (ORCPT ); Mon, 18 Oct 2021 10:40:04 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:43514 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232085AbhJROkD (ORCPT ); Mon, 18 Oct 2021 10:40:03 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1634567871; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=l+eM4Zdlfz183Ohg51FBPD/XoJZrhXylT7f3NCJvHwU=; b=N1jl6ftMos5MgnjzYN0UKv5ygJK71adJEYn8+Eanq1sxyi8U10Ap1t6782onISsoAR24GZ g9WqFjTNhwSyJNvsEAXQ5WZzpIr8qLRNziYbSZxrPVcOibNMtsjwdPr0tdODe0GU920UPe YICFWmys8k5sgAEv18RYhsq73zq9PDg= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-228-5mjYeHwPNnWdLkzimpdHmw-1; Mon, 18 Oct 2021 10:37:48 -0400 X-MC-Unique: 5mjYeHwPNnWdLkzimpdHmw-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E5B4918414A1; Mon, 18 Oct 2021 14:37:45 +0000 (UTC) Received: from llong.com (unknown [10.22.16.224]) by smtp.corp.redhat.com (Postfix) with ESMTP id EDBEB6A914; Mon, 18 Oct 2021 14:37:43 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, Andrew Morton , Roman Gushchin , Phil Auld , Peter Zijlstra , Juri Lelli , Frederic Weisbecker , Marcelo Tosatti , =?utf-8?q?Michal_Koutn=C3=BD?= , Waiman Long Subject: [PATCH v8 2/6] cgroup/cpuset: Refining features and constraints of a partition Date: Mon, 18 Oct 2021 10:36:15 -0400 Message-Id: <20211018143619.205065-3-longman@redhat.com> In-Reply-To: <20211018143619.205065-1-longman@redhat.com> References: <20211018143619.205065-1-longman@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org In order to plan for proper partitioning of a system, a valid partition must be using only CPUs designated in "cpuset.cpus" of the partition root. It can be a subset, but it can't use CPUs outside of its designed list. If none of the CPUs in the desginated list can be granted to the partition, its "cpuset.cpus.effective" becomes empty. This is allowed as long as there is no task in the partition. To ease implementation, there are additional contraints in enabling a partition root. 1) The "cpuset.cpus" is non-empty and exclusive. 2) The parent cgroup is a valid partition root. 3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus". 4) There is no child cgroups with cpuset enabled. This allows offlined cpus in parent's "cpuset.cpus" to be granted to a child partition which can have empty "cpuset.cpus.effective" when it has no task. The cpuset's subparts_cpus keeps track of CPUs (including offline CPUs) that are allocated to child partitions. It does not change during hotplug operations. Once a partition root has been enabled, changes to "cpuset.cpus" is generally allowed as long as the cpu list is exclusive, non-empty and is a superset of children's cpu lists. These existing rules are enforced by validate_change(). A partition will become invalid when one or more of the following constraints are violated: 1) The parent cgroup is a valid partition root. 2) "cpuset.cpus.effective" is a subset of "cpuset.cpus" 3) "cpuset.cpus.effective" is non-empty when there are tasks in the partition. Disabling a partition root is always allowed even if there are child partitions underneath it. In this case, all the child partitions are also disabled (switch to "member"). So care must be taken to double check if there are child partitions underneath it before disabling a partition. This patch makes the necessary change to support the above features and constraints. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 228 +++++++++++++++++++++-------------------- 1 file changed, 117 insertions(+), 111 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index e3a6609bc919..e2c01345353c 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1227,29 +1227,27 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, return -EINVAL; /* - * Enabling/disabling partition root is not allowed if there are - * online children. + * Enabling partition root is not allowed if there are online children. */ - if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css)) + if ((cmd == partcmd_enable) && css_has_online_children(&cpuset->css)) return -EBUSY; adding = deleting = false; old_prs = new_prs = cpuset->partition_root_state; if (cmd == partcmd_enable) { /* - * Enabling partition root is not allowed if not all the CPUs - * can be granted from parent's effective_cpus. + * Enabling partition root is not allowed if cpus_allowed isn't + * a subset of parent's cpus_allowed. */ - if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus)) + if (!cpumask_subset(cpuset->cpus_allowed, parent->cpus_allowed)) return -EINVAL; /* * A parent can be left with no CPU as long as there is no - * task directly associated with the parent partition. For - * such a parent, no new task can be moved into it. + * task directly associated with the parent partition. */ if (partition_is_populated(parent, cpuset) && - cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)) + cpumask_subset(parent->effective_cpus, cpuset->cpus_allowed)) return -EINVAL; cpumask_copy(tmp->addmask, cpuset->cpus_allowed); @@ -1261,54 +1259,52 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, /* * partcmd_update with newmask: * + * Compute add/delete mask to/from subparts_cpus + * * delmask = cpus_allowed & ~newmask & parent->subparts_cpus - * addmask = newmask & parent->effective_cpus + * addmask = newmask & parent->cpus_allowed * & ~parent->subparts_cpus */ cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask); deleting = cpumask_and(tmp->delmask, tmp->delmask, parent->subparts_cpus); - cpumask_and(tmp->addmask, newmask, parent->effective_cpus); + cpumask_and(tmp->addmask, newmask, parent->cpus_allowed); adding = cpumask_andnot(tmp->addmask, tmp->addmask, parent->subparts_cpus); /* - * Return error if the new effective_cpus could become empty - * and there are tasks in the parent. + * Make partition invalid if parent's effective_cpus could + * become empty and there are tasks in the parent. */ - if (adding && partition_is_populated(parent, cpuset) && - cpumask_equal(parent->effective_cpus, tmp->addmask)) { - if (!deleting) - return -EINVAL; - /* - * As some of the CPUs in subparts_cpus might have - * been offlined, we need to compute the real delmask - * to confirm that. - */ - if (!cpumask_and(tmp->addmask, tmp->delmask, - cpu_active_mask)) - return -EINVAL; - cpumask_copy(tmp->addmask, parent->effective_cpus); - } + part_error = partition_is_populated(parent, cpuset) && + cpumask_subset(parent->effective_cpus, tmp->addmask) && + !cpumask_intersects(tmp->delmask, cpu_active_mask); } else { /* * partcmd_update w/o newmask: * * addmask = cpus_allowed & parent->effective_cpus * - * Note that parent's subparts_cpus may have been - * pre-shrunk in case there is a change in the cpu list. - * So no deletion is needed. + * This gets invoked either due to a hotplug event or + * from update_cpumasks_hier() where we can't return an + * error. This can cause a partition root to become invalid + * in the case of a hotplug. + * + * A partition error happens when: + * 1) Cpuset is valid partition, but parent does not distribute + * out any CPUs. + * 2) Parent has tasks and all its effective CPUs will have + * to be distributed out. */ adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed, parent->effective_cpus); - part_error = cpumask_equal(tmp->addmask, parent->effective_cpus) && - partition_is_populated(parent, cpuset); + part_error = (is_partition_root(cpuset) && + !parent->nr_subparts_cpus) || + (cpumask_equal(parent->effective_cpus, tmp->addmask) && + partition_is_populated(parent, cpuset)); } if (cmd == partcmd_update) { - int prev_prs = cpuset->partition_root_state; - /* * Check for possible transition between PRS_ENABLED * and PRS_ERROR. @@ -1323,13 +1319,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, new_prs = PRS_ENABLED; break; } - /* - * Set part_error if previously in invalid state. - */ - part_error = (prev_prs == PRS_ERROR); } - if (!part_error && (new_prs == PRS_ERROR)) + if ((old_prs == PRS_ERROR) && (new_prs == PRS_ERROR)) return 0; /* Nothing need to be done */ if (new_prs == PRS_ERROR) { @@ -1382,6 +1374,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, * update_cpumasks_hier - Update effective cpumasks and tasks in the subtree * @cs: the cpuset to consider * @tmp: temp variables for calculating effective_cpus & partition setup + * @force: don't skip any descendant cpusets if set * * When configured cpumask is changed, the effective cpumasks of this cpuset * and all its descendants need to be updated. @@ -1390,7 +1383,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, * * Called with cpuset_rwsem held */ -static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) +static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp, + bool force) { struct cpuset *cp; struct cgroup_subsys_state *pos_css; @@ -1400,6 +1394,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) rcu_read_lock(); cpuset_for_each_descendant_pre(cp, pos_css, cs) { struct cpuset *parent = parent_cs(cp); + bool update_parent = false; compute_effective_cpumask(tmp->new_cpus, cp, parent); @@ -1427,9 +1422,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) /* * Skip the whole subtree if the cpumask remains the same - * and has no partition root state. + * with no partition root state and force flag not set. */ - if (!cp->partition_root_state && + if (!cp->partition_root_state && !force && cpumask_equal(tmp->new_cpus, cp->effective_cpus)) { pos_css = css_rightmost_descendant(pos_css); continue; @@ -1466,8 +1461,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) break; case PRS_ENABLED: - if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp)) - update_tasks_cpumask(parent); + update_parent = true; break; case PRS_ERROR: @@ -1483,40 +1477,41 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) continue; rcu_read_unlock(); + if (update_parent) { + if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp)) + update_tasks_cpumask(parent); + /* + * The cpuset partition_root_state may be changed + * to PRS_ERROR. Capture it. + */ + new_prs = cp->partition_root_state; + } + spin_lock_irq(&callback_lock); - cpumask_copy(cp->effective_cpus, tmp->new_cpus); - if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) { + if (cp->nr_subparts_cpus && (new_prs <= 0)) { + /* + * Put all active subparts_cpus back to effective_cpus. + */ + cpumask_or(tmp->new_cpus, tmp->new_cpus, + cp->subparts_cpus); + cpumask_and(tmp->new_cpus, tmp->new_cpus, + cpu_active_mask); cp->nr_subparts_cpus = 0; cpumask_clear(cp->subparts_cpus); - } else if (cp->nr_subparts_cpus) { + } + + cpumask_copy(cp->effective_cpus, tmp->new_cpus); + if (cp->nr_subparts_cpus) { /* * Make sure that effective_cpus & subparts_cpus * are mutually exclusive. - * - * In the unlikely event that effective_cpus - * becomes empty. we clear cp->nr_subparts_cpus and - * let its child partition roots to compete for - * CPUs again. */ cpumask_andnot(cp->effective_cpus, cp->effective_cpus, cp->subparts_cpus); - if (cpumask_empty(cp->effective_cpus)) { - cpumask_copy(cp->effective_cpus, tmp->new_cpus); - cpumask_clear(cp->subparts_cpus); - cp->nr_subparts_cpus = 0; - } else if (!cpumask_subset(cp->subparts_cpus, - tmp->new_cpus)) { - cpumask_andnot(cp->subparts_cpus, - cp->subparts_cpus, tmp->new_cpus); - cp->nr_subparts_cpus - = cpumask_weight(cp->subparts_cpus); - } } - if (new_prs != old_prs) - cp->partition_root_state = new_prs; - + cp->partition_root_state = new_prs; spin_unlock_irq(&callback_lock); notify_partition_change(cp, old_prs, new_prs); @@ -1570,7 +1565,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs, if (!sibling->use_parent_ecpus) continue; - update_cpumasks_hier(sibling, tmp); + update_cpumasks_hier(sibling, tmp, false); } rcu_read_unlock(); } @@ -1643,13 +1638,13 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, * Make sure that subparts_cpus is a subset of cpus_allowed. */ if (cs->nr_subparts_cpus) { - cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus, - cs->cpus_allowed); + cpumask_and(cs->subparts_cpus, cs->subparts_cpus, + cs->cpus_allowed); cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus); } spin_unlock_irq(&callback_lock); - update_cpumasks_hier(cs, &tmp); + update_cpumasks_hier(cs, &tmp, false); if (cs->partition_root_state) { struct cpuset *parent = parent_cs(cs); @@ -2071,20 +2066,23 @@ static int update_prstate(struct cpuset *cs, int new_prs) } } else { /* - * Turning off partition root will clear the - * CS_CPU_EXCLUSIVE bit. + * Switch back to member is always allowed even if it + * disables child partitions. */ - if (old_prs == PRS_ERROR) { - update_flag(CS_CPU_EXCLUSIVE, cs, 0); - err = 0; - goto out; + err = 0; + update_parent_subparts_cpumask(cs, partcmd_disable, NULL, + &tmpmask); + /* + * If there are child partitions, we have to disable them. + */ + if (unlikely(cs->nr_subparts_cpus)) { + spin_lock_irq(&callback_lock); + cs->nr_subparts_cpus = 0; + cpumask_clear(cs->subparts_cpus); + compute_effective_cpumask(cs->effective_cpus, cs, parent); + spin_unlock_irq(&callback_lock); } - err = update_parent_subparts_cpumask(cs, partcmd_disable, - NULL, &tmpmask); - if (err) - goto out; - /* Turning off CS_CPU_EXCLUSIVE will not return error */ update_flag(CS_CPU_EXCLUSIVE, cs, 0); } @@ -2105,6 +2103,11 @@ static int update_prstate(struct cpuset *cs, int new_prs) spin_lock_irq(&callback_lock); cs->partition_root_state = new_prs; spin_unlock_irq(&callback_lock); + /* + * Update child cpusets when disabling partition. + */ + if (new_prs == PRS_DISABLED && !list_empty(&cs->css.children)) + update_cpumasks_hier(cs, &tmpmask, true); notify_partition_change(cs, old_prs, new_prs); } @@ -3176,12 +3179,32 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) /* * In the unlikely event that a partition root has empty - * effective_cpus with tasks or its parent becomes erroneous, we - * have to transition it to the erroneous state. + * effective_cpus with tasks, we will have to invalidate child + * partitions, if present, by setting nr_subparts_cpus to 0 to + * reclaim their cpus. + */ + if (is_partition_root(cs) && cpumask_empty(&new_cpus) && + cs->nr_subparts_cpus && partition_is_populated(cs, NULL)) { + spin_lock_irq(&callback_lock); + cs->nr_subparts_cpus = 0; + cpumask_clear(cs->subparts_cpus); + spin_unlock_irq(&callback_lock); + compute_effective_cpumask(&new_cpus, cs, parent); + } + + /* + * Force the partition to become invalid if either one of + * the following conditions hold: + * 1) empty effective cpus with tasks in partition + * 2) parent is invalid or doesn't grant any cpus to child partitions. */ if (is_partition_root(cs) && ((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) || - (parent->partition_root_state == PRS_ERROR))) { + !parent->nr_subparts_cpus)) { + int old_prs; + + update_parent_subparts_cpumask(cs, partcmd_disable, + NULL, tmp); if (cs->nr_subparts_cpus) { spin_lock_irq(&callback_lock); cs->nr_subparts_cpus = 0; @@ -3190,40 +3213,23 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) compute_effective_cpumask(&new_cpus, cs, parent); } - /* - * Force the partition to become invalid if either one of - * the following conditions hold: - * 1) empty effective cpus but not valid empty partition. - * 2) parent is invalid or doesn't grant any cpus to child - * partitions. - */ - if ((parent->partition_root_state == PRS_ERROR) || - (cpumask_empty(&new_cpus) && - partition_is_populated(cs, NULL))) { - int old_prs; - - update_parent_subparts_cpumask(cs, partcmd_disable, - NULL, tmp); - old_prs = cs->partition_root_state; - if (old_prs != PRS_ERROR) { - spin_lock_irq(&callback_lock); - cs->partition_root_state = PRS_ERROR; - spin_unlock_irq(&callback_lock); - notify_partition_change(cs, old_prs, PRS_ERROR); - } + old_prs = cs->partition_root_state; + if (old_prs != PRS_ERROR) { + spin_lock_irq(&callback_lock); + cs->partition_root_state = PRS_ERROR; + spin_unlock_irq(&callback_lock); + notify_partition_change(cs, old_prs, PRS_ERROR); } cpuset_force_rebuild(); } /* * On the other hand, an erroneous partition root may be transitioned - * back to a regular one or a partition root with no CPU allocated - * from the parent may change to erroneous. + * back to a regular one. */ - if (is_partition_root(parent) && - ((cs->partition_root_state == PRS_ERROR) || - !cpumask_intersects(&new_cpus, parent->subparts_cpus)) && - update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp)) + else if (is_partition_root(parent) && + (cs->partition_root_state == PRS_ERROR) && + update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp)) cpuset_force_rebuild(); update_tasks: From patchwork Mon Oct 18 14:36:16 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 12566691 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A39BC433EF for ; Mon, 18 Oct 2021 14:37:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3407760FC3 for ; Mon, 18 Oct 2021 14:37:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232120AbhJROkJ (ORCPT ); Mon, 18 Oct 2021 10:40:09 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:44039 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232170AbhJROkH (ORCPT ); Mon, 18 Oct 2021 10:40:07 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1634567875; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DgD89ffo3d0V7AdPQW/l9aZ1vU22bPCLKKNskpIYKIQ=; b=YIKGfwXERN2cfosHSwN6ArTPoUO0n2jrxMYDZrcimBO7ohKVXyQ+SJ/LbqCEkFLaSBMfmt nfM4y8z5UPqTNddSA9qdRT78XE9WeYRGFRCoC50hIm0FydKislMZSAKJ2+NGCPBSU7ZFuV sE37QEa8emGugncdwi+V5zj/jiTQztM= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-384-mSRi0j5uPRyzmj-MGHvn_g-1; Mon, 18 Oct 2021 10:37:50 -0400 X-MC-Unique: mSRi0j5uPRyzmj-MGHvn_g-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 179DD18414A0; Mon, 18 Oct 2021 14:37:48 +0000 (UTC) Received: from llong.com (unknown [10.22.16.224]) by smtp.corp.redhat.com (Postfix) with ESMTP id 1742D62A44; Mon, 18 Oct 2021 14:37:46 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, Andrew Morton , Roman Gushchin , Phil Auld , Peter Zijlstra , Juri Lelli , Frederic Weisbecker , Marcelo Tosatti , =?utf-8?q?Michal_Koutn=C3=BD?= , Waiman Long Subject: [PATCH v8 3/6] cgroup/cpuset: Add a new isolated cpus.partition type Date: Mon, 18 Oct 2021 10:36:16 -0400 Message-Id: <20211018143619.205065-4-longman@redhat.com> In-Reply-To: <20211018143619.205065-1-longman@redhat.com> References: <20211018143619.205065-1-longman@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org cgroup/cpuset: Add a new isolated cpus.partition type Cpuset v1 uses the sched_load_balance control file to determine if load balancing should be enabled. Cpuset v2 gets rid of sched_load_balance as its use may require disabling load balancing at cgroup root. For workloads that require very low latency like DPDK, the latency jitters caused by periodic load balancing may exceed the desired latency limit. When cpuset v2 is in use, the only way to avoid this latency cost is to use the "isolcpus=" kernel boot option to isolate a set of CPUs. After the kernel boot, however, there is no way to add or remove CPUs from this isolated set. For workloads that are more dynamic in nature, that means users have to provision enough CPUs for the worst case situation resulting in excess idle CPUs. To address this issue for cpuset v2, a new cpuset.cpus.partition type "isolated" is added which allows the creation of a cpuset partition without load balancing. This will allow system administrators to dynamically adjust the size of isolated partition to the current need of the workload without rebooting the system. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 46 ++++++++++++++++++++++++++++++++++++++---- 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index e2c01345353c..fab31921728c 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -172,6 +172,8 @@ struct cpuset { * * 1 - partition root * + * 2 - partition root without load balancing (isolated) + * * -1 - invalid partition root * None of the cpus in cpus_allowed can be put into the parent's * subparts_cpus. In this case, the cpuset is not a real partition @@ -181,6 +183,7 @@ struct cpuset { */ #define PRS_DISABLED 0 #define PRS_ENABLED 1 +#define PRS_ISOLATED 2 #define PRS_ERROR -1 /* @@ -1306,17 +1309,22 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, if (cmd == partcmd_update) { /* - * Check for possible transition between PRS_ENABLED - * and PRS_ERROR. + * Check for possible transition between PRS_ERROR and + * PRS_ENABLED/PRS_ISOLATED. */ switch (cpuset->partition_root_state) { case PRS_ENABLED: + case PRS_ISOLATED: if (part_error) new_prs = PRS_ERROR; break; case PRS_ERROR: - if (!part_error) + if (part_error) + break; + if (is_sched_load_balance(cpuset)) new_prs = PRS_ENABLED; + else + new_prs = PRS_ISOLATED; break; } } @@ -1461,6 +1469,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp, break; case PRS_ENABLED: + case PRS_ISOLATED: update_parent = true; break; @@ -2028,6 +2037,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, static int update_prstate(struct cpuset *cs, int new_prs) { int err, old_prs = cs->partition_root_state; + bool sched_domain_rebuilt = false; struct cpuset *parent = parent_cs(cs); struct tmpmasks tmpmask; @@ -2064,6 +2074,22 @@ static int update_prstate(struct cpuset *cs, int new_prs) update_flag(CS_CPU_EXCLUSIVE, cs, 0); goto out; } + + if (new_prs == PRS_ISOLATED) { + /* + * Disable the load balance flag should not return an + * error unless the system is running out of memory. + */ + update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); + sched_domain_rebuilt = true; + } + } else if (old_prs && new_prs) { + /* + * A change in load balance state only, no change in cpumasks. + */ + update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED)); + err = 0; + goto out; /* Sched domain is rebuilt in update_flag() */ } else { /* * Switch back to member is always allowed even if it @@ -2085,6 +2111,12 @@ static int update_prstate(struct cpuset *cs, int new_prs) /* Turning off CS_CPU_EXCLUSIVE will not return error */ update_flag(CS_CPU_EXCLUSIVE, cs, 0); + + if (!is_sched_load_balance(cs)) { + /* Make sure load balance is on */ + update_flag(CS_SCHED_LOAD_BALANCE, cs, 1); + sched_domain_rebuilt = true; + } } /* @@ -2097,7 +2129,8 @@ static int update_prstate(struct cpuset *cs, int new_prs) if (parent->child_ecpus_count) update_sibling_cpumasks(parent, cs, &tmpmask); - rebuild_sched_domains_locked(); + if (!sched_domain_rebuilt) + rebuild_sched_domains_locked(); out: if (!err) { spin_lock_irq(&callback_lock); @@ -2609,6 +2642,9 @@ static int sched_partition_show(struct seq_file *seq, void *v) case PRS_ENABLED: seq_puts(seq, "root\n"); break; + case PRS_ISOLATED: + seq_puts(seq, "isolated\n"); + break; case PRS_DISABLED: seq_puts(seq, "member\n"); break; @@ -2635,6 +2671,8 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf, val = PRS_ENABLED; else if (!strcmp(buf, "member")) val = PRS_DISABLED; + else if (!strcmp(buf, "isolated")) + val = PRS_ISOLATED; else return -EINVAL; From patchwork Mon Oct 18 14:36:17 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 12566721 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 698FAC433EF for ; Mon, 18 Oct 2021 14:38:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 536216103C for ; Mon, 18 Oct 2021 14:38:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232200AbhJROlG (ORCPT ); Mon, 18 Oct 2021 10:41:06 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:59040 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231736AbhJROlF (ORCPT ); Mon, 18 Oct 2021 10:41:05 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1634567933; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=474JE8KDGexTjFdGx4bBirOGDjMm7XbgTSGx/41x5Xw=; b=fd5jgqASSKOm0svok9zioXn6SogEWcqI1gqi3rYwSpCiidsECsuifD/YHGM7+ZPuK39fhM TOEsN5EfzKZCMW9ABp5HXG/cRWaxPab20fcvHNTo/+gLFde+kx1dacUkkmMjP9IC+JmjRz ktCzEBZtvnRJAeP3dVIYM8se9Nfd2Z4= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-137-DOmqaj5tPsaZCl6u_vQnag-1; Mon, 18 Oct 2021 10:38:50 -0400 X-MC-Unique: DOmqaj5tPsaZCl6u_vQnag-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 9186A80668C; Mon, 18 Oct 2021 14:38:48 +0000 (UTC) Received: from llong.com (unknown [10.22.16.224]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3CD126A8E5; Mon, 18 Oct 2021 14:37:48 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, Andrew Morton , Roman Gushchin , Phil Auld , Peter Zijlstra , Juri Lelli , Frederic Weisbecker , Marcelo Tosatti , =?utf-8?q?Michal_Koutn=C3=BD?= , Waiman Long Subject: [PATCH v8 4/6] cgroup/cpuset: Show invalid partition reason string Date: Mon, 18 Oct 2021 10:36:17 -0400 Message-Id: <20211018143619.205065-5-longman@redhat.com> In-Reply-To: <20211018143619.205065-1-longman@redhat.com> References: <20211018143619.205065-1-longman@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org There are a number of different reasons which can cause a partition to become invalid. A user seeing an invalid partition may not know exactly why. To help user to get a better understanding of the underlying reason, The cpuset.cpus.partition control file, when read, will now report the reason why a partition become invalid. When a partition does become invalid, reading the control file will show "root invalid ()" where is a string that describes why the partition is invalid. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 48 +++++++++++++++++++++++++++++++++++++++--- 1 file changed, 45 insertions(+), 3 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index fab31921728c..9c5ecc8d100c 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -78,6 +78,24 @@ struct fmeter { spinlock_t lock; /* guards read or write of above */ }; +/* + * Invalid partition error code + */ +enum prs_errcode { + PERR_NONE = 0, + PERR_INVCPUS, + PERR_NOCPUS, + PERR_PARENT, + PERR_HOTPLUG, +}; + +static const char * const perr_strings[] = { + [PERR_INVCPUS] = "Invalid change to cpuset.cpus", + [PERR_PARENT] = "Parent is no longer a valid partition root", + [PERR_NOCPUS] = "Parent unable to distribute cpu downstream", + [PERR_HOTPLUG] = "No cpu available due to hotplug", +}; + struct cpuset { struct cgroup_subsys_state css; @@ -163,6 +181,9 @@ struct cpuset { /* Handle for cpuset.cpus.partition */ struct cgroup_file partition_file; + + /* Invalid partition error code, not lock protected */ + enum prs_errcode prs_err; }; /* @@ -275,8 +296,13 @@ static inline int is_partition_root(const struct cpuset *cs) static inline void notify_partition_change(struct cpuset *cs, int old_prs, int new_prs) { - if (old_prs != new_prs) - cgroup_file_notify(&cs->partition_file); + if (old_prs == new_prs) + return; + cgroup_file_notify(&cs->partition_file); + + /* Reset prs_err if not invalid */ + if (new_prs != PRS_ERROR) + WRITE_ONCE(cs->prs_err, PERR_NONE); } static struct cpuset top_cpuset = { @@ -1282,6 +1308,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, part_error = partition_is_populated(parent, cpuset) && cpumask_subset(parent->effective_cpus, tmp->addmask) && !cpumask_intersects(tmp->delmask, cpu_active_mask); + + if ((READ_ONCE(cpuset->prs_err) == PERR_NONE) && part_error) + WRITE_ONCE(cpuset->prs_err, PERR_INVCPUS); } else { /* * partcmd_update w/o newmask: @@ -1305,6 +1334,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, !parent->nr_subparts_cpus) || (cpumask_equal(parent->effective_cpus, tmp->addmask) && partition_is_populated(parent, cpuset)); + + if (is_partition_root(cpuset) && part_error) + WRITE_ONCE(cpuset->prs_err, PERR_NOCPUS); } if (cmd == partcmd_update) { @@ -1478,6 +1510,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp, * When parent is invalid, it has to be too. */ new_prs = PRS_ERROR; + WRITE_ONCE(cp->prs_err, PERR_PARENT); break; } } @@ -2637,6 +2670,7 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) static int sched_partition_show(struct seq_file *seq, void *v) { struct cpuset *cs = css_cs(seq_css(seq)); + const char *err; switch (cs->partition_root_state) { case PRS_ENABLED: @@ -2649,7 +2683,11 @@ static int sched_partition_show(struct seq_file *seq, void *v) seq_puts(seq, "member\n"); break; case PRS_ERROR: - seq_puts(seq, "root invalid\n"); + err = perr_strings[READ_ONCE(cs->prs_err)]; + if (err) + seq_printf(seq, "root invalid (%s)\n", err); + else + seq_puts(seq, "root invalid\n"); break; } return 0; @@ -3256,6 +3294,10 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) spin_lock_irq(&callback_lock); cs->partition_root_state = PRS_ERROR; spin_unlock_irq(&callback_lock); + if (parent->partition_root_state == PRS_ERROR) + WRITE_ONCE(cs->prs_err, PERR_PARENT); + else + WRITE_ONCE(cs->prs_err, PERR_HOTPLUG); notify_partition_change(cs, old_prs, PRS_ERROR); } cpuset_force_rebuild(); From patchwork Mon Oct 18 14:36:18 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 12566723 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E44A5C433EF for ; Mon, 18 Oct 2021 14:39:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CA62D60FC3 for ; Mon, 18 Oct 2021 14:39:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232164AbhJROlt (ORCPT ); Mon, 18 Oct 2021 10:41:49 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:51901 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232051AbhJROls (ORCPT ); Mon, 18 Oct 2021 10:41:48 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1634567976; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Fgzu0/Idcc1GRywORkD/Fdq2hd78xuyZG5CB+CLyV8c=; b=dnmBjmpqgMgBTyRV4CC8PDlAR3IojC7C5DmJLfks05dwBgKOkUVdlRZTgbzeF7n8dzWT9p dI0fQhTEfZwO1kGTP8jXDMWg74xBEQoLt3p7qZmt9uXHMTHR/aCe4eSfN7X2eXwvUkxTx/ 9XBwksm3kbyw2AjpRomJlvLaOLEOpKM= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-212-g9_jaAOuPvmVx6pCScO9zQ-1; Mon, 18 Oct 2021 10:39:30 -0400 X-MC-Unique: g9_jaAOuPvmVx6pCScO9zQ-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 967FC101B4BB; Mon, 18 Oct 2021 14:39:28 +0000 (UTC) Received: from llong.com (unknown [10.22.16.224]) by smtp.corp.redhat.com (Postfix) with ESMTP id B97156A8E5; Mon, 18 Oct 2021 14:38:48 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, Andrew Morton , Roman Gushchin , Phil Auld , Peter Zijlstra , Juri Lelli , Frederic Weisbecker , Marcelo Tosatti , =?utf-8?q?Michal_Koutn=C3=BD?= , Waiman Long Subject: [PATCH v8 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst Date: Mon, 18 Oct 2021 10:36:18 -0400 Message-Id: <20211018143619.205065-6-longman@redhat.com> In-Reply-To: <20211018143619.205065-1-longman@redhat.com> References: <20211018143619.205065-1-longman@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced "isolated" cpuset partition type as well as other changes made in other cpuset patches. Signed-off-by: Waiman Long --- Documentation/admin-guide/cgroup-v2.rst | 153 ++++++++++++++---------- 1 file changed, 93 insertions(+), 60 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 4d8c27eca96b..40d39562a8dd 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2091,8 +2091,9 @@ Cpuset Interface Files It accepts only the following input values when written to. ======== ================================ - "root" a partition root - "member" a non-root member of a partition + "member" Non-root member of a partition + "root" Partition root + "isolated" Partition root without load balancing ======== ================================ When set to be a partition root, the current cgroup is the @@ -2101,64 +2102,96 @@ Cpuset Interface Files partition roots themselves and their descendants. The root cgroup is always a partition root. - There are constraints on where a partition root can be set. - It can only be set in a cgroup if all the following conditions - are true. - - 1) The "cpuset.cpus" is not empty and the list of CPUs are - exclusive, i.e. they are not shared by any of its siblings. - 2) The parent cgroup is a partition root. - 3) The "cpuset.cpus" is also a proper subset of the parent's - "cpuset.cpus.effective". - 4) There is no child cgroups with cpuset enabled. This is for - eliminating corner cases that have to be handled if such a - condition is allowed. - - Setting it to partition root will take the CPUs away from the - effective CPUs of the parent cgroup. Once it is set, this - file cannot be reverted back to "member" if there are any child - cgroups with cpuset enabled. - - A parent partition cannot distribute all its CPUs to its - child partitions. There must be at least one cpu left in the - parent partition. - - Once becoming a partition root, changes to "cpuset.cpus" is - generally allowed as long as the first condition above is true, - the change will not take away all the CPUs from the parent - partition and the new "cpuset.cpus" value is a superset of its - children's "cpuset.cpus" values. - - Sometimes, external factors like changes to ancestors' - "cpuset.cpus" or cpu hotplug can cause the state of the partition - root to change. On read, the "cpuset.sched.partition" file - can show the following values. - - ============== ============================== - "member" Non-root member of a partition - "root" Partition root - "root invalid" Invalid partition root - ============== ============================== - - It is a partition root if the first 2 partition root conditions - above are true and at least one CPU from "cpuset.cpus" is - granted by the parent cgroup. - - A partition root can become invalid if none of CPUs requested - in "cpuset.cpus" can be granted by the parent cgroup or the - parent cgroup is no longer a partition root itself. In this - case, it is not a real partition even though the restriction - of the first partition root condition above will still apply. - The cpu affinity of all the tasks in the cgroup will then be - associated with CPUs in the nearest ancestor partition. - - An invalid partition root can be transitioned back to a - real partition root if at least one of the requested CPUs - can now be granted by its parent. In this case, the cpu - affinity of all the tasks in the formerly invalid partition - will be associated to the CPUs of the newly formed partition. - Changing the partition state of an invalid partition root to - "member" is always allowed even if child cpusets are present. + When set to "isolated", the CPUs in that partition root will + be in an isolated state without any load balancing from the + scheduler. Tasks in such a partition must be explicitly bound + to each individual CPU. + + "cpuset.cpus" must always be set up first before enabling + partition. Unlike "member" whose "cpuset.cpus.effective" can + contain CPUs not in "cpuset.cpus", this can never happen with a + valid partition root. In other words, "cpuset.cpus.effective" + is always a subset of "cpuset.cpus" for a valid partition root. + + When a parent partition root cannot exclusively grant any of + the CPUs specified in "cpuset.cpus", "cpuset.cpus.effective" + becomes empty. If there are tasks in the partition root, the + partition root becomes invalid and "cpuset.cpus.effective" + is reset to that of the nearest non-empty ancestor. + + Note that a task cannot be moved to a cgroup with empty + "cpuset.cpus.effective". + + There are additional constraints on where a partition root can + be enabled ("root" or "isolated"). It can only be enabled in + a cgroup if all the following conditions are met. + + 1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are + not shared by any of its siblings. + 2) The parent cgroup is a valid partition root. + 3) The "cpuset.cpus" is a subset of parent's "cpuset.cpus". + 4) There is no child cgroups with cpuset enabled. This avoids + cpu migrations of multiple cgroups simultaneously which can + be problematic. + + On read, the "cpuset.cpus.partition" file can show the following + values. + + ====================== ============================== + "member" Non-root member of a partition + "root" Partition root + "isolated" Partition root without load balancing + "root invalid ()" Invalid partition root + ====================== ============================== + + In the case of an invalid partition root, a descriptive string on + why the partition is invalid is included within parentheses. + + Once becoming a partition root, changes to "cpuset.cpus" + is generally allowed as long as the cpu list is exclusive, + non-empty and is a superset of children's cpu lists. + + The constraints of a valid partition root are as follows: + + 1) The parent cgroup is a valid partition root. + 2) "cpuset.cpus.effective" is a subset of "cpuset.cpus" + 3) "cpuset.cpus.effective" is non-empty when there are tasks + in the partition. + + Changes to "cpuset.cpus" or cpu hotplug may cause the state + of a valid partition root to become invalid when one or more + constraints of a valid partition root are violated. Therefore, + user space agents that manage partition roots should avoid + unnecessary changes to "cpuset.cpus" and always check the state + of "cpuset.cpus.partition" after making changes to make sure + that the partitions are functioning properly as expected. + + Changing a partition root to "member" is always allowed. + If there are child partition roots underneath it, however, + they will be forced to be switched back to "member" too and + lose their partitions. So care must be taken to double check + for this condition before disabling a partition root. + + Setting a cgroup to a valid partition root will take the CPUs + away from the effective CPUs of the parent partition. + + A valid parent partition may distribute out all its CPUs to + its child partitions as long as it is not the root cgroup as + we need some house-keeping CPUs in the root cgroup. + + An invalid partition is not a real partition even though some + internal states may still be kept. + + An invalid partition root can be reverted back to a real + partition root if none of the constraints of a valid partition + root are violated. + + Poll and inotify events are triggered whenever the state of + "cpuset.cpus.partition" changes. That includes changes caused by + write to "cpuset.cpus.partition", cpu hotplug and other changes + that make the partition invalid. This will allow user space + agents to monitor unexpected changes to "cpuset.cpus.partition" + without the need to do continuous polling. Device controller From patchwork Mon Oct 18 14:36:19 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 12566725 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70BFFC433F5 for ; Mon, 18 Oct 2021 14:39:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4FDA060232 for ; Mon, 18 Oct 2021 14:39:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232221AbhJROlz (ORCPT ); Mon, 18 Oct 2021 10:41:55 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:41794 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231848AbhJROly (ORCPT ); Mon, 18 Oct 2021 10:41:54 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1634567983; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DQok48wS5H1Q2IJPIxVqWUgbDjVgWWD6F3sEIUDBSVI=; b=UjrhkG/O6cWz42XHYh7+r2pjNgzTsZM1/EyPofoRg9LWzoKPtX1Lqlr8cW6rRO+Z226lPI 87DB4O7NW46WIK12B5pohR/saPX87e17f3khYztjB2y74korQGChZ+sD9w3QSJvPCv/Jiw gNVkxioYPP5RMzvIppMDYU/rhwZGnHI= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-88-blTikHdeOt6GBC0zZOkWVA-1; Mon, 18 Oct 2021 10:39:39 -0400 X-MC-Unique: blTikHdeOt6GBC0zZOkWVA-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id EC38E101B4A2; Mon, 18 Oct 2021 14:39:37 +0000 (UTC) Received: from llong.com (unknown [10.22.16.224]) by smtp.corp.redhat.com (Postfix) with ESMTP id B82DC75120; Mon, 18 Oct 2021 14:39:28 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, Andrew Morton , Roman Gushchin , Phil Auld , Peter Zijlstra , Juri Lelli , Frederic Weisbecker , Marcelo Tosatti , =?utf-8?q?Michal_Koutn=C3=BD?= , Waiman Long Subject: [PATCH v8 6/6] kselftest/cgroup: Add cpuset v2 partition root state test Date: Mon, 18 Oct 2021 10:36:19 -0400 Message-Id: <20211018143619.205065-7-longman@redhat.com> In-Reply-To: <20211018143619.205065-1-longman@redhat.com> References: <20211018143619.205065-1-longman@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org Add a test script test_cpuset_prs.sh with a helper program wait_inotify for exercising the cpuset v2 partition root state code. Signed-off-by: Waiman Long --- tools/testing/selftests/cgroup/Makefile | 5 +- .../selftests/cgroup/test_cpuset_prs.sh | 664 ++++++++++++++++++ tools/testing/selftests/cgroup/wait_inotify.c | 87 +++ 3 files changed, 754 insertions(+), 2 deletions(-) create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile index 59e222460581..3f1fd3f93f41 100644 --- a/tools/testing/selftests/cgroup/Makefile +++ b/tools/testing/selftests/cgroup/Makefile @@ -1,10 +1,11 @@ # SPDX-License-Identifier: GPL-2.0 CFLAGS += -Wall -pthread -all: +all: ${HELPER_PROGS} TEST_FILES := with_stress.sh -TEST_PROGS := test_stress.sh +TEST_PROGS := test_stress.sh test_cpuset_prs.sh +TEST_GEN_FILES := wait_inotify TEST_GEN_PROGS = test_memcontrol TEST_GEN_PROGS += test_kmem TEST_GEN_PROGS += test_core diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh new file mode 100755 index 000000000000..0a5d3bbad5cd --- /dev/null +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh @@ -0,0 +1,664 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Test for cpuset v2 partition root state (PRS) +# +# The sched verbose flag is set, if available, so that the console log +# can be examined for the correct setting of scheduling domain. +# + +skip_test() { + echo "$1" + echo "Test SKIPPED" + exit 0 +} + +[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!" + +# Set sched verbose flag, if available +[[ -d /sys/kernel/debug/sched ]] && echo Y > /sys/kernel/debug/sched/verbose + +# Get wait_inotify location +WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify + +# Find cgroup v2 mount point +CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}') +[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!" + +CPUS=$(lscpu | grep "^CPU(s)" | sed -e "s/.*:[[:space:]]*//") +[[ $CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!" + +# Set verbose flag and delay factor +PROG=$1 +VERBOSE= +DELAY_FACTOR=1 +while [[ "$1" = -* ]] +do + case "$1" in + -v) VERBOSE=1 + break + ;; + -d) DELAY_FACTOR=$2 + shift + break + ;; + *) echo "Usage: $PROG [-v] [-d " + exit + ;; + esac + shift +done + +cd $CGROUP2 +echo +cpuset > cgroup.subtree_control +[[ -d test ]] || mkdir test +cd test + +# Pause in ms +pause() +{ + DELAY=$1 + LOOP=0 + while [[ $LOOP -lt $DELAY_FACTOR ]] + do + sleep $DELAY + ((LOOP++)) + done + return 0 +} + +console_msg() +{ + MSG=$1 + echo "$MSG" + echo "" > /dev/console + echo "$MSG" > /dev/console + pause 0.01 +} + +test_partition() +{ + EXPECTED_VAL=$1 + echo $EXPECTED_VAL > cpuset.cpus.partition + [[ $? -eq 0 ]] || exit 1 + ACTUAL_VAL=$(cat cpuset.cpus.partition) + [[ $ACTUAL_VAL != $EXPECTED_VAL ]] && { + echo "cpuset.cpus.partition: expect $EXPECTED_VAL, found $EXPECTED_VAL" + echo "Test FAILED" + exit 1 + } +} + +test_effective_cpus() +{ + EXPECTED_VAL=$1 + ACTUAL_VAL=$(cat cpuset.cpus.effective) + [[ "$ACTUAL_VAL" != "$EXPECTED_VAL" ]] && { + echo "cpuset.cpus.effective: expect '$EXPECTED_VAL', found '$EXPECTED_VAL'" + echo "Test FAILED" + exit 1 + } +} + +# Adding current process to cgroup.procs as a test +test_add_proc() +{ + OUTSTR="$1" + ERRMSG=$((echo $$ > cgroup.procs) |& cat) + echo $ERRMSG | grep -q "$OUTSTR" + [[ $? -ne 0 ]] && { + echo "cgroup.procs: expect '$OUTSTR', got '$ERRMSG'" + echo "Test FAILED" + exit 1 + } + echo $$ > $CGROUP2/cgroup.procs # Move out the task +} + +# +# Testing the new "isolated" partition root type +# +test_isolated() +{ + echo 2-3 > cpuset.cpus + TYPE=$(cat cpuset.cpus.partition) + [[ $TYPE = member ]] || echo member > cpuset.cpus.partition + + console_msg "Change from member to root" + test_partition root + + console_msg "Change from root to isolated" + test_partition isolated + + console_msg "Change from isolated to member" + test_partition member + + console_msg "Change from member to isolated" + test_partition isolated + + console_msg "Change from isolated to root" + test_partition root + + console_msg "Change from root to member" + test_partition member + + # + # Testing partition root with no cpu + # + console_msg "Distribute all cpus to child partition" + echo +cpuset > cgroup.subtree_control + test_partition root + + mkdir A1 + cd A1 + echo 2-3 > cpuset.cpus + test_partition root + test_effective_cpus 2-3 + cd .. + test_effective_cpus "" + + console_msg "Moving task to partition test" + test_add_proc "No space left" + cd A1 + test_add_proc "" + cd .. + + console_msg "Shrink and expand child partition" + cd A1 + echo 2 > cpuset.cpus + cd .. + test_effective_cpus 3 + cd A1 + echo 2-3 > cpuset.cpus + cd .. + test_effective_cpus "" + + # Cleaning up + console_msg "Cleaning up" + echo $$ > $CGROUP2/cgroup.procs + [[ -d A1 ]] && rmdir A1 +} + +# +# Cpuset controller state transition test matrix. +# +# Cgroup test hierarchy +# +# test -- A1 -- A2 -- A3 +# \- B1 +# +# P = set cpus.partition (0:member, 1:root, 2:isolated, -1:root invalid) +# C = add cpu-list +# S

= use prefix in subtree_control +# T = put a task into cgroup +# O- = Write to CPU online file of +# +SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1" +TEST_MATRIX=( + # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate + # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ + " S+ C0-1 . . C2-3 S+ C4-5 . . 0 A2:0-1" + " S+ C0-1 . . C2-3 P1 . . . 0 " + " S+ C0-1 . . C2-3 P1:S+ C0-1:P1 . . 0 " + " S+ C0-1 . . C2-3 P1:S+ C1:P1 . . 0 " + " S+ C0-1:S+ . . C2-3 . . . P1 0 " + " S+ C0-1:P1 . . C2-3 S+ C1 . . 0 " + " S+ C0-1:P1 . . C2-3 S+ C1:P1 . . 0 " + " S+ C0-1:P1 . . C2-3 S+ C1:P1 . P1 0 " + " S+ C0-1:P1 . . C2-3 C4-5 . . . 0 A1:4-5" + " S+ C0-1:P1 . . C2-3 S+:C4-5 . . . 0 A1:4-5" + " S+ C0-1 . . C2-3:P1 . . . C2 0 " + " S+ C0-1 . . C2-3:P1 . . . C4-5 0 B1:4-5" + " S+ C0-3:P1:S+ C2-3:P1 . . . . . . 0 A1:0-1,A2:2-3" + " S+ C0-3:P1:S+ C2-3:P1 . . C1-3 . . . 0 A1:1,A2:2-3" + " S+ C2-3:P1:S+ C3:P1 . . C3 . . . 0 A1:,A2:3 A1:P1,A2:P1" + " S+ C2-3:P1:S+ C3:P1 . . C3 P0 . . 0 A1:3,A2:3 A1:P1,A2:P0" + " S+ C2-3:P1:S+ C2:P1 . . C2-4 . . . 0 A1:3-4,A2:2" + " S+ C2-3:P1:S+ C3:P1 . . C3 . . C0-2 0 A1:,B1:0-2 A1:P1,A2:P1" + " S+ $SETUP_A123_PARTITIONS . C2-3 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1" + + # CPU offlining cases: + " S+ C0-1 . . C2-3 S+ C4-5 . O2-0 0 A1:0-1,B1:3" + " S+ C0-3:P1:S+ C2-3:P1 . . O2-0 . . . 0 A1:0-1,A2:3" + " S+ C0-3:P1:S+ C2-3:P1 . . O2-0 O2-1 . . 0 A1:0-1,A2:2-3" + " S+ C0-3:P1:S+ C2-3:P1 . . O1-0 . . . 0 A1:0,A2:2-3" + " S+ C0-3:P1:S+ C2-3:P1 . . O1-0 O1-1 . . 0 A1:0-1,A2:2-3" + " S+ C2-3:P1:S+ C3:P1 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P1" + " S+ C2-3:P1:S+ C3:P2 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P2" + " S+ C2-3:P1:S+ C3:P1 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P1" + " S+ C2-3:P1:S+ C3:P2 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P2" + " S+ C2-3:P1:S+ C3:P1 . . O2-0 . . . 0 A1:,A2:3 A1:P1,A2:P1" + " S+ C2-3:P1:S+ C3:P1 . . O3-0 . . . 0 A1:2,A2: A1:P1,A2:P1" + " S+ C2-3:P1:S+ C3:P1 . . T:O2-0 . . . 0 A1:3,A2:3 A1:P1,A2:P-1" + " S+ C2-3:P1:S+ C3:P1 . . . T:O3-0 . . 0 A1:2,A2:2 A1:P1,A2:P-1" + " S+ $SETUP_A123_PARTITIONS . O1-0 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . O2-0 . . . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . O3-0 . . . 0 A1:1,A2:2,A3: A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . T:O1-0 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1" + " S+ $SETUP_A123_PARTITIONS . . T:O2-0 . . 0 A1:1,A2:3,A3:3 A1:P1,A2:P1,A3:P-1" + " S+ $SETUP_A123_PARTITIONS . . . T:O3-0 . 0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1" + " S+ $SETUP_A123_PARTITIONS . T:O1-0 O1-1 . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . . T:O2-0 O2-1 . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . . . T:O3-0 O3-1 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O1-1 . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O2-1 . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1" + + # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate + # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ + # + # Incorrect change to cpuset.cpus invalidates partition root + # + # Adding CPUs to partition root that are not in parent's + # cpuset.cpus is allowed, but those extra CPUs are ignored. + " S+ C2-3:P1:S+ C3:P1 . . . C2-4 . . 0 A1:,A2:2-3 A1:P1,A2:P1" + + # Taking away all CPUs from parent or itself if there are tasks + # will make the partition invalid. + " S+ C2-3:P1:S+ C3:P1 . . T C2-3 . . 0 A1:2-3,A2:2-3 A1:P1,A2:P-1" + " S+ $SETUP_A123_PARTITIONS . T:C2-3 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1" + " S+ $SETUP_A123_PARTITIONS . T:C2-3:C1-3 . . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1" + + # Changing a partition root to member disables child partitions + " S+ C2-3:P1:S+ C3:P1 . . P0 . . . 0 A1:2-3,A2:3 A1:P0,A2:P0" + " S+ $SETUP_A123_PARTITIONS . C2-3 P0 . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P0,A3:P0" + + # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate + # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ + # Failure cases: + + # To become a partition root, cpuset.cpus must be a subset of + # parent's cpuset.cpus. + " S+ C0-1 . . C2-3 S+ C4-5:P1 . . 1 " + + # A cpuset cannot become a partition root if it has child cpusets + # with non-empty cpuset.cpus. + " S+ C0-1:S+ C1 . C2-3 P1 . . . 1 " + + # Any change to cpuset.cpus of a partition root must be exclusive. + " S+ C0-1:P1 . . C2-3 C0-2 . . . 1 " + " S+ C0-1 . . C2-3:P1 . . . C1 1 " + " S+ C2-3:P1:S+ C2:P1 . C1 C1-3 . . . 1 " + + # Deletion of CPUs distributed to child cgroup is not allowed. + " S+ C0-1:P1:S+ C1 . C2-3 C4-5 . . . 1 " + " S+ C0-3:P1:S+ C2-3:P1 . . C0-2 . . . 1 " + + # A task cannot be added to a partition with no cpu + " S+ C2-3:P1:S+ C3:P1 . . O2-0:T . . . 1 A1:,A2:3 A1:P1,A2:P1" +) + +# +# Write to the cpu online file +# $1 - - where = cpu number, value to be written +# +write_cpu_online() +{ + CPU=${1%-*} + VAL=${1#*-} + CPUFILE=//sys/devices/system/cpu/cpu${CPU}/online + if [[ $VAL -eq 0 ]] + then + OFFLINE_CPUS="$OFFLINE_CPUS $CPU" + else + [[ -n "$OFFLINE_CPUS" ]] && { + OFFLINE_CPUS=$(echo $CPU $CPU $OFFLINE_CPUS | fmt -1 |\ + sort | uniq -u) + } + fi + echo $VAL > $CPUFILE + pause 0.01 +} + +# +# Set controller state +# $1 - cgroup directory +# $2 - state +# $3 - showerr +# +# The presence of ":" in state means transition from one to the next. +# +set_ctrl_state() +{ + TMPMSG=/tmp/.msg_$$ + CGRP=$1 + STATE=$2 + SHOWERR=${3}${VERBOSE} + CTRL=${CTRL:=$CONTROLLER} + HASERR=0 + REDIRECT="2> $TMPMSG" + [[ -z "$STATE" || "$STATE" = '.' ]] && return 0 + + rm -f $TMPMSG + for CMD in $(echo $STATE | sed -e "s/:/ /g") + do + TFILE=$CGRP/cgroup.procs + SFILE=$CGRP/cgroup.subtree_control + PFILE=$CGRP/cpuset.cpus.partition + CFILE=$CGRP/cpuset.cpus + S=$(expr substr $CMD 1 1) + if [[ $S = S ]] + then + PREFIX=${CMD#?} + COMM="echo ${PREFIX}${CTRL} > $SFILE" + eval $COMM $REDIRECT + elif [[ $S = C ]] + then + CPUS=${CMD#?} + COMM="echo $CPUS > $CFILE" + eval $COMM $REDIRECT + elif [[ $S = P ]] + then + VAL=${CMD#?} + case $VAL in + 0) VAL=member + ;; + 1) VAL=root + ;; + 2) VAL=isolated + ;; + *) + echo "Invalid partition state - $VAL" + exit 1 + ;; + esac + COMM="echo $VAL > $PFILE" + eval $COMM $REDIRECT + elif [[ $S = O ]] + then + VAL=${CMD#?} + write_cpu_online $VAL + elif [[ $S = T ]] + then + COMM="echo 0 > $TFILE" + eval $COMM $REDIRECT + fi + RET=$? + [[ $RET -ne 0 ]] && { + [[ -n "$SHOWERR" ]] && { + echo "$COMM" + cat $TMPMSG + } + HASERR=1 + } + pause 0.01 + rm -f $TMPMSG + done + return $HASERR +} + +set_ctrl_state_noerr() +{ + CGRP=$1 + STATE=$2 + [[ -d $CGRP ]] || mkdir $CGRP + set_ctrl_state $CGRP $STATE 1 + [[ $? -ne 0 ]] && { + echo "ERROR: Failed to set $2 to cgroup $1!" + exit 1 + } +} + +online_cpus() +{ + [[ -n "OFFLINE_CPUS" ]] && { + for C in $OFFLINE_CPUS + do + write_cpu_online ${C}-1 + done + } +} + +# +# Return 1 if the list of effective cpus isn't the same as the initial list. +# +reset_cgroup_states() +{ + echo 0 > $CGROUP2/cgroup.procs + online_cpus + rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1 + set_ctrl_state . S- + pause 0.01 +} + +dump_states() +{ + for DIR in A1 A1/A2 A1/A2/A3 B1 + do + ECPUS=$DIR/cpuset.cpus.effective + PRS=$DIR/cpuset.cpus.partition + [[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)" + [[ -e $PRS ]] && echo "$PRS: $(cat $PRS)" + done +} + +# +# Check effective cpus +# $1 - check string, format: :[,:]* +# +check_effective_cpus() +{ + CHK_STR=$1 + for CHK in $(echo $CHK_STR | sed -e "s/,/ /g") + do + set -- $(echo $CHK | sed -e "s/:/ /g") + CGRP=$1 + CPUS=$2 + [[ $CGRP = A2 ]] && CGRP=A1/A2 + [[ $CGRP = A3 ]] && CGRP=A1/A2/A3 + FILE=$CGRP/cpuset.cpus.effective + [[ -e $FILE ]] || return 1 + [[ $CPUS = $(cat $FILE) ]] || return 1 + done +} + +# +# Check cgroup states +# $1 - check string, format: :[,:]* +# +check_cgroup_states() +{ + CHK_STR=$1 + for CHK in $(echo $CHK_STR | sed -e "s/,/ /g") + do + set -- $(echo $CHK | sed -e "s/:/ /g") + CGRP=$1 + STATE=$2 + FILE= + EVAL=$(expr substr $STATE 2 2) + [[ $CGRP = A2 ]] && CGRP=A1/A2 + [[ $CGRP = A3 ]] && CGRP=A1/A2/A3 + + case $STATE in + P*) FILE=$CGRP/cpuset.cpus.partition + ;; + *) echo "Unknown state: $STATE!" + exit 1 + ;; + esac + VAL=$(cat $FILE) + + case "$VAL" in + member) VAL=0 + ;; + root) VAL=1 + ;; + isolated) + VAL=2 + ;; + "root invalid"*) + VAL=-1 + ;; + esac + [[ $EVAL != $VAL ]] && return 1 + done + return 0 +} + +# +# Run cpuset state transition test +# $1 - test matrix name +# +# This test is somewhat fragile as delays (sleep x) are added in various +# places to make sure state changes are fully propagated before the next +# action. These delays may need to be adjusted if running in a slower machine. +# +run_state_test() +{ + TEST=$1 + CONTROLLER=cpuset + CPULIST=0-6 + I=0 + eval CNT="\${#$TEST[@]}" + + reset_cgroup_states + echo $CPULIST > cpuset.cpus + echo root > cpuset.cpus.partition + console_msg "Running state transition test ..." + + while [[ $I -lt $CNT ]] + do + echo "Running test $I ..." > /dev/console + eval set -- "\${$TEST[$I]}" + ROOT=$1 + OLD_A1=$2 + OLD_A2=$3 + OLD_A3=$4 + OLD_B1=$5 + NEW_A1=$6 + NEW_A2=$7 + NEW_A3=$8 + NEW_B1=$9 + RESULT=${10} + ECPUS=${11} + STATES=${12} + + set_ctrl_state_noerr . $ROOT + set_ctrl_state_noerr A1 $OLD_A1 + set_ctrl_state_noerr A1/A2 $OLD_A2 + set_ctrl_state_noerr A1/A2/A3 $OLD_A3 + set_ctrl_state_noerr B1 $OLD_B1 + RETVAL=0 + set_ctrl_state A1 $NEW_A1; ((RETVAL += $?)) + set_ctrl_state A1/A2 $NEW_A2; ((RETVAL += $?)) + set_ctrl_state A1/A2/A3 $NEW_A3; ((RETVAL += $?)) + set_ctrl_state B1 $NEW_B1; ((RETVAL += $?)) + + [[ $RETVAL -ne $RESULT ]] && { + echo "Test $TEST[$I] failed result check!" + eval echo \"\${$TEST[$I]}\" + dump_states + online_cpus + exit 1 + } + + [[ -n "$ECPUS" && "$ECPUS" != . ]] && { + check_effective_cpus $ECPUS + [[ $? -ne 0 ]] && { + echo "Test $TEST[$I] failed effective CPU check!" + eval echo \"\${$TEST[$I]}\" + echo + dump_states + online_cpus + exit 1 + } + } + + [[ -n "$STATES" ]] && { + check_cgroup_states $STATES + [[ $? -ne 0 ]] && { + echo "FAILED: Test $TEST[$I] failed states check!" + eval echo \"\${$TEST[$I]}\" + echo + dump_states + online_cpus + exit 1 + } + } + + reset_cgroup_states + # + # Check to see if effective cpu list changes + # + pause 0.05 + NEWLIST=$(cat cpuset.cpus.effective) + [[ $NEWLIST != $CPULIST ]] && { + echo "Effective cpus changed to $NEWLIST after test $I!" + exit 1 + } + [[ -n "$VERBOSE" ]] && echo "Test $I done." + ((I++)) + done + echo "All $I tests of $TEST PASSED." + + echo member > cpuset.cpus.partition +} + +# +# Wait for inotify event for the given file and read it +# $1: cgroup file to wait for +# $2: file to store the read result +# +wait_inotify() +{ + CGROUP_FILE=$1 + OUTPUT_FILE=$2 + + $WAIT_INOTIFY $CGROUP_FILE + cat $CGROUP_FILE > $OUTPUT_FILE +} + +# +# Test if inotify events are properly generated when going into and out of +# invalid partition state. +# +test_inotify() +{ + ERR=0 + PRS=/tmp/.prs_$$ + [[ -f $WAIT_INOTIFY ]] || { + echo "wait_inotify not found, inotify test SKIPPED." + return + } + + pause 0.01 + echo 1 > cpuset.cpus + echo 0 > cgroup.procs + echo root > cpuset.cpus.partition + pause 0.01 + rm -f $PRS + wait_inotify $PWD/cpuset.cpus.partition $PRS & + pause 0.01 + set_ctrl_state . "O1-0" + pause 0.01 + check_cgroup_states ".:P-1" + if [[ $? -ne 0 ]] + then + echo "FAILED: Inotify test - partition not invalid" + ERR=1 + elif [[ ! -f $PRS ]] + then + echo "FAILED: Inotify test - event not generated" + ERR=1 + kill %1 + elif [[ $(cat $PRS) != "root invalid"* ]] + then + echo "FAILED: Inotify test - incorrect state" + cat $PRS + ERR=1 + fi + online_cpus + echo member > cpuset.cpus.partition + echo 0 > ../cgroup.procs + if [[ $ERR -ne 0 ]] + then + exit 1 + else + echo "Inotify test PASSED" + fi +} + +run_state_test TEST_MATRIX +test_isolated +test_inotify +echo "All tests PASSED." +cd .. +rmdir test diff --git a/tools/testing/selftests/cgroup/wait_inotify.c b/tools/testing/selftests/cgroup/wait_inotify.c new file mode 100644 index 000000000000..e11b431e1b62 --- /dev/null +++ b/tools/testing/selftests/cgroup/wait_inotify.c @@ -0,0 +1,87 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Wait until an inotify event on the given cgroup file. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static const char usage[] = "Usage: %s [-v] \n"; +static char *file; +static int verbose; + +static inline void fail_message(char *msg) +{ + fprintf(stderr, msg, file); + exit(1); +} + +int main(int argc, char *argv[]) +{ + char *cmd = argv[0]; + int c, fd; + struct pollfd fds = { .events = POLLIN, }; + + while ((c = getopt(argc, argv, "v")) != -1) { + switch (c) { + case 'v': + verbose++; + break; + } + argv++, argc--; + } + + if (argc != 2) { + fprintf(stderr, usage, cmd); + return -1; + } + file = argv[1]; + fd = open(file, O_RDONLY); + if (fd < 0) + fail_message("Cgroup file %s not found!\n"); + close(fd); + + fd = inotify_init(); + if (fd < 0) + fail_message("inotify_init() fails on %s!\n"); + if (inotify_add_watch(fd, file, IN_MODIFY) < 0) + fail_message("inotify_add_watch() fails on %s!\n"); + fds.fd = fd; + + /* + * poll waiting loop + */ + for (;;) { + int ret = poll(&fds, 1, 10000); + + if (ret < 0) { + if (errno == EINTR) + continue; + perror("poll"); + exit(1); + } + if ((ret > 0) && (fds.revents & POLLIN)) + break; + } + if (verbose) { + struct inotify_event events[10]; + long len; + + usleep(1000); + len = read(fd, events, sizeof(events)); + printf("Number of events read = %ld\n", + len/sizeof(struct inotify_event)); + } + close(fd); + return 0; +}