diff mbox series

[v5,1/6] cgroup/cpuset: Properly transition to invalid partition

Message ID 20210814173848.11540-2-longman@redhat.com (mailing list archive)
State New
Headers show
Series cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand

Commit Message

Waiman Long Aug. 14, 2021, 5:38 p.m. UTC
For cpuset partition, the special state of PRS_ERROR (invalid partition
root) was originally designed to handle hotplug events.  In this state,
CPUs allocated to the partition root is released back to the parent
but the cpuset exclusive flags remain unchanged.

Changing a cpuset into a partition root is strictly controlled. The
following constraints must be satisfied in order to make the transition
possible:

 - The "cpuset.cpus" is not empty and the list of CPUs are exclusive,
   i.e. they are not shared by any of its siblings.
 - The parent cgroup is a partition root.
 - The "cpuset.cpus" is a subset of the parent's "cpuset.cpus.effective".
 - There is no child cgroups with cpuset enabled.

Changing a partition root back to a member is always allowed, though care
must be taken to make sure that this change won't break child cpusets,
if present.

Since partition root sets the CPU_EXCLUSIVE flag, cpuset.cpus changes
that break the cpu exclusivity rule will not be allowed. However,
other changes to cpuset.cpus on a partition root may still cause it to
become invalid. So users must always check the partition root state of
"cpuset.cpus.partition" after making changes to cpuset.cpus to make sure
that the partition root is still valid.

For a partition root tree with parent and child partition roots, there
are two cases where the child partitions can become invalid. Firstly,
changing partition state to "member" will force the child partitions
to become invalid.

Secondly, if some cpus are taken away from the parent partition root
so that its cpuset.cpus.effective becomes empty, it will try to pull
cpus away from the child partitions and force them to become invalid
which may allow the parent partition to remain valid.

This patch makes sure that partitions are properly changed to invalid
when some of the valid partition constraints are violated.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 177 +++++++++++++++++++++++------------------
 1 file changed, 100 insertions(+), 77 deletions(-)

Comments

kernel test robot Aug. 14, 2021, 8:21 p.m. UTC | #1
Hi Waiman,

I love your patch! Perhaps something to improve:

[auto build test WARNING on cgroup/for-next]
[also build test WARNING on next-20210813]
[cannot apply to kselftest/next v5.14-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/cgroup-cpuset-Add-new-cpuset-partition-type-empty-effecitve-cpus/20210815-014333
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
config: ia64-defconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/56ec7dd271c77e3cc92f0df6fd766004a7a0aa88
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Waiman-Long/cgroup-cpuset-Add-new-cpuset-partition-type-empty-effecitve-cpus/20210815-014333
        git checkout 56ec7dd271c77e3cc92f0df6fd766004a7a0aa88
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=ia64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   kernel/cgroup/cpuset.c: In function 'update_prstate':
>> kernel/cgroup/cpuset.c:2068:1: warning: the frame size of 3072 bytes is larger than 2048 bytes [-Wframe-larger-than=]
    2068 | }
         | ^


vim +2068 kernel/cgroup/cpuset.c

^1da177e4c3f41 kernel/cpuset.c        Linus Torvalds 2005-04-16  1966  
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1967  /*
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1968   * update_prstate - update partititon_root_state
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1969   * cs: the cpuset to update
0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long    2021-07-20  1970   * new_prs: new partition root state
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1971   *
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1972   * Call with cpuset_mutex held.
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1973   */
0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long    2021-07-20  1974  static int update_prstate(struct cpuset *cs, int new_prs)
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1975  {
6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long    2021-07-20  1976  	int err, old_prs = cs->partition_root_state;
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1977  	struct cpuset *parent = parent_cs(cs);
0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long    2021-07-20  1978  	struct tmpmasks tmpmask;
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1979  
6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long    2021-07-20  1980  	if (old_prs == new_prs)
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1981  		return 0;
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1982  
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1983  	/*
3881b86128d0be kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1984  	 * Cannot force a partial or invalid partition root to a full
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1985  	 * partition root.
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1986  	 */
6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long    2021-07-20  1987  	if (new_prs && (old_prs == PRS_ERROR))
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1988  		return -EINVAL;
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1989  
0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long    2021-07-20  1990  	if (alloc_cpumasks(NULL, &tmpmask))
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1991  		return -ENOMEM;
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1992  
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1993  	err = -EINVAL;
6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long    2021-07-20  1994  	if (!old_prs) {
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1995  		/*
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1996  		 * Turning on partition root requires setting the
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1997  		 * CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1998  		 * cannot be NULL.
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  1999  		 */
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2000  		if (cpumask_empty(cs->cpus_allowed))
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2001  			goto out;
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2002  
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2003  		err = update_flag(CS_CPU_EXCLUSIVE, cs, 1);
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2004  		if (err)
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2005  			goto out;
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2006  
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2007  		err = update_parent_subparts_cpumask(cs, partcmd_enable,
0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long    2021-07-20  2008  						     NULL, &tmpmask);
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2009  		if (err) {
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2010  			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2011  			goto out;
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2012  		}
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2013  	} else {
3881b86128d0be kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2014  		/*
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2015  		 * Switch back to member is always allowed even if it
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2016  		 * causes child partitions to become invalid.
3881b86128d0be kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2017  		 */
3881b86128d0be kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2018  		err = 0;
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2019  		update_parent_subparts_cpumask(cs, partcmd_disable, NULL,
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2020  					       &tmpmask);
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2021  		/*
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2022  		 * If there are child partitions, we have to make them invalid.
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2023  		 */
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2024  		if (unlikely(cs->nr_subparts_cpus)) {
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2025  			struct tmpmasks tmp;
3881b86128d0be kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2026  
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2027  			spin_lock_irq(&callback_lock);
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2028  			cs->nr_subparts_cpus = 0;
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2029  			cpumask_clear(cs->subparts_cpus);
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2030  			compute_effective_cpumask(cs->effective_cpus, cs, parent);
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2031  			spin_unlock_irq(&callback_lock);
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2032  
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2033  			/*
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2034  			 * If alloc_cpumasks() fails, we are running out
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2035  			 * of memory and there isn't much we can do.
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2036  			 */
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2037  			if (!alloc_cpumasks(NULL, &tmp)) {
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2038  				update_cpumasks_hier(cs, &tmp);
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2039  				free_cpumasks(NULL, &tmp);
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2040  			}
56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long    2021-08-14  2041  		}
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2042  
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2043  		/* Turning off CS_CPU_EXCLUSIVE will not return error */
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2044  		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2045  	}
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2046  
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2047  	/*
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2048  	 * Update cpumask of parent's tasks except when it is the top
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2049  	 * cpuset as some system daemons cannot be mapped to other CPUs.
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2050  	 */
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2051  	if (parent != &top_cpuset)
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2052  		update_tasks_cpumask(parent);
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2053  
4716909cc5c566 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2054  	if (parent->child_ecpus_count)
0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long    2021-07-20  2055  		update_sibling_cpumasks(parent, cs, &tmpmask);
4716909cc5c566 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2056  
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2057  	rebuild_sched_domains_locked();
ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long    2018-11-08  2058  out:
6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long    2021-07-20  2059  	if (!err) {
6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long    2021-07-20  2060  		spin_lock_irq(&callback_lock);
6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long    2021-07-20  2061  		cs->partition_root_state = new_prs;
6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long    2021-07-20  2062  		spin_unlock_irq(&callback_lock);
e7cc9888dc5792 kernel/cgroup/cpuset.c Waiman Long    2021-08-10  2063  		notify_partition_change(cs, old_prs, new_prs);
6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long    2021-07-20  2064  	}
6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long    2021-07-20  2065  
0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long    2021-07-20  2066  	free_cpumasks(NULL, &tmpmask);
645fcc9d2f6946 kernel/cpuset.c        Li Zefan       2009-01-07  2067  	return err;
^1da177e4c3f41 kernel/cpuset.c        Linus Torvalds 2005-04-16 @2068  }
^1da177e4c3f41 kernel/cpuset.c        Linus Torvalds 2005-04-16  2069  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Waiman Long Aug. 14, 2021, 8:54 p.m. UTC | #2
On 8/14/21 4:21 PM, kernel test robot wrote:
> Hi Waiman,
>
> I love your patch! Perhaps something to improve:
>
> [auto build test WARNING on cgroup/for-next]
> [also build test WARNING on next-20210813]
> [cannot apply to kselftest/next v5.14-rc5]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url:    https://github.com/0day-ci/linux/commits/Waiman-Long/cgroup-cpuset-Add-new-cpuset-partition-type-empty-effecitve-cpus/20210815-014333
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
> config: ia64-defconfig (attached as .config)
> compiler: ia64-linux-gcc (GCC) 11.2.0
> reproduce (this is a W=1 build):
>          wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # https://github.com/0day-ci/linux/commit/56ec7dd271c77e3cc92f0df6fd766004a7a0aa88
>          git remote add linux-review https://github.com/0day-ci/linux
>          git fetch --no-tags linux-review Waiman-Long/cgroup-cpuset-Add-new-cpuset-partition-type-empty-effecitve-cpus/20210815-014333
>          git checkout 56ec7dd271c77e3cc92f0df6fd766004a7a0aa88
>          # save the attached .config to linux build tree
>          COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=ia64
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>):
>
>     kernel/cgroup/cpuset.c: In function 'update_prstate':

Oh, it was caused by a duplicated tmpmask in update_prstate() which 
isn't really necessary. Will send out a new version to fix that.

Thanks,
Longman
diff mbox series

Patch

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 44d234b0df5e..7705950ad70b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1177,10 +1177,9 @@  static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		return -EINVAL;
 
 	/*
-	 * Enabling/disabling partition root is not allowed if there are
-	 * online children.
+	 * Enabling partition root is not allowed if there are online children.
 	 */
-	if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
+	if ((cmd == partcmd_enable) && css_has_online_children(&cpuset->css))
 		return -EBUSY;
 
 	/*
@@ -1208,6 +1207,14 @@  static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		/*
 		 * partcmd_update with newmask:
 		 *
+		 * Make partition invalid if newmask isn't a subset of
+		 * (cpus_allowed | parent->effective_cpus).
+		 */
+		cpumask_or(tmp->addmask, cpuset->cpus_allowed,
+					 parent->effective_cpus);
+		part_error = !cpumask_subset(newmask, tmp->addmask);
+
+		/*
 		 * delmask = cpus_allowed & ~newmask & parent->subparts_cpus
 		 * addmask = newmask & parent->effective_cpus
 		 *		     & ~parent->subparts_cpus
@@ -1220,20 +1227,21 @@  static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		adding = cpumask_andnot(tmp->addmask, tmp->addmask,
 					parent->subparts_cpus);
 		/*
-		 * Return error if the new effective_cpus could become empty.
+		 * Make partition invalid if parent's effective_cpus could
+		 * become empty.
 		 */
 		if (adding &&
 		    cpumask_equal(parent->effective_cpus, tmp->addmask)) {
 			if (!deleting)
-				return -EINVAL;
+				part_error = true;
 			/*
 			 * As some of the CPUs in subparts_cpus might have
 			 * been offlined, we need to compute the real delmask
 			 * to confirm that.
 			 */
-			if (!cpumask_and(tmp->addmask, tmp->delmask,
-					 cpu_active_mask))
-				return -EINVAL;
+			else if (!cpumask_and(tmp->addmask, tmp->delmask,
+					      cpu_active_mask))
+				part_error = true;
 			cpumask_copy(tmp->addmask, parent->effective_cpus);
 		}
 	} else {
@@ -1242,19 +1250,23 @@  static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 		 *
 		 * addmask = cpus_allowed & parent->effective_cpus
 		 *
+		 * This gets invoked either due to a hotplug event or
+		 * from update_cpumasks_hier() where we can't return an
+		 * error. This can cause a partition root to become invalid
+		 * in the case of a hotplug.
+		 *
 		 * Note that parent's subparts_cpus may have been
 		 * pre-shrunk in case there is a change in the cpu list.
 		 * So no deletion is needed.
 		 */
 		adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
 				     parent->effective_cpus);
-		part_error = cpumask_equal(tmp->addmask,
-					   parent->effective_cpus);
+		part_error = (is_partition_root(cpuset) &&
+			      !parent->nr_subparts_cpus) ||
+			     cpumask_equal(tmp->addmask, parent->effective_cpus);
 	}
 
 	if (cmd == partcmd_update) {
-		int prev_prs = cpuset->partition_root_state;
-
 		/*
 		 * Check for possible transition between PRS_ENABLED
 		 * and PRS_ERROR.
@@ -1269,13 +1281,9 @@  static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
 				new_prs = PRS_ENABLED;
 			break;
 		}
-		/*
-		 * Set part_error if previously in invalid state.
-		 */
-		part_error = (prev_prs == PRS_ERROR);
 	}
 
-	if (!part_error && (new_prs == PRS_ERROR))
+	if ((old_prs == PRS_ERROR) && (new_prs == PRS_ERROR))
 		return 0;	/* Nothing need to be done */
 
 	if (new_prs == PRS_ERROR) {
@@ -1407,6 +1415,11 @@  static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 			case PRS_ENABLED:
 				if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
 					update_tasks_cpumask(parent);
+				/*
+				 * The cpuset partition_root_state may be
+				 * changed to PRS_ERROR. Capture it.
+				 */
+				new_prs = cp->partition_root_state;
 				break;
 
 			case PRS_ERROR:
@@ -1424,33 +1437,27 @@  static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 
 		spin_lock_irq(&callback_lock);
 
-		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
 		if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) {
+			/*
+			 * Put all active subparts_cpus back to effective_cpus.
+			 */
+			cpumask_or(tmp->new_cpus, tmp->new_cpus,
+				   cp->subparts_cpus);
+			cpumask_and(tmp->new_cpus, tmp->new_cpus,
+				    cpu_active_mask);
 			cp->nr_subparts_cpus = 0;
 			cpumask_clear(cp->subparts_cpus);
-		} else if (cp->nr_subparts_cpus) {
+		}
+
+		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+		if (cp->nr_subparts_cpus) {
 			/*
 			 * Make sure that effective_cpus & subparts_cpus
-			 * are mutually exclusive.
-			 *
-			 * In the unlikely event that effective_cpus
-			 * becomes empty. we clear cp->nr_subparts_cpus and
-			 * let its child partition roots to compete for
-			 * CPUs again.
+			 * of a partition root are mutually exclusive.
 			 */
 			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
 				       cp->subparts_cpus);
-			if (cpumask_empty(cp->effective_cpus)) {
-				cpumask_copy(cp->effective_cpus, tmp->new_cpus);
-				cpumask_clear(cp->subparts_cpus);
-				cp->nr_subparts_cpus = 0;
-			} else if (!cpumask_subset(cp->subparts_cpus,
-						   tmp->new_cpus)) {
-				cpumask_andnot(cp->subparts_cpus,
-					cp->subparts_cpus, tmp->new_cpus);
-				cp->nr_subparts_cpus
-					= cpumask_weight(cp->subparts_cpus);
-			}
+			WARN_ON_ONCE(cpumask_empty(cp->effective_cpus));
 		}
 
 		if (new_prs != old_prs)
@@ -1582,8 +1589,8 @@  static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	 * Make sure that subparts_cpus is a subset of cpus_allowed.
 	 */
 	if (cs->nr_subparts_cpus) {
-		cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus,
-			       cs->cpus_allowed);
+		cpumask_and(cs->subparts_cpus, cs->subparts_cpus,
+			    cs->cpus_allowed);
 		cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
 	}
 	spin_unlock_irq(&callback_lock);
@@ -2005,19 +2012,33 @@  static int update_prstate(struct cpuset *cs, int new_prs)
 		}
 	} else {
 		/*
-		 * Turning off partition root will clear the
-		 * CS_CPU_EXCLUSIVE bit.
+		 * Switch back to member is always allowed even if it
+		 * causes child partitions to become invalid.
 		 */
-		if (old_prs == PRS_ERROR) {
-			update_flag(CS_CPU_EXCLUSIVE, cs, 0);
-			err = 0;
-			goto out;
-		}
+		err = 0;
+		update_parent_subparts_cpumask(cs, partcmd_disable, NULL,
+					       &tmpmask);
+		/*
+		 * If there are child partitions, we have to make them invalid.
+		 */
+		if (unlikely(cs->nr_subparts_cpus)) {
+			struct tmpmasks tmp;
 
-		err = update_parent_subparts_cpumask(cs, partcmd_disable,
-						     NULL, &tmpmask);
-		if (err)
-			goto out;
+			spin_lock_irq(&callback_lock);
+			cs->nr_subparts_cpus = 0;
+			cpumask_clear(cs->subparts_cpus);
+			compute_effective_cpumask(cs->effective_cpus, cs, parent);
+			spin_unlock_irq(&callback_lock);
+
+			/*
+			 * If alloc_cpumasks() fails, we are running out
+			 * of memory and there isn't much we can do.
+			 */
+			if (!alloc_cpumasks(NULL, &tmp)) {
+				update_cpumasks_hier(cs, &tmp);
+				free_cpumasks(NULL, &tmp);
+			}
+		}
 
 		/* Turning off CS_CPU_EXCLUSIVE will not return error */
 		update_flag(CS_CPU_EXCLUSIVE, cs, 0);
@@ -3104,11 +3125,28 @@  static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 
 	/*
 	 * In the unlikely event that a partition root has empty
-	 * effective_cpus or its parent becomes erroneous, we have to
-	 * transition it to the erroneous state.
+	 * effective_cpus, we will have to force any child partitions,
+	 * if present, to become invalid by setting nr_subparts_cpus to 0
+	 * without causing itself to become invalid.
+	 */
+	if (is_partition_root(cs) && cs->nr_subparts_cpus &&
+	    cpumask_empty(&new_cpus)) {
+		cs->nr_subparts_cpus = 0;
+		cpumask_clear(cs->subparts_cpus);
+		compute_effective_cpumask(&new_cpus, cs, parent);
+	}
+
+	/*
+	 * If empty effective_cpus or zero nr_subparts_cpus or its parent
+	 * becomes erroneous, we have to transition it to the erroneous state.
 	 */
 	if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
-	   (parent->partition_root_state == PRS_ERROR))) {
+	    (parent->partition_root_state == PRS_ERROR) ||
+	    !parent->nr_subparts_cpus)) {
+		int old_prs;
+
+		update_parent_subparts_cpumask(cs, partcmd_disable,
+					       NULL, tmp);
 		if (cs->nr_subparts_cpus) {
 			spin_lock_irq(&callback_lock);
 			cs->nr_subparts_cpus = 0;
@@ -3117,38 +3155,23 @@  static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 			compute_effective_cpumask(&new_cpus, cs, parent);
 		}
 
-		/*
-		 * If the effective_cpus is empty because the child
-		 * partitions take away all the CPUs, we can keep
-		 * the current partition and let the child partitions
-		 * fight for available CPUs.
-		 */
-		if ((parent->partition_root_state == PRS_ERROR) ||
-		     cpumask_empty(&new_cpus)) {
-			int old_prs;
-
-			update_parent_subparts_cpumask(cs, partcmd_disable,
-						       NULL, tmp);
-			old_prs = cs->partition_root_state;
-			if (old_prs != PRS_ERROR) {
-				spin_lock_irq(&callback_lock);
-				cs->partition_root_state = PRS_ERROR;
-				spin_unlock_irq(&callback_lock);
-				notify_partition_change(cs, old_prs, PRS_ERROR);
-			}
+		old_prs = cs->partition_root_state;
+		if (old_prs != PRS_ERROR) {
+			spin_lock_irq(&callback_lock);
+			cs->partition_root_state = PRS_ERROR;
+			spin_unlock_irq(&callback_lock);
+			notify_partition_change(cs, old_prs, PRS_ERROR);
 		}
 		cpuset_force_rebuild();
 	}
 
 	/*
 	 * On the other hand, an erroneous partition root may be transitioned
-	 * back to a regular one or a partition root with no CPU allocated
-	 * from the parent may change to erroneous.
+	 * back to a regular one.
 	 */
-	if (is_partition_root(parent) &&
-	   ((cs->partition_root_state == PRS_ERROR) ||
-	    !cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
-	     update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
+	else if (is_partition_root(parent) &&
+		(cs->partition_root_state == PRS_ERROR) &&
+		 update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
 		cpuset_force_rebuild();
 
 update_tasks: