diff mbox

[1/2] cgroup: make sure a parent css isn't offlined before its children

Message ID 20160121212416.GL6357@twins.programming.kicks-ass.net (mailing list archive)
State New, archived
Headers show

Commit Message

Peter Zijlstra Jan. 21, 2016, 9:24 p.m. UTC
On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote:
> There are three subsystem callbacks in css shutdown path -
> css_offline(), css_released() and css_free().  Except for
> css_released(), cgroup core didn't use to guarantee the order of
> invocation.  css_offline() or css_free() could be called on a parent
> css before its children.  This behavior is unexpected and led to
> use-after-free in cpu controller.
> 
> This patch updates offline path so that a parent css is never offlined
> before its children.  Each css keeps online_cnt which reaches zero iff
> itself and all its children are offline and offline_css() is invoked
> only after online_cnt reaches zero.
> 
> This fixes the reported cpu controller malfunction.  The next patch
> will update css_free() handling.

No, I need to fix the cpu controller too, because the offending code
sits off of css_free() (the next patch), but also does a call_rcu() in
between, which also doesn't guarantee order.

So your patch and the below would be required to fix this I think.

And then I should look at removing the call_rcu() from the css_free() at
a later date, I think its superfluous but need to double check that.


---
Subject: sched: Fix cgroup entity load tracking tear-down

When a cgroup's cpu runqueue is destroyed, it should remove its
remaining load accounting from its parent cgroup.

The current site for doing so it unsuited because its far too late and
unordered against other cgroup removal (css_free will be, but we're also
in an RCU callback).

Put it in the css_offline callback, which is the start of cgroup
destruction, right after the group has been made unavailable to
userspace. The css_offline callbacks are called in hierarchical order.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |  4 +---
 kernel/sched/fair.c  | 35 ++++++++++++++++++++---------------
 kernel/sched/sched.h |  2 +-
 3 files changed, 22 insertions(+), 19 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Tejun Heo Jan. 21, 2016, 9:28 p.m. UTC | #1
On Thu, Jan 21, 2016 at 10:24:16PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote:
> > There are three subsystem callbacks in css shutdown path -
> > css_offline(), css_released() and css_free().  Except for
> > css_released(), cgroup core didn't use to guarantee the order of
> > invocation.  css_offline() or css_free() could be called on a parent
> > css before its children.  This behavior is unexpected and led to
> > use-after-free in cpu controller.
> > 
> > This patch updates offline path so that a parent css is never offlined
> > before its children.  Each css keeps online_cnt which reaches zero iff
> > itself and all its children are offline and offline_css() is invoked
> > only after online_cnt reaches zero.
> > 
> > This fixes the reported cpu controller malfunction.  The next patch
> > will update css_free() handling.
> 
> No, I need to fix the cpu controller too, because the offending code
> sits off of css_free() (the next patch), but also does a call_rcu() in
> between, which also doesn't guarantee order.

Ah, I see.  Christian, can you please apply all three patches and see
whether the problem gets fixed?  Once verified, I'll update the patch
description and repost.

Thanks.
Christian Borntraeger Jan. 22, 2016, 8:18 a.m. UTC | #2
On 01/21/2016 10:28 PM, Tejun Heo wrote:
> On Thu, Jan 21, 2016 at 10:24:16PM +0100, Peter Zijlstra wrote:
>> On Thu, Jan 21, 2016 at 03:31:11PM -0500, Tejun Heo wrote:
>>> There are three subsystem callbacks in css shutdown path -
>>> css_offline(), css_released() and css_free().  Except for
>>> css_released(), cgroup core didn't use to guarantee the order of
>>> invocation.  css_offline() or css_free() could be called on a parent
>>> css before its children.  This behavior is unexpected and led to
>>> use-after-free in cpu controller.
>>>
>>> This patch updates offline path so that a parent css is never offlined
>>> before its children.  Each css keeps online_cnt which reaches zero iff
>>> itself and all its children are offline and offline_css() is invoked
>>> only after online_cnt reaches zero.
>>>
>>> This fixes the reported cpu controller malfunction.  The next patch
>>> will update css_free() handling.
>>
>> No, I need to fix the cpu controller too, because the offending code
>> sits off of css_free() (the next patch), but also does a call_rcu() in
>> between, which also doesn't guarantee order.
> 
> Ah, I see.  Christian, can you please apply all three patches and see
> whether the problem gets fixed?  Once verified, I'll update the patch
> description and repost.

With these 3 patches I always run into the dio/scsi problem, but never in
the css issue. So I cannot test a full day or so, but it looks like
the problem is gone. At least it worked multiple times for 30minutes or
so until my system was killed by the io issue.

Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8bd352dc63f..d589a140fe0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7865,11 +7865,9 @@  void sched_destroy_group(struct task_group *tg)
 void sched_offline_group(struct task_group *tg)
 {
 	unsigned long flags;
-	int i;
 
 	/* end participation in shares distribution */
-	for_each_possible_cpu(i)
-		unregister_fair_sched_group(tg, i);
+	unregister_fair_sched_group(tg);
 
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_del_rcu(&tg->list);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f60da0f0fd7..aff660b70bf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8244,11 +8244,8 @@  void free_fair_sched_group(struct task_group *tg)
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
-		if (tg->se) {
-			if (tg->se[i])
-				remove_entity_load_avg(tg->se[i]);
+		if (tg->se)
 			kfree(tg->se[i]);
-		}
 	}
 
 	kfree(tg->cfs_rq);
@@ -8296,21 +8293,29 @@  int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 	return 0;
 }
 
-void unregister_fair_sched_group(struct task_group *tg, int cpu)
+void unregister_fair_sched_group(struct task_group *tg)
 {
-	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
+	struct rq *rq;
+	int cpu;
 
-	/*
-	* Only empty task groups can be destroyed; so we can speculatively
-	* check on_list without danger of it being re-added.
-	*/
-	if (!tg->cfs_rq[cpu]->on_list)
-		return;
+	for_each_possible_cpu(cpu) {
+		if (tg->se[cpu])
+			remove_entity_load_avg(tg->se[cpu]);
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
-	list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+		/*
+		 * Only empty task groups can be destroyed; so we can speculatively
+		 * check on_list without danger of it being re-added.
+		 */
+		if (!tg->cfs_rq[cpu]->on_list)
+			continue;
+
+		rq = cpu_rq(cpu);
+
+		raw_spin_lock_irqsave(&rq->lock, flags);
+		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
+		raw_spin_unlock_irqrestore(&rq->lock, flags);
+	}
 }
 
 void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 837bcd383cda..492478bb717c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -313,7 +313,7 @@  extern int tg_nop(struct task_group *tg, void *data);
 
 extern void free_fair_sched_group(struct task_group *tg);
 extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent);
-extern void unregister_fair_sched_group(struct task_group *tg, int cpu);
+extern void unregister_fair_sched_group(struct task_group *tg);
 extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *se, int cpu,
 			struct sched_entity *parent);