diff mbox

[v5] blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()

Message ID 5c99ef07-1133-a45a-9a09-7c29e8139f5b@linux.alibaba.com (mailing list archive)
State New, archived
Headers show

Commit Message

Joseph Qi March 14, 2018, 6:18 a.m. UTC
We've triggered a WARNING in blk_throtl_bio() when throttling writeback
io, which complains blkg->refcnt is already 0 when calling blkg_get(),
and then kernel crashes with invalid page request.
After investigating this issue, we've found it is caused by a race
between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
below:

writeback kworker               cgroup_rmdir
                                  cgroup_destroy_locked
                                    kill_css
                                      css_killed_ref_fn
                                        css_killed_work_fn
                                          offline_css
                                            blkcg_css_offline
  blkcg_bio_issue_check
    rcu_read_lock
    blkg_lookup
                                              spin_trylock(q->queue_lock)
                                              blkg_destroy
                                              spin_unlock(q->queue_lock)
    blk_throtl_bio
    spin_lock_irq(q->queue_lock)
    ...
    spin_unlock_irq(q->queue_lock)
  rcu_read_unlock

Since rcu can only prevent blkg from releasing when it is being used,
the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
blkg release.
Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
And then the corresponding blkg_put() will schedule blkg release again,
which result in double free.
This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
creation in blkcg_bio_issue_check()"). Before this commit, it will
lookup first and then try to lookup/create again with queue_lock. Since
revive this logic is a bit drastic, so fix it by only offlining pd during
blkcg_css_offline(), and move the rest destruction (especially
blkg_put()) into blkcg_css_free(), which should be the right way as
discussed.

Fixes: ae1188963611 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()")
Reported-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: stable@vger.kernel.org #4.3+
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
---
 block/blk-cgroup.c         | 78 ++++++++++++++++++++++++++++++++++++----------
 include/linux/blk-cgroup.h |  1 +
 2 files changed, 63 insertions(+), 16 deletions(-)

Comments

Tejun Heo March 14, 2018, 2:09 p.m. UTC | #1
Hello,

On Wed, Mar 14, 2018 at 02:18:04PM +0800, Joseph Qi wrote:
> Fixes: ae1188963611 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()")
> Reported-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
> Cc: stable@vger.kernel.org #4.3+

I'm a bit nervous about tagging it for -stable.  Given the low rate of
this actually occurring, I'm not sure the benefits outweigh the risks.
Let's at least cook it for a couple releases before sending it to
-stable.

> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 69bea82..dccd102 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -88,6 +88,7 @@ struct blkg_policy_data {
>  	/* the blkg and policy id this per-policy data belongs to */
>  	struct blkcg_gq			*blkg;
>  	int				plid;
> +	bool				offlined;
>  };

This is pure bike-shedding but offlined reads kinda weird to me, maybe
just offline would read better?  Other than that,

 Acked-by: Tejun Heo <tj@kernel.org>

Thanks a lot for seeing this through.
Joseph Qi March 15, 2018, 1:18 a.m. UTC | #2
Hello Tejun,

Thanks for your quick response.

On 18/3/14 22:09, Tejun Heo wrote:
> Hello,
> 
> On Wed, Mar 14, 2018 at 02:18:04PM +0800, Joseph Qi wrote:
>> Fixes: ae1188963611 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()")
>> Reported-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
>> Cc: stable@vger.kernel.org #4.3+
> 
> I'm a bit nervous about tagging it for -stable.  Given the low rate of
> this actually occurring, I'm not sure the benefits outweigh the risks.
> Let's at least cook it for a couple releases before sending it to
> -stable.
> 
>> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
>> index 69bea82..dccd102 100644
>> --- a/include/linux/blk-cgroup.h
>> +++ b/include/linux/blk-cgroup.h
>> @@ -88,6 +88,7 @@ struct blkg_policy_data {
>>  	/* the blkg and policy id this per-policy data belongs to */
>>  	struct blkcg_gq			*blkg;
>>  	int				plid;
>> +	bool				offlined;
>>  };
> 
> This is pure bike-shedding but offlined reads kinda weird to me, maybe
> just offline would read better?  Other than that,
> 
Do I need to resend a new version for this?

Thanks,
Joseph

>  Acked-by: Tejun Heo <tj@kernel.org>
> 
> Thanks a lot for seeing this through.
>
Tejun Heo March 19, 2018, 2:45 p.m. UTC | #3
Hello,

On Thu, Mar 15, 2018 at 09:18:55AM +0800, Joseph Qi wrote:
> > This is pure bike-shedding but offlined reads kinda weird to me, maybe
> > just offline would read better?  Other than that,
> > 
> Do I need to resend a new version for this?

No idea, Jens's call.  He can fix it up if he wants to while applying
but there's no harm in sending an updated version either.

Thanks.
Jens Axboe March 19, 2018, 2:52 p.m. UTC | #4
On 3/19/18 8:45 AM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 15, 2018 at 09:18:55AM +0800, Joseph Qi wrote:
>>> This is pure bike-shedding but offlined reads kinda weird to me, maybe
>>> just offline would read better?  Other than that,
>>>
>> Do I need to resend a new version for this?
> 
> No idea, Jens's call.  He can fix it up if he wants to while applying
> but there's no harm in sending an updated version either.

This got queued up last week:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.17/block&id=4c6994806f708559c2812b73501406e21ae5dcd0

for 4.17.
diff mbox

Patch

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c2033a2..92112f4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -307,11 +307,28 @@  struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 	}
 }
 
+static void blkg_pd_offline(struct blkcg_gq *blkg)
+{
+	int i;
+
+	lockdep_assert_held(blkg->q->queue_lock);
+	lockdep_assert_held(&blkg->blkcg->lock);
+
+	for (i = 0; i < BLKCG_MAX_POLS; i++) {
+		struct blkcg_policy *pol = blkcg_policy[i];
+
+		if (blkg->pd[i] && !blkg->pd[i]->offlined &&
+		    pol->pd_offline_fn) {
+			pol->pd_offline_fn(blkg->pd[i]);
+			blkg->pd[i]->offlined = true;
+		}
+	}
+}
+
 static void blkg_destroy(struct blkcg_gq *blkg)
 {
 	struct blkcg *blkcg = blkg->blkcg;
 	struct blkcg_gq *parent = blkg->parent;
-	int i;
 
 	lockdep_assert_held(blkg->q->queue_lock);
 	lockdep_assert_held(&blkcg->lock);
@@ -320,13 +337,6 @@  static void blkg_destroy(struct blkcg_gq *blkg)
 	WARN_ON_ONCE(list_empty(&blkg->q_node));
 	WARN_ON_ONCE(hlist_unhashed(&blkg->blkcg_node));
 
-	for (i = 0; i < BLKCG_MAX_POLS; i++) {
-		struct blkcg_policy *pol = blkcg_policy[i];
-
-		if (blkg->pd[i] && pol->pd_offline_fn)
-			pol->pd_offline_fn(blkg->pd[i]);
-	}
-
 	if (parent) {
 		blkg_rwstat_add_aux(&parent->stat_bytes, &blkg->stat_bytes);
 		blkg_rwstat_add_aux(&parent->stat_ios, &blkg->stat_ios);
@@ -369,6 +379,7 @@  static void blkg_destroy_all(struct request_queue *q)
 		struct blkcg *blkcg = blkg->blkcg;
 
 		spin_lock(&blkcg->lock);
+		blkg_pd_offline(blkg);
 		blkg_destroy(blkg);
 		spin_unlock(&blkcg->lock);
 	}
@@ -995,25 +1006,25 @@  static int blkcg_print_stat(struct seq_file *sf, void *v)
  * @css: css of interest
  *
  * This function is called when @css is about to go away and responsible
- * for shooting down all blkgs associated with @css.  blkgs should be
- * removed while holding both q and blkcg locks.  As blkcg lock is nested
- * inside q lock, this function performs reverse double lock dancing.
+ * for offlining all blkgs pd and killing all wbs associated with @css.
+ * blkgs pd offline should be done while holding both q and blkcg locks.
+ * As blkcg lock is nested inside q lock, this function performs reverse
+ * double lock dancing.
  *
  * This is the blkcg counterpart of ioc_release_fn().
  */
 static void blkcg_css_offline(struct cgroup_subsys_state *css)
 {
 	struct blkcg *blkcg = css_to_blkcg(css);
+	struct blkcg_gq *blkg;
 
 	spin_lock_irq(&blkcg->lock);
 
-	while (!hlist_empty(&blkcg->blkg_list)) {
-		struct blkcg_gq *blkg = hlist_entry(blkcg->blkg_list.first,
-						struct blkcg_gq, blkcg_node);
+	hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) {
 		struct request_queue *q = blkg->q;
 
 		if (spin_trylock(q->queue_lock)) {
-			blkg_destroy(blkg);
+			blkg_pd_offline(blkg);
 			spin_unlock(q->queue_lock);
 		} else {
 			spin_unlock_irq(&blkcg->lock);
@@ -1027,11 +1038,43 @@  static void blkcg_css_offline(struct cgroup_subsys_state *css)
 	wb_blkcg_offline(blkcg);
 }
 
+/**
+ * blkcg_destroy_all_blkgs - destroy all blkgs associated with a blkcg
+ * @blkcg: blkcg of interest
+ *
+ * This function is called when blkcg css is about to free and responsible for
+ * destroying all blkgs associated with @blkcg.
+ * blkgs should be removed while holding both q and blkcg locks. As blkcg lock
+ * is nested inside q lock, this function performs reverse double lock dancing.
+ */
+static void blkcg_destroy_all_blkgs(struct blkcg *blkcg)
+{
+	spin_lock_irq(&blkcg->lock);
+	while (!hlist_empty(&blkcg->blkg_list)) {
+		struct blkcg_gq *blkg = hlist_entry(blkcg->blkg_list.first,
+						    struct blkcg_gq,
+						    blkcg_node);
+		struct request_queue *q = blkg->q;
+
+		if (spin_trylock(q->queue_lock)) {
+			blkg_destroy(blkg);
+			spin_unlock(q->queue_lock);
+		} else {
+			spin_unlock_irq(&blkcg->lock);
+			cpu_relax();
+			spin_lock_irq(&blkcg->lock);
+		}
+	}
+	spin_unlock_irq(&blkcg->lock);
+}
+
 static void blkcg_css_free(struct cgroup_subsys_state *css)
 {
 	struct blkcg *blkcg = css_to_blkcg(css);
 	int i;
 
+	blkcg_destroy_all_blkgs(blkcg);
+
 	mutex_lock(&blkcg_pol_mutex);
 
 	list_del(&blkcg->all_blkcgs_node);
@@ -1371,8 +1414,11 @@  void blkcg_deactivate_policy(struct request_queue *q,
 		spin_lock(&blkg->blkcg->lock);
 
 		if (blkg->pd[pol->plid]) {
-			if (pol->pd_offline_fn)
+			if (!blkg->pd[pol->plid]->offlined &&
+			    pol->pd_offline_fn) {
 				pol->pd_offline_fn(blkg->pd[pol->plid]);
+				blkg->pd[pol->plid]->offlined = true;
+			}
 			pol->pd_free_fn(blkg->pd[pol->plid]);
 			blkg->pd[pol->plid] = NULL;
 		}
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 69bea82..dccd102 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -88,6 +88,7 @@  struct blkg_policy_data {
 	/* the blkg and policy id this per-policy data belongs to */
 	struct blkcg_gq			*blkg;
 	int				plid;
+	bool				offlined;
 };
 
 /*