block: fix q->blkg_list corruption during disk rebind

Message ID	20240407125910.4053377-1-ming.lei@redhat.com (mailing list archive)
State	New
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A525F25745 for <linux-block@vger.kernel.org>; Sun, 7 Apr 2024 12:59:25 +0000 (UTC) From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org Cc: Ming Lei <ming.lei@redhat.com>, Yu Kuai <yukuai3@huawei.com>, Tejun Heo <tj@kernel.org> Subject: [PATCH] block: fix q->blkg_list corruption during disk rebind Date: Sun, 7 Apr 2024 20:59:10 +0800 Message-ID: <20240407125910.4053377-1-ming.lei@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	block: fix q->blkg_list corruption during disk rebind \| expand block: fix q->blkg_list corruption during disk rebind

Ming Lei April 7, 2024, 12:59 p.m. UTC

Multiple gendisk instances can allocated/added for single request queue
in case of disk rebind. blkg may still stay in q->blkg_list when calling
blkcg_init_disk() for rebind, then q->blkg_list becomes corrupted.

Fix the list corruption issue by:

- add blkg_init_queue() to initialize q->blkg_list & q->blkcg_mutex only
- move calling blkg_init_queue() into blk_alloc_queue()

The list corruption should be started since commit f1c006f1c685 ("blk-cgroup:
synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
which delays removing blkg from q->blkg_list into blkg_free_workfn().

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Fixes: 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler")
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
blktests:
	https://lore.kernel.org/linux-block/20240407125717.4052964-1-ming.lei@redhat.com/

 block/blk-cgroup.c | 9 ++++++---
 block/blk-cgroup.h | 2 ++
 block/blk-core.c   | 2 ++
 3 files changed, 10 insertions(+), 3 deletions(-)

Yu Kuai April 7, 2024, 1:40 p.m. UTC | #1

Hi,

在 2024/04/07 20:59, Ming Lei 写道:
> Multiple gendisk instances can allocated/added for single request queue
> in case of disk rebind. blkg may still stay in q->blkg_list when calling
> blkcg_init_disk() for rebind, then q->blkg_list becomes corrupted.
> 
> Fix the list corruption issue by:
> 
> - add blkg_init_queue() to initialize q->blkg_list & q->blkcg_mutex only
> - move calling blkg_init_queue() into blk_alloc_queue()
> 
> The list corruption should be started since commit f1c006f1c685 ("blk-cgroup:
> synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> which delays removing blkg from q->blkg_list into blkg_free_workfn().

I'm not familiar with how bind/unbind works yet, however, the patch
itself looks reasonable to me, the initialization of fields related to
queue should not be delayed to disk allocation.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>

BTW, it looks like the whole blkcg_init_disk() can go away:
  - init of ioprio and blk-throttle can be delayed to the first user
configuration;
  - root_blkg allocation doesn't rely on disk at all;

Or is there any plan to move the blkcg related field or code path to
gendisk instead of queue? I might missing some previous discussions.

Thanks,
Kuai

> 
> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> Fixes: 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler")
> Cc: Yu Kuai <yukuai3@huawei.com>
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> blktests:
> 	https://lore.kernel.org/linux-block/20240407125717.4052964-1-ming.lei@redhat.com/
> 
>   block/blk-cgroup.c | 9 ++++++---
>   block/blk-cgroup.h | 2 ++
>   block/blk-core.c   | 2 ++
>   3 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index bdbb557feb5a..059467086b13 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1409,6 +1409,12 @@ static int blkcg_css_online(struct cgroup_subsys_state *css)
>   	return 0;
>   }
>   
> +void blkg_init_queue(struct request_queue *q)
> +{
> +	INIT_LIST_HEAD(&q->blkg_list);
> +	mutex_init(&q->blkcg_mutex);
> +}
> +
>   int blkcg_init_disk(struct gendisk *disk)
>   {
>   	struct request_queue *q = disk->queue;
> @@ -1416,9 +1422,6 @@ int blkcg_init_disk(struct gendisk *disk)
>   	bool preloaded;
>   	int ret;
>   
> -	INIT_LIST_HEAD(&q->blkg_list);
> -	mutex_init(&q->blkcg_mutex);
> -
>   	new_blkg = blkg_alloc(&blkcg_root, disk, GFP_KERNEL);
>   	if (!new_blkg)
>   		return -ENOMEM;
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 78b74106bf10..90b3959d88cf 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -189,6 +189,7 @@ struct blkcg_policy {
>   extern struct blkcg blkcg_root;
>   extern bool blkcg_debug_stats;
>   
> +void blkg_init_queue(struct request_queue *q);
>   int blkcg_init_disk(struct gendisk *disk);
>   void blkcg_exit_disk(struct gendisk *disk);
>   
> @@ -482,6 +483,7 @@ struct blkcg {
>   };
>   
>   static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
> +static inline void blkg_init_queue(struct request_queue *q) { }
>   static inline int blkcg_init_disk(struct gendisk *disk) { return 0; }
>   static inline void blkcg_exit_disk(struct gendisk *disk) { }
>   static inline int blkcg_policy_register(struct blkcg_policy *pol) { return 0; }
> diff --git a/block/blk-core.c b/block/blk-core.c
> index a16b5abdbbf5..3a6f5603fb44 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -442,6 +442,8 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
>   	init_waitqueue_head(&q->mq_freeze_wq);
>   	mutex_init(&q->mq_freeze_lock);
>   
> +	blkg_init_queue(q);
> +
>   	/*
>   	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
>   	 * See blk_register_queue() for details.
>

Jens Axboe April 7, 2024, 9:50 p.m. UTC | #2

On Sun, 07 Apr 2024 20:59:10 +0800, Ming Lei wrote:
> Multiple gendisk instances can allocated/added for single request queue
> in case of disk rebind. blkg may still stay in q->blkg_list when calling
> blkcg_init_disk() for rebind, then q->blkg_list becomes corrupted.
> 
> Fix the list corruption issue by:
> 
> - add blkg_init_queue() to initialize q->blkg_list & q->blkcg_mutex only
> - move calling blkg_init_queue() into blk_alloc_queue()
> 
> [...]

Applied, thanks!

[1/1] block: fix q->blkg_list corruption during disk rebind
      commit: 8b8ace080319a866f5dfe9da8e665ae51d971c54

Best regards,

Ming Lei April 8, 2024, 3:05 a.m. UTC | #3

On Sun, Apr 07, 2024 at 09:40:52PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2024/04/07 20:59, Ming Lei 写道:
> > Multiple gendisk instances can allocated/added for single request queue
> > in case of disk rebind. blkg may still stay in q->blkg_list when calling
> > blkcg_init_disk() for rebind, then q->blkg_list becomes corrupted.
> > 
> > Fix the list corruption issue by:
> > 
> > - add blkg_init_queue() to initialize q->blkg_list & q->blkcg_mutex only
> > - move calling blkg_init_queue() into blk_alloc_queue()
> > 
> > The list corruption should be started since commit f1c006f1c685 ("blk-cgroup:
> > synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
> > which delays removing blkg from q->blkg_list into blkg_free_workfn().
> 
> I'm not familiar with how bind/unbind works yet, however, the patch
> itself looks reasonable to me, the initialization of fields related to
> queue should not be delayed to disk allocation.
> 
> Reviewed-by: Yu Kuai <yukuai3@huawei.com>

Thanks!

> 
> BTW, it looks like the whole blkcg_init_disk() can go away:
>  - init of ioprio and blk-throttle can be delayed to the first user
> configuration;
>  - root_blkg allocation doesn't rely on disk at all;
> 
> Or is there any plan to move the blkcg related field or code path to
> gendisk instead of queue? I might missing some previous discussions.

Christoph worked towards this direction, but not succeed:

a06377c5d01e Revert "blk-cgroup: pin the gendisk in struct blkcg_gq"
9a9c261e6b55 Revert "blk-cgroup: pass a gendisk to blkg_lookup"
1231039db31c Revert "blk-cgroup: move the cgroup information to struct gendisk"
dcb522014351 Revert "blk-cgroup: simplify blkg freeing from initialization failure paths"



Thanks, 
Ming

Christoph Hellwig April 8, 2024, 5:58 a.m. UTC | #4

On Sun, Apr 07, 2024 at 08:59:10PM +0800, Ming Lei wrote:
> Multiple gendisk instances can allocated/added for single request queue
> in case of disk rebind. blkg may still stay in q->blkg_list when calling
> blkcg_init_disk() for rebind, then q->blkg_list becomes corrupted.
> 
> Fix the list corruption issue by:

The right fix is to move the blkgs to the gendisk as there is no use
for them on a request_queue without a gendisk.

Ming Lei April 8, 2024, 7:42 a.m. UTC | #5

On Sun, Apr 07, 2024 at 10:58:29PM -0700, Christoph Hellwig wrote:
> On Sun, Apr 07, 2024 at 08:59:10PM +0800, Ming Lei wrote:
> > Multiple gendisk instances can allocated/added for single request queue
> > in case of disk rebind. blkg may still stay in q->blkg_list when calling
> > blkcg_init_disk() for rebind, then q->blkg_list becomes corrupted.
> > 
> > Fix the list corruption issue by:
> 
> The right fix is to move the blkgs to the gendisk as there is no use
> for them on a request_queue without a gendisk.

The fix needs to be backported to stable, so it isn't good to fix in
that aggressive way, especially last time your attempt failed, please see
the revert commits in my last reply to Yu Kuai.

Thanks,
Ming

block: fix q->blkg_list corruption during disk rebind

Commit Message

Comments

Patch