Message ID | 20221130132156.2836184-9-linan122@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iocost bugfix | expand |
On Wed, Nov 30, 2022 at 09:21:55PM +0800, Li Nan wrote: > Remove block device when iocost is initializing may cause > null-pointer dereference: > > CPU1 CPU2 > ioc_qos_write > blkcg_conf_open_bdev > blkdev_get_no_open > kobject_get_unless_zero > blk_iocost_init > rq_qos_add > del_gendisk > rq_qos_exit > q->rq_qos = rqos->next > //iocost is removed from q->roqs > blkcg_activate_policy > pd_init_fn > ioc_pd_init > ioc = q_to_ioc(blkg->q) > //cant find iocost and return null > > Fix problem by moving rq_qos_exit() to disk_release(). ioc_qos_write() get > bd_device.kobj in blkcg_conf_open_bdev(), so disk_release will not be > actived until iocost initialization is complited. I think it'd be better to make del_gendisk wait for these in-flight operations because this might fix the above particular issue but now all the policies are exposed to request_queue in a state it never expected before. del_gendisk() is quiescing the queue around rq_qos_exit(), so maybe we can piggyback on that and update blkcg_conf_open_bdev() to provide such exclusion? Thanks.
Hi, 在 2022/12/01 4:50, Tejun Heo 写道: > On Wed, Nov 30, 2022 at 09:21:55PM +0800, Li Nan wrote: >> Remove block device when iocost is initializing may cause >> null-pointer dereference: >> >> CPU1 CPU2 >> ioc_qos_write >> blkcg_conf_open_bdev >> blkdev_get_no_open >> kobject_get_unless_zero >> blk_iocost_init >> rq_qos_add >> del_gendisk >> rq_qos_exit >> q->rq_qos = rqos->next >> //iocost is removed from q->roqs >> blkcg_activate_policy >> pd_init_fn >> ioc_pd_init >> ioc = q_to_ioc(blkg->q) >> //cant find iocost and return null >> >> Fix problem by moving rq_qos_exit() to disk_release(). ioc_qos_write() get >> bd_device.kobj in blkcg_conf_open_bdev(), so disk_release will not be >> actived until iocost initialization is complited. > > I think it'd be better to make del_gendisk wait for these in-flight > operations because this might fix the above particular issue but now all the > policies are exposed to request_queue in a state it never expected before. > > del_gendisk() is quiescing the queue around rq_qos_exit(), so maybe we can > piggyback on that and update blkcg_conf_open_bdev() to provide such > exclusion? Let del_gendisk waiting for that sounds good, but I'm litter confused how to do that. Following are what I think about: 1) By mentioning that "del_gendisk() is quiescing the queue", do you suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init requires memory allocation. 2) Hold gendisk open mutex 3) Use q_usage_counter, and in the meantime, rq_qos_add() and blkcg_activate_policy() will need refactoring to factor out freeze queue. 4) Define a new metux Thanks, Kuai > > Thanks. >
On Thu, Dec 01, 2022 at 10:12:13AM +0800, Yu Kuai wrote: > 1) By mentioning that "del_gendisk() is quiescing the queue", do you > suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init > requires memory allocation. Quiescing uses SRCU so that should be fine but I'm not sure whether this is the right one to piggyback on. Jens should have a better idea. Thanks.
Hi, 在 2022/12/01 18:11, Tejun Heo 写道: > On Thu, Dec 01, 2022 at 10:12:13AM +0800, Yu Kuai wrote: >> 1) By mentioning that "del_gendisk() is quiescing the queue", do you >> suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init >> requires memory allocation. > > Quiescing uses SRCU so that should be fine but I'm not sure whether this is > the right one to piggyback on. Jens should have a better idea. > > Thanks. > Currently SRCU is used if BLK_MQ_F_BLOCKING set, otherwise RCU is used. dispatch: #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ do { \ if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) { \ int srcu_idx; \ \ might_sleep_if(check_sleep); \ srcu_idx = srcu_read_lock((q)->tag_set->srcu); \ (dispatch_ops); \ srcu_read_unlock((q)->tag_set->srcu, srcu_idx); \ } else { \ rcu_read_lock(); \ (dispatch_ops); \ rcu_read_unlock(); \ } \ } while (0) quiesce: void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set) { if (set->flags & BLK_MQ_F_BLOCKING) synchronize_srcu(set->srcu); else synchronize_rcu(); } Thanks, Kuai
On Thu, Dec 01, 2022 at 06:23:16PM +0800, Yu Kuai wrote: > Hi, > > 在 2022/12/01 18:11, Tejun Heo 写道: > > On Thu, Dec 01, 2022 at 10:12:13AM +0800, Yu Kuai wrote: > > > 1) By mentioning that "del_gendisk() is quiescing the queue", do you > > > suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init > > > requires memory allocation. > > > > Quiescing uses SRCU so that should be fine but I'm not sure whether this is > > the right one to piggyback on. Jens should have a better idea. > > > > Thanks. > > > > Currently SRCU is used if BLK_MQ_F_BLOCKING set, otherwise RCU is used. > > dispatch: > #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ > do { \ > if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) { \ > int srcu_idx; \ > \ > might_sleep_if(check_sleep); \ > srcu_idx = srcu_read_lock((q)->tag_set->srcu); \ > (dispatch_ops); \ > srcu_read_unlock((q)->tag_set->srcu, srcu_idx); \ > } else { \ > rcu_read_lock(); \ > (dispatch_ops); \ > rcu_read_unlock(); \ > } \ > } while (0) > > quiesce: > void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set) > { > if (set->flags & BLK_MQ_F_BLOCKING) > synchronize_srcu(set->srcu); > else > synchronize_rcu(); > } Oh I see. Unfortunately, I don't know what to do off the top of my head. Thanks.
Hi, Tejun! While reviewing rq_qos code, I found that there are some other possible defects: 1) queue_lock is held to protect rq_qos_add() and rq_qos_del(), whlie it's not held to protect rq_qos_exit(), which is absolutely not safe because they can be called concurrently by configuring iocost and removing device. I'm thinking about holding the lock to fetch the list and reset q->rq_qos first: diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c index 88f0fe7dcf54..271ad65eebd9 100644 --- a/block/blk-rq-qos.c +++ b/block/blk-rq-qos.c @@ -288,9 +288,15 @@ void rq_qos_wait(struct rq_wait *rqw, void *private_data, void rq_qos_exit(struct request_queue *q) { - while (q->rq_qos) { - struct rq_qos *rqos = q->rq_qos; - q->rq_qos = rqos->next; + struct rq_qos *rqos; + + spin_lock_irq(&q->queue_lock); + rqos = q->rq_qos; + q->rq_qos = NULL; + spin_unlock_irq(&q->queue_lock); + + while (rqos) { rqos->ops->exit(rqos); + rqos = rqos->next; } } 2) rq_qos_add() can still succeed after rq_qos_exit() is done, which will cause memory leak. Hence a checking is required beforing adding to q->rq_qos. I'm thinking about flag QUEUE_FLAG_DYING first, but the flag will not set if disk state is not marked GD_OWNS_QUEUE. Since blk_unregister_queue() is called before rq_qos_exit(), use the queue flag QUEUE_FLAG_REGISTERED should be OK. For the current problem that device can be removed while initializing , I'm thinking about some possible solutions: Since bfq is initialized in elevator initialization, and others are in queue initialization, such problem is only possible in iocost, hence it make sense to fix it in iocost: 1) use open mutex to prevet concurrency, however, this will cause that configuring iocost will block some other operations that is relied on open_mutex. @@ -2889,7 +2889,15 @@ static int blk_iocost_init(struct gendisk *disk) if (ret) goto err_free_ioc; + mutex_lock(&disk->open_mutex); + if (!disk_live(disk)) { + mutex_unlock(&disk->open_mutex); + ret = -ENODEV; + goto err_del_qos; + } + ret = blkcg_activate_policy(q, &blkcg_policy_iocost); + mutex_unlock(&disk->open_mutex); if (ret) goto err_del_qos; 2) like 1), the difference is that define a new mutex just in iocst. 3) Or is it better to fix it in the higher level? For example: add a new restriction that blkcg_deactivate_policy() should be called with blkcg_activate_policy() in pairs, and blkcg_deactivate_policy() will wait for blkcg_activate_policy() to finish. Something like: diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index ef4fef1af909..6266f702157f 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1410,7 +1410,7 @@ int blkcg_activate_policy(struct request_queue *q, struct blkcg_gq *blkg, *pinned_blkg = NULL; int ret; - if (blkcg_policy_enabled(q, pol)) + if (WARN_ON_ONCE(blkcg_policy_enabled(q, pol))) return 0; if (queue_is_mq(q)) @@ -1477,6 +1477,8 @@ int blkcg_activate_policy(struct request_queue *q, blkg_put(pinned_blkg); if (pd_prealloc) pol->pd_free_fn(pd_prealloc); + if (!ret) + wake_up(q->policy_waitq); return ret; enomem: @@ -1512,7 +1514,7 @@ void blkcg_deactivate_policy(struct request_queue *q, struct blkcg_gq *blkg; if (!blkcg_policy_enabled(q, pol)) - return; + wait_event(q->policy_waitq, blkcg_policy_enabled(q, pol)); wait_event(q->xxx, blkcg_policy_enabled(q, pol));
Hello, On Mon, Dec 05, 2022 at 05:32:17PM +0800, Yu Kuai wrote: > 1) queue_lock is held to protect rq_qos_add() and rq_qos_del(), whlie > it's not held to protect rq_qos_exit(), which is absolutely not safe > because they can be called concurrently by configuring iocost and > removing device. > I'm thinking about holding the lock to fetch the list and reset > q->rq_qos first: > > diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c > index 88f0fe7dcf54..271ad65eebd9 100644 > --- a/block/blk-rq-qos.c > +++ b/block/blk-rq-qos.c > @@ -288,9 +288,15 @@ void rq_qos_wait(struct rq_wait *rqw, void > *private_data, > > void rq_qos_exit(struct request_queue *q) > { > - while (q->rq_qos) { > - struct rq_qos *rqos = q->rq_qos; > - q->rq_qos = rqos->next; > + struct rq_qos *rqos; > + > + spin_lock_irq(&q->queue_lock); > + rqos = q->rq_qos; > + q->rq_qos = NULL; > + spin_unlock_irq(&q->queue_lock); > + > + while (rqos) { > rqos->ops->exit(rqos); > + rqos = rqos->next; > } > } > > 2) rq_qos_add() can still succeed after rq_qos_exit() is done, which > will cause memory leak. Hence a checking is required beforing adding > to q->rq_qos. I'm thinking about flag QUEUE_FLAG_DYING first, but the > flag will not set if disk state is not marked GD_OWNS_QUEUE. Since > blk_unregister_queue() is called before rq_qos_exit(), use the queue > flag QUEUE_FLAG_REGISTERED should be OK. > > For the current problem that device can be removed while initializing > , I'm thinking about some possible solutions: > > Since bfq is initialized in elevator initialization, and others are > in queue initialization, such problem is only possible in iocost, hence > it make sense to fix it in iocost: So, iolatency is likely to switch to similar lazy init scheme, so we better fix it in the rq_qos / core block layer. ... > 3) Or is it better to fix it in the higher level? For example: > add a new restriction that blkcg_deactivate_policy() should be called > with blkcg_activate_policy() in pairs, and blkcg_deactivate_policy() > will wait for blkcg_activate_policy() to finish. Something like: > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index ef4fef1af909..6266f702157f 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -1410,7 +1410,7 @@ int blkcg_activate_policy(struct request_queue *q, > struct blkcg_gq *blkg, *pinned_blkg = NULL; > int ret; > > - if (blkcg_policy_enabled(q, pol)) > + if (WARN_ON_ONCE(blkcg_policy_enabled(q, pol))) > return 0; > > if (queue_is_mq(q)) > @@ -1477,6 +1477,8 @@ int blkcg_activate_policy(struct request_queue *q, > blkg_put(pinned_blkg); > if (pd_prealloc) > pol->pd_free_fn(pd_prealloc); > + if (!ret) > + wake_up(q->policy_waitq); > return ret; > > enomem: > @@ -1512,7 +1514,7 @@ void blkcg_deactivate_policy(struct request_queue *q, > struct blkcg_gq *blkg; > > if (!blkcg_policy_enabled(q, pol)) > - return; > + wait_event(q->policy_waitq, blkcg_policy_enabled(q, pol)); > wait_event(q->xxx, blkcg_policy_enabled(q, pol)); Yeah, along this line but hopefully something simpler like a mutex. Thanks.
diff --git a/block/genhd.c b/block/genhd.c index dcf200bcbd3e..0db440bbfefb 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -656,7 +656,6 @@ void del_gendisk(struct gendisk *disk) elevator_exit(q); mutex_unlock(&q->sysfs_lock); } - rq_qos_exit(q); blk_mq_unquiesce_queue(q); /* @@ -1168,6 +1167,7 @@ static void disk_release(struct device *dev) !test_bit(GD_ADDED, &disk->state)) blk_mq_exit_queue(disk->queue); + rq_qos_exit(disk->queue); blkcg_exit_disk(disk); bioset_exit(&disk->bio_split);
Remove block device when iocost is initializing may cause null-pointer dereference: CPU1 CPU2 ioc_qos_write blkcg_conf_open_bdev blkdev_get_no_open kobject_get_unless_zero blk_iocost_init rq_qos_add del_gendisk rq_qos_exit q->rq_qos = rqos->next //iocost is removed from q->roqs blkcg_activate_policy pd_init_fn ioc_pd_init ioc = q_to_ioc(blkg->q) //cant find iocost and return null Fix problem by moving rq_qos_exit() to disk_release(). ioc_qos_write() get bd_device.kobj in blkcg_conf_open_bdev(), so disk_release will not be actived until iocost initialization is complited. Signed-off-by: Li Nan <linan122@huawei.com> --- block/genhd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)