Message ID | 20221227125502.541931-2-yukuai1@huaweicloud.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-iocost: add refcounting for iocg and ioc | expand |
On Tue, Dec 27, 2022 at 08:55:01PM +0800, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > iocost requires that child iocg must exit before parent iocg, otherwise > kernel might crash in ioc_timer_fn(). However, currently iocg is exited > in pd_free_fn(), which can't guarantee such order: > > 1) remove cgroup can concurrent with deactivate policy; > 2) blkg_free() triggered by remove cgroup is asynchronously, remove > child cgroup can concurrent with remove parent cgroup; > > Fix the problem by add refcounting for iocg, and child iocg will grab > reference of parent iocg, so that parent iocg will wait for all child > iocg to be exited. Wouldn't it be better to do this refcnting in the blk-cgroup core code rather than in blk-iocost? Thanks.
Hi, 在 2023/01/05 5:44, Tejun Heo 写道: > On Tue, Dec 27, 2022 at 08:55:01PM +0800, Yu Kuai wrote: >> From: Yu Kuai <yukuai3@huawei.com> >> >> iocost requires that child iocg must exit before parent iocg, otherwise >> kernel might crash in ioc_timer_fn(). However, currently iocg is exited >> in pd_free_fn(), which can't guarantee such order: >> >> 1) remove cgroup can concurrent with deactivate policy; >> 2) blkg_free() triggered by remove cgroup is asynchronously, remove >> child cgroup can concurrent with remove parent cgroup; >> >> Fix the problem by add refcounting for iocg, and child iocg will grab >> reference of parent iocg, so that parent iocg will wait for all child >> iocg to be exited. > > Wouldn't it be better to do this refcnting in the blk-cgroup core code > rather than in blk-iocost? > The problem is that I can't find a proper way to fix the competition that pd_free_fn() can be called from different context: 1) from blkg_free() that is called asynchronously from removing cgroup; 2) from blkcg_deactivate_policy() that is called from removing device; 1) is related to blkg, while 2) is not, hence refcnting from blkg can't fix the problem. refcnting from blkcg_policy_data should be ok, but I see that bfq already has the similar refcnting, while other policy doesn't require such refcnting. Any suggestions? Thanks, Kuai > Thanks. >
On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote: > 1) is related to blkg, while 2) is not, hence refcnting from blkg can't > fix the problem. refcnting from blkcg_policy_data should be ok, but I > see that bfq already has the similar refcnting, while other policy > doesn't require such refcnting. Hmm... taking a step back, wouldn't this be solved by moving the first part of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there, right? Thanks. -- tejun
Hi, 在 2023/01/06 2:32, Tejun Heo 写道: > On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote: >> 1) is related to blkg, while 2) is not, hence refcnting from blkg can't >> fix the problem. refcnting from blkcg_policy_data should be ok, but I >> see that bfq already has the similar refcnting, while other policy >> doesn't require such refcnting. > > Hmm... taking a step back, wouldn't this be solved by moving the first part > of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there, > right? > Moving first part to pd_offline_fn() has some requirements, like what I did in the other thread: iocg can be activated again after pd_offline_fn(), which is possible because bio can be dispatched when cgroup is removed. I tried to avoid that by: 1) dispatch all throttled bio io ioc_pd_offline() 2) don't throttle bio after ioc_pd_offline() However, you already disagreed with that.
On Fri, Jan 06, 2023 at 09:08:45AM +0800, Yu Kuai wrote: > Hi, > > 在 2023/01/06 2:32, Tejun Heo 写道: > > On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote: > > > 1) is related to blkg, while 2) is not, hence refcnting from blkg can't > > > fix the problem. refcnting from blkcg_policy_data should be ok, but I > > > see that bfq already has the similar refcnting, while other policy > > > doesn't require such refcnting. > > > > Hmm... taking a step back, wouldn't this be solved by moving the first part > > of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there, > > right? > > > > Moving first part to pd_offline_fn() has some requirements, like what I > did in the other thread: > > iocg can be activated again after pd_offline_fn(), which is possible > because bio can be dispatched when cgroup is removed. I tried to avoid > that by: > > 1) dispatch all throttled bio io ioc_pd_offline() > 2) don't throttle bio after ioc_pd_offline() > > However, you already disagreed with that.
Hi, 在 2023/01/07 4:18, Tejun Heo 写道: > On Fri, Jan 06, 2023 at 09:08:45AM +0800, Yu Kuai wrote: >> Hi, >> >> 在 2023/01/06 2:32, Tejun Heo 写道: >>> On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote: >>>> 1) is related to blkg, while 2) is not, hence refcnting from blkg can't >>>> fix the problem. refcnting from blkcg_policy_data should be ok, but I >>>> see that bfq already has the similar refcnting, while other policy >>>> doesn't require such refcnting. >>> >>> Hmm... taking a step back, wouldn't this be solved by moving the first part >>> of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there, >>> right? >>> >> >> Moving first part to pd_offline_fn() has some requirements, like what I >> did in the other thread: >> >> iocg can be activated again after pd_offline_fn(), which is possible >> because bio can be dispatched when cgroup is removed. I tried to avoid >> that by: >> >> 1) dispatch all throttled bio io ioc_pd_offline() >> 2) don't throttle bio after ioc_pd_offline() >> >> However, you already disagreed with that.
Hello, On Mon, Jan 09, 2023 at 09:32:46AM +0800, Yu Kuai wrote: > > 59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished") > > d866dbf61787 ("blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it") > > These two commits are applied for three years, I don't check the details > yet but they seem can't guarantee that no io will be handled by > rq_qos_throttle() after pd_offline_fn(), because I just reproduced this > in another problem: > > f02be9002c48 ("block, bfq: fix null pointer dereference in bfq_bio_bfqg()") > > User thread can issue async io, and io can be throttled by > blk-throttle(not writeback), then user thread can exit and cgroup can be > removed before such io is dispatched to rq_qos_throttle. I see. > > After the above two commits, ->pd_offline_fn() is called only after all > > possible writebacks are complete, so it shouldn't allow mass escapes to > > root. With writebacks out of the picture, it might be that there can be no > > further IOs once ->pd_offline_fn() is called too as there can be no tasks > > left in it and no dirty pages, but best to confirm that. > > > > So, yeah, the original approach you took should work although I'm not sure > > the patches that you added to make offline blkg to bypass are necessary > > (that also contributed to my assumption that there will be more IOs on those > > blkg's). Have you seen more IOs coming down the pipeline after offline? If > > so, can you dump some backtraces and see where they're coming from? > > Currently I'm sure such IOs can come from blk-throttle, and I'm not sure > yet but I also suspect io_uring can do this. Yeah, that's unfortunate. There are several options here: 1. Do what you originally suggested - bypass to root after offline. I feel uneasy about this. Both iolatency and throtl clear their configs on offline but that's punting to the parent. For iocost it'd be bypassing all controls, which can actually be exploited. 2. Make all possible IO issuers use blkcg_[un]pin_online() and shift the iocost shutdown to pd_offline_fn(). This likely is the most canonical solution given the current situation but it's kinda nasty to add another layer of refcnting all over the place. 3. Order blkg free so that parents are never freed before children. You did this by adding refcnts in iocost but shouldn't it be possible to simply shift blkg_put(blkg->parent) in __blkg_release() to blkg_free_workfn()? #3 seems the most logical to me. What do you thinK? Thanks.
Hi, 在 2023/01/10 2:23, Tejun Heo 写道: > Yeah, that's unfortunate. There are several options here: > > 1. Do what you originally suggested - bypass to root after offline. I feel > uneasy about this. Both iolatency and throtl clear their configs on > offline but that's punting to the parent. For iocost it'd be bypassing > all controls, which can actually be exploited. > > 2. Make all possible IO issuers use blkcg_[un]pin_online() and shift the > iocost shutdown to pd_offline_fn(). This likely is the most canonical > solution given the current situation but it's kinda nasty to add another > layer of refcnting all over the place. > > 3. Order blkg free so that parents are never freed before children. You did > this by adding refcnts in iocost but shouldn't it be possible to simply > shift blkg_put(blkg->parent) in __blkg_release() to blkg_free_workfn()? As I tried to explain before, we can make sure blkg_free() is called in order, but blkg_free() from remove cgroup can concurrent with deactivate policy, and we can't guarantee the order of ioc_pd_free() that is called both from blkg_free() and blkcg_deactivate_policy(). Hence I don't think #3 is possible. I personaly prefer #1, I don't see any real use case about the defect that you described, and actually in cgroup v1 blk-throtl is bypassed to no limit as well. I'm not sure about #2, that sounds a possible solution but I'm not quite familiar with the implementations here. Consider that bfq already has such refcounting for bfqg, perhaps similiar refcounting is acceptable? Thanks, Kuai
Hello, On Tue, Jan 10, 2023 at 09:39:44AM +0800, Yu Kuai wrote: > As I tried to explain before, we can make sure blkg_free() is called > in order, but blkg_free() from remove cgroup can concurrent with > deactivate policy, and we can't guarantee the order of ioc_pd_free() > that is called both from blkg_free() and blkcg_deactivate_policy(). > Hence I don't think #3 is possible. Hahaha, sorry that I keep forgetting that. This doesn't really feel like that important or difficult part of the problem tho. Can't it be solved by synchronizing blkg free work item against the deactivate path with a mutex? Thanks.
Hi, 在 2023/01/11 2:36, Tejun Heo 写道: > Hello, > > On Tue, Jan 10, 2023 at 09:39:44AM +0800, Yu Kuai wrote: >> As I tried to explain before, we can make sure blkg_free() is called >> in order, but blkg_free() from remove cgroup can concurrent with >> deactivate policy, and we can't guarantee the order of ioc_pd_free() >> that is called both from blkg_free() and blkcg_deactivate_policy(). >> Hence I don't think #3 is possible. > > Hahaha, sorry that I keep forgetting that. This doesn't really feel like > that important or difficult part of the problem tho. Can't it be solved by > synchronizing blkg free work item against the deactivate path with a mutex? > I'm not sure, of course this can fix the problem, but two spinlock 'blkcg->lock' and 'q->queue_lock' are used to protect blkg_destroy() currently, add a mutex(disk level?) requires a refactor, which seems complex to me. Thanks, Kuai
Hello, On Wed, Jan 11, 2023 at 09:36:25AM +0800, Yu Kuai wrote: > I'm not sure, of course this can fix the problem, but two spinlock > 'blkcg->lock' and 'q->queue_lock' are used to protect blkg_destroy() > currently, add a mutex(disk level?) requires a refactor, which seems > complex to me. The fact that the two paths can race each other already seems buggy. e.g. What prevents them from running pd_free on the same pd twice? So, it needs to be fixed anyway and the intention always has been that these callbacks are called in the correct traversal order. Thanks.
Hi, 在 2023/01/12 1:07, Tejun Heo 写道: > Hello, > > On Wed, Jan 11, 2023 at 09:36:25AM +0800, Yu Kuai wrote: >> I'm not sure, of course this can fix the problem, but two spinlock >> 'blkcg->lock' and 'q->queue_lock' are used to protect blkg_destroy() >> currently, add a mutex(disk level?) requires a refactor, which seems >> complex to me. > > The fact that the two paths can race each other already seems buggy. e.g. > What prevents them from running pd_free on the same pd twice? So, it needs I think the root cause is that blkg is tracked from two different list, blkcg->blkg_list from cgroup level and q->blkg_list from disk level. And pd_free_fn is also called from both blkg_destroy() and deactivate policy for a disk. I just thought about another solution: remove the blkcg_deactivate_policy() from rq_qos_exit() from deleting the device, and delay the policy cleanup and free to blkg_destroy_all(). Then the policies(other than bfq) can only call pd_free_fn() from blkg_destroy(), and it's easy to guarantee the order. For bfq, it can stay the same since bfq has refcounting itself. Then for the problem that ioc can be freed in pd_free_fn(), we can fix it by freeing ioc in ioc_pd_free() for root blkg instead of rq_qos_exit(). What do you think? Thanks, Kuai > to be fixed anyway and the intention always has been that these callbacks > are called in the correct traversal order. > > Thanks. >
Hello, On Thu, Jan 12, 2023 at 02:18:15PM +0800, Yu Kuai wrote: > remove the blkcg_deactivate_policy() from rq_qos_exit() from deleting > the device, and delay the policy cleanup and free to blkg_destroy_all(). > Then the policies(other than bfq) can only call pd_free_fn() from > blkg_destroy(), and it's easy to guarantee the order. For bfq, it can > stay the same since bfq has refcounting itself. > > Then for the problem that ioc can be freed in pd_free_fn(), we can fix > it by freeing ioc in ioc_pd_free() for root blkg instead of > rq_qos_exit(). > > What do you think? That would remove the ability to dynamically remove an rq_qos policy, right? We don't currently do it but given that having an rq_qos registered comes with perf overhead, it's something we might want to do in the future - e.g. only activate the policy when the controller is actually enabled. So, idk. What's wrong with synchronizing the two removal paths? blkcg policies are combinations of cgroups and block device configurations, so having exit paths from both sides is kinda natural. Thanks.
Hi, 在 2023/01/13 8:53, Tejun Heo 写道: > Hello, > > On Thu, Jan 12, 2023 at 02:18:15PM +0800, Yu Kuai wrote: >> remove the blkcg_deactivate_policy() from rq_qos_exit() from deleting >> the device, and delay the policy cleanup and free to blkg_destroy_all(). >> Then the policies(other than bfq) can only call pd_free_fn() from >> blkg_destroy(), and it's easy to guarantee the order. For bfq, it can >> stay the same since bfq has refcounting itself. >> >> Then for the problem that ioc can be freed in pd_free_fn(), we can fix >> it by freeing ioc in ioc_pd_free() for root blkg instead of >> rq_qos_exit(). >> >> What do you think? > > That would remove the ability to dynamically remove an rq_qos policy, right? > We don't currently do it but given that having an rq_qos registered comes > with perf overhead, it's something we might want to do in the future - e.g. Yes, that make sense, remove ioc and other policies dynamically. > only activate the policy when the controller is actually enabled. So, idk. > What's wrong with synchronizing the two removal paths? blkcg policies are > combinations of cgroups and block device configurations, so having exit > paths from both sides is kinda natural. I still can't figure out how to synchronizing them will a mutex. Maybe I'm being foolish... Thanks, Kuai
Hello, On Fri, Jan 13, 2023 at 09:10:25AM +0800, Yu Kuai wrote: > > only activate the policy when the controller is actually enabled. So, idk. > > What's wrong with synchronizing the two removal paths? blkcg policies are > > combinations of cgroups and block device configurations, so having exit > > paths from both sides is kinda natural. > > I still can't figure out how to synchronizing them will a mutex. Maybe > I'm being foolish... Hmm... can't you just use e.g. per-bdev mutex which is grabbed by both blkg_free_workfn() and blkcg_deactivate_policy()? Thanks.
Hi, 在 2023/01/13 9:15, Tejun Heo 写道: > Hello, > > On Fri, Jan 13, 2023 at 09:10:25AM +0800, Yu Kuai wrote: >>> only activate the policy when the controller is actually enabled. So, idk. >>> What's wrong with synchronizing the two removal paths? blkcg policies are >>> combinations of cgroups and block device configurations, so having exit >>> paths from both sides is kinda natural. >> >> I still can't figure out how to synchronizing them will a mutex. Maybe >> I'm being foolish... > > Hmm... can't you just use e.g. per-bdev mutex which is grabbed by both > blkg_free_workfn() and blkcg_deactivate_policy()? > I think hold the lock in blkg_free_workfn() is too late, pd_free_fn() for parent from blkcg_deactivate_policy() can be called first. t1: remove cgroup t1/t2 blkcg_destroy_blkgs blkg_destroy percpu_ref_kill(&blkg->refcnt) blkg_release blkg_free schedule_work(&blkg->free_work) // t1 is done t2: handle t1 from removing device blkcg_deactivate_policy pd_free_fn // free parent t3: from t1 blkg_free_workfn pd_free_fn // free child Thanks, Kuai
Hello, On Fri, Jan 13, 2023 at 09:25:11AM +0800, Yu Kuai wrote: > I think hold the lock in blkg_free_workfn() is too late, pd_free_fn() > for parent from blkcg_deactivate_policy() can be called first. > > t1: remove cgroup t1/t2 > blkcg_destroy_blkgs > blkg_destroy > percpu_ref_kill(&blkg->refcnt) > blkg_release > blkg_free > schedule_work(&blkg->free_work) > // t1 is done > > t2: handle t1 from removing device > blkcg_deactivate_policy > pd_free_fn > // free parent > t3: from t1 > blkg_free_workfn > pd_free_fn > // free child As we discussed before, you'd have to order the actual freeing by shifting the ref puts into the free_work. If you move `blkg_put(blkg->parent)` and `list_del_init(&blkg->q_node)` to blkg_free_workfn() (this will require adjustments as these things are used from other places too), the free work items will be ordered and the blkg would remain iterable - IOW, deactivate_policy would be able to see it allowing the two paths to synchronize, right? Thanks.
Hi, 在 2023/01/14 1:16, Tejun Heo 写道: > As we discussed before, you'd have to order the actual freeing by shifting > the ref puts into the free_work. If you move `blkg_put(blkg->parent)` and > `list_del_init(&blkg->q_node)` to blkg_free_workfn() (this will require > adjustments as these things are used from other places too), the free work > items will be ordered and the blkg would remain iterable - IOW, > deactivate_policy would be able to see it allowing the two paths to > synchronize, right? That sounds reasonable to only remove blkg from queue list if pd_free_fn() is done. It's right this way deactivate_policy will be able to see it, and if deactivate_policy is called first, pd_free_fn() can be called here, and later blkg_free_workfn() should skip pd_free_fn(). It's glad that we come up with suitable solution finially.
diff --git a/block/blk-iocost.c b/block/blk-iocost.c index 7a0d754b9eb2..525e93e1175a 100644 --- a/block/blk-iocost.c +++ b/block/blk-iocost.c @@ -461,6 +461,8 @@ struct ioc_gq { struct blkg_policy_data pd; struct ioc *ioc; + refcount_t ref; + /* * A iocg can get its weight from two sources - an explicit * per-device-cgroup configuration or the default weight of the @@ -2943,9 +2945,53 @@ static struct blkg_policy_data *ioc_pd_alloc(gfp_t gfp, struct request_queue *q, return NULL; } + refcount_set(&iocg->ref, 1); return &iocg->pd; } +static void iocg_get(struct ioc_gq *iocg) +{ + refcount_inc(&iocg->ref); +} + +static void iocg_put(struct ioc_gq *iocg) +{ + struct ioc *ioc = iocg->ioc; + unsigned long flags; + struct ioc_gq *parent = NULL; + + if (!refcount_dec_and_test(&iocg->ref)) + return; + + if (iocg->level > 0) + parent = iocg->ancestors[iocg->level - 1]; + + if (ioc) { + spin_lock_irqsave(&ioc->lock, flags); + + if (!list_empty(&iocg->active_list)) { + struct ioc_now now; + + ioc_now(ioc, &now); + propagate_weights(iocg, 0, 0, false, &now); + list_del_init(&iocg->active_list); + } + + WARN_ON_ONCE(!list_empty(&iocg->walk_list)); + WARN_ON_ONCE(!list_empty(&iocg->surplus_list)); + + spin_unlock_irqrestore(&ioc->lock, flags); + + hrtimer_cancel(&iocg->waitq_timer); + } + + free_percpu(iocg->pcpu_stat); + kfree(iocg); + + if (parent) + iocg_put(parent); +} + static void ioc_pd_init(struct blkg_policy_data *pd) { struct ioc_gq *iocg = pd_to_iocg(pd); @@ -2973,6 +3019,9 @@ static void ioc_pd_init(struct blkg_policy_data *pd) iocg->level = blkg->blkcg->css.cgroup->level; + if (blkg->parent) + iocg_get(blkg_to_iocg(blkg->parent)); + for (tblkg = blkg; tblkg; tblkg = tblkg->parent) { struct ioc_gq *tiocg = blkg_to_iocg(tblkg); iocg->ancestors[tiocg->level] = tiocg; @@ -2985,30 +3034,7 @@ static void ioc_pd_init(struct blkg_policy_data *pd) static void ioc_pd_free(struct blkg_policy_data *pd) { - struct ioc_gq *iocg = pd_to_iocg(pd); - struct ioc *ioc = iocg->ioc; - unsigned long flags; - - if (ioc) { - spin_lock_irqsave(&ioc->lock, flags); - - if (!list_empty(&iocg->active_list)) { - struct ioc_now now; - - ioc_now(ioc, &now); - propagate_weights(iocg, 0, 0, false, &now); - list_del_init(&iocg->active_list); - } - - WARN_ON_ONCE(!list_empty(&iocg->walk_list)); - WARN_ON_ONCE(!list_empty(&iocg->surplus_list)); - - spin_unlock_irqrestore(&ioc->lock, flags); - - hrtimer_cancel(&iocg->waitq_timer); - } - free_percpu(iocg->pcpu_stat); - kfree(iocg); + iocg_put(pd_to_iocg(pd)); } static void ioc_pd_stat(struct blkg_policy_data *pd, struct seq_file *s)