Message ID | 20221222143353.598042-6-shikemeng@huaweicloud.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | A few bugfix and cleanup patches for sbitmap | expand |
On Thu 22-12-22 22:33:53, Kemeng Shi wrote: > Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") > mentioned that in case of shared tags, there could be just one real > active hctx(queue) because of lazy detection of tag idle. Then driver tag > allocation may wait forever on this real active hctx(queue) if wake_batch > is > hctx_max_depth where hctx_max_depth is available tags depth for the > actve hctx(queue). However, the condition wake_batch > hctx_max_depth is > not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only > wake up one wait queue for each wake_batch even though there is only one > waiter in the woken wait queue. After this, there is only one tag to free > and wake_batch may not be reached anymore. Commit 180dccb0dba4f ("blk-mq: > fix tag_get wait task can't be awakened") methioned that driver tag > allocation may wait forever. Actually, the inactive hctx(queue) will be > truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all > to wake one waiter per wait queue to break the hung. But IO hung for 30 > seconds is also not acceptable. Set batch size to small enough that depth > of the shared hctx(queue) is enough to wake up all of the queues like > sbq_calc_wake_batch do to fix this potential IO hung. > > Although hctx_max_depth will be clamped to at least 4 while wake_batch > recalculation does not do the clamp, the wake_batch will be always > recalculated to 1 when hctx_max_depth <= 4. > > Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") > Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> So the condition in sbitmap_queue_recalculate_wake_batch() also seemed strange to me and the changelogs of commits 180dccb0dba4 and 10825410b95 ("blk-mq: Fix wrong wakeup batch configuration which will cause hang") didn't add much confidence about the magic batch setting to 4. Let me add to CC original author of this code if he has any thoughts on why using wake batch of 4 is safe for cards with say 32 tags in case active_users is currently 32. Because I don't see why that is correct either. Honza > --- > lib/sbitmap.c | 5 +---- > 1 file changed, 1 insertion(+), 4 deletions(-) > > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > index b6d3bb1c3675..804fe99783e4 100644 > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -458,13 +458,10 @@ void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq, > unsigned int users) > { > unsigned int wake_batch; > - unsigned int min_batch; > unsigned int depth = (sbq->sb.depth + users - 1) / users; > > - min_batch = sbq->sb.depth >= (4 * SBQ_WAIT_QUEUES) ? 4 : 1; > - > wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES, > - min_batch, SBQ_WAKE_BATCH); > + 1, SBQ_WAKE_BATCH); > > WRITE_ONCE(sbq->wake_batch, wake_batch); > } > -- > 2.30.0 >
Hi, 在 2022/12/22 21:41, Jan Kara 写道: > On Thu 22-12-22 22:33:53, Kemeng Shi wrote: >> Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") >> mentioned that in case of shared tags, there could be just one real >> active hctx(queue) because of lazy detection of tag idle. Then driver tag >> allocation may wait forever on this real active hctx(queue) if wake_batch >> is > hctx_max_depth where hctx_max_depth is available tags depth for the >> actve hctx(queue). However, the condition wake_batch > hctx_max_depth is >> not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only >> wake up one wait queue for each wake_batch even though there is only one >> waiter in the woken wait queue. After this, there is only one tag to free >> and wake_batch may not be reached anymore. Commit 180dccb0dba4f ("blk-mq: >> fix tag_get wait task can't be awakened") methioned that driver tag >> allocation may wait forever. Actually, the inactive hctx(queue) will be >> truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all >> to wake one waiter per wait queue to break the hung. But IO hung for 30 >> seconds is also not acceptable. Set batch size to small enough that depth >> of the shared hctx(queue) is enough to wake up all of the queues like >> sbq_calc_wake_batch do to fix this potential IO hung. >> >> Although hctx_max_depth will be clamped to at least 4 while wake_batch >> recalculation does not do the clamp, the wake_batch will be always >> recalculated to 1 when hctx_max_depth <= 4. >> >> Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") >> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> > > So the condition in sbitmap_queue_recalculate_wake_batch() also seemed > strange to me and the changelogs of commits 180dccb0dba4 and 10825410b95 > ("blk-mq: Fix wrong wakeup batch configuration which will cause hang") > didn't add much confidence about the magic batch setting to 4. Let me add > to CC original author of this code if he has any thoughts on why using > wake batch of 4 is safe for cards with say 32 tags in case active_users is > currently 32. Because I don't see why that is correct either. > If I remember this correctly, the reason to use 4 here in the first place is to avoid performance degradation. And for why this is safe because 4 * 8 = 32. Someone is waiting for tag means 32 tags is all grabbed, and wake batch of 4 will make sure at least 8 wait queues will be awaken. It's right some waitqueue might only have one waiter, but I don't think this will cause io hang. Thanks, Kuai > Honza > >> --- >> lib/sbitmap.c | 5 +---- >> 1 file changed, 1 insertion(+), 4 deletions(-) >> >> diff --git a/lib/sbitmap.c b/lib/sbitmap.c >> index b6d3bb1c3675..804fe99783e4 100644 >> --- a/lib/sbitmap.c >> +++ b/lib/sbitmap.c >> @@ -458,13 +458,10 @@ void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq, >> unsigned int users) >> { >> unsigned int wake_batch; >> - unsigned int min_batch; >> unsigned int depth = (sbq->sb.depth + users - 1) / users; >> >> - min_batch = sbq->sb.depth >= (4 * SBQ_WAIT_QUEUES) ? 4 : 1; >> - >> wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES, >> - min_batch, SBQ_WAKE_BATCH); >> + 1, SBQ_WAKE_BATCH); >> >> WRITE_ONCE(sbq->wake_batch, wake_batch); >> } >> -- >> 2.30.0 >>
在 2022/12/26 15:50, Yu Kuai 写道: >> why using >> wake batch of 4 is safe for cards with say 32 tags in case >> active_users is >> currently 32. Because I don't see why that is correct either. >> I see, you guys are worried that during the period that some hctx complete all it's not idle yet. It's right waiter might wait for other hctx to become idle to be awaken in this case. However, I'm not sure which way is better. Ming, do you have any idea? Thanks, Kuai
Friendly ping... on 12/26/2022 4:57 PM, Yu Kuai wrote: > 在 2022/12/26 15:50, Yu Kuai 写道: > >>> why using >>> wake batch of 4 is safe for cards with say 32 tags in case active_users is >>> currently 32. Because I don't see why that is correct either. >>> > > I see, you guys are worried that during the period that some hctx > complete all it's not idle yet. It's right waiter might wait for > other hctx to become idle to be awaken in this case. However, I'm > not sure which way is better. > > Ming, do you have any idea? > > Thanks, > Kuai >
on 1/3/2023 10:12 AM, Kemeng Shi wrote: > Friendly ping... > > on 12/26/2022 4:57 PM, Yu Kuai wrote: >> 在 2022/12/26 15:50, Yu Kuai 写道: >> >>>> why using >>>> wake batch of 4 is safe for cards with say 32 tags in case active_users is >>>> currently 32. Because I don't see why that is correct either. >>>> >> >> I see, you guys are worried that during the period that some hctx >> complete all it's not idle yet. It's right waiter might wait for >> other hctx to become idle to be awaken in this case. However, I'm >> not sure which way is better. >> >> Ming, do you have any idea? >> >> Thanks, >> Kuai >> > Hi Jan. The magic batch 4 seems just for performance initially while lacks of full consideration. And there is no better solution provided in futher. Do you have any suggestion that I can do to make more progress.
On Mon 16-01-23 10:15:08, Kemeng Shi wrote: > > > on 1/3/2023 10:12 AM, Kemeng Shi wrote: > > Friendly ping... > > > > on 12/26/2022 4:57 PM, Yu Kuai wrote: > >> 在 2022/12/26 15:50, Yu Kuai 写道: > >> > >>>> why using > >>>> wake batch of 4 is safe for cards with say 32 tags in case active_users is > >>>> currently 32. Because I don't see why that is correct either. > >>>> > >> > >> I see, you guys are worried that during the period that some hctx > >> complete all it's not idle yet. It's right waiter might wait for > >> other hctx to become idle to be awaken in this case. However, I'm > >> not sure which way is better. > >> > >> Ming, do you have any idea? > >> > >> Thanks, > >> Kuai > >> > > > Hi Jan. The magic batch 4 seems just for performance initially while > lacks of full consideration. And there is no better solution provided > in futher. Do you have any suggestion that I can do to make more > progress. Yeah, since there was not any good reasoning behind the magic batch of size 4, feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> to this patch. Honza
diff --git a/lib/sbitmap.c b/lib/sbitmap.c index b6d3bb1c3675..804fe99783e4 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -458,13 +458,10 @@ void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq, unsigned int users) { unsigned int wake_batch; - unsigned int min_batch; unsigned int depth = (sbq->sb.depth + users - 1) / users; - min_batch = sbq->sb.depth >= (4 * SBQ_WAIT_QUEUES) ? 4 : 1; - wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES, - min_batch, SBQ_WAKE_BATCH); + 1, SBQ_WAKE_BATCH); WRITE_ONCE(sbq->wake_batch, wake_batch); }
Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") mentioned that in case of shared tags, there could be just one real active hctx(queue) because of lazy detection of tag idle. Then driver tag allocation may wait forever on this real active hctx(queue) if wake_batch is > hctx_max_depth where hctx_max_depth is available tags depth for the actve hctx(queue). However, the condition wake_batch > hctx_max_depth is not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only wake up one wait queue for each wake_batch even though there is only one waiter in the woken wait queue. After this, there is only one tag to free and wake_batch may not be reached anymore. Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") methioned that driver tag allocation may wait forever. Actually, the inactive hctx(queue) will be truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all to wake one waiter per wait queue to break the hung. But IO hung for 30 seconds is also not acceptable. Set batch size to small enough that depth of the shared hctx(queue) is enough to wake up all of the queues like sbq_calc_wake_batch do to fix this potential IO hung. Although hctx_max_depth will be clamped to at least 4 while wake_batch recalculation does not do the clamp, the wake_batch will be always recalculated to 1 when hctx_max_depth <= 4. Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> --- lib/sbitmap.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)