Message ID | 20230721095715.232728-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] sbitmap: fix batching wakeup | expand |
On Fri, Jul 21, 2023 at 05:57:15PM +0800, Ming Lei wrote: > From: David Jeffery <djeffery@redhat.com> ... > Cc: David Jeffery <djeffery@redhat.com> > Cc: Kemeng Shi <shikemeng@huaweicloud.com> > Cc: Gabriel Krisman Bertazi <krisman@suse.de> > Cc: Chengming Zhou <zhouchengming@bytedance.com> > Cc: Jan Kara <jack@suse.cz> > Signed-off-by: Ming Lei <ming.lei@redhat.com> Shouldn't the author include their sign off? Or is this supposed to be from you?
On Fri, Jul 21, 2023 at 12:40:19PM +0200, Keith Busch wrote: > On Fri, Jul 21, 2023 at 05:57:15PM +0800, Ming Lei wrote: > > From: David Jeffery <djeffery@redhat.com> > > ... > > > Cc: David Jeffery <djeffery@redhat.com> > > Cc: Kemeng Shi <shikemeng@huaweicloud.com> > > Cc: Gabriel Krisman Bertazi <krisman@suse.de> > > Cc: Chengming Zhou <zhouchengming@bytedance.com> > > Cc: Jan Kara <jack@suse.cz> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > Shouldn't the author include their sign off? Or is this supposed to be > from you? I understand signed-off-by needs to be explicit from David, and let's focus on patch itself. And I believe David can reply with his signed-off-by when the patch is ready to go. Thanks, Ming
On Fri, Jul 21, 2023 at 05:57:15PM +0800, Ming Lei wrote: > static void __sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) > { > - int i, wake_index; > + int i, wake_index, woken; > > if (!atomic_read(&sbq->ws_active)) > return; > @@ -567,13 +567,12 @@ static void __sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) > */ > wake_index = sbq_index_inc(wake_index); > > - /* > - * It is sufficient to wake up at least one waiter to > - * guarantee forward progress. > - */ > - if (waitqueue_active(&ws->wait) && > - wake_up_nr(&ws->wait, nr)) > - break; > + if (waitqueue_active(&ws->wait)) { > + woken = wake_up_nr(&ws->wait, nr); > + if (woken == nr) > + break; > + nr -= woken; > + } > } This looks good. I had something similiar at one point, but after all the churn this file had gone through, I somehow convinced myself it wasn't necessary anymore. Your analysis and fix look correct to me! Reviewed-by: Keith Busch <kbusch@kernel.org>
Ming Lei <ming.lei@redhat.com> writes: > From: David Jeffery <djeffery@redhat.com> > > Current code supposes that it is enough to provide forward progress by just > waking up one wait queue after one completion batch is done. > > Unfortunately this way isn't enough, cause waiter can be added to > wait queue just after it is woken up. > > Follows one example(64 depth, wake_batch is 8) > > 1) all 64 tags are active > > 2) in each wait queue, there is only one single waiter > > 3) each time one completion batch(8 completions) wakes up just one waiter in each wait > queue, then immediately one new sleeper is added to this wait queue > > 4) after 64 completions, 8 waiters are wakeup, and there are still 8 waiters in each > wait queue > > 5) after another 8 active tags are completed, only one waiter can be wakeup, and the other 7 > can't be waken up anymore. > > Turns out it isn't easy to fix this problem, so simply wakeup enough waiters for > single batch. yes, I think this makes sense. When working on this algorithm I remember I considered it (thus wake_up_nr being ready), but ended up believing it wasn't needed. please take: Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> I wonder how likely it is to reach it. Did you get a bug report? Thanks,
On 7/21/23 3:57?AM, Ming Lei wrote: > From: David Jeffery <djeffery@redhat.com> > > Current code supposes that it is enough to provide forward progress by just > waking up one wait queue after one completion batch is done. > > Unfortunately this way isn't enough, cause waiter can be added to > wait queue just after it is woken up. > > Follows one example(64 depth, wake_batch is 8) > > 1) all 64 tags are active > > 2) in each wait queue, there is only one single waiter > > 3) each time one completion batch(8 completions) wakes up just one waiter in each wait > queue, then immediately one new sleeper is added to this wait queue > > 4) after 64 completions, 8 waiters are wakeup, and there are still 8 waiters in each > wait queue > > 5) after another 8 active tags are completed, only one waiter can be wakeup, and the other 7 > can't be waken up anymore. > > Turns out it isn't easy to fix this problem, so simply wakeup enough waiters for > single batch. Change looks good, but commit message needs: 1) Signed-off-by from author 2) To be wrapped at ~72 chars, it's got lines that are way too long. I can edit the commit message, but David does need to reply with his SOB before it can get applied.
On Fri, Jul 21, 2023 at 6:50 AM Ming Lei <ming.lei@redhat.com> wrote: > > On Fri, Jul 21, 2023 at 12:40:19PM +0200, Keith Busch wrote: > > On Fri, Jul 21, 2023 at 05:57:15PM +0800, Ming Lei wrote: > > > From: David Jeffery <djeffery@redhat.com> > > > > ... > > > > > Cc: David Jeffery <djeffery@redhat.com> > > > Cc: Kemeng Shi <shikemeng@huaweicloud.com> > > > Cc: Gabriel Krisman Bertazi <krisman@suse.de> > > > Cc: Chengming Zhou <zhouchengming@bytedance.com> > > > Cc: Jan Kara <jack@suse.cz> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > > Shouldn't the author include their sign off? Or is this supposed to be > > from you? > > I understand signed-off-by needs to be explicit from David, and let's > focus on patch itself. And I believe David can reply with his > signed-off-by when the patch is ready to go. > > > Thanks, > Ming > Thank you for the initial checking of the patch, Ming. I appreciate your patience and wisdom looking over my ideas. Please add my signed-off-by to the patch. Signed-off-by: David Jeffery <djeffery@redhat.com> Regards, David Jeffery
On Fri, 21 Jul 2023 17:57:15 +0800, Ming Lei wrote: > Current code supposes that it is enough to provide forward progress by just > waking up one wait queue after one completion batch is done. > > Unfortunately this way isn't enough, cause waiter can be added to > wait queue just after it is woken up. > > Follows one example(64 depth, wake_batch is 8) > > [...] Applied, thanks! [1/1] sbitmap: fix batching wakeup commit: 106397376c0369fcc01c58dd189ff925a2724a57 Best regards,
On Fri, Jul 21, 2023 at 12:35:43PM -0400, Gabriel Krisman Bertazi wrote: > Ming Lei <ming.lei@redhat.com> writes: > > > From: David Jeffery <djeffery@redhat.com> > > > > Current code supposes that it is enough to provide forward progress by just > > waking up one wait queue after one completion batch is done. > > > > Unfortunately this way isn't enough, cause waiter can be added to > > wait queue just after it is woken up. > > > > Follows one example(64 depth, wake_batch is 8) > > > > 1) all 64 tags are active > > > > 2) in each wait queue, there is only one single waiter > > > > 3) each time one completion batch(8 completions) wakes up just one waiter in each wait > > queue, then immediately one new sleeper is added to this wait queue > > > > 4) after 64 completions, 8 waiters are wakeup, and there are still 8 waiters in each > > wait queue > > > > 5) after another 8 active tags are completed, only one waiter can be wakeup, and the other 7 > > can't be waken up anymore. > > > > Turns out it isn't easy to fix this problem, so simply wakeup enough waiters for > > single batch. > > yes, I think this makes sense. When working on this algorithm I remember > I considered it (thus wake_up_nr being ready), but ended up believing it > wasn't needed. please take: > > Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> > > I wonder how likely it is to reach it. Did you get a bug report? It was reported from one RH customer, and I am also hit once when running dbench on loop(bfq) over scsi_debug in my routine test. David figured out the idea, and we discussed other solutions too, but turns out others can't work, and the above (extreme)example seems easier to follow, from me. Per David's early analysis, it should be easier to trigger since commit 26edb30dd1c0 ("sbitmap: Try each queue to wake up at least one waiter") because 'wake_index' isn't updated before calling wake_up_nr(), then multiple completion batch may only wakeup one same wait queue, meantime multiple sleepers are added to same wait queue. Thanks, Ming
On Fri 21-07-23 17:57:15, Ming Lei wrote: > From: David Jeffery <djeffery@redhat.com> > > Current code supposes that it is enough to provide forward progress by just > waking up one wait queue after one completion batch is done. > > Unfortunately this way isn't enough, cause waiter can be added to > wait queue just after it is woken up. > > Follows one example(64 depth, wake_batch is 8) > > 1) all 64 tags are active > > 2) in each wait queue, there is only one single waiter > > 3) each time one completion batch(8 completions) wakes up just one waiter in each wait > queue, then immediately one new sleeper is added to this wait queue > > 4) after 64 completions, 8 waiters are wakeup, and there are still 8 waiters in each > wait queue > > 5) after another 8 active tags are completed, only one waiter can be wakeup, and the other 7 > can't be waken up anymore. > > Turns out it isn't easy to fix this problem, so simply wakeup enough waiters for > single batch. > > Cc: David Jeffery <djeffery@redhat.com> > Cc: Kemeng Shi <shikemeng@huaweicloud.com> > Cc: Gabriel Krisman Bertazi <krisman@suse.de> > Cc: Chengming Zhou <zhouchengming@bytedance.com> > Cc: Jan Kara <jack@suse.cz> > Signed-off-by: Ming Lei <ming.lei@redhat.com> I'm sorry for the delay - I was on vacation. I can see the patch got already merged and I'm not strictly against that (although I think Gabriel was experimenting with this exact wakeup scheme and as far as I remember the more eager waking up was causing performance decrease for some configurations). But let me challenge the analysis above a bit. For the sleeper to be added to a waitqueue in step 3), blk_mq_mark_tag_wait() must fail the blk_mq_get_driver_tag() call. Which means that all tags were used at that moment. To summarize, anytime we add any new waiter to the waitqueue, all tags are used and thus we should eventually receive enough wakeups to wake all of them. What am I missing? Honza > --- > lib/sbitmap.c | 15 +++++++-------- > 1 file changed, 7 insertions(+), 8 deletions(-) > > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > index eff4e42c425a..d0a5081dfd12 100644 > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -550,7 +550,7 @@ EXPORT_SYMBOL_GPL(sbitmap_queue_min_shallow_depth); > > static void __sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) > { > - int i, wake_index; > + int i, wake_index, woken; > > if (!atomic_read(&sbq->ws_active)) > return; > @@ -567,13 +567,12 @@ static void __sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) > */ > wake_index = sbq_index_inc(wake_index); > > - /* > - * It is sufficient to wake up at least one waiter to > - * guarantee forward progress. > - */ > - if (waitqueue_active(&ws->wait) && > - wake_up_nr(&ws->wait, nr)) > - break; > + if (waitqueue_active(&ws->wait)) { > + woken = wake_up_nr(&ws->wait, nr); > + if (woken == nr) > + break; > + nr -= woken; > + } > } > > if (wake_index != atomic_read(&sbq->wake_index)) > -- > 2.40.1 >
On Wed, Aug 02, 2023 at 06:05:53PM +0200, Jan Kara wrote: > On Fri 21-07-23 17:57:15, Ming Lei wrote: > > From: David Jeffery <djeffery@redhat.com> > > > > Current code supposes that it is enough to provide forward progress by just > > waking up one wait queue after one completion batch is done. > > > > Unfortunately this way isn't enough, cause waiter can be added to > > wait queue just after it is woken up. > > > > Follows one example(64 depth, wake_batch is 8) > > > > 1) all 64 tags are active > > > > 2) in each wait queue, there is only one single waiter > > > > 3) each time one completion batch(8 completions) wakes up just one waiter in each wait > > queue, then immediately one new sleeper is added to this wait queue > > > > 4) after 64 completions, 8 waiters are wakeup, and there are still 8 waiters in each > > wait queue > > > > 5) after another 8 active tags are completed, only one waiter can be wakeup, and the other 7 > > can't be waken up anymore. > > > > Turns out it isn't easy to fix this problem, so simply wakeup enough waiters for > > single batch. > > > > Cc: David Jeffery <djeffery@redhat.com> > > Cc: Kemeng Shi <shikemeng@huaweicloud.com> > > Cc: Gabriel Krisman Bertazi <krisman@suse.de> > > Cc: Chengming Zhou <zhouchengming@bytedance.com> > > Cc: Jan Kara <jack@suse.cz> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > I'm sorry for the delay - I was on vacation. I can see the patch got > already merged and I'm not strictly against that (although I think Gabriel > was experimenting with this exact wakeup scheme and as far as I remember > the more eager waking up was causing performance decrease for some > configurations). But let me challenge the analysis above a bit. For the > sleeper to be added to a waitqueue in step 3), blk_mq_mark_tag_wait() must > fail the blk_mq_get_driver_tag() call. Which means that all tags were used Here only allocating request by blk_mq_get_tag() is involved, and getting driver tag isn't involved. > at that moment. To summarize, anytime we add any new waiter to the > waitqueue, all tags are used and thus we should eventually receive enough > wakeups to wake all of them. What am I missing? When running the final retry(__blk_mq_get_tag) before sleeping(io_schedule()) in blk_mq_get_tag(), the sleeper has been added to wait queue. So when two completion batch comes, the two may wake up same wq because same ->wake_index can be observed from two completion path, and both two wake_up_nr() can return > 0 because adding sleeper into wq and wake_up_nr() can be interleaved, then 16 completions just wakeup 2 sleepers added to same wq. If the story happens on one wq with >= 8 sleepers, io hang will be triggered, if there are another two pending wq. Thanks, Ming
On Tue 08-08-23 16:18:50, Ming Lei wrote: > On Wed, Aug 02, 2023 at 06:05:53PM +0200, Jan Kara wrote: > > On Fri 21-07-23 17:57:15, Ming Lei wrote: > > > From: David Jeffery <djeffery@redhat.com> > > > > > > Current code supposes that it is enough to provide forward progress by just > > > waking up one wait queue after one completion batch is done. > > > > > > Unfortunately this way isn't enough, cause waiter can be added to > > > wait queue just after it is woken up. > > > > > > Follows one example(64 depth, wake_batch is 8) > > > > > > 1) all 64 tags are active > > > > > > 2) in each wait queue, there is only one single waiter > > > > > > 3) each time one completion batch(8 completions) wakes up just one waiter in each wait > > > queue, then immediately one new sleeper is added to this wait queue > > > > > > 4) after 64 completions, 8 waiters are wakeup, and there are still 8 waiters in each > > > wait queue > > > > > > 5) after another 8 active tags are completed, only one waiter can be wakeup, and the other 7 > > > can't be waken up anymore. > > > > > > Turns out it isn't easy to fix this problem, so simply wakeup enough waiters for > > > single batch. > > > > > > Cc: David Jeffery <djeffery@redhat.com> > > > Cc: Kemeng Shi <shikemeng@huaweicloud.com> > > > Cc: Gabriel Krisman Bertazi <krisman@suse.de> > > > Cc: Chengming Zhou <zhouchengming@bytedance.com> > > > Cc: Jan Kara <jack@suse.cz> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > > I'm sorry for the delay - I was on vacation. I can see the patch got > > already merged and I'm not strictly against that (although I think Gabriel > > was experimenting with this exact wakeup scheme and as far as I remember > > the more eager waking up was causing performance decrease for some > > configurations). But let me challenge the analysis above a bit. For the > > sleeper to be added to a waitqueue in step 3), blk_mq_mark_tag_wait() must > > fail the blk_mq_get_driver_tag() call. Which means that all tags were used > > Here only allocating request by blk_mq_get_tag() is involved, and > getting driver tag isn't involved. > > > at that moment. To summarize, anytime we add any new waiter to the > > waitqueue, all tags are used and thus we should eventually receive enough > > wakeups to wake all of them. What am I missing? > > When running the final retry(__blk_mq_get_tag) before > sleeping(io_schedule()) in blk_mq_get_tag(), the sleeper has been added to > wait queue. > > So when two completion batch comes, the two may wake up same wq because > same ->wake_index can be observed from two completion path, and both two > wake_up_nr() can return > 0 because adding sleeper into wq and wake_up_nr() > can be interleaved, then 16 completions just wakeup 2 sleepers added to > same wq. > > If the story happens on one wq with >= 8 sleepers, io hang will be > triggered, if there are another two pending wq. OK, thanks for explanation! I think I see the problem now. Honza
on 7/21/2023 5:57 PM, Ming Lei wrote: > From: David Jeffery <djeffery@redhat.com> > > Current code supposes that it is enough to provide forward progress by just > waking up one wait queue after one completion batch is done. > > Unfortunately this way isn't enough, cause waiter can be added to > wait queue just after it is woken up. > Sorry for missed this. The change looks good and makes code simpler. But the issue is not supposed to heppen in real world. > Follows one example(64 depth, wake_batch is 8) > > 1) all 64 tags are active > > 2) in each wait queue, there is only one single waiter > > 3) each time one completion batch(8 completions) wakes up just one waiter in each wait > queue, then immediately one new sleeper is added to this wait queue > > 4) after 64 completions, 8 waiters are wakeup, and there are still 8 waiters in each > wait queue> As we only wait when all tags are consumed (After sbitmap_prepare_to_wait, we will try sbitmap_queue_get before sleep), there will be 64 active tags again when any new sleeper is added. Then the menthioned issue should not exist. If there was no any gain with old way to wakeup, this looks good. Otherwise we may need to reconsider this. Wish this helps, thanks! > 5) after another 8 active tags are completed, only one waiter can be wakeup, and the other 7 > can't be waken up anymore. > > Turns out it isn't easy to fix this problem, so simply wakeup enough waiters for > single batch. > > Cc: David Jeffery <djeffery@redhat.com> > Cc: Kemeng Shi <shikemeng@huaweicloud.com> > Cc: Gabriel Krisman Bertazi <krisman@suse.de> > Cc: Chengming Zhou <zhouchengming@bytedance.com> > Cc: Jan Kara <jack@suse.cz> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > lib/sbitmap.c | 15 +++++++-------- > 1 file changed, 7 insertions(+), 8 deletions(-) > > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > index eff4e42c425a..d0a5081dfd12 100644 > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -550,7 +550,7 @@ EXPORT_SYMBOL_GPL(sbitmap_queue_min_shallow_depth); > > static void __sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) > { > - int i, wake_index; > + int i, wake_index, woken; > > if (!atomic_read(&sbq->ws_active)) > return; > @@ -567,13 +567,12 @@ static void __sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) > */ > wake_index = sbq_index_inc(wake_index); > > - /* > - * It is sufficient to wake up at least one waiter to > - * guarantee forward progress. > - */ > - if (waitqueue_active(&ws->wait) && > - wake_up_nr(&ws->wait, nr)) > - break; > + if (waitqueue_active(&ws->wait)) { > + woken = wake_up_nr(&ws->wait, nr); > + if (woken == nr) > + break; > + nr -= woken; > + } > } > > if (wake_index != atomic_read(&sbq->wake_index)) >
diff --git a/lib/sbitmap.c b/lib/sbitmap.c index eff4e42c425a..d0a5081dfd12 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -550,7 +550,7 @@ EXPORT_SYMBOL_GPL(sbitmap_queue_min_shallow_depth); static void __sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) { - int i, wake_index; + int i, wake_index, woken; if (!atomic_read(&sbq->ws_active)) return; @@ -567,13 +567,12 @@ static void __sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr) */ wake_index = sbq_index_inc(wake_index); - /* - * It is sufficient to wake up at least one waiter to - * guarantee forward progress. - */ - if (waitqueue_active(&ws->wait) && - wake_up_nr(&ws->wait, nr)) - break; + if (waitqueue_active(&ws->wait)) { + woken = wake_up_nr(&ws->wait, nr); + if (woken == nr) + break; + nr -= woken; + } } if (wake_index != atomic_read(&sbq->wake_index))