Message ID | 20221103013937.603626-1-khazhy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] bfq: fix waker_bfqq inconsistency crash | expand |
Hi, 在 2022/11/03 9:39, Khazhismel Kumykov 写道: > This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL, > but woken_list_node still being hashed. This would happen when > bfq_init_rq() expects a brand new allocated queue to be returned from From what I see, bfqq->waker_bfqq is updated in bfq_init_rq() only if 'new_queue' is false, but if 'new_queue' is false, the returned 'bfqq' from bfq_get_bfqq_handle_split() will never be oom_bfqq, so I'm confused here... > bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq > without resetting woken_list_node. Since we can always return oom_bfqq > when attempting to allocate, we cannot assume waker_bfqq starts as NULL. > We must either reset woken_list_node, or avoid setting woken_list at all > for oom_bfqq - opt to do the former. Once oom_bfqq is used, I think the io is treated as issued from root group. Hence I don't think it's necessary to set woken_list or waker_bfqq for oom_bfqq. Thanks, Kuai > > Crashes would have a stacktrace like: > [160595.656560] bfq_add_bfqq_busy+0x110/0x1ec > [160595.661142] bfq_add_request+0x6bc/0x980 > [160595.666602] bfq_insert_request+0x8ec/0x1240 > [160595.671762] bfq_insert_requests+0x58/0x9c > [160595.676420] blk_mq_sched_insert_request+0x11c/0x198 > [160595.682107] blk_mq_submit_bio+0x270/0x62c > [160595.686759] __submit_bio_noacct_mq+0xec/0x178 > [160595.691926] submit_bio+0x120/0x184 > [160595.695990] ext4_mpage_readpages+0x77c/0x7c8 > [160595.701026] ext4_readpage+0x60/0xb0 > [160595.705158] filemap_read_page+0x54/0x114 > [160595.711961] filemap_fault+0x228/0x5f4 > [160595.716272] do_read_fault+0xe0/0x1f0 > [160595.720487] do_fault+0x40/0x1c8 > > Tested by injecting random failures into bfq_get_queue, crashes go away > completely. > > Fixes: 8ef3fc3a043c ("block, bfq: make shared queues inherit wakers") > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > --- > RFC mainly because it's not clear to me the best policy here - but the > patch is tested and fixes a real crash we started seeing in 5.15 > > This is following up my ramble over at > https://lore.kernel.org/lkml/CACGdZYLMnfcqwbAXDx+x9vUOMn2cz55oc+8WySBS3J2Xd_q7Lg@mail.gmail.com/ > > block/bfq-iosched.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > index 7ea427817f7f..5d2861119d20 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -6793,7 +6793,12 @@ static struct bfq_queue *bfq_init_rq(struct request *rq) > * reset. So insert new_bfqq into the > * woken_list of the waker. See > * bfq_check_waker for details. > + * > + * Also, if we got oom_bfqq, we must check if > + * it's already in a woken_list > */ > + if (unlikely(!hlist_unhashed(&bfqq->woken_list_node))) > + hlist_del_init(&bfqq->woken_list_node); > if (bfqq->waker_bfqq) > hlist_add_head(&bfqq->woken_list_node, > &bfqq->waker_bfqq->woken_list); >
On Wed, Nov 2, 2022 at 7:56 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > Hi, > > 在 2022/11/03 9:39, Khazhismel Kumykov 写道: > > This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL, > > but woken_list_node still being hashed. This would happen when > > bfq_init_rq() expects a brand new allocated queue to be returned from > > From what I see, bfqq->waker_bfqq is updated in bfq_init_rq() only if > 'new_queue' is false, but if 'new_queue' is false, the returned 'bfqq' > from bfq_get_bfqq_handle_split() will never be oom_bfqq, so I'm confused > here... There's two calls for bfq_get_bfqq_handle_split in this function - the second one is after the check you mentioned, and is the problematic one. > > > bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq > > without resetting woken_list_node. Since we can always return oom_bfqq > > when attempting to allocate, we cannot assume waker_bfqq starts as NULL. > > We must either reset woken_list_node, or avoid setting woken_list at all > > for oom_bfqq - opt to do the former. > > Once oom_bfqq is used, I think the io is treated as issued from root > group. Hence I don't think it's necessary to set woken_list or > waker_bfqq for oom_bfqq. Ack, I was wondering what's right here since, evidently, *someone* had already set oom_bfqq->waker_bfqq to *something* (although... maybe it was an earlier init_rq). But maybe it's better to do nothing if we *know* it's oom_bfqq. Is it a correct interpretation here that setting waker_bfqq won't accomplish anything anyways on oom_bfqq, so better off not? > > Thanks, > Kuai
Hi, 在 2022/11/03 11:05, Khazhy Kumykov 写道: > On Wed, Nov 2, 2022 at 7:56 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: >> >> Hi, >> >> 在 2022/11/03 9:39, Khazhismel Kumykov 写道: >>> This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL, >>> but woken_list_node still being hashed. This would happen when >>> bfq_init_rq() expects a brand new allocated queue to be returned from >> >> From what I see, bfqq->waker_bfqq is updated in bfq_init_rq() only if >> 'new_queue' is false, but if 'new_queue' is false, the returned 'bfqq' >> from bfq_get_bfqq_handle_split() will never be oom_bfqq, so I'm confused >> here... > There's two calls for bfq_get_bfqq_handle_split in this function - the > second one is after the check you mentioned, and is the problematic > one. Yes, thanks for the explanation. Now I understand how the problem triggers. >> >>> bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq >>> without resetting woken_list_node. Since we can always return oom_bfqq >>> when attempting to allocate, we cannot assume waker_bfqq starts as NULL. >>> We must either reset woken_list_node, or avoid setting woken_list at all >>> for oom_bfqq - opt to do the former. >> >> Once oom_bfqq is used, I think the io is treated as issued from root >> group. Hence I don't think it's necessary to set woken_list or >> waker_bfqq for oom_bfqq. > Ack, I was wondering what's right here since, evidently, *someone* had > already set oom_bfqq->waker_bfqq to *something* (although... maybe it > was an earlier init_rq). But maybe it's better to do nothing if we > *know* it's oom_bfqq. I need to have a check how oom_bfqq get involved with waker_bfqq, and then see if it's reasonable. Probably Jan and Paolo will have better view on this. Thanks, Kuai > > Is it a correct interpretation here that setting waker_bfqq won't > accomplish anything anyways on oom_bfqq, so better off not?
On Thu 03-11-22 11:51:15, Yu Kuai wrote: > Hi, > > 在 2022/11/03 11:05, Khazhy Kumykov 写道: > > On Wed, Nov 2, 2022 at 7:56 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > > > > > Hi, > > > > > > 在 2022/11/03 9:39, Khazhismel Kumykov 写道: > > > > This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL, > > > > but woken_list_node still being hashed. This would happen when > > > > bfq_init_rq() expects a brand new allocated queue to be returned from > > > > > > From what I see, bfqq->waker_bfqq is updated in bfq_init_rq() only if > > > 'new_queue' is false, but if 'new_queue' is false, the returned 'bfqq' > > > from bfq_get_bfqq_handle_split() will never be oom_bfqq, so I'm confused > > > here... > > There's two calls for bfq_get_bfqq_handle_split in this function - the > > second one is after the check you mentioned, and is the problematic > > one. > Yes, thanks for the explanation. Now I understand how the problem > triggers. > > > > > > > > bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq > > > > without resetting woken_list_node. Since we can always return oom_bfqq > > > > when attempting to allocate, we cannot assume waker_bfqq starts as NULL. > > > > We must either reset woken_list_node, or avoid setting woken_list at all > > > > for oom_bfqq - opt to do the former. > > > > > > Once oom_bfqq is used, I think the io is treated as issued from root > > > group. Hence I don't think it's necessary to set woken_list or > > > waker_bfqq for oom_bfqq. > > Ack, I was wondering what's right here since, evidently, *someone* had > > already set oom_bfqq->waker_bfqq to *something* (although... maybe it > > was an earlier init_rq). But maybe it's better to do nothing if we > > *know* it's oom_bfqq. > > I need to have a check how oom_bfqq get involved with waker_bfqq, and > then see if it's reasonable. > > Probably Jan and Paolo will have better view on this. Thanks for the CC Kuai and thanks to Khazy for spotting the bug. The oom_bfqq is just a fallback bfqq and as such it should be extempted from all special handling like waker detection etc. All this stuff is just for optimizing performance and when we are OOM, we have far larger troubles than to optimize performance. So how I think we should really fix this is that we extempt oom_bfqq from waker detection in bfq_check_waker() by adding: bfqq == bfqd->oom_bfqq || bfqd->last_completed_rq_bfq == bfqd->oom_bfqq) to the initial check and then also if bfq_get_bfqq_handle_split() returns oom_bfqq we should just skip carrying over the waker information. Honza
On Thu, Nov 3, 2022 at 1:47 AM Jan Kara <jack@suse.cz> wrote: > > On Thu 03-11-22 11:51:15, Yu Kuai wrote: > > Hi, > > > > 在 2022/11/03 11:05, Khazhy Kumykov 写道: > > > On Wed, Nov 2, 2022 at 7:56 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > > > > > > > Hi, > > > > > > > > 在 2022/11/03 9:39, Khazhismel Kumykov 写道: > > > > > This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL, > > > > > but woken_list_node still being hashed. This would happen when > > > > > bfq_init_rq() expects a brand new allocated queue to be returned from > > > > > > > > From what I see, bfqq->waker_bfqq is updated in bfq_init_rq() only if > > > > 'new_queue' is false, but if 'new_queue' is false, the returned 'bfqq' > > > > from bfq_get_bfqq_handle_split() will never be oom_bfqq, so I'm confused > > > > here... > > > There's two calls for bfq_get_bfqq_handle_split in this function - the > > > second one is after the check you mentioned, and is the problematic > > > one. > > Yes, thanks for the explanation. Now I understand how the problem > > triggers. > > > > > > > > > > > bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq > > > > > without resetting woken_list_node. Since we can always return oom_bfqq > > > > > when attempting to allocate, we cannot assume waker_bfqq starts as NULL. > > > > > We must either reset woken_list_node, or avoid setting woken_list at all > > > > > for oom_bfqq - opt to do the former. > > > > > > > > Once oom_bfqq is used, I think the io is treated as issued from root > > > > group. Hence I don't think it's necessary to set woken_list or > > > > waker_bfqq for oom_bfqq. > > > Ack, I was wondering what's right here since, evidently, *someone* had > > > already set oom_bfqq->waker_bfqq to *something* (although... maybe it > > > was an earlier init_rq). But maybe it's better to do nothing if we > > > *know* it's oom_bfqq. > > > > I need to have a check how oom_bfqq get involved with waker_bfqq, and > > then see if it's reasonable. > > > > Probably Jan and Paolo will have better view on this. > > Thanks for the CC Kuai and thanks to Khazy for spotting the bug. The > oom_bfqq is just a fallback bfqq and as such it should be extempted from > all special handling like waker detection etc. All this stuff is just for > optimizing performance and when we are OOM, we have far larger troubles > than to optimize performance. > > So how I think we should really fix this is that we extempt oom_bfqq from > waker detection in bfq_check_waker() by adding: > > bfqq == bfqd->oom_bfqq || > bfqd->last_completed_rq_bfq == bfqd->oom_bfqq) > > to the initial check and then also if bfq_get_bfqq_handle_split() returns > oom_bfqq we should just skip carrying over the waker information. Thanks for the tip! I'll send a followup, including your suggestions. I do have some other questions in this area, if you could help me understand. Looking again at bfq_init_rq, inside of the !new_queue section - we call bfq_split_bfqq() to "split" our bfqq, then in the next line bfq_get_bfqq_handle_split inspects bic_to_bfqq(bic, is_sync), and if it's NULL, allocates a new queue. However, if we have an async rq, this call will return the pre-existing async bfqq, as the call to bfq_split_bfqq() only clears the sync bfqq, ever. The best understanding I have now is: "bic->bfqq[aync] is never NULL (and we don't merge async queues) so we'll never reach this !new_queue section anyways if it's async". Is that accurate? Thanks, Khazhy > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Fri 04-11-22 14:25:32, Khazhy Kumykov wrote: > On Thu, Nov 3, 2022 at 1:47 AM Jan Kara <jack@suse.cz> wrote: > > > > On Thu 03-11-22 11:51:15, Yu Kuai wrote: > > > Hi, > > > > > > 在 2022/11/03 11:05, Khazhy Kumykov 写道: > > > > On Wed, Nov 2, 2022 at 7:56 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > > > > > > > > > Hi, > > > > > > > > > > 在 2022/11/03 9:39, Khazhismel Kumykov 写道: > > > > > > This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL, > > > > > > but woken_list_node still being hashed. This would happen when > > > > > > bfq_init_rq() expects a brand new allocated queue to be returned from > > > > > > > > > > From what I see, bfqq->waker_bfqq is updated in bfq_init_rq() only if > > > > > 'new_queue' is false, but if 'new_queue' is false, the returned 'bfqq' > > > > > from bfq_get_bfqq_handle_split() will never be oom_bfqq, so I'm confused > > > > > here... > > > > There's two calls for bfq_get_bfqq_handle_split in this function - the > > > > second one is after the check you mentioned, and is the problematic > > > > one. > > > Yes, thanks for the explanation. Now I understand how the problem > > > triggers. > > > > > > > > > > > > > > bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq > > > > > > without resetting woken_list_node. Since we can always return oom_bfqq > > > > > > when attempting to allocate, we cannot assume waker_bfqq starts as NULL. > > > > > > We must either reset woken_list_node, or avoid setting woken_list at all > > > > > > for oom_bfqq - opt to do the former. > > > > > > > > > > Once oom_bfqq is used, I think the io is treated as issued from root > > > > > group. Hence I don't think it's necessary to set woken_list or > > > > > waker_bfqq for oom_bfqq. > > > > Ack, I was wondering what's right here since, evidently, *someone* had > > > > already set oom_bfqq->waker_bfqq to *something* (although... maybe it > > > > was an earlier init_rq). But maybe it's better to do nothing if we > > > > *know* it's oom_bfqq. > > > > > > I need to have a check how oom_bfqq get involved with waker_bfqq, and > > > then see if it's reasonable. > > > > > > Probably Jan and Paolo will have better view on this. > > > > Thanks for the CC Kuai and thanks to Khazy for spotting the bug. The > > oom_bfqq is just a fallback bfqq and as such it should be extempted from > > all special handling like waker detection etc. All this stuff is just for > > optimizing performance and when we are OOM, we have far larger troubles > > than to optimize performance. > > > > So how I think we should really fix this is that we extempt oom_bfqq from > > waker detection in bfq_check_waker() by adding: > > > > bfqq == bfqd->oom_bfqq || > > bfqd->last_completed_rq_bfq == bfqd->oom_bfqq) > > > > to the initial check and then also if bfq_get_bfqq_handle_split() returns > > oom_bfqq we should just skip carrying over the waker information. > Thanks for the tip! I'll send a followup, including your suggestions. > > I do have some other questions in this area, if you could help me > understand. Looking again at bfq_init_rq, inside of the !new_queue > section - we call bfq_split_bfqq() to "split" our bfqq, then in the > next line bfq_get_bfqq_handle_split inspects bic_to_bfqq(bic, > is_sync), and if it's NULL, allocates a new queue. However, if we have > an async rq, this call will return the pre-existing async bfqq, as the > call to bfq_split_bfqq() only clears the sync bfqq, ever. The best > understanding I have now is: "bic->bfqq[aync] is never NULL (and we > don't merge async queues) so we'll never reach this !new_queue section > anyways if it's async". Is that accurate? So you are right that async queues are never merged or split. In fact, if you have a look at bfq_get_queue(), you'll notice that async queue is common for all processes with the same ioprio & blkcg. So all these games with splitting, merging, waker detection etc. impact only sync queues. Honza
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 7ea427817f7f..5d2861119d20 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -6793,7 +6793,12 @@ static struct bfq_queue *bfq_init_rq(struct request *rq) * reset. So insert new_bfqq into the * woken_list of the waker. See * bfq_check_waker for details. + * + * Also, if we got oom_bfqq, we must check if + * it's already in a woken_list */ + if (unlikely(!hlist_unhashed(&bfqq->woken_list_node))) + hlist_del_init(&bfqq->woken_list_node); if (bfqq->waker_bfqq) hlist_add_head(&bfqq->woken_list_node, &bfqq->waker_bfqq->woken_list);
This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL, but woken_list_node still being hashed. This would happen when bfq_init_rq() expects a brand new allocated queue to be returned from bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq without resetting woken_list_node. Since we can always return oom_bfqq when attempting to allocate, we cannot assume waker_bfqq starts as NULL. We must either reset woken_list_node, or avoid setting woken_list at all for oom_bfqq - opt to do the former. Crashes would have a stacktrace like: [160595.656560] bfq_add_bfqq_busy+0x110/0x1ec [160595.661142] bfq_add_request+0x6bc/0x980 [160595.666602] bfq_insert_request+0x8ec/0x1240 [160595.671762] bfq_insert_requests+0x58/0x9c [160595.676420] blk_mq_sched_insert_request+0x11c/0x198 [160595.682107] blk_mq_submit_bio+0x270/0x62c [160595.686759] __submit_bio_noacct_mq+0xec/0x178 [160595.691926] submit_bio+0x120/0x184 [160595.695990] ext4_mpage_readpages+0x77c/0x7c8 [160595.701026] ext4_readpage+0x60/0xb0 [160595.705158] filemap_read_page+0x54/0x114 [160595.711961] filemap_fault+0x228/0x5f4 [160595.716272] do_read_fault+0xe0/0x1f0 [160595.720487] do_fault+0x40/0x1c8 Tested by injecting random failures into bfq_get_queue, crashes go away completely. Fixes: 8ef3fc3a043c ("block, bfq: make shared queues inherit wakers") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> --- RFC mainly because it's not clear to me the best policy here - but the patch is tested and fixes a real crash we started seeing in 5.15 This is following up my ramble over at https://lore.kernel.org/lkml/CACGdZYLMnfcqwbAXDx+x9vUOMn2cz55oc+8WySBS3J2Xd_q7Lg@mail.gmail.com/ block/bfq-iosched.c | 5 +++++ 1 file changed, 5 insertions(+)