[RFC] bfq: fix waker_bfqq inconsistency crash

Message ID	20221103013937.603626-1-khazhy@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Khazhismel Kumykov <khazhy@chromium.org> To: Paolo Valente <paolo.valente@linaro.org>, Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Khazhismel Kumykov <khazhy@google.com> Subject: [RFC PATCH] bfq: fix waker_bfqq inconsistency crash Date: Wed, 2 Nov 2022 18:39:37 -0700 Message-Id: <20221103013937.603626-1-khazhy@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[RFC] bfq: fix waker_bfqq inconsistency crash \| expand [RFC] bfq: fix waker_bfqq inconsistency crash

Khazhy Kumykov Nov. 3, 2022, 1:39 a.m. UTC

This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL,
but woken_list_node still being hashed. This would happen when
bfq_init_rq() expects a brand new allocated queue to be returned from
bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq
without resetting woken_list_node. Since we can always return oom_bfqq
when attempting to allocate, we cannot assume waker_bfqq starts as NULL.
We must either reset woken_list_node, or avoid setting woken_list at all
for oom_bfqq - opt to do the former.

Crashes would have a stacktrace like:
[160595.656560]  bfq_add_bfqq_busy+0x110/0x1ec
[160595.661142]  bfq_add_request+0x6bc/0x980
[160595.666602]  bfq_insert_request+0x8ec/0x1240
[160595.671762]  bfq_insert_requests+0x58/0x9c
[160595.676420]  blk_mq_sched_insert_request+0x11c/0x198
[160595.682107]  blk_mq_submit_bio+0x270/0x62c
[160595.686759]  __submit_bio_noacct_mq+0xec/0x178
[160595.691926]  submit_bio+0x120/0x184
[160595.695990]  ext4_mpage_readpages+0x77c/0x7c8
[160595.701026]  ext4_readpage+0x60/0xb0
[160595.705158]  filemap_read_page+0x54/0x114
[160595.711961]  filemap_fault+0x228/0x5f4
[160595.716272]  do_read_fault+0xe0/0x1f0
[160595.720487]  do_fault+0x40/0x1c8

Tested by injecting random failures into bfq_get_queue, crashes go away
completely.

Fixes: 8ef3fc3a043c ("block, bfq: make shared queues inherit wakers")
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
---
RFC mainly because it's not clear to me the best policy here - but the
patch is tested and fixes a real crash we started seeing in 5.15

This is following up my ramble over at
https://lore.kernel.org/lkml/CACGdZYLMnfcqwbAXDx+x9vUOMn2cz55oc+8WySBS3J2Xd_q7Lg@mail.gmail.com/

 block/bfq-iosched.c | 5 +++++
 1 file changed, 5 insertions(+)

Yu Kuai Nov. 3, 2022, 2:55 a.m. UTC | #1

Hi,

在 2022/11/03 9:39, Khazhismel Kumykov 写道:
> This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL,
> but woken_list_node still being hashed. This would happen when
> bfq_init_rq() expects a brand new allocated queue to be returned from

 From what I see, bfqq->waker_bfqq is updated in bfq_init_rq() only if
'new_queue' is false, but if 'new_queue' is false, the returned 'bfqq'
from bfq_get_bfqq_handle_split() will never be oom_bfqq, so I'm confused
here...

> bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq
> without resetting woken_list_node. Since we can always return oom_bfqq
> when attempting to allocate, we cannot assume waker_bfqq starts as NULL.
> We must either reset woken_list_node, or avoid setting woken_list at all
> for oom_bfqq - opt to do the former.

Once oom_bfqq is used, I think the io is treated as issued from root
group. Hence I don't think it's necessary to set woken_list or
waker_bfqq for oom_bfqq.

Thanks,
Kuai
> 
> Crashes would have a stacktrace like:
> [160595.656560]  bfq_add_bfqq_busy+0x110/0x1ec
> [160595.661142]  bfq_add_request+0x6bc/0x980
> [160595.666602]  bfq_insert_request+0x8ec/0x1240
> [160595.671762]  bfq_insert_requests+0x58/0x9c
> [160595.676420]  blk_mq_sched_insert_request+0x11c/0x198
> [160595.682107]  blk_mq_submit_bio+0x270/0x62c
> [160595.686759]  __submit_bio_noacct_mq+0xec/0x178
> [160595.691926]  submit_bio+0x120/0x184
> [160595.695990]  ext4_mpage_readpages+0x77c/0x7c8
> [160595.701026]  ext4_readpage+0x60/0xb0
> [160595.705158]  filemap_read_page+0x54/0x114
> [160595.711961]  filemap_fault+0x228/0x5f4
> [160595.716272]  do_read_fault+0xe0/0x1f0
> [160595.720487]  do_fault+0x40/0x1c8
> 
> Tested by injecting random failures into bfq_get_queue, crashes go away
> completely.
> 
> Fixes: 8ef3fc3a043c ("block, bfq: make shared queues inherit wakers")
> Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> ---
> RFC mainly because it's not clear to me the best policy here - but the
> patch is tested and fixes a real crash we started seeing in 5.15
> 
> This is following up my ramble over at
> https://lore.kernel.org/lkml/CACGdZYLMnfcqwbAXDx+x9vUOMn2cz55oc+8WySBS3J2Xd_q7Lg@mail.gmail.com/
> 
>   block/bfq-iosched.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 7ea427817f7f..5d2861119d20 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -6793,7 +6793,12 @@ static struct bfq_queue *bfq_init_rq(struct request *rq)
>   				 * reset. So insert new_bfqq into the
>   				 * woken_list of the waker. See
>   				 * bfq_check_waker for details.
> +				 *
> +				 * Also, if we got oom_bfqq, we must check if
> +				 * it's already in a woken_list
>   				 */
> +				if (unlikely(!hlist_unhashed(&bfqq->woken_list_node)))
> +					hlist_del_init(&bfqq->woken_list_node);
>   				if (bfqq->waker_bfqq)
>   					hlist_add_head(&bfqq->woken_list_node,
>   						       &bfqq->waker_bfqq->woken_list);
>

Khazhy Kumykov Nov. 3, 2022, 3:05 a.m. UTC | #2

On Wed, Nov 2, 2022 at 7:56 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2022/11/03 9:39, Khazhismel Kumykov 写道:
> > This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL,
> > but woken_list_node still being hashed. This would happen when
> > bfq_init_rq() expects a brand new allocated queue to be returned from
>
>  From what I see, bfqq->waker_bfqq is updated in bfq_init_rq() only if
> 'new_queue' is false, but if 'new_queue' is false, the returned 'bfqq'
> from bfq_get_bfqq_handle_split() will never be oom_bfqq, so I'm confused
> here...
There's two calls for bfq_get_bfqq_handle_split in this function - the
second one is after the check you mentioned, and is the problematic
one.
>
> > bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq
> > without resetting woken_list_node. Since we can always return oom_bfqq
> > when attempting to allocate, we cannot assume waker_bfqq starts as NULL.
> > We must either reset woken_list_node, or avoid setting woken_list at all
> > for oom_bfqq - opt to do the former.
>
> Once oom_bfqq is used, I think the io is treated as issued from root
> group. Hence I don't think it's necessary to set woken_list or
> waker_bfqq for oom_bfqq.
Ack, I was wondering what's right here since, evidently, *someone* had
already set oom_bfqq->waker_bfqq to *something* (although... maybe it
was an earlier init_rq). But maybe it's better to do nothing if we
*know* it's oom_bfqq.

Is it a correct interpretation here that setting waker_bfqq won't
accomplish anything anyways on oom_bfqq, so better off not?

>
> Thanks,
> Kuai

Yu Kuai Nov. 3, 2022, 3:51 a.m. UTC | #3

Hi,

在 2022/11/03 11:05, Khazhy Kumykov 写道:
> On Wed, Nov 2, 2022 at 7:56 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2022/11/03 9:39, Khazhismel Kumykov 写道:
>>> This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL,
>>> but woken_list_node still being hashed. This would happen when
>>> bfq_init_rq() expects a brand new allocated queue to be returned from
>>
>>   From what I see, bfqq->waker_bfqq is updated in bfq_init_rq() only if
>> 'new_queue' is false, but if 'new_queue' is false, the returned 'bfqq'
>> from bfq_get_bfqq_handle_split() will never be oom_bfqq, so I'm confused
>> here...
> There's two calls for bfq_get_bfqq_handle_split in this function - the
> second one is after the check you mentioned, and is the problematic
> one.
Yes, thanks for the explanation. Now I understand how the problem
triggers.

>>
>>> bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq
>>> without resetting woken_list_node. Since we can always return oom_bfqq
>>> when attempting to allocate, we cannot assume waker_bfqq starts as NULL.
>>> We must either reset woken_list_node, or avoid setting woken_list at all
>>> for oom_bfqq - opt to do the former.
>>
>> Once oom_bfqq is used, I think the io is treated as issued from root
>> group. Hence I don't think it's necessary to set woken_list or
>> waker_bfqq for oom_bfqq.
> Ack, I was wondering what's right here since, evidently, *someone* had
> already set oom_bfqq->waker_bfqq to *something* (although... maybe it
> was an earlier init_rq). But maybe it's better to do nothing if we
> *know* it's oom_bfqq.

I need to have a check how oom_bfqq get involved with waker_bfqq, and
then see if it's reasonable.

Probably Jan and Paolo will have better view on this.

Thanks,
Kuai
> 
> Is it a correct interpretation here that setting waker_bfqq won't
> accomplish anything anyways on oom_bfqq, so better off not?

Jan Kara Nov. 3, 2022, 8:47 a.m. UTC | #4

On Thu 03-11-22 11:51:15, Yu Kuai wrote:
> Hi,
> 
> 在 2022/11/03 11:05, Khazhy Kumykov 写道:
> > On Wed, Nov 2, 2022 at 7:56 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> > > 
> > > Hi,
> > > 
> > > 在 2022/11/03 9:39, Khazhismel Kumykov 写道:
> > > > This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL,
> > > > but woken_list_node still being hashed. This would happen when
> > > > bfq_init_rq() expects a brand new allocated queue to be returned from
> > > 
> > >   From what I see, bfqq->waker_bfqq is updated in bfq_init_rq() only if
> > > 'new_queue' is false, but if 'new_queue' is false, the returned 'bfqq'
> > > from bfq_get_bfqq_handle_split() will never be oom_bfqq, so I'm confused
> > > here...
> > There's two calls for bfq_get_bfqq_handle_split in this function - the
> > second one is after the check you mentioned, and is the problematic
> > one.
> Yes, thanks for the explanation. Now I understand how the problem
> triggers.
> 
> > > 
> > > > bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq
> > > > without resetting woken_list_node. Since we can always return oom_bfqq
> > > > when attempting to allocate, we cannot assume waker_bfqq starts as NULL.
> > > > We must either reset woken_list_node, or avoid setting woken_list at all
> > > > for oom_bfqq - opt to do the former.
> > > 
> > > Once oom_bfqq is used, I think the io is treated as issued from root
> > > group. Hence I don't think it's necessary to set woken_list or
> > > waker_bfqq for oom_bfqq.
> > Ack, I was wondering what's right here since, evidently, *someone* had
> > already set oom_bfqq->waker_bfqq to *something* (although... maybe it
> > was an earlier init_rq). But maybe it's better to do nothing if we
> > *know* it's oom_bfqq.
> 
> I need to have a check how oom_bfqq get involved with waker_bfqq, and
> then see if it's reasonable.
> 
> Probably Jan and Paolo will have better view on this.

Thanks for the CC Kuai and thanks to Khazy for spotting the bug. The
oom_bfqq is just a fallback bfqq and as such it should be extempted from
all special handling like waker detection etc. All this stuff is just for
optimizing performance and when we are OOM, we have far larger troubles
than to optimize performance.

So how I think we should really fix this is that we extempt oom_bfqq from
waker detection in bfq_check_waker() by adding:

	bfqq == bfqd->oom_bfqq ||
 	bfqd->last_completed_rq_bfq == bfqd->oom_bfqq)

to the initial check and then also if bfq_get_bfqq_handle_split() returns
oom_bfqq we should just skip carrying over the waker information.

								Honza

Khazhy Kumykov Nov. 4, 2022, 9:25 p.m. UTC | #5

On Thu, Nov 3, 2022 at 1:47 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 03-11-22 11:51:15, Yu Kuai wrote:
> > Hi,
> >
> > 在 2022/11/03 11:05, Khazhy Kumykov 写道:
> > > On Wed, Nov 2, 2022 at 7:56 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > 在 2022/11/03 9:39, Khazhismel Kumykov 写道:
> > > > > This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL,
> > > > > but woken_list_node still being hashed. This would happen when
> > > > > bfq_init_rq() expects a brand new allocated queue to be returned from
> > > >
> > > >   From what I see, bfqq->waker_bfqq is updated in bfq_init_rq() only if
> > > > 'new_queue' is false, but if 'new_queue' is false, the returned 'bfqq'
> > > > from bfq_get_bfqq_handle_split() will never be oom_bfqq, so I'm confused
> > > > here...
> > > There's two calls for bfq_get_bfqq_handle_split in this function - the
> > > second one is after the check you mentioned, and is the problematic
> > > one.
> > Yes, thanks for the explanation. Now I understand how the problem
> > triggers.
> >
> > > >
> > > > > bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq
> > > > > without resetting woken_list_node. Since we can always return oom_bfqq
> > > > > when attempting to allocate, we cannot assume waker_bfqq starts as NULL.
> > > > > We must either reset woken_list_node, or avoid setting woken_list at all
> > > > > for oom_bfqq - opt to do the former.
> > > >
> > > > Once oom_bfqq is used, I think the io is treated as issued from root
> > > > group. Hence I don't think it's necessary to set woken_list or
> > > > waker_bfqq for oom_bfqq.
> > > Ack, I was wondering what's right here since, evidently, *someone* had
> > > already set oom_bfqq->waker_bfqq to *something* (although... maybe it
> > > was an earlier init_rq). But maybe it's better to do nothing if we
> > > *know* it's oom_bfqq.
> >
> > I need to have a check how oom_bfqq get involved with waker_bfqq, and
> > then see if it's reasonable.
> >
> > Probably Jan and Paolo will have better view on this.
>
> Thanks for the CC Kuai and thanks to Khazy for spotting the bug. The
> oom_bfqq is just a fallback bfqq and as such it should be extempted from
> all special handling like waker detection etc. All this stuff is just for
> optimizing performance and when we are OOM, we have far larger troubles
> than to optimize performance.
>
> So how I think we should really fix this is that we extempt oom_bfqq from
> waker detection in bfq_check_waker() by adding:
>
>         bfqq == bfqd->oom_bfqq ||
>         bfqd->last_completed_rq_bfq == bfqd->oom_bfqq)
>
> to the initial check and then also if bfq_get_bfqq_handle_split() returns
> oom_bfqq we should just skip carrying over the waker information.
Thanks for the tip! I'll send a followup, including your suggestions.

I do have some other questions in this area, if you could help me
understand. Looking again at bfq_init_rq, inside of the !new_queue
section - we call bfq_split_bfqq() to "split" our bfqq, then in the
next line bfq_get_bfqq_handle_split inspects bic_to_bfqq(bic,
is_sync), and if it's NULL, allocates a new queue. However, if we have
an async rq, this call will return the pre-existing async bfqq, as the
call to bfq_split_bfqq() only clears the sync bfqq, ever. The best
understanding I have now is: "bic->bfqq[aync] is never NULL (and we
don't merge async queues) so we'll never reach this !new_queue section
anyways if it's async". Is that accurate?

Thanks,
Khazhy


>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Jan Kara Nov. 7, 2022, 1:26 p.m. UTC | #6

On Fri 04-11-22 14:25:32, Khazhy Kumykov wrote:
> On Thu, Nov 3, 2022 at 1:47 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 03-11-22 11:51:15, Yu Kuai wrote:
> > > Hi,
> > >
> > > 在 2022/11/03 11:05, Khazhy Kumykov 写道:
> > > > On Wed, Nov 2, 2022 at 7:56 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > 在 2022/11/03 9:39, Khazhismel Kumykov 写道:
> > > > > > This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL,
> > > > > > but woken_list_node still being hashed. This would happen when
> > > > > > bfq_init_rq() expects a brand new allocated queue to be returned from
> > > > >
> > > > >   From what I see, bfqq->waker_bfqq is updated in bfq_init_rq() only if
> > > > > 'new_queue' is false, but if 'new_queue' is false, the returned 'bfqq'
> > > > > from bfq_get_bfqq_handle_split() will never be oom_bfqq, so I'm confused
> > > > > here...
> > > > There's two calls for bfq_get_bfqq_handle_split in this function - the
> > > > second one is after the check you mentioned, and is the problematic
> > > > one.
> > > Yes, thanks for the explanation. Now I understand how the problem
> > > triggers.
> > >
> > > > >
> > > > > > bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq
> > > > > > without resetting woken_list_node. Since we can always return oom_bfqq
> > > > > > when attempting to allocate, we cannot assume waker_bfqq starts as NULL.
> > > > > > We must either reset woken_list_node, or avoid setting woken_list at all
> > > > > > for oom_bfqq - opt to do the former.
> > > > >
> > > > > Once oom_bfqq is used, I think the io is treated as issued from root
> > > > > group. Hence I don't think it's necessary to set woken_list or
> > > > > waker_bfqq for oom_bfqq.
> > > > Ack, I was wondering what's right here since, evidently, *someone* had
> > > > already set oom_bfqq->waker_bfqq to *something* (although... maybe it
> > > > was an earlier init_rq). But maybe it's better to do nothing if we
> > > > *know* it's oom_bfqq.
> > >
> > > I need to have a check how oom_bfqq get involved with waker_bfqq, and
> > > then see if it's reasonable.
> > >
> > > Probably Jan and Paolo will have better view on this.
> >
> > Thanks for the CC Kuai and thanks to Khazy for spotting the bug. The
> > oom_bfqq is just a fallback bfqq and as such it should be extempted from
> > all special handling like waker detection etc. All this stuff is just for
> > optimizing performance and when we are OOM, we have far larger troubles
> > than to optimize performance.
> >
> > So how I think we should really fix this is that we extempt oom_bfqq from
> > waker detection in bfq_check_waker() by adding:
> >
> >         bfqq == bfqd->oom_bfqq ||
> >         bfqd->last_completed_rq_bfq == bfqd->oom_bfqq)
> >
> > to the initial check and then also if bfq_get_bfqq_handle_split() returns
> > oom_bfqq we should just skip carrying over the waker information.
> Thanks for the tip! I'll send a followup, including your suggestions.
> 
> I do have some other questions in this area, if you could help me
> understand. Looking again at bfq_init_rq, inside of the !new_queue
> section - we call bfq_split_bfqq() to "split" our bfqq, then in the
> next line bfq_get_bfqq_handle_split inspects bic_to_bfqq(bic,
> is_sync), and if it's NULL, allocates a new queue. However, if we have
> an async rq, this call will return the pre-existing async bfqq, as the
> call to bfq_split_bfqq() only clears the sync bfqq, ever. The best
> understanding I have now is: "bic->bfqq[aync] is never NULL (and we
> don't merge async queues) so we'll never reach this !new_queue section
> anyways if it's async". Is that accurate?

So you are right that async queues are never merged or split. In fact, if
you have a look at bfq_get_queue(), you'll notice that async queue is
common for all processes with the same ioprio & blkcg. So all these games
with splitting, merging, waker detection etc. impact only sync queues.

								Honza

[RFC] bfq: fix waker_bfqq inconsistency crash

Commit Message

Comments

Patch