[RFC] blk-mq: fix potential I/O hang caused by batch wakeup

Message ID	20240520033847.13533-1-yang.yang@vivo.com (mailing list archive)
State	New
Headers	show Received: from SINPR02CU002.outbound.protection.outlook.com (mail-southeastasiaazon11011004.outbound.protection.outlook.com [52.101.133.4]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 24520DDAB; Mon, 20 May 2024 03:39:20 +0000 (UTC) From: Yang Yang <yang.yang@vivo.com> To: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Yang Yang <yang.yang@vivo.com> Subject: [RFC PATCH] blk-mq: fix potential I/O hang caused by batch wakeup Date: Mon, 20 May 2024 11:38:46 +0800 Message-Id: <20240520033847.13533-1-yang.yang@vivo.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	[RFC] blk-mq: fix potential I/O hang caused by batch wakeup \| expand [RFC] blk-mq: fix potential I/O hang caused by batch wakeup

Message ID

20240520033847.13533-1-yang.yang@vivo.com (mailing list archive)

State

New

Headers

From: Yang Yang <yang.yang@vivo.com>
To: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org
Cc: Yang Yang <yang.yang@vivo.com>
Subject: [RFC PATCH] blk-mq: fix potential I/O hang caused by batch wakeup
Date: Mon, 20 May 2024 11:38:46 +0800
Message-Id: <20240520033847.13533-1-yang.yang@vivo.com>
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
Precedence: bulk
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 ltEmVa/oT3ONGIkccdYCz/y7aLmyuaOt+Fcee7M3QjKbnOSDQA56nxvambApHV+vHZNbLei1IaRMwRBnHq3ejeAV+b8RoNu8tvVTDMOxp2aHyXBq5gZbw1NmXofIjAoiFkhvetbqeUjblw5aZ290yVnlnZiKEFaOYxXgJ3isYZ4BtT7QtZlu0dTMwkDRR9KxmYbbrFszGZeu35FPmJkIymL3G7z323TtoQCUwFPjV3M6vsin1NZfQrydpHCKvjF5TglB2lmDmZoL/zjuqas+aA3nZ88Fjc8rz5yqG1zZD6xC9F6jvgJR3lCOVKsuthDUcsLNUwCI0fGQklRU0JZoN60mq9sSJ7D6UJ7p+3hhmfzVBAmexp5vOjJcFtjQLQyftDSpmBgiK2lIaqVx6DsjymYozO7c7JI45dLQAku0+yEPwbKrsbcUuaRRrr54Ar91gUp4JGZatkAQ1tDbsbrxH+A7bDgzWSGDewsY2Ky+nl/x5CX/hVm4NFo1FLGbgzknErIhhF8u9kRiE/a5OuRfE+rdaItFjAOUKOa0N5lIPmzFC+rtBrCUl0nnLRPKj11JJ9gHNJhImWl6Mr8RwEQEruEIwXEO0ybPSVgqv8Ra8xMpVKiO/4+Mh0wzLQ/0VUFsyCM6jVi64tvU/nr28TlVMK/CC0VjG3DWbp+i1SBjlFg15AsMpD+LvSVyt8KN8AfN1dHRT0Zueuv8opyy6uZ0s7fcZjh/n2PiW5tNv6BzdRcqYu5l6558+WHjXwWeoYZVENpHWmnWEe0zgz1IDWivzlxykRYqccUsr8AfhSML3f8FKm0mfTpk8Mpha/hBPjOrFlPzl2Z6uwt++nDJ0kwslZwkb44ais0xGFaicXq58KlkZttmqVj8Vls7yrL37LOc27yQM6QbYDdHSCz2VEHGtLnCjNftDPpzV5fUZjSFL4426AipzehPe/8O0IF4b1nNmf3vgWOXYgwJlGlYmNDtSaj+hLdUD62DLeVhE7ZyLPNFaeI6HBfrn2mfhORleFxIjVzyOZFNpZoTHZwqcLNFqd10mIKtuYx/c31YDpUgrXgQO1+evnmS3mvxl4QZKvdx+BtAYrHRasUGEJ0HAXlw0x8Ue16b/ImWIAXEbjM/HLFuWy4JuJISvd4JYMhRrW24GipFiFHU4MXd6oRT9FYihxKIcmEnb48thD3kBOcana6Vp3TlhidQNE5YUlafj8PxzvHW7xL7v5889aw9F/hN7qw3m5mHKql0sGgHQyApYMbwRgmBq/er0b9PdAbzAhwnYDeWmNz8OwnQlwVLEdSWeniWc1Gz5nZ9PFmF3RsYTNo7IEhHNf08ADiTJygGixwKhZITmGKimCQCdAg8wXNIrBv3utSuqks5K0IymM7gNd4i6HNUpXCofj3weRltU3iOoUHJTzDRQhwGzxwUz8qSXQ7JANXcUmA2ByEWqE/JycyfJSvmIxEjpJ+x+mumMVyLYMNXC78uvHniGrxSUSunxijTPU/oiDcfWFkik8ksF1YcKb7NrOfbhmNSl0E1kPt+V2S+cX0E9960BD45kqwfgxkgHD0v67pisj620TnlDYkGXsTFlu8M6rYH87DWD3+e
X-OriginatorOrg: vivo.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 5ee7c794-761d-4d40-efc2-08dc787e6d04
X-MS-Exchange-CrossTenant-AuthSource: KL1PR06MB7401.apcprd06.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 20 May 2024 03:39:17.1493
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 923e42dc-48d5-4cbe-b582-1a797a6412ed
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 VdHBC140FjscUC+x08sOgYh8RFHaeUQe9MhFu6bRHcWBpeKwXwmMyvtHf8Zz2+iSXEq/aJdesiMGIor+ifoSkw==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SEZPR06MB5023

Series

[RFC] blk-mq: fix potential I/O hang caused by batch wakeup | expand

Commit Message

YangYang May 20, 2024, 3:38 a.m. UTC

The depth is 62, and the wake_batch is 8. In the following situation,
the task would hang forever.

  t1:                 t2:                          t3:
  blk_mq_get_tag      .                            .
  io_schedule         .                            .
                      elevator_switch              .
                      blk_mq_freeze_queue          .
                      blk_freeze_queue_start       .
                      blk_mq_freeze_queue_wait     .
                                                   blk_mq_submit_bio
                                                   __bio_queue_enter

Fix this issue by waking up all the waiters sleeping on tags after
freezing the queue.

Signed-off-by: Yang Yang <yang.yang@vivo.com>
---
 block/blk-core.c | 2 --
 block/blk-mq.c   | 4 +++-
 2 files changed, 3 insertions(+), 3 deletions(-)

Comments

Bart Van Assche May 20, 2024, 6:11 p.m. UTC | #1

On 5/19/24 20:38, Yang Yang wrote:
> The depth is 62, and the wake_batch is 8. In the following situation,
> the task would hang forever.
> 
>    t1:                 t2:                          t3:
>    blk_mq_get_tag      .                            .
>    io_schedule         .                            .
>                        elevator_switch              .
>                        blk_mq_freeze_queue          .
>                        blk_freeze_queue_start       .
>                        blk_mq_freeze_queue_wait     .
>                                                     blk_mq_submit_bio
>                                                     __bio_queue_enter
> 
> Fix this issue by waking up all the waiters sleeping on tags after
> freezing the queue.

Shouldn't blk_mq_alloc_request() be mentioned in t1 since that is the function
that calls blk_queue_enter()?

> diff --git a/block/blk-core.c b/block/blk-core.c
> index a16b5abdbbf5..e1eacfad6e5b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -298,8 +298,6 @@ void blk_queue_start_drain(struct request_queue *q)
>   	 * prevent I/O from crossing blk_queue_enter().
>   	 */
>   	blk_freeze_queue_start(q);
> -	if (queue_is_mq(q))
> -		blk_mq_wake_waiters(q);
>   	/* Make blk_queue_enter() reexamine the DYING flag. */
>   	wake_up_all(&q->mq_freeze_wq);
>   }

Why has blk_queue_start_drain() been modified? I don't see any reference
in the patch description to blk_queue_start_drain(). Am I perhaps missing
something?

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4ecb9db62337..9eb3139e713a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -125,8 +125,10 @@ void blk_freeze_queue_start(struct request_queue *q)
>   	if (++q->mq_freeze_depth == 1) {
>   		percpu_ref_kill(&q->q_usage_counter);
>   		mutex_unlock(&q->mq_freeze_lock);
> -		if (queue_is_mq(q))
> +		if (queue_is_mq(q)) {
> +			blk_mq_wake_waiters(q);
>   			blk_mq_run_hw_queues(q, false);
> +		}
>   	} else {
>   		mutex_unlock(&q->mq_freeze_lock);
>   	}

Why would the above change be necessary? If the blk_queue_enter() call
by blk_mq_alloc_request() succeeds and blk_mq_get_tag() calls
io_schedule(), io_schedule() will be woken up indirectly by the
blk_mq_run_hw_queues() call because that call will free one of the tags
that the io_schedule() call is waiting for.

Thanks,

Bart.

YangYang May 21, 2024, 11:25 a.m. UTC | #2

On 2024/5/21 2:11, Bart Van Assche wrote:
> On 5/19/24 20:38, Yang Yang wrote:
>> The depth is 62, and the wake_batch is 8. In the following situation,
>> the task would hang forever.
>>
>>    t1:                 t2:                          t3:
>>    blk_mq_get_tag      .                            .
>>    io_schedule         .                            .
>>                        elevator_switch              .
>>                        blk_mq_freeze_queue          .
>>                        blk_freeze_queue_start       .
>>                        blk_mq_freeze_queue_wait     .
>>                                                     blk_mq_submit_bio
>>                                                     __bio_queue_enter
>>
>> Fix this issue by waking up all the waiters sleeping on tags after
>> freezing the queue.
> 
> Shouldn't blk_mq_alloc_request() be mentioned in t1 since that is the function
> that calls blk_queue_enter()?

  t1:                      t2:                          t3:
  blk_mq_submit_bio        .                            .
  __blk_mq_alloc_requests  .                            .
  blk_mq_get_tag           .                            .
  io_schedule              .                            .
                           elevator_switch              .
                           blk_mq_freeze_queue          .
                           blk_freeze_queue_start       .
                           q->mq_freeze_depth=1         .
                           blk_mq_freeze_queue_wait     .
                                                        blk_mq_submit_bio
                                                        __bio_queue_enter
                                                        wait_event(!q->mq_freeze_depth)

> 
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index a16b5abdbbf5..e1eacfad6e5b 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -298,8 +298,6 @@ void blk_queue_start_drain(struct request_queue *q)
>>        * prevent I/O from crossing blk_queue_enter().
>>        */
>>       blk_freeze_queue_start(q);
>> -    if (queue_is_mq(q))
>> -        blk_mq_wake_waiters(q);
>>       /* Make blk_queue_enter() reexamine the DYING flag. */
>>       wake_up_all(&q->mq_freeze_wq);
>>   }
> 
> Why has blk_queue_start_drain() been modified? I don't see any reference
> in the patch description to blk_queue_start_drain(). Am I perhaps missing
> something?

blk_mq_wake_waiters() has already been called in blk_freeze_queue_start(),
so I thought it can be removed from blk_queue_start_drain().

> 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 4ecb9db62337..9eb3139e713a 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -125,8 +125,10 @@ void blk_freeze_queue_start(struct request_queue *q)
>>       if (++q->mq_freeze_depth == 1) {
>>           percpu_ref_kill(&q->q_usage_counter);
>>           mutex_unlock(&q->mq_freeze_lock);
>> -        if (queue_is_mq(q))
>> +        if (queue_is_mq(q)) {
>> +            blk_mq_wake_waiters(q);
>>               blk_mq_run_hw_queues(q, false);
>> +        }
>>       } else {
>>           mutex_unlock(&q->mq_freeze_lock);
>>       }
> 
> Why would the above change be necessary? If the blk_queue_enter() call
> by blk_mq_alloc_request() succeeds and blk_mq_get_tag() calls
> io_schedule(), io_schedule() will be woken up indirectly by the
> blk_mq_run_hw_queues() call because that call will free one of the tags
> that the io_schedule() call is waiting for.

This patch is a workaround solution. I think the hang is caused by
a lost wakeup, so after blk_mq_run_hw_queues(), t1 is still waiting
for the tag.

     bt = 0xFFFFFF802F9C6790 -> (
     sb = (
       depth = 62,
       shift = 6,
       map_nr = 1,
       round_robin = FALSE,
       map = 0xFFFFFF803BF97000,
       alloc_hint = 0x00000049119B3F4C),
     wake_batch = 6,
     wake_index = (counter = 0),
     ws = 0xFFFFFF803BEBCA00,
     ws_active = (counter = 1),
     min_shallow_depth = 48,
     completion_cnt = (counter = 1),
     wakeup_cnt = (counter = 0))

Upon analyzing the coredump, it was noticed that sbq->completion_cnt=1,
and I can't figure out why.
blk_mq_put_tag->sbitmap_queue_clear->sbitmap_queue_wake_up(sbq, 1) should
be called multiple times, considering that sbq->ws_active=1,
sbq->completion_cnt should be greater than 1.
Looking forward to some advice from block layer experts.

Thanks.

> 
> Thanks,
> 
> Bart.

diff --git a/block/blk-core.c b/block/blk-core.c
index a16b5abdbbf5..e1eacfad6e5b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -298,8 +298,6 @@  void blk_queue_start_drain(struct request_queue *q)
 	 * prevent I/O from crossing blk_queue_enter().
 	 */
 	blk_freeze_queue_start(q);
-	if (queue_is_mq(q))
-		blk_mq_wake_waiters(q);
 	/* Make blk_queue_enter() reexamine the DYING flag. */
 	wake_up_all(&q->mq_freeze_wq);
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4ecb9db62337..9eb3139e713a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,8 +125,10 @@  void blk_freeze_queue_start(struct request_queue *q)
 	if (++q->mq_freeze_depth == 1) {
 		percpu_ref_kill(&q->q_usage_counter);
 		mutex_unlock(&q->mq_freeze_lock);
-		if (queue_is_mq(q))
+		if (queue_is_mq(q)) {
+			blk_mq_wake_waiters(q);
 			blk_mq_run_hw_queues(q, false);
+		}
 	} else {
 		mutex_unlock(&q->mq_freeze_lock);
 	}

[RFC] blk-mq: fix potential I/O hang caused by batch wakeup

Commit Message

Comments

Patch