[v2,0/3] Fix some starvation problems in block layer

Message ID	20240903081653.65613-1-songmuchun@bytedance.com (mailing list archive)
Headers	show Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DFA10200129 for <linux-block@vger.kernel.org>; Tue, 3 Sep 2024 08:17:08 +0000 (UTC) From: Muchun Song <songmuchun@bytedance.com> To: axboe@kernel.dk, ming.lei@redhat.com, yukuai1@huaweicloud.com Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, muchun.song@linux.dev, Muchun Song <songmuchun@bytedance.com> Subject: [PATCH v2 0/3] Fix some starvation problems in block layer Date: Tue, 3 Sep 2024 16:16:50 +0800 Message-Id: <20240903081653.65613-1-songmuchun@bytedance.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Fix some starvation problems in block layer \| expand [v2,0/3] Fix some starvation problems in block layer [v2,1/3] block: fix missing dispatching request when queue is started or unquiesced [v2,2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests [v2,3/3] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests

Message ID

20240903081653.65613-1-songmuchun@bytedance.com (mailing list archive)

Headers

From: Muchun Song <songmuchun@bytedance.com>
To: axboe@kernel.dk,
	ming.lei@redhat.com,
	yukuai1@huaweicloud.com
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	muchun.song@linux.dev,
	Muchun Song <songmuchun@bytedance.com>
Subject: [PATCH v2 0/3] Fix some starvation problems in block layer
Date: Tue,  3 Sep 2024 16:16:50 +0800
Message-Id: <20240903081653.65613-1-songmuchun@bytedance.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Fix some starvation problems in block layer | expand

Message

Muchun Song Sept. 3, 2024, 8:16 a.m. UTC

We encounter a problem on our servers where hundreds of UNINTERRUPTED
processes are all waiting in the WBT wait queue. And the IO hung detector
logged so many messages about "blocked for more than 122 seconds". The
call trace is as follows:

    Call Trace:
        __schedule+0x959/0xee0
        schedule+0x40/0xb0
        io_schedule+0x12/0x40
        rq_qos_wait+0xaf/0x140
        wbt_wait+0x92/0xc0
        __rq_qos_throttle+0x20/0x30
        blk_mq_make_request+0x12a/0x5c0
        generic_make_request_nocheck+0x172/0x3f0
        submit_bio+0x42/0x1c0
        ...

The WBT module is used to throttle buffered writeback, which will block
any buffered writeback IO request until the previous inflight IOs have
been completed. So I checked the inflight IO counter. That was one meaning
one IO request was submitted to the downstream interface like block core
layer or device driver (virtio_blk driver in our case). We need to figure
out why the inflight IO is not completed in time. I confirmed that all
the virtio ring buffers of virtio_blk are empty and the hardware dispatch
list had one IO request, so the root cause is not related to the block
device or the virtio_blk driver since the driver has never received that
IO request.

We know that block core layer could submit IO requests to the driver through
kworker (the callback function is blk_mq_run_work_fn). I thought maybe the
kworker was blocked by some other resources causing the callback to not be
evoked in time. So I checked all the kworkers and workqueues and confirmed
there was no pending work on any kworker or workqueue.

Integrate all the investigation information, the problem should be in the
block core layer missing a chance to submit that IO request. After
some investigation of code, I found some scenarios which could cause the
problem.

Changes in v2:
  - Collect RB tag from Ming Lei.
  - Use barrier-less approach to fix QUEUE_FLAG_QUIESCED ordering problem
    suggested by Ming Lei.
  - Apply new approach to fix BLK_MQ_S_STOPPED ordering for easier maintenance.
  - Add Fixes tag to each patch.

Muchun Song (3):
  block: fix missing dispatching request when queue is started or
    unquiesced
  block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding
    requests
  block: fix ordering between checking BLK_MQ_S_STOPPED and adding
    requests

 block/blk-mq.c | 55 ++++++++++++++++++++++++++++++++++++++------------
 block/blk-mq.h | 13 ++++++++++++
 2 files changed, 55 insertions(+), 13 deletions(-)

Comments

Muchun Song Sept. 10, 2024, 2:49 a.m. UTC | #1

> On Sep 3, 2024, at 16:16, Muchun Song <songmuchun@bytedance.com> wrote:
> 
> We encounter a problem on our servers where hundreds of UNINTERRUPTED
> processes are all waiting in the WBT wait queue. And the IO hung detector
> logged so many messages about "blocked for more than 122 seconds". The
> call trace is as follows:
> 
>    Call Trace:
>        __schedule+0x959/0xee0
>        schedule+0x40/0xb0
>        io_schedule+0x12/0x40
>        rq_qos_wait+0xaf/0x140
>        wbt_wait+0x92/0xc0
>        __rq_qos_throttle+0x20/0x30
>        blk_mq_make_request+0x12a/0x5c0
>        generic_make_request_nocheck+0x172/0x3f0
>        submit_bio+0x42/0x1c0
>        ...
> 
> The WBT module is used to throttle buffered writeback, which will block
> any buffered writeback IO request until the previous inflight IOs have
> been completed. So I checked the inflight IO counter. That was one meaning
> one IO request was submitted to the downstream interface like block core
> layer or device driver (virtio_blk driver in our case). We need to figure
> out why the inflight IO is not completed in time. I confirmed that all
> the virtio ring buffers of virtio_blk are empty and the hardware dispatch
> list had one IO request, so the root cause is not related to the block
> device or the virtio_blk driver since the driver has never received that
> IO request.
> 
> We know that block core layer could submit IO requests to the driver through
> kworker (the callback function is blk_mq_run_work_fn). I thought maybe the
> kworker was blocked by some other resources causing the callback to not be
> evoked in time. So I checked all the kworkers and workqueues and confirmed
> there was no pending work on any kworker or workqueue.
> 
> Integrate all the investigation information, the problem should be in the
> block core layer missing a chance to submit that IO request. After
> some investigation of code, I found some scenarios which could cause the
> problem.

Hi Jens Axboe,

May I ask if you have any suggestions for those fixes? Or if they could
be merged?

Muchun,
Thanks.

> 
> Changes in v2:
>  - Collect RB tag from Ming Lei.
>  - Use barrier-less approach to fix QUEUE_FLAG_QUIESCED ordering problem
>    suggested by Ming Lei.
>  - Apply new approach to fix BLK_MQ_S_STOPPED ordering for easier maintenance.
>  - Add Fixes tag to each patch.
> 
> Muchun Song (3):
>  block: fix missing dispatching request when queue is started or
>    unquiesced
>  block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding
>    requests
>  block: fix ordering between checking BLK_MQ_S_STOPPED and adding
>    requests
> 
> block/blk-mq.c | 55 ++++++++++++++++++++++++++++++++++++++------------
> block/blk-mq.h | 13 ++++++++++++
> 2 files changed, 55 insertions(+), 13 deletions(-)
> 
> -- 
> 2.20.1
>