From patchwork Wed Nov 8 22:48:51 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10049597 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id A93BC603FF for ; Wed, 8 Nov 2017 22:48:57 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9A67C2A732 for ; Wed, 8 Nov 2017 22:48:57 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8F7AB2A733; Wed, 8 Nov 2017 22:48:57 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.4 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5BB8F2A717 for ; Wed, 8 Nov 2017 22:48:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753018AbdKHWsz (ORCPT ); Wed, 8 Nov 2017 17:48:55 -0500 Received: from mail-pg0-f68.google.com ([74.125.83.68]:56364 "EHLO mail-pg0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753009AbdKHWsy (ORCPT ); Wed, 8 Nov 2017 17:48:54 -0500 Received: by mail-pg0-f68.google.com with SMTP id m18so3065000pgd.13 for ; Wed, 08 Nov 2017 14:48:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=to:from:subject:cc:message-id:date:user-agent:mime-version :content-language:content-transfer-encoding; bh=1lKuQYCNHNfo2YwkmWdGaXx8noaAoqGchfyZt6M+lJc=; b=KLmruup/bve8ooozVAPwuDjSHnx7TpWlGdzhPOriL4DhsdWnMOw59J6sAA3PCQ1cVT CPsebafXx06/HYwz9aIfFgaCrpA7+Ts1xvr6fZ1V+5n+dHPkL5T3Xkbp5xr778OoetzQ rKes3hif/mi5D6nkRxuth9em6fr4udWgO5RhJEnkOv59qP3OhL6nwyB4gDTzn4tZiASu WQPAp3H6bNpAWie3qph8pjK8BMZybz8vYhZtPyMx4WCFWtSSuZCuPGCRHgw0/nyh5ylY 0g32DyLdsXLhJnOnx7qIClJlwKnWwyqEE5puBuzoZcgL9PJn5+QSBpjoV1U5hFeZd5YN t46g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:from:subject:cc:message-id:date:user-agent :mime-version:content-language:content-transfer-encoding; bh=1lKuQYCNHNfo2YwkmWdGaXx8noaAoqGchfyZt6M+lJc=; b=A1C/K4KObDjVQ/0KH18+ZMGVT4ngripcrMaV+DiA4DGXRh/+xISGj5dIUO4tiCCMyc WzeqZSH35HuDaE2ABtZqERDPDZseDJ8xyy7xEu0Mo5ghqww4JZUNQKRyJudsJBGTTqkO oRrZlHuzwTpcIPkilLNuiLY3MfxBTOBa0wfnUl6P7su/LRYuCQoQp6i8N0BYMxFuZujg OZ2ALBWAYkmr5XQYt0/t4n6E6GD0c0zRnXTOiSLzUyniW+QdADoR7LoHunseuFwCfwD0 +EgqviqOzbVzG+AO395tgeRYtGplVUepgDf3ySY7S0rjnd6cE+4yqBSEpx+xaHfLdA8m dpuA== X-Gm-Message-State: AJaThX4cAaJisHPX5Q4odej/UQErP60lWLXwv0tTDdSsnBJBSKCRGfmP P7UFl/kdFDI0HM7RlfSaog2qAQ== X-Google-Smtp-Source: ABhQp+SSu7HgE9jgXeyns4DDKyYKc7FNfsYNwow8sRQRB5OQ804mG8+DYw1C1lQjl1Wt6OU7bz8xGw== X-Received: by 10.84.143.195 with SMTP id 61mr1784750plz.357.1510181333399; Wed, 08 Nov 2017 14:48:53 -0800 (PST) Received: from [192.168.1.154] ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id 204sm9990101pfu.8.2017.11.08.14.48.52 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 08 Nov 2017 14:48:52 -0800 (PST) To: "linux-block@vger.kernel.org" From: Jens Axboe Subject: [PATCH] blk-mq: fix issue with shared tag queue re-running Cc: Omar Sandoval Message-ID: <98418e6d-2981-0fb7-dcdd-79b635955fcf@kernel.dk> Date: Wed, 8 Nov 2017 15:48:51 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 Content-Language: en-US Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This patch attempts to make the case of hctx re-running on driver tag failure more robust. Without this patch, it's pretty easy to trigger a stall condition with shared tags. An example is using null_blk like this: modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 hw_queue_depth=1 which sets up 4 devices, sharing the same tag set with a depth of 1. Running a fio job ala: [global] bs=4k rw=randread norandommap direct=1 ioengine=libaio iodepth=4 [nullb0] filename=/dev/nullb0 [nullb1] filename=/dev/nullb1 [nullb2] filename=/dev/nullb2 [nullb3] filename=/dev/nullb3 will inevitably end with one or more threads being stuck waiting for a scheduler tag. That IO is then stuck forever, until someone else triggers a run of the queue. Ensure that we always re-run the hardware queue, if the driver tag we were waiting for got freed before we added our leftover request entries back on the dispatch list. Signed-off-by: Jens Axboe Reviewed-by: Bart Van Assche Tested-by: Bart Van Assche Reviewed-by: Ming Lei Reviewed-by: Omar Sandoval diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 7f4a1ba532af..bb7f08415203 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -179,7 +179,6 @@ static const char *const hctx_state_name[] = { HCTX_STATE_NAME(STOPPED), HCTX_STATE_NAME(TAG_ACTIVE), HCTX_STATE_NAME(SCHED_RESTART), - HCTX_STATE_NAME(TAG_WAITING), HCTX_STATE_NAME(START_ON_RUN), }; #undef HCTX_STATE_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index 3d759bb8a5bb..8dc5db40df9d 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -998,49 +998,64 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, return rq->tag != -1; } -static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, int flags, - void *key) +static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, + int flags, void *key) { struct blk_mq_hw_ctx *hctx; hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait); - list_del(&wait->entry); - clear_bit_unlock(BLK_MQ_S_TAG_WAITING, &hctx->state); + list_del_init(&wait->entry); blk_mq_run_hw_queue(hctx, true); return 1; } -static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx) +static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx **hctx, + struct request *rq) { + struct blk_mq_hw_ctx *this_hctx = *hctx; + wait_queue_entry_t *wait = &this_hctx->dispatch_wait; struct sbq_wait_state *ws; + if (!list_empty_careful(&wait->entry)) + return false; + + spin_lock(&this_hctx->lock); + if (!list_empty(&wait->entry)) { + spin_unlock(&this_hctx->lock); + return false; + } + + ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx); + add_wait_queue(&ws->wait, wait); + /* - * The TAG_WAITING bit serves as a lock protecting hctx->dispatch_wait. - * The thread which wins the race to grab this bit adds the hardware - * queue to the wait queue. + * It's possible that a tag was freed in the window between the + * allocation failure and adding the hardware queue to the wait + * queue. */ - if (test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state) || - test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, &hctx->state)) + if (!blk_mq_get_driver_tag(rq, hctx, false)) { + spin_unlock(&this_hctx->lock); return false; - - init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake); - ws = bt_wait_ptr(&hctx->tags->bitmap_tags, hctx); + } /* - * As soon as this returns, it's no longer safe to fiddle with - * hctx->dispatch_wait, since a completion can wake up the wait queue - * and unlock the bit. + * We got a tag, remove outselves from the wait queue to ensure + * someone else gets the wakeup. */ - add_wait_queue(&ws->wait, &hctx->dispatch_wait); + spin_lock_irq(&ws->wait.lock); + list_del_init(&wait->entry); + spin_unlock_irq(&ws->wait.lock); + spin_unlock(&this_hctx->lock); return true; } bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, - bool got_budget) + bool got_budget) { struct blk_mq_hw_ctx *hctx; struct request *rq, *nxt; + bool no_tag = false; int errors, queued; if (list_empty(list)) @@ -1060,22 +1075,15 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, if (!blk_mq_get_driver_tag(rq, &hctx, false)) { /* * The initial allocation attempt failed, so we need to - * rerun the hardware queue when a tag is freed. + * rerun the hardware queue when a tag is freed. The + * waitqueue takes care of that. If the queue is run + * before we add this entry back on the dispatch list, + * we'll re-run it below. */ - if (!blk_mq_dispatch_wait_add(hctx)) { - if (got_budget) - blk_mq_put_dispatch_budget(hctx); - break; - } - - /* - * It's possible that a tag was freed in the window - * between the allocation failure and adding the - * hardware queue to the wait queue. - */ - if (!blk_mq_get_driver_tag(rq, &hctx, false)) { + if (!blk_mq_dispatch_wait_add(&hctx, rq)) { if (got_budget) blk_mq_put_dispatch_budget(hctx); + no_tag = true; break; } } @@ -1140,10 +1148,10 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * it is no longer set that means that it was cleared by another * thread and hence that a queue rerun is needed. * - * If TAG_WAITING is set that means that an I/O scheduler has - * been configured and another thread is waiting for a driver - * tag. To guarantee fairness, do not rerun this hardware queue - * but let the other thread grab the driver tag. + * If 'no_tag' is set, that means that we failed getting + * a driver tag with an I/O scheduler attached. If our dispatch + * waitqueue is no longer active, ensure that we run the queue + * AFTER adding our entries back to the list. * * If no I/O scheduler has been configured it is possible that * the hardware queue got stopped and restarted before requests @@ -1156,7 +1164,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * and dm-rq. */ if (!blk_mq_sched_needs_restart(hctx) && - !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) + (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) blk_mq_run_hw_queue(hctx, true); } @@ -2020,6 +2028,9 @@ static int blk_mq_init_hctx(struct request_queue *q, hctx->nr_ctx = 0; + init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake); + INIT_LIST_HEAD(&hctx->dispatch_wait.entry); + if (set->ops->init_hctx && set->ops->init_hctx(hctx, set->driver_data, hctx_idx)) goto free_bitmap; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 674641527da7..4ae987c2352c 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -35,7 +35,7 @@ struct blk_mq_hw_ctx { struct blk_mq_ctx **ctxs; unsigned int nr_ctx; - wait_queue_entry_t dispatch_wait; + wait_queue_entry_t dispatch_wait; atomic_t wait_index; struct blk_mq_tags *tags; @@ -181,8 +181,7 @@ enum { BLK_MQ_S_STOPPED = 0, BLK_MQ_S_TAG_ACTIVE = 1, BLK_MQ_S_SCHED_RESTART = 2, - BLK_MQ_S_TAG_WAITING = 3, - BLK_MQ_S_START_ON_RUN = 4, + BLK_MQ_S_START_ON_RUN = 3, BLK_MQ_MAX_DEPTH = 10240,