From patchwork Mon Oct 30 22:42:04 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bart Van Assche X-Patchwork-Id: 10033383 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 196606039A for ; Mon, 30 Oct 2017 22:42:24 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0AC7F2898D for ; Mon, 30 Oct 2017 22:42:24 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id F3A552898F; Mon, 30 Oct 2017 22:42:23 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B2F8728991 for ; Mon, 30 Oct 2017 22:42:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752989AbdJ3WmU (ORCPT ); Mon, 30 Oct 2017 18:42:20 -0400 Received: from esa4.hgst.iphmx.com ([216.71.154.42]:50774 "EHLO esa4.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752748AbdJ3WmL (ORCPT ); Mon, 30 Oct 2017 18:42:11 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1509403332; x=1540939332; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=RKhS6vD9T5OA8OkuSKkp2Lp5ppMX2AX/1cUt2prRFDQ=; b=CRUzUVlo31+ackEEiUpi4E0SAGUX0QvmgmKBCI1A8e69W3BcGTcdlPTe CJ7EoOL8BIbs9bG5uoCkz6Cmi6hQ8HCbCKd4GrLtYFICOuVoHsmxadDxV Id23dKAWU0H4TveXBxRBXxbudBZcNoq1wZTccSfDko2SoM41mS0cqVMoG F5CiDAeSeulT5dV/J9xE3XMJ/61NdZECaTGGp2b+H5LKMC5P++HWjaPp+ hmBi+NSZb5+xVbBMJK4CGR88P4z8c0imzZX/FoKJxMS0t7YQ4XMH7RnVL JAOhHh1/Hpe5lkTM7HDCBnYp51RG+2DiAovvv8bn2U+3pixLoZwH8PPIr Q==; X-IronPort-AV: E=Sophos;i="5.44,321,1505750400"; d="scan'208";a="60181984" Received: from h199-255-45-14.hgst.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 31 Oct 2017 06:42:09 +0800 Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP; 30 Oct 2017 15:40:17 -0700 Received: from unknown (HELO MILHUBIP04.sdcorp.global.sandisk.com) ([10.177.9.97]) by uls-op-cesaip02.wdc.com with ESMTP; 30 Oct 2017 15:42:08 -0700 Received: from milsmgip11.sandisk.com (10.177.8.100) by MILHUBIP04.sdcorp.global.sandisk.com (10.177.9.97) with Microsoft SMTP Server id 14.3.319.2; Mon, 30 Oct 2017 15:42:07 -0700 X-AuditID: 0ab10959-41c5898000002c97-96-59f7aabf1d70 Received: from thinkpad-bart.int.fusionio.com ( [10.177.8.100]) by (Symantec Messaging Gateway) with SMTP id 43.BD.11415.FBAA7F95; Mon, 30 Oct 2017 15:42:07 -0700 (PDT) From: Bart Van Assche To: Jens Axboe CC: , , "Christoph Hellwig" , "Martin K . Petersen" , Oleksandr Natalenko , Ming Lei , Martin Steigerwald , "Bart Van Assche" , Johannes Thumshirn Subject: [PATCH v11 6/7] block, scsi: Make SCSI quiesce and resume work reliably Date: Mon, 30 Oct 2017 15:42:04 -0700 Message-ID: <20171030224205.25212-7-bart.vanassche@wdc.com> X-Mailer: git-send-email 2.14.2 In-Reply-To: <20171030224205.25212-1-bart.vanassche@wdc.com> References: <20171030224205.25212-1-bart.vanassche@wdc.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrHJMWRmVeSWpSXmKPExsXCtZEjRXf/qu+RBhNX6VmsvtvPZnHpzxdG i5WrjzJZPO6awWqx95a2Rff1HWwWy4//Y7JYMquZyeLQZCCx8MUmZgcuj8tnSz0mrufz2H2z gc3j5SUOj49Pb7F4vN93lc1j8+lqj8+b5DzaD3QzBXBGcdmkpOZklqUW6dslcGXMPH6frWB+ YMWlq4fZGxjfOHUxcnJICJhILPz5j6mLkYtDSGA1o8Til48ZQRJsAnoSp+btYwKxRQQUJHp+ r2QDsZkF3jBJPPwmCWILCwRI/Hz9DqyGRUBV4sfbOWA1vALWEvOu/2CGWCAv8X7BfbCZnAI2 Eos2fWcHsYWAanZNnM40gZF7ASPDKkax3Myc4tz0zAJDQ73ixLyUzOJsveT83E2M4BDjjNzB +HSi+SFGJg5OqQbGlG9B60Nm78uR2GWv/PhR4ry1d060sFgzX+if92b9Gd9X5xhu6K56MMOg SfFwbEf9x5Nz/AS37Vt8w+aW22lHHjd2Ae6HpbcvTJWWSv3IO+Fp3MOCH1cvBIRs2PIy2PyK g2hGt5qUwNwyNZfUxlXf/QWZjc6Gvdvwo/Nfbf23Uu9vN2c/vv+8WImlOCPRUIu5qDgRAJCP oXPhAQAA MIME-Version: 1.0 Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The contexts from which a SCSI device can be quiesced or resumed are: * Writing into /sys/class/scsi_device/*/device/state. * SCSI parallel (SPI) domain validation. * The SCSI device power management methods. See also scsi_bus_pm_ops. It is essential during suspend and resume that neither the filesystem state nor the filesystem metadata in RAM changes. This is why while the hibernation image is being written or restored that SCSI devices are quiesced. The SCSI core quiesces devices through scsi_device_quiesce() and scsi_device_resume(). In the SDEV_QUIESCE state execution of non-preempt requests is deferred. This is realized by returning BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI devices. Avoid that a full queue prevents power management requests to be submitted by deferring allocation of non-preempt requests for devices in the quiesced state. This patch has been tested by running the following commands and by verifying that after each resume the fio job was still running: for ((i=0; i<10; i++)); do ( cd /sys/block/md0/md && while true; do [ "$( sync_action sleep 1 done ) & pids=($!) for d in /sys/class/block/sd*[a-z]; do bdev=${d#/sys/class/block/} hcil=$(readlink "$d/device") hcil=${hcil#../../../} echo 4 > "$d/queue/nr_requests" echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth" fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 \ --rw=randread --ioengine=libaio --numjobs=4 --iodepth=16 \ --iodepth_batch=1 --thread --loops=$((2**31)) & pids+=($!) done sleep 1 echo "$(date) Hibernating ..." >>hibernate-test-log.txt systemctl hibernate sleep 10 kill "${pids[@]}" echo idle > /sys/block/md0/md/sync_action wait echo "$(date) Done." >>hibernate-test-log.txt done Reported-by: Oleksandr Natalenko References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348). Signed-off-by: Bart Van Assche Tested-by: Martin Steigerwald Reviewed-by: Hannes Reinecke Cc: Martin K. Petersen Cc: Ming Lei Cc: Christoph Hellwig Cc: Johannes Thumshirn --- block/blk-core.c | 43 ++++++++++++++++++++++++++++++++++++------- block/blk-mq.c | 4 ++-- drivers/scsi/scsi_lib.c | 42 ++++++++++++++++++++++++++++++------------ fs/block_dev.c | 4 ++-- include/linux/blkdev.h | 2 +- include/scsi/scsi_device.h | 1 + 6 files changed, 72 insertions(+), 24 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 16ddd52e6408..d4dc10bb01e3 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -373,6 +373,7 @@ void blk_clear_preempt_only(struct request_queue *q) spin_lock_irqsave(q->queue_lock, flags); queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); + wake_up_all(&q->mq_freeze_wq); spin_unlock_irqrestore(q->queue_lock, flags); } EXPORT_SYMBOL_GPL(blk_clear_preempt_only); @@ -794,15 +795,41 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask) } EXPORT_SYMBOL(blk_alloc_queue); -int blk_queue_enter(struct request_queue *q, bool nowait) +/** + * blk_queue_enter() - try to increase q->q_usage_counter + * @q: request queue pointer + * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT + */ +int blk_queue_enter(struct request_queue *q, unsigned int flags) { + const bool preempt = flags & BLK_MQ_REQ_PREEMPT; + while (true) { + bool success = false; int ret; - if (percpu_ref_tryget_live(&q->q_usage_counter)) + rcu_read_lock_sched(); + if (percpu_ref_tryget_live(&q->q_usage_counter)) { + /* + * The code that sets the PREEMPT_ONLY flag is + * responsible for ensuring that that flag is globally + * visible before the queue is unfrozen. + */ + if (preempt || !blk_queue_preempt_only(q)) { + success = true; + } else { + percpu_ref_put(&q->q_usage_counter); + WARN_ONCE(true, + "%s: Attempt to allocate non-preempt request in preempt-only mode.\n", + kobject_name(q->kobj.parent)); + } + } + rcu_read_unlock_sched(); + + if (success) return 0; - if (nowait) + if (flags & BLK_MQ_REQ_NOWAIT) return -EBUSY; /* @@ -815,7 +842,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait) smp_rmb(); ret = wait_event_interruptible(q->mq_freeze_wq, - !atomic_read(&q->mq_freeze_depth) || + (atomic_read(&q->mq_freeze_depth) == 0 && + (preempt || !blk_queue_preempt_only(q))) || blk_queue_dying(q)); if (blk_queue_dying(q)) return -ENODEV; @@ -1444,8 +1472,7 @@ static struct request *blk_old_get_request(struct request_queue *q, /* create ioc upfront */ create_io_context(gfp_mask, q->node); - ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) || - (op & REQ_NOWAIT)); + ret = blk_queue_enter(q, flags); if (ret) return ERR_PTR(ret); spin_lock_irq(q->queue_lock); @@ -2266,8 +2293,10 @@ blk_qc_t generic_make_request(struct bio *bio) current->bio_list = bio_list_on_stack; do { struct request_queue *q = bio->bi_disk->queue; + unsigned int flags = bio->bi_opf & REQ_NOWAIT ? + BLK_MQ_REQ_NOWAIT : 0; - if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) { + if (likely(blk_queue_enter(q, flags) == 0)) { struct bio_list lower, same; /* Create a fresh bio_list for all subordinate requests */ diff --git a/block/blk-mq.c b/block/blk-mq.c index 6a025b17caac..c6bff60e6b8b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -386,7 +386,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, struct request *rq; int ret; - ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT); + ret = blk_queue_enter(q, flags); if (ret) return ERR_PTR(ret); @@ -425,7 +425,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, if (hctx_idx >= q->nr_hw_queues) return ERR_PTR(-EIO); - ret = blk_queue_enter(q, true); + ret = blk_queue_enter(q, flags); if (ret) return ERR_PTR(ret); diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 7c119696402c..d85b7941b988 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2955,21 +2955,37 @@ static void scsi_wait_for_queuecommand(struct scsi_device *sdev) int scsi_device_quiesce(struct scsi_device *sdev) { + struct request_queue *q = sdev->request_queue; int err; + /* + * It is allowed to call scsi_device_quiesce() multiple times from + * the same context but concurrent scsi_device_quiesce() calls are + * not allowed. + */ + WARN_ON_ONCE(sdev->quiesced_by && sdev->quiesced_by != current); + + blk_set_preempt_only(q); + + blk_mq_freeze_queue(q); + /* + * Ensure that the effect of blk_set_preempt_only() will be visible + * for percpu_ref_tryget() callers that occur after the queue + * unfreeze even if the queue was already frozen before this function + * was called. See also https://lwn.net/Articles/573497/. + */ + synchronize_rcu(); + blk_mq_unfreeze_queue(q); + mutex_lock(&sdev->state_mutex); err = scsi_device_set_state(sdev, SDEV_QUIESCE); + if (err == 0) + sdev->quiesced_by = current; + else + blk_clear_preempt_only(q); mutex_unlock(&sdev->state_mutex); - if (err) - return err; - - scsi_run_queue(sdev->request_queue); - while (atomic_read(&sdev->device_busy)) { - msleep_interruptible(200); - scsi_run_queue(sdev->request_queue); - } - return 0; + return err; } EXPORT_SYMBOL(scsi_device_quiesce); @@ -2989,9 +3005,11 @@ void scsi_device_resume(struct scsi_device *sdev) * device deleted during suspend) */ mutex_lock(&sdev->state_mutex); - if (sdev->sdev_state == SDEV_QUIESCE && - scsi_device_set_state(sdev, SDEV_RUNNING) == 0) - scsi_run_queue(sdev->request_queue); + WARN_ON_ONCE(!sdev->quiesced_by); + sdev->quiesced_by = NULL; + blk_clear_preempt_only(sdev->request_queue); + if (sdev->sdev_state == SDEV_QUIESCE) + scsi_device_set_state(sdev, SDEV_RUNNING); mutex_unlock(&sdev->state_mutex); } EXPORT_SYMBOL(scsi_device_resume); diff --git a/fs/block_dev.c b/fs/block_dev.c index 07ddccd17801..c5363186618b 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -662,7 +662,7 @@ int bdev_read_page(struct block_device *bdev, sector_t sector, if (!ops->rw_page || bdev_get_integrity(bdev)) return result; - result = blk_queue_enter(bdev->bd_queue, false); + result = blk_queue_enter(bdev->bd_queue, 0); if (result) return result; result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, false); @@ -698,7 +698,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector, if (!ops->rw_page || bdev_get_integrity(bdev)) return -EOPNOTSUPP; - result = blk_queue_enter(bdev->bd_queue, false); + result = blk_queue_enter(bdev->bd_queue, 0); if (result) return result; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 864ad2e4a58c..4f91c6462752 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -956,7 +956,7 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t, extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t, struct scsi_ioctl_command __user *); -extern int blk_queue_enter(struct request_queue *q, bool nowait); +extern int blk_queue_enter(struct request_queue *q, unsigned int flags); extern void blk_queue_exit(struct request_queue *q); extern void blk_start_queue(struct request_queue *q); extern void blk_start_queue_async(struct request_queue *q); diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index 82e93ee94708..6f0f1e242e23 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -219,6 +219,7 @@ struct scsi_device { unsigned char access_state; struct mutex state_mutex; enum scsi_device_state sdev_state; + struct task_struct *quiesced_by; unsigned long sdev_data[0]; } __attribute__((aligned(sizeof(unsigned long))));