From patchwork Fri Sep 23 06:14:59 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12986158 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE4B8C6FA82 for ; Fri, 23 Sep 2022 06:15:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229624AbiIWGP5 (ORCPT ); Fri, 23 Sep 2022 02:15:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56286 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229641AbiIWGPy (ORCPT ); Fri, 23 Sep 2022 02:15:54 -0400 Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 05D91F9611; Thu, 22 Sep 2022 23:15:52 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VQVS0d6_1663913745; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VQVS0d6_1663913745) by smtp.aliyun-inc.com; Fri, 23 Sep 2022 14:15:50 +0800 From: ZiyangZhang To: ming.lei@redhat.com Cc: axboe@kernel.dk, xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RESEND PATCH V5 1/7] ublk_drv: check 'current' instead of 'ubq_daemon' Date: Fri, 23 Sep 2022 14:14:59 +0800 Message-Id: <20220923061505.52007-2-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> References: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org This check is not atomic. So with recovery feature, ubq_daemon may be modified simultaneously by recovery task. Instead, check 'current' is safe here because 'current' never changes. Also add comment explaining this check, which is really important for understanding recovery feature. Signed-off-by: ZiyangZhang Reviewed-by: Ming Lei --- drivers/block/ublk_drv.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 6a4a94b4cdf4..c39b67d7133d 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -645,14 +645,22 @@ static inline void __ublk_rq_task_work(struct request *req) struct ublk_device *ub = ubq->dev; int tag = req->tag; struct ublk_io *io = &ubq->ios[tag]; - bool task_exiting = current != ubq->ubq_daemon || ubq_daemon_is_dying(ubq); unsigned int mapped_bytes; pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n", __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags, ublk_get_iod(ubq, req->tag)->addr); - if (unlikely(task_exiting)) { + /* + * Task is exiting if either: + * + * (1) current != ubq_daemon. + * io_uring_cmd_complete_in_task() tries to run task_work + * in a workqueue if ubq_daemon(cmd's task) is PF_EXITING. + * + * (2) current->flags & PF_EXITING. + */ + if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) { blk_mq_end_request(req, BLK_STS_IOERR); mod_delayed_work(system_wq, &ub->monitor_work, 0); return; From patchwork Fri Sep 23 06:15:00 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12986159 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23401ECAAD8 for ; Fri, 23 Sep 2022 06:16:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230007AbiIWGQL (ORCPT ); Fri, 23 Sep 2022 02:16:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57220 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230033AbiIWGQJ (ORCPT ); Fri, 23 Sep 2022 02:16:09 -0400 Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A614A12387A; Thu, 22 Sep 2022 23:16:03 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R191e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VQVS0gF_1663913750; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VQVS0gF_1663913750) by smtp.aliyun-inc.com; Fri, 23 Sep 2022 14:16:00 +0800 From: ZiyangZhang To: ming.lei@redhat.com Cc: axboe@kernel.dk, xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RESEND PATCH V5 2/7] ublk_drv: define macros for recovery feature and check them Date: Fri, 23 Sep 2022 14:15:00 +0800 Message-Id: <20220923061505.52007-3-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> References: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Define some macros for recovery feature. UBLK_S_DEV_QUIESCED implies that ublk_device is quiesced and is ready for recovery. This state can be observed by userspace. UBLK_F_USER_RECOVERY implies that: (1) ublk_drv enables recovery feature. It won't let monitor_work to automatically abort rqs and release the device. (2) With a dying ubq_daemon, ublk_drv ends(aborts) rqs issued to userspace(ublksrv) before crash. (3) With a dying ubq_daemon, in task work and ublk_queue_rq(), ublk_drv requeues rqs. Signed-off-by: ZiyangZhang Reviewed-by: Ming Lei --- drivers/block/ublk_drv.c | 18 +++++++++++++++++- include/uapi/linux/ublk_cmd.h | 3 +++ 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index c39b67d7133d..05bfbaa49696 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -49,7 +49,8 @@ /* All UBLK_F_* have to be included into UBLK_F_ALL */ #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \ | UBLK_F_URING_CMD_COMP_IN_TASK \ - | UBLK_F_NEED_GET_DATA) + | UBLK_F_NEED_GET_DATA \ + | UBLK_F_USER_RECOVERY) /* All UBLK_PARAM_TYPE_* should be included here */ #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD) @@ -323,6 +324,21 @@ static inline int ublk_queue_cmd_buf_size(struct ublk_device *ub, int q_id) PAGE_SIZE); } +static inline bool ublk_queue_can_use_recovery( + struct ublk_queue *ubq) +{ + if (ubq->flags & UBLK_F_USER_RECOVERY) + return true; + return false; +} + +static inline bool ublk_can_use_recovery(struct ublk_device *ub) +{ + if (ub->dev_info.flags & UBLK_F_USER_RECOVERY) + return true; + return false; +} + static void ublk_free_disk(struct gendisk *disk) { struct ublk_device *ub = disk->private_data; diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h index 677edaab2b66..340ff14bde49 100644 --- a/include/uapi/linux/ublk_cmd.h +++ b/include/uapi/linux/ublk_cmd.h @@ -74,9 +74,12 @@ */ #define UBLK_F_NEED_GET_DATA (1UL << 2) +#define UBLK_F_USER_RECOVERY (1UL << 3) + /* device state */ #define UBLK_S_DEV_DEAD 0 #define UBLK_S_DEV_LIVE 1 +#define UBLK_S_DEV_QUIESCED 2 /* shipped via sqe->cmd of io_uring command */ struct ublksrv_ctrl_cmd { From patchwork Fri Sep 23 06:15:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12986160 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 471A8ECAAD8 for ; Fri, 23 Sep 2022 06:16:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230043AbiIWGQf (ORCPT ); Fri, 23 Sep 2022 02:16:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58092 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230049AbiIWGQY (ORCPT ); Fri, 23 Sep 2022 02:16:24 -0400 Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com [115.124.30.57]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9A73212578A; Thu, 22 Sep 2022 23:16:17 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VQVS0lX_1663913761; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VQVS0lX_1663913761) by smtp.aliyun-inc.com; Fri, 23 Sep 2022 14:16:15 +0800 From: ZiyangZhang To: ming.lei@redhat.com Cc: axboe@kernel.dk, xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RESEND PATCH V5 3/7] ublk_drv: requeue rqs with recovery feature enabled Date: Fri, 23 Sep 2022 14:15:01 +0800 Message-Id: <20220923061505.52007-4-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> References: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org With recovery feature enabled, in ublk_queue_rq or task work (in exit_task_work or fallback wq), we requeue rqs instead of ending(aborting) them. Besides, No matter recovery feature is enabled or disabled, we schedule monitor_work immediately. Signed-off-by: ZiyangZhang Reviewed-by: Ming Lei --- drivers/block/ublk_drv.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 05bfbaa49696..3ae13e46ece6 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -655,6 +655,16 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res) #define UBLK_REQUEUE_DELAY_MS 3 +static inline void __ublk_abort_rq(struct ublk_queue *ubq, + struct request *rq) +{ + /* We cannot process this rq so just requeue it. */ + if (ublk_queue_can_use_recovery(ubq)) + blk_mq_requeue_request(rq, false); + else + blk_mq_end_request(rq, BLK_STS_IOERR); +} + static inline void __ublk_rq_task_work(struct request *req) { struct ublk_queue *ubq = req->mq_hctx->driver_data; @@ -677,7 +687,7 @@ static inline void __ublk_rq_task_work(struct request *req) * (2) current->flags & PF_EXITING. */ if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) { - blk_mq_end_request(req, BLK_STS_IOERR); + __ublk_abort_rq(ubq, req); mod_delayed_work(system_wq, &ub->monitor_work, 0); return; } @@ -769,7 +779,8 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, if (unlikely(ubq_daemon_is_dying(ubq))) { fail: mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0); - return BLK_STS_IOERR; + __ublk_abort_rq(ubq, rq); + return BLK_STS_OK; } if (ublk_can_use_task_work(ubq)) { From patchwork Fri Sep 23 06:15:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12986161 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 20E06ECAAD8 for ; Fri, 23 Sep 2022 06:16:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229612AbiIWGQu (ORCPT ); Fri, 23 Sep 2022 02:16:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58322 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230041AbiIWGQj (ORCPT ); Fri, 23 Sep 2022 02:16:39 -0400 Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com [115.124.30.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 70739125783; Thu, 22 Sep 2022 23:16:36 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R441e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045176;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VQVS0sp_1663913775; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VQVS0sp_1663913775) by smtp.aliyun-inc.com; Fri, 23 Sep 2022 14:16:34 +0800 From: ZiyangZhang To: ming.lei@redhat.com Cc: axboe@kernel.dk, xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RESEND PATCH V5 4/7] ublk_drv: consider recovery feature in aborting mechanism Date: Fri, 23 Sep 2022 14:15:02 +0800 Message-Id: <20220923061505.52007-5-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> References: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org With USER_RECOVERY feature enabled, the monitor_work schedules quiesce_work after finding a dying ubq_daemon. The monitor_work should also abort all rqs issued to userspace before the ubq_daemon is dying. The quiesce_work's job is to: (1) quiesce request queue. (2) check if there is any INFLIGHT rq. If so, we retry until all these rqs are requeued and become IDLE. These rqs should be requeued by ublk_queue_rq(), task work, io_uring fallback wq or monitor_work. (3) complete all ioucmds by calling io_uring_cmd_done(). We are safe to do so because no ioucmd can be referenced now. (5) set ub's state to UBLK_S_DEV_QUIESCED, which means we are ready for recovery. This state is exposed to userspace by GET_DEV_INFO. The driver can always handle STOP_DEV and cleanup everything no matter ub's state is LIVE or QUIESCED. After ub's state is UBLK_S_DEV_QUIESCED, user can recover with new process. Note: we do not change the default behavior with reocvery feature disabled. monitor_work still schedules stop_work and abort inflight rqs. And finally ublk_device is released. Signed-off-by: ZiyangZhang Reviewed-by: Ming Lei --- drivers/block/ublk_drv.c | 116 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 110 insertions(+), 6 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 3ae13e46ece6..2a4891c3e5fd 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -120,7 +120,7 @@ struct ublk_queue { unsigned long io_addr; /* mapped vm address */ unsigned int max_io_sz; - bool abort_work_pending; + bool force_abort; unsigned short nr_io_ready; /* how many ios setup */ struct ublk_device *dev; struct ublk_io ios[0]; @@ -162,6 +162,7 @@ struct ublk_device { * monitor each queue's daemon periodically */ struct delayed_work monitor_work; + struct work_struct quiesce_work; struct work_struct stop_work; }; @@ -773,6 +774,17 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, res = ublk_setup_iod(ubq, rq); if (unlikely(res != BLK_STS_OK)) return BLK_STS_IOERR; + /* With recovery feature enabled, force_abort is set in + * ublk_stop_dev() before calling del_gendisk(). We have to + * abort all requeued and new rqs here to let del_gendisk() + * move on. Besides, we cannot not call io_uring_cmd_complete_in_task() + * to avoid UAF on io_uring ctx. + * + * Note: force_abort is guaranteed to be seen because it is set + * before request queue is unqiuesced. + */ + if (ublk_queue_can_use_recovery(ubq) && unlikely(ubq->force_abort)) + return BLK_STS_IOERR; blk_mq_start_request(bd->rq); @@ -967,7 +979,10 @@ static void ublk_daemon_monitor_work(struct work_struct *work) struct ublk_queue *ubq = ublk_get_queue(ub, i); if (ubq_daemon_is_dying(ubq)) { - schedule_work(&ub->stop_work); + if (ublk_queue_can_use_recovery(ubq)) + schedule_work(&ub->quiesce_work); + else + schedule_work(&ub->stop_work); /* abort queue is for making forward progress */ ublk_abort_queue(ub, ubq); @@ -975,12 +990,13 @@ static void ublk_daemon_monitor_work(struct work_struct *work) } /* - * We can't schedule monitor work after ublk_remove() is started. + * We can't schedule monitor work after ub's state is not UBLK_S_DEV_LIVE. + * after ublk_remove() or __ublk_quiesce_dev() is started. * * No need ub->mutex, monitor work are canceled after state is marked - * as DEAD, so DEAD state is observed reliably. + * as not LIVE, so new state is observed reliably. */ - if (ub->dev_info.state != UBLK_S_DEV_DEAD) + if (ub->dev_info.state == UBLK_S_DEV_LIVE) schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON_MONITOR_PERIOD); } @@ -1017,12 +1033,97 @@ static void ublk_cancel_dev(struct ublk_device *ub) ublk_cancel_queue(ublk_get_queue(ub, i)); } -static void ublk_stop_dev(struct ublk_device *ub) +static bool ublk_check_inflight_rq(struct request *rq, void *data) +{ + bool *idle = data; + + if (blk_mq_request_started(rq)) { + *idle = false; + return false; + } + return true; +} + +static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub) +{ + bool idle; + + WARN_ON_ONCE(!blk_queue_quiesced(ub->ub_disk->queue)); + while (true) { + idle = true; + blk_mq_tagset_busy_iter(&ub->tag_set, + ublk_check_inflight_rq, &idle); + if (idle) + break; + msleep(UBLK_REQUEUE_DELAY_MS); + } +} + +static void __ublk_quiesce_dev(struct ublk_device *ub) { + pr_devel("%s: quiesce ub: dev_id %d state %s\n", + __func__, ub->dev_info.dev_id, + ub->dev_info.state == UBLK_S_DEV_LIVE ? + "LIVE" : "QUIESCED"); + blk_mq_quiesce_queue(ub->ub_disk->queue); + ublk_wait_tagset_rqs_idle(ub); + ub->dev_info.state = UBLK_S_DEV_QUIESCED; + ublk_cancel_dev(ub); + /* we are going to release task_struct of ubq_daemon and resets + * ->ubq_daemon to NULL. So in monitor_work, check on ubq_daemon causes UAF. + * Besides, monitor_work is not necessary in QUIESCED state since we have + * already scheduled quiesce_work and quiesced all ubqs. + * + * Do not let monitor_work schedule itself if state it QUIESCED. And we cancel + * it here and re-schedule it in END_USER_RECOVERY to avoid UAF. + */ + cancel_delayed_work_sync(&ub->monitor_work); +} + +static void ublk_quiesce_work_fn(struct work_struct *work) +{ + struct ublk_device *ub = + container_of(work, struct ublk_device, quiesce_work); + mutex_lock(&ub->mutex); if (ub->dev_info.state != UBLK_S_DEV_LIVE) goto unlock; + __ublk_quiesce_dev(ub); + unlock: + mutex_unlock(&ub->mutex); +} +static void ublk_unquiesce_dev(struct ublk_device *ub) +{ + int i; + + pr_devel("%s: unquiesce ub: dev_id %d state %s\n", + __func__, ub->dev_info.dev_id, + ub->dev_info.state == UBLK_S_DEV_LIVE ? + "LIVE" : "QUIESCED"); + /* quiesce_work has run. We let requeued rqs be aborted + * before running fallback_wq. "force_abort" must be seen + * after request queue is unqiuesced. Then del_gendisk() + * can move on. + */ + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) + ublk_get_queue(ub, i)->force_abort = true; + + blk_mq_unquiesce_queue(ub->ub_disk->queue); + /* We may have requeued some rqs in ublk_quiesce_queue() */ + blk_mq_kick_requeue_list(ub->ub_disk->queue); +} + +static void ublk_stop_dev(struct ublk_device *ub) +{ + mutex_lock(&ub->mutex); + if (ub->dev_info.state == UBLK_S_DEV_DEAD) + goto unlock; + if (ublk_can_use_recovery(ub)) { + if (ub->dev_info.state == UBLK_S_DEV_LIVE) + __ublk_quiesce_dev(ub); + ublk_unquiesce_dev(ub); + } del_gendisk(ub->ub_disk); ub->dev_info.state = UBLK_S_DEV_DEAD; ub->dev_info.ublksrv_pid = -1; @@ -1346,6 +1447,7 @@ static void ublk_remove(struct ublk_device *ub) { ublk_stop_dev(ub); cancel_work_sync(&ub->stop_work); + cancel_work_sync(&ub->quiesce_work); cdev_device_del(&ub->cdev, &ub->cdev_dev); put_device(&ub->cdev_dev); } @@ -1522,6 +1624,7 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd) goto out_unlock; mutex_init(&ub->mutex); spin_lock_init(&ub->mm_lock); + INIT_WORK(&ub->quiesce_work, ublk_quiesce_work_fn); INIT_WORK(&ub->stop_work, ublk_stop_work_fn); INIT_DELAYED_WORK(&ub->monitor_work, ublk_daemon_monitor_work); @@ -1642,6 +1745,7 @@ static int ublk_ctrl_stop_dev(struct io_uring_cmd *cmd) ublk_stop_dev(ub); cancel_work_sync(&ub->stop_work); + cancel_work_sync(&ub->quiesce_work); ublk_put_device(ub); return 0; From patchwork Fri Sep 23 06:15:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12986162 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5101CC6FA82 for ; Fri, 23 Sep 2022 06:17:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230160AbiIWGRF (ORCPT ); Fri, 23 Sep 2022 02:17:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58048 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230037AbiIWGQv (ORCPT ); Fri, 23 Sep 2022 02:16:51 -0400 Received: from out30-43.freemail.mail.aliyun.com (out30-43.freemail.mail.aliyun.com [115.124.30.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3885B125795; Thu, 22 Sep 2022 23:16:44 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R881e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046051;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VQVS1.N_1663913795; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VQVS1.N_1663913795) by smtp.aliyun-inc.com; Fri, 23 Sep 2022 14:16:42 +0800 From: ZiyangZhang To: ming.lei@redhat.com Cc: axboe@kernel.dk, xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RESEND PATCH V5 5/7] ublk_drv: support UBLK_F_USER_RECOVERY_REISSUE Date: Fri, 23 Sep 2022 14:15:03 +0800 Message-Id: <20220923061505.52007-6-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> References: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org UBLK_F_USER_RECOVERY_REISSUE implies that: With a dying ubq_daemon, ublk_drv let monitor_work requeues rq issued to userspace(ublksrv) before the ubq_daemon is dying. UBLK_F_USER_RECOVERY_REISSUE is designed for backends which: (1) tolerate double-write since ublk_drv may issue the same rq twice. (2) does not let frontend users get I/O error, such as read-only FS and VM backend. Signed-off-by: ZiyangZhang Reviewed-by: Ming Lei --- drivers/block/ublk_drv.c | 22 ++++++++++++++++++---- include/uapi/linux/ublk_cmd.h | 2 ++ 2 files changed, 20 insertions(+), 4 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 2a4891c3e5fd..3d7b067e0b99 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -50,7 +50,8 @@ #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \ | UBLK_F_URING_CMD_COMP_IN_TASK \ | UBLK_F_NEED_GET_DATA \ - | UBLK_F_USER_RECOVERY) + | UBLK_F_USER_RECOVERY \ + | UBLK_F_USER_RECOVERY_REISSUE) /* All UBLK_PARAM_TYPE_* should be included here */ #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD) @@ -325,6 +326,15 @@ static inline int ublk_queue_cmd_buf_size(struct ublk_device *ub, int q_id) PAGE_SIZE); } +static inline bool ublk_queue_can_use_recovery_reissue( + struct ublk_queue *ubq) +{ + if ((ubq->flags & UBLK_F_USER_RECOVERY) && + (ubq->flags & UBLK_F_USER_RECOVERY_REISSUE)) + return true; + return false; +} + static inline bool ublk_queue_can_use_recovery( struct ublk_queue *ubq) { @@ -629,13 +639,17 @@ static void ublk_complete_rq(struct request *req) * Also aborting may not be started yet, keep in mind that one failed * request may be issued by block layer again. */ -static void __ublk_fail_req(struct ublk_io *io, struct request *req) +static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io, + struct request *req) { WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE); if (!(io->flags & UBLK_IO_FLAG_ABORTED)) { io->flags |= UBLK_IO_FLAG_ABORTED; - blk_mq_end_request(req, BLK_STS_IOERR); + if (ublk_queue_can_use_recovery_reissue(ubq)) + blk_mq_requeue_request(req, false); + else + blk_mq_end_request(req, BLK_STS_IOERR); } } @@ -963,7 +977,7 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq) */ rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i); if (rq) - __ublk_fail_req(io, rq); + __ublk_fail_req(ubq, io, rq); } } ublk_put_device(ub); diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h index 340ff14bde49..332370628757 100644 --- a/include/uapi/linux/ublk_cmd.h +++ b/include/uapi/linux/ublk_cmd.h @@ -76,6 +76,8 @@ #define UBLK_F_USER_RECOVERY (1UL << 3) +#define UBLK_F_USER_RECOVERY_REISSUE (1UL << 4) + /* device state */ #define UBLK_S_DEV_DEAD 0 #define UBLK_S_DEV_LIVE 1 From patchwork Fri Sep 23 06:15:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12986163 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84CF2C6FA86 for ; Fri, 23 Sep 2022 06:17:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229825AbiIWGRn (ORCPT ); Fri, 23 Sep 2022 02:17:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60850 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229636AbiIWGRm (ORCPT ); Fri, 23 Sep 2022 02:17:42 -0400 Received: from out30-56.freemail.mail.aliyun.com (out30-56.freemail.mail.aliyun.com [115.124.30.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 440541176CA; Thu, 22 Sep 2022 23:17:41 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R291e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045176;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VQVS12A_1663913802; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VQVS12A_1663913802) by smtp.aliyun-inc.com; Fri, 23 Sep 2022 14:17:38 +0800 From: ZiyangZhang To: ming.lei@redhat.com Cc: axboe@kernel.dk, xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RESEND PATCH V5 6/7] ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support Date: Fri, 23 Sep 2022 14:15:04 +0800 Message-Id: <20220923061505.52007-7-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> References: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org START_USER_RECOVERY and END_USER_RECOVERY are two new control commands to support user recovery feature. After a crash, user should send START_USER_RECOVERY, it will: (1) check if (a)current ublk_device is UBLK_S_DEV_QUIESCED which was set by quiesce_work and (b)chardev is released (2) reinit all ubqs, including: (a) put the task_struct and reset ->ubq_daemon to NULL. (b) reset all ublk_io. (3) reset ub->mm to NULL. Then, user should start a new process and send FETCH_REQ on each ubq_daemon. Finally, user should send END_USER_RECOVERY, it will: (1) wait for all new ubq_daemons getting ready. (2) update ublksrv_pid (3) unquiesce the request queue and expect incoming ublk_queue_rq() (4) convert ub's state to UBLK_S_DEV_LIVE Note: we can handle STOP_DEV between START_USER_RECOVERY and END_USER_RECOVERY. This is helpful to users who cannot start new process after sending START_USER_RECOVERY ctrl-cmd. Signed-off-by: ZiyangZhang Reviewed-by: Ming Lei --- drivers/block/ublk_drv.c | 116 ++++++++++++++++++++++++++++++++++ include/uapi/linux/ublk_cmd.h | 3 +- 2 files changed, 118 insertions(+), 1 deletion(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 3d7b067e0b99..f31d80f66fd6 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -1862,6 +1862,116 @@ static int ublk_ctrl_set_params(struct io_uring_cmd *cmd) return ret; } +static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq) +{ + int i; + + WARN_ON_ONCE(!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq))); + /* All old ioucmds have to be completed */ + WARN_ON_ONCE(ubq->nr_io_ready); + /* old daemon is PF_EXITING, put it now */ + put_task_struct(ubq->ubq_daemon); + /* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */ + ubq->ubq_daemon = NULL; + + for (i = 0; i < ubq->q_depth; i++) { + struct ublk_io *io = &ubq->ios[i]; + + /* forget everything now and be ready for new FETCH_REQ */ + io->flags = 0; + io->cmd = NULL; + io->addr = 0; + } +} + +static int ublk_ctrl_start_recovery(struct io_uring_cmd *cmd) +{ + struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd; + struct ublk_device *ub; + int ret = -EINVAL; + int i; + + ub = ublk_get_device_from_id(header->dev_id); + if (!ub) + return ret; + + mutex_lock(&ub->mutex); + if (!ublk_can_use_recovery(ub)) + goto out_unlock; + /* + * START_RECOVERY is only allowd after: + * + * (1) UB_STATE_OPEN is not set, which means the dying process is exited + * and related io_uring ctx is freed so file struct of /dev/ublkcX is + * released. + * + * (2) UBLK_S_DEV_QUIESCED is set, which means the quiesce_work: + * (a)has quiesced request queue + * (b)has requeued every inflight rqs whose io_flags is ACTIVE + * (c)has requeued/aborted every inflight rqs whose io_flags is NOT ACTIVE + * (d)has completed/camceled all ioucmds owned by ther dying process + */ + if (test_bit(UB_STATE_OPEN, &ub->state) || + ub->dev_info.state != UBLK_S_DEV_QUIESCED) { + ret = -EBUSY; + goto out_unlock; + } + pr_devel("%s: start recovery for dev id %d.\n", __func__, header->dev_id); + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) + ublk_queue_reinit(ub, ublk_get_queue(ub, i)); + /* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */ + ub->mm = NULL; + ub->nr_queues_ready = 0; + init_completion(&ub->completion); + ret = 0; + out_unlock: + mutex_unlock(&ub->mutex); + ublk_put_device(ub); + return ret; +} + +static int ublk_ctrl_end_recovery(struct io_uring_cmd *cmd) +{ + struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd; + int ublksrv_pid = (int)header->data[0]; + struct ublk_device *ub; + int ret = -EINVAL; + + ub = ublk_get_device_from_id(header->dev_id); + if (!ub) + return ret; + + pr_devel("%s: Waiting for new ubq_daemons(nr: %d) are ready, dev id %d...\n", + __func__, ub->dev_info.nr_hw_queues, header->dev_id); + /* wait until new ubq_daemon sending all FETCH_REQ */ + wait_for_completion_interruptible(&ub->completion); + pr_devel("%s: All new ubq_daemons(nr: %d) are ready, dev id %d\n", + __func__, ub->dev_info.nr_hw_queues, header->dev_id); + + mutex_lock(&ub->mutex); + if (!ublk_can_use_recovery(ub)) + goto out_unlock; + + if (ub->dev_info.state != UBLK_S_DEV_QUIESCED) { + ret = -EBUSY; + goto out_unlock; + } + ub->dev_info.ublksrv_pid = ublksrv_pid; + pr_devel("%s: new ublksrv_pid %d, dev id %d\n", + __func__, ublksrv_pid, header->dev_id); + blk_mq_unquiesce_queue(ub->ub_disk->queue); + pr_devel("%s: queue unquiesced, dev id %d.\n", + __func__, header->dev_id); + blk_mq_kick_requeue_list(ub->ub_disk->queue); + ub->dev_info.state = UBLK_S_DEV_LIVE; + schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON_MONITOR_PERIOD); + ret = 0; + out_unlock: + mutex_unlock(&ub->mutex); + ublk_put_device(ub); + return ret; +} + static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) { @@ -1903,6 +2013,12 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd, case UBLK_CMD_SET_PARAMS: ret = ublk_ctrl_set_params(cmd); break; + case UBLK_CMD_START_USER_RECOVERY: + ret = ublk_ctrl_start_recovery(cmd); + break; + case UBLK_CMD_END_USER_RECOVERY: + ret = ublk_ctrl_end_recovery(cmd); + break; default: break; } diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h index 332370628757..8f88e3a29998 100644 --- a/include/uapi/linux/ublk_cmd.h +++ b/include/uapi/linux/ublk_cmd.h @@ -17,7 +17,8 @@ #define UBLK_CMD_STOP_DEV 0x07 #define UBLK_CMD_SET_PARAMS 0x08 #define UBLK_CMD_GET_PARAMS 0x09 - +#define UBLK_CMD_START_USER_RECOVERY 0x10 +#define UBLK_CMD_END_USER_RECOVERY 0x11 /* * IO commands, issued by ublk server, and handled by ublk driver. * From patchwork Fri Sep 23 06:15:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12986164 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A58BECAAD8 for ; Fri, 23 Sep 2022 06:17:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230099AbiIWGRw (ORCPT ); Fri, 23 Sep 2022 02:17:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60976 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230081AbiIWGRs (ORCPT ); Fri, 23 Sep 2022 02:17:48 -0400 Received: from out30-56.freemail.mail.aliyun.com (out30-56.freemail.mail.aliyun.com [115.124.30.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B01C412387A; Thu, 22 Sep 2022 23:17:47 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R101e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045176;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VQVS1Vn_1663913859; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VQVS1Vn_1663913859) by smtp.aliyun-inc.com; Fri, 23 Sep 2022 14:17:45 +0800 From: ZiyangZhang To: ming.lei@redhat.com Cc: axboe@kernel.dk, xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RESEND PATCH V5 7/7] Documentation: document ublk user recovery feature Date: Fri, 23 Sep 2022 14:15:05 +0800 Message-Id: <20220923061505.52007-8-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> References: <20220923061505.52007-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Add documentation for user recovery feature of ublk subsystem. Signed-off-by: ZiyangZhang --- Documentation/block/ublk.rst | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst index 2122d1a4a541..c3dde087e601 100644 --- a/Documentation/block/ublk.rst +++ b/Documentation/block/ublk.rst @@ -144,6 +144,38 @@ managing and controlling ublk devices with help of several control commands: For retrieving device info via ``ublksrv_ctrl_dev_info``. It is the server's responsibility to save IO target specific info in userspace. +- ``UBLK_CMD_START_USER_RECOVERY`` + + This command is valid if ``UBLK_F_USER_RECOVERY`` feature is enabled. This + command is accepted after the old process has exited, ublk device is quiesced + and ``/dev/ublkc*`` is closed. User should send this command before he starts + a new process which opens ``/dev/ublkc*``. When this command returns, the + ublk device is ready for the new process. + +- ``UBLK_CMD_END_USER_RECOVERY`` + + This command is valid if ``UBLK_F_USER_RECOVERY`` feature is enabled. This + command is accepted after a new process has opened ``/dev/ublkc*`` and get + all ublk queues be ready. When this command returns, ublk device is + unquiesced and new I/O requests are passed to the new process. + +- user recovery feature description + + Two new features are added for user recovery: ``UBLK_F_USER_RECOVERY`` and + ``UBLK_F_USER_RECOVERY_REISSUE``. + + With ``UBLK_F_USER_RECOVERY`` set, after one ubq_daemon(ublksrv io handler) is + dying, ublk does not release ``/dev/ublkc*`` or ``/dev/ublkb*`` but requeues all + inflight requests which have not been issued to userspace. Requests which have + been issued to userspace are aborted. + + With ``UBLK_F_USER_RECOVERY_REISSUE`` set, after one ubq_daemon(ublksrv io + handler) is dying, contrary to ``UBLK_F_USER_RECOVERY``, requests which have been + issued to userspace are requeued and will be re-issued to the new process after + handling ``UBLK_CMD_END_USER_RECOVERY``. ``UBLK_F_USER_RECOVERY_REISSUE`` is + designed for backends who tolerate double-write since the driver may issue the + same I/O request twice. It might be useful to a read-only FS or a VM backend. + Data plane ----------