From patchwork Sun Dec 8 22:57:55 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 13898711 X-Patchwork-Delegate: snitzer@redhat.com Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A570028691; Sun, 8 Dec 2024 22:58:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733698704; cv=none; b=n8uV9LfuHeh+IukZJzF0Al5Nlne8bRM+NAyYNAVdhGAh4oEPYoYBFq+//vSibA1zGFqocUej2BgDeBp3/NekS/jJh+zcmy6ODnYet3e4uMeb3GCoV++CH3up/RJgZU6BHXEWi0EOCEfvHvKwVVSFRE46LhtUHSLStB4+j4uzY1Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733698704; c=relaxed/simple; bh=YOCj3HMPox14+s+PUTOQS2H9Jof2wHqCsCSTKnotjZM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Q4oxqE5CRYtaa2rFIlQNvGhDE8lT9ILYvKZYt6lClTicd7JxfFjJVONXBDTD9Uxd3hfStYr+A20ykm9No1e3i2p+avdq+mmmtmoxA/ZgO63TCxcFNPoHdOFpZhSDga/6MzrK9o8WxD7R670SlknlNs1ShVBvg338n7oCc7y018E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=P/x5BVwq; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="P/x5BVwq" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1017FC4CEE1; Sun, 8 Dec 2024 22:58:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1733698704; bh=YOCj3HMPox14+s+PUTOQS2H9Jof2wHqCsCSTKnotjZM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=P/x5BVwqKb8AK7QpA3JNEYu9ggTWKcijjovz6O+97/99Ciz3bseK/F3ICkS5OwOQe BL705XH8PIJFnaacO1hc1gv3L5O5VImizsOVECRsV2H6WpWPYHtKhc+H4koxUVXBjA mQiZaFFUzqtlaBMYjbWxtybADc1jkiyvbwlhkm08EYyhQsKf6k5GauqTmPTz/f7TSp cQ5wQPzgbgZ263uu7F05sv8I27yrUPIwF3Gfb7Asxz2jdBnu0hsDOCyaJoOOzXNEW8 ybd0JcsOloRi9nMxhlv3XDF60afecjTWIKV4u16Mg7YgKXf3ytm6djk/TBHHw5xupj 6WklohlYwXTcg== From: Damien Le Moal To: linux-block@vger.kernel.org, Jens Axboe , Mike Snitzer , Mikulas Patocka , dm-devel@lists.linux.dev Cc: Christoph Hellwig , Bart Van Assche Subject: [PATCH v2 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs Date: Mon, 9 Dec 2024 07:57:55 +0900 Message-ID: <20241208225758.219228-2-dlemoal@kernel.org> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20241208225758.219228-1-dlemoal@kernel.org> References: <20241208225758.219228-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: dm-devel@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 For zoned block devices, a write BIO issued to a zone that has no on-going writes will be prepared for execution and allowed to execute immediately by blk_zone_wplug_handle_write() (called from blk_zone_plug_bio()). However, if this BIO specifies REQ_NOWAIT, the allocation of a request for its execution in blk_mq_submit_bio() may fail after blk_zone_plug_bio() completed, marking the target zone of the BIO as plugged. When this BIO is retried later on, it will be blocked as the zone write plug of the target zone is in a plugged state without any on-going write operation (completion of write operations trigger unplugging of the next write BIOs for a zone). This leads to a BIO that is stuck in a zone write plug and never completes, which results in various issues such as hung tasks. Avoid this problem by always executing REQ_NOWAIT write BIOs using the BIO work of a zone write plug. This ensure that we never block the BIO issuer and can thus safely ignore the REQ_NOWAIT flag when executing the BIO from the zone write plug BIO work. Since such BIO may be the first write BIO issued to a zone with no on-going write, modify disk_zone_wplug_add_bio() to schedule the zone write plug BIO work if the write plug is not already marked with the BLK_ZONE_WPLUG_PLUGGED flag. This scheduling is otherwise not necessary as the completion of the on-going write for the zone will schedule the execution of the next plugged BIOs. blk_zone_wplug_handle_write() is also fixed to better handle zone write plug allocation failures for REQ_NOWAIT BIOs by failing a write BIO using bio_wouldblock_error() instead of bio_io_error(). Reported-by: Bart Van Assche Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig --- block/blk-zoned.c | 62 ++++++++++++++++++++++++++++++++--------------- 1 file changed, 42 insertions(+), 20 deletions(-) diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 263e28b72053..7982b9494d63 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -746,9 +746,25 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio) return false; } -static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug, - struct bio *bio, unsigned int nr_segs) +static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk, + struct blk_zone_wplug *zwplug) +{ + /* + * Take a reference on the zone write plug and schedule the submission + * of the next plugged BIO. blk_zone_wplug_bio_work() will release the + * reference we take here. + */ + WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)); + refcount_inc(&zwplug->ref); + queue_work(disk->zone_wplugs_wq, &zwplug->bio_work); +} + +static inline void disk_zone_wplug_add_bio(struct gendisk *disk, + struct blk_zone_wplug *zwplug, + struct bio *bio, unsigned int nr_segs) { + bool schedule_bio_work = false; + /* * Grab an extra reference on the BIO request queue usage counter. * This reference will be reused to submit a request for the BIO for @@ -764,6 +780,16 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug, */ bio_clear_polled(bio); + /* + * REQ_NOWAIT BIOs are always handled using the zone write plug BIO + * work, which can block. So clear the REQ_NOWAIT flag and schedule the + * work if this is the first BIO we are plugging. + */ + if (bio->bi_opf & REQ_NOWAIT) { + schedule_bio_work = !(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED); + bio->bi_opf &= ~REQ_NOWAIT; + } + /* * Reuse the poll cookie field to store the number of segments when * split to the hardware limits. @@ -777,6 +803,11 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug, * at the tail of the list to preserve the sequential write order. */ bio_list_add(&zwplug->bio_list, bio); + + zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED; + + if (schedule_bio_work) + disk_zone_wplug_schedule_bio_work(disk, zwplug); } /* @@ -970,7 +1001,10 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs) zwplug = disk_get_and_lock_zone_wplug(disk, sector, gfp_mask, &flags); if (!zwplug) { - bio_io_error(bio); + if (bio->bi_opf & REQ_NOWAIT) + bio_wouldblock_error(bio); + else + bio_io_error(bio); return true; } @@ -979,9 +1013,11 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs) /* * If the zone is already plugged or has a pending error, add the BIO - * to the plug BIO list. Otherwise, plug and let the BIO execute. + * to the plug BIO list. Do the same for REQ_NOWAIT BIOs to ensure that + * we will not see a BLK_STS_AGAIN failure if we let the BIO execute. + * Otherwise, plug and let the BIO execute. */ - if (zwplug->flags & BLK_ZONE_WPLUG_BUSY) + if (zwplug->flags & BLK_ZONE_WPLUG_BUSY || (bio->bi_opf & REQ_NOWAIT)) goto plug; /* @@ -998,8 +1034,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs) return false; plug: - zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED; - blk_zone_wplug_add_bio(zwplug, bio, nr_segs); + disk_zone_wplug_add_bio(disk, zwplug, bio, nr_segs); spin_unlock_irqrestore(&zwplug->lock, flags); @@ -1083,19 +1118,6 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs) } EXPORT_SYMBOL_GPL(blk_zone_plug_bio); -static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk, - struct blk_zone_wplug *zwplug) -{ - /* - * Take a reference on the zone write plug and schedule the submission - * of the next plugged BIO. blk_zone_wplug_bio_work() will release the - * reference we take here. - */ - WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)); - refcount_inc(&zwplug->ref); - queue_work(disk->zone_wplugs_wq, &zwplug->bio_work); -} - static void disk_zone_wplug_unplug_bio(struct gendisk *disk, struct blk_zone_wplug *zwplug) { From patchwork Sun Dec 8 22:57:56 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 13898712 X-Patchwork-Delegate: snitzer@redhat.com Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2188C28691; Sun, 8 Dec 2024 22:58:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733698706; cv=none; b=H0fojRxsaIWEze9BPL366irlLotN6+XKZHWgDRXRQU73kzuEmz7zN1jksIJ9aVvlKEp8TZm8fAPP+Zl0eruIv9wMiJLq/3UFZqyMpcZb5f9/BZw1cidKJpnEf2wZXLUI6+Yp4R4dRUGYOLVjoz7eQsTrmcEIUr02TBUTJmz35o8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733698706; c=relaxed/simple; bh=wZBrbIYegrSFuSGYBJZyXvrCOOml4B0ujsdF5S6ivd8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=IGKQciJOcYkQoPLHmTzrj20mI2QFGsFqLN6IBD85b7mXQQjJilS6MZgfg5tSf5YP+me3cwWXCNxbwiAgH0KUyr4bku0nM8nkk9fGX6VE0/+S+F3TJ1wrj1+0INgqmNMx8ZF1IcuRcuSr1nbaJPNTwm35h5gm+q9GOuUQt0WShqs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=pXam0PZr; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="pXam0PZr" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 89BB3C4CED2; Sun, 8 Dec 2024 22:58:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1733698705; bh=wZBrbIYegrSFuSGYBJZyXvrCOOml4B0ujsdF5S6ivd8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=pXam0PZrjNXFbAQuNpxzs8XbxVDdyNfK5PpjdXP03qz9lvSWsSV15ZtYaQjfx3J40 WHrb+jwB8IKI/2HEiHhiqAGPjjvfBSYecKhdRKrlaCcPDGqBhHaMNXcatEmTT2RrNJ Trvz2ggd/ButHFrKv3T7GPjmmNdmWkZ/OAO8DxwFHyuSYLXqZ6XHnRoGmuFv3A2rVI YtYyKZyK9Pnqthawj0xXNz9HfdpJesAc0LkHHTmXqQGBQhNekFr4kZoYwcJ/Pzysjy uz6VqFgTURrUXiPhnNOipiWZthE4vdUTbRX1wLhykLvt8TWMRC6qn8djC2/A4VjSpX nlNlxzLPSjtQw== From: Damien Le Moal To: linux-block@vger.kernel.org, Jens Axboe , Mike Snitzer , Mikulas Patocka , dm-devel@lists.linux.dev Cc: Christoph Hellwig , Bart Van Assche Subject: [PATCH v2 2/4] block: Ignore REQ_NOWAIT for zone reset and zone finish operations Date: Mon, 9 Dec 2024 07:57:56 +0900 Message-ID: <20241208225758.219228-3-dlemoal@kernel.org> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20241208225758.219228-1-dlemoal@kernel.org> References: <20241208225758.219228-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: dm-devel@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 There are currently any issuer of REQ_OP_ZONE_RESET and REQ_OP_ZONE_FINISH operations that set REQ_NOWAIT. However, as we cannot handle this flag correctly due to the potential request allocation failure that may happen in blk_mq_submit_bio() after blk_zone_plug_bio() has handled the zone write plug write pointer updates for the targeted zones, modify blk_zone_wplug_handle_reset_or_finish() to warn if this flag is set and ignore it. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig --- block/blk-zoned.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 7982b9494d63..ee9c67121c6c 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -707,6 +707,15 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio, return true; } + /* + * No-wait reset or finish BIOs do not make much sense as the callers + * issue these as blocking operations in most cases. To avoid issues + * the BIO execution potentially failing with BLK_STS_AGAIN, warn about + * REQ_NOWAIT being set and ignore that flag. + */ + if (WARN_ON_ONCE(bio->bi_opf & REQ_NOWAIT)) + bio->bi_opf &= ~REQ_NOWAIT; + /* * If we have a zone write plug, set its write pointer offset to 0 * (reset case) or to the zone size (finish case). This will abort all From patchwork Sun Dec 8 22:57:57 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 13898713 X-Patchwork-Delegate: snitzer@redhat.com Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A5262198A0D; Sun, 8 Dec 2024 22:58:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733698707; cv=none; b=pSy1WuC2t8KPVpsDAf/4MHYBSDPh7iDrr57S/mCkHHc2qdLTCEnYApQznYr12C1tw5qiAotuEpFBR4NEmCaD4O2A9u6KgBeedcL8Dgic4qcwrag5hpP1lszpc4QTHXmIyLAsizyDopZXUOVrDHqstP0T4pu/nPXjDGk1zKQ14g8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733698707; c=relaxed/simple; bh=1ySXHo69YNq/h789ZoACn0JBEcMaRqUpUBFLBwx4n40=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=bQzpl6ex8/nX6i9T2kQGbcUx1YFJjejB/Bm8xgawdxL23HL8yewHfhvDF2LmDr8LTAQgwdtm8uti96eqh8nf7QPmTOJ2m5R26BzY0fE4UjOKlv7LW8Qqe6+ln9/TQCP9i0RPfbS7TPUQ4nQ4lTabU+eIUiWMnTZaJoxk2+JiGOg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=FYYUZJgy; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="FYYUZJgy" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 08538C4CEE1; Sun, 8 Dec 2024 22:58:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1733698707; bh=1ySXHo69YNq/h789ZoACn0JBEcMaRqUpUBFLBwx4n40=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=FYYUZJgyO70w/AgGSORyvYo+BLmBJVa0o3GanUvB46yaCi9QW78aLeQ7vFa6Gkp4e /QmGSlBZAm9rLPtTec3KKpi1hhT1B/h+Y1J53+VVb7fQaUVTgIyGP6ag/dbPYG52HC Krp7QsHWEpCZxGdPgY9OBm+ZtPTfTkGj/kL+X/YqNbsMFsDFecYfCRqfGA9DJyTNhf Cni0bjL/DaHtM8S4fJ0yfxzVeguhabn84XWFHbnPGajgUZW3PAUJdUxfWLn9qxRSGg 2adhgLdfUG/gNCpNxr2ka++Xyc9fSaKlUu6z7oASePy60tP+6PoZhZPJtxcAMdwBPF e31AkeQD/IvNg== From: Damien Le Moal To: linux-block@vger.kernel.org, Jens Axboe , Mike Snitzer , Mikulas Patocka , dm-devel@lists.linux.dev Cc: Christoph Hellwig , Bart Van Assche Subject: [PATCH v2 3/4] block: Prevent potential deadlocks in zone write plug error recovery Date: Mon, 9 Dec 2024 07:57:57 +0900 Message-ID: <20241208225758.219228-4-dlemoal@kernel.org> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20241208225758.219228-1-dlemoal@kernel.org> References: <20241208225758.219228-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: dm-devel@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Zone write plugging for handling writes to zones of a zoned block device always execute a zone report whenever a write BIO to a zone fails. The intent of this is to ensure that the tracking of a zone write pointer is always correct to ensure that BIOs can be checked on submission and that we can always correctly emulate zone append operations using regular write BIOs. However, this error recovery scheme introduces a potential deadlock if a device queue freeze is initiated while BIOs are still plugged in a zone write plug and one of these write operation fails. In such case, the disk zone write plug error recovery work is scheduled and executes a report zone. This in turn can result in a request allocation in the underlying driver to issue the report zones command to the device. However, with the device queue freeze already started, this allocation will block, preventing the report zone execution and the continuation of the processing of the plugged BIOs. As plugged BIOs hold a queue usage reference, the queue freeze itself will never complete, resulting in a deadlock. Avoid this problem by completely getting rid of the need for executing a report zone from within the zone write plugging code, instead relying on the user either executing a report zones, resetting the zone or finishing the zone of a failed write. This is not an unresannable requirement as all well-behaved applications, FSes and device mapper already use report zones to recover from write errors whenever possible. The changes to switch to this recovery method are as follows: - Completely remove the error recovery work and its associated resources (zone write plug list head, disk error list, and disk zone_wplugs_work work struct). This also removes the functions disk_zone_wplug_set_error() and disk_zone_wplug_clear_error(). - Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into BLK_ZONE_WPLUG_NEED_WP_UPDATE. - Modify the function disk_zone_wplug_set_wp_offset() to clear this flag, thus implementing recovery of a correct write pointer for the reset zone and finish zone operations. - Introduce the function disk_zone_wplug_sync_wp_offset() which calls disk_zone_wplug_set_wp_offset() if the zone write plug is marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag. disk_zone_wplug_sync_wp_offset() is called from the new, always used, disk_report_zones_cb() report zones callback function for blkdev_report_zones(). This implements recovery of a correct write pointer offset for zone write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within the range of the report zones operation executed by the user. - Modify blk_zone_write_plug_bio_endio() to set the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed write BIO. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal --- block/blk-zoned.c | 388 +++++++++++++---------------------------- include/linux/blkdev.h | 2 - 2 files changed, 121 insertions(+), 269 deletions(-) diff --git a/block/blk-zoned.c b/block/blk-zoned.c index ee9c67121c6c..d709f12d5935 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -41,7 +41,6 @@ static const char *const zone_cond_name[] = { /* * Per-zone write plug. * @node: hlist_node structure for managing the plug using a hash table. - * @link: To list the plug in the zone write plug error list of the disk. * @ref: Zone write plug reference counter. A zone write plug reference is * always at least 1 when the plug is hashed in the disk plug hash table. * The reference is incremented whenever a new BIO needing plugging is @@ -63,7 +62,6 @@ static const char *const zone_cond_name[] = { */ struct blk_zone_wplug { struct hlist_node node; - struct list_head link; refcount_t ref; spinlock_t lock; unsigned int flags; @@ -80,8 +78,8 @@ struct blk_zone_wplug { * - BLK_ZONE_WPLUG_PLUGGED: Indicates that the zone write plug is plugged, * that is, that write BIOs are being throttled due to a write BIO already * being executed or the zone write plug bio list is not empty. - * - BLK_ZONE_WPLUG_ERROR: Indicates that a write error happened which will be - * recovered with a report zone to update the zone write pointer offset. + * - BLK_ZONE_WPLUG_NEED_WP_UPDATE: Indicates that we lost track of a zone + * write pointer offset and need to update it. * - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed * from the disk hash table and that the initial reference to the zone * write plug set when the plug was first added to the hash table has been @@ -91,11 +89,9 @@ struct blk_zone_wplug { * freed once all remaining references from BIOs or functions are dropped. */ #define BLK_ZONE_WPLUG_PLUGGED (1U << 0) -#define BLK_ZONE_WPLUG_ERROR (1U << 1) +#define BLK_ZONE_WPLUG_NEED_WP_UPDATE (1U << 1) #define BLK_ZONE_WPLUG_UNHASHED (1U << 2) -#define BLK_ZONE_WPLUG_BUSY (BLK_ZONE_WPLUG_PLUGGED | BLK_ZONE_WPLUG_ERROR) - /** * blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX. * @zone_cond: BLK_ZONE_COND_XXX. @@ -115,6 +111,27 @@ const char *blk_zone_cond_str(enum blk_zone_cond zone_cond) } EXPORT_SYMBOL_GPL(blk_zone_cond_str); +struct disk_report_zones_cb_args { + struct gendisk *disk; + report_zones_cb user_cb; + void *user_data; +}; + +static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk, + struct blk_zone *zone); + +static int disk_report_zones_cb(struct blk_zone *zone, unsigned int idx, + void *data) +{ + struct disk_report_zones_cb_args *args = data; + struct gendisk *disk = args->disk; + + if (disk->zone_wplugs_hash) + disk_zone_wplug_sync_wp_offset(disk, zone); + + return args->user_cb(zone, idx, args->user_data); +} + /** * blkdev_report_zones - Get zones information * @bdev: Target block device @@ -139,6 +156,11 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector, { struct gendisk *disk = bdev->bd_disk; sector_t capacity = get_capacity(disk); + struct disk_report_zones_cb_args args = { + .disk = disk, + .user_cb = cb, + .user_data = data, + }; if (!bdev_is_zoned(bdev) || WARN_ON_ONCE(!disk->fops->report_zones)) return -EOPNOTSUPP; @@ -146,7 +168,8 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector, if (!nr_zones || sector >= capacity) return 0; - return disk->fops->report_zones(disk, sector, nr_zones, cb, data); + return disk->fops->report_zones(disk, sector, nr_zones, + disk_report_zones_cb, &args); } EXPORT_SYMBOL_GPL(blkdev_report_zones); @@ -427,7 +450,7 @@ static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug) { if (refcount_dec_and_test(&zwplug->ref)) { WARN_ON_ONCE(!bio_list_empty(&zwplug->bio_list)); - WARN_ON_ONCE(!list_empty(&zwplug->link)); + WARN_ON_ONCE(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED); WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)); call_rcu(&zwplug->rcu_head, disk_free_zone_wplug_rcu); @@ -441,8 +464,8 @@ static inline bool disk_should_remove_zone_wplug(struct gendisk *disk, if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED) return false; - /* If the zone write plug is still busy, it cannot be removed. */ - if (zwplug->flags & BLK_ZONE_WPLUG_BUSY) + /* If the zone write plug is still plugged, it cannot be removed. */ + if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) return false; /* @@ -525,7 +548,6 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk, return NULL; INIT_HLIST_NODE(&zwplug->node); - INIT_LIST_HEAD(&zwplug->link); refcount_set(&zwplug->ref, 2); spin_lock_init(&zwplug->lock); zwplug->flags = 0; @@ -574,115 +596,22 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug) } /* - * Abort (fail) all plugged BIOs of a zone write plug that are not aligned - * with the assumed write pointer location of the zone when the BIO will - * be unplugged. - */ -static void disk_zone_wplug_abort_unaligned(struct gendisk *disk, - struct blk_zone_wplug *zwplug) -{ - unsigned int wp_offset = zwplug->wp_offset; - struct bio_list bl = BIO_EMPTY_LIST; - struct bio *bio; - - while ((bio = bio_list_pop(&zwplug->bio_list))) { - if (disk_zone_is_full(disk, zwplug->zone_no, wp_offset) || - (bio_op(bio) != REQ_OP_ZONE_APPEND && - bio_offset_from_zone_start(bio) != wp_offset)) { - blk_zone_wplug_bio_io_error(zwplug, bio); - continue; - } - - wp_offset += bio_sectors(bio); - bio_list_add(&bl, bio); - } - - bio_list_merge(&zwplug->bio_list, &bl); -} - -static inline void disk_zone_wplug_set_error(struct gendisk *disk, - struct blk_zone_wplug *zwplug) -{ - unsigned long flags; - - if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) - return; - - /* - * At this point, we already have a reference on the zone write plug. - * However, since we are going to add the plug to the disk zone write - * plugs work list, increase its reference count. This reference will - * be dropped in disk_zone_wplugs_work() once the error state is - * handled, or in disk_zone_wplug_clear_error() if the zone is reset or - * finished. - */ - zwplug->flags |= BLK_ZONE_WPLUG_ERROR; - refcount_inc(&zwplug->ref); - - spin_lock_irqsave(&disk->zone_wplugs_lock, flags); - list_add_tail(&zwplug->link, &disk->zone_wplugs_err_list); - spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags); -} - -static inline void disk_zone_wplug_clear_error(struct gendisk *disk, - struct blk_zone_wplug *zwplug) -{ - unsigned long flags; - - if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR)) - return; - - /* - * We are racing with the error handling work which drops the reference - * on the zone write plug after handling the error state. So remove the - * plug from the error list and drop its reference count only if the - * error handling has not yet started, that is, if the zone write plug - * is still listed. - */ - spin_lock_irqsave(&disk->zone_wplugs_lock, flags); - if (!list_empty(&zwplug->link)) { - list_del_init(&zwplug->link); - zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR; - disk_put_zone_wplug(zwplug); - } - spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags); -} - -/* - * Set a zone write plug write pointer offset to either 0 (zone reset case) - * or to the zone size (zone finish case). This aborts all plugged BIOs, which - * is fine to do as doing a zone reset or zone finish while writes are in-flight - * is a mistake from the user which will most likely cause all plugged BIOs to - * fail anyway. + * Set a zone write plug write pointer offset to the specified value. + * This aborts all plugged BIOs, which is fine as this function is called for + * a zone reset operation, a zone finish operation or if the zone needs a wp + * update from a report zone after a write error. */ static void disk_zone_wplug_set_wp_offset(struct gendisk *disk, struct blk_zone_wplug *zwplug, unsigned int wp_offset) { - unsigned long flags; - - spin_lock_irqsave(&zwplug->lock, flags); - - /* - * Make sure that a BIO completion or another zone reset or finish - * operation has not already removed the plug from the hash table. - */ - if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED) { - spin_unlock_irqrestore(&zwplug->lock, flags); - return; - } + lockdep_assert_held(&zwplug->lock); /* Update the zone write pointer and abort all plugged BIOs. */ + zwplug->flags &= ~BLK_ZONE_WPLUG_NEED_WP_UPDATE; zwplug->wp_offset = wp_offset; disk_zone_wplug_abort(zwplug); - /* - * Updating the write pointer offset puts back the zone - * in a good state. So clear the error flag and decrement the - * error count if we were in error state. - */ - disk_zone_wplug_clear_error(disk, zwplug); - /* * The zone write plug now has no BIO plugged: remove it from the * hash table so that it cannot be seen. The plug will be freed @@ -690,8 +619,48 @@ static void disk_zone_wplug_set_wp_offset(struct gendisk *disk, */ if (disk_should_remove_zone_wplug(disk, zwplug)) disk_remove_zone_wplug(disk, zwplug); +} +static unsigned int blk_zone_wp_offset(struct blk_zone *zone) +{ + switch (zone->cond) { + case BLK_ZONE_COND_IMP_OPEN: + case BLK_ZONE_COND_EXP_OPEN: + case BLK_ZONE_COND_CLOSED: + return zone->wp - zone->start; + case BLK_ZONE_COND_FULL: + return zone->len; + case BLK_ZONE_COND_EMPTY: + return 0; + case BLK_ZONE_COND_NOT_WP: + case BLK_ZONE_COND_OFFLINE: + case BLK_ZONE_COND_READONLY: + default: + /* + * Conventional, offline and read-only zones do not have a valid + * write pointer. + */ + return UINT_MAX; + } +} + +static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk, + struct blk_zone *zone) +{ + struct blk_zone_wplug *zwplug; + unsigned long flags; + + zwplug = disk_get_zone_wplug(disk, zone->start); + if (!zwplug) + return; + + spin_lock_irqsave(&zwplug->lock, flags); + if (zwplug->flags & BLK_ZONE_WPLUG_NEED_WP_UPDATE) + disk_zone_wplug_set_wp_offset(disk, zwplug, + blk_zone_wp_offset(zone)); spin_unlock_irqrestore(&zwplug->lock, flags); + + disk_put_zone_wplug(zwplug); } static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio, @@ -700,6 +669,7 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio, struct gendisk *disk = bio->bi_bdev->bd_disk; sector_t sector = bio->bi_iter.bi_sector; struct blk_zone_wplug *zwplug; + unsigned long flags; /* Conventional zones cannot be reset nor finished. */ if (!bdev_zone_is_seq(bio->bi_bdev, sector)) { @@ -725,7 +695,9 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio, */ zwplug = disk_get_zone_wplug(disk, sector); if (zwplug) { + spin_lock_irqsave(&zwplug->lock, flags); disk_zone_wplug_set_wp_offset(disk, zwplug, wp_offset); + spin_unlock_irqrestore(&zwplug->lock, flags); disk_put_zone_wplug(zwplug); } @@ -736,6 +708,7 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio) { struct gendisk *disk = bio->bi_bdev->bd_disk; struct blk_zone_wplug *zwplug; + unsigned long flags; sector_t sector; /* @@ -747,7 +720,9 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio) sector += disk->queue->limits.chunk_sectors) { zwplug = disk_get_zone_wplug(disk, sector); if (zwplug) { + spin_lock_irqsave(&zwplug->lock, flags); disk_zone_wplug_set_wp_offset(disk, zwplug, 0); + spin_unlock_irqrestore(&zwplug->lock, flags); disk_put_zone_wplug(zwplug); } } @@ -929,13 +904,23 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug, { struct gendisk *disk = bio->bi_bdev->bd_disk; + /* + * If we lost track of the zone write pointer due to a write error, + * the user must either execute a report zones, reset the zone or finish + * the to recover a reliable write pointer position. Fail BIOs if the + * user did not do that as we cannot handle emulated zone append + * otherwise. + */ + if (zwplug->flags & BLK_ZONE_WPLUG_NEED_WP_UPDATE) + return false; + /* * Check that the user is not attempting to write to a full zone. * We know such BIO will fail, and that would potentially overflow our * write pointer offset beyond the end of the zone. */ if (disk_zone_wplug_is_full(disk, zwplug)) - goto err; + return false; if (bio_op(bio) == REQ_OP_ZONE_APPEND) { /* @@ -954,24 +939,18 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug, bio_set_flag(bio, BIO_EMULATES_ZONE_APPEND); } else { /* - * Check for non-sequential writes early because we avoid a - * whole lot of error handling trouble if we don't send it off - * to the driver. + * Check for non-sequential writes early as we know that BIOs + * with a start sector not unaligned to the zone write pointer + * will fail. */ if (bio_offset_from_zone_start(bio) != zwplug->wp_offset) - goto err; + return false; } /* Advance the zone write pointer offset. */ zwplug->wp_offset += bio_sectors(bio); return true; - -err: - /* We detected an invalid write BIO: schedule error recovery. */ - disk_zone_wplug_set_error(disk, zwplug); - kblockd_schedule_work(&disk->zone_wplugs_work); - return false; } static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs) @@ -1021,20 +1000,20 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs) bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING); /* - * If the zone is already plugged or has a pending error, add the BIO - * to the plug BIO list. Do the same for REQ_NOWAIT BIOs to ensure that - * we will not see a BLK_STS_AGAIN failure if we let the BIO execute. + * If the zone is already plugged, add the BIO to the plug BIO list. + * Do the same for REQ_NOWAIT BIOs to ensure that we will not see a + * BLK_STS_AGAIN failure if we let the BIO execute. * Otherwise, plug and let the BIO execute. */ - if (zwplug->flags & BLK_ZONE_WPLUG_BUSY || (bio->bi_opf & REQ_NOWAIT)) + if ((zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) || + (bio->bi_opf & REQ_NOWAIT)) goto plug; - /* - * If an error is detected when preparing the BIO, add it to the BIO - * list so that error recovery can deal with it. - */ - if (!blk_zone_wplug_prepare_bio(zwplug, bio)) - goto plug; + if (!blk_zone_wplug_prepare_bio(zwplug, bio)) { + spin_unlock_irqrestore(&zwplug->lock, flags); + bio_io_error(bio); + return true; + } zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED; @@ -1134,16 +1113,6 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk, spin_lock_irqsave(&zwplug->lock, flags); - /* - * If we had an error, schedule error recovery. The recovery work - * will restart submission of plugged BIOs. - */ - if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) { - spin_unlock_irqrestore(&zwplug->lock, flags); - kblockd_schedule_work(&disk->zone_wplugs_work); - return; - } - /* Schedule submission of the next plugged BIO if we have one. */ if (!bio_list_empty(&zwplug->bio_list)) { disk_zone_wplug_schedule_bio_work(disk, zwplug); @@ -1186,12 +1155,13 @@ void blk_zone_write_plug_bio_endio(struct bio *bio) } /* - * If the BIO failed, mark the plug as having an error to trigger - * recovery. + * If the BIO failed, abort all plugged BIOs and mark the plug as + * needing a write pointer update. */ if (bio->bi_status != BLK_STS_OK) { spin_lock_irqsave(&zwplug->lock, flags); - disk_zone_wplug_set_error(disk, zwplug); + disk_zone_wplug_abort(zwplug); + zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE; spin_unlock_irqrestore(&zwplug->lock, flags); } @@ -1247,6 +1217,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work) */ spin_lock_irqsave(&zwplug->lock, flags); +again: bio = bio_list_pop(&zwplug->bio_list); if (!bio) { zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED; @@ -1255,10 +1226,8 @@ static void blk_zone_wplug_bio_work(struct work_struct *work) } if (!blk_zone_wplug_prepare_bio(zwplug, bio)) { - /* Error recovery will decide what to do with the BIO. */ - bio_list_add_head(&zwplug->bio_list, bio); - spin_unlock_irqrestore(&zwplug->lock, flags); - goto put_zwplug; + blk_zone_wplug_bio_io_error(zwplug, bio); + goto again; } spin_unlock_irqrestore(&zwplug->lock, flags); @@ -1280,120 +1249,6 @@ static void blk_zone_wplug_bio_work(struct work_struct *work) disk_put_zone_wplug(zwplug); } -static unsigned int blk_zone_wp_offset(struct blk_zone *zone) -{ - switch (zone->cond) { - case BLK_ZONE_COND_IMP_OPEN: - case BLK_ZONE_COND_EXP_OPEN: - case BLK_ZONE_COND_CLOSED: - return zone->wp - zone->start; - case BLK_ZONE_COND_FULL: - return zone->len; - case BLK_ZONE_COND_EMPTY: - return 0; - case BLK_ZONE_COND_NOT_WP: - case BLK_ZONE_COND_OFFLINE: - case BLK_ZONE_COND_READONLY: - default: - /* - * Conventional, offline and read-only zones do not have a valid - * write pointer. - */ - return UINT_MAX; - } -} - -static int blk_zone_wplug_report_zone_cb(struct blk_zone *zone, - unsigned int idx, void *data) -{ - struct blk_zone *zonep = data; - - *zonep = *zone; - return 0; -} - -static void disk_zone_wplug_handle_error(struct gendisk *disk, - struct blk_zone_wplug *zwplug) -{ - sector_t zone_start_sector = - bdev_zone_sectors(disk->part0) * zwplug->zone_no; - unsigned int noio_flag; - struct blk_zone zone; - unsigned long flags; - int ret; - - /* Get the current zone information from the device. */ - noio_flag = memalloc_noio_save(); - ret = disk->fops->report_zones(disk, zone_start_sector, 1, - blk_zone_wplug_report_zone_cb, &zone); - memalloc_noio_restore(noio_flag); - - spin_lock_irqsave(&zwplug->lock, flags); - - /* - * A zone reset or finish may have cleared the error already. In such - * case, do nothing as the report zones may have seen the "old" write - * pointer value before the reset/finish operation completed. - */ - if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR)) - goto unlock; - - zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR; - - if (ret != 1) { - /* - * We failed to get the zone information, meaning that something - * is likely really wrong with the device. Abort all remaining - * plugged BIOs as otherwise we could endup waiting forever on - * plugged BIOs to complete if there is a queue freeze on-going. - */ - disk_zone_wplug_abort(zwplug); - goto unplug; - } - - /* Update the zone write pointer offset. */ - zwplug->wp_offset = blk_zone_wp_offset(&zone); - disk_zone_wplug_abort_unaligned(disk, zwplug); - - /* Restart BIO submission if we still have any BIO left. */ - if (!bio_list_empty(&zwplug->bio_list)) { - disk_zone_wplug_schedule_bio_work(disk, zwplug); - goto unlock; - } - -unplug: - zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED; - if (disk_should_remove_zone_wplug(disk, zwplug)) - disk_remove_zone_wplug(disk, zwplug); - -unlock: - spin_unlock_irqrestore(&zwplug->lock, flags); -} - -static void disk_zone_wplugs_work(struct work_struct *work) -{ - struct gendisk *disk = - container_of(work, struct gendisk, zone_wplugs_work); - struct blk_zone_wplug *zwplug; - unsigned long flags; - - spin_lock_irqsave(&disk->zone_wplugs_lock, flags); - - while (!list_empty(&disk->zone_wplugs_err_list)) { - zwplug = list_first_entry(&disk->zone_wplugs_err_list, - struct blk_zone_wplug, link); - list_del_init(&zwplug->link); - spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags); - - disk_zone_wplug_handle_error(disk, zwplug); - disk_put_zone_wplug(zwplug); - - spin_lock_irqsave(&disk->zone_wplugs_lock, flags); - } - - spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags); -} - static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk) { return 1U << disk->zone_wplugs_hash_bits; @@ -1402,8 +1257,6 @@ static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk) void disk_init_zone_resources(struct gendisk *disk) { spin_lock_init(&disk->zone_wplugs_lock); - INIT_LIST_HEAD(&disk->zone_wplugs_err_list); - INIT_WORK(&disk->zone_wplugs_work, disk_zone_wplugs_work); } /* @@ -1502,8 +1355,6 @@ void disk_free_zone_resources(struct gendisk *disk) if (!disk->zone_wplugs_pool) return; - cancel_work_sync(&disk->zone_wplugs_work); - if (disk->zone_wplugs_wq) { destroy_workqueue(disk->zone_wplugs_wq); disk->zone_wplugs_wq = NULL; @@ -1700,6 +1551,8 @@ static int blk_revalidate_seq_zone(struct blk_zone *zone, unsigned int idx, if (!disk->zone_wplugs_hash) return 0; + disk_zone_wplug_sync_wp_offset(disk, zone); + wp_offset = blk_zone_wp_offset(zone); if (!wp_offset || wp_offset >= zone->capacity) return 0; @@ -1830,6 +1683,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk) memalloc_noio_restore(noio_flag); return ret; } + ret = disk->fops->report_zones(disk, 0, UINT_MAX, blk_revalidate_zone_cb, &args); if (!ret) { diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 08a727b40816..15e7dfc013b7 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -200,8 +200,6 @@ struct gendisk { spinlock_t zone_wplugs_lock; struct mempool_s *zone_wplugs_pool; struct hlist_head *zone_wplugs_hash; - struct list_head zone_wplugs_err_list; - struct work_struct zone_wplugs_work; struct workqueue_struct *zone_wplugs_wq; #endif /* CONFIG_BLK_DEV_ZONED */ From patchwork Sun Dec 8 22:57:58 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 13898714 X-Patchwork-Delegate: snitzer@redhat.com Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 22D20198E75; Sun, 8 Dec 2024 22:58:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733698709; cv=none; b=hN78hdru0J4A54heWP0jjQ3IXpwbgrgJdkYQMCUwxASe7qnSUU7+b9X1j++TTgnt5bioWpIfh5cflmIWOar19DgU2aSpqZ6slKaNJd7w8Gj5eFVgrMXl+r/OC3njGMeo1kEldeVfQpbE50Ir1q8EIaNSOxpqY5XjwsznmNjYEek= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733698709; c=relaxed/simple; bh=naaHmVa1xXF1w9vzqh7Bb52GrZQLJ2MlhUpqffZX1As=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=eqp1Uxm4COfjrOSxqswDgLQGsCFB0IupVk/MtgNfsJ0dHLjFmzUn6vfoZqy4JQDamLhc1jiOiVqFV4LeYf3sg2KQsqdF8+RWfGhTuaWs459gm24wIs0pP1UlxGjPiLBfM2+busZLqGQcrFz3sGFAgWo9fJBTbtStjgtIPHlxnq8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=unpkp3qV; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="unpkp3qV" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9DA4EC4CEE0; Sun, 8 Dec 2024 22:58:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1733698708; bh=naaHmVa1xXF1w9vzqh7Bb52GrZQLJ2MlhUpqffZX1As=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=unpkp3qVh9ZE00vnt8vFgDiZyIHtgGkOSdZZf5oR4BkQqLXIAEMB9IyD2d3ul22fe MJJ83no1vXIo7pnYzf+zjXA3ZkOUTIxNhov7ieATJwT5a6KgPh6PDIGqnrGLND8iEH EgLgLRa9xKgRAcyEwtgDtpJQNMf1T0oV/bXezqHp0ZNosj0H2jAejWDk33lRsXMECs c4iNcqw6pDe75tIC5CHB6QaJkwvaNhLBAFN7DzMXf9E+oXpMig9GTYoG2cuUarz139 So5G2PwN/m8nPCbjTOZmyArMCgvluFvu2EkK1rdlPjsjpLoq1Cf2ruugfWBPMiMS/z tr0EhH467D2Hw== From: Damien Le Moal To: linux-block@vger.kernel.org, Jens Axboe , Mike Snitzer , Mikulas Patocka , dm-devel@lists.linux.dev Cc: Christoph Hellwig , Bart Van Assche Subject: [PATCH v2 4/4] dm: Fix dm-zoned-reclaim zone write pointer alignment Date: Mon, 9 Dec 2024 07:57:58 +0900 Message-ID: <20241208225758.219228-5-dlemoal@kernel.org> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20241208225758.219228-1-dlemoal@kernel.org> References: <20241208225758.219228-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: dm-devel@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 The zone reclaim processing of the dm-zoned device mapper uses blkdev_issue_zeroout() to align the write pointer of a zone being used for reclaiming zones, to write valid data blocks at the correct zone start relative offset. However, the first call to blkdev_issue_zeroout() will try to use hardware offload through a REQ_OP_WRITE_ZEROES operation, if the device reports a non-zero max zero sectors limit. If this operation fails, blkdev_issue_zeroout() falls back to using a regular write operation with zeror-pages. With the removal of the zone write plug automatic write pointer recovery after a write error, the first attempt using a REQ_OP_WRITE_ZEROES leaves the target zone write pointer out of sync with the drive current value as the zone writ eplugging code advanced the zone write plug write pointer offset on submission of the REQ_OP_WRITE_ZEROES operation. The target zone is marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE, which causes the fallback regular write operation to also fail. blkdev_issue_zeroout() callers such as dmz_reclaim_align_wp() can recover from such situation by executing a report zones and retrying the call to blkdev_issue_zeroout() to handle this recoverable error situation. Given that such pattern will be common to all users of blkdev_issue_zeroout(), introduce the function blkdev_issue_zone_zeroout() to automatically handle such recovery. This function calls blkdev_issue_zeroout() with the BLKDEV_ZERO_NOFALLBACK flag to intercept failures on the first execution which attempt to use the device hardware offload with a REQ_OP_WRITE_ZEROES. If this attempt fails, blkdev_report_zones() is executed to recover the target zone to a good state and execute again blkdev_issue_zeroout() without the BLKDEV_ZERO_NOFALLBACK flag. Replacing the call to blkdev_issue_zeroout() with a call to blkdev_issue_zone_zeroout() in dmz_reclaim_align_wp() thus solves irrecoverable write errors triggered by the removal of the zone write plugging automatic recovery (commit "block: Prevent potential deadlocks in zone write plug error recovery"). Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal --- block/blk-zoned.c | 55 +++++++++++++++++++++++++++++++++++ drivers/md/dm-zoned-reclaim.c | 4 +-- include/linux/blkdev.h | 3 ++ 3 files changed, 60 insertions(+), 2 deletions(-) diff --git a/block/blk-zoned.c b/block/blk-zoned.c index d709f12d5935..c5400792de13 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -129,6 +129,9 @@ static int disk_report_zones_cb(struct blk_zone *zone, unsigned int idx, if (disk->zone_wplugs_hash) disk_zone_wplug_sync_wp_offset(disk, zone); + if (!args->user_cb) + return 0; + return args->user_cb(zone, idx, args->user_data); } @@ -663,6 +666,16 @@ static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk, disk_put_zone_wplug(zwplug); } +static int disk_zone_sync_wp_offset(struct gendisk *disk, sector_t sector) +{ + struct disk_report_zones_cb_args args = { + .disk = disk, + }; + + return disk->fops->report_zones(disk, sector, 1, + disk_report_zones_cb, &args); +} + static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio, unsigned int wp_offset) { @@ -1720,6 +1733,48 @@ int blk_revalidate_disk_zones(struct gendisk *disk) } EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones); +/** + * blkdev_issue_zone_zeroout - zero-fill a block range in a zone + * @bdev: blockdev to write + * @sector: start sector + * @nr_sects: number of sectors to write + * @gfp_mask: memory allocation flags (for bio_alloc) + * + * Description: + * Zero-fill a block range in a zone (@sector must be equal to the zone write + * pointer), handling potential errors due to the (initially unknown) lack of + * hardware offload (See blkdev_issue_zeroout()). + */ +int blkdev_issue_zone_zeroout(struct block_device *bdev, sector_t sector, + sector_t nr_sects, gfp_t gfp_mask) +{ + int ret; + + if (WARN_ON_ONCE(!bdev_is_zoned(bdev))) + return -EIO; + + ret = blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, + BLKDEV_ZERO_NOFALLBACK); + if (ret != -EOPNOTSUPP) + return ret; + + /* + * The failed call to blkdev_issue_zeroout() advanced the zone write + * pointer. Undo this using a report zone to update the zone write + * pointer to the correct current value. + */ + ret = disk_zone_sync_wp_offset(bdev->bd_disk, sector); + if (ret != 1) + return ret < 0 ? ret : -EIO; + + /* + * Retry without BLKDEV_ZERO_NOFALLBACK to force the fallback to a + * regular write with zero-pages. + */ + return blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, 0); +} +EXPORT_SYMBOL(blkdev_issue_zone_zeroout); + #ifdef CONFIG_BLK_DEBUG_FS int queue_zone_wplugs_show(void *data, struct seq_file *m) diff --git a/drivers/md/dm-zoned-reclaim.c b/drivers/md/dm-zoned-reclaim.c index d58db9a27e6c..b085a929e64e 100644 --- a/drivers/md/dm-zoned-reclaim.c +++ b/drivers/md/dm-zoned-reclaim.c @@ -76,9 +76,9 @@ static int dmz_reclaim_align_wp(struct dmz_reclaim *zrc, struct dm_zone *zone, * pointer and the requested position. */ nr_blocks = block - wp_block; - ret = blkdev_issue_zeroout(dev->bdev, + ret = blkdev_issue_zone_zeroout(dev->bdev, dmz_start_sect(zmd, zone) + dmz_blk2sect(wp_block), - dmz_blk2sect(nr_blocks), GFP_NOIO, 0); + dmz_blk2sect(nr_blocks), GFP_NOIO); if (ret) { dmz_dev_err(dev, "Align zone %u wp %llu to %llu (wp+%u) blocks failed %d", diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 15e7dfc013b7..696fdafcfe91 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1419,6 +1419,9 @@ static inline bool bdev_zone_is_seq(struct block_device *bdev, sector_t sector) return is_seq; } +int blkdev_issue_zone_zeroout(struct block_device *bdev, sector_t sector, + sector_t nr_sects, gfp_t gfp_mask); + static inline unsigned int queue_dma_alignment(const struct request_queue *q) { return q->limits.dma_alignment;