From patchwork Mon Jul 13 12:35:10 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 11659573 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C75CC13B4 for ; Mon, 13 Jul 2020 12:35:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B82912072D for ; Mon, 13 Jul 2020 12:35:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729027AbgGMMfX (ORCPT ); Mon, 13 Jul 2020 08:35:23 -0400 Received: from mx2.suse.de ([195.135.220.15]:33194 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726586AbgGMMfX (ORCPT ); Mon, 13 Jul 2020 08:35:23 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 48D5BADC1; Mon, 13 Jul 2020 12:35:23 +0000 (UTC) From: Coly Li To: axboe@kernel.dk, linux-block@vger.kernel.org Cc: Coly Li , Damien Le Moal , Chaitanya Kulkarni , Christoph Hellwig , Hannes Reinecke , Jens Axboe , Johannes Thumshirn , Keith Busch , Shaun Tancheff Subject: [PATCH 1/2] block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers Date: Mon, 13 Jul 2020 20:35:10 +0800 Message-Id: <20200713123511.19441-2-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200713123511.19441-1-colyli@suse.de> References: <20200713123511.19441-1-colyli@suse.de> MIME-Version: 1.0 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Currently REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are defined as even numbers 6 and 8, such zone reset bios are treated as READ bios by bio_data_dir(), which is obviously misleading. The macro bio_data_dir() is defined in include/linux/bio.h as, 55 #define bio_data_dir(bio) \ 56 (op_is_write(bio_op(bio)) ? WRITE : READ) And op_is_write() is defined in include/linux/blk_types.h as, 397 static inline bool op_is_write(unsigned int op) 398 { 399 return (op & 1); 400 } The convention of op_is_write() is when there is data transfer then the op code should be odd number, and treat as a write op. bio_data_dir() treats all bio direction as READ if op_is_write() reports false, and WRITE if op_is_write() reports true. Because REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are even numbers, although they don't transfer data but reporting them as READ bio by bio_data_dir() is misleading and might be wrong. Because these two commands will reset the writer pointers of the resetting zones, and all content after the reset write pointer will be invalid and unaccessible, obviously they are not READ bios in any means. This patch changes REQ_OP_ZONE_RESET from 6 to 15, and changes REQ_OP_ZONE_RESET_ALL from 8 to 17. Now bios with these two op code can be treated as WRITE by bio_data_dir(). Although they don't transfer data, now we keep them consistent with REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES with the ituition that they change on-media content and should be WRITE request. Signed-off-by: Coly Li Reviewed-by: Damien Le Moal Reviewed-by: Chaitanya Kulkarni Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Jens Axboe Cc: Johannes Thumshirn Cc: Keith Busch Cc: Shaun Tancheff Reviewed-by: Damien Le Moal --- include/linux/blk_types.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index ccb895f911b1..447b46a0accf 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -300,12 +300,8 @@ enum req_opf { REQ_OP_DISCARD = 3, /* securely erase sectors */ REQ_OP_SECURE_ERASE = 5, - /* reset a zone write pointer */ - REQ_OP_ZONE_RESET = 6, /* write the same sector many times */ REQ_OP_WRITE_SAME = 7, - /* reset all the zone present on the device */ - REQ_OP_ZONE_RESET_ALL = 8, /* write the zero filled sector many times */ REQ_OP_WRITE_ZEROES = 9, /* Open a zone */ @@ -316,6 +312,10 @@ enum req_opf { REQ_OP_ZONE_FINISH = 12, /* write data at the current zone write pointer */ REQ_OP_ZONE_APPEND = 13, + /* reset a zone write pointer */ + REQ_OP_ZONE_RESET = 15, + /* reset all the zone present on the device */ + REQ_OP_ZONE_RESET_ALL = 17, /* SCSI passthrough using struct scsi_request */ REQ_OP_SCSI_IN = 32, From patchwork Mon Jul 13 12:35:11 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 11659575 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C84D160D for ; Mon, 13 Jul 2020 12:35:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B5BA52075B for ; Mon, 13 Jul 2020 12:35:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729308AbgGMMfa (ORCPT ); Mon, 13 Jul 2020 08:35:30 -0400 Received: from mx2.suse.de ([195.135.220.15]:33322 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726586AbgGMMf2 (ORCPT ); Mon, 13 Jul 2020 08:35:28 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 8EAC4ADC1; Mon, 13 Jul 2020 12:35:28 +0000 (UTC) From: Coly Li To: axboe@kernel.dk, linux-block@vger.kernel.org Cc: Coly Li , Acshai Manoj , Hannes Reinecke , Ming Lei , Xiao Ni , Bart Van Assche , Christoph Hellwig , Enzo Matsumiya Subject: [PATCH 2/2] block: improve discard bio alignment in __blkdev_issue_discard() Date: Mon, 13 Jul 2020 20:35:11 +0800 Message-Id: <20200713123511.19441-3-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200713123511.19441-1-colyli@suse.de> References: <20200713123511.19441-1-colyli@suse.de> MIME-Version: 1.0 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org This patch improves discard bio split for address and size alignment in __blkdev_issue_discard(). The aligned discard bio may help underlying device controller to perform better discard and internal garbage collection, and avoid unnecessary internal fragment. Current discard bio split algorithm in __blkdev_issue_discard() may have non-discarded fregment on device even the discard bio LBA and size are both aligned to device's discard granularity size. Here is the example steps on how to reproduce the above problem. - On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk with thin mode and give it to a Linux virtual machine. - Inside the Linux virtual machine, if the 50GB virtual disk shows up as /dev/sdb, fill data into the first 50GB by, # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200 - Discard the 50GB range from offset 0 on /dev/sdb, # blkdiscard /dev/sdb -o 0 -l 53687091200 - Observe the underlying mapping status of the device # sg_get_lba_status /dev/sdb -m 1048 --lba=0 descriptor LBA: 0x0000000000000000 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000000000800 blocks: 16773120 deallocated descriptor LBA: 0x0000000000fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000001000000 blocks: 8386560 deallocated descriptor LBA: 0x00000000017ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000001800000 blocks: 8386560 deallocated descriptor LBA: 0x0000000001fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000002000000 blocks: 8386560 deallocated descriptor LBA: 0x00000000027ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000002800000 blocks: 8386560 deallocated descriptor LBA: 0x0000000002fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000003000000 blocks: 8386560 deallocated descriptor LBA: 0x00000000037ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000003800000 blocks: 8386560 deallocated descriptor LBA: 0x0000000003fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000004000000 blocks: 8386560 deallocated descriptor LBA: 0x00000000047ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000004800000 blocks: 8386560 deallocated descriptor LBA: 0x0000000004fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000005000000 blocks: 8386560 deallocated descriptor LBA: 0x00000000057ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000005800000 blocks: 8386560 deallocated descriptor LBA: 0x0000000005fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000006000000 blocks: 6291456 deallocated descriptor LBA: 0x0000000006600000 blocks: 0 deallocated Although the discard bio starts at LBA 0 and has 50<<30 bytes size which are perfect aligned to the discard granularity, from the above list these are many 1MB (2048 sectors) internal fragments exist unexpectedly. The problem is in __blkdev_issue_discard(), an improper algorithm causes an improper bio size which is not aligned. 25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, 26 sector_t nr_sects, gfp_t gfp_mask, int flags, 27 struct bio **biop) 28 { 29 struct request_queue *q = bdev_get_queue(bdev); [snipped] 56 57 while (nr_sects) { 58 sector_t req_sects = min_t(sector_t, nr_sects, 59 bio_allowed_max_sectors(q)); 60 61 WARN_ON_ONCE((req_sects << 9) > UINT_MAX); 62 63 bio = blk_next_bio(bio, 0, gfp_mask); 64 bio->bi_iter.bi_sector = sector; 65 bio_set_dev(bio, bdev); 66 bio_set_op_attrs(bio, op, 0); 67 68 bio->bi_iter.bi_size = req_sects << 9; 69 sector += req_sects; 70 nr_sects -= req_sects; [snipped] 79 } 80 81 *biop = bio; 82 return 0; 83 } 84 EXPORT_SYMBOL(__blkdev_issue_discard); At line 58-59, to discard a 50GB range, req_sects is set as return value of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above case, the discard granularity is 2048 sectors, although the start LBA and discard length are aligned to discard granularity, req_sects never has chance to be aligned to discard granularity. This is why there are some still-mapped 2048 sectors fragment in every 4 or 8 GB range. If req_sects at line 58 is set to a value aligned to discard_granularity and close to UNIT_MAX, then all consequent split bios inside device driver are (almostly) aligned to discard_granularity of the device queue. The 2048 sectors still-mapped fragment will disappear. This patch introduces bio_aligned_discard_max_sectors() to return the the value which is aligned to q->limits.discard_granularity and closest to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with this new routine to decide a more proper split bio length. But we still need to handle the situation when discard start LBA is not aligned to q->limits.discard_granularity, otherwise even the length is aligned, current code may still leave 2048 fragment around every 4GB range. Therefore, to calculate req_sects, firstly the start LBA of discard range is checked, if it is not aligned to discard granularity, the first split location should make sure following bio has bi_sector aligned to discard granularity. Then there won't be still-mapped fragment in the middle of the discard range. The above is how this patch improves discard bio alignment in __blkdev_issue_discard(). Now with this patch, after discard with same command line mentiond previously, sg_get_lba_status returns, descriptor LBA: 0x0000000000000000 blocks: 106954752 deallocated descriptor LBA: 0x0000000006600000 blocks: 0 deallocated We an see there is no 2048 sectors segment anymore, everything is clean. Reported-and-tested-by: Acshai Manoj Signed-off-by: Coly Li Reviewed-by: Hannes Reinecke Reviewed-by: Ming Lei Reviewed-by: Xiao Ni Cc: Bart Van Assche Cc: Christoph Hellwig Cc: Enzo Matsumiya Cc: Jens Axboe --- block/blk-lib.c | 25 +++++++++++++++++++++++-- block/blk.h | 14 ++++++++++++++ 2 files changed, 37 insertions(+), 2 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 5f2c429d4378..7bffdee63a20 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -55,8 +55,29 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, return -EINVAL; while (nr_sects) { - sector_t req_sects = min_t(sector_t, nr_sects, - bio_allowed_max_sectors(q)); + sector_t granularity_aligned_lba; + sector_t req_sects; + + granularity_aligned_lba = round_up(sector, + q->limits.discard_granularity >> SECTOR_SHIFT); + + /* + * Check whether the discard bio starts at a discard_granularity + * aligned LBA, + * - If no: set (granularity_aligned_lba - sector) to bi_size of + * the first split bio, then the second bio will start at a + * discard_granularity aligned LBA. + * - If yes: use bio_aligned_discard_max_sectors() as the max + * possible bi_size of the first split bio. Then when this bio + * is split in device drive, the split ones are very probably + * to be aligned to discard_granularity of the device's queue. + */ + if (granularity_aligned_lba == sector) + req_sects = min_t(sector_t, nr_sects, + bio_aligned_discard_max_sectors(q)); + else + req_sects = min_t(sector_t, nr_sects, + granularity_aligned_lba - sector); WARN_ON_ONCE((req_sects << 9) > UINT_MAX); diff --git a/block/blk.h b/block/blk.h index b5d1f0fc6547..a80738581f84 100644 --- a/block/blk.h +++ b/block/blk.h @@ -281,6 +281,20 @@ static inline unsigned int bio_allowed_max_sectors(struct request_queue *q) return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9; } +/* + * The max bio size which is aligned to q->limits.discard_granularity. This + * is a hint to split large discard bio in generic block layer, then if device + * driver needs to split the discard bio into smaller ones, their bi_size can + * be very probably and easily aligned to discard_granularity of the device's + * queue. + */ +static inline unsigned int bio_aligned_discard_max_sectors( + struct request_queue *q) +{ + return round_down(UINT_MAX, q->limits.discard_granularity) >> + SECTOR_SHIFT; +} + /* * Internal io_context interface */