Message ID | 20231110051950.21972-1-ed.tsai@mediatek.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] block: limit the extract size to align queue limit | expand |
On Fri, Nov 10, 2023 at 01:19:49PM +0800, ed.tsai@mediatek.com wrote: > + if (bdev && blk_queue_pci_p2pdma(bdev->bd_disk->queue)) > extraction_flags |= ITER_ALLOW_P2PDMA; As pointed out in reply to Ming, you really need to first figure out if we can assume we have a valid bdev or not, and if not pass all the relevant information separately. > + if (bdev && bio_op(bio) != REQ_OP_ZONE_APPEND) { > + unsigned int max = queue_max_bytes(bdev_get_queue(bdev)); The higher level code must not look at queue_max_bytes, that is only used for splitting and might not even be initialized.
Hi, kernel test robot noticed the following build warnings: [auto build test WARNING on axboe-block/for-next] [also build test WARNING on hch-configfs/for-next linus/master v6.6 next-20231110] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/ed-tsai-mediatek-com/block-limit-the-extract-size-to-align-queue-limit/20231110-142205 base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next patch link: https://lore.kernel.org/r/20231110051950.21972-1-ed.tsai%40mediatek.com patch subject: [PATCH v2] block: limit the extract size to align queue limit config: arc-randconfig-002-20231110 (https://download.01.org/0day-ci/archive/20231110/202311101853.9N398fyj-lkp@intel.com/config) compiler: arc-elf-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231110/202311101853.9N398fyj-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202311101853.9N398fyj-lkp@intel.com/ All warnings (new ones prefixed by >>): block/bio.c: In function '__bio_iov_iter_get_pages': >> block/bio.c:1261:29: warning: suggest parentheses around '-' in operand of '&' [-Wparentheses] 1261 | max - bio->bi_iter.bi_size & (max - 1) : max; | ~~~~^~~~~~~~~~~~~~~~~~~~~~ vim +1261 block/bio.c 1214 1215 /** 1216 * __bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio 1217 * @bio: bio to add pages to 1218 * @iter: iov iterator describing the region to be mapped 1219 * 1220 * Extracts pages from *iter and appends them to @bio's bvec array. The pages 1221 * will have to be cleaned up in the way indicated by the BIO_PAGE_PINNED flag. 1222 * For a multi-segment *iter, this function only adds pages from the next 1223 * non-empty segment of the iov iterator. 1224 */ 1225 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) 1226 { 1227 iov_iter_extraction_t extraction_flags = 0; 1228 unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; 1229 unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt; 1230 struct block_device *bdev = bio->bi_bdev; 1231 struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; 1232 struct page **pages = (struct page **)bv; 1233 ssize_t max_extract = UINT_MAX - bio->bi_iter.bi_size; 1234 ssize_t size, left; 1235 unsigned len, i = 0; 1236 size_t offset; 1237 int ret = 0; 1238 1239 /* 1240 * Move page array up in the allocated memory for the bio vecs as far as 1241 * possible so that we can start filling biovecs from the beginning 1242 * without overwriting the temporary page array. 1243 */ 1244 BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2); 1245 pages += entries_left * (PAGE_PTRS_PER_BVEC - 1); 1246 1247 if (bdev && blk_queue_pci_p2pdma(bdev->bd_disk->queue)) 1248 extraction_flags |= ITER_ALLOW_P2PDMA; 1249 1250 /* 1251 * Each segment in the iov is required to be a block size multiple. 1252 * However, we may not be able to get the entire segment if it spans 1253 * more pages than bi_max_vecs allows, so we have to ALIGN_DOWN the 1254 * result to ensure the bio's total size is correct. The remainder of 1255 * the iov data will be picked up in the next bio iteration. 1256 */ 1257 if (bdev && bio_op(bio) != REQ_OP_ZONE_APPEND) { 1258 unsigned int max = queue_max_bytes(bdev_get_queue(bdev)); 1259 1260 max_extract = bio->bi_iter.bi_size ? > 1261 max - bio->bi_iter.bi_size & (max - 1) : max; 1262 } 1263 size = iov_iter_extract_pages(iter, &pages, max_extract, 1264 nr_pages, extraction_flags, &offset); 1265 if (unlikely(size <= 0)) 1266 return size ? size : -EFAULT; 1267 1268 nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE); 1269 1270 if (bdev) { 1271 size_t trim = size & (bdev_logical_block_size(bdev) - 1); 1272 iov_iter_revert(iter, trim); 1273 size -= trim; 1274 } 1275 1276 if (unlikely(!size)) { 1277 ret = -EFAULT; 1278 goto out; 1279 } 1280 1281 for (left = size, i = 0; left > 0; left -= len, i++) { 1282 struct page *page = pages[i]; 1283 1284 len = min_t(size_t, PAGE_SIZE - offset, left); 1285 if (bio_op(bio) == REQ_OP_ZONE_APPEND) { 1286 ret = bio_iov_add_zone_append_page(bio, page, len, 1287 offset); 1288 if (ret) 1289 break; 1290 } else 1291 bio_iov_add_page(bio, page, len, offset); 1292 1293 offset = 0; 1294 } 1295 1296 iov_iter_revert(iter, left); 1297 out: 1298 while (i < nr_pages) 1299 bio_release_page(bio, pages[i++]); 1300 1301 return ret; 1302 } 1303
On Fri, Nov 10, 2023 at 01:19:49PM +0800, ed.tsai@mediatek.com wrote: > From: Ed Tsai <ed.tsai@mediatek.com> > > When an application performs a large IO, it fills and submits multiple > full bios to the block layer. Referring to commit 07173c3ec276 > ("block: enable multipage bvecs"), the full bio size is no longer fixed > at 1MB but can vary based on the physical memory layout. > > The size of the full bio no longer aligns with the maximum IO size of > the queue. Therefore, in a 64MB read, you may see many unaligned bios > being submitted. > > Executing the command to perform a 64MB read: > > dd if=/data/test_file of=/dev/null bs=64m count=1 iflag=direct > > It demonstrates the submission of numerous unaligned bios: > > block_bio_queue: 254,52 R 2933336 + 2136 > block_bio_queue: 254,52 R 2935472 + 2152 > block_bio_queue: 254,52 R 2937624 + 2128 > block_bio_queue: 254,52 R 2939752 + 2160 > > This patch limits the number of extract pages to ensure that we submit > the bio once we fill enough pages, preventing the block layer from > spliting small I/Os in between. > > I performed the Antutu V10 Storage Test on a UFS 4.0 device, which > resulted in a significant improvement in the Sequential test: > > Sequential Read (average of 5 rounds): > Original: 3033.7 MB/sec > Patched: 3520.9 MB/sec > > Sequential Write (average of 5 rounds): > Original: 2225.4 MB/sec > Patched: 2800.3 MB/sec > > Link: https://lore.kernel.org/linux-arm-kernel/20231025092255.27930-1-ed.tsai@mediatek.com/ > Signed-off-by: Ed Tsai <ed.tsai@mediatek.com> > > --- > block/bio.c | 17 ++++++++++++----- > 1 file changed, 12 insertions(+), 5 deletions(-) > > diff --git a/block/bio.c b/block/bio.c > index 816d412c06e9..8d3a112e68da 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1227,8 +1227,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > iov_iter_extraction_t extraction_flags = 0; > unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; > unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt; > + struct block_device *bdev = bio->bi_bdev; > struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; > struct page **pages = (struct page **)bv; > + ssize_t max_extract = UINT_MAX - bio->bi_iter.bi_size; > ssize_t size, left; > unsigned len, i = 0; > size_t offset; > @@ -1242,7 +1244,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2); > pages += entries_left * (PAGE_PTRS_PER_BVEC - 1); > > - if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue)) > + if (bdev && blk_queue_pci_p2pdma(bdev->bd_disk->queue)) > extraction_flags |= ITER_ALLOW_P2PDMA; > > /* > @@ -1252,16 +1254,21 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > * result to ensure the bio's total size is correct. The remainder of > * the iov data will be picked up in the next bio iteration. > */ > - size = iov_iter_extract_pages(iter, &pages, > - UINT_MAX - bio->bi_iter.bi_size, > + if (bdev && bio_op(bio) != REQ_OP_ZONE_APPEND) { > + unsigned int max = queue_max_bytes(bdev_get_queue(bdev)); > + > + max_extract = bio->bi_iter.bi_size ? > + max - bio->bi_iter.bi_size & (max - 1) : max; > + } > + size = iov_iter_extract_pages(iter, &pages, max_extract, > nr_pages, extraction_flags, &offset); The above is just what I did in the 'slow path' of patch v2[1], and it can't work well for every extracting pages which is usually slow, and batching extracting pages should be done always, such as: 1) build one ublk disk(suppose it is /dev/ublkb0) with max sectors of 32k: - rublk add null --io-buf-size=16384 -q 2 [2] 2) run 64KB IO fio --direct=1 --size=230G --bsrange=64k-64k --runtime=20 --numjobs=2 --ioengine=libaio \ --iodepth=64 --iodepth_batch_submit=64 --iodepth_batch_complete_min=64 --group_reporting=1 \ --filename=/dev/ublkb0 --name=/dev/ublkb0-test-randread --rw=randread In my local VM, read BW is dropped to 3709MB/s from 20GB/s in the above fio test with this patch. The point is that: 1) bio size alignment is only needed in case of multiple bios 2) bio size alignment is needed only when the current bio is approaching to become FULL 3) with multiple bvec, it is hard to know how many pages can be held in bvecs beforehand In short, running every alignment is much less efficient. [1] https://lore.kernel.org/linux-block/202311100354.HYfqOQ7o-lkp@intel.com/T/#u [2] install rublk via `cargo install --version=^0.1 rublk` and CONFIG_BLK_DEV_UBLK is required Thanks, Ming
diff --git a/block/bio.c b/block/bio.c index 816d412c06e9..8d3a112e68da 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1227,8 +1227,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) iov_iter_extraction_t extraction_flags = 0; unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt; + struct block_device *bdev = bio->bi_bdev; struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; struct page **pages = (struct page **)bv; + ssize_t max_extract = UINT_MAX - bio->bi_iter.bi_size; ssize_t size, left; unsigned len, i = 0; size_t offset; @@ -1242,7 +1244,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2); pages += entries_left * (PAGE_PTRS_PER_BVEC - 1); - if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue)) + if (bdev && blk_queue_pci_p2pdma(bdev->bd_disk->queue)) extraction_flags |= ITER_ALLOW_P2PDMA; /* @@ -1252,16 +1254,21 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) * result to ensure the bio's total size is correct. The remainder of * the iov data will be picked up in the next bio iteration. */ - size = iov_iter_extract_pages(iter, &pages, - UINT_MAX - bio->bi_iter.bi_size, + if (bdev && bio_op(bio) != REQ_OP_ZONE_APPEND) { + unsigned int max = queue_max_bytes(bdev_get_queue(bdev)); + + max_extract = bio->bi_iter.bi_size ? + max - bio->bi_iter.bi_size & (max - 1) : max; + } + size = iov_iter_extract_pages(iter, &pages, max_extract, nr_pages, extraction_flags, &offset); if (unlikely(size <= 0)) return size ? size : -EFAULT; nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE); - if (bio->bi_bdev) { - size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1); + if (bdev) { + size_t trim = size & (bdev_logical_block_size(bdev) - 1); iov_iter_revert(iter, trim); size -= trim; }