diff mbox series

[RFC] fs: block_dev: compute nr_vecs hint for improving writeback bvecs allocation

Message ID 20210105132647.3818503-1-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] fs: block_dev: compute nr_vecs hint for improving writeback bvecs allocation | expand

Commit Message

Ming Lei Jan. 5, 2021, 1:26 p.m. UTC
Writeback code always allocates bio with max allowed bvecs(BIO_MAX_PAGES)
since it is hard to know in advance how many pages will be written in
single bio. And BIO_MAX_PAGES bvecs takes one 4K page.

On the other hand, for workloads of random IO, most of writeback bios just
uses <= 4 bvecs; for workloads of sequential IO, attributes to multipage
bvec, quite a lot of pages are contiguous because of space/time locality
for pages in sequential IO workloads, then either nr_bvecs is small enough
or size of bio is very big. So in reality, it is quite often to see most
of writeback bios just uses <=4 bvecs, which can be allocated inline. This
can be observed in normal tests(kernel build, git clone kernel, dbench,
...)

So improve bvec allocation by using Exponential Weighted Moving Average(
EWMA) to compute nr_bvecs hint for writeback bio. With this simple
approach, it is observed[1] that most of writeback bios just need <=4
nr_bvecs in normal workloads, then bvecs can be allocated inline with bio
together, meantime one extra 4k allocation is avoided.

[1] bpftrace script for observing writeback .bi_max_vcnt & .bi_vcnt
histogram
http://people.redhat.com/minlei/tests/tools/wb_vcnt.bt

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 fs/block_dev.c            |  1 +
 fs/iomap/buffered-io.c    | 13 +++++++++----
 include/linux/bio.h       |  2 --
 include/linux/blk_types.h | 31 +++++++++++++++++++++++++++++++
 4 files changed, 41 insertions(+), 6 deletions(-)

Comments

Christoph Hellwig Jan. 5, 2021, 6:39 p.m. UTC | #1
At least for iomap I think this is the wrong approach.  Between the
iomap and writeback_control we know the maximum size of the writeback
request and can just use that.  Take a look at what iomap_readpage_actor
does on the read size, something similar taking the wbc into account
can be done on the writeback side.
Ming Lei Jan. 6, 2021, 8:45 a.m. UTC | #2
On Tue, Jan 05, 2021 at 07:39:38PM +0100, Christoph Hellwig wrote:
> At least for iomap I think this is the wrong approach.  Between the
> iomap and writeback_control we know the maximum size of the writeback
> request and can just use that.

I think writeback_control can tell us nothing about max pages in single
bio:

- wbc->nr_to_write controls how many pages to writeback, this pages
  usually don't belong to same bio. Also this number is often much
  bigger than BIO_MAX_PAGES.

- wbc->range_start/range_end is similar too, which is often much more
  bigger than BIO_MAX_PAGES.

Also page/blocks_in_page can be mapped to different extent too, which is
only available when wpc->ops->map_blocks() is returned, which looks not
different with mpage_writepages(), in which bio is allocated with
BIO_MAX_PAGES vecs too.

Or you mean we can use iomap->length for this purpose? But iomap->length
still is still too big in case of xfs.
Dave Chinner Jan. 6, 2021, 10:21 p.m. UTC | #3
On Wed, Jan 06, 2021 at 04:45:48PM +0800, Ming Lei wrote:
> On Tue, Jan 05, 2021 at 07:39:38PM +0100, Christoph Hellwig wrote:
> > At least for iomap I think this is the wrong approach.  Between the
> > iomap and writeback_control we know the maximum size of the writeback
> > request and can just use that.
> 
> I think writeback_control can tell us nothing about max pages in single
> bio:

By definition, the iomap tells us exactly how big the IO is going to
be. i.e. an iomap spans a single contiguous range that we are going
to issue IO on. Hence we can use that to size the bio exactly
right for direct IO.

> - wbc->nr_to_write controls how many pages to writeback, this pages
>   usually don't belong to same bio. Also this number is often much
>   bigger than BIO_MAX_PAGES.
> 
> - wbc->range_start/range_end is similar too, which is often much more
>   bigger than BIO_MAX_PAGES.
> 
> Also page/blocks_in_page can be mapped to different extent too, which is
> only available when wpc->ops->map_blocks() is returned,

We only allocate the bio -after- calling ->map_blocks() to obtain
the iomap for the given writeback range request. Hence we
already know how large the BIO could be before we allocate it.

> which looks not
> different with mpage_writepages(), in which bio is allocated with
> BIO_MAX_PAGES vecs too.

__mpage_writepage() only maps a page at a time, so it can't tell
ahead of time how big the bio is going to need to be as it doesn't
return/cache a contiguous extent range. So it's actually very
different to the iomap writeback code, and effectively does require
a BIO_MAX_PAGES vecs allocation all the time...

> Or you mean we can use iomap->length for this purpose? But iomap->length
> still is still too big in case of xfs.

if we are doing small random writeback into large extents (i.e.
iomap->length is large), then it is trivial to detect that we are
doing random writes rather than sequential writes by checking if the
current page is sequential to the last sector in the current bio.
We already do this non-sequential IO checking to determine if a new
bio needs to be allocated in iomap_can_add_to_ioend(), and we also
know how large the current contiguous range mapped into the current
bio chain is (ioend->io_size). Hence we've got everything we need to
determine whether we should do a large or small bio vec allocation
in the iomap writeback path...

Cheers,

Dave.
Ming Lei Jan. 8, 2021, 7:59 a.m. UTC | #4
On Thu, Jan 07, 2021 at 09:21:11AM +1100, Dave Chinner wrote:
> On Wed, Jan 06, 2021 at 04:45:48PM +0800, Ming Lei wrote:
> > On Tue, Jan 05, 2021 at 07:39:38PM +0100, Christoph Hellwig wrote:
> > > At least for iomap I think this is the wrong approach.  Between the
> > > iomap and writeback_control we know the maximum size of the writeback
> > > request and can just use that.
> > 
> > I think writeback_control can tell us nothing about max pages in single
> > bio:
> 
> By definition, the iomap tells us exactly how big the IO is going to
> be. i.e. an iomap spans a single contiguous range that we are going
> to issue IO on. Hence we can use that to size the bio exactly
> right for direct IO.

When I trace wpc->iomap.length in iomap_add_to_ioend() on the following fio
randwrite/write, the length is 1GB most of times, maybe because it is
one fresh XFS.

fio --size=1G --bsrange=4k-4k --runtime=30 --numjobs=2 --ioengine=psync --iodepth=32 \
	--directory=$DIR --group_reporting=1 --unlink=0 --direct=0 --fsync=0 --name=f1 \
	--stonewall --rw=$RW
sync

Another reason is that pages in the range may be contiguous physically,
so lots of pages may share one single bvec.

> 
> > - wbc->nr_to_write controls how many pages to writeback, this pages
> >   usually don't belong to same bio. Also this number is often much
> >   bigger than BIO_MAX_PAGES.
> > 
> > - wbc->range_start/range_end is similar too, which is often much more
> >   bigger than BIO_MAX_PAGES.
> > 
> > Also page/blocks_in_page can be mapped to different extent too, which is
> > only available when wpc->ops->map_blocks() is returned,
> 
> We only allocate the bio -after- calling ->map_blocks() to obtain
> the iomap for the given writeback range request. Hence we
> already know how large the BIO could be before we allocate it.
> 
> > which looks not
> > different with mpage_writepages(), in which bio is allocated with
> > BIO_MAX_PAGES vecs too.
> 
> __mpage_writepage() only maps a page at a time, so it can't tell
> ahead of time how big the bio is going to need to be as it doesn't
> return/cache a contiguous extent range. So it's actually very
> different to the iomap writeback code, and effectively does require
> a BIO_MAX_PAGES vecs allocation all the time...
> 
> > Or you mean we can use iomap->length for this purpose? But iomap->length
> > still is still too big in case of xfs.
> 
> if we are doing small random writeback into large extents (i.e.
> iomap->length is large), then it is trivial to detect that we are
> doing random writes rather than sequential writes by checking if the
> current page is sequential to the last sector in the current bio.
> We already do this non-sequential IO checking to determine if a new
> bio needs to be allocated in iomap_can_add_to_ioend(), and we also
> know how large the current contiguous range mapped into the current
> bio chain is (ioend->io_size). Hence we've got everything we need to
> determine whether we should do a large or small bio vec allocation
> in the iomap writeback path...

page->index should tell us if the workload is random or sequential, however
still not easy to decide how many pages there will be in the next bio
when iomap->length is large.


Thanks,
Ming
Dave Chinner Jan. 8, 2021, 9 p.m. UTC | #5
On Fri, Jan 08, 2021 at 03:59:22PM +0800, Ming Lei wrote:
> On Thu, Jan 07, 2021 at 09:21:11AM +1100, Dave Chinner wrote:
> > On Wed, Jan 06, 2021 at 04:45:48PM +0800, Ming Lei wrote:
> > > On Tue, Jan 05, 2021 at 07:39:38PM +0100, Christoph Hellwig wrote:
> > > > At least for iomap I think this is the wrong approach.  Between the
> > > > iomap and writeback_control we know the maximum size of the writeback
> > > > request and can just use that.
> > > 
> > > I think writeback_control can tell us nothing about max pages in single
> > > bio:
> > 
> > By definition, the iomap tells us exactly how big the IO is going to
> > be. i.e. an iomap spans a single contiguous range that we are going
> > to issue IO on. Hence we can use that to size the bio exactly
> > right for direct IO.
> 
> When I trace wpc->iomap.length in iomap_add_to_ioend() on the following fio
> randwrite/write, the length is 1GB most of times, maybe because it is
> one fresh XFS.

Yes, that's exactly what I said it would do.

> Another reason is that pages in the range may be contiguous physically,
> so lots of pages may share one single bvec.

The iomap layer does not care about this, and there's no way this
can be detected ahead of time, anyway, because we are only passed a
single page at a time. When we get large pages from the page cache,
we'll still only get one page at a time, but we'll get physically
contiguous pages and so it will still be a 1 page : 1 bvec
relationship at the iomap layer.

> > > - wbc->nr_to_write controls how many pages to writeback, this pages
> > >   usually don't belong to same bio. Also this number is often much
> > >   bigger than BIO_MAX_PAGES.
> > > 
> > > - wbc->range_start/range_end is similar too, which is often much more
> > >   bigger than BIO_MAX_PAGES.
> > > 
> > > Also page/blocks_in_page can be mapped to different extent too, which is
> > > only available when wpc->ops->map_blocks() is returned,
> > 
> > We only allocate the bio -after- calling ->map_blocks() to obtain
> > the iomap for the given writeback range request. Hence we
> > already know how large the BIO could be before we allocate it.
> > 
> > > which looks not
> > > different with mpage_writepages(), in which bio is allocated with
> > > BIO_MAX_PAGES vecs too.
> > 
> > __mpage_writepage() only maps a page at a time, so it can't tell
> > ahead of time how big the bio is going to need to be as it doesn't
> > return/cache a contiguous extent range. So it's actually very
> > different to the iomap writeback code, and effectively does require
> > a BIO_MAX_PAGES vecs allocation all the time...
> > 
> > > Or you mean we can use iomap->length for this purpose? But iomap->length
> > > still is still too big in case of xfs.
> > 
> > if we are doing small random writeback into large extents (i.e.
> > iomap->length is large), then it is trivial to detect that we are
> > doing random writes rather than sequential writes by checking if the
> > current page is sequential to the last sector in the current bio.
> > We already do this non-sequential IO checking to determine if a new
> > bio needs to be allocated in iomap_can_add_to_ioend(), and we also
> > know how large the current contiguous range mapped into the current
> > bio chain is (ioend->io_size). Hence we've got everything we need to
> > determine whether we should do a large or small bio vec allocation
> > in the iomap writeback path...
> 
> page->index should tell us if the workload is random or sequential, however
> still not easy to decide how many pages there will be in the next bio
> when iomap->length is large.

page->index doesn't tell us anything about what type of IO is being
done - it just tells us where in the file we need to map to find the
physical block we need to write it to. OTOH, the iomap writeback
context contains all the information about current IO being build -
offset, size, current bio, etc - and the page->index gets compared
against the state in the iomap writepage context.

So, if the wpc->iomap.length is large, current page->index does not
map sequentially to the end of wpc->ioend->io_bio (or
wpc->io_end->io_offset + wpc->ioend->io_size) and
wpc->io_end->io_size == page_size(page) for the currently held bio,
then we are clearly doing random single page writeback into a large
allocated extent. Hence in that case we can do small bvec
allocations for the new bio.

Sure, the first bio in a ->writepages invocation doesn't have this
information, so we've going to have to assume BIO_MAX_PAGES for the
first bio. But for every bio after that in the ->writepages
invocation we have the state of the previous contiguous writeback
range held in the wpc structure and can use that info to optimise
the thousands of random single pages that are written after then
first one...

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3e5b02f6606c..0490f16a96db 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -880,6 +880,7 @@  struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 #ifdef CONFIG_SYSFS
 	INIT_LIST_HEAD(&bdev->bd_holder_disks);
 #endif
+	bdev->bd_wb_nr_vecs_hint = BIO_MAX_PAGES / 2;
 	bdev->bd_stats = alloc_percpu(struct disk_stats);
 	if (!bdev->bd_stats) {
 		iput(inode);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 16a1e82e3aeb..1e031af182db 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1210,6 +1210,7 @@  iomap_submit_ioend(struct iomap_writepage_ctx *wpc, struct iomap_ioend *ioend,
 		return error;
 	}
 
+	bdev_update_wb_nr_vecs_hint(wpc->iomap.bdev, ioend->io_bio);
 	submit_bio(ioend->io_bio);
 	return 0;
 }
@@ -1221,7 +1222,8 @@  iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc,
 	struct iomap_ioend *ioend;
 	struct bio *bio;
 
-	bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, &iomap_ioend_bioset);
+	bio = bio_alloc_bioset(GFP_NOFS, bdev_wb_nr_vecs_hint(wpc->iomap.bdev),
+			&iomap_ioend_bioset);
 	bio_set_dev(bio, wpc->iomap.bdev);
 	bio->bi_iter.bi_sector = sector;
 	bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
@@ -1248,11 +1250,13 @@  iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc,
  * traversal in iomap_finish_ioend().
  */
 static struct bio *
-iomap_chain_bio(struct bio *prev)
+iomap_chain_bio(struct bio *prev, struct block_device *bdev)
 {
 	struct bio *new;
 
-	new = bio_alloc(GFP_NOFS, BIO_MAX_PAGES);
+	bdev_update_wb_nr_vecs_hint(bdev, prev);
+
+	new = bio_alloc(GFP_NOFS, bdev_wb_nr_vecs_hint(bdev));
 	bio_copy_dev(new, prev);/* also copies over blkcg information */
 	new->bi_iter.bi_sector = bio_end_sector(prev);
 	new->bi_opf = prev->bi_opf;
@@ -1308,7 +1312,8 @@  iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
 	if (!merged) {
 		if (bio_full(wpc->ioend->io_bio, len)) {
 			wpc->ioend->io_bio =
-				iomap_chain_bio(wpc->ioend->io_bio);
+				iomap_chain_bio(wpc->ioend->io_bio,
+						wpc->iomap.bdev);
 		}
 		bio_add_page(wpc->ioend->io_bio, page, len, poff);
 	}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 70914dd6a70d..ad0cd1a2abbe 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -19,8 +19,6 @@ 
 #define BIO_BUG_ON
 #endif
 
-#define BIO_MAX_PAGES		256
-
 #define bio_prio(bio)			(bio)->bi_ioprio
 #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 866f74261b3b..078e212f5e1f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -24,6 +24,7 @@  struct block_device {
 	sector_t		bd_start_sect;
 	struct disk_stats __percpu *bd_stats;
 	unsigned long		bd_stamp;
+	unsigned short		bd_wb_nr_vecs_hint;
 	bool			bd_read_only;	/* read-only policy */
 	dev_t			bd_dev;
 	int			bd_openers;
@@ -307,6 +308,8 @@  enum {
 	BIO_FLAG_LAST
 };
 
+#define BIO_MAX_PAGES		256
+
 /* See BVEC_POOL_OFFSET below before adding new flags */
 
 /*
@@ -565,4 +568,32 @@  struct blk_rq_stat {
 	u64 batch;
 };
 
+/* called before submitting writeback bios */
+static inline void bdev_update_wb_nr_vecs_hint(struct block_device *bdev,
+		struct bio *bio)
+{
+	unsigned short nr_vecs;
+
+	if (!bdev)
+		return;
+
+	nr_vecs = bdev->bd_wb_nr_vecs_hint;
+	/*
+	 * If this bio is full, double current nr_vecs_hint for fast convergence.
+	 * Otherwise use Exponential Weighted Moving Average to figure out the
+	 * hint
+	 */
+	if (bio->bi_vcnt >= bio->bi_max_vecs)
+		nr_vecs *= 2;
+	else
+		nr_vecs = (nr_vecs * 3 + bio->bi_vcnt + 3) / 4;
+
+	bdev->bd_wb_nr_vecs_hint = clamp_val(nr_vecs, 1, BIO_MAX_PAGES);
+}
+
+static inline unsigned short bdev_wb_nr_vecs_hint(struct block_device *bdev)
+{
+	return bdev->bd_wb_nr_vecs_hint;
+}
+
 #endif /* __LINUX_BLK_TYPES_H */