diff mbox series

[04/11] block: Add bio_for_each_thp_segment_all

Message ID 20200824151700.16097-5-willy@infradead.org (mailing list archive)
State New, archived
Headers show
Series iomap/fs/block patches for 5.11 | expand

Commit Message

Matthew Wilcox Aug. 24, 2020, 3:16 p.m. UTC
Iterate once for each THP instead of once for each base page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/bio.h  | 13 +++++++++++++
 include/linux/bvec.h | 27 +++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

Comments

Christoph Hellwig Aug. 27, 2020, 8:44 a.m. UTC | #1
On Mon, Aug 24, 2020 at 04:16:53PM +0100, Matthew Wilcox (Oracle) wrote:
> Iterate once for each THP instead of once for each base page.

FYI, I've always been wondering if bio_for_each_segment_all is the
right interface for the I/O completions, because we generally don't
need the fake bvecs for each page.  Only the first page can have an
offset, and only the last page can be end earlier than the end of
the page size.

It would seem way more efficient to just have a helper that extracts
the offset and end, and just use that in a loop that does the way
cheaper iteration over the physical addresses only.  This might (or
might) not be a good time to switch to that model for iomap.
Matthew Wilcox Aug. 31, 2020, 7:48 p.m. UTC | #2
On Thu, Aug 27, 2020 at 09:44:31AM +0100, Christoph Hellwig wrote:
> On Mon, Aug 24, 2020 at 04:16:53PM +0100, Matthew Wilcox (Oracle) wrote:
> > Iterate once for each THP instead of once for each base page.
> 
> FYI, I've always been wondering if bio_for_each_segment_all is the
> right interface for the I/O completions, because we generally don't
> need the fake bvecs for each page.  Only the first page can have an
> offset, and only the last page can be end earlier than the end of
> the page size.
> 
> It would seem way more efficient to just have a helper that extracts
> the offset and end, and just use that in a loop that does the way
> cheaper iteration over the physical addresses only.  This might (or
> might) not be a good time to switch to that model for iomap.

Something like this?

static void iomap_read_end_io(struct bio *bio)
{
        int i, error = blk_status_to_errno(bio->bi_status);

        for (i = 0; i < bio->bi_vcnt; i++) {
                struct bio_vec *bvec = &bio->bi_io_vec[i];
                size_t offset = bvec->bv_offset;
                size_t length = bvec->bv_len;
                struct page *page = bvec->bv_page;

                while (length > 0) { 
                        size_t count = thp_size(page) - offset;
                        
                        if (count > length)
                                count = length;
                        iomap_read_page_end_io(page, offset, count, error);
                        page += (offset + count) / PAGE_SIZE;
                        length -= count;
                        offset = 0;
                }
        }

        bio_put(bio);
}

Maybe I'm missing something important here, but it's significantly
simpler code -- iomap_read_end_io() goes down from 816 bytes to 560 bytes
(256 bytes less!) iomap_read_page_end_io is inlined into it both before
and after.

There is some weirdness going on with regards to bv_offset that I don't
quite understand.  In the original bvec_advance:

                bv->bv_page = bvec->bv_page + (bvec->bv_offset >> PAGE_SHIFT);
                bv->bv_offset = bvec->bv_offset & ~PAGE_MASK;

which I cargo-culted into bvec_thp_advance as:

                bv->bv_page = thp_head(bvec->bv_page +
                                (bvec->bv_offset >> PAGE_SHIFT));
                page_size = thp_size(bv->bv_page);
                bv->bv_offset = bvec->bv_offset -
                                (bv->bv_page - bvec->bv_page) * PAGE_SIZE;

Is it possible to have a bvec with an offset that is larger than the
size of bv_page?  That doesn't seem like a useful thing to do, but
if that needs to be supported, then the code up top doesn't do that.
We maybe gain a little bit by counting length down to 0 instead of
counting it up to bv_len.  I dunno; reading the code over now, it
doesn't seem like that much of a difference.

Maybe you meant something different?
Christoph Hellwig Sept. 1, 2020, 5:34 a.m. UTC | #3
On Mon, Aug 31, 2020 at 08:48:37PM +0100, Matthew Wilcox wrote:
> static void iomap_read_end_io(struct bio *bio)
> {
>         int i, error = blk_status_to_errno(bio->bi_status);
> 
>         for (i = 0; i < bio->bi_vcnt; i++) {
>                 struct bio_vec *bvec = &bio->bi_io_vec[i];

This should probably use bio_for_each_bvec_all instead of directly
poking into the bio.  I'd also be tempted to move the loop body into
a separate helper, but that's just a slight stylistic preference.

>                 size_t offset = bvec->bv_offset;
>                 size_t length = bvec->bv_len;
>                 struct page *page = bvec->bv_page;
> 
>                 while (length > 0) { 
>                         size_t count = thp_size(page) - offset;
>                         
>                         if (count > length)
>                                 count = length;
>                         iomap_read_page_end_io(page, offset, count, error);
>                         page += (offset + count) / PAGE_SIZE;

Shouldn't the page_size here be thp_size?

> Maybe I'm missing something important here, but it's significantly
> simpler code -- iomap_read_end_io() goes down from 816 bytes to 560 bytes
> (256 bytes less!) iomap_read_page_end_io is inlined into it both before
> and after.

Yes, that's exactly why I think avoiding bio_for_each_segment_all is
a good idea in general.

> There is some weirdness going on with regards to bv_offset that I don't
> quite understand.  In the original bvec_advance:
> 
>                 bv->bv_page = bvec->bv_page + (bvec->bv_offset >> PAGE_SHIFT);
>                 bv->bv_offset = bvec->bv_offset & ~PAGE_MASK;
> 
> which I cargo-culted into bvec_thp_advance as:
> 
>                 bv->bv_page = thp_head(bvec->bv_page +
>                                 (bvec->bv_offset >> PAGE_SHIFT));
>                 page_size = thp_size(bv->bv_page);
>                 bv->bv_offset = bvec->bv_offset -
>                                 (bv->bv_page - bvec->bv_page) * PAGE_SIZE;
> 
> Is it possible to have a bvec with an offset that is larger than the
> size of bv_page?  That doesn't seem like a useful thing to do, but
> if that needs to be supported, then the code up top doesn't do that.
> We maybe gain a little bit by counting length down to 0 instead of
> counting it up to bv_len.  I dunno; reading the code over now, it
> doesn't seem like that much of a difference.

Drivers can absolutely see a bv_offset that is larger due to bio
splitting.  However the submitting file system should never see one
unless it creates one, which would be stupid.

And yes, eventually bv_page and bv_offset should be replaced with a

	phys_addr_t		bv_phys;

and life would become simpler in many places (and the bvec would
shrink for most common setups as well).

For now I'd end up with something like:

static void iomap_read_end_bvec(struct page *page, size_t offset,
		size_t length, int error)
{
	while (length > 0) {
		size_t page_size = thp_size(page);
		size_t count = min(page_size - offset, length);

		iomap_read_page_end_io(page, offset, count, error);

		page += (offset + count) / page_size;
		length -= count;
		offset = 0;
	}
}

static void iomap_read_end_io(struct bio *bio)
{
	int i, error = blk_status_to_errno(bio->bi_status);
	struct bio_vec *bvec;

	bio_for_each_bvec_all(bvec, bio, i)
		iomap_read_end_bvec(bvec->bv_page, bvec->bv_offset,
				    bvec->bv_len, error;
        bio_put(bio);
}

and maybe even merge iomap_read_page_end_io into iomap_read_end_bvec.
Matthew Wilcox Sept. 1, 2020, 1:05 p.m. UTC | #4
On Tue, Sep 01, 2020 at 06:34:26AM +0100, Christoph Hellwig wrote:
> On Mon, Aug 31, 2020 at 08:48:37PM +0100, Matthew Wilcox wrote:
> > static void iomap_read_end_io(struct bio *bio)
> > {
> >         int i, error = blk_status_to_errno(bio->bi_status);
> > 
> >         for (i = 0; i < bio->bi_vcnt; i++) {
> >                 struct bio_vec *bvec = &bio->bi_io_vec[i];
> 
> This should probably use bio_for_each_bvec_all instead of directly
> poking into the bio.  I'd also be tempted to move the loop body into
> a separate helper, but that's just a slight stylistic preference.

Ah, got it.

> >                 size_t offset = bvec->bv_offset;
> >                 size_t length = bvec->bv_len;
> >                 struct page *page = bvec->bv_page;
> > 
> >                 while (length > 0) { 
> >                         size_t count = thp_size(page) - offset;
> >                         
> >                         if (count > length)
> >                                 count = length;
> >                         iomap_read_page_end_io(page, offset, count, error);
> >                         page += (offset + count) / PAGE_SIZE;
> 
> Shouldn't the page_size here be thp_size?

No.  Let's suppose we have a 20kB I/O which starts on a page boundary and
the first page is order-2.  To get from the first head page to the second
page, we need to add 4, which is 16kB / 4kB, not 16kB / 16kB.

> > Maybe I'm missing something important here, but it's significantly
> > simpler code -- iomap_read_end_io() goes down from 816 bytes to 560 bytes
> > (256 bytes less!) iomap_read_page_end_io is inlined into it both before
> > and after.
> 
> Yes, that's exactly why I think avoiding bio_for_each_segment_all is
> a good idea in general.

I took out all the attempts to cope with insane bv_offset to compare like
with like and I got the bio_for_each_thp_segment_all() version down to
656 bytes:

@@ -166,21 +166,15 @@ static inline void bvec_thp_advance(const struct bio_vec *
bvec,
                                struct bvec_iter_all *iter_all)
 {
        struct bio_vec *bv = &iter_all->bv;
-       unsigned int page_size;
 
        if (iter_all->done) {
                bv->bv_page += thp_nr_pages(bv->bv_page);
-               page_size = thp_size(bv->bv_page);
                bv->bv_offset = 0;
        } else {
-               bv->bv_page = thp_head(bvec->bv_page +
-                               (bvec->bv_offset >> PAGE_SHIFT));
-               page_size = thp_size(bv->bv_page);
-               bv->bv_offset = bvec->bv_offset -
-                               (bv->bv_page - bvec->bv_page) * PAGE_SIZE;
-               BUG_ON(bv->bv_offset >= page_size);
+               bv->bv_page = bvec->bv_page;
+               bv->bv_offset = bvec->bv_offset;
        }
-       bv->bv_len = min(page_size - bv->bv_offset,
+       bv->bv_len = min_t(unsigned int, thp_size(bv->bv_page) - bv->bv_offset,
                         bvec->bv_len - iter_all->done);
        iter_all->done += bv->bv_len;

> And yes, eventually bv_page and bv_offset should be replaced with a
> 
> 	phys_addr_t		bv_phys;
> 
> and life would become simpler in many places (and the bvec would
> shrink for most common setups as well).

I'd very much like to see that.  It causes quite considerable pain for
our virtualisation people that we need a struct page.  They'd like the
hypervisor to not have struct pages for the guest's memory, but if they
don't have them, they can't do I/O to them.  Perhaps I'll try getting
one of them to work on this.

I'm not entirely sure the bvec would shrink.  On 64-bit systems, it's
currently 8 bytes for the struct page, 4 bytes for the len and 4 bytes
for the offset.  Sure, we can get rid of the offset, but the compiler
will just pad the struct from 12 bytes back to 16.  On 32-bit systems
with 32-bit phys_addr_t, we go from 12 bytes down to 8, but most 32-bit
systems have a 64-bit phys_addr_t these days, don't they?

> For now I'd end up with something like:
> 
> static void iomap_read_end_bvec(struct page *page, size_t offset,
> 		size_t length, int error)
> {
> 	while (length > 0) {
> 		size_t page_size = thp_size(page);
> 		size_t count = min(page_size - offset, length);
> 
> 		iomap_read_page_end_io(page, offset, count, error);
> 
> 		page += (offset + count) / page_size;
> 		length -= count;
> 		offset = 0;
> 	}
> }
> 
> static void iomap_read_end_io(struct bio *bio)
> {
> 	int i, error = blk_status_to_errno(bio->bi_status);
> 	struct bio_vec *bvec;
> 
> 	bio_for_each_bvec_all(bvec, bio, i)
> 		iomap_read_end_bvec(bvec->bv_page, bvec->bv_offset,
> 				    bvec->bv_len, error;
>         bio_put(bio);
> }
> 
> and maybe even merge iomap_read_page_end_io into iomap_read_end_bvec.

The lines start to get a bit long.  Here's what I currently have on
the write side:

@@ -1104,6 +1117,20 @@ iomap_finish_page_writeback(struct inode *inode, struct p
age *page,
                end_page_writeback(page);
 }
 
+static void iomap_finish_bvec_write(struct inode *inode, struct page *page,
+               size_t offset, size_t length, int error)
+{
+       while (length > 0) {
+               size_t count = min(thp_size(page) - offset, length);
+
+               iomap_finish_page_writeback(inode, page, error, count);
+
+               page += (offset + count) / PAGE_SIZE;
+               offset = 0;
+               length -= count;
+       }
+}
+
 /*
  * We're now finished for good with this ioend structure.  Update the page
  * state, release holds on bios, and finally free up memory.  Do not use the
@@ -1121,7 +1148,7 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 
        for (bio = &ioend->io_inline_bio; bio; bio = next) {
                struct bio_vec *bv;
-               struct bvec_iter_all iter_all;
+               int i;
 
                /*
                 * For the last bio, bi_private points to the ioend, so we
@@ -1133,9 +1160,9 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
                        next = bio->bi_private;
 
                /* walk each page on bio, ending page IO on them */
-               bio_for_each_thp_segment_all(bv, bio, iter_all)
-                       iomap_finish_page_writeback(inode, bv->bv_page, error,
-                                       bv->bv_len);
+               bio_for_each_bvec_all(bv, bio, i)
+                       iomap_finish_bvec_writeback(inode, bv->bv_page,
+                                       bv->bv_offset, bv->bv_len, error);
                bio_put(bio);
        }
        /* The ioend has been freed by bio_put() */

That's a bit more boilerplate than I'd like, but if bio_vec is going to
lose its bv_page then I don't see a better way.  Unless we come up with
a different page/offset/length struct that bio_vecs are decomposed into.
Christoph Hellwig Sept. 1, 2020, 2:50 p.m. UTC | #5
On Tue, Sep 01, 2020 at 02:05:25PM +0100, Matthew Wilcox wrote:
> > >                 struct page *page = bvec->bv_page;
> > > 
> > >                 while (length > 0) { 
> > >                         size_t count = thp_size(page) - offset;
> > >                         
> > >                         if (count > length)
> > >                                 count = length;
> > >                         iomap_read_page_end_io(page, offset, count, error);
> > >                         page += (offset + count) / PAGE_SIZE;
> > 
> > Shouldn't the page_size here be thp_size?
> 
> No.  Let's suppose we have a 20kB I/O which starts on a page boundary and
> the first page is order-2.  To get from the first head page to the second
> page, we need to add 4, which is 16kB / 4kB, not 16kB / 16kB.

True.

> I'm not entirely sure the bvec would shrink.  On 64-bit systems, it's
> currently 8 bytes for the struct page, 4 bytes for the len and 4 bytes
> for the offset.  Sure, we can get rid of the offset, but the compiler
> will just pad the struct from 12 bytes back to 16.  On 32-bit systems
> with 32-bit phys_addr_t, we go from 12 bytes down to 8, but most 32-bit
> systems have a 64-bit phys_addr_t these days, don't they?

Actually on those system that still are 32-bit because they are so
tiny I'd very much still expect a 32-bit phys_addr_t.  E.g. arm
without LPAE or 32-bit RISC-V.

But yeah, point taken on the alignment for the 64-bit ones.

> That's a bit more boilerplate than I'd like, but if bio_vec is going to
> lose its bv_page then I don't see a better way.  Unless we come up with
> a different page/offset/length struct that bio_vecs are decomposed into.

I'm not sure it is going to lose bv_page any time soon.  I'd sure like
to, but least time something like that came up Linus wasn't entirely
in favor.  Things might have changed now, though and I think it is about
time to give it another try.
diff mbox series

Patch

diff --git a/include/linux/bio.h b/include/linux/bio.h
index c6d765382926..a0e104910097 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -129,12 +129,25 @@  static inline bool bio_next_segment(const struct bio *bio,
 	return true;
 }
 
+static inline bool bio_next_thp_segment(const struct bio *bio,
+				    struct bvec_iter_all *iter)
+{
+	if (iter->idx >= bio->bi_vcnt)
+		return false;
+
+	bvec_thp_advance(&bio->bi_io_vec[iter->idx], iter);
+	return true;
+}
+
 /*
  * drivers should _never_ use the all version - the bio may have been split
  * before it got to the driver and the driver won't own all of it
  */
 #define bio_for_each_segment_all(bvl, bio, iter) \
 	for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); )
+#define bio_for_each_thp_segment_all(bvl, bio, iter) \
+	for (bvl = bvec_init_iter_all(&iter); \
+	     bio_next_thp_segment((bio), &iter); )
 
 static inline void bio_advance_iter(const struct bio *bio,
 				    struct bvec_iter *iter, unsigned int bytes)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ac0c7299d5b8..ea8a37a7515b 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -162,4 +162,31 @@  static inline void bvec_advance(const struct bio_vec *bvec,
 	}
 }
 
+static inline void bvec_thp_advance(const struct bio_vec *bvec,
+				struct bvec_iter_all *iter_all)
+{
+	struct bio_vec *bv = &iter_all->bv;
+	unsigned int page_size;
+
+	if (iter_all->done) {
+		bv->bv_page += thp_nr_pages(bv->bv_page);
+		page_size = thp_size(bv->bv_page);
+		bv->bv_offset = 0;
+	} else {
+		bv->bv_page = thp_head(bvec->bv_page +
+				(bvec->bv_offset >> PAGE_SHIFT));
+		page_size = thp_size(bv->bv_page);
+		bv->bv_offset = bvec->bv_offset -
+				(bv->bv_page - bvec->bv_page) * PAGE_SIZE;
+		BUG_ON(bv->bv_offset >= page_size);
+	}
+	bv->bv_len = min(page_size - bv->bv_offset,
+			 bvec->bv_len - iter_all->done);
+	iter_all->done += bv->bv_len;
+
+	if (iter_all->done == bvec->bv_len) {
+		iter_all->idx++;
+		iter_all->done = 0;
+	}
+}
 #endif /* __LINUX_BVEC_ITER_H */