diff mbox series

[v2,1/2] iov_iter: optimise iov_iter_npages for bvec

Message ID ab04202d0f8c1424da47251085657c436d762785.1605827965.git.asml.silence@gmail.com (mailing list archive)
State New, archived
Headers show
Series optimise iov_iter | expand

Commit Message

Pavel Begunkov Nov. 19, 2020, 11:24 p.m. UTC
The block layer spends quite a while in iov_iter_npages(), but for the
bvec case the number of pages is already known and stored in
iter->nr_segs, so it can be returned immediately as an optimisation

Perf for an io_uring benchmark with registered buffers (i.e. bvec) shows
~1.5-2.0% total cycle count spent in iov_iter_npages(), that's dropped
by this patch to ~0.2%.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 lib/iov_iter.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Matthew Wilcox Nov. 20, 2020, 1:20 a.m. UTC | #1
On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote:
> The block layer spends quite a while in iov_iter_npages(), but for the
> bvec case the number of pages is already known and stored in
> iter->nr_segs, so it can be returned immediately as an optimisation

Er ... no, it doesn't.  nr_segs is the number of bvecs.  Each bvec can
store up to 4GB of contiguous physical memory.
Pavel Begunkov Nov. 20, 2020, 1:39 a.m. UTC | #2
On 20/11/2020 01:20, Matthew Wilcox wrote:
> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote:
>> The block layer spends quite a while in iov_iter_npages(), but for the
>> bvec case the number of pages is already known and stored in
>> iter->nr_segs, so it can be returned immediately as an optimisation
> 
> Er ... no, it doesn't.  nr_segs is the number of bvecs.  Each bvec can
> store up to 4GB of contiguous physical memory.

Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a
stupid statement. Thanks!

Are there many users of that? All these iterators are a huge burden,
just to count one 4KB page in bvec it takes 2% of CPU time for me.
Matthew Wilcox Nov. 20, 2020, 1:49 a.m. UTC | #3
On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote:
> On 20/11/2020 01:20, Matthew Wilcox wrote:
> > On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote:
> >> The block layer spends quite a while in iov_iter_npages(), but for the
> >> bvec case the number of pages is already known and stored in
> >> iter->nr_segs, so it can be returned immediately as an optimisation
> > 
> > Er ... no, it doesn't.  nr_segs is the number of bvecs.  Each bvec can
> > store up to 4GB of contiguous physical memory.
> 
> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a
> stupid statement. Thanks!
> 
> Are there many users of that? All these iterators are a huge burden,
> just to count one 4KB page in bvec it takes 2% of CPU time for me.

__bio_try_merge_page() will create multipage BIOs, and that's
called from a number of places including
bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages()

so ... yeah, it's used a lot.
Pavel Begunkov Nov. 20, 2020, 1:56 a.m. UTC | #4
On 20/11/2020 01:49, Matthew Wilcox wrote:
> On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote:
>> On 20/11/2020 01:20, Matthew Wilcox wrote:
>>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote:
>>>> The block layer spends quite a while in iov_iter_npages(), but for the
>>>> bvec case the number of pages is already known and stored in
>>>> iter->nr_segs, so it can be returned immediately as an optimisation
>>>
>>> Er ... no, it doesn't.  nr_segs is the number of bvecs.  Each bvec can
>>> store up to 4GB of contiguous physical memory.
>>
>> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a
>> stupid statement. Thanks!
>>
>> Are there many users of that? All these iterators are a huge burden,
>> just to count one 4KB page in bvec it takes 2% of CPU time for me.
> 
> __bio_try_merge_page() will create multipage BIOs, and that's
> called from a number of places including
> bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages()

I get it that there are a lot of places, more interesting how often
it's actually triggered and if that's performance critical for anybody.
Not like I'm going to change it, just out of curiosity, but bvec.h
can be nicely optimised without it.

> 
> so ... yeah, it's used a lot.
>
Matthew Wilcox Nov. 20, 2020, 2:06 a.m. UTC | #5
On Fri, Nov 20, 2020 at 01:56:22AM +0000, Pavel Begunkov wrote:
> On 20/11/2020 01:49, Matthew Wilcox wrote:
> > On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote:
> >> On 20/11/2020 01:20, Matthew Wilcox wrote:
> >>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote:
> >>>> The block layer spends quite a while in iov_iter_npages(), but for the
> >>>> bvec case the number of pages is already known and stored in
> >>>> iter->nr_segs, so it can be returned immediately as an optimisation
> >>>
> >>> Er ... no, it doesn't.  nr_segs is the number of bvecs.  Each bvec can
> >>> store up to 4GB of contiguous physical memory.
> >>
> >> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a
> >> stupid statement. Thanks!
> >>
> >> Are there many users of that? All these iterators are a huge burden,
> >> just to count one 4KB page in bvec it takes 2% of CPU time for me.
> > 
> > __bio_try_merge_page() will create multipage BIOs, and that's
> > called from a number of places including
> > bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages()
> 
> I get it that there are a lot of places, more interesting how often
> it's actually triggered and if that's performance critical for anybody.
> Not like I'm going to change it, just out of curiosity, but bvec.h
> can be nicely optimised without it.

Typically when you're allocating pages for the page cache, they'll get
allocated in order and then you'll read or write them in order, so yes,
it ends up triggering quite a lot.  There was once a bug in the page
allocator which caused them to get allocated in reverse order and it
was a noticable performance hit (this was 15-20 years ago).
Pavel Begunkov Nov. 20, 2020, 2:08 a.m. UTC | #6
On 20/11/2020 02:06, Matthew Wilcox wrote:
> On Fri, Nov 20, 2020 at 01:56:22AM +0000, Pavel Begunkov wrote:
>> On 20/11/2020 01:49, Matthew Wilcox wrote:
>>> On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote:
>>>> On 20/11/2020 01:20, Matthew Wilcox wrote:
>>>>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote:
>>>>>> The block layer spends quite a while in iov_iter_npages(), but for the
>>>>>> bvec case the number of pages is already known and stored in
>>>>>> iter->nr_segs, so it can be returned immediately as an optimisation
>>>>>
>>>>> Er ... no, it doesn't.  nr_segs is the number of bvecs.  Each bvec can
>>>>> store up to 4GB of contiguous physical memory.
>>>>
>>>> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a
>>>> stupid statement. Thanks!
>>>>
>>>> Are there many users of that? All these iterators are a huge burden,
>>>> just to count one 4KB page in bvec it takes 2% of CPU time for me.
>>>
>>> __bio_try_merge_page() will create multipage BIOs, and that's
>>> called from a number of places including
>>> bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages()
>>
>> I get it that there are a lot of places, more interesting how often
>> it's actually triggered and if that's performance critical for anybody.
>> Not like I'm going to change it, just out of curiosity, but bvec.h
>> can be nicely optimised without it.
> 
> Typically when you're allocating pages for the page cache, they'll get
> allocated in order and then you'll read or write them in order, so yes,
> it ends up triggering quite a lot.  There was once a bug in the page
> allocator which caused them to get allocated in reverse order and it
> was a noticable performance hit (this was 15-20 years ago).

I see, thanks for a bit of insight
Ming Lei Nov. 20, 2020, 2:22 a.m. UTC | #7
On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote:
> On 20/11/2020 01:20, Matthew Wilcox wrote:
> > On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote:
> >> The block layer spends quite a while in iov_iter_npages(), but for the
> >> bvec case the number of pages is already known and stored in
> >> iter->nr_segs, so it can be returned immediately as an optimisation
> > 
> > Er ... no, it doesn't.  nr_segs is the number of bvecs.  Each bvec can
> > store up to 4GB of contiguous physical memory.
> 
> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a
> stupid statement. Thanks!
> 

iov_iter_npages(bvec) still can be improved a bit by the following way:

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 1635111c5bd2..d85ed7acce05 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1608,17 +1608,23 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 		npages = pipe_space_for_user(iter_head, pipe->tail, pipe);
 		if (npages >= maxpages)
 			return maxpages;
+	} else if (iov_iter_is_bvec(i)) {
+		unsigned idx, offset = i->iov_offset;
+
+		for (idx = 0; idx < i->nr_segs; idx++) {
+			npages += DIV_ROUND_UP(i->bvec[idx].bv_len - offset,
+					PAGE_SIZE);
+			offset = 0;
+		}
+		if (npages >= maxpages)
+			return maxpages;
 	} else iterate_all_kinds(i, size, v, ({
 		unsigned long p = (unsigned long)v.iov_base;
 		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
 			- p / PAGE_SIZE;
 		if (npages >= maxpages)
 			return maxpages;
-	0;}),({
-		npages++;
-		if (npages >= maxpages)
-			return maxpages;
-	}),({
+	0;}),0,({
 		unsigned long p = (unsigned long)v.iov_base;
 		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
 			- p / PAGE_SIZE;
Ming Lei Nov. 20, 2020, 2:24 a.m. UTC | #8
On Fri, Nov 20, 2020 at 02:06:10AM +0000, Matthew Wilcox wrote:
> On Fri, Nov 20, 2020 at 01:56:22AM +0000, Pavel Begunkov wrote:
> > On 20/11/2020 01:49, Matthew Wilcox wrote:
> > > On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote:
> > >> On 20/11/2020 01:20, Matthew Wilcox wrote:
> > >>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote:
> > >>>> The block layer spends quite a while in iov_iter_npages(), but for the
> > >>>> bvec case the number of pages is already known and stored in
> > >>>> iter->nr_segs, so it can be returned immediately as an optimisation
> > >>>
> > >>> Er ... no, it doesn't.  nr_segs is the number of bvecs.  Each bvec can
> > >>> store up to 4GB of contiguous physical memory.
> > >>
> > >> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a
> > >> stupid statement. Thanks!
> > >>
> > >> Are there many users of that? All these iterators are a huge burden,
> > >> just to count one 4KB page in bvec it takes 2% of CPU time for me.
> > > 
> > > __bio_try_merge_page() will create multipage BIOs, and that's
> > > called from a number of places including
> > > bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages()
> > 
> > I get it that there are a lot of places, more interesting how often
> > it's actually triggered and if that's performance critical for anybody.
> > Not like I'm going to change it, just out of curiosity, but bvec.h
> > can be nicely optimised without it.
> 
> Typically when you're allocating pages for the page cache, they'll get
> allocated in order and then you'll read or write them in order, so yes,
> it ends up triggering quite a lot.  There was once a bug in the page
> allocator which caused them to get allocated in reverse order and it
> was a noticable performance hit (this was 15-20 years ago).

hugepage use cases can benefit much from this way too.


Thanks,
Ming
Pavel Begunkov Nov. 20, 2020, 2:25 a.m. UTC | #9
On 20/11/2020 02:22, Ming Lei wrote:
> On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote:
>> On 20/11/2020 01:20, Matthew Wilcox wrote:
>>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote:
>>>> The block layer spends quite a while in iov_iter_npages(), but for the
>>>> bvec case the number of pages is already known and stored in
>>>> iter->nr_segs, so it can be returned immediately as an optimisation
>>>
>>> Er ... no, it doesn't.  nr_segs is the number of bvecs.  Each bvec can
>>> store up to 4GB of contiguous physical memory.
>>
>> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a
>> stupid statement. Thanks!
>>
> 
> iov_iter_npages(bvec) still can be improved a bit by the following way:

Yep, was doing exactly that, +a couple of other places that are in my way.

> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 1635111c5bd2..d85ed7acce05 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1608,17 +1608,23 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
>  		npages = pipe_space_for_user(iter_head, pipe->tail, pipe);
>  		if (npages >= maxpages)
>  			return maxpages;
> +	} else if (iov_iter_is_bvec(i)) {
> +		unsigned idx, offset = i->iov_offset;
> +
> +		for (idx = 0; idx < i->nr_segs; idx++) {
> +			npages += DIV_ROUND_UP(i->bvec[idx].bv_len - offset,
> +					PAGE_SIZE);
> +			offset = 0;
> +		}
> +		if (npages >= maxpages)
> +			return maxpages;
>  	} else iterate_all_kinds(i, size, v, ({
>  		unsigned long p = (unsigned long)v.iov_base;
>  		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
>  			- p / PAGE_SIZE;
>  		if (npages >= maxpages)
>  			return maxpages;
> -	0;}),({
> -		npages++;
> -		if (npages >= maxpages)
> -			return maxpages;
> -	}),({
> +	0;}),0,({
>  		unsigned long p = (unsigned long)v.iov_base;
>  		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
>  			- p / PAGE_SIZE;
>
Matthew Wilcox Nov. 20, 2020, 2:54 a.m. UTC | #10
On Fri, Nov 20, 2020 at 02:25:08AM +0000, Pavel Begunkov wrote:
> On 20/11/2020 02:22, Ming Lei wrote:
> > iov_iter_npages(bvec) still can be improved a bit by the following way:
> 
> Yep, was doing exactly that, +a couple of other places that are in my way.

Are you optimising the right thing here?  Assuming you're looking at
the one in do_blockdev_direct_IO(), wouldn't we be better off figuring
out how to copy the bvecs directly from the iov_iter into the bio
rather than calling dio_bio_add_page() for each page?
Christoph Hellwig Nov. 20, 2020, 8:14 a.m. UTC | #11
On Fri, Nov 20, 2020 at 02:54:57AM +0000, Matthew Wilcox wrote:
> On Fri, Nov 20, 2020 at 02:25:08AM +0000, Pavel Begunkov wrote:
> > On 20/11/2020 02:22, Ming Lei wrote:
> > > iov_iter_npages(bvec) still can be improved a bit by the following way:
> > 
> > Yep, was doing exactly that, +a couple of other places that are in my way.
> 
> Are you optimising the right thing here?  Assuming you're looking at
> the one in do_blockdev_direct_IO(), wouldn't we be better off figuring
> out how to copy the bvecs directly from the iov_iter into the bio
> rather than calling dio_bio_add_page() for each page?

Which is most effectively done by stopping to to use *blockdev_direct_IO
and switching to iomap instead :)
Pavel Begunkov Nov. 20, 2020, 9:57 a.m. UTC | #12
On 20/11/2020 02:54, Matthew Wilcox wrote:
> On Fri, Nov 20, 2020 at 02:25:08AM +0000, Pavel Begunkov wrote:
>> On 20/11/2020 02:22, Ming Lei wrote:
>>> iov_iter_npages(bvec) still can be improved a bit by the following way:
>>
>> Yep, was doing exactly that, +a couple of other places that are in my way.
> 
> Are you optimising the right thing here?  Assuming you're looking at
> the one in do_blockdev_direct_IO(), wouldn't we be better off figuring
> out how to copy the bvecs directly from the iov_iter into the bio
> rather than calling dio_bio_add_page() for each page?

Ha, you got me, *add_page() was that "couple of others". It shows up much
more, but iov_iter_npages() just looked simple enough to do first.
Matthew Wilcox Nov. 20, 2020, 12:39 p.m. UTC | #13
On Fri, Nov 20, 2020 at 08:14:29AM +0000, Christoph Hellwig wrote:
> On Fri, Nov 20, 2020 at 02:54:57AM +0000, Matthew Wilcox wrote:
> > On Fri, Nov 20, 2020 at 02:25:08AM +0000, Pavel Begunkov wrote:
> > > On 20/11/2020 02:22, Ming Lei wrote:
> > > > iov_iter_npages(bvec) still can be improved a bit by the following way:
> > > 
> > > Yep, was doing exactly that, +a couple of other places that are in my way.
> > 
> > Are you optimising the right thing here?  Assuming you're looking at
> > the one in do_blockdev_direct_IO(), wouldn't we be better off figuring
> > out how to copy the bvecs directly from the iov_iter into the bio
> > rather than calling dio_bio_add_page() for each page?
> 
> Which is most effectively done by stopping to to use *blockdev_direct_IO
> and switching to iomap instead :)

But iomap still calls iov_iter_npages().  So maybe we need something like
this ...

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 933f234d5bec..1c5a802a45d9 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -250,7 +250,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 	orig_count = iov_iter_count(dio->submit.iter);
 	iov_iter_truncate(dio->submit.iter, length);
 
-	nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
+	nr_pages = bio_iov_iter_npages(dio->submit.iter);
 	if (nr_pages <= 0) {
 		ret = nr_pages;
 		goto out;
@@ -308,7 +308,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		dio->size += n;
 		copied += n;
 
-		nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
+		nr_pages = bio_iov_iter_npages(dio->submit.iter);
 		iomap_dio_submit_bio(dio, iomap, bio, pos);
 		pos += n;
 	} while (nr_pages);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c6d765382926..86cc74f84b30 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -10,6 +10,7 @@
 #include <linux/ioprio.h>
 /* struct bio, bio_vec and BIO_* flags are defined in blk_types.h */
 #include <linux/blk_types.h>
+#include <linux/uio.h>
 
 #define BIO_DEBUG
 
@@ -447,6 +448,16 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
 void __bio_add_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int off);
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
+
+static inline int bio_iov_iter_npages(const struct iov_iter *i)
+{
+	if (!iov_iter_count(i))
+		return 0;
+	if (iov_iter_is_bvec(i))
+		return 1;
+	return iov_iter_npages(i, BIO_MAX_PAGES);
+}
+
 void bio_release_pages(struct bio *bio, bool mark_dirty);
 extern void bio_set_pages_dirty(struct bio *bio);
 extern void bio_check_pages_dirty(struct bio *bio);
Pavel Begunkov Nov. 20, 2020, 1 p.m. UTC | #14
On 20/11/2020 12:39, Matthew Wilcox wrote:
> On Fri, Nov 20, 2020 at 08:14:29AM +0000, Christoph Hellwig wrote:
>> On Fri, Nov 20, 2020 at 02:54:57AM +0000, Matthew Wilcox wrote:
>>> On Fri, Nov 20, 2020 at 02:25:08AM +0000, Pavel Begunkov wrote:
>>>> On 20/11/2020 02:22, Ming Lei wrote:
>>>>> iov_iter_npages(bvec) still can be improved a bit by the following way:
>>>>
>>>> Yep, was doing exactly that, +a couple of other places that are in my way.
>>>
>>> Are you optimising the right thing here?  Assuming you're looking at
>>> the one in do_blockdev_direct_IO(), wouldn't we be better off figuring
>>> out how to copy the bvecs directly from the iov_iter into the bio
>>> rather than calling dio_bio_add_page() for each page?
>>
>> Which is most effectively done by stopping to to use *blockdev_direct_IO
>> and switching to iomap instead :)
> 
> But iomap still calls iov_iter_npages().  So maybe we need something like
> this ...

Yep, all that are not mutually exclusive optimisations.
Why `return 1`? It seems to be used later in bio_alloc(nr_pages)

> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 933f234d5bec..1c5a802a45d9 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -250,7 +250,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  	orig_count = iov_iter_count(dio->submit.iter);
>  	iov_iter_truncate(dio->submit.iter, length);
>  
> -	nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
> +	nr_pages = bio_iov_iter_npages(dio->submit.iter);
>  	if (nr_pages <= 0) {
>  		ret = nr_pages;
>  		goto out;
> @@ -308,7 +308,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		dio->size += n;
>  		copied += n;
>  
> -		nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
> +		nr_pages = bio_iov_iter_npages(dio->submit.iter);
>  		iomap_dio_submit_bio(dio, iomap, bio, pos);
>  		pos += n;
>  	} while (nr_pages);
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index c6d765382926..86cc74f84b30 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -10,6 +10,7 @@
>  #include <linux/ioprio.h>
>  /* struct bio, bio_vec and BIO_* flags are defined in blk_types.h */
>  #include <linux/blk_types.h>
> +#include <linux/uio.h>
>  
>  #define BIO_DEBUG
>  
> @@ -447,6 +448,16 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>  void __bio_add_page(struct bio *bio, struct page *page,
>  		unsigned int len, unsigned int off);
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
> +
> +static inline int bio_iov_iter_npages(const struct iov_iter *i)
> +{
> +	if (!iov_iter_count(i))
> +		return 0;
> +	if (iov_iter_is_bvec(i))
> +		return 1;
> +	return iov_iter_npages(i, BIO_MAX_PAGES);
> +}
> +
>  void bio_release_pages(struct bio *bio, bool mark_dirty);
>  extern void bio_set_pages_dirty(struct bio *bio);
>  extern void bio_check_pages_dirty(struct bio *bio);
>
Matthew Wilcox Nov. 20, 2020, 1:13 p.m. UTC | #15
On Fri, Nov 20, 2020 at 01:00:37PM +0000, Pavel Begunkov wrote:
> On 20/11/2020 12:39, Matthew Wilcox wrote:
> > On Fri, Nov 20, 2020 at 08:14:29AM +0000, Christoph Hellwig wrote:
> >> On Fri, Nov 20, 2020 at 02:54:57AM +0000, Matthew Wilcox wrote:
> >>> On Fri, Nov 20, 2020 at 02:25:08AM +0000, Pavel Begunkov wrote:
> >>>> On 20/11/2020 02:22, Ming Lei wrote:
> >>>>> iov_iter_npages(bvec) still can be improved a bit by the following way:
> >>>>
> >>>> Yep, was doing exactly that, +a couple of other places that are in my way.
> >>>
> >>> Are you optimising the right thing here?  Assuming you're looking at
> >>> the one in do_blockdev_direct_IO(), wouldn't we be better off figuring
> >>> out how to copy the bvecs directly from the iov_iter into the bio
> >>> rather than calling dio_bio_add_page() for each page?
> >>
> >> Which is most effectively done by stopping to to use *blockdev_direct_IO
> >> and switching to iomap instead :)
> > 
> > But iomap still calls iov_iter_npages().  So maybe we need something like
> > this ...
> 
> Yep, all that are not mutually exclusive optimisations.
> Why `return 1`? It seems to be used later in bio_alloc(nr_pages)

because 0 means "no pages".  It does no harm to allocate one biovec
that we then don't use.

> > -	nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
> > +	nr_pages = bio_iov_iter_npages(dio->submit.iter);
> >  	if (nr_pages <= 0) {
            ^^^^^^^^^^^^^

> > -		nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
> > +		nr_pages = bio_iov_iter_npages(dio->submit.iter);
> >  		iomap_dio_submit_bio(dio, iomap, bio, pos);
> >  		pos += n;
> >  	} while (nr_pages);
                 ^^^^^^^^
David Laight Nov. 20, 2020, 1:29 p.m. UTC | #16
From: Pavel Begunkov
> Sent: 19 November 2020 23:25
>
> The block layer spends quite a while in iov_iter_npages(), but for the
> bvec case the number of pages is already known and stored in
> iter->nr_segs, so it can be returned immediately as an optimisation
> 
> Perf for an io_uring benchmark with registered buffers (i.e. bvec) shows
> ~1.5-2.0% total cycle count spent in iov_iter_npages(), that's dropped
> by this patch to ~0.2%.
> 
> Reviewed-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  lib/iov_iter.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 1635111c5bd2..0fa7ac330acf 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1594,6 +1594,8 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
>  		return 0;
>  	if (unlikely(iov_iter_is_discard(i)))
>  		return 0;
> +	if (unlikely(iov_iter_is_bvec(i)))
> +		return min_t(int, i->nr_segs, maxpages);
> 
>  	if (unlikely(iov_iter_is_pipe(i))) {

Is it worth putting an extra condition around these three 'unlikely' cases.
ie:
	if (unlikely((iov_iter_type(i) & (ITER_DISCARD | ITER_BVEC | ITER_PIPE)) {
		if (iov_iter_is_discard(i))
			return 0;
		if (iov_iter_is_bvec(i))
			return min_t(int, i->nr_segs, maxpages);
		/* Must be ITER_PIPE */

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Pavel Begunkov Nov. 20, 2020, 5:22 p.m. UTC | #17
On 20/11/2020 02:24, Ming Lei wrote:
> On Fri, Nov 20, 2020 at 02:06:10AM +0000, Matthew Wilcox wrote:
>> On Fri, Nov 20, 2020 at 01:56:22AM +0000, Pavel Begunkov wrote:
>>> On 20/11/2020 01:49, Matthew Wilcox wrote:
>>>> On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote:
>>>>> On 20/11/2020 01:20, Matthew Wilcox wrote:
>>>>>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote:
>>>>>>> The block layer spends quite a while in iov_iter_npages(), but for the
>>>>>>> bvec case the number of pages is already known and stored in
>>>>>>> iter->nr_segs, so it can be returned immediately as an optimisation
>>>>>>
>>>>>> Er ... no, it doesn't.  nr_segs is the number of bvecs.  Each bvec can
>>>>>> store up to 4GB of contiguous physical memory.
>>>>>
>>>>> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a
>>>>> stupid statement. Thanks!
>>>>>
>>>>> Are there many users of that? All these iterators are a huge burden,
>>>>> just to count one 4KB page in bvec it takes 2% of CPU time for me.
>>>>
>>>> __bio_try_merge_page() will create multipage BIOs, and that's
>>>> called from a number of places including
>>>> bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages()
>>>
>>> I get it that there are a lot of places, more interesting how often
>>> it's actually triggered and if that's performance critical for anybody.
>>> Not like I'm going to change it, just out of curiosity, but bvec.h
>>> can be nicely optimised without it.
>>
>> Typically when you're allocating pages for the page cache, they'll get
>> allocated in order and then you'll read or write them in order, so yes,
>> it ends up triggering quite a lot.  There was once a bug in the page
>> allocator which caused them to get allocated in reverse order and it
>> was a noticable performance hit (this was 15-20 years ago).
> 
> hugepage use cases can benefit much from this way too.

This didn't yield any considerable boost for me though. 1.5% -> 1.3%
for 1 page reads. I'll send it anyway though because there are cases
that can benefit, e.g. as Ming mentioned.

Ming would you want to send the patch yourself? After all you did post
it first.
Pavel Begunkov Nov. 20, 2020, 5:23 p.m. UTC | #18
On 20/11/2020 17:22, Pavel Begunkov wrote:
> On 20/11/2020 02:24, Ming Lei wrote:
>> On Fri, Nov 20, 2020 at 02:06:10AM +0000, Matthew Wilcox wrote:
>>> On Fri, Nov 20, 2020 at 01:56:22AM +0000, Pavel Begunkov wrote:
>>>> On 20/11/2020 01:49, Matthew Wilcox wrote:
>>>>> On Fri, Nov 20, 2020 at 01:39:05AM +0000, Pavel Begunkov wrote:
>>>>>> On 20/11/2020 01:20, Matthew Wilcox wrote:
>>>>>>> On Thu, Nov 19, 2020 at 11:24:38PM +0000, Pavel Begunkov wrote:
>>>>>>>> The block layer spends quite a while in iov_iter_npages(), but for the
>>>>>>>> bvec case the number of pages is already known and stored in
>>>>>>>> iter->nr_segs, so it can be returned immediately as an optimisation
>>>>>>>
>>>>>>> Er ... no, it doesn't.  nr_segs is the number of bvecs.  Each bvec can
>>>>>>> store up to 4GB of contiguous physical memory.
>>>>>>
>>>>>> Ah, really, missed min() with PAGE_SIZE in bvec_iter_len(), then it's a
>>>>>> stupid statement. Thanks!
>>>>>>
>>>>>> Are there many users of that? All these iterators are a huge burden,
>>>>>> just to count one 4KB page in bvec it takes 2% of CPU time for me.
>>>>>
>>>>> __bio_try_merge_page() will create multipage BIOs, and that's
>>>>> called from a number of places including
>>>>> bio_try_merge_hw_seg(), bio_add_page(), and __bio_iov_iter_get_pages()
>>>>
>>>> I get it that there are a lot of places, more interesting how often
>>>> it's actually triggered and if that's performance critical for anybody.
>>>> Not like I'm going to change it, just out of curiosity, but bvec.h
>>>> can be nicely optimised without it.
>>>
>>> Typically when you're allocating pages for the page cache, they'll get
>>> allocated in order and then you'll read or write them in order, so yes,
>>> it ends up triggering quite a lot.  There was once a bug in the page
>>> allocator which caused them to get allocated in reverse order and it
>>> was a noticable performance hit (this was 15-20 years ago).
>>
>> hugepage use cases can benefit much from this way too.
> 
> This didn't yield any considerable boost for me though. 1.5% -> 1.3%
> for 1 page reads. I'll send it anyway though because there are cases
> that can benefit, e.g. as Ming mentioned.

And yeah, it just shifts my attention for optimisation to its callers,
e.g. blkdev_direct_IO.

> Ming would you want to send the patch yourself? After all you did post
> it first.
>
diff mbox series

Patch

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 1635111c5bd2..0fa7ac330acf 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1594,6 +1594,8 @@  int iov_iter_npages(const struct iov_iter *i, int maxpages)
 		return 0;
 	if (unlikely(iov_iter_is_discard(i)))
 		return 0;
+	if (unlikely(iov_iter_is_bvec(i)))
+		return min_t(int, i->nr_segs, maxpages);
 
 	if (unlikely(iov_iter_is_pipe(i))) {
 		struct pipe_inode_info *pipe = i->pipe;
@@ -1614,11 +1616,9 @@  int iov_iter_npages(const struct iov_iter *i, int maxpages)
 			- p / PAGE_SIZE;
 		if (npages >= maxpages)
 			return maxpages;
-	0;}),({
-		npages++;
-		if (npages >= maxpages)
-			return maxpages;
-	}),({
+	0;}),
+		0 /* bvecs are handled above */
+	,({
 		unsigned long p = (unsigned long)v.iov_base;
 		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
 			- p / PAGE_SIZE;