diff mbox

[v3] Btrfs: improve multi-thread buffer read

Message ID 1342461925-22495-1-git-send-email-liubo2009@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

liubo July 16, 2012, 6:05 p.m. UTC
While testing with my buffer read fio jobs[1], I find that btrfs does not
perform well enough.

Here is a scenario in fio jobs:

We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
and all of them will race on add_to_page_cache_lru(), and if one thread
successfully puts its page into the page cache, it takes the responsibility
to read the page's data.

And what's more, reading a page needs a period of time to finish, in which
other threads can slide in and process rest pages:

     t1          t2          t3          t4
   add Page1
   read Page1  add Page2
     |         read Page2  add Page3
     |            |        read Page3  add Page4
     |            |           |        read Page4
-----|------------|-----------|-----------|--------
     v            v           v           v
    bio          bio         bio         bio

Now we have four bios, each of which holds only one page since we need to
maintain consecutive pages in bio.  Thus, we can end up with far more bios
than we need.

Here we're going to
a) delay the real read-page section and
b) try to put more pages into page cache.

With that said, we can make each bio hold more pages and reduce the number
of bios we need.

Here is some numbers taken from fio results:
         w/o patch                 w patch
       -------------  --------  ---------------
READ:    745MB/s        +19%       887MB/s

[1]:
[global]
group_reporting
thread
numjobs=4
bs=32k
rw=read
ioengine=sync
directory=/mnt/btrfs/

[READ]
filename=foobar
size=2000M
invalidate=1

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
v2->v3: adopt kernel native pagevec instead of kmalloc.
v1->v2: if we fail to make a allocation, just fall back to the old way to
        read page.

 fs/btrfs/extent_io.c |   16 +++++++++++++++-
 1 files changed, 15 insertions(+), 1 deletions(-)

Comments

David Sterba July 18, 2012, 11:57 a.m. UTC | #1
On Mon, Jul 16, 2012 at 02:05:25PM -0400, Liu Bo wrote:
> v2->v3: adopt kernel native pagevec instead of kmalloc.

Do we really use the pagevec features here? It looks more like a fancy
way to employ a simple array ...

And with a simple array we could use 16 page pointers covering 16 * 4k =
64k bytes which looks more friendly than 14 * 4K = 56K (available in a
pagevec). This may have a negative effect on the troughput, but I
haven't measured that. Can you please benchmark it against pagevec?


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
liubo July 19, 2012, 1:11 a.m. UTC | #2
On 07/18/2012 07:57 PM, David Sterba wrote:

> On Mon, Jul 16, 2012 at 02:05:25PM -0400, Liu Bo wrote:
>> v2->v3: adopt kernel native pagevec instead of kmalloc.
> 
> Do we really use the pagevec features here? It looks more like a fancy
> way to employ a simple array ...
> 
> And with a simple array we could use 16 page pointers covering 16 * 4k =
> 64k bytes which looks more friendly than 14 * 4K = 56K (available in a
> pagevec). This may have a negative effect on the troughput, but I
> haven't measured that. Can you please benchmark it against pagevec?
> 


Thanks for the advice.

Yes, it is a simple array that is needed.

Indeed I even tried 128 page array, with which I can get the biggest improvement on my box.
Basically in range [1, 128], the more pages in the array, the bigger improvement we'll get.

But as Chris suggested, my test is really a race case in practical use, half of improvement
is somehow enough, so we turn to use pagevec struct because it is closer to how we solve
similar problems in other parts of the kernel.

thanks,
liubo

> 
> david
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba July 19, 2012, 2:05 a.m. UTC | #3
On Thu, Jul 19, 2012 at 09:11:06AM +0800, Liu Bo wrote:
> On 07/18/2012 07:57 PM, David Sterba wrote:
> 
> > On Mon, Jul 16, 2012 at 02:05:25PM -0400, Liu Bo wrote:
> >> v2->v3: adopt kernel native pagevec instead of kmalloc.
> > 
> > Do we really use the pagevec features here? It looks more like a fancy
> > way to employ a simple array ...
> > 
> > And with a simple array we could use 16 page pointers covering 16 * 4k =
> > 64k bytes which looks more friendly than 14 * 4K = 56K (available in a
> > pagevec). This may have a negative effect on the troughput, but I
> > haven't measured that. Can you please benchmark it against pagevec?
> > 
> 
> 
> Thanks for the advice.
> 
> Yes, it is a simple array that is needed.
> 
> Indeed I even tried 128 page array, with which I can get the biggest improvement on my box.
> Basically in range [1, 128], the more pages in the array, the bigger improvement we'll get.

128 is too much, this would snip 128 * 8 = 1K off the stack.

> But as Chris suggested, my test is really a race case in practical use, half of improvement
> is somehow enough, so we turn to use pagevec struct because it is closer to how we solve
> similar problems in other parts of the kernel.

Yes it's an optimization, nice and simple one, but I don't see the
use of pagevec justified. By the other parts of kernel is probably meant
memory management, and pagevec's are used along with lookups to inode
mappings, plus there are other sideefects on pagecache (like calling
lru_add_drain() from pagevec_release, as can be seen in your code).

Filesystems can use pagevec_lookup instead of find_get_pages,
like ext4 does, but btrfs uses simple arrays of 16 pages, in
lock_delalloc_pages, end_compressed_writeback, __unlock_for_delalloc and
extent_clear_unlock_delalloc (in connection with find_get_pages).

I was specifically interested in benchmarking pagevec used as in V3
against simple array with 16 elements, but now that I looked around
while writing this mail, I think that pagevec is not the way to go.


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
liubo July 19, 2012, 2:31 a.m. UTC | #4
On 07/19/2012 10:05 AM, David Sterba wrote:

> On Thu, Jul 19, 2012 at 09:11:06AM +0800, Liu Bo wrote:
>> On 07/18/2012 07:57 PM, David Sterba wrote:
>>
>>> On Mon, Jul 16, 2012 at 02:05:25PM -0400, Liu Bo wrote:
>>>> v2->v3: adopt kernel native pagevec instead of kmalloc.
>>> Do we really use the pagevec features here? It looks more like a fancy
>>> way to employ a simple array ...
>>>
>>> And with a simple array we could use 16 page pointers covering 16 * 4k =
>>> 64k bytes which looks more friendly than 14 * 4K = 56K (available in a
>>> pagevec). This may have a negative effect on the troughput, but I
>>> haven't measured that. Can you please benchmark it against pagevec?
>>>
>>
>> Thanks for the advice.
>>
>> Yes, it is a simple array that is needed.
>>
>> Indeed I even tried 128 page array, with which I can get the biggest improvement on my box.
>> Basically in range [1, 128], the more pages in the array, the bigger improvement we'll get.
> 
> 128 is too much, this would snip 128 * 8 = 1K off the stack.
> 


That's why I give up 128. :)

>> But as Chris suggested, my test is really a race case in practical use, half of improvement
>> is somehow enough, so we turn to use pagevec struct because it is closer to how we solve
>> similar problems in other parts of the kernel.
> 
> Yes it's an optimization, nice and simple one, but I don't see the
> use of pagevec justified. By the other parts of kernel is probably meant
> memory management, and pagevec's are used along with lookups to inode
> mappings, plus there are other sideefects on pagecache (like calling
> lru_add_drain() from pagevec_release, as can be seen in your code).
> 
> Filesystems can use pagevec_lookup instead of find_get_pages,
> like ext4 does, but btrfs uses simple arrays of 16 pages, in
> lock_delalloc_pages, end_compressed_writeback, __unlock_for_delalloc and
> extent_clear_unlock_delalloc (in connection with find_get_pages).
> 
> I was specifically interested in benchmarking pagevec used as in V3
> against simple array with 16 elements, but now that I looked around
> while writing this mail, I think that pagevec is not the way to go.
> 


Sorry, I see no difference between 16 pages array and pagevec(14 pages), and
I have no idea why ext4 use 16 pages array(maybe historical reasons), but IMO
it is proper and natural to use pagevec to manage pages.

Anyway, either will be ok, but we'd better get Chris's confirmation.

So Chris?

thanks,
liubo

> 
> david
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba July 20, 2012, 3:36 a.m. UTC | #5
On Thu, Jul 19, 2012 at 10:31:05AM +0800, Liu Bo wrote:
> > 128 is too much, this would snip 128 * 8 = 1K off the stack.
> 
> That's why I give up 128. :)

It's good as a reference point, nobody says it should stay at 128.

> >> But as Chris suggested, my test is really a race case in practical use, half of improvement
> >> is somehow enough, so we turn to use pagevec struct because it is closer to how we solve
> >> similar problems in other parts of the kernel.
> > 
> > Yes it's an optimization, nice and simple one, but I don't see the
> > use of pagevec justified. By the other parts of kernel is probably meant
> > memory management, and pagevec's are used along with lookups to inode
> > mappings, plus there are other sideefects on pagecache (like calling
> > lru_add_drain() from pagevec_release, as can be seen in your code).
> > 
> > Filesystems can use pagevec_lookup instead of find_get_pages,
> > like ext4 does, but btrfs uses simple arrays of 16 pages, in
> > lock_delalloc_pages, end_compressed_writeback, __unlock_for_delalloc and
> > extent_clear_unlock_delalloc (in connection with find_get_pages).
> > 
> > I was specifically interested in benchmarking pagevec used as in V3
> > against simple array with 16 elements, but now that I looked around
> > while writing this mail, I think that pagevec is not the way to go.
> > 
> 
> Sorry, I see no difference between 16 pages array and pagevec(14 pages),

The difference is 2 pages, at least. Besides [quoting patch from the
first post for reference]

> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3557,7 +3557,10 @@ int extent_readpages(struct extent_io_tree *tree,
>         struct bio *bio = NULL;
>         unsigned page_idx;
>         unsigned long bio_flags = 0;
> +       struct pagevec pvec;
> +       int i = 0;
> 
> +       pagevec_init(&pvec, 0);
>         for (page_idx = 0; page_idx < nr_pages; page_idx++) {
>                 struct page *page = list_entry(pages->prev, struct page, lru);
> 
> @@ -3565,11 +3568,22 @@ int extent_readpages(struct extent_io_tree *tree,
>                 list_del(&page->lru);
>                 if (!add_to_page_cache_lru(page, mapping,
>                                         page->index, GFP_NOFS)) {
> -                       __extent_read_full_page(tree, page, get_extent,
> +                       page_cache_get(page);
> +                       if (pagevec_add(&pvec, page) == 0) {
> +                               for (i = 0; i < pagevec_count(&pvec); i++)
> +                                       __extent_read_full_page(tree,
> +                                               pvec.pages[i], get_extent,
>                                                 &bio, 0, &bio_flags);
> +                               pagevec_release(&pvec);
                                  ^^^^^^^^^^^^^^^^^^^^^^
here

> +                       }
>                 }
>                 page_cache_release(page);
>         }
> +       for (i = 0; i < pagevec_count(&pvec); i++)
> +               __extent_read_full_page(tree, pvec.pages[i], get_extent,
> +                                       &bio, 0, &bio_flags);
> +       pagevec_release(&pvec);
          ^^^^^^^^^^^^^^^^^^^^^^
and here

> +
>         BUG_ON(!list_empty(pages));
>         if (bio)
>                 return submit_one_bio(READ, bio, 0, bio_flags);

you actually call pagevec_release. And I pointed out that this is not a
simple operation (like the other pagevec_* functions just doing some
arithmetics) -- it calls lru_add_drain(), this does lots of things with
pagecache and LRU lists, follow the call chain from there if you don't
believe me.

> and I have no idea why ext4 use 16 pages array(maybe historical
> reasons),

sigh, I didn't say that ext4 uses 16 pointer array, quite the opposite:

> > like ext4 does, but btrfs uses simple arrays of 16 pages, in
> > lock_delalloc_pages, end_compressed_writeback, __unlock_for_delalloc and
> > extent_clear_unlock_delalloc (in connection with find_get_pages).

> but IMO it is proper and natural to use pagevec to manage pages.

As you've benchmarked, the more pages one can batch here at once the
better and I don't see why we should miss the opportunity for 2 another
pages just because it's shorter/nicer to write it via pagevec's.


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
liubo July 20, 2012, 3:51 a.m. UTC | #6
On 07/20/2012 11:36 AM, David Sterba wrote:

> On Thu, Jul 19, 2012 at 10:31:05AM +0800, Liu Bo wrote:
>>> 128 is too much, this would snip 128 * 8 = 1K off the stack.
>> That's why I give up 128. :)
> 
> It's good as a reference point, nobody says it should stay at 128.
> 
>>>> But as Chris suggested, my test is really a race case in practical use, half of improvement
>>>> is somehow enough, so we turn to use pagevec struct because it is closer to how we solve
>>>> similar problems in other parts of the kernel.
>>> Yes it's an optimization, nice and simple one, but I don't see the
>>> use of pagevec justified. By the other parts of kernel is probably meant
>>> memory management, and pagevec's are used along with lookups to inode
>>> mappings, plus there are other sideefects on pagecache (like calling
>>> lru_add_drain() from pagevec_release, as can be seen in your code).
>>>
>>> Filesystems can use pagevec_lookup instead of find_get_pages,
>>> like ext4 does, but btrfs uses simple arrays of 16 pages, in
>>> lock_delalloc_pages, end_compressed_writeback, __unlock_for_delalloc and
>>> extent_clear_unlock_delalloc (in connection with find_get_pages).
>>>
>>> I was specifically interested in benchmarking pagevec used as in V3
>>> against simple array with 16 elements, but now that I looked around
>>> while writing this mail, I think that pagevec is not the way to go.
>>>
>> Sorry, I see no difference between 16 pages array and pagevec(14 pages),
> 
> The difference is 2 pages, at least. Besides [quoting patch from the
> first post for reference]
> 
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3557,7 +3557,10 @@ int extent_readpages(struct extent_io_tree *tree,
>>         struct bio *bio = NULL;
>>         unsigned page_idx;
>>         unsigned long bio_flags = 0;
>> +       struct pagevec pvec;
>> +       int i = 0;
>>
>> +       pagevec_init(&pvec, 0);
>>         for (page_idx = 0; page_idx < nr_pages; page_idx++) {
>>                 struct page *page = list_entry(pages->prev, struct page, lru);
>>
>> @@ -3565,11 +3568,22 @@ int extent_readpages(struct extent_io_tree *tree,
>>                 list_del(&page->lru);
>>                 if (!add_to_page_cache_lru(page, mapping,
>>                                         page->index, GFP_NOFS)) {
>> -                       __extent_read_full_page(tree, page, get_extent,
>> +                       page_cache_get(page);
>> +                       if (pagevec_add(&pvec, page) == 0) {
>> +                               for (i = 0; i < pagevec_count(&pvec); i++)
>> +                                       __extent_read_full_page(tree,
>> +                                               pvec.pages[i], get_extent,
>>                                                 &bio, 0, &bio_flags);
>> +                               pagevec_release(&pvec);
>                                   ^^^^^^^^^^^^^^^^^^^^^^
> here
> 
>> +                       }
>>                 }
>>                 page_cache_release(page);
>>         }
>> +       for (i = 0; i < pagevec_count(&pvec); i++)
>> +               __extent_read_full_page(tree, pvec.pages[i], get_extent,
>> +                                       &bio, 0, &bio_flags);
>> +       pagevec_release(&pvec);
>           ^^^^^^^^^^^^^^^^^^^^^^
> and here
> 
>> +
>>         BUG_ON(!list_empty(pages));
>>         if (bio)
>>                 return submit_one_bio(READ, bio, 0, bio_flags);
> 
> you actually call pagevec_release. And I pointed out that this is not a
> simple operation (like the other pagevec_* functions just doing some
> arithmetics) -- it calls lru_add_drain(), this does lots of things with
> pagecache and LRU lists, follow the call chain from there if you don't
> believe me.
> 


well, you're totally right.  It does make some side effects.

>> and I have no idea why ext4 use 16 pages array(maybe historical
>> reasons),
> 
> sigh, I didn't say that ext4 uses 16 pointer array, quite the opposite:
> 


oh, sorry, I owe you.

>>> like ext4 does, but btrfs uses simple arrays of 16 pages, in
>>> lock_delalloc_pages, end_compressed_writeback, __unlock_for_delalloc and
>>> extent_clear_unlock_delalloc (in connection with find_get_pages).
> 
>> but IMO it is proper and natural to use pagevec to manage pages.
> 
> As you've benchmarked, the more pages one can batch here at once the
> better and I don't see why we should miss the opportunity for 2 another
> pages just because it's shorter/nicer to write it via pagevec's.
> 


Thanks for your explanation and review, I will give it a hit ASAP.

thanks,
liubo

> 
> david
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason July 20, 2012, 12:40 p.m. UTC | #7
On Wed, Jul 18, 2012 at 08:31:05PM -0600, Liu Bo wrote:
> On 07/19/2012 10:05 AM, David Sterba wrote:

[ pagevec vs array ]

> Sorry, I see no difference between 16 pages array and pagevec(14 pages), and
> I have no idea why ext4 use 16 pages array(maybe historical reasons), but IMO
> it is proper and natural to use pagevec to manage pages.
> 
> Anyway, either will be ok, but we'd better get Chris's confirmation.
> 
> So Chris?

Yeah, the array does sound better, and I would go with 16.  It's a nice
round number. Thanks Dave!

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 01c21b6..2fcbcac 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3557,7 +3557,10 @@  int extent_readpages(struct extent_io_tree *tree,
 	struct bio *bio = NULL;
 	unsigned page_idx;
 	unsigned long bio_flags = 0;
+	struct pagevec pvec;
+	int i = 0;
 
+	pagevec_init(&pvec, 0);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_entry(pages->prev, struct page, lru);
 
@@ -3565,11 +3568,22 @@  int extent_readpages(struct extent_io_tree *tree,
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
 					page->index, GFP_NOFS)) {
-			__extent_read_full_page(tree, page, get_extent,
+			page_cache_get(page);
+			if (pagevec_add(&pvec, page) == 0) {
+				for (i = 0; i < pagevec_count(&pvec); i++)
+					__extent_read_full_page(tree,
+						pvec.pages[i], get_extent,
 						&bio, 0, &bio_flags);
+				pagevec_release(&pvec);
+			}
 		}
 		page_cache_release(page);
 	}
+	for (i = 0; i < pagevec_count(&pvec); i++)
+		__extent_read_full_page(tree, pvec.pages[i], get_extent,
+					&bio, 0, &bio_flags);
+	pagevec_release(&pvec);
+
 	BUG_ON(!list_empty(pages));
 	if (bio)
 		return submit_one_bio(READ, bio, 0, bio_flags);