diff mbox series

[01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage()

Message ID 20201103133108.148112-2-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [01/32] btrfs: extent_io: remove the extent_start/extent_len for end_bio_extent_readpage() | expand

Commit Message

Qu Wenruo Nov. 3, 2020, 1:30 p.m. UTC
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.

Hides behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.

Here is an example to explain the original work flow:
  Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)

  end_bio_extent_extent_readpage() entered
  |- extent_start = 0;
  |- extent_end = 0;
  |- bio_for_each_segment_all() {
  |  |- /* Got the 1st bvec */
  |  |- start = SZ_1M;
  |  |- end = SZ_1M + SZ_4K - 1;
  |  |- update = 1;
  |  |- if (extent_len == 0) {
  |  |  |- extent_start = start; /* SZ_1M */
  |  |  |- extent_len = end + 1 - start; /* SZ_1M */
  |  |  }
  |  |
  |  |- /* Got the 2nd bvec */
  |  |- start = SZ_1M + 4K;
  |  |- end = SZ_1M + 4K - 1;
  |  |- update = 1;
  |  |- if (extent_start + extent_len == start) {
  |  |  |- extent_len += end + 1 - start; /* SZ_8K */
  |  |  }
  |  } /* All bio vec iterated */
  |
  |- if (extent_len) {
     |- endio_readpage_release_extent(tree, extent_start, extent_len,
				      update);
	/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */

As the above flow shows, the existing code in end_bio_extent_readpage()
is just accumulate extent_start/extent_len, and when the contiguous range
breaks, call endio_readpage_release_extent() for the range.

The contiguous range breaks at two locations:
- The total else {} branch
  This means we had a page in a bio where it's not contiguous.
  Currently this branch will never be triggered. As all our bio is
  submitted as contiguous pages.

- After the bio_for_each_segment_all() loop ends
  This is the normal call sites where we iterated all bvecs of a bio,
  and all pages should be contiguous, thus we can call
  endio_readpage_release_extent() on the full range.

The original code has also considered cases like (!uptodate), so it will
mark the uptodate range with EXTENT_UPTODATE.

So this patch will remove the extent_start/extent_len dancing, replace
it with regular endio_readpage_release_extent() call on each bvec.

This brings one behavior change:
- Temporary memory usage increase
  Unlike the old call which only modify the extent tree once, now we
  update the extent tree for each bvec.

  Although the end result is the same, since we may need more extent
  state split/allocation, we need more temporary memory during that
  bvec iteration.

But considering how streamline the new code is, the temporary memory
usage increase should be acceptable.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c | 33 +++------------------------------
 1 file changed, 3 insertions(+), 30 deletions(-)

Comments

Nikolay Borisov Nov. 5, 2020, 9:46 a.m. UTC | #1
On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
> In end_bio_extent_readpage() we had a strange dance around
> extent_start/extent_len.
> 
> Hides behind the strange dance is, it's just calling
> endio_readpage_release_extent() on each bvec range.
> 
> Here is an example to explain the original work flow:
>   Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
> 
>   end_bio_extent_extent_readpage() entered
>   |- extent_start = 0;
>   |- extent_end = 0;
>   |- bio_for_each_segment_all() {
>   |  |- /* Got the 1st bvec */
>   |  |- start = SZ_1M;
>   |  |- end = SZ_1M + SZ_4K - 1;
>   |  |- update = 1;
>   |  |- if (extent_len == 0) {
>   |  |  |- extent_start = start; /* SZ_1M */
>   |  |  |- extent_len = end + 1 - start; /* SZ_1M */
>   |  |  }
>   |  |
>   |  |- /* Got the 2nd bvec */
>   |  |- start = SZ_1M + 4K;
>   |  |- end = SZ_1M + 4K - 1;
>   |  |- update = 1;
>   |  |- if (extent_start + extent_len == start) {
>   |  |  |- extent_len += end + 1 - start; /* SZ_8K */
>   |  |  }
>   |  } /* All bio vec iterated */
>   |
>   |- if (extent_len) {
>      |- endio_readpage_release_extent(tree, extent_start, extent_len,
> 				      update);
> 	/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
> 
> As the above flow shows, the existing code in end_bio_extent_readpage()
> is just accumulate extent_start/extent_len, and when the contiguous range
> breaks, call endio_readpage_release_extent() for the range.
> 
> The contiguous range breaks at two locations:
> - The total else {} branch
>   This means we had a page in a bio where it's not contiguous.
>   Currently this branch will never be triggered. As all our bio is
>   submitted as contiguous pages.
> 

The endio routine cares about logical file contiguity as evidenced by
the fact it uses page_offset() to calculate 'start', however our recent
discussion on irc with the contiguity in csums bios clearly showed that
we can have bios which contains pages that are contiguous in their disk
byte nr but not in their logical offset, in fact Josef even mentioned on
slack that a single bio can contain pages for different inodes so long
as their logical disk byte nr are contiguous. I think this is not an
issue in this case because you are doing the unlock on a bvec
granularity but just the above statement is somewhat misleading.

> - After the bio_for_each_segment_all() loop ends
>   This is the normal call sites where we iterated all bvecs of a bio,
>   and all pages should be contiguous, thus we can call
>   endio_readpage_release_extent() on the full range.
> 
> The original code has also considered cases like (!uptodate), so it will
> mark the uptodate range with EXTENT_UPTODATE.
> 
> So this patch will remove the extent_start/extent_len dancing, replace
> it with regular endio_readpage_release_extent() call on each bvec.
> 
> This brings one behavior change:
> - Temporary memory usage increase
>   Unlike the old call which only modify the extent tree once, now we
>   update the extent tree for each bvec.

I suspect for large bios with a lot of bvecs this would likely increase
latency because we will now invoke endio_readpage_release_extent
proportionally to the number of bvecs.

> 
>   Although the end result is the same, since we may need more extent
>   state split/allocation, we need more temporary memory during that
>   bvec iteration.

Also bear in mind that this happens in a critical endio context, which
uses GFP_ATOMIC allocations so if we get ENOSPC it would be rather bad.

> 
> But considering how streamline the new code is, the temporary memory
> usage increase should be acceptable.

I definitely like the new code however without quantifying what's the
increase of number of calls of endio_readpage_release_extent I'd rather
not merge it.

On a different note, one minor cleanup that could be done is replace all
those "end + 1 - start" expressions with simply "len" as this is
effectively the length of the current bvec.

<snip>
Qu Wenruo Nov. 5, 2020, 10:15 a.m. UTC | #2
On 2020/11/5 下午5:46, Nikolay Borisov wrote:
> 
> 
> On 3.11.20 г. 15:30 ч., Qu Wenruo wrote:
>> In end_bio_extent_readpage() we had a strange dance around
>> extent_start/extent_len.
>>
>> Hides behind the strange dance is, it's just calling
>> endio_readpage_release_extent() on each bvec range.
>>
>> Here is an example to explain the original work flow:
>>   Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
>>
>>   end_bio_extent_extent_readpage() entered
>>   |- extent_start = 0;
>>   |- extent_end = 0;
>>   |- bio_for_each_segment_all() {
>>   |  |- /* Got the 1st bvec */
>>   |  |- start = SZ_1M;
>>   |  |- end = SZ_1M + SZ_4K - 1;
>>   |  |- update = 1;
>>   |  |- if (extent_len == 0) {
>>   |  |  |- extent_start = start; /* SZ_1M */
>>   |  |  |- extent_len = end + 1 - start; /* SZ_1M */
>>   |  |  }
>>   |  |
>>   |  |- /* Got the 2nd bvec */
>>   |  |- start = SZ_1M + 4K;
>>   |  |- end = SZ_1M + 4K - 1;
>>   |  |- update = 1;
>>   |  |- if (extent_start + extent_len == start) {
>>   |  |  |- extent_len += end + 1 - start; /* SZ_8K */
>>   |  |  }
>>   |  } /* All bio vec iterated */
>>   |
>>   |- if (extent_len) {
>>      |- endio_readpage_release_extent(tree, extent_start, extent_len,
>> 				      update);
>> 	/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
>>
>> As the above flow shows, the existing code in end_bio_extent_readpage()
>> is just accumulate extent_start/extent_len, and when the contiguous range
>> breaks, call endio_readpage_release_extent() for the range.
>>
>> The contiguous range breaks at two locations:
>> - The total else {} branch
>>   This means we had a page in a bio where it's not contiguous.
>>   Currently this branch will never be triggered. As all our bio is
>>   submitted as contiguous pages.
>>
> 
> The endio routine cares about logical file contiguity as evidenced by
> the fact it uses page_offset() to calculate 'start', however our recent
> discussion on irc with the contiguity in csums bios clearly showed that
> we can have bios which contains pages that are contiguous in their disk
> byte nr but not in their logical offset, in fact Josef even mentioned on
> slack that a single bio can contain pages for different inodes so long
> as their logical disk byte nr are contiguous. I think this is not an
> issue in this case because you are doing the unlock on a bvec
> granularity but just the above statement is somewhat misleading.

Right, forgot the recent discovered bvec contig problem.

But still, the contig check condition is still correct, just the commit
message needs some update.

Another off-topic question is, should we allow such on-disk bytenr only
merge (to improve the IO output), or don't allow them to provide a
simpler endio function?

> 
>> - After the bio_for_each_segment_all() loop ends
>>   This is the normal call sites where we iterated all bvecs of a bio,
>>   and all pages should be contiguous, thus we can call
>>   endio_readpage_release_extent() on the full range.
>>
>> The original code has also considered cases like (!uptodate), so it will
>> mark the uptodate range with EXTENT_UPTODATE.
>>
>> So this patch will remove the extent_start/extent_len dancing, replace
>> it with regular endio_readpage_release_extent() call on each bvec.
>>
>> This brings one behavior change:
>> - Temporary memory usage increase
>>   Unlike the old call which only modify the extent tree once, now we
>>   update the extent tree for each bvec.
> 
> I suspect for large bios with a lot of bvecs this would likely increase
> latency because we will now invoke endio_readpage_release_extent
> proportionally to the number of bvecs.

I believe the same situation.

Now we need to do extent_io tree operations for each bvec.
We can slightly reduce the overhead if we have something like globally
cached extent_state.

Your comment indeed implies we should do better extent contig cache,
other than completely relying on extent io tree.

Maybe I could find some good way to improve the situation, while still
avoid doing the existing dancing.

> 
>>
>>   Although the end result is the same, since we may need more extent
>>   state split/allocation, we need more temporary memory during that
>>   bvec iteration.
> 
> Also bear in mind that this happens in a critical endio context, which
> uses GFP_ATOMIC allocations so if we get ENOSPC it would be rather bad.

I know you mean -ENOMEM.

But the true is, except the leading/tailing sector of the extent, we
shouldn't really cause extra split/allocation.

Each remaining extent should only enlarge previously modified extent_state.
Thus the end result for the extent io tree operation should not change much.

> 
>>
>> But considering how streamline the new code is, the temporary memory
>> usage increase should be acceptable.
> 
> I definitely like the new code however without quantifying what's the
> increase of number of calls of endio_readpage_release_extent I'd rather
> not merge it.

Your point stands.

I could add a new wrapper to do the same thing, but with a small help
from some new structure to really record the
inode/extent_start/extent_len internally.

The end result should be the same in the endio function, but much easier
to read. (The complex part would definite have more comment)

What about this solution?

Thanks,
Qu
> 
> On a different note, one minor cleanup that could be done is replace all
> those "end + 1 - start" expressions with simply "len" as this is
> effectively the length of the current bvec.
> 
> <snip>
>
Nikolay Borisov Nov. 5, 2020, 10:32 a.m. UTC | #3
On 5.11.20 г. 12:15 ч., Qu Wenruo wrote:
> 
> 

<snip>

>>
>> The endio routine cares about logical file contiguity as evidenced by
>> the fact it uses page_offset() to calculate 'start', however our recent
>> discussion on irc with the contiguity in csums bios clearly showed that
>> we can have bios which contains pages that are contiguous in their disk
>> byte nr but not in their logical offset, in fact Josef even mentioned on
>> slack that a single bio can contain pages for different inodes so long
>> as their logical disk byte nr are contiguous. I think this is not an
>> issue in this case because you are doing the unlock on a bvec
>> granularity but just the above statement is somewhat misleading.
> 
> Right, forgot the recent discovered bvec contig problem.
> 
> But still, the contig check condition is still correct, just the commit
> message needs some update.
> 
> Another off-topic question is, should we allow such on-disk bytenr only
> merge (to improve the IO output), or don't allow them to provide a
> simpler endio function?

Can't answer that without quantifying what the performance impact is so
we can properly judge complexity/performance trade-off!

<snip>

>> I suspect for large bios with a lot of bvecs this would likely increase
>> latency because we will now invoke endio_readpage_release_extent
>> proportionally to the number of bvecs.
> 
> I believe the same situation.
> 
> Now we need to do extent_io tree operations for each bvec.
> We can slightly reduce the overhead if we have something like globally
> cached extent_state.
> 
> Your comment indeed implies we should do better extent contig cache,
> other than completely relying on extent io tree.
> 
> Maybe I could find some good way to improve the situation, while still
> avoid doing the existing dancing.

First I'd like to have numbers showing what the overhead otherwise it
will be impossible to tell if whatever approach you choose brings any
improvements.

<snip>

>> Also bear in mind that this happens in a critical endio context, which
>> uses GFP_ATOMIC allocations so if we get ENOSPC it would be rather bad.
> 
> I know you mean -ENOMEM.

Yep, my bad.

> 
> But the true is, except the leading/tailing sector of the extent, we
> shouldn't really cause extra split/allocation.

That's something you assume so the real behavior might be different,
again I think we need to experiment to better understand the behavior.

<snip>

>> I definitely like the new code however without quantifying what's the
>> increase of number of calls of endio_readpage_release_extent I'd rather
>> not merge it.
> 
> Your point stands.
> 
> I could add a new wrapper to do the same thing, but with a small help
> from some new structure to really record the
> inode/extent_start/extent_len internally.
> 
> The end result should be the same in the endio function, but much easier
> to read. (The complex part would definite have more comment)
> 
> What about this solution?

IMO the best course of action is to measure the length of extents being
unlocked in the old version of the code and the number of bvecs in a
bio. That way you would be able to extrapolate with the new version of
the code how many more calls to extent unlock would have been made. This
will tell you how effective this optimisation really is and if it's
worth keeping around.

<snip>
Josef Bacik Nov. 5, 2020, 7:40 p.m. UTC | #4
On 11/3/20 8:30 AM, Qu Wenruo wrote:
> In end_bio_extent_readpage() we had a strange dance around
> extent_start/extent_len.
> 
> Hides behind the strange dance is, it's just calling
> endio_readpage_release_extent() on each bvec range.
> 
> Here is an example to explain the original work flow:
>    Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
> 
>    end_bio_extent_extent_readpage() entered
>    |- extent_start = 0;
>    |- extent_end = 0;
>    |- bio_for_each_segment_all() {
>    |  |- /* Got the 1st bvec */
>    |  |- start = SZ_1M;
>    |  |- end = SZ_1M + SZ_4K - 1;
>    |  |- update = 1;
>    |  |- if (extent_len == 0) {
>    |  |  |- extent_start = start; /* SZ_1M */
>    |  |  |- extent_len = end + 1 - start; /* SZ_1M */
>    |  |  }
>    |  |
>    |  |- /* Got the 2nd bvec */
>    |  |- start = SZ_1M + 4K;
>    |  |- end = SZ_1M + 4K - 1;
>    |  |- update = 1;
>    |  |- if (extent_start + extent_len == start) {
>    |  |  |- extent_len += end + 1 - start; /* SZ_8K */
>    |  |  }
>    |  } /* All bio vec iterated */
>    |
>    |- if (extent_len) {
>       |- endio_readpage_release_extent(tree, extent_start, extent_len,
> 				      update);
> 	/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
> 
> As the above flow shows, the existing code in end_bio_extent_readpage()
> is just accumulate extent_start/extent_len, and when the contiguous range
> breaks, call endio_readpage_release_extent() for the range.
> 
> The contiguous range breaks at two locations:
> - The total else {} branch
>    This means we had a page in a bio where it's not contiguous.
>    Currently this branch will never be triggered. As all our bio is
>    submitted as contiguous pages.
> 
> - After the bio_for_each_segment_all() loop ends
>    This is the normal call sites where we iterated all bvecs of a bio,
>    and all pages should be contiguous, thus we can call
>    endio_readpage_release_extent() on the full range.
> 
> The original code has also considered cases like (!uptodate), so it will
> mark the uptodate range with EXTENT_UPTODATE.
> 
> So this patch will remove the extent_start/extent_len dancing, replace
> it with regular endio_readpage_release_extent() call on each bvec.
> 
> This brings one behavior change:
> - Temporary memory usage increase
>    Unlike the old call which only modify the extent tree once, now we
>    update the extent tree for each bvec.
> 
>    Although the end result is the same, since we may need more extent
>    state split/allocation, we need more temporary memory during that
>    bvec iteration.
> 
> But considering how streamline the new code is, the temporary memory
> usage increase should be acceptable.

It's not just temporary memory usage, it's a point of latency for every memory 
operation.  We have a lot of memory usage on our servers, every trip into the 
slab allocator is going to be a new chance to induce latency because we get 
caught by some cgroup limit and force reclaim.  The fact that these could be 
GFP_ATOMIC makes it even worse, because now we'll have this random knock-on 
affect for heavy read workloads.

Then to top it all off we could have several megs worth of IO per bio, which 
means we're doing this allocation 100's of times per bio!  This is a hard no for 
me.  Thanks,

Josef
Qu Wenruo Nov. 6, 2020, 1:52 a.m. UTC | #5
On 2020/11/6 上午3:40, Josef Bacik wrote:
> On 11/3/20 8:30 AM, Qu Wenruo wrote:
>> In end_bio_extent_readpage() we had a strange dance around
>> extent_start/extent_len.
>>
>> Hides behind the strange dance is, it's just calling
>> endio_readpage_release_extent() on each bvec range.
>>
>> Here is an example to explain the original work flow:
>>    Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
>>
>>    end_bio_extent_extent_readpage() entered
>>    |- extent_start = 0;
>>    |- extent_end = 0;
>>    |- bio_for_each_segment_all() {
>>    |  |- /* Got the 1st bvec */
>>    |  |- start = SZ_1M;
>>    |  |- end = SZ_1M + SZ_4K - 1;
>>    |  |- update = 1;
>>    |  |- if (extent_len == 0) {
>>    |  |  |- extent_start = start; /* SZ_1M */
>>    |  |  |- extent_len = end + 1 - start; /* SZ_1M */
>>    |  |  }
>>    |  |
>>    |  |- /* Got the 2nd bvec */
>>    |  |- start = SZ_1M + 4K;
>>    |  |- end = SZ_1M + 4K - 1;
>>    |  |- update = 1;
>>    |  |- if (extent_start + extent_len == start) {
>>    |  |  |- extent_len += end + 1 - start; /* SZ_8K */
>>    |  |  }
>>    |  } /* All bio vec iterated */
>>    |
>>    |- if (extent_len) {
>>       |- endio_readpage_release_extent(tree, extent_start, extent_len,
>>                       update);
>>     /* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
>>
>> As the above flow shows, the existing code in end_bio_extent_readpage()
>> is just accumulate extent_start/extent_len, and when the contiguous range
>> breaks, call endio_readpage_release_extent() for the range.
>>
>> The contiguous range breaks at two locations:
>> - The total else {} branch
>>    This means we had a page in a bio where it's not contiguous.
>>    Currently this branch will never be triggered. As all our bio is
>>    submitted as contiguous pages.
>>
>> - After the bio_for_each_segment_all() loop ends
>>    This is the normal call sites where we iterated all bvecs of a bio,
>>    and all pages should be contiguous, thus we can call
>>    endio_readpage_release_extent() on the full range.
>>
>> The original code has also considered cases like (!uptodate), so it will
>> mark the uptodate range with EXTENT_UPTODATE.
>>
>> So this patch will remove the extent_start/extent_len dancing, replace
>> it with regular endio_readpage_release_extent() call on each bvec.
>>
>> This brings one behavior change:
>> - Temporary memory usage increase
>>    Unlike the old call which only modify the extent tree once, now we
>>    update the extent tree for each bvec.
>>
>>    Although the end result is the same, since we may need more extent
>>    state split/allocation, we need more temporary memory during that
>>    bvec iteration.
>>
>> But considering how streamline the new code is, the temporary memory
>> usage increase should be acceptable.
> 
> It's not just temporary memory usage, it's a point of latency for every
> memory operation.

The latency comes from 2 parts:
- extent_state search
  Even it's a log(n) operation, we're calling it for each bvec, thus
  it's definitely cause more latency, I'll post the test result soon,
  but initial result is already pretty poor.

- extent_state preallocation
  This is the tricky one.

  In theory, since we're at read path, we can call it with GFP_KERNEL,
  but the truth is, the extent io tree uses gfp_mask to determine if we
  can do memory allocation, and if possible, they will always try to
  prealloc some memory, which is not always ideal.

  This means even we can call GFP_KERNEL here, we shouldn't.

  So ironically, we should call with GFP_ATOMIC to reduce the memory
  allocation trials. But that would cause possible false ENOMEM alert
  thought.

  As in the extent io tree operations, except the first bvec, we should
  always just enlarge previously inserted extent_state, so the memory
  usage isn't really a problem.

This again, shows the hidden sins of extent io tree, and further prove
that we need more interface rework for it.

The best situation would be, we allocate one extent_state as cache, and
allow extent io tree to use that cache, other than doing the hidden
preallocate internally.

And only re-allocate the precached extent_state after extent io tree
really used that.
For endio call sites, the possibility to need new allocation is low.
As contig range should only need one extent_state allocated.

For now, I want to just keep the old behavior, with slightly better
comments.
And leave the large extent io tree rework in the future.

Thanks,
Qu

>  We have a lot of memory usage on our servers, every
> trip into the slab allocator is going to be a new chance to induce
> latency because we get caught by some cgroup limit and force reclaim. 
> The fact that these could be GFP_ATOMIC makes it even worse, because now
> we'll have this random knock-on affect for heavy read workloads.
> 
> Then to top it all off we could have several megs worth of IO per bio,
> which means we're doing this allocation 100's of times per bio!  This is
> a hard no for me.  Thanks,
> 
> Josef
Qu Wenruo Nov. 6, 2020, 2:01 a.m. UTC | #6
...
>
> Can't answer that without quantifying what the performance impact is so
> we can properly judge complexity/performance trade-off!
> First I'd like to have numbers showing what the overhead otherwise it
> will be impossible to tell if whatever approach you choose brings any
> improvements.

You and Josef are right. The too many extra extent io tree call is
greatly hammering the performance of endio function.

Just a very basic average execution time around that part shows obvious
performance drop (read 1GiB file):

BEFORE: (Execution time between the page_unlock() and the end of the loop)
total_time=117795112ns nr_events=262144
avg=449.35ns

AFTER: (execution time for the two functions at the end of the loop)
total_time=3058216317ns nr_events=262144
avg=11666.17ns

So, definitely NACK.

I'll switch back to the old behavior, but still try to enhance its
readability.

Thanks,
Qu

>
> <snip>
>
>>> Also bear in mind that this happens in a critical endio context, which
>>> uses GFP_ATOMIC allocations so if we get ENOSPC it would be rather bad.
>>
>> I know you mean -ENOMEM.
>
> Yep, my bad.
>
>>
>> But the true is, except the leading/tailing sector of the extent, we
>> shouldn't really cause extra split/allocation.
>
> That's something you assume so the real behavior might be different,
> again I think we need to experiment to better understand the behavior.
>
> <snip>
>
>>> I definitely like the new code however without quantifying what's the
>>> increase of number of calls of endio_readpage_release_extent I'd rather
>>> not merge it.
>>
>> Your point stands.
>>
>> I could add a new wrapper to do the same thing, but with a small help
>> from some new structure to really record the
>> inode/extent_start/extent_len internally.
>>
>> The end result should be the same in the endio function, but much easier
>> to read. (The complex part would definite have more comment)
>>
>> What about this solution?
>
> IMO the best course of action is to measure the length of extents being
> unlocked in the old version of the code and the number of bvecs in a
> bio. That way you would be able to extrapolate with the new version of
> the code how many more calls to extent unlock would have been made. This
> will tell you how effective this optimisation really is and if it's
> worth keeping around.
>
> <snip>
>
Qu Wenruo Nov. 6, 2020, 7:19 a.m. UTC | #7
On 2020/11/6 上午10:01, Qu Wenruo wrote:
> ...
>>
>> Can't answer that without quantifying what the performance impact is so
>> we can properly judge complexity/performance trade-off!
>> First I'd like to have numbers showing what the overhead otherwise it
>> will be impossible to tell if whatever approach you choose brings any
>> improvements.
>
> You and Josef are right. The too many extra extent io tree call is
> greatly hammering the performance of endio function.
>
> Just a very basic average execution time around that part shows obvious
> performance drop (read 1GiB file):
>
> BEFORE: (Execution time between the page_unlock() and the end of the loop)
> total_time=117795112ns nr_events=262144
> avg=449.35ns
>
> AFTER: (execution time for the two functions at the end of the loop)
> total_time=3058216317ns nr_events=262144
> avg=11666.17ns
>
> So, definitely NACK.
>
> I'll switch back to the old behavior, but still try to enhance its
> readability.
>

Base on this result, a lot of subpage work needs to be re-done.

This shows that, although extent io tree operation can replace some page
status in functionality, the performance is still way worse than page bits.

I guess that's also why iomap uses bitmap for each subpage, not some
btree based global status tracking.
Meaning later new EXTENT_* bits for data is going to cause the same
performance impact.

Now I have to change the policy to how to attack the data read write
path. Since bit map is unavoidable, then it looks like waiting for iomap
integration is no longer a feasible solution for subpage support.

A similar bitmap needs to be implemented inside btrfs first, allowing
full subpage read/write ability, then switch to iomap afterwards.

The next update may be completely different then.

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f3515d3c1321..58dc55e1429d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2779,12 +2779,10 @@  static void end_bio_extent_writepage(struct bio *bio)
 	bio_put(bio);
 }
 
-static void
-endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len,
-			      int uptodate)
+static void endio_readpage_release_extent(struct extent_io_tree *tree, u64 start,
+					  u64 end, int uptodate)
 {
 	struct extent_state *cached = NULL;
-	u64 end = start + len - 1;
 
 	if (uptodate && tree->track_uptodate)
 		set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
@@ -2812,8 +2810,6 @@  static void end_bio_extent_readpage(struct bio *bio)
 	u64 start;
 	u64 end;
 	u64 len;
-	u64 extent_start = 0;
-	u64 extent_len = 0;
 	int mirror;
 	int ret;
 	struct bvec_iter_all iter_all;
@@ -2922,32 +2918,9 @@  static void end_bio_extent_readpage(struct bio *bio)
 		unlock_page(page);
 		offset += len;
 
-		if (unlikely(!uptodate)) {
-			if (extent_len) {
-				endio_readpage_release_extent(tree,
-							      extent_start,
-							      extent_len, 1);
-				extent_start = 0;
-				extent_len = 0;
-			}
-			endio_readpage_release_extent(tree, start,
-						      end - start + 1, 0);
-		} else if (!extent_len) {
-			extent_start = start;
-			extent_len = end + 1 - start;
-		} else if (extent_start + extent_len == start) {
-			extent_len += end + 1 - start;
-		} else {
-			endio_readpage_release_extent(tree, extent_start,
-						      extent_len, uptodate);
-			extent_start = start;
-			extent_len = end + 1 - start;
-		}
+		endio_readpage_release_extent(tree, start, end, uptodate);
 	}
 
-	if (extent_len)
-		endio_readpage_release_extent(tree, extent_start, extent_len,
-					      uptodate);
 	btrfs_io_bio_free_csum(io_bio);
 	bio_put(bio);
 }