mbox series

[PoC,00/11] btrfs: scrub: rework to get rid of the complex bio formshaping

Message ID cover.1670314744.git.wqu@suse.com (mailing list archive)
Headers show
Series btrfs: scrub: rework to get rid of the complex bio formshaping | expand

Message

Qu Wenruo Dec. 6, 2022, 8:23 a.m. UTC
This is a proof-of-concept patchset, thus don't merge.

This series is mostly a full rework of low level scrub code.

Although I implemented the full support for all profiles, only raid56
code is partially cleaned up (which is already over 1K lines removed).
The estimated full cleanup will be around 1~2K more lines removed
eventually.

The series is sent out for feedback, as the full patchset can be very
large (mostly to remove old codes).

[PROBLEMS OF SCRUB]

TL;DR
The current scrub code is way too complex for future expansion.

Current scrub code has a complex system to manage its in-flight bios.

This behavior is designed to improve scrub performance, but has a lot of
disadvantage too:

- Too many indirect calls/jumps

  To scrub one extent in a simple mirror (like SINGLE), the call chain
  involves the following functions:

  /* Before entering scrub_simple_mirror() */
  scrub_ctx()
  |- INIT_WORK(&sbio->work, scrub_bio_end_io_worker);

  /* In scrub_simple_mirror() */
  scrub_extent()
  |- scrub_sectors()
     |- scrub_add_sector_to_rd_bio()
        |- sbio->bio->bi_end_io = scrub_bio_end_io;

  /* Now it jumps to the endio function */

  scrub_bio_end_io()
  |- queue_work()

  /* Now it jumps to workqueue, which is setup in scrub_ctx() */
  scrub_bio_end_io_worker()
  |- scrub_block_complete()
     |- scrub_handle_errored_block() /* For corruption case */
     |  |- scrub_recheck_block()
     |     |- scrub_repair_block_from_good_copy()
     |- scrub_checksum()
     |- scrub_write_block_to_dev_replace()

  The whole jumps/delayed calls are really a mess to read.
  This makes me wonder if the original code is really designed for human
  to read.

- Complex bio form-shaping
  We have scrub_bio to manage the in-flight bios.

  It has at most 64 bios for one device scrub, and each bio can be as
  large as 128K.

  This is definitely a big performance enhancement.

  But I'm not convinced that scrub performance should be the first thing
  to consider.

- No usage of btrfs_submit_bio()
  This means we're doing a lot of things which can already be handled by
  btrfs_submit_bio().

  This means quite some duplicated code.

[ENHANCEMENT]

This patchset will introduce a new infrasturcture, scrub2_stripe.

The "scrub2" prefix is to avoid naming conflicts.

The overall idea is, we always do scrub in the unit of BTRFS_STRIPE_LEN,
hence the "scrub2_stripe".

Furthermore, all work will done in a submit-and-wait fashion, reducing
the delayed calls/jumps to minimal.

The new scrub entrance in scrub_simple_mirror() would looks like this:

  scrub_simple_mirror()
  |  /* Find a stripe with at least one sector used */
  |- scrub2_find_fill_first_stripe()
  |
  |  /* Submit a read for the whole 64KiB */
  |- scrub2_read_and_wait_stripe()
  |  |- btrfs_submit_bio()
  |  |  /* do the verification for all sectors */
  |  |- scrub2_verify_one_stripe()
  |
  |- scrub2_reapair_one_stripe()
  |  |- scrub2_repair_from_mirror()
  |     |  /* reuse the existing read path */
  |     |- scrub2_read_and_wait_stripe()
  |
  |- scrub2_report_errors()
  |
  |  /*
  |   * For dev-replace and regular scrub repair, the write range
  |   * will be different.
  |   * (replace will writeback all good sectors, repair only writes
  |   *  back repaired sectors).
  |   */
  |- scrub2_writeback_sectors()
  
Everything is done in a submit-and-wait fashion.

Although this will reduce concurrency, the readability will be greatly
improved.

Furthermore since we're already submitting read in 64KiB size, it's less
IOPS intense, thus it's already doing optimization for read.

Even if the performance is not that good, it can be enhanced later by
using scrub2_stripe_group to submit multiple stripes in one go (aka,
enlarge the block size) to improve performance, without damaging
readability.

[CURRENT FEATURES]

- Working scrub/replace for all profiles
  Currently only non-raid56 is tested.

[TODO]

There are a lot of todos:

- Completely remove scrub_bio/scrub_block/scrub_sector
  I estimate that would result further 1~2K lines removed.

  Unfortunately, thus huge cleanup will take way more patches than
  usual.

- Add proper support for zoned devices
  Currently zoned code is not touched, but existing zoned code relies on
  scrub_bio.
  Thus they won't work at all.

- Proper performance benchmarking


Qu Wenruo (11):
  btrfs: scrub: introduce the structure for new BTRFS_STRIPE_LEN based
    interface
  btrfs: scrub: introduce a helper to find and fill the sector info for
    a scrub2_stripe
  btrfs: scrub: introduce a helper to verify one scrub2_stripe
  btrfs: scrub: add the repair function for scrub2_stripe
  btrfs: scrub: add a writeback helper for scrub2_stripe
  btrfs: scrub: add the error reporting for scrub2_stripe
  btrfs: scrub: add raid56 P/Q scrubbing support
  btrfs: scrub: use dedicated super block verification function to scrub
    one super block
  btrfs: scrub: switch to the new scrub2_stripe based infrastructure
  btrfs: scrub: cleanup the old scrub_parity infrastructure
  btrfs: scrub: cleanup scrub_extent() and its related functions

 fs/btrfs/disk-io.c   |   10 +-
 fs/btrfs/disk-io.h   |    2 +
 fs/btrfs/file-item.c |    8 +-
 fs/btrfs/file-item.h |    3 +-
 fs/btrfs/raid56.c    |    2 +-
 fs/btrfs/scrub.c     | 2263 +++++++++++++++++++++++-------------------
 6 files changed, 1238 insertions(+), 1050 deletions(-)

Comments

Christoph Hellwig Dec. 6, 2022, 8:40 a.m. UTC | #1
On Tue, Dec 06, 2022 at 04:23:27PM +0800, Qu Wenruo wrote:
> TL;DR
> The current scrub code is way too complex for future expansion.
> 
> Current scrub code has a complex system to manage its in-flight bios.

From my own ventures into that code I have to agree.

> This behavior is designed to improve scrub performance, but has a lot of
> disadvantage too:

Just curious:  any idea how it was supposed to improve performance?
Because the code does not actually look particularly optimized in terms
of I/O patterns.

> Furthermore, all work will done in a submit-and-wait fashion, reducing
> the delayed calls/jumps to minimal.

I think even with this overall scheme we could do a bit of async
state machine if needed.  But then again scrube is not the main
I/O fast path, so in doubt we can just throw more threads at the
problem if that becomes too complicated.
Qu Wenruo Dec. 6, 2022, 9:29 a.m. UTC | #2
On 2022/12/6 16:40, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 04:23:27PM +0800, Qu Wenruo wrote:
>> TL;DR
>> The current scrub code is way too complex for future expansion.
>>
>> Current scrub code has a complex system to manage its in-flight bios.
> 
>  From my own ventures into that code I have to agree.
> 
>> This behavior is designed to improve scrub performance, but has a lot of
>> disadvantage too:
> 
> Just curious:  any idea how it was supposed to improve performance?
> Because the code does not actually look particularly optimized in terms
> of I/O patterns.

I'm not 100% sure, but my educated guess is, it's just merging bios.

- Merge physically adjacent read/writes into one scrub_bio
   This will mostly help RAID0/RAID10/RAID5 scrubbing.
   As although logically each stripe is only 64K, this allows scrub
   to assemble all read/writes into a larger bio instead always split
   them at stripe boundary.

   However the effectiveness should not be that huge, as the scrub_bio
   size is only slightly increased from 64K to 128K.

I'd argue that if the extents are all interleaved, then my 
always-read-64K behavior would be better.

> 
>> Furthermore, all work will done in a submit-and-wait fashion, reducing
>> the delayed calls/jumps to minimal.
> 
> I think even with this overall scheme we could do a bit of async
> state machine if needed.

Yes, the infrastructure is already here.

In raid56, we need to scrub the data stripes first.
In that case, scrub2_stripe_group is introduced, and use workqueue for 
each stripe to scrub.

So if later we determine the current read-64K-then-next behavior is not 
fast enough, we can use scrub2_stripe_group to run several stripes at once.

>  But then again scrube is not the main
> I/O fast path, so in doubt we can just throw more threads at the
> problem if that becomes too complicated.

Exactly.

I'd do some benchmark later to show the difference.

Thanks,
Qu
Josef Bacik Dec. 13, 2022, 10:08 p.m. UTC | #3
On Tue, Dec 06, 2022 at 04:23:27PM +0800, Qu Wenruo wrote:
> This is a proof-of-concept patchset, thus don't merge.
> 
> This series is mostly a full rework of low level scrub code.
> 
> Although I implemented the full support for all profiles, only raid56
> code is partially cleaned up (which is already over 1K lines removed).
> The estimated full cleanup will be around 1~2K more lines removed
> eventually.
> 
> The series is sent out for feedback, as the full patchset can be very
> large (mostly to remove old codes).
> 
> [PROBLEMS OF SCRUB]
> 
> TL;DR
> The current scrub code is way too complex for future expansion.
> 
> Current scrub code has a complex system to manage its in-flight bios.
> 
> This behavior is designed to improve scrub performance, but has a lot of
> disadvantage too:
> 
> - Too many indirect calls/jumps
> 
>   To scrub one extent in a simple mirror (like SINGLE), the call chain
>   involves the following functions:
> 
>   /* Before entering scrub_simple_mirror() */
>   scrub_ctx()
>   |- INIT_WORK(&sbio->work, scrub_bio_end_io_worker);
> 
>   /* In scrub_simple_mirror() */
>   scrub_extent()
>   |- scrub_sectors()
>      |- scrub_add_sector_to_rd_bio()
>         |- sbio->bio->bi_end_io = scrub_bio_end_io;
> 
>   /* Now it jumps to the endio function */
> 
>   scrub_bio_end_io()
>   |- queue_work()
> 
>   /* Now it jumps to workqueue, which is setup in scrub_ctx() */
>   scrub_bio_end_io_worker()
>   |- scrub_block_complete()
>      |- scrub_handle_errored_block() /* For corruption case */
>      |  |- scrub_recheck_block()
>      |     |- scrub_repair_block_from_good_copy()
>      |- scrub_checksum()
>      |- scrub_write_block_to_dev_replace()
> 
>   The whole jumps/delayed calls are really a mess to read.
>   This makes me wonder if the original code is really designed for human
>   to read.
> 
> - Complex bio form-shaping
>   We have scrub_bio to manage the in-flight bios.
> 
>   It has at most 64 bios for one device scrub, and each bio can be as
>   large as 128K.
> 
>   This is definitely a big performance enhancement.
> 
>   But I'm not convinced that scrub performance should be the first thing
>   to consider.
> 
> - No usage of btrfs_submit_bio()
>   This means we're doing a lot of things which can already be handled by
>   btrfs_submit_bio().
> 
>   This means quite some duplicated code.
> 
> [ENHANCEMENT]
> 
> This patchset will introduce a new infrasturcture, scrub2_stripe.
> 
> The "scrub2" prefix is to avoid naming conflicts.
> 
> The overall idea is, we always do scrub in the unit of BTRFS_STRIPE_LEN,
> hence the "scrub2_stripe".
> 
> Furthermore, all work will done in a submit-and-wait fashion, reducing
> the delayed calls/jumps to minimal.
> 
> The new scrub entrance in scrub_simple_mirror() would looks like this:
> 
>   scrub_simple_mirror()
>   |  /* Find a stripe with at least one sector used */
>   |- scrub2_find_fill_first_stripe()
>   |
>   |  /* Submit a read for the whole 64KiB */
>   |- scrub2_read_and_wait_stripe()
>   |  |- btrfs_submit_bio()
>   |  |  /* do the verification for all sectors */
>   |  |- scrub2_verify_one_stripe()
>   |
>   |- scrub2_reapair_one_stripe()
>   |  |- scrub2_repair_from_mirror()
>   |     |  /* reuse the existing read path */
>   |     |- scrub2_read_and_wait_stripe()
>   |
>   |- scrub2_report_errors()
>   |
>   |  /*
>   |   * For dev-replace and regular scrub repair, the write range
>   |   * will be different.
>   |   * (replace will writeback all good sectors, repair only writes
>   |   *  back repaired sectors).
>   |   */
>   |- scrub2_writeback_sectors()
>   
> Everything is done in a submit-and-wait fashion.
> 
> Although this will reduce concurrency, the readability will be greatly
> improved.
> 
> Furthermore since we're already submitting read in 64KiB size, it's less
> IOPS intense, thus it's already doing optimization for read.
> 
> Even if the performance is not that good, it can be enhanced later by
> using scrub2_stripe_group to submit multiple stripes in one go (aka,
> enlarge the block size) to improve performance, without damaging
> readability.
> 
> [CURRENT FEATURES]
> 
> - Working scrub/replace for all profiles
>   Currently only non-raid56 is tested.
> 
> [TODO]
> 
> There are a lot of todos:
> 
> - Completely remove scrub_bio/scrub_block/scrub_sector
>   I estimate that would result further 1~2K lines removed.
> 
>   Unfortunately, thus huge cleanup will take way more patches than
>   usual.
> 
> - Add proper support for zoned devices
>   Currently zoned code is not touched, but existing zoned code relies on
>   scrub_bio.
>   Thus they won't work at all.
> 
> - Proper performance benchmarking

I looked through everything, I don't see any glaring problems.  I definitely
would like to see a decent comment at the top of scrub.c detailing the behavior,
similar to delalloc-space.c or space-info.c.

I do not love the scrub2 thing, but if the entire patchset ended with
s/scrub2/scrub/g then I suppose that would be ok.  Ditto for exporting functions
that aren't prefix'ed with btrfs_.  If you're going to eventually come through
and clean that up then by all means do this, I just would want to see it be
properly cleaned at the end.

I didn't pay too close attention to the code, the missing parts like zoned and
stuff are enough that I don't want to devote too much attention to code that is
likely to change between now and it's final form.  Your design makes sense, I
don't care about scrub performance in general, as long as it's not unusably slow
I'd happily trade performance for readability and better maintainability.  But I
definitely want the performance changes described, so we know what we're paying
for.  Thanks,

Josef