mbox series

[v2,00/12] btrfs: scrub: use a more reader friendly code to implement scrub_simple_mirror()

Message ID cover.1678777941.git.wqu@suse.com (mailing list archive)
Headers show
Series btrfs: scrub: use a more reader friendly code to implement scrub_simple_mirror() | expand

Message

Qu Wenruo March 14, 2023, 7:34 a.m. UTC
This series can be found in my github repo:

https://github.com/adam900710/linux/tree/scrub_stripe

It's recommended to fetch from the repo, as our misc-next seems to
change pretty rapidly.

[Changelog]
v2:
- Use batched scrub_stripe submission
  This allows much better performance compared to the old scrub code

- Add scrub specific bio layer helpers
  This makes the scrub code to be completely rely on logical bytenr +
  mirror_num.

[PROBLEMS OF OLD SCRUB]

- Too many delayed jumps, making it hard to read
  Even starting from scrub_simple_mirror(), we have the following
  functions:

  scrub_extent()
       |
       v
  scrub_sectors()
       |
       v
  scrub_add_sector_to_rd_bio()
       | endio function
       v
  scrub_bio_end_io()
       | delayed work
       v
  scrub_bio_end_io_worker()
       |
       v
  scrub_block_complete()
       |
       v
  scrub_handle_errored_blocks()
       |
       v
  scrub_recheck_block()
       |
       v
  scrub_repair_sector_from_good_copy()

  Not to mention the hidden jumps in certain branches.

- IOPS inefficient for fragmented extents

  The real block size of scrub read is between 4K and 128K.
  If the extents are not adjacent, the blocksize drops to 4K and would
  be an IOPS disaster.

- All hardcoded to do the logical -> physical mapping by scrub itself
  No usage of any existing bio facilities.
  And even implemented a RAID56 recovery wrapper.

[NEW SCRUB_STRIPE BASED SOLUTION]

- Overall streamlined code base

  queue_scrub_stripe()
     |
     v
  scrub_find_fill_first_stripe()
     |
     v
  done

  Or

  queue_scrub_stripe()
     |
     v
  flush_scrub_stripes()
     |
     v
  scrub_submit_initial_read()
     | endio function
     v
  scrub_read_endio()
     | delayed work
     v
  scrub_stripe_read_repair_worker()
     |
     v
  scrub_verify_one_stripe()
     |
     v
  scrub_stripe_submit_repair_read()
     |
     v
  scrub_write_sectors()
     |
     v
  scrub_stripe_report_errors()

  Only one endio and delayed work, all other work are properly done in a
  sequential workflow.

- Always read in 64KiB block size
  The real blocksize of read starts at 64KiB, and ends at 512K.
  This already results a better performance even for the worst case:

  With patchset:	404.81MiB/s
  Without patchset:	369.30MiB/s

  Around 10% performance improvement on an SATA SSD.

- All logical bytenr/mirror_num based read and write

  With the new single stripe fast path in btrfs_submit_bio(), scrub can
  reuse most of the bio layer code, result much simpler scrub code.

[TODO]

- More testing on zoned devices
  Now the patchset can already pass all scrub/replace groups with
  regular devices.

- More cleanup on RAID56 path
  Now RAID56 still uses some old facility, resulting things like
  scrub_sector and scrub_bio can not be fully cleaned up.

Qu Wenruo (12):
  btrfs: scrub: use dedicated super block verification function to scrub
    one super block
  btrfs: introduce a new helper to submit bio for scrub
  btrfs: introduce a new helper to submit write bio for scrub
  btrfs: scrub: introduce the structure for new BTRFS_STRIPE_LEN based
    interface
  btrfs: scrub: introduce a helper to find and fill the sector info for
    a scrub_stripe
  btrfs: scrub: introduce a helper to verify one metadata
  btrfs: scrub: introduce a helper to verify one scrub_stripe
  btrfs: scrub: introduce the main read repair worker for scrub_stripe
  btrfs: scrub: introduce a writeback helper for scrub_stripe
  btrfs: scrub: introduce error reporting functionality for scrub_stripe
  btrfs: scrub: introduce the helper to queue a stripe for scrub
  btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe
    infrastructure

 fs/btrfs/bio.c       |  142 +++-
 fs/btrfs/bio.h       |   22 +-
 fs/btrfs/file-item.c |    9 +-
 fs/btrfs/file-item.h |    3 +-
 fs/btrfs/raid56.c    |    2 +-
 fs/btrfs/scrub.c     | 1656 ++++++++++++++++++++++++++++++------------
 6 files changed, 1348 insertions(+), 486 deletions(-)

Comments

Johannes Thumshirn March 14, 2023, 10:58 a.m. UTC | #1
On 14.03.23 08:36, Qu Wenruo wrote:
> - More testing on zoned devices
>   Now the patchset can already pass all scrub/replace groups with
>   regular devices.

While probably not being the ultimate solution for you here, but 
you can use qemu to emulate ZNS drives [1].

The TL;DR is:

qemu-system-x86_64 -device nvme,id=nvme0,serial=01234        \
	-drive file=${znsimg},id=nvmezns0,format=raw,if=none \
	-device nvme-ns,drive=nvmezns0,bus=nvme0,nsid=1,zoned=true

[1] https://zonedstorage.io/docs/getting-started/zbd-emulation#nvme-zoned-namespace-device-emulation-with-qemu
Qu Wenruo March 14, 2023, 11:06 a.m. UTC | #2
On 2023/3/14 18:58, Johannes Thumshirn wrote:
> On 14.03.23 08:36, Qu Wenruo wrote:
>> - More testing on zoned devices
>>    Now the patchset can already pass all scrub/replace groups with
>>    regular devices.
> 
> While probably not being the ultimate solution for you here, but
> you can use qemu to emulate ZNS drives [1].
> 
> The TL;DR is:
> 
> qemu-system-x86_64 -device nvme,id=nvme0,serial=01234        \
> 	-drive file=${znsimg},id=nvmezns0,format=raw,if=none \
> 	-device nvme-ns,drive=nvmezns0,bus=nvme0,nsid=1,zoned=true
> 
> [1] https://zonedstorage.io/docs/getting-started/zbd-emulation#nvme-zoned-namespace-device-emulation-with-qemu
> 

Is there a libvirt xml binding for it?
Still prefer libvirt xml based emulation, which would benefit a lot for 
all my daily runs.

Thanks,
Qu
Johannes Thumshirn March 14, 2023, 11:10 a.m. UTC | #3
On 14.03.23 12:06, Qu Wenruo wrote:
> 
> 
> On 2023/3/14 18:58, Johannes Thumshirn wrote:
>> On 14.03.23 08:36, Qu Wenruo wrote:
>>> - More testing on zoned devices
>>>    Now the patchset can already pass all scrub/replace groups with
>>>    regular devices.
>>
>> While probably not being the ultimate solution for you here, but
>> you can use qemu to emulate ZNS drives [1].
>>
>> The TL;DR is:
>>
>> qemu-system-x86_64 -device nvme,id=nvme0,serial=01234        \
>> 	-drive file=${znsimg},id=nvmezns0,format=raw,if=none \
>> 	-device nvme-ns,drive=nvmezns0,bus=nvme0,nsid=1,zoned=true
>>
>> [1] https://zonedstorage.io/docs/getting-started/zbd-emulation#nvme-zoned-namespace-device-emulation-with-qemu
>>
> 
> Is there a libvirt xml binding for it?
> Still prefer libvirt xml based emulation, which would benefit a lot for 
> all my daily runs.

Not that I know of, but you can add qemu commandline stuff into libvirt.

Something like that (untested):

<qemu:commandline>
  <qemu:arg value='-drive'/>
  <qemu:arg value='file=/path/to/nvme.img,format=raw,if=none,id=nvmezns0'/>
  <qemu:arg value='-device'/>
  <qemu:arg value='nvme-ns,drive=nvmezns0,bus=nvme0,nsid=1,zoned=true'/>
</qemu:commandline>
Qu Wenruo March 14, 2023, 11:18 a.m. UTC | #4
On 2023/3/14 19:10, Johannes Thumshirn wrote:
> On 14.03.23 12:06, Qu Wenruo wrote:
>>
>>
>> On 2023/3/14 18:58, Johannes Thumshirn wrote:
>>> On 14.03.23 08:36, Qu Wenruo wrote:
>>>> - More testing on zoned devices
>>>>     Now the patchset can already pass all scrub/replace groups with
>>>>     regular devices.
>>>
>>> While probably not being the ultimate solution for you here, but
>>> you can use qemu to emulate ZNS drives [1].
>>>
>>> The TL;DR is:
>>>
>>> qemu-system-x86_64 -device nvme,id=nvme0,serial=01234        \
>>> 	-drive file=${znsimg},id=nvmezns0,format=raw,if=none \
>>> 	-device nvme-ns,drive=nvmezns0,bus=nvme0,nsid=1,zoned=true
>>>
>>> [1] https://zonedstorage.io/docs/getting-started/zbd-emulation#nvme-zoned-namespace-device-emulation-with-qemu
>>>
>>
>> Is there a libvirt xml binding for it?
>> Still prefer libvirt xml based emulation, which would benefit a lot for
>> all my daily runs.
> 
> Not that I know of, but you can add qemu commandline stuff into libvirt.
> 
> Something like that (untested):
> 
> <qemu:commandline>
>    <qemu:arg value='-drive'/>
>    <qemu:arg value='file=/path/to/nvme.img,format=raw,if=none,id=nvmezns0'/>
>    <qemu:arg value='-device'/>
>    <qemu:arg value='nvme-ns,drive=nvmezns0,bus=nvme0,nsid=1,zoned=true'/>
> </qemu:commandline>

I guess that's the only way to go.

I'll report back after I got it running.

If everything went well, we will see super niche combination, like 64K 
page size aarch64 with ZNS running for tests...

Thanks,
Qu
David Sterba March 14, 2023, 6:48 p.m. UTC | #5
On Tue, Mar 14, 2023 at 03:34:55PM +0800, Qu Wenruo wrote:
> This series can be found in my github repo:
> 
> https://github.com/adam900710/linux/tree/scrub_stripe
> 
> It's recommended to fetch from the repo, as our misc-next seems to
> change pretty rapidly.

There's a cleanup series that changed return value type of
btrfs_bio_alloc which is used in 3 patches and it fails to compile. The
bio pointer needs to be switched to btrfs_bio and touches the logic so
it's not a trivial change I'd otherwise do.

There's also some trailing whitespace in some patches and 'git am'
refuses to apply them. Pulling the series won't unfortunatelly help,
please refresh the series.

Regarding misc-next updates: it gets rebased each Monday after a -rc is
released so the potentially duplicated patches merged in the last week
disappear from misc-next. Otherwise I don't rebase it, only append
patches and occasionally update tags or some trivial bits in the code.

If you work on a patchset for a long time it may become a chasing game
for the stable base, in that case we need to coordinate and/or postpone
some series.
Qu Wenruo March 14, 2023, 10:42 p.m. UTC | #6
On 2023/3/15 02:48, David Sterba wrote:
> On Tue, Mar 14, 2023 at 03:34:55PM +0800, Qu Wenruo wrote:
>> This series can be found in my github repo:
>>
>> https://github.com/adam900710/linux/tree/scrub_stripe
>>
>> It's recommended to fetch from the repo, as our misc-next seems to
>> change pretty rapidly.
> 
> There's a cleanup series that changed return value type of
> btrfs_bio_alloc which is used in 3 patches and it fails to compile. The
> bio pointer needs to be switched to btrfs_bio and touches the logic so
> it's not a trivial change I'd otherwise do.

I'm aware of that type safe patchset and that's expected, and I'm 
already ready to rebase.

The reason I sent the series as is, is to get some more feedback, 
especially considering how large the series is, and it's touching the 
core functionality of scrub.

> 
> There's also some trailing whitespace in some patches and 'git am'
> refuses to apply them. Pulling the series won't unfortunatelly help,
> please refresh the series.

That's a total surprise here.

Shouldn't the btrfs workflow catch such problems?
Or the hook is not triggered for rebase?

Anyway thanks for catching this new problem.

Thanks,
Qu

> 
> Regarding misc-next updates: it gets rebased each Monday after a -rc is
> released so the potentially duplicated patches merged in the last week
> disappear from misc-next. Otherwise I don't rebase it, only append
> patches and occasionally update tags or some trivial bits in the code.
> 
> If you work on a patchset for a long time it may become a chasing game
> for the stable base, in that case we need to coordinate and/or postpone
> some series.
Christoph Hellwig March 15, 2023, 7:52 a.m. UTC | #7
On Tue, Mar 14, 2023 at 03:34:55PM +0800, Qu Wenruo wrote:
> - More cleanup on RAID56 path
>   Now RAID56 still uses some old facility, resulting things like
>   scrub_sector and scrub_bio can not be fully cleaned up.

I think converting the raid path is something that should be done
before merging the series, instead of leaving the parallel
infrastructure in.
Qu Wenruo March 15, 2023, 8:11 a.m. UTC | #8
On 2023/3/15 15:52, Christoph Hellwig wrote:
> On Tue, Mar 14, 2023 at 03:34:55PM +0800, Qu Wenruo wrote:
>> - More cleanup on RAID56 path
>>    Now RAID56 still uses some old facility, resulting things like
>>    scrub_sector and scrub_bio can not be fully cleaned up.
> 
> I think converting the raid path is something that should be done
> before merging the series, instead of leaving the parallel
> infrastructure in.

The RAID56 scrub path is indeed a little messy, but I'm not sure what's 
the better way to clean it up.

For now, if it's a data stripe, it's already using the 
scrub_simple_mirror(), so that's not a big deal.

The problem is in scrub_raid56_data_stripes_for_parity(), which is doing 
the same data stripes scrubbing, but with a slightly different behavior 
dedicated for parity.

My current plan is, go with scrub_stripe for the low hanging fruit 
(everything except P/Q scrub, perf and readability improvement).

Then convert the remaining RAID56 code to the scrub_stripe facility.

The conversion itself maybe even larger than this patchset, although 
most of the change would be just dropping the old facility.


So for now, I prefer to perfect the scrub_stripe for the low hanging 
fruits first, leaving the days of the old facility counted, then do a 
proper final conversion.

Thanks,
Qu
Qu Wenruo March 28, 2023, 9:34 a.m. UTC | #9
On 2023/3/15 15:52, Christoph Hellwig wrote:
> On Tue, Mar 14, 2023 at 03:34:55PM +0800, Qu Wenruo wrote:
>> - More cleanup on RAID56 path
>>    Now RAID56 still uses some old facility, resulting things like
>>    scrub_sector and scrub_bio can not be fully cleaned up.
> 
> I think converting the raid path is something that should be done
> before merging the series, instead of leaving the parallel
> infrastructure in.

BTW, finally I have a local branch with the remaining path converted to 
the new infrastructure.

The real problem is not converting the raid path.
The patch doing the convert is pretty small, mostly thanks to the new 
scrub_stripe infrastructure, we can get rid of the complex scrub_parity 
and scrub_recover related code.

  fs/btrfs/scrub.c | 168 ++++++++++++++++++++++++++++++++++++++++++++---
  fs/btrfs/scrub.h |   4 ++
  2 files changed, 162 insertions(+), 10 deletions(-)

The problem is how I cleanup the existing scrub infrastructure 
(scrub_sector, scrub_block, scrub_recover, scrub_parity structures and 
involved calls).

Currently I just do a single patch to cleanup, the result is super aweful:

  fs/btrfs/scrub.c | 2513 +---------------------------------------------
  fs/btrfs/scrub.h |    4 -
  2 files changed, 5 insertions(+), 2512 deletions(-)

To be honest, I'm more concerned on how to split the cleanup patch.

Thanks,
Qu
Christoph Hellwig March 28, 2023, 11:37 p.m. UTC | #10
On Tue, Mar 28, 2023 at 05:34:03PM +0800, Qu Wenruo wrote:
> Currently I just do a single patch to cleanup, the result is super aweful:
> 
>  fs/btrfs/scrub.c | 2513 +---------------------------------------------
>  fs/btrfs/scrub.h |    4 -
>  2 files changed, 5 insertions(+), 2512 deletions(-)

To me this looks perfect.  A patch that just removes a lot of code is
always good.
Qu Wenruo March 28, 2023, 11:44 p.m. UTC | #11
On 2023/3/29 07:37, Christoph Hellwig wrote:
> On Tue, Mar 28, 2023 at 05:34:03PM +0800, Qu Wenruo wrote:
>> Currently I just do a single patch to cleanup, the result is super aweful:
>>
>>   fs/btrfs/scrub.c | 2513 +---------------------------------------------
>>   fs/btrfs/scrub.h |    4 -
>>   2 files changed, 5 insertions(+), 2512 deletions(-)
> 
> To me this looks perfect.  A patch that just removes a lot of code is
> always good.

Isn't the patch split also important?

Anyway I will try some different methods to split that.

Thanks,
Qu
Christoph Hellwig March 28, 2023, 11:50 p.m. UTC | #12
On Wed, Mar 29, 2023 at 07:44:59AM +0800, Qu Wenruo wrote:
> Isn't the patch split also important?
> 
> Anyway I will try some different methods to split that.

I'll have to defer to David as btrfs is sometimes a little different
from the rest of the kernel in it's requirements, but everywhere
else a patch that just removes code can be as big as it gets.