Message ID | cover.1670314744.git.wqu@suse.com (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: scrub: rework to get rid of the complex bio formshaping | expand |
On Tue, Dec 06, 2022 at 04:23:27PM +0800, Qu Wenruo wrote: > TL;DR > The current scrub code is way too complex for future expansion. > > Current scrub code has a complex system to manage its in-flight bios. From my own ventures into that code I have to agree. > This behavior is designed to improve scrub performance, but has a lot of > disadvantage too: Just curious: any idea how it was supposed to improve performance? Because the code does not actually look particularly optimized in terms of I/O patterns. > Furthermore, all work will done in a submit-and-wait fashion, reducing > the delayed calls/jumps to minimal. I think even with this overall scheme we could do a bit of async state machine if needed. But then again scrube is not the main I/O fast path, so in doubt we can just throw more threads at the problem if that becomes too complicated.
On 2022/12/6 16:40, Christoph Hellwig wrote: > On Tue, Dec 06, 2022 at 04:23:27PM +0800, Qu Wenruo wrote: >> TL;DR >> The current scrub code is way too complex for future expansion. >> >> Current scrub code has a complex system to manage its in-flight bios. > > From my own ventures into that code I have to agree. > >> This behavior is designed to improve scrub performance, but has a lot of >> disadvantage too: > > Just curious: any idea how it was supposed to improve performance? > Because the code does not actually look particularly optimized in terms > of I/O patterns. I'm not 100% sure, but my educated guess is, it's just merging bios. - Merge physically adjacent read/writes into one scrub_bio This will mostly help RAID0/RAID10/RAID5 scrubbing. As although logically each stripe is only 64K, this allows scrub to assemble all read/writes into a larger bio instead always split them at stripe boundary. However the effectiveness should not be that huge, as the scrub_bio size is only slightly increased from 64K to 128K. I'd argue that if the extents are all interleaved, then my always-read-64K behavior would be better. > >> Furthermore, all work will done in a submit-and-wait fashion, reducing >> the delayed calls/jumps to minimal. > > I think even with this overall scheme we could do a bit of async > state machine if needed. Yes, the infrastructure is already here. In raid56, we need to scrub the data stripes first. In that case, scrub2_stripe_group is introduced, and use workqueue for each stripe to scrub. So if later we determine the current read-64K-then-next behavior is not fast enough, we can use scrub2_stripe_group to run several stripes at once. > But then again scrube is not the main > I/O fast path, so in doubt we can just throw more threads at the > problem if that becomes too complicated. Exactly. I'd do some benchmark later to show the difference. Thanks, Qu
On Tue, Dec 06, 2022 at 04:23:27PM +0800, Qu Wenruo wrote: > This is a proof-of-concept patchset, thus don't merge. > > This series is mostly a full rework of low level scrub code. > > Although I implemented the full support for all profiles, only raid56 > code is partially cleaned up (which is already over 1K lines removed). > The estimated full cleanup will be around 1~2K more lines removed > eventually. > > The series is sent out for feedback, as the full patchset can be very > large (mostly to remove old codes). > > [PROBLEMS OF SCRUB] > > TL;DR > The current scrub code is way too complex for future expansion. > > Current scrub code has a complex system to manage its in-flight bios. > > This behavior is designed to improve scrub performance, but has a lot of > disadvantage too: > > - Too many indirect calls/jumps > > To scrub one extent in a simple mirror (like SINGLE), the call chain > involves the following functions: > > /* Before entering scrub_simple_mirror() */ > scrub_ctx() > |- INIT_WORK(&sbio->work, scrub_bio_end_io_worker); > > /* In scrub_simple_mirror() */ > scrub_extent() > |- scrub_sectors() > |- scrub_add_sector_to_rd_bio() > |- sbio->bio->bi_end_io = scrub_bio_end_io; > > /* Now it jumps to the endio function */ > > scrub_bio_end_io() > |- queue_work() > > /* Now it jumps to workqueue, which is setup in scrub_ctx() */ > scrub_bio_end_io_worker() > |- scrub_block_complete() > |- scrub_handle_errored_block() /* For corruption case */ > | |- scrub_recheck_block() > | |- scrub_repair_block_from_good_copy() > |- scrub_checksum() > |- scrub_write_block_to_dev_replace() > > The whole jumps/delayed calls are really a mess to read. > This makes me wonder if the original code is really designed for human > to read. > > - Complex bio form-shaping > We have scrub_bio to manage the in-flight bios. > > It has at most 64 bios for one device scrub, and each bio can be as > large as 128K. > > This is definitely a big performance enhancement. > > But I'm not convinced that scrub performance should be the first thing > to consider. > > - No usage of btrfs_submit_bio() > This means we're doing a lot of things which can already be handled by > btrfs_submit_bio(). > > This means quite some duplicated code. > > [ENHANCEMENT] > > This patchset will introduce a new infrasturcture, scrub2_stripe. > > The "scrub2" prefix is to avoid naming conflicts. > > The overall idea is, we always do scrub in the unit of BTRFS_STRIPE_LEN, > hence the "scrub2_stripe". > > Furthermore, all work will done in a submit-and-wait fashion, reducing > the delayed calls/jumps to minimal. > > The new scrub entrance in scrub_simple_mirror() would looks like this: > > scrub_simple_mirror() > | /* Find a stripe with at least one sector used */ > |- scrub2_find_fill_first_stripe() > | > | /* Submit a read for the whole 64KiB */ > |- scrub2_read_and_wait_stripe() > | |- btrfs_submit_bio() > | | /* do the verification for all sectors */ > | |- scrub2_verify_one_stripe() > | > |- scrub2_reapair_one_stripe() > | |- scrub2_repair_from_mirror() > | | /* reuse the existing read path */ > | |- scrub2_read_and_wait_stripe() > | > |- scrub2_report_errors() > | > | /* > | * For dev-replace and regular scrub repair, the write range > | * will be different. > | * (replace will writeback all good sectors, repair only writes > | * back repaired sectors). > | */ > |- scrub2_writeback_sectors() > > Everything is done in a submit-and-wait fashion. > > Although this will reduce concurrency, the readability will be greatly > improved. > > Furthermore since we're already submitting read in 64KiB size, it's less > IOPS intense, thus it's already doing optimization for read. > > Even if the performance is not that good, it can be enhanced later by > using scrub2_stripe_group to submit multiple stripes in one go (aka, > enlarge the block size) to improve performance, without damaging > readability. > > [CURRENT FEATURES] > > - Working scrub/replace for all profiles > Currently only non-raid56 is tested. > > [TODO] > > There are a lot of todos: > > - Completely remove scrub_bio/scrub_block/scrub_sector > I estimate that would result further 1~2K lines removed. > > Unfortunately, thus huge cleanup will take way more patches than > usual. > > - Add proper support for zoned devices > Currently zoned code is not touched, but existing zoned code relies on > scrub_bio. > Thus they won't work at all. > > - Proper performance benchmarking I looked through everything, I don't see any glaring problems. I definitely would like to see a decent comment at the top of scrub.c detailing the behavior, similar to delalloc-space.c or space-info.c. I do not love the scrub2 thing, but if the entire patchset ended with s/scrub2/scrub/g then I suppose that would be ok. Ditto for exporting functions that aren't prefix'ed with btrfs_. If you're going to eventually come through and clean that up then by all means do this, I just would want to see it be properly cleaned at the end. I didn't pay too close attention to the code, the missing parts like zoned and stuff are enough that I don't want to devote too much attention to code that is likely to change between now and it's final form. Your design makes sense, I don't care about scrub performance in general, as long as it's not unusably slow I'd happily trade performance for readability and better maintainability. But I definitely want the performance changes described, so we know what we're paying for. Thanks, Josef