Message ID | 6c1ffe48e93fee9aa975ecc22dc2e7a1f3d7a0de.1688539673.git.wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: speedup scrub csum verification | expand |
On Wed, Jul 05, 2023 at 02:48:48PM +0800, Qu Wenruo wrote: > [REGRESSION] > There is a report about scrub is much slower on v6.4 kernel on fast NVME > devices. > > The system has a NVME device which can reach over 3GBytes/s, but scrub > speed is below 1GBytes/s. > > [CAUSE] > Since commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to > scrub_stripe infrastructure") scrub goes a completely new > implementation. > > There is a behavior change, where previously scrub is doing csum > verification in one-thread-per-block way, but the new code goes > one-thread-per-stripe way. > > This means for the worst case, new code would only have one thread > verifying a whole 64K stripe filled with data. > > While the old code is doing 16 threads to handle the same stripe. > > Considering the reporter's CPU can only do CRC32C at around 2GBytes/s, > while the NVME drive can do 3GBytes/s, the difference can be big: > > 1 thread: 1 / ( 1 / 3 + 1 / 2) = 1.2 Gbytes/s > 8 threads: 1 / ( 1 / 3 + 1 / 8 / 2) = 2.5 Gbytes/s > > [FIX] > To fix the performance regression, this patch would introduce > multi-thread csum verification by: > > - Introduce a new workqueue for scrub csum verification > The new workqueue is needed as we can not queue the same csum work > into the main scrub worker, where we are waiting for the csum work > to finish. > Or this can lead to dead lock if there is no more worker allocated. > > - Add extra members to scrub_sector_verification > This allows a work to be queued for the specific sector. > Although this means we will have 20 bytes overhead per sector. > > - Queue sector verification work into scrub_csum_worker > This allows multiple threads to handle the csum verification workload. > > - Do not reset stripe->sectors during scrub_find_fill_first_stripe() > Since sectors now contain extra info, we should not touch those > members. > > Reported-by: Bernd Lentes <bernd.lentes@helmholtz-muenchen.de> > Link: https://lore.kernel.org/linux-btrfs/CAAKzf7=yS9vnf5zNid1CyvN19wyAgPz5o9sJP0vBqN6LReqXVg@mail.gmail.com/ > Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") > Signed-off-by: Qu Wenruo <wqu@suse.com> Added to misc-next, thanks.
On 2023/7/12 05:01, David Sterba wrote: > On Wed, Jul 05, 2023 at 02:48:48PM +0800, Qu Wenruo wrote: >> [REGRESSION] >> There is a report about scrub is much slower on v6.4 kernel on fast NVME >> devices. >> >> The system has a NVME device which can reach over 3GBytes/s, but scrub >> speed is below 1GBytes/s. >> >> [CAUSE] >> Since commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to >> scrub_stripe infrastructure") scrub goes a completely new >> implementation. >> >> There is a behavior change, where previously scrub is doing csum >> verification in one-thread-per-block way, but the new code goes >> one-thread-per-stripe way. >> >> This means for the worst case, new code would only have one thread >> verifying a whole 64K stripe filled with data. >> >> While the old code is doing 16 threads to handle the same stripe. >> >> Considering the reporter's CPU can only do CRC32C at around 2GBytes/s, >> while the NVME drive can do 3GBytes/s, the difference can be big: >> >> 1 thread: 1 / ( 1 / 3 + 1 / 2) = 1.2 Gbytes/s >> 8 threads: 1 / ( 1 / 3 + 1 / 8 / 2) = 2.5 Gbytes/s >> >> [FIX] >> To fix the performance regression, this patch would introduce >> multi-thread csum verification by: >> >> - Introduce a new workqueue for scrub csum verification >> The new workqueue is needed as we can not queue the same csum work >> into the main scrub worker, where we are waiting for the csum work >> to finish. >> Or this can lead to dead lock if there is no more worker allocated. >> >> - Add extra members to scrub_sector_verification >> This allows a work to be queued for the specific sector. >> Although this means we will have 20 bytes overhead per sector. >> >> - Queue sector verification work into scrub_csum_worker >> This allows multiple threads to handle the csum verification workload. >> >> - Do not reset stripe->sectors during scrub_find_fill_first_stripe() >> Since sectors now contain extra info, we should not touch those >> members. >> >> Reported-by: Bernd Lentes <bernd.lentes@helmholtz-muenchen.de> >> Link: https://lore.kernel.org/linux-btrfs/CAAKzf7=yS9vnf5zNid1CyvN19wyAgPz5o9sJP0vBqN6LReqXVg@mail.gmail.com/ >> Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") >> Signed-off-by: Qu Wenruo <wqu@suse.com> > > Added to misc-next, thanks. Please drop this, I got feedback from some real world tester, and it doesn't help at all. (Although it shows that the CPU usage is indeed lower than previous) Thanks, Qu
On Wed, Jul 12, 2023 at 06:24:41AM +0800, Qu Wenruo wrote: > On 2023/7/12 05:01, David Sterba wrote: > > On Wed, Jul 05, 2023 at 02:48:48PM +0800, Qu Wenruo wrote: > >> - Queue sector verification work into scrub_csum_worker > >> This allows multiple threads to handle the csum verification workload. > >> > >> - Do not reset stripe->sectors during scrub_find_fill_first_stripe() > >> Since sectors now contain extra info, we should not touch those > >> members. > >> > >> Reported-by: Bernd Lentes <bernd.lentes@helmholtz-muenchen.de> > >> Link: https://lore.kernel.org/linux-btrfs/CAAKzf7=yS9vnf5zNid1CyvN19wyAgPz5o9sJP0vBqN6LReqXVg@mail.gmail.com/ > >> Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") > >> Signed-off-by: Qu Wenruo <wqu@suse.com> > > > > Added to misc-next, thanks. > > Please drop this, I got feedback from some real world tester, and it > doesn't help at all. > (Although it shows that the CPU usage is indeed lower than previous) Ok, no problem. Paralelizing the checksums is likely the way to go but I don't see how it could be made faster.
On 2023/7/12 06:57, David Sterba wrote: > On Wed, Jul 12, 2023 at 06:24:41AM +0800, Qu Wenruo wrote: >> On 2023/7/12 05:01, David Sterba wrote: >>> On Wed, Jul 05, 2023 at 02:48:48PM +0800, Qu Wenruo wrote: >>>> - Queue sector verification work into scrub_csum_worker >>>> This allows multiple threads to handle the csum verification workload. >>>> >>>> - Do not reset stripe->sectors during scrub_find_fill_first_stripe() >>>> Since sectors now contain extra info, we should not touch those >>>> members. >>>> >>>> Reported-by: Bernd Lentes <bernd.lentes@helmholtz-muenchen.de> >>>> Link: https://lore.kernel.org/linux-btrfs/CAAKzf7=yS9vnf5zNid1CyvN19wyAgPz5o9sJP0vBqN6LReqXVg@mail.gmail.com/ >>>> Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") >>>> Signed-off-by: Qu Wenruo <wqu@suse.com> >>> >>> Added to misc-next, thanks. >> >> Please drop this, I got feedback from some real world tester, and it >> doesn't help at all. >> (Although it shows that the CPU usage is indeed lower than previous) > > Ok, no problem. Paralelizing the checksums is likely the way to go but > I don't see how it could be made faster. The problem of the patch is, it looks like the way csum verification is distributing is bad. It really distributes csum verification in a per-sector basis, which seems to damage the CPU cache line a lot. Another thing is, for the reported regression, the tester reported that the situation has a lot of large files, and I'm assuming the block size of the old scrub can grow much larger than the new 8 * 64K. I'll got a physical machine to do more testing and hopefully we can got a better solution to this performance regression. Thanks, Qu
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h index 203d2a267828..0c0cb5b0e471 100644 --- a/fs/btrfs/fs.h +++ b/fs/btrfs/fs.h @@ -643,6 +643,7 @@ struct btrfs_fs_info { */ refcount_t scrub_workers_refcnt; struct workqueue_struct *scrub_workers; + struct workqueue_struct *scrub_csum_workers; struct btrfs_subpage_info *subpage_info; struct btrfs_discard_ctl discard_ctl; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 38c103f13fd5..f4df3b8852a6 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -72,6 +72,11 @@ struct scrub_sector_verification { */ u64 generation; }; + + /* For multi-thread verification. */ + struct scrub_stripe *stripe; + struct work_struct work; + unsigned int sector_nr; }; enum scrub_stripe_flags { @@ -258,6 +263,12 @@ static int init_scrub_stripe(struct btrfs_fs_info *fs_info, GFP_KERNEL); if (!stripe->sectors) goto error; + for (int i = 0; i < stripe->nr_sectors; i++) { + struct scrub_sector_verification *sector = &stripe->sectors[i]; + + sector->stripe = stripe; + sector->sector_nr = i; + } stripe->csums = kcalloc(BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits, fs_info->csum_size, GFP_KERNEL); @@ -680,11 +691,11 @@ static void scrub_verify_one_sector(struct scrub_stripe *stripe, int sector_nr) /* Sector not utilized, skip it. */ if (!test_bit(sector_nr, &stripe->extent_sector_bitmap)) - return; + goto out; /* IO error, no need to check. */ if (test_bit(sector_nr, &stripe->io_error_bitmap)) - return; + goto out; /* Metadata, verify the full tree block. */ if (sector->is_metadata) { @@ -702,10 +713,10 @@ static void scrub_verify_one_sector(struct scrub_stripe *stripe, int sector_nr) stripe->logical + (sector_nr << fs_info->sectorsize_bits), stripe->logical); - return; + goto out; } scrub_verify_one_metadata(stripe, sector_nr); - return; + goto out; } /* @@ -714,7 +725,7 @@ static void scrub_verify_one_sector(struct scrub_stripe *stripe, int sector_nr) */ if (!sector->csum) { clear_bit(sector_nr, &stripe->error_bitmap); - return; + goto out; } ret = btrfs_check_sector_csum(fs_info, page, pgoff, csum_buf, sector->csum); @@ -725,6 +736,17 @@ static void scrub_verify_one_sector(struct scrub_stripe *stripe, int sector_nr) clear_bit(sector_nr, &stripe->csum_error_bitmap); clear_bit(sector_nr, &stripe->error_bitmap); } +out: + atomic_dec(&stripe->pending_io); + wake_up(&stripe->io_wait); +} + +static void scrub_verify_work(struct work_struct *work) +{ + struct scrub_sector_verification *sector = container_of(work, + struct scrub_sector_verification, work); + + scrub_verify_one_sector(sector->stripe, sector->sector_nr); } /* Verify specified sectors of a stripe. */ @@ -734,11 +756,24 @@ static void scrub_verify_one_stripe(struct scrub_stripe *stripe, unsigned long b const u32 sectors_per_tree = fs_info->nodesize >> fs_info->sectorsize_bits; int sector_nr; + /* All IO should have finished, and we will reuse pending_io soon. */ + ASSERT(atomic_read(&stripe->pending_io) == 0); + for_each_set_bit(sector_nr, &bitmap, stripe->nr_sectors) { - scrub_verify_one_sector(stripe, sector_nr); + struct scrub_sector_verification *sector = &stripe->sectors[sector_nr]; + + /* The sector should have been initialized. */ + ASSERT(sector->sector_nr == sector_nr); + ASSERT(sector->stripe == stripe); + + atomic_inc(&stripe->pending_io); + INIT_WORK(§or->work, scrub_verify_work); + queue_work(fs_info->scrub_csum_workers, §or->work); + if (stripe->sectors[sector_nr].is_metadata) sector_nr += sectors_per_tree - 1; } + wait_scrub_stripe_io(stripe); } static int calc_sector_number(struct scrub_stripe *stripe, struct bio_vec *first_bvec) @@ -1484,8 +1519,6 @@ static int scrub_find_fill_first_stripe(struct btrfs_block_group *bg, u64 extent_gen; int ret; - memset(stripe->sectors, 0, sizeof(struct scrub_sector_verification) * - stripe->nr_sectors); scrub_stripe_reset_bitmaps(stripe); /* The range must be inside the bg. */ @@ -2692,12 +2725,17 @@ static void scrub_workers_put(struct btrfs_fs_info *fs_info) if (refcount_dec_and_mutex_lock(&fs_info->scrub_workers_refcnt, &fs_info->scrub_lock)) { struct workqueue_struct *scrub_workers = fs_info->scrub_workers; + struct workqueue_struct *scrub_csum_workers = + fs_info->scrub_csum_workers; fs_info->scrub_workers = NULL; + fs_info->scrub_csum_workers = NULL; mutex_unlock(&fs_info->scrub_lock); if (scrub_workers) destroy_workqueue(scrub_workers); + if (scrub_csum_workers) + destroy_workqueue(scrub_csum_workers); } } @@ -2708,6 +2746,7 @@ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info, int is_dev_replace) { struct workqueue_struct *scrub_workers = NULL; + struct workqueue_struct *scrub_csum_workers = NULL; unsigned int flags = WQ_FREEZABLE | WQ_UNBOUND; int max_active = fs_info->thread_pool_size; int ret = -ENOMEM; @@ -2722,10 +2761,18 @@ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info, if (!scrub_workers) return -ENOMEM; + scrub_csum_workers = alloc_workqueue("btrfs-scrub-csum", flags, max_active); + if (!scrub_csum_workers) { + destroy_workqueue(scrub_workers); + return -ENOMEM; + } + mutex_lock(&fs_info->scrub_lock); if (refcount_read(&fs_info->scrub_workers_refcnt) == 0) { - ASSERT(fs_info->scrub_workers == NULL); + ASSERT(fs_info->scrub_workers == NULL && + fs_info->scrub_csum_workers == NULL); fs_info->scrub_workers = scrub_workers; + fs_info->scrub_csum_workers = scrub_csum_workers; refcount_set(&fs_info->scrub_workers_refcnt, 1); mutex_unlock(&fs_info->scrub_lock); return 0; @@ -2737,6 +2784,7 @@ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info, ret = 0; destroy_workqueue(scrub_workers); + destroy_workqueue(scrub_csum_workers); return ret; }
[REGRESSION] There is a report about scrub is much slower on v6.4 kernel on fast NVME devices. The system has a NVME device which can reach over 3GBytes/s, but scrub speed is below 1GBytes/s. [CAUSE] Since commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") scrub goes a completely new implementation. There is a behavior change, where previously scrub is doing csum verification in one-thread-per-block way, but the new code goes one-thread-per-stripe way. This means for the worst case, new code would only have one thread verifying a whole 64K stripe filled with data. While the old code is doing 16 threads to handle the same stripe. Considering the reporter's CPU can only do CRC32C at around 2GBytes/s, while the NVME drive can do 3GBytes/s, the difference can be big: 1 thread: 1 / ( 1 / 3 + 1 / 2) = 1.2 Gbytes/s 8 threads: 1 / ( 1 / 3 + 1 / 8 / 2) = 2.5 Gbytes/s [FIX] To fix the performance regression, this patch would introduce multi-thread csum verification by: - Introduce a new workqueue for scrub csum verification The new workqueue is needed as we can not queue the same csum work into the main scrub worker, where we are waiting for the csum work to finish. Or this can lead to dead lock if there is no more worker allocated. - Add extra members to scrub_sector_verification This allows a work to be queued for the specific sector. Although this means we will have 20 bytes overhead per sector. - Queue sector verification work into scrub_csum_worker This allows multiple threads to handle the csum verification workload. - Do not reset stripe->sectors during scrub_find_fill_first_stripe() Since sectors now contain extra info, we should not touch those members. Reported-by: Bernd Lentes <bernd.lentes@helmholtz-muenchen.de> Link: https://lore.kernel.org/linux-btrfs/CAAKzf7=yS9vnf5zNid1CyvN19wyAgPz5o9sJP0vBqN6LReqXVg@mail.gmail.com/ Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/fs.h | 1 + fs/btrfs/scrub.c | 66 +++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 58 insertions(+), 9 deletions(-)