Message ID | cover.1690542141.git.wqu@suse.com (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: scrub: improve the scrub performance | expand |
Qu Wenruo - 28.07.23, 13:14:03 CEST: > The first 3 patches would greately improve the scrub read performance, > but unfortunately it's still not as fast as the pre-6.4 kernels. > (2.2GiB/s vs 3.0GiB/s), but still much better than 6.4 kernels > (2.2GiB vs 1.0GiB/s). Thanks for the patch set. What is the reason for not going back to the performance of the pre-6.4 kernel? Isn't it possible with the new scrubbing method? In that case what improvements does the new scrubbing code have that warrant to have a lower performance? Just like to understand the background of this a bit more. I do not mind a bit lower performance too much, especially in case it is outweighed by other benefits.
On Fri, Jul 28, 2023 at 02:38:35PM +0200, Martin Steigerwald wrote: > Qu Wenruo - 28.07.23, 13:14:03 CEST: > > The first 3 patches would greately improve the scrub read performance, > > but unfortunately it's still not as fast as the pre-6.4 kernels. > > (2.2GiB/s vs 3.0GiB/s), but still much better than 6.4 kernels > > (2.2GiB vs 1.0GiB/s). > > Thanks for the patch set. > > What is the reason for not going back to the performance of the pre-6.4 > kernel? Isn't it possible with the new scrubbing method? In that case > what improvements does the new scrubbing code have that warrant to have > a lower performance? Lower performance was not expected and needs to be brought back. A minor decrease would be tolerable but that's something around 5%, not 60%. > Just like to understand the background of this a bit more. I do not mind > a bit lower performance too much, especially in case it is outweighed by > other benefits. The code in scrub was from 3.0 times and since then new features have been implemented, extending the code became hard over time so a bigger update was done restructuring how the IO is done.
David Sterba - 28.07.23, 18:50:37 CEST: > On Fri, Jul 28, 2023 at 02:38:35PM +0200, Martin Steigerwald wrote: > > Qu Wenruo - 28.07.23, 13:14:03 CEST: > > > The first 3 patches would greately improve the scrub read > > > performance, but unfortunately it's still not as fast as the > > > pre-6.4 kernels. (2.2GiB/s vs 3.0GiB/s), but still much better > > > than 6.4 kernels (2.2GiB vs 1.0GiB/s). > > > > Thanks for the patch set. > > > > What is the reason for not going back to the performance of the > > pre-6.4 kernel? Isn't it possible with the new scrubbing method? In > > that case what improvements does the new scrubbing code have that > > warrant to have a lower performance? > > Lower performance was not expected and needs to be brought back. A > minor decrease would be tolerable but that's something around 5%, not > 60%. Okay. Best of success with improving performance again. > > Just like to understand the background of this a bit more. I do not > > mind a bit lower performance too much, especially in case it is > > outweighed by other benefits. > > The code in scrub was from 3.0 times and since then new features have > been implemented, extending the code became hard over time so a bigger > update was done restructuring how the IO is done. Okay, thanks for explaining.
Hello, I did some testing with 4 x 320GB HDD's. Meta raid1c4 and data raid 5. Kernel 6.3.12 btrfs scrub start -B /dev/sdb scrub done for 6691cb53-271b-4abd-b2ab-143c41027924 Scrub started: Tue Aug 1 04:00:39 2023 Status: finished Duration: 2:37:35 Total to scrub: 149.58GiB Rate: 16.20MiB/s Error summary: no errors found Kernel 6.5.0-rc3 btrfs scrub start -B /dev/sdb scrub done for 6691cb53-271b-4abd-b2ab-143c41027924 Scrub started: Tue Aug 1 08:41:12 2023 Status: finished Duration: 1:31:03 Total to scrub: 299.16GiB Rate: 56.08MiB/s Error summary: no errors found So much better speed but Total to scrub reporting seems strange. df -h /dev/sdb Filesystem Size Used Avail Use% Mounted on /dev/sdb 1,2T 599G 292G 68% /mnt Looks like old did like 1/4 of total data what seems like right because I have 4 drives. New did about 1/2 of total data what seems wrong. And if I do scrub against mount point: btrfs scrub start -B /mnt/ scrub done for 6691cb53-271b-4abd-b2ab-143c41027924 Scrub started: Tue Aug 1 11:03:56 2023 Status: finished Duration: 10:02:44 Total to scrub: 1.17TiB Rate: 33.89MiB/s Error summary: no errors found Then performance goes down to toilet and now Total to scrub reporting is like 2/1 btrfs version btrfs-progs v6.3.3 Is it btrfs-progs issue with reporting? What about raid 5 scrub performance, why it is so bad? About disks, they are old WD Blue drives what can do about 100MB/s read/write on average. On 28/07/2023 14.14, Qu Wenruo wrote: > [REPO] > https://github.com/adam900710/linux/tree/scrub_testing > > [CHANGELOG] > v1: > - Rebased to latest misc-next > > - Rework the read IO grouping patch > David has found some crashes mostly related to scrub performance > fixes, meanwhile the original grouping patch has one extra flag, > SCRUB_FLAG_READ_SUBMITTED, to avoid double submitting. > > But this flag can be avoided as we can easily avoid double submitting > just by properly checking the sctx->nr_stripe variable. > > This reworked grouping read IO patch should be safer compared to the > initial version, with better code structure. > > Unfortunately, the final performance is worse than the initial version > (2.2GiB/s vs 2.5GiB/s), but it should be less racy thus safer. > > - Re-order the patches > The first 3 patches are the main fixes, and I put safer patches first, > so even if David still found crash at certain patch, the remaining can > be dropped if needed. > > There is a huge scrub performance drop introduced by v6.4 kernel, that > the scrub performance is only around 1/3 for large data extents. > > There are several causes: > > - Missing blk plug > This means read requests won't be merged by block layer, this can > hugely reduce the read performance. > > - Extra time spent on extent/csum tree search > This including extra path allocation/freeing and tree searchs. > This is especially obvious for large data extents, as previously we > only do one csum search per 512K, but now we do one csum search per > 64K, an 8 times increase in csum tree search. > > - Less concurrency > Mostly due to the fact that we're doing submit-and-wait, thus much > lower queue depth, affecting things like NVME which benefits a lot > from high concurrency. > > The first 3 patches would greately improve the scrub read performance, > but unfortunately it's still not as fast as the pre-6.4 kernels. > (2.2GiB/s vs 3.0GiB/s), but still much better than 6.4 kernels (2.2GiB > vs 1.0GiB/s). > > Qu Wenruo (5): > btrfs: scrub: avoid unnecessary extent tree search preparing stripes > btrfs: scrub: avoid unnecessary csum tree search preparing stripes > btrfs: scrub: fix grouping of read IO > btrfs: scrub: don't go ordered workqueue for dev-replace > btrfs: scrub: move write back of repaired sectors into > scrub_stripe_read_repair_worker() > > fs/btrfs/file-item.c | 33 +++--- > fs/btrfs/file-item.h | 6 +- > fs/btrfs/raid56.c | 4 +- > fs/btrfs/scrub.c | 234 ++++++++++++++++++++++++++----------------- > 4 files changed, 169 insertions(+), 108 deletions(-) >
On 2023/8/2 04:14, Jani Partanen wrote: > Hello, I did some testing with 4 x 320GB HDD's. Meta raid1c4 and data > raid 5. RAID5 has other problems related to scrub performance unfortunately. > > Kernel 6.3.12 > > btrfs scrub start -B /dev/sdb > > scrub done for 6691cb53-271b-4abd-b2ab-143c41027924 > Scrub started: Tue Aug 1 04:00:39 2023 > Status: finished > Duration: 2:37:35 > Total to scrub: 149.58GiB > Rate: 16.20MiB/s > Error summary: no errors found > > > Kernel 6.5.0-rc3 > > btrfs scrub start -B /dev/sdb > > scrub done for 6691cb53-271b-4abd-b2ab-143c41027924 > Scrub started: Tue Aug 1 08:41:12 2023 > Status: finished > Duration: 1:31:03 > Total to scrub: 299.16GiB > Rate: 56.08MiB/s > Error summary: no errors found > > > So much better speed but Total to scrub reporting seems strange. > > df -h /dev/sdb > Filesystem Size Used Avail Use% Mounted on > /dev/sdb 1,2T 599G 292G 68% /mnt > > > Looks like old did like 1/4 of total data what seems like right because > I have 4 drives. > > New did about 1/2 of total data what seems wrong. I checked the kernel part of the progress reporting, for single device scrub for RAID56, if it's a data stripe it contributes to the scrubbed bytes, but if it's a P/Q stripe it should not contribute to the value. Thus 1/4 should be the correct value. However there is another factor in btrfs-progs, which determines how to report the numbers. There is a fix for it already merged in v6.3.2, but it seems to have other problems involved. > > And if I do scrub against mount point: > > btrfs scrub start -B /mnt/ > scrub done for 6691cb53-271b-4abd-b2ab-143c41027924 > Scrub started: Tue Aug 1 11:03:56 2023 > Status: finished > Duration: 10:02:44 > Total to scrub: 1.17TiB > Rate: 33.89MiB/s > Error summary: no errors found > > > Then performance goes down to toilet and now Total to scrub reporting is > like 2/1 > > btrfs version > btrfs-progs v6.3.3 > > Is it btrfs-progs issue with reporting? Can you try with -BdR option? It shows the raw numbers, which is the easiest way to determine if it's a bug in btrfs-progs or kernel. > What about raid 5 scrub > performance, why it is so bad? It's explained in this cover letter: https://lore.kernel.org/linux-btrfs/cover.1688368617.git.wqu@suse.com/ In short, RAID56 full fs scrub is causing too many duplicated reads, and the root cause is, the per-device scrub is never a good idea for RAID56. That's why I'm trying to introduce the new scrub flag for that. Thanks, Qu > > > About disks, they are old WD Blue drives what can do about 100MB/s > read/write on average. > > > On 28/07/2023 14.14, Qu Wenruo wrote: >> [REPO] >> https://github.com/adam900710/linux/tree/scrub_testing >> >> [CHANGELOG] >> v1: >> - Rebased to latest misc-next >> >> - Rework the read IO grouping patch >> David has found some crashes mostly related to scrub performance >> fixes, meanwhile the original grouping patch has one extra flag, >> SCRUB_FLAG_READ_SUBMITTED, to avoid double submitting. >> >> But this flag can be avoided as we can easily avoid double submitting >> just by properly checking the sctx->nr_stripe variable. >> >> This reworked grouping read IO patch should be safer compared to the >> initial version, with better code structure. >> >> Unfortunately, the final performance is worse than the initial version >> (2.2GiB/s vs 2.5GiB/s), but it should be less racy thus safer. >> >> - Re-order the patches >> The first 3 patches are the main fixes, and I put safer patches first, >> so even if David still found crash at certain patch, the remaining can >> be dropped if needed. >> >> There is a huge scrub performance drop introduced by v6.4 kernel, that >> the scrub performance is only around 1/3 for large data extents. >> >> There are several causes: >> >> - Missing blk plug >> This means read requests won't be merged by block layer, this can >> hugely reduce the read performance. >> >> - Extra time spent on extent/csum tree search >> This including extra path allocation/freeing and tree searchs. >> This is especially obvious for large data extents, as previously we >> only do one csum search per 512K, but now we do one csum search per >> 64K, an 8 times increase in csum tree search. >> >> - Less concurrency >> Mostly due to the fact that we're doing submit-and-wait, thus much >> lower queue depth, affecting things like NVME which benefits a lot >> from high concurrency. >> >> The first 3 patches would greately improve the scrub read performance, >> but unfortunately it's still not as fast as the pre-6.4 kernels. >> (2.2GiB/s vs 3.0GiB/s), but still much better than 6.4 kernels (2.2GiB >> vs 1.0GiB/s). >> >> Qu Wenruo (5): >> btrfs: scrub: avoid unnecessary extent tree search preparing stripes >> btrfs: scrub: avoid unnecessary csum tree search preparing stripes >> btrfs: scrub: fix grouping of read IO >> btrfs: scrub: don't go ordered workqueue for dev-replace >> btrfs: scrub: move write back of repaired sectors into >> scrub_stripe_read_repair_worker() >> >> fs/btrfs/file-item.c | 33 +++--- >> fs/btrfs/file-item.h | 6 +- >> fs/btrfs/raid56.c | 4 +- >> fs/btrfs/scrub.c | 234 ++++++++++++++++++++++++++----------------- >> 4 files changed, 169 insertions(+), 108 deletions(-) >>
On 02/08/2023 1.06, Qu Wenruo wrote: > > Can you try with -BdR option? > > It shows the raw numbers, which is the easiest way to determine if it's > a bug in btrfs-progs or kernel. > Here is single device result: btrfs scrub start -BdR /dev/sdb Scrub device /dev/sdb (id 1) done Scrub started: Wed Aug 2 01:33:21 2023 Status: finished Duration: 0:44:29 data_extents_scrubbed: 4902956 tree_extents_scrubbed: 60494 data_bytes_scrubbed: 321301020672 tree_bytes_scrubbed: 991133696 read_errors: 0 csum_errors: 0 verify_errors: 0 no_csum: 22015840 csum_discards: 0 super_errors: 0 malloc_errors: 0 uncorrectable_errors: 0 unverified_errors: 0 corrected_errors: 0 last_physical: 256679870464 I'll do against mountpoint when I go to sleep because it gonna take long. >> What about raid 5 scrub >> performance, why it is so bad? > > It's explained in this cover letter: > https://lore.kernel.org/linux-btrfs/cover.1688368617.git.wqu@suse.com/ > > In short, RAID56 full fs scrub is causing too many duplicated reads, and > the root cause is, the per-device scrub is never a good idea for RAID56. > > That's why I'm trying to introduce the new scrub flag for that. > Ah, so there is different patchset for raid5 scrub, good to know. I'm gonna build that branch and test it. Also let me know if I could help somehow to do that stress testing. These drives are deticated for testing. I am running VM under Hyper-V and disk are passthrough directly to VM.
On 2023/8/2 07:48, Jani Partanen wrote: > > On 02/08/2023 1.06, Qu Wenruo wrote: >> >> Can you try with -BdR option? >> >> It shows the raw numbers, which is the easiest way to determine if it's >> a bug in btrfs-progs or kernel. >> > > Here is single device result: > > btrfs scrub start -BdR /dev/sdb > > Scrub device /dev/sdb (id 1) done > Scrub started: Wed Aug 2 01:33:21 2023 > Status: finished > Duration: 0:44:29 > data_extents_scrubbed: 4902956 > tree_extents_scrubbed: 60494 > data_bytes_scrubbed: 321301020672 So the btrfs scrub report is doing the correct report using the values from kernel. And considering the used space is around 600G, divided by 4 disks (aka, 3 data stripes + 1 parity stripes), it's not that weird, as we would got around 200G per device (parity doesn't contribute to the scrubbed bytes). Especially considering your metadata is RAID1C4, it means we should only have more than 200G. Instead it's the old report of less than 200G doesn't seem correct. Mind to provide the output of "btrfs fi usage <mnt>" to verify my assumption? > tree_bytes_scrubbed: 991133696 > read_errors: 0 > csum_errors: 0 > verify_errors: 0 > no_csum: 22015840 > csum_discards: 0 > super_errors: 0 > malloc_errors: 0 > uncorrectable_errors: 0 > unverified_errors: 0 > corrected_errors: 0 > last_physical: 256679870464 > > > I'll do against mountpoint when I go to sleep because it gonna take long. > >>> What about raid 5 scrub >>> performance, why it is so bad? >> >> It's explained in this cover letter: >> https://lore.kernel.org/linux-btrfs/cover.1688368617.git.wqu@suse.com/ >> >> In short, RAID56 full fs scrub is causing too many duplicated reads, and >> the root cause is, the per-device scrub is never a good idea for RAID56. >> >> That's why I'm trying to introduce the new scrub flag for that. >> > Ah, so there is different patchset for raid5 scrub, good to know. I'm > gonna build that branch and test it. Although it's not recommended to test it for now, as we're still handling the performance drop, thus the patchset may not apply cleanly. > Also let me know if I could help > somehow to do that stress testing. These drives are deticated for > testing. I am running VM under Hyper-V and disk are passthrough directly > to VM. > Sure, I'll CC you when refreshing the patchset, extra tests are always appreciated. Thanks, Qu
On 02/08/2023 4.56, Qu Wenruo wrote: > > So the btrfs scrub report is doing the correct report using the values > from kernel. > > And considering the used space is around 600G, divided by 4 disks (aka, > 3 data stripes + 1 parity stripes), it's not that weird, as we would got > around 200G per device (parity doesn't contribute to the scrubbed bytes). > > Especially considering your metadata is RAID1C4, it means we should only > have more than 200G. > Instead it's the old report of less than 200G doesn't seem correct. > > Mind to provide the output of "btrfs fi usage <mnt>" to verify my > assumption? > btrfs fi usage /mnt/ Overall: Device size: 1.16TiB Device allocated: 844.25GiB Device unallocated: 348.11GiB Device missing: 0.00B Device slack: 0.00B Used: 799.86GiB Free (estimated): 289.58GiB (min: 115.52GiB) Free (statfs, df): 289.55GiB Data ratio: 1.33 Metadata ratio: 4.00 Global reserve: 471.80MiB (used: 0.00B) Multiple profiles: no Data,RAID5: Size:627.00GiB, Used:598.51GiB (95.46%) /dev/sdb 209.00GiB /dev/sdc 209.00GiB /dev/sdd 209.00GiB /dev/sde 209.00GiB Metadata,RAID1C4: Size:2.00GiB, Used:472.56MiB (23.07%) /dev/sdb 2.00GiB /dev/sdc 2.00GiB /dev/sdd 2.00GiB /dev/sde 2.00GiB System,RAID1C4: Size:64.00MiB, Used:64.00KiB (0.10%) /dev/sdb 64.00MiB /dev/sdc 64.00MiB /dev/sdd 64.00MiB /dev/sde 64.00MiB Unallocated: /dev/sdb 87.03GiB /dev/sdc 87.03GiB /dev/sdd 87.03GiB /dev/sde 87.03GiB There is 1 extra 2GB file now so thats why it show little more usage now. > Sure, I'll CC you when refreshing the patchset, extra tests are always > appreciated. > Sound good, thanks!
On 2023/8/2 10:15, Jani Partanen wrote: > > On 02/08/2023 4.56, Qu Wenruo wrote: >> >> So the btrfs scrub report is doing the correct report using the values >> from kernel. >> >> And considering the used space is around 600G, divided by 4 disks (aka, >> 3 data stripes + 1 parity stripes), it's not that weird, as we would got >> around 200G per device (parity doesn't contribute to the scrubbed bytes). >> >> Especially considering your metadata is RAID1C4, it means we should only >> have more than 200G. >> Instead it's the old report of less than 200G doesn't seem correct. >> >> Mind to provide the output of "btrfs fi usage <mnt>" to verify my >> assumption? >> > btrfs fi usage /mnt/ > Overall: > Device size: 1.16TiB > Device allocated: 844.25GiB > Device unallocated: 348.11GiB > Device missing: 0.00B > Device slack: 0.00B > Used: 799.86GiB > Free (estimated): 289.58GiB (min: 115.52GiB) > Free (statfs, df): 289.55GiB > Data ratio: 1.33 > Metadata ratio: 4.00 > Global reserve: 471.80MiB (used: 0.00B) > Multiple profiles: no > > Data,RAID5: Size:627.00GiB, Used:598.51GiB (95.46%) > /dev/sdb 209.00GiB > /dev/sdc 209.00GiB > /dev/sdd 209.00GiB > /dev/sde 209.00GiB OK, my previous calculation is incorrect... For each device there should be 209GiB used by RAID5 chunks, and only 3/4 of them contributes to the scrubbed data bytes. Thus there seems to be some double accounting. Definitely needs extra digging for this situation. Thanks, Qu > > Metadata,RAID1C4: Size:2.00GiB, Used:472.56MiB (23.07%) > /dev/sdb 2.00GiB > /dev/sdc 2.00GiB > /dev/sdd 2.00GiB > /dev/sde 2.00GiB > > System,RAID1C4: Size:64.00MiB, Used:64.00KiB (0.10%) > /dev/sdb 64.00MiB > /dev/sdc 64.00MiB > /dev/sdd 64.00MiB > /dev/sde 64.00MiB > > Unallocated: > /dev/sdb 87.03GiB > /dev/sdc 87.03GiB > /dev/sdd 87.03GiB > > /dev/sde 87.03GiB > > > There is 1 extra 2GB file now so thats why it show little more usage now. > > >> Sure, I'll CC you when refreshing the patchset, extra tests are always >> appreciated. >> > Sound good, thanks! >
On 2023/8/2 10:20, Qu Wenruo wrote: > > > On 2023/8/2 10:15, Jani Partanen wrote: >> >> On 02/08/2023 4.56, Qu Wenruo wrote: >>> >>> So the btrfs scrub report is doing the correct report using the values >>> from kernel. >>> >>> And considering the used space is around 600G, divided by 4 disks (aka, >>> 3 data stripes + 1 parity stripes), it's not that weird, as we would got >>> around 200G per device (parity doesn't contribute to the scrubbed >>> bytes). >>> >>> Especially considering your metadata is RAID1C4, it means we should only >>> have more than 200G. >>> Instead it's the old report of less than 200G doesn't seem correct. >>> >>> Mind to provide the output of "btrfs fi usage <mnt>" to verify my >>> assumption? >>> >> btrfs fi usage /mnt/ >> Overall: >> Device size: 1.16TiB >> Device allocated: 844.25GiB >> Device unallocated: 348.11GiB >> Device missing: 0.00B >> Device slack: 0.00B >> Used: 799.86GiB >> Free (estimated): 289.58GiB (min: 115.52GiB) >> Free (statfs, df): 289.55GiB >> Data ratio: 1.33 >> Metadata ratio: 4.00 >> Global reserve: 471.80MiB (used: 0.00B) >> Multiple profiles: no >> >> Data,RAID5: Size:627.00GiB, Used:598.51GiB (95.46%) >> /dev/sdb 209.00GiB >> /dev/sdc 209.00GiB >> /dev/sdd 209.00GiB >> /dev/sde 209.00GiB > > OK, my previous calculation is incorrect... > > For each device there should be 209GiB used by RAID5 chunks, and only > 3/4 of them contributes to the scrubbed data bytes. > > Thus there seems to be some double accounting. > > Definitely needs extra digging for this situation. Well, this turns out to be something related to the patchset. If you don't apply the patchset, the reporting is correct. The problem is in the last patch, which calls scrub_stripe_report_errors() twice, thus double accounting the values. I'll fix it soon. Thanks for spotting this one! Qu > > Thanks, > Qu > >> >> Metadata,RAID1C4: Size:2.00GiB, Used:472.56MiB (23.07%) >> /dev/sdb 2.00GiB >> /dev/sdc 2.00GiB >> /dev/sdd 2.00GiB >> /dev/sde 2.00GiB >> >> System,RAID1C4: Size:64.00MiB, Used:64.00KiB (0.10%) >> /dev/sdb 64.00MiB >> /dev/sdc 64.00MiB >> /dev/sdd 64.00MiB >> /dev/sde 64.00MiB >> >> Unallocated: >> /dev/sdb 87.03GiB >> /dev/sdc 87.03GiB >> /dev/sdd 87.03GiB >> >> /dev/sde 87.03GiB >> >> >> There is 1 extra 2GB file now so thats why it show little more usage now. >> >> >>> Sure, I'll CC you when refreshing the patchset, extra tests are always >>> appreciated. >>> >> Sound good, thanks! >>