diff mbox series

btrfs: scrub: speed up on NVME devices

Message ID ef3951fa130f9b61fe097e8d5f6e425525165a28.1689330324.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: scrub: speed up on NVME devices | expand

Commit Message

Qu Wenruo July 14, 2023, 10:26 a.m. UTC
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.

On a PCIE3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.

[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:

Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
nvme0n1p3  9731.00 3425544.00 17237.00  63.92    2.18   352.02  21.18 100.00
nvme0n1p3 20853.00 1330288.00     0.00   0.00    0.08    63.79   1.60 100.00

The upper one is v6.3 while the lower one is v6.4.

There are several obvious differences:

- Very few read merges
  This turns out to be a behavior change that we no longer go bio
  plug/unplug.

- Very low aqu-sz
  This is due to the submit-and-wait behavior of flush_scrub_stripes().

Both behavior is not that obvious on SATA SSDs, as SATA SSDs has NCQ to
merge the reads, while SATA SSDs can not handle high queue depth well
either.

[FIX]
For now this patch focus on the read speed fix. Dev-replace replace
speed needs extra work.

For the read part, we go two directions to fix the problems:

- Re-introduce blk plug/unplug to merge read requests
  This is pretty simple, and the behavior is pretty easy to observe.

  This would enlarge the average read request size to 512K.

- Introduce multi-group reads and no longer waits for each group
  Instead of the old behavior, which submit 8 stripes and wait for
  them, here we would enlarge the total stripes to 16 * 8.
  Which is 8M per device, the same limits as the old scrub flying
  bios size limit.

  Now every time we fill a group (8 stripes), we submit them and
  continue to next stripes.

  Only when the full 16 * 8 stripes are all filled, we submit the
  remaining ones (the last group), and wait for all groups to finish.
  Then submit the repair writes and dev-replace writes.

  This should enlarge the queue depth.

Even with all these optimization, unfortunately we can only improve the
scrub performance to around 1.9GiB/s, as the queue depth is still very
low.

Now the new iostat results looks like this:

Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
nvme0n1p3  4030.00 1995904.00 27257.00  87.12    0.37   495.26   1.50 100.00

Which still have a very low queue depth.

The current bottleneck seems to be in flush_scrub_stripes(), which is
still doing submit-and-wait, for read-repair and dev-replace
synchronization.

To fully re-gain the performance, we need to get rid of the
submit-and-wait, and go workqueue solution to fully utilize the
high queue depth capability of NVME devices.

Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
For the proper fix, I'm afraid we have to utilize btrfs workqueue, let
the initial read and repair done in an async manner, and let the
writeback (repaired and dev-replace) happen in a synchronized manner.

This can allow us to have a very high queue depth, to claim the
remaining 1GiB/s performance.

But I'm also not sure if we should go that hard, as we may still have
SATA SSD/HDDs, which won't benefit at all from high queue depth.

The only good news is, this patch is small enough for backport, without
huge structure changes.
---
 fs/btrfs/scrub.c | 76 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 59 insertions(+), 17 deletions(-)

Comments

Linux regression tracking (Thorsten Leemhuis) July 14, 2023, 12:24 p.m. UTC | #1
Hi, Thorsten here, the Linux kernel's regression tracker.

On 14.07.23 12:26, Qu Wenruo wrote:
> [REGRESSION]
> There are several regression reports about the scrub performance with
> v6.4 kernel.

Many thx for taking care of this. Could you please consider adding Link:
or Closes: tags to the reports, as explained in
Documentation/process/submitting-patches.rst and
Documentation/process/5.Posting.rst?

For more details why I ask for this, see here:
https://linux-regtracking.leemhuis.info/about/#linktag

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

#regzbot ^backmonitor:
https://lore.kernel.org/linux-btrfs/CAAKzf7=yS9vnf5zNid1CyvN19wyAgPz5o9sJP0vBqN6LReqXVg@mail.gmail.com/
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.
David Sterba July 18, 2023, 12:59 a.m. UTC | #2
On Fri, Jul 14, 2023 at 06:26:40PM +0800, Qu Wenruo wrote:
> [REGRESSION]
> There are several regression reports about the scrub performance with
> v6.4 kernel.
> 
> On a PCIE3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
> v6.4 can only go 1GB/s, an obvious 66% performance drop.
> 
> [CAUSE]
> Iostat shows a very different behavior between v6.3 and v6.4 kernel:
> 
> Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
> nvme0n1p3  9731.00 3425544.00 17237.00  63.92    2.18   352.02  21.18 100.00
> nvme0n1p3 20853.00 1330288.00     0.00   0.00    0.08    63.79   1.60 100.00
> 
> The upper one is v6.3 while the lower one is v6.4.
> 
> There are several obvious differences:
> 
> - Very few read merges
>   This turns out to be a behavior change that we no longer go bio
>   plug/unplug.
> 
> - Very low aqu-sz
>   This is due to the submit-and-wait behavior of flush_scrub_stripes().
> 
> Both behavior is not that obvious on SATA SSDs, as SATA SSDs has NCQ to
> merge the reads, while SATA SSDs can not handle high queue depth well
> either.
> 
> [FIX]
> For now this patch focus on the read speed fix. Dev-replace replace
> speed needs extra work.
> 
> For the read part, we go two directions to fix the problems:
> 
> - Re-introduce blk plug/unplug to merge read requests
>   This is pretty simple, and the behavior is pretty easy to observe.
> 
>   This would enlarge the average read request size to 512K.
> 
> - Introduce multi-group reads and no longer waits for each group
>   Instead of the old behavior, which submit 8 stripes and wait for
>   them, here we would enlarge the total stripes to 16 * 8.
>   Which is 8M per device, the same limits as the old scrub flying
>   bios size limit.
> 
>   Now every time we fill a group (8 stripes), we submit them and
>   continue to next stripes.

Have you tried other values for the stripe count if this makes any
difference? Getting back to the previous value (8M) is closer to the
amount of data sent at once but the code around has chnaged, so it's
possible that increasing that could also improve it (or not).

>   Only when the full 16 * 8 stripes are all filled, we submit the
>   remaining ones (the last group), and wait for all groups to finish.
>   Then submit the repair writes and dev-replace writes.
> 
>   This should enlarge the queue depth.
> 
> Even with all these optimization, unfortunately we can only improve the
> scrub performance to around 1.9GiB/s, as the queue depth is still very
> low.

So the request grouping gained about 700MB/s, I was expecting more but
it's a noticeable improvement still.

> Now the new iostat results looks like this:
> 
> Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
> nvme0n1p3  4030.00 1995904.00 27257.00  87.12    0.37   495.26   1.50 100.00
> 
> Which still have a very low queue depth.
> 
> The current bottleneck seems to be in flush_scrub_stripes(), which is
> still doing submit-and-wait, for read-repair and dev-replace
> synchronization.
> 
> To fully re-gain the performance, we need to get rid of the
> submit-and-wait, and go workqueue solution to fully utilize the
> high queue depth capability of NVME devices.
> 
> Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> For the proper fix, I'm afraid we have to utilize btrfs workqueue, let
> the initial read and repair done in an async manner, and let the
> writeback (repaired and dev-replace) happen in a synchronized manner.
> 
> This can allow us to have a very high queue depth, to claim the
> remaining 1GiB/s performance.
> 
> But I'm also not sure if we should go that hard, as we may still have
> SATA SSD/HDDs, which won't benefit at all from high queue depth.

I don't understand what you mean here, there are all sorts of devices
that people use and will use. Lower queue depth will lead to lower
throughput which would be expected, high speed devices should utilize
the full bandwidth if we have infrastructure for that.

To some level we can affect the IO but it's still up to block layer.
We've removed code that did own priority, batching and similar things,
which is IMHO good and we can rely on block layer capabilities. So I'm
assuming the fixes should go more in that direction. If this means to
use a work queue then sure.

> The only good news is, this patch is small enough for backport, without
> huge structure changes.

Yeah that's useful, thanks. Patch added to misc-next.
Qu Wenruo July 18, 2023, 1:41 a.m. UTC | #3
On 2023/7/18 08:59, David Sterba wrote:
> On Fri, Jul 14, 2023 at 06:26:40PM +0800, Qu Wenruo wrote:
>> [REGRESSION]
>> There are several regression reports about the scrub performance with
>> v6.4 kernel.
>>
>> On a PCIE3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
>> v6.4 can only go 1GB/s, an obvious 66% performance drop.
>>
>> [CAUSE]
>> Iostat shows a very different behavior between v6.3 and v6.4 kernel:
>>
>> Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
>> nvme0n1p3  9731.00 3425544.00 17237.00  63.92    2.18   352.02  21.18 100.00
>> nvme0n1p3 20853.00 1330288.00     0.00   0.00    0.08    63.79   1.60 100.00
>>
>> The upper one is v6.3 while the lower one is v6.4.
>>
>> There are several obvious differences:
>>
>> - Very few read merges
>>    This turns out to be a behavior change that we no longer go bio
>>    plug/unplug.
>>
>> - Very low aqu-sz
>>    This is due to the submit-and-wait behavior of flush_scrub_stripes().
>>
>> Both behavior is not that obvious on SATA SSDs, as SATA SSDs has NCQ to
>> merge the reads, while SATA SSDs can not handle high queue depth well
>> either.
>>
>> [FIX]
>> For now this patch focus on the read speed fix. Dev-replace replace
>> speed needs extra work.
>>
>> For the read part, we go two directions to fix the problems:
>>
>> - Re-introduce blk plug/unplug to merge read requests
>>    This is pretty simple, and the behavior is pretty easy to observe.
>>
>>    This would enlarge the average read request size to 512K.
>>
>> - Introduce multi-group reads and no longer waits for each group
>>    Instead of the old behavior, which submit 8 stripes and wait for
>>    them, here we would enlarge the total stripes to 16 * 8.
>>    Which is 8M per device, the same limits as the old scrub flying
>>    bios size limit.
>>
>>    Now every time we fill a group (8 stripes), we submit them and
>>    continue to next stripes.
>
> Have you tried other values for the stripe count if this makes any
> difference? Getting back to the previous value (8M) is closer to the
> amount of data sent at once but the code around has chnaged, so it's
> possible that increasing that could also improve it (or not).

I tried smaller ones, like grounp number 8 (4M), with the plugging added
back, it's not that good, around 1.6GB/s ~ 1.7GB/s.
And larger number 32 (16M) doesn't increase much over the 16 (8M) value.

The bottleneck on the queue depth seems to be the wait.

Especially for zoned write back, we need to ensure the write into
dev-replace target device ordered.

This patch choose to wait for all existing stripes to finish their read,
then submit all the writes, to ensure the write sequence.

But this is still a read-and-wait behavior, thus leads to the low queue
depth.


>
>>    Only when the full 16 * 8 stripes are all filled, we submit the
>>    remaining ones (the last group), and wait for all groups to finish.
>>    Then submit the repair writes and dev-replace writes.
>>
>>    This should enlarge the queue depth.
>>
>> Even with all these optimization, unfortunately we can only improve the
>> scrub performance to around 1.9GiB/s, as the queue depth is still very
>> low.
>
> So the request grouping gained about 700MB/s, I was expecting more but
> it's a noticeable improvement still.

I think the queue depth is the final missing piece, but that's not
something we can easily fix, especially taking zoned support into
consideration.

>
>> Now the new iostat results looks like this:
>>
>> Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
>> nvme0n1p3  4030.00 1995904.00 27257.00  87.12    0.37   495.26   1.50 100.00
>>
>> Which still have a very low queue depth.
>>
>> The current bottleneck seems to be in flush_scrub_stripes(), which is
>> still doing submit-and-wait, for read-repair and dev-replace
>> synchronization.
>>
>> To fully re-gain the performance, we need to get rid of the
>> submit-and-wait, and go workqueue solution to fully utilize the
>> high queue depth capability of NVME devices.
>>
>> Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> For the proper fix, I'm afraid we have to utilize btrfs workqueue, let
>> the initial read and repair done in an async manner, and let the
>> writeback (repaired and dev-replace) happen in a synchronized manner.
>>
>> This can allow us to have a very high queue depth, to claim the
>> remaining 1GiB/s performance.
>>
>> But I'm also not sure if we should go that hard, as we may still have
>> SATA SSD/HDDs, which won't benefit at all from high queue depth.
>
> I don't understand what you mean here, there are all sorts of devices
> that people use and will use. Lower queue depth will lead to lower
> throughput which would be expected, high speed devices should utilize
> the full bandwidth if we have infrastructure for that.
>
> To some level we can affect the IO but it's still up to block layer.
> We've removed code that did own priority, batching and similar things,
> which is IMHO good and we can rely on block layer capabilities. So I'm
> assuming the fixes should go more in that direction. If this means to
> use a work queue then sure.

OK, I'll go the workqueue solution, making the sctx->stripes[] a ring
buffer like behavior.

So we will still batch all the reads of the same group, then let the
repair part happen in the async part of the btrfs workqueue, and the
write to dev-replace target to happen in the ordered part.

Hopes by this we can enlarge the queue depth and gain back the performance.

Thanks,
Qu

>
>> The only good news is, this patch is small enough for backport, without
>> huge structure changes.
>
> Yeah that's useful, thanks. Patch added to misc-next.
Qu Wenruo July 19, 2023, 5:22 a.m. UTC | #4
On 2023/7/18 09:41, Qu Wenruo wrote:
>
>
> On 2023/7/18 08:59, David Sterba wrote:
>> On Fri, Jul 14, 2023 at 06:26:40PM +0800, Qu Wenruo wrote:
>>> [REGRESSION]
>>> There are several regression reports about the scrub performance with
>>> v6.4 kernel.
>>>
>>> On a PCIE3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
>>> v6.4 can only go 1GB/s, an obvious 66% performance drop.
>>>
>>> [CAUSE]
>>> Iostat shows a very different behavior between v6.3 and v6.4 kernel:
>>>
>>> Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz
>>> aqu-sz  %util
>>> nvme0n1p3  9731.00 3425544.00 17237.00  63.92    2.18   352.02  21.18
>>> 100.00
>>> nvme0n1p3 20853.00 1330288.00     0.00   0.00    0.08    63.79   1.60
>>> 100.00
>>>
>>> The upper one is v6.3 while the lower one is v6.4.
>>>
>>> There are several obvious differences:
>>>
>>> - Very few read merges
>>>    This turns out to be a behavior change that we no longer go bio
>>>    plug/unplug.
>>>
>>> - Very low aqu-sz
>>>    This is due to the submit-and-wait behavior of flush_scrub_stripes().
>>>
>>> Both behavior is not that obvious on SATA SSDs, as SATA SSDs has NCQ to
>>> merge the reads, while SATA SSDs can not handle high queue depth well
>>> either.
>>>
>>> [FIX]
>>> For now this patch focus on the read speed fix. Dev-replace replace
>>> speed needs extra work.
>>>
>>> For the read part, we go two directions to fix the problems:
>>>
>>> - Re-introduce blk plug/unplug to merge read requests
>>>    This is pretty simple, and the behavior is pretty easy to observe.
>>>
>>>    This would enlarge the average read request size to 512K.
>>>
>>> - Introduce multi-group reads and no longer waits for each group
>>>    Instead of the old behavior, which submit 8 stripes and wait for
>>>    them, here we would enlarge the total stripes to 16 * 8.
>>>    Which is 8M per device, the same limits as the old scrub flying
>>>    bios size limit.
>>>
>>>    Now every time we fill a group (8 stripes), we submit them and
>>>    continue to next stripes.
>>
>> Have you tried other values for the stripe count if this makes any
>> difference? Getting back to the previous value (8M) is closer to the
>> amount of data sent at once but the code around has chnaged, so it's
>> possible that increasing that could also improve it (or not).
>
> I tried smaller ones, like grounp number 8 (4M), with the plugging added
> back, it's not that good, around 1.6GB/s ~ 1.7GB/s.
> And larger number 32 (16M) doesn't increase much over the 16 (8M) value.
>
> The bottleneck on the queue depth seems to be the wait.
>
> Especially for zoned write back, we need to ensure the write into
> dev-replace target device ordered.
>
> This patch choose to wait for all existing stripes to finish their read,
> then submit all the writes, to ensure the write sequence.
>
> But this is still a read-and-wait behavior, thus leads to the low queue
> depth.

I tried making sctx->stripes[] a ring buffer, so that we would only wait
the current slot if it's still under scrub, no more dedicated waits at
the last slot.

The branch can be found at my github repo:
https://github.com/adam900710/linux/tree/scrub_testing


But unfortunately this brings no change to scrub performance at all, and
the average queue depth is still pretty shallow (slightly better, can
reach 1.8~2.0).

If needed, I can send out the patchset (still pretty small, +249/-246)
for review, but I don't think it worthy as no improvement at all...

For now, I'm afraid the latency introduced by plug is already enough to
reduce the throughput, thus we have to go back the original behavior, to
manually merge the read requests, other than rely on block layer plug...

Thanks,
Qu
Martin Steigerwald July 19, 2023, 7:01 a.m. UTC | #5
Hi Qu.

I bet this patch is intended to go into 6.4.

Qu Wenruo - 14.07.23, 12:26:40 CEST:
> For the proper fix, I'm afraid we have to utilize btrfs workqueue, let
> the initial read and repair done in an async manner, and let the
> writeback (repaired and dev-replace) happen in a synchronized manner.
> 
> This can allow us to have a very high queue depth, to claim the
> remaining 1GiB/s performance.
> 
> But I'm also not sure if we should go that hard, as we may still have
> SATA SSD/HDDs, which won't benefit at all from high queue depth.

Could workqueue be tuned to use a lower queue depth for SATA SSD/HDDs?

Best,
Martin Steigerwald July 19, 2023, 7:27 a.m. UTC | #6
Hi Qu, hi Jens.

@Qu: I wonder whether Jens as a maintainer for the block layer can shed 
some insight on the merging of requests in block layer or the latency of 
plugging aspect of this bug.

@Jens: There has been a regression of scrubbing speeds in kernel 6.4 
mainly for NVME SSDs with a drop of speed from above 2 GiB/s to 
sometimes even lower than 1 GiB/s. I bet Qu can explain better to you 
what is actually causing this. For reference here the initial thread:

Scrub of my nvme SSD has slowed by about 2/3
https://lore.kernel.org/linux-btrfs/
CAAKzf7=yS9vnf5zNid1CyvN19wyAgPz5o9sJP0vBqN6LReqXVg@mail.gmail.com/

And here another attempt of Qu to fix this which did not work out as well 
as he hoped:

btrfs: scrub: make sctx->stripes[] a ring buffer 

https://lore.kernel.org/linux-btrfs/cover.1689744163.git.wqu@suse.com/

Maybe Jens you can share some insight on how to best achieve higher 
queue depth and good merge behavior for NVME SSDs while not making the 
queue depth too high for SATA SSDs/HDDs?

Qu Wenruo - 19.07.23, 07:22:05 CEST:
> On 2023/7/18 09:41, Qu Wenruo wrote:
> > On 2023/7/18 08:59, David Sterba wrote:
> >> On Fri, Jul 14, 2023 at 06:26:40PM +0800, Qu Wenruo wrote:
> >>> [REGRESSION]
> >>> There are several regression reports about the scrub performance
> >>> with
> >>> v6.4 kernel.
> >>> 
> >>> On a PCIE3.0 device, the old v6.3 kernel can go 3GB/s scrub speed,
> >>> but v6.4 can only go 1GB/s, an obvious 66% performance drop.
> >>> 
> >>> [CAUSE]
> >>> Iostat shows a very different behavior between v6.3 and v6.4
> >>> kernel:
> >>> 
> >>> Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz
> >>> aqu-sz  %util
> >>> nvme0n1p3  9731.00 3425544.00 17237.00  63.92    2.18   352.02 
> >>> 21.18
> >>> 100.00
> >>> nvme0n1p3 20853.00 1330288.00     0.00   0.00    0.08    63.79  
> >>> 1.60
> >>> 100.00
> >>> 
> >>> The upper one is v6.3 while the lower one is v6.4.
> >>> 
> >>> There are several obvious differences:
> >>> 
> >>> - Very few read merges
> >>>    This turns out to be a behavior change that we no longer go bio
> >>>    plug/unplug.
> >>> 
> >>> - Very low aqu-sz
> >>>    This is due to the submit-and-wait behavior of
> >>> flush_scrub_stripes().
> >>> 
> >>> Both behavior is not that obvious on SATA SSDs, as SATA SSDs has
> >>> NCQ to merge the reads, while SATA SSDs can not handle high queue
> >>> depth well either.
> >>> 
> >>> [FIX]
> >>> For now this patch focus on the read speed fix. Dev-replace
> >>> replace
> >>> speed needs extra work.
> >>> 
> >>> For the read part, we go two directions to fix the problems:
> >>> 
> >>> - Re-introduce blk plug/unplug to merge read requests
> >>>    This is pretty simple, and the behavior is pretty easy to
> >>> observe.
> >>> 
> >>>    This would enlarge the average read request size to 512K.
> >>> 
> >>> - Introduce multi-group reads and no longer waits for each group
> >>>    Instead of the old behavior, which submit 8 stripes and wait
> >>> for
> >>>    them, here we would enlarge the total stripes to 16 * 8.
> >>>    Which is 8M per device, the same limits as the old scrub flying
> >>>    bios size limit.
> >>> 
> >>>    Now every time we fill a group (8 stripes), we submit them and
> >>>    continue to next stripes.
> >> 
> >> Have you tried other values for the stripe count if this makes any
> >> difference? Getting back to the previous value (8M) is closer to
> >> the
> >> amount of data sent at once but the code around has chnaged, so
> >> it's
> >> possible that increasing that could also improve it (or not).
> > 
> > I tried smaller ones, like grounp number 8 (4M), with the plugging
> > added back, it's not that good, around 1.6GB/s ~ 1.7GB/s.
> > And larger number 32 (16M) doesn't increase much over the 16 (8M)
> > value.
> > 
> > The bottleneck on the queue depth seems to be the wait.
> > 
> > Especially for zoned write back, we need to ensure the write into
> > dev-replace target device ordered.
> > 
> > This patch choose to wait for all existing stripes to finish their
> > read, then submit all the writes, to ensure the write sequence.
> > 
> > But this is still a read-and-wait behavior, thus leads to the low
> > queue depth.
> 
> I tried making sctx->stripes[] a ring buffer, so that we would only
> wait the current slot if it's still under scrub, no more dedicated
> waits at the last slot.
> 
> The branch can be found at my github repo:
> https://github.com/adam900710/linux/tree/scrub_testing
> 
> 
> But unfortunately this brings no change to scrub performance at all,
> and the average queue depth is still pretty shallow (slightly better,
> can reach 1.8~2.0).
> 
> If needed, I can send out the patchset (still pretty small, +249/-246)
> for review, but I don't think it worthy as no improvement at all...
> 
> For now, I'm afraid the latency introduced by plug is already enough
> to reduce the throughput, thus we have to go back the original
> behavior, to manually merge the read requests, other than rely on
> block layer plug...
Qu Wenruo July 19, 2023, 8:51 a.m. UTC | #7
On 2023/7/19 13:22, Qu Wenruo wrote:
>
>
> On 2023/7/18 09:41, Qu Wenruo wrote:
>>
>>
>> On 2023/7/18 08:59, David Sterba wrote:
>>> On Fri, Jul 14, 2023 at 06:26:40PM +0800, Qu Wenruo wrote:
>>>> [REGRESSION]
>>>> There are several regression reports about the scrub performance with
>>>> v6.4 kernel.
>>>>
>>>> On a PCIE3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
>>>> v6.4 can only go 1GB/s, an obvious 66% performance drop.
>>>>
>>>> [CAUSE]
>>>> Iostat shows a very different behavior between v6.3 and v6.4 kernel:
>>>>
>>>> Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz
>>>> aqu-sz  %util
>>>> nvme0n1p3  9731.00 3425544.00 17237.00  63.92    2.18   352.02  21.18
>>>> 100.00
>>>> nvme0n1p3 20853.00 1330288.00     0.00   0.00    0.08    63.79   1.60
>>>> 100.00
>>>>
>>>> The upper one is v6.3 while the lower one is v6.4.
>>>>
>>>> There are several obvious differences:
>>>>
>>>> - Very few read merges
>>>>    This turns out to be a behavior change that we no longer go bio
>>>>    plug/unplug.
>>>>
>>>> - Very low aqu-sz
>>>>    This is due to the submit-and-wait behavior of
>>>> flush_scrub_stripes().
>>>>
>>>> Both behavior is not that obvious on SATA SSDs, as SATA SSDs has NCQ to
>>>> merge the reads, while SATA SSDs can not handle high queue depth well
>>>> either.
>>>>
>>>> [FIX]
>>>> For now this patch focus on the read speed fix. Dev-replace replace
>>>> speed needs extra work.
>>>>
>>>> For the read part, we go two directions to fix the problems:
>>>>
>>>> - Re-introduce blk plug/unplug to merge read requests
>>>>    This is pretty simple, and the behavior is pretty easy to observe.
>>>>
>>>>    This would enlarge the average read request size to 512K.
>>>>
>>>> - Introduce multi-group reads and no longer waits for each group
>>>>    Instead of the old behavior, which submit 8 stripes and wait for
>>>>    them, here we would enlarge the total stripes to 16 * 8.
>>>>    Which is 8M per device, the same limits as the old scrub flying
>>>>    bios size limit.
>>>>
>>>>    Now every time we fill a group (8 stripes), we submit them and
>>>>    continue to next stripes.
>>>
>>> Have you tried other values for the stripe count if this makes any
>>> difference? Getting back to the previous value (8M) is closer to the
>>> amount of data sent at once but the code around has chnaged, so it's
>>> possible that increasing that could also improve it (or not).
>>
>> I tried smaller ones, like grounp number 8 (4M), with the plugging added
>> back, it's not that good, around 1.6GB/s ~ 1.7GB/s.
>> And larger number 32 (16M) doesn't increase much over the 16 (8M) value.
>>
>> The bottleneck on the queue depth seems to be the wait.
>>
>> Especially for zoned write back, we need to ensure the write into
>> dev-replace target device ordered.
>>
>> This patch choose to wait for all existing stripes to finish their read,
>> then submit all the writes, to ensure the write sequence.
>>
>> But this is still a read-and-wait behavior, thus leads to the low queue
>> depth.
>
> I tried making sctx->stripes[] a ring buffer, so that we would only wait
> the current slot if it's still under scrub, no more dedicated waits at
> the last slot.
>
> The branch can be found at my github repo:
> https://github.com/adam900710/linux/tree/scrub_testing
>
>
> But unfortunately this brings no change to scrub performance at all, and
> the average queue depth is still pretty shallow (slightly better, can
> reach 1.8~2.0).
>
> If needed, I can send out the patchset (still pretty small, +249/-246)
> for review, but I don't think it worthy as no improvement at all...
>
> For now, I'm afraid the latency introduced by plug is already enough to
> reduce the throughput, thus we have to go back the original behavior, to
> manually merge the read requests, other than rely on block layer plug...

Well, I got some progress in a completely different path of scrub.

Since I have no better ideas, I begin skipping certain workload to see
what is really slowing down the read.

Skipped data verification, no difference (a big surprise to me).

Skipped data csum population, the read speed jumped to 2.7GiB/s, which
is already pretty close to the old speed.

This already shows some big progress, the csum and extent tree search is
now a pretty big part, and we're doing the search again and again just
to process one 64KiB.

Thus the core should be reducing the extent/csum tree search, other than
chasing the read pattern. This is also a big wake up call for me, as I
purely assumed the cached extent buffers for extent/csum tree would make
the search as fast as light, but it isn't.

With this new bulb lit (with extra benchmark results backing it up), it
should not be that hard to get back the performance, at least beyond
2.5GiB/s.

Thanks,
Qu
>
> Thanks,
> Qu
Jens Axboe July 19, 2023, 2:59 p.m. UTC | #8
On 7/19/23 1:27?AM, Martin Steigerwald wrote:
> Hi Qu, hi Jens.
> 
> @Qu: I wonder whether Jens as a maintainer for the block layer can shed 
> some insight on the merging of requests in block layer or the latency of 
> plugging aspect of this bug.
> 
> @Jens: There has been a regression of scrubbing speeds in kernel 6.4 
> mainly for NVME SSDs with a drop of speed from above 2 GiB/s to 
> sometimes even lower than 1 GiB/s. I bet Qu can explain better to you 
> what is actually causing this. For reference here the initial thread:
> 
> Scrub of my nvme SSD has slowed by about 2/3
> https://lore.kernel.org/linux-btrfs/
> CAAKzf7=yS9vnf5zNid1CyvN19wyAgPz5o9sJP0vBqN6LReqXVg@mail.gmail.com/
> 
> And here another attempt of Qu to fix this which did not work out as well 
> as he hoped:
> 
> btrfs: scrub: make sctx->stripes[] a ring buffer 
> 
> https://lore.kernel.org/linux-btrfs/cover.1689744163.git.wqu@suse.com/
> 
> Maybe Jens you can share some insight on how to best achieve higher 
> queue depth and good merge behavior for NVME SSDs while not making the 
> queue depth too high for SATA SSDs/HDDs?

Didn't read the whole thread, but a quick glance at the 6.3..6.4 btrfs
changes, it's dropping plugging around issuing the scrub IO? Using a
plug around issuing IO is _the_ way to ensure that merging gets done,
and it'll also take care of starting issue if the plug depth becomes too
large.

So is this a case of just not doing plugging anymore, and this is why
the merge rate (and performance) drops?
Qu Wenruo July 19, 2023, 10:30 p.m. UTC | #9
On 2023/7/19 22:59, Jens Axboe wrote:
> On 7/19/23 1:27?AM, Martin Steigerwald wrote:
>> Hi Qu, hi Jens.
>>
>> @Qu: I wonder whether Jens as a maintainer for the block layer can shed
>> some insight on the merging of requests in block layer or the latency of
>> plugging aspect of this bug.
>>
>> @Jens: There has been a regression of scrubbing speeds in kernel 6.4
>> mainly for NVME SSDs with a drop of speed from above 2 GiB/s to
>> sometimes even lower than 1 GiB/s. I bet Qu can explain better to you
>> what is actually causing this. For reference here the initial thread:
>>
>> Scrub of my nvme SSD has slowed by about 2/3
>> https://lore.kernel.org/linux-btrfs/
>> CAAKzf7=yS9vnf5zNid1CyvN19wyAgPz5o9sJP0vBqN6LReqXVg@mail.gmail.com/
>>
>> And here another attempt of Qu to fix this which did not work out as well
>> as he hoped:
>>
>> btrfs: scrub: make sctx->stripes[] a ring buffer
>>
>> https://lore.kernel.org/linux-btrfs/cover.1689744163.git.wqu@suse.com/
>>
>> Maybe Jens you can share some insight on how to best achieve higher
>> queue depth and good merge behavior for NVME SSDs while not making the
>> queue depth too high for SATA SSDs/HDDs?
>
> Didn't read the whole thread, but a quick glance at the 6.3..6.4 btrfs
> changes, it's dropping plugging around issuing the scrub IO? Using a
> plug around issuing IO is _the_ way to ensure that merging gets done,
> and it'll also take care of starting issue if the plug depth becomes too
> large.

Thanks, I learnt the lesson by the harder way.

Just another question related to plugs, the whole plug is per-thread
thing, what if we're submitting multiple bios across several kthreads?

Is there any global/device/bio based way to tell block layer to try its
best to merge? Even it may means higher latency.

>
> So is this a case of just not doing plugging anymore, and this is why
> the merge rate (and performance) drops?
>

That's part of the reason, another problem is due to the new fixed block
size, we spent much more time on preparing the needed info before
submitting the bio, thus reducing the throughput.

I'm working on reducing the overhead to properly fix the regression.

Thanks,
Qu
Jens Axboe July 19, 2023, 10:36 p.m. UTC | #10
On 7/19/23 4:30?PM, Qu Wenruo wrote:
> 
> 
> On 2023/7/19 22:59, Jens Axboe wrote:
>> On 7/19/23 1:27?AM, Martin Steigerwald wrote:
>>> Hi Qu, hi Jens.
>>>
>>> @Qu: I wonder whether Jens as a maintainer for the block layer can shed
>>> some insight on the merging of requests in block layer or the latency of
>>> plugging aspect of this bug.
>>>
>>> @Jens: There has been a regression of scrubbing speeds in kernel 6.4
>>> mainly for NVME SSDs with a drop of speed from above 2 GiB/s to
>>> sometimes even lower than 1 GiB/s. I bet Qu can explain better to you
>>> what is actually causing this. For reference here the initial thread:
>>>
>>> Scrub of my nvme SSD has slowed by about 2/3
>>> https://lore.kernel.org/linux-btrfs/
>>> CAAKzf7=yS9vnf5zNid1CyvN19wyAgPz5o9sJP0vBqN6LReqXVg@mail.gmail.com/
>>>
>>> And here another attempt of Qu to fix this which did not work out as well
>>> as he hoped:
>>>
>>> btrfs: scrub: make sctx->stripes[] a ring buffer
>>>
>>> https://lore.kernel.org/linux-btrfs/cover.1689744163.git.wqu@suse.com/
>>>
>>> Maybe Jens you can share some insight on how to best achieve higher
>>> queue depth and good merge behavior for NVME SSDs while not making the
>>> queue depth too high for SATA SSDs/HDDs?
>>
>> Didn't read the whole thread, but a quick glance at the 6.3..6.4 btrfs
>> changes, it's dropping plugging around issuing the scrub IO? Using a
>> plug around issuing IO is _the_ way to ensure that merging gets done,
>> and it'll also take care of starting issue if the plug depth becomes too
>> large.
> 
> Thanks, I learnt the lesson by the harder way.
> 
> Just another question related to plugs, the whole plug is per-thread
> thing, what if we're submitting multiple bios across several kthreads?
> 
> Is there any global/device/bio based way to tell block layer to try its
> best to merge? Even it may means higher latency.

Right, the plug is on-stack for the submitting process. It's generally
pretty rare to have cross-thread merges, at least for normal workloads.
But anything beyond the plug for merging happens in the io scheduler,
though these days with devices having high queue depth, you'd have to be
pretty lucky to hit that. There's no generic way to hold off.

Do you have multiple threads doing the scrubbing? If yes, then my first
question would be why they would be working in areas that overlap to
such an extent that cross thread merging makes sense? And if there are
good reasons for that, and that's an IFF, then you could queue pending
IO locally to a shared list, and have whatever thread grabs the list
first actually perform all the submissions. That'd be more efficient
further down too.
Qu Wenruo July 19, 2023, 10:50 p.m. UTC | #11
On 2023/7/20 06:36, Jens Axboe wrote:
> On 7/19/23 4:30?PM, Qu Wenruo wrote:
>>
>>
>> On 2023/7/19 22:59, Jens Axboe wrote:
>>> On 7/19/23 1:27?AM, Martin Steigerwald wrote:
>>>> Hi Qu, hi Jens.
>>>>
>>>> @Qu: I wonder whether Jens as a maintainer for the block layer can shed
>>>> some insight on the merging of requests in block layer or the latency of
>>>> plugging aspect of this bug.
>>>>
>>>> @Jens: There has been a regression of scrubbing speeds in kernel 6.4
>>>> mainly for NVME SSDs with a drop of speed from above 2 GiB/s to
>>>> sometimes even lower than 1 GiB/s. I bet Qu can explain better to you
>>>> what is actually causing this. For reference here the initial thread:
>>>>
>>>> Scrub of my nvme SSD has slowed by about 2/3
>>>> https://lore.kernel.org/linux-btrfs/
>>>> CAAKzf7=yS9vnf5zNid1CyvN19wyAgPz5o9sJP0vBqN6LReqXVg@mail.gmail.com/
>>>>
>>>> And here another attempt of Qu to fix this which did not work out as well
>>>> as he hoped:
>>>>
>>>> btrfs: scrub: make sctx->stripes[] a ring buffer
>>>>
>>>> https://lore.kernel.org/linux-btrfs/cover.1689744163.git.wqu@suse.com/
>>>>
>>>> Maybe Jens you can share some insight on how to best achieve higher
>>>> queue depth and good merge behavior for NVME SSDs while not making the
>>>> queue depth too high for SATA SSDs/HDDs?
>>>
>>> Didn't read the whole thread, but a quick glance at the 6.3..6.4 btrfs
>>> changes, it's dropping plugging around issuing the scrub IO? Using a
>>> plug around issuing IO is _the_ way to ensure that merging gets done,
>>> and it'll also take care of starting issue if the plug depth becomes too
>>> large.
>>
>> Thanks, I learnt the lesson by the harder way.
>>
>> Just another question related to plugs, the whole plug is per-thread
>> thing, what if we're submitting multiple bios across several kthreads?
>>
>> Is there any global/device/bio based way to tell block layer to try its
>> best to merge? Even it may means higher latency.
>
> Right, the plug is on-stack for the submitting process. It's generally
> pretty rare to have cross-thread merges, at least for normal workloads.
> But anything beyond the plug for merging happens in the io scheduler,
> though these days with devices having high queue depth, you'd have to be
> pretty lucky to hit that. There's no generic way to hold off.

Thanks for clarifying this.

>
> Do you have multiple threads doing the scrubbing?

No, initially I have a multi-thread (and over-engineered) implementation
to make the initial read submission happen in the main thread, but all
later work (verification, and writeback for repair/replace) happens in
different workers.

But later I finally found out that it's not the read part causing the
low throughput, but the preparation part before read submission.

Thus this is just to fill my curiosity.

Thanks very much for the explanation!
Qu

> If yes, then my first
> question would be why they would be working in areas that overlap to
> such an extent that cross thread merging makes sense? And if there are
> good reasons for that, and that's an IFF, then you could queue pending
> IO locally to a shared list, and have whatever thread grabs the list
> first actually perform all the submissions. That'd be more efficient
> further down too.
>
diff mbox series

Patch

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 38c103f13fd5..adb7d8c921d4 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -43,16 +43,28 @@  struct scrub_ctx;
 /*
  * The following value only influences the performance.
  *
- * This determines the batch size for stripe submitted in one go.
+ * This detemines how many stripes would be submitted in one go,
+ * this would 512KiB (BTRFS_STRIPE_LEN * SCRUB_STRIPES_PER_GROUP).
  */
-#define SCRUB_STRIPES_PER_SCTX	8	/* That would be 8 64K stripe per-device. */
+#define SCRUB_STRIPES_PER_GROUP		8
 
+/*
+ * How many groups we have for each sctx.
+ *
+ * This would be 8M per device, the same value as the old scrub
+ * flying bios size limit.
+ */
+#define SCRUB_STRIPE_GROUPS_PER_SCTX	16
+
+#define SCRUB_STRIPES_PER_SCTX		(SCRUB_STRIPES_PER_GROUP * \
+					 SCRUB_STRIPE_GROUPS_PER_SCTX)
 /*
  * The following value times PAGE_SIZE needs to be large enough to match the
  * largest node/leaf/sector size that shall be supported.
  */
 #define SCRUB_MAX_SECTORS_PER_BLOCK	(BTRFS_MAX_METADATA_BLOCKSIZE / SZ_4K)
 
+
 /* Represent one sector and its needed info to verify the content. */
 struct scrub_sector_verification {
 	bool is_metadata;
@@ -78,6 +90,9 @@  enum scrub_stripe_flags {
 	/* Set when @mirror_num, @dev, @physical and @logical are set. */
 	SCRUB_STRIPE_FLAG_INITIALIZED,
 
+	/* Set when the initial read has been submitted. */
+	SCRUB_STRIPE_FLAG_READ_SUBMITTED,
+
 	/* Set when the read-repair is finished. */
 	SCRUB_STRIPE_FLAG_REPAIR_DONE,
 
@@ -1604,6 +1619,9 @@  static void scrub_submit_initial_read(struct scrub_ctx *sctx,
 	ASSERT(stripe->mirror_num > 0);
 	ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
 
+	if (test_and_set_bit(SCRUB_STRIPE_FLAG_READ_SUBMITTED, &stripe->state))
+		return;
+
 	bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info,
 			       scrub_read_endio, stripe);
 
@@ -1657,6 +1675,7 @@  static int flush_scrub_stripes(struct scrub_ctx *sctx)
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
 	struct scrub_stripe *stripe;
 	const int nr_stripes = sctx->cur_stripe;
+	struct blk_plug plug;
 	int ret = 0;
 
 	if (!nr_stripes)
@@ -1664,12 +1683,17 @@  static int flush_scrub_stripes(struct scrub_ctx *sctx)
 
 	ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &sctx->stripes[0].state));
 
+	/* We should only have at most one group to submit. */
 	scrub_throttle_dev_io(sctx, sctx->stripes[0].dev,
-			      btrfs_stripe_nr_to_offset(nr_stripes));
+			      btrfs_stripe_nr_to_offset(
+				      nr_stripes % SCRUB_STRIPES_PER_GROUP ?:
+				      SCRUB_STRIPES_PER_GROUP));
+	blk_start_plug(&plug);
 	for (int i = 0; i < nr_stripes; i++) {
 		stripe = &sctx->stripes[i];
 		scrub_submit_initial_read(sctx, stripe);
 	}
+	blk_finish_plug(&plug);
 
 	for (int i = 0; i < nr_stripes; i++) {
 		stripe = &sctx->stripes[i];
@@ -1748,28 +1772,47 @@  static void raid56_scrub_wait_endio(struct bio *bio)
 
 static int queue_scrub_stripe(struct scrub_ctx *sctx, struct btrfs_block_group *bg,
 			      struct btrfs_device *dev, int mirror_num,
-			      u64 logical, u32 length, u64 physical)
+			      u64 logical, u32 length, u64 physical,
+			      u64 *found_stripe_start_ret)
 {
 	struct scrub_stripe *stripe;
 	int ret;
 
-	/* No available slot, submit all stripes and wait for them. */
-	if (sctx->cur_stripe >= SCRUB_STRIPES_PER_SCTX) {
-		ret = flush_scrub_stripes(sctx);
-		if (ret < 0)
-			return ret;
-	}
-
+	/*
+	 * We should always have at least one slot, when full the last one
+	 * who queued a slot should handle the flush.
+	 */
+	ASSERT(sctx->cur_stripe < SCRUB_STRIPES_PER_SCTX);
 	stripe = &sctx->stripes[sctx->cur_stripe];
-
-	/* We can queue one stripe using the remaining slot. */
 	scrub_reset_stripe(stripe);
 	ret = scrub_find_fill_first_stripe(bg, dev, physical, mirror_num,
 					   logical, length, stripe);
 	/* Either >0 as no more extents or <0 for error. */
 	if (ret)
 		return ret;
+	*found_stripe_start_ret = stripe->logical;
+
 	sctx->cur_stripe++;
+
+	/* Last slot used, flush them all. */
+	if (sctx->cur_stripe == SCRUB_STRIPES_PER_SCTX)
+		return flush_scrub_stripes(sctx);
+
+	/* We have filled one group, submit them now. */
+	if (sctx->cur_stripe % SCRUB_STRIPES_PER_GROUP == 0) {
+		struct blk_plug plug;
+
+		scrub_throttle_dev_io(sctx, sctx->stripes[0].dev,
+			      btrfs_stripe_nr_to_offset(SCRUB_STRIPES_PER_GROUP));
+
+		blk_start_plug(&plug);
+		for (int i = sctx->cur_stripe - SCRUB_STRIPES_PER_GROUP;
+		     i < sctx->cur_stripe; i++) {
+			stripe = &sctx->stripes[i];
+			scrub_submit_initial_read(sctx, stripe);
+		}
+		blk_finish_plug(&plug);
+	}
 	return 0;
 }
 
@@ -1965,6 +2008,7 @@  static int scrub_simple_mirror(struct scrub_ctx *sctx,
 	/* Go through each extent items inside the logical range */
 	while (cur_logical < logical_end) {
 		u64 cur_physical = physical + cur_logical - logical_start;
+		u64 found_logical;
 
 		/* Canceled? */
 		if (atomic_read(&fs_info->scrub_cancel_req) ||
@@ -1988,7 +2032,7 @@  static int scrub_simple_mirror(struct scrub_ctx *sctx,
 
 		ret = queue_scrub_stripe(sctx, bg, device, mirror_num,
 					 cur_logical, logical_end - cur_logical,
-					 cur_physical);
+					 cur_physical, &found_logical);
 		if (ret > 0) {
 			/* No more extent, just update the accounting */
 			sctx->stat.last_physical = physical + logical_length;
@@ -1998,9 +2042,7 @@  static int scrub_simple_mirror(struct scrub_ctx *sctx,
 		if (ret < 0)
 			break;
 
-		ASSERT(sctx->cur_stripe > 0);
-		cur_logical = sctx->stripes[sctx->cur_stripe - 1].logical
-			      + BTRFS_STRIPE_LEN;
+		cur_logical = found_logical + BTRFS_STRIPE_LEN;
 
 		/* Don't hold CPU for too long time */
 		cond_resched();