[0/2] btrfs: disable inline checksum for multi-dev striped FS

Message ID	cover.1705568050.git.naohiro.aota@wdc.com (mailing list archive)
Headers	show Received: from esa2.hgst.iphmx.com (esa2.hgst.iphmx.com [68.232.143.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 92B88125A2 for <linux-btrfs@vger.kernel.org>; Thu, 18 Jan 2024 08:55:26 +0000 (UTC) IronPort-SDR: vSLgBcqEHxLZRbi03qiZwfbyIAC/+QnnT5KokzsrLylt2e44rS2UqiDX+XSc5syIiVZmkKOIPm hSqCuA7f11Vg== IronPort-SDR: D1quaOCv4zdE3StpTsXGdzS1dcHHRn5qGSciP5IPCPpE2MAn8aIjXKGbvo+Dw8RYgWpjzDLqRx Tpj50GTt6wFA== WDCIronportException: Internal From: Naohiro Aota <naohiro.aota@wdc.com> To: linux-btrfs@vger.kernel.org Cc: wangyugui@e16-tech.com, Naohiro Aota <naohiro.aota@wdc.com> Subject: [PATCH 0/2] btrfs: disable inline checksum for multi-dev striped FS Date: Thu, 18 Jan 2024 17:54:49 +0900 Message-ID: <cover.1705568050.git.naohiro.aota@wdc.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	btrfs: disable inline checksum for multi-dev striped FS \| expand [0/2] btrfs: disable inline checksum for multi-dev striped FS [1/2] btrfs: introduce inline_csum_mode to tweak inline checksum behavior [2/2] btrfs: detect multi-dev stripe and disable automatic inline checksum

Naohiro Aota Jan. 18, 2024, 8:54 a.m. UTC

There was a report of write performance regression on 6.5-rc4 on RAID0
(4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
and doing the checksum inline can be bad for performance on RAID0
setup [2]. 

[1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
[2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/

While inlining the fast checksum is good for single (or two) device,
but it is not fast enough for multi-device striped writing.

So, this series first introduces fs_devices->inline_csum_mode and its
sysfs interface to tweak the inline csum behavior (auto/on/off). Then,
it disables inline checksum when it find a block group striped writing
into multiple devices.

Naohiro Aota (2):
  btrfs: introduce inline_csum_mode to tweak inline checksum behavior
  btrfs: detect multi-dev stripe and disable automatic inline checksum

 fs/btrfs/bio.c     | 14 ++++++++++++--
 fs/btrfs/sysfs.c   | 39 +++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c | 20 ++++++++++++++++++++
 fs/btrfs/volumes.h | 21 +++++++++++++++++++++
 4 files changed, 92 insertions(+), 2 deletions(-)

Roman Mamedov Jan. 18, 2024, 9:12 a.m. UTC | #1

On Thu, 18 Jan 2024 17:54:49 +0900
Naohiro Aota <naohiro.aota@wdc.com> wrote:

> There was a report of write performance regression on 6.5-rc4 on RAID0
> (4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
> and doing the checksum inline can be bad for performance on RAID0
> setup [2]. 
> 
> [1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
> [2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
> 
> While inlining the fast checksum is good for single (or two) device,
> but it is not fast enough for multi-device striped writing.

Personal opinion, it is a very awkward criteria to enable or disable the
inline mode. There can be a RAID0 of SATA HDDs/SSDs that will be slower than a
single PCI-E 4.0 NVMe SSD. In [1] the inline mode slashes the performance from
4 GB/sec to 1.5 GB/sec. A single modern SSD is capable of up to 6 GB/sec.

Secondly, less often, there can be a hardware RAID which presents itself to the
OS as a single device, but is also very fast.

Sure, basing such decision on anything else, such as benchmark of the
actual block device may not be as feasible.

> So, this series first introduces fs_devices->inline_csum_mode and its
> sysfs interface to tweak the inline csum behavior (auto/on/off). Then,
> it disables inline checksum when it find a block group striped writing
> into multiple devices.

Has it been determined what improvement enabling the inline mode brings at all,
and in which setups? Maybe just disable it by default and provide this tweak
option?

> Naohiro Aota (2):
>   btrfs: introduce inline_csum_mode to tweak inline checksum behavior
>   btrfs: detect multi-dev stripe and disable automatic inline checksum
> 
>  fs/btrfs/bio.c     | 14 ++++++++++++--
>  fs/btrfs/sysfs.c   | 39 +++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.c | 20 ++++++++++++++++++++
>  fs/btrfs/volumes.h | 21 +++++++++++++++++++++
>  4 files changed, 92 insertions(+), 2 deletions(-)
>

Johannes Thumshirn Jan. 19, 2024, 3:30 p.m. UTC | #2

Apart from the style nit in 2/2:
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

David Sterba Jan. 19, 2024, 3:49 p.m. UTC | #3

On Thu, Jan 18, 2024 at 02:12:31PM +0500, Roman Mamedov wrote:
> On Thu, 18 Jan 2024 17:54:49 +0900
> Naohiro Aota <naohiro.aota@wdc.com> wrote:
> 
> > There was a report of write performance regression on 6.5-rc4 on RAID0
> > (4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
> > and doing the checksum inline can be bad for performance on RAID0
> > setup [2]. 
> > 
> > [1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
> > [2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
> > 
> > While inlining the fast checksum is good for single (or two) device,
> > but it is not fast enough for multi-device striped writing.
> 
> Personal opinion, it is a very awkward criteria to enable or disable the
> inline mode. There can be a RAID0 of SATA HDDs/SSDs that will be slower than a
> single PCI-E 4.0 NVMe SSD. In [1] the inline mode slashes the performance from
> 4 GB/sec to 1.5 GB/sec. A single modern SSD is capable of up to 6 GB/sec.
> 
> Secondly, less often, there can be a hardware RAID which presents itself to the
> OS as a single device, but is also very fast.

Yeah I find the decision logic not adapting well to various types of
underlying hardware. While the multi-device and striped sounds like a
good heuristic it can still lead to worse performance.

> Sure, basing such decision on anything else, such as benchmark of the
> actual block device may not be as feasible.

In an ideal case it adapts to current load or device capabilities, which
needs a feedback loop and tracking the status for the offloading
decision.

David Sterba Jan. 19, 2024, 4:01 p.m. UTC | #4

On Thu, Jan 18, 2024 at 05:54:49PM +0900, Naohiro Aota wrote:
> There was a report of write performance regression on 6.5-rc4 on RAID0
> (4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
> and doing the checksum inline can be bad for performance on RAID0
> setup [2]. 

First, please don't name it 'inline checksum', it's so confusing because
we have 'inline' as inline files and also the inline checksums stored in
the b-tree nodes.

> [1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
> [2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
> 
> While inlining the fast checksum is good for single (or two) device,
> but it is not fast enough for multi-device striped writing.
> 
> So, this series first introduces fs_devices->inline_csum_mode and its
> sysfs interface to tweak the inline csum behavior (auto/on/off). Then,
> it disables inline checksum when it find a block group striped writing
> into multiple devices.

How is one supposed to know if and how the sysfs knob should be set?
This depends on the device speed(s), profiles and number of devices, can
the same decision logic be replicated inside btrfs? Such tuning should
be done automatically (similar things are done in other subystems like
memory management).

With such type of setting we'll get people randomly flipping it on/off
and see if it fixes performance, without actually looking if it's
relevant or not. We've seen this with random advice circling around
internet how to fix enospc problems, it's next to impossible to stop
that so I really don't want to allow that for performance.

Naohiro Aota Jan. 22, 2024, 7:17 a.m. UTC | #5

On Thu, Jan 18, 2024 at 02:12:31PM +0500, Roman Mamedov wrote:
> On Thu, 18 Jan 2024 17:54:49 +0900
> Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> > There was a report of write performance regression on 6.5-rc4 on RAID0
> > (4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
> > and doing the checksum inline can be bad for performance on RAID0
> > setup [2].
> >
> > [1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
> > [2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
> >
> > While inlining the fast checksum is good for single (or two) device,
> > but it is not fast enough for multi-device striped writing.
>
> Personal opinion, it is a very awkward criteria to enable or disable the
> inline mode. There can be a RAID0 of SATA HDDs/SSDs that will be slower than a
> single PCI-E 4.0 NVMe SSD. In [1] the inline mode slashes the performance from
> 4 GB/sec to 1.5 GB/sec. A single modern SSD is capable of up to 6 GB/sec.
>
> Secondly, less often, there can be a hardware RAID which presents itself to the
> OS as a single device, but is also very fast.
>
> Sure, basing such decision on anything else, such as benchmark of the
> actual block device may not be as feasible.
>
> > So, this series first introduces fs_devices->inline_csum_mode and its
> > sysfs interface to tweak the inline csum behavior (auto/on/off). Then,
> > it disables inline checksum when it find a block group striped writing
> > into multiple devices.
>
> Has it been determined what improvement enabling the inline mode brings at all,
> and in which setups? Maybe just disable it by default and provide this tweak
> option?

Note: as mentioned by David, I'm going to say "sync checksum" instead of
"inline checksum".

I'm going to list the benchmark results here.

The sync checksum is introduced with this patch.

https://lore.kernel.org/linux-btrfs/20230503070615.1029820-2-hch@lst.de/

The benchmark described in the patch is originated from this email by Chris.

https://lore.kernel.org/linux-btrfs/eb544c31-7a74-d503-83f0-4dc226917d1a@meta.com/

* Device: Intel Optane

* workqueue checksum (Unpatched):
  write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets

* Sync checksum (synchronous CRCs):
  write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets

Christoph also did the same on kvm on consumer drives and got a similar
result. Furthermore, even with "non-accelerated crc32 code", "the workqueue
offload only looked better for really large writes, and then only
marginally."

https://lore.kernel.org/linux-btrfs/20230325081341.GB7353@lst.de/

Then, Wang Yugui reported a regression both on SINGLE setup and RAID0 setup.

https://lore.kernel.org/linux-btrfs/20230811222321.2AD2.409509F4@e16-tech.com/

* CPU: E5 2680 v2, two NUMA nodes
* RAM: 192G
* Device: NVMe SSD PCIe3 x4
* Btrfs profile: data=raid0, metadata=raid1
  - all PCIe3 NVMe SSD are connected to one NVMe HBA/one numa node.

* workqueue checksum: RAID0:
  WRITE: bw=3858MiB/s (4045MB/s)
  WRITE: bw=3781MiB/s (3965MB/s)
* sync checksum: RAID0:
  WRITE: bw=1311MiB/s (1375MB/s)
  WRITE: bw=1435MiB/s (1504MB/s)

* workqueue checksum: SINGLE:
  WRITE: bw=3004MiB/s (3149MB/s)
  WRITE: bw=2851MiB/s (2990MB/s)
* sync checksum: SINGLE:
  WRITE: bw=1337MiB/s (1402MB/s)
  WRITE: bw=1413MiB/s (1481MB/s)

So, workqueue (old) method is way better on his machine.

After a while, I reported workqueue checksum is better than sync checksum
on 6 SSDs RAID0 case.

https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/

* CPU: Intel(R) Xeon(R) Platinum 8260 CPU, two NUMA nodes, 96 CPUs
* RAM: 1024G

On 6 SSDs RAID0
* workqueue checksum:
  WRITE: bw=2106MiB/s (2208MB/s), 2106MiB/s-2106MiB/s (2208MB/s-2208MB/s), io=760GiB (816GB), run=369705-369705msec
* sync checksum:
  WRITE: bw=1391MiB/s (1458MB/s), 1391MiB/s-1391MiB/s (1458MB/s-1458MB/s), io=499GiB (536GB), run=367312-367312msec

Or, even with 1 SSD setup (still RAID0):
* workqueue checksum:
  WRITE: bw=437MiB/s (459MB/s), 437MiB/s-437MiB/s (459MB/s-459MB/s), io=299GiB (321GB), run=699787-699787msec
* sync checksum:
  WRITE: bw=442MiB/s (464MB/s), 442MiB/s-442MiB/s (464MB/s-464MB/s), io=302GiB (324GB), run=698553-698553msec

The same as Wang Yugui, I got better performance with workqueue checksum.

I also tested it on an emulated fast device (null_blk with irqmode=0)
today. The device is formatted with the default profile.

* CPU: Intel(R) Xeon(R) Platinum 8260 CPU, two NUMA nodes, 96 CPUs
* RAM: 1024G
* Device: null_blk with irqmode=0, use_per_node_hctx=1, memory_backed=1, size=512000 (512GB)
* Btrfs profile: data=single, metadata=dup

I ran this fio command with this series applied to tweak the checksum mode.

fio --group_reporting --eta=always --eta-interval=30s --eta-newline=30s \
    --rw=write \
    --direct=0 --ioengine=psync \
    --filesize=${filesize} \
    --blocksize=1m \
    --end_fsync=1 \
    --directory=/mnt \
    --name=writer --numjobs=${numjobs}

I tried several setups, but I could not get a better performance with sync
checksum. Examples are shown below.

With numjobs=96, filesize=2GB
* workqueue checksum: (writing off to the newly added sysfs file)
  WRITE: bw=1776MiB/s (1862MB/s), 1776MiB/s-1776MiB/s (1862MB/s-1862MB/s), io=192GiB (206GB), run=110733-110733msec
* sync checksum       (writing on to the sysfs file)
  WRITE: bw=1037MiB/s (1088MB/s), 1037MiB/s-1037MiB/s (1088MB/s-1088MB/s), io=192GiB (206GB), run=189550-189550msec

With numjobs=368, filesize=512MB
* workqueue checksum:
  WRITE: bw=1726MiB/s (1810MB/s), 1726MiB/s-1726MiB/s (1810MB/s-1810MB/s), io=192GiB (206GB), run=113902-113902msec
* sync checksum
  WRITE: bw=1221MiB/s (1280MB/s), 1221MiB/s-1221MiB/s (1280MB/s-1280MB/s), io=192GiB (206GB), run=161060-161060msec

Also, I run a similar experiment on a different machine, which has 32 CPUs
and 128 GB RAM. Since it has a smaller RAM, filesize is also smaller than
above. And, again, workqueue checksum is slightly better.

* workqueue checksum:
  WRITE: bw=298MiB/s (313MB/s), 298MiB/s-298MiB/s (313MB/s-313MB/s), io=32.0GiB (34.4GB), run=109883-109883msec
* sync checksum
  WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=32.0GiB (34.4GB), run=119169-119169msec


When I started writing this reply, I thought the proper criteria may be the
number of CPUs, or some balance of the number of CPUs vs disks. But, now,
as I could not get "sync checksum" to be better on any setup, I'm getting
puzzled. Is "sync checksum" really effective still for now? Maybe, it's
good on smaller CPUs (~4?) machine with a single device?

In addition, We are also going to have a change on the workqueue's side
too, which changes max number of working jobs, especially effective for a
NUMA machine.

https://lore.kernel.org/all/20240113002911.406791-1-tj@kernel.org/

Anyway, we need more benchmark results to see the effect of "sync checksum"
and "workqueue checksum".

>
> > Naohiro Aota (2):
> >   btrfs: introduce inline_csum_mode to tweak inline checksum behavior
> >   btrfs: detect multi-dev stripe and disable automatic inline checksum
> >
> >  fs/btrfs/bio.c     | 14 ++++++++++++--
> >  fs/btrfs/sysfs.c   | 39 +++++++++++++++++++++++++++++++++++++++
> >  fs/btrfs/volumes.c | 20 ++++++++++++++++++++
> >  fs/btrfs/volumes.h | 21 +++++++++++++++++++++
> >  4 files changed, 92 insertions(+), 2 deletions(-)
> >
>
>
> --
> With respect,
> Roman

Naohiro Aota Jan. 22, 2024, 3:12 p.m. UTC | #6

On Fri, Jan 19, 2024 at 05:01:01PM +0100, David Sterba wrote:
> On Thu, Jan 18, 2024 at 05:54:49PM +0900, Naohiro Aota wrote:
> > There was a report of write performance regression on 6.5-rc4 on RAID0
> > (4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
> > and doing the checksum inline can be bad for performance on RAID0
> > setup [2]. 
> 
> First, please don't name it 'inline checksum', it's so confusing because
> we have 'inline' as inline files and also the inline checksums stored in
> the b-tree nodes.

Sure. Sorry for the confusing naming. Is it OK to call it "sync checksum"?
and "workqueue checksum" for the opposite?

> 
> > [1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
> > [2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
> > 
> > While inlining the fast checksum is good for single (or two) device,
> > but it is not fast enough for multi-device striped writing.
> > 
> > So, this series first introduces fs_devices->inline_csum_mode and its
> > sysfs interface to tweak the inline csum behavior (auto/on/off). Then,
> > it disables inline checksum when it find a block group striped writing
> > into multiple devices.
> 
> How is one supposed to know if and how the sysfs knob should be set?
> This depends on the device speed(s), profiles and number of devices, can
> the same decision logic be replicated inside btrfs? Such tuning should
> be done automatically (similar things are done in other subystems like
> memory management).

Yeah, I first thought it was OK to turn sync checksum off automatically on
e.g, RAID0 case. But, as reported in [1], it becomes difficult. It might
depend also on CPUs.

[1] https://lore.kernel.org/linux-btrfs/irc2v7zqrpbkeehhysq7fccwmguujnkrktknl3d23t2ecwope6@o62qzd4yyxt2/T/#u

> With such type of setting we'll get people randomly flipping it on/off
> and see if it fixes performance, without actually looking if it's
> relevant or not. We've seen this with random advice circling around
> internet how to fix enospc problems, it's next to impossible to stop
> that so I really don't want to allow that for performance.

Yes, I agree it's nasty to have a random switch.

But, in [1], I can't find a setup that has a better performance on sync
checksum (even for SINGLE setup). So, I think, we need to rethink and
examine the effectiveness of sync checksum vs workqueue checksum. So, for
the evaluation, I'd like to leave the sysfs knob under CONFIG_BTRFS_DEBUG.

Naohiro Aota Jan. 22, 2024, 3:31 p.m. UTC | #7

On Fri, Jan 19, 2024 at 04:49:27PM +0100, David Sterba wrote:
> On Thu, Jan 18, 2024 at 02:12:31PM +0500, Roman Mamedov wrote:
> > On Thu, 18 Jan 2024 17:54:49 +0900
> > Naohiro Aota <naohiro.aota@wdc.com> wrote:
> > 
> > > There was a report of write performance regression on 6.5-rc4 on RAID0
> > > (4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
> > > and doing the checksum inline can be bad for performance on RAID0
> > > setup [2]. 
> > > 
> > > [1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
> > > [2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
> > > 
> > > While inlining the fast checksum is good for single (or two) device,
> > > but it is not fast enough for multi-device striped writing.
> > 
> > Personal opinion, it is a very awkward criteria to enable or disable the
> > inline mode. There can be a RAID0 of SATA HDDs/SSDs that will be slower than a
> > single PCI-E 4.0 NVMe SSD. In [1] the inline mode slashes the performance from
> > 4 GB/sec to 1.5 GB/sec. A single modern SSD is capable of up to 6 GB/sec.
> > 
> > Secondly, less often, there can be a hardware RAID which presents itself to the
> > OS as a single device, but is also very fast.
> 
> Yeah I find the decision logic not adapting well to various types of
> underlying hardware. While the multi-device and striped sounds like a
> good heuristic it can still lead to worse performance.
> 
> > Sure, basing such decision on anything else, such as benchmark of the
> > actual block device may not be as feasible.
> 
> In an ideal case it adapts to current load or device capabilities, which
> needs a feedback loop and tracking the status for the offloading
> decision.

Yeah, it is the best if we can implement it properly...

David Sterba Jan. 22, 2024, 9:19 p.m. UTC | #8

On Mon, Jan 22, 2024 at 03:12:27PM +0000, Naohiro Aota wrote:
> On Fri, Jan 19, 2024 at 05:01:01PM +0100, David Sterba wrote:
> > On Thu, Jan 18, 2024 at 05:54:49PM +0900, Naohiro Aota wrote:
> > > There was a report of write performance regression on 6.5-rc4 on RAID0
> > > (4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
> > > and doing the checksum inline can be bad for performance on RAID0
> > > setup [2]. 
> > 
> > First, please don't name it 'inline checksum', it's so confusing because
> > we have 'inline' as inline files and also the inline checksums stored in
> > the b-tree nodes.
> 
> Sure. Sorry for the confusing naming. Is it OK to call it "sync checksum"?
> and "workqueue checksum" for the opposite?

Well 'sync' also already has a meaning :) we could call it 'checksum
offloading'.

> > > [1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
> > > [2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
> > > 
> > > While inlining the fast checksum is good for single (or two) device,
> > > but it is not fast enough for multi-device striped writing.
> > > 
> > > So, this series first introduces fs_devices->inline_csum_mode and its
> > > sysfs interface to tweak the inline csum behavior (auto/on/off). Then,
> > > it disables inline checksum when it find a block group striped writing
> > > into multiple devices.
> > 
> > How is one supposed to know if and how the sysfs knob should be set?
> > This depends on the device speed(s), profiles and number of devices, can
> > the same decision logic be replicated inside btrfs? Such tuning should
> > be done automatically (similar things are done in other subystems like
> > memory management).
> 
> Yeah, I first thought it was OK to turn sync checksum off automatically on
> e.g, RAID0 case. But, as reported in [1], it becomes difficult. It might
> depend also on CPUs.
> 
> [1] https://lore.kernel.org/linux-btrfs/irc2v7zqrpbkeehhysq7fccwmguujnkrktknl3d23t2ecwope6@o62qzd4yyxt2/T/#u
> 
> > With such type of setting we'll get people randomly flipping it on/off
> > and see if it fixes performance, without actually looking if it's
> > relevant or not. We've seen this with random advice circling around
> > internet how to fix enospc problems, it's next to impossible to stop
> > that so I really don't want to allow that for performance.
> 
> Yes, I agree it's nasty to have a random switch.
> 
> But, in [1], I can't find a setup that has a better performance on sync
> checksum (even for SINGLE setup). So, I think, we need to rethink and
> examine the effectiveness of sync checksum vs workqueue checksum. So, for
> the evaluation, I'd like to leave the sysfs knob under CONFIG_BTRFS_DEBUG.

Ok to have it for debugging and evaluation. Thanks for the results [1],
clearly the switch to sync checksums was not backed by enough
benchmarks. As a fallback we can revert the change at the cost of
pessimizing the one case where it helped compared to several others
where it makes things works.

We might find a heuristic based on device type and turn it again
selectively, eg. for the NVMe-like devices.

Wang Yugui Jan. 24, 2024, 12:19 a.m. UTC | #9

Hi,

> There was a report of write performance regression on 6.5-rc4 on RAID0
> (4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
> and doing the checksum inline can be bad for performance on RAID0
> setup [2]. 
> 
> [1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
> [2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
> 
> While inlining the fast checksum is good for single (or two) device,
> but it is not fast enough for multi-device striped writing.
> 
> So, this series first introduces fs_devices->inline_csum_mode and its
> sysfs interface to tweak the inline csum behavior (auto/on/off). Then,
> it disables inline checksum when it find a block group striped writing
> into multiple devices.

We have struct btrfs_inode | sync_writers in kernel 6.1.y, but dropped in recent
kernel. 

Is btrfs_inode | sync_writers not implemented very well?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2024/01/24

Wang Yugui Jan. 29, 2024, 12:56 p.m. UTC | #10

Hi,

> Hi,
> 
> > There was a report of write performance regression on 6.5-rc4 on RAID0
> > (4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
> > and doing the checksum inline can be bad for performance on RAID0
> > setup [2]. 
> > 
> > [1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
> > [2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
> > 
> > While inlining the fast checksum is good for single (or two) device,
> > but it is not fast enough for multi-device striped writing.
> > 
> > So, this series first introduces fs_devices->inline_csum_mode and its
> > sysfs interface to tweak the inline csum behavior (auto/on/off). Then,
> > it disables inline checksum when it find a block group striped writing
> > into multiple devices.
> 
> We have struct btrfs_inode | sync_writers in kernel 6.1.y, but dropped in recent
> kernel. 
> 
> Is btrfs_inode | sync_writers not implemented very well?

I tried the logic blow, some like ' btrfs_inode | sync_writers'.
- checksum of metadata always sync
- checksum of data async only when depth over 1,
	to reduce task switch when low load.
	to use more cpu core when high load.

performance test result is not good
	2GiB/s(checksum of data always async) -> 2.1GiB/s when low load.
	4GiB/s(checksum of data always async) -> 2788MiB/s when high load.

but the info  maybe useful, so post it here.


- checksum of metadata always sync

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 12b12443efaa..8ef968f0957d 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -598,7 +598,7 @@ static void run_one_async_free(struct btrfs_work *work)
 static bool should_async_write(struct btrfs_bio *bbio)
 {
 	/* Submit synchronously if the checksum implementation is fast. */
-	if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &bbio->fs_info->flags))
+	if ((bbio->bio.bi_opf & REQ_META) && test_bit(BTRFS_FS_CSUM_IMPL_FAST, &bbio->fs_info->flags))
 		return false;
 
 	/*


- checksum of data async only when depth over 1, to reduce task switch.

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index efb894967f55..f90b6e8cf53c 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -626,6 +626,9 @@ static bool should_async_write(struct btrfs_bio *bbio)
 	if ((bbio->bio.bi_opf & REQ_META) && btrfs_is_zoned(bbio->fs_info))
 		return false;
 
+	if (!(bbio->bio.bi_opf & REQ_META) && atomic_read(&bbio->fs_info->depth_checksum_data)==0 )
+		return false;
+
 	return true;
 }
 
@@ -725,11 +728,21 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
 		if (inode && !(inode->flags & BTRFS_INODE_NODATASUM) &&
 		    !test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state) &&
 		    !btrfs_is_data_reloc_root(inode->root)) {
-			if (should_async_write(bbio) &&
-			    btrfs_wq_submit_bio(bbio, bioc, &smap, mirror_num))
-				goto done;
-
+			if (should_async_write(bbio)){
+				if (!(bbio->bio.bi_opf & REQ_META))
+					atomic_inc(&bbio->fs_info->depth_checksum_data);
+				ret = btrfs_wq_submit_bio(bbio, bioc, &smap, mirror_num);
+				if (!(bbio->bio.bi_opf & REQ_META))
+					atomic_dec(&bbio->fs_info->depth_checksum_data);
+				if(ret)
+					goto done;
+			}
+
+			if (!(bbio->bio.bi_opf & REQ_META))
+				atomic_inc(&bbio->fs_info->depth_checksum_data);
 			ret = btrfs_bio_csum(bbio);
+			if (!(bbio->bio.bi_opf & REQ_META))
+				atomic_dec(&bbio->fs_info->depth_checksum_data);
 			if (ret)
 				goto fail_put_bio;
 		} else if (use_append) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d7b127443c9a..3fd89be7610a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2776,6 +2776,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 
 	fs_info->thread_pool_size = min_t(unsigned long,
 					  num_online_cpus() + 2, 8);
+	atomic_set(&fs_info->depth_checksum_data, 0);
 
 	INIT_LIST_HEAD(&fs_info->ordered_roots);
 	spin_lock_init(&fs_info->ordered_root_lock);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 7443bf014639..123cc8fa9be1 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -596,6 +596,7 @@ struct btrfs_fs_info {
 	struct task_struct *transaction_kthread;
 	struct task_struct *cleaner_kthread;
 	u32 thread_pool_size;
+	atomic_t depth_checksum_data;
 
 	struct kobject *space_info_kobj;
 	struct kobject *qgroups_kobj;

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2024/01/25

Naohiro Aota Jan. 30, 2024, 1:38 a.m. UTC | #11

On Mon, Jan 29, 2024 at 08:56:22PM +0800, Wang Yugui wrote:
> Hi,
> 
> > Hi,
> > 
> > > There was a report of write performance regression on 6.5-rc4 on RAID0
> > > (4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
> > > and doing the checksum inline can be bad for performance on RAID0
> > > setup [2]. 
> > > 
> > > [1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
> > > [2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
> > > 
> > > While inlining the fast checksum is good for single (or two) device,
> > > but it is not fast enough for multi-device striped writing.
> > > 
> > > So, this series first introduces fs_devices->inline_csum_mode and its
> > > sysfs interface to tweak the inline csum behavior (auto/on/off). Then,
> > > it disables inline checksum when it find a block group striped writing
> > > into multiple devices.
> > 
> > We have struct btrfs_inode | sync_writers in kernel 6.1.y, but dropped in recent
> > kernel. 
> > 
> > Is btrfs_inode | sync_writers not implemented very well?
> 
> I tried the logic blow, some like ' btrfs_inode | sync_writers'.
> - checksum of metadata always sync
> - checksum of data async only when depth over 1,
> 	to reduce task switch when low load.
> 	to use more cpu core when high load.
> 
> performance test result is not good
> 	2GiB/s(checksum of data always async) -> 2.1GiB/s when low load.
> 	4GiB/s(checksum of data always async) -> 2788MiB/s when high load.
> 
> but the info  maybe useful, so post it here.
> 
> 
> - checksum of metadata always sync
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 12b12443efaa..8ef968f0957d 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -598,7 +598,7 @@ static void run_one_async_free(struct btrfs_work *work)
>  static bool should_async_write(struct btrfs_bio *bbio)
>  {
>  	/* Submit synchronously if the checksum implementation is fast. */
> -	if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &bbio->fs_info->flags))
> +	if ((bbio->bio.bi_opf & REQ_META) && test_bit(BTRFS_FS_CSUM_IMPL_FAST, &bbio->fs_info->flags))
>  		return false;
>  
>  	/*
> 
> 
> - checksum of data async only when depth over 1, to reduce task switch.
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index efb894967f55..f90b6e8cf53c 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -626,6 +626,9 @@ static bool should_async_write(struct btrfs_bio *bbio)
>  	if ((bbio->bio.bi_opf & REQ_META) && btrfs_is_zoned(bbio->fs_info))
>  		return false;
>  
> +	if (!(bbio->bio.bi_opf & REQ_META) && atomic_read(&bbio->fs_info->depth_checksum_data)==0 )
> +		return false;
> +
>  	return true;
>  }
>  
> @@ -725,11 +728,21 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
>  		if (inode && !(inode->flags & BTRFS_INODE_NODATASUM) &&
>  		    !test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state) &&
>  		    !btrfs_is_data_reloc_root(inode->root)) {
> -			if (should_async_write(bbio) &&
> -			    btrfs_wq_submit_bio(bbio, bioc, &smap, mirror_num))
> -				goto done;
> -
> +			if (should_async_write(bbio)){
> +				if (!(bbio->bio.bi_opf & REQ_META))
> +					atomic_inc(&bbio->fs_info->depth_checksum_data);
> +				ret = btrfs_wq_submit_bio(bbio, bioc, &smap, mirror_num);
> +				if (!(bbio->bio.bi_opf & REQ_META))
> +					atomic_dec(&bbio->fs_info->depth_checksum_data);

This does not looks like well implemented. btrfs_wq_submit_bio() returns
soon after it queues a checksum work to the workqueue without processing
any actual work.

> +				if(ret)
> +					goto done;
> +			}
> +
> +			if (!(bbio->bio.bi_opf & REQ_META))
> +				atomic_inc(&bbio->fs_info->depth_checksum_data);
>  			ret = btrfs_bio_csum(bbio);
> +			if (!(bbio->bio.bi_opf & REQ_META))
> +				atomic_dec(&bbio->fs_info->depth_checksum_data);

These lines seems fine. But, as the btrfs_wq_submit_bio() side is not well
implemented, it will work like this:

- The first data bio will go to sync checksum: btrfs_bio_csum() because
  depth_checksum_data == 0.
- While at it, the second one come and it goes to btrfs_wq_submit_bio()
  because depth_checksum_data == 1.
- Then, depth_checksum_data soon decrements to 1. The same happens for
  other parallel IOs.
- Once the checksum of the first bio finishes, depth_checksum_data == 0,
  accepting sync checksum. But, at this time, the workqueue may have a lot
  of work queued.

>  			if (ret)
>  				goto fail_put_bio;
>  		} else if (use_append) {
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index d7b127443c9a..3fd89be7610a 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2776,6 +2776,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>  
>  	fs_info->thread_pool_size = min_t(unsigned long,
>  					  num_online_cpus() + 2, 8);
> +	atomic_set(&fs_info->depth_checksum_data, 0);
>  
>  	INIT_LIST_HEAD(&fs_info->ordered_roots);
>  	spin_lock_init(&fs_info->ordered_root_lock);
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 7443bf014639..123cc8fa9be1 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -596,6 +596,7 @@ struct btrfs_fs_info {
>  	struct task_struct *transaction_kthread;
>  	struct task_struct *cleaner_kthread;
>  	u32 thread_pool_size;
> +	atomic_t depth_checksum_data;
>  
>  	struct kobject *space_info_kobj;
>  	struct kobject *qgroups_kobj;
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2024/01/25
> 
>

[0/2] btrfs: disable inline checksum for multi-dev striped FS

Message

Comments