[v3,00/10] btrfs: zoned: write-time activation of metadata block group

Message ID	cover.1691424260.git.naohiro.aota@wdc.com (mailing list archive)
Headers	show Return-Path: <linux-btrfs-owner@vger.kernel.org> IronPort-SDR: 8A15u4A7TcYFCH7BNojb6Z76dOhe7UbGzIB/Orm0/iqigkHgc7PWDTEi4yWa7quXLUOdYxmKrU Zodu8hhq+0TdHdP2TMEBxcI5HQyVDTmtlYOH0n4BKDK4FooW6XGX+NcgjGZ9UKI0Zjv4F8FmUQ 994R+3hDGGM1ZuvZE/b7De0MAZ7Jn7tK8AgmCRYZNAhpRkzpjursn7mEJSUCFlAumeSMAtQ/ed 6uPLKfW9/gDPKEaoQvwXzlsUX0GEwHstlkdwS/bzRnSnWY21oO6Gu1udUOjiv9sgN45MAYLCrs awQ= IronPort-SDR: FjruVsC26Hxcdzh9gu53gWcGumGowpisdAInJ8y1dZfF5msOA+H9tpvLaki7Mlx2Zio6z0nW87 MpNEuYRei9kY3Tg0P2VrGUe/Dgtan8Pw7fXr2HkB+CIHcsUs/rp4d5tPRM7Pyk1eV1fLzHsSx2 VeBTqNvOxuhrZ1M7fo0j4quiGcKUS7oH03KuU9OtgH2yB0PEsKzN6oK3An47wBDMYKC6M6LOTN +fQEbrFpt/nMQIGypFZ2ekEcbhDhc14vE+j0UbeDOzSiEkGAZOxsvQpig8C3SL0KZWi1fDLJA+ m2A= WDCIronportException: Internal From: Naohiro Aota <naohiro.aota@wdc.com> To: linux-btrfs@vger.kernel.org Cc: hch@infradead.org, josef@toxicpanda.com, dsterba@suse.cz, Naohiro Aota <naohiro.aota@wdc.com> Subject: [PATCH v3 00/10] btrfs: zoned: write-time activation of metadata block group Date: Tue, 8 Aug 2023 01:12:30 +0900 Message-ID: <cover.1691424260.git.naohiro.aota@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: zoned: write-time activation of metadata block group \| expand [v3,00/10] btrfs: zoned: write-time activation of metadata block group [v3,01/10] btrfs: introduce struct to consolidate extent buffer write context [v3,02/10] btrfs: zoned: introduce block group context to btrfs_eb_write_context [v3,03/10] btrfs: zoned: return int from btrfs_check_meta_write_pointer [v3,04/10] btrfs: zoned: defer advancing meta_write_pointer [v3,05/10] btrfs: zoned: update meta_write_pointer on zone finish [v3,06/10] btrfs: zoned: reserve zones for an active metadata/system block group [v3,07/10] btrfs: zoned: activate metadata block group on write time [v3,08/10] btrfs: zoned: no longer count fresh BG region as zone unusable [v3,09/10] btrfs: zoned: don't activate non-DATA BG on allocation [v3,10/10] btrfs: zoned: re-enable metadata over-commit for zoned mode

Naohiro Aota Aug. 7, 2023, 4:12 p.m. UTC

In the current implementation, block groups are activated at
reservation time to ensure that all reserved bytes can be written to
an active metadata block group. However, this approach has proven to
be less efficient, as it activates block groups more frequently than
necessary, putting pressure on the active zone resource and leading to
potential issues such as early ENOSPC or hung_task.

Another drawback of the current method is that it hampers metadata
over-commit, and necessitates additional flush operations and block
group allocations, resulting in decreased overall performance.

Actually, we don't need so many active metadata block groups because
there is only one sequential metadata write stream.

So, this series introduces a write-time activation of metadata and
system block group. This involves reserving at least one active block
group specifically for a metadata and system block group. When the
write goes into a new block group, it should have allocated all the
regions in the current active block group. So, we can wait for IOs to
fill the space, and then switch to a new block group.

Switching to the write-time activation solves the above issue and will
lead to better performance.

* Performance

There is a significant difference with a workload (buffered write without
sync) because we re-enable metadata over-commit.

before the patch:  741.00 MB/sec
after the patch:  1430.27 MB/sec (+ 93%)

* Organization

Patches 1-5 are preparation patches involves meta_write_pointer check.

Patches 6 and 7 are the main part of this series, implementing the
write-time activation.

Patches 8-10 addresses code for reserve time activation: counting fresh
block group as zone_unusable, activating a block group on allocation,
and disabling metadata over-commit.

* Changes

- v3
  - Rework the reservation patch to fix the over-reservation problem
    https://lore.kernel.org/all/xpb5wdmxx5wops26ihulo73oluc64dt4zpxqc7cirp2wvxl3qy@hv7lsvma5hxf/
  - Rename btrfs_eb_write_context's block_group to zoned_bg.
    
- v2
  - Introduce a struct to consolidate extent buffer write context
    (btrfs_eb_write_context)
  - Change return type of btrfs_check_meta_write_pointer to int
  - Calculate the reservation count only when it sees DUP BG
  - Drop unnecessary BG lock

Naohiro Aota (10):
  btrfs: introduce struct to consolidate extent buffer write context
  btrfs: zoned: introduce block group context to btrfs_eb_write_context
  btrfs: zoned: return int from btrfs_check_meta_write_pointer
  btrfs: zoned: defer advancing meta_write_pointer
  btrfs: zoned: update meta_write_pointer on zone finish
  btrfs: zoned: reserve zones for an active metadata/system block group
  btrfs: zoned: activate metadata block group on write time
  btrfs: zoned: no longer count fresh BG region as zone unusable
  btrfs: zoned: don't activate non-DATA BG on allocation
  btrfs: zoned: re-enable metadata over-commit for zoned mode

 fs/btrfs/block-group.c      |  13 +-
 fs/btrfs/disk-io.c          |   2 +
 fs/btrfs/extent-tree.c      |   8 +-
 fs/btrfs/extent_io.c        |  44 +++---
 fs/btrfs/extent_io.h        |   7 +
 fs/btrfs/free-space-cache.c |   8 +-
 fs/btrfs/fs.h               |   3 +
 fs/btrfs/space-info.c       |  34 +----
 fs/btrfs/zoned.c            | 259 ++++++++++++++++++++++++++++--------
 fs/btrfs/zoned.h            |  29 ++--
 10 files changed, 273 insertions(+), 134 deletions(-)

David Sterba Aug. 9, 2023, 6:02 p.m. UTC | #1

On Tue, Aug 08, 2023 at 01:12:30AM +0900, Naohiro Aota wrote:
> In the current implementation, block groups are activated at
> reservation time to ensure that all reserved bytes can be written to
> an active metadata block group. However, this approach has proven to
> be less efficient, as it activates block groups more frequently than
> necessary, putting pressure on the active zone resource and leading to
> potential issues such as early ENOSPC or hung_task.
> 
> Another drawback of the current method is that it hampers metadata
> over-commit, and necessitates additional flush operations and block
> group allocations, resulting in decreased overall performance.
> 
> Actually, we don't need so many active metadata block groups because
> there is only one sequential metadata write stream.
> 
> So, this series introduces a write-time activation of metadata and
> system block group. This involves reserving at least one active block
> group specifically for a metadata and system block group. When the
> write goes into a new block group, it should have allocated all the
> regions in the current active block group. So, we can wait for IOs to
> fill the space, and then switch to a new block group.
> 
> Switching to the write-time activation solves the above issue and will
> lead to better performance.
> 
> * Performance
> 
> There is a significant difference with a workload (buffered write without
> sync) because we re-enable metadata over-commit.
> 
> before the patch:  741.00 MB/sec
> after the patch:  1430.27 MB/sec (+ 93%)
> 
> * Organization
> 
> Patches 1-5 are preparation patches involves meta_write_pointer check.
> 
> Patches 6 and 7 are the main part of this series, implementing the
> write-time activation.
> 
> Patches 8-10 addresses code for reserve time activation: counting fresh
> block group as zone_unusable, activating a block group on allocation,
> and disabling metadata over-commit.
> 
> * Changes
> 
> - v3
>   - Rework the reservation patch to fix the over-reservation problem
>     https://lore.kernel.org/all/xpb5wdmxx5wops26ihulo73oluc64dt4zpxqc7cirp2wvxl3qy@hv7lsvma5hxf/
>   - Rename btrfs_eb_write_context's block_group to zoned_bg.

Added to misc-next, thanks. We need it in order to enable zoned tests in
the CI so this goes in now, any fixups or more review tags will be done
in the commits.

Josef Bacik Aug. 10, 2023, 12:59 p.m. UTC | #2

On Tue, Aug 08, 2023 at 01:12:30AM +0900, Naohiro Aota wrote:
> In the current implementation, block groups are activated at
> reservation time to ensure that all reserved bytes can be written to
> an active metadata block group. However, this approach has proven to
> be less efficient, as it activates block groups more frequently than
> necessary, putting pressure on the active zone resource and leading to
> potential issues such as early ENOSPC or hung_task.
> 
> Another drawback of the current method is that it hampers metadata
> over-commit, and necessitates additional flush operations and block
> group allocations, resulting in decreased overall performance.
> 
> Actually, we don't need so many active metadata block groups because
> there is only one sequential metadata write stream.
> 
> So, this series introduces a write-time activation of metadata and
> system block group. This involves reserving at least one active block
> group specifically for a metadata and system block group. When the
> write goes into a new block group, it should have allocated all the
> regions in the current active block group. So, we can wait for IOs to
> fill the space, and then switch to a new block group.
> 
> Switching to the write-time activation solves the above issue and will
> lead to better performance.
> 
> * Performance
> 
> There is a significant difference with a workload (buffered write without
> sync) because we re-enable metadata over-commit.
> 
> before the patch:  741.00 MB/sec
> after the patch:  1430.27 MB/sec (+ 93%)
> 
> * Organization
> 
> Patches 1-5 are preparation patches involves meta_write_pointer check.
> 
> Patches 6 and 7 are the main part of this series, implementing the
> write-time activation.
> 
> Patches 8-10 addresses code for reserve time activation: counting fresh
> block group as zone_unusable, activating a block group on allocation,
> and disabling metadata over-commit.
> 

Hey Naohiro,

This enabled me to turn on the zoned vm for the GitHub CI, we're only failing 7
tests now, so great job!

However all the !zoned vms panic immediately

https://paste.centos.org/view/54d11384

Can you fix that up?  Also you can submit a PR against the 'ci' branch of our
linux repo in the btrfs GitHub project to run through the CI yourself to make
sure you didn't mess anything up.  Thanks,

Josef

Josef Bacik Aug. 10, 2023, 1:34 p.m. UTC | #3

On Tue, Aug 08, 2023 at 01:12:30AM +0900, Naohiro Aota wrote:
> In the current implementation, block groups are activated at
> reservation time to ensure that all reserved bytes can be written to
> an active metadata block group. However, this approach has proven to
> be less efficient, as it activates block groups more frequently than
> necessary, putting pressure on the active zone resource and leading to
> potential issues such as early ENOSPC or hung_task.
> 
> Another drawback of the current method is that it hampers metadata
> over-commit, and necessitates additional flush operations and block
> group allocations, resulting in decreased overall performance.
> 
> Actually, we don't need so many active metadata block groups because
> there is only one sequential metadata write stream.
> 
> So, this series introduces a write-time activation of metadata and
> system block group. This involves reserving at least one active block
> group specifically for a metadata and system block group. When the
> write goes into a new block group, it should have allocated all the
> regions in the current active block group. So, we can wait for IOs to
> fill the space, and then switch to a new block group.
> 
> Switching to the write-time activation solves the above issue and will
> lead to better performance.
> 
> * Performance
> 
> There is a significant difference with a workload (buffered write without
> sync) because we re-enable metadata over-commit.
> 
> before the patch:  741.00 MB/sec
> after the patch:  1430.27 MB/sec (+ 93%)
> 
> * Organization
> 
> Patches 1-5 are preparation patches involves meta_write_pointer check.
> 
> Patches 6 and 7 are the main part of this series, implementing the
> write-time activation.
> 
> Patches 8-10 addresses code for reserve time activation: counting fresh
> block group as zone_unusable, activating a block group on allocation,
> and disabling metadata over-commit.
> 
> * Changes

Additionally you had these failures in the CI setup

btrfs/220 btrfs/237 btrfs/239 btrfs/273 btrfs/295 generic/551 generic/574

I've excluded them so we can catch regressions, but everything except btrfs/220
seem like legitimate failures.  btrfs/220 needs to be updated since zoned
doesn't do discard=async, but you can do that whenever, I'm less worried about
that.  The rest should be investigated at some point, though not as a
prerequisite for merging this series.  Thanks,

Josef

Naohiro Aota Aug. 10, 2023, 2:13 p.m. UTC | #4

On Thu, Aug 10, 2023 at 08:59:37AM -0400, Josef Bacik wrote:
> On Tue, Aug 08, 2023 at 01:12:30AM +0900, Naohiro Aota wrote:
> > In the current implementation, block groups are activated at
> > reservation time to ensure that all reserved bytes can be written to
> > an active metadata block group. However, this approach has proven to
> > be less efficient, as it activates block groups more frequently than
> > necessary, putting pressure on the active zone resource and leading to
> > potential issues such as early ENOSPC or hung_task.
> > 
> > Another drawback of the current method is that it hampers metadata
> > over-commit, and necessitates additional flush operations and block
> > group allocations, resulting in decreased overall performance.
> > 
> > Actually, we don't need so many active metadata block groups because
> > there is only one sequential metadata write stream.
> > 
> > So, this series introduces a write-time activation of metadata and
> > system block group. This involves reserving at least one active block
> > group specifically for a metadata and system block group. When the
> > write goes into a new block group, it should have allocated all the
> > regions in the current active block group. So, we can wait for IOs to
> > fill the space, and then switch to a new block group.
> > 
> > Switching to the write-time activation solves the above issue and will
> > lead to better performance.
> > 
> > * Performance
> > 
> > There is a significant difference with a workload (buffered write without
> > sync) because we re-enable metadata over-commit.
> > 
> > before the patch:  741.00 MB/sec
> > after the patch:  1430.27 MB/sec (+ 93%)
> > 
> > * Organization
> > 
> > Patches 1-5 are preparation patches involves meta_write_pointer check.
> > 
> > Patches 6 and 7 are the main part of this series, implementing the
> > write-time activation.
> > 
> > Patches 8-10 addresses code for reserve time activation: counting fresh
> > block group as zone_unusable, activating a block group on allocation,
> > and disabling metadata over-commit.
> > 
> 
> Hey Naohiro,
> 
> This enabled me to turn on the zoned vm for the GitHub CI, we're only failing 7
> tests now, so great job!

Thanks! The github CI setup is really interesting. I tried to figure out
how it setup the zoned devices. Are they QEmu emulated ZNS devices?

> However all the !zoned vms panic immediately
> 
> https://paste.centos.org/view/54d11384
> 
> Can you fix that up?  Also you can submit a PR against the 'ci' branch of our
> linux repo in the btrfs GitHub project to run through the CI yourself to make
> sure you didn't mess anything up.  Thanks,

I sent a candidate fix as a PR. I hope it works well.

> 
> Josef

Naohiro Aota Aug. 10, 2023, 2:34 p.m. UTC | #5

On Thu, Aug 10, 2023 at 09:34:58AM -0400, Josef Bacik wrote:
> On Tue, Aug 08, 2023 at 01:12:30AM +0900, Naohiro Aota wrote:
> > In the current implementation, block groups are activated at
> > reservation time to ensure that all reserved bytes can be written to
> > an active metadata block group. However, this approach has proven to
> > be less efficient, as it activates block groups more frequently than
> > necessary, putting pressure on the active zone resource and leading to
> > potential issues such as early ENOSPC or hung_task.
> > 
> > Another drawback of the current method is that it hampers metadata
> > over-commit, and necessitates additional flush operations and block
> > group allocations, resulting in decreased overall performance.
> > 
> > Actually, we don't need so many active metadata block groups because
> > there is only one sequential metadata write stream.
> > 
> > So, this series introduces a write-time activation of metadata and
> > system block group. This involves reserving at least one active block
> > group specifically for a metadata and system block group. When the
> > write goes into a new block group, it should have allocated all the
> > regions in the current active block group. So, we can wait for IOs to
> > fill the space, and then switch to a new block group.
> > 
> > Switching to the write-time activation solves the above issue and will
> > lead to better performance.
> > 
> > * Performance
> > 
> > There is a significant difference with a workload (buffered write without
> > sync) because we re-enable metadata over-commit.
> > 
> > before the patch:  741.00 MB/sec
> > after the patch:  1430.27 MB/sec (+ 93%)
> > 
> > * Organization
> > 
> > Patches 1-5 are preparation patches involves meta_write_pointer check.
> > 
> > Patches 6 and 7 are the main part of this series, implementing the
> > write-time activation.
> > 
> > Patches 8-10 addresses code for reserve time activation: counting fresh
> > block group as zone_unusable, activating a block group on allocation,
> > and disabling metadata over-commit.
> > 
> > * Changes
> 
> Additionally you had these failures in the CI setup
> 
> btrfs/220 btrfs/237 btrfs/239 btrfs/273 btrfs/295 generic/551 generic/574
> 
> I've excluded them so we can catch regressions, but everything except btrfs/220
> seem like legitimate failures.  btrfs/220 needs to be updated since zoned
> doesn't do discard=async, but you can do that whenever, I'm less worried about
> that.  The rest should be investigated at some point, though not as a
> prerequisite for merging this series.  Thanks,

I checked the CI log. Yes, btrfs/220 is due to discards=async.

* known to fail
- btrfs/237: we need to tweak the test for ZNS (zone capacity != zone size)
- btrfs/239: somehow, tree-log is behaving differently on zoned mode... I
  	     have no idea why it fail. But, I think it is still a valid status...

* need to modify test?
- btrfs/295: overwriting a zoned device won't work. So, this test should be skipped.
- generic/574: not sure fsverity works with zoned mode. Need to check.

So, btrfs/273 and generic/551 are suspicious. btrfs/273 prints some WARN
dmesg and generic/551 killed a AIO_TEST program... Are there details
available?

> 
> Josef

David Sterba Aug. 10, 2023, 2:36 p.m. UTC | #6

On Thu, Aug 10, 2023 at 02:34:11PM +0000, Naohiro Aota wrote:
> > seem like legitimate failures.  btrfs/220 needs to be updated since zoned
> > doesn't do discard=async, but you can do that whenever, I'm less worried about
> > that.  The rest should be investigated at some point, though not as a
> > prerequisite for merging this series.  Thanks,
> 
> I checked the CI log. Yes, btrfs/220 is due to discards=async.
> 
> * known to fail
> - btrfs/237: we need to tweak the test for ZNS (zone capacity != zone size)
> - btrfs/239: somehow, tree-log is behaving differently on zoned mode... I
>   	     have no idea why it fail. But, I think it is still a valid status...
> 
> * need to modify test?
> - generic/574: not sure fsverity works with zoned mode. Need to check.

The compatibility matrix at https://btrfs.readthedocs.io/en/latest/Status.html#zoned-mode
does not mention fsverity, so somebody has to test it and add the entry.

[v3,00/10] btrfs: zoned: write-time activation of metadata block group

Message

Comments