mbox series

[00/18] btrfs: make send scale and perform better with shared extents

Message ID cover.1667315100.git.fdmanana@suse.com (mailing list archive)
Headers show
Series btrfs: make send scale and perform better with shared extents | expand

Message

Filipe Manana Nov. 1, 2022, 4:15 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

There are two problems with send regarding cloned extents:

1) Sometimes it ends up not cloning whole extents, but only a section of
   the extents, reducing in less extent sharing at the receiver and extra
   IO on the send side (reading data, issuing write commands) and on the
   receiver side too (writing more data). This is not only not optimal
   but it also surprises users and often gets reported (such as in the
   thread referenced in patch 09/18);

2) When we find that a data extent is directly shared more than 64 times,
   we don't attempt to clone it, because that requires backref walking to
   determine from which inode and range we should clone from and for
   extents with many backreferences, that can be too slow, specially if
   we have many thousands of extents with a huge amount of sharing each.

This patchset solves the first problem completely (patch 09/18), and for
the second issue while not fully eliminated, it's significantly improved.
In a test scenario with 50 000 files where each file is reflinked 50 times,
there's a performance improvement of ~70% to ~75% for both full and
incremental send operations. This test and results are in the changelog
of patch 17/18.

After this we can now bump the limit from 64 max references to 1024, which
is still a conservative value, but the goal is to get rid of such limit in
the future (some more work required for that, but we're getting there).

There's also a nice and simple performance optimization when processing
extents that are not shared and we are using only one clone source (the
send root itself, very common), with gains varying between ~9% to ~18%
in some small scale tests where there are no shared extents or the majority
of the extents are not shared. That's patch 08/18.

The rest is just refactoring and cleanups in preparation for the optimization
work for send, and a few bug fixes for error paths in the backref walking
code and qgroup self tests. In particular the error paths for backref walking
are important because with the latest patches they are triggered not just in
case an error happens but also when the backref walking callbacks tell the
backref walking code to stop early.

More details in the changelogs of the patches.

I've also left this in a git tree at:

  https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/log/?h=send_clone_performance_scalability

Filipe Manana (18):
  btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
  btrfs: fix inode list leak during backref walking at find_parent_nodes()
  btrfs: fix ulist leaks in error paths of qgroup self tests
  btrfs: remove pointless and double ulist frees in error paths of qgroup tests
  btrfs: send: avoid unnecessary path allocations when finding extent clone
  btrfs: send: update comment at find_extent_clone()
  btrfs: send: drop unnecessary backref context field initializations
  btrfs: send: avoid unnecessary backref lookups when finding clone source
  btrfs: send: optimize clone detection to increase extent sharing
  btrfs: use a single argument for extent offset in backref walking functions
  btrfs: use a structure to pass arguments to backref walking functions
  btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
  btrfs: constify ulist parameter of ulist_next()
  btrfs: send: cache leaf to roots mapping during backref walking
  btrfs: send: skip unnecessary backref iterations
  btrfs: send: avoid double extent tree search when finding clone source
  btrfs: send: skip resolution of our own backref when finding clone source
  btrfs: send: bump the extent reference count limit for backref walking

 fs/btrfs/backref.c            | 596 ++++++++++++++++++++--------------
 fs/btrfs/backref.h            | 137 +++++++-
 fs/btrfs/qgroup.c             |  38 ++-
 fs/btrfs/relocation.c         |  19 +-
 fs/btrfs/scrub.c              |  18 +-
 fs/btrfs/send.c               | 467 +++++++++++++++++++-------
 fs/btrfs/tests/qgroup-tests.c |  86 +++--
 fs/btrfs/ulist.c              |   2 +-
 fs/btrfs/ulist.h              |   2 +-
 9 files changed, 928 insertions(+), 437 deletions(-)

Comments

David Sterba Nov. 2, 2022, 4:01 p.m. UTC | #1
On Tue, Nov 01, 2022 at 04:15:36PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> There are two problems with send regarding cloned extents:
> 
> 1) Sometimes it ends up not cloning whole extents, but only a section of
>    the extents, reducing in less extent sharing at the receiver and extra
>    IO on the send side (reading data, issuing write commands) and on the
>    receiver side too (writing more data). This is not only not optimal
>    but it also surprises users and often gets reported (such as in the
>    thread referenced in patch 09/18);
> 
> 2) When we find that a data extent is directly shared more than 64 times,
>    we don't attempt to clone it, because that requires backref walking to
>    determine from which inode and range we should clone from and for
>    extents with many backreferences, that can be too slow, specially if
>    we have many thousands of extents with a huge amount of sharing each.
> 
> This patchset solves the first problem completely (patch 09/18), and for
> the second issue while not fully eliminated, it's significantly improved.
> In a test scenario with 50 000 files where each file is reflinked 50 times,
> there's a performance improvement of ~70% to ~75% for both full and
> incremental send operations. This test and results are in the changelog
> of patch 17/18.
> 
> After this we can now bump the limit from 64 max references to 1024, which
> is still a conservative value, but the goal is to get rid of such limit in
> the future (some more work required for that, but we're getting there).
> 
> There's also a nice and simple performance optimization when processing
> extents that are not shared and we are using only one clone source (the
> send root itself, very common), with gains varying between ~9% to ~18%
> in some small scale tests where there are no shared extents or the majority
> of the extents are not shared. That's patch 08/18.
> 
> The rest is just refactoring and cleanups in preparation for the optimization
> work for send, and a few bug fixes for error paths in the backref walking
> code and qgroup self tests. In particular the error paths for backref walking
> are important because with the latest patches they are triggered not just in
> case an error happens but also when the backref walking callbacks tell the
> backref walking code to stop early.
> 
> More details in the changelogs of the patches.
> 
> I've also left this in a git tree at:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/log/?h=send_clone_performance_scalability
> 
> Filipe Manana (18):
>   btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
>   btrfs: fix inode list leak during backref walking at find_parent_nodes()
>   btrfs: fix ulist leaks in error paths of qgroup self tests
>   btrfs: remove pointless and double ulist frees in error paths of qgroup tests
>   btrfs: send: avoid unnecessary path allocations when finding extent clone
>   btrfs: send: update comment at find_extent_clone()
>   btrfs: send: drop unnecessary backref context field initializations
>   btrfs: send: avoid unnecessary backref lookups when finding clone source
>   btrfs: send: optimize clone detection to increase extent sharing
>   btrfs: use a single argument for extent offset in backref walking functions
>   btrfs: use a structure to pass arguments to backref walking functions
>   btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
>   btrfs: constify ulist parameter of ulist_next()
>   btrfs: send: cache leaf to roots mapping during backref walking
>   btrfs: send: skip unnecessary backref iterations
>   btrfs: send: avoid double extent tree search when finding clone source
>   btrfs: send: skip resolution of our own backref when finding clone source
>   btrfs: send: bump the extent reference count limit for backref walking

Thanks a lot, the improvements look great. Added to misc-next.