diff mbox series

Btrfs: send, skip backreference walking for extents with many references

Message ID 20191030122301.25270-1-fdmanana@kernel.org (mailing list archive)
State New, archived
Headers show
Series Btrfs: send, skip backreference walking for extents with many references | expand

Commit Message

Filipe Manana Oct. 30, 2019, 12:23 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Backreference walking, which is used by send to figure if it can issue
clone operations instead of write operations, can be very slow and use too
much memory when extents have many references. This change simply skips
backreference walking when an extent has more than 64 references, in which
case we fallback to a write operation instead of a clone operation. This
limit is conservative and in practice I observed no signicant slowdown
with up to 100 references and still low memory usage up to that limit.

This is a temporary workaround until there are speedups in the backref
walking code, and as such it does not attempt to add extra interfaces or
knobs to tweak the threshold.

Reported-by: Atemu <atemu.main@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/send.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

Comments

Qu Wenruo Oct. 30, 2019, 12:47 p.m. UTC | #1
On 2019/10/30 下午8:23, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Backreference walking, which is used by send to figure if it can issue
> clone operations instead of write operations, can be very slow and use too
> much memory when extents have many references. This change simply skips
> backreference walking when an extent has more than 64 references, in which
> case we fallback to a write operation instead of a clone operation. This
> limit is conservative and in practice I observed no signicant slowdown
> with up to 100 references and still low memory usage up to that limit.
> 
> This is a temporary workaround until there are speedups in the backref
> walking code, and as such it does not attempt to add extra interfaces or
> knobs to tweak the threshold.
> 
> Reported-by: Atemu <atemu.main@gmail.com>
> Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

The workaround is much better than the old
completely-disable-reflink-detection one.

Thanks,
Qu

> ---
>  fs/btrfs/send.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 123ac54af071..518ec1265a0c 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -25,6 +25,14 @@
>  #include "compression.h"
>  
>  /*
> + * Maximum number of references an extent can have in order for us to attempt to
> + * issue clone operations instead of write operations. This currently exists to
> + * avoid hitting limitations of the backreference walking code (taking a lot of
> + * time and using too much memory for extents with large number of references).
> + */
> +#define SEND_MAX_EXTENT_REFS	64
> +
> +/*
>   * A fs_path is a helper to dynamically build path names with unknown size.
>   * It reallocates the internal buffer on demand.
>   * It allows fast adding of path elements on the right side (normal path) and
> @@ -1302,6 +1310,7 @@ static int find_extent_clone(struct send_ctx *sctx,
>  	struct clone_root *cur_clone_root;
>  	struct btrfs_key found_key;
>  	struct btrfs_path *tmp_path;
> +	struct btrfs_extent_item *ei;
>  	int compressed;
>  	u32 i;
>  
> @@ -1349,7 +1358,6 @@ static int find_extent_clone(struct send_ctx *sctx,
>  	ret = extent_from_logical(fs_info, disk_byte, tmp_path,
>  				  &found_key, &flags);
>  	up_read(&fs_info->commit_root_sem);
> -	btrfs_release_path(tmp_path);
>  
>  	if (ret < 0)
>  		goto out;
> @@ -1358,6 +1366,21 @@ static int find_extent_clone(struct send_ctx *sctx,
>  		goto out;
>  	}
>  
> +	ei = btrfs_item_ptr(tmp_path->nodes[0], tmp_path->slots[0],
> +			    struct btrfs_extent_item);
> +	/*
> +	 * Backreference walking (iterate_extent_inodes() below) is currently
> +	 * too expensive when an extent has a large number of references, both
> +	 * in time spent and used memory. So for now just fallback to write
> +	 * operations instead of clone operations when an extent has more than
> +	 * a certain amount of references.
> +	 */
> +	if (btrfs_extent_refs(tmp_path->nodes[0], ei) > SEND_MAX_EXTENT_REFS) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	btrfs_release_path(tmp_path);
> +
>  	/*
>  	 * Setup the clone roots.
>  	 */
>
David Sterba Nov. 1, 2019, 2:50 p.m. UTC | #2
On Wed, Oct 30, 2019 at 12:23:01PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Backreference walking, which is used by send to figure if it can issue
> clone operations instead of write operations, can be very slow and use too
> much memory when extents have many references. This change simply skips
> backreference walking when an extent has more than 64 references, in which
> case we fallback to a write operation instead of a clone operation. This
> limit is conservative and in practice I observed no signicant slowdown
> with up to 100 references and still low memory usage up to that limit.

So this could lead to larger stream due to writes instead of clones, and
thus fewer cloned ranges on the receive side. This is observable and
could potentially raise questions why is this happening. Limiting the
nmuber makes sense, based on the report, though I'm curious if we can
make it higher, eg. 128, or 100 that you have measured.

I'll tag the patch for stable as it can be considered usability bug fix,
so I'm interested in the possible fallouts.

> This is a temporary workaround until there are speedups in the backref
> walking code, and as such it does not attempt to add extra interfaces or
> knobs to tweak the threshold.
> 
> Reported-by: Atemu <atemu.main@gmail.com>
> Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Added to misc-next, thanks. The above is up to discussion and the number
can be tuned later.
Zygo Blaxell Nov. 23, 2019, 5:27 a.m. UTC | #3
On Wed, Oct 30, 2019 at 12:23:01PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Backreference walking, which is used by send to figure if it can issue
> clone operations instead of write operations, can be very slow and use too
> much memory when extents have many references. This change simply skips
> backreference walking when an extent has more than 64 references, in which
> case we fallback to a write operation instead of a clone operation. This
> limit is conservative and in practice I observed no signicant slowdown
> with up to 100 references and still low memory usage up to that limit.
> 
> This is a temporary workaround until there are speedups in the backref
> walking code, and as such it does not attempt to add extra interfaces or
> knobs to tweak the threshold.
> 
> Reported-by: Atemu <atemu.main@gmail.com>
> Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/send.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 123ac54af071..518ec1265a0c 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -25,6 +25,14 @@
>  #include "compression.h"
>  
>  /*
> + * Maximum number of references an extent can have in order for us to attempt to
> + * issue clone operations instead of write operations. This currently exists to
> + * avoid hitting limitations of the backreference walking code (taking a lot of
> + * time and using too much memory for extents with large number of references).
> + */
> +#define SEND_MAX_EXTENT_REFS	64
> +
> +/*
>   * A fs_path is a helper to dynamically build path names with unknown size.
>   * It reallocates the internal buffer on demand.
>   * It allows fast adding of path elements on the right side (normal path) and
> @@ -1302,6 +1310,7 @@ static int find_extent_clone(struct send_ctx *sctx,
>  	struct clone_root *cur_clone_root;
>  	struct btrfs_key found_key;
>  	struct btrfs_path *tmp_path;
> +	struct btrfs_extent_item *ei;
>  	int compressed;
>  	u32 i;
>  
> @@ -1349,7 +1358,6 @@ static int find_extent_clone(struct send_ctx *sctx,
>  	ret = extent_from_logical(fs_info, disk_byte, tmp_path,
>  				  &found_key, &flags);
>  	up_read(&fs_info->commit_root_sem);
> -	btrfs_release_path(tmp_path);
>  
>  	if (ret < 0)
>  		goto out;
> @@ -1358,6 +1366,21 @@ static int find_extent_clone(struct send_ctx *sctx,
>  		goto out;
>  	}
>  
> +	ei = btrfs_item_ptr(tmp_path->nodes[0], tmp_path->slots[0],
> +			    struct btrfs_extent_item);
> +	/*
> +	 * Backreference walking (iterate_extent_inodes() below) is currently
> +	 * too expensive when an extent has a large number of references, both
> +	 * in time spent and used memory. So for now just fallback to write
> +	 * operations instead of clone operations when an extent has more than
> +	 * a certain amount of references.
> +	 */
> +	if (btrfs_extent_refs(tmp_path->nodes[0], ei) > SEND_MAX_EXTENT_REFS) {

So...does this...work?

I ask because I looked at this a few years ago as a way to spend less
time doing LOGICAL_INO calls during dedupe (and especially avoid the
8-orders-of-magnitude performance degradation in the bad cases), but
I found that it was useless: it wasn't proportional to the number of
extents, nor was it an upper or lower bound, nor could extents be sorted
by number of references using extent.refs as a key, nor could it predict
the amount of time it would take for iterate_extent_inodes() to run.
I guess I'm surprised you got a different result.

If the 'refs' field is 1, the extent might have somewhere between 1
and 3500 root/inode/offset references.  But it's not a lower bound,
e.g. here is an extent on a random filesystem where extent_refs is a
big number:

        item 87 key (1344172032 EXTENT_ITEM 4096) itemoff 11671 itemsize 24
                refs 897 gen 2668358 flags DATA

LOGICAL_INO_V2 only finds 170 extent references:

        # btrfs ins log 1344172032 -P . | wc -l
        170

Is there a good primer somewhere on how the reference counting works in
btrfs?

> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	btrfs_release_path(tmp_path);
> +
>  	/*
>  	 * Setup the clone roots.
>  	 */
> -- 
> 2.11.0
>
Filipe Manana Nov. 23, 2019, 1:33 p.m. UTC | #4
On Sat, Nov 23, 2019 at 5:29 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Wed, Oct 30, 2019 at 12:23:01PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Backreference walking, which is used by send to figure if it can issue
> > clone operations instead of write operations, can be very slow and use too
> > much memory when extents have many references. This change simply skips
> > backreference walking when an extent has more than 64 references, in which
> > case we fallback to a write operation instead of a clone operation. This
> > limit is conservative and in practice I observed no signicant slowdown
> > with up to 100 references and still low memory usage up to that limit.
> >
> > This is a temporary workaround until there are speedups in the backref
> > walking code, and as such it does not attempt to add extra interfaces or
> > knobs to tweak the threshold.
> >
> > Reported-by: Atemu <atemu.main@gmail.com>
> > Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/send.c | 25 ++++++++++++++++++++++++-
> >  1 file changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> > index 123ac54af071..518ec1265a0c 100644
> > --- a/fs/btrfs/send.c
> > +++ b/fs/btrfs/send.c
> > @@ -25,6 +25,14 @@
> >  #include "compression.h"
> >
> >  /*
> > + * Maximum number of references an extent can have in order for us to attempt to
> > + * issue clone operations instead of write operations. This currently exists to
> > + * avoid hitting limitations of the backreference walking code (taking a lot of
> > + * time and using too much memory for extents with large number of references).
> > + */
> > +#define SEND_MAX_EXTENT_REFS 64
> > +
> > +/*
> >   * A fs_path is a helper to dynamically build path names with unknown size.
> >   * It reallocates the internal buffer on demand.
> >   * It allows fast adding of path elements on the right side (normal path) and
> > @@ -1302,6 +1310,7 @@ static int find_extent_clone(struct send_ctx *sctx,
> >       struct clone_root *cur_clone_root;
> >       struct btrfs_key found_key;
> >       struct btrfs_path *tmp_path;
> > +     struct btrfs_extent_item *ei;
> >       int compressed;
> >       u32 i;
> >
> > @@ -1349,7 +1358,6 @@ static int find_extent_clone(struct send_ctx *sctx,
> >       ret = extent_from_logical(fs_info, disk_byte, tmp_path,
> >                                 &found_key, &flags);
> >       up_read(&fs_info->commit_root_sem);
> > -     btrfs_release_path(tmp_path);
> >
> >       if (ret < 0)
> >               goto out;
> > @@ -1358,6 +1366,21 @@ static int find_extent_clone(struct send_ctx *sctx,
> >               goto out;
> >       }
> >
> > +     ei = btrfs_item_ptr(tmp_path->nodes[0], tmp_path->slots[0],
> > +                         struct btrfs_extent_item);
> > +     /*
> > +      * Backreference walking (iterate_extent_inodes() below) is currently
> > +      * too expensive when an extent has a large number of references, both
> > +      * in time spent and used memory. So for now just fallback to write
> > +      * operations instead of clone operations when an extent has more than
> > +      * a certain amount of references.
> > +      */
> > +     if (btrfs_extent_refs(tmp_path->nodes[0], ei) > SEND_MAX_EXTENT_REFS) {
>
> So...does this...work?

I wouldn't send it if it didn't work, at least for the case reported.

There's even a test case for this, btrfs/130 from fstests:

https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/btrfs/130

It used to take several hours to complete on a test vm I use
frequently, now it completes in 1 or 2 seconds.

While that is about an extent shared only within the same file, the
same happens if the extent is reflinked to many other files (either
all part of the same subvolume or belonging to different subvolumes).

>
> I ask because I looked at this a few years ago as a way to spend less
> time doing LOGICAL_INO calls during dedupe (and especially avoid the

To give some context to those not familiar with you, you are
mentioning a user space dedupe tool that doesn't call the btrfs dedup
ioctl, and calls the clone ioctl directly, while also using the
LOGICAL_INO ioctl.

> 8-orders-of-magnitude performance degradation in the bad cases), but
> I found that it was useless: it wasn't proportional to the number of
> extents, nor was it an upper or lower bound, nor could extents be sorted
> by number of references using extent.refs as a key, nor could it predict
> the amount of time it would take for iterate_extent_inodes() to run.
> I guess I'm surprised you got a different result.
>
> If the 'refs' field is 1, the extent might have somewhere between 1
> and 3500 root/inode/offset references.

Yes, that's due to snapshotting and the way it works.
Creating a snapshot consists of COWing the root node of a tree only,
which will not increment extent references (for data extents,
referenced in leafs) if the tree has an height of 2 or more.

Example:

1) You have a subvolume tree with 2 levels (root at level 1, leafs at level 0);

2) You have a file there that refers to extent A, not shared with
anyone else, and has a reference count of 1 in the extent item at the
extent tree;

3) After you snapshot the subvolume, extent A will still have 1
reference count in its item at the extent tree.

If you do anything that changes the leaf that refers to extent A
(resulting in it being COWed), in either tree, btrfs will take care of
adding a new backreference to the extent and bump the reference count
to 2 for extent A.

If on the other hand, if the source subvolume was really small,
consisting of a single node/leaf (root at level 0), then you would get
2 references for extent A immediately after making the snapshot.

> But it's not a lower bound,
> e.g. here is an extent on a random filesystem where extent_refs is a
> big number:
>
>         item 87 key (1344172032 EXTENT_ITEM 4096) itemoff 11671 itemsize 24
>                 refs 897 gen 2668358 flags DATA
>
> LOGICAL_INO_V2 only finds 170 extent references:
>
>         # btrfs ins log 1344172032 -P . | wc -l
>         170

Pass "-s 65536", the default buffer is likely not large enough to all
inode/root numbers.

>
> Is there a good primer somewhere on how the reference counting works in
> btrfs?

Dunno. Source code, and an old paper about btrfs:

https://domino.research.ibm.com/library/cyberdig.nsf/papers/6E1C5B6A1B6EDD9885257A38006B6130/$File/rj10501.pdf

Maybe the wiki has more references.

>
> > +             ret = -ENOENT;
> > +             goto out;
> > +     }
> > +     btrfs_release_path(tmp_path);
> > +
> >       /*
> >        * Setup the clone roots.
> >        */
> > --
> > 2.11.0
> >
Zygo Blaxell Nov. 24, 2019, 1:33 a.m. UTC | #5
On Sat, Nov 23, 2019 at 01:33:33PM +0000, Filipe Manana wrote:
> On Sat, Nov 23, 2019 at 5:29 AM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Wed, Oct 30, 2019 at 12:23:01PM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Backreference walking, which is used by send to figure if it can issue
> > > clone operations instead of write operations, can be very slow and use too
> > > much memory when extents have many references. This change simply skips
> > > backreference walking when an extent has more than 64 references, in which
> > > case we fallback to a write operation instead of a clone operation. This
> > > limit is conservative and in practice I observed no signicant slowdown
> > > with up to 100 references and still low memory usage up to that limit.
> > >
> > > This is a temporary workaround until there are speedups in the backref
> > > walking code, and as such it does not attempt to add extra interfaces or
> > > knobs to tweak the threshold.
> > >
> > > Reported-by: Atemu <atemu.main@gmail.com>
> > > Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >  fs/btrfs/send.c | 25 ++++++++++++++++++++++++-
> > >  1 file changed, 24 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> > > index 123ac54af071..518ec1265a0c 100644
> > > --- a/fs/btrfs/send.c
> > > +++ b/fs/btrfs/send.c
> > > @@ -25,6 +25,14 @@
> > >  #include "compression.h"
> > >
> > >  /*
> > > + * Maximum number of references an extent can have in order for us to attempt to
> > > + * issue clone operations instead of write operations. This currently exists to
> > > + * avoid hitting limitations of the backreference walking code (taking a lot of
> > > + * time and using too much memory for extents with large number of references).
> > > + */
> > > +#define SEND_MAX_EXTENT_REFS 64
> > > +
> > > +/*
> > >   * A fs_path is a helper to dynamically build path names with unknown size.
> > >   * It reallocates the internal buffer on demand.
> > >   * It allows fast adding of path elements on the right side (normal path) and
> > > @@ -1302,6 +1310,7 @@ static int find_extent_clone(struct send_ctx *sctx,
> > >       struct clone_root *cur_clone_root;
> > >       struct btrfs_key found_key;
> > >       struct btrfs_path *tmp_path;
> > > +     struct btrfs_extent_item *ei;
> > >       int compressed;
> > >       u32 i;
> > >
> > > @@ -1349,7 +1358,6 @@ static int find_extent_clone(struct send_ctx *sctx,
> > >       ret = extent_from_logical(fs_info, disk_byte, tmp_path,
> > >                                 &found_key, &flags);
> > >       up_read(&fs_info->commit_root_sem);
> > > -     btrfs_release_path(tmp_path);
> > >
> > >       if (ret < 0)
> > >               goto out;
> > > @@ -1358,6 +1366,21 @@ static int find_extent_clone(struct send_ctx *sctx,
> > >               goto out;
> > >       }
> > >
> > > +     ei = btrfs_item_ptr(tmp_path->nodes[0], tmp_path->slots[0],
> > > +                         struct btrfs_extent_item);
> > > +     /*
> > > +      * Backreference walking (iterate_extent_inodes() below) is currently
> > > +      * too expensive when an extent has a large number of references, both
> > > +      * in time spent and used memory. So for now just fallback to write
> > > +      * operations instead of clone operations when an extent has more than
> > > +      * a certain amount of references.
> > > +      */
> > > +     if (btrfs_extent_refs(tmp_path->nodes[0], ei) > SEND_MAX_EXTENT_REFS) {
> >
> > So...does this...work?
> 
> I wouldn't send it if it didn't work, at least for the case reported.
> 
> There's even a test case for this, btrfs/130 from fstests:
> 
> https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/btrfs/130
> 
> It used to take several hours to complete on a test vm I use
> frequently, now it completes in 1 or 2 seconds.

To be clear, I'm not doubting that the patch fixes a test case.
It's more like I regularly encounter a lot of performance problems from
iterate_extent_inodes() and friends, and only a tiny fraction of them
look like that test case.

TL;DR it looks like LOGICAL_INO has more bugs than I knew about before
today, and that throws a bunch of my previous work on this topic into
doubt.  I have been viewing btrfs through a LOGICAL_INO-shaped lens.

> While that is about an extent shared only within the same file, the
> same happens if the extent is reflinked to many other files (either
> all part of the same subvolume or belonging to different subvolumes).

OK, then this might be a different bug than the ones I'm seeing, or
maybe I am hitting a variety of different bugs at once.

> > I ask because I looked at this a few years ago as a way to spend less
> > time doing LOGICAL_INO calls during dedupe (and especially avoid the
> 
> To give some context to those not familiar with you, you are
> mentioning a user space dedupe tool that doesn't call the btrfs dedup
> ioctl, and calls the clone ioctl directly, while also using the
> LOGICAL_INO ioctl.

Uhhh...bees uses the dedup ioctl, and does not call the clone ioctl.

Here is some more correct context:

bees uses LOGICAL_INO_V2 for two purposes:  the obvious one is to
discover the set of references to an extent so we can pass them all to
the dedupe ioctl.  The other purpose is to estimate the kernel's future
performance when manipulating that extent using iterate_extent_inodes()
and related functions.

The performance estimate function is:  run the LOGICAL_INO(_V2) ioctl
and measure the CPU time the kernel thread spent.  If the kernel spends
more than 100 ms on the extent, then bees puts the extent on a blacklist
and will never touch the extent again.

Obviously this is wrong, but the test is cheap, the false positive
rate is low, and the cost of deduping a bad extent is extremely high.
A bad extent can take 100 million times more kernel CPU than an average
extent (8 orders of magnitude slower).  Only 0.001% of extents in a
typical filesystem fail the test, but if we deduped those naively,
each one can take hours of kernel CPU time in many btrfs operations
(send, balance, scrub error path, even overwrite and delete seem to
use iterate_extent_inodes() at least in some cases).  I've seen this
phenomenon occur when using other dedupe tools like duperemove as well.

The vexing thing is that I haven't found any simple way to predict the
bad behavior, other than to run the existing code and measure how it
behaves while the performance is still tolerable.  Variables like file
size, extent inline reference count, snapshot reference count, extent
size, extent item data members all have poor correlations to the kernel
CPU time of LOGICAL_INO:  it seems to happen with equal probability on
large and small files, and extents with high and low reference counts.

One of the things on my todo list is to dive into the kernel code to
figure out what's going on, but the issue is easy to work around, so
I have been looking at the higher priority crashing/hanging/corrupting
bugs instead.

> > 8-orders-of-magnitude performance degradation in the bad cases), but
> > I found that it was useless: it wasn't proportional to the number of
> > extents, nor was it an upper or lower bound, nor could extents be sorted
> > by number of references using extent.refs as a key, nor could it predict
> > the amount of time it would take for iterate_extent_inodes() to run.
> > I guess I'm surprised you got a different result.
> >
> > If the 'refs' field is 1, the extent might have somewhere between 1
> > and 3500 root/inode/offset references.
> 
> Yes, that's due to snapshotting and the way it works.
> Creating a snapshot consists of COWing the root node of a tree only,
> which will not increment extent references (for data extents,
> referenced in leafs) if the tree has an height of 2 or more.
> 
> Example:
> 
> 1) You have a subvolume tree with 2 levels (root at level 1, leafs at level 0);
> 
> 2) You have a file there that refers to extent A, not shared with
> anyone else, and has a reference count of 1 in the extent item at the
> extent tree;
> 
> 3) After you snapshot the subvolume, extent A will still have 1
> reference count in its item at the extent tree.
> 
> If you do anything that changes the leaf that refers to extent A
> (resulting in it being COWed), in either tree, btrfs will take care of
> adding a new backreference to the extent and bump the reference count
> to 2 for extent A.
> 
> If on the other hand, if the source subvolume was really small,
> consisting of a single node/leaf (root at level 0), then you would get
> 2 references for extent A immediately after making the snapshot.
> 
> > But it's not a lower bound,
> > e.g. here is an extent on a random filesystem where extent_refs is a
> > big number:
> >
> >         item 87 key (1344172032 EXTENT_ITEM 4096) itemoff 11671 itemsize 24
> >                 refs 897 gen 2668358 flags DATA
> >
> > LOGICAL_INO_V2 only finds 170 extent references:
> >
> >         # btrfs ins log 1344172032 -P . | wc -l
> >         170
> 
> Pass "-s 65536", the default buffer is likely not large enough to all
> inode/root numbers.

The default buffer is SZ_64K.  When I hacked up 'btrfs ins log' to use
LOGICAL_INO_V2 I also increased the size to SZ_16M.  Even so, 64K is
enough for 2730 references (the previous limit with LOGICAL_INO v1),
and 2730 is more than 170 (or 897).

So I'm kind of stumped there.  If it works as you describe above
(and the description above is similar to my understanding of how it
works), then extent_item.refs should be a *lower* bound on the number
of references.  Snapshots will make the real number of references larger
than extent_item.refs, but cannot make it smaller (if an EXTENT_ITEM is
deleted in one of two subvols, the reference count in the EXTENT_DATA
would remain the same--incremented by the CoW, then decremented by
the delete).  Yet here is a case where that doesn't hold.

Or...another theory is that LOGICAL_INO is broken, or trying to be
"helpful" by doing unnecessary work to get the wrong answer (like
FIEMAP does).  If I look at this extent again, I see that

	btrfs ins dump-tree -t 12805 /dev/testfs | grep 'extent data disk byte 1344172032 ' | nl

is reporting 897 lines of output.  Interesting!  Up until this moment, I
had been assuming LOGICAL_INO was more or less correct, but maybe I should
reevaluate my previous experimental results with the opposite assumption.

e.g. the first lines of LOGICAL_INO output, sorted in inode/offset order:

	inode 25801552 offset 386469888 root 12805
	inode 25801552 offset 386469888 root 12801
	inode 29497105 offset 69632 root 12805
	inode 29497104 offset 69632 root 12805
	[...and 166 more...]

brute-force searching btrfs ins dump-tree shows 3 more references in inode
25801552, plus some bonus inodes that don't show up in the LOGICAL_INO
output at all:

        item 53 key (146777 EXTENT_DATA 22319104) itemoff 13342 itemsize 53
                generation 2668358 type 1 (regular)
                extent data disk byte 1344172032 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 3 key (25801552 EXTENT_DATA 128630784) itemoff 16071 itemsize 53
                generation 2668358 type 1 (regular)
                extent data disk byte 1344172032 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 110 key (25801552 EXTENT_DATA 384958464) itemoff 10400 itemsize 53
                generation 2668358 type 1 (regular)
                extent data disk byte 1344172032 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 82 key (25801552 EXTENT_DATA 385462272) itemoff 11884 itemsize 53
                generation 2668358 type 1 (regular)
                extent data disk byte 1344172032 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 175 key (25801552 EXTENT_DATA 385966080) itemoff 6955 itemsize 53
                generation 2668358 type 1 (regular)
                extent data disk byte 1344172032 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 77 key (25801552 EXTENT_DATA 386469888) itemoff 12149 itemsize 53
                generation 2668358 type 1 (regular)
                extent data disk byte 1344172032 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 60 key (25855558 EXTENT_DATA 13611008) itemoff 12820 itemsize 53
                generation 2668358 type 1 (regular)
                extent data disk byte 1344172032 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 53 key (25855558 EXTENT_DATA 14114816) itemoff 13421 itemsize 53
                generation 2668358 type 1 (regular)
                extent data disk byte 1344172032 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 106 key (29496169 EXTENT_DATA 69632) itemoff 10426 itemsize 53
                generation 2668358 type 1 (regular)
                extent data disk byte 1344172032 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 140 key (29496170 EXTENT_DATA 69632) itemoff 8531 itemsize 53
                generation 2668358 type 1 (regular)
                extent data disk byte 1344172032 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 175 key (29496171 EXTENT_DATA 69632) itemoff 6583 itemsize 53
                generation 2668358 type 1 (regular)
                extent data disk byte 1344172032 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
	[...etc, there are 897 of these.]

and...these are distinct, not logically or physically contiguous,
no compression, no data offset...no reason why LOGICAL_INO (v1 or v2)
should not report them.

> > Is there a good primer somewhere on how the reference counting works in
> > btrfs?
> 
> Dunno. Source code, and an old paper about btrfs:
> 
> https://domino.research.ibm.com/library/cyberdig.nsf/papers/6E1C5B6A1B6EDD9885257A38006B6130/$File/rj10501.pdf
> 
> Maybe the wiki has more references.

I've read those.  I guess I misstated my question:  how is it *supposed*
to work, so I can compare that to what the source code (particularly
the LOGICAL_INO implementation) is *actually* doing, and possibly
correct it?

If it ever gets up to the top of my priority list I'm fully aware I
might end up writing that document.

> > > +             ret = -ENOENT;
> > > +             goto out;
> > > +     }
> > > +     btrfs_release_path(tmp_path);
> > > +
> > >       /*
> > >        * Setup the clone roots.
> > >        */
> > > --
> > > 2.11.0
> > >
>
Filipe Manana Nov. 25, 2019, 3:31 p.m. UTC | #6
On Sun, Nov 24, 2019 at 1:33 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Sat, Nov 23, 2019 at 01:33:33PM +0000, Filipe Manana wrote:
> > On Sat, Nov 23, 2019 at 5:29 AM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > On Wed, Oct 30, 2019 at 12:23:01PM +0000, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > Backreference walking, which is used by send to figure if it can issue
> > > > clone operations instead of write operations, can be very slow and use too
> > > > much memory when extents have many references. This change simply skips
> > > > backreference walking when an extent has more than 64 references, in which
> > > > case we fallback to a write operation instead of a clone operation. This
> > > > limit is conservative and in practice I observed no signicant slowdown
> > > > with up to 100 references and still low memory usage up to that limit.
> > > >
> > > > This is a temporary workaround until there are speedups in the backref
> > > > walking code, and as such it does not attempt to add extra interfaces or
> > > > knobs to tweak the threshold.
> > > >
> > > > Reported-by: Atemu <atemu.main@gmail.com>
> > > > Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
> > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > > ---
> > > >  fs/btrfs/send.c | 25 ++++++++++++++++++++++++-
> > > >  1 file changed, 24 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> > > > index 123ac54af071..518ec1265a0c 100644
> > > > --- a/fs/btrfs/send.c
> > > > +++ b/fs/btrfs/send.c
> > > > @@ -25,6 +25,14 @@
> > > >  #include "compression.h"
> > > >
> > > >  /*
> > > > + * Maximum number of references an extent can have in order for us to attempt to
> > > > + * issue clone operations instead of write operations. This currently exists to
> > > > + * avoid hitting limitations of the backreference walking code (taking a lot of
> > > > + * time and using too much memory for extents with large number of references).
> > > > + */
> > > > +#define SEND_MAX_EXTENT_REFS 64
> > > > +
> > > > +/*
> > > >   * A fs_path is a helper to dynamically build path names with unknown size.
> > > >   * It reallocates the internal buffer on demand.
> > > >   * It allows fast adding of path elements on the right side (normal path) and
> > > > @@ -1302,6 +1310,7 @@ static int find_extent_clone(struct send_ctx *sctx,
> > > >       struct clone_root *cur_clone_root;
> > > >       struct btrfs_key found_key;
> > > >       struct btrfs_path *tmp_path;
> > > > +     struct btrfs_extent_item *ei;
> > > >       int compressed;
> > > >       u32 i;
> > > >
> > > > @@ -1349,7 +1358,6 @@ static int find_extent_clone(struct send_ctx *sctx,
> > > >       ret = extent_from_logical(fs_info, disk_byte, tmp_path,
> > > >                                 &found_key, &flags);
> > > >       up_read(&fs_info->commit_root_sem);
> > > > -     btrfs_release_path(tmp_path);
> > > >
> > > >       if (ret < 0)
> > > >               goto out;
> > > > @@ -1358,6 +1366,21 @@ static int find_extent_clone(struct send_ctx *sctx,
> > > >               goto out;
> > > >       }
> > > >
> > > > +     ei = btrfs_item_ptr(tmp_path->nodes[0], tmp_path->slots[0],
> > > > +                         struct btrfs_extent_item);
> > > > +     /*
> > > > +      * Backreference walking (iterate_extent_inodes() below) is currently
> > > > +      * too expensive when an extent has a large number of references, both
> > > > +      * in time spent and used memory. So for now just fallback to write
> > > > +      * operations instead of clone operations when an extent has more than
> > > > +      * a certain amount of references.
> > > > +      */
> > > > +     if (btrfs_extent_refs(tmp_path->nodes[0], ei) > SEND_MAX_EXTENT_REFS) {
> > >
> > > So...does this...work?
> >
> > I wouldn't send it if it didn't work, at least for the case reported.
> >
> > There's even a test case for this, btrfs/130 from fstests:
> >
> > https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/btrfs/130
> >
> > It used to take several hours to complete on a test vm I use
> > frequently, now it completes in 1 or 2 seconds.
>
> To be clear, I'm not doubting that the patch fixes a test case.
> It's more like I regularly encounter a lot of performance problems from
> iterate_extent_inodes() and friends, and only a tiny fraction of them
> look like that test case.
>
> TL;DR it looks like LOGICAL_INO has more bugs than I knew about before
> today,

Which are? Where were they reported?

> and that throws a bunch of my previous work on this topic into
> doubt.  I have been viewing btrfs through a LOGICAL_INO-shaped lens.
>
> > While that is about an extent shared only within the same file, the
> > same happens if the extent is reflinked to many other files (either
> > all part of the same subvolume or belonging to different subvolumes).
>
> OK, then this might be a different bug than the ones I'm seeing, or
> maybe I am hitting a variety of different bugs at once.
>
> > > I ask because I looked at this a few years ago as a way to spend less
> > > time doing LOGICAL_INO calls during dedupe (and especially avoid the
> >
> > To give some context to those not familiar with you, you are
> > mentioning a user space dedupe tool that doesn't call the btrfs dedup
> > ioctl, and calls the clone ioctl directly, while also using the
> > LOGICAL_INO ioctl.
>
> Uhhh...bees uses the dedup ioctl, and does not call the clone ioctl.
>
> Here is some more correct context:
>
> bees uses LOGICAL_INO_V2 for two purposes:  the obvious one is to
> discover the set of references to an extent so we can pass them all to
> the dedupe ioctl.  The other purpose is to estimate the kernel's future
> performance when manipulating that extent using iterate_extent_inodes()
> and related functions.
>
> The performance estimate function is:  run the LOGICAL_INO(_V2) ioctl
> and measure the CPU time the kernel thread spent.  If the kernel spends
> more than 100 ms on the extent, then bees puts the extent on a blacklist
> and will never touch the extent again.
>
> Obviously this is wrong, but the test is cheap, the false positive
> rate is low, and the cost of deduping a bad extent is extremely high.
> A bad extent can take 100 million times more kernel CPU than an average
> extent (8 orders of magnitude slower).  Only 0.001% of extents in a
> typical filesystem fail the test, but if we deduped those naively,
> each one can take hours of kernel CPU time in many btrfs operations
> (send, balance, scrub error path, even overwrite and delete seem to
> use iterate_extent_inodes() at least in some cases).  I've seen this
> phenomenon occur when using other dedupe tools like duperemove as well.
>
> The vexing thing is that I haven't found any simple way to predict the
> bad behavior, other than to run the existing code and measure how it
> behaves while the performance is still tolerable.  Variables like file
> size, extent inline reference count, snapshot reference count, extent
> size, extent item data members all have poor correlations to the kernel
> CPU time of LOGICAL_INO:  it seems to happen with equal probability on
> large and small files, and extents with high and low reference counts.
>
> One of the things on my todo list is to dive into the kernel code to
> figure out what's going on, but the issue is easy to work around, so
> I have been looking at the higher priority crashing/hanging/corrupting
> bugs instead.
>
> > > 8-orders-of-magnitude performance degradation in the bad cases), but
> > > I found that it was useless: it wasn't proportional to the number of
> > > extents, nor was it an upper or lower bound, nor could extents be sorted
> > > by number of references using extent.refs as a key, nor could it predict
> > > the amount of time it would take for iterate_extent_inodes() to run.
> > > I guess I'm surprised you got a different result.
> > >
> > > If the 'refs' field is 1, the extent might have somewhere between 1
> > > and 3500 root/inode/offset references.
> >
> > Yes, that's due to snapshotting and the way it works.
> > Creating a snapshot consists of COWing the root node of a tree only,
> > which will not increment extent references (for data extents,
> > referenced in leafs) if the tree has an height of 2 or more.
> >
> > Example:
> >
> > 1) You have a subvolume tree with 2 levels (root at level 1, leafs at level 0);
> >
> > 2) You have a file there that refers to extent A, not shared with
> > anyone else, and has a reference count of 1 in the extent item at the
> > extent tree;
> >
> > 3) After you snapshot the subvolume, extent A will still have 1
> > reference count in its item at the extent tree.
> >
> > If you do anything that changes the leaf that refers to extent A
> > (resulting in it being COWed), in either tree, btrfs will take care of
> > adding a new backreference to the extent and bump the reference count
> > to 2 for extent A.
> >
> > If on the other hand, if the source subvolume was really small,
> > consisting of a single node/leaf (root at level 0), then you would get
> > 2 references for extent A immediately after making the snapshot.
> >
> > > But it's not a lower bound,
> > > e.g. here is an extent on a random filesystem where extent_refs is a
> > > big number:
> > >
> > >         item 87 key (1344172032 EXTENT_ITEM 4096) itemoff 11671 itemsize 24
> > >                 refs 897 gen 2668358 flags DATA
> > >
> > > LOGICAL_INO_V2 only finds 170 extent references:
> > >
> > >         # btrfs ins log 1344172032 -P . | wc -l
> > >         170
> >
> > Pass "-s 65536", the default buffer is likely not large enough to all
> > inode/root numbers.
>
> The default buffer is SZ_64K.  When I hacked up 'btrfs ins log' to use

SZ_4K: https://github.com/kdave/btrfs-progs/blob/master/cmds/inspect.c#L151

> LOGICAL_INO_V2 I also increased the size to SZ_16M.  Even so, 64K is
> enough for 2730 references (the previous limit with LOGICAL_INO v1),
> and 2730 is more than 170 (or 897).

So I just made a test where we create a large tree, where only one
file has an extent, and then snapshot the default subvolume 1000
times:

https://friendpaste.com/4s4yt94l0Z2utwiPHFABbE

When calling:

btrfs inspect-internal logical-resolve -P $extent_bytenr $mnt | wc -l

It gave 170 lines, exactly what you are getting. After passing "-s
65536", it gives 1001 (1000 snapshots + 1 for the subvolume).
Seems like too much coincidence?

>
> So I'm kind of stumped there.  If it works as you describe above
> (and the description above is similar to my understanding of how it
> works), then extent_item.refs should be a *lower* bound on the number
> of references.  Snapshots will make the real number of references larger
> than extent_item.refs, but cannot make it smaller (if an EXTENT_ITEM is
> deleted in one of two subvols, the reference count in the EXTENT_DATA
> would remain the same--incremented by the CoW, then decremented by
> the delete).  Yet here is a case where that doesn't hold.
>
> Or...another theory is that LOGICAL_INO is broken, or trying to be
> "helpful" by doing unnecessary work to get the wrong answer (like
> FIEMAP does).  If I look at this extent again, I see that

Hum, what wrong answers does fiemap gives? Was this reported before somewhere?

I don't recall seeing logical_ino (either v1 or v2) bug reports for a
long time, and I don't spend time looking at it either for a long
time, but sure it's entirely possible something is buggy there.

>
>         btrfs ins dump-tree -t 12805 /dev/testfs | grep 'extent data disk byte 1344172032 ' | nl
>
> is reporting 897 lines of output.  Interesting!  Up until this moment, I
> had been assuming LOGICAL_INO was more or less correct, but maybe I should
> reevaluate my previous experimental results with the opposite assumption.
>
> e.g. the first lines of LOGICAL_INO output, sorted in inode/offset order:
>
>         inode 25801552 offset 386469888 root 12805
>         inode 25801552 offset 386469888 root 12801
>         inode 29497105 offset 69632 root 12805
>         inode 29497104 offset 69632 root 12805
>         [...and 166 more...]
>
> brute-force searching btrfs ins dump-tree shows 3 more references in inode
> 25801552, plus some bonus inodes that don't show up in the LOGICAL_INO
> output at all:
>
>         item 53 key (146777 EXTENT_DATA 22319104) itemoff 13342 itemsize 53
>                 generation 2668358 type 1 (regular)
>                 extent data disk byte 1344172032 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>                 extent compression 0 (none)
>         item 3 key (25801552 EXTENT_DATA 128630784) itemoff 16071 itemsize 53
>                 generation 2668358 type 1 (regular)
>                 extent data disk byte 1344172032 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>                 extent compression 0 (none)
>         item 110 key (25801552 EXTENT_DATA 384958464) itemoff 10400 itemsize 53
>                 generation 2668358 type 1 (regular)
>                 extent data disk byte 1344172032 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>                 extent compression 0 (none)
>         item 82 key (25801552 EXTENT_DATA 385462272) itemoff 11884 itemsize 53
>                 generation 2668358 type 1 (regular)
>                 extent data disk byte 1344172032 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>                 extent compression 0 (none)
>         item 175 key (25801552 EXTENT_DATA 385966080) itemoff 6955 itemsize 53
>                 generation 2668358 type 1 (regular)
>                 extent data disk byte 1344172032 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>                 extent compression 0 (none)
>         item 77 key (25801552 EXTENT_DATA 386469888) itemoff 12149 itemsize 53
>                 generation 2668358 type 1 (regular)
>                 extent data disk byte 1344172032 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>                 extent compression 0 (none)
>         item 60 key (25855558 EXTENT_DATA 13611008) itemoff 12820 itemsize 53
>                 generation 2668358 type 1 (regular)
>                 extent data disk byte 1344172032 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>                 extent compression 0 (none)
>         item 53 key (25855558 EXTENT_DATA 14114816) itemoff 13421 itemsize 53
>                 generation 2668358 type 1 (regular)
>                 extent data disk byte 1344172032 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>                 extent compression 0 (none)
>         item 106 key (29496169 EXTENT_DATA 69632) itemoff 10426 itemsize 53
>                 generation 2668358 type 1 (regular)
>                 extent data disk byte 1344172032 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>                 extent compression 0 (none)
>         item 140 key (29496170 EXTENT_DATA 69632) itemoff 8531 itemsize 53
>                 generation 2668358 type 1 (regular)
>                 extent data disk byte 1344172032 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>                 extent compression 0 (none)
>         item 175 key (29496171 EXTENT_DATA 69632) itemoff 6583 itemsize 53
>                 generation 2668358 type 1 (regular)
>                 extent data disk byte 1344172032 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>                 extent compression 0 (none)
>         [...etc, there are 897 of these.]
>
> and...these are distinct, not logically or physically contiguous,
> no compression, no data offset...no reason why LOGICAL_INO (v1 or v2)
> should not report them.
>
> > > Is there a good primer somewhere on how the reference counting works in
> > > btrfs?
> >
> > Dunno. Source code, and an old paper about btrfs:
> >
> > https://domino.research.ibm.com/library/cyberdig.nsf/papers/6E1C5B6A1B6EDD9885257A38006B6130/$File/rj10501.pdf
> >
> > Maybe the wiki has more references.
>
> I've read those.  I guess I misstated my question:  how is it *supposed*
> to work, so I can compare that to what the source code (particularly
> the LOGICAL_INO implementation) is *actually* doing, and possibly
> correct it?
>
> If it ever gets up to the top of my priority list I'm fully aware I
> might end up writing that document.
>
> > > > +             ret = -ENOENT;
> > > > +             goto out;
> > > > +     }
> > > > +     btrfs_release_path(tmp_path);
> > > > +
> > > >       /*
> > > >        * Setup the clone roots.
> > > >        */
> > > > --
> > > > 2.11.0
> > > >
> >
Martin Raiber Nov. 26, 2019, 5:52 p.m. UTC | #7
On 30.10.2019 13:23 fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Backreference walking, which is used by send to figure if it can issue
> clone operations instead of write operations, can be very slow and use too
> much memory when extents have many references. This change simply skips
> backreference walking when an extent has more than 64 references, in which
> case we fallback to a write operation instead of a clone operation. This
> limit is conservative and in practice I observed no signicant slowdown
> with up to 100 references and still low memory usage up to that limit.
>
> This is a temporary workaround until there are speedups in the backref
> walking code, and as such it does not attempt to add extra interfaces or
> knobs to tweak the threshold.

Thanks for the patch!

Did some further deliberation on  the problem, and for me the best short
term solution (apart from your patch) for the send clone source
detection would be to offload it to userspace.
I.e. a flag like "--no-data"/BTRFS_SEND_FLAG_NO_FILE_DATA that disables
clone source detection.
Userspace can then take each BTRFS_SEND_C_UPDATE_EXTENT and look if it
is in the clone sources/on the send target. If it is a dedup tool it can
use its dedup database, it can use a cache or it can use
BTRFS_IOC_LOGICAL_INO. Another advantage is that user space can do this
in multiple threads.

Only problem I can think of is that send prevents subvol removal and
dedup during send, but user space may not be finished with the actual
send stream once send has finished outputting the metadata. So there may
be some issues there, but not if one has control over when removal/dedup
happens.

>
> Reported-by: Atemu <atemu.main@gmail.com>
> Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/send.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 123ac54af071..518ec1265a0c 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -25,6 +25,14 @@
>  #include "compression.h"
>  
>  /*
> + * Maximum number of references an extent can have in order for us to attempt to
> + * issue clone operations instead of write operations. This currently exists to
> + * avoid hitting limitations of the backreference walking code (taking a lot of
> + * time and using too much memory for extents with large number of references).
> + */
> +#define SEND_MAX_EXTENT_REFS	64
> +
> +/*
>   * A fs_path is a helper to dynamically build path names with unknown size.
>   * It reallocates the internal buffer on demand.
>   * It allows fast adding of path elements on the right side (normal path) and
> @@ -1302,6 +1310,7 @@ static int find_extent_clone(struct send_ctx *sctx,
>  	struct clone_root *cur_clone_root;
>  	struct btrfs_key found_key;
>  	struct btrfs_path *tmp_path;
> +	struct btrfs_extent_item *ei;
>  	int compressed;
>  	u32 i;
>  
> @@ -1349,7 +1358,6 @@ static int find_extent_clone(struct send_ctx *sctx,
>  	ret = extent_from_logical(fs_info, disk_byte, tmp_path,
>  				  &found_key, &flags);
>  	up_read(&fs_info->commit_root_sem);
> -	btrfs_release_path(tmp_path);
>  
>  	if (ret < 0)
>  		goto out;
> @@ -1358,6 +1366,21 @@ static int find_extent_clone(struct send_ctx *sctx,
>  		goto out;
>  	}
>  
> +	ei = btrfs_item_ptr(tmp_path->nodes[0], tmp_path->slots[0],
> +			    struct btrfs_extent_item);
> +	/*
> +	 * Backreference walking (iterate_extent_inodes() below) is currently
> +	 * too expensive when an extent has a large number of references, both
> +	 * in time spent and used memory. So for now just fallback to write
> +	 * operations instead of clone operations when an extent has more than
> +	 * a certain amount of references.
> +	 */
> +	if (btrfs_extent_refs(tmp_path->nodes[0], ei) > SEND_MAX_EXTENT_REFS) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	btrfs_release_path(tmp_path);
> +
>  	/*
>  	 * Setup the clone roots.
>  	 */
Zygo Blaxell Nov. 26, 2019, 11:02 p.m. UTC | #8
On Mon, Nov 25, 2019 at 03:31:59PM +0000, Filipe Manana wrote:
> On Sun, Nov 24, 2019 at 1:33 AM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Sat, Nov 23, 2019 at 01:33:33PM +0000, Filipe Manana wrote:
> > > On Sat, Nov 23, 2019 at 5:29 AM Zygo Blaxell
> > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > >
> > > > On Wed, Oct 30, 2019 at 12:23:01PM +0000, fdmanana@kernel.org wrote:
> > To be clear, I'm not doubting that the patch fixes a test case.
> > It's more like I regularly encounter a lot of performance problems from
> > iterate_extent_inodes() and friends, and only a tiny fraction of them
> > look like that test case.
> >
> > TL;DR it looks like LOGICAL_INO has more bugs than I knew about before
> > today,
> 
> Which are? Where were they reported?

The first and least important bug is that LOGICAL_INO is slow.  This is
well known, although maybe the magnitude of the problem is not.  In one
dedupe test corpus I can produce (with bees or duperemove) a filesystem
with extents that have only a few thousand references, but take 45
kernel CPU minutes to process (each).  On this filesystem image I can
find backrefs by brute force processing 'btrfs inspect dump-tree' output
with a Perl script in only 8 minutes, and the problematic extents only
have a few hundred to a few thousand references.  Other extents have half
a million references and LOGICAL_INO can find them all in a few seconds
(including both CPU and IO time).  I don't know what the kernel's doing,
but whatever it is, it's *far* beyond any reasonable amount of processing
for the metadata.

The second bug is a bit more subtle.

I've been mining the last year of crash and corruption data and
tracking about 7 distict frequent failure behaviors in 5.0..5.3 kernels.
It turns out that LOGICAL_INO is implicated in half of these as well.
I've been running some tests to confirm this and exclude other factors
for the past month.  I was intending to post something about it at the
conclusion of testing a few weeks, but I have some actionable information
today, so here we are...

The data so far says that a host in the wild running LOGICAL_INO and
balance at the same time is 150x more likely to crash compared to hosts
that do not run LOGICAL_INO at all, or hosts that run either LOGICAL_INO
or balance but not at the same time.

Simultaneous LOGICAL_INO + balance correlates with these three failure
modes:

	https://www.spinics.net/lists/linux-btrfs/msg89425.html

	https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg88892.html

	[metadata corruption on 5.2 that looks the same as the previously
	reported corruption on 5.1, except 5.2 has a write time corruption
	detector that aborts the transaction instead of damaging the
	filesystem]

A host in our fleet that rarely or never runs LOGICAL_INO has few or no
problems on any kernel from 5.0 to 5.3.  When crashes do occur, they are
mostly random non-btrfs bugs (CIFS, wifi drivers, GPU, hardware failure,
power failure, etc).  We did have a few 5.2.x kernels hang in testing
due to the writeback bug fixed in 5.2.15.

The specific application we are running that uses LOGICAL_INO is bees.
bees spends more time running LOGICAL_INO than any other single activity
(other than sleeping).  Other dedupe tools don't seem to have an impact
on system uptime, which rules out EXTENT_SAME.  If there was a bug in
POSIX API like open() or read() we would all know about it.  That leaves
LOGICAL_INO, TREE_SEARCH_V2, and INO_PATHS as the 3 ioctls that bees calls
with unusual frequency compared to other software, but TREE_SEARCH_V2
and INO_PATHS don't show up in BUG_ON stack traces while LOGICAL_INO does.

I don't have a nice minimal reproducer yet, but "run bees and btrfs
balance start on a big, busy filesystem" triggers some kind of crash
(BUG_ON or transaction abort) for me in less than a day, 3x a day on
some hosts.  I think that a good reproducer would be a simple program
that walks the extent tree and feeds every extent bytenr in random order
to LOGICAL_INO in multiple threads while a balance is running, to see
if that makes the crashes happen faster.  I have not tried this yet.

I don't know how the 5.1 and 5.2 metadata corruption is connected,
other than the corruption immediately stops happening when we stop
running bees on the test host, or arrange for bees and balance to run
at separate times.

The bug is possibly as old as 4.12--based on earlier crash data, it
might have been the 4th most frequent crash case in kernels prior to 5.0.
The first three most frequent cases in 4.14 were flushoncommit deadlocks,
a reclaim/commit deadlock, and the rename/dedupe deadlock, all fixed
by 5.0.  LOGICAL_INO-related failures are now the most common crash case
in my data up to 5.2.

I'm running tests on 5.3 kernels, which are crashing in new, different,
and possibly unrelated ways.  LOGICAL_INO + balance is not the fastest
way to crash a 5.3 kernel, but I don't know yet whether or not it is
still a *possible* way to crash a 5.3 kernel.

That's all I really know about this bug so far:  there seems to be a
bug, it's not a new bug, and simultaneous LOGICAL_INO + balance seems
to trigger it.

> > > > LOGICAL_INO_V2 only finds 170 extent references:
> > > >
> > > >         # btrfs ins log 1344172032 -P . | wc -l
> > > >         170
> > >
> > > Pass "-s 65536", the default buffer is likely not large enough to all
> > > inode/root numbers.
> >
> > The default buffer is SZ_64K.  When I hacked up 'btrfs ins log' to use
> 
> SZ_4K: https://github.com/kdave/btrfs-progs/blob/master/cmds/inspect.c#L151

...which is not the SZ_64K I was referring to.  Oops!

OK, my bad.  I forgot to double check this against the hand-rolled tools
I usually use.

With the right buffer size, `btrfs ins log` produces 1794 references,
which is the right number given the extent appears in 2 snapshots.

> > So I'm kind of stumped there.  If it works as you describe above
> > (and the description above is similar to my understanding of how it
> > works), then extent_item.refs should be a *lower* bound on the number
> > of references.  Snapshots will make the real number of references larger
> > than extent_item.refs, but cannot make it smaller (if an EXTENT_ITEM is
> > deleted in one of two subvols, the reference count in the EXTENT_DATA
> > would remain the same--incremented by the CoW, then decremented by
> > the delete).  Yet here is a case where that doesn't hold.
> >
> > Or...another theory is that LOGICAL_INO is broken, or trying to be
> > "helpful" by doing unnecessary work to get the wrong answer (like
> > FIEMAP does).  If I look at this extent again, I see that
> 
> Hum, what wrong answers does fiemap gives? Was this reported before somewhere?

FIEMAP coalesces physically adjacent extents as opposed to reporting
btrfs extent item boundaries, and it adjusts the first extent record to
match the beginning of the range parameter given to FIEMAP instead of
reporting the beginning of the extent.  This does interesting things
when the extents are compressed or if you try to iterate over them in
reverse order.

I presume that behavior is either by design, or just an awful compromise
to make the output conform to the limited capabilities of the FIEMAP
result format.  This would mean FIEMAP's behavior is not really a bug,
and I haven't reported it as one; however, it does mean FIEMAP is not an
appropriate tool for applications that need accurate information about
physical structure on btrfs.

Computing the 'shared' flag for FIEMAP is prohibitively expensive
at scale, and there's no way to skip that work with the current
FIEMAP interface if the application is not interested.

I switched to TREE_SEARCH_V2 to get extent item data:  it's faster, more
accurate, and more useful than FIEMAP, especially for compressed extents.

If anyone's interested in improving the generic FIEMAP interface to
handle btrfs quirks better, I can write up a proposal--it could be handy
to get this data without having to be root as TREE_SEARCH_V2 requires.
Alternatively, btrfs could have an unprivileged form of TREE_SEARCH_V2
that limits the min/max root/objectid ranges to the inode of the file
descriptor that is passed to the ioctl.
Filipe Manana Nov. 27, 2019, 11:42 a.m. UTC | #9
On Tue, Nov 26, 2019 at 11:03 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Mon, Nov 25, 2019 at 03:31:59PM +0000, Filipe Manana wrote:
> > On Sun, Nov 24, 2019 at 1:33 AM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > On Sat, Nov 23, 2019 at 01:33:33PM +0000, Filipe Manana wrote:
> > > > On Sat, Nov 23, 2019 at 5:29 AM Zygo Blaxell
> > > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > > >
> > > > > On Wed, Oct 30, 2019 at 12:23:01PM +0000, fdmanana@kernel.org wrote:
> > > To be clear, I'm not doubting that the patch fixes a test case.
> > > It's more like I regularly encounter a lot of performance problems from
> > > iterate_extent_inodes() and friends, and only a tiny fraction of them
> > > look like that test case.
> > >
> > > TL;DR it looks like LOGICAL_INO has more bugs than I knew about before
> > > today,
> >
> > Which are? Where were they reported?
>
> The first and least important bug is that LOGICAL_INO is slow.  This is
> well known, although maybe the magnitude of the problem is not.  In one
> dedupe test corpus I can produce (with bees or duperemove) a filesystem
> with extents that have only a few thousand references, but take 45

Walking the backreferences was always slow, and having a few hundred
references for an extent is enough to notice a significant slowdown
already.
That affects the logical_ino ioctls, send, etc.

Worst, for logical_ino, is that we do the backreference walking while
either holding a transaction (if one is currently running) or
holding a semaphore (commit_root_sem) to prevent transaction commits
from happening (to ensure consistency and that things don't disappear
while we're using them).
So for a slow backreference walking, that can hang a lot of other
tasks that are trying to commit a transaction.

There was someone working on improving backreference walking (Mark),
but I think he's not working on btrfs anymore, and I don't know what
happened to that work (if it was ever finished, etc).

> kernel CPU minutes to process (each).  On this filesystem image I can
> find backrefs by brute force processing 'btrfs inspect dump-tree' output
> with a Perl script in only 8 minutes, and the problematic extents only
> have a few hundred to a few thousand references.  Other extents have half
> a million references and LOGICAL_INO can find them all in a few seconds
> (including both CPU and IO time).  I don't know what the kernel's doing,
> but whatever it is, it's *far* beyond any reasonable amount of processing
> for the metadata.
>
> The second bug is a bit more subtle.
>
> I've been mining the last year of crash and corruption data and
> tracking about 7 distict frequent failure behaviors in 5.0..5.3 kernels.
> It turns out that LOGICAL_INO is implicated in half of these as well.
> I've been running some tests to confirm this and exclude other factors
> for the past month.  I was intending to post something about it at the
> conclusion of testing a few weeks, but I have some actionable information
> today, so here we are...
>
> The data so far says that a host in the wild running LOGICAL_INO and
> balance at the same time is 150x more likely to crash compared to hosts
> that do not run LOGICAL_INO at all, or hosts that run either LOGICAL_INO
> or balance but not at the same time.
>
> Simultaneous LOGICAL_INO + balance correlates with these three failure
> modes:
>
>         https://www.spinics.net/lists/linux-btrfs/msg89425.html
>
>         https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg88892.html

For the bug involving btrfs_search_old_slot(), try the following fix:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=efad8a853ad2057f96664328a0d327a05ce39c76

Both reports where made before that fix was posted/merged.

(
>
>         [metadata corruption on 5.2 that looks the same as the previously
>         reported corruption on 5.1, except 5.2 has a write time corruption
>         detector that aborts the transaction instead of damaging the
>         filesystem]
>
> A host in our fleet that rarely or never runs LOGICAL_INO has few or no
> problems on any kernel from 5.0 to 5.3.  When crashes do occur, they are
> mostly random non-btrfs bugs (CIFS, wifi drivers, GPU, hardware failure,
> power failure, etc).  We did have a few 5.2.x kernels hang in testing
> due to the writeback bug fixed in 5.2.15.
>
> The specific application we are running that uses LOGICAL_INO is bees.
> bees spends more time running LOGICAL_INO than any other single activity
> (other than sleeping).  Other dedupe tools don't seem to have an impact
> on system uptime, which rules out EXTENT_SAME.  If there was a bug in
> POSIX API like open() or read() we would all know about it.  That leaves
> LOGICAL_INO, TREE_SEARCH_V2, and INO_PATHS as the 3 ioctls that bees calls
> with unusual frequency compared to other software, but TREE_SEARCH_V2
> and INO_PATHS don't show up in BUG_ON stack traces while LOGICAL_INO does.
>
> I don't have a nice minimal reproducer yet, but "run bees and btrfs
> balance start on a big, busy filesystem" triggers some kind of crash
> (BUG_ON or transaction abort) for me in less than a day, 3x a day on
> some hosts.  I think that a good reproducer would be a simple program
> that walks the extent tree and feeds every extent bytenr in random order
> to LOGICAL_INO in multiple threads while a balance is running, to see
> if that makes the crashes happen faster.  I have not tried this yet.
>
> I don't know how the 5.1 and 5.2 metadata corruption is connected,
> other than the corruption immediately stops happening when we stop
> running bees on the test host, or arrange for bees and balance to run
> at separate times.
>
> The bug is possibly as old as 4.12--based on earlier crash data, it
> might have been the 4th most frequent crash case in kernels prior to 5.0.
> The first three most frequent cases in 4.14 were flushoncommit deadlocks,
> a reclaim/commit deadlock, and the rename/dedupe deadlock, all fixed
> by 5.0.  LOGICAL_INO-related failures are now the most common crash case
> in my data up to 5.2.
>
> I'm running tests on 5.3 kernels, which are crashing in new, different,
> and possibly unrelated ways.  LOGICAL_INO + balance is not the fastest
> way to crash a 5.3 kernel, but I don't know yet whether or not it is
> still a *possible* way to crash a 5.3 kernel.
>
> That's all I really know about this bug so far:  there seems to be a
> bug, it's not a new bug, and simultaneous LOGICAL_INO + balance seems
> to trigger it.
>
> > > > > LOGICAL_INO_V2 only finds 170 extent references:
> > > > >
> > > > >         # btrfs ins log 1344172032 -P . | wc -l
> > > > >         170
> > > >
> > > > Pass "-s 65536", the default buffer is likely not large enough to all
> > > > inode/root numbers.
> > >
> > > The default buffer is SZ_64K.  When I hacked up 'btrfs ins log' to use
> >
> > SZ_4K: https://github.com/kdave/btrfs-progs/blob/master/cmds/inspect.c#L151
>
> ...which is not the SZ_64K I was referring to.  Oops!
>
> OK, my bad.  I forgot to double check this against the hand-rolled tools
> I usually use.
>
> With the right buffer size, `btrfs ins log` produces 1794 references,
> which is the right number given the extent appears in 2 snapshots.

Ok, one less bug.

>
> > > So I'm kind of stumped there.  If it works as you describe above
> > > (and the description above is similar to my understanding of how it
> > > works), then extent_item.refs should be a *lower* bound on the number
> > > of references.  Snapshots will make the real number of references larger
> > > than extent_item.refs, but cannot make it smaller (if an EXTENT_ITEM is
> > > deleted in one of two subvols, the reference count in the EXTENT_DATA
> > > would remain the same--incremented by the CoW, then decremented by
> > > the delete).  Yet here is a case where that doesn't hold.
> > >
> > > Or...another theory is that LOGICAL_INO is broken, or trying to be
> > > "helpful" by doing unnecessary work to get the wrong answer (like
> > > FIEMAP does).  If I look at this extent again, I see that
> >
> > Hum, what wrong answers does fiemap gives? Was this reported before somewhere?
>
> FIEMAP coalesces physically adjacent extents as opposed to reporting
> btrfs extent item boundaries, and it adjusts the first extent record to
> match the beginning of the range parameter given to FIEMAP instead of
> reporting the beginning of the extent.  This does interesting things
> when the extents are compressed or if you try to iterate over them in
> reverse order.
>
> I presume that behavior is either by design, or just an awful compromise
> to make the output conform to the limited capabilities of the FIEMAP
> result format.  This would mean FIEMAP's behavior is not really a bug,
> and I haven't reported it as one; however, it does mean FIEMAP is not an
> appropriate tool for applications that need accurate information about
> physical structure on btrfs.

Not sure if it's a bug either, but reading [1] suggests the
implementation can return merged extents as long as they're flagged
with FIEMAP_EXTENT_MERGED (on the other hand that flag exists for
filesystems that don't support extents, are entirely block based).

We try to report extents based on the in-memory state, extent maps,
rather then iterating the extent items in the btrees, mostly for
performance reasons, but also because in-memory state reflects the
current state while stuff in the btrees may be outdated (due to
delalloc/buffered writes).
But extent maps can be merged in order to save memory. That's probably
what you are experiencing.
The thing is that we don't know if extent maps were merged when we are
at fiemap, that that can be fixed easily and start setting the flag
FIEMAP_EXTENT_MERGED. Anyway, right now I'm more inclined to think the
merging is not correct.

[1] https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt

>
> Computing the 'shared' flag for FIEMAP is prohibitively expensive
> at scale, and there's no way to skip that work with the current
> FIEMAP interface if the application is not interested.

Yep, same thing, backreference walking, can slow things down a lot.

>
> I switched to TREE_SEARCH_V2 to get extent item data:  it's faster, more
> accurate, and more useful than FIEMAP, especially for compressed extents.
>
> If anyone's interested in improving the generic FIEMAP interface to
> handle btrfs quirks better, I can write up a proposal--it could be handy
> to get this data without having to be root as TREE_SEARCH_V2 requires.
> Alternatively, btrfs could have an unprivileged form of TREE_SEARCH_V2
> that limits the min/max root/objectid ranges to the inode of the file
> descriptor that is passed to the ioctl.
Zygo Blaxell Nov. 27, 2019, 6:09 p.m. UTC | #10
On Wed, Nov 27, 2019 at 11:42:21AM +0000, Filipe Manana wrote:
> Walking the backreferences was always slow, and having a few hundred
> references for an extent is enough to notice a significant slowdown
> already.
> That affects the logical_ino ioctls, send, etc.
> 
> Worst, for logical_ino, is that we do the backreference walking while
> either holding a transaction (if one is currently running) or
> holding a semaphore (commit_root_sem) to prevent transaction commits
> from happening (to ensure consistency and that things don't disappear
> while we're using them).
> So for a slow backreference walking, that can hang a lot of other
> tasks that are trying to commit a transaction.
> 
> There was someone working on improving backreference walking (Mark),
> but I think he's not working on btrfs anymore, and I don't know what
> happened to that work (if it was ever finished, etc).

Writing a userspace backref walker to replace LOGICAL_INO has been on
my TODO list for a while, but I had to go after the kernel crashing
bugs first.  Now the top of the kernel crashing bug list is LOGICAL_INO
too, so I guess there is no escape.

I'd love to help with the kernel implementation of backref walking,
but it's used by everything in btrfs, and it's full of wonderful things
like log unwinding and negative reference counting that I know I don't
understand well enough to mess with.

> > Simultaneous LOGICAL_INO + balance correlates with these three failure
> > modes:
> >
> >         https://www.spinics.net/lists/linux-btrfs/msg89425.html
> >
> >         https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg88892.html
> 
> For the bug involving btrfs_search_old_slot(), try the following fix:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=efad8a853ad2057f96664328a0d327a05ce39c76
> 
> Both reports where made before that fix was posted/merged.

5.3.13 already contains that commit (backported as
f59d80e2f12b695d38e646f00de1e46e123dfcc9 in 5.3.4), and:

	[58446.332688][ T4514] ------------[ cut here ]------------
	[58446.336367][ T4514] kernel BUG at fs/btrfs/ctree.c:1222!
	[58446.339993][ T4514] invalid opcode: 0000 [#1] SMP PTI
	[58446.343408][ T4514] CPU: 3 PID: 4514 Comm: crawl_35277 Not tainted 5.3.13-zb64-c0eaf2b82f8c+ #1
	[58446.348872][ T4514] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
	[58446.354272][ T4514] RIP: 0010:__tree_mod_log_rewind+0x235/0x240
	[58446.356393][ T4514] Code: 45 31 c0 4c 89 e7 48 89 d6 48 c1 e6 05 48 8d 74 32 65 ba 19 00 00 00 e8 59 d1 04 00 e9 89 fe ff ff 49 63 57 2c e9 18 ff ff ff <0f> 0b 0f 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 83 c2 01 48 63
	[58446.361072][ T4514] RSP: 0000:ffffa34340e1b830 EFLAGS: 00010293
	[58446.364072][ T4514] RAX: 000003ff696b4000 RBX: 000000000000000f RCX: ffff981b2bf7db00
	[58446.366818][ T4514] RDX: 0000000000000000 RSI: 000000000000007e RDI: ffff981b2bf7d300
	[58446.368673][ T4514] RBP: ffffa34340e1b860 R08: ffffa34340e1b7d8 R09: ffffa34340e1b7e0
	[58446.371392][ T4514] R10: 000000000008b74a R11: 0000000000000000 R12: ffff981c62b2e4a0
	[58446.375829][ T4514] R13: ffff981b2bf7d300 R14: 0000000000000006 R15: ffff981b2bf7db00
	[58446.378642][ T4514] FS:  00007fccbd589700(0000) GS:ffff981cf6800000(0000) knlGS:0000000000000000
	[58446.380289][ T4514] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[58446.381567][ T4514] CR2: 00007fcc447d5000 CR3: 0000000205248003 CR4: 00000000001606e0
	[58446.383172][ T4514] Call Trace:
	[58446.383835][ T4514]  btrfs_search_old_slot+0x14a/0x7e0
	[58446.384850][ T4514]  resolve_indirect_refs+0x1f0/0x9a0
	[58446.385864][ T4514]  find_parent_nodes+0x427/0x1330
	[58446.386849][ T4514]  btrfs_find_all_roots_safe+0xc1/0x130
	[58446.387924][ T4514]  ? btrfs_inode_flags_to_xflags+0x50/0x50
	[58446.389055][ T4514]  iterate_extent_inodes+0x134/0x390
	[58446.390078][ T4514]  ? _raw_spin_unlock+0x27/0x40
	[58446.391008][ T4514]  iterate_inodes_from_logical+0x9c/0xd0
	[58446.392090][ T4514]  ? iterate_inodes_from_logical+0x9c/0xd0
	[58446.393245][ T4514]  ? btrfs_inode_flags_to_xflags+0x50/0x50
	[58446.394409][ T4514]  btrfs_ioctl_logical_to_ino+0x10e/0x1b0
	[58446.395554][ T4514]  btrfs_ioctl+0x1417/0x2c30
	[58446.396459][ T4514]  ? kvm_sched_clock_read+0x18/0x30
	[58446.397462][ T4514]  ? __lock_acquire+0x4b2/0x1c00
	[58446.398414][ T4514]  ? retint_kernel+0x10/0x10
	[58446.399329][ T4514]  ? kvm_sched_clock_read+0x18/0x30
	[58446.400357][ T4514]  ? sched_clock+0x9/0x10
	[58446.401219][ T4514]  do_vfs_ioctl+0xa9/0x6f0
	[58446.402099][ T4514]  ? do_vfs_ioctl+0xa9/0x6f0
	[58446.403025][ T4514]  ? __fget+0x11e/0x200
	[58446.403863][ T4514]  ksys_ioctl+0x67/0x90
	[58446.404704][ T4514]  __x64_sys_ioctl+0x1a/0x20
	[58446.405619][ T4514]  do_syscall_64+0x65/0x1b0
	[58446.406529][ T4514]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

I was finally able to hit this bug on a 5.3.x kernel last night.
Zygo Blaxell Nov. 27, 2019, 6:20 p.m. UTC | #11
On Wed, Nov 27, 2019 at 01:09:55PM -0500, Zygo Blaxell wrote:
> On Wed, Nov 27, 2019 at 11:42:21AM +0000, Filipe Manana wrote:
> > Walking the backreferences was always slow, and having a few hundred
> > references for an extent is enough to notice a significant slowdown
> > already.
> > That affects the logical_ino ioctls, send, etc.
> > 
> > Worst, for logical_ino, is that we do the backreference walking while
> > either holding a transaction (if one is currently running) or
> > holding a semaphore (commit_root_sem) to prevent transaction commits
> > from happening (to ensure consistency and that things don't disappear
> > while we're using them).
> > So for a slow backreference walking, that can hang a lot of other
> > tasks that are trying to commit a transaction.
> > 
> > There was someone working on improving backreference walking (Mark),
> > but I think he's not working on btrfs anymore, and I don't know what
> > happened to that work (if it was ever finished, etc).
> 
> Writing a userspace backref walker to replace LOGICAL_INO has been on
> my TODO list for a while, but I had to go after the kernel crashing
> bugs first.  Now the top of the kernel crashing bug list is LOGICAL_INO
> too, so I guess there is no escape.
> 
> I'd love to help with the kernel implementation of backref walking,
> but it's used by everything in btrfs, and it's full of wonderful things
> like log unwinding and negative reference counting that I know I don't
> understand well enough to mess with.
> 
> > > Simultaneous LOGICAL_INO + balance correlates with these three failure
> > > modes:
> > >
> > >         https://www.spinics.net/lists/linux-btrfs/msg89425.html
> > >
> > >         https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg88892.html
> > 
> > For the bug involving btrfs_search_old_slot(), try the following fix:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=efad8a853ad2057f96664328a0d327a05ce39c76
> > 
> > Both reports where made before that fix was posted/merged.
> 
> 5.3.13 already contains that commit (backported as
> f59d80e2f12b695d38e646f00de1e46e123dfcc9 in 5.3.4), and:
> 
> 	[58446.332688][ T4514] ------------[ cut here ]------------
> 	[58446.336367][ T4514] kernel BUG at fs/btrfs/ctree.c:1222!

I note this is not exactly the same bug as the one I reported before:
it's BUG_ON(tm->slot < n) in case MOD_LOG_KEY_REMOVE_WHILE_FREEING,
while the previously reported one (and the one that is much more common
and apparently much harder to hit on 5.3.x) is BUG_ON(tm->slot >= n)
in case MOD_LOG_KEY_REPLACE.

> 	[58446.339993][ T4514] invalid opcode: 0000 [#1] SMP PTI
> 	[58446.343408][ T4514] CPU: 3 PID: 4514 Comm: crawl_35277 Not tainted 5.3.13-zb64-c0eaf2b82f8c+ #1
> 	[58446.348872][ T4514] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> 	[58446.354272][ T4514] RIP: 0010:__tree_mod_log_rewind+0x235/0x240
> 	[58446.356393][ T4514] Code: 45 31 c0 4c 89 e7 48 89 d6 48 c1 e6 05 48 8d 74 32 65 ba 19 00 00 00 e8 59 d1 04 00 e9 89 fe ff ff 49 63 57 2c e9 18 ff ff ff <0f> 0b 0f 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 83 c2 01 48 63
> 	[58446.361072][ T4514] RSP: 0000:ffffa34340e1b830 EFLAGS: 00010293
> 	[58446.364072][ T4514] RAX: 000003ff696b4000 RBX: 000000000000000f RCX: ffff981b2bf7db00
> 	[58446.366818][ T4514] RDX: 0000000000000000 RSI: 000000000000007e RDI: ffff981b2bf7d300
> 	[58446.368673][ T4514] RBP: ffffa34340e1b860 R08: ffffa34340e1b7d8 R09: ffffa34340e1b7e0
> 	[58446.371392][ T4514] R10: 000000000008b74a R11: 0000000000000000 R12: ffff981c62b2e4a0
> 	[58446.375829][ T4514] R13: ffff981b2bf7d300 R14: 0000000000000006 R15: ffff981b2bf7db00
> 	[58446.378642][ T4514] FS:  00007fccbd589700(0000) GS:ffff981cf6800000(0000) knlGS:0000000000000000
> 	[58446.380289][ T4514] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	[58446.381567][ T4514] CR2: 00007fcc447d5000 CR3: 0000000205248003 CR4: 00000000001606e0
> 	[58446.383172][ T4514] Call Trace:
> 	[58446.383835][ T4514]  btrfs_search_old_slot+0x14a/0x7e0
> 	[58446.384850][ T4514]  resolve_indirect_refs+0x1f0/0x9a0
> 	[58446.385864][ T4514]  find_parent_nodes+0x427/0x1330
> 	[58446.386849][ T4514]  btrfs_find_all_roots_safe+0xc1/0x130
> 	[58446.387924][ T4514]  ? btrfs_inode_flags_to_xflags+0x50/0x50
> 	[58446.389055][ T4514]  iterate_extent_inodes+0x134/0x390
> 	[58446.390078][ T4514]  ? _raw_spin_unlock+0x27/0x40
> 	[58446.391008][ T4514]  iterate_inodes_from_logical+0x9c/0xd0
> 	[58446.392090][ T4514]  ? iterate_inodes_from_logical+0x9c/0xd0
> 	[58446.393245][ T4514]  ? btrfs_inode_flags_to_xflags+0x50/0x50
> 	[58446.394409][ T4514]  btrfs_ioctl_logical_to_ino+0x10e/0x1b0
> 	[58446.395554][ T4514]  btrfs_ioctl+0x1417/0x2c30
> 	[58446.396459][ T4514]  ? kvm_sched_clock_read+0x18/0x30
> 	[58446.397462][ T4514]  ? __lock_acquire+0x4b2/0x1c00
> 	[58446.398414][ T4514]  ? retint_kernel+0x10/0x10
> 	[58446.399329][ T4514]  ? kvm_sched_clock_read+0x18/0x30
> 	[58446.400357][ T4514]  ? sched_clock+0x9/0x10
> 	[58446.401219][ T4514]  do_vfs_ioctl+0xa9/0x6f0
> 	[58446.402099][ T4514]  ? do_vfs_ioctl+0xa9/0x6f0
> 	[58446.403025][ T4514]  ? __fget+0x11e/0x200
> 	[58446.403863][ T4514]  ksys_ioctl+0x67/0x90
> 	[58446.404704][ T4514]  __x64_sys_ioctl+0x1a/0x20
> 	[58446.405619][ T4514]  do_syscall_64+0x65/0x1b0
> 	[58446.406529][ T4514]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> I was finally able to hit this bug on a 5.3.x kernel last night.
Filipe Manana Nov. 29, 2019, 3:25 p.m. UTC | #12
On Wed, Nov 27, 2019 at 6:20 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Wed, Nov 27, 2019 at 01:09:55PM -0500, Zygo Blaxell wrote:
> > On Wed, Nov 27, 2019 at 11:42:21AM +0000, Filipe Manana wrote:
> > > Walking the backreferences was always slow, and having a few hundred
> > > references for an extent is enough to notice a significant slowdown
> > > already.
> > > That affects the logical_ino ioctls, send, etc.
> > >
> > > Worst, for logical_ino, is that we do the backreference walking while
> > > either holding a transaction (if one is currently running) or
> > > holding a semaphore (commit_root_sem) to prevent transaction commits
> > > from happening (to ensure consistency and that things don't disappear
> > > while we're using them).
> > > So for a slow backreference walking, that can hang a lot of other
> > > tasks that are trying to commit a transaction.
> > >
> > > There was someone working on improving backreference walking (Mark),
> > > but I think he's not working on btrfs anymore, and I don't know what
> > > happened to that work (if it was ever finished, etc).
> >
> > Writing a userspace backref walker to replace LOGICAL_INO has been on
> > my TODO list for a while, but I had to go after the kernel crashing
> > bugs first.  Now the top of the kernel crashing bug list is LOGICAL_INO
> > too, so I guess there is no escape.
> >
> > I'd love to help with the kernel implementation of backref walking,
> > but it's used by everything in btrfs, and it's full of wonderful things
> > like log unwinding and negative reference counting that I know I don't
> > understand well enough to mess with.
> >
> > > > Simultaneous LOGICAL_INO + balance correlates with these three failure
> > > > modes:
> > > >
> > > >         https://www.spinics.net/lists/linux-btrfs/msg89425.html
> > > >
> > > >         https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg88892.html
> > >
> > > For the bug involving btrfs_search_old_slot(), try the following fix:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=efad8a853ad2057f96664328a0d327a05ce39c76
> > >
> > > Both reports where made before that fix was posted/merged.
> >
> > 5.3.13 already contains that commit (backported as
> > f59d80e2f12b695d38e646f00de1e46e123dfcc9 in 5.3.4), and:
> >
> >       [58446.332688][ T4514] ------------[ cut here ]------------
> >       [58446.336367][ T4514] kernel BUG at fs/btrfs/ctree.c:1222!
>
> I note this is not exactly the same bug as the one I reported before:
> it's BUG_ON(tm->slot < n) in case MOD_LOG_KEY_REMOVE_WHILE_FREEING,
> while the previously reported one (and the one that is much more common
> and apparently much harder to hit on 5.3.x) is BUG_ON(tm->slot >= n)
> in case MOD_LOG_KEY_REPLACE.

Interesting. Eventually I'll have to look at it.

Thanks.

>
> >       [58446.339993][ T4514] invalid opcode: 0000 [#1] SMP PTI
> >       [58446.343408][ T4514] CPU: 3 PID: 4514 Comm: crawl_35277 Not tainted 5.3.13-zb64-c0eaf2b82f8c+ #1
> >       [58446.348872][ T4514] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> >       [58446.354272][ T4514] RIP: 0010:__tree_mod_log_rewind+0x235/0x240
> >       [58446.356393][ T4514] Code: 45 31 c0 4c 89 e7 48 89 d6 48 c1 e6 05 48 8d 74 32 65 ba 19 00 00 00 e8 59 d1 04 00 e9 89 fe ff ff 49 63 57 2c e9 18 ff ff ff <0f> 0b 0f 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 83 c2 01 48 63
> >       [58446.361072][ T4514] RSP: 0000:ffffa34340e1b830 EFLAGS: 00010293
> >       [58446.364072][ T4514] RAX: 000003ff696b4000 RBX: 000000000000000f RCX: ffff981b2bf7db00
> >       [58446.366818][ T4514] RDX: 0000000000000000 RSI: 000000000000007e RDI: ffff981b2bf7d300
> >       [58446.368673][ T4514] RBP: ffffa34340e1b860 R08: ffffa34340e1b7d8 R09: ffffa34340e1b7e0
> >       [58446.371392][ T4514] R10: 000000000008b74a R11: 0000000000000000 R12: ffff981c62b2e4a0
> >       [58446.375829][ T4514] R13: ffff981b2bf7d300 R14: 0000000000000006 R15: ffff981b2bf7db00
> >       [58446.378642][ T4514] FS:  00007fccbd589700(0000) GS:ffff981cf6800000(0000) knlGS:0000000000000000
> >       [58446.380289][ T4514] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >       [58446.381567][ T4514] CR2: 00007fcc447d5000 CR3: 0000000205248003 CR4: 00000000001606e0
> >       [58446.383172][ T4514] Call Trace:
> >       [58446.383835][ T4514]  btrfs_search_old_slot+0x14a/0x7e0
> >       [58446.384850][ T4514]  resolve_indirect_refs+0x1f0/0x9a0
> >       [58446.385864][ T4514]  find_parent_nodes+0x427/0x1330
> >       [58446.386849][ T4514]  btrfs_find_all_roots_safe+0xc1/0x130
> >       [58446.387924][ T4514]  ? btrfs_inode_flags_to_xflags+0x50/0x50
> >       [58446.389055][ T4514]  iterate_extent_inodes+0x134/0x390
> >       [58446.390078][ T4514]  ? _raw_spin_unlock+0x27/0x40
> >       [58446.391008][ T4514]  iterate_inodes_from_logical+0x9c/0xd0
> >       [58446.392090][ T4514]  ? iterate_inodes_from_logical+0x9c/0xd0
> >       [58446.393245][ T4514]  ? btrfs_inode_flags_to_xflags+0x50/0x50
> >       [58446.394409][ T4514]  btrfs_ioctl_logical_to_ino+0x10e/0x1b0
> >       [58446.395554][ T4514]  btrfs_ioctl+0x1417/0x2c30
> >       [58446.396459][ T4514]  ? kvm_sched_clock_read+0x18/0x30
> >       [58446.397462][ T4514]  ? __lock_acquire+0x4b2/0x1c00
> >       [58446.398414][ T4514]  ? retint_kernel+0x10/0x10
> >       [58446.399329][ T4514]  ? kvm_sched_clock_read+0x18/0x30
> >       [58446.400357][ T4514]  ? sched_clock+0x9/0x10
> >       [58446.401219][ T4514]  do_vfs_ioctl+0xa9/0x6f0
> >       [58446.402099][ T4514]  ? do_vfs_ioctl+0xa9/0x6f0
> >       [58446.403025][ T4514]  ? __fget+0x11e/0x200
> >       [58446.403863][ T4514]  ksys_ioctl+0x67/0x90
> >       [58446.404704][ T4514]  __x64_sys_ioctl+0x1a/0x20
> >       [58446.405619][ T4514]  do_syscall_64+0x65/0x1b0
> >       [58446.406529][ T4514]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > I was finally able to hit this bug on a 5.3.x kernel last night.
>
>
Martin Raiber Jan. 18, 2020, 6:41 p.m. UTC | #13
On 26.11.2019 18:52 Martin Raiber wrote:
> On 30.10.2019 13:23 fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> Backreference walking, which is used by send to figure if it can issue
>> clone operations instead of write operations, can be very slow and use too
>> much memory when extents have many references. This change simply skips
>> backreference walking when an extent has more than 64 references, in which
>> case we fallback to a write operation instead of a clone operation. This
>> limit is conservative and in practice I observed no signicant slowdown
>> with up to 100 references and still low memory usage up to that limit.
>>
>> This is a temporary workaround until there are speedups in the backref
>> walking code, and as such it does not attempt to add extra interfaces or
>> knobs to tweak the threshold.
> Thanks for the patch!
>
> Did some further deliberation on  the problem, and for me the best short
> term solution (apart from your patch) for the send clone source
> detection would be to offload it to userspace.
> I.e. a flag like "--no-data"/BTRFS_SEND_FLAG_NO_FILE_DATA that disables
> clone source detection.
> Userspace can then take each BTRFS_SEND_C_UPDATE_EXTENT and look if it
> is in the clone sources/on the send target. If it is a dedup tool it can
> use its dedup database, it can use a cache or it can use
> BTRFS_IOC_LOGICAL_INO. Another advantage is that user space can do this
> in multiple threads.
>
> Only problem I can think of is that send prevents subvol removal and
> dedup during send, but user space may not be finished with the actual
> send stream once send has finished outputting the metadata. So there may
> be some issues there, but not if one has control over when removal/dedup
> happens.

I was still having performance issues even with this patch. Shouldn't be
too hard to reproduce. E.g., I have a ~10GB sqlite db with 1236459
extents (as dbs tend to have on btrfs). Sending that (even with only two
snapshots) causes it to be cpu limited.

So I went ahead and continued to use the patch that disables backref
walking in send (for the db above the backref walking doesn't do
anything anyway), but use it in combination with "--no-data" to read
file data and then dedup in user space (only whole file dedup for now).
This is all a bit application specific. Is anyone interested in the code?

Regards,
Martin Raiber

>> Reported-by: Atemu <atemu.main@gmail.com>
>> Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>  fs/btrfs/send.c | 25 ++++++++++++++++++++++++-
>>  1 file changed, 24 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>> index 123ac54af071..518ec1265a0c 100644
>> --- a/fs/btrfs/send.c
>> +++ b/fs/btrfs/send.c
>> @@ -25,6 +25,14 @@
>>  #include "compression.h"
>>  
>>  /*
>> + * Maximum number of references an extent can have in order for us to attempt to
>> + * issue clone operations instead of write operations. This currently exists to
>> + * avoid hitting limitations of the backreference walking code (taking a lot of
>> + * time and using too much memory for extents with large number of references).
>> + */
>> +#define SEND_MAX_EXTENT_REFS	64
>> +
>> +/*
>>   * A fs_path is a helper to dynamically build path names with unknown size.
>>   * It reallocates the internal buffer on demand.
>>   * It allows fast adding of path elements on the right side (normal path) and
>> @@ -1302,6 +1310,7 @@ static int find_extent_clone(struct send_ctx *sctx,
>>  	struct clone_root *cur_clone_root;
>>  	struct btrfs_key found_key;
>>  	struct btrfs_path *tmp_path;
>> +	struct btrfs_extent_item *ei;
>>  	int compressed;
>>  	u32 i;
>>  
>> @@ -1349,7 +1358,6 @@ static int find_extent_clone(struct send_ctx *sctx,
>>  	ret = extent_from_logical(fs_info, disk_byte, tmp_path,
>>  				  &found_key, &flags);
>>  	up_read(&fs_info->commit_root_sem);
>> -	btrfs_release_path(tmp_path);
>>  
>>  	if (ret < 0)
>>  		goto out;
>> @@ -1358,6 +1366,21 @@ static int find_extent_clone(struct send_ctx *sctx,
>>  		goto out;
>>  	}
>>  
>> +	ei = btrfs_item_ptr(tmp_path->nodes[0], tmp_path->slots[0],
>> +			    struct btrfs_extent_item);
>> +	/*
>> +	 * Backreference walking (iterate_extent_inodes() below) is currently
>> +	 * too expensive when an extent has a large number of references, both
>> +	 * in time spent and used memory. So for now just fallback to write
>> +	 * operations instead of clone operations when an extent has more than
>> +	 * a certain amount of references.
>> +	 */
>> +	if (btrfs_extent_refs(tmp_path->nodes[0], ei) > SEND_MAX_EXTENT_REFS) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>> +	btrfs_release_path(tmp_path);
>> +
>>  	/*
>>  	 * Setup the clone roots.
>>  	 */
>
Filipe Manana Jan. 20, 2020, 10:57 a.m. UTC | #14
On Sat, Jan 18, 2020 at 6:42 PM Martin Raiber <martin@urbackup.org> wrote:
>
> On 26.11.2019 18:52 Martin Raiber wrote:
> > On 30.10.2019 13:23 fdmanana@kernel.org wrote:
> >> From: Filipe Manana <fdmanana@suse.com>
> >>
> >> Backreference walking, which is used by send to figure if it can issue
> >> clone operations instead of write operations, can be very slow and use too
> >> much memory when extents have many references. This change simply skips
> >> backreference walking when an extent has more than 64 references, in which
> >> case we fallback to a write operation instead of a clone operation. This
> >> limit is conservative and in practice I observed no signicant slowdown
> >> with up to 100 references and still low memory usage up to that limit.
> >>
> >> This is a temporary workaround until there are speedups in the backref
> >> walking code, and as such it does not attempt to add extra interfaces or
> >> knobs to tweak the threshold.
> > Thanks for the patch!
> >
> > Did some further deliberation on  the problem, and for me the best short
> > term solution (apart from your patch) for the send clone source
> > detection would be to offload it to userspace.
> > I.e. a flag like "--no-data"/BTRFS_SEND_FLAG_NO_FILE_DATA that disables
> > clone source detection.
> > Userspace can then take each BTRFS_SEND_C_UPDATE_EXTENT and look if it
> > is in the clone sources/on the send target. If it is a dedup tool it can
> > use its dedup database, it can use a cache or it can use
> > BTRFS_IOC_LOGICAL_INO. Another advantage is that user space can do this
> > in multiple threads.
> >
> > Only problem I can think of is that send prevents subvol removal and
> > dedup during send, but user space may not be finished with the actual
> > send stream once send has finished outputting the metadata. So there may
> > be some issues there, but not if one has control over when removal/dedup
> > happens.
>
> I was still having performance issues even with this patch. Shouldn't be
> too hard to reproduce. E.g., I have a ~10GB sqlite db with 1236459
> extents (as dbs tend to have on btrfs). Sending that (even with only two
> snapshots) causes it to be cpu limited.
>
> So I went ahead and continued to use the patch that disables backref
> walking in send (for the db above the backref walking doesn't do
> anything anyway), but use it in combination with "--no-data" to read
> file data and then dedup in user space (only whole file dedup for now).
> This is all a bit application specific. Is anyone interested in the code?

You should only see performance improvement if you have extents that
have more than 64 references.
You mention two snapshots, but you don't mention if there are extents
that are shared through reflinks (cloning and deduplication).
Also, is compression enabled? That can have a very significant impact,
specially for zstd.

Thanks.


>
> Regards,
> Martin Raiber
>
> >> Reported-by: Atemu <atemu.main@gmail.com>
> >> Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
> >> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >> ---
> >>  fs/btrfs/send.c | 25 ++++++++++++++++++++++++-
> >>  1 file changed, 24 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> >> index 123ac54af071..518ec1265a0c 100644
> >> --- a/fs/btrfs/send.c
> >> +++ b/fs/btrfs/send.c
> >> @@ -25,6 +25,14 @@
> >>  #include "compression.h"
> >>
> >>  /*
> >> + * Maximum number of references an extent can have in order for us to attempt to
> >> + * issue clone operations instead of write operations. This currently exists to
> >> + * avoid hitting limitations of the backreference walking code (taking a lot of
> >> + * time and using too much memory for extents with large number of references).
> >> + */
> >> +#define SEND_MAX_EXTENT_REFS        64
> >> +
> >> +/*
> >>   * A fs_path is a helper to dynamically build path names with unknown size.
> >>   * It reallocates the internal buffer on demand.
> >>   * It allows fast adding of path elements on the right side (normal path) and
> >> @@ -1302,6 +1310,7 @@ static int find_extent_clone(struct send_ctx *sctx,
> >>      struct clone_root *cur_clone_root;
> >>      struct btrfs_key found_key;
> >>      struct btrfs_path *tmp_path;
> >> +    struct btrfs_extent_item *ei;
> >>      int compressed;
> >>      u32 i;
> >>
> >> @@ -1349,7 +1358,6 @@ static int find_extent_clone(struct send_ctx *sctx,
> >>      ret = extent_from_logical(fs_info, disk_byte, tmp_path,
> >>                                &found_key, &flags);
> >>      up_read(&fs_info->commit_root_sem);
> >> -    btrfs_release_path(tmp_path);
> >>
> >>      if (ret < 0)
> >>              goto out;
> >> @@ -1358,6 +1366,21 @@ static int find_extent_clone(struct send_ctx *sctx,
> >>              goto out;
> >>      }
> >>
> >> +    ei = btrfs_item_ptr(tmp_path->nodes[0], tmp_path->slots[0],
> >> +                        struct btrfs_extent_item);
> >> +    /*
> >> +     * Backreference walking (iterate_extent_inodes() below) is currently
> >> +     * too expensive when an extent has a large number of references, both
> >> +     * in time spent and used memory. So for now just fallback to write
> >> +     * operations instead of clone operations when an extent has more than
> >> +     * a certain amount of references.
> >> +     */
> >> +    if (btrfs_extent_refs(tmp_path->nodes[0], ei) > SEND_MAX_EXTENT_REFS) {
> >> +            ret = -ENOENT;
> >> +            goto out;
> >> +    }
> >> +    btrfs_release_path(tmp_path);
> >> +
> >>      /*
> >>       * Setup the clone roots.
> >>       */
> >
>
Martin Raiber Jan. 20, 2020, 11:30 a.m. UTC | #15
On 20.01.2020 11:57 Filipe Manana wrote:
> On Sat, Jan 18, 2020 at 6:42 PM Martin Raiber <martin@urbackup.org> wrote:
>> On 26.11.2019 18:52 Martin Raiber wrote:
>>> On 30.10.2019 13:23 fdmanana@kernel.org wrote:
>>>> From: Filipe Manana <fdmanana@suse.com>
>>>>
>>>> Backreference walking, which is used by send to figure if it can issue
>>>> clone operations instead of write operations, can be very slow and use too
>>>> much memory when extents have many references. This change simply skips
>>>> backreference walking when an extent has more than 64 references, in which
>>>> case we fallback to a write operation instead of a clone operation. This
>>>> limit is conservative and in practice I observed no signicant slowdown
>>>> with up to 100 references and still low memory usage up to that limit.
>>>>
>>>> This is a temporary workaround until there are speedups in the backref
>>>> walking code, and as such it does not attempt to add extra interfaces or
>>>> knobs to tweak the threshold.
>>> Thanks for the patch!
>>>
>>> Did some further deliberation on  the problem, and for me the best short
>>> term solution (apart from your patch) for the send clone source
>>> detection would be to offload it to userspace.
>>> I.e. a flag like "--no-data"/BTRFS_SEND_FLAG_NO_FILE_DATA that disables
>>> clone source detection.
>>> Userspace can then take each BTRFS_SEND_C_UPDATE_EXTENT and look if it
>>> is in the clone sources/on the send target. If it is a dedup tool it can
>>> use its dedup database, it can use a cache or it can use
>>> BTRFS_IOC_LOGICAL_INO. Another advantage is that user space can do this
>>> in multiple threads.
>>>
>>> Only problem I can think of is that send prevents subvol removal and
>>> dedup during send, but user space may not be finished with the actual
>>> send stream once send has finished outputting the metadata. So there may
>>> be some issues there, but not if one has control over when removal/dedup
>>> happens.
>> I was still having performance issues even with this patch. Shouldn't be
>> too hard to reproduce. E.g., I have a ~10GB sqlite db with 1236459
>> extents (as dbs tend to have on btrfs). Sending that (even with only two
>> snapshots) causes it to be cpu limited.
>>
>> So I went ahead and continued to use the patch that disables backref
>> walking in send (for the db above the backref walking doesn't do
>> anything anyway), but use it in combination with "--no-data" to read
>> file data and then dedup in user space (only whole file dedup for now).
>> This is all a bit application specific. Is anyone interested in the code?
> You should only see performance improvement if you have extents that
> have more than 64 references.
> You mention two snapshots, but you don't mention if there are extents
> that are shared through reflinks (cloning and deduplication).
> Also, is compression enabled? That can have a very significant impact,
> specially for zstd.
>
> Thanks.

Yeah, sure. It's a database file. I can't see how there would be
reflinks in it (initially during creation, but afterwards not). I back
it up by creating a read-only snapshot and sending/receiving that to a
separate backup volume. That snapshot and the last snapshot for
incremental send are the only ones.

There is another volume with a lot of files that are reflinked on the
same machine and those get send/received as well, and it should work for
both volumes.

Compression on the database file is enabled (lzo). I would guess that
zstd should have the fastest decompression, though (lzo seems to have a
little bit better compression speed, but not decompression speed)? The
sqlite page size is 4096 bytes. That might be too low and cause too much
fragmentation...

>
>
>> Regards,
>> Martin Raiber
>>
>>>> Reported-by: Atemu <atemu.main@gmail.com>
>>>> Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>> ---
>>>>  fs/btrfs/send.c | 25 ++++++++++++++++++++++++-
>>>>  1 file changed, 24 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>>>> index 123ac54af071..518ec1265a0c 100644
>>>> --- a/fs/btrfs/send.c
>>>> +++ b/fs/btrfs/send.c
>>>> @@ -25,6 +25,14 @@
>>>>  #include "compression.h"
>>>>
>>>>  /*
>>>> + * Maximum number of references an extent can have in order for us to attempt to
>>>> + * issue clone operations instead of write operations. This currently exists to
>>>> + * avoid hitting limitations of the backreference walking code (taking a lot of
>>>> + * time and using too much memory for extents with large number of references).
>>>> + */
>>>> +#define SEND_MAX_EXTENT_REFS        64
>>>> +
>>>> +/*
>>>>   * A fs_path is a helper to dynamically build path names with unknown size.
>>>>   * It reallocates the internal buffer on demand.
>>>>   * It allows fast adding of path elements on the right side (normal path) and
>>>> @@ -1302,6 +1310,7 @@ static int find_extent_clone(struct send_ctx *sctx,
>>>>      struct clone_root *cur_clone_root;
>>>>      struct btrfs_key found_key;
>>>>      struct btrfs_path *tmp_path;
>>>> +    struct btrfs_extent_item *ei;
>>>>      int compressed;
>>>>      u32 i;
>>>>
>>>> @@ -1349,7 +1358,6 @@ static int find_extent_clone(struct send_ctx *sctx,
>>>>      ret = extent_from_logical(fs_info, disk_byte, tmp_path,
>>>>                                &found_key, &flags);
>>>>      up_read(&fs_info->commit_root_sem);
>>>> -    btrfs_release_path(tmp_path);
>>>>
>>>>      if (ret < 0)
>>>>              goto out;
>>>> @@ -1358,6 +1366,21 @@ static int find_extent_clone(struct send_ctx *sctx,
>>>>              goto out;
>>>>      }
>>>>
>>>> +    ei = btrfs_item_ptr(tmp_path->nodes[0], tmp_path->slots[0],
>>>> +                        struct btrfs_extent_item);
>>>> +    /*
>>>> +     * Backreference walking (iterate_extent_inodes() below) is currently
>>>> +     * too expensive when an extent has a large number of references, both
>>>> +     * in time spent and used memory. So for now just fallback to write
>>>> +     * operations instead of clone operations when an extent has more than
>>>> +     * a certain amount of references.
>>>> +     */
>>>> +    if (btrfs_extent_refs(tmp_path->nodes[0], ei) > SEND_MAX_EXTENT_REFS) {
>>>> +            ret = -ENOENT;
>>>> +            goto out;
>>>> +    }
>>>> +    btrfs_release_path(tmp_path);
>>>> +
>>>>      /*
>>>>       * Setup the clone roots.
>>>>       */
>
Martin Raiber Jan. 22, 2020, 1:22 a.m. UTC | #16
On 20.01.2020 12:30 Martin Raiber wrote:
> On 20.01.2020 11:57 Filipe Manana wrote:
>> On Sat, Jan 18, 2020 at 6:42 PM Martin Raiber <martin@urbackup.org> wrote:
>>> On 26.11.2019 18:52 Martin Raiber wrote:
>>>> On 30.10.2019 13:23 fdmanana@kernel.org wrote:
>>>>> From: Filipe Manana <fdmanana@suse.com>
>>>>>
>>>>> Backreference walking, which is used by send to figure if it can issue
>>>>> clone operations instead of write operations, can be very slow and use too
>>>>> much memory when extents have many references. This change simply skips
>>>>> backreference walking when an extent has more than 64 references, in which
>>>>> case we fallback to a write operation instead of a clone operation. This
>>>>> limit is conservative and in practice I observed no signicant slowdown
>>>>> with up to 100 references and still low memory usage up to that limit.
>>>>>
>>>>> This is a temporary workaround until there are speedups in the backref
>>>>> walking code, and as such it does not attempt to add extra interfaces or
>>>>> knobs to tweak the threshold.
>>>> Thanks for the patch!
>>>>
>>>> Did some further deliberation on  the problem, and for me the best short
>>>> term solution (apart from your patch) for the send clone source
>>>> detection would be to offload it to userspace.
>>>> I.e. a flag like "--no-data"/BTRFS_SEND_FLAG_NO_FILE_DATA that disables
>>>> clone source detection.
>>>> Userspace can then take each BTRFS_SEND_C_UPDATE_EXTENT and look if it
>>>> is in the clone sources/on the send target. If it is a dedup tool it can
>>>> use its dedup database, it can use a cache or it can use
>>>> BTRFS_IOC_LOGICAL_INO. Another advantage is that user space can do this
>>>> in multiple threads.
>>>>
>>>> Only problem I can think of is that send prevents subvol removal and
>>>> dedup during send, but user space may not be finished with the actual
>>>> send stream once send has finished outputting the metadata. So there may
>>>> be some issues there, but not if one has control over when removal/dedup
>>>> happens.
>>> I was still having performance issues even with this patch. Shouldn't be
>>> too hard to reproduce. E.g., I have a ~10GB sqlite db with 1236459
>>> extents (as dbs tend to have on btrfs). Sending that (even with only two
>>> snapshots) causes it to be cpu limited.
>>>
>>> So I went ahead and continued to use the patch that disables backref
>>> walking in send (for the db above the backref walking doesn't do
>>> anything anyway), but use it in combination with "--no-data" to read
>>> file data and then dedup in user space (only whole file dedup for now).
>>> This is all a bit application specific. Is anyone interested in the code?
>> You should only see performance improvement if you have extents that
>> have more than 64 references.
>> You mention two snapshots, but you don't mention if there are extents
>> that are shared through reflinks (cloning and deduplication).
>> Also, is compression enabled? That can have a very significant impact,
>> specially for zstd.
>>
>> Thanks.
> Yeah, sure. It's a database file. I can't see how there would be
> reflinks in it (initially during creation, but afterwards not). I back
> it up by creating a read-only snapshot and sending/receiving that to a
> separate backup volume. That snapshot and the last snapshot for
> incremental send are the only ones.
>
> There is another volume with a lot of files that are reflinked on the
> same machine and those get send/received as well, and it should work for
> both volumes.
>
> Compression on the database file is enabled (lzo). I would guess that
> zstd should have the fastest decompression, though (lzo seems to have a
> little bit better compression speed, but not decompression speed)? The
> sqlite page size is 4096 bytes. That might be too low and cause too much
> fragmentation...

Created a script to reproduce this (albeit with a smaller 1.7GiB file).

Test results:
https://gist.github.com/uroni/5b601877ce146bb58657b4821ea15fca#file-btrfs_send_test_results

The script:
https://gist.github.com/uroni/4671b3b5215e230babcd7561c6619a82#file-btrfs_send_test-sh

Send, lzo compression, no fragmentation: 0m5.025s
Send, lzo compression, 114231 extents: 6m58.613s

Send, no compression, no fragmentation: 0m2.741s
Send, no compression, 103676 extents: 3m0.290s

Multiply those numbers by 10x and a send of 17GiB of data takes over an
hour. And yeah, having no compression seems to double the speed.

>
>>
>>> Regards,
>>> Martin Raiber
>>>
>>>>> Reported-by: Atemu <atemu.main@gmail.com>
>>>>> Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
>>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>>> ---
>>>>>  fs/btrfs/send.c | 25 ++++++++++++++++++++++++-
>>>>>  1 file changed, 24 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>>>>> index 123ac54af071..518ec1265a0c 100644
>>>>> --- a/fs/btrfs/send.c
>>>>> +++ b/fs/btrfs/send.c
>>>>> @@ -25,6 +25,14 @@
>>>>>  #include "compression.h"
>>>>>
>>>>>  /*
>>>>> + * Maximum number of references an extent can have in order for us to attempt to
>>>>> + * issue clone operations instead of write operations. This currently exists to
>>>>> + * avoid hitting limitations of the backreference walking code (taking a lot of
>>>>> + * time and using too much memory for extents with large number of references).
>>>>> + */
>>>>> +#define SEND_MAX_EXTENT_REFS        64
>>>>> +
>>>>> +/*
>>>>>   * A fs_path is a helper to dynamically build path names with unknown size.
>>>>>   * It reallocates the internal buffer on demand.
>>>>>   * It allows fast adding of path elements on the right side (normal path) and
>>>>> @@ -1302,6 +1310,7 @@ static int find_extent_clone(struct send_ctx *sctx,
>>>>>      struct clone_root *cur_clone_root;
>>>>>      struct btrfs_key found_key;
>>>>>      struct btrfs_path *tmp_path;
>>>>> +    struct btrfs_extent_item *ei;
>>>>>      int compressed;
>>>>>      u32 i;
>>>>>
>>>>> @@ -1349,7 +1358,6 @@ static int find_extent_clone(struct send_ctx *sctx,
>>>>>      ret = extent_from_logical(fs_info, disk_byte, tmp_path,
>>>>>                                &found_key, &flags);
>>>>>      up_read(&fs_info->commit_root_sem);
>>>>> -    btrfs_release_path(tmp_path);
>>>>>
>>>>>      if (ret < 0)
>>>>>              goto out;
>>>>> @@ -1358,6 +1366,21 @@ static int find_extent_clone(struct send_ctx *sctx,
>>>>>              goto out;
>>>>>      }
>>>>>
>>>>> +    ei = btrfs_item_ptr(tmp_path->nodes[0], tmp_path->slots[0],
>>>>> +                        struct btrfs_extent_item);
>>>>> +    /*
>>>>> +     * Backreference walking (iterate_extent_inodes() below) is currently
>>>>> +     * too expensive when an extent has a large number of references, both
>>>>> +     * in time spent and used memory. So for now just fallback to write
>>>>> +     * operations instead of clone operations when an extent has more than
>>>>> +     * a certain amount of references.
>>>>> +     */
>>>>> +    if (btrfs_extent_refs(tmp_path->nodes[0], ei) > SEND_MAX_EXTENT_REFS) {
>>>>> +            ret = -ENOENT;
>>>>> +            goto out;
>>>>> +    }
>>>>> +    btrfs_release_path(tmp_path);
>>>>> +
>>>>>      /*
>>>>>       * Setup the clone roots.
>>>>>       */
diff mbox series

Patch

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 123ac54af071..518ec1265a0c 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -25,6 +25,14 @@ 
 #include "compression.h"
 
 /*
+ * Maximum number of references an extent can have in order for us to attempt to
+ * issue clone operations instead of write operations. This currently exists to
+ * avoid hitting limitations of the backreference walking code (taking a lot of
+ * time and using too much memory for extents with large number of references).
+ */
+#define SEND_MAX_EXTENT_REFS	64
+
+/*
  * A fs_path is a helper to dynamically build path names with unknown size.
  * It reallocates the internal buffer on demand.
  * It allows fast adding of path elements on the right side (normal path) and
@@ -1302,6 +1310,7 @@  static int find_extent_clone(struct send_ctx *sctx,
 	struct clone_root *cur_clone_root;
 	struct btrfs_key found_key;
 	struct btrfs_path *tmp_path;
+	struct btrfs_extent_item *ei;
 	int compressed;
 	u32 i;
 
@@ -1349,7 +1358,6 @@  static int find_extent_clone(struct send_ctx *sctx,
 	ret = extent_from_logical(fs_info, disk_byte, tmp_path,
 				  &found_key, &flags);
 	up_read(&fs_info->commit_root_sem);
-	btrfs_release_path(tmp_path);
 
 	if (ret < 0)
 		goto out;
@@ -1358,6 +1366,21 @@  static int find_extent_clone(struct send_ctx *sctx,
 		goto out;
 	}
 
+	ei = btrfs_item_ptr(tmp_path->nodes[0], tmp_path->slots[0],
+			    struct btrfs_extent_item);
+	/*
+	 * Backreference walking (iterate_extent_inodes() below) is currently
+	 * too expensive when an extent has a large number of references, both
+	 * in time spent and used memory. So for now just fallback to write
+	 * operations instead of clone operations when an extent has more than
+	 * a certain amount of references.
+	 */
+	if (btrfs_extent_refs(tmp_path->nodes[0], ei) > SEND_MAX_EXTENT_REFS) {
+		ret = -ENOENT;
+		goto out;
+	}
+	btrfs_release_path(tmp_path);
+
 	/*
 	 * Setup the clone roots.
 	 */