file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
diff mbox series

Message ID 20180820010932.GV2234@dastard
State New
Headers show
Series
  • file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
Related show

Commit Message

Dave Chinner Aug. 20, 2018, 1:09 a.m. UTC
[cc linux-fsdevel now, too]

On Mon, Aug 20, 2018 at 09:11:26AM +1000, Dave Chinner wrote:
> [cc linux-xfs@vger.kernel.org]
> 
> On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > Test that deduplication of an entire file that has a size that is not
> > aligned to the filesystem's block size into a different file does not
> > corrupt the destination's file data.

Ok, I've looked at this now. My first question is where did all the
magic offsets in this test come from? i.e. how was this bug
found and who is it affecting?

> > This test is motivated by a bug found in Btrfs which is fixed by the
> > following patch for the linux kernel:
> > 
> >   "Btrfs: fix data corruption when deduplicating between different files"
> > 
> > XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
> > with the same corruption as in Btrfs - some bytes of a block get replaced
> > with zeroes after the deduplication.
> 
> Filipe, in future can please report XFS bugs you find to the XFS
> list the moment you find them. We shouldn't ever find out about a
> data corruption bug we need to fix via a "oh, by the way" comment in
> a commit message for a regression test....

This becomes much more relevant because of what I've just found....

.....

> > +# The first byte with a value of 0xae starts at an offset (2518890) which is not
> > +# a multiple of the block size.
> > +$XFS_IO_PROG -f \
> > +	-c "pwrite -S 0x6b 0 2518890" \
> > +	-c "pwrite -S 0xae 2518890 102398" \
> > +	$SCRATCH_MNT/foo | _filter_xfs_io
> > +
> > +# Create a second file with a length not aligned to the block size, whose bytes
> > +# all have the value 0x6b, so that its extent(s) can be deduplicated with the
> > +# first file.
> > +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | _filter_xfs_io
> > +
> > +# The file is filled with bytes having the value 0x6b from offset 0 to offset
> > +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
> > +echo "File content before deduplication:"
> > +od -t x1 $SCRATCH_MNT/foo

Please use "od -Ad -t x1 <file>" so the file offsets reported by od
match the offsets used in the test (i.e. in decimal, not octal).

> > +
> > +# Now deduplicate the entire second file into a range of the first file that
> > +# also has all bytes with the value 0x6b. The destination range's end offset
> > +# must not be aligned to the block size and must be less then the offset of
> > +# the first byte with the value 0xae (byte at offset 2518890).
> > +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" $SCRATCH_MNT/foo \
> > +	| _filter_xfs_io

Ok, now it gets fun. dedupe to non-block aligned rtanges is supposed
to be rejected by the kernel in vfs_clone_file_prep_inodes(). i.e
this check:

        /* Only reflink if we're aligned to block boundaries */
        if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
            !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
                return -EINVAL;

And it's pretty clear that a length of 557771 is not block aligned
(being an odd number).

So why was this dedupe request even accepted by the kernel? Well,
I think there's a bug in the check just above this:

        /* If we're linking to EOF, continue to the block boundary. */
        if (pos_in + *len == isize)
                blen = ALIGN(isize, bs) - pos_in;
        else
                blen = *len;

basically, when the "in" file dedupe/reflink range is to EOF, it
expands the range to include /all/ of the block that contains the
EOF byte. IOWs, it now requests us to dedupe /undefined data beyond
EOF/. But when we go to compare the data in these ranges, it
truncates the comparison to the length that the user asked for
(i.e. *len) and not the extended block length.

IOWs, it doesn't compare the bytes beyond EOF in the source block to
the data in the destination block it would replace, and so doesn't
fail the compare like it should.

And, well, btrfs has the same bug. extent_same_check_offsets()
extends the range for alignment so it passes alignment checks, but
then /explicitly/ uses the original length for the data compare
and dedupe. i.e:

       /* pass original length for comparison so we stay within i_size */
        ret = btrfs_cmp_data(olen, cmp);
        if (ret == 0)
                ret = btrfs_clone(src, dst, loff, olen, len, dst_loff, 1);

This is what we should see if someone tried to dedupe the EOF block
of a file:

generic/505     - output mismatch (see /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad)
    --- tests/generic/505.out   2018-08-20 09:36:58.449015709 +1000
    +++ /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad    2018-08-20 10:45:47.245482160 +1000
    @@ -13,8 +13,7 @@
     *
     2621280 ae ae ae ae ae ae ae ae
     2621288
    -deduped 557771/557771 bytes at offset 1957888
    -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    +XFS_IOC_FILE_EXTENT_SAME: Invalid argument
     File content after deduplication and before unmounting:
    ...
    (Run 'diff -u tests/generic/505.out /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad'  to see the entire diff)

i.e. the dedupe request should fail as we cannot dedupe the EOF
block.

The patch below does this for the VFS dedupe code. it's not a final
patch, it's just a demonstration of how this should never have got
past the range checks. Open questions:

	- is documenting rejection on request alignment grounds
	  (i.e. EINVAL) in the man page sufficient for app
	  developers to understand what is going on here?
	- should we just round down the EOF dedupe request to the
	  block before EOF so dedupe still succeeds?
	- should file cloning (i.e. reflink) be allow allowing the
	  EOF block to be shared somewhere inside EOF in the
	  destination file? That's stale data exposure, yes?
	- should clone only allow sharing of the EOF block if it's a
	  either a full file clone or a matching range clone and
	  isize is the same for both src and dest files?

Discuss.

Cheers,

Dave.

Comments

Darrick J. Wong Aug. 20, 2018, 3:33 p.m. UTC | #1
On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> [cc linux-fsdevel now, too]
> 
> On Mon, Aug 20, 2018 at 09:11:26AM +1000, Dave Chinner wrote:
> > [cc linux-xfs@vger.kernel.org]
> > 
> > On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > > 
> > > Test that deduplication of an entire file that has a size that is not
> > > aligned to the filesystem's block size into a different file does not
> > > corrupt the destination's file data.
> 
> Ok, I've looked at this now. My first question is where did all the
> magic offsets in this test come from? i.e. how was this bug
> found and who is it affecting?
> 
> > > This test is motivated by a bug found in Btrfs which is fixed by the
> > > following patch for the linux kernel:
> > > 
> > >   "Btrfs: fix data corruption when deduplicating between different files"
> > > 
> > > XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
> > > with the same corruption as in Btrfs - some bytes of a block get replaced
> > > with zeroes after the deduplication.
> > 
> > Filipe, in future can please report XFS bugs you find to the XFS
> > list the moment you find them. We shouldn't ever find out about a
> > data corruption bug we need to fix via a "oh, by the way" comment in
> > a commit message for a regression test....
> 
> This becomes much more relevant because of what I've just found....
> 
> .....
> 
> > > +# The first byte with a value of 0xae starts at an offset (2518890) which is not
> > > +# a multiple of the block size.
> > > +$XFS_IO_PROG -f \
> > > +	-c "pwrite -S 0x6b 0 2518890" \
> > > +	-c "pwrite -S 0xae 2518890 102398" \
> > > +	$SCRATCH_MNT/foo | _filter_xfs_io
> > > +
> > > +# Create a second file with a length not aligned to the block size, whose bytes
> > > +# all have the value 0x6b, so that its extent(s) can be deduplicated with the
> > > +# first file.
> > > +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | _filter_xfs_io
> > > +
> > > +# The file is filled with bytes having the value 0x6b from offset 0 to offset
> > > +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
> > > +echo "File content before deduplication:"
> > > +od -t x1 $SCRATCH_MNT/foo
> 
> Please use "od -Ad -t x1 <file>" so the file offsets reported by od
> match the offsets used in the test (i.e. in decimal, not octal).
> 
> > > +
> > > +# Now deduplicate the entire second file into a range of the first file that
> > > +# also has all bytes with the value 0x6b. The destination range's end offset
> > > +# must not be aligned to the block size and must be less then the offset of
> > > +# the first byte with the value 0xae (byte at offset 2518890).
> > > +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" $SCRATCH_MNT/foo \
> > > +	| _filter_xfs_io
> 
> Ok, now it gets fun. dedupe to non-block aligned rtanges is supposed
> to be rejected by the kernel in vfs_clone_file_prep_inodes(). i.e
> this check:
> 
>         /* Only reflink if we're aligned to block boundaries */
>         if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
>             !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
>                 return -EINVAL;
> 
> And it's pretty clear that a length of 557771 is not block aligned
> (being an odd number).
> 
> So why was this dedupe request even accepted by the kernel? Well,
> I think there's a bug in the check just above this:
> 
>         /* If we're linking to EOF, continue to the block boundary. */
>         if (pos_in + *len == isize)
>                 blen = ALIGN(isize, bs) - pos_in;
>         else
>                 blen = *len;
> 
> basically, when the "in" file dedupe/reflink range is to EOF, it
> expands the range to include /all/ of the block that contains the
> EOF byte. IOWs, it now requests us to dedupe /undefined data beyond
> EOF/. But when we go to compare the data in these ranges, it
> truncates the comparison to the length that the user asked for
> (i.e. *len) and not the extended block length.
> 
> IOWs, it doesn't compare the bytes beyond EOF in the source block to
> the data in the destination block it would replace, and so doesn't
> fail the compare like it should.
> 
> And, well, btrfs has the same bug. extent_same_check_offsets()
> extends the range for alignment so it passes alignment checks, but
> then /explicitly/ uses the original length for the data compare
> and dedupe. i.e:
> 
>        /* pass original length for comparison so we stay within i_size */
>         ret = btrfs_cmp_data(olen, cmp);
>         if (ret == 0)
>                 ret = btrfs_clone(src, dst, loff, olen, len, dst_loff, 1);
> 
> This is what we should see if someone tried to dedupe the EOF block
> of a file:
> 
> generic/505     - output mismatch (see /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad)
>     --- tests/generic/505.out   2018-08-20 09:36:58.449015709 +1000
>     +++ /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad    2018-08-20 10:45:47.245482160 +1000
>     @@ -13,8 +13,7 @@
>      *
>      2621280 ae ae ae ae ae ae ae ae
>      2621288
>     -deduped 557771/557771 bytes at offset 1957888
>     -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>     +XFS_IOC_FILE_EXTENT_SAME: Invalid argument
>      File content after deduplication and before unmounting:
>     ...
>     (Run 'diff -u tests/generic/505.out /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad'  to see the entire diff)
> 
> i.e. the dedupe request should fail as we cannot dedupe the EOF
> block.
> 
> The patch below does this for the VFS dedupe code. it's not a final
> patch, it's just a demonstration of how this should never have got
> past the range checks. Open questions:

(Here's my five minutes of XFS that I'm allowed per day... :/)

> 	- is documenting rejection on request alignment grounds
> 	  (i.e. EINVAL) in the man page sufficient for app
> 	  developers to understand what is going on here?

I think so.  The manpage says: "The filesystem does not support
reflinking the ranges of the given files", which (to my mind) covers
this case of not supporting dedupe of EOF blocks.

> 	- should we just round down the EOF dedupe request to the
> 	  block before EOF so dedupe still succeeds?

I've often wondered if the interface should (have) be(en) that we start
at src_off/dst_off and share as many common blocks as possible until we
find a mismatch, then tell userspace where we stopped... instead of like
now where we compare the entire extent and fail if any part of it
doesn't match.

> 	- should file cloning (i.e. reflink) be allow allowing the
> 	  EOF block to be shared somewhere inside EOF in the
> 	  destination file?

I don't think we should.

> That's stale data exposure, yes?

Haven't tested that, but seems likely.

> 	- should clone only allow sharing of the EOF block if it's a
> 	  either a full file clone or a matching range clone and
> 	  isize is the same for both src and dest files?

The above sound reasonable for clone, though I would also allow clone to
share the EOF block if we extend isize of the dest file.

The above also nearly sound reasonable for dedupe too.  If we encounter
EOF at the same places in the src range and the dest range, should we
allow sharing there?  The post-eof bytes are undefined by definition;
does undefined == undefined?

> 
> Discuss.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 
> [RFC] vfs: fix data corruption w/ unaligned dedupe ranges
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Exposed by fstests generic/505 on XFS, caused by extending the
> bblock match range to include the partial EOF block, but then
> allowing unknown data beyond EOF to be considered a "match" to data
> in the destination file because the comparison is only made to the
> end of the source file. This corrupts the destination file when the
> source extent is shared with it.
> 
> Open questions:
> 
> 	- is documenting rejection on request alignment grounds
> 	  (i.e. EINVAL) in the man page sufficient for app
> 	  developers to understand what is going on here?
> 	- should we just silently round down the EOF dedupe request
> 	  to the block before EOF so dedupe still succeeds?
> 	- should file cloning (i.e. reflink) be allow allowing the
> 	  EOF block to be shared somewhere inside EOF in the
> 	  destination file? That's stale data exposure, yes?
> 	- should clone only allow sharing of the EOF block if it's a
> 	  either a full file clone or a matching range clone and
> 	  isize is the same for both src and dest files?
> 
> Btrfs also has the same bug in extent_same_check_offsets() and
> btrfs_extent_same_range() that is not addressed by this patch.

<nod> (xfs/ocfs2) clone ioctls tend to be bug-for-bug compatible with
btrfs more often than we probably ought to... :/

(I also sorta wonder if btrfs should be ported to the common vfs
routines for clone prep and dedupe comparison...?)

--D

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/read_write.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 153f8f690490..3844359a4597 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1768,8 +1768,17 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>  			return -EINVAL;
>  	}
>  
> -	/* If we're linking to EOF, continue to the block boundary. */
> -	if (pos_in + *len == isize)
> +	/* Reflink allows linking to EOF, so round the length up to allow us to
> +	 * shared the last block in the file. We don't care what is beyond the
> +	 * EOF point in the block that gets shared, as we can't access it and
> +	 * attempts to extent the file will break the sharing.
> +	 *
> +	 * For dedupe, sharing the EOF block is illegal because the bytes beyond
> +	 * EOF are undefined (i.e. not readable) and so can't be deduped. Hence
> +	 * we do not extend/align the length and instead let the alignment
> +	 * checks below reject it.
> +	 */
> +	if (!is_dedupe && pos_in + *len == isize)
>  		blen = ALIGN(isize, bs) - pos_in;
>  	else
>  		blen = *len;
Dave Chinner Aug. 21, 2018, 12:49 a.m. UTC | #2
On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
....
> > So why was this dedupe request even accepted by the kernel? Well,
> > I think there's a bug in the check just above this:
> > 
> >         /* If we're linking to EOF, continue to the block boundary. */
> >         if (pos_in + *len == isize)
> >                 blen = ALIGN(isize, bs) - pos_in;
> >         else
> >                 blen = *len;
> > 
> > basically, when the "in" file dedupe/reflink range is to EOF, it
> > expands the range to include /all/ of the block that contains the
> > EOF byte. IOWs, it now requests us to dedupe /undefined data beyond
> > EOF/. But when we go to compare the data in these ranges, it
> > truncates the comparison to the length that the user asked for
> > (i.e. *len) and not the extended block length.
> > 
> > IOWs, it doesn't compare the bytes beyond EOF in the source block to
> > the data in the destination block it would replace, and so doesn't
> > fail the compare like it should.
......
> > i.e. the dedupe request should fail as we cannot dedupe the EOF
> > block.
> > 
> > The patch below does this for the VFS dedupe code. it's not a final
> > patch, it's just a demonstration of how this should never have got
> > past the range checks. Open questions:
> 
> (Here's my five minutes of XFS that I'm allowed per day... :/)
> 
> > 	- is documenting rejection on request alignment grounds
> > 	  (i.e. EINVAL) in the man page sufficient for app
> > 	  developers to understand what is going on here?
> 
> I think so.  The manpage says: "The filesystem does not support
> reflinking the ranges of the given files", which (to my mind) covers
> this case of not supporting dedupe of EOF blocks.

Ok.

> > 	- should we just round down the EOF dedupe request to the
> > 	  block before EOF so dedupe still succeeds?
> 
> I've often wondered if the interface should (have) be(en) that we start
> at src_off/dst_off and share as many common blocks as possible until we
> find a mismatch, then tell userspace where we stopped... instead of like
> now where we compare the entire extent and fail if any part of it
> doesn't match.

I think that the ioctl_fideduperange() man page description gives us
the flexibility to do this. That is, this text:

	Upon successful completion of this ioctl, the number of
	bytes successfully deduplicated is returned in bytes_deduped
	and a status code for the deduplication operation is
	returned in status.  If even a single byte in the range does
	not match, the deduplication request will be ignored  and
	status set  to FILE_DEDUPE_RANGE_DIFFERS.

This implies we can dedupe less than the entire range as long as the
entire range matches. If the entire range does not match, we have
to return FILE_DEDUPE_RANGE_DIFFERS, but in this case it does match
so we can pick and choose how much we deduplicate. How much we
dedupe is then returned as a byte count. In this case, it will be a
few bytes short of the entire length requested because we aligned
the dedupe inwards....

Does that sound reasonable?

> > 	- should file cloning (i.e. reflink) be allow allowing the
> > 	  EOF block to be shared somewhere inside EOF in the
> > 	  destination file?
> 
> I don't think we should.
> 
> > That's stale data exposure, yes?
> 
> Haven't tested that, but seems likely.

Yeah, I haven't tested it either, but it I can't see how the current
behaviour ends well.

> > 	- should clone only allow sharing of the EOF block if it's a
> > 	  either a full file clone or a matching range clone and
> > 	  isize is the same for both src and dest files?
> 
> The above sound reasonable for clone, though I would also allow clone to
> share the EOF block if we extend isize of the dest file.

Hmmm. That covers the only valid rule for sharing the EOF block,
right? i.e. That the source EOF lands exactly at or beyond the
destination EOF, so it remains the EOF block in both files?

FWIW, I just noticed that the ioctl_ficlonerange man page doesn't
document the return value on success of a clone.

> The above also nearly sound reasonable for dedupe too.  If we encounter
> EOF at the same places in the src range and the dest range, should we
> allow sharing there?  The post-eof bytes are undefined by definition;
> does undefined == undefined?

It's undefined, espescially as different fs implementations may be
doing different things with partial blocks (e.g. tail packing).

From an XFS perspective, I don't think we should physically share
partial EOF blocks on dedupe because it means extending either file
becomes a COW operation and then a new allocation instead of just a
new allocation (i.e. a fragmentation vector). So seems better to me
to just leave the partial EOF block unshared in this case.

> > [RFC] vfs: fix data corruption w/ unaligned dedupe ranges
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Exposed by fstests generic/505 on XFS, caused by extending the
> > bblock match range to include the partial EOF block, but then
> > allowing unknown data beyond EOF to be considered a "match" to data
> > in the destination file because the comparison is only made to the
> > end of the source file. This corrupts the destination file when the
> > source extent is shared with it.
> > 
> > Open questions:
> > 
> > 	- is documenting rejection on request alignment grounds
> > 	  (i.e. EINVAL) in the man page sufficient for app
> > 	  developers to understand what is going on here?
> > 	- should we just silently round down the EOF dedupe request
> > 	  to the block before EOF so dedupe still succeeds?
> > 	- should file cloning (i.e. reflink) be allow allowing the
> > 	  EOF block to be shared somewhere inside EOF in the
> > 	  destination file? That's stale data exposure, yes?
> > 	- should clone only allow sharing of the EOF block if it's a
> > 	  either a full file clone or a matching range clone and
> > 	  isize is the same for both src and dest files?
> > 
> > Btrfs also has the same bug in extent_same_check_offsets() and
> > btrfs_extent_same_range() that is not addressed by this patch.
> 
> <nod> (xfs/ocfs2) clone ioctls tend to be bug-for-bug compatible with
> btrfs more often than we probably ought to... :/

*nod*

> (I also sorta wonder if btrfs should be ported to the common vfs
> routines for clone prep and dedupe comparison...?)

If someone feels like trying to tame that dragon, then I'm not going
to stop them. I'm not going to try, though.

Cheers,

Dave.
Eric Sandeen Aug. 21, 2018, 1:17 a.m. UTC | #3
On 8/20/18 7:49 PM, Dave Chinner wrote:
> 	Upon successful completion of this ioctl, the number of
> 	bytes successfully deduplicated is returned in bytes_deduped
> 	and a status code for the deduplication operation is
> 	returned in status.  If even a single byte in the range does
> 	not match, the deduplication request will be ignored  and
> 	status set  to FILE_DEDUPE_RANGE_DIFFERS.
> 
> This implies we can dedupe less than the entire range as long as the
> entire range matches. If the entire range does not match, we have
> to return FILE_DEDUPE_RANGE_DIFFERS, but in this case it does match
> so we can pick and choose how much we deduplicate. How much we
> dedupe is then returned as a byte count. In this case, it will be a
> few bytes short of the entire length requested because we aligned
> the dedupe inwards....
> 
> Does that sound reasonable?

I had hoped that dedupe was advisory as Darrick wished for, but TBH my
reading of that is no, if you ask for a range to be deduped and any of
it differs, "even a single byte," you fail it all.

Why else would that last part be present, if the interface is free to
ignore later parts that don't match and truncate the range to the
matching portion?

-Eric
Dave Chinner Aug. 21, 2018, 4:49 a.m. UTC | #4
On Mon, Aug 20, 2018 at 08:17:18PM -0500, Eric Sandeen wrote:
> 
> 
> On 8/20/18 7:49 PM, Dave Chinner wrote:
> > 	Upon successful completion of this ioctl, the number of
> > 	bytes successfully deduplicated is returned in bytes_deduped
> > 	and a status code for the deduplication operation is
> > 	returned in status.  If even a single byte in the range does
> > 	not match, the deduplication request will be ignored  and
> > 	status set  to FILE_DEDUPE_RANGE_DIFFERS.
> > 
> > This implies we can dedupe less than the entire range as long as the
> > entire range matches. If the entire range does not match, we have
> > to return FILE_DEDUPE_RANGE_DIFFERS, but in this case it does match
> > so we can pick and choose how much we deduplicate. How much we
> > dedupe is then returned as a byte count. In this case, it will be a
> > few bytes short of the entire length requested because we aligned
> > the dedupe inwards....
> > 
> > Does that sound reasonable?
> 
> I had hoped that dedupe was advisory as Darrick wished for, but TBH my
> reading of that is no, if you ask for a range to be deduped and any of
> it differs, "even a single byte," you fail it all.

Yes, if the data differs, then it fails.  But that's not what I'm
questioning nor what I care about in this case. I'm asking whether
the filesystem gets to choose how much of the range of same data is
deduped when the file data is apparently the same.

> Why else would that last part be present, if the interface is free to
> ignore later parts that don't match and truncate the range to the
> matching portion?

I think there is a clear differentiation in the man page text
between "all the bytes are the same" and "how much of that range the
filesystem deduped". i.e. the man page does not say that the
filesystem *must* dedupe the entire range if all the data is the
same - it says the filesystem will return /how many bytes it
successfully deduplicated/. IOWs, "do nothing and return zero" is a
valid ->dedupe_file_range implementation.

AFAIC, it's the fact that we do data comparison from the page cache
before calling into the filesystem to check on-disk formats that is
the problem here. Having identical data in the page cache is not the
same thing as "the blocks on disk containing the user data are
identical". i.e. Filesystems deduplicate disk blocks, but they often
perform transformations on the data as it passes between the page
cache and disk blocks or have constraints on the format of the data
on disk. For example:

	- encrypted files. unencrypted data in the page cache is the
	  same, therefore we can't return FILE_DEDUPE_RANGE_DIFFERS
	  because the user will be able to see that they are the
	  same. But they will different on disk after encryption
	  (e.g. because different nonce/seed or completely different
	  keys) and so the filesystem should not dedupe them.

	- the two files differ in on-disk format e.g. compressed
	  vs encrypted.

	- data might be inline with metadata

	- tail packing

	- dedupe request might be block aligned at file offsets, but
	  file offsets are not aligned to disk blocks due to, say,
	  multi-block data compression.

	- on disk extent tracking granularity might be larger than
	  filesystem block size (e.g. ext4's bigalloc mode) so can't
	  be deduped on filesystem block size alignment.

	- the source or destination inode is marked "NODATACOW" so
	  can't contain shared extents

I'm sure there's others, but I think this is enough to demonstrate
that "user visible file data is the same" does not mean the
filesystem can dedupe it. The wording in the man page appears to
understand these limitations and hence it allows us to silently
ignore the partial EOF block when it comes to dedupe....

Cheers,

Dave.
Filipe Manana Aug. 21, 2018, 3:55 p.m. UTC | #5
On Mon, Aug 20, 2018 at 2:09 AM, Dave Chinner <david@fromorbit.com> wrote:
> [cc linux-fsdevel now, too]
>
> On Mon, Aug 20, 2018 at 09:11:26AM +1000, Dave Chinner wrote:
>> [cc linux-xfs@vger.kernel.org]
>>
>> On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdmanana@kernel.org wrote:
>> > From: Filipe Manana <fdmanana@suse.com>
>> >
>> > Test that deduplication of an entire file that has a size that is not
>> > aligned to the filesystem's block size into a different file does not
>> > corrupt the destination's file data.
>
> Ok, I've looked at this now. My first question is where did all the
> magic offsets in this test come from? i.e. how was this bug
> found and who is it affecting?

I found it myself. I'm not aware of any users or applications affected by it.

>
>> > This test is motivated by a bug found in Btrfs which is fixed by the
>> > following patch for the linux kernel:
>> >
>> >   "Btrfs: fix data corruption when deduplicating between different files"
>> >
>> > XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
>> > with the same corruption as in Btrfs - some bytes of a block get replaced
>> > with zeroes after the deduplication.
>>
>> Filipe, in future can please report XFS bugs you find to the XFS
>> list the moment you find them. We shouldn't ever find out about a
>> data corruption bug we need to fix via a "oh, by the way" comment in
>> a commit message for a regression test....
>
> This becomes much more relevant because of what I've just found....
>
> .....
>
>> > +# The first byte with a value of 0xae starts at an offset (2518890) which is not
>> > +# a multiple of the block size.
>> > +$XFS_IO_PROG -f \
>> > +   -c "pwrite -S 0x6b 0 2518890" \
>> > +   -c "pwrite -S 0xae 2518890 102398" \
>> > +   $SCRATCH_MNT/foo | _filter_xfs_io
>> > +
>> > +# Create a second file with a length not aligned to the block size, whose bytes
>> > +# all have the value 0x6b, so that its extent(s) can be deduplicated with the
>> > +# first file.
>> > +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | _filter_xfs_io
>> > +
>> > +# The file is filled with bytes having the value 0x6b from offset 0 to offset
>> > +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
>> > +echo "File content before deduplication:"
>> > +od -t x1 $SCRATCH_MNT/foo
>
> Please use "od -Ad -t x1 <file>" so the file offsets reported by od
> match the offsets used in the test (i.e. in decimal, not octal).

Will do, in the next test version after agreement on the fix.

>
>> > +
>> > +# Now deduplicate the entire second file into a range of the first file that
>> > +# also has all bytes with the value 0x6b. The destination range's end offset
>> > +# must not be aligned to the block size and must be less then the offset of
>> > +# the first byte with the value 0xae (byte at offset 2518890).
>> > +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" $SCRATCH_MNT/foo \
>> > +   | _filter_xfs_io
>
> Ok, now it gets fun. dedupe to non-block aligned rtanges is supposed
> to be rejected by the kernel in vfs_clone_file_prep_inodes(). i.e
> this check:
>
>         /* Only reflink if we're aligned to block boundaries */
>         if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
>             !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
>                 return -EINVAL;
>
> And it's pretty clear that a length of 557771 is not block aligned
> (being an odd number).
>
> So why was this dedupe request even accepted by the kernel? Well,
> I think there's a bug in the check just above this:
>
>         /* If we're linking to EOF, continue to the block boundary. */
>         if (pos_in + *len == isize)
>                 blen = ALIGN(isize, bs) - pos_in;
>         else
>                 blen = *len;

Yes, btrfs, for the dedupe call, also has its own place where it does
the same thing,
at fs/btrfs/ioctl.c:extent_same_check_offsets().
And that's precisely what made me suspicious about it, together with
what you note below about the call to btrfs_cmp_data() using the
original, unaligned, length.

However, I just ran the same test using reflink and not dedupe and the
same problem happens. In earlier versions of the test/debugging I
either did not notice
or made some mistake because I hadn't seen the same problem for the
reflink/clone operation, and since we only do that extra rounding in
the btrfs dedupe code path,
and not on the clone one, I took the conclusion that only dedupe was
affected, but that is also done in the VFS as you just pointed.
I only noticied XFS also failed at the last moment, when I had the
test case complete.

>
> basically, when the "in" file dedupe/reflink range is to EOF, it
> expands the range to include /all/ of the block that contains the
> EOF byte. IOWs, it now requests us to dedupe /undefined data beyond
> EOF/. But when we go to compare the data in these ranges, it
> truncates the comparison to the length that the user asked for
> (i.e. *len) and not the extended block length.
>
> IOWs, it doesn't compare the bytes beyond EOF in the source block to
> the data in the destination block it would replace, and so doesn't
> fail the compare like it should.
>
> And, well, btrfs has the same bug. extent_same_check_offsets()
> extends the range for alignment so it passes alignment checks, but
> then /explicitly/ uses the original length for the data compare
> and dedupe. i.e:
>
>        /* pass original length for comparison so we stay within i_size */
>         ret = btrfs_cmp_data(olen, cmp);
>         if (ret == 0)
>                 ret = btrfs_clone(src, dst, loff, olen, len, dst_loff, 1);
>
> This is what we should see if someone tried to dedupe the EOF block
> of a file:
>
> generic/505     - output mismatch (see /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad)
>     --- tests/generic/505.out   2018-08-20 09:36:58.449015709 +1000
>     +++ /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad    2018-08-20 10:45:47.245482160 +1000
>     @@ -13,8 +13,7 @@
>      *
>      2621280 ae ae ae ae ae ae ae ae
>      2621288
>     -deduped 557771/557771 bytes at offset 1957888
>     -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>     +XFS_IOC_FILE_EXTENT_SAME: Invalid argument
>      File content after deduplication and before unmounting:
>     ...
>     (Run 'diff -u tests/generic/505.out /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad'  to see the entire diff)
>
> i.e. the dedupe request should fail as we cannot dedupe the EOF
> block.
>
> The patch below does this for the VFS dedupe code. it's not a final
> patch, it's just a demonstration of how this should never have got
> past the range checks. Open questions:
>
>         - is documenting rejection on request alignment grounds
>           (i.e. EINVAL) in the man page sufficient for app
>           developers to understand what is going on here?

Maybe. No idea if that would "break" existing applications and use cases.

>         - should we just round down the EOF dedupe request to the
>           block before EOF so dedupe still succeeds?

That was my idea because dedupe is allowed to deduplicate less then
what is requested.

>         - should file cloning (i.e. reflink) be allow allowing the
>           EOF block to be shared somewhere inside EOF in the
>           destination file? That's stale data exposure, yes?

No, but for cloning I'm not sure which approach is better, to round
down or reject with -EINVAL.

>         - should clone only allow sharing of the EOF block if it's a
>           either a full file clone or a matching range clone and
>           isize is the same for both src and dest files?

I think it should.

cheers

>
> Discuss.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
>
> [RFC] vfs: fix data corruption w/ unaligned dedupe ranges
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Exposed by fstests generic/505 on XFS, caused by extending the
> bblock match range to include the partial EOF block, but then
> allowing unknown data beyond EOF to be considered a "match" to data
> in the destination file because the comparison is only made to the
> end of the source file. This corrupts the destination file when the
> source extent is shared with it.
>
> Open questions:
>
>         - is documenting rejection on request alignment grounds
>           (i.e. EINVAL) in the man page sufficient for app
>           developers to understand what is going on here?
>         - should we just silently round down the EOF dedupe request
>           to the block before EOF so dedupe still succeeds?
>         - should file cloning (i.e. reflink) be allow allowing the
>           EOF block to be shared somewhere inside EOF in the
>           destination file? That's stale data exposure, yes?
>         - should clone only allow sharing of the EOF block if it's a
>           either a full file clone or a matching range clone and
>           isize is the same for both src and dest files?
>
> Btrfs also has the same bug in extent_same_check_offsets() and
> btrfs_extent_same_range() that is not addressed by this patch.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/read_write.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 153f8f690490..3844359a4597 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1768,8 +1768,17 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>                         return -EINVAL;
>         }
>
> -       /* If we're linking to EOF, continue to the block boundary. */
> -       if (pos_in + *len == isize)
> +       /* Reflink allows linking to EOF, so round the length up to allow us to
> +        * shared the last block in the file. We don't care what is beyond the
> +        * EOF point in the block that gets shared, as we can't access it and
> +        * attempts to extent the file will break the sharing.
> +        *
> +        * For dedupe, sharing the EOF block is illegal because the bytes beyond
> +        * EOF are undefined (i.e. not readable) and so can't be deduped. Hence
> +        * we do not extend/align the length and instead let the alignment
> +        * checks below reject it.
> +        */
> +       if (!is_dedupe && pos_in + *len == isize)
>                 blen = ALIGN(isize, bs) - pos_in;
>         else
>                 blen = *len;
Zygo Blaxell Aug. 23, 2018, 12:58 p.m. UTC | #6
On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > 	- is documenting rejection on request alignment grounds
> > 	  (i.e. EINVAL) in the man page sufficient for app
> > 	  developers to understand what is going on here?
> 
> I think so.  The manpage says: "The filesystem does not support
> reflinking the ranges of the given files", which (to my mind) covers
> this case of not supporting dedupe of EOF blocks.

Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
this; however, on btrfs, not supporting dedupe of EOF blocks means small
files (one extent) cannot be deduped at all, because the EOF block holds
a reference to the entire dst extent.  If a dedupe app doesn't go all the
way to EOF on btrfs, then it should not attempt to dedupe any part of the
last extent of the file as the benefit would be zero or slightly negative.

The app developer would need to be aware that such a restriction could
exist on some filesystems, and be able to distinguish this from other
cases that could lead to EINVAL.  Portable code would have to try a dedupe
up to EOF, then if that failed, round down and retry, and if that failed
too, the app would have to figure out which filesystem it's running on
to know what to do next.  Performance demands the app know what the FS
will do in advance, and avoid a whole class of behavior.

btrfs dedupe reports success if the src extent is inline and the same
size as the dst extent (i.e. file is smaller than one page).  No dedupe
can occur in such cases--a clone results in a simple copy, so the best
a dedupe could do would be a no-op.  Returning EINVAL there would break
a few popular tools like "cp --reflink".  Returning OK but doing nothing
seems to be the best option in that case.

> > 	- should we just round down the EOF dedupe request to the
> > 	  block before EOF so dedupe still succeeds?
> 
> I've often wondered if the interface should (have) be(en) that we start
> at src_off/dst_off and share as many common blocks as possible until we
> find a mismatch, then tell userspace where we stopped... instead of like
> now where we compare the entire extent and fail if any part of it
> doesn't match.

The usefulness or harmfulness of that approach depends a lot on what
the application expects the filesystem to do.

In btrfs, the dedupe operation acts on references to data, not the
underlying data blocks.  If there are 1000 slightly overlapping references
to a single contiguous range of data blocks in dst on disk, each dedupe
operation acts on only one of those, leaving the other 999 untouched.
If the app then submits 999 other dedupe requests, no references to the
dst blocks remain and the underlying data blocks can be deleted.

In a parallel universe (or a better filesystem, or a userspace emulation
built out of dedupe and other ioctls), dedupe could work at the extent
data (physical) level.  The app points at src and dst extent references
(inode/offset/length tuples), and the filesystem figures out which
physical blocks these point to, then adjusts all the references to the
dst blocks at once, dealing with partial overlaps and snapshots and
nodatacow and whatever other exotic features might be lurking in the
filesystem, ending with every reference to every part of dst replaced
by the longest possible contiguous reference(s) to src.

Problems arise if the length deduped is not exactly the length requested.
If the search continues until a mismatch is found, where does the search
for a mismatch lead?  Does the search follow physically contiguous
blocks on disk, or would dedupe follow logically contiguous blocks in
the src and dst files?  Or the intersection of those, i.e. physically
contiguous blocks that are logically contiguous in _any_ two files,
not limited to src and dst.

There is also the problem where the files could have been previously
deduped and then partially overwritten with identical data.  If the
application cannot control where the dedupe search for identical data
ends, it can end up accidentally creating new references to extents
while it is trying to eliminate those extents.  The kernel might do a
lot of extra work from looking ahead that the application has to undo
immediately (e.g. after the first few blocks of dst, the app wants to
do another dedupe with a better src extent elsewhere on the filesystem,
but the kernel goes ahead and dedupes with an inferior src beyond the
end of what the app asked for).

bees tries to determine exactly the set of dedupe requests required to
remove all references to duplicate extents (and maybe someday do defrag
as well).  If the kernel deviates from the requested sizes (e.g. because
the data changed on the filesystem between dedup requests), the final
extent layout after the dedupe requests are finished won't match what
bees expected it to be, so bees has to reexamine the filesystem and
either retry with a fresh set of exact dedupe requests, or give up and
leave duplicate extents on disk.  The retry loop is normal and ends
quickly if the dedupe fails because data changed on disk, but if the
kernel starts messing with the dedupe lengths then bees has to detect
this and escape before it gets stuck in a nasty feedback loop.

If we want to design a new interface, it should allow the app to specify
maximum and minimum length, so that the kernel knows how much flexibility
it is allowed by the application.  Maximum length lets one app say
"dedupe as much as you can find, up to EOF", while minimum length lets
another app say "don't bother if the match is less than 12K, the space
saving is not worth the write iops", and setting them equal lets the
third app say "I have a plan that requires you to do precisely what I
tell you or do nothing at all."

> > 	- should file cloning (i.e. reflink) be allow allowing the
> > 	  EOF block to be shared somewhere inside EOF in the
> > 	  destination file?
> 
> I don't think we should.
> 
> > That's stale data exposure, yes?
> 
> Haven't tested that, but seems likely.

It doesn't sound like a good idea, especially if mmap is involved.

> > 	- should clone only allow sharing of the EOF block if it's a
> > 	  either a full file clone or a matching range clone and
> > 	  isize is the same for both src and dest files?
> 
> The above sound reasonable for clone, though I would also allow clone to
> share the EOF block if we extend isize of the dest file.
> 
> The above also nearly sound reasonable for dedupe too.  If we encounter
> EOF at the same places in the src range and the dest range, should we
> allow sharing there?  The post-eof bytes are undefined by definition;
> does undefined == undefined?

> > 
> > Discuss.
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > 
> > [RFC] vfs: fix data corruption w/ unaligned dedupe ranges
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Exposed by fstests generic/505 on XFS, caused by extending the
> > bblock match range to include the partial EOF block, but then
> > allowing unknown data beyond EOF to be considered a "match" to data
> > in the destination file because the comparison is only made to the
> > end of the source file. This corrupts the destination file when the
> > source extent is shared with it.
> > 
> > Open questions:
> > 
> > 	- is documenting rejection on request alignment grounds
> > 	  (i.e. EINVAL) in the man page sufficient for app
> > 	  developers to understand what is going on here?
> > 	- should we just silently round down the EOF dedupe request
> > 	  to the block before EOF so dedupe still succeeds?
> > 	- should file cloning (i.e. reflink) be allow allowing the
> > 	  EOF block to be shared somewhere inside EOF in the
> > 	  destination file? That's stale data exposure, yes?
> > 	- should clone only allow sharing of the EOF block if it's a
> > 	  either a full file clone or a matching range clone and
> > 	  isize is the same for both src and dest files?
> > 
> > Btrfs also has the same bug in extent_same_check_offsets() and
> > btrfs_extent_same_range() that is not addressed by this patch.
> 
> <nod> (xfs/ocfs2) clone ioctls tend to be bug-for-bug compatible with
> btrfs more often than we probably ought to... :/
> 
> (I also sorta wonder if btrfs should be ported to the common vfs
> routines for clone prep and dedupe comparison...?)
> 
> --D
> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/read_write.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 153f8f690490..3844359a4597 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1768,8 +1768,17 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
> >  			return -EINVAL;
> >  	}
> >  
> > -	/* If we're linking to EOF, continue to the block boundary. */
> > -	if (pos_in + *len == isize)
> > +	/* Reflink allows linking to EOF, so round the length up to allow us to
> > +	 * shared the last block in the file. We don't care what is beyond the
> > +	 * EOF point in the block that gets shared, as we can't access it and
> > +	 * attempts to extent the file will break the sharing.
> > +	 *
> > +	 * For dedupe, sharing the EOF block is illegal because the bytes beyond
> > +	 * EOF are undefined (i.e. not readable) and so can't be deduped. Hence
> > +	 * we do not extend/align the length and instead let the alignment
> > +	 * checks below reject it.
> > +	 */
> > +	if (!is_dedupe && pos_in + *len == isize)
> >  		blen = ALIGN(isize, bs) - pos_in;
> >  	else
> >  		blen = *len;
Zygo Blaxell Aug. 24, 2018, 2:19 a.m. UTC | #7
On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > >   - should we just round down the EOF dedupe request to the
> > >     block before EOF so dedupe still succeeds?
> > 
> > I've often wondered if the interface should (have) be(en) that we start
> > at src_off/dst_off and share as many common blocks as possible until we
> > find a mismatch, then tell userspace where we stopped... instead of like
> > now where we compare the entire extent and fail if any part of it
> > doesn't match.
> 
> The usefulness or harmfulness of that approach depends a lot on what
> the application expects the filesystem to do.

Here are some concrete examples.

In the following, letters are 4K disk blocks and also inode offsets
(i.e. "A" means a block containing 4096 x "A" located at inode offset 0,
"B" contains "B" located at inode offset 1, etc).  "|" indicates
a physical discontinuity of the blocks on disk.  Lowercase "a" has
identical content to uppercase "A", but they are located in different
physical blocks on disk.

Suppose you have two identical files with different write histories,
so they have different on-disk layouts:

        Inode 1:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ

        Inode 2:  a|b|c|d|e|f|g|hijklmnopqrstuvwxyz

A naive dedupe app might pick src and dst at random, and do this:

        // dedupe(length, src_ino, src_off, dst_ino, dst_off)

        dedupe(length 26, Inode 1, Offset 0, Inode 2, Offset 0)

with the result having 11 fragments in each file, all from the
original Inode 1:

        Inode 1:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ

        Inode 2:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ

A smarter dedupe app might choose src and dst based on logical proximity
and/or physical seek distance, or the app might choose dst with the
smallest number of existing references in the filesystem, or the app might
simply choose the longest available src extents to minimize fragmentation:

        dedupe(length 7, Inode 1, Offset 0, Inode 2, Offset 0)

        dedupe(length 19, Inode 2, Offset 7, Inode 1, Offset 7)

with the result having 2 fragments in each file, each chosen
from a different original inode:

        Inode 1:  ABCDEFG|hijklmnopqrstuvwxyz

        Inode 2:  ABCDEFG|hijklmnopqrstuvwxyz

If the kernel continued past the 'length 7' size specified in the first
dedupe, then the 'hijklmnopqrstuvwxyz' would be *lost*, and the second
dedupe would be an expensive no-op because both Inode 1 and Inode 2
refer to the same physical blocks:

        Inode 1:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ

                  [-------] - app asked for this
        Inode 2:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ
    kernel does this too - [-------------------------]
    and "hijklmnopqrstuvwxyz" no longer exists for second dedupe

A dedupe app willing to spend more on IO can create its own better src
with only one fragment:

        open(with O_TMPFILE) -> Inode 3

        copy(length 7, Inode 1, Offset 0, Inode 3, Offset 0)
        
        copy(length 19, Inode 2, Offset 7, Inode 3, Offset 7)

        dedupe(length 26, Inode 3, Offset 0, Inode 1, Offset 0)

        dedupe(length 26, Inode 3, Offset 0, Inode 2, Offset 0)

        close(Inode 3)

Now there is just one fragment referenced from two places:

        Inode 1:  αβξδεφγηιςκλμνοπθρστυвшχψζ

        Inode 2:  αβξδεφγηιςκλμνοπθρστυвшχψζ

[If encoding goes horribly wrong, the above are a-z transcoded as a mix
of Greek and Cyrillic Unicode characters.]

Real filesystems sometimes present thousands of possible dedupe
src/dst permutations to choose from.  The kernel shouldn't be trying to
second-guess an application that may have access to external information
to make better decisions (e.g. the full set of src extents available,
or knowledge of other calls the app will issue in the future).

> In btrfs, the dedupe operation acts on references to data, not the
> underlying data blocks.  If there are 1000 slightly overlapping references
> to a single contiguous range of data blocks in dst on disk, each dedupe
> operation acts on only one of those, leaving the other 999 untouched.
> If the app then submits 999 other dedupe requests, no references to the
> dst blocks remain and the underlying data blocks can be deleted.
> 
> In a parallel universe (or a better filesystem, or a userspace emulation
> built out of dedupe and other ioctls), dedupe could work at the extent
> data (physical) level.  The app points at src and dst extent references
> (inode/offset/length tuples), and the filesystem figures out which
> physical blocks these point to, then adjusts all the references to the
> dst blocks at once, dealing with partial overlaps and snapshots and
> nodatacow and whatever other exotic features might be lurking in the
> filesystem, ending with every reference to every part of dst replaced
> by the longest possible contiguous reference(s) to src.
> 
> Problems arise if the length deduped is not exactly the length requested.
> If the search continues until a mismatch is found, where does the search
> for a mismatch lead?  Does the search follow physically contiguous
> blocks on disk, or would dedupe follow logically contiguous blocks in
> the src and dst files?  Or the intersection of those, i.e. physically
> contiguous blocks that are logically contiguous in _any_ two files,
> not limited to src and dst.
> 
> There is also the problem where the files could have been previously
> deduped and then partially overwritten with identical data.  If the
> application cannot control where the dedupe search for identical data
> ends, it can end up accidentally creating new references to extents
> while it is trying to eliminate those extents.  The kernel might do a
> lot of extra work from looking ahead that the application has to undo
> immediately (e.g. after the first few blocks of dst, the app wants to
> do another dedupe with a better src extent elsewhere on the filesystem,
> but the kernel goes ahead and dedupes with an inferior src beyond the
> end of what the app asked for).
> 
> bees tries to determine exactly the set of dedupe requests required to
> remove all references to duplicate extents (and maybe someday do defrag
> as well).  If the kernel deviates from the requested sizes (e.g. because
> the data changed on the filesystem between dedup requests), the final
> extent layout after the dedupe requests are finished won't match what
> bees expected it to be, so bees has to reexamine the filesystem and
> either retry with a fresh set of exact dedupe requests, or give up and
> leave duplicate extents on disk.  The retry loop is normal and ends
> quickly if the dedupe fails because data changed on disk, but if the
> kernel starts messing with the dedupe lengths then bees has to detect
> this and escape before it gets stuck in a nasty feedback loop.
> 
> If we want to design a new interface, it should allow the app to specify
> maximum and minimum length, so that the kernel knows how much flexibility
> it is allowed by the application.  Maximum length lets one app say
> "dedupe as much as you can find, up to EOF", while minimum length lets
> another app say "don't bother if the match is less than 12K, the space
> saving is not worth the write iops", and setting them equal lets the
> third app say "I have a plan that requires you to do precisely what I
> tell you or do nothing at all."
> 
> > > 	- should file cloning (i.e. reflink) be allow allowing the
> > > 	  EOF block to be shared somewhere inside EOF in the
> > > 	  destination file?
> > 
> > I don't think we should.
> > 
> > > That's stale data exposure, yes?
> > 
> > Haven't tested that, but seems likely.
> 
> It doesn't sound like a good idea, especially if mmap is involved.
> 
> > > 	- should clone only allow sharing of the EOF block if it's a
> > > 	  either a full file clone or a matching range clone and
> > > 	  isize is the same for both src and dest files?
> > 
> > The above sound reasonable for clone, though I would also allow clone to
> > share the EOF block if we extend isize of the dest file.
> > 
> > The above also nearly sound reasonable for dedupe too.  If we encounter
> > EOF at the same places in the src range and the dest range, should we
> > allow sharing there?  The post-eof bytes are undefined by definition;
> > does undefined == undefined?
> 
> > > 
> > > Discuss.
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > > 
> > > [RFC] vfs: fix data corruption w/ unaligned dedupe ranges
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Exposed by fstests generic/505 on XFS, caused by extending the
> > > bblock match range to include the partial EOF block, but then
> > > allowing unknown data beyond EOF to be considered a "match" to data
> > > in the destination file because the comparison is only made to the
> > > end of the source file. This corrupts the destination file when the
> > > source extent is shared with it.
> > > 
> > > Open questions:
> > > 
> > > 	- is documenting rejection on request alignment grounds
> > > 	  (i.e. EINVAL) in the man page sufficient for app
> > > 	  developers to understand what is going on here?
> > > 	- should we just silently round down the EOF dedupe request
> > > 	  to the block before EOF so dedupe still succeeds?
> > > 	- should file cloning (i.e. reflink) be allow allowing the
> > > 	  EOF block to be shared somewhere inside EOF in the
> > > 	  destination file? That's stale data exposure, yes?
> > > 	- should clone only allow sharing of the EOF block if it's a
> > > 	  either a full file clone or a matching range clone and
> > > 	  isize is the same for both src and dest files?
> > > 
> > > Btrfs also has the same bug in extent_same_check_offsets() and
> > > btrfs_extent_same_range() that is not addressed by this patch.
> > 
> > <nod> (xfs/ocfs2) clone ioctls tend to be bug-for-bug compatible with
> > btrfs more often than we probably ought to... :/
> > 
> > (I also sorta wonder if btrfs should be ported to the common vfs
> > routines for clone prep and dedupe comparison...?)
> > 
> > --D
> > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/read_write.c | 13 +++++++++++--
> > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > index 153f8f690490..3844359a4597 100644
> > > --- a/fs/read_write.c
> > > +++ b/fs/read_write.c
> > > @@ -1768,8 +1768,17 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
> > >  			return -EINVAL;
> > >  	}
> > >  
> > > -	/* If we're linking to EOF, continue to the block boundary. */
> > > -	if (pos_in + *len == isize)
> > > +	/* Reflink allows linking to EOF, so round the length up to allow us to
> > > +	 * shared the last block in the file. We don't care what is beyond the
> > > +	 * EOF point in the block that gets shared, as we can't access it and
> > > +	 * attempts to extent the file will break the sharing.
> > > +	 *
> > > +	 * For dedupe, sharing the EOF block is illegal because the bytes beyond
> > > +	 * EOF are undefined (i.e. not readable) and so can't be deduped. Hence
> > > +	 * we do not extend/align the length and instead let the alignment
> > > +	 * checks below reject it.
> > > +	 */
> > > +	if (!is_dedupe && pos_in + *len == isize)
> > >  		blen = ALIGN(isize, bs) - pos_in;
> > >  	else
> > >  		blen = *len;
Dave Chinner Aug. 30, 2018, 6:27 a.m. UTC | #8
On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > > 	- is documenting rejection on request alignment grounds
> > > 	  (i.e. EINVAL) in the man page sufficient for app
> > > 	  developers to understand what is going on here?
> > 
> > I think so.  The manpage says: "The filesystem does not support
> > reflinking the ranges of the given files", which (to my mind) covers
> > this case of not supporting dedupe of EOF blocks.
> 
> Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
> this; however, on btrfs, not supporting dedupe of EOF blocks means small
> files (one extent) cannot be deduped at all, because the EOF block holds
> a reference to the entire dst extent.  If a dedupe app doesn't go all the
> way to EOF on btrfs, then it should not attempt to dedupe any part of the
> last extent of the file as the benefit would be zero or slightly negative.

That's a filesystem implementation issue, not an API or application
issue.

> The app developer would need to be aware that such a restriction could
> exist on some filesystems, and be able to distinguish this from other
> cases that could lead to EINVAL.  Portable code would have to try a dedupe
> up to EOF, then if that failed, round down and retry, and if that failed
> too, the app would have to figure out which filesystem it's running on
> to know what to do next.  Performance demands the app know what the FS
> will do in advance, and avoid a whole class of behavior.

Nobody writes "portable" applications like that. They read the man
page first, and work out what the common subset of functionality is
and then code from that. Man page says:

"Disk filesystems generally require the offset and length arguments
to be aligned to the fundamental block size."

IOWs, code compatible with starts with supporting the general case.
i.e. a range rounded to filesystem block boundaries (it's already
run fstat() on the files it wants to dedupe to find their size,
yes?), hence ignoring the partial EOF block. Will just work on
everything.

Code that then wants to optimise for btrfs/xfs/ocfs quirks runs
fstatvfs to determine what fs it's operating on and applies the
necessary quirks. For btrfs it can extend the range to include the
partial EOF block, and hence will handle the implementation quirks
btrfs has with single extent dedupe.

Simple, reliable, and doesn't require any sort of flailing
about with offsets and lengths to avoid unexpected EINVAL errors.

> btrfs dedupe reports success if the src extent is inline and the same
> size as the dst extent (i.e. file is smaller than one page).  No dedupe
> can occur in such cases--a clone results in a simple copy, so the best
> a dedupe could do would be a no-op.  Returning EINVAL there would break
> a few popular tools like "cp --reflink".  Returning OK but doing nothing
> seems to be the best option in that case.

Again, those are a filesystem implementation issues, not problems
with the API itself.

> > > 	- should we just round down the EOF dedupe request to the
> > > 	  block before EOF so dedupe still succeeds?
> > 
> > I've often wondered if the interface should (have) be(en) that we start
> > at src_off/dst_off and share as many common blocks as possible until we
> > find a mismatch, then tell userspace where we stopped... instead of like
> > now where we compare the entire extent and fail if any part of it
> > doesn't match.
> 
> The usefulness or harmfulness of that approach depends a lot on what
> the application expects the filesystem to do.
> 
> In btrfs, the dedupe operation acts on references to data, not the
> underlying data blocks.  If there are 1000 slightly overlapping references
> to a single contiguous range of data blocks in dst on disk, each dedupe
> operation acts on only one of those, leaving the other 999 untouched.
> If the app then submits 999 other dedupe requests, no references to the
> dst blocks remain and the underlying data blocks can be deleted.

Assuming your strawman is valid, if you have a thousand separate
references across the same set of data blocks on disk, then that data is
already heavily deduplicated.  Trying to optimise that further
seems.... misguided, way down the curve of diminishing returns.

> In a parallel universe (or a better filesystem, or a userspace emulation
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> built out of dedupe and other ioctls), dedupe could work at the extent
> data (physical) level.  The app points at src and dst extent references
> (inode/offset/length tuples), and the filesystem figures out which
> physical blocks these point to, then adjusts all the references to the
> dst blocks at once,

That's XFS dedupe in a nutshell. And we aren't in a parallel
universe here. :P

> dealing with partial overlaps and snapshots and nodatacow
> and whatever other exotic features might be lurking in the
> filesystem, ending with every reference to every part of dst replaced
> by the longest possible contiguous reference(s) to src.

XFS doesn't have partial overlaps, we don't have nodatacow hacks,
and the subvol snapshot stuff I'm working on just uses shared data
extents so it's 100% compatible with dedupe.

[snip btrfs dedupe issues]

Again, it just seems to me like the problems you are describing are
complexity problems that arise from the filesystem implementation
and all the hoops you have to jump through to work around them. It
doesn't seem to have anything to do with problems in the dedupe
API...

> If we want to design a new interface, it should allow the app to specify
> maximum and minimum length, so that the kernel knows how much flexibility
> it is allowed by the application.  Maximum length lets one app say
> "dedupe as much as you can find, up to EOF", while minimum length lets
> another app say "don't bother if the match is less than 12K, the space
> saving is not worth the write iops", and setting them equal lets the
> third app say "I have a plan that requires you to do precisely what I
> tell you or do nothing at all."

OK, we didn't need a "btrfs is insane" story to justify this
proposal - it's an entirely reasonable set of control requests.
IOWs, you want the current API (do exactly what I say),
Darricks proposed API (do as much as you can) and a new behaviour
(do at least this much) all rolled into one interface.

So, cribbing from copy_file_range(), a syscall like:

ssize_t dedupe_file_range(int fd_src, loff_t *off_src,
			 int fd_dst, loff_t *off_dst,
			 size_t len, unsigned int flags, u64 optval);

With the control flags:

#define DDFR_F_TRY		(0)	/* default: as much as possible */
#define DDFR_F_EXACT		(1<<0)	/* exactly what is asked or fail */
#define DDFR_F_MINLEN		(1<<1)	/* at least as much as optval says or fail */

And all return the number of bytes deduped in the range that was
specified.

Perhaps you'd like to write a man page describing how it should all
work?

Cheers,

Dave.
Zygo Blaxell Aug. 31, 2018, 5:10 a.m. UTC | #9
On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > > > 	- is documenting rejection on request alignment grounds
> > > > 	  (i.e. EINVAL) in the man page sufficient for app
> > > > 	  developers to understand what is going on here?
> > > 
> > > I think so.  The manpage says: "The filesystem does not support
> > > reflinking the ranges of the given files", which (to my mind) covers
> > > this case of not supporting dedupe of EOF blocks.
> > 
> > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
> > this; however, on btrfs, not supporting dedupe of EOF blocks means small
> > files (one extent) cannot be deduped at all, because the EOF block holds
> > a reference to the entire dst extent.  If a dedupe app doesn't go all the
> > way to EOF on btrfs, then it should not attempt to dedupe any part of the
> > last extent of the file as the benefit would be zero or slightly negative.
> 
> That's a filesystem implementation issue, not an API or application
> issue.

The API and application issue remains even if btrfs is not considered.
btrfs is just the worst case outcome.  Other filesystems still have
fragmentation issues, and applications have efficiency-vs-capability
tradeoffs to make if they can't rely on dedupe-to-EOF being available.

Tools like 'cp --reflink=auto' work by trying the best case, then falling
back to a second choice if the first choice returns an error.  If the
second choice fails too, the surprising behavior can make inattentive
users lose data.

> > The app developer would need to be aware that such a restriction could
> > exist on some filesystems, and be able to distinguish this from other
> > cases that could lead to EINVAL.  Portable code would have to try a dedupe
> > up to EOF, then if that failed, round down and retry, and if that failed
> > too, the app would have to figure out which filesystem it's running on
> > to know what to do next.  Performance demands the app know what the FS
> > will do in advance, and avoid a whole class of behavior.
> 
> Nobody writes "portable" applications like that. 

As an app developer, and having studied other applications' revision
histories, and having followed IRC and mailing list conversations
involving other developers writing these applications, I can assure
you that is _exactly_ how portable applications get written around
the dedupe function.

Usually people start from experience with tools that use hardlinks to
implement dedupe, so the developer's mental model starts with deduping
entire files.  Their first attempt does this:

	stat(fd, &st);
	dedupe( ..., src_offset = 0, dst_offset = 0, length = st.st_size);

then subsequent revisions of their code cope with limits on length,
and then deal with EINVAL on odd lengths, because those are the problems
that are encountered as the code runs for the first time on an expanding
set of filesystems.  After that, they deal with implementation-specific
performance issues.

Other app developers start by ignoring incomplete blocks, then compare
their free-space-vs-time graphs with other dedupe apps on the same
filesystem, then either adapt to handle EOF properly, or just accept
being uncompetitive.

> They read the man
> page first, and work out what the common subset of functionality is
> and then code from that. 

> Man page says:
> 
> "Disk filesystems generally require the offset and length arguments
> to be aligned to the fundamental block size."

> IOWs, code compatible with starts with supporting the general case.
> i.e. a range rounded to filesystem block boundaries (it's already
> run fstat() on the files it wants to dedupe to find their size,
> yes?), hence ignoring the partial EOF block. Will just work on
> everything.

Will cause a significant time/space performance hit too.  EOFs are
everywhere, and they have a higher-than-average duplication rate
for their size.  If an application assumes EOF can't be deduped on
every filesystem, then it leaves a non-trivial amount of free space
unrecovered on filesystems that can dedupe EOF.  It also necessarily
increases fragmentation unless the filesystem implements file tails
(where it keeps fragmentation constant as the tail won't be stored
contiguously in any case).

> Code that then wants to optimise for btrfs/xfs/ocfs quirks runs
> fstatvfs to determine what fs it's operating on and applies the
> necessary quirks. For btrfs it can extend the range to include the
> partial EOF block, and hence will handle the implementation quirks
> btrfs has with single extent dedupe.
> 
> Simple, reliable, and doesn't require any sort of flailing
> about with offsets and lengths to avoid unexpected EINVAL errors.

It would be better if the filesystem was given the EOF block by the
application if it is the application's intent to dedupe/clone the end
of the file.  If the filesystem isn't able to handle the EOF block, then
it can simply round the length down.  If the filesystem does handle the
EOF block specially then it simply does so.  This handles all the broken
btrfs behaviors as well as the behavior of better filesystems.

That is simple, reliable, doesn't require any sort of flailing about
with offsets and lengths to avoid unexpected EINVAL errors, and *also*
handles EOF consistently on all filesystems.

> > > > 	- should we just round down the EOF dedupe request to the
> > > > 	  block before EOF so dedupe still succeeds?
> > > 
> > > I've often wondered if the interface should (have) be(en) that we start
> > > at src_off/dst_off and share as many common blocks as possible until we
> > > find a mismatch, then tell userspace where we stopped... instead of like
> > > now where we compare the entire extent and fail if any part of it
> > > doesn't match.
> > 
> > The usefulness or harmfulness of that approach depends a lot on what
> > the application expects the filesystem to do.
> > 
> > In btrfs, the dedupe operation acts on references to data, not the
> > underlying data blocks.  If there are 1000 slightly overlapping references
> > to a single contiguous range of data blocks in dst on disk, each dedupe
> > operation acts on only one of those, leaving the other 999 untouched.
> > If the app then submits 999 other dedupe requests, no references to the
> > dst blocks remain and the underlying data blocks can be deleted.
> 
> Assuming your strawman is valid, if you have a thousand separate
> references across the same set of data blocks on disk, then that data is
> already heavily deduplicated.  Trying to optimise that further
> seems.... misguided, way down the curve of diminishing returns.

It is something that naive dedupe apps will do.  "naive" here meaning
"does not dive deeply into the filesystem's physical structure (or survey
the entire filesystem with FIEMAP) to determine that the thousand-refs
situation does not exist at dst prior to invoking the dedupe() call."

A lot of dedupe tools are just file tree walkers.  duperemove, bees, and
bedup are exceptions that look deeper into the filesystem structure, but
even duperemove can be configured to run as a simple file tree walker
to save time and memory (well, to spend time and memory
differently--nothing is ever free...).

File-tree-walking dedupe tools are typically given a collection of files
and told to locate duplicate data blocks inside the tree and use dedupe to
remove them.  These files may have already been deduped before by previous
runs of the tool, or by different tools, or by other block-sharing
filesystem primitives like clones or snapshots.  If a user does this
by accident:

	dedupe_tree_walker ... A/1/

	dedupe_tree_walker ... A/2/

	dedupe_tree_walker ... A

then the user not only hits the 'thousands of dst refs' case, but may
hit it thousands of times.  A/1 and A/2 will have lots of duplicate
blocks, but because they were deduped in separate runs, each tree
doesn't have references to the other tree's shared blocks.  The third
dedupe_tree_walker at the end will then present the kernel with src/dst
pairs that have thousands of refs to both src and dst.

If you plot the frequency of block contents on a histogram, there are
usually a few blocks that can appear thousands or even *millions* of
times per TB.

> XFS doesn't have partial overlaps, we don't have nodatacow hacks,
> and the subvol snapshot stuff I'm working on just uses shared data
> extents so it's 100% compatible with dedupe.

If you allow this sequence of operations, you get partial overlaps:

	dedupe(fd1, 0, fd2, 0, 1048576);

	dedupe(fd2, 16384, fd3, 0, 65536);

If you don't allow that, then you have more significant limitations
than I know what to do with from a dedupe application.

> [snip btrfs dedupe issues]
> 
> Again, it just seems to me like the problems you are describing are
> complexity problems that arise from the filesystem implementation
> and all the hoops you have to jump through to work around them. It
> doesn't seem to have anything to do with problems in the dedupe
> API...

The current API man page doesn't explicitly say length can or cannot
include non-block-aligned EOF.  The original question was whether requests
to dedupe/clone unaligned EOF blocks should be allowed to return EINVAL,
and my answer is that they should not, because of accumulated experience
using historical versions of dedupe_file_range before and after
that change in behavior.

Unaligned EOF should always be accepted for dedupe, but a filesystem may
silently round down the length if the filesystem can't share a partial
block at EOF.  Note the dedupe man page already allows filesystems
to reduce the length without notice ("The maximum size of src_length
is filesystem dependent and is typically [i.e. not always] 16 MiB.
This limit will be enforced silently by the filesystem.").

If an app is relying on the "data differs" result to change app behavior
(merely printing the result doesn't count), then the app should already be
checking the length.  As far as I know I'm the only person who has written
such an app, and I didn't bother to publish it (it wasn't as fast as other
apps that invoke dedupe_file_range() only when they expect it to succeed).

I have no opinion on what should happen if offset + length != st_size
for one or both files.  Any of DATA_DIFFERS, EINVAL, or dedupe success
with a reduced length is OK.  A typical dedupe app will avoid making
such a call intentionally, i.e. this case will only come up as a result
of losing a race with a change in file size.

For the clone case (specifically copy_file_range) it seems that from the
man page there is no expectation of EINVAL based on length alignment,
so a fallback to copy would be required if the filesystem can't share
blocks at EOF.

> > If we want to design a new interface, it should allow the app to specify
> > maximum and minimum length, so that the kernel knows how much flexibility
> > it is allowed by the application.  Maximum length lets one app say
> > "dedupe as much as you can find, up to EOF", while minimum length lets
> > another app say "don't bother if the match is less than 12K, the space
> > saving is not worth the write iops", and setting them equal lets the
> > third app say "I have a plan that requires you to do precisely what I
> > tell you or do nothing at all."
> 
> OK, we didn't need a "btrfs is insane" story to justify this
> proposal 

btrfs has a long list of issues, but many of the dedupe issues are
really problems that arise from manipulating shared references to blocks
in general.  Different filesystems can move the costs around (i.e. doing
more work inside the dedupe ioctl so it doesn't have to be done later by
other userspace or the kernel) but there are some pretty fundamental costs
common to any filesystem that implements CoW block sharing (e.g. data
fragmentation performance loss and/or extra iops for defragmentation if
dedupe is used too aggressively).

- it's an entirely reasonable set of control requests.
> IOWs, you want the current API (do exactly what I say),
> Darricks proposed API (do as much as you can) and a new behaviour
> (do at least this much) all rolled into one interface.

> So, cribbing from copy_file_range(), a syscall like:
> 
> ssize_t dedupe_file_range(int fd_src, loff_t *off_src,
> 			 int fd_dst, loff_t *off_dst,
> 			 size_t len, unsigned int flags, u64 optval);
> 
> With the control flags:
> 
> #define DDFR_F_TRY		(0)	/* default: as much as possible */
> #define DDFR_F_EXACT		(1<<0)	/* exactly what is asked or fail */
> #define DDFR_F_MINLEN		(1<<1)	/* at least as much as optval says or fail */
> 
> And all return the number of bytes deduped in the range that was
> specified.

I think it would be a bad idea to make DDFR_F_TRY the default behavior--on
any filesystem.  It's not what any existing block-oriented dedupe app
design expects, or would even find useful if it was available.  It's a
special niche case that only works when you know in advance that you
have exactly two files (or big contiguous file offset ranges) and you
already know or suspect they are identical.

My unpublished app that used the "data_differs" response could have used
this behavior for this special case.  The original app was designed for
the special case of completely duplicate files, and looks like this:

	off_t find_next_same_block(int fd1, int fd2, off_t different_block, off_t file_size);

	for (pos = 0 .. file_size) {
		length = min(file_size - pos, SZ_16M);
		if (0 < (same_length = dedupe_file_range(fd1, &pos, fd2, &pos, length, DDFR_F_EXACT, 0))) {
			pos += same_length;
		} else {
			// error handling deleted
			// skip over different blocks, or just exit here if we don't care
			pos = find_next_same_block(fd1, fd2, pos, file_size);
		}
	}

With DDFR_F_TRY, it would be changed to look like this:

	for (pos = 0 .. file_size) {
		length = file_size - pos;
		if (0 < (same_length = dedupe_file_range(fd1, &pos, fd2, &pos, length, DDFR_F_TRY, 0))) {
			// same_length is possibly different from (file_size - pos)
			// if we hit a different block somewhere, so keep looping
			pos += same_length;
		} else {
			// error handling deleted
			// skip over different blocks, or just exit here if we don't care
			pos = find_next_same_block(fd1, fd2, pos, file_size);
		}
	}

In the case where the files are identical, the gain from eliding a few
loop iterations is negligible compared to the current API's DDFR_F_EXACT
behavior (microseconds per terabyte).

When there are differences between two files, the app may have to skip
over the different data, and that eats all the gains from making fewer
kernel calls for the duplicate data.  It's faster to do this with reads
than by calling the dedupe ioctl (with a DATA_DIFFERS result) because
reads require fewer and less expensive locks.

I've left out some other details in my simple two-file dedupe app: the
app looks at the structure of the file with FIEMAP, and occasionally
swaps src and dst fd's, or skips existing deduped ranges.  That behavior
doesn't work with DDFR_F_TRY at all.

Arguably we could just add more flags and behavior requirements
(DDFR_F_SWAP to swap src and dst FD automatically, and automatically
skip parts of src and dst that are already deduped) so that the kernel
does all this if requested.  In this case we would be building the
behavior of an entire simple file-oriented dedupe app into the kernel.
That seems like a lot of policy knobs given that these capabilities
1) already exist in userspace and 2) were abandoned by their author
for not being particularly performant or capable.

> Perhaps you'd like to write a man page describing how it should all
> work?

The expected behavior at EOF for the existing dedupe API should be
clearer.  I can contribute text for that.

The thing is, I like the existing dedupe as is.  It is useful precisely
because what it does is so tightly bounded.  Proposals like DDFR_F_MINLEN
are just responses to limit damage from other change proposals like
DDFR_F_TRY.  I see no pressing need for any of them.

I don't want to see what happened to the btrfs defrag ioctl happening to
the dedupe ioctl--there, the kernel accumulated so many quirky heuristics
that the ioctl became useless for any purpose other than implementing
'btrfs fi defrag', and I ended up having to emulate the defrag ioctl in
userspace to be able to combine it with dedupe.

For future development I've abandoned the entire dedupe_file_range
approach.  I need to be able to read and dedupe the data blocks of
the filesystem directly without having to deal with details like which
files those blocks belong to, especially on filesystems with lots of
existing deduped blocks and snapshots.  The file structure is frankly
just noise for dedupe on large filesystems.  I'm building a translation
layer for bees that does this--i.e. the main dedupe loop works only with
raw data blocks, and the translation layer maps read(blocknr, length)
and dedupe(block1, block2, length) requests onto the existing kernel
read(fd, offset, length) and dedupe(fd1, off1, fd2, off2, length) calls.
If the end result of that development work meets my performance
targets--I've built over a dozen dedupe apps, and only two were good
enough to release so far--then I'd propose moving those primitives into
the kernel.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Sept. 6, 2018, 8:38 a.m. UTC | #10
On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > > > > 	- is documenting rejection on request alignment grounds
> > > > > 	  (i.e. EINVAL) in the man page sufficient for app
> > > > > 	  developers to understand what is going on here?
> > > > 
> > > > I think so.  The manpage says: "The filesystem does not support
> > > > reflinking the ranges of the given files", which (to my mind) covers
> > > > this case of not supporting dedupe of EOF blocks.
> > > 
> > > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
> > > this; however, on btrfs, not supporting dedupe of EOF blocks means small
> > > files (one extent) cannot be deduped at all, because the EOF block holds
> > > a reference to the entire dst extent.  If a dedupe app doesn't go all the
> > > way to EOF on btrfs, then it should not attempt to dedupe any part of the
> > > last extent of the file as the benefit would be zero or slightly negative.
> > 
> > That's a filesystem implementation issue, not an API or application
> > issue.
> 
> The API and application issue remains even if btrfs is not considered.
> btrfs is just the worst case outcome.  Other filesystems still have
> fragmentation issues, and applications have efficiency-vs-capability
> tradeoffs to make if they can't rely on dedupe-to-EOF being available.
> 
> Tools like 'cp --reflink=auto' work by trying the best case, then falling
> back to a second choice if the first choice returns an error.

Well, yes. That's necessary for the "cp" tool to behave according to
user expectations.  That's not a kernel API issue - that's just an
implementation of an *application requirement*.  Indeed, this is
identical to the behaviour of rename() in "mv" - if rename fails
with -EXDEV, mv needs to fall back to a manual copy because the user
expects the file to be moved.

IOWS, these application level requirements you talk about are just
not relevant to the kernel API for dedupe/clone operations.

[snip]

> It is something that naive dedupe apps will do.  "naive" here meaning
> "does not dive deeply into the filesystem's physical structure (or survey
> the entire filesystem with FIEMAP) to determine that the thousand-refs
> situation does not exist at dst prior to invoking the dedupe() call."

/me sighs and points at FS_IOC_GETFSMAP

$ man ioctl_getfsmap
....
DESCRIPTION
       This ioctl(2) operation retrieves physical extent mappings
       for a filesystem.  This information can be used to discover
       which files are mapped to a physical block, examine free
       space, or find known bad blocks, among other things.
.....

I don't really care about "enabling" naive, inefficient
applications. I care about applications that scale to huge
filesystems and can get the job done quickly and efficiently.

> > XFS doesn't have partial overlaps, we don't have nodatacow hacks,
> > and the subvol snapshot stuff I'm working on just uses shared data
> > extents so it's 100% compatible with dedupe.
> 
> If you allow this sequence of operations, you get partial overlaps:
> 
> 	dedupe(fd1, 0, fd2, 0, 1048576);
> 
> 	dedupe(fd2, 16384, fd3, 0, 65536);

Oh, I misunderstood - I thought you were refering to sharing partial
filesystem blocks (like at EOF) because that's what this discussion
was originally about. XFS supports the above just fine.

[snip]

tl;dr we don't need a new clone or dedupe API

> For future development I've abandoned the entire dedupe_file_range
> approach.  I need to be able to read and dedupe the data blocks of
> the filesystem directly without having to deal with details like which
> files those blocks belong to, especially on filesystems with lots of
> existing deduped blocks and snapshots. 

IOWs, your desired OOB dedupe algorithm is:

	a) ask the filesystem where all it's file data is
	b) read that used space to build a data hash index
	c) on all the data hash collisions find the owners of the
	   colliding blocks
	d) if the block data is the same dedupe it

I agree - that's a simple and effective algorithm. It's also the
obvious solution to an expert in the field.

> The file structure is frankly
> just noise for dedupe on large filesystems.

We learnt that lesson back in the late 1990s. xfsdump, xfs_fsr, all
the SGI^WHPE HSM scanning tools, etc all avoid the directory
structure because it's so slow. XFS's bulkstat interface, OTOH, can
scan for target inodes at a over a million inodes/sec if you've got
the IO and CPU to throw at it....

> I'm building a translation
> layer for bees that does this--i.e. the main dedupe loop works only with
> raw data blocks, and the translation layer maps read(blocknr, length)
> and dedupe(block1, block2, length) requests onto the existing kernel
> read(fd, offset, length) and dedupe(fd1, off1, fd2, off2, length)i

That's FS_IOC_GETFSMAP. :P

FYI, in 2012 I came up with a plan for dedupe in XFS:

	a) GETFSMAP to query reverse map tree to find file
	   data blocks (don't try to dedupe unused space)
	b) read the underlying block device in ascending block
	   order and hash each block to build a collision tree.
	   Trivially parallelised.
	c) GETFSMAP to query rmap for the owners of colliding
	   blocks
	d) For all file data owner maps of a colliding block
		- bulkstat the inode numbers returned by GETFSMAP
		  and open_by_handle to get fd's
		- run dedupe syscall

We've finally got all the infrastructure we need to write this app
for XFS - we just need to write it and integrate it into xfs_fsr....

> calls.
> If the end result of that development work meets my performance
> targets--I've built over a dozen dedupe apps, and only two were good
> enough to release so far--then I'd propose moving those primitives into
> the kernel.

dedupe has been around for a long time and there hasn't been any
breakthroughs in IO efficient OOB dedupe scanning algorithms for
years.  OOB dedupe is not actually that difficult to do efficiently,
it's just that most people trying to do it lack the knowledge and/or
insight to make the jump that it requires reverse block mapping, not
directory traversal.

You've taken a dozen or so failures to realise what was obvious to
me at the outset - your translation layer will essentially hold the
same information that XFS already maintains on disk in it's rmap
btrees. IIRC btrfs has semi-equivalent reverse mapping functionality
on disk (back pointers?), so maybe your best bet would be to
implement FSMAP for btrfs and then you have a dedupe app that works
efficiently on both XFS and btrfs...

Cheers,

Dave.
Zygo Blaxell Sept. 7, 2018, 3:53 a.m. UTC | #11
On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote:
> On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > > > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > > > > > 	- is documenting rejection on request alignment grounds
> > > > > > 	  (i.e. EINVAL) in the man page sufficient for app
> > > > > > 	  developers to understand what is going on here?
> > > > > 
> > > > > I think so.  The manpage says: "The filesystem does not support
> > > > > reflinking the ranges of the given files", which (to my mind) covers
> > > > > this case of not supporting dedupe of EOF blocks.

[snip]

> tl;dr we don't need a new clone or dedupe API

Well, I wouldn't say _that_ yet.

It seems less likely that the API we need will be a minor revision of the
existing clone or dedupe and more likely that it will be a complementary
API for OOB dedupe engines, or a dedupe API that takes (device,
physical, length) tuples instead of (fd, offset, length) tuples.

> > For future development I've abandoned the entire dedupe_file_range
> > approach.  I need to be able to read and dedupe the data blocks of
> > the filesystem directly without having to deal with details like which
> > files those blocks belong to, especially on filesystems with lots of
> > existing deduped blocks and snapshots. 
> 
> IOWs, your desired OOB dedupe algorithm is:
> 
> 	a) ask the filesystem where all it's file data is

Actually, it's "ask the filesystem where all the *new* file data is"
since we don't want to read any unique data twice on subsequent runs.

> 	b) read that used space to build a data hash index
> 	c) on all the data hash collisions find the owners of the
> 	   colliding blocks
> 	d) if the block data is the same dedupe it
> 
> I agree - that's a simple and effective algorithm. It's also the
> obvious solution to an expert in the field.

It's obvious, but if you literally implement that, you end up with a
filesystem shattered into billions of fragments.

At step c) I need to figure out _which_ colliding block I want to keep.
That's where the algorithm needs to inspect the higher-level file
structure to determine the physical and logical situation for each block.
That leads to one of:

	a) store the hashes and block locations in the hash table
	b) relocate the blocks (defrag)
	c) replace some blocks with a reference to other blocks (dedupe)

This sounds a bit like what xfs_fsr does (or will do?).

Bees also operates under a constant-RAM constraint, so it doesn't operate
in two distinct "collect data" and "act on data collected" passes,
and cannot spend memory to store data about more than a few extents at
any time.  bees dedupes contiguous blocks pairwise as the second duplicate
block is discovered--sometimes keeping the second block, if the selection
algorithm prefers new blocks over blocks previously encountered (e.g. if
a defragmented version of an extent comes along, and we don't want to
undo the defrag operation by deduping non-defragged extents over it).

This does not produce the fastest possible OOB dedupe result--that comes
from an algorithm that does operate in two passes and sorts duplicate
extents by length between the passes.  The fast algorithm produces a
free space vs time curve that starts with a long flat interval followed
by a steep slope that gradually flattens over time, and is suitable
for filesystems that are small enough that the entire filesystem can
be processed in a batch run.  The bees algorithm produces a continuous
straight-ish line that is more suitable for continuous incremental
operation using O(1) memory.

> > The file structure is frankly
> > just noise for dedupe on large filesystems.
> 
> We learnt that lesson back in the late 1990s. xfsdump, xfs_fsr, all
> the SGI^WHPE HSM scanning tools, etc all avoid the directory
> structure because it's so slow. XFS's bulkstat interface, OTOH, can
> scan for target inodes at a over a million inodes/sec if you've got
> the IO and CPU to throw at it....

A million inodes/sec?  btrfs TREE_SEARCH_V2 barely manages 2000 inodes/sec
on a 2.2GHz AMD A6--and that's just in the kernel, not counting delivery
of the data to userspace.  I'm clearly using the wrong filesystem here.

> > I'm building a translation
> > layer for bees that does this--i.e. the main dedupe loop works only with
> > raw data blocks, and the translation layer maps read(blocknr, length)
> > and dedupe(block1, block2, length) requests onto the existing kernel
> > read(fd, offset, length) and dedupe(fd1, off1, fd2, off2, length)i
> 
> That's FS_IOC_GETFSMAP. :P

No, that's the layer that _uses_ FS_IOC_GETFSMAP.  ;)

> FYI, in 2012 I came up with a plan for dedupe in XFS:
> 
> 	a) GETFSMAP to query reverse map tree to find file
> 	   data blocks (don't try to dedupe unused space)
> 	b) read the underlying block device in ascending block
> 	   order and hash each block to build a collision tree.
> 	   Trivially parallelised.
> 	c) GETFSMAP to query rmap for the owners of colliding
> 	   blocks
> 	d) For all file data owner maps of a colliding block
> 		- bulkstat the inode numbers returned by GETFSMAP
> 		  and open_by_handle to get fd's
> 		- run dedupe syscall

I think steps c and d show room for improvement.

I'd like the filesystem to receive a pair of contiguous colliding block
ranges, and have the filesystem locate and update all the owners at once
after performing the comparison (skip updating any owners that have
changed after the comparison).  That effectively merges c and d into
a single syscall.  There are a few advantages to doing it this way,
especially when it comes to requirements for locking (e.g. no need to
lock files against modification during the block compare).

With the existing dedupe syscall, the app has to dig up from the (device,
physical) tuple through the filesystem to get (fd, offset) tuples to pass
to dedupe, which will dig down through the filesystem to turn them back
into (device, physical) tuples again for comparison and then dig back
up again to modify the (inode, offset) tuples for any other references.
It seems like the filesystem could skip one leg of this journey if it
starts with the block address.  If the data changed underneath so the
comparison fails, none of the other work is necessary, and most of the
other work requires seeks.

> You've taken a dozen or so failures to realise what was obvious to
> me at the outset

It was obvious to me too; however, I have to deliver a dedupe engine that
doesn't blow up the system it's running on, and calling LOGICAL_INO on
btrfs more than absolutely necessary has historically been a bad idea.

I tried many alternatives to avoid this (basically ordering the dedupe
syscalls to try to converge on the correct result eventually), but some
use cases (especially snapshots) cannot be done sanely any other way,
so I'm pressing ahead with pure physical block addressing and hoping
that the filesystems catch up.

> - your translation layer will essentially hold the
> same information that XFS already maintains on disk in it's rmap
> btrees. IIRC btrfs has semi-equivalent reverse mapping functionality
> on disk (back pointers?), so maybe your best bet would be to
> implement FSMAP for btrfs and then you have a dedupe app that works
> efficiently on both XFS and btrfs...

btrfs has LOGICAL_INO_V2 and TREE_SEARCH_V2 ioctls which provide the
same sort of data as GETFSMAP in a differently structured address space.
INO_PATHS lets me implement an equivalent of open_by_handle.

The existing translation layer in bees uses the information in the
existing filesystem structures to map logical to physical and back
as required.  The new translation layer operates _only_ in the physical
space at the top, so as far as the core dedupe/defrag engine is concerned,
the filesystem underneath is just an iterable collection of contiguous
data extents that can be read and deduped directly.

The only data bees stores outside of the filesystem is the dedupe block
hash table, the current value of the extent iterator (aka the low key for
the next GETFSMAP call), and the transid of the filesystem the last time
the extent iterator was reset.  btrfs embeds the transid (an indicator
of age) into every data extent record and into metadata page headers, so
bees can rapidly skip old extents in SEARCH_V2 for fast incremental scans.

The translation layer can work as long as every physical block has a
unique address in a data type that has addition, subtraction, less-than
and bisection operators, and the filesystem can map a position in physical
address space to a position in logical address space and back.  I only
need bisection for cases that require backward iteration from a known
key (btrfs csums require this, and the new translation layer enables the
dedupe code to use the existing csums instead of reading and hashing the
data blocks).

On btrfs I open the files to do reading instead of scraping the
block device directly--IMHO a reasonable decision, given that btrfs
implements RAID and compression to make raw block read from a device
very complicated.  Profiling so far has never shown open() to have
significant overhead compared to the other operations; however, if
btrfs had a "I am root, read this arbitrary data block for me" ioctl,
I'd eliminate even that little overhead.

It looks like "open device, seek to physical, read block" is sufficient
for XFS (no compression or other encoding to worry about).

btrfs uses a 64-bit virtual address space (compatible with FIEMAP),
GETFSMAP uses a 32-bit device ID and a 64-bit offset.  This means all
bees DDT entries are immediately 25% bigger on XFS, unless I enumerate the
device IDs and recycle some zero bits in .physical as a device ID index.

I wonder if a future btrfs implementation of GETFSMAP will use the
device ID and physical offset fields, or just return a constant device
ID with .physical carrying the 64-bit virtual address that the rest of
the filesystem uses.

I wonder how compressed data would work with GETFSMAP?  In btrfs the
LOGICAL_INO and FIEMAP ioctls treat the entire compressed extent as a
single entity with no distinct blocks inside.  Bees needs a unique address
for every block, even the compressed ones, so some of that is handled
by emulation using TREE_SEARCH_V2 in the translation layer (the physical
data type is 64-bit extent base plus an uncompressed block offset).

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Sept. 10, 2018, 9:06 a.m. UTC | #12
On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote:
> On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote:
> > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > For future development I've abandoned the entire dedupe_file_range
> > > approach.  I need to be able to read and dedupe the data blocks of
> > > the filesystem directly without having to deal with details like which
> > > files those blocks belong to, especially on filesystems with lots of
> > > existing deduped blocks and snapshots. 
> > 
> > IOWs, your desired OOB dedupe algorithm is:
> > 
> > 	a) ask the filesystem where all it's file data is
> 
> Actually, it's "ask the filesystem where all the *new* file data is"
> since we don't want to read any unique data twice on subsequent runs.

Sorry, how do you read "unique data" twice? By definition, unique
data only occurs once....

Oh, and you still need to hash the old data so you can find
collisions with the new data that got written. Unless, of course,
you are keeping your hash tree in a persistent database and can work
out how to prune stale entries out of it efficiently....

And, well, see my comments about media scrubbing below.

> > 	b) read that used space to build a data hash index
> > 	c) on all the data hash collisions find the owners of the
> > 	   colliding blocks
> > 	d) if the block data is the same dedupe it
> > 
> > I agree - that's a simple and effective algorithm. It's also the
> > obvious solution to an expert in the field.
> 
> It's obvious, but if you literally implement that, you end up with a
> filesystem shattered into billions of fragments.

Well, yes, that's obvious. I thought that "details omitted for
reasons of brevity" would be understood, not require omitted details
to be explained to me.

> At step c) I need to figure out _which_ colliding block I want to keep.
> That's where the algorithm needs to inspect the higher-level file
> structure to determine the physical and logical situation for each block.
> That leads to one of:
> 
> 	a) store the hashes and block locations in the hash table
> 	b) relocate the blocks (defrag)
> 	c) replace some blocks with a reference to other blocks (dedupe)
> 
> This sounds a bit like what xfs_fsr does (or will do?).

xfs_fsr currently does the defrag bit w/ the swapext ioctl(*). It'll
get extended to also do dedupe scans eventually, as defrag and
dedupe obviously need to be integrated to prevent them from
duelling.

(*) swapext - in case you might not know about it - atomically
swaps a range of extents between two files without copying data.  So
you remember those dedupe selection problems you were talking about
(details omitted for brevity):

inode 1:	ABCDEFG|H|I|JK|L|M|N|O|P
inode 2:	a|b|c|d|e|f|g|hijklmop

The XFS solution will be to first defrag the master file (inode 1)
with the larger extents from inode 2:

swapext(H-P, h-p)

inode 1:	ABCDEFG|hijklmop
inode 2:	a|b|c|d|e|f|g|H|I|JK|L|M|N|O|P

And the we dedupe to the defragged file:

dedupe(inode 1, inode 2)

inode 1:	ABCDEFG|hijklmop
inode 2:	ABCDEFG|hijklmop

> Bees also operates under a constant-RAM constraint, so it doesn't operate
> in two distinct "collect data" and "act on data collected" passes,
> and cannot spend memory to store data about more than a few extents at
> any time.

I suspect that I'm thinking at a completely different scale to you.
I don't really care for highly constrained or optimal dedupe
algorithms  because those last few dedupe percentages really don't
matter that much to me. I care much more about using all the
resources we can and running as fast as we possibly can, then
providing the admin with means to throttle performance back to what
they need.

i.e. I'm concerned about how to effectively scan and dedupe PBs of
data, where scan rates may need to be measured in tens of GB/s.  In
these cases the limiting factor is either suboptimal IO patterns
(low IO throughput), or not having enough CPU and memory bandwidth
for hashing the data the IO engine is streaming through memory.
Once we get past a certain point, we'll gain far more by
parallelising the dedupe scan over lots of CPU than we ever will by
optimising an algorithm that cannot be parallelised. Hence we need
to start with a parallelisable algorithm, not an artificially
constrained one.

e.g. a soft requirement is that we need to scan the entire fs at
least once a month. That will need to be broken up into nightly
work, with a full 1PB filesystem needing 33+TB a night to be
scanned. We typically get 1 hour a night to do this sort of thing,
meaning the dedupe scan rate needs to sustain 1GB/s for that entire
hour.  Doing this with a single threaded, few extents at a time
algorithm is going to be a real stretch, so we need RAM and
concurrency at both the IO and CPU level to acheive this.

A simple piece-wise per-AG scanning algorithm (like we use in
xfs_repair) could easily work within a 3GB RAM per AG constraint and
would scale very well. We'd only need to scan 30-40 AGs in the hour,
and a single AG at 1GB/s will only take 2 minutes to scan. We can
then do the processing while the next AG gets scanned. If we've got
10-20GB RAM to use (and who doesn't when they have 1PB of storage?)
then we can scan 5-10AGs at once to keep the IO rate up, and process
them in bulk as we scan more.

Then consider that we can use that slowly iterating nightly dedupe
block scan to replace the online media scrubber. i.e. We get a
storage bitrot detecting scan for free with such a OOB dedupe
algorithm.  That's a massive win for enterprise filesystems - we
can cram all our maintenance tasks (defrag, scrubbing and dedupe)
into a single nightly scan, then it's highly predictable and
repeatable behaviour that admins can plan for and save themselves a
lot of unnecessary pain.

> > > The file structure is frankly
> > > just noise for dedupe on large filesystems.
> > 
> > We learnt that lesson back in the late 1990s. xfsdump, xfs_fsr, all
> > the SGI^WHPE HSM scanning tools, etc all avoid the directory
> > structure because it's so slow. XFS's bulkstat interface, OTOH, can
> > scan for target inodes at a over a million inodes/sec if you've got
> > the IO and CPU to throw at it....
> 
> A million inodes/sec?

Yeah, 2.1GHZ Xeons can scan ~200,000 inodes/s each when CPU bound,
and the above number comes from bulkstating 8 different AGs in
parallel on a 100 million inode, 500TB filesystem. It's limited by
lock contention on the sb inode list lock, so fixing that will allow
it to go faster.

> btrfs TREE_SEARCH_V2 barely manages 2000 inodes/sec on a 2.2GHz
> AMD A6--and that's just in the kernel, not counting delivery of
> the data to userspace.  I'm clearly using the wrong filesystem
> here.

Perhaps so.

> > FYI, in 2012 I came up with a plan for dedupe in XFS:
> > 
> > 	a) GETFSMAP to query reverse map tree to find file
> > 	   data blocks (don't try to dedupe unused space)
> > 	b) read the underlying block device in ascending block
> > 	   order and hash each block to build a collision tree.
> > 	   Trivially parallelised.
> > 	c) GETFSMAP to query rmap for the owners of colliding
> > 	   blocks
> > 	d) For all file data owner maps of a colliding block
> > 		- bulkstat the inode numbers returned by GETFSMAP
> > 		  and open_by_handle to get fd's
> > 		- run dedupe syscall
> 
> I think steps c and d show room for improvement.
> 
> I'd like the filesystem to receive a pair of contiguous colliding block
> ranges, and have the filesystem locate and update all the owners at once
> after performing the comparison (skip updating any owners that have
> changed after the comparison).

I'd like a pony, too.

However, I think that such bottom up operations are nearly
impossible to do atomically because they invert the locking orders
of normal operations and are completely open-ended in scope in what
they may need to update.  syscalls are better designed as simple, self
contained operations that you can build into bigger operations, not
provide massive, complex, open-ended operations themselves.

> I wonder if a future btrfs implementation of GETFSMAP will use the
> device ID and physical offset fields, or just return a constant device
> ID with .physical carrying the 64-bit virtual address that the rest of
> the filesystem uses.
> 
> I wonder how compressed data would work with GETFSMAP?  In btrfs the
> LOGICAL_INO and FIEMAP ioctls treat the entire compressed extent as a
> single entity with no distinct blocks inside.  Bees needs a unique address
> for every block, even the compressed ones, so some of that is handled
> by emulation using TREE_SEARCH_V2 in the translation layer (the physical
> data type is 64-bit extent base plus an uncompressed block offset).

Whoever implements GETFSMAP for btrfs will have to work through
those complexities. If btrfs treats something as a undividable
complete object, then GETFSMAP will probably have to report it as
such and not be able to report things like offsets into the object.
It will just have to be documented in the man page in the "btrfs
specific objects" section.

Cheers,

Dave,
Zygo Blaxell Sept. 19, 2018, 4:12 a.m. UTC | #13
On Mon, Sep 10, 2018 at 07:06:46PM +1000, Dave Chinner wrote:
> On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote:
> > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > > For future development I've abandoned the entire dedupe_file_range
> > > > approach.  I need to be able to read and dedupe the data blocks of
> > > > the filesystem directly without having to deal with details like which
> > > > files those blocks belong to, especially on filesystems with lots of
> > > > existing deduped blocks and snapshots. 
> > > 
> > > IOWs, your desired OOB dedupe algorithm is:
> > > 
> > > 	a) ask the filesystem where all it's file data is
> > 
> > Actually, it's "ask the filesystem where all the *new* file data is"
> > since we don't want to read any unique data twice on subsequent runs.
> 
> Sorry, how do you read "unique data" twice? By definition, unique
> data only occurs once....

...but once it has been read, we don't want to read it again.  Ever.
Even better would be to read unique data less than 1.0 times on average.

> Oh, and you still need to hash the old data so you can find
> collisions with the new data that got written. Unless, of course,
> you are keeping your hash tree in a persistent database 

I do that.

> and can work out how to prune stale entries out of it efficiently....

I did that first.

Well, more like I found that even a bad algorithm can still find
most of the duplicate data in a typical filesystem, and there's a
steep diminishing returns curve the closer you get to 100% efficiency.
So I just used a bad algorithm (random drop with a bias toward keeping
hashes that matched duplicate blocks).  There's room to improve that,
but the possible gains are small, so it's at least #5 on the performance
whack-a-mole list and probably lower.

The randomness means each full-filesystem sweep finds a different subset
of duplicates, so I can arbitrarily cut hash table size in half and get
almost all of the match rate back by doing two full scans.  Or I cut
the filesystem up into a few large pieces and feed the pieces through in
different orders on different scan runs, so different subsets of data in
the hash table meet different subsets of data on disk during each scan.
An early prototype of bees worked that way, but single-digit efficiency
gains were not worth doubling iops, so I stopped.

> [...]I thought that "details omitted for
> reasons of brevity" would be understood, not require omitted details
> to be explained to me.

Sorry.  I don't know what you already know.

> > Bees also operates under a constant-RAM constraint, so it doesn't operate
> > in two distinct "collect data" and "act on data collected" passes,
> > and cannot spend memory to store data about more than a few extents at
> > any time.
> 
> I suspect that I'm thinking at a completely different scale to you.
> I don't really care for highly constrained or optimal dedupe
> algorithms  because those last few dedupe percentages really don't
> matter that much to me. 

At large scales RAM is always constrained.  It's the dedupe triangle of
RAM, iops, and match hit rate--any improvement in one comes at the cost
of the others.  Any dedupe can go faster or use less RAM by raising the
block size or partitioning the input data set to make it smaller.

bees RAM usage is a bit more explicitly controlled--the admin tells bees
how much RAM to use, and bees scales the other parameters to fit that.
Other dedupe engines make the admin do math to set parameters to avoid
overflowing RAM with dynamic memory allocations, or leave the admin to
discover what their RAM constraint is the hard way.

One big difference I am noticing in our approaches is latency.  ZFS (and
in-kernel btrfs dedupe) provides minimal dedupe latency (duplicate
data occupies disk space for zero time as it is never written to disk
at all) but it requires more RAM for a given dedupe hit rate than any
other dedupe implementation I've seen.  What you've written tells me
XFS saves RAM by partitioning the data and relying on an existing but
very large source of iops (sharing scrub reads with dedupe), but then
the dedupe latency is the same as the scrub interval (the worst so far).
bees aims to have latency of a few minutes (ideally scanning data while
it's still dirty in cache, but there's no good userspace API for that)
though it's obviously not there yet.

> I care much more about using all the
> resources we can and running as fast as we possibly can, then
> providing the admin with means to throttle performance back to what
> they need.
> 
> i.e. I'm concerned about how to effectively scan and dedupe PBs of
> data, where scan rates may need to be measured in tens of GB/s.  

My targets are only one order of magnitude smaller--in the limit, I expect
bees will eventually reach a point where it has matched the btrfs scaling
limits and can go no further (I've bumped into some of them already).

I do have a very different latency goal and different optimization
opportunities because of different capabilities in the filesystems.

> In these cases the limiting factor is either suboptimal IO patterns
> (low IO throughput), or not having enough CPU and memory bandwidth
> for hashing the data the IO engine is streaming through memory.
> Once we get past a certain point, we'll gain far more by
> parallelising the dedupe scan over lots of CPU than we ever will by
> optimising an algorithm that cannot be parallelised. 
> Hence we need to start with a parallelisable algorithm, not an artificially
> constrained one.

It's possible to parallelize reads on btrfs, it's just much more work
to get it done than on XFS.  I'll get around to that eventually, but
right now the gains are relatively small relative to the work required
to implement it.

> e.g. a soft requirement is that we need to scan the entire fs at
> least once a month. 

I have to scan and dedupe multiple times per hour.  OK, the first-ever
scan of a non-empty filesystem is allowed to take much longer, but after
that, if you have enough spare iops for continuous autodefrag you should
also have spare iops for continuous dedupe.

btrfs has age information embedded in the metadata so I can skip scanning
blocks (or entire subtrees) that have not changed since an earlier scan.
This is the "read unique data only once" feature I'm working on.
That can be batched up and run during maintenance windows (one of bees'
most-requested features).

A lot of duplicate data tends to occur close together in time, so
sorting the data by age can also stretch the efficiency of a self-pruning
hash table.

> That will need to be broken up into nightly
> work, with a full 1PB filesystem needing 33+TB a night to be
> scanned. We typically get 1 hour a night to do this sort of thing,
> meaning the dedupe scan rate needs to sustain 1GB/s for that entire
> hour.  Doing this with a single threaded, few extents at a time
> algorithm is going to be a real stretch, so we need RAM and
> concurrency at both the IO and CPU level to acheive this.

On btrfs I can read the block checksums instead of the block data,
eliminating the need to read trivially unique data during scanning (i.e. I
only have to read a data block to calculate a strong hash after at least
one checksum collision).  This is the "read unique data less than once"
feature.

Parallel reading is around #4 on my performance whack-a-mole list.  #1 and
#2 are "read unique data less than once" and #3 is fixing sequential
reads (and somewhere in there is optimizing the scheduling of dedupe vs
scan so they don't slow each other down by seeking and probably half a
dozen btrfs workarounds).  The current read code is just a placeholder
from the first prototype version of bees.  It burns egregious amounts
of CPU to work, but even that code is not the top of the performance
whack-a-mole list today.

Eventually bees will need fast raw parallel reading and maybe divide
big filesystems into smaller chunks for the first read (where bees
can't optimize by ignoring unchanged data).  Maybe btrfs will catch
up on big-filesystem performance, e.g. so that mounting a half-full PB
filesystem takes less than an hour and write amplification is below 1000%.
Maybe by then XFS will have checksums of data (maybe even with a lower
collision rate than crc32) and be able to efficiently select extents by
age so it doesn't have to burn petabytes of iops every month.  If the
feature sets converge, at that point XFS dedupe will probably be faster
just because it doesn't have to work around btrfs...  :-P

> A simple piece-wise per-AG scanning algorithm (like we use in
> xfs_repair) could easily work within a 3GB RAM per AG constraint and
> would scale very well. We'd only need to scan 30-40 AGs in the hour,
> and a single AG at 1GB/s will only take 2 minutes to scan. We can
> then do the processing while the next AG gets scanned. If we've got
> 10-20GB RAM to use (and who doesn't when they have 1PB of storage?)
> then we can scan 5-10AGs at once to keep the IO rate up, and process
> them in bulk as we scan more.

How do you match dupe blocks from different AGs if you only keep RAM for
the duration of one AG scan?  Do you not dedupe across AG boundaries?

I ask because one of the things I am deduping is multi-terabyte VM images,
and an AG sounds like it's smaller than one of those.

The algorithm could dump the sorted hash table (or just the blocks known
to be duplicates which is usually a much smaller list) to a file after
each AG, and merge the sorted hash tables to dedupe across AGs...unless
there's a gotcha there that makes dedupe across AGs undesirable or
impossible?

> Then consider that we can use that slowly iterating nightly dedupe
> block scan to replace the online media scrubber. i.e. We get a
> storage bitrot detecting scan for free with such a OOB dedupe
> algorithm.  That's a massive win for enterprise filesystems - we
> can cram all our maintenance tasks (defrag, scrubbing and dedupe)
> into a single nightly scan, then it's highly predictable and
> repeatable behaviour that admins can plan for and save themselves a
> lot of unnecessary pain.

That is a compelling argument for always doing the full-filesystem
dedupe scan even if you don't have to.

That gives me an idea to use later:  btrfs scrub is already parallelized
across devices.  Maybe btrfs scrub could be adapted to share the data
with userspace (as opposed to userspace trying to duplicate all the RAID
repair things that btrfs scrub does).  Then dedupe just follows the scrub
through the filesystem and resolves hash collisions as it goes (or queues
them up for later processing in sorted batches to minimize seeking).

> Yeah, 2.1GHZ Xeons can scan ~200,000 inodes/s each when CPU bound,
> and the above number comes from bulkstating 8 different AGs in
> parallel on a 100 million inode, 500TB filesystem. It's limited by
> lock contention on the sb inode list lock, so fixing that will allow
> it to go faster.
> 
> > btrfs TREE_SEARCH_V2 barely manages 2000 inodes/sec on a 2.2GHz
> > AMD A6--and that's just in the kernel, not counting delivery of
> > the data to userspace.  I'm clearly using the wrong filesystem
> > here.
> 
> Perhaps so.

Indeed.  Implement compression, online shrinking, data checksums, and
self-repair, and we can talk migrations.  ;)

[...]
> > I'd like the filesystem to receive a pair of contiguous colliding block
> > ranges, and have the filesystem locate and update all the owners at once
> > after performing the comparison (skip updating any owners that have
> > changed after the comparison).
> 
> I'd like a pony, too.

I have friends with an assortment of horses.  What kind of pony would
you like?  ;)

> However, I think that such bottom up operations are nearly
> impossible to do atomically because they invert the locking orders
> of normal operations and are completely open-ended in scope in what
> they may need to update.  

I don't think it needs to be atomic at all.  dedupe only needs a way
to know if it has lost a race and continue gracefully, and that can be
as simple as looping over each dst owner object in turn and atomically
asking "do you still own the data I compared?  If so, I update you to own
the other copy; otherwise, I skip you and move on with the next owner."

If the filesystem has CoW semantics (i.e. data blocks are read-only until
deleted) this is a matter of comparing block addresses in metadata to
see if they changed after the block read/compare operation.  There would
need to be a lock on the data extents to prevent deletion during the
dedupe call, or a generation number on the data extent that could be
compared to detect modification after the fact, but there _wouldn't_ be
deadlocks on inodes that arise from the current top-down implementation
(e.g. dedupe and rename on the same two inodes).

If any owner changes are missed--which can only happen if there happens
to be a concurrent data update on that specific data--the next dedupe scan
can deal with it.  In fact it probably has to, since there is new data.

What you've described so far means the scope isn't limited anyway.  If the
call is used to dedupe two heavily-reflinked extents together (e.g.
both duplicate copies are each shared by thousands of snapshots that
have been created during the month-long period between dedupe runs),
it could always be stuck doing a lot of work updating dst owners.
Was there an omitted detail there?

> Cheers,
> 
> Dave,
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Sept. 21, 2018, 2:59 a.m. UTC | #14
On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote:
> On Mon, Sep 10, 2018 at 07:06:46PM +1000, Dave Chinner wrote:
> > On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote:
> > > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> > > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > > > For future development I've abandoned the entire dedupe_file_range
> > > > > approach.  I need to be able to read and dedupe the data blocks of
> > > > > the filesystem directly without having to deal with details like which
> > > > > files those blocks belong to, especially on filesystems with lots of
> > > > > existing deduped blocks and snapshots. 
> > > > 
> > > > IOWs, your desired OOB dedupe algorithm is:
> > > > 
> > > > 	a) ask the filesystem where all it's file data is
> > > 
> > > Actually, it's "ask the filesystem where all the *new* file data is"
> > > since we don't want to read any unique data twice on subsequent runs.
> > 
> > Sorry, how do you read "unique data" twice? By definition, unique
> > data only occurs once....
> 
> ...but once it has been read, we don't want to read it again.  Ever.
> Even better would be to read unique data less than 1.0 times on average.
> 
> > Oh, and you still need to hash the old data so you can find
> > collisions with the new data that got written. Unless, of course,
> > you are keeping your hash tree in a persistent database 
> 
> I do that.

OK, so you're slowly re-inventing the typical HSM implementation
that keep a userspace database to keep track of what is happening in
the filesystem. They do this to make better decisions when moving
less frequently accessed data out to a slower teirs of storage to
keep space in the primary storage available as it fills up. You're
approaching dedupe in essentially the same way, for very similar
reasons.

> One big difference I am noticing in our approaches is latency.  ZFS (and
> in-kernel btrfs dedupe) provides minimal dedupe latency (duplicate
> data occupies disk space for zero time as it is never written to disk
> at all) but it requires more RAM for a given dedupe hit rate than any
> other dedupe implementation I've seen.  What you've written tells me
> XFS saves RAM by partitioning the data and relying on an existing but
> very large source of iops (sharing scrub reads with dedupe), but then
> the dedupe latency is the same as the scrub interval (the worst so far).

That's just the background maintenance implementation. xfs_fsr can
run this way, but it can also be run as a complete immediate scan,
or it can be pointed at a defined set of files to run on. We can
(and probably will) do exactly the same thing with dedupe.

> bees aims to have latency of a few minutes (ideally scanning data while
> it's still dirty in cache, but there's no good userspace API for that)
> though it's obviously not there yet.

Yup, this is pretty much the "I need immediate notification that a
file changed" problem that HSMs face. That's one of the things DMAPI
was used for - XFS has (now unused) dmapi event notification fields in
it's inode structure that were used by DMAPI to configure the
notifications the filesystem was supposed to send userspace....

With no DMAPI in the future, people with custom HSM-like interfaces
based on dmapi are starting to turn to fanotify and friends to
provide them with the change notifications they require....

> > e.g. a soft requirement is that we need to scan the entire fs at
> > least once a month. 
> 
> I have to scan and dedupe multiple times per hour.  OK, the first-ever
> scan of a non-empty filesystem is allowed to take much longer, but after
> that, if you have enough spare iops for continuous autodefrag you should
> also have spare iops for continuous dedupe.

Yup, but using notifications avoids the for even these scans - you'd
know exactly what data has changed, when it changed, and know
exactly that you needed to read to calculate the new hashes.

> > A simple piece-wise per-AG scanning algorithm (like we use in
> > xfs_repair) could easily work within a 3GB RAM per AG constraint and
> > would scale very well. We'd only need to scan 30-40 AGs in the hour,
> > and a single AG at 1GB/s will only take 2 minutes to scan. We can
> > then do the processing while the next AG gets scanned. If we've got
> > 10-20GB RAM to use (and who doesn't when they have 1PB of storage?)
> > then we can scan 5-10AGs at once to keep the IO rate up, and process
> > them in bulk as we scan more.
> 
> How do you match dupe blocks from different AGs if you only keep RAM for
> the duration of one AG scan?  Do you not dedupe across AG boundaries?

We could, but do we need too? There's a heap of runtime considerations
at the filesystem level we need to take into consideration here, and
there's every chance that too much consolidation creates
unpredictable bottlenecks in overwrite workloads that need to break
the sharing (i.e. COW operations).

e.g. An AG contains up to 1TB of data which is more than enough to
get decent AG-internal dedupe rates. If we've got 1PB of data spread
across 1000AGs, deduping a million copies of a common data pattern
spread across the entire filesystem down to one per AG (i.e. 10^6
copies down to 10^3) still gives a massive space saving.

However, still having multiple copies means that when an overwrite
occurs we don't have a single global block that everyone contends on
trying to COW it. i.e. dedupe within an AG means we can run both
dedupe and COW operations wholly within a single AG. That means
ENOSPC operation is much more reliable, we have less complex
locking, we don't do as much work, we only need to manage one set of
freelist blocks, etc.

> > I'd like a pony, too.
> 
> I have friends with an assortment of horses.  What kind of pony would
> you like?  ;)

Rainbow, please :P

> > However, I think that such bottom up operations are nearly
> > impossible to do atomically because they invert the locking orders
> > of normal operations and are completely open-ended in scope in what
> > they may need to update.  
> 
> I don't think it needs to be atomic at all.  dedupe only needs a way
> to know if it has lost a race and continue gracefully, and that can be

Detecting races reliably implies locking and/or making operational
state changes appear atomic to external observers.

> What you've described so far means the scope isn't limited anyway.  If the
> call is used to dedupe two heavily-reflinked extents together (e.g.
> both duplicate copies are each shared by thousands of snapshots that
> have been created during the month-long period between dedupe runs),
> it could always be stuck doing a lot of work updating dst owners.
> Was there an omitted detail there?

As I said early in the discussion - if both copies of identical data
are already shared hundreds or thousands of times each, then it
makes no sense to dedupe them again. All that does is create huge
amounts of work updating metadata for very little additional gain.

Cheers,

Dave.
Zygo Blaxell Sept. 21, 2018, 4:40 a.m. UTC | #15
On Fri, Sep 21, 2018 at 12:59:31PM +1000, Dave Chinner wrote:
> On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote:
[...]
> With no DMAPI in the future, people with custom HSM-like interfaces
> based on dmapi are starting to turn to fanotify and friends to
> provide them with the change notifications they require....

I had a fanotify-based scanner once, before I noticed btrfs effectively
had timestamps all over its metadata.

fanotify won't tell me which parts of a file were modified (unless it
got that feature in the last few years?).  fanotify was pretty useless
when the only file on the system that was being modified was a 13TB
VM image.  Or even a little 16GB one.  Has to scan the whole file to
find the one new byte.  Even on desktops the poor thing spends most of
its time looping over /var/log/messages.  It was sad.

If fanotify gave me (inode, offset, length) tuples of dirty pages in
cache, I could look them up and use a dedupe_file_range call to replace
the dirty pages with a reference to an existing disk block.  If my
listener can do that fast enough, it's in-band dedupe; if it doesn't,
the data gets flushed to disk as normal, and I fall back to a scan of
the filesystem to clean it up later.

> > > e.g. a soft requirement is that we need to scan the entire fs at
> > > least once a month. 
> > 
> > I have to scan and dedupe multiple times per hour.  OK, the first-ever
> > scan of a non-empty filesystem is allowed to take much longer, but after
> > that, if you have enough spare iops for continuous autodefrag you should
> > also have spare iops for continuous dedupe.
> 
> Yup, but using notifications avoids the for even these scans - you'd
> know exactly what data has changed, when it changed, and know
> exactly that you needed to read to calculate the new hashes.

...if the scanner can keep up with the notifications; otherwise, the
notification receiver has to log them somewhere for the scanner to
catch up.  If there are missed or dropped notifications--or 23 hours a
day we're not listening for notifications because we only have an hour
a day maintenance window--some kind of filesystem scan has to be done
after the fact anyway.

> > > A simple piece-wise per-AG scanning algorithm (like we use in
> > > xfs_repair) could easily work within a 3GB RAM per AG constraint and
> > > would scale very well. We'd only need to scan 30-40 AGs in the hour,
> > > and a single AG at 1GB/s will only take 2 minutes to scan. We can
> > > then do the processing while the next AG gets scanned. If we've got
> > > 10-20GB RAM to use (and who doesn't when they have 1PB of storage?)
> > > then we can scan 5-10AGs at once to keep the IO rate up, and process
> > > them in bulk as we scan more.
> > 
> > How do you match dupe blocks from different AGs if you only keep RAM for
> > the duration of one AG scan?  Do you not dedupe across AG boundaries?
> 
> We could, but do we need too? There's a heap of runtime considerations
> at the filesystem level we need to take into consideration here, and
> there's every chance that too much consolidation creates
> unpredictable bottlenecks in overwrite workloads that need to break
> the sharing (i.e. COW operations).

I'm well aware of that.  I have a bunch of hacks in bees to not be too
efficient lest it push the btrfs reflink bottlenecks too far.

> e.g. An AG contains up to 1TB of data which is more than enough to
> get decent AG-internal dedupe rates. If we've got 1PB of data spread
> across 1000AGs, deduping a million copies of a common data pattern
> spread across the entire filesystem down to one per AG (i.e. 10^6
> copies down to 10^3) still gives a massive space saving.

That's true for 1000+ AG filesystems, but it's a bigger problem for
filesystems of 2-5 AGs, where each AG holds one copy of 20-50% of the
duplicates on the filesystem.

OTOH, a filesystem that small could just be done in one pass with a
larger but still reasonable amount of RAM.

> > What you've described so far means the scope isn't limited anyway.  If the
> > call is used to dedupe two heavily-reflinked extents together (e.g.
> > both duplicate copies are each shared by thousands of snapshots that
> > have been created during the month-long period between dedupe runs),
> > it could always be stuck doing a lot of work updating dst owners.
> > Was there an omitted detail there?
> 
> As I said early in the discussion - if both copies of identical data
> are already shared hundreds or thousands of times each, then it
> makes no sense to dedupe them again. All that does is create huge
> amounts of work updating metadata for very little additional gain.

I've had a user complain about the existing 2560-reflink limit in bees,
because they were starting with 3000 snapshots of their data before they
ran dedupe for the first time, so almost all their data started above
the reflink limit before dedupe, and no dedupes occurred because of that.

Build servers end up with a 3-4 digit number of reflinks to every file
after dedupe, then they make snapshots of a subvol of a million such files
to back it up--instantly but temporarily doubling every reflink count.
Billions of reflink updates in only 10 TB of space.

Updating a thousand reflinks to an extent sounds like a stupid amount of
work, but in configurations like these it is just the price of deduping
anything.

Still, there has to be a limit somewhere--millions of refs to a block
might be a reasonable absurdity cutoff.


> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Andreas Dilger Oct. 1, 2018, 8:34 p.m. UTC | #16
On Sep 20, 2018, at 10:40 PM, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> 
> On Fri, Sep 21, 2018 at 12:59:31PM +1000, Dave Chinner wrote:
>> On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote:
> [...]
>> With no DMAPI in the future, people with custom HSM-like interfaces
>> based on dmapi are starting to turn to fanotify and friends to
>> provide them with the change notifications they require....
> 
> I had a fanotify-based scanner once, before I noticed btrfs effectively
> had timestamps all over its metadata.
> 
> fanotify won't tell me which parts of a file were modified (unless it
> got that feature in the last few years?).  fanotify was pretty useless
> when the only file on the system that was being modified was a 13TB
> VM image.  Or even a little 16GB one.  Has to scan the whole file to
> find the one new byte.  Even on desktops the poor thing spends most of
> its time looping over /var/log/messages.  It was sad.
> 
> If fanotify gave me (inode, offset, length) tuples of dirty pages in
> cache, I could look them up and use a dedupe_file_range call to replace
> the dirty pages with a reference to an existing disk block.  If my
> listener can do that fast enough, it's in-band dedupe; if it doesn't,
> the data gets flushed to disk as normal, and I fall back to a scan of
> the filesystem to clean it up later.
> 
>>>> e.g. a soft requirement is that we need to scan the entire fs at
>>>> least once a month.
>>> 
>>> I have to scan and dedupe multiple times per hour.  OK, the first-ever
>>> scan of a non-empty filesystem is allowed to take much longer, but after
>>> that, if you have enough spare iops for continuous autodefrag you should
>>> also have spare iops for continuous dedupe.
>> 
>> Yup, but using notifications avoids the for even these scans - you'd
>> know exactly what data has changed, when it changed, and know
>> exactly that you needed to read to calculate the new hashes.
> 
> ...if the scanner can keep up with the notifications; otherwise, the
> notification receiver has to log them somewhere for the scanner to
> catch up.  If there are missed or dropped notifications--or 23 hours a
> day we're not listening for notifications because we only have an hour
> a day maintenance window--some kind of filesystem scan has to be done
> after the fact anyway.

It is worthwhile to mention that Lustre has a persistent Changelog record
that is generated atomically with the filesystem transaction that the event
happened in.

Once there is a Changelog consumer that registers itself with the filesystem,
along with a mask of the event types that it is interested in, the Changelog
begins recording all such events to disk (e.g. create, mkdir, setattr, etc.).
The Changelog consumer periodically notifies the filesystem when it has
processed events up to X, so that it can purge old events from the log.  It
is possible to have multiple consumers registered, and the log is only purged
up to the slowest consumer.

If a consumer hasn't processed logs in some (relatively long) time (e.g. many
days or weeks), or if the filesystem is otherwise going to run out of space,
then the consumer is deregistered and the old log records are cleaned up.  This
also notifies the consumer that it is is no longer active, and it has to do a
full scan to update its state for the events that it missed.

Having a persistent changelog is useful for all kinds of event processing,
and avoids the need to do real-time processing.  If the userspace daemon fails,
or the system is restarted, etc. then there is no need to rescan the whole
filesystem, which is important when there are many billions of files therein.


Cheers, Andreas

Patch
diff mbox series

diff --git a/fs/read_write.c b/fs/read_write.c
index 153f8f690490..3844359a4597 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1768,8 +1768,17 @@  int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 			return -EINVAL;
 	}
 
-	/* If we're linking to EOF, continue to the block boundary. */
-	if (pos_in + *len == isize)
+	/* Reflink allows linking to EOF, so round the length up to allow us to
+	 * shared the last block in the file. We don't care what is beyond the
+	 * EOF point in the block that gets shared, as we can't access it and
+	 * attempts to extent the file will break the sharing.
+	 *
+	 * For dedupe, sharing the EOF block is illegal because the bytes beyond
+	 * EOF are undefined (i.e. not readable) and so can't be deduped. Hence
+	 * we do not extend/align the length and instead let the alignment
+	 * checks below reject it.
+	 */
+	if (!is_dedupe && pos_in + *len == isize)
 		blen = ALIGN(isize, bs) - pos_in;
 	else
 		blen = *len;