diff mbox series

[05/25] vfs: avoid problematic remapping requests into partial EOF block

Message ID 153923117420.5546.13317703807467393934.stgit@magnolia (mailing list archive)
State New, archived
Headers show
Series fs: fixes for serious clone/dedupe problems | expand

Commit Message

Darrick J. Wong Oct. 11, 2018, 4:12 a.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

A deduplication data corruption is exposed by fstests generic/505 on
XFS. It is caused by extending the block match range to include the
partial EOF block, but then allowing unknown data beyond EOF to be
considered a "match" to data in the destination file because the
comparison is only made to the end of the source file. This corrupts the
destination file when the source extent is shared with it.

The VFS remapping prep functions  only support whole block dedupe, but
we still need to appear to support whole file dedupe correctly.  Hence
if the dedupe request includes the last block of the souce file, don't
include it in the actual dedupe operation. If the rest of the range
dedupes successfully, then reject the entire request.  A subsequent
patch will enable us to shorten dedupe requests correctly.

When reflinking sub-file ranges, a data corruption can occur when the
source file range includes a partial EOF block. This shares the unknown
data beyond EOF into the second file at a position inside EOF, exposing
stale data in the second file.

If the reflink request includes the last block of the souce file, only
proceed with the reflink operation if it lands at or past the
destination file's current EOF. If it lands within the destination file
EOF, reject the entire request with -EINVAL and make the caller go the
hard way.  A subsequent patch will enable us to shorten reflink requests
correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Comments

Dave Chinner Oct. 12, 2018, 12:16 a.m. UTC | #1
On Wed, Oct 10, 2018 at 09:12:54PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> A deduplication data corruption is exposed by fstests generic/505 on
> XFS. It is caused by extending the block match range to include the
> partial EOF block, but then allowing unknown data beyond EOF to be
> considered a "match" to data in the destination file because the
> comparison is only made to the end of the source file. This corrupts the
> destination file when the source extent is shared with it.
> 
> The VFS remapping prep functions only support whole block dedupe, but
> we still need to appear to support whole file dedupe correctly.  Hence
> if the dedupe request includes the last block of the souce file, don't
> include it in the actual dedupe operation. If the rest of the range
> dedupes successfully, then reject the entire request.  A subsequent
> patch will enable us to shorten dedupe requests correctly.

Ok, so this patch rejects whole file dedupe requests, and then a
later patch adds support back in for it?

Doesn't that leave a bisect landmine behind? Why separate the
functionality like this?

Cheers,

Dave.
Darrick J. Wong Oct. 12, 2018, 4:07 p.m. UTC | #2
On Fri, Oct 12, 2018 at 11:16:16AM +1100, Dave Chinner wrote:
> On Wed, Oct 10, 2018 at 09:12:54PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > A deduplication data corruption is exposed by fstests generic/505 on
> > XFS. It is caused by extending the block match range to include the
> > partial EOF block, but then allowing unknown data beyond EOF to be
> > considered a "match" to data in the destination file because the
> > comparison is only made to the end of the source file. This corrupts the
> > destination file when the source extent is shared with it.
> > 
> > The VFS remapping prep functions only support whole block dedupe, but
> > we still need to appear to support whole file dedupe correctly.  Hence
> > if the dedupe request includes the last block of the souce file, don't
> > include it in the actual dedupe operation. If the rest of the range
> > dedupes successfully, then reject the entire request.  A subsequent
> > patch will enable us to shorten dedupe requests correctly.
> 
> Ok, so this patch rejects whole file dedupe requests, and then a
> later patch adds support back in for it?
> 
> Doesn't that leave a bisect landmine behind? Why separate the
> functionality like this?

Heh, it's a leftover from when I was trying to undo the behavior that
bytes_deduped == len even if we rounded down.  I gave up on that, so
this can match the xfs patch.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Filipe Manana Oct. 12, 2018, 8:22 p.m. UTC | #3
On Thu, Oct 11, 2018 at 5:13 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> A deduplication data corruption is exposed by fstests generic/505 on
> XFS.

(and btrfs)

Btw, the generic test I wrote was indeed numbered 505, however it was
never committed and there's now a generic/505 which has nothing to do
with deduplication.
So you should update the changelog to avoid confusion.

thanks

> It is caused by extending the block match range to include the
> partial EOF block, but then allowing unknown data beyond EOF to be
> considered a "match" to data in the destination file because the
> comparison is only made to the end of the source file. This corrupts the
> destination file when the source extent is shared with it.
>
> The VFS remapping prep functions  only support whole block dedupe, but
> we still need to appear to support whole file dedupe correctly.  Hence
> if the dedupe request includes the last block of the souce file, don't
> include it in the actual dedupe operation. If the rest of the range
> dedupes successfully, then reject the entire request.  A subsequent
> patch will enable us to shorten dedupe requests correctly.
>
> When reflinking sub-file ranges, a data corruption can occur when the
> source file range includes a partial EOF block. This shares the unknown
> data beyond EOF into the second file at a position inside EOF, exposing
> stale data in the second file.
>
> If the reflink request includes the last block of the souce file, only
> proceed with the reflink operation if it lands at or past the
> destination file's current EOF. If it lands within the destination file
> EOF, reject the entire request with -EINVAL and make the caller go the
> hard way.  A subsequent patch will enable us to shorten reflink requests
> correctly.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/read_write.c |   22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
>
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index d6e8e242a15f..8498991e2f33 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1723,6 +1723,7 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
>  {
>         struct inode *inode_in = file_inode(file_in);
>         struct inode *inode_out = file_inode(file_out);
> +       u64 blkmask = i_blocksize(inode_in) - 1;
>         bool same_inode = (inode_in == inode_out);
>         int ret;
>
> @@ -1785,6 +1786,27 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
>                         return -EBADE;
>         }
>
> +       /* Are we doing a partial EOF block remapping of some kind? */
> +       if (*len & blkmask) {
> +               /*
> +                * If the dedupe data matches, don't try to dedupe the partial
> +                * EOF block.
> +                *
> +                * If the user is attempting to remap a partial EOF block and
> +                * it's inside the destination EOF then reject it.
> +                *
> +                * We don't support shortening requests, so we can only reject
> +                * them.
> +                */
> +               if (is_dedupe)
> +                       ret = -EBADE;
> +               else if (pos_out + *len < i_size_read(inode_out))
> +                       ret = -EINVAL;
> +
> +               if (ret)
> +                       return ret;
> +       }
> +
>         return 1;
>  }
>  EXPORT_SYMBOL(vfs_clone_file_prep);
>
Dave Chinner Oct. 15, 2018, 12:31 a.m. UTC | #4
On Fri, Oct 12, 2018 at 09:22:18PM +0100, Filipe Manana wrote:
> On Thu, Oct 11, 2018 at 5:13 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > A deduplication data corruption is exposed by fstests generic/505 on
> > XFS.
> 
> (and btrfs)
> 
> Btw, the generic test I wrote was indeed numbered 505, however it was
> never committed and there's now a generic/505 which has nothing to do
> with deduplication.
> So you should update the changelog to avoid confusion.

What test is it now? And if it hasn't been committed, are you going
to update it and repost as it clearly had value....

Cheers,

Dave.
Filipe Manana Nov. 2, 2018, 12:04 p.m. UTC | #5
On Mon, Oct 15, 2018 at 1:31 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Oct 12, 2018 at 09:22:18PM +0100, Filipe Manana wrote:
> > On Thu, Oct 11, 2018 at 5:13 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > >
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > >
> > > A deduplication data corruption is exposed by fstests generic/505 on
> > > XFS.
> >
> > (and btrfs)
> >
> > Btw, the generic test I wrote was indeed numbered 505, however it was
> > never committed and there's now a generic/505 which has nothing to do
> > with deduplication.
> > So you should update the changelog to avoid confusion.
>
> What test is it now? And if it hasn't been committed, are you going
> to update it and repost as it clearly had value....

Sorry, I lost track of this.

So what was the conclusion of the thread where discussion about this
problem started?
It wasn't clear to me if a consensus was reached and got lost on that
long user space dedupe tools discussion between you and Zygo.

The test assumed a fix of rounding down the range and deduping less
bytes then requested (which ended up included in 4.19 for btrfs).

From this vfs patch it seems it was decided to return errno -EDADE instead.
Is this the final decision?

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong Nov. 2, 2018, 5:42 p.m. UTC | #6
On Fri, Nov 02, 2018 at 12:04:39PM +0000, Filipe Manana wrote:
> On Mon, Oct 15, 2018 at 1:31 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Fri, Oct 12, 2018 at 09:22:18PM +0100, Filipe Manana wrote:
> > > On Thu, Oct 11, 2018 at 5:13 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > >
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > >
> > > > A deduplication data corruption is exposed by fstests generic/505 on
> > > > XFS.
> > >
> > > (and btrfs)
> > >
> > > Btw, the generic test I wrote was indeed numbered 505, however it was
> > > never committed and there's now a generic/505 which has nothing to do
> > > with deduplication.
> > > So you should update the changelog to avoid confusion.
> >
> > What test is it now? And if it hasn't been committed, are you going
> > to update it and repost as it clearly had value....
> 
> Sorry, I lost track of this.
> 
> So what was the conclusion of the thread where discussion about this
> problem started?
> It wasn't clear to me if a consensus was reached and got lost on that
> long user space dedupe tools discussion between you and Zygo.
> 
> The test assumed a fix of rounding down the range and deduping less
> bytes then requested (which ended up included in 4.19 for btrfs).
> 
> From this vfs patch it seems it was decided to return errno -EDADE instead.
> Is this the final decision?

No, I reworked the whole mess to match btrfs-4.19 behavior of deduping
fewer bytes than requested.

--D

> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”
Filipe Manana Nov. 2, 2018, 6:18 p.m. UTC | #7
On Fri, Nov 2, 2018 at 5:42 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Fri, Nov 02, 2018 at 12:04:39PM +0000, Filipe Manana wrote:
> > On Mon, Oct 15, 2018 at 1:31 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Fri, Oct 12, 2018 at 09:22:18PM +0100, Filipe Manana wrote:
> > > > On Thu, Oct 11, 2018 at 5:13 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > > >
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > >
> > > > > A deduplication data corruption is exposed by fstests generic/505 on
> > > > > XFS.
> > > >
> > > > (and btrfs)
> > > >
> > > > Btw, the generic test I wrote was indeed numbered 505, however it was
> > > > never committed and there's now a generic/505 which has nothing to do
> > > > with deduplication.
> > > > So you should update the changelog to avoid confusion.
> > >
> > > What test is it now? And if it hasn't been committed, are you going
> > > to update it and repost as it clearly had value....
> >
> > Sorry, I lost track of this.
> >
> > So what was the conclusion of the thread where discussion about this
> > problem started?
> > It wasn't clear to me if a consensus was reached and got lost on that
> > long user space dedupe tools discussion between you and Zygo.
> >
> > The test assumed a fix of rounding down the range and deduping less
> > bytes then requested (which ended up included in 4.19 for btrfs).
> >
> > From this vfs patch it seems it was decided to return errno -EDADE instead.
> > Is this the final decision?
>
> No, I reworked the whole mess to match btrfs-4.19 behavior of deduping
> fewer bytes than requested.

What about cloning?
For cloning the issue is still not fixed in btrfs either.

So was that done in a later version of this patchset or somewhere else?

thanks

>
> --D
>
> > >
> > > Cheers,
> > >
> > > Dave.
> > > --
> > > Dave Chinner
> > > david@fromorbit.com
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
Filipe Manana Nov. 2, 2018, 7:05 p.m. UTC | #8
On Fri, Nov 2, 2018 at 6:18 PM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Fri, Nov 2, 2018 at 5:42 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > On Fri, Nov 02, 2018 at 12:04:39PM +0000, Filipe Manana wrote:
> > > On Mon, Oct 15, 2018 at 1:31 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Fri, Oct 12, 2018 at 09:22:18PM +0100, Filipe Manana wrote:
> > > > > On Thu, Oct 11, 2018 at 5:13 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > > > >
> > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > >
> > > > > > A deduplication data corruption is exposed by fstests generic/505 on
> > > > > > XFS.
> > > > >
> > > > > (and btrfs)
> > > > >
> > > > > Btw, the generic test I wrote was indeed numbered 505, however it was
> > > > > never committed and there's now a generic/505 which has nothing to do
> > > > > with deduplication.
> > > > > So you should update the changelog to avoid confusion.
> > > >
> > > > What test is it now? And if it hasn't been committed, are you going
> > > > to update it and repost as it clearly had value....
> > >
> > > Sorry, I lost track of this.
> > >
> > > So what was the conclusion of the thread where discussion about this
> > > problem started?
> > > It wasn't clear to me if a consensus was reached and got lost on that
> > > long user space dedupe tools discussion between you and Zygo.
> > >
> > > The test assumed a fix of rounding down the range and deduping less
> > > bytes then requested (which ended up included in 4.19 for btrfs).
> > >
> > > From this vfs patch it seems it was decided to return errno -EDADE instead.
> > > Is this the final decision?
> >
> > No, I reworked the whole mess to match btrfs-4.19 behavior of deduping
> > fewer bytes than requested.
>
> What about cloning?
> For cloning the issue is still not fixed in btrfs either.
>
> So was that done in a later version of this patchset or somewhere else?

Never mind, found it, it returns -EINVAL.

>
> thanks
>
> >
> > --D
> >
> > > >
> > > > Cheers,
> > > >
> > > > Dave.
> > > > --
> > > > Dave Chinner
> > > > david@fromorbit.com
> > >
> > >
> > >
> > > --
> > > Filipe David Manana,
> > >
> > > “Whether you think you can, or you think you can't — you're right.”
>
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”
diff mbox series

Patch

diff --git a/fs/read_write.c b/fs/read_write.c
index d6e8e242a15f..8498991e2f33 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1723,6 +1723,7 @@  int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
+	u64 blkmask = i_blocksize(inode_in) - 1;
 	bool same_inode = (inode_in == inode_out);
 	int ret;
 
@@ -1785,6 +1786,27 @@  int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 			return -EBADE;
 	}
 
+	/* Are we doing a partial EOF block remapping of some kind? */
+	if (*len & blkmask) {
+		/*
+		 * If the dedupe data matches, don't try to dedupe the partial
+		 * EOF block.
+		 *
+		 * If the user is attempting to remap a partial EOF block and
+		 * it's inside the destination EOF then reject it.
+		 *
+		 * We don't support shortening requests, so we can only reject
+		 * them.
+		 */
+		if (is_dedupe)
+			ret = -EBADE;
+		else if (pos_out + *len < i_size_read(inode_out))
+			ret = -EINVAL;
+
+		if (ret)
+			return ret;
+	}
+
 	return 1;
 }
 EXPORT_SYMBOL(vfs_clone_file_prep);