diff mbox series

Btrfs: fix data corruption due to cloning of eof block

Message ID 20181105111417.11808-1-fdmanana@kernel.org (mailing list archive)
State New, archived
Headers show
Series Btrfs: fix data corruption due to cloning of eof block | expand

Commit Message

Filipe Manana Nov. 5, 2018, 11:14 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

We currently allow cloning a range from a file which includes the last
block of the file even if the file's size is not aligned to the block
size. This is fine and useful when the destination file has the same size,
but when it does not and the range ends somewhere in the middle of the
destination file, it leads to corruption because the bytes between the EOF
and the end of the block have undefined data (when there is support for
discard/trimming they have a value of 0x00).

Example:

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt

 $ export foo_size=$((256 * 1024 + 100))
 $ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo
 $ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar

 $ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar

 $ od -A d -t x1 /mnt/bar
 0000000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
 *
 0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c
 *
 0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00
 0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 *
 0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
 *
 1048576

The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527
(512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead
of 0xb5.

This is similar to the problem we had for deduplication that got recently
fixed by commit de02b9f6bb65 ("Btrfs: fix data corruption when
deduplicating between different files").

Fix this by not allowing such operations to be performed and return the
errno -EINVAL to user space. This is what XFS is doing as well at the VFS
level. This change however now makes us return -EINVAL instead of
-EOPNOTSUPP for cases where the source range maps to an inline extent and
the destination range's end is smaller then the destination file's size,
since the detection of inline extents is done during the actual process of
dropping file extent items (at __btrfs_drop_extents()). Returning the
-EINVAL error is done early on and solely based on the input parameters
(offsets and length) and destination file's size. This makes us consistent
with XFS and anyone else supporting cloning since this case is now checked
at a higher level in the VFS and is where the -EINVAL will be returned
from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1
by commit 07d19dc9fbe9 ("vfs: avoid problematic remapping requests into
partial EOF block"). So this change is more geared towards stable kernels,
as it's unlikely the new VFS checks get removed intentionally.

A test case for fstests follows soon, as well as an update to filter
existing tests that expect -EOPNOTSUPP to accept -EINVAL as well.

CC: <stable@vger.kernel.org> # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ioctl.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

Josef Bacik Nov. 7, 2018, 8:49 p.m. UTC | #1
On Mon, Nov 05, 2018 at 11:14:17AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We currently allow cloning a range from a file which includes the last
> block of the file even if the file's size is not aligned to the block
> size. This is fine and useful when the destination file has the same size,
> but when it does not and the range ends somewhere in the middle of the
> destination file, it leads to corruption because the bytes between the EOF
> and the end of the block have undefined data (when there is support for
> discard/trimming they have a value of 0x00).
> 
> Example:
> 
>  $ mkfs.btrfs -f /dev/sdb
>  $ mount /dev/sdb /mnt
> 
>  $ export foo_size=$((256 * 1024 + 100))
>  $ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo
>  $ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar
> 
>  $ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar
> 
>  $ od -A d -t x1 /mnt/bar
>  0000000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
>  *
>  0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c
>  *
>  0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00
>  0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  *
>  0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
>  *
>  1048576
> 
> The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527
> (512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead
> of 0xb5.
> 
> This is similar to the problem we had for deduplication that got recently
> fixed by commit de02b9f6bb65 ("Btrfs: fix data corruption when
> deduplicating between different files").
> 
> Fix this by not allowing such operations to be performed and return the
> errno -EINVAL to user space. This is what XFS is doing as well at the VFS
> level. This change however now makes us return -EINVAL instead of
> -EOPNOTSUPP for cases where the source range maps to an inline extent and
> the destination range's end is smaller then the destination file's size,
> since the detection of inline extents is done during the actual process of
> dropping file extent items (at __btrfs_drop_extents()). Returning the
> -EINVAL error is done early on and solely based on the input parameters
> (offsets and length) and destination file's size. This makes us consistent
> with XFS and anyone else supporting cloning since this case is now checked
> at a higher level in the VFS and is where the -EINVAL will be returned
> from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1
> by commit 07d19dc9fbe9 ("vfs: avoid problematic remapping requests into
> partial EOF block"). So this change is more geared towards stable kernels,
> as it's unlikely the new VFS checks get removed intentionally.
> 
> A test case for fstests follows soon, as well as an update to filter
> existing tests that expect -EOPNOTSUPP to accept -EINVAL as well.
> 
> CC: <stable@vger.kernel.org> # 4.4+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
diff mbox series

Patch

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f3134fc69880..30e098970063 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4277,9 +4277,17 @@  static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 		goto out_unlock;
 	if (len == 0)
 		olen = len = src->i_size - off;
-	/* if we extend to eof, continue to block boundary */
-	if (off + len == src->i_size)
+	/*
+	 * If we extend to eof, continue to block boundary if and only if the
+	 * destination end offset matches the destination file's size, otherwise
+	 * we would be corrupting data by placing the eof block into the middle
+	 * of a file.
+	 */
+	if (off + len == src->i_size) {
+		if (!IS_ALIGNED(len, bs) && destoff + len < inode->i_size)
+			goto out_unlock;
 		len = ALIGN(src->i_size, bs) - off;
+	}
 
 	if (len == 0) {
 		ret = 0;