diff mbox series

[v3] btrfs: new test for devt change between mounts

Message ID 9dde3b18f00a30cae78197c9069db503f720fe71.1709844612.git.boris@bur.io (mailing list archive)
State New, archived
Headers show
Series [v3] btrfs: new test for devt change between mounts | expand

Commit Message

Boris Burkov March 8, 2024, 5:43 p.m. UTC
It is possible to confuse the btrfs device cache (fs_devices) by
starting with a multi-device filesystem, then removing and re-adding a
device in a way which changes its dev_t while the filesystem is
unmounted. After this procedure, if we remount, then we are in a funny
state where struct btrfs_device's "devt" field does not match the bd_dev
of the "bdev" field. I would say this is bad enough, as we have violated
a pretty clear invariant.

But for style points, we can then remove the extra device from the fs,
making it a single device fs, which enables the "temp_fsid" feature,
which permits multiple separate mounts of different devices with the
same fsid. Since btrfs is confused and *thinks* there are different
devices (based on device->devt), it allows a second redundant mount of
the same device (not a bind mount!). This then allows us to corrupt the
original mount by doing stuff to the one that should be a bind mount.

Signed-off-by: Boris Burkov <boris@bur.io>
---
Changelog:
v3:
- fstests convention improvements (helpers, output, comments, etc...)
v2:
- fix numerous fundamental issues, v1 wasn't really ready

 common/config       |   1 +
 tests/btrfs/311     | 105 ++++++++++++++++++++++++++++++++++++++++++++
 tests/btrfs/311.out |   2 +
 3 files changed, 108 insertions(+)
 create mode 100755 tests/btrfs/311
 create mode 100644 tests/btrfs/311.out

Comments

Filipe Manana March 10, 2024, 5:16 p.m. UTC | #1
On Fri, Mar 8, 2024 at 5:42 PM Boris Burkov <boris@bur.io> wrote:
>
> It is possible to confuse the btrfs device cache (fs_devices) by
> starting with a multi-device filesystem, then removing and re-adding a
> device in a way which changes its dev_t while the filesystem is
> unmounted. After this procedure, if we remount, then we are in a funny
> state where struct btrfs_device's "devt" field does not match the bd_dev
> of the "bdev" field. I would say this is bad enough, as we have violated
> a pretty clear invariant.
>
> But for style points, we can then remove the extra device from the fs,
> making it a single device fs, which enables the "temp_fsid" feature,
> which permits multiple separate mounts of different devices with the
> same fsid. Since btrfs is confused and *thinks* there are different
> devices (based on device->devt), it allows a second redundant mount of
> the same device (not a bind mount!). This then allows us to corrupt the
> original mount by doing stuff to the one that should be a bind mount.
>
> Signed-off-by: Boris Burkov <boris@bur.io>

It fails on for-next (as expected) and it passes with the following
patch applied:

    "btrfs: validate device maj:min during open"

If the consensus is to use that patch as a fix, we can later add the
_fixed_by_kernel_commit call to the test.
Also, the test number 311 has been taken for a week now.

Anyway:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> ---
> Changelog:
> v3:
> - fstests convention improvements (helpers, output, comments, etc...)
> v2:
> - fix numerous fundamental issues, v1 wasn't really ready
>
>  common/config       |   1 +
>  tests/btrfs/311     | 105 ++++++++++++++++++++++++++++++++++++++++++++
>  tests/btrfs/311.out |   2 +
>  3 files changed, 108 insertions(+)
>  create mode 100755 tests/btrfs/311
>  create mode 100644 tests/btrfs/311.out
>
> diff --git a/common/config b/common/config
> index a3b15b96f..43b517fda 100644
> --- a/common/config
> +++ b/common/config
> @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)"
>  export GZIP_PROG="$(type -P gzip)"
>  export BTRFS_IMAGE_PROG="$(type -P btrfs-image)"
>  export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
> +export PARTED_PROG="$(type -P parted)"
>
>  # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled.
>  # newer systems have udevadm command but older systems like RHEL5 don't.
> diff --git a/tests/btrfs/311 b/tests/btrfs/311
> new file mode 100755
> index 000000000..a7fa541c4
> --- /dev/null
> +++ b/tests/btrfs/311
> @@ -0,0 +1,105 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2024 Meta, Inc. All Rights Reserved.
> +#
> +# FS QA Test 311
> +#
> +# Test an edge case of multi device volume management in btrfs.
> +# If a device changes devt between mounts of a multi device fs, we can trick
> +# btrfs into mounting the same device twice fully (not as a bind mount). From
> +# there, it is trivial to induce corruption.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick volume scrub
> +
> +# real QA test starts here
> +_supported_fs btrfs
> +_require_test
> +_require_command "$PARTED_PROG" parted
> +_require_batched_discard "$TEST_DIR"
> +
> +_cleanup() {
> +       cd /
> +       $UMOUNT_PROG $MNT
> +       $UMOUNT_PROG $BIND
> +       losetup -d $DEV0
> +       losetup -d $DEV1
> +       losetup -d $DEV2
> +       rm $IMG0
> +       rm $IMG1
> +       rm $IMG2
> +}
> +
> +IMG0=$TEST_DIR/$$.img0
> +IMG1=$TEST_DIR/$$.img1
> +IMG2=$TEST_DIR/$$.img2
> +truncate -s 1G $IMG0
> +truncate -s 1G $IMG1
> +truncate -s 1G $IMG2
> +DEV0=$(losetup -f $IMG0 --show)
> +DEV1=$(losetup -f $IMG1 --show)
> +DEV2=$(losetup -f $IMG2 --show)
> +D0P1=$DEV0"p1"
> +D1P1=$DEV1"p1"
> +MNT=$TEST_DIR/mnt
> +BIND=$TEST_DIR/bind
> +
> +# Setup partition table with one partition on each device.
> +$PARTED_PROG $DEV0 'mktable gpt' --script
> +$PARTED_PROG $DEV1 'mktable gpt' --script
> +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> +
> +# mkfs with two devices to avoid clearing devices on close
> +# single raid to allow removing DEV2.
> +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 >>$seqres.full 2>&1 || _fail "failed to mkfs.btrfs"
> +
> +# Cycle mount the two device fs to populate both devices into the
> +# stale device cache.
> +mkdir -p $MNT
> +_mount $D0P1 $MNT
> +$UMOUNT_PROG $MNT
> +
> +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date.
> +$PARTED_PROG $DEV0 'rm 1' --script
> +$PARTED_PROG $DEV1 'rm 1' --script
> +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> +
> +# Mount with mismatched dev_t!
> +_mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point"
> +
> +# Remove the extra device to bring temp-fsid back in the fray.
> +$BTRFS_UTIL_PROG device remove $DEV2 $MNT
> +
> +# Create the would be bind mount.
> +mkdir -p $BIND
> +_mount $D0P1 $BIND
> +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT)
> +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND)
> +# If they're different, we are in trouble.
> +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show"
> +
> +# Now really prove it by corrupting the first mount with the second.
> +for i in $(seq 20); do
> +       $XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >>$seqres.full 2>&1
> +done
> +for i in $(seq 20); do
> +       $XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >>$seqres.full 2>&1
> +done
> +
> +# sync so that we really write the large file data out to the shared device
> +sync
> +
> +# now delete from one view of the shared device
> +find $BIND -type f -delete
> +# sync so that fstrim definitely has deleted data to trim
> +sync
> +# This should blow up both mounts, if the writes somehow didn't overlap at all.
> +$FSTRIM_PROG $BIND
> +# drop caches to improve the odds we read from the corrupted device while scrubbing.
> +echo 3 > /proc/sys/vm/drop_caches
> +$BTRFS_UTIL_PROG scrub start -B $MNT | grep "Error summary:"
> +
> +status=0
> +exit
> diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out
> new file mode 100644
> index 000000000..70a6db809
> --- /dev/null
> +++ b/tests/btrfs/311.out
> @@ -0,0 +1,2 @@
> +QA output created by 311
> +Error summary:    no errors found
> --
> 2.43.0
>
>
Anand Jain March 11, 2024, 4:44 p.m. UTC | #2
On 3/8/24 23:13, Boris Burkov wrote:
> It is possible to confuse the btrfs device cache (fs_devices) by
> starting with a multi-device filesystem, then removing and re-adding a
> device in a way which changes its dev_t while the filesystem is
> unmounted. After this procedure, if we remount, then we are in a funny
> state where struct btrfs_device's "devt" field does not match the bd_dev
> of the "bdev" field. I would say this is bad enough, as we have violated
> a pretty clear invariant.
> 
> But for style points, we can then remove the extra device from the fs,
> making it a single device fs, which enables the "temp_fsid" feature,
> which permits multiple separate mounts of different devices with the
> same fsid. Since btrfs is confused and *thinks* there are different
> devices (based on device->devt), it allows a second redundant mount of
> the same device (not a bind mount!). This then allows us to corrupt the
> original mount by doing stuff to the one that should be a bind mount.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
> Changelog:
> v3:
> - fstests convention improvements (helpers, output, comments, etc...)
> v2:
> - fix numerous fundamental issues, v1 wasn't really ready
> 
>   common/config       |   1 +
>   tests/btrfs/311     | 105 ++++++++++++++++++++++++++++++++++++++++++++
>   tests/btrfs/311.out |   2 +
>   3 files changed, 108 insertions(+)
>   create mode 100755 tests/btrfs/311
>   create mode 100644 tests/btrfs/311.out
> 
> diff --git a/common/config b/common/config
> index a3b15b96f..43b517fda 100644
> --- a/common/config
> +++ b/common/config
> @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)"
>   export GZIP_PROG="$(type -P gzip)"
>   export BTRFS_IMAGE_PROG="$(type -P btrfs-image)"
>   export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
> +export PARTED_PROG="$(type -P parted)"
>   
>   # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled.
>   # newer systems have udevadm command but older systems like RHEL5 don't.
> diff --git a/tests/btrfs/311 b/tests/btrfs/311
> new file mode 100755
> index 000000000..a7fa541c4
> --- /dev/null
> +++ b/tests/btrfs/311
> @@ -0,0 +1,105 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2024 Meta, Inc. All Rights Reserved.
> +#
> +# FS QA Test 311
> +#
> +# Test an edge case of multi device volume management in btrfs.
> +# If a device changes devt between mounts of a multi device fs, we can trick
> +# btrfs into mounting the same device twice fully (not as a bind mount). From
> +# there, it is trivial to induce corruption.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick volume scrub

Please add tmpfsid as well, because this test is about when not to 
activate the tmpfsid.

> +
> +# real QA test starts here
> +_supported_fs btrfs
> +_require_test
> +_require_command "$PARTED_PROG" parted
> +_require_batched_discard "$TEST_DIR"
> +

_fixed_by_kernel_commit XXXXXXXXXXXX \
		btrfs: validate device maj:min during open

> +_cleanup() {
> +	cd /
> +	$UMOUNT_PROG $MNT
> +	$UMOUNT_PROG $BIND
> +	losetup -d $DEV0
> +	losetup -d $DEV1
> +	losetup -d $DEV2
> +	rm $IMG0
> +	rm $IMG1
> +	rm $IMG2
> +}
> +
> +IMG0=$TEST_DIR/$$.img0
> +IMG1=$TEST_DIR/$$.img1
> +IMG2=$TEST_DIR/$$.img2
> +truncate -s 1G $IMG0
> +truncate -s 1G $IMG1
> +truncate -s 1G $IMG2
> +DEV0=$(losetup -f $IMG0 --show)
> +DEV1=$(losetup -f $IMG1 --show)
> +DEV2=$(losetup -f $IMG2 --show)
> +D0P1=$DEV0"p1"
> +D1P1=$DEV1"p1"
> +MNT=$TEST_DIR/mnt
> +BIND=$TEST_DIR/bind
> +
> +# Setup partition table with one partition on each device.
> +$PARTED_PROG $DEV0 'mktable gpt' --script
> +$PARTED_PROG $DEV1 'mktable gpt' --script
> +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> +
> +# mkfs with two devices to avoid clearing devices on close
> +# single raid to allow removing DEV2.
> +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 >>$seqres.full 2>&1 || _fail "failed to mkfs.btrfs"

Error out is already sufficient here.

> +
> +# Cycle mount the two device fs to populate both devices into the
> +# stale device cache.
> +mkdir -p $MNT
> +_mount $D0P1 $MNT
> +$UMOUNT_PROG $MNT
> +
> +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date.
> +$PARTED_PROG $DEV0 'rm 1' --script
> +$PARTED_PROG $DEV1 'rm 1' --script
> +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> +
> +# Mount with mismatched dev_t!
> +_mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point"
> +

On a system where the kernel bug is fixed, the mount is expected to pass.

On a system without the kernel bug fix, the mount is still expected to pass.

However, the failure message indicates that it is advisable to fail the 
mount in this scenario.

I believe this will be the case once the btrfs-progs patch below is 
integrated:

[PATCH 0/2] btrfs-progs: forget removed devices

So, we need to update the test case logic based on whether the above 
btrfs-progs patch is integrated.


> +# Remove the extra device to bring temp-fsid back in the fray.
> +$BTRFS_UTIL_PROG device remove $DEV2 $MNT
> +
> +# Create the would be bind mount.
> +mkdir -p $BIND
> +_mount $D0P1 $BIND
> +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT)
> +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND)
> +# If they're different, we are in trouble.
> +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show"
> +


> +# Now really prove it by corrupting the first mount with the second.
> +for i in $(seq 20); do
> +	$XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >>$seqres.full 2>&1
> +done
> +for i in $(seq 20); do
> +	$XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >>$seqres.full 2>&1
> +done
> +

> +# sync so that we really write the large file data out to the shared device
> +sync
> +
> +# now delete from one view of the shared device
> +find $BIND -type f -delete
> +# sync so that fstrim definitely has deleted data to trim
> +sync
> +# This should blow up both mounts, if the writes somehow didn't overlap at all.
> +$FSTRIM_PROG $BIND
> +# drop caches to improve the odds we read from the corrupted device while scrubbing.
> +echo 3 > /proc/sys/vm/drop_caches
> +$BTRFS_UTIL_PROG scrub start -B $MNT | grep "Error summary:"
> +


The rest appears to be fine.

Question: Why didn't you choose a cp --reflink=always across $MNT and 
$BIND to prove how kernel think about $MNT and $BIND.

Thanks, Anand

> +status=0
> +exit
> diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out
> new file mode 100644
> index 000000000..70a6db809
> --- /dev/null
> +++ b/tests/btrfs/311.out
> @@ -0,0 +1,2 @@
> +QA output created by 311
> +Error summary:    no errors found
Boris Burkov March 11, 2024, 7:06 p.m. UTC | #3
On Mon, Mar 11, 2024 at 10:14:15PM +0530, Anand Jain wrote:
> On 3/8/24 23:13, Boris Burkov wrote:
> > It is possible to confuse the btrfs device cache (fs_devices) by
> > starting with a multi-device filesystem, then removing and re-adding a
> > device in a way which changes its dev_t while the filesystem is
> > unmounted. After this procedure, if we remount, then we are in a funny
> > state where struct btrfs_device's "devt" field does not match the bd_dev
> > of the "bdev" field. I would say this is bad enough, as we have violated
> > a pretty clear invariant.
> > 
> > But for style points, we can then remove the extra device from the fs,
> > making it a single device fs, which enables the "temp_fsid" feature,
> > which permits multiple separate mounts of different devices with the
> > same fsid. Since btrfs is confused and *thinks* there are different
> > devices (based on device->devt), it allows a second redundant mount of
> > the same device (not a bind mount!). This then allows us to corrupt the
> > original mount by doing stuff to the one that should be a bind mount.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> > Changelog:
> > v3:
> > - fstests convention improvements (helpers, output, comments, etc...)
> > v2:
> > - fix numerous fundamental issues, v1 wasn't really ready
> > 
> >   common/config       |   1 +
> >   tests/btrfs/311     | 105 ++++++++++++++++++++++++++++++++++++++++++++
> >   tests/btrfs/311.out |   2 +
> >   3 files changed, 108 insertions(+)
> >   create mode 100755 tests/btrfs/311
> >   create mode 100644 tests/btrfs/311.out
> > 
> > diff --git a/common/config b/common/config
> > index a3b15b96f..43b517fda 100644
> > --- a/common/config
> > +++ b/common/config
> > @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)"
> >   export GZIP_PROG="$(type -P gzip)"
> >   export BTRFS_IMAGE_PROG="$(type -P btrfs-image)"
> >   export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
> > +export PARTED_PROG="$(type -P parted)"
> >   # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled.
> >   # newer systems have udevadm command but older systems like RHEL5 don't.
> > diff --git a/tests/btrfs/311 b/tests/btrfs/311
> > new file mode 100755
> > index 000000000..a7fa541c4
> > --- /dev/null
> > +++ b/tests/btrfs/311
> > @@ -0,0 +1,105 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (C) 2024 Meta, Inc. All Rights Reserved.
> > +#
> > +# FS QA Test 311
> > +#
> > +# Test an edge case of multi device volume management in btrfs.
> > +# If a device changes devt between mounts of a multi device fs, we can trick
> > +# btrfs into mounting the same device twice fully (not as a bind mount). From
> > +# there, it is trivial to induce corruption.
> > +#
> > +. ./common/preamble
> > +_begin_fstest auto quick volume scrub
> 
> Please add tmpfsid as well, because this test is about when not to activate
> the tmpfsid.
> 
> > +
> > +# real QA test starts here
> > +_supported_fs btrfs
> > +_require_test
> > +_require_command "$PARTED_PROG" parted
> > +_require_batched_discard "$TEST_DIR"
> > +
> 
> _fixed_by_kernel_commit XXXXXXXXXXXX \
> 		btrfs: validate device maj:min during open
> 
> > +_cleanup() {
> > +	cd /
> > +	$UMOUNT_PROG $MNT
> > +	$UMOUNT_PROG $BIND
> > +	losetup -d $DEV0
> > +	losetup -d $DEV1
> > +	losetup -d $DEV2
> > +	rm $IMG0
> > +	rm $IMG1
> > +	rm $IMG2
> > +}
> > +
> > +IMG0=$TEST_DIR/$$.img0
> > +IMG1=$TEST_DIR/$$.img1
> > +IMG2=$TEST_DIR/$$.img2
> > +truncate -s 1G $IMG0
> > +truncate -s 1G $IMG1
> > +truncate -s 1G $IMG2
> > +DEV0=$(losetup -f $IMG0 --show)
> > +DEV1=$(losetup -f $IMG1 --show)
> > +DEV2=$(losetup -f $IMG2 --show)
> > +D0P1=$DEV0"p1"
> > +D1P1=$DEV1"p1"
> > +MNT=$TEST_DIR/mnt
> > +BIND=$TEST_DIR/bind
> > +
> > +# Setup partition table with one partition on each device.
> > +$PARTED_PROG $DEV0 'mktable gpt' --script
> > +$PARTED_PROG $DEV1 'mktable gpt' --script
> > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> > +
> > +# mkfs with two devices to avoid clearing devices on close
> > +# single raid to allow removing DEV2.
> > +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 >>$seqres.full 2>&1 || _fail "failed to mkfs.btrfs"
> 
> Error out is already sufficient here.
> 

I don't understand, but I think Filipe asked for this one.

> > +
> > +# Cycle mount the two device fs to populate both devices into the
> > +# stale device cache.
> > +mkdir -p $MNT
> > +_mount $D0P1 $MNT
> > +$UMOUNT_PROG $MNT
> > +
> > +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date.
> > +$PARTED_PROG $DEV0 'rm 1' --script
> > +$PARTED_PROG $DEV1 'rm 1' --script
> > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> > +
> > +# Mount with mismatched dev_t!
> > +_mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point"
> > +
> 
> On a system where the kernel bug is fixed, the mount is expected to pass.
> 
> On a system without the kernel bug fix, the mount is still expected to pass.
> 
> However, the failure message indicates that it is advisable to fail the
> mount in this scenario.
> 
> I believe this will be the case once the btrfs-progs patch below is
> integrated:
> 
> [PATCH 0/2] btrfs-progs: forget removed devices
> 
> So, we need to update the test case logic based on whether the above
> btrfs-progs patch is integrated.
> 
> 

The point of this was just to do our best to avoid doing a bunch of
writes/scrubbing/trimming on the fs under all the loopdevs and stuff in
an earlier version outside fstests. It would stink to run the script
without set -e and blow up your root drive or whatever.

With that said, this is running in TEST_DIR, so I think we can forget
about the precaution now.

> > +# Remove the extra device to bring temp-fsid back in the fray.
> > +$BTRFS_UTIL_PROG device remove $DEV2 $MNT
> > +
> > +# Create the would be bind mount.
> > +mkdir -p $BIND
> > +_mount $D0P1 $BIND
> > +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT)
> > +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND)
> > +# If they're different, we are in trouble.
> > +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show"
> > +
> 
> 
> > +# Now really prove it by corrupting the first mount with the second.
> > +for i in $(seq 20); do
> > +	$XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >>$seqres.full 2>&1
> > +done
> > +for i in $(seq 20); do
> > +	$XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >>$seqres.full 2>&1
> > +done
> > +
> 
> > +# sync so that we really write the large file data out to the shared device
> > +sync
> > +
> > +# now delete from one view of the shared device
> > +find $BIND -type f -delete
> > +# sync so that fstrim definitely has deleted data to trim
> > +sync
> > +# This should blow up both mounts, if the writes somehow didn't overlap at all.
> > +$FSTRIM_PROG $BIND
> > +# drop caches to improve the odds we read from the corrupted device while scrubbing.
> > +echo 3 > /proc/sys/vm/drop_caches
> > +$BTRFS_UTIL_PROG scrub start -B $MNT | grep "Error summary:"
> > +
> 
> 
> The rest appears to be fine.
> 
> Question: Why didn't you choose a cp --reflink=always across $MNT and $BIND
> to prove how kernel think about $MNT and $BIND.

Didn't occur to me. Who knows, maybe one day cross-fs reflink will work ;)

Demonstrating a corruption felt "elemental" to me.

> 
> Thanks, Anand
> 
> > +status=0
> > +exit
> > diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out
> > new file mode 100644
> > index 000000000..70a6db809
> > --- /dev/null
> > +++ b/tests/btrfs/311.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 311
> > +Error summary:    no errors found
>
diff mbox series

Patch

diff --git a/common/config b/common/config
index a3b15b96f..43b517fda 100644
--- a/common/config
+++ b/common/config
@@ -235,6 +235,7 @@  export BLKZONE_PROG="$(type -P blkzone)"
 export GZIP_PROG="$(type -P gzip)"
 export BTRFS_IMAGE_PROG="$(type -P btrfs-image)"
 export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
+export PARTED_PROG="$(type -P parted)"
 
 # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled.
 # newer systems have udevadm command but older systems like RHEL5 don't.
diff --git a/tests/btrfs/311 b/tests/btrfs/311
new file mode 100755
index 000000000..a7fa541c4
--- /dev/null
+++ b/tests/btrfs/311
@@ -0,0 +1,105 @@ 
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2024 Meta, Inc. All Rights Reserved.
+#
+# FS QA Test 311
+#
+# Test an edge case of multi device volume management in btrfs.
+# If a device changes devt between mounts of a multi device fs, we can trick
+# btrfs into mounting the same device twice fully (not as a bind mount). From
+# there, it is trivial to induce corruption.
+#
+. ./common/preamble
+_begin_fstest auto quick volume scrub
+
+# real QA test starts here
+_supported_fs btrfs
+_require_test
+_require_command "$PARTED_PROG" parted
+_require_batched_discard "$TEST_DIR"
+
+_cleanup() {
+	cd /
+	$UMOUNT_PROG $MNT
+	$UMOUNT_PROG $BIND
+	losetup -d $DEV0
+	losetup -d $DEV1
+	losetup -d $DEV2
+	rm $IMG0
+	rm $IMG1
+	rm $IMG2
+}
+
+IMG0=$TEST_DIR/$$.img0
+IMG1=$TEST_DIR/$$.img1
+IMG2=$TEST_DIR/$$.img2
+truncate -s 1G $IMG0
+truncate -s 1G $IMG1
+truncate -s 1G $IMG2
+DEV0=$(losetup -f $IMG0 --show)
+DEV1=$(losetup -f $IMG1 --show)
+DEV2=$(losetup -f $IMG2 --show)
+D0P1=$DEV0"p1"
+D1P1=$DEV1"p1"
+MNT=$TEST_DIR/mnt
+BIND=$TEST_DIR/bind
+
+# Setup partition table with one partition on each device.
+$PARTED_PROG $DEV0 'mktable gpt' --script
+$PARTED_PROG $DEV1 'mktable gpt' --script
+$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
+$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
+
+# mkfs with two devices to avoid clearing devices on close
+# single raid to allow removing DEV2.
+$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 >>$seqres.full 2>&1 || _fail "failed to mkfs.btrfs"
+
+# Cycle mount the two device fs to populate both devices into the
+# stale device cache.
+mkdir -p $MNT
+_mount $D0P1 $MNT
+$UMOUNT_PROG $MNT
+
+# Swap the partition dev_ts. This leaves the dev_t in the cache out of date.
+$PARTED_PROG $DEV0 'rm 1' --script
+$PARTED_PROG $DEV1 'rm 1' --script
+$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
+$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
+
+# Mount with mismatched dev_t!
+_mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point"
+
+# Remove the extra device to bring temp-fsid back in the fray.
+$BTRFS_UTIL_PROG device remove $DEV2 $MNT
+
+# Create the would be bind mount.
+mkdir -p $BIND
+_mount $D0P1 $BIND
+mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT)
+bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND)
+# If they're different, we are in trouble.
+[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show"
+
+# Now really prove it by corrupting the first mount with the second.
+for i in $(seq 20); do
+	$XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >>$seqres.full 2>&1
+done
+for i in $(seq 20); do
+	$XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >>$seqres.full 2>&1
+done
+
+# sync so that we really write the large file data out to the shared device
+sync
+
+# now delete from one view of the shared device
+find $BIND -type f -delete
+# sync so that fstrim definitely has deleted data to trim
+sync
+# This should blow up both mounts, if the writes somehow didn't overlap at all.
+$FSTRIM_PROG $BIND
+# drop caches to improve the odds we read from the corrupted device while scrubbing.
+echo 3 > /proc/sys/vm/drop_caches
+$BTRFS_UTIL_PROG scrub start -B $MNT | grep "Error summary:"
+
+status=0
+exit
diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out
new file mode 100644
index 000000000..70a6db809
--- /dev/null
+++ b/tests/btrfs/311.out
@@ -0,0 +1,2 @@ 
+QA output created by 311
+Error summary:    no errors found