diff mbox series

[v2] btrfs/311: new test for devt change between mounts

Message ID f40e347d5a4b4b28201b1a088d38a3c75dd10ebd.1709251328.git.boris@bur.io (mailing list archive)
State New, archived
Headers show
Series [v2] btrfs/311: new test for devt change between mounts | expand

Commit Message

Boris Burkov March 1, 2024, 12:03 a.m. UTC
It is possible to confuse the btrfs device cache (fs_devices) by
starting with a multi-device filesystem, then removing and re-adding a
device in a way which changes its dev_t while the filesystem is
unmounted. After this procedure, if we remount, then we are in a funny
state where struct btrfs_device's "devt" field does not match the bd_dev
of the "bdev" field. I would say this is bad enough, as we have violated
a pretty clear invariant.

But for style points, we can then remove the extra device from the fs,
making it a single device fs, which enables the "temp_fsid" feature,
which permits multiple separate mounts of different devices with the
same fsid. Since btrfs is confused and *thinks* there are different
devices (based on device->devt), it allows a second redundant mount of
the same device (not a bind mount!). This then allows us to corrupt the
original mount by doing stuff to the one that should be a bind mount.

This is fixed by the combination of the kernel patch:
btrfs: support device name lookup in forget
and the btrfs-progs patches:
btrfs-progs: allow btrfs device scan -u on dead dev
btrfs-progs: add udev rule to forget removed device

Signed-off-by: Boris Burkov <boris@bur.io>
---
 common/config       |   1 +
 common/rc           |   4 ++
 tests/btrfs/311     | 101 ++++++++++++++++++++++++++++++++++++++++++++
 tests/btrfs/311.out |   2 +
 4 files changed, 108 insertions(+)
 create mode 100755 tests/btrfs/311
 create mode 100644 tests/btrfs/311.out

Comments

Filipe Manana March 5, 2024, 12:13 p.m. UTC | #1
On Fri, Mar 1, 2024 at 12:02 AM Boris Burkov <boris@bur.io> wrote:
>
> It is possible to confuse the btrfs device cache (fs_devices) by
> starting with a multi-device filesystem, then removing and re-adding a
> device in a way which changes its dev_t while the filesystem is
> unmounted. After this procedure, if we remount, then we are in a funny
> state where struct btrfs_device's "devt" field does not match the bd_dev
> of the "bdev" field. I would say this is bad enough, as we have violated
> a pretty clear invariant.
>
> But for style points, we can then remove the extra device from the fs,
> making it a single device fs, which enables the "temp_fsid" feature,
> which permits multiple separate mounts of different devices with the
> same fsid. Since btrfs is confused and *thinks* there are different
> devices (based on device->devt), it allows a second redundant mount of
> the same device (not a bind mount!). This then allows us to corrupt the
> original mount by doing stuff to the one that should be a bind mount.
>
> This is fixed by the combination of the kernel patch:
> btrfs: support device name lookup in forget
> and the btrfs-progs patches:
> btrfs-progs: allow btrfs device scan -u on dead dev
> btrfs-progs: add udev rule to forget removed device

May I suggest to make this more readable, easier to the eye?
My inserting blank lines and indenting the lines with the patch
subjects, like for example:

"""
This is fixed by the combination of the kernel patch:

    btrfs: support device name lookup in forget

and the btrfs-progs patches:

   btrfs-progs: allow btrfs device scan -u on dead dev
   btrfs-progs: add udev rule to forget removed device
"""

And these should be placed in the test case itself with:

   _fixed_by_git_commit btrfs-progs xxxxxxxxxx "btrfs-progs: allow
btrfs device scan -u on dead dev"

For btrfs-progs commits, and for the kernel:

   _fixed_by_kernel_commit xxxxxxxxxxxx "btrfs: support device name
lookup in forget"

I see however that there's discussion for those patches between you,
Anand and David, and it
seems there's a chance those patches won't be merged to fix this bug,
especially the kernel one
for which Anand submitted an alternative. If those are added to the
test case itself, can always be
updated later if needed.

Also avoid putting the test number in the subject - there's no
guarantee it will end up with that number when merged.

>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  common/config       |   1 +
>  common/rc           |   4 ++
>  tests/btrfs/311     | 101 ++++++++++++++++++++++++++++++++++++++++++++
>  tests/btrfs/311.out |   2 +
>  4 files changed, 108 insertions(+)
>  create mode 100755 tests/btrfs/311
>  create mode 100644 tests/btrfs/311.out
>
> diff --git a/common/config b/common/config
> index a3b15b96f..43b517fda 100644
> --- a/common/config
> +++ b/common/config
> @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)"
>  export GZIP_PROG="$(type -P gzip)"
>  export BTRFS_IMAGE_PROG="$(type -P btrfs-image)"
>  export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
> +export PARTED_PROG="$(type -P parted)"
>
>  # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled.
>  # newer systems have udevadm command but older systems like RHEL5 don't.
> diff --git a/common/rc b/common/rc
> index 30c44dddd..8e009aca9 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -5375,6 +5375,10 @@ _require_unshare() {
>                 _notrun "unshare $*: command not found, should be in util-linux"
>  }
>
> +_require_parted() {

These three last functions from common/rc use that style, the { after
the function name in the same line,
but everywhere else the { is on a new line, and that's the most common
style in fstests.
I wish we had some consistency.

> +       $PARTED_PROG --list &>/dev/null || _notrun "parted: command not found"

Why not just call:

_require_command "$PARTED_PROG" parted

Could even do that in the test and no need for a common function in
this file, as we do in many other tests
(grep for "'_require_command'" in tests/generic for example).

> +}
> +
>  # Return a random file in a directory. A directory is *not* followed
>  # recursively.
>  _random_file() {
> diff --git a/tests/btrfs/311 b/tests/btrfs/311
> new file mode 100755
> index 000000000..887c46ba0
> --- /dev/null
> +++ b/tests/btrfs/311
> @@ -0,0 +1,101 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2024 Meta, Inc. All Rights Reserved.
> +#
> +# FS QA Test 311
> +#
> +# Test an edge case of multi device volume management in btrfs.
> +# If a device changes devt between mounts of a multi device fs, we can trick
> +# btrfs into mounting the same device twice fully (not as a bind mount). From
> +# there, it is trivial to induce corruption.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick volume

Missing 'scrub' group.

> +
> +# real QA test starts here
> +_supported_fs btrfs
> +_require_test
> +_require_parted
> +
> +_cleanup() {
> +       cd /
> +       umount $MNT
> +       umount $BIND
> +       losetup -d $DEV0
> +       losetup -d $DEV1
> +       losetup -d $DEV2
> +       rm $IMG0
> +       rm $IMG1
> +       rm $IMG2
> +}
> +
> +IMG0=$TEST_DIR/$$.img0
> +IMG1=$TEST_DIR/$$.img1
> +IMG2=$TEST_DIR/$$.img2
> +truncate -s 1G $IMG0
> +truncate -s 1G $IMG1
> +truncate -s 1G $IMG2
> +DEV0=$(losetup -f $IMG0 --show)
> +DEV1=$(losetup -f $IMG1 --show)
> +DEV2=$(losetup -f $IMG2 --show)
> +D0P1=$DEV0"p1"
> +D1P1=$DEV1"p1"
> +MNT=$TEST_DIR/mnt
> +BIND=$TEST_DIR/bind
> +
> +# Setup partition table with one partition on each device.
> +$PARTED_PROG $DEV0 'mktable gpt' --script
> +$PARTED_PROG $DEV1 'mktable gpt' --script
> +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> +
> +# mkfs with two devices to avoid clearing devices on close
> +# single raid to allow removing DEV2.
> +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 &>/dev/null

Please redirect to the $seqres.full instead, and _fail if mkfs fails.
That's what we do nowadays due to unpleasant surprises in the past.

> +
> +# Cycle mount the two device fs to populate both devices into the
> +# stale device cache.
> +mkdir -p $MNT
> +mount $D0P1 $MNT

Use the _mount() helper.

> +umount $MNT

We use $UMOUNT_PROG in fstests.

> +
> +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date.
> +$PARTED_PROG $DEV0 'rm 1' --script
> +$PARTED_PROG $DEV1 'rm 1' --script
> +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> +
> +# Mount with mismatched dev_t!
> +mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point"

Same here.

> +
> +# Remove the extra device to bring temp-fsid back in the fray.
> +$BTRFS_UTIL_PROG device remove $DEV2 $MNT
> +
> +# Create the would be bind mount.
> +mkdir -p $BIND
> +mount $D0P1 $BIND

Same here.

> +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT)
> +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND)
> +# If they're different, we are in trouble.
> +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show"
> +
> +# Now really prove it by corrupting the first mount with the second.
> +for i in $(seq 20); do
> +       $XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >$seqres.full 2>&1

This is overriding the .full file, use >>

> +done
> +for i in $(seq 20); do
> +       $XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >$seqres.full 2>&1

Same here, this is overriding the .full file, use >>

> +done
> +sync

Can we please have a comment mentioning why the sync is needed?

> +find $BIND -type f -delete
> +sync

Same here.

> +
> +# This should blow up both mounts, if the writes somehow didn't overlap at all.
> +$FSTRIM_PROG $BIND

Since it's using the fstrim program and needs trim to be supported,
the test misses a:

_require_batched_discard "$TEST_DIR"

at the top.

> +echo 3 > /proc/sys/vm/drop_caches

Please add a comment mentioning why we are dropping caches.

> +$BTRFS_UTIL_PROG scrub start -B $MNT >>$seqres.full 2>&1

The test passes whether scrub fails or succeeds, as it's redirecting
stdout and stderr to the .full file and ignoring the exit status of
the command.

The ideal fstests way is to put the expected output in the golden
output file and just call the command without redirecting anything.
If that's not doable for some reason, at the very least do a (...) ||
echo "Scrub failed, check $seqres.full for details."

Thanks.

> +
> +# success, all done
> +echo "Silence is golden"
> +status=0
> +exit
> diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out
> new file mode 100644
> index 000000000..62f253029
> --- /dev/null
> +++ b/tests/btrfs/311.out
> @@ -0,0 +1,2 @@
> +QA output created by 311
> +Silence is golden
> --
> 2.43.0
>
>
Boris Burkov March 7, 2024, 8:49 p.m. UTC | #2
On Tue, Mar 05, 2024 at 12:13:07PM +0000, Filipe Manana wrote:
> On Fri, Mar 1, 2024 at 12:02 AM Boris Burkov <boris@bur.io> wrote:
> >
> > It is possible to confuse the btrfs device cache (fs_devices) by
> > starting with a multi-device filesystem, then removing and re-adding a
> > device in a way which changes its dev_t while the filesystem is
> > unmounted. After this procedure, if we remount, then we are in a funny
> > state where struct btrfs_device's "devt" field does not match the bd_dev
> > of the "bdev" field. I would say this is bad enough, as we have violated
> > a pretty clear invariant.
> >
> > But for style points, we can then remove the extra device from the fs,
> > making it a single device fs, which enables the "temp_fsid" feature,
> > which permits multiple separate mounts of different devices with the
> > same fsid. Since btrfs is confused and *thinks* there are different
> > devices (based on device->devt), it allows a second redundant mount of
> > the same device (not a bind mount!). This then allows us to corrupt the
> > original mount by doing stuff to the one that should be a bind mount.
> >
> > This is fixed by the combination of the kernel patch: > > btrfs: support device name lookup in forget
> > and the btrfs-progs patches:
> > btrfs-progs: allow btrfs device scan -u on dead dev
> > btrfs-progs: add udev rule to forget removed device
> 
> May I suggest to make this more readable, easier to the eye?
> My inserting blank lines and indenting the lines with the patch
> subjects, like for example:
> 
> """
> This is fixed by the combination of the kernel patch:
> 
>     btrfs: support device name lookup in forget
> 
> and the btrfs-progs patches:
> 
>    btrfs-progs: allow btrfs device scan -u on dead dev
>    btrfs-progs: add udev rule to forget removed device
> """
> 
> And these should be placed in the test case itself with:
> 
>    _fixed_by_git_commit btrfs-progs xxxxxxxxxx "btrfs-progs: allow
> btrfs device scan -u on dead dev"
> 
> For btrfs-progs commits, and for the kernel:
> 
>    _fixed_by_kernel_commit xxxxxxxxxxxx "btrfs: support device name
> lookup in forget"

I'll do so going forward, thanks.

> 
> I see however that there's discussion for those patches between you,
> Anand and David, and it
> seems there's a chance those patches won't be merged to fix this bug,
> especially the kernel one
> for which Anand submitted an alternative. If those are added to the
> test case itself, can always be
> updated later if needed.

I'm leaning towards omitting this part from the commit message and we
can update the test case once we pick a winner. Is that OK with you?

> 
> Also avoid putting the test number in the subject - there's no
> guarantee it will end up with that number when merged.
> 
> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  common/config       |   1 +
> >  common/rc           |   4 ++
> >  tests/btrfs/311     | 101 ++++++++++++++++++++++++++++++++++++++++++++
> >  tests/btrfs/311.out |   2 +
> >  4 files changed, 108 insertions(+)
> >  create mode 100755 tests/btrfs/311
> >  create mode 100644 tests/btrfs/311.out
> >
> > diff --git a/common/config b/common/config
> > index a3b15b96f..43b517fda 100644
> > --- a/common/config
> > +++ b/common/config
> > @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)"
> >  export GZIP_PROG="$(type -P gzip)"
> >  export BTRFS_IMAGE_PROG="$(type -P btrfs-image)"
> >  export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
> > +export PARTED_PROG="$(type -P parted)"
> >
> >  # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled.
> >  # newer systems have udevadm command but older systems like RHEL5 don't.
> > diff --git a/common/rc b/common/rc
> > index 30c44dddd..8e009aca9 100644
> > --- a/common/rc
> > +++ b/common/rc
> > @@ -5375,6 +5375,10 @@ _require_unshare() {
> >                 _notrun "unshare $*: command not found, should be in util-linux"
> >  }
> >
> > +_require_parted() {
> 
> These three last functions from common/rc use that style, the { after
> the function name in the same line,
> but everywhere else the { is on a new line, and that's the most common
> style in fstests.
> I wish we had some consistency.
> 
> > +       $PARTED_PROG --list &>/dev/null || _notrun "parted: command not found"
> 
> Why not just call:
> 
> _require_command "$PARTED_PROG" parted
> 
> Could even do that in the test and no need for a common function in
> this file, as we do in many other tests
> (grep for "'_require_command'" in tests/generic for example).
> 
> > +}
> > +
> >  # Return a random file in a directory. A directory is *not* followed
> >  # recursively.
> >  _random_file() {
> > diff --git a/tests/btrfs/311 b/tests/btrfs/311
> > new file mode 100755
> > index 000000000..887c46ba0
> > --- /dev/null
> > +++ b/tests/btrfs/311
> > @@ -0,0 +1,101 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (C) 2024 Meta, Inc. All Rights Reserved.
> > +#
> > +# FS QA Test 311
> > +#
> > +# Test an edge case of multi device volume management in btrfs.
> > +# If a device changes devt between mounts of a multi device fs, we can trick
> > +# btrfs into mounting the same device twice fully (not as a bind mount). From
> > +# there, it is trivial to induce corruption.
> > +#
> > +. ./common/preamble
> > +_begin_fstest auto quick volume
> 
> Missing 'scrub' group.
> 
> > +
> > +# real QA test starts here
> > +_supported_fs btrfs
> > +_require_test
> > +_require_parted
> > +
> > +_cleanup() {
> > +       cd /
> > +       umount $MNT
> > +       umount $BIND
> > +       losetup -d $DEV0
> > +       losetup -d $DEV1
> > +       losetup -d $DEV2
> > +       rm $IMG0
> > +       rm $IMG1
> > +       rm $IMG2
> > +}
> > +
> > +IMG0=$TEST_DIR/$$.img0
> > +IMG1=$TEST_DIR/$$.img1
> > +IMG2=$TEST_DIR/$$.img2
> > +truncate -s 1G $IMG0
> > +truncate -s 1G $IMG1
> > +truncate -s 1G $IMG2
> > +DEV0=$(losetup -f $IMG0 --show)
> > +DEV1=$(losetup -f $IMG1 --show)
> > +DEV2=$(losetup -f $IMG2 --show)
> > +D0P1=$DEV0"p1"
> > +D1P1=$DEV1"p1"
> > +MNT=$TEST_DIR/mnt
> > +BIND=$TEST_DIR/bind
> > +
> > +# Setup partition table with one partition on each device.
> > +$PARTED_PROG $DEV0 'mktable gpt' --script
> > +$PARTED_PROG $DEV1 'mktable gpt' --script
> > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> > +
> > +# mkfs with two devices to avoid clearing devices on close
> > +# single raid to allow removing DEV2.
> > +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 &>/dev/null
> 
> Please redirect to the $seqres.full instead, and _fail if mkfs fails.
> That's what we do nowadays due to unpleasant surprises in the past.
> 
> > +
> > +# Cycle mount the two device fs to populate both devices into the
> > +# stale device cache.
> > +mkdir -p $MNT
> > +mount $D0P1 $MNT
> 
> Use the _mount() helper.
> 
> > +umount $MNT
> 
> We use $UMOUNT_PROG in fstests.
> 
> > +
> > +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date.
> > +$PARTED_PROG $DEV0 'rm 1' --script
> > +$PARTED_PROG $DEV1 'rm 1' --script
> > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> > +
> > +# Mount with mismatched dev_t!
> > +mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point"
> 
> Same here.
> 
> > +
> > +# Remove the extra device to bring temp-fsid back in the fray.
> > +$BTRFS_UTIL_PROG device remove $DEV2 $MNT
> > +
> > +# Create the would be bind mount.
> > +mkdir -p $BIND
> > +mount $D0P1 $BIND
> 
> Same here.
> 
> > +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT)
> > +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND)
> > +# If they're different, we are in trouble.
> > +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show"
> > +
> > +# Now really prove it by corrupting the first mount with the second.
> > +for i in $(seq 20); do
> > +       $XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >$seqres.full 2>&1
> 
> This is overriding the .full file, use >>
> 
> > +done
> > +for i in $(seq 20); do
> > +       $XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >$seqres.full 2>&1
> 
> Same here, this is overriding the .full file, use >>
> 
> > +done
> > +sync
> 
> Can we please have a comment mentioning why the sync is needed?
> 
> > +find $BIND -type f -delete
> > +sync
> 
> Same here.
> 
> > +
> > +# This should blow up both mounts, if the writes somehow didn't overlap at all.
> > +$FSTRIM_PROG $BIND
> 
> Since it's using the fstrim program and needs trim to be supported,
> the test misses a:
> 
> _require_batched_discard "$TEST_DIR"
> 
> at the top.
> 
> > +echo 3 > /proc/sys/vm/drop_caches
> 
> Please add a comment mentioning why we are dropping caches.
> 
> > +$BTRFS_UTIL_PROG scrub start -B $MNT >>$seqres.full 2>&1
> 
> The test passes whether scrub fails or succeeds, as it's redirecting
> stdout and stderr to the .full file and ignoring the exit status of
> the command.
> 
> The ideal fstests way is to put the expected output in the golden
> output file and just call the command without redirecting anything.
> If that's not doable for some reason, at the very least do a (...) ||
> echo "Scrub failed, check $seqres.full for details."
> 
> Thanks.
> 
> > +
> > +# success, all done
> > +echo "Silence is golden"
> > +status=0
> > +exit
> > diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out
> > new file mode 100644
> > index 000000000..62f253029
> > --- /dev/null
> > +++ b/tests/btrfs/311.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 311
> > +Silence is golden
> > --
> > 2.43.0
> >
> >
Filipe Manana March 8, 2024, 11:55 a.m. UTC | #3
On Thu, Mar 7, 2024 at 8:48 PM Boris Burkov <boris@bur.io> wrote:
>
> On Tue, Mar 05, 2024 at 12:13:07PM +0000, Filipe Manana wrote:
> > On Fri, Mar 1, 2024 at 12:02 AM Boris Burkov <boris@bur.io> wrote:
> > >
> > > It is possible to confuse the btrfs device cache (fs_devices) by
> > > starting with a multi-device filesystem, then removing and re-adding a
> > > device in a way which changes its dev_t while the filesystem is
> > > unmounted. After this procedure, if we remount, then we are in a funny
> > > state where struct btrfs_device's "devt" field does not match the bd_dev
> > > of the "bdev" field. I would say this is bad enough, as we have violated
> > > a pretty clear invariant.
> > >
> > > But for style points, we can then remove the extra device from the fs,
> > > making it a single device fs, which enables the "temp_fsid" feature,
> > > which permits multiple separate mounts of different devices with the
> > > same fsid. Since btrfs is confused and *thinks* there are different
> > > devices (based on device->devt), it allows a second redundant mount of
> > > the same device (not a bind mount!). This then allows us to corrupt the
> > > original mount by doing stuff to the one that should be a bind mount.
> > >
> > > This is fixed by the combination of the kernel patch: > > btrfs: support device name lookup in forget
> > > and the btrfs-progs patches:
> > > btrfs-progs: allow btrfs device scan -u on dead dev
> > > btrfs-progs: add udev rule to forget removed device
> >
> > May I suggest to make this more readable, easier to the eye?
> > My inserting blank lines and indenting the lines with the patch
> > subjects, like for example:
> >
> > """
> > This is fixed by the combination of the kernel patch:
> >
> >     btrfs: support device name lookup in forget
> >
> > and the btrfs-progs patches:
> >
> >    btrfs-progs: allow btrfs device scan -u on dead dev
> >    btrfs-progs: add udev rule to forget removed device
> > """
> >
> > And these should be placed in the test case itself with:
> >
> >    _fixed_by_git_commit btrfs-progs xxxxxxxxxx "btrfs-progs: allow
> > btrfs device scan -u on dead dev"
> >
> > For btrfs-progs commits, and for the kernel:
> >
> >    _fixed_by_kernel_commit xxxxxxxxxxxx "btrfs: support device name
> > lookup in forget"
>
> I'll do so going forward, thanks.
>
> >
> > I see however that there's discussion for those patches between you,
> > Anand and David, and it
> > seems there's a chance those patches won't be merged to fix this bug,
> > especially the kernel one
> > for which Anand submitted an alternative. If those are added to the
> > test case itself, can always be
> > updated later if needed.
>
> I'm leaning towards omitting this part from the commit message and we
> can update the test case once we pick a winner. Is that OK with you?

Yes, thanks.

>
> >
> > Also avoid putting the test number in the subject - there's no
> > guarantee it will end up with that number when merged.
> >
> > >
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > >  common/config       |   1 +
> > >  common/rc           |   4 ++
> > >  tests/btrfs/311     | 101 ++++++++++++++++++++++++++++++++++++++++++++
> > >  tests/btrfs/311.out |   2 +
> > >  4 files changed, 108 insertions(+)
> > >  create mode 100755 tests/btrfs/311
> > >  create mode 100644 tests/btrfs/311.out
> > >
> > > diff --git a/common/config b/common/config
> > > index a3b15b96f..43b517fda 100644
> > > --- a/common/config
> > > +++ b/common/config
> > > @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)"
> > >  export GZIP_PROG="$(type -P gzip)"
> > >  export BTRFS_IMAGE_PROG="$(type -P btrfs-image)"
> > >  export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
> > > +export PARTED_PROG="$(type -P parted)"
> > >
> > >  # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled.
> > >  # newer systems have udevadm command but older systems like RHEL5 don't.
> > > diff --git a/common/rc b/common/rc
> > > index 30c44dddd..8e009aca9 100644
> > > --- a/common/rc
> > > +++ b/common/rc
> > > @@ -5375,6 +5375,10 @@ _require_unshare() {
> > >                 _notrun "unshare $*: command not found, should be in util-linux"
> > >  }
> > >
> > > +_require_parted() {
> >
> > These three last functions from common/rc use that style, the { after
> > the function name in the same line,
> > but everywhere else the { is on a new line, and that's the most common
> > style in fstests.
> > I wish we had some consistency.
> >
> > > +       $PARTED_PROG --list &>/dev/null || _notrun "parted: command not found"
> >
> > Why not just call:
> >
> > _require_command "$PARTED_PROG" parted
> >
> > Could even do that in the test and no need for a common function in
> > this file, as we do in many other tests
> > (grep for "'_require_command'" in tests/generic for example).
> >
> > > +}
> > > +
> > >  # Return a random file in a directory. A directory is *not* followed
> > >  # recursively.
> > >  _random_file() {
> > > diff --git a/tests/btrfs/311 b/tests/btrfs/311
> > > new file mode 100755
> > > index 000000000..887c46ba0
> > > --- /dev/null
> > > +++ b/tests/btrfs/311
> > > @@ -0,0 +1,101 @@
> > > +#! /bin/bash
> > > +# SPDX-License-Identifier: GPL-2.0
> > > +# Copyright (C) 2024 Meta, Inc. All Rights Reserved.
> > > +#
> > > +# FS QA Test 311
> > > +#
> > > +# Test an edge case of multi device volume management in btrfs.
> > > +# If a device changes devt between mounts of a multi device fs, we can trick
> > > +# btrfs into mounting the same device twice fully (not as a bind mount). From
> > > +# there, it is trivial to induce corruption.
> > > +#
> > > +. ./common/preamble
> > > +_begin_fstest auto quick volume
> >
> > Missing 'scrub' group.
> >
> > > +
> > > +# real QA test starts here
> > > +_supported_fs btrfs
> > > +_require_test
> > > +_require_parted
> > > +
> > > +_cleanup() {
> > > +       cd /
> > > +       umount $MNT
> > > +       umount $BIND
> > > +       losetup -d $DEV0
> > > +       losetup -d $DEV1
> > > +       losetup -d $DEV2
> > > +       rm $IMG0
> > > +       rm $IMG1
> > > +       rm $IMG2
> > > +}
> > > +
> > > +IMG0=$TEST_DIR/$$.img0
> > > +IMG1=$TEST_DIR/$$.img1
> > > +IMG2=$TEST_DIR/$$.img2
> > > +truncate -s 1G $IMG0
> > > +truncate -s 1G $IMG1
> > > +truncate -s 1G $IMG2
> > > +DEV0=$(losetup -f $IMG0 --show)
> > > +DEV1=$(losetup -f $IMG1 --show)
> > > +DEV2=$(losetup -f $IMG2 --show)
> > > +D0P1=$DEV0"p1"
> > > +D1P1=$DEV1"p1"
> > > +MNT=$TEST_DIR/mnt
> > > +BIND=$TEST_DIR/bind
> > > +
> > > +# Setup partition table with one partition on each device.
> > > +$PARTED_PROG $DEV0 'mktable gpt' --script
> > > +$PARTED_PROG $DEV1 'mktable gpt' --script
> > > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> > > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> > > +
> > > +# mkfs with two devices to avoid clearing devices on close
> > > +# single raid to allow removing DEV2.
> > > +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 &>/dev/null
> >
> > Please redirect to the $seqres.full instead, and _fail if mkfs fails.
> > That's what we do nowadays due to unpleasant surprises in the past.
> >
> > > +
> > > +# Cycle mount the two device fs to populate both devices into the
> > > +# stale device cache.
> > > +mkdir -p $MNT
> > > +mount $D0P1 $MNT
> >
> > Use the _mount() helper.
> >
> > > +umount $MNT
> >
> > We use $UMOUNT_PROG in fstests.
> >
> > > +
> > > +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date.
> > > +$PARTED_PROG $DEV0 'rm 1' --script
> > > +$PARTED_PROG $DEV1 'rm 1' --script
> > > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> > > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> > > +
> > > +# Mount with mismatched dev_t!
> > > +mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point"
> >
> > Same here.
> >
> > > +
> > > +# Remove the extra device to bring temp-fsid back in the fray.
> > > +$BTRFS_UTIL_PROG device remove $DEV2 $MNT
> > > +
> > > +# Create the would be bind mount.
> > > +mkdir -p $BIND
> > > +mount $D0P1 $BIND
> >
> > Same here.
> >
> > > +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT)
> > > +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND)
> > > +# If they're different, we are in trouble.
> > > +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show"
> > > +
> > > +# Now really prove it by corrupting the first mount with the second.
> > > +for i in $(seq 20); do
> > > +       $XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >$seqres.full 2>&1
> >
> > This is overriding the .full file, use >>
> >
> > > +done
> > > +for i in $(seq 20); do
> > > +       $XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >$seqres.full 2>&1
> >
> > Same here, this is overriding the .full file, use >>
> >
> > > +done
> > > +sync
> >
> > Can we please have a comment mentioning why the sync is needed?
> >
> > > +find $BIND -type f -delete
> > > +sync
> >
> > Same here.
> >
> > > +
> > > +# This should blow up both mounts, if the writes somehow didn't overlap at all.
> > > +$FSTRIM_PROG $BIND
> >
> > Since it's using the fstrim program and needs trim to be supported,
> > the test misses a:
> >
> > _require_batched_discard "$TEST_DIR"
> >
> > at the top.
> >
> > > +echo 3 > /proc/sys/vm/drop_caches
> >
> > Please add a comment mentioning why we are dropping caches.
> >
> > > +$BTRFS_UTIL_PROG scrub start -B $MNT >>$seqres.full 2>&1
> >
> > The test passes whether scrub fails or succeeds, as it's redirecting
> > stdout and stderr to the .full file and ignoring the exit status of
> > the command.
> >
> > The ideal fstests way is to put the expected output in the golden
> > output file and just call the command without redirecting anything.
> > If that's not doable for some reason, at the very least do a (...) ||
> > echo "Scrub failed, check $seqres.full for details."
> >
> > Thanks.
> >
> > > +
> > > +# success, all done
> > > +echo "Silence is golden"
> > > +status=0
> > > +exit
> > > diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out
> > > new file mode 100644
> > > index 000000000..62f253029
> > > --- /dev/null
> > > +++ b/tests/btrfs/311.out
> > > @@ -0,0 +1,2 @@
> > > +QA output created by 311
> > > +Silence is golden
> > > --
> > > 2.43.0
> > >
> > >
diff mbox series

Patch

diff --git a/common/config b/common/config
index a3b15b96f..43b517fda 100644
--- a/common/config
+++ b/common/config
@@ -235,6 +235,7 @@  export BLKZONE_PROG="$(type -P blkzone)"
 export GZIP_PROG="$(type -P gzip)"
 export BTRFS_IMAGE_PROG="$(type -P btrfs-image)"
 export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
+export PARTED_PROG="$(type -P parted)"
 
 # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled.
 # newer systems have udevadm command but older systems like RHEL5 don't.
diff --git a/common/rc b/common/rc
index 30c44dddd..8e009aca9 100644
--- a/common/rc
+++ b/common/rc
@@ -5375,6 +5375,10 @@  _require_unshare() {
 		_notrun "unshare $*: command not found, should be in util-linux"
 }
 
+_require_parted() {
+	$PARTED_PROG --list &>/dev/null || _notrun "parted: command not found"
+}
+
 # Return a random file in a directory. A directory is *not* followed
 # recursively.
 _random_file() {
diff --git a/tests/btrfs/311 b/tests/btrfs/311
new file mode 100755
index 000000000..887c46ba0
--- /dev/null
+++ b/tests/btrfs/311
@@ -0,0 +1,101 @@ 
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2024 Meta, Inc. All Rights Reserved.
+#
+# FS QA Test 311
+#
+# Test an edge case of multi device volume management in btrfs.
+# If a device changes devt between mounts of a multi device fs, we can trick
+# btrfs into mounting the same device twice fully (not as a bind mount). From
+# there, it is trivial to induce corruption.
+#
+. ./common/preamble
+_begin_fstest auto quick volume
+
+# real QA test starts here
+_supported_fs btrfs
+_require_test
+_require_parted
+
+_cleanup() {
+	cd /
+	umount $MNT
+	umount $BIND
+	losetup -d $DEV0
+	losetup -d $DEV1
+	losetup -d $DEV2
+	rm $IMG0
+	rm $IMG1
+	rm $IMG2
+}
+
+IMG0=$TEST_DIR/$$.img0
+IMG1=$TEST_DIR/$$.img1
+IMG2=$TEST_DIR/$$.img2
+truncate -s 1G $IMG0
+truncate -s 1G $IMG1
+truncate -s 1G $IMG2
+DEV0=$(losetup -f $IMG0 --show)
+DEV1=$(losetup -f $IMG1 --show)
+DEV2=$(losetup -f $IMG2 --show)
+D0P1=$DEV0"p1"
+D1P1=$DEV1"p1"
+MNT=$TEST_DIR/mnt
+BIND=$TEST_DIR/bind
+
+# Setup partition table with one partition on each device.
+$PARTED_PROG $DEV0 'mktable gpt' --script
+$PARTED_PROG $DEV1 'mktable gpt' --script
+$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
+$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
+
+# mkfs with two devices to avoid clearing devices on close
+# single raid to allow removing DEV2.
+$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 &>/dev/null
+
+# Cycle mount the two device fs to populate both devices into the
+# stale device cache.
+mkdir -p $MNT
+mount $D0P1 $MNT
+umount $MNT
+
+# Swap the partition dev_ts. This leaves the dev_t in the cache out of date.
+$PARTED_PROG $DEV0 'rm 1' --script
+$PARTED_PROG $DEV1 'rm 1' --script
+$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
+$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
+
+# Mount with mismatched dev_t!
+mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point"
+
+# Remove the extra device to bring temp-fsid back in the fray.
+$BTRFS_UTIL_PROG device remove $DEV2 $MNT
+
+# Create the would be bind mount.
+mkdir -p $BIND
+mount $D0P1 $BIND
+mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT)
+bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND)
+# If they're different, we are in trouble.
+[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show"
+
+# Now really prove it by corrupting the first mount with the second.
+for i in $(seq 20); do
+	$XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >$seqres.full 2>&1
+done
+for i in $(seq 20); do
+	$XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >$seqres.full 2>&1
+done
+sync
+find $BIND -type f -delete
+sync
+
+# This should blow up both mounts, if the writes somehow didn't overlap at all.
+$FSTRIM_PROG $BIND
+echo 3 > /proc/sys/vm/drop_caches
+$BTRFS_UTIL_PROG scrub start -B $MNT >>$seqres.full 2>&1
+
+# success, all done
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out
new file mode 100644
index 000000000..62f253029
--- /dev/null
+++ b/tests/btrfs/311.out
@@ -0,0 +1,2 @@ 
+QA output created by 311
+Silence is golden