fstest: CrashMonkey tests ported to xfstest

Message ID	CA+EzBbB5t5bSZ0hB6Y1jkxTgOLOv809Jbbcx7w5eK_XS_FPG7w@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <fstests-owner@kernel.org> MIME-Version: 1.0 From: Jayashree Mohan <jayashree2912@gmail.com> Date: Mon, 29 Oct 2018 11:50:39 -0500 Message-ID: <CA+EzBbB5t5bSZ0hB6Y1jkxTgOLOv809Jbbcx7w5eK_XS_FPG7w@mail.gmail.com> Subject: [PATCH] fstest: CrashMonkey tests ported to xfstest To: fstests <fstests@vger.kernel.org> Cc: Vijaychidambaram Velayudhan Pillai <vijay@cs.utexas.edu>, "Theodore Ts'o" <tytso@mit.edu>, Dave Chinner <david@fromorbit.com>, Amir Goldstein <amir73il@gmail.com>, Eryu Guan <guaneryu@gmail.com> Content-Type: text/plain; charset="UTF-8" Sender: fstests-owner@vger.kernel.org Precedence: bulk
Series	fstest: CrashMonkey tests ported to xfstest \| expand fstest: CrashMonkey tests ported to xfstest

Jayashree Mohan Oct. 29, 2018, 4:50 p.m. UTC

Hi all,

As discussed previously, I have ported a few CrashMonkey tests into
xfstest test-case for your review. I have written a patch for testing
simple "hard link" behaviour. This patch batches 37 test cases
generated by CrashMonkey into one  xfstest test, all of them testing
hard-link behaviour.  These 37 tests have been grouped based on the
similarities; there are comments in the patch to
describe what each group is testing for. On a high-level, we are
testing the creation of hard links between files in same directory and
files across two directories, while allowing fsync of either the files
involved, their parent directories, or unrelated sibling files.

We aim to convert the other CrashMonkey tests in the same way, batched
by the type of file-system operation we test. This particular test
took 10 seconds to run on my VM (Ubuntu 16.04, 2GB RAM, single core)
on 100MB scratch partition. I have added per test case timer in this
patch, which shows each test case takes about 0.25 seconds (to run,
flakey_remount, test and cleanup). This test passes clean on ext4, xfs
and F2FS. It will show three failed cases as of kernel 4.16 on btrfs
(I think it's not yet patched on 4.19, so you should see the failure
on the newest kernel as well).

Thinking more about it, none of the CrashMonkey test cases check for
resource exhaustion and hence we don't need more than 100MB of scratch
device. If the per-test external overhead due to fsck/scrub etc
depends on the scratch partition size, we can speed up the CrashMonkey
tests by allowing a new device like scratch(much smaller in size -
about 100-200MB). Additionally, our tests also do not need fsck to be
run at the end. Is there a way to skip performing fsck after each test
(if its something configurable in the test file) ?

If this sort of patch seems fine to you, I can go ahead and port the
other CrashMonkey tests in the same way. After batching, I hope there
would be around 10-12 test files like the one attached here.

Patch below:
---



Thanks,
Jayashree Mohan

Filipe Manana Oct. 30, 2018, 1:05 p.m. UTC | #1

On Mon, Oct 29, 2018 at 4:51 PM Jayashree Mohan <jayashree2912@gmail.com> wrote:
>
> Hi all,
>
> As discussed previously, I have ported a few CrashMonkey tests into
> xfstest test-case for your review. I have written a patch for testing
> simple "hard link" behaviour. This patch batches 37 test cases
> generated by CrashMonkey into one  xfstest test, all of them testing
> hard-link behaviour.  These 37 tests have been grouped based on the
> similarities; there are comments in the patch to
> describe what each group is testing for. On a high-level, we are
> testing the creation of hard links between files in same directory and
> files across two directories, while allowing fsync of either the files
> involved, their parent directories, or unrelated sibling files.
>
> We aim to convert the other CrashMonkey tests in the same way, batched
> by the type of file-system operation we test. This particular test
> took 10 seconds to run on my VM (Ubuntu 16.04, 2GB RAM, single core)
> on 100MB scratch partition. I have added per test case timer in this
> patch, which shows each test case takes about 0.25 seconds (to run,
> flakey_remount, test and cleanup). This test passes clean on ext4, xfs
> and F2FS. It will show three failed cases as of kernel 4.16 on btrfs
> (I think it's not yet patched on 4.19, so you should see the failure
> on the newest kernel as well).
>
> Thinking more about it, none of the CrashMonkey test cases check for
> resource exhaustion and hence we don't need more than 100MB of scratch
> device. If the per-test external overhead due to fsck/scrub etc
> depends on the scratch partition size, we can speed up the CrashMonkey
> tests by allowing a new device like scratch(much smaller in size -
> about 100-200MB). Additionally, our tests also do not need fsck to be
> run at the end. Is there a way to skip performing fsck after each test
> (if its something configurable in the test file) ?

Just use _require_scratch_nocheck() instead of _require_scratch()

For this type of tests, I think it's a good idea to let fsck run.

Even if all of the links are persisted, the log/journal replay might
have caused metadata inconsistencies in the fs for example - this was
true for many cases I fixed over the years in btrfs.
Even if fsck doesn't report any problem now, it's still good to run
it, to help prevent future regressions.

Plus this test creates a very small fs, it's not like fsck will take a
significant time to run.
So for all these reasons I would unmount and fsck after each test.

The test looks good to me, only some coding style comments below.

>
> If this sort of patch seems fine to you, I can go ahead and port the
> other CrashMonkey tests in the same way. After batching, I hope there
> would be around 10-12 test files like the one attached here.
>
> Patch below:
> ---
>
> diff --git a/tests/generic/517-link b/tests/generic/517-link
> new file mode 100755
> index 0000000..791b6c0
> --- /dev/null
> +++ b/tests/generic/517-link
> @@ -0,0 +1,162 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2018 The University of Texas at Austin.  All Rights Reserved.
> +#
> +# FS QA Test 517-link
> +#
> +# Test if we create a hard link to a file and persist either of the
> files, all the names persist.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1    # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +    _cleanup_flakey
> +    cd /
> +    rm -f $tmp.*
> +}
> +
> +init_start_time=$(date +%s.%N)
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/dmflakey
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch
> +_require_dm_target flakey
> +
> +# initialize scratch device
> +_scratch_mkfs >>$seqres.full 2>&1
> +_require_metadata_journaling $SCRATCH_DEV
> +_init_flakey
> +_mount_flakey
> +init_end_time=$(date +%s.%N)
> +init_time_elapsed=$(echo "$init_end_time-$init_start_time" | bc)
> +echo "Initialization time = $init_time_elapsed" >> $seqres.full
> +
> +stat_opt='-c "blocks: %b size: %s inode: %i links: %h"'
> +test_num=0
> +total_time=0
> +
> +cleanup()
> +{
> +    rm -rf $SCRATCH_MNT/*
> +    sync
> +}
> +
> +# create a hard link $2 to file $1, and fsync $3, followed by power-cu
> +test_link()
> +{
> +    start_time=$(date +%s.%N)
> +    test_num=$((test_num + 1))
> +    sibling=0
> +    before=""
> +    after=""

Several of these are local variable, so prefix them with the 'local' keyword.

> +    echo -ne "\n=== Test $test_num :  link $1 $2 " | tee -a $seqres.full

You need to use the _filter_scratch filter here and then adjust the
golden output (the .out file).
Because if not, then the test will fail for anyone with a different
mount path for the scratch device.
Yours is  /mnt/scratch but mine is /home/fdmanana/scratch_1 for example.

> +
> +    # now execute the workload
> +    mkdir -p "${1%/*}" && mkdir -p "${2%/*}" && touch "$1"
> +    ln $1 $2
> +
> +    if [ -z "$3" ]
> +    then

The fstests coding style is like:

if [ whatever ]; then
fi

This applies to all if statements below.

> +        echo -ne " with sync ===\n"
> +        sync
> +        before=`stat "$stat_opt" $1`
> +    else
> +        echo " with fsync $3 ==="
> +
> +        # If the file being persisted is a sibling, create it first
> +        if [ ! -f $3 ]
> +        then
> +            sibling=1
> +            touch $3
> +        fi
> +
> +        $XFS_IO_PROG -c "fsync" $3 | _filter_xfs_io

No need to use the filter. The fsync command, unlike pwrite for
example, doesn't output rates and time.

> +
> +        if [ $sibling -ne 1 ]
> +        then
> +            before=`stat "$stat_opt" $1`
> +        fi
> +    fi
> +
> +    _flakey_drop_and_remount | tee -a $seqres.full
> +
> +    if [ -f $1 ]
> +    then
> +        after=`stat "$stat_opt" $1`
> +    fi
> +
> +    if [ "$before" != "$after" ] && [ $sibling -ne 1 ]
> +    then
> +        echo "Before: $before" | tee -a $seqres.full
> +        echo "After: $after" | tee -a $seqres.full
> +    fi
> +
> +    cleanup
> +    end_time=$(date +%s.%N)
> +    time_elapsed=$(echo "$end_time-$start_time" | bc)
> +    echo " Elapsed time : $time_elapsed" >> $seqres.full
> +    total_time=$(echo "$total_time+$time_elapsed" | bc)

Add one space before the + and another after, for readability and to
conform with the code base.

> +    echo " Total time : $total_time" >> $seqres.full
> +}
> +
> +# run the link test for different combinations
> +
> +test_start_time=$(date +%s.%N)
> +# Group 1: Both files within root directory
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar; do
> +    test_link $SCRATCH_MNT/foo $SCRATCH_MNT/bar $file_name
> +done
> +test_link $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> +
> +# Group 2: Create hard link in a sub directory
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
> +    test_link $SCRATCH_MNT/foo $SCRATCH_MNT/A/bar $file_name
> +done
> +test_link $SCRATCH_MNT/foo $SCRATCH_MNT/A/bar
> +
> +# Group 3: Create hard link in parent directory
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
> +    test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/bar $file_name
> +done
> +test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/bar
> +
> +# Group 4: Both files within a directory other than root
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/A $SCRATCH_MNT/A/bar
> $SCRATCH_MNT/A/foo; do
> +    test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/A/bar $file_name
> +done
> +test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/A/bar
> +
> +#Group 5: Exercise name reuse : Link file in sub-directory
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
> +    test_link $SCRATCH_MNT/bar $SCRATCH_MNT/A/bar $file_name
> +done
> +test_link $SCRATCH_MNT/bar $SCRATCH_MNT/A/bar
> +
> +#Group 6: Exercise name reuse : Link file in parent directory
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
> +    test_link $SCRATCH_MNT/A/bar $SCRATCH_MNT/bar $file_name
> +done
> +test_link $SCRATCH_MNT/A/bar $SCRATCH_MNT/bar
> +
> +test_end_time=$(date +%s.%N)
> +test_time_elapsed=$(echo "$test_end_time-$test_start_time" | bc)
> +echo "Test Elapsed time : $test_time_elapsed" >> $seqres.full
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/517-link.out b/tests/generic/517-link.out
> new file mode 100644
> index 0000000..381b644
> --- /dev/null
> +++ b/tests/generic/517-link.out
> @@ -0,0 +1,75 @@
> +QA output created by 517-link
> +
> +=== Test 1 :  link /mnt/scratch/foo /mnt/scratch/bar  with fsync
> /mnt/scratch ===
> +
> +=== Test 2 :  link /mnt/scratch/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/foo ===
> +
> +=== Test 3 :  link /mnt/scratch/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/bar ===
> +
> +=== Test 4 :  link /mnt/scratch/foo /mnt/scratch/bar  with sync ===
> +
> +=== Test 5 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch ===
> +
> +=== Test 6 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/foo ===
> +
> +=== Test 7 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/bar ===
> +
> +=== Test 8 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A ===
> +
> +=== Test 9 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A/bar ===
> +
> +=== Test 10 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A/foo ===
> +
> +=== Test 11 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with sync ===
> +
> +=== Test 12 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
> /mnt/scratch ===
> +
> +=== Test 13 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/foo ===
> +
> +=== Test 14 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/bar ===
> +
> +=== Test 15 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/A ===
> +
> +=== Test 16 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/A/bar ===
> +
> +=== Test 17 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/A/foo ===
> +
> +=== Test 18 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with sync ===
> +
> +=== Test 19 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch ===
> +
> +=== Test 20 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A ===
> +
> +=== Test 21 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A/bar ===
> +
> +=== Test 22 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A/foo ===
> +
> +=== Test 23 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with sync ===
> +
> +=== Test 24 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
> /mnt/scratch ===
> +
> +=== Test 25 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
> /mnt/scratch/foo ===
> +
> +=== Test 26 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
> /mnt/scratch/bar ===
> +
> +=== Test 27 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A ===
> +
> +=== Test 28 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A/bar ===
> +
> +=== Test 29 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A/foo ===
> +
> +=== Test 30 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with sync ===
> +
> +=== Test 31 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
> /mnt/scratch ===
> +
> +=== Test 32 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
> /mnt/scratch/foo ===
> +
> +=== Test 33 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
> /mnt/scratch/bar ===
> +
> +=== Test 34 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
> /mnt/scratch/A ===
> +
> +=== Test 35 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
> /mnt/scratch/A/bar ===
> +
> +=== Test 36 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
> /mnt/scratch/A/foo ===
> +
> +=== Test 37 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with sync ===
> diff --git a/tests/generic/group b/tests/generic/group
> index 47de978..dc04152 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -519,3 +519,4 @@
>  514 auto quick clone
>  515 auto quick clone
>  516 auto quick dedupe clone
> +517-link crash

Should also be in the 'quick' and 'log' groups.
Not sure if it's worth adding the new 'crash' group. The test isn't
fundamentally different from other power failure tests.

Thanks.

>
>
> Thanks,
> Jayashree Mohan

Jayashree Mohan Nov. 2, 2018, 8:39 p.m. UTC | #2

Hi Filipe,

Thanks for the feedback on the patch. Will fix the coding style as you
suggested.

> For this type of tests, I think it's a good idea to let fsck run.
>
> Even if all of the links are persisted, the log/journal replay might
> have caused metadata inconsistencies in the fs for example - this was
> true for many cases I fixed over the years in btrfs.
> Even if fsck doesn't report any problem now, it's still good to run
> it, to help prevent future regressions.
>
> Plus this test creates a very small fs, it's not like fsck will take a
> significant time to run.
> So for all these reasons I would unmount and fsck after each test.

Originally, there are 300 odd crash-consistency tests generated by
CrashMonkey. To run fsck after each test, we will have to convert each
of these tests into an equivalent xfstest test-case. We previously had
a discussion about this, on the thread here (
https://www.spinics.net/lists/fstests/msg10718.html ) and the
suggestion was to batch similar tests together to reduce the external
per-test overhead due to scrub/fsck.
Additionally, batching similar tests will result in around 15 new test
cases that could be added to the 'quick group', as opposed to adding
300 new tests.

However, if you still recommend that fsck be run after each test, I
can submit patches for 300 individual test cases. Let me know which
one is preferable.

Thanks,
Jayashree.

Eryu Guan Nov. 4, 2018, 4:27 p.m. UTC | #3

On Fri, Nov 02, 2018 at 03:39:51PM -0500, Jayashree Mohan wrote:
> Hi Filipe,
> 
> Thanks for the feedback on the patch. Will fix the coding style as you
> suggested.
> 
> > For this type of tests, I think it's a good idea to let fsck run.
> >
> > Even if all of the links are persisted, the log/journal replay might
> > have caused metadata inconsistencies in the fs for example - this was
> > true for many cases I fixed over the years in btrfs.
> > Even if fsck doesn't report any problem now, it's still good to run
> > it, to help prevent future regressions.
> >
> > Plus this test creates a very small fs, it's not like fsck will take a
> > significant time to run.
> > So for all these reasons I would unmount and fsck after each test.

This looks reasonable to me.

> 
> Originally, there are 300 odd crash-consistency tests generated by
> CrashMonkey. To run fsck after each test, we will have to convert each
> of these tests into an equivalent xfstest test-case. We previously had
> a discussion about this, on the thread here (
> https://www.spinics.net/lists/fstests/msg10718.html ) and the
> suggestion was to batch similar tests together to reduce the external
> per-test overhead due to scrub/fsck.

You could batch similar tests together but still do fsck after each
sub-test by calling _check_scratch_fs manually, and call
_require_scratch_nocheck to indicate there's no need to do fsck at the
end of test.

> Additionally, batching similar tests will result in around 15 new test
> cases that could be added to the 'quick group', as opposed to adding
> 300 new tests.

I think we could batch similar tests and create relatively small fs
(e.g. 256M, as btrfs defaults to mixed mode if fs < 256M, and btrfs
folks wanted to test non-mixed mode by default) and run fsck after each
sub-test first, then see how long the tests take.

Thanks,
Eryu

> 
> However, if you still recommend that fsck be run after each test, I
> can submit patches for 300 individual test cases. Let me know which
> one is preferable.
> 
> Thanks,
> Jayashree.

Eryu Guan Nov. 4, 2018, 4:38 p.m. UTC | #4

On Mon, Oct 29, 2018 at 11:50:39AM -0500, Jayashree Mohan wrote:
> Hi all,
> 
> As discussed previously, I have ported a few CrashMonkey tests into
> xfstest test-case for your review. I have written a patch for testing
> simple "hard link" behaviour. This patch batches 37 test cases
> generated by CrashMonkey into one  xfstest test, all of them testing
> hard-link behaviour.  These 37 tests have been grouped based on the
> similarities; there are comments in the patch to
> describe what each group is testing for. On a high-level, we are
> testing the creation of hard links between files in same directory and
> files across two directories, while allowing fsync of either the files
> involved, their parent directories, or unrelated sibling files.
> 
> We aim to convert the other CrashMonkey tests in the same way, batched
> by the type of file-system operation we test. This particular test
> took 10 seconds to run on my VM (Ubuntu 16.04, 2GB RAM, single core)
> on 100MB scratch partition. I have added per test case timer in this
> patch, which shows each test case takes about 0.25 seconds (to run,
> flakey_remount, test and cleanup). This test passes clean on ext4, xfs
> and F2FS. It will show three failed cases as of kernel 4.16 on btrfs
> (I think it's not yet patched on 4.19, so you should see the failure
> on the newest kernel as well).
> 
> Thinking more about it, none of the CrashMonkey test cases check for
> resource exhaustion and hence we don't need more than 100MB of scratch
> device. If the per-test external overhead due to fsck/scrub etc
> depends on the scratch partition size, we can speed up the CrashMonkey
> tests by allowing a new device like scratch(much smaller in size -
> about 100-200MB). Additionally, our tests also do not need fsck to be
> run at the end. Is there a way to skip performing fsck after each test
> (if its something configurable in the test file) ?
> 
> If this sort of patch seems fine to you, I can go ahead and port the
> other CrashMonkey tests in the same way. After batching, I hope there
> would be around 10-12 test files like the one attached here.
> 
> Patch below:
> ---
> 
> diff --git a/tests/generic/517-link b/tests/generic/517-link
> new file mode 100755
> index 0000000..791b6c0
> --- /dev/null
> +++ b/tests/generic/517-link
> @@ -0,0 +1,162 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2018 The University of Texas at Austin.  All Rights Reserved.
> +#
> +# FS QA Test 517-link
> +#
> +# Test if we create a hard link to a file and persist either of the
> files, all the names persist.

Your mail agent seems to wrap the line and makes the patch unable to be
applied. There're many other places in the patch. Could you please take
a look?

> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1    # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +    _cleanup_flakey

Please use single tab to do indention not 4 spaces.

> +    cd /
> +    rm -f $tmp.*
> +}
> +
> +init_start_time=$(date +%s.%N)
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/dmflakey
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch
> +_require_dm_target flakey
> +
> +# initialize scratch device
> +_scratch_mkfs >>$seqres.full 2>&1
> +_require_metadata_journaling $SCRATCH_DEV
> +_init_flakey
> +_mount_flakey
> +init_end_time=$(date +%s.%N)
> +init_time_elapsed=$(echo "$init_end_time-$init_start_time" | bc)
> +echo "Initialization time = $init_time_elapsed" >> $seqres.full
> +
> +stat_opt='-c "blocks: %b size: %s inode: %i links: %h"'
> +test_num=0
> +total_time=0
> +
> +cleanup()

Hmm, this could be confusing people, as we already have a _cleanup()
function. But I think it'd be better to re-create the fs for each test
so every sub-test starts with a clean filesystem and there's no need to
do cleanup after each sub-test. (Yeah, this may consume more test time.)

Thanks,
Eryu

> +{
> +    rm -rf $SCRATCH_MNT/*
> +    sync
> +}
> +
> +# create a hard link $2 to file $1, and fsync $3, followed by power-cu
> +test_link()
> +{
> +    start_time=$(date +%s.%N)
> +    test_num=$((test_num + 1))
> +    sibling=0
> +    before=""
> +    after=""
> +    echo -ne "\n=== Test $test_num :  link $1 $2 " | tee -a $seqres.full
> +
> +    # now execute the workload
> +    mkdir -p "${1%/*}" && mkdir -p "${2%/*}" && touch "$1"
> +    ln $1 $2
> +
> +    if [ -z "$3" ]
> +    then
> +        echo -ne " with sync ===\n"
> +        sync
> +        before=`stat "$stat_opt" $1`
> +    else
> +        echo " with fsync $3 ==="
> +
> +        # If the file being persisted is a sibling, create it first
> +        if [ ! -f $3 ]
> +        then
> +            sibling=1
> +            touch $3
> +        fi
> +
> +        $XFS_IO_PROG -c "fsync" $3 | _filter_xfs_io
> +
> +        if [ $sibling -ne 1 ]
> +        then
> +            before=`stat "$stat_opt" $1`
> +        fi
> +    fi
> +
> +    _flakey_drop_and_remount | tee -a $seqres.full
> +
> +    if [ -f $1 ]
> +    then
> +        after=`stat "$stat_opt" $1`
> +    fi
> +
> +    if [ "$before" != "$after" ] && [ $sibling -ne 1 ]
> +    then
> +        echo "Before: $before" | tee -a $seqres.full
> +        echo "After: $after" | tee -a $seqres.full
> +    fi
> +
> +    cleanup
> +    end_time=$(date +%s.%N)
> +    time_elapsed=$(echo "$end_time-$start_time" | bc)
> +    echo " Elapsed time : $time_elapsed" >> $seqres.full
> +    total_time=$(echo "$total_time+$time_elapsed" | bc)
> +    echo " Total time : $total_time" >> $seqres.full
> +}
> +
> +# run the link test for different combinations
> +
> +test_start_time=$(date +%s.%N)
> +# Group 1: Both files within root directory
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar; do
> +    test_link $SCRATCH_MNT/foo $SCRATCH_MNT/bar $file_name
> +done
> +test_link $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> +
> +# Group 2: Create hard link in a sub directory
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
> +    test_link $SCRATCH_MNT/foo $SCRATCH_MNT/A/bar $file_name
> +done
> +test_link $SCRATCH_MNT/foo $SCRATCH_MNT/A/bar
> +
> +# Group 3: Create hard link in parent directory
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
> +    test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/bar $file_name
> +done
> +test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/bar
> +
> +# Group 4: Both files within a directory other than root
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/A $SCRATCH_MNT/A/bar
> $SCRATCH_MNT/A/foo; do
> +    test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/A/bar $file_name
> +done
> +test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/A/bar
> +
> +#Group 5: Exercise name reuse : Link file in sub-directory
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
> +    test_link $SCRATCH_MNT/bar $SCRATCH_MNT/A/bar $file_name
> +done
> +test_link $SCRATCH_MNT/bar $SCRATCH_MNT/A/bar
> +
> +#Group 6: Exercise name reuse : Link file in parent directory
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
> +    test_link $SCRATCH_MNT/A/bar $SCRATCH_MNT/bar $file_name
> +done
> +test_link $SCRATCH_MNT/A/bar $SCRATCH_MNT/bar
> +
> +test_end_time=$(date +%s.%N)
> +test_time_elapsed=$(echo "$test_end_time-$test_start_time" | bc)
> +echo "Test Elapsed time : $test_time_elapsed" >> $seqres.full
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/517-link.out b/tests/generic/517-link.out
> new file mode 100644
> index 0000000..381b644
> --- /dev/null
> +++ b/tests/generic/517-link.out
> @@ -0,0 +1,75 @@
> +QA output created by 517-link
> +
> +=== Test 1 :  link /mnt/scratch/foo /mnt/scratch/bar  with fsync
> /mnt/scratch ===
> +
> +=== Test 2 :  link /mnt/scratch/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/foo ===
> +
> +=== Test 3 :  link /mnt/scratch/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/bar ===
> +
> +=== Test 4 :  link /mnt/scratch/foo /mnt/scratch/bar  with sync ===
> +
> +=== Test 5 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch ===
> +
> +=== Test 6 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/foo ===
> +
> +=== Test 7 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/bar ===
> +
> +=== Test 8 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A ===
> +
> +=== Test 9 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A/bar ===
> +
> +=== Test 10 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A/foo ===
> +
> +=== Test 11 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with sync ===
> +
> +=== Test 12 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
> /mnt/scratch ===
> +
> +=== Test 13 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/foo ===
> +
> +=== Test 14 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/bar ===
> +
> +=== Test 15 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/A ===
> +
> +=== Test 16 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/A/bar ===
> +
> +=== Test 17 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
> /mnt/scratch/A/foo ===
> +
> +=== Test 18 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with sync ===
> +
> +=== Test 19 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch ===
> +
> +=== Test 20 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A ===
> +
> +=== Test 21 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A/bar ===
> +
> +=== Test 22 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A/foo ===
> +
> +=== Test 23 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with sync ===
> +
> +=== Test 24 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
> /mnt/scratch ===
> +
> +=== Test 25 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
> /mnt/scratch/foo ===
> +
> +=== Test 26 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
> /mnt/scratch/bar ===
> +
> +=== Test 27 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A ===
> +
> +=== Test 28 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A/bar ===
> +
> +=== Test 29 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
> /mnt/scratch/A/foo ===
> +
> +=== Test 30 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with sync ===
> +
> +=== Test 31 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
> /mnt/scratch ===
> +
> +=== Test 32 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
> /mnt/scratch/foo ===
> +
> +=== Test 33 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
> /mnt/scratch/bar ===
> +
> +=== Test 34 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
> /mnt/scratch/A ===
> +
> +=== Test 35 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
> /mnt/scratch/A/bar ===
> +
> +=== Test 36 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
> /mnt/scratch/A/foo ===
> +
> +=== Test 37 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with sync ===
> diff --git a/tests/generic/group b/tests/generic/group
> index 47de978..dc04152 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -519,3 +519,4 @@
>  514 auto quick clone
>  515 auto quick clone
>  516 auto quick dedupe clone
> +517-link crash
> 
> 
> Thanks,
> Jayashree Mohan

Jayashree Mohan Nov. 4, 2018, 8:21 p.m. UTC | #5

Hi Eryu and Filipe,

I have incorporated the coding style suggested, and renamed the cleanup function. However, creating a clean fs image after each sub test is resulting in about 10-15s of additional overhead overall. I instead clean up the working directory and unmount. The time for the tests varies between 12-15 seconds.

Please find the patch below

—

diff --git a/tests/generic/517-link b/tests/generic/517-link
new file mode 100755
index 0000000..ea5c5b7
--- /dev/null
+++ b/tests/generic/517-link
@@ -0,0 +1,164 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 The University of Texas at Austin.  All Rights Reserved.
+#
+# FS QA Test 517-link
+#
+# Test case created by CrashMonkey
+#
+# Test if we create a hard link to a file and persist either of the files, all the names persist.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	_cleanup_flakey
+	cd /
+	rm -f $tmp.*
+}
+
+init_start_time=$(date +%s.%N)
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch_nocheck
+_require_dm_target flakey
+
+# initialize scratch device
+_scratch_mkfs >> $seqres.full 2>&1
+_require_metadata_journaling $SCRATCH_DEV
+_init_flakey
+init_end_time=$(date +%s.%N)
+init_time_elapsed=$(echo "$init_end_time - $init_start_time" | bc) 
+echo "Initialization time = $init_time_elapsed" >> $seqres.full
+
+stat_opt='-c "blocks: %b size: %s inode: %i links: %h"'
+test_num=0
+total_time=0
+
+_clean_dir()
+{
+	_mount_flakey
+	rm -rf $SCRATCH_MNT/*
+	sync
+	_unmount_flakey
+}
+
+# create a hard link $2 to file $1, and fsync $3, followed by power-cut
+test_link()
+{
+	test_num=$((test_num + 1))
+	local start_time=$(date +%s.%N)
+	local sibling=0
+	local before=""
+	local after=""
+	echo -ne "\n=== Test $test_num :  link $1 $2 " | _filter_scratch
+	_mount_flakey
+	
+	# now execute the workload
+	mkdir -p "${1%/*}" && mkdir -p "${2%/*}" && touch "$1"
+	ln $1 $2
+	
+	if [ -z "$3" ]; then
+		echo -ne " with sync ===\n"
+		sync
+		before=`stat "$stat_opt" $1`
+	else
+		echo " with fsync $3 ===" | _filter_scratch
+		
+		# If the file being persisted is a sibling, create it first
+		if [ ! -f $3 ]; then
+			sibling=1
+			touch $3
+		fi
+
+		$XFS_IO_PROG -c "fsync" $3
+		
+		if [ $sibling -ne 1 ]; then
+			before=`stat "$stat_opt" $1`
+		fi
+	fi
+
+	_flakey_drop_and_remount | tee -a $seqres.full
+	
+	if [ -f $1 ]; then
+		after=`stat "$stat_opt" $1`
+	fi
+
+	if [ "$before" != "$after" ] && [ $sibling -ne 1 ]; then
+		echo "Before: $before" | tee -a $seqres.full
+		echo "After: $after" | tee -a $seqres.full	
+	fi
+
+	_unmount_flakey
+	_check_scratch_fs $FLAKEY_DEV
+	[ $? -ne 0 ] && _fatal "fsck failed"
+	end_time=$(date +%s.%N)
+	time_elapsed=$(echo "$end_time - $start_time" | bc)
+	echo " Elapsed time : $time_elapsed" >> $seqres.full
+	total_time=$(echo "$total_time + $time_elapsed" | bc)
+	echo " Total time : $total_time" >> $seqres.full
+	_clean_dir
+}
+
+# run the link test for different combinations
+
+test_start_time=$(date +%s.%N)
+# Group 1: Both files within root directory
+for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar; do
+	test_link $SCRATCH_MNT/foo $SCRATCH_MNT/bar $file_name
+done
+test_link $SCRATCH_MNT/foo $SCRATCH_MNT/bar 
+
+# Group 2: Create hard link in a sub directory
+for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
+	test_link $SCRATCH_MNT/foo $SCRATCH_MNT/A/bar $file_name
+done
+test_link $SCRATCH_MNT/foo $SCRATCH_MNT/A/bar 
+
+# Group 3: Create hard link in parent directory
+for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
+	test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/bar $file_name
+done
+test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/bar 
+
+# Group 4: Both files within a directory other than root
+for file_name in $SCRATCH_MNT $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
+	test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/A/bar $file_name
+done
+test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/A/bar 
+
+#Group 5: Exercise name reuse : Link file in sub-directory
+for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
+	test_link $SCRATCH_MNT/bar $SCRATCH_MNT/A/bar $file_name
+done
+test_link $SCRATCH_MNT/bar $SCRATCH_MNT/A/bar 
+
+#Group 6: Exercise name reuse : Link file in parent directory
+for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
+	test_link $SCRATCH_MNT/A/bar $SCRATCH_MNT/bar $file_name
+done
+test_link $SCRATCH_MNT/A/bar $SCRATCH_MNT/bar 
+
+test_end_time=$(date +%s.%N)
+test_time_elapsed=$(echo "$test_end_time-$test_start_time" | bc) 
+echo "Test Elapsed time : $test_time_elapsed" >> $seqres.full
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/517-link.out b/tests/generic/517-link.out
new file mode 100644
index 0000000..9555860
--- /dev/null
+++ b/tests/generic/517-link.out
@@ -0,0 +1,75 @@
+QA output created by 517-link
+
+=== Test 1 :  link SCRATCH_MNT/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT ===
+
+=== Test 2 :  link SCRATCH_MNT/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/foo ===
+
+=== Test 3 :  link SCRATCH_MNT/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/bar ===
+
+=== Test 4 :  link SCRATCH_MNT/foo SCRATCH_MNT/bar  with sync ===
+
+=== Test 5 :  link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT ===
+
+=== Test 6 :  link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/foo ===
+
+=== Test 7 :  link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/bar ===
+
+=== Test 8 :  link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A ===
+
+=== Test 9 :  link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/bar ===
+
+=== Test 10 :  link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/foo ===
+
+=== Test 11 :  link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with sync ===
+
+=== Test 12 :  link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT ===
+
+=== Test 13 :  link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/foo ===
+
+=== Test 14 :  link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/bar ===
+
+=== Test 15 :  link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A ===
+
+=== Test 16 :  link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A/bar ===
+
+=== Test 17 :  link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A/foo ===
+
+=== Test 18 :  link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with sync ===
+
+=== Test 19 :  link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT ===
+
+=== Test 20 :  link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A ===
+
+=== Test 21 :  link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/bar ===
+
+=== Test 22 :  link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/foo ===
+
+=== Test 23 :  link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with sync ===
+
+=== Test 24 :  link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT ===
+
+=== Test 25 :  link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/foo ===
+
+=== Test 26 :  link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/bar ===
+
+=== Test 27 :  link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A ===
+
+=== Test 28 :  link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/bar ===
+
+=== Test 29 :  link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/foo ===
+
+=== Test 30 :  link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with sync ===
+
+=== Test 31 :  link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT ===
+
+=== Test 32 :  link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/foo ===
+
+=== Test 33 :  link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/bar ===
+
+=== Test 34 :  link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A ===
+
+=== Test 35 :  link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A/bar ===
+
+=== Test 36 :  link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A/foo ===
+
+=== Test 37 :  link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with sync ===
diff --git a/tests/generic/group b/tests/generic/group
index 47de978..67e9108 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -519,3 +519,4 @@
 514 auto quick clone
 515 auto quick clone
 516 auto quick dedupe clone
+517-link quick log


> On Nov 4, 2018, at 10:38 AM, Eryu Guan <guaneryu@gmail.com> wrote:
> 
> On Mon, Oct 29, 2018 at 11:50:39AM -0500, Jayashree Mohan wrote:
>> Hi all,
>> 
>> As discussed previously, I have ported a few CrashMonkey tests into
>> xfstest test-case for your review. I have written a patch for testing
>> simple "hard link" behaviour. This patch batches 37 test cases
>> generated by CrashMonkey into one  xfstest test, all of them testing
>> hard-link behaviour.  These 37 tests have been grouped based on the
>> similarities; there are comments in the patch to
>> describe what each group is testing for. On a high-level, we are
>> testing the creation of hard links between files in same directory and
>> files across two directories, while allowing fsync of either the files
>> involved, their parent directories, or unrelated sibling files.
>> 
>> We aim to convert the other CrashMonkey tests in the same way, batched
>> by the type of file-system operation we test. This particular test
>> took 10 seconds to run on my VM (Ubuntu 16.04, 2GB RAM, single core)
>> on 100MB scratch partition. I have added per test case timer in this
>> patch, which shows each test case takes about 0.25 seconds (to run,
>> flakey_remount, test and cleanup). This test passes clean on ext4, xfs
>> and F2FS. It will show three failed cases as of kernel 4.16 on btrfs
>> (I think it's not yet patched on 4.19, so you should see the failure
>> on the newest kernel as well).
>> 
>> Thinking more about it, none of the CrashMonkey test cases check for
>> resource exhaustion and hence we don't need more than 100MB of scratch
>> device. If the per-test external overhead due to fsck/scrub etc
>> depends on the scratch partition size, we can speed up the CrashMonkey
>> tests by allowing a new device like scratch(much smaller in size -
>> about 100-200MB). Additionally, our tests also do not need fsck to be
>> run at the end. Is there a way to skip performing fsck after each test
>> (if its something configurable in the test file) ?
>> 
>> If this sort of patch seems fine to you, I can go ahead and port the
>> other CrashMonkey tests in the same way. After batching, I hope there
>> would be around 10-12 test files like the one attached here.
>> 
>> Patch below:
>> ---
>> 
>> diff --git a/tests/generic/517-link b/tests/generic/517-link
>> new file mode 100755
>> index 0000000..791b6c0
>> --- /dev/null
>> +++ b/tests/generic/517-link
>> @@ -0,0 +1,162 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2018 The University of Texas at Austin.  All Rights Reserved.
>> +#
>> +# FS QA Test 517-link
>> +#
>> +# Test if we create a hard link to a file and persist either of the
>> files, all the names persist.
> 
> Your mail agent seems to wrap the line and makes the patch unable to be
> applied. There're many other places in the patch. Could you please take
> a look?
> 
>> +#
>> +seq=`basename $0`
>> +seqres=$RESULT_DIR/$seq
>> +echo "QA output created by $seq"
>> +
>> +here=`pwd`
>> +tmp=/tmp/$$
>> +status=1    # failure is the default!
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +_cleanup()
>> +{
>> +    _cleanup_flakey
> 
> Please use single tab to do indention not 4 spaces.
> 
>> +    cd /
>> +    rm -f $tmp.*
>> +}
>> +
>> +init_start_time=$(date +%s.%N)
>> +# get standard environment, filters and checks
>> +. ./common/rc
>> +. ./common/filter
>> +. ./common/dmflakey
>> +
>> +# remove previous $seqres.full before test
>> +rm -f $seqres.full
>> +
>> +# real QA test starts here
>> +_supported_fs generic
>> +_supported_os Linux
>> +_require_scratch
>> +_require_dm_target flakey
>> +
>> +# initialize scratch device
>> +_scratch_mkfs >>$seqres.full 2>&1
>> +_require_metadata_journaling $SCRATCH_DEV
>> +_init_flakey
>> +_mount_flakey
>> +init_end_time=$(date +%s.%N)
>> +init_time_elapsed=$(echo "$init_end_time-$init_start_time" | bc)
>> +echo "Initialization time = $init_time_elapsed" >> $seqres.full
>> +
>> +stat_opt='-c "blocks: %b size: %s inode: %i links: %h"'
>> +test_num=0
>> +total_time=0
>> +
>> +cleanup()
> 
> Hmm, this could be confusing people, as we already have a _cleanup()
> function. But I think it'd be better to re-create the fs for each test
> so every sub-test starts with a clean filesystem and there's no need to
> do cleanup after each sub-test. (Yeah, this may consume more test time.)
> 
> Thanks,
> Eryu
> 
>> +{
>> +    rm -rf $SCRATCH_MNT/*
>> +    sync
>> +}
>> +
>> +# create a hard link $2 to file $1, and fsync $3, followed by power-cu
>> +test_link()
>> +{
>> +    start_time=$(date +%s.%N)
>> +    test_num=$((test_num + 1))
>> +    sibling=0
>> +    before=""
>> +    after=""
>> +    echo -ne "\n=== Test $test_num :  link $1 $2 " | tee -a $seqres.full
>> +
>> +    # now execute the workload
>> +    mkdir -p "${1%/*}" && mkdir -p "${2%/*}" && touch "$1"
>> +    ln $1 $2
>> +
>> +    if [ -z "$3" ]
>> +    then
>> +        echo -ne " with sync ===\n"
>> +        sync
>> +        before=`stat "$stat_opt" $1`
>> +    else
>> +        echo " with fsync $3 ==="
>> +
>> +        # If the file being persisted is a sibling, create it first
>> +        if [ ! -f $3 ]
>> +        then
>> +            sibling=1
>> +            touch $3
>> +        fi
>> +
>> +        $XFS_IO_PROG -c "fsync" $3 | _filter_xfs_io
>> +
>> +        if [ $sibling -ne 1 ]
>> +        then
>> +            before=`stat "$stat_opt" $1`
>> +        fi
>> +    fi
>> +
>> +    _flakey_drop_and_remount | tee -a $seqres.full
>> +
>> +    if [ -f $1 ]
>> +    then
>> +        after=`stat "$stat_opt" $1`
>> +    fi
>> +
>> +    if [ "$before" != "$after" ] && [ $sibling -ne 1 ]
>> +    then
>> +        echo "Before: $before" | tee -a $seqres.full
>> +        echo "After: $after" | tee -a $seqres.full
>> +    fi
>> +
>> +    cleanup
>> +    end_time=$(date +%s.%N)
>> +    time_elapsed=$(echo "$end_time-$start_time" | bc)
>> +    echo " Elapsed time : $time_elapsed" >> $seqres.full
>> +    total_time=$(echo "$total_time+$time_elapsed" | bc)
>> +    echo " Total time : $total_time" >> $seqres.full
>> +}
>> +
>> +# run the link test for different combinations
>> +
>> +test_start_time=$(date +%s.%N)
>> +# Group 1: Both files within root directory
>> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar; do
>> +    test_link $SCRATCH_MNT/foo $SCRATCH_MNT/bar $file_name
>> +done
>> +test_link $SCRATCH_MNT/foo $SCRATCH_MNT/bar
>> +
>> +# Group 2: Create hard link in a sub directory
>> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar
>> $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
>> +    test_link $SCRATCH_MNT/foo $SCRATCH_MNT/A/bar $file_name
>> +done
>> +test_link $SCRATCH_MNT/foo $SCRATCH_MNT/A/bar
>> +
>> +# Group 3: Create hard link in parent directory
>> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar
>> $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
>> +    test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/bar $file_name
>> +done
>> +test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/bar
>> +
>> +# Group 4: Both files within a directory other than root
>> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/A $SCRATCH_MNT/A/bar
>> $SCRATCH_MNT/A/foo; do
>> +    test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/A/bar $file_name
>> +done
>> +test_link $SCRATCH_MNT/A/foo $SCRATCH_MNT/A/bar
>> +
>> +#Group 5: Exercise name reuse : Link file in sub-directory
>> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar
>> $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
>> +    test_link $SCRATCH_MNT/bar $SCRATCH_MNT/A/bar $file_name
>> +done
>> +test_link $SCRATCH_MNT/bar $SCRATCH_MNT/A/bar
>> +
>> +#Group 6: Exercise name reuse : Link file in parent directory
>> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar
>> $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
>> +    test_link $SCRATCH_MNT/A/bar $SCRATCH_MNT/bar $file_name
>> +done
>> +test_link $SCRATCH_MNT/A/bar $SCRATCH_MNT/bar
>> +
>> +test_end_time=$(date +%s.%N)
>> +test_time_elapsed=$(echo "$test_end_time-$test_start_time" | bc)
>> +echo "Test Elapsed time : $test_time_elapsed" >> $seqres.full
>> +
>> +# success, all done
>> +status=0
>> +exit
>> diff --git a/tests/generic/517-link.out b/tests/generic/517-link.out
>> new file mode 100644
>> index 0000000..381b644
>> --- /dev/null
>> +++ b/tests/generic/517-link.out
>> @@ -0,0 +1,75 @@
>> +QA output created by 517-link
>> +
>> +=== Test 1 :  link /mnt/scratch/foo /mnt/scratch/bar  with fsync
>> /mnt/scratch ===
>> +
>> +=== Test 2 :  link /mnt/scratch/foo /mnt/scratch/bar  with fsync
>> /mnt/scratch/foo ===
>> +
>> +=== Test 3 :  link /mnt/scratch/foo /mnt/scratch/bar  with fsync
>> /mnt/scratch/bar ===
>> +
>> +=== Test 4 :  link /mnt/scratch/foo /mnt/scratch/bar  with sync ===
>> +
>> +=== Test 5 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
>> /mnt/scratch ===
>> +
>> +=== Test 6 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/foo ===
>> +
>> +=== Test 7 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/bar ===
>> +
>> +=== Test 8 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/A ===
>> +
>> +=== Test 9 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/A/bar ===
>> +
>> +=== Test 10 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/A/foo ===
>> +
>> +=== Test 11 :  link /mnt/scratch/foo /mnt/scratch/A/bar  with sync ===
>> +
>> +=== Test 12 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
>> /mnt/scratch ===
>> +
>> +=== Test 13 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
>> /mnt/scratch/foo ===
>> +
>> +=== Test 14 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
>> /mnt/scratch/bar ===
>> +
>> +=== Test 15 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
>> /mnt/scratch/A ===
>> +
>> +=== Test 16 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
>> /mnt/scratch/A/bar ===
>> +
>> +=== Test 17 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with fsync
>> /mnt/scratch/A/foo ===
>> +
>> +=== Test 18 :  link /mnt/scratch/A/foo /mnt/scratch/bar  with sync ===
>> +
>> +=== Test 19 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with fsync
>> /mnt/scratch ===
>> +
>> +=== Test 20 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/A ===
>> +
>> +=== Test 21 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/A/bar ===
>> +
>> +=== Test 22 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/A/foo ===
>> +
>> +=== Test 23 :  link /mnt/scratch/A/foo /mnt/scratch/A/bar  with sync ===
>> +
>> +=== Test 24 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
>> /mnt/scratch ===
>> +
>> +=== Test 25 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/foo ===
>> +
>> +=== Test 26 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/bar ===
>> +
>> +=== Test 27 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/A ===
>> +
>> +=== Test 28 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/A/bar ===
>> +
>> +=== Test 29 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with fsync
>> /mnt/scratch/A/foo ===
>> +
>> +=== Test 30 :  link /mnt/scratch/bar /mnt/scratch/A/bar  with sync ===
>> +
>> +=== Test 31 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
>> /mnt/scratch ===
>> +
>> +=== Test 32 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
>> /mnt/scratch/foo ===
>> +
>> +=== Test 33 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
>> /mnt/scratch/bar ===
>> +
>> +=== Test 34 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
>> /mnt/scratch/A ===
>> +
>> +=== Test 35 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
>> /mnt/scratch/A/bar ===
>> +
>> +=== Test 36 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with fsync
>> /mnt/scratch/A/foo ===
>> +
>> +=== Test 37 :  link /mnt/scratch/A/bar /mnt/scratch/bar  with sync ===
>> diff --git a/tests/generic/group b/tests/generic/group
>> index 47de978..dc04152 100644
>> --- a/tests/generic/group
>> +++ b/tests/generic/group
>> @@ -519,3 +519,4 @@
>> 514 auto quick clone
>> 515 auto quick clone
>> 516 auto quick dedupe clone
>> +517-link crash
>> 
>> 
>> Thanks,
>> Jayashree Mohan

Dave Chinner Nov. 5, 2018, 5:22 a.m. UTC | #6

On Sun, Nov 04, 2018 at 02:21:21PM -0600, Jayashree Mohan wrote:
> Hi Eryu and Filipe,
> 
> I have incorporated the coding style suggested, and renamed the cleanup function. However, creating a clean fs image after each sub test is resulting in about 10-15s of additional overhead overall. I instead clean up the working directory and unmount. The time for the tests varies between 12-15 seconds.
> 
> Please find the patch below
> 
> —
> 
> diff --git a/tests/generic/517-link b/tests/generic/517-link
> new file mode 100755
> index 0000000..ea5c5b7
> --- /dev/null
> +++ b/tests/generic/517-link
> @@ -0,0 +1,164 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2018 The University of Texas at Austin.  All Rights Reserved.
> +#
> +# FS QA Test 517-link
> +#
> +# Test case created by CrashMonkey
> +#
> +# Test if we create a hard link to a file and persist either of the files, all the names persist.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +	_cleanup_flakey
> +	cd /
> +	rm -f $tmp.*
> +}
> +
> +init_start_time=$(date +%s.%N)
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/dmflakey
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch_nocheck
> +_require_dm_target flakey
> +
> +# initialize scratch device
> +_scratch_mkfs >> $seqres.full 2>&1
> +_require_metadata_journaling $SCRATCH_DEV
> +_init_flakey
> +init_end_time=$(date +%s.%N)
> +init_time_elapsed=$(echo "$init_end_time - $init_start_time" | bc) 
> +echo "Initialization time = $init_time_elapsed" >> $seqres.full

No timing inside the test, please. The harness times the test
execution.

> +stat_opt='-c "blocks: %b size: %s inode: %i links: %h"'
> +test_num=0

Don't number tests. If we add an extra test, it will completely
break the golden output.

> +total_time=0
> +
> +_clean_dir()
> +{
> +	_mount_flakey
> +	rm -rf $SCRATCH_MNT/*
> +	sync
> +	_unmount_flakey
> +}

Why not just _scratch_mkfs?

> +
> +# create a hard link $2 to file $1, and fsync $3, followed by power-cut
> +test_link()
> +{
> +	test_num=$((test_num + 1))
> +	local start_time=$(date +%s.%N)
> +	local sibling=0
> +	local before=""
> +	local after=""
> +	echo -ne "\n=== Test $test_num :  link $1 $2 " | _filter_scratch
> +	_mount_flakey
> +	

Still whitespace damaged.

> +	# now execute the workload
> +	mkdir -p "${1%/*}" && mkdir -p "${2%/*}" && touch "$1"

One line each. 

> +	ln $1 $2

No checks for failure?

> +	
> +	if [ -z "$3" ]; then
> +		echo -ne " with sync ===\n"
> +		sync
> +		before=`stat "$stat_opt" $1`
> +	else
> +		echo " with fsync $3 ===" | _filter_scratch
> +		
> +		# If the file being persisted is a sibling, create it first
> +		if [ ! -f $3 ]; then
> +			sibling=1
> +			touch $3
> +		fi
> +
> +		$XFS_IO_PROG -c "fsync" $3
> +		
> +		if [ $sibling -ne 1 ]; then
> +			before=`stat "$stat_opt" $1`
> +		fi
> +	fi

To tell the truth, I'd consider these two separate functions -
test_link_sync() and test_link_fsync(). More below.


> +
> +	_flakey_drop_and_remount | tee -a $seqres.full
> +	
> +	if [ -f $1 ]; then
> +		after=`stat "$stat_opt" $1`
> +	fi
> +
> +	if [ "$before" != "$after" ] && [ $sibling -ne 1 ]; then
> +		echo "Before: $before" | tee -a $seqres.full
> +		echo "After: $after" | tee -a $seqres.full	
> +	fi

Why tee all the output to $seqres.full? Once the timing info is
removed, it dones't contain anything extra that the normal output
file...

> +
> +	_unmount_flakey
> +	_check_scratch_fs $FLAKEY_DEV
> +	[ $? -ne 0 ] && _fatal "fsck failed"
> +	end_time=$(date +%s.%N)
> +	time_elapsed=$(echo "$end_time - $start_time" | bc)
> +	echo " Elapsed time : $time_elapsed" >> $seqres.full
> +	total_time=$(echo "$total_time + $time_elapsed" | bc)
> +	echo " Total time : $total_time" >> $seqres.full
> +	_clean_dir
> +}
> +
> +# run the link test for different combinations
> +
> +test_start_time=$(date +%s.%N)
> +# Group 1: Both files within root directory
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar; do
> +	test_link $SCRATCH_MNT/foo $SCRATCH_MNT/bar $file_name
> +done
> +test_link $SCRATCH_MNT/foo $SCRATCH_MNT/bar 
> +
> +# Group 2: Create hard link in a sub directory
> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
> +	test_link $SCRATCH_MNT/foo $SCRATCH_MNT/A/bar $file_name
> +done
> +test_link $SCRATCH_MNT/foo $SCRATCH_MNT/A/bar 

Lines too long.

Basically, this is a two dimensional array of filenames.
Make test_link_fsync() do:

	src=$SCRATCH_MNT/$1
	dst=$SCRATCH_MNT/$2
	fsync=$SCRATCH_MNT/$3

And define your fsync filenames and link destinations names like so
(can't remember bash array notation, so paraphrasing):

# group 1: Both files within root directory
fnames[1]="./ foo bar"
dnames[1]="foo bar"

# group 2: Create hard link in a sub directory
fnames[2]="./ foo bar A A/bar A/foo"
dnames[2]="foo A/bar"

#g3
fnames[3]="./ foo bar A A/bar A/foo"
dnames[3]="A/foo bar"

#g4
fnames[4]="./ A A/bar A/foo"
dnames[4]="A/foo A/bar"

#g5
fnames[5]="./ foo bar A A/bar A/foo"
dnames[5]="bar A/bar"

#g6
fnames[6]="./ foo bar A A/bar A/foo"
dnames[6]="A/bar bar"

for ((i=1; i<=6; i++)); do
	for f in $fnames[$i]; do
		test_link_fsync $dnames[$i] $f
	done
	test_link_sync $dnames[1]
done

And all this shouty, shouty noise goes away. Much easier to read,
much simpler to maintain.

Cheers,

Dave.

Jayashree Mohan Nov. 5, 2018, 8:16 p.m. UTC | #7

Hi Dave,

Thanks for the comments. 

>> +init_end_time=$(date +%s.%N)
>> +init_time_elapsed=$(echo "$init_end_time - $init_start_time" | bc) 
>> +echo "Initialization time = $init_time_elapsed" >> $seqres.full
> 
> No timing inside the test, please. The harness times the test
> execution.

Sure, that was meant for the initial evaluation. I’ve removed it now.

> 
>> +total_time=0
>> +
>> +_clean_dir()
>> +{
>> +	_mount_flakey
>> +	rm -rf $SCRATCH_MNT/*
>> +	sync
>> +	_unmount_flakey
>> +}
> 
> Why not just _scratch_mkfs?

I believe that to _scratch_mkfs, I must first _cleanup dm_flakey. If I replace the above snippet by 
_cleanup
_scratch_mkfs
_init_flakey

The time taken for the test goes up by around 10 seconds (due to mkfs maybe). So I thought it was sufficient to remove the working directory.



> 
> Why tee all the output to $seqres.full? Once the timing info is
> removed, it dones't contain anything extra that the normal output
> file…

Yeah, it was required for my timing info. Removed it now.


> 
>> +# Group 2: Create hard link in a sub directory
>> +for file_name in $SCRATCH_MNT $SCRATCH_MNT/foo $SCRATCH_MNT/bar $SCRATCH_MNT/A $SCRATCH_MNT/A/bar $SCRATCH_MNT/A/foo; do
>> +	test_link $SCRATCH_MNT/foo $SCRATCH_MNT/A/bar $file_name
>> +done
>> +test_link $SCRATCH_MNT/foo $SCRATCH_MNT/A/bar 
> 
> Lines too long.
> 
> Basically, this is a two dimensional array of filenames.
> Make test_link_fsync() do:
> 
> And all this shouty, shouty noise goes away. Much easier to read,
> much simpler to maintain.

Thanks! I think the code looks cleaner now.

—

diff --git a/tests/generic/517-link b/tests/generic/517-link
new file mode 100755
index 0000000..7108300
--- /dev/null
+++ b/tests/generic/517-link
@@ -0,0 +1,176 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 The University of Texas at Austin.  All Rights Reserved.
+#
+# FS QA Test 517-link
+#
+# Test case created by CrashMonkey
+#
+# Test if we create a hard link to a file and persist either of the files, all the names persist.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	_cleanup_flakey
+	cd /
+	rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch_nocheck
+_require_dm_target flakey
+
+# initialize scratch device
+_scratch_mkfs >> $seqres.full 2>&1
+_require_metadata_journaling $SCRATCH_DEV
+_init_flakey
+
+stat_opt='-c "blocks: %b size: %s inode: %i links: %h"'
+before=""
+after=""
+
+_clean_dir()
+{
+	_mount_flakey
+	rm -rf $SCRATCH_MNT/*
+	sync
+	_unmount_flakey
+}
+
+_check_consistency()
+{
+	_flakey_drop_and_remount | tee -a $seqres.full
+	
+	if [ -f $1 ]; then
+		after=`stat "$stat_opt" $1`
+	fi
+
+	if [ "$before" != "$after" ] && [ $2 -ne 1 ]; then
+		echo "Before: $before" 
+		echo "After: $after"	
+	fi
+
+	_unmount_flakey
+	_check_scratch_fs $FLAKEY_DEV
+	[ $? -ne 0 ] && _fatal "fsck failed"
+}
+
+# create a hard link $2 to file $1, and fsync $3, followed by power-cut
+test_link_fsync()
+{
+	local sibling=0
+	before=""
+	after=""
+	src=$SCRATCH_MNT/$1
+	dest=$SCRATCH_MNT/$2
+	
+	if [ "$3" == "./" ]; then
+		fsync=$SCRATCH_MNT
+	else
+		fsync=$SCRATCH_MNT/$3
+	fi
+
+	echo -ne "\n=== link $src $dest  with fsync $fsync ===\n" | _filter_scratch
+	_mount_flakey
+	
+	# now execute the workload
+	mkdir -p "${src%/*}"
+	mkdir -p "${dest%/*}" 
+	touch $src
+	ln $src $dest || _fail "Link creation failed"
+	
+	# If the file being persisted is a sibling, create it first
+	if [ ! -f $fsync ]; then
+		sibling=1
+		touch $fsync
+	fi
+
+	$XFS_IO_PROG -c "fsync" $fsync
+		
+	if [ $sibling -ne 1 ]; then
+		before=`stat "$stat_opt" $src`
+	fi
+
+	_check_consistency $src $sibling
+	_clean_dir
+}
+
+# create a hard link $2 to file $1, and sync, followed by power-cut
+test_link_sync()
+{
+	src=$SCRATCH_MNT/$1
+	dest=$SCRATCH_MNT/$2
+	before=""
+	after=""
+	echo -ne "\n=== link $src $dest  with sync ===\n" | _filter_scratch
+	_mount_flakey
+	
+	# now execute the workload
+	mkdir -p "${src%/*}"
+	mkdir -p "${dest%/*}" 
+	touch $src
+	ln $src $dest || _fail "Link creation failed"	
+	sync
+	before=`stat "$stat_opt" $src`
+	
+	_check_consistency $src 0
+	_clean_dir
+}
+
+
+# Create different combinations to run the link test
+# Group 0: Both files within root directory
+file_names[0]="foo bar"
+fsync_names[0]="./ foo bar"
+
+# Group 1: Create hard link in a sub directory
+file_names[1]="foo A/bar"
+fsync_names[1]="./ foo bar A A/bar A/foo"
+
+# Group 2: Create hard link in parent directory
+file_names[2]="A/foo bar"
+fsync_names[2]="./ foo bar A A/bar A/foo"
+
+# Group 3: Both files within a directory other than root
+file_names[3]="A/foo A/bar"
+fsync_names[3]="./ A A/bar A/foo"
+
+#Group 4: Exercise name reuse : Link file in sub-directory
+file_names[4]="bar A/bar"
+fsync_names[4]="./ foo bar A A/bar A/foo"
+
+#Group 5: Exercise name reuse : Link file in parent directory
+file_names[5]="A/bar bar"
+fsync_names[5]="./ foo bar A A/bar A/foo"
+
+for ((test_group=0; test_group<6; test_group++)); do
+	for file in ${fsync_names[$test_group]}; do
+		test_link_fsync ${file_names[$test_group]} $file
+	done
+	test_link_sync ${file_names[$test_group]}
+done 
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/517-link.out b/tests/generic/517-link.out
new file mode 100644
index 0000000..1f80b27
--- /dev/null
+++ b/tests/generic/517-link.out
@@ -0,0 +1,75 @@
+QA output created by 517-link
+
+=== link SCRATCH_MNT/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT ===
+
+=== link SCRATCH_MNT/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/foo ===
+
+=== link SCRATCH_MNT/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/bar ===
+
+=== link SCRATCH_MNT/foo SCRATCH_MNT/bar  with sync ===
+
+=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT ===
+
+=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/foo ===
+
+=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/bar ===
+
+=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A ===
+
+=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/bar ===
+
+=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/foo ===
+
+=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with sync ===
+
+=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT ===
+
+=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/foo ===
+
+=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/bar ===
+
+=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A ===
+
+=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A/bar ===
+
+=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A/foo ===
+
+=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with sync ===
+
+=== link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT ===
+
+=== link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A ===
+
+=== link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/bar ===
+
+=== link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/foo ===
+
+=== link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with sync ===
+
+=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT ===
+
+=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/foo ===
+
+=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/bar ===
+
+=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A ===
+
+=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/bar ===
+
+=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/foo ===
+
+=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with sync ===
+
+=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT ===
+
+=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/foo ===
+
+=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/bar ===
+
+=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A ===
+
+=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A/bar ===
+
+=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A/foo ===
+
+=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with sync ===
diff --git a/tests/generic/group b/tests/generic/group
index 47de978..67e9108 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -519,3 +519,4 @@
 514 auto quick clone
 515 auto quick clone
 516 auto quick dedupe clone
+517-link quick log

Dave Chinner Nov. 6, 2018, 10:57 p.m. UTC | #8

On Mon, Nov 05, 2018 at 02:16:57PM -0600, Jayashree Mohan wrote:
> Hi Dave,
> 
> Thanks for the comments. 
> 
> >> +init_end_time=$(date +%s.%N)
> >> +init_time_elapsed=$(echo "$init_end_time - $init_start_time" | bc) 
> >> +echo "Initialization time = $init_time_elapsed" >> $seqres.full
> > 
> > No timing inside the test, please. The harness times the test
> > execution.
> 
> Sure, that was meant for the initial evaluation. I’ve removed it now.

Thanks!

> 
> > 
> >> +total_time=0
> >> +
> >> +_clean_dir()
> >> +{
> >> +	_mount_flakey
> >> +	rm -rf $SCRATCH_MNT/*
> >> +	sync
> >> +	_unmount_flakey
> >> +}
> > 
> > Why not just _scratch_mkfs?
> 
> I believe that to _scratch_mkfs, I must first _cleanup dm_flakey.
> If I replace the above snippet by 
> _cleanup
> _scratch_mkfs
> _init_flakey
> 
> The time taken for the test goes up by around 10 seconds (due to
> mkfs maybe). So I thought it was sufficient to remove the working
> directory.

Ok. Put that in a comment (comments are for explaining why, not how
or what). :)

Overall the new version looks a lot better. More comments
below.

> diff --git a/tests/generic/517-link b/tests/generic/517-link
> new file mode 100755
> index 0000000..7108300
> --- /dev/null
> +++ b/tests/generic/517-link

Please just use "517" as the test name. At some point it was though
that descriptive names for tests might be useful, but it's turned
out that's not the case and we have just continued to number them.

> @@ -0,0 +1,176 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2018 The University of Texas at Austin.  All Rights Reserved.
> +#
> +# FS QA Test 517-link
> +#
> +# Test case created by CrashMonkey
> +#
> +# Test if we create a hard link to a file and persist either of the files, all the names persist.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +	_cleanup_flakey
> +	cd /
> +	rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/dmflakey
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch_nocheck
> +_require_dm_target flakey
> +
> +# initialize scratch device
> +_scratch_mkfs >> $seqres.full 2>&1
> +_require_metadata_journaling $SCRATCH_DEV
> +_init_flakey
> +
> +stat_opt='-c "blocks: %b size: %s inode: %i links: %h"'
> +before=""
> +after=""
> +
> +_clean_dir()
> +{
> +	_mount_flakey
> +	rm -rf $SCRATCH_MNT/*
> +	sync
> +	_unmount_flakey
> +}
> +
> +_check_consistency()
> +{
> +	_flakey_drop_and_remount | tee -a $seqres.full
> +	

Whitespace damage.

> +	if [ -f $1 ]; then
> +		after=`stat "$stat_opt" $1`
> +	fi
> +
> +	if [ "$before" != "$after" ] && [ $2 -ne 1 ]; then
> +		echo "Before: $before" 
> +		echo "After: $after"	

Whitespace.

Add this to your .vimrc:

" highlight whitespace damage
highlight RedundantSpaces ctermbg=red guibg=red
match RedundantSpaces /\s\+$\| \+\ze\t/


> +	fi
> +
> +	_unmount_flakey
> +	_check_scratch_fs $FLAKEY_DEV
> +	[ $? -ne 0 ] && _fatal "fsck failed"
> +}
> +
> +# create a hard link $2 to file $1, and fsync $3, followed by power-cut
> +test_link_fsync()
> +{
> +	local sibling=0
> +	before=""
> +	after=""
> +	src=$SCRATCH_MNT/$1
> +	dest=$SCRATCH_MNT/$2
> +	

Whitespace

> +	if [ "$3" == "./" ]; then
> +		fsync=$SCRATCH_MNT
> +	else
> +		fsync=$SCRATCH_MNT/$3
> +	fi

Why doesn't fsync of $SCRATCH_MNT/./ work?

> +
> +	echo -ne "\n=== link $src $dest  with fsync $fsync ===\n" | _filter_scratch
> +	_mount_flakey
> +	

Whitespace

> +	# now execute the workload
> +	mkdir -p "${src%/*}"

Please explain why this magic directory creation is needed - "now
execute the workload" isn't going to remind me what this does in a
couple of years time.

> +	mkdir -p "${dest%/*}" 

Whitespace

> +	touch $src
> +	ln $src $dest || _fail "Link creation failed"

No need for _fail here. The error message will end up in the output
and the test will fail the golden output match. That way it will
still run all the other tests.

> +	

Whitespace

> +	# If the file being persisted is a sibling, create it first
> +	if [ ! -f $fsync ]; then
> +		sibling=1
> +		touch $fsync
> +	fi
> +
> +	$XFS_IO_PROG -c "fsync" $fsync
> +		

Whitespace

> +	if [ $sibling -ne 1 ]; then
> +		before=`stat "$stat_opt" $src`
> +	fi
> +
> +	_check_consistency $src $sibling
> +	_clean_dir
> +}
> +
> +# create a hard link $2 to file $1, and sync, followed by power-cut
> +test_link_sync()
> +{
> +	src=$SCRATCH_MNT/$1
> +	dest=$SCRATCH_MNT/$2
> +	before=""
> +	after=""
> +	echo -ne "\n=== link $src $dest  with sync ===\n" | _filter_scratch
> +	_mount_flakey
> +	

Whitespace

> +	# now execute the workload
> +	mkdir -p "${src%/*}"
> +	mkdir -p "${dest%/*}" 

Whitespace

> +	touch $src
> +	ln $src $dest || _fail "Link creation failed"	

Whitespace

> +	sync
> +	before=`stat "$stat_opt" $src`
> +	

Whitespace

> +	_check_consistency $src 0
> +	_clean_dir
> +}
> +
> +
> +# Create different combinations to run the link test
> +# Group 0: Both files within root directory
> +file_names[0]="foo bar"
> +fsync_names[0]="./ foo bar"
> +
> +# Group 1: Create hard link in a sub directory
> +file_names[1]="foo A/bar"
> +fsync_names[1]="./ foo bar A A/bar A/foo"
> +
> +# Group 2: Create hard link in parent directory
> +file_names[2]="A/foo bar"
> +fsync_names[2]="./ foo bar A A/bar A/foo"
> +
> +# Group 3: Both files within a directory other than root
> +file_names[3]="A/foo A/bar"
> +fsync_names[3]="./ A A/bar A/foo"
> +
> +#Group 4: Exercise name reuse : Link file in sub-directory
> +file_names[4]="bar A/bar"
> +fsync_names[4]="./ foo bar A A/bar A/foo"
> +
> +#Group 5: Exercise name reuse : Link file in parent directory
> +file_names[5]="A/bar bar"
> +fsync_names[5]="./ foo bar A A/bar A/foo"
> +
> +for ((test_group=0; test_group<6; test_group++)); do
> +	for file in ${fsync_names[$test_group]}; do
> +		test_link_fsync ${file_names[$test_group]} $file
> +	done
> +	test_link_sync ${file_names[$test_group]}
> +done 

Whitespace

> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/517-link.out b/tests/generic/517-link.out
> new file mode 100644
> index 0000000..1f80b27
> --- /dev/null
> +++ b/tests/generic/517-link.out
> @@ -0,0 +1,75 @@
> +QA output created by 517-link
> +
> +=== link SCRATCH_MNT/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT ===
> +
> +=== link SCRATCH_MNT/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/foo ===
> +
> +=== link SCRATCH_MNT/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/bar ===
> +
> +=== link SCRATCH_MNT/foo SCRATCH_MNT/bar  with sync ===
> +
> +=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT ===
> +
> +=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/foo ===
> +
> +=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/bar ===
> +
> +=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A ===
> +
> +=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/bar ===
> +
> +=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/foo ===
> +
> +=== link SCRATCH_MNT/foo SCRATCH_MNT/A/bar  with sync ===
> +
> +=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT ===
> +
> +=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/foo ===
> +
> +=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/bar ===
> +
> +=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A ===
> +
> +=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A/bar ===
> +
> +=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A/foo ===
> +
> +=== link SCRATCH_MNT/A/foo SCRATCH_MNT/bar  with sync ===
> +
> +=== link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT ===
> +
> +=== link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A ===
> +
> +=== link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/bar ===
> +
> +=== link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/foo ===
> +
> +=== link SCRATCH_MNT/A/foo SCRATCH_MNT/A/bar  with sync ===
> +
> +=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT ===
> +
> +=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/foo ===
> +
> +=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/bar ===
> +
> +=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A ===
> +
> +=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/bar ===
> +
> +=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with fsync SCRATCH_MNT/A/foo ===
> +
> +=== link SCRATCH_MNT/bar SCRATCH_MNT/A/bar  with sync ===
> +
> +=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT ===
> +
> +=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/foo ===
> +
> +=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/bar ===
> +
> +=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A ===
> +
> +=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A/bar ===
> +
> +=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with fsync SCRATCH_MNT/A/foo ===
> +
> +=== link SCRATCH_MNT/A/bar SCRATCH_MNT/bar  with sync ===
> diff --git a/tests/generic/group b/tests/generic/group
> index 47de978..67e9108 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -519,3 +519,4 @@
>  514 auto quick clone
>  515 auto quick clone
>  516 auto quick dedupe clone
> +517-link quick log
> 
> 
>

Theodore Ts'o Nov. 6, 2018, 11:15 p.m. UTC | #9

On Mon, Nov 05, 2018 at 02:16:57PM -0600, Jayashree Mohan wrote:
> 
> I believe that to _scratch_mkfs, I must first _cleanup dm_flakey. If I replace the above snippet by 
> _cleanup
> _scratch_mkfs
> _init_flakey
> 
> The time taken for the test goes up by around 10 seconds (due to mkfs maybe). So I thought it was sufficient to remove the working directory.

Can you try adding _check_scratch_fs after each test case?  Yes, it
will increase the test time, but it will make it easier for a
developer to figure out what might be going on.

Also, how big does the file system have to be?  I wonder if we can
speed things up if a ramdisk is used as the backing device for
dm-flakey.

On the flip side, am I remembering correctly that the original
technique used by your research paper used a custom I/O stack so you
could find potential problems even in the operations getting lost
after a power drop, no matter how unlikely, but rather, for anything
that isn't guaranteed by the cache flush command?

One argument for not using a ramdisk to speed things up is that it
would make be much less likely that potential problems would be found.
But I wonder, given that we're not using dm-flakey, how likely that we
would notice regressions in the first place.

For example, given that we know which patches were needed to fix the
various problems found by your research.  Suppose we revert those
patches, or use a kernel that doesn't have those fixes.  Will the
xfstests script you've generated be able to trigger the failures with
an unfixed kernel?

Cheers,

					- Ted

Dave Chinner Nov. 6, 2018, 11:39 p.m. UTC | #10

On Tue, Nov 06, 2018 at 06:15:36PM -0500, Theodore Y. Ts'o wrote:
> On Mon, Nov 05, 2018 at 02:16:57PM -0600, Jayashree Mohan wrote:
> > 
> > I believe that to _scratch_mkfs, I must first _cleanup dm_flakey. If I replace the above snippet by 
> > _cleanup
> > _scratch_mkfs
> > _init_flakey
> > 
> > The time taken for the test goes up by around 10 seconds (due to mkfs maybe). So I thought it was sufficient to remove the working directory.
> 
> Can you try adding _check_scratch_fs after each test case?  Yes, it

_check_scratch_fs now runs xfs_scrub on XFS as well as xfs_repair,
so it's actually quite expensive.

The whole point of aggregating all these tests into one fstest is to
avoid the overhead of running _check_scratch_fs after every single
test that are /extremely unlikely/ to fail on existing filesystems.

> Also, how big does the file system have to be?  I wonder if we can
> speed things up if a ramdisk is used as the backing device for
> dm-flakey.

I already run fstests on pmem devices for fast iteration of the
entire test suite. The slowest part of the test suite for quick
tests is still the filesystem checking between tests. So putting
dm-flakey on a ramdisk doesn't help anything here.

I also run fstests on memory constrained VMs, where there isn't ram
to add a ramdisk for random stuff like this. Then all the dm-flakey
tests either wont run or fail due to OOM-kills.

Cheers,

Dave.

Jayashree Mohan Nov. 6, 2018, 11:54 p.m. UTC | #11

> > Can you try adding _check_scratch_fs after each test case?  Yes, it
>
> _check_scratch_fs now runs xfs_scrub on XFS as well as xfs_repair,
> so it's actually quite expensive.
>
> The whole point of aggregating all these tests into one fstest is to
> avoid the overhead of running _check_scratch_fs after every single
> test that are /extremely unlikely/ to fail on existing filesystems.


Filipe and Eryu suggest that we run _check_scratch_fs after each
subtest. Quoting Filipe,


>
> For this type of tests, I think it's a good idea to let fsck run.
>
> Even if all of the links are persisted, the log/journal replay might
> have caused metadata inconsistencies in the fs for example - this was
> true for many cases I fixed over the years in btrfs.
> Even if fsck doesn't report any problem now, it's still good to run
> it, to help prevent future regressions.
>
> Plus this test creates a very small fs, it's not like fsck will take a
> significant time to run.
> So for all these reasons I would unmount and fsck after each test.


For this reason, we currently _check_scratch_fs after each subtest in
the _check_consistency method in my patch.
+       _unmount_flakey
+       _check_scratch_fs $FLAKEY_DEV
+       [ $? -ne 0 ] && _fatal "fsck failed"

Running on a 200MB partition, addition of this check added only around
3-4 seconds of delay in total for this patch consisting of 37 tests.
Currently this patch takes about 12-15 seconds to run to completion on
my 200MB partition.
Am I missing something - is this the check you are talking about?

Thanks,
Jayashree Mohan

Jayashree Mohan Nov. 7, 2018, 12:24 a.m. UTC | #12

Hi Ted,

> > I believe that to _scratch_mkfs, I must first _cleanup dm_flakey. If I replace the above snippet by
> > _cleanup
> > _scratch_mkfs
> > _init_flakey
> >
> > The time taken for the test goes up by around 10 seconds (due to mkfs maybe). So I thought it was sufficient to remove the working directory.
>
> Can you try adding _check_scratch_fs after each test case?  Yes, it
> will increase the test time, but it will make it easier for a
> developer to figure out what might be going on.

As per Filipe's and Eryu's suggestions, each sub test in the patch
unmounts the device and tests for consistency using
_check_scratch_fs.
+       _unmount_flakey
+       _check_scratch_fs $FLAKEY_DEV
+       [ $? -ne 0 ] && _fatal "fsck failed"
This added about an additional 3-4 seconds of delay overall. I hope
this is what you're suggesting.


> Also, how big does the file system have to be?  I wonder if we can
> speed things up if a ramdisk is used as the backing device for
> dm-flakey.

The file system can be as small as 100MB. I would imagine that ramdisk
results in speedup.


> On the flip side, am I remembering correctly that the original
> technique used by your research paper used a custom I/O stack so you
> could find potential problems even in the operations getting lost
> after a power drop, no matter how unlikely, but rather, for anything
> that isn't guaranteed by the cache flush command?

Are you talking about re-ordering of the block IOs? We don't use that
feature for these tests - we only replay the block IOs in order, just
like dm-flakey/ dm-logwrites would do.


> One argument for not using a ramdisk to speed things up is that it
> would make be much less likely that potential problems would be found.
> But I wonder, given that we're not using dm-flakey, how likely that we
> would notice regressions in the first place.

To clarify, the patches I would be sending out, do not require
CrashMonkey in the loop for testing. We only use dm-flakey and the
in-order replay support it provides.


> For example, given that we know which patches were needed to fix the
> various problems found by your research.  Suppose we revert those
> patches, or use a kernel that doesn't have those fixes.  Will the
> xfstests script you've generated be able to trigger the failures with
> an unfixed kernel?

Yes, if you run these xftests on an unpatched kernel, you can
reproduce the bugs our paper claims.

Dave Chinner Nov. 7, 2018, 2:09 a.m. UTC | #13

On Tue, Nov 06, 2018 at 05:50:28PM -0600, Jayashree Mohan wrote:
> On Tue, Nov 6, 2018 at 5:40 PM Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Tue, Nov 06, 2018 at 06:15:36PM -0500, Theodore Y. Ts'o wrote:
> > > On Mon, Nov 05, 2018 at 02:16:57PM -0600, Jayashree Mohan wrote:
> > > >
> > > > I believe that to _scratch_mkfs, I must first _cleanup dm_flakey. If I
> > replace the above snippet by
> > > > _cleanup
> > > > _scratch_mkfs
> > > > _init_flakey
> > > >
> > > > The time taken for the test goes up by around 10 seconds (due to mkfs
> > maybe). So I thought it was sufficient to remove the working directory.
> > >
> > > Can you try adding _check_scratch_fs after each test case?  Yes, it
> >
> > _check_scratch_fs now runs xfs_scrub on XFS as well as xfs_repair,
> > so it's actually quite expensive.
> >
> > The whole point of aggregating all these tests into one fstest is to
> > avoid the overhead of running _check_scratch_fs after every single
> > test that are /extremely unlikely/ to fail on existing filesystems.
> >
> 
> Filipe and Eryu suggest that we run _check_scratch_fs after each subtest.
> Quoting Filipe,

These tests are highly unlikely to fail on existing filesystems.
If they do fail, then the developer can narrow it down by modifying
the test to the single test that fails and add _check_scratch_fs
where necessary.

Run time matters. Expensive, finegrained testing is only
useful if there's a high probability of the test finding ongoing
bugs. These tests have found bugs when they were first written, but
the ongoing probability of finding more bugs is extremely low.
Adding a huge amount of testing overhead for what appear to be very
marginal returns is not a good tradeoff.

> > Plus this test creates a very small fs, it's not like fsck will take a
> > significant time to run.
> > So for all these reasons I would unmount and fsck after each test.
> 
> For this reason, we currently _check_scratch_fs after each subtest in the
> _check_consistency method in my patch.
> +       _unmount_flakey
> +       _check_scratch_fs $FLAKEY_DEV
> +       [ $? -ne 0 ] && _fatal "fsck failed"
> 
> Running on a 200MB partition, addition of this check added only around 3-4
> seconds of delay in total for this patch consisting of 37 tests. Currently
> this patch takes about 12-15 seconds to run to completion on my 200MB
> partition.

What filesystem, and what about 20GB scratch partitions (which are
common)?  i.e. Checking cost is different on different filesystems,
different capacity devices and even different userspace versions of
the same filesystem utilities. It is most definitely not free, and
in some cases can be prohibitively expensive.

----

I suspect we've lost sight of the fact that fstests was /primarily/
a filesystem developer test suite, not a distro regression test
suite. If the test suite becomes too cumbersome and slow for
developers to use effectively, then it will get used less during
development and that's a *really, really bad outcome*.

e.g. I don't use the auto group in my development workflow any more
- it's too slow to get decent coverage of changes these days. I only
use the auto group for integration testing now. A few years ago it
would take about an hour to run the auto group on a couple of
spindles. These days it takes closer to 5 hours on the same setup.
Even on my really fast pmem setup it take a couple of hours to run
the entire auto group. As I have 15 different configurations I
need to run through integration testing and limited resources,
runtime certainly matters.

It even takes half an hour to run the quick group on my fast
machine, which really isn't very quick anymore because of the sheer
number of tests in the quick group.  Half an hour is too slow for
effective change feed back - feedback within 5 minutes is
necessary, otherwise the developer will context switch to something
else while waiting and lose all focus on what they were doing. This
leads to highly inefficient developers.

The quick group is now full of short, fine grained, targetted
regression tests. These are useful for distro release-to-release
testing, but the chances that they find new bugs are extremely low.
They really aren't that useful for developers who "fix the bug and
move on" and will never see that test fail ever again. If one of
these tests has a huge overhead or the sheer number of those tests
creates substantial overhead, then that is not a particularly good
use of the developer's time and resources.

IOWs, there's a cost to run each test that the really important use
case for fstests (i.e. the developers' test/dev cycle) cannot really
afford to pay that cost. i.e. developers need wide test coverage /at
speed/, not need deep, fine-grained, time consuming test coverage.

To put it in other words, developers need tests focussed on finding
bugs quickly, not regression tests that provide the core
requirements of integration and release testing. The development
testing phase is all about finding fast and effciently.

There's been little done in recent times to make fstests faster and
more efficient at finding new bugs for developers - the vast
majority of new tests has been for specific bugs that have already
been fixed.  Even these crashmonkey tests are not "find new bugs"
tests. The bugs they uncovered have been found and fixed, and so
they fall into this same finegrained integration regression test
category.

The only tests that I've seen discover new bugs recently are those
that run fsx, fstress or some other semi-randomised workloads that
are combined with some other operation. These tests find the bugs
that fine-grained, targetted regression tests will never uncover,
and so in many cases running most of these integration/regression
tests doesn't provide any value to the developer.

The faster the tests find the bugs, the faster the developer fixes
them. Then more dev/test cycles a developer can do in a day, the
more bugs they find and fix. So we want fstests to remain useful to
developers, then we have to pay more attention to runtime and how
efficient the tests are at fiding new bugs. More is not better.

Perhaps we need to recategorise the tests into new groups.

Perhaps we need to start working on reducing the huge amount
of technical debt we now have in fstests.

Perhaps we need to scale the fstests infrastructure to support
thousands of tests efficiently.

Perhaps we need to focus on tests that uncover new bugs (looking
forwards) rather than regression tests (looking backwards)

But, really, if we don't stop and think about what we are doing here
fstests will slowly drop out of the day-to-day developer workflow
because it is becoming less useful for finding new filesystem bugs
quickly...

Cheers,

Dave.

Theodore Ts'o Nov. 7, 2018, 4:04 a.m. UTC | #14

On Wed, Nov 07, 2018 at 01:09:22PM +1100, Dave Chinner wrote:
> > Running on a 200MB partition, addition of this check added only around 3-4
> > seconds of delay in total for this patch consisting of 37 tests. Currently
> > this patch takes about 12-15 seconds to run to completion on my 200MB
> > partition.
> 
> What filesystem, and what about 20GB scratch partitions (which are
> common)?  i.e. Checking cost is different on different filesystems,
> different capacity devices and even different userspace versions of
> the same filesystem utilities. It is most definitely not free, and
> in some cases can be prohibitively expensive.

For the CrashMonkey tests, one solution might be to force the use of a
small file system on the scratch disk.  (e.g., using _scratch_mkfs_sized).

> I suspect we've lost sight of the fact that fstests was /primarily/h
> a filesystem developer test suite, not a distro regression test
> suite. If the test suite becomes too cumbersome and slow for
> developers to use effectively, then it will get used less during
> development and that's a *really, really bad outcome*.

I agree with your concern.

> It even takes half an hour to run the quick group on my fast
> machine, which really isn't very quick anymore because of the sheer
> number of tests in the quick group.  Half an hour is too slow for
> effective change feed back - feedback within 5 minutes is
> necessary, otherwise the developer will context switch to somethingt
> else while waiting and lose all focus on what they were doing. This
> leads to highly inefficient developers.

At Google we were willing to live with a 10 minute "fssmoke" subset,
but admittedly, that's grown to 15-20 minutes in recent years.  So
trying to create a "smoke" group that is only 5 minutes SGTM.

> The only tests that I've seen discover new bugs recently are those
> that run fsx, fstress or some other semi-randomised workloads that
> are combined with some other operation. These tests find the bugs
> that fine-grained, targetted regression tests will never uncover,
> and so in many cases running most of these integration/regression
> tests doesn't provide any value to the developer.

Yeah, what I used to do is assume that if the test run survives past
generic/013 (which uses fsstress), I'd assume that it would pass the
rest of the tests, and I would move on to reviewing the next commit.
Unfortuantely we've added so many ext4 specific tests (which run in
front of generic) that this trick no longer works.  I haven't gotten
annoyed enough to hack in some way to reorder the tests that get run
so the highest value tests run first, and then sending a "90+% chance
the commit is good, running the rest of the tests" message, but it has
occurred to me....

> Perhaps we need to recategorise the tests into new groups.

Agreed.  Either we need to change what tests we leave in "quick", or
we need to create a new group "smoke" where quick is an attribute of
the group as a whole, not an attribute of each test in the "quick"
group.

> Perhaps we need to scale the fstests infrastructure to support
> thousands of tests efficiently.

On my "when I find the round tuit, or when I can get a GSOC or intern
to work on it, whichever comes first" list is to enhance gce-xfstests
so it can shard the tests for a particular fs configuration so they
use a group of a VMs, instead of just using a separate VM for each
config scenario (e.g., dax, blocksize < page size, bigalloc, ext3
compat, etc.)

It might mean using ~100 VM's instead of the current 10 that I use,
but if it means the tests complete in a tenth of the time, the total
cost for doing a full integration test won't change by that much.  The
bigger problem is that people might have to ask permission to increase
the GCE quotas from the defaults used on new accounts.

For those people who are trying to run xfstests on bare metal, I'm not
sure there's that much that can be done to improve things; did you
have some ideas?  Or were you assuming that step one would require
buying many more physical test machines in your test cluster?

(Maybe IBM will be willing to give you a much bigger test machine
budget?  :-)

							- Ted

Jayashree Mohan Nov. 8, 2018, 1:41 a.m. UTC | #15

Hi all,

We understand the concern about testing times. To choose a middle
ground, Ted's suggestion of using _scratch_mkfs_sized works best for
CrashMonkey specific tests. These tests involve very few files and it
suffices to have a 100MB file system. I tested the patch on ext4, xfs,
btrfs and f2fs on a partition of this size. The overhead due to
_check_scratch_fs  after each sub test is in the range of 3-5 seconds
for all  these file systems. If this is tolerable, we can force a
smaller file system size for all CrashMonkey tests. Does this sound
reasonable to you?

Thanks,
Jayashree Mohan
On Tue, Nov 6, 2018 at 10:04 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Wed, Nov 07, 2018 at 01:09:22PM +1100, Dave Chinner wrote:
> > > Running on a 200MB partition, addition of this check added only around 3-4
> > > seconds of delay in total for this patch consisting of 37 tests. Currently
> > > this patch takes about 12-15 seconds to run to completion on my 200MB
> > > partition.
> >
> > What filesystem, and what about 20GB scratch partitions (which are
> > common)?  i.e. Checking cost is different on different filesystems,
> > different capacity devices and even different userspace versions of
> > the same filesystem utilities. It is most definitely not free, and
> > in some cases can be prohibitively expensive.
>
> For the CrashMonkey tests, one solution might be to force the use of a
> small file system on the scratch disk.  (e.g., using _scratch_mkfs_sized).
>
> > I suspect we've lost sight of the fact that fstests was /primarily/h
> > a filesystem developer test suite, not a distro regression test
> > suite. If the test suite becomes too cumbersome and slow for
> > developers to use effectively, then it will get used less during
> > development and that's a *really, really bad outcome*.
>
> I agree with your concern.
>
> > It even takes half an hour to run the quick group on my fast
> > machine, which really isn't very quick anymore because of the sheer
> > number of tests in the quick group.  Half an hour is too slow for
> > effective change feed back - feedback within 5 minutes is
> > necessary, otherwise the developer will context switch to somethingt
> > else while waiting and lose all focus on what they were doing. This
> > leads to highly inefficient developers.
>
> At Google we were willing to live with a 10 minute "fssmoke" subset,
> but admittedly, that's grown to 15-20 minutes in recent years.  So
> trying to create a "smoke" group that is only 5 minutes SGTM.
>
> > The only tests that I've seen discover new bugs recently are those
> > that run fsx, fstress or some other semi-randomised workloads that
> > are combined with some other operation. These tests find the bugs
> > that fine-grained, targetted regression tests will never uncover,
> > and so in many cases running most of these integration/regression
> > tests doesn't provide any value to the developer.
>
> Yeah, what I used to do is assume that if the test run survives past
> generic/013 (which uses fsstress), I'd assume that it would pass the
> rest of the tests, and I would move on to reviewing the next commit.
> Unfortuantely we've added so many ext4 specific tests (which run in
> front of generic) that this trick no longer works.  I haven't gotten
> annoyed enough to hack in some way to reorder the tests that get run
> so the highest value tests run first, and then sending a "90+% chance
> the commit is good, running the rest of the tests" message, but it has
> occurred to me....
>
> > Perhaps we need to recategorise the tests into new groups.
>
> Agreed.  Either we need to change what tests we leave in "quick", or
> we need to create a new group "smoke" where quick is an attribute of
> the group as a whole, not an attribute of each test in the "quick"
> group.
>
> > Perhaps we need to scale the fstests infrastructure to support
> > thousands of tests efficiently.
>
> On my "when I find the round tuit, or when I can get a GSOC or intern
> to work on it, whichever comes first" list is to enhance gce-xfstests
> so it can shard the tests for a particular fs configuration so they
> use a group of a VMs, instead of just using a separate VM for each
> config scenario (e.g., dax, blocksize < page size, bigalloc, ext3
> compat, etc.)
>
> It might mean using ~100 VM's instead of the current 10 that I use,
> but if it means the tests complete in a tenth of the time, the total
> cost for doing a full integration test won't change by that much.  The
> bigger problem is that people might have to ask permission to increase
> the GCE quotas from the defaults used on new accounts.
>
> For those people who are trying to run xfstests on bare metal, I'm not
> sure there's that much that can be done to improve things; did you
> have some ideas?  Or were you assuming that step one would require
> buying many more physical test machines in your test cluster?
>
> (Maybe IBM will be willing to give you a much bigger test machine
> budget?  :-)
>
>                                                         - Ted

Dave Chinner Nov. 8, 2018, 9:10 a.m. UTC | #16

On Wed, Nov 07, 2018 at 07:41:50PM -0600, Jayashree Mohan wrote:
> Hi all,
> 
> We understand the concern about testing times. To choose a middle
> ground, Ted's suggestion of using _scratch_mkfs_sized works best for
> CrashMonkey specific tests. These tests involve very few files and it
> suffices to have a 100MB file system. I tested the patch on ext4, xfs,
> btrfs and f2fs on a partition of this size. The overhead due to
> _check_scratch_fs  after each sub test is in the range of 3-5 seconds

3-5 second per invocation of _check_scratch_fs?

> for all  these file systems. If this is tolerable, we can force a

Not if your numbers that means 300 x 3-5 seconds. That's 15-25
minutes of extra runtime.

Cheers,

Dave.

Dave Chinner Nov. 8, 2018, 9:40 a.m. UTC | #17

On Wed, Nov 07, 2018 at 01:09:22PM +1100, Dave Chinner wrote:
> To put it in other words, developers need tests focussed on finding
> bugs quickly, not regression tests that provide the core
> requirements of integration and release testing. The development
> testing phase is all about finding fast and effciently.

To emphasis my point of having tests and tools capable of finding
new bugs, I noticed yesterday that fstress and fsx didn't support
copy_file_range, and fsx doesn't support clone/dedupe_file_range
either. Darrick added them overnight.

fsx as run by generic/263 takes *32* operations to find a data
corruption with copy_file_range on XFS. Even changing it to do
buffered IO instead of direct IO, it only takes ~600 operations to
fail with a different data corruption.

That's *at least* two previously unknown bugs exposed in under a
second of runtime.

That's the sort of tooling we need - we don't need hundreds of tests
that are scripted reproducers of fixed problems, we need tools that
exercise boundary conditions and corner cases in ways that are
likely to expose incorrect behaviour. Tools that do these things
quickly and in a reproducable manner are worth their weight in
gold...

IMO, Quality Engineering is not just about writing regression tests
to keep out known bugs - it's most important function is developing
and refining new testing tools to find bugs that have escaped
detection with existing testing methods and tools. If test engineers
can find new bugs, software engineers can fix them. That's really
the ultimate goal here - to find bugs and fix them before users are
exposed to them...

Cheers,

Dave.

Jayashree Mohan Nov. 8, 2018, 2:46 p.m. UTC | #18

> > We understand the concern about testing times. To choose a middle
> > ground, Ted's suggestion of using _scratch_mkfs_sized works best for
> > CrashMonkey specific tests. These tests involve very few files and it
> > suffices to have a 100MB file system. I tested the patch on ext4, xfs,
> > btrfs and f2fs on a partition of this size. The overhead due to
> > _check_scratch_fs  after each sub test is in the range of 3-5 seconds
>
> 3-5 second per invocation of _check_scratch_fs?

No. This is the overall overhead for calling  _check_scratch_fs after
each of the 37 sub tests in my first patch. Without this check, it
took 10 seconds to run. With this addition, it takes about 12-15
seconds to run all the 37 sub tests.


> > for all  these file systems. If this is tolerable, we can force a
>
> Not if your numbers that means 300 x 3-5 seconds. That's 15-25
> minutes of extra runtime.

Going by the above calculation, the extra run time for the invocation
of check after each sub-test should be around 30-40 seconds in total
(for the set of 300 tests).

Vijay Chidambaram Nov. 8, 2018, 3:35 p.m. UTC | #19

On Thu, Nov 8, 2018 at 3:40 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Nov 07, 2018 at 01:09:22PM +1100, Dave Chinner wrote:
> > To put it in other words, developers need tests focussed on finding
> > bugs quickly, not regression tests that provide the core
> > requirements of integration and release testing. The development
> > testing phase is all about finding fast and effciently.
>
> To emphasis my point of having tests and tools capable of finding
> new bugs, I noticed yesterday that fstress and fsx didn't support
> copy_file_range, and fsx doesn't support clone/dedupe_file_range
> either. Darrick added them overnight.
>
> fsx as run by generic/263 takes *32* operations to find a data
> corruption with copy_file_range on XFS. Even changing it to do
> buffered IO instead of direct IO, it only takes ~600 operations to
> fail with a different data corruption.
>
> That's *at least* two previously unknown bugs exposed in under a
> second of runtime.
>
> That's the sort of tooling we need - we don't need hundreds of tests
> that are scripted reproducers of fixed problems, we need tools that
> exercise boundary conditions and corner cases in ways that are
> likely to expose incorrect behaviour. Tools that do these things
> quickly and in a reproducable manner are worth their weight in
> gold...
>
> IMO, Quality Engineering is not just about writing regression tests
> to keep out known bugs - it's most important function is developing
> and refining new testing tools to find bugs that have escaped
> detection with existing testing methods and tools. If test engineers
> can find new bugs, software engineers can fix them. That's really
> the ultimate goal here - to find bugs and fix them before users are
> exposed to them...

Dave, I think there is some confusion about what CrashMonkey does. I
think you'll find its very close to what you want. Let me explain.

CrashMonkey does exactly the kind of systematic testing that you want.
Given a set of system calls, it generates tests for crash consistency
for different workloads comprising of these system calls. It does this
by testing each system call first, then each pair of system calls, and
so on. Both the workload (which system calls to test) and the check
(what should the file system look like after crash recovery) are
automatically generated, without any human effort in the loop.

CrashMonkey found 10 new bugs in btrfs and F2FS, so its not just a
suite of regression tests.

When we studied previous crash-consistency bugs reported and fixed in
the kernel, we noticed most of them could be reproduced on a clean fs
image of small size (100 MB). We found that the arguments to the
system calls could also be constrained: we just needed to reuse a
small set of file names or file ranges. We used this to automatically
generate xfstests-style tests for each file system. We generated and
tested a total of 3.3M workloads on a research cluster at UT Austin.

We found that even testing a single system call revealed three new
bugs (which have not all been patched yet). To systematically test
single system calls, you need about 300 tests. This is what Jayashree
is trying to add to fstests. Since there are very few
crash-consistency tests currently in fstests, it might be good to add
more coverage with these tests.

The CrashMonkey github repo is available here:
https://github.com/utsaslab/crashmonkey
The link to the paper: https://www.cs.utexas.edu/~jaya/pdf/osdi18-B3.pdf
Talk slides: https://www.cs.utexas.edu/~jaya/slides/osdi18-B3-slides.pdf

Theodore Ts'o Nov. 8, 2018, 4:10 p.m. UTC | #20

On Thu, Nov 08, 2018 at 08:40:45PM +1100, Dave Chinner wrote:
> IMO, Quality Engineering is not just about writing regression tests
> to keep out known bugs - it's most important function is developing
> and refining new testing tools to find bugs that have escaped
> detection with existing testing methods and tools. If test engineers
> can find new bugs, software engineers can fix them. That's really
> the ultimate goal here - to find bugs and fix them before users are
> exposed to them...

How about creating a new group, "regression"?  Regression tests are
not useless, but we might want to reserve them for integration
testing, periodic tests, and distro testing of "golden masters" where
test run is not of essence.

For smoke testing, what we want are general functional tests, and
"moderate" stress tests.  Regression tests are a distant third
priority.  So assigning regression tests to distinct group so we can
filter them out might be another way of controlling the runtime of a
smoke test config.

					- Ted

Dave Chinner Nov. 9, 2018, 3:12 a.m. UTC | #21

On Thu, Nov 08, 2018 at 09:35:56AM -0600, Vijaychidambaram Velayudhan Pillai wrote:
> On Thu, Nov 8, 2018 at 3:40 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Nov 07, 2018 at 01:09:22PM +1100, Dave Chinner wrote:
> > > To put it in other words, developers need tests focussed on finding
> > > bugs quickly, not regression tests that provide the core
> > > requirements of integration and release testing. The development
> > > testing phase is all about finding fast and effciently.
> >
> > To emphasis my point of having tests and tools capable of finding
> > new bugs, I noticed yesterday that fstress and fsx didn't support
> > copy_file_range, and fsx doesn't support clone/dedupe_file_range
> > either. Darrick added them overnight.
> >
> > fsx as run by generic/263 takes *32* operations to find a data
> > corruption with copy_file_range on XFS. Even changing it to do
> > buffered IO instead of direct IO, it only takes ~600 operations to
> > fail with a different data corruption.
> >
> > That's *at least* two previously unknown bugs exposed in under a
> > second of runtime.
> >
> > That's the sort of tooling we need - we don't need hundreds of tests
> > that are scripted reproducers of fixed problems, we need tools that
> > exercise boundary conditions and corner cases in ways that are
> > likely to expose incorrect behaviour. Tools that do these things
> > quickly and in a reproducable manner are worth their weight in
> > gold...
> >
> > IMO, Quality Engineering is not just about writing regression tests
> > to keep out known bugs - it's most important function is developing
> > and refining new testing tools to find bugs that have escaped
> > detection with existing testing methods and tools. If test engineers
> > can find new bugs, software engineers can fix them. That's really
> > the ultimate goal here - to find bugs and fix them before users are
> > exposed to them...
> 
> Dave, I think there is some confusion about what CrashMonkey does. I
> think you'll find its very close to what you want. Let me explain.

I'm pretty sure I know what crashmonkey does, and I know what is
being proposed here /isn't crashmonkey/.

> CrashMonkey does exactly the kind of systematic testing that you want.
> Given a set of system calls, it generates tests for crash consistency
> for different workloads comprising of these system calls. It does this
> by testing each system call first, then each pair of system calls, and
> so on. Both the workload (which system calls to test) and the check
> (what should the file system look like after crash recovery) are
> automatically generated, without any human effort in the loop.
> 
> CrashMonkey found 10 new bugs in btrfs and F2FS, so its not just a
> suite of regression tests.

Sure, but that was during an initial exploritory phase of the
project on two immature filesystems. Now that those initial bugs
have been fixed, I'm not seeing new bugs being reported.

So what is being proposed for fstests here? It's not an exploratory
tool like crashmonkey, or arbitrary boundary case exerciser like
fsstress and fsx. What is being proposed is a set of fixed scripts
that walk a defined set of single operations with a known, fixed set
of initial conditions and checks that each individual op they behave
as they should in that environment.

There's no variation here. The same tests with the same initial
conditions is run /every time/. When run on the same code, they will
exercise exactly the same code path. There is no variation at all,
and so if there are no bugs in the code path they exercise, they
will not find any new bugs.

That's my point: unlike crashmonkey in workload generation mode or
fsx/fsstress, the code they exercise does not vary.  Yes, when first
run on a new code base they exposed bugs, but now that those bugs
are fixed, they don't find any new bugs.  They can only detect bugs
in changes to the fixed code paths they exercise.  Ergo, they are
regression tests.

Don't get me wrong - we need both types of tests - but I care less
about a huge swath of railroad regression tests than I do about
tools that find new bugs over and over again without needing
modification.

> When we studied previous crash-consistency bugs reported and fixed in
> the kernel, we noticed most of them could be reproduced on a clean fs
> image of small size (100 MB). We found that the arguments to the
> system calls could also be constrained: we just needed to reuse a
> small set of file names or file ranges. We used this to automatically
> generate xfstests-style tests for each file system. We generated and
> tested a total of 3.3M workloads on a research cluster at UT Austin.

Which is great, especailly as you found bugs in that exploration.
But exhaustive searches like this really are not practical for day
to day development. Developers don't ahve their own personal
clusters for testing their filesystem code. They might only have a
laptop.

These sorts of massive exploratory regression testing are really the
domain of product release managers and their QE department (think of
the scale of testing that goes into a RHEL or SLES release).  It's
-their job- to find gnarly, weird regressions that are beyond the
capability of individual developers to uncover. This isn't the sort
of testing that is relevant to the day-to-day filesystem developer.

This comes back to my point about fstests being a tool for
developers as much as it is for distro QE departments. The balance
is falling too far towards the "massive regression test suite" side
and away from the "find new bugs really fast" focus we have
historically had. Adding hundreds more tests on that fall on the
"massive regression test suite" side of the ledger just makes this
imbalance worse.

That's not something that crashmonkey can solve, but it's something
we, as fstests users and developers, have to be very aware of when
considering an addition of the size being proposed.

> We found that even testing a single system call revealed three new
> bugs (which have not all been patched yet). To systematically test
> single system calls, you need about 300 tests.

That's 300 tests per system call?  I think that's underestimating
the complexity of many syscalls (like open(), read(), etc) quite
substantially. Indeed, open(O_TMPFILE) is going to make linkat()
behave very differently, and there's a whole set of crash
consistency problems when O_TMPFILE is used with linkat() that the
proposed link behaviour tests do not cover.

Maybe a better way to integrate this is to add a completely new
tests/ subdirectory and push all the crash consistency tests into
that directory. They don't get run by quick/auto, but instead by a
specific group that runs that directory. The tests don't get
intermingled with all the other generic tests, and you can set them
up to run fsck as often as you want because they don't get in the
way of existing testing.  Over time we can more of the generic crash
consistency regression tests elsewhere in fstests (e.g. all those
fsync-on-btrfs-doesn't tests) over to that same subdir.

I suspect we need to do more of this sort of "by type" breakup of
the "generic" directory, too, because it has become hard to manage
the 500-odd tests that are now in it....

Cheers,

Dave.

Vijay Chidambaram Nov. 9, 2018, 4:39 p.m. UTC | #22

On Thu, Nov 8, 2018 at 9:15 PM Dave Chinner <david@fromorbit.com> wrote:

> Which is great, especailly as you found bugs in that exploration.
> But exhaustive searches like this really are not practical for day
> to day development. Developers don't ahve their own personal
> clusters for testing their filesystem code. They might only have a
> laptop.
>
> These sorts of massive exploratory regression testing are really the
> domain of product release managers and their QE department (think of
> the scale of testing that goes into a RHEL or SLES release).  It's
> -their job- to find gnarly, weird regressions that are beyond the
> capability of individual developers to uncover. This isn't the sort
> of testing that is relevant to the day-to-day filesystem developer.
>
> This comes back to my point about fstests being a tool for
> developers as much as it is for distro QE departments. The balance
> is falling too far towards the "massive regression test suite" side
> and away from the "find new bugs really fast" focus we have
> historically had. Adding hundreds more tests on that fall on the
> "massive regression test suite" side of the ledger just makes this
> imbalance worse.
>
> That's not something that crashmonkey can solve, but it's something
> we, as fstests users and developers, have to be very aware of when
> considering an addition of the size being proposed.
>
> > We found that even testing a single system call revealed three new
> > bugs (which have not all been patched yet). To systematically test
> > single system calls, you need about 300 tests.
>
> That's 300 tests per system call?  I think that's underestimating
> the complexity of many syscalls (like open(), read(), etc) quite
> substantially. Indeed, open(O_TMPFILE) is going to make linkat()
> behave very differently, and there's a whole set of crash
> consistency problems when O_TMPFILE is used with linkat() that the
> proposed link behaviour tests do not cover.

We have workloads which explore the interaction between different
system calls, which would capture effects like this. But thats a
bigger set, and we are not attempting to add them to fstests at this
point.

> Maybe a better way to integrate this is to add a completely new
> tests/ subdirectory and push all the crash consistency tests into
> that directory. They don't get run by quick/auto, but instead by a
> specific group that runs that directory. The tests don't get
> intermingled with all the other generic tests, and you can set them
> up to run fsck as often as you want because they don't get in the
> way of existing testing.  Over time we can more of the generic crash
> consistency regression tests elsewhere in fstests (e.g. all those
> fsync-on-btrfs-doesn't tests) over to that same subdir.

I agree with not wanting developers to run 300 tests every time they
run fstests. Perhaps a different "crash" group is what we want to do
here? Then only developers who want to ensure crash consistency still
holds can run those, knowing it will take a bit of time to run the
set. We'll submit a patch with a new "crash" group and see what people
think -- let me know if you disagree.

I should note that although ext4 and xfs are super robust from a long
history of development and testing, many other file systems in the
Linux kernel are not, and I think the other file systems would
definitely benefit from having the CrashMonkey regression tests.

More generally, I'm not sure how to help/encourage file-system
developers to run CrashMonkey. Note that CrashMonkey can be run for
different amounts of time based on the computational budget of the
developers. It is similar to fstress in that regard. I think having
fstress in-kernel helps developers use it. Is adding the CrashMonkey
tool (user-space code) into the Linux kernel the right move here?

Theodore Ts'o Nov. 9, 2018, 7:17 p.m. UTC | #23

How about creating a group "regress", and we can move them into
tests/regress for better categorization, for regression tests?

And for tests that attempt find new problems, we already have a group
"fuzzers"; and perhaps there should be a some standard guieline for
how long a single invocation of a fuzz test should run before
quitting.  I'd say something in the 5 to 15 minute range?  People can
always run the fuzz test multiple times, and if the fuzz test can save
test between runs, it should matter that it's broken up into multiple
test invocations.

					- Ted

Vijay Chidambaram Nov. 9, 2018, 8:47 p.m. UTC | #24

On Fri, Nov 9, 2018 at 1:35 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> How about creating a group "regress", and we can move them into
> tests/regress for better categorization, for regression tests?

"regress" group seems good, we'll send out a patch with that.

> And for tests that attempt find new problems, we already have a group
> "fuzzers"; and perhaps there should be a some standard guieline for
> how long a single invocation of a fuzz test should run before
> quitting.  I'd say something in the 5 to 15 minute range?  People can
> always run the fuzz test multiple times, and if the fuzz test can save
> test between runs, it should matter that it's broken up into multiple
> test invocations.

CrashMonkey doesn't do random tests for every invocation, so its
behavior is a bit different from a fuzzer. But we can definitely
modify CrashMonkey to randomly pick workloads in the target state
space and run for a pre-defined amount of time.

fstest: CrashMonkey tests ported to xfstest

Commit Message

Comments

Patch