diff mbox series

[03/14] xfs: document the testing plan for online fsck

Message ID 165989702236.2495930.5556030223682318775.stgit@magnolia (mailing list archive)
State Superseded
Headers show
Series xfs: design documentation for online fsck | expand

Commit Message

Darrick J. Wong Aug. 7, 2022, 6:30 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Start the third chapter of the online fsck design documentation.  This
covers the testing plan to make sure that both online and offline fsck
can detect arbitrary problems and correct them without making things
worse.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  187 ++++++++++++++++++++
 1 file changed, 187 insertions(+)

Comments

Dave Chinner Aug. 11, 2022, 12:09 a.m. UTC | #1
On Sun, Aug 07, 2022 at 11:30:22AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Start the third chapter of the online fsck design documentation.  This
> covers the testing plan to make sure that both online and offline fsck
> can detect arbitrary problems and correct them without making things
> worse.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  187 ++++++++++++++++++++
>  1 file changed, 187 insertions(+)


....
> +Stress Testing
> +--------------
> +
> +A unique requirement to online fsck is the ability to operate on a filesystem
> +concurrently with regular workloads.
> +Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
> +impact on the running system, the online repair code should never introduce
> +inconsistencies into the filesystem metadata, and regular workloads should
> +never notice resource starvation.
> +To verify that these conditions are being met, fstests has been enhanced in
> +the following ways:
> +
> +* For each scrub item type, create a test to exercise checking that item type
> +  while running ``fsstress``.
> +* For each scrub item type, create a test to exercise repairing that item type
> +  while running ``fsstress``.
> +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
> +  filesystem doesn't cause problems.
> +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
> +  force-repairing the whole filesystem doesn't cause problems.
> +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
> +  freezing and thawing the filesystem.
> +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
> +  remounting the filesystem read-only and read-write.
> +* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)

I had a thought when reading this that we want to ensure that online
repair handles concurrent grow/shrink operations so that doesn't
cause problems, as well as dealing with concurrent attempts to run
independent online repair processes.

Not sure that comes under stress testing, but it was the "test while
freeze/thaw" that triggered me to think of this, so that's where I'm
commenting about it. :)

Cheers,

Dave.
Darrick J. Wong Aug. 16, 2022, 2:18 a.m. UTC | #2
On Thu, Aug 11, 2022 at 10:09:45AM +1000, Dave Chinner wrote:
> On Sun, Aug 07, 2022 at 11:30:22AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Start the third chapter of the online fsck design documentation.  This
> > covers the testing plan to make sure that both online and offline fsck
> > can detect arbitrary problems and correct them without making things
> > worse.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  187 ++++++++++++++++++++
> >  1 file changed, 187 insertions(+)
> 
> 
> ....
> > +Stress Testing
> > +--------------
> > +
> > +A unique requirement to online fsck is the ability to operate on a filesystem
> > +concurrently with regular workloads.
> > +Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
> > +impact on the running system, the online repair code should never introduce
> > +inconsistencies into the filesystem metadata, and regular workloads should
> > +never notice resource starvation.
> > +To verify that these conditions are being met, fstests has been enhanced in
> > +the following ways:
> > +
> > +* For each scrub item type, create a test to exercise checking that item type
> > +  while running ``fsstress``.
> > +* For each scrub item type, create a test to exercise repairing that item type
> > +  while running ``fsstress``.
> > +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
> > +  filesystem doesn't cause problems.
> > +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
> > +  force-repairing the whole filesystem doesn't cause problems.
> > +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
> > +  freezing and thawing the filesystem.
> > +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
> > +  remounting the filesystem read-only and read-write.
> > +* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
> 
> I had a thought when reading this that we want to ensure that online
> repair handles concurrent grow/shrink operations so that doesn't
> cause problems, as well as dealing with concurrent attempts to run
> independent online repair processes.
> 
> Not sure that comes under stress testing, but it was the "test while
> freeze/thaw" that triggered me to think of this, so that's where I'm
> commenting about it. :)

Hmm.  I hadn't really given that much thought.  Let me go add that to
the test suite and see how many daemons come pouring out...

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index a03a7b9f0250..d630b6bdbe4a 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -563,3 +563,190 @@  functionality.
 Many of these risks are inherent to software programming.
 Despite this, it is hoped that this new functionality will prove useful in
 reducing unexpected downtime.
+
+3. Testing Plan
+===============
+
+As stated before, fsck tools have three main goals:
+
+1. Detect inconsistencies in the metadata;
+
+2. Eliminate those inconsistencies; and
+
+3. Minimize further loss of data.
+
+Demonstrations of correct operation are necessary to build users' confidence
+that the software behaves within expectations.
+Unfortunately, it was not really feasible to perform regular exhaustive testing
+of every aspect of a fsck tool until the introduction of low-cost virtual
+machines with high-IOPS storage.
+With ample hardware availability in mind, the testing strategy for the online
+fsck project involves differential analysis against the existing fsck tools and
+systematic testing of every attribute of every type of metadata object.
+Testing can be split into four major categories, as discussed below.
+
+Integrated Testing with fstests
+-------------------------------
+
+The primary goal of any free software QA effort is to make testing as
+inexpensive and widespread as possible to maximize the scaling advantages of
+community.
+In other words, testing should maximize the breadth of filesystem configuration
+scenarios and hardware setups.
+This improves code quality by enabling the authors of online fsck to find and
+fix bugs early, and helps developers of new features to find integration
+issues earlier in their development effort.
+
+The Linux filesystem community shares a common QA testing suite,
+`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
+functional and regression testing.
+Even before development work began on online fsck, fstests (when run on XFS)
+would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
+scratch filesystems between each test.
+This provides a level of assurance that the kernel and the fsck tools stay in
+alignment about what constitutes consistent metadata.
+During development of the online checking code, fstests was modified to run
+``xfs_scrub -n`` between each test to ensure that the new checking code
+produces the same results as the two existing fsck tools.
+
+To start development of online repair, fstests was modified to run
+``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
+This ensures that offline repair does not crash, leave a corrupt filesystem
+after it exists, or trigger complaints from the online check.
+This also established a baseline for what can and cannot be repaired offline.
+To complete the first phase of development of online repair, fstests was
+modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
+This enables a comparison of the effectiveness of online repair as compared to
+the existing offline repair tools.
+
+General Fuzz Testing of Metadata Blocks
+---------------------------------------
+
+XFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
+
+Before development of online fsck even began, a set of fstests were created
+to test the rather common fault that entire metadata blocks get corrupted.
+This required the creation of fstests library code that can create a filesystem
+containing every possible type of metadata object.
+Next, individual test cases were created to create a test filesystem, identify
+a single block of a specific type of metadata object, trash it with the
+existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
+particular metadata validation strategy.
+
+This earlier test suite enabled XFS developers to test the ability of the
+in-kernel validation functions and the ability of the offline fsck tool to
+detect and eliminate the inconsistent metadata.
+This part of the test suite was extended to cover online fsck in exactly the
+same manner.
+
+In other words, for a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem:
+
+  * Write garbage to it
+
+  * Test the reactions of:
+
+    1. The kernel verifiers to stop obviously bad metadata
+    2. Offline repair (``xfs_repair``) to detect and fix
+    3. Online repair (``xfs_scrub``) to detect and fix
+
+Targeted Fuzz Testing of Metadata Records
+-----------------------------------------
+
+A quick conversation with the other XFS developers revealed that the existing
+test infrastructure could be extended to provide a much more powerful
+facility: targeted fuzz testing of every metadata field of every metadata
+object in the filesystem.
+``xfs_db`` can modify every field of every metadata structure in every
+block in the filesystem to simulate the effects of memory corruption and
+software bugs.
+Given that fstests already contains the ability to create a filesystem
+containing every metadata format known to the filesystem, ``xfs_db`` can be
+used to perform exhaustive fuzz testing!
+
+For a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem...
+
+  * For each record inside that metadata object...
+
+    * For each field inside that record...
+
+      * For each conceivable type of transformation that can be applied to a bit field...
+
+        1. Clear all bits
+        2. Set all bits
+        3. Toggle the most significant bit
+        4. Toggle the middle bit
+        5. Toggle the least significant bit
+        6. Add a small quantity
+        7. Subtract a small quantity
+        8. Randomize the contents
+
+        * ...test the reactions of:
+
+          1. The kernel verifiers to stop obviously bad metadata
+          2. Offline checking (``xfs_repair -n``)
+          3. Offline repair (``xfs_repair``)
+          4. Online checking (``xfs_scrub -n``)
+          5. Online repair (``xfs_scrub``)
+          6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
+
+This is quite the combinatoric explosion!
+
+Fortunately, having this much test coverage makes it easy for XFS developers to
+check the responses of XFS' fsck tools.
+Since the introduction of the fuzz testing framework, these tests have been
+used to discover incorrect repair code and missing functionality for entire
+classes of metadata objects in ``xfs_repair``.
+The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
+confirming that ``xfs_repair`` could detect at least as many corruptions as
+the older tool.
+
+These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
+allow the online fsck developers to compare online fsck against offline fsck,
+and they enable XFS developers to find deficiencies in the code base.
+
+Proposed patchsets include
+`general fuzzer improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
+`fuzzing baselines
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
+and `improvements in fuzz testing comprehensiveness
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
+
+Stress Testing
+--------------
+
+A unique requirement to online fsck is the ability to operate on a filesystem
+concurrently with regular workloads.
+Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
+impact on the running system, the online repair code should never introduce
+inconsistencies into the filesystem metadata, and regular workloads should
+never notice resource starvation.
+To verify that these conditions are being met, fstests has been enhanced in
+the following ways:
+
+* For each scrub item type, create a test to exercise checking that item type
+  while running ``fsstress``.
+* For each scrub item type, create a test to exercise repairing that item type
+  while running ``fsstress``.
+* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
+  filesystem doesn't cause problems.
+* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
+  force-repairing the whole filesystem doesn't cause problems.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+  freezing and thawing the filesystem.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+  remounting the filesystem read-only and read-write.
+* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
+
+Success is defined by the ability to run all of these tests without observing
+any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
+check warnings, or any other sort of mischief.
+
+Proposed patchsets include `general stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
+and the `evolution of existing per-function stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.