diff mbox series

[2/6] common: capture metadump output if xfs filesystem check fails

Message ID 161292579087.3504537.10519481439481869013.stgit@magnolia (mailing list archive)
State New, archived
Headers show
Series fstests: various improvements to the test framework | expand

Commit Message

Darrick J. Wong Feb. 10, 2021, 2:56 a.m. UTC
From: Darrick J. Wong <djwong@djwong.org>

Capture metadump output when various userspace repair and checker tools
fail or indicate corruption, to aid in debugging.  We don't bother to
annotate xfs_check because it's bitrotting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 README     |    2 ++
 common/xfs |   26 ++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

Comments

Brian Foster Feb. 11, 2021, 1:59 p.m. UTC | #1
On Tue, Feb 09, 2021 at 06:56:30PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@djwong.org>
> 
> Capture metadump output when various userspace repair and checker tools
> fail or indicate corruption, to aid in debugging.  We don't bother to
> annotate xfs_check because it's bitrotting.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  README     |    2 ++
>  common/xfs |   26 ++++++++++++++++++++++++++
>  2 files changed, 28 insertions(+)
> 
> 
> diff --git a/README b/README
> index 43bb0cee..36f72088 100644
> --- a/README
> +++ b/README
> @@ -109,6 +109,8 @@ Preparing system for tests:
>               - Set TEST_FS_MODULE_RELOAD=1 to unload the module and reload
>                 it between test invocations.  This assumes that the name of
>                 the module is the same as FSTYP.
> +	     - Set SNAPSHOT_CORRUPT_XFS=1 to record compressed metadumps of XFS
> +	       filesystems if the various stages of _check_xfs_filesystem fail.
>  
>          - or add a case to the switch in common/config assigning
>            these variables based on the hostname of your test
> diff --git a/common/xfs b/common/xfs
> index 2156749d..ad1eb6ee 100644
> --- a/common/xfs
> +++ b/common/xfs
> @@ -432,6 +432,21 @@ _supports_xfs_scrub()
>  	return 0
>  }
>  
> +# Save a compressed snapshot of a corrupt xfs filesystem for later debugging.
> +_snapshot_xfs() {

The term snapshot has a well known meaning. Can we just call this
_metadump_xfs()?

> +	local metadump="$1"
> +	local device="$2"
> +	local logdev="$3"
> +	local options="-a -o"
> +
> +	if [ "$logdev" != "none" ]; then
> +		options="$options -l $logdev"
> +	fi
> +
> +	$XFS_METADUMP_PROG $options "$device" "$metadump" >> "$seqres.full" 2>&1
> +	gzip -f "$metadump" >> "$seqres.full" 2>&1 &

Why compress in the background? I wonder if we should just skip the
compression step since this requires an option to enable in the first
place..

> +}
> +
>  # run xfs_check and friends on a FS.
>  _check_xfs_filesystem()
>  {
...
> @@ -540,6 +564,8 @@ _check_xfs_filesystem()
>  			cat $tmp.repair				>>$seqres.full
>  			echo "*** end xfs_repair output"	>>$seqres.full
>  
> +			test "$SNAPSHOT_CORRUPT_XFS" = "1" && \
> +				_snapshot_xfs "$seqres.rebuildrepair.md" "$device" "$2"

Why do we collect so many metadump images? Shouldn't all but the last
TEST_XFS_REPAIR_REBUILD thing not modify the fs? If so, it seems like we
should be able to collect one image (and perhaps just call it
"$seqres.$device.md") if any of the first several checks flag a problem.

Brian

>  			ok=0
>  		fi
>  		rm -f $tmp.repair
>
Darrick J. Wong Feb. 11, 2021, 6:12 p.m. UTC | #2
On Thu, Feb 11, 2021 at 08:59:58AM -0500, Brian Foster wrote:
> On Tue, Feb 09, 2021 at 06:56:30PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@djwong.org>
> > 
> > Capture metadump output when various userspace repair and checker tools
> > fail or indicate corruption, to aid in debugging.  We don't bother to
> > annotate xfs_check because it's bitrotting.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  README     |    2 ++
> >  common/xfs |   26 ++++++++++++++++++++++++++
> >  2 files changed, 28 insertions(+)
> > 
> > 
> > diff --git a/README b/README
> > index 43bb0cee..36f72088 100644
> > --- a/README
> > +++ b/README
> > @@ -109,6 +109,8 @@ Preparing system for tests:
> >               - Set TEST_FS_MODULE_RELOAD=1 to unload the module and reload
> >                 it between test invocations.  This assumes that the name of
> >                 the module is the same as FSTYP.
> > +	     - Set SNAPSHOT_CORRUPT_XFS=1 to record compressed metadumps of XFS
> > +	       filesystems if the various stages of _check_xfs_filesystem fail.
> >  
> >          - or add a case to the switch in common/config assigning
> >            these variables based on the hostname of your test
> > diff --git a/common/xfs b/common/xfs
> > index 2156749d..ad1eb6ee 100644
> > --- a/common/xfs
> > +++ b/common/xfs
> > @@ -432,6 +432,21 @@ _supports_xfs_scrub()
> >  	return 0
> >  }
> >  
> > +# Save a compressed snapshot of a corrupt xfs filesystem for later debugging.
> > +_snapshot_xfs() {
> 
> The term snapshot has a well known meaning. Can we just call this
> _metadump_xfs()?

Ok.

> 
> > +	local metadump="$1"
> > +	local device="$2"
> > +	local logdev="$3"
> > +	local options="-a -o"
> > +
> > +	if [ "$logdev" != "none" ]; then
> > +		options="$options -l $logdev"
> > +	fi
> > +
> > +	$XFS_METADUMP_PROG $options "$device" "$metadump" >> "$seqres.full" 2>&1
> > +	gzip -f "$metadump" >> "$seqres.full" 2>&1 &
> 
> Why compress in the background?

Sometimes the metadumps can become very large and I don't tend to have a
lot of space on the test appliances for storing blobs.

Also, I was under the impression that it was customary for people to
share compressed metadumps of crashes, so why not save everyone a step?

I do this in the background to avoid holding up the next fstest.

> I wonder if we should just skip the
> compression step since this requires an option to enable in the first
> place..

Seeing as it's optional, I think that's all the more reason to compress.

> 
> > +}
> > +
> >  # run xfs_check and friends on a FS.
> >  _check_xfs_filesystem()
> >  {
> ...
> > @@ -540,6 +564,8 @@ _check_xfs_filesystem()
> >  			cat $tmp.repair				>>$seqres.full
> >  			echo "*** end xfs_repair output"	>>$seqres.full
> >  
> > +			test "$SNAPSHOT_CORRUPT_XFS" = "1" && \
> > +				_snapshot_xfs "$seqres.rebuildrepair.md" "$device" "$2"
> 
> Why do we collect so many metadump images? Shouldn't all but the last
> TEST_XFS_REPAIR_REBUILD thing not modify the fs? If so, it seems like we
> should be able to collect one image (and perhaps just call it
> "$seqres.$device.md") if any of the first several checks flag a problem.

Yes, the number of metadumps collected can be reduced to two.  One if
scrub or logprint or repair -n fail, and a second one if the user set
TEST_XFS_REPAIR_REBUILD=1 and either the repair or the repair -n fail.

Will change that.

--D

> 
> Brian
> 
> >  			ok=0
> >  		fi
> >  		rm -f $tmp.repair
> > 
>
Brian Foster Feb. 11, 2021, 6:35 p.m. UTC | #3
On Thu, Feb 11, 2021 at 10:12:34AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 11, 2021 at 08:59:58AM -0500, Brian Foster wrote:
> > On Tue, Feb 09, 2021 at 06:56:30PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@djwong.org>
> > > 
> > > Capture metadump output when various userspace repair and checker tools
> > > fail or indicate corruption, to aid in debugging.  We don't bother to
> > > annotate xfs_check because it's bitrotting.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  README     |    2 ++
> > >  common/xfs |   26 ++++++++++++++++++++++++++
> > >  2 files changed, 28 insertions(+)
> > > 
> > > 
> > > diff --git a/README b/README
> > > index 43bb0cee..36f72088 100644
> > > --- a/README
> > > +++ b/README
> > > @@ -109,6 +109,8 @@ Preparing system for tests:
> > >               - Set TEST_FS_MODULE_RELOAD=1 to unload the module and reload
> > >                 it between test invocations.  This assumes that the name of
> > >                 the module is the same as FSTYP.
> > > +	     - Set SNAPSHOT_CORRUPT_XFS=1 to record compressed metadumps of XFS
> > > +	       filesystems if the various stages of _check_xfs_filesystem fail.
> > >  
> > >          - or add a case to the switch in common/config assigning
> > >            these variables based on the hostname of your test
> > > diff --git a/common/xfs b/common/xfs
> > > index 2156749d..ad1eb6ee 100644
> > > --- a/common/xfs
> > > +++ b/common/xfs
> > > @@ -432,6 +432,21 @@ _supports_xfs_scrub()
> > >  	return 0
> > >  }
> > >  
> > > +# Save a compressed snapshot of a corrupt xfs filesystem for later debugging.
> > > +_snapshot_xfs() {
> > 
> > The term snapshot has a well known meaning. Can we just call this
> > _metadump_xfs()?
> 
> Ok.
> 
> > 
> > > +	local metadump="$1"
> > > +	local device="$2"
> > > +	local logdev="$3"
> > > +	local options="-a -o"
> > > +
> > > +	if [ "$logdev" != "none" ]; then
> > > +		options="$options -l $logdev"
> > > +	fi
> > > +
> > > +	$XFS_METADUMP_PROG $options "$device" "$metadump" >> "$seqres.full" 2>&1
> > > +	gzip -f "$metadump" >> "$seqres.full" 2>&1 &
> > 
> > Why compress in the background?
> 
> Sometimes the metadumps can become very large and I don't tend to have a
> lot of space on the test appliances for storing blobs.
> 
> Also, I was under the impression that it was customary for people to
> share compressed metadumps of crashes, so why not save everyone a step?
> 
> I do this in the background to avoid holding up the next fstest.
> 
> > I wonder if we should just skip the
> > compression step since this requires an option to enable in the first
> > place..
> 
> Seeing as it's optional, I think that's all the more reason to compress.
> 

That's fair. It was more the background task that I was concerned about.
If the issue is that the compression takes too long, ISTM there's a
similar risk of the background compression conflicting with ongoing
tests. E.g., we have various tests that scale out I/O threads to extreme
levels and could delay the compression even longer (or vice versa), we
have no way to prevent multiple background compression tasks from
starting/competing as tests continue to run, etc.

What about allowing the user to specify an optional env var in the
config file to provide a compression command to use? If set, compress
the file in the foreground. Then the user can determine whether
compression is necessary at all, and if so, which compression tool might
provide a suitable enough time/space tradeoff for the test environment
(i.e., something like lz4 might be faster than gzip or bzip2 at the cost
of space).

Brian

> > 
> > > +}
> > > +
> > >  # run xfs_check and friends on a FS.
> > >  _check_xfs_filesystem()
> > >  {
> > ...
> > > @@ -540,6 +564,8 @@ _check_xfs_filesystem()
> > >  			cat $tmp.repair				>>$seqres.full
> > >  			echo "*** end xfs_repair output"	>>$seqres.full
> > >  
> > > +			test "$SNAPSHOT_CORRUPT_XFS" = "1" && \
> > > +				_snapshot_xfs "$seqres.rebuildrepair.md" "$device" "$2"
> > 
> > Why do we collect so many metadump images? Shouldn't all but the last
> > TEST_XFS_REPAIR_REBUILD thing not modify the fs? If so, it seems like we
> > should be able to collect one image (and perhaps just call it
> > "$seqres.$device.md") if any of the first several checks flag a problem.
> 
> Yes, the number of metadumps collected can be reduced to two.  One if
> scrub or logprint or repair -n fail, and a second one if the user set
> TEST_XFS_REPAIR_REBUILD=1 and either the repair or the repair -n fail.
> 
> Will change that.
> 
> --D
> 
> > 
> > Brian
> > 
> > >  			ok=0
> > >  		fi
> > >  		rm -f $tmp.repair
> > > 
> > 
>
Darrick J. Wong Feb. 11, 2021, 7:05 p.m. UTC | #4
On Thu, Feb 11, 2021 at 01:35:24PM -0500, Brian Foster wrote:
> On Thu, Feb 11, 2021 at 10:12:34AM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 11, 2021 at 08:59:58AM -0500, Brian Foster wrote:
> > > On Tue, Feb 09, 2021 at 06:56:30PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@djwong.org>
> > > > 
> > > > Capture metadump output when various userspace repair and checker tools
> > > > fail or indicate corruption, to aid in debugging.  We don't bother to
> > > > annotate xfs_check because it's bitrotting.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  README     |    2 ++
> > > >  common/xfs |   26 ++++++++++++++++++++++++++
> > > >  2 files changed, 28 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/README b/README
> > > > index 43bb0cee..36f72088 100644
> > > > --- a/README
> > > > +++ b/README
> > > > @@ -109,6 +109,8 @@ Preparing system for tests:
> > > >               - Set TEST_FS_MODULE_RELOAD=1 to unload the module and reload
> > > >                 it between test invocations.  This assumes that the name of
> > > >                 the module is the same as FSTYP.
> > > > +	     - Set SNAPSHOT_CORRUPT_XFS=1 to record compressed metadumps of XFS
> > > > +	       filesystems if the various stages of _check_xfs_filesystem fail.
> > > >  
> > > >          - or add a case to the switch in common/config assigning
> > > >            these variables based on the hostname of your test
> > > > diff --git a/common/xfs b/common/xfs
> > > > index 2156749d..ad1eb6ee 100644
> > > > --- a/common/xfs
> > > > +++ b/common/xfs
> > > > @@ -432,6 +432,21 @@ _supports_xfs_scrub()
> > > >  	return 0
> > > >  }
> > > >  
> > > > +# Save a compressed snapshot of a corrupt xfs filesystem for later debugging.
> > > > +_snapshot_xfs() {
> > > 
> > > The term snapshot has a well known meaning. Can we just call this
> > > _metadump_xfs()?
> > 
> > Ok.
> > 
> > > 
> > > > +	local metadump="$1"
> > > > +	local device="$2"
> > > > +	local logdev="$3"
> > > > +	local options="-a -o"
> > > > +
> > > > +	if [ "$logdev" != "none" ]; then
> > > > +		options="$options -l $logdev"
> > > > +	fi
> > > > +
> > > > +	$XFS_METADUMP_PROG $options "$device" "$metadump" >> "$seqres.full" 2>&1
> > > > +	gzip -f "$metadump" >> "$seqres.full" 2>&1 &
> > > 
> > > Why compress in the background?
> > 
> > Sometimes the metadumps can become very large and I don't tend to have a
> > lot of space on the test appliances for storing blobs.
> > 
> > Also, I was under the impression that it was customary for people to
> > share compressed metadumps of crashes, so why not save everyone a step?
> > 
> > I do this in the background to avoid holding up the next fstest.
> > 
> > > I wonder if we should just skip the
> > > compression step since this requires an option to enable in the first
> > > place..
> > 
> > Seeing as it's optional, I think that's all the more reason to compress.
> > 
> 
> That's fair. It was more the background task that I was concerned about.
> If the issue is that the compression takes too long, ISTM there's a
> similar risk of the background compression conflicting with ongoing
> tests. E.g., we have various tests that scale out I/O threads to extreme
> levels and could delay the compression even longer (or vice versa), we
> have no way to prevent multiple background compression tasks from
> starting/competing as tests continue to run, etc.

<nod> Admittedly I chose gzip because it's decent in both the speed and
compression ratio traits; someone else might want xz --extreme.

> What about allowing the user to specify an optional env var in the
> config file to provide a compression command to use? If set, compress
> the file in the foreground. Then the user can determine whether
> compression is necessary at all, and if so, which compression tool might
> provide a suitable enough time/space tradeoff for the test environment
> (i.e., something like lz4 might be faster than gzip or bzip2 at the cost
> of space).

Good idea!  If the user sets SNAPSHOT_XFS_COMPRESSOR to the compressor
program of their choice (e.g. 'gzip -9') then we'll use that to compress
the metadump.

It also occurred to me that I could refactor _scratch_metadump to use
this new helper, so I think I'll implement some means for letting actual
tests disable compression unconditionally.

--D

> Brian
> 
> > > 
> > > > +}
> > > > +
> > > >  # run xfs_check and friends on a FS.
> > > >  _check_xfs_filesystem()
> > > >  {
> > > ...
> > > > @@ -540,6 +564,8 @@ _check_xfs_filesystem()
> > > >  			cat $tmp.repair				>>$seqres.full
> > > >  			echo "*** end xfs_repair output"	>>$seqres.full
> > > >  
> > > > +			test "$SNAPSHOT_CORRUPT_XFS" = "1" && \
> > > > +				_snapshot_xfs "$seqres.rebuildrepair.md" "$device" "$2"
> > > 
> > > Why do we collect so many metadump images? Shouldn't all but the last
> > > TEST_XFS_REPAIR_REBUILD thing not modify the fs? If so, it seems like we
> > > should be able to collect one image (and perhaps just call it
> > > "$seqres.$device.md") if any of the first several checks flag a problem.
> > 
> > Yes, the number of metadumps collected can be reduced to two.  One if
> > scrub or logprint or repair -n fail, and a second one if the user set
> > TEST_XFS_REPAIR_REBUILD=1 and either the repair or the repair -n fail.
> > 
> > Will change that.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > >  			ok=0
> > > >  		fi
> > > >  		rm -f $tmp.repair
> > > > 
> > > 
> > 
>
diff mbox series

Patch

diff --git a/README b/README
index 43bb0cee..36f72088 100644
--- a/README
+++ b/README
@@ -109,6 +109,8 @@  Preparing system for tests:
              - Set TEST_FS_MODULE_RELOAD=1 to unload the module and reload
                it between test invocations.  This assumes that the name of
                the module is the same as FSTYP.
+	     - Set SNAPSHOT_CORRUPT_XFS=1 to record compressed metadumps of XFS
+	       filesystems if the various stages of _check_xfs_filesystem fail.
 
         - or add a case to the switch in common/config assigning
           these variables based on the hostname of your test
diff --git a/common/xfs b/common/xfs
index 2156749d..ad1eb6ee 100644
--- a/common/xfs
+++ b/common/xfs
@@ -432,6 +432,21 @@  _supports_xfs_scrub()
 	return 0
 }
 
+# Save a compressed snapshot of a corrupt xfs filesystem for later debugging.
+_snapshot_xfs() {
+	local metadump="$1"
+	local device="$2"
+	local logdev="$3"
+	local options="-a -o"
+
+	if [ "$logdev" != "none" ]; then
+		options="$options -l $logdev"
+	fi
+
+	$XFS_METADUMP_PROG $options "$device" "$metadump" >> "$seqres.full" 2>&1
+	gzip -f "$metadump" >> "$seqres.full" 2>&1 &
+}
+
 # run xfs_check and friends on a FS.
 _check_xfs_filesystem()
 {
@@ -482,6 +497,9 @@  _check_xfs_filesystem()
 		# mounted ...
 		mountpoint=`_umount_or_remount_ro $device`
 	fi
+	if [ "$ok" -ne 1 ] && [ "$SNAPSHOT_CORRUPT_XFS" = "1" ]; then
+		_snapshot_xfs "$seqres.scrub.md" "$device" "$2"
+	fi
 
 	$XFS_LOGPRINT_PROG -t $extra_log_options $device 2>&1 \
 		| tee $tmp.logprint | grep -q "<CLEAN>"
@@ -491,6 +509,8 @@  _check_xfs_filesystem()
 		cat $tmp.logprint			>>$seqres.full
 		echo "*** end xfs_logprint output"	>>$seqres.full
 
+		test "$SNAPSHOT_CORRUPT_XFS" = "1" && \
+			_snapshot_xfs "$seqres.logprint.md" "$device" "$2"
 		ok=0
 	fi
 
@@ -516,6 +536,8 @@  _check_xfs_filesystem()
 		cat $tmp.repair				>>$seqres.full
 		echo "*** end xfs_repair output"	>>$seqres.full
 
+		test "$SNAPSHOT_CORRUPT_XFS" = "1" && \
+			_snapshot_xfs "$seqres.repair.md" "$device" "$2"
 		ok=0
 	fi
 	rm -f $tmp.fs_check $tmp.logprint $tmp.repair
@@ -529,6 +551,8 @@  _check_xfs_filesystem()
 			cat $tmp.repair				>>$seqres.full
 			echo "*** end xfs_repair output"	>>$seqres.full
 
+			test "$SNAPSHOT_CORRUPT_XFS" = "1" && \
+				_snapshot_xfs "$seqres.rebuild.md" "$device" "$2"
 			ok=0
 		fi
 		rm -f $tmp.repair
@@ -540,6 +564,8 @@  _check_xfs_filesystem()
 			cat $tmp.repair				>>$seqres.full
 			echo "*** end xfs_repair output"	>>$seqres.full
 
+			test "$SNAPSHOT_CORRUPT_XFS" = "1" && \
+				_snapshot_xfs "$seqres.rebuildrepair.md" "$device" "$2"
 			ok=0
 		fi
 		rm -f $tmp.repair