[08/11] xfs_repair: allow setting the needsrepair flag

Message ID	161308438691.3850286.3501696811159590596.stgit@magnolia (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> Subject: [PATCH 08/11] xfs_repair: allow setting the needsrepair flag From: "Darrick J. Wong" <djwong@kernel.org> To: sandeen@sandeen.net, djwong@kernel.org Cc: Christoph Hellwig <hch@lst.de>, Brian Foster <bfoster@redhat.com>, linux-xfs@vger.kernel.org, bfoster@redhat.com Date: Thu, 11 Feb 2021 14:59:46 -0800 Message-ID: <161308438691.3850286.3501696811159590596.stgit@magnolia> In-Reply-To: <161308434132.3850286.13801623440532587184.stgit@magnolia> References: <161308434132.3850286.13801623440532587184.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	xfs: add the ability to flag a fs for repair \| expand [PATCHSET,v5,00/11] xfs: add the ability to flag a fs for repair [01/11] xfs_admin: clean up string quoting [02/11] xfs_admin: support filesystems with realtime devices [03/11] xfs_db: report the needsrepair flag in check and version commands [04/11] xfs_db: don't allow label/uuid setting if the needsrepair flag is set [05/11] xfs_repair: fix unmount error message to have a newline [06/11] xfs_repair: clear quota CHKD flags on the incore superblock too [07/11] xfs_repair: clear the needsrepair flag [08/11] xfs_repair: allow setting the needsrepair flag [09/11] xfs_repair: add a testing hook for NEEDSREPAIR [10/11] xfs_admin: support adding features to V5 filesystems [11/11] man: mark all deprecated V4 format options

Darrick J. Wong Feb. 11, 2021, 10:59 p.m. UTC

From: Darrick J. Wong <djwong@kernel.org>

Quietly set up the ability to tell xfs_repair to set NEEDSREPAIR at
program start and (presumably) clear it by the end of the run.  This
code isn't terribly useful to users; it's mainly here so that fstests
can exercise the functionality.  We don't document this flag in the
manual pages at all because repair clears needsrepair at exit, which
means the knobs only exist for fstests to exercise the functionality.

Note that we can't do any of these upgrades until we've at least done a
preliminary scan of the primary super and the log.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 repair/globals.c    |    2 ++
 repair/globals.h    |    2 ++
 repair/phase2.c     |   63 +++++++++++++++++++++++++++++++++++++++++++++++++++
 repair/xfs_repair.c |    9 +++++++
 4 files changed, 76 insertions(+)

Eric Sandeen Feb. 11, 2021, 11:29 p.m. UTC | #1

On 2/11/21 4:59 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Quietly set up the ability to tell xfs_repair to set NEEDSREPAIR at
> program start and (presumably) clear it by the end of the run.  This
> code isn't terribly useful to users; it's mainly here so that fstests
> can exercise the functionality.  We don't document this flag in the
> manual pages at all because repair clears needsrepair at exit, which
> means the knobs only exist for fstests to exercise the functionality.
> 
> Note that we can't do any of these upgrades until we've at least done a
> preliminary scan of the primary super and the log.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Brian Foster <bfoster@redhat.com>


I'm still a little on the fence about the cmdline option for crashing repair at a certain point from the POV that Brian kind of pointed out that this doesn't exactly scale as we need more hooks.

but

ehhhh it's a test-only undocumented option and I guess we could change it later if desired

we do have other debug options on the commandline already as well....


> ---
>  repair/globals.c    |    2 ++
>  repair/globals.h    |    2 ++
>  repair/phase2.c     |   63 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  repair/xfs_repair.c |    9 +++++++
>  4 files changed, 76 insertions(+)
> 
> 
> diff --git a/repair/globals.c b/repair/globals.c
> index 110d98b6..699a96ee 100644
> --- a/repair/globals.c
> +++ b/repair/globals.c
> @@ -49,6 +49,8 @@ int	rt_spec;		/* Realtime dev specified as option */
>  int	convert_lazy_count;	/* Convert lazy-count mode on/off */
>  int	lazy_count;		/* What to set if to if converting */
>  
> +bool	add_needsrepair;	/* forcibly set needsrepair while repairing */
> +
>  /* misc status variables */
>  
>  int	primary_sb_modified;
> diff --git a/repair/globals.h b/repair/globals.h
> index 1d397b35..043b3e8e 100644
> --- a/repair/globals.h
> +++ b/repair/globals.h
> @@ -90,6 +90,8 @@ extern int	rt_spec;		/* Realtime dev specified as option */
>  extern int	convert_lazy_count;	/* Convert lazy-count mode on/off */
>  extern int	lazy_count;		/* What to set if to if converting */
>  
> +extern bool	add_needsrepair;
> +
>  /* misc status variables */
>  
>  extern int		primary_sb_modified;
> diff --git a/repair/phase2.c b/repair/phase2.c
> index 952ac4a5..9a8d42e1 100644
> --- a/repair/phase2.c
> +++ b/repair/phase2.c
> @@ -131,6 +131,63 @@ zero_log(
>  		libxfs_max_lsn = log->l_last_sync_lsn;
>  }
>  
> +static bool
> +set_needsrepair(
> +	struct xfs_mount	*mp)
> +{
> +	if (!xfs_sb_version_hascrc(&mp->m_sb)) {
> +		printf(
> +	_("needsrepair flag only supported on V5 filesystems.\n"));
> +		exit(0);
> +	}
> +
> +	if (xfs_sb_version_needsrepair(&mp->m_sb)) {
> +		printf(_("Filesystem already marked as needing repair.\n"));
> +		exit(0);
> +	}
> +
> +	printf(_("Marking filesystem in need of repair.\n"));
> +	mp->m_sb.sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
> +	return true;
> +}
> +
> +/* Perform the user's requested upgrades on filesystem. */
> +static void
> +upgrade_filesystem(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_buf		*bp;
> +	bool			dirty = false;
> +	int			error;
> +
> +	if (add_needsrepair)
> +		dirty |= set_needsrepair(mp);
> +
> +        if (no_modify || !dirty)
> +                return;
> +
> +        bp = libxfs_getsb(mp);
> +        if (!bp || bp->b_error) {
> +                do_error(
> +	_("couldn't get superblock for feature upgrade, err=%d\n"),
> +                                bp ? bp->b_error : ENOMEM);
> +        } else {
> +                libxfs_sb_to_disk(bp->b_addr, &mp->m_sb);
> +
> +                /*
> +		 * Write the primary super to disk immediately so that
> +		 * needsrepair will be set if repair doesn't complete.
> +		 */
> +                error = -libxfs_bwrite(bp);
> +                if (error)
> +                        do_error(
> +	_("filesystem feature upgrade failed, err=%d\n"),
> +                                        error);
> +        }
> +        if (bp)
> +                libxfs_buf_relse(bp);
> +}
> +
>  /*
>   * ok, at this point, the fs is mounted but the root inode may be
>   * trashed and the ag headers haven't been checked.  So we have
> @@ -235,4 +292,10 @@ phase2(
>  				do_warn(_("would correct\n"));
>  		}
>  	}
> +
> +	/*
> +	 * Upgrade the filesystem now that we've done a preliminary check of
> +	 * the superblocks, the AGs, the log, and the metadata inodes.
> +	 */
> +	upgrade_filesystem(mp);
>  }
> diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
> index 90d1a95a..a613505f 100644
> --- a/repair/xfs_repair.c
> +++ b/repair/xfs_repair.c
> @@ -65,11 +65,13 @@ static char *o_opts[] = {
>   */
>  enum c_opt_nums {
>  	CONVERT_LAZY_COUNT = 0,
> +	CONVERT_NEEDSREPAIR,
>  	C_MAX_OPTS,
>  };
>  
>  static char *c_opts[] = {
>  	[CONVERT_LAZY_COUNT]	= "lazycount",
> +	[CONVERT_NEEDSREPAIR]	= "needsrepair",
>  	[C_MAX_OPTS]		= NULL,
>  };
>  
> @@ -302,6 +304,13 @@ process_args(int argc, char **argv)
>  					lazy_count = (int)strtol(val, NULL, 0);
>  					convert_lazy_count = 1;
>  					break;
> +				case CONVERT_NEEDSREPAIR:
> +					if (!val)
> +						do_abort(
> +		_("-c needsrepair requires a parameter\n"));
> +					if (strtol(val, NULL, 0) == 1)
> +						add_needsrepair = true;
> +					break;
>  				default:
>  					unknown('c', val);
>  					break;
>

Darrick J. Wong Feb. 12, 2021, 12:17 a.m. UTC | #2

On Thu, Feb 11, 2021 at 05:29:05PM -0600, Eric Sandeen wrote:
> On 2/11/21 4:59 PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Quietly set up the ability to tell xfs_repair to set NEEDSREPAIR at
> > program start and (presumably) clear it by the end of the run.  This
> > code isn't terribly useful to users; it's mainly here so that fstests
> > can exercise the functionality.  We don't document this flag in the
> > manual pages at all because repair clears needsrepair at exit, which
> > means the knobs only exist for fstests to exercise the functionality.
> > 
> > Note that we can't do any of these upgrades until we've at least done a
> > preliminary scan of the primary super and the log.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> 
> I'm still a little on the fence about the cmdline option for crashing
> repair at a certain point from the POV that Brian kind of pointed out
> that this doesn't exactly scale as we need more hooks.

(That's in the next patch.)

> but
> 
> ehhhh it's a test-only undocumented option and I guess we could change
> it later if desired
> 
> we do have other debug options on the commandline already as well....

I don't mind moving the debugging hooks to be seekrit environment
variables or something, but I don't think I've quite addressed some of
Brian's comments from last time:

[paste in stuff Brian said]

> But is it worth maintaining test specific debug logic in an
> application just to confirm that particular feature bit upgrades
> actually set the bit?

I argue that yes, this is important enough to burn a debugging knob.
The sequence that I think we should prevent through testing is the one
where we've set the new feature on the primary super but we haven't
finished generating whatever new metadata is needed to complete the
upgrade, the system crashes, and on remount the verifiers explode.

Chances are pretty good that we'll get an angry bug report on the
mailing list: "I upgraded my fs, the power went down, and the kernel
sprayed corruption everywhere!"  If we get a customer escalation like
this, I'd /much/ rather it be about not being able to mount right after
the reboot than a latent corruption that grows unseen until somebody's
filesystem loses data.

If a future patch to repair accidentally breaks the behavior where we
set NEEDSREPAIR at the same time as we set the new feature and flush the
super to disk, we cannot tell that there's been a regression in this
safety mechanism just by looking at the output of an otherwise
successful xfs_repair run...

> It seems sufficient to me to test that needsrepair functionality works
> as expected and that individual feature upgrade works as well.

...so in other words, we need some point to inject an error to make sure
that the upgrade interlock is correct.

> Given the discussion on patch 7, perhaps it makes more sense to at
> least defer this sort of injection mechanism until we have a scheme
> for generic needsrepair usage worked out for xfs_repair?

I'm in the midst of prototyping what I said in the last thread --
hooking the buffe cache so that repair can catch the first time we
actually write anything to the filesystem, and using that to set
NEEDSREPAIR.  I've not run it through full fstests yet, but AFAICT I can
keep using the same tests and the same injection knobs I already wrote.

> I am wondering if there's a way to make repair fail without requiring
> additional code, but if not and we do require some sort of injection
> mode, I suspect we might end up better served by something more
> generic (i.e. capable of failures at random points) rather than
> defining a command line option specifically for a particular fstest..

Probably yes, but ... uh I don't want this to drag on into building a
generic error injection framework for userspace.

I would /really/ like to get inobtcount/bigtime tests into the kernel
without a giant detour they have nearly zero test coverage from the
wider community.

--D

> 
> > ---
> >  repair/globals.c    |    2 ++
> >  repair/globals.h    |    2 ++
> >  repair/phase2.c     |   63 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  repair/xfs_repair.c |    9 +++++++
> >  4 files changed, 76 insertions(+)
> > 
> > 
> > diff --git a/repair/globals.c b/repair/globals.c
> > index 110d98b6..699a96ee 100644
> > --- a/repair/globals.c
> > +++ b/repair/globals.c
> > @@ -49,6 +49,8 @@ int	rt_spec;		/* Realtime dev specified as option */
> >  int	convert_lazy_count;	/* Convert lazy-count mode on/off */
> >  int	lazy_count;		/* What to set if to if converting */
> >  
> > +bool	add_needsrepair;	/* forcibly set needsrepair while repairing */
> > +
> >  /* misc status variables */
> >  
> >  int	primary_sb_modified;
> > diff --git a/repair/globals.h b/repair/globals.h
> > index 1d397b35..043b3e8e 100644
> > --- a/repair/globals.h
> > +++ b/repair/globals.h
> > @@ -90,6 +90,8 @@ extern int	rt_spec;		/* Realtime dev specified as option */
> >  extern int	convert_lazy_count;	/* Convert lazy-count mode on/off */
> >  extern int	lazy_count;		/* What to set if to if converting */
> >  
> > +extern bool	add_needsrepair;
> > +
> >  /* misc status variables */
> >  
> >  extern int		primary_sb_modified;
> > diff --git a/repair/phase2.c b/repair/phase2.c
> > index 952ac4a5..9a8d42e1 100644
> > --- a/repair/phase2.c
> > +++ b/repair/phase2.c
> > @@ -131,6 +131,63 @@ zero_log(
> >  		libxfs_max_lsn = log->l_last_sync_lsn;
> >  }
> >  
> > +static bool
> > +set_needsrepair(
> > +	struct xfs_mount	*mp)
> > +{
> > +	if (!xfs_sb_version_hascrc(&mp->m_sb)) {
> > +		printf(
> > +	_("needsrepair flag only supported on V5 filesystems.\n"));
> > +		exit(0);
> > +	}
> > +
> > +	if (xfs_sb_version_needsrepair(&mp->m_sb)) {
> > +		printf(_("Filesystem already marked as needing repair.\n"));
> > +		exit(0);
> > +	}
> > +
> > +	printf(_("Marking filesystem in need of repair.\n"));
> > +	mp->m_sb.sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
> > +	return true;
> > +}
> > +
> > +/* Perform the user's requested upgrades on filesystem. */
> > +static void
> > +upgrade_filesystem(
> > +	struct xfs_mount	*mp)
> > +{
> > +	struct xfs_buf		*bp;
> > +	bool			dirty = false;
> > +	int			error;
> > +
> > +	if (add_needsrepair)
> > +		dirty |= set_needsrepair(mp);
> > +
> > +        if (no_modify || !dirty)
> > +                return;
> > +
> > +        bp = libxfs_getsb(mp);
> > +        if (!bp || bp->b_error) {
> > +                do_error(
> > +	_("couldn't get superblock for feature upgrade, err=%d\n"),
> > +                                bp ? bp->b_error : ENOMEM);
> > +        } else {
> > +                libxfs_sb_to_disk(bp->b_addr, &mp->m_sb);
> > +
> > +                /*
> > +		 * Write the primary super to disk immediately so that
> > +		 * needsrepair will be set if repair doesn't complete.
> > +		 */
> > +                error = -libxfs_bwrite(bp);
> > +                if (error)
> > +                        do_error(
> > +	_("filesystem feature upgrade failed, err=%d\n"),
> > +                                        error);
> > +        }
> > +        if (bp)
> > +                libxfs_buf_relse(bp);
> > +}
> > +
> >  /*
> >   * ok, at this point, the fs is mounted but the root inode may be
> >   * trashed and the ag headers haven't been checked.  So we have
> > @@ -235,4 +292,10 @@ phase2(
> >  				do_warn(_("would correct\n"));
> >  		}
> >  	}
> > +
> > +	/*
> > +	 * Upgrade the filesystem now that we've done a preliminary check of
> > +	 * the superblocks, the AGs, the log, and the metadata inodes.
> > +	 */
> > +	upgrade_filesystem(mp);
> >  }
> > diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
> > index 90d1a95a..a613505f 100644
> > --- a/repair/xfs_repair.c
> > +++ b/repair/xfs_repair.c
> > @@ -65,11 +65,13 @@ static char *o_opts[] = {
> >   */
> >  enum c_opt_nums {
> >  	CONVERT_LAZY_COUNT = 0,
> > +	CONVERT_NEEDSREPAIR,
> >  	C_MAX_OPTS,
> >  };
> >  
> >  static char *c_opts[] = {
> >  	[CONVERT_LAZY_COUNT]	= "lazycount",
> > +	[CONVERT_NEEDSREPAIR]	= "needsrepair",
> >  	[C_MAX_OPTS]		= NULL,
> >  };
> >  
> > @@ -302,6 +304,13 @@ process_args(int argc, char **argv)
> >  					lazy_count = (int)strtol(val, NULL, 0);
> >  					convert_lazy_count = 1;
> >  					break;
> > +				case CONVERT_NEEDSREPAIR:
> > +					if (!val)
> > +						do_abort(
> > +		_("-c needsrepair requires a parameter\n"));
> > +					if (strtol(val, NULL, 0) == 1)
> > +						add_needsrepair = true;
> > +					break;
> >  				default:
> >  					unknown('c', val);
> >  					break;
> >

Eric Sandeen Feb. 12, 2021, 12:20 a.m. UTC | #3

On 2/11/21 6:17 PM, Darrick J. Wong wrote:
> On Thu, Feb 11, 2021 at 05:29:05PM -0600, Eric Sandeen wrote:
>> On 2/11/21 4:59 PM, Darrick J. Wong wrote:
>>> From: Darrick J. Wong <djwong@kernel.org>
>>>
>>> Quietly set up the ability to tell xfs_repair to set NEEDSREPAIR at
>>> program start and (presumably) clear it by the end of the run.  This
>>> code isn't terribly useful to users; it's mainly here so that fstests
>>> can exercise the functionality.  We don't document this flag in the
>>> manual pages at all because repair clears needsrepair at exit, which
>>> means the knobs only exist for fstests to exercise the functionality.
>>>
>>> Note that we can't do any of these upgrades until we've at least done a
>>> preliminary scan of the primary super and the log.
>>>
>>> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>> Reviewed-by: Brian Foster <bfoster@redhat.com>
>>
>>
>> I'm still a little on the fence about the cmdline option for crashing
>> repair at a certain point from the POV that Brian kind of pointed out
>> that this doesn't exactly scale as we need more hooks.
> 
> (That's in the next patch.)

I. Am. Awesome.
 
...

> Probably yes, but ... uh I don't want this to drag on into building a
> generic error injection framework for userspace.
> 
> I would /really/ like to get inobtcount/bigtime tests into the kernel
> without a giant detour they have nearly zero test coverage from the
> wider community.

Yeah, I dont' want that either.

this (er, next patch) is s3kr1t and if we have something better later we
can change it.  I'll just merge stuff as-is and move forward.

-Eric

Darrick J. Wong Feb. 12, 2021, 1:26 a.m. UTC | #4

On Thu, Feb 11, 2021 at 06:20:23PM -0600, Eric Sandeen wrote:
> 
> 
> On 2/11/21 6:17 PM, Darrick J. Wong wrote:
> > On Thu, Feb 11, 2021 at 05:29:05PM -0600, Eric Sandeen wrote:
> >> On 2/11/21 4:59 PM, Darrick J. Wong wrote:
> >>> From: Darrick J. Wong <djwong@kernel.org>
> >>>
> >>> Quietly set up the ability to tell xfs_repair to set NEEDSREPAIR at
> >>> program start and (presumably) clear it by the end of the run.  This
> >>> code isn't terribly useful to users; it's mainly here so that fstests
> >>> can exercise the functionality.  We don't document this flag in the
> >>> manual pages at all because repair clears needsrepair at exit, which
> >>> means the knobs only exist for fstests to exercise the functionality.
> >>>
> >>> Note that we can't do any of these upgrades until we've at least done a
> >>> preliminary scan of the primary super and the log.
> >>>
> >>> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> >>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>> Reviewed-by: Brian Foster <bfoster@redhat.com>
> >>
> >>
> >> I'm still a little on the fence about the cmdline option for crashing
> >> repair at a certain point from the POV that Brian kind of pointed out
> >> that this doesn't exactly scale as we need more hooks.
> > 
> > (That's in the next patch.)
> 
> I. Am. Awesome.
>  
> ...
> 
> > Probably yes, but ... uh I don't want this to drag on into building a
> > generic error injection framework for userspace.
> > 
> > I would /really/ like to get inobtcount/bigtime tests into the kernel
> > without a giant detour they have nearly zero test coverage from the
> > wider community.
> 
> Yeah, I dont' want that either.
> 
> this (er, next patch) is s3kr1t and if we have something better later we
> can change it.  I'll just merge stuff as-is and move forward.

Er... TBH I actually /did/ want to hear Brian's response...

--D

> -Eric

Darrick J. Wong Feb. 12, 2021, 4:35 a.m. UTC | #5

On Thu, Feb 11, 2021 at 04:17:31PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 11, 2021 at 05:29:05PM -0600, Eric Sandeen wrote:
> > On 2/11/21 4:59 PM, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Quietly set up the ability to tell xfs_repair to set NEEDSREPAIR at
> > > program start and (presumably) clear it by the end of the run.  This
> > > code isn't terribly useful to users; it's mainly here so that fstests
> > > can exercise the functionality.  We don't document this flag in the
> > > manual pages at all because repair clears needsrepair at exit, which
> > > means the knobs only exist for fstests to exercise the functionality.
> > > 
> > > Note that we can't do any of these upgrades until we've at least done a
> > > preliminary scan of the primary super and the log.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > 
> > 
> > I'm still a little on the fence about the cmdline option for crashing
> > repair at a certain point from the POV that Brian kind of pointed out
> > that this doesn't exactly scale as we need more hooks.
> 
> (That's in the next patch.)
> 
> > but
> > 
> > ehhhh it's a test-only undocumented option and I guess we could change
> > it later if desired
> > 
> > we do have other debug options on the commandline already as well....
> 
> I don't mind moving the debugging hooks to be seekrit environment
> variables or something, but I don't think I've quite addressed some of
> Brian's comments from last time:
> 
> [paste in stuff Brian said]
> 
> > But is it worth maintaining test specific debug logic in an
> > application just to confirm that particular feature bit upgrades
> > actually set the bit?
> 
> I argue that yes, this is important enough to burn a debugging knob.
> The sequence that I think we should prevent through testing is the one
> where we've set the new feature on the primary super but we haven't
> finished generating whatever new metadata is needed to complete the
> upgrade, the system crashes, and on remount the verifiers explode.
> 
> Chances are pretty good that we'll get an angry bug report on the
> mailing list: "I upgraded my fs, the power went down, and the kernel
> sprayed corruption everywhere!"  If we get a customer escalation like
> this, I'd /much/ rather it be about not being able to mount right after
> the reboot than a latent corruption that grows unseen until somebody's
> filesystem loses data.
> 
> If a future patch to repair accidentally breaks the behavior where we
> set NEEDSREPAIR at the same time as we set the new feature and flush the
> super to disk, we cannot tell that there's been a regression in this
> safety mechanism just by looking at the output of an otherwise
> successful xfs_repair run...
> 
> > It seems sufficient to me to test that needsrepair functionality works
> > as expected and that individual feature upgrade works as well.
> 
> ...so in other words, we need some point to inject an error to make sure
> that the upgrade interlock is correct.
> 
> > Given the discussion on patch 7, perhaps it makes more sense to at
> > least defer this sort of injection mechanism until we have a scheme
> > for generic needsrepair usage worked out for xfs_repair?
> 
> I'm in the midst of prototyping what I said in the last thread --
> hooking the buffe cache so that repair can catch the first time we
> actually write anything to the filesystem, and using that to set
> NEEDSREPAIR.  I've not run it through full fstests yet, but AFAICT I can
> keep using the same tests and the same injection knobs I already wrote.

Ok, so, the buffer cache hook works now.

I split the needsrepair functionality tests into two parts -- one
corrupts a filesystem and runs repair with the error injection armed so
that repair sets NEEDSREPAIR and exits when we try to fix the damaged
metadata.  The second test exercises xfs_admin -O inobtcount=1 in
various scenarios with external devices, and then arms the error
injector to make sure that a partially completed upgrade has to go back
through repair.

I removed -c needsrepair because it's no longer necessary.

I replaced -o debug_needsrepair_force-abort with a magic environment
variable XFS_REPAIR_DEBUG_NEEDSREPAIR so that we don't clutter up the
CLI.

So that at least gets rid of the complaints about CLI clutter.

--D


> 
> > I am wondering if there's a way to make repair fail without requiring
> > additional code, but if not and we do require some sort of injection
> > mode, I suspect we might end up better served by something more
> > generic (i.e. capable of failures at random points) rather than
> > defining a command line option specifically for a particular fstest..
> 
> Probably yes, but ... uh I don't want this to drag on into building a
> generic error injection framework for userspace.
> 
> I would /really/ like to get inobtcount/bigtime tests into the kernel
> without a giant detour they have nearly zero test coverage from the
> wider community.
> 
> --D
> 
> > 
> > > ---
> > >  repair/globals.c    |    2 ++
> > >  repair/globals.h    |    2 ++
> > >  repair/phase2.c     |   63 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  repair/xfs_repair.c |    9 +++++++
> > >  4 files changed, 76 insertions(+)
> > > 
> > > 
> > > diff --git a/repair/globals.c b/repair/globals.c
> > > index 110d98b6..699a96ee 100644
> > > --- a/repair/globals.c
> > > +++ b/repair/globals.c
> > > @@ -49,6 +49,8 @@ int	rt_spec;		/* Realtime dev specified as option */
> > >  int	convert_lazy_count;	/* Convert lazy-count mode on/off */
> > >  int	lazy_count;		/* What to set if to if converting */
> > >  
> > > +bool	add_needsrepair;	/* forcibly set needsrepair while repairing */
> > > +
> > >  /* misc status variables */
> > >  
> > >  int	primary_sb_modified;
> > > diff --git a/repair/globals.h b/repair/globals.h
> > > index 1d397b35..043b3e8e 100644
> > > --- a/repair/globals.h
> > > +++ b/repair/globals.h
> > > @@ -90,6 +90,8 @@ extern int	rt_spec;		/* Realtime dev specified as option */
> > >  extern int	convert_lazy_count;	/* Convert lazy-count mode on/off */
> > >  extern int	lazy_count;		/* What to set if to if converting */
> > >  
> > > +extern bool	add_needsrepair;
> > > +
> > >  /* misc status variables */
> > >  
> > >  extern int		primary_sb_modified;
> > > diff --git a/repair/phase2.c b/repair/phase2.c
> > > index 952ac4a5..9a8d42e1 100644
> > > --- a/repair/phase2.c
> > > +++ b/repair/phase2.c
> > > @@ -131,6 +131,63 @@ zero_log(
> > >  		libxfs_max_lsn = log->l_last_sync_lsn;
> > >  }
> > >  
> > > +static bool
> > > +set_needsrepair(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	if (!xfs_sb_version_hascrc(&mp->m_sb)) {
> > > +		printf(
> > > +	_("needsrepair flag only supported on V5 filesystems.\n"));
> > > +		exit(0);
> > > +	}
> > > +
> > > +	if (xfs_sb_version_needsrepair(&mp->m_sb)) {
> > > +		printf(_("Filesystem already marked as needing repair.\n"));
> > > +		exit(0);
> > > +	}
> > > +
> > > +	printf(_("Marking filesystem in need of repair.\n"));
> > > +	mp->m_sb.sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
> > > +	return true;
> > > +}
> > > +
> > > +/* Perform the user's requested upgrades on filesystem. */
> > > +static void
> > > +upgrade_filesystem(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	struct xfs_buf		*bp;
> > > +	bool			dirty = false;
> > > +	int			error;
> > > +
> > > +	if (add_needsrepair)
> > > +		dirty |= set_needsrepair(mp);
> > > +
> > > +        if (no_modify || !dirty)
> > > +                return;
> > > +
> > > +        bp = libxfs_getsb(mp);
> > > +        if (!bp || bp->b_error) {
> > > +                do_error(
> > > +	_("couldn't get superblock for feature upgrade, err=%d\n"),
> > > +                                bp ? bp->b_error : ENOMEM);
> > > +        } else {
> > > +                libxfs_sb_to_disk(bp->b_addr, &mp->m_sb);
> > > +
> > > +                /*
> > > +		 * Write the primary super to disk immediately so that
> > > +		 * needsrepair will be set if repair doesn't complete.
> > > +		 */
> > > +                error = -libxfs_bwrite(bp);
> > > +                if (error)
> > > +                        do_error(
> > > +	_("filesystem feature upgrade failed, err=%d\n"),
> > > +                                        error);
> > > +        }
> > > +        if (bp)
> > > +                libxfs_buf_relse(bp);
> > > +}
> > > +
> > >  /*
> > >   * ok, at this point, the fs is mounted but the root inode may be
> > >   * trashed and the ag headers haven't been checked.  So we have
> > > @@ -235,4 +292,10 @@ phase2(
> > >  				do_warn(_("would correct\n"));
> > >  		}
> > >  	}
> > > +
> > > +	/*
> > > +	 * Upgrade the filesystem now that we've done a preliminary check of
> > > +	 * the superblocks, the AGs, the log, and the metadata inodes.
> > > +	 */
> > > +	upgrade_filesystem(mp);
> > >  }
> > > diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
> > > index 90d1a95a..a613505f 100644
> > > --- a/repair/xfs_repair.c
> > > +++ b/repair/xfs_repair.c
> > > @@ -65,11 +65,13 @@ static char *o_opts[] = {
> > >   */
> > >  enum c_opt_nums {
> > >  	CONVERT_LAZY_COUNT = 0,
> > > +	CONVERT_NEEDSREPAIR,
> > >  	C_MAX_OPTS,
> > >  };
> > >  
> > >  static char *c_opts[] = {
> > >  	[CONVERT_LAZY_COUNT]	= "lazycount",
> > > +	[CONVERT_NEEDSREPAIR]	= "needsrepair",
> > >  	[C_MAX_OPTS]		= NULL,
> > >  };
> > >  
> > > @@ -302,6 +304,13 @@ process_args(int argc, char **argv)
> > >  					lazy_count = (int)strtol(val, NULL, 0);
> > >  					convert_lazy_count = 1;
> > >  					break;
> > > +				case CONVERT_NEEDSREPAIR:
> > > +					if (!val)
> > > +						do_abort(
> > > +		_("-c needsrepair requires a parameter\n"));
> > > +					if (strtol(val, NULL, 0) == 1)
> > > +						add_needsrepair = true;
> > > +					break;
> > >  				default:
> > >  					unknown('c', val);
> > >  					break;
> > >

Brian Foster Feb. 12, 2021, 1:35 p.m. UTC | #6

On Thu, Feb 11, 2021 at 04:17:31PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 11, 2021 at 05:29:05PM -0600, Eric Sandeen wrote:
> > On 2/11/21 4:59 PM, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Quietly set up the ability to tell xfs_repair to set NEEDSREPAIR at
> > > program start and (presumably) clear it by the end of the run.  This
> > > code isn't terribly useful to users; it's mainly here so that fstests
> > > can exercise the functionality.  We don't document this flag in the
> > > manual pages at all because repair clears needsrepair at exit, which
> > > means the knobs only exist for fstests to exercise the functionality.
> > > 
> > > Note that we can't do any of these upgrades until we've at least done a
> > > preliminary scan of the primary super and the log.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > 
> > 
> > I'm still a little on the fence about the cmdline option for crashing
> > repair at a certain point from the POV that Brian kind of pointed out
> > that this doesn't exactly scale as we need more hooks.
> 
> (That's in the next patch.)
> 
> > but
> > 
> > ehhhh it's a test-only undocumented option and I guess we could change
> > it later if desired
> > 
> > we do have other debug options on the commandline already as well....
> 
> I don't mind moving the debugging hooks to be seekrit environment
> variables or something, but I don't think I've quite addressed some of
> Brian's comments from last time:
> 
> [paste in stuff Brian said]
> 
> > But is it worth maintaining test specific debug logic in an
> > application just to confirm that particular feature bit upgrades
> > actually set the bit?
> 
> I argue that yes, this is important enough to burn a debugging knob.
> The sequence that I think we should prevent through testing is the one
> where we've set the new feature on the primary super but we haven't
> finished generating whatever new metadata is needed to complete the
> upgrade, the system crashes, and on remount the verifiers explode.
> 
> Chances are pretty good that we'll get an angry bug report on the
> mailing list: "I upgraded my fs, the power went down, and the kernel
> sprayed corruption everywhere!"  If we get a customer escalation like
> this, I'd /much/ rather it be about not being able to mount right after
> the reboot than a latent corruption that grows unseen until somebody's
> filesystem loses data.
> 
> If a future patch to repair accidentally breaks the behavior where we
> set NEEDSREPAIR at the same time as we set the new feature and flush the
> super to disk, we cannot tell that there's been a regression in this
> safety mechanism just by looking at the output of an otherwise
> successful xfs_repair run...
> 

So I think what urks me most about this is how specific it is to the
particular test. IMO, it would be _nice_ to be able to induce xfs_repair
aborts at random purely via external mechanism, but I don't view that as
a hard requirement and so don't necessarily oppose an injection
mechanism in general. I also don't think this particular mechanism is as
robust as suggested because it tests for one very particular failure
scenario (i.e. failure to set the bit) over and over. If somebody was so
misguided as to rewrite the superblock sometime later in repair without
the bit set (somehow and for who knows what reason), this test wouldn't
catch it.

Here are some handwavy random thoughts on approaches for inducing
failures that I think would be more preferable, yet wouldn't preclude
the specific test this mechanism intends to support:

- Define a custom signal handler to trigger an do_abort() and invoke it
  randomly via test (or just kill -9 randomly). Con: this might require
  a non-trivial test fs and some looping to provide adequate coverage.
- Rework the current hook into somewhere more generic that allows either
  a random or generally more configurable trigger:
	- I.e., randomly abort in the buffer I/O completion path based
	  on a percentage passed by the user.
	- Refactor the per-phase timestamp() calls into a helper and
	  wire in a per-phase injection point, then let the test produce
	  explicit failures at the end of each phase, 1-7. This is not
	  quite as random, but certainly more thorough than a single
	  specific failure point.

These would probably still require some command line option to enable,
but it becomes less of a "test that nobody screws up these few lines of
code we just added" regression test. IMO, those tests tend to fail more
rarely than the randomized stress/failure tests that have at least some
capability to produce unforeseen failure scenarios.

> > It seems sufficient to me to test that needsrepair functionality works
> > as expected and that individual feature upgrade works as well.
> 
> ...so in other words, we need some point to inject an error to make sure
> that the upgrade interlock is correct.
> 
> > Given the discussion on patch 7, perhaps it makes more sense to at
> > least defer this sort of injection mechanism until we have a scheme
> > for generic needsrepair usage worked out for xfs_repair?
> 
> I'm in the midst of prototyping what I said in the last thread --
> hooking the buffe cache so that repair can catch the first time we
> actually write anything to the filesystem, and using that to set
> NEEDSREPAIR.  I've not run it through full fstests yet, but AFAICT I can
> keep using the same tests and the same injection knobs I already wrote.
> 
> > I am wondering if there's a way to make repair fail without requiring
> > additional code, but if not and we do require some sort of injection
> > mode, I suspect we might end up better served by something more
> > generic (i.e. capable of failures at random points) rather than
> > defining a command line option specifically for a particular fstest..
> 
> Probably yes, but ... uh I don't want this to drag on into building a
> generic error injection framework for userspace.
> 

That's certainly fair. That's partly why I suggested to kick this can
down the road just a bit. At the same time I don't see the suggestions
above as necessarily more complex or more involved than this patch. It
may require around the same amount of code either way, just with a bit
more generic of an implementation.

Brian

> I would /really/ like to get inobtcount/bigtime tests into the kernel
> without a giant detour they have nearly zero test coverage from the
> wider community.
> 
> --D
> 
> > 
> > > ---
> > >  repair/globals.c    |    2 ++
> > >  repair/globals.h    |    2 ++
> > >  repair/phase2.c     |   63 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  repair/xfs_repair.c |    9 +++++++
> > >  4 files changed, 76 insertions(+)
> > > 
> > > 
> > > diff --git a/repair/globals.c b/repair/globals.c
> > > index 110d98b6..699a96ee 100644
> > > --- a/repair/globals.c
> > > +++ b/repair/globals.c
> > > @@ -49,6 +49,8 @@ int	rt_spec;		/* Realtime dev specified as option */
> > >  int	convert_lazy_count;	/* Convert lazy-count mode on/off */
> > >  int	lazy_count;		/* What to set if to if converting */
> > >  
> > > +bool	add_needsrepair;	/* forcibly set needsrepair while repairing */
> > > +
> > >  /* misc status variables */
> > >  
> > >  int	primary_sb_modified;
> > > diff --git a/repair/globals.h b/repair/globals.h
> > > index 1d397b35..043b3e8e 100644
> > > --- a/repair/globals.h
> > > +++ b/repair/globals.h
> > > @@ -90,6 +90,8 @@ extern int	rt_spec;		/* Realtime dev specified as option */
> > >  extern int	convert_lazy_count;	/* Convert lazy-count mode on/off */
> > >  extern int	lazy_count;		/* What to set if to if converting */
> > >  
> > > +extern bool	add_needsrepair;
> > > +
> > >  /* misc status variables */
> > >  
> > >  extern int		primary_sb_modified;
> > > diff --git a/repair/phase2.c b/repair/phase2.c
> > > index 952ac4a5..9a8d42e1 100644
> > > --- a/repair/phase2.c
> > > +++ b/repair/phase2.c
> > > @@ -131,6 +131,63 @@ zero_log(
> > >  		libxfs_max_lsn = log->l_last_sync_lsn;
> > >  }
> > >  
> > > +static bool
> > > +set_needsrepair(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	if (!xfs_sb_version_hascrc(&mp->m_sb)) {
> > > +		printf(
> > > +	_("needsrepair flag only supported on V5 filesystems.\n"));
> > > +		exit(0);
> > > +	}
> > > +
> > > +	if (xfs_sb_version_needsrepair(&mp->m_sb)) {
> > > +		printf(_("Filesystem already marked as needing repair.\n"));
> > > +		exit(0);
> > > +	}
> > > +
> > > +	printf(_("Marking filesystem in need of repair.\n"));
> > > +	mp->m_sb.sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
> > > +	return true;
> > > +}
> > > +
> > > +/* Perform the user's requested upgrades on filesystem. */
> > > +static void
> > > +upgrade_filesystem(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	struct xfs_buf		*bp;
> > > +	bool			dirty = false;
> > > +	int			error;
> > > +
> > > +	if (add_needsrepair)
> > > +		dirty |= set_needsrepair(mp);
> > > +
> > > +        if (no_modify || !dirty)
> > > +                return;
> > > +
> > > +        bp = libxfs_getsb(mp);
> > > +        if (!bp || bp->b_error) {
> > > +                do_error(
> > > +	_("couldn't get superblock for feature upgrade, err=%d\n"),
> > > +                                bp ? bp->b_error : ENOMEM);
> > > +        } else {
> > > +                libxfs_sb_to_disk(bp->b_addr, &mp->m_sb);
> > > +
> > > +                /*
> > > +		 * Write the primary super to disk immediately so that
> > > +		 * needsrepair will be set if repair doesn't complete.
> > > +		 */
> > > +                error = -libxfs_bwrite(bp);
> > > +                if (error)
> > > +                        do_error(
> > > +	_("filesystem feature upgrade failed, err=%d\n"),
> > > +                                        error);
> > > +        }
> > > +        if (bp)
> > > +                libxfs_buf_relse(bp);
> > > +}
> > > +
> > >  /*
> > >   * ok, at this point, the fs is mounted but the root inode may be
> > >   * trashed and the ag headers haven't been checked.  So we have
> > > @@ -235,4 +292,10 @@ phase2(
> > >  				do_warn(_("would correct\n"));
> > >  		}
> > >  	}
> > > +
> > > +	/*
> > > +	 * Upgrade the filesystem now that we've done a preliminary check of
> > > +	 * the superblocks, the AGs, the log, and the metadata inodes.
> > > +	 */
> > > +	upgrade_filesystem(mp);
> > >  }
> > > diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
> > > index 90d1a95a..a613505f 100644
> > > --- a/repair/xfs_repair.c
> > > +++ b/repair/xfs_repair.c
> > > @@ -65,11 +65,13 @@ static char *o_opts[] = {
> > >   */
> > >  enum c_opt_nums {
> > >  	CONVERT_LAZY_COUNT = 0,
> > > +	CONVERT_NEEDSREPAIR,
> > >  	C_MAX_OPTS,
> > >  };
> > >  
> > >  static char *c_opts[] = {
> > >  	[CONVERT_LAZY_COUNT]	= "lazycount",
> > > +	[CONVERT_NEEDSREPAIR]	= "needsrepair",
> > >  	[C_MAX_OPTS]		= NULL,
> > >  };
> > >  
> > > @@ -302,6 +304,13 @@ process_args(int argc, char **argv)
> > >  					lazy_count = (int)strtol(val, NULL, 0);
> > >  					convert_lazy_count = 1;
> > >  					break;
> > > +				case CONVERT_NEEDSREPAIR:
> > > +					if (!val)
> > > +						do_abort(
> > > +		_("-c needsrepair requires a parameter\n"));
> > > +					if (strtol(val, NULL, 0) == 1)
> > > +						add_needsrepair = true;
> > > +					break;
> > >  				default:
> > >  					unknown('c', val);
> > >  					break;
> > > 
>

Darrick J. Wong Feb. 12, 2021, 6:54 p.m. UTC | #7

On Fri, Feb 12, 2021 at 08:35:03AM -0500, Brian Foster wrote:
> On Thu, Feb 11, 2021 at 04:17:31PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 11, 2021 at 05:29:05PM -0600, Eric Sandeen wrote:
> > > On 2/11/21 4:59 PM, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Quietly set up the ability to tell xfs_repair to set NEEDSREPAIR at
> > > > program start and (presumably) clear it by the end of the run.  This
> > > > code isn't terribly useful to users; it's mainly here so that fstests
> > > > can exercise the functionality.  We don't document this flag in the
> > > > manual pages at all because repair clears needsrepair at exit, which
> > > > means the knobs only exist for fstests to exercise the functionality.
> > > > 
> > > > Note that we can't do any of these upgrades until we've at least done a
> > > > preliminary scan of the primary super and the log.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > > 
> > > 
> > > I'm still a little on the fence about the cmdline option for crashing
> > > repair at a certain point from the POV that Brian kind of pointed out
> > > that this doesn't exactly scale as we need more hooks.
> > 
> > (That's in the next patch.)
> > 
> > > but
> > > 
> > > ehhhh it's a test-only undocumented option and I guess we could change
> > > it later if desired
> > > 
> > > we do have other debug options on the commandline already as well....
> > 
> > I don't mind moving the debugging hooks to be seekrit environment
> > variables or something, but I don't think I've quite addressed some of
> > Brian's comments from last time:
> > 
> > [paste in stuff Brian said]
> > 
> > > But is it worth maintaining test specific debug logic in an
> > > application just to confirm that particular feature bit upgrades
> > > actually set the bit?
> > 
> > I argue that yes, this is important enough to burn a debugging knob.
> > The sequence that I think we should prevent through testing is the one
> > where we've set the new feature on the primary super but we haven't
> > finished generating whatever new metadata is needed to complete the
> > upgrade, the system crashes, and on remount the verifiers explode.
> > 
> > Chances are pretty good that we'll get an angry bug report on the
> > mailing list: "I upgraded my fs, the power went down, and the kernel
> > sprayed corruption everywhere!"  If we get a customer escalation like
> > this, I'd /much/ rather it be about not being able to mount right after
> > the reboot than a latent corruption that grows unseen until somebody's
> > filesystem loses data.
> > 
> > If a future patch to repair accidentally breaks the behavior where we
> > set NEEDSREPAIR at the same time as we set the new feature and flush the
> > super to disk, we cannot tell that there's been a regression in this
> > safety mechanism just by looking at the output of an otherwise
> > successful xfs_repair run...
> > 
> 
> So I think what urks me most about this is how specific it is to the
> particular test. IMO, it would be _nice_ to be able to induce xfs_repair
> aborts at random purely via external mechanism, but I don't view that as
> a hard requirement and so don't necessarily oppose an injection
> mechanism in general. I also don't think this particular mechanism is as
> robust as suggested because it tests for one very particular failure
> scenario (i.e. failure to set the bit) over and over. If somebody was so
> misguided as to rewrite the superblock sometime later in repair without
> the bit set (somehow and for who knows what reason), this test wouldn't
> catch it.
> 
> Here are some handwavy random thoughts on approaches for inducing
> failures that I think would be more preferable, yet wouldn't preclude
> the specific test this mechanism intends to support:
> 
> - Define a custom signal handler to trigger an do_abort() and invoke it
>   randomly via test (or just kill -9 randomly). Con: this might require
>   a non-trivial test fs and some looping to provide adequate coverage.

I don't think a randomly triggered abort is better than a targeted trip
point.  However...

> - Rework the current hook into somewhere more generic that allows either
>   a random or generally more configurable trigger:
> 	- I.e., randomly abort in the buffer I/O completion path based
> 	  on a percentage passed by the user.

...since we know that a given xfs_repair run will trigger a bunch of
disk writes between phase 2 and phase 6, I think I could build a
trigger that would abort() after N writes to a device.  From there it
wouldn't be hard to add a test that does (more or less):

for i in {0..1000..10}; do
	xfs_mdrestore <dumpfile> /dev/sda
	XFS_REPAIR_DEBUG_FAIL_WRITE=$i xfs_repair /dev/sda
	xfs_db -c version /dev/sda | grep NEEDSREPAIR || _fail
	xfs_repair /dev/sda
	xfs_db -c version /dev/sda | grep NEEDSREPAIR && _fail
done

> 	- Refactor the per-phase timestamp() calls into a helper and
> 	  wire in a per-phase injection point, then let the test produce
> 	  explicit failures at the end of each phase, 1-7. This is not
> 	  quite as random, but certainly more thorough than a single
> 	  specific failure point.

This sounds like a reasonable second trip point for the directory repair
checker, since we know that the sketchy directory repair bits happen in
phase 3 and/or phase 6:

<fuzz dirent>
XFS_REPAIR_DEBUG_FAIL_PHASE=6 xfs_repair /dev/sda
xfs_db -c version /dev/sda | grep NEEDSREPAIR || _fail
xfs_repair /dev/sda
xfs_db -c version /dev/sda | grep NEEDSREPAIR && _fail

This is a good starting point, thanks. :)

> These would probably still require some command line option to enable,
> but it becomes less of a "test that nobody screws up these few lines of
> code we just added" regression test. IMO, those tests tend to fail more
> rarely than the randomized stress/failure tests that have at least some
> capability to produce unforeseen failure scenarios.

Fair 'nuff.

> > > It seems sufficient to me to test that needsrepair functionality works
> > > as expected and that individual feature upgrade works as well.
> > 
> > ...so in other words, we need some point to inject an error to make sure
> > that the upgrade interlock is correct.
> > 
> > > Given the discussion on patch 7, perhaps it makes more sense to at
> > > least defer this sort of injection mechanism until we have a scheme
> > > for generic needsrepair usage worked out for xfs_repair?
> > 
> > I'm in the midst of prototyping what I said in the last thread --
> > hooking the buffe cache so that repair can catch the first time we
> > actually write anything to the filesystem, and using that to set
> > NEEDSREPAIR.  I've not run it through full fstests yet, but AFAICT I can
> > keep using the same tests and the same injection knobs I already wrote.
> > 
> > > I am wondering if there's a way to make repair fail without requiring
> > > additional code, but if not and we do require some sort of injection
> > > mode, I suspect we might end up better served by something more
> > > generic (i.e. capable of failures at random points) rather than
> > > defining a command line option specifically for a particular fstest..
> > 
> > Probably yes, but ... uh I don't want this to drag on into building a
> > generic error injection framework for userspace.
> > 
> 
> That's certainly fair. That's partly why I suggested to kick this can
> down the road just a bit. At the same time I don't see the suggestions
> above as necessarily more complex or more involved than this patch. It
> may require around the same amount of code either way, just with a bit
> more generic of an implementation.

In the meantime I guess Eric can take the other 2 fully reviewed series
as well as patches 1-7 and 10 from this series since (AFAICT) those
pieces are fully reviewed.

--D

> Brian
> 
> > I would /really/ like to get inobtcount/bigtime tests into the kernel
> > without a giant detour they have nearly zero test coverage from the
> > wider community.
> > 
> > --D
> > 
> > > 
> > > > ---
> > > >  repair/globals.c    |    2 ++
> > > >  repair/globals.h    |    2 ++
> > > >  repair/phase2.c     |   63 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  repair/xfs_repair.c |    9 +++++++
> > > >  4 files changed, 76 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/repair/globals.c b/repair/globals.c
> > > > index 110d98b6..699a96ee 100644
> > > > --- a/repair/globals.c
> > > > +++ b/repair/globals.c
> > > > @@ -49,6 +49,8 @@ int	rt_spec;		/* Realtime dev specified as option */
> > > >  int	convert_lazy_count;	/* Convert lazy-count mode on/off */
> > > >  int	lazy_count;		/* What to set if to if converting */
> > > >  
> > > > +bool	add_needsrepair;	/* forcibly set needsrepair while repairing */
> > > > +
> > > >  /* misc status variables */
> > > >  
> > > >  int	primary_sb_modified;
> > > > diff --git a/repair/globals.h b/repair/globals.h
> > > > index 1d397b35..043b3e8e 100644
> > > > --- a/repair/globals.h
> > > > +++ b/repair/globals.h
> > > > @@ -90,6 +90,8 @@ extern int	rt_spec;		/* Realtime dev specified as option */
> > > >  extern int	convert_lazy_count;	/* Convert lazy-count mode on/off */
> > > >  extern int	lazy_count;		/* What to set if to if converting */
> > > >  
> > > > +extern bool	add_needsrepair;
> > > > +
> > > >  /* misc status variables */
> > > >  
> > > >  extern int		primary_sb_modified;
> > > > diff --git a/repair/phase2.c b/repair/phase2.c
> > > > index 952ac4a5..9a8d42e1 100644
> > > > --- a/repair/phase2.c
> > > > +++ b/repair/phase2.c
> > > > @@ -131,6 +131,63 @@ zero_log(
> > > >  		libxfs_max_lsn = log->l_last_sync_lsn;
> > > >  }
> > > >  
> > > > +static bool
> > > > +set_needsrepair(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	if (!xfs_sb_version_hascrc(&mp->m_sb)) {
> > > > +		printf(
> > > > +	_("needsrepair flag only supported on V5 filesystems.\n"));
> > > > +		exit(0);
> > > > +	}
> > > > +
> > > > +	if (xfs_sb_version_needsrepair(&mp->m_sb)) {
> > > > +		printf(_("Filesystem already marked as needing repair.\n"));
> > > > +		exit(0);
> > > > +	}
> > > > +
> > > > +	printf(_("Marking filesystem in need of repair.\n"));
> > > > +	mp->m_sb.sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
> > > > +	return true;
> > > > +}
> > > > +
> > > > +/* Perform the user's requested upgrades on filesystem. */
> > > > +static void
> > > > +upgrade_filesystem(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	struct xfs_buf		*bp;
> > > > +	bool			dirty = false;
> > > > +	int			error;
> > > > +
> > > > +	if (add_needsrepair)
> > > > +		dirty |= set_needsrepair(mp);
> > > > +
> > > > +        if (no_modify || !dirty)
> > > > +                return;
> > > > +
> > > > +        bp = libxfs_getsb(mp);
> > > > +        if (!bp || bp->b_error) {
> > > > +                do_error(
> > > > +	_("couldn't get superblock for feature upgrade, err=%d\n"),
> > > > +                                bp ? bp->b_error : ENOMEM);
> > > > +        } else {
> > > > +                libxfs_sb_to_disk(bp->b_addr, &mp->m_sb);
> > > > +
> > > > +                /*
> > > > +		 * Write the primary super to disk immediately so that
> > > > +		 * needsrepair will be set if repair doesn't complete.
> > > > +		 */
> > > > +                error = -libxfs_bwrite(bp);
> > > > +                if (error)
> > > > +                        do_error(
> > > > +	_("filesystem feature upgrade failed, err=%d\n"),
> > > > +                                        error);
> > > > +        }
> > > > +        if (bp)
> > > > +                libxfs_buf_relse(bp);
> > > > +}
> > > > +
> > > >  /*
> > > >   * ok, at this point, the fs is mounted but the root inode may be
> > > >   * trashed and the ag headers haven't been checked.  So we have
> > > > @@ -235,4 +292,10 @@ phase2(
> > > >  				do_warn(_("would correct\n"));
> > > >  		}
> > > >  	}
> > > > +
> > > > +	/*
> > > > +	 * Upgrade the filesystem now that we've done a preliminary check of
> > > > +	 * the superblocks, the AGs, the log, and the metadata inodes.
> > > > +	 */
> > > > +	upgrade_filesystem(mp);
> > > >  }
> > > > diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
> > > > index 90d1a95a..a613505f 100644
> > > > --- a/repair/xfs_repair.c
> > > > +++ b/repair/xfs_repair.c
> > > > @@ -65,11 +65,13 @@ static char *o_opts[] = {
> > > >   */
> > > >  enum c_opt_nums {
> > > >  	CONVERT_LAZY_COUNT = 0,
> > > > +	CONVERT_NEEDSREPAIR,
> > > >  	C_MAX_OPTS,
> > > >  };
> > > >  
> > > >  static char *c_opts[] = {
> > > >  	[CONVERT_LAZY_COUNT]	= "lazycount",
> > > > +	[CONVERT_NEEDSREPAIR]	= "needsrepair",
> > > >  	[C_MAX_OPTS]		= NULL,
> > > >  };
> > > >  
> > > > @@ -302,6 +304,13 @@ process_args(int argc, char **argv)
> > > >  					lazy_count = (int)strtol(val, NULL, 0);
> > > >  					convert_lazy_count = 1;
> > > >  					break;
> > > > +				case CONVERT_NEEDSREPAIR:
> > > > +					if (!val)
> > > > +						do_abort(
> > > > +		_("-c needsrepair requires a parameter\n"));
> > > > +					if (strtol(val, NULL, 0) == 1)
> > > > +						add_needsrepair = true;
> > > > +					break;
> > > >  				default:
> > > >  					unknown('c', val);
> > > >  					break;
> > > > 
> > 
>

[08/11] xfs_repair: allow setting the needsrepair flag

Commit Message

Comments

Patch