diff mbox series

[RFC[RAP] PATCH] xfs: allow setting and clearing of log incompat feature flags

Message ID 20201208004028.GU629293@magnolia (mailing list archive)
State Superseded, archived
Headers show
Series [RFC[RAP] PATCH] xfs: allow setting and clearing of log incompat feature flags | expand

Commit Message

Darrick J. Wong Dec. 8, 2020, 12:40 a.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

Log incompat feature flags in the superblock exist for one purpose: to
protect the contents of a dirty log from replay on a kernel that isn't
prepared to handle those dirty contents.  This means that they can be
cleared if (a) we know the log is clean and (b) we know that there
aren't any other threads in the system that might be setting or relying
upon a log incompat flag.

Therefore, clear the log incompat flags when we've finished recovering
the log, when we're unmounting cleanly, remounting read-only, or
freezing; and provide a function so that subsequent patches can start
using this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
Note: I wrote this so that we could turn on log incompat flags for
atomic extent swapping and Allison could probably use it for the delayed
logged xattr patchset.  Not gonna try to land this in 5.11, FWIW...
---
 fs/xfs/libxfs/xfs_format.h |   15 ++++++
 fs/xfs/xfs_log.c           |    8 +++
 fs/xfs/xfs_mount.c         |  119 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h         |    2 +
 fs/xfs/xfs_super.c         |    9 +++
 5 files changed, 153 insertions(+)

Comments

Brian Foster Dec. 8, 2020, 11:19 a.m. UTC | #1
On Mon, Dec 07, 2020 at 04:40:28PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Log incompat feature flags in the superblock exist for one purpose: to
> protect the contents of a dirty log from replay on a kernel that isn't
> prepared to handle those dirty contents.  This means that they can be
> cleared if (a) we know the log is clean and (b) we know that there
> aren't any other threads in the system that might be setting or relying
> upon a log incompat flag.
> 
> Therefore, clear the log incompat flags when we've finished recovering
> the log, when we're unmounting cleanly, remounting read-only, or
> freezing; and provide a function so that subsequent patches can start
> using this.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> Note: I wrote this so that we could turn on log incompat flags for
> atomic extent swapping and Allison could probably use it for the delayed
> logged xattr patchset.  Not gonna try to land this in 5.11, FWIW...
> ---
>  fs/xfs/libxfs/xfs_format.h |   15 ++++++
>  fs/xfs/xfs_log.c           |    8 +++
>  fs/xfs/xfs_mount.c         |  119 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_mount.h         |    2 +
>  fs/xfs/xfs_super.c         |    9 +++
>  5 files changed, 153 insertions(+)
> 
...
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 669a9cf43582..68a5c90ec65b 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
...
> @@ -1361,6 +1369,117 @@ xfs_force_summary_recalc(
>  	xfs_fs_mark_checked(mp, XFS_SICK_FS_COUNTERS);
>  }
>  
> +/*
> + * Enable a log incompat feature flag in the primary superblock.  The caller
> + * cannot have any other transactions in progress.
> + */
> +int
> +xfs_add_incompat_log_feature(
> +	struct xfs_mount	*mp,
> +	uint32_t		feature)
> +{
> +	struct xfs_dsb		*dsb;
> +	int			error;
> +
> +	ASSERT(hweight32(feature) == 1);
> +	ASSERT(!(feature & XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN));
> +
> +	/*
> +	 * Force the log to disk and kick the background AIL thread to reduce
> +	 * the chances that the bwrite will stall waiting for the AIL to unpin
> +	 * the primary superblock buffer.  This isn't a data integrity
> +	 * operation, so we don't need a synchronous push.
> +	 */
> +	error = xfs_log_force(mp, XFS_LOG_SYNC);
> +	if (error)
> +		return error;
> +	xfs_ail_push_all(mp->m_ail);
> +
> +	/*
> +	 * Lock the primary superblock buffer to serialize all callers that
> +	 * are trying to set feature bits.
> +	 */
> +	xfs_buf_lock(mp->m_sb_bp);
> +	xfs_buf_hold(mp->m_sb_bp);
> +
> +	if (XFS_FORCED_SHUTDOWN(mp)) {
> +		error = -EIO;
> +		goto rele;
> +	}
> +
> +	if (xfs_sb_has_incompat_log_feature(&mp->m_sb, feature))
> +		goto rele;
> +
> +	/*
> +	 * Write the primary superblock to disk immediately, because we need
> +	 * the log_incompat bit to be set in the primary super now to protect
> +	 * the log items that we're going to commit later.
> +	 */
> +	dsb = mp->m_sb_bp->b_addr;
> +	xfs_sb_to_disk(dsb, &mp->m_sb);
> +	dsb->sb_features_log_incompat |= cpu_to_be32(feature);
> +	error = xfs_bwrite(mp->m_sb_bp);
> +	if (error)
> +		goto shutdown;
> +

Do we need to do all of this serialization and whatnot to set the
incompat bit? Can we just modify and log the in-core sb when or before
an incompatible item is logged, similar to how xfs_sbversion_add_attr2()
is implemented for example?

> +	/*
> +	 * Add the feature bits to the incore superblock before we unlock the
> +	 * buffer.
> +	 */
> +	xfs_sb_add_incompat_log_features(&mp->m_sb, feature);
> +	xfs_buf_relse(mp->m_sb_bp);
> +
> +	/* Log the superblock to disk. */
> +	return xfs_sync_sb(mp, false);
> +shutdown:
> +	xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> +rele:
> +	xfs_buf_relse(mp->m_sb_bp);
> +	return error;
> +}
> +
> +/*
> + * Clear all the log incompat flags from the superblock.
> + *
> + * The caller must have freeze protection, cannot have any other transactions
> + * in progress, must ensure that the log does not contain any log items
> + * protected by any log incompat bit, and must ensure that there are no other
> + * threads that could be reading or writing the log incompat feature flag in
> + * the primary super.  In other words, we can only turn features off as one of
> + * the final steps of quiescing or unmounting the log.
> + */
> +int
> +xfs_clear_incompat_log_features(
> +	struct xfs_mount	*mp)
> +{
> +	if (!xfs_sb_version_hascrc(&mp->m_sb) ||
> +	    !xfs_sb_has_incompat_log_feature(&mp->m_sb,
> +				XFS_SB_FEAT_INCOMPAT_LOG_ALL) ||
> +	    XFS_FORCED_SHUTDOWN(mp))
> +		return 0;
> +
> +	/*
> +	 * Update the incore superblock.  We synchronize on the primary super
> +	 * buffer lock to be consistent with the add function, though at least
> +	 * in theory this shouldn't be necessary.
> +	 */
> +	xfs_buf_lock(mp->m_sb_bp);
> +	xfs_buf_hold(mp->m_sb_bp);
> +
> +	if (!xfs_sb_has_incompat_log_feature(&mp->m_sb,
> +				XFS_SB_FEAT_INCOMPAT_LOG_ALL))
> +		goto rele;
> +
> +	xfs_sb_remove_incompat_log_features(&mp->m_sb);
> +	xfs_buf_relse(mp->m_sb_bp);
> +
> +	/* Log the superblock to disk. */
> +	return xfs_sync_sb(mp, false);

Ok, so here it looks like we log the superblock change and presumably
rely on the unmount sequence to write it back...

> +rele:
> +	xfs_buf_relse(mp->m_sb_bp);
> +	return 0;
> +}
> +
>  /*
>   * Update the in-core delayed block counter.
>   *
...
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 343435a677f9..71e304c15e6b 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -919,6 +919,15 @@ xfs_quiesce_attr(
>  	/* force the log to unpin objects from the now complete transactions */
>  	xfs_log_force(mp, XFS_LOG_SYNC);
>  
> +	/*
> +	 * Clear log incompat features since we're freezing or remounting ro.
> +	 * Report failures, though it's not fatal to have a higher log feature
> +	 * protection level than the log contents actually require.
> +	 */
> +	error = xfs_clear_incompat_log_features(mp);
> +	if (error)
> +		xfs_warn(mp, "Unable to clear superblock log incompat flags. "
> +				"Frozen image may not be consistent.");
>  

What happens if the modified superblock is written back and then we
crash during a quiesce but before the log is fully clean? ISTM we could
end up in recovery of a log with potentially incompatible log items
where the feature bit has already been cleared. Not sure that's a
problem with intents, but seems risky in general.

Perhaps this needs to be pushed further down such that similar to
unmount, we ensure a full/sync AIL push completes (and moves the in-core
tail) before we log the feature bit change. I do wonder if it's worth
complicating the log quiesce path to clear feature bits at all, but I
suppose it could be a little inconsistent to clean the log on freeze yet
leave an incompat bit around. Perhaps we should push the clear bit
sequence down into the log quiesce path between completing the AIL push
and writing the unmount record. We may have to commit a sync transaction
and then push the AIL again, but that would cover the unmount and freeze
cases and I think we could probably do away with the post-recovery bit
clearing case entirely. A current/recovered mount should clear the
associated bits on the next log quiesce anyways. Hm?

Brian

>  	/* Push the superblock and write an unmount record */
>  	error = xfs_log_sbcount(mp);
>
Darrick J. Wong Dec. 8, 2020, 6:10 p.m. UTC | #2
On Tue, Dec 08, 2020 at 06:19:06AM -0500, Brian Foster wrote:
> On Mon, Dec 07, 2020 at 04:40:28PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Log incompat feature flags in the superblock exist for one purpose: to
> > protect the contents of a dirty log from replay on a kernel that isn't
> > prepared to handle those dirty contents.  This means that they can be
> > cleared if (a) we know the log is clean and (b) we know that there
> > aren't any other threads in the system that might be setting or relying
> > upon a log incompat flag.
> > 
> > Therefore, clear the log incompat flags when we've finished recovering
> > the log, when we're unmounting cleanly, remounting read-only, or
> > freezing; and provide a function so that subsequent patches can start
> > using this.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > Note: I wrote this so that we could turn on log incompat flags for
> > atomic extent swapping and Allison could probably use it for the delayed
> > logged xattr patchset.  Not gonna try to land this in 5.11, FWIW...
> > ---
> >  fs/xfs/libxfs/xfs_format.h |   15 ++++++
> >  fs/xfs/xfs_log.c           |    8 +++
> >  fs/xfs/xfs_mount.c         |  119 ++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_mount.h         |    2 +
> >  fs/xfs/xfs_super.c         |    9 +++
> >  5 files changed, 153 insertions(+)
> > 
> ...
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index 669a9cf43582..68a5c90ec65b 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> ...
> > @@ -1361,6 +1369,117 @@ xfs_force_summary_recalc(
> >  	xfs_fs_mark_checked(mp, XFS_SICK_FS_COUNTERS);
> >  }
> >  
> > +/*
> > + * Enable a log incompat feature flag in the primary superblock.  The caller
> > + * cannot have any other transactions in progress.
> > + */
> > +int
> > +xfs_add_incompat_log_feature(
> > +	struct xfs_mount	*mp,
> > +	uint32_t		feature)
> > +{
> > +	struct xfs_dsb		*dsb;
> > +	int			error;
> > +
> > +	ASSERT(hweight32(feature) == 1);
> > +	ASSERT(!(feature & XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN));
> > +
> > +	/*
> > +	 * Force the log to disk and kick the background AIL thread to reduce
> > +	 * the chances that the bwrite will stall waiting for the AIL to unpin
> > +	 * the primary superblock buffer.  This isn't a data integrity
> > +	 * operation, so we don't need a synchronous push.
> > +	 */
> > +	error = xfs_log_force(mp, XFS_LOG_SYNC);
> > +	if (error)
> > +		return error;
> > +	xfs_ail_push_all(mp->m_ail);
> > +
> > +	/*
> > +	 * Lock the primary superblock buffer to serialize all callers that
> > +	 * are trying to set feature bits.
> > +	 */
> > +	xfs_buf_lock(mp->m_sb_bp);
> > +	xfs_buf_hold(mp->m_sb_bp);
> > +
> > +	if (XFS_FORCED_SHUTDOWN(mp)) {
> > +		error = -EIO;
> > +		goto rele;
> > +	}
> > +
> > +	if (xfs_sb_has_incompat_log_feature(&mp->m_sb, feature))
> > +		goto rele;
> > +
> > +	/*
> > +	 * Write the primary superblock to disk immediately, because we need
> > +	 * the log_incompat bit to be set in the primary super now to protect
> > +	 * the log items that we're going to commit later.
> > +	 */
> > +	dsb = mp->m_sb_bp->b_addr;
> > +	xfs_sb_to_disk(dsb, &mp->m_sb);
> > +	dsb->sb_features_log_incompat |= cpu_to_be32(feature);
> > +	error = xfs_bwrite(mp->m_sb_bp);
> > +	if (error)
> > +		goto shutdown;
> > +
> 
> Do we need to do all of this serialization and whatnot to set the
> incompat bit? Can we just modify and log the in-core sb when or before
> an incompatible item is logged, similar to how xfs_sbversion_add_attr2()
> is implemented for example?

Hm.  The goal here is to ensure that the primary super gets updated
before anything starts logging the protected log items.

The first iteration of this patch simply updated the incore superblock
and called xfs_sync_sb(mp, true) to flush it out to the primary super.
This proved to be the source of log livelocks because that function
bholds the primary super and pushes the log and AIL, which sometimes
stalled out because the AIL can't get the lock on the primary super
buffer.

The second version did the bare bwrite, but suffered from the
xfs_buf_lock occasionally stalling for 30s while waiting for a log flush
or something.  This third version forces the log and pushes the AIL
prior to adding the feature bit to speed things up.  There's a small
window between the xfs_buf_relse and the end of xfs_sync_sb where
another thread could start logging items before we add the super update
to the log stream, but the bwrite is supposed to cover that case. :)

The difficulty here is that any other threads that think they might need
to set the log_incompat bit have to be held off until this thread
confirms that the primary super has been updated on disk, because the
three outcomes we want here are (a) feature already set, just go; (b)
feature not set, but by the time you get the locks it's already taken
care of; or (c) feature not set, and you get to do all the updates.

We can't do that with the primary super buffer lock alone (because we
can't bhold it and push the AIL at the same time) but I guess I could
just add a "feature update" mutex into the mix.

> > +	/*
> > +	 * Add the feature bits to the incore superblock before we unlock the
> > +	 * buffer.
> > +	 */
> > +	xfs_sb_add_incompat_log_features(&mp->m_sb, feature);
> > +	xfs_buf_relse(mp->m_sb_bp);
> > +
> > +	/* Log the superblock to disk. */
> > +	return xfs_sync_sb(mp, false);
> > +shutdown:
> > +	xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> > +rele:
> > +	xfs_buf_relse(mp->m_sb_bp);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Clear all the log incompat flags from the superblock.
> > + *
> > + * The caller must have freeze protection, cannot have any other transactions
> > + * in progress, must ensure that the log does not contain any log items
> > + * protected by any log incompat bit, and must ensure that there are no other
> > + * threads that could be reading or writing the log incompat feature flag in
> > + * the primary super.  In other words, we can only turn features off as one of
> > + * the final steps of quiescing or unmounting the log.
> > + */
> > +int
> > +xfs_clear_incompat_log_features(
> > +	struct xfs_mount	*mp)
> > +{
> > +	if (!xfs_sb_version_hascrc(&mp->m_sb) ||
> > +	    !xfs_sb_has_incompat_log_feature(&mp->m_sb,
> > +				XFS_SB_FEAT_INCOMPAT_LOG_ALL) ||
> > +	    XFS_FORCED_SHUTDOWN(mp))
> > +		return 0;
> > +
> > +	/*
> > +	 * Update the incore superblock.  We synchronize on the primary super
> > +	 * buffer lock to be consistent with the add function, though at least
> > +	 * in theory this shouldn't be necessary.
> > +	 */
> > +	xfs_buf_lock(mp->m_sb_bp);
> > +	xfs_buf_hold(mp->m_sb_bp);
> > +
> > +	if (!xfs_sb_has_incompat_log_feature(&mp->m_sb,
> > +				XFS_SB_FEAT_INCOMPAT_LOG_ALL))
> > +		goto rele;
> > +
> > +	xfs_sb_remove_incompat_log_features(&mp->m_sb);
> > +	xfs_buf_relse(mp->m_sb_bp);
> > +
> > +	/* Log the superblock to disk. */
> > +	return xfs_sync_sb(mp, false);
> 
> Ok, so here it looks like we log the superblock change and presumably
> rely on the unmount sequence to write it back...
> 
> > +rele:
> > +	xfs_buf_relse(mp->m_sb_bp);
> > +	return 0;
> > +}
> > +
> >  /*
> >   * Update the in-core delayed block counter.
> >   *
> ...
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 343435a677f9..71e304c15e6b 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -919,6 +919,15 @@ xfs_quiesce_attr(
> >  	/* force the log to unpin objects from the now complete transactions */
> >  	xfs_log_force(mp, XFS_LOG_SYNC);
> >  
> > +	/*
> > +	 * Clear log incompat features since we're freezing or remounting ro.
> > +	 * Report failures, though it's not fatal to have a higher log feature
> > +	 * protection level than the log contents actually require.
> > +	 */
> > +	error = xfs_clear_incompat_log_features(mp);
> > +	if (error)
> > +		xfs_warn(mp, "Unable to clear superblock log incompat flags. "
> > +				"Frozen image may not be consistent.");
> >  
> 
> What happens if the modified superblock is written back and then we
> crash during a quiesce but before the log is fully clean? ISTM we could
> end up in recovery of a log with potentially incompatible log items
> where the feature bit has already been cleared. Not sure that's a
> problem with intents, but seems risky in general.

I think you're right that this isn't generally correct.  It works for
intent items because those are all started by front end callers and
those are guaranteed not to be running transactions by the time we get
to xfs_quiesce_attr... but that wouldn't necessarily hold true for other
types of log items that we could create.

> Perhaps this needs to be pushed further down such that similar to
> unmount, we ensure a full/sync AIL push completes (and moves the in-core
> tail) before we log the feature bit change. I do wonder if it's worth
> complicating the log quiesce path to clear feature bits at all, but I
> suppose it could be a little inconsistent to clean the log on freeze yet
> leave an incompat bit around. Perhaps we should push the clear bit
> sequence down into the log quiesce path between completing the AIL push
> and writing the unmount record. We may have to commit a sync transaction
> and then push the AIL again, but that would cover the unmount and freeze
> cases and I think we could probably do away with the post-recovery bit
> clearing case entirely. A current/recovered mount should clear the
> associated bits on the next log quiesce anyways. Hm?

Hm.  You know how xfs_log_quiesce takes and releases the superblock
buffer lock after pushing the AIL but before writing the unmount record?
What if we did the log_incompat feature clearing + bwrite right after
that?

--D

> 
> Brian
> 
> >  	/* Push the superblock and write an unmount record */
> >  	error = xfs_log_sbcount(mp);
> > 
>
Brian Foster Dec. 8, 2020, 7:19 p.m. UTC | #3
On Tue, Dec 08, 2020 at 10:10:27AM -0800, Darrick J. Wong wrote:
> On Tue, Dec 08, 2020 at 06:19:06AM -0500, Brian Foster wrote:
> > On Mon, Dec 07, 2020 at 04:40:28PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Log incompat feature flags in the superblock exist for one purpose: to
> > > protect the contents of a dirty log from replay on a kernel that isn't
> > > prepared to handle those dirty contents.  This means that they can be
> > > cleared if (a) we know the log is clean and (b) we know that there
> > > aren't any other threads in the system that might be setting or relying
> > > upon a log incompat flag.
> > > 
> > > Therefore, clear the log incompat flags when we've finished recovering
> > > the log, when we're unmounting cleanly, remounting read-only, or
> > > freezing; and provide a function so that subsequent patches can start
> > > using this.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > > Note: I wrote this so that we could turn on log incompat flags for
> > > atomic extent swapping and Allison could probably use it for the delayed
> > > logged xattr patchset.  Not gonna try to land this in 5.11, FWIW...
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h |   15 ++++++
> > >  fs/xfs/xfs_log.c           |    8 +++
> > >  fs/xfs/xfs_mount.c         |  119 ++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_mount.h         |    2 +
> > >  fs/xfs/xfs_super.c         |    9 +++
> > >  5 files changed, 153 insertions(+)
> > > 
> > ...
> > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > index 669a9cf43582..68a5c90ec65b 100644
> > > --- a/fs/xfs/xfs_mount.c
> > > +++ b/fs/xfs/xfs_mount.c
> > ...
> > > @@ -1361,6 +1369,117 @@ xfs_force_summary_recalc(
> > >  	xfs_fs_mark_checked(mp, XFS_SICK_FS_COUNTERS);
> > >  }
> > >  
> > > +/*
> > > + * Enable a log incompat feature flag in the primary superblock.  The caller
> > > + * cannot have any other transactions in progress.
> > > + */
> > > +int
> > > +xfs_add_incompat_log_feature(
> > > +	struct xfs_mount	*mp,
> > > +	uint32_t		feature)
> > > +{
> > > +	struct xfs_dsb		*dsb;
> > > +	int			error;
> > > +
> > > +	ASSERT(hweight32(feature) == 1);
> > > +	ASSERT(!(feature & XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN));
> > > +
> > > +	/*
> > > +	 * Force the log to disk and kick the background AIL thread to reduce
> > > +	 * the chances that the bwrite will stall waiting for the AIL to unpin
> > > +	 * the primary superblock buffer.  This isn't a data integrity
> > > +	 * operation, so we don't need a synchronous push.
> > > +	 */
> > > +	error = xfs_log_force(mp, XFS_LOG_SYNC);
> > > +	if (error)
> > > +		return error;
> > > +	xfs_ail_push_all(mp->m_ail);
> > > +
> > > +	/*
> > > +	 * Lock the primary superblock buffer to serialize all callers that
> > > +	 * are trying to set feature bits.
> > > +	 */
> > > +	xfs_buf_lock(mp->m_sb_bp);
> > > +	xfs_buf_hold(mp->m_sb_bp);
> > > +
> > > +	if (XFS_FORCED_SHUTDOWN(mp)) {
> > > +		error = -EIO;
> > > +		goto rele;
> > > +	}
> > > +
> > > +	if (xfs_sb_has_incompat_log_feature(&mp->m_sb, feature))
> > > +		goto rele;
> > > +
> > > +	/*
> > > +	 * Write the primary superblock to disk immediately, because we need
> > > +	 * the log_incompat bit to be set in the primary super now to protect
> > > +	 * the log items that we're going to commit later.
> > > +	 */
> > > +	dsb = mp->m_sb_bp->b_addr;
> > > +	xfs_sb_to_disk(dsb, &mp->m_sb);
> > > +	dsb->sb_features_log_incompat |= cpu_to_be32(feature);
> > > +	error = xfs_bwrite(mp->m_sb_bp);
> > > +	if (error)
> > > +		goto shutdown;
> > > +
> > 
> > Do we need to do all of this serialization and whatnot to set the
> > incompat bit? Can we just modify and log the in-core sb when or before
> > an incompatible item is logged, similar to how xfs_sbversion_add_attr2()
> > is implemented for example?
> 
> Hm.  The goal here is to ensure that the primary super gets updated
> before anything starts logging the protected log items.
> 

Ah, right. We can't just dump the change in the log and go because if we
crash at that point, an older kernel might not see the superblock update
and try to recover subsequent incompat items. I'll have to take another
look at the code with your description below in mind to try and wrap my
head around that.

Quick thought in the meantime... have you given any thought toward
trying to simplify the serialization requirements by perhaps doing the
set/clear of the log incompat bits from already isolated contexts? For
example, if the filesystem supports some featureX, just set the
associated log incompat bit at mount time (before any user transactions
can run) and clear it as a final step on clean unmount.

> The first iteration of this patch simply updated the incore superblock
> and called xfs_sync_sb(mp, true) to flush it out to the primary super.
> This proved to be the source of log livelocks because that function
> bholds the primary super and pushes the log and AIL, which sometimes
> stalled out because the AIL can't get the lock on the primary super
> buffer.
> 
> The second version did the bare bwrite, but suffered from the
> xfs_buf_lock occasionally stalling for 30s while waiting for a log flush
> or something.  This third version forces the log and pushes the AIL
> prior to adding the feature bit to speed things up.  There's a small
> window between the xfs_buf_relse and the end of xfs_sync_sb where
> another thread could start logging items before we add the super update
> to the log stream, but the bwrite is supposed to cover that case. :)
> 
> The difficulty here is that any other threads that think they might need
> to set the log_incompat bit have to be held off until this thread
> confirms that the primary super has been updated on disk, because the
> three outcomes we want here are (a) feature already set, just go; (b)
> feature not set, but by the time you get the locks it's already taken
> care of; or (c) feature not set, and you get to do all the updates.
> 
> We can't do that with the primary super buffer lock alone (because we
> can't bhold it and push the AIL at the same time) but I guess I could
> just add a "feature update" mutex into the mix.
> 
> > > +	/*
> > > +	 * Add the feature bits to the incore superblock before we unlock the
> > > +	 * buffer.
> > > +	 */
> > > +	xfs_sb_add_incompat_log_features(&mp->m_sb, feature);
> > > +	xfs_buf_relse(mp->m_sb_bp);
> > > +
> > > +	/* Log the superblock to disk. */
> > > +	return xfs_sync_sb(mp, false);
> > > +shutdown:
> > > +	xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> > > +rele:
> > > +	xfs_buf_relse(mp->m_sb_bp);
> > > +	return error;
> > > +}
> > > +
> > > +/*
> > > + * Clear all the log incompat flags from the superblock.
> > > + *
> > > + * The caller must have freeze protection, cannot have any other transactions
> > > + * in progress, must ensure that the log does not contain any log items
> > > + * protected by any log incompat bit, and must ensure that there are no other
> > > + * threads that could be reading or writing the log incompat feature flag in
> > > + * the primary super.  In other words, we can only turn features off as one of
> > > + * the final steps of quiescing or unmounting the log.
> > > + */
> > > +int
> > > +xfs_clear_incompat_log_features(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	if (!xfs_sb_version_hascrc(&mp->m_sb) ||
> > > +	    !xfs_sb_has_incompat_log_feature(&mp->m_sb,
> > > +				XFS_SB_FEAT_INCOMPAT_LOG_ALL) ||
> > > +	    XFS_FORCED_SHUTDOWN(mp))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * Update the incore superblock.  We synchronize on the primary super
> > > +	 * buffer lock to be consistent with the add function, though at least
> > > +	 * in theory this shouldn't be necessary.
> > > +	 */
> > > +	xfs_buf_lock(mp->m_sb_bp);
> > > +	xfs_buf_hold(mp->m_sb_bp);
> > > +
> > > +	if (!xfs_sb_has_incompat_log_feature(&mp->m_sb,
> > > +				XFS_SB_FEAT_INCOMPAT_LOG_ALL))
> > > +		goto rele;
> > > +
> > > +	xfs_sb_remove_incompat_log_features(&mp->m_sb);
> > > +	xfs_buf_relse(mp->m_sb_bp);
> > > +
> > > +	/* Log the superblock to disk. */
> > > +	return xfs_sync_sb(mp, false);
> > 
> > Ok, so here it looks like we log the superblock change and presumably
> > rely on the unmount sequence to write it back...
> > 
> > > +rele:
> > > +	xfs_buf_relse(mp->m_sb_bp);
> > > +	return 0;
> > > +}
> > > +
> > >  /*
> > >   * Update the in-core delayed block counter.
> > >   *
> > ...
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index 343435a677f9..71e304c15e6b 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -919,6 +919,15 @@ xfs_quiesce_attr(
> > >  	/* force the log to unpin objects from the now complete transactions */
> > >  	xfs_log_force(mp, XFS_LOG_SYNC);
> > >  
> > > +	/*
> > > +	 * Clear log incompat features since we're freezing or remounting ro.
> > > +	 * Report failures, though it's not fatal to have a higher log feature
> > > +	 * protection level than the log contents actually require.
> > > +	 */
> > > +	error = xfs_clear_incompat_log_features(mp);
> > > +	if (error)
> > > +		xfs_warn(mp, "Unable to clear superblock log incompat flags. "
> > > +				"Frozen image may not be consistent.");
> > >  
> > 
> > What happens if the modified superblock is written back and then we
> > crash during a quiesce but before the log is fully clean? ISTM we could
> > end up in recovery of a log with potentially incompatible log items
> > where the feature bit has already been cleared. Not sure that's a
> > problem with intents, but seems risky in general.
> 
> I think you're right that this isn't generally correct.  It works for
> intent items because those are all started by front end callers and
> those are guaranteed not to be running transactions by the time we get
> to xfs_quiesce_attr... but that wouldn't necessarily hold true for other
> types of log items that we could create.
> 
> > Perhaps this needs to be pushed further down such that similar to
> > unmount, we ensure a full/sync AIL push completes (and moves the in-core
> > tail) before we log the feature bit change. I do wonder if it's worth
> > complicating the log quiesce path to clear feature bits at all, but I
> > suppose it could be a little inconsistent to clean the log on freeze yet
> > leave an incompat bit around. Perhaps we should push the clear bit
> > sequence down into the log quiesce path between completing the AIL push
> > and writing the unmount record. We may have to commit a sync transaction
> > and then push the AIL again, but that would cover the unmount and freeze
> > cases and I think we could probably do away with the post-recovery bit
> > clearing case entirely. A current/recovered mount should clear the
> > associated bits on the next log quiesce anyways. Hm?
> 
> Hm.  You know how xfs_log_quiesce takes and releases the superblock
> buffer lock after pushing the AIL but before writing the unmount record?
> What if we did the log_incompat feature clearing + bwrite right after
> that?
> 

Yeah, that's where I was thinking for the unmount side. I'm not sure we
want to just bwrite there vs. log+push though. Otherwise, I think
there's still a window of potential inconsistency between when the
superblock write completes and the unmount record lands on disk. I
_think_ the whole log force -> AIL push (i.e. move log tail) -> update
super -> log force -> AIL push sequence ensures that if an older kernel
saw the updated super, the log would technically be dirty but we'd at
least be sure that all incompat log items are behind the tail.
Alternatively, I guess we could do a sync sb write after the unmount
record and we'd just risk that a clean unmount wouldn't clear the
feature bit if the write failed, which might not be so bad. Hmm...

Brian

> --D
> 
> > 
> > Brian
> > 
> > >  	/* Push the superblock and write an unmount record */
> > >  	error = xfs_log_sbcount(mp);
> > > 
> > 
>
Darrick J. Wong Dec. 9, 2020, 3:26 a.m. UTC | #4
On Tue, Dec 08, 2020 at 02:19:13PM -0500, Brian Foster wrote:
> On Tue, Dec 08, 2020 at 10:10:27AM -0800, Darrick J. Wong wrote:
> > On Tue, Dec 08, 2020 at 06:19:06AM -0500, Brian Foster wrote:
> > > On Mon, Dec 07, 2020 at 04:40:28PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Log incompat feature flags in the superblock exist for one purpose: to
> > > > protect the contents of a dirty log from replay on a kernel that isn't
> > > > prepared to handle those dirty contents.  This means that they can be
> > > > cleared if (a) we know the log is clean and (b) we know that there
> > > > aren't any other threads in the system that might be setting or relying
> > > > upon a log incompat flag.
> > > > 
> > > > Therefore, clear the log incompat flags when we've finished recovering
> > > > the log, when we're unmounting cleanly, remounting read-only, or
> > > > freezing; and provide a function so that subsequent patches can start
> > > > using this.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > > Note: I wrote this so that we could turn on log incompat flags for
> > > > atomic extent swapping and Allison could probably use it for the delayed
> > > > logged xattr patchset.  Not gonna try to land this in 5.11, FWIW...
> > > > ---
> > > >  fs/xfs/libxfs/xfs_format.h |   15 ++++++
> > > >  fs/xfs/xfs_log.c           |    8 +++
> > > >  fs/xfs/xfs_mount.c         |  119 ++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/xfs_mount.h         |    2 +
> > > >  fs/xfs/xfs_super.c         |    9 +++
> > > >  5 files changed, 153 insertions(+)
> > > > 
> > > ...
> > > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > > index 669a9cf43582..68a5c90ec65b 100644
> > > > --- a/fs/xfs/xfs_mount.c
> > > > +++ b/fs/xfs/xfs_mount.c
> > > ...
> > > > @@ -1361,6 +1369,117 @@ xfs_force_summary_recalc(
> > > >  	xfs_fs_mark_checked(mp, XFS_SICK_FS_COUNTERS);
> > > >  }
> > > >  
> > > > +/*
> > > > + * Enable a log incompat feature flag in the primary superblock.  The caller
> > > > + * cannot have any other transactions in progress.
> > > > + */
> > > > +int
> > > > +xfs_add_incompat_log_feature(
> > > > +	struct xfs_mount	*mp,
> > > > +	uint32_t		feature)
> > > > +{
> > > > +	struct xfs_dsb		*dsb;
> > > > +	int			error;
> > > > +
> > > > +	ASSERT(hweight32(feature) == 1);
> > > > +	ASSERT(!(feature & XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN));
> > > > +
> > > > +	/*
> > > > +	 * Force the log to disk and kick the background AIL thread to reduce
> > > > +	 * the chances that the bwrite will stall waiting for the AIL to unpin
> > > > +	 * the primary superblock buffer.  This isn't a data integrity
> > > > +	 * operation, so we don't need a synchronous push.
> > > > +	 */
> > > > +	error = xfs_log_force(mp, XFS_LOG_SYNC);
> > > > +	if (error)
> > > > +		return error;
> > > > +	xfs_ail_push_all(mp->m_ail);
> > > > +
> > > > +	/*
> > > > +	 * Lock the primary superblock buffer to serialize all callers that
> > > > +	 * are trying to set feature bits.
> > > > +	 */
> > > > +	xfs_buf_lock(mp->m_sb_bp);
> > > > +	xfs_buf_hold(mp->m_sb_bp);
> > > > +
> > > > +	if (XFS_FORCED_SHUTDOWN(mp)) {
> > > > +		error = -EIO;
> > > > +		goto rele;
> > > > +	}
> > > > +
> > > > +	if (xfs_sb_has_incompat_log_feature(&mp->m_sb, feature))
> > > > +		goto rele;
> > > > +
> > > > +	/*
> > > > +	 * Write the primary superblock to disk immediately, because we need
> > > > +	 * the log_incompat bit to be set in the primary super now to protect
> > > > +	 * the log items that we're going to commit later.
> > > > +	 */
> > > > +	dsb = mp->m_sb_bp->b_addr;
> > > > +	xfs_sb_to_disk(dsb, &mp->m_sb);
> > > > +	dsb->sb_features_log_incompat |= cpu_to_be32(feature);
> > > > +	error = xfs_bwrite(mp->m_sb_bp);
> > > > +	if (error)
> > > > +		goto shutdown;
> > > > +
> > > 
> > > Do we need to do all of this serialization and whatnot to set the
> > > incompat bit? Can we just modify and log the in-core sb when or before
> > > an incompatible item is logged, similar to how xfs_sbversion_add_attr2()
> > > is implemented for example?
> > 
> > Hm.  The goal here is to ensure that the primary super gets updated
> > before anything starts logging the protected log items.
> > 
> 
> Ah, right. We can't just dump the change in the log and go because if we
> crash at that point, an older kernel might not see the superblock update
> and try to recover subsequent incompat items. I'll have to take another
> look at the code with your description below in mind to try and wrap my
> head around that.
> 
> Quick thought in the meantime... have you given any thought toward
> trying to simplify the serialization requirements by perhaps doing the
> set/clear of the log incompat bits from already isolated contexts? For
> example, if the filesystem supports some featureX, just set the
> associated log incompat bit at mount time (before any user transactions
> can run) and clear it as a final step on clean unmount.

I don't want to set the log incompat bit until we're actually going to
use the protected functionality, because setting it at mount time
unconditionally means that older kernels cannot recover the log.

> > The first iteration of this patch simply updated the incore superblock
> > and called xfs_sync_sb(mp, true) to flush it out to the primary super.
> > This proved to be the source of log livelocks because that function
> > bholds the primary super and pushes the log and AIL, which sometimes
> > stalled out because the AIL can't get the lock on the primary super
> > buffer.
> > 
> > The second version did the bare bwrite, but suffered from the
> > xfs_buf_lock occasionally stalling for 30s while waiting for a log flush
> > or something.  This third version forces the log and pushes the AIL
> > prior to adding the feature bit to speed things up.  There's a small
> > window between the xfs_buf_relse and the end of xfs_sync_sb where
> > another thread could start logging items before we add the super update
> > to the log stream, but the bwrite is supposed to cover that case. :)
> > 
> > The difficulty here is that any other threads that think they might need
> > to set the log_incompat bit have to be held off until this thread
> > confirms that the primary super has been updated on disk, because the
> > three outcomes we want here are (a) feature already set, just go; (b)
> > feature not set, but by the time you get the locks it's already taken
> > care of; or (c) feature not set, and you get to do all the updates.
> > 
> > We can't do that with the primary super buffer lock alone (because we
> > can't bhold it and push the AIL at the same time) but I guess I could
> > just add a "feature update" mutex into the mix.
> > 
> > > > +	/*
> > > > +	 * Add the feature bits to the incore superblock before we unlock the
> > > > +	 * buffer.
> > > > +	 */
> > > > +	xfs_sb_add_incompat_log_features(&mp->m_sb, feature);
> > > > +	xfs_buf_relse(mp->m_sb_bp);
> > > > +
> > > > +	/* Log the superblock to disk. */
> > > > +	return xfs_sync_sb(mp, false);
> > > > +shutdown:
> > > > +	xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> > > > +rele:
> > > > +	xfs_buf_relse(mp->m_sb_bp);
> > > > +	return error;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Clear all the log incompat flags from the superblock.
> > > > + *
> > > > + * The caller must have freeze protection, cannot have any other transactions
> > > > + * in progress, must ensure that the log does not contain any log items
> > > > + * protected by any log incompat bit, and must ensure that there are no other
> > > > + * threads that could be reading or writing the log incompat feature flag in
> > > > + * the primary super.  In other words, we can only turn features off as one of
> > > > + * the final steps of quiescing or unmounting the log.
> > > > + */
> > > > +int
> > > > +xfs_clear_incompat_log_features(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	if (!xfs_sb_version_hascrc(&mp->m_sb) ||
> > > > +	    !xfs_sb_has_incompat_log_feature(&mp->m_sb,
> > > > +				XFS_SB_FEAT_INCOMPAT_LOG_ALL) ||
> > > > +	    XFS_FORCED_SHUTDOWN(mp))
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * Update the incore superblock.  We synchronize on the primary super
> > > > +	 * buffer lock to be consistent with the add function, though at least
> > > > +	 * in theory this shouldn't be necessary.
> > > > +	 */
> > > > +	xfs_buf_lock(mp->m_sb_bp);
> > > > +	xfs_buf_hold(mp->m_sb_bp);
> > > > +
> > > > +	if (!xfs_sb_has_incompat_log_feature(&mp->m_sb,
> > > > +				XFS_SB_FEAT_INCOMPAT_LOG_ALL))
> > > > +		goto rele;
> > > > +
> > > > +	xfs_sb_remove_incompat_log_features(&mp->m_sb);
> > > > +	xfs_buf_relse(mp->m_sb_bp);
> > > > +
> > > > +	/* Log the superblock to disk. */
> > > > +	return xfs_sync_sb(mp, false);
> > > 
> > > Ok, so here it looks like we log the superblock change and presumably
> > > rely on the unmount sequence to write it back...
> > > 
> > > > +rele:
> > > > +	xfs_buf_relse(mp->m_sb_bp);
> > > > +	return 0;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Update the in-core delayed block counter.
> > > >   *
> > > ...
> > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > index 343435a677f9..71e304c15e6b 100644
> > > > --- a/fs/xfs/xfs_super.c
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -919,6 +919,15 @@ xfs_quiesce_attr(
> > > >  	/* force the log to unpin objects from the now complete transactions */
> > > >  	xfs_log_force(mp, XFS_LOG_SYNC);
> > > >  
> > > > +	/*
> > > > +	 * Clear log incompat features since we're freezing or remounting ro.
> > > > +	 * Report failures, though it's not fatal to have a higher log feature
> > > > +	 * protection level than the log contents actually require.
> > > > +	 */
> > > > +	error = xfs_clear_incompat_log_features(mp);
> > > > +	if (error)
> > > > +		xfs_warn(mp, "Unable to clear superblock log incompat flags. "
> > > > +				"Frozen image may not be consistent.");
> > > >  
> > > 
> > > What happens if the modified superblock is written back and then we
> > > crash during a quiesce but before the log is fully clean? ISTM we could
> > > end up in recovery of a log with potentially incompatible log items
> > > where the feature bit has already been cleared. Not sure that's a
> > > problem with intents, but seems risky in general.
> > 
> > I think you're right that this isn't generally correct.  It works for
> > intent items because those are all started by front end callers and
> > those are guaranteed not to be running transactions by the time we get
> > to xfs_quiesce_attr... but that wouldn't necessarily hold true for other
> > types of log items that we could create.
> > 
> > > Perhaps this needs to be pushed further down such that similar to
> > > unmount, we ensure a full/sync AIL push completes (and moves the in-core
> > > tail) before we log the feature bit change. I do wonder if it's worth
> > > complicating the log quiesce path to clear feature bits at all, but I
> > > suppose it could be a little inconsistent to clean the log on freeze yet
> > > leave an incompat bit around. Perhaps we should push the clear bit
> > > sequence down into the log quiesce path between completing the AIL push
> > > and writing the unmount record. We may have to commit a sync transaction
> > > and then push the AIL again, but that would cover the unmount and freeze
> > > cases and I think we could probably do away with the post-recovery bit
> > > clearing case entirely. A current/recovered mount should clear the
> > > associated bits on the next log quiesce anyways. Hm?
> > 
> > Hm.  You know how xfs_log_quiesce takes and releases the superblock
> > buffer lock after pushing the AIL but before writing the unmount record?
> > What if we did the log_incompat feature clearing + bwrite right after
> > that?
> > 
> 
> Yeah, that's where I was thinking for the unmount side. I'm not sure we
> want to just bwrite there vs. log+push though. Otherwise, I think
> there's still a window of potential inconsistency between when the
> superblock write completes and the unmount record lands on disk. I
> _think_ the whole log force -> AIL push (i.e. move log tail) -> update
> super -> log force -> AIL push sequence ensures that if an older kernel
> saw the updated super, the log would technically be dirty but we'd at
> least be sure that all incompat log items are behind the tail.
> Alternatively, I guess we could do a sync sb write after the unmount
> record and we'd just risk that a clean unmount wouldn't clear the
> feature bit if the write failed, which might not be so bad. Hmm...

Hm.  That would work too I think.

--D

> 
> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > >  	/* Push the superblock and write an unmount record */
> > > >  	error = xfs_log_sbcount(mp);
> > > > 
> > > 
> > 
>
Dave Chinner Dec. 9, 2020, 4:19 a.m. UTC | #5
On Tue, Dec 08, 2020 at 07:26:24PM -0800, Darrick J. Wong wrote:
> On Tue, Dec 08, 2020 at 02:19:13PM -0500, Brian Foster wrote:
> > On Tue, Dec 08, 2020 at 10:10:27AM -0800, Darrick J. Wong wrote:
> > > On Tue, Dec 08, 2020 at 06:19:06AM -0500, Brian Foster wrote:
> > > > On Mon, Dec 07, 2020 at 04:40:28PM -0800, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > Log incompat feature flags in the superblock exist for one purpose: to
> > > > > protect the contents of a dirty log from replay on a kernel that isn't
> > > > > prepared to handle those dirty contents.  This means that they can be
> > > > > cleared if (a) we know the log is clean and (b) we know that there
> > > > > aren't any other threads in the system that might be setting or relying
> > > > > upon a log incompat flag.
> > > > > 
> > > > > Therefore, clear the log incompat flags when we've finished recovering
> > > > > the log, when we're unmounting cleanly, remounting read-only, or
> > > > > freezing; and provide a function so that subsequent patches can start
> > > > > using this.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > > Note: I wrote this so that we could turn on log incompat flags for
> > > > > atomic extent swapping and Allison could probably use it for the delayed
> > > > > logged xattr patchset.  Not gonna try to land this in 5.11, FWIW...
> > > > > ---
....
> > > > unmount, we ensure a full/sync AIL push completes (and moves the in-core
> > > > tail) before we log the feature bit change. I do wonder if it's worth
> > > > complicating the log quiesce path to clear feature bits at all, but I
> > > > suppose it could be a little inconsistent to clean the log on freeze yet
> > > > leave an incompat bit around. Perhaps we should push the clear bit
> > > > sequence down into the log quiesce path between completing the AIL push
> > > > and writing the unmount record. We may have to commit a sync transaction
> > > > and then push the AIL again, but that would cover the unmount and freeze
> > > > cases and I think we could probably do away with the post-recovery bit
> > > > clearing case entirely. A current/recovered mount should clear the
> > > > associated bits on the next log quiesce anyways. Hm?
> > > 
> > > Hm.  You know how xfs_log_quiesce takes and releases the superblock
> > > buffer lock after pushing the AIL but before writing the unmount record?
> > > What if we did the log_incompat feature clearing + bwrite right after
> > > that?
> > > 
> > 
> > Yeah, that's where I was thinking for the unmount side. I'm not sure we
> > want to just bwrite there vs. log+push though. Otherwise, I think
> > there's still a window of potential inconsistency between when the
> > superblock write completes and the unmount record lands on disk.

The only thing the unmount record really does is ensure that the
head and tail of the log point to the unmount record. that's what
marks the log clean, not the unmount record itself. We use a special
record here because we're not actually modifying anything - it's not
a transaction, just a marker to get the log head + tail to point to
the same LSN in the log.

Log covering does the same thing via logging and flushing the
superblock multiple times - it makes sure that the only item in the
commit at the head of the journal has a tail pointer that also
points to the same checkpoint. The fact that it's got a "dirty"
unmodified superblock in rather than being an unmount record is
largely irrelevant - the rest of the log has been marked as clean by
this checkpoint and the replay of the SB from this checkpoint is
effectively a no-op.

In fact, I think it's a "no-op" we can take advantage of.....

> > I
> > _think_ the whole log force -> AIL push (i.e. move log tail) -> update
> > super -> log force -> AIL push sequence ensures that if an older kernel
> > saw the updated super, the log would technically be dirty but we'd at
> > least be sure that all incompat log items are behind the tail.

It's a bit more complex than that - if the log is not dirty, then
the second log force does nothing, and the tail of the log does not
get updated. Hence you have to log, commit and writeback the
superblock twice to ensure that the log tail has been updated in the
journal to after the point in time where the superblock was *first
logged*.

The log covering state machine handles this all correctly, and it
does it using the superblock as the mechanism that moves the tail of
the log forward. I suspect that we could use the second logging of
the superblock in that state machine to clear/set log incompat
flags, as that is the commit that marks the log "empty". If we crash
before the sb write at this point, log recovery will only see the
superblock to recover, so it doesn't matter that there's a change in
log incompat bits here because after recovery we're going to clear
them, right?

I'd like to avoid re-inventing the wheel here if we can :)

Cheers,

Dave.
Brian Foster Dec. 9, 2020, 3:50 p.m. UTC | #6
On Tue, Dec 08, 2020 at 07:26:24PM -0800, Darrick J. Wong wrote:
> On Tue, Dec 08, 2020 at 02:19:13PM -0500, Brian Foster wrote:
> > On Tue, Dec 08, 2020 at 10:10:27AM -0800, Darrick J. Wong wrote:
> > > On Tue, Dec 08, 2020 at 06:19:06AM -0500, Brian Foster wrote:
> > > > On Mon, Dec 07, 2020 at 04:40:28PM -0800, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > Log incompat feature flags in the superblock exist for one purpose: to
> > > > > protect the contents of a dirty log from replay on a kernel that isn't
> > > > > prepared to handle those dirty contents.  This means that they can be
> > > > > cleared if (a) we know the log is clean and (b) we know that there
> > > > > aren't any other threads in the system that might be setting or relying
> > > > > upon a log incompat flag.
> > > > > 
> > > > > Therefore, clear the log incompat flags when we've finished recovering
> > > > > the log, when we're unmounting cleanly, remounting read-only, or
> > > > > freezing; and provide a function so that subsequent patches can start
> > > > > using this.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > > Note: I wrote this so that we could turn on log incompat flags for
> > > > > atomic extent swapping and Allison could probably use it for the delayed
> > > > > logged xattr patchset.  Not gonna try to land this in 5.11, FWIW...
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_format.h |   15 ++++++
> > > > >  fs/xfs/xfs_log.c           |    8 +++
> > > > >  fs/xfs/xfs_mount.c         |  119 ++++++++++++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/xfs_mount.h         |    2 +
> > > > >  fs/xfs/xfs_super.c         |    9 +++
> > > > >  5 files changed, 153 insertions(+)
> > > > > 
> > > > ...
> > > > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > > > index 669a9cf43582..68a5c90ec65b 100644
> > > > > --- a/fs/xfs/xfs_mount.c
> > > > > +++ b/fs/xfs/xfs_mount.c
> > > > ...
> > > > > @@ -1361,6 +1369,117 @@ xfs_force_summary_recalc(
> > > > >  	xfs_fs_mark_checked(mp, XFS_SICK_FS_COUNTERS);
> > > > >  }
> > > > >  
> > > > > +/*
> > > > > + * Enable a log incompat feature flag in the primary superblock.  The caller
> > > > > + * cannot have any other transactions in progress.
> > > > > + */
> > > > > +int
> > > > > +xfs_add_incompat_log_feature(
> > > > > +	struct xfs_mount	*mp,
> > > > > +	uint32_t		feature)
> > > > > +{
> > > > > +	struct xfs_dsb		*dsb;
> > > > > +	int			error;
> > > > > +
> > > > > +	ASSERT(hweight32(feature) == 1);
> > > > > +	ASSERT(!(feature & XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN));
> > > > > +
> > > > > +	/*
> > > > > +	 * Force the log to disk and kick the background AIL thread to reduce
> > > > > +	 * the chances that the bwrite will stall waiting for the AIL to unpin
> > > > > +	 * the primary superblock buffer.  This isn't a data integrity
> > > > > +	 * operation, so we don't need a synchronous push.
> > > > > +	 */
> > > > > +	error = xfs_log_force(mp, XFS_LOG_SYNC);
> > > > > +	if (error)
> > > > > +		return error;
> > > > > +	xfs_ail_push_all(mp->m_ail);
> > > > > +
> > > > > +	/*
> > > > > +	 * Lock the primary superblock buffer to serialize all callers that
> > > > > +	 * are trying to set feature bits.
> > > > > +	 */
> > > > > +	xfs_buf_lock(mp->m_sb_bp);
> > > > > +	xfs_buf_hold(mp->m_sb_bp);
> > > > > +
> > > > > +	if (XFS_FORCED_SHUTDOWN(mp)) {
> > > > > +		error = -EIO;
> > > > > +		goto rele;
> > > > > +	}
> > > > > +
> > > > > +	if (xfs_sb_has_incompat_log_feature(&mp->m_sb, feature))
> > > > > +		goto rele;
> > > > > +
> > > > > +	/*
> > > > > +	 * Write the primary superblock to disk immediately, because we need
> > > > > +	 * the log_incompat bit to be set in the primary super now to protect
> > > > > +	 * the log items that we're going to commit later.
> > > > > +	 */
> > > > > +	dsb = mp->m_sb_bp->b_addr;
> > > > > +	xfs_sb_to_disk(dsb, &mp->m_sb);
> > > > > +	dsb->sb_features_log_incompat |= cpu_to_be32(feature);
> > > > > +	error = xfs_bwrite(mp->m_sb_bp);
> > > > > +	if (error)
> > > > > +		goto shutdown;
> > > > > +
> > > > 
> > > > Do we need to do all of this serialization and whatnot to set the
> > > > incompat bit? Can we just modify and log the in-core sb when or before
> > > > an incompatible item is logged, similar to how xfs_sbversion_add_attr2()
> > > > is implemented for example?
> > > 
> > > Hm.  The goal here is to ensure that the primary super gets updated
> > > before anything starts logging the protected log items.
> > > 
> > 
> > Ah, right. We can't just dump the change in the log and go because if we
> > crash at that point, an older kernel might not see the superblock update
> > and try to recover subsequent incompat items. I'll have to take another
> > look at the code with your description below in mind to try and wrap my
> > head around that.
> > 
> > Quick thought in the meantime... have you given any thought toward
> > trying to simplify the serialization requirements by perhaps doing the
> > set/clear of the log incompat bits from already isolated contexts? For
> > example, if the filesystem supports some featureX, just set the
> > associated log incompat bit at mount time (before any user transactions
> > can run) and clear it as a final step on clean unmount.
> 
> I don't want to set the log incompat bit until we're actually going to
> use the protected functionality, because setting it at mount time
> unconditionally means that older kernels cannot recover the log.
> 

True, but we're susceptible to that anyways if we don't ever clear the
bit until explicit quiesce (i.e. unmount/freeze). Even if we adopted a
log idle feature bit clearing hook, there's no guarantee the filesystem
will idle between the time incompat items are out of the log and a crash
occurs. ISTM that the only way to truly meet that requirement is to also
clear the feature bit when the last incompat log item falls off the tail
of the log, which really doesn't seem worth the complexity.

I guess this is more of a question of requirements. Is it necessary to
define the requirement as "dirty logs generated by this kernel may be
unrecoverable on older kernels if incompat log items are present in the
active range?" Is it acceptable to alternatively define it as "dirty
logs generated by this kernel must be recovered by a kernel with
equivalent feature support?"

The former is certainly possible, but as noted above I'm skeptical we'd
actually achieve that level of flexibility. There may also be log write
error -> shutdown (or log recovery error) scenarios where things might
be inconsistent. It might be fine from an implementation/correctness
standpoint to have cases where the bit is set but said items hadn't made
it to the log, but if we have to accept that situation anyways, I do
wonder whether the requirements are unnecessarily strict.

The latter option seems quite reasonable to me. I'm not sure a user is
going to know or care enough about whether a dirty log actually has
resident incompat log items unless there is some extremely clear high
level feature association (where I suspect we'd more likely have a
regular incompat feature bit anyways). I suppose it might also depend on
how the bits are used. Do we have use cases where the difference between
setting the incompat state conditionally or unconditionally is
significant?

Brian

> > > The first iteration of this patch simply updated the incore superblock
> > > and called xfs_sync_sb(mp, true) to flush it out to the primary super.
> > > This proved to be the source of log livelocks because that function
> > > bholds the primary super and pushes the log and AIL, which sometimes
> > > stalled out because the AIL can't get the lock on the primary super
> > > buffer.
> > > 
> > > The second version did the bare bwrite, but suffered from the
> > > xfs_buf_lock occasionally stalling for 30s while waiting for a log flush
> > > or something.  This third version forces the log and pushes the AIL
> > > prior to adding the feature bit to speed things up.  There's a small
> > > window between the xfs_buf_relse and the end of xfs_sync_sb where
> > > another thread could start logging items before we add the super update
> > > to the log stream, but the bwrite is supposed to cover that case. :)
> > > 
> > > The difficulty here is that any other threads that think they might need
> > > to set the log_incompat bit have to be held off until this thread
> > > confirms that the primary super has been updated on disk, because the
> > > three outcomes we want here are (a) feature already set, just go; (b)
> > > feature not set, but by the time you get the locks it's already taken
> > > care of; or (c) feature not set, and you get to do all the updates.
> > > 
> > > We can't do that with the primary super buffer lock alone (because we
> > > can't bhold it and push the AIL at the same time) but I guess I could
> > > just add a "feature update" mutex into the mix.
> > > 
> > > > > +	/*
> > > > > +	 * Add the feature bits to the incore superblock before we unlock the
> > > > > +	 * buffer.
> > > > > +	 */
> > > > > +	xfs_sb_add_incompat_log_features(&mp->m_sb, feature);
> > > > > +	xfs_buf_relse(mp->m_sb_bp);
> > > > > +
> > > > > +	/* Log the superblock to disk. */
> > > > > +	return xfs_sync_sb(mp, false);
> > > > > +shutdown:
> > > > > +	xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> > > > > +rele:
> > > > > +	xfs_buf_relse(mp->m_sb_bp);
> > > > > +	return error;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Clear all the log incompat flags from the superblock.
> > > > > + *
> > > > > + * The caller must have freeze protection, cannot have any other transactions
> > > > > + * in progress, must ensure that the log does not contain any log items
> > > > > + * protected by any log incompat bit, and must ensure that there are no other
> > > > > + * threads that could be reading or writing the log incompat feature flag in
> > > > > + * the primary super.  In other words, we can only turn features off as one of
> > > > > + * the final steps of quiescing or unmounting the log.
> > > > > + */
> > > > > +int
> > > > > +xfs_clear_incompat_log_features(
> > > > > +	struct xfs_mount	*mp)
> > > > > +{
> > > > > +	if (!xfs_sb_version_hascrc(&mp->m_sb) ||
> > > > > +	    !xfs_sb_has_incompat_log_feature(&mp->m_sb,
> > > > > +				XFS_SB_FEAT_INCOMPAT_LOG_ALL) ||
> > > > > +	    XFS_FORCED_SHUTDOWN(mp))
> > > > > +		return 0;
> > > > > +
> > > > > +	/*
> > > > > +	 * Update the incore superblock.  We synchronize on the primary super
> > > > > +	 * buffer lock to be consistent with the add function, though at least
> > > > > +	 * in theory this shouldn't be necessary.
> > > > > +	 */
> > > > > +	xfs_buf_lock(mp->m_sb_bp);
> > > > > +	xfs_buf_hold(mp->m_sb_bp);
> > > > > +
> > > > > +	if (!xfs_sb_has_incompat_log_feature(&mp->m_sb,
> > > > > +				XFS_SB_FEAT_INCOMPAT_LOG_ALL))
> > > > > +		goto rele;
> > > > > +
> > > > > +	xfs_sb_remove_incompat_log_features(&mp->m_sb);
> > > > > +	xfs_buf_relse(mp->m_sb_bp);
> > > > > +
> > > > > +	/* Log the superblock to disk. */
> > > > > +	return xfs_sync_sb(mp, false);
> > > > 
> > > > Ok, so here it looks like we log the superblock change and presumably
> > > > rely on the unmount sequence to write it back...
> > > > 
> > > > > +rele:
> > > > > +	xfs_buf_relse(mp->m_sb_bp);
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Update the in-core delayed block counter.
> > > > >   *
> > > > ...
> > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > > index 343435a677f9..71e304c15e6b 100644
> > > > > --- a/fs/xfs/xfs_super.c
> > > > > +++ b/fs/xfs/xfs_super.c
> > > > > @@ -919,6 +919,15 @@ xfs_quiesce_attr(
> > > > >  	/* force the log to unpin objects from the now complete transactions */
> > > > >  	xfs_log_force(mp, XFS_LOG_SYNC);
> > > > >  
> > > > > +	/*
> > > > > +	 * Clear log incompat features since we're freezing or remounting ro.
> > > > > +	 * Report failures, though it's not fatal to have a higher log feature
> > > > > +	 * protection level than the log contents actually require.
> > > > > +	 */
> > > > > +	error = xfs_clear_incompat_log_features(mp);
> > > > > +	if (error)
> > > > > +		xfs_warn(mp, "Unable to clear superblock log incompat flags. "
> > > > > +				"Frozen image may not be consistent.");
> > > > >  
> > > > 
> > > > What happens if the modified superblock is written back and then we
> > > > crash during a quiesce but before the log is fully clean? ISTM we could
> > > > end up in recovery of a log with potentially incompatible log items
> > > > where the feature bit has already been cleared. Not sure that's a
> > > > problem with intents, but seems risky in general.
> > > 
> > > I think you're right that this isn't generally correct.  It works for
> > > intent items because those are all started by front end callers and
> > > those are guaranteed not to be running transactions by the time we get
> > > to xfs_quiesce_attr... but that wouldn't necessarily hold true for other
> > > types of log items that we could create.
> > > 
> > > > Perhaps this needs to be pushed further down such that similar to
> > > > unmount, we ensure a full/sync AIL push completes (and moves the in-core
> > > > tail) before we log the feature bit change. I do wonder if it's worth
> > > > complicating the log quiesce path to clear feature bits at all, but I
> > > > suppose it could be a little inconsistent to clean the log on freeze yet
> > > > leave an incompat bit around. Perhaps we should push the clear bit
> > > > sequence down into the log quiesce path between completing the AIL push
> > > > and writing the unmount record. We may have to commit a sync transaction
> > > > and then push the AIL again, but that would cover the unmount and freeze
> > > > cases and I think we could probably do away with the post-recovery bit
> > > > clearing case entirely. A current/recovered mount should clear the
> > > > associated bits on the next log quiesce anyways. Hm?
> > > 
> > > Hm.  You know how xfs_log_quiesce takes and releases the superblock
> > > buffer lock after pushing the AIL but before writing the unmount record?
> > > What if we did the log_incompat feature clearing + bwrite right after
> > > that?
> > > 
> > 
> > Yeah, that's where I was thinking for the unmount side. I'm not sure we
> > want to just bwrite there vs. log+push though. Otherwise, I think
> > there's still a window of potential inconsistency between when the
> > superblock write completes and the unmount record lands on disk. I
> > _think_ the whole log force -> AIL push (i.e. move log tail) -> update
> > super -> log force -> AIL push sequence ensures that if an older kernel
> > saw the updated super, the log would technically be dirty but we'd at
> > least be sure that all incompat log items are behind the tail.
> > Alternatively, I guess we could do a sync sb write after the unmount
> > record and we'd just risk that a clean unmount wouldn't clear the
> > feature bit if the write failed, which might not be so bad. Hmm...
> 
> Hm.  That would work too I think.
> 
> --D
> 
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > >  	/* Push the superblock and write an unmount record */
> > > > >  	error = xfs_log_sbcount(mp);
> > > > > 
> > > > 
> > > 
> > 
>
Brian Foster Dec. 9, 2020, 3:52 p.m. UTC | #7
On Wed, Dec 09, 2020 at 03:19:50PM +1100, Dave Chinner wrote:
> On Tue, Dec 08, 2020 at 07:26:24PM -0800, Darrick J. Wong wrote:
> > On Tue, Dec 08, 2020 at 02:19:13PM -0500, Brian Foster wrote:
> > > On Tue, Dec 08, 2020 at 10:10:27AM -0800, Darrick J. Wong wrote:
> > > > On Tue, Dec 08, 2020 at 06:19:06AM -0500, Brian Foster wrote:
> > > > > On Mon, Dec 07, 2020 at 04:40:28PM -0800, Darrick J. Wong wrote:
> > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > 
> > > > > > Log incompat feature flags in the superblock exist for one purpose: to
> > > > > > protect the contents of a dirty log from replay on a kernel that isn't
> > > > > > prepared to handle those dirty contents.  This means that they can be
> > > > > > cleared if (a) we know the log is clean and (b) we know that there
> > > > > > aren't any other threads in the system that might be setting or relying
> > > > > > upon a log incompat flag.
> > > > > > 
> > > > > > Therefore, clear the log incompat flags when we've finished recovering
> > > > > > the log, when we're unmounting cleanly, remounting read-only, or
> > > > > > freezing; and provide a function so that subsequent patches can start
> > > > > > using this.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > ---
> > > > > > Note: I wrote this so that we could turn on log incompat flags for
> > > > > > atomic extent swapping and Allison could probably use it for the delayed
> > > > > > logged xattr patchset.  Not gonna try to land this in 5.11, FWIW...
> > > > > > ---
> ....
> > > > > unmount, we ensure a full/sync AIL push completes (and moves the in-core
> > > > > tail) before we log the feature bit change. I do wonder if it's worth
> > > > > complicating the log quiesce path to clear feature bits at all, but I
> > > > > suppose it could be a little inconsistent to clean the log on freeze yet
> > > > > leave an incompat bit around. Perhaps we should push the clear bit
> > > > > sequence down into the log quiesce path between completing the AIL push
> > > > > and writing the unmount record. We may have to commit a sync transaction
> > > > > and then push the AIL again, but that would cover the unmount and freeze
> > > > > cases and I think we could probably do away with the post-recovery bit
> > > > > clearing case entirely. A current/recovered mount should clear the
> > > > > associated bits on the next log quiesce anyways. Hm?
> > > > 
> > > > Hm.  You know how xfs_log_quiesce takes and releases the superblock
> > > > buffer lock after pushing the AIL but before writing the unmount record?
> > > > What if we did the log_incompat feature clearing + bwrite right after
> > > > that?
> > > > 
> > > 
> > > Yeah, that's where I was thinking for the unmount side. I'm not sure we
> > > want to just bwrite there vs. log+push though. Otherwise, I think
> > > there's still a window of potential inconsistency between when the
> > > superblock write completes and the unmount record lands on disk.
> 
> The only thing the unmount record really does is ensure that the
> head and tail of the log point to the unmount record. that's what
> marks the log clean, not the unmount record itself. We use a special
> record here because we're not actually modifying anything - it's not
> a transaction, just a marker to get the log head + tail to point to
> the same LSN in the log.
> 

Right..

> Log covering does the same thing via logging and flushing the
> superblock multiple times - it makes sure that the only item in the
> commit at the head of the journal has a tail pointer that also
> points to the same checkpoint. The fact that it's got a "dirty"
> unmodified superblock in rather than being an unmount record is
> largely irrelevant - the rest of the log has been marked as clean by
> this checkpoint and the replay of the SB from this checkpoint is
> effectively a no-op.
> 

That's close to what I was trying to accomplish above (independent of
whether the approach was actually correct ;). The difference is that I
wasn't trying to mark the log clean generically, but guarantee it's
"clean of incompat items" by the time the superblock buffer that clears
the incompat feature bit hits the log. That way a recovery would update
the superblock on current kernels and old kernels would either see a
dirty log or the a clean log with the updated super.

> In fact, I think it's a "no-op" we can take advantage of.....
> 
> > > I
> > > _think_ the whole log force -> AIL push (i.e. move log tail) -> update
> > > super -> log force -> AIL push sequence ensures that if an older kernel
> > > saw the updated super, the log would technically be dirty but we'd at
> > > least be sure that all incompat log items are behind the tail.
> 
> It's a bit more complex than that - if the log is not dirty, then
> the second log force does nothing, and the tail of the log does not
> get updated. Hence you have to log, commit and writeback the
> superblock twice to ensure that the log tail has been updated in the
> journal to after the point in time where the superblock was *first
> logged*.
> 

I'm a little confused by your statement. The second log force would come
after we've explicitly logged the superblock (having cleared the feature
bit), so the log would have to be dirty at that point. Do you mean if
the log is not dirty from the start, something else, or just misread?

> The log covering state machine handles this all correctly, and it
> does it using the superblock as the mechanism that moves the tail of
> the log forward. I suspect that we could use the second logging of
> the superblock in that state machine to clear/set log incompat
> flags, as that is the commit that marks the log "empty". If we crash
> before the sb write at this point, log recovery will only see the
> superblock to recover, so it doesn't matter that there's a change in
> log incompat bits here because after recovery we're going to clear
> them, right?
> 

Hmm.. but if we update the superblock in the same checkpoint that marks
the log empty and then fail before the superblock write, the superblock
update is lost since recovery would never happen (because the log is
empty). I think the more significant problem with that is we can now
present a filesystem with a clean log to an older kernel with the bit
set.

This is why I was trying to basically log the superblock update in a
checkpoint that also ensured the tail moved forward past incompat items
but didn't necessarily clean the log. I wonder if we can fit that state
into the existing log covering mechanism by logging the update earlier
(or maybe via an intermediate state..)? I'll need to go stare at that
code a bit..

> I'd like to avoid re-inventing the wheel here if we can :)
> 

Agreed. The idle log covering mechanism crossed my mind but I didn't
look further into it because I thought it was an isolated mechanism (and
hadn't grok'd the details of it). I.e., we'd still have to handle freeze
and unmount separately so it made more sense to grok those cases first.
If you think there's a possibility to rework it so the same covering
mechanism can be reused across idle, quiesce, unmount, and that we can
conditionally fit a superblock log incompat bit clear into that
sequence, that sounds like a nice overall approach to me.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Brian Foster Dec. 9, 2020, 5:04 p.m. UTC | #8
On Wed, Dec 09, 2020 at 10:52:11AM -0500, Brian Foster wrote:
> On Wed, Dec 09, 2020 at 03:19:50PM +1100, Dave Chinner wrote:
> > On Tue, Dec 08, 2020 at 07:26:24PM -0800, Darrick J. Wong wrote:
> > > On Tue, Dec 08, 2020 at 02:19:13PM -0500, Brian Foster wrote:
> > > > On Tue, Dec 08, 2020 at 10:10:27AM -0800, Darrick J. Wong wrote:
> > > > > On Tue, Dec 08, 2020 at 06:19:06AM -0500, Brian Foster wrote:
> > > > > > On Mon, Dec 07, 2020 at 04:40:28PM -0800, Darrick J. Wong wrote:
> > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > 
> > > > > > > Log incompat feature flags in the superblock exist for one purpose: to
> > > > > > > protect the contents of a dirty log from replay on a kernel that isn't
> > > > > > > prepared to handle those dirty contents.  This means that they can be
> > > > > > > cleared if (a) we know the log is clean and (b) we know that there
> > > > > > > aren't any other threads in the system that might be setting or relying
> > > > > > > upon a log incompat flag.
> > > > > > > 
> > > > > > > Therefore, clear the log incompat flags when we've finished recovering
> > > > > > > the log, when we're unmounting cleanly, remounting read-only, or
> > > > > > > freezing; and provide a function so that subsequent patches can start
> > > > > > > using this.
> > > > > > > 
> > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > ---
> > > > > > > Note: I wrote this so that we could turn on log incompat flags for
> > > > > > > atomic extent swapping and Allison could probably use it for the delayed
> > > > > > > logged xattr patchset.  Not gonna try to land this in 5.11, FWIW...
> > > > > > > ---
> > ....
> > > > > > unmount, we ensure a full/sync AIL push completes (and moves the in-core
> > > > > > tail) before we log the feature bit change. I do wonder if it's worth
> > > > > > complicating the log quiesce path to clear feature bits at all, but I
> > > > > > suppose it could be a little inconsistent to clean the log on freeze yet
> > > > > > leave an incompat bit around. Perhaps we should push the clear bit
> > > > > > sequence down into the log quiesce path between completing the AIL push
> > > > > > and writing the unmount record. We may have to commit a sync transaction
> > > > > > and then push the AIL again, but that would cover the unmount and freeze
> > > > > > cases and I think we could probably do away with the post-recovery bit
> > > > > > clearing case entirely. A current/recovered mount should clear the
> > > > > > associated bits on the next log quiesce anyways. Hm?
> > > > > 
> > > > > Hm.  You know how xfs_log_quiesce takes and releases the superblock
> > > > > buffer lock after pushing the AIL but before writing the unmount record?
> > > > > What if we did the log_incompat feature clearing + bwrite right after
> > > > > that?
> > > > > 
> > > > 
> > > > Yeah, that's where I was thinking for the unmount side. I'm not sure we
> > > > want to just bwrite there vs. log+push though. Otherwise, I think
> > > > there's still a window of potential inconsistency between when the
> > > > superblock write completes and the unmount record lands on disk.
> > 
> > The only thing the unmount record really does is ensure that the
> > head and tail of the log point to the unmount record. that's what
> > marks the log clean, not the unmount record itself. We use a special
> > record here because we're not actually modifying anything - it's not
> > a transaction, just a marker to get the log head + tail to point to
> > the same LSN in the log.
> > 
> 
> Right..
> 
> > Log covering does the same thing via logging and flushing the
> > superblock multiple times - it makes sure that the only item in the
> > commit at the head of the journal has a tail pointer that also
> > points to the same checkpoint. The fact that it's got a "dirty"
> > unmodified superblock in rather than being an unmount record is
> > largely irrelevant - the rest of the log has been marked as clean by
> > this checkpoint and the replay of the SB from this checkpoint is
> > effectively a no-op.
> > 
> 
> That's close to what I was trying to accomplish above (independent of
> whether the approach was actually correct ;). The difference is that I
> wasn't trying to mark the log clean generically, but guarantee it's
> "clean of incompat items" by the time the superblock buffer that clears
> the incompat feature bit hits the log. That way a recovery would update
> the superblock on current kernels and old kernels would either see a
> dirty log or the a clean log with the updated super.
> 
> > In fact, I think it's a "no-op" we can take advantage of.....
> > 
> > > > I
> > > > _think_ the whole log force -> AIL push (i.e. move log tail) -> update
> > > > super -> log force -> AIL push sequence ensures that if an older kernel
> > > > saw the updated super, the log would technically be dirty but we'd at
> > > > least be sure that all incompat log items are behind the tail.
> > 
> > It's a bit more complex than that - if the log is not dirty, then
> > the second log force does nothing, and the tail of the log does not
> > get updated. Hence you have to log, commit and writeback the
> > superblock twice to ensure that the log tail has been updated in the
> > journal to after the point in time where the superblock was *first
> > logged*.
> > 
> 
> I'm a little confused by your statement. The second log force would come
> after we've explicitly logged the superblock (having cleared the feature
> bit), so the log would have to be dirty at that point. Do you mean if
> the log is not dirty from the start, something else, or just misread?
> 
> > The log covering state machine handles this all correctly, and it
> > does it using the superblock as the mechanism that moves the tail of
> > the log forward. I suspect that we could use the second logging of
> > the superblock in that state machine to clear/set log incompat
> > flags, as that is the commit that marks the log "empty". If we crash
> > before the sb write at this point, log recovery will only see the
> > superblock to recover, so it doesn't matter that there's a change in
> > log incompat bits here because after recovery we're going to clear
> > them, right?
> > 
> 
> Hmm.. but if we update the superblock in the same checkpoint that marks
> the log empty and then fail before the superblock write, the superblock
> update is lost since recovery would never happen (because the log is
> empty). I think the more significant problem with that is we can now
> present a filesystem with a clean log to an older kernel with the bit
> set.
> 

Hmmmm.. but does log covering actually mark the log clean? I was
thinking it did from your description but from a quick test it appears
it does not. Perhaps that's what you mean by "empty" instead of clean,
since the log is not technically empty but it contains a no-op change.

If that's the case, perhaps we can do an explicit log cover in any of
the quiesce paths, as you describe, before logging the unmount record.
The first superblock update is essentially no-op to stamp a new tail in
the log (as is done today), the next logs the actual superblock change
and updates the on-disk tail to point to the just written checkpoint,
then we writeback the superblock and mark the log clean with an unmount
record. On recovery, we either see a dirty log with the superblock bit
set, a dirty log with the superblock bit cleared but that has already
been covered (so is safely recoverable on an old kernel), or a clean
log. Am I following that about right..?

Brian

> This is why I was trying to basically log the superblock update in a
> checkpoint that also ensured the tail moved forward past incompat items
> but didn't necessarily clean the log. I wonder if we can fit that state
> into the existing log covering mechanism by logging the update earlier
> (or maybe via an intermediate state..)? I'll need to go stare at that
> code a bit..
> 
> > I'd like to avoid re-inventing the wheel here if we can :)
> > 
> 
> Agreed. The idle log covering mechanism crossed my mind but I didn't
> look further into it because I thought it was an isolated mechanism (and
> hadn't grok'd the details of it). I.e., we'd still have to handle freeze
> and unmount separately so it made more sense to grok those cases first.
> If you think there's a possibility to rework it so the same covering
> mechanism can be reused across idle, quiesce, unmount, and that we can
> conditionally fit a superblock log incompat bit clear into that
> sequence, that sounds like a nice overall approach to me.
> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
>
Dave Chinner Dec. 9, 2020, 8:51 p.m. UTC | #9
On Wed, Dec 09, 2020 at 12:04:28PM -0500, Brian Foster wrote:
> On Wed, Dec 09, 2020 at 10:52:11AM -0500, Brian Foster wrote:
> > On Wed, Dec 09, 2020 at 03:19:50PM +1100, Dave Chinner wrote:
> > > On Tue, Dec 08, 2020 at 07:26:24PM -0800, Darrick J. Wong wrote:
> > > > On Tue, Dec 08, 2020 at 02:19:13PM -0500, Brian Foster wrote:
> > > > > On Tue, Dec 08, 2020 at 10:10:27AM -0800, Darrick J. Wong wrote:
> > > > > > On Tue, Dec 08, 2020 at 06:19:06AM -0500, Brian Foster wrote:
> > > > > > > On Mon, Dec 07, 2020 at 04:40:28PM -0800, Darrick J. Wong wrote:
> > > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > 
> > > > > > > > Log incompat feature flags in the superblock exist for one purpose: to
> > > > > > > > protect the contents of a dirty log from replay on a kernel that isn't
> > > > > > > > prepared to handle those dirty contents.  This means that they can be
> > > > > > > > cleared if (a) we know the log is clean and (b) we know that there
> > > > > > > > aren't any other threads in the system that might be setting or relying
> > > > > > > > upon a log incompat flag.
> > > > > > > > 
> > > > > > > > Therefore, clear the log incompat flags when we've finished recovering
> > > > > > > > the log, when we're unmounting cleanly, remounting read-only, or
> > > > > > > > freezing; and provide a function so that subsequent patches can start
> > > > > > > > using this.
> > > > > > > > 
> > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > ---
> > > > > > > > Note: I wrote this so that we could turn on log incompat flags for
> > > > > > > > atomic extent swapping and Allison could probably use it for the delayed
> > > > > > > > logged xattr patchset.  Not gonna try to land this in 5.11, FWIW...
> > > > > > > > ---
> > > ....
> > > > > > > unmount, we ensure a full/sync AIL push completes (and moves the in-core
> > > > > > > tail) before we log the feature bit change. I do wonder if it's worth
> > > > > > > complicating the log quiesce path to clear feature bits at all, but I
> > > > > > > suppose it could be a little inconsistent to clean the log on freeze yet
> > > > > > > leave an incompat bit around. Perhaps we should push the clear bit
> > > > > > > sequence down into the log quiesce path between completing the AIL push
> > > > > > > and writing the unmount record. We may have to commit a sync transaction
> > > > > > > and then push the AIL again, but that would cover the unmount and freeze
> > > > > > > cases and I think we could probably do away with the post-recovery bit
> > > > > > > clearing case entirely. A current/recovered mount should clear the
> > > > > > > associated bits on the next log quiesce anyways. Hm?
> > > > > > 
> > > > > > Hm.  You know how xfs_log_quiesce takes and releases the superblock
> > > > > > buffer lock after pushing the AIL but before writing the unmount record?
> > > > > > What if we did the log_incompat feature clearing + bwrite right after
> > > > > > that?
> > > > > > 
> > > > > 
> > > > > Yeah, that's where I was thinking for the unmount side. I'm not sure we
> > > > > want to just bwrite there vs. log+push though. Otherwise, I think
> > > > > there's still a window of potential inconsistency between when the
> > > > > superblock write completes and the unmount record lands on disk.
> > > 
> > > The only thing the unmount record really does is ensure that the
> > > head and tail of the log point to the unmount record. that's what
> > > marks the log clean, not the unmount record itself. We use a special
> > > record here because we're not actually modifying anything - it's not
> > > a transaction, just a marker to get the log head + tail to point to
> > > the same LSN in the log.
> > > 
> > 
> > Right..
> > 
> > > Log covering does the same thing via logging and flushing the
> > > superblock multiple times - it makes sure that the only item in the
> > > commit at the head of the journal has a tail pointer that also
> > > points to the same checkpoint. The fact that it's got a "dirty"
> > > unmodified superblock in rather than being an unmount record is
> > > largely irrelevant - the rest of the log has been marked as clean by
> > > this checkpoint and the replay of the SB from this checkpoint is
> > > effectively a no-op.
> > > 
> > 
> > That's close to what I was trying to accomplish above (independent of
> > whether the approach was actually correct ;). The difference is that I
> > wasn't trying to mark the log clean generically, but guarantee it's
> > "clean of incompat items" by the time the superblock buffer that clears
> > the incompat feature bit hits the log. That way a recovery would update
> > the superblock on current kernels and old kernels would either see a
> > dirty log or the a clean log with the updated super.
> > 
> > > In fact, I think it's a "no-op" we can take advantage of.....
> > > 
> > > > > I
> > > > > _think_ the whole log force -> AIL push (i.e. move log tail) -> update
> > > > > super -> log force -> AIL push sequence ensures that if an older kernel
> > > > > saw the updated super, the log would technically be dirty but we'd at
> > > > > least be sure that all incompat log items are behind the tail.
> > > 
> > > It's a bit more complex than that - if the log is not dirty, then
> > > the second log force does nothing, and the tail of the log does not
> > > get updated. Hence you have to log, commit and writeback the
> > > superblock twice to ensure that the log tail has been updated in the
> > > journal to after the point in time where the superblock was *first
> > > logged*.
> > > 
> > 
> > I'm a little confused by your statement. The second log force would come
> > after we've explicitly logged the superblock (having cleared the feature
> > bit), so the log would have to be dirty at that point. Do you mean if
> > the log is not dirty from the start, something else, or just misread?

There can be things logged while the AIL is pushing, hence by the
time the push waiter is woken and running again the head of the log
could have moved forwards. Hence the second log force is not
indicating that the log is empty because the head and tail aren't in
the same place. Even if you push the AIL after that, this doesn't
guarantee that the tail LSN in the log has moved past the superblock
that is sitting in the log - that requires another log write after
the second AIL push....

i.e. without a log force after the second AIL push, log recovery
will still try to recover anything still active in the journal after
the second log force, including the superblock update, because the
AIL push only updates the log tail in memory, not on disk...

> > > The log covering state machine handles this all correctly, and it
> > > does it using the superblock as the mechanism that moves the tail of
> > > the log forward. I suspect that we could use the second logging of
> > > the superblock in that state machine to clear/set log incompat
> > > flags, as that is the commit that marks the log "empty". If we crash
> > > before the sb write at this point, log recovery will only see the
> > > superblock to recover, so it doesn't matter that there's a change in
> > > log incompat bits here because after recovery we're going to clear
> > > them, right?
> > > 
> > 
> > Hmm.. but if we update the superblock in the same checkpoint that marks
> > the log empty and then fail before the superblock write, the superblock
> > update is lost since recovery would never happen (because the log is
> > empty). I think the more significant problem with that is we can now
> > present a filesystem with a clean log to an older kernel with the bit
> > set.
> > 
> 
> Hmmmm.. but does log covering actually mark the log clean? I was

No, it doesn't. But it's empty and has nothing to recover, except
for the dummy superblock which has no actual changes in it...

> thinking it did from your description but from a quick test it appears
> it does not. Perhaps that's what you mean by "empty" instead of clean,
> since the log is not technically empty but it contains a no-op change.
> 
> If that's the case, perhaps we can do an explicit log cover in any of
> the quiesce paths, as you describe, before logging the unmount record.

Yup, that's the gist of it.

> The first superblock update is essentially no-op to stamp a new tail in
> the log (as is done today), the next logs the actual superblock change
> and updates the on-disk tail to point to the just written checkpoint,
> then we writeback the superblock and mark the log clean with an unmount
> record. On recovery, we either see a dirty log with the superblock bit
> set, a dirty log with the superblock bit cleared but that has already
> been covered (so is safely recoverable on an old kernel), or a clean
> log. Am I following that about right..?

Yes, pretty much.

Only I don't think we should be using an unmount record for
this. Looking at what we do with a freeze these days, and how
xfs_log_quiesce() is implemented, I think it's all wrong for
freezing. Freezing does not require us to empty the buffer cache,
and freezing immediately dirties the log again by logging the
superblock after quiesce to ensure recovery will run.

IOWs, what freeze actually needs is the quiesce to return with an
empty but dirty log, and that's exactly what covering the log does.

For changing incompat log flags, covering also provides exactly what
we need - an empty log with only a dirty superblock in the journal
to recover. All that we need then is to recheck feature flags after
recovery (not just clear log incompat flags) because we might need
to use this to also set incompat feature flags dynamically.

Hence it seems to me that what xfs_log_quiesce() should do is cover
the log, and the callers of xfs_log_quiesce() can then write the
unmount record directly if they need a clean log to be left behind.

The emptying of the buffer cache should also be removed from
xfs_log_quiesce(), because the only thing that needs this is
xfs_log_unmount(). Freeze, rw->ro transitions and the new log
incompat flag modifications do not need to empty the buffer cache.

And that leaves us with a xfs_log_quiesce() that can then be used to
modify the superblock during the second phase of the covering
process.

> > This is why I was trying to basically log the superblock update in a
> > checkpoint that also ensured the tail moved forward past incompat items
> > but didn't necessarily clean the log. I wonder if we can fit that state
> > into the existing log covering mechanism by logging the update earlier
> > (or maybe via an intermediate state..)? I'll need to go stare at that
> > code a bit..

That's kinda what I'd like to see done - it gives us a generic way
of altering superblock incompat feature fields and filesystem
structures dynamically with no risk of log recovery needing to parse
structures it may not know about.

> > > I'd like to avoid re-inventing the wheel here if we can :)
> > > 
> > 
> > Agreed. The idle log covering mechanism crossed my mind but I didn't
> > look further into it because I thought it was an isolated mechanism (and
> > hadn't grok'd the details of it). I.e., we'd still have to handle freeze
> > and unmount separately so it made more sense to grok those cases first.
> > If you think there's a possibility to rework it so the same covering
> > mechanism can be reused across idle, quiesce, unmount, and that we can
> > conditionally fit a superblock log incompat bit clear into that
> > sequence, that sounds like a nice overall approach to me.

*nod*

That's exactly what I'm trying to acheive here - one mechanism to
rule them all :)

Cheers,

Dave.
Brian Foster Dec. 10, 2020, 2:23 p.m. UTC | #10
On Thu, Dec 10, 2020 at 07:51:32AM +1100, Dave Chinner wrote:
> On Wed, Dec 09, 2020 at 12:04:28PM -0500, Brian Foster wrote:
> > On Wed, Dec 09, 2020 at 10:52:11AM -0500, Brian Foster wrote:
> > > On Wed, Dec 09, 2020 at 03:19:50PM +1100, Dave Chinner wrote:
> > > > On Tue, Dec 08, 2020 at 07:26:24PM -0800, Darrick J. Wong wrote:
> > > > > On Tue, Dec 08, 2020 at 02:19:13PM -0500, Brian Foster wrote:
> > > > > > On Tue, Dec 08, 2020 at 10:10:27AM -0800, Darrick J. Wong wrote:
> > > > > > > On Tue, Dec 08, 2020 at 06:19:06AM -0500, Brian Foster wrote:
> > > > > > > > On Mon, Dec 07, 2020 at 04:40:28PM -0800, Darrick J. Wong wrote:
> > > > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > > 
> > > > > > > > > Log incompat feature flags in the superblock exist for one purpose: to
> > > > > > > > > protect the contents of a dirty log from replay on a kernel that isn't
> > > > > > > > > prepared to handle those dirty contents.  This means that they can be
> > > > > > > > > cleared if (a) we know the log is clean and (b) we know that there
> > > > > > > > > aren't any other threads in the system that might be setting or relying
> > > > > > > > > upon a log incompat flag.
> > > > > > > > > 
> > > > > > > > > Therefore, clear the log incompat flags when we've finished recovering
> > > > > > > > > the log, when we're unmounting cleanly, remounting read-only, or
> > > > > > > > > freezing; and provide a function so that subsequent patches can start
> > > > > > > > > using this.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > > ---
> > > > > > > > > Note: I wrote this so that we could turn on log incompat flags for
> > > > > > > > > atomic extent swapping and Allison could probably use it for the delayed
> > > > > > > > > logged xattr patchset.  Not gonna try to land this in 5.11, FWIW...
> > > > > > > > > ---
> > > > ....
> > > > > > > > unmount, we ensure a full/sync AIL push completes (and moves the in-core
> > > > > > > > tail) before we log the feature bit change. I do wonder if it's worth
> > > > > > > > complicating the log quiesce path to clear feature bits at all, but I
> > > > > > > > suppose it could be a little inconsistent to clean the log on freeze yet
> > > > > > > > leave an incompat bit around. Perhaps we should push the clear bit
> > > > > > > > sequence down into the log quiesce path between completing the AIL push
> > > > > > > > and writing the unmount record. We may have to commit a sync transaction
> > > > > > > > and then push the AIL again, but that would cover the unmount and freeze
> > > > > > > > cases and I think we could probably do away with the post-recovery bit
> > > > > > > > clearing case entirely. A current/recovered mount should clear the
> > > > > > > > associated bits on the next log quiesce anyways. Hm?
> > > > > > > 
> > > > > > > Hm.  You know how xfs_log_quiesce takes and releases the superblock
> > > > > > > buffer lock after pushing the AIL but before writing the unmount record?
> > > > > > > What if we did the log_incompat feature clearing + bwrite right after
> > > > > > > that?
> > > > > > > 
> > > > > > 
> > > > > > Yeah, that's where I was thinking for the unmount side. I'm not sure we
> > > > > > want to just bwrite there vs. log+push though. Otherwise, I think
> > > > > > there's still a window of potential inconsistency between when the
> > > > > > superblock write completes and the unmount record lands on disk.
> > > > 
> > > > The only thing the unmount record really does is ensure that the
> > > > head and tail of the log point to the unmount record. that's what
> > > > marks the log clean, not the unmount record itself. We use a special
> > > > record here because we're not actually modifying anything - it's not
> > > > a transaction, just a marker to get the log head + tail to point to
> > > > the same LSN in the log.
> > > > 
> > > 
> > > Right..
> > > 
> > > > Log covering does the same thing via logging and flushing the
> > > > superblock multiple times - it makes sure that the only item in the
> > > > commit at the head of the journal has a tail pointer that also
> > > > points to the same checkpoint. The fact that it's got a "dirty"
> > > > unmodified superblock in rather than being an unmount record is
> > > > largely irrelevant - the rest of the log has been marked as clean by
> > > > this checkpoint and the replay of the SB from this checkpoint is
> > > > effectively a no-op.
> > > > 
> > > 
> > > That's close to what I was trying to accomplish above (independent of
> > > whether the approach was actually correct ;). The difference is that I
> > > wasn't trying to mark the log clean generically, but guarantee it's
> > > "clean of incompat items" by the time the superblock buffer that clears
> > > the incompat feature bit hits the log. That way a recovery would update
> > > the superblock on current kernels and old kernels would either see a
> > > dirty log or the a clean log with the updated super.
> > > 
> > > > In fact, I think it's a "no-op" we can take advantage of.....
> > > > 
> > > > > > I
> > > > > > _think_ the whole log force -> AIL push (i.e. move log tail) -> update
> > > > > > super -> log force -> AIL push sequence ensures that if an older kernel
> > > > > > saw the updated super, the log would technically be dirty but we'd at
> > > > > > least be sure that all incompat log items are behind the tail.
> > > > 
> > > > It's a bit more complex than that - if the log is not dirty, then
> > > > the second log force does nothing, and the tail of the log does not
> > > > get updated. Hence you have to log, commit and writeback the
> > > > superblock twice to ensure that the log tail has been updated in the
> > > > journal to after the point in time where the superblock was *first
> > > > logged*.
> > > > 
> > > 
> > > I'm a little confused by your statement. The second log force would come
> > > after we've explicitly logged the superblock (having cleared the feature
> > > bit), so the log would have to be dirty at that point. Do you mean if
> > > the log is not dirty from the start, something else, or just misread?
> 
> There can be things logged while the AIL is pushing, hence by the
> time the push waiter is woken and running again the head of the log
> could have moved forwards. Hence the second log force is not
> indicating that the log is empty because the head and tail aren't in
> the same place. Even if you push the AIL after that, this doesn't
> guarantee that the tail LSN in the log has moved past the superblock
> that is sitting in the log - that requires another log write after
> the second AIL push....
> 

Yes, that is the intent. A crash at that point should see and recover
the superblock in the log. The log would have to be marked clean
separately (if necessary) via an unmount record after the superblock is
written back.

> i.e. without a log force after the second AIL push, log recovery
> will still try to recover anything still active in the journal after
> the second log force, including the superblock update, because the
> AIL push only updates the log tail in memory, not on disk...
> 

Yes. I think we're talking about nearly the same thing here. What I was
trying to accomplish is basically the same exact behavior discussed
below with regard to the log covering mechanism, this is just
effectively open coded and explicit (as opposed to being a background
thing based on an idle log like log covering currently is) from an
otherwise isolated (i.e. no concurrent transactions) context.
Regardless, it's not important if we go down the covering path..

> > > > The log covering state machine handles this all correctly, and it
> > > > does it using the superblock as the mechanism that moves the tail of
> > > > the log forward. I suspect that we could use the second logging of
> > > > the superblock in that state machine to clear/set log incompat
> > > > flags, as that is the commit that marks the log "empty". If we crash
> > > > before the sb write at this point, log recovery will only see the
> > > > superblock to recover, so it doesn't matter that there's a change in
> > > > log incompat bits here because after recovery we're going to clear
> > > > them, right?
> > > > 
> > > 
> > > Hmm.. but if we update the superblock in the same checkpoint that marks
> > > the log empty and then fail before the superblock write, the superblock
> > > update is lost since recovery would never happen (because the log is
> > > empty). I think the more significant problem with that is we can now
> > > present a filesystem with a clean log to an older kernel with the bit
> > > set.
> > > 
> > 
> > Hmmmm.. but does log covering actually mark the log clean? I was
> 
> No, it doesn't. But it's empty and has nothing to recover, except
> for the dummy superblock which has no actual changes in it...
> 

Ok. I don't consider that "empty," but whatever, I know what you mean...
:P

> > thinking it did from your description but from a quick test it appears
> > it does not. Perhaps that's what you mean by "empty" instead of clean,
> > since the log is not technically empty but it contains a no-op change.
> > 
> > If that's the case, perhaps we can do an explicit log cover in any of
> > the quiesce paths, as you describe, before logging the unmount record.
> 
> Yup, that's the gist of it.
> 
> > The first superblock update is essentially no-op to stamp a new tail in
> > the log (as is done today), the next logs the actual superblock change
> > and updates the on-disk tail to point to the just written checkpoint,
> > then we writeback the superblock and mark the log clean with an unmount
> > record. On recovery, we either see a dirty log with the superblock bit
> > set, a dirty log with the superblock bit cleared but that has already
> > been covered (so is safely recoverable on an old kernel), or a clean
> > log. Am I following that about right..?
> 
> Yes, pretty much.
> 
> Only I don't think we should be using an unmount record for
> this. Looking at what we do with a freeze these days, and how
> xfs_log_quiesce() is implemented, I think it's all wrong for
> freezing. Freezing does not require us to empty the buffer cache,
> and freezing immediately dirties the log again by logging the
> superblock after quiesce to ensure recovery will run.
> 
> IOWs, what freeze actually needs is the quiesce to return with an
> empty but dirty log, and that's exactly what covering the log does.
> 

Well freeze does end up with such a log today, it's just implemented in
a hacky and brute force fashion by doing the log quiesce (i.e. unmount
record) followed by the dummy superblock commit instead of running
through the log covering sequence. (But pedantry aside, yes, agreed.)

> For changing incompat log flags, covering also provides exactly what
> we need - an empty log with only a dirty superblock in the journal
> to recover. All that we need then is to recheck feature flags after
> recovery (not just clear log incompat flags) because we might need
> to use this to also set incompat feature flags dynamically.
> 

I'd love it if we use a better term for a log isolated to the superblock
buffer. So far we've discussed an empty log with a dummy superblock, an
empty log with a dirty superblock, and I suppose we could also have an
actual empty log. :P "Covered log," perhaps?

> Hence it seems to me that what xfs_log_quiesce() should do is cover
> the log, and the callers of xfs_log_quiesce() can then write the
> unmount record directly if they need a clean log to be left behind.
> 

Ack..

> The emptying of the buffer cache should also be removed from
> xfs_log_quiesce(), because the only thing that needs this is
> xfs_log_unmount(). Freeze, rw->ro transitions and the new log
> incompat flag modifications do not need to empty the buffer cache.
> 

Yeah, I have patches I've been testing the past couple days that do
exactly that. I'll post them today.

> And that leaves us with a xfs_log_quiesce() that can then be used to
> modify the superblock during the second phase of the covering
> process.
> 

Yep, I like it.

> > > This is why I was trying to basically log the superblock update in a
> > > checkpoint that also ensured the tail moved forward past incompat items
> > > but didn't necessarily clean the log. I wonder if we can fit that state
> > > into the existing log covering mechanism by logging the update earlier
> > > (or maybe via an intermediate state..)? I'll need to go stare at that
> > > code a bit..
> 
> That's kinda what I'd like to see done - it gives us a generic way
> of altering superblock incompat feature fields and filesystem
> structures dynamically with no risk of log recovery needing to parse
> structures it may not know about.
> 

We might have to think about if we want to clear such feature bits in
all log covering scenarios (idle, freeze) vs. just unmount. My previous
suggestion in the other sub-thread was to set bits on mount and clear on
unmount because that (hopefully) simplifies the broader implementation.
Otherwise we need to manage dynamic setting of bits when the associated
log items are active, and that still isn't truly representative because
the bits would remain set long after active items fall out of the log,
until the log is covered. Either way is possible, but I'm curious what
others think.

> > > > I'd like to avoid re-inventing the wheel here if we can :)
> > > > 
> > > 
> > > Agreed. The idle log covering mechanism crossed my mind but I didn't
> > > look further into it because I thought it was an isolated mechanism (and
> > > hadn't grok'd the details of it). I.e., we'd still have to handle freeze
> > > and unmount separately so it made more sense to grok those cases first.
> > > If you think there's a possibility to rework it so the same covering
> > > mechanism can be reused across idle, quiesce, unmount, and that we can
> > > conditionally fit a superblock log incompat bit clear into that
> > > sequence, that sounds like a nice overall approach to me.
> 
> *nod*
> 
> That's exactly what I'm trying to acheive here - one mechanism to
> rule them all :)
> 

Thanks. Once I get the freeze/remount buffer reclaim bits settled I'll
try and take a closer look at the covering code and think about how we
could generically repurpose it. If that works out, Darrick's patch would
then just have to update the in-core superblock at the right time and
force the AIL.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Dec. 10, 2020, 9:50 p.m. UTC | #11
On Thu, Dec 10, 2020 at 09:23:58AM -0500, Brian Foster wrote:
> On Thu, Dec 10, 2020 at 07:51:32AM +1100, Dave Chinner wrote:
> > For changing incompat log flags, covering also provides exactly what
> > we need - an empty log with only a dirty superblock in the journal
> > to recover. All that we need then is to recheck feature flags after
> > recovery (not just clear log incompat flags) because we might need
> > to use this to also set incompat feature flags dynamically.
> > 
> 
> I'd love it if we use a better term for a log isolated to the superblock
> buffer. So far we've discussed an empty log with a dummy superblock, an
> empty log with a dirty superblock, and I suppose we could also have an
> actual empty log. :P "Covered log," perhaps?

On terminology: "covered log" has been around for 25 years
in XFS, and it's very definition is "an empty, up-to-date log except
for a dummy object we logged without any changes to update the log
tail". And, by definition, that dummy object is dirty in the log
even if it contains no actual modifications when it was logged.
So in this case, "dummy" is directly interchangable with "dirty"
when looking at the log contents w.r.t. recovery behaviour.

It's worth looking at a historical change, too. Log covering
originally used the root inode and not the superblock. That caused
problems on linux dirtying VFS inode state, so we changed it to the
superblock in commit 1a387d3be2b3 ("xfs: dummy transactions should
not dirty VFS state") about a decade ago.  Note the name of the
function at the time (xfs_fs_log_dummy()) and that it's description
explicitly mentions that a dummy transaction used for updating the
log tail when covering it.

The only difference in this discussion is fact that we may be
replacing the "dummy" with a modified superblock. Both are still
dirty superblock objects in the log, and serve the same purpose for
covering the log, but instead of using a dummy object we update the
log incompat flags so taht the only thing that gets replayed if we
crash is the modification to the superblock flags...

Finally, no, we can't have a truly empty log while the filesystem is
mounted because log transaction records must not be empty. Further,
we can only bring the log down to an "empty but not quite clean"
state while mounted, because if the log is actually marked clean and
we crash then log recovery will not run and so we will not clean up
files that were open-but-unlinked when the system crashed.


> > > > but didn't necessarily clean the log. I wonder if we can fit that state
> > > > into the existing log covering mechanism by logging the update earlier
> > > > (or maybe via an intermediate state..)? I'll need to go stare at that
> > > > code a bit..
> > 
> > That's kinda what I'd like to see done - it gives us a generic way
> > of altering superblock incompat feature fields and filesystem
> > structures dynamically with no risk of log recovery needing to parse
> > structures it may not know about.
> > 
> 
> We might have to think about if we want to clear such feature bits in
> all log covering scenarios (idle, freeze) vs. just unmount.

I think clearing bits can probably be lazy. Set a state flag to tell
log covering that it needs to do an update, and so when the covering
code finally runs the feature bit modification just happens.


> My previous suggestion in the other sub-thread was to set bits on
> mount and clear on unmount because that (hopefully) simplifies the
> broader implementation.  Otherwise we need to manage dynamic
> setting of bits when the associated log items are active, and that
> still isn't truly representative because the bits would remain set
> long after active items fall out of the log, until the log is
> covered. Either way is possible, but I'm curious what others
> think.

I think that unless we are setting a log incompat flag, log covering
should unconditionally clear the log incompat flags. Because the log
is empty, and recovery of a superblock buffer should -always- be
possible, then by definition we have no log incompat state
present....

As for a mechanism for dynamically adding log incompat flags?
Perhaps we just do that in xfs_trans_alloc() - add an log incompat
flags field into the transaction reservation structure, and if
xfs_trans_alloc() sees an incompat field set and the superblock
doesn't have it set, the first thing it does is run a "set log
incompat flag" transaction before then doing it's normal work...

This should be rare enough it doesn't have any measurable
performance overhead, and it's flexible enough to support any log
incompat feature we might need to implement...

Cheers,

Dave.
Brian Foster Dec. 11, 2020, 1:39 p.m. UTC | #12
On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> On Thu, Dec 10, 2020 at 09:23:58AM -0500, Brian Foster wrote:
> > On Thu, Dec 10, 2020 at 07:51:32AM +1100, Dave Chinner wrote:
> > > For changing incompat log flags, covering also provides exactly what
> > > we need - an empty log with only a dirty superblock in the journal
> > > to recover. All that we need then is to recheck feature flags after
> > > recovery (not just clear log incompat flags) because we might need
> > > to use this to also set incompat feature flags dynamically.
> > > 
> > 
> > I'd love it if we use a better term for a log isolated to the superblock
> > buffer. So far we've discussed an empty log with a dummy superblock, an
> > empty log with a dirty superblock, and I suppose we could also have an
> > actual empty log. :P "Covered log," perhaps?
> 
> On terminology: "covered log" has been around for 25 years
> in XFS, and it's very definition is "an empty, up-to-date log except
> for a dummy object we logged without any changes to update the log
> tail". And, by definition, that dummy object is dirty in the log
> even if it contains no actual modifications when it was logged.
> So in this case, "dummy" is directly interchangable with "dirty"
> when looking at the log contents w.r.t. recovery behaviour.
> 
> It's worth looking at a historical change, too. Log covering
> originally used the root inode and not the superblock. That caused
> problems on linux dirtying VFS inode state, so we changed it to the
> superblock in commit 1a387d3be2b3 ("xfs: dummy transactions should
> not dirty VFS state") about a decade ago.  Note the name of the
> function at the time (xfs_fs_log_dummy()) and that it's description
> explicitly mentions that a dummy transaction used for updating the
> log tail when covering it.
> 
> The only difference in this discussion is fact that we may be
> replacing the "dummy" with a modified superblock. Both are still
> dirty superblock objects in the log, and serve the same purpose for
> covering the log, but instead of using a dummy object we update the
> log incompat flags so taht the only thing that gets replayed if we
> crash is the modification to the superblock flags...
> 

Makes sense, thanks.

> Finally, no, we can't have a truly empty log while the filesystem is
> mounted because log transaction records must not be empty. Further,
> we can only bring the log down to an "empty but not quite clean"
> state while mounted, because if the log is actually marked clean and
> we crash then log recovery will not run and so we will not clean up
> files that were open-but-unlinked when the system crashed.
> 

Yeah, I wasn't suggesting to do that, just surmising about terminology.
It sounds like "covered log" refers to the current state logging an
unmodified superblock object.

> 
> > > > > but didn't necessarily clean the log. I wonder if we can fit that state
> > > > > into the existing log covering mechanism by logging the update earlier
> > > > > (or maybe via an intermediate state..)? I'll need to go stare at that
> > > > > code a bit..
> > > 
> > > That's kinda what I'd like to see done - it gives us a generic way
> > > of altering superblock incompat feature fields and filesystem
> > > structures dynamically with no risk of log recovery needing to parse
> > > structures it may not know about.
> > > 
> > 
> > We might have to think about if we want to clear such feature bits in
> > all log covering scenarios (idle, freeze) vs. just unmount.
> 
> I think clearing bits can probably be lazy. Set a state flag to tell
> log covering that it needs to do an update, and so when the covering
> code finally runs the feature bit modification just happens.
> 

That's reasonable on its own, it just means we have to support dynamic
setting of the associated bit(s) at runtime...

> 
> > My previous suggestion in the other sub-thread was to set bits on
> > mount and clear on unmount because that (hopefully) simplifies the
> > broader implementation.  Otherwise we need to manage dynamic
> > setting of bits when the associated log items are active, and that
> > still isn't truly representative because the bits would remain set
> > long after active items fall out of the log, until the log is
> > covered. Either way is possible, but I'm curious what others
> > think.
> 
> I think that unless we are setting a log incompat flag, log covering
> should unconditionally clear the log incompat flags. Because the log
> is empty, and recovery of a superblock buffer should -always- be
> possible, then by definition we have no log incompat state
> present....
> 

It should, but in practice will depend on whether the superblock was
written back or not before a crash while in the covered state. So
recovery may or may not actually work on an older kernel after the log
is covered. I think that is fine from a correctness standpoint because
the important part is to make sure an older kernel will not attempt to
recover incompatible items, and we make that guarantee by covering the
log before clearing the bit. I'm merely pointing this out because I
still think it's more straightforward to just enforce that a kernel with
the associated feature bit be required to recover a dirty log.

> As for a mechanism for dynamically adding log incompat flags?
> Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> flags field into the transaction reservation structure, and if
> xfs_trans_alloc() sees an incompat field set and the superblock
> doesn't have it set, the first thing it does is run a "set log
> incompat flag" transaction before then doing it's normal work...
> 
> This should be rare enough it doesn't have any measurable
> performance overhead, and it's flexible enough to support any log
> incompat feature we might need to implement...
> 

But I don't think that is sufficient. As Darrick pointed out up-thread,
the updated superblock has to be written back before we're allowed to
commit transactions with incompatible items. Otherwise, an older kernel
can attempt log recovery with incompatible items present if the
filesystem crashes before the superblock is written back.

We could do some sync transaction and/or sync write dance at runtime,
but I think the performance/overhead aspect becomes slightly less
deterministic. It's not clear to me how many bits we'd support over
time, and whether users would notice hiccups when running some sustained
workload and happen to trigger sync transaction/write or AIL push
sequences to set internal bits.

My question is how flexible do we really need to make incompatible log
recovery support? Why not just commit the superblock once at mount time
with however many bits the current kernel supports and clear them on
unmount? (Or perhaps consider a lazy setting variant where we set all
supported bits on the first modification..?)

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Darrick J. Wong Dec. 11, 2020, 11:35 p.m. UTC | #13
On Fri, Dec 11, 2020 at 08:39:01AM -0500, Brian Foster wrote:
> On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> > On Thu, Dec 10, 2020 at 09:23:58AM -0500, Brian Foster wrote:
> > > On Thu, Dec 10, 2020 at 07:51:32AM +1100, Dave Chinner wrote:
> > > > For changing incompat log flags, covering also provides exactly what
> > > > we need - an empty log with only a dirty superblock in the journal
> > > > to recover. All that we need then is to recheck feature flags after
> > > > recovery (not just clear log incompat flags) because we might need
> > > > to use this to also set incompat feature flags dynamically.
> > > > 
> > > 
> > > I'd love it if we use a better term for a log isolated to the superblock
> > > buffer. So far we've discussed an empty log with a dummy superblock, an
> > > empty log with a dirty superblock, and I suppose we could also have an
> > > actual empty log. :P "Covered log," perhaps?
> > 
> > On terminology: "covered log" has been around for 25 years
> > in XFS, and it's very definition is "an empty, up-to-date log except
> > for a dummy object we logged without any changes to update the log
> > tail". And, by definition, that dummy object is dirty in the log

Hmm, can we capture in a comment in the logging code the precise meaning
of "covered log" in whatever patch we end up generating?  I've long
suspected that's more or less what a covered log meant (the log has one
item just to prove that head==tail and there's nothing else to see here)
but it sure would've been nice to confirm that. :)

> > even if it contains no actual modifications when it was logged.
> > So in this case, "dummy" is directly interchangable with "dirty"
> > when looking at the log contents w.r.t. recovery behaviour.
> > 
> > It's worth looking at a historical change, too. Log covering
> > originally used the root inode and not the superblock. That caused
> > problems on linux dirtying VFS inode state, so we changed it to the
> > superblock in commit 1a387d3be2b3 ("xfs: dummy transactions should
> > not dirty VFS state") about a decade ago.  Note the name of the
> > function at the time (xfs_fs_log_dummy()) and that it's description
> > explicitly mentions that a dummy transaction used for updating the
> > log tail when covering it.
> > 
> > The only difference in this discussion is fact that we may be
> > replacing the "dummy" with a modified superblock. Both are still
> > dirty superblock objects in the log, and serve the same purpose for
> > covering the log, but instead of using a dummy object we update the
> > log incompat flags so taht the only thing that gets replayed if we
> > crash is the modification to the superblock flags...
> > 
> 
> Makes sense, thanks.
> 
> > Finally, no, we can't have a truly empty log while the filesystem is
> > mounted because log transaction records must not be empty. Further,
> > we can only bring the log down to an "empty but not quite clean"
> > state while mounted, because if the log is actually marked clean and
> > we crash then log recovery will not run and so we will not clean up
> > files that were open-but-unlinked when the system crashed.

Whatever happened to Eric's patchset to make us process unlinked inodes
at mount time so that freeze images wouldn't require recovery?

> Yeah, I wasn't suggesting to do that, just surmising about terminology.
> It sounds like "covered log" refers to the current state logging an
> unmodified superblock object.
> 
> > 
> > > > > > but didn't necessarily clean the log. I wonder if we can fit that state
> > > > > > into the existing log covering mechanism by logging the update earlier
> > > > > > (or maybe via an intermediate state..)? I'll need to go stare at that
> > > > > > code a bit..
> > > > 
> > > > That's kinda what I'd like to see done - it gives us a generic way
> > > > of altering superblock incompat feature fields and filesystem
> > > > structures dynamically with no risk of log recovery needing to parse
> > > > structures it may not know about.
> > > > 
> > > 
> > > We might have to think about if we want to clear such feature bits in
> > > all log covering scenarios (idle, freeze) vs. just unmount.
> > 
> > I think clearing bits can probably be lazy. Set a state flag to tell
> > log covering that it needs to do an update, and so when the covering
> > code finally runs the feature bit modification just happens.
> > 
> 
> That's reasonable on its own, it just means we have to support dynamic
> setting of the associated bit(s) at runtime...

The one downside of clearing the log incompat flags when the log goes
idle is that right now I print a big EXPERIMENTAL warning whenever we
turn on the atomic swapext feature, and doing this will cause that
message to cycle over and over...

That's kind of a minor point though.  Either we add a superblock flag to
remember that we've already warned about the experimental feature, or we
just don't bother at all.

Next time I have spare cycles I'll look into adapting this patch to make
it clear log incompat flags at unmount and/or covering time, assuming
nobody beats me to it.

> > > My previous suggestion in the other sub-thread was to set bits on
> > > mount and clear on unmount because that (hopefully) simplifies the
> > > broader implementation.  Otherwise we need to manage dynamic
> > > setting of bits when the associated log items are active, and that
> > > still isn't truly representative because the bits would remain set
> > > long after active items fall out of the log, until the log is
> > > covered. Either way is possible, but I'm curious what others
> > > think.
> > 
> > I think that unless we are setting a log incompat flag, log covering
> > should unconditionally clear the log incompat flags. Because the log
> > is empty, and recovery of a superblock buffer should -always- be
> > possible, then by definition we have no log incompat state
> > present....
> > 
> 
> It should, but in practice will depend on whether the superblock was
> written back or not before a crash while in the covered state. So
> recovery may or may not actually work on an older kernel after the log
> is covered. I think that is fine from a correctness standpoint because
> the important part is to make sure an older kernel will not attempt to
> recover incompatible items, and we make that guarantee by covering the
> log before clearing the bit. I'm merely pointing this out because I
> still think it's more straightforward to just enforce that a kernel with
> the associated feature bit be required to recover a dirty log.

<nod>

> > As for a mechanism for dynamically adding log incompat flags?
> > Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> > flags field into the transaction reservation structure, and if
> > xfs_trans_alloc() sees an incompat field set and the superblock
> > doesn't have it set, the first thing it does is run a "set log
> > incompat flag" transaction before then doing it's normal work...
> > 
> > This should be rare enough it doesn't have any measurable
> > performance overhead, and it's flexible enough to support any log
> > incompat feature we might need to implement...
> > 
> 
> But I don't think that is sufficient. As Darrick pointed out up-thread,
> the updated superblock has to be written back before we're allowed to
> commit transactions with incompatible items. Otherwise, an older kernel
> can attempt log recovery with incompatible items present if the
> filesystem crashes before the superblock is written back.
> 
> We could do some sync transaction and/or sync write dance at runtime,
> but I think the performance/overhead aspect becomes slightly less
> deterministic. It's not clear to me how many bits we'd support over
> time, and whether users would notice hiccups when running some sustained
> workload and happen to trigger sync transaction/write or AIL push
> sequences to set internal bits.

Probably few, since I imagined that most new log items are going to get
built on behalf of new disk format features.  OTOH Allison and I are
both building features that don't change the disk format...

> My question is how flexible do we really need to make incompatible log
> recovery support? Why not just commit the superblock once at mount time
> with however many bits the current kernel supports and clear them on
> unmount? (Or perhaps consider a lazy setting variant where we set all
> supported bits on the first modification..?)

I don't think setting it at mount (or first mod) time is a good idea
either, since that means that the fs cannot be recovered on an old
kernel even if we never actually used the new log items.

--D

> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
>
Dave Chinner Dec. 12, 2020, 9:14 p.m. UTC | #14
On Fri, Dec 11, 2020 at 08:39:01AM -0500, Brian Foster wrote:
> On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> > As for a mechanism for dynamically adding log incompat flags?
> > Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> > flags field into the transaction reservation structure, and if
> > xfs_trans_alloc() sees an incompat field set and the superblock
> > doesn't have it set, the first thing it does is run a "set log
> > incompat flag" transaction before then doing it's normal work...
> > 
> > This should be rare enough it doesn't have any measurable
> > performance overhead, and it's flexible enough to support any log
> > incompat feature we might need to implement...
> > 
> 
> But I don't think that is sufficient. As Darrick pointed out up-thread,
> the updated superblock has to be written back before we're allowed to
> commit transactions with incompatible items. Otherwise, an older kernel
> can attempt log recovery with incompatible items present if the
> filesystem crashes before the superblock is written back.

Sure, that's what the hook in xfs_trans_alloc() would do. It can do
the work in the context that is going to need it, and set a wait
flag for all incoming transactions that need a log incompat flag to
wait for it do it's work.  Once it's done and the flag is set, it
can continue and wake all the waiters now that the log incompat flag
has been set. Anything that doesn't need a log incompat flag can
just keep going and doesn't ever get blocked....

> We could do some sync transaction and/or sync write dance at runtime,
> but I think the performance/overhead aspect becomes slightly less
> deterministic. It's not clear to me how many bits we'd support over
> time, and whether users would notice hiccups when running some sustained
> workload and happen to trigger sync transaction/write or AIL push
> sequences to set internal bits.

I don't think the number of bits is ever going to be a worry.  If we
do it on a transaction granularlity, it will only block transactions
taht need the log incomapt bit, and only until the bit is set.

I suspect this is one of the rare occasions where an unlogged
modification makes an awful lot of sense: we don't even log that we
are adding a log incompat flag, we just do an atomic synchronous
write straight to the superblock to set the incompat flag(s). The
entire modification can be done under the superblock buffer lock to
serialise multiple transactions all trying to set incompat bits, and
we don't set the in-memory superblock incompat bit until after it
has been set and written to disk. Hence multiple waits can check the
flag after they've got the sb buffer lock, and they'll see that it's
already been set and just continue...

This gets rid of the whole "what about a log containing an item that
sets the incompat bit" problem, and it provides a simple means of
serialising and co-ordinating setting of a log incompat flag....

> My question is how flexible do we really need to make incompatible log
> recovery support? Why not just commit the superblock once at mount time
> with however many bits the current kernel supports and clear them on
> unmount? (Or perhaps consider a lazy setting variant where we set all
> supported bits on the first modification..?)

We don't want to set the incompat bits if we don't need to. That
just guarantees user horror stories that start with "boot system
with new kernel, crash, go back to old kernel, can't mount root
filesystem anymore".

Cheers,

Dave.
Brian Foster Dec. 14, 2020, 3:25 p.m. UTC | #15
On Fri, Dec 11, 2020 at 03:35:07PM -0800, Darrick J. Wong wrote:
> On Fri, Dec 11, 2020 at 08:39:01AM -0500, Brian Foster wrote:
> > On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> > > On Thu, Dec 10, 2020 at 09:23:58AM -0500, Brian Foster wrote:
> > > > On Thu, Dec 10, 2020 at 07:51:32AM +1100, Dave Chinner wrote:
> > > > > For changing incompat log flags, covering also provides exactly what
> > > > > we need - an empty log with only a dirty superblock in the journal
> > > > > to recover. All that we need then is to recheck feature flags after
> > > > > recovery (not just clear log incompat flags) because we might need
> > > > > to use this to also set incompat feature flags dynamically.
> > > > > 
> > > > 
> > > > I'd love it if we use a better term for a log isolated to the superblock
> > > > buffer. So far we've discussed an empty log with a dummy superblock, an
> > > > empty log with a dirty superblock, and I suppose we could also have an
> > > > actual empty log. :P "Covered log," perhaps?
> > > 
> > > On terminology: "covered log" has been around for 25 years
> > > in XFS, and it's very definition is "an empty, up-to-date log except
> > > for a dummy object we logged without any changes to update the log
> > > tail". And, by definition, that dummy object is dirty in the log
> 
> Hmm, can we capture in a comment in the logging code the precise meaning
> of "covered log" in whatever patch we end up generating?  I've long
> suspected that's more or less what a covered log meant (the log has one
> item just to prove that head==tail and there's nothing else to see here)
> but it sure would've been nice to confirm that. :)
> 

Yeah, I was asking about proper terminology up front to help make sure
commit logs and whatnot accurately describe what's going on. :)

> > > even if it contains no actual modifications when it was logged.
> > > So in this case, "dummy" is directly interchangable with "dirty"
> > > when looking at the log contents w.r.t. recovery behaviour.
> > > 
> > > It's worth looking at a historical change, too. Log covering
> > > originally used the root inode and not the superblock. That caused
> > > problems on linux dirtying VFS inode state, so we changed it to the
> > > superblock in commit 1a387d3be2b3 ("xfs: dummy transactions should
> > > not dirty VFS state") about a decade ago.  Note the name of the
> > > function at the time (xfs_fs_log_dummy()) and that it's description
> > > explicitly mentions that a dummy transaction used for updating the
> > > log tail when covering it.
> > > 
> > > The only difference in this discussion is fact that we may be
> > > replacing the "dummy" with a modified superblock. Both are still
> > > dirty superblock objects in the log, and serve the same purpose for
> > > covering the log, but instead of using a dummy object we update the
> > > log incompat flags so taht the only thing that gets replayed if we
> > > crash is the modification to the superblock flags...
> > > 
> > 
> > Makes sense, thanks.
> > 
> > > Finally, no, we can't have a truly empty log while the filesystem is
> > > mounted because log transaction records must not be empty. Further,
> > > we can only bring the log down to an "empty but not quite clean"
> > > state while mounted, because if the log is actually marked clean and
> > > we crash then log recovery will not run and so we will not clean up
> > > files that were open-but-unlinked when the system crashed.
> 
> Whatever happened to Eric's patchset to make us process unlinked inodes
> at mount time so that freeze images wouldn't require recovery?
> 
> > Yeah, I wasn't suggesting to do that, just surmising about terminology.
> > It sounds like "covered log" refers to the current state logging an
> > unmodified superblock object.
> > 
> > > 
> > > > > > > but didn't necessarily clean the log. I wonder if we can fit that state
> > > > > > > into the existing log covering mechanism by logging the update earlier
> > > > > > > (or maybe via an intermediate state..)? I'll need to go stare at that
> > > > > > > code a bit..
> > > > > 
> > > > > That's kinda what I'd like to see done - it gives us a generic way
> > > > > of altering superblock incompat feature fields and filesystem
> > > > > structures dynamically with no risk of log recovery needing to parse
> > > > > structures it may not know about.
> > > > > 
> > > > 
> > > > We might have to think about if we want to clear such feature bits in
> > > > all log covering scenarios (idle, freeze) vs. just unmount.
> > > 
> > > I think clearing bits can probably be lazy. Set a state flag to tell
> > > log covering that it needs to do an update, and so when the covering
> > > code finally runs the feature bit modification just happens.
> > > 
> > 
> > That's reasonable on its own, it just means we have to support dynamic
> > setting of the associated bit(s) at runtime...
> 
> The one downside of clearing the log incompat flags when the log goes
> idle is that right now I print a big EXPERIMENTAL warning whenever we
> turn on the atomic swapext feature, and doing this will cause that
> message to cycle over and over...
> 
> That's kind of a minor point though.  Either we add a superblock flag to
> remember that we've already warned about the experimental feature, or we
> just don't bother at all.
> 
> Next time I have spare cycles I'll look into adapting this patch to make
> it clear log incompat flags at unmount and/or covering time, assuming
> nobody beats me to it.
> 
> > > > My previous suggestion in the other sub-thread was to set bits on
> > > > mount and clear on unmount because that (hopefully) simplifies the
> > > > broader implementation.  Otherwise we need to manage dynamic
> > > > setting of bits when the associated log items are active, and that
> > > > still isn't truly representative because the bits would remain set
> > > > long after active items fall out of the log, until the log is
> > > > covered. Either way is possible, but I'm curious what others
> > > > think.
> > > 
> > > I think that unless we are setting a log incompat flag, log covering
> > > should unconditionally clear the log incompat flags. Because the log
> > > is empty, and recovery of a superblock buffer should -always- be
> > > possible, then by definition we have no log incompat state
> > > present....
> > > 
> > 
> > It should, but in practice will depend on whether the superblock was
> > written back or not before a crash while in the covered state. So
> > recovery may or may not actually work on an older kernel after the log
> > is covered. I think that is fine from a correctness standpoint because
> > the important part is to make sure an older kernel will not attempt to
> > recover incompatible items, and we make that guarantee by covering the
> > log before clearing the bit. I'm merely pointing this out because I
> > still think it's more straightforward to just enforce that a kernel with
> > the associated feature bit be required to recover a dirty log.
> 
> <nod>
> 
> > > As for a mechanism for dynamically adding log incompat flags?
> > > Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> > > flags field into the transaction reservation structure, and if
> > > xfs_trans_alloc() sees an incompat field set and the superblock
> > > doesn't have it set, the first thing it does is run a "set log
> > > incompat flag" transaction before then doing it's normal work...
> > > 
> > > This should be rare enough it doesn't have any measurable
> > > performance overhead, and it's flexible enough to support any log
> > > incompat feature we might need to implement...
> > > 
> > 
> > But I don't think that is sufficient. As Darrick pointed out up-thread,
> > the updated superblock has to be written back before we're allowed to
> > commit transactions with incompatible items. Otherwise, an older kernel
> > can attempt log recovery with incompatible items present if the
> > filesystem crashes before the superblock is written back.
> > 
> > We could do some sync transaction and/or sync write dance at runtime,
> > but I think the performance/overhead aspect becomes slightly less
> > deterministic. It's not clear to me how many bits we'd support over
> > time, and whether users would notice hiccups when running some sustained
> > workload and happen to trigger sync transaction/write or AIL push
> > sequences to set internal bits.
> 
> Probably few, since I imagined that most new log items are going to get
> built on behalf of new disk format features.  OTOH Allison and I are
> both building features that don't change the disk format...
> 

Right..

> > My question is how flexible do we really need to make incompatible log
> > recovery support? Why not just commit the superblock once at mount time
> > with however many bits the current kernel supports and clear them on
> > unmount? (Or perhaps consider a lazy setting variant where we set all
> > supported bits on the first modification..?)
> 
> I don't think setting it at mount (or first mod) time is a good idea
> either, since that means that the fs cannot be recovered on an old
> kernel even if we never actually used the new log items.
> 

Indeed, that's the primary tradeoff. I think it's worth the tradeoff for
simplicity because I don't think users will know or understand the
difference between when the bit is set and cleared dynamically or just
set by default. I suppose that is less of a concern if a bit is tied to
some exclusive operation like extent swap and nothing else, but with
something like xattrs I think the reality is that a writeable filesystem
will mostly likely require recovery on a feature bit enabled kernel
regardless.

That said, I do see value in clearing the bit when the log idles and
thus obviously that requires bits to be set again at runtime. I'm just
not clear on how much value that brings in practice, given the above.

Brian

> --D
> 
> > 
> > Brian
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > 
>
Brian Foster Dec. 14, 2020, 3:58 p.m. UTC | #16
On Sun, Dec 13, 2020 at 08:14:39AM +1100, Dave Chinner wrote:
> On Fri, Dec 11, 2020 at 08:39:01AM -0500, Brian Foster wrote:
> > On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> > > As for a mechanism for dynamically adding log incompat flags?
> > > Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> > > flags field into the transaction reservation structure, and if
> > > xfs_trans_alloc() sees an incompat field set and the superblock
> > > doesn't have it set, the first thing it does is run a "set log
> > > incompat flag" transaction before then doing it's normal work...
> > > 
> > > This should be rare enough it doesn't have any measurable
> > > performance overhead, and it's flexible enough to support any log
> > > incompat feature we might need to implement...
> > > 
> > 
> > But I don't think that is sufficient. As Darrick pointed out up-thread,
> > the updated superblock has to be written back before we're allowed to
> > commit transactions with incompatible items. Otherwise, an older kernel
> > can attempt log recovery with incompatible items present if the
> > filesystem crashes before the superblock is written back.
> 
> Sure, that's what the hook in xfs_trans_alloc() would do. It can do
> the work in the context that is going to need it, and set a wait
> flag for all incoming transactions that need a log incompat flag to
> wait for it do it's work.  Once it's done and the flag is set, it
> can continue and wake all the waiters now that the log incompat flag
> has been set. Anything that doesn't need a log incompat flag can
> just keep going and doesn't ever get blocked....
> 

It would have to be a sync transaction plus sync AIL force in
transaction allocation context if we were to log the superblock change,
which sounds a bit hairy...

> > We could do some sync transaction and/or sync write dance at runtime,
> > but I think the performance/overhead aspect becomes slightly less
> > deterministic. It's not clear to me how many bits we'd support over
> > time, and whether users would notice hiccups when running some sustained
> > workload and happen to trigger sync transaction/write or AIL push
> > sequences to set internal bits.
> 
> I don't think the number of bits is ever going to be a worry.  If we
> do it on a transaction granularlity, it will only block transactions
> taht need the log incomapt bit, and only until the bit is set.
> 

... until it's cleared again. Granted, that would only occur once the
log idles.

> I suspect this is one of the rare occasions where an unlogged
> modification makes an awful lot of sense: we don't even log that we
> are adding a log incompat flag, we just do an atomic synchronous
> write straight to the superblock to set the incompat flag(s). The
> entire modification can be done under the superblock buffer lock to
> serialise multiple transactions all trying to set incompat bits, and
> we don't set the in-memory superblock incompat bit until after it
> has been set and written to disk. Hence multiple waits can check the
> flag after they've got the sb buffer lock, and they'll see that it's
> already been set and just continue...
> 

Agreed. That is a notable simplification and I think much more
preferable than the above for the dynamic approach.

That said, note that dynamic feature bits might introduce complexity in
more subtle ways. For example, nothing that I can see currently
serializes idle log covering with an active transaction (that may have
just set an incompat bit via some hook yet not committed anything to the
log subsystem), so it might not be as simple as just adding a hook
somewhere.

> This gets rid of the whole "what about a log containing an item that
> sets the incompat bit" problem, and it provides a simple means of
> serialising and co-ordinating setting of a log incompat flag....
> 
> > My question is how flexible do we really need to make incompatible log
> > recovery support? Why not just commit the superblock once at mount time
> > with however many bits the current kernel supports and clear them on
> > unmount? (Or perhaps consider a lazy setting variant where we set all
> > supported bits on the first modification..?)
> 
> We don't want to set the incompat bits if we don't need to. That
> just guarantees user horror stories that start with "boot system
> with new kernel, crash, go back to old kernel, can't mount root
> filesystem anymore".
> 

Indeed, that is a potential wart with just setting bits on mount. I do
think this is likely to be the case with or without dynamic feature
bits, because at least in certain cases we'll be setting incompat bits
in short order anyways. E.g., one of the primary use cases here is for
xattrs, which is likely to be active on any root filesystem via things
like SELinux, etc. Point being, all it takes is one feature bit
associated with some core operation to introduce this risky update
scenario in practice.

I dunno... I'm just trying to explore whether we can simplify this whole
concept to something more easily managed and less likely to cause us
headache. I'm a bit concerned that we're disregarding other tradeoffs
like the complexity noted above, the risk and cost of bugs in the
mechanism itself (because log recovery has historically been so well
tested.. :P) or whether the idea of new kernels immediately delivering
new incompat log formats is a robust/reliable solution in the first
place. IIRC, the last time we did this was ICREATE and that was hidden
behind the v5 update. IOW, for certain things like the xattr rework, I'd
think that kind of experimental stabilization cycle is warranted before
we'd consider enabling such a feature, even dynamically (which means a
revertible kernel should be available in common/incremental upgrade
cases).

Anyways, it sounds like both you and Darrick still prefer this approach
so this is just my .02 for the time being..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Dec. 14, 2020, 8:54 p.m. UTC | #17
On Mon, Dec 14, 2020 at 10:58:31AM -0500, Brian Foster wrote:
> On Sun, Dec 13, 2020 at 08:14:39AM +1100, Dave Chinner wrote:
> > On Fri, Dec 11, 2020 at 08:39:01AM -0500, Brian Foster wrote:
> > > On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> > > > As for a mechanism for dynamically adding log incompat flags?
> > > > Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> > > > flags field into the transaction reservation structure, and if
> > > > xfs_trans_alloc() sees an incompat field set and the superblock
> > > > doesn't have it set, the first thing it does is run a "set log
> > > > incompat flag" transaction before then doing it's normal work...
> > > > 
> > > > This should be rare enough it doesn't have any measurable
> > > > performance overhead, and it's flexible enough to support any log
> > > > incompat feature we might need to implement...
> > > > 
> > > 
> > > But I don't think that is sufficient. As Darrick pointed out up-thread,
> > > the updated superblock has to be written back before we're allowed to
> > > commit transactions with incompatible items. Otherwise, an older kernel
> > > can attempt log recovery with incompatible items present if the
> > > filesystem crashes before the superblock is written back.
> > 
> > Sure, that's what the hook in xfs_trans_alloc() would do. It can do
> > the work in the context that is going to need it, and set a wait
> > flag for all incoming transactions that need a log incompat flag to
> > wait for it do it's work.  Once it's done and the flag is set, it
> > can continue and wake all the waiters now that the log incompat flag
> > has been set. Anything that doesn't need a log incompat flag can
> > just keep going and doesn't ever get blocked....
> > 
> 
> It would have to be a sync transaction plus sync AIL force in
> transaction allocation context if we were to log the superblock change,
> which sounds a bit hairy...

Well, we already do sync AIL forces in transaction reservation when
we run out of log space, so there's no technical reason for this
being a problem at all. xfs_trans_alloc() is expected to block
waiting on AIL tail pushing....

> > I suspect this is one of the rare occasions where an unlogged
> > modification makes an awful lot of sense: we don't even log that we
> > are adding a log incompat flag, we just do an atomic synchronous
> > write straight to the superblock to set the incompat flag(s). The
> > entire modification can be done under the superblock buffer lock to
> > serialise multiple transactions all trying to set incompat bits, and
> > we don't set the in-memory superblock incompat bit until after it
> > has been set and written to disk. Hence multiple waits can check the
> > flag after they've got the sb buffer lock, and they'll see that it's
> > already been set and just continue...
> > 
> 
> Agreed. That is a notable simplification and I think much more
> preferable than the above for the dynamic approach.
> 
> That said, note that dynamic feature bits might introduce complexity in
> more subtle ways. For example, nothing that I can see currently
> serializes idle log covering with an active transaction (that may have
> just set an incompat bit via some hook yet not committed anything to the
> log subsystem), so it might not be as simple as just adding a hook
> somewhere.

Right, we had to make log covering away of the CIL to prevent it
from idling while there were multiple active committed transactions
in memory. So the state machine only progresses if both the CIL and
AIL are empty. If we had some way of knowing that a transaction is
in progress, we could check that in xfs_log_need_covered() and we'd
stop the state machine progress at that point. But we got rid of the
active transaction counter that we could use for that....

[Hmmm, didn't I recently have a patch that re-introduced that
counter to fix some other "we need to know if there's an active
transaction running" issue? Can't remember what that was now...]

> > This gets rid of the whole "what about a log containing an item that
> > sets the incompat bit" problem, and it provides a simple means of
> > serialising and co-ordinating setting of a log incompat flag....
> > 
> > > My question is how flexible do we really need to make incompatible log
> > > recovery support? Why not just commit the superblock once at mount time
> > > with however many bits the current kernel supports and clear them on
> > > unmount? (Or perhaps consider a lazy setting variant where we set all
> > > supported bits on the first modification..?)
> > 
> > We don't want to set the incompat bits if we don't need to. That
> > just guarantees user horror stories that start with "boot system
> > with new kernel, crash, go back to old kernel, can't mount root
> > filesystem anymore".
> > 
> 
> Indeed, that is a potential wart with just setting bits on mount. I do
> think this is likely to be the case with or without dynamic feature
> bits, because at least in certain cases we'll be setting incompat bits
> in short order anyways. E.g., one of the primary use cases here is for
> xattrs, which is likely to be active on any root filesystem via things
> like SELinux, etc. Point being, all it takes is one feature bit
> associated with some core operation to introduce this risky update
> scenario in practice.

That may well be the case for some distros and some root
filesystems, and that's an argument against using log incompat flags
for the -xattr feature-. It's not an argument against
dynamically setting and clearing log incompat features in general.

That is, if xattrs are so wide spread that we expose users to
"upgrade-fail-can't downgrade" by use of a dynamic log incompat
flag, then we should not be making that feature dynamic and
"autoset". In this situation, it needs to be opt-in and planned,
likely done in maintenance downtime rather than a side effect of a
kernel upgrade.

So, yeah, this discussion is making me think that the xattr logging
upgrade is going to need a full ATTR3 feature bit like the other
ATTR and ATTR2 feature bits, not just a log incompat bit...

> I dunno... I'm just trying to explore whether we can simplify this whole
> concept to something more easily managed and less likely to cause us
> headache. I'm a bit concerned that we're disregarding other tradeoffs
> like the complexity noted above, the risk and cost of bugs in the
> mechanism itself (because log recovery has historically been so well
> tested.. :P) or whether the idea of new kernels immediately delivering
> new incompat log formats is a robust/reliable solution in the first
> place. IIRC, the last time we did this was ICREATE and that was hidden
> behind the v5 update. IOW, for certain things like the xattr rework, I'd
> think that kind of experimental stabilization cycle is warranted before
> we'd consider enabling such a feature, even dynamically (which means a
> revertible kernel should be available in common/incremental upgrade
> cases).

IMO, the xattr logging rework is most definitely under the
EXPERIMENTAL umbrella and that was always going to be the case.
Also, I don't think we're ignoring the potential complexity of
dynamically setting/clearing stuff - otherwise we wouldn't be having
this conversation about how simple we can actually make it. If it
turns out that we can't do it simply, then setting/clearing at
mount/unmount should be considered "plan B"....

But right now, I think the discussion has come up with some ideas to
greatly simplify the dynamic flag setting + clearing....

Cheers,

Dave.
Brian Foster Dec. 15, 2020, 1:50 p.m. UTC | #18
On Tue, Dec 15, 2020 at 07:54:56AM +1100, Dave Chinner wrote:
> On Mon, Dec 14, 2020 at 10:58:31AM -0500, Brian Foster wrote:
> > On Sun, Dec 13, 2020 at 08:14:39AM +1100, Dave Chinner wrote:
> > > On Fri, Dec 11, 2020 at 08:39:01AM -0500, Brian Foster wrote:
> > > > On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> > > > > As for a mechanism for dynamically adding log incompat flags?
> > > > > Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> > > > > flags field into the transaction reservation structure, and if
> > > > > xfs_trans_alloc() sees an incompat field set and the superblock
> > > > > doesn't have it set, the first thing it does is run a "set log
> > > > > incompat flag" transaction before then doing it's normal work...
> > > > > 
> > > > > This should be rare enough it doesn't have any measurable
> > > > > performance overhead, and it's flexible enough to support any log
> > > > > incompat feature we might need to implement...
> > > > > 
> > > > 
> > > > But I don't think that is sufficient. As Darrick pointed out up-thread,
> > > > the updated superblock has to be written back before we're allowed to
> > > > commit transactions with incompatible items. Otherwise, an older kernel
> > > > can attempt log recovery with incompatible items present if the
> > > > filesystem crashes before the superblock is written back.
> > > 
> > > Sure, that's what the hook in xfs_trans_alloc() would do. It can do
> > > the work in the context that is going to need it, and set a wait
> > > flag for all incoming transactions that need a log incompat flag to
> > > wait for it do it's work.  Once it's done and the flag is set, it
> > > can continue and wake all the waiters now that the log incompat flag
> > > has been set. Anything that doesn't need a log incompat flag can
> > > just keep going and doesn't ever get blocked....
> > > 
> > 
> > It would have to be a sync transaction plus sync AIL force in
> > transaction allocation context if we were to log the superblock change,
> > which sounds a bit hairy...
> 
> Well, we already do sync AIL forces in transaction reservation when
> we run out of log space, so there's no technical reason for this
> being a problem at all. xfs_trans_alloc() is expected to block
> waiting on AIL tail pushing....
> 
> > > I suspect this is one of the rare occasions where an unlogged
> > > modification makes an awful lot of sense: we don't even log that we
> > > are adding a log incompat flag, we just do an atomic synchronous
> > > write straight to the superblock to set the incompat flag(s). The
> > > entire modification can be done under the superblock buffer lock to
> > > serialise multiple transactions all trying to set incompat bits, and
> > > we don't set the in-memory superblock incompat bit until after it
> > > has been set and written to disk. Hence multiple waits can check the
> > > flag after they've got the sb buffer lock, and they'll see that it's
> > > already been set and just continue...
> > > 
> > 
> > Agreed. That is a notable simplification and I think much more
> > preferable than the above for the dynamic approach.
> > 
> > That said, note that dynamic feature bits might introduce complexity in
> > more subtle ways. For example, nothing that I can see currently
> > serializes idle log covering with an active transaction (that may have
> > just set an incompat bit via some hook yet not committed anything to the
> > log subsystem), so it might not be as simple as just adding a hook
> > somewhere.
> 
> Right, we had to make log covering away of the CIL to prevent it
> from idling while there were multiple active committed transactions
> in memory. So the state machine only progresses if both the CIL and
> AIL are empty. If we had some way of knowing that a transaction is
> in progress, we could check that in xfs_log_need_covered() and we'd
> stop the state machine progress at that point. But we got rid of the
> active transaction counter that we could use for that....
> 
> [Hmmm, didn't I recently have a patch that re-introduced that
> counter to fix some other "we need to know if there's an active
> transaction running" issue? Can't remember what that was now...]
> 

I think you removed it, actually, via commit b41b46c20c0bd ("xfs: remove
the m_active_trans counter"). We subsequently discussed reintroducing
the same concept for the quotaoff rework [1], which might be what you're
thinking of. That uses a percpu rwsem since we don't really need a
counter, but I suspect could be reused for serialization in this use
case as well (assuming I can get some reviews on it.. ;).

FWIW, I was considering putting those quotaoff patches ahead of the log
covering work so we could reuse that code again in attr quiesce, but I
think I'm pretty close to being able to remove that particular usage
entirely.

[1] https://lore.kernel.org/linux-xfs/20201001150310.141467-1-bfoster@redhat.com/

> > > This gets rid of the whole "what about a log containing an item that
> > > sets the incompat bit" problem, and it provides a simple means of
> > > serialising and co-ordinating setting of a log incompat flag....
> > > 
> > > > My question is how flexible do we really need to make incompatible log
> > > > recovery support? Why not just commit the superblock once at mount time
> > > > with however many bits the current kernel supports and clear them on
> > > > unmount? (Or perhaps consider a lazy setting variant where we set all
> > > > supported bits on the first modification..?)
> > > 
> > > We don't want to set the incompat bits if we don't need to. That
> > > just guarantees user horror stories that start with "boot system
> > > with new kernel, crash, go back to old kernel, can't mount root
> > > filesystem anymore".
> > > 
> > 
> > Indeed, that is a potential wart with just setting bits on mount. I do
> > think this is likely to be the case with or without dynamic feature
> > bits, because at least in certain cases we'll be setting incompat bits
> > in short order anyways. E.g., one of the primary use cases here is for
> > xattrs, which is likely to be active on any root filesystem via things
> > like SELinux, etc. Point being, all it takes is one feature bit
> > associated with some core operation to introduce this risky update
> > scenario in practice.
> 
> That may well be the case for some distros and some root
> filesystems, and that's an argument against using log incompat flags
> for the -xattr feature-. It's not an argument against
> dynamically setting and clearing log incompat features in general.
> 

Sure. I mentioned in past mails that my concerns/feedback depend heavily
on use case. xattrs is one of the two (?) or so motivating this work.

> That is, if xattrs are so wide spread that we expose users to
> "upgrade-fail-can't downgrade" by use of a dynamic log incompat
> flag, then we should not be making that feature dynamic and
> "autoset". In this situation, it needs to be opt-in and planned,
> likely done in maintenance downtime rather than a side effect of a
> kernel upgrade.
> 
> So, yeah, this discussion is making me think that the xattr logging
> upgrade is going to need a full ATTR3 feature bit like the other
> ATTR and ATTR2 feature bits, not just a log incompat bit...
> 

Perhaps. Not using this at all for xattrs does address quite a bit of my
concerns, but I think if we wanted the potential flexibility of the log
incompat bit down the road, it might be reasonable to manage the
experimental cycle "manually" as described above (i.e., essentially
don't set/clear that bit automatically for a period of time). I don't
feel strongly about one approach over the other in that regard, though,
just that we don't immediately turn the mechanism on right out of the
gate because the feature bit mechanism happens to support it.

> > I dunno... I'm just trying to explore whether we can simplify this whole
> > concept to something more easily managed and less likely to cause us
> > headache. I'm a bit concerned that we're disregarding other tradeoffs
> > like the complexity noted above, the risk and cost of bugs in the
> > mechanism itself (because log recovery has historically been so well
> > tested.. :P) or whether the idea of new kernels immediately delivering
> > new incompat log formats is a robust/reliable solution in the first
> > place. IIRC, the last time we did this was ICREATE and that was hidden
> > behind the v5 update. IOW, for certain things like the xattr rework, I'd
> > think that kind of experimental stabilization cycle is warranted before
> > we'd consider enabling such a feature, even dynamically (which means a
> > revertible kernel should be available in common/incremental upgrade
> > cases).
> 
> IMO, the xattr logging rework is most definitely under the
> EXPERIMENTAL umbrella and that was always going to be the case.
> Also, I don't think we're ignoring the potential complexity of
> dynamically setting/clearing stuff - otherwise we wouldn't be having
> this conversation about how simple we can actually make it. If it
> turns out that we can't do it simply, then setting/clearing at
> mount/unmount should be considered "plan B"....
> 

I'm more approaching this from a "what are the requirements and how/why
do they justify the associated complexity?" angle. That's why I'm asking
things like how much difference does a dynamic bit really make for
something like xattrs. But I agree that's less of a concern when
associated with more obscure or rarely used operations, so on balance I
think that's a fair approach to this mechanism provided we consider
suitability on a per feature basis.

> But right now, I think the discussion has come up with some ideas to
> greatly simplify the dynamic flag setting + clearing....
> 

Agreed, thanks.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Eric Sandeen Dec. 15, 2020, 2:56 p.m. UTC | #19
On 12/11/20 5:35 PM, Darrick J. Wong wrote:
>>> Finally, no, we can't have a truly empty log while the filesystem is
>>> mounted because log transaction records must not be empty. Further,
>>> we can only bring the log down to an "empty but not quite clean"
>>> state while mounted, because if the log is actually marked clean and
>>> we crash then log recovery will not run and so we will not clean up
>>> files that were open-but-unlinked when the system crashed.

> Whatever happened to Eric's patchset to make us process unlinked inodes
> at mount time so that freeze images wouldn't require recovery?

It suffered a fate typical of my patches.  You observed an issue in a test,
Dave did some brief triage, and I haven't gotten back to it.

I'd still like to do that if we can, the requirement to run recovery and/or
use xfs-specific mount options to get a snapshot mounted is still a weird wart.
Pretty sure I've seen at least one user who thought we completely failed
to generate consistent snapshots because "the log was dirty" and
"I had to run repair."

So it'd still be nice to get rid of that behavior IMHO.

-Eric
Darrick J. Wong Jan. 7, 2021, 11:28 p.m. UTC | #20
On Tue, Dec 15, 2020 at 08:50:03AM -0500, Brian Foster wrote:
> On Tue, Dec 15, 2020 at 07:54:56AM +1100, Dave Chinner wrote:
> > On Mon, Dec 14, 2020 at 10:58:31AM -0500, Brian Foster wrote:
> > > On Sun, Dec 13, 2020 at 08:14:39AM +1100, Dave Chinner wrote:
> > > > On Fri, Dec 11, 2020 at 08:39:01AM -0500, Brian Foster wrote:
> > > > > On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> > > > > > As for a mechanism for dynamically adding log incompat flags?
> > > > > > Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> > > > > > flags field into the transaction reservation structure, and if
> > > > > > xfs_trans_alloc() sees an incompat field set and the superblock
> > > > > > doesn't have it set, the first thing it does is run a "set log
> > > > > > incompat flag" transaction before then doing it's normal work...
> > > > > > 
> > > > > > This should be rare enough it doesn't have any measurable
> > > > > > performance overhead, and it's flexible enough to support any log
> > > > > > incompat feature we might need to implement...
> > > > > > 
> > > > > 
> > > > > But I don't think that is sufficient. As Darrick pointed out up-thread,
> > > > > the updated superblock has to be written back before we're allowed to
> > > > > commit transactions with incompatible items. Otherwise, an older kernel
> > > > > can attempt log recovery with incompatible items present if the
> > > > > filesystem crashes before the superblock is written back.
> > > > 
> > > > Sure, that's what the hook in xfs_trans_alloc() would do. It can do
> > > > the work in the context that is going to need it, and set a wait
> > > > flag for all incoming transactions that need a log incompat flag to
> > > > wait for it do it's work.  Once it's done and the flag is set, it
> > > > can continue and wake all the waiters now that the log incompat flag
> > > > has been set. Anything that doesn't need a log incompat flag can
> > > > just keep going and doesn't ever get blocked....
> > > > 
> > > 
> > > It would have to be a sync transaction plus sync AIL force in
> > > transaction allocation context if we were to log the superblock change,
> > > which sounds a bit hairy...
> > 
> > Well, we already do sync AIL forces in transaction reservation when
> > we run out of log space, so there's no technical reason for this
> > being a problem at all. xfs_trans_alloc() is expected to block
> > waiting on AIL tail pushing....
> > 
> > > > I suspect this is one of the rare occasions where an unlogged
> > > > modification makes an awful lot of sense: we don't even log that we
> > > > are adding a log incompat flag, we just do an atomic synchronous
> > > > write straight to the superblock to set the incompat flag(s). The
> > > > entire modification can be done under the superblock buffer lock to
> > > > serialise multiple transactions all trying to set incompat bits, and
> > > > we don't set the in-memory superblock incompat bit until after it
> > > > has been set and written to disk. Hence multiple waits can check the
> > > > flag after they've got the sb buffer lock, and they'll see that it's
> > > > already been set and just continue...
> > > > 
> > > 
> > > Agreed. That is a notable simplification and I think much more
> > > preferable than the above for the dynamic approach.
> > > 
> > > That said, note that dynamic feature bits might introduce complexity in
> > > more subtle ways. For example, nothing that I can see currently
> > > serializes idle log covering with an active transaction (that may have
> > > just set an incompat bit via some hook yet not committed anything to the
> > > log subsystem), so it might not be as simple as just adding a hook
> > > somewhere.
> > 
> > Right, we had to make log covering away of the CIL to prevent it
> > from idling while there were multiple active committed transactions
> > in memory. So the state machine only progresses if both the CIL and
> > AIL are empty. If we had some way of knowing that a transaction is
> > in progress, we could check that in xfs_log_need_covered() and we'd
> > stop the state machine progress at that point. But we got rid of the
> > active transaction counter that we could use for that....
> > 
> > [Hmmm, didn't I recently have a patch that re-introduced that
> > counter to fix some other "we need to know if there's an active
> > transaction running" issue? Can't remember what that was now...]
> > 
> 
> I think you removed it, actually, via commit b41b46c20c0bd ("xfs: remove
> the m_active_trans counter"). We subsequently discussed reintroducing
> the same concept for the quotaoff rework [1], which might be what you're
> thinking of. That uses a percpu rwsem since we don't really need a
> counter, but I suspect could be reused for serialization in this use
> case as well (assuming I can get some reviews on it.. ;).
> 
> FWIW, I was considering putting those quotaoff patches ahead of the log
> covering work so we could reuse that code again in attr quiesce, but I
> think I'm pretty close to being able to remove that particular usage
> entirely.

I was thinking about using a rwsem to protect the log incompat flags --
code that thinks it might use a protected feature takes the lock in
read mode until commit; and the log covering code only clears the
flags if down_write_trylock succeeds.  That constrains the overhead to
threads that are trying to use the feature, instead of making all
threads pay the cost of bumping the counter.

> [1] https://lore.kernel.org/linux-xfs/20201001150310.141467-1-bfoster@redhat.com/
> 
> > > > This gets rid of the whole "what about a log containing an item that
> > > > sets the incompat bit" problem, and it provides a simple means of
> > > > serialising and co-ordinating setting of a log incompat flag....
> > > > 
> > > > > My question is how flexible do we really need to make incompatible log
> > > > > recovery support? Why not just commit the superblock once at mount time
> > > > > with however many bits the current kernel supports and clear them on
> > > > > unmount? (Or perhaps consider a lazy setting variant where we set all
> > > > > supported bits on the first modification..?)
> > > > 
> > > > We don't want to set the incompat bits if we don't need to. That
> > > > just guarantees user horror stories that start with "boot system
> > > > with new kernel, crash, go back to old kernel, can't mount root
> > > > filesystem anymore".
> > > > 
> > > 
> > > Indeed, that is a potential wart with just setting bits on mount. I do
> > > think this is likely to be the case with or without dynamic feature
> > > bits, because at least in certain cases we'll be setting incompat bits
> > > in short order anyways. E.g., one of the primary use cases here is for
> > > xattrs, which is likely to be active on any root filesystem via things
> > > like SELinux, etc. Point being, all it takes is one feature bit
> > > associated with some core operation to introduce this risky update
> > > scenario in practice.
> > 
> > That may well be the case for some distros and some root
> > filesystems, and that's an argument against using log incompat flags
> > for the -xattr feature-. It's not an argument against
> > dynamically setting and clearing log incompat features in general.
> > 
> 
> Sure. I mentioned in past mails that my concerns/feedback depend heavily
> on use case. xattrs is one of the two (?) or so motivating this work.
> 
> > That is, if xattrs are so wide spread that we expose users to
> > "upgrade-fail-can't downgrade" by use of a dynamic log incompat
> > flag, then we should not be making that feature dynamic and
> > "autoset". In this situation, it needs to be opt-in and planned,
> > likely done in maintenance downtime rather than a side effect of a
> > kernel upgrade.
> > 
> > So, yeah, this discussion is making me think that the xattr logging
> > upgrade is going to need a full ATTR3 feature bit like the other
> > ATTR and ATTR2 feature bits, not just a log incompat bit...
> > 
> 
> Perhaps. Not using this at all for xattrs does address quite a bit of my
> concerns, but I think if we wanted the potential flexibility of the log
> incompat bit down the road, it might be reasonable to manage the
> experimental cycle "manually" as described above (i.e., essentially
> don't set/clear that bit automatically for a period of time). I don't
> feel strongly about one approach over the other in that regard, though,
> just that we don't immediately turn the mechanism on right out of the
> gate because the feature bit mechanism happens to support it.

I suggested to Allison that enabling logged xattrs should be (for now) a
CONFIG_XFS_DEBUG=y mount option so that only bleeding edge people
actually get the new functionality.  As we build confidence in the
feature we can think about letting the kernel turn it on automatically.

As for a persistent feature flag, let's use directory parent pointers
since that will force us to create a new rocompat flag anyway.

> > > I dunno... I'm just trying to explore whether we can simplify this whole
> > > concept to something more easily managed and less likely to cause us
> > > headache. I'm a bit concerned that we're disregarding other tradeoffs
> > > like the complexity noted above, the risk and cost of bugs in the
> > > mechanism itself (because log recovery has historically been so well
> > > tested.. :P) or whether the idea of new kernels immediately delivering
> > > new incompat log formats is a robust/reliable solution in the first
> > > place. IIRC, the last time we did this was ICREATE and that was hidden
> > > behind the v5 update. IOW, for certain things like the xattr rework, I'd
> > > think that kind of experimental stabilization cycle is warranted before
> > > we'd consider enabling such a feature, even dynamically (which means a
> > > revertible kernel should be available in common/incremental upgrade
> > > cases).
> > 
> > IMO, the xattr logging rework is most definitely under the
> > EXPERIMENTAL umbrella and that was always going to be the case.
> > Also, I don't think we're ignoring the potential complexity of
> > dynamically setting/clearing stuff - otherwise we wouldn't be having
> > this conversation about how simple we can actually make it. If it
> > turns out that we can't do it simply, then setting/clearing at
> > mount/unmount should be considered "plan B"....
> > 
> 
> I'm more approaching this from a "what are the requirements and how/why
> do they justify the associated complexity?" angle. That's why I'm asking
> things like how much difference does a dynamic bit really make for
> something like xattrs. But I agree that's less of a concern when
> associated with more obscure or rarely used operations, so on balance I
> think that's a fair approach to this mechanism provided we consider
> suitability on a per feature basis.

Hm.  If I had to peer into my crystal ball I'd guess that the current
xattr logging scheme works fine for most xattr users, so I wouldn't
worry much about the dynamic bit.

However, I could see things like atomic range exchange being more
popular, in which case people might notice the overhead of tracking when
we can turn off the feature bit...

--D

> > But right now, I think the discussion has come up with some ideas to
> > greatly simplify the dynamic flag setting + clearing....
> > 
> 
> Agreed, thanks.
> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
>
Dave Chinner Jan. 13, 2021, 9:31 p.m. UTC | #21
On Thu, Jan 07, 2021 at 03:28:21PM -0800, Darrick J. Wong wrote:
> On Tue, Dec 15, 2020 at 08:50:03AM -0500, Brian Foster wrote:
> > On Tue, Dec 15, 2020 at 07:54:56AM +1100, Dave Chinner wrote:
> > > On Mon, Dec 14, 2020 at 10:58:31AM -0500, Brian Foster wrote:
> > > > On Sun, Dec 13, 2020 at 08:14:39AM +1100, Dave Chinner wrote:
> > > > > On Fri, Dec 11, 2020 at 08:39:01AM -0500, Brian Foster wrote:
> > > > > > On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> > > > > > > As for a mechanism for dynamically adding log incompat flags?
> > > > > > > Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> > > > > > > flags field into the transaction reservation structure, and if
> > > > > > > xfs_trans_alloc() sees an incompat field set and the superblock
> > > > > > > doesn't have it set, the first thing it does is run a "set log
> > > > > > > incompat flag" transaction before then doing it's normal work...
> > > > > > > 
> > > > > > > This should be rare enough it doesn't have any measurable
> > > > > > > performance overhead, and it's flexible enough to support any log
> > > > > > > incompat feature we might need to implement...
> > > > > > > 
> > > > > > 
> > > > > > But I don't think that is sufficient. As Darrick pointed out up-thread,
> > > > > > the updated superblock has to be written back before we're allowed to
> > > > > > commit transactions with incompatible items. Otherwise, an older kernel
> > > > > > can attempt log recovery with incompatible items present if the
> > > > > > filesystem crashes before the superblock is written back.
> > > > > 
> > > > > Sure, that's what the hook in xfs_trans_alloc() would do. It can do
> > > > > the work in the context that is going to need it, and set a wait
> > > > > flag for all incoming transactions that need a log incompat flag to
> > > > > wait for it do it's work.  Once it's done and the flag is set, it
> > > > > can continue and wake all the waiters now that the log incompat flag
> > > > > has been set. Anything that doesn't need a log incompat flag can
> > > > > just keep going and doesn't ever get blocked....
> > > > > 
> > > > 
> > > > It would have to be a sync transaction plus sync AIL force in
> > > > transaction allocation context if we were to log the superblock change,
> > > > which sounds a bit hairy...
> > > 
> > > Well, we already do sync AIL forces in transaction reservation when
> > > we run out of log space, so there's no technical reason for this
> > > being a problem at all. xfs_trans_alloc() is expected to block
> > > waiting on AIL tail pushing....
> > > 
> > > > > I suspect this is one of the rare occasions where an unlogged
> > > > > modification makes an awful lot of sense: we don't even log that we
> > > > > are adding a log incompat flag, we just do an atomic synchronous
> > > > > write straight to the superblock to set the incompat flag(s). The
> > > > > entire modification can be done under the superblock buffer lock to
> > > > > serialise multiple transactions all trying to set incompat bits, and
> > > > > we don't set the in-memory superblock incompat bit until after it
> > > > > has been set and written to disk. Hence multiple waits can check the
> > > > > flag after they've got the sb buffer lock, and they'll see that it's
> > > > > already been set and just continue...
> > > > > 
> > > > 
> > > > Agreed. That is a notable simplification and I think much more
> > > > preferable than the above for the dynamic approach.
> > > > 
> > > > That said, note that dynamic feature bits might introduce complexity in
> > > > more subtle ways. For example, nothing that I can see currently
> > > > serializes idle log covering with an active transaction (that may have
> > > > just set an incompat bit via some hook yet not committed anything to the
> > > > log subsystem), so it might not be as simple as just adding a hook
> > > > somewhere.
> > > 
> > > Right, we had to make log covering away of the CIL to prevent it
> > > from idling while there were multiple active committed transactions
> > > in memory. So the state machine only progresses if both the CIL and
> > > AIL are empty. If we had some way of knowing that a transaction is
> > > in progress, we could check that in xfs_log_need_covered() and we'd
> > > stop the state machine progress at that point. But we got rid of the
> > > active transaction counter that we could use for that....
> > > 
> > > [Hmmm, didn't I recently have a patch that re-introduced that
> > > counter to fix some other "we need to know if there's an active
> > > transaction running" issue? Can't remember what that was now...]
> > > 
> > 
> > I think you removed it, actually, via commit b41b46c20c0bd ("xfs: remove
> > the m_active_trans counter"). We subsequently discussed reintroducing
> > the same concept for the quotaoff rework [1], which might be what you're
> > thinking of. That uses a percpu rwsem since we don't really need a
> > counter, but I suspect could be reused for serialization in this use
> > case as well (assuming I can get some reviews on it.. ;).
> > 
> > FWIW, I was considering putting those quotaoff patches ahead of the log
> > covering work so we could reuse that code again in attr quiesce, but I
> > think I'm pretty close to being able to remove that particular usage
> > entirely.
> 
> I was thinking about using a rwsem to protect the log incompat flags --
> code that thinks it might use a protected feature takes the lock in
> read mode until commit; and the log covering code only clears the
> flags if down_write_trylock succeeds.  That constrains the overhead to
> threads that are trying to use the feature, instead of making all
> threads pay the cost of bumping the counter.

If you are going to do that, make it a per-cpu rwsem, because we
really only care about the global shared read overhead in the hot
paths and not the overhead of taking it in write mode if
it is only the log covering code that does that...

> > I'm more approaching this from a "what are the requirements and how/why
> > do they justify the associated complexity?" angle. That's why I'm asking
> > things like how much difference does a dynamic bit really make for
> > something like xattrs. But I agree that's less of a concern when
> > associated with more obscure or rarely used operations, so on balance I
> > think that's a fair approach to this mechanism provided we consider
> > suitability on a per feature basis.
> 
> Hm.  If I had to peer into my crystal ball I'd guess that the current
> xattr logging scheme works fine for most xattr users, so I wouldn't
> worry much about the dynamic bit.
> 
> However, I could see things like atomic range exchange being more
> popular, in which case people might notice the overhead of tracking when
> we can turn off the feature bit...

Hence a per-cpu rwsem... :)

Cheers,

Dave.
Darrick J. Wong Jan. 14, 2021, 2:25 a.m. UTC | #22
On Thu, Jan 14, 2021 at 08:31:05AM +1100, Dave Chinner wrote:
> On Thu, Jan 07, 2021 at 03:28:21PM -0800, Darrick J. Wong wrote:
> > On Tue, Dec 15, 2020 at 08:50:03AM -0500, Brian Foster wrote:
> > > On Tue, Dec 15, 2020 at 07:54:56AM +1100, Dave Chinner wrote:
> > > > On Mon, Dec 14, 2020 at 10:58:31AM -0500, Brian Foster wrote:
> > > > > On Sun, Dec 13, 2020 at 08:14:39AM +1100, Dave Chinner wrote:
> > > > > > On Fri, Dec 11, 2020 at 08:39:01AM -0500, Brian Foster wrote:
> > > > > > > On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> > > > > > > > As for a mechanism for dynamically adding log incompat flags?
> > > > > > > > Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> > > > > > > > flags field into the transaction reservation structure, and if
> > > > > > > > xfs_trans_alloc() sees an incompat field set and the superblock
> > > > > > > > doesn't have it set, the first thing it does is run a "set log
> > > > > > > > incompat flag" transaction before then doing it's normal work...
> > > > > > > > 
> > > > > > > > This should be rare enough it doesn't have any measurable
> > > > > > > > performance overhead, and it's flexible enough to support any log
> > > > > > > > incompat feature we might need to implement...
> > > > > > > > 
> > > > > > > 
> > > > > > > But I don't think that is sufficient. As Darrick pointed out up-thread,
> > > > > > > the updated superblock has to be written back before we're allowed to
> > > > > > > commit transactions with incompatible items. Otherwise, an older kernel
> > > > > > > can attempt log recovery with incompatible items present if the
> > > > > > > filesystem crashes before the superblock is written back.
> > > > > > 
> > > > > > Sure, that's what the hook in xfs_trans_alloc() would do. It can do
> > > > > > the work in the context that is going to need it, and set a wait
> > > > > > flag for all incoming transactions that need a log incompat flag to
> > > > > > wait for it do it's work.  Once it's done and the flag is set, it
> > > > > > can continue and wake all the waiters now that the log incompat flag
> > > > > > has been set. Anything that doesn't need a log incompat flag can
> > > > > > just keep going and doesn't ever get blocked....
> > > > > > 
> > > > > 
> > > > > It would have to be a sync transaction plus sync AIL force in
> > > > > transaction allocation context if we were to log the superblock change,
> > > > > which sounds a bit hairy...
> > > > 
> > > > Well, we already do sync AIL forces in transaction reservation when
> > > > we run out of log space, so there's no technical reason for this
> > > > being a problem at all. xfs_trans_alloc() is expected to block
> > > > waiting on AIL tail pushing....
> > > > 
> > > > > > I suspect this is one of the rare occasions where an unlogged
> > > > > > modification makes an awful lot of sense: we don't even log that we
> > > > > > are adding a log incompat flag, we just do an atomic synchronous
> > > > > > write straight to the superblock to set the incompat flag(s). The
> > > > > > entire modification can be done under the superblock buffer lock to
> > > > > > serialise multiple transactions all trying to set incompat bits, and
> > > > > > we don't set the in-memory superblock incompat bit until after it
> > > > > > has been set and written to disk. Hence multiple waits can check the
> > > > > > flag after they've got the sb buffer lock, and they'll see that it's
> > > > > > already been set and just continue...
> > > > > > 
> > > > > 
> > > > > Agreed. That is a notable simplification and I think much more
> > > > > preferable than the above for the dynamic approach.
> > > > > 
> > > > > That said, note that dynamic feature bits might introduce complexity in
> > > > > more subtle ways. For example, nothing that I can see currently
> > > > > serializes idle log covering with an active transaction (that may have
> > > > > just set an incompat bit via some hook yet not committed anything to the
> > > > > log subsystem), so it might not be as simple as just adding a hook
> > > > > somewhere.
> > > > 
> > > > Right, we had to make log covering away of the CIL to prevent it
> > > > from idling while there were multiple active committed transactions
> > > > in memory. So the state machine only progresses if both the CIL and
> > > > AIL are empty. If we had some way of knowing that a transaction is
> > > > in progress, we could check that in xfs_log_need_covered() and we'd
> > > > stop the state machine progress at that point. But we got rid of the
> > > > active transaction counter that we could use for that....
> > > > 
> > > > [Hmmm, didn't I recently have a patch that re-introduced that
> > > > counter to fix some other "we need to know if there's an active
> > > > transaction running" issue? Can't remember what that was now...]
> > > > 
> > > 
> > > I think you removed it, actually, via commit b41b46c20c0bd ("xfs: remove
> > > the m_active_trans counter"). We subsequently discussed reintroducing
> > > the same concept for the quotaoff rework [1], which might be what you're
> > > thinking of. That uses a percpu rwsem since we don't really need a
> > > counter, but I suspect could be reused for serialization in this use
> > > case as well (assuming I can get some reviews on it.. ;).
> > > 
> > > FWIW, I was considering putting those quotaoff patches ahead of the log
> > > covering work so we could reuse that code again in attr quiesce, but I
> > > think I'm pretty close to being able to remove that particular usage
> > > entirely.
> > 
> > I was thinking about using a rwsem to protect the log incompat flags --
> > code that thinks it might use a protected feature takes the lock in
> > read mode until commit; and the log covering code only clears the
> > flags if down_write_trylock succeeds.  That constrains the overhead to
> > threads that are trying to use the feature, instead of making all
> > threads pay the cost of bumping the counter.
> 
> If you are going to do that, make it a per-cpu rwsem, because we
> really only care about the global shared read overhead in the hot
> paths and not the overhead of taking it in write mode if
> it is only the log covering code that does that...
> 
> > > I'm more approaching this from a "what are the requirements and how/why
> > > do they justify the associated complexity?" angle. That's why I'm asking
> > > things like how much difference does a dynamic bit really make for
> > > something like xattrs. But I agree that's less of a concern when
> > > associated with more obscure or rarely used operations, so on balance I
> > > think that's a fair approach to this mechanism provided we consider
> > > suitability on a per feature basis.
> > 
> > Hm.  If I had to peer into my crystal ball I'd guess that the current
> > xattr logging scheme works fine for most xattr users, so I wouldn't
> > worry much about the dynamic bit.
> > 
> > However, I could see things like atomic range exchange being more
> > popular, in which case people might notice the overhead of tracking when
> > we can turn off the feature bit...
> 
> Hence a per-cpu rwsem... :)

Yup, it seems to work fine, though now I'm distracted over the posteof
cleanup serieses... :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Brian Foster Jan. 14, 2021, 10:30 a.m. UTC | #23
On Wed, Jan 13, 2021 at 06:25:47PM -0800, Darrick J. Wong wrote:
> On Thu, Jan 14, 2021 at 08:31:05AM +1100, Dave Chinner wrote:
> > On Thu, Jan 07, 2021 at 03:28:21PM -0800, Darrick J. Wong wrote:
> > > On Tue, Dec 15, 2020 at 08:50:03AM -0500, Brian Foster wrote:
> > > > On Tue, Dec 15, 2020 at 07:54:56AM +1100, Dave Chinner wrote:
> > > > > On Mon, Dec 14, 2020 at 10:58:31AM -0500, Brian Foster wrote:
> > > > > > On Sun, Dec 13, 2020 at 08:14:39AM +1100, Dave Chinner wrote:
> > > > > > > On Fri, Dec 11, 2020 at 08:39:01AM -0500, Brian Foster wrote:
> > > > > > > > On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> > > > > > > > > As for a mechanism for dynamically adding log incompat flags?
> > > > > > > > > Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> > > > > > > > > flags field into the transaction reservation structure, and if
> > > > > > > > > xfs_trans_alloc() sees an incompat field set and the superblock
> > > > > > > > > doesn't have it set, the first thing it does is run a "set log
> > > > > > > > > incompat flag" transaction before then doing it's normal work...
> > > > > > > > > 
> > > > > > > > > This should be rare enough it doesn't have any measurable
> > > > > > > > > performance overhead, and it's flexible enough to support any log
> > > > > > > > > incompat feature we might need to implement...
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > But I don't think that is sufficient. As Darrick pointed out up-thread,
> > > > > > > > the updated superblock has to be written back before we're allowed to
> > > > > > > > commit transactions with incompatible items. Otherwise, an older kernel
> > > > > > > > can attempt log recovery with incompatible items present if the
> > > > > > > > filesystem crashes before the superblock is written back.
> > > > > > > 
> > > > > > > Sure, that's what the hook in xfs_trans_alloc() would do. It can do
> > > > > > > the work in the context that is going to need it, and set a wait
> > > > > > > flag for all incoming transactions that need a log incompat flag to
> > > > > > > wait for it do it's work.  Once it's done and the flag is set, it
> > > > > > > can continue and wake all the waiters now that the log incompat flag
> > > > > > > has been set. Anything that doesn't need a log incompat flag can
> > > > > > > just keep going and doesn't ever get blocked....
> > > > > > > 
> > > > > > 
> > > > > > It would have to be a sync transaction plus sync AIL force in
> > > > > > transaction allocation context if we were to log the superblock change,
> > > > > > which sounds a bit hairy...
> > > > > 
> > > > > Well, we already do sync AIL forces in transaction reservation when
> > > > > we run out of log space, so there's no technical reason for this
> > > > > being a problem at all. xfs_trans_alloc() is expected to block
> > > > > waiting on AIL tail pushing....
> > > > > 
> > > > > > > I suspect this is one of the rare occasions where an unlogged
> > > > > > > modification makes an awful lot of sense: we don't even log that we
> > > > > > > are adding a log incompat flag, we just do an atomic synchronous
> > > > > > > write straight to the superblock to set the incompat flag(s). The
> > > > > > > entire modification can be done under the superblock buffer lock to
> > > > > > > serialise multiple transactions all trying to set incompat bits, and
> > > > > > > we don't set the in-memory superblock incompat bit until after it
> > > > > > > has been set and written to disk. Hence multiple waits can check the
> > > > > > > flag after they've got the sb buffer lock, and they'll see that it's
> > > > > > > already been set and just continue...
> > > > > > > 
> > > > > > 
> > > > > > Agreed. That is a notable simplification and I think much more
> > > > > > preferable than the above for the dynamic approach.
> > > > > > 
> > > > > > That said, note that dynamic feature bits might introduce complexity in
> > > > > > more subtle ways. For example, nothing that I can see currently
> > > > > > serializes idle log covering with an active transaction (that may have
> > > > > > just set an incompat bit via some hook yet not committed anything to the
> > > > > > log subsystem), so it might not be as simple as just adding a hook
> > > > > > somewhere.
> > > > > 
> > > > > Right, we had to make log covering away of the CIL to prevent it
> > > > > from idling while there were multiple active committed transactions
> > > > > in memory. So the state machine only progresses if both the CIL and
> > > > > AIL are empty. If we had some way of knowing that a transaction is
> > > > > in progress, we could check that in xfs_log_need_covered() and we'd
> > > > > stop the state machine progress at that point. But we got rid of the
> > > > > active transaction counter that we could use for that....
> > > > > 
> > > > > [Hmmm, didn't I recently have a patch that re-introduced that
> > > > > counter to fix some other "we need to know if there's an active
> > > > > transaction running" issue? Can't remember what that was now...]
> > > > > 
> > > > 
> > > > I think you removed it, actually, via commit b41b46c20c0bd ("xfs: remove
> > > > the m_active_trans counter"). We subsequently discussed reintroducing
> > > > the same concept for the quotaoff rework [1], which might be what you're
> > > > thinking of. That uses a percpu rwsem since we don't really need a
> > > > counter, but I suspect could be reused for serialization in this use
> > > > case as well (assuming I can get some reviews on it.. ;).
> > > > 
> > > > FWIW, I was considering putting those quotaoff patches ahead of the log
> > > > covering work so we could reuse that code again in attr quiesce, but I
> > > > think I'm pretty close to being able to remove that particular usage
> > > > entirely.
> > > 
> > > I was thinking about using a rwsem to protect the log incompat flags --
> > > code that thinks it might use a protected feature takes the lock in
> > > read mode until commit; and the log covering code only clears the
> > > flags if down_write_trylock succeeds.  That constrains the overhead to
> > > threads that are trying to use the feature, instead of making all
> > > threads pay the cost of bumping the counter.
> > 
> > If you are going to do that, make it a per-cpu rwsem, because we
> > really only care about the global shared read overhead in the hot
> > paths and not the overhead of taking it in write mode if
> > it is only the log covering code that does that...
> > 
> > > > I'm more approaching this from a "what are the requirements and how/why
> > > > do they justify the associated complexity?" angle. That's why I'm asking
> > > > things like how much difference does a dynamic bit really make for
> > > > something like xattrs. But I agree that's less of a concern when
> > > > associated with more obscure or rarely used operations, so on balance I
> > > > think that's a fair approach to this mechanism provided we consider
> > > > suitability on a per feature basis.
> > > 
> > > Hm.  If I had to peer into my crystal ball I'd guess that the current
> > > xattr logging scheme works fine for most xattr users, so I wouldn't
> > > worry much about the dynamic bit.
> > > 
> > > However, I could see things like atomic range exchange being more
> > > popular, in which case people might notice the overhead of tracking when
> > > we can turn off the feature bit...
> > 
> > Hence a per-cpu rwsem... :)
> 
> Yup, it seems to work fine, though now I'm distracted over the posteof
> cleanup serieses... :)
> 

FWIW, I have a patch on the list from a few months ago that introduces a
transaction percpu rwsem for the quotaoff rework:

https://lore.kernel.org/linux-xfs/20201001150310.141467-3-bfoster@redhat.com/

Perhaps I should repost?

Brian

> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
>
Darrick J. Wong Jan. 21, 2021, 12:12 a.m. UTC | #24
On Thu, Jan 14, 2021 at 05:30:47AM -0500, Brian Foster wrote:
> On Wed, Jan 13, 2021 at 06:25:47PM -0800, Darrick J. Wong wrote:
> > On Thu, Jan 14, 2021 at 08:31:05AM +1100, Dave Chinner wrote:
> > > On Thu, Jan 07, 2021 at 03:28:21PM -0800, Darrick J. Wong wrote:
> > > > On Tue, Dec 15, 2020 at 08:50:03AM -0500, Brian Foster wrote:
> > > > > On Tue, Dec 15, 2020 at 07:54:56AM +1100, Dave Chinner wrote:
> > > > > > On Mon, Dec 14, 2020 at 10:58:31AM -0500, Brian Foster wrote:
> > > > > > > On Sun, Dec 13, 2020 at 08:14:39AM +1100, Dave Chinner wrote:
> > > > > > > > On Fri, Dec 11, 2020 at 08:39:01AM -0500, Brian Foster wrote:
> > > > > > > > > On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> > > > > > > > > > As for a mechanism for dynamically adding log incompat flags?
> > > > > > > > > > Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> > > > > > > > > > flags field into the transaction reservation structure, and if
> > > > > > > > > > xfs_trans_alloc() sees an incompat field set and the superblock
> > > > > > > > > > doesn't have it set, the first thing it does is run a "set log
> > > > > > > > > > incompat flag" transaction before then doing it's normal work...
> > > > > > > > > > 
> > > > > > > > > > This should be rare enough it doesn't have any measurable
> > > > > > > > > > performance overhead, and it's flexible enough to support any log
> > > > > > > > > > incompat feature we might need to implement...
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > But I don't think that is sufficient. As Darrick pointed out up-thread,
> > > > > > > > > the updated superblock has to be written back before we're allowed to
> > > > > > > > > commit transactions with incompatible items. Otherwise, an older kernel
> > > > > > > > > can attempt log recovery with incompatible items present if the
> > > > > > > > > filesystem crashes before the superblock is written back.
> > > > > > > > 
> > > > > > > > Sure, that's what the hook in xfs_trans_alloc() would do. It can do
> > > > > > > > the work in the context that is going to need it, and set a wait
> > > > > > > > flag for all incoming transactions that need a log incompat flag to
> > > > > > > > wait for it do it's work.  Once it's done and the flag is set, it
> > > > > > > > can continue and wake all the waiters now that the log incompat flag
> > > > > > > > has been set. Anything that doesn't need a log incompat flag can
> > > > > > > > just keep going and doesn't ever get blocked....
> > > > > > > > 
> > > > > > > 
> > > > > > > It would have to be a sync transaction plus sync AIL force in
> > > > > > > transaction allocation context if we were to log the superblock change,
> > > > > > > which sounds a bit hairy...
> > > > > > 
> > > > > > Well, we already do sync AIL forces in transaction reservation when
> > > > > > we run out of log space, so there's no technical reason for this
> > > > > > being a problem at all. xfs_trans_alloc() is expected to block
> > > > > > waiting on AIL tail pushing....
> > > > > > 
> > > > > > > > I suspect this is one of the rare occasions where an unlogged
> > > > > > > > modification makes an awful lot of sense: we don't even log that we
> > > > > > > > are adding a log incompat flag, we just do an atomic synchronous
> > > > > > > > write straight to the superblock to set the incompat flag(s). The
> > > > > > > > entire modification can be done under the superblock buffer lock to
> > > > > > > > serialise multiple transactions all trying to set incompat bits, and
> > > > > > > > we don't set the in-memory superblock incompat bit until after it
> > > > > > > > has been set and written to disk. Hence multiple waits can check the
> > > > > > > > flag after they've got the sb buffer lock, and they'll see that it's
> > > > > > > > already been set and just continue...
> > > > > > > > 
> > > > > > > 
> > > > > > > Agreed. That is a notable simplification and I think much more
> > > > > > > preferable than the above for the dynamic approach.
> > > > > > > 
> > > > > > > That said, note that dynamic feature bits might introduce complexity in
> > > > > > > more subtle ways. For example, nothing that I can see currently
> > > > > > > serializes idle log covering with an active transaction (that may have
> > > > > > > just set an incompat bit via some hook yet not committed anything to the
> > > > > > > log subsystem), so it might not be as simple as just adding a hook
> > > > > > > somewhere.
> > > > > > 
> > > > > > Right, we had to make log covering away of the CIL to prevent it
> > > > > > from idling while there were multiple active committed transactions
> > > > > > in memory. So the state machine only progresses if both the CIL and
> > > > > > AIL are empty. If we had some way of knowing that a transaction is
> > > > > > in progress, we could check that in xfs_log_need_covered() and we'd
> > > > > > stop the state machine progress at that point. But we got rid of the
> > > > > > active transaction counter that we could use for that....
> > > > > > 
> > > > > > [Hmmm, didn't I recently have a patch that re-introduced that
> > > > > > counter to fix some other "we need to know if there's an active
> > > > > > transaction running" issue? Can't remember what that was now...]
> > > > > > 
> > > > > 
> > > > > I think you removed it, actually, via commit b41b46c20c0bd ("xfs: remove
> > > > > the m_active_trans counter"). We subsequently discussed reintroducing
> > > > > the same concept for the quotaoff rework [1], which might be what you're
> > > > > thinking of. That uses a percpu rwsem since we don't really need a
> > > > > counter, but I suspect could be reused for serialization in this use
> > > > > case as well (assuming I can get some reviews on it.. ;).
> > > > > 
> > > > > FWIW, I was considering putting those quotaoff patches ahead of the log
> > > > > covering work so we could reuse that code again in attr quiesce, but I
> > > > > think I'm pretty close to being able to remove that particular usage
> > > > > entirely.
> > > > 
> > > > I was thinking about using a rwsem to protect the log incompat flags --
> > > > code that thinks it might use a protected feature takes the lock in
> > > > read mode until commit; and the log covering code only clears the
> > > > flags if down_write_trylock succeeds.  That constrains the overhead to
> > > > threads that are trying to use the feature, instead of making all
> > > > threads pay the cost of bumping the counter.
> > > 
> > > If you are going to do that, make it a per-cpu rwsem, because we
> > > really only care about the global shared read overhead in the hot
> > > paths and not the overhead of taking it in write mode if
> > > it is only the log covering code that does that...
> > > 
> > > > > I'm more approaching this from a "what are the requirements and how/why
> > > > > do they justify the associated complexity?" angle. That's why I'm asking
> > > > > things like how much difference does a dynamic bit really make for
> > > > > something like xattrs. But I agree that's less of a concern when
> > > > > associated with more obscure or rarely used operations, so on balance I
> > > > > think that's a fair approach to this mechanism provided we consider
> > > > > suitability on a per feature basis.
> > > > 
> > > > Hm.  If I had to peer into my crystal ball I'd guess that the current
> > > > xattr logging scheme works fine for most xattr users, so I wouldn't
> > > > worry much about the dynamic bit.
> > > > 
> > > > However, I could see things like atomic range exchange being more
> > > > popular, in which case people might notice the overhead of tracking when
> > > > we can turn off the feature bit...
> > > 
> > > Hence a per-cpu rwsem... :)
> > 
> > Yup, it seems to work fine, though now I'm distracted over the posteof
> > cleanup serieses... :)

Well ok, it works with the /huge/ exception that there is no
percpu_rwsem_down_write_trylock, which means that we can't have the log
opportunistically try to grab it while covering.  That's kind of a
bummer, since I don't think that's at all desirable.

> FWIW, I have a patch on the list from a few months ago that introduces a
> transaction percpu rwsem for the quotaoff rework:
> 
> https://lore.kernel.org/linux-xfs/20201001150310.141467-3-bfoster@redhat.com/
> 
> Perhaps I should repost?

Ok?  Though /me is busy collecting things for 5.12 ATM... :)

--D

> 
> Brian
> 
> > --D
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > 
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index c5481444d61e..84806986009c 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -495,6 +495,21 @@  xfs_sb_has_incompat_log_feature(
 	return (sbp->sb_features_log_incompat & feature) != 0;
 }
 
+static inline void
+xfs_sb_remove_incompat_log_features(
+	struct xfs_sb	*sbp)
+{
+	sbp->sb_features_log_incompat &= ~XFS_SB_FEAT_INCOMPAT_LOG_ALL;
+}
+
+static inline void
+xfs_sb_add_incompat_log_features(
+	struct xfs_sb	*sbp,
+	unsigned int	features)
+{
+	sbp->sb_features_log_incompat |= features;
+}
+
 /*
  * V5 superblock specific feature checks
  */
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index fa2d05e65ff1..eeb29f4c676b 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -738,6 +738,14 @@  xfs_log_mount_finish(
 	 * that aren't removed until recovery is cancelled.
 	 */
 	if (!error && recovered) {
+		/*
+		 * Now that we've recovered the log and all the intents, we can
+		 * clear the log incompat feature bits in the superblock
+		 * because there's no longer anything to protect.  We rely on
+		 * the AIL push to write out the updated superblock after
+		 * everything else.
+		 */
+		error = xfs_clear_incompat_log_features(mp);
 		xfs_log_force(mp, XFS_LOG_SYNC);
 		xfs_ail_push_all_sync(mp->m_ail);
 	}
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 669a9cf43582..68a5c90ec65b 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1151,6 +1151,14 @@  xfs_unmountfs(
 		xfs_warn(mp, "Unable to update superblock counters. "
 				"Freespace may not be correct on next mount.");
 
+	/*
+	 * Clear log incompat features since we're unmounting cleanly.  Report
+	 * failures, though it's not fatal to have a higher log feature
+	 * protection level than the log contents actually require.
+	 */
+	error = xfs_clear_incompat_log_features(mp);
+	if (error)
+		xfs_warn(mp, "Unable to clear superblock log incompat flags.");
 
 	xfs_log_unmount(mp);
 	xfs_da_unmount(mp);
@@ -1361,6 +1369,117 @@  xfs_force_summary_recalc(
 	xfs_fs_mark_checked(mp, XFS_SICK_FS_COUNTERS);
 }
 
+/*
+ * Enable a log incompat feature flag in the primary superblock.  The caller
+ * cannot have any other transactions in progress.
+ */
+int
+xfs_add_incompat_log_feature(
+	struct xfs_mount	*mp,
+	uint32_t		feature)
+{
+	struct xfs_dsb		*dsb;
+	int			error;
+
+	ASSERT(hweight32(feature) == 1);
+	ASSERT(!(feature & XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN));
+
+	/*
+	 * Force the log to disk and kick the background AIL thread to reduce
+	 * the chances that the bwrite will stall waiting for the AIL to unpin
+	 * the primary superblock buffer.  This isn't a data integrity
+	 * operation, so we don't need a synchronous push.
+	 */
+	error = xfs_log_force(mp, XFS_LOG_SYNC);
+	if (error)
+		return error;
+	xfs_ail_push_all(mp->m_ail);
+
+	/*
+	 * Lock the primary superblock buffer to serialize all callers that
+	 * are trying to set feature bits.
+	 */
+	xfs_buf_lock(mp->m_sb_bp);
+	xfs_buf_hold(mp->m_sb_bp);
+
+	if (XFS_FORCED_SHUTDOWN(mp)) {
+		error = -EIO;
+		goto rele;
+	}
+
+	if (xfs_sb_has_incompat_log_feature(&mp->m_sb, feature))
+		goto rele;
+
+	/*
+	 * Write the primary superblock to disk immediately, because we need
+	 * the log_incompat bit to be set in the primary super now to protect
+	 * the log items that we're going to commit later.
+	 */
+	dsb = mp->m_sb_bp->b_addr;
+	xfs_sb_to_disk(dsb, &mp->m_sb);
+	dsb->sb_features_log_incompat |= cpu_to_be32(feature);
+	error = xfs_bwrite(mp->m_sb_bp);
+	if (error)
+		goto shutdown;
+
+	/*
+	 * Add the feature bits to the incore superblock before we unlock the
+	 * buffer.
+	 */
+	xfs_sb_add_incompat_log_features(&mp->m_sb, feature);
+	xfs_buf_relse(mp->m_sb_bp);
+
+	/* Log the superblock to disk. */
+	return xfs_sync_sb(mp, false);
+shutdown:
+	xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
+rele:
+	xfs_buf_relse(mp->m_sb_bp);
+	return error;
+}
+
+/*
+ * Clear all the log incompat flags from the superblock.
+ *
+ * The caller must have freeze protection, cannot have any other transactions
+ * in progress, must ensure that the log does not contain any log items
+ * protected by any log incompat bit, and must ensure that there are no other
+ * threads that could be reading or writing the log incompat feature flag in
+ * the primary super.  In other words, we can only turn features off as one of
+ * the final steps of quiescing or unmounting the log.
+ */
+int
+xfs_clear_incompat_log_features(
+	struct xfs_mount	*mp)
+{
+	if (!xfs_sb_version_hascrc(&mp->m_sb) ||
+	    !xfs_sb_has_incompat_log_feature(&mp->m_sb,
+				XFS_SB_FEAT_INCOMPAT_LOG_ALL) ||
+	    XFS_FORCED_SHUTDOWN(mp))
+		return 0;
+
+	/*
+	 * Update the incore superblock.  We synchronize on the primary super
+	 * buffer lock to be consistent with the add function, though at least
+	 * in theory this shouldn't be necessary.
+	 */
+	xfs_buf_lock(mp->m_sb_bp);
+	xfs_buf_hold(mp->m_sb_bp);
+
+	if (!xfs_sb_has_incompat_log_feature(&mp->m_sb,
+				XFS_SB_FEAT_INCOMPAT_LOG_ALL))
+		goto rele;
+
+	xfs_sb_remove_incompat_log_features(&mp->m_sb);
+	xfs_buf_relse(mp->m_sb_bp);
+
+	/* Log the superblock to disk. */
+	return xfs_sync_sb(mp, false);
+rele:
+	xfs_buf_relse(mp->m_sb_bp);
+	return 0;
+}
+
 /*
  * Update the in-core delayed block counter.
  *
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 232f2383e8ba..02bd940d66fb 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -441,6 +441,8 @@  int	xfs_zero_extent(struct xfs_inode *ip, xfs_fsblock_t start_fsb,
 struct xfs_error_cfg * xfs_error_get_cfg(struct xfs_mount *mp,
 		int error_class, int error);
 void xfs_force_summary_recalc(struct xfs_mount *mp);
+int xfs_add_incompat_log_feature(struct xfs_mount *mp, uint32_t feature);
+int xfs_clear_incompat_log_features(struct xfs_mount *mp);
 void xfs_mod_delalloc(struct xfs_mount *mp, int64_t delta);
 unsigned int xfs_guess_metadata_threads(struct xfs_mount *mp);
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 343435a677f9..71e304c15e6b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -919,6 +919,15 @@  xfs_quiesce_attr(
 	/* force the log to unpin objects from the now complete transactions */
 	xfs_log_force(mp, XFS_LOG_SYNC);
 
+	/*
+	 * Clear log incompat features since we're freezing or remounting ro.
+	 * Report failures, though it's not fatal to have a higher log feature
+	 * protection level than the log contents actually require.
+	 */
+	error = xfs_clear_incompat_log_features(mp);
+	if (error)
+		xfs_warn(mp, "Unable to clear superblock log incompat flags. "
+				"Frozen image may not be consistent.");
 
 	/* Push the superblock and write an unmount record */
 	error = xfs_log_sbcount(mp);