diff mbox series

[6/7] xfs: don't update file system geometry through transaction deltas

Message ID 20240930164211.2357358-7-hch@lst.de (mailing list archive)
State Not Applicable, archived
Headers show
Series [1/7] xfs: pass the exact range to initialize to xfs_initialize_perag | expand

Commit Message

Christoph Hellwig Sept. 30, 2024, 4:41 p.m. UTC
Updates to the file system geometry in growfs need to be committed to
stable store before the allocator can see them to avoid that they are
in the same CIL checkpoint as transactions that make use of this new
information, which will make recovery impossible or broken.

To do this add two new helpers to prepare a superblock for direct
manipulation of the on-disk buffer, and to commit these updates while
holding the buffer locked (similar to what xfs_sync_sb_buf does) and use
those in growfs instead of applying the changes through the deltas in the
xfs_trans structure (which also happens to shrink the xfs_trans structure
a fair bit).

The rtbmimap repair code was also using the transaction deltas and is
converted to also update the superblock buffer directly under the buffer
lock.

This new method establishes a locking protocol where even in-core
superblock fields must only be updated with the superblock buffer
locked.  For now it is only applied to affected geometry fields,
but in the future it would make sense to apply it universally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_sb.c         |  97 ++++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_sb.h         |   3 +
 fs/xfs/libxfs/xfs_shared.h     |   8 ---
 fs/xfs/scrub/rtbitmap_repair.c |  26 +++++----
 fs/xfs/xfs_fsops.c             |  80 ++++++++++++++++----------
 fs/xfs/xfs_rtalloc.c           |  92 +++++++++++++++++-------------
 fs/xfs/xfs_trans.c             | 101 ++-------------------------------
 fs/xfs/xfs_trans.h             |   8 ---
 8 files changed, 198 insertions(+), 217 deletions(-)

Comments

Brian Foster Oct. 10, 2024, 2:05 p.m. UTC | #1
On Mon, Sep 30, 2024 at 06:41:47PM +0200, Christoph Hellwig wrote:
> Updates to the file system geometry in growfs need to be committed to
> stable store before the allocator can see them to avoid that they are
> in the same CIL checkpoint as transactions that make use of this new
> information, which will make recovery impossible or broken.
> 

Ok, so we don't want geometry changes transactions in the same CIL
checkpoint as alloc related transactions that might depend on the
geometry changes. That seems reasonable and on a first pass I have an
idea of what this is doing, but the description is kind of vague.
Obviously this fixes an issue on the recovery side (since I've tested
it), but it's not quite clear to me from the description and/or logic
changes how that issue manifests.

Could you elaborate please? For example, is this some kind of race
situation between an allocation request and a growfs transaction, where
the former perhaps sees a newly added AG between the time the growfs
transaction commits (applying the sb deltas) and it actually commits to
the log due to being a sync transaction, thus allowing an alloc on a new
AG into the same checkpoint that adds the AG?

Is there any ordering issue on the recovery side, or is it mainly that
we don't init the in-core perags until after recovery, so log recovery
of grow operations is basically broken unless we don't reference the new
AGs in the log before the geometry update pushes out of the active
log..? I'm trying to grok how much of this patch is fixing a
reproducible bug vs. supporting what the earlier patches are doing on
the recovery side.

I also wonder if/how this might fundamentally apply to shrink, but maybe
that's getting too far into the weeds. IIRC we don't currently support
AG shrink, but can lop off the end of the tail AG if space happens to be
free. So for example, is there any issue where that tail AG space is
freed and immediately shrunk off of the fs? Hm.. maybe the hacky growfs
test should try and include some shrink operations as well.

Anyways, context on the specific problem would make this easier to
review and IMO, should be included in the commit log anyways for
historical reference if you're going to change how superblock fields are
logged. Just my .02.

Brian

> To do this add two new helpers to prepare a superblock for direct
> manipulation of the on-disk buffer, and to commit these updates while
> holding the buffer locked (similar to what xfs_sync_sb_buf does) and use
> those in growfs instead of applying the changes through the deltas in the
> xfs_trans structure (which also happens to shrink the xfs_trans structure
> a fair bit).
> 
> The rtbmimap repair code was also using the transaction deltas and is
> converted to also update the superblock buffer directly under the buffer
> lock.
> 
> This new method establishes a locking protocol where even in-core
> superblock fields must only be updated with the superblock buffer
> locked.  For now it is only applied to affected geometry fields,
> but in the future it would make sense to apply it universally.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_sb.c         |  97 ++++++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_sb.h         |   3 +
>  fs/xfs/libxfs/xfs_shared.h     |   8 ---
>  fs/xfs/scrub/rtbitmap_repair.c |  26 +++++----
>  fs/xfs/xfs_fsops.c             |  80 ++++++++++++++++----------
>  fs/xfs/xfs_rtalloc.c           |  92 +++++++++++++++++-------------
>  fs/xfs/xfs_trans.c             | 101 ++-------------------------------
>  fs/xfs/xfs_trans.h             |   8 ---
>  8 files changed, 198 insertions(+), 217 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index d95409f3cba667..2c83ab7441ade5 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -1025,6 +1025,80 @@ xfs_sb_mount_common(
>  	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
>  }
>  
> +/*
> + * Mirror the lazy sb counters to the in-core superblock.
> + *
> + * If this is at unmount, the counters will be exactly correct, but at any other
> + * time they will only be ballpark correct because of reservations that have
> + * been taken out percpu counters.  If we have an unclean shutdown, this will be
> + * corrected by log recovery rebuilding the counters from the AGF block counts.
> + *
> + * Do not update sb_frextents here because it is not part of the lazy sb
> + * counters, despite having a percpu counter.  It is always kept consistent with
> + * the ondisk rtbitmap by xfs_trans_apply_sb_deltas() and hence we don't need
> + * have to update it here.
> + */
> +static void
> +xfs_flush_sb_counters(
> +	struct xfs_mount	*mp)
> +{
> +	if (xfs_has_lazysbcount(mp)) {
> +		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
> +		mp->m_sb.sb_ifree = min_t(uint64_t,
> +				percpu_counter_sum_positive(&mp->m_ifree),
> +				mp->m_sb.sb_icount);
> +		mp->m_sb.sb_fdblocks =
> +				percpu_counter_sum_positive(&mp->m_fdblocks);
> +	}
> +}
> +
> +/*
> + * Prepare a direct update to the superblock through the on-disk buffer.
> + *
> + * This locks out other modifications through the buffer lock and then syncs all
> + * in-core values to the on-disk buffer (including the percpu counters).
> + *
> + * The caller then directly manipulates the on-disk fields and calls
> + * xfs_commit_sb_update to the updates to disk them.  The caller is responsible
> + * to also update the in-core field, but it can do so after the transaction has
> + * been committed to disk.
> + *
> + * Updating the in-core field only after xfs_commit_sb_update ensures that other
> + * processes only see the update once it is stable on disk, and is usually the
> + * right thing to do for superblock updates.
> + *
> + * Note that writes to superblock fields updated using this helper are
> + * synchronized using the superblock buffer lock, which must be taken around
> + * all updates to the in-core fields as well.
> + */
> +struct xfs_dsb *
> +xfs_prepare_sb_update(
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		**bpp)
> +{
> +	*bpp = xfs_trans_getsb(tp);
> +	xfs_flush_sb_counters(tp->t_mountp);
> +	xfs_sb_to_disk((*bpp)->b_addr, &tp->t_mountp->m_sb);
> +	return (*bpp)->b_addr;
> +}
> +
> +/*
> + * Commit a direct update to the on-disk superblock.  Keeps @bp locked and
> + * referenced, so the caller must call xfs_buf_relse() manually.
> + */
> +int
> +xfs_commit_sb_update(
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		*bp)
> +{
> +	xfs_trans_bhold(tp, bp);
> +	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
> +	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> +
> +	xfs_trans_set_sync(tp);
> +	return xfs_trans_commit(tp);
> +}
> +
>  /*
>   * xfs_log_sb() can be used to copy arbitrary changes to the in-core superblock
>   * into the superblock buffer to be logged.  It does not provide the higher
> @@ -1038,28 +1112,7 @@ xfs_log_sb(
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_buf		*bp = xfs_trans_getsb(tp);
>  
> -	/*
> -	 * Lazy sb counters don't update the in-core superblock so do that now.
> -	 * If this is at unmount, the counters will be exactly correct, but at
> -	 * any other time they will only be ballpark correct because of
> -	 * reservations that have been taken out percpu counters. If we have an
> -	 * unclean shutdown, this will be corrected by log recovery rebuilding
> -	 * the counters from the AGF block counts.
> -	 *
> -	 * Do not update sb_frextents here because it is not part of the lazy
> -	 * sb counters, despite having a percpu counter. It is always kept
> -	 * consistent with the ondisk rtbitmap by xfs_trans_apply_sb_deltas()
> -	 * and hence we don't need have to update it here.
> -	 */
> -	if (xfs_has_lazysbcount(mp)) {
> -		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
> -		mp->m_sb.sb_ifree = min_t(uint64_t,
> -				percpu_counter_sum_positive(&mp->m_ifree),
> -				mp->m_sb.sb_icount);
> -		mp->m_sb.sb_fdblocks =
> -				percpu_counter_sum_positive(&mp->m_fdblocks);
> -	}
> -
> +	xfs_flush_sb_counters(mp);
>  	xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
>  	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
> index 885c837559914d..3649d071687e33 100644
> --- a/fs/xfs/libxfs/xfs_sb.h
> +++ b/fs/xfs/libxfs/xfs_sb.h
> @@ -13,6 +13,9 @@ struct xfs_trans;
>  struct xfs_fsop_geom;
>  struct xfs_perag;
>  
> +struct xfs_dsb *xfs_prepare_sb_update(struct xfs_trans *tp,
> +			struct xfs_buf **bpp);
> +int		xfs_commit_sb_update(struct xfs_trans *tp, struct xfs_buf *bp);
>  extern void	xfs_log_sb(struct xfs_trans *tp);
>  extern int	xfs_sync_sb(struct xfs_mount *mp, bool wait);
>  extern int	xfs_sync_sb_buf(struct xfs_mount *mp);
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index 33b84a3a83ff63..45a32ea426164a 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -149,14 +149,6 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>  #define	XFS_TRANS_SB_RES_FDBLOCKS	0x00000008
>  #define	XFS_TRANS_SB_FREXTENTS		0x00000010
>  #define	XFS_TRANS_SB_RES_FREXTENTS	0x00000020
> -#define	XFS_TRANS_SB_DBLOCKS		0x00000040
> -#define	XFS_TRANS_SB_AGCOUNT		0x00000080
> -#define	XFS_TRANS_SB_IMAXPCT		0x00000100
> -#define	XFS_TRANS_SB_REXTSIZE		0x00000200
> -#define	XFS_TRANS_SB_RBMBLOCKS		0x00000400
> -#define	XFS_TRANS_SB_RBLOCKS		0x00000800
> -#define	XFS_TRANS_SB_REXTENTS		0x00001000
> -#define	XFS_TRANS_SB_REXTSLOG		0x00002000
>  
>  /*
>   * Here we centralize the specification of XFS meta-data buffer reference count
> diff --git a/fs/xfs/scrub/rtbitmap_repair.c b/fs/xfs/scrub/rtbitmap_repair.c
> index 0fef98e9f83409..be9d31f032b1bf 100644
> --- a/fs/xfs/scrub/rtbitmap_repair.c
> +++ b/fs/xfs/scrub/rtbitmap_repair.c
> @@ -16,6 +16,7 @@
>  #include "xfs_bit.h"
>  #include "xfs_bmap.h"
>  #include "xfs_bmap_btree.h"
> +#include "xfs_sb.h"
>  #include "scrub/scrub.h"
>  #include "scrub/common.h"
>  #include "scrub/trace.h"
> @@ -127,20 +128,21 @@ xrep_rtbitmap_geometry(
>  	struct xchk_rtbitmap	*rtb)
>  {
>  	struct xfs_mount	*mp = sc->mp;
> -	struct xfs_trans	*tp = sc->tp;
>  
>  	/* Superblock fields */
> -	if (mp->m_sb.sb_rextents != rtb->rextents)
> -		xfs_trans_mod_sb(sc->tp, XFS_TRANS_SB_REXTENTS,
> -				rtb->rextents - mp->m_sb.sb_rextents);
> -
> -	if (mp->m_sb.sb_rbmblocks != rtb->rbmblocks)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_RBMBLOCKS,
> -				rtb->rbmblocks - mp->m_sb.sb_rbmblocks);
> -
> -	if (mp->m_sb.sb_rextslog != rtb->rextslog)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTSLOG,
> -				rtb->rextslog - mp->m_sb.sb_rextslog);
> +	if (mp->m_sb.sb_rextents != rtb->rextents ||
> +	    mp->m_sb.sb_rbmblocks != rtb->rbmblocks ||
> +	    mp->m_sb.sb_rextslog != rtb->rextslog) {
> +		struct xfs_buf		*bp = xfs_trans_getsb(sc->tp);
> +
> +		mp->m_sb.sb_rextents = rtb->rextents;
> +		mp->m_sb.sb_rbmblocks = rtb->rbmblocks;
> +		mp->m_sb.sb_rextslog = rtb->rextslog;
> +		xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
> +
> +		xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF);
> +		xfs_trans_log_buf(sc->tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> +	}
>  
>  	/* Fix broken isize */
>  	sc->ip->i_disk_size = roundup_64(sc->ip->i_disk_size,
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index b247d895c276d2..4168ccf21068cb 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -79,6 +79,46 @@ xfs_resizefs_init_new_ags(
>  	return error;
>  }
>  
> +static int
> +xfs_growfs_data_update_sb(
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		nagcount,
> +	xfs_rfsblock_t		nb,
> +	xfs_agnumber_t		nagimax)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_dsb		*sbp;
> +	struct xfs_buf		*bp;
> +	int			error;
> +
> +	/*
> +	 * Update the geometry in the on-disk superblock first, and ensure
> +	 * they make it to disk before the superblock can be relogged.
> +	 */
> +	sbp = xfs_prepare_sb_update(tp, &bp);
> +	sbp->sb_agcount = cpu_to_be32(nagcount);
> +	sbp->sb_dblocks = cpu_to_be64(nb);
> +	error = xfs_commit_sb_update(tp, bp);
> +	if (error)
> +		goto out_unlock;
> +
> +	/*
> +	 * Propagate the new values to the live mount structure after they made
> +	 * it to disk with the superblock buffer still locked.
> +	 */
> +	mp->m_sb.sb_agcount = nagcount;
> +	mp->m_sb.sb_dblocks = nb;
> +
> +	if (nagimax)
> +		mp->m_maxagi = nagimax;
> +	xfs_set_low_space_thresholds(mp);
> +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> +
> +out_unlock:
> +	xfs_buf_relse(bp);
> +	return error;
> +}
> +
>  /*
>   * growfs operations
>   */
> @@ -171,37 +211,13 @@ xfs_growfs_data_private(
>  	if (error)
>  		goto out_trans_cancel;
>  
> -	/*
> -	 * Update changed superblock fields transactionally. These are not
> -	 * seen by the rest of the world until the transaction commit applies
> -	 * them atomically to the superblock.
> -	 */
> -	if (nagcount > oagcount)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
> -	if (delta)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta);
>  	if (id.nfree)
>  		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
>  
> -	/*
> -	 * Sync sb counters now to reflect the updated values. This is
> -	 * particularly important for shrink because the write verifier
> -	 * will fail if sb_fdblocks is ever larger than sb_dblocks.
> -	 */
> -	if (xfs_has_lazysbcount(mp))
> -		xfs_log_sb(tp);
> -
> -	xfs_trans_set_sync(tp);
> -	error = xfs_trans_commit(tp);
> +	error = xfs_growfs_data_update_sb(tp, nagcount, nb, nagimax);
>  	if (error)
>  		return error;
>  
> -	/* New allocation groups fully initialized, so update mount struct */
> -	if (nagimax)
> -		mp->m_maxagi = nagimax;
> -	xfs_set_low_space_thresholds(mp);
> -	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> -
>  	if (delta > 0) {
>  		/*
>  		 * If we expanded the last AG, free the per-AG reservation
> @@ -260,8 +276,9 @@ xfs_growfs_imaxpct(
>  	struct xfs_mount	*mp,
>  	__u32			imaxpct)
>  {
> +	struct xfs_dsb		*sbp;
> +	struct xfs_buf		*bp;
>  	struct xfs_trans	*tp;
> -	int			dpct;
>  	int			error;
>  
>  	if (imaxpct > 100)
> @@ -272,10 +289,13 @@ xfs_growfs_imaxpct(
>  	if (error)
>  		return error;
>  
> -	dpct = imaxpct - mp->m_sb.sb_imax_pct;
> -	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
> -	xfs_trans_set_sync(tp);
> -	return xfs_trans_commit(tp);
> +	sbp = xfs_prepare_sb_update(tp, &bp);
> +	sbp->sb_imax_pct = imaxpct;
> +	error = xfs_commit_sb_update(tp, bp);
> +	if (!error)
> +		mp->m_sb.sb_imax_pct = imaxpct;
> +	xfs_buf_relse(bp);
> +	return error;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> index 3a2005a1e673dc..994e5efedab20f 100644
> --- a/fs/xfs/xfs_rtalloc.c
> +++ b/fs/xfs/xfs_rtalloc.c
> @@ -698,6 +698,56 @@ xfs_growfs_rt_fixup_extsize(
>  	return error;
>  }
>  
> +static int
> +xfs_growfs_rt_update_sb(
> +	struct xfs_trans	*tp,
> +	struct xfs_mount	*mp,
> +	struct xfs_mount	*nmp,
> +	xfs_rtbxlen_t		freed_rtx)
> +{
> +	struct xfs_dsb		*sbp;
> +	struct xfs_buf		*bp;
> +	int			error;
> +
> +	/*
> +	 * Update the geometry in the on-disk superblock first, and ensure
> +	 * they make it to disk before the superblock can be relogged.
> +	 */
> +	sbp = xfs_prepare_sb_update(tp, &bp);
> +	sbp->sb_rextsize = cpu_to_be32(nmp->m_sb.sb_rextsize);
> +	sbp->sb_rbmblocks = cpu_to_be32(nmp->m_sb.sb_rbmblocks);
> +	sbp->sb_rblocks = cpu_to_be64(nmp->m_sb.sb_rblocks);
> +	sbp->sb_rextents = cpu_to_be64(nmp->m_sb.sb_rextents);
> +	sbp->sb_rextslog = nmp->m_sb.sb_rextslog;
> +	error = xfs_commit_sb_update(tp, bp);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Propagate the new values to the live mount structure after they made
> +	 * it to disk with the superblock buffer still locked.
> +	 */
> +	mp->m_sb.sb_rextsize = nmp->m_sb.sb_rextsize;
> +	mp->m_sb.sb_rbmblocks = nmp->m_sb.sb_rbmblocks;
> +	mp->m_sb.sb_rblocks = nmp->m_sb.sb_rblocks;
> +	mp->m_sb.sb_rextents = nmp->m_sb.sb_rextents;
> +	mp->m_sb.sb_rextslog = nmp->m_sb.sb_rextslog;
> +	mp->m_rsumlevels = nmp->m_rsumlevels;
> +	mp->m_rsumblocks = nmp->m_rsumblocks;
> +
> +	/*
> +	 * Recompute the growfsrt reservation from the new rsumsize.
> +	 */
> +	xfs_trans_resv_calc(mp, &mp->m_resv);
> +
> +	/*
> +	 * Ensure the mount RT feature flag is now set.
> +	 */
> +	mp->m_features |= XFS_FEAT_REALTIME;
> +	xfs_buf_relse(bp);
> +	return 0;
> +}
> +
>  static int
>  xfs_growfs_rt_bmblock(
>  	struct xfs_mount	*mp,
> @@ -780,25 +830,6 @@ xfs_growfs_rt_bmblock(
>  			goto out_cancel;
>  	}
>  
> -	/*
> -	 * Update superblock fields.
> -	 */
> -	if (nmp->m_sb.sb_rextsize != mp->m_sb.sb_rextsize)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSIZE,
> -			nmp->m_sb.sb_rextsize - mp->m_sb.sb_rextsize);
> -	if (nmp->m_sb.sb_rbmblocks != mp->m_sb.sb_rbmblocks)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBMBLOCKS,
> -			nmp->m_sb.sb_rbmblocks - mp->m_sb.sb_rbmblocks);
> -	if (nmp->m_sb.sb_rblocks != mp->m_sb.sb_rblocks)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBLOCKS,
> -			nmp->m_sb.sb_rblocks - mp->m_sb.sb_rblocks);
> -	if (nmp->m_sb.sb_rextents != mp->m_sb.sb_rextents)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTENTS,
> -			nmp->m_sb.sb_rextents - mp->m_sb.sb_rextents);
> -	if (nmp->m_sb.sb_rextslog != mp->m_sb.sb_rextslog)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSLOG,
> -			nmp->m_sb.sb_rextslog - mp->m_sb.sb_rextslog);
> -
>  	/*
>  	 * Free the new extent.
>  	 */
> @@ -807,33 +838,12 @@ xfs_growfs_rt_bmblock(
>  	xfs_rtbuf_cache_relse(&nargs);
>  	if (error)
>  		goto out_cancel;
> -
> -	/*
> -	 * Mark more blocks free in the superblock.
> -	 */
>  	xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx);
>  
> -	/*
> -	 * Update the calculated values in the real mount structure.
> -	 */
> -	mp->m_rsumlevels = nmp->m_rsumlevels;
> -	mp->m_rsumblocks = nmp->m_rsumblocks;
> -	xfs_mount_sb_set_rextsize(mp, &mp->m_sb);
> -
> -	/*
> -	 * Recompute the growfsrt reservation from the new rsumsize.
> -	 */
> -	xfs_trans_resv_calc(mp, &mp->m_resv);
> -
> -	error = xfs_trans_commit(args.tp);
> +	error = xfs_growfs_rt_update_sb(args.tp, mp, nmp, freed_rtx);
>  	if (error)
>  		goto out_free;
>  
> -	/*
> -	 * Ensure the mount RT feature flag is now set.
> -	 */
> -	mp->m_features |= XFS_FEAT_REALTIME;
> -
>  	kfree(nmp);
>  	return 0;
>  
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index bdf3704dc30118..56505cb94f877d 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -430,31 +430,6 @@ xfs_trans_mod_sb(
>  		ASSERT(delta < 0);
>  		tp->t_res_frextents_delta += delta;
>  		break;
> -	case XFS_TRANS_SB_DBLOCKS:
> -		tp->t_dblocks_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_AGCOUNT:
> -		ASSERT(delta > 0);
> -		tp->t_agcount_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_IMAXPCT:
> -		tp->t_imaxpct_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_REXTSIZE:
> -		tp->t_rextsize_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_RBMBLOCKS:
> -		tp->t_rbmblocks_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_RBLOCKS:
> -		tp->t_rblocks_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_REXTENTS:
> -		tp->t_rextents_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_REXTSLOG:
> -		tp->t_rextslog_delta += delta;
> -		break;
>  	default:
>  		ASSERT(0);
>  		return;
> @@ -475,12 +450,8 @@ STATIC void
>  xfs_trans_apply_sb_deltas(
>  	xfs_trans_t	*tp)
>  {
> -	struct xfs_dsb	*sbp;
> -	struct xfs_buf	*bp;
> -	int		whole = 0;
> -
> -	bp = xfs_trans_getsb(tp);
> -	sbp = bp->b_addr;
> +	struct xfs_buf	*bp = xfs_trans_getsb(tp);
> +	struct xfs_dsb	*sbp = bp->b_addr;
>  
>  	/*
>  	 * Only update the superblock counters if we are logging them
> @@ -522,53 +493,10 @@ xfs_trans_apply_sb_deltas(
>  		spin_unlock(&mp->m_sb_lock);
>  	}
>  
> -	if (tp->t_dblocks_delta) {
> -		be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_agcount_delta) {
> -		be32_add_cpu(&sbp->sb_agcount, tp->t_agcount_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_imaxpct_delta) {
> -		sbp->sb_imax_pct += tp->t_imaxpct_delta;
> -		whole = 1;
> -	}
> -	if (tp->t_rextsize_delta) {
> -		be32_add_cpu(&sbp->sb_rextsize, tp->t_rextsize_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rbmblocks_delta) {
> -		be32_add_cpu(&sbp->sb_rbmblocks, tp->t_rbmblocks_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rblocks_delta) {
> -		be64_add_cpu(&sbp->sb_rblocks, tp->t_rblocks_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rextents_delta) {
> -		be64_add_cpu(&sbp->sb_rextents, tp->t_rextents_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rextslog_delta) {
> -		sbp->sb_rextslog += tp->t_rextslog_delta;
> -		whole = 1;
> -	}
> -
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
> -	if (whole)
> -		/*
> -		 * Log the whole thing, the fields are noncontiguous.
> -		 */
> -		xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> -	else
> -		/*
> -		 * Since all the modifiable fields are contiguous, we
> -		 * can get away with this.
> -		 */
> -		xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount),
> -				  offsetof(struct xfs_dsb, sb_frextents) +
> -				  sizeof(sbp->sb_frextents) - 1);
> +	xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount),
> +			  offsetof(struct xfs_dsb, sb_frextents) +
> +			  sizeof(sbp->sb_frextents) - 1);
>  }
>  
>  /*
> @@ -656,26 +584,7 @@ xfs_trans_unreserve_and_mod_sb(
>  	 * must be consistent with the ondisk rtbitmap and must never include
>  	 * incore reservations.
>  	 */
> -	mp->m_sb.sb_dblocks += tp->t_dblocks_delta;
> -	mp->m_sb.sb_agcount += tp->t_agcount_delta;
> -	mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta;
> -	mp->m_sb.sb_rextsize += tp->t_rextsize_delta;
> -	if (tp->t_rextsize_delta) {
> -		mp->m_rtxblklog = log2_if_power2(mp->m_sb.sb_rextsize);
> -		mp->m_rtxblkmask = mask64_if_power2(mp->m_sb.sb_rextsize);
> -	}
> -	mp->m_sb.sb_rbmblocks += tp->t_rbmblocks_delta;
> -	mp->m_sb.sb_rblocks += tp->t_rblocks_delta;
> -	mp->m_sb.sb_rextents += tp->t_rextents_delta;
> -	mp->m_sb.sb_rextslog += tp->t_rextslog_delta;
>  	spin_unlock(&mp->m_sb_lock);
> -
> -	/*
> -	 * Debug checks outside of the spinlock so they don't lock up the
> -	 * machine if they fail.
> -	 */
> -	ASSERT(mp->m_sb.sb_imax_pct >= 0);
> -	ASSERT(mp->m_sb.sb_rextslog >= 0);
>  }
>  
>  /* Add the given log item to the transaction's list of log items. */
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index f06cc0f41665ad..e5911cf09be444 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -140,14 +140,6 @@ typedef struct xfs_trans {
>  	int64_t			t_res_fdblocks_delta; /* on-disk only chg */
>  	int64_t			t_frextents_delta;/* superblock freextents chg*/
>  	int64_t			t_res_frextents_delta; /* on-disk only chg */
> -	int64_t			t_dblocks_delta;/* superblock dblocks change */
> -	int64_t			t_agcount_delta;/* superblock agcount change */
> -	int64_t			t_imaxpct_delta;/* superblock imaxpct change */
> -	int64_t			t_rextsize_delta;/* superblock rextsize chg */
> -	int64_t			t_rbmblocks_delta;/* superblock rbmblocks chg */
> -	int64_t			t_rblocks_delta;/* superblock rblocks change */
> -	int64_t			t_rextents_delta;/* superblocks rextents chg */
> -	int64_t			t_rextslog_delta;/* superblocks rextslog chg */
>  	struct list_head	t_items;	/* log item descriptors */
>  	struct list_head	t_busy;		/* list of busy extents */
>  	struct list_head	t_dfops;	/* deferred operations */
> -- 
> 2.45.2
> 
>
Darrick J. Wong Oct. 10, 2024, 7:01 p.m. UTC | #2
On Mon, Sep 30, 2024 at 06:41:47PM +0200, Christoph Hellwig wrote:
> Updates to the file system geometry in growfs need to be committed to
> stable store before the allocator can see them to avoid that they are
> in the same CIL checkpoint as transactions that make use of this new
> information, which will make recovery impossible or broken.
> 
> To do this add two new helpers to prepare a superblock for direct
> manipulation of the on-disk buffer, and to commit these updates while
> holding the buffer locked (similar to what xfs_sync_sb_buf does) and use
> those in growfs instead of applying the changes through the deltas in the
> xfs_trans structure (which also happens to shrink the xfs_trans structure
> a fair bit).

Yay for shrinking xfs_trans!

> The rtbmimap repair code was also using the transaction deltas and is
> converted to also update the superblock buffer directly under the buffer
> lock.
> 
> This new method establishes a locking protocol where even in-core
> superblock fields must only be updated with the superblock buffer
> locked.  For now it is only applied to affected geometry fields,
> but in the future it would make sense to apply it universally.

Hmm.  One thing that I don't quite like here is the separation between
updating the ondisk sb fields and updating the incore sb/recomputing the
cached geometry fields.  I think that's been handled correctly here, but
the pending changes in growfsrt for rtgroups is going to make this more
ugly.

What if instead this took the form of a new defer_ops type?  The
xfs_prepare_sb_update function would allocate a tracking object where
we'd pin the sb buffer and record which fields get changed, as well as
the new values.  xfs_commit_sb_update then xfs_defer_add()s it to the
transaction and commits it.  (The ->create_intent function would return
NULL so that no log item is created.)

The ->finish_item function would then bhold the sb buffer, update the
ondisk super like how xfs_commit_sb_update does in this patch, set
XFS_SB_TRANS_SYNC, and return -EAGAIN.  The defer ops would commit and
flush that transaction and call ->finish_item again, at which point it
would recompute the incore/cached geometry as necessary, bwrite the sb
buffer, and release it.

The downside is that it's more complexity, but the upside is that the
geometry changes are contained in one place instead of being scattered
around, and the incore changes only happen if the synchronous
transaction actually gets written to disk.  IOWs, the end result is the
same as what you propose here, but structured differently.  

I guess the biggest downside is that log recovery has to call the
incore/cached geometry recomputation function directly because there's
no actual log intent item to recover.

(The code changes themselves look acceptable to me.)

--D

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_sb.c         |  97 ++++++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_sb.h         |   3 +
>  fs/xfs/libxfs/xfs_shared.h     |   8 ---
>  fs/xfs/scrub/rtbitmap_repair.c |  26 +++++----
>  fs/xfs/xfs_fsops.c             |  80 ++++++++++++++++----------
>  fs/xfs/xfs_rtalloc.c           |  92 +++++++++++++++++-------------
>  fs/xfs/xfs_trans.c             | 101 ++-------------------------------
>  fs/xfs/xfs_trans.h             |   8 ---
>  8 files changed, 198 insertions(+), 217 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index d95409f3cba667..2c83ab7441ade5 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -1025,6 +1025,80 @@ xfs_sb_mount_common(
>  	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
>  }
>  
> +/*
> + * Mirror the lazy sb counters to the in-core superblock.
> + *
> + * If this is at unmount, the counters will be exactly correct, but at any other
> + * time they will only be ballpark correct because of reservations that have
> + * been taken out percpu counters.  If we have an unclean shutdown, this will be
> + * corrected by log recovery rebuilding the counters from the AGF block counts.
> + *
> + * Do not update sb_frextents here because it is not part of the lazy sb
> + * counters, despite having a percpu counter.  It is always kept consistent with
> + * the ondisk rtbitmap by xfs_trans_apply_sb_deltas() and hence we don't need
> + * have to update it here.
> + */
> +static void
> +xfs_flush_sb_counters(
> +	struct xfs_mount	*mp)
> +{
> +	if (xfs_has_lazysbcount(mp)) {
> +		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
> +		mp->m_sb.sb_ifree = min_t(uint64_t,
> +				percpu_counter_sum_positive(&mp->m_ifree),
> +				mp->m_sb.sb_icount);
> +		mp->m_sb.sb_fdblocks =
> +				percpu_counter_sum_positive(&mp->m_fdblocks);
> +	}
> +}
> +
> +/*
> + * Prepare a direct update to the superblock through the on-disk buffer.
> + *
> + * This locks out other modifications through the buffer lock and then syncs all
> + * in-core values to the on-disk buffer (including the percpu counters).
> + *
> + * The caller then directly manipulates the on-disk fields and calls
> + * xfs_commit_sb_update to the updates to disk them.  The caller is responsible
> + * to also update the in-core field, but it can do so after the transaction has
> + * been committed to disk.
> + *
> + * Updating the in-core field only after xfs_commit_sb_update ensures that other
> + * processes only see the update once it is stable on disk, and is usually the
> + * right thing to do for superblock updates.
> + *
> + * Note that writes to superblock fields updated using this helper are
> + * synchronized using the superblock buffer lock, which must be taken around
> + * all updates to the in-core fields as well.
> + */
> +struct xfs_dsb *
> +xfs_prepare_sb_update(
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		**bpp)
> +{
> +	*bpp = xfs_trans_getsb(tp);
> +	xfs_flush_sb_counters(tp->t_mountp);
> +	xfs_sb_to_disk((*bpp)->b_addr, &tp->t_mountp->m_sb);
> +	return (*bpp)->b_addr;
> +}
> +
> +/*
> + * Commit a direct update to the on-disk superblock.  Keeps @bp locked and
> + * referenced, so the caller must call xfs_buf_relse() manually.
> + */
> +int
> +xfs_commit_sb_update(
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		*bp)
> +{
> +	xfs_trans_bhold(tp, bp);
> +	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
> +	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> +
> +	xfs_trans_set_sync(tp);
> +	return xfs_trans_commit(tp);
> +}
> +
>  /*
>   * xfs_log_sb() can be used to copy arbitrary changes to the in-core superblock
>   * into the superblock buffer to be logged.  It does not provide the higher
> @@ -1038,28 +1112,7 @@ xfs_log_sb(
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_buf		*bp = xfs_trans_getsb(tp);
>  
> -	/*
> -	 * Lazy sb counters don't update the in-core superblock so do that now.
> -	 * If this is at unmount, the counters will be exactly correct, but at
> -	 * any other time they will only be ballpark correct because of
> -	 * reservations that have been taken out percpu counters. If we have an
> -	 * unclean shutdown, this will be corrected by log recovery rebuilding
> -	 * the counters from the AGF block counts.
> -	 *
> -	 * Do not update sb_frextents here because it is not part of the lazy
> -	 * sb counters, despite having a percpu counter. It is always kept
> -	 * consistent with the ondisk rtbitmap by xfs_trans_apply_sb_deltas()
> -	 * and hence we don't need have to update it here.
> -	 */
> -	if (xfs_has_lazysbcount(mp)) {
> -		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
> -		mp->m_sb.sb_ifree = min_t(uint64_t,
> -				percpu_counter_sum_positive(&mp->m_ifree),
> -				mp->m_sb.sb_icount);
> -		mp->m_sb.sb_fdblocks =
> -				percpu_counter_sum_positive(&mp->m_fdblocks);
> -	}
> -
> +	xfs_flush_sb_counters(mp);
>  	xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
>  	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
> index 885c837559914d..3649d071687e33 100644
> --- a/fs/xfs/libxfs/xfs_sb.h
> +++ b/fs/xfs/libxfs/xfs_sb.h
> @@ -13,6 +13,9 @@ struct xfs_trans;
>  struct xfs_fsop_geom;
>  struct xfs_perag;
>  
> +struct xfs_dsb *xfs_prepare_sb_update(struct xfs_trans *tp,
> +			struct xfs_buf **bpp);
> +int		xfs_commit_sb_update(struct xfs_trans *tp, struct xfs_buf *bp);
>  extern void	xfs_log_sb(struct xfs_trans *tp);
>  extern int	xfs_sync_sb(struct xfs_mount *mp, bool wait);
>  extern int	xfs_sync_sb_buf(struct xfs_mount *mp);
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index 33b84a3a83ff63..45a32ea426164a 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -149,14 +149,6 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>  #define	XFS_TRANS_SB_RES_FDBLOCKS	0x00000008
>  #define	XFS_TRANS_SB_FREXTENTS		0x00000010
>  #define	XFS_TRANS_SB_RES_FREXTENTS	0x00000020
> -#define	XFS_TRANS_SB_DBLOCKS		0x00000040
> -#define	XFS_TRANS_SB_AGCOUNT		0x00000080
> -#define	XFS_TRANS_SB_IMAXPCT		0x00000100
> -#define	XFS_TRANS_SB_REXTSIZE		0x00000200
> -#define	XFS_TRANS_SB_RBMBLOCKS		0x00000400
> -#define	XFS_TRANS_SB_RBLOCKS		0x00000800
> -#define	XFS_TRANS_SB_REXTENTS		0x00001000
> -#define	XFS_TRANS_SB_REXTSLOG		0x00002000
>  
>  /*
>   * Here we centralize the specification of XFS meta-data buffer reference count
> diff --git a/fs/xfs/scrub/rtbitmap_repair.c b/fs/xfs/scrub/rtbitmap_repair.c
> index 0fef98e9f83409..be9d31f032b1bf 100644
> --- a/fs/xfs/scrub/rtbitmap_repair.c
> +++ b/fs/xfs/scrub/rtbitmap_repair.c
> @@ -16,6 +16,7 @@
>  #include "xfs_bit.h"
>  #include "xfs_bmap.h"
>  #include "xfs_bmap_btree.h"
> +#include "xfs_sb.h"
>  #include "scrub/scrub.h"
>  #include "scrub/common.h"
>  #include "scrub/trace.h"
> @@ -127,20 +128,21 @@ xrep_rtbitmap_geometry(
>  	struct xchk_rtbitmap	*rtb)
>  {
>  	struct xfs_mount	*mp = sc->mp;
> -	struct xfs_trans	*tp = sc->tp;
>  
>  	/* Superblock fields */
> -	if (mp->m_sb.sb_rextents != rtb->rextents)
> -		xfs_trans_mod_sb(sc->tp, XFS_TRANS_SB_REXTENTS,
> -				rtb->rextents - mp->m_sb.sb_rextents);
> -
> -	if (mp->m_sb.sb_rbmblocks != rtb->rbmblocks)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_RBMBLOCKS,
> -				rtb->rbmblocks - mp->m_sb.sb_rbmblocks);
> -
> -	if (mp->m_sb.sb_rextslog != rtb->rextslog)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTSLOG,
> -				rtb->rextslog - mp->m_sb.sb_rextslog);
> +	if (mp->m_sb.sb_rextents != rtb->rextents ||
> +	    mp->m_sb.sb_rbmblocks != rtb->rbmblocks ||
> +	    mp->m_sb.sb_rextslog != rtb->rextslog) {
> +		struct xfs_buf		*bp = xfs_trans_getsb(sc->tp);
> +
> +		mp->m_sb.sb_rextents = rtb->rextents;
> +		mp->m_sb.sb_rbmblocks = rtb->rbmblocks;
> +		mp->m_sb.sb_rextslog = rtb->rextslog;
> +		xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
> +
> +		xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF);
> +		xfs_trans_log_buf(sc->tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> +	}
>  
>  	/* Fix broken isize */
>  	sc->ip->i_disk_size = roundup_64(sc->ip->i_disk_size,
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index b247d895c276d2..4168ccf21068cb 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -79,6 +79,46 @@ xfs_resizefs_init_new_ags(
>  	return error;
>  }
>  
> +static int
> +xfs_growfs_data_update_sb(
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		nagcount,
> +	xfs_rfsblock_t		nb,
> +	xfs_agnumber_t		nagimax)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_dsb		*sbp;
> +	struct xfs_buf		*bp;
> +	int			error;
> +
> +	/*
> +	 * Update the geometry in the on-disk superblock first, and ensure
> +	 * they make it to disk before the superblock can be relogged.
> +	 */
> +	sbp = xfs_prepare_sb_update(tp, &bp);
> +	sbp->sb_agcount = cpu_to_be32(nagcount);
> +	sbp->sb_dblocks = cpu_to_be64(nb);
> +	error = xfs_commit_sb_update(tp, bp);
> +	if (error)
> +		goto out_unlock;
> +
> +	/*
> +	 * Propagate the new values to the live mount structure after they made
> +	 * it to disk with the superblock buffer still locked.
> +	 */
> +	mp->m_sb.sb_agcount = nagcount;
> +	mp->m_sb.sb_dblocks = nb;
> +
> +	if (nagimax)
> +		mp->m_maxagi = nagimax;
> +	xfs_set_low_space_thresholds(mp);
> +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> +
> +out_unlock:
> +	xfs_buf_relse(bp);
> +	return error;
> +}
> +
>  /*
>   * growfs operations
>   */
> @@ -171,37 +211,13 @@ xfs_growfs_data_private(
>  	if (error)
>  		goto out_trans_cancel;
>  
> -	/*
> -	 * Update changed superblock fields transactionally. These are not
> -	 * seen by the rest of the world until the transaction commit applies
> -	 * them atomically to the superblock.
> -	 */
> -	if (nagcount > oagcount)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
> -	if (delta)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta);
>  	if (id.nfree)
>  		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
>  
> -	/*
> -	 * Sync sb counters now to reflect the updated values. This is
> -	 * particularly important for shrink because the write verifier
> -	 * will fail if sb_fdblocks is ever larger than sb_dblocks.
> -	 */
> -	if (xfs_has_lazysbcount(mp))
> -		xfs_log_sb(tp);
> -
> -	xfs_trans_set_sync(tp);
> -	error = xfs_trans_commit(tp);
> +	error = xfs_growfs_data_update_sb(tp, nagcount, nb, nagimax);
>  	if (error)
>  		return error;
>  
> -	/* New allocation groups fully initialized, so update mount struct */
> -	if (nagimax)
> -		mp->m_maxagi = nagimax;
> -	xfs_set_low_space_thresholds(mp);
> -	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> -
>  	if (delta > 0) {
>  		/*
>  		 * If we expanded the last AG, free the per-AG reservation
> @@ -260,8 +276,9 @@ xfs_growfs_imaxpct(
>  	struct xfs_mount	*mp,
>  	__u32			imaxpct)
>  {
> +	struct xfs_dsb		*sbp;
> +	struct xfs_buf		*bp;
>  	struct xfs_trans	*tp;
> -	int			dpct;
>  	int			error;
>  
>  	if (imaxpct > 100)
> @@ -272,10 +289,13 @@ xfs_growfs_imaxpct(
>  	if (error)
>  		return error;
>  
> -	dpct = imaxpct - mp->m_sb.sb_imax_pct;
> -	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
> -	xfs_trans_set_sync(tp);
> -	return xfs_trans_commit(tp);
> +	sbp = xfs_prepare_sb_update(tp, &bp);
> +	sbp->sb_imax_pct = imaxpct;
> +	error = xfs_commit_sb_update(tp, bp);
> +	if (!error)
> +		mp->m_sb.sb_imax_pct = imaxpct;
> +	xfs_buf_relse(bp);
> +	return error;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> index 3a2005a1e673dc..994e5efedab20f 100644
> --- a/fs/xfs/xfs_rtalloc.c
> +++ b/fs/xfs/xfs_rtalloc.c
> @@ -698,6 +698,56 @@ xfs_growfs_rt_fixup_extsize(
>  	return error;
>  }
>  
> +static int
> +xfs_growfs_rt_update_sb(
> +	struct xfs_trans	*tp,
> +	struct xfs_mount	*mp,
> +	struct xfs_mount	*nmp,
> +	xfs_rtbxlen_t		freed_rtx)
> +{
> +	struct xfs_dsb		*sbp;
> +	struct xfs_buf		*bp;
> +	int			error;
> +
> +	/*
> +	 * Update the geometry in the on-disk superblock first, and ensure
> +	 * they make it to disk before the superblock can be relogged.
> +	 */
> +	sbp = xfs_prepare_sb_update(tp, &bp);
> +	sbp->sb_rextsize = cpu_to_be32(nmp->m_sb.sb_rextsize);
> +	sbp->sb_rbmblocks = cpu_to_be32(nmp->m_sb.sb_rbmblocks);
> +	sbp->sb_rblocks = cpu_to_be64(nmp->m_sb.sb_rblocks);
> +	sbp->sb_rextents = cpu_to_be64(nmp->m_sb.sb_rextents);
> +	sbp->sb_rextslog = nmp->m_sb.sb_rextslog;
> +	error = xfs_commit_sb_update(tp, bp);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Propagate the new values to the live mount structure after they made
> +	 * it to disk with the superblock buffer still locked.
> +	 */
> +	mp->m_sb.sb_rextsize = nmp->m_sb.sb_rextsize;
> +	mp->m_sb.sb_rbmblocks = nmp->m_sb.sb_rbmblocks;
> +	mp->m_sb.sb_rblocks = nmp->m_sb.sb_rblocks;
> +	mp->m_sb.sb_rextents = nmp->m_sb.sb_rextents;
> +	mp->m_sb.sb_rextslog = nmp->m_sb.sb_rextslog;
> +	mp->m_rsumlevels = nmp->m_rsumlevels;
> +	mp->m_rsumblocks = nmp->m_rsumblocks;
> +
> +	/*
> +	 * Recompute the growfsrt reservation from the new rsumsize.
> +	 */
> +	xfs_trans_resv_calc(mp, &mp->m_resv);
> +
> +	/*
> +	 * Ensure the mount RT feature flag is now set.
> +	 */
> +	mp->m_features |= XFS_FEAT_REALTIME;
> +	xfs_buf_relse(bp);
> +	return 0;
> +}
> +
>  static int
>  xfs_growfs_rt_bmblock(
>  	struct xfs_mount	*mp,
> @@ -780,25 +830,6 @@ xfs_growfs_rt_bmblock(
>  			goto out_cancel;
>  	}
>  
> -	/*
> -	 * Update superblock fields.
> -	 */
> -	if (nmp->m_sb.sb_rextsize != mp->m_sb.sb_rextsize)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSIZE,
> -			nmp->m_sb.sb_rextsize - mp->m_sb.sb_rextsize);
> -	if (nmp->m_sb.sb_rbmblocks != mp->m_sb.sb_rbmblocks)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBMBLOCKS,
> -			nmp->m_sb.sb_rbmblocks - mp->m_sb.sb_rbmblocks);
> -	if (nmp->m_sb.sb_rblocks != mp->m_sb.sb_rblocks)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBLOCKS,
> -			nmp->m_sb.sb_rblocks - mp->m_sb.sb_rblocks);
> -	if (nmp->m_sb.sb_rextents != mp->m_sb.sb_rextents)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTENTS,
> -			nmp->m_sb.sb_rextents - mp->m_sb.sb_rextents);
> -	if (nmp->m_sb.sb_rextslog != mp->m_sb.sb_rextslog)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSLOG,
> -			nmp->m_sb.sb_rextslog - mp->m_sb.sb_rextslog);
> -
>  	/*
>  	 * Free the new extent.
>  	 */
> @@ -807,33 +838,12 @@ xfs_growfs_rt_bmblock(
>  	xfs_rtbuf_cache_relse(&nargs);
>  	if (error)
>  		goto out_cancel;
> -
> -	/*
> -	 * Mark more blocks free in the superblock.
> -	 */
>  	xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx);
>  
> -	/*
> -	 * Update the calculated values in the real mount structure.
> -	 */
> -	mp->m_rsumlevels = nmp->m_rsumlevels;
> -	mp->m_rsumblocks = nmp->m_rsumblocks;
> -	xfs_mount_sb_set_rextsize(mp, &mp->m_sb);
> -
> -	/*
> -	 * Recompute the growfsrt reservation from the new rsumsize.
> -	 */
> -	xfs_trans_resv_calc(mp, &mp->m_resv);
> -
> -	error = xfs_trans_commit(args.tp);
> +	error = xfs_growfs_rt_update_sb(args.tp, mp, nmp, freed_rtx);
>  	if (error)
>  		goto out_free;
>  
> -	/*
> -	 * Ensure the mount RT feature flag is now set.
> -	 */
> -	mp->m_features |= XFS_FEAT_REALTIME;
> -
>  	kfree(nmp);
>  	return 0;
>  
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index bdf3704dc30118..56505cb94f877d 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -430,31 +430,6 @@ xfs_trans_mod_sb(
>  		ASSERT(delta < 0);
>  		tp->t_res_frextents_delta += delta;
>  		break;
> -	case XFS_TRANS_SB_DBLOCKS:
> -		tp->t_dblocks_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_AGCOUNT:
> -		ASSERT(delta > 0);
> -		tp->t_agcount_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_IMAXPCT:
> -		tp->t_imaxpct_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_REXTSIZE:
> -		tp->t_rextsize_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_RBMBLOCKS:
> -		tp->t_rbmblocks_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_RBLOCKS:
> -		tp->t_rblocks_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_REXTENTS:
> -		tp->t_rextents_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_REXTSLOG:
> -		tp->t_rextslog_delta += delta;
> -		break;
>  	default:
>  		ASSERT(0);
>  		return;
> @@ -475,12 +450,8 @@ STATIC void
>  xfs_trans_apply_sb_deltas(
>  	xfs_trans_t	*tp)
>  {
> -	struct xfs_dsb	*sbp;
> -	struct xfs_buf	*bp;
> -	int		whole = 0;
> -
> -	bp = xfs_trans_getsb(tp);
> -	sbp = bp->b_addr;
> +	struct xfs_buf	*bp = xfs_trans_getsb(tp);
> +	struct xfs_dsb	*sbp = bp->b_addr;
>  
>  	/*
>  	 * Only update the superblock counters if we are logging them
> @@ -522,53 +493,10 @@ xfs_trans_apply_sb_deltas(
>  		spin_unlock(&mp->m_sb_lock);
>  	}
>  
> -	if (tp->t_dblocks_delta) {
> -		be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_agcount_delta) {
> -		be32_add_cpu(&sbp->sb_agcount, tp->t_agcount_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_imaxpct_delta) {
> -		sbp->sb_imax_pct += tp->t_imaxpct_delta;
> -		whole = 1;
> -	}
> -	if (tp->t_rextsize_delta) {
> -		be32_add_cpu(&sbp->sb_rextsize, tp->t_rextsize_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rbmblocks_delta) {
> -		be32_add_cpu(&sbp->sb_rbmblocks, tp->t_rbmblocks_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rblocks_delta) {
> -		be64_add_cpu(&sbp->sb_rblocks, tp->t_rblocks_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rextents_delta) {
> -		be64_add_cpu(&sbp->sb_rextents, tp->t_rextents_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rextslog_delta) {
> -		sbp->sb_rextslog += tp->t_rextslog_delta;
> -		whole = 1;
> -	}
> -
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
> -	if (whole)
> -		/*
> -		 * Log the whole thing, the fields are noncontiguous.
> -		 */
> -		xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> -	else
> -		/*
> -		 * Since all the modifiable fields are contiguous, we
> -		 * can get away with this.
> -		 */
> -		xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount),
> -				  offsetof(struct xfs_dsb, sb_frextents) +
> -				  sizeof(sbp->sb_frextents) - 1);
> +	xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount),
> +			  offsetof(struct xfs_dsb, sb_frextents) +
> +			  sizeof(sbp->sb_frextents) - 1);
>  }
>  
>  /*
> @@ -656,26 +584,7 @@ xfs_trans_unreserve_and_mod_sb(
>  	 * must be consistent with the ondisk rtbitmap and must never include
>  	 * incore reservations.
>  	 */
> -	mp->m_sb.sb_dblocks += tp->t_dblocks_delta;
> -	mp->m_sb.sb_agcount += tp->t_agcount_delta;
> -	mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta;
> -	mp->m_sb.sb_rextsize += tp->t_rextsize_delta;
> -	if (tp->t_rextsize_delta) {
> -		mp->m_rtxblklog = log2_if_power2(mp->m_sb.sb_rextsize);
> -		mp->m_rtxblkmask = mask64_if_power2(mp->m_sb.sb_rextsize);
> -	}
> -	mp->m_sb.sb_rbmblocks += tp->t_rbmblocks_delta;
> -	mp->m_sb.sb_rblocks += tp->t_rblocks_delta;
> -	mp->m_sb.sb_rextents += tp->t_rextents_delta;
> -	mp->m_sb.sb_rextslog += tp->t_rextslog_delta;
>  	spin_unlock(&mp->m_sb_lock);
> -
> -	/*
> -	 * Debug checks outside of the spinlock so they don't lock up the
> -	 * machine if they fail.
> -	 */
> -	ASSERT(mp->m_sb.sb_imax_pct >= 0);
> -	ASSERT(mp->m_sb.sb_rextslog >= 0);
>  }
>  
>  /* Add the given log item to the transaction's list of log items. */
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index f06cc0f41665ad..e5911cf09be444 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -140,14 +140,6 @@ typedef struct xfs_trans {
>  	int64_t			t_res_fdblocks_delta; /* on-disk only chg */
>  	int64_t			t_frextents_delta;/* superblock freextents chg*/
>  	int64_t			t_res_frextents_delta; /* on-disk only chg */
> -	int64_t			t_dblocks_delta;/* superblock dblocks change */
> -	int64_t			t_agcount_delta;/* superblock agcount change */
> -	int64_t			t_imaxpct_delta;/* superblock imaxpct change */
> -	int64_t			t_rextsize_delta;/* superblock rextsize chg */
> -	int64_t			t_rbmblocks_delta;/* superblock rbmblocks chg */
> -	int64_t			t_rblocks_delta;/* superblock rblocks change */
> -	int64_t			t_rextents_delta;/* superblocks rextents chg */
> -	int64_t			t_rextslog_delta;/* superblocks rextslog chg */
>  	struct list_head	t_items;	/* log item descriptors */
>  	struct list_head	t_busy;		/* list of busy extents */
>  	struct list_head	t_dfops;	/* deferred operations */
> -- 
> 2.45.2
> 
>
Christoph Hellwig Oct. 11, 2024, 7:57 a.m. UTC | #3
On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> Ok, so we don't want geometry changes transactions in the same CIL
> checkpoint as alloc related transactions that might depend on the
> geometry changes. That seems reasonable and on a first pass I have an
> idea of what this is doing, but the description is kind of vague.
> Obviously this fixes an issue on the recovery side (since I've tested
> it), but it's not quite clear to me from the description and/or logic
> changes how that issue manifests.
> 
> Could you elaborate please? For example, is this some kind of race
> situation between an allocation request and a growfs transaction, where
> the former perhaps sees a newly added AG between the time the growfs
> transaction commits (applying the sb deltas) and it actually commits to
> the log due to being a sync transaction, thus allowing an alloc on a new
> AG into the same checkpoint that adds the AG?

This is based on the feedback by Dave on the previous version:

https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/

Just doing the perag/in-core sb updates earlier fixes all the issues
with my test case, so I'm not actually sure how to get more updates
into the check checkpoint.  I'll try your exercisers if it could hit
that.
Christoph Hellwig Oct. 11, 2024, 7:59 a.m. UTC | #4
On Thu, Oct 10, 2024 at 12:01:47PM -0700, Darrick J. Wong wrote:
> What if instead this took the form of a new defer_ops type?  The
> xfs_prepare_sb_update function would allocate a tracking object where
> we'd pin the sb buffer and record which fields get changed, as well as
> the new values.  xfs_commit_sb_update then xfs_defer_add()s it to the
> transaction and commits it.  (The ->create_intent function would return
> NULL so that no log item is created.)
> 
> The ->finish_item function would then bhold the sb buffer, update the
> ondisk super like how xfs_commit_sb_update does in this patch, set
> XFS_SB_TRANS_SYNC, and return -EAGAIN.  The defer ops would commit and
> flush that transaction and call ->finish_item again, at which point it
> would recompute the incore/cached geometry as necessary, bwrite the sb
> buffer, and release it.
> 
> The downside is that it's more complexity, but the upside is that the
> geometry changes are contained in one place instead of being scattered
> around, and the incore changes only happen if the synchronous
> transaction actually gets written to disk.  IOWs, the end result is the
> same as what you propose here, but structured differently.  

That sounds overkill at first, but if we want to move all sb updates
to that model more strutured infrastructure might be very useful.
Brian Foster Oct. 11, 2024, 2:02 p.m. UTC | #5
On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > Ok, so we don't want geometry changes transactions in the same CIL
> > checkpoint as alloc related transactions that might depend on the
> > geometry changes. That seems reasonable and on a first pass I have an
> > idea of what this is doing, but the description is kind of vague.
> > Obviously this fixes an issue on the recovery side (since I've tested
> > it), but it's not quite clear to me from the description and/or logic
> > changes how that issue manifests.
> > 
> > Could you elaborate please? For example, is this some kind of race
> > situation between an allocation request and a growfs transaction, where
> > the former perhaps sees a newly added AG between the time the growfs
> > transaction commits (applying the sb deltas) and it actually commits to
> > the log due to being a sync transaction, thus allowing an alloc on a new
> > AG into the same checkpoint that adds the AG?
> 
> This is based on the feedback by Dave on the previous version:
> 
> https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> 

Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
not sure I'd go straight to this change given the situation...

> Just doing the perag/in-core sb updates earlier fixes all the issues
> with my test case, so I'm not actually sure how to get more updates
> into the check checkpoint.  I'll try your exercisers if it could hit
> that.
> 

Ok, that explains things a bit. My observation is that the first 5
patches or so address the mount failure problem, but from there I'm not
reproducing much difference with or without the final patch. Either way,
I see aborts and splats all over the place, which implies at minimum
this isn't the only issue here.

So given that 1. growfs recovery seems pretty much broken, 2. this
particular patch has no straightforward way to test that it fixes
something and at the same time doesn't break anything else, and 3. we do
have at least one fairly straightforward growfs/recovery test in the
works that reliably explodes, personally I'd suggest to split this work
off into separate series.

It seems reasonable enough to me to get patches 1-5 in asap once they're
fully cleaned up, and then leave the next two as part of a followon
series pending further investigation into these other issues. As part of
that I'd like to know whether the recovery test reproduces (or can be
made to reproduce) the issue this patch presumably fixes, but I'd also
settle for "the grow recovery test now passes reliably and this doesn't
regress it." But once again, just my .02.

Brian
Darrick J. Wong Oct. 11, 2024, 4:44 p.m. UTC | #6
On Fri, Oct 11, 2024 at 09:59:03AM +0200, Christoph Hellwig wrote:
> On Thu, Oct 10, 2024 at 12:01:47PM -0700, Darrick J. Wong wrote:
> > What if instead this took the form of a new defer_ops type?  The
> > xfs_prepare_sb_update function would allocate a tracking object where
> > we'd pin the sb buffer and record which fields get changed, as well as
> > the new values.  xfs_commit_sb_update then xfs_defer_add()s it to the
> > transaction and commits it.  (The ->create_intent function would return
> > NULL so that no log item is created.)
> > 
> > The ->finish_item function would then bhold the sb buffer, update the
> > ondisk super like how xfs_commit_sb_update does in this patch, set
> > XFS_SB_TRANS_SYNC, and return -EAGAIN.  The defer ops would commit and
> > flush that transaction and call ->finish_item again, at which point it
> > would recompute the incore/cached geometry as necessary, bwrite the sb
> > buffer, and release it.
> > 
> > The downside is that it's more complexity, but the upside is that the
> > geometry changes are contained in one place instead of being scattered
> > around, and the incore changes only happen if the synchronous
> > transaction actually gets written to disk.  IOWs, the end result is the
> > same as what you propose here, but structured differently.  
> 
> That sounds overkill at first, but if we want to move all sb updates
> to that model more strutured infrastructure might be very useful.

<nod> We could just take this as-is and refactor it into the defer item
code once we're done making all the other sb geometry growfsrt updates.
I'd rather do that than rebase *two* entire patchsets just to get the
same results.

--D
Darrick J. Wong Oct. 11, 2024, 5:13 p.m. UTC | #7
On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > Ok, so we don't want geometry changes transactions in the same CIL
> > > checkpoint as alloc related transactions that might depend on the
> > > geometry changes. That seems reasonable and on a first pass I have an
> > > idea of what this is doing, but the description is kind of vague.
> > > Obviously this fixes an issue on the recovery side (since I've tested
> > > it), but it's not quite clear to me from the description and/or logic
> > > changes how that issue manifests.
> > > 
> > > Could you elaborate please? For example, is this some kind of race
> > > situation between an allocation request and a growfs transaction, where
> > > the former perhaps sees a newly added AG between the time the growfs
> > > transaction commits (applying the sb deltas) and it actually commits to
> > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > AG into the same checkpoint that adds the AG?
> > 
> > This is based on the feedback by Dave on the previous version:
> > 
> > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > 
> 
> Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> not sure I'd go straight to this change given the situation...
> 
> > Just doing the perag/in-core sb updates earlier fixes all the issues
> > with my test case, so I'm not actually sure how to get more updates
> > into the check checkpoint.  I'll try your exercisers if it could hit
> > that.
> > 
> 
> Ok, that explains things a bit. My observation is that the first 5
> patches or so address the mount failure problem, but from there I'm not
> reproducing much difference with or without the final patch.

Does this change to flush the log after committing the new sb fix the
recovery problems on older kernels?  I /think/ that's the point of this
patch.

>                                                              Either way,
> I see aborts and splats all over the place, which implies at minimum
> this isn't the only issue here.

Ugh.  I've recently noticed the long soak logrecovery test vm have seen
a slight tick up in failure rates -- random blocks that have clearly had
garbage written to them such that recovery tries to read the block to
recover a buffer log item and kaboom.  At this point it's unclear if
that's a problem with xfs or somewhere else. :(

> So given that 1. growfs recovery seems pretty much broken, 2. this
> particular patch has no straightforward way to test that it fixes
> something and at the same time doesn't break anything else, and 3. we do

I'm curious, what might break?  Was that merely a general comment, or do
you have something specific in mind?  (iows: do you see more string to
pull? :))

> have at least one fairly straightforward growfs/recovery test in the
> works that reliably explodes, personally I'd suggest to split this work
> off into separate series.
> 
> It seems reasonable enough to me to get patches 1-5 in asap once they're
> fully cleaned up, and then leave the next two as part of a followon
> series pending further investigation into these other issues. As part of
> that I'd like to know whether the recovery test reproduces (or can be
> made to reproduce) the issue this patch presumably fixes, but I'd also
> settle for "the grow recovery test now passes reliably and this doesn't
> regress it." But once again, just my .02.

Yeah, it's too bad there's no good way to test recovery with older
kernels either. :(

--D

> Brian
> 
>
Brian Foster Oct. 11, 2024, 6:41 p.m. UTC | #8
On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > checkpoint as alloc related transactions that might depend on the
> > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > idea of what this is doing, but the description is kind of vague.
> > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > it), but it's not quite clear to me from the description and/or logic
> > > > changes how that issue manifests.
> > > > 
> > > > Could you elaborate please? For example, is this some kind of race
> > > > situation between an allocation request and a growfs transaction, where
> > > > the former perhaps sees a newly added AG between the time the growfs
> > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > AG into the same checkpoint that adds the AG?
> > > 
> > > This is based on the feedback by Dave on the previous version:
> > > 
> > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > 
> > 
> > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > not sure I'd go straight to this change given the situation...
> > 
> > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > with my test case, so I'm not actually sure how to get more updates
> > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > that.
> > > 
> > 
> > Ok, that explains things a bit. My observation is that the first 5
> > patches or so address the mount failure problem, but from there I'm not
> > reproducing much difference with or without the final patch.
> 
> Does this change to flush the log after committing the new sb fix the
> recovery problems on older kernels?  I /think/ that's the point of this
> patch.
> 

I don't follow.. growfs always forced the log via the sync transaction,
right? Or do you mean something else by "change to flush the log?"

I thought the main functional change of this patch was to hold the
superblock buffer locked across the force so nothing else can relog the
new geometry superblock buffer in the same cil checkpoint. Presumably,
the theory is that prevents recovery from seeing updates to different
buffers that depend on the geometry update before the actual sb geometry
update is recovered (because the latter might have been relogged).

Maybe we're saying the same thing..? Or maybe I just misunderstand.
Either way I think patch could use a more detailed commit log...

> >                                                              Either way,
> > I see aborts and splats all over the place, which implies at minimum
> > this isn't the only issue here.
> 
> Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> a slight tick up in failure rates -- random blocks that have clearly had
> garbage written to them such that recovery tries to read the block to
> recover a buffer log item and kaboom.  At this point it's unclear if
> that's a problem with xfs or somewhere else. :(
> 
> > So given that 1. growfs recovery seems pretty much broken, 2. this
> > particular patch has no straightforward way to test that it fixes
> > something and at the same time doesn't break anything else, and 3. we do
> 
> I'm curious, what might break?  Was that merely a general comment, or do
> you have something specific in mind?  (iows: do you see more string to
> pull? :))
> 

Just a general comment..

Something related that isn't totally clear to me is what about the
inverse shrink situation where dblocks is reduced. I.e., is there some
similar scenario where for example instead of the sb buffer being
relogged past some other buffer update that depends on it, some other
change is relogged past a sb update that invalidates/removes blocks
referenced by the relogged buffer..? If so, does that imply a shrink
should flush the log before the shrink transaction commits to ensure it
lands in a new checkpoint (as opposed to ensuring followon updates land
in a new checkpoint)..?

Anyways, my point is just that if it were me I wouldn't get too deep
into this until some of the reproducible growfs recovery issues are at
least characterized and testing is more sorted out.

The context for testing is here [1]. The TLDR is basically that
Christoph has a targeted test that reproduces the initial mount failure
and I hacked up a more general test that also reproduces it and
additional growfs recovery problems. This test does seem to confirm that
the previous patches address the mount failure issue, but this patch
doesn't seem to prevent any of the other problems produced by the
generic test. That might just mean the test doesn't reproduce what this
fixes, but it's kind of hard to at least regression test something like
this when basic growfs crash-recovery seems pretty much broken.

Brian

[1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/

> > have at least one fairly straightforward growfs/recovery test in the
> > works that reliably explodes, personally I'd suggest to split this work
> > off into separate series.
> > 
> > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > fully cleaned up, and then leave the next two as part of a followon
> > series pending further investigation into these other issues. As part of
> > that I'd like to know whether the recovery test reproduces (or can be
> > made to reproduce) the issue this patch presumably fixes, but I'd also
> > settle for "the grow recovery test now passes reliably and this doesn't
> > regress it." But once again, just my .02.
> 
> Yeah, it's too bad there's no good way to test recovery with older
> kernels either. :(
> 
> --D
> 
> > Brian
> > 
> > 
>
Darrick J. Wong Oct. 11, 2024, 11:12 p.m. UTC | #9
On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > checkpoint as alloc related transactions that might depend on the
> > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > idea of what this is doing, but the description is kind of vague.
> > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > changes how that issue manifests.
> > > > > 
> > > > > Could you elaborate please? For example, is this some kind of race
> > > > > situation between an allocation request and a growfs transaction, where
> > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > AG into the same checkpoint that adds the AG?
> > > > 
> > > > This is based on the feedback by Dave on the previous version:
> > > > 
> > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > 
> > > 
> > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > not sure I'd go straight to this change given the situation...
> > > 
> > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > with my test case, so I'm not actually sure how to get more updates
> > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > that.
> > > > 
> > > 
> > > Ok, that explains things a bit. My observation is that the first 5
> > > patches or so address the mount failure problem, but from there I'm not
> > > reproducing much difference with or without the final patch.
> > 
> > Does this change to flush the log after committing the new sb fix the
> > recovery problems on older kernels?  I /think/ that's the point of this
> > patch.
> > 
> 
> I don't follow.. growfs always forced the log via the sync transaction,
> right? Or do you mean something else by "change to flush the log?"

I guess I was typing a bit too fast this morning -- "change to flush the
log to disk before anyone else can get their hands on the superblock".
You're right that xfs_log_sb and data-device growfs already do that.

That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
that this patch fixes, right?

> I thought the main functional change of this patch was to hold the
> superblock buffer locked across the force so nothing else can relog the
> new geometry superblock buffer in the same cil checkpoint. Presumably,
> the theory is that prevents recovery from seeing updates to different
> buffers that depend on the geometry update before the actual sb geometry
> update is recovered (because the latter might have been relogged).
> 
> Maybe we're saying the same thing..? Or maybe I just misunderstand.
> Either way I think patch could use a more detailed commit log...

<nod> The commit message should point out that we're fixing a real bug
here, which is that growfsrt doesn't force the log to disk when it
commits the new rt geometry.

> > >                                                              Either way,
> > > I see aborts and splats all over the place, which implies at minimum
> > > this isn't the only issue here.
> > 
> > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > a slight tick up in failure rates -- random blocks that have clearly had
> > garbage written to them such that recovery tries to read the block to
> > recover a buffer log item and kaboom.  At this point it's unclear if
> > that's a problem with xfs or somewhere else. :(
> > 
> > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > particular patch has no straightforward way to test that it fixes
> > > something and at the same time doesn't break anything else, and 3. we do
> > 
> > I'm curious, what might break?  Was that merely a general comment, or do
> > you have something specific in mind?  (iows: do you see more string to
> > pull? :))
> > 
> 
> Just a general comment..
> 
> Something related that isn't totally clear to me is what about the
> inverse shrink situation where dblocks is reduced. I.e., is there some
> similar scenario where for example instead of the sb buffer being
> relogged past some other buffer update that depends on it, some other
> change is relogged past a sb update that invalidates/removes blocks
> referenced by the relogged buffer..? If so, does that imply a shrink
> should flush the log before the shrink transaction commits to ensure it
> lands in a new checkpoint (as opposed to ensuring followon updates land
> in a new checkpoint)..?

I think so.  Might we want to do that before and after to be careful?

> Anyways, my point is just that if it were me I wouldn't get too deep
> into this until some of the reproducible growfs recovery issues are at
> least characterized and testing is more sorted out.
> 
> The context for testing is here [1]. The TLDR is basically that
> Christoph has a targeted test that reproduces the initial mount failure
> and I hacked up a more general test that also reproduces it and
> additional growfs recovery problems. This test does seem to confirm that
> the previous patches address the mount failure issue, but this patch
> doesn't seem to prevent any of the other problems produced by the
> generic test. That might just mean the test doesn't reproduce what this
> fixes, but it's kind of hard to at least regression test something like
> this when basic growfs crash-recovery seems pretty much broken.

Hmm, if you make a variant of that test which formats with an rt device
and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
similar blowups?  Let's see what happens if I do that...

--D

> Brian
> 
> [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> 
> > > have at least one fairly straightforward growfs/recovery test in the
> > > works that reliably explodes, personally I'd suggest to split this work
> > > off into separate series.
> > > 
> > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > fully cleaned up, and then leave the next two as part of a followon
> > > series pending further investigation into these other issues. As part of
> > > that I'd like to know whether the recovery test reproduces (or can be
> > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > settle for "the grow recovery test now passes reliably and this doesn't
> > > regress it." But once again, just my .02.
> > 
> > Yeah, it's too bad there's no good way to test recovery with older
> > kernels either. :(
> > 
> > --D
> > 
> > > Brian
> > > 
> > > 
> > 
> 
>
Darrick J. Wong Oct. 11, 2024, 11:29 p.m. UTC | #10
On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > changes how that issue manifests.
> > > > > > 
> > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > AG into the same checkpoint that adds the AG?
> > > > > 
> > > > > This is based on the feedback by Dave on the previous version:
> > > > > 
> > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > 
> > > > 
> > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > not sure I'd go straight to this change given the situation...
> > > > 
> > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > that.
> > > > > 
> > > > 
> > > > Ok, that explains things a bit. My observation is that the first 5
> > > > patches or so address the mount failure problem, but from there I'm not
> > > > reproducing much difference with or without the final patch.
> > > 
> > > Does this change to flush the log after committing the new sb fix the
> > > recovery problems on older kernels?  I /think/ that's the point of this
> > > patch.
> > > 
> > 
> > I don't follow.. growfs always forced the log via the sync transaction,
> > right? Or do you mean something else by "change to flush the log?"
> 
> I guess I was typing a bit too fast this morning -- "change to flush the
> log to disk before anyone else can get their hands on the superblock".
> You're right that xfs_log_sb and data-device growfs already do that.
> 
> That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> that this patch fixes, right?
> 
> > I thought the main functional change of this patch was to hold the
> > superblock buffer locked across the force so nothing else can relog the
> > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > the theory is that prevents recovery from seeing updates to different
> > buffers that depend on the geometry update before the actual sb geometry
> > update is recovered (because the latter might have been relogged).
> > 
> > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > Either way I think patch could use a more detailed commit log...
> 
> <nod> The commit message should point out that we're fixing a real bug
> here, which is that growfsrt doesn't force the log to disk when it
> commits the new rt geometry.
> 
> > > >                                                              Either way,
> > > > I see aborts and splats all over the place, which implies at minimum
> > > > this isn't the only issue here.
> > > 
> > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > a slight tick up in failure rates -- random blocks that have clearly had
> > > garbage written to them such that recovery tries to read the block to
> > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > that's a problem with xfs or somewhere else. :(
> > > 
> > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > particular patch has no straightforward way to test that it fixes
> > > > something and at the same time doesn't break anything else, and 3. we do
> > > 
> > > I'm curious, what might break?  Was that merely a general comment, or do
> > > you have something specific in mind?  (iows: do you see more string to
> > > pull? :))
> > > 
> > 
> > Just a general comment..
> > 
> > Something related that isn't totally clear to me is what about the
> > inverse shrink situation where dblocks is reduced. I.e., is there some
> > similar scenario where for example instead of the sb buffer being
> > relogged past some other buffer update that depends on it, some other
> > change is relogged past a sb update that invalidates/removes blocks
> > referenced by the relogged buffer..? If so, does that imply a shrink
> > should flush the log before the shrink transaction commits to ensure it
> > lands in a new checkpoint (as opposed to ensuring followon updates land
> > in a new checkpoint)..?
> 
> I think so.  Might we want to do that before and after to be careful?
> 
> > Anyways, my point is just that if it were me I wouldn't get too deep
> > into this until some of the reproducible growfs recovery issues are at
> > least characterized and testing is more sorted out.
> > 
> > The context for testing is here [1]. The TLDR is basically that
> > Christoph has a targeted test that reproduces the initial mount failure
> > and I hacked up a more general test that also reproduces it and
> > additional growfs recovery problems. This test does seem to confirm that
> > the previous patches address the mount failure issue, but this patch
> > doesn't seem to prevent any of the other problems produced by the
> > generic test. That might just mean the test doesn't reproduce what this
> > fixes, but it's kind of hard to at least regression test something like
> > this when basic growfs crash-recovery seems pretty much broken.
> 
> Hmm, if you make a variant of that test which formats with an rt device
> and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> similar blowups?  Let's see what happens if I do that...

Ahahaha awesome it corrupts the filesystem:

_check_xfs_filesystem: filesystem on /dev/sdf is inconsistent (r)
*** xfs_repair -n output ***
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - generate realtime summary info and bitmap...
sb_frextents 389, counted 10329
discrepancy in summary (0) at dblock 0x0 words 0x3f-0x3f/0x400
discrepancy in summary (0) at dblock 0x0 words 0x44-0x44/0x400
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...

--D

--- /dev/null
+++ b/tests/xfs/610
@@ -0,0 +1,102 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2000-2004 Silicon Graphics, Inc.  All Rights Reserved.
+#
+# FS QA Test No. 610
+#
+# XFS online growfs-while-allocating tests (rt subvol variant)
+#
+. ./common/preamble
+_begin_fstest growfs ioctl prealloc auto stress
+
+# Import common functions.
+. ./common/filter
+
+_create_scratch()
+{
+	_scratch_mkfs_xfs "$@" >> $seqres.full
+
+	if ! _try_scratch_mount 2>/dev/null
+	then
+		echo "failed to mount $SCRATCH_DEV"
+		exit 1
+	fi
+
+	_xfs_force_bdev realtime $SCRATCH_MNT &> /dev/null
+
+	# fix the reserve block pool to a known size so that the enospc
+	# calculations work out correctly.
+	_scratch_resvblks 1024 >  /dev/null 2>&1
+}
+
+_fill_scratch()
+{
+	$XFS_IO_PROG -f -c "resvsp 0 ${1}" $SCRATCH_MNT/resvfile
+}
+
+_stress_scratch()
+{
+	procs=3
+	nops=1000
+	# -w ensures that the only ops are ones which cause write I/O
+	FSSTRESS_ARGS=`_scale_fsstress_args -d $SCRATCH_MNT -w -p $procs \
+	    -n $nops $FSSTRESS_AVOID`
+	$FSSTRESS_PROG $FSSTRESS_ARGS >> $seqres.full 2>&1 &
+}
+
+_require_realtime
+_require_scratch
+_require_xfs_io_command "falloc"
+
+_scratch_mkfs_xfs | tee -a $seqres.full | _filter_mkfs 2>$tmp.mkfs
+. $tmp.mkfs	# extract blocksize and data size for scratch device
+
+endsize=`expr 550 \* 1048576`	# stop after growing this big
+incsize=`expr  42 \* 1048576`	# grow in chunks of this size
+modsize=`expr   4 \* $incsize`	# pause after this many increments
+
+[ `expr $endsize / $dbsize` -lt $dblocks ] || _notrun "Scratch device too small"
+
+size=`expr 125 \* 1048576`	# 120 megabytes initially
+sizeb=`expr $size / $dbsize`	# in data blocks
+logblks=$(_scratch_find_xfs_min_logblocks -rsize=${size})
+_create_scratch -lsize=${logblks}b -rsize=${size}
+
+for i in `seq 125 -1 90`; do
+	fillsize=`expr $i \* 1048576`
+	out="$(_fill_scratch $fillsize 2>&1)"
+	echo "$out" | grep -q 'No space left on device' && continue
+	test -n "${out}" && echo "$out"
+	break
+done
+
+#
+# Grow the filesystem while actively stressing it...
+# Kick off more stress threads on each iteration, grow; repeat.
+#
+while [ $size -le $endsize ]; do
+	echo "*** stressing a ${sizeb} block filesystem" >> $seqres.full
+	_stress_scratch
+	size=`expr $size + $incsize`
+	sizeb=`expr $size / $dbsize`	# in data blocks
+	echo "*** growing to a ${sizeb} block filesystem" >> $seqres.full
+	xfs_growfs -R ${sizeb} $SCRATCH_MNT >> $seqres.full
+	echo AGCOUNT=$agcount >> $seqres.full
+	echo >> $seqres.full
+
+	sleep $((RANDOM % 3))
+	_scratch_shutdown
+	ps -e | grep fsstress > /dev/null 2>&1
+	while [ $? -eq 0 ]; do
+		killall -9 fsstress > /dev/null 2>&1
+		wait > /dev/null 2>&1
+		ps -e | grep fsstress > /dev/null 2>&1
+	done
+	_scratch_cycle_mount || _fail "cycle mount failed"
+done > /dev/null 2>&1
+wait	# stop for any remaining stress processes
+
+_scratch_unmount
+
+status=0
+exit
--- /dev/null
+++ b/tests/xfs/610.out
@@ -0,0 +1,7 @@
+QA output created by 610
+meta-data=DDEV isize=XXX agcount=N, agsize=XXX blks
+data     = bsize=XXX blocks=XXX, imaxpct=PCT
+         = sunit=XXX swidth=XXX, unwritten=X
+naming   =VERN bsize=XXX
+log      =LDEV bsize=XXX blocks=XXX
+realtime =RDEV extsz=XXX blocks=XXX, rtextents=XXX
Christoph Hellwig Oct. 14, 2024, 5:58 a.m. UTC | #11
On Fri, Oct 11, 2024 at 04:29:06PM -0700, Darrick J. Wong wrote:
> Ahahaha awesome it corrupts the filesystem:

Is this with a rtgroup file system?  I can't get your test to fail
with the latest xfs staging tree.
Darrick J. Wong Oct. 14, 2024, 3:30 p.m. UTC | #12
On Mon, Oct 14, 2024 at 07:58:50AM +0200, Christoph Hellwig wrote:
> On Fri, Oct 11, 2024 at 04:29:06PM -0700, Darrick J. Wong wrote:
> > Ahahaha awesome it corrupts the filesystem:
> 
> Is this with a rtgroup file system?  I can't get your test to fail
> with the latest xfs staging tree.

aha, yes, it's with rtgroups.

--D
Brian Foster Oct. 14, 2024, 6:50 p.m. UTC | #13
On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > changes how that issue manifests.
> > > > > > 
> > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > AG into the same checkpoint that adds the AG?
> > > > > 
> > > > > This is based on the feedback by Dave on the previous version:
> > > > > 
> > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > 
> > > > 
> > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > not sure I'd go straight to this change given the situation...
> > > > 
> > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > that.
> > > > > 
> > > > 
> > > > Ok, that explains things a bit. My observation is that the first 5
> > > > patches or so address the mount failure problem, but from there I'm not
> > > > reproducing much difference with or without the final patch.
> > > 
> > > Does this change to flush the log after committing the new sb fix the
> > > recovery problems on older kernels?  I /think/ that's the point of this
> > > patch.
> > > 
> > 
> > I don't follow.. growfs always forced the log via the sync transaction,
> > right? Or do you mean something else by "change to flush the log?"
> 
> I guess I was typing a bit too fast this morning -- "change to flush the
> log to disk before anyone else can get their hands on the superblock".
> You're right that xfs_log_sb and data-device growfs already do that.
> 
> That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> that this patch fixes, right?
> 

Ah, Ok.. that makes sense. Sounds like it could be..

> > I thought the main functional change of this patch was to hold the
> > superblock buffer locked across the force so nothing else can relog the
> > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > the theory is that prevents recovery from seeing updates to different
> > buffers that depend on the geometry update before the actual sb geometry
> > update is recovered (because the latter might have been relogged).
> > 
> > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > Either way I think patch could use a more detailed commit log...
> 
> <nod> The commit message should point out that we're fixing a real bug
> here, which is that growfsrt doesn't force the log to disk when it
> commits the new rt geometry.
> 

Maybe even make it a separate patch to pull apart some of these cleanups
from fixes. I was also wondering if the whole locking change is the
moral equivalent of locking the sb across the growfs trans (i.e.
trans_getsb() + trans_bhold()), at which point maybe that would be a
reasonable incremental patch too.

> > > >                                                              Either way,
> > > > I see aborts and splats all over the place, which implies at minimum
> > > > this isn't the only issue here.
> > > 
> > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > a slight tick up in failure rates -- random blocks that have clearly had
> > > garbage written to them such that recovery tries to read the block to
> > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > that's a problem with xfs or somewhere else. :(
> > > 
> > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > particular patch has no straightforward way to test that it fixes
> > > > something and at the same time doesn't break anything else, and 3. we do
> > > 
> > > I'm curious, what might break?  Was that merely a general comment, or do
> > > you have something specific in mind?  (iows: do you see more string to
> > > pull? :))
> > > 
> > 
> > Just a general comment..
> > 
> > Something related that isn't totally clear to me is what about the
> > inverse shrink situation where dblocks is reduced. I.e., is there some
> > similar scenario where for example instead of the sb buffer being
> > relogged past some other buffer update that depends on it, some other
> > change is relogged past a sb update that invalidates/removes blocks
> > referenced by the relogged buffer..? If so, does that imply a shrink
> > should flush the log before the shrink transaction commits to ensure it
> > lands in a new checkpoint (as opposed to ensuring followon updates land
> > in a new checkpoint)..?
> 
> I think so.  Might we want to do that before and after to be careful?
> 

Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
had a log preflush to flush out already committed changes before the
grow, I don't think anything really prevents another "problematic"
transaction from committing after that preflush.

I dunno.. on one hand it does seem like an unlikely thing due to the
nature of needing space to be free in order to shrink in the first
place, but OTOH if you have something like grow that is rare, not
performance sensitive, has a history of not being well tested, and has
these subtle ordering requirements that might change indirectly to other
transactions, ISTM it could be a wise engineering decision to simplify
to the degree possible and find the most basic model that enforces
predictable ordering.

So for a hacky thought/example, suppose we defined a transaction mode
that basically implemented an exclusive checkpoint requirement (i.e.
this transaction owns an entire checkpoint, nothing else is allowed in
the CIL concurrently). Presumably that would ensure everything before
the grow would flush out to disk in one checkpoint, everything
concurrent would block on synchronous commit of the grow trans (before
new geometry is exposed), and then after that point everything pending
would drain into another checkpoint.

It kind of sounds like overkill, but really if it could be implemented
simply enough then we wouldn't have to think too hard about auditing all
other relog scenarios. I'd probably want to see at least some reproducer
for this sort of problem to prove the theory though too, even if it
required debug instrumentation or something. Hm?

> > Anyways, my point is just that if it were me I wouldn't get too deep
> > into this until some of the reproducible growfs recovery issues are at
> > least characterized and testing is more sorted out.
> > 
> > The context for testing is here [1]. The TLDR is basically that
> > Christoph has a targeted test that reproduces the initial mount failure
> > and I hacked up a more general test that also reproduces it and
> > additional growfs recovery problems. This test does seem to confirm that
> > the previous patches address the mount failure issue, but this patch
> > doesn't seem to prevent any of the other problems produced by the
> > generic test. That might just mean the test doesn't reproduce what this
> > fixes, but it's kind of hard to at least regression test something like
> > this when basic growfs crash-recovery seems pretty much broken.
> 
> Hmm, if you make a variant of that test which formats with an rt device
> and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> similar blowups?  Let's see what happens if I do that...
> 

Heh, sounds like so from your followup. Fun times.

I guess that test should probably work its way upstream. I made some
tweaks locally since last posted to try and make it a little more
aggressive, but it didn't repro anything new so not sure how much
difference it makes really. Do we want a separate version like yours for
the rt case or would you expect to cover both cases in a single test?

Brian

> --D
> 
> > Brian
> > 
> > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > 
> > > > have at least one fairly straightforward growfs/recovery test in the
> > > > works that reliably explodes, personally I'd suggest to split this work
> > > > off into separate series.
> > > > 
> > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > fully cleaned up, and then leave the next two as part of a followon
> > > > series pending further investigation into these other issues. As part of
> > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > regress it." But once again, just my .02.
> > > 
> > > Yeah, it's too bad there's no good way to test recovery with older
> > > kernels either. :(
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > 
> > > 
> > 
> > 
>
Darrick J. Wong Oct. 15, 2024, 4:42 p.m. UTC | #14
On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > > changes how that issue manifests.
> > > > > > > 
> > > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > > AG into the same checkpoint that adds the AG?
> > > > > > 
> > > > > > This is based on the feedback by Dave on the previous version:
> > > > > > 
> > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > > 
> > > > > 
> > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > > not sure I'd go straight to this change given the situation...
> > > > > 
> > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > > that.
> > > > > > 
> > > > > 
> > > > > Ok, that explains things a bit. My observation is that the first 5
> > > > > patches or so address the mount failure problem, but from there I'm not
> > > > > reproducing much difference with or without the final patch.
> > > > 
> > > > Does this change to flush the log after committing the new sb fix the
> > > > recovery problems on older kernels?  I /think/ that's the point of this
> > > > patch.
> > > > 
> > > 
> > > I don't follow.. growfs always forced the log via the sync transaction,
> > > right? Or do you mean something else by "change to flush the log?"
> > 
> > I guess I was typing a bit too fast this morning -- "change to flush the
> > log to disk before anyone else can get their hands on the superblock".
> > You're right that xfs_log_sb and data-device growfs already do that.
> > 
> > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> > that this patch fixes, right?
> > 
> 
> Ah, Ok.. that makes sense. Sounds like it could be..

Yeah.  Hey Christoph, would you mind pre-pending a minimal fixpatch to
set xfs_trans_set_sync in growfsrt before this one that refactors the
existing growfs/sb updates?

> > > I thought the main functional change of this patch was to hold the
> > > superblock buffer locked across the force so nothing else can relog the
> > > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > > the theory is that prevents recovery from seeing updates to different
> > > buffers that depend on the geometry update before the actual sb geometry
> > > update is recovered (because the latter might have been relogged).
> > > 
> > > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > > Either way I think patch could use a more detailed commit log...
> > 
> > <nod> The commit message should point out that we're fixing a real bug
> > here, which is that growfsrt doesn't force the log to disk when it
> > commits the new rt geometry.
> > 
> 
> Maybe even make it a separate patch to pull apart some of these cleanups
> from fixes. I was also wondering if the whole locking change is the
> moral equivalent of locking the sb across the growfs trans (i.e.
> trans_getsb() + trans_bhold()), at which point maybe that would be a
> reasonable incremental patch too.
> 
> > > > >                                                              Either way,
> > > > > I see aborts and splats all over the place, which implies at minimum
> > > > > this isn't the only issue here.
> > > > 
> > > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > > a slight tick up in failure rates -- random blocks that have clearly had
> > > > garbage written to them such that recovery tries to read the block to
> > > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > > that's a problem with xfs or somewhere else. :(
> > > > 
> > > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > > particular patch has no straightforward way to test that it fixes
> > > > > something and at the same time doesn't break anything else, and 3. we do
> > > > 
> > > > I'm curious, what might break?  Was that merely a general comment, or do
> > > > you have something specific in mind?  (iows: do you see more string to
> > > > pull? :))
> > > > 
> > > 
> > > Just a general comment..
> > > 
> > > Something related that isn't totally clear to me is what about the
> > > inverse shrink situation where dblocks is reduced. I.e., is there some
> > > similar scenario where for example instead of the sb buffer being
> > > relogged past some other buffer update that depends on it, some other
> > > change is relogged past a sb update that invalidates/removes blocks
> > > referenced by the relogged buffer..? If so, does that imply a shrink
> > > should flush the log before the shrink transaction commits to ensure it
> > > lands in a new checkpoint (as opposed to ensuring followon updates land
> > > in a new checkpoint)..?
> > 
> > I think so.  Might we want to do that before and after to be careful?
> > 
> 
> Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
> had a log preflush to flush out already committed changes before the
> grow, I don't think anything really prevents another "problematic"
> transaction from committing after that preflush.

Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't
you?

> I dunno.. on one hand it does seem like an unlikely thing due to the
> nature of needing space to be free in order to shrink in the first
> place, but OTOH if you have something like grow that is rare, not
> performance sensitive, has a history of not being well tested, and has
> these subtle ordering requirements that might change indirectly to other
> transactions, ISTM it could be a wise engineering decision to simplify
> to the degree possible and find the most basic model that enforces
> predictable ordering.
> 
> So for a hacky thought/example, suppose we defined a transaction mode
> that basically implemented an exclusive checkpoint requirement (i.e.
> this transaction owns an entire checkpoint, nothing else is allowed in
> the CIL concurrently). Presumably that would ensure everything before
> the grow would flush out to disk in one checkpoint, everything
> concurrent would block on synchronous commit of the grow trans (before
> new geometry is exposed), and then after that point everything pending
> would drain into another checkpoint.
> 
> It kind of sounds like overkill, but really if it could be implemented
> simply enough then we wouldn't have to think too hard about auditing all
> other relog scenarios. I'd probably want to see at least some reproducer
> for this sort of problem to prove the theory though too, even if it
> required debug instrumentation or something. Hm?

What if we redefined the input requirements to shrink?  Lets say we
require that the fd argument to a shrink ioctl is actually an unlinkable
O_TMPFILE regular file with the EOFS blocks mapped to it.  Then we can
force the log without holding any locks, and the shrink transaction can
remove the bmap and rmap records at the same time that it updates the sb
geometry.  The otherwise inaccessible file means that nobody can reuse
that space between the log force and the sb update.

> > > Anyways, my point is just that if it were me I wouldn't get too deep
> > > into this until some of the reproducible growfs recovery issues are at
> > > least characterized and testing is more sorted out.
> > > 
> > > The context for testing is here [1]. The TLDR is basically that
> > > Christoph has a targeted test that reproduces the initial mount failure
> > > and I hacked up a more general test that also reproduces it and
> > > additional growfs recovery problems. This test does seem to confirm that
> > > the previous patches address the mount failure issue, but this patch
> > > doesn't seem to prevent any of the other problems produced by the
> > > generic test. That might just mean the test doesn't reproduce what this
> > > fixes, but it's kind of hard to at least regression test something like
> > > this when basic growfs crash-recovery seems pretty much broken.
> > 
> > Hmm, if you make a variant of that test which formats with an rt device
> > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> > similar blowups?  Let's see what happens if I do that...
> > 
> 
> Heh, sounds like so from your followup. Fun times.
> 
> I guess that test should probably work its way upstream. I made some
> tweaks locally since last posted to try and make it a little more
> aggressive, but it didn't repro anything new so not sure how much
> difference it makes really. Do we want a separate version like yours for
> the rt case or would you expect to cover both cases in a single test?

This probably should be different tests, because rt is its own very
weird animal.

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > > 
> > > > > have at least one fairly straightforward growfs/recovery test in the
> > > > > works that reliably explodes, personally I'd suggest to split this work
> > > > > off into separate series.
> > > > > 
> > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > > fully cleaned up, and then leave the next two as part of a followon
> > > > > series pending further investigation into these other issues. As part of
> > > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > > regress it." But once again, just my .02.
> > > > 
> > > > Yeah, it's too bad there's no good way to test recovery with older
> > > > kernels either. :(
> > > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> > 
> 
>
Brian Foster Oct. 18, 2024, 12:27 p.m. UTC | #15
On Tue, Oct 15, 2024 at 09:42:05AM -0700, Darrick J. Wong wrote:
> On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > > > changes how that issue manifests.
> > > > > > > > 
> > > > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > > > AG into the same checkpoint that adds the AG?
> > > > > > > 
> > > > > > > This is based on the feedback by Dave on the previous version:
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > > > 
> > > > > > 
> > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > > > not sure I'd go straight to this change given the situation...
> > > > > > 
> > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > > > that.
> > > > > > > 
> > > > > > 
> > > > > > Ok, that explains things a bit. My observation is that the first 5
> > > > > > patches or so address the mount failure problem, but from there I'm not
> > > > > > reproducing much difference with or without the final patch.
> > > > > 
> > > > > Does this change to flush the log after committing the new sb fix the
> > > > > recovery problems on older kernels?  I /think/ that's the point of this
> > > > > patch.
> > > > > 
> > > > 
> > > > I don't follow.. growfs always forced the log via the sync transaction,
> > > > right? Or do you mean something else by "change to flush the log?"
> > > 
> > > I guess I was typing a bit too fast this morning -- "change to flush the
> > > log to disk before anyone else can get their hands on the superblock".
> > > You're right that xfs_log_sb and data-device growfs already do that.
> > > 
> > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> > > that this patch fixes, right?
> > > 
> > 
> > Ah, Ok.. that makes sense. Sounds like it could be..
> 
> Yeah.  Hey Christoph, would you mind pre-pending a minimal fixpatch to
> set xfs_trans_set_sync in growfsrt before this one that refactors the
> existing growfs/sb updates?
> 
> > > > I thought the main functional change of this patch was to hold the
> > > > superblock buffer locked across the force so nothing else can relog the
> > > > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > > > the theory is that prevents recovery from seeing updates to different
> > > > buffers that depend on the geometry update before the actual sb geometry
> > > > update is recovered (because the latter might have been relogged).
> > > > 
> > > > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > > > Either way I think patch could use a more detailed commit log...
> > > 
> > > <nod> The commit message should point out that we're fixing a real bug
> > > here, which is that growfsrt doesn't force the log to disk when it
> > > commits the new rt geometry.
> > > 
> > 
> > Maybe even make it a separate patch to pull apart some of these cleanups
> > from fixes. I was also wondering if the whole locking change is the
> > moral equivalent of locking the sb across the growfs trans (i.e.
> > trans_getsb() + trans_bhold()), at which point maybe that would be a
> > reasonable incremental patch too.
> > 
> > > > > >                                                              Either way,
> > > > > > I see aborts and splats all over the place, which implies at minimum
> > > > > > this isn't the only issue here.
> > > > > 
> > > > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > > > a slight tick up in failure rates -- random blocks that have clearly had
> > > > > garbage written to them such that recovery tries to read the block to
> > > > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > > > that's a problem with xfs or somewhere else. :(
> > > > > 
> > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > > > particular patch has no straightforward way to test that it fixes
> > > > > > something and at the same time doesn't break anything else, and 3. we do
> > > > > 
> > > > > I'm curious, what might break?  Was that merely a general comment, or do
> > > > > you have something specific in mind?  (iows: do you see more string to
> > > > > pull? :))
> > > > > 
> > > > 
> > > > Just a general comment..
> > > > 
> > > > Something related that isn't totally clear to me is what about the
> > > > inverse shrink situation where dblocks is reduced. I.e., is there some
> > > > similar scenario where for example instead of the sb buffer being
> > > > relogged past some other buffer update that depends on it, some other
> > > > change is relogged past a sb update that invalidates/removes blocks
> > > > referenced by the relogged buffer..? If so, does that imply a shrink
> > > > should flush the log before the shrink transaction commits to ensure it
> > > > lands in a new checkpoint (as opposed to ensuring followon updates land
> > > > in a new checkpoint)..?
> > > 
> > > I think so.  Might we want to do that before and after to be careful?
> > > 
> > 
> > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
> > had a log preflush to flush out already committed changes before the
> > grow, I don't think anything really prevents another "problematic"
> > transaction from committing after that preflush.
> 
> Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't
> you?
> 

I guess it depends on how far into the weeds we want to get. I'm not
necessarily sure than anything exists today that is definitely
problematic wrt shrink. That would probably warrant an audit of
transactions or some other high level analysis to disprove. More thought
needed.

Short of the latter, I'm more thinking about the question "is there some
new thing we could add years down the line that 1. adds something to the
log that could conflict and 2. could be reordered past a shrink
transaction in a problematic way?" If the answer to that is open ended
and some such thing does come along, I think it's highly likely this
would just break growfs logging again until somebody trips over it in
the field.

> > I dunno.. on one hand it does seem like an unlikely thing due to the
> > nature of needing space to be free in order to shrink in the first
> > place, but OTOH if you have something like grow that is rare, not
> > performance sensitive, has a history of not being well tested, and has
> > these subtle ordering requirements that might change indirectly to other
> > transactions, ISTM it could be a wise engineering decision to simplify
> > to the degree possible and find the most basic model that enforces
> > predictable ordering.
> > 
> > So for a hacky thought/example, suppose we defined a transaction mode
> > that basically implemented an exclusive checkpoint requirement (i.e.
> > this transaction owns an entire checkpoint, nothing else is allowed in
> > the CIL concurrently). Presumably that would ensure everything before
> > the grow would flush out to disk in one checkpoint, everything
> > concurrent would block on synchronous commit of the grow trans (before
> > new geometry is exposed), and then after that point everything pending
> > would drain into another checkpoint.
> > 
> > It kind of sounds like overkill, but really if it could be implemented
> > simply enough then we wouldn't have to think too hard about auditing all
> > other relog scenarios. I'd probably want to see at least some reproducer
> > for this sort of problem to prove the theory though too, even if it
> > required debug instrumentation or something. Hm?
> 
> What if we redefined the input requirements to shrink?  Lets say we
> require that the fd argument to a shrink ioctl is actually an unlinkable
> O_TMPFILE regular file with the EOFS blocks mapped to it.  Then we can
> force the log without holding any locks, and the shrink transaction can
> remove the bmap and rmap records at the same time that it updates the sb
> geometry.  The otherwise inaccessible file means that nobody can reuse
> that space between the log force and the sb update.
> 

Interesting thought. It kind of sounds like how shrink already works to
some degree, right? I.e. the kernel side allocs the blocks out of the
btrees and tosses them, just no inode in the mix?

Honestly I'd probably need to stare at this code and think about it and
work through some scenarios to quantify how much of a concern this
really is, and I don't really have the bandwidth for that just now. I
mainly wanted to raise the notion that if we're assessing high level log
ordering requirements for growfs, we should consider the shrink case as
well.

> > > > Anyways, my point is just that if it were me I wouldn't get too deep
> > > > into this until some of the reproducible growfs recovery issues are at
> > > > least characterized and testing is more sorted out.
> > > > 
> > > > The context for testing is here [1]. The TLDR is basically that
> > > > Christoph has a targeted test that reproduces the initial mount failure
> > > > and I hacked up a more general test that also reproduces it and
> > > > additional growfs recovery problems. This test does seem to confirm that
> > > > the previous patches address the mount failure issue, but this patch
> > > > doesn't seem to prevent any of the other problems produced by the
> > > > generic test. That might just mean the test doesn't reproduce what this
> > > > fixes, but it's kind of hard to at least regression test something like
> > > > this when basic growfs crash-recovery seems pretty much broken.
> > > 
> > > Hmm, if you make a variant of that test which formats with an rt device
> > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> > > similar blowups?  Let's see what happens if I do that...
> > > 
> > 
> > Heh, sounds like so from your followup. Fun times.
> > 
> > I guess that test should probably work its way upstream. I made some
> > tweaks locally since last posted to try and make it a little more
> > aggressive, but it didn't repro anything new so not sure how much
> > difference it makes really. Do we want a separate version like yours for
> > the rt case or would you expect to cover both cases in a single test?
> 
> This probably should be different tests, because rt is its own very
> weird animal.
> 

Posted a couple tests the other day, JFYI.

Brian

> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > > > 
> > > > > > have at least one fairly straightforward growfs/recovery test in the
> > > > > > works that reliably explodes, personally I'd suggest to split this work
> > > > > > off into separate series.
> > > > > > 
> > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > > > fully cleaned up, and then leave the next two as part of a followon
> > > > > > series pending further investigation into these other issues. As part of
> > > > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > > > regress it." But once again, just my .02.
> > > > > 
> > > > > Yeah, it's too bad there's no good way to test recovery with older
> > > > > kernels either. :(
> > > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > 
> > 
>
Dave Chinner Oct. 21, 2024, 1:38 p.m. UTC | #16
On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> So for a hacky thought/example, suppose we defined a transaction mode
> that basically implemented an exclusive checkpoint requirement (i.e.
> this transaction owns an entire checkpoint, nothing else is allowed in
> the CIL concurrently).

Transactions know nothing about the CIL, nor should they. The CIL
also has no place in ordering transactions - it's purely an
aggregation mechanism that flushes committed transactions to stable
storage when it is told to. i.e. when a log force is issued.

A globally serialised transaction requires ordering at the
transaction allocation/reservation level, not at the CIL. i.e. it is
essentially the same ordering problem as serialising against
untracked DIO on the inode before we can run a truncate (lock,
drain, do operation, unlock).

> Presumably that would ensure everything before
> the grow would flush out to disk in one checkpoint, everything
> concurrent would block on synchronous commit of the grow trans (before
> new geometry is exposed), and then after that point everything pending
> would drain into another checkpoint.

Yup, that's high level transaction level ordering and really has
nothing to do with the CIL. The CIL is mostly a FIFO aggregator; the
only ordering it does is to preserve transaction commit ordering
down to the journal.

> It kind of sounds like overkill, but really if it could be implemented
> simply enough then we wouldn't have to think too hard about auditing all
> other relog scenarios. I'd probably want to see at least some reproducer
> for this sort of problem to prove the theory though too, even if it
> required debug instrumentation or something. Hm?

It's relatively straight forward to do, but it seems like total
overkill for growfs, as growfs only requires ordering
between the change of size and new allocations. We can do that by
not exposing the new space until after then superblock has been
modifed on stable storage in the case of grow.

In the case of shrink, globally serialising the growfs
transaction for shrink won't actually do any thing useful because we
have to deny access to the free space we are removing before we
even start the shrink transaction. Hence we need allocation vs
shrink co-ordination before we shrink the superblock space, not a
globally serialised size modification transaction...

-Dave.
Darrick J. Wong Oct. 21, 2024, 4:59 p.m. UTC | #17
On Fri, Oct 18, 2024 at 08:27:23AM -0400, Brian Foster wrote:
> On Tue, Oct 15, 2024 at 09:42:05AM -0700, Darrick J. Wong wrote:
> > On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> > > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> > > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > > > > changes how that issue manifests.
> > > > > > > > > 
> > > > > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > > > > AG into the same checkpoint that adds the AG?
> > > > > > > > 
> > > > > > > > This is based on the feedback by Dave on the previous version:
> > > > > > > > 
> > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > > > > 
> > > > > > > 
> > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > > > > not sure I'd go straight to this change given the situation...
> > > > > > > 
> > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > > > > that.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ok, that explains things a bit. My observation is that the first 5
> > > > > > > patches or so address the mount failure problem, but from there I'm not
> > > > > > > reproducing much difference with or without the final patch.
> > > > > > 
> > > > > > Does this change to flush the log after committing the new sb fix the
> > > > > > recovery problems on older kernels?  I /think/ that's the point of this
> > > > > > patch.
> > > > > > 
> > > > > 
> > > > > I don't follow.. growfs always forced the log via the sync transaction,
> > > > > right? Or do you mean something else by "change to flush the log?"
> > > > 
> > > > I guess I was typing a bit too fast this morning -- "change to flush the
> > > > log to disk before anyone else can get their hands on the superblock".
> > > > You're right that xfs_log_sb and data-device growfs already do that.
> > > > 
> > > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> > > > that this patch fixes, right?
> > > > 
> > > 
> > > Ah, Ok.. that makes sense. Sounds like it could be..
> > 
> > Yeah.  Hey Christoph, would you mind pre-pending a minimal fixpatch to
> > set xfs_trans_set_sync in growfsrt before this one that refactors the
> > existing growfs/sb updates?
> > 
> > > > > I thought the main functional change of this patch was to hold the
> > > > > superblock buffer locked across the force so nothing else can relog the
> > > > > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > > > > the theory is that prevents recovery from seeing updates to different
> > > > > buffers that depend on the geometry update before the actual sb geometry
> > > > > update is recovered (because the latter might have been relogged).
> > > > > 
> > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > > > > Either way I think patch could use a more detailed commit log...
> > > > 
> > > > <nod> The commit message should point out that we're fixing a real bug
> > > > here, which is that growfsrt doesn't force the log to disk when it
> > > > commits the new rt geometry.
> > > > 
> > > 
> > > Maybe even make it a separate patch to pull apart some of these cleanups
> > > from fixes. I was also wondering if the whole locking change is the
> > > moral equivalent of locking the sb across the growfs trans (i.e.
> > > trans_getsb() + trans_bhold()), at which point maybe that would be a
> > > reasonable incremental patch too.
> > > 
> > > > > > >                                                              Either way,
> > > > > > > I see aborts and splats all over the place, which implies at minimum
> > > > > > > this isn't the only issue here.
> > > > > > 
> > > > > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > > > > a slight tick up in failure rates -- random blocks that have clearly had
> > > > > > garbage written to them such that recovery tries to read the block to
> > > > > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > > > > that's a problem with xfs or somewhere else. :(
> > > > > > 
> > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > > > > particular patch has no straightforward way to test that it fixes
> > > > > > > something and at the same time doesn't break anything else, and 3. we do
> > > > > > 
> > > > > > I'm curious, what might break?  Was that merely a general comment, or do
> > > > > > you have something specific in mind?  (iows: do you see more string to
> > > > > > pull? :))
> > > > > > 
> > > > > 
> > > > > Just a general comment..
> > > > > 
> > > > > Something related that isn't totally clear to me is what about the
> > > > > inverse shrink situation where dblocks is reduced. I.e., is there some
> > > > > similar scenario where for example instead of the sb buffer being
> > > > > relogged past some other buffer update that depends on it, some other
> > > > > change is relogged past a sb update that invalidates/removes blocks
> > > > > referenced by the relogged buffer..? If so, does that imply a shrink
> > > > > should flush the log before the shrink transaction commits to ensure it
> > > > > lands in a new checkpoint (as opposed to ensuring followon updates land
> > > > > in a new checkpoint)..?
> > > > 
> > > > I think so.  Might we want to do that before and after to be careful?
> > > > 
> > > 
> > > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
> > > had a log preflush to flush out already committed changes before the
> > > grow, I don't think anything really prevents another "problematic"
> > > transaction from committing after that preflush.
> > 
> > Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't
> > you?
> > 
> 
> I guess it depends on how far into the weeds we want to get. I'm not
> necessarily sure than anything exists today that is definitely
> problematic wrt shrink. That would probably warrant an audit of
> transactions or some other high level analysis to disprove. More thought
> needed.

<nod> I think there isn't a problem with shrink because the shrink
transaction itself must be able to find the space, which means that
there cannot be any files or unfinished deferred ops pointing to that
space.

> Short of the latter, I'm more thinking about the question "is there some
> new thing we could add years down the line that 1. adds something to the
> log that could conflict and 2. could be reordered past a shrink
> transaction in a problematic way?" If the answer to that is open ended
> and some such thing does come along, I think it's highly likely this
> would just break growfs logging again until somebody trips over it in
> the field.

Good thing we have a couple of tests now? :)

> > > I dunno.. on one hand it does seem like an unlikely thing due to the
> > > nature of needing space to be free in order to shrink in the first
> > > place, but OTOH if you have something like grow that is rare, not
> > > performance sensitive, has a history of not being well tested, and has
> > > these subtle ordering requirements that might change indirectly to other
> > > transactions, ISTM it could be a wise engineering decision to simplify
> > > to the degree possible and find the most basic model that enforces
> > > predictable ordering.
> > > 
> > > So for a hacky thought/example, suppose we defined a transaction mode
> > > that basically implemented an exclusive checkpoint requirement (i.e.
> > > this transaction owns an entire checkpoint, nothing else is allowed in
> > > the CIL concurrently). Presumably that would ensure everything before
> > > the grow would flush out to disk in one checkpoint, everything
> > > concurrent would block on synchronous commit of the grow trans (before
> > > new geometry is exposed), and then after that point everything pending
> > > would drain into another checkpoint.
> > > 
> > > It kind of sounds like overkill, but really if it could be implemented
> > > simply enough then we wouldn't have to think too hard about auditing all
> > > other relog scenarios. I'd probably want to see at least some reproducer
> > > for this sort of problem to prove the theory though too, even if it
> > > required debug instrumentation or something. Hm?
> > 
> > What if we redefined the input requirements to shrink?  Lets say we
> > require that the fd argument to a shrink ioctl is actually an unlinkable
> > O_TMPFILE regular file with the EOFS blocks mapped to it.  Then we can
> > force the log without holding any locks, and the shrink transaction can
> > remove the bmap and rmap records at the same time that it updates the sb
> > geometry.  The otherwise inaccessible file means that nobody can reuse
> > that space between the log force and the sb update.
> > 
> 
> Interesting thought. It kind of sounds like how shrink already works to
> some degree, right? I.e. the kernel side allocs the blocks out of the
> btrees and tosses them, just no inode in the mix?

Right.

> Honestly I'd probably need to stare at this code and think about it and
> work through some scenarios to quantify how much of a concern this
> really is, and I don't really have the bandwidth for that just now. I
> mainly wanted to raise the notion that if we're assessing high level log
> ordering requirements for growfs, we should consider the shrink case as
> well.

<nod>

> > > > > Anyways, my point is just that if it were me I wouldn't get too deep
> > > > > into this until some of the reproducible growfs recovery issues are at
> > > > > least characterized and testing is more sorted out.
> > > > > 
> > > > > The context for testing is here [1]. The TLDR is basically that
> > > > > Christoph has a targeted test that reproduces the initial mount failure
> > > > > and I hacked up a more general test that also reproduces it and
> > > > > additional growfs recovery problems. This test does seem to confirm that
> > > > > the previous patches address the mount failure issue, but this patch
> > > > > doesn't seem to prevent any of the other problems produced by the
> > > > > generic test. That might just mean the test doesn't reproduce what this
> > > > > fixes, but it's kind of hard to at least regression test something like
> > > > > this when basic growfs crash-recovery seems pretty much broken.
> > > > 
> > > > Hmm, if you make a variant of that test which formats with an rt device
> > > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> > > > similar blowups?  Let's see what happens if I do that...
> > > > 
> > > 
> > > Heh, sounds like so from your followup. Fun times.
> > > 
> > > I guess that test should probably work its way upstream. I made some
> > > tweaks locally since last posted to try and make it a little more
> > > aggressive, but it didn't repro anything new so not sure how much
> > > difference it makes really. Do we want a separate version like yours for
> > > the rt case or would you expect to cover both cases in a single test?
> > 
> > This probably should be different tests, because rt is its own very
> > weird animal.
> > 
> 
> Posted a couple tests the other day, JFYI.
> 
> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > > > > 
> > > > > > > have at least one fairly straightforward growfs/recovery test in the
> > > > > > > works that reliably explodes, personally I'd suggest to split this work
> > > > > > > off into separate series.
> > > > > > > 
> > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > > > > fully cleaned up, and then leave the next two as part of a followon
> > > > > > > series pending further investigation into these other issues. As part of
> > > > > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > > > > regress it." But once again, just my .02.
> > > > > > 
> > > > > > Yeah, it's too bad there's no good way to test recovery with older
> > > > > > kernels either. :(
> > > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> > 
>
Brian Foster Oct. 23, 2024, 2:45 p.m. UTC | #18
On Mon, Oct 21, 2024 at 09:59:18AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 18, 2024 at 08:27:23AM -0400, Brian Foster wrote:
> > On Tue, Oct 15, 2024 at 09:42:05AM -0700, Darrick J. Wong wrote:
> > > On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> > > > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> > > > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > > > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > > > > > changes how that issue manifests.
> > > > > > > > > > 
> > > > > > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > > > > > AG into the same checkpoint that adds the AG?
> > > > > > > > > 
> > > > > > > > > This is based on the feedback by Dave on the previous version:
> > > > > > > > > 
> > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > > > > > not sure I'd go straight to this change given the situation...
> > > > > > > > 
> > > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > > > > > that.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Ok, that explains things a bit. My observation is that the first 5
> > > > > > > > patches or so address the mount failure problem, but from there I'm not
> > > > > > > > reproducing much difference with or without the final patch.
> > > > > > > 
> > > > > > > Does this change to flush the log after committing the new sb fix the
> > > > > > > recovery problems on older kernels?  I /think/ that's the point of this
> > > > > > > patch.
> > > > > > > 
> > > > > > 
> > > > > > I don't follow.. growfs always forced the log via the sync transaction,
> > > > > > right? Or do you mean something else by "change to flush the log?"
> > > > > 
> > > > > I guess I was typing a bit too fast this morning -- "change to flush the
> > > > > log to disk before anyone else can get their hands on the superblock".
> > > > > You're right that xfs_log_sb and data-device growfs already do that.
> > > > > 
> > > > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> > > > > that this patch fixes, right?
> > > > > 
> > > > 
> > > > Ah, Ok.. that makes sense. Sounds like it could be..
> > > 
> > > Yeah.  Hey Christoph, would you mind pre-pending a minimal fixpatch to
> > > set xfs_trans_set_sync in growfsrt before this one that refactors the
> > > existing growfs/sb updates?
> > > 
> > > > > > I thought the main functional change of this patch was to hold the
> > > > > > superblock buffer locked across the force so nothing else can relog the
> > > > > > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > > > > > the theory is that prevents recovery from seeing updates to different
> > > > > > buffers that depend on the geometry update before the actual sb geometry
> > > > > > update is recovered (because the latter might have been relogged).
> > > > > > 
> > > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > > > > > Either way I think patch could use a more detailed commit log...
> > > > > 
> > > > > <nod> The commit message should point out that we're fixing a real bug
> > > > > here, which is that growfsrt doesn't force the log to disk when it
> > > > > commits the new rt geometry.
> > > > > 
> > > > 
> > > > Maybe even make it a separate patch to pull apart some of these cleanups
> > > > from fixes. I was also wondering if the whole locking change is the
> > > > moral equivalent of locking the sb across the growfs trans (i.e.
> > > > trans_getsb() + trans_bhold()), at which point maybe that would be a
> > > > reasonable incremental patch too.
> > > > 
> > > > > > > >                                                              Either way,
> > > > > > > > I see aborts and splats all over the place, which implies at minimum
> > > > > > > > this isn't the only issue here.
> > > > > > > 
> > > > > > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > > > > > a slight tick up in failure rates -- random blocks that have clearly had
> > > > > > > garbage written to them such that recovery tries to read the block to
> > > > > > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > > > > > that's a problem with xfs or somewhere else. :(
> > > > > > > 
> > > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > > > > > particular patch has no straightforward way to test that it fixes
> > > > > > > > something and at the same time doesn't break anything else, and 3. we do
> > > > > > > 
> > > > > > > I'm curious, what might break?  Was that merely a general comment, or do
> > > > > > > you have something specific in mind?  (iows: do you see more string to
> > > > > > > pull? :))
> > > > > > > 
> > > > > > 
> > > > > > Just a general comment..
> > > > > > 
> > > > > > Something related that isn't totally clear to me is what about the
> > > > > > inverse shrink situation where dblocks is reduced. I.e., is there some
> > > > > > similar scenario where for example instead of the sb buffer being
> > > > > > relogged past some other buffer update that depends on it, some other
> > > > > > change is relogged past a sb update that invalidates/removes blocks
> > > > > > referenced by the relogged buffer..? If so, does that imply a shrink
> > > > > > should flush the log before the shrink transaction commits to ensure it
> > > > > > lands in a new checkpoint (as opposed to ensuring followon updates land
> > > > > > in a new checkpoint)..?
> > > > > 
> > > > > I think so.  Might we want to do that before and after to be careful?
> > > > > 
> > > > 
> > > > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
> > > > had a log preflush to flush out already committed changes before the
> > > > grow, I don't think anything really prevents another "problematic"
> > > > transaction from committing after that preflush.
> > > 
> > > Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't
> > > you?
> > > 
> > 
> > I guess it depends on how far into the weeds we want to get. I'm not
> > necessarily sure than anything exists today that is definitely
> > problematic wrt shrink. That would probably warrant an audit of
> > transactions or some other high level analysis to disprove. More thought
> > needed.
> 
> <nod> I think there isn't a problem with shrink because the shrink
> transaction itself must be able to find the space, which means that
> there cannot be any files or unfinished deferred ops pointing to that
> space.
> 

Ok, that makes sense to me. On poking through the code, one thing that
it looks like it misses is the case where the allocation fails purely
due to extents being busy (i.e. even if the blocks are free).

That doesn't seem harmful, just perhaps a little odd that a shrink might
fail and then succeed after some period of time for the log tail to push
with no other visible changes. Unless I'm missing something, it might be
worth adding the log force there (that I think you suggested earlier).

All that said, I'm not so sure this will all apply the same if shrink
grows to a more active implementation. For example, I suspect an
implementation that learns to quiesce/truncate full AGs isn't going to
necessarily need to allocate all of the blocks out of the AG free space
trees. But of course, this is all vaporware anyways. :)

> > Short of the latter, I'm more thinking about the question "is there some
> > new thing we could add years down the line that 1. adds something to the
> > log that could conflict and 2. could be reordered past a shrink
> > transaction in a problematic way?" If the answer to that is open ended
> > and some such thing does come along, I think it's highly likely this
> > would just break growfs logging again until somebody trips over it in
> > the field.
> 
> Good thing we have a couple of tests now? :)
> 

Not with good shrink support, unfortunately. :/ I tried adding basic
shrink calls into the proposed tests, but I wasn't able to see them
actually do anything, presumably because of the stress workload and
smallish filesystem keeping those blocks in use. It might require a more
active shrink implementation before we could support this kind of test,
or otherwise maybe something more limited/targeted than fsstress.

Hmm.. a random, off the top of my head idea might be a debug knob that
artificially restricts all allocations beyond a certain disk offset...

Brian

> > > > I dunno.. on one hand it does seem like an unlikely thing due to the
> > > > nature of needing space to be free in order to shrink in the first
> > > > place, but OTOH if you have something like grow that is rare, not
> > > > performance sensitive, has a history of not being well tested, and has
> > > > these subtle ordering requirements that might change indirectly to other
> > > > transactions, ISTM it could be a wise engineering decision to simplify
> > > > to the degree possible and find the most basic model that enforces
> > > > predictable ordering.
> > > > 
> > > > So for a hacky thought/example, suppose we defined a transaction mode
> > > > that basically implemented an exclusive checkpoint requirement (i.e.
> > > > this transaction owns an entire checkpoint, nothing else is allowed in
> > > > the CIL concurrently). Presumably that would ensure everything before
> > > > the grow would flush out to disk in one checkpoint, everything
> > > > concurrent would block on synchronous commit of the grow trans (before
> > > > new geometry is exposed), and then after that point everything pending
> > > > would drain into another checkpoint.
> > > > 
> > > > It kind of sounds like overkill, but really if it could be implemented
> > > > simply enough then we wouldn't have to think too hard about auditing all
> > > > other relog scenarios. I'd probably want to see at least some reproducer
> > > > for this sort of problem to prove the theory though too, even if it
> > > > required debug instrumentation or something. Hm?
> > > 
> > > What if we redefined the input requirements to shrink?  Lets say we
> > > require that the fd argument to a shrink ioctl is actually an unlinkable
> > > O_TMPFILE regular file with the EOFS blocks mapped to it.  Then we can
> > > force the log without holding any locks, and the shrink transaction can
> > > remove the bmap and rmap records at the same time that it updates the sb
> > > geometry.  The otherwise inaccessible file means that nobody can reuse
> > > that space between the log force and the sb update.
> > > 
> > 
> > Interesting thought. It kind of sounds like how shrink already works to
> > some degree, right? I.e. the kernel side allocs the blocks out of the
> > btrees and tosses them, just no inode in the mix?
> 
> Right.
> 
> > Honestly I'd probably need to stare at this code and think about it and
> > work through some scenarios to quantify how much of a concern this
> > really is, and I don't really have the bandwidth for that just now. I
> > mainly wanted to raise the notion that if we're assessing high level log
> > ordering requirements for growfs, we should consider the shrink case as
> > well.
> 
> <nod>
> 
> > > > > > Anyways, my point is just that if it were me I wouldn't get too deep
> > > > > > into this until some of the reproducible growfs recovery issues are at
> > > > > > least characterized and testing is more sorted out.
> > > > > > 
> > > > > > The context for testing is here [1]. The TLDR is basically that
> > > > > > Christoph has a targeted test that reproduces the initial mount failure
> > > > > > and I hacked up a more general test that also reproduces it and
> > > > > > additional growfs recovery problems. This test does seem to confirm that
> > > > > > the previous patches address the mount failure issue, but this patch
> > > > > > doesn't seem to prevent any of the other problems produced by the
> > > > > > generic test. That might just mean the test doesn't reproduce what this
> > > > > > fixes, but it's kind of hard to at least regression test something like
> > > > > > this when basic growfs crash-recovery seems pretty much broken.
> > > > > 
> > > > > Hmm, if you make a variant of that test which formats with an rt device
> > > > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> > > > > similar blowups?  Let's see what happens if I do that...
> > > > > 
> > > > 
> > > > Heh, sounds like so from your followup. Fun times.
> > > > 
> > > > I guess that test should probably work its way upstream. I made some
> > > > tweaks locally since last posted to try and make it a little more
> > > > aggressive, but it didn't repro anything new so not sure how much
> > > > difference it makes really. Do we want a separate version like yours for
> > > > the rt case or would you expect to cover both cases in a single test?
> > > 
> > > This probably should be different tests, because rt is its own very
> > > weird animal.
> > > 
> > 
> > Posted a couple tests the other day, JFYI.
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > > > > > 
> > > > > > > > have at least one fairly straightforward growfs/recovery test in the
> > > > > > > > works that reliably explodes, personally I'd suggest to split this work
> > > > > > > > off into separate series.
> > > > > > > > 
> > > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > > > > > fully cleaned up, and then leave the next two as part of a followon
> > > > > > > > series pending further investigation into these other issues. As part of
> > > > > > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > > > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > > > > > regress it." But once again, just my .02.
> > > > > > > 
> > > > > > > Yeah, it's too bad there's no good way to test recovery with older
> > > > > > > kernels either. :(
> > > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > > Brian
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > 
>
Brian Foster Oct. 23, 2024, 3:06 p.m. UTC | #19
On Tue, Oct 22, 2024 at 12:38:23AM +1100, Dave Chinner wrote:
> On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> > So for a hacky thought/example, suppose we defined a transaction mode
> > that basically implemented an exclusive checkpoint requirement (i.e.
> > this transaction owns an entire checkpoint, nothing else is allowed in
> > the CIL concurrently).
> 
> Transactions know nothing about the CIL, nor should they. The CIL
> also has no place in ordering transactions - it's purely an
> aggregation mechanism that flushes committed transactions to stable
> storage when it is told to. i.e. when a log force is issued.
> 
> A globally serialised transaction requires ordering at the
> transaction allocation/reservation level, not at the CIL. i.e. it is
> essentially the same ordering problem as serialising against
> untracked DIO on the inode before we can run a truncate (lock,
> drain, do operation, unlock).
> 
> > Presumably that would ensure everything before
> > the grow would flush out to disk in one checkpoint, everything
> > concurrent would block on synchronous commit of the grow trans (before
> > new geometry is exposed), and then after that point everything pending
> > would drain into another checkpoint.
> 
> Yup, that's high level transaction level ordering and really has
> nothing to do with the CIL. The CIL is mostly a FIFO aggregator; the
> only ordering it does is to preserve transaction commit ordering
> down to the journal.
> 
> > It kind of sounds like overkill, but really if it could be implemented
> > simply enough then we wouldn't have to think too hard about auditing all
> > other relog scenarios. I'd probably want to see at least some reproducer
> > for this sort of problem to prove the theory though too, even if it
> > required debug instrumentation or something. Hm?
> 
> It's relatively straight forward to do, but it seems like total
> overkill for growfs, as growfs only requires ordering
> between the change of size and new allocations. We can do that by
> not exposing the new space until after then superblock has been
> modifed on stable storage in the case of grow.
> 
> In the case of shrink, globally serialising the growfs
> transaction for shrink won't actually do any thing useful because we
> have to deny access to the free space we are removing before we
> even start the shrink transaction. Hence we need allocation vs
> shrink co-ordination before we shrink the superblock space, not a
> globally serialised size modification transaction...
> 

Not sure what you mean here, at least I don't see that requirement in
the current code. It looks like shrink acquires the blocks all in the
same transaction as the shrink. If something fails, it rolls or returns
the space depending on what actually failed..

Brian

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Darrick J. Wong Oct. 24, 2024, 6:02 p.m. UTC | #20
On Wed, Oct 23, 2024 at 10:45:51AM -0400, Brian Foster wrote:
> On Mon, Oct 21, 2024 at 09:59:18AM -0700, Darrick J. Wong wrote:
> > On Fri, Oct 18, 2024 at 08:27:23AM -0400, Brian Foster wrote:
> > > On Tue, Oct 15, 2024 at 09:42:05AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> > > > > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> > > > > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > > > > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > > > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > > > > > > changes how that issue manifests.
> > > > > > > > > > > 
> > > > > > > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > > > > > > AG into the same checkpoint that adds the AG?
> > > > > > > > > > 
> > > > > > > > > > This is based on the feedback by Dave on the previous version:
> > > > > > > > > > 
> > > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > > > > > > not sure I'd go straight to this change given the situation...
> > > > > > > > > 
> > > > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > > > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > > > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > > > > > > that.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Ok, that explains things a bit. My observation is that the first 5
> > > > > > > > > patches or so address the mount failure problem, but from there I'm not
> > > > > > > > > reproducing much difference with or without the final patch.
> > > > > > > > 
> > > > > > > > Does this change to flush the log after committing the new sb fix the
> > > > > > > > recovery problems on older kernels?  I /think/ that's the point of this
> > > > > > > > patch.
> > > > > > > > 
> > > > > > > 
> > > > > > > I don't follow.. growfs always forced the log via the sync transaction,
> > > > > > > right? Or do you mean something else by "change to flush the log?"
> > > > > > 
> > > > > > I guess I was typing a bit too fast this morning -- "change to flush the
> > > > > > log to disk before anyone else can get their hands on the superblock".
> > > > > > You're right that xfs_log_sb and data-device growfs already do that.
> > > > > > 
> > > > > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> > > > > > that this patch fixes, right?
> > > > > > 
> > > > > 
> > > > > Ah, Ok.. that makes sense. Sounds like it could be..
> > > > 
> > > > Yeah.  Hey Christoph, would you mind pre-pending a minimal fixpatch to
> > > > set xfs_trans_set_sync in growfsrt before this one that refactors the
> > > > existing growfs/sb updates?
> > > > 
> > > > > > > I thought the main functional change of this patch was to hold the
> > > > > > > superblock buffer locked across the force so nothing else can relog the
> > > > > > > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > > > > > > the theory is that prevents recovery from seeing updates to different
> > > > > > > buffers that depend on the geometry update before the actual sb geometry
> > > > > > > update is recovered (because the latter might have been relogged).
> > > > > > > 
> > > > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > > > > > > Either way I think patch could use a more detailed commit log...
> > > > > > 
> > > > > > <nod> The commit message should point out that we're fixing a real bug
> > > > > > here, which is that growfsrt doesn't force the log to disk when it
> > > > > > commits the new rt geometry.
> > > > > > 
> > > > > 
> > > > > Maybe even make it a separate patch to pull apart some of these cleanups
> > > > > from fixes. I was also wondering if the whole locking change is the
> > > > > moral equivalent of locking the sb across the growfs trans (i.e.
> > > > > trans_getsb() + trans_bhold()), at which point maybe that would be a
> > > > > reasonable incremental patch too.
> > > > > 
> > > > > > > > >                                                              Either way,
> > > > > > > > > I see aborts and splats all over the place, which implies at minimum
> > > > > > > > > this isn't the only issue here.
> > > > > > > > 
> > > > > > > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > > > > > > a slight tick up in failure rates -- random blocks that have clearly had
> > > > > > > > garbage written to them such that recovery tries to read the block to
> > > > > > > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > > > > > > that's a problem with xfs or somewhere else. :(
> > > > > > > > 
> > > > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > > > > > > particular patch has no straightforward way to test that it fixes
> > > > > > > > > something and at the same time doesn't break anything else, and 3. we do
> > > > > > > > 
> > > > > > > > I'm curious, what might break?  Was that merely a general comment, or do
> > > > > > > > you have something specific in mind?  (iows: do you see more string to
> > > > > > > > pull? :))
> > > > > > > > 
> > > > > > > 
> > > > > > > Just a general comment..
> > > > > > > 
> > > > > > > Something related that isn't totally clear to me is what about the
> > > > > > > inverse shrink situation where dblocks is reduced. I.e., is there some
> > > > > > > similar scenario where for example instead of the sb buffer being
> > > > > > > relogged past some other buffer update that depends on it, some other
> > > > > > > change is relogged past a sb update that invalidates/removes blocks
> > > > > > > referenced by the relogged buffer..? If so, does that imply a shrink
> > > > > > > should flush the log before the shrink transaction commits to ensure it
> > > > > > > lands in a new checkpoint (as opposed to ensuring followon updates land
> > > > > > > in a new checkpoint)..?
> > > > > > 
> > > > > > I think so.  Might we want to do that before and after to be careful?
> > > > > > 
> > > > > 
> > > > > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
> > > > > had a log preflush to flush out already committed changes before the
> > > > > grow, I don't think anything really prevents another "problematic"
> > > > > transaction from committing after that preflush.
> > > > 
> > > > Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't
> > > > you?
> > > > 
> > > 
> > > I guess it depends on how far into the weeds we want to get. I'm not
> > > necessarily sure than anything exists today that is definitely
> > > problematic wrt shrink. That would probably warrant an audit of
> > > transactions or some other high level analysis to disprove. More thought
> > > needed.
> > 
> > <nod> I think there isn't a problem with shrink because the shrink
> > transaction itself must be able to find the space, which means that
> > there cannot be any files or unfinished deferred ops pointing to that
> > space.
> > 
> 
> Ok, that makes sense to me. On poking through the code, one thing that
> it looks like it misses is the case where the allocation fails purely
> due to extents being busy (i.e. even if the blocks are free).
> 
> That doesn't seem harmful, just perhaps a little odd that a shrink might
> fail and then succeed after some period of time for the log tail to push
> with no other visible changes. Unless I'm missing something, it might be
> worth adding the log force there (that I think you suggested earlier).

Any program that relies on a particular piece of space being in a
particular state can fail due to other active threads, so I'm not
worried about that.

> All that said, I'm not so sure this will all apply the same if shrink
> grows to a more active implementation. For example, I suspect an
> implementation that learns to quiesce/truncate full AGs isn't going to
> necessarily need to allocate all of the blocks out of the AG free space
> trees. But of course, this is all vaporware anyways. :)

<nod> That's a burden for whoever ends up working on AG removal. ;)

> > > Short of the latter, I'm more thinking about the question "is there some
> > > new thing we could add years down the line that 1. adds something to the
> > > log that could conflict and 2. could be reordered past a shrink
> > > transaction in a problematic way?" If the answer to that is open ended
> > > and some such thing does come along, I think it's highly likely this
> > > would just break growfs logging again until somebody trips over it in
> > > the field.
> > 
> > Good thing we have a couple of tests now? :)
> > 
> 
> Not with good shrink support, unfortunately. :/ I tried adding basic
> shrink calls into the proposed tests, but I wasn't able to see them
> actually do anything, presumably because of the stress workload and
> smallish filesystem keeping those blocks in use. It might require a more
> active shrink implementation before we could support this kind of test,
> or otherwise maybe something more limited/targeted than fsstress.

I suspect you're right.

--D

> Hmm.. a random, off the top of my head idea might be a debug knob that
> artificially restricts all allocations beyond a certain disk offset...
> 
> Brian
> 
> > > > > I dunno.. on one hand it does seem like an unlikely thing due to the
> > > > > nature of needing space to be free in order to shrink in the first
> > > > > place, but OTOH if you have something like grow that is rare, not
> > > > > performance sensitive, has a history of not being well tested, and has
> > > > > these subtle ordering requirements that might change indirectly to other
> > > > > transactions, ISTM it could be a wise engineering decision to simplify
> > > > > to the degree possible and find the most basic model that enforces
> > > > > predictable ordering.
> > > > > 
> > > > > So for a hacky thought/example, suppose we defined a transaction mode
> > > > > that basically implemented an exclusive checkpoint requirement (i.e.
> > > > > this transaction owns an entire checkpoint, nothing else is allowed in
> > > > > the CIL concurrently). Presumably that would ensure everything before
> > > > > the grow would flush out to disk in one checkpoint, everything
> > > > > concurrent would block on synchronous commit of the grow trans (before
> > > > > new geometry is exposed), and then after that point everything pending
> > > > > would drain into another checkpoint.
> > > > > 
> > > > > It kind of sounds like overkill, but really if it could be implemented
> > > > > simply enough then we wouldn't have to think too hard about auditing all
> > > > > other relog scenarios. I'd probably want to see at least some reproducer
> > > > > for this sort of problem to prove the theory though too, even if it
> > > > > required debug instrumentation or something. Hm?
> > > > 
> > > > What if we redefined the input requirements to shrink?  Lets say we
> > > > require that the fd argument to a shrink ioctl is actually an unlinkable
> > > > O_TMPFILE regular file with the EOFS blocks mapped to it.  Then we can
> > > > force the log without holding any locks, and the shrink transaction can
> > > > remove the bmap and rmap records at the same time that it updates the sb
> > > > geometry.  The otherwise inaccessible file means that nobody can reuse
> > > > that space between the log force and the sb update.
> > > > 
> > > 
> > > Interesting thought. It kind of sounds like how shrink already works to
> > > some degree, right? I.e. the kernel side allocs the blocks out of the
> > > btrees and tosses them, just no inode in the mix?
> > 
> > Right.
> > 
> > > Honestly I'd probably need to stare at this code and think about it and
> > > work through some scenarios to quantify how much of a concern this
> > > really is, and I don't really have the bandwidth for that just now. I
> > > mainly wanted to raise the notion that if we're assessing high level log
> > > ordering requirements for growfs, we should consider the shrink case as
> > > well.
> > 
> > <nod>
> > 
> > > > > > > Anyways, my point is just that if it were me I wouldn't get too deep
> > > > > > > into this until some of the reproducible growfs recovery issues are at
> > > > > > > least characterized and testing is more sorted out.
> > > > > > > 
> > > > > > > The context for testing is here [1]. The TLDR is basically that
> > > > > > > Christoph has a targeted test that reproduces the initial mount failure
> > > > > > > and I hacked up a more general test that also reproduces it and
> > > > > > > additional growfs recovery problems. This test does seem to confirm that
> > > > > > > the previous patches address the mount failure issue, but this patch
> > > > > > > doesn't seem to prevent any of the other problems produced by the
> > > > > > > generic test. That might just mean the test doesn't reproduce what this
> > > > > > > fixes, but it's kind of hard to at least regression test something like
> > > > > > > this when basic growfs crash-recovery seems pretty much broken.
> > > > > > 
> > > > > > Hmm, if you make a variant of that test which formats with an rt device
> > > > > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> > > > > > similar blowups?  Let's see what happens if I do that...
> > > > > > 
> > > > > 
> > > > > Heh, sounds like so from your followup. Fun times.
> > > > > 
> > > > > I guess that test should probably work its way upstream. I made some
> > > > > tweaks locally since last posted to try and make it a little more
> > > > > aggressive, but it didn't repro anything new so not sure how much
> > > > > difference it makes really. Do we want a separate version like yours for
> > > > > the rt case or would you expect to cover both cases in a single test?
> > > > 
> > > > This probably should be different tests, because rt is its own very
> > > > weird animal.
> > > > 
> > > 
> > > Posted a couple tests the other day, JFYI.
> > > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > > > > > > 
> > > > > > > > > have at least one fairly straightforward growfs/recovery test in the
> > > > > > > > > works that reliably explodes, personally I'd suggest to split this work
> > > > > > > > > off into separate series.
> > > > > > > > > 
> > > > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > > > > > > fully cleaned up, and then leave the next two as part of a followon
> > > > > > > > > series pending further investigation into these other issues. As part of
> > > > > > > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > > > > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > > > > > > regress it." But once again, just my .02.
> > > > > > > > 
> > > > > > > > Yeah, it's too bad there's no good way to test recovery with older
> > > > > > > > kernels either. :(
> > > > > > > > 
> > > > > > > > --D
> > > > > > > > 
> > > > > > > > > Brian
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index d95409f3cba667..2c83ab7441ade5 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1025,6 +1025,80 @@  xfs_sb_mount_common(
 	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
 }
 
+/*
+ * Mirror the lazy sb counters to the in-core superblock.
+ *
+ * If this is at unmount, the counters will be exactly correct, but at any other
+ * time they will only be ballpark correct because of reservations that have
+ * been taken out percpu counters.  If we have an unclean shutdown, this will be
+ * corrected by log recovery rebuilding the counters from the AGF block counts.
+ *
+ * Do not update sb_frextents here because it is not part of the lazy sb
+ * counters, despite having a percpu counter.  It is always kept consistent with
+ * the ondisk rtbitmap by xfs_trans_apply_sb_deltas() and hence we don't need
+ * have to update it here.
+ */
+static void
+xfs_flush_sb_counters(
+	struct xfs_mount	*mp)
+{
+	if (xfs_has_lazysbcount(mp)) {
+		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
+		mp->m_sb.sb_ifree = min_t(uint64_t,
+				percpu_counter_sum_positive(&mp->m_ifree),
+				mp->m_sb.sb_icount);
+		mp->m_sb.sb_fdblocks =
+				percpu_counter_sum_positive(&mp->m_fdblocks);
+	}
+}
+
+/*
+ * Prepare a direct update to the superblock through the on-disk buffer.
+ *
+ * This locks out other modifications through the buffer lock and then syncs all
+ * in-core values to the on-disk buffer (including the percpu counters).
+ *
+ * The caller then directly manipulates the on-disk fields and calls
+ * xfs_commit_sb_update to the updates to disk them.  The caller is responsible
+ * to also update the in-core field, but it can do so after the transaction has
+ * been committed to disk.
+ *
+ * Updating the in-core field only after xfs_commit_sb_update ensures that other
+ * processes only see the update once it is stable on disk, and is usually the
+ * right thing to do for superblock updates.
+ *
+ * Note that writes to superblock fields updated using this helper are
+ * synchronized using the superblock buffer lock, which must be taken around
+ * all updates to the in-core fields as well.
+ */
+struct xfs_dsb *
+xfs_prepare_sb_update(
+	struct xfs_trans	*tp,
+	struct xfs_buf		**bpp)
+{
+	*bpp = xfs_trans_getsb(tp);
+	xfs_flush_sb_counters(tp->t_mountp);
+	xfs_sb_to_disk((*bpp)->b_addr, &tp->t_mountp->m_sb);
+	return (*bpp)->b_addr;
+}
+
+/*
+ * Commit a direct update to the on-disk superblock.  Keeps @bp locked and
+ * referenced, so the caller must call xfs_buf_relse() manually.
+ */
+int
+xfs_commit_sb_update(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*bp)
+{
+	xfs_trans_bhold(tp, bp);
+	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
+	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
+
+	xfs_trans_set_sync(tp);
+	return xfs_trans_commit(tp);
+}
+
 /*
  * xfs_log_sb() can be used to copy arbitrary changes to the in-core superblock
  * into the superblock buffer to be logged.  It does not provide the higher
@@ -1038,28 +1112,7 @@  xfs_log_sb(
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_buf		*bp = xfs_trans_getsb(tp);
 
-	/*
-	 * Lazy sb counters don't update the in-core superblock so do that now.
-	 * If this is at unmount, the counters will be exactly correct, but at
-	 * any other time they will only be ballpark correct because of
-	 * reservations that have been taken out percpu counters. If we have an
-	 * unclean shutdown, this will be corrected by log recovery rebuilding
-	 * the counters from the AGF block counts.
-	 *
-	 * Do not update sb_frextents here because it is not part of the lazy
-	 * sb counters, despite having a percpu counter. It is always kept
-	 * consistent with the ondisk rtbitmap by xfs_trans_apply_sb_deltas()
-	 * and hence we don't need have to update it here.
-	 */
-	if (xfs_has_lazysbcount(mp)) {
-		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
-		mp->m_sb.sb_ifree = min_t(uint64_t,
-				percpu_counter_sum_positive(&mp->m_ifree),
-				mp->m_sb.sb_icount);
-		mp->m_sb.sb_fdblocks =
-				percpu_counter_sum_positive(&mp->m_fdblocks);
-	}
-
+	xfs_flush_sb_counters(mp);
 	xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
 	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index 885c837559914d..3649d071687e33 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -13,6 +13,9 @@  struct xfs_trans;
 struct xfs_fsop_geom;
 struct xfs_perag;
 
+struct xfs_dsb *xfs_prepare_sb_update(struct xfs_trans *tp,
+			struct xfs_buf **bpp);
+int		xfs_commit_sb_update(struct xfs_trans *tp, struct xfs_buf *bp);
 extern void	xfs_log_sb(struct xfs_trans *tp);
 extern int	xfs_sync_sb(struct xfs_mount *mp, bool wait);
 extern int	xfs_sync_sb_buf(struct xfs_mount *mp);
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 33b84a3a83ff63..45a32ea426164a 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -149,14 +149,6 @@  void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 #define	XFS_TRANS_SB_RES_FDBLOCKS	0x00000008
 #define	XFS_TRANS_SB_FREXTENTS		0x00000010
 #define	XFS_TRANS_SB_RES_FREXTENTS	0x00000020
-#define	XFS_TRANS_SB_DBLOCKS		0x00000040
-#define	XFS_TRANS_SB_AGCOUNT		0x00000080
-#define	XFS_TRANS_SB_IMAXPCT		0x00000100
-#define	XFS_TRANS_SB_REXTSIZE		0x00000200
-#define	XFS_TRANS_SB_RBMBLOCKS		0x00000400
-#define	XFS_TRANS_SB_RBLOCKS		0x00000800
-#define	XFS_TRANS_SB_REXTENTS		0x00001000
-#define	XFS_TRANS_SB_REXTSLOG		0x00002000
 
 /*
  * Here we centralize the specification of XFS meta-data buffer reference count
diff --git a/fs/xfs/scrub/rtbitmap_repair.c b/fs/xfs/scrub/rtbitmap_repair.c
index 0fef98e9f83409..be9d31f032b1bf 100644
--- a/fs/xfs/scrub/rtbitmap_repair.c
+++ b/fs/xfs/scrub/rtbitmap_repair.c
@@ -16,6 +16,7 @@ 
 #include "xfs_bit.h"
 #include "xfs_bmap.h"
 #include "xfs_bmap_btree.h"
+#include "xfs_sb.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -127,20 +128,21 @@  xrep_rtbitmap_geometry(
 	struct xchk_rtbitmap	*rtb)
 {
 	struct xfs_mount	*mp = sc->mp;
-	struct xfs_trans	*tp = sc->tp;
 
 	/* Superblock fields */
-	if (mp->m_sb.sb_rextents != rtb->rextents)
-		xfs_trans_mod_sb(sc->tp, XFS_TRANS_SB_REXTENTS,
-				rtb->rextents - mp->m_sb.sb_rextents);
-
-	if (mp->m_sb.sb_rbmblocks != rtb->rbmblocks)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_RBMBLOCKS,
-				rtb->rbmblocks - mp->m_sb.sb_rbmblocks);
-
-	if (mp->m_sb.sb_rextslog != rtb->rextslog)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTSLOG,
-				rtb->rextslog - mp->m_sb.sb_rextslog);
+	if (mp->m_sb.sb_rextents != rtb->rextents ||
+	    mp->m_sb.sb_rbmblocks != rtb->rbmblocks ||
+	    mp->m_sb.sb_rextslog != rtb->rextslog) {
+		struct xfs_buf		*bp = xfs_trans_getsb(sc->tp);
+
+		mp->m_sb.sb_rextents = rtb->rextents;
+		mp->m_sb.sb_rbmblocks = rtb->rbmblocks;
+		mp->m_sb.sb_rextslog = rtb->rextslog;
+		xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
+
+		xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF);
+		xfs_trans_log_buf(sc->tp, bp, 0, sizeof(struct xfs_dsb) - 1);
+	}
 
 	/* Fix broken isize */
 	sc->ip->i_disk_size = roundup_64(sc->ip->i_disk_size,
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index b247d895c276d2..4168ccf21068cb 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -79,6 +79,46 @@  xfs_resizefs_init_new_ags(
 	return error;
 }
 
+static int
+xfs_growfs_data_update_sb(
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		nagcount,
+	xfs_rfsblock_t		nb,
+	xfs_agnumber_t		nagimax)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_dsb		*sbp;
+	struct xfs_buf		*bp;
+	int			error;
+
+	/*
+	 * Update the geometry in the on-disk superblock first, and ensure
+	 * they make it to disk before the superblock can be relogged.
+	 */
+	sbp = xfs_prepare_sb_update(tp, &bp);
+	sbp->sb_agcount = cpu_to_be32(nagcount);
+	sbp->sb_dblocks = cpu_to_be64(nb);
+	error = xfs_commit_sb_update(tp, bp);
+	if (error)
+		goto out_unlock;
+
+	/*
+	 * Propagate the new values to the live mount structure after they made
+	 * it to disk with the superblock buffer still locked.
+	 */
+	mp->m_sb.sb_agcount = nagcount;
+	mp->m_sb.sb_dblocks = nb;
+
+	if (nagimax)
+		mp->m_maxagi = nagimax;
+	xfs_set_low_space_thresholds(mp);
+	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+
+out_unlock:
+	xfs_buf_relse(bp);
+	return error;
+}
+
 /*
  * growfs operations
  */
@@ -171,37 +211,13 @@  xfs_growfs_data_private(
 	if (error)
 		goto out_trans_cancel;
 
-	/*
-	 * Update changed superblock fields transactionally. These are not
-	 * seen by the rest of the world until the transaction commit applies
-	 * them atomically to the superblock.
-	 */
-	if (nagcount > oagcount)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
-	if (delta)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta);
 	if (id.nfree)
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
 
-	/*
-	 * Sync sb counters now to reflect the updated values. This is
-	 * particularly important for shrink because the write verifier
-	 * will fail if sb_fdblocks is ever larger than sb_dblocks.
-	 */
-	if (xfs_has_lazysbcount(mp))
-		xfs_log_sb(tp);
-
-	xfs_trans_set_sync(tp);
-	error = xfs_trans_commit(tp);
+	error = xfs_growfs_data_update_sb(tp, nagcount, nb, nagimax);
 	if (error)
 		return error;
 
-	/* New allocation groups fully initialized, so update mount struct */
-	if (nagimax)
-		mp->m_maxagi = nagimax;
-	xfs_set_low_space_thresholds(mp);
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
-
 	if (delta > 0) {
 		/*
 		 * If we expanded the last AG, free the per-AG reservation
@@ -260,8 +276,9 @@  xfs_growfs_imaxpct(
 	struct xfs_mount	*mp,
 	__u32			imaxpct)
 {
+	struct xfs_dsb		*sbp;
+	struct xfs_buf		*bp;
 	struct xfs_trans	*tp;
-	int			dpct;
 	int			error;
 
 	if (imaxpct > 100)
@@ -272,10 +289,13 @@  xfs_growfs_imaxpct(
 	if (error)
 		return error;
 
-	dpct = imaxpct - mp->m_sb.sb_imax_pct;
-	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
-	xfs_trans_set_sync(tp);
-	return xfs_trans_commit(tp);
+	sbp = xfs_prepare_sb_update(tp, &bp);
+	sbp->sb_imax_pct = imaxpct;
+	error = xfs_commit_sb_update(tp, bp);
+	if (!error)
+		mp->m_sb.sb_imax_pct = imaxpct;
+	xfs_buf_relse(bp);
+	return error;
 }
 
 /*
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 3a2005a1e673dc..994e5efedab20f 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -698,6 +698,56 @@  xfs_growfs_rt_fixup_extsize(
 	return error;
 }
 
+static int
+xfs_growfs_rt_update_sb(
+	struct xfs_trans	*tp,
+	struct xfs_mount	*mp,
+	struct xfs_mount	*nmp,
+	xfs_rtbxlen_t		freed_rtx)
+{
+	struct xfs_dsb		*sbp;
+	struct xfs_buf		*bp;
+	int			error;
+
+	/*
+	 * Update the geometry in the on-disk superblock first, and ensure
+	 * they make it to disk before the superblock can be relogged.
+	 */
+	sbp = xfs_prepare_sb_update(tp, &bp);
+	sbp->sb_rextsize = cpu_to_be32(nmp->m_sb.sb_rextsize);
+	sbp->sb_rbmblocks = cpu_to_be32(nmp->m_sb.sb_rbmblocks);
+	sbp->sb_rblocks = cpu_to_be64(nmp->m_sb.sb_rblocks);
+	sbp->sb_rextents = cpu_to_be64(nmp->m_sb.sb_rextents);
+	sbp->sb_rextslog = nmp->m_sb.sb_rextslog;
+	error = xfs_commit_sb_update(tp, bp);
+	if (error)
+		return error;
+
+	/*
+	 * Propagate the new values to the live mount structure after they made
+	 * it to disk with the superblock buffer still locked.
+	 */
+	mp->m_sb.sb_rextsize = nmp->m_sb.sb_rextsize;
+	mp->m_sb.sb_rbmblocks = nmp->m_sb.sb_rbmblocks;
+	mp->m_sb.sb_rblocks = nmp->m_sb.sb_rblocks;
+	mp->m_sb.sb_rextents = nmp->m_sb.sb_rextents;
+	mp->m_sb.sb_rextslog = nmp->m_sb.sb_rextslog;
+	mp->m_rsumlevels = nmp->m_rsumlevels;
+	mp->m_rsumblocks = nmp->m_rsumblocks;
+
+	/*
+	 * Recompute the growfsrt reservation from the new rsumsize.
+	 */
+	xfs_trans_resv_calc(mp, &mp->m_resv);
+
+	/*
+	 * Ensure the mount RT feature flag is now set.
+	 */
+	mp->m_features |= XFS_FEAT_REALTIME;
+	xfs_buf_relse(bp);
+	return 0;
+}
+
 static int
 xfs_growfs_rt_bmblock(
 	struct xfs_mount	*mp,
@@ -780,25 +830,6 @@  xfs_growfs_rt_bmblock(
 			goto out_cancel;
 	}
 
-	/*
-	 * Update superblock fields.
-	 */
-	if (nmp->m_sb.sb_rextsize != mp->m_sb.sb_rextsize)
-		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSIZE,
-			nmp->m_sb.sb_rextsize - mp->m_sb.sb_rextsize);
-	if (nmp->m_sb.sb_rbmblocks != mp->m_sb.sb_rbmblocks)
-		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBMBLOCKS,
-			nmp->m_sb.sb_rbmblocks - mp->m_sb.sb_rbmblocks);
-	if (nmp->m_sb.sb_rblocks != mp->m_sb.sb_rblocks)
-		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBLOCKS,
-			nmp->m_sb.sb_rblocks - mp->m_sb.sb_rblocks);
-	if (nmp->m_sb.sb_rextents != mp->m_sb.sb_rextents)
-		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTENTS,
-			nmp->m_sb.sb_rextents - mp->m_sb.sb_rextents);
-	if (nmp->m_sb.sb_rextslog != mp->m_sb.sb_rextslog)
-		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSLOG,
-			nmp->m_sb.sb_rextslog - mp->m_sb.sb_rextslog);
-
 	/*
 	 * Free the new extent.
 	 */
@@ -807,33 +838,12 @@  xfs_growfs_rt_bmblock(
 	xfs_rtbuf_cache_relse(&nargs);
 	if (error)
 		goto out_cancel;
-
-	/*
-	 * Mark more blocks free in the superblock.
-	 */
 	xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx);
 
-	/*
-	 * Update the calculated values in the real mount structure.
-	 */
-	mp->m_rsumlevels = nmp->m_rsumlevels;
-	mp->m_rsumblocks = nmp->m_rsumblocks;
-	xfs_mount_sb_set_rextsize(mp, &mp->m_sb);
-
-	/*
-	 * Recompute the growfsrt reservation from the new rsumsize.
-	 */
-	xfs_trans_resv_calc(mp, &mp->m_resv);
-
-	error = xfs_trans_commit(args.tp);
+	error = xfs_growfs_rt_update_sb(args.tp, mp, nmp, freed_rtx);
 	if (error)
 		goto out_free;
 
-	/*
-	 * Ensure the mount RT feature flag is now set.
-	 */
-	mp->m_features |= XFS_FEAT_REALTIME;
-
 	kfree(nmp);
 	return 0;
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index bdf3704dc30118..56505cb94f877d 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -430,31 +430,6 @@  xfs_trans_mod_sb(
 		ASSERT(delta < 0);
 		tp->t_res_frextents_delta += delta;
 		break;
-	case XFS_TRANS_SB_DBLOCKS:
-		tp->t_dblocks_delta += delta;
-		break;
-	case XFS_TRANS_SB_AGCOUNT:
-		ASSERT(delta > 0);
-		tp->t_agcount_delta += delta;
-		break;
-	case XFS_TRANS_SB_IMAXPCT:
-		tp->t_imaxpct_delta += delta;
-		break;
-	case XFS_TRANS_SB_REXTSIZE:
-		tp->t_rextsize_delta += delta;
-		break;
-	case XFS_TRANS_SB_RBMBLOCKS:
-		tp->t_rbmblocks_delta += delta;
-		break;
-	case XFS_TRANS_SB_RBLOCKS:
-		tp->t_rblocks_delta += delta;
-		break;
-	case XFS_TRANS_SB_REXTENTS:
-		tp->t_rextents_delta += delta;
-		break;
-	case XFS_TRANS_SB_REXTSLOG:
-		tp->t_rextslog_delta += delta;
-		break;
 	default:
 		ASSERT(0);
 		return;
@@ -475,12 +450,8 @@  STATIC void
 xfs_trans_apply_sb_deltas(
 	xfs_trans_t	*tp)
 {
-	struct xfs_dsb	*sbp;
-	struct xfs_buf	*bp;
-	int		whole = 0;
-
-	bp = xfs_trans_getsb(tp);
-	sbp = bp->b_addr;
+	struct xfs_buf	*bp = xfs_trans_getsb(tp);
+	struct xfs_dsb	*sbp = bp->b_addr;
 
 	/*
 	 * Only update the superblock counters if we are logging them
@@ -522,53 +493,10 @@  xfs_trans_apply_sb_deltas(
 		spin_unlock(&mp->m_sb_lock);
 	}
 
-	if (tp->t_dblocks_delta) {
-		be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta);
-		whole = 1;
-	}
-	if (tp->t_agcount_delta) {
-		be32_add_cpu(&sbp->sb_agcount, tp->t_agcount_delta);
-		whole = 1;
-	}
-	if (tp->t_imaxpct_delta) {
-		sbp->sb_imax_pct += tp->t_imaxpct_delta;
-		whole = 1;
-	}
-	if (tp->t_rextsize_delta) {
-		be32_add_cpu(&sbp->sb_rextsize, tp->t_rextsize_delta);
-		whole = 1;
-	}
-	if (tp->t_rbmblocks_delta) {
-		be32_add_cpu(&sbp->sb_rbmblocks, tp->t_rbmblocks_delta);
-		whole = 1;
-	}
-	if (tp->t_rblocks_delta) {
-		be64_add_cpu(&sbp->sb_rblocks, tp->t_rblocks_delta);
-		whole = 1;
-	}
-	if (tp->t_rextents_delta) {
-		be64_add_cpu(&sbp->sb_rextents, tp->t_rextents_delta);
-		whole = 1;
-	}
-	if (tp->t_rextslog_delta) {
-		sbp->sb_rextslog += tp->t_rextslog_delta;
-		whole = 1;
-	}
-
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
-	if (whole)
-		/*
-		 * Log the whole thing, the fields are noncontiguous.
-		 */
-		xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
-	else
-		/*
-		 * Since all the modifiable fields are contiguous, we
-		 * can get away with this.
-		 */
-		xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount),
-				  offsetof(struct xfs_dsb, sb_frextents) +
-				  sizeof(sbp->sb_frextents) - 1);
+	xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount),
+			  offsetof(struct xfs_dsb, sb_frextents) +
+			  sizeof(sbp->sb_frextents) - 1);
 }
 
 /*
@@ -656,26 +584,7 @@  xfs_trans_unreserve_and_mod_sb(
 	 * must be consistent with the ondisk rtbitmap and must never include
 	 * incore reservations.
 	 */
-	mp->m_sb.sb_dblocks += tp->t_dblocks_delta;
-	mp->m_sb.sb_agcount += tp->t_agcount_delta;
-	mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta;
-	mp->m_sb.sb_rextsize += tp->t_rextsize_delta;
-	if (tp->t_rextsize_delta) {
-		mp->m_rtxblklog = log2_if_power2(mp->m_sb.sb_rextsize);
-		mp->m_rtxblkmask = mask64_if_power2(mp->m_sb.sb_rextsize);
-	}
-	mp->m_sb.sb_rbmblocks += tp->t_rbmblocks_delta;
-	mp->m_sb.sb_rblocks += tp->t_rblocks_delta;
-	mp->m_sb.sb_rextents += tp->t_rextents_delta;
-	mp->m_sb.sb_rextslog += tp->t_rextslog_delta;
 	spin_unlock(&mp->m_sb_lock);
-
-	/*
-	 * Debug checks outside of the spinlock so they don't lock up the
-	 * machine if they fail.
-	 */
-	ASSERT(mp->m_sb.sb_imax_pct >= 0);
-	ASSERT(mp->m_sb.sb_rextslog >= 0);
 }
 
 /* Add the given log item to the transaction's list of log items. */
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f06cc0f41665ad..e5911cf09be444 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -140,14 +140,6 @@  typedef struct xfs_trans {
 	int64_t			t_res_fdblocks_delta; /* on-disk only chg */
 	int64_t			t_frextents_delta;/* superblock freextents chg*/
 	int64_t			t_res_frextents_delta; /* on-disk only chg */
-	int64_t			t_dblocks_delta;/* superblock dblocks change */
-	int64_t			t_agcount_delta;/* superblock agcount change */
-	int64_t			t_imaxpct_delta;/* superblock imaxpct change */
-	int64_t			t_rextsize_delta;/* superblock rextsize chg */
-	int64_t			t_rbmblocks_delta;/* superblock rbmblocks chg */
-	int64_t			t_rblocks_delta;/* superblock rblocks change */
-	int64_t			t_rextents_delta;/* superblocks rextents chg */
-	int64_t			t_rextslog_delta;/* superblocks rextslog chg */
 	struct list_head	t_items;	/* log item descriptors */
 	struct list_head	t_busy;		/* list of busy extents */
 	struct list_head	t_dfops;	/* deferred operations */