Message ID | 20240930164211.2357358-7-hch@lst.de (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | [1/7] xfs: pass the exact range to initialize to xfs_initialize_perag | expand |
On Mon, Sep 30, 2024 at 06:41:47PM +0200, Christoph Hellwig wrote: > Updates to the file system geometry in growfs need to be committed to > stable store before the allocator can see them to avoid that they are > in the same CIL checkpoint as transactions that make use of this new > information, which will make recovery impossible or broken. > Ok, so we don't want geometry changes transactions in the same CIL checkpoint as alloc related transactions that might depend on the geometry changes. That seems reasonable and on a first pass I have an idea of what this is doing, but the description is kind of vague. Obviously this fixes an issue on the recovery side (since I've tested it), but it's not quite clear to me from the description and/or logic changes how that issue manifests. Could you elaborate please? For example, is this some kind of race situation between an allocation request and a growfs transaction, where the former perhaps sees a newly added AG between the time the growfs transaction commits (applying the sb deltas) and it actually commits to the log due to being a sync transaction, thus allowing an alloc on a new AG into the same checkpoint that adds the AG? Is there any ordering issue on the recovery side, or is it mainly that we don't init the in-core perags until after recovery, so log recovery of grow operations is basically broken unless we don't reference the new AGs in the log before the geometry update pushes out of the active log..? I'm trying to grok how much of this patch is fixing a reproducible bug vs. supporting what the earlier patches are doing on the recovery side. I also wonder if/how this might fundamentally apply to shrink, but maybe that's getting too far into the weeds. IIRC we don't currently support AG shrink, but can lop off the end of the tail AG if space happens to be free. So for example, is there any issue where that tail AG space is freed and immediately shrunk off of the fs? Hm.. maybe the hacky growfs test should try and include some shrink operations as well. Anyways, context on the specific problem would make this easier to review and IMO, should be included in the commit log anyways for historical reference if you're going to change how superblock fields are logged. Just my .02. Brian > To do this add two new helpers to prepare a superblock for direct > manipulation of the on-disk buffer, and to commit these updates while > holding the buffer locked (similar to what xfs_sync_sb_buf does) and use > those in growfs instead of applying the changes through the deltas in the > xfs_trans structure (which also happens to shrink the xfs_trans structure > a fair bit). > > The rtbmimap repair code was also using the transaction deltas and is > converted to also update the superblock buffer directly under the buffer > lock. > > This new method establishes a locking protocol where even in-core > superblock fields must only be updated with the superblock buffer > locked. For now it is only applied to affected geometry fields, > but in the future it would make sense to apply it universally. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/xfs/libxfs/xfs_sb.c | 97 ++++++++++++++++++++++++------- > fs/xfs/libxfs/xfs_sb.h | 3 + > fs/xfs/libxfs/xfs_shared.h | 8 --- > fs/xfs/scrub/rtbitmap_repair.c | 26 +++++---- > fs/xfs/xfs_fsops.c | 80 ++++++++++++++++---------- > fs/xfs/xfs_rtalloc.c | 92 +++++++++++++++++------------- > fs/xfs/xfs_trans.c | 101 ++------------------------------- > fs/xfs/xfs_trans.h | 8 --- > 8 files changed, 198 insertions(+), 217 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c > index d95409f3cba667..2c83ab7441ade5 100644 > --- a/fs/xfs/libxfs/xfs_sb.c > +++ b/fs/xfs/libxfs/xfs_sb.c > @@ -1025,6 +1025,80 @@ xfs_sb_mount_common( > mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp); > } > > +/* > + * Mirror the lazy sb counters to the in-core superblock. > + * > + * If this is at unmount, the counters will be exactly correct, but at any other > + * time they will only be ballpark correct because of reservations that have > + * been taken out percpu counters. If we have an unclean shutdown, this will be > + * corrected by log recovery rebuilding the counters from the AGF block counts. > + * > + * Do not update sb_frextents here because it is not part of the lazy sb > + * counters, despite having a percpu counter. It is always kept consistent with > + * the ondisk rtbitmap by xfs_trans_apply_sb_deltas() and hence we don't need > + * have to update it here. > + */ > +static void > +xfs_flush_sb_counters( > + struct xfs_mount *mp) > +{ > + if (xfs_has_lazysbcount(mp)) { > + mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount); > + mp->m_sb.sb_ifree = min_t(uint64_t, > + percpu_counter_sum_positive(&mp->m_ifree), > + mp->m_sb.sb_icount); > + mp->m_sb.sb_fdblocks = > + percpu_counter_sum_positive(&mp->m_fdblocks); > + } > +} > + > +/* > + * Prepare a direct update to the superblock through the on-disk buffer. > + * > + * This locks out other modifications through the buffer lock and then syncs all > + * in-core values to the on-disk buffer (including the percpu counters). > + * > + * The caller then directly manipulates the on-disk fields and calls > + * xfs_commit_sb_update to the updates to disk them. The caller is responsible > + * to also update the in-core field, but it can do so after the transaction has > + * been committed to disk. > + * > + * Updating the in-core field only after xfs_commit_sb_update ensures that other > + * processes only see the update once it is stable on disk, and is usually the > + * right thing to do for superblock updates. > + * > + * Note that writes to superblock fields updated using this helper are > + * synchronized using the superblock buffer lock, which must be taken around > + * all updates to the in-core fields as well. > + */ > +struct xfs_dsb * > +xfs_prepare_sb_update( > + struct xfs_trans *tp, > + struct xfs_buf **bpp) > +{ > + *bpp = xfs_trans_getsb(tp); > + xfs_flush_sb_counters(tp->t_mountp); > + xfs_sb_to_disk((*bpp)->b_addr, &tp->t_mountp->m_sb); > + return (*bpp)->b_addr; > +} > + > +/* > + * Commit a direct update to the on-disk superblock. Keeps @bp locked and > + * referenced, so the caller must call xfs_buf_relse() manually. > + */ > +int > +xfs_commit_sb_update( > + struct xfs_trans *tp, > + struct xfs_buf *bp) > +{ > + xfs_trans_bhold(tp, bp); > + xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF); > + xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1); > + > + xfs_trans_set_sync(tp); > + return xfs_trans_commit(tp); > +} > + > /* > * xfs_log_sb() can be used to copy arbitrary changes to the in-core superblock > * into the superblock buffer to be logged. It does not provide the higher > @@ -1038,28 +1112,7 @@ xfs_log_sb( > struct xfs_mount *mp = tp->t_mountp; > struct xfs_buf *bp = xfs_trans_getsb(tp); > > - /* > - * Lazy sb counters don't update the in-core superblock so do that now. > - * If this is at unmount, the counters will be exactly correct, but at > - * any other time they will only be ballpark correct because of > - * reservations that have been taken out percpu counters. If we have an > - * unclean shutdown, this will be corrected by log recovery rebuilding > - * the counters from the AGF block counts. > - * > - * Do not update sb_frextents here because it is not part of the lazy > - * sb counters, despite having a percpu counter. It is always kept > - * consistent with the ondisk rtbitmap by xfs_trans_apply_sb_deltas() > - * and hence we don't need have to update it here. > - */ > - if (xfs_has_lazysbcount(mp)) { > - mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount); > - mp->m_sb.sb_ifree = min_t(uint64_t, > - percpu_counter_sum_positive(&mp->m_ifree), > - mp->m_sb.sb_icount); > - mp->m_sb.sb_fdblocks = > - percpu_counter_sum_positive(&mp->m_fdblocks); > - } > - > + xfs_flush_sb_counters(mp); > xfs_sb_to_disk(bp->b_addr, &mp->m_sb); > xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF); > xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1); > diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h > index 885c837559914d..3649d071687e33 100644 > --- a/fs/xfs/libxfs/xfs_sb.h > +++ b/fs/xfs/libxfs/xfs_sb.h > @@ -13,6 +13,9 @@ struct xfs_trans; > struct xfs_fsop_geom; > struct xfs_perag; > > +struct xfs_dsb *xfs_prepare_sb_update(struct xfs_trans *tp, > + struct xfs_buf **bpp); > +int xfs_commit_sb_update(struct xfs_trans *tp, struct xfs_buf *bp); > extern void xfs_log_sb(struct xfs_trans *tp); > extern int xfs_sync_sb(struct xfs_mount *mp, bool wait); > extern int xfs_sync_sb_buf(struct xfs_mount *mp); > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h > index 33b84a3a83ff63..45a32ea426164a 100644 > --- a/fs/xfs/libxfs/xfs_shared.h > +++ b/fs/xfs/libxfs/xfs_shared.h > @@ -149,14 +149,6 @@ void xfs_log_get_max_trans_res(struct xfs_mount *mp, > #define XFS_TRANS_SB_RES_FDBLOCKS 0x00000008 > #define XFS_TRANS_SB_FREXTENTS 0x00000010 > #define XFS_TRANS_SB_RES_FREXTENTS 0x00000020 > -#define XFS_TRANS_SB_DBLOCKS 0x00000040 > -#define XFS_TRANS_SB_AGCOUNT 0x00000080 > -#define XFS_TRANS_SB_IMAXPCT 0x00000100 > -#define XFS_TRANS_SB_REXTSIZE 0x00000200 > -#define XFS_TRANS_SB_RBMBLOCKS 0x00000400 > -#define XFS_TRANS_SB_RBLOCKS 0x00000800 > -#define XFS_TRANS_SB_REXTENTS 0x00001000 > -#define XFS_TRANS_SB_REXTSLOG 0x00002000 > > /* > * Here we centralize the specification of XFS meta-data buffer reference count > diff --git a/fs/xfs/scrub/rtbitmap_repair.c b/fs/xfs/scrub/rtbitmap_repair.c > index 0fef98e9f83409..be9d31f032b1bf 100644 > --- a/fs/xfs/scrub/rtbitmap_repair.c > +++ b/fs/xfs/scrub/rtbitmap_repair.c > @@ -16,6 +16,7 @@ > #include "xfs_bit.h" > #include "xfs_bmap.h" > #include "xfs_bmap_btree.h" > +#include "xfs_sb.h" > #include "scrub/scrub.h" > #include "scrub/common.h" > #include "scrub/trace.h" > @@ -127,20 +128,21 @@ xrep_rtbitmap_geometry( > struct xchk_rtbitmap *rtb) > { > struct xfs_mount *mp = sc->mp; > - struct xfs_trans *tp = sc->tp; > > /* Superblock fields */ > - if (mp->m_sb.sb_rextents != rtb->rextents) > - xfs_trans_mod_sb(sc->tp, XFS_TRANS_SB_REXTENTS, > - rtb->rextents - mp->m_sb.sb_rextents); > - > - if (mp->m_sb.sb_rbmblocks != rtb->rbmblocks) > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_RBMBLOCKS, > - rtb->rbmblocks - mp->m_sb.sb_rbmblocks); > - > - if (mp->m_sb.sb_rextslog != rtb->rextslog) > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTSLOG, > - rtb->rextslog - mp->m_sb.sb_rextslog); > + if (mp->m_sb.sb_rextents != rtb->rextents || > + mp->m_sb.sb_rbmblocks != rtb->rbmblocks || > + mp->m_sb.sb_rextslog != rtb->rextslog) { > + struct xfs_buf *bp = xfs_trans_getsb(sc->tp); > + > + mp->m_sb.sb_rextents = rtb->rextents; > + mp->m_sb.sb_rbmblocks = rtb->rbmblocks; > + mp->m_sb.sb_rextslog = rtb->rextslog; > + xfs_sb_to_disk(bp->b_addr, &mp->m_sb); > + > + xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF); > + xfs_trans_log_buf(sc->tp, bp, 0, sizeof(struct xfs_dsb) - 1); > + } > > /* Fix broken isize */ > sc->ip->i_disk_size = roundup_64(sc->ip->i_disk_size, > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c > index b247d895c276d2..4168ccf21068cb 100644 > --- a/fs/xfs/xfs_fsops.c > +++ b/fs/xfs/xfs_fsops.c > @@ -79,6 +79,46 @@ xfs_resizefs_init_new_ags( > return error; > } > > +static int > +xfs_growfs_data_update_sb( > + struct xfs_trans *tp, > + xfs_agnumber_t nagcount, > + xfs_rfsblock_t nb, > + xfs_agnumber_t nagimax) > +{ > + struct xfs_mount *mp = tp->t_mountp; > + struct xfs_dsb *sbp; > + struct xfs_buf *bp; > + int error; > + > + /* > + * Update the geometry in the on-disk superblock first, and ensure > + * they make it to disk before the superblock can be relogged. > + */ > + sbp = xfs_prepare_sb_update(tp, &bp); > + sbp->sb_agcount = cpu_to_be32(nagcount); > + sbp->sb_dblocks = cpu_to_be64(nb); > + error = xfs_commit_sb_update(tp, bp); > + if (error) > + goto out_unlock; > + > + /* > + * Propagate the new values to the live mount structure after they made > + * it to disk with the superblock buffer still locked. > + */ > + mp->m_sb.sb_agcount = nagcount; > + mp->m_sb.sb_dblocks = nb; > + > + if (nagimax) > + mp->m_maxagi = nagimax; > + xfs_set_low_space_thresholds(mp); > + mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); > + > +out_unlock: > + xfs_buf_relse(bp); > + return error; > +} > + > /* > * growfs operations > */ > @@ -171,37 +211,13 @@ xfs_growfs_data_private( > if (error) > goto out_trans_cancel; > > - /* > - * Update changed superblock fields transactionally. These are not > - * seen by the rest of the world until the transaction commit applies > - * them atomically to the superblock. > - */ > - if (nagcount > oagcount) > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount); > - if (delta) > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta); > if (id.nfree) > xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree); > > - /* > - * Sync sb counters now to reflect the updated values. This is > - * particularly important for shrink because the write verifier > - * will fail if sb_fdblocks is ever larger than sb_dblocks. > - */ > - if (xfs_has_lazysbcount(mp)) > - xfs_log_sb(tp); > - > - xfs_trans_set_sync(tp); > - error = xfs_trans_commit(tp); > + error = xfs_growfs_data_update_sb(tp, nagcount, nb, nagimax); > if (error) > return error; > > - /* New allocation groups fully initialized, so update mount struct */ > - if (nagimax) > - mp->m_maxagi = nagimax; > - xfs_set_low_space_thresholds(mp); > - mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); > - > if (delta > 0) { > /* > * If we expanded the last AG, free the per-AG reservation > @@ -260,8 +276,9 @@ xfs_growfs_imaxpct( > struct xfs_mount *mp, > __u32 imaxpct) > { > + struct xfs_dsb *sbp; > + struct xfs_buf *bp; > struct xfs_trans *tp; > - int dpct; > int error; > > if (imaxpct > 100) > @@ -272,10 +289,13 @@ xfs_growfs_imaxpct( > if (error) > return error; > > - dpct = imaxpct - mp->m_sb.sb_imax_pct; > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct); > - xfs_trans_set_sync(tp); > - return xfs_trans_commit(tp); > + sbp = xfs_prepare_sb_update(tp, &bp); > + sbp->sb_imax_pct = imaxpct; > + error = xfs_commit_sb_update(tp, bp); > + if (!error) > + mp->m_sb.sb_imax_pct = imaxpct; > + xfs_buf_relse(bp); > + return error; > } > > /* > diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c > index 3a2005a1e673dc..994e5efedab20f 100644 > --- a/fs/xfs/xfs_rtalloc.c > +++ b/fs/xfs/xfs_rtalloc.c > @@ -698,6 +698,56 @@ xfs_growfs_rt_fixup_extsize( > return error; > } > > +static int > +xfs_growfs_rt_update_sb( > + struct xfs_trans *tp, > + struct xfs_mount *mp, > + struct xfs_mount *nmp, > + xfs_rtbxlen_t freed_rtx) > +{ > + struct xfs_dsb *sbp; > + struct xfs_buf *bp; > + int error; > + > + /* > + * Update the geometry in the on-disk superblock first, and ensure > + * they make it to disk before the superblock can be relogged. > + */ > + sbp = xfs_prepare_sb_update(tp, &bp); > + sbp->sb_rextsize = cpu_to_be32(nmp->m_sb.sb_rextsize); > + sbp->sb_rbmblocks = cpu_to_be32(nmp->m_sb.sb_rbmblocks); > + sbp->sb_rblocks = cpu_to_be64(nmp->m_sb.sb_rblocks); > + sbp->sb_rextents = cpu_to_be64(nmp->m_sb.sb_rextents); > + sbp->sb_rextslog = nmp->m_sb.sb_rextslog; > + error = xfs_commit_sb_update(tp, bp); > + if (error) > + return error; > + > + /* > + * Propagate the new values to the live mount structure after they made > + * it to disk with the superblock buffer still locked. > + */ > + mp->m_sb.sb_rextsize = nmp->m_sb.sb_rextsize; > + mp->m_sb.sb_rbmblocks = nmp->m_sb.sb_rbmblocks; > + mp->m_sb.sb_rblocks = nmp->m_sb.sb_rblocks; > + mp->m_sb.sb_rextents = nmp->m_sb.sb_rextents; > + mp->m_sb.sb_rextslog = nmp->m_sb.sb_rextslog; > + mp->m_rsumlevels = nmp->m_rsumlevels; > + mp->m_rsumblocks = nmp->m_rsumblocks; > + > + /* > + * Recompute the growfsrt reservation from the new rsumsize. > + */ > + xfs_trans_resv_calc(mp, &mp->m_resv); > + > + /* > + * Ensure the mount RT feature flag is now set. > + */ > + mp->m_features |= XFS_FEAT_REALTIME; > + xfs_buf_relse(bp); > + return 0; > +} > + > static int > xfs_growfs_rt_bmblock( > struct xfs_mount *mp, > @@ -780,25 +830,6 @@ xfs_growfs_rt_bmblock( > goto out_cancel; > } > > - /* > - * Update superblock fields. > - */ > - if (nmp->m_sb.sb_rextsize != mp->m_sb.sb_rextsize) > - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSIZE, > - nmp->m_sb.sb_rextsize - mp->m_sb.sb_rextsize); > - if (nmp->m_sb.sb_rbmblocks != mp->m_sb.sb_rbmblocks) > - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBMBLOCKS, > - nmp->m_sb.sb_rbmblocks - mp->m_sb.sb_rbmblocks); > - if (nmp->m_sb.sb_rblocks != mp->m_sb.sb_rblocks) > - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBLOCKS, > - nmp->m_sb.sb_rblocks - mp->m_sb.sb_rblocks); > - if (nmp->m_sb.sb_rextents != mp->m_sb.sb_rextents) > - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTENTS, > - nmp->m_sb.sb_rextents - mp->m_sb.sb_rextents); > - if (nmp->m_sb.sb_rextslog != mp->m_sb.sb_rextslog) > - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSLOG, > - nmp->m_sb.sb_rextslog - mp->m_sb.sb_rextslog); > - > /* > * Free the new extent. > */ > @@ -807,33 +838,12 @@ xfs_growfs_rt_bmblock( > xfs_rtbuf_cache_relse(&nargs); > if (error) > goto out_cancel; > - > - /* > - * Mark more blocks free in the superblock. > - */ > xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx); > > - /* > - * Update the calculated values in the real mount structure. > - */ > - mp->m_rsumlevels = nmp->m_rsumlevels; > - mp->m_rsumblocks = nmp->m_rsumblocks; > - xfs_mount_sb_set_rextsize(mp, &mp->m_sb); > - > - /* > - * Recompute the growfsrt reservation from the new rsumsize. > - */ > - xfs_trans_resv_calc(mp, &mp->m_resv); > - > - error = xfs_trans_commit(args.tp); > + error = xfs_growfs_rt_update_sb(args.tp, mp, nmp, freed_rtx); > if (error) > goto out_free; > > - /* > - * Ensure the mount RT feature flag is now set. > - */ > - mp->m_features |= XFS_FEAT_REALTIME; > - > kfree(nmp); > return 0; > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index bdf3704dc30118..56505cb94f877d 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -430,31 +430,6 @@ xfs_trans_mod_sb( > ASSERT(delta < 0); > tp->t_res_frextents_delta += delta; > break; > - case XFS_TRANS_SB_DBLOCKS: > - tp->t_dblocks_delta += delta; > - break; > - case XFS_TRANS_SB_AGCOUNT: > - ASSERT(delta > 0); > - tp->t_agcount_delta += delta; > - break; > - case XFS_TRANS_SB_IMAXPCT: > - tp->t_imaxpct_delta += delta; > - break; > - case XFS_TRANS_SB_REXTSIZE: > - tp->t_rextsize_delta += delta; > - break; > - case XFS_TRANS_SB_RBMBLOCKS: > - tp->t_rbmblocks_delta += delta; > - break; > - case XFS_TRANS_SB_RBLOCKS: > - tp->t_rblocks_delta += delta; > - break; > - case XFS_TRANS_SB_REXTENTS: > - tp->t_rextents_delta += delta; > - break; > - case XFS_TRANS_SB_REXTSLOG: > - tp->t_rextslog_delta += delta; > - break; > default: > ASSERT(0); > return; > @@ -475,12 +450,8 @@ STATIC void > xfs_trans_apply_sb_deltas( > xfs_trans_t *tp) > { > - struct xfs_dsb *sbp; > - struct xfs_buf *bp; > - int whole = 0; > - > - bp = xfs_trans_getsb(tp); > - sbp = bp->b_addr; > + struct xfs_buf *bp = xfs_trans_getsb(tp); > + struct xfs_dsb *sbp = bp->b_addr; > > /* > * Only update the superblock counters if we are logging them > @@ -522,53 +493,10 @@ xfs_trans_apply_sb_deltas( > spin_unlock(&mp->m_sb_lock); > } > > - if (tp->t_dblocks_delta) { > - be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta); > - whole = 1; > - } > - if (tp->t_agcount_delta) { > - be32_add_cpu(&sbp->sb_agcount, tp->t_agcount_delta); > - whole = 1; > - } > - if (tp->t_imaxpct_delta) { > - sbp->sb_imax_pct += tp->t_imaxpct_delta; > - whole = 1; > - } > - if (tp->t_rextsize_delta) { > - be32_add_cpu(&sbp->sb_rextsize, tp->t_rextsize_delta); > - whole = 1; > - } > - if (tp->t_rbmblocks_delta) { > - be32_add_cpu(&sbp->sb_rbmblocks, tp->t_rbmblocks_delta); > - whole = 1; > - } > - if (tp->t_rblocks_delta) { > - be64_add_cpu(&sbp->sb_rblocks, tp->t_rblocks_delta); > - whole = 1; > - } > - if (tp->t_rextents_delta) { > - be64_add_cpu(&sbp->sb_rextents, tp->t_rextents_delta); > - whole = 1; > - } > - if (tp->t_rextslog_delta) { > - sbp->sb_rextslog += tp->t_rextslog_delta; > - whole = 1; > - } > - > xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF); > - if (whole) > - /* > - * Log the whole thing, the fields are noncontiguous. > - */ > - xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1); > - else > - /* > - * Since all the modifiable fields are contiguous, we > - * can get away with this. > - */ > - xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount), > - offsetof(struct xfs_dsb, sb_frextents) + > - sizeof(sbp->sb_frextents) - 1); > + xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount), > + offsetof(struct xfs_dsb, sb_frextents) + > + sizeof(sbp->sb_frextents) - 1); > } > > /* > @@ -656,26 +584,7 @@ xfs_trans_unreserve_and_mod_sb( > * must be consistent with the ondisk rtbitmap and must never include > * incore reservations. > */ > - mp->m_sb.sb_dblocks += tp->t_dblocks_delta; > - mp->m_sb.sb_agcount += tp->t_agcount_delta; > - mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta; > - mp->m_sb.sb_rextsize += tp->t_rextsize_delta; > - if (tp->t_rextsize_delta) { > - mp->m_rtxblklog = log2_if_power2(mp->m_sb.sb_rextsize); > - mp->m_rtxblkmask = mask64_if_power2(mp->m_sb.sb_rextsize); > - } > - mp->m_sb.sb_rbmblocks += tp->t_rbmblocks_delta; > - mp->m_sb.sb_rblocks += tp->t_rblocks_delta; > - mp->m_sb.sb_rextents += tp->t_rextents_delta; > - mp->m_sb.sb_rextslog += tp->t_rextslog_delta; > spin_unlock(&mp->m_sb_lock); > - > - /* > - * Debug checks outside of the spinlock so they don't lock up the > - * machine if they fail. > - */ > - ASSERT(mp->m_sb.sb_imax_pct >= 0); > - ASSERT(mp->m_sb.sb_rextslog >= 0); > } > > /* Add the given log item to the transaction's list of log items. */ > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h > index f06cc0f41665ad..e5911cf09be444 100644 > --- a/fs/xfs/xfs_trans.h > +++ b/fs/xfs/xfs_trans.h > @@ -140,14 +140,6 @@ typedef struct xfs_trans { > int64_t t_res_fdblocks_delta; /* on-disk only chg */ > int64_t t_frextents_delta;/* superblock freextents chg*/ > int64_t t_res_frextents_delta; /* on-disk only chg */ > - int64_t t_dblocks_delta;/* superblock dblocks change */ > - int64_t t_agcount_delta;/* superblock agcount change */ > - int64_t t_imaxpct_delta;/* superblock imaxpct change */ > - int64_t t_rextsize_delta;/* superblock rextsize chg */ > - int64_t t_rbmblocks_delta;/* superblock rbmblocks chg */ > - int64_t t_rblocks_delta;/* superblock rblocks change */ > - int64_t t_rextents_delta;/* superblocks rextents chg */ > - int64_t t_rextslog_delta;/* superblocks rextslog chg */ > struct list_head t_items; /* log item descriptors */ > struct list_head t_busy; /* list of busy extents */ > struct list_head t_dfops; /* deferred operations */ > -- > 2.45.2 > >
On Mon, Sep 30, 2024 at 06:41:47PM +0200, Christoph Hellwig wrote: > Updates to the file system geometry in growfs need to be committed to > stable store before the allocator can see them to avoid that they are > in the same CIL checkpoint as transactions that make use of this new > information, which will make recovery impossible or broken. > > To do this add two new helpers to prepare a superblock for direct > manipulation of the on-disk buffer, and to commit these updates while > holding the buffer locked (similar to what xfs_sync_sb_buf does) and use > those in growfs instead of applying the changes through the deltas in the > xfs_trans structure (which also happens to shrink the xfs_trans structure > a fair bit). Yay for shrinking xfs_trans! > The rtbmimap repair code was also using the transaction deltas and is > converted to also update the superblock buffer directly under the buffer > lock. > > This new method establishes a locking protocol where even in-core > superblock fields must only be updated with the superblock buffer > locked. For now it is only applied to affected geometry fields, > but in the future it would make sense to apply it universally. Hmm. One thing that I don't quite like here is the separation between updating the ondisk sb fields and updating the incore sb/recomputing the cached geometry fields. I think that's been handled correctly here, but the pending changes in growfsrt for rtgroups is going to make this more ugly. What if instead this took the form of a new defer_ops type? The xfs_prepare_sb_update function would allocate a tracking object where we'd pin the sb buffer and record which fields get changed, as well as the new values. xfs_commit_sb_update then xfs_defer_add()s it to the transaction and commits it. (The ->create_intent function would return NULL so that no log item is created.) The ->finish_item function would then bhold the sb buffer, update the ondisk super like how xfs_commit_sb_update does in this patch, set XFS_SB_TRANS_SYNC, and return -EAGAIN. The defer ops would commit and flush that transaction and call ->finish_item again, at which point it would recompute the incore/cached geometry as necessary, bwrite the sb buffer, and release it. The downside is that it's more complexity, but the upside is that the geometry changes are contained in one place instead of being scattered around, and the incore changes only happen if the synchronous transaction actually gets written to disk. IOWs, the end result is the same as what you propose here, but structured differently. I guess the biggest downside is that log recovery has to call the incore/cached geometry recomputation function directly because there's no actual log intent item to recover. (The code changes themselves look acceptable to me.) --D > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/xfs/libxfs/xfs_sb.c | 97 ++++++++++++++++++++++++------- > fs/xfs/libxfs/xfs_sb.h | 3 + > fs/xfs/libxfs/xfs_shared.h | 8 --- > fs/xfs/scrub/rtbitmap_repair.c | 26 +++++---- > fs/xfs/xfs_fsops.c | 80 ++++++++++++++++---------- > fs/xfs/xfs_rtalloc.c | 92 +++++++++++++++++------------- > fs/xfs/xfs_trans.c | 101 ++------------------------------- > fs/xfs/xfs_trans.h | 8 --- > 8 files changed, 198 insertions(+), 217 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c > index d95409f3cba667..2c83ab7441ade5 100644 > --- a/fs/xfs/libxfs/xfs_sb.c > +++ b/fs/xfs/libxfs/xfs_sb.c > @@ -1025,6 +1025,80 @@ xfs_sb_mount_common( > mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp); > } > > +/* > + * Mirror the lazy sb counters to the in-core superblock. > + * > + * If this is at unmount, the counters will be exactly correct, but at any other > + * time they will only be ballpark correct because of reservations that have > + * been taken out percpu counters. If we have an unclean shutdown, this will be > + * corrected by log recovery rebuilding the counters from the AGF block counts. > + * > + * Do not update sb_frextents here because it is not part of the lazy sb > + * counters, despite having a percpu counter. It is always kept consistent with > + * the ondisk rtbitmap by xfs_trans_apply_sb_deltas() and hence we don't need > + * have to update it here. > + */ > +static void > +xfs_flush_sb_counters( > + struct xfs_mount *mp) > +{ > + if (xfs_has_lazysbcount(mp)) { > + mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount); > + mp->m_sb.sb_ifree = min_t(uint64_t, > + percpu_counter_sum_positive(&mp->m_ifree), > + mp->m_sb.sb_icount); > + mp->m_sb.sb_fdblocks = > + percpu_counter_sum_positive(&mp->m_fdblocks); > + } > +} > + > +/* > + * Prepare a direct update to the superblock through the on-disk buffer. > + * > + * This locks out other modifications through the buffer lock and then syncs all > + * in-core values to the on-disk buffer (including the percpu counters). > + * > + * The caller then directly manipulates the on-disk fields and calls > + * xfs_commit_sb_update to the updates to disk them. The caller is responsible > + * to also update the in-core field, but it can do so after the transaction has > + * been committed to disk. > + * > + * Updating the in-core field only after xfs_commit_sb_update ensures that other > + * processes only see the update once it is stable on disk, and is usually the > + * right thing to do for superblock updates. > + * > + * Note that writes to superblock fields updated using this helper are > + * synchronized using the superblock buffer lock, which must be taken around > + * all updates to the in-core fields as well. > + */ > +struct xfs_dsb * > +xfs_prepare_sb_update( > + struct xfs_trans *tp, > + struct xfs_buf **bpp) > +{ > + *bpp = xfs_trans_getsb(tp); > + xfs_flush_sb_counters(tp->t_mountp); > + xfs_sb_to_disk((*bpp)->b_addr, &tp->t_mountp->m_sb); > + return (*bpp)->b_addr; > +} > + > +/* > + * Commit a direct update to the on-disk superblock. Keeps @bp locked and > + * referenced, so the caller must call xfs_buf_relse() manually. > + */ > +int > +xfs_commit_sb_update( > + struct xfs_trans *tp, > + struct xfs_buf *bp) > +{ > + xfs_trans_bhold(tp, bp); > + xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF); > + xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1); > + > + xfs_trans_set_sync(tp); > + return xfs_trans_commit(tp); > +} > + > /* > * xfs_log_sb() can be used to copy arbitrary changes to the in-core superblock > * into the superblock buffer to be logged. It does not provide the higher > @@ -1038,28 +1112,7 @@ xfs_log_sb( > struct xfs_mount *mp = tp->t_mountp; > struct xfs_buf *bp = xfs_trans_getsb(tp); > > - /* > - * Lazy sb counters don't update the in-core superblock so do that now. > - * If this is at unmount, the counters will be exactly correct, but at > - * any other time they will only be ballpark correct because of > - * reservations that have been taken out percpu counters. If we have an > - * unclean shutdown, this will be corrected by log recovery rebuilding > - * the counters from the AGF block counts. > - * > - * Do not update sb_frextents here because it is not part of the lazy > - * sb counters, despite having a percpu counter. It is always kept > - * consistent with the ondisk rtbitmap by xfs_trans_apply_sb_deltas() > - * and hence we don't need have to update it here. > - */ > - if (xfs_has_lazysbcount(mp)) { > - mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount); > - mp->m_sb.sb_ifree = min_t(uint64_t, > - percpu_counter_sum_positive(&mp->m_ifree), > - mp->m_sb.sb_icount); > - mp->m_sb.sb_fdblocks = > - percpu_counter_sum_positive(&mp->m_fdblocks); > - } > - > + xfs_flush_sb_counters(mp); > xfs_sb_to_disk(bp->b_addr, &mp->m_sb); > xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF); > xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1); > diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h > index 885c837559914d..3649d071687e33 100644 > --- a/fs/xfs/libxfs/xfs_sb.h > +++ b/fs/xfs/libxfs/xfs_sb.h > @@ -13,6 +13,9 @@ struct xfs_trans; > struct xfs_fsop_geom; > struct xfs_perag; > > +struct xfs_dsb *xfs_prepare_sb_update(struct xfs_trans *tp, > + struct xfs_buf **bpp); > +int xfs_commit_sb_update(struct xfs_trans *tp, struct xfs_buf *bp); > extern void xfs_log_sb(struct xfs_trans *tp); > extern int xfs_sync_sb(struct xfs_mount *mp, bool wait); > extern int xfs_sync_sb_buf(struct xfs_mount *mp); > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h > index 33b84a3a83ff63..45a32ea426164a 100644 > --- a/fs/xfs/libxfs/xfs_shared.h > +++ b/fs/xfs/libxfs/xfs_shared.h > @@ -149,14 +149,6 @@ void xfs_log_get_max_trans_res(struct xfs_mount *mp, > #define XFS_TRANS_SB_RES_FDBLOCKS 0x00000008 > #define XFS_TRANS_SB_FREXTENTS 0x00000010 > #define XFS_TRANS_SB_RES_FREXTENTS 0x00000020 > -#define XFS_TRANS_SB_DBLOCKS 0x00000040 > -#define XFS_TRANS_SB_AGCOUNT 0x00000080 > -#define XFS_TRANS_SB_IMAXPCT 0x00000100 > -#define XFS_TRANS_SB_REXTSIZE 0x00000200 > -#define XFS_TRANS_SB_RBMBLOCKS 0x00000400 > -#define XFS_TRANS_SB_RBLOCKS 0x00000800 > -#define XFS_TRANS_SB_REXTENTS 0x00001000 > -#define XFS_TRANS_SB_REXTSLOG 0x00002000 > > /* > * Here we centralize the specification of XFS meta-data buffer reference count > diff --git a/fs/xfs/scrub/rtbitmap_repair.c b/fs/xfs/scrub/rtbitmap_repair.c > index 0fef98e9f83409..be9d31f032b1bf 100644 > --- a/fs/xfs/scrub/rtbitmap_repair.c > +++ b/fs/xfs/scrub/rtbitmap_repair.c > @@ -16,6 +16,7 @@ > #include "xfs_bit.h" > #include "xfs_bmap.h" > #include "xfs_bmap_btree.h" > +#include "xfs_sb.h" > #include "scrub/scrub.h" > #include "scrub/common.h" > #include "scrub/trace.h" > @@ -127,20 +128,21 @@ xrep_rtbitmap_geometry( > struct xchk_rtbitmap *rtb) > { > struct xfs_mount *mp = sc->mp; > - struct xfs_trans *tp = sc->tp; > > /* Superblock fields */ > - if (mp->m_sb.sb_rextents != rtb->rextents) > - xfs_trans_mod_sb(sc->tp, XFS_TRANS_SB_REXTENTS, > - rtb->rextents - mp->m_sb.sb_rextents); > - > - if (mp->m_sb.sb_rbmblocks != rtb->rbmblocks) > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_RBMBLOCKS, > - rtb->rbmblocks - mp->m_sb.sb_rbmblocks); > - > - if (mp->m_sb.sb_rextslog != rtb->rextslog) > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTSLOG, > - rtb->rextslog - mp->m_sb.sb_rextslog); > + if (mp->m_sb.sb_rextents != rtb->rextents || > + mp->m_sb.sb_rbmblocks != rtb->rbmblocks || > + mp->m_sb.sb_rextslog != rtb->rextslog) { > + struct xfs_buf *bp = xfs_trans_getsb(sc->tp); > + > + mp->m_sb.sb_rextents = rtb->rextents; > + mp->m_sb.sb_rbmblocks = rtb->rbmblocks; > + mp->m_sb.sb_rextslog = rtb->rextslog; > + xfs_sb_to_disk(bp->b_addr, &mp->m_sb); > + > + xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF); > + xfs_trans_log_buf(sc->tp, bp, 0, sizeof(struct xfs_dsb) - 1); > + } > > /* Fix broken isize */ > sc->ip->i_disk_size = roundup_64(sc->ip->i_disk_size, > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c > index b247d895c276d2..4168ccf21068cb 100644 > --- a/fs/xfs/xfs_fsops.c > +++ b/fs/xfs/xfs_fsops.c > @@ -79,6 +79,46 @@ xfs_resizefs_init_new_ags( > return error; > } > > +static int > +xfs_growfs_data_update_sb( > + struct xfs_trans *tp, > + xfs_agnumber_t nagcount, > + xfs_rfsblock_t nb, > + xfs_agnumber_t nagimax) > +{ > + struct xfs_mount *mp = tp->t_mountp; > + struct xfs_dsb *sbp; > + struct xfs_buf *bp; > + int error; > + > + /* > + * Update the geometry in the on-disk superblock first, and ensure > + * they make it to disk before the superblock can be relogged. > + */ > + sbp = xfs_prepare_sb_update(tp, &bp); > + sbp->sb_agcount = cpu_to_be32(nagcount); > + sbp->sb_dblocks = cpu_to_be64(nb); > + error = xfs_commit_sb_update(tp, bp); > + if (error) > + goto out_unlock; > + > + /* > + * Propagate the new values to the live mount structure after they made > + * it to disk with the superblock buffer still locked. > + */ > + mp->m_sb.sb_agcount = nagcount; > + mp->m_sb.sb_dblocks = nb; > + > + if (nagimax) > + mp->m_maxagi = nagimax; > + xfs_set_low_space_thresholds(mp); > + mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); > + > +out_unlock: > + xfs_buf_relse(bp); > + return error; > +} > + > /* > * growfs operations > */ > @@ -171,37 +211,13 @@ xfs_growfs_data_private( > if (error) > goto out_trans_cancel; > > - /* > - * Update changed superblock fields transactionally. These are not > - * seen by the rest of the world until the transaction commit applies > - * them atomically to the superblock. > - */ > - if (nagcount > oagcount) > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount); > - if (delta) > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta); > if (id.nfree) > xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree); > > - /* > - * Sync sb counters now to reflect the updated values. This is > - * particularly important for shrink because the write verifier > - * will fail if sb_fdblocks is ever larger than sb_dblocks. > - */ > - if (xfs_has_lazysbcount(mp)) > - xfs_log_sb(tp); > - > - xfs_trans_set_sync(tp); > - error = xfs_trans_commit(tp); > + error = xfs_growfs_data_update_sb(tp, nagcount, nb, nagimax); > if (error) > return error; > > - /* New allocation groups fully initialized, so update mount struct */ > - if (nagimax) > - mp->m_maxagi = nagimax; > - xfs_set_low_space_thresholds(mp); > - mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); > - > if (delta > 0) { > /* > * If we expanded the last AG, free the per-AG reservation > @@ -260,8 +276,9 @@ xfs_growfs_imaxpct( > struct xfs_mount *mp, > __u32 imaxpct) > { > + struct xfs_dsb *sbp; > + struct xfs_buf *bp; > struct xfs_trans *tp; > - int dpct; > int error; > > if (imaxpct > 100) > @@ -272,10 +289,13 @@ xfs_growfs_imaxpct( > if (error) > return error; > > - dpct = imaxpct - mp->m_sb.sb_imax_pct; > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct); > - xfs_trans_set_sync(tp); > - return xfs_trans_commit(tp); > + sbp = xfs_prepare_sb_update(tp, &bp); > + sbp->sb_imax_pct = imaxpct; > + error = xfs_commit_sb_update(tp, bp); > + if (!error) > + mp->m_sb.sb_imax_pct = imaxpct; > + xfs_buf_relse(bp); > + return error; > } > > /* > diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c > index 3a2005a1e673dc..994e5efedab20f 100644 > --- a/fs/xfs/xfs_rtalloc.c > +++ b/fs/xfs/xfs_rtalloc.c > @@ -698,6 +698,56 @@ xfs_growfs_rt_fixup_extsize( > return error; > } > > +static int > +xfs_growfs_rt_update_sb( > + struct xfs_trans *tp, > + struct xfs_mount *mp, > + struct xfs_mount *nmp, > + xfs_rtbxlen_t freed_rtx) > +{ > + struct xfs_dsb *sbp; > + struct xfs_buf *bp; > + int error; > + > + /* > + * Update the geometry in the on-disk superblock first, and ensure > + * they make it to disk before the superblock can be relogged. > + */ > + sbp = xfs_prepare_sb_update(tp, &bp); > + sbp->sb_rextsize = cpu_to_be32(nmp->m_sb.sb_rextsize); > + sbp->sb_rbmblocks = cpu_to_be32(nmp->m_sb.sb_rbmblocks); > + sbp->sb_rblocks = cpu_to_be64(nmp->m_sb.sb_rblocks); > + sbp->sb_rextents = cpu_to_be64(nmp->m_sb.sb_rextents); > + sbp->sb_rextslog = nmp->m_sb.sb_rextslog; > + error = xfs_commit_sb_update(tp, bp); > + if (error) > + return error; > + > + /* > + * Propagate the new values to the live mount structure after they made > + * it to disk with the superblock buffer still locked. > + */ > + mp->m_sb.sb_rextsize = nmp->m_sb.sb_rextsize; > + mp->m_sb.sb_rbmblocks = nmp->m_sb.sb_rbmblocks; > + mp->m_sb.sb_rblocks = nmp->m_sb.sb_rblocks; > + mp->m_sb.sb_rextents = nmp->m_sb.sb_rextents; > + mp->m_sb.sb_rextslog = nmp->m_sb.sb_rextslog; > + mp->m_rsumlevels = nmp->m_rsumlevels; > + mp->m_rsumblocks = nmp->m_rsumblocks; > + > + /* > + * Recompute the growfsrt reservation from the new rsumsize. > + */ > + xfs_trans_resv_calc(mp, &mp->m_resv); > + > + /* > + * Ensure the mount RT feature flag is now set. > + */ > + mp->m_features |= XFS_FEAT_REALTIME; > + xfs_buf_relse(bp); > + return 0; > +} > + > static int > xfs_growfs_rt_bmblock( > struct xfs_mount *mp, > @@ -780,25 +830,6 @@ xfs_growfs_rt_bmblock( > goto out_cancel; > } > > - /* > - * Update superblock fields. > - */ > - if (nmp->m_sb.sb_rextsize != mp->m_sb.sb_rextsize) > - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSIZE, > - nmp->m_sb.sb_rextsize - mp->m_sb.sb_rextsize); > - if (nmp->m_sb.sb_rbmblocks != mp->m_sb.sb_rbmblocks) > - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBMBLOCKS, > - nmp->m_sb.sb_rbmblocks - mp->m_sb.sb_rbmblocks); > - if (nmp->m_sb.sb_rblocks != mp->m_sb.sb_rblocks) > - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBLOCKS, > - nmp->m_sb.sb_rblocks - mp->m_sb.sb_rblocks); > - if (nmp->m_sb.sb_rextents != mp->m_sb.sb_rextents) > - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTENTS, > - nmp->m_sb.sb_rextents - mp->m_sb.sb_rextents); > - if (nmp->m_sb.sb_rextslog != mp->m_sb.sb_rextslog) > - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSLOG, > - nmp->m_sb.sb_rextslog - mp->m_sb.sb_rextslog); > - > /* > * Free the new extent. > */ > @@ -807,33 +838,12 @@ xfs_growfs_rt_bmblock( > xfs_rtbuf_cache_relse(&nargs); > if (error) > goto out_cancel; > - > - /* > - * Mark more blocks free in the superblock. > - */ > xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx); > > - /* > - * Update the calculated values in the real mount structure. > - */ > - mp->m_rsumlevels = nmp->m_rsumlevels; > - mp->m_rsumblocks = nmp->m_rsumblocks; > - xfs_mount_sb_set_rextsize(mp, &mp->m_sb); > - > - /* > - * Recompute the growfsrt reservation from the new rsumsize. > - */ > - xfs_trans_resv_calc(mp, &mp->m_resv); > - > - error = xfs_trans_commit(args.tp); > + error = xfs_growfs_rt_update_sb(args.tp, mp, nmp, freed_rtx); > if (error) > goto out_free; > > - /* > - * Ensure the mount RT feature flag is now set. > - */ > - mp->m_features |= XFS_FEAT_REALTIME; > - > kfree(nmp); > return 0; > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index bdf3704dc30118..56505cb94f877d 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -430,31 +430,6 @@ xfs_trans_mod_sb( > ASSERT(delta < 0); > tp->t_res_frextents_delta += delta; > break; > - case XFS_TRANS_SB_DBLOCKS: > - tp->t_dblocks_delta += delta; > - break; > - case XFS_TRANS_SB_AGCOUNT: > - ASSERT(delta > 0); > - tp->t_agcount_delta += delta; > - break; > - case XFS_TRANS_SB_IMAXPCT: > - tp->t_imaxpct_delta += delta; > - break; > - case XFS_TRANS_SB_REXTSIZE: > - tp->t_rextsize_delta += delta; > - break; > - case XFS_TRANS_SB_RBMBLOCKS: > - tp->t_rbmblocks_delta += delta; > - break; > - case XFS_TRANS_SB_RBLOCKS: > - tp->t_rblocks_delta += delta; > - break; > - case XFS_TRANS_SB_REXTENTS: > - tp->t_rextents_delta += delta; > - break; > - case XFS_TRANS_SB_REXTSLOG: > - tp->t_rextslog_delta += delta; > - break; > default: > ASSERT(0); > return; > @@ -475,12 +450,8 @@ STATIC void > xfs_trans_apply_sb_deltas( > xfs_trans_t *tp) > { > - struct xfs_dsb *sbp; > - struct xfs_buf *bp; > - int whole = 0; > - > - bp = xfs_trans_getsb(tp); > - sbp = bp->b_addr; > + struct xfs_buf *bp = xfs_trans_getsb(tp); > + struct xfs_dsb *sbp = bp->b_addr; > > /* > * Only update the superblock counters if we are logging them > @@ -522,53 +493,10 @@ xfs_trans_apply_sb_deltas( > spin_unlock(&mp->m_sb_lock); > } > > - if (tp->t_dblocks_delta) { > - be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta); > - whole = 1; > - } > - if (tp->t_agcount_delta) { > - be32_add_cpu(&sbp->sb_agcount, tp->t_agcount_delta); > - whole = 1; > - } > - if (tp->t_imaxpct_delta) { > - sbp->sb_imax_pct += tp->t_imaxpct_delta; > - whole = 1; > - } > - if (tp->t_rextsize_delta) { > - be32_add_cpu(&sbp->sb_rextsize, tp->t_rextsize_delta); > - whole = 1; > - } > - if (tp->t_rbmblocks_delta) { > - be32_add_cpu(&sbp->sb_rbmblocks, tp->t_rbmblocks_delta); > - whole = 1; > - } > - if (tp->t_rblocks_delta) { > - be64_add_cpu(&sbp->sb_rblocks, tp->t_rblocks_delta); > - whole = 1; > - } > - if (tp->t_rextents_delta) { > - be64_add_cpu(&sbp->sb_rextents, tp->t_rextents_delta); > - whole = 1; > - } > - if (tp->t_rextslog_delta) { > - sbp->sb_rextslog += tp->t_rextslog_delta; > - whole = 1; > - } > - > xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF); > - if (whole) > - /* > - * Log the whole thing, the fields are noncontiguous. > - */ > - xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1); > - else > - /* > - * Since all the modifiable fields are contiguous, we > - * can get away with this. > - */ > - xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount), > - offsetof(struct xfs_dsb, sb_frextents) + > - sizeof(sbp->sb_frextents) - 1); > + xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount), > + offsetof(struct xfs_dsb, sb_frextents) + > + sizeof(sbp->sb_frextents) - 1); > } > > /* > @@ -656,26 +584,7 @@ xfs_trans_unreserve_and_mod_sb( > * must be consistent with the ondisk rtbitmap and must never include > * incore reservations. > */ > - mp->m_sb.sb_dblocks += tp->t_dblocks_delta; > - mp->m_sb.sb_agcount += tp->t_agcount_delta; > - mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta; > - mp->m_sb.sb_rextsize += tp->t_rextsize_delta; > - if (tp->t_rextsize_delta) { > - mp->m_rtxblklog = log2_if_power2(mp->m_sb.sb_rextsize); > - mp->m_rtxblkmask = mask64_if_power2(mp->m_sb.sb_rextsize); > - } > - mp->m_sb.sb_rbmblocks += tp->t_rbmblocks_delta; > - mp->m_sb.sb_rblocks += tp->t_rblocks_delta; > - mp->m_sb.sb_rextents += tp->t_rextents_delta; > - mp->m_sb.sb_rextslog += tp->t_rextslog_delta; > spin_unlock(&mp->m_sb_lock); > - > - /* > - * Debug checks outside of the spinlock so they don't lock up the > - * machine if they fail. > - */ > - ASSERT(mp->m_sb.sb_imax_pct >= 0); > - ASSERT(mp->m_sb.sb_rextslog >= 0); > } > > /* Add the given log item to the transaction's list of log items. */ > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h > index f06cc0f41665ad..e5911cf09be444 100644 > --- a/fs/xfs/xfs_trans.h > +++ b/fs/xfs/xfs_trans.h > @@ -140,14 +140,6 @@ typedef struct xfs_trans { > int64_t t_res_fdblocks_delta; /* on-disk only chg */ > int64_t t_frextents_delta;/* superblock freextents chg*/ > int64_t t_res_frextents_delta; /* on-disk only chg */ > - int64_t t_dblocks_delta;/* superblock dblocks change */ > - int64_t t_agcount_delta;/* superblock agcount change */ > - int64_t t_imaxpct_delta;/* superblock imaxpct change */ > - int64_t t_rextsize_delta;/* superblock rextsize chg */ > - int64_t t_rbmblocks_delta;/* superblock rbmblocks chg */ > - int64_t t_rblocks_delta;/* superblock rblocks change */ > - int64_t t_rextents_delta;/* superblocks rextents chg */ > - int64_t t_rextslog_delta;/* superblocks rextslog chg */ > struct list_head t_items; /* log item descriptors */ > struct list_head t_busy; /* list of busy extents */ > struct list_head t_dfops; /* deferred operations */ > -- > 2.45.2 > >
On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > Ok, so we don't want geometry changes transactions in the same CIL > checkpoint as alloc related transactions that might depend on the > geometry changes. That seems reasonable and on a first pass I have an > idea of what this is doing, but the description is kind of vague. > Obviously this fixes an issue on the recovery side (since I've tested > it), but it's not quite clear to me from the description and/or logic > changes how that issue manifests. > > Could you elaborate please? For example, is this some kind of race > situation between an allocation request and a growfs transaction, where > the former perhaps sees a newly added AG between the time the growfs > transaction commits (applying the sb deltas) and it actually commits to > the log due to being a sync transaction, thus allowing an alloc on a new > AG into the same checkpoint that adds the AG? This is based on the feedback by Dave on the previous version: https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/ Just doing the perag/in-core sb updates earlier fixes all the issues with my test case, so I'm not actually sure how to get more updates into the check checkpoint. I'll try your exercisers if it could hit that.
On Thu, Oct 10, 2024 at 12:01:47PM -0700, Darrick J. Wong wrote: > What if instead this took the form of a new defer_ops type? The > xfs_prepare_sb_update function would allocate a tracking object where > we'd pin the sb buffer and record which fields get changed, as well as > the new values. xfs_commit_sb_update then xfs_defer_add()s it to the > transaction and commits it. (The ->create_intent function would return > NULL so that no log item is created.) > > The ->finish_item function would then bhold the sb buffer, update the > ondisk super like how xfs_commit_sb_update does in this patch, set > XFS_SB_TRANS_SYNC, and return -EAGAIN. The defer ops would commit and > flush that transaction and call ->finish_item again, at which point it > would recompute the incore/cached geometry as necessary, bwrite the sb > buffer, and release it. > > The downside is that it's more complexity, but the upside is that the > geometry changes are contained in one place instead of being scattered > around, and the incore changes only happen if the synchronous > transaction actually gets written to disk. IOWs, the end result is the > same as what you propose here, but structured differently. That sounds overkill at first, but if we want to move all sb updates to that model more strutured infrastructure might be very useful.
On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > Ok, so we don't want geometry changes transactions in the same CIL > > checkpoint as alloc related transactions that might depend on the > > geometry changes. That seems reasonable and on a first pass I have an > > idea of what this is doing, but the description is kind of vague. > > Obviously this fixes an issue on the recovery side (since I've tested > > it), but it's not quite clear to me from the description and/or logic > > changes how that issue manifests. > > > > Could you elaborate please? For example, is this some kind of race > > situation between an allocation request and a growfs transaction, where > > the former perhaps sees a newly added AG between the time the growfs > > transaction commits (applying the sb deltas) and it actually commits to > > the log due to being a sync transaction, thus allowing an alloc on a new > > AG into the same checkpoint that adds the AG? > > This is based on the feedback by Dave on the previous version: > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/ > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm not sure I'd go straight to this change given the situation... > Just doing the perag/in-core sb updates earlier fixes all the issues > with my test case, so I'm not actually sure how to get more updates > into the check checkpoint. I'll try your exercisers if it could hit > that. > Ok, that explains things a bit. My observation is that the first 5 patches or so address the mount failure problem, but from there I'm not reproducing much difference with or without the final patch. Either way, I see aborts and splats all over the place, which implies at minimum this isn't the only issue here. So given that 1. growfs recovery seems pretty much broken, 2. this particular patch has no straightforward way to test that it fixes something and at the same time doesn't break anything else, and 3. we do have at least one fairly straightforward growfs/recovery test in the works that reliably explodes, personally I'd suggest to split this work off into separate series. It seems reasonable enough to me to get patches 1-5 in asap once they're fully cleaned up, and then leave the next two as part of a followon series pending further investigation into these other issues. As part of that I'd like to know whether the recovery test reproduces (or can be made to reproduce) the issue this patch presumably fixes, but I'd also settle for "the grow recovery test now passes reliably and this doesn't regress it." But once again, just my .02. Brian
On Fri, Oct 11, 2024 at 09:59:03AM +0200, Christoph Hellwig wrote: > On Thu, Oct 10, 2024 at 12:01:47PM -0700, Darrick J. Wong wrote: > > What if instead this took the form of a new defer_ops type? The > > xfs_prepare_sb_update function would allocate a tracking object where > > we'd pin the sb buffer and record which fields get changed, as well as > > the new values. xfs_commit_sb_update then xfs_defer_add()s it to the > > transaction and commits it. (The ->create_intent function would return > > NULL so that no log item is created.) > > > > The ->finish_item function would then bhold the sb buffer, update the > > ondisk super like how xfs_commit_sb_update does in this patch, set > > XFS_SB_TRANS_SYNC, and return -EAGAIN. The defer ops would commit and > > flush that transaction and call ->finish_item again, at which point it > > would recompute the incore/cached geometry as necessary, bwrite the sb > > buffer, and release it. > > > > The downside is that it's more complexity, but the upside is that the > > geometry changes are contained in one place instead of being scattered > > around, and the incore changes only happen if the synchronous > > transaction actually gets written to disk. IOWs, the end result is the > > same as what you propose here, but structured differently. > > That sounds overkill at first, but if we want to move all sb updates > to that model more strutured infrastructure might be very useful. <nod> We could just take this as-is and refactor it into the defer item code once we're done making all the other sb geometry growfsrt updates. I'd rather do that than rebase *two* entire patchsets just to get the same results. --D
On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote: > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > > Ok, so we don't want geometry changes transactions in the same CIL > > > checkpoint as alloc related transactions that might depend on the > > > geometry changes. That seems reasonable and on a first pass I have an > > > idea of what this is doing, but the description is kind of vague. > > > Obviously this fixes an issue on the recovery side (since I've tested > > > it), but it's not quite clear to me from the description and/or logic > > > changes how that issue manifests. > > > > > > Could you elaborate please? For example, is this some kind of race > > > situation between an allocation request and a growfs transaction, where > > > the former perhaps sees a newly added AG between the time the growfs > > > transaction commits (applying the sb deltas) and it actually commits to > > > the log due to being a sync transaction, thus allowing an alloc on a new > > > AG into the same checkpoint that adds the AG? > > > > This is based on the feedback by Dave on the previous version: > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/ > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm > not sure I'd go straight to this change given the situation... > > > Just doing the perag/in-core sb updates earlier fixes all the issues > > with my test case, so I'm not actually sure how to get more updates > > into the check checkpoint. I'll try your exercisers if it could hit > > that. > > > > Ok, that explains things a bit. My observation is that the first 5 > patches or so address the mount failure problem, but from there I'm not > reproducing much difference with or without the final patch. Does this change to flush the log after committing the new sb fix the recovery problems on older kernels? I /think/ that's the point of this patch. > Either way, > I see aborts and splats all over the place, which implies at minimum > this isn't the only issue here. Ugh. I've recently noticed the long soak logrecovery test vm have seen a slight tick up in failure rates -- random blocks that have clearly had garbage written to them such that recovery tries to read the block to recover a buffer log item and kaboom. At this point it's unclear if that's a problem with xfs or somewhere else. :( > So given that 1. growfs recovery seems pretty much broken, 2. this > particular patch has no straightforward way to test that it fixes > something and at the same time doesn't break anything else, and 3. we do I'm curious, what might break? Was that merely a general comment, or do you have something specific in mind? (iows: do you see more string to pull? :)) > have at least one fairly straightforward growfs/recovery test in the > works that reliably explodes, personally I'd suggest to split this work > off into separate series. > > It seems reasonable enough to me to get patches 1-5 in asap once they're > fully cleaned up, and then leave the next two as part of a followon > series pending further investigation into these other issues. As part of > that I'd like to know whether the recovery test reproduces (or can be > made to reproduce) the issue this patch presumably fixes, but I'd also > settle for "the grow recovery test now passes reliably and this doesn't > regress it." But once again, just my .02. Yeah, it's too bad there's no good way to test recovery with older kernels either. :( --D > Brian > >
On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote: > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote: > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > > > Ok, so we don't want geometry changes transactions in the same CIL > > > > checkpoint as alloc related transactions that might depend on the > > > > geometry changes. That seems reasonable and on a first pass I have an > > > > idea of what this is doing, but the description is kind of vague. > > > > Obviously this fixes an issue on the recovery side (since I've tested > > > > it), but it's not quite clear to me from the description and/or logic > > > > changes how that issue manifests. > > > > > > > > Could you elaborate please? For example, is this some kind of race > > > > situation between an allocation request and a growfs transaction, where > > > > the former perhaps sees a newly added AG between the time the growfs > > > > transaction commits (applying the sb deltas) and it actually commits to > > > > the log due to being a sync transaction, thus allowing an alloc on a new > > > > AG into the same checkpoint that adds the AG? > > > > > > This is based on the feedback by Dave on the previous version: > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/ > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm > > not sure I'd go straight to this change given the situation... > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues > > > with my test case, so I'm not actually sure how to get more updates > > > into the check checkpoint. I'll try your exercisers if it could hit > > > that. > > > > > > > Ok, that explains things a bit. My observation is that the first 5 > > patches or so address the mount failure problem, but from there I'm not > > reproducing much difference with or without the final patch. > > Does this change to flush the log after committing the new sb fix the > recovery problems on older kernels? I /think/ that's the point of this > patch. > I don't follow.. growfs always forced the log via the sync transaction, right? Or do you mean something else by "change to flush the log?" I thought the main functional change of this patch was to hold the superblock buffer locked across the force so nothing else can relog the new geometry superblock buffer in the same cil checkpoint. Presumably, the theory is that prevents recovery from seeing updates to different buffers that depend on the geometry update before the actual sb geometry update is recovered (because the latter might have been relogged). Maybe we're saying the same thing..? Or maybe I just misunderstand. Either way I think patch could use a more detailed commit log... > > Either way, > > I see aborts and splats all over the place, which implies at minimum > > this isn't the only issue here. > > Ugh. I've recently noticed the long soak logrecovery test vm have seen > a slight tick up in failure rates -- random blocks that have clearly had > garbage written to them such that recovery tries to read the block to > recover a buffer log item and kaboom. At this point it's unclear if > that's a problem with xfs or somewhere else. :( > > > So given that 1. growfs recovery seems pretty much broken, 2. this > > particular patch has no straightforward way to test that it fixes > > something and at the same time doesn't break anything else, and 3. we do > > I'm curious, what might break? Was that merely a general comment, or do > you have something specific in mind? (iows: do you see more string to > pull? :)) > Just a general comment.. Something related that isn't totally clear to me is what about the inverse shrink situation where dblocks is reduced. I.e., is there some similar scenario where for example instead of the sb buffer being relogged past some other buffer update that depends on it, some other change is relogged past a sb update that invalidates/removes blocks referenced by the relogged buffer..? If so, does that imply a shrink should flush the log before the shrink transaction commits to ensure it lands in a new checkpoint (as opposed to ensuring followon updates land in a new checkpoint)..? Anyways, my point is just that if it were me I wouldn't get too deep into this until some of the reproducible growfs recovery issues are at least characterized and testing is more sorted out. The context for testing is here [1]. The TLDR is basically that Christoph has a targeted test that reproduces the initial mount failure and I hacked up a more general test that also reproduces it and additional growfs recovery problems. This test does seem to confirm that the previous patches address the mount failure issue, but this patch doesn't seem to prevent any of the other problems produced by the generic test. That might just mean the test doesn't reproduce what this fixes, but it's kind of hard to at least regression test something like this when basic growfs crash-recovery seems pretty much broken. Brian [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/ > > have at least one fairly straightforward growfs/recovery test in the > > works that reliably explodes, personally I'd suggest to split this work > > off into separate series. > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're > > fully cleaned up, and then leave the next two as part of a followon > > series pending further investigation into these other issues. As part of > > that I'd like to know whether the recovery test reproduces (or can be > > made to reproduce) the issue this patch presumably fixes, but I'd also > > settle for "the grow recovery test now passes reliably and this doesn't > > regress it." But once again, just my .02. > > Yeah, it's too bad there's no good way to test recovery with older > kernels either. :( > > --D > > > Brian > > > > >
On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote: > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote: > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote: > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > > > > Ok, so we don't want geometry changes transactions in the same CIL > > > > > checkpoint as alloc related transactions that might depend on the > > > > > geometry changes. That seems reasonable and on a first pass I have an > > > > > idea of what this is doing, but the description is kind of vague. > > > > > Obviously this fixes an issue on the recovery side (since I've tested > > > > > it), but it's not quite clear to me from the description and/or logic > > > > > changes how that issue manifests. > > > > > > > > > > Could you elaborate please? For example, is this some kind of race > > > > > situation between an allocation request and a growfs transaction, where > > > > > the former perhaps sees a newly added AG between the time the growfs > > > > > transaction commits (applying the sb deltas) and it actually commits to > > > > > the log due to being a sync transaction, thus allowing an alloc on a new > > > > > AG into the same checkpoint that adds the AG? > > > > > > > > This is based on the feedback by Dave on the previous version: > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/ > > > > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm > > > not sure I'd go straight to this change given the situation... > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues > > > > with my test case, so I'm not actually sure how to get more updates > > > > into the check checkpoint. I'll try your exercisers if it could hit > > > > that. > > > > > > > > > > Ok, that explains things a bit. My observation is that the first 5 > > > patches or so address the mount failure problem, but from there I'm not > > > reproducing much difference with or without the final patch. > > > > Does this change to flush the log after committing the new sb fix the > > recovery problems on older kernels? I /think/ that's the point of this > > patch. > > > > I don't follow.. growfs always forced the log via the sync transaction, > right? Or do you mean something else by "change to flush the log?" I guess I was typing a bit too fast this morning -- "change to flush the log to disk before anyone else can get their hands on the superblock". You're right that xfs_log_sb and data-device growfs already do that. That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug that this patch fixes, right? > I thought the main functional change of this patch was to hold the > superblock buffer locked across the force so nothing else can relog the > new geometry superblock buffer in the same cil checkpoint. Presumably, > the theory is that prevents recovery from seeing updates to different > buffers that depend on the geometry update before the actual sb geometry > update is recovered (because the latter might have been relogged). > > Maybe we're saying the same thing..? Or maybe I just misunderstand. > Either way I think patch could use a more detailed commit log... <nod> The commit message should point out that we're fixing a real bug here, which is that growfsrt doesn't force the log to disk when it commits the new rt geometry. > > > Either way, > > > I see aborts and splats all over the place, which implies at minimum > > > this isn't the only issue here. > > > > Ugh. I've recently noticed the long soak logrecovery test vm have seen > > a slight tick up in failure rates -- random blocks that have clearly had > > garbage written to them such that recovery tries to read the block to > > recover a buffer log item and kaboom. At this point it's unclear if > > that's a problem with xfs or somewhere else. :( > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this > > > particular patch has no straightforward way to test that it fixes > > > something and at the same time doesn't break anything else, and 3. we do > > > > I'm curious, what might break? Was that merely a general comment, or do > > you have something specific in mind? (iows: do you see more string to > > pull? :)) > > > > Just a general comment.. > > Something related that isn't totally clear to me is what about the > inverse shrink situation where dblocks is reduced. I.e., is there some > similar scenario where for example instead of the sb buffer being > relogged past some other buffer update that depends on it, some other > change is relogged past a sb update that invalidates/removes blocks > referenced by the relogged buffer..? If so, does that imply a shrink > should flush the log before the shrink transaction commits to ensure it > lands in a new checkpoint (as opposed to ensuring followon updates land > in a new checkpoint)..? I think so. Might we want to do that before and after to be careful? > Anyways, my point is just that if it were me I wouldn't get too deep > into this until some of the reproducible growfs recovery issues are at > least characterized and testing is more sorted out. > > The context for testing is here [1]. The TLDR is basically that > Christoph has a targeted test that reproduces the initial mount failure > and I hacked up a more general test that also reproduces it and > additional growfs recovery problems. This test does seem to confirm that > the previous patches address the mount failure issue, but this patch > doesn't seem to prevent any of the other problems produced by the > generic test. That might just mean the test doesn't reproduce what this > fixes, but it's kind of hard to at least regression test something like > this when basic growfs crash-recovery seems pretty much broken. Hmm, if you make a variant of that test which formats with an rt device and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see similar blowups? Let's see what happens if I do that... --D > Brian > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/ > > > > have at least one fairly straightforward growfs/recovery test in the > > > works that reliably explodes, personally I'd suggest to split this work > > > off into separate series. > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're > > > fully cleaned up, and then leave the next two as part of a followon > > > series pending further investigation into these other issues. As part of > > > that I'd like to know whether the recovery test reproduces (or can be > > > made to reproduce) the issue this patch presumably fixes, but I'd also > > > settle for "the grow recovery test now passes reliably and this doesn't > > > regress it." But once again, just my .02. > > > > Yeah, it's too bad there's no good way to test recovery with older > > kernels either. :( > > > > --D > > > > > Brian > > > > > > > > > >
On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote: > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote: > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote: > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote: > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > > > > > Ok, so we don't want geometry changes transactions in the same CIL > > > > > > checkpoint as alloc related transactions that might depend on the > > > > > > geometry changes. That seems reasonable and on a first pass I have an > > > > > > idea of what this is doing, but the description is kind of vague. > > > > > > Obviously this fixes an issue on the recovery side (since I've tested > > > > > > it), but it's not quite clear to me from the description and/or logic > > > > > > changes how that issue manifests. > > > > > > > > > > > > Could you elaborate please? For example, is this some kind of race > > > > > > situation between an allocation request and a growfs transaction, where > > > > > > the former perhaps sees a newly added AG between the time the growfs > > > > > > transaction commits (applying the sb deltas) and it actually commits to > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new > > > > > > AG into the same checkpoint that adds the AG? > > > > > > > > > > This is based on the feedback by Dave on the previous version: > > > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/ > > > > > > > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm > > > > not sure I'd go straight to this change given the situation... > > > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues > > > > > with my test case, so I'm not actually sure how to get more updates > > > > > into the check checkpoint. I'll try your exercisers if it could hit > > > > > that. > > > > > > > > > > > > > Ok, that explains things a bit. My observation is that the first 5 > > > > patches or so address the mount failure problem, but from there I'm not > > > > reproducing much difference with or without the final patch. > > > > > > Does this change to flush the log after committing the new sb fix the > > > recovery problems on older kernels? I /think/ that's the point of this > > > patch. > > > > > > > I don't follow.. growfs always forced the log via the sync transaction, > > right? Or do you mean something else by "change to flush the log?" > > I guess I was typing a bit too fast this morning -- "change to flush the > log to disk before anyone else can get their hands on the superblock". > You're right that xfs_log_sb and data-device growfs already do that. > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug > that this patch fixes, right? > > > I thought the main functional change of this patch was to hold the > > superblock buffer locked across the force so nothing else can relog the > > new geometry superblock buffer in the same cil checkpoint. Presumably, > > the theory is that prevents recovery from seeing updates to different > > buffers that depend on the geometry update before the actual sb geometry > > update is recovered (because the latter might have been relogged). > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand. > > Either way I think patch could use a more detailed commit log... > > <nod> The commit message should point out that we're fixing a real bug > here, which is that growfsrt doesn't force the log to disk when it > commits the new rt geometry. > > > > > Either way, > > > > I see aborts and splats all over the place, which implies at minimum > > > > this isn't the only issue here. > > > > > > Ugh. I've recently noticed the long soak logrecovery test vm have seen > > > a slight tick up in failure rates -- random blocks that have clearly had > > > garbage written to them such that recovery tries to read the block to > > > recover a buffer log item and kaboom. At this point it's unclear if > > > that's a problem with xfs or somewhere else. :( > > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this > > > > particular patch has no straightforward way to test that it fixes > > > > something and at the same time doesn't break anything else, and 3. we do > > > > > > I'm curious, what might break? Was that merely a general comment, or do > > > you have something specific in mind? (iows: do you see more string to > > > pull? :)) > > > > > > > Just a general comment.. > > > > Something related that isn't totally clear to me is what about the > > inverse shrink situation where dblocks is reduced. I.e., is there some > > similar scenario where for example instead of the sb buffer being > > relogged past some other buffer update that depends on it, some other > > change is relogged past a sb update that invalidates/removes blocks > > referenced by the relogged buffer..? If so, does that imply a shrink > > should flush the log before the shrink transaction commits to ensure it > > lands in a new checkpoint (as opposed to ensuring followon updates land > > in a new checkpoint)..? > > I think so. Might we want to do that before and after to be careful? > > > Anyways, my point is just that if it were me I wouldn't get too deep > > into this until some of the reproducible growfs recovery issues are at > > least characterized and testing is more sorted out. > > > > The context for testing is here [1]. The TLDR is basically that > > Christoph has a targeted test that reproduces the initial mount failure > > and I hacked up a more general test that also reproduces it and > > additional growfs recovery problems. This test does seem to confirm that > > the previous patches address the mount failure issue, but this patch > > doesn't seem to prevent any of the other problems produced by the > > generic test. That might just mean the test doesn't reproduce what this > > fixes, but it's kind of hard to at least regression test something like > > this when basic growfs crash-recovery seems pretty much broken. > > Hmm, if you make a variant of that test which formats with an rt device > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see > similar blowups? Let's see what happens if I do that... Ahahaha awesome it corrupts the filesystem: _check_xfs_filesystem: filesystem on /dev/sdf is inconsistent (r) *** xfs_repair -n output *** Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - generate realtime summary info and bitmap... sb_frextents 389, counted 10329 discrepancy in summary (0) at dblock 0x0 words 0x3f-0x3f/0x400 discrepancy in summary (0) at dblock 0x0 words 0x44-0x44/0x400 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... --D --- /dev/null +++ b/tests/xfs/610 @@ -0,0 +1,102 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2000-2004 Silicon Graphics, Inc. All Rights Reserved. +# +# FS QA Test No. 610 +# +# XFS online growfs-while-allocating tests (rt subvol variant) +# +. ./common/preamble +_begin_fstest growfs ioctl prealloc auto stress + +# Import common functions. +. ./common/filter + +_create_scratch() +{ + _scratch_mkfs_xfs "$@" >> $seqres.full + + if ! _try_scratch_mount 2>/dev/null + then + echo "failed to mount $SCRATCH_DEV" + exit 1 + fi + + _xfs_force_bdev realtime $SCRATCH_MNT &> /dev/null + + # fix the reserve block pool to a known size so that the enospc + # calculations work out correctly. + _scratch_resvblks 1024 > /dev/null 2>&1 +} + +_fill_scratch() +{ + $XFS_IO_PROG -f -c "resvsp 0 ${1}" $SCRATCH_MNT/resvfile +} + +_stress_scratch() +{ + procs=3 + nops=1000 + # -w ensures that the only ops are ones which cause write I/O + FSSTRESS_ARGS=`_scale_fsstress_args -d $SCRATCH_MNT -w -p $procs \ + -n $nops $FSSTRESS_AVOID` + $FSSTRESS_PROG $FSSTRESS_ARGS >> $seqres.full 2>&1 & +} + +_require_realtime +_require_scratch +_require_xfs_io_command "falloc" + +_scratch_mkfs_xfs | tee -a $seqres.full | _filter_mkfs 2>$tmp.mkfs +. $tmp.mkfs # extract blocksize and data size for scratch device + +endsize=`expr 550 \* 1048576` # stop after growing this big +incsize=`expr 42 \* 1048576` # grow in chunks of this size +modsize=`expr 4 \* $incsize` # pause after this many increments + +[ `expr $endsize / $dbsize` -lt $dblocks ] || _notrun "Scratch device too small" + +size=`expr 125 \* 1048576` # 120 megabytes initially +sizeb=`expr $size / $dbsize` # in data blocks +logblks=$(_scratch_find_xfs_min_logblocks -rsize=${size}) +_create_scratch -lsize=${logblks}b -rsize=${size} + +for i in `seq 125 -1 90`; do + fillsize=`expr $i \* 1048576` + out="$(_fill_scratch $fillsize 2>&1)" + echo "$out" | grep -q 'No space left on device' && continue + test -n "${out}" && echo "$out" + break +done + +# +# Grow the filesystem while actively stressing it... +# Kick off more stress threads on each iteration, grow; repeat. +# +while [ $size -le $endsize ]; do + echo "*** stressing a ${sizeb} block filesystem" >> $seqres.full + _stress_scratch + size=`expr $size + $incsize` + sizeb=`expr $size / $dbsize` # in data blocks + echo "*** growing to a ${sizeb} block filesystem" >> $seqres.full + xfs_growfs -R ${sizeb} $SCRATCH_MNT >> $seqres.full + echo AGCOUNT=$agcount >> $seqres.full + echo >> $seqres.full + + sleep $((RANDOM % 3)) + _scratch_shutdown + ps -e | grep fsstress > /dev/null 2>&1 + while [ $? -eq 0 ]; do + killall -9 fsstress > /dev/null 2>&1 + wait > /dev/null 2>&1 + ps -e | grep fsstress > /dev/null 2>&1 + done + _scratch_cycle_mount || _fail "cycle mount failed" +done > /dev/null 2>&1 +wait # stop for any remaining stress processes + +_scratch_unmount + +status=0 +exit --- /dev/null +++ b/tests/xfs/610.out @@ -0,0 +1,7 @@ +QA output created by 610 +meta-data=DDEV isize=XXX agcount=N, agsize=XXX blks +data = bsize=XXX blocks=XXX, imaxpct=PCT + = sunit=XXX swidth=XXX, unwritten=X +naming =VERN bsize=XXX +log =LDEV bsize=XXX blocks=XXX +realtime =RDEV extsz=XXX blocks=XXX, rtextents=XXX
On Fri, Oct 11, 2024 at 04:29:06PM -0700, Darrick J. Wong wrote:
> Ahahaha awesome it corrupts the filesystem:
Is this with a rtgroup file system? I can't get your test to fail
with the latest xfs staging tree.
On Mon, Oct 14, 2024 at 07:58:50AM +0200, Christoph Hellwig wrote: > On Fri, Oct 11, 2024 at 04:29:06PM -0700, Darrick J. Wong wrote: > > Ahahaha awesome it corrupts the filesystem: > > Is this with a rtgroup file system? I can't get your test to fail > with the latest xfs staging tree. aha, yes, it's with rtgroups. --D
On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote: > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote: > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote: > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote: > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > > > > > Ok, so we don't want geometry changes transactions in the same CIL > > > > > > checkpoint as alloc related transactions that might depend on the > > > > > > geometry changes. That seems reasonable and on a first pass I have an > > > > > > idea of what this is doing, but the description is kind of vague. > > > > > > Obviously this fixes an issue on the recovery side (since I've tested > > > > > > it), but it's not quite clear to me from the description and/or logic > > > > > > changes how that issue manifests. > > > > > > > > > > > > Could you elaborate please? For example, is this some kind of race > > > > > > situation between an allocation request and a growfs transaction, where > > > > > > the former perhaps sees a newly added AG between the time the growfs > > > > > > transaction commits (applying the sb deltas) and it actually commits to > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new > > > > > > AG into the same checkpoint that adds the AG? > > > > > > > > > > This is based on the feedback by Dave on the previous version: > > > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/ > > > > > > > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm > > > > not sure I'd go straight to this change given the situation... > > > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues > > > > > with my test case, so I'm not actually sure how to get more updates > > > > > into the check checkpoint. I'll try your exercisers if it could hit > > > > > that. > > > > > > > > > > > > > Ok, that explains things a bit. My observation is that the first 5 > > > > patches or so address the mount failure problem, but from there I'm not > > > > reproducing much difference with or without the final patch. > > > > > > Does this change to flush the log after committing the new sb fix the > > > recovery problems on older kernels? I /think/ that's the point of this > > > patch. > > > > > > > I don't follow.. growfs always forced the log via the sync transaction, > > right? Or do you mean something else by "change to flush the log?" > > I guess I was typing a bit too fast this morning -- "change to flush the > log to disk before anyone else can get their hands on the superblock". > You're right that xfs_log_sb and data-device growfs already do that. > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug > that this patch fixes, right? > Ah, Ok.. that makes sense. Sounds like it could be.. > > I thought the main functional change of this patch was to hold the > > superblock buffer locked across the force so nothing else can relog the > > new geometry superblock buffer in the same cil checkpoint. Presumably, > > the theory is that prevents recovery from seeing updates to different > > buffers that depend on the geometry update before the actual sb geometry > > update is recovered (because the latter might have been relogged). > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand. > > Either way I think patch could use a more detailed commit log... > > <nod> The commit message should point out that we're fixing a real bug > here, which is that growfsrt doesn't force the log to disk when it > commits the new rt geometry. > Maybe even make it a separate patch to pull apart some of these cleanups from fixes. I was also wondering if the whole locking change is the moral equivalent of locking the sb across the growfs trans (i.e. trans_getsb() + trans_bhold()), at which point maybe that would be a reasonable incremental patch too. > > > > Either way, > > > > I see aborts and splats all over the place, which implies at minimum > > > > this isn't the only issue here. > > > > > > Ugh. I've recently noticed the long soak logrecovery test vm have seen > > > a slight tick up in failure rates -- random blocks that have clearly had > > > garbage written to them such that recovery tries to read the block to > > > recover a buffer log item and kaboom. At this point it's unclear if > > > that's a problem with xfs or somewhere else. :( > > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this > > > > particular patch has no straightforward way to test that it fixes > > > > something and at the same time doesn't break anything else, and 3. we do > > > > > > I'm curious, what might break? Was that merely a general comment, or do > > > you have something specific in mind? (iows: do you see more string to > > > pull? :)) > > > > > > > Just a general comment.. > > > > Something related that isn't totally clear to me is what about the > > inverse shrink situation where dblocks is reduced. I.e., is there some > > similar scenario where for example instead of the sb buffer being > > relogged past some other buffer update that depends on it, some other > > change is relogged past a sb update that invalidates/removes blocks > > referenced by the relogged buffer..? If so, does that imply a shrink > > should flush the log before the shrink transaction commits to ensure it > > lands in a new checkpoint (as opposed to ensuring followon updates land > > in a new checkpoint)..? > > I think so. Might we want to do that before and after to be careful? > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we had a log preflush to flush out already committed changes before the grow, I don't think anything really prevents another "problematic" transaction from committing after that preflush. I dunno.. on one hand it does seem like an unlikely thing due to the nature of needing space to be free in order to shrink in the first place, but OTOH if you have something like grow that is rare, not performance sensitive, has a history of not being well tested, and has these subtle ordering requirements that might change indirectly to other transactions, ISTM it could be a wise engineering decision to simplify to the degree possible and find the most basic model that enforces predictable ordering. So for a hacky thought/example, suppose we defined a transaction mode that basically implemented an exclusive checkpoint requirement (i.e. this transaction owns an entire checkpoint, nothing else is allowed in the CIL concurrently). Presumably that would ensure everything before the grow would flush out to disk in one checkpoint, everything concurrent would block on synchronous commit of the grow trans (before new geometry is exposed), and then after that point everything pending would drain into another checkpoint. It kind of sounds like overkill, but really if it could be implemented simply enough then we wouldn't have to think too hard about auditing all other relog scenarios. I'd probably want to see at least some reproducer for this sort of problem to prove the theory though too, even if it required debug instrumentation or something. Hm? > > Anyways, my point is just that if it were me I wouldn't get too deep > > into this until some of the reproducible growfs recovery issues are at > > least characterized and testing is more sorted out. > > > > The context for testing is here [1]. The TLDR is basically that > > Christoph has a targeted test that reproduces the initial mount failure > > and I hacked up a more general test that also reproduces it and > > additional growfs recovery problems. This test does seem to confirm that > > the previous patches address the mount failure issue, but this patch > > doesn't seem to prevent any of the other problems produced by the > > generic test. That might just mean the test doesn't reproduce what this > > fixes, but it's kind of hard to at least regression test something like > > this when basic growfs crash-recovery seems pretty much broken. > > Hmm, if you make a variant of that test which formats with an rt device > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see > similar blowups? Let's see what happens if I do that... > Heh, sounds like so from your followup. Fun times. I guess that test should probably work its way upstream. I made some tweaks locally since last posted to try and make it a little more aggressive, but it didn't repro anything new so not sure how much difference it makes really. Do we want a separate version like yours for the rt case or would you expect to cover both cases in a single test? Brian > --D > > > Brian > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/ > > > > > > have at least one fairly straightforward growfs/recovery test in the > > > > works that reliably explodes, personally I'd suggest to split this work > > > > off into separate series. > > > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're > > > > fully cleaned up, and then leave the next two as part of a followon > > > > series pending further investigation into these other issues. As part of > > > > that I'd like to know whether the recovery test reproduces (or can be > > > > made to reproduce) the issue this patch presumably fixes, but I'd also > > > > settle for "the grow recovery test now passes reliably and this doesn't > > > > regress it." But once again, just my .02. > > > > > > Yeah, it's too bad there's no good way to test recovery with older > > > kernels either. :( > > > > > > --D > > > > > > > Brian > > > > > > > > > > > > > > > >
On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote: > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote: > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote: > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote: > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote: > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL > > > > > > > checkpoint as alloc related transactions that might depend on the > > > > > > > geometry changes. That seems reasonable and on a first pass I have an > > > > > > > idea of what this is doing, but the description is kind of vague. > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested > > > > > > > it), but it's not quite clear to me from the description and/or logic > > > > > > > changes how that issue manifests. > > > > > > > > > > > > > > Could you elaborate please? For example, is this some kind of race > > > > > > > situation between an allocation request and a growfs transaction, where > > > > > > > the former perhaps sees a newly added AG between the time the growfs > > > > > > > transaction commits (applying the sb deltas) and it actually commits to > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new > > > > > > > AG into the same checkpoint that adds the AG? > > > > > > > > > > > > This is based on the feedback by Dave on the previous version: > > > > > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/ > > > > > > > > > > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm > > > > > not sure I'd go straight to this change given the situation... > > > > > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues > > > > > > with my test case, so I'm not actually sure how to get more updates > > > > > > into the check checkpoint. I'll try your exercisers if it could hit > > > > > > that. > > > > > > > > > > > > > > > > Ok, that explains things a bit. My observation is that the first 5 > > > > > patches or so address the mount failure problem, but from there I'm not > > > > > reproducing much difference with or without the final patch. > > > > > > > > Does this change to flush the log after committing the new sb fix the > > > > recovery problems on older kernels? I /think/ that's the point of this > > > > patch. > > > > > > > > > > I don't follow.. growfs always forced the log via the sync transaction, > > > right? Or do you mean something else by "change to flush the log?" > > > > I guess I was typing a bit too fast this morning -- "change to flush the > > log to disk before anyone else can get their hands on the superblock". > > You're right that xfs_log_sb and data-device growfs already do that. > > > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug > > that this patch fixes, right? > > > > Ah, Ok.. that makes sense. Sounds like it could be.. Yeah. Hey Christoph, would you mind pre-pending a minimal fixpatch to set xfs_trans_set_sync in growfsrt before this one that refactors the existing growfs/sb updates? > > > I thought the main functional change of this patch was to hold the > > > superblock buffer locked across the force so nothing else can relog the > > > new geometry superblock buffer in the same cil checkpoint. Presumably, > > > the theory is that prevents recovery from seeing updates to different > > > buffers that depend on the geometry update before the actual sb geometry > > > update is recovered (because the latter might have been relogged). > > > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand. > > > Either way I think patch could use a more detailed commit log... > > > > <nod> The commit message should point out that we're fixing a real bug > > here, which is that growfsrt doesn't force the log to disk when it > > commits the new rt geometry. > > > > Maybe even make it a separate patch to pull apart some of these cleanups > from fixes. I was also wondering if the whole locking change is the > moral equivalent of locking the sb across the growfs trans (i.e. > trans_getsb() + trans_bhold()), at which point maybe that would be a > reasonable incremental patch too. > > > > > > Either way, > > > > > I see aborts and splats all over the place, which implies at minimum > > > > > this isn't the only issue here. > > > > > > > > Ugh. I've recently noticed the long soak logrecovery test vm have seen > > > > a slight tick up in failure rates -- random blocks that have clearly had > > > > garbage written to them such that recovery tries to read the block to > > > > recover a buffer log item and kaboom. At this point it's unclear if > > > > that's a problem with xfs or somewhere else. :( > > > > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this > > > > > particular patch has no straightforward way to test that it fixes > > > > > something and at the same time doesn't break anything else, and 3. we do > > > > > > > > I'm curious, what might break? Was that merely a general comment, or do > > > > you have something specific in mind? (iows: do you see more string to > > > > pull? :)) > > > > > > > > > > Just a general comment.. > > > > > > Something related that isn't totally clear to me is what about the > > > inverse shrink situation where dblocks is reduced. I.e., is there some > > > similar scenario where for example instead of the sb buffer being > > > relogged past some other buffer update that depends on it, some other > > > change is relogged past a sb update that invalidates/removes blocks > > > referenced by the relogged buffer..? If so, does that imply a shrink > > > should flush the log before the shrink transaction commits to ensure it > > > lands in a new checkpoint (as opposed to ensuring followon updates land > > > in a new checkpoint)..? > > > > I think so. Might we want to do that before and after to be careful? > > > > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we > had a log preflush to flush out already committed changes before the > grow, I don't think anything really prevents another "problematic" > transaction from committing after that preflush. Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't you? > I dunno.. on one hand it does seem like an unlikely thing due to the > nature of needing space to be free in order to shrink in the first > place, but OTOH if you have something like grow that is rare, not > performance sensitive, has a history of not being well tested, and has > these subtle ordering requirements that might change indirectly to other > transactions, ISTM it could be a wise engineering decision to simplify > to the degree possible and find the most basic model that enforces > predictable ordering. > > So for a hacky thought/example, suppose we defined a transaction mode > that basically implemented an exclusive checkpoint requirement (i.e. > this transaction owns an entire checkpoint, nothing else is allowed in > the CIL concurrently). Presumably that would ensure everything before > the grow would flush out to disk in one checkpoint, everything > concurrent would block on synchronous commit of the grow trans (before > new geometry is exposed), and then after that point everything pending > would drain into another checkpoint. > > It kind of sounds like overkill, but really if it could be implemented > simply enough then we wouldn't have to think too hard about auditing all > other relog scenarios. I'd probably want to see at least some reproducer > for this sort of problem to prove the theory though too, even if it > required debug instrumentation or something. Hm? What if we redefined the input requirements to shrink? Lets say we require that the fd argument to a shrink ioctl is actually an unlinkable O_TMPFILE regular file with the EOFS blocks mapped to it. Then we can force the log without holding any locks, and the shrink transaction can remove the bmap and rmap records at the same time that it updates the sb geometry. The otherwise inaccessible file means that nobody can reuse that space between the log force and the sb update. > > > Anyways, my point is just that if it were me I wouldn't get too deep > > > into this until some of the reproducible growfs recovery issues are at > > > least characterized and testing is more sorted out. > > > > > > The context for testing is here [1]. The TLDR is basically that > > > Christoph has a targeted test that reproduces the initial mount failure > > > and I hacked up a more general test that also reproduces it and > > > additional growfs recovery problems. This test does seem to confirm that > > > the previous patches address the mount failure issue, but this patch > > > doesn't seem to prevent any of the other problems produced by the > > > generic test. That might just mean the test doesn't reproduce what this > > > fixes, but it's kind of hard to at least regression test something like > > > this when basic growfs crash-recovery seems pretty much broken. > > > > Hmm, if you make a variant of that test which formats with an rt device > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see > > similar blowups? Let's see what happens if I do that... > > > > Heh, sounds like so from your followup. Fun times. > > I guess that test should probably work its way upstream. I made some > tweaks locally since last posted to try and make it a little more > aggressive, but it didn't repro anything new so not sure how much > difference it makes really. Do we want a separate version like yours for > the rt case or would you expect to cover both cases in a single test? This probably should be different tests, because rt is its own very weird animal. --D > Brian > > > --D > > > > > Brian > > > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/ > > > > > > > > have at least one fairly straightforward growfs/recovery test in the > > > > > works that reliably explodes, personally I'd suggest to split this work > > > > > off into separate series. > > > > > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're > > > > > fully cleaned up, and then leave the next two as part of a followon > > > > > series pending further investigation into these other issues. As part of > > > > > that I'd like to know whether the recovery test reproduces (or can be > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also > > > > > settle for "the grow recovery test now passes reliably and this doesn't > > > > > regress it." But once again, just my .02. > > > > > > > > Yeah, it's too bad there's no good way to test recovery with older > > > > kernels either. :( > > > > > > > > --D > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > >
On Tue, Oct 15, 2024 at 09:42:05AM -0700, Darrick J. Wong wrote: > On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote: > > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote: > > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote: > > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote: > > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote: > > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL > > > > > > > > checkpoint as alloc related transactions that might depend on the > > > > > > > > geometry changes. That seems reasonable and on a first pass I have an > > > > > > > > idea of what this is doing, but the description is kind of vague. > > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested > > > > > > > > it), but it's not quite clear to me from the description and/or logic > > > > > > > > changes how that issue manifests. > > > > > > > > > > > > > > > > Could you elaborate please? For example, is this some kind of race > > > > > > > > situation between an allocation request and a growfs transaction, where > > > > > > > > the former perhaps sees a newly added AG between the time the growfs > > > > > > > > transaction commits (applying the sb deltas) and it actually commits to > > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new > > > > > > > > AG into the same checkpoint that adds the AG? > > > > > > > > > > > > > > This is based on the feedback by Dave on the previous version: > > > > > > > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/ > > > > > > > > > > > > > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm > > > > > > not sure I'd go straight to this change given the situation... > > > > > > > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues > > > > > > > with my test case, so I'm not actually sure how to get more updates > > > > > > > into the check checkpoint. I'll try your exercisers if it could hit > > > > > > > that. > > > > > > > > > > > > > > > > > > > Ok, that explains things a bit. My observation is that the first 5 > > > > > > patches or so address the mount failure problem, but from there I'm not > > > > > > reproducing much difference with or without the final patch. > > > > > > > > > > Does this change to flush the log after committing the new sb fix the > > > > > recovery problems on older kernels? I /think/ that's the point of this > > > > > patch. > > > > > > > > > > > > > I don't follow.. growfs always forced the log via the sync transaction, > > > > right? Or do you mean something else by "change to flush the log?" > > > > > > I guess I was typing a bit too fast this morning -- "change to flush the > > > log to disk before anyone else can get their hands on the superblock". > > > You're right that xfs_log_sb and data-device growfs already do that. > > > > > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug > > > that this patch fixes, right? > > > > > > > Ah, Ok.. that makes sense. Sounds like it could be.. > > Yeah. Hey Christoph, would you mind pre-pending a minimal fixpatch to > set xfs_trans_set_sync in growfsrt before this one that refactors the > existing growfs/sb updates? > > > > > I thought the main functional change of this patch was to hold the > > > > superblock buffer locked across the force so nothing else can relog the > > > > new geometry superblock buffer in the same cil checkpoint. Presumably, > > > > the theory is that prevents recovery from seeing updates to different > > > > buffers that depend on the geometry update before the actual sb geometry > > > > update is recovered (because the latter might have been relogged). > > > > > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand. > > > > Either way I think patch could use a more detailed commit log... > > > > > > <nod> The commit message should point out that we're fixing a real bug > > > here, which is that growfsrt doesn't force the log to disk when it > > > commits the new rt geometry. > > > > > > > Maybe even make it a separate patch to pull apart some of these cleanups > > from fixes. I was also wondering if the whole locking change is the > > moral equivalent of locking the sb across the growfs trans (i.e. > > trans_getsb() + trans_bhold()), at which point maybe that would be a > > reasonable incremental patch too. > > > > > > > > Either way, > > > > > > I see aborts and splats all over the place, which implies at minimum > > > > > > this isn't the only issue here. > > > > > > > > > > Ugh. I've recently noticed the long soak logrecovery test vm have seen > > > > > a slight tick up in failure rates -- random blocks that have clearly had > > > > > garbage written to them such that recovery tries to read the block to > > > > > recover a buffer log item and kaboom. At this point it's unclear if > > > > > that's a problem with xfs or somewhere else. :( > > > > > > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this > > > > > > particular patch has no straightforward way to test that it fixes > > > > > > something and at the same time doesn't break anything else, and 3. we do > > > > > > > > > > I'm curious, what might break? Was that merely a general comment, or do > > > > > you have something specific in mind? (iows: do you see more string to > > > > > pull? :)) > > > > > > > > > > > > > Just a general comment.. > > > > > > > > Something related that isn't totally clear to me is what about the > > > > inverse shrink situation where dblocks is reduced. I.e., is there some > > > > similar scenario where for example instead of the sb buffer being > > > > relogged past some other buffer update that depends on it, some other > > > > change is relogged past a sb update that invalidates/removes blocks > > > > referenced by the relogged buffer..? If so, does that imply a shrink > > > > should flush the log before the shrink transaction commits to ensure it > > > > lands in a new checkpoint (as opposed to ensuring followon updates land > > > > in a new checkpoint)..? > > > > > > I think so. Might we want to do that before and after to be careful? > > > > > > > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we > > had a log preflush to flush out already committed changes before the > > grow, I don't think anything really prevents another "problematic" > > transaction from committing after that preflush. > > Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't > you? > I guess it depends on how far into the weeds we want to get. I'm not necessarily sure than anything exists today that is definitely problematic wrt shrink. That would probably warrant an audit of transactions or some other high level analysis to disprove. More thought needed. Short of the latter, I'm more thinking about the question "is there some new thing we could add years down the line that 1. adds something to the log that could conflict and 2. could be reordered past a shrink transaction in a problematic way?" If the answer to that is open ended and some such thing does come along, I think it's highly likely this would just break growfs logging again until somebody trips over it in the field. > > I dunno.. on one hand it does seem like an unlikely thing due to the > > nature of needing space to be free in order to shrink in the first > > place, but OTOH if you have something like grow that is rare, not > > performance sensitive, has a history of not being well tested, and has > > these subtle ordering requirements that might change indirectly to other > > transactions, ISTM it could be a wise engineering decision to simplify > > to the degree possible and find the most basic model that enforces > > predictable ordering. > > > > So for a hacky thought/example, suppose we defined a transaction mode > > that basically implemented an exclusive checkpoint requirement (i.e. > > this transaction owns an entire checkpoint, nothing else is allowed in > > the CIL concurrently). Presumably that would ensure everything before > > the grow would flush out to disk in one checkpoint, everything > > concurrent would block on synchronous commit of the grow trans (before > > new geometry is exposed), and then after that point everything pending > > would drain into another checkpoint. > > > > It kind of sounds like overkill, but really if it could be implemented > > simply enough then we wouldn't have to think too hard about auditing all > > other relog scenarios. I'd probably want to see at least some reproducer > > for this sort of problem to prove the theory though too, even if it > > required debug instrumentation or something. Hm? > > What if we redefined the input requirements to shrink? Lets say we > require that the fd argument to a shrink ioctl is actually an unlinkable > O_TMPFILE regular file with the EOFS blocks mapped to it. Then we can > force the log without holding any locks, and the shrink transaction can > remove the bmap and rmap records at the same time that it updates the sb > geometry. The otherwise inaccessible file means that nobody can reuse > that space between the log force and the sb update. > Interesting thought. It kind of sounds like how shrink already works to some degree, right? I.e. the kernel side allocs the blocks out of the btrees and tosses them, just no inode in the mix? Honestly I'd probably need to stare at this code and think about it and work through some scenarios to quantify how much of a concern this really is, and I don't really have the bandwidth for that just now. I mainly wanted to raise the notion that if we're assessing high level log ordering requirements for growfs, we should consider the shrink case as well. > > > > Anyways, my point is just that if it were me I wouldn't get too deep > > > > into this until some of the reproducible growfs recovery issues are at > > > > least characterized and testing is more sorted out. > > > > > > > > The context for testing is here [1]. The TLDR is basically that > > > > Christoph has a targeted test that reproduces the initial mount failure > > > > and I hacked up a more general test that also reproduces it and > > > > additional growfs recovery problems. This test does seem to confirm that > > > > the previous patches address the mount failure issue, but this patch > > > > doesn't seem to prevent any of the other problems produced by the > > > > generic test. That might just mean the test doesn't reproduce what this > > > > fixes, but it's kind of hard to at least regression test something like > > > > this when basic growfs crash-recovery seems pretty much broken. > > > > > > Hmm, if you make a variant of that test which formats with an rt device > > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see > > > similar blowups? Let's see what happens if I do that... > > > > > > > Heh, sounds like so from your followup. Fun times. > > > > I guess that test should probably work its way upstream. I made some > > tweaks locally since last posted to try and make it a little more > > aggressive, but it didn't repro anything new so not sure how much > > difference it makes really. Do we want a separate version like yours for > > the rt case or would you expect to cover both cases in a single test? > > This probably should be different tests, because rt is its own very > weird animal. > Posted a couple tests the other day, JFYI. Brian > --D > > > Brian > > > > > --D > > > > > > > Brian > > > > > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/ > > > > > > > > > > have at least one fairly straightforward growfs/recovery test in the > > > > > > works that reliably explodes, personally I'd suggest to split this work > > > > > > off into separate series. > > > > > > > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're > > > > > > fully cleaned up, and then leave the next two as part of a followon > > > > > > series pending further investigation into these other issues. As part of > > > > > > that I'd like to know whether the recovery test reproduces (or can be > > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also > > > > > > settle for "the grow recovery test now passes reliably and this doesn't > > > > > > regress it." But once again, just my .02. > > > > > > > > > > Yeah, it's too bad there's no good way to test recovery with older > > > > > kernels either. :( > > > > > > > > > > --D > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote: > So for a hacky thought/example, suppose we defined a transaction mode > that basically implemented an exclusive checkpoint requirement (i.e. > this transaction owns an entire checkpoint, nothing else is allowed in > the CIL concurrently). Transactions know nothing about the CIL, nor should they. The CIL also has no place in ordering transactions - it's purely an aggregation mechanism that flushes committed transactions to stable storage when it is told to. i.e. when a log force is issued. A globally serialised transaction requires ordering at the transaction allocation/reservation level, not at the CIL. i.e. it is essentially the same ordering problem as serialising against untracked DIO on the inode before we can run a truncate (lock, drain, do operation, unlock). > Presumably that would ensure everything before > the grow would flush out to disk in one checkpoint, everything > concurrent would block on synchronous commit of the grow trans (before > new geometry is exposed), and then after that point everything pending > would drain into another checkpoint. Yup, that's high level transaction level ordering and really has nothing to do with the CIL. The CIL is mostly a FIFO aggregator; the only ordering it does is to preserve transaction commit ordering down to the journal. > It kind of sounds like overkill, but really if it could be implemented > simply enough then we wouldn't have to think too hard about auditing all > other relog scenarios. I'd probably want to see at least some reproducer > for this sort of problem to prove the theory though too, even if it > required debug instrumentation or something. Hm? It's relatively straight forward to do, but it seems like total overkill for growfs, as growfs only requires ordering between the change of size and new allocations. We can do that by not exposing the new space until after then superblock has been modifed on stable storage in the case of grow. In the case of shrink, globally serialising the growfs transaction for shrink won't actually do any thing useful because we have to deny access to the free space we are removing before we even start the shrink transaction. Hence we need allocation vs shrink co-ordination before we shrink the superblock space, not a globally serialised size modification transaction... -Dave.
On Fri, Oct 18, 2024 at 08:27:23AM -0400, Brian Foster wrote: > On Tue, Oct 15, 2024 at 09:42:05AM -0700, Darrick J. Wong wrote: > > On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote: > > > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote: > > > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote: > > > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote: > > > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote: > > > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > > > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL > > > > > > > > > checkpoint as alloc related transactions that might depend on the > > > > > > > > > geometry changes. That seems reasonable and on a first pass I have an > > > > > > > > > idea of what this is doing, but the description is kind of vague. > > > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested > > > > > > > > > it), but it's not quite clear to me from the description and/or logic > > > > > > > > > changes how that issue manifests. > > > > > > > > > > > > > > > > > > Could you elaborate please? For example, is this some kind of race > > > > > > > > > situation between an allocation request and a growfs transaction, where > > > > > > > > > the former perhaps sees a newly added AG between the time the growfs > > > > > > > > > transaction commits (applying the sb deltas) and it actually commits to > > > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new > > > > > > > > > AG into the same checkpoint that adds the AG? > > > > > > > > > > > > > > > > This is based on the feedback by Dave on the previous version: > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/ > > > > > > > > > > > > > > > > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm > > > > > > > not sure I'd go straight to this change given the situation... > > > > > > > > > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues > > > > > > > > with my test case, so I'm not actually sure how to get more updates > > > > > > > > into the check checkpoint. I'll try your exercisers if it could hit > > > > > > > > that. > > > > > > > > > > > > > > > > > > > > > > Ok, that explains things a bit. My observation is that the first 5 > > > > > > > patches or so address the mount failure problem, but from there I'm not > > > > > > > reproducing much difference with or without the final patch. > > > > > > > > > > > > Does this change to flush the log after committing the new sb fix the > > > > > > recovery problems on older kernels? I /think/ that's the point of this > > > > > > patch. > > > > > > > > > > > > > > > > I don't follow.. growfs always forced the log via the sync transaction, > > > > > right? Or do you mean something else by "change to flush the log?" > > > > > > > > I guess I was typing a bit too fast this morning -- "change to flush the > > > > log to disk before anyone else can get their hands on the superblock". > > > > You're right that xfs_log_sb and data-device growfs already do that. > > > > > > > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug > > > > that this patch fixes, right? > > > > > > > > > > Ah, Ok.. that makes sense. Sounds like it could be.. > > > > Yeah. Hey Christoph, would you mind pre-pending a minimal fixpatch to > > set xfs_trans_set_sync in growfsrt before this one that refactors the > > existing growfs/sb updates? > > > > > > > I thought the main functional change of this patch was to hold the > > > > > superblock buffer locked across the force so nothing else can relog the > > > > > new geometry superblock buffer in the same cil checkpoint. Presumably, > > > > > the theory is that prevents recovery from seeing updates to different > > > > > buffers that depend on the geometry update before the actual sb geometry > > > > > update is recovered (because the latter might have been relogged). > > > > > > > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand. > > > > > Either way I think patch could use a more detailed commit log... > > > > > > > > <nod> The commit message should point out that we're fixing a real bug > > > > here, which is that growfsrt doesn't force the log to disk when it > > > > commits the new rt geometry. > > > > > > > > > > Maybe even make it a separate patch to pull apart some of these cleanups > > > from fixes. I was also wondering if the whole locking change is the > > > moral equivalent of locking the sb across the growfs trans (i.e. > > > trans_getsb() + trans_bhold()), at which point maybe that would be a > > > reasonable incremental patch too. > > > > > > > > > > Either way, > > > > > > > I see aborts and splats all over the place, which implies at minimum > > > > > > > this isn't the only issue here. > > > > > > > > > > > > Ugh. I've recently noticed the long soak logrecovery test vm have seen > > > > > > a slight tick up in failure rates -- random blocks that have clearly had > > > > > > garbage written to them such that recovery tries to read the block to > > > > > > recover a buffer log item and kaboom. At this point it's unclear if > > > > > > that's a problem with xfs or somewhere else. :( > > > > > > > > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this > > > > > > > particular patch has no straightforward way to test that it fixes > > > > > > > something and at the same time doesn't break anything else, and 3. we do > > > > > > > > > > > > I'm curious, what might break? Was that merely a general comment, or do > > > > > > you have something specific in mind? (iows: do you see more string to > > > > > > pull? :)) > > > > > > > > > > > > > > > > Just a general comment.. > > > > > > > > > > Something related that isn't totally clear to me is what about the > > > > > inverse shrink situation where dblocks is reduced. I.e., is there some > > > > > similar scenario where for example instead of the sb buffer being > > > > > relogged past some other buffer update that depends on it, some other > > > > > change is relogged past a sb update that invalidates/removes blocks > > > > > referenced by the relogged buffer..? If so, does that imply a shrink > > > > > should flush the log before the shrink transaction commits to ensure it > > > > > lands in a new checkpoint (as opposed to ensuring followon updates land > > > > > in a new checkpoint)..? > > > > > > > > I think so. Might we want to do that before and after to be careful? > > > > > > > > > > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we > > > had a log preflush to flush out already committed changes before the > > > grow, I don't think anything really prevents another "problematic" > > > transaction from committing after that preflush. > > > > Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't > > you? > > > > I guess it depends on how far into the weeds we want to get. I'm not > necessarily sure than anything exists today that is definitely > problematic wrt shrink. That would probably warrant an audit of > transactions or some other high level analysis to disprove. More thought > needed. <nod> I think there isn't a problem with shrink because the shrink transaction itself must be able to find the space, which means that there cannot be any files or unfinished deferred ops pointing to that space. > Short of the latter, I'm more thinking about the question "is there some > new thing we could add years down the line that 1. adds something to the > log that could conflict and 2. could be reordered past a shrink > transaction in a problematic way?" If the answer to that is open ended > and some such thing does come along, I think it's highly likely this > would just break growfs logging again until somebody trips over it in > the field. Good thing we have a couple of tests now? :) > > > I dunno.. on one hand it does seem like an unlikely thing due to the > > > nature of needing space to be free in order to shrink in the first > > > place, but OTOH if you have something like grow that is rare, not > > > performance sensitive, has a history of not being well tested, and has > > > these subtle ordering requirements that might change indirectly to other > > > transactions, ISTM it could be a wise engineering decision to simplify > > > to the degree possible and find the most basic model that enforces > > > predictable ordering. > > > > > > So for a hacky thought/example, suppose we defined a transaction mode > > > that basically implemented an exclusive checkpoint requirement (i.e. > > > this transaction owns an entire checkpoint, nothing else is allowed in > > > the CIL concurrently). Presumably that would ensure everything before > > > the grow would flush out to disk in one checkpoint, everything > > > concurrent would block on synchronous commit of the grow trans (before > > > new geometry is exposed), and then after that point everything pending > > > would drain into another checkpoint. > > > > > > It kind of sounds like overkill, but really if it could be implemented > > > simply enough then we wouldn't have to think too hard about auditing all > > > other relog scenarios. I'd probably want to see at least some reproducer > > > for this sort of problem to prove the theory though too, even if it > > > required debug instrumentation or something. Hm? > > > > What if we redefined the input requirements to shrink? Lets say we > > require that the fd argument to a shrink ioctl is actually an unlinkable > > O_TMPFILE regular file with the EOFS blocks mapped to it. Then we can > > force the log without holding any locks, and the shrink transaction can > > remove the bmap and rmap records at the same time that it updates the sb > > geometry. The otherwise inaccessible file means that nobody can reuse > > that space between the log force and the sb update. > > > > Interesting thought. It kind of sounds like how shrink already works to > some degree, right? I.e. the kernel side allocs the blocks out of the > btrees and tosses them, just no inode in the mix? Right. > Honestly I'd probably need to stare at this code and think about it and > work through some scenarios to quantify how much of a concern this > really is, and I don't really have the bandwidth for that just now. I > mainly wanted to raise the notion that if we're assessing high level log > ordering requirements for growfs, we should consider the shrink case as > well. <nod> > > > > > Anyways, my point is just that if it were me I wouldn't get too deep > > > > > into this until some of the reproducible growfs recovery issues are at > > > > > least characterized and testing is more sorted out. > > > > > > > > > > The context for testing is here [1]. The TLDR is basically that > > > > > Christoph has a targeted test that reproduces the initial mount failure > > > > > and I hacked up a more general test that also reproduces it and > > > > > additional growfs recovery problems. This test does seem to confirm that > > > > > the previous patches address the mount failure issue, but this patch > > > > > doesn't seem to prevent any of the other problems produced by the > > > > > generic test. That might just mean the test doesn't reproduce what this > > > > > fixes, but it's kind of hard to at least regression test something like > > > > > this when basic growfs crash-recovery seems pretty much broken. > > > > > > > > Hmm, if you make a variant of that test which formats with an rt device > > > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see > > > > similar blowups? Let's see what happens if I do that... > > > > > > > > > > Heh, sounds like so from your followup. Fun times. > > > > > > I guess that test should probably work its way upstream. I made some > > > tweaks locally since last posted to try and make it a little more > > > aggressive, but it didn't repro anything new so not sure how much > > > difference it makes really. Do we want a separate version like yours for > > > the rt case or would you expect to cover both cases in a single test? > > > > This probably should be different tests, because rt is its own very > > weird animal. > > > > Posted a couple tests the other day, JFYI. > > Brian > > > --D > > > > > Brian > > > > > > > --D > > > > > > > > > Brian > > > > > > > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/ > > > > > > > > > > > > have at least one fairly straightforward growfs/recovery test in the > > > > > > > works that reliably explodes, personally I'd suggest to split this work > > > > > > > off into separate series. > > > > > > > > > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're > > > > > > > fully cleaned up, and then leave the next two as part of a followon > > > > > > > series pending further investigation into these other issues. As part of > > > > > > > that I'd like to know whether the recovery test reproduces (or can be > > > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also > > > > > > > settle for "the grow recovery test now passes reliably and this doesn't > > > > > > > regress it." But once again, just my .02. > > > > > > > > > > > > Yeah, it's too bad there's no good way to test recovery with older > > > > > > kernels either. :( > > > > > > > > > > > > --D > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
On Mon, Oct 21, 2024 at 09:59:18AM -0700, Darrick J. Wong wrote: > On Fri, Oct 18, 2024 at 08:27:23AM -0400, Brian Foster wrote: > > On Tue, Oct 15, 2024 at 09:42:05AM -0700, Darrick J. Wong wrote: > > > On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote: > > > > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote: > > > > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote: > > > > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote: > > > > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote: > > > > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > > > > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > > > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL > > > > > > > > > > checkpoint as alloc related transactions that might depend on the > > > > > > > > > > geometry changes. That seems reasonable and on a first pass I have an > > > > > > > > > > idea of what this is doing, but the description is kind of vague. > > > > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested > > > > > > > > > > it), but it's not quite clear to me from the description and/or logic > > > > > > > > > > changes how that issue manifests. > > > > > > > > > > > > > > > > > > > > Could you elaborate please? For example, is this some kind of race > > > > > > > > > > situation between an allocation request and a growfs transaction, where > > > > > > > > > > the former perhaps sees a newly added AG between the time the growfs > > > > > > > > > > transaction commits (applying the sb deltas) and it actually commits to > > > > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new > > > > > > > > > > AG into the same checkpoint that adds the AG? > > > > > > > > > > > > > > > > > > This is based on the feedback by Dave on the previous version: > > > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/ > > > > > > > > > > > > > > > > > > > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm > > > > > > > > not sure I'd go straight to this change given the situation... > > > > > > > > > > > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues > > > > > > > > > with my test case, so I'm not actually sure how to get more updates > > > > > > > > > into the check checkpoint. I'll try your exercisers if it could hit > > > > > > > > > that. > > > > > > > > > > > > > > > > > > > > > > > > > Ok, that explains things a bit. My observation is that the first 5 > > > > > > > > patches or so address the mount failure problem, but from there I'm not > > > > > > > > reproducing much difference with or without the final patch. > > > > > > > > > > > > > > Does this change to flush the log after committing the new sb fix the > > > > > > > recovery problems on older kernels? I /think/ that's the point of this > > > > > > > patch. > > > > > > > > > > > > > > > > > > > I don't follow.. growfs always forced the log via the sync transaction, > > > > > > right? Or do you mean something else by "change to flush the log?" > > > > > > > > > > I guess I was typing a bit too fast this morning -- "change to flush the > > > > > log to disk before anyone else can get their hands on the superblock". > > > > > You're right that xfs_log_sb and data-device growfs already do that. > > > > > > > > > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug > > > > > that this patch fixes, right? > > > > > > > > > > > > > Ah, Ok.. that makes sense. Sounds like it could be.. > > > > > > Yeah. Hey Christoph, would you mind pre-pending a minimal fixpatch to > > > set xfs_trans_set_sync in growfsrt before this one that refactors the > > > existing growfs/sb updates? > > > > > > > > > I thought the main functional change of this patch was to hold the > > > > > > superblock buffer locked across the force so nothing else can relog the > > > > > > new geometry superblock buffer in the same cil checkpoint. Presumably, > > > > > > the theory is that prevents recovery from seeing updates to different > > > > > > buffers that depend on the geometry update before the actual sb geometry > > > > > > update is recovered (because the latter might have been relogged). > > > > > > > > > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand. > > > > > > Either way I think patch could use a more detailed commit log... > > > > > > > > > > <nod> The commit message should point out that we're fixing a real bug > > > > > here, which is that growfsrt doesn't force the log to disk when it > > > > > commits the new rt geometry. > > > > > > > > > > > > > Maybe even make it a separate patch to pull apart some of these cleanups > > > > from fixes. I was also wondering if the whole locking change is the > > > > moral equivalent of locking the sb across the growfs trans (i.e. > > > > trans_getsb() + trans_bhold()), at which point maybe that would be a > > > > reasonable incremental patch too. > > > > > > > > > > > > Either way, > > > > > > > > I see aborts and splats all over the place, which implies at minimum > > > > > > > > this isn't the only issue here. > > > > > > > > > > > > > > Ugh. I've recently noticed the long soak logrecovery test vm have seen > > > > > > > a slight tick up in failure rates -- random blocks that have clearly had > > > > > > > garbage written to them such that recovery tries to read the block to > > > > > > > recover a buffer log item and kaboom. At this point it's unclear if > > > > > > > that's a problem with xfs or somewhere else. :( > > > > > > > > > > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this > > > > > > > > particular patch has no straightforward way to test that it fixes > > > > > > > > something and at the same time doesn't break anything else, and 3. we do > > > > > > > > > > > > > > I'm curious, what might break? Was that merely a general comment, or do > > > > > > > you have something specific in mind? (iows: do you see more string to > > > > > > > pull? :)) > > > > > > > > > > > > > > > > > > > Just a general comment.. > > > > > > > > > > > > Something related that isn't totally clear to me is what about the > > > > > > inverse shrink situation where dblocks is reduced. I.e., is there some > > > > > > similar scenario where for example instead of the sb buffer being > > > > > > relogged past some other buffer update that depends on it, some other > > > > > > change is relogged past a sb update that invalidates/removes blocks > > > > > > referenced by the relogged buffer..? If so, does that imply a shrink > > > > > > should flush the log before the shrink transaction commits to ensure it > > > > > > lands in a new checkpoint (as opposed to ensuring followon updates land > > > > > > in a new checkpoint)..? > > > > > > > > > > I think so. Might we want to do that before and after to be careful? > > > > > > > > > > > > > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we > > > > had a log preflush to flush out already committed changes before the > > > > grow, I don't think anything really prevents another "problematic" > > > > transaction from committing after that preflush. > > > > > > Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't > > > you? > > > > > > > I guess it depends on how far into the weeds we want to get. I'm not > > necessarily sure than anything exists today that is definitely > > problematic wrt shrink. That would probably warrant an audit of > > transactions or some other high level analysis to disprove. More thought > > needed. > > <nod> I think there isn't a problem with shrink because the shrink > transaction itself must be able to find the space, which means that > there cannot be any files or unfinished deferred ops pointing to that > space. > Ok, that makes sense to me. On poking through the code, one thing that it looks like it misses is the case where the allocation fails purely due to extents being busy (i.e. even if the blocks are free). That doesn't seem harmful, just perhaps a little odd that a shrink might fail and then succeed after some period of time for the log tail to push with no other visible changes. Unless I'm missing something, it might be worth adding the log force there (that I think you suggested earlier). All that said, I'm not so sure this will all apply the same if shrink grows to a more active implementation. For example, I suspect an implementation that learns to quiesce/truncate full AGs isn't going to necessarily need to allocate all of the blocks out of the AG free space trees. But of course, this is all vaporware anyways. :) > > Short of the latter, I'm more thinking about the question "is there some > > new thing we could add years down the line that 1. adds something to the > > log that could conflict and 2. could be reordered past a shrink > > transaction in a problematic way?" If the answer to that is open ended > > and some such thing does come along, I think it's highly likely this > > would just break growfs logging again until somebody trips over it in > > the field. > > Good thing we have a couple of tests now? :) > Not with good shrink support, unfortunately. :/ I tried adding basic shrink calls into the proposed tests, but I wasn't able to see them actually do anything, presumably because of the stress workload and smallish filesystem keeping those blocks in use. It might require a more active shrink implementation before we could support this kind of test, or otherwise maybe something more limited/targeted than fsstress. Hmm.. a random, off the top of my head idea might be a debug knob that artificially restricts all allocations beyond a certain disk offset... Brian > > > > I dunno.. on one hand it does seem like an unlikely thing due to the > > > > nature of needing space to be free in order to shrink in the first > > > > place, but OTOH if you have something like grow that is rare, not > > > > performance sensitive, has a history of not being well tested, and has > > > > these subtle ordering requirements that might change indirectly to other > > > > transactions, ISTM it could be a wise engineering decision to simplify > > > > to the degree possible and find the most basic model that enforces > > > > predictable ordering. > > > > > > > > So for a hacky thought/example, suppose we defined a transaction mode > > > > that basically implemented an exclusive checkpoint requirement (i.e. > > > > this transaction owns an entire checkpoint, nothing else is allowed in > > > > the CIL concurrently). Presumably that would ensure everything before > > > > the grow would flush out to disk in one checkpoint, everything > > > > concurrent would block on synchronous commit of the grow trans (before > > > > new geometry is exposed), and then after that point everything pending > > > > would drain into another checkpoint. > > > > > > > > It kind of sounds like overkill, but really if it could be implemented > > > > simply enough then we wouldn't have to think too hard about auditing all > > > > other relog scenarios. I'd probably want to see at least some reproducer > > > > for this sort of problem to prove the theory though too, even if it > > > > required debug instrumentation or something. Hm? > > > > > > What if we redefined the input requirements to shrink? Lets say we > > > require that the fd argument to a shrink ioctl is actually an unlinkable > > > O_TMPFILE regular file with the EOFS blocks mapped to it. Then we can > > > force the log without holding any locks, and the shrink transaction can > > > remove the bmap and rmap records at the same time that it updates the sb > > > geometry. The otherwise inaccessible file means that nobody can reuse > > > that space between the log force and the sb update. > > > > > > > Interesting thought. It kind of sounds like how shrink already works to > > some degree, right? I.e. the kernel side allocs the blocks out of the > > btrees and tosses them, just no inode in the mix? > > Right. > > > Honestly I'd probably need to stare at this code and think about it and > > work through some scenarios to quantify how much of a concern this > > really is, and I don't really have the bandwidth for that just now. I > > mainly wanted to raise the notion that if we're assessing high level log > > ordering requirements for growfs, we should consider the shrink case as > > well. > > <nod> > > > > > > > Anyways, my point is just that if it were me I wouldn't get too deep > > > > > > into this until some of the reproducible growfs recovery issues are at > > > > > > least characterized and testing is more sorted out. > > > > > > > > > > > > The context for testing is here [1]. The TLDR is basically that > > > > > > Christoph has a targeted test that reproduces the initial mount failure > > > > > > and I hacked up a more general test that also reproduces it and > > > > > > additional growfs recovery problems. This test does seem to confirm that > > > > > > the previous patches address the mount failure issue, but this patch > > > > > > doesn't seem to prevent any of the other problems produced by the > > > > > > generic test. That might just mean the test doesn't reproduce what this > > > > > > fixes, but it's kind of hard to at least regression test something like > > > > > > this when basic growfs crash-recovery seems pretty much broken. > > > > > > > > > > Hmm, if you make a variant of that test which formats with an rt device > > > > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see > > > > > similar blowups? Let's see what happens if I do that... > > > > > > > > > > > > > Heh, sounds like so from your followup. Fun times. > > > > > > > > I guess that test should probably work its way upstream. I made some > > > > tweaks locally since last posted to try and make it a little more > > > > aggressive, but it didn't repro anything new so not sure how much > > > > difference it makes really. Do we want a separate version like yours for > > > > the rt case or would you expect to cover both cases in a single test? > > > > > > This probably should be different tests, because rt is its own very > > > weird animal. > > > > > > > Posted a couple tests the other day, JFYI. > > > > Brian > > > > > --D > > > > > > > Brian > > > > > > > > > --D > > > > > > > > > > > Brian > > > > > > > > > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/ > > > > > > > > > > > > > > have at least one fairly straightforward growfs/recovery test in the > > > > > > > > works that reliably explodes, personally I'd suggest to split this work > > > > > > > > off into separate series. > > > > > > > > > > > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're > > > > > > > > fully cleaned up, and then leave the next two as part of a followon > > > > > > > > series pending further investigation into these other issues. As part of > > > > > > > > that I'd like to know whether the recovery test reproduces (or can be > > > > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also > > > > > > > > settle for "the grow recovery test now passes reliably and this doesn't > > > > > > > > regress it." But once again, just my .02. > > > > > > > > > > > > > > Yeah, it's too bad there's no good way to test recovery with older > > > > > > > kernels either. :( > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
On Tue, Oct 22, 2024 at 12:38:23AM +1100, Dave Chinner wrote: > On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote: > > So for a hacky thought/example, suppose we defined a transaction mode > > that basically implemented an exclusive checkpoint requirement (i.e. > > this transaction owns an entire checkpoint, nothing else is allowed in > > the CIL concurrently). > > Transactions know nothing about the CIL, nor should they. The CIL > also has no place in ordering transactions - it's purely an > aggregation mechanism that flushes committed transactions to stable > storage when it is told to. i.e. when a log force is issued. > > A globally serialised transaction requires ordering at the > transaction allocation/reservation level, not at the CIL. i.e. it is > essentially the same ordering problem as serialising against > untracked DIO on the inode before we can run a truncate (lock, > drain, do operation, unlock). > > > Presumably that would ensure everything before > > the grow would flush out to disk in one checkpoint, everything > > concurrent would block on synchronous commit of the grow trans (before > > new geometry is exposed), and then after that point everything pending > > would drain into another checkpoint. > > Yup, that's high level transaction level ordering and really has > nothing to do with the CIL. The CIL is mostly a FIFO aggregator; the > only ordering it does is to preserve transaction commit ordering > down to the journal. > > > It kind of sounds like overkill, but really if it could be implemented > > simply enough then we wouldn't have to think too hard about auditing all > > other relog scenarios. I'd probably want to see at least some reproducer > > for this sort of problem to prove the theory though too, even if it > > required debug instrumentation or something. Hm? > > It's relatively straight forward to do, but it seems like total > overkill for growfs, as growfs only requires ordering > between the change of size and new allocations. We can do that by > not exposing the new space until after then superblock has been > modifed on stable storage in the case of grow. > > In the case of shrink, globally serialising the growfs > transaction for shrink won't actually do any thing useful because we > have to deny access to the free space we are removing before we > even start the shrink transaction. Hence we need allocation vs > shrink co-ordination before we shrink the superblock space, not a > globally serialised size modification transaction... > Not sure what you mean here, at least I don't see that requirement in the current code. It looks like shrink acquires the blocks all in the same transaction as the shrink. If something fails, it rolls or returns the space depending on what actually failed.. Brian > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On Wed, Oct 23, 2024 at 10:45:51AM -0400, Brian Foster wrote: > On Mon, Oct 21, 2024 at 09:59:18AM -0700, Darrick J. Wong wrote: > > On Fri, Oct 18, 2024 at 08:27:23AM -0400, Brian Foster wrote: > > > On Tue, Oct 15, 2024 at 09:42:05AM -0700, Darrick J. Wong wrote: > > > > On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote: > > > > > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote: > > > > > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote: > > > > > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote: > > > > > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote: > > > > > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > > > > > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > > > > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL > > > > > > > > > > > checkpoint as alloc related transactions that might depend on the > > > > > > > > > > > geometry changes. That seems reasonable and on a first pass I have an > > > > > > > > > > > idea of what this is doing, but the description is kind of vague. > > > > > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested > > > > > > > > > > > it), but it's not quite clear to me from the description and/or logic > > > > > > > > > > > changes how that issue manifests. > > > > > > > > > > > > > > > > > > > > > > Could you elaborate please? For example, is this some kind of race > > > > > > > > > > > situation between an allocation request and a growfs transaction, where > > > > > > > > > > > the former perhaps sees a newly added AG between the time the growfs > > > > > > > > > > > transaction commits (applying the sb deltas) and it actually commits to > > > > > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new > > > > > > > > > > > AG into the same checkpoint that adds the AG? > > > > > > > > > > > > > > > > > > > > This is based on the feedback by Dave on the previous version: > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm > > > > > > > > > not sure I'd go straight to this change given the situation... > > > > > > > > > > > > > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues > > > > > > > > > > with my test case, so I'm not actually sure how to get more updates > > > > > > > > > > into the check checkpoint. I'll try your exercisers if it could hit > > > > > > > > > > that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ok, that explains things a bit. My observation is that the first 5 > > > > > > > > > patches or so address the mount failure problem, but from there I'm not > > > > > > > > > reproducing much difference with or without the final patch. > > > > > > > > > > > > > > > > Does this change to flush the log after committing the new sb fix the > > > > > > > > recovery problems on older kernels? I /think/ that's the point of this > > > > > > > > patch. > > > > > > > > > > > > > > > > > > > > > > I don't follow.. growfs always forced the log via the sync transaction, > > > > > > > right? Or do you mean something else by "change to flush the log?" > > > > > > > > > > > > I guess I was typing a bit too fast this morning -- "change to flush the > > > > > > log to disk before anyone else can get their hands on the superblock". > > > > > > You're right that xfs_log_sb and data-device growfs already do that. > > > > > > > > > > > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug > > > > > > that this patch fixes, right? > > > > > > > > > > > > > > > > Ah, Ok.. that makes sense. Sounds like it could be.. > > > > > > > > Yeah. Hey Christoph, would you mind pre-pending a minimal fixpatch to > > > > set xfs_trans_set_sync in growfsrt before this one that refactors the > > > > existing growfs/sb updates? > > > > > > > > > > > I thought the main functional change of this patch was to hold the > > > > > > > superblock buffer locked across the force so nothing else can relog the > > > > > > > new geometry superblock buffer in the same cil checkpoint. Presumably, > > > > > > > the theory is that prevents recovery from seeing updates to different > > > > > > > buffers that depend on the geometry update before the actual sb geometry > > > > > > > update is recovered (because the latter might have been relogged). > > > > > > > > > > > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand. > > > > > > > Either way I think patch could use a more detailed commit log... > > > > > > > > > > > > <nod> The commit message should point out that we're fixing a real bug > > > > > > here, which is that growfsrt doesn't force the log to disk when it > > > > > > commits the new rt geometry. > > > > > > > > > > > > > > > > Maybe even make it a separate patch to pull apart some of these cleanups > > > > > from fixes. I was also wondering if the whole locking change is the > > > > > moral equivalent of locking the sb across the growfs trans (i.e. > > > > > trans_getsb() + trans_bhold()), at which point maybe that would be a > > > > > reasonable incremental patch too. > > > > > > > > > > > > > > Either way, > > > > > > > > > I see aborts and splats all over the place, which implies at minimum > > > > > > > > > this isn't the only issue here. > > > > > > > > > > > > > > > > Ugh. I've recently noticed the long soak logrecovery test vm have seen > > > > > > > > a slight tick up in failure rates -- random blocks that have clearly had > > > > > > > > garbage written to them such that recovery tries to read the block to > > > > > > > > recover a buffer log item and kaboom. At this point it's unclear if > > > > > > > > that's a problem with xfs or somewhere else. :( > > > > > > > > > > > > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this > > > > > > > > > particular patch has no straightforward way to test that it fixes > > > > > > > > > something and at the same time doesn't break anything else, and 3. we do > > > > > > > > > > > > > > > > I'm curious, what might break? Was that merely a general comment, or do > > > > > > > > you have something specific in mind? (iows: do you see more string to > > > > > > > > pull? :)) > > > > > > > > > > > > > > > > > > > > > > Just a general comment.. > > > > > > > > > > > > > > Something related that isn't totally clear to me is what about the > > > > > > > inverse shrink situation where dblocks is reduced. I.e., is there some > > > > > > > similar scenario where for example instead of the sb buffer being > > > > > > > relogged past some other buffer update that depends on it, some other > > > > > > > change is relogged past a sb update that invalidates/removes blocks > > > > > > > referenced by the relogged buffer..? If so, does that imply a shrink > > > > > > > should flush the log before the shrink transaction commits to ensure it > > > > > > > lands in a new checkpoint (as opposed to ensuring followon updates land > > > > > > > in a new checkpoint)..? > > > > > > > > > > > > I think so. Might we want to do that before and after to be careful? > > > > > > > > > > > > > > > > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we > > > > > had a log preflush to flush out already committed changes before the > > > > > grow, I don't think anything really prevents another "problematic" > > > > > transaction from committing after that preflush. > > > > > > > > Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't > > > > you? > > > > > > > > > > I guess it depends on how far into the weeds we want to get. I'm not > > > necessarily sure than anything exists today that is definitely > > > problematic wrt shrink. That would probably warrant an audit of > > > transactions or some other high level analysis to disprove. More thought > > > needed. > > > > <nod> I think there isn't a problem with shrink because the shrink > > transaction itself must be able to find the space, which means that > > there cannot be any files or unfinished deferred ops pointing to that > > space. > > > > Ok, that makes sense to me. On poking through the code, one thing that > it looks like it misses is the case where the allocation fails purely > due to extents being busy (i.e. even if the blocks are free). > > That doesn't seem harmful, just perhaps a little odd that a shrink might > fail and then succeed after some period of time for the log tail to push > with no other visible changes. Unless I'm missing something, it might be > worth adding the log force there (that I think you suggested earlier). Any program that relies on a particular piece of space being in a particular state can fail due to other active threads, so I'm not worried about that. > All that said, I'm not so sure this will all apply the same if shrink > grows to a more active implementation. For example, I suspect an > implementation that learns to quiesce/truncate full AGs isn't going to > necessarily need to allocate all of the blocks out of the AG free space > trees. But of course, this is all vaporware anyways. :) <nod> That's a burden for whoever ends up working on AG removal. ;) > > > Short of the latter, I'm more thinking about the question "is there some > > > new thing we could add years down the line that 1. adds something to the > > > log that could conflict and 2. could be reordered past a shrink > > > transaction in a problematic way?" If the answer to that is open ended > > > and some such thing does come along, I think it's highly likely this > > > would just break growfs logging again until somebody trips over it in > > > the field. > > > > Good thing we have a couple of tests now? :) > > > > Not with good shrink support, unfortunately. :/ I tried adding basic > shrink calls into the proposed tests, but I wasn't able to see them > actually do anything, presumably because of the stress workload and > smallish filesystem keeping those blocks in use. It might require a more > active shrink implementation before we could support this kind of test, > or otherwise maybe something more limited/targeted than fsstress. I suspect you're right. --D > Hmm.. a random, off the top of my head idea might be a debug knob that > artificially restricts all allocations beyond a certain disk offset... > > Brian > > > > > > I dunno.. on one hand it does seem like an unlikely thing due to the > > > > > nature of needing space to be free in order to shrink in the first > > > > > place, but OTOH if you have something like grow that is rare, not > > > > > performance sensitive, has a history of not being well tested, and has > > > > > these subtle ordering requirements that might change indirectly to other > > > > > transactions, ISTM it could be a wise engineering decision to simplify > > > > > to the degree possible and find the most basic model that enforces > > > > > predictable ordering. > > > > > > > > > > So for a hacky thought/example, suppose we defined a transaction mode > > > > > that basically implemented an exclusive checkpoint requirement (i.e. > > > > > this transaction owns an entire checkpoint, nothing else is allowed in > > > > > the CIL concurrently). Presumably that would ensure everything before > > > > > the grow would flush out to disk in one checkpoint, everything > > > > > concurrent would block on synchronous commit of the grow trans (before > > > > > new geometry is exposed), and then after that point everything pending > > > > > would drain into another checkpoint. > > > > > > > > > > It kind of sounds like overkill, but really if it could be implemented > > > > > simply enough then we wouldn't have to think too hard about auditing all > > > > > other relog scenarios. I'd probably want to see at least some reproducer > > > > > for this sort of problem to prove the theory though too, even if it > > > > > required debug instrumentation or something. Hm? > > > > > > > > What if we redefined the input requirements to shrink? Lets say we > > > > require that the fd argument to a shrink ioctl is actually an unlinkable > > > > O_TMPFILE regular file with the EOFS blocks mapped to it. Then we can > > > > force the log without holding any locks, and the shrink transaction can > > > > remove the bmap and rmap records at the same time that it updates the sb > > > > geometry. The otherwise inaccessible file means that nobody can reuse > > > > that space between the log force and the sb update. > > > > > > > > > > Interesting thought. It kind of sounds like how shrink already works to > > > some degree, right? I.e. the kernel side allocs the blocks out of the > > > btrees and tosses them, just no inode in the mix? > > > > Right. > > > > > Honestly I'd probably need to stare at this code and think about it and > > > work through some scenarios to quantify how much of a concern this > > > really is, and I don't really have the bandwidth for that just now. I > > > mainly wanted to raise the notion that if we're assessing high level log > > > ordering requirements for growfs, we should consider the shrink case as > > > well. > > > > <nod> > > > > > > > > > Anyways, my point is just that if it were me I wouldn't get too deep > > > > > > > into this until some of the reproducible growfs recovery issues are at > > > > > > > least characterized and testing is more sorted out. > > > > > > > > > > > > > > The context for testing is here [1]. The TLDR is basically that > > > > > > > Christoph has a targeted test that reproduces the initial mount failure > > > > > > > and I hacked up a more general test that also reproduces it and > > > > > > > additional growfs recovery problems. This test does seem to confirm that > > > > > > > the previous patches address the mount failure issue, but this patch > > > > > > > doesn't seem to prevent any of the other problems produced by the > > > > > > > generic test. That might just mean the test doesn't reproduce what this > > > > > > > fixes, but it's kind of hard to at least regression test something like > > > > > > > this when basic growfs crash-recovery seems pretty much broken. > > > > > > > > > > > > Hmm, if you make a variant of that test which formats with an rt device > > > > > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see > > > > > > similar blowups? Let's see what happens if I do that... > > > > > > > > > > > > > > > > Heh, sounds like so from your followup. Fun times. > > > > > > > > > > I guess that test should probably work its way upstream. I made some > > > > > tweaks locally since last posted to try and make it a little more > > > > > aggressive, but it didn't repro anything new so not sure how much > > > > > difference it makes really. Do we want a separate version like yours for > > > > > the rt case or would you expect to cover both cases in a single test? > > > > > > > > This probably should be different tests, because rt is its own very > > > > weird animal. > > > > > > > > > > Posted a couple tests the other day, JFYI. > > > > > > Brian > > > > > > > --D > > > > > > > > > Brian > > > > > > > > > > > --D > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/ > > > > > > > > > > > > > > > > have at least one fairly straightforward growfs/recovery test in the > > > > > > > > > works that reliably explodes, personally I'd suggest to split this work > > > > > > > > > off into separate series. > > > > > > > > > > > > > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're > > > > > > > > > fully cleaned up, and then leave the next two as part of a followon > > > > > > > > > series pending further investigation into these other issues. As part of > > > > > > > > > that I'd like to know whether the recovery test reproduces (or can be > > > > > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also > > > > > > > > > settle for "the grow recovery test now passes reliably and this doesn't > > > > > > > > > regress it." But once again, just my .02. > > > > > > > > > > > > > > > > Yeah, it's too bad there's no good way to test recovery with older > > > > > > > > kernels either. :( > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c index d95409f3cba667..2c83ab7441ade5 100644 --- a/fs/xfs/libxfs/xfs_sb.c +++ b/fs/xfs/libxfs/xfs_sb.c @@ -1025,6 +1025,80 @@ xfs_sb_mount_common( mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp); } +/* + * Mirror the lazy sb counters to the in-core superblock. + * + * If this is at unmount, the counters will be exactly correct, but at any other + * time they will only be ballpark correct because of reservations that have + * been taken out percpu counters. If we have an unclean shutdown, this will be + * corrected by log recovery rebuilding the counters from the AGF block counts. + * + * Do not update sb_frextents here because it is not part of the lazy sb + * counters, despite having a percpu counter. It is always kept consistent with + * the ondisk rtbitmap by xfs_trans_apply_sb_deltas() and hence we don't need + * have to update it here. + */ +static void +xfs_flush_sb_counters( + struct xfs_mount *mp) +{ + if (xfs_has_lazysbcount(mp)) { + mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount); + mp->m_sb.sb_ifree = min_t(uint64_t, + percpu_counter_sum_positive(&mp->m_ifree), + mp->m_sb.sb_icount); + mp->m_sb.sb_fdblocks = + percpu_counter_sum_positive(&mp->m_fdblocks); + } +} + +/* + * Prepare a direct update to the superblock through the on-disk buffer. + * + * This locks out other modifications through the buffer lock and then syncs all + * in-core values to the on-disk buffer (including the percpu counters). + * + * The caller then directly manipulates the on-disk fields and calls + * xfs_commit_sb_update to the updates to disk them. The caller is responsible + * to also update the in-core field, but it can do so after the transaction has + * been committed to disk. + * + * Updating the in-core field only after xfs_commit_sb_update ensures that other + * processes only see the update once it is stable on disk, and is usually the + * right thing to do for superblock updates. + * + * Note that writes to superblock fields updated using this helper are + * synchronized using the superblock buffer lock, which must be taken around + * all updates to the in-core fields as well. + */ +struct xfs_dsb * +xfs_prepare_sb_update( + struct xfs_trans *tp, + struct xfs_buf **bpp) +{ + *bpp = xfs_trans_getsb(tp); + xfs_flush_sb_counters(tp->t_mountp); + xfs_sb_to_disk((*bpp)->b_addr, &tp->t_mountp->m_sb); + return (*bpp)->b_addr; +} + +/* + * Commit a direct update to the on-disk superblock. Keeps @bp locked and + * referenced, so the caller must call xfs_buf_relse() manually. + */ +int +xfs_commit_sb_update( + struct xfs_trans *tp, + struct xfs_buf *bp) +{ + xfs_trans_bhold(tp, bp); + xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF); + xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1); + + xfs_trans_set_sync(tp); + return xfs_trans_commit(tp); +} + /* * xfs_log_sb() can be used to copy arbitrary changes to the in-core superblock * into the superblock buffer to be logged. It does not provide the higher @@ -1038,28 +1112,7 @@ xfs_log_sb( struct xfs_mount *mp = tp->t_mountp; struct xfs_buf *bp = xfs_trans_getsb(tp); - /* - * Lazy sb counters don't update the in-core superblock so do that now. - * If this is at unmount, the counters will be exactly correct, but at - * any other time they will only be ballpark correct because of - * reservations that have been taken out percpu counters. If we have an - * unclean shutdown, this will be corrected by log recovery rebuilding - * the counters from the AGF block counts. - * - * Do not update sb_frextents here because it is not part of the lazy - * sb counters, despite having a percpu counter. It is always kept - * consistent with the ondisk rtbitmap by xfs_trans_apply_sb_deltas() - * and hence we don't need have to update it here. - */ - if (xfs_has_lazysbcount(mp)) { - mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount); - mp->m_sb.sb_ifree = min_t(uint64_t, - percpu_counter_sum_positive(&mp->m_ifree), - mp->m_sb.sb_icount); - mp->m_sb.sb_fdblocks = - percpu_counter_sum_positive(&mp->m_fdblocks); - } - + xfs_flush_sb_counters(mp); xfs_sb_to_disk(bp->b_addr, &mp->m_sb); xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF); xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1); diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h index 885c837559914d..3649d071687e33 100644 --- a/fs/xfs/libxfs/xfs_sb.h +++ b/fs/xfs/libxfs/xfs_sb.h @@ -13,6 +13,9 @@ struct xfs_trans; struct xfs_fsop_geom; struct xfs_perag; +struct xfs_dsb *xfs_prepare_sb_update(struct xfs_trans *tp, + struct xfs_buf **bpp); +int xfs_commit_sb_update(struct xfs_trans *tp, struct xfs_buf *bp); extern void xfs_log_sb(struct xfs_trans *tp); extern int xfs_sync_sb(struct xfs_mount *mp, bool wait); extern int xfs_sync_sb_buf(struct xfs_mount *mp); diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h index 33b84a3a83ff63..45a32ea426164a 100644 --- a/fs/xfs/libxfs/xfs_shared.h +++ b/fs/xfs/libxfs/xfs_shared.h @@ -149,14 +149,6 @@ void xfs_log_get_max_trans_res(struct xfs_mount *mp, #define XFS_TRANS_SB_RES_FDBLOCKS 0x00000008 #define XFS_TRANS_SB_FREXTENTS 0x00000010 #define XFS_TRANS_SB_RES_FREXTENTS 0x00000020 -#define XFS_TRANS_SB_DBLOCKS 0x00000040 -#define XFS_TRANS_SB_AGCOUNT 0x00000080 -#define XFS_TRANS_SB_IMAXPCT 0x00000100 -#define XFS_TRANS_SB_REXTSIZE 0x00000200 -#define XFS_TRANS_SB_RBMBLOCKS 0x00000400 -#define XFS_TRANS_SB_RBLOCKS 0x00000800 -#define XFS_TRANS_SB_REXTENTS 0x00001000 -#define XFS_TRANS_SB_REXTSLOG 0x00002000 /* * Here we centralize the specification of XFS meta-data buffer reference count diff --git a/fs/xfs/scrub/rtbitmap_repair.c b/fs/xfs/scrub/rtbitmap_repair.c index 0fef98e9f83409..be9d31f032b1bf 100644 --- a/fs/xfs/scrub/rtbitmap_repair.c +++ b/fs/xfs/scrub/rtbitmap_repair.c @@ -16,6 +16,7 @@ #include "xfs_bit.h" #include "xfs_bmap.h" #include "xfs_bmap_btree.h" +#include "xfs_sb.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" @@ -127,20 +128,21 @@ xrep_rtbitmap_geometry( struct xchk_rtbitmap *rtb) { struct xfs_mount *mp = sc->mp; - struct xfs_trans *tp = sc->tp; /* Superblock fields */ - if (mp->m_sb.sb_rextents != rtb->rextents) - xfs_trans_mod_sb(sc->tp, XFS_TRANS_SB_REXTENTS, - rtb->rextents - mp->m_sb.sb_rextents); - - if (mp->m_sb.sb_rbmblocks != rtb->rbmblocks) - xfs_trans_mod_sb(tp, XFS_TRANS_SB_RBMBLOCKS, - rtb->rbmblocks - mp->m_sb.sb_rbmblocks); - - if (mp->m_sb.sb_rextslog != rtb->rextslog) - xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTSLOG, - rtb->rextslog - mp->m_sb.sb_rextslog); + if (mp->m_sb.sb_rextents != rtb->rextents || + mp->m_sb.sb_rbmblocks != rtb->rbmblocks || + mp->m_sb.sb_rextslog != rtb->rextslog) { + struct xfs_buf *bp = xfs_trans_getsb(sc->tp); + + mp->m_sb.sb_rextents = rtb->rextents; + mp->m_sb.sb_rbmblocks = rtb->rbmblocks; + mp->m_sb.sb_rextslog = rtb->rextslog; + xfs_sb_to_disk(bp->b_addr, &mp->m_sb); + + xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF); + xfs_trans_log_buf(sc->tp, bp, 0, sizeof(struct xfs_dsb) - 1); + } /* Fix broken isize */ sc->ip->i_disk_size = roundup_64(sc->ip->i_disk_size, diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index b247d895c276d2..4168ccf21068cb 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -79,6 +79,46 @@ xfs_resizefs_init_new_ags( return error; } +static int +xfs_growfs_data_update_sb( + struct xfs_trans *tp, + xfs_agnumber_t nagcount, + xfs_rfsblock_t nb, + xfs_agnumber_t nagimax) +{ + struct xfs_mount *mp = tp->t_mountp; + struct xfs_dsb *sbp; + struct xfs_buf *bp; + int error; + + /* + * Update the geometry in the on-disk superblock first, and ensure + * they make it to disk before the superblock can be relogged. + */ + sbp = xfs_prepare_sb_update(tp, &bp); + sbp->sb_agcount = cpu_to_be32(nagcount); + sbp->sb_dblocks = cpu_to_be64(nb); + error = xfs_commit_sb_update(tp, bp); + if (error) + goto out_unlock; + + /* + * Propagate the new values to the live mount structure after they made + * it to disk with the superblock buffer still locked. + */ + mp->m_sb.sb_agcount = nagcount; + mp->m_sb.sb_dblocks = nb; + + if (nagimax) + mp->m_maxagi = nagimax; + xfs_set_low_space_thresholds(mp); + mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); + +out_unlock: + xfs_buf_relse(bp); + return error; +} + /* * growfs operations */ @@ -171,37 +211,13 @@ xfs_growfs_data_private( if (error) goto out_trans_cancel; - /* - * Update changed superblock fields transactionally. These are not - * seen by the rest of the world until the transaction commit applies - * them atomically to the superblock. - */ - if (nagcount > oagcount) - xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount); - if (delta) - xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta); if (id.nfree) xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree); - /* - * Sync sb counters now to reflect the updated values. This is - * particularly important for shrink because the write verifier - * will fail if sb_fdblocks is ever larger than sb_dblocks. - */ - if (xfs_has_lazysbcount(mp)) - xfs_log_sb(tp); - - xfs_trans_set_sync(tp); - error = xfs_trans_commit(tp); + error = xfs_growfs_data_update_sb(tp, nagcount, nb, nagimax); if (error) return error; - /* New allocation groups fully initialized, so update mount struct */ - if (nagimax) - mp->m_maxagi = nagimax; - xfs_set_low_space_thresholds(mp); - mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); - if (delta > 0) { /* * If we expanded the last AG, free the per-AG reservation @@ -260,8 +276,9 @@ xfs_growfs_imaxpct( struct xfs_mount *mp, __u32 imaxpct) { + struct xfs_dsb *sbp; + struct xfs_buf *bp; struct xfs_trans *tp; - int dpct; int error; if (imaxpct > 100) @@ -272,10 +289,13 @@ xfs_growfs_imaxpct( if (error) return error; - dpct = imaxpct - mp->m_sb.sb_imax_pct; - xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct); - xfs_trans_set_sync(tp); - return xfs_trans_commit(tp); + sbp = xfs_prepare_sb_update(tp, &bp); + sbp->sb_imax_pct = imaxpct; + error = xfs_commit_sb_update(tp, bp); + if (!error) + mp->m_sb.sb_imax_pct = imaxpct; + xfs_buf_relse(bp); + return error; } /* diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c index 3a2005a1e673dc..994e5efedab20f 100644 --- a/fs/xfs/xfs_rtalloc.c +++ b/fs/xfs/xfs_rtalloc.c @@ -698,6 +698,56 @@ xfs_growfs_rt_fixup_extsize( return error; } +static int +xfs_growfs_rt_update_sb( + struct xfs_trans *tp, + struct xfs_mount *mp, + struct xfs_mount *nmp, + xfs_rtbxlen_t freed_rtx) +{ + struct xfs_dsb *sbp; + struct xfs_buf *bp; + int error; + + /* + * Update the geometry in the on-disk superblock first, and ensure + * they make it to disk before the superblock can be relogged. + */ + sbp = xfs_prepare_sb_update(tp, &bp); + sbp->sb_rextsize = cpu_to_be32(nmp->m_sb.sb_rextsize); + sbp->sb_rbmblocks = cpu_to_be32(nmp->m_sb.sb_rbmblocks); + sbp->sb_rblocks = cpu_to_be64(nmp->m_sb.sb_rblocks); + sbp->sb_rextents = cpu_to_be64(nmp->m_sb.sb_rextents); + sbp->sb_rextslog = nmp->m_sb.sb_rextslog; + error = xfs_commit_sb_update(tp, bp); + if (error) + return error; + + /* + * Propagate the new values to the live mount structure after they made + * it to disk with the superblock buffer still locked. + */ + mp->m_sb.sb_rextsize = nmp->m_sb.sb_rextsize; + mp->m_sb.sb_rbmblocks = nmp->m_sb.sb_rbmblocks; + mp->m_sb.sb_rblocks = nmp->m_sb.sb_rblocks; + mp->m_sb.sb_rextents = nmp->m_sb.sb_rextents; + mp->m_sb.sb_rextslog = nmp->m_sb.sb_rextslog; + mp->m_rsumlevels = nmp->m_rsumlevels; + mp->m_rsumblocks = nmp->m_rsumblocks; + + /* + * Recompute the growfsrt reservation from the new rsumsize. + */ + xfs_trans_resv_calc(mp, &mp->m_resv); + + /* + * Ensure the mount RT feature flag is now set. + */ + mp->m_features |= XFS_FEAT_REALTIME; + xfs_buf_relse(bp); + return 0; +} + static int xfs_growfs_rt_bmblock( struct xfs_mount *mp, @@ -780,25 +830,6 @@ xfs_growfs_rt_bmblock( goto out_cancel; } - /* - * Update superblock fields. - */ - if (nmp->m_sb.sb_rextsize != mp->m_sb.sb_rextsize) - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSIZE, - nmp->m_sb.sb_rextsize - mp->m_sb.sb_rextsize); - if (nmp->m_sb.sb_rbmblocks != mp->m_sb.sb_rbmblocks) - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBMBLOCKS, - nmp->m_sb.sb_rbmblocks - mp->m_sb.sb_rbmblocks); - if (nmp->m_sb.sb_rblocks != mp->m_sb.sb_rblocks) - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBLOCKS, - nmp->m_sb.sb_rblocks - mp->m_sb.sb_rblocks); - if (nmp->m_sb.sb_rextents != mp->m_sb.sb_rextents) - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTENTS, - nmp->m_sb.sb_rextents - mp->m_sb.sb_rextents); - if (nmp->m_sb.sb_rextslog != mp->m_sb.sb_rextslog) - xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSLOG, - nmp->m_sb.sb_rextslog - mp->m_sb.sb_rextslog); - /* * Free the new extent. */ @@ -807,33 +838,12 @@ xfs_growfs_rt_bmblock( xfs_rtbuf_cache_relse(&nargs); if (error) goto out_cancel; - - /* - * Mark more blocks free in the superblock. - */ xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx); - /* - * Update the calculated values in the real mount structure. - */ - mp->m_rsumlevels = nmp->m_rsumlevels; - mp->m_rsumblocks = nmp->m_rsumblocks; - xfs_mount_sb_set_rextsize(mp, &mp->m_sb); - - /* - * Recompute the growfsrt reservation from the new rsumsize. - */ - xfs_trans_resv_calc(mp, &mp->m_resv); - - error = xfs_trans_commit(args.tp); + error = xfs_growfs_rt_update_sb(args.tp, mp, nmp, freed_rtx); if (error) goto out_free; - /* - * Ensure the mount RT feature flag is now set. - */ - mp->m_features |= XFS_FEAT_REALTIME; - kfree(nmp); return 0; diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index bdf3704dc30118..56505cb94f877d 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -430,31 +430,6 @@ xfs_trans_mod_sb( ASSERT(delta < 0); tp->t_res_frextents_delta += delta; break; - case XFS_TRANS_SB_DBLOCKS: - tp->t_dblocks_delta += delta; - break; - case XFS_TRANS_SB_AGCOUNT: - ASSERT(delta > 0); - tp->t_agcount_delta += delta; - break; - case XFS_TRANS_SB_IMAXPCT: - tp->t_imaxpct_delta += delta; - break; - case XFS_TRANS_SB_REXTSIZE: - tp->t_rextsize_delta += delta; - break; - case XFS_TRANS_SB_RBMBLOCKS: - tp->t_rbmblocks_delta += delta; - break; - case XFS_TRANS_SB_RBLOCKS: - tp->t_rblocks_delta += delta; - break; - case XFS_TRANS_SB_REXTENTS: - tp->t_rextents_delta += delta; - break; - case XFS_TRANS_SB_REXTSLOG: - tp->t_rextslog_delta += delta; - break; default: ASSERT(0); return; @@ -475,12 +450,8 @@ STATIC void xfs_trans_apply_sb_deltas( xfs_trans_t *tp) { - struct xfs_dsb *sbp; - struct xfs_buf *bp; - int whole = 0; - - bp = xfs_trans_getsb(tp); - sbp = bp->b_addr; + struct xfs_buf *bp = xfs_trans_getsb(tp); + struct xfs_dsb *sbp = bp->b_addr; /* * Only update the superblock counters if we are logging them @@ -522,53 +493,10 @@ xfs_trans_apply_sb_deltas( spin_unlock(&mp->m_sb_lock); } - if (tp->t_dblocks_delta) { - be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta); - whole = 1; - } - if (tp->t_agcount_delta) { - be32_add_cpu(&sbp->sb_agcount, tp->t_agcount_delta); - whole = 1; - } - if (tp->t_imaxpct_delta) { - sbp->sb_imax_pct += tp->t_imaxpct_delta; - whole = 1; - } - if (tp->t_rextsize_delta) { - be32_add_cpu(&sbp->sb_rextsize, tp->t_rextsize_delta); - whole = 1; - } - if (tp->t_rbmblocks_delta) { - be32_add_cpu(&sbp->sb_rbmblocks, tp->t_rbmblocks_delta); - whole = 1; - } - if (tp->t_rblocks_delta) { - be64_add_cpu(&sbp->sb_rblocks, tp->t_rblocks_delta); - whole = 1; - } - if (tp->t_rextents_delta) { - be64_add_cpu(&sbp->sb_rextents, tp->t_rextents_delta); - whole = 1; - } - if (tp->t_rextslog_delta) { - sbp->sb_rextslog += tp->t_rextslog_delta; - whole = 1; - } - xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF); - if (whole) - /* - * Log the whole thing, the fields are noncontiguous. - */ - xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1); - else - /* - * Since all the modifiable fields are contiguous, we - * can get away with this. - */ - xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount), - offsetof(struct xfs_dsb, sb_frextents) + - sizeof(sbp->sb_frextents) - 1); + xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount), + offsetof(struct xfs_dsb, sb_frextents) + + sizeof(sbp->sb_frextents) - 1); } /* @@ -656,26 +584,7 @@ xfs_trans_unreserve_and_mod_sb( * must be consistent with the ondisk rtbitmap and must never include * incore reservations. */ - mp->m_sb.sb_dblocks += tp->t_dblocks_delta; - mp->m_sb.sb_agcount += tp->t_agcount_delta; - mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta; - mp->m_sb.sb_rextsize += tp->t_rextsize_delta; - if (tp->t_rextsize_delta) { - mp->m_rtxblklog = log2_if_power2(mp->m_sb.sb_rextsize); - mp->m_rtxblkmask = mask64_if_power2(mp->m_sb.sb_rextsize); - } - mp->m_sb.sb_rbmblocks += tp->t_rbmblocks_delta; - mp->m_sb.sb_rblocks += tp->t_rblocks_delta; - mp->m_sb.sb_rextents += tp->t_rextents_delta; - mp->m_sb.sb_rextslog += tp->t_rextslog_delta; spin_unlock(&mp->m_sb_lock); - - /* - * Debug checks outside of the spinlock so they don't lock up the - * machine if they fail. - */ - ASSERT(mp->m_sb.sb_imax_pct >= 0); - ASSERT(mp->m_sb.sb_rextslog >= 0); } /* Add the given log item to the transaction's list of log items. */ diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index f06cc0f41665ad..e5911cf09be444 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -140,14 +140,6 @@ typedef struct xfs_trans { int64_t t_res_fdblocks_delta; /* on-disk only chg */ int64_t t_frextents_delta;/* superblock freextents chg*/ int64_t t_res_frextents_delta; /* on-disk only chg */ - int64_t t_dblocks_delta;/* superblock dblocks change */ - int64_t t_agcount_delta;/* superblock agcount change */ - int64_t t_imaxpct_delta;/* superblock imaxpct change */ - int64_t t_rextsize_delta;/* superblock rextsize chg */ - int64_t t_rbmblocks_delta;/* superblock rbmblocks chg */ - int64_t t_rblocks_delta;/* superblock rblocks change */ - int64_t t_rextents_delta;/* superblocks rextents chg */ - int64_t t_rextslog_delta;/* superblocks rextslog chg */ struct list_head t_items; /* log item descriptors */ struct list_head t_busy; /* list of busy extents */ struct list_head t_dfops; /* deferred operations */
Updates to the file system geometry in growfs need to be committed to stable store before the allocator can see them to avoid that they are in the same CIL checkpoint as transactions that make use of this new information, which will make recovery impossible or broken. To do this add two new helpers to prepare a superblock for direct manipulation of the on-disk buffer, and to commit these updates while holding the buffer locked (similar to what xfs_sync_sb_buf does) and use those in growfs instead of applying the changes through the deltas in the xfs_trans structure (which also happens to shrink the xfs_trans structure a fair bit). The rtbmimap repair code was also using the transaction deltas and is converted to also update the superblock buffer directly under the buffer lock. This new method establishes a locking protocol where even in-core superblock fields must only be updated with the superblock buffer locked. For now it is only applied to affected geometry fields, but in the future it would make sense to apply it universally. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_sb.c | 97 ++++++++++++++++++++++++------- fs/xfs/libxfs/xfs_sb.h | 3 + fs/xfs/libxfs/xfs_shared.h | 8 --- fs/xfs/scrub/rtbitmap_repair.c | 26 +++++---- fs/xfs/xfs_fsops.c | 80 ++++++++++++++++---------- fs/xfs/xfs_rtalloc.c | 92 +++++++++++++++++------------- fs/xfs/xfs_trans.c | 101 ++------------------------------- fs/xfs/xfs_trans.h | 8 --- 8 files changed, 198 insertions(+), 217 deletions(-)