diff mbox series

xfs: fix debug-mode accounting for deferred AGFL freeing

Message ID 20210425154634.GZ3122264@magnolia (mailing list archive)
State New, archived
Headers show
Series xfs: fix debug-mode accounting for deferred AGFL freeing | expand

Commit Message

Darrick J. Wong April 25, 2021, 3:46 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

In commit f8f2835a9cf3 we changed the behavior of XFS to use EFIs to
remove blocks from an overfilled AGFL because there were complaints
about transaction overruns that stemmed from trying to free multiple
blocks in a single transaction.

Unfortunately, that commit missed a subtlety in the debug-mode
transaction accounting when a realtime volume is attached.  If a
realtime file undergoes a data fork mapping change such that realtime
extents are allocated (or freed) in the same transaction that a data
device block is also allocated (or freed), we can trip a debugging
assertion.  This can happen (for example) if a realtime extent is
allocated and it is necessary to reshape the bmbt to hold the new
mapping.

When we go to allocate a bmbt block from an AG, the first thing the data
device block allocator does is ensure that the freelist is the proper
length.  If the freelist is too long, it will trim the freelist to the
proper length.

In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to
record the decrement in the AG free list count.  Prior to f8f28 we would
put the free block back in the free space btrees in the same
transaction, which calls xfs_trans_agblocks_delta() to record the
increment in the AG free block count.  Since AGFL blocks are included in
the global free block count (fdblocks), there is no corresponding
fdblocks update, so the AGFL free satisfies the following condition in
xfs_trans_apply_sb_deltas:

	/*
	 * Check that superblock mods match the mods made to AGF counters.
	 */
	ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) ==
	       (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta +
		tp->t_ag_btree_delta));

The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is
the number blocks that were allocated.

After commit f8f28 we defer the block freeing to the next chained
transaction, which means that the calls to xfs_trans_agflist_delta and
xfs_trans_agblocks_delta occur in separate transactions.  The (first)
transaction that shortens the free list trips on the comparison, which
has now become:

(X + 0) == ((X) + -1 + 0)

because we haven't freed the AGFL block yet; we've only logged an
intention to free it.  When the second transaction (the deferred free)
commits, it will evaluate the expression as:

(0 + 0) == (1 + 0 + 0)

and trip over that in turn.

At this point, the astute reader may note that the two commits tagged by
this patch have been in the kernel for a long time but haven't generated
any bug reports.  How is it that the author became aware of this bug?

This originally surfaced as an intermittent failure when I was testing
realtime rmap, but a different bug report by Zorro Lang reveals the same
assertion occuring on !lazysbcount filesystems.

The common factor to both reports (and why this problem wasn't
previously reported) becomes apparent if we consider when
xfs_trans_apply_sb_deltas is called by __xfs_trans_commit():

	if (tp->t_flags & XFS_TRANS_SB_DIRTY)
		xfs_trans_apply_sb_deltas(tp);

With a modern lazysbcount filesystem, transactions update only the
percpu counters, so they don't need to set XFS_TRANS_SB_DIRTY, hence
xfs_trans_apply_sb_deltas is rarely called.

However, updates to the count of free realtime extents are not part of
lazysbcount, so XFS_TRANS_SB_DIRTY will be set on transactions adding or
removing data fork mappings to realtime files; similarly,
XFS_TRANS_SB_DIRTY is always set on !lazysbcount filesystems.

This bug was found by running generic/051 on either a V4 filesystem
lacking lazysbcount; or a V5 filesystem with a realtime volume.

Fixes: f8f2835a9cf3 ("xfs: defer agfl block frees when dfops is available")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

Dave Chinner April 25, 2021, 10:50 p.m. UTC | #1
On Sun, Apr 25, 2021 at 08:46:34AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> In commit f8f2835a9cf3 we changed the behavior of XFS to use EFIs to
> remove blocks from an overfilled AGFL because there were complaints
> about transaction overruns that stemmed from trying to free multiple
> blocks in a single transaction.
> 
> Unfortunately, that commit missed a subtlety in the debug-mode
> transaction accounting when a realtime volume is attached.  If a
> realtime file undergoes a data fork mapping change such that realtime
> extents are allocated (or freed) in the same transaction that a data
> device block is also allocated (or freed), we can trip a debugging
> assertion.  This can happen (for example) if a realtime extent is
> allocated and it is necessary to reshape the bmbt to hold the new
> mapping.
> 
> When we go to allocate a bmbt block from an AG, the first thing the data
> device block allocator does is ensure that the freelist is the proper
> length.  If the freelist is too long, it will trim the freelist to the
> proper length.
> 
> In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to
> record the decrement in the AG free list count.  Prior to f8f28 we would
> put the free block back in the free space btrees in the same
> transaction, which calls xfs_trans_agblocks_delta() to record the
> increment in the AG free block count.  Since AGFL blocks are included in
> the global free block count (fdblocks), there is no corresponding
> fdblocks update, so the AGFL free satisfies the following condition in
> xfs_trans_apply_sb_deltas:
> 
> 	/*
> 	 * Check that superblock mods match the mods made to AGF counters.
> 	 */
> 	ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) ==
> 	       (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta +
> 		tp->t_ag_btree_delta));
> 
> The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is
> the number blocks that were allocated.
> 
> After commit f8f28 we defer the block freeing to the next chained
> transaction, which means that the calls to xfs_trans_agflist_delta and
> xfs_trans_agblocks_delta occur in separate transactions.  The (first)
> transaction that shortens the free list trips on the comparison, which
> has now become:
> 
> (X + 0) == ((X) + -1 + 0)
> 
> because we haven't freed the AGFL block yet; we've only logged an
> intention to free it.  When the second transaction (the deferred free)
> commits, it will evaluate the expression as:
> 
> (0 + 0) == (1 + 0 + 0)
> 
> and trip over that in turn.
> 
> At this point, the astute reader may note that the two commits tagged by
> this patch have been in the kernel for a long time but haven't generated
> any bug reports.  How is it that the author became aware of this bug?
> 
> This originally surfaced as an intermittent failure when I was testing
> realtime rmap, but a different bug report by Zorro Lang reveals the same
> assertion occuring on !lazysbcount filesystems.
> 
> The common factor to both reports (and why this problem wasn't
> previously reported) becomes apparent if we consider when
> xfs_trans_apply_sb_deltas is called by __xfs_trans_commit():
> 
> 	if (tp->t_flags & XFS_TRANS_SB_DIRTY)
> 		xfs_trans_apply_sb_deltas(tp);

IIUC, what you are saying is that this debug code is simply not
exercised in normal testing and hasn't been for the past decade?
And it still won't be exercised on anything other than realtime
device testing?

I don't really consider !lazycount worth investing time and
effort into because deprecated and default for over a decade, so if
this is the case, then why not just remove it?

Cheers,

Dave.
Darrick J. Wong April 25, 2021, 11:05 p.m. UTC | #2
On Mon, Apr 26, 2021 at 08:50:52AM +1000, Dave Chinner wrote:
> On Sun, Apr 25, 2021 at 08:46:34AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > In commit f8f2835a9cf3 we changed the behavior of XFS to use EFIs to
> > remove blocks from an overfilled AGFL because there were complaints
> > about transaction overruns that stemmed from trying to free multiple
> > blocks in a single transaction.
> > 
> > Unfortunately, that commit missed a subtlety in the debug-mode
> > transaction accounting when a realtime volume is attached.  If a
> > realtime file undergoes a data fork mapping change such that realtime
> > extents are allocated (or freed) in the same transaction that a data
> > device block is also allocated (or freed), we can trip a debugging
> > assertion.  This can happen (for example) if a realtime extent is
> > allocated and it is necessary to reshape the bmbt to hold the new
> > mapping.
> > 
> > When we go to allocate a bmbt block from an AG, the first thing the data
> > device block allocator does is ensure that the freelist is the proper
> > length.  If the freelist is too long, it will trim the freelist to the
> > proper length.
> > 
> > In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to
> > record the decrement in the AG free list count.  Prior to f8f28 we would
> > put the free block back in the free space btrees in the same
> > transaction, which calls xfs_trans_agblocks_delta() to record the
> > increment in the AG free block count.  Since AGFL blocks are included in
> > the global free block count (fdblocks), there is no corresponding
> > fdblocks update, so the AGFL free satisfies the following condition in
> > xfs_trans_apply_sb_deltas:
> > 
> > 	/*
> > 	 * Check that superblock mods match the mods made to AGF counters.
> > 	 */
> > 	ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) ==
> > 	       (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta +
> > 		tp->t_ag_btree_delta));
> > 
> > The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is
> > the number blocks that were allocated.
> > 
> > After commit f8f28 we defer the block freeing to the next chained
> > transaction, which means that the calls to xfs_trans_agflist_delta and
> > xfs_trans_agblocks_delta occur in separate transactions.  The (first)
> > transaction that shortens the free list trips on the comparison, which
> > has now become:
> > 
> > (X + 0) == ((X) + -1 + 0)
> > 
> > because we haven't freed the AGFL block yet; we've only logged an
> > intention to free it.  When the second transaction (the deferred free)
> > commits, it will evaluate the expression as:
> > 
> > (0 + 0) == (1 + 0 + 0)
> > 
> > and trip over that in turn.
> > 
> > At this point, the astute reader may note that the two commits tagged by
> > this patch have been in the kernel for a long time but haven't generated
> > any bug reports.  How is it that the author became aware of this bug?
> > 
> > This originally surfaced as an intermittent failure when I was testing
> > realtime rmap, but a different bug report by Zorro Lang reveals the same
> > assertion occuring on !lazysbcount filesystems.
> > 
> > The common factor to both reports (and why this problem wasn't
> > previously reported) becomes apparent if we consider when
> > xfs_trans_apply_sb_deltas is called by __xfs_trans_commit():
> > 
> > 	if (tp->t_flags & XFS_TRANS_SB_DIRTY)
> > 		xfs_trans_apply_sb_deltas(tp);
> 
> IIUC, what you are saying is that this debug code is simply not
> exercised in normal testing and hasn't been for the past decade?

Yes.

> And it still won't be exercised on anything other than realtime
> device testing?

Yes.  Note that I've added it to my regular testing routine since I
actually /do/ have users of rt volumes now.

> I don't really consider !lazycount worth investing time and
> effort into because deprecated and default for over a decade, so if
> this is the case, then why not just remove it?

I think I understand [this infrequently exercised code path], which
means I don't understand it, so rather than be one of those idiots who
rips out everything he doesn't understand, I thought I'd at least try to
figure out what it did and patch it back together.

So now that I actually understand what that debug code does having dug
around to see what those fields do, I'd also be just fine ripping it
out.  But I wanted to know what I would be tearing out first...

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner April 25, 2021, 11:21 p.m. UTC | #3
On Sun, Apr 25, 2021 at 04:05:50PM -0700, Darrick J. Wong wrote:
> On Mon, Apr 26, 2021 at 08:50:52AM +1000, Dave Chinner wrote:
> > On Sun, Apr 25, 2021 at 08:46:34AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > In commit f8f2835a9cf3 we changed the behavior of XFS to use EFIs to
> > > remove blocks from an overfilled AGFL because there were complaints
> > > about transaction overruns that stemmed from trying to free multiple
> > > blocks in a single transaction.
> > > 
> > > Unfortunately, that commit missed a subtlety in the debug-mode
> > > transaction accounting when a realtime volume is attached.  If a
> > > realtime file undergoes a data fork mapping change such that realtime
> > > extents are allocated (or freed) in the same transaction that a data
> > > device block is also allocated (or freed), we can trip a debugging
> > > assertion.  This can happen (for example) if a realtime extent is
> > > allocated and it is necessary to reshape the bmbt to hold the new
> > > mapping.
> > > 
> > > When we go to allocate a bmbt block from an AG, the first thing the data
> > > device block allocator does is ensure that the freelist is the proper
> > > length.  If the freelist is too long, it will trim the freelist to the
> > > proper length.
> > > 
> > > In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to
> > > record the decrement in the AG free list count.  Prior to f8f28 we would
> > > put the free block back in the free space btrees in the same
> > > transaction, which calls xfs_trans_agblocks_delta() to record the
> > > increment in the AG free block count.  Since AGFL blocks are included in
> > > the global free block count (fdblocks), there is no corresponding
> > > fdblocks update, so the AGFL free satisfies the following condition in
> > > xfs_trans_apply_sb_deltas:
> > > 
> > > 	/*
> > > 	 * Check that superblock mods match the mods made to AGF counters.
> > > 	 */
> > > 	ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) ==
> > > 	       (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta +
> > > 		tp->t_ag_btree_delta));
> > > 
> > > The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is
> > > the number blocks that were allocated.
> > > 
> > > After commit f8f28 we defer the block freeing to the next chained
> > > transaction, which means that the calls to xfs_trans_agflist_delta and
> > > xfs_trans_agblocks_delta occur in separate transactions.  The (first)
> > > transaction that shortens the free list trips on the comparison, which
> > > has now become:
> > > 
> > > (X + 0) == ((X) + -1 + 0)
> > > 
> > > because we haven't freed the AGFL block yet; we've only logged an
> > > intention to free it.  When the second transaction (the deferred free)
> > > commits, it will evaluate the expression as:
> > > 
> > > (0 + 0) == (1 + 0 + 0)
> > > 
> > > and trip over that in turn.
> > > 
> > > At this point, the astute reader may note that the two commits tagged by
> > > this patch have been in the kernel for a long time but haven't generated
> > > any bug reports.  How is it that the author became aware of this bug?
> > > 
> > > This originally surfaced as an intermittent failure when I was testing
> > > realtime rmap, but a different bug report by Zorro Lang reveals the same
> > > assertion occuring on !lazysbcount filesystems.
> > > 
> > > The common factor to both reports (and why this problem wasn't
> > > previously reported) becomes apparent if we consider when
> > > xfs_trans_apply_sb_deltas is called by __xfs_trans_commit():
> > > 
> > > 	if (tp->t_flags & XFS_TRANS_SB_DIRTY)
> > > 		xfs_trans_apply_sb_deltas(tp);
> > 
> > IIUC, what you are saying is that this debug code is simply not
> > exercised in normal testing and hasn't been for the past decade?
> 
> Yes.
> 
> > And it still won't be exercised on anything other than realtime
> > device testing?
> 
> Yes.  Note that I've added it to my regular testing routine since I
> actually /do/ have users of rt volumes now.

Ok.

> 
> > I don't really consider !lazycount worth investing time and
> > effort into because deprecated and default for over a decade, so if
> > this is the case, then why not just remove it?
> 
> I think I understand [this infrequently exercised code path], which
> means I don't understand it, so rather than be one of those idiots who
> rips out everything he doesn't understand, I thought I'd at least try to
> figure out what it did and patch it back together.
> 
> So now that I actually understand what that debug code does having dug
> around to see what those fields do, I'd also be just fine ripping it
> out.  But I wanted to know what I would be tearing out first...

History. In 2006 this went from being calculated and checked on
every transaction, to just being debug code:

commit a7b0e16180bdc21c42331259b741c86ebd27b9bf
Author: Nathan Scott <nathans@sgi.com>
Date:   Fri Jun 23 02:49:59 2006 +0000

    Reduce size of xfs_trans_t structure.
    * remove ->t_forw, ->t_back -- unused
    * ->t_ag_freeblks_delta, ->t_ag_flist_delta, ->t_ag_btree_delta
      are debugging aid -- wrap them in everyone's favourite way.
    
    As a result, cut "xfs_trans" slab object size from 592 to 572 bytes here.
    
    Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
    Merge of xfs-linux-melb:xfs-kern:26319a by kenmcd.
    
      Reduce size of xfs_trans_t structure.

But the fields are largely unchanged from when they were introduced
in 1994 as a debugging aid:

commit a6116f5f96658a400f122527163d2a233c45904c
Author: Adam Sweeney <ajs@sgi.com>
Date:   Mon Jun 27 23:53:59 1994 +0000

    Add debugging accounting for agf btree blocks.

commit 785055c77c166e935f22afd3379a861e2967a3b1
Author: Adam Sweeney <ajs@sgi.com>
Date:   Fri Jun 24 20:46:39 1994 +0000

    Add calls to track agf counter changes in the transaction
    so we can ensure that they are consistent with the changes
    to the superblock.

So it was debugging code from 1994 that was largely turned into dead
code when lazysbcounters were introduced in 2007. Hence I'm not sure
it holds any value anymore, especially if we are just going to turn
lazycount for everyone...

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index aaa19101bb2a..66e7edf86a50 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2472,6 +2472,22 @@  xfs_defer_agfl_block(
 	trace_xfs_agfl_free_defer(mp, agno, 0, agbno, 1);
 
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_AGFL_FREE, &new->xefi_list);
+
+	/*
+	 * Debugging assertions in the transaction accounting code require that
+	 * updates to an AGF freelist count are cancelled out by an update to
+	 * an AGF free block count.  Prior to deferred AGFL freeing, we removed
+	 * a block from the AGFL and freed it in the same transaction, which
+	 * satisfied this invariant.
+	 *
+	 * Now that we defer the actual freeing to a subsequent transaction in
+	 * the chain, this no longer holds true.  Debug delta counters do not
+	 * roll over, so pretend that we updated an AGF free block count
+	 * somewhere.  The EFI for the AGFL block is logged in the same
+	 * transaction as the AGFL change, which is how we maintain integrity
+	 * even if the system goes down.
+	 */
+	xfs_trans_agblocks_delta(tp, 1);
 }
 
 #ifdef DEBUG