diff mbox series

[v2] xfs: fix internal error from AGFL exhaustion

Message ID 68cd85697855f686529829a2825b044913148caf.1698699188.git.osandov@fb.com (mailing list archive)
State Superseded
Headers show
Series [v2] xfs: fix internal error from AGFL exhaustion | expand

Commit Message

Omar Sandoval Oct. 30, 2023, 9 p.m. UTC
From: Omar Sandoval <osandov@fb.com>

We've been seeing XFS errors like the following:

XFS: Internal error i != 1 at line 3526 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_insert+0x1ec/0x280
...
Call Trace:
 xfs_corruption_error+0x94/0xa0
 xfs_btree_insert+0x221/0x280
 xfs_alloc_fixup_trees+0x104/0x3e0
 xfs_alloc_ag_vextent_size+0x667/0x820
 xfs_alloc_fix_freelist+0x5d9/0x750
 xfs_free_extent_fix_freelist+0x65/0xa0
 __xfs_free_extent+0x57/0x180
...

This is the XFS_IS_CORRUPT() check in xfs_btree_insert() when
xfs_btree_insrec() fails.

After converting this into a panic and dissecting the core dump, I found
that xfs_btree_insrec() is failing because it's trying to split a leaf
node in the cntbt when the AG free list is empty. In particular, it's
failing to get a block from the AGFL _while trying to refill the AGFL_.

If a single operation splits every level of the bnobt and the cntbt (and
the rmapbt if it is enabled) at once, the free list will be empty. Then,
when the next operation tries to refill the free list, it allocates
space. If the allocation does not use a full extent, it will need to
insert records for the remaining space in the bnobt and cntbt. And if
those new records go in full leaves, the leaves (and potentially more
nodes up to the old root) need to be split.

Fix it by accounting for the additional splits that may be required to
refill the free list in the calculation for the minimum free list size.

P.S. As far as I can tell, this bug has existed for a long time -- maybe
back to xfs-history commit afdf80ae7405 ("Add XFS_AG_MAXLEVELS macros
...") in April 1994! It requires a very unlucky sequence of events, and
in fact we didn't hit it until a particular sparse mmap workload updated
from 5.12 to 5.19. But this bug existed in 5.12, so it must've been
exposed by some other change in allocation or writeback patterns. It's
also much less likely to be hit with the rmapbt enabled, since that
increases the minimum free list size and is unlikely to split at the
same time as the bnobt and cntbt.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Changes since v1 [1]:

- Updated calculation to account for internal nodes that may also need
  to be split.
- Updated comments and commit message accordingly.

1: https://lore.kernel.org/linux-xfs/ZTrSiwF7OAq0hJHn@dread.disaster.area/T/

 fs/xfs/libxfs/xfs_alloc.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

Comments

Darrick J. Wong Oct. 31, 2023, 9:24 p.m. UTC | #1
On Mon, Oct 30, 2023 at 02:00:02PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> We've been seeing XFS errors like the following:
> 
> XFS: Internal error i != 1 at line 3526 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_insert+0x1ec/0x280
> ...
> Call Trace:
>  xfs_corruption_error+0x94/0xa0
>  xfs_btree_insert+0x221/0x280
>  xfs_alloc_fixup_trees+0x104/0x3e0
>  xfs_alloc_ag_vextent_size+0x667/0x820
>  xfs_alloc_fix_freelist+0x5d9/0x750
>  xfs_free_extent_fix_freelist+0x65/0xa0
>  __xfs_free_extent+0x57/0x180
> ...
> 
> This is the XFS_IS_CORRUPT() check in xfs_btree_insert() when
> xfs_btree_insrec() fails.
> 
> After converting this into a panic and dissecting the core dump, I found
> that xfs_btree_insrec() is failing because it's trying to split a leaf
> node in the cntbt when the AG free list is empty. In particular, it's
> failing to get a block from the AGFL _while trying to refill the AGFL_.
> 
> If a single operation splits every level of the bnobt and the cntbt (and
> the rmapbt if it is enabled) at once, the free list will be empty. Then,
> when the next operation tries to refill the free list, it allocates
> space. If the allocation does not use a full extent, it will need to
> insert records for the remaining space in the bnobt and cntbt. And if
> those new records go in full leaves, the leaves (and potentially more
> nodes up to the old root) need to be split.
> 
> Fix it by accounting for the additional splits that may be required to
> refill the free list in the calculation for the minimum free list size.
> 
> P.S. As far as I can tell, this bug has existed for a long time -- maybe
> back to xfs-history commit afdf80ae7405 ("Add XFS_AG_MAXLEVELS macros
> ...") in April 1994! It requires a very unlucky sequence of events, and
> in fact we didn't hit it until a particular sparse mmap workload updated
> from 5.12 to 5.19. But this bug existed in 5.12, so it must've been
> exposed by some other change in allocation or writeback patterns. It's
> also much less likely to be hit with the rmapbt enabled, since that
> increases the minimum free list size and is unlikely to split at the
> same time as the bnobt and cntbt.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> Changes since v1 [1]:
> 
> - Updated calculation to account for internal nodes that may also need
>   to be split.
> - Updated comments and commit message accordingly.
> 
> 1: https://lore.kernel.org/linux-xfs/ZTrSiwF7OAq0hJHn@dread.disaster.area/T/
> 
>  fs/xfs/libxfs/xfs_alloc.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 3069194527dd..3949c6ad0cce 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2275,16 +2275,35 @@ xfs_alloc_min_freelist(
>  
>  	ASSERT(mp->m_alloc_maxlevels > 0);
>  
> +	/*
> +	 * For a btree shorter than the maximum height, the worst case is that
> +	 * every level gets split and a new level is added, then while inserting
> +	 * another entry to refill the AGFL, every level under the old root gets
> +	 * split again. This is:
> +	 *
> +	 *   (current height + 1) + (current height - 1)

Could you make this comment define this relationship more explicitly?

	 *   (full height split reservation) + (AGFL refill split height)
	 *   (current height + 1) + (current height - 1)

With that added,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> +	 * = (new height)         + (new height - 2)
> +	 * = 2 * new height - 2
> +	 *
> +	 * For a btree of maximum height, the worst case is that every level
> +	 * under the root gets split, then while inserting another entry to
> +	 * refill the AGFL, every level under the root gets split again. This is
> +	 * also:
> +	 *
> +	 *   2 * (new_height - 1)
> +	 * = 2 * new height - 2
> +	 */
> +
>  	/* space needed by-bno freespace btree */
>  	min_free = min_t(unsigned int, levels[XFS_BTNUM_BNOi] + 1,
> -				       mp->m_alloc_maxlevels);
> +				       mp->m_alloc_maxlevels) * 2 - 2;
>  	/* space needed by-size freespace btree */
>  	min_free += min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
> -				       mp->m_alloc_maxlevels);
> +				       mp->m_alloc_maxlevels) * 2 - 2;
>  	/* space needed reverse mapping used space btree */
>  	if (xfs_has_rmapbt(mp))
>  		min_free += min_t(unsigned int, levels[XFS_BTNUM_RMAPi] + 1,
> -						mp->m_rmap_maxlevels);
> +						mp->m_rmap_maxlevels) * 2 - 2;
>  
>  	return min_free;
>  }
> -- 
> 2.41.0
>
Dave Chinner Oct. 31, 2023, 9:59 p.m. UTC | #2
On Tue, Oct 31, 2023 at 02:24:00PM -0700, Darrick J. Wong wrote:
> On Mon, Oct 30, 2023 at 02:00:02PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > We've been seeing XFS errors like the following:
> > 
> > XFS: Internal error i != 1 at line 3526 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_insert+0x1ec/0x280
> > ...
> > Call Trace:
> >  xfs_corruption_error+0x94/0xa0
> >  xfs_btree_insert+0x221/0x280
> >  xfs_alloc_fixup_trees+0x104/0x3e0
> >  xfs_alloc_ag_vextent_size+0x667/0x820
> >  xfs_alloc_fix_freelist+0x5d9/0x750
> >  xfs_free_extent_fix_freelist+0x65/0xa0
> >  __xfs_free_extent+0x57/0x180
> > ...
> > 
> > This is the XFS_IS_CORRUPT() check in xfs_btree_insert() when
> > xfs_btree_insrec() fails.
> > 
> > After converting this into a panic and dissecting the core dump, I found
> > that xfs_btree_insrec() is failing because it's trying to split a leaf
> > node in the cntbt when the AG free list is empty. In particular, it's
> > failing to get a block from the AGFL _while trying to refill the AGFL_.
> > 
> > If a single operation splits every level of the bnobt and the cntbt (and
> > the rmapbt if it is enabled) at once, the free list will be empty. Then,
> > when the next operation tries to refill the free list, it allocates
> > space. If the allocation does not use a full extent, it will need to
> > insert records for the remaining space in the bnobt and cntbt. And if
> > those new records go in full leaves, the leaves (and potentially more
> > nodes up to the old root) need to be split.
> > 
> > Fix it by accounting for the additional splits that may be required to
> > refill the free list in the calculation for the minimum free list size.
> > 
> > P.S. As far as I can tell, this bug has existed for a long time -- maybe
> > back to xfs-history commit afdf80ae7405 ("Add XFS_AG_MAXLEVELS macros
> > ...") in April 1994! It requires a very unlucky sequence of events, and
> > in fact we didn't hit it until a particular sparse mmap workload updated
> > from 5.12 to 5.19. But this bug existed in 5.12, so it must've been
> > exposed by some other change in allocation or writeback patterns. It's
> > also much less likely to be hit with the rmapbt enabled, since that
> > increases the minimum free list size and is unlikely to split at the
> > same time as the bnobt and cntbt.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> > Changes since v1 [1]:
> > 
> > - Updated calculation to account for internal nodes that may also need
> >   to be split.
> > - Updated comments and commit message accordingly.
> > 
> > 1: https://lore.kernel.org/linux-xfs/ZTrSiwF7OAq0hJHn@dread.disaster.area/T/
> > 
> >  fs/xfs/libxfs/xfs_alloc.c | 25 ++++++++++++++++++++++---
> >  1 file changed, 22 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 3069194527dd..3949c6ad0cce 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -2275,16 +2275,35 @@ xfs_alloc_min_freelist(
> >  
> >  	ASSERT(mp->m_alloc_maxlevels > 0);
> >  
> > +	/*
> > +	 * For a btree shorter than the maximum height, the worst case is that
> > +	 * every level gets split and a new level is added, then while inserting
> > +	 * another entry to refill the AGFL, every level under the old root gets
> > +	 * split again. This is:
> > +	 *
> > +	 *   (current height + 1) + (current height - 1)
> 
> Could you make this comment define this relationship more explicitly?
> 
> 	 *   (full height split reservation) + (AGFL refill split height)
> 	 *   (current height + 1) + (current height - 1)
> 
> With that added,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Yup, with that comment change I'm happy with it, too.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 3069194527dd..3949c6ad0cce 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2275,16 +2275,35 @@  xfs_alloc_min_freelist(
 
 	ASSERT(mp->m_alloc_maxlevels > 0);
 
+	/*
+	 * For a btree shorter than the maximum height, the worst case is that
+	 * every level gets split and a new level is added, then while inserting
+	 * another entry to refill the AGFL, every level under the old root gets
+	 * split again. This is:
+	 *
+	 *   (current height + 1) + (current height - 1)
+	 * = (new height)         + (new height - 2)
+	 * = 2 * new height - 2
+	 *
+	 * For a btree of maximum height, the worst case is that every level
+	 * under the root gets split, then while inserting another entry to
+	 * refill the AGFL, every level under the root gets split again. This is
+	 * also:
+	 *
+	 *   2 * (new_height - 1)
+	 * = 2 * new height - 2
+	 */
+
 	/* space needed by-bno freespace btree */
 	min_free = min_t(unsigned int, levels[XFS_BTNUM_BNOi] + 1,
-				       mp->m_alloc_maxlevels);
+				       mp->m_alloc_maxlevels) * 2 - 2;
 	/* space needed by-size freespace btree */
 	min_free += min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
-				       mp->m_alloc_maxlevels);
+				       mp->m_alloc_maxlevels) * 2 - 2;
 	/* space needed reverse mapping used space btree */
 	if (xfs_has_rmapbt(mp))
 		min_free += min_t(unsigned int, levels[XFS_BTNUM_RMAPi] + 1,
-						mp->m_rmap_maxlevels);
+						mp->m_rmap_maxlevels) * 2 - 2;
 
 	return min_free;
 }