diff mbox series

xfs: fix internal error from AGFL exhaustion

Message ID 013168d2de9d25c56fe45ad75e9257cf9664f2d6.1698190191.git.osandov@fb.com (mailing list archive)
State Superseded
Headers show
Series xfs: fix internal error from AGFL exhaustion | expand

Commit Message

Omar Sandoval Oct. 24, 2023, 11:37 p.m. UTC
From: Omar Sandoval <osandov@fb.com>

We've been seeing XFS errors like the following:

XFS: Internal error i != 1 at line 3526 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_insert+0x1ec/0x280
...
Call Trace:
 xfs_corruption_error+0x94/0xa0
 xfs_btree_insert+0x221/0x280
 xfs_alloc_fixup_trees+0x104/0x3e0
 xfs_alloc_ag_vextent_size+0x667/0x820
 xfs_alloc_fix_freelist+0x5d9/0x750
 xfs_free_extent_fix_freelist+0x65/0xa0
 __xfs_free_extent+0x57/0x180
...

This is the XFS_IS_CORRUPT() check in xfs_btree_insert() when
xfs_btree_insrec() fails.

After converting this into a panic and dissecting the core dump, I found
that xfs_btree_insrec() is failing because it's trying to split a leaf
node in the cntbt when the AG free list is empty. In particular, it's
failing to get a block from the AGFL _while trying to refill the AGFL_.

Our filesystems don't have the rmapbt enabled, so if both the bnobt and
cntbt are 2 levels, the free list is 6 blocks. If a single operation
causes both the bnobt and cntbt to split from 2 levels to 3 levels, this
will allocate 6 new blocks and exhaust the free list. Then, when the
next operation tries to refill the freelist, it allocates space. If the
allocation does not use a full extent, it will need to insert records
for the remaining space in the bnobt and cntbt. And if those new records
go in full leaves, the leaves need to be split. (It's guaranteed that
none of the internal nodes need to be split because they were just
split.)

Fix it by adding an extra block of slack in the freelist for the
potential leaf split in each of the bnobt and cntbt.

P.S. As far as I can tell, this bug has existed for a long time -- maybe
back to xfs-history commit afdf80ae7405 ("Add XFS_AG_MAXLEVELS macros
...") in April 1994! It requires a very unlucky sequence of events, and
in fact we didn't hit it until a particular sparse mmap workload updated
from 5.12 to 5.19. But this bug existed in 5.12, so it must've been
exposed by some other change in allocation or writeback patterns.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Hi,

This patch is based on v6.6-rc7. I've also replied with a regression
test for fstests.

Thanks,
Omar

 fs/xfs/libxfs/xfs_alloc.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Comments

Dave Chinner Oct. 25, 2023, 4:35 a.m. UTC | #1
On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> We've been seeing XFS errors like the following:
> 
> XFS: Internal error i != 1 at line 3526 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_insert+0x1ec/0x280
> ...
> Call Trace:
>  xfs_corruption_error+0x94/0xa0
>  xfs_btree_insert+0x221/0x280
>  xfs_alloc_fixup_trees+0x104/0x3e0
>  xfs_alloc_ag_vextent_size+0x667/0x820
>  xfs_alloc_fix_freelist+0x5d9/0x750
>  xfs_free_extent_fix_freelist+0x65/0xa0
>  __xfs_free_extent+0x57/0x180
> ...

Good find, Omar!

For completeness, what's the rest of the trace? We've recently
changed how extent freeing path works w.r.t. busy extents so I'm
curious as to what the path into this code is.

Lots of what follows is really notes as I try to pull this apart and
understand it. Please check my working!

> 
> This is the XFS_IS_CORRUPT() check in xfs_btree_insert() when
> xfs_btree_insrec() fails.
> 
> After converting this into a panic and dissecting the core dump, I found
> that xfs_btree_insrec() is failing because it's trying to split a leaf
> node in the cntbt when the AG free list is empty. In particular, it's
> failing to get a block from the AGFL _while trying to refill the AGFL_.
> Our filesystems don't have the rmapbt enabled, so if both the bnobt and
> cntbt are 2 levels, the free list is 6 blocks.
>
> If a single operation
> causes both the bnobt and cntbt to split from 2 levels to 3 levels, this
> will allocate 6 new blocks and exhaust the free list.

Which is just fine - the AGFL was sized correctly for that specific
operation to allow it to succeed.

> Then, when the
> next operation tries to refill the freelist, it allocates space.

Ok, so the initial condition for this allocation is a completely
empty AGFL.

> If the
> allocation does not use a full extent, it will need to insert records
> for the remaining space in the bnobt and cntbt. And if those new records
> go in full leaves, the leaves need to be split.

Ok, so we've gone to do a by-size allocation, which will start the
search at:

	if ((error = xfs_alloc_lookup_ge(cnt_cur, 0,
                        args->maxlen + args->alignment - 1, &i)))

an extent that covers maxlen + alignment. The allocation parameters
set by the freelist fixup will be:

	targs.alignment = targs.minlen = targs.prod = 1;
...
	targs.maxlen = need - pag->pagf_flcount;

Ok, so the by-size allocation will select the first extent the same
size of larger than what the AGFL needs to fill the empty slots.

Oversize extents get selected because the allocation doesn't find an
extent of the exact size requested and so pulls a larger extent of
the free space tree. If that extent is not suitable (e.g. it is
busy), it then keeps walking in the direction of larger extents to
carve the free list out of.

IOWs, this search direction guarantees that we'll have to split
the free space extent if there is no free extent the exact size the
AGFL needs.

Ok, that's a possible avenue for fixing the issue - when the agfl is
completely empty, do a LE search so we never have to split a free
space extent and hence completely avoid btree splits....

> (It's guaranteed that
> none of the internal nodes need to be split because they were just
> split.)

Hmmm. That doesn't sound right - btrees don't work that way.

We don't know what path the tree split along in the previous insert
operation  - a split only guarantees the next insert along that same
path won't split if the insert hits one of the two leaf blocks from
the split.  Insert into a different leaf can still split interior
nodes all the way back up the tree the point where it intersects
with the previous full height split path.

e.g. if the previous full height split was down the left side of the
tree and the next insert is in the right half of the tree, the right
half can split on insert all the way up to the level below the old
root.

> Fix it by adding an extra block of slack in the freelist for the
> potential leaf split in each of the bnobt and cntbt.

Hmmm. yeah - example given is 2->3 level split, hence only need to
handle a single level leaf split before path intersection occurs.
Yup, adding an extra block will make the exact problem being seen go
away.

Ok, what's the general solution? 4-level, 5-level or larger trees?

Worst split after a full split is up to the level below the old root
block. the root block was split, so it won't need a split again, so
only it's children could split again. OK, so that's (height - 1)
needed for a max split to refill the AGFL after a full height split
occurred.

Hence it looks like the minimum AGFL space for any given
allocation/free operation needs to be:

	(full height split reservation) + (AGFL refill split height)

which is:

	= (new height) + (new height - 2)
	= 2 * new height - 2
	= 2 * (current height + 1) - 2
	= 2 * current height

Ok, so we essentially have to double the AGFL minimum size to handle
the generic case of AGFL refill splitting free space trees after a
transaction that has exhausted it's AGFL reservation.

Now, did I get that right?

> P.S. As far as I can tell, this bug has existed for a long time -- maybe
> back to xfs-history commit afdf80ae7405 ("Add XFS_AG_MAXLEVELS macros
> ...") in April 1994! It requires a very unlucky sequence of events, and
> in fact we didn't hit it until a particular sparse mmap workload updated
> from 5.12 to 5.19. But this bug existed in 5.12, so it must've been
> exposed by some other change in allocation or writeback patterns.

No surprise there - these sorts of fundamental reservations and
limits that largely avoid ENOSPC issues don't get touched unless a
problem is found. Who knows how many deep dark corners this specific
change will expose to the light...

> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 3069194527dd..2cbcf18fb903 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2275,12 +2275,16 @@ xfs_alloc_min_freelist(
>  
>  	ASSERT(mp->m_alloc_maxlevels > 0);
>  
> -	/* space needed by-bno freespace btree */
> +	/*
> +	 * space needed by-bno freespace btree: one per level that may be split
> +	 * by an insert, plus one more for a leaf split that may be necessary to
> +	 * refill the AGFL
> +	 */
>  	min_free = min_t(unsigned int, levels[XFS_BTNUM_BNOi] + 1,
> -				       mp->m_alloc_maxlevels);
> -	/* space needed by-size freespace btree */
> +				       mp->m_alloc_maxlevels) + 1;
> +	/* space needed by-size freespace btree, same as above */
>  	min_free += min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
> -				       mp->m_alloc_maxlevels);
> +				       mp->m_alloc_maxlevels) + 1;
>  	/* space needed reverse mapping used space btree */
>  	if (xfs_has_rmapbt(mp))
>  		min_free += min_t(unsigned int, levels[XFS_BTNUM_RMAPi] + 1,

The rmapbt case will need this change, too, because rmapbt blocks
are allocated from the free list and so an rmapbt update can exhaust
the free list completely, too.

Cheers,

Dave.
Omar Sandoval Oct. 25, 2023, 5:14 a.m. UTC | #2
On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote:
> On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > We've been seeing XFS errors like the following:
> > 
> > XFS: Internal error i != 1 at line 3526 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_insert+0x1ec/0x280
> > ...
> > Call Trace:
> >  xfs_corruption_error+0x94/0xa0
> >  xfs_btree_insert+0x221/0x280
> >  xfs_alloc_fixup_trees+0x104/0x3e0
> >  xfs_alloc_ag_vextent_size+0x667/0x820
> >  xfs_alloc_fix_freelist+0x5d9/0x750
> >  xfs_free_extent_fix_freelist+0x65/0xa0
> >  __xfs_free_extent+0x57/0x180
> > ...
> 
> Good find, Omar!
> 
> For completeness, what's the rest of the trace? We've recently
> changed how extent freeing path works w.r.t. busy extents so I'm
> curious as to what the path into this code is.

This one is from my fpunch reproducer, but there were two call traces we
actually saw in production. One from writeback:

xfs_corruption_error+0x90/0xa0
xfs_btree_insert+0x1b1/0x1d0
xfs_alloc_fixup_trees+0x28e/0x450
xfs_alloc_ag_vextent_size+0x4e0/0x780
xfs_alloc_ag_vextent+0x11c/0x140
xfs_alloc_fix_freelist+0x3f2/0x510
xfs_alloc_vextent+0x254/0x490
xfs_bmap_btalloc+0x301/0x6e0
xfs_bmapi_allocate+0x206/0x2d0
xfs_bmapi_convert_delalloc+0x2b9/0x450
xfs_map_blocks+0x1cc/0x470
iomap_do_writepage+0x22b/0x860
write_cache_pages+0x1ae/0x4c0
iomap_writepages+0x1c/0x40
xfs_vm_writepages+0x6b/0xa0
do_writepages+0x6f/0x1d0
__writeback_single_inode+0x39/0x2b0
writeback_sb_inodes+0x277/0x510
__writeback_inodes_wb+0x6e/0xe0
wb_writeback+0xda/0x250
wb_workfn+0x1f4/0x430
process_one_work+0x12c/0x490
worker_thread+0x143/0x3e0
kthread+0xa3/0xd0
ret_from_fork+0x1f/0x30

And one from inodegc:

xfs_corruption_error+0x90/0xa0
xfs_btree_insert+0x1b1/0x1d0
xfs_alloc_fixup_trees+0x28e/0x450
xfs_alloc_ag_vextent_size+0x4e0/0x780
xfs_alloc_ag_vextent+0x11c/0x140
xfs_alloc_fix_freelist+0x3f2/0x510
__xfs_free_extent+0xde/0x1f0
xfs_trans_free_extent+0x54/0xd0
xfs_extent_free_finish_item+0x69/0xa0
xfs_defer_finish_noroll+0x163/0x510
xfs_defer_finish+0x10/0x70
xfs_itruncate_extents_flags+0x13a/0x250
xfs_inactive_truncate+0xac/0xe0
xfs_inactive+0x14f/0x160
xfs_inodegc_worker+0x81/0x100
process_one_work+0x12c/0x490
worker_thread+0x143/0x3e0
kthread+0xa3/0xd0
ret_from_fork+0x1f/0x30

These are both from 5.19.

> > (It's guaranteed that
> > none of the internal nodes need to be split because they were just
> > split.)
> 
> Hmmm. That doesn't sound right - btrees don't work that way.
> 
> We don't know what path the tree split along in the previous insert
> operation  - a split only guarantees the next insert along that same
> path won't split if the insert hits one of the two leaf blocks from
> the split.  Insert into a different leaf can still split interior
> nodes all the way back up the tree the point where it intersects
> with the previous full height split path.
> 
> e.g. if the previous full height split was down the left side of the
> tree and the next insert is in the right half of the tree, the right
> half can split on insert all the way up to the level below the old
> root.

You're right, I was only thinking about the 2->3 level case. My
assumption is only true in that case.

> > Fix it by adding an extra block of slack in the freelist for the
> > potential leaf split in each of the bnobt and cntbt.
> 
> Hmmm. yeah - example given is 2->3 level split, hence only need to
> handle a single level leaf split before path intersection occurs.
> Yup, adding an extra block will make the exact problem being seen go
> away.
> 
> Ok, what's the general solution? 4-level, 5-level or larger trees?
> 
> Worst split after a full split is up to the level below the old root
> block. the root block was split, so it won't need a split again, so
> only it's children could split again. OK, so that's (height - 1)
> needed for a max split to refill the AGFL after a full height split
> occurred.
> 
> Hence it looks like the minimum AGFL space for any given
> allocation/free operation needs to be:
> 
> 	(full height split reservation) + (AGFL refill split height)
> 
> which is:
> 
> 	= (new height) + (new height - 2)
> 	= 2 * new height - 2
> 	= 2 * (current height + 1) - 2
> 	= 2 * current height
> 
> Ok, so we essentially have to double the AGFL minimum size to handle
> the generic case of AGFL refill splitting free space trees after a
> transaction that has exhausted it's AGFL reservation.
> 
> Now, did I get that right?

That sounds right, but I'll have to think about it more after some sleep
:)

Assuming that is correct, your LE search suggestion sounds kind of nice
rather than such a drastic change to the AGFL size.

> The rmapbt case will need this change, too, because rmapbt blocks
> are allocated from the free list and so an rmapbt update can exhaust
> the free list completely, too.

Ah, I missed where the rmapbt is updated since it was slightly removed
from the xfs_alloc_fixup_trees() code path I was looking at.

Thanks for the quick review!

Omar
Darrick J. Wong Oct. 25, 2023, 3:50 p.m. UTC | #3
On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote:
> On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote:
> > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > We've been seeing XFS errors like the following:
> > > 
> > > XFS: Internal error i != 1 at line 3526 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_insert+0x1ec/0x280
> > > ...
> > > Call Trace:
> > >  xfs_corruption_error+0x94/0xa0
> > >  xfs_btree_insert+0x221/0x280
> > >  xfs_alloc_fixup_trees+0x104/0x3e0
> > >  xfs_alloc_ag_vextent_size+0x667/0x820
> > >  xfs_alloc_fix_freelist+0x5d9/0x750
> > >  xfs_free_extent_fix_freelist+0x65/0xa0
> > >  __xfs_free_extent+0x57/0x180
> > > ...
> > 
> > Good find, Omar!
> > 
> > For completeness, what's the rest of the trace? We've recently
> > changed how extent freeing path works w.r.t. busy extents so I'm
> > curious as to what the path into this code is.
> 
> This one is from my fpunch reproducer, but there were two call traces we
> actually saw in production. One from writeback:
> 
> xfs_corruption_error+0x90/0xa0
> xfs_btree_insert+0x1b1/0x1d0
> xfs_alloc_fixup_trees+0x28e/0x450
> xfs_alloc_ag_vextent_size+0x4e0/0x780
> xfs_alloc_ag_vextent+0x11c/0x140
> xfs_alloc_fix_freelist+0x3f2/0x510
> xfs_alloc_vextent+0x254/0x490
> xfs_bmap_btalloc+0x301/0x6e0
> xfs_bmapi_allocate+0x206/0x2d0
> xfs_bmapi_convert_delalloc+0x2b9/0x450
> xfs_map_blocks+0x1cc/0x470
> iomap_do_writepage+0x22b/0x860
> write_cache_pages+0x1ae/0x4c0
> iomap_writepages+0x1c/0x40
> xfs_vm_writepages+0x6b/0xa0
> do_writepages+0x6f/0x1d0
> __writeback_single_inode+0x39/0x2b0
> writeback_sb_inodes+0x277/0x510
> __writeback_inodes_wb+0x6e/0xe0
> wb_writeback+0xda/0x250
> wb_workfn+0x1f4/0x430
> process_one_work+0x12c/0x490
> worker_thread+0x143/0x3e0
> kthread+0xa3/0xd0
> ret_from_fork+0x1f/0x30
> 
> And one from inodegc:
> 
> xfs_corruption_error+0x90/0xa0
> xfs_btree_insert+0x1b1/0x1d0
> xfs_alloc_fixup_trees+0x28e/0x450
> xfs_alloc_ag_vextent_size+0x4e0/0x780
> xfs_alloc_ag_vextent+0x11c/0x140
> xfs_alloc_fix_freelist+0x3f2/0x510
> __xfs_free_extent+0xde/0x1f0
> xfs_trans_free_extent+0x54/0xd0
> xfs_extent_free_finish_item+0x69/0xa0
> xfs_defer_finish_noroll+0x163/0x510
> xfs_defer_finish+0x10/0x70
> xfs_itruncate_extents_flags+0x13a/0x250
> xfs_inactive_truncate+0xac/0xe0
> xfs_inactive+0x14f/0x160
> xfs_inodegc_worker+0x81/0x100
> process_one_work+0x12c/0x490
> worker_thread+0x143/0x3e0
> kthread+0xa3/0xd0
> ret_from_fork+0x1f/0x30
> 
> These are both from 5.19.

Hey, I've seen that backtrace before!  Only a handful of times,
fortuntely.

> > > (It's guaranteed that
> > > none of the internal nodes need to be split because they were just
> > > split.)
> > 
> > Hmmm. That doesn't sound right - btrees don't work that way.
> > 
> > We don't know what path the tree split along in the previous insert
> > operation  - a split only guarantees the next insert along that same
> > path won't split if the insert hits one of the two leaf blocks from
> > the split.  Insert into a different leaf can still split interior
> > nodes all the way back up the tree the point where it intersects
> > with the previous full height split path.
> > 
> > e.g. if the previous full height split was down the left side of the
> > tree and the next insert is in the right half of the tree, the right
> > half can split on insert all the way up to the level below the old
> > root.
> 
> You're right, I was only thinking about the 2->3 level case. My
> assumption is only true in that case.
> 
> > > Fix it by adding an extra block of slack in the freelist for the
> > > potential leaf split in each of the bnobt and cntbt.

I see how the cntbt can split because changing the length of a freespace
extent can require the record to move to a totally different part of the
cntbt.  The reinsertion causes the split.

How does the bnobt split during a refresh of the AGFL?  Will we ever
allocate a block from the /middle/ of a free space extent?

> > 
> > Hmmm. yeah - example given is 2->3 level split, hence only need to
> > handle a single level leaf split before path intersection occurs.
> > Yup, adding an extra block will make the exact problem being seen go
> > away.
> > 
> > Ok, what's the general solution? 4-level, 5-level or larger trees?
> > 
> > Worst split after a full split is up to the level below the old root
> > block. the root block was split, so it won't need a split again, so
> > only it's children could split again. OK, so that's (height - 1)
> > needed for a max split to refill the AGFL after a full height split
> > occurred.
> > 
> > Hence it looks like the minimum AGFL space for any given
> > allocation/free operation needs to be:
> > 
> > 	(full height split reservation) + (AGFL refill split height)
> > 
> > which is:
> > 
> > 	= (new height) + (new height - 2)
> > 	= 2 * new height - 2
> > 	= 2 * (current height + 1) - 2
> > 	= 2 * current height
> > 
> > Ok, so we essentially have to double the AGFL minimum size to handle
> > the generic case of AGFL refill splitting free space trees after a
> > transaction that has exhausted it's AGFL reservation.
> > 
> > Now, did I get that right?
> 
> That sounds right, but I'll have to think about it more after some sleep
> :)

I think that's correct, assuming that (2 * current_height) is the
per-btree calcluation.

> Assuming that is correct, your LE search suggestion sounds kind of nice
> rather than such a drastic change to the AGFL size.

The absolute maximum height of a free space btree is 7 blocks, according
to xfs_db:

# xfs_db /dev/sda -c 'btheight -w absmax all'
bnobt: 7
cntbt: 7
inobt: 6
finobt: 6
bmapbt: 14
refcountbt: 6
rmapbt: 10

The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't
think it's /that/ drastic.  For a filesystem with rmapbt (V5, 1k blocks)
that minimum jumps to 121 blocks.

> > The rmapbt case will need this change, too, because rmapbt blocks
> > are allocated from the free list and so an rmapbt update can exhaust
> > the free list completely, too.
> 
> Ah, I missed where the rmapbt is updated since it was slightly removed
> from the xfs_alloc_fixup_trees() code path I was looking at.

The rmapbt has its own accounting tricks (XFS_AG_RESV_RMAPBT) to ensure
that there's always enough free space to refill the AGFL.  Is that why
the testcase turns off rmapbt?

--D

> Thanks for the quick review!
> 
> Omar
Omar Sandoval Oct. 25, 2023, 8:03 p.m. UTC | #4
On Wed, Oct 25, 2023 at 08:50:46AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote:
> > On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote:
> > > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > Fix it by adding an extra block of slack in the freelist for the
> > > > potential leaf split in each of the bnobt and cntbt.
> 
> I see how the cntbt can split because changing the length of a freespace
> extent can require the record to move to a totally different part of the
> cntbt.  The reinsertion causes the split.
> 
> How does the bnobt split during a refresh of the AGFL?  Will we ever
> allocate a block from the /middle/ of a free space extent?

That's the case I was worried about for the bnobt. I see two ways that
can happen:

1. alignment, prod, or mod requirements, which the freelist doesn't
   have.
2. Busy extents. I don't know enough about XFS to say whether or not
   this is applicable, but at first glance I don't see why not.

> > > Hmmm. yeah - example given is 2->3 level split, hence only need to
> > > handle a single level leaf split before path intersection occurs.
> > > Yup, adding an extra block will make the exact problem being seen go
> > > away.
> > > 
> > > Ok, what's the general solution? 4-level, 5-level or larger trees?
> > > 
> > > Worst split after a full split is up to the level below the old root
> > > block. the root block was split, so it won't need a split again, so
> > > only it's children could split again. OK, so that's (height - 1)
> > > needed for a max split to refill the AGFL after a full height split
> > > occurred.
> > > 
> > > Hence it looks like the minimum AGFL space for any given
> > > allocation/free operation needs to be:
> > > 
> > > 	(full height split reservation) + (AGFL refill split height)
> > > 
> > > which is:
> > > 
> > > 	= (new height) + (new height - 2)
> > > 	= 2 * new height - 2
> > > 	= 2 * (current height + 1) - 2
> > > 	= 2 * current height
> > > 
> > > Ok, so we essentially have to double the AGFL minimum size to handle
> > > the generic case of AGFL refill splitting free space trees after a
> > > transaction that has exhausted it's AGFL reservation.
> > > 
> > > Now, did I get that right?
> > 
> > That sounds right, but I'll have to think about it more after some sleep
> > :)
> 
> I think that's correct, assuming that (2 * current_height) is the
> per-btree calcluation.

Agreed, except when the tree is already the maximum level. In that case,
the worst case is splitting up to but not including the root node twice,
which is 2 * height - 2 (i.e., stopping before Dave substituted new
height with current height + 1). So I think we want:

min_free = min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
			       mp->m_alloc_maxlevels) * 2 - 2;

> > Assuming that is correct, your LE search suggestion sounds kind of nice
> > rather than such a drastic change to the AGFL size.

Come to think of it, if there is nothing <= the desired size but there
is something >, we have no choice but to do the split, so that doesn't
work.

> The absolute maximum height of a free space btree is 7 blocks, according
> to xfs_db:
> 
> # xfs_db /dev/sda -c 'btheight -w absmax all'
> bnobt: 7
> cntbt: 7
> inobt: 6
> finobt: 6
> bmapbt: 14
> refcountbt: 6
> rmapbt: 10
> 
> The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't
> think it's /that/ drastic.  For a filesystem with rmapbt (V5, 1k blocks)
> that minimum jumps to 121 blocks.
> 
> > > The rmapbt case will need this change, too, because rmapbt blocks
> > > are allocated from the free list and so an rmapbt update can exhaust
> > > the free list completely, too.
> > 
> > Ah, I missed where the rmapbt is updated since it was slightly removed
> > from the xfs_alloc_fixup_trees() code path I was looking at.
> 
> The rmapbt has its own accounting tricks (XFS_AG_RESV_RMAPBT) to ensure
> that there's always enough free space to refill the AGFL.  Is that why
> the testcase turns off rmapbt?

Nope, I turned it off in my test case because with the rmapbt enabled,
the freelist is larger. Therefore, for this bug to happen, the bnobt,
cntbt, and rmapbt all need a maximum split at the same time. This is
"easy" to do for just the bnobt and cntbt since they store the same
records, but it's a lot harder to get the rmapbt in on it, too.

Nevertheless, it seems possible, right? XFS_AG_RESV_RMAPBT only seems to
guarantee that there is space in the AG to refill the AGFL, but that
doesn't mean we won't need to split the rmapbt at an inopportune time?
Darrick J. Wong Oct. 25, 2023, 10:39 p.m. UTC | #5
On Wed, Oct 25, 2023 at 01:03:00PM -0700, Omar Sandoval wrote:
> On Wed, Oct 25, 2023 at 08:50:46AM -0700, Darrick J. Wong wrote:
> > On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote:
> > > On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote:
> > > > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > Fix it by adding an extra block of slack in the freelist for the
> > > > > potential leaf split in each of the bnobt and cntbt.
> > 
> > I see how the cntbt can split because changing the length of a freespace
> > extent can require the record to move to a totally different part of the
> > cntbt.  The reinsertion causes the split.
> > 
> > How does the bnobt split during a refresh of the AGFL?  Will we ever
> > allocate a block from the /middle/ of a free space extent?
> 
> That's the case I was worried about for the bnobt. I see two ways that
> can happen:
> 
> 1. alignment, prod, or mod requirements, which the freelist doesn't
>    have.
> 2. Busy extents. I don't know enough about XFS to say whether or not
>    this is applicable, but at first glance I don't see why not.

Yes, I think it is possible -- we allocate an extent to fill the AGFL,
the beginning of that extent is still busy due to slow discard, so
xfs_alloc_compute_aligned gives us a block from the middle of the free
space.  Since AGFL filling has no particular alignment/prod/mod, any
number of blocks are good enough, so we end up taking the blocks from
the middle of the extent.

> > > > Hmmm. yeah - example given is 2->3 level split, hence only need to
> > > > handle a single level leaf split before path intersection occurs.
> > > > Yup, adding an extra block will make the exact problem being seen go
> > > > away.
> > > > 
> > > > Ok, what's the general solution? 4-level, 5-level or larger trees?
> > > > 
> > > > Worst split after a full split is up to the level below the old root
> > > > block. the root block was split, so it won't need a split again, so
> > > > only it's children could split again. OK, so that's (height - 1)
> > > > needed for a max split to refill the AGFL after a full height split
> > > > occurred.
> > > > 
> > > > Hence it looks like the minimum AGFL space for any given
> > > > allocation/free operation needs to be:
> > > > 
> > > > 	(full height split reservation) + (AGFL refill split height)
> > > > 
> > > > which is:
> > > > 
> > > > 	= (new height) + (new height - 2)
> > > > 	= 2 * new height - 2
> > > > 	= 2 * (current height + 1) - 2
> > > > 	= 2 * current height
> > > > 
> > > > Ok, so we essentially have to double the AGFL minimum size to handle
> > > > the generic case of AGFL refill splitting free space trees after a
> > > > transaction that has exhausted it's AGFL reservation.
> > > > 
> > > > Now, did I get that right?
> > > 
> > > That sounds right, but I'll have to think about it more after some sleep
> > > :)
> > 
> > I think that's correct, assuming that (2 * current_height) is the
> > per-btree calcluation.
> 
> Agreed, except when the tree is already the maximum level. In that case,
> the worst case is splitting up to but not including the root node twice,
> which is 2 * height - 2 (i.e., stopping before Dave substituted new
> height with current height + 1). So I think we want:
> 
> min_free = min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
> 			       mp->m_alloc_maxlevels) * 2 - 2;
> 
> > > Assuming that is correct, your LE search suggestion sounds kind of nice
> > > rather than such a drastic change to the AGFL size.
> 
> Come to think of it, if there is nothing <= the desired size but there
> is something >, we have no choice but to do the split, so that doesn't
> work.

Yeah, I was kind of wondering that myself.

> > The absolute maximum height of a free space btree is 7 blocks, according
> > to xfs_db:
> > 
> > # xfs_db /dev/sda -c 'btheight -w absmax all'
> > bnobt: 7
> > cntbt: 7
> > inobt: 6
> > finobt: 6
> > bmapbt: 14
> > refcountbt: 6
> > rmapbt: 10
> > 
> > The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't
> > think it's /that/ drastic.  For a filesystem with rmapbt (V5, 1k blocks)
> > that minimum jumps to 121 blocks.
> > 
> > > > The rmapbt case will need this change, too, because rmapbt blocks
> > > > are allocated from the free list and so an rmapbt update can exhaust
> > > > the free list completely, too.
> > > 
> > > Ah, I missed where the rmapbt is updated since it was slightly removed
> > > from the xfs_alloc_fixup_trees() code path I was looking at.
> > 
> > The rmapbt has its own accounting tricks (XFS_AG_RESV_RMAPBT) to ensure
> > that there's always enough free space to refill the AGFL.  Is that why
> > the testcase turns off rmapbt?
> 
> Nope, I turned it off in my test case because with the rmapbt enabled,
> the freelist is larger. Therefore, for this bug to happen, the bnobt,
> cntbt, and rmapbt all need a maximum split at the same time. This is
> "easy" to do for just the bnobt and cntbt since they store the same
> records, but it's a lot harder to get the rmapbt in on it, too.

<nod>

> Nevertheless, it seems possible, right? XFS_AG_RESV_RMAPBT only seems to
> guarantee that there is space in the AG to refill the AGFL, but that
> doesn't mean we won't need to split the rmapbt at an inopportune time?

Ooh, good point.  Just because we have plenty of free space doesn't mean
the AGFL actually has one free block to handle an rmapbt split resulting
from filling the AGFL.

--D
Dave Chinner Oct. 25, 2023, 11:37 p.m. UTC | #6
On Wed, Oct 25, 2023 at 03:39:09PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 25, 2023 at 01:03:00PM -0700, Omar Sandoval wrote:
> > On Wed, Oct 25, 2023 at 08:50:46AM -0700, Darrick J. Wong wrote:
> > > On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote:
> > > > On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote:
> > > > > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> > > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > > Fix it by adding an extra block of slack in the freelist for the
> > > > > > potential leaf split in each of the bnobt and cntbt.
> > > 
> > > I see how the cntbt can split because changing the length of a freespace
> > > extent can require the record to move to a totally different part of the
> > > cntbt.  The reinsertion causes the split.
> > > 
> > > How does the bnobt split during a refresh of the AGFL?  Will we ever
> > > allocate a block from the /middle/ of a free space extent?
> > 
> > That's the case I was worried about for the bnobt. I see two ways that
> > can happen:
> > 
> > 1. alignment, prod, or mod requirements, which the freelist doesn't
> >    have.
> > 2. Busy extents. I don't know enough about XFS to say whether or not
> >    this is applicable, but at first glance I don't see why not.
> 
> Yes, I think it is possible -- we allocate an extent to fill the AGFL,
> the beginning of that extent is still busy due to slow discard, so
> xfs_alloc_compute_aligned gives us a block from the middle of the free
> space.  Since AGFL filling has no particular alignment/prod/mod, any
> number of blocks are good enough, so we end up taking the blocks from
> the middle of the extent.

*nod*

That's exactly the conclusion I came to when I wondered how it was
possible...

> > > > > Hmmm. yeah - example given is 2->3 level split, hence only need to
> > > > > handle a single level leaf split before path intersection occurs.
> > > > > Yup, adding an extra block will make the exact problem being seen go
> > > > > away.
> > > > > 
> > > > > Ok, what's the general solution? 4-level, 5-level or larger trees?
> > > > > 
> > > > > Worst split after a full split is up to the level below the old root
> > > > > block. the root block was split, so it won't need a split again, so
> > > > > only it's children could split again. OK, so that's (height - 1)
> > > > > needed for a max split to refill the AGFL after a full height split
> > > > > occurred.
> > > > > 
> > > > > Hence it looks like the minimum AGFL space for any given
> > > > > allocation/free operation needs to be:
> > > > > 
> > > > > 	(full height split reservation) + (AGFL refill split height)
> > > > > 
> > > > > which is:
> > > > > 
> > > > > 	= (new height) + (new height - 2)
> > > > > 	= 2 * new height - 2
> > > > > 	= 2 * (current height + 1) - 2
> > > > > 	= 2 * current height
> > > > > 
> > > > > Ok, so we essentially have to double the AGFL minimum size to handle
> > > > > the generic case of AGFL refill splitting free space trees after a
> > > > > transaction that has exhausted it's AGFL reservation.
> > > > > 
> > > > > Now, did I get that right?
> > > > 
> > > > That sounds right, but I'll have to think about it more after some sleep
> > > > :)
> > > 
> > > I think that's correct, assuming that (2 * current_height) is the
> > > per-btree calcluation.
> > 
> > Agreed, except when the tree is already the maximum level. In that case,
> > the worst case is splitting up to but not including the root node twice,
> > which is 2 * height - 2 (i.e., stopping before Dave substituted new
> > height with current height + 1). So I think we want:
> > 
> > min_free = min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
> > 			       mp->m_alloc_maxlevels) * 2 - 2;
> > 
> > > > Assuming that is correct, your LE search suggestion sounds kind of nice
> > > > rather than such a drastic change to the AGFL size.
> > 
> > Come to think of it, if there is nothing <= the desired size but there
> > is something >, we have no choice but to do the split, so that doesn't
> > work.
> 
> Yeah, I was kind of wondering that myself.

Ok, I'm glad to see there are enough eyes on this to point out the
things I missed when musing about it as a possible solution...

> > > The absolute maximum height of a free space btree is 7 blocks, according
> > > to xfs_db:
> > > 
> > > # xfs_db /dev/sda -c 'btheight -w absmax all'
> > > bnobt: 7
> > > cntbt: 7
> > > inobt: 6
> > > finobt: 6
> > > bmapbt: 14
> > > refcountbt: 6
> > > rmapbt: 10
> > > 
> > > The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't
> > > think it's /that/ drastic.  For a filesystem with rmapbt (V5, 1k blocks)
> > > that minimum jumps to 121 blocks.

Yup. keep in mind this comment in xfs_alloc_fix_freelist():

	 * XXX (dgc): When we have lots of free space, does this buy us
         * anything other than extra overhead when we need to put more blocks
         * back on the free list? Maybe we should only do this when space is
         * getting low or the AGFL is more than half full?

IOWs, I've previously mused about keeping the AGFL much fuller than
the absolute minimum. There doesn't seem to be any real gain to
keeping it at just the minimum size when we are nowhere near ENOSPC,
it just means we are doing lots of management operations like
reducing it immediately after a transaction that has released blocks
to the freelist, then only to have to extend it again when we do an
operation that consumes blocks from the free list.

i.e. should we add also hysteresis to limit the number of AGFL
size manipulations we need to do in mixed workloads? i.e. we don't free
blocks from the AGFL until it might get overfilled by an operation
(i.e. max size - max merge free blocks) or we are nearing ENOSPC,
and we don't fill till we are at the minimum. In either case, the
target for empty or fill operations is "half full"...

Cheers,

Dave.
Darrick J. Wong Oct. 26, 2023, 3:21 a.m. UTC | #7
On Thu, Oct 26, 2023 at 10:37:39AM +1100, Dave Chinner wrote:
> On Wed, Oct 25, 2023 at 03:39:09PM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 25, 2023 at 01:03:00PM -0700, Omar Sandoval wrote:
> > > On Wed, Oct 25, 2023 at 08:50:46AM -0700, Darrick J. Wong wrote:
> > > > On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote:
> > > > > On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote:
> > > > > > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> > > > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > > > Fix it by adding an extra block of slack in the freelist for the
> > > > > > > potential leaf split in each of the bnobt and cntbt.
> > > > 
> > > > I see how the cntbt can split because changing the length of a freespace
> > > > extent can require the record to move to a totally different part of the
> > > > cntbt.  The reinsertion causes the split.
> > > > 
> > > > How does the bnobt split during a refresh of the AGFL?  Will we ever
> > > > allocate a block from the /middle/ of a free space extent?
> > > 
> > > That's the case I was worried about for the bnobt. I see two ways that
> > > can happen:
> > > 
> > > 1. alignment, prod, or mod requirements, which the freelist doesn't
> > >    have.
> > > 2. Busy extents. I don't know enough about XFS to say whether or not
> > >    this is applicable, but at first glance I don't see why not.
> > 
> > Yes, I think it is possible -- we allocate an extent to fill the AGFL,
> > the beginning of that extent is still busy due to slow discard, so
> > xfs_alloc_compute_aligned gives us a block from the middle of the free
> > space.  Since AGFL filling has no particular alignment/prod/mod, any
> > number of blocks are good enough, so we end up taking the blocks from
> > the middle of the extent.
> 
> *nod*
> 
> That's exactly the conclusion I came to when I wondered how it was
> possible...
> 
> > > > > > Hmmm. yeah - example given is 2->3 level split, hence only need to
> > > > > > handle a single level leaf split before path intersection occurs.
> > > > > > Yup, adding an extra block will make the exact problem being seen go
> > > > > > away.
> > > > > > 
> > > > > > Ok, what's the general solution? 4-level, 5-level or larger trees?
> > > > > > 
> > > > > > Worst split after a full split is up to the level below the old root
> > > > > > block. the root block was split, so it won't need a split again, so
> > > > > > only it's children could split again. OK, so that's (height - 1)
> > > > > > needed for a max split to refill the AGFL after a full height split
> > > > > > occurred.
> > > > > > 
> > > > > > Hence it looks like the minimum AGFL space for any given
> > > > > > allocation/free operation needs to be:
> > > > > > 
> > > > > > 	(full height split reservation) + (AGFL refill split height)
> > > > > > 
> > > > > > which is:
> > > > > > 
> > > > > > 	= (new height) + (new height - 2)
> > > > > > 	= 2 * new height - 2
> > > > > > 	= 2 * (current height + 1) - 2
> > > > > > 	= 2 * current height
> > > > > > 
> > > > > > Ok, so we essentially have to double the AGFL minimum size to handle
> > > > > > the generic case of AGFL refill splitting free space trees after a
> > > > > > transaction that has exhausted it's AGFL reservation.
> > > > > > 
> > > > > > Now, did I get that right?
> > > > > 
> > > > > That sounds right, but I'll have to think about it more after some sleep
> > > > > :)
> > > > 
> > > > I think that's correct, assuming that (2 * current_height) is the
> > > > per-btree calcluation.
> > > 
> > > Agreed, except when the tree is already the maximum level. In that case,
> > > the worst case is splitting up to but not including the root node twice,
> > > which is 2 * height - 2 (i.e., stopping before Dave substituted new
> > > height with current height + 1). So I think we want:
> > > 
> > > min_free = min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
> > > 			       mp->m_alloc_maxlevels) * 2 - 2;
> > > 
> > > > > Assuming that is correct, your LE search suggestion sounds kind of nice
> > > > > rather than such a drastic change to the AGFL size.
> > > 
> > > Come to think of it, if there is nothing <= the desired size but there
> > > is something >, we have no choice but to do the split, so that doesn't
> > > work.
> > 
> > Yeah, I was kind of wondering that myself.
> 
> Ok, I'm glad to see there are enough eyes on this to point out the
> things I missed when musing about it as a possible solution...
> 
> > > > The absolute maximum height of a free space btree is 7 blocks, according
> > > > to xfs_db:
> > > > 
> > > > # xfs_db /dev/sda -c 'btheight -w absmax all'
> > > > bnobt: 7
> > > > cntbt: 7
> > > > inobt: 6
> > > > finobt: 6
> > > > bmapbt: 14
> > > > refcountbt: 6
> > > > rmapbt: 10
> > > > 
> > > > The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't
> > > > think it's /that/ drastic.  For a filesystem with rmapbt (V5, 1k blocks)
> > > > that minimum jumps to 121 blocks.
> 
> Yup. keep in mind this comment in xfs_alloc_fix_freelist():
> 
> 	 * XXX (dgc): When we have lots of free space, does this buy us
>          * anything other than extra overhead when we need to put more blocks
>          * back on the free list? Maybe we should only do this when space is
>          * getting low or the AGFL is more than half full?
> 
> IOWs, I've previously mused about keeping the AGFL much fuller than
> the absolute minimum. There doesn't seem to be any real gain to
> keeping it at just the minimum size when we are nowhere near ENOSPC,
> it just means we are doing lots of management operations like
> reducing it immediately after a transaction that has released blocks
> to the freelist, then only to have to extend it again when we do an
> operation that consumes blocks from the free list.
> 
> i.e. should we add also hysteresis to limit the number of AGFL
> size manipulations we need to do in mixed workloads? i.e. we don't free
> blocks from the AGFL until it might get overfilled by an operation
> (i.e. max size - max merge free blocks) or we are nearing ENOSPC,
> and we don't fill till we are at the minimum. In either case, the
> target for empty or fill operations is "half full"...

Hmm.   What if we kept the AGFL completely full?

On that 512b-fsblock V4 filesystem, that'd use up 128 blocks instead of 7
blocks.  128 blocks == 64K per AG.

On a 64K-fsblock V5 filesystem, that'd use up 16370 64K blocks per AG, or
~1GB per AG.  Ok, that might be too much.

On a 4K-fsblock V5 fs, that's 1010 4K blocks per AG, or ~4MB per AG.

A gigabyte is a bit ridiculous, but what if we always did the maximum
bnobt/cntbt/rmapbt height instead of current_height + 1?

For that 4K-fsblock filesystem, that's 24 blocks per AG, or 96K.  Given
that we only support 300M+ filesystems now, that's 75M per AG.  I feel
like we could go crazy and try to keep 640K in the AGFL.

640K ought to be enough for anyone.

(Numbers chosen arbitrarily, of course.)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Oct. 26, 2023, 8:56 p.m. UTC | #8
On Wed, Oct 25, 2023 at 08:21:02PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 26, 2023 at 10:37:39AM +1100, Dave Chinner wrote:
> > On Wed, Oct 25, 2023 at 03:39:09PM -0700, Darrick J. Wong wrote:
> > > On Wed, Oct 25, 2023 at 01:03:00PM -0700, Omar Sandoval wrote:
> > > > On Wed, Oct 25, 2023 at 08:50:46AM -0700, Darrick J. Wong wrote:
> > > > > The absolute maximum height of a free space btree is 7 blocks, according
> > > > > to xfs_db:
> > > > > 
> > > > > # xfs_db /dev/sda -c 'btheight -w absmax all'
> > > > > bnobt: 7
> > > > > cntbt: 7
> > > > > inobt: 6
> > > > > finobt: 6
> > > > > bmapbt: 14
> > > > > refcountbt: 6
> > > > > rmapbt: 10
> > > > > 
> > > > > The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't
> > > > > think it's /that/ drastic.  For a filesystem with rmapbt (V5, 1k blocks)
> > > > > that minimum jumps to 121 blocks.
> > 
> > Yup. keep in mind this comment in xfs_alloc_fix_freelist():
> > 
> > 	 * XXX (dgc): When we have lots of free space, does this buy us
> >          * anything other than extra overhead when we need to put more blocks
> >          * back on the free list? Maybe we should only do this when space is
> >          * getting low or the AGFL is more than half full?
> > 
> > IOWs, I've previously mused about keeping the AGFL much fuller than
> > the absolute minimum. There doesn't seem to be any real gain to
> > keeping it at just the minimum size when we are nowhere near ENOSPC,
> > it just means we are doing lots of management operations like
> > reducing it immediately after a transaction that has released blocks
> > to the freelist, then only to have to extend it again when we do an
> > operation that consumes blocks from the free list.
> > 
> > i.e. should we add also hysteresis to limit the number of AGFL
> > size manipulations we need to do in mixed workloads? i.e. we don't free
> > blocks from the AGFL until it might get overfilled by an operation
> > (i.e. max size - max merge free blocks) or we are nearing ENOSPC,
> > and we don't fill till we are at the minimum. In either case, the
> > target for empty or fill operations is "half full"...
> 
> Hmm.   What if we kept the AGFL completely full?

It can't be kept that way - we have to have space to store blocks
that are released by tree merges. i.e. like a tree split requires a
minimum number blocks in the free list for it to succeed, a tree
merge needs a minimum number of free slots in the AGFL for it to
succeed.

> On that 512b-fsblock V4 filesystem, that'd use up 128 blocks instead of 7
> blocks.  128 blocks == 64K per AG.
> 
> On a 64K-fsblock V5 filesystem, that'd use up 16370 64K blocks per AG, or
> ~1GB per AG.  Ok, that might be too much.
> 
> On a 4K-fsblock V5 fs, that's 1010 4K blocks per AG, or ~4MB per AG.
> 
> A gigabyte is a bit ridiculous, but what if we always did the maximum
> bnobt/cntbt/rmapbt height instead of current_height + 1?

Yes, largely full AGFLs all the time doesn't make a whole lot of
sense, but just a max height reservation still doesn't cover the
case exposed by this fix. It would need to be 2 * max height
to cover the double split case we are talking about here.

Also, if we change from the current height to always use the
maximum, it re-introduces a bunch of ENOSPC issues resulting from
overcommitting user data space. That is, we must ensure that global
free space does not exhaust the free space that we need to refill
the AGFLs to enable an extent free at ENOSPC to proceed. i.e.
the mp->m_alloc_set_aside value (i.e. XFS_ALLOCBT_AGFL_RESERVE)
needs to reflect the fact this would change the minimum AGFL fill at
ENOSPC.

Further, changing to max reservation for the minimum fill fails to
recognoise that at ENOSPC, the free space trees are empty and so
will never split, and so a large minimum AGFL fill is completely
unnecessary - at most all we need is to have enough blocks in the
AGFL for the single level root split to occur (i.e. 2 blocks per
free space btree). That's waht the current XFS_ALLOCBT_AGFL_RESERVE
value is set to.

So, really, I don't see moving away from the variable minimum
reservation we currently have is a good idea. Adding a bit of
hysteresis to the AGFL fill/drain operations doesn't require
changing the min fill requirements, just picking a sensible target
above the current minimum fill...

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 3069194527dd..2cbcf18fb903 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2275,12 +2275,16 @@  xfs_alloc_min_freelist(
 
 	ASSERT(mp->m_alloc_maxlevels > 0);
 
-	/* space needed by-bno freespace btree */
+	/*
+	 * space needed by-bno freespace btree: one per level that may be split
+	 * by an insert, plus one more for a leaf split that may be necessary to
+	 * refill the AGFL
+	 */
 	min_free = min_t(unsigned int, levels[XFS_BTNUM_BNOi] + 1,
-				       mp->m_alloc_maxlevels);
-	/* space needed by-size freespace btree */
+				       mp->m_alloc_maxlevels) + 1;
+	/* space needed by-size freespace btree, same as above */
 	min_free += min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
-				       mp->m_alloc_maxlevels);
+				       mp->m_alloc_maxlevels) + 1;
 	/* space needed reverse mapping used space btree */
 	if (xfs_has_rmapbt(mp))
 		min_free += min_t(unsigned int, levels[XFS_BTNUM_RMAPi] + 1,