diff mbox series

[2/5] mkfs: don't let internal logs consume more than 95% of an AG

Message ID 164738661360.3191861.16773208450465120679.stgit@magnolia (mailing list archive)
State New, archived
Headers show
Series xfsprogs: stop allowing tiny filesystems | expand

Commit Message

Darrick J. Wong March 15, 2022, 11:23 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Currently, we don't let an internal log consume every last block in an
AG.  According to the comment, we're doing this to avoid tripping AGF
verifiers if freeblks==0, but on a modern filesystem this isn't
sufficient to avoid problems.  First, the per-AG reservations for
reflink and rmap claim up to about 1.7% of each AG for btree expansion,
and secondly, we need to have enough space in the AG to allocate the
root inode chunk, if it should be the case that the log ends up in AG 0.
We don't care about nonredundant (i.e. agcount==1) filesystems, but it
can also happen if the user passes in -lagnum=0.

Change this constraint so that we can't leave less than 5% free space
after allocating the log.  This is perhaps a bit much, but as we're
about to disallow tiny filesystems anyway, it seems unlikely to cause
problems with scenarios that we care about.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 mkfs/xfs_mkfs.c |   17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

Comments

Eric Sandeen March 16, 2022, 6:50 p.m. UTC | #1
On 3/15/22 6:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Currently, we don't let an internal log consume every last block in an
> AG.  According to the comment, we're doing this to avoid tripping AGF
> verifiers if freeblks==0, but on a modern filesystem this isn't
> sufficient to avoid problems.  First, the per-AG reservations for
> reflink and rmap claim up to about 1.7% of each AG for btree expansion,

Hm, will that be a factor if the log consumes every last block in that
AG? Or is the problem that if we consume "most" blocks, that leaves the
possibility of reflink/rmap btree expansion subsequently failing because
we do have a little room for new allocations in that AG?

Or is it a problem right out of the gate because the per-ag reservations
collide with a maximal log before the filesystem is even in use?

> and secondly, we need to have enough space in the AG to allocate the
> root inode chunk, if it should be the case that the log ends up in AG 0.
> We don't care about nonredundant (i.e. agcount==1) filesystems, but it
> can also happen if the user passes in -lagnum=0.
> 
> Change this constraint so that we can't leave less than 5% free space
> after allocating the log.  This is perhaps a bit much, but as we're
> about to disallow tiny filesystems anyway, it seems unlikely to cause
> problems with scenarios that we care about.

This is only modifying the case where we automatically calculated a
log size, and doesn't affect a manually-specified size. Is that
intentional? (I guess we already had this discrepancy, whether it was
the old "-1" heuristic or the new "95%" heuristic...

But 5% is likely to be a fair bit bigger than 1 block, so I'm wondering
if the manually-specified case needs to be limited as well.

Thanks,
-Eric
Eric Sandeen March 31, 2022, 5:21 a.m. UTC | #2
On 3/16/22 8:50 AM, Eric Sandeen wrote:
> On 3/15/22 6:23 PM, Darrick J. Wong wrote:
>> From: Darrick J. Wong <djwong@kernel.org>
>>
>> Currently, we don't let an internal log consume every last block in an
>> AG.  According to the comment, we're doing this to avoid tripping AGF
>> verifiers if freeblks==0, but on a modern filesystem this isn't
>> sufficient to avoid problems.  First, the per-AG reservations for
>> reflink and rmap claim up to about 1.7% of each AG for btree expansion,
> 
> Hm, will that be a factor if the log consumes every last block in that
> AG? Or is the problem that if we consume "most" blocks, that leaves the
> possibility of reflink/rmap btree expansion subsequently failing because
> we do have a little room for new allocations in that AG?
> 
> Or is it a problem right out of the gate because the per-ag reservations
> collide with a maximal log before the filesystem is even in use?

Darrick, any comment on this? What did you actually run into that prompted
this change?

Still bugs me a little that a manually-sized log escapes this limit, and if
it's needed for proper functioning, we should probably enforce it everywhere.

I do understand that the existing code only validates auto-sized logs. But
I don't want to sweep this under the rug, even if we choose to not fix it all
right now.

Mostly looking for clarification on the what fails and how, with the current
code.

Thanks,
-Eric
Darrick J. Wong March 31, 2022, 4:20 p.m. UTC | #3
On Wed, Mar 30, 2022 at 07:21:36PM -1000, Eric Sandeen wrote:
> On 3/16/22 8:50 AM, Eric Sandeen wrote:
> > On 3/15/22 6:23 PM, Darrick J. Wong wrote:
> >> From: Darrick J. Wong <djwong@kernel.org>
> >>
> >> Currently, we don't let an internal log consume every last block in an
> >> AG.  According to the comment, we're doing this to avoid tripping AGF
> >> verifiers if freeblks==0, but on a modern filesystem this isn't
> >> sufficient to avoid problems.  First, the per-AG reservations for
> >> reflink and rmap claim up to about 1.7% of each AG for btree expansion,
> > 
> > Hm, will that be a factor if the log consumes every last block in that
> > AG? Or is the problem that if we consume "most" blocks, that leaves the
> > possibility of reflink/rmap btree expansion subsequently failing because
> > we do have a little room for new allocations in that AG?
> > 
> > Or is it a problem right out of the gate because the per-ag reservations
> > collide with a maximal log before the filesystem is even in use?
> 
> Darrick, any comment on this? What did you actually run into that prompted
> this change?

Oops, sorry, forgot to reply to this.

The per-AG reservations shouldn't be a problem on a modern kernel
because we assume an internal log never moves or grows, so we only need
to account for rmap/refcount btree expansions to cover space used
elsewhere in the AG.  That said, *one* block is cutting it really close.

Also, earlier kernels (4.9 era) weren't smart about that, though 4.9 is
dead according to gregkh and any kernel that says rmap/reflink are
experimental should not be used.

The problem that I saw is that you could trick the default calculations
into making the log to consume so much space that either (a) the root
directory gets allocated in the next AG, or if you were <cough>
formatting a single AG then mkfs just exits with a cryptic message about
ENOSPC.

It's fairly difficult to trigger this for a plain old block device, but
if mkfs thinks the disk has a gigantic stripe unit, it'll go around
putting the root directory at a much higher block number than is usual.
For the most part the "max_ag_blocks - 1" was actually fine, but not so
much for the case where the AG size is close to 64MB.

So in the end the 5% figure is still rather handwavy wormcanning.

> Still bugs me a little that a manually-sized log escapes this limit, and if
> it's needed for proper functioning, we should probably enforce it everywhere.

<nod> I guess the same limits ought to be applied to explicit logsize to
improve the "Well if it hurts don't do that!" experience. ;)

> I do understand that the existing code only validates auto-sized logs. But
> I don't want to sweep this under the rug, even if we choose to not fix it all
> right now.
> 
> Mostly looking for clarification on the what fails and how, with the current
> code.

<nod> I'll try to add a sample mkfs failure to the commit message.

--D

> 
> Thanks,
> -Eric
diff mbox series

Patch

diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index b97bd360..ad776492 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3271,15 +3271,18 @@  clamp_internal_log_size(
 	/*
 	 * Make sure the log fits wholly within an AG
 	 *
-	 * XXX: If agf->freeblks ends up as 0 because the log uses all
-	 * the free space, it causes the kernel all sorts of problems
-	 * with per-ag reservations. Right now just back it off one
-	 * block, but there's a whole can of worms here that needs to be
-	 * opened to decide what is the valid maximum size of a log in
-	 * an AG.
+	 * XXX: If agf->freeblks ends up as 0 because the log uses all the free
+	 * space, it causes the kernel all sorts of problems with per-AG
+	 * reservations.  The reservations are only supposed to take 2% of the
+	 * AG, but there's a further problem that if the log ends up in AG 0,
+	 * we also need space to allocate the root directory inode chunk.
+	 *
+	 * Right now just back it off by 5%, but there's a whole can of worms
+	 * here that needs to be opened to decide what is the valid maximum
+	 * size of a log in an AG.
 	 */
 	cfg->logblocks = min(cfg->logblocks,
-			     libxfs_alloc_ag_max_usable(mp) - 1);
+			     libxfs_alloc_ag_max_usable(mp) * 95 / 100);
 
 	/* and now clamp the size to the maximum supported size */
 	cfg->logblocks = min(cfg->logblocks, XFS_MAX_LOG_BLOCKS);