Message ID | 164738661360.3191861.16773208450465120679.stgit@magnolia (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | xfsprogs: stop allowing tiny filesystems | expand |
On 3/15/22 6:23 PM, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Currently, we don't let an internal log consume every last block in an > AG. According to the comment, we're doing this to avoid tripping AGF > verifiers if freeblks==0, but on a modern filesystem this isn't > sufficient to avoid problems. First, the per-AG reservations for > reflink and rmap claim up to about 1.7% of each AG for btree expansion, Hm, will that be a factor if the log consumes every last block in that AG? Or is the problem that if we consume "most" blocks, that leaves the possibility of reflink/rmap btree expansion subsequently failing because we do have a little room for new allocations in that AG? Or is it a problem right out of the gate because the per-ag reservations collide with a maximal log before the filesystem is even in use? > and secondly, we need to have enough space in the AG to allocate the > root inode chunk, if it should be the case that the log ends up in AG 0. > We don't care about nonredundant (i.e. agcount==1) filesystems, but it > can also happen if the user passes in -lagnum=0. > > Change this constraint so that we can't leave less than 5% free space > after allocating the log. This is perhaps a bit much, but as we're > about to disallow tiny filesystems anyway, it seems unlikely to cause > problems with scenarios that we care about. This is only modifying the case where we automatically calculated a log size, and doesn't affect a manually-specified size. Is that intentional? (I guess we already had this discrepancy, whether it was the old "-1" heuristic or the new "95%" heuristic... But 5% is likely to be a fair bit bigger than 1 block, so I'm wondering if the manually-specified case needs to be limited as well. Thanks, -Eric
On 3/16/22 8:50 AM, Eric Sandeen wrote: > On 3/15/22 6:23 PM, Darrick J. Wong wrote: >> From: Darrick J. Wong <djwong@kernel.org> >> >> Currently, we don't let an internal log consume every last block in an >> AG. According to the comment, we're doing this to avoid tripping AGF >> verifiers if freeblks==0, but on a modern filesystem this isn't >> sufficient to avoid problems. First, the per-AG reservations for >> reflink and rmap claim up to about 1.7% of each AG for btree expansion, > > Hm, will that be a factor if the log consumes every last block in that > AG? Or is the problem that if we consume "most" blocks, that leaves the > possibility of reflink/rmap btree expansion subsequently failing because > we do have a little room for new allocations in that AG? > > Or is it a problem right out of the gate because the per-ag reservations > collide with a maximal log before the filesystem is even in use? Darrick, any comment on this? What did you actually run into that prompted this change? Still bugs me a little that a manually-sized log escapes this limit, and if it's needed for proper functioning, we should probably enforce it everywhere. I do understand that the existing code only validates auto-sized logs. But I don't want to sweep this under the rug, even if we choose to not fix it all right now. Mostly looking for clarification on the what fails and how, with the current code. Thanks, -Eric
On Wed, Mar 30, 2022 at 07:21:36PM -1000, Eric Sandeen wrote: > On 3/16/22 8:50 AM, Eric Sandeen wrote: > > On 3/15/22 6:23 PM, Darrick J. Wong wrote: > >> From: Darrick J. Wong <djwong@kernel.org> > >> > >> Currently, we don't let an internal log consume every last block in an > >> AG. According to the comment, we're doing this to avoid tripping AGF > >> verifiers if freeblks==0, but on a modern filesystem this isn't > >> sufficient to avoid problems. First, the per-AG reservations for > >> reflink and rmap claim up to about 1.7% of each AG for btree expansion, > > > > Hm, will that be a factor if the log consumes every last block in that > > AG? Or is the problem that if we consume "most" blocks, that leaves the > > possibility of reflink/rmap btree expansion subsequently failing because > > we do have a little room for new allocations in that AG? > > > > Or is it a problem right out of the gate because the per-ag reservations > > collide with a maximal log before the filesystem is even in use? > > Darrick, any comment on this? What did you actually run into that prompted > this change? Oops, sorry, forgot to reply to this. The per-AG reservations shouldn't be a problem on a modern kernel because we assume an internal log never moves or grows, so we only need to account for rmap/refcount btree expansions to cover space used elsewhere in the AG. That said, *one* block is cutting it really close. Also, earlier kernels (4.9 era) weren't smart about that, though 4.9 is dead according to gregkh and any kernel that says rmap/reflink are experimental should not be used. The problem that I saw is that you could trick the default calculations into making the log to consume so much space that either (a) the root directory gets allocated in the next AG, or if you were <cough> formatting a single AG then mkfs just exits with a cryptic message about ENOSPC. It's fairly difficult to trigger this for a plain old block device, but if mkfs thinks the disk has a gigantic stripe unit, it'll go around putting the root directory at a much higher block number than is usual. For the most part the "max_ag_blocks - 1" was actually fine, but not so much for the case where the AG size is close to 64MB. So in the end the 5% figure is still rather handwavy wormcanning. > Still bugs me a little that a manually-sized log escapes this limit, and if > it's needed for proper functioning, we should probably enforce it everywhere. <nod> I guess the same limits ought to be applied to explicit logsize to improve the "Well if it hurts don't do that!" experience. ;) > I do understand that the existing code only validates auto-sized logs. But > I don't want to sweep this under the rug, even if we choose to not fix it all > right now. > > Mostly looking for clarification on the what fails and how, with the current > code. <nod> I'll try to add a sample mkfs failure to the commit message. --D > > Thanks, > -Eric
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c index b97bd360..ad776492 100644 --- a/mkfs/xfs_mkfs.c +++ b/mkfs/xfs_mkfs.c @@ -3271,15 +3271,18 @@ clamp_internal_log_size( /* * Make sure the log fits wholly within an AG * - * XXX: If agf->freeblks ends up as 0 because the log uses all - * the free space, it causes the kernel all sorts of problems - * with per-ag reservations. Right now just back it off one - * block, but there's a whole can of worms here that needs to be - * opened to decide what is the valid maximum size of a log in - * an AG. + * XXX: If agf->freeblks ends up as 0 because the log uses all the free + * space, it causes the kernel all sorts of problems with per-AG + * reservations. The reservations are only supposed to take 2% of the + * AG, but there's a further problem that if the log ends up in AG 0, + * we also need space to allocate the root directory inode chunk. + * + * Right now just back it off by 5%, but there's a whole can of worms + * here that needs to be opened to decide what is the valid maximum + * size of a log in an AG. */ cfg->logblocks = min(cfg->logblocks, - libxfs_alloc_ag_max_usable(mp) - 1); + libxfs_alloc_ag_max_usable(mp) * 95 / 100); /* and now clamp the size to the maximum supported size */ cfg->logblocks = min(cfg->logblocks, XFS_MAX_LOG_BLOCKS);