diff mbox series

[1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant

Message ID 164779461278.550479.17511700626088235894.stgit@magnolia (mailing list archive)
State Accepted
Headers show
Series xfs: fix incorrect reserve pool calculations and reporting | expand

Commit Message

Darrick J. Wong March 20, 2022, 4:43 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Currently, we use this undocumented macro to encode the minimum number
of blocks needed to replenish a completely empty AGFL when an AG is
nearly full.  This has lead to confusion on the part of the maintainers,
so let's document what the value actually means, and move it to
xfs_alloc.c since it's not used outside of that module.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c |   23 ++++++++++++++++++-----
 fs/xfs/libxfs/xfs_alloc.h |    1 -
 2 files changed, 18 insertions(+), 6 deletions(-)

Comments

Dave Chinner March 23, 2022, 8:39 p.m. UTC | #1
On Sun, Mar 20, 2022 at 09:43:32AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Currently, we use this undocumented macro to encode the minimum number
> of blocks needed to replenish a completely empty AGFL when an AG is
> nearly full.  This has lead to confusion on the part of the maintainers,
> so let's document what the value actually means, and move it to
> xfs_alloc.c since it's not used outside of that module.

Code change looks fine, but....

> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c |   23 ++++++++++++++++++-----
>  fs/xfs/libxfs/xfs_alloc.h |    1 -
>  2 files changed, 18 insertions(+), 6 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 353e53b892e6..b0678e96ce61 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -82,6 +82,19 @@ xfs_prealloc_blocks(
>  }
>  
>  /*
> + * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
> + * guarantee that we can refill the AGFL prior to allocating space in a nearly
> + * full AG.  We require two blocks per free space btree because free space
> + * btrees shrink to a single block as the AG fills up, and any allocation can
> + * cause a btree split.  The rmap btree uses a per-AG reservation to withhold
> + * space from xfs_mod_fdblocks, so we do not account for that here.
> + */
> +#define XFS_ALLOCBT_AGFL_RESERVE	4

.... that comment is not correct.  this number had nothing to do
with btree split reservations and is all about preventing
oversubscription of the AG when the free space trees are completely
empty.  By the time there is enough free space records in the AG for
the bno/cnt trees to be at risk of a single level root split
(several hundred free extent records), there is enough free space to
fully allocate the 4 blocks that the AGFL needs for that split.

That is, the set aside was designed to prevent AG selection in
xfs_alloc_vextent() having xfs_alloc_fixup_freelist() modifying an
AG to fix up the AGFL and then fail to fully fill the AGFL because,
say, there were only 2 blocks free in the AG and the AGFL needed 3.
Then we try all other AGs and get ENOSPC from all of them, and we
end up cancelling a dirty transaction and shutting down instead of
propagating ENOSPC up to the user.

This "not enough blocks to populate the AGFL" problem arose because
we can allocate extents directly from the AGFL if the free space
btree is empty, resulting in depletion of the AGFL and no free space
to re-populate it. Freeing a block will then go back into the btree,
and so the next allocation attempt might need 2 blocks for the AGFL,
have one block in the free space tree, and then we fail to fill
the AGFL completely because we still need one block for the AGFL and
there's no space available anymore. If all other AGs are like this
or ENOSPC, then kaboom.

IOWs, I originally added this per-ag set aside so that when the AG
was almost completely empty and we were down to allocating the last
blocks from the AG, users couldn't oversubscribe global free space by
consuming the blocks the AGs required to fill the AGFLs to allow the
last blocks that users could allocate to be allocated safely.

Hence the set aside of 4 blocks per AG was not to ensure the
freespace trees could be split, but to ensure every last possible
block could be allocated from the AG without causing the AG
selection algorithms to select and modify AGs that could not have
their AGFL fully fixed up to allocate the blocks that the caller
needed when near ENOSPC...

Cheers,

Dave.
Darrick J. Wong March 24, 2022, 5:15 a.m. UTC | #2
On Thu, Mar 24, 2022 at 07:39:16AM +1100, Dave Chinner wrote:
> On Sun, Mar 20, 2022 at 09:43:32AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Currently, we use this undocumented macro to encode the minimum number
> > of blocks needed to replenish a completely empty AGFL when an AG is
> > nearly full.  This has lead to confusion on the part of the maintainers,
> > so let's document what the value actually means, and move it to
> > xfs_alloc.c since it's not used outside of that module.
> 
> Code change looks fine, but....
> 
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c |   23 ++++++++++++++++++-----
> >  fs/xfs/libxfs/xfs_alloc.h |    1 -
> >  2 files changed, 18 insertions(+), 6 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 353e53b892e6..b0678e96ce61 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -82,6 +82,19 @@ xfs_prealloc_blocks(
> >  }
> >  
> >  /*
> > + * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
> > + * guarantee that we can refill the AGFL prior to allocating space in a nearly
> > + * full AG.  We require two blocks per free space btree because free space
> > + * btrees shrink to a single block as the AG fills up, and any allocation can
> > + * cause a btree split.  The rmap btree uses a per-AG reservation to withhold
> > + * space from xfs_mod_fdblocks, so we do not account for that here.
> > + */
> > +#define XFS_ALLOCBT_AGFL_RESERVE	4
> 
> .... that comment is not correct.  this number had nothing to do
> with btree split reservations and is all about preventing
> oversubscription of the AG when the free space trees are completely
> empty.  By the time there is enough free space records in the AG for
> the bno/cnt trees to be at risk of a single level root split
> (several hundred free extent records), there is enough free space to
> fully allocate the 4 blocks that the AGFL needs for that split.
> 
> That is, the set aside was designed to prevent AG selection in
> xfs_alloc_vextent() having xfs_alloc_fixup_freelist() modifying an
> AG to fix up the AGFL and then fail to fully fill the AGFL because,
> say, there were only 2 blocks free in the AG and the AGFL needed 3.
> Then we try all other AGs and get ENOSPC from all of them, and we
> end up cancelling a dirty transaction and shutting down instead of
> propagating ENOSPC up to the user.
> 
> This "not enough blocks to populate the AGFL" problem arose because
> we can allocate extents directly from the AGFL if the free space
> btree is empty, resulting in depletion of the AGFL and no free space
> to re-populate it. Freeing a block will then go back into the btree,
> and so the next allocation attempt might need 2 blocks for the AGFL,
> have one block in the free space tree, and then we fail to fill
> the AGFL completely because we still need one block for the AGFL and
> there's no space available anymore. If all other AGs are like this
> or ENOSPC, then kaboom.
> 
> IOWs, I originally added this per-ag set aside so that when the AG
> was almost completely empty and we were down to allocating the last
> blocks from the AG, users couldn't oversubscribe global free space by
> consuming the blocks the AGs required to fill the AGFLs to allow the
> last blocks that users could allocate to be allocated safely.
> 
> Hence the set aside of 4 blocks per AG was not to ensure the
> freespace trees could be split, but to ensure every last possible
> block could be allocated from the AG without causing the AG
> selection algorithms to select and modify AGs that could not have
> their AGFL fully fixed up to allocate the blocks that the caller
> needed when near ENOSPC...

Hmmm, thanks for the context!  What do you think about this revised
comment?

/*
 * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
 * guarantee that we can refill the AGFL prior to allocating space in a
 * nearly full AG.  Although the the space described in the free space
 * btrees, the freesp btrees, and the blocks owned by the agfl are all
 * added to the ondisk fdblocks, it's a mistake to let the ondisk free
 * space in the AG drop so low that the free space btrees cannot refill
 * an empty AGFL up to the minimum level.  Rather than grind through
 * empty AGs until the fs goes down, we subtract this many AG blocks
 * from the incore fdblocks to ENOSPC early.
 */
#define XFS_ALLOCBT_AGFL_RESERVE	4

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner March 24, 2022, 5:58 a.m. UTC | #3
On Wed, Mar 23, 2022 at 10:15:42PM -0700, Darrick J. Wong wrote:
> On Thu, Mar 24, 2022 at 07:39:16AM +1100, Dave Chinner wrote:
> > On Sun, Mar 20, 2022 at 09:43:32AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Currently, we use this undocumented macro to encode the minimum number
> > > of blocks needed to replenish a completely empty AGFL when an AG is
> > > nearly full.  This has lead to confusion on the part of the maintainers,
> > > so let's document what the value actually means, and move it to
> > > xfs_alloc.c since it's not used outside of that module.
> > 
> > Code change looks fine, but....
> > 
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_alloc.c |   23 ++++++++++++++++++-----
> > >  fs/xfs/libxfs/xfs_alloc.h |    1 -
> > >  2 files changed, 18 insertions(+), 6 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > index 353e53b892e6..b0678e96ce61 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > @@ -82,6 +82,19 @@ xfs_prealloc_blocks(
> > >  }
> > >  
> > >  /*
> > > + * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
> > > + * guarantee that we can refill the AGFL prior to allocating space in a nearly
> > > + * full AG.  We require two blocks per free space btree because free space
> > > + * btrees shrink to a single block as the AG fills up, and any allocation can
> > > + * cause a btree split.  The rmap btree uses a per-AG reservation to withhold
> > > + * space from xfs_mod_fdblocks, so we do not account for that here.
> > > + */
> > > +#define XFS_ALLOCBT_AGFL_RESERVE	4
> > 
> > .... that comment is not correct.  this number had nothing to do
> > with btree split reservations and is all about preventing
> > oversubscription of the AG when the free space trees are completely
> > empty.  By the time there is enough free space records in the AG for
> > the bno/cnt trees to be at risk of a single level root split
> > (several hundred free extent records), there is enough free space to
> > fully allocate the 4 blocks that the AGFL needs for that split.
> > 
> > That is, the set aside was designed to prevent AG selection in
> > xfs_alloc_vextent() having xfs_alloc_fixup_freelist() modifying an
> > AG to fix up the AGFL and then fail to fully fill the AGFL because,
> > say, there were only 2 blocks free in the AG and the AGFL needed 3.
> > Then we try all other AGs and get ENOSPC from all of them, and we
> > end up cancelling a dirty transaction and shutting down instead of
> > propagating ENOSPC up to the user.
> > 
> > This "not enough blocks to populate the AGFL" problem arose because
> > we can allocate extents directly from the AGFL if the free space
> > btree is empty, resulting in depletion of the AGFL and no free space
> > to re-populate it. Freeing a block will then go back into the btree,
> > and so the next allocation attempt might need 2 blocks for the AGFL,
> > have one block in the free space tree, and then we fail to fill
> > the AGFL completely because we still need one block for the AGFL and
> > there's no space available anymore. If all other AGs are like this
> > or ENOSPC, then kaboom.
> > 
> > IOWs, I originally added this per-ag set aside so that when the AG
> > was almost completely empty and we were down to allocating the last
> > blocks from the AG, users couldn't oversubscribe global free space by
> > consuming the blocks the AGs required to fill the AGFLs to allow the
> > last blocks that users could allocate to be allocated safely.
> > 
> > Hence the set aside of 4 blocks per AG was not to ensure the
> > freespace trees could be split, but to ensure every last possible
> > block could be allocated from the AG without causing the AG
> > selection algorithms to select and modify AGs that could not have
> > their AGFL fully fixed up to allocate the blocks that the caller
> > needed when near ENOSPC...
> 
> Hmmm, thanks for the context!  What do you think about this revised
> comment?
> 
> /*
>  * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
>  * guarantee that we can refill the AGFL prior to allocating space in a
>  * nearly full AG.  Although the the space described in the free space
>  * btrees, the freesp btrees, and the blocks owned by the agfl are all
>  * added to the ondisk fdblocks, it's a mistake to let the ondisk free
>  * space in the AG drop so low that the free space btrees cannot refill
>  * an empty AGFL up to the minimum level.  Rather than grind through
>  * empty AGs until the fs goes down, we subtract this many AG blocks
>  * from the incore fdblocks to ENOSPC early.

s/ENOSPC early./ensure user allocation does not overcommit the space
the filesystem needs for the AGFLs./

And it's all good by me.

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 353e53b892e6..b0678e96ce61 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -82,6 +82,19 @@  xfs_prealloc_blocks(
 }
 
 /*
+ * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
+ * guarantee that we can refill the AGFL prior to allocating space in a nearly
+ * full AG.  We require two blocks per free space btree because free space
+ * btrees shrink to a single block as the AG fills up, and any allocation can
+ * cause a btree split.  The rmap btree uses a per-AG reservation to withhold
+ * space from xfs_mod_fdblocks, so we do not account for that here.
+ */
+#define XFS_ALLOCBT_AGFL_RESERVE	4
+
+/*
+ * Compute the number of blocks that we set aside to guarantee the ability to
+ * refill the AGFL and handle a full bmap btree split.
+ *
  * In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
  * AGF buffer (PV 947395), we place constraints on the relationship among
  * actual allocations for data blocks, freelist blocks, and potential file data
@@ -93,14 +106,14 @@  xfs_prealloc_blocks(
  * extents need to be actually allocated. To get around this, we explicitly set
  * aside a few blocks which will not be reserved in delayed allocation.
  *
- * We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to handle a
- * potential split of the file's bmap btree.
+ * For each AG, we need to reserve enough blocks to replenish a totally empty
+ * AGFL and 4 more to handle a potential split of the file's bmap btree.
  */
 unsigned int
 xfs_alloc_set_aside(
 	struct xfs_mount	*mp)
 {
-	return mp->m_sb.sb_agcount * (XFS_ALLOC_AGFL_RESERVE + 4);
+	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
 }
 
 /*
@@ -124,12 +137,12 @@  xfs_alloc_ag_max_usable(
 	unsigned int		blocks;
 
 	blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */
-	blocks += XFS_ALLOC_AGFL_RESERVE;
+	blocks += XFS_ALLOCBT_AGFL_RESERVE;
 	blocks += 3;			/* AGF, AGI btree root blocks */
 	if (xfs_has_finobt(mp))
 		blocks++;		/* finobt root block */
 	if (xfs_has_rmapbt(mp))
-		blocks++; 		/* rmap root block */
+		blocks++;		/* rmap root block */
 	if (xfs_has_reflink(mp))
 		blocks++;		/* refcount root block */
 
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 1c14a0b1abea..d4c057b764f9 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -88,7 +88,6 @@  typedef struct xfs_alloc_arg {
 #define XFS_ALLOC_NOBUSY		(1 << 2)/* Busy extents not allowed */
 
 /* freespace limit calculations */
-#define XFS_ALLOC_AGFL_RESERVE	4
 unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
 unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);