diff mbox series

[2/3] xfs: add a free space extent change reservation

Message ID 20200909081912.1185392-3-david@fromorbit.com (mailing list archive)
State Superseded, archived
Headers show
Series xfs: intent recovery reservation changes | expand

Commit Message

Dave Chinner Sept. 9, 2020, 8:19 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Lots of the transaction reservation code reserves space for an
extent allocation. It is inconsistently implemented, and many of
them get it wrong. Introduce a new function to calculate the log
space reservation for adding or removing an extent from the free
space btrees.

This function reserves space for logging the AGF, the AGFL and the
free space btrees, avoiding the need to account for them seperately
in every reservation that manipulates free space.

Convert the EFI recovery reservation to use this transaction
reservation as EFI recovery only needs to manipulate the free space
index.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

Comments

Brian Foster Sept. 9, 2020, 1:35 p.m. UTC | #1
On Wed, Sep 09, 2020 at 06:19:11PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Lots of the transaction reservation code reserves space for an
> extent allocation. It is inconsistently implemented, and many of
> them get it wrong. Introduce a new function to calculate the log
> space reservation for adding or removing an extent from the free
> space btrees.
> 
> This function reserves space for logging the AGF, the AGFL and the
> free space btrees, avoiding the need to account for them seperately
> in every reservation that manipulates free space.
> 
> Convert the EFI recovery reservation to use this transaction
> reservation as EFI recovery only needs to manipulate the free space
> index.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_resv.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index da2ec052ac0a..621ddb277dfa 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -79,6 +79,23 @@ xfs_allocfree_log_count(
>  	return blocks;
>  }
>  
> +/*
> + * Log reservation required to add or remove a single extent to the free space
> + * btrees.  This requires modifying:
> + *
> + * the agf header: 1 sector
> + * the agfl header: 1 sector
> + * the allocation btrees: 2 trees * (max depth - 1) * block size

Nit, but the xfs_allocfree_log_count() helper this uses clearly
indicates reservation for up to four trees. It might be worth referring
to that here just to minimize spreading details all over the place that
are likely to become stale or inconsistent over time.

> + */
> +uint
> +xfs_allocfree_extent_res(
> +	struct xfs_mount *mp)
> +{
> +	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> +	       xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),

Why calculate for a single extent when an EFI can refer to multiple
extents? I thought the max was 2, but the extent free portion of the
itruncate calculation actually uses an op count of 4. The reason for
that is not immediately clear to me. It actually accounts 4 agf/agfl
blocks as well, so perhaps there's a wrong assumption somewhere. FWIW,
the truncate code allows 2 unmaps per transaction and the
xfs_extent_free_defer_type struct limits the dfp to 16. I suspect the
latter is not relevant for the current code.

Either way, multiple extents are factored into the current freeing
reservation and the extent freeing code at runtime (dfops) and during
recovery both appear to iterate on an extent count (potentially > 1) per
transaction. The itruncate comment, for reference (I also just noticed
that the subsequent patch modifies this comment, so you're presumably
aware of this mess):

/*
 * ...
 * And the bmap_finish transaction can free the blocks and bmap blocks (t2):
 *    the agf for each of the ags: 4 * sector size
 *    the agfl for each of the ags: 4 * sector size
 *    the super block to reflect the freed blocks: sector size
 *    worst case split in allocation btrees per extent assuming 4 extents:
 *              4 exts * 2 trees * (2 * max depth - 1) * block size
 * ...
 */

Brian

> +				XFS_FSB_TO_B(mp, 1));
> +}
> +
>  /*
>   * Logging inodes is really tricksy. They are logged in memory format,
>   * which means that what we write into the log doesn't directly translate into
> @@ -922,7 +939,7 @@ xfs_trans_resv_calc(
>  	 * EFI recovery is itruncate minus the initial transaction that logs
>  	 * logs the EFI.
>  	 */
> -	resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres;
> +	resp->tr_efi.tr_logres = xfs_allocfree_extent_res(mp);
>  	resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1;
>  	resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>  
> -- 
> 2.28.0
>
Dave Chinner Sept. 9, 2020, 10:51 p.m. UTC | #2
On Wed, Sep 09, 2020 at 09:35:25AM -0400, Brian Foster wrote:
> On Wed, Sep 09, 2020 at 06:19:11PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Lots of the transaction reservation code reserves space for an
> > extent allocation. It is inconsistently implemented, and many of
> > them get it wrong. Introduce a new function to calculate the log
> > space reservation for adding or removing an extent from the free
> > space btrees.
> > 
> > This function reserves space for logging the AGF, the AGFL and the
> > free space btrees, avoiding the need to account for them seperately
> > in every reservation that manipulates free space.
> > 
> > Convert the EFI recovery reservation to use this transaction
> > reservation as EFI recovery only needs to manipulate the free space
> > index.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_trans_resv.c | 19 ++++++++++++++++++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index da2ec052ac0a..621ddb277dfa 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -79,6 +79,23 @@ xfs_allocfree_log_count(
> >  	return blocks;
> >  }
> >  
> > +/*
> > + * Log reservation required to add or remove a single extent to the free space
> > + * btrees.  This requires modifying:
> > + *
> > + * the agf header: 1 sector
> > + * the agfl header: 1 sector
> > + * the allocation btrees: 2 trees * (max depth - 1) * block size
> 
> Nit, but the xfs_allocfree_log_count() helper this uses clearly
> indicates reservation for up to four trees. It might be worth referring
> to that here just to minimize spreading details all over the place that
> are likely to become stale or inconsistent over time.

Yup, but now I think of it, xfs_allocfree_log_count() doesn't seem
right for freeing, and I need to check how allocation works again
because stuff gets deferred.

i.e. on freeing, AFAICT we don't modify the freespace trees, the reflink
tree and the rmap trees all in the same transaction. We do a cycle
that looks like this:

log new intent, commit, execute the intent, log the intent done,
log new intent, commit, execute the intent, log the intent done,
log new intent, commit, ....

And so I'm not sure that we are modifying the reflink, rmap and free
space trees all in the same transaction and commit.

> 
> > + */
> > +uint
> > +xfs_allocfree_extent_res(
> > +	struct xfs_mount *mp)
> > +{
> > +	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > +	       xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> 
> Why calculate for a single extent when an EFI can refer to multiple
> extents?

This isn't anything to do with an EFI at this point. It's just the
reservation needed to remove a single extent from a single AG in
isolation. I wanted to isolate this reservation completely from the
rest of the transaction reservations and how it ends up being used.

This deadlock surprised me, and so reflection and insight has lead
me to think that we actually need our reservations to reflect the
specific operations that will be performed by the transaction rather
than an aggregation of things that get modified.

Then each reservation can contain the reservations for each
operation it performance (e.g. free an extent) 

That's kinda what this "intent needs it's own reservation"

> I thought the max was 2, but the extent free portion of the
> itruncate calculation actually uses an op count of 4. The reason for
> that is not immediately clear to me.

THe reason is that itruncate used to allow 4 extents to be freed in
a single transaction. i.e. XFS_ITRUNC_MAX_EXTENTS used to be defined
to 4, it is now 2.

However, if you want to relate this to EFIs, I think this
reservation is completely bogus. If all the extents freed a single
BMBT extent free are packed into a single EFI (i.e. all the BMBT
blocks freed and the data extent) then we'll overrun the reservation
extent count of 4. The max extents being freed in a single
BMBT extent free operations is a full bmbt merge + the data extent.
i.e. XFS_BM_MAXLEVELS + 1 extents in a single EFI. Except that
we're allowed to pack two data extent frees into a single EFI, and
that means it is, worst case, a full BMBT merge + a BMBT merge up to
level below root + 2 data extents, or:

= (XFS_BM_MAXLEVELS + 1) + ((XFS_BM_MAXLEVELS - 1 - 1) + 1)
= XFS_BM_MAXLEVELS * 2 extents

And then we try to free all those extents in a single itruncate unit
reservation. We probably don't hit this often because the maximum
BMBT level is bound by filesystem size, and it's rare to have a >=
4-level BMBT tree needed to make this problematic.  We do not have a
reservation large enough to do that as a single transaction, and we
don't want to have a reservation that large, either.

This is not a limitation of the EFI/EFD construct - that can
effectively be unbound. The limitation is that we don't roll and
relog the EFD between each extent in the EFI that we free. The
itruncate reservation has a bound limit for freeing extents, the
EFI reservation does not define that.

So, really, I think our extent free reservations and some of the
intent execution behaviour needs a serious audit and rework, and we
need to decouple the extent allocation and freeing reservations as
extent allocation is now a fundamentally different operation to
freeing an extent. i.e. Allocation does BMBT, refcount, rmap and
freespace tree mods all in one transaction, while freeing does
multiple individual transactional modifications linked by chained
intents.

> Either way, multiple extents are factored into the current freeing
> reservation and the extent freeing code at runtime (dfops) and during
> recovery both appear to iterate on an extent count (potentially > 1) per
> transaction. The itruncate comment, for reference (I also just noticed
> that the subsequent patch modifies this comment, so you're presumably
> aware of this mess):
> 
> /*
>  * ...
>  * And the bmap_finish transaction can free the blocks and bmap blocks (t2):
>  *    the agf for each of the ags: 4 * sector size
>  *    the agfl for each of the ags: 4 * sector size
>  *    the super block to reflect the freed blocks: sector size
>  *    worst case split in allocation btrees per extent assuming 4 extents:
>  *              4 exts * 2 trees * (2 * max depth - 1) * block size
>  * ...
>  */

Oh, yeah, I'm well aware of it - ISTR commenting on it past
discussions about the transaction reservation sizes being much
larger than necessary....

Cheers,

Dave.
Brian Foster Sept. 10, 2020, 1:19 p.m. UTC | #3
On Thu, Sep 10, 2020 at 08:51:27AM +1000, Dave Chinner wrote:
> On Wed, Sep 09, 2020 at 09:35:25AM -0400, Brian Foster wrote:
> > On Wed, Sep 09, 2020 at 06:19:11PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Lots of the transaction reservation code reserves space for an
> > > extent allocation. It is inconsistently implemented, and many of
> > > them get it wrong. Introduce a new function to calculate the log
> > > space reservation for adding or removing an extent from the free
> > > space btrees.
> > > 
> > > This function reserves space for logging the AGF, the AGFL and the
> > > free space btrees, avoiding the need to account for them seperately
> > > in every reservation that manipulates free space.
> > > 
> > > Convert the EFI recovery reservation to use this transaction
> > > reservation as EFI recovery only needs to manipulate the free space
> > > index.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_trans_resv.c | 19 ++++++++++++++++++-
> > >  1 file changed, 18 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > > index da2ec052ac0a..621ddb277dfa 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > > @@ -79,6 +79,23 @@ xfs_allocfree_log_count(
> > >  	return blocks;
> > >  }
> > >  
> > > +/*
> > > + * Log reservation required to add or remove a single extent to the free space
> > > + * btrees.  This requires modifying:
> > > + *
> > > + * the agf header: 1 sector
> > > + * the agfl header: 1 sector
> > > + * the allocation btrees: 2 trees * (max depth - 1) * block size
> > 
> > Nit, but the xfs_allocfree_log_count() helper this uses clearly
> > indicates reservation for up to four trees. It might be worth referring
> > to that here just to minimize spreading details all over the place that
> > are likely to become stale or inconsistent over time.
> 
> Yup, but now I think of it, xfs_allocfree_log_count() doesn't seem
> right for freeing, and I need to check how allocation works again
> because stuff gets deferred.
> 
> i.e. on freeing, AFAICT we don't modify the freespace trees, the reflink
> tree and the rmap trees all in the same transaction. We do a cycle
> that looks like this:
> 
> log new intent, commit, execute the intent, log the intent done,
> log new intent, commit, execute the intent, log the intent done,
> log new intent, commit, ....
> 
> And so I'm not sure that we are modifying the reflink, rmap and free
> space trees all in the same transaction and commit.
> 

Indeed.

> > 
> > > + */
> > > +uint
> > > +xfs_allocfree_extent_res(
> > > +	struct xfs_mount *mp)
> > > +{
> > > +	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > > +	       xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > 
> > Why calculate for a single extent when an EFI can refer to multiple
> > extents?
> 
> This isn't anything to do with an EFI at this point. It's just the
> reservation needed to remove a single extent from a single AG in
> isolation. I wanted to isolate this reservation completely from the
> rest of the transaction reservations and how it ends up being used.
> 

Sure, but this is used as the reservation size for tr_efi. The current
itruncate reservation is clearly more than is required for EFI recovery,
but my point is that the current itruncate code means recovery must
handle EFIs with at least a couple extents (perhaps more based on your
analysis below). Recovery currently processes efi_nextents all in a
single transaction. The current recovery reservation seems to
accommodate current runtime behavior, but does the new reservation based
on this helper satisfy EFIs with >1 extents? I haven't tested it, but it
seems potentially deficient based on inspection...

> This deadlock surprised me, and so reflection and insight has lead
> me to think that we actually need our reservations to reflect the
> specific operations that will be performed by the transaction rather
> than an aggregation of things that get modified.
> 
> Then each reservation can contain the reservations for each
> operation it performance (e.g. free an extent) 
> 
> That's kinda what this "intent needs it's own reservation"
> 

Ack.

> > I thought the max was 2, but the extent free portion of the
> > itruncate calculation actually uses an op count of 4. The reason for
> > that is not immediately clear to me.
> 
> THe reason is that itruncate used to allow 4 extents to be freed in
> a single transaction. i.e. XFS_ITRUNC_MAX_EXTENTS used to be defined
> to 4, it is now 2.
> 

Ok, well that's a simple enough explanation.. :P

> However, if you want to relate this to EFIs, I think this
> reservation is completely bogus. If all the extents freed a single
> BMBT extent free are packed into a single EFI (i.e. all the BMBT
> blocks freed and the data extent) then we'll overrun the reservation
> extent count of 4. The max extents being freed in a single
> BMBT extent free operations is a full bmbt merge + the data extent.
> i.e. XFS_BM_MAXLEVELS + 1 extents in a single EFI. Except that
> we're allowed to pack two data extent frees into a single EFI, and
> that means it is, worst case, a full BMBT merge + a BMBT merge up to
> level below root + 2 data extents, or:
> 

All I want to do wrt to this patch is make sure the recovery reservation
is large enough to process the EFIs it might see in the log based on the
current code.

> = (XFS_BM_MAXLEVELS + 1) + ((XFS_BM_MAXLEVELS - 1 - 1) + 1)
> = XFS_BM_MAXLEVELS * 2 extents
> 
> And then we try to free all those extents in a single itruncate unit
> reservation. We probably don't hit this often because the maximum
> BMBT level is bound by filesystem size, and it's rare to have a >=
> 4-level BMBT tree needed to make this problematic.  We do not have a
> reservation large enough to do that as a single transaction, and we
> don't want to have a reservation that large, either.
> 

Agreed.

> This is not a limitation of the EFI/EFD construct - that can
> effectively be unbound. The limitation is that we don't roll and
> relog the EFD between each extent in the EFI that we free. The
> itruncate reservation has a bound limit for freeing extents, the
> EFI reservation does not define that.
> 

Sure.

> So, really, I think our extent free reservations and some of the
> intent execution behaviour needs a serious audit and rework, and we
> need to decouple the extent allocation and freeing reservations as
> extent allocation is now a fundamentally different operation to
> freeing an extent. i.e. Allocation does BMBT, refcount, rmap and
> freespace tree mods all in one transaction, while freeing does
> multiple individual transactional modifications linked by chained
> intents.
> 

I don't think an allocation transaction is so overloaded as compared to
freeing. I thought reflink was more of a separate remap operation vs
being tied to block allocation, not to say that a remap might depend on
block allocs or not, but both remaps and refcount updates look like
individual intents (along with rmapbt updates associated with block
mappings/allocations). Regardless, I agree that the current transaction
(and associated reservation) situation is a mess, particularly now with
more use of intents, complex chains thereof, and certain transactions
(tr_itruncate/tr_write) being overloaded to the point where they're huge
or simply too difficult to reason about..

Brian

> > Either way, multiple extents are factored into the current freeing
> > reservation and the extent freeing code at runtime (dfops) and during
> > recovery both appear to iterate on an extent count (potentially > 1) per
> > transaction. The itruncate comment, for reference (I also just noticed
> > that the subsequent patch modifies this comment, so you're presumably
> > aware of this mess):
> > 
> > /*
> >  * ...
> >  * And the bmap_finish transaction can free the blocks and bmap blocks (t2):
> >  *    the agf for each of the ags: 4 * sector size
> >  *    the agfl for each of the ags: 4 * sector size
> >  *    the super block to reflect the freed blocks: sector size
> >  *    worst case split in allocation btrees per extent assuming 4 extents:
> >  *              4 exts * 2 trees * (2 * max depth - 1) * block size
> >  * ...
> >  */
> 
> Oh, yeah, I'm well aware of it - ISTR commenting on it past
> discussions about the transaction reservation sizes being much
> larger than necessary....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index da2ec052ac0a..621ddb277dfa 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -79,6 +79,23 @@  xfs_allocfree_log_count(
 	return blocks;
 }
 
+/*
+ * Log reservation required to add or remove a single extent to the free space
+ * btrees.  This requires modifying:
+ *
+ * the agf header: 1 sector
+ * the agfl header: 1 sector
+ * the allocation btrees: 2 trees * (max depth - 1) * block size
+ */
+uint
+xfs_allocfree_extent_res(
+	struct xfs_mount *mp)
+{
+	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
+	       xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
+				XFS_FSB_TO_B(mp, 1));
+}
+
 /*
  * Logging inodes is really tricksy. They are logged in memory format,
  * which means that what we write into the log doesn't directly translate into
@@ -922,7 +939,7 @@  xfs_trans_resv_calc(
 	 * EFI recovery is itruncate minus the initial transaction that logs
 	 * logs the EFI.
 	 */
-	resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres;
+	resp->tr_efi.tr_logres = xfs_allocfree_extent_res(mp);
 	resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1;
 	resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES;