diff mbox

[RFC,4/4] xfs: include an allocfree res for inobt modifications

Message ID 20171128154922.GE45759@bfoster.bfoster (mailing list archive)
State Accepted
Headers show

Commit Message

Brian Foster Nov. 28, 2017, 3:49 p.m. UTC
On Tue, Nov 28, 2017 at 10:27:19AM +1100, Dave Chinner wrote:
> On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
> > Analysis of recent reports of log reservation overruns and code
> > inspection has uncovered that the reservations associated with inode
> > operations may not cover the worst case scenarios. In particular,
> > many cases only include one allocfree res. for a particular
> > operation even though said operations may also entail AGFL fixups
> > and inode btree block allocations in addition to the actual inode
> > chunk allocation. This can easily turn into two or three block
> > allocations (or frees) per operation.
> > 
> > In theory, the only way to define the worst case reservation is to
> > include an allocfree res for each individual allocation in a
> > transaction. Since that is impractical (we can perform multiple agfl
> > fixups per tx and not every allocation is going to result in a full
> > tree operation), implement a reasonable compromise that addresses
> > the deficiency in practice without blowing out the size of the
> > transactions.
> > 
> > Refactor the inode transaction reservation code to include one
> > allocfree res. per inode btree modification to cover allocations
> > required by the tree itself. This essentially separates the
> > reservation required to allocate the physical inode chunk from
> > additional reservation required to perform inobt record
> > insertion/removal.
> 
> I think you missed the most important reason the inobt/finobt
> modifications need there own allocfree reservation - btree
> modifications that cause btree blocks to be freed do not use defered
> ops and so the freeing of blocks occurs directly within the primary
> transaction. Hence the primary transaction reservation must take
> this into account ....
> 
> > Apply the same logic to the finobt reservation.
> > This results in killing off the finobt modify condition because we
> > no longer assume that the broader transaction reservation will cover
> > finobt block allocations.
> > 
> > Suggested-by: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> Code looks fine. Comments below are for another follow-on patch.
> 

Actually, what do you think of the following variant (compile tested
only)? This facilites another cleanup patch I just wrote to kill off
about half of the (now effectively duplicate) xfs_calc_create_*()
helpers because the finobt and inode chunk res. helpers only include the
associated reservation based on the associated feature bits.

Brian

--- 8< ---

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Chinner Nov. 28, 2017, 10:34 p.m. UTC | #1
On Tue, Nov 28, 2017 at 10:49:22AM -0500, Brian Foster wrote:
> On Tue, Nov 28, 2017 at 10:27:19AM +1100, Dave Chinner wrote:
> > On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
> > > Analysis of recent reports of log reservation overruns and code
> > > inspection has uncovered that the reservations associated with inode
> > > operations may not cover the worst case scenarios. In particular,
> > > many cases only include one allocfree res. for a particular
> > > operation even though said operations may also entail AGFL fixups
> > > and inode btree block allocations in addition to the actual inode
> > > chunk allocation. This can easily turn into two or three block
> > > allocations (or frees) per operation.
> > > 
> > > In theory, the only way to define the worst case reservation is to
> > > include an allocfree res for each individual allocation in a
> > > transaction. Since that is impractical (we can perform multiple agfl
> > > fixups per tx and not every allocation is going to result in a full
> > > tree operation), implement a reasonable compromise that addresses
> > > the deficiency in practice without blowing out the size of the
> > > transactions.
> > > 
> > > Refactor the inode transaction reservation code to include one
> > > allocfree res. per inode btree modification to cover allocations
> > > required by the tree itself. This essentially separates the
> > > reservation required to allocate the physical inode chunk from
> > > additional reservation required to perform inobt record
> > > insertion/removal.
> > 
> > I think you missed the most important reason the inobt/finobt
> > modifications need there own allocfree reservation - btree
> > modifications that cause btree blocks to be freed do not use defered
> > ops and so the freeing of blocks occurs directly within the primary
> > transaction. Hence the primary transaction reservation must take
> > this into account ....
> > 
> > > Apply the same logic to the finobt reservation.
> > > This results in killing off the finobt modify condition because we
> > > no longer assume that the broader transaction reservation will cover
> > > finobt block allocations.
> > > 
> > > Suggested-by: Dave Chinner <david@fromorbit.com>
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > 
> > Code looks fine. Comments below are for another follow-on patch.
> > 
> 
> Actually, what do you think of the following variant (compile tested
> only)? This facilites another cleanup patch I just wrote to kill off
> about half of the (now effectively duplicate) xfs_calc_create_*()
> helpers because the finobt and inode chunk res. helpers only include the
> associated reservation based on the associated feature bits.

Yup, makes a lot of sense to do that.

FWIW, because we've got so many alloc vs free type reservations,
would it make more sense to do something like:

#define _ALLOC	true
#define _FREE	false

And use those in the callers rather than true/false?

i.e. this code remains the same in that it takes an "bool alloc"
flag as a parameter:

> +STATIC uint
> +xfs_calc_inode_chunk_res(
> +	struct xfs_mount	*mp,
> +	bool			alloc)
> +{
> +	uint			res, size = 0;
> +
> +	res = xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> +			       XFS_FSB_TO_B(mp, 1));
> +	if (alloc) {
> +		/* icreate tx uses ordered buffers */
> +		if (xfs_sb_version_hascrc(&mp->m_sb))
> +			return res;
> +		size = XFS_FSB_TO_B(mp, 1);
> +	}
> +
> +	res += xfs_calc_buf_res(mp->m_ialloc_blks, size);
> +	return res;
> +}

but makes the callers  look like:

>  STATIC uint
> @@ -396,9 +430,7 @@ xfs_calc_create_resv_alloc(
>  {
>  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
>  		mp->m_sb.sb_sectsize +
> -		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> -				 XFS_FSB_TO_B(mp, 1)) +
> +		xfs_calc_inode_chunk_res(mp, true) +

		xfs_calc_inode_chunk_res(mp, _ALLOC) +
		.....

and

> @@ -511,7 +542,7 @@ xfs_calc_ifree_reservation(
>  		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
>  		xfs_calc_iunlink_remove_reservation(mp) +
>  		xfs_calc_buf_res(1, 0) +
> -		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> +		xfs_calc_inode_chunk_res(mp, false) +

		xfs_calc_inode_chunk_res(mp, _FREE) +
		.....

That would make it very clear what type of reservation we are
acutally expecting to take out in the calculation. i.e. the code
is now clearly self documenting :)

Cheers,

Dave.
Brian Foster Nov. 29, 2017, 2:32 p.m. UTC | #2
On Wed, Nov 29, 2017 at 09:34:45AM +1100, Dave Chinner wrote:
> On Tue, Nov 28, 2017 at 10:49:22AM -0500, Brian Foster wrote:
> > On Tue, Nov 28, 2017 at 10:27:19AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
...
> > > Code looks fine. Comments below are for another follow-on patch.
> > > 
> > 
> > Actually, what do you think of the following variant (compile tested
> > only)? This facilites another cleanup patch I just wrote to kill off
> > about half of the (now effectively duplicate) xfs_calc_create_*()
> > helpers because the finobt and inode chunk res. helpers only include the
> > associated reservation based on the associated feature bits.
> 
> Yup, makes a lot of sense to do that.
> 
> FWIW, because we've got so many alloc vs free type reservations,
> would it make more sense to do something like:
> 
> #define _ALLOC	true
> #define _FREE	false
> 
> And use those in the callers rather than true/false?
> 
...
> > @@ -511,7 +542,7 @@ xfs_calc_ifree_reservation(
> >  		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> >  		xfs_calc_iunlink_remove_reservation(mp) +
> >  		xfs_calc_buf_res(1, 0) +
> > -		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> > +		xfs_calc_inode_chunk_res(mp, false) +
> 
> 		xfs_calc_inode_chunk_res(mp, _FREE) +
> 		.....
> 
> That would make it very clear what type of reservation we are
> acutally expecting to take out in the calculation. i.e. the code
> is now clearly self documenting :)
> 

Sure, that looks like a nice enhancement. I'll factor that in..

Brian

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index c27c57e65e15..045781f2cf00 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -172,6 +172,41 @@  xfs_calc_finobt_res(
 }
 
 /*
+ * Calculate the reservation required to allocate or free an inode chunk. This
+ * includes:
+ *
+ * the allocation btrees: 2 trees * (max depth - 1) * block size
+ * the inode chunk: m_ialloc_blks * N
+ *
+ * The size N of the inode chunk reservation depends on whether it is for
+ * allocation or free and which type of create transaction is in use. An inode
+ * chunk free always invalidates the buffers and only requires reservation for
+ * headers (N == 0). An inode chunk allocation requires a chunk sized
+ * reservation on v4 and older superblocks to initialize the chunk. No chunk
+ * reservation is required for allocation on v5 supers, which use ordered
+ * buffers to initialize.
+ */
+STATIC uint
+xfs_calc_inode_chunk_res(
+	struct xfs_mount	*mp,
+	bool			alloc)
+{
+	uint			res, size = 0;
+
+	res = xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
+			       XFS_FSB_TO_B(mp, 1));
+	if (alloc) {
+		/* icreate tx uses ordered buffers */
+		if (xfs_sb_version_hascrc(&mp->m_sb))
+			return res;
+		size = XFS_FSB_TO_B(mp, 1);
+	}
+
+	res += xfs_calc_buf_res(mp->m_ialloc_blks, size);
+	return res;
+}
+
+/*
  * Various log reservation values.
  *
  * These are based on the size of the file system block because that is what
@@ -386,8 +421,7 @@  xfs_calc_create_resv_modify(
  * For create we can allocate some inodes giving:
  *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
  *    the superblock for the nlink flag: sector size
- *    the inode blocks allocated: mp->m_ialloc_blks * blocksize
- *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the inode chunk (allocation/init)
  *    the inode btree (record insertion)
  */
 STATIC uint
@@ -396,9 +430,7 @@  xfs_calc_create_resv_alloc(
 {
 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
 		mp->m_sb.sb_sectsize +
-		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_inode_chunk_res(mp, true) +
 		xfs_calc_inobt_res(mp);
 }
 
@@ -415,7 +447,7 @@  __xfs_calc_create_reservation(
  * For icreate we can allocate some inodes giving:
  *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
  *    the superblock for the nlink flag: sector size
- *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the inode chunk (allocation, no init)
  *    the inobt (record insertion)
  *    the finobt (record insertion)
  */
@@ -425,8 +457,7 @@  xfs_calc_icreate_resv_alloc(
 {
 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
 		mp->m_sb.sb_sectsize +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_inode_chunk_res(mp, true) +
 		xfs_calc_inobt_res(mp) +
 		xfs_calc_finobt_res(mp);
 }
@@ -492,15 +523,15 @@  xfs_calc_symlink_reservation(
  *    the inode being freed: inode size
  *    the super block free inode counter, AGF and AGFL: sector size
  *    the on disk inode (agi unlinked list removal)
- *    the inode chunk is marked stale (headers only)
+ *    the inode chunk (invalidated, headers only)
  *    the inode btree
  *    the finobt (record insertion, removal or modification)
  *
- * Note that the allocfree res. for the inode chunk itself is not included
- * because the extent free occurs after a transaction roll. We could take the
- * maximum of the pre/post roll operations, but the pre-roll reservation already
- * includes at least one allocfree res. for the inobt and is thus guaranteed to
- * be larger.
+ * Note that the inode chunk res. includes an allocfree res. for freeing of the
+ * inode chunk. This is technically extraneous because the inode chunk free is
+ * deferred (it occurs after a transaction roll). Include the extra reservation
+ * anyways since we've had reports of ifree transaction overruns due to too many
+ * agfl fixups during inode chunk frees.
  */
 STATIC uint
 xfs_calc_ifree_reservation(
@@ -511,7 +542,7 @@  xfs_calc_ifree_reservation(
 		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
 		xfs_calc_iunlink_remove_reservation(mp) +
 		xfs_calc_buf_res(1, 0) +
-		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
+		xfs_calc_inode_chunk_res(mp, false) +
 		xfs_calc_inobt_res(mp) +
 		xfs_calc_finobt_res(mp);
 }