diff mbox

[RFC,4/4] xfs: include an allocfree res for inobt modifications

Message ID 20171127202434.43125-5-bfoster@redhat.com (mailing list archive)
State Superseded
Headers show

Commit Message

Brian Foster Nov. 27, 2017, 8:24 p.m. UTC
Analysis of recent reports of log reservation overruns and code
inspection has uncovered that the reservations associated with inode
operations may not cover the worst case scenarios. In particular,
many cases only include one allocfree res. for a particular
operation even though said operations may also entail AGFL fixups
and inode btree block allocations in addition to the actual inode
chunk allocation. This can easily turn into two or three block
allocations (or frees) per operation.

In theory, the only way to define the worst case reservation is to
include an allocfree res for each individual allocation in a
transaction. Since that is impractical (we can perform multiple agfl
fixups per tx and not every allocation is going to result in a full
tree operation), implement a reasonable compromise that addresses
the deficiency in practice without blowing out the size of the
transactions.

Refactor the inode transaction reservation code to include one
allocfree res. per inode btree modification to cover allocations
required by the tree itself. This essentially separates the
reservation required to allocate the physical inode chunk from
additional reservation required to perform inobt record
insertion/removal. Apply the same logic to the finobt reservation.
This results in killing off the finobt modify condition because we
no longer assume that the broader transaction reservation will cover
finobt block allocations.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 84 +++++++++++++++++++++---------------------
 1 file changed, 43 insertions(+), 41 deletions(-)

Comments

Dave Chinner Nov. 27, 2017, 11:27 p.m. UTC | #1
On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
> Analysis of recent reports of log reservation overruns and code
> inspection has uncovered that the reservations associated with inode
> operations may not cover the worst case scenarios. In particular,
> many cases only include one allocfree res. for a particular
> operation even though said operations may also entail AGFL fixups
> and inode btree block allocations in addition to the actual inode
> chunk allocation. This can easily turn into two or three block
> allocations (or frees) per operation.
> 
> In theory, the only way to define the worst case reservation is to
> include an allocfree res for each individual allocation in a
> transaction. Since that is impractical (we can perform multiple agfl
> fixups per tx and not every allocation is going to result in a full
> tree operation), implement a reasonable compromise that addresses
> the deficiency in practice without blowing out the size of the
> transactions.
> 
> Refactor the inode transaction reservation code to include one
> allocfree res. per inode btree modification to cover allocations
> required by the tree itself. This essentially separates the
> reservation required to allocate the physical inode chunk from
> additional reservation required to perform inobt record
> insertion/removal.

I think you missed the most important reason the inobt/finobt
modifications need there own allocfree reservation - btree
modifications that cause btree blocks to be freed do not use defered
ops and so the freeing of blocks occurs directly within the primary
transaction. Hence the primary transaction reservation must take
this into account ....

> Apply the same logic to the finobt reservation.
> This results in killing off the finobt modify condition because we
> no longer assume that the broader transaction reservation will cover
> finobt block allocations.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Code looks fine. Comments below are for another follow-on patch.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

> @@ -387,8 +386,8 @@ xfs_calc_create_resv_modify(
>   *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
>   *    the superblock for the nlink flag: sector size
>   *    the inode blocks allocated: mp->m_ialloc_blks * blocksize
> - *    the inode btree: max depth * blocksize
>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> + *    the inode btree (record insertion)
>   */
>  STATIC uint
>  xfs_calc_create_resv_alloc(
> @@ -397,9 +396,9 @@ xfs_calc_create_resv_alloc(
>  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
>  		mp->m_sb.sb_sectsize +
>  		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
>  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> -				 XFS_FSB_TO_B(mp, 1));
> +				 XFS_FSB_TO_B(mp, 1)) +
> +		xfs_calc_inobt_res(mp);
>  }

Looking at this,  I'm wondering if there should also be a function
for calculating the inode chunk reservation. Something like:

static uint
xfs_calc_inode_chunk_res(
	struct xfs-mount	*mp,
	bool			chunk_contents_logged)
{
	uint	res;

	if (chunk_contents_logged) {
		res = xfs_calc_buf_res(mp->m_ialloc_blks,
					XFS_FSB_TO_B(mp, 1));
	} else {
		/* take into account logged headers for freeing */
		res = xfs_calc_buf_res(mp->m_ialloc_blks, 0);
	}

	res += xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
				XFS_FSB_TO_B(mp, 1));
	return res;
}

Then xfs_calc_create_resv_alloc() reads like:

	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
		mp->m_sb.sb_sectsize + 
		xfs_calc_inode_chunk_res(mp, true) +
		xfs_calc_inobt_res(mp);


>  
>  STATIC uint
> @@ -415,8 +414,8 @@ __xfs_calc_create_reservation(
>   * For icreate we can allocate some inodes giving:
>   *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
>   *    the superblock for the nlink flag: sector size
> - *    the inode btree: max depth * blocksize
>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> + *    the inobt (record insertion)
>   *    the finobt (record insertion)
>   */
>  STATIC uint
> @@ -425,10 +424,10 @@ xfs_calc_icreate_resv_alloc(
>  {
>  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
>  		mp->m_sb.sb_sectsize +
> -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
>  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  				 XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_finobt_res(mp, 0, 0);
> +		xfs_calc_inobt_res(mp) +
> +		xfs_calc_finobt_res(mp);
>  }

	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
		mp->m_sb.sb_sectsize +
		xfs_calc_inode_chunk_res(mp, false) +
		xfs_calc_inobt_res(mp) +
		xfs_calc_finobt_res(mp);

>  
>  STATIC uint
> @@ -494,9 +493,14 @@ xfs_calc_symlink_reservation(
>   *    the agi hash list and counters: sector size
>   *    the on disk inode before ours in the agi hash list: inode cluster size
>   *    the inode chunk is marked stale (headers only)
> - *    the inode btree: max depth * blocksize
> - *    the allocation btrees: 2 trees * (max depth - 1) * block size
> + *    the inode btree
>   *    the finobt (record insertion, removal or modification)
> + *
> + * Note that the allocfree res. for the inode chunk itself is not included
> + * because the extent free occurs after a transaction roll. We could take the
> + * maximum of the pre/post roll operations, but the pre-roll reservation already
> + * includes at least one allocfree res. for the inobt and is thus guaranteed to
> + * be larger.
>   */
>  STATIC uint
>  xfs_calc_ifree_reservation(
> @@ -508,10 +512,8 @@ xfs_calc_ifree_reservation(
>  		xfs_calc_iunlink_remove_reservation(mp) +
>  		xfs_calc_buf_res(1, 0) +
>  		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> -				 XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_finobt_res(mp, 0, 1);
> +		xfs_calc_inobt_res(mp) +
> +		xfs_calc_finobt_res(mp);
>  }

	.....
		xfs_calc_iunlink_remove_reservation(mp) +
		xfs_calc_buf_res(1, 0) +
		xfs_calc_inode_chunk_res(mp, false) +
		xfs_calc_inobt_res(mp) +
		xfs_calc_finobt_res(mp);

This seems to make more sense to me - it's clear what the individual
components of the reservation are, and we can ensure that the
individual components have the necessary reservation independently
of the overall reservations that need them.

Cheers,

Dave.
Brian Foster Nov. 28, 2017, 2:04 p.m. UTC | #2
On Tue, Nov 28, 2017 at 10:27:19AM +1100, Dave Chinner wrote:
> On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
> > Analysis of recent reports of log reservation overruns and code
> > inspection has uncovered that the reservations associated with inode
> > operations may not cover the worst case scenarios. In particular,
> > many cases only include one allocfree res. for a particular
> > operation even though said operations may also entail AGFL fixups
> > and inode btree block allocations in addition to the actual inode
> > chunk allocation. This can easily turn into two or three block
> > allocations (or frees) per operation.
> > 
> > In theory, the only way to define the worst case reservation is to
> > include an allocfree res for each individual allocation in a
> > transaction. Since that is impractical (we can perform multiple agfl
> > fixups per tx and not every allocation is going to result in a full
> > tree operation), implement a reasonable compromise that addresses
> > the deficiency in practice without blowing out the size of the
> > transactions.
> > 
> > Refactor the inode transaction reservation code to include one
> > allocfree res. per inode btree modification to cover allocations
> > required by the tree itself. This essentially separates the
> > reservation required to allocate the physical inode chunk from
> > additional reservation required to perform inobt record
> > insertion/removal.
> 
> I think you missed the most important reason the inobt/finobt
> modifications need there own allocfree reservation - btree
> modifications that cause btree blocks to be freed do not use defered
> ops and so the freeing of blocks occurs directly within the primary
> transaction. Hence the primary transaction reservation must take
> this into account ....
> 

Sort of implied by "to cover allocations by the tree itself," but I'll
update the commit log to be more explicit.

> > Apply the same logic to the finobt reservation.
> > This results in killing off the finobt modify condition because we
> > no longer assume that the broader transaction reservation will cover
> > finobt block allocations.
> > 
> > Suggested-by: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> Code looks fine. Comments below are for another follow-on patch.
> 

Ok..

> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> > @@ -387,8 +386,8 @@ xfs_calc_create_resv_modify(
> >   *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
> >   *    the superblock for the nlink flag: sector size
> >   *    the inode blocks allocated: mp->m_ialloc_blks * blocksize
> > - *    the inode btree: max depth * blocksize
> >   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > + *    the inode btree (record insertion)
> >   */
> >  STATIC uint
> >  xfs_calc_create_resv_alloc(
> > @@ -397,9 +396,9 @@ xfs_calc_create_resv_alloc(
> >  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> >  		mp->m_sb.sb_sectsize +
> >  		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
> > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> >  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > -				 XFS_FSB_TO_B(mp, 1));
> > +				 XFS_FSB_TO_B(mp, 1)) +
> > +		xfs_calc_inobt_res(mp);
> >  }
> 
> Looking at this,  I'm wondering if there should also be a function
> for calculating the inode chunk reservation. Something like:
> 
> static uint
> xfs_calc_inode_chunk_res(
> 	struct xfs-mount	*mp,
> 	bool			chunk_contents_logged)
> {
> 	uint	res;
> 
> 	if (chunk_contents_logged) {
> 		res = xfs_calc_buf_res(mp->m_ialloc_blks,
> 					XFS_FSB_TO_B(mp, 1));
> 	} else {
> 		/* take into account logged headers for freeing */
> 		res = xfs_calc_buf_res(mp->m_ialloc_blks, 0);
> 	}
> 
> 	res += xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> 				XFS_FSB_TO_B(mp, 1));
> 	return res;
> }
> 
> Then xfs_calc_create_resv_alloc() reads like:
> 
> 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> 		mp->m_sb.sb_sectsize + 
> 		xfs_calc_inode_chunk_res(mp, true) +
> 		xfs_calc_inobt_res(mp);
> 

Looks reasonable.

> 
> >  
> >  STATIC uint
> > @@ -415,8 +414,8 @@ __xfs_calc_create_reservation(
> >   * For icreate we can allocate some inodes giving:
> >   *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
> >   *    the superblock for the nlink flag: sector size
> > - *    the inode btree: max depth * blocksize
> >   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > + *    the inobt (record insertion)
> >   *    the finobt (record insertion)
> >   */
> >  STATIC uint
> > @@ -425,10 +424,10 @@ xfs_calc_icreate_resv_alloc(
> >  {
> >  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> >  		mp->m_sb.sb_sectsize +
> > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> >  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> >  				 XFS_FSB_TO_B(mp, 1)) +
> > -		xfs_calc_finobt_res(mp, 0, 0);
> > +		xfs_calc_inobt_res(mp) +
> > +		xfs_calc_finobt_res(mp);
> >  }
> 
> 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> 		mp->m_sb.sb_sectsize +
> 		xfs_calc_inode_chunk_res(mp, false) +
> 		xfs_calc_inobt_res(mp) +
> 		xfs_calc_finobt_res(mp);
> 

The icreate reservation doesn't currently include m_ialloc_blks at all.
The helper, as defined above, adds a reservation for associated headers.
Is that really necessary?  My understanding is that icreate doesn't log
the inode chunk. I suppose we could get around that by tweaking the
parameter to take the size to reserve instead of a bool and pass a dummy
value (i.e., negative) to avoid logging the chunk at all. A little ugly,
but I could live with it.

> >  
> >  STATIC uint
> > @@ -494,9 +493,14 @@ xfs_calc_symlink_reservation(
> >   *    the agi hash list and counters: sector size
> >   *    the on disk inode before ours in the agi hash list: inode cluster size
> >   *    the inode chunk is marked stale (headers only)
> > - *    the inode btree: max depth * blocksize
> > - *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > + *    the inode btree
> >   *    the finobt (record insertion, removal or modification)
> > + *
> > + * Note that the allocfree res. for the inode chunk itself is not included
> > + * because the extent free occurs after a transaction roll. We could take the
> > + * maximum of the pre/post roll operations, but the pre-roll reservation already
> > + * includes at least one allocfree res. for the inobt and is thus guaranteed to
> > + * be larger.
> >   */
> >  STATIC uint
> >  xfs_calc_ifree_reservation(
> > @@ -508,10 +512,8 @@ xfs_calc_ifree_reservation(
> >  		xfs_calc_iunlink_remove_reservation(mp) +
> >  		xfs_calc_buf_res(1, 0) +
> >  		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> > -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > -				 XFS_FSB_TO_B(mp, 1)) +
> > -		xfs_calc_finobt_res(mp, 0, 1);
> > +		xfs_calc_inobt_res(mp) +
> > +		xfs_calc_finobt_res(mp);
> >  }
> 
> 	.....
> 		xfs_calc_iunlink_remove_reservation(mp) +
> 		xfs_calc_buf_res(1, 0) +
> 		xfs_calc_inode_chunk_res(mp, false) +
> 		xfs_calc_inobt_res(mp) +
> 		xfs_calc_finobt_res(mp);
> 

This covers the inode chunk invalidation, but also adds the allocfree
res. for the chunk free where we technically don't need it (because the
free is deferred, re: the comment above).

I guess I'm fine with just adding that one, but I'd update the comment
above to point out that it's technically unecessary. Hm?

> This seems to make more sense to me - it's clear what the individual
> components of the reservation are, and we can ensure that the
> individual components have the necessary reservation independently
> of the overall reservations that need them.
> 

I agree in principle. I think the underlying helpers (and pushing down
some of the associated documentation) help clearly declare the intent of
the reservations, particularly when we include multiple factors of a
single reservation and/or have situations where we don't technically
have a definition of worst case, but want to define something logically
reasonable (like the whole allocfree per inode tree thing).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Nov. 28, 2017, 10:26 p.m. UTC | #3
On Tue, Nov 28, 2017 at 09:04:59AM -0500, Brian Foster wrote:
> On Tue, Nov 28, 2017 at 10:27:19AM +1100, Dave Chinner wrote:
> > On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
> > >  STATIC uint
> > > @@ -415,8 +414,8 @@ __xfs_calc_create_reservation(
> > >   * For icreate we can allocate some inodes giving:
> > >   *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
> > >   *    the superblock for the nlink flag: sector size
> > > - *    the inode btree: max depth * blocksize
> > >   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > > + *    the inobt (record insertion)
> > >   *    the finobt (record insertion)
> > >   */
> > >  STATIC uint
> > > @@ -425,10 +424,10 @@ xfs_calc_icreate_resv_alloc(
> > >  {
> > >  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > >  		mp->m_sb.sb_sectsize +
> > > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> > >  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > >  				 XFS_FSB_TO_B(mp, 1)) +
> > > -		xfs_calc_finobt_res(mp, 0, 0);
> > > +		xfs_calc_inobt_res(mp) +
> > > +		xfs_calc_finobt_res(mp);
> > >  }
> > 
> > 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > 		mp->m_sb.sb_sectsize +
> > 		xfs_calc_inode_chunk_res(mp, false) +
> > 		xfs_calc_inobt_res(mp) +
> > 		xfs_calc_finobt_res(mp);
> > 
> 
> The icreate reservation doesn't currently include m_ialloc_blks at all.
> The helper, as defined above, adds a reservation for associated headers.
> Is that really necessary?  My understanding is that icreate doesn't log
> the inode chunk.

Right, it uses ordered buffers to avoid needing to log them.

> I suppose we could get around that by tweaking the
> parameter to take the size to reserve instead of a bool and pass a dummy
> value (i.e., negative) to avoid logging the chunk at all. A little ugly,
> but I could live with it.

I don't think that having an extra few hundred bytes of overhead in
the reservation is going to be noticable by anyone. I'd just
ignore the problem (as I did when suggesting this).

> > >  STATIC uint
> > > @@ -494,9 +493,14 @@ xfs_calc_symlink_reservation(
> > >   *    the agi hash list and counters: sector size
> > >   *    the on disk inode before ours in the agi hash list: inode cluster size
> > >   *    the inode chunk is marked stale (headers only)
> > > - *    the inode btree: max depth * blocksize
> > > - *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > > + *    the inode btree
> > >   *    the finobt (record insertion, removal or modification)
> > > + *
> > > + * Note that the allocfree res. for the inode chunk itself is not included
> > > + * because the extent free occurs after a transaction roll. We could take the
> > > + * maximum of the pre/post roll operations, but the pre-roll reservation already
> > > + * includes at least one allocfree res. for the inobt and is thus guaranteed to
> > > + * be larger.
> > >   */
> > >  STATIC uint
> > >  xfs_calc_ifree_reservation(
> > > @@ -508,10 +512,8 @@ xfs_calc_ifree_reservation(
> > >  		xfs_calc_iunlink_remove_reservation(mp) +
> > >  		xfs_calc_buf_res(1, 0) +
> > >  		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> > > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> > > -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > > -				 XFS_FSB_TO_B(mp, 1)) +
> > > -		xfs_calc_finobt_res(mp, 0, 1);
> > > +		xfs_calc_inobt_res(mp) +
> > > +		xfs_calc_finobt_res(mp);
> > >  }
> > 
> > 	.....
> > 		xfs_calc_iunlink_remove_reservation(mp) +
> > 		xfs_calc_buf_res(1, 0) +
> > 		xfs_calc_inode_chunk_res(mp, false) +
> > 		xfs_calc_inobt_res(mp) +
> > 		xfs_calc_finobt_res(mp);
> > 
> 
> This covers the inode chunk invalidation, but also adds the allocfree
> res. for the chunk free where we technically don't need it (because the
> free is deferred, re: the comment above).
> 
> I guess I'm fine with just adding that one, but I'd update the comment
> above to point out that it's technically unecessary. Hm?

*nod*

Though with sparse inodes, we might be freeing multiple extents,
right?  which means we probably need all the allocfree reservations
we can get....

> > This seems to make more sense to me - it's clear what the individual
> > components of the reservation are, and we can ensure that the
> > individual components have the necessary reservation independently
> > of the overall reservations that need them.
> > 
> 
> I agree in principle. I think the underlying helpers (and pushing down
> some of the associated documentation) help clearly declare the intent of
> the reservations, particularly when we include multiple factors of a
> single reservation and/or have situations where we don't technically
> have a definition of worst case, but want to define something logically
> reasonable (like the whole allocfree per inode tree thing).

Yup, that's pretty much what I was thinking.

Cheers,

Dave.
Brian Foster Nov. 29, 2017, 2:32 p.m. UTC | #4
On Wed, Nov 29, 2017 at 09:26:37AM +1100, Dave Chinner wrote:
> On Tue, Nov 28, 2017 at 09:04:59AM -0500, Brian Foster wrote:
> > On Tue, Nov 28, 2017 at 10:27:19AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
> > > >  STATIC uint
> > > > @@ -415,8 +414,8 @@ __xfs_calc_create_reservation(
> > > >   * For icreate we can allocate some inodes giving:
> > > >   *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
> > > >   *    the superblock for the nlink flag: sector size
> > > > - *    the inode btree: max depth * blocksize
> > > >   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > > > + *    the inobt (record insertion)
> > > >   *    the finobt (record insertion)
> > > >   */
> > > >  STATIC uint
> > > > @@ -425,10 +424,10 @@ xfs_calc_icreate_resv_alloc(
> > > >  {
> > > >  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > > >  		mp->m_sb.sb_sectsize +
> > > > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> > > >  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > > >  				 XFS_FSB_TO_B(mp, 1)) +
> > > > -		xfs_calc_finobt_res(mp, 0, 0);
> > > > +		xfs_calc_inobt_res(mp) +
> > > > +		xfs_calc_finobt_res(mp);
> > > >  }
> > > 
> > > 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > > 		mp->m_sb.sb_sectsize +
> > > 		xfs_calc_inode_chunk_res(mp, false) +
> > > 		xfs_calc_inobt_res(mp) +
> > > 		xfs_calc_finobt_res(mp);
> > > 
> > 
> > The icreate reservation doesn't currently include m_ialloc_blks at all.
> > The helper, as defined above, adds a reservation for associated headers.
> > Is that really necessary?  My understanding is that icreate doesn't log
> > the inode chunk.
> 
> Right, it uses ordered buffers to avoid needing to log them.
> 
> > I suppose we could get around that by tweaking the
> > parameter to take the size to reserve instead of a bool and pass a dummy
> > value (i.e., negative) to avoid logging the chunk at all. A little ugly,
> > but I could live with it.
> 
> I don't think that having an extra few hundred bytes of overhead in
> the reservation is going to be noticable by anyone. I'd just
> ignore the problem (as I did when suggesting this).
> 

I'm not as concerned with the overhead as much as I want to make sure
the intent of the changes is clearly documented. The refactoring
proposed in the other thread factors this out more cleanly than I
originally anticipated, so it's a moot point now.

> > > >  STATIC uint
> > > > @@ -494,9 +493,14 @@ xfs_calc_symlink_reservation(
> > > >   *    the agi hash list and counters: sector size
> > > >   *    the on disk inode before ours in the agi hash list: inode cluster size
> > > >   *    the inode chunk is marked stale (headers only)
> > > > - *    the inode btree: max depth * blocksize
> > > > - *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > > > + *    the inode btree
> > > >   *    the finobt (record insertion, removal or modification)
> > > > + *
> > > > + * Note that the allocfree res. for the inode chunk itself is not included
> > > > + * because the extent free occurs after a transaction roll. We could take the
> > > > + * maximum of the pre/post roll operations, but the pre-roll reservation already
> > > > + * includes at least one allocfree res. for the inobt and is thus guaranteed to
> > > > + * be larger.
> > > >   */
> > > >  STATIC uint
> > > >  xfs_calc_ifree_reservation(
> > > > @@ -508,10 +512,8 @@ xfs_calc_ifree_reservation(
> > > >  		xfs_calc_iunlink_remove_reservation(mp) +
> > > >  		xfs_calc_buf_res(1, 0) +
> > > >  		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> > > > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> > > > -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > > > -				 XFS_FSB_TO_B(mp, 1)) +
> > > > -		xfs_calc_finobt_res(mp, 0, 1);
> > > > +		xfs_calc_inobt_res(mp) +
> > > > +		xfs_calc_finobt_res(mp);
> > > >  }
> > > 
> > > 	.....
> > > 		xfs_calc_iunlink_remove_reservation(mp) +
> > > 		xfs_calc_buf_res(1, 0) +
> > > 		xfs_calc_inode_chunk_res(mp, false) +
> > > 		xfs_calc_inobt_res(mp) +
> > > 		xfs_calc_finobt_res(mp);
> > > 
> > 
> > This covers the inode chunk invalidation, but also adds the allocfree
> > res. for the chunk free where we technically don't need it (because the
> > free is deferred, re: the comment above).
> > 
> > I guess I'm fine with just adding that one, but I'd update the comment
> > above to point out that it's technically unecessary. Hm?
> 
> *nod*
> 
> Though with sparse inodes, we might be freeing multiple extents,
> right?  which means we probably need all the allocfree reservations
> we can get....
> 

Yep good point, I suppose that is possible.

Brian

> > > This seems to make more sense to me - it's clear what the individual
> > > components of the reservation are, and we can ensure that the
> > > individual components have the necessary reservation independently
> > > of the overall reservations that need them.
> > > 
> > 
> > I agree in principle. I think the underlying helpers (and pushing down
> > some of the associated documentation) help clearly declare the intent of
> > the reservations, particularly when we include multiple factors of a
> > single reservation and/or have situations where we don't technically
> > have a definition of worst case, but want to define something logically
> > reasonable (like the whole allocfree per inode tree thing).
> 
> Yup, that's pretty much what I was thinking.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 4cd7cd1e60da..79e4aea74065 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -132,44 +132,43 @@  xfs_calc_inode_res(
 }
 
 /*
- * The free inode btree is a conditional feature and the log reservation
- * requirements differ slightly from that of the traditional inode allocation
- * btree. The finobt tracks records for inode chunks with at least one free
- * inode. A record can be removed from the tree for an inode allocation
- * or free and thus the finobt reservation is unconditional across:
+ * Inode btree record insertion/removal modifies the inode btree and free space
+ * btrees (since the inobt does not use the agfl). This requires the following
+ * reservation:
  *
- * 	- inode allocation
- * 	- inode free
- * 	- inode chunk allocation
+ * the inode btree: max depth * blocksize
+ * the allocation btrees: 2 trees * (max depth - 1) * block size
  *
- * The 'modify' param indicates to include the record modification scenario. The
- * 'alloc' param indicates to include the reservation for free space btree
- * modifications on behalf of finobt modifications. This is required only for
- * transactions that do not already account for free space btree modifications.
+ * The caller must account for SB and AG header modifications, etc.
+ */
+STATIC uint
+xfs_calc_inobt_res(
+	struct xfs_mount	*mp)
+{
+	return xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
+				 XFS_FSB_TO_B(mp, 1));
+}
+
+/*
+ * The free inode btree is a conditional feature. The behavior differs slightly
+ * from that of the traditional inode btree in that the finobt tracks records
+ * for inode chunks with at least one free inode. A record can be removed from
+ * the tree during individual inode allocation. Therefore the finobt
+ * reservation is unconditional for both the inode chunk allocation and
+ * individual inode allocation (modify) cases.
  *
- * the free inode btree: max depth * block size
- * the allocation btrees: 2 trees * (max depth - 1) * block size
- * the free inode btree entry: block size
+ * Behavior aside, the reservation for finobt modification is equivalent to the
+ * traditional inobt: cover a full finobt shape change plus block allocation.
  */
 STATIC uint
 xfs_calc_finobt_res(
-	struct xfs_mount	*mp,
-	int			alloc,
-	int			modify)
+	struct xfs_mount	*mp)
 {
-	uint res;
-
 	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
 		return 0;
 
-	res = xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1));
-	if (alloc)
-		res += xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-					XFS_FSB_TO_B(mp, 1));
-	if (modify)
-		res += (uint)XFS_FSB_TO_B(mp, 1);
-
-	return res;
+	return xfs_calc_inobt_res(mp);
 }
 
 /*
@@ -379,7 +378,7 @@  xfs_calc_create_resv_modify(
 		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
 		(uint)XFS_FSB_TO_B(mp, 1) +
 		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_finobt_res(mp, 1, 1);
+		xfs_calc_finobt_res(mp);
 }
 
 /*
@@ -387,8 +386,8 @@  xfs_calc_create_resv_modify(
  *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
  *    the superblock for the nlink flag: sector size
  *    the inode blocks allocated: mp->m_ialloc_blks * blocksize
- *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the inode btree (record insertion)
  */
 STATIC uint
 xfs_calc_create_resv_alloc(
@@ -397,9 +396,9 @@  xfs_calc_create_resv_alloc(
 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
 		mp->m_sb.sb_sectsize +
 		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+				 XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_inobt_res(mp);
 }
 
 STATIC uint
@@ -415,8 +414,8 @@  __xfs_calc_create_reservation(
  * For icreate we can allocate some inodes giving:
  *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
  *    the superblock for the nlink flag: sector size
- *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the inobt (record insertion)
  *    the finobt (record insertion)
  */
 STATIC uint
@@ -425,10 +424,10 @@  xfs_calc_icreate_resv_alloc(
 {
 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
 		mp->m_sb.sb_sectsize +
-		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 				 XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_finobt_res(mp, 0, 0);
+		xfs_calc_inobt_res(mp) +
+		xfs_calc_finobt_res(mp);
 }
 
 STATIC uint
@@ -494,9 +493,14 @@  xfs_calc_symlink_reservation(
  *    the agi hash list and counters: sector size
  *    the on disk inode before ours in the agi hash list: inode cluster size
  *    the inode chunk is marked stale (headers only)
- *    the inode btree: max depth * blocksize
- *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the inode btree
  *    the finobt (record insertion, removal or modification)
+ *
+ * Note that the allocfree res. for the inode chunk itself is not included
+ * because the extent free occurs after a transaction roll. We could take the
+ * maximum of the pre/post roll operations, but the pre-roll reservation already
+ * includes at least one allocfree res. for the inobt and is thus guaranteed to
+ * be larger.
  */
 STATIC uint
 xfs_calc_ifree_reservation(
@@ -508,10 +512,8 @@  xfs_calc_ifree_reservation(
 		xfs_calc_iunlink_remove_reservation(mp) +
 		xfs_calc_buf_res(1, 0) +
 		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
-		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_finobt_res(mp, 0, 1);
+		xfs_calc_inobt_res(mp) +
+		xfs_calc_finobt_res(mp);
 }
 
 /*