diff mbox series

[4/4] xfs: fix off-by-one error in xfs_btree_space_to_height

Message ID 167149471987.336919.3277522603824048839.stgit@magnolia (mailing list archive)
State Accepted
Headers show
Series xfs: random fixes for 6.2, part 3 | expand

Commit Message

Darrick J. Wong Dec. 20, 2022, 12:05 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Lately I've been stress-testing extreme-sized rmap btrees by using the
(new) xfs_db bmap_inflate command to clone bmbt mappings billions of
times and then using xfs_repair to build new rmap and refcount btrees.
This of course is /much/ faster than actually FICLONEing a file billions
of times.

Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with
EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not
sufficiently large for the test scenario.  For a 1TB filesystem (~67
million AG blocks, 4 AGs) the btheight command reports:

$ xfs_db -c 'btheight -n 4400801200 -w min rmapbt' /dev/sda
rmapbt: worst case per 4096-byte block: 84 records (leaf) / 45 keyptrs (node)
level 0: 4400801200 records, 52390491 blocks
level 1: 52390491 records, 1164234 blocks
level 2: 1164234 records, 25872 blocks
level 3: 25872 records, 575 blocks
level 4: 575 records, 13 blocks
level 5: 13 records, 1 block
6 levels, 53581186 blocks total

The AG is sufficiently large to build this rmap btree.  Unfortunately,
m_rmap_maxlevels is 5.  Augmenting the loop in the space->height
function to report height, node blocks, and blocks remaining produces
this:

ht 1 node_blocks 45 blockleft 67108863
ht 2 node_blocks 2025 blockleft 67108818
ht 3 node_blocks 91125 blockleft 67106793
ht 4 node_blocks 4100625 blockleft 67015668
final height: 5

The goal of this function is to compute the maximum height btree that
can be stored in the given number of ondisk fsblocks.  Starting with the
top level of the tree, each iteration through the loop adds the fanout
factor of the next level down until we run out of blocks.  IOWs, maximum
height is achieved by using the smallest fanout factor that can apply
to that level.

However, the loop setup is not correct.  Top level btree blocks are
allowed to contain fewer than minrecs items, so the computation is
incorrect because the first time through the loop it should be using a
fanout factor of 2.  With this corrected, the above becomes:

ht 1 node_blocks 2 blockleft 67108863
ht 2 node_blocks 90 blockleft 67108861
ht 3 node_blocks 4050 blockleft 67108771
ht 4 node_blocks 182250 blockleft 67104721
ht 5 node_blocks 8201250 blockleft 66922471
final height: 6

Fixes: 9ec691205e7d ("xfs: compute the maximum height of the rmap btree when reflink enabled")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Dave Chinner Dec. 20, 2022, 5 a.m. UTC | #1
On Mon, Dec 19, 2022 at 04:05:19PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Lately I've been stress-testing extreme-sized rmap btrees by using the
> (new) xfs_db bmap_inflate command to clone bmbt mappings billions of
> times and then using xfs_repair to build new rmap and refcount btrees.
> This of course is /much/ faster than actually FICLONEing a file billions
> of times.
> 
> Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with
> EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not
> sufficiently large for the test scenario.  For a 1TB filesystem (~67
> million AG blocks, 4 AGs) the btheight command reports:
> 
> $ xfs_db -c 'btheight -n 4400801200 -w min rmapbt' /dev/sda
> rmapbt: worst case per 4096-byte block: 84 records (leaf) / 45 keyptrs (node)
> level 0: 4400801200 records, 52390491 blocks
> level 1: 52390491 records, 1164234 blocks
> level 2: 1164234 records, 25872 blocks
> level 3: 25872 records, 575 blocks
> level 4: 575 records, 13 blocks
> level 5: 13 records, 1 block
> 6 levels, 53581186 blocks total
> 
> The AG is sufficiently large to build this rmap btree.  Unfortunately,
> m_rmap_maxlevels is 5.  Augmenting the loop in the space->height
> function to report height, node blocks, and blocks remaining produces
> this:
> 
> ht 1 node_blocks 45 blockleft 67108863
> ht 2 node_blocks 2025 blockleft 67108818
> ht 3 node_blocks 91125 blockleft 67106793
> ht 4 node_blocks 4100625 blockleft 67015668
> final height: 5
> 
> The goal of this function is to compute the maximum height btree that
> can be stored in the given number of ondisk fsblocks.  Starting with the
> top level of the tree, each iteration through the loop adds the fanout
> factor of the next level down until we run out of blocks.  IOWs, maximum
> height is achieved by using the smallest fanout factor that can apply
> to that level.
> 
> However, the loop setup is not correct.  Top level btree blocks are
> allowed to contain fewer than minrecs items, so the computation is
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ah, that's the critical piece of information I was looking for. I
couldn't work out from the code change below what was wrong with
limits[1]. So....

> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 4c16c8c31fcb..8d11d3f5e529 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -4666,7 +4666,11 @@ xfs_btree_space_to_height(
>  	const unsigned int	*limits,
>  	unsigned long long	leaf_blocks)
>  {
> -	unsigned long long	node_blocks = limits[1];
> +	/*
> +	 * The root btree block can have a fanout between 2 and maxrecs because
> +	 * the tree might not be big enough to fill it.
> +	 */

Can you change this comment to say something like:

	/*
	 * The root btree block can have less than minrecs pointers
	 * in it because the tree might not be big enough to require
	 * that amount of fanout. Hence it has a minimum size of
	 * 2 pointers, not limits[1].
	 */

Otherwise it looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

> +	unsigned long long	node_blocks = 2;
>  	unsigned long long	blocks_left = leaf_blocks - 1;
>  	unsigned int		height = 1;

For future consideration, we don't use maxrecs in this calculation
at all - should we just pass minrecs into the function rather than
an array of limits?

Cheers,

Dave.
Darrick J. Wong Dec. 20, 2022, 4:10 p.m. UTC | #2
On Tue, Dec 20, 2022 at 04:00:01PM +1100, Dave Chinner wrote:
> On Mon, Dec 19, 2022 at 04:05:19PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Lately I've been stress-testing extreme-sized rmap btrees by using the
> > (new) xfs_db bmap_inflate command to clone bmbt mappings billions of
> > times and then using xfs_repair to build new rmap and refcount btrees.
> > This of course is /much/ faster than actually FICLONEing a file billions
> > of times.
> > 
> > Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with
> > EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not
> > sufficiently large for the test scenario.  For a 1TB filesystem (~67
> > million AG blocks, 4 AGs) the btheight command reports:
> > 
> > $ xfs_db -c 'btheight -n 4400801200 -w min rmapbt' /dev/sda
> > rmapbt: worst case per 4096-byte block: 84 records (leaf) / 45 keyptrs (node)
> > level 0: 4400801200 records, 52390491 blocks
> > level 1: 52390491 records, 1164234 blocks
> > level 2: 1164234 records, 25872 blocks
> > level 3: 25872 records, 575 blocks
> > level 4: 575 records, 13 blocks
> > level 5: 13 records, 1 block
> > 6 levels, 53581186 blocks total
> > 
> > The AG is sufficiently large to build this rmap btree.  Unfortunately,
> > m_rmap_maxlevels is 5.  Augmenting the loop in the space->height
> > function to report height, node blocks, and blocks remaining produces
> > this:
> > 
> > ht 1 node_blocks 45 blockleft 67108863
> > ht 2 node_blocks 2025 blockleft 67108818
> > ht 3 node_blocks 91125 blockleft 67106793
> > ht 4 node_blocks 4100625 blockleft 67015668
> > final height: 5
> > 
> > The goal of this function is to compute the maximum height btree that
> > can be stored in the given number of ondisk fsblocks.  Starting with the
> > top level of the tree, each iteration through the loop adds the fanout
> > factor of the next level down until we run out of blocks.  IOWs, maximum
> > height is achieved by using the smallest fanout factor that can apply
> > to that level.
> > 
> > However, the loop setup is not correct.  Top level btree blocks are
> > allowed to contain fewer than minrecs items, so the computation is
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Ah, that's the critical piece of information I was looking for. I
> couldn't work out from the code change below what was wrong with
> limits[1]. So....
> 
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index 4c16c8c31fcb..8d11d3f5e529 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -4666,7 +4666,11 @@ xfs_btree_space_to_height(
> >  	const unsigned int	*limits,
> >  	unsigned long long	leaf_blocks)
> >  {
> > -	unsigned long long	node_blocks = limits[1];
> > +	/*
> > +	 * The root btree block can have a fanout between 2 and maxrecs because
> > +	 * the tree might not be big enough to fill it.
> > +	 */
> 
> Can you change this comment to say something like:
> 
> 	/*
> 	 * The root btree block can have less than minrecs pointers
> 	 * in it because the tree might not be big enough to require
> 	 * that amount of fanout. Hence it has a minimum size of
> 	 * 2 pointers, not limits[1].
> 	 */

Done.  Thanks for the reviews! :)

--D

> 
> Otherwise it looks good.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> > +	unsigned long long	node_blocks = 2;
> >  	unsigned long long	blocks_left = leaf_blocks - 1;
> >  	unsigned int		height = 1;
> 
> For future consideration, we don't use maxrecs in this calculation
> at all - should we just pass minrecs into the function rather than
> an array of limits?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong Dec. 20, 2022, 4:20 p.m. UTC | #3
On Tue, Dec 20, 2022 at 08:10:08AM -0800, Darrick J. Wong wrote:
> On Tue, Dec 20, 2022 at 04:00:01PM +1100, Dave Chinner wrote:
> > On Mon, Dec 19, 2022 at 04:05:19PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Lately I've been stress-testing extreme-sized rmap btrees by using the
> > > (new) xfs_db bmap_inflate command to clone bmbt mappings billions of
> > > times and then using xfs_repair to build new rmap and refcount btrees.
> > > This of course is /much/ faster than actually FICLONEing a file billions
> > > of times.
> > > 
> > > Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with
> > > EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not
> > > sufficiently large for the test scenario.  For a 1TB filesystem (~67
> > > million AG blocks, 4 AGs) the btheight command reports:
> > > 
> > > $ xfs_db -c 'btheight -n 4400801200 -w min rmapbt' /dev/sda
> > > rmapbt: worst case per 4096-byte block: 84 records (leaf) / 45 keyptrs (node)
> > > level 0: 4400801200 records, 52390491 blocks
> > > level 1: 52390491 records, 1164234 blocks
> > > level 2: 1164234 records, 25872 blocks
> > > level 3: 25872 records, 575 blocks
> > > level 4: 575 records, 13 blocks
> > > level 5: 13 records, 1 block
> > > 6 levels, 53581186 blocks total
> > > 
> > > The AG is sufficiently large to build this rmap btree.  Unfortunately,
> > > m_rmap_maxlevels is 5.  Augmenting the loop in the space->height
> > > function to report height, node blocks, and blocks remaining produces
> > > this:
> > > 
> > > ht 1 node_blocks 45 blockleft 67108863
> > > ht 2 node_blocks 2025 blockleft 67108818
> > > ht 3 node_blocks 91125 blockleft 67106793
> > > ht 4 node_blocks 4100625 blockleft 67015668
> > > final height: 5
> > > 
> > > The goal of this function is to compute the maximum height btree that
> > > can be stored in the given number of ondisk fsblocks.  Starting with the
> > > top level of the tree, each iteration through the loop adds the fanout
> > > factor of the next level down until we run out of blocks.  IOWs, maximum
> > > height is achieved by using the smallest fanout factor that can apply
> > > to that level.
> > > 
> > > However, the loop setup is not correct.  Top level btree blocks are
> > > allowed to contain fewer than minrecs items, so the computation is
> >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > Ah, that's the critical piece of information I was looking for. I
> > couldn't work out from the code change below what was wrong with
> > limits[1]. So....
> > 
> > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > > index 4c16c8c31fcb..8d11d3f5e529 100644
> > > --- a/fs/xfs/libxfs/xfs_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_btree.c
> > > @@ -4666,7 +4666,11 @@ xfs_btree_space_to_height(
> > >  	const unsigned int	*limits,
> > >  	unsigned long long	leaf_blocks)
> > >  {
> > > -	unsigned long long	node_blocks = limits[1];
> > > +	/*
> > > +	 * The root btree block can have a fanout between 2 and maxrecs because
> > > +	 * the tree might not be big enough to fill it.
> > > +	 */
> > 
> > Can you change this comment to say something like:
> > 
> > 	/*
> > 	 * The root btree block can have less than minrecs pointers
> > 	 * in it because the tree might not be big enough to require
> > 	 * that amount of fanout. Hence it has a minimum size of
> > 	 * 2 pointers, not limits[1].
> > 	 */
> 
> Done.  Thanks for the reviews! :)
> 
> --D
> 
> > 
> > Otherwise it looks good.
> > 
> > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > 
> > > +	unsigned long long	node_blocks = 2;
> > >  	unsigned long long	blocks_left = leaf_blocks - 1;
> > >  	unsigned int		height = 1;
> > 
> > For future consideration, we don't use maxrecs in this calculation
> > at all - should we just pass minrecs into the function rather than
> > an array of limits?

I prefer to replace all those mxr/mnr array constructs with an explicit
structure:

struct xfs_btblock_geometry {
	uint16_t	leaf_maxrecs;
	uint16_t	leaf_minrecs;

	uint16_t	node_maxrecs;
	uint16_t	node_minrecs;
};

and get rid of all the mp->m_rmap_mxr[leaf != 0] stuff that slows me
down every time I have to read it.

--D

> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
Dave Chinner Dec. 20, 2022, 8:39 p.m. UTC | #4
On Tue, Dec 20, 2022 at 08:20:03AM -0800, Darrick J. Wong wrote:
> On Tue, Dec 20, 2022 at 08:10:08AM -0800, Darrick J. Wong wrote:
> > On Tue, Dec 20, 2022 at 04:00:01PM +1100, Dave Chinner wrote:
> > > For future consideration, we don't use maxrecs in this calculation
> > > at all - should we just pass minrecs into the function rather than
> > > an array of limits?
> 
> I prefer to replace all those mxr/mnr array constructs with an explicit
> structure:
> 
> struct xfs_btblock_geometry {
> 	uint16_t	leaf_maxrecs;
> 	uint16_t	leaf_minrecs;
> 
> 	uint16_t	node_maxrecs;
> 	uint16_t	node_minrecs;
> };
> 
> and get rid of all the mp->m_rmap_mxr[leaf != 0] stuff that slows me
> down every time I have to read it.

Yup, that sounds like a great idea - I have to work it out from
first principles every time, too.

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 4c16c8c31fcb..8d11d3f5e529 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4666,7 +4666,11 @@  xfs_btree_space_to_height(
 	const unsigned int	*limits,
 	unsigned long long	leaf_blocks)
 {
-	unsigned long long	node_blocks = limits[1];
+	/*
+	 * The root btree block can have a fanout between 2 and maxrecs because
+	 * the tree might not be big enough to fill it.
+	 */
+	unsigned long long	node_blocks = 2;
 	unsigned long long	blocks_left = leaf_blocks - 1;
 	unsigned int		height = 1;