diff mbox series

[3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check

Message ID 163961697197.3129691.1911552605195534271.stgit@magnolia (mailing list archive)
State Accepted, archived
Headers show
Series xfs: random fixes for 5.17 | expand

Commit Message

Darrick J. Wong Dec. 16, 2021, 1:09 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

When xfs_scrub encounters a directory with a leaf1 block, it tries to
validate that the leaf1 block's bestcount (aka the best free count of
each directory data block) is the correct size.  Previously, this author
believed that comparing bestcount to the directory isize (since
directory data blocks are under isize, and leaf/bestfree blocks are
above it) was sufficient.

Unfortunately during testing of online repair, it was discovered that it
is possible to create a directory with a hole between the last directory
block and isize.  The directory code seems to handle this situation just
fine and xfs_repair doesn't complain, which effectively makes this quirk
part of the disk format.

Fix the check to work properly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/dir.c |   15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

Comments

Dave Chinner Dec. 16, 2021, 5:05 a.m. UTC | #1
On Wed, Dec 15, 2021 at 05:09:32PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> When xfs_scrub encounters a directory with a leaf1 block, it tries to
> validate that the leaf1 block's bestcount (aka the best free count of
> each directory data block) is the correct size.  Previously, this author
> believed that comparing bestcount to the directory isize (since
> directory data blocks are under isize, and leaf/bestfree blocks are
> above it) was sufficient.
> 
> Unfortunately during testing of online repair, it was discovered that it
> is possible to create a directory with a hole between the last directory
> block and isize.

We have xfs_da3_swap_lastblock() that can leave an -empty- da block
between the last referenced block and isize, but that's not a "hole"
in the file. If you don't mean xfs_da3_swap_lastblock(), then can
you clarify what you mean by a "hole" here and explain to me how the
situation it occurs in comes about? 

Cheers,

Dave.
Darrick J. Wong Dec. 16, 2021, 7:25 p.m. UTC | #2
On Thu, Dec 16, 2021 at 04:05:37PM +1100, Dave Chinner wrote:
> On Wed, Dec 15, 2021 at 05:09:32PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > When xfs_scrub encounters a directory with a leaf1 block, it tries to
> > validate that the leaf1 block's bestcount (aka the best free count of
> > each directory data block) is the correct size.  Previously, this author
> > believed that comparing bestcount to the directory isize (since
> > directory data blocks are under isize, and leaf/bestfree blocks are
> > above it) was sufficient.
> > 
> > Unfortunately during testing of online repair, it was discovered that it
> > is possible to create a directory with a hole between the last directory
> > block and isize.
> 
> We have xfs_da3_swap_lastblock() that can leave an -empty- da block
> between the last referenced block and isize, but that's not a "hole"
> in the file. If you don't mean xfs_da3_swap_lastblock(), then can
> you clarify what you mean by a "hole" here and explain to me how the
> situation it occurs in comes about?

I don't actually know how it comes about.  I wrote a test that sets up
fsstress to expand and contract directories and races xfs_scrub -n, and
noticed that I'd periodically get complaints about directories (usually
$SCRATCH_MNT/p$CPU) where the last block(s) before i_size were actually
holes.

I began reading the dir2 code to try to figure out how this came about
(clearly we're not updating i_size somewhere) but then took the shortcut
of seeing if xfs_repair or xfs_check complained about this situation.
Neither of them did, and I found a couple more directories in a similar
situation on my crash test dummy machine, and concluded "Wellllp, I
guess this is part of the ondisk format!" and committed the patch.

Also, I thought xfs_da3_swap_lastblock only operates on leaf and da
btree blocks, not the blocks containing directory entries?  I /think/
the actual explanation is that something goes wrong in
xfs_dir2_shrink_inode (maybe?) such that the mapping goes away but
i_disk_size doesn't get updated?  Not sure how /that/ can happen,
though...

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Dec. 16, 2021, 9:17 p.m. UTC | #3
On Thu, Dec 16, 2021 at 11:25:49AM -0800, Darrick J. Wong wrote:
> On Thu, Dec 16, 2021 at 04:05:37PM +1100, Dave Chinner wrote:
> > On Wed, Dec 15, 2021 at 05:09:32PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > When xfs_scrub encounters a directory with a leaf1 block, it tries to
> > > validate that the leaf1 block's bestcount (aka the best free count of
> > > each directory data block) is the correct size.  Previously, this author
> > > believed that comparing bestcount to the directory isize (since
> > > directory data blocks are under isize, and leaf/bestfree blocks are
> > > above it) was sufficient.
> > > 
> > > Unfortunately during testing of online repair, it was discovered that it
> > > is possible to create a directory with a hole between the last directory
> > > block and isize.
> > 
> > We have xfs_da3_swap_lastblock() that can leave an -empty- da block
> > between the last referenced block and isize, but that's not a "hole"
> > in the file. If you don't mean xfs_da3_swap_lastblock(), then can
> > you clarify what you mean by a "hole" here and explain to me how the
> > situation it occurs in comes about?
> 
> I don't actually know how it comes about.  I wrote a test that sets up
> fsstress to expand and contract directories and races xfs_scrub -n, and
> noticed that I'd periodically get complaints about directories (usually
> $SCRATCH_MNT/p$CPU) where the last block(s) before i_size were actually
> holes.

Is that test getting to ENOSPC at all?

> I began reading the dir2 code to try to figure out how this came about
> (clearly we're not updating i_size somewhere) but then took the shortcut
> of seeing if xfs_repair or xfs_check complained about this situation.
> Neither of them did, and I found a couple more directories in a similar
> situation on my crash test dummy machine, and concluded "Wellllp, I
> guess this is part of the ondisk format!" and committed the patch.
> 
> Also, I thought xfs_da3_swap_lastblock only operates on leaf and da
> btree blocks, not the blocks containing directory entries?

Ah, right you are. I noticed xfs_da_shrink_inode() being called from
leaf_to_block() and thought it might be swapping the leaf with the
last data block that we probably just removed. Looking at the code,
that is not going to happend AFAICT...

> I /think/
> the actual explanation is that something goes wrong in
> xfs_dir2_shrink_inode (maybe?) such that the mapping goes away but
> i_disk_size doesn't get updated?  Not sure how /that/ can happen,
> though...

Actually, the ENOSPC case in xfs_dir2_shrink_inode is the likely
case. If we can't free the block because bunmapi gets ENOSPC due
to xfs_dir_rename() being called without a block reservation, it'll
just get left there as an empty data block. If all the other dir
data blocks around it get removed properly, it could eventually end
up between the last valid entry and isize....

There are lots of weird corner cases around ENOSPC in the directory
code, perhaps this is just another of them...

Cheers,

Dave.
Darrick J. Wong Dec. 16, 2021, 9:40 p.m. UTC | #4
On Fri, Dec 17, 2021 at 08:17:48AM +1100, Dave Chinner wrote:
> On Thu, Dec 16, 2021 at 11:25:49AM -0800, Darrick J. Wong wrote:
> > On Thu, Dec 16, 2021 at 04:05:37PM +1100, Dave Chinner wrote:
> > > On Wed, Dec 15, 2021 at 05:09:32PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > When xfs_scrub encounters a directory with a leaf1 block, it tries to
> > > > validate that the leaf1 block's bestcount (aka the best free count of
> > > > each directory data block) is the correct size.  Previously, this author
> > > > believed that comparing bestcount to the directory isize (since
> > > > directory data blocks are under isize, and leaf/bestfree blocks are
> > > > above it) was sufficient.
> > > > 
> > > > Unfortunately during testing of online repair, it was discovered that it
> > > > is possible to create a directory with a hole between the last directory
> > > > block and isize.
> > > 
> > > We have xfs_da3_swap_lastblock() that can leave an -empty- da block
> > > between the last referenced block and isize, but that's not a "hole"
> > > in the file. If you don't mean xfs_da3_swap_lastblock(), then can
> > > you clarify what you mean by a "hole" here and explain to me how the
> > > situation it occurs in comes about?
> > 
> > I don't actually know how it comes about.  I wrote a test that sets up
> > fsstress to expand and contract directories and races xfs_scrub -n, and
> > noticed that I'd periodically get complaints about directories (usually
> > $SCRATCH_MNT/p$CPU) where the last block(s) before i_size were actually
> > holes.
> 
> Is that test getting to ENOSPC at all?

Yes.  That particular VM has a generous 8GB of SCRATCH_DEV to make the
repairs more interesting.

> > I began reading the dir2 code to try to figure out how this came about
> > (clearly we're not updating i_size somewhere) but then took the shortcut
> > of seeing if xfs_repair or xfs_check complained about this situation.
> > Neither of them did, and I found a couple more directories in a similar
> > situation on my crash test dummy machine, and concluded "Wellllp, I
> > guess this is part of the ondisk format!" and committed the patch.
> > 
> > Also, I thought xfs_da3_swap_lastblock only operates on leaf and da
> > btree blocks, not the blocks containing directory entries?
> 
> Ah, right you are. I noticed xfs_da_shrink_inode() being called from
> leaf_to_block() and thought it might be swapping the leaf with the
> last data block that we probably just removed. Looking at the code,
> that is not going to happend AFAICT...
> 
> > I /think/
> > the actual explanation is that something goes wrong in
> > xfs_dir2_shrink_inode (maybe?) such that the mapping goes away but
> > i_disk_size doesn't get updated?  Not sure how /that/ can happen,
> > though...
> 
> Actually, the ENOSPC case in xfs_dir2_shrink_inode is the likely
> case. If we can't free the block because bunmapi gets ENOSPC due
> to xfs_dir_rename() being called without a block reservation, it'll
> just get left there as an empty data block. If all the other dir
> data blocks around it get removed properly, it could eventually end
> up between the last valid entry and isize....
> 
> There are lots of weird corner cases around ENOSPC in the directory
> code, perhaps this is just another of them...

<nod> The next time I reproduce it, I'll send you a metadump.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Dec. 16, 2021, 10:04 p.m. UTC | #5
On Wed, Dec 15, 2021 at 05:09:32PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> When xfs_scrub encounters a directory with a leaf1 block, it tries to
> validate that the leaf1 block's bestcount (aka the best free count of
> each directory data block) is the correct size.  Previously, this author
> believed that comparing bestcount to the directory isize (since
> directory data blocks are under isize, and leaf/bestfree blocks are
> above it) was sufficient.
> 
> Unfortunately during testing of online repair, it was discovered that it
> is possible to create a directory with a hole between the last directory
> block and isize.  The directory code seems to handle this situation just
> fine and xfs_repair doesn't complain, which effectively makes this quirk
> part of the disk format.
> 
> Fix the check to work properly.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

With the "we're not sure how this happens" discussion out of the
way, the change to handle the empty space between the last block and
isize looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Christoph Hellwig Dec. 24, 2021, 7:17 a.m. UTC | #6
Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
diff mbox series

Patch

diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 200a63f58fe7..9a16932d77ce 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -497,6 +497,7 @@  STATIC int
 xchk_directory_leaf1_bestfree(
 	struct xfs_scrub		*sc,
 	struct xfs_da_args		*args,
+	xfs_dir2_db_t			last_data_db,
 	xfs_dablk_t			lblk)
 {
 	struct xfs_dir3_icleaf_hdr	leafhdr;
@@ -534,10 +535,14 @@  xchk_directory_leaf1_bestfree(
 	}
 
 	/*
-	 * There should be as many bestfree slots as there are dir data
-	 * blocks that can fit under i_size.
+	 * There must be enough bestfree slots to cover all the directory data
+	 * blocks that we scanned.  It is possible for there to be a hole
+	 * between the last data block and i_disk_size.  This seems like an
+	 * oversight to the scrub author, but as we have been writing out
+	 * directories like this (and xfs_repair doesn't mind them) for years,
+	 * that's what we have to check.
 	 */
-	if (bestcount != xfs_dir2_byte_to_db(geo, sc->ip->i_disk_size)) {
+	if (bestcount != last_data_db + 1) {
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
 		goto out;
 	}
@@ -669,6 +674,7 @@  xchk_directory_blocks(
 	xfs_fileoff_t		lblk;
 	struct xfs_iext_cursor	icur;
 	xfs_dablk_t		dabno;
+	xfs_dir2_db_t		last_data_db = 0;
 	bool			found;
 	int			is_block = 0;
 	int			error;
@@ -712,6 +718,7 @@  xchk_directory_blocks(
 				args.geo->fsbcount);
 		     lblk < got.br_startoff + got.br_blockcount;
 		     lblk += args.geo->fsbcount) {
+			last_data_db = xfs_dir2_da_to_db(mp->m_dir_geo, lblk);
 			error = xchk_directory_data_bestfree(sc, lblk,
 					is_block);
 			if (error)
@@ -734,7 +741,7 @@  xchk_directory_blocks(
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
 			goto out;
 		}
-		error = xchk_directory_leaf1_bestfree(sc, &args,
+		error = xchk_directory_leaf1_bestfree(sc, &args, last_data_db,
 				leaf_lblk);
 		if (error)
 			goto out;