xfsdump: find root inode, not first inode
diff mbox series

Message ID f66f26f7-5e29-80fc-206c-9a53cf4640fa@redhat.com
State New
Headers show
Series
  • xfsdump: find root inode, not first inode
Related show

Commit Message

Eric Sandeen Aug. 21, 2019, 10:14 p.m. UTC
The prior effort to identify the actual root inode in a filesystem
failed in the (rare) case where inodes were allocated with a lower
number than the root.  As a result, the wrong root inode number
went into the dump, and restore would fail with:

xfsrestore: tree.c:757: tree_begindir: Assertion `ino != persp->p_rootino || hardh == persp->p_rooth' failed.

Fix this by iterating over a chunk's worth of inodes until we find
a directory inode with generation 0, which should only be true
for the real root inode.

Fixes: 25195ebf107 ("xfsdump: handle bind mount targets")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

Comments

Dave Chinner Aug. 22, 2019, 6:01 a.m. UTC | #1
On Wed, Aug 21, 2019 at 05:14:51PM -0500, Eric Sandeen wrote:
> The prior effort to identify the actual root inode in a filesystem
> failed in the (rare) case where inodes were allocated with a lower
> number than the root.  As a result, the wrong root inode number
> went into the dump, and restore would fail with:
> 
> xfsrestore: tree.c:757: tree_begindir: Assertion `ino != persp->p_rootino || hardh == persp->p_rooth' failed.
> 
> Fix this by iterating over a chunk's worth of inodes until we find
> a directory inode with generation 0, which should only be true
> for the real root inode.

I'm not sure addresses the actual case that can cause this error.

i.e. the root inode is always the first inode in the "root chunk"
that is allocated at mkfs time. THat location is where repair will
always calculate it to be, and it will be the inode that it uses
to rebuild the root dir from.

The problem, as I understand it, is that we have a situation where
the per-ag btree root blocks have moved (due to split/merge) from
the their original location which is packed against the AG headers.
The root inode chunk is located at the lowest inode cluster aligned
block after these AG btree roots.

Once the btree roots have moved, they may be space between the AG
headers and the root inode chunk to allocate a new inode chunk,
in which case we have at least 64 inodes allocated underneath the
root inode. And if we have 64k blocks and 256 byte inodes, we could
have 256 inodes per fs block between the AG headers and the root
indoe chunk.

Hence scanning all the inodes in the first indoe chunk won't find
the root indoe if any of these situations has occurred, and we may
have to scan 1500 or more inodes in to find the root chunk (6 btree
roots - BNOBT,CNTBT,INOBT,FINOBT,RMAPBT,REFCBT - and 64k blocks).

So I think the best thing to do here is try to calculate the root
inode number as per xfs_repair, and then bulkstat that. i.e. see
the calculation of "first_prealloc_ino" in repair/xfs_repair.c
(about line 450). Probably requires a XFS_IOC_FSGEOMETRY_V1 call to
get the necessary info to calculate it...

Cheers,

Dave.
Eric Sandeen Aug. 22, 2019, 2:48 p.m. UTC | #2
On 8/22/19 1:01 AM, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 05:14:51PM -0500, Eric Sandeen wrote:
>> The prior effort to identify the actual root inode in a filesystem
>> failed in the (rare) case where inodes were allocated with a lower
>> number than the root.  As a result, the wrong root inode number
>> went into the dump, and restore would fail with:
>>
>> xfsrestore: tree.c:757: tree_begindir: Assertion `ino != persp->p_rootino || hardh == persp->p_rooth' failed.
>>
>> Fix this by iterating over a chunk's worth of inodes until we find
>> a directory inode with generation 0, which should only be true
>> for the real root inode.
> 
> I'm not sure addresses the actual case that can cause this error.
> 
> i.e. the root inode is always the first inode in the "root chunk"
> that is allocated at mkfs time. THat location is where repair will
> always calculate it to be, and it will be the inode that it uses
> to rebuild the root dir from.
> 
> The problem, as I understand it, is that we have a situation where
> the per-ag btree root blocks have moved (due to split/merge) from
> the their original location which is packed against the AG headers.
> The root inode chunk is located at the lowest inode cluster aligned
> block after these AG btree roots.
> 
> Once the btree roots have moved, they may be space between the AG
> headers and the root inode chunk to allocate a new inode chunk,
> in which case we have at least 64 inodes allocated underneath the
> root inode. And if we have 64k blocks and 256 byte inodes, we could
> have 256 inodes per fs block between the AG headers and the root
> indoe chunk.
> 
> Hence scanning all the inodes in the first indoe chunk won't find
> the root indoe if any of these situations has occurred, and we may
> have to scan 1500 or more inodes in to find the root chunk (6 btree
> roots - BNOBT,CNTBT,INOBT,FINOBT,RMAPBT,REFCBT - and 64k blocks).

Hm.  So, the one report I had of this had a root inode of 128, and
bulkstat returned a first inode of 64, triggering the assert above.
(but even then you're right, scanning 64 isn't enough)

If it really can be that far off then maybe this is a bad way to go,
although the long scan is going to be exceedingly rare I think.  Most
people will get the root inode on the first bulkstat call.

> So I think the best thing to do here is try to calculate the root
> inode number as per xfs_repair, and then bulkstat that. i.e. see
> the calculation of "first_prealloc_ino" in repair/xfs_repair.c
> (about line 450). Probably requires a XFS_IOC_FSGEOMETRY_V1 call to
> get the necessary info to calculate it...

I had started to go down that path and it seemed like a real mess.
We need m_ag_maxlevels, m_rmap_maxlevels, sb_logstart, features,
etc etc etc.  I guess I can revisit it.  I had thought about a common
function to calculate this so we aren't coding it twice but I'm not
sure we have the same sets of inputs for the various cases ...

-Eric
Eric Sandeen Aug. 22, 2019, 7:13 p.m. UTC | #3
On 8/22/19 9:48 AM, Eric Sandeen wrote:
> On 8/22/19 1:01 AM, Dave Chinner wrote:

...

>> Hence scanning all the inodes in the first indoe chunk won't find
>> the root indoe if any of these situations has occurred, and we may
>> have to scan 1500 or more inodes in to find the root chunk (6 btree
>> roots - BNOBT,CNTBT,INOBT,FINOBT,RMAPBT,REFCBT - and 64k blocks).
> 
> Hm.  So, the one report I had of this had a root inode of 128, and
> bulkstat returned a first inode of 64, triggering the assert above.
> (but even then you're right, scanning 64 isn't enough)
> 
> If it really can be that far off then maybe this is a bad way to go,
> although the long scan is going to be exceedingly rare I think.  Most
> people will get the root inode on the first bulkstat call.
> 
>> So I think the best thing to do here is try to calculate the root
>> inode number as per xfs_repair, and then bulkstat that. i.e. see
>> the calculation of "first_prealloc_ino" in repair/xfs_repair.c
>> (about line 450). Probably requires a XFS_IOC_FSGEOMETRY_V1 call to
>> get the necessary info to calculate it...
> 
> I had started to go down that path and it seemed like a real mess.
> We need m_ag_maxlevels, m_rmap_maxlevels, sb_logstart, features,
> etc etc etc.  I guess I can revisit it.  I had thought about a common
> function to calculate this so we aren't coding it twice but I'm not
> sure we have the same sets of inputs for the various cases ...

I'm not sure it's possible without a ton of setup (or at all).
(repair has already done a full libxfs_mount setup, so it has what it needs.)

For example we need inode alignment; calculating that depends on whether
we have sparse inodes, and sparse inodes is an incompat feature not
reported by GEOM.  :(

        if (fp->inode_align) {
                int     cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;

                sbp->sb_versionnum |= XFS_SB_VERSION_ALIGNBIT;
                if (cfg->sb_feat.crcs_enabled)
                        cluster_size *= cfg->inodesize / XFS_DINODE_MIN_SIZE;
                sbp->sb_inoalignmt = cluster_size >> cfg->blocklog;
        } else
                sbp->sb_inoalignmt = 0;
...
        /*
         * Sparse inode chunk support has two main inode alignment requirements.
         * First, sparse chunk alignment must match the cluster size. Second,
         * full chunk alignment must match the inode chunk size.
         *
         * Copy the already calculated/scaled inoalignmt to spino_align and
         * update the former to the full inode chunk size.
         */
        if (fp->spinodes) {
                sbp->sb_spino_align = sbp->sb_inoalignmt;
                sbp->sb_inoalignmt = XFS_INODES_PER_CHUNK *
                                cfg->inodesize >> cfg->blocklog;
                sbp->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_SPINODES;
        }

-Eric
Eric Sandeen Aug. 22, 2019, 7:19 p.m. UTC | #4
On 8/22/19 2:13 PM, Eric Sandeen wrote:
> On 8/22/19 9:48 AM, Eric Sandeen wrote:
>> On 8/22/19 1:01 AM, Dave Chinner wrote:
> 
> ...
> 
>>> Hence scanning all the inodes in the first indoe chunk won't find
>>> the root indoe if any of these situations has occurred, and we may
>>> have to scan 1500 or more inodes in to find the root chunk (6 btree
>>> roots - BNOBT,CNTBT,INOBT,FINOBT,RMAPBT,REFCBT - and 64k blocks).
>>
>> Hm.  So, the one report I had of this had a root inode of 128, and
>> bulkstat returned a first inode of 64, triggering the assert above.
>> (but even then you're right, scanning 64 isn't enough)
>>
>> If it really can be that far off then maybe this is a bad way to go,
>> although the long scan is going to be exceedingly rare I think.  Most
>> people will get the root inode on the first bulkstat call.
>>
>>> So I think the best thing to do here is try to calculate the root
>>> inode number as per xfs_repair, and then bulkstat that. i.e. see
>>> the calculation of "first_prealloc_ino" in repair/xfs_repair.c
>>> (about line 450). Probably requires a XFS_IOC_FSGEOMETRY_V1 call to
>>> get the necessary info to calculate it...
>>
>> I had started to go down that path and it seemed like a real mess.
>> We need m_ag_maxlevels, m_rmap_maxlevels, sb_logstart, features,
>> etc etc etc.  I guess I can revisit it.  I had thought about a common
>> function to calculate this so we aren't coding it twice but I'm not
>> sure we have the same sets of inputs for the various cases ...
> 
> I'm not sure it's possible without a ton of setup (or at all).
> (repair has already done a full libxfs_mount setup, so it has what it needs.)
> 
> For example we need inode alignment; calculating that depends on whether
> we have sparse inodes, and sparse inodes is an incompat feature not
> reported by GEOM.  :(
> 
>         if (fp->inode_align) {
>                 int     cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
> 
>                 sbp->sb_versionnum |= XFS_SB_VERSION_ALIGNBIT;
>                 if (cfg->sb_feat.crcs_enabled)
>                         cluster_size *= cfg->inodesize / XFS_DINODE_MIN_SIZE;
>                 sbp->sb_inoalignmt = cluster_size >> cfg->blocklog;
>         } else
>                 sbp->sb_inoalignmt = 0;
> ...
>         /*
>          * Sparse inode chunk support has two main inode alignment requirements.
>          * First, sparse chunk alignment must match the cluster size. Second,
>          * full chunk alignment must match the inode chunk size.
>          *
>          * Copy the already calculated/scaled inoalignmt to spino_align and
>          * update the former to the full inode chunk size.
>          */
>         if (fp->spinodes) {
>                 sbp->sb_spino_align = sbp->sb_inoalignmt;
>                 sbp->sb_inoalignmt = XFS_INODES_PER_CHUNK *
>                                 cfg->inodesize >> cfg->blocklog;
>                 sbp->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_SPINODES;
>         }
> 
> -Eric

Should the root inode number always be a multiple of XFS_INODES_PER_CHUNK?
if so I could speed up a longer search that way ... but this sucks.  :/

Maybe I should just disallow dumping via a bind-mounted dir.  Stat the target dir,
get the inode, bulkstat it, check the gen, and if it's not zero, bail out with
an error.

... that's probably better and less surprising anyway.

-Eric

Patch
diff mbox series

diff --git a/dump/content.c b/dump/content.c
index 30232d4..9f9c03b 100644
--- a/dump/content.c
+++ b/dump/content.c
@@ -1384,7 +1384,7 @@  baseuuidbypass:
 	/* figure out the ino for the root directory of the fs
 	 * and get its struct xfs_bstat for inomap_build().  This could
 	 * be a bind mount; don't ask for the mount point inode,
-	 * find the actual lowest inode number in the filesystem.
+	 * actually find the root inode number in the filesystem.
 	 */
 	{
 		stat64_t rootstat;
@@ -1404,20 +1404,36 @@  baseuuidbypass:
 			(struct xfs_bstat *)calloc(1, sizeof(struct xfs_bstat));
 		assert(sc_rootxfsstatp);
 
-		/* Get the first valid (i.e. root) inode in this fs */
-		bulkreq.lastip = (__u64 *)&lastino;
-		bulkreq.icount = 1;
-		bulkreq.ubuffer = sc_rootxfsstatp;
-		bulkreq.ocount = &ocount;
-		if (ioctl(sc_fsfd, XFS_IOC_FSBULKSTAT, &bulkreq) < 0) {
-			mlog(MLOG_ERROR,
-			      _("failed to get bulkstat information for root inode\n"));
+		/*
+		 * Find the root inode in this fs.  It is (rarely) possible to
+		 * have a non-root inode come before the root inode, so iterate
+		 * over a chunk's worth looking for the first dir inode with
+		 * bs_gen == 0, which should only be true for the root inode.
+		 */
+		for (i = 0; i < 64; i++) {
+			bulkreq.lastip = (__u64 *)&lastino;
+			bulkreq.icount = 1;
+			bulkreq.ubuffer = sc_rootxfsstatp;
+			bulkreq.ocount = &ocount;
+			if (ioctl(sc_fsfd, XFS_IOC_FSBULKSTAT, &bulkreq) < 0) {
+				mlog(MLOG_ERROR,
+_("failed to get bulkstat information for root inode\n"));
+				return BOOL_FALSE;
+			}
+			/* found it? */
+			if ((sc_rootxfsstatp->bs_mode & S_IFMT) == S_IFDIR &&
+			    sc_rootxfsstatp->bs_gen == 0)
+				break;
+		}
+
+		if (i == 64) {
+			mlog(MLOG_ERROR, _("failed to find root inode\n"));
 			return BOOL_FALSE;
 		}
 
 		if (sc_rootxfsstatp->bs_ino != rootstat.st_ino)
 			mlog (MLOG_NORMAL | MLOG_NOTE,
-			       _("root ino %lld differs from mount dir ino %lld, bind mount?\n"),
+_("root ino %lld differs from mount dir ino %lld, bind mount?\n"),
 			         sc_rootxfsstatp->bs_ino, rootstat.st_ino);
 	}