diff mbox

[11/21] xfs: repair the rmapbt

Message ID 152986827881.3155.10096839660329617215.stgit@magnolia (mailing list archive)
State Superseded
Headers show

Commit Message

Darrick J. Wong June 24, 2018, 7:24 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

Rebuild the reverse mapping btree from all primary metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile            |    1 
 fs/xfs/scrub/repair.h      |   11 
 fs/xfs/scrub/rmap.c        |    6 
 fs/xfs/scrub/rmap_repair.c | 1036 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c       |    2 
 5 files changed, 1054 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/rmap_repair.c



--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Chinner July 3, 2018, 5:32 a.m. UTC | #1
On Sun, Jun 24, 2018 at 12:24:38PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Rebuild the reverse mapping btree from all primary metadata.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

....

> +static inline int xfs_repair_rmapbt_setup(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip)
> +{
> +	/* We don't support rmap repair, but we can still do a scan. */
> +	return xfs_scrub_setup_ag_btree(sc, ip, false);
> +}

This comment seems at odds with the commit message....

....

> +/*
> + * Reverse Mapping Btree Repair
> + * ============================
> + *
> + * This is the most involved of all the AG space btree rebuilds.  Everywhere
> + * else in XFS we lock inodes and then AG data structures, but generating the
> + * list of rmap records requires that we be able to scan both block mapping
> + * btrees of every inode in the filesystem to see if it owns any extents in
> + * this AG.  We can't tolerate any inode updates while we do this, so we
> + * freeze the filesystem to lock everyone else out, and grant ourselves
> + * special privileges to run transactions with regular background reclamation
> + * turned off.

Hmmm. This implies we are going to scan the entire filesystem for
every AG we need to rebuild the rmap tree in. That seems like an
awful lot of work if there's more than one rmap btree that needs
rebuild.

> + * We also have to be very careful not to allow inode reclaim to start a
> + * transaction because all transactions (other than our own) will block.

What happens when we run out of memory? Inode reclaim will need to
run at that point, right?

> + * So basically we scan all primary per-AG metadata and all block maps of all
> + * inodes to generate a huge list of reverse map records.  Next we look for
> + * gaps in the rmap records to calculate all the unclaimed free space (1).
> + * Next, we scan all other OWN_AG metadata (bnobt, cntbt, agfl) and subtract
> + * the space used by those btrees from (1), and also subtract the free space
> + * listed in the bnobt from (1).  What's left are the gaps in assigned space
> + * that the new rmapbt knows about but the existing bnobt doesn't; these are
> + * the blocks from the old rmapbt and they can be freed.

THis looks like a lot of repeated work. We've already scanned a
bunch of these trees to repair them, then thrown away the scan
results. Now we do another scan of what we've rebuilt.....

... hold on. Chicken and egg.

We verify and rebuild all the other trees from the rmap information
- how do we do determine that the rmap needs to rebuilt and that the
metadata it's being rebuilt from is valid?

Given that we've effetively got to shut down access to the
filesystem for the entire rmap rebuild while we do an entire
filesystem scan, why would we do this online? It's going to be
faster to do this rebuild offline (because of all the prefetching,
rebuilding all AG trees from the state gathered in the full
filesystem passes, etc) and we don't have to hack around potential
transaction and memory reclaim deadlock situations, either?

So why do rmap rebuilds online at all?

> + */
> +
> +/* Set us up to repair reverse mapping btrees. */
> +int
> +xfs_repair_rmapbt_setup(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip)
> +{
> +	int				error;
> +
> +	/*
> +	 * Freeze out anything that can lock an inode.  We reconstruct
> +	 * the rmapbt by reading inode bmaps with the AGF held, which is
> +	 * only safe w.r.t. ABBA deadlocks if we're the only ones locking
> +	 * inodes.
> +	 */
> +	error = xfs_scrub_fs_freeze(sc);
> +	if (error)
> +		return error;
> +
> +	/* Check the AG number and set up the scrub context. */
> +	error = xfs_scrub_setup_fs(sc, ip);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Lock all the AG header buffers so that we can read all the
> +	 * per-AG metadata too.
> +	 */
> +	error = xfs_repair_grab_all_ag_headers(sc);
> +	if (error)
> +		return error;

So if we have thousands of AGs (think PB scale filesystems) then
we're going hold many thousands of locked buffers here? Just so we
can rebuild the rmapbt in one AG?

What does holding these buffers locked protect us against that
an active freeze doesn't?

> +xfs_repair_rmapbt_new_rmap(
> +	struct xfs_repair_rmapbt	*rr,
> +	xfs_agblock_t			startblock,
> +	xfs_extlen_t			blockcount,
> +	uint64_t			owner,
> +	uint64_t			offset,
> +	unsigned int			flags)
> +{
> +	struct xfs_repair_rmapbt_extent	*rre;
> +	int				error = 0;
> +
> +	trace_xfs_repair_rmap_extent_fn(rr->sc->mp, rr->sc->sa.agno,
> +			startblock, blockcount, owner, offset, flags);
> +
> +	if (xfs_scrub_should_terminate(rr->sc, &error))
> +		return error;
> +
> +	rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL);
> +	if (!rre)
> +		return -ENOMEM;

This seems like a likely thing to happen given the "no reclaim"
state of the filesystem and the memory demand a rmapbt rebuild
can have. If we've got GBs of rmap info in the AG that needs to be
rebuilt, how much RAM are we going to need to index it all as we
scan the filesystem?

> +xfs_repair_rmapbt_scan_ifork(
> +	struct xfs_repair_rmapbt	*rr,
> +	struct xfs_inode		*ip,
> +	int				whichfork)
> +{
> +	struct xfs_bmbt_irec		rec;
> +	struct xfs_iext_cursor		icur;
> +	struct xfs_mount		*mp = rr->sc->mp;
> +	struct xfs_btree_cur		*cur = NULL;
> +	struct xfs_ifork		*ifp;
> +	unsigned int			rflags;
> +	int				fmt;
> +	int				error = 0;
> +
> +	/* Do we even have data mapping extents? */
> +	fmt = XFS_IFORK_FORMAT(ip, whichfork);
> +	ifp = XFS_IFORK_PTR(ip, whichfork);
> +	switch (fmt) {
> +	case XFS_DINODE_FMT_BTREE:
> +		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
> +			error = xfs_iread_extents(rr->sc->tp, ip, whichfork);
> +			if (error)
> +				return error;
> +		}

Ok, so we need inodes locked to do this....

....
> +/* Iterate all the inodes in an AG group. */
> +STATIC int
> +xfs_repair_rmapbt_scan_inobt(
> +	struct xfs_btree_cur		*cur,
> +	union xfs_btree_rec		*rec,
> +	void				*priv)
> +{
> +	struct xfs_inobt_rec_incore	irec;
> +	struct xfs_repair_rmapbt	*rr = priv;
> +	struct xfs_mount		*mp = cur->bc_mp;
> +	struct xfs_inode		*ip = NULL;
> +	xfs_ino_t			ino;
> +	xfs_agino_t			agino;
> +	int				chunkidx;
> +	int				lock_mode = 0;
> +	int				error = 0;
> +
> +	xfs_inobt_btrec_to_irec(mp, rec, &irec);
> +
> +	for (chunkidx = 0, agino = irec.ir_startino;
> +	     chunkidx < XFS_INODES_PER_CHUNK;
> +	     chunkidx++, agino++) {
> +		bool	inuse;
> +
> +		/* Skip if this inode is free */
> +		if (XFS_INOBT_MASK(chunkidx) & irec.ir_free)
> +			continue;
> +		ino = XFS_AGINO_TO_INO(mp, cur->bc_private.a.agno, agino);
> +
> +		/* Back off and try again if an inode is being reclaimed */
> +		error = xfs_icache_inode_is_allocated(mp, cur->bc_tp, ino,
> +				&inuse);
> +		if (error == -EAGAIN)
> +			return -EDEADLOCK;

And we can get inode access errors here.....

FWIW, how is the inode being reclaimed if the filesystem is frozen?

> +
> +		/*
> +		 * Grab inode for scanning.  We cannot use DONTCACHE here
> +		 * because we already have a transaction so the iput must not
> +		 * trigger inode reclaim (which might allocate a transaction
> +		 * to clean up posteof blocks).
> +		 */
> +		error = xfs_iget(mp, cur->bc_tp, ino, 0, 0, &ip);

So if there are enough inodes in the AG, we'll run out of memory
here because we aren't reclaiming inodes from the cache but instead
putting them all on the defferred iput list?

> +		if (error)
> +			return error;
> +		trace_xfs_scrub_iget(ip, __this_address);
> +
> +		if ((ip->i_d.di_format == XFS_DINODE_FMT_BTREE &&
> +		     !(ip->i_df.if_flags & XFS_IFEXTENTS)) ||
> +		    (ip->i_d.di_aformat == XFS_DINODE_FMT_BTREE &&
> +		     !(ip->i_afp->if_flags & XFS_IFEXTENTS)))
> +			lock_mode = XFS_ILOCK_EXCL;
> +		else
> +			lock_mode = XFS_ILOCK_SHARED;
> +		if (!xfs_ilock_nowait(ip, lock_mode)) {
> +			error = -EBUSY;
> +			goto out_rele;
> +		}

And in what situation do we get inodes stuck with the ilock held on
frozen filesysetms?

....

> +out_unlock:
> +	xfs_iunlock(ip, lock_mode);
> +out_rele:
> +	iput(VFS_I(ip));
> +	return error;

calling iput in the error path is a bug - it will trigger all the
paths you're trying to avoid by using te deferred iput list.

....


> +/* Collect rmaps for all block mappings for every inode in this AG. */
> +STATIC int
> +xfs_repair_rmapbt_generate_aginode_rmaps(
> +	struct xfs_repair_rmapbt	*rr,
> +	xfs_agnumber_t			agno)
> +{
> +	struct xfs_scrub_context	*sc = rr->sc;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_buf			*agi_bp;
> +	int				error;
> +
> +	error = xfs_ialloc_read_agi(mp, sc->tp, agno, &agi_bp);
> +	if (error)
> +		return error;
> +	cur = xfs_inobt_init_cursor(mp, sc->tp, agi_bp, agno, XFS_BTNUM_INO);
> +	error = xfs_btree_query_all(cur, xfs_repair_rmapbt_scan_inobt, rr);

So if we get a locked or reclaiming inode anywhere in the
filesystem we see EDEADLOCK/EBUSY here without having scanned all
the inodes in the AG, right?

> +	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> +	xfs_trans_brelse(sc->tp, agi_bp);
> +	return error;
> +}
> +
> +/*
> + * Generate all the reverse-mappings for this AG, a list of the old rmapbt
> + * blocks, and the new btreeblks count.  Figure out if we have enough free
> + * space to reconstruct the inode btrees.  The caller must clean up the lists
> + * if anything goes wrong.
> + */
> +STATIC int
> +xfs_repair_rmapbt_find_rmaps(
> +	struct xfs_scrub_context	*sc,
> +	struct list_head		*rmap_records,
> +	xfs_agblock_t			*new_btreeblks)
> +{
> +	struct xfs_repair_rmapbt	rr;
> +	xfs_agnumber_t			agno;
> +	int				error;
> +
> +	rr.rmaplist = rmap_records;
> +	rr.sc = sc;
> +	rr.nr_records = 0;
> +
> +	/* Generate rmaps for AG space metadata */
> +	error = xfs_repair_rmapbt_generate_agheader_rmaps(&rr);
> +	if (error)
> +		return error;
> +	error = xfs_repair_rmapbt_generate_log_rmaps(&rr);
> +	if (error)
> +		return error;
> +	error = xfs_repair_rmapbt_generate_freesp_rmaps(&rr, new_btreeblks);
> +	if (error)
> +		return error;
> +	error = xfs_repair_rmapbt_generate_inobt_rmaps(&rr);
> +	if (error)
> +		return error;
> +	error = xfs_repair_rmapbt_generate_refcountbt_rmaps(&rr);
> +	if (error)
> +		return error;
> +
> +	/* Iterate all AGs for inodes rmaps. */
> +	for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) {
> +		error = xfs_repair_rmapbt_generate_aginode_rmaps(&rr, agno);
> +		if (error)
> +			return error;

And that means we abort here....

> +/* Repair the rmap btree for some AG. */
> +int
> +xfs_repair_rmapbt(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_owner_info		oinfo;
> +	struct list_head		rmap_records;
> +	xfs_extlen_t			new_btreeblks;
> +	int				log_flags = 0;
> +	int				error;
> +
> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> +
> +	/* Collect rmaps for all AG headers. */
> +	INIT_LIST_HEAD(&rmap_records);
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_UNKNOWN);
> +	error = xfs_repair_rmapbt_find_rmaps(sc, &rmap_records, &new_btreeblks);
> +	if (error)
> +		goto out;

And we drop out here. So, essentially, any ENOMEM, locked inode or
inode in reclaim anywhere in the filesystem will prevent rmap
rebuild. Which says to me that rebuilding the rmap on
on any substantial filesystem is likely to fail.

Which brings me back to my original question: why attempt to do
rmap rebuild online given how complex it is, the performance
implications of a full filesystem scan per AG that needs rebuild,
and all the ways it could easily fail?

Cheers,

Dave.
Darrick J. Wong July 3, 2018, 11:59 p.m. UTC | #2
On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:24:38PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Rebuild the reverse mapping btree from all primary metadata.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> ....
> 
> > +static inline int xfs_repair_rmapbt_setup(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip)
> > +{
> > +	/* We don't support rmap repair, but we can still do a scan. */
> > +	return xfs_scrub_setup_ag_btree(sc, ip, false);
> > +}
> 
> This comment seems at odds with the commit message....

This is the Kconfig shim needed if CONFIG_XFS_ONLINE_REPAIR=n.

> 
> ....
> 
> > +/*
> > + * Reverse Mapping Btree Repair
> > + * ============================
> > + *
> > + * This is the most involved of all the AG space btree rebuilds.  Everywhere
> > + * else in XFS we lock inodes and then AG data structures, but generating the
> > + * list of rmap records requires that we be able to scan both block mapping
> > + * btrees of every inode in the filesystem to see if it owns any extents in
> > + * this AG.  We can't tolerate any inode updates while we do this, so we
> > + * freeze the filesystem to lock everyone else out, and grant ourselves
> > + * special privileges to run transactions with regular background reclamation
> > + * turned off.
> 
> Hmmm. This implies we are going to scan the entire filesystem for
> every AG we need to rebuild the rmap tree in. That seems like an
> awful lot of work if there's more than one rmap btree that needs
> rebuild.

[some of this Dave and I discussed on IRC, so I'll summarize for
everyone else here...]

For this initial v0 iteration of the rmap repair code, yes, we have to
freeze the fs and iterate everything.  However, unless your computer and
storage are particularly untrustworthy, rmapbt reconstruction should be
a very infrequent thing.  Now that we have a FREEZE_OK flag, userspace
has to opt-in to slow repairs, and presumably it could choose instead to
unmount and run xfs_repair if that's too dear or there are too many
broken AGs, etc.  More on that later.

In the long run I don't see a need to freeze the filesystem to scan
every inode for bmbt entries in the damaged AG.  In fact, we can improve
the performance of all the AG repair functions in general with the
scheme I'm about to outline:

Create a "shut down this AG" primitive.  Once set, block and inode
allocation routines will bypass this AG.  Unlinked inodes are moved to
the unlinked list to avoid touching as much of the AGI as we practically
can.  Unmapped/freed blocks can be moved to a hidden inode (in another
AG) to be freed later.  Growfs operation in that AG can be rejected.

With AG shutdown in place, we no longer need to freeze the whole
filesystem.  Instead we traverse the live fs looking for mappings into
the afflicted AG, and when we're done rebuilding the rmapbt we can then
clean out the unlinked inodes and blocks, which catches us up to the
rest of the system.

(It would be awesome if we could capture the log intent items for the
removals until we're done rebuilding the rmapbt, but I suspect that
could fatally pin the log on us.)

However, given that adding AG shutdowns would be a lot of work I'd
prefer to get this launched even if it's reasonably likely early
versions will fail due to ENOMEM.  I tried to structure the rmapbt
repair so that we collect all the records and touch all the inodes
before we start writing anything so that if we run out of memory or
encounter a locked/reclaimable/whatever inode we'll just free all the
memory and bail out.  After that, the admin can take the fs offline and
run xfs_repair, so it shouldn't be much worse than today.

> > + * We also have to be very careful not to allow inode reclaim to start a
> > + * transaction because all transactions (other than our own) will block.
> 
> What happens when we run out of memory? Inode reclaim will need to
> run at that point, right?

Yes.

> > + * So basically we scan all primary per-AG metadata and all block maps of all
> > + * inodes to generate a huge list of reverse map records.  Next we look for
> > + * gaps in the rmap records to calculate all the unclaimed free space (1).
> > + * Next, we scan all other OWN_AG metadata (bnobt, cntbt, agfl) and subtract
> > + * the space used by those btrees from (1), and also subtract the free space
> > + * listed in the bnobt from (1).  What's left are the gaps in assigned space
> > + * that the new rmapbt knows about but the existing bnobt doesn't; these are
> > + * the blocks from the old rmapbt and they can be freed.
> 
> THis looks like a lot of repeated work. We've already scanned a
> bunch of these trees to repair them, then thrown away the scan
> results. Now we do another scan of what we've rebuilt.....
> 
> ... hold on. Chicken and egg.
> 
> We verify and rebuild all the other trees from the rmap information
> - how do we do determine that the rmap needs to rebuilt and that the
> metadata it's being rebuilt from is valid?

Userspace should invoke the other scrubbers for the AG before deciding
to re-run the rmap scrubber with IFLAG_REPAIR set.  xfs_scrub won't try
to repair rmap btrees until after it's already checked everything else
in the filesystem.  Currently xfs_scrub will complain if it finds
corruption in the primary data & the rmapbt; as the code matures I will
probably change it to error out when this happens.

> Given that we've effetively got to shut down access to the
> filesystem for the entire rmap rebuild while we do an entire
> filesystem scan, why would we do this online? It's going to be
> faster to do this rebuild offline (because of all the prefetching,
> rebuilding all AG trees from the state gathered in the full
> filesystem passes, etc) and we don't have to hack around potential
> transaction and memory reclaim deadlock situations, either?
> 
> So why do rmap rebuilds online at all?

The thing is, xfs_scrub will warm the xfs_buf cache during phases 2 and
3 while it checks everything.  By the time it gets to rmapbt repairs
towards the end of phase 4 (if there's enough memory) those blocks will
still be in cache and online repair doesn't have to wait for the disk.

If instead you unmount and run xfs_repair then xfs_repair has to reload
all that metadata and recheck it, all of which happens with the fs
offline.

So except for the extra complexity of avoiding deadlocks (which I
readily admit is not a small task) I at least don't think it's a
clear-cut downtime win to rely on xfs_repair.

> > + */
> > +
> > +/* Set us up to repair reverse mapping btrees. */
> > +int
> > +xfs_repair_rmapbt_setup(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip)
> > +{
> > +	int				error;
> > +
> > +	/*
> > +	 * Freeze out anything that can lock an inode.  We reconstruct
> > +	 * the rmapbt by reading inode bmaps with the AGF held, which is
> > +	 * only safe w.r.t. ABBA deadlocks if we're the only ones locking
> > +	 * inodes.
> > +	 */
> > +	error = xfs_scrub_fs_freeze(sc);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Check the AG number and set up the scrub context. */
> > +	error = xfs_scrub_setup_fs(sc, ip);
> > +	if (error)
> > +		return error;
> > +
> > +	/*
> > +	 * Lock all the AG header buffers so that we can read all the
> > +	 * per-AG metadata too.
> > +	 */
> > +	error = xfs_repair_grab_all_ag_headers(sc);
> > +	if (error)
> > +		return error;
> 
> So if we have thousands of AGs (think PB scale filesystems) then
> we're going hold many thousands of locked buffers here? Just so we
> can rebuild the rmapbt in one AG?
> 
> What does holding these buffers locked protect us against that
> an active freeze doesn't?

Nothing, since we're frozen.  I think that's been around since before
rmapbt repair learned to freeze the fs.

> > +xfs_repair_rmapbt_new_rmap(
> > +	struct xfs_repair_rmapbt	*rr,
> > +	xfs_agblock_t			startblock,
> > +	xfs_extlen_t			blockcount,
> > +	uint64_t			owner,
> > +	uint64_t			offset,
> > +	unsigned int			flags)
> > +{
> > +	struct xfs_repair_rmapbt_extent	*rre;
> > +	int				error = 0;
> > +
> > +	trace_xfs_repair_rmap_extent_fn(rr->sc->mp, rr->sc->sa.agno,
> > +			startblock, blockcount, owner, offset, flags);
> > +
> > +	if (xfs_scrub_should_terminate(rr->sc, &error))
> > +		return error;
> > +
> > +	rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL);
> > +	if (!rre)
> > +		return -ENOMEM;
> 
> This seems like a likely thing to happen given the "no reclaim"
> state of the filesystem and the memory demand a rmapbt rebuild
> can have. If we've got GBs of rmap info in the AG that needs to be
> rebuilt, how much RAM are we going to need to index it all as we
> scan the filesystem?

More than I'd like -- at least 24 bytes per record (which at least is no
larger than the size of the on-disk btree) plus a list_head until I can
move the repairers away from creating huge lists.

> > +xfs_repair_rmapbt_scan_ifork(
> > +	struct xfs_repair_rmapbt	*rr,
> > +	struct xfs_inode		*ip,
> > +	int				whichfork)
> > +{
> > +	struct xfs_bmbt_irec		rec;
> > +	struct xfs_iext_cursor		icur;
> > +	struct xfs_mount		*mp = rr->sc->mp;
> > +	struct xfs_btree_cur		*cur = NULL;
> > +	struct xfs_ifork		*ifp;
> > +	unsigned int			rflags;
> > +	int				fmt;
> > +	int				error = 0;
> > +
> > +	/* Do we even have data mapping extents? */
> > +	fmt = XFS_IFORK_FORMAT(ip, whichfork);
> > +	ifp = XFS_IFORK_PTR(ip, whichfork);
> > +	switch (fmt) {
> > +	case XFS_DINODE_FMT_BTREE:
> > +		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
> > +			error = xfs_iread_extents(rr->sc->tp, ip, whichfork);
> > +			if (error)
> > +				return error;
> > +		}
> 
> Ok, so we need inodes locked to do this....
> 
> ....
> > +/* Iterate all the inodes in an AG group. */
> > +STATIC int
> > +xfs_repair_rmapbt_scan_inobt(
> > +	struct xfs_btree_cur		*cur,
> > +	union xfs_btree_rec		*rec,
> > +	void				*priv)
> > +{
> > +	struct xfs_inobt_rec_incore	irec;
> > +	struct xfs_repair_rmapbt	*rr = priv;
> > +	struct xfs_mount		*mp = cur->bc_mp;
> > +	struct xfs_inode		*ip = NULL;
> > +	xfs_ino_t			ino;
> > +	xfs_agino_t			agino;
> > +	int				chunkidx;
> > +	int				lock_mode = 0;
> > +	int				error = 0;
> > +
> > +	xfs_inobt_btrec_to_irec(mp, rec, &irec);
> > +
> > +	for (chunkidx = 0, agino = irec.ir_startino;
> > +	     chunkidx < XFS_INODES_PER_CHUNK;
> > +	     chunkidx++, agino++) {
> > +		bool	inuse;
> > +
> > +		/* Skip if this inode is free */
> > +		if (XFS_INOBT_MASK(chunkidx) & irec.ir_free)
> > +			continue;
> > +		ino = XFS_AGINO_TO_INO(mp, cur->bc_private.a.agno, agino);
> > +
> > +		/* Back off and try again if an inode is being reclaimed */
> > +		error = xfs_icache_inode_is_allocated(mp, cur->bc_tp, ino,
> > +				&inuse);
> > +		if (error == -EAGAIN)
> > +			return -EDEADLOCK;
> 
> And we can get inode access errors here.....
> 
> FWIW, how is the inode being reclaimed if the filesystem is frozen?

Memory reclaim from some other process?  Maybe? :)

I'll do some research to learn if other processes can get into inode
reclaim on our frozen fs.

> > +
> > +		/*
> > +		 * Grab inode for scanning.  We cannot use DONTCACHE here
> > +		 * because we already have a transaction so the iput must not
> > +		 * trigger inode reclaim (which might allocate a transaction
> > +		 * to clean up posteof blocks).
> > +		 */
> > +		error = xfs_iget(mp, cur->bc_tp, ino, 0, 0, &ip);
> 
> So if there are enough inodes in the AG, we'll run out of memory
> here because we aren't reclaiming inodes from the cache but instead
> putting them all on the defferred iput list?

Enough inodes in the AG that also have post-eof blocks or cow blocks to
free.  If we don't have to do any inactive work then they just go away.

> > +		if (error)
> > +			return error;
> > +		trace_xfs_scrub_iget(ip, __this_address);
> > +
> > +		if ((ip->i_d.di_format == XFS_DINODE_FMT_BTREE &&
> > +		     !(ip->i_df.if_flags & XFS_IFEXTENTS)) ||
> > +		    (ip->i_d.di_aformat == XFS_DINODE_FMT_BTREE &&
> > +		     !(ip->i_afp->if_flags & XFS_IFEXTENTS)))
> > +			lock_mode = XFS_ILOCK_EXCL;
> > +		else
> > +			lock_mode = XFS_ILOCK_SHARED;
> > +		if (!xfs_ilock_nowait(ip, lock_mode)) {
> > +			error = -EBUSY;
> > +			goto out_rele;
> > +		}
> 
> And in what situation do we get inodes stuck with the ilock held on
> frozen filesysetms?

I think we do not, and that this is a relic of pre-freeze rmap repair.

> ....
> 
> > +out_unlock:
> > +	xfs_iunlock(ip, lock_mode);
> > +out_rele:
> > +	iput(VFS_I(ip));
> > +	return error;
> 
> calling iput in the error path is a bug - it will trigger all the
> paths you're trying to avoid by using te deferred iput list.

Oops, that should be our xfs_scrub_repair_iput.

> ....
> 
> 
> > +/* Collect rmaps for all block mappings for every inode in this AG. */
> > +STATIC int
> > +xfs_repair_rmapbt_generate_aginode_rmaps(
> > +	struct xfs_repair_rmapbt	*rr,
> > +	xfs_agnumber_t			agno)
> > +{
> > +	struct xfs_scrub_context	*sc = rr->sc;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_btree_cur		*cur;
> > +	struct xfs_buf			*agi_bp;
> > +	int				error;
> > +
> > +	error = xfs_ialloc_read_agi(mp, sc->tp, agno, &agi_bp);
> > +	if (error)
> > +		return error;
> > +	cur = xfs_inobt_init_cursor(mp, sc->tp, agi_bp, agno, XFS_BTNUM_INO);
> > +	error = xfs_btree_query_all(cur, xfs_repair_rmapbt_scan_inobt, rr);
> 
> So if we get a locked or reclaiming inode anywhere in the
> filesystem we see EDEADLOCK/EBUSY here without having scanned all
> the inodes in the AG, right?

Right.

> > +	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> > +	xfs_trans_brelse(sc->tp, agi_bp);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Generate all the reverse-mappings for this AG, a list of the old rmapbt
> > + * blocks, and the new btreeblks count.  Figure out if we have enough free
> > + * space to reconstruct the inode btrees.  The caller must clean up the lists
> > + * if anything goes wrong.
> > + */
> > +STATIC int
> > +xfs_repair_rmapbt_find_rmaps(
> > +	struct xfs_scrub_context	*sc,
> > +	struct list_head		*rmap_records,
> > +	xfs_agblock_t			*new_btreeblks)
> > +{
> > +	struct xfs_repair_rmapbt	rr;
> > +	xfs_agnumber_t			agno;
> > +	int				error;
> > +
> > +	rr.rmaplist = rmap_records;
> > +	rr.sc = sc;
> > +	rr.nr_records = 0;
> > +
> > +	/* Generate rmaps for AG space metadata */
> > +	error = xfs_repair_rmapbt_generate_agheader_rmaps(&rr);
> > +	if (error)
> > +		return error;
> > +	error = xfs_repair_rmapbt_generate_log_rmaps(&rr);
> > +	if (error)
> > +		return error;
> > +	error = xfs_repair_rmapbt_generate_freesp_rmaps(&rr, new_btreeblks);
> > +	if (error)
> > +		return error;
> > +	error = xfs_repair_rmapbt_generate_inobt_rmaps(&rr);
> > +	if (error)
> > +		return error;
> > +	error = xfs_repair_rmapbt_generate_refcountbt_rmaps(&rr);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Iterate all AGs for inodes rmaps. */
> > +	for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) {
> > +		error = xfs_repair_rmapbt_generate_aginode_rmaps(&rr, agno);
> > +		if (error)
> > +			return error;
> 
> And that means we abort here....
> 
> > +/* Repair the rmap btree for some AG. */
> > +int
> > +xfs_repair_rmapbt(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_owner_info		oinfo;
> > +	struct list_head		rmap_records;
> > +	xfs_extlen_t			new_btreeblks;
> > +	int				log_flags = 0;
> > +	int				error;
> > +
> > +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> > +
> > +	/* Collect rmaps for all AG headers. */
> > +	INIT_LIST_HEAD(&rmap_records);
> > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_UNKNOWN);
> > +	error = xfs_repair_rmapbt_find_rmaps(sc, &rmap_records, &new_btreeblks);
> > +	if (error)
> > +		goto out;
> 
> And we drop out here. So, essentially, any ENOMEM, locked inode or
> inode in reclaim anywhere in the filesystem will prevent rmap
> rebuild. Which says to me that rebuilding the rmap on
> on any substantial filesystem is likely to fail.
> 
> Which brings me back to my original question: why attempt to do
> rmap rebuild online given how complex it is, the performance
> implications of a full filesystem scan per AG that needs rebuild,
> and all the ways it could easily fail?

Right.  If we run out of memory or hit a locked/in-reclaim inode we'll
bounce back out to userspace having not touched anything.  Userspace can
decide if it wants to migrate or shut down other services and retry the
online scrub, or if it wants to unmount and run xfs_repair instead.
My mid-term goals for the repair patchset is to minimize the memory use
and minimize the amount of time we spend in the freezer, but for that I
need to add a few more (largeish) tools to XFS and an trying to avoid
snowballing a ton of code in front of repair. :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Carlos Maiolino July 4, 2018, 8:44 a.m. UTC | #3
> [some of this Dave and I discussed on IRC, so I'll summarize for
> everyone else here...]
> 
> For this initial v0 iteration of the rmap repair code, yes, we have to
> freeze the fs and iterate everything.  However, unless your computer and
> storage are particularly untrustworthy, rmapbt reconstruction should be
> a very infrequent thing.  Now that we have a FREEZE_OK flag, userspace
> has to opt-in to slow repairs, and presumably it could choose instead to
> unmount and run xfs_repair if that's too dear or there are too many
> broken AGs, etc.  More on that later.
> 
> In the long run I don't see a need to freeze the filesystem to scan
> every inode for bmbt entries in the damaged AG.  In fact, we can improve
> the performance of all the AG repair functions in general with the
> scheme I'm about to outline:
> 
> Create a "shut down this AG" primitive.  Once set, block and inode
> allocation routines will bypass this AG.  Unlinked inodes are moved to
> the unlinked list to avoid touching as much of the AGI as we practically
> can.  Unmapped/freed blocks can be moved to a hidden inode (in another
> AG) to be freed later.  Growfs operation in that AG can be rejected.
> 

Does it mean that new block allocation requestsvfor inodes already existing in
the frozen AG will block until the AG is thawed, or these block allocations
will be redirected to another AG? I'm just asking because in either case, we
should document it well. The repair case is certainly (or should be) a rare
case, but if there is any heavy workload going on on the frozen AG, and we
redirect it to another AG, it can end up heavily fragmenting the files on the
frozen AG.
So, I wonder if any operation to the AG under repair should actually be blocked
too?

Cheers
Darrick J. Wong July 4, 2018, 6:40 p.m. UTC | #4
On Wed, Jul 04, 2018 at 10:44:38AM +0200, Carlos Maiolino wrote:
> > [some of this Dave and I discussed on IRC, so I'll summarize for
> > everyone else here...]
> > 
> > For this initial v0 iteration of the rmap repair code, yes, we have to
> > freeze the fs and iterate everything.  However, unless your computer and
> > storage are particularly untrustworthy, rmapbt reconstruction should be
> > a very infrequent thing.  Now that we have a FREEZE_OK flag, userspace
> > has to opt-in to slow repairs, and presumably it could choose instead to
> > unmount and run xfs_repair if that's too dear or there are too many
> > broken AGs, etc.  More on that later.
> > 
> > In the long run I don't see a need to freeze the filesystem to scan
> > every inode for bmbt entries in the damaged AG.  In fact, we can improve
> > the performance of all the AG repair functions in general with the
> > scheme I'm about to outline:
> > 
> > Create a "shut down this AG" primitive.  Once set, block and inode
> > allocation routines will bypass this AG.  Unlinked inodes are moved to
> > the unlinked list to avoid touching as much of the AGI as we practically
> > can.  Unmapped/freed blocks can be moved to a hidden inode (in another
> > AG) to be freed later.  Growfs operation in that AG can be rejected.
> > 
> 
> Does it mean that new block allocation requestsvfor inodes already existing in
> the frozen AG will block until the AG is thawed, or these block allocations
> will be redirected to another AG? I'm just asking because in either case, we
> should document it well. The repair case is certainly (or should be) a rare
> case, but if there is any heavy workload going on on the frozen AG, and we
> redirect it to another AG, it can end up heavily fragmenting the files on the
> frozen AG.
> So, I wonder if any operation to the AG under repair should actually be blocked
> too?

I don't think that will be possible for rmapbt repair -- we need to be
able to take locks in the wrong order (agf -> inodes) without
deadlocking with a regular operation that's blocked on the AG (inodes ->
agf).  The freezer mechanism eliminates the deadlock possibility by
eliminating the regular IO paths, so this proposed AG shutdown would
also have to protect against that by absorbing operations.

--D

> Cheers
> 
> 
> -- 
> Carlos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner July 4, 2018, 11:21 p.m. UTC | #5
On Tue, Jul 03, 2018 at 04:59:01PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote:
> > On Sun, Jun 24, 2018 at 12:24:38PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Rebuild the reverse mapping btree from all primary metadata.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > ....
> > 
> > > +static inline int xfs_repair_rmapbt_setup(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_inode		*ip)
> > > +{
> > > +	/* We don't support rmap repair, but we can still do a scan. */
> > > +	return xfs_scrub_setup_ag_btree(sc, ip, false);
> > > +}
> > 
> > This comment seems at odds with the commit message....
> 
> This is the Kconfig shim needed if CONFIG_XFS_ONLINE_REPAIR=n.

Ok, that wasn't clear from the patch context.

> > > + * This is the most involved of all the AG space btree rebuilds.  Everywhere
> > > + * else in XFS we lock inodes and then AG data structures, but generating the
> > > + * list of rmap records requires that we be able to scan both block mapping
> > > + * btrees of every inode in the filesystem to see if it owns any extents in
> > > + * this AG.  We can't tolerate any inode updates while we do this, so we
> > > + * freeze the filesystem to lock everyone else out, and grant ourselves
> > > + * special privileges to run transactions with regular background reclamation
> > > + * turned off.
> > 
> > Hmmm. This implies we are going to scan the entire filesystem for
> > every AG we need to rebuild the rmap tree in. That seems like an
> > awful lot of work if there's more than one rmap btree that needs
> > rebuild.
> 
> [some of this Dave and I discussed on IRC, so I'll summarize for
> everyone else here...]

....

> > Given that we've effetively got to shut down access to the
> > filesystem for the entire rmap rebuild while we do an entire
> > filesystem scan, why would we do this online? It's going to be
> > faster to do this rebuild offline (because of all the prefetching,
> > rebuilding all AG trees from the state gathered in the full
> > filesystem passes, etc) and we don't have to hack around potential
> > transaction and memory reclaim deadlock situations, either?
> > 
> > So why do rmap rebuilds online at all?
> 
> The thing is, xfs_scrub will warm the xfs_buf cache during phases 2 and
> 3 while it checks everything.  By the time it gets to rmapbt repairs
> towards the end of phase 4 (if there's enough memory) those blocks will
> still be in cache and online repair doesn't have to wait for the disk.

Therein lies the problem: "if there's enough memory". If there's
enough memory to cache at the filesystem metadata, track all the
bits repair needs to track, and there's no other memory pressure
then it will hit the cache. But populating that cache is still going
to be slower than an offline repair because IO patterns (see below)
and there is competing IO from other work being done on the
system (i.e. online repair competes for IO resources and memory
resources).

As such, I don't see that we're going to have everything we need
cached for any significantly sized or busy filesystem, and that
means we actually have to care about how much IO online repair
algorithms require. We also have to take into account that much of
that IO is going to be synchronous single metadata block reads.
This will be a limitation on any sort of high IO latency storage
(spinning rust, network based block devices, slow SSDs, etc).

> If instead you unmount and run xfs_repair then xfs_repair has to reload
> all that metadata and recheck it, all of which happens with the fs
> offline.

xfs_repair has all sorts of concurrency and prefetching
optimisations that allow it to scan and process metadata orders of
magnitude faster than online repair, especially on slow storage.
i.e. online repair is going to be IO seek bound, while offline
repair is typically IO bandwidth and/or CPU bound.  Offline repair
can do full filesystem metadata scans measured in GB/s; as long as
online repair does serialised synchronous single structure walks it
will be orders of magnitude slower than an offline repair.

> So except for the extra complexity of avoiding deadlocks (which I
> readily admit is not a small task) I at least don't think it's a
> clear-cut downtime win to rely on xfs_repair.

Back then - as it is still now - I couldn't see how the IO load
required by synchronous full filesystem scans one structure at a
time was going to reduce filesystem downtime compared to an offline
repair doing optimised "all metadata types at once" concurrent
linear AG scans.

Keep in mind that online repair will never guarantee that it can fix
all problems, so we're always going to have to offline repair. iWhat
we want to achieve is minimising downtime for users when a repair is
required. With the above IO limitations in mind, I've always
considered that online repair would just be for all the simple,
quick, easy to fix stuff, because complex stuff that required huge
amounts of RAM and full filesystem scans to resolve would always be
done faster offline.

That's why I think that offline repair will be a better choice for
users for the forseeable future if repairing the damage requires
full filesystem metadata scans.

> > > +
> > > +	rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL);
> > > +	if (!rre)
> > > +		return -ENOMEM;
> > 
> > This seems like a likely thing to happen given the "no reclaim"
> > state of the filesystem and the memory demand a rmapbt rebuild
> > can have. If we've got GBs of rmap info in the AG that needs to be
> > rebuilt, how much RAM are we going to need to index it all as we
> > scan the filesystem?
> 
> More than I'd like -- at least 24 bytes per record (which at least is no
> larger than the size of the on-disk btree) plus a list_head until I can
> move the repairers away from creating huge lists.

Ok, it kinda sounds a bit like we need to be able to create the new
btree on the fly, rather than as a single operation at the end. e.g.
if the list builds up to, say, 100k records, we push them into the
new tree and can free them. e.g. can we iteratively build the new
tree on disk as we go, then do a root block swap at the end to
switch from the old tree to the new tree?

If that's a potential direction, then maybe we can look at this as a
future direction? It also leads to the posibility of
pausing/continuing repair from where the last chunk of records were
processed, so if we do run out of memory we don't have to start from
the beginning again?

Cheers,

Dave.
Darrick J. Wong July 5, 2018, 3:48 a.m. UTC | #6
On Thu, Jul 05, 2018 at 09:21:15AM +1000, Dave Chinner wrote:
> On Tue, Jul 03, 2018 at 04:59:01PM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote:
> > > On Sun, Jun 24, 2018 at 12:24:38PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Rebuild the reverse mapping btree from all primary metadata.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > ....
> > > 
> > > > +static inline int xfs_repair_rmapbt_setup(
> > > > +	struct xfs_scrub_context	*sc,
> > > > +	struct xfs_inode		*ip)
> > > > +{
> > > > +	/* We don't support rmap repair, but we can still do a scan. */
> > > > +	return xfs_scrub_setup_ag_btree(sc, ip, false);
> > > > +}
> > > 
> > > This comment seems at odds with the commit message....
> > 
> > This is the Kconfig shim needed if CONFIG_XFS_ONLINE_REPAIR=n.
> 
> Ok, that wasn't clear from the patch context.

I'll add a noisier comment in that section about ...

/*
 * Compatibility shims for CONFIG_XFS_ONLINE_REPAIR=n.
 */

> > > > + * This is the most involved of all the AG space btree rebuilds.  Everywhere
> > > > + * else in XFS we lock inodes and then AG data structures, but generating the
> > > > + * list of rmap records requires that we be able to scan both block mapping
> > > > + * btrees of every inode in the filesystem to see if it owns any extents in
> > > > + * this AG.  We can't tolerate any inode updates while we do this, so we
> > > > + * freeze the filesystem to lock everyone else out, and grant ourselves
> > > > + * special privileges to run transactions with regular background reclamation
> > > > + * turned off.
> > > 
> > > Hmmm. This implies we are going to scan the entire filesystem for
> > > every AG we need to rebuild the rmap tree in. That seems like an
> > > awful lot of work if there's more than one rmap btree that needs
> > > rebuild.
> > 
> > [some of this Dave and I discussed on IRC, so I'll summarize for
> > everyone else here...]
> 
> ....
> 
> > > Given that we've effetively got to shut down access to the
> > > filesystem for the entire rmap rebuild while we do an entire
> > > filesystem scan, why would we do this online? It's going to be
> > > faster to do this rebuild offline (because of all the prefetching,
> > > rebuilding all AG trees from the state gathered in the full
> > > filesystem passes, etc) and we don't have to hack around potential
> > > transaction and memory reclaim deadlock situations, either?
> > > 
> > > So why do rmap rebuilds online at all?
> > 
> > The thing is, xfs_scrub will warm the xfs_buf cache during phases 2 and
> > 3 while it checks everything.  By the time it gets to rmapbt repairs
> > towards the end of phase 4 (if there's enough memory) those blocks will
> > still be in cache and online repair doesn't have to wait for the disk.
> 
> Therein lies the problem: "if there's enough memory". If there's
> enough memory to cache at the filesystem metadata, track all the
> bits repair needs to track, and there's no other memory pressure
> then it will hit the cache.

Fair enough, it is true that the memory requirements of online repair
are high (and higher than they could be), though online repair only
requires enough memory to store all the incore rmap records for the AG
it's repairing, whereas xfs_repair will try to store all rmaps for the
entire filesystem.  It's not a straightforward comparison since
xfs_repair memory can be swapped and its slabs are much more efficient
than the linked lists that online repair has to use, but some of this
seems solvable.

> But populating that cache is still going to be slower than an offline
> repair because IO patterns (see below) and there is competing IO from
> other work being done on the system (i.e. online repair competes for
> IO resources and memory resources).
>
> As such, I don't see that we're going to have everything we need
> cached for any significantly sized or busy filesystem, and that
> means we actually have to care about how much IO online repair
> algorithms require. We also have to take into account that much of
> that IO is going to be synchronous single metadata block reads.
> This will be a limitation on any sort of high IO latency storage
> (spinning rust, network based block devices, slow SSDs, etc).

TBH I'm not sure online repair is a good match for spinning rust, nbd,
or usb sticks since there's a general presumption that it won't consume
so much storage time that everything else sees huge latency spikes.
Repairs will cause much larger spikes but again should be infrequent and
still an improvement over the fs exploding.

> > If instead you unmount and run xfs_repair then xfs_repair has to reload
> > all that metadata and recheck it, all of which happens with the fs
> > offline.
> 
> xfs_repair has all sorts of concurrency and prefetching
> optimisations that allow it to scan and process metadata orders of
> magnitude faster than online repair, especially on slow storage.
> i.e. online repair is going to be IO seek bound, while offline
> repair is typically IO bandwidth and/or CPU bound.  Offline repair
> can do full filesystem metadata scans measured in GB/s; as long as
> online repair does serialised synchronous single structure walks it
> will be orders of magnitude slower than an offline repair.

My long view of online repair is that it satisfies a different set of
constraints than xfs_repair.  Offline repair needs to be as efficient
with resources (particularly CPU) as it can be, because the fs is 100%
offline while it runs.  It needs to run in as little time as possible
and we optimize heavily for that.

Compare this to the situation facing online repair -- we've sharded the
FS into a number of allocation groups, and (in the future) we can take
an AG offline to repair the rmapbt.  Assuming there's sufficient free
space in the surviving AGs that the filesystem can handle everything
asked of it, there are users who prefer to lose 1/4 (or 1/32, or
whatever) of the FS for 10x the time that they would lose all of it.

Even if online repair is bound by uncached synchronous random reads,
it can still be a win.  I have a few customers in mind who have told me
that losing pieces of the storage for long(er) periods of time are
easier for them to handle than all of it, even for a short time.

> > So except for the extra complexity of avoiding deadlocks (which I
> > readily admit is not a small task) I at least don't think it's a
> > clear-cut downtime win to rely on xfs_repair.
> 
> Back then - as it is still now - I couldn't see how the IO load
> required by synchronous full filesystem scans one structure at a
> time was going to reduce filesystem downtime compared to an offline
> repair doing optimised "all metadata types at once" concurrent
> linear AG scans.

<nod> For the scrub half of this story I deliberately avoided anything
anything that required hard fs downtime, at least until we got to the
fscounters thing.  I'm hoping that background scrubs will be gentle
enough that it will be a better option than what ext4 online scrub does
(which is to say it creates lvm snapshots and fscks them).

> Keep in mind that online repair will never guarantee that it can fix
> all problems, so we're always going to have to offline repair. iWhat
> we want to achieve is minimising downtime for users when a repair is
> required. With the above IO limitations in mind, I've always
> considered that online repair would just be for all the simple,
> quick, easy to fix stuff, because complex stuff that required huge
> amounts of RAM and full filesystem scans to resolve would always be
> done faster offline.
>
> That's why I think that offline repair will be a better choice for
> users for the forseeable future if repairing the damage requires
> full filesystem metadata scans.

That might be, but how else can we determine that than to merge it under
an experimental banner, iterate it for a while, and try to persuade a
wider band of users than ourselves to try it on a non-production system
to see if it better solves their problems? :)

At worst, if future us decide that we will never figure out how to make
online rmapbt repair robust I'm fully prepared to withdraw it.

> > > > +
> > > > +	rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL);
> > > > +	if (!rre)
> > > > +		return -ENOMEM;
> > > 
> > > This seems like a likely thing to happen given the "no reclaim"
> > > state of the filesystem and the memory demand a rmapbt rebuild
> > > can have. If we've got GBs of rmap info in the AG that needs to be
> > > rebuilt, how much RAM are we going to need to index it all as we
> > > scan the filesystem?
> > 
> > More than I'd like -- at least 24 bytes per record (which at least is no
> > larger than the size of the on-disk btree) plus a list_head until I can
> > move the repairers away from creating huge lists.
> 
> Ok, it kinda sounds a bit like we need to be able to create the new
> btree on the fly, rather than as a single operation at the end. e.g.
> if the list builds up to, say, 100k records, we push them into the
> new tree and can free them. e.g. can we iteratively build the new
> tree on disk as we go, then do a root block swap at the end to
> switch from the old tree to the new tree?

Seeing as we have a bunch of storage at our disposal, I think we could
push the records out to disk (or just write them straight into a new
btree) when we detect low memory conditions or consume more than some
threshold of memory.  For v0 I was focusing on getting it to work at all
even with a largeish memory cost. :)

> If that's a potential direction, then maybe we can look at this as a
> future direction? It also leads to the posibility of
> pausing/continuing repair from where the last chunk of records were
> processed, so if we do run out of memory we don't have to start from
> the beginning again?

That could be difficult to do in practice -- we'll continue to
accumulate deferred frees for that AG in the meantime.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner July 5, 2018, 7:03 a.m. UTC | #7
On Wed, Jul 04, 2018 at 08:48:58PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 05, 2018 at 09:21:15AM +1000, Dave Chinner wrote:
> > On Tue, Jul 03, 2018 at 04:59:01PM -0700, Darrick J. Wong wrote:
> > > On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote:
> > Keep in mind that online repair will never guarantee that it can fix
> > all problems, so we're always going to have to offline repair. iWhat
> > we want to achieve is minimising downtime for users when a repair is
> > required. With the above IO limitations in mind, I've always
> > considered that online repair would just be for all the simple,
> > quick, easy to fix stuff, because complex stuff that required huge
> > amounts of RAM and full filesystem scans to resolve would always be
> > done faster offline.
> >
> > That's why I think that offline repair will be a better choice for
> > users for the forseeable future if repairing the damage requires
> > full filesystem metadata scans.
> 
> That might be, but how else can we determine that than to merge it under
> an experimental banner, iterate it for a while, and try to persuade a
> wider band of users than ourselves to try it on a non-production system
> to see if it better solves their problems? :)

I think "deploy and idea to users and see if they have problems" is
a horrible development model. That's where btrfs went off the rails
almost 10 years ago....

That aside, I'd like to make sure we're on the same page - we don't
need online repair to fix /everything/ before it gets merged. Even
without rmap rebuild, it will still be usable and useful to the vast
majority of users wanting the filesystem to self repair the typical
one-off corruption problems filesystems encounter during their
typical production life.

What I'm worried about is that the online repair algorithms have a
fundmental dependence on rmap lookups to avoid the need to build the
global cross-references needed to isolate and repair corruptions.
This use of rmap, from one angle, can be seen as a performance
optimisation, but as I've come understand more of the algorithms the
online repair code uses, it looks more like an intractable problem
than one of performance optimisation.

That is, if the rmap is corrupt, or we even suspect that it's
corrupt, then we cannot use it to validate the state and contents of
the other on-disk structures. We could propagate rmap corruptions to
those other structures and it would go undetected. And because we
can't determine the validity of all the other structures, the only
way we can correctly repair the filesystem is to build a global
cross-reference, resolve all the inconsistencies, and then rewrite
all the structures. i.e. we need to do what xfs_repair does in phase
3, 4 and 5.

IOWs, ISTM that the online scrub/repair algorithms break down in the
face of rmap corruption and it is correctable only by building a
global cross-reference which we can then reconcile and use to
rebuild all the per-AG metadata btrees in the filesystem. Trying to
do it online via scanning of unverifiable structures risks making
the problem worse and there is a good possibility that we can't
repair the inconsistencies in a single pass.

So at this point, I'd prefer that we stop, discuss and work out a
solution to the rmap rebuild problem rather than merge a rmap
rebuild algorithm prematurely. This doesn't need to hold up any of
the other online repair functionality - none of that is dependent on
this particular piece of the puzzle being implemented.

> At worst, if future us decide that we will never figure out how to make
> online rmapbt repair robust I'm fully prepared to withdraw it.

Merging new features before sorting out the fundamental principles,
behaviours and algorithms of those features is how btrfs ended up
with a pile of stuff that doesn't quite work which no-one
understands quite well enough to fix.

> > > > This seems like a likely thing to happen given the "no reclaim"
> > > > state of the filesystem and the memory demand a rmapbt rebuild
> > > > can have. If we've got GBs of rmap info in the AG that needs to be
> > > > rebuilt, how much RAM are we going to need to index it all as we
> > > > scan the filesystem?
> > > 
> > > More than I'd like -- at least 24 bytes per record (which at least is no
> > > larger than the size of the on-disk btree) plus a list_head until I can
> > > move the repairers away from creating huge lists.
> > 
> > Ok, it kinda sounds a bit like we need to be able to create the new
> > btree on the fly, rather than as a single operation at the end. e.g.
> > if the list builds up to, say, 100k records, we push them into the
> > new tree and can free them. e.g. can we iteratively build the new
> > tree on disk as we go, then do a root block swap at the end to
> > switch from the old tree to the new tree?
> 
> Seeing as we have a bunch of storage at our disposal, I think we could
> push the records out to disk (or just write them straight into a new
> btree) when we detect low memory conditions or consume more than some
> threshold of memory. i

*nod* That's kinda along the lines of what I was thinking of.

> For v0 I was focusing on getting it to work at all
> even with a largeish memory cost. :)

Right, I'm not suggesting that this needs to be done before the
initial merge - I just want to make sure we don't back ourselves
into a corner we can't get out of.

> > If that's a potential direction, then maybe we can look at this as a
> > future direction? It also leads to the posibility of
> > pausing/continuing repair from where the last chunk of records were
> > processed, so if we do run out of memory we don't have to start from
> > the beginning again?
> 
> That could be difficult to do in practice -- we'll continue to
> accumulate deferred frees for that AG in the meantime.

I think it depends on the way we record deferred frees, but it's a
bit premature to be discussing this right now :P

Cheers,

Dave.
Darrick J. Wong July 6, 2018, 12:47 a.m. UTC | #8
On Thu, Jul 05, 2018 at 05:03:24PM +1000, Dave Chinner wrote:
> On Wed, Jul 04, 2018 at 08:48:58PM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 05, 2018 at 09:21:15AM +1000, Dave Chinner wrote:
> > > On Tue, Jul 03, 2018 at 04:59:01PM -0700, Darrick J. Wong wrote:
> > > > On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote:
> > > Keep in mind that online repair will never guarantee that it can fix
> > > all problems, so we're always going to have to offline repair. iWhat
> > > we want to achieve is minimising downtime for users when a repair is
> > > required. With the above IO limitations in mind, I've always
> > > considered that online repair would just be for all the simple,
> > > quick, easy to fix stuff, because complex stuff that required huge
> > > amounts of RAM and full filesystem scans to resolve would always be
> > > done faster offline.
> > >
> > > That's why I think that offline repair will be a better choice for
> > > users for the forseeable future if repairing the damage requires
> > > full filesystem metadata scans.
> > 
> > That might be, but how else can we determine that than to merge it under
> > an experimental banner, iterate it for a while, and try to persuade a
> > wider band of users than ourselves to try it on a non-production system
> > to see if it better solves their problems? :)
> 
> I think "deploy and idea to users and see if they have problems" is
> a horrible development model. That's where btrfs went off the rails
> almost 10 years ago....

I would only test-deploy to a very small and carefully selected group of
tightly-controlled internal users.  For everyone else, it hides behind a
default=N kconfig option and we disavow any support of EXPERIMENTAL
features. :)

> That aside, I'd like to make sure we're on the same page - we don't
> need online repair to fix /everything/ before it gets merged. Even

Indeed.  I'm fine with it merging the non-freezing repairers in their
current form.  I regularly run the repair tool fuzz xfstests to compare
the effectiveness of the xfs_scrub / xfs_repair.  They both have
deficiencies here and there, which will keep us busy for years to come.
:P

> without rmap rebuild, it will still be usable and useful to the vast
> majority of users wanting the filesystem to self repair the typical
> one-off corruption problems filesystems encounter during their
> typical production life.
> 
> What I'm worried about is that the online repair algorithms have a
> fundmental dependence on rmap lookups to avoid the need to build the
> global cross-references needed to isolate and repair corruptions.

Yes, the primary metadata repairers are absolutely dependent on the
secondary data to be faster than a full scan, and repairing secondary
metadata was never going to be easy to do online.

> This use of rmap, from one angle, can be seen as a performance
> optimisation, but as I've come understand more of the algorithms the
> online repair code uses, it looks more like an intractable problem
> than one of performance optimisation.
> 
> That is, if the rmap is corrupt, or we even suspect that it's
> corrupt, then we cannot use it to validate the state and contents of
> the other on-disk structures. We could propagate rmap corruptions to
> those other structures and it would go undetected. And because we
> can't determine the validity of all the other structures, the only
> way we can correctly repair the filesystem is to build a global
> cross-reference, resolve all the inconsistencies, and then rewrite
> all the structures. i.e. we need to do what xfs_repair does in phase
> 3, 4 and 5.
> 
> IOWs, ISTM that the online scrub/repair algorithms break down in the
> face of rmap corruption and it is correctable only by building a
> global cross-reference which we can then reconcile and use to
> rebuild all the per-AG metadata btrees in the filesystem. Trying to
> do it online via scanning of unverifiable structures risks making
> the problem worse and there is a good possibility that we can't
> repair the inconsistencies in a single pass.

Yeah.  I've pondered whether or not the primary repairers ought to
require at least a quick rmap scan before they touch anything, but up
till now I've preferred to keep that knowledge in xfs_scrub.

> So at this point, I'd prefer that we stop, discuss and work out a
> solution to the rmap rebuild problem rather than merge a rmap
> rebuild algorithm prematurely. This doesn't need to hold up any of
> the other online repair functionality - none of that is dependent on
> this particular piece of the puzzle being implemented.

I didn't think it would hold up the others repairers.  I always knew
that the rmapbt repair was going to be a tough project. :)

So, seeing as you're about to head off on vacation, let's set a rough
time to discuss the rmap rebuild problem in detail once you're back.
Does that sound good?

> > At worst, if future us decide that we will never figure out how to make
> > online rmapbt repair robust I'm fully prepared to withdraw it.
> 
> Merging new features before sorting out the fundamental principles,
> behaviours and algorithms of those features is how btrfs ended up
> with a pile of stuff that doesn't quite work which no-one
> understands quite well enough to fix.

Ok ok it can stay out of tree until we're all fairly confident it'll
survive on our bombardment tests.  I do have a test, xfs/422, to
simulate repairing an rmapbt under heavy load by running fsstress,
fsfreeze, and xfs_io injecting and force-repairing rmapbts in a loop.
So far it hasn't blown up or failed to regenerate the rmapbt...

> > > > > This seems like a likely thing to happen given the "no reclaim"
> > > > > state of the filesystem and the memory demand a rmapbt rebuild
> > > > > can have. If we've got GBs of rmap info in the AG that needs to be
> > > > > rebuilt, how much RAM are we going to need to index it all as we
> > > > > scan the filesystem?
> > > > 
> > > > More than I'd like -- at least 24 bytes per record (which at least is no
> > > > larger than the size of the on-disk btree) plus a list_head until I can
> > > > move the repairers away from creating huge lists.
> > > 
> > > Ok, it kinda sounds a bit like we need to be able to create the new
> > > btree on the fly, rather than as a single operation at the end. e.g.
> > > if the list builds up to, say, 100k records, we push them into the
> > > new tree and can free them. e.g. can we iteratively build the new
> > > tree on disk as we go, then do a root block swap at the end to
> > > switch from the old tree to the new tree?
> > 
> > Seeing as we have a bunch of storage at our disposal, I think we could
> > push the records out to disk (or just write them straight into a new
> > btree) when we detect low memory conditions or consume more than some
> > threshold of memory. i
> 
> *nod* That's kinda along the lines of what I was thinking of.
> 
> > For v0 I was focusing on getting it to work at all
> > even with a largeish memory cost. :)
> 
> Right, I'm not suggesting that this needs to be done before the
> initial merge - I just want to make sure we don't back ourselves
> into a corner we can't get out of.

Agreed.  I don't want to get stuck in a corner either, but I don't know
that I know where all those corners are. :)

> > > If that's a potential direction, then maybe we can look at this as a
> > > future direction? It also leads to the posibility of
> > > pausing/continuing repair from where the last chunk of records were
> > > processed, so if we do run out of memory we don't have to start from
> > > the beginning again?
> > 
> > That could be difficult to do in practice -- we'll continue to
> > accumulate deferred frees for that AG in the meantime.
> 
> I think it depends on the way we record deferred frees, but it's a
> bit premature to be discussing this right now :P

<nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner July 6, 2018, 1:08 a.m. UTC | #9
On Thu, Jul 05, 2018 at 05:47:48PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 05, 2018 at 05:03:24PM +1000, Dave Chinner wrote:
> > IOWs, ISTM that the online scrub/repair algorithms break down in the
> > face of rmap corruption and it is correctable only by building a
> > global cross-reference which we can then reconcile and use to
> > rebuild all the per-AG metadata btrees in the filesystem. Trying to
> > do it online via scanning of unverifiable structures risks making
> > the problem worse and there is a good possibility that we can't
> > repair the inconsistencies in a single pass.
> 
> Yeah.  I've pondered whether or not the primary repairers ought to
> require at least a quick rmap scan before they touch anything, but up
> till now I've preferred to keep that knowledge in xfs_scrub.

I think that's fine - scrub is most likely going to be the trigger
to run online repair, anyway.

> > So at this point, I'd prefer that we stop, discuss and work out a
> > solution to the rmap rebuild problem rather than merge a rmap
> > rebuild algorithm prematurely. This doesn't need to hold up any of
> > the other online repair functionality - none of that is dependent on
> > this particular piece of the puzzle being implemented.
> 
> I didn't think it would hold up the others repairers.  I always knew
> that the rmapbt repair was going to be a tough project. :)
> 
> So, seeing as you're about to head off on vacation, let's set a rough
> time to discuss the rmap rebuild problem in detail once you're back.
> Does that sound good?

Yes. Gives us some thinking time, too :P

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 837fd4a95f6f..c71c5deef4c9 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -167,6 +167,7 @@  xfs-y				+= $(addprefix scrub/, \
 				   alloc_repair.o \
 				   ialloc_repair.o \
 				   repair.o \
+				   rmap_repair.o \
 				   )
 endif
 endif
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 1cdf457e41da..3d9e064147ec 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -96,6 +96,7 @@  int xfs_repair_find_ag_btree_roots(struct xfs_scrub_context *sc,
 void xfs_repair_force_quotacheck(struct xfs_scrub_context *sc, uint dqtype);
 int xfs_repair_ino_dqattach(struct xfs_scrub_context *sc);
 int xfs_repair_grab_all_ag_headers(struct xfs_scrub_context *sc);
+int xfs_repair_rmapbt_setup(struct xfs_scrub_context *sc, struct xfs_inode *ip);
 
 /* Metadata repairers */
 
@@ -106,6 +107,7 @@  int xfs_repair_agfl(struct xfs_scrub_context *sc);
 int xfs_repair_agi(struct xfs_scrub_context *sc);
 int xfs_repair_allocbt(struct xfs_scrub_context *sc);
 int xfs_repair_iallocbt(struct xfs_scrub_context *sc);
+int xfs_repair_rmapbt(struct xfs_scrub_context *sc);
 
 #else
 
@@ -127,6 +129,14 @@  xfs_repair_calc_ag_resblks(
 	return 0;
 }
 
+static inline int xfs_repair_rmapbt_setup(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	/* We don't support rmap repair, but we can still do a scan. */
+	return xfs_scrub_setup_ag_btree(sc, ip, false);
+}
+
 #define xfs_repair_probe		xfs_repair_notsupported
 #define xfs_repair_superblock		xfs_repair_notsupported
 #define xfs_repair_agf			xfs_repair_notsupported
@@ -134,6 +144,7 @@  xfs_repair_calc_ag_resblks(
 #define xfs_repair_agi			xfs_repair_notsupported
 #define xfs_repair_allocbt		xfs_repair_notsupported
 #define xfs_repair_iallocbt		xfs_repair_notsupported
+#define xfs_repair_rmapbt		xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index c6d763236ba7..dd1cccfbb31a 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -24,6 +24,7 @@ 
 #include "scrub/common.h"
 #include "scrub/btree.h"
 #include "scrub/trace.h"
+#include "scrub/repair.h"
 
 /*
  * Set us up to scrub reverse mapping btrees.
@@ -33,7 +34,10 @@  xfs_scrub_setup_ag_rmapbt(
 	struct xfs_scrub_context	*sc,
 	struct xfs_inode		*ip)
 {
-	return xfs_scrub_setup_ag_btree(sc, ip, false);
+	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)
+		return xfs_repair_rmapbt_setup(sc, ip);
+	else
+		return xfs_scrub_setup_ag_btree(sc, ip, false);
 }
 
 /* Reverse-mapping scrubber. */
diff --git a/fs/xfs/scrub/rmap_repair.c b/fs/xfs/scrub/rmap_repair.c
new file mode 100644
index 000000000000..2ade606060c8
--- /dev/null
+++ b/fs/xfs/scrub/rmap_repair.c
@@ -0,0 +1,1036 @@ 
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_btree.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Reverse Mapping Btree Repair
+ * ============================
+ *
+ * This is the most involved of all the AG space btree rebuilds.  Everywhere
+ * else in XFS we lock inodes and then AG data structures, but generating the
+ * list of rmap records requires that we be able to scan both block mapping
+ * btrees of every inode in the filesystem to see if it owns any extents in
+ * this AG.  We can't tolerate any inode updates while we do this, so we
+ * freeze the filesystem to lock everyone else out, and grant ourselves
+ * special privileges to run transactions with regular background reclamation
+ * turned off.
+ *
+ * We also have to be very careful not to allow inode reclaim to start a
+ * transaction because all transactions (other than our own) will block.
+ *
+ * So basically we scan all primary per-AG metadata and all block maps of all
+ * inodes to generate a huge list of reverse map records.  Next we look for
+ * gaps in the rmap records to calculate all the unclaimed free space (1).
+ * Next, we scan all other OWN_AG metadata (bnobt, cntbt, agfl) and subtract
+ * the space used by those btrees from (1), and also subtract the free space
+ * listed in the bnobt from (1).  What's left are the gaps in assigned space
+ * that the new rmapbt knows about but the existing bnobt doesn't; these are
+ * the blocks from the old rmapbt and they can be freed.
+ */
+
+/* Set us up to repair reverse mapping btrees. */
+int
+xfs_repair_rmapbt_setup(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	int				error;
+
+	/*
+	 * Freeze out anything that can lock an inode.  We reconstruct
+	 * the rmapbt by reading inode bmaps with the AGF held, which is
+	 * only safe w.r.t. ABBA deadlocks if we're the only ones locking
+	 * inodes.
+	 */
+	error = xfs_scrub_fs_freeze(sc);
+	if (error)
+		return error;
+
+	/* Check the AG number and set up the scrub context. */
+	error = xfs_scrub_setup_fs(sc, ip);
+	if (error)
+		return error;
+
+	/*
+	 * Lock all the AG header buffers so that we can read all the
+	 * per-AG metadata too.
+	 */
+	error = xfs_repair_grab_all_ag_headers(sc);
+	if (error)
+		return error;
+
+	return xfs_scrub_ag_init(sc, sc->sm->sm_agno, &sc->sa);
+}
+
+struct xfs_repair_rmapbt_extent {
+	struct list_head		list;
+	struct xfs_rmap_irec		rmap;
+};
+
+/* Context for collecting rmaps */
+struct xfs_repair_rmapbt {
+	struct list_head		*rmaplist;
+	struct xfs_scrub_context	*sc;
+	uint64_t			owner;
+	xfs_agblock_t			btblocks;
+	uint64_t			nr_records;
+};
+
+/* Context for calculating old rmapbt blocks */
+struct xfs_repair_rmapbt_freesp {
+	struct xfs_repair_extent_list	rmap_freelist;
+	struct xfs_repair_extent_list	bno_freelist;
+	struct xfs_scrub_context	*sc;
+	xfs_agblock_t			next_bno;
+};
+
+/* Initialize an rmap. */
+static inline int
+xfs_repair_rmapbt_new_rmap(
+	struct xfs_repair_rmapbt	*rr,
+	xfs_agblock_t			startblock,
+	xfs_extlen_t			blockcount,
+	uint64_t			owner,
+	uint64_t			offset,
+	unsigned int			flags)
+{
+	struct xfs_repair_rmapbt_extent	*rre;
+	int				error = 0;
+
+	trace_xfs_repair_rmap_extent_fn(rr->sc->mp, rr->sc->sa.agno,
+			startblock, blockcount, owner, offset, flags);
+
+	if (xfs_scrub_should_terminate(rr->sc, &error))
+		return error;
+
+	rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL);
+	if (!rre)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&rre->list);
+	rre->rmap.rm_startblock = startblock;
+	rre->rmap.rm_blockcount = blockcount;
+	rre->rmap.rm_owner = owner;
+	rre->rmap.rm_offset = offset;
+	rre->rmap.rm_flags = flags;
+	list_add_tail(&rre->list, rr->rmaplist);
+	rr->nr_records++;
+
+	return 0;
+}
+
+/* Add an AGFL block to the rmap list. */
+STATIC int
+xfs_repair_rmapbt_walk_agfl(
+	struct xfs_mount		*mp,
+	xfs_agblock_t			bno,
+	void				*priv)
+{
+	struct xfs_repair_rmapbt	*rr = priv;
+
+	return xfs_repair_rmapbt_new_rmap(rr, bno, 1, XFS_RMAP_OWN_AG, 0, 0);
+}
+
+/* Add a btree block to the rmap list. */
+STATIC int
+xfs_repair_rmapbt_visit_btblock(
+	struct xfs_btree_cur		*cur,
+	int				level,
+	void				*priv)
+{
+	struct xfs_repair_rmapbt	*rr = priv;
+	struct xfs_buf			*bp;
+	xfs_fsblock_t			fsb;
+
+	xfs_btree_get_block(cur, level, &bp);
+	if (!bp)
+		return 0;
+
+	rr->btblocks++;
+	fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
+	return xfs_repair_rmapbt_new_rmap(rr, XFS_FSB_TO_AGBNO(cur->bc_mp, fsb),
+			1, rr->owner, 0, 0);
+}
+
+STATIC int
+xfs_repair_rmapbt_stash_btree_rmap(
+	struct xfs_scrub_context	*sc,
+	xfs_fsblock_t			fsbno,
+	xfs_fsblock_t			len,
+	void				*priv)
+{
+	return xfs_repair_rmapbt_new_rmap(priv, XFS_FSB_TO_AGBNO(sc->mp, fsbno),
+			len, XFS_RMAP_OWN_INOBT, 0, 0);
+}
+
+/* Record inode btree rmaps. */
+STATIC int
+xfs_repair_rmapbt_inodes(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_rec		*rec,
+	void				*priv)
+{
+	struct xfs_inobt_rec_incore	irec;
+	struct xfs_repair_rmapbt	*rr = priv;
+	struct xfs_mount		*mp = cur->bc_mp;
+	xfs_agino_t			agino;
+	xfs_agino_t			iperhole;
+	unsigned int			i;
+	int				error;
+
+	/* Record the inobt blocks. */
+	error = xfs_repair_collect_btree_cur_blocks(rr->sc, cur,
+			xfs_repair_rmapbt_stash_btree_rmap, rr);
+	if (error)
+		return error;
+
+	xfs_inobt_btrec_to_irec(mp, rec, &irec);
+
+	/* Record a non-sparse inode chunk. */
+	if (irec.ir_holemask == XFS_INOBT_HOLEMASK_FULL)
+		return xfs_repair_rmapbt_new_rmap(rr,
+				XFS_AGINO_TO_AGBNO(mp, irec.ir_startino),
+				XFS_INODES_PER_CHUNK / mp->m_sb.sb_inopblock,
+				XFS_RMAP_OWN_INODES, 0, 0);
+
+	/* Iterate each chunk. */
+	iperhole = max_t(xfs_agino_t, mp->m_sb.sb_inopblock,
+			XFS_INODES_PER_HOLEMASK_BIT);
+	for (i = 0, agino = irec.ir_startino;
+	     i < XFS_INOBT_HOLEMASK_BITS;
+	     i += iperhole / XFS_INODES_PER_HOLEMASK_BIT, agino += iperhole) {
+		/* Skip holes. */
+		if (irec.ir_holemask & (1 << i))
+			continue;
+
+		/* Record the inode chunk otherwise. */
+		error = xfs_repair_rmapbt_new_rmap(rr,
+				XFS_AGINO_TO_AGBNO(mp, agino),
+				iperhole / mp->m_sb.sb_inopblock,
+				XFS_RMAP_OWN_INODES, 0, 0);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/* Record a CoW staging extent. */
+STATIC int
+xfs_repair_rmapbt_refcount(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_rec		*rec,
+	void				*priv)
+{
+	struct xfs_repair_rmapbt	*rr = priv;
+	struct xfs_refcount_irec	refc;
+
+	xfs_refcount_btrec_to_irec(rec, &refc);
+	if (refc.rc_refcount != 1)
+		return -EFSCORRUPTED;
+
+	return xfs_repair_rmapbt_new_rmap(rr,
+			refc.rc_startblock - XFS_REFC_COW_START,
+			refc.rc_blockcount, XFS_RMAP_OWN_COW, 0, 0);
+}
+
+/* Add a bmbt block to the rmap list. */
+STATIC int
+xfs_repair_rmapbt_visit_bmbt(
+	struct xfs_btree_cur		*cur,
+	int				level,
+	void				*priv)
+{
+	struct xfs_repair_rmapbt	*rr = priv;
+	struct xfs_buf			*bp;
+	xfs_fsblock_t			fsb;
+	unsigned int			flags = XFS_RMAP_BMBT_BLOCK;
+
+	xfs_btree_get_block(cur, level, &bp);
+	if (!bp)
+		return 0;
+
+	fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
+	if (XFS_FSB_TO_AGNO(cur->bc_mp, fsb) != rr->sc->sa.agno)
+		return 0;
+
+	if (cur->bc_private.b.whichfork == XFS_ATTR_FORK)
+		flags |= XFS_RMAP_ATTR_FORK;
+	return xfs_repair_rmapbt_new_rmap(rr,
+			XFS_FSB_TO_AGBNO(cur->bc_mp, fsb), 1,
+			cur->bc_private.b.ip->i_ino, 0, flags);
+}
+
+/* Determine rmap flags from fork and bmbt state. */
+static inline unsigned int
+xfs_repair_rmapbt_bmap_flags(
+	int			whichfork,
+	xfs_exntst_t		state)
+{
+	return  (whichfork == XFS_ATTR_FORK ? XFS_RMAP_ATTR_FORK : 0) |
+		(state == XFS_EXT_UNWRITTEN ? XFS_RMAP_UNWRITTEN : 0);
+}
+
+/* Find all the extents from a given AG in an inode fork. */
+STATIC int
+xfs_repair_rmapbt_scan_ifork(
+	struct xfs_repair_rmapbt	*rr,
+	struct xfs_inode		*ip,
+	int				whichfork)
+{
+	struct xfs_bmbt_irec		rec;
+	struct xfs_iext_cursor		icur;
+	struct xfs_mount		*mp = rr->sc->mp;
+	struct xfs_btree_cur		*cur = NULL;
+	struct xfs_ifork		*ifp;
+	unsigned int			rflags;
+	int				fmt;
+	int				error = 0;
+
+	/* Do we even have data mapping extents? */
+	fmt = XFS_IFORK_FORMAT(ip, whichfork);
+	ifp = XFS_IFORK_PTR(ip, whichfork);
+	switch (fmt) {
+	case XFS_DINODE_FMT_BTREE:
+		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
+			error = xfs_iread_extents(rr->sc->tp, ip, whichfork);
+			if (error)
+				return error;
+		}
+		break;
+	case XFS_DINODE_FMT_EXTENTS:
+		break;
+	default:
+		return 0;
+	}
+	if (!ifp)
+		return 0;
+
+	/* Find all the BMBT blocks in the AG. */
+	if (fmt == XFS_DINODE_FMT_BTREE) {
+		cur = xfs_bmbt_init_cursor(mp, rr->sc->tp, ip, whichfork);
+		error = xfs_btree_visit_blocks(cur,
+				xfs_repair_rmapbt_visit_bmbt, rr);
+		if (error)
+			goto out;
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+		cur = NULL;
+	}
+
+	/* We're done if this is an rt inode's data fork. */
+	if (whichfork == XFS_DATA_FORK && XFS_IS_REALTIME_INODE(ip))
+		return 0;
+
+	/* Find all the extents in the AG. */
+	for_each_xfs_iext(ifp, &icur, &rec) {
+		if (isnullstartblock(rec.br_startblock))
+			continue;
+		/* Stash non-hole extent. */
+		if (XFS_FSB_TO_AGNO(mp, rec.br_startblock) == rr->sc->sa.agno) {
+			rflags = xfs_repair_rmapbt_bmap_flags(whichfork,
+					rec.br_state);
+			error = xfs_repair_rmapbt_new_rmap(rr,
+					XFS_FSB_TO_AGBNO(mp, rec.br_startblock),
+					rec.br_blockcount, ip->i_ino,
+					rec.br_startoff, rflags);
+			if (error)
+				goto out;
+		}
+	}
+out:
+	if (cur)
+		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/* Iterate all the inodes in an AG group. */
+STATIC int
+xfs_repair_rmapbt_scan_inobt(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_rec		*rec,
+	void				*priv)
+{
+	struct xfs_inobt_rec_incore	irec;
+	struct xfs_repair_rmapbt	*rr = priv;
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_inode		*ip = NULL;
+	xfs_ino_t			ino;
+	xfs_agino_t			agino;
+	int				chunkidx;
+	int				lock_mode = 0;
+	int				error = 0;
+
+	xfs_inobt_btrec_to_irec(mp, rec, &irec);
+
+	for (chunkidx = 0, agino = irec.ir_startino;
+	     chunkidx < XFS_INODES_PER_CHUNK;
+	     chunkidx++, agino++) {
+		bool	inuse;
+
+		/* Skip if this inode is free */
+		if (XFS_INOBT_MASK(chunkidx) & irec.ir_free)
+			continue;
+		ino = XFS_AGINO_TO_INO(mp, cur->bc_private.a.agno, agino);
+
+		/* Back off and try again if an inode is being reclaimed */
+		error = xfs_icache_inode_is_allocated(mp, cur->bc_tp, ino,
+				&inuse);
+		if (error == -EAGAIN)
+			return -EDEADLOCK;
+
+		/*
+		 * Grab inode for scanning.  We cannot use DONTCACHE here
+		 * because we already have a transaction so the iput must not
+		 * trigger inode reclaim (which might allocate a transaction
+		 * to clean up posteof blocks).
+		 */
+		error = xfs_iget(mp, cur->bc_tp, ino, 0, 0, &ip);
+		if (error)
+			return error;
+		trace_xfs_scrub_iget(ip, __this_address);
+
+		if ((ip->i_d.di_format == XFS_DINODE_FMT_BTREE &&
+		     !(ip->i_df.if_flags & XFS_IFEXTENTS)) ||
+		    (ip->i_d.di_aformat == XFS_DINODE_FMT_BTREE &&
+		     !(ip->i_afp->if_flags & XFS_IFEXTENTS)))
+			lock_mode = XFS_ILOCK_EXCL;
+		else
+			lock_mode = XFS_ILOCK_SHARED;
+		if (!xfs_ilock_nowait(ip, lock_mode)) {
+			error = -EBUSY;
+			goto out_rele;
+		}
+
+		/* Check the data fork. */
+		error = xfs_repair_rmapbt_scan_ifork(rr, ip, XFS_DATA_FORK);
+		if (error)
+			goto out_unlock;
+
+		/* Check the attr fork. */
+		error = xfs_repair_rmapbt_scan_ifork(rr, ip, XFS_ATTR_FORK);
+		if (error)
+			goto out_unlock;
+
+		xfs_iunlock(ip, lock_mode);
+		xfs_scrub_iput(rr->sc, ip);
+		ip = NULL;
+	}
+
+	return error;
+out_unlock:
+	xfs_iunlock(ip, lock_mode);
+out_rele:
+	iput(VFS_I(ip));
+	return error;
+}
+
+/* Record extents that aren't in use from gaps in the rmap records. */
+STATIC int
+xfs_repair_rmapbt_record_rmap_freesp(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xfs_repair_rmapbt_freesp	*rrf = priv;
+	xfs_fsblock_t			fsb;
+	int				error;
+
+	/* Record the free space we find. */
+	if (rec->rm_startblock > rrf->next_bno) {
+		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
+				rrf->next_bno);
+		error = xfs_repair_collect_btree_extent(rrf->sc,
+				&rrf->rmap_freelist, fsb,
+				rec->rm_startblock - rrf->next_bno);
+		if (error)
+			return error;
+	}
+	rrf->next_bno = max_t(xfs_agblock_t, rrf->next_bno,
+			rec->rm_startblock + rec->rm_blockcount);
+	return 0;
+}
+
+/* Record extents that aren't in use from the bnobt records. */
+STATIC int
+xfs_repair_rmapbt_record_bno_freesp(
+	struct xfs_btree_cur		*cur,
+	struct xfs_alloc_rec_incore	*rec,
+	void				*priv)
+{
+	struct xfs_repair_rmapbt_freesp	*rrf = priv;
+	xfs_fsblock_t			fsb;
+
+	/* Record the free space we find. */
+	fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
+			rec->ar_startblock);
+	return xfs_repair_collect_btree_extent(rrf->sc, &rrf->bno_freelist,
+			fsb, rec->ar_blockcount);
+}
+
+/* Compare two rmapbt extents. */
+static int
+xfs_repair_rmapbt_extent_cmp(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_repair_rmapbt_extent	*ap;
+	struct xfs_repair_rmapbt_extent	*bp;
+
+	ap = container_of(a, struct xfs_repair_rmapbt_extent, list);
+	bp = container_of(b, struct xfs_repair_rmapbt_extent, list);
+	return xfs_rmap_compare(&ap->rmap, &bp->rmap);
+}
+
+/* Generate rmaps for the AG headers (AGI/AGF/AGFL) */
+STATIC int
+xfs_repair_rmapbt_generate_agheader_rmaps(
+	struct xfs_repair_rmapbt	*rr)
+{
+	struct xfs_scrub_context	*sc = rr->sc;
+	int				error;
+
+	/* Create a record for the AG sb->agfl. */
+	error = xfs_repair_rmapbt_new_rmap(rr, XFS_SB_BLOCK(sc->mp),
+			XFS_AGFL_BLOCK(sc->mp) - XFS_SB_BLOCK(sc->mp) + 1,
+			XFS_RMAP_OWN_FS, 0, 0);
+	if (error)
+		return error;
+
+	/* Generate rmaps for the blocks in the AGFL. */
+	return xfs_agfl_walk(sc->mp, XFS_BUF_TO_AGF(sc->sa.agf_bp),
+			sc->sa.agfl_bp, xfs_repair_rmapbt_walk_agfl, rr);
+}
+
+/* Generate rmaps for the log, if it's in this AG. */
+STATIC int
+xfs_repair_rmapbt_generate_log_rmaps(
+	struct xfs_repair_rmapbt	*rr)
+{
+	struct xfs_scrub_context	*sc = rr->sc;
+
+	if (sc->mp->m_sb.sb_logstart == 0 ||
+	    XFS_FSB_TO_AGNO(sc->mp, sc->mp->m_sb.sb_logstart) != sc->sa.agno)
+		return 0;
+
+	return xfs_repair_rmapbt_new_rmap(rr,
+			XFS_FSB_TO_AGBNO(sc->mp, sc->mp->m_sb.sb_logstart),
+			sc->mp->m_sb.sb_logblocks, XFS_RMAP_OWN_LOG, 0, 0);
+}
+
+/* Collect rmaps for the blocks containing the free space btrees. */
+STATIC int
+xfs_repair_rmapbt_generate_freesp_rmaps(
+	struct xfs_repair_rmapbt	*rr,
+	xfs_agblock_t			*new_btreeblks)
+{
+	struct xfs_scrub_context	*sc = rr->sc;
+	struct xfs_btree_cur		*cur;
+	int				error;
+
+	rr->owner = XFS_RMAP_OWN_AG;
+	rr->btblocks = 0;
+
+	/* bnobt */
+	cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.agno, XFS_BTNUM_BNO);
+	error = xfs_btree_visit_blocks(cur, xfs_repair_rmapbt_visit_btblock,
+			rr);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/* cntbt */
+	cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.agno, XFS_BTNUM_CNT);
+	error = xfs_btree_visit_blocks(cur, xfs_repair_rmapbt_visit_btblock,
+			rr);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/* btreeblks doesn't include the bnobt/cntbt btree roots */
+	*new_btreeblks = rr->btblocks - 2;
+	return 0;
+err:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/* Collect rmaps for the blocks containing inode btrees and the inode chunks. */
+STATIC int
+xfs_repair_rmapbt_generate_inobt_rmaps(
+	struct xfs_repair_rmapbt	*rr)
+{
+	struct xfs_scrub_context	*sc = rr->sc;
+	struct xfs_btree_cur		*cur;
+	int				error;
+
+	rr->owner = XFS_RMAP_OWN_INOBT;
+
+	/*
+	 * Iterate every record in the inobt so we can capture all the inode
+	 * chunks and the blocks in the inobt itself.  Note that if there are
+	 * zero records in the inobt then query_all does nothing and we have
+	 * to account the empty inobt root manually.
+	 */
+	if (sc->sa.pag->pagi_count > 0) {
+		cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp,
+				sc->sa.agno, XFS_BTNUM_INO);
+		error = xfs_btree_query_all(cur, xfs_repair_rmapbt_inodes, rr);
+		if (error)
+			goto err_cur;
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	} else {
+		struct xfs_agi		*agi;
+
+		agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
+		error = xfs_repair_rmapbt_new_rmap(rr,
+				be32_to_cpu(agi->agi_root), 1,
+				XFS_RMAP_OWN_INOBT, 0, 0);
+		if (error)
+			goto err;
+	}
+
+	/* finobt */
+	if (!xfs_sb_version_hasfinobt(&sc->mp->m_sb))
+		return 0;
+
+	cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp, sc->sa.agno,
+			XFS_BTNUM_FINO);
+	error = xfs_btree_visit_blocks(cur, xfs_repair_rmapbt_visit_btblock,
+			rr);
+	if (error)
+		goto err_cur;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return 0;
+err_cur:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+err:
+	return error;
+}
+
+/*
+ * Collect rmaps for the blocks containing the refcount btree, and all CoW
+ * staging extents.
+ */
+STATIC int
+xfs_repair_rmapbt_generate_refcountbt_rmaps(
+	struct xfs_repair_rmapbt	*rr)
+{
+	union xfs_btree_irec		low;
+	union xfs_btree_irec		high;
+	struct xfs_scrub_context	*sc = rr->sc;
+	struct xfs_btree_cur		*cur;
+	int				error;
+
+	if (!xfs_sb_version_hasreflink(&sc->mp->m_sb))
+		return 0;
+
+	rr->owner = XFS_RMAP_OWN_REFC;
+
+	/* refcountbt */
+	cur = xfs_refcountbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.agno, NULL);
+	error = xfs_btree_visit_blocks(cur, xfs_repair_rmapbt_visit_btblock,
+			rr);
+	if (error)
+		goto err_cur;
+
+	/* Collect rmaps for CoW staging extents. */
+	memset(&low, 0, sizeof(low));
+	low.rc.rc_startblock = XFS_REFC_COW_START;
+	memset(&high, 0xFF, sizeof(high));
+	error = xfs_btree_query_range(cur, &low, &high,
+			xfs_repair_rmapbt_refcount, rr);
+	if (error)
+		goto err_cur;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return 0;
+err_cur:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/* Collect rmaps for all block mappings for every inode in this AG. */
+STATIC int
+xfs_repair_rmapbt_generate_aginode_rmaps(
+	struct xfs_repair_rmapbt	*rr,
+	xfs_agnumber_t			agno)
+{
+	struct xfs_scrub_context	*sc = rr->sc;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_btree_cur		*cur;
+	struct xfs_buf			*agi_bp;
+	int				error;
+
+	error = xfs_ialloc_read_agi(mp, sc->tp, agno, &agi_bp);
+	if (error)
+		return error;
+	cur = xfs_inobt_init_cursor(mp, sc->tp, agi_bp, agno, XFS_BTNUM_INO);
+	error = xfs_btree_query_all(cur, xfs_repair_rmapbt_scan_inobt, rr);
+	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	xfs_trans_brelse(sc->tp, agi_bp);
+	return error;
+}
+
+/*
+ * Generate all the reverse-mappings for this AG, a list of the old rmapbt
+ * blocks, and the new btreeblks count.  Figure out if we have enough free
+ * space to reconstruct the inode btrees.  The caller must clean up the lists
+ * if anything goes wrong.
+ */
+STATIC int
+xfs_repair_rmapbt_find_rmaps(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*rmap_records,
+	xfs_agblock_t			*new_btreeblks)
+{
+	struct xfs_repair_rmapbt	rr;
+	xfs_agnumber_t			agno;
+	int				error;
+
+	rr.rmaplist = rmap_records;
+	rr.sc = sc;
+	rr.nr_records = 0;
+
+	/* Generate rmaps for AG space metadata */
+	error = xfs_repair_rmapbt_generate_agheader_rmaps(&rr);
+	if (error)
+		return error;
+	error = xfs_repair_rmapbt_generate_log_rmaps(&rr);
+	if (error)
+		return error;
+	error = xfs_repair_rmapbt_generate_freesp_rmaps(&rr, new_btreeblks);
+	if (error)
+		return error;
+	error = xfs_repair_rmapbt_generate_inobt_rmaps(&rr);
+	if (error)
+		return error;
+	error = xfs_repair_rmapbt_generate_refcountbt_rmaps(&rr);
+	if (error)
+		return error;
+
+	/* Iterate all AGs for inodes rmaps. */
+	for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) {
+		error = xfs_repair_rmapbt_generate_aginode_rmaps(&rr, agno);
+		if (error)
+			return error;
+	}
+
+	/* Do we actually have enough space to do this? */
+	if (!xfs_repair_ag_has_space(sc->sa.pag,
+			xfs_rmapbt_calc_size(sc->mp, rr.nr_records),
+			XFS_AG_RESV_RMAPBT))
+		return -ENOSPC;
+
+	return 0;
+}
+
+/* Update the AGF counters. */
+STATIC int
+xfs_repair_rmapbt_reset_counters(
+	struct xfs_scrub_context	*sc,
+	xfs_agblock_t			new_btreeblks,
+	int				*log_flags)
+{
+	struct xfs_agf			*agf;
+	struct xfs_perag		*pag = sc->sa.pag;
+
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+	pag->pagf_btreeblks = new_btreeblks;
+	agf->agf_btreeblks = cpu_to_be32(new_btreeblks);
+	*log_flags |= XFS_AGF_BTREEBLKS;
+
+	return 0;
+}
+
+/* Initialize a new rmapbt root and implant it into the AGF. */
+STATIC int
+xfs_repair_rmapbt_reset_btree(
+	struct xfs_scrub_context	*sc,
+	struct xfs_owner_info		*oinfo,
+	int				*log_flags)
+{
+	struct xfs_buf			*bp;
+	struct xfs_agf			*agf;
+	struct xfs_perag		*pag = sc->sa.pag;
+	xfs_fsblock_t			btfsb;
+	int				error;
+
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+
+	/* Initialize a new rmapbt root. */
+	error = xfs_repair_alloc_ag_block(sc, oinfo, &btfsb,
+			XFS_AG_RESV_RMAPBT);
+	if (error)
+		return error;
+
+	/* The root block is not a btreeblks block. */
+	be32_add_cpu(&agf->agf_btreeblks, -1);
+	pag->pagf_btreeblks--;
+	*log_flags |= XFS_AGF_BTREEBLKS;
+
+	error = xfs_repair_init_btblock(sc, btfsb, &bp, XFS_BTNUM_RMAP,
+			&xfs_rmapbt_buf_ops);
+	if (error)
+		return error;
+
+	agf->agf_roots[XFS_BTNUM_RMAPi] =
+			cpu_to_be32(XFS_FSB_TO_AGBNO(sc->mp, btfsb));
+	agf->agf_levels[XFS_BTNUM_RMAPi] = cpu_to_be32(1);
+	agf->agf_rmap_blocks = cpu_to_be32(1);
+	pag->pagf_levels[XFS_BTNUM_RMAPi] = 1;
+	*log_flags |= XFS_AGF_ROOTS | XFS_AGF_LEVELS | XFS_AGF_RMAP_BLOCKS;
+
+	return 0;
+}
+
+/*
+ * Roll and fix the free list while reloading the rmapbt.  Do not shrink the
+ * freelist because the rmapbt is not fully set up yet.
+ */
+STATIC int
+xfs_repair_rmapbt_fix_freelist(
+	struct xfs_scrub_context	*sc)
+{
+	int				error;
+
+	error = xfs_repair_roll_ag_trans(sc);
+	if (error)
+		return error;
+	return xfs_repair_fix_freelist(sc, false);
+}
+
+/* Insert all the rmaps we collected. */
+STATIC int
+xfs_repair_rmapbt_rebuild_tree(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*rmap_records)
+{
+	struct xfs_repair_rmapbt_extent	*rre;
+	struct xfs_repair_rmapbt_extent	*n;
+	struct xfs_btree_cur		*cur;
+	struct xfs_mount		*mp = sc->mp;
+	uint32_t			old_flcount;
+	int				error;
+
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
+	old_flcount = sc->sa.pag->pagf_flcount;
+
+	list_sort(NULL, rmap_records, xfs_repair_rmapbt_extent_cmp);
+	list_for_each_entry_safe(rre, n, rmap_records, list) {
+		/* Add the rmap. */
+		error = xfs_rmap_map_raw(cur, &rre->rmap);
+		if (error)
+			goto err_cur;
+		list_del(&rre->list);
+		kmem_free(rre);
+
+		/*
+		 * If the flcount changed because the rmap btree changed shape
+		 * then we need to fix the freelist to keep it full enough to
+		 * handle a total btree split.  We'll roll this transaction to
+		 * get it out of the way and then fix the freelist in a fresh
+		 * transaction.
+		 *
+		 * However, two things we must be careful about: (1) fixing
+		 * the freelist changes the rmapbt so drop the rmapbt cursor
+		 * and (2) we can't let the freelist shrink.  The rmapbt isn't
+		 * fully set up yet, which means that the current AGFL blocks
+		 * might not be reflected in the rmapbt, which is a problem if
+		 * we want to unmap blocks from the AGFL.
+		 */
+		if (sc->sa.pag->pagf_flcount == old_flcount)
+			continue;
+		if (list_empty(rmap_records))
+			break;
+
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+		error = xfs_repair_rmapbt_fix_freelist(sc);
+		if (error)
+			goto err;
+		old_flcount = sc->sa.pag->pagf_flcount;
+		cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp,
+				sc->sa.agno);
+	}
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/* Fix the freelist once more, if necessary. */
+	if (sc->sa.pag->pagf_flcount != old_flcount) {
+		error = xfs_repair_rmapbt_fix_freelist(sc);
+		if (error)
+			goto err;
+	}
+	return 0;
+err_cur:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+err:
+	return error;
+}
+
+/* Cancel every rmapbt record. */
+STATIC void
+xfs_repair_rmapbt_cancel_rmaps(
+	struct list_head	*reclist)
+{
+	struct xfs_repair_rmapbt_extent	*rre;
+	struct xfs_repair_rmapbt_extent	*n;
+
+	list_for_each_entry_safe(rre, n, reclist, list) {
+		list_del(&rre->list);
+		kmem_free(rre);
+	}
+}
+
+/*
+ * Reap the old rmapbt blocks.  Now that the rmapbt is fully rebuilt, we make
+ * a list of gaps in the rmap records and a list of the extents mentioned in
+ * the bnobt.  Any block that's in the new rmapbt gap list but not mentioned
+ * in the bnobt is a block from the old rmapbt and can be removed.
+ */
+STATIC int
+xfs_repair_rmapbt_reap_old_blocks(
+	struct xfs_scrub_context	*sc,
+	struct xfs_owner_info		*oinfo)
+{
+	struct xfs_repair_rmapbt_freesp	rrf;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_agf			*agf;
+	struct xfs_btree_cur		*cur;
+	xfs_fsblock_t			btfsb;
+	xfs_agblock_t			agend;
+	int				error;
+
+	xfs_repair_init_extent_list(&rrf.rmap_freelist);
+	xfs_repair_init_extent_list(&rrf.bno_freelist);
+	rrf.next_bno = 0;
+	rrf.sc = sc;
+
+	/* Compute free space from the new rmapbt. */
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
+	error = xfs_rmap_query_all(cur, xfs_repair_rmapbt_record_rmap_freesp,
+			&rrf);
+	if (error)
+		goto err_cur;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/* Insert a record for space between the last rmap and EOAG. */
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+	agend = be32_to_cpu(agf->agf_length);
+	if (rrf.next_bno < agend) {
+		btfsb = XFS_AGB_TO_FSB(mp, sc->sa.agno, rrf.next_bno);
+		error = xfs_repair_collect_btree_extent(sc, &rrf.rmap_freelist,
+				btfsb, agend - rrf.next_bno);
+		if (error)
+			goto err;
+	}
+
+	/* Compute free space from the existing bnobt. */
+	cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.agno, XFS_BTNUM_BNO);
+	error = xfs_alloc_query_all(cur, xfs_repair_rmapbt_record_bno_freesp,
+			&rrf);
+	if (error)
+		goto err_lists;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/*
+	 * Free the "free" blocks that the new rmapbt knows about but
+	 * the old bnobt doesn't.  These are the old rmapbt blocks.
+	 */
+	error = xfs_repair_subtract_extents(sc, &rrf.rmap_freelist,
+			&rrf.bno_freelist);
+	xfs_repair_cancel_btree_extents(sc, &rrf.bno_freelist);
+	if (error)
+		goto err;
+	error = xfs_repair_invalidate_blocks(sc, &rrf.rmap_freelist);
+	if (error)
+		goto err;
+	return xfs_repair_reap_btree_extents(sc, &rrf.rmap_freelist, oinfo,
+			XFS_AG_RESV_RMAPBT);
+err_lists:
+	xfs_repair_cancel_btree_extents(sc, &rrf.bno_freelist);
+err_cur:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+err:
+	return error;
+}
+
+/* Repair the rmap btree for some AG. */
+int
+xfs_repair_rmapbt(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_owner_info		oinfo;
+	struct list_head		rmap_records;
+	xfs_extlen_t			new_btreeblks;
+	int				log_flags = 0;
+	int				error;
+
+	xfs_scrub_perag_get(sc->mp, &sc->sa);
+
+	/* Collect rmaps for all AG headers. */
+	INIT_LIST_HEAD(&rmap_records);
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_UNKNOWN);
+	error = xfs_repair_rmapbt_find_rmaps(sc, &rmap_records, &new_btreeblks);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the old rmap btrees.  This is the point at which
+	 * we are no longer able to bail out gracefully.
+	 */
+	error = xfs_repair_rmapbt_reset_counters(sc, new_btreeblks, &log_flags);
+	if (error)
+		goto out;
+	error = xfs_repair_rmapbt_reset_btree(sc, &oinfo, &log_flags);
+	if (error)
+		goto out;
+	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
+	error = xfs_repair_roll_ag_trans(sc);
+	if (error)
+		goto out;
+
+	/* Now rebuild the rmap information. */
+	error = xfs_repair_rmapbt_rebuild_tree(sc, &rmap_records);
+	if (error)
+		goto out;
+
+	/* Find and destroy the blocks from the old rmapbt. */
+	error = xfs_repair_rmapbt_reap_old_blocks(sc, &oinfo);
+out:
+	xfs_repair_rmapbt_cancel_rmaps(&rmap_records);
+	return error;
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 424f01130f14..3f8036ee3971 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -280,7 +280,7 @@  static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_ag_rmapbt,
 		.scrub	= xfs_scrub_rmapbt,
 		.has	= xfs_sb_version_hasrmapbt,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_rmapbt,
 	},
 	[XFS_SCRUB_TYPE_REFCNTBT] = {	/* refcountbt */
 		.type	= ST_PERAG,