diff mbox series

[2/2] xfs: only run COW extent recovery when there are no live extents

Message ID 163900531629.374528.14641806907962114873.stgit@magnolia (mailing list archive)
State Superseded, archived
Headers show
Series xfs: fix data corruption when cycling ro/rw mounts | expand

Commit Message

Darrick J. Wong Dec. 8, 2021, 11:15 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose.  Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:

mount <filesystem>				# (0)
fsstress <run only readonly ops> &		# (1)
while true; do
	fsstress <run all ops>
	mount -o remount,ro			# (2)
	fsstress <run only readonly ops>
	mount -o remount,rw			# (3)
done

When (2) happens, notice that (1) is still running.  xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).

When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.

The results are catastrophic -- file (B) and the refcount btree are now
corrupt.  In the first patch, we fixed the race condition in (2) so that
(A) will always flush the COW fork.  In this second patch, we move the
_recover_cow call to the initial mount call in (0) for safety.

As mentioned previously, xfs_reflink_recover_cow walks the refcount
btree looking for COW staging extents, and frees them.  This was
intended to be run at mount time (when we know there are no live inodes)
to clean up any leftover staging events that may have been left behind
during an unclean shutdown.  As a time "optimization" for readonly
mounts, we deferred this to the ro->rw transition, not realizing that
any failure to clean all COW forks during a rw->ro transition would
result in catastrophic corruption.

Therefore, remove this optimization and only run the recovery routine
when we're guaranteed not to have any COW staging extents anywhere,
which means we always run this at mount time.  While we're at it, move
the callsite to xfs_log_mount_finish because any refcount btree
expansion (however unlikely given that we're removing records from the
right side of the index) must be fed by a per-AG reservation, which
doesn't exist in its current location.

Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c     |   23 ++++++++++++++++++++++-
 fs/xfs/xfs_mount.c   |   10 ----------
 fs/xfs/xfs_reflink.c |    5 ++++-
 fs/xfs/xfs_super.c   |    9 ---------
 4 files changed, 26 insertions(+), 21 deletions(-)

Comments

Dave Chinner Dec. 9, 2021, 5:41 a.m. UTC | #1
On Wed, Dec 08, 2021 at 03:15:16PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> As part of multiple customer escalations due to file data corruption
> after copy on write operations, I wrote some fstests that use fsstress
> to hammer on COW to shake things loose.  Regrettably, I caught some
> filesystem shutdowns due to incorrect rmap operations with the following
> loop:
> 
> mount <filesystem>				# (0)
> fsstress <run only readonly ops> &		# (1)
> while true; do
> 	fsstress <run all ops>
> 	mount -o remount,ro			# (2)
> 	fsstress <run only readonly ops>
> 	mount -o remount,rw			# (3)
> done
> 
> When (2) happens, notice that (1) is still running.  xfs_remount_ro will
> call xfs_blockgc_stop to walk the inode cache to free all the COW
> extents, but the blockgc mechanism races with (1)'s reader threads to
> take IOLOCKs and loses, which means that it doesn't clean them all out.
> Call such a file (A).
> 
> When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
> walks the ondisk refcount btree and frees any COW extent that it finds.
> This function does not check the inode cache, which means that incore
> COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
> one of those former COW extents are allocated and mapped into another
> file (B) and someone triggers a COW to the stale reservation in (A), A's
> dirty data will be written into (B) and once that's done, those blocks
> will be transferred to (A)'s data fork without bumping the refcount.
> 
> The results are catastrophic -- file (B) and the refcount btree are now
> corrupt.  In the first patch, we fixed the race condition in (2) so that
> (A) will always flush the COW fork.  In this second patch, we move the
> _recover_cow call to the initial mount call in (0) for safety.
> 
> As mentioned previously, xfs_reflink_recover_cow walks the refcount
> btree looking for COW staging extents, and frees them.  This was
> intended to be run at mount time (when we know there are no live inodes)
> to clean up any leftover staging events that may have been left behind
> during an unclean shutdown.  As a time "optimization" for readonly
> mounts, we deferred this to the ro->rw transition, not realizing that
> any failure to clean all COW forks during a rw->ro transition would
> result in catastrophic corruption.
> 
> Therefore, remove this optimization and only run the recovery routine
> when we're guaranteed not to have any COW staging extents anywhere,
> which means we always run this at mount time.  While we're at it, move
> the callsite to xfs_log_mount_finish because any refcount btree
> expansion (however unlikely given that we're removing records from the
> right side of the index) must be fed by a per-AG reservation, which
> doesn't exist in its current location.
> 
> Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_log.c     |   23 ++++++++++++++++++++++-
>  fs/xfs/xfs_mount.c   |   10 ----------
>  fs/xfs/xfs_reflink.c |    5 ++++-
>  fs/xfs/xfs_super.c   |    9 ---------
>  4 files changed, 26 insertions(+), 21 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 89fec9a18c34..c17344fc1275 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -10,6 +10,7 @@
>  #include "xfs_log_format.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
> +#include "xfs_inode.h"
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_trans.h"
> @@ -20,6 +21,7 @@
>  #include "xfs_sysfs.h"
>  #include "xfs_sb.h"
>  #include "xfs_health.h"
> +#include "xfs_reflink.h"
>  
>  struct kmem_cache	*xfs_log_ticket_cache;
>  
> @@ -847,9 +849,28 @@ xfs_log_mount_finish(
>  	/* Make sure the log is dead if we're returning failure. */
>  	ASSERT(!error || xlog_is_shutdown(log));
>  
> -	return error;
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Recover any CoW staging blocks that are still referenced by the
> +	 * ondisk refcount metadata.  During mount there cannot be any live
> +	 * staging extents as we have not permitted any user modifications.
> +	 * Therefore, it is safe to free them all right now, even on a
> +	 * read-only mount.
> +	 */
> +	error = xfs_reflink_recover_cow(mp);
> +	if (error) {
> +		xfs_err(mp, "Error %d recovering leftover CoW allocations.",
> +				error);
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +		return error;
> +	}
> +
> +	return 0;
>  }

THis is after we've emitted an "Ending recovery ...." log message.
I kinda expected you to move this up to just after the
evict_inodes() call before we force the log, push the AIL, drain
the buffers used during recovery and potentially turn the
"filesystem is read-only" flag back on.

i.e. if this is a recovery operation, it should be done before we
finish recovery....

Other than that, the change is fine.

Cheers,

Dave.
Chandan Babu R Dec. 9, 2021, 2:44 p.m. UTC | #2
On 09 Dec 2021 at 04:45, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> As part of multiple customer escalations due to file data corruption
> after copy on write operations, I wrote some fstests that use fsstress
> to hammer on COW to shake things loose.  Regrettably, I caught some
> filesystem shutdowns due to incorrect rmap operations with the following
> loop:
>
> mount <filesystem>				# (0)
> fsstress <run only readonly ops> &		# (1)
> while true; do
> 	fsstress <run all ops>
> 	mount -o remount,ro			# (2)
> 	fsstress <run only readonly ops>
> 	mount -o remount,rw			# (3)
> done
>
> When (2) happens, notice that (1) is still running.  xfs_remount_ro will
> call xfs_blockgc_stop to walk the inode cache to free all the COW
> extents, but the blockgc mechanism races with (1)'s reader threads to
> take IOLOCKs and loses, which means that it doesn't clean them all out.
> Call such a file (A).
>
> When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
> walks the ondisk refcount btree and frees any COW extent that it finds.
> This function does not check the inode cache, which means that incore
> COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
> one of those former COW extents are allocated and mapped into another
> file (B) and someone triggers a COW to the stale reservation in (A), A's
> dirty data will be written into (B) and once that's done, those blocks
> will be transferred to (A)'s data fork without bumping the refcount.
>
> The results are catastrophic -- file (B) and the refcount btree are now
> corrupt.  In the first patch, we fixed the race condition in (2) so that
> (A) will always flush the COW fork.  In this second patch, we move the
> _recover_cow call to the initial mount call in (0) for safety.
>
> As mentioned previously, xfs_reflink_recover_cow walks the refcount
> btree looking for COW staging extents, and frees them.  This was
> intended to be run at mount time (when we know there are no live inodes)
> to clean up any leftover staging events that may have been left behind
> during an unclean shutdown.  As a time "optimization" for readonly
> mounts, we deferred this to the ro->rw transition, not realizing that
> any failure to clean all COW forks during a rw->ro transition would
> result in catastrophic corruption.
>
> Therefore, remove this optimization and only run the recovery routine
> when we're guaranteed not to have any COW staging extents anywhere,
> which means we always run this at mount time.  While we're at it, move
> the callsite to xfs_log_mount_finish because any refcount btree
> expansion (however unlikely given that we're removing records from the
> right side of the index) must be fed by a per-AG reservation, which
> doesn't exist in its current location.
>

With Dave's review comments addressed,

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

> Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_log.c     |   23 ++++++++++++++++++++++-
>  fs/xfs/xfs_mount.c   |   10 ----------
>  fs/xfs/xfs_reflink.c |    5 ++++-
>  fs/xfs/xfs_super.c   |    9 ---------
>  4 files changed, 26 insertions(+), 21 deletions(-)
>
>
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 89fec9a18c34..c17344fc1275 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -10,6 +10,7 @@
>  #include "xfs_log_format.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
> +#include "xfs_inode.h"
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_trans.h"
> @@ -20,6 +21,7 @@
>  #include "xfs_sysfs.h"
>  #include "xfs_sb.h"
>  #include "xfs_health.h"
> +#include "xfs_reflink.h"
>  
>  struct kmem_cache	*xfs_log_ticket_cache;
>  
> @@ -847,9 +849,28 @@ xfs_log_mount_finish(
>  	/* Make sure the log is dead if we're returning failure. */
>  	ASSERT(!error || xlog_is_shutdown(log));
>  
> -	return error;
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Recover any CoW staging blocks that are still referenced by the
> +	 * ondisk refcount metadata.  During mount there cannot be any live
> +	 * staging extents as we have not permitted any user modifications.
> +	 * Therefore, it is safe to free them all right now, even on a
> +	 * read-only mount.
> +	 */
> +	error = xfs_reflink_recover_cow(mp);
> +	if (error) {
> +		xfs_err(mp, "Error %d recovering leftover CoW allocations.",
> +				error);
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +		return error;
> +	}
> +
> +	return 0;
>  }
>  
> +
>  /*
>   * The mount has failed. Cancel the recovery if it hasn't completed and destroy
>   * the log.
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 359109b6f0d3..bed73e8002a5 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -936,15 +936,6 @@ xfs_mountfs(
>  			xfs_warn(mp,
>  	"Unable to allocate reserve blocks. Continuing without reserve pool.");
>  
> -		/* Recover any CoW blocks that never got remapped. */
> -		error = xfs_reflink_recover_cow(mp);
> -		if (error) {
> -			xfs_err(mp,
> -	"Error %d recovering leftover CoW allocations.", error);
> -			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> -			goto out_quota;
> -		}
> -
>  		/* Reserve AG blocks for future btree expansion. */
>  		error = xfs_fs_reserve_ag_blocks(mp);
>  		if (error && error != -ENOSPC)
> @@ -955,7 +946,6 @@ xfs_mountfs(
>  
>   out_agresv:
>  	xfs_fs_unreserve_ag_blocks(mp);
> - out_quota:
>  	xfs_qm_unmount_quotas(mp);
>   out_rtunmount:
>  	xfs_rtunmount_inodes(mp);
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index cb0edb1d68ef..8b6c7163f684 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -749,7 +749,10 @@ xfs_reflink_end_cow(
>  }
>  
>  /*
> - * Free leftover CoW reservations that didn't get cleaned out.
> + * Free all CoW staging blocks that are still referenced by the ondisk refcount
> + * metadata.  The ondisk metadata does not track which inode created the
> + * staging extent, so callers must ensure that there are no cached inodes with
> + * live CoW staging extents.
>   */
>  int
>  xfs_reflink_recover_cow(
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 778b57b1f020..c7ac486ca5d3 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1739,15 +1739,6 @@ xfs_remount_rw(
>  	 */
>  	xfs_restore_resvblks(mp);
>  	xfs_log_work_queue(mp);
> -
> -	/* Recover any CoW blocks that never got remapped. */
> -	error = xfs_reflink_recover_cow(mp);
> -	if (error) {
> -		xfs_err(mp,
> -			"Error %d recovering leftover CoW allocations.", error);
> -		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> -		return error;
> -	}
>  	xfs_blockgc_start(mp);
>  
>  	/* Create the per-AG metadata reservation pool .*/
Darrick J. Wong Dec. 11, 2021, 1:24 a.m. UTC | #3
On Thu, Dec 09, 2021 at 04:41:49PM +1100, Dave Chinner wrote:
> On Wed, Dec 08, 2021 at 03:15:16PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > As part of multiple customer escalations due to file data corruption
> > after copy on write operations, I wrote some fstests that use fsstress
> > to hammer on COW to shake things loose.  Regrettably, I caught some
> > filesystem shutdowns due to incorrect rmap operations with the following
> > loop:
> > 
> > mount <filesystem>				# (0)
> > fsstress <run only readonly ops> &		# (1)
> > while true; do
> > 	fsstress <run all ops>
> > 	mount -o remount,ro			# (2)
> > 	fsstress <run only readonly ops>
> > 	mount -o remount,rw			# (3)
> > done
> > 
> > When (2) happens, notice that (1) is still running.  xfs_remount_ro will
> > call xfs_blockgc_stop to walk the inode cache to free all the COW
> > extents, but the blockgc mechanism races with (1)'s reader threads to
> > take IOLOCKs and loses, which means that it doesn't clean them all out.
> > Call such a file (A).
> > 
> > When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
> > walks the ondisk refcount btree and frees any COW extent that it finds.
> > This function does not check the inode cache, which means that incore
> > COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
> > one of those former COW extents are allocated and mapped into another
> > file (B) and someone triggers a COW to the stale reservation in (A), A's
> > dirty data will be written into (B) and once that's done, those blocks
> > will be transferred to (A)'s data fork without bumping the refcount.
> > 
> > The results are catastrophic -- file (B) and the refcount btree are now
> > corrupt.  In the first patch, we fixed the race condition in (2) so that
> > (A) will always flush the COW fork.  In this second patch, we move the
> > _recover_cow call to the initial mount call in (0) for safety.
> > 
> > As mentioned previously, xfs_reflink_recover_cow walks the refcount
> > btree looking for COW staging extents, and frees them.  This was
> > intended to be run at mount time (when we know there are no live inodes)
> > to clean up any leftover staging events that may have been left behind
> > during an unclean shutdown.  As a time "optimization" for readonly
> > mounts, we deferred this to the ro->rw transition, not realizing that
> > any failure to clean all COW forks during a rw->ro transition would
> > result in catastrophic corruption.
> > 
> > Therefore, remove this optimization and only run the recovery routine
> > when we're guaranteed not to have any COW staging extents anywhere,
> > which means we always run this at mount time.  While we're at it, move
> > the callsite to xfs_log_mount_finish because any refcount btree
> > expansion (however unlikely given that we're removing records from the
> > right side of the index) must be fed by a per-AG reservation, which
> > doesn't exist in its current location.
> > 
> > Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_log.c     |   23 ++++++++++++++++++++++-
> >  fs/xfs/xfs_mount.c   |   10 ----------
> >  fs/xfs/xfs_reflink.c |    5 ++++-
> >  fs/xfs/xfs_super.c   |    9 ---------
> >  4 files changed, 26 insertions(+), 21 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 89fec9a18c34..c17344fc1275 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -10,6 +10,7 @@
> >  #include "xfs_log_format.h"
> >  #include "xfs_trans_resv.h"
> >  #include "xfs_mount.h"
> > +#include "xfs_inode.h"
> >  #include "xfs_errortag.h"
> >  #include "xfs_error.h"
> >  #include "xfs_trans.h"
> > @@ -20,6 +21,7 @@
> >  #include "xfs_sysfs.h"
> >  #include "xfs_sb.h"
> >  #include "xfs_health.h"
> > +#include "xfs_reflink.h"
> >  
> >  struct kmem_cache	*xfs_log_ticket_cache;
> >  
> > @@ -847,9 +849,28 @@ xfs_log_mount_finish(
> >  	/* Make sure the log is dead if we're returning failure. */
> >  	ASSERT(!error || xlog_is_shutdown(log));
> >  
> > -	return error;
> > +	if (error)
> > +		return error;
> > +
> > +	/*
> > +	 * Recover any CoW staging blocks that are still referenced by the
> > +	 * ondisk refcount metadata.  During mount there cannot be any live
> > +	 * staging extents as we have not permitted any user modifications.
> > +	 * Therefore, it is safe to free them all right now, even on a
> > +	 * read-only mount.
> > +	 */
> > +	error = xfs_reflink_recover_cow(mp);
> > +	if (error) {
> > +		xfs_err(mp, "Error %d recovering leftover CoW allocations.",
> > +				error);
> > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +		return error;
> > +	}
> > +
> > +	return 0;
> >  }
> 
> THis is after we've emitted an "Ending recovery ...." log message.
> I kinda expected you to move this up to just after the
> evict_inodes() call before we force the log, push the AIL, drain
> the buffers used during recovery and potentially turn the
> "filesystem is read-only" flag back on.
> 
> i.e. if this is a recovery operation, it should be done before we
> finish recovery....
> 
> Other than that, the change is fine.

It's a recovery function, albeit one that we always run during mount,
even if the log didn't require recovery.  That's a behavior change,
which is beyond the scope of a fix patch.  Not to mention it causes a
hang in xfs/434, which means now I have to withdraw this until I sort
out why we stall forever in xfs_buftarg_drain, and ftrace isn't helpful
in this regard.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 89fec9a18c34..c17344fc1275 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -10,6 +10,7 @@ 
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_inode.h"
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_trans.h"
@@ -20,6 +21,7 @@ 
 #include "xfs_sysfs.h"
 #include "xfs_sb.h"
 #include "xfs_health.h"
+#include "xfs_reflink.h"
 
 struct kmem_cache	*xfs_log_ticket_cache;
 
@@ -847,9 +849,28 @@  xfs_log_mount_finish(
 	/* Make sure the log is dead if we're returning failure. */
 	ASSERT(!error || xlog_is_shutdown(log));
 
-	return error;
+	if (error)
+		return error;
+
+	/*
+	 * Recover any CoW staging blocks that are still referenced by the
+	 * ondisk refcount metadata.  During mount there cannot be any live
+	 * staging extents as we have not permitted any user modifications.
+	 * Therefore, it is safe to free them all right now, even on a
+	 * read-only mount.
+	 */
+	error = xfs_reflink_recover_cow(mp);
+	if (error) {
+		xfs_err(mp, "Error %d recovering leftover CoW allocations.",
+				error);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		return error;
+	}
+
+	return 0;
 }
 
+
 /*
  * The mount has failed. Cancel the recovery if it hasn't completed and destroy
  * the log.
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 359109b6f0d3..bed73e8002a5 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -936,15 +936,6 @@  xfs_mountfs(
 			xfs_warn(mp,
 	"Unable to allocate reserve blocks. Continuing without reserve pool.");
 
-		/* Recover any CoW blocks that never got remapped. */
-		error = xfs_reflink_recover_cow(mp);
-		if (error) {
-			xfs_err(mp,
-	"Error %d recovering leftover CoW allocations.", error);
-			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-			goto out_quota;
-		}
-
 		/* Reserve AG blocks for future btree expansion. */
 		error = xfs_fs_reserve_ag_blocks(mp);
 		if (error && error != -ENOSPC)
@@ -955,7 +946,6 @@  xfs_mountfs(
 
  out_agresv:
 	xfs_fs_unreserve_ag_blocks(mp);
- out_quota:
 	xfs_qm_unmount_quotas(mp);
  out_rtunmount:
 	xfs_rtunmount_inodes(mp);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index cb0edb1d68ef..8b6c7163f684 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -749,7 +749,10 @@  xfs_reflink_end_cow(
 }
 
 /*
- * Free leftover CoW reservations that didn't get cleaned out.
+ * Free all CoW staging blocks that are still referenced by the ondisk refcount
+ * metadata.  The ondisk metadata does not track which inode created the
+ * staging extent, so callers must ensure that there are no cached inodes with
+ * live CoW staging extents.
  */
 int
 xfs_reflink_recover_cow(
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 778b57b1f020..c7ac486ca5d3 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1739,15 +1739,6 @@  xfs_remount_rw(
 	 */
 	xfs_restore_resvblks(mp);
 	xfs_log_work_queue(mp);
-
-	/* Recover any CoW blocks that never got remapped. */
-	error = xfs_reflink_recover_cow(mp);
-	if (error) {
-		xfs_err(mp,
-			"Error %d recovering leftover CoW allocations.", error);
-		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-		return error;
-	}
 	xfs_blockgc_start(mp);
 
 	/* Create the per-AG metadata reservation pool .*/