xfs: log recovery should replay deferred ops in order

Message ID	20171122182949.GH2135@magnolia (mailing list archive)
State	Accepted
Headers	show Return-Path: <linux-xfs-owner@kernel.org> Date: Wed, 22 Nov 2017 10:29:49 -0800 From: "Darrick J. Wong" <darrick.wong@oracle.com> To: linux-xfs <linux-xfs@vger.kernel.org> Cc: Hou Tao <houtao1@huawei.com>, Amir Goldstein <amir73il@gmail.com>, Christoph Hellwig <hch@lst.de> Subject: [PATCH] xfs: log recovery should replay deferred ops in order Message-ID: <20171122182949.GH2135@magnolia> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk

Message ID

20171122182949.GH2135@magnolia (mailing list archive)

State

Accepted

Headers

Date: Wed, 22 Nov 2017 10:29:49 -0800
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: linux-xfs <linux-xfs@vger.kernel.org>
Cc: Hou Tao <houtao1@huawei.com>, Amir Goldstein <amir73il@gmail.com>,
	Christoph Hellwig <hch@lst.de>
Subject: [PATCH] xfs: log recovery should replay deferred ops in order
Message-ID: <20171122182949.GH2135@magnolia>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk

Commit Message

Darrick J. Wong Nov. 22, 2017, 6:29 p.m. UTC

From: Darrick J. Wong <darrick.wong@oracle.com>

As part of testing log recovery with dm_log_writes, Amir Goldstein
discovered an error in the deferred ops recovery that lead to corruption
of the filesystem metadata if a reflink+rmap filesystem happened to shut
down midway through a CoW remap:

"This is what happens [after failed log recovery]:

"Phase 1 - find and verify superblock...
"Phase 2 - using internal log
"        - zero log...
"        - scan filesystem freespace and inode maps...
"        - found root inode chunk
"Phase 3 - for each AG...
"        - scan (but don't clear) agi unlinked lists...
"        - process known inodes and perform inode discovery...
"        - agno = 0
"data fork in regular inode 134 claims CoW block 376
"correcting nextents for inode 134
"bad data fork in inode 134
"would have cleared inode 134"

Hou Tao dissected the log contents of exactly such a crash:

"According to the implementation of xfs_defer_finish(), these ops should
be completed in the following sequence:

"Have been done:
"(1) CUI: Oper (160)
"(2) BUI: Oper (161)
"(3) CUD: Oper (194), for CUI Oper (160)
"(4) RUI A: Oper (197), free rmap [0x155, 2, -9]

"Should be done:
"(5) BUD: for BUI Oper (161)
"(6) RUI B: add rmap [0x155, 2, 137]
"(7) RUD: for RUI A
"(8) RUD: for RUI B

"Actually be done by xlog_recover_process_intents()
"(5) BUD: for BUI Oper (161)
"(6) RUI B: add rmap [0x155, 2, 137]
"(7) RUD: for RUI B
"(8) RUD: for RUI A

"So the rmap entry [0x155, 2, -9] for COW should be freed firstly,
then a new rmap entry [0x155, 2, 137] will be added. However, as we can see
from the log record in post_mount.log (generated after umount) and the trace
print, the new rmap entry [0x155, 2, 137] are added firstly, then the rmap
entry [0x155, 2, -9] are freed."

When reconstructing the internal log state from the log items found on
disk, it's required that deferred ops replay in exactly the same order
that they would have had the filesystem not gone down.  However,
replaying unfinished deferred ops can create /more/ deferred ops.  These
new deferred ops are finished in the wrong order.  This causes fs
corruption and replay crashes, so let's create a single defer_ops to
handle the subsequent ops created during replay, then use one single
transaction at the end of log recovery to ensure that everything is
replayed in the same order as they're supposed to be.

Reported-by: Amir Goldstein <amir73il@gmail.com>
Analyzed-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_item.c     |   23 ++++----------
 fs/xfs/xfs_bmap_item.h     |    3 +-
 fs/xfs/xfs_log_recover.c   |   73 +++++++++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_refcount_item.c |   21 ++++---------
 fs/xfs/xfs_refcount_item.h |    3 +-
 5 files changed, 83 insertions(+), 40 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Amir Goldstein Nov. 23, 2017, 10:46 a.m. UTC | #1

On Wed, Nov 22, 2017 at 8:29 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> As part of testing log recovery with dm_log_writes, Amir Goldstein
> discovered an error in the deferred ops recovery that lead to corruption
> of the filesystem metadata if a reflink+rmap filesystem happened to shut
> down midway through a CoW remap:
>
> "This is what happens [after failed log recovery]:
>
> "Phase 1 - find and verify superblock...
> "Phase 2 - using internal log
> "        - zero log...
> "        - scan filesystem freespace and inode maps...
> "        - found root inode chunk
> "Phase 3 - for each AG...
> "        - scan (but don't clear) agi unlinked lists...
> "        - process known inodes and perform inode discovery...
> "        - agno = 0
> "data fork in regular inode 134 claims CoW block 376
> "correcting nextents for inode 134
> "bad data fork in inode 134
> "would have cleared inode 134"
>
> Hou Tao dissected the log contents of exactly such a crash:
>
> "According to the implementation of xfs_defer_finish(), these ops should
> be completed in the following sequence:
>
> "Have been done:
> "(1) CUI: Oper (160)
> "(2) BUI: Oper (161)
> "(3) CUD: Oper (194), for CUI Oper (160)
> "(4) RUI A: Oper (197), free rmap [0x155, 2, -9]
>
> "Should be done:
> "(5) BUD: for BUI Oper (161)
> "(6) RUI B: add rmap [0x155, 2, 137]
> "(7) RUD: for RUI A
> "(8) RUD: for RUI B
>
> "Actually be done by xlog_recover_process_intents()
> "(5) BUD: for BUI Oper (161)
> "(6) RUI B: add rmap [0x155, 2, 137]
> "(7) RUD: for RUI B
> "(8) RUD: for RUI A
>
> "So the rmap entry [0x155, 2, -9] for COW should be freed firstly,
> then a new rmap entry [0x155, 2, 137] will be added. However, as we can see
> from the log record in post_mount.log (generated after umount) and the trace
> print, the new rmap entry [0x155, 2, 137] are added firstly, then the rmap
> entry [0x155, 2, -9] are freed."
>
> When reconstructing the internal log state from the log items found on
> disk, it's required that deferred ops replay in exactly the same order
> that they would have had the filesystem not gone down.  However,
> replaying unfinished deferred ops can create /more/ deferred ops.  These
> new deferred ops are finished in the wrong order.  This causes fs
> corruption and replay crashes, so let's create a single defer_ops to
> handle the subsequent ops created during replay, then use one single
> transaction at the end of log recovery to ensure that everything is
> replayed in the same order as they're supposed to be.
>
> Reported-by: Amir Goldstein <amir73il@gmail.com>
> Analyzed-by: Hou Tao <houtao1@huawei.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

You may add:
Tested-by: Amir Goldstein <amir73il@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Nov. 23, 2017, 1:36 p.m. UTC | #2

Hi Darrick,

this looks generally good to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

But I have a few nitpicks below:

> +	/*
> +	 * We're finishing the defer_ops that accumulated as a result of
> +	 * recovering unfinished intent items during log recovery.  We
> +	 * reserve an itruncate transaction because it is the largest
> +	 * permanent transaction type.  Since we're the only user of the
> +	 * fs right now, take 93% of the available free blocks.  Use
> +	 * weird math to avoid a 64-bit division.
> +	 */

Those magic 93% are 15/16th, maybe better to write it that way :)
That being said / 16 still is a 64-bit division, and I'm pretty sure
we'll run into an old gcc on some weird architecture that will not
turn it into a shift, so better write it as a shift to start with.

> +	freeblks = percpu_counter_sum(&mp->m_fdblocks);
> +	resblks = min_t(s64, UINT_MAX, freeblks);

Didn't I just got called out for using u* types?  The same applies
to s*, so use int64_t here to be consistent.

> +		/*
> +		 * NOTE: If your intent processing routine can create
> +		 * more deferred ops, you /must/ attach them to the
> +		 * dfops in this routine or else those subsequent
> +		 * intents will get replayed in the wrong order!
> +		 */

Please use up all available 80 characters for your comments.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Darrick J. Wong Nov. 27, 2017, 5:50 p.m. UTC | #3

On Thu, Nov 23, 2017 at 02:36:40PM +0100, Christoph Hellwig wrote:
> Hi Darrick,
> 
> this looks generally good to me:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> But I have a few nitpicks below:
> 
> > +	/*
> > +	 * We're finishing the defer_ops that accumulated as a result of
> > +	 * recovering unfinished intent items during log recovery.  We
> > +	 * reserve an itruncate transaction because it is the largest
> > +	 * permanent transaction type.  Since we're the only user of the
> > +	 * fs right now, take 93% of the available free blocks.  Use
> > +	 * weird math to avoid a 64-bit division.
> > +	 */
> 
> Those magic 93% are 15/16th, maybe better to write it that way :)

Ok, will update comment.

> That being said / 16 still is a 64-bit division, and I'm pretty sure
> we'll run into an old gcc on some weird architecture that will not
> turn it into a shift, so better write it as a shift to start with.

It shouldn't be, since resblks is unsigned int, which afaik is 32-bit.
But meh, I'll just change it to ">> 4" to make it obvious.

> > +	freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > +	resblks = min_t(s64, UINT_MAX, freeblks);
> 
> Didn't I just got called out for using u* types?  The same applies
> to s*, so use int64_t here to be consistent.

Yeah... I pulled the return type directly from percpu_counter_sum, sigh.
Will change to int64_t.

> > +		/*
> > +		 * NOTE: If your intent processing routine can create
> > +		 * more deferred ops, you /must/ attach them to the
> > +		 * dfops in this routine or else those subsequent
> > +		 * intents will get replayed in the wrong order!
> > +		 */
> 
> Please use up all available 80 characters for your comments.

Ok, will fix the comment.

--D

> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Amir Goldstein Nov. 30, 2017, 8:31 p.m. UTC | #4

On Wed, Nov 22, 2017 at 8:29 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> As part of testing log recovery with dm_log_writes, Amir Goldstein
> discovered an error in the deferred ops recovery that lead to corruption
> of the filesystem metadata if a reflink+rmap filesystem happened to shut
> down midway through a CoW remap:
>
> "This is what happens [after failed log recovery]:
>
> "Phase 1 - find and verify superblock...
> "Phase 2 - using internal log
> "        - zero log...
> "        - scan filesystem freespace and inode maps...
> "        - found root inode chunk
> "Phase 3 - for each AG...
> "        - scan (but don't clear) agi unlinked lists...
> "        - process known inodes and perform inode discovery...
> "        - agno = 0
> "data fork in regular inode 134 claims CoW block 376
> "correcting nextents for inode 134
> "bad data fork in inode 134
> "would have cleared inode 134"
>
> Hou Tao dissected the log contents of exactly such a crash:
>
> "According to the implementation of xfs_defer_finish(), these ops should
> be completed in the following sequence:
>
> "Have been done:
> "(1) CUI: Oper (160)
> "(2) BUI: Oper (161)
> "(3) CUD: Oper (194), for CUI Oper (160)
> "(4) RUI A: Oper (197), free rmap [0x155, 2, -9]
>
> "Should be done:
> "(5) BUD: for BUI Oper (161)
> "(6) RUI B: add rmap [0x155, 2, 137]
> "(7) RUD: for RUI A
> "(8) RUD: for RUI B
>
> "Actually be done by xlog_recover_process_intents()
> "(5) BUD: for BUI Oper (161)
> "(6) RUI B: add rmap [0x155, 2, 137]
> "(7) RUD: for RUI B
> "(8) RUD: for RUI A
>
> "So the rmap entry [0x155, 2, -9] for COW should be freed firstly,
> then a new rmap entry [0x155, 2, 137] will be added. However, as we can see
> from the log record in post_mount.log (generated after umount) and the trace
> print, the new rmap entry [0x155, 2, 137] are added firstly, then the rmap
> entry [0x155, 2, -9] are freed."
>
> When reconstructing the internal log state from the log items found on
> disk, it's required that deferred ops replay in exactly the same order
> that they would have had the filesystem not gone down.  However,
> replaying unfinished deferred ops can create /more/ deferred ops.  These
> new deferred ops are finished in the wrong order.  This causes fs
> corruption and replay crashes, so let's create a single defer_ops to
> handle the subsequent ops created during replay, then use one single
> transaction at the end of log recovery to ensure that everything is
> replayed in the same order as they're supposed to be.
>
> Reported-by: Amir Goldstein <amir73il@gmail.com>
> Analyzed-by: Hou Tao <houtao1@huawei.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Since it fixes the problem I reported and I couldn't link this patch
with other crashes
I found, you can add:
Tested-by: Amir Goldstein <amir73il@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index dd136f7..e5fb008 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -389,7 +389,8 @@  xfs_bud_init(
 int
 xfs_bui_recover(
 	struct xfs_mount		*mp,
-	struct xfs_bui_log_item		*buip)
+	struct xfs_bui_log_item		*buip,
+	struct xfs_defer_ops		*dfops)
 {
 	int				error = 0;
 	unsigned int			bui_type;
@@ -404,9 +405,7 @@  xfs_bui_recover(
 	xfs_exntst_t			state;
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip = NULL;
-	struct xfs_defer_ops		dfops;
 	struct xfs_bmbt_irec		irec;
-	xfs_fsblock_t			firstfsb;
 
 	ASSERT(!test_bit(XFS_BUI_RECOVERED, &buip->bui_flags));
 
@@ -464,7 +463,6 @@  xfs_bui_recover(
 
 	if (VFS_I(ip)->i_nlink == 0)
 		xfs_iflags_set(ip, XFS_IRECOVERY);
-	xfs_defer_init(&dfops, &firstfsb);
 
 	/* Process deferred bmap item. */
 	state = (bmap->me_flags & XFS_BMAP_EXTENT_UNWRITTEN) ?
@@ -479,16 +477,16 @@  xfs_bui_recover(
 		break;
 	default:
 		error = -EFSCORRUPTED;
-		goto err_dfops;
+		goto err_inode;
 	}
 	xfs_trans_ijoin(tp, ip, 0);
 
 	count = bmap->me_len;
-	error = xfs_trans_log_finish_bmap_update(tp, budp, &dfops, type,
+	error = xfs_trans_log_finish_bmap_update(tp, budp, dfops, type,
 			ip, whichfork, bmap->me_startoff,
 			bmap->me_startblock, &count, state);
 	if (error)
-		goto err_dfops;
+		goto err_inode;
 
 	if (count > 0) {
 		ASSERT(type == XFS_BMAP_UNMAP);
@@ -496,16 +494,11 @@  xfs_bui_recover(
 		irec.br_blockcount = count;
 		irec.br_startoff = bmap->me_startoff;
 		irec.br_state = state;
-		error = xfs_bmap_unmap_extent(tp->t_mountp, &dfops, ip, &irec);
+		error = xfs_bmap_unmap_extent(tp->t_mountp, dfops, ip, &irec);
 		if (error)
-			goto err_dfops;
+			goto err_inode;
 	}
 
-	/* Finish transaction, free inodes. */
-	error = xfs_defer_finish(&tp, &dfops);
-	if (error)
-		goto err_dfops;
-
 	set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
 	error = xfs_trans_commit(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -513,8 +506,6 @@  xfs_bui_recover(
 
 	return error;
 
-err_dfops:
-	xfs_defer_cancel(&dfops);
 err_inode:
 	xfs_trans_cancel(tp);
 	if (ip) {
diff --git a/fs/xfs/xfs_bmap_item.h b/fs/xfs/xfs_bmap_item.h
index c867daa..24b354a 100644
--- a/fs/xfs/xfs_bmap_item.h
+++ b/fs/xfs/xfs_bmap_item.h
@@ -93,6 +93,7 @@  struct xfs_bud_log_item *xfs_bud_init(struct xfs_mount *,
 		struct xfs_bui_log_item *);
 void xfs_bui_item_free(struct xfs_bui_log_item *);
 void xfs_bui_release(struct xfs_bui_log_item *);
-int xfs_bui_recover(struct xfs_mount *mp, struct xfs_bui_log_item *buip);
+int xfs_bui_recover(struct xfs_mount *mp, struct xfs_bui_log_item *buip,
+		struct xfs_defer_ops *dfops);
 
 #endif	/* __XFS_BMAP_ITEM_H__ */
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 87b1c33..36da419 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -24,6 +24,7 @@ 
 #include "xfs_bit.h"
 #include "xfs_sb.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_inode.h"
@@ -4716,7 +4717,8 @@  STATIC int
 xlog_recover_process_cui(
 	struct xfs_mount		*mp,
 	struct xfs_ail			*ailp,
-	struct xfs_log_item		*lip)
+	struct xfs_log_item		*lip,
+	struct xfs_defer_ops		*dfops)
 {
 	struct xfs_cui_log_item		*cuip;
 	int				error;
@@ -4729,7 +4731,7 @@  xlog_recover_process_cui(
 		return 0;
 
 	spin_unlock(&ailp->xa_lock);
-	error = xfs_cui_recover(mp, cuip);
+	error = xfs_cui_recover(mp, cuip, dfops);
 	spin_lock(&ailp->xa_lock);
 
 	return error;
@@ -4756,7 +4758,8 @@  STATIC int
 xlog_recover_process_bui(
 	struct xfs_mount		*mp,
 	struct xfs_ail			*ailp,
-	struct xfs_log_item		*lip)
+	struct xfs_log_item		*lip,
+	struct xfs_defer_ops		*dfops)
 {
 	struct xfs_bui_log_item		*buip;
 	int				error;
@@ -4769,7 +4772,7 @@  xlog_recover_process_bui(
 		return 0;
 
 	spin_unlock(&ailp->xa_lock);
-	error = xfs_bui_recover(mp, buip);
+	error = xfs_bui_recover(mp, buip, dfops);
 	spin_lock(&ailp->xa_lock);
 
 	return error;
@@ -4805,6 +4808,44 @@  static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
 	}
 }
 
+/* Take all the collected deferred ops and finish them in order. */
+static int
+xlog_finish_defer_ops(
+	struct xfs_mount	*mp,
+	struct xfs_defer_ops	*dfops)
+{
+	struct xfs_trans	*tp;
+	s64			freeblks;
+	uint			resblks;
+	int			error;
+
+	/*
+	 * We're finishing the defer_ops that accumulated as a result of
+	 * recovering unfinished intent items during log recovery.  We
+	 * reserve an itruncate transaction because it is the largest
+	 * permanent transaction type.  Since we're the only user of the
+	 * fs right now, take 93% of the available free blocks.  Use
+	 * weird math to avoid a 64-bit division.
+	 */
+	freeblks = percpu_counter_sum(&mp->m_fdblocks);
+	resblks = min_t(s64, UINT_MAX, freeblks);
+	resblks = resblks * 15 / 16;
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
+			0, XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+
+	error = xfs_defer_finish(&tp, dfops);
+	if (error)
+		goto out_cancel;
+
+	return xfs_trans_commit(tp);
+
+out_cancel:
+	xfs_trans_cancel(tp);
+	return error;
+}
+
 /*
  * When this is called, all of the log intent items which did not have
  * corresponding log done items should be in the AIL.  What we do now
@@ -4825,10 +4866,12 @@  STATIC int
 xlog_recover_process_intents(
 	struct xlog		*log)
 {
-	struct xfs_log_item	*lip;
-	int			error = 0;
+	struct xfs_defer_ops	dfops;
 	struct xfs_ail_cursor	cur;
+	struct xfs_log_item	*lip;
 	struct xfs_ail		*ailp;
+	xfs_fsblock_t		firstfsb;
+	int			error = 0;
 #if defined(DEBUG) || defined(XFS_WARN)
 	xfs_lsn_t		last_lsn;
 #endif
@@ -4839,6 +4882,7 @@  xlog_recover_process_intents(
 #if defined(DEBUG) || defined(XFS_WARN)
 	last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
 #endif
+	xfs_defer_init(&dfops, &firstfsb);
 	while (lip != NULL) {
 		/*
 		 * We're done when we see something other than an intent.
@@ -4859,6 +4903,12 @@  xlog_recover_process_intents(
 		 */
 		ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
 
+		/*
+		 * NOTE: If your intent processing routine can create
+		 * more deferred ops, you /must/ attach them to the
+		 * dfops in this routine or else those subsequent
+		 * intents will get replayed in the wrong order!
+		 */
 		switch (lip->li_type) {
 		case XFS_LI_EFI:
 			error = xlog_recover_process_efi(log->l_mp, ailp, lip);
@@ -4867,10 +4917,12 @@  xlog_recover_process_intents(
 			error = xlog_recover_process_rui(log->l_mp, ailp, lip);
 			break;
 		case XFS_LI_CUI:
-			error = xlog_recover_process_cui(log->l_mp, ailp, lip);
+			error = xlog_recover_process_cui(log->l_mp, ailp, lip,
+					&dfops);
 			break;
 		case XFS_LI_BUI:
-			error = xlog_recover_process_bui(log->l_mp, ailp, lip);
+			error = xlog_recover_process_bui(log->l_mp, ailp, lip,
+					&dfops);
 			break;
 		}
 		if (error)
@@ -4880,6 +4932,11 @@  xlog_recover_process_intents(
 out:
 	xfs_trans_ail_cursor_done(&cur);
 	spin_unlock(&ailp->xa_lock);
+	if (error)
+		xfs_defer_cancel(&dfops);
+	else
+		error = xlog_finish_defer_ops(log->l_mp, &dfops);
+
 	return error;
 }
 
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 8f2e2fa..3a55d6f 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -393,7 +393,8 @@  xfs_cud_init(
 int
 xfs_cui_recover(
 	struct xfs_mount		*mp,
-	struct xfs_cui_log_item		*cuip)
+	struct xfs_cui_log_item		*cuip,
+	struct xfs_defer_ops		*dfops)
 {
 	int				i;
 	int				error = 0;
@@ -405,11 +406,9 @@  xfs_cui_recover(
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
 	enum xfs_refcount_intent_type	type;
-	xfs_fsblock_t			firstfsb;
 	xfs_fsblock_t			new_fsb;
 	xfs_extlen_t			new_len;
 	struct xfs_bmbt_irec		irec;
-	struct xfs_defer_ops		dfops;
 	bool				requeue_only = false;
 
 	ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags));
@@ -465,7 +464,6 @@  xfs_cui_recover(
 		return error;
 	cudp = xfs_trans_get_cud(tp, cuip);
 
-	xfs_defer_init(&dfops, &firstfsb);
 	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
 		refc = &cuip->cui_format.cui_extents[i];
 		refc_type = refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK;
@@ -485,7 +483,7 @@  xfs_cui_recover(
 			new_len = refc->pe_len;
 		} else
 			error = xfs_trans_log_finish_refcount_update(tp, cudp,
-				&dfops, type, refc->pe_startblock, refc->pe_len,
+				dfops, type, refc->pe_startblock, refc->pe_len,
 				&new_fsb, &new_len, &rcur);
 		if (error)
 			goto abort_error;
@@ -497,21 +495,21 @@  xfs_cui_recover(
 			switch (type) {
 			case XFS_REFCOUNT_INCREASE:
 				error = xfs_refcount_increase_extent(
-						tp->t_mountp, &dfops, &irec);
+						tp->t_mountp, dfops, &irec);
 				break;
 			case XFS_REFCOUNT_DECREASE:
 				error = xfs_refcount_decrease_extent(
-						tp->t_mountp, &dfops, &irec);
+						tp->t_mountp, dfops, &irec);
 				break;
 			case XFS_REFCOUNT_ALLOC_COW:
 				error = xfs_refcount_alloc_cow_extent(
-						tp->t_mountp, &dfops,
+						tp->t_mountp, dfops,
 						irec.br_startblock,
 						irec.br_blockcount);
 				break;
 			case XFS_REFCOUNT_FREE_COW:
 				error = xfs_refcount_free_cow_extent(
-						tp->t_mountp, &dfops,
+						tp->t_mountp, dfops,
 						irec.br_startblock,
 						irec.br_blockcount);
 				break;
@@ -525,17 +523,12 @@  xfs_cui_recover(
 	}
 
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	error = xfs_defer_finish(&tp, &dfops);
-	if (error)
-		goto abort_defer;
 	set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
 	error = xfs_trans_commit(tp);
 	return error;
 
 abort_error:
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-abort_defer:
-	xfs_defer_cancel(&dfops);
 	xfs_trans_cancel(tp);
 	return error;
 }
diff --git a/fs/xfs/xfs_refcount_item.h b/fs/xfs/xfs_refcount_item.h
index 5b74ddd..0e53273 100644
--- a/fs/xfs/xfs_refcount_item.h
+++ b/fs/xfs/xfs_refcount_item.h
@@ -96,6 +96,7 @@  struct xfs_cud_log_item *xfs_cud_init(struct xfs_mount *,
 		struct xfs_cui_log_item *);
 void xfs_cui_item_free(struct xfs_cui_log_item *);
 void xfs_cui_release(struct xfs_cui_log_item *);
-int xfs_cui_recover(struct xfs_mount *mp, struct xfs_cui_log_item *cuip);
+int xfs_cui_recover(struct xfs_mount *mp, struct xfs_cui_log_item *cuip,
+		struct xfs_defer_ops *dfops);
 
 #endif	/* __XFS_REFCOUNT_ITEM_H__ */

xfs: log recovery should replay deferred ops in order

Commit Message

Comments

Patch