Message ID | 20171122182949.GH2135@magnolia (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
On Wed, Nov 22, 2017 at 8:29 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > As part of testing log recovery with dm_log_writes, Amir Goldstein > discovered an error in the deferred ops recovery that lead to corruption > of the filesystem metadata if a reflink+rmap filesystem happened to shut > down midway through a CoW remap: > > "This is what happens [after failed log recovery]: > > "Phase 1 - find and verify superblock... > "Phase 2 - using internal log > " - zero log... > " - scan filesystem freespace and inode maps... > " - found root inode chunk > "Phase 3 - for each AG... > " - scan (but don't clear) agi unlinked lists... > " - process known inodes and perform inode discovery... > " - agno = 0 > "data fork in regular inode 134 claims CoW block 376 > "correcting nextents for inode 134 > "bad data fork in inode 134 > "would have cleared inode 134" > > Hou Tao dissected the log contents of exactly such a crash: > > "According to the implementation of xfs_defer_finish(), these ops should > be completed in the following sequence: > > "Have been done: > "(1) CUI: Oper (160) > "(2) BUI: Oper (161) > "(3) CUD: Oper (194), for CUI Oper (160) > "(4) RUI A: Oper (197), free rmap [0x155, 2, -9] > > "Should be done: > "(5) BUD: for BUI Oper (161) > "(6) RUI B: add rmap [0x155, 2, 137] > "(7) RUD: for RUI A > "(8) RUD: for RUI B > > "Actually be done by xlog_recover_process_intents() > "(5) BUD: for BUI Oper (161) > "(6) RUI B: add rmap [0x155, 2, 137] > "(7) RUD: for RUI B > "(8) RUD: for RUI A > > "So the rmap entry [0x155, 2, -9] for COW should be freed firstly, > then a new rmap entry [0x155, 2, 137] will be added. However, as we can see > from the log record in post_mount.log (generated after umount) and the trace > print, the new rmap entry [0x155, 2, 137] are added firstly, then the rmap > entry [0x155, 2, -9] are freed." > > When reconstructing the internal log state from the log items found on > disk, it's required that deferred ops replay in exactly the same order > that they would have had the filesystem not gone down. However, > replaying unfinished deferred ops can create /more/ deferred ops. These > new deferred ops are finished in the wrong order. This causes fs > corruption and replay crashes, so let's create a single defer_ops to > handle the subsequent ops created during replay, then use one single > transaction at the end of log recovery to ensure that everything is > replayed in the same order as they're supposed to be. > > Reported-by: Amir Goldstein <amir73il@gmail.com> > Analyzed-by: Hou Tao <houtao1@huawei.com> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- You may add: Tested-by: Amir Goldstein <amir73il@gmail.com> -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Darrick, this looks generally good to me: Reviewed-by: Christoph Hellwig <hch@lst.de> But I have a few nitpicks below: > + /* > + * We're finishing the defer_ops that accumulated as a result of > + * recovering unfinished intent items during log recovery. We > + * reserve an itruncate transaction because it is the largest > + * permanent transaction type. Since we're the only user of the > + * fs right now, take 93% of the available free blocks. Use > + * weird math to avoid a 64-bit division. > + */ Those magic 93% are 15/16th, maybe better to write it that way :) That being said / 16 still is a 64-bit division, and I'm pretty sure we'll run into an old gcc on some weird architecture that will not turn it into a shift, so better write it as a shift to start with. > + freeblks = percpu_counter_sum(&mp->m_fdblocks); > + resblks = min_t(s64, UINT_MAX, freeblks); Didn't I just got called out for using u* types? The same applies to s*, so use int64_t here to be consistent. > + /* > + * NOTE: If your intent processing routine can create > + * more deferred ops, you /must/ attach them to the > + * dfops in this routine or else those subsequent > + * intents will get replayed in the wrong order! > + */ Please use up all available 80 characters for your comments. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 23, 2017 at 02:36:40PM +0100, Christoph Hellwig wrote: > Hi Darrick, > > this looks generally good to me: > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > But I have a few nitpicks below: > > > + /* > > + * We're finishing the defer_ops that accumulated as a result of > > + * recovering unfinished intent items during log recovery. We > > + * reserve an itruncate transaction because it is the largest > > + * permanent transaction type. Since we're the only user of the > > + * fs right now, take 93% of the available free blocks. Use > > + * weird math to avoid a 64-bit division. > > + */ > > Those magic 93% are 15/16th, maybe better to write it that way :) Ok, will update comment. > That being said / 16 still is a 64-bit division, and I'm pretty sure > we'll run into an old gcc on some weird architecture that will not > turn it into a shift, so better write it as a shift to start with. It shouldn't be, since resblks is unsigned int, which afaik is 32-bit. But meh, I'll just change it to ">> 4" to make it obvious. > > + freeblks = percpu_counter_sum(&mp->m_fdblocks); > > + resblks = min_t(s64, UINT_MAX, freeblks); > > Didn't I just got called out for using u* types? The same applies > to s*, so use int64_t here to be consistent. Yeah... I pulled the return type directly from percpu_counter_sum, sigh. Will change to int64_t. > > + /* > > + * NOTE: If your intent processing routine can create > > + * more deferred ops, you /must/ attach them to the > > + * dfops in this routine or else those subsequent > > + * intents will get replayed in the wrong order! > > + */ > > Please use up all available 80 characters for your comments. Ok, will fix the comment. --D > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 22, 2017 at 8:29 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > As part of testing log recovery with dm_log_writes, Amir Goldstein > discovered an error in the deferred ops recovery that lead to corruption > of the filesystem metadata if a reflink+rmap filesystem happened to shut > down midway through a CoW remap: > > "This is what happens [after failed log recovery]: > > "Phase 1 - find and verify superblock... > "Phase 2 - using internal log > " - zero log... > " - scan filesystem freespace and inode maps... > " - found root inode chunk > "Phase 3 - for each AG... > " - scan (but don't clear) agi unlinked lists... > " - process known inodes and perform inode discovery... > " - agno = 0 > "data fork in regular inode 134 claims CoW block 376 > "correcting nextents for inode 134 > "bad data fork in inode 134 > "would have cleared inode 134" > > Hou Tao dissected the log contents of exactly such a crash: > > "According to the implementation of xfs_defer_finish(), these ops should > be completed in the following sequence: > > "Have been done: > "(1) CUI: Oper (160) > "(2) BUI: Oper (161) > "(3) CUD: Oper (194), for CUI Oper (160) > "(4) RUI A: Oper (197), free rmap [0x155, 2, -9] > > "Should be done: > "(5) BUD: for BUI Oper (161) > "(6) RUI B: add rmap [0x155, 2, 137] > "(7) RUD: for RUI A > "(8) RUD: for RUI B > > "Actually be done by xlog_recover_process_intents() > "(5) BUD: for BUI Oper (161) > "(6) RUI B: add rmap [0x155, 2, 137] > "(7) RUD: for RUI B > "(8) RUD: for RUI A > > "So the rmap entry [0x155, 2, -9] for COW should be freed firstly, > then a new rmap entry [0x155, 2, 137] will be added. However, as we can see > from the log record in post_mount.log (generated after umount) and the trace > print, the new rmap entry [0x155, 2, 137] are added firstly, then the rmap > entry [0x155, 2, -9] are freed." > > When reconstructing the internal log state from the log items found on > disk, it's required that deferred ops replay in exactly the same order > that they would have had the filesystem not gone down. However, > replaying unfinished deferred ops can create /more/ deferred ops. These > new deferred ops are finished in the wrong order. This causes fs > corruption and replay crashes, so let's create a single defer_ops to > handle the subsequent ops created during replay, then use one single > transaction at the end of log recovery to ensure that everything is > replayed in the same order as they're supposed to be. > > Reported-by: Amir Goldstein <amir73il@gmail.com> > Analyzed-by: Hou Tao <houtao1@huawei.com> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- Since it fixes the problem I reported and I couldn't link this patch with other crashes I found, you can add: Tested-by: Amir Goldstein <amir73il@gmail.com> -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index dd136f7..e5fb008 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -389,7 +389,8 @@ xfs_bud_init( int xfs_bui_recover( struct xfs_mount *mp, - struct xfs_bui_log_item *buip) + struct xfs_bui_log_item *buip, + struct xfs_defer_ops *dfops) { int error = 0; unsigned int bui_type; @@ -404,9 +405,7 @@ xfs_bui_recover( xfs_exntst_t state; struct xfs_trans *tp; struct xfs_inode *ip = NULL; - struct xfs_defer_ops dfops; struct xfs_bmbt_irec irec; - xfs_fsblock_t firstfsb; ASSERT(!test_bit(XFS_BUI_RECOVERED, &buip->bui_flags)); @@ -464,7 +463,6 @@ xfs_bui_recover( if (VFS_I(ip)->i_nlink == 0) xfs_iflags_set(ip, XFS_IRECOVERY); - xfs_defer_init(&dfops, &firstfsb); /* Process deferred bmap item. */ state = (bmap->me_flags & XFS_BMAP_EXTENT_UNWRITTEN) ? @@ -479,16 +477,16 @@ xfs_bui_recover( break; default: error = -EFSCORRUPTED; - goto err_dfops; + goto err_inode; } xfs_trans_ijoin(tp, ip, 0); count = bmap->me_len; - error = xfs_trans_log_finish_bmap_update(tp, budp, &dfops, type, + error = xfs_trans_log_finish_bmap_update(tp, budp, dfops, type, ip, whichfork, bmap->me_startoff, bmap->me_startblock, &count, state); if (error) - goto err_dfops; + goto err_inode; if (count > 0) { ASSERT(type == XFS_BMAP_UNMAP); @@ -496,16 +494,11 @@ xfs_bui_recover( irec.br_blockcount = count; irec.br_startoff = bmap->me_startoff; irec.br_state = state; - error = xfs_bmap_unmap_extent(tp->t_mountp, &dfops, ip, &irec); + error = xfs_bmap_unmap_extent(tp->t_mountp, dfops, ip, &irec); if (error) - goto err_dfops; + goto err_inode; } - /* Finish transaction, free inodes. */ - error = xfs_defer_finish(&tp, &dfops); - if (error) - goto err_dfops; - set_bit(XFS_BUI_RECOVERED, &buip->bui_flags); error = xfs_trans_commit(tp); xfs_iunlock(ip, XFS_ILOCK_EXCL); @@ -513,8 +506,6 @@ xfs_bui_recover( return error; -err_dfops: - xfs_defer_cancel(&dfops); err_inode: xfs_trans_cancel(tp); if (ip) { diff --git a/fs/xfs/xfs_bmap_item.h b/fs/xfs/xfs_bmap_item.h index c867daa..24b354a 100644 --- a/fs/xfs/xfs_bmap_item.h +++ b/fs/xfs/xfs_bmap_item.h @@ -93,6 +93,7 @@ struct xfs_bud_log_item *xfs_bud_init(struct xfs_mount *, struct xfs_bui_log_item *); void xfs_bui_item_free(struct xfs_bui_log_item *); void xfs_bui_release(struct xfs_bui_log_item *); -int xfs_bui_recover(struct xfs_mount *mp, struct xfs_bui_log_item *buip); +int xfs_bui_recover(struct xfs_mount *mp, struct xfs_bui_log_item *buip, + struct xfs_defer_ops *dfops); #endif /* __XFS_BMAP_ITEM_H__ */ diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 87b1c33..36da419 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -24,6 +24,7 @@ #include "xfs_bit.h" #include "xfs_sb.h" #include "xfs_mount.h" +#include "xfs_defer.h" #include "xfs_da_format.h" #include "xfs_da_btree.h" #include "xfs_inode.h" @@ -4716,7 +4717,8 @@ STATIC int xlog_recover_process_cui( struct xfs_mount *mp, struct xfs_ail *ailp, - struct xfs_log_item *lip) + struct xfs_log_item *lip, + struct xfs_defer_ops *dfops) { struct xfs_cui_log_item *cuip; int error; @@ -4729,7 +4731,7 @@ xlog_recover_process_cui( return 0; spin_unlock(&ailp->xa_lock); - error = xfs_cui_recover(mp, cuip); + error = xfs_cui_recover(mp, cuip, dfops); spin_lock(&ailp->xa_lock); return error; @@ -4756,7 +4758,8 @@ STATIC int xlog_recover_process_bui( struct xfs_mount *mp, struct xfs_ail *ailp, - struct xfs_log_item *lip) + struct xfs_log_item *lip, + struct xfs_defer_ops *dfops) { struct xfs_bui_log_item *buip; int error; @@ -4769,7 +4772,7 @@ xlog_recover_process_bui( return 0; spin_unlock(&ailp->xa_lock); - error = xfs_bui_recover(mp, buip); + error = xfs_bui_recover(mp, buip, dfops); spin_lock(&ailp->xa_lock); return error; @@ -4805,6 +4808,44 @@ static inline bool xlog_item_is_intent(struct xfs_log_item *lip) } } +/* Take all the collected deferred ops and finish them in order. */ +static int +xlog_finish_defer_ops( + struct xfs_mount *mp, + struct xfs_defer_ops *dfops) +{ + struct xfs_trans *tp; + s64 freeblks; + uint resblks; + int error; + + /* + * We're finishing the defer_ops that accumulated as a result of + * recovering unfinished intent items during log recovery. We + * reserve an itruncate transaction because it is the largest + * permanent transaction type. Since we're the only user of the + * fs right now, take 93% of the available free blocks. Use + * weird math to avoid a 64-bit division. + */ + freeblks = percpu_counter_sum(&mp->m_fdblocks); + resblks = min_t(s64, UINT_MAX, freeblks); + resblks = resblks * 15 / 16; + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks, + 0, XFS_TRANS_RESERVE, &tp); + if (error) + return error; + + error = xfs_defer_finish(&tp, dfops); + if (error) + goto out_cancel; + + return xfs_trans_commit(tp); + +out_cancel: + xfs_trans_cancel(tp); + return error; +} + /* * When this is called, all of the log intent items which did not have * corresponding log done items should be in the AIL. What we do now @@ -4825,10 +4866,12 @@ STATIC int xlog_recover_process_intents( struct xlog *log) { - struct xfs_log_item *lip; - int error = 0; + struct xfs_defer_ops dfops; struct xfs_ail_cursor cur; + struct xfs_log_item *lip; struct xfs_ail *ailp; + xfs_fsblock_t firstfsb; + int error = 0; #if defined(DEBUG) || defined(XFS_WARN) xfs_lsn_t last_lsn; #endif @@ -4839,6 +4882,7 @@ xlog_recover_process_intents( #if defined(DEBUG) || defined(XFS_WARN) last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block); #endif + xfs_defer_init(&dfops, &firstfsb); while (lip != NULL) { /* * We're done when we see something other than an intent. @@ -4859,6 +4903,12 @@ xlog_recover_process_intents( */ ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0); + /* + * NOTE: If your intent processing routine can create + * more deferred ops, you /must/ attach them to the + * dfops in this routine or else those subsequent + * intents will get replayed in the wrong order! + */ switch (lip->li_type) { case XFS_LI_EFI: error = xlog_recover_process_efi(log->l_mp, ailp, lip); @@ -4867,10 +4917,12 @@ xlog_recover_process_intents( error = xlog_recover_process_rui(log->l_mp, ailp, lip); break; case XFS_LI_CUI: - error = xlog_recover_process_cui(log->l_mp, ailp, lip); + error = xlog_recover_process_cui(log->l_mp, ailp, lip, + &dfops); break; case XFS_LI_BUI: - error = xlog_recover_process_bui(log->l_mp, ailp, lip); + error = xlog_recover_process_bui(log->l_mp, ailp, lip, + &dfops); break; } if (error) @@ -4880,6 +4932,11 @@ xlog_recover_process_intents( out: xfs_trans_ail_cursor_done(&cur); spin_unlock(&ailp->xa_lock); + if (error) + xfs_defer_cancel(&dfops); + else + error = xlog_finish_defer_ops(log->l_mp, &dfops); + return error; } diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c index 8f2e2fa..3a55d6f 100644 --- a/fs/xfs/xfs_refcount_item.c +++ b/fs/xfs/xfs_refcount_item.c @@ -393,7 +393,8 @@ xfs_cud_init( int xfs_cui_recover( struct xfs_mount *mp, - struct xfs_cui_log_item *cuip) + struct xfs_cui_log_item *cuip, + struct xfs_defer_ops *dfops) { int i; int error = 0; @@ -405,11 +406,9 @@ xfs_cui_recover( struct xfs_trans *tp; struct xfs_btree_cur *rcur = NULL; enum xfs_refcount_intent_type type; - xfs_fsblock_t firstfsb; xfs_fsblock_t new_fsb; xfs_extlen_t new_len; struct xfs_bmbt_irec irec; - struct xfs_defer_ops dfops; bool requeue_only = false; ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags)); @@ -465,7 +464,6 @@ xfs_cui_recover( return error; cudp = xfs_trans_get_cud(tp, cuip); - xfs_defer_init(&dfops, &firstfsb); for (i = 0; i < cuip->cui_format.cui_nextents; i++) { refc = &cuip->cui_format.cui_extents[i]; refc_type = refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK; @@ -485,7 +483,7 @@ xfs_cui_recover( new_len = refc->pe_len; } else error = xfs_trans_log_finish_refcount_update(tp, cudp, - &dfops, type, refc->pe_startblock, refc->pe_len, + dfops, type, refc->pe_startblock, refc->pe_len, &new_fsb, &new_len, &rcur); if (error) goto abort_error; @@ -497,21 +495,21 @@ xfs_cui_recover( switch (type) { case XFS_REFCOUNT_INCREASE: error = xfs_refcount_increase_extent( - tp->t_mountp, &dfops, &irec); + tp->t_mountp, dfops, &irec); break; case XFS_REFCOUNT_DECREASE: error = xfs_refcount_decrease_extent( - tp->t_mountp, &dfops, &irec); + tp->t_mountp, dfops, &irec); break; case XFS_REFCOUNT_ALLOC_COW: error = xfs_refcount_alloc_cow_extent( - tp->t_mountp, &dfops, + tp->t_mountp, dfops, irec.br_startblock, irec.br_blockcount); break; case XFS_REFCOUNT_FREE_COW: error = xfs_refcount_free_cow_extent( - tp->t_mountp, &dfops, + tp->t_mountp, dfops, irec.br_startblock, irec.br_blockcount); break; @@ -525,17 +523,12 @@ xfs_cui_recover( } xfs_refcount_finish_one_cleanup(tp, rcur, error); - error = xfs_defer_finish(&tp, &dfops); - if (error) - goto abort_defer; set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags); error = xfs_trans_commit(tp); return error; abort_error: xfs_refcount_finish_one_cleanup(tp, rcur, error); -abort_defer: - xfs_defer_cancel(&dfops); xfs_trans_cancel(tp); return error; } diff --git a/fs/xfs/xfs_refcount_item.h b/fs/xfs/xfs_refcount_item.h index 5b74ddd..0e53273 100644 --- a/fs/xfs/xfs_refcount_item.h +++ b/fs/xfs/xfs_refcount_item.h @@ -96,6 +96,7 @@ struct xfs_cud_log_item *xfs_cud_init(struct xfs_mount *, struct xfs_cui_log_item *); void xfs_cui_item_free(struct xfs_cui_log_item *); void xfs_cui_release(struct xfs_cui_log_item *); -int xfs_cui_recover(struct xfs_mount *mp, struct xfs_cui_log_item *cuip); +int xfs_cui_recover(struct xfs_mount *mp, struct xfs_cui_log_item *cuip, + struct xfs_defer_ops *dfops); #endif /* __XFS_REFCOUNT_ITEM_H__ */