[4/4] xfs: fix AGF vs inode cluster buffer deadlock

Message ID	20230517000449.3997582-5-david@fromorbit.com (mailing list archive)
State	Accepted
Headers	show Return-Path: <linux-xfs-owner@vger.kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Subject: [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock Date: Wed, 17 May 2023 10:04:49 +1000 Message-Id: <20230517000449.3997582-5-david@fromorbit.com> In-Reply-To: <20230517000449.3997582-1-david@fromorbit.com> References: <20230517000449.3997582-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[1/4] xfs: buffer pins need to hold a buffer reference \| expand [1/4] xfs: buffer pins need to hold a buffer reference [2/4] xfs: restore allocation trylock iteration [3/4] xfs: defered work could create precommits [4/4] xfs: fix AGF vs inode cluster buffer deadlock

Dave Chinner May 17, 2023, 12:04 a.m. UTC

From: Dave Chinner <dchinner@redhat.com>

Lock order in XFS is AGI -> AGF, hence for operations involving
inode unlinked list operations we always lock the AGI first. Inode
unlinked list operations operate on the inode cluster buffer,
so the lock order there is AGI -> inode cluster buffer.

For O_TMPFILE operations, this now means the lock order set down in
xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
unlinked ops are done before the directory modifications that may
allocate space and lock the AGF.

Unfortunately, we also now lock the inode cluster buffer when
logging an inode so that we can attach the inode to the cluster
buffer and pin it in memory. This creates a lock order of AGF ->
inode cluster buffer in directory operations as we have to log the
inode after we've allocated new space for it.

This creates a lock inversion between the AGF and the inode cluster
buffer. Because the inode cluster buffer is shared across multiple
inodes, the inversion is not specific to individual inodes but can
occur when inodes in the same cluster buffer are accessed in
different orders.

To fix this we need move all the inode log item cluster buffer
interactions to the end of the current transaction. Unfortunately,
xfs_trans_log_inode() calls are littered throughout the transactions
with no thought to ordering against other items or locking. This
makes it difficult to do anything that involves changing the call
sites of xfs_trans_log_inode() to change locking orders.

However, we do now have a mechanism that allows is to postpone dirty
item processing to just before we commit the transaction: the
->iop_precommit method. This will be called after all the
modifications are done and high level objects like AGI and AGF
buffers have been locked and modified, thereby providing a mechanism
that guarantees we don't lock the inode cluster buffer before those
high level objects are locked.

This change is largely moving the guts of xfs_trans_log_inode() to
xfs_inode_item_precommit() and providing an extra flag context in
the inode log item to track the dirty state of the inode in the
current transaction. This also means we do a lot less repeated work
in xfs_trans_log_inode() by only doing it once per transaction when
all the work is done.

Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_log_format.h  |   9 +-
 fs/xfs/libxfs/xfs_trans_inode.c | 115 +++---------------------
 fs/xfs/xfs_inode_item.c         | 152 ++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode_item.h         |   1 +
 4 files changed, 171 insertions(+), 106 deletions(-)

Darrick J. Wong May 17, 2023, 1:26 a.m. UTC | #1

On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Lock order in XFS is AGI -> AGF, hence for operations involving
> inode unlinked list operations we always lock the AGI first. Inode
> unlinked list operations operate on the inode cluster buffer,
> so the lock order there is AGI -> inode cluster buffer.
> 
> For O_TMPFILE operations, this now means the lock order set down in
> xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
> unlinked ops are done before the directory modifications that may
> allocate space and lock the AGF.
> 
> Unfortunately, we also now lock the inode cluster buffer when
> logging an inode so that we can attach the inode to the cluster
> buffer and pin it in memory. This creates a lock order of AGF ->
> inode cluster buffer in directory operations as we have to log the
> inode after we've allocated new space for it.
> 
> This creates a lock inversion between the AGF and the inode cluster
> buffer. Because the inode cluster buffer is shared across multiple
> inodes, the inversion is not specific to individual inodes but can
> occur when inodes in the same cluster buffer are accessed in
> different orders.
> 
> To fix this we need move all the inode log item cluster buffer
> interactions to the end of the current transaction. Unfortunately,
> xfs_trans_log_inode() calls are littered throughout the transactions
> with no thought to ordering against other items or locking. This
> makes it difficult to do anything that involves changing the call
> sites of xfs_trans_log_inode() to change locking orders.
> 
> However, we do now have a mechanism that allows is to postpone dirty
> item processing to just before we commit the transaction: the
> ->iop_precommit method. This will be called after all the
> modifications are done and high level objects like AGI and AGF
> buffers have been locked and modified, thereby providing a mechanism
> that guarantees we don't lock the inode cluster buffer before those
> high level objects are locked.
> 
> This change is largely moving the guts of xfs_trans_log_inode() to
> xfs_inode_item_precommit() and providing an extra flag context in
> the inode log item to track the dirty state of the inode in the
> current transaction. This also means we do a lot less repeated work
> in xfs_trans_log_inode() by only doing it once per transaction when
> all the work is done.

Aha, and that's why you moved all the "opportunistically tweak inode
metadata while we're already logging it" bits to the precommit hook.

> Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_log_format.h  |   9 +-
>  fs/xfs/libxfs/xfs_trans_inode.c | 115 +++---------------------
>  fs/xfs/xfs_inode_item.c         | 152 ++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_inode_item.h         |   1 +
>  4 files changed, 171 insertions(+), 106 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index f13e0809dc63..269573c82808 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -324,7 +324,6 @@ struct xfs_inode_log_format_32 {
>  #define XFS_ILOG_DOWNER	0x200	/* change the data fork owner on replay */
>  #define XFS_ILOG_AOWNER	0x400	/* change the attr fork owner on replay */
>  
> -
>  /*
>   * The timestamps are dirty, but not necessarily anything else in the inode
>   * core.  Unlike the other fields above this one must never make it to disk
> @@ -333,6 +332,14 @@ struct xfs_inode_log_format_32 {
>   */
>  #define XFS_ILOG_TIMESTAMP	0x4000
>  
> +/*
> + * The version field has been changed, but not necessarily anything else of
> + * interest. This must never make it to disk - it is used purely to ensure that
> + * the inode item ->precommit operation can update the fsync flag triggers
> + * in the inode item correctly.
> + */
> +#define XFS_ILOG_IVERSION	0x8000
> +
>  #define	XFS_ILOG_NONCORE	(XFS_ILOG_DDATA | XFS_ILOG_DEXT | \
>  				 XFS_ILOG_DBROOT | XFS_ILOG_DEV | \
>  				 XFS_ILOG_ADATA | XFS_ILOG_AEXT | \
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 8b5547073379..2d164d0588b1 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -40,9 +40,8 @@ xfs_trans_ijoin(
>  	iip->ili_lock_flags = lock_flags;
>  	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
>  
> -	/*
> -	 * Get a log_item_desc to point at the new item.
> -	 */
> +	/* Reset the per-tx dirty context and add the item to the tx. */
> +	iip->ili_dirty_flags = 0;
>  	xfs_trans_add_item(tp, &iip->ili_item);
>  }
>  
> @@ -76,17 +75,10 @@ xfs_trans_ichgtime(
>  /*
>   * This is called to mark the fields indicated in fieldmask as needing to be
>   * logged when the transaction is committed.  The inode must already be
> - * associated with the given transaction.
> - *
> - * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
> - * of the core inode if any of it has changed, and we always log all of the
> - * inline data/extents/b-tree root if any of them has changed.
> - *
> - * Grab and pin the cluster buffer associated with this inode to avoid RMW
> - * cycles at inode writeback time. Avoid the need to add error handling to every
> - * xfs_trans_log_inode() call by shutting down on read error.  This will cause
> - * transactions to fail and everything to error out, just like if we return a
> - * read error in a dirty transaction and cancel it.
> + * associated with the given transaction. All we do here is record where the
> + * inode was dirtied and mark the transaction and inode log item dirty;
> + * everything else is done in the ->precommit log item operation after the
> + * changes in the transaction have been completed.
>   */
>  void
>  xfs_trans_log_inode(
> @@ -96,7 +88,6 @@ xfs_trans_log_inode(
>  {
>  	struct xfs_inode_log_item *iip = ip->i_itemp;
>  	struct inode		*inode = VFS_I(ip);
> -	uint			iversion_flags = 0;
>  
>  	ASSERT(iip);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> @@ -104,18 +95,6 @@ xfs_trans_log_inode(
>  
>  	tp->t_flags |= XFS_TRANS_DIRTY;
>  
> -	/*
> -	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> -	 * don't matter - we either will need an extra transaction in 24 hours
> -	 * to log the timestamps, or will clear already cleared fields in the
> -	 * worst case.
> -	 */
> -	if (inode->i_state & I_DIRTY_TIME) {
> -		spin_lock(&inode->i_lock);
> -		inode->i_state &= ~I_DIRTY_TIME;
> -		spin_unlock(&inode->i_lock);
> -	}
> -
>  	/*
>  	 * First time we log the inode in a transaction, bump the inode change
>  	 * counter if it is configured for this to occur. While we have the
> @@ -128,86 +107,12 @@ xfs_trans_log_inode(
>  	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
>  		if (IS_I_VERSION(inode) &&
>  		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> -			iversion_flags = XFS_ILOG_CORE;
> +			flags |= XFS_ILOG_IVERSION;
>  	}
>  
> -	/*
> -	 * If we're updating the inode core or the timestamps and it's possible
> -	 * to upgrade this inode to bigtime format, do so now.
> -	 */
> -	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> -	    xfs_has_bigtime(ip->i_mount) &&
> -	    !xfs_inode_has_bigtime(ip)) {
> -		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> -		flags |= XFS_ILOG_CORE;
> -	}
> -
> -	/*
> -	 * Inode verifiers do not check that the extent size hint is an integer
> -	 * multiple of the rt extent size on a directory with both rtinherit
> -	 * and extszinherit flags set.  If we're logging a directory that is
> -	 * misconfigured in this way, clear the hint.
> -	 */
> -	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> -	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> -	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> -		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> -				   XFS_DIFLAG_EXTSZINHERIT);
> -		ip->i_extsize = 0;
> -		flags |= XFS_ILOG_CORE;
> -	}
> -
> -	/*
> -	 * Record the specific change for fdatasync optimisation. This allows
> -	 * fdatasync to skip log forces for inodes that are only timestamp
> -	 * dirty.
> -	 */
> -	spin_lock(&iip->ili_lock);
> -	iip->ili_fsync_fields |= flags;
> -
> -	if (!iip->ili_item.li_buf) {
> -		struct xfs_buf	*bp;
> -		int		error;
> -
> -		/*
> -		 * We hold the ILOCK here, so this inode is not going to be
> -		 * flushed while we are here. Further, because there is no
> -		 * buffer attached to the item, we know that there is no IO in
> -		 * progress, so nothing will clear the ili_fields while we read
> -		 * in the buffer. Hence we can safely drop the spin lock and
> -		 * read the buffer knowing that the state will not change from
> -		 * here.
> -		 */
> -		spin_unlock(&iip->ili_lock);
> -		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> -		if (error) {
> -			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> -			return;
> -		}
> -
> -		/*
> -		 * We need an explicit buffer reference for the log item but
> -		 * don't want the buffer to remain attached to the transaction.
> -		 * Hold the buffer but release the transaction reference once
> -		 * we've attached the inode log item to the buffer log item
> -		 * list.
> -		 */
> -		xfs_buf_hold(bp);
> -		spin_lock(&iip->ili_lock);
> -		iip->ili_item.li_buf = bp;
> -		bp->b_flags |= _XBF_INODES;
> -		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> -		xfs_trans_brelse(tp, bp);
> -	}
> -
> -	/*
> -	 * Always OR in the bits from the ili_last_fields field.  This is to
> -	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> -	 * in the eventual clearing of the ili_fields bits.  See the big comment
> -	 * in xfs_iflush() for an explanation of this coordination mechanism.
> -	 */
> -	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
> -	spin_unlock(&iip->ili_lock);
> +	iip->ili_dirty_flags |= flags;
> +	trace_printk("ino 0x%llx, flags 0x%x, dflags 0x%x",
> +		ip->i_ino, flags, iip->ili_dirty_flags);

Urk, leftover debugging info?

--D
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index ca2941ab6cbc..586af11b7cd1 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -29,6 +29,156 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
>  	return container_of(lip, struct xfs_inode_log_item, ili_item);
>  }
>  
> +static uint64_t
> +xfs_inode_item_sort(
> +	struct xfs_log_item	*lip)
> +{
> +	return INODE_ITEM(lip)->ili_inode->i_ino;
> +}
> +
> +/*
> + * Prior to finally logging the inode, we have to ensure that all the
> + * per-modification inode state changes are applied. This includes VFS inode
> + * state updates, format conversions, verifier state synchronisation and
> + * ensuring the inode buffer remains in memory whilst the inode is dirty.
> + *
> + * We have to be careful when we grab the inode cluster buffer due to lock
> + * ordering constraints. The unlinked inode modifications (xfs_iunlink_item)
> + * require AGI -> inode cluster buffer lock order. The inode cluster buffer is
> + * not locked until ->precommit, so it happens after everything else has been
> + * modified.
> + *
> + * Further, we have AGI -> AGF lock ordering, and with O_TMPFILE handling we
> + * have AGI -> AGF -> iunlink item -> inode cluster buffer lock order. Hence we
> + * cannot safely lock the inode cluster buffer in xfs_trans_log_inode() because
> + * it can be called on a inode (e.g. via bumplink/droplink) before we take the
> + * AGF lock modifying directory blocks.
> + *
> + * Rather than force a complete rework of all the transactions to call
> + * xfs_trans_log_inode() once and once only at the end of every transaction, we
> + * move the pinning of the inode cluster buffer to a ->precommit operation. This
> + * matches how the xfs_iunlink_item locks the inode cluster buffer, and it
> + * ensures that the inode cluster buffer locking is always done last in a
> + * transaction. i.e. we ensure the lock order is always AGI -> AGF -> inode
> + * cluster buffer.
> + *
> + * If we return the inode number as the precommit sort key then we'll also
> + * guarantee that the order all inode cluster buffer locking is the same all the
> + * inodes and unlink items in the transaction.
> + */
> +static int
> +xfs_inode_item_precommit(
> +	struct xfs_trans	*tp,
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> +	struct xfs_inode	*ip = iip->ili_inode;
> +	struct inode		*inode = VFS_I(ip);
> +	unsigned int		flags = iip->ili_dirty_flags;
> +
> +	trace_printk("ino 0x%llx, dflags 0x%x, fields 0x%x lastf 0x%x",
> +		ip->i_ino, flags, iip->ili_fields, iip->ili_last_fields);
> +	/*
> +	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> +	 * don't matter - we either will need an extra transaction in 24 hours
> +	 * to log the timestamps, or will clear already cleared fields in the
> +	 * worst case.
> +	 */
> +	if (inode->i_state & I_DIRTY_TIME) {
> +		spin_lock(&inode->i_lock);
> +		inode->i_state &= ~I_DIRTY_TIME;
> +		spin_unlock(&inode->i_lock);
> +	}
> +
> +
> +	/*
> +	 * If we're updating the inode core or the timestamps and it's possible
> +	 * to upgrade this inode to bigtime format, do so now.
> +	 */
> +	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> +	    xfs_has_bigtime(ip->i_mount) &&
> +	    !xfs_inode_has_bigtime(ip)) {
> +		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> +		flags |= XFS_ILOG_CORE;
> +	}
> +
> +	/*
> +	 * Inode verifiers do not check that the extent size hint is an integer
> +	 * multiple of the rt extent size on a directory with both rtinherit
> +	 * and extszinherit flags set.  If we're logging a directory that is
> +	 * misconfigured in this way, clear the hint.
> +	 */
> +	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> +	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> +	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> +		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> +				   XFS_DIFLAG_EXTSZINHERIT);
> +		ip->i_extsize = 0;
> +		flags |= XFS_ILOG_CORE;
> +	}
> +
> +	/*
> +	 * Record the specific change for fdatasync optimisation. This allows
> +	 * fdatasync to skip log forces for inodes that are only timestamp
> +	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> +	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> +	 * (ili_fields) correctly tracks that the version has changed.
> +	 */
> +	spin_lock(&iip->ili_lock);
> +	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> +	if (flags & XFS_ILOG_IVERSION)
> +		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> +
> +	if (!iip->ili_item.li_buf) {
> +		struct xfs_buf	*bp;
> +		int		error;
> +
> +		/*
> +		 * We hold the ILOCK here, so this inode is not going to be
> +		 * flushed while we are here. Further, because there is no
> +		 * buffer attached to the item, we know that there is no IO in
> +		 * progress, so nothing will clear the ili_fields while we read
> +		 * in the buffer. Hence we can safely drop the spin lock and
> +		 * read the buffer knowing that the state will not change from
> +		 * here.
> +		 */
> +		spin_unlock(&iip->ili_lock);
> +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> +		if (error)
> +			return error;
> +
> +		/*
> +		 * We need an explicit buffer reference for the log item but
> +		 * don't want the buffer to remain attached to the transaction.
> +		 * Hold the buffer but release the transaction reference once
> +		 * we've attached the inode log item to the buffer log item
> +		 * list.
> +		 */
> +		xfs_buf_hold(bp);
> +		spin_lock(&iip->ili_lock);
> +		iip->ili_item.li_buf = bp;
> +		bp->b_flags |= _XBF_INODES;
> +		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> +		xfs_trans_brelse(tp, bp);
> +	}
> +
> +	/*
> +	 * Always OR in the bits from the ili_last_fields field.  This is to
> +	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> +	 * in the eventual clearing of the ili_fields bits.  See the big comment
> +	 * in xfs_iflush() for an explanation of this coordination mechanism.
> +	 */
> +	iip->ili_fields |= (flags | iip->ili_last_fields);
> +	spin_unlock(&iip->ili_lock);
> +
> +	/*
> +	 * We are done with the log item transaction dirty state, so clear it so
> +	 * that it doesn't pollute future transactions.
> +	 */
> +	iip->ili_dirty_flags = 0;
> +	return 0;
> +}
> +
>  /*
>   * The logged size of an inode fork is always the current size of the inode
>   * fork. This means that when an inode fork is relogged, the size of the logged
> @@ -662,6 +812,8 @@ xfs_inode_item_committing(
>  }
>  
>  static const struct xfs_item_ops xfs_inode_item_ops = {
> +	.iop_sort	= xfs_inode_item_sort,
> +	.iop_precommit	= xfs_inode_item_precommit,
>  	.iop_size	= xfs_inode_item_size,
>  	.iop_format	= xfs_inode_item_format,
>  	.iop_pin	= xfs_inode_item_pin,
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index bbd836a44ff0..377e06007804 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -17,6 +17,7 @@ struct xfs_inode_log_item {
>  	struct xfs_log_item	ili_item;	   /* common portion */
>  	struct xfs_inode	*ili_inode;	   /* inode ptr */
>  	unsigned short		ili_lock_flags;	   /* inode lock flags */
> +	unsigned int		ili_dirty_flags;   /* dirty in current tx */
>  	/*
>  	 * The ili_lock protects the interactions between the dirty state and
>  	 * the flush state of the inode log item. This allows us to do atomic
> -- 
> 2.40.1
>

Dave Chinner May 17, 2023, 1:47 a.m. UTC | #2

On Tue, May 16, 2023 at 06:26:29PM -0700, Darrick J. Wong wrote:
> On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Lock order in XFS is AGI -> AGF, hence for operations involving
> > inode unlinked list operations we always lock the AGI first. Inode
> > unlinked list operations operate on the inode cluster buffer,
> > so the lock order there is AGI -> inode cluster buffer.
> > 
> > For O_TMPFILE operations, this now means the lock order set down in
> > xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
> > unlinked ops are done before the directory modifications that may
> > allocate space and lock the AGF.
> > 
> > Unfortunately, we also now lock the inode cluster buffer when
> > logging an inode so that we can attach the inode to the cluster
> > buffer and pin it in memory. This creates a lock order of AGF ->
> > inode cluster buffer in directory operations as we have to log the
> > inode after we've allocated new space for it.
> > 
> > This creates a lock inversion between the AGF and the inode cluster
> > buffer. Because the inode cluster buffer is shared across multiple
> > inodes, the inversion is not specific to individual inodes but can
> > occur when inodes in the same cluster buffer are accessed in
> > different orders.
> > 
> > To fix this we need move all the inode log item cluster buffer
> > interactions to the end of the current transaction. Unfortunately,
> > xfs_trans_log_inode() calls are littered throughout the transactions
> > with no thought to ordering against other items or locking. This
> > makes it difficult to do anything that involves changing the call
> > sites of xfs_trans_log_inode() to change locking orders.
> > 
> > However, we do now have a mechanism that allows is to postpone dirty
> > item processing to just before we commit the transaction: the
> > ->iop_precommit method. This will be called after all the
> > modifications are done and high level objects like AGI and AGF
> > buffers have been locked and modified, thereby providing a mechanism
> > that guarantees we don't lock the inode cluster buffer before those
> > high level objects are locked.
> > 
> > This change is largely moving the guts of xfs_trans_log_inode() to
> > xfs_inode_item_precommit() and providing an extra flag context in
> > the inode log item to track the dirty state of the inode in the
> > current transaction. This also means we do a lot less repeated work
> > in xfs_trans_log_inode() by only doing it once per transaction when
> > all the work is done.
> 
> Aha, and that's why you moved all the "opportunistically tweak inode
> metadata while we're already logging it" bits to the precommit hook.

Yes. It didn't make sense to move just some of the "only need to do
once per transaction while the inode is locked" stuff from one place
to the other. I figured it's better to have it all in one place and
do it all just once...

....
> > -	/*
> > -	 * Always OR in the bits from the ili_last_fields field.  This is to
> > -	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > -	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > -	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > -	 */
> > -	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
> > -	spin_unlock(&iip->ili_lock);
> > +	iip->ili_dirty_flags |= flags;
> > +	trace_printk("ino 0x%llx, flags 0x%x, dflags 0x%x",
> > +		ip->i_ino, flags, iip->ili_dirty_flags);
> 
> Urk, leftover debugging info?

Yes, I just realised I'd left it there when writing the last email.
Ignoring those two trace-printk() statements, the rest of the code
should be ok to review...

-Dave.

Dave Chinner June 1, 2023, 1:51 a.m. UTC | #3

Friendly Ping.

Apart from the stray trace_printk()s I forgot to remove, are
there any other problems with this patch I need to fix?

-Dave.

On Tue, May 16, 2023 at 06:26:29PM -0700, Darrick J. Wong wrote:
> On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Lock order in XFS is AGI -> AGF, hence for operations involving
> > inode unlinked list operations we always lock the AGI first. Inode
> > unlinked list operations operate on the inode cluster buffer,
> > so the lock order there is AGI -> inode cluster buffer.
> > 
> > For O_TMPFILE operations, this now means the lock order set down in
> > xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
> > unlinked ops are done before the directory modifications that may
> > allocate space and lock the AGF.
> > 
> > Unfortunately, we also now lock the inode cluster buffer when
> > logging an inode so that we can attach the inode to the cluster
> > buffer and pin it in memory. This creates a lock order of AGF ->
> > inode cluster buffer in directory operations as we have to log the
> > inode after we've allocated new space for it.
> > 
> > This creates a lock inversion between the AGF and the inode cluster
> > buffer. Because the inode cluster buffer is shared across multiple
> > inodes, the inversion is not specific to individual inodes but can
> > occur when inodes in the same cluster buffer are accessed in
> > different orders.
> > 
> > To fix this we need move all the inode log item cluster buffer
> > interactions to the end of the current transaction. Unfortunately,
> > xfs_trans_log_inode() calls are littered throughout the transactions
> > with no thought to ordering against other items or locking. This
> > makes it difficult to do anything that involves changing the call
> > sites of xfs_trans_log_inode() to change locking orders.
> > 
> > However, we do now have a mechanism that allows is to postpone dirty
> > item processing to just before we commit the transaction: the
> > ->iop_precommit method. This will be called after all the
> > modifications are done and high level objects like AGI and AGF
> > buffers have been locked and modified, thereby providing a mechanism
> > that guarantees we don't lock the inode cluster buffer before those
> > high level objects are locked.
> > 
> > This change is largely moving the guts of xfs_trans_log_inode() to
> > xfs_inode_item_precommit() and providing an extra flag context in
> > the inode log item to track the dirty state of the inode in the
> > current transaction. This also means we do a lot less repeated work
> > in xfs_trans_log_inode() by only doing it once per transaction when
> > all the work is done.
> 
> Aha, and that's why you moved all the "opportunistically tweak inode
> metadata while we're already logging it" bits to the precommit hook.
> 
> > Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_log_format.h  |   9 +-
> >  fs/xfs/libxfs/xfs_trans_inode.c | 115 +++---------------------
> >  fs/xfs/xfs_inode_item.c         | 152 ++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_inode_item.h         |   1 +
> >  4 files changed, 171 insertions(+), 106 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > index f13e0809dc63..269573c82808 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -324,7 +324,6 @@ struct xfs_inode_log_format_32 {
> >  #define XFS_ILOG_DOWNER	0x200	/* change the data fork owner on replay */
> >  #define XFS_ILOG_AOWNER	0x400	/* change the attr fork owner on replay */
> >  
> > -
> >  /*
> >   * The timestamps are dirty, but not necessarily anything else in the inode
> >   * core.  Unlike the other fields above this one must never make it to disk
> > @@ -333,6 +332,14 @@ struct xfs_inode_log_format_32 {
> >   */
> >  #define XFS_ILOG_TIMESTAMP	0x4000
> >  
> > +/*
> > + * The version field has been changed, but not necessarily anything else of
> > + * interest. This must never make it to disk - it is used purely to ensure that
> > + * the inode item ->precommit operation can update the fsync flag triggers
> > + * in the inode item correctly.
> > + */
> > +#define XFS_ILOG_IVERSION	0x8000
> > +
> >  #define	XFS_ILOG_NONCORE	(XFS_ILOG_DDATA | XFS_ILOG_DEXT | \
> >  				 XFS_ILOG_DBROOT | XFS_ILOG_DEV | \
> >  				 XFS_ILOG_ADATA | XFS_ILOG_AEXT | \
> > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > index 8b5547073379..2d164d0588b1 100644
> > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > @@ -40,9 +40,8 @@ xfs_trans_ijoin(
> >  	iip->ili_lock_flags = lock_flags;
> >  	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
> >  
> > -	/*
> > -	 * Get a log_item_desc to point at the new item.
> > -	 */
> > +	/* Reset the per-tx dirty context and add the item to the tx. */
> > +	iip->ili_dirty_flags = 0;
> >  	xfs_trans_add_item(tp, &iip->ili_item);
> >  }
> >  
> > @@ -76,17 +75,10 @@ xfs_trans_ichgtime(
> >  /*
> >   * This is called to mark the fields indicated in fieldmask as needing to be
> >   * logged when the transaction is committed.  The inode must already be
> > - * associated with the given transaction.
> > - *
> > - * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
> > - * of the core inode if any of it has changed, and we always log all of the
> > - * inline data/extents/b-tree root if any of them has changed.
> > - *
> > - * Grab and pin the cluster buffer associated with this inode to avoid RMW
> > - * cycles at inode writeback time. Avoid the need to add error handling to every
> > - * xfs_trans_log_inode() call by shutting down on read error.  This will cause
> > - * transactions to fail and everything to error out, just like if we return a
> > - * read error in a dirty transaction and cancel it.
> > + * associated with the given transaction. All we do here is record where the
> > + * inode was dirtied and mark the transaction and inode log item dirty;
> > + * everything else is done in the ->precommit log item operation after the
> > + * changes in the transaction have been completed.
> >   */
> >  void
> >  xfs_trans_log_inode(
> > @@ -96,7 +88,6 @@ xfs_trans_log_inode(
> >  {
> >  	struct xfs_inode_log_item *iip = ip->i_itemp;
> >  	struct inode		*inode = VFS_I(ip);
> > -	uint			iversion_flags = 0;
> >  
> >  	ASSERT(iip);
> >  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > @@ -104,18 +95,6 @@ xfs_trans_log_inode(
> >  
> >  	tp->t_flags |= XFS_TRANS_DIRTY;
> >  
> > -	/*
> > -	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> > -	 * don't matter - we either will need an extra transaction in 24 hours
> > -	 * to log the timestamps, or will clear already cleared fields in the
> > -	 * worst case.
> > -	 */
> > -	if (inode->i_state & I_DIRTY_TIME) {
> > -		spin_lock(&inode->i_lock);
> > -		inode->i_state &= ~I_DIRTY_TIME;
> > -		spin_unlock(&inode->i_lock);
> > -	}
> > -
> >  	/*
> >  	 * First time we log the inode in a transaction, bump the inode change
> >  	 * counter if it is configured for this to occur. While we have the
> > @@ -128,86 +107,12 @@ xfs_trans_log_inode(
> >  	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> >  		if (IS_I_VERSION(inode) &&
> >  		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > -			iversion_flags = XFS_ILOG_CORE;
> > +			flags |= XFS_ILOG_IVERSION;
> >  	}
> >  
> > -	/*
> > -	 * If we're updating the inode core or the timestamps and it's possible
> > -	 * to upgrade this inode to bigtime format, do so now.
> > -	 */
> > -	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> > -	    xfs_has_bigtime(ip->i_mount) &&
> > -	    !xfs_inode_has_bigtime(ip)) {
> > -		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> > -		flags |= XFS_ILOG_CORE;
> > -	}
> > -
> > -	/*
> > -	 * Inode verifiers do not check that the extent size hint is an integer
> > -	 * multiple of the rt extent size on a directory with both rtinherit
> > -	 * and extszinherit flags set.  If we're logging a directory that is
> > -	 * misconfigured in this way, clear the hint.
> > -	 */
> > -	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> > -	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> > -	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> > -		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> > -				   XFS_DIFLAG_EXTSZINHERIT);
> > -		ip->i_extsize = 0;
> > -		flags |= XFS_ILOG_CORE;
> > -	}
> > -
> > -	/*
> > -	 * Record the specific change for fdatasync optimisation. This allows
> > -	 * fdatasync to skip log forces for inodes that are only timestamp
> > -	 * dirty.
> > -	 */
> > -	spin_lock(&iip->ili_lock);
> > -	iip->ili_fsync_fields |= flags;
> > -
> > -	if (!iip->ili_item.li_buf) {
> > -		struct xfs_buf	*bp;
> > -		int		error;
> > -
> > -		/*
> > -		 * We hold the ILOCK here, so this inode is not going to be
> > -		 * flushed while we are here. Further, because there is no
> > -		 * buffer attached to the item, we know that there is no IO in
> > -		 * progress, so nothing will clear the ili_fields while we read
> > -		 * in the buffer. Hence we can safely drop the spin lock and
> > -		 * read the buffer knowing that the state will not change from
> > -		 * here.
> > -		 */
> > -		spin_unlock(&iip->ili_lock);
> > -		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> > -		if (error) {
> > -			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> > -			return;
> > -		}
> > -
> > -		/*
> > -		 * We need an explicit buffer reference for the log item but
> > -		 * don't want the buffer to remain attached to the transaction.
> > -		 * Hold the buffer but release the transaction reference once
> > -		 * we've attached the inode log item to the buffer log item
> > -		 * list.
> > -		 */
> > -		xfs_buf_hold(bp);
> > -		spin_lock(&iip->ili_lock);
> > -		iip->ili_item.li_buf = bp;
> > -		bp->b_flags |= _XBF_INODES;
> > -		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> > -		xfs_trans_brelse(tp, bp);
> > -	}
> > -
> > -	/*
> > -	 * Always OR in the bits from the ili_last_fields field.  This is to
> > -	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > -	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > -	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > -	 */
> > -	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
> > -	spin_unlock(&iip->ili_lock);
> > +	iip->ili_dirty_flags |= flags;
> > +	trace_printk("ino 0x%llx, flags 0x%x, dflags 0x%x",
> > +		ip->i_ino, flags, iip->ili_dirty_flags);
> 
> Urk, leftover debugging info?
> 
> --D
> >  }
> >  
> >  int
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index ca2941ab6cbc..586af11b7cd1 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -29,6 +29,156 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
> >  	return container_of(lip, struct xfs_inode_log_item, ili_item);
> >  }
> >  
> > +static uint64_t
> > +xfs_inode_item_sort(
> > +	struct xfs_log_item	*lip)
> > +{
> > +	return INODE_ITEM(lip)->ili_inode->i_ino;
> > +}
> > +
> > +/*
> > + * Prior to finally logging the inode, we have to ensure that all the
> > + * per-modification inode state changes are applied. This includes VFS inode
> > + * state updates, format conversions, verifier state synchronisation and
> > + * ensuring the inode buffer remains in memory whilst the inode is dirty.
> > + *
> > + * We have to be careful when we grab the inode cluster buffer due to lock
> > + * ordering constraints. The unlinked inode modifications (xfs_iunlink_item)
> > + * require AGI -> inode cluster buffer lock order. The inode cluster buffer is
> > + * not locked until ->precommit, so it happens after everything else has been
> > + * modified.
> > + *
> > + * Further, we have AGI -> AGF lock ordering, and with O_TMPFILE handling we
> > + * have AGI -> AGF -> iunlink item -> inode cluster buffer lock order. Hence we
> > + * cannot safely lock the inode cluster buffer in xfs_trans_log_inode() because
> > + * it can be called on a inode (e.g. via bumplink/droplink) before we take the
> > + * AGF lock modifying directory blocks.
> > + *
> > + * Rather than force a complete rework of all the transactions to call
> > + * xfs_trans_log_inode() once and once only at the end of every transaction, we
> > + * move the pinning of the inode cluster buffer to a ->precommit operation. This
> > + * matches how the xfs_iunlink_item locks the inode cluster buffer, and it
> > + * ensures that the inode cluster buffer locking is always done last in a
> > + * transaction. i.e. we ensure the lock order is always AGI -> AGF -> inode
> > + * cluster buffer.
> > + *
> > + * If we return the inode number as the precommit sort key then we'll also
> > + * guarantee that the order all inode cluster buffer locking is the same all the
> > + * inodes and unlink items in the transaction.
> > + */
> > +static int
> > +xfs_inode_item_precommit(
> > +	struct xfs_trans	*tp,
> > +	struct xfs_log_item	*lip)
> > +{
> > +	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> > +	struct xfs_inode	*ip = iip->ili_inode;
> > +	struct inode		*inode = VFS_I(ip);
> > +	unsigned int		flags = iip->ili_dirty_flags;
> > +
> > +	trace_printk("ino 0x%llx, dflags 0x%x, fields 0x%x lastf 0x%x",
> > +		ip->i_ino, flags, iip->ili_fields, iip->ili_last_fields);
> > +	/*
> > +	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> > +	 * don't matter - we either will need an extra transaction in 24 hours
> > +	 * to log the timestamps, or will clear already cleared fields in the
> > +	 * worst case.
> > +	 */
> > +	if (inode->i_state & I_DIRTY_TIME) {
> > +		spin_lock(&inode->i_lock);
> > +		inode->i_state &= ~I_DIRTY_TIME;
> > +		spin_unlock(&inode->i_lock);
> > +	}
> > +
> > +
> > +	/*
> > +	 * If we're updating the inode core or the timestamps and it's possible
> > +	 * to upgrade this inode to bigtime format, do so now.
> > +	 */
> > +	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> > +	    xfs_has_bigtime(ip->i_mount) &&
> > +	    !xfs_inode_has_bigtime(ip)) {
> > +		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> > +		flags |= XFS_ILOG_CORE;
> > +	}
> > +
> > +	/*
> > +	 * Inode verifiers do not check that the extent size hint is an integer
> > +	 * multiple of the rt extent size on a directory with both rtinherit
> > +	 * and extszinherit flags set.  If we're logging a directory that is
> > +	 * misconfigured in this way, clear the hint.
> > +	 */
> > +	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> > +	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> > +	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> > +		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> > +				   XFS_DIFLAG_EXTSZINHERIT);
> > +		ip->i_extsize = 0;
> > +		flags |= XFS_ILOG_CORE;
> > +	}
> > +
> > +	/*
> > +	 * Record the specific change for fdatasync optimisation. This allows
> > +	 * fdatasync to skip log forces for inodes that are only timestamp
> > +	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> > +	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> > +	 * (ili_fields) correctly tracks that the version has changed.
> > +	 */
> > +	spin_lock(&iip->ili_lock);
> > +	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> > +	if (flags & XFS_ILOG_IVERSION)
> > +		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > +
> > +	if (!iip->ili_item.li_buf) {
> > +		struct xfs_buf	*bp;
> > +		int		error;
> > +
> > +		/*
> > +		 * We hold the ILOCK here, so this inode is not going to be
> > +		 * flushed while we are here. Further, because there is no
> > +		 * buffer attached to the item, we know that there is no IO in
> > +		 * progress, so nothing will clear the ili_fields while we read
> > +		 * in the buffer. Hence we can safely drop the spin lock and
> > +		 * read the buffer knowing that the state will not change from
> > +		 * here.
> > +		 */
> > +		spin_unlock(&iip->ili_lock);
> > +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> > +		if (error)
> > +			return error;
> > +
> > +		/*
> > +		 * We need an explicit buffer reference for the log item but
> > +		 * don't want the buffer to remain attached to the transaction.
> > +		 * Hold the buffer but release the transaction reference once
> > +		 * we've attached the inode log item to the buffer log item
> > +		 * list.
> > +		 */
> > +		xfs_buf_hold(bp);
> > +		spin_lock(&iip->ili_lock);
> > +		iip->ili_item.li_buf = bp;
> > +		bp->b_flags |= _XBF_INODES;
> > +		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> > +		xfs_trans_brelse(tp, bp);
> > +	}
> > +
> > +	/*
> > +	 * Always OR in the bits from the ili_last_fields field.  This is to
> > +	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > +	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > +	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > +	 */
> > +	iip->ili_fields |= (flags | iip->ili_last_fields);
> > +	spin_unlock(&iip->ili_lock);
> > +
> > +	/*
> > +	 * We are done with the log item transaction dirty state, so clear it so
> > +	 * that it doesn't pollute future transactions.
> > +	 */
> > +	iip->ili_dirty_flags = 0;
> > +	return 0;
> > +}
> > +
> >  /*
> >   * The logged size of an inode fork is always the current size of the inode
> >   * fork. This means that when an inode fork is relogged, the size of the logged
> > @@ -662,6 +812,8 @@ xfs_inode_item_committing(
> >  }
> >  
> >  static const struct xfs_item_ops xfs_inode_item_ops = {
> > +	.iop_sort	= xfs_inode_item_sort,
> > +	.iop_precommit	= xfs_inode_item_precommit,
> >  	.iop_size	= xfs_inode_item_size,
> >  	.iop_format	= xfs_inode_item_format,
> >  	.iop_pin	= xfs_inode_item_pin,
> > diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> > index bbd836a44ff0..377e06007804 100644
> > --- a/fs/xfs/xfs_inode_item.h
> > +++ b/fs/xfs/xfs_inode_item.h
> > @@ -17,6 +17,7 @@ struct xfs_inode_log_item {
> >  	struct xfs_log_item	ili_item;	   /* common portion */
> >  	struct xfs_inode	*ili_inode;	   /* inode ptr */
> >  	unsigned short		ili_lock_flags;	   /* inode lock flags */
> > +	unsigned int		ili_dirty_flags;   /* dirty in current tx */
> >  	/*
> >  	 * The ili_lock protects the interactions between the dirty state and
> >  	 * the flush state of the inode log item. This allows us to do atomic
> > -- 
> > 2.40.1
> > 
>

Darrick J. Wong June 1, 2023, 2:38 p.m. UTC | #4

On Thu, Jun 01, 2023 at 11:51:49AM +1000, Dave Chinner wrote:
> Friendly Ping.
> 
> Apart from the stray trace_printk()s I forgot to remove, are
> there any other problems with this patch I need to fix?

Nothing obvious that I could see.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> -Dave.
> 
> On Tue, May 16, 2023 at 06:26:29PM -0700, Darrick J. Wong wrote:
> > On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Lock order in XFS is AGI -> AGF, hence for operations involving
> > > inode unlinked list operations we always lock the AGI first. Inode
> > > unlinked list operations operate on the inode cluster buffer,
> > > so the lock order there is AGI -> inode cluster buffer.
> > > 
> > > For O_TMPFILE operations, this now means the lock order set down in
> > > xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
> > > unlinked ops are done before the directory modifications that may
> > > allocate space and lock the AGF.
> > > 
> > > Unfortunately, we also now lock the inode cluster buffer when
> > > logging an inode so that we can attach the inode to the cluster
> > > buffer and pin it in memory. This creates a lock order of AGF ->
> > > inode cluster buffer in directory operations as we have to log the
> > > inode after we've allocated new space for it.
> > > 
> > > This creates a lock inversion between the AGF and the inode cluster
> > > buffer. Because the inode cluster buffer is shared across multiple
> > > inodes, the inversion is not specific to individual inodes but can
> > > occur when inodes in the same cluster buffer are accessed in
> > > different orders.
> > > 
> > > To fix this we need move all the inode log item cluster buffer
> > > interactions to the end of the current transaction. Unfortunately,
> > > xfs_trans_log_inode() calls are littered throughout the transactions
> > > with no thought to ordering against other items or locking. This
> > > makes it difficult to do anything that involves changing the call
> > > sites of xfs_trans_log_inode() to change locking orders.
> > > 
> > > However, we do now have a mechanism that allows is to postpone dirty
> > > item processing to just before we commit the transaction: the
> > > ->iop_precommit method. This will be called after all the
> > > modifications are done and high level objects like AGI and AGF
> > > buffers have been locked and modified, thereby providing a mechanism
> > > that guarantees we don't lock the inode cluster buffer before those
> > > high level objects are locked.
> > > 
> > > This change is largely moving the guts of xfs_trans_log_inode() to
> > > xfs_inode_item_precommit() and providing an extra flag context in
> > > the inode log item to track the dirty state of the inode in the
> > > current transaction. This also means we do a lot less repeated work
> > > in xfs_trans_log_inode() by only doing it once per transaction when
> > > all the work is done.
> > 
> > Aha, and that's why you moved all the "opportunistically tweak inode
> > metadata while we're already logging it" bits to the precommit hook.
> > 
> > > Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_log_format.h  |   9 +-
> > >  fs/xfs/libxfs/xfs_trans_inode.c | 115 +++---------------------
> > >  fs/xfs/xfs_inode_item.c         | 152 ++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_inode_item.h         |   1 +
> > >  4 files changed, 171 insertions(+), 106 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > > index f13e0809dc63..269573c82808 100644
> > > --- a/fs/xfs/libxfs/xfs_log_format.h
> > > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > > @@ -324,7 +324,6 @@ struct xfs_inode_log_format_32 {
> > >  #define XFS_ILOG_DOWNER	0x200	/* change the data fork owner on replay */
> > >  #define XFS_ILOG_AOWNER	0x400	/* change the attr fork owner on replay */
> > >  
> > > -
> > >  /*
> > >   * The timestamps are dirty, but not necessarily anything else in the inode
> > >   * core.  Unlike the other fields above this one must never make it to disk
> > > @@ -333,6 +332,14 @@ struct xfs_inode_log_format_32 {
> > >   */
> > >  #define XFS_ILOG_TIMESTAMP	0x4000
> > >  
> > > +/*
> > > + * The version field has been changed, but not necessarily anything else of
> > > + * interest. This must never make it to disk - it is used purely to ensure that
> > > + * the inode item ->precommit operation can update the fsync flag triggers
> > > + * in the inode item correctly.
> > > + */
> > > +#define XFS_ILOG_IVERSION	0x8000
> > > +
> > >  #define	XFS_ILOG_NONCORE	(XFS_ILOG_DDATA | XFS_ILOG_DEXT | \
> > >  				 XFS_ILOG_DBROOT | XFS_ILOG_DEV | \
> > >  				 XFS_ILOG_ADATA | XFS_ILOG_AEXT | \
> > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > > index 8b5547073379..2d164d0588b1 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > > @@ -40,9 +40,8 @@ xfs_trans_ijoin(
> > >  	iip->ili_lock_flags = lock_flags;
> > >  	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
> > >  
> > > -	/*
> > > -	 * Get a log_item_desc to point at the new item.
> > > -	 */
> > > +	/* Reset the per-tx dirty context and add the item to the tx. */
> > > +	iip->ili_dirty_flags = 0;
> > >  	xfs_trans_add_item(tp, &iip->ili_item);
> > >  }
> > >  
> > > @@ -76,17 +75,10 @@ xfs_trans_ichgtime(
> > >  /*
> > >   * This is called to mark the fields indicated in fieldmask as needing to be
> > >   * logged when the transaction is committed.  The inode must already be
> > > - * associated with the given transaction.
> > > - *
> > > - * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
> > > - * of the core inode if any of it has changed, and we always log all of the
> > > - * inline data/extents/b-tree root if any of them has changed.
> > > - *
> > > - * Grab and pin the cluster buffer associated with this inode to avoid RMW
> > > - * cycles at inode writeback time. Avoid the need to add error handling to every
> > > - * xfs_trans_log_inode() call by shutting down on read error.  This will cause
> > > - * transactions to fail and everything to error out, just like if we return a
> > > - * read error in a dirty transaction and cancel it.
> > > + * associated with the given transaction. All we do here is record where the
> > > + * inode was dirtied and mark the transaction and inode log item dirty;
> > > + * everything else is done in the ->precommit log item operation after the
> > > + * changes in the transaction have been completed.
> > >   */
> > >  void
> > >  xfs_trans_log_inode(
> > > @@ -96,7 +88,6 @@ xfs_trans_log_inode(
> > >  {
> > >  	struct xfs_inode_log_item *iip = ip->i_itemp;
> > >  	struct inode		*inode = VFS_I(ip);
> > > -	uint			iversion_flags = 0;
> > >  
> > >  	ASSERT(iip);
> > >  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > @@ -104,18 +95,6 @@ xfs_trans_log_inode(
> > >  
> > >  	tp->t_flags |= XFS_TRANS_DIRTY;
> > >  
> > > -	/*
> > > -	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> > > -	 * don't matter - we either will need an extra transaction in 24 hours
> > > -	 * to log the timestamps, or will clear already cleared fields in the
> > > -	 * worst case.
> > > -	 */
> > > -	if (inode->i_state & I_DIRTY_TIME) {
> > > -		spin_lock(&inode->i_lock);
> > > -		inode->i_state &= ~I_DIRTY_TIME;
> > > -		spin_unlock(&inode->i_lock);
> > > -	}
> > > -
> > >  	/*
> > >  	 * First time we log the inode in a transaction, bump the inode change
> > >  	 * counter if it is configured for this to occur. While we have the
> > > @@ -128,86 +107,12 @@ xfs_trans_log_inode(
> > >  	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > >  		if (IS_I_VERSION(inode) &&
> > >  		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > > -			iversion_flags = XFS_ILOG_CORE;
> > > +			flags |= XFS_ILOG_IVERSION;
> > >  	}
> > >  
> > > -	/*
> > > -	 * If we're updating the inode core or the timestamps and it's possible
> > > -	 * to upgrade this inode to bigtime format, do so now.
> > > -	 */
> > > -	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> > > -	    xfs_has_bigtime(ip->i_mount) &&
> > > -	    !xfs_inode_has_bigtime(ip)) {
> > > -		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> > > -		flags |= XFS_ILOG_CORE;
> > > -	}
> > > -
> > > -	/*
> > > -	 * Inode verifiers do not check that the extent size hint is an integer
> > > -	 * multiple of the rt extent size on a directory with both rtinherit
> > > -	 * and extszinherit flags set.  If we're logging a directory that is
> > > -	 * misconfigured in this way, clear the hint.
> > > -	 */
> > > -	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> > > -	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> > > -	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> > > -		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> > > -				   XFS_DIFLAG_EXTSZINHERIT);
> > > -		ip->i_extsize = 0;
> > > -		flags |= XFS_ILOG_CORE;
> > > -	}
> > > -
> > > -	/*
> > > -	 * Record the specific change for fdatasync optimisation. This allows
> > > -	 * fdatasync to skip log forces for inodes that are only timestamp
> > > -	 * dirty.
> > > -	 */
> > > -	spin_lock(&iip->ili_lock);
> > > -	iip->ili_fsync_fields |= flags;
> > > -
> > > -	if (!iip->ili_item.li_buf) {
> > > -		struct xfs_buf	*bp;
> > > -		int		error;
> > > -
> > > -		/*
> > > -		 * We hold the ILOCK here, so this inode is not going to be
> > > -		 * flushed while we are here. Further, because there is no
> > > -		 * buffer attached to the item, we know that there is no IO in
> > > -		 * progress, so nothing will clear the ili_fields while we read
> > > -		 * in the buffer. Hence we can safely drop the spin lock and
> > > -		 * read the buffer knowing that the state will not change from
> > > -		 * here.
> > > -		 */
> > > -		spin_unlock(&iip->ili_lock);
> > > -		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> > > -		if (error) {
> > > -			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> > > -			return;
> > > -		}
> > > -
> > > -		/*
> > > -		 * We need an explicit buffer reference for the log item but
> > > -		 * don't want the buffer to remain attached to the transaction.
> > > -		 * Hold the buffer but release the transaction reference once
> > > -		 * we've attached the inode log item to the buffer log item
> > > -		 * list.
> > > -		 */
> > > -		xfs_buf_hold(bp);
> > > -		spin_lock(&iip->ili_lock);
> > > -		iip->ili_item.li_buf = bp;
> > > -		bp->b_flags |= _XBF_INODES;
> > > -		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> > > -		xfs_trans_brelse(tp, bp);
> > > -	}
> > > -
> > > -	/*
> > > -	 * Always OR in the bits from the ili_last_fields field.  This is to
> > > -	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > > -	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > > -	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > > -	 */
> > > -	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
> > > -	spin_unlock(&iip->ili_lock);
> > > +	iip->ili_dirty_flags |= flags;
> > > +	trace_printk("ino 0x%llx, flags 0x%x, dflags 0x%x",
> > > +		ip->i_ino, flags, iip->ili_dirty_flags);
> > 
> > Urk, leftover debugging info?
> > 
> > --D
> > >  }
> > >  
> > >  int
> > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > > index ca2941ab6cbc..586af11b7cd1 100644
> > > --- a/fs/xfs/xfs_inode_item.c
> > > +++ b/fs/xfs/xfs_inode_item.c
> > > @@ -29,6 +29,156 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
> > >  	return container_of(lip, struct xfs_inode_log_item, ili_item);
> > >  }
> > >  
> > > +static uint64_t
> > > +xfs_inode_item_sort(
> > > +	struct xfs_log_item	*lip)
> > > +{
> > > +	return INODE_ITEM(lip)->ili_inode->i_ino;
> > > +}
> > > +
> > > +/*
> > > + * Prior to finally logging the inode, we have to ensure that all the
> > > + * per-modification inode state changes are applied. This includes VFS inode
> > > + * state updates, format conversions, verifier state synchronisation and
> > > + * ensuring the inode buffer remains in memory whilst the inode is dirty.
> > > + *
> > > + * We have to be careful when we grab the inode cluster buffer due to lock
> > > + * ordering constraints. The unlinked inode modifications (xfs_iunlink_item)
> > > + * require AGI -> inode cluster buffer lock order. The inode cluster buffer is
> > > + * not locked until ->precommit, so it happens after everything else has been
> > > + * modified.
> > > + *
> > > + * Further, we have AGI -> AGF lock ordering, and with O_TMPFILE handling we
> > > + * have AGI -> AGF -> iunlink item -> inode cluster buffer lock order. Hence we
> > > + * cannot safely lock the inode cluster buffer in xfs_trans_log_inode() because
> > > + * it can be called on a inode (e.g. via bumplink/droplink) before we take the
> > > + * AGF lock modifying directory blocks.
> > > + *
> > > + * Rather than force a complete rework of all the transactions to call
> > > + * xfs_trans_log_inode() once and once only at the end of every transaction, we
> > > + * move the pinning of the inode cluster buffer to a ->precommit operation. This
> > > + * matches how the xfs_iunlink_item locks the inode cluster buffer, and it
> > > + * ensures that the inode cluster buffer locking is always done last in a
> > > + * transaction. i.e. we ensure the lock order is always AGI -> AGF -> inode
> > > + * cluster buffer.
> > > + *
> > > + * If we return the inode number as the precommit sort key then we'll also
> > > + * guarantee that the order all inode cluster buffer locking is the same all the
> > > + * inodes and unlink items in the transaction.
> > > + */
> > > +static int
> > > +xfs_inode_item_precommit(
> > > +	struct xfs_trans	*tp,
> > > +	struct xfs_log_item	*lip)
> > > +{
> > > +	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> > > +	struct xfs_inode	*ip = iip->ili_inode;
> > > +	struct inode		*inode = VFS_I(ip);
> > > +	unsigned int		flags = iip->ili_dirty_flags;
> > > +
> > > +	trace_printk("ino 0x%llx, dflags 0x%x, fields 0x%x lastf 0x%x",
> > > +		ip->i_ino, flags, iip->ili_fields, iip->ili_last_fields);
> > > +	/*
> > > +	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> > > +	 * don't matter - we either will need an extra transaction in 24 hours
> > > +	 * to log the timestamps, or will clear already cleared fields in the
> > > +	 * worst case.
> > > +	 */
> > > +	if (inode->i_state & I_DIRTY_TIME) {
> > > +		spin_lock(&inode->i_lock);
> > > +		inode->i_state &= ~I_DIRTY_TIME;
> > > +		spin_unlock(&inode->i_lock);
> > > +	}
> > > +
> > > +
> > > +	/*
> > > +	 * If we're updating the inode core or the timestamps and it's possible
> > > +	 * to upgrade this inode to bigtime format, do so now.
> > > +	 */
> > > +	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> > > +	    xfs_has_bigtime(ip->i_mount) &&
> > > +	    !xfs_inode_has_bigtime(ip)) {
> > > +		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> > > +		flags |= XFS_ILOG_CORE;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Inode verifiers do not check that the extent size hint is an integer
> > > +	 * multiple of the rt extent size on a directory with both rtinherit
> > > +	 * and extszinherit flags set.  If we're logging a directory that is
> > > +	 * misconfigured in this way, clear the hint.
> > > +	 */
> > > +	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> > > +	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> > > +	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> > > +		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> > > +				   XFS_DIFLAG_EXTSZINHERIT);
> > > +		ip->i_extsize = 0;
> > > +		flags |= XFS_ILOG_CORE;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Record the specific change for fdatasync optimisation. This allows
> > > +	 * fdatasync to skip log forces for inodes that are only timestamp
> > > +	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> > > +	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> > > +	 * (ili_fields) correctly tracks that the version has changed.
> > > +	 */
> > > +	spin_lock(&iip->ili_lock);
> > > +	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> > > +	if (flags & XFS_ILOG_IVERSION)
> > > +		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > > +
> > > +	if (!iip->ili_item.li_buf) {
> > > +		struct xfs_buf	*bp;
> > > +		int		error;
> > > +
> > > +		/*
> > > +		 * We hold the ILOCK here, so this inode is not going to be
> > > +		 * flushed while we are here. Further, because there is no
> > > +		 * buffer attached to the item, we know that there is no IO in
> > > +		 * progress, so nothing will clear the ili_fields while we read
> > > +		 * in the buffer. Hence we can safely drop the spin lock and
> > > +		 * read the buffer knowing that the state will not change from
> > > +		 * here.
> > > +		 */
> > > +		spin_unlock(&iip->ili_lock);
> > > +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> > > +		if (error)
> > > +			return error;
> > > +
> > > +		/*
> > > +		 * We need an explicit buffer reference for the log item but
> > > +		 * don't want the buffer to remain attached to the transaction.
> > > +		 * Hold the buffer but release the transaction reference once
> > > +		 * we've attached the inode log item to the buffer log item
> > > +		 * list.
> > > +		 */
> > > +		xfs_buf_hold(bp);
> > > +		spin_lock(&iip->ili_lock);
> > > +		iip->ili_item.li_buf = bp;
> > > +		bp->b_flags |= _XBF_INODES;
> > > +		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> > > +		xfs_trans_brelse(tp, bp);
> > > +	}
> > > +
> > > +	/*
> > > +	 * Always OR in the bits from the ili_last_fields field.  This is to
> > > +	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > > +	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > > +	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > > +	 */
> > > +	iip->ili_fields |= (flags | iip->ili_last_fields);
> > > +	spin_unlock(&iip->ili_lock);
> > > +
> > > +	/*
> > > +	 * We are done with the log item transaction dirty state, so clear it so
> > > +	 * that it doesn't pollute future transactions.
> > > +	 */
> > > +	iip->ili_dirty_flags = 0;
> > > +	return 0;
> > > +}
> > > +
> > >  /*
> > >   * The logged size of an inode fork is always the current size of the inode
> > >   * fork. This means that when an inode fork is relogged, the size of the logged
> > > @@ -662,6 +812,8 @@ xfs_inode_item_committing(
> > >  }
> > >  
> > >  static const struct xfs_item_ops xfs_inode_item_ops = {
> > > +	.iop_sort	= xfs_inode_item_sort,
> > > +	.iop_precommit	= xfs_inode_item_precommit,
> > >  	.iop_size	= xfs_inode_item_size,
> > >  	.iop_format	= xfs_inode_item_format,
> > >  	.iop_pin	= xfs_inode_item_pin,
> > > diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> > > index bbd836a44ff0..377e06007804 100644
> > > --- a/fs/xfs/xfs_inode_item.h
> > > +++ b/fs/xfs/xfs_inode_item.h
> > > @@ -17,6 +17,7 @@ struct xfs_inode_log_item {
> > >  	struct xfs_log_item	ili_item;	   /* common portion */
> > >  	struct xfs_inode	*ili_inode;	   /* inode ptr */
> > >  	unsigned short		ili_lock_flags;	   /* inode lock flags */
> > > +	unsigned int		ili_dirty_flags;   /* dirty in current tx */
> > >  	/*
> > >  	 * The ili_lock protects the interactions between the dirty state and
> > >  	 * the flush state of the inode log item. This allows us to do atomic
> > > -- 
> > > 2.40.1
> > > 
> > 
> 
> -- 
> Dave Chinner
> david@fromorbit.com

Christoph Hellwig June 1, 2023, 3:12 p.m. UTC | #5

[apparently your gmail server decided my previous reply was spam, so
 I'm not sure this reaches you]

This looks good minus the left over trace_printk statements.

Reviewed-by: Christoph Hellwig <hch@lst.de>

Matthew Wilcox June 25, 2023, 2:58 a.m. UTC | #6

On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> Lock order in XFS is AGI -> AGF, hence for operations involving
> inode unlinked list operations we always lock the AGI first. Inode
> unlinked list operations operate on the inode cluster buffer,
> so the lock order there is AGI -> inode cluster buffer.

Hi Dave,

This commit reliably produces an assertion failure for me.  I haven't
tried to analyse why.  It's pretty clear though; I can run generic/426
in a loop for hundreds of seconds on the parent commit (cb042117488d),
but it'll die within 30 seconds on commit 82842fee6e59.

    export MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"

I suspect the size=1024 is the important thing, but I haven't tested
that hypothesis.  This is on an x86-64 virtual machine; full qemu
command line at the end [1]

00028 FSTYP         -- xfs (debug)
00028 PLATFORM      -- Linux/x86_64 pepe-kvm 6.4.0-rc5-00004-g82842fee6e59 #182 SMP PREEMPT_DYNAMIC Sat Jun 24 22:51:32 EDT 2023
00028 MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
00028 MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
00028
00028 XFS (sdc): Mounting V5 Filesystem 591c2048-7cce-4eda-acf7-649e19cd8554
00028 XFS (sdc): Ending clean mount
00028 XFS (sdc): Unmounting Filesystem 591c2048-7cce-4eda-acf7-649e19cd8554
00028 XFS (sdb): EXPERIMENTAL online scrub feature in use. Use at your own risk!
00028 XFS (sdb): Unmounting Filesystem 9db9e0a2-c05b-4690-a938-ae8f7b70be8e
00028 XFS (sdb): Mounting V5 Filesystem 9db9e0a2-c05b-4690-a938-ae8f7b70be8e
00028 XFS (sdb): Ending clean mount
00028 generic/426       run fstests generic/426 at 2023-06-25 02:52:07
00029 XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
00029 ------------[ cut here ]------------
00029 kernel BUG at fs/xfs/xfs_message.c:102!
00029 invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
00029 CPU: 1 PID: 62 Comm: kworker/1:1 Kdump: loaded Not tainted 6.4.0-rc5-00004-g82842fee6e59 #182
00029 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
00029 Workqueue: xfs-inodegc/sdb xfs_inodegc_worker
00029 RIP: 0010:assfail+0x30/0x40
00029 Code: c9 48 c7 c2 48 f8 ea 81 48 89 f1 48 89 e5 48 89 fe 48 c7 c7 b9 cc e5 81 e8 fd fd ff ff 80 3d f6 2f d3 00 00 75 04 0f 0b 5d c3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48 63 f6 49 89
00029 RSP: 0018:ffff88800317bc78 EFLAGS: 00010202
00029 RAX: 00000000ffffffea RBX: ffff88800611e000 RCX: 000000007fffffff
00029 RDX: 0000000000000021 RSI: 0000000000000000 RDI: ffffffff81e5ccb9
00029 RBP: ffff88800317bc78 R08: 0000000000000000 R09: 000000000000000a
00029 R10: 000000000000000a R11: 0fffffffffffffff R12: ffff88800c780800
00029 R13: ffff88800317bce0 R14: 0000000000000001 R15: ffff88800c73d000
00029 FS:  0000000000000000(0000) GS:ffff88807d840000(0000) knlGS:0000000000000000
00029 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
00029 CR2: 00005623b1911068 CR3: 000000000ee28003 CR4: 0000000000770ea0
00029 PKRU: 55555554
00029 Call Trace:
00029  <TASK>
00029  ? show_regs+0x5c/0x70
00029  ? die+0x32/0x90
00029  ? do_trap+0xbb/0xe0
00029  ? do_error_trap+0x67/0x90
00029  ? assfail+0x30/0x40
00029  ? exc_invalid_op+0x52/0x70
00029  ? assfail+0x30/0x40
00029  ? asm_exc_invalid_op+0x1b/0x20
00029  ? assfail+0x30/0x40
00029  ? assfail+0x23/0x40
00029  xfs_trans_read_buf_map+0x2d9/0x480
00029  xfs_imap_to_bp+0x3d/0x40
00029  xfs_inode_item_precommit+0x176/0x200
00029  xfs_trans_run_precommits+0x65/0xc0
00029  __xfs_trans_commit+0x3d/0x360
00029  xfs_trans_commit+0xb/0x10
00029  xfs_inactive_ifree.isra.0+0xea/0x200
00029  xfs_inactive+0x132/0x230
00029  xfs_inodegc_worker+0xb6/0x1a0
00029  process_one_work+0x1a9/0x3a0
00029  worker_thread+0x4e/0x3a0
00029  ? process_one_work+0x3a0/0x3a0
00029  kthread+0xf9/0x130

In case things have moved around since that commit, the particular line
throwing the assertion is in this paragraph:

        if (bp) {
                ASSERT(xfs_buf_islocked(bp));
                ASSERT(bp->b_transp == tp);
                ASSERT(bp->b_log_item != NULL);
                ASSERT(!bp->b_error);
                ASSERT(bp->b_flags & XBF_DONE);

It's the last one that trips.  Sorry for not catching this earlier; my
test suite experienced a bit of a failure and I only just got around to
fixing it enough to run all the way through.

[1] qemu-system-x86_64 -nodefaults -nographic -cpu host -machine type=q35,accel=kvm,nvdimm=on -m 2G,slots=8,maxmem=256G -smp 8 -kernel /home/willy/kernel/linux-next/.build_test_kernel-x86_64/kpgk/vmlinuz -append mitigations=off console=hvc0 root=/dev/sda rw log_buf_len=8M ktest.dir=/home/willy/kernel/ktest ktest.env=/tmp/build-test-kernel-FzOfFCHDVD/env crashkernel=128M no_console_suspend page_owner=on -device virtio-serial -chardev stdio,id=console -device virtconsole,chardev=console -serial unix:/tmp/build-test-kernel-FzOfFCHDVD/vm-kgdb,server,nowait -monitor unix:/tmp/build-test-kernel-FzOfFCHDVD/vm-mon,server,nowait -gdb unix:/tmp/build-test-kernel-FzOfFCHDVD/vm-gdb,server,nowait -device virtio-rng-pci -virtfs local,path=/,mount_tag=host,security_model=none -device virtio-scsi-pci,id=hba -nic user,model=virtio,hostfwd=tcp:127.0.0.1:28201-:22 -drive if=none,format=raw,id=disk0,file=/var/lib/ktest/root.amd64,snapshot=on -device scsi-hd,bus=hba.0,drive=disk0 -drive if=none,format=raw,id=disk1,file=/tmp/build-test-kernel-FzOfFCHDVD/dev-1,cache=unsafe -device scsi-hd,bus=hba.0,drive=disk1 -drive if=none,format=raw,id=disk2,file=/tmp/build-test-kernel-FzOfFCHDVD/dev-2,cache=unsafe -device scsi-hd,bus=hba.0,drive=disk2 -drive if=none,format=raw,id=disk3,file=/tmp/build-test-kernel-FzOfFCHDVD/dev-3,cache=unsafe -device scsi-hd,bus=hba.0,drive=disk3 -drive if=none,format=raw,id=disk4,file=/tmp/build-test-kernel-FzOfFCHDVD/dev-4,cache=unsafe -device scsi-hd,bus=hba.0,drive=disk4

Dave Chinner June 25, 2023, 10:34 p.m. UTC | #7

On Sun, Jun 25, 2023 at 03:58:15AM +0100, Matthew Wilcox wrote:
> On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> > Lock order in XFS is AGI -> AGF, hence for operations involving
> > inode unlinked list operations we always lock the AGI first. Inode
> > unlinked list operations operate on the inode cluster buffer,
> > so the lock order there is AGI -> inode cluster buffer.
> 
> Hi Dave,
> 
> This commit reliably produces an assertion failure for me.  I haven't
> tried to analyse why.  It's pretty clear though; I can run generic/426
> in a loop for hundreds of seconds on the parent commit (cb042117488d),
> but it'll die within 30 seconds on commit 82842fee6e59.
> 
>     export MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"

That's part of my regular regression test config (mkfs defaults w/
1kB block size), and I haven't seen this problem.

I've just kicked off an iteration of g/426 on a couple of machines,
on a current TOT, and they are already a couple of hundred
iterations in without a failure....

Can you grab a trace for me? i.e. run

# trace-cmd record -e xfs\* -e printk

in one shell and leave it running. Then in another shell run the
test. when the test fails, ctrl-c the trace-cmd, and send me
the output of

# trace-cmd report > report.txt

> I suspect the size=1024 is the important thing, but I haven't tested
> that hypothesis.  This is on an x86-64 virtual machine; full qemu
> command line at the end [1]

As it's an inode cluster buffer failure, I very much doubt
filesystem block size plays a part. Inode buffer size is defined by
inode size, not filesystem block size and so the buffer in question
will be a 16kB unmapped buffer because inode size is 512 bytes...

> 00028 FSTYP         -- xfs (debug)
> 00028 PLATFORM      -- Linux/x86_64 pepe-kvm 6.4.0-rc5-00004-g82842fee6e59 #182 SMP PREEMPT_DYNAMIC Sat Jun 24 22:51:32 EDT 2023
> 00028 MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
> 00028 MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
> 00028
> 00028 XFS (sdc): Mounting V5 Filesystem 591c2048-7cce-4eda-acf7-649e19cd8554
> 00028 XFS (sdc): Ending clean mount
> 00028 XFS (sdc): Unmounting Filesystem 591c2048-7cce-4eda-acf7-649e19cd8554
> 00028 XFS (sdb): EXPERIMENTAL online scrub feature in use. Use at your own risk!
> 00028 XFS (sdb): Unmounting Filesystem 9db9e0a2-c05b-4690-a938-ae8f7b70be8e
> 00028 XFS (sdb): Mounting V5 Filesystem 9db9e0a2-c05b-4690-a938-ae8f7b70be8e
> 00028 XFS (sdb): Ending clean mount
> 00028 generic/426       run fstests generic/426 at 2023-06-25 02:52:07
> 00029 XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
> 00029 ------------[ cut here ]------------
> 00029 kernel BUG at fs/xfs/xfs_message.c:102!
> 00029 invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> 00029 CPU: 1 PID: 62 Comm: kworker/1:1 Kdump: loaded Not tainted 6.4.0-rc5-00004-g82842fee6e59 #182
> 00029 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> 00029 Workqueue: xfs-inodegc/sdb xfs_inodegc_worker
> 00029 RIP: 0010:assfail+0x30/0x40
> 00029 Code: c9 48 c7 c2 48 f8 ea 81 48 89 f1 48 89 e5 48 89 fe 48 c7 c7 b9 cc e5 81 e8 fd fd ff ff 80 3d f6 2f d3 00 00 75 04 0f 0b 5d c3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48 63 f6 49 89
> 00029 RSP: 0018:ffff88800317bc78 EFLAGS: 00010202
> 00029 RAX: 00000000ffffffea RBX: ffff88800611e000 RCX: 000000007fffffff
> 00029 RDX: 0000000000000021 RSI: 0000000000000000 RDI: ffffffff81e5ccb9
> 00029 RBP: ffff88800317bc78 R08: 0000000000000000 R09: 000000000000000a
> 00029 R10: 000000000000000a R11: 0fffffffffffffff R12: ffff88800c780800
> 00029 R13: ffff88800317bce0 R14: 0000000000000001 R15: ffff88800c73d000
> 00029 FS:  0000000000000000(0000) GS:ffff88807d840000(0000) knlGS:0000000000000000
> 00029 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 00029 CR2: 00005623b1911068 CR3: 000000000ee28003 CR4: 0000000000770ea0
> 00029 PKRU: 55555554
> 00029 Call Trace:
> 00029  <TASK>
> 00029  ? show_regs+0x5c/0x70
> 00029  ? die+0x32/0x90
> 00029  ? do_trap+0xbb/0xe0
> 00029  ? do_error_trap+0x67/0x90
> 00029  ? assfail+0x30/0x40
> 00029  ? exc_invalid_op+0x52/0x70
> 00029  ? assfail+0x30/0x40
> 00029  ? asm_exc_invalid_op+0x1b/0x20
> 00029  ? assfail+0x30/0x40
> 00029  ? assfail+0x23/0x40
> 00029  xfs_trans_read_buf_map+0x2d9/0x480
> 00029  xfs_imap_to_bp+0x3d/0x40
> 00029  xfs_inode_item_precommit+0x176/0x200
> 00029  xfs_trans_run_precommits+0x65/0xc0
> 00029  __xfs_trans_commit+0x3d/0x360
> 00029  xfs_trans_commit+0xb/0x10
> 00029  xfs_inactive_ifree.isra.0+0xea/0x200
> 00029  xfs_inactive+0x132/0x230
> 00029  xfs_inodegc_worker+0xb6/0x1a0
> 00029  process_one_work+0x1a9/0x3a0
> 00029  worker_thread+0x4e/0x3a0
> 00029  ? process_one_work+0x3a0/0x3a0
> 00029  kthread+0xf9/0x130
> 
> In case things have moved around since that commit, the particular line
> throwing the assertion is in this paragraph:
> 
>         if (bp) {
>                 ASSERT(xfs_buf_islocked(bp));
>                 ASSERT(bp->b_transp == tp);
>                 ASSERT(bp->b_log_item != NULL);
>                 ASSERT(!bp->b_error);
>                 ASSERT(bp->b_flags & XBF_DONE);

Nothing immediately obvious stands out here. 

I suspect it may be an interaction with memory reclaim freeing the
inode cluster buffer while it is clean after the inode has been
brought into memory, then xfs_ifree_cluster using
xfs_trans_get_buf() to invalidate the inode cluster (hence bringing
it into memory without reading it's contents) and then us trying to
read it, finding it already linked into the transaction, and it
skipping the buffer cache lookup that would have read the data
in....

The trace will tell me if this is roughly what is happening.

Cheers,

Dave.

[4/4] xfs: fix AGF vs inode cluster buffer deadlock

Commit Message

Comments

Patch