diff mbox series

[03/30] xfs: add an inode item lock

Message ID 20200601214251.4167140-4-david@fromorbit.com (mailing list archive)
State Superseded
Headers show
Series xfs: rework inode flushing to make inode reclaim fully asynchronous | expand

Commit Message

Dave Chinner June 1, 2020, 9:42 p.m. UTC
From: Dave Chinner <dchinner@redhat.com>

The inode log item is kind of special in that it can be aggregating
new changes in memory at the same time time existing changes are
being written back to disk. This means there are fields in the log
item that are accessed concurrently from contexts that don't share
any locking at all.

e.g. updating ili_last_fields occurs at flush time under the
ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
completion time, and is read under the ILOCK_EXCL when the inode is
logged.  Hence there is no actual serialisation between reading the
field during logging of the inode in transactions vs clearing the
field in IO completion.

We currently get away with this by the fact that we are only
clearing fields in IO completion, and nothing bad happens if we
accidentally log more of the inode than we actually modify. Worst
case is we consume a tiny bit more memory and log bandwidth.

However, if we want to do more complex state manipulations on the
log item that requires updates at all three of these potential
locations, we need to have some mechanism of serialising those
operations. To do this, introduce a spinlock into the log item to
serialise internal state.

This could be done via the xfs_inode i_flags_lock, but this then
leads to potential lock inversion issues where inode flag updates
need to occur inside locks that best nest inside the inode log item
locks (e.g. marking inodes stale during inode cluster freeing).
Using a separate spinlock avoids these sorts of problems and
simplifies future code.

This does not touch the use of ili_fields in the item formatting
code - that is entirely protected by the ILOCK_EXCL at this point in
time, so it remains untouched.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_inode.c | 54 +++++++++++++++++----------------
 fs/xfs/xfs_file.c               |  9 ++++--
 fs/xfs/xfs_inode.c              | 20 +++++++-----
 fs/xfs/xfs_inode_item.c         |  7 +++++
 fs/xfs/xfs_inode_item.h         | 18 +++++++++--
 5 files changed, 68 insertions(+), 40 deletions(-)

Comments

Brian Foster June 2, 2020, 4:34 p.m. UTC | #1
On Tue, Jun 02, 2020 at 07:42:24AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The inode log item is kind of special in that it can be aggregating
> new changes in memory at the same time time existing changes are
> being written back to disk. This means there are fields in the log
> item that are accessed concurrently from contexts that don't share
> any locking at all.
> 
> e.g. updating ili_last_fields occurs at flush time under the
> ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
> completion time, and is read under the ILOCK_EXCL when the inode is
> logged.  Hence there is no actual serialisation between reading the
> field during logging of the inode in transactions vs clearing the
> field in IO completion.
> 
> We currently get away with this by the fact that we are only
> clearing fields in IO completion, and nothing bad happens if we
> accidentally log more of the inode than we actually modify. Worst
> case is we consume a tiny bit more memory and log bandwidth.
> 
> However, if we want to do more complex state manipulations on the
> log item that requires updates at all three of these potential
> locations, we need to have some mechanism of serialising those
> operations. To do this, introduce a spinlock into the log item to
> serialise internal state.
> 
> This could be done via the xfs_inode i_flags_lock, but this then
> leads to potential lock inversion issues where inode flag updates
> need to occur inside locks that best nest inside the inode log item
> locks (e.g. marking inodes stale during inode cluster freeing).
> Using a separate spinlock avoids these sorts of problems and
> simplifies future code.
> 
> This does not touch the use of ili_fields in the item formatting
> code - that is entirely protected by the ILOCK_EXCL at this point in
> time, so it remains untouched.
> 

Thanks for pointing this out.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 54 +++++++++++++++++----------------
>  fs/xfs/xfs_file.c               |  9 ++++--
>  fs/xfs/xfs_inode.c              | 20 +++++++-----
>  fs/xfs/xfs_inode_item.c         |  7 +++++
>  fs/xfs/xfs_inode_item.h         | 18 +++++++++--
>  5 files changed, 68 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 4504d215cd590..fe6c2e39be85d 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
...
> @@ -122,23 +117,30 @@ xfs_trans_log_inode(
>  	 * set however, then go ahead and bump the i_version counter
>  	 * unconditionally.
>  	 */
> -	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
> -	    IS_I_VERSION(VFS_I(ip))) {
> -		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> -			flags |= XFS_ILOG_CORE;
> +	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> +		if (IS_I_VERSION(inode) &&
> +		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> +			iversion_flags = XFS_ILOG_CORE;
>  	}
>  
> -	tp->t_flags |= XFS_TRANS_DIRTY;
> +	/*
> +	 * Record the specific change for fdatasync optimisation. This allows
> +	 * fdatasync to skip log forces for inodes that are only timestamp
> +	 * dirty. We do this before the change count so that the core being
> +	 * logged in this case does not impact on fdatasync behaviour.
> +	 */

We no longer do this before the change count logic so that part of the
comment is bogus.

> +	spin_lock(&iip->ili_lock);
> +	iip->ili_fsync_fields |= flags;
>  
>  	/*
> -	 * Always OR in the bits from the ili_last_fields field.
> -	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
> -	 * routines in the eventual clearing of the ili_fields bits.
> -	 * See the big comment in xfs_iflush() for an explanation of
> -	 * this coordination mechanism.
> +	 * Always OR in the bits from the ili_last_fields field.  This is to
> +	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
> +	 * the eventual clearing of the ili_fields bits.  See the big comment in
> +	 * xfs_iflush() for an explanation of this coordination mechanism.
>  	 */
> -	flags |= ip->i_itemp->ili_last_fields;
> -	ip->i_itemp->ili_fields |= flags;
> +	iip->ili_fields |= (flags | iip->ili_last_fields |
> +			    iversion_flags);
> +	spin_unlock(&iip->ili_lock);
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 403c90309a8ff..0abf770b77498 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -94,6 +94,7 @@ xfs_file_fsync(
>  {
>  	struct inode		*inode = file->f_mapping->host;
>  	struct xfs_inode	*ip = XFS_I(inode);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error = 0;
>  	int			log_flushed = 0;
> @@ -137,13 +138,15 @@ xfs_file_fsync(
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
>  	if (xfs_ipincount(ip)) {
>  		if (!datasync ||
> -		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> -			lsn = ip->i_itemp->ili_last_lsn;
> +		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> +			lsn = iip->ili_last_lsn;

I am still a little confused why the lock is elided in other read cases,
such as this one or perhaps the similar check in xfs_bmbt_to_iomap()..?

Similarly, it looks like we set the ili_[flush|last]_lsn fields outside
of this lock (though last_lsn looks like it's also covered by ilock),
yet the update to the inode_log_item struct implies they should be
protected. What's the intent there?

>  	}
>  
>  	if (lsn) {
>  		error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> -		ip->i_itemp->ili_fsync_fields = 0;
> +		spin_lock(&iip->ili_lock);
> +		iip->ili_fsync_fields = 0;
> +		spin_unlock(&iip->ili_lock);
>  	}
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
...
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 4de5070e07655..44c47c08b0b59 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -16,12 +16,24 @@ struct xfs_mount;
>  struct xfs_inode_log_item {
>  	struct xfs_log_item	ili_item;	   /* common portion */
>  	struct xfs_inode	*ili_inode;	   /* inode ptr */
> -	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
> -	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
> -	unsigned short		ili_lock_flags;	   /* lock flags */
> +	unsigned short		ili_lock_flags;	   /* inode lock flags */
> +	/*
> +	 * The ili_lock protects the interactions between the dirty state and
> +	 * the flush state of the inode log item. This allows us to do atomic
> +	 * modifications of multiple state fields without having to hold a
> +	 * specific inode lock to serialise them.
> +	 *
> +	 * We need atomic changes between indoe dirtying, inode flushing and

s/indoe/inode/

Brian

> +	 * inode completion, but these all hold different combinations of
> +	 * ILOCK and iflock and hence we need some other method of serialising
> +	 * updates to the flush state.
> +	 */
> +	spinlock_t		ili_lock;	   /* flush state lock */
>  	unsigned int		ili_last_fields;   /* fields when flushed */
>  	unsigned int		ili_fields;	   /* fields to be logged */
>  	unsigned int		ili_fsync_fields;  /* logged since last fsync */
> +	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
> +	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
>  };
>  
>  static inline int xfs_inode_clean(xfs_inode_t *ip)
> -- 
> 2.26.2.761.g0e0b3e54be
>
Dave Chinner June 4, 2020, 1:54 a.m. UTC | #2
On Tue, Jun 02, 2020 at 12:34:44PM -0400, Brian Foster wrote:
> On Tue, Jun 02, 2020 at 07:42:24AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> ...
> > @@ -122,23 +117,30 @@ xfs_trans_log_inode(
> >  	 * set however, then go ahead and bump the i_version counter
> >  	 * unconditionally.
> >  	 */
> > -	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
> > -	    IS_I_VERSION(VFS_I(ip))) {
> > -		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> > -			flags |= XFS_ILOG_CORE;
> > +	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > +		if (IS_I_VERSION(inode) &&
> > +		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > +			iversion_flags = XFS_ILOG_CORE;
> >  	}
> >  
> > -	tp->t_flags |= XFS_TRANS_DIRTY;
> > +	/*
> > +	 * Record the specific change for fdatasync optimisation. This allows
> > +	 * fdatasync to skip log forces for inodes that are only timestamp
> > +	 * dirty. We do this before the change count so that the core being
> > +	 * logged in this case does not impact on fdatasync behaviour.
> > +	 */
> 
> We no longer do this before the change count logic so that part of the
> comment is bogus.

Ugh. Another 6 patch conflicts to resolve coming right up....

> > +	spin_lock(&iip->ili_lock);
> > +	iip->ili_fsync_fields |= flags;
> >  
> >  	/*
> > -	 * Always OR in the bits from the ili_last_fields field.
> > -	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
> > -	 * routines in the eventual clearing of the ili_fields bits.
> > -	 * See the big comment in xfs_iflush() for an explanation of
> > -	 * this coordination mechanism.
> > +	 * Always OR in the bits from the ili_last_fields field.  This is to
> > +	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
> > +	 * the eventual clearing of the ili_fields bits.  See the big comment in
> > +	 * xfs_iflush() for an explanation of this coordination mechanism.
> >  	 */
> > -	flags |= ip->i_itemp->ili_last_fields;
> > -	ip->i_itemp->ili_fields |= flags;
> > +	iip->ili_fields |= (flags | iip->ili_last_fields |
> > +			    iversion_flags);
> > +	spin_unlock(&iip->ili_lock);
> >  }
> >  
> >  int
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 403c90309a8ff..0abf770b77498 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -94,6 +94,7 @@ xfs_file_fsync(
> >  {
> >  	struct inode		*inode = file->f_mapping->host;
> >  	struct xfs_inode	*ip = XFS_I(inode);
> > +	struct xfs_inode_log_item *iip = ip->i_itemp;
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	int			error = 0;
> >  	int			log_flushed = 0;
> > @@ -137,13 +138,15 @@ xfs_file_fsync(
> >  	xfs_ilock(ip, XFS_ILOCK_SHARED);
> >  	if (xfs_ipincount(ip)) {
> >  		if (!datasync ||
> > -		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > -			lsn = ip->i_itemp->ili_last_lsn;
> > +		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > +			lsn = iip->ili_last_lsn;
> 
> I am still a little confused why the lock is elided in other read cases,
> such as this one or perhaps the similar check in xfs_bmbt_to_iomap()..?

They are still all serialised against those field changing the same
way they currently are. i.e. they are all under the ILOCK, so
changes that occur during IO submission will never occur.  Hence the
only thing that we can race with is IO completion clearing the
fields, in which case the subsequent operations if the item is now
clean turn into no-ops.

i.e:
- ILOCK serialises transaction logging vs IO submission.
- iflock serialises IO submission vs IO completion.
- Nothing serialises transaction logging vs IO completion.

The latter is what the ili_lock is intended for; everything else is
still protected by the existing serialisation mechanisms that they
are now. Any races in areas outside xfs_trans_log_inode() vs
xfs_iflush_done/abort() is largely outside the scope of this patch
and this lock...

> Similarly, it looks like we set the ili_[flush|last]_lsn fields outside
> of this lock (though last_lsn looks like it's also covered by ilock),
> yet the update to the inode_log_item struct implies they should be
> protected. What's the intent there?

The lsn fields are updated via xfs_trans_ail_lsn_copy(), which on 32
bit systems takes the AIL lock, and I don't think it's a good idea
to put the AIL lock inside the inode item lock.

Cheers,

Dave.
Brian Foster June 4, 2020, 2:03 p.m. UTC | #3
On Thu, Jun 04, 2020 at 11:54:56AM +1000, Dave Chinner wrote:
> On Tue, Jun 02, 2020 at 12:34:44PM -0400, Brian Foster wrote:
> > On Tue, Jun 02, 2020 at 07:42:24AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > ...
> > > @@ -122,23 +117,30 @@ xfs_trans_log_inode(
> > >  	 * set however, then go ahead and bump the i_version counter
> > >  	 * unconditionally.
> > >  	 */
> > > -	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
> > > -	    IS_I_VERSION(VFS_I(ip))) {
> > > -		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> > > -			flags |= XFS_ILOG_CORE;
> > > +	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > > +		if (IS_I_VERSION(inode) &&
> > > +		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > > +			iversion_flags = XFS_ILOG_CORE;
> > >  	}
> > >  
> > > -	tp->t_flags |= XFS_TRANS_DIRTY;
> > > +	/*
> > > +	 * Record the specific change for fdatasync optimisation. This allows
> > > +	 * fdatasync to skip log forces for inodes that are only timestamp
> > > +	 * dirty. We do this before the change count so that the core being
> > > +	 * logged in this case does not impact on fdatasync behaviour.
> > > +	 */
> > 
> > We no longer do this before the change count logic so that part of the
> > comment is bogus.
> 
> Ugh. Another 6 patch conflicts to resolve coming right up....
> 
> > > +	spin_lock(&iip->ili_lock);
> > > +	iip->ili_fsync_fields |= flags;
> > >  
> > >  	/*
> > > -	 * Always OR in the bits from the ili_last_fields field.
> > > -	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
> > > -	 * routines in the eventual clearing of the ili_fields bits.
> > > -	 * See the big comment in xfs_iflush() for an explanation of
> > > -	 * this coordination mechanism.
> > > +	 * Always OR in the bits from the ili_last_fields field.  This is to
> > > +	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
> > > +	 * the eventual clearing of the ili_fields bits.  See the big comment in
> > > +	 * xfs_iflush() for an explanation of this coordination mechanism.
> > >  	 */
> > > -	flags |= ip->i_itemp->ili_last_fields;
> > > -	ip->i_itemp->ili_fields |= flags;
> > > +	iip->ili_fields |= (flags | iip->ili_last_fields |
> > > +			    iversion_flags);
> > > +	spin_unlock(&iip->ili_lock);
> > >  }
> > >  
> > >  int
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 403c90309a8ff..0abf770b77498 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -94,6 +94,7 @@ xfs_file_fsync(
> > >  {
> > >  	struct inode		*inode = file->f_mapping->host;
> > >  	struct xfs_inode	*ip = XFS_I(inode);
> > > +	struct xfs_inode_log_item *iip = ip->i_itemp;
> > >  	struct xfs_mount	*mp = ip->i_mount;
> > >  	int			error = 0;
> > >  	int			log_flushed = 0;
> > > @@ -137,13 +138,15 @@ xfs_file_fsync(
> > >  	xfs_ilock(ip, XFS_ILOCK_SHARED);
> > >  	if (xfs_ipincount(ip)) {
> > >  		if (!datasync ||
> > > -		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > -			lsn = ip->i_itemp->ili_last_lsn;
> > > +		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > +			lsn = iip->ili_last_lsn;
> > 
> > I am still a little confused why the lock is elided in other read cases,
> > such as this one or perhaps the similar check in xfs_bmbt_to_iomap()..?
> 
> They are still all serialised against those field changing the same
> way they currently are. i.e. they are all under the ILOCK, so
> changes that occur during IO submission will never occur.  Hence the
> only thing that we can race with is IO completion clearing the
> fields, in which case the subsequent operations if the item is now
> clean turn into no-ops.
> 
> i.e:
> - ILOCK serialises transaction logging vs IO submission.
> - iflock serialises IO submission vs IO completion.
> - Nothing serialises transaction logging vs IO completion.
> 
> The latter is what the ili_lock is intended for; everything else is
> still protected by the existing serialisation mechanisms that they
> are now. Any races in areas outside xfs_trans_log_inode() vs
> xfs_iflush_done/abort() is largely outside the scope of this patch
> and this lock...
> 

Ok, but in this particular case we use the ili_lock around the
ili_fsync_fields reset (but not the read in the same function), and that
field is cleared when the inode is flushed. Is the lock used here for
the abort case?

I think I'll probably have to get through the rest of the series, see
how the lock is used with the logging changes in place, and then come
back and see if I can grok this aspect of it a little better..

> > Similarly, it looks like we set the ili_[flush|last]_lsn fields outside
> > of this lock (though last_lsn looks like it's also covered by ilock),
> > yet the update to the inode_log_item struct implies they should be
> > protected. What's the intent there?
> 
> The lsn fields are updated via xfs_trans_ail_lsn_copy(), which on 32
> bit systems takes the AIL lock, and I don't think it's a good idea
> to put the AIL lock inside the inode item lock.
> 

Ok.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 4504d215cd590..fe6c2e39be85d 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -82,16 +82,20 @@  xfs_trans_ichgtime(
  */
 void
 xfs_trans_log_inode(
-	xfs_trans_t	*tp,
-	xfs_inode_t	*ip,
-	uint		flags)
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	uint			flags)
 {
-	struct inode	*inode = VFS_I(ip);
+	struct xfs_inode_log_item *iip = ip->i_itemp;
+	struct inode		*inode = VFS_I(ip);
+	uint			iversion_flags = 0;
 
-	ASSERT(ip->i_itemp != NULL);
+	ASSERT(iip);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
 
+	tp->t_flags |= XFS_TRANS_DIRTY;
+
 	/*
 	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
 	 * don't matter - we either will need an extra transaction in 24 hours
@@ -104,15 +108,6 @@  xfs_trans_log_inode(
 		spin_unlock(&inode->i_lock);
 	}
 
-	/*
-	 * Record the specific change for fdatasync optimisation. This
-	 * allows fdatasync to skip log forces for inodes that are only
-	 * timestamp dirty. We do this before the change count so that
-	 * the core being logged in this case does not impact on fdatasync
-	 * behaviour.
-	 */
-	ip->i_itemp->ili_fsync_fields |= flags;
-
 	/*
 	 * First time we log the inode in a transaction, bump the inode change
 	 * counter if it is configured for this to occur. While we have the
@@ -122,23 +117,30 @@  xfs_trans_log_inode(
 	 * set however, then go ahead and bump the i_version counter
 	 * unconditionally.
 	 */
-	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
-	    IS_I_VERSION(VFS_I(ip))) {
-		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
-			flags |= XFS_ILOG_CORE;
+	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
+		if (IS_I_VERSION(inode) &&
+		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
+			iversion_flags = XFS_ILOG_CORE;
 	}
 
-	tp->t_flags |= XFS_TRANS_DIRTY;
+	/*
+	 * Record the specific change for fdatasync optimisation. This allows
+	 * fdatasync to skip log forces for inodes that are only timestamp
+	 * dirty. We do this before the change count so that the core being
+	 * logged in this case does not impact on fdatasync behaviour.
+	 */
+	spin_lock(&iip->ili_lock);
+	iip->ili_fsync_fields |= flags;
 
 	/*
-	 * Always OR in the bits from the ili_last_fields field.
-	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
-	 * routines in the eventual clearing of the ili_fields bits.
-	 * See the big comment in xfs_iflush() for an explanation of
-	 * this coordination mechanism.
+	 * Always OR in the bits from the ili_last_fields field.  This is to
+	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
+	 * the eventual clearing of the ili_fields bits.  See the big comment in
+	 * xfs_iflush() for an explanation of this coordination mechanism.
 	 */
-	flags |= ip->i_itemp->ili_last_fields;
-	ip->i_itemp->ili_fields |= flags;
+	iip->ili_fields |= (flags | iip->ili_last_fields |
+			    iversion_flags);
+	spin_unlock(&iip->ili_lock);
 }
 
 int
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 403c90309a8ff..0abf770b77498 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -94,6 +94,7 @@  xfs_file_fsync(
 {
 	struct inode		*inode = file->f_mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_inode_log_item *iip = ip->i_itemp;
 	struct xfs_mount	*mp = ip->i_mount;
 	int			error = 0;
 	int			log_flushed = 0;
@@ -137,13 +138,15 @@  xfs_file_fsync(
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	if (xfs_ipincount(ip)) {
 		if (!datasync ||
-		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
-			lsn = ip->i_itemp->ili_last_lsn;
+		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
+			lsn = iip->ili_last_lsn;
 	}
 
 	if (lsn) {
 		error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
-		ip->i_itemp->ili_fsync_fields = 0;
+		spin_lock(&iip->ili_lock);
+		iip->ili_fsync_fields = 0;
+		spin_unlock(&iip->ili_lock);
 	}
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4fa12775ac146..ac3c8af8c9a14 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2702,9 +2702,11 @@  xfs_ifree_cluster(
 				continue;
 
 			iip = ip->i_itemp;
+			spin_lock(&iip->ili_lock);
 			iip->ili_last_fields = iip->ili_fields;
 			iip->ili_fields = 0;
 			iip->ili_fsync_fields = 0;
+			spin_unlock(&iip->ili_lock);
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
@@ -2740,6 +2742,7 @@  xfs_ifree(
 {
 	int			error;
 	struct xfs_icluster	xic = { 0 };
+	struct xfs_inode_log_item *iip = ip->i_itemp;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(VFS_I(ip)->i_nlink == 0);
@@ -2777,7 +2780,9 @@  xfs_ifree(
 	ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS;
 
 	/* Don't attempt to replay owner changes for a deleted inode */
-	ip->i_itemp->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);
+	spin_lock(&iip->ili_lock);
+	iip->ili_fields &= ~(XFS_ILOG_AOWNER | XFS_ILOG_DOWNER);
+	spin_unlock(&iip->ili_lock);
 
 	/*
 	 * Bump the generation count so no one will be confused
@@ -3833,20 +3838,19 @@  xfs_iflush_int(
 	 * know that the information those bits represent is permanently on
 	 * disk.  As long as the flush completes before the inode is logged
 	 * again, then both ili_fields and ili_last_fields will be cleared.
-	 *
-	 * We can play with the ili_fields bits here, because the inode lock
-	 * must be held exclusively in order to set bits there and the flush
-	 * lock protects the ili_last_fields bits.  Store the current LSN of the
-	 * inode so that we can tell whether the item has moved in the AIL from
-	 * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
-	 * because it is a 64 bit value that cannot be read atomically.
 	 */
 	error = 0;
 flush_out:
+	spin_lock(&iip->ili_lock);
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
 	iip->ili_fsync_fields = 0;
+	spin_unlock(&iip->ili_lock);
 
+	/*
+	 * Store the current LSN of the inode so that we can tell whether the
+	 * item has moved in the AIL from xfs_iflush_done().
+	 */
 	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 				&iip->ili_item.li_lsn);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index b17384aa8df40..6ef9cbcfc94a7 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -637,6 +637,7 @@  xfs_inode_item_init(
 	iip = ip->i_itemp = kmem_zone_zalloc(xfs_ili_zone, 0);
 
 	iip->ili_inode = ip;
+	spin_lock_init(&iip->ili_lock);
 	xfs_log_item_init(mp, &iip->ili_item, XFS_LI_INODE,
 						&xfs_inode_item_ops);
 }
@@ -738,7 +739,11 @@  xfs_iflush_done(
 	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
 		list_del_init(&blip->li_bio_list);
 		iip = INODE_ITEM(blip);
+
+		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
+		spin_unlock(&iip->ili_lock);
+
 		xfs_ifunlock(iip->ili_inode);
 	}
 	list_del(&tmp);
@@ -762,9 +767,11 @@  xfs_iflush_abort(
 		 * Clear the inode logging fields so no more flushes are
 		 * attempted.
 		 */
+		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
 		iip->ili_fields = 0;
 		iip->ili_fsync_fields = 0;
+		spin_unlock(&iip->ili_lock);
 	}
 	/*
 	 * Release the inode's flush lock since we're done with it.
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 4de5070e07655..44c47c08b0b59 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -16,12 +16,24 @@  struct xfs_mount;
 struct xfs_inode_log_item {
 	struct xfs_log_item	ili_item;	   /* common portion */
 	struct xfs_inode	*ili_inode;	   /* inode ptr */
-	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
-	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
-	unsigned short		ili_lock_flags;	   /* lock flags */
+	unsigned short		ili_lock_flags;	   /* inode lock flags */
+	/*
+	 * The ili_lock protects the interactions between the dirty state and
+	 * the flush state of the inode log item. This allows us to do atomic
+	 * modifications of multiple state fields without having to hold a
+	 * specific inode lock to serialise them.
+	 *
+	 * We need atomic changes between indoe dirtying, inode flushing and
+	 * inode completion, but these all hold different combinations of
+	 * ILOCK and iflock and hence we need some other method of serialising
+	 * updates to the flush state.
+	 */
+	spinlock_t		ili_lock;	   /* flush state lock */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
 	unsigned int		ili_fsync_fields;  /* logged since last fsync */
+	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
+	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
 };
 
 static inline int xfs_inode_clean(xfs_inode_t *ip)