diff mbox

xfs: rewrite the fdatasync optimization

Message ID 20180205073933.17065-1-hch@lst.de (mailing list archive)
State Superseded
Headers show

Commit Message

Christoph Hellwig Feb. 5, 2018, 7:39 a.m. UTC
Currently we need to the ilock over the log force in xfs_fsync so that we
can protect ili_fsync_fields against incorrect manipulation.

But if instead we add new XFS_ILOG_VERSION pseudo log area similar to the
timestamp one we can use that to just record the last dirty / fdatasync
dirty lsn as long as the inode is pinned, and clear it when unpinning to
avoid holding the ilock over I/O.

This will drastically reduce latency on multithreaded workloads that
mix writes with fsync calls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_log_format.h | 11 +++++++++--
 fs/xfs/xfs_file.c              | 20 ++++++--------------
 fs/xfs/xfs_inode.c             |  2 --
 fs/xfs/xfs_inode_item.c        | 13 ++++++++++---
 fs/xfs/xfs_inode_item.h        |  2 +-
 fs/xfs/xfs_iomap.c             |  3 +--
 fs/xfs/xfs_trans_inode.c       | 11 +----------
 7 files changed, 28 insertions(+), 34 deletions(-)

Comments

Dave Chinner Feb. 5, 2018, 10:17 p.m. UTC | #1
On Mon, Feb 05, 2018 at 08:39:33AM +0100, Christoph Hellwig wrote:
> Currently we need to the ilock over the log force in xfs_fsync so that we
> can protect ili_fsync_fields against incorrect manipulation.
> 
> But if instead we add new XFS_ILOG_VERSION pseudo log area similar to the
> timestamp one we can use that to just record the last dirty / fdatasync
> dirty lsn as long as the inode is pinned, and clear it when unpinning to
> avoid holding the ilock over I/O.

I thought that NFS requires the iversion changes to be made stable
at the same time as the data changes they correspond to is made
stable? i.e. NFS requires us to stabilise the on disk iversion field
during fdatasync.

As it is, I think this change XFS_ILOG_VERSION change is unnecessary
because Jeff Layton's changes to avoid iversion changes when the
value is not being sampled has been merged. Hence iversion won't be
changed on every write anymore unless NFS is in the picture and is
sampling iversion. 

IOWs, I think this may be needed for older kernels, but its not
clear that we need it for upstream kernels.

> This will drastically reduce latency on multithreaded workloads that
> mix writes with fsync calls.

The change to use ili_datasync_lsn is what reduces the latency
because the ilock is not held over the log force anymore. That's
useful and stands alone from the iversion modification so I
think, at minimum, this needs to be separated into two patches....

Cheers,

Dave.
Christoph Hellwig Feb. 6, 2018, 7:23 a.m. UTC | #2
On Tue, Feb 06, 2018 at 09:17:26AM +1100, Dave Chinner wrote:
> > Currently we need to the ilock over the log force in xfs_fsync so that we
> > can protect ili_fsync_fields against incorrect manipulation.
> > 
> > But if instead we add new XFS_ILOG_VERSION pseudo log area similar to the
> > timestamp one we can use that to just record the last dirty / fdatasync
> > dirty lsn as long as the inode is pinned, and clear it when unpinning to
> > avoid holding the ilock over I/O.
> 
> I thought that NFS requires the iversion changes to be made stable
> at the same time as the data changes they correspond to is made
> stable? i.e. NFS requires us to stabilise the on disk iversion field
> during fdatasync.

Yes, and this patch doesn't change that behavior.  The reason why the
new XFS_ILOG_VERSION flag is needed is to be able to check if
an fdatasync needs to flush out an inode log item.  This is done the
same way we did the XFS_ILOG_TIMESTAMP optimization, i.e. mark the
version updates as XFS_ILOG_VERSION, which gets propagatated to
XFS_ILOG_CORE when flushing (and we still commit every transaction
with a dirty inode), but we then use the ili_datasync_lsn field
that is only set if something that is not the timestamp or version
is logged to check in fdatasync if we need to force the log or not.

> 
> As it is, I think this change XFS_ILOG_VERSION change is unnecessary
> because Jeff Layton's changes to avoid iversion changes when the
> value is not being sampled has been merged. Hence iversion won't be
> changed on every write anymore unless NFS is in the picture and is
> sampling iversion. 

It still is needed.  We need to change i_version everytime we update
the timestamps (IFF it is queried in 4.15+), and we need to log every
transaction that updates timestamps and i_version.  But we do not need
to force out the log in fdatasync if only the timestamps and i_version
have changed.

So the only change in 4.15 is that for non-NFS workloads the change
might not matter that much, but the fundamental issue is the same.

> The change to use ili_datasync_lsn is what reduces the latency
> because the ilock is not held over the log force anymore. That's
> useful and stands alone from the iversion modification so I
> think, at minimum, this needs to be separated into two patches....

I can split it out.  XFS_ILOG_VERSION is the bit that makes our
old XFS_ILOG_TIMESTAMP optimization actually work on v5 file systems.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Feb. 8, 2018, 11:17 p.m. UTC | #3
On Tue, Feb 06, 2018 at 08:23:56AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 06, 2018 at 09:17:26AM +1100, Dave Chinner wrote:
> > > Currently we need to the ilock over the log force in xfs_fsync so that we
> > > can protect ili_fsync_fields against incorrect manipulation.
> > > 
> > > But if instead we add new XFS_ILOG_VERSION pseudo log area similar to the
> > > timestamp one we can use that to just record the last dirty / fdatasync
> > > dirty lsn as long as the inode is pinned, and clear it when unpinning to
> > > avoid holding the ilock over I/O.
> > 
> > I thought that NFS requires the iversion changes to be made stable
> > at the same time as the data changes they correspond to is made
> > stable? i.e. NFS requires us to stabilise the on disk iversion field
> > during fdatasync.
> 
> Yes,

So you're agreeing with me that if iversion is changed, then
fdatasync needs to force the log.

> and this patch doesn't change that behavior.

I think it does.

> The reason why the
> new XFS_ILOG_VERSION flag is needed is to be able to check if
> an fdatasync needs to flush out an inode log item.

If we don't change iversion because it's not been queried, then in
the current code then XFS_ILOG_CORE will not be set and so the inode
remains only timestamp dirty and does not get flushed by fdatasync.
If we change iversion, then XFS_ILOG_CORE gets set, and the next
fdatasync will correctly flush the inode log item.

That's exactly the behaviour we need to meet the above NFS
requirements - the iversion change is associated with the new data
being written, not the timestamp change we are making as a result of
the data change.

If we look at your code now, when we change iversion we'll set
XFS_ILOG_VERSION and that will now behave identically to the
timestamp code w.r.t. fdatasync. i.e. we will *never* flush pure
iversion/timestamp changes on fdatasync.

That's not the behaviour NFS requires - we'll stabilise data but we
will not stabilise the iversion change that goes along with that
data change. Hence if we crash after fdatasync, there'll be new data
on disk and an old iversion....

So, AFAICT, this patch does change behviour and it breaks the
iversion semantics that the NFS server requires for data
modification.

Cheers,

Dave.
Christoph Hellwig Feb. 13, 2018, 3:04 p.m. UTC | #4
On Fri, Feb 09, 2018 at 10:17:25AM +1100, Dave Chinner wrote:
> > > I thought that NFS requires the iversion changes to be made stable
> > > at the same time as the data changes they correspond to is made
> > > stable? i.e. NFS requires us to stabilise the on disk iversion field
> > > during fdatasync.
> > 
> > Yes,
> 
> So you're agreeing with me that if iversion is changed, then
> fdatasync needs to force the log.

No, not at all.  fdatasync does simply not matter for NFS changetimes.
NFS comes in through commit_inode_metadata, for which the fdatasync
optimization is irrelevant.  But even discounting NFS - anyone caring
about timestamps persisting must use fsync and not fdatasync, in fact
that is the whole point for fdatasyncs existence!

> > new XFS_ILOG_VERSION flag is needed is to be able to check if
> > an fdatasync needs to flush out an inode log item.
> 
> If we don't change iversion because it's not been queried, then in
> the current code then XFS_ILOG_CORE will not be set and so the inode
> remains only timestamp dirty and does not get flushed by fdatasync.
> If we change iversion, then XFS_ILOG_CORE gets set, and the next
> fdatasync will correctly flush the inode log item.

Yes, if we talk about ili_fields.  Once we are deep down in the logging
core XFS_ILOG_CORE of course gets set, see xfs_inode_item_format.

> If we look at your code now, when we change iversion we'll set
> XFS_ILOG_VERSION and that will now behave identically to the
> timestamp code w.r.t. fdatasync. i.e. we will *never* flush pure
> iversion/timestamp changes on fdatasync.

And that is the correct behavior for fdatasync.

> That's not the behaviour NFS requires - we'll stabilise data but we
> will not stabilise the iversion change that goes along with that
> data change. Hence if we crash after fdatasync, there'll be new data
> on disk and an old iversion....
> 
> So, AFAICT, this patch does change behviour and it breaks the
> iversion semantics that the NFS server requires for data
> modification.

We still log exactly the same transaction to the log with this change.
The only difference is that a fdatasync will only force out the latest
transaction that has some non-timestamp, non-version change instead of
just the latests non-timestamp change.  For anyone using fsync-like
semantics like NFS nothing changes at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Feb. 13, 2018, 11:26 p.m. UTC | #5
On Tue, Feb 13, 2018 at 04:04:02PM +0100, Christoph Hellwig wrote:
> On Fri, Feb 09, 2018 at 10:17:25AM +1100, Dave Chinner wrote:
> > > > I thought that NFS requires the iversion changes to be made stable
> > > > at the same time as the data changes they correspond to is made
> > > > stable? i.e. NFS requires us to stabilise the on disk iversion field
> > > > during fdatasync.
> > > 
> > > Yes,
> > 
> > So you're agreeing with me that if iversion is changed, then
> > fdatasync needs to force the log.
> 
> No, not at all.  fdatasync does simply not matter for NFS changetimes.
> NFS comes in through commit_inode_metadata, for which the fdatasync
> optimization is irrelevant.

Yes, for NFS server based IO.i

No, for local user IO to the local fs that is also exported via NFS.

> But even discounting NFS - anyone caring
> about timestamps persisting must use fsync and not fdatasync, in fact
> that is the whole point for fdatasyncs existence!

Yes, but timestamps are not the issue here - we know the behaviour
is correct.  The issue here is the iversion change counter
semantics be w.r.t. fdatasync, and you haven't established a case
fdatasync being able to avoid it.

> > > new XFS_ILOG_VERSION flag is needed is to be able to check if
> > > an fdatasync needs to flush out an inode log item.
> > 
> > If we don't change iversion because it's not been queried, then in
> > the current code then XFS_ILOG_CORE will not be set and so the inode
> > remains only timestamp dirty and does not get flushed by fdatasync.
> > If we change iversion, then XFS_ILOG_CORE gets set, and the next
> > fdatasync will correctly flush the inode log item.
> 
> Yes, if we talk about ili_fields.  Once we are deep down in the logging
> core XFS_ILOG_CORE of course gets set, see xfs_inode_item_format.
> 
> > If we look at your code now, when we change iversion we'll set
> > XFS_ILOG_VERSION and that will now behave identically to the
> > timestamp code w.r.t. fdatasync. i.e. we will *never* flush pure
> > iversion/timestamp changes on fdatasync.
> 
> And that is the correct behavior for fdatasync.

For *timestamps*, yes.

But iversion is a *data change notifier*, and hence it's directly
related to the data being stabilised during fdatasync.

e.g. In the case of a local write to an NFS exported filesystem, the
iversion needs to be made stable at the same time as the data via
fdatasync. That way if the server crashes after fdatasync, when it
comes back up all the NFS clients see that the data in that file
has changed via the recovered iversion change from the fdatasync.
If the iversion change is not written to the log during the
fdatasync, then none of the NFS clients will detect that the data
has actually changed after the server has recovered from the crash.

AIUI, this violates the semantics NFS requires from the iversion
field, and that means fdatasync needs to flush iversion changes.

Cheers,

Dave.
Christoph Hellwig Feb. 14, 2018, 3:53 p.m. UTC | #6
Ok, for the case where the same file also accessed throught NFS is
locally written and fdatasync we'll run into an issue.

But we already have the exact same issue on NFSv3 which always uses
ctime and never looks at iversion.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Feb. 14, 2018, 10:43 p.m. UTC | #7
On Wed, Feb 14, 2018 at 04:53:43PM +0100, Christoph Hellwig wrote:
> Ok, for the case where the same file also accessed throught NFS is
> locally written and fdatasync we'll run into an issue.
> 
> But we already have the exact same issue on NFSv3 which always uses
> ctime and never looks at iversion.

The NFSv3 server won't be querying the iversion field, so on
upstream kernels it will never change or need to be logged on local
fs accesses. Hence I don't think we need to care about NFSv3 at all
here....

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 349d9f8edb89..0cb4875b29ec 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -327,6 +327,13 @@  struct xfs_inode_log_format_32 {
  */
 #define XFS_ILOG_TIMESTAMP	0x4000
 
+/*
+ * Similar for this one:  it means we increased the inode version, which
+ * when combined with just XFS_ILOG_TIMESTAMP does not require blocking
+ * in fdatasync.
+ */
+#define XFS_ILOG_VERSION	0x8000
+
 #define	XFS_ILOG_NONCORE	(XFS_ILOG_DDATA | XFS_ILOG_DEXT | \
 				 XFS_ILOG_DBROOT | XFS_ILOG_DEV | \
 				 XFS_ILOG_ADATA | XFS_ILOG_AEXT | \
@@ -343,8 +350,8 @@  struct xfs_inode_log_format_32 {
 				 XFS_ILOG_DEXT | XFS_ILOG_DBROOT | \
 				 XFS_ILOG_DEV | XFS_ILOG_ADATA | \
 				 XFS_ILOG_AEXT | XFS_ILOG_ABROOT | \
-				 XFS_ILOG_TIMESTAMP | XFS_ILOG_DOWNER | \
-				 XFS_ILOG_AOWNER)
+				 XFS_ILOG_DOWNER | XFS_ILOG_AOWNER | \
+				 XFS_ILOG_TIMESTAMP | XFS_ILOG_VERSION)
 
 static inline int xfs_ilog_fbroot(int w)
 {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8601275cc5e6..eeb72f2b10c4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -165,27 +165,19 @@  xfs_file_fsync(
 	 * All metadata updates are logged, which means that we just have to
 	 * flush the log up to the latest LSN that touched the inode. If we have
 	 * concurrent fsync/fdatasync() calls, we need them to all block on the
-	 * log force before we clear the ili_fsync_fields field. This ensures
-	 * that we don't get a racing sync operation that does not wait for the
-	 * metadata to hit the journal before returning. If we race with
-	 * clearing the ili_fsync_fields, then all that will happen is the log
-	 * force will do nothing as the lsn will already be on disk. We can't
-	 * race with setting ili_fsync_fields because that is done under
-	 * XFS_ILOCK_EXCL, and that can't happen because we hold the lock shared
-	 * until after the ili_fsync_fields is cleared.
+	 * log force before returning.
 	 */
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	if (xfs_ipincount(ip)) {
-		if (!datasync ||
-		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
+		if (datasync)
+			lsn = ip->i_itemp->ili_datasync_lsn;
+		else
 			lsn = ip->i_itemp->ili_last_lsn;
 	}
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
-	if (lsn) {
+	if (lsn)
 		error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
-		ip->i_itemp->ili_fsync_fields = 0;
-	}
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	/*
 	 * If we only have a single device, and the log force about was
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 604ee384a00a..a57772553a2a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2385,7 +2385,6 @@  xfs_ifree_cluster(
 
 			iip->ili_last_fields = iip->ili_fields;
 			iip->ili_fields = 0;
-			iip->ili_fsync_fields = 0;
 			iip->ili_logged = 1;
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
@@ -3649,7 +3648,6 @@  xfs_iflush_int(
 	 */
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
-	iip->ili_fsync_fields = 0;
 	iip->ili_logged = 1;
 
 	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index d5037f060d6f..3e33a4f421ed 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -442,7 +442,8 @@  xfs_inode_item_format(
 	}
 
 	/* update the format with the exact fields we actually logged */
-	ilf->ilf_fields |= (iip->ili_fields & ~XFS_ILOG_TIMESTAMP);
+	ilf->ilf_fields |=
+		(iip->ili_fields & ~(XFS_ILOG_TIMESTAMP | XFS_ILOG_VERSION));
 }
 
 /*
@@ -630,6 +631,9 @@  xfs_inode_item_committed(
 	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 	struct xfs_inode	*ip = iip->ili_inode;
 
+	iip->ili_last_lsn = 0;
+	iip->ili_datasync_lsn = 0;
+
 	if (xfs_iflags_test(ip, XFS_ISTALE)) {
 		xfs_inode_item_unpin(lip, 0);
 		return -1;
@@ -646,7 +650,11 @@  xfs_inode_item_committing(
 	struct xfs_log_item	*lip,
 	xfs_lsn_t		lsn)
 {
-	INODE_ITEM(lip)->ili_last_lsn = lsn;
+	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+
+	iip->ili_last_lsn = lsn;
+	if (iip->ili_fields & ~(XFS_ILOG_TIMESTAMP | XFS_ILOG_VERSION))
+		iip->ili_datasync_lsn = lsn;
 }
 
 /*
@@ -826,7 +834,6 @@  xfs_iflush_abort(
 		 * attempted.
 		 */
 		iip->ili_fields = 0;
-		iip->ili_fsync_fields = 0;
 	}
 	/*
 	 * Release the inode's flush lock since we're done with it.
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index b72373a33cd9..9377ff41322f 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -30,11 +30,11 @@  typedef struct xfs_inode_log_item {
 	struct xfs_inode	*ili_inode;	   /* inode ptr */
 	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
 	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
+	xfs_lsn_t		ili_datasync_lsn;
 	unsigned short		ili_lock_flags;	   /* lock flags */
 	unsigned short		ili_logged;	   /* flushed logged data */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
-	unsigned int		ili_fsync_fields;  /* logged since last fsync */
 } xfs_inode_log_item_t;
 
 static inline int xfs_inode_clean(xfs_inode_t *ip)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 66e1edbfb2b2..221d218f08a9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1090,8 +1090,7 @@  xfs_file_iomap_begin(
 		trace_xfs_iomap_found(ip, offset, length, 0, &imap);
 	}
 
-	if (xfs_ipincount(ip) && (ip->i_itemp->ili_fsync_fields
-				& ~XFS_ILOG_TIMESTAMP))
+	if (xfs_ipincount(ip) && ip->i_itemp->ili_datasync_lsn)
 		iomap->flags |= IOMAP_F_DIRTY;
 
 	xfs_bmbt_to_iomap(ip, iomap, &imap);
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index 4a89da4b6fe7..0f9fbdea86ba 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -101,15 +101,6 @@  xfs_trans_log_inode(
 	ASSERT(ip->i_itemp != NULL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
-	/*
-	 * Record the specific change for fdatasync optimisation. This
-	 * allows fdatasync to skip log forces for inodes that are only
-	 * timestamp dirty. We do this before the change count so that
-	 * the core being logged in this case does not impact on fdatasync
-	 * behaviour.
-	 */
-	ip->i_itemp->ili_fsync_fields |= flags;
-
 	/*
 	 * First time we log the inode in a transaction, bump the inode change
 	 * counter if it is configured for this to occur. While we have the
@@ -122,7 +113,7 @@  xfs_trans_log_inode(
 	if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
 	    IS_I_VERSION(VFS_I(ip))) {
 		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
-			flags |= XFS_ILOG_CORE;
+			flags |= XFS_ILOG_VERSION;
 	}
 
 	tp->t_flags |= XFS_TRANS_DIRTY;