diff mbox series

[2/2] xfs: reduce ilock acquisitions in xfs_file_fsync

Message ID 20210122164643.620257-3-hch@lst.de (mailing list archive)
State Accepted
Headers show
Series [1/2] xfs: refactor xfs_file_fsync | expand

Commit Message

Christoph Hellwig Jan. 22, 2021, 4:46 p.m. UTC
If the inode is not pinned by the time fsync is called we don't need the
ilock to protect against concurrent clearing of ili_fsync_fields as the
inode won't need a log flush or clearing of these fields.  Not taking
the iolock allows for full concurrency of fsync and thus O_DSYNC
completions with io_uring/aio write submissions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Dave Chinner Jan. 22, 2021, 9:08 p.m. UTC | #1
On Fri, Jan 22, 2021 at 05:46:43PM +0100, Christoph Hellwig wrote:
> If the inode is not pinned by the time fsync is called we don't need the
> ilock to protect against concurrent clearing of ili_fsync_fields as the
> inode won't need a log flush or clearing of these fields.  Not taking
> the iolock allows for full concurrency of fsync and thus O_DSYNC
> completions with io_uring/aio write submissions.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Code looks good, so

Reviewed-by: Dave Chinner <dchinner@redhat.com>

But it makes me wonder...

That is, we already elide the call to generic_write_sync() in direct
IO in the case that the device supports FUA and it's a pure
overwrite with no dirty metadata on the inode. Hence for a lot of
storage and AIO/io_uring+DIO w/ O_DSYNC workloads we're already
eliding this fsync-based lock cycle.

In the case where we can't do a REQ_FUA IO because it is not
supported by the device, then don't we really only need a cache
flush at IO completion rather than the full generic_write_sync()
call path? That would provide this optimisation to all the
filesystems using iomap_dio_rw(), not just XFS....

In fact, I wonder if we need to do anything other than just use
REQ_FUA unconditionally in iomap for this situation, as the block
layer will translate REQ_FUA to a write+post-flush if the device
doesn't support FUA writes directly.

You're thoughts on that, Christoph?

Cheers,

Dave.
Christoph Hellwig Jan. 23, 2021, 6:41 a.m. UTC | #2
On Sat, Jan 23, 2021 at 08:08:01AM +1100, Dave Chinner wrote:
> That is, we already elide the call to generic_write_sync() in direct
> IO in the case that the device supports FUA and it's a pure
> overwrite with no dirty metadata on the inode. Hence for a lot of
> storage and AIO/io_uring+DIO w/ O_DSYNC workloads we're already
> eliding this fsync-based lock cycle.
> 
> In the case where we can't do a REQ_FUA IO because it is not
> supported by the device, then don't we really only need a cache
> flush at IO completion rather than the full generic_write_sync()
> call path? That would provide this optimisation to all the
> filesystems using iomap_dio_rw(), not just XFS....
> 
> In fact, I wonder if we need to do anything other than just use
> REQ_FUA unconditionally in iomap for this situation, as the block
> layer will translate REQ_FUA to a write+post-flush if the device
> doesn't support FUA writes directly.
> 
> You're thoughts on that, Christoph?

For the pure overwrite O_DIRECT + O_DSYNC case we'd get away with just
a flush.  And using REQ_FUA will get us there, so it might be worth
a try.
Brian Foster Jan. 25, 2021, 1:16 p.m. UTC | #3
On Fri, Jan 22, 2021 at 05:46:43PM +0100, Christoph Hellwig wrote:
> If the inode is not pinned by the time fsync is called we don't need the
> ilock to protect against concurrent clearing of ili_fsync_fields as the
> inode won't need a log flush or clearing of these fields.  Not taking
> the iolock allows for full concurrency of fsync and thus O_DSYNC
> completions with io_uring/aio write submissions.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_file.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 588232c77f11e0..ffe2d7c37e26cd 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -200,7 +200,14 @@ xfs_file_fsync(
>  	else if (mp->m_logdev_targp != mp->m_ddev_targp)
>  		xfs_blkdev_issue_flush(mp->m_ddev_targp);
>  
> -	error = xfs_fsync_flush_log(ip, datasync, &log_flushed);
> +	/*
> +	 * Any inode that has dirty modifications in the log is pinned.  The
> +	 * racy check here for a pinned inode while not catch modifications

s/while/will/ ?

Otherwise looks good:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +	 * that happen concurrently to the fsync call, but fsync semantics
> +	 * only require to sync previously completed I/O.
> +	 */
> +	if (xfs_ipincount(ip))
> +		error = xfs_fsync_flush_log(ip, datasync, &log_flushed);
>  
>  	/*
>  	 * If we only have a single device, and the log force about was
> -- 
> 2.29.2
>
Christoph Hellwig Jan. 26, 2021, 6:56 a.m. UTC | #4
On Sat, Jan 23, 2021 at 07:41:39AM +0100, Christoph Hellwig wrote:
> > In fact, I wonder if we need to do anything other than just use
> > REQ_FUA unconditionally in iomap for this situation, as the block
> > layer will translate REQ_FUA to a write+post-flush if the device
> > doesn't support FUA writes directly.
> > 
> > You're thoughts on that, Christoph?
> 
> For the pure overwrite O_DIRECT + O_DSYNC case we'd get away with just
> a flush.  And using REQ_FUA will get us there, so it might be worth
> a try.

And looking at this a little more, while just using REQ_FUA would
work it would be rather suboptimal for many cases, as the block layer
flush state machine would do a flush for every bio.  So for each
O_DIRECT + O_DSYNC write that generates more than one bio we'd grow
extra flushes.
Christoph Hellwig Jan. 28, 2021, 8 a.m. UTC | #5
On Mon, Jan 25, 2021 at 08:16:18AM -0500, Brian Foster wrote:
> > -	error = xfs_fsync_flush_log(ip, datasync, &log_flushed);
> > +	/*
> > +	 * Any inode that has dirty modifications in the log is pinned.  The
> > +	 * racy check here for a pinned inode while not catch modifications
> 
> s/while/will/ ?

Yes.  Darrick, can you fix this up when applying the patch, or do you
want me to resend?
diff mbox series

Patch

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 588232c77f11e0..ffe2d7c37e26cd 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -200,7 +200,14 @@  xfs_file_fsync(
 	else if (mp->m_logdev_targp != mp->m_ddev_targp)
 		xfs_blkdev_issue_flush(mp->m_ddev_targp);
 
-	error = xfs_fsync_flush_log(ip, datasync, &log_flushed);
+	/*
+	 * Any inode that has dirty modifications in the log is pinned.  The
+	 * racy check here for a pinned inode while not catch modifications
+	 * that happen concurrently to the fsync call, but fsync semantics
+	 * only require to sync previously completed I/O.
+	 */
+	if (xfs_ipincount(ip))
+		error = xfs_fsync_flush_log(ip, datasync, &log_flushed);
 
 	/*
 	 * If we only have a single device, and the log force about was