diff mbox

[RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

Message ID 20180301014144.28892-1-david@fromorbit.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Chinner March 1, 2018, 1:41 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

If we are doing direct IO writes with datasync semantics, we often
have to flush metadata changes along with the data write. However,
if we are overwriting existing data, there are no metadata changes
that we need to flush. In this case, optimising the IO by using
FUA write makes sense.

We know from teh IOMAP_F_DIRTY flag as to whether a specific inode
requires a metadata flush - this is currently used by DAX to ensure
extent modi$fication as stable in page fault operations. For direct
IO writes, we can use it to determine if we need to flush metadata
or not once the data is on disk.

Hence if we have been returned a mapped extent that is not new and
the IO mapping is not dirty, then we can use a FUA write to provide
datasync semantics. This allows us to short-cut the
generic_write_sync() call in IO completion and hence avoid
unnecessary operations. This makes pure direct IO data write
behaviour identical to the way block devices use REQ_FUA to provide
datasync semantics.

Now that iomap_dio_rw() is determining if REQ_FUA can be used, we
have to stop issuing generic_write_sync() calls from the XFS code
when REQ_FUA is issued, otherwise it will still throw a cache flush
to the device via xfs_file_fsync(). To do this, we need to make
iomap_dio_rw() always responsible for issuing generic_write_sync()
when necessary, not just for AIO calls. This means the filesystem
doesn't have to guess when cache flushes are necessary now.

On a FUA enabled device, a synchronous direct IO write workload
(sequential 4k overwrites in 32MB file) had the following results:

# xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo

kernel		time	write()s	write iops	Write b/w
------		----	--------	----------	---------
(no dsync)	 4s	2173/s		2173		8.5MB/s
vanilla		22s	 370/s		 750		1.4MB/s
patched		19s	 420/s		 420		1.6MB/s

The patched code clearly doesn't send cache flushes anymore, but
instead uses FUA (confirmed via blktrace), and performance improves
a bit as a result. However, the benefits will be higher on workloads
that mix O_DSYNC overwrites with other write IO as we won't be
flushing the entire device cache on every DSYNC overwrite IO
anymore.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c        | 38 +++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_file.c |  5 +++++
 2 files changed, 38 insertions(+), 5 deletions(-)

Comments

Darrick J. Wong March 2, 2018, 5:05 p.m. UTC | #1
On Thu, Mar 01, 2018 at 12:41:44PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> If we are doing direct IO writes with datasync semantics, we often
> have to flush metadata changes along with the data write. However,
> if we are overwriting existing data, there are no metadata changes
> that we need to flush. In this case, optimising the IO by using
> FUA write makes sense.
> 
> We know from teh IOMAP_F_DIRTY flag as to whether a specific inode
> requires a metadata flush - this is currently used by DAX to ensure
> extent modi$fication as stable in page fault operations. For direct

           ^^^^ modification

> IO writes, we can use it to determine if we need to flush metadata
> or not once the data is on disk.
> 
> Hence if we have been returned a mapped extent that is not new and
> the IO mapping is not dirty, then we can use a FUA write to provide
> datasync semantics. This allows us to short-cut the
> generic_write_sync() call in IO completion and hence avoid
> unnecessary operations. This makes pure direct IO data write
> behaviour identical to the way block devices use REQ_FUA to provide
> datasync semantics.
> 
> Now that iomap_dio_rw() is determining if REQ_FUA can be used, we
> have to stop issuing generic_write_sync() calls from the XFS code
> when REQ_FUA is issued, otherwise it will still throw a cache flush
> to the device via xfs_file_fsync(). To do this, we need to make
> iomap_dio_rw() always responsible for issuing generic_write_sync()
> when necessary, not just for AIO calls. This means the filesystem
> doesn't have to guess when cache flushes are necessary now.
> 
> On a FUA enabled device, a synchronous direct IO write workload
> (sequential 4k overwrites in 32MB file) had the following results:
> 
> # xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo
> 
> kernel		time	write()s	write iops	Write b/w
> ------		----	--------	----------	---------
> (no dsync)	 4s	2173/s		2173		8.5MB/s
> vanilla		22s	 370/s		 750		1.4MB/s
> patched		19s	 420/s		 420		1.6MB/s
> 
> The patched code clearly doesn't send cache flushes anymore, but
> instead uses FUA (confirmed via blktrace), and performance improves
> a bit as a result. However, the benefits will be higher on workloads
> that mix O_DSYNC overwrites with other write IO as we won't be
> flushing the entire device cache on every DSYNC overwrite IO
> anymore.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap.c        | 38 +++++++++++++++++++++++++++++++++-----
>  fs/xfs/xfs_file.c |  5 +++++
>  2 files changed, 38 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index afd163586aa0..bcc90e3a2e3f 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -685,6 +685,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_data);
>   * Private flags for iomap_dio, must not overlap with the public ones in
>   * iomap.h:
>   */
> +#define IOMAP_DIO_WRITE_FUA	(1 << 29)
>  #define IOMAP_DIO_WRITE		(1 << 30)
>  #define IOMAP_DIO_DIRTY		(1 << 31)
>  
> @@ -760,8 +761,19 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  	}
>  
>  	inode_dio_end(file_inode(iocb->ki_filp));
> -	kfree(dio);
>  
> +	/*
> +	 * If a FUA write was done, then that is all we required for datasync
> +	 * semantics -. we don't need to call generic_write_sync() to complete

Minor nit: space-dash-period.

Looks ok to me, but does anyone else have comments?  (Looking at hch
here. ;))

--D

> +	 * the write.
> +	 */
> +	if (ret > 0 &&
> +	    (dio->flags & (IOMAP_DIO_WRITE|IOMAP_DIO_WRITE_FUA)) ==
> +							IOMAP_DIO_WRITE) {
> +		ret = generic_write_sync(iocb, ret);
> +	}
> +
> +	kfree(dio);
>  	return ret;
>  }
>  
> @@ -769,12 +781,9 @@ static void iomap_dio_complete_work(struct work_struct *work)
>  {
>  	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
>  	struct kiocb *iocb = dio->iocb;
> -	bool is_write = (dio->flags & IOMAP_DIO_WRITE);
>  	ssize_t ret;
>  
>  	ret = iomap_dio_complete(dio);
> -	if (is_write && ret > 0)
> -		ret = generic_write_sync(iocb, ret);
>  	iocb->ki_complete(iocb, ret, 0);
>  }
>  
> @@ -883,6 +892,15 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>  			dio->flags |= IOMAP_DIO_COW;
>  		if (iomap->flags & IOMAP_F_NEW)
>  			need_zeroout = true;
> +		/*
> +		 * Use a FUA write if we need datasync semantics and this is a
> +		 * pure data IO that doesn't require any metadata updates. This
> +		 * allows us to avoid cache flushes on IO completion.
> +		 */
> +		else if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
> +			 (dio->flags & IOMAP_DIO_WRITE) &&
> +			 (dio->iocb->ki_flags & IOCB_DSYNC))
> +			dio->flags |= IOMAP_DIO_WRITE_FUA;
>  		break;
>  	default:
>  		WARN_ON_ONCE(1);
> @@ -930,7 +948,11 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  		n = bio->bi_iter.bi_size;
>  		if (dio->flags & IOMAP_DIO_WRITE) {
> -			bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
> +			int op_flags = REQ_SYNC | REQ_IDLE;
> +
> +			if (dio->flags & IOMAP_DIO_WRITE_FUA)
> +				op_flags |= REQ_FUA;
> +			bio_set_op_attrs(bio, REQ_OP_WRITE, op_flags);
>  			task_io_account_write(n);
>  		} else {
>  			bio_set_op_attrs(bio, REQ_OP_READ, 0);
> @@ -961,6 +983,12 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>  	return copied;
>  }
>  
> +/*
> + * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
> + * is being issued as AIO or not. This allows us to optimise pure data writes to
> + * use REQ_FUA rather than requiring generic_write_sync() to issue a REQ_FLUSH
> + * post write.
> + */
>  ssize_t
>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		const struct iomap_ops *ops, iomap_dio_end_io_t end_io)
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 260ff5e5c264..81aa3b73471e 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -732,6 +732,11 @@ xfs_file_write_iter(
>  		ret = xfs_file_dio_aio_write(iocb, from);
>  		if (ret == -EREMCHG)
>  			goto buffered;
> +		/*
> +		 * Direct IO handles sync type writes internally on I/O
> +		 * completion.
> +		 */
> +		return ret;
>  	} else {
>  buffered:
>  		ret = xfs_file_buffered_aio_write(iocb, from);
> -- 
> 2.16.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig March 2, 2018, 10:20 p.m. UTC | #2
> @@ -760,8 +761,19 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  	}
>  
>  	inode_dio_end(file_inode(iocb->ki_filp));
> -	kfree(dio);
>  
> +	/*
> +	 * If a FUA write was done, then that is all we required for datasync
> +	 * semantics -. we don't need to call generic_write_sync() to complete
> +	 * the write.
> +	 */
> +	if (ret > 0 &&
> +	    (dio->flags & (IOMAP_DIO_WRITE|IOMAP_DIO_WRITE_FUA)) ==
> +							IOMAP_DIO_WRITE) {
> +		ret = generic_write_sync(iocb, ret);
> +	}
> +
> +	kfree(dio);

Can please split the move of the generic_write_sync call into
generic_write_sync a separate prep patch?  It's enough of a logic change
on its own that it warrants a separate commit with a separate explanation.

Also I'd be tempted to invert the IOMAP_DIO_WRITE_FUA flag and replace
it with an IOMAP_DIO_WRITE_SYNC flag to indicate we need the
generic_write_sync call, as that should make the logic much more clear.

> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 260ff5e5c264..81aa3b73471e 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -732,6 +732,11 @@ xfs_file_write_iter(
>  		ret = xfs_file_dio_aio_write(iocb, from);
>  		if (ret == -EREMCHG)
>  			goto buffered;
> +		/*
> +		 * Direct IO handles sync type writes internally on I/O
> +		 * completion.
> +		 */
> +		return ret;
>  	} else {
>  buffered:
>  		ret = xfs_file_buffered_aio_write(iocb, from);

The else is not needed and you can now have a much more sensible
code flow here:

		ret = xfs_file_dio_aio_write(iocb, from);
		if (ret != -EREMCHG))
			return ret;
	}

	ret = xfs_file_buffered_aio_write(iocb, from);
Christoph Hellwig March 2, 2018, 10:26 p.m. UTC | #3
On Fri, Mar 02, 2018 at 11:20:31PM +0100, Christoph Hellwig wrote:
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -732,6 +732,11 @@ xfs_file_write_iter(
> >  		ret = xfs_file_dio_aio_write(iocb, from);
> >  		if (ret == -EREMCHG)
> >  			goto buffered;
> > +		/*
> > +		 * Direct IO handles sync type writes internally on I/O
> > +		 * completion.
> > +		 */
> > +		return ret;
> >  	} else {
> >  buffered:
> >  		ret = xfs_file_buffered_aio_write(iocb, from);
> 
> The else is not needed and you can now have a much more sensible
> code flow here:
> 
> 		ret = xfs_file_dio_aio_write(iocb, from);
> 		if (ret != -EREMCHG))
> 			return ret;
> 	}
> 
> 	ret = xfs_file_buffered_aio_write(iocb, from);

Actually a little more complicated due to the DAX case, and the fact
that both your original and this new version miss the XFS_STATS_ADD
call.  But you get the idea.  While we're at it we should probably
also skil the generic_write_sync call for DAX pure overwrites while
we're at it.
Dave Chinner March 2, 2018, 10:53 p.m. UTC | #4
On Fri, Mar 02, 2018 at 11:20:31PM +0100, Christoph Hellwig wrote:
> > @@ -760,8 +761,19 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> >  	}
> >  
> >  	inode_dio_end(file_inode(iocb->ki_filp));
> > -	kfree(dio);
> >  
> > +	/*
> > +	 * If a FUA write was done, then that is all we required for datasync
> > +	 * semantics -. we don't need to call generic_write_sync() to complete
> > +	 * the write.
> > +	 */
> > +	if (ret > 0 &&
> > +	    (dio->flags & (IOMAP_DIO_WRITE|IOMAP_DIO_WRITE_FUA)) ==
> > +							IOMAP_DIO_WRITE) {
> > +		ret = generic_write_sync(iocb, ret);
> > +	}
> > +
> > +	kfree(dio);
> 
> Can please split the move of the generic_write_sync call into
> generic_write_sync a separate prep patch?  It's enough of a logic change
> on its own that it warrants a separate commit with a separate explanation.

Ok, that makes sense.

> Also I'd be tempted to invert the IOMAP_DIO_WRITE_FUA flag and replace
> it with an IOMAP_DIO_WRITE_SYNC flag to indicate we need the
> generic_write_sync call, as that should make the logic much more clear.

Ah, much cleaner.

> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 260ff5e5c264..81aa3b73471e 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -732,6 +732,11 @@ xfs_file_write_iter(
> >  		ret = xfs_file_dio_aio_write(iocb, from);
> >  		if (ret == -EREMCHG)
> >  			goto buffered;
> > +		/*
> > +		 * Direct IO handles sync type writes internally on I/O
> > +		 * completion.
> > +		 */
> > +		return ret;
> >  	} else {
> >  buffered:
> >  		ret = xfs_file_buffered_aio_write(iocb, from);
> 
> The else is not needed and you can now have a much more sensible
> code flow here:
> 
> 		ret = xfs_file_dio_aio_write(iocb, from);
> 		if (ret != -EREMCHG))
> 			return ret;
> 	}
> 
> 	ret = xfs_file_buffered_aio_write(iocb, from);

I thought about that, but as you noticed the DAX case gets in the
way so I didn't bother for a RFC patch.

I'll have a look at making the DAX iomap code do something similar
with generic_write_sync() so we can simplify this high level code.

Thanks Christoph!

Cheers,

Dave.
Christoph Hellwig March 2, 2018, 10:59 p.m. UTC | #5
On Sat, Mar 03, 2018 at 09:53:19AM +1100, Dave Chinner wrote:
> I thought about that, but as you noticed the DAX case gets in the
> way so I didn't bother for a RFC patch.
> 
> I'll have a look at making the DAX iomap code do something similar
> with generic_write_sync() so we can simplify this high level code.

I suspect it might make sense to juse move generic_write_sync into
each of the three write routines as a prep patch before the prep
patch, which should clean everything up nicely.
Christoph Hellwig March 2, 2018, 11 p.m. UTC | #6
Oh, and another thing:  I think you want to make this new code dependent
on the block devie actually supporting REQ_FUA natively.  Otherwise
you'll cause a flush for every emulated FUA write, which is only going
make things worse, especially for ATA where FLUSH is not queued.  And
last time I check libata still disabled FUA by default.

On Sat, Mar 03, 2018 at 09:53:19AM +1100, Dave Chinner wrote:
> On Fri, Mar 02, 2018 at 11:20:31PM +0100, Christoph Hellwig wrote:
> > > @@ -760,8 +761,19 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > >  	}
> > >  
> > >  	inode_dio_end(file_inode(iocb->ki_filp));
> > > -	kfree(dio);
> > >  
> > > +	/*
> > > +	 * If a FUA write was done, then that is all we required for datasync
> > > +	 * semantics -. we don't need to call generic_write_sync() to complete
> > > +	 * the write.
> > > +	 */
> > > +	if (ret > 0 &&
> > > +	    (dio->flags & (IOMAP_DIO_WRITE|IOMAP_DIO_WRITE_FUA)) ==
> > > +							IOMAP_DIO_WRITE) {
> > > +		ret = generic_write_sync(iocb, ret);
> > > +	}
> > > +
> > > +	kfree(dio);
> > 
> > Can please split the move of the generic_write_sync call into
> > generic_write_sync a separate prep patch?  It's enough of a logic change
> > on its own that it warrants a separate commit with a separate explanation.
> 
> Ok, that makes sense.
> 
> > Also I'd be tempted to invert the IOMAP_DIO_WRITE_FUA flag and replace
> > it with an IOMAP_DIO_WRITE_SYNC flag to indicate we need the
> > generic_write_sync call, as that should make the logic much more clear.
> 
> Ah, much cleaner.
> 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 260ff5e5c264..81aa3b73471e 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -732,6 +732,11 @@ xfs_file_write_iter(
> > >  		ret = xfs_file_dio_aio_write(iocb, from);
> > >  		if (ret == -EREMCHG)
> > >  			goto buffered;
> > > +		/*
> > > +		 * Direct IO handles sync type writes internally on I/O
> > > +		 * completion.
> > > +		 */
> > > +		return ret;
> > >  	} else {
> > >  buffered:
> > >  		ret = xfs_file_buffered_aio_write(iocb, from);
> > 
> > The else is not needed and you can now have a much more sensible
> > code flow here:
> > 
> > 		ret = xfs_file_dio_aio_write(iocb, from);
> > 		if (ret != -EREMCHG))
> > 			return ret;
> > 	}
> > 
> > 	ret = xfs_file_buffered_aio_write(iocb, from);
> 
> I thought about that, but as you noticed the DAX case gets in the
> way so I didn't bother for a RFC patch.
> 
> I'll have a look at making the DAX iomap code do something similar
> with generic_write_sync() so we can simplify this high level code.
> 
> Thanks Christoph!
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
---end quoted text---
Dave Chinner March 2, 2018, 11:15 p.m. UTC | #7
On Sat, Mar 03, 2018 at 12:00:42AM +0100, Christoph Hellwig wrote:
> Oh, and another thing:  I think you want to make this new code dependent
> on the block devie actually supporting REQ_FUA natively.  Otherwise
> you'll cause a flush for every emulated FUA write, which is only going
> make things worse, especially for ATA where FLUSH is not queued.  And
> last time I check libata still disabled FUA by default.

Yup, but the issue we have right now is that for pure RWF_DSYNC data
overwrites we are already doing a post-flush on every IO. It's being
issued as a separate zero-length IO, which is why REQ_FUA is faster
and results in lower overall IOPS.  The flush comes from this path:

generic_write_sync
  vfs_fsync_range
    xfs_file_fsync
      ....
        /*
         * If we only have a single device, and the log force about was
         * a no-op we might have to flush the data device cache here.
         * This can only happen for fdatasync/O_DSYNC if we were overwriting
         * an already allocated file and thus do not have any metadata to
         * commit.
         */
        if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
            mp->m_logdev_targp == mp->m_ddev_targp)
                xfs_blkdev_issue_flush(mp->m_ddev_targp);

So the end result of using REQ_FUA and not calling
generic_write_sync() on devices that don't support REQ_FUA is that
we get a post-flush attached to the IO rather than separately
issuing the post-flush via generic_write_sync().

Hence I think we end up with the same behaviour on devices that
don't support REQ_FUA, just via a slightly different mechanism.

Cheers,

Dave.
Christoph Hellwig March 2, 2018, 11:21 p.m. UTC | #8
On Sat, Mar 03, 2018 at 10:15:17AM +1100, Dave Chinner wrote:
> On Sat, Mar 03, 2018 at 12:00:42AM +0100, Christoph Hellwig wrote:
> > Oh, and another thing:  I think you want to make this new code dependent
> > on the block devie actually supporting REQ_FUA natively.  Otherwise
> > you'll cause a flush for every emulated FUA write, which is only going
> > make things worse, especially for ATA where FLUSH is not queued.  And
> > last time I check libata still disabled FUA by default.
> 
> Yup, but the issue we have right now is that for pure RWF_DSYNC data
> overwrites we are already doing a post-flush on every IO. It's being
> issued as a separate zero-length IO, which is why REQ_FUA is faster
> and results in lower overall IOPS.  The flush comes from this path:

That is only the case if your device actually supports FUA.  If the
device does notit is emulated by the block/flk-flush.c code by issuing a
FLUSH once the write has returned.

So for e.g. a direct I/O write() call with O_DSYNC that turns into
e.g. four write calls on the wire you currently have:

   WRITE
   WRITE
   WRITE
   WRITE
   FLUSH

with your patch and a device that supports FUA you get

   WRITE (FUA)
   WRITE (FUA)
   WRITE (FUA)
   WRITE (FUA)

but with a device that does not support FUA you get

   WRITE
   FLUSH
   WRITE
   FLUSH
   WRITE
   FLUSH
   WRITE
   FLUSH

with the additional pain point that on ATA FLUSH is not a queueable
command, so it will have to wait for the completion of every other
non-related command first, and no other command can be started.

So we should absolutely use your new approach IFF the device actually
supports FUA (aka QUEUE_FLAG_FUA is set), but it will not help much
or even be harmful if the device does not actually support the FUA bit.
Dave Chinner March 4, 2018, 11 p.m. UTC | #9
On Fri, Mar 02, 2018 at 11:26:41PM +0100, Christoph Hellwig wrote:
> While we're at it we should probably
> also skil the generic_write_sync call for DAX pure overwrites while
> we're at it.

Now that I've looked at it, DAX is a friggin' mess.

We can't push the generic_write_sync() call into dax_iomap_rw()
because dax_iomap_rw is called with the inode_lock() held, and both
ext2's and ext4's ->fsync implementation can call
__generic_file_fsync() which takes the inode_lock(). Deadlock
central right there - someone is going to have to screw with ext4's
indoe sync code to before we can do this.

Further, XFS has post-write metadata updates to do in the case of
extending writes, and hence we'd still need the call to
generic_write_sync() after we have returned from dax_iomap_rw(). I'm
thinking we'd really need a dax write IO completion callback (say in
the struct iomap_ops) to run the filesystem specific IO completions
(like the DIO path) before we can start to think about optimisations
like this for DAX for XFS....

Cheers,

Dave.
Christoph Hellwig March 5, 2018, 3:11 p.m. UTC | #10
Oh well, lets skip DAX for now then.  I still think moving the
generic_write_sync call into xfs_file_dax_write/xfs_file_dio_aio_write/
xfs_file_buffered_aio_write functions would be a useful cleanup.

On Mon, Mar 05, 2018 at 10:00:49AM +1100, Dave Chinner wrote:
> On Fri, Mar 02, 2018 at 11:26:41PM +0100, Christoph Hellwig wrote:
> > While we're at it we should probably
> > also skil the generic_write_sync call for DAX pure overwrites while
> > we're at it.
> 
> Now that I've looked at it, DAX is a friggin' mess.
> 
> We can't push the generic_write_sync() call into dax_iomap_rw()
> because dax_iomap_rw is called with the inode_lock() held, and both
> ext2's and ext4's ->fsync implementation can call
> __generic_file_fsync() which takes the inode_lock(). Deadlock
> central right there - someone is going to have to screw with ext4's
> indoe sync code to before we can do this.
> 
> Further, XFS has post-write metadata updates to do in the case of
> extending writes, and hence we'd still need the call to
> generic_write_sync() after we have returned from dax_iomap_rw(). I'm
> thinking we'd really need a dax write IO completion callback (say in
> the struct iomap_ops) to run the filesystem specific IO completions
> (like the DIO path) before we can start to think about optimisations
> like this for DAX for XFS....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
---end quoted text---
Dan Williams March 12, 2018, 11:53 p.m. UTC | #11
[ adding Robert who had a question about this patch, and Jan + Ted for
the ext4 implications ]

I talked with Robert offline and he had some questions about this
patch that belong on the mailing list.

On Wed, Feb 28, 2018 at 5:41 PM, Dave Chinner <david@fromorbit.com> wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> If we are doing direct IO writes with datasync semantics, we often
> have to flush metadata changes along with the data write. However,
> if we are overwriting existing data, there are no metadata changes
> that we need to flush. In this case, optimising the IO by using
> FUA write makes sense.
>
> We know from teh IOMAP_F_DIRTY flag as to whether a specific inode
> requires a metadata flush - this is currently used by DAX to ensure
> extent modi$fication as stable in page fault operations. For direct
> IO writes, we can use it to determine if we need to flush metadata
> or not once the data is on disk.
>
> Hence if we have been returned a mapped extent that is not new and
> the IO mapping is not dirty, then we can use a FUA write to provide
> datasync semantics. This allows us to short-cut the
> generic_write_sync() call in IO completion and hence avoid
> unnecessary operations. This makes pure direct IO data write
> behaviour identical to the way block devices use REQ_FUA to provide
> datasync semantics.
>
> Now that iomap_dio_rw() is determining if REQ_FUA can be used, we
> have to stop issuing generic_write_sync() calls from the XFS code
> when REQ_FUA is issued, otherwise it will still throw a cache flush
> to the device via xfs_file_fsync(). To do this, we need to make
> iomap_dio_rw() always responsible for issuing generic_write_sync()
> when necessary, not just for AIO calls. This means the filesystem
> doesn't have to guess when cache flushes are necessary now.
>
> On a FUA enabled device, a synchronous direct IO write workload
> (sequential 4k overwrites in 32MB file) had the following results:
>
> # xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo
>
> kernel          time    write()s        write iops      Write b/w
> ------          ----    --------        ----------      ---------
> (no dsync)       4s     2173/s          2173            8.5MB/s
> vanilla         22s      370/s           750            1.4MB/s
> patched         19s      420/s           420            1.6MB/s
>
> The patched code clearly doesn't send cache flushes anymore, but
> instead uses FUA (confirmed via blktrace), and performance improves
> a bit as a result. However, the benefits will be higher on workloads
> that mix O_DSYNC overwrites with other write IO as we won't be
> flushing the entire device cache on every DSYNC overwrite IO
> anymore.
>
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap.c        | 38 +++++++++++++++++++++++++++++++++-----
>  fs/xfs/xfs_file.c |  5 +++++
>  2 files changed, 38 insertions(+), 5 deletions(-)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index afd163586aa0..bcc90e3a2e3f 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -685,6 +685,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_data);
>   * Private flags for iomap_dio, must not overlap with the public ones in
>   * iomap.h:
>   */
> +#define IOMAP_DIO_WRITE_FUA    (1 << 29)
>  #define IOMAP_DIO_WRITE                (1 << 30)
>  #define IOMAP_DIO_DIRTY                (1 << 31)
>
> @@ -760,8 +761,19 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>         }
>
>         inode_dio_end(file_inode(iocb->ki_filp));
> -       kfree(dio);
>
> +       /*
> +        * If a FUA write was done, then that is all we required for datasync
> +        * semantics -. we don't need to call generic_write_sync() to complete
> +        * the write.
> +        */
> +       if (ret > 0 &&
> +           (dio->flags & (IOMAP_DIO_WRITE|IOMAP_DIO_WRITE_FUA)) ==
> +                                                       IOMAP_DIO_WRITE) {
> +               ret = generic_write_sync(iocb, ret);
> +       }
> +
> +       kfree(dio);
>         return ret;
>  }
>
> @@ -769,12 +781,9 @@ static void iomap_dio_complete_work(struct work_struct *work)
>  {
>         struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
>         struct kiocb *iocb = dio->iocb;
> -       bool is_write = (dio->flags & IOMAP_DIO_WRITE);
>         ssize_t ret;
>
>         ret = iomap_dio_complete(dio);
> -       if (is_write && ret > 0)
> -               ret = generic_write_sync(iocb, ret);
>         iocb->ki_complete(iocb, ret, 0);
>  }
>
> @@ -883,6 +892,15 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>                         dio->flags |= IOMAP_DIO_COW;
>                 if (iomap->flags & IOMAP_F_NEW)
>                         need_zeroout = true;
> +               /*
> +                * Use a FUA write if we need datasync semantics and this is a
> +                * pure data IO that doesn't require any metadata updates. This
> +                * allows us to avoid cache flushes on IO completion.
> +                */
> +               else if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
> +                        (dio->flags & IOMAP_DIO_WRITE) &&
> +                        (dio->iocb->ki_flags & IOCB_DSYNC))
> +                       dio->flags |= IOMAP_DIO_WRITE_FUA;
>                 break;
>         default:
>                 WARN_ON_ONCE(1);
> @@ -930,7 +948,11 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>
>                 n = bio->bi_iter.bi_size;
>                 if (dio->flags & IOMAP_DIO_WRITE) {
> -                       bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
> +                       int op_flags = REQ_SYNC | REQ_IDLE;
> +
> +                       if (dio->flags & IOMAP_DIO_WRITE_FUA)
> +                               op_flags |= REQ_FUA;
> +                       bio_set_op_attrs(bio, REQ_OP_WRITE, op_flags);
>                         task_io_account_write(n);
>                 } else {
>                         bio_set_op_attrs(bio, REQ_OP_READ, 0);
> @@ -961,6 +983,12 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>         return copied;
>  }
>
> +/*
> + * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
> + * is being issued as AIO or not. This allows us to optimise pure data writes to
> + * use REQ_FUA rather than requiring generic_write_sync() to issue a REQ_FLUSH
> + * post write.
> + */
>  ssize_t
>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>                 const struct iomap_ops *ops, iomap_dio_end_io_t end_io)
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 260ff5e5c264..81aa3b73471e 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -732,6 +732,11 @@ xfs_file_write_iter(
>                 ret = xfs_file_dio_aio_write(iocb, from);
>                 if (ret == -EREMCHG)
>                         goto buffered;
> +               /*
> +                * Direct IO handles sync type writes internally on I/O
> +                * completion.
> +                */
> +               return ret;
>         } else {
>  buffered:
>                 ret = xfs_file_buffered_aio_write(iocb, from);
> --
> 2.16.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert Dorr March 13, 2018, 12:15 a.m. UTC | #12
Hello all.  I have a couple of follow-up questions around this effort, thanks you all for all your kind inputs, patience and knowledge transfer.

1.    How does xfs or ext4 make sure a pattern of WS followed by FWS does not allow the write (WS) completion to be visible before the flush completes?  I suspected the write was held in iomap_dio_complete_work but with the generic_write_sync change in the patch would a O_DSYNC write request to a DpoFua=0 block queue allow T2 to see the completion via io_getevents before T1 completed the actual flush?

2.    How will my application be able to dynamically determine if xfs and ext4 have the performance enhancement for FUA or I need engage alternate methods to use fsync/fdatasync at strategic locations?

3.    Are there any plans yet to optimize ext4 as well?

4.    Before the patched code the xfs_file_write_iter would call generic_write_sync and that calls submit_io_wait.  Does this hold the thread issuing the io_submit so it is unable to drive more async I/O?


P.S. I tested the xfs patch today and great results
______________________________________________________________
Stamp the file with all 0xCCs before I do the write timing, so each sector is used and metadata flushed.  Throughput way up on my system.

O_DIRECT | O_DSYNC      WFS   19.19 sec   (getting FUA behavior)
O_DIRECT                           WS     18.65 sec



-----Original Message-----
From: dan.j.williams@gmail.com <dan.j.williams@gmail.com> On Behalf Of Dan Williams

Sent: Monday, March 12, 2018 6:54 PM
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org; Christoph Hellwig <hch@lst.de>; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Robert Dorr <rdorr@microsoft.com>; Jan Kara <jack@suse.cz>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>
Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

[ adding Robert who had a question about this patch, and Jan + Ted for the ext4 implications ]

I talked with Robert offline and he had some questions about this patch that belong on the mailing list.

On Wed, Feb 28, 2018 at 5:41 PM, Dave Chinner <david@fromorbit.com> wrote:
> From: Dave Chinner <dchinner@redhat.com>

>

> If we are doing direct IO writes with datasync semantics, we often 

> have to flush metadata changes along with the data write. However, if 

> we are overwriting existing data, there are no metadata changes that 

> we need to flush. In this case, optimising the IO by using FUA write 

> makes sense.

>

> We know from teh IOMAP_F_DIRTY flag as to whether a specific inode 

> requires a metadata flush - this is currently used by DAX to ensure 

> extent modi$fication as stable in page fault operations. For direct IO 

> writes, we can use it to determine if we need to flush metadata or not 

> once the data is on disk.

>

> Hence if we have been returned a mapped extent that is not new and the 

> IO mapping is not dirty, then we can use a FUA write to provide 

> datasync semantics. This allows us to short-cut the

> generic_write_sync() call in IO completion and hence avoid unnecessary 

> operations. This makes pure direct IO data write behaviour identical 

> to the way block devices use REQ_FUA to provide datasync semantics.

>

> Now that iomap_dio_rw() is determining if REQ_FUA can be used, we have 

> to stop issuing generic_write_sync() calls from the XFS code when 

> REQ_FUA is issued, otherwise it will still throw a cache flush to the 

> device via xfs_file_fsync(). To do this, we need to make

> iomap_dio_rw() always responsible for issuing generic_write_sync() 

> when necessary, not just for AIO calls. This means the filesystem 

> doesn't have to guess when cache flushes are necessary now.

>

> On a FUA enabled device, a synchronous direct IO write workload 

> (sequential 4k overwrites in 32MB file) had the following results:

>

> # xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo

>

> kernel          time    write()s        write iops      Write b/w

> ------          ----    --------        ----------      ---------

> (no dsync)       4s     2173/s          2173            8.5MB/s

> vanilla         22s      370/s           750            1.4MB/s

> patched         19s      420/s           420            1.6MB/s

>

> The patched code clearly doesn't send cache flushes anymore, but 

> instead uses FUA (confirmed via blktrace), and performance improves a 

> bit as a result. However, the benefits will be higher on workloads 

> that mix O_DSYNC overwrites with other write IO as we won't be 

> flushing the entire device cache on every DSYNC overwrite IO anymore.

>

> Signed-Off-By: Dave Chinner <dchinner@redhat.com>

> ---

>  fs/iomap.c        | 38 +++++++++++++++++++++++++++++++++-----

>  fs/xfs/xfs_file.c |  5 +++++

>  2 files changed, 38 insertions(+), 5 deletions(-)

>

> diff --git a/fs/iomap.c b/fs/iomap.c

> index afd163586aa0..bcc90e3a2e3f 100644

> --- a/fs/iomap.c

> +++ b/fs/iomap.c

> @@ -685,6 +685,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_data);

>   * Private flags for iomap_dio, must not overlap with the public ones in

>   * iomap.h:

>   */

> +#define IOMAP_DIO_WRITE_FUA    (1 << 29)

>  #define IOMAP_DIO_WRITE                (1 << 30)

>  #define IOMAP_DIO_DIRTY                (1 << 31)

>

> @@ -760,8 +761,19 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)

>         }

>

>         inode_dio_end(file_inode(iocb->ki_filp));

> -       kfree(dio);

>

> +       /*

> +        * If a FUA write was done, then that is all we required for datasync

> +        * semantics -. we don't need to call generic_write_sync() to complete

> +        * the write.

> +        */

> +       if (ret > 0 &&

> +           (dio->flags & (IOMAP_DIO_WRITE|IOMAP_DIO_WRITE_FUA)) ==

> +                                                       IOMAP_DIO_WRITE) {

> +               ret = generic_write_sync(iocb, ret);

> +       }

> +

> +       kfree(dio);

>         return ret;

>  }

>

> @@ -769,12 +781,9 @@ static void iomap_dio_complete_work(struct 

> work_struct *work)  {

>         struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);

>         struct kiocb *iocb = dio->iocb;

> -       bool is_write = (dio->flags & IOMAP_DIO_WRITE);

>         ssize_t ret;

>

>         ret = iomap_dio_complete(dio);

> -       if (is_write && ret > 0)

> -               ret = generic_write_sync(iocb, ret);

>         iocb->ki_complete(iocb, ret, 0);  }

>

> @@ -883,6 +892,15 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,

>                         dio->flags |= IOMAP_DIO_COW;

>                 if (iomap->flags & IOMAP_F_NEW)

>                         need_zeroout = true;

> +               /*

> +                * Use a FUA write if we need datasync semantics and this is a

> +                * pure data IO that doesn't require any metadata updates. This

> +                * allows us to avoid cache flushes on IO completion.

> +                */

> +               else if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&

> +                        (dio->flags & IOMAP_DIO_WRITE) &&

> +                        (dio->iocb->ki_flags & IOCB_DSYNC))

> +                       dio->flags |= IOMAP_DIO_WRITE_FUA;

>                 break;

>         default:

>                 WARN_ON_ONCE(1);

> @@ -930,7 +948,11 @@ iomap_dio_actor(struct inode *inode, loff_t pos, 

> loff_t length,

>

>                 n = bio->bi_iter.bi_size;

>                 if (dio->flags & IOMAP_DIO_WRITE) {

> -                       bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);

> +                       int op_flags = REQ_SYNC | REQ_IDLE;

> +

> +                       if (dio->flags & IOMAP_DIO_WRITE_FUA)

> +                               op_flags |= REQ_FUA;

> +                       bio_set_op_attrs(bio, REQ_OP_WRITE, op_flags);

>                         task_io_account_write(n);

>                 } else {

>                         bio_set_op_attrs(bio, REQ_OP_READ, 0); @@ 

> -961,6 +983,12 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,

>         return copied;

>  }

>

> +/*

> + * iomap_dio_rw() always completes O_[D]SYNC writes regardless of 

> +whether the IO

> + * is being issued as AIO or not. This allows us to optimise pure 

> +data writes to

> + * use REQ_FUA rather than requiring generic_write_sync() to issue a 

> +REQ_FLUSH

> + * post write.

> + */

>  ssize_t

>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,

>                 const struct iomap_ops *ops, iomap_dio_end_io_t 

> end_io) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 

> 260ff5e5c264..81aa3b73471e 100644

> --- a/fs/xfs/xfs_file.c

> +++ b/fs/xfs/xfs_file.c

> @@ -732,6 +732,11 @@ xfs_file_write_iter(

>                 ret = xfs_file_dio_aio_write(iocb, from);

>                 if (ret == -EREMCHG)

>                         goto buffered;

> +               /*

> +                * Direct IO handles sync type writes internally on I/O

> +                * completion.

> +                */

> +               return ret;

>         } else {

>  buffered:

>                 ret = xfs_file_buffered_aio_write(iocb, from);

> --

> 2.16.1

>

> --

> To unsubscribe from this list: send the line "unsubscribe linux-xfs" 

> in the body of a message to majordomo@vger.kernel.org More majordomo 

> info at  

> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fvger.k

> ernel.org%2Fmajordomo-info.html&data=04%7C01%7Crdorr%40microsoft.com%7

> C70271d8bc9c84c12e87808d5887485e5%7C72f988bf86f141af91ab2d7cd011db47%7

> C1%7C0%7C636564956441249840%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD

> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=KqPTOI97YQxhrm%2

> BFu3MpoFMIWieH2nkool4tzs5IJ0M%3D&reserved=0
Dave Chinner March 13, 2018, 5:10 a.m. UTC | #13
On Tue, Mar 13, 2018 at 12:15:28AM +0000, Robert Dorr wrote:
> Hello all.  I have a couple of follow-up questions around this
> effort, thanks you all for all your kind inputs, patience and
> knowledge transfer.
> 
> 1.    How does xfs or ext4 make sure a pattern of WS followed by
> FWS does not allow the write (WS) completion to be visible before
> the flush completes?

I'm not sure what you are asking here. You need to be more precise
about what these IOs are, who dispatched them and what their
dependencies are. Where exactly did that FWS (REQ_FLUSH) come from?

I think, though, you're asking questions about IO ordering at the
wrong level - filesystems serialise and order IO, not the block
layer. Hence what you see at the block layer is not necessary a
reflection of the ordering the filesystem is doing.

(I've already explained this earlier today in a different thread:
https://marc.info/?l=linux-xfs&m=152091489100831&w=2)

That's why I ask about the operation causing a REQ_FLUSH to be
issued to the storage device, as that cannot be directly issued from
userspace. It will occur as a side effect of a data integrity
operation the filesystem is asked to perform, but without knowing
the relationship between the integrity operation and the write in
question an answer cannot be given.

It may would be better to describe your IO ordering and integrity
requirements at a higher level (e.g. the syscall layer), because
then we know what you are trying to acheive rather that trying to
understand your problem from context-less questions about "IO
barriers" that don't actually exist...

> I suspected the write was held in
> iomap_dio_complete_work but with the generic_write_sync change in
> the patch would a O_DSYNC write request to a DpoFua=0 block queue
> allow T2 to see the completion via io_getevents before T1
> completed the actual flush?

Yes, that can happen as concurrent data direct IO are not serialised
against each other and will always race to completion without
providing any ordering guarantees.  IOWs, if you have an IO ordering
dependency in your application, then that ordering dependency needs
to be handled in the application.

> 2.    How will my application be able to dynamically determine if
> xfs and ext4 have the performance enhancement for FUA or I need
> engage alternate methods to use fsync/fdatasync at strategic
> locations?

You don't. The filesystem will provide the same integrity guarantees
in either case - FUA is just a performance optimisation that will
get used if your hardware supports it. Applications should not care
what capabilities the storage hardawre has - the kernel should do
what is fastest and most reliable for the underlying storage....

> 3.    Are there any plans yet to optimize ext4 as well?

Not from me.

> 4.    Before the patched code the xfs_file_write_iter would call
> generic_write_sync and that calls submit_io_wait.  Does this hold
> the thread issuing the io_submit so it is unable to drive more
> async I/O?

No, -EIOCBQUEUED is returned to avoid blocking.
AIO calls generic_write_sync() from the IO completion
path via a worker thread so it's all done asynchronously.


Cheers,

Dave.
Robert Dorr March 13, 2018, 4 p.m. UTC | #14
Very helpful, allow me to provide some concrete scenarios.  SQL Server has data and log files and we honor WAL to provide for proper crash recovery and durability of the database.

	*	http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIObasics.mspx 
	*	http://download.microsoft.com/download/4/f/8/4f8f2dc9-a9a7-4b68-98cb-163482c95e0b/SQLIOBasicsCh2.doc 

---------------------------------------------------------------------------------------------
#1 Scenario - Write vs Flush and io_getevents Ordering
---------------------------------------------------------------------------------------------
If SQL Server opens data and log files with O_DIRECT | O_DSYNC

1.	T1 - SQL Server issues io_submit (IOCB_CMD_WRITE) for data file Page
		xfs_write_file_iter issues xfs_file_dio_aio_write (W1)
		As you pointed out the return is -EIOCBQUEUED (from the block device direct I/O queue insert when !sync) so T1 can continue forward and would skip the generic_write_sync in xfs_write_file_iter

2.	T2 - SQL Server is waiting on io_getevents for the I/O (W1) posted by T1 via the same async kernel I/O context
	When the W1 is considered complete SQL Server can truncate/reuse our database log file.  I need to validate that the kernel only returns from io_getevents for W1 processed by hardware before F1 was issued and completed, then SQL Server can safely truncate the database transaction log.
	
	The code shows iomap_dio_complete_work that I expected to issue the REQ_PREFLUSH for the O_DSYNC based write and prevent the completion from getting to the io_getevents caller.  The patch moved this to the iomap_dio_complete which I believe meets my needs.  

I think the answer is the iomap* logic makes sure to issue generic_write_sync for O_DSYNC W1 after W1 is completed by hardware and then waits for completion of the flush request (REQ_PREFLUSH) before W1 is returned to the AIO completion ring, preventing io_getevents from processing W1 before the flush occurs and completes.   I just need proper confirmation from the experts on this code that this is the expected behavior.

---------------------------------------------------------------------------------------------
#2 Dynamic determination of performant FUA capabilities
---------------------------------------------------------------------------------------------
For SQL Server using O_DIRECT | O_DSYNC on current kernels is very performance impacting.   Instead we enable a mode for SQL that opens O_DIRECT only and issues fsync/fdatasync when we are hardening log files or checkpointing data files.   This reduces the write, flush, write, flush pattern allowing for write, write, write, ... then flush as we only issue flush requests when required to maintain the data integrity boundaries of SQL Server.  The performance is significantly better then the device flush for each write as you can imagine.

Testing shows the FUA enhancement is better then the write, flush pattern.   For SQL Server we want to dynamically open with O_DIRECT | O_DSYNC when REQ_FUA can be properly used and open with O_DIRECT and leverage SQL Server's alternate flush scheme when running on older kernel or a system that does not support FUA (SATA, IDE, ...)


-----Original Message-----
From: Dave Chinner <david@fromorbit.com> 
Sent: Tuesday, March 13, 2018 12:11 AM
To: Robert Dorr <rdorr@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>; linux-xfs@vger.kernel.org; Christoph Hellwig <hch@lst.de>; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Jan Kara <jack@suse.cz>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson <micn@microsoft.com>
Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

[This sender failed our fraud detection checks and may not be who they appear to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]

On Tue, Mar 13, 2018 at 12:15:28AM +0000, Robert Dorr wrote:
> Hello all.  I have a couple of follow-up questions around this effort, 
> thanks you all for all your kind inputs, patience and knowledge 
> transfer.
>
> 1.    How does xfs or ext4 make sure a pattern of WS followed by
> FWS does not allow the write (WS) completion to be visible before the 
> flush completes?

I'm not sure what you are asking here. You need to be more precise about what these IOs are, who dispatched them and what their dependencies are. Where exactly did that FWS (REQ_FLUSH) come from?

I think, though, you're asking questions about IO ordering at the wrong level - filesystems serialise and order IO, not the block layer. Hence what you see at the block layer is not necessary a reflection of the ordering the filesystem is doing.

(I've already explained this earlier today in a different thread:
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmarc.info%2F%3Fl%3Dlinux-xfs%26m%3D152091489100831%26w%3D2&data=04%7C01%7Crdorr%40microsoft.com%7C19bb64450eba4b0f2c8308d588a0c4e1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636565146506974722%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C100&sdata=o0CgiPt1dsfALFS5fXHRGy2jlD1t4HIBjjqD%2BqOKnb4%3D&reserved=0)

That's why I ask about the operation causing a REQ_FLUSH to be issued to the storage device, as that cannot be directly issued from userspace. It will occur as a side effect of a data integrity operation the filesystem is asked to perform, but without knowing the relationship between the integrity operation and the write in question an answer cannot be given.

It may would be better to describe your IO ordering and integrity requirements at a higher level (e.g. the syscall layer), because then we know what you are trying to acheive rather that trying to understand your problem from context-less questions about "IO barriers" that don't actually exist...

> I suspected the write was held in
> iomap_dio_complete_work but with the generic_write_sync change in the 
> patch would a O_DSYNC write request to a DpoFua=0 block queue allow T2 
> to see the completion via io_getevents before T1 completed the actual 
> flush?

Yes, that can happen as concurrent data direct IO are not serialised against each other and will always race to completion without providing any ordering guarantees.  IOWs, if you have an IO ordering dependency in your application, then that ordering dependency needs to be handled in the application.

> 2.    How will my application be able to dynamically determine if
> xfs and ext4 have the performance enhancement for FUA or I need engage 
> alternate methods to use fsync/fdatasync at strategic locations?

You don't. The filesystem will provide the same integrity guarantees in either case - FUA is just a performance optimisation that will get used if your hardware supports it. Applications should not care what capabilities the storage hardawre has - the kernel should do what is fastest and most reliable for the underlying storage....

> 3.    Are there any plans yet to optimize ext4 as well?

Not from me.

> 4.    Before the patched code the xfs_file_write_iter would call
> generic_write_sync and that calls submit_io_wait.  Does this hold the 
> thread issuing the io_submit so it is unable to drive more async I/O?

No, -EIOCBQUEUED is returned to avoid blocking.
AIO calls generic_write_sync() from the IO completion path via a worker thread so it's all done asynchronously.


Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
Christoph Hellwig March 13, 2018, 4:12 p.m. UTC | #15
> I think the answer is the iomap* logic makes sure to issue generic_write_sync for O_DSYNC W1 after W1 is completed by hardware and then waits for completion of the flush request (REQ_PREFLUSH) before W1 is returned to the AIO completion ring, preventing io_getevents from processing W1 before the flush occurs and completes.   I just need proper confirmation from the experts on this code that this is the expected behavior.

Yes. that is the case.

> For SQL Server using O_DIRECT | O_DSYNC on current kernels is very performance impacting.   Instead we enable a mode for SQL that opens O_DIRECT only and issues fsync/fdatasync when we are hardening log files or checkpointing data files.   This reduces the write, flush, write, flush pattern allowing for write, write, write, ... then flush as we only issue flush requests when required to maintain the data integrity boundaries of SQL Server.  The performance is significantly better then the device flush for each write as you can imagine.
> 
> Testing shows the FUA enhancement is better then the write, flush pattern.   For SQL Server we want to dynamically open with O_DIRECT | O_DSYNC when REQ_FUA can be properly used and open with O_DIRECT and leverage SQL Server's alternate flush scheme when running on older kernel or a system that does not support FUA (SATA, IDE, ...)

There is no really good way to figure this out except for benchmarking,
given that the FUA use an implementation detail.

Btw, another feature that might be interesting to you is the RWF_DSYNC
flag to the pwritev2 syscall, which allows to apply O_DSYNC semantics on
a per-I/O basis instead of using it at open time.
Robert Dorr March 13, 2018, 6:52 p.m. UTC | #16
I tried the RWF_ODSYNC and xfs seems to honor properly but it causes ext4 to go spin out a single CPU and hang the system. 
Jan Kara March 19, 2018, 4:06 p.m. UTC | #17
On Tue 13-03-18 18:52:57, Robert Dorr wrote:
> I tried the RWF_ODSYNC and xfs seems to honor properly but it causes ext4
> to go spin out a single CPU and hang the system. 
Robert Dorr March 19, 2018, 4:14 p.m. UTC | #18
Awesome news on the spin.   

Are you going to be able to make changes to EXT4 to accommodate REQ_FUA without generic_write_flush to improve performance like was done for xfs?

-----Original Message-----
From: Jan Kara <jack@suse.cz> 

Sent: Monday, March 19, 2018 11:07 AM
To: Robert Dorr <rdorr@microsoft.com>
Cc: Christoph Hellwig <hch@lst.de>; Dave Chinner <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; linux-xfs@vger.kernel.org; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Jan Kara <jack@suse.cz>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson <micn@microsoft.com>
Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

On Tue 13-03-18 18:52:57, Robert Dorr wrote:
> I tried the RWF_ODSYNC and xfs seems to honor properly but it causes 

> ext4 to go spin out a single CPU and hang the system. 
Robert Dorr March 21, 2018, 11:52 p.m. UTC | #19
Jan - are you able to make the ext4 performance changes?


-----Original Message-----
From: Robert Dorr 

Sent: Monday, March 19, 2018 11:14 AM
To: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>; Dave Chinner <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; linux-xfs@vger.kernel.org; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson <micn@microsoft.com>
Subject: RE: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

Awesome news on the spin.   

Are you going to be able to make changes to EXT4 to accommodate REQ_FUA without generic_write_flush to improve performance like was done for xfs?

-----Original Message-----
From: Jan Kara <jack@suse.cz>

Sent: Monday, March 19, 2018 11:07 AM
To: Robert Dorr <rdorr@microsoft.com>
Cc: Christoph Hellwig <hch@lst.de>; Dave Chinner <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; linux-xfs@vger.kernel.org; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Jan Kara <jack@suse.cz>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson <micn@microsoft.com>
Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

On Tue 13-03-18 18:52:57, Robert Dorr wrote:
> I tried the RWF_ODSYNC and xfs seems to honor properly but it causes

> ext4 to go spin out a single CPU and hang the system. 
Jan Kara March 22, 2018, 2:35 p.m. UTC | #20
On Mon 19-03-18 16:14:16, Robert Dorr wrote:
> Awesome news on the spin.   
> 
> Are you going to be able to make changes to EXT4 to accommodate REQ_FUA
> without generic_write_flush to improve performance like was done for xfs?

I'm currently on a sick leave so I'm slow on replies. Sorry. I don't see a
reason why ext4 could not have the same optimization as XFS for that case
of direct IO. However I'm not sure when I get to implementing that.

								Honza

> -----Original Message-----
> From: Jan Kara <jack@suse.cz> 
> Sent: Monday, March 19, 2018 11:07 AM
> To: Robert Dorr <rdorr@microsoft.com>
> Cc: Christoph Hellwig <hch@lst.de>; Dave Chinner <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; linux-xfs@vger.kernel.org; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Jan Kara <jack@suse.cz>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson <micn@microsoft.com>
> Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes
> 
> On Tue 13-03-18 18:52:57, Robert Dorr wrote:
> > I tried the RWF_ODSYNC and xfs seems to honor properly but it causes 
> > ext4 to go spin out a single CPU and hang the system. 
Robert Dorr March 22, 2018, 2:38 p.m. UTC | #21
Sorry to hear that.  Get well soon.

Thanks for the response.


-----Original Message-----
From: Jan Kara <jack@suse.cz> 

Sent: Thursday, March 22, 2018 9:36 AM
To: Robert Dorr <rdorr@microsoft.com>
Cc: Jan Kara <jack@suse.cz>; Christoph Hellwig <hch@lst.de>; Dave Chinner <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; linux-xfs@vger.kernel.org; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson <micn@microsoft.com>
Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

On Mon 19-03-18 16:14:16, Robert Dorr wrote:
> Awesome news on the spin.   

> 

> Are you going to be able to make changes to EXT4 to accommodate 

> REQ_FUA without generic_write_flush to improve performance like was done for xfs?


I'm currently on a sick leave so I'm slow on replies. Sorry. I don't see a reason why ext4 could not have the same optimization as XFS for that case of direct IO. However I'm not sure when I get to implementing that.

								Honza

> -----Original Message-----

> From: Jan Kara <jack@suse.cz>

> Sent: Monday, March 19, 2018 11:07 AM

> To: Robert Dorr <rdorr@microsoft.com>

> Cc: Christoph Hellwig <hch@lst.de>; Dave Chinner 

> <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; 

> linux-xfs@vger.kernel.org; linux-fsdevel 

> <linux-fsdevel@vger.kernel.org>; Jan Kara <jack@suse.cz>; Theodore 

> Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott 

> Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; 

> Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson 

> <micn@microsoft.com>

> Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO 

> writes

> 

> On Tue 13-03-18 18:52:57, Robert Dorr wrote:

> > I tried the RWF_ODSYNC and xfs seems to honor properly but it causes

> > ext4 to go spin out a single CPU and hang the system. 
Robert Dorr April 24, 2018, 2:09 p.m. UTC | #22
Hope you are feeling better.

1. How can we effect similar changes in ext4?
2. How do we get these changes pushed into all common release builds for both xfs or ext4 optimizations?

We are seeing dramatic performance increases with the xfs fixes.   https://bugzilla.redhat.com/show_bug.cgi?id=1548524

4 socket, TPCC
9,108 – Old kernel
36,618 – Patched kernel

2 socket, TPCC
9,606 - Old kernel
18,840 - Patched kernel

--------------------------------------------------------------------------------------------------
-- Times in milli-seconds
-- wt=SQL writethrough setting
-- awt=SQL alternatewritethrough setting
-- -T3979=SQL open with Fua behavior
-- -T3982=SQL opens without Fua and use fsync (forced flush) behavior
--------------------------------------------------------------------------------------------------
--
--	Linux patched kernel for XFS FUA opts
--  SQL Server 2017 CU6
--  blktrace active
--  Centos7 - 4proc, 8GB ram running on Hyper-V
--  Win10, 8proc, 32GB ram host
--  Samsung SSD drive
--  Clean SQL Server restarts
--------------------------------------------------------------------------------------------------
--  PATCHED KERNEL 4.16.x
--	     								Create DB Inserts Checkpoint Options
--  ----------------------------------- --------- ------- ---------- -----------------------------------
--  Default (-T3982 = forced flush)		3787      29814   47		 O_DIRECT and fsync from SQL
--  -T3979 and wt=1 awt=1			    4224	  25880   30         O_DIRECT and specialized fdatasync  (Best without FUA is ~10x slower than FUA)
--
--  -T3979 and wt=0						2860	  2570	  13		 O_DIRECT 
--  -T3979 and wt=1 awt=0				4204	  2573	  17		 O_DIRECT | O_DSYNC - Inserts and checkpoint on allocated space nears O_DIRECT only  <--- !!! DESIRED BEHAVIOR !!!
--  -T3979 and wt=1 awt=0 NO STAMP		4127	  2823	  17		 O_DIRECT | O_DSYNC - Issues internal flushs for 1st writes, create faster but inserts slower
--
--  UNPATCHED KERNEL 3.x
--	     								Create DB Inserts Checkpoint Options
--  ----------------------------------- --------- ------- ---------- -----------------------------------
--  Default (-T3982 = forced flush)		3097      29700   26		 O_DIRECT and fsync from SQL
--  -T3979 and wt=1 awt=1			    4210	  25800   27         O_DIRECT and specialized fdatasync
--
--  -T3979 and wt=0						2880	  2134	  6			 O_DIRECT 
--  -T3979 and wt=1 awt=0 		        4443	  25290	  42		 O_DIRECT | O_DSYNC
--  -T3979 and wt=1 awt=0 NO STAMP		4120	  25680	  40		 O_DIRECT | O_DSYNC
--




-----Original Message-----
From: Robert Dorr 

Sent: Thursday, March 22, 2018 7:38 AM
To: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>; Dave Chinner <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; linux-xfs@vger.kernel.org; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson <micn@microsoft.com>
Subject: RE: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

Sorry to hear that.  Get well soon.

Thanks for the response.


-----Original Message-----
From: Jan Kara <jack@suse.cz>

Sent: Thursday, March 22, 2018 9:36 AM
To: Robert Dorr <rdorr@microsoft.com>
Cc: Jan Kara <jack@suse.cz>; Christoph Hellwig <hch@lst.de>; Dave Chinner <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; linux-xfs@vger.kernel.org; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson <micn@microsoft.com>
Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

On Mon 19-03-18 16:14:16, Robert Dorr wrote:
> Awesome news on the spin.   

> 

> Are you going to be able to make changes to EXT4 to accommodate 

> REQ_FUA without generic_write_flush to improve performance like was done for xfs?


I'm currently on a sick leave so I'm slow on replies. Sorry. I don't see a reason why ext4 could not have the same optimization as XFS for that case of direct IO. However I'm not sure when I get to implementing that.

								Honza

> -----Original Message-----

> From: Jan Kara <jack@suse.cz>

> Sent: Monday, March 19, 2018 11:07 AM

> To: Robert Dorr <rdorr@microsoft.com>

> Cc: Christoph Hellwig <hch@lst.de>; Dave Chinner 

> <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; 

> linux-xfs@vger.kernel.org; linux-fsdevel 

> <linux-fsdevel@vger.kernel.org>; Jan Kara <jack@suse.cz>; Theodore 

> Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott 

> Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; 

> Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson 

> <micn@microsoft.com>

> Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO 

> writes

> 

> On Tue 13-03-18 18:52:57, Robert Dorr wrote:

> > I tried the RWF_ODSYNC and xfs seems to honor properly but it causes

> > ext4 to go spin out a single CPU and hang the system. 
Nikolay Borisov April 24, 2018, 3:32 p.m. UTC | #23
On 24.04.2018 17:09, Robert Dorr wrote:
> Hope you are feeling better.
> 
> 1. How can we effect similar changes in ext4?
> 2. How do we get these changes pushed into all common release builds for both xfs or ext4 optimizations?
> 
> We are seeing dramatic performance increases with the xfs fixes.   https://bugzilla.redhat.com/show_bug.cgi?id=1548524
> 
> 4 socket, TPCC
> 9,108 – Old kernel
> 36,618 – Patched kernel
> 
> 2 socket, TPCC
> 9,606 - Old kernel
> 18,840 - Patched kernel
> 
> --------------------------------------------------------------------------------------------------
> -- Times in milli-seconds
> -- wt=SQL writethrough setting
> -- awt=SQL alternatewritethrough setting
> -- -T3979=SQL open with Fua behavior
> -- -T3982=SQL opens without Fua and use fsync (forced flush) behavior
> --------------------------------------------------------------------------------------------------
> --
> --	Linux patched kernel for XFS FUA opts
> --  SQL Server 2017 CU6
> --  blktrace active
> --  Centos7 - 4proc, 8GB ram running on Hyper-V
> --  Win10, 8proc, 32GB ram host
> --  Samsung SSD drive
> --  Clean SQL Server restarts
> --------------------------------------------------------------------------------------------------
> --  PATCHED KERNEL 4.16.x
> --	     								Create DB Inserts Checkpoint Options
> --  ----------------------------------- --------- ------- ---------- -----------------------------------
> --  Default (-T3982 = forced flush)		3787      29814   47		 O_DIRECT and fsync from SQL
> --  -T3979 and wt=1 awt=1			    4224	  25880   30         O_DIRECT and specialized fdatasync  (Best without FUA is ~10x slower than FUA)
> --
> --  -T3979 and wt=0						2860	  2570	  13		 O_DIRECT 
> --  -T3979 and wt=1 awt=0				4204	  2573	  17		 O_DIRECT | O_DSYNC - Inserts and checkpoint on allocated space nears O_DIRECT only  <--- !!! DESIRED BEHAVIOR !!!
> --  -T3979 and wt=1 awt=0 NO STAMP		4127	  2823	  17		 O_DIRECT | O_DSYNC - Issues internal flushs for 1st writes, create faster but inserts slower
> --
> --  UNPATCHED KERNEL 3.x
> --	     								Create DB Inserts Checkpoint Options
> --  ----------------------------------- --------- ------- ---------- -----------------------------------
> --  Default (-T3982 = forced flush)		3097      29700   26		 O_DIRECT and fsync from SQL
> --  -T3979 and wt=1 awt=1			    4210	  25800   27         O_DIRECT and specialized fdatasync
> --
> --  -T3979 and wt=0						2880	  2134	  6			 O_DIRECT 
> --  -T3979 and wt=1 awt=0 		        4443	  25290	  42		 O_DIRECT | O_DSYNC
> --  -T3979 and wt=1 awt=0 NO STAMP		4120	  25680	  40		 O_DIRECT | O_DSYNC
> --
> 

So these number should be taken with a VERY BIG grain of salt. Not
because the change is not valid but because at the very least you are
comparing kernel 3.x with latest 4.16. It's unlikely that this report
will be the thing that tips the scaler to either "implement" or "not"
that for ext4. But at the very least show number for the same kernel
with the only difference being the patch.


> 
> 
> 
> -----Original Message-----
> From: Robert Dorr 
> Sent: Thursday, March 22, 2018 7:38 AM
> To: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>; Dave Chinner <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; linux-xfs@vger.kernel.org; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson <micn@microsoft.com>
> Subject: RE: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes
> 
> Sorry to hear that.  Get well soon.
> 
> Thanks for the response.
> 
> 
> -----Original Message-----
> From: Jan Kara <jack@suse.cz>
> Sent: Thursday, March 22, 2018 9:36 AM
> To: Robert Dorr <rdorr@microsoft.com>
> Cc: Jan Kara <jack@suse.cz>; Christoph Hellwig <hch@lst.de>; Dave Chinner <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; linux-xfs@vger.kernel.org; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson <micn@microsoft.com>
> Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes
> 
> On Mon 19-03-18 16:14:16, Robert Dorr wrote:
>> Awesome news on the spin.   
>>
>> Are you going to be able to make changes to EXT4 to accommodate 
>> REQ_FUA without generic_write_flush to improve performance like was done for xfs?
> 
> I'm currently on a sick leave so I'm slow on replies. Sorry. I don't see a reason why ext4 could not have the same optimization as XFS for that case of direct IO. However I'm not sure when I get to implementing that.
> 
> 								Honza
> 
>> -----Original Message-----
>> From: Jan Kara <jack@suse.cz>
>> Sent: Monday, March 19, 2018 11:07 AM
>> To: Robert Dorr <rdorr@microsoft.com>
>> Cc: Christoph Hellwig <hch@lst.de>; Dave Chinner 
>> <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; 
>> linux-xfs@vger.kernel.org; linux-fsdevel 
>> <linux-fsdevel@vger.kernel.org>; Jan Kara <jack@suse.cz>; Theodore 
>> Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott 
>> Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; 
>> Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson 
>> <micn@microsoft.com>
>> Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO 
>> writes
>>
>> On Tue 13-03-18 18:52:57, Robert Dorr wrote:
>>> I tried the RWF_ODSYNC and xfs seems to honor properly but it causes
>>> ext4 to go spin out a single CPU and hang the system. 
Jan Kara April 25, 2018, 10:28 p.m. UTC | #24
On Tue 24-04-18 14:09:53, Robert Dorr wrote:
> Hope you are feeling better.

Yes :) Thanks.

> 1. How can we effect similar changes in ext4?

So ext4 currently does not use the iomap infrastructure for direct IO and
thus cannot directly use the optimization of using FUA. We'd like to switch
ext4 direct IO to use iomap instead of the blkdev_direct_IO() helper. It
should be pretty straightforward but not completely trivial. I'll look into
it but it will take a while.

> 2. How do we get these changes pushed into all common release builds for
> both xfs or ext4 optimizations?

I'm not sure what are you speaking about here. Do you mean how do you get
it to distributions like SLES, RHEL, etc? Generally when they update to new
enough kernel version. Which is rather fast (weeks after Linus releases a
kernel with this improvement) for community distros such as Fedora or
openSUSE Tumbleweed, it will take much longer for enterprise distros
(generally only the next major release will pick up a new kernel so a year
or two).

								Honza

> -----Original Message-----
> From: Robert Dorr 
> Sent: Thursday, March 22, 2018 7:38 AM
> To: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>; Dave Chinner <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; linux-xfs@vger.kernel.org; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson <micn@microsoft.com>
> Subject: RE: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes
> 
> Sorry to hear that.  Get well soon.
> 
> Thanks for the response.
> 
> 
> -----Original Message-----
> From: Jan Kara <jack@suse.cz>
> Sent: Thursday, March 22, 2018 9:36 AM
> To: Robert Dorr <rdorr@microsoft.com>
> Cc: Jan Kara <jack@suse.cz>; Christoph Hellwig <hch@lst.de>; Dave Chinner <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; linux-xfs@vger.kernel.org; linux-fsdevel <linux-fsdevel@vger.kernel.org>; Theodore Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson <micn@microsoft.com>
> Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes
> 
> On Mon 19-03-18 16:14:16, Robert Dorr wrote:
> > Awesome news on the spin.   
> > 
> > Are you going to be able to make changes to EXT4 to accommodate 
> > REQ_FUA without generic_write_flush to improve performance like was done for xfs?
> 
> I'm currently on a sick leave so I'm slow on replies. Sorry. I don't see a reason why ext4 could not have the same optimization as XFS for that case of direct IO. However I'm not sure when I get to implementing that.
> 
> 								Honza
> 
> > -----Original Message-----
> > From: Jan Kara <jack@suse.cz>
> > Sent: Monday, March 19, 2018 11:07 AM
> > To: Robert Dorr <rdorr@microsoft.com>
> > Cc: Christoph Hellwig <hch@lst.de>; Dave Chinner 
> > <david@fromorbit.com>; Dan Williams <dan.j.williams@intel.com>; 
> > linux-xfs@vger.kernel.org; linux-fsdevel 
> > <linux-fsdevel@vger.kernel.org>; Jan Kara <jack@suse.cz>; Theodore 
> > Ts'o <tytso@mit.edu>; Matthew Wilcox <mawilcox@microsoft.com>; Scott 
> > Konersmann <scottkon@microsoft.com>; Slava Oks <slavao@microsoft.com>; 
> > Jasraj Dange <jasrajd@microsoft.com>; Michael Nelson 
> > <micn@microsoft.com>
> > Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO 
> > writes
> > 
> > On Tue 13-03-18 18:52:57, Robert Dorr wrote:
> > > I tried the RWF_ODSYNC and xfs seems to honor properly but it causes
> > > ext4 to go spin out a single CPU and hang the system. 
Theodore Ts'o Dec. 7, 2023, 6:50 a.m. UTC | #25
On Thu, Mar 01, 2018 at 12:41:44PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> If we are doing direct IO writes with datasync semantics, we often
> have to flush metadata changes along with the data write. However,
> if we are overwriting existing data, there are no metadata changes
> that we need to flush. In this case, optimising the IO by using
> FUA write makes sense.
> 
> We know from teh IOMAP_F_DIRTY flag as to whether a specific inode
> requires a metadata flush - this is currently used by DAX to ensure
> extent modi$fication as stable in page fault operations. For direct
> IO writes, we can use it to determine if we need to flush metadata
> or not once the data is on disk.

Hi,

I've gotten an inquiry from some engineers at Microsoft who would
really like it if ext4 could use FUA writes when doing O_DSYNC writes,
since this is soemthing that SQL Server uses.  In the discussion for
this patch series back in 2018[1], ext4 hadn't yet converted over to
iomap for Direct I/O, and so adding this feature for ext4 wasn't
really practical.

[1] https://lore.kernel.org/all/20180319160650.mavedzwienzgwgqi@quack2.suse.cz/

Today, ext4 does use iomap for DIO, but an experiment seems to
indicate that something hasn't been wired up to enable FUA for O_DSYNC
writes.  I've looked at fs/iomap/direct-io.c and it wasn't immediately
obvious what I need to add to enable this feature.

I was wondering if you could me some quick hints about what and where
I should be looking?

Many thanks!

						- Ted
Christoph Hellwig Dec. 7, 2023, 7:32 a.m. UTC | #26
On Thu, Dec 07, 2023 at 01:50:46AM -0500, Theodore Ts'o wrote:
> Today, ext4 does use iomap for DIO, but an experiment seems to
> indicate that something hasn't been wired up to enable FUA for O_DSYNC
> writes.  I've looked at fs/iomap/direct-io.c and it wasn't immediately
> obvious what I need to add to enable this feature.
> 
> I was wondering if you could me some quick hints about what and where
> I should be looking?

There isn't really anything strange going on.  First your device needs to
support it.  For NVMe the feature is mandatory, but we still disable it
globally for SATA, and second you need to remember FUA only makes sense
for O_DSYNC-style writes, not O_SYNC that also writes out the inode.

Then you need to make sure neither IOMAP_F_SHARED nor IOMAP_F_DIRTY
is set on the iomap. IOMAP_F_SHARED is never set by ext4, so if none
of the above is the culprit I'd loke into if IOMAP_F_DIRTY and up
beeing set for your test I/O.
Dave Chinner Dec. 7, 2023, 11:03 p.m. UTC | #27
On Thu, Dec 07, 2023 at 01:50:46AM -0500, Theodore Ts'o wrote:
> On Thu, Mar 01, 2018 at 12:41:44PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > If we are doing direct IO writes with datasync semantics, we often
> > have to flush metadata changes along with the data write. However,
> > if we are overwriting existing data, there are no metadata changes
> > that we need to flush. In this case, optimising the IO by using
> > FUA write makes sense.
> > 
> > We know from teh IOMAP_F_DIRTY flag as to whether a specific inode
> > requires a metadata flush - this is currently used by DAX to ensure
> > extent modi$fication as stable in page fault operations. For direct
> > IO writes, we can use it to determine if we need to flush metadata
> > or not once the data is on disk.
> 
> Hi,
> 
> I've gotten an inquiry from some engineers at Microsoft who would
> really like it if ext4 could use FUA writes when doing O_DSYNC writes,
> since this is soemthing that SQL Server uses.  In the discussion for
> this patch series back in 2018[1], ext4 hadn't yet converted over to
> iomap for Direct I/O, and so adding this feature for ext4 wasn't
> really practical.
> 
> [1] https://lore.kernel.org/all/20180319160650.mavedzwienzgwgqi@quack2.suse.cz/
> 
> Today, ext4 does use iomap for DIO, but an experiment seems to
> indicate that something hasn't been wired up to enable FUA for O_DSYNC
> writes.  I've looked at fs/iomap/direct-io.c and it wasn't immediately
> obvious what I need to add to enable this feature.

I've actually looked briefly into this in the past for ext4 - IIRC
the problem is that mtime updates during the write are causing the
inode to get marked dirty.

Yeah, typical trace from a RWF_DSYNC direct IO through ext4 is:

xfs_io-6002  [007]  2986.814841: ext4_es_lookup_extent_enter: dev 7,0 ino 12 lblk 0
xfs_io-6002  [007]  2986.814848: ext4_es_lookup_extent_exit: dev 7,0 ino 12 found 1 [0/30720) 34304 W
xfs_io-6002  [007]  2986.814856: ext4_journal_start_inode: dev 7,0 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 12, caller ext4_dirty_inode+0x39
xfs_io-6002  [007]  2986.814875: ext4_mark_inode_dirty: dev 7,0 ino 12 caller ext4_dirty_inode+0x5c
xfs_io-6002  [007]  2986.814917: iomap_dio_rw_begin:   dev 7:0 ino 0xc size 0x40000000 offset 0x0 length 0x20000 done_before 0x0 flags DIRECT dio_flags  aio 0
xfs_io-6002  [007]  2986.814922: iomap_iter:           dev 7:0 ino 0xc pos 0x0 length 0x0 flags WRITE|DIRECT (0x11) ops ext4_iomap_overwrite_ops caller __iomap_dio_rw+0x1c2

So I think the problem may be through file_modified() from
ext4_dio_write_checks() triggering mtime updates which then comes
back through mark_inode_dirty and that ends up causing
the inode to be marked dirty and tracked by the journal.

Yup, there it is in generic_update_time():

	if (updated & (S_ATIME|S_MTIME|S_CTIME))
                dirty_flags = inode->i_sb->s_flags & SB_LAZYTIME ? I_DIRTY_TIME : I_DIRTY_SYNC;

it's setting the inode I_DIRTY_SYNC for mtime updates.

XFS does not do this. It implements the ->update_time() method so
doesn't use generic_update_time(). XFS tracks pure timestamp updates
separately to other dirty inode metadata. We then elide "timestamp
dirty" state from the IOMAP_F_DIRTY checks because O_DSYNC doesn't
require timestamps to be persisted.

FWIW, the patch below allows testing FUA optimisations on loop
devices....

-Dave.
diff mbox

Patch

diff --git a/fs/iomap.c b/fs/iomap.c
index afd163586aa0..bcc90e3a2e3f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -685,6 +685,7 @@  EXPORT_SYMBOL_GPL(iomap_seek_data);
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_WRITE_FUA	(1 << 29)
 #define IOMAP_DIO_WRITE		(1 << 30)
 #define IOMAP_DIO_DIRTY		(1 << 31)
 
@@ -760,8 +761,19 @@  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	}
 
 	inode_dio_end(file_inode(iocb->ki_filp));
-	kfree(dio);
 
+	/*
+	 * If a FUA write was done, then that is all we required for datasync
+	 * semantics -. we don't need to call generic_write_sync() to complete
+	 * the write.
+	 */
+	if (ret > 0 &&
+	    (dio->flags & (IOMAP_DIO_WRITE|IOMAP_DIO_WRITE_FUA)) ==
+							IOMAP_DIO_WRITE) {
+		ret = generic_write_sync(iocb, ret);
+	}
+
+	kfree(dio);
 	return ret;
 }
 
@@ -769,12 +781,9 @@  static void iomap_dio_complete_work(struct work_struct *work)
 {
 	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
 	struct kiocb *iocb = dio->iocb;
-	bool is_write = (dio->flags & IOMAP_DIO_WRITE);
 	ssize_t ret;
 
 	ret = iomap_dio_complete(dio);
-	if (is_write && ret > 0)
-		ret = generic_write_sync(iocb, ret);
 	iocb->ki_complete(iocb, ret, 0);
 }
 
@@ -883,6 +892,15 @@  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 			dio->flags |= IOMAP_DIO_COW;
 		if (iomap->flags & IOMAP_F_NEW)
 			need_zeroout = true;
+		/*
+		 * Use a FUA write if we need datasync semantics and this is a
+		 * pure data IO that doesn't require any metadata updates. This
+		 * allows us to avoid cache flushes on IO completion.
+		 */
+		else if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
+			 (dio->flags & IOMAP_DIO_WRITE) &&
+			 (dio->iocb->ki_flags & IOCB_DSYNC))
+			dio->flags |= IOMAP_DIO_WRITE_FUA;
 		break;
 	default:
 		WARN_ON_ONCE(1);
@@ -930,7 +948,11 @@  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		n = bio->bi_iter.bi_size;
 		if (dio->flags & IOMAP_DIO_WRITE) {
-			bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
+			int op_flags = REQ_SYNC | REQ_IDLE;
+
+			if (dio->flags & IOMAP_DIO_WRITE_FUA)
+				op_flags |= REQ_FUA;
+			bio_set_op_attrs(bio, REQ_OP_WRITE, op_flags);
 			task_io_account_write(n);
 		} else {
 			bio_set_op_attrs(bio, REQ_OP_READ, 0);
@@ -961,6 +983,12 @@  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 	return copied;
 }
 
+/*
+ * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
+ * is being issued as AIO or not. This allows us to optimise pure data writes to
+ * use REQ_FUA rather than requiring generic_write_sync() to issue a REQ_FLUSH
+ * post write.
+ */
 ssize_t
 iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, iomap_dio_end_io_t end_io)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 260ff5e5c264..81aa3b73471e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -732,6 +732,11 @@  xfs_file_write_iter(
 		ret = xfs_file_dio_aio_write(iocb, from);
 		if (ret == -EREMCHG)
 			goto buffered;
+		/*
+		 * Direct IO handles sync type writes internally on I/O
+		 * completion.
+		 */
+		return ret;
 	} else {
 buffered:
 		ret = xfs_file_buffered_aio_write(iocb, from);