[6/6] xfs: reduce exclusive locking on unaligned dio

Message ID	20210112010746.1154363-7-david@fromorbit.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, avi@scylladb.com, andres@anarazel.de Subject: [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio Date: Tue, 12 Jan 2021 12:07:46 +1100 Message-Id: <20210112010746.1154363-7-david@fromorbit.com> In-Reply-To: <20210112010746.1154363-1-david@fromorbit.com> References: <20210112010746.1154363-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[1/6] iomap: convert iomap_dio_rw() to an args structure \| expand [1/6] iomap: convert iomap_dio_rw() to an args structure [2/6] iomap: move DIO NOWAIT setup up into filesystems [3/6] xfs: factor out a xfs_ilock_iocb helper [4/6] xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware [5/6] xfs: split unaligned DIO write code out [6/6] xfs: reduce exclusive locking on unaligned dio

Message ID

20210112010746.1154363-7-david@fromorbit.com (mailing list archive)

State

Superseded

Headers

From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org, avi@scylladb.com, andres@anarazel.de
Subject: [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio
Date: Tue, 12 Jan 2021 12:07:46 +1100
Message-Id: <20210112010746.1154363-7-david@fromorbit.com>
In-Reply-To: <20210112010746.1154363-1-david@fromorbit.com>
References: <20210112010746.1154363-1-david@fromorbit.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

[1/6] iomap: convert iomap_dio_rw() to an args structure | expand

Commit Message

Dave Chinner Jan. 12, 2021, 1:07 a.m. UTC

From: Dave Chinner <dchinner@redhat.com>

Attempt shared locking for unaligned DIO, but only if the the
underlying extent is already allocated and in written state. On
failure, retry with the existing exclusive locking.

Test case is fio randrw of 512 byte IOs using AIO and an iodepth of
32 IOs.

Vanilla:

  READ: bw=4560KiB/s (4670kB/s), 4560KiB/s-4560KiB/s (4670kB/s-4670kB/s), io=134MiB (140MB), run=30001-30001msec
  WRITE: bw=4567KiB/s (4676kB/s), 4567KiB/s-4567KiB/s (4676kB/s-4676kB/s), io=134MiB (140MB), run=30001-30001msec


Patched:
   READ: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1127MiB (1182MB), run=30002-30002msec
  WRITE: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1128MiB (1183MB), run=30002-30002msec

That's an improvement from ~18k IOPS to a ~150k IOPS, which is
about the IOPS limit of the VM block device setup I'm testing on.

4kB block IO comparison:

   READ: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8868MiB (9299MB), run=30002-30002msec
  WRITE: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8878MiB (9309MB), run=30002-30002msec

Which is ~150k IOPS, same as what the test gets for sub-block
AIO+DIO writes with this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 94 +++++++++++++++++++++++++++++++---------------
 fs/xfs/xfs_iomap.c | 32 +++++++++++-----
 2 files changed, 86 insertions(+), 40 deletions(-)

Comments

Christoph Hellwig Jan. 12, 2021, 10:42 a.m. UTC | #1

> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index bba33be17eff..f5c75404b8a5 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -408,7 +408,7 @@ xfs_file_aio_write_checks(
>  			drained_dio = true;
>  			goto restart;
>  		}
> -	
> +

Spurious unrelated whitespace change.

>  	struct iomap_dio_rw_args args = {
>  		.iocb			= iocb,
>  		.iter			= from,
>  		.ops			= &xfs_direct_write_iomap_ops,
>  		.dops			= &xfs_dio_write_ops,
>  		.wait_for_completion	= is_sync_kiocb(iocb),
> -		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
> +		.nonblocking		= true,

I think this is in many ways wrong.  As far as I can tell you want this
so that we get the imap_spans_range in xfs_direct_write_iomap_begin. But
we should not trigger any of the other checks, so we'd really need
another flag instead of reusing this one.

imap_spans_range is a bit pessimistic for avoiding the exclusive lock,
but I guess we could live that if it is clearly documented as helping
with the implementation, but we really should not automatically trigger
all the other effects of nowait I/O.

Brian Foster Jan. 12, 2021, 5:01 p.m. UTC | #2

On Tue, Jan 12, 2021 at 11:42:57AM +0100, Christoph Hellwig wrote:
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index bba33be17eff..f5c75404b8a5 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -408,7 +408,7 @@ xfs_file_aio_write_checks(
> >  			drained_dio = true;
> >  			goto restart;
> >  		}
> > -	
> > +
> 
> Spurious unrelated whitespace change.
> 
> >  	struct iomap_dio_rw_args args = {
> >  		.iocb			= iocb,
> >  		.iter			= from,
> >  		.ops			= &xfs_direct_write_iomap_ops,
> >  		.dops			= &xfs_dio_write_ops,
> >  		.wait_for_completion	= is_sync_kiocb(iocb),
> > -		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
> > +		.nonblocking		= true,
> 
> I think this is in many ways wrong.  As far as I can tell you want this
> so that we get the imap_spans_range in xfs_direct_write_iomap_begin. But
> we should not trigger any of the other checks, so we'd really need
> another flag instead of reusing this one.
> 

It's really the br_state != XFS_EXT_NORM check that we want for the
unaligned case, isn't it?

> imap_spans_range is a bit pessimistic for avoiding the exclusive lock,
> but I guess we could live that if it is clearly documented as helping
> with the implementation, but we really should not automatically trigger
> all the other effects of nowait I/O.
> 

Regardless, I agree on this point. I don't have a strong opinion in
general on this approach vs. the other, but it does seem odd to me to
overload the broader nowait semantics with the unaligned I/O checks. I
see that it works for the primary case we care about, but this also
means things like the _has_page() check now trigger exclusivity for the
unaligned case where that doesn't seem to be necessary. I do like the
previous cleanups so I suspect if we worked this into a new
'subblock_io' flag that indicates to the lower layer whether the
filesystem can allow zeroing, that might clean much of this up.

Brian

Christoph Hellwig Jan. 12, 2021, 5:10 p.m. UTC | #3

On Tue, Jan 12, 2021 at 12:01:33PM -0500, Brian Foster wrote:
> > I think this is in many ways wrong.  As far as I can tell you want this
> > so that we get the imap_spans_range in xfs_direct_write_iomap_begin. But
> > we should not trigger any of the other checks, so we'd really need
> > another flag instead of reusing this one.
> > 
> 
> It's really the br_state != XFS_EXT_NORM check that we want for the
> unaligned case, isn't it?

Inherently, yes.  But if we want to avoid the extra irec lookup outside
->iomap_begin we have to limit us to a single I/O, as we'll do a partial
write otherwise if only the extent that the end of write falls into is
unwritten and not block aligned.

Dave Chinner Jan. 12, 2021, 10:06 p.m. UTC | #4

On Tue, Jan 12, 2021 at 12:01:33PM -0500, Brian Foster wrote:
> On Tue, Jan 12, 2021 at 11:42:57AM +0100, Christoph Hellwig wrote:
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index bba33be17eff..f5c75404b8a5 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -408,7 +408,7 @@ xfs_file_aio_write_checks(
> > >  			drained_dio = true;
> > >  			goto restart;
> > >  		}
> > > -	
> > > +
> > 
> > Spurious unrelated whitespace change.
> > 
> > >  	struct iomap_dio_rw_args args = {
> > >  		.iocb			= iocb,
> > >  		.iter			= from,
> > >  		.ops			= &xfs_direct_write_iomap_ops,
> > >  		.dops			= &xfs_dio_write_ops,
> > >  		.wait_for_completion	= is_sync_kiocb(iocb),
> > > -		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
> > > +		.nonblocking		= true,
> > 
> > I think this is in many ways wrong.  As far as I can tell you want this
> > so that we get the imap_spans_range in xfs_direct_write_iomap_begin. But
> > we should not trigger any of the other checks, so we'd really need
> > another flag instead of reusing this one.
> > 
> 
> It's really the br_state != XFS_EXT_NORM check that we want for the
> unaligned case, isn't it?

We can only submit unaligned DIO with a shared IOLOCK to a written
range, which means we need to abort the IO if we hit a COW range
(imap_needs_cow()), a hole (imap_needs_alloc()), the range spans
multiple extents (imap_spans_range()) and, finally, unwritten
extents (the new check I added).

IOMAP_NOWAIT aborts on all these cases and returns EAGAIN.

> > imap_spans_range is a bit pessimistic for avoiding the exclusive lock,

No, it's absolutely required.

If the sub-block aligned dio spans multiple extents, we don't know
what locking is required for that next extent until iomap_apply()
loops and calls us again for that range. WHile the first range might
be written and OK to issue, the next extent range could
require allocation, COW or unwritten extent conversion and so would
require exclusive IO locking.  And so we end up with partial IO
submission, which causes all sorts of problems...

IOWs, if the unaligned dio cannot be mapped to a single written
extent, we can't do it under shared locking conditions - it must be
done under exclusive locking to maintain the "no partial submission"
rules we have for DIO.

> > but I guess we could live that if it is clearly documented as helping
> > with the implementation, but we really should not automatically trigger
> > all the other effects of nowait I/O.
> 
> Regardless, I agree on this point.

The only thing that IOMAP_NOWAIT does that might be questionable is
the xfs_ilock_nowait() call on the ILOCK. We want it to abort shared
IO if we don't have the extents read in - Christoph's patch made
this trigger exclusive IO, too and so of all the things that
IOMAP_NOWAIT triggers, the -only thing- we can raise a question
about is the trylock.

And, quite frankly, if something is modifying the inode metadata
while we are trying to sub-block DIO, I want the sub-block DIO to
fall back to exclusive locking just to be safe. It may not be
necessary, but right now I'd prefer to err on the side of caution
and be conservative about when this optimisation triggers. If we get
it wrong, we corrupt data....

> I don't have a strong opinion in general on this approach vs. the
> other, but it does seem odd to me to overload the broader nowait
> semantics with the unaligned I/O checks. I see that it works for
> the primary case we care about, but this also means things like
> the _has_page() check now trigger exclusivity for the unaligned
> case where that doesn't seem to be necessary.

Actually, it's another case of being safe rather than sorry. In the
sub-block DIO is racing with mmap or write() dirtying the page that
spans the DIO range, we end up issuing concurrent IOs to the same
LBA range, something that results in undefined behaviour and is
something we must absolutely not do.

That is:

	DIO	(1024, 512)
		submit_bio (1024, 512)
		.....
	mmap
		(0, 4096)
		touch byte 0
		page dirty

	DIO	(2048, 512)
		filemap_write_and_wait_range(2048, 512)
		submit_bio(0, 4096)
		.....

and now we have overlapping concurrent IO in flight even though
usrespace has not done any overlapping modifications at all.
Overlapping IO should never be issued by the filesystem as the
result is undefined. Yes, the application should not be mixing
mmap+DIO, but we the filesystem in this case is doing something even
worse and something we tell userspace developers that *they should
never do*. We can trivially avoid this corruption case by falling
back to exclusive locking for subblock dio if writeback and/or page
cache invalidation may be required.

IOWs, IOMAP_NOWAIT gives us exactly the behaviour we need here for
serialising concurrent sub-block dio against page cache based IO...

> I do like the
> previous cleanups so I suspect if we worked this into a new
> 'subblock_io' flag that indicates to the lower layer whether the
> filesystem can allow zeroing, that might clean much of this up.

Allow zeroing where, exactly? e.g. some filesystems do zeroing in
their allocation routines during mapping. IOWs, this strikes me as
encoding specific filesystem implementation requirements into the
generic API as opposed to using generic functionality to implement
specific FS behavioural requirements.

Cheers,

Dave.

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index bba33be17eff..f5c75404b8a5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -408,7 +408,7 @@  xfs_file_aio_write_checks(
 			drained_dio = true;
 			goto restart;
 		}
-	
+
 		trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
 		error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
 				NULL, &xfs_buffered_write_iomap_ops);
@@ -510,9 +510,9 @@  static const struct iomap_dio_ops xfs_dio_write_ops = {
 /*
  * Handle block aligned direct IO writes
  *
- * Lock the inode appropriately to prepare for and issue a direct IO write.
- * By separating it from the buffered write path we remove all the tricky to
- * follow locking changes and looping.
+ * Lock the inode appropriately to prepare for and issue a direct IO write.  By
+ * separating it from the buffered write path we remove all the tricky to follow
+ * locking changes and looping.
  *
  * If there are cached pages or we're extending the file, we need IOLOCK_EXCL
  * until we're sure the bytes at the new EOF have been zeroed and/or the cached
@@ -578,18 +578,31 @@  xfs_file_dio_write_aligned(
  * allowing them to be done in parallel with reads and other direct IO writes.
  * However, if the IO is not aligned to filesystem blocks, the direct IO layer
  * may need to do sub-block zeroing and that requires serialisation against other
- * direct IOs to the same block. In this case we need to serialise the
- * submission of the unaligned IOs so that we don't get racing block zeroing in
- * the dio layer.
+ * direct IOs to the same block. In the case where sub-block zeroing is not
+ * required, we can do concurrent sub-block dios to the same block successfully.
+ *
+ * Hence we have two cases here - the shared, optimisitic fast path for written
+ * extents, and everything else that needs exclusive IO path access across the
+ * entire IO.
+ *
+ * For the first case, we do all the checks we need at the mapping layer in the
+ * DIO code as part of the existing NOWAIT infrastructure. Hence all we need to
+ * do to support concurrent subblock dio is first try a non-blocking submission.
+ * If that returns -EAGAIN, then we simply repeat the IO submission with full
+ * IO exclusivity guaranteed so that we avoid racing sub-block zeroing.
+ *
+ * The only wrinkle in this case is that the iomap DIO code always does
+ * partial tail sub-block zeroing for post-EOF writes. Hence for any IO that
+ * _ends_ past the current EOF we need to run with full exclusivity. Note that
+ * we also check for the start of IO being beyond EOF because then zeroing
+ * between the old EOF and the start of the IO is required and that also
+ * requires exclusivity. Hence we avoid lock cycles and blocking under
+ * IOCB_NOWAIT for this situation, too.
  *
- * To provide the same serialisation for AIO, we also need to wait for
+ * To provide the exclusivity required when using AIO, we also need to wait for
  * outstanding IOs to complete so that unwritten extent conversion is completed
  * before we try to map the overlapping block. This is currently implemented by
  * hitting it with a big hammer (i.e. inode_dio_wait()).
- *
- * This means that unaligned dio writes alwys block. There is no "nowait" fast
- * path in this code - if IOCB_NOWAIT is set we simply return -EAGAIN up front
- * and we don't have to worry about that anymore.
  */
 static ssize_t
 xfs_file_dio_write_unaligned(
@@ -597,23 +610,35 @@  xfs_file_dio_write_unaligned(
 	struct kiocb		*iocb,
 	struct iov_iter		*from)
 {
-	int			iolock = XFS_IOLOCK_EXCL;
+	int			iolock = XFS_IOLOCK_SHARED;
 	size_t			count;
 	ssize_t			ret;
+	size_t			isize = i_size_read(VFS_I(ip));
 	struct iomap_dio_rw_args args = {
 		.iocb			= iocb,
 		.iter			= from,
 		.ops			= &xfs_direct_write_iomap_ops,
 		.dops			= &xfs_dio_write_ops,
 		.wait_for_completion	= is_sync_kiocb(iocb),
-		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
+		.nonblocking		= true,
 	};
 
 	/*
-	 * This must be the only IO in-flight. Wait on it before we
-	 * release the iolock to prevent subsequent overlapping IO.
+	 * Extending writes need exclusivity because of the sub-block zeroing
+	 * that the DIO code always does for partial tail blocks beyond EOF.
 	 */
-	args.wait_for_completion = true;
+	if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
+retry_exclusive:
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		iolock = XFS_IOLOCK_EXCL;
+		args.nonblocking = false;
+		args.wait_for_completion = true;
+	}
+
+	ret = xfs_ilock_iocb(iocb, iolock);
+	if (ret)
+		return ret;
 
 	/*
 	 * We can't properly handle unaligned direct I/O to reflink
@@ -621,30 +646,37 @@  xfs_file_dio_write_unaligned(
 	 */
 	if (xfs_is_cow_inode(ip)) {
 		trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, count);
-		return -ENOTBLK;
+		ret = -ENOTBLK;
+		goto out_unlock;
 	}
 
-	/* unaligned dio always waits, bail */
-	if (iocb->ki_flags & IOCB_NOWAIT)
-		return -EAGAIN;
-	xfs_ilock(ip, iolock);
-
 	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
 	if (ret)
-		goto out;
+		goto out_unlock;
 	count = iov_iter_count(from);
 
 	/*
-	 * If we are doing unaligned IO, we can't allow any other overlapping IO
-	 * in-flight at the same time or we risk data corruption. Wait for all
-	 * other IO to drain before we submit. If the IO is aligned, demote the
-	 * iolock if we had to take the exclusive lock in
-	 * xfs_file_aio_write_checks() for other reasons.
+	 * If we are doing exclusive unaligned IO, we can't allow any other
+	 * overlapping IO in-flight at the same time or we risk data corruption.
+	 * Wait for all other IO to drain before we submit.
 	 */
-	inode_dio_wait(VFS_I(ip));
+	if (!args.nonblocking)
+		inode_dio_wait(VFS_I(ip));
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
 	ret = iomap_dio_rw(&args);
-out:
+
+	/*
+	 * Retry unaligned IO with exclusive blocking semantics if the DIO
+	 * layer rejected it for mapping or locking reasons. If we are doing
+	 * nonblocking user IO, propagate the error.
+	 */
+	if (ret == -EAGAIN) {
+		ASSERT(args.nonblocking == true);
+		xfs_iunlock(ip, iolock);
+		goto retry_exclusive;
+	}
+
+out_unlock:
 	if (iolock)
 		xfs_iunlock(ip, iolock);
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 7b9ff824e82d..e5659200e5e8 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -783,16 +783,30 @@  xfs_direct_write_iomap_begin(
 	if (imap_needs_alloc(inode, flags, &imap, nimaps))
 		goto allocate_blocks;
 
-	/*
-	 * NOWAIT IO needs to span the entire requested IO with a single map so
-	 * that we avoid partial IO failures due to the rest of the IO range not
-	 * covered by this map triggering an EAGAIN condition when it is
-	 * subsequently mapped and aborting the IO.
-	 */
-	if ((flags & IOMAP_NOWAIT) &&
-	    !imap_spans_range(&imap, offset_fsb, end_fsb)) {
+	/* Handle special NOWAIT conditions for existing allocated extents. */
+	if (flags & IOMAP_NOWAIT) {
 		error = -EAGAIN;
-		goto out_unlock;
+		/*
+		 * NOWAIT IO needs to span the entire requested IO with a single
+		 * map so that we avoid partial IO failures due to the rest of
+		 * the IO range not covered by this map triggering an EAGAIN
+		 * condition when it is subsequently mapped and aborting the IO.
+		 */
+		if (!imap_spans_range(&imap, offset_fsb, end_fsb))
+			goto out_unlock;
+
+		/*
+		 * If the IO is unaligned and the caller holds a shared IOLOCK,
+		 * NOWAIT will be set because we can only do the IO if it spans
+		 * a written extent. Otherwise we have to do sub-block zeroing,
+		 * and that can only be done under an exclusive IOLOCK. Hence if
+		 * this is not a written extent, return EAGAIN to tell the
+		 * caller to try again.
+		 */
+		if (imap.br_state != XFS_EXT_NORM &&
+		    ((offset & mp->m_blockmask) ||
+		     ((offset + length) & mp->m_blockmask)))
+			goto out_unlock;
 	}
 
 	xfs_iunlock(ip, lockmode);