diff mbox series

xfs: don't allow NOWAIT DIO across extent boundaries

Message ID 20201119042315.535693-1-david@fromorbit.com (mailing list archive)
State Accepted
Headers show
Series xfs: don't allow NOWAIT DIO across extent boundaries | expand

Commit Message

Dave Chinner Nov. 19, 2020, 4:23 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Jens has reported a situation where partial direct IOs can be issued
and completed yet still return -EAGAIN. We don't want this to report
a short IO as we want XFS to complete user DIO entirely or not at
all.

This partial IO situation can occur on a write IO that is split
across an allocated extent and a hole, and the second mapping is
returning EAGAIN because allocation would be required.

The trivial reproducer:

$ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
pwrite: Resource temporarily unavailable
$

The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
the first 4kB write:

 xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
 iomap_apply:          dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
 xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
 xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
 xfs_iomap_found:      dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
 iomap_apply_dstmap:   dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY

Here the first iomap loop has mapped the first 4kB of the file and
issued the IO, and we enter the second iomap_apply loop:

 iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
 xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
 xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin

And we exit with -EAGAIN out because we hit the allocate case trying
to make the second 4kB block.

Then IO completes on the first 4kB and the original IO context
completes and unlocks the inode, returning -EAGAIN to userspace:

 xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
 xfs_iunlock:          dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write

There are other vectors to the same problem when we re-enter the
mapping code if we have to make multiple mappinfs under NOWAIT
conditions. e.g. failing trylocks, COW extents being found,
allocation being required, and so on.

Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
to go ahead if the mapping we retrieve for the IO spans an entire
allocated extent. This avoids the possibility of subsequent mappings
to complete the IO from triggering NOWAIT semantics by any means as
NOWAIT IO will now only enter the mapping code once per NOWAIT IO.

Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

Comments

Darrick J. Wong Nov. 19, 2020, 5:55 p.m. UTC | #1
On Thu, Nov 19, 2020 at 03:23:15PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Jens has reported a situation where partial direct IOs can be issued
> and completed yet still return -EAGAIN. We don't want this to report
> a short IO as we want XFS to complete user DIO entirely or not at
> all.
> 
> This partial IO situation can occur on a write IO that is split
> across an allocated extent and a hole, and the second mapping is
> returning EAGAIN because allocation would be required.
> 
> The trivial reproducer:
> 
> $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
> wrote 4096/4096 bytes at offset 0
> 4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
> pwrite: Resource temporarily unavailable
> $
> 
> The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
> the first 4kB write:
> 
>  xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
>  iomap_apply:          dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
>  xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
>  xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
>  xfs_iomap_found:      dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
>  iomap_apply_dstmap:   dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY
> 
> Here the first iomap loop has mapped the first 4kB of the file and
> issued the IO, and we enter the second iomap_apply loop:
> 
>  iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
>  xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
>  xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
> 
> And we exit with -EAGAIN out because we hit the allocate case trying
> to make the second 4kB block.
> 
> Then IO completes on the first 4kB and the original IO context
> completes and unlocks the inode, returning -EAGAIN to userspace:
> 
>  xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
>  xfs_iunlock:          dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write
> 
> There are other vectors to the same problem when we re-enter the
> mapping code if we have to make multiple mappinfs under NOWAIT
> conditions. e.g. failing trylocks, COW extents being found,
> allocation being required, and so on.
> 
> Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
> to go ahead if the mapping we retrieve for the IO spans an entire
> allocated extent. This avoids the possibility of subsequent mappings
> to complete the IO from triggering NOWAIT semantics by any means as
> NOWAIT IO will now only enter the mapping code once per NOWAIT IO.
> 
> Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Hmm so I guess the whole point of this is speculative writing?

As in, thread X either wants us to write impatiently, or to fail the
whole IO so that it can hand the slow-mode write to some other thread or
something?  The manpage says something about short reads, but it doesn't
say much about writes, but silently writing some but not all seems
broken. :)

If so then I guess this is fine, and to the (limited) extent that I grok
the whole usecase,

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 3abb8b9d6f4c..7b9ff824e82d 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -706,6 +706,23 @@ xfs_ilock_for_iomap(
>  	return 0;
>  }
>  
> +/*
> + * Check that the imap we are going to return to the caller spans the entire
> + * range that the caller requested for the IO.
> + */
> +static bool
> +imap_spans_range(
> +	struct xfs_bmbt_irec	*imap,
> +	xfs_fileoff_t		offset_fsb,
> +	xfs_fileoff_t		end_fsb)
> +{
> +	if (imap->br_startoff > offset_fsb)
> +		return false;
> +	if (imap->br_startoff + imap->br_blockcount < end_fsb)
> +		return false;
> +	return true;
> +}
> +
>  static int
>  xfs_direct_write_iomap_begin(
>  	struct inode		*inode,
> @@ -766,6 +783,18 @@ xfs_direct_write_iomap_begin(
>  	if (imap_needs_alloc(inode, flags, &imap, nimaps))
>  		goto allocate_blocks;
>  
> +	/*
> +	 * NOWAIT IO needs to span the entire requested IO with a single map so
> +	 * that we avoid partial IO failures due to the rest of the IO range not
> +	 * covered by this map triggering an EAGAIN condition when it is
> +	 * subsequently mapped and aborting the IO.
> +	 */
> +	if ((flags & IOMAP_NOWAIT) &&
> +	    !imap_spans_range(&imap, offset_fsb, end_fsb)) {
> +		error = -EAGAIN;
> +		goto out_unlock;
> +	}
> +
>  	xfs_iunlock(ip, lockmode);
>  	trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap);
>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, iomap_flags);
> -- 
> 2.28.0
>
Dave Chinner Nov. 19, 2020, 8:50 p.m. UTC | #2
On Thu, Nov 19, 2020 at 09:55:26AM -0800, Darrick J. Wong wrote:
> On Thu, Nov 19, 2020 at 03:23:15PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Jens has reported a situation where partial direct IOs can be issued
> > and completed yet still return -EAGAIN. We don't want this to report
> > a short IO as we want XFS to complete user DIO entirely or not at
> > all.
> > 
> > This partial IO situation can occur on a write IO that is split
> > across an allocated extent and a hole, and the second mapping is
> > returning EAGAIN because allocation would be required.
> > 
> > The trivial reproducer:
> > 
> > $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
> > wrote 4096/4096 bytes at offset 0
> > 4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
> > pwrite: Resource temporarily unavailable
> > $
> > 
> > The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
> > the first 4kB write:
> > 
> >  xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
> >  iomap_apply:          dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
> >  xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
> >  xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
> >  xfs_iomap_found:      dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
> >  iomap_apply_dstmap:   dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY
> > 
> > Here the first iomap loop has mapped the first 4kB of the file and
> > issued the IO, and we enter the second iomap_apply loop:
> > 
> >  iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
> >  xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
> >  xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
> > 
> > And we exit with -EAGAIN out because we hit the allocate case trying
> > to make the second 4kB block.
> > 
> > Then IO completes on the first 4kB and the original IO context
> > completes and unlocks the inode, returning -EAGAIN to userspace:
> > 
> >  xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
> >  xfs_iunlock:          dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write
> > 
> > There are other vectors to the same problem when we re-enter the
> > mapping code if we have to make multiple mappinfs under NOWAIT
> > conditions. e.g. failing trylocks, COW extents being found,
> > allocation being required, and so on.
> > 
> > Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
> > to go ahead if the mapping we retrieve for the IO spans an entire
> > allocated extent. This avoids the possibility of subsequent mappings
> > to complete the IO from triggering NOWAIT semantics by any means as
> > NOWAIT IO will now only enter the mapping code once per NOWAIT IO.
> > 
> > Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Hmm so I guess the whole point of this is speculative writing?

Non-blocking async writes, not speculative writes.

> As in, thread X either wants us to write impatiently, or to fail the
> whole IO so that it can hand the slow-mode write to some other thread or
> something?

Yes, that's exactly the sort of thing RWF_NOWAIT was originally
intended to support (e.g. the SeaStar IO framework).

> The manpage says something about short reads, but it doesn't
> say much about writes, but silently writing some but not all seems
> broken. :)

Well, that depends on how you look at it.

In the end, the current behaviour has a transient data state that is
finalised when the write gets re-run by the
application. The only issue here is efficiency, and we don't really
even care if there's a crash between the these two IOs that would
result in the silent write being exposed.

I say this because the write is always going to be done as two
separate IOs so there's always a failure window where the non-atomic
writes can result in only one of them hitting stable storage.

Hence this change doesn't actually affect anything from the data
integrity perspective - it just closes the window down slightly and
prevents unnecessary repeated IO and/or blocking a task we shouldn't
be blocking.

> If so then I guess this is fine, and to the (limited) extent that I grok
> the whole usecase,
> 
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

Thanks!

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 3abb8b9d6f4c..7b9ff824e82d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -706,6 +706,23 @@  xfs_ilock_for_iomap(
 	return 0;
 }
 
+/*
+ * Check that the imap we are going to return to the caller spans the entire
+ * range that the caller requested for the IO.
+ */
+static bool
+imap_spans_range(
+	struct xfs_bmbt_irec	*imap,
+	xfs_fileoff_t		offset_fsb,
+	xfs_fileoff_t		end_fsb)
+{
+	if (imap->br_startoff > offset_fsb)
+		return false;
+	if (imap->br_startoff + imap->br_blockcount < end_fsb)
+		return false;
+	return true;
+}
+
 static int
 xfs_direct_write_iomap_begin(
 	struct inode		*inode,
@@ -766,6 +783,18 @@  xfs_direct_write_iomap_begin(
 	if (imap_needs_alloc(inode, flags, &imap, nimaps))
 		goto allocate_blocks;
 
+	/*
+	 * NOWAIT IO needs to span the entire requested IO with a single map so
+	 * that we avoid partial IO failures due to the rest of the IO range not
+	 * covered by this map triggering an EAGAIN condition when it is
+	 * subsequently mapped and aborting the IO.
+	 */
+	if ((flags & IOMAP_NOWAIT) &&
+	    !imap_spans_range(&imap, offset_fsb, end_fsb)) {
+		error = -EAGAIN;
+		goto out_unlock;
+	}
+
 	xfs_iunlock(ip, lockmode);
 	trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, iomap_flags);