diff mbox series

[RFC] iomap: zeroing needs to be pagecache aware

Message ID 20221201005214.3836105-1-david@fromorbit.com (mailing list archive)
State New, archived
Headers show
Series [RFC] iomap: zeroing needs to be pagecache aware | expand

Commit Message

Dave Chinner Dec. 1, 2022, 12:52 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Unwritten extents can have page cache data over the range being
zeroed so we can't just skip them entirely. Fix this by checking for
an existing dirty folio over the unwritten range we are zeroing
and only performing zeroing if the folio is already dirty.

XXX: how do we detect a iomap containing a cow mapping over a hole
in iomap_zero_iter()? The XFS code implies this case also needs to
zero the page cache if there is data present, so trigger for page
cache lookup only in iomap_zero_iter() needs to handle this case as
well.

Before:

$ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters

real    0m14.103s
user    0m0.015s
sys     0m0.020s

$ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 85.90    0.847616          16     50000           ftruncate
 14.01    0.138229           2     50000           pwrite64
....

After:

$ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters

real    0m0.144s
user    0m0.021s
sys     0m0.012s

$ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 53.86    0.505964          10     50000           ftruncate
 46.12    0.433251           8     50000           pwrite64
....

Yup, we get back all the performance.

As for the "mmap write beyond EOF" data exposure aspect
documented here:

https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/

With this command:

$ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
  -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
  -c fsync -c "pread -v 3k 32" /mnt/scratch/foo

Before:

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
wrote 4096/4096 bytes at offset 32768
4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
read 32/32 bytes at offset 3072
32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182 ops/sec

After:

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
wrote 4096/4096 bytes at offset 32768
4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
read 32/32 bytes at offset 3072
32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429 ops/sec)

We see that this post-eof unwritten extent dirty page zeroing is
working correctly.

This has passed through most of fstests on a couple of test VMs
without issues at the moment, so I think this approach to fixing the
issue is going to be solid once we've worked out how to detect the
COW-hole mapping case.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/buffered-io.c | 50 +++++++++++++++++++++++++++++++++++-------
 fs/xfs/xfs_iops.c      | 12 +---------
 2 files changed, 43 insertions(+), 19 deletions(-)

Comments

Darrick J. Wong Dec. 1, 2022, 2:08 a.m. UTC | #1
On Thu, Dec 01, 2022 at 11:52:14AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Unwritten extents can have page cache data over the range being
> zeroed so we can't just skip them entirely. Fix this by checking for
> an existing dirty folio over the unwritten range we are zeroing
> and only performing zeroing if the folio is already dirty.

Hm, I'll look at this tomorrow morning when I'm less bleary.  From a
cursory glance it looks ok though.

> XXX: how do we detect a iomap containing a cow mapping over a hole
> in iomap_zero_iter()? The XFS code implies this case also needs to
> zero the page cache if there is data present, so trigger for page
> cache lookup only in iomap_zero_iter() needs to handle this case as
> well.

I've been wondering for a while if we ought to rename iomap_iter.iomap
to write_iomap and iomap_iter.srcmap to read_iomap, and change all the
->iomap_begin and ->iomap_end functions as needed.  I think that would
make it more clear to iomap users which one they're supposed to use.
Right now we overload iomap_iter.iomap for reads and for writes if
srcmap is a hole (or SHARED isn't set on iomap) and it's getting
confusing to keep track of all that.

I guess the hard part of all that is that writes to the pagecache don't
touch storage; and writeback doesn't care about the source mapping since
it's only using block granularity.

--D

> Before:
> 
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> 
> real    0m14.103s
> user    0m0.015s
> sys     0m0.020s
> 
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  85.90    0.847616          16     50000           ftruncate
>  14.01    0.138229           2     50000           pwrite64
> ....
> 
> After:
> 
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> 
> real    0m0.144s
> user    0m0.021s
> sys     0m0.012s
> 
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  53.86    0.505964          10     50000           ftruncate
>  46.12    0.433251           8     50000           pwrite64
> ....
> 
> Yup, we get back all the performance.
> 
> As for the "mmap write beyond EOF" data exposure aspect
> documented here:
> 
> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
> 
> With this command:
> 
> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
>   -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
>   -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
> 
> Before:
> 
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
> 00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
> 00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182 ops/sec
> 
> After:
> 
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
> 00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429 ops/sec)
> 
> We see that this post-eof unwritten extent dirty page zeroing is
> working correctly.
> 
> This has passed through most of fstests on a couple of test VMs
> without issues at the moment, so I think this approach to fixing the
> issue is going to be solid once we've worked out how to detect the
> COW-hole mapping case.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 50 +++++++++++++++++++++++++++++++++++-------
>  fs/xfs/xfs_iops.c      | 12 +---------
>  2 files changed, 43 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 356193e44cf0..0969e4525507 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -585,14 +585,14 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
>  }
>  
>  static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
> -		size_t len, struct folio **foliop)
> +		size_t len, struct folio **foliop, unsigned fgp)
>  {
>  	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
>  	struct folio *folio;
> -	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
>  	int status = 0;
>  
> +	fgp |= FGP_LOCK | FGP_WRITE | FGP_STABLE | FGP_NOFS;
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
>  
> @@ -615,7 +615,12 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
>  	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
>  			fgp, mapping_gfp_mask(iter->inode->i_mapping));
>  	if (!folio) {
> -		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
> +		if (!(fgp & FGP_CREAT))
> +			status = -ENODATA;
> +		else if (iter->flags & IOMAP_NOWAIT)
> +			status = -EAGAIN;
> +		else
> +			status = -ENOMEM;
>  		goto out_no_page;
>  	}
>  
> @@ -791,7 +796,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  			break;
>  		}
>  
> -		status = iomap_write_begin(iter, pos, bytes, &folio);
> +		status = iomap_write_begin(iter, pos, bytes, &folio, FGP_CREAT);
>  		if (unlikely(status))
>  			break;
>  		if (iter->iomap.flags & IOMAP_F_STALE)
> @@ -1101,7 +1106,7 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
>  		unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
>  		struct folio *folio;
>  
> -		status = iomap_write_begin(iter, pos, bytes, &folio);
> +		status = iomap_write_begin(iter, pos, bytes, &folio, FGP_CREAT);
>  		if (unlikely(status))
>  			return status;
>  		if (iter->iomap.flags & IOMAP_F_STALE)
> @@ -1147,10 +1152,14 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  	loff_t pos = iter->pos;
>  	loff_t length = iomap_length(iter);
>  	loff_t written = 0;
> +	unsigned fgp = FGP_CREAT;
>  
>  	/* already zeroed?  we're done. */
> -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> +	if (srcmap->type == IOMAP_HOLE)
>  		return length;
> +	/* only do page cache lookups over unwritten extents */
> +	if (srcmap->type == IOMAP_UNWRITTEN)
> +		fgp = 0;
>  
>  	do {
>  		struct folio *folio;
> @@ -1158,9 +1167,20 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		size_t offset;
>  		size_t bytes = min_t(u64, SIZE_MAX, length);
>  
> -		status = iomap_write_begin(iter, pos, bytes, &folio);
> -		if (status)
> +		status = iomap_write_begin(iter, pos, bytes, &folio, fgp);
> +		if (status) {
> +			if (status == -ENODATA) {
> +				/*
> +				 * No folio was found, so skip to the start of
> +				 * the next potential entry in the page cache
> +				 * and continue from there.
> +				 */
> +				if (bytes > PAGE_SIZE - offset_in_page(pos))
> +					bytes = PAGE_SIZE - offset_in_page(pos);
> +				goto loop_continue;
> +			}
>  			return status;
> +		}
>  		if (iter->iomap.flags & IOMAP_F_STALE)
>  			break;
>  
> @@ -1168,6 +1188,19 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		if (bytes > folio_size(folio) - offset)
>  			bytes = folio_size(folio) - offset;
>  
> +		/*
> +		 * If the folio over an unwritten extent is clean, then we
> +		 * aren't going to touch the data in it at all. We don't want to
> +		 * mark it dirty or change the uptodate state of data in the
> +		 * page, so we just unlock it and skip to the next range over
> +		 * the unwritten extent we need to check.
> +		 */
> +		if (srcmap->type == IOMAP_UNWRITTEN &&
> +		    !folio_test_dirty(folio)) {
> +			folio_unlock(folio);
> +			goto loop_continue;
> +		}
> +
>  		folio_zero_range(folio, offset, bytes);
>  		folio_mark_accessed(folio);
>  
> @@ -1175,6 +1208,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		if (WARN_ON_ONCE(bytes == 0))
>  			return -EIO;
>  
> +loop_continue:
>  		pos += bytes;
>  		length -= bytes;
>  		written += bytes;
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 2e10e1c66ad6..d7c2f8c49f94 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -839,17 +839,7 @@ xfs_setattr_size(
>  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>  				&did_zeroing);
> -	} else {
> -		/*
> -		 * iomap won't detect a dirty page over an unwritten block (or a
> -		 * cow block over a hole) and subsequently skips zeroing the
> -		 * newly post-EOF portion of the page. Flush the new EOF to
> -		 * convert the block before the pagecache truncate.
> -		 */
> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> -						     newsize);
> -		if (error)
> -			return error;
> +	} else if (newsize != oldsize) {
>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>  	}
>  
> -- 
> 2.38.1
>
Dave Chinner Dec. 1, 2022, 2:43 a.m. UTC | #2
On Wed, Nov 30, 2022 at 06:08:04PM -0800, Darrick J. Wong wrote:
> On Thu, Dec 01, 2022 at 11:52:14AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Unwritten extents can have page cache data over the range being
> > zeroed so we can't just skip them entirely. Fix this by checking for
> > an existing dirty folio over the unwritten range we are zeroing
> > and only performing zeroing if the folio is already dirty.
> 
> Hm, I'll look at this tomorrow morning when I'm less bleary.  From a
> cursory glance it looks ok though.
> 
> > XXX: how do we detect a iomap containing a cow mapping over a hole
> > in iomap_zero_iter()? The XFS code implies this case also needs to
> > zero the page cache if there is data present, so trigger for page
> > cache lookup only in iomap_zero_iter() needs to handle this case as
> > well.
> 
> I've been wondering for a while if we ought to rename iomap_iter.iomap
> to write_iomap and iomap_iter.srcmap to read_iomap, and change all the
> ->iomap_begin and ->iomap_end functions as needed.  I think that would
> make it more clear to iomap users which one they're supposed to use.
> Right now we overload iomap_iter.iomap for reads and for writes if
> srcmap is a hole (or SHARED isn't set on iomap) and it's getting
> confusing to keep track of all that.

*nod*

We definitely need to clarify this - I find the overloading
confusing at the best of times.  No idea what the solution to this
looks like, though...

> I guess the hard part of all that is that writes to the pagecache don't
> touch storage; and writeback doesn't care about the source mapping since
> it's only using block granularity.

Yup, that's why this code needs the IOMAP_F_STALE code to be in
place before we can use the page cache lookups like this.

Cheers,

Dave.
Darrick J. Wong Dec. 1, 2022, 3:59 a.m. UTC | #3
On Thu, Dec 01, 2022 at 01:43:29PM +1100, Dave Chinner wrote:
> On Wed, Nov 30, 2022 at 06:08:04PM -0800, Darrick J. Wong wrote:
> > On Thu, Dec 01, 2022 at 11:52:14AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Unwritten extents can have page cache data over the range being
> > > zeroed so we can't just skip them entirely. Fix this by checking for
> > > an existing dirty folio over the unwritten range we are zeroing
> > > and only performing zeroing if the folio is already dirty.
> > 
> > Hm, I'll look at this tomorrow morning when I'm less bleary.  From a
> > cursory glance it looks ok though.
> > 
> > > XXX: how do we detect a iomap containing a cow mapping over a hole
> > > in iomap_zero_iter()? The XFS code implies this case also needs to
> > > zero the page cache if there is data present, so trigger for page
> > > cache lookup only in iomap_zero_iter() needs to handle this case as
> > > well.
> > 
> > I've been wondering for a while if we ought to rename iomap_iter.iomap
> > to write_iomap and iomap_iter.srcmap to read_iomap, and change all the
> > ->iomap_begin and ->iomap_end functions as needed.  I think that would
> > make it more clear to iomap users which one they're supposed to use.
> > Right now we overload iomap_iter.iomap for reads and for writes if
> > srcmap is a hole (or SHARED isn't set on iomap) and it's getting
> > confusing to keep track of all that.
> 
> *nod*
> 
> We definitely need to clarify this - I find the overloading
> confusing at the best of times.  No idea what the solution to this
> looks like, though...

<nod> It probably means a largeish cleanup for 6.3.  Maybe start with
the patches I'd sent earlier that shortened the ->iomap_begin and
->iomap_end parameter list, and move on to redefining the two?

https://lore.kernel.org/linux-xfs/166801778962.3992140.13451716594530581376.stgit@magnolia/
https://lore.kernel.org/linux-xfs/166801779522.3992140.4135946031734299717.stgit@magnolia/


> > I guess the hard part of all that is that writes to the pagecache don't
> > touch storage; and writeback doesn't care about the source mapping since
> > it's only using block granularity.
> 
> Yup, that's why this code needs the IOMAP_F_STALE code to be in
> place before we can use the page cache lookups like this.

Well it's in for-next now, so it'll land in 6.2.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Brian Foster Dec. 13, 2022, 3:18 p.m. UTC | #4
On Thu, Dec 01, 2022 at 11:52:14AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Unwritten extents can have page cache data over the range being
> zeroed so we can't just skip them entirely. Fix this by checking for
> an existing dirty folio over the unwritten range we are zeroing
> and only performing zeroing if the folio is already dirty.
> 
> XXX: how do we detect a iomap containing a cow mapping over a hole
> in iomap_zero_iter()? The XFS code implies this case also needs to
> zero the page cache if there is data present, so trigger for page
> cache lookup only in iomap_zero_iter() needs to handle this case as
> well.
> 

I think iomap generally would report the COW mapping in srcmap for that
case. The problem I suspect is that if XFS sees a hole in the data fork
for an IOMAP_ZERO operation, it doesn't bother checking the COW fork at
all and just returns the data fork hole. That probably needs to be
reworked to perform IOMAP_ZERO lookups similar to those for a normal
buffered write, otherwise iomap_zero_range() would fail to zero in the
particular case where a COW prealloc might overlap a hole.

Either way, I agree with the followon point that the iomap map
naming/reporting wrt to COW is far too confusing to keep straight.

> Before:
> 
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> 
> real    0m14.103s
> user    0m0.015s
> sys     0m0.020s
> 
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  85.90    0.847616          16     50000           ftruncate
>  14.01    0.138229           2     50000           pwrite64
> ....
> 
> After:
> 
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> 
> real    0m0.144s
> user    0m0.021s
> sys     0m0.012s
> 
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  53.86    0.505964          10     50000           ftruncate
>  46.12    0.433251           8     50000           pwrite64
> ....
> 
> Yup, we get back all the performance.
> 
> As for the "mmap write beyond EOF" data exposure aspect
> documented here:
> 
> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
> 
> With this command:
> 
> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
>   -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
>   -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
> 
> Before:
> 
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
> 00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
> 00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182 ops/sec
> 
> After:
> 
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
> 00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429 ops/sec)
> 
> We see that this post-eof unwritten extent dirty page zeroing is
> working correctly.
> 
> This has passed through most of fstests on a couple of test VMs
> without issues at the moment, so I think this approach to fixing the
> issue is going to be solid once we've worked out how to detect the
> COW-hole mapping case.
> 

Does fstests currently have any coverage for zero range operations over
multiple folios? I'm not sure that it does. I suspect the reason this
problem went undetected for so long is that the primary uses of zero
range (i.e. falloc) always flush and invalidate the range. The only
exceptions that I'm aware of are the eof boundary handling scenarios in
XFS (i.e. write past eof, truncate), and those typically only expect to
handle a single folio at the eof block.

Given there's also no test cases for the historical and subtle stale
data exposure problems with this code, I suspect we'd want to come up
with at least one test case that has the ability to effectively stress
new zero range logic. Perhaps fallocate zero range could elide the cache
flush/inval? Even if that were just an occasional/random debug mode
thing, that would add some fsx coverage from the various associated
tests, which would be better than what we have now.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 50 +++++++++++++++++++++++++++++++++++-------
>  fs/xfs/xfs_iops.c      | 12 +---------
>  2 files changed, 43 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 356193e44cf0..0969e4525507 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
...
> @@ -791,7 +796,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  			break;
>  		}
>  
> -		status = iomap_write_begin(iter, pos, bytes, &folio);
> +		status = iomap_write_begin(iter, pos, bytes, &folio, FGP_CREAT);
>  		if (unlikely(status))
>  			break;
>  		if (iter->iomap.flags & IOMAP_F_STALE)

Not necessarily related to this patch, but this looks like it creates a
bit of a negative feedback loop for buffered writes. I.e., a streaming
buffered write workload performs delayed allocation, XFS converts
delalloc to unwritten extents, writeback comes in behind and converts
unwritten extents in ioend batch size chunks, each extent conversion
updates the data fork sequence number and leads to spurious
invalidations on subsequent writes. I don't know if that will ever lead
to enough of a hit that it might be observable on a real workload, or
otherwise if it can just be chalked up to being like any other
background work that comes as a side effect of user operations (i.e.
balance dirty pages, writeback, etc.).

Just out of curiosity and to explore potential worst case
characteristics, I ran a quick test to spin up a bunch of background
flushers of a file while doing a larger streaming write. This seems able
to trigger a 40-50% drop in buffered write throughput compared to the
same workload without stale iomap detection, fwiw. Obviously that is not
testing a sane workload, but probably something to keep in mind if
somebody happens to observe/report that sort of behavior in the near
future.

...
> @@ -1158,9 +1167,20 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		size_t offset;
>  		size_t bytes = min_t(u64, SIZE_MAX, length);
>  
> -		status = iomap_write_begin(iter, pos, bytes, &folio);
> -		if (status)
> +		status = iomap_write_begin(iter, pos, bytes, &folio, fgp);
> +		if (status) {
> +			if (status == -ENODATA) {
> +				/*
> +				 * No folio was found, so skip to the start of
> +				 * the next potential entry in the page cache
> +				 * and continue from there.
> +				 */
> +				if (bytes > PAGE_SIZE - offset_in_page(pos))
> +					bytes = PAGE_SIZE - offset_in_page(pos);
> +				goto loop_continue;
> +			}

I don't think this logic is safe because nothing prevents folio eviction
between the mapping lookup and folio lookup. So if there's a dirty folio
over an unwritten extent, iomap looks up the unwritten extent, and the
folio writes back and is evicted before this lookup, we skip zeroing
data on disk when we shouldn't. That reintroduces a variant of the stale
data exposure vector the flush in xfs_setattr_size() is intended to
prevent.

If we wanted to use stale iomap detection for this purpose, one possible
approach might be to try and support a revalidation check outside of
locked folio context (because there's no guarantee one will exist for a
mapping that might have become stale). With something like that
available, uncached ranges would just have to be revalidated before they
can be tracked as processed by the iter. That might not be too much
trouble to do here since I think the existing stale check (if a folio is
found) would indirectly validate any preceding uncached range. The iter
function would just have to process the 'written' count appropriately
and handle the added scenario where an uncached range must be directly
revalidated.

Brian

>  			return status;
> +		}
>  		if (iter->iomap.flags & IOMAP_F_STALE)
>  			break;
>  
> @@ -1168,6 +1188,19 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		if (bytes > folio_size(folio) - offset)
>  			bytes = folio_size(folio) - offset;
>  
> +		/*
> +		 * If the folio over an unwritten extent is clean, then we
> +		 * aren't going to touch the data in it at all. We don't want to
> +		 * mark it dirty or change the uptodate state of data in the
> +		 * page, so we just unlock it and skip to the next range over
> +		 * the unwritten extent we need to check.
> +		 */
> +		if (srcmap->type == IOMAP_UNWRITTEN &&
> +		    !folio_test_dirty(folio)) {
> +			folio_unlock(folio);
> +			goto loop_continue;
> +		}
> +
>  		folio_zero_range(folio, offset, bytes);
>  		folio_mark_accessed(folio);
>  
> @@ -1175,6 +1208,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		if (WARN_ON_ONCE(bytes == 0))
>  			return -EIO;
>  
> +loop_continue:
>  		pos += bytes;
>  		length -= bytes;
>  		written += bytes;
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 2e10e1c66ad6..d7c2f8c49f94 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -839,17 +839,7 @@ xfs_setattr_size(
>  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>  				&did_zeroing);
> -	} else {
> -		/*
> -		 * iomap won't detect a dirty page over an unwritten block (or a
> -		 * cow block over a hole) and subsequently skips zeroing the
> -		 * newly post-EOF portion of the page. Flush the new EOF to
> -		 * convert the block before the pagecache truncate.
> -		 */
> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> -						     newsize);
> -		if (error)
> -			return error;
> +	} else if (newsize != oldsize) {
>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>  	}
>  
> -- 
> 2.38.1
>
Christoph Hellwig Dec. 23, 2022, 4:15 p.m. UTC | #5
On Thu, Dec 01, 2022 at 11:52:14AM +1100, Dave Chinner wrote:
>  	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
>  			fgp, mapping_gfp_mask(iter->inode->i_mapping));
>  	if (!folio) {
> -		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
> +		if (!(fgp & FGP_CREAT))
> +			status = -ENODATA;
> +		else if (iter->flags & IOMAP_NOWAIT)
> +			status = -EAGAIN;
> +		else
> +			status = -ENOMEM;

Trying to reverse engineer the error case based on the flags passed
seems a bit problematic, even if I think for now it mostl still
works as !FGP_CREAT is exclusive vs IOMAP_NOWAIT.

Matthew, what do you think of switching __filemap_get_folio to an
ERR_PTR return and return actual cause?  Or do we have too many
callers and need ____filemap_get_folio (urrg..)?

> +		status = iomap_write_begin(iter, pos, bytes, &folio, fgp);
> +		if (status) {
> +			if (status == -ENODATA) {
> +				/*
> +				 * No folio was found, so skip to the start of
> +				 * the next potential entry in the page cache
> +				 * and continue from there.
> +				 */
> +				if (bytes > PAGE_SIZE - offset_in_page(pos))
> +					bytes = PAGE_SIZE - offset_in_page(pos);
> +				goto loop_continue;
> +			}
>  			return status;
> +		}

Nit: I'd check for -ENODATA one level of identation out to keep
the nesting limited:

		status = iomap_write_begin(iter, pos, bytes, &folio, fgp);
		if (status == -ENODATA) {
			...
			goto loop_continue;
		}
		if (status)
 			return status;
Christoph Hellwig Dec. 23, 2022, 4:15 p.m. UTC | #6
On Wed, Nov 30, 2022 at 06:08:04PM -0800, Darrick J. Wong wrote:
> I've been wondering for a while if we ought to rename iomap_iter.iomap
> to write_iomap and iomap_iter.srcmap to read_iomap, and change all the
> ->iomap_begin and ->iomap_end functions as needed.

I'd do src vs dst, but yes, something like that is long overdue.
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 356193e44cf0..0969e4525507 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -585,14 +585,14 @@  static int iomap_write_begin_inline(const struct iomap_iter *iter,
 }
 
 static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
-		size_t len, struct folio **foliop)
+		size_t len, struct folio **foliop, unsigned fgp)
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct folio *folio;
-	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
 	int status = 0;
 
+	fgp |= FGP_LOCK | FGP_WRITE | FGP_STABLE | FGP_NOFS;
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
 
@@ -615,7 +615,12 @@  static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
 	if (!folio) {
-		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
+		if (!(fgp & FGP_CREAT))
+			status = -ENODATA;
+		else if (iter->flags & IOMAP_NOWAIT)
+			status = -EAGAIN;
+		else
+			status = -ENOMEM;
 		goto out_no_page;
 	}
 
@@ -791,7 +796,7 @@  static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			break;
 		}
 
-		status = iomap_write_begin(iter, pos, bytes, &folio);
+		status = iomap_write_begin(iter, pos, bytes, &folio, FGP_CREAT);
 		if (unlikely(status))
 			break;
 		if (iter->iomap.flags & IOMAP_F_STALE)
@@ -1101,7 +1106,7 @@  static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 		unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
 		struct folio *folio;
 
-		status = iomap_write_begin(iter, pos, bytes, &folio);
+		status = iomap_write_begin(iter, pos, bytes, &folio, FGP_CREAT);
 		if (unlikely(status))
 			return status;
 		if (iter->iomap.flags & IOMAP_F_STALE)
@@ -1147,10 +1152,14 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 	loff_t pos = iter->pos;
 	loff_t length = iomap_length(iter);
 	loff_t written = 0;
+	unsigned fgp = FGP_CREAT;
 
 	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	if (srcmap->type == IOMAP_HOLE)
 		return length;
+	/* only do page cache lookups over unwritten extents */
+	if (srcmap->type == IOMAP_UNWRITTEN)
+		fgp = 0;
 
 	do {
 		struct folio *folio;
@@ -1158,9 +1167,20 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		size_t offset;
 		size_t bytes = min_t(u64, SIZE_MAX, length);
 
-		status = iomap_write_begin(iter, pos, bytes, &folio);
-		if (status)
+		status = iomap_write_begin(iter, pos, bytes, &folio, fgp);
+		if (status) {
+			if (status == -ENODATA) {
+				/*
+				 * No folio was found, so skip to the start of
+				 * the next potential entry in the page cache
+				 * and continue from there.
+				 */
+				if (bytes > PAGE_SIZE - offset_in_page(pos))
+					bytes = PAGE_SIZE - offset_in_page(pos);
+				goto loop_continue;
+			}
 			return status;
+		}
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
@@ -1168,6 +1188,19 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (bytes > folio_size(folio) - offset)
 			bytes = folio_size(folio) - offset;
 
+		/*
+		 * If the folio over an unwritten extent is clean, then we
+		 * aren't going to touch the data in it at all. We don't want to
+		 * mark it dirty or change the uptodate state of data in the
+		 * page, so we just unlock it and skip to the next range over
+		 * the unwritten extent we need to check.
+		 */
+		if (srcmap->type == IOMAP_UNWRITTEN &&
+		    !folio_test_dirty(folio)) {
+			folio_unlock(folio);
+			goto loop_continue;
+		}
+
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);
 
@@ -1175,6 +1208,7 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;
 
+loop_continue:
 		pos += bytes;
 		length -= bytes;
 		written += bytes;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 2e10e1c66ad6..d7c2f8c49f94 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -839,17 +839,7 @@  xfs_setattr_size(
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
 		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
 				&did_zeroing);
-	} else {
-		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
-		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
+	} else if (newsize != oldsize) {
 		error = xfs_truncate_page(ip, newsize, &did_zeroing);
 	}