[RFC,v4,1/8] iomap: zeroing needs to be pagecache aware

Message ID	20240529095206.2568162-2-yi.zhang@huaweicloud.com (mailing list archive)
State	New
Headers	show Received: from dggsgout11.his.huawei.com (unknown [45.249.212.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0F21715AAD3; Wed, 29 May 2024 01:59:20 +0000 (UTC) From: Zhang Yi <yi.zhang@huaweicloud.com> To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, djwong@kernel.org, hch@infradead.org, brauner@kernel.org, david@fromorbit.com, chandanbabu@kernel.org, jack@suse.cz, willy@infradead.org, yi.zhang@huawei.com, yi.zhang@huaweicloud.com, chengzhihao1@huawei.com, yukuai3@huawei.com Subject: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware Date: Wed, 29 May 2024 17:51:59 +0800 Message-Id: <20240529095206.2568162-2-yi.zhang@huaweicloud.com> In-Reply-To: <20240529095206.2568162-1-yi.zhang@huaweicloud.com> References: <20240529095206.2568162-1-yi.zhang@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	iomap/xfs: fix stale data exposure when truncating realtime inodes \| expand [RFC,v4,0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes [RFC,v4,1/8] iomap: zeroing needs to be pagecache aware [RFC,v4,2/8] math64: add rem_u64() to just return the remainder [RFC,v4,3/8] iomap: pass blocksize to iomap_truncate_page() [RFC,v4,4/8] fsdax: pass blocksize to dax_truncate_page() [RFC,v4,5/8] xfs: refactor the truncating order [RFC,v4,6/8] xfs: correct the truncate blocksize of realtime inode [RFC,v4,7/8] xfs: reserve blocks for truncating realtime inode [RFC,v4,8/8] xfs: improve truncate on a realtime inode with huge extsize

Zhang Yi May 29, 2024, 9:51 a.m. UTC

From: Dave Chinner <dchinner@redhat.com>

Unwritten extents can have page cache data over the range being
zeroed so we can't just skip them entirely. Fix this by checking for
an existing dirty folio over the unwritten range we are zeroing
and only performing zeroing if the folio is already dirty.

XXX: how do we detect a iomap containing a cow mapping over a hole
in iomap_zero_iter()? The XFS code implies this case also needs to
zero the page cache if there is data present, so trigger for page
cache lookup only in iomap_zero_iter() needs to handle this case as
well.

Before:

$ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters

real    0m14.103s
user    0m0.015s
sys     0m0.020s

$ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 85.90    0.847616          16     50000           ftruncate
 14.01    0.138229           2     50000           pwrite64
....

After:

$ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters

real    0m0.144s
user    0m0.021s
sys     0m0.012s

$ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 53.86    0.505964          10     50000           ftruncate
 46.12    0.433251           8     50000           pwrite64
....

Yup, we get back all the performance.

As for the "mmap write beyond EOF" data exposure aspect
documented here:

https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/

With this command:

$ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
  -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
  -c fsync -c "pread -v 3k 32" /mnt/scratch/foo

Before:

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
wrote 4096/4096 bytes at offset 32768
4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
XXXXXXXXXXXXXXXX
00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
XXXXXXXXXXXXXXXX
read 32/32 bytes at offset 3072
32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182
   ops/sec

After:

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
wrote 4096/4096 bytes at offset 32768
4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
................
00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
................
read 32/32 bytes at offset 3072
32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429
   ops/sec)

We see that this post-eof unwritten extent dirty page zeroing is
working correctly.

This has passed through most of fstests on a couple of test VMs
without issues at the moment, so I think this approach to fixing the
issue is going to be solid once we've worked out how to detect the
COW-hole mapping case.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/buffered-io.c | 42 ++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_iops.c      | 12 +-----------
 2 files changed, 41 insertions(+), 13 deletions(-)

Christoph Hellwig May 31, 2024, 1:11 p.m. UTC | #1

On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> XXX: how do we detect a iomap containing a cow mapping over a hole
> in iomap_zero_iter()? The XFS code implies this case also needs to
> zero the page cache if there is data present, so trigger for page
> cache lookup only in iomap_zero_iter() needs to handle this case as
> well.

If there is no data in the page cache and either a whole or unwritten
extent it really should not matter what is in the COW fork, a there
obviously isn't any data we could zero.

If there is data in the page cache for something that is marked as
a hole in the srcmap, but we have data in the COW fork due to
COW extsize preallocation we'd need to zero it, but as the
xfs iomap ops don't return a separate srcmap for that case we
should be fine.  Or am I missing something?

> + * Note: when zeroing unwritten extents, we might have data in the page cache
> + * over an unwritten extent. In this case, we want to do a pure lookup on the
> + * page cache and not create a new folio as we don't need to perform zeroing on
> + * unwritten extents if there is no cached data over the given range.
>   */
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  {
>  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
>  
> +	if (iter->flags & IOMAP_ZERO) {
> +		const struct iomap *srcmap = iomap_iter_srcmap(iter);
> +
> +		if (srcmap->type == IOMAP_UNWRITTEN)
> +			fgp &= ~FGP_CREAT;
> +	}

Nit:  The comment would probably stand out a little better if it was
right next to the IOMAP_ZERO conditional instead of above the
function.

> +		if (status) {
> +			if (status == -ENOENT) {
> +				/*
> +				 * Unwritten extents need to have page cache
> +				 * lookups done to determine if they have data
> +				 * over them that needs zeroing. If there is no
> +				 * data, we'll get -ENOENT returned here, so we
> +				 * can just skip over this index.
> +				 */
> +				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);

I'd return -EIO if the WARN_ON triggers.

> +loop_continue:

While I'm no strange to gotos for loop control something trips me
up about jumping to the end of the loop.  Here is what I could come
up with instead.  Not arguing it's objectively better, but I somehow
like it a little better:

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 700b22d6807783..81378f7cd8d7ff 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1412,49 +1412,56 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		bool ret;
 
 		status = iomap_write_begin(iter, pos, bytes, &folio);
-		if (status) {
-			if (status == -ENOENT) {
-				/*
-				 * Unwritten extents need to have page cache
-				 * lookups done to determine if they have data
-				 * over them that needs zeroing. If there is no
-				 * data, we'll get -ENOENT returned here, so we
-				 * can just skip over this index.
-				 */
-				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
-				if (bytes > PAGE_SIZE - offset_in_page(pos))
-					bytes = PAGE_SIZE - offset_in_page(pos);
-				goto loop_continue;
-			}
+		if (status && status != -ENOENT)
 			return status;
-		}
-		if (iter->iomap.flags & IOMAP_F_STALE)
-			break;
 
-		offset = offset_in_folio(folio, pos);
-		if (bytes > folio_size(folio) - offset)
-			bytes = folio_size(folio) - offset;
+		if (status == -ENOENT) {
+			/*
+			 * If we end up here, we did not find a folio in the
+			 * page cache for an unwritten extent and thus can
+			 * skip over the range.
+			 */
+			if (WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN))
+				return -EIO;
 
-		/*
-		 * If the folio over an unwritten extent is clean (i.e. because
-		 * it has been read from), then it already contains zeros. Hence
-		 * we can just skip it.
-		 */
-		if (srcmap->type == IOMAP_UNWRITTEN &&
-		    !folio_test_dirty(folio)) {
-			folio_unlock(folio);
-			goto loop_continue;
+			/*
+			 * XXX: It would be nice if we could get the offset of
+			 * the next entry in the pagecache so that we don't have
+			 * to iterate one page at a time here.
+			 */
+			offset = offset_in_page(pos);
+			if (bytes > PAGE_SIZE - offset)
+				bytes = PAGE_SIZE - offset;
+		} else {
+			if (iter->iomap.flags & IOMAP_F_STALE)
+				break;
+
+			offset = offset_in_folio(folio, pos);
+			if (bytes > folio_size(folio) - offset)
+				bytes = folio_size(folio) - offset;
+		
+			/*
+			 * If the folio over an unwritten extent is clean (i.e.
+			 * because it has only been read from), then it already
+			 * contains zeros.  Hence we can just skip it.
+			 */
+			if (srcmap->type == IOMAP_UNWRITTEN &&
+			    !folio_test_dirty(folio)) {
+				folio_unlock(folio);
+				status = -ENOENT;
+			}
 		}
 
-		folio_zero_range(folio, offset, bytes);
-		folio_mark_accessed(folio);
+		if (status != -ENOENT) {
+			folio_zero_range(folio, offset, bytes);
+			folio_mark_accessed(folio);
 
-		ret = iomap_write_end(iter, pos, bytes, bytes, folio);
-		__iomap_put_folio(iter, pos, bytes, folio);
-		if (WARN_ON_ONCE(!ret))
-			return -EIO;
+			ret = iomap_write_end(iter, pos, bytes, bytes, folio);
+			__iomap_put_folio(iter, pos, bytes, folio);
+			if (WARN_ON_ONCE(!ret))
+				return -EIO;
+		}
 
-loop_continue:
 		pos += bytes;
 		length -= bytes;
 		written += bytes;

Darrick J. Wong May 31, 2024, 2:03 p.m. UTC | #2

On Fri, May 31, 2024 at 06:11:25AM -0700, Christoph Hellwig wrote:
> On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> > XXX: how do we detect a iomap containing a cow mapping over a hole
> > in iomap_zero_iter()? The XFS code implies this case also needs to
> > zero the page cache if there is data present, so trigger for page
> > cache lookup only in iomap_zero_iter() needs to handle this case as
> > well.
> 
> If there is no data in the page cache and either a whole or unwritten
> extent it really should not matter what is in the COW fork, a there
> obviously isn't any data we could zero.
> 
> If there is data in the page cache for something that is marked as
> a hole in the srcmap, but we have data in the COW fork due to
> COW extsize preallocation we'd need to zero it, but as the
> xfs iomap ops don't return a separate srcmap for that case we
> should be fine.  Or am I missing something?

It might be useful to skip the scan for dirty pagecache if both forks
have holes, since (in theory) that's never possible on xfs.

OTOH maybe there are filesystems that allow dirty pagecache over a hole?

> > + * Note: when zeroing unwritten extents, we might have data in the page cache
> > + * over an unwritten extent. In this case, we want to do a pure lookup on the
> > + * page cache and not create a new folio as we don't need to perform zeroing on
> > + * unwritten extents if there is no cached data over the given range.
> >   */
> >  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
> >  {
> >  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
> >  
> > +	if (iter->flags & IOMAP_ZERO) {
> > +		const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > +
> > +		if (srcmap->type == IOMAP_UNWRITTEN)
> > +			fgp &= ~FGP_CREAT;
> > +	}
> 
> Nit:  The comment would probably stand out a little better if it was
> right next to the IOMAP_ZERO conditional instead of above the
> function.

Agreed.

> > +		if (status) {
> > +			if (status == -ENOENT) {
> > +				/*
> > +				 * Unwritten extents need to have page cache
> > +				 * lookups done to determine if they have data
> > +				 * over them that needs zeroing. If there is no
> > +				 * data, we'll get -ENOENT returned here, so we
> > +				 * can just skip over this index.
> > +				 */
> > +				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
> 
> I'd return -EIO if the WARN_ON triggers.
> 
> > +loop_continue:
> 
> While I'm no strange to gotos for loop control something trips me
> up about jumping to the end of the loop.  Here is what I could come
> up with instead.  Not arguing it's objectively better, but I somehow
> like it a little better:
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 700b22d6807783..81378f7cd8d7ff 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1412,49 +1412,56 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		bool ret;
>  
>  		status = iomap_write_begin(iter, pos, bytes, &folio);
> -		if (status) {
> -			if (status == -ENOENT) {
> -				/*
> -				 * Unwritten extents need to have page cache
> -				 * lookups done to determine if they have data
> -				 * over them that needs zeroing. If there is no
> -				 * data, we'll get -ENOENT returned here, so we
> -				 * can just skip over this index.
> -				 */
> -				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
> -				if (bytes > PAGE_SIZE - offset_in_page(pos))
> -					bytes = PAGE_SIZE - offset_in_page(pos);
> -				goto loop_continue;
> -			}
> +		if (status && status != -ENOENT)
>  			return status;
> -		}
> -		if (iter->iomap.flags & IOMAP_F_STALE)
> -			break;
>  
> -		offset = offset_in_folio(folio, pos);
> -		if (bytes > folio_size(folio) - offset)
> -			bytes = folio_size(folio) - offset;
> +		if (status == -ENOENT) {
> +			/*
> +			 * If we end up here, we did not find a folio in the
> +			 * page cache for an unwritten extent and thus can
> +			 * skip over the range.
> +			 */
> +			if (WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN))
> +				return -EIO;
>  
> -		/*
> -		 * If the folio over an unwritten extent is clean (i.e. because
> -		 * it has been read from), then it already contains zeros. Hence
> -		 * we can just skip it.
> -		 */
> -		if (srcmap->type == IOMAP_UNWRITTEN &&
> -		    !folio_test_dirty(folio)) {
> -			folio_unlock(folio);
> -			goto loop_continue;
> +			/*
> +			 * XXX: It would be nice if we could get the offset of
> +			 * the next entry in the pagecache so that we don't have
> +			 * to iterate one page at a time here.
> +			 */
> +			offset = offset_in_page(pos);
> +			if (bytes > PAGE_SIZE - offset)
> +				bytes = PAGE_SIZE - offset;

Why is it PAGE_SIZE here and not folio_size() like below?

(I know you're just copying the existing code; I'm merely wondering if
this is some minor bug.)

--D

> +		} else {
> +			if (iter->iomap.flags & IOMAP_F_STALE)
> +				break;
> +
> +			offset = offset_in_folio(folio, pos);
> +			if (bytes > folio_size(folio) - offset)
> +				bytes = folio_size(folio) - offset;
> +		
> +			/*
> +			 * If the folio over an unwritten extent is clean (i.e.
> +			 * because it has only been read from), then it already
> +			 * contains zeros.  Hence we can just skip it.
> +			 */
> +			if (srcmap->type == IOMAP_UNWRITTEN &&
> +			    !folio_test_dirty(folio)) {
> +				folio_unlock(folio);
> +				status = -ENOENT;
> +			}
>  		}
>  
> -		folio_zero_range(folio, offset, bytes);
> -		folio_mark_accessed(folio);
> +		if (status != -ENOENT) {
> +			folio_zero_range(folio, offset, bytes);
> +			folio_mark_accessed(folio);
>  
> -		ret = iomap_write_end(iter, pos, bytes, bytes, folio);
> -		__iomap_put_folio(iter, pos, bytes, folio);
> -		if (WARN_ON_ONCE(!ret))
> -			return -EIO;
> +			ret = iomap_write_end(iter, pos, bytes, bytes, folio);
> +			__iomap_put_folio(iter, pos, bytes, folio);
> +			if (WARN_ON_ONCE(!ret))
> +				return -EIO;
> +		}
>  
> -loop_continue:
>  		pos += bytes;
>  		length -= bytes;
>  		written += bytes;
>

Christoph Hellwig May 31, 2024, 2:05 p.m. UTC | #3

On Fri, May 31, 2024 at 07:03:58AM -0700, Darrick J. Wong wrote:
> > +			/*
> > +			 * XXX: It would be nice if we could get the offset of
> > +			 * the next entry in the pagecache so that we don't have
> > +			 * to iterate one page at a time here.
> > +			 */
> > +			offset = offset_in_page(pos);
> > +			if (bytes > PAGE_SIZE - offset)
> > +				bytes = PAGE_SIZE - offset;
> 
> Why is it PAGE_SIZE here and not folio_size() like below?
> 
> (I know you're just copying the existing code; I'm merely wondering if
> this is some minor bug.)

See the comment just above :)

Brian Foster May 31, 2024, 3:43 p.m. UTC | #4

On Fri, May 31, 2024 at 07:03:58AM -0700, Darrick J. Wong wrote:
> On Fri, May 31, 2024 at 06:11:25AM -0700, Christoph Hellwig wrote:
> > On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> > > XXX: how do we detect a iomap containing a cow mapping over a hole
> > > in iomap_zero_iter()? The XFS code implies this case also needs to
> > > zero the page cache if there is data present, so trigger for page
> > > cache lookup only in iomap_zero_iter() needs to handle this case as
> > > well.
> > 
> > If there is no data in the page cache and either a whole or unwritten
> > extent it really should not matter what is in the COW fork, a there
> > obviously isn't any data we could zero.
> > 
> > If there is data in the page cache for something that is marked as
> > a hole in the srcmap, but we have data in the COW fork due to
> > COW extsize preallocation we'd need to zero it, but as the
> > xfs iomap ops don't return a separate srcmap for that case we
> > should be fine.  Or am I missing something?
> 
> It might be useful to skip the scan for dirty pagecache if both forks
> have holes, since (in theory) that's never possible on xfs.
> 
> OTOH maybe there are filesystems that allow dirty pagecache over a hole?
> 

IIRC there was a case where dirty cache can exist over what is reported
as a hole to zero range. I want to say it was something like a COW
prealloc over a data fork hole followed by a buffered write and then a
zero range, but I don't recall the details. That is all something that
should be fixed on the lookup side anyways.

Brian

> > > + * Note: when zeroing unwritten extents, we might have data in the page cache
> > > + * over an unwritten extent. In this case, we want to do a pure lookup on the
> > > + * page cache and not create a new folio as we don't need to perform zeroing on
> > > + * unwritten extents if there is no cached data over the given range.
> > >   */
> > >  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
> > >  {
> > >  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
> > >  
> > > +	if (iter->flags & IOMAP_ZERO) {
> > > +		const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > > +
> > > +		if (srcmap->type == IOMAP_UNWRITTEN)
> > > +			fgp &= ~FGP_CREAT;
> > > +	}
> > 
> > Nit:  The comment would probably stand out a little better if it was
> > right next to the IOMAP_ZERO conditional instead of above the
> > function.
> 
> Agreed.
> 
> > > +		if (status) {
> > > +			if (status == -ENOENT) {
> > > +				/*
> > > +				 * Unwritten extents need to have page cache
> > > +				 * lookups done to determine if they have data
> > > +				 * over them that needs zeroing. If there is no
> > > +				 * data, we'll get -ENOENT returned here, so we
> > > +				 * can just skip over this index.
> > > +				 */
> > > +				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
> > 
> > I'd return -EIO if the WARN_ON triggers.
> > 
> > > +loop_continue:
> > 
> > While I'm no strange to gotos for loop control something trips me
> > up about jumping to the end of the loop.  Here is what I could come
> > up with instead.  Not arguing it's objectively better, but I somehow
> > like it a little better:
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 700b22d6807783..81378f7cd8d7ff 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1412,49 +1412,56 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >  		bool ret;
> >  
> >  		status = iomap_write_begin(iter, pos, bytes, &folio);
> > -		if (status) {
> > -			if (status == -ENOENT) {
> > -				/*
> > -				 * Unwritten extents need to have page cache
> > -				 * lookups done to determine if they have data
> > -				 * over them that needs zeroing. If there is no
> > -				 * data, we'll get -ENOENT returned here, so we
> > -				 * can just skip over this index.
> > -				 */
> > -				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
> > -				if (bytes > PAGE_SIZE - offset_in_page(pos))
> > -					bytes = PAGE_SIZE - offset_in_page(pos);
> > -				goto loop_continue;
> > -			}
> > +		if (status && status != -ENOENT)
> >  			return status;
> > -		}
> > -		if (iter->iomap.flags & IOMAP_F_STALE)
> > -			break;
> >  
> > -		offset = offset_in_folio(folio, pos);
> > -		if (bytes > folio_size(folio) - offset)
> > -			bytes = folio_size(folio) - offset;
> > +		if (status == -ENOENT) {
> > +			/*
> > +			 * If we end up here, we did not find a folio in the
> > +			 * page cache for an unwritten extent and thus can
> > +			 * skip over the range.
> > +			 */
> > +			if (WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN))
> > +				return -EIO;
> >  
> > -		/*
> > -		 * If the folio over an unwritten extent is clean (i.e. because
> > -		 * it has been read from), then it already contains zeros. Hence
> > -		 * we can just skip it.
> > -		 */
> > -		if (srcmap->type == IOMAP_UNWRITTEN &&
> > -		    !folio_test_dirty(folio)) {
> > -			folio_unlock(folio);
> > -			goto loop_continue;
> > +			/*
> > +			 * XXX: It would be nice if we could get the offset of
> > +			 * the next entry in the pagecache so that we don't have
> > +			 * to iterate one page at a time here.
> > +			 */
> > +			offset = offset_in_page(pos);
> > +			if (bytes > PAGE_SIZE - offset)
> > +				bytes = PAGE_SIZE - offset;
> 
> Why is it PAGE_SIZE here and not folio_size() like below?
> 
> (I know you're just copying the existing code; I'm merely wondering if
> this is some minor bug.)
> 
> --D
> 
> > +		} else {
> > +			if (iter->iomap.flags & IOMAP_F_STALE)
> > +				break;
> > +
> > +			offset = offset_in_folio(folio, pos);
> > +			if (bytes > folio_size(folio) - offset)
> > +				bytes = folio_size(folio) - offset;
> > +		
> > +			/*
> > +			 * If the folio over an unwritten extent is clean (i.e.
> > +			 * because it has only been read from), then it already
> > +			 * contains zeros.  Hence we can just skip it.
> > +			 */
> > +			if (srcmap->type == IOMAP_UNWRITTEN &&
> > +			    !folio_test_dirty(folio)) {
> > +				folio_unlock(folio);
> > +				status = -ENOENT;
> > +			}
> >  		}
> >  
> > -		folio_zero_range(folio, offset, bytes);
> > -		folio_mark_accessed(folio);
> > +		if (status != -ENOENT) {
> > +			folio_zero_range(folio, offset, bytes);
> > +			folio_mark_accessed(folio);
> >  
> > -		ret = iomap_write_end(iter, pos, bytes, bytes, folio);
> > -		__iomap_put_folio(iter, pos, bytes, folio);
> > -		if (WARN_ON_ONCE(!ret))
> > -			return -EIO;
> > +			ret = iomap_write_end(iter, pos, bytes, bytes, folio);
> > +			__iomap_put_folio(iter, pos, bytes, folio);
> > +			if (WARN_ON_ONCE(!ret))
> > +				return -EIO;
> > +		}
> >  
> > -loop_continue:
> >  		pos += bytes;
> >  		length -= bytes;
> >  		written += bytes;
> > 
>

Brian Foster May 31, 2024, 3:44 p.m. UTC | #5

On Fri, May 31, 2024 at 07:05:38AM -0700, Christoph Hellwig wrote:
> On Fri, May 31, 2024 at 07:03:58AM -0700, Darrick J. Wong wrote:
> > > +			/*
> > > +			 * XXX: It would be nice if we could get the offset of
> > > +			 * the next entry in the pagecache so that we don't have
> > > +			 * to iterate one page at a time here.
> > > +			 */
> > > +			offset = offset_in_page(pos);
> > > +			if (bytes > PAGE_SIZE - offset)
> > > +				bytes = PAGE_SIZE - offset;
> > 
> > Why is it PAGE_SIZE here and not folio_size() like below?
> > 
> > (I know you're just copying the existing code; I'm merely wondering if
> > this is some minor bug.)
> 
> See the comment just above :)
> 
> 

FWIW, something like the following is pretty slow with the current
implementation on a quick test:

  xfs_io -fc "falloc -k 0 1t" -c "pwrite 1000g 4k" <file>

... so I'd think you'd want some kind of data seek or something to more
efficiently process the range.

Brian

Brian Foster June 2, 2024, 11:04 a.m. UTC | #6

On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Unwritten extents can have page cache data over the range being
> zeroed so we can't just skip them entirely. Fix this by checking for
> an existing dirty folio over the unwritten range we are zeroing
> and only performing zeroing if the folio is already dirty.
> 
> XXX: how do we detect a iomap containing a cow mapping over a hole
> in iomap_zero_iter()? The XFS code implies this case also needs to
> zero the page cache if there is data present, so trigger for page
> cache lookup only in iomap_zero_iter() needs to handle this case as
> well.
> 
> Before:
> 
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> 
> real    0m14.103s
> user    0m0.015s
> sys     0m0.020s
> 
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  85.90    0.847616          16     50000           ftruncate
>  14.01    0.138229           2     50000           pwrite64
> ....
> 
> After:
> 
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> 
> real    0m0.144s
> user    0m0.021s
> sys     0m0.012s
> 
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  53.86    0.505964          10     50000           ftruncate
>  46.12    0.433251           8     50000           pwrite64
> ....
> 
> Yup, we get back all the performance.
> 
> As for the "mmap write beyond EOF" data exposure aspect
> documented here:
> 
> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
> 
> With this command:
> 
> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
>   -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
>   -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
> 
> Before:
> 
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
> 00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> XXXXXXXXXXXXXXXX
> 00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> XXXXXXXXXXXXXXXX
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182
>    ops/sec
> 
> After:
> 
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
> 00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> ................
> 00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> ................
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429
>    ops/sec)
> 
> We see that this post-eof unwritten extent dirty page zeroing is
> working correctly.
> 

I've pointed this out in the past, but IIRC this implementation is racy
vs. reclaim. Specifically, relying on folio lookup after mapping lookup
doesn't take reclaim into account, so if we look up an unwritten mapping
and then a folio flushes and reclaims by the time the scan reaches that
offset, it incorrectly treats that subrange as already zero when it
actually isn't (because the extent is actually stale by that point, but
the stale extent check is skipped).

A simple example to demonstrate this is something like the following:

# looping truncate zeroing
while [ true ]; do
	xfs_io -fc "truncate 0" -c "falloc 0 32K" -c "pwrite 0 4k" -c "truncate 2k" <file>
	xfs_io -c "mmap 0 4k" -c "mread -v 2k 16" <file> | grep cd && break
done

vs.

# looping writeback and reclaim
while [ true ]; do
	xfs_io -c "sync_range -a 0 0" -c "fadvise -d 0 0" <file>
done

If I ran that against this patch, the first loop will eventually detect
stale data exposed past eof.

Brian

Dave Chinner June 2, 2024, 10:22 p.m. UTC | #7

On Fri, May 31, 2024 at 06:11:25AM -0700, Christoph Hellwig wrote:
> On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> > XXX: how do we detect a iomap containing a cow mapping over a hole
> > in iomap_zero_iter()? The XFS code implies this case also needs to
> > zero the page cache if there is data present, so trigger for page
> > cache lookup only in iomap_zero_iter() needs to handle this case as
> > well.
> 
> If there is no data in the page cache and either a whole or unwritten
> extent it really should not matter what is in the COW fork, a there
> obviously isn't any data we could zero.
> 
> If there is data in the page cache for something that is marked as
> a hole in the srcmap, but we have data in the COW fork due to
> COW extsize preallocation we'd need to zero it, but as the
> xfs iomap ops don't return a separate srcmap for that case we
> should be fine.  Or am I missing something?

If the data extent is a hole, xfs_buffered_write_iomap_begin()
doesn't even check the cow fork for extents if IOMAP_ZERO is being
done. Hence if there is a pending COW extent that extends over a
data fork hole (cow fork preallocation can do that, right?), then we
may have data in the page cache over an unwritten extent in the COW
fork.

This code:

	/* We never need to allocate blocks for zeroing or unsharing a hole. */
        if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
            imap.br_startoff > offset_fsb) {
                xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
                goto out_unlock;
        }

The comment, IMO, indicates the issue here:  we're not going to
allocate blocks in IOMAP_ZERO, but we do need to map anything that
might contain page cache data for the IOMAP_ZERO case. If "data
hole, COW unwritten, page cache dirty" can exist as the comment in
xfs_setattr_size() implies, then this code is broken and needs
fixing.

I don't know what that fix looks like yet - I suspect that all we
need to do for IOMAP_ZERO is to return the COW extent in the srcmap,
and then the zeroing code should do the right thing if it's an
unwritten COW extent...

-Dave.

Zhang Yi June 3, 2024, 9:07 a.m. UTC | #8

On 2024/6/2 19:04, Brian Foster wrote:
> On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> Unwritten extents can have page cache data over the range being
>> zeroed so we can't just skip them entirely. Fix this by checking for
>> an existing dirty folio over the unwritten range we are zeroing
>> and only performing zeroing if the folio is already dirty.
>>
>> XXX: how do we detect a iomap containing a cow mapping over a hole
>> in iomap_zero_iter()? The XFS code implies this case also needs to
>> zero the page cache if there is data present, so trigger for page
>> cache lookup only in iomap_zero_iter() needs to handle this case as
>> well.
>>
>> Before:
>>
>> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
>> path /mnt/scratch/foo, 50000 iters
>>
>> real    0m14.103s
>> user    0m0.015s
>> sys     0m0.020s
>>
>> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
>> path /mnt/scratch/foo, 50000 iters
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  85.90    0.847616          16     50000           ftruncate
>>  14.01    0.138229           2     50000           pwrite64
>> ....
>>
>> After:
>>
>> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
>> path /mnt/scratch/foo, 50000 iters
>>
>> real    0m0.144s
>> user    0m0.021s
>> sys     0m0.012s
>>
>> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
>> path /mnt/scratch/foo, 50000 iters
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  53.86    0.505964          10     50000           ftruncate
>>  46.12    0.433251           8     50000           pwrite64
>> ....
>>
>> Yup, we get back all the performance.
>>
>> As for the "mmap write beyond EOF" data exposure aspect
>> documented here:
>>
>> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
>>
>> With this command:
>>
>> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
>>   -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
>>   -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
>>
>> Before:
>>
>> wrote 1024/1024 bytes at offset 0
>> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
>> wrote 4096/4096 bytes at offset 32768
>> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
>> 00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>> XXXXXXXXXXXXXXXX
>> 00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>> XXXXXXXXXXXXXXXX
>> read 32/32 bytes at offset 3072
>> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182
>>    ops/sec
>>
>> After:
>>
>> wrote 1024/1024 bytes at offset 0
>> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
>> wrote 4096/4096 bytes at offset 32768
>> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
>> 00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> ................
>> 00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> ................
>> read 32/32 bytes at offset 3072
>> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429
>>    ops/sec)
>>
>> We see that this post-eof unwritten extent dirty page zeroing is
>> working correctly.
>>
> 
> I've pointed this out in the past, but IIRC this implementation is racy
> vs. reclaim. Specifically, relying on folio lookup after mapping lookup
> doesn't take reclaim into account, so if we look up an unwritten mapping
> and then a folio flushes and reclaims by the time the scan reaches that
> offset, it incorrectly treats that subrange as already zero when it
> actually isn't (because the extent is actually stale by that point, but
> the stale extent check is skipped).
> 

Hello, Brian!

I'm confused, how could that happen? We do stale check under folio lock,
if the folio flushed and reclaimed before we get&lock that folio in
iomap_zero_iter()->iomap_write_begin(), the ->iomap_valid() would check
this stale out and zero again in the next iteration. Am I missing
something?

Thanks,
Yi.

> A simple example to demonstrate this is something like the following:
> 
> # looping truncate zeroing
> while [ true ]; do
> 	xfs_io -fc "truncate 0" -c "falloc 0 32K" -c "pwrite 0 4k" -c "truncate 2k" <file>
> 	xfs_io -c "mmap 0 4k" -c "mread -v 2k 16" <file> | grep cd && break
> done
> 
> vs.
> 
> # looping writeback and reclaim
> while [ true ]; do
> 	xfs_io -c "sync_range -a 0 0" -c "fadvise -d 0 0" <file>
> done
> 
> If I ran that against this patch, the first loop will eventually detect
> stale data exposed past eof.
> 
> Brian
>

Brian Foster June 3, 2024, 2:37 p.m. UTC | #9

On Mon, Jun 03, 2024 at 05:07:02PM +0800, Zhang Yi wrote:
> On 2024/6/2 19:04, Brian Foster wrote:
> > On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> >> From: Dave Chinner <dchinner@redhat.com>
> >>
> >> Unwritten extents can have page cache data over the range being
> >> zeroed so we can't just skip them entirely. Fix this by checking for
> >> an existing dirty folio over the unwritten range we are zeroing
> >> and only performing zeroing if the folio is already dirty.
> >>
> >> XXX: how do we detect a iomap containing a cow mapping over a hole
> >> in iomap_zero_iter()? The XFS code implies this case also needs to
> >> zero the page cache if there is data present, so trigger for page
> >> cache lookup only in iomap_zero_iter() needs to handle this case as
> >> well.
> >>
> >> Before:
> >>
> >> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> >> path /mnt/scratch/foo, 50000 iters
> >>
> >> real    0m14.103s
> >> user    0m0.015s
> >> sys     0m0.020s
> >>
> >> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> >> path /mnt/scratch/foo, 50000 iters
> >> % time     seconds  usecs/call     calls    errors syscall
> >> ------ ----------- ----------- --------- --------- ----------------
> >>  85.90    0.847616          16     50000           ftruncate
> >>  14.01    0.138229           2     50000           pwrite64
> >> ....
> >>
> >> After:
> >>
> >> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> >> path /mnt/scratch/foo, 50000 iters
> >>
> >> real    0m0.144s
> >> user    0m0.021s
> >> sys     0m0.012s
> >>
> >> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> >> path /mnt/scratch/foo, 50000 iters
> >> % time     seconds  usecs/call     calls    errors syscall
> >> ------ ----------- ----------- --------- --------- ----------------
> >>  53.86    0.505964          10     50000           ftruncate
> >>  46.12    0.433251           8     50000           pwrite64
> >> ....
> >>
> >> Yup, we get back all the performance.
> >>
> >> As for the "mmap write beyond EOF" data exposure aspect
> >> documented here:
> >>
> >> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
> >>
> >> With this command:
> >>
> >> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
> >>   -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
> >>   -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
> >>
> >> Before:
> >>
> >> wrote 1024/1024 bytes at offset 0
> >> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
> >> wrote 4096/4096 bytes at offset 32768
> >> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
> >> 00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> >> XXXXXXXXXXXXXXXX
> >> 00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> >> XXXXXXXXXXXXXXXX
> >> read 32/32 bytes at offset 3072
> >> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182
> >>    ops/sec
> >>
> >> After:
> >>
> >> wrote 1024/1024 bytes at offset 0
> >> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
> >> wrote 4096/4096 bytes at offset 32768
> >> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
> >> 00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> ................
> >> 00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> ................
> >> read 32/32 bytes at offset 3072
> >> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429
> >>    ops/sec)
> >>
> >> We see that this post-eof unwritten extent dirty page zeroing is
> >> working correctly.
> >>
> > 
> > I've pointed this out in the past, but IIRC this implementation is racy
> > vs. reclaim. Specifically, relying on folio lookup after mapping lookup
> > doesn't take reclaim into account, so if we look up an unwritten mapping
> > and then a folio flushes and reclaims by the time the scan reaches that
> > offset, it incorrectly treats that subrange as already zero when it
> > actually isn't (because the extent is actually stale by that point, but
> > the stale extent check is skipped).
> > 
> 
> Hello, Brian!
> 
> I'm confused, how could that happen? We do stale check under folio lock,
> if the folio flushed and reclaimed before we get&lock that folio in
> iomap_zero_iter()->iomap_write_begin(), the ->iomap_valid() would check
> this stale out and zero again in the next iteration. Am I missing
> something?
> 

Hi Yi,

Yep, that is my understanding of how the revalidation thing works in
general as well. The nuance in this particular case is that no folio
exists at the associated offset. Therefore, the reval is skipped in
iomap_write_begin(), iomap_zero_iter() skips over the range as well, and
the operation carries on as normal.

Have you tried the test sequence above? I just retried on latest master
plus this series and it still trips for me. I haven't redone the low
level analysis, but in general what this is trying to show is something
like the following...

Suppose we start with an unwritten block on disk with a dirty folio in
cache:

- iomap looks up the extent and finds the unwritten mapping.
- Reclaim kicks in and writes back the page and removes it from cache.
  The underlying block is no longer unwritten (current mapping is now
  stale).
- iomap_zero_iter() processes the associated offset. iomap_get_folio()
  clears FGP_CREAT, no folio is found.
- iomap_write_begin() returns -ENOENT (i.e. no folio lock ->
  iomap_valid()). iomap_zero_iter() consumes the error and skips to the
  next range.

At that point we have data on disk from a previous write that a
post-write zero range (i.e. truncate) failed to zero. Let me know if you
think this doesn't match up with the current series. It's been a while
since I've looked at this and it's certainly possible I've missed
something or something changed.

Otherwise, I think there are multiple potential ways around this. I.e.,
you could consider a fallback locking mode for reval, explicit (large?)
folio recreation and invalidation (if non-dirty) over unwritten mappings
for revalidation purposes, or something like the POC diff in reply to
hch on patch 3 around checking cache prior to iomap lookup (which also
has tradeoffs). I'm sure there are other ideas as well.

Brian

> Thanks,
> Yi.
> 
> > A simple example to demonstrate this is something like the following:
> > 
> > # looping truncate zeroing
> > while [ true ]; do
> > 	xfs_io -fc "truncate 0" -c "falloc 0 32K" -c "pwrite 0 4k" -c "truncate 2k" <file>
> > 	xfs_io -c "mmap 0 4k" -c "mread -v 2k 16" <file> | grep cd && break
> > done
> > 
> > vs.
> > 
> > # looping writeback and reclaim
> > while [ true ]; do
> > 	xfs_io -c "sync_range -a 0 0" -c "fadvise -d 0 0" <file>
> > done
> > 
> > If I ran that against this patch, the first loop will eventually detect
> > stale data exposed past eof.
> > 
> > Brian
> > 
>

Dave Chinner June 4, 2024, 11:38 p.m. UTC | #10

On Mon, Jun 03, 2024 at 10:37:48AM -0400, Brian Foster wrote:
> On Mon, Jun 03, 2024 at 05:07:02PM +0800, Zhang Yi wrote:
> > On 2024/6/2 19:04, Brian Foster wrote:
> > > On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> > >> From: Dave Chinner <dchinner@redhat.com>
> > >>
> > >> Unwritten extents can have page cache data over the range being
> > >> zeroed so we can't just skip them entirely. Fix this by checking for
> > >> an existing dirty folio over the unwritten range we are zeroing
> > >> and only performing zeroing if the folio is already dirty.
> > >>
> > >> XXX: how do we detect a iomap containing a cow mapping over a hole
> > >> in iomap_zero_iter()? The XFS code implies this case also needs to
> > >> zero the page cache if there is data present, so trigger for page
> > >> cache lookup only in iomap_zero_iter() needs to handle this case as
> > >> well.
> > >>
> > >> Before:
> > >>
> > >> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> > >> path /mnt/scratch/foo, 50000 iters
> > >>
> > >> real    0m14.103s
> > >> user    0m0.015s
> > >> sys     0m0.020s
> > >>
> > >> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> > >> path /mnt/scratch/foo, 50000 iters
> > >> % time     seconds  usecs/call     calls    errors syscall
> > >> ------ ----------- ----------- --------- --------- ----------------
> > >>  85.90    0.847616          16     50000           ftruncate
> > >>  14.01    0.138229           2     50000           pwrite64
> > >> ....
> > >>
> > >> After:
> > >>
> > >> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> > >> path /mnt/scratch/foo, 50000 iters
> > >>
> > >> real    0m0.144s
> > >> user    0m0.021s
> > >> sys     0m0.012s
> > >>
> > >> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> > >> path /mnt/scratch/foo, 50000 iters
> > >> % time     seconds  usecs/call     calls    errors syscall
> > >> ------ ----------- ----------- --------- --------- ----------------
> > >>  53.86    0.505964          10     50000           ftruncate
> > >>  46.12    0.433251           8     50000           pwrite64
> > >> ....
> > >>
> > >> Yup, we get back all the performance.
> > >>
> > >> As for the "mmap write beyond EOF" data exposure aspect
> > >> documented here:
> > >>
> > >> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
> > >>
> > >> With this command:
> > >>
> > >> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
> > >>   -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
> > >>   -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
> > >>
> > >> Before:
> > >>
> > >> wrote 1024/1024 bytes at offset 0
> > >> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
> > >> wrote 4096/4096 bytes at offset 32768
> > >> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
> > >> 00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> > >> XXXXXXXXXXXXXXXX
> > >> 00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> > >> XXXXXXXXXXXXXXXX
> > >> read 32/32 bytes at offset 3072
> > >> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182
> > >>    ops/sec
> > >>
> > >> After:
> > >>
> > >> wrote 1024/1024 bytes at offset 0
> > >> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
> > >> wrote 4096/4096 bytes at offset 32768
> > >> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
> > >> 00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > >> ................
> > >> 00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > >> ................
> > >> read 32/32 bytes at offset 3072
> > >> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429
> > >>    ops/sec)
> > >>
> > >> We see that this post-eof unwritten extent dirty page zeroing is
> > >> working correctly.
> > >>
> > > 
> > > I've pointed this out in the past, but IIRC this implementation is racy
> > > vs. reclaim. Specifically, relying on folio lookup after mapping lookup
> > > doesn't take reclaim into account, so if we look up an unwritten mapping
> > > and then a folio flushes and reclaims by the time the scan reaches that
> > > offset, it incorrectly treats that subrange as already zero when it
> > > actually isn't (because the extent is actually stale by that point, but
> > > the stale extent check is skipped).
> > > 
> > 
> > Hello, Brian!
> > 
> > I'm confused, how could that happen? We do stale check under folio lock,
> > if the folio flushed and reclaimed before we get&lock that folio in
> > iomap_zero_iter()->iomap_write_begin(), the ->iomap_valid() would check
> > this stale out and zero again in the next iteration. Am I missing
> > something?
> > 
> 
> Hi Yi,
> 
> Yep, that is my understanding of how the revalidation thing works in
> general as well. The nuance in this particular case is that no folio
> exists at the associated offset. Therefore, the reval is skipped in
> iomap_write_begin(), iomap_zero_iter() skips over the range as well, and
> the operation carries on as normal.
>> 
> Have you tried the test sequence above? I just retried on latest master
> plus this series and it still trips for me. I haven't redone the low
> level analysis, but in general what this is trying to show is something
> like the following...
> 
> Suppose we start with an unwritten block on disk with a dirty folio in
> cache:
> 
> - iomap looks up the extent and finds the unwritten mapping.
> - Reclaim kicks in and writes back the page and removes it from cache.

To be pedantic, reclaim doesn't write folios back - we haven't done
that for the best part of a decade on XFS. We don't even have a
->writepage method for reclaim to write back pages anymore.

Hence writeback has to come from background flusher threads hitting
that specific folio, then IO completion running and converting the
unwritten extent, then reclaim hitting that folio, all while the
zeroing of the current iomap is being walked and zeroed.

That makes it an extremely rare and difficult condition to hit. Yes,
it's possible, but it's also something we can easily detect. So as
long as detection is low cost, the cost of resolution when such a
rare event is detected isn't going to be noticed by anyone.

>   The underlying block is no longer unwritten (current mapping is now
>   stale).
> - iomap_zero_iter() processes the associated offset. iomap_get_folio()
>   clears FGP_CREAT, no folio is found.

Actually, this is really easy to fix - we simply revalidate the
mapping at this point rather than just skipping the folio range. If
the mapping has changed because it's now written, the zeroing code
backs out and gets a new mapping that reflects the changed state of
this range.

However, with the above cost analysis in mind, a lower overhead
common case alternative might be to only revalidate the mapping at
->iomap_end() time. If the mapping has changed while zeroing, we
return -EBUSY/-ESTALE and that triggers the zeroing to restart from
the offset at the beginning of the "stale" iomap.  The runtime cost
is one extra mapping revalidation call per mapping, and the
resolution cost is refetching and zeroing the range of a single
unwritten iomap.

-Dave.

[RFC,v4,1/8] iomap: zeroing needs to be pagecache aware

Commit Message

Comments

Patch