diff mbox series

xfs: flush new eof page on truncate to avoid post-eof corruption

Message ID 20201007143509.669729-1-bfoster@redhat.com (mailing list archive)
State Superseded
Headers show
Series xfs: flush new eof page on truncate to avoid post-eof corruption | expand

Commit Message

Brian Foster Oct. 7, 2020, 2:35 p.m. UTC
It is possible to expose non-zeroed post-EOF data in XFS if the new
EOF page is dirty, backed by an unwritten block and the truncate
happens to race with writeback. iomap_truncate_page() will not zero
the post-EOF portion of the page if the underlying block is
unwritten. The subsequent call to truncate_setsize() will, but
doesn't dirty the page. Therefore, if writeback happens to complete
after iomap_truncate_page() (so it still sees the unwritten block)
but before truncate_setsize(), the cached page becomes inconsistent
with the on-disk block. A mapped read after the associated page is
reclaimed or invalidated exposes non-zero post-EOF data.

For example, consider the following sequence when run on a kernel
modified to explicitly flush the new EOF page within the race
window:

$ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file
$ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file
  ...
$ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
00000400:  00 00 00 00 00 00 00 00  ........
$ umount /mnt/; mount <dev> /mnt/
$ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
00000400:  cd cd cd cd cd cd cd cd  ........

Update xfs_setattr_size() to explicitly flush the new EOF page prior
to the page truncate to ensure iomap has the latest state of the
underlying block.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Signed-off-by: Brian Foster <bfoster@redhat.com>
---

This patch is intentionally simplistic because I wanted to get some
thoughts on a proper fix and at the same time consider something easily
backportable. The iomap behavior seems rather odd to me in general,
particularly if we consider the same kind of behavior can occur on
file-extending writes. It's just not a user observable problem in that
case because a sub-page write of a current EOF page (backed by an
unwritten block) will zero fill the rest of the page at write time
(before the zero range essentially skips it due to the unwritten block).
It's not totally clear to me if that's an intentional design
characteristic of iomap or something we should address.

It _seems_ like the more appropriate fix is that iomap truncate page
should at least accommodate a dirty page over an unwritten block and
modify the page (or perhaps just unconditionally do a buffered write on
a non-aligned truncate, similar to what block_truncate_page() does). For
example, we could push the UNWRITTEN check from iomap_zero_range_actor()
down into iomap_zero(), actually check for an existing page there, and
then either zero it or skip out if none exists. Thoughts?

Brian

 fs/xfs/xfs_iops.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Darrick J. Wong Oct. 7, 2020, 3:33 p.m. UTC | #1
On Wed, Oct 07, 2020 at 10:35:09AM -0400, Brian Foster wrote:
> It is possible to expose non-zeroed post-EOF data in XFS if the new
> EOF page is dirty, backed by an unwritten block and the truncate
> happens to race with writeback. iomap_truncate_page() will not zero
> the post-EOF portion of the page if the underlying block is
> unwritten. The subsequent call to truncate_setsize() will, but
> doesn't dirty the page. Therefore, if writeback happens to complete
> after iomap_truncate_page() (so it still sees the unwritten block)
> but before truncate_setsize(), the cached page becomes inconsistent
> with the on-disk block. A mapped read after the associated page is
> reclaimed or invalidated exposes non-zero post-EOF data.
> 
> For example, consider the following sequence when run on a kernel
> modified to explicitly flush the new EOF page within the race
> window:
> 
> $ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file
> $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file
>   ...
> $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
> 00000400:  00 00 00 00 00 00 00 00  ........
> $ umount /mnt/; mount <dev> /mnt/
> $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
> 00000400:  cd cd cd cd cd cd cd cd  ........
> 
> Update xfs_setattr_size() to explicitly flush the new EOF page prior
> to the page truncate to ensure iomap has the latest state of the
> underlying block.
> 
> Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> This patch is intentionally simplistic because I wanted to get some
> thoughts on a proper fix and at the same time consider something easily
> backportable. The iomap behavior seems rather odd to me in general,
> particularly if we consider the same kind of behavior can occur on
> file-extending writes. It's just not a user observable problem in that
> case because a sub-page write of a current EOF page (backed by an
> unwritten block) will zero fill the rest of the page at write time
> (before the zero range essentially skips it due to the unwritten block).
> It's not totally clear to me if that's an intentional design
> characteristic of iomap or something we should address.
> 
> It _seems_ like the more appropriate fix is that iomap truncate page
> should at least accommodate a dirty page over an unwritten block and
> modify the page (or perhaps just unconditionally do a buffered write on
> a non-aligned truncate, similar to what block_truncate_page() does). For
> example, we could push the UNWRITTEN check from iomap_zero_range_actor()
> down into iomap_zero(), actually check for an existing page there, and
> then either zero it or skip out if none exists. Thoughts?

I haven't looked at this in much depth yet, but I agree with the
principle that iomap ought to handle the case of unwritten extents
fronted by dirty pagecache.

--D

> Brian
> 
>  fs/xfs/xfs_iops.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 80a13c8561d8..3ef2e77b454e 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -911,6 +911,16 @@ xfs_setattr_size(
>  		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
>  				&did_zeroing, &xfs_buffered_write_iomap_ops);
>  	} else {
> +		/*
> +		 * iomap won't detect a dirty page over an unwritten block and
> +		 * subsequently skips zeroing the newly post-eof portion of the
> +		 * page. Flush the new EOF to convert the block before the
> +		 * pagecache truncate.
> +		 */
> +		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> +						     newsize);
> +		if (error)
> +			return error;
>  		error = iomap_truncate_page(inode, newsize, &did_zeroing,
>  				&xfs_buffered_write_iomap_ops);
>  	}
> -- 
> 2.25.4
>
Brian Foster Oct. 8, 2020, 7:11 p.m. UTC | #2
On Wed, Oct 07, 2020 at 08:33:59AM -0700, Darrick J. Wong wrote:
> On Wed, Oct 07, 2020 at 10:35:09AM -0400, Brian Foster wrote:
> > It is possible to expose non-zeroed post-EOF data in XFS if the new
> > EOF page is dirty, backed by an unwritten block and the truncate
> > happens to race with writeback. iomap_truncate_page() will not zero
> > the post-EOF portion of the page if the underlying block is
> > unwritten. The subsequent call to truncate_setsize() will, but
> > doesn't dirty the page. Therefore, if writeback happens to complete
> > after iomap_truncate_page() (so it still sees the unwritten block)
> > but before truncate_setsize(), the cached page becomes inconsistent
> > with the on-disk block. A mapped read after the associated page is
> > reclaimed or invalidated exposes non-zero post-EOF data.
> > 
> > For example, consider the following sequence when run on a kernel
> > modified to explicitly flush the new EOF page within the race
> > window:
> > 
> > $ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file
> > $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file
> >   ...
> > $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
> > 00000400:  00 00 00 00 00 00 00 00  ........
> > $ umount /mnt/; mount <dev> /mnt/
> > $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
> > 00000400:  cd cd cd cd cd cd cd cd  ........
> > 
> > Update xfs_setattr_size() to explicitly flush the new EOF page prior
> > to the page truncate to ensure iomap has the latest state of the
> > underlying block.
> > 
> > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > This patch is intentionally simplistic because I wanted to get some
> > thoughts on a proper fix and at the same time consider something easily
> > backportable. The iomap behavior seems rather odd to me in general,
> > particularly if we consider the same kind of behavior can occur on
> > file-extending writes. It's just not a user observable problem in that
> > case because a sub-page write of a current EOF page (backed by an
> > unwritten block) will zero fill the rest of the page at write time
> > (before the zero range essentially skips it due to the unwritten block).
> > It's not totally clear to me if that's an intentional design
> > characteristic of iomap or something we should address.
> > 
> > It _seems_ like the more appropriate fix is that iomap truncate page
> > should at least accommodate a dirty page over an unwritten block and
> > modify the page (or perhaps just unconditionally do a buffered write on
> > a non-aligned truncate, similar to what block_truncate_page() does). For
> > example, we could push the UNWRITTEN check from iomap_zero_range_actor()
> > down into iomap_zero(), actually check for an existing page there, and
> > then either zero it or skip out if none exists. Thoughts?
> 
> I haven't looked at this in much depth yet, but I agree with the
> principle that iomap ought to handle the case of unwritten extents
> fronted by dirty pagecache.
> 

Ok. What I was originally thinking above turned out to be too
inefficient. However, it occurred to me that we already have an
efficient cache scanning mechanism in seek data/hole, so I think
something like the appended might be doable. Thoughts?

Brian

--- 8< ---

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index bcfc288dba3f..676d8d2ae7c7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -944,6 +944,26 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 	return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
 }
 
+static void
+iomap_zero_range_skip_uncached(struct inode *inode, loff_t *pos,
+		loff_t *count, loff_t *written)
+{
+	unsigned dirty_offset, bytes = 0;
+
+	dirty_offset = page_cache_seek_hole_data(inode, *pos, *count,
+				SEEK_DATA);
+	if (dirty_offset == -ENOENT)
+		bytes = *count;
+	else if (dirty_offset > *pos)
+		bytes = dirty_offset - *pos;
+
+	if (bytes) {
+		*pos += bytes;
+		*count -= bytes;
+		*written += bytes;
+	}
+}
+
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		void *data, struct iomap *iomap, struct iomap *srcmap)
@@ -953,12 +973,19 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 	int status;
 
 	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	if (srcmap->type == IOMAP_HOLE)
 		return count;
 
 	do {
 		unsigned offset, bytes;
 
+		if (srcmap->type == IOMAP_UNWRITTEN) {
+			iomap_zero_range_skip_uncached(inode, &pos, &count,
+				&written);
+			if (!count)
+				break;
+		}
+
 		offset = offset_in_page(pos);
 		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
 
diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
index 107ee80c3568..4f5e4eca9906 100644
--- a/fs/iomap/seek.c
+++ b/fs/iomap/seek.c
@@ -70,7 +70,7 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
  *
  * Returns the resulting offset on successs, and -ENOENT otherwise.
  */
-static loff_t
+loff_t
 page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
 		int whence)
 {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4d1d3c3469e9..437ae0d708d6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -184,6 +184,9 @@ loff_t iomap_seek_data(struct inode *inode, loff_t offset,
 		const struct iomap_ops *ops);
 sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
 		const struct iomap_ops *ops);
+loff_t page_cache_seek_hole_data(struct inode *inode, loff_t offset,
+		loff_t length, int whence);
+
 
 /*
  * Structure for writeback I/O completions.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 80a13c8561d8..3ef2e77b454e 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -911,6 +911,16 @@  xfs_setattr_size(
 		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
 				&did_zeroing, &xfs_buffered_write_iomap_ops);
 	} else {
+		/*
+		 * iomap won't detect a dirty page over an unwritten block and
+		 * subsequently skips zeroing the newly post-eof portion of the
+		 * page. Flush the new EOF to convert the block before the
+		 * pagecache truncate.
+		 */
+		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
+						     newsize);
+		if (error)
+			return error;
 		error = iomap_truncate_page(inode, newsize, &did_zeroing,
 				&xfs_buffered_write_iomap_ops);
 	}