diff mbox series

[5/9] xfs: buffered write failure should not truncate the page cache

Message ID 20221115013043.360610-6-david@fromorbit.com (mailing list archive)
State New
Headers show
Series xfs, iomap: fix data corrupton due to stale cached iomaps | expand

Commit Message

Dave Chinner Nov. 15, 2022, 1:30 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

xfs_buffered_write_iomap_end() currently invalidates the page cache
over the unused range of the delalloc extent it allocated. While the
write allocated the delalloc extent, it does not own it exclusively
as the write does not hold any locks that prevent either writeback
or mmap page faults from changing the state of either the page cache
or the extent state backing this range.

Whilst xfs_bmap_punch_delalloc_range() already handles races in
extent conversion - it will only punch out delalloc extents and it
ignores any other type of extent - the page cache truncate does not
discriminate between data written by this write or some other task.
As a result, truncating the page cache can result in data corruption
if the write races with mmap modifications to the file over the same
range.

generic/346 exercises this workload, and if we randomly fail writes
(as will happen when iomap gets stale iomap detection later in the
patchset), it will randomly corrupt the file data because it removes
data written by mmap() in the same page as the write() that failed.

Hence we do not want to punch out the page cache over the range of
the extent we failed to write to - what we actually need to do is
detect the ranges that have dirty data in cache over them and *not
punch them out*.

TO do this, we have to walk the page cache over the range of the
delalloc extent we want to remove. This is made complex by the fact
we have to handle partially up-to-date folios correctly and this can
happen even when the FSB size == PAGE_SIZE because we now support
multi-page folios in the page cache.

Because we are only interested in discovering the edges of data
ranges in the page cache (i.e. hole-data boundaries) we can make use
of mapping_seek_hole_data() to find those transitions in the page
cache. As we hold the invalidate_lock, we know that the boundaries
are not going to change while we walk the range. This interface is
also byte-based and is sub-page block aware, so we can find the data
ranges in the cache based on byte offsets rather than page, folio or
fs block sized chunks. This greatly simplifies the logic of finding
dirty cached ranges in the page cache.

Once we've identified a range that contains cached data, we can then
iterate the range folio by folio. This allows us to determine if the
data is dirty and hence perform the correct delalloc extent punching
operations. The seek interface we use to iterate data ranges will
give us sub-folio start/end granularity, so we may end up looking up
the same folio multiple times as the seek interface iterates across
each discontiguous data region in the folio.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 141 insertions(+), 10 deletions(-)

Comments

Christoph Hellwig Nov. 15, 2022, 8:43 a.m. UTC | #1
> +static int
> +xfs_buffered_write_delalloc_scan(

As pointed out last time, this is generic iomap (or in fact filemap.c,
but maybe start small) logic, so move it there and just pass
xfs_buffered_write_delalloc_punch as a callback.
Darrick J. Wong Nov. 16, 2022, 12:48 a.m. UTC | #2
On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_buffered_write_iomap_end() currently invalidates the page cache
> over the unused range of the delalloc extent it allocated. While the
> write allocated the delalloc extent, it does not own it exclusively
> as the write does not hold any locks that prevent either writeback
> or mmap page faults from changing the state of either the page cache
> or the extent state backing this range.
> 
> Whilst xfs_bmap_punch_delalloc_range() already handles races in
> extent conversion - it will only punch out delalloc extents and it
> ignores any other type of extent - the page cache truncate does not
> discriminate between data written by this write or some other task.
> As a result, truncating the page cache can result in data corruption
> if the write races with mmap modifications to the file over the same
> range.
> 
> generic/346 exercises this workload, and if we randomly fail writes
> (as will happen when iomap gets stale iomap detection later in the
> patchset), it will randomly corrupt the file data because it removes
> data written by mmap() in the same page as the write() that failed.
> 
> Hence we do not want to punch out the page cache over the range of
> the extent we failed to write to - what we actually need to do is
> detect the ranges that have dirty data in cache over them and *not
> punch them out*.
> 
> TO do this, we have to walk the page cache over the range of the
> delalloc extent we want to remove. This is made complex by the fact
> we have to handle partially up-to-date folios correctly and this can
> happen even when the FSB size == PAGE_SIZE because we now support
> multi-page folios in the page cache.
> 
> Because we are only interested in discovering the edges of data
> ranges in the page cache (i.e. hole-data boundaries) we can make use
> of mapping_seek_hole_data() to find those transitions in the page
> cache. As we hold the invalidate_lock, we know that the boundaries
> are not going to change while we walk the range. This interface is
> also byte-based and is sub-page block aware, so we can find the data
> ranges in the cache based on byte offsets rather than page, folio or
> fs block sized chunks. This greatly simplifies the logic of finding
> dirty cached ranges in the page cache.
> 
> Once we've identified a range that contains cached data, we can then
> iterate the range folio by folio. This allows us to determine if the
> data is dirty and hence perform the correct delalloc extent punching
> operations. The seek interface we use to iterate data ranges will
> give us sub-folio start/end granularity, so we may end up looking up
> the same folio multiple times as the seek interface iterates across
> each discontiguous data region in the folio.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 141 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 7bb55dbc19d3..2d48fcc7bd6f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
>  				end_fsb - start_fsb);
>  }
>  
> +/*
> + * Scan the data range passed to us for dirty page cache folios. If we find a
> + * dirty folio, punch out the preceeding range and update the offset from which
> + * the next punch will start from.
> + *
> + * We can punch out clean pages because they either contain data that has been
> + * written back - in which case the delalloc punch over that range is a no-op -
> + * or they have been read faults in which case they contain zeroes and we can
> + * remove the delalloc backing range and any new writes to those pages will do
> + * the normal hole filling operation...
> + *
> + * This makes the logic simple: we only need to keep the delalloc extents only
> + * over the dirty ranges of the page cache.
> + */
> +static int
> +xfs_buffered_write_delalloc_scan(
> +	struct inode		*inode,
> +	loff_t			*punch_start_byte,
> +	loff_t			start_byte,
> +	loff_t			end_byte)
> +{
> +	loff_t			offset = start_byte;
> +
> +	while (offset < end_byte) {
> +		struct folio	*folio;
> +
> +		/* grab locked page */
> +		folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT);
> +		if (!folio) {
> +			offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE;
> +			continue;
> +		}
> +
> +		/* if dirty, punch up to offset */
> +		if (folio_test_dirty(folio)) {
> +			if (offset > *punch_start_byte) {
> +				int	error;
> +
> +				error = xfs_buffered_write_delalloc_punch(inode,
> +						*punch_start_byte, offset);

This sounds an awful lot like what iomap_writeback_ops.discard_folio()
does, albeit without the xfs_alert screaming everywhere.

Moving along... so we punch out delalloc reservations for any part of
the page cache that is clean.  "punch_start_byte" is the start pos of
the last range of clean cache, and we want to punch up to the start of
this dirty folio...

> +				if (error) {
> +					folio_unlock(folio);
> +					folio_put(folio);
> +					return error;
> +				}
> +			}
> +
> +			/*
> +			 * Make sure the next punch start is correctly bound to
> +			 * the end of this data range, not the end of the folio.
> +			 */
> +			*punch_start_byte = min_t(loff_t, end_byte,
> +					folio_next_index(folio) << PAGE_SHIFT);

...and then start a new clean range after this folio (or at the end_byte
to signal that we're done)...

> +		}
> +
> +		/* move offset to start of next folio in range */
> +		offset = folio_next_index(folio) << PAGE_SHIFT;
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Punch out all the delalloc blocks in the range given except for those that
> + * have dirty data still pending in the page cache - those are going to be
> + * written and so must still retain the delalloc backing for writeback.
> + *
> + * As we are scanning the page cache for data, we don't need to reimplement the
> + * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> + * start and end of data ranges correctly even for sub-folio block sizes. This
> + * byte range based iteration is especially convenient because it means we don't
> + * have to care about variable size folios, nor where the start or end of the
> + * data range lies within a folio, if they lie within the same folio or even if
> + * there are multiple discontiguous data ranges within the folio.
> + */
> +static int
> +xfs_buffered_write_delalloc_release(
> +	struct inode		*inode,
> +	loff_t			start_byte,
> +	loff_t			end_byte)
> +{
> +	loff_t			punch_start_byte = start_byte;
> +	int			error = 0;
> +
> +	/*
> +	 * Lock the mapping to avoid races with page faults re-instantiating
> +	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> +	 * cache and perform delalloc extent removal. Failing to do this can
> +	 * leave dirty pages with no space reservation in the cache.
> +	 */
> +	filemap_invalidate_lock(inode->i_mapping);
> +	while (start_byte < end_byte) {
> +		loff_t		data_end;
> +
> +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> +				start_byte, end_byte, SEEK_DATA);
> +		/*
> +		 * If there is no more data to scan, all that is left is to
> +		 * punch out the remaining range.
> +		 */
> +		if (start_byte == -ENXIO || start_byte == end_byte)
> +			break;
> +		if (start_byte < 0) {
> +			error = start_byte;
> +			goto out_unlock;
> +		}
> +		ASSERT(start_byte >= punch_start_byte);
> +		ASSERT(start_byte < end_byte);
> +
> +		/*
> +		 * We find the end of this contiguous cached data range by
> +		 * seeking from start_byte to the beginning of the next hole.
> +		 */
> +		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
> +				end_byte, SEEK_HOLE);
> +		if (data_end < 0) {
> +			error = data_end;
> +			goto out_unlock;
> +		}
> +		ASSERT(data_end > start_byte);
> +		ASSERT(data_end <= end_byte);
> +
> +		error = xfs_buffered_write_delalloc_scan(inode,
> +				&punch_start_byte, start_byte, data_end);

...and we use SEEK_HOLE/SEEK_DATA to find the ranges of pagecache where
there's even going to be folios mapped.  But in structuring the code
like this, xfs now has to know details about folio state again, and the
point of iomap/buffered-io.c is to delegate handling of the pagecache
away from XFS, at least for filesystems that want to manage buffered IO
the same way XFS does.

IOWs, I agree with Christoph that these two functions that compute the
ranges that need delalloc-punching really ought to be in the iomap code.
TBH I wonder if this isn't better suited for being called by
iomap_write_iter after a short write on an IOMAP_F_NEW iomap, with an
function pointer in iomap page_ops for iomap to tell xfs to drop the
delalloc reservations.

--D

> +		if (error)
> +			goto out_unlock;
> +
> +		/* The next data search starts at the end of this one. */
> +		start_byte = data_end;
> +	}
> +
> +	if (punch_start_byte <1 end_byte)
> +		error = xfs_buffered_write_delalloc_punch(inode,
> +				punch_start_byte, end_byte);
> +out_unlock:
> +	filemap_invalidate_unlock(inode->i_mapping);
> +	return error;
> +}
> +
>  static int
>  xfs_buffered_write_iomap_end(
>  	struct inode		*inode,
> @@ -1179,16 +1319,7 @@ xfs_buffered_write_iomap_end(
>  	if (start_byte >= end_byte)
>  		return 0;
>  
> -	/*
> -	 * Lock the mapping to avoid races with page faults re-instantiating
> -	 * folios and dirtying them via ->page_mkwrite between the page cache
> -	 * truncation and the delalloc extent removal. Failing to do this can
> -	 * leave dirty pages with no space reservation in the cache.
> -	 */
> -	filemap_invalidate_lock(inode->i_mapping);
> -	truncate_pagecache_range(inode, start_byte, end_byte - 1);
> -	error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
> -	filemap_invalidate_unlock(inode->i_mapping);
> +	error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte);
>  	if (error && !xfs_is_shutdown(mp)) {
>  		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
>  			__func__, XFS_I(inode)->i_ino);
> -- 
> 2.37.2
>
Brian Foster Nov. 16, 2022, 1:57 p.m. UTC | #3
On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 141 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 7bb55dbc19d3..2d48fcc7bd6f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
>  				end_fsb - start_fsb);
>  }
>  
...
> +/*
> + * Punch out all the delalloc blocks in the range given except for those that
> + * have dirty data still pending in the page cache - those are going to be
> + * written and so must still retain the delalloc backing for writeback.
> + *
> + * As we are scanning the page cache for data, we don't need to reimplement the
> + * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> + * start and end of data ranges correctly even for sub-folio block sizes. This
> + * byte range based iteration is especially convenient because it means we don't
> + * have to care about variable size folios, nor where the start or end of the
> + * data range lies within a folio, if they lie within the same folio or even if
> + * there are multiple discontiguous data ranges within the folio.
> + */
> +static int
> +xfs_buffered_write_delalloc_release(
> +	struct inode		*inode,
> +	loff_t			start_byte,
> +	loff_t			end_byte)
> +{
> +	loff_t			punch_start_byte = start_byte;
> +	int			error = 0;
> +
> +	/*
> +	 * Lock the mapping to avoid races with page faults re-instantiating
> +	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> +	 * cache and perform delalloc extent removal. Failing to do this can
> +	 * leave dirty pages with no space reservation in the cache.
> +	 */
> +	filemap_invalidate_lock(inode->i_mapping);
> +	while (start_byte < end_byte) {
> +		loff_t		data_end;
> +
> +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> +				start_byte, end_byte, SEEK_DATA);

FWIW, the fact that mapping seek data is based on uptodate status means
that seek behavior can change based on prior reads. For example, see how
seek hole/data presents reads of unwritten ranges as data [1]. The same
thing isn't observable for holes because iomap doesn't check the mapping
in that case, but underlying iop state is the same and that is what this
code is looking at.

The filtering being done here means we essentially only care about dirty
pages backed by delalloc blocks. That means if you get here with a dirty
page and the portion of the page affected by this failed write is
uptodate, this won't punch an underlying delalloc block even though
nothing else may have written to it in the meantime. That sort of state
can be created by a prior read of the range on a sub-page block size fs,
or perhaps a racing async readahead (via read fault of a lower
offset..?), etc.

I suspect this is not a serious error because the page is dirty and
writeback will thus convert the block. The only exception to that I can
see is if the block is beyond EOF (consider a mapped read to a page that
straddles EOF, followed by a post-eof write that fails), writeback won't
actually map the block directly. It may convert if contiguous with
delalloc blocks inside EOF (and sufficiently sized physical extents
exist), or even if not, should still otherwise be cleaned up by the
various other means we already have to manage post-eof blocks.

So IOW there's a tradeoff being made here for possible spurious
allocation and I/O and a subtle dependency on writeback that should
probably be documented somewhere. The larger concern is that if
writeback eventually changes based on dirty range tracking in a way that
breaks this dependency, that introduces yet another stale delalloc block
landmine associated with this error handling code (regardless of whether
you want to call that a bug in this code, seek data, whatever), and
those problems are difficult enough to root cause as it is.

Brian

[1]

# xfs_io -fc "falloc 0 4k" -c "seek -a 0" -c "pread 0 4k" -c "seek -a 0" <file>
Whence  Result
HOLE    0
read 4096/4096 bytes at offset 0
4 KiB, 4 ops; 0.0000 sec (156 MiB/sec and 160000.0000 ops/sec)
Whence  Result
DATA    0
HOLE    4096

> +		/*
> +		 * If there is no more data to scan, all that is left is to
> +		 * punch out the remaining range.
> +		 */
> +		if (start_byte == -ENXIO || start_byte == end_byte)
> +			break;
> +		if (start_byte < 0) {
> +			error = start_byte;
> +			goto out_unlock;
> +		}
> +		ASSERT(start_byte >= punch_start_byte);
> +		ASSERT(start_byte < end_byte);
> +
> +		/*
> +		 * We find the end of this contiguous cached data range by
> +		 * seeking from start_byte to the beginning of the next hole.
> +		 */
> +		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
> +				end_byte, SEEK_HOLE);
> +		if (data_end < 0) {
> +			error = data_end;
> +			goto out_unlock;
> +		}
> +		ASSERT(data_end > start_byte);
> +		ASSERT(data_end <= end_byte);
> +
> +		error = xfs_buffered_write_delalloc_scan(inode,
> +				&punch_start_byte, start_byte, data_end);
> +		if (error)
> +			goto out_unlock;
> +
> +		/* The next data search starts at the end of this one. */
> +		start_byte = data_end;
> +	}
> +
> +	if (punch_start_byte < end_byte)
> +		error = xfs_buffered_write_delalloc_punch(inode,
> +				punch_start_byte, end_byte);
> +out_unlock:
> +	filemap_invalidate_unlock(inode->i_mapping);
> +	return error;
> +}
> +
>  static int
>  xfs_buffered_write_iomap_end(
>  	struct inode		*inode,
> @@ -1179,16 +1319,7 @@ xfs_buffered_write_iomap_end(
>  	if (start_byte >= end_byte)
>  		return 0;
>  
> -	/*
> -	 * Lock the mapping to avoid races with page faults re-instantiating
> -	 * folios and dirtying them via ->page_mkwrite between the page cache
> -	 * truncation and the delalloc extent removal. Failing to do this can
> -	 * leave dirty pages with no space reservation in the cache.
> -	 */
> -	filemap_invalidate_lock(inode->i_mapping);
> -	truncate_pagecache_range(inode, start_byte, end_byte - 1);
> -	error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
> -	filemap_invalidate_unlock(inode->i_mapping);
> +	error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte);
>  	if (error && !xfs_is_shutdown(mp)) {
>  		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
>  			__func__, XFS_I(inode)->i_ino);
> -- 
> 2.37.2
> 
>
Dave Chinner Nov. 17, 2022, 12:41 a.m. UTC | #4
On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote:
> On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> ...
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 141 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 7bb55dbc19d3..2d48fcc7bd6f 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
> >  				end_fsb - start_fsb);
> >  }
> >  
> ...
> > +/*
> > + * Punch out all the delalloc blocks in the range given except for those that
> > + * have dirty data still pending in the page cache - those are going to be
> > + * written and so must still retain the delalloc backing for writeback.
> > + *
> > + * As we are scanning the page cache for data, we don't need to reimplement the
> > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> > + * start and end of data ranges correctly even for sub-folio block sizes. This
> > + * byte range based iteration is especially convenient because it means we don't
> > + * have to care about variable size folios, nor where the start or end of the
> > + * data range lies within a folio, if they lie within the same folio or even if
> > + * there are multiple discontiguous data ranges within the folio.
> > + */
> > +static int
> > +xfs_buffered_write_delalloc_release(
> > +	struct inode		*inode,
> > +	loff_t			start_byte,
> > +	loff_t			end_byte)
> > +{
> > +	loff_t			punch_start_byte = start_byte;
> > +	int			error = 0;
> > +
> > +	/*
> > +	 * Lock the mapping to avoid races with page faults re-instantiating
> > +	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> > +	 * cache and perform delalloc extent removal. Failing to do this can
> > +	 * leave dirty pages with no space reservation in the cache.
> > +	 */
> > +	filemap_invalidate_lock(inode->i_mapping);
> > +	while (start_byte < end_byte) {
> > +		loff_t		data_end;
> > +
> > +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> > +				start_byte, end_byte, SEEK_DATA);
> 
> FWIW, the fact that mapping seek data is based on uptodate status means
> that seek behavior can change based on prior reads.

Yup. It should be obvious that any page cache scan based algorithm
will change based on changing page cache residency.

> For example, see how
> seek hole/data presents reads of unwritten ranges as data [1]. The same
> thing isn't observable for holes because iomap doesn't check the mapping
> in that case, but underlying iop state is the same and that is what this
> code is looking at.

Well, yes.  That's the fundamental, underlying issue that this
patchset is addressing for the write() operation: that the page
cache contents and the underlying filesystem extent map are not
guaranteed to be coherent and can be changed independently of each
other.

The whole problem with looking exclusively at filesystem level
extent state (and hence FIEMAP) is that the extent state doesn't
tell us whether the is uncommitted data over the range of the extent
in the page cache.  The filesystem extent state and page cache data
*can't be coherent* in a writeback caching environment. This is the
fundamental difference between what the filesystem extent map tells
us (FIEMAP) and what querying the page cache tells us
(SEEK_DATA/SEEK_HOLE).

This is also the underlying problem with iomap_truncate_page() - it
fails to query the page cache for data over unwritten extents, so
fails to zero the post-EOF part of dirty folios over unwritten
extents and so it all goes wrong...

> The filtering being done here means we essentially only care about dirty
> pages backed by delalloc blocks. That means if you get here with a dirty
> page and the portion of the page affected by this failed write is
> uptodate, this won't punch an underlying delalloc block even though
> nothing else may have written to it in the meantime.

Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc iomap
will not span ranges that have pre-existing *dirty* data in the
page cache. Those *must* already have (del)allocated extents, hence
the iomap for the newly allocated delalloc extent will always end
before pre-existing dirty data in the page cache starts.

Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map
precludes stepping into ranges that have pre-existing cached dirty
data.

We also can't get a racing write() to the same range right now
because this is all under IOLOCK_EXCL, hence we only ever see dirty
folios as a result of race with page faults. page faults zero the
entire folio they insert into the page cache and
iomap_folio_mkwrite_iter() asserts that the entire folio is marked
up to date. Hence if we find a dirty folio outside the range the
write() dirtied, we are guaranteed that the entire dirty folio is up
to date....

Yes, there can be pre-existing *clean* folios (and clean partially
up to date folios) in the page cache, but we won't have dirty
partially up to date pages in the middle of the range we are
scanning. Hence we only need to care about the edge cases (folios
that overlap start and ends). We skip the partially written start
block, and we always punch up to the end block if it is different
from the last block we punched up to. If the end of the data spans
into a dirty folio, we know that dirty range is up to date because
the seek scan only returns ranges that are up to date. Hence we
don't punch those partial blocks out....

Regardless, let's assume we have a racing write that has partially
updated and dirtied a folio (because we've moved to
XFS_IOLOCK_SHARED locking for writes). This case is already handled
by the mapping_seek_hole_data() based iteration.

That is, the mapping_seek_hole_data() loop provides us with
*discrete ranges of up to date data* that are independent of folio
size, up-to-date range granularity, dirty range tracking, filesystem
block size, etc.

Hence if the next discrete range we discover is in the same dirty
folio as the previous discrete range of up to date data, we know we
have a sub-folio sized hole in the data that is not up to date.
Because there is no data over this range, we have to punch out the
underlying delalloc extent over that range. 

IOWs, the dirty state of the folio and/or the granularity of the
dirty range tracking is irrelevant here - we know there was no data
in the cache (dlean or dirty) over this range because it is
discontiguous with the previous range of data returned.

IOWs, if we have this "up to date" map on a dirty folio like this:

Data		+-------+UUUUUUU+-------+UUUUUUU+-------+
Extent map	+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+

Then the unrolled iteration and punching we do would look like this:

First iteration of the range:

punch_start:
		V
		+-------+UUUUUUU+-------+UUUUUUU+-------+

SEEK_DATA:		V
		+-------+UUUUUUU+-------+UUUUUUU+-------+
SEEK_HOLE:			^
Data range:		+UUUUUUU+
Punch range:	+-------+
Extent map:	+-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+

Second iteration:

punch_start			V
		+-------+UUUUUUU+-------+UUUUUUU+-------+
SEEK_DATA:				V
		+-------+UUUUUUU+-------+UUUUUUU+-------+
SEEK_HOLE:					^
Data range:				+UUUUUUU+
Punch range:			+-------+
Extent map:	+-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+

Third iteration:

punch_start					V
		+-------+UUUUUUU+-------+UUUUUUU+-------+
SEEK_DATA: - moves into next folio in cache
....
Punch range:					+-------+ ......
Extent map:	+-------+DDDDDDD+-------+DDDDDDD+-------+ ......
			(to end of scan range or start of next data)

As you can see, this scan does not care about folio size, sub-folio
range granularity or filesystem block sizes.  It also matches
exactly how writeback handles dirty, partially up to date folios, so
there's no stray delalloc blocks left around to be tripped over
after failed or short writes occur.

Indeed, if we move to sub-folio dirty range tracking, we can simply
add a mapping_seek_hole_data() variant that walks dirty ranges in
the page cache rather than up to date ranges. Then we can remove the
inner loop from this code that looks up folios to determine dirty
state. The above algorithm does not change - we just walk from
discrete range to discrete range punching the gaps between them....

IOWs, the algorithm is largely future proof - the only thing that
needs to change if we change iomap to track sub-folio dirty ranges
is how we check the data range for being dirty. That should be no
surprise, really, the surprise should be that we can make some
simple mods to page cache seek to remove the need for checking dirty
state in this code altogether....

> That sort of state
> can be created by a prior read of the range on a sub-page block size fs,
> or perhaps a racing async readahead (via read fault of a lower
> offset..?), etc.

Yup, generic/346 exercises this racing unaligned, sub-folio mmap
write vs write() case. This test, specifically, was the reason I
moved to using mapping_seek_hole_data() - g/346 found an endless
stream of bugs in the sub-multi-page-folio range iteration code I
kept trying to write....

> I suspect this is not a serious error because the page is dirty
> and writeback will thus convert the block. The only exception to
> that I can see is if the block is beyond EOF (consider a mapped
> read to a page that straddles EOF, followed by a post-eof write
> that fails), writeback won't actually map the block directly.

I don't think that can happen. iomap_write_failed() calls
truncate_pagecache_range() to remove any newly instantiated cached
data beyond the original EOF. Hence the delalloc punch will remove
everything beyond the original EOF that was allocated for the failed
write. Hence when we get to writeback we're not going to find any
up-to-date data beyond the EOF block in the page cache, nor any
stray delalloc blocks way beyond EOF....

> It may convert if contiguous with delalloc blocks inside EOF (and
> sufficiently sized physical extents exist), or even if not, should
> still otherwise be cleaned up by the various other means we
> already have to manage post-eof blocks.
>
> So IOW there's a tradeoff being made here for possible spurious
> allocation and I/O and a subtle dependency on writeback that
> should probably be documented somewhere.

As per above, I don't think there is any spurious/stale allocation
left behind by the punch code, nor is there any dependency on
writeback to ignore it such issues.

> The larger concern is that if
> writeback eventually changes based on dirty range tracking in a way that
> breaks this dependency, that introduces yet another stale delalloc block
> landmine associated with this error handling code (regardless of whether
> you want to call that a bug in this code, seek data, whatever), and
> those problems are difficult enough to root cause as it is.

If iomap changes how it tracks dirty ranges, this punch code only
needs small changes to work with that correctly. There aren't any
unknown landmines here - if we change dirty tracking, we know that
we have to update the code that depends on the existing dirty
tracking mechanisms to work correctly with the new infrastructure...

Cheers,

Dave.
Dave Chinner Nov. 17, 2022, 1:06 a.m. UTC | #5
On Tue, Nov 15, 2022 at 04:48:58PM -0800, Darrick J. Wong wrote:
> On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> > +/*
> > + * Scan the data range passed to us for dirty page cache folios. If we find a
> > + * dirty folio, punch out the preceeding range and update the offset from which
> > + * the next punch will start from.
> > + *
> > + * We can punch out clean pages because they either contain data that has been
> > + * written back - in which case the delalloc punch over that range is a no-op -
> > + * or they have been read faults in which case they contain zeroes and we can
> > + * remove the delalloc backing range and any new writes to those pages will do
> > + * the normal hole filling operation...
> > + *
> > + * This makes the logic simple: we only need to keep the delalloc extents only
> > + * over the dirty ranges of the page cache.
> > + */
> > +static int
> > +xfs_buffered_write_delalloc_scan(
> > +	struct inode		*inode,
> > +	loff_t			*punch_start_byte,
> > +	loff_t			start_byte,
> > +	loff_t			end_byte)
> > +{
> > +	loff_t			offset = start_byte;
> > +
> > +	while (offset < end_byte) {
> > +		struct folio	*folio;
> > +
> > +		/* grab locked page */
> > +		folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT);
> > +		if (!folio) {
> > +			offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE;
> > +			continue;
> > +		}
> > +
> > +		/* if dirty, punch up to offset */
> > +		if (folio_test_dirty(folio)) {
> > +			if (offset > *punch_start_byte) {
> > +				int	error;
> > +
> > +				error = xfs_buffered_write_delalloc_punch(inode,
> > +						*punch_start_byte, offset);
> 
> This sounds an awful lot like what iomap_writeback_ops.discard_folio()
> does, albeit without the xfs_alert screaming everywhere.

similar, but .discard_folio() is trashing uncommitted written data,
whilst this loop is explicitly preserving uncommitted written
data....

> Moving along... so we punch out delalloc reservations for any part of
> the page cache that is clean.  "punch_start_byte" is the start pos of
> the last range of clean cache, and we want to punch up to the start of
> this dirty folio...
> 
> > +				if (error) {
> > +					folio_unlock(folio);
> > +					folio_put(folio);
> > +					return error;
> > +				}
> > +			}
> > +
> > +			/*
> > +			 * Make sure the next punch start is correctly bound to
> > +			 * the end of this data range, not the end of the folio.
> > +			 */
> > +			*punch_start_byte = min_t(loff_t, end_byte,
> > +					folio_next_index(folio) << PAGE_SHIFT);
> 
> ...and then start a new clean range after this folio (or at the end_byte
> to signal that we're done)...

Yes.

> > +	filemap_invalidate_lock(inode->i_mapping);
> > +	while (start_byte < end_byte) {
> > +		loff_t		data_end;
> > +
> > +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> > +				start_byte, end_byte, SEEK_DATA);
> > +		/*
> > +		 * If there is no more data to scan, all that is left is to
> > +		 * punch out the remaining range.
> > +		 */
> > +		if (start_byte == -ENXIO || start_byte == end_byte)
> > +			break;
> > +		if (start_byte < 0) {
> > +			error = start_byte;
> > +			goto out_unlock;
> > +		}
> > +		ASSERT(start_byte >= punch_start_byte);
> > +		ASSERT(start_byte < end_byte);
> > +
> > +		/*
> > +		 * We find the end of this contiguous cached data range by
> > +		 * seeking from start_byte to the beginning of the next hole.
> > +		 */
> > +		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
> > +				end_byte, SEEK_HOLE);
> > +		if (data_end < 0) {
> > +			error = data_end;
> > +			goto out_unlock;
> > +		}
> > +		ASSERT(data_end > start_byte);
> > +		ASSERT(data_end <= end_byte);
> > +
> > +		error = xfs_buffered_write_delalloc_scan(inode,
> > +				&punch_start_byte, start_byte, data_end);
> 
> ...and we use SEEK_HOLE/SEEK_DATA to find the ranges of pagecache where
> there's even going to be folios mapped.  But in structuring the code
> like this, xfs now has to know details about folio state again, and the
> point of iomap/buffered-io.c is to delegate handling of the pagecache
> away from XFS, at least for filesystems that want to manage buffered IO
> the same way XFS does.
> 
> IOWs, I agree with Christoph that these two functions that compute the
> ranges that need delalloc-punching really ought to be in the iomap code.

Which I've already done.

> TBH I wonder if this isn't better suited for being called by
> iomap_write_iter after a short write on an IOMAP_F_NEW iomap, with an
> function pointer in iomap page_ops for iomap to tell xfs to drop the
> delalloc reservations.

IIRC, the whole point of the iomap_begin()/iomap_end() operation
pairs is that the iter functions don't need to concern themselves
with the filesystem operations needed to manipulate extents and
clean up filesystem state as a result of failed operations.

I'm pretty sure we need this same error handling for zeroing and
unshare operations, because they could also detect stale cached
iomaps and have to go remap their extents. Maybe they don't allocate
new extents, but that is beside the point - the error handling that
is necessary is common to multiple functions, and right now they
don't have to care about cleaning up iomap/filesystem state when
short writes/errors occur.

Hence I don't think we want to break the architectural layering by
doing this - I think it would just lead to having to handle the same
issues in multiple places, and we don't gain any extra control over
or information about how to perform the cleanup of the unused
portion of the iomap....

Cheers,

Dave.
Darrick J. Wong Nov. 17, 2022, 6:28 p.m. UTC | #6
On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote:
> On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote:
> > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > ...
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
> > >  1 file changed, 141 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 7bb55dbc19d3..2d48fcc7bd6f 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
> > >  				end_fsb - start_fsb);
> > >  }
> > >  
> > ...
> > > +/*
> > > + * Punch out all the delalloc blocks in the range given except for those that
> > > + * have dirty data still pending in the page cache - those are going to be
> > > + * written and so must still retain the delalloc backing for writeback.
> > > + *
> > > + * As we are scanning the page cache for data, we don't need to reimplement the
> > > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> > > + * start and end of data ranges correctly even for sub-folio block sizes. This
> > > + * byte range based iteration is especially convenient because it means we don't
> > > + * have to care about variable size folios, nor where the start or end of the
> > > + * data range lies within a folio, if they lie within the same folio or even if
> > > + * there are multiple discontiguous data ranges within the folio.
> > > + */
> > > +static int
> > > +xfs_buffered_write_delalloc_release(
> > > +	struct inode		*inode,
> > > +	loff_t			start_byte,
> > > +	loff_t			end_byte)
> > > +{
> > > +	loff_t			punch_start_byte = start_byte;
> > > +	int			error = 0;
> > > +
> > > +	/*
> > > +	 * Lock the mapping to avoid races with page faults re-instantiating
> > > +	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> > > +	 * cache and perform delalloc extent removal. Failing to do this can
> > > +	 * leave dirty pages with no space reservation in the cache.
> > > +	 */
> > > +	filemap_invalidate_lock(inode->i_mapping);
> > > +	while (start_byte < end_byte) {
> > > +		loff_t		data_end;
> > > +
> > > +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> > > +				start_byte, end_byte, SEEK_DATA);
> > 
> > FWIW, the fact that mapping seek data is based on uptodate status means
> > that seek behavior can change based on prior reads.
> 
> Yup. It should be obvious that any page cache scan based algorithm
> will change based on changing page cache residency.
> 
> > For example, see how
> > seek hole/data presents reads of unwritten ranges as data [1]. The same
> > thing isn't observable for holes because iomap doesn't check the mapping
> > in that case, but underlying iop state is the same and that is what this
> > code is looking at.
> 
> Well, yes.  That's the fundamental, underlying issue that this
> patchset is addressing for the write() operation: that the page
> cache contents and the underlying filesystem extent map are not
> guaranteed to be coherent and can be changed independently of each
> other.
> 
> The whole problem with looking exclusively at filesystem level
> extent state (and hence FIEMAP) is that the extent state doesn't
> tell us whether the is uncommitted data over the range of the extent
> in the page cache.  The filesystem extent state and page cache data
> *can't be coherent* in a writeback caching environment. This is the
> fundamental difference between what the filesystem extent map tells
> us (FIEMAP) and what querying the page cache tells us
> (SEEK_DATA/SEEK_HOLE).
> 
> This is also the underlying problem with iomap_truncate_page() - it
> fails to query the page cache for data over unwritten extents, so
> fails to zero the post-EOF part of dirty folios over unwritten
> extents and so it all goes wrong...
> 
> > The filtering being done here means we essentially only care about dirty
> > pages backed by delalloc blocks. That means if you get here with a dirty
> > page and the portion of the page affected by this failed write is
> > uptodate, this won't punch an underlying delalloc block even though
> > nothing else may have written to it in the meantime.
> 
> Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc iomap
> will not span ranges that have pre-existing *dirty* data in the
> page cache. Those *must* already have (del)allocated extents, hence
> the iomap for the newly allocated delalloc extent will always end
> before pre-existing dirty data in the page cache starts.
> 
> Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map
> precludes stepping into ranges that have pre-existing cached dirty
> data.
> 
> We also can't get a racing write() to the same range right now
> because this is all under IOLOCK_EXCL, hence we only ever see dirty
> folios as a result of race with page faults. page faults zero the
> entire folio they insert into the page cache and
> iomap_folio_mkwrite_iter() asserts that the entire folio is marked
> up to date. Hence if we find a dirty folio outside the range the
> write() dirtied, we are guaranteed that the entire dirty folio is up
> to date....
> 
> Yes, there can be pre-existing *clean* folios (and clean partially
> up to date folios) in the page cache, but we won't have dirty
> partially up to date pages in the middle of the range we are
> scanning. Hence we only need to care about the edge cases (folios
> that overlap start and ends). We skip the partially written start
> block, and we always punch up to the end block if it is different
> from the last block we punched up to. If the end of the data spans
> into a dirty folio, we know that dirty range is up to date because
> the seek scan only returns ranges that are up to date. Hence we
> don't punch those partial blocks out....
> 
> Regardless, let's assume we have a racing write that has partially
> updated and dirtied a folio (because we've moved to
> XFS_IOLOCK_SHARED locking for writes). This case is already handled
> by the mapping_seek_hole_data() based iteration.
> 
> That is, the mapping_seek_hole_data() loop provides us with
> *discrete ranges of up to date data* that are independent of folio
> size, up-to-date range granularity, dirty range tracking, filesystem
> block size, etc.
> 
> Hence if the next discrete range we discover is in the same dirty
> folio as the previous discrete range of up to date data, we know we
> have a sub-folio sized hole in the data that is not up to date.
> Because there is no data over this range, we have to punch out the
> underlying delalloc extent over that range. 
> 
> IOWs, the dirty state of the folio and/or the granularity of the
> dirty range tracking is irrelevant here - we know there was no data
> in the cache (dlean or dirty) over this range because it is
> discontiguous with the previous range of data returned.
> 
> IOWs, if we have this "up to date" map on a dirty folio like this:
> 
> Data		+-------+UUUUUUU+-------+UUUUUUU+-------+
> Extent map	+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> 
> Then the unrolled iteration and punching we do would look like this:
> 
> First iteration of the range:
> 
> punch_start:
> 		V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> 
> SEEK_DATA:		V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_HOLE:			^
> Data range:		+UUUUUUU+
> Punch range:	+-------+
> Extent map:	+-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> 
> Second iteration:
> 
> punch_start			V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_DATA:				V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_HOLE:					^
> Data range:				+UUUUUUU+
> Punch range:			+-------+
> Extent map:	+-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+
> 
> Third iteration:
> 
> punch_start					V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_DATA: - moves into next folio in cache
> ....
> Punch range:					+-------+ ......
> Extent map:	+-------+DDDDDDD+-------+DDDDDDD+-------+ ......
> 			(to end of scan range or start of next data)
> 
> As you can see, this scan does not care about folio size, sub-folio
> range granularity or filesystem block sizes.  It also matches
> exactly how writeback handles dirty, partially up to date folios, so
> there's no stray delalloc blocks left around to be tripped over
> after failed or short writes occur.

I had been wondering about the particular case of dealing with partially
uptodate folios, so it was really helpful to watch this all in
ASCIIVision(tm).

> Indeed, if we move to sub-folio dirty range tracking, we can simply
> add a mapping_seek_hole_data() variant that walks dirty ranges in
> the page cache rather than up to date ranges. Then we can remove the
> inner loop from this code that looks up folios to determine dirty
> state. The above algorithm does not change - we just walk from
> discrete range to discrete range punching the gaps between them....
> 
> IOWs, the algorithm is largely future proof - the only thing that
> needs to change if we change iomap to track sub-folio dirty ranges
> is how we check the data range for being dirty. That should be no
> surprise, really, the surprise should be that we can make some
> simple mods to page cache seek to remove the need for checking dirty
> state in this code altogether....
> 
> > That sort of state
> > can be created by a prior read of the range on a sub-page block size fs,
> > or perhaps a racing async readahead (via read fault of a lower
> > offset..?), etc.
> 
> Yup, generic/346 exercises this racing unaligned, sub-folio mmap
> write vs write() case. This test, specifically, was the reason I
> moved to using mapping_seek_hole_data() - g/346 found an endless
> stream of bugs in the sub-multi-page-folio range iteration code I
> kept trying to write....
> 
> > I suspect this is not a serious error because the page is dirty
> > and writeback will thus convert the block. The only exception to
> > that I can see is if the block is beyond EOF (consider a mapped
> > read to a page that straddles EOF, followed by a post-eof write
> > that fails), writeback won't actually map the block directly.
> 
> I don't think that can happen. iomap_write_failed() calls
> truncate_pagecache_range() to remove any newly instantiated cached
> data beyond the original EOF. Hence the delalloc punch will remove
> everything beyond the original EOF that was allocated for the failed
> write. Hence when we get to writeback we're not going to find any
> up-to-date data beyond the EOF block in the page cache, nor any
> stray delalloc blocks way beyond EOF....
> 
> > It may convert if contiguous with delalloc blocks inside EOF (and
> > sufficiently sized physical extents exist), or even if not, should
> > still otherwise be cleaned up by the various other means we
> > already have to manage post-eof blocks.
> >
> > So IOW there's a tradeoff being made here for possible spurious
> > allocation and I/O and a subtle dependency on writeback that
> > should probably be documented somewhere.
> 
> As per above, I don't think there is any spurious/stale allocation
> left behind by the punch code, nor is there any dependency on
> writeback to ignore it such issues.
> 
> > The larger concern is that if
> > writeback eventually changes based on dirty range tracking in a way that
> > breaks this dependency, that introduces yet another stale delalloc block
> > landmine associated with this error handling code (regardless of whether
> > you want to call that a bug in this code, seek data, whatever), and
> > those problems are difficult enough to root cause as it is.
> 
> If iomap changes how it tracks dirty ranges, this punch code only
> needs small changes to work with that correctly. There aren't any
> unknown landmines here - if we change dirty tracking, we know that
> we have to update the code that depends on the existing dirty
> tracking mechanisms to work correctly with the new infrastructure...

I hope that anyone trying to change the iomap dirty tracking code will
find it easier to do with the iteration code in buffered-io.c.  Thanks
for taking the time to rework that part for v3.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Brian Foster Nov. 18, 2022, 5:20 p.m. UTC | #7
On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote:
> On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote:
> > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > ...
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
> > >  1 file changed, 141 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 7bb55dbc19d3..2d48fcc7bd6f 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
> > >  				end_fsb - start_fsb);
> > >  }
> > >  
> > ...
> > > +/*
> > > + * Punch out all the delalloc blocks in the range given except for those that
> > > + * have dirty data still pending in the page cache - those are going to be
> > > + * written and so must still retain the delalloc backing for writeback.
> > > + *
> > > + * As we are scanning the page cache for data, we don't need to reimplement the
> > > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> > > + * start and end of data ranges correctly even for sub-folio block sizes. This
> > > + * byte range based iteration is especially convenient because it means we don't
> > > + * have to care about variable size folios, nor where the start or end of the
> > > + * data range lies within a folio, if they lie within the same folio or even if
> > > + * there are multiple discontiguous data ranges within the folio.
> > > + */
> > > +static int
> > > +xfs_buffered_write_delalloc_release(
> > > +	struct inode		*inode,
> > > +	loff_t			start_byte,
> > > +	loff_t			end_byte)
> > > +{
> > > +	loff_t			punch_start_byte = start_byte;
> > > +	int			error = 0;
> > > +
> > > +	/*
> > > +	 * Lock the mapping to avoid races with page faults re-instantiating
> > > +	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> > > +	 * cache and perform delalloc extent removal. Failing to do this can
> > > +	 * leave dirty pages with no space reservation in the cache.
> > > +	 */
> > > +	filemap_invalidate_lock(inode->i_mapping);
> > > +	while (start_byte < end_byte) {
> > > +		loff_t		data_end;
> > > +
> > > +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> > > +				start_byte, end_byte, SEEK_DATA);
> > 
> > FWIW, the fact that mapping seek data is based on uptodate status means
> > that seek behavior can change based on prior reads.
> 
> Yup. It should be obvious that any page cache scan based algorithm
> will change based on changing page cache residency.
> 
> > For example, see how
> > seek hole/data presents reads of unwritten ranges as data [1]. The same
> > thing isn't observable for holes because iomap doesn't check the mapping
> > in that case, but underlying iop state is the same and that is what this
> > code is looking at.
> 
> Well, yes.  That's the fundamental, underlying issue that this
> patchset is addressing for the write() operation: that the page
> cache contents and the underlying filesystem extent map are not
> guaranteed to be coherent and can be changed independently of each
> other.
> 
> The whole problem with looking exclusively at filesystem level
> extent state (and hence FIEMAP) is that the extent state doesn't
> tell us whether the is uncommitted data over the range of the extent
> in the page cache.  The filesystem extent state and page cache data
> *can't be coherent* in a writeback caching environment. This is the
> fundamental difference between what the filesystem extent map tells
> us (FIEMAP) and what querying the page cache tells us
> (SEEK_DATA/SEEK_HOLE).
> 
> This is also the underlying problem with iomap_truncate_page() - it
> fails to query the page cache for data over unwritten extents, so
> fails to zero the post-EOF part of dirty folios over unwritten
> extents and so it all goes wrong...
> 
> > The filtering being done here means we essentially only care about dirty
> > pages backed by delalloc blocks. That means if you get here with a dirty
> > page and the portion of the page affected by this failed write is
> > uptodate, this won't punch an underlying delalloc block even though
> > nothing else may have written to it in the meantime.
> 
> Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc iomap
> will not span ranges that have pre-existing *dirty* data in the
> page cache. Those *must* already have (del)allocated extents, hence
> the iomap for the newly allocated delalloc extent will always end
> before pre-existing dirty data in the page cache starts.
> 
> Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map
> precludes stepping into ranges that have pre-existing cached dirty
> data.
> 
> We also can't get a racing write() to the same range right now
> because this is all under IOLOCK_EXCL, hence we only ever see dirty
> folios as a result of race with page faults. page faults zero the
> entire folio they insert into the page cache and
> iomap_folio_mkwrite_iter() asserts that the entire folio is marked
> up to date. Hence if we find a dirty folio outside the range the
> write() dirtied, we are guaranteed that the entire dirty folio is up
> to date....
> 
> Yes, there can be pre-existing *clean* folios (and clean partially
> up to date folios) in the page cache, but we won't have dirty
> partially up to date pages in the middle of the range we are
> scanning. Hence we only need to care about the edge cases (folios
> that overlap start and ends). We skip the partially written start
> block, and we always punch up to the end block if it is different
> from the last block we punched up to. If the end of the data spans
> into a dirty folio, we know that dirty range is up to date because
> the seek scan only returns ranges that are up to date. Hence we
> don't punch those partial blocks out....
> 
> Regardless, let's assume we have a racing write that has partially
> updated and dirtied a folio (because we've moved to
> XFS_IOLOCK_SHARED locking for writes). This case is already handled
> by the mapping_seek_hole_data() based iteration.
> 
> That is, the mapping_seek_hole_data() loop provides us with
> *discrete ranges of up to date data* that are independent of folio
> size, up-to-date range granularity, dirty range tracking, filesystem
> block size, etc.
> 
> Hence if the next discrete range we discover is in the same dirty
> folio as the previous discrete range of up to date data, we know we
> have a sub-folio sized hole in the data that is not up to date.
> Because there is no data over this range, we have to punch out the
> underlying delalloc extent over that range. 
> 
> IOWs, the dirty state of the folio and/or the granularity of the
> dirty range tracking is irrelevant here - we know there was no data
> in the cache (dlean or dirty) over this range because it is
> discontiguous with the previous range of data returned.
> 
> IOWs, if we have this "up to date" map on a dirty folio like this:
> 
> Data		+-------+UUUUUUU+-------+UUUUUUU+-------+
> Extent map	+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> 
> Then the unrolled iteration and punching we do would look like this:
> 
> First iteration of the range:
> 
> punch_start:
> 		V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> 
> SEEK_DATA:		V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_HOLE:			^
> Data range:		+UUUUUUU+
> Punch range:	+-------+
> Extent map:	+-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> 
> Second iteration:
> 
> punch_start			V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_DATA:				V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_HOLE:					^
> Data range:				+UUUUUUU+
> Punch range:			+-------+
> Extent map:	+-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+
> 
> Third iteration:
> 
> punch_start					V
> 		+-------+UUUUUUU+-------+UUUUUUU+-------+
> SEEK_DATA: - moves into next folio in cache
> ....
> Punch range:					+-------+ ......
> Extent map:	+-------+DDDDDDD+-------+DDDDDDD+-------+ ......
> 			(to end of scan range or start of next data)
> 
> As you can see, this scan does not care about folio size, sub-folio
> range granularity or filesystem block sizes.  It also matches
> exactly how writeback handles dirty, partially up to date folios, so
> there's no stray delalloc blocks left around to be tripped over
> after failed or short writes occur.
> 
> Indeed, if we move to sub-folio dirty range tracking, we can simply
> add a mapping_seek_hole_data() variant that walks dirty ranges in
> the page cache rather than up to date ranges. Then we can remove the
> inner loop from this code that looks up folios to determine dirty
> state. The above algorithm does not change - we just walk from
> discrete range to discrete range punching the gaps between them....
> 
> IOWs, the algorithm is largely future proof - the only thing that
> needs to change if we change iomap to track sub-folio dirty ranges
> is how we check the data range for being dirty. That should be no
> surprise, really, the surprise should be that we can make some
> simple mods to page cache seek to remove the need for checking dirty
> state in this code altogether....
> 
> > That sort of state
> > can be created by a prior read of the range on a sub-page block size fs,
> > or perhaps a racing async readahead (via read fault of a lower
> > offset..?), etc.
> 
> Yup, generic/346 exercises this racing unaligned, sub-folio mmap
> write vs write() case. This test, specifically, was the reason I
> moved to using mapping_seek_hole_data() - g/346 found an endless
> stream of bugs in the sub-multi-page-folio range iteration code I
> kept trying to write....
> 
> > I suspect this is not a serious error because the page is dirty
> > and writeback will thus convert the block. The only exception to
> > that I can see is if the block is beyond EOF (consider a mapped
> > read to a page that straddles EOF, followed by a post-eof write
> > that fails), writeback won't actually map the block directly.
> 
> I don't think that can happen. iomap_write_failed() calls
> truncate_pagecache_range() to remove any newly instantiated cached
> data beyond the original EOF. Hence the delalloc punch will remove
> everything beyond the original EOF that was allocated for the failed
> write. Hence when we get to writeback we're not going to find any
> up-to-date data beyond the EOF block in the page cache, nor any
> stray delalloc blocks way beyond EOF....
> 

It can happen if the page straddles eof. I don't think much of the above
relates to the behavior I'm describing. This doesn't require racing
writes, theoretical shared write locking, the IOMAP_F_NEW flag or core
iteration algorithm are not major factors, etc.

It's easier to just use examples. Consider the following sequence on a
bsize=1k fs on a kernel with this patch series. Note that my xfs_io is
hacked up to turn 'pwrite -f' into a "fail" flag that triggers the
-EFAULT error check in iomap_write_iter() by passing a bad user buffer:

# set eof and make sure entire page is uptodate (even posteof)
$XFS_IO_PROG -fc "truncate 1k" -c "mmap 0 4k" -c "mread 0 4k" $file
# do a delalloc and open/close cycle to set dirty release
$XFS_IO_PROG -c "pwrite 0 1" -c "close" -c "open $file" $file
# fail an appending write to a posteof, discontig, uptodate range on already dirty page
$XFS_IO_PROG -c "pwrite -f 2k 1" -c fsync -c "fiemap -v" $file

wrote 1/1 bytes at offset 0
1.000000 bytes, 1 ops; 0.0003 sec (3.255 KiB/sec and 3333.3333 ops/sec)
pwrite: Bad address
/mnt/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..1]:          22..23               2   0x0
   1: [2..3]:          hole                 2
   2: [4..5]:          0..1                 2   0x7

So clearly there's a posteof delalloc (FIEMAP_EXTENT_DELALLOC == 0x4) block
there that hasn't been written to. I call this a landmine because it's an
unexpected/untracked instance of post-eof delalloc. I.e., post-eof block
reclamation won't pick it up because the inode was never tagged for speculative
preallocation:

$ xfs_spaceman -c "prealloc -s" /mnt
$ xfs_io -c "fiemap -v" /mnt/file 
/mnt/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..1]:          22..23               2   0x0
   1: [2..3]:          hole                 2
   2: [4..5]:          0..1                 2   0x7

At the same time, it's not an observable problem in XFS because inode
reclaim will eventually truncate that block off due to the delalloc
block check, which will prevent any sort of space accounting corruption.
I also don't think anybody will miss a single block dangling off like
that in the meantime. The only other odd side effect I can think of is
if this happened and the user performed an explicit post-eof fallocate,
reclaim could now lop that off due to the last resort delalloc block
check. That can still technically happen anyways as a side effect of
speculative prealloc so I don't see that as a major problem either.

The more important point here is that just because XFS can handle this
situation reasonably gracefully doesn't mean every other filesystem
expects it.

> > It may convert if contiguous with delalloc blocks inside EOF (and
> > sufficiently sized physical extents exist), or even if not, should
> > still otherwise be cleaned up by the various other means we
> > already have to manage post-eof blocks.
> >
> > So IOW there's a tradeoff being made here for possible spurious
> > allocation and I/O and a subtle dependency on writeback that
> > should probably be documented somewhere.
> 
> As per above, I don't think there is any spurious/stale allocation
> left behind by the punch code, nor is there any dependency on
> writeback to ignore it such issues.
> 

The same sort of example as above (but within eof) demonstrates the
writeback dependency. I.e., with this series alone the punch code leaves
behind an in-eof uptodate delalloc block, but writeback comes along and
maps it regardless because it also only has uptodate granularity within
the dirty folio.

When writeback changes to only map dirty sub-ranges, clearly that no
longer happens and the landmine turns into a stale delalloc leak. This
is also easy to reproduce with the last dirty range RFC patches pulled
on top of this series.

The changes you describe above to seek out and punch all !dirty delalloc
ranges should address both of these problems, because the error handling
code doesn't consider i_size the way writeback does.

> > The larger concern is that if
> > writeback eventually changes based on dirty range tracking in a way that
> > breaks this dependency, that introduces yet another stale delalloc block
> > landmine associated with this error handling code (regardless of whether
> > you want to call that a bug in this code, seek data, whatever), and
> > those problems are difficult enough to root cause as it is.
> 
> If iomap changes how it tracks dirty ranges, this punch code only
> needs small changes to work with that correctly. There aren't any
> unknown landmines here - if we change dirty tracking, we know that
> we have to update the code that depends on the existing dirty
> tracking mechanisms to work correctly with the new infrastructure...
> 

I'm just pointing out there are side effects / corner cases that don't
depend on the complicated shared locking / racing write fault scenarios
described above. Therefore, this code probably needs to be fixed at the
same time dirty range tracking is enabled to not introduce stale
delalloc problems for XFS.

In the meantime, I don't think any fs developer just looking to use or
debug the iomap layer and considering this punch error handling
mechanism should be expected to immediately connect the dots between it
using pagecache seek, that depending on Uptodate, the pagecache seek
behavior implications of that, writeback potentially not mapping
post-eof ranges, and therefore possibly needing to know that the fs in
question may need its own custom post-eof error handling if dangling
post-eof blocks aren't expected anywhere else for that fs.

It may be possible to address that by doing more aggressive punching
past the i_size boundary here, but that comes with other tradeoffs.
Short of that, I was really just asking for a more descriptive comment
for the punch mechanism so these quirks are not lost before they can be
properly fixed up or are apparent to any new users besides XFS this
might attract in the meantime. For example, something like the following
added to the _punch_delalloc() comment (from v3):

"
...

Note that since this depends on pagecache seek data, and that depends on
folio Uptodate state to identify data regions, the scanning behavior
here can be non-deterministic based on the state of cache at the time of
the failed operation. For example, if the target folio of a failed write
was already dirty and the target (sub-folio) range was previously an
Uptodate hole before the failing write may have performed delayed
allocation, we have no choice but to skip the punch for that sub-range
of the folio because we can't tell whether sub-folio Uptodate ranges
were actually written to or not. This means that delalloc block
allocations may persist for writes that never succeeded.

This can similarly skip punches beyond EOF for failed post-eof writes,
except note that writeback will not explicitly map blocks beyond i_size.
Therefore, filesystems that cannot gracefully accommodate dangling
post-eof blocks via other means may need to check for and handle that
case directly after calling this function.

These quirks should all be addressed with proper sub-folio dirty range
tracking, at which point we can deterministically scan and punch
non-dirty delalloc ranges at the time of write failure.
"

... but I also don't want to go in circles just over a comment, so I'm
content that it's at least documented on the list here.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Nov. 21, 2022, 11:13 p.m. UTC | #8
On Fri, Nov 18, 2022 at 12:20:45PM -0500, Brian Foster wrote:
> On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote:
> > On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote:
> > > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > ...
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> ---
> > > > fs/xfs/xfs_iomap.c | 151
> > > > ++++++++++++++++++++++++++++++++++++++++++--- 1 file
> > > > changed, 141 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index
> > > > 7bb55dbc19d3..2d48fcc7bd6f 100644 --- a/fs/xfs/xfs_iomap.c
> > > > +++ b/fs/xfs/xfs_iomap.c @@ -1134,6 +1134,146 @@
> > > > xfs_buffered_write_delalloc_punch( end_fsb - start_fsb); }
> > > >  
> > > ...
> > > > +/* + * Punch out all the delalloc blocks in the range given
> > > > except for those that + * have dirty data still pending in
> > > > the page cache - those are going to be + * written and so
> > > > must still retain the delalloc backing for writeback.  + * +
> > > > * As we are scanning the page cache for data, we don't need
> > > > to reimplement the + * wheel - mapping_seek_hole_data() does
> > > > exactly what we need to identify the + * start and end of
> > > > data ranges correctly even for sub-folio block sizes. This +
> > > > * byte range based iteration is especially convenient
> > > > because it means we don't + * have to care about variable
> > > > size folios, nor where the start or end of the + * data
> > > > range lies within a folio, if they lie within the same folio
> > > > or even if + * there are multiple discontiguous data ranges
> > > > within the folio.  + */ +static int
> > > > +xfs_buffered_write_delalloc_release( +	struct inode
> > > > *inode, +	loff_t			start_byte, +
> > > > loff_t			end_byte) +{ +	loff_t
> > > > punch_start_byte = start_byte; +	int
> > > > error = 0; + +	/* +	 * Lock the mapping to avoid races
> > > > with page faults re-instantiating +	 * folios and
> > > > dirtying them via ->page_mkwrite whilst we walk the +	 *
> > > > cache and perform delalloc extent removal. Failing to do
> > > > this can +	 * leave dirty pages with no space
> > > > reservation in the cache.  +	 */ +
> > > > filemap_invalidate_lock(inode->i_mapping); +	while
> > > > (start_byte < end_byte) { +		loff_t
> > > > data_end; + +		start_byte =
> > > > mapping_seek_hole_data(inode->i_mapping, +
> > > > start_byte, end_byte, SEEK_DATA);
> > > 
> > > FWIW, the fact that mapping seek data is based on uptodate
> > > status means that seek behavior can change based on prior
> > > reads.
> > 
> > Yup. It should be obvious that any page cache scan based
> > algorithm will change based on changing page cache residency.
> > 
> > > For example, see how seek hole/data presents reads of
> > > unwritten ranges as data [1]. The same thing isn't observable
> > > for holes because iomap doesn't check the mapping in that
> > > case, but underlying iop state is the same and that is what
> > > this code is looking at.
> > 
> > Well, yes.  That's the fundamental, underlying issue that this
> > patchset is addressing for the write() operation: that the page
> > cache contents and the underlying filesystem extent map are not
> > guaranteed to be coherent and can be changed independently of
> > each other.
> > 
> > The whole problem with looking exclusively at filesystem level
> > extent state (and hence FIEMAP) is that the extent state doesn't
> > tell us whether the is uncommitted data over the range of the
> > extent in the page cache.  The filesystem extent state and page
> > cache data *can't be coherent* in a writeback caching
> > environment. This is the fundamental difference between what the
> > filesystem extent map tells us (FIEMAP) and what querying the
> > page cache tells us (SEEK_DATA/SEEK_HOLE).
> > 
> > This is also the underlying problem with iomap_truncate_page() -
> > it fails to query the page cache for data over unwritten
> > extents, so fails to zero the post-EOF part of dirty folios over
> > unwritten extents and so it all goes wrong...
> > 
> > > The filtering being done here means we essentially only care
> > > about dirty pages backed by delalloc blocks. That means if you
> > > get here with a dirty page and the portion of the page
> > > affected by this failed write is uptodate, this won't punch an
> > > underlying delalloc block even though nothing else may have
> > > written to it in the meantime.
> > 
> > Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc
> > iomap will not span ranges that have pre-existing *dirty* data
> > in the page cache. Those *must* already have (del)allocated
> > extents, hence the iomap for the newly allocated delalloc extent
> > will always end before pre-existing dirty data in the page cache
> > starts.
> > 
> > Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map
> > precludes stepping into ranges that have pre-existing cached
> > dirty data.
> > 
> > We also can't get a racing write() to the same range right now
> > because this is all under IOLOCK_EXCL, hence we only ever see
> > dirty folios as a result of race with page faults. page faults
> > zero the entire folio they insert into the page cache and
> > iomap_folio_mkwrite_iter() asserts that the entire folio is
> > marked up to date. Hence if we find a dirty folio outside the
> > range the write() dirtied, we are guaranteed that the entire
> > dirty folio is up to date....
> > 
> > Yes, there can be pre-existing *clean* folios (and clean
> > partially up to date folios) in the page cache, but we won't
> > have dirty partially up to date pages in the middle of the range
> > we are scanning. Hence we only need to care about the edge cases
> > (folios that overlap start and ends). We skip the partially
> > written start block, and we always punch up to the end block if
> > it is different from the last block we punched up to. If the end
> > of the data spans into a dirty folio, we know that dirty range
> > is up to date because the seek scan only returns ranges that are
> > up to date. Hence we don't punch those partial blocks out....
> > 
> > Regardless, let's assume we have a racing write that has
> > partially updated and dirtied a folio (because we've moved to
> > XFS_IOLOCK_SHARED locking for writes). This case is already
> > handled by the mapping_seek_hole_data() based iteration.
> > 
> > That is, the mapping_seek_hole_data() loop provides us with
> > *discrete ranges of up to date data* that are independent of
> > folio size, up-to-date range granularity, dirty range tracking,
> > filesystem block size, etc.
> > 
> > Hence if the next discrete range we discover is in the same
> > dirty folio as the previous discrete range of up to date data,
> > we know we have a sub-folio sized hole in the data that is not
> > up to date.  Because there is no data over this range, we have
> > to punch out the underlying delalloc extent over that range. 
> > 
> > IOWs, the dirty state of the folio and/or the granularity of the
> > dirty range tracking is irrelevant here - we know there was no
> > data in the cache (dlean or dirty) over this range because it is
> > discontiguous with the previous range of data returned.
> > 
> > IOWs, if we have this "up to date" map on a dirty folio like
> > this:
> > 
> > Data		+-------+UUUUUUU+-------+UUUUUUU+-------+
> > Extent map	+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> > 
> > Then the unrolled iteration and punching we do would look like
> > this:
> > 
> > First iteration of the range:
> > 
> > punch_start: V +-------+UUUUUUU+-------+UUUUUUU+-------+
> > 
> > SEEK_DATA:		V +-------+UUUUUUU+-------+UUUUUUU+-------+
> > SEEK_HOLE:			^ Data range:		+UUUUUUU+
> > Punch range:	+-------+ Extent map:
> > +-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> > 
> > Second iteration:
> > 
> > punch_start			V
> > +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA:
> > V +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_HOLE:
> > ^ Data range:				+UUUUUUU+ Punch
> > range:			+-------+ Extent map:
> > +-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+
> > 
> > Third iteration:
> > 
> > punch_start					V
> > +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA: - moves
> > into next folio in cache ....  Punch range:
> > +-------+ ......  Extent map:
> > +-------+DDDDDDD+-------+DDDDDDD+-------+ ......  (to end of
> > scan range or start of next data)
> > 
> > As you can see, this scan does not care about folio size,
> > sub-folio range granularity or filesystem block sizes.  It also
> > matches exactly how writeback handles dirty, partially up to
> > date folios, so there's no stray delalloc blocks left around to
> > be tripped over after failed or short writes occur.
> > 
> > Indeed, if we move to sub-folio dirty range tracking, we can
> > simply add a mapping_seek_hole_data() variant that walks dirty
> > ranges in the page cache rather than up to date ranges. Then we
> > can remove the inner loop from this code that looks up folios to
> > determine dirty state. The above algorithm does not change - we
> > just walk from discrete range to discrete range punching the
> > gaps between them....
> > 
> > IOWs, the algorithm is largely future proof - the only thing
> > that needs to change if we change iomap to track sub-folio dirty
> > ranges is how we check the data range for being dirty. That
> > should be no surprise, really, the surprise should be that we
> > can make some simple mods to page cache seek to remove the need
> > for checking dirty state in this code altogether....
> > 
> > > That sort of state can be created by a prior read of the range
> > > on a sub-page block size fs, or perhaps a racing async
> > > readahead (via read fault of a lower offset..?), etc.
> > 
> > Yup, generic/346 exercises this racing unaligned, sub-folio mmap
> > write vs write() case. This test, specifically, was the reason I
> > moved to using mapping_seek_hole_data() - g/346 found an endless
> > stream of bugs in the sub-multi-page-folio range iteration code
> > I kept trying to write....
> > 
> > > I suspect this is not a serious error because the page is
> > > dirty and writeback will thus convert the block. The only
> > > exception to that I can see is if the block is beyond EOF
> > > (consider a mapped read to a page that straddles EOF, followed
> > > by a post-eof write that fails), writeback won't actually map
> > > the block directly.
> > 
> > I don't think that can happen. iomap_write_failed() calls
> > truncate_pagecache_range() to remove any newly instantiated
> > cached data beyond the original EOF. Hence the delalloc punch
> > will remove everything beyond the original EOF that was
> > allocated for the failed write. Hence when we get to writeback
> > we're not going to find any up-to-date data beyond the EOF block
> > in the page cache, nor any stray delalloc blocks way beyond
> > EOF....
> > 
> 
> It can happen if the page straddles eof. I don't think much of the
> above relates to the behavior I'm describing. This doesn't require
> racing writes, theoretical shared write locking, the IOMAP_F_NEW
> flag or core iteration algorithm are not major factors, etc.
> 
> It's easier to just use examples. Consider the following sequence
> on a bsize=1k fs on a kernel with this patch series. Note that my
> xfs_io is hacked up to turn 'pwrite -f' into a "fail" flag that
> triggers the -EFAULT error check in iomap_write_iter() by passing
> a bad user buffer:
> 
> # set eof and make sure entire page is uptodate (even posteof)
> $XFS_IO_PROG -fc "truncate 1k" -c "mmap 0 4k" -c "mread 0 4k"
> $file # do a delalloc and open/close cycle to set dirty release
> $XFS_IO_PROG -c "pwrite 0 1" -c "close" -c "open $file" $file #
> fail an appending write to a posteof, discontig, uptodate range on
> already dirty page $XFS_IO_PROG -c "pwrite -f 2k 1" -c fsync -c
> "fiemap -v" $file
> 
Firstly, can you tell me which patchset you reproduced this on? Just
a vanilla kernel, thev version of the patchset in this thread or the
latest one I sent out?

Secondly, can you please send a patch with that failed write
functionality upstream for xfs_io - it is clearly useful.

> wrote 1/1 bytes at offset 0
> 1.000000 bytes, 1 ops; 0.0003 sec (3.255 KiB/sec and 3333.3333 ops/sec)
> pwrite: Bad address
> /mnt/file:
>  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
>    0: [0..1]:          22..23               2   0x0
>    1: [2..3]:          hole                 2
>    2: [4..5]:          0..1                 2   0x7
> 
> So clearly there's a posteof delalloc (FIEMAP_EXTENT_DELALLOC == 0x4) block
> there that hasn't been written to.

Ok, so this is failing before even iomap_write_begin() is called,
so it's a failed write that has not touched page cache state, but
has allocated a delalloc block beyond EOF.

This is a case that writeback handles because writeback
is supposed skip writeback for pages and blocks beyond EOF. i.e.
iomap_writepage_map() is passed the current EOF to terminate the
writeback mapping loop when the folio being written back spans EOF.
Hence iomap_writepage_map never walks past EOF and so does not ever
iterate filesystem blocks that contain up to date data beyond in the
page cache beyond EOF. That's why writeback leaves that stray
delalloc block beyond, even though it's covered by up to date data.

IOWs, it appears that the problem here is that the page cache
considers data in the folio beyond EOF to be up to date and contain
data, whilst most filesystems do not consider data beyond EOF to be
valid. This means there is a boundary condition that
mapping_seek_hole_data() doesn't handle correctly - it completely
ignores EOF and so SEEK_DATA for an offset past EOF will report data
beyond EOF if the file has been mmap()d and the last folio faulted
in.

This specific boundary condition behaviour is hidden by
iomap_seek_data() and shmem_file_llseek() by not allowing seeks to
start and/or end beyond EOF. Hence they never calls
mapping_seek_hole_data() with a start byte beyond EOF and so will
not ever try to find data in the page caceh beyond EOF.

That looks easy enough to fix - just consider any data range
returned that is beyond i_size_read() to require punching, and then
the semantics match writeback ignoring anything beyond EOF....

> I call this a landmine because it's an
> unexpected/untracked instance of post-eof delalloc. I.e., post-eof block
> reclamation won't pick it up because the inode was never tagged for speculative
> preallocation:

Of course - this wasn't allocated by speculative preallocation -
that doesn't kick in until the file is >64kB in length, so
xfs_bmapi_reserve_delalloc() doesn't set the flag saying there's
post-eof preallocation to trim away on the inode.

> > > It may convert if contiguous with delalloc blocks inside EOF (and
> > > sufficiently sized physical extents exist), or even if not, should
> > > still otherwise be cleaned up by the various other means we
> > > already have to manage post-eof blocks.
> > >
> > > So IOW there's a tradeoff being made here for possible spurious
> > > allocation and I/O and a subtle dependency on writeback that
> > > should probably be documented somewhere.
> > 
> > As per above, I don't think there is any spurious/stale allocation
> > left behind by the punch code, nor is there any dependency on
> > writeback to ignore it such issues.
> > 
> 
> The same sort of example as above (but within eof) demonstrates the
> writeback dependency. I.e., with this series alone the punch code leaves
> behind an in-eof uptodate delalloc block, but writeback comes along and
> maps it regardless because it also only has uptodate granularity within
> the dirty folio.

This "unflushable delalloc extent" case can't happen within EOF. The
page fault initialises the folio full of zeroes, so when we get the
failed write, the only change is that we now have a single delalloc
block within EOF that writeback will iterate over, allocate and
write all the up to date zeros to it.

Sure, this error path might cause a small amount of extra IO over
the dirty folio, but it does not cause any other sort of issue.
THere are no stray delalloc blocks, there are no data corruption
issues, there are no stale data exposure issues, etc. Leaving a
delalloc block under up-to-date page cache data should not cause
a user vsible issue.

> When writeback changes to only map dirty sub-ranges, clearly that no
> longer happens and the landmine turns into a stale delalloc leak. This
> is also easy to reproduce with the last dirty range RFC patches pulled
> on top of this series.

No, not even then. If we have sub-folio dirty ranges, we know this
failed write range is *still clean* and so we'll punch it out,
regardless of whether it is inside or outside EOF.

> It may be possible to address that by doing more aggressive punching
> past the i_size boundary here, but that comes with other tradeoffs.

Tradeoffs such as .... ?

> For example, if the target folio of a failed write
> was already dirty and the target (sub-folio) range was previously an
> Uptodate hole before the failing write may have performed delayed
> allocation, we have no choice but to skip the punch for that sub-range
> of the folio because we can't tell whether sub-folio Uptodate ranges
> were actually written to or not. This means that delalloc block
> allocations may persist for writes that never succeeded.

I can add something like that, though I'd need to make sure it is
clear that this is not a data corruption vector, just an side effect
of a highly unlikely set of corner case conditions we only care
about handling without corruption, not efficiency or performance.

Cheers,

Dave.
Brian Foster Nov. 23, 2022, 5:25 p.m. UTC | #9
On Tue, Nov 22, 2022 at 10:13:04AM +1100, Dave Chinner wrote:
> On Fri, Nov 18, 2022 at 12:20:45PM -0500, Brian Foster wrote:
> > On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote:
> > > > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > ...
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> ---
> > > > > fs/xfs/xfs_iomap.c | 151
> > > > > ++++++++++++++++++++++++++++++++++++++++++--- 1 file
> > > > > changed, 141 insertions(+), 10 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index
> > > > > 7bb55dbc19d3..2d48fcc7bd6f 100644 --- a/fs/xfs/xfs_iomap.c
> > > > > +++ b/fs/xfs/xfs_iomap.c @@ -1134,6 +1134,146 @@
> > > > > xfs_buffered_write_delalloc_punch( end_fsb - start_fsb); }
> > > > >  
> > > > ...
> > > > > +/* + * Punch out all the delalloc blocks in the range given
> > > > > except for those that + * have dirty data still pending in
> > > > > the page cache - those are going to be + * written and so
> > > > > must still retain the delalloc backing for writeback.  + * +
> > > > > * As we are scanning the page cache for data, we don't need
> > > > > to reimplement the + * wheel - mapping_seek_hole_data() does
> > > > > exactly what we need to identify the + * start and end of
> > > > > data ranges correctly even for sub-folio block sizes. This +
> > > > > * byte range based iteration is especially convenient
> > > > > because it means we don't + * have to care about variable
> > > > > size folios, nor where the start or end of the + * data
> > > > > range lies within a folio, if they lie within the same folio
> > > > > or even if + * there are multiple discontiguous data ranges
> > > > > within the folio.  + */ +static int
> > > > > +xfs_buffered_write_delalloc_release( +	struct inode
> > > > > *inode, +	loff_t			start_byte, +
> > > > > loff_t			end_byte) +{ +	loff_t
> > > > > punch_start_byte = start_byte; +	int
> > > > > error = 0; + +	/* +	 * Lock the mapping to avoid races
> > > > > with page faults re-instantiating +	 * folios and
> > > > > dirtying them via ->page_mkwrite whilst we walk the +	 *
> > > > > cache and perform delalloc extent removal. Failing to do
> > > > > this can +	 * leave dirty pages with no space
> > > > > reservation in the cache.  +	 */ +
> > > > > filemap_invalidate_lock(inode->i_mapping); +	while
> > > > > (start_byte < end_byte) { +		loff_t
> > > > > data_end; + +		start_byte =
> > > > > mapping_seek_hole_data(inode->i_mapping, +
> > > > > start_byte, end_byte, SEEK_DATA);
> > > > 
> > > > FWIW, the fact that mapping seek data is based on uptodate
> > > > status means that seek behavior can change based on prior
> > > > reads.
> > > 
> > > Yup. It should be obvious that any page cache scan based
> > > algorithm will change based on changing page cache residency.
> > > 
> > > > For example, see how seek hole/data presents reads of
> > > > unwritten ranges as data [1]. The same thing isn't observable
> > > > for holes because iomap doesn't check the mapping in that
> > > > case, but underlying iop state is the same and that is what
> > > > this code is looking at.
> > > 
> > > Well, yes.  That's the fundamental, underlying issue that this
> > > patchset is addressing for the write() operation: that the page
> > > cache contents and the underlying filesystem extent map are not
> > > guaranteed to be coherent and can be changed independently of
> > > each other.
> > > 
> > > The whole problem with looking exclusively at filesystem level
> > > extent state (and hence FIEMAP) is that the extent state doesn't
> > > tell us whether the is uncommitted data over the range of the
> > > extent in the page cache.  The filesystem extent state and page
> > > cache data *can't be coherent* in a writeback caching
> > > environment. This is the fundamental difference between what the
> > > filesystem extent map tells us (FIEMAP) and what querying the
> > > page cache tells us (SEEK_DATA/SEEK_HOLE).
> > > 
> > > This is also the underlying problem with iomap_truncate_page() -
> > > it fails to query the page cache for data over unwritten
> > > extents, so fails to zero the post-EOF part of dirty folios over
> > > unwritten extents and so it all goes wrong...
> > > 
> > > > The filtering being done here means we essentially only care
> > > > about dirty pages backed by delalloc blocks. That means if you
> > > > get here with a dirty page and the portion of the page
> > > > affected by this failed write is uptodate, this won't punch an
> > > > underlying delalloc block even though nothing else may have
> > > > written to it in the meantime.
> > > 
> > > Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc
> > > iomap will not span ranges that have pre-existing *dirty* data
> > > in the page cache. Those *must* already have (del)allocated
> > > extents, hence the iomap for the newly allocated delalloc extent
> > > will always end before pre-existing dirty data in the page cache
> > > starts.
> > > 
> > > Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map
> > > precludes stepping into ranges that have pre-existing cached
> > > dirty data.
> > > 
> > > We also can't get a racing write() to the same range right now
> > > because this is all under IOLOCK_EXCL, hence we only ever see
> > > dirty folios as a result of race with page faults. page faults
> > > zero the entire folio they insert into the page cache and
> > > iomap_folio_mkwrite_iter() asserts that the entire folio is
> > > marked up to date. Hence if we find a dirty folio outside the
> > > range the write() dirtied, we are guaranteed that the entire
> > > dirty folio is up to date....
> > > 
> > > Yes, there can be pre-existing *clean* folios (and clean
> > > partially up to date folios) in the page cache, but we won't
> > > have dirty partially up to date pages in the middle of the range
> > > we are scanning. Hence we only need to care about the edge cases
> > > (folios that overlap start and ends). We skip the partially
> > > written start block, and we always punch up to the end block if
> > > it is different from the last block we punched up to. If the end
> > > of the data spans into a dirty folio, we know that dirty range
> > > is up to date because the seek scan only returns ranges that are
> > > up to date. Hence we don't punch those partial blocks out....
> > > 
> > > Regardless, let's assume we have a racing write that has
> > > partially updated and dirtied a folio (because we've moved to
> > > XFS_IOLOCK_SHARED locking for writes). This case is already
> > > handled by the mapping_seek_hole_data() based iteration.
> > > 
> > > That is, the mapping_seek_hole_data() loop provides us with
> > > *discrete ranges of up to date data* that are independent of
> > > folio size, up-to-date range granularity, dirty range tracking,
> > > filesystem block size, etc.
> > > 
> > > Hence if the next discrete range we discover is in the same
> > > dirty folio as the previous discrete range of up to date data,
> > > we know we have a sub-folio sized hole in the data that is not
> > > up to date.  Because there is no data over this range, we have
> > > to punch out the underlying delalloc extent over that range. 
> > > 
> > > IOWs, the dirty state of the folio and/or the granularity of the
> > > dirty range tracking is irrelevant here - we know there was no
> > > data in the cache (dlean or dirty) over this range because it is
> > > discontiguous with the previous range of data returned.
> > > 
> > > IOWs, if we have this "up to date" map on a dirty folio like
> > > this:
> > > 
> > > Data		+-------+UUUUUUU+-------+UUUUUUU+-------+
> > > Extent map	+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> > > 
> > > Then the unrolled iteration and punching we do would look like
> > > this:
> > > 
> > > First iteration of the range:
> > > 
> > > punch_start: V +-------+UUUUUUU+-------+UUUUUUU+-------+
> > > 
> > > SEEK_DATA:		V +-------+UUUUUUU+-------+UUUUUUU+-------+
> > > SEEK_HOLE:			^ Data range:		+UUUUUUU+
> > > Punch range:	+-------+ Extent map:
> > > +-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+
> > > 
> > > Second iteration:
> > > 
> > > punch_start			V
> > > +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA:
> > > V +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_HOLE:
> > > ^ Data range:				+UUUUUUU+ Punch
> > > range:			+-------+ Extent map:
> > > +-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+
> > > 
> > > Third iteration:
> > > 
> > > punch_start					V
> > > +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA: - moves
> > > into next folio in cache ....  Punch range:
> > > +-------+ ......  Extent map:
> > > +-------+DDDDDDD+-------+DDDDDDD+-------+ ......  (to end of
> > > scan range or start of next data)
> > > 
> > > As you can see, this scan does not care about folio size,
> > > sub-folio range granularity or filesystem block sizes.  It also
> > > matches exactly how writeback handles dirty, partially up to
> > > date folios, so there's no stray delalloc blocks left around to
> > > be tripped over after failed or short writes occur.
> > > 
> > > Indeed, if we move to sub-folio dirty range tracking, we can
> > > simply add a mapping_seek_hole_data() variant that walks dirty
> > > ranges in the page cache rather than up to date ranges. Then we
> > > can remove the inner loop from this code that looks up folios to
> > > determine dirty state. The above algorithm does not change - we
> > > just walk from discrete range to discrete range punching the
> > > gaps between them....
> > > 
> > > IOWs, the algorithm is largely future proof - the only thing
> > > that needs to change if we change iomap to track sub-folio dirty
> > > ranges is how we check the data range for being dirty. That
> > > should be no surprise, really, the surprise should be that we
> > > can make some simple mods to page cache seek to remove the need
> > > for checking dirty state in this code altogether....
> > > 
> > > > That sort of state can be created by a prior read of the range
> > > > on a sub-page block size fs, or perhaps a racing async
> > > > readahead (via read fault of a lower offset..?), etc.
> > > 
> > > Yup, generic/346 exercises this racing unaligned, sub-folio mmap
> > > write vs write() case. This test, specifically, was the reason I
> > > moved to using mapping_seek_hole_data() - g/346 found an endless
> > > stream of bugs in the sub-multi-page-folio range iteration code
> > > I kept trying to write....
> > > 
> > > > I suspect this is not a serious error because the page is
> > > > dirty and writeback will thus convert the block. The only
> > > > exception to that I can see is if the block is beyond EOF
> > > > (consider a mapped read to a page that straddles EOF, followed
> > > > by a post-eof write that fails), writeback won't actually map
> > > > the block directly.
> > > 
> > > I don't think that can happen. iomap_write_failed() calls
> > > truncate_pagecache_range() to remove any newly instantiated
> > > cached data beyond the original EOF. Hence the delalloc punch
> > > will remove everything beyond the original EOF that was
> > > allocated for the failed write. Hence when we get to writeback
> > > we're not going to find any up-to-date data beyond the EOF block
> > > in the page cache, nor any stray delalloc blocks way beyond
> > > EOF....
> > > 
> > 
> > It can happen if the page straddles eof. I don't think much of the
> > above relates to the behavior I'm describing. This doesn't require
> > racing writes, theoretical shared write locking, the IOMAP_F_NEW
> > flag or core iteration algorithm are not major factors, etc.
> > 
> > It's easier to just use examples. Consider the following sequence
> > on a bsize=1k fs on a kernel with this patch series. Note that my
> > xfs_io is hacked up to turn 'pwrite -f' into a "fail" flag that
> > triggers the -EFAULT error check in iomap_write_iter() by passing
> > a bad user buffer:
> > 
> > # set eof and make sure entire page is uptodate (even posteof)
> > $XFS_IO_PROG -fc "truncate 1k" -c "mmap 0 4k" -c "mread 0 4k"
> > $file # do a delalloc and open/close cycle to set dirty release
> > $XFS_IO_PROG -c "pwrite 0 1" -c "close" -c "open $file" $file #
> > fail an appending write to a posteof, discontig, uptodate range on
> > already dirty page $XFS_IO_PROG -c "pwrite -f 2k 1" -c fsync -c
> > "fiemap -v" $file
> > 
> Firstly, can you tell me which patchset you reproduced this on? Just
> a vanilla kernel, thev version of the patchset in this thread or the
> latest one I sent out?
> 

This series.

> Secondly, can you please send a patch with that failed write
> functionality upstream for xfs_io - it is clearly useful.
> 

Sure. It trivially passes a NULL buffer, but I can post an RFC for now
at least.

> > wrote 1/1 bytes at offset 0
> > 1.000000 bytes, 1 ops; 0.0003 sec (3.255 KiB/sec and 3333.3333 ops/sec)
> > pwrite: Bad address
> > /mnt/file:
> >  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> >    0: [0..1]:          22..23               2   0x0
> >    1: [2..3]:          hole                 2
> >    2: [4..5]:          0..1                 2   0x7
> > 
> > So clearly there's a posteof delalloc (FIEMAP_EXTENT_DELALLOC == 0x4) block
> > there that hasn't been written to.
> 
> Ok, so this is failing before even iomap_write_begin() is called,
> so it's a failed write that has not touched page cache state, but
> has allocated a delalloc block beyond EOF.
> 

Yes.

> This is a case that writeback handles because writeback
> is supposed skip writeback for pages and blocks beyond EOF. i.e.
> iomap_writepage_map() is passed the current EOF to terminate the
> writeback mapping loop when the folio being written back spans EOF.
> Hence iomap_writepage_map never walks past EOF and so does not ever
> iterate filesystem blocks that contain up to date data beyond in the
> page cache beyond EOF. That's why writeback leaves that stray
> delalloc block beyond, even though it's covered by up to date data.
> 
> IOWs, it appears that the problem here is that the page cache
> considers data in the folio beyond EOF to be up to date and contain
> data, whilst most filesystems do not consider data beyond EOF to be
> valid. This means there is a boundary condition that
> mapping_seek_hole_data() doesn't handle correctly - it completely
> ignores EOF and so SEEK_DATA for an offset past EOF will report data
> beyond EOF if the file has been mmap()d and the last folio faulted
> in.
> 
> This specific boundary condition behaviour is hidden by
> iomap_seek_data() and shmem_file_llseek() by not allowing seeks to
> start and/or end beyond EOF. Hence they never calls
> mapping_seek_hole_data() with a start byte beyond EOF and so will
> not ever try to find data in the page caceh beyond EOF.
> 
> That looks easy enough to fix - just consider any data range
> returned that is beyond i_size_read() to require punching, and then
> the semantics match writeback ignoring anything beyond EOF....
> 
> > I call this a landmine because it's an
> > unexpected/untracked instance of post-eof delalloc. I.e., post-eof block
> > reclamation won't pick it up because the inode was never tagged for speculative
> > preallocation:
> 
> Of course - this wasn't allocated by speculative preallocation -
> that doesn't kick in until the file is >64kB in length, so
> xfs_bmapi_reserve_delalloc() doesn't set the flag saying there's
> post-eof preallocation to trim away on the inode.
> 
> > > > It may convert if contiguous with delalloc blocks inside EOF (and
> > > > sufficiently sized physical extents exist), or even if not, should
> > > > still otherwise be cleaned up by the various other means we
> > > > already have to manage post-eof blocks.
> > > >
> > > > So IOW there's a tradeoff being made here for possible spurious
> > > > allocation and I/O and a subtle dependency on writeback that
> > > > should probably be documented somewhere.
> > > 
> > > As per above, I don't think there is any spurious/stale allocation
> > > left behind by the punch code, nor is there any dependency on
> > > writeback to ignore it such issues.
> > > 
> > 
> > The same sort of example as above (but within eof) demonstrates the
> > writeback dependency. I.e., with this series alone the punch code leaves
> > behind an in-eof uptodate delalloc block, but writeback comes along and
> > maps it regardless because it also only has uptodate granularity within
> > the dirty folio.
> 
> This "unflushable delalloc extent" case can't happen within EOF. The
> page fault initialises the folio full of zeroes, so when we get the
> failed write, the only change is that we now have a single delalloc
> block within EOF that writeback will iterate over, allocate and
> write all the up to date zeros to it.
> 
> Sure, this error path might cause a small amount of extra IO over
> the dirty folio, but it does not cause any other sort of issue.
> THere are no stray delalloc blocks, there are no data corruption
> issues, there are no stale data exposure issues, etc. Leaving a
> delalloc block under up-to-date page cache data should not cause
> a user vsible issue.
> 

This is why I suggested a comment might be sufficient.

> > When writeback changes to only map dirty sub-ranges, clearly that no
> > longer happens and the landmine turns into a stale delalloc leak. This
> > is also easy to reproduce with the last dirty range RFC patches pulled
> > on top of this series.
> 
> No, not even then. If we have sub-folio dirty ranges, we know this
> failed write range is *still clean* and so we'll punch it out,
> regardless of whether it is inside or outside EOF.
> 

At this point I suspect we're talking past eachother and based on
previous discussion, I think we're on the same page wrt to what actually
needs to happen and why.

> > It may be possible to address that by doing more aggressive punching
> > past the i_size boundary here, but that comes with other tradeoffs.
> 
> Tradeoffs such as .... ?
> 

When I wrote that I was thinking about the potential for odd
interactions with speculative preallocation. For example, consider the
case where the write that allocates speculative prealloc happens to
fail, how xfs_can_free_eofblocks() is implemented, etc.

I'm not explicitly saying it's some catastrophic problem, corruption
vector, etc., or anything like that. I'm just saying it's replacing one
potential wonky interaction with another, and thus introduces different
considerations for things that probably don't have great test coverage
to begin with.

I suppose it might make some sense for the iomap mechanism to invoke the
callback for post-eof ranges, but then have the fs handle that
particular case specially if appropriate. So for example if an XFS inode
sees a post-eof punch and is tagged with post-eof prealloc, it might
make more sense to either punch the whole of it out or just leave it
around to be cleaned up later vs. just punch out a subset of it. But I
don't have a strong opinion either way and still no objection to simply
documenting the quirk for the time being.

> > For example, if the target folio of a failed write
> > was already dirty and the target (sub-folio) range was previously an
> > Uptodate hole before the failing write may have performed delayed
> > allocation, we have no choice but to skip the punch for that sub-range
> > of the folio because we can't tell whether sub-folio Uptodate ranges
> > were actually written to or not. This means that delalloc block
> > allocations may persist for writes that never succeeded.
> 
> I can add something like that, though I'd need to make sure it is
> clear that this is not a data corruption vector, just an side effect
> of a highly unlikely set of corner case conditions we only care
> about handling without corruption, not efficiency or performance.
> 

I couldn't think of any such issues for upstream XFS, at least. One
reason is that delalloc blocks convert to unwritten extents by default,
so anything converted that wasn't directly targeted with writes should
remain in that state. Another fallback is that iomap explicitly zeroes
cache pages for post-eof blocks on reads, regardless of extent type. Of
course those were just initial observations and don't necessarily mean
problems don't exist. :)

FWIW, the more subtle reason I commented on these quirks, but isn't so
relevant upstream, is that if this work is going to get backported to
older/stable/distro kernels it might be possible for this to land in
kernels without one or both of those previously mentioned behaviors (at
which point there very well might be a vector for stale data exposure or
a subtle feature dependency, depending on context, what the final patch
series looks like, etc.).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 7bb55dbc19d3..2d48fcc7bd6f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1134,6 +1134,146 @@  xfs_buffered_write_delalloc_punch(
 				end_fsb - start_fsb);
 }
 
+/*
+ * Scan the data range passed to us for dirty page cache folios. If we find a
+ * dirty folio, punch out the preceeding range and update the offset from which
+ * the next punch will start from.
+ *
+ * We can punch out clean pages because they either contain data that has been
+ * written back - in which case the delalloc punch over that range is a no-op -
+ * or they have been read faults in which case they contain zeroes and we can
+ * remove the delalloc backing range and any new writes to those pages will do
+ * the normal hole filling operation...
+ *
+ * This makes the logic simple: we only need to keep the delalloc extents only
+ * over the dirty ranges of the page cache.
+ */
+static int
+xfs_buffered_write_delalloc_scan(
+	struct inode		*inode,
+	loff_t			*punch_start_byte,
+	loff_t			start_byte,
+	loff_t			end_byte)
+{
+	loff_t			offset = start_byte;
+
+	while (offset < end_byte) {
+		struct folio	*folio;
+
+		/* grab locked page */
+		folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT);
+		if (!folio) {
+			offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE;
+			continue;
+		}
+
+		/* if dirty, punch up to offset */
+		if (folio_test_dirty(folio)) {
+			if (offset > *punch_start_byte) {
+				int	error;
+
+				error = xfs_buffered_write_delalloc_punch(inode,
+						*punch_start_byte, offset);
+				if (error) {
+					folio_unlock(folio);
+					folio_put(folio);
+					return error;
+				}
+			}
+
+			/*
+			 * Make sure the next punch start is correctly bound to
+			 * the end of this data range, not the end of the folio.
+			 */
+			*punch_start_byte = min_t(loff_t, end_byte,
+					folio_next_index(folio) << PAGE_SHIFT);
+		}
+
+		/* move offset to start of next folio in range */
+		offset = folio_next_index(folio) << PAGE_SHIFT;
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+	return 0;
+}
+
+/*
+ * Punch out all the delalloc blocks in the range given except for those that
+ * have dirty data still pending in the page cache - those are going to be
+ * written and so must still retain the delalloc backing for writeback.
+ *
+ * As we are scanning the page cache for data, we don't need to reimplement the
+ * wheel - mapping_seek_hole_data() does exactly what we need to identify the
+ * start and end of data ranges correctly even for sub-folio block sizes. This
+ * byte range based iteration is especially convenient because it means we don't
+ * have to care about variable size folios, nor where the start or end of the
+ * data range lies within a folio, if they lie within the same folio or even if
+ * there are multiple discontiguous data ranges within the folio.
+ */
+static int
+xfs_buffered_write_delalloc_release(
+	struct inode		*inode,
+	loff_t			start_byte,
+	loff_t			end_byte)
+{
+	loff_t			punch_start_byte = start_byte;
+	int			error = 0;
+
+	/*
+	 * Lock the mapping to avoid races with page faults re-instantiating
+	 * folios and dirtying them via ->page_mkwrite whilst we walk the
+	 * cache and perform delalloc extent removal. Failing to do this can
+	 * leave dirty pages with no space reservation in the cache.
+	 */
+	filemap_invalidate_lock(inode->i_mapping);
+	while (start_byte < end_byte) {
+		loff_t		data_end;
+
+		start_byte = mapping_seek_hole_data(inode->i_mapping,
+				start_byte, end_byte, SEEK_DATA);
+		/*
+		 * If there is no more data to scan, all that is left is to
+		 * punch out the remaining range.
+		 */
+		if (start_byte == -ENXIO || start_byte == end_byte)
+			break;
+		if (start_byte < 0) {
+			error = start_byte;
+			goto out_unlock;
+		}
+		ASSERT(start_byte >= punch_start_byte);
+		ASSERT(start_byte < end_byte);
+
+		/*
+		 * We find the end of this contiguous cached data range by
+		 * seeking from start_byte to the beginning of the next hole.
+		 */
+		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
+				end_byte, SEEK_HOLE);
+		if (data_end < 0) {
+			error = data_end;
+			goto out_unlock;
+		}
+		ASSERT(data_end > start_byte);
+		ASSERT(data_end <= end_byte);
+
+		error = xfs_buffered_write_delalloc_scan(inode,
+				&punch_start_byte, start_byte, data_end);
+		if (error)
+			goto out_unlock;
+
+		/* The next data search starts at the end of this one. */
+		start_byte = data_end;
+	}
+
+	if (punch_start_byte < end_byte)
+		error = xfs_buffered_write_delalloc_punch(inode,
+				punch_start_byte, end_byte);
+out_unlock:
+	filemap_invalidate_unlock(inode->i_mapping);
+	return error;
+}
+
 static int
 xfs_buffered_write_iomap_end(
 	struct inode		*inode,
@@ -1179,16 +1319,7 @@  xfs_buffered_write_iomap_end(
 	if (start_byte >= end_byte)
 		return 0;
 
-	/*
-	 * Lock the mapping to avoid races with page faults re-instantiating
-	 * folios and dirtying them via ->page_mkwrite between the page cache
-	 * truncation and the delalloc extent removal. Failing to do this can
-	 * leave dirty pages with no space reservation in the cache.
-	 */
-	filemap_invalidate_lock(inode->i_mapping);
-	truncate_pagecache_range(inode, start_byte, end_byte - 1);
-	error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
-	filemap_invalidate_unlock(inode->i_mapping);
+	error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte);
 	if (error && !xfs_is_shutdown(mp)) {
 		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
 			__func__, XFS_I(inode)->i_ino);