diff mbox series

[4/7] xfs: buffered write failure should not truncate the page cache

Message ID 20221101003412.3842572-5-david@fromorbit.com (mailing list archive)
State New, archived
Headers show
Series [1/7] xfs: write page faults in iomap are not buffered writes | expand

Commit Message

Dave Chinner Nov. 1, 2022, 12:34 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

xfs_buffered_write_iomap_end() currently invalidates the page cache
over the unused range of the delalloc extent it allocated. While the
write allocated the delalloc extent, it does not own it exclusively
as the write does not hold any locks that prevent either writeback
or mmap page faults from changing the state of either the page cache
or the extent state backing this range.

Whilst xfs_bmap_punch_delalloc_range() already handles races in
extent conversion - it will only punch out delalloc extents and it
ignores any other type of extent - the page cache truncate does not
discriminate between data written by this write or some other task.
As a result, truncating the page cache can result in data corruption
if the write races with mmap modifications to the file over the same
range.

generic/346 exercises this workload, and if we randomly fail writes
(as will happen when iomap gets stale iomap detection later in the
patchset), it will randomly corrupt the file data because it removes
data written by mmap() in the same page as the write() that failed.

Hence we do not want to punch out the page cache over the range of
the extent we failed to write to - what we actually need to do is
detect the ranges that have dirty data in cache over them and *not
punch them out*.

TO do this, we have to walk the page cache over the range of the
delalloc extent we want to remove. This is made complex by the fact
we have to handle partially up-to-date folios correctly and this can
happen even when the FSB size == PAGE_SIZE because we now support
multi-page folios in the page cache.

Because we are only interested in discovering the edges of data
ranges in the page cache (i.e. hole-data boundaries) we can make use
of mapping_seek_hole_data() to find those transitions in the page
cache. As we hold the invalidate_lock, we know that the boundaries
are not going to change while we walk the range. This interface is
also byte-based and is sub-page block aware, so we can find the data
ranges in the cache based on byte offsets rather than page, folio or
fs block sized chunks. This greatly simplifies the logic of finding
dirty cached ranges in the page cache.

Once we've identified a range that contains cached data, we can then
iterate the range folio by folio. This allows us to determine if the
data is dirty and hence perform the correct delalloc extent punching
operations. The seek interface we use to iterate data ranges will
give us sub-folio start/end granularity, so we may end up looking up
the same folio multiple times as the seek interface iterates across
each discontiguous data region in the folio.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 141 insertions(+), 10 deletions(-)

Comments

kernel test robot Nov. 1, 2022, 11:57 a.m. UTC | #1
Hi Dave,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on xfs-linux/for-next]
[also build test ERROR on linus/master v6.1-rc3 next-20221101]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Chinner/xfs-write-page-faults-in-iomap-are-not-buffered-writes/20221101-104935
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
patch link:    https://lore.kernel.org/r/20221101003412.3842572-5-david%40fromorbit.com
patch subject: [PATCH 4/7] xfs: buffered write failure should not truncate the page cache
config: m68k-buildonly-randconfig-r004-20221030
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/1008d06381363506c0e2fbe0ed34f57be234ab2c
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Dave-Chinner/xfs-write-page-faults-in-iomap-are-not-buffered-writes/20221101-104935
        git checkout 1008d06381363506c0e2fbe0ed34f57be234ab2c
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "mapping_seek_hole_data" [fs/xfs/xfs.ko] undefined!
Christoph Hellwig Nov. 2, 2022, 7:24 a.m. UTC | #2
Just a little dumb question while I'm wrapping my head around the change
here - why do we even punch the pagecache to start with?  As long as the
regions that we failed to write to aren't marked uptdate (at the page
or block level for sub-block writes), who cares if they remain in the
page cache for now?
Darrick J. Wong Nov. 2, 2022, 4:41 p.m. UTC | #3
On Tue, Nov 01, 2022 at 11:34:09AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_buffered_write_iomap_end() currently invalidates the page cache
> over the unused range of the delalloc extent it allocated. While the
> write allocated the delalloc extent, it does not own it exclusively
> as the write does not hold any locks that prevent either writeback
> or mmap page faults from changing the state of either the page cache
> or the extent state backing this range.
> 
> Whilst xfs_bmap_punch_delalloc_range() already handles races in
> extent conversion - it will only punch out delalloc extents and it
> ignores any other type of extent - the page cache truncate does not
> discriminate between data written by this write or some other task.
> As a result, truncating the page cache can result in data corruption
> if the write races with mmap modifications to the file over the same
> range.
> 
> generic/346 exercises this workload, and if we randomly fail writes
> (as will happen when iomap gets stale iomap detection later in the
> patchset), it will randomly corrupt the file data because it removes
> data written by mmap() in the same page as the write() that failed.
> 
> Hence we do not want to punch out the page cache over the range of
> the extent we failed to write to - what we actually need to do is
> detect the ranges that have dirty data in cache over them and *not
> punch them out*.

Same dumb question as hch -- why do we need to punch out the nondirty
pagecache after a failed write?  If the folios are uptodate then we're
evicting cache unnecessarily, and if they're !uptodate can't we let
reclaim do the dirty work for us?

I don't know if there are hysterical raisins for this or if the goal is
to undo memory consumption after a write failure?  If we're stale-ing
the write because the iomapping changed, why not leave the folio where
it is, refresh the iomapping, and come back to (possibly?) the same
folio?

--D

> TO do this, we have to walk the page cache over the range of the
> delalloc extent we want to remove. This is made complex by the fact
> we have to handle partially up-to-date folios correctly and this can
> happen even when the FSB size == PAGE_SIZE because we now support
> multi-page folios in the page cache.
> 
> Because we are only interested in discovering the edges of data
> ranges in the page cache (i.e. hole-data boundaries) we can make use
> of mapping_seek_hole_data() to find those transitions in the page
> cache. As we hold the invalidate_lock, we know that the boundaries
> are not going to change while we walk the range. This interface is
> also byte-based and is sub-page block aware, so we can find the data
> ranges in the cache based on byte offsets rather than page, folio or
> fs block sized chunks. This greatly simplifies the logic of finding
> dirty cached ranges in the page cache.
> 
> Once we've identified a range that contains cached data, we can then
> iterate the range folio by folio. This allows us to determine if the
> data is dirty and hence perform the correct delalloc extent punching
> operations. The seek interface we use to iterate data ranges will
> give us sub-folio start/end granularity, so we may end up looking up
> the same folio multiple times as the seek interface iterates across
> each discontiguous data region in the folio.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 141 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 7bb55dbc19d3..2d48fcc7bd6f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
>  				end_fsb - start_fsb);
>  }
>  
> +/*
> + * Scan the data range passed to us for dirty page cache folios. If we find a
> + * dirty folio, punch out the preceeding range and update the offset from which
> + * the next punch will start from.
> + *
> + * We can punch out clean pages because they either contain data that has been
> + * written back - in which case the delalloc punch over that range is a no-op -
> + * or they have been read faults in which case they contain zeroes and we can
> + * remove the delalloc backing range and any new writes to those pages will do
> + * the normal hole filling operation...
> + *
> + * This makes the logic simple: we only need to keep the delalloc extents only
> + * over the dirty ranges of the page cache.
> + */
> +static int
> +xfs_buffered_write_delalloc_scan(
> +	struct inode		*inode,
> +	loff_t			*punch_start_byte,
> +	loff_t			start_byte,
> +	loff_t			end_byte)
> +{
> +	loff_t			offset = start_byte;
> +
> +	while (offset < end_byte) {
> +		struct folio	*folio;
> +
> +		/* grab locked page */
> +		folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT);
> +		if (!folio) {
> +			offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE;
> +			continue;
> +		}
> +
> +		/* if dirty, punch up to offset */
> +		if (folio_test_dirty(folio)) {
> +			if (offset > *punch_start_byte) {
> +				int	error;
> +
> +				error = xfs_buffered_write_delalloc_punch(inode,
> +						*punch_start_byte, offset);
> +				if (error) {
> +					folio_unlock(folio);
> +					folio_put(folio);
> +					return error;
> +				}
> +			}
> +
> +			/*
> +			 * Make sure the next punch start is correctly bound to
> +			 * the end of this data range, not the end of the folio.
> +			 */
> +			*punch_start_byte = min_t(loff_t, end_byte,
> +					folio_next_index(folio) << PAGE_SHIFT);
> +		}
> +
> +		/* move offset to start of next folio in range */
> +		offset = folio_next_index(folio) << PAGE_SHIFT;
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Punch out all the delalloc blocks in the range given except for those that
> + * have dirty data still pending in the page cache - those are going to be
> + * written and so must still retain the delalloc backing for writeback.
> + *
> + * As we are scanning the page cache for data, we don't need to reimplement the
> + * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> + * start and end of data ranges correctly even for sub-folio block sizes. This
> + * byte range based iteration is especially convenient because it means we don't
> + * have to care about variable size folios, nor where the start or end of the
> + * data range lies within a folio, if they lie within the same folio or even if
> + * there are multiple discontiguous data ranges within the folio.
> + */
> +static int
> +xfs_buffered_write_delalloc_release(
> +	struct inode		*inode,
> +	loff_t			start_byte,
> +	loff_t			end_byte)
> +{
> +	loff_t			punch_start_byte = start_byte;
> +	int			error = 0;
> +
> +	/*
> +	 * Lock the mapping to avoid races with page faults re-instantiating
> +	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> +	 * cache and perform delalloc extent removal. Failing to do this can
> +	 * leave dirty pages with no space reservation in the cache.
> +	 */
> +	filemap_invalidate_lock(inode->i_mapping);
> +	while (start_byte < end_byte) {
> +		loff_t		data_end;
> +
> +		start_byte = mapping_seek_hole_data(inode->i_mapping,
> +				start_byte, end_byte, SEEK_DATA);
> +		/*
> +		 * If there is no more data to scan, all that is left is to
> +		 * punch out the remaining range.
> +		 */
> +		if (start_byte == -ENXIO || start_byte == end_byte)
> +			break;
> +		if (start_byte < 0) {
> +			error = start_byte;
> +			goto out_unlock;
> +		}
> +		ASSERT(start_byte >= punch_start_byte);
> +		ASSERT(start_byte < end_byte);
> +
> +		/*
> +		 * We find the end of this contiguous cached data range by
> +		 * seeking from start_byte to the beginning of the next hole.
> +		 */
> +		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
> +				end_byte, SEEK_HOLE);
> +		if (data_end < 0) {
> +			error = data_end;
> +			goto out_unlock;
> +		}
> +		ASSERT(data_end > start_byte);
> +		ASSERT(data_end <= end_byte);
> +
> +		error = xfs_buffered_write_delalloc_scan(inode,
> +				&punch_start_byte, start_byte, data_end);
> +		if (error)
> +			goto out_unlock;
> +
> +		/* The next data search starts at the end of this one. */
> +		start_byte = data_end;
> +	}
> +
> +	if (punch_start_byte < end_byte)
> +		error = xfs_buffered_write_delalloc_punch(inode,
> +				punch_start_byte, end_byte);
> +out_unlock:
> +	filemap_invalidate_unlock(inode->i_mapping);
> +	return error;
> +}
> +
>  static int
>  xfs_buffered_write_iomap_end(
>  	struct inode		*inode,
> @@ -1179,16 +1319,7 @@ xfs_buffered_write_iomap_end(
>  	if (start_byte >= end_byte)
>  		return 0;
>  
> -	/*
> -	 * Lock the mapping to avoid races with page faults re-instantiating
> -	 * folios and dirtying them via ->page_mkwrite between the page cache
> -	 * truncation and the delalloc extent removal. Failing to do this can
> -	 * leave dirty pages with no space reservation in the cache.
> -	 */
> -	filemap_invalidate_lock(inode->i_mapping);
> -	truncate_pagecache_range(inode, start_byte, end_byte - 1);
> -	error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
> -	filemap_invalidate_unlock(inode->i_mapping);
> +	error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte);
>  	if (error && !xfs_is_shutdown(mp)) {
>  		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
>  			__func__, XFS_I(inode)->i_ino);
> -- 
> 2.37.2
>
Dave Chinner Nov. 2, 2022, 8:57 p.m. UTC | #4
On Wed, Nov 02, 2022 at 12:24:42AM -0700, Christoph Hellwig wrote:
> Just a little dumb question while I'm wrapping my head around the change
> here - why do we even punch the pagecache to start with?

We don't. It's just wrong because it assumes that the write() owns
the range of the page cache and it only contains no non-zero data
because the write allocated the delalloc range and therefore the
write was into a hole and therefore, by definition, it contains no
non-zero data.

Hence if the write is short, we punched out the page cache under the
assumption that it will only remove cached zeroes from the cache. If
those zeroes are dirty for some reason (zeroing prior to the iomap
hole/unwritten detection?) we don't need to write them and have to
be removed from the page caceh before we punch out the underlying
delalloc extent.

Unfortunately, this assumption has always been compeltely invalid
because both writeback and mmap page faults access to the page cache
can race with write()...

> As long as the
> regions that we failed to write to aren't marked uptdate (at the page
> or block level for sub-block writes), who cares if they remain in the
> page cache for now?

Exactly - that's the premise this patchset is based on - we only
need to care about dirty pages across the range of the delalloc
extent, and nothing else matters as it will be properly instantiated
with new delalloc backing if it gets dirtied in future...

Cheers,

Dave.
Dave Chinner Nov. 2, 2022, 9:04 p.m. UTC | #5
On Wed, Nov 02, 2022 at 09:41:30AM -0700, Darrick J. Wong wrote:
> On Tue, Nov 01, 2022 at 11:34:09AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > xfs_buffered_write_iomap_end() currently invalidates the page cache
> > over the unused range of the delalloc extent it allocated. While the
> > write allocated the delalloc extent, it does not own it exclusively
> > as the write does not hold any locks that prevent either writeback
> > or mmap page faults from changing the state of either the page cache
> > or the extent state backing this range.
> > 
> > Whilst xfs_bmap_punch_delalloc_range() already handles races in
> > extent conversion - it will only punch out delalloc extents and it
> > ignores any other type of extent - the page cache truncate does not
> > discriminate between data written by this write or some other task.
> > As a result, truncating the page cache can result in data corruption
> > if the write races with mmap modifications to the file over the same
> > range.
> > 
> > generic/346 exercises this workload, and if we randomly fail writes
> > (as will happen when iomap gets stale iomap detection later in the
> > patchset), it will randomly corrupt the file data because it removes
> > data written by mmap() in the same page as the write() that failed.
> > 
> > Hence we do not want to punch out the page cache over the range of
> > the extent we failed to write to - what we actually need to do is
> > detect the ranges that have dirty data in cache over them and *not
> > punch them out*.
> 
> Same dumb question as hch -- why do we need to punch out the nondirty
> pagecache after a failed write?  If the folios are uptodate then we're
> evicting cache unnecessarily, and if they're !uptodate can't we let
> reclaim do the dirty work for us?

Sorry, we don't punch out the page cache  - this was badly worded. I
meant:

"[...] - what we actually need to do is
detect the ranges that have dirty data in cache over the delalloc
extent and retain those regions of the delalloc extent."

i.e. we never punch the page cache anymore, and we only selectively
punch the delalloc extent that back the clean regions of thw write
range...

> I don't know if there are hysterical raisins for this or if the goal is
> to undo memory consumption after a write failure?  If we're stale-ing
> the write because the iomapping changed, why not leave the folio where
> it is, refresh the iomapping, and come back to (possibly?) the same
> folio?

I can't say for certain - I haven't gone an looked at the history.
I suspect it goes back to the days where write() could write zeroes
into the page cache for eof zeroing or zeroing for file extension
before the write() started writing data. Maybe that was part of what
it was trying to undo?

-Dave.
Darrick J. Wong Nov. 2, 2022, 10:26 p.m. UTC | #6
On Thu, Nov 03, 2022 at 08:04:33AM +1100, Dave Chinner wrote:
> On Wed, Nov 02, 2022 at 09:41:30AM -0700, Darrick J. Wong wrote:
> > On Tue, Nov 01, 2022 at 11:34:09AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > xfs_buffered_write_iomap_end() currently invalidates the page cache
> > > over the unused range of the delalloc extent it allocated. While the
> > > write allocated the delalloc extent, it does not own it exclusively
> > > as the write does not hold any locks that prevent either writeback
> > > or mmap page faults from changing the state of either the page cache
> > > or the extent state backing this range.
> > > 
> > > Whilst xfs_bmap_punch_delalloc_range() already handles races in
> > > extent conversion - it will only punch out delalloc extents and it
> > > ignores any other type of extent - the page cache truncate does not
> > > discriminate between data written by this write or some other task.
> > > As a result, truncating the page cache can result in data corruption
> > > if the write races with mmap modifications to the file over the same
> > > range.
> > > 
> > > generic/346 exercises this workload, and if we randomly fail writes
> > > (as will happen when iomap gets stale iomap detection later in the
> > > patchset), it will randomly corrupt the file data because it removes
> > > data written by mmap() in the same page as the write() that failed.
> > > 
> > > Hence we do not want to punch out the page cache over the range of
> > > the extent we failed to write to - what we actually need to do is
> > > detect the ranges that have dirty data in cache over them and *not
> > > punch them out*.
> > 
> > Same dumb question as hch -- why do we need to punch out the nondirty
> > pagecache after a failed write?  If the folios are uptodate then we're
> > evicting cache unnecessarily, and if they're !uptodate can't we let
> > reclaim do the dirty work for us?
> 
> Sorry, we don't punch out the page cache  - this was badly worded. I
> meant:
> 
> "[...] - what we actually need to do is
> detect the ranges that have dirty data in cache over the delalloc
> extent and retain those regions of the delalloc extent."
> 
> i.e. we never punch the page cache anymore, and we only selectively
> punch the delalloc extent that back the clean regions of thw write
> range...

Ah ok.  I was thinking there was a discrepancy between the description
in the commit message and the code, but then I zoomed out and asked
myself why dump the pagecache at all...

> > I don't know if there are hysterical raisins for this or if the goal is
> > to undo memory consumption after a write failure?  If we're stale-ing
> > the write because the iomapping changed, why not leave the folio where
> > it is, refresh the iomapping, and come back to (possibly?) the same
> > folio?
> 
> I can't say for certain - I haven't gone an looked at the history.
> I suspect it goes back to the days where write() could write zeroes
> into the page cache for eof zeroing or zeroing for file extension
> before the write() started writing data. Maybe that was part of what
> it was trying to undo?

Yeah, I dunno, but it certainly looks like it might be an attempt to
undo the effects of posteof zeroing if the write fails.  Back in 2.6.36
days, commit 155130a4f784 ("get rid of block_write_begin_newtrunc")
changed xfs_vm_write_begin to truncate any posteof areas after a failed
write.  This was part of some sort of "new truncate sequence" that
itself got fixed and patched various times after that.

In 4.8, commit 68a9f5e7007c ("xfs: implement iomap based buffered write
path") changed this truncation to the unprocessed part of a short write,
even if it wasn't posteof.  The commit says it's part of a broad
rewrite, but doesn't mention anything about that particular change.

/me shrugs

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Christoph Hellwig Nov. 4, 2022, 8:08 a.m. UTC | #7
So, the whole scan for delalloc logic seems pretty generic, I think
it can an should be lifted to iomap, with
xfs_buffered_write_delalloc_punch provided as a callback.

As for the reuse of the seek hole / data helpers, and I'm not sure
this actually helps all that much, and certainly is not super
efficient.  I don't want you to directly talk into rewriting this
once again, but a simple

	while ((folio = find_get_entry(&xas, max, XA_PRESENT))) {
		if (xa_is_value(folio) || folio_test_uptodate(folio) ||
		    ops->is_partially_uptodate) {
			xas_pause(xas);
			rcu_read_unlock();
			folio_lock(folio);
			if (likely(folio->mapping == mapping) &&
			    folio_test_dirty(folio)) {
				// crop folio start / end by total range
				if (ops->is_partially_uptodate) {
					// find partial range
				}
				// add to or start new range
			}
			
		} else {
			// pause rcu unlock and submit previous range if
			// needed
		}

		// move to next folio with size magic etc
	}

	// submit pending range if it exists
Dave Chinner Nov. 4, 2022, 11:10 p.m. UTC | #8
On Fri, Nov 04, 2022 at 01:08:50AM -0700, Christoph Hellwig wrote:
> So, the whole scan for delalloc logic seems pretty generic, I think
> it can an should be lifted to iomap, with
> xfs_buffered_write_delalloc_punch provided as a callback.

Maybe. When we get another filesystem that has the same problem with
short writes needing to punch delalloc extents, we can look at
lifting it into the generic code. But until then, it is exclusively
an XFS issue...

> As for the reuse of the seek hole / data helpers, and I'm not sure
> this actually helps all that much, and certainly is not super
> efficient.  I don't want you to directly talk into rewriting this
> once again, but a simple

[snip]

I started with the method you are suggesting, and it took me 4 weeks
of fighting with boundary condition bugs before I realised there was
a better way.

Searching for sub-folio discontiguities is highly inefficient
however you look at it - we have to scan dirty folios block by block
determine the uptodate state of each block. We can't do a range scan
because is_partially_uptodate() will return false if any block
within the range is not up to date.  Hence we have to iterate one
block at a time to determine the state of each block, and that
greatly complicates things.

i.e. we now have range boundarys at the edges of the write() op,
range boundaries at the edges of filesysetm blocks, and range
boundaries at unpredictable folio_size() edges. I couldn't keep all
this straight in my head - I have to be able to maintain and debug
this code, so if I can't track all the edge cases in my head, I sure
as hell can't debug the code, nor expect to understand it when I
next look at it in a few months time.

Just because one person is smart enough to be able to write code
that uses multiply-nested range iterations full of boundary
conditions that have to be handled correctly, it doesn't mean that
it is the best way to write slow-path/error handling code that *must
be correct*. The best code is the code that anyone can understand
and say "yes, that is correct".

So, yes, using the seek hole / data helpers might be a tiny bit more
code, but compactness, efficiency and speed really don't matter.
What matters is that the code is correct and that the people who
need to track down the bugs and data corruptions in this code are
able to understand and debug the code.  i.e. to make the code
maintainable we need to break the complex problems down into
algorithms and code that can be understood and debugged by anyone,
not just the smartest person in the room.

Cheers,

Dave.
Darrick J. Wong Nov. 7, 2022, 11:48 p.m. UTC | #9
On Sat, Nov 05, 2022 at 10:10:36AM +1100, Dave Chinner wrote:
> On Fri, Nov 04, 2022 at 01:08:50AM -0700, Christoph Hellwig wrote:
> > So, the whole scan for delalloc logic seems pretty generic, I think
> > it can an should be lifted to iomap, with
> > xfs_buffered_write_delalloc_punch provided as a callback.
> 
> Maybe. When we get another filesystem that has the same problem with
> short writes needing to punch delalloc extents, we can look at
> lifting it into the generic code. But until then, it is exclusively
> an XFS issue...
> 
> > As for the reuse of the seek hole / data helpers, and I'm not sure
> > this actually helps all that much, and certainly is not super
> > efficient.  I don't want you to directly talk into rewriting this
> > once again, but a simple
> 
> [snip]
> 
> I started with the method you are suggesting, and it took me 4 weeks
> of fighting with boundary condition bugs before I realised there was
> a better way.
> 
> Searching for sub-folio discontiguities is highly inefficient
> however you look at it - we have to scan dirty folios block by block
> determine the uptodate state of each block. We can't do a range scan
> because is_partially_uptodate() will return false if any block
> within the range is not up to date.  Hence we have to iterate one
> block at a time to determine the state of each block, and that
> greatly complicates things.

This sounds like a neat optimization for seek hole/data, but that's an
optimization that can be deferred to another cleanup.  As it is, this
fix patchset already introduces plenty to think about.

--D

> i.e. we now have range boundarys at the edges of the write() op,
> range boundaries at the edges of filesysetm blocks, and range
> boundaries at unpredictable folio_size() edges. I couldn't keep all
> this straight in my head - I have to be able to maintain and debug
> this code, so if I can't track all the edge cases in my head, I sure
> as hell can't debug the code, nor expect to understand it when I
> next look at it in a few months time.

> Just because one person is smart enough to be able to write code
> that uses multiply-nested range iterations full of boundary
> conditions that have to be handled correctly, it doesn't mean that
> it is the best way to write slow-path/error handling code that *must
> be correct*. The best code is the code that anyone can understand
> and say "yes, that is correct".
> 
> So, yes, using the seek hole / data helpers might be a tiny bit more
> code, but compactness, efficiency and speed really don't matter.
> What matters is that the code is correct and that the people who
> need to track down the bugs and data corruptions in this code are
> able to understand and debug the code.  i.e. to make the code
> maintainable we need to break the complex problems down into
> algorithms and code that can be understood and debugged by anyone,
> not just the smartest person in the room.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 7bb55dbc19d3..2d48fcc7bd6f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1134,6 +1134,146 @@  xfs_buffered_write_delalloc_punch(
 				end_fsb - start_fsb);
 }
 
+/*
+ * Scan the data range passed to us for dirty page cache folios. If we find a
+ * dirty folio, punch out the preceeding range and update the offset from which
+ * the next punch will start from.
+ *
+ * We can punch out clean pages because they either contain data that has been
+ * written back - in which case the delalloc punch over that range is a no-op -
+ * or they have been read faults in which case they contain zeroes and we can
+ * remove the delalloc backing range and any new writes to those pages will do
+ * the normal hole filling operation...
+ *
+ * This makes the logic simple: we only need to keep the delalloc extents only
+ * over the dirty ranges of the page cache.
+ */
+static int
+xfs_buffered_write_delalloc_scan(
+	struct inode		*inode,
+	loff_t			*punch_start_byte,
+	loff_t			start_byte,
+	loff_t			end_byte)
+{
+	loff_t			offset = start_byte;
+
+	while (offset < end_byte) {
+		struct folio	*folio;
+
+		/* grab locked page */
+		folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT);
+		if (!folio) {
+			offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE;
+			continue;
+		}
+
+		/* if dirty, punch up to offset */
+		if (folio_test_dirty(folio)) {
+			if (offset > *punch_start_byte) {
+				int	error;
+
+				error = xfs_buffered_write_delalloc_punch(inode,
+						*punch_start_byte, offset);
+				if (error) {
+					folio_unlock(folio);
+					folio_put(folio);
+					return error;
+				}
+			}
+
+			/*
+			 * Make sure the next punch start is correctly bound to
+			 * the end of this data range, not the end of the folio.
+			 */
+			*punch_start_byte = min_t(loff_t, end_byte,
+					folio_next_index(folio) << PAGE_SHIFT);
+		}
+
+		/* move offset to start of next folio in range */
+		offset = folio_next_index(folio) << PAGE_SHIFT;
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+	return 0;
+}
+
+/*
+ * Punch out all the delalloc blocks in the range given except for those that
+ * have dirty data still pending in the page cache - those are going to be
+ * written and so must still retain the delalloc backing for writeback.
+ *
+ * As we are scanning the page cache for data, we don't need to reimplement the
+ * wheel - mapping_seek_hole_data() does exactly what we need to identify the
+ * start and end of data ranges correctly even for sub-folio block sizes. This
+ * byte range based iteration is especially convenient because it means we don't
+ * have to care about variable size folios, nor where the start or end of the
+ * data range lies within a folio, if they lie within the same folio or even if
+ * there are multiple discontiguous data ranges within the folio.
+ */
+static int
+xfs_buffered_write_delalloc_release(
+	struct inode		*inode,
+	loff_t			start_byte,
+	loff_t			end_byte)
+{
+	loff_t			punch_start_byte = start_byte;
+	int			error = 0;
+
+	/*
+	 * Lock the mapping to avoid races with page faults re-instantiating
+	 * folios and dirtying them via ->page_mkwrite whilst we walk the
+	 * cache and perform delalloc extent removal. Failing to do this can
+	 * leave dirty pages with no space reservation in the cache.
+	 */
+	filemap_invalidate_lock(inode->i_mapping);
+	while (start_byte < end_byte) {
+		loff_t		data_end;
+
+		start_byte = mapping_seek_hole_data(inode->i_mapping,
+				start_byte, end_byte, SEEK_DATA);
+		/*
+		 * If there is no more data to scan, all that is left is to
+		 * punch out the remaining range.
+		 */
+		if (start_byte == -ENXIO || start_byte == end_byte)
+			break;
+		if (start_byte < 0) {
+			error = start_byte;
+			goto out_unlock;
+		}
+		ASSERT(start_byte >= punch_start_byte);
+		ASSERT(start_byte < end_byte);
+
+		/*
+		 * We find the end of this contiguous cached data range by
+		 * seeking from start_byte to the beginning of the next hole.
+		 */
+		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
+				end_byte, SEEK_HOLE);
+		if (data_end < 0) {
+			error = data_end;
+			goto out_unlock;
+		}
+		ASSERT(data_end > start_byte);
+		ASSERT(data_end <= end_byte);
+
+		error = xfs_buffered_write_delalloc_scan(inode,
+				&punch_start_byte, start_byte, data_end);
+		if (error)
+			goto out_unlock;
+
+		/* The next data search starts at the end of this one. */
+		start_byte = data_end;
+	}
+
+	if (punch_start_byte < end_byte)
+		error = xfs_buffered_write_delalloc_punch(inode,
+				punch_start_byte, end_byte);
+out_unlock:
+	filemap_invalidate_unlock(inode->i_mapping);
+	return error;
+}
+
 static int
 xfs_buffered_write_iomap_end(
 	struct inode		*inode,
@@ -1179,16 +1319,7 @@  xfs_buffered_write_iomap_end(
 	if (start_byte >= end_byte)
 		return 0;
 
-	/*
-	 * Lock the mapping to avoid races with page faults re-instantiating
-	 * folios and dirtying them via ->page_mkwrite between the page cache
-	 * truncation and the delalloc extent removal. Failing to do this can
-	 * leave dirty pages with no space reservation in the cache.
-	 */
-	filemap_invalidate_lock(inode->i_mapping);
-	truncate_pagecache_range(inode, start_byte, end_byte - 1);
-	error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
-	filemap_invalidate_unlock(inode->i_mapping);
+	error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte);
 	if (error && !xfs_is_shutdown(mp)) {
 		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
 			__func__, XFS_I(inode)->i_ino);