mbox series

[RFC,0/4] iomap: zero dirty folios over unwritten mappings on zero range

Message ID 20240718130212.23905-1-bfoster@redhat.com (mailing list archive)
Headers show
Series iomap: zero dirty folios over unwritten mappings on zero range | expand

Message

Brian Foster July 18, 2024, 1:02 p.m. UTC
Hi all,

This is a stab at fixing the iomap zero range problem where it doesn't
correctly handle the case of an unwritten mapping with dirty pagecache.
The gist is that we scan the mapping for dirty cache, zero any
already-dirty folios via buffered writes as normal, but then otherwise
skip clean ranges once we have a chance to validate those ranges against
races with writeback or reclaim.

This is somewhat simplistic in terms of how it scans, but that is
intentional based on the existing use cases for zero range. From poking
around a bit, my current sense is that there isn't any user of zero
range that would ever expect to see more than a single dirty folio. Most
callers either straddle the EOF folio or flush in higher level code for
presumably (fs) context specific reasons. If somebody has an example to
the contrary, please let me know because I'd love to be able to use it
for testing.

The caveat to this approach is that it only works for filesystems that
implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
doesn't use ->iomap_valid() and does call zero range, but AFAICT it
doesn't actually export unwritten mappings so I suspect this is not a
problem. My understanding is that ext4 iomap support is in progress, but
I've not yet dug into what that looks like (though I suspect similar to
XFS). The concern is mainly that this leaves a landmine for fs that
might grow support for unwritten mappings && zero range but not
->iomap_valid(). We'd likely never know zero range was broken for such
fs until stale data exposure problems start to materialize.

I considered adding a fallback to just add a flush at the top of
iomap_zero_range() so at least all future users would be correct, but I
wanted to gate that on the absence of ->iomap_valid() and folio_ops
isn't provided until iomap_begin() time. I suppose another way around
that could be to add a flags param to iomap_zero_range() where the
caller could explicitly opt out of a flush, but that's still kind of
ugly. I dunno, maybe better than nothing..?

So IMO, this raises the question of whether this is just unnecessarily
overcomplicated. The KISS principle implies that it would also be
perfectly fine to do a conditional "flush and stale" in zero range
whenever we see the combination of an unwritten mapping and dirty
pagecache (the latter checked before or during ->iomap_begin()). That's
simple to implement and AFAICT would work/perform adequately and
generically for all filesystems. I have one or two prototypes of this
sort of thing if folks want to see it as an alternative.

Otherwise, this series has so far seen some light regression and
targeted testing. Patches 1-2 are factoring dependencies, patch 3
implements the main fix, and patch 4 drops the flush from XFS truncate.
Thoughts, reviews, flames appreciated.

Brian

Brian Foster (4):
  filemap: return pos of first dirty folio from range_has_writeback
  iomap: refactor an iomap_revalidate() helper
  iomap: fix handling of dirty folios over unwritten extents
  xfs: remove unnecessary flush of eof page from truncate

 fs/iomap/buffered-io.c  | 81 ++++++++++++++++++++++++++++++++++-------
 fs/xfs/xfs_iops.c       | 10 -----
 include/linux/pagemap.h |  4 +-
 mm/filemap.c            |  8 ++--
 4 files changed, 75 insertions(+), 28 deletions(-)

Comments

Matthew Wilcox July 18, 2024, 3:09 p.m. UTC | #1
On Thu, Jul 18, 2024 at 09:02:09AM -0400, Brian Foster wrote:
> @@ -655,6 +655,8 @@ bool filemap_range_has_writeback(struct address_space *mapping,
>  				folio_test_writeback(folio))
>  			break;
>  	}
> +	if (folio)
> +		*start_byte = folio_pos(folio);
>  	rcu_read_unlock();
>  	return folio != NULL;
>  }

Distressingly, this is unsafe.

We have no reference on the folio at this point (not one that matters,
anyway).  We have the rcu read lock, yes, but that doesn't protect enough
to make folio_pos() safe.

Since we do't have folio_get() here, the folio can be freed, sent back to
the page allocator, and then reallocated to literally any purpose.  As I'm
reviewing patch 1/4, I have no idea if this is just a hint and you can
survive it being completely wrong, or if this is going to cause problems.
Josef Bacik July 18, 2024, 3:36 p.m. UTC | #2
On Thu, Jul 18, 2024 at 09:02:08AM -0400, Brian Foster wrote:
> Hi all,
> 
> This is a stab at fixing the iomap zero range problem where it doesn't
> correctly handle the case of an unwritten mapping with dirty pagecache.
> The gist is that we scan the mapping for dirty cache, zero any
> already-dirty folios via buffered writes as normal, but then otherwise
> skip clean ranges once we have a chance to validate those ranges against
> races with writeback or reclaim.
> 
> This is somewhat simplistic in terms of how it scans, but that is
> intentional based on the existing use cases for zero range. From poking
> around a bit, my current sense is that there isn't any user of zero
> range that would ever expect to see more than a single dirty folio. Most
> callers either straddle the EOF folio or flush in higher level code for
> presumably (fs) context specific reasons. If somebody has an example to
> the contrary, please let me know because I'd love to be able to use it
> for testing.
> 
> The caveat to this approach is that it only works for filesystems that
> implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
> doesn't use ->iomap_valid() and does call zero range, but AFAICT it
> doesn't actually export unwritten mappings so I suspect this is not a
> problem. My understanding is that ext4 iomap support is in progress, but
> I've not yet dug into what that looks like (though I suspect similar to
> XFS). The concern is mainly that this leaves a landmine for fs that
> might grow support for unwritten mappings && zero range but not
> ->iomap_valid(). We'd likely never know zero range was broken for such
> fs until stale data exposure problems start to materialize.
> 
> I considered adding a fallback to just add a flush at the top of
> iomap_zero_range() so at least all future users would be correct, but I
> wanted to gate that on the absence of ->iomap_valid() and folio_ops
> isn't provided until iomap_begin() time. I suppose another way around
> that could be to add a flags param to iomap_zero_range() where the
> caller could explicitly opt out of a flush, but that's still kind of
> ugly. I dunno, maybe better than nothing..?
> 
> So IMO, this raises the question of whether this is just unnecessarily
> overcomplicated. The KISS principle implies that it would also be
> perfectly fine to do a conditional "flush and stale" in zero range
> whenever we see the combination of an unwritten mapping and dirty
> pagecache (the latter checked before or during ->iomap_begin()). That's
> simple to implement and AFAICT would work/perform adequately and
> generically for all filesystems. I have one or two prototypes of this
> sort of thing if folks want to see it as an alternative.

I think this is the better approach, otherwise there's another behavior that's
gated behind having a callback that other filesystems may not know about and
thus have a gap.

Additionally do you have a test for this stale data exposure?  I think no matter
what the solution it would be good to have a test for this so that we can make
sure we're all doing the correct thing with zero range.  Thanks,

Josef
Darrick J. Wong July 18, 2024, 4:02 p.m. UTC | #3
On Thu, Jul 18, 2024 at 11:36:13AM -0400, Josef Bacik wrote:
> On Thu, Jul 18, 2024 at 09:02:08AM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > This is a stab at fixing the iomap zero range problem where it doesn't
> > correctly handle the case of an unwritten mapping with dirty pagecache.
> > The gist is that we scan the mapping for dirty cache, zero any
> > already-dirty folios via buffered writes as normal, but then otherwise
> > skip clean ranges once we have a chance to validate those ranges against
> > races with writeback or reclaim.
> > 
> > This is somewhat simplistic in terms of how it scans, but that is
> > intentional based on the existing use cases for zero range. From poking
> > around a bit, my current sense is that there isn't any user of zero
> > range that would ever expect to see more than a single dirty folio. Most
> > callers either straddle the EOF folio or flush in higher level code for
> > presumably (fs) context specific reasons. If somebody has an example to
> > the contrary, please let me know because I'd love to be able to use it
> > for testing.
> > 
> > The caveat to this approach is that it only works for filesystems that
> > implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
> > doesn't use ->iomap_valid() and does call zero range, but AFAICT it
> > doesn't actually export unwritten mappings so I suspect this is not a
> > problem. My understanding is that ext4 iomap support is in progress, but
> > I've not yet dug into what that looks like (though I suspect similar to
> > XFS). The concern is mainly that this leaves a landmine for fs that
> > might grow support for unwritten mappings && zero range but not
> > ->iomap_valid(). We'd likely never know zero range was broken for such
> > fs until stale data exposure problems start to materialize.
> > 
> > I considered adding a fallback to just add a flush at the top of
> > iomap_zero_range() so at least all future users would be correct, but I
> > wanted to gate that on the absence of ->iomap_valid() and folio_ops
> > isn't provided until iomap_begin() time. I suppose another way around
> > that could be to add a flags param to iomap_zero_range() where the
> > caller could explicitly opt out of a flush, but that's still kind of
> > ugly. I dunno, maybe better than nothing..?

Or move ->iomap_valid to the iomap ops structure.  It's a mapping
predicate, and has nothing to do with folios.

> > So IMO, this raises the question of whether this is just unnecessarily
> > overcomplicated. The KISS principle implies that it would also be
> > perfectly fine to do a conditional "flush and stale" in zero range
> > whenever we see the combination of an unwritten mapping and dirty
> > pagecache (the latter checked before or during ->iomap_begin()). That's
> > simple to implement and AFAICT would work/perform adequately and
> > generically for all filesystems. I have one or two prototypes of this
> > sort of thing if folks want to see it as an alternative.

I wouldn't mind seeing such a prototype.  Start by hoisting the
filemap_write_and_wait_range call to iomap, then adjust it only to do
that if there's dirty pagecache + unwritten mappings?  Then get more
complicated from there, and we can decide if we want the increasing
levels of trickiness.

> I think this is the better approach, otherwise there's another behavior that's
> gated behind having a callback that other filesystems may not know about and
> thus have a gap.

<nod> I think filesystems currently only need to supply an ->iomap_valid
function for pagecache operations because those are the only ones where
we have to maintain consistency between something that isn't locked when
we get the mapping, and the mapping not being locked when we lock that
first thing.  I suspect they also only need to supply it if they support
unwritten extents.

From what I can tell, the rest (e.g. directio/FIEMAP) don't care because
callers get to manage concurrency.

*But* in general it makes sense to me that any iomap operation ought to
be able to revalidate a mapping at any time.

> Additionally do you have a test for this stale data exposure?  I think no matter
> what the solution it would be good to have a test for this so that we can make
> sure we're all doing the correct thing with zero range.  Thanks,

I was also curious about this.   IIRC we have some tests for the
validiting checking itself, but I don't recall if there's a specific
regression test for the eofblock clearing.

--D

> Josef
>
Brian Foster July 18, 2024, 4:40 p.m. UTC | #4
On Thu, Jul 18, 2024 at 09:02:02AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 18, 2024 at 11:36:13AM -0400, Josef Bacik wrote:
> > On Thu, Jul 18, 2024 at 09:02:08AM -0400, Brian Foster wrote:
> > > Hi all,
> > > 
> > > This is a stab at fixing the iomap zero range problem where it doesn't
> > > correctly handle the case of an unwritten mapping with dirty pagecache.
> > > The gist is that we scan the mapping for dirty cache, zero any
> > > already-dirty folios via buffered writes as normal, but then otherwise
> > > skip clean ranges once we have a chance to validate those ranges against
> > > races with writeback or reclaim.
> > > 
> > > This is somewhat simplistic in terms of how it scans, but that is
> > > intentional based on the existing use cases for zero range. From poking
> > > around a bit, my current sense is that there isn't any user of zero
> > > range that would ever expect to see more than a single dirty folio. Most
> > > callers either straddle the EOF folio or flush in higher level code for
> > > presumably (fs) context specific reasons. If somebody has an example to
> > > the contrary, please let me know because I'd love to be able to use it
> > > for testing.
> > > 
> > > The caveat to this approach is that it only works for filesystems that
> > > implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
> > > doesn't use ->iomap_valid() and does call zero range, but AFAICT it
> > > doesn't actually export unwritten mappings so I suspect this is not a
> > > problem. My understanding is that ext4 iomap support is in progress, but
> > > I've not yet dug into what that looks like (though I suspect similar to
> > > XFS). The concern is mainly that this leaves a landmine for fs that
> > > might grow support for unwritten mappings && zero range but not
> > > ->iomap_valid(). We'd likely never know zero range was broken for such
> > > fs until stale data exposure problems start to materialize.
> > > 
> > > I considered adding a fallback to just add a flush at the top of
> > > iomap_zero_range() so at least all future users would be correct, but I
> > > wanted to gate that on the absence of ->iomap_valid() and folio_ops
> > > isn't provided until iomap_begin() time. I suppose another way around
> > > that could be to add a flags param to iomap_zero_range() where the
> > > caller could explicitly opt out of a flush, but that's still kind of
> > > ugly. I dunno, maybe better than nothing..?
> 
> Or move ->iomap_valid to the iomap ops structure.  It's a mapping
> predicate, and has nothing to do with folios.
> 

Good idea. That might be an option.

> > > So IMO, this raises the question of whether this is just unnecessarily
> > > overcomplicated. The KISS principle implies that it would also be
> > > perfectly fine to do a conditional "flush and stale" in zero range
> > > whenever we see the combination of an unwritten mapping and dirty
> > > pagecache (the latter checked before or during ->iomap_begin()). That's
> > > simple to implement and AFAICT would work/perform adequately and
> > > generically for all filesystems. I have one or two prototypes of this
> > > sort of thing if folks want to see it as an alternative.
> 
> I wouldn't mind seeing such a prototype.  Start by hoisting the
> filemap_write_and_wait_range call to iomap, then adjust it only to do
> that if there's dirty pagecache + unwritten mappings?  Then get more
> complicated from there, and we can decide if we want the increasing
> levels of trickiness.
> 

Yeah, exactly. Start with an unconditional flush at the top of
iomap_zero_range() (which perhaps also serves as a -stable fix), then
replace it with an unconditional dirty cache check and a conditional
flush/stale down in zero_iter() (for the dirty+unwritten case). With
that false positives from the cache check are less of an issue because
the only consequence is basically just a spurious flush. From there, the
revalidation approach could be an optional further optimization to avoid
the flush entirely, but we'll have to see if it's worth the complexity.

I have various experimental patches around that pretty much do the
conditional flush thing. I just have to form it into a presentable
series.

> > I think this is the better approach, otherwise there's another behavior that's
> > gated behind having a callback that other filesystems may not know about and
> > thus have a gap.
> 
> <nod> I think filesystems currently only need to supply an ->iomap_valid
> function for pagecache operations because those are the only ones where
> we have to maintain consistency between something that isn't locked when
> we get the mapping, and the mapping not being locked when we lock that
> first thing.  I suspect they also only need to supply it if they support
> unwritten extents.
> 
> From what I can tell, the rest (e.g. directio/FIEMAP) don't care because
> callers get to manage concurrency.
> 
> *But* in general it makes sense to me that any iomap operation ought to
> be able to revalidate a mapping at any time.
> 
> > Additionally do you have a test for this stale data exposure?  I think no matter
> > what the solution it would be good to have a test for this so that we can make
> > sure we're all doing the correct thing with zero range.  Thanks,
> 
> I was also curious about this.   IIRC we have some tests for the
> validiting checking itself, but I don't recall if there's a specific
> regression test for the eofblock clearing.
> 

Err.. yeah. I have some random test sequences around that reproduce some
of these issues. I'll form them into an fstest to go along with this.

Thank you both for the feedback.

Brian

> --D
> 
> > Josef
> > 
>
Dave Chinner July 19, 2024, 1:10 a.m. UTC | #5
On Thu, Jul 18, 2024 at 09:02:08AM -0400, Brian Foster wrote:
> Hi all,
> 
> This is a stab at fixing the iomap zero range problem where it doesn't
> correctly handle the case of an unwritten mapping with dirty pagecache.
> The gist is that we scan the mapping for dirty cache, zero any
> already-dirty folios via buffered writes as normal, but then otherwise
> skip clean ranges once we have a chance to validate those ranges against
> races with writeback or reclaim.
> 
> This is somewhat simplistic in terms of how it scans, but that is
> intentional based on the existing use cases for zero range. From poking
> around a bit, my current sense is that there isn't any user of zero
> range that would ever expect to see more than a single dirty folio.

The current code generally only zeroes a single filesystem block or
less because that's all we need to zero for partial writes.  This is
not going to be true for very much longer with XFS forcealign
functionality, and I suspect it's not true right now for large rt
extent sizes when doing sub-extent writes. In these cases, we are
going to have to zero multiple filesystem blocks during truncate,
hole punch, unaligned writes, etc.

So even if we don't do this now, I think this is something we will
almost certainly be doing in the next kernel release or two.

> Most
> callers either straddle the EOF folio or flush in higher level code for
> presumably (fs) context specific reasons. If somebody has an example to
> the contrary, please let me know because I'd love to be able to use it
> for testing.

Check the xfs_inode_has_bigrtalloc() and xfs_inode_alloc_unitsize()
cases. These are currently being worked on and expanded and factored
so eventually these cases will all fall under
xfs_inode_alloc_unitsize().

> The caveat to this approach is that it only works for filesystems that
> implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
> doesn't use ->iomap_valid() and does call zero range, but AFAICT it
> doesn't actually export unwritten mappings so I suspect this is not a
> problem. My understanding is that ext4 iomap support is in progress, but
> I've not yet dug into what that looks like (though I suspect similar to
> XFS). The concern is mainly that this leaves a landmine for fs that
> might grow support for unwritten mappings && zero range but not
> ->iomap_valid(). We'd likely never know zero range was broken for such
> fs until stale data exposure problems start to materialize.
> 
> I considered adding a fallback to just add a flush at the top of
> iomap_zero_range() so at least all future users would be correct, but I
> wanted to gate that on the absence of ->iomap_valid() and folio_ops
> isn't provided until iomap_begin() time. I suppose another way around
> that could be to add a flags param to iomap_zero_range() where the
> caller could explicitly opt out of a flush, but that's still kind of
> ugly. I dunno, maybe better than nothing..?

We want to avoid the flush in this case if we can - what XFS does is
a workaround for iomap not handling dirty data over unwritten
extents. That first flush causes performance issues with certain
truncate heavy workloads, so we really want to avoid it in the
generic code if we can.

> So IMO, this raises the question of whether this is just unnecessarily
> overcomplicated. The KISS principle implies that it would also be
> perfectly fine to do a conditional "flush and stale" in zero range
> whenever we see the combination of an unwritten mapping and dirty
> pagecache (the latter checked before or during ->iomap_begin()). That's
> simple to implement and AFAICT would work/perform adequately and
> generically for all filesystems. I have one or two prototypes of this
> sort of thing if folks want to see it as an alternative.

If we are going to zero the range, and the range is already
unwritten, then why do we need to flush the data in the cache to
make it clean and written before running the zeroing? Why not just
invalidate the entire cache over the unwritten region and so return it
all to containing zeroes (i.e. is unwritten!) without doing any IO.

Yes, if some of the range is under writeback, the invalidation will
have to wait for that to complete - invalidate_inode_pages2_range()
does this for us - but after the invalidation those regions will now
be written and iomap revalidation after page cache invalidation will
detect this.

So maybe the solution is simply to invalidate the cache over
unwritten extents and then revalidate the iomap? If the iomap is
still valid, then we can skip the unwritten extent completely. If
the invalidation returned -EBUSY or the iomap is stale, then remap
it and try again?

If we don't have an iomap validation function, then we could check
filemap_range_needs_writeback() before calling
invalidate_inode_pages2_range() as that will tell us if there were
folios that might have been under writeback during the invalidation.
In that case, we can treat "needs writeback" the same as a failed
iomap revalidation.

So what am I missing? What does the flush actually accomplish that
simply calling invalidate_inode_pages2_range() to throw the data we
need to zero away doesn't?

-Dave.
Brian Foster July 19, 2024, 3:22 p.m. UTC | #6
On Fri, Jul 19, 2024 at 11:10:28AM +1000, Dave Chinner wrote:
> On Thu, Jul 18, 2024 at 09:02:08AM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > This is a stab at fixing the iomap zero range problem where it doesn't
> > correctly handle the case of an unwritten mapping with dirty pagecache.
> > The gist is that we scan the mapping for dirty cache, zero any
> > already-dirty folios via buffered writes as normal, but then otherwise
> > skip clean ranges once we have a chance to validate those ranges against
> > races with writeback or reclaim.
> > 
> > This is somewhat simplistic in terms of how it scans, but that is
> > intentional based on the existing use cases for zero range. From poking
> > around a bit, my current sense is that there isn't any user of zero
> > range that would ever expect to see more than a single dirty folio.
> 
> The current code generally only zeroes a single filesystem block or
> less because that's all we need to zero for partial writes.  This is
> not going to be true for very much longer with XFS forcealign
> functionality, and I suspect it's not true right now for large rt
> extent sizes when doing sub-extent writes. In these cases, we are
> going to have to zero multiple filesystem blocks during truncate,
> hole punch, unaligned writes, etc.
> 
> So even if we don't do this now, I think this is something we will
> almost certainly be doing in the next kernel release or two.
> 
> > Most
> > callers either straddle the EOF folio or flush in higher level code for
> > presumably (fs) context specific reasons. If somebody has an example to
> > the contrary, please let me know because I'd love to be able to use it
> > for testing.
> 
> Check the xfs_inode_has_bigrtalloc() and xfs_inode_alloc_unitsize()
> cases. These are currently being worked on and expanded and factored
> so eventually these cases will all fall under
> xfs_inode_alloc_unitsize().
> 
> > The caveat to this approach is that it only works for filesystems that
> > implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
> > doesn't use ->iomap_valid() and does call zero range, but AFAICT it
> > doesn't actually export unwritten mappings so I suspect this is not a
> > problem. My understanding is that ext4 iomap support is in progress, but
> > I've not yet dug into what that looks like (though I suspect similar to
> > XFS). The concern is mainly that this leaves a landmine for fs that
> > might grow support for unwritten mappings && zero range but not
> > ->iomap_valid(). We'd likely never know zero range was broken for such
> > fs until stale data exposure problems start to materialize.
> > 
> > I considered adding a fallback to just add a flush at the top of
> > iomap_zero_range() so at least all future users would be correct, but I
> > wanted to gate that on the absence of ->iomap_valid() and folio_ops
> > isn't provided until iomap_begin() time. I suppose another way around
> > that could be to add a flags param to iomap_zero_range() where the
> > caller could explicitly opt out of a flush, but that's still kind of
> > ugly. I dunno, maybe better than nothing..?
> 
> We want to avoid the flush in this case if we can - what XFS does is
> a workaround for iomap not handling dirty data over unwritten
> extents. That first flush causes performance issues with certain
> truncate heavy workloads, so we really want to avoid it in the
> generic code if we can.
> 

Sort of.. the only complaint I've heard about this was due to reliance
on a userspace program with a subtly dumb and repetitive truncate
workload. We worked around this problem by fixing the userspace tool and
I've not heard a complaint since.

Even without that userspace fix, a conditional flush in the kernel would
have been perfectly suitable for the same workload (as in, pretty much
unnoticeable). So the broader point here is just that this isn't so
black and white that flushing at all is necessarily a problem.

> > So IMO, this raises the question of whether this is just unnecessarily
> > overcomplicated. The KISS principle implies that it would also be
> > perfectly fine to do a conditional "flush and stale" in zero range
> > whenever we see the combination of an unwritten mapping and dirty
> > pagecache (the latter checked before or during ->iomap_begin()). That's
> > simple to implement and AFAICT would work/perform adequately and
> > generically for all filesystems. I have one or two prototypes of this
> > sort of thing if folks want to see it as an alternative.
> 
> If we are going to zero the range, and the range is already
> unwritten, then why do we need to flush the data in the cache to
> make it clean and written before running the zeroing? Why not just
> invalidate the entire cache over the unwritten region and so return it
> all to containing zeroes (i.e. is unwritten!) without doing any IO.
> 
> Yes, if some of the range is under writeback, the invalidation will
> have to wait for that to complete - invalidate_inode_pages2_range()
> does this for us - but after the invalidation those regions will now
> be written and iomap revalidation after page cache invalidation will
> detect this.
> 
> So maybe the solution is simply to invalidate the cache over
> unwritten extents and then revalidate the iomap? If the iomap is
> still valid, then we can skip the unwritten extent completely. If
> the invalidation returned -EBUSY or the iomap is stale, then remap
> it and try again?
> 
> If we don't have an iomap validation function, then we could check
> filemap_range_needs_writeback() before calling
> invalidate_inode_pages2_range() as that will tell us if there were
> folios that might have been under writeback during the invalidation.
> In that case, we can treat "needs writeback" the same as a failed
> iomap revalidation.
> 
> So what am I missing? What does the flush actually accomplish that
> simply calling invalidate_inode_pages2_range() to throw the data we
> need to zero away doesn't?
> 

Ugh.. when I look at invalidate I see various additional corner case
complexities to think about, for pretty much no additional value over a
conditional flush, especially when you start getting into some of the
writeback complexities you've already noted above. Even if you could
make it work technically (which I'm not sure of), it looks like it would
be increasingly more difficult to properly test and maintain. I don't
really see much reason to go down that path without explicit
justification and perhaps some proving out.

I think the right approach to this problem is precisely what was
discussed up thread with Josef and Darrick. Do the simple and
generically correct thing by lifting the unconditional flush from XFS,
optimize to make the flush conditional to dirty+unwritten while still
being fs generic, and then finally optimize away the flush entirely for
filesystems that provide the proper revalidation support like XFS.

A benefit of that is we can start with a tested and veriable functional
base to iterate from and fall back on. I think the mapping iteration
thing from the patch 3 discussion might turn out elegant enough that
maybe we'll come up with a generic non-flushing solution based on that
(I have some vague thoughts that require investigation) once we have
some code to poke around at.

Brian

P.S., I'm off for a week or so after today. I'll think more about the
mapping iteration idea and then try to put something together once I'm
back.

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>