diff mbox series

[RFC,v6,08/10] iomap/xfs: Eliminate the iomap_valid handler

Message ID 20230108194034.1444764-9-agruenba@redhat.com (mailing list archive)
State New, archived
Headers show
Series Turn iomap_page_ops into iomap_folio_ops | expand

Commit Message

Andreas Gruenbacher Jan. 8, 2023, 7:40 p.m. UTC
Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
handler and validating the mapping there.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/buffered-io.c | 26 +++++---------------------
 fs/xfs/xfs_iomap.c     | 37 ++++++++++++++++++++++++++-----------
 include/linux/iomap.h  | 23 ++++++-----------------
 3 files changed, 37 insertions(+), 49 deletions(-)

Comments

Dave Chinner Jan. 8, 2023, 9:59 p.m. UTC | #1
On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> handler and validating the mapping there.
> 
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

I think this is wrong.

The ->iomap_valid() function handles a fundamental architectural
issue with cached iomaps: the iomap can become stale at any time
whilst it is in use by the iomap core code.

The current problem it solves in the iomap_write_begin() path has to
do with writeback and memory reclaim races over unwritten extents,
but the general case is that we must be able to check the iomap
at any point in time to assess it's validity.

Indeed, we also have this same "iomap valid check" functionality in the
writeback code as cached iomaps can become stale due to racing
writeback, truncated, etc. But you wouldn't know it by looking at the iomap
writeback code - this is currently hidden by XFS by embedding
the checks into the iomap writeback ->map_blocks function.

That is, the first thing that xfs_map_blocks() does is check if the
cached iomap is valid, and if it is valid it returns immediately and
the iomap writeback code uses it without question.

The reason that this is embedded like this is that the iomap did not
have a validity cookie field in it, and so the validity information
was wrapped around the outside of the iomap_writepage_ctx and the
filesystem has to decode it from that private wrapping structure.

However, the validity information iin the structure wrapper is
indentical to the iomap validity cookie, and so the direction I've
been working towards is to replace this implicit, hidden cached
iomap validity check with an explicit ->iomap_valid call and then
only call ->map_blocks if the validity check fails (or is not
implemented).

I want to use the same code for all the iomap validity checks in all
the iomap core code - this is an iomap issue, the conditions where
we need to check for iomap validity are different for depending on
the iomap context being run, and the checks are not necessarily
dependent on first having locked a folio.

Yes, the validity cookie needs to be decoded by the filesystem, but
that does not dictate where the validity checking needs to be done
by the iomap core.

Hence I think removing ->iomap_valid is a big step backwards for the
iomap core code - the iomap core needs to be able to formally verify
the iomap is valid at any point in time, not just at the point in
time a folio in the page cache has been locked...

-Dave.
Andreas Gruenbacher Jan. 9, 2023, 6:45 p.m. UTC | #2
On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > handler and validating the mapping there.
> >
> > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
>
> I think this is wrong.
>
> The ->iomap_valid() function handles a fundamental architectural
> issue with cached iomaps: the iomap can become stale at any time
> whilst it is in use by the iomap core code.
>
> The current problem it solves in the iomap_write_begin() path has to
> do with writeback and memory reclaim races over unwritten extents,
> but the general case is that we must be able to check the iomap
> at any point in time to assess it's validity.
>
> Indeed, we also have this same "iomap valid check" functionality in the
> writeback code as cached iomaps can become stale due to racing
> writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> writeback code - this is currently hidden by XFS by embedding
> the checks into the iomap writeback ->map_blocks function.
>
> That is, the first thing that xfs_map_blocks() does is check if the
> cached iomap is valid, and if it is valid it returns immediately and
> the iomap writeback code uses it without question.
>
> The reason that this is embedded like this is that the iomap did not
> have a validity cookie field in it, and so the validity information
> was wrapped around the outside of the iomap_writepage_ctx and the
> filesystem has to decode it from that private wrapping structure.
>
> However, the validity information iin the structure wrapper is
> indentical to the iomap validity cookie,

Then could that part of the xfs code be converted to use
iomap->validity_cookie so that struct iomap_writepage_ctx can be
eliminated?

> and so the direction I've
> been working towards is to replace this implicit, hidden cached
> iomap validity check with an explicit ->iomap_valid call and then
> only call ->map_blocks if the validity check fails (or is not
> implemented).
>
> I want to use the same code for all the iomap validity checks in all
> the iomap core code - this is an iomap issue, the conditions where
> we need to check for iomap validity are different for depending on
> the iomap context being run, and the checks are not necessarily
> dependent on first having locked a folio.
>
> Yes, the validity cookie needs to be decoded by the filesystem, but
> that does not dictate where the validity checking needs to be done
> by the iomap core.
>
> Hence I think removing ->iomap_valid is a big step backwards for the
> iomap core code - the iomap core needs to be able to formally verify
> the iomap is valid at any point in time, not just at the point in
> time a folio in the page cache has been locked...

We don't need to validate an iomap "at any time". It's two specific
places in the code in which we need to check, and we're not going to
end up with ten more such places tomorrow. I'd prefer to keep those
filesystem internals in the filesystem specific code instead of
exposing them to the iomap layer. But that's just me ...

If we ignore this particular commit for now, do you have any
objections to the patches in this series? If not, it would be great if
we could add the other patches to iomap-for-next.

By the way, I'm still not sure if gfs2 is affected by this whole iomap
validation drama given that it neither implements unwritten extents
nor delayed allocation. This is a mess.

Thanks,
Andreas
Dave Chinner Jan. 9, 2023, 10:54 p.m. UTC | #3
On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote:
> On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > > handler and validating the mapping there.
> > >
> > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> >
> > I think this is wrong.
> >
> > The ->iomap_valid() function handles a fundamental architectural
> > issue with cached iomaps: the iomap can become stale at any time
> > whilst it is in use by the iomap core code.
> >
> > The current problem it solves in the iomap_write_begin() path has to
> > do with writeback and memory reclaim races over unwritten extents,
> > but the general case is that we must be able to check the iomap
> > at any point in time to assess it's validity.
> >
> > Indeed, we also have this same "iomap valid check" functionality in the
> > writeback code as cached iomaps can become stale due to racing
> > writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> > writeback code - this is currently hidden by XFS by embedding
> > the checks into the iomap writeback ->map_blocks function.
> >
> > That is, the first thing that xfs_map_blocks() does is check if the
> > cached iomap is valid, and if it is valid it returns immediately and
> > the iomap writeback code uses it without question.
> >
> > The reason that this is embedded like this is that the iomap did not
> > have a validity cookie field in it, and so the validity information
> > was wrapped around the outside of the iomap_writepage_ctx and the
> > filesystem has to decode it from that private wrapping structure.
> >
> > However, the validity information iin the structure wrapper is
> > indentical to the iomap validity cookie,
> 
> Then could that part of the xfs code be converted to use
> iomap->validity_cookie so that struct iomap_writepage_ctx can be
> eliminated?

Yes, that is the plan.

> 
> > and so the direction I've
> > been working towards is to replace this implicit, hidden cached
> > iomap validity check with an explicit ->iomap_valid call and then
> > only call ->map_blocks if the validity check fails (or is not
> > implemented).
> >
> > I want to use the same code for all the iomap validity checks in all
> > the iomap core code - this is an iomap issue, the conditions where
> > we need to check for iomap validity are different for depending on
> > the iomap context being run, and the checks are not necessarily
> > dependent on first having locked a folio.
> >
> > Yes, the validity cookie needs to be decoded by the filesystem, but
> > that does not dictate where the validity checking needs to be done
> > by the iomap core.
> >
> > Hence I think removing ->iomap_valid is a big step backwards for the
> > iomap core code - the iomap core needs to be able to formally verify
> > the iomap is valid at any point in time, not just at the point in
> > time a folio in the page cache has been locked...
> 
> We don't need to validate an iomap "at any time". It's two specific
> places in the code in which we need to check, and we're not going to
> end up with ten more such places tomorrow.

Not immediately, but that doesn't change the fact this is not a
filesystem specific issue - it's an inherent characteristic of
cached iomaps and unsynchronised extent state changes that occur
outside exclusive inode->i_rwsem IO context (e.g. in writeback and
IO completion contexts).

Racing mmap + buffered writes can expose these state changes as the
iomap bufferred write IO path is not serialised against the iomap
mmap IO path except via folio locks. Hence a mmap page fault can
invalidate a cached buffered write iomap by causing a hole ->
unwritten, hole -> delalloc or hole -> written conversion in the
middle of the buffered write range. The buffered write still has a
hole mapping cached for that entire range, and it is now incorrect.

If the mmap write happens to change extent state at the trailing
edge of a partial buffered write, data corruption will occur if we
race just right with writeback and memory reclaim. I'm pretty sure
that this corruption can be reporduced on gfs2 if we try hard enough
- generic/346 triggers the mmap/write race condition, all that is
needed from that point is for writeback and reclaiming pages at
exactly the right time...

> I'd prefer to keep those
> filesystem internals in the filesystem specific code instead of
> exposing them to the iomap layer. But that's just me ...

My point is that there is nothing XFS specific about these stale
cached iomap race conditions, nor is it specifically related to
folio locking. The folio locking inversions w.r.t. iomap caching and
the interactions with writeback and reclaim are simply the
manifestation that brought the issue to our attention.

This is why I think hiding iomap validation filesystem specific page
cache allocation/lookup functions is entirely the wrong layer to be
doing iomap validity checks. Especially as it prevents us from
adding more validity checks in the core infrastructure when we need
them in future.

AFAIC, an iomap must carry with it a method for checking
that it is still valid. We need it in the write path, we need it in
the writeback path. If we want to relax the restrictions on clone
operations (e.g. shared locking on the source file), we'll need to
be able to detect stale cached iomaps in those paths, too. And I
haven't really thought through all the implications of shared
locking on buffered writes yet, but that may well require more
checks in other places as well.

> If we ignore this particular commit for now, do you have any
> objections to the patches in this series? If not, it would be great if
> we could add the other patches to iomap-for-next.

I still don't like moving page cache operations into individual
filesystems, but for the moment I can live with the IOMAP_NOCREATE
hack to drill iomap state through the filesystem without the
filesystem being aware of it.

> By the way, I'm still not sure if gfs2 is affected by this whole iomap
> validation drama given that it neither implements unwritten extents
> nor delayed allocation. This is a mess.

See above - I'm pretty sure it will be, but it may be very difficult
to expose. After all, it's taken several years before anyone noticed
this issue with XFS, even though we were aware of the issue of stale
cached iomaps causing data corruption in the writeback path....

Cheers,

Dave.
Andreas Grünbacher Jan. 10, 2023, 1:09 a.m. UTC | #4
Am Mo., 9. Jan. 2023 um 23:58 Uhr schrieb Dave Chinner <david@fromorbit.com>:
> On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote:
> > On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@fromorbit.com> wrote:
> > > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > > > handler and validating the mapping there.
> > > >
> > > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > >
> > > I think this is wrong.
> > >
> > > The ->iomap_valid() function handles a fundamental architectural
> > > issue with cached iomaps: the iomap can become stale at any time
> > > whilst it is in use by the iomap core code.
> > >
> > > The current problem it solves in the iomap_write_begin() path has to
> > > do with writeback and memory reclaim races over unwritten extents,
> > > but the general case is that we must be able to check the iomap
> > > at any point in time to assess it's validity.
> > >
> > > Indeed, we also have this same "iomap valid check" functionality in the
> > > writeback code as cached iomaps can become stale due to racing
> > > writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> > > writeback code - this is currently hidden by XFS by embedding
> > > the checks into the iomap writeback ->map_blocks function.
> > >
> > > That is, the first thing that xfs_map_blocks() does is check if the
> > > cached iomap is valid, and if it is valid it returns immediately and
> > > the iomap writeback code uses it without question.
> > >
> > > The reason that this is embedded like this is that the iomap did not
> > > have a validity cookie field in it, and so the validity information
> > > was wrapped around the outside of the iomap_writepage_ctx and the
> > > filesystem has to decode it from that private wrapping structure.
> > >
> > > However, the validity information iin the structure wrapper is
> > > indentical to the iomap validity cookie,
> >
> > Then could that part of the xfs code be converted to use
> > iomap->validity_cookie so that struct iomap_writepage_ctx can be
> > eliminated?
>
> Yes, that is the plan.
>
> >
> > > and so the direction I've
> > > been working towards is to replace this implicit, hidden cached
> > > iomap validity check with an explicit ->iomap_valid call and then
> > > only call ->map_blocks if the validity check fails (or is not
> > > implemented).
> > >
> > > I want to use the same code for all the iomap validity checks in all
> > > the iomap core code - this is an iomap issue, the conditions where
> > > we need to check for iomap validity are different for depending on
> > > the iomap context being run, and the checks are not necessarily
> > > dependent on first having locked a folio.
> > >
> > > Yes, the validity cookie needs to be decoded by the filesystem, but
> > > that does not dictate where the validity checking needs to be done
> > > by the iomap core.
> > >
> > > Hence I think removing ->iomap_valid is a big step backwards for the
> > > iomap core code - the iomap core needs to be able to formally verify
> > > the iomap is valid at any point in time, not just at the point in
> > > time a folio in the page cache has been locked...
> >
> > We don't need to validate an iomap "at any time". It's two specific
> > places in the code in which we need to check, and we're not going to
> > end up with ten more such places tomorrow.
>
> Not immediately, but that doesn't change the fact this is not a
> filesystem specific issue - it's an inherent characteristic of
> cached iomaps and unsynchronised extent state changes that occur
> outside exclusive inode->i_rwsem IO context (e.g. in writeback and
> IO completion contexts).
>
> Racing mmap + buffered writes can expose these state changes as the
> iomap bufferred write IO path is not serialised against the iomap
> mmap IO path except via folio locks. Hence a mmap page fault can
> invalidate a cached buffered write iomap by causing a hole ->
> unwritten, hole -> delalloc or hole -> written conversion in the
> middle of the buffered write range. The buffered write still has a
> hole mapping cached for that entire range, and it is now incorrect.
>
> If the mmap write happens to change extent state at the trailing
> edge of a partial buffered write, data corruption will occur if we
> race just right with writeback and memory reclaim. I'm pretty sure
> that this corruption can be reporduced on gfs2 if we try hard enough
> - generic/346 triggers the mmap/write race condition, all that is
> needed from that point is for writeback and reclaiming pages at
> exactly the right time...
>
> > I'd prefer to keep those
> > filesystem internals in the filesystem specific code instead of
> > exposing them to the iomap layer. But that's just me ...
>
> My point is that there is nothing XFS specific about these stale
> cached iomap race conditions, nor is it specifically related to
> folio locking. The folio locking inversions w.r.t. iomap caching and
> the interactions with writeback and reclaim are simply the
> manifestation that brought the issue to our attention.
>
> This is why I think hiding iomap validation filesystem specific page
> cache allocation/lookup functions is entirely the wrong layer to be
> doing iomap validity checks. Especially as it prevents us from
> adding more validity checks in the core infrastructure when we need
> them in future.
>
> AFAIC, an iomap must carry with it a method for checking
> that it is still valid. We need it in the write path, we need it in
> the writeback path. If we want to relax the restrictions on clone
> operations (e.g. shared locking on the source file), we'll need to
> be able to detect stale cached iomaps in those paths, too. And I
> haven't really thought through all the implications of shared
> locking on buffered writes yet, but that may well require more
> checks in other places as well.
>
> > If we ignore this particular commit for now, do you have any
> > objections to the patches in this series? If not, it would be great if
> > we could add the other patches to iomap-for-next.
>
> I still don't like moving page cache operations into individual
> filesystems, but for the moment I can live with the IOMAP_NOCREATE
> hack to drill iomap state through the filesystem without the
> filesystem being aware of it.

Alright, works for me. Darrick?

> > By the way, I'm still not sure if gfs2 is affected by this whole iomap
> > validation drama given that it neither implements unwritten extents
> > nor delayed allocation. This is a mess.
>
> See above - I'm pretty sure it will be, but it may be very difficult
> to expose. After all, it's taken several years before anyone noticed
> this issue with XFS, even though we were aware of the issue of stale
> cached iomaps causing data corruption in the writeback path....

Okay, that's all pretty ugly. Thanks a lot for the detailed explanation.

Cheers,
Andreas

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
Christoph Hellwig Jan. 10, 2023, 8:51 a.m. UTC | #5
On Mon, Jan 09, 2023 at 08:59:11AM +1100, Dave Chinner wrote:
> Indeed, we also have this same "iomap valid check" functionality in the
> writeback code as cached iomaps can become stale due to racing
> writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> writeback code - this is currently hidden by XFS by embedding
> the checks into the iomap writeback ->map_blocks function.

And that's in many ways a good thing, as it avoids various callouts
that are expensive and confusing.  Just like how this patch gets it
right by not having a mess of badly interacting callbacks, but
one that ensures that the page is ready.

> Hence I think removing ->iomap_valid is a big step backwards for the
> iomap core code - the iomap core needs to be able to formally verify
> the iomap is valid at any point in time, not just at the point in
> time a folio in the page cache has been locked...

For using it anywhere else but the buffered write path it is in the
wrong place to start with, and nonwithstanding my above concern I
can't relaly think of a good place and prototype for such a valid
callback to actually cover all use cases.
Christoph Hellwig Jan. 10, 2023, 8:52 a.m. UTC | #6
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong Jan. 15, 2023, 5:29 p.m. UTC | #7
On Tue, Jan 10, 2023 at 02:09:07AM +0100, Andreas Grünbacher wrote:
> Am Mo., 9. Jan. 2023 um 23:58 Uhr schrieb Dave Chinner <david@fromorbit.com>:
> > On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote:
> > > On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > > > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > > > > handler and validating the mapping there.
> > > > >
> > > > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > > >
> > > > I think this is wrong.
> > > >
> > > > The ->iomap_valid() function handles a fundamental architectural
> > > > issue with cached iomaps: the iomap can become stale at any time
> > > > whilst it is in use by the iomap core code.
> > > >
> > > > The current problem it solves in the iomap_write_begin() path has to
> > > > do with writeback and memory reclaim races over unwritten extents,
> > > > but the general case is that we must be able to check the iomap
> > > > at any point in time to assess it's validity.
> > > >
> > > > Indeed, we also have this same "iomap valid check" functionality in the
> > > > writeback code as cached iomaps can become stale due to racing
> > > > writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> > > > writeback code - this is currently hidden by XFS by embedding
> > > > the checks into the iomap writeback ->map_blocks function.
> > > >
> > > > That is, the first thing that xfs_map_blocks() does is check if the
> > > > cached iomap is valid, and if it is valid it returns immediately and
> > > > the iomap writeback code uses it without question.
> > > >
> > > > The reason that this is embedded like this is that the iomap did not
> > > > have a validity cookie field in it, and so the validity information
> > > > was wrapped around the outside of the iomap_writepage_ctx and the
> > > > filesystem has to decode it from that private wrapping structure.
> > > >
> > > > However, the validity information iin the structure wrapper is
> > > > indentical to the iomap validity cookie,
> > >
> > > Then could that part of the xfs code be converted to use
> > > iomap->validity_cookie so that struct iomap_writepage_ctx can be
> > > eliminated?
> >
> > Yes, that is the plan.
> >
> > >
> > > > and so the direction I've
> > > > been working towards is to replace this implicit, hidden cached
> > > > iomap validity check with an explicit ->iomap_valid call and then
> > > > only call ->map_blocks if the validity check fails (or is not
> > > > implemented).
> > > >
> > > > I want to use the same code for all the iomap validity checks in all
> > > > the iomap core code - this is an iomap issue, the conditions where
> > > > we need to check for iomap validity are different for depending on
> > > > the iomap context being run, and the checks are not necessarily
> > > > dependent on first having locked a folio.
> > > >
> > > > Yes, the validity cookie needs to be decoded by the filesystem, but
> > > > that does not dictate where the validity checking needs to be done
> > > > by the iomap core.
> > > >
> > > > Hence I think removing ->iomap_valid is a big step backwards for the
> > > > iomap core code - the iomap core needs to be able to formally verify
> > > > the iomap is valid at any point in time, not just at the point in
> > > > time a folio in the page cache has been locked...
> > >
> > > We don't need to validate an iomap "at any time". It's two specific
> > > places in the code in which we need to check, and we're not going to
> > > end up with ten more such places tomorrow.
> >
> > Not immediately, but that doesn't change the fact this is not a
> > filesystem specific issue - it's an inherent characteristic of
> > cached iomaps and unsynchronised extent state changes that occur
> > outside exclusive inode->i_rwsem IO context (e.g. in writeback and
> > IO completion contexts).
> >
> > Racing mmap + buffered writes can expose these state changes as the
> > iomap bufferred write IO path is not serialised against the iomap
> > mmap IO path except via folio locks. Hence a mmap page fault can
> > invalidate a cached buffered write iomap by causing a hole ->
> > unwritten, hole -> delalloc or hole -> written conversion in the
> > middle of the buffered write range. The buffered write still has a
> > hole mapping cached for that entire range, and it is now incorrect.
> >
> > If the mmap write happens to change extent state at the trailing
> > edge of a partial buffered write, data corruption will occur if we
> > race just right with writeback and memory reclaim. I'm pretty sure
> > that this corruption can be reporduced on gfs2 if we try hard enough
> > - generic/346 triggers the mmap/write race condition, all that is
> > needed from that point is for writeback and reclaiming pages at
> > exactly the right time...
> >
> > > I'd prefer to keep those
> > > filesystem internals in the filesystem specific code instead of
> > > exposing them to the iomap layer. But that's just me ...
> >
> > My point is that there is nothing XFS specific about these stale
> > cached iomap race conditions, nor is it specifically related to
> > folio locking. The folio locking inversions w.r.t. iomap caching and
> > the interactions with writeback and reclaim are simply the
> > manifestation that brought the issue to our attention.
> >
> > This is why I think hiding iomap validation filesystem specific page
> > cache allocation/lookup functions is entirely the wrong layer to be
> > doing iomap validity checks. Especially as it prevents us from
> > adding more validity checks in the core infrastructure when we need
> > them in future.
> >
> > AFAIC, an iomap must carry with it a method for checking
> > that it is still valid. We need it in the write path, we need it in
> > the writeback path. If we want to relax the restrictions on clone
> > operations (e.g. shared locking on the source file), we'll need to
> > be able to detect stale cached iomaps in those paths, too. And I
> > haven't really thought through all the implications of shared
> > locking on buffered writes yet, but that may well require more
> > checks in other places as well.
> >
> > > If we ignore this particular commit for now, do you have any
> > > objections to the patches in this series? If not, it would be great if
> > > we could add the other patches to iomap-for-next.
> >
> > I still don't like moving page cache operations into individual
> > filesystems, but for the moment I can live with the IOMAP_NOCREATE
> > hack to drill iomap state through the filesystem without the
> > filesystem being aware of it.
> 
> Alright, works for me. Darrick?

Works for me too.

I've wondered if IOMAP_NOCREATE could be useful for more things (e.g.
determining if part of a file has been cached) though I've not thought
of a good usecase for that.  Maybe something along the lines of a
"userspace wants us to redirty this critical file after fsync returned
EIO" type thing?

> > > By the way, I'm still not sure if gfs2 is affected by this whole iomap
> > > validation drama given that it neither implements unwritten extents
> > > nor delayed allocation. This is a mess.
> >
> > See above - I'm pretty sure it will be, but it may be very difficult
> > to expose. After all, it's taken several years before anyone noticed
> > this issue with XFS, even though we were aware of the issue of stale
> > cached iomaps causing data corruption in the writeback path....
> 
> Okay, that's all pretty ugly. Thanks a lot for the detailed explanation.

I don't have any objections to pulling everything except patches 8 and
10 for testing this week.  I find myself more in agreement with
Christoph and Andreas that whoever gets the folio is also responsible
for knowing if revalidating the mapping is necessary and then doing it.
However, I still have enough questions about the mapping revalidation to
make that a separate discussion.

Questions, namely:

1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
don't think it does, but OTOH zone pointer management might complicate
that.

2. How about porting the writeback iomap validation to use this
mechanism?  (I suspect Dave might already be working on this...)

2. Do we need to revalidate mappings for directio writes?  I think the
answer is no (for xfs) because the ->iomap_begin call will allocate
whatever blocks are needed and truncate/punch/reflink block on the
iolock while the directio writes are pending, so you'll never end up
with a stale mapping.  But I don't know if that statement applies
generally...

--D

> Cheers,
> Andreas
> 
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
Christoph Hellwig Jan. 18, 2023, 7:21 a.m. UTC | #8
On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> I don't have any objections to pulling everything except patches 8 and
> 10 for testing this week. 

That would be great.  I now have a series to return the ERR_PTR
from __filemap_get_folio which will cause a minor conflict, but
I think that's easy enough for Linux to handle.

> 
> 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> don't think it does, but OTOH zone pointer management might complicate
> that.

Adding Damien.

> 2. How about porting the writeback iomap validation to use this
> mechanism?  (I suspect Dave might already be working on this...)

What is "this mechanism"?  Do you mean the here removed ->iomap_valid
?   writeback calls into ->map_blocks for every block while under the
folio lock, so the validation can (and for XFS currently is) done
in that.  Moving it out into a separate method with extra indirect
functiona call overhead and interactions between the methods seems
like a retrograde step to me.

> 2. Do we need to revalidate mappings for directio writes?  I think the
> answer is no (for xfs) because the ->iomap_begin call will allocate
> whatever blocks are needed and truncate/punch/reflink block on the
> iolock while the directio writes are pending, so you'll never end up
> with a stale mapping.

Yes.
Damien Le Moal Jan. 18, 2023, 9:11 a.m. UTC | #9
On 1/18/23 16:21, Christoph Hellwig wrote:
> On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
>> I don't have any objections to pulling everything except patches 8 and
>> 10 for testing this week. 
> 
> That would be great.  I now have a series to return the ERR_PTR
> from __filemap_get_folio which will cause a minor conflict, but
> I think that's easy enough for Linux to handle.
> 
>>
>> 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
>> don't think it does, but OTOH zone pointer management might complicate
>> that.
> 
> Adding Damien.

zonefs has a static mapping of file blocks that never changes and is fully
populated up to a file max size from mount. So zonefs is not using the
iomap_valid page operation. In fact, zonefs is not even using struct
iomap_page_ops.

> 
>> 2. How about porting the writeback iomap validation to use this
>> mechanism?  (I suspect Dave might already be working on this...)
> 
> What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> ?   writeback calls into ->map_blocks for every block while under the
> folio lock, so the validation can (and for XFS currently is) done
> in that.  Moving it out into a separate method with extra indirect
> functiona call overhead and interactions between the methods seems
> like a retrograde step to me.
> 
>> 2. Do we need to revalidate mappings for directio writes?  I think the
>> answer is no (for xfs) because the ->iomap_begin call will allocate
>> whatever blocks are needed and truncate/punch/reflink block on the
>> iolock while the directio writes are pending, so you'll never end up
>> with a stale mapping.
> 
> Yes.
Darrick J. Wong Jan. 18, 2023, 7:04 p.m. UTC | #10
On Tue, Jan 17, 2023 at 11:21:38PM -0800, Christoph Hellwig wrote:
> On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> > I don't have any objections to pulling everything except patches 8 and
> > 10 for testing this week. 
> 
> That would be great.  I now have a series to return the ERR_PTR
> from __filemap_get_folio which will cause a minor conflict, but
> I think that's easy enough for Linux to handle.

Ok, done.

> > 
> > 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> > don't think it does, but OTOH zone pointer management might complicate
> > that.
> 
> Adding Damien.
> 
> > 2. How about porting the writeback iomap validation to use this
> > mechanism?  (I suspect Dave might already be working on this...)
> 
> What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> ?   writeback calls into ->map_blocks for every block while under the
> folio lock, so the validation can (and for XFS currently is) done
> in that.  Moving it out into a separate method with extra indirect
> functiona call overhead and interactions between the methods seems
> like a retrograde step to me.

Sorry, I should've been more specific -- can xfs writeback use the
validity cookie in struct iomap and thereby get rid of struct
xfs_writepage_ctx entirely?

> > 2. Do we need to revalidate mappings for directio writes?  I think the
> > answer is no (for xfs) because the ->iomap_begin call will allocate
> > whatever blocks are needed and truncate/punch/reflink block on the
> > iolock while the directio writes are pending, so you'll never end up
> > with a stale mapping.
> 
> Yes.

Er... yes as in "Yes, we *do* need to revalidate directio writes", or
"Yes, your reasoning is correct"?

--D
Andreas Grünbacher Jan. 18, 2023, 7:57 p.m. UTC | #11
Am Mi., 18. Jan. 2023 um 20:04 Uhr schrieb Darrick J. Wong <djwong@kernel.org>:
>
> On Tue, Jan 17, 2023 at 11:21:38PM -0800, Christoph Hellwig wrote:
> > On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> > > I don't have any objections to pulling everything except patches 8 and
> > > 10 for testing this week.
> >
> > That would be great.  I now have a series to return the ERR_PTR
> > from __filemap_get_folio which will cause a minor conflict, but
> > I think that's easy enough for Linux to handle.
>
> Ok, done.
>
> > >
> > > 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> > > don't think it does, but OTOH zone pointer management might complicate
> > > that.
> >
> > Adding Damien.
> >
> > > 2. How about porting the writeback iomap validation to use this
> > > mechanism?  (I suspect Dave might already be working on this...)
> >
> > What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> > ?   writeback calls into ->map_blocks for every block while under the
> > folio lock, so the validation can (and for XFS currently is) done
> > in that.  Moving it out into a separate method with extra indirect
> > functiona call overhead and interactions between the methods seems
> > like a retrograde step to me.
>
> Sorry, I should've been more specific -- can xfs writeback use the
> validity cookie in struct iomap and thereby get rid of struct
> xfs_writepage_ctx entirely?

Already asked and answered in the same thread:

https://lore.kernel.org/linux-fsdevel/20230109225453.GQ1971568@dread.disaster.area/

> > > 2. Do we need to revalidate mappings for directio writes?  I think the
> > > answer is no (for xfs) because the ->iomap_begin call will allocate
> > > whatever blocks are needed and truncate/punch/reflink block on the
> > > iolock while the directio writes are pending, so you'll never end up
> > > with a stale mapping.
> >
> > Yes.
>
> Er... yes as in "Yes, we *do* need to revalidate directio writes", or
> "Yes, your reasoning is correct"?
>
> --D
Dave Chinner Jan. 18, 2023, 9:42 p.m. UTC | #12
On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> 2. Do we need to revalidate mappings for directio writes?  I think the
> answer is no (for xfs) because the ->iomap_begin call will allocate
> whatever blocks are needed and truncate/punch/reflink block on the
> iolock while the directio writes are pending, so you'll never end up
> with a stale mapping.  But I don't know if that statement applies
> generally...

The issue is not truncate/punch/reflink for either DIO or buffered
IO - the issue that leads to stale iomaps is async extent state.
i.e. IO completion doing unwritten extent conversion.

For DIO, AIO doesn't hold the IOLOCK at all when completion is run
(like buffered writeback), but non-AIO DIO writes hold the IOLOCK
shared while waiting for completion. This means that we can have DIO
submission and completion still running concurrently, and so stale
iomaps are a definite possibility.

From my notes when I looked at this:

1. the race condition for a DIO write mapping go stale is an
overlapping DIO completion and converting the block from unwritten
to written, and then the dio write incorrectly issuing sub-block
zeroing because the mapping is now stale.

2. DIO read into a hole or unwritten extent zeroes the entire range
in the user buffer in one operation. If this is a large range, this
could race with small DIO writes within that range that have
completed

3. There is a window between dio write completion doing unwritten
extent conversion (by ->end_io) and the page cache being
invalidated, providing a window where buffered read maps can be
stale and incorrect read behaviour exposed to userpace before
the page cache is invalidated.

These all stem from IO having overlapping ranges, which is largely
unsupported but can't be entirely prevented (e.g. backup
applications running in the background). Largely the problems are
confined to sub-block IOs. i.e.  when sub-block DIO writes to the
same block are being performed, we have the possiblity that one
write completes whilst the other is deciding what to zero, unaware
that the range is now MAPPED rather than UNWRITTEN.

We currently avoid issues with sub-block dio writes by using
IOMAP_DIO_OVERWRITE_ONLY with shared locking. This ensures that the
unaligned IO fits entirely within a MAPPED extent so no sub-block
zeroing is required. If allocation or sub-block zeroing is required,
then we force the filesystem to fall back to exclusive IO locking
and wait for all concurrent DIO in flight to complete so that it
can't race with any other DIO write that might cause the map to
become stale while we are doing the zeroing.

This does not avoid potential issues with DIO write vs buffered
read, nor DIO write vs mmap IO. It's not totally clear to me
whether we need ->iomap_valid checks in the buffered read paths
to avoid the completion races with DIO writes, but there are windows
there where cached iomaps could be considered stale....

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 006ddf933948..72dfbc3cb086 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -638,10 +638,9 @@  static int iomap_write_begin_inline(const struct iomap_iter *iter,
 static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 		size_t len, struct folio **foliop)
 {
-	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct folio *folio;
-	int status = 0;
+	int status;
 
 	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
 	if (srcmap != &iter->iomap)
@@ -654,27 +653,12 @@  static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
 
 	folio = __iomap_get_folio(iter, pos, len);
-	if (IS_ERR(folio))
-		return PTR_ERR(folio);
-
-	/*
-	 * Now we have a locked folio, before we do anything with it we need to
-	 * check that the iomap we have cached is not stale. The inode extent
-	 * mapping can change due to concurrent IO in flight (e.g.
-	 * IOMAP_UNWRITTEN state can change and memory reclaim could have
-	 * reclaimed a previously partially written page at this index after IO
-	 * completion before this write reaches this file offset) and hence we
-	 * could do the wrong thing here (zero a page range incorrectly or fail
-	 * to zero) and corrupt data.
-	 */
-	if (page_ops && page_ops->iomap_valid) {
-		bool iomap_valid = page_ops->iomap_valid(iter->inode,
-							&iter->iomap);
-		if (!iomap_valid) {
+	if (IS_ERR(folio)) {
+		if (folio == ERR_PTR(-ESTALE)) {
 			iter->iomap.flags |= IOMAP_F_STALE;
-			status = 0;
-			goto out_unlock;
+			return 0;
 		}
+		return PTR_ERR(folio);
 	}
 
 	if (pos + len > folio_pos(folio) + folio_size(folio))
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 669c1bc5c3a7..d0bf99539180 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -62,29 +62,44 @@  xfs_iomap_inode_sequence(
 	return cookie | READ_ONCE(ip->i_df.if_seq);
 }
 
-/*
- * Check that the iomap passed to us is still valid for the given offset and
- * length.
- */
-static bool
-xfs_iomap_valid(
-	struct inode		*inode,
-	const struct iomap	*iomap)
+static struct folio *
+xfs_get_folio(
+	struct iomap_iter	*iter,
+	loff_t			pos,
+	unsigned		len)
 {
+	struct inode		*inode = iter->inode;
+	struct iomap		*iomap = &iter->iomap;
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct folio		*folio;
 
+	folio = iomap_get_folio(iter, pos);
+	if (IS_ERR(folio))
+		return folio;
+
+	/*
+	 * Now that we have a locked folio, we need to check that the iomap we
+	 * have cached is not stale.  The inode extent mapping can change due to
+	 * concurrent IO in flight (e.g., IOMAP_UNWRITTEN state can change and
+	 * memory reclaim could have reclaimed a previously partially written
+	 * page at this index after IO completion before this write reaches
+	 * this file offset) and hence we could do the wrong thing here (zero a
+	 * page range incorrectly or fail to zero) and corrupt data.
+	 */
 	if (iomap->validity_cookie !=
 			xfs_iomap_inode_sequence(ip, iomap->flags)) {
 		trace_xfs_iomap_invalid(ip, iomap);
-		return false;
+		folio_unlock(folio);
+		folio_put(folio);
+		return ERR_PTR(-ESTALE);
 	}
 
 	XFS_ERRORTAG_DELAY(ip->i_mount, XFS_ERRTAG_WRITE_DELAY_MS);
-	return true;
+	return folio;
 }
 
 const struct iomap_page_ops xfs_iomap_page_ops = {
-	.iomap_valid		= xfs_iomap_valid,
+	.get_folio		= xfs_get_folio,
 };
 
 int
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index da226032aedc..0ae2cddbedd6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -134,29 +134,18 @@  static inline bool iomap_inline_data_valid(const struct iomap *iomap)
  * When get_folio succeeds, put_folio will always be called to do any
  * cleanup work necessary.  put_folio is responsible for unlocking and putting
  * @folio.
+ *
+ * When an iomap is created, the filesystem can store internal state (e.g., a
+ * sequence number) in iomap->validity_cookie.  The get_folio handler can use
+ * this validity cookie to detect when the iomap needs to be refreshed because
+ * it is no longer up to date.  In that case, the function should return
+ * ERR_PTR(-ESTALE) to retry the operation with a fresh mapping.
  */
 struct iomap_page_ops {
 	struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
 			unsigned len);
 	void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
 			struct folio *folio);
-
-	/*
-	 * Check that the cached iomap still maps correctly to the filesystem's
-	 * internal extent map. FS internal extent maps can change while iomap
-	 * is iterating a cached iomap, so this hook allows iomap to detect that
-	 * the iomap needs to be refreshed during a long running write
-	 * operation.
-	 *
-	 * The filesystem can store internal state (e.g. a sequence number) in
-	 * iomap->validity_cookie when the iomap is first mapped to be able to
-	 * detect changes between mapping time and whenever .iomap_valid() is
-	 * called.
-	 *
-	 * This is called with the folio over the specified file position held
-	 * locked by the iomap code.
-	 */
-	bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
 };
 
 /*