diff mbox series

[RFC,2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

Message ID f5e84d3a63de30def2f3800f534d14389f6ba137.1700506526.git.ritesh.list@gmail.com (mailing list archive)
State New
Headers show
Series ext2: Use iomap in buffered-io for regular files and enable large folio support | expand

Commit Message

Ritesh Harjani (IBM) Nov. 20, 2023, 7:05 p.m. UTC
This patch converts ext2 regular file's buffered-io path to use iomap.
- buffered-io path using iomap_file_buffered_write
- DIO fallback to buffered-io now uses iomap_file_buffered_write
- writeback path now uses a new aops - ext2_file_aops
- truncate now uses iomap_truncate_page
- mmap path of ext2 continues to use generic_file_vm_ops

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext2/file.c  | 20 +++++++++++++--
 fs/ext2/inode.c | 68 ++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 80 insertions(+), 8 deletions(-)

Comments

Matthew Wilcox Nov. 20, 2023, 8 p.m. UTC | #1
On Tue, Nov 21, 2023 at 12:35:20AM +0530, Ritesh Harjani (IBM) wrote:
> +static void ext2_file_readahead(struct readahead_control *rac)
> +{
> +	return iomap_readahead(rac, &ext2_iomap_ops);

We generally prefer to omit the 'return' when the function returns void
(it's a GCC extension to not warn about it, so not actually a bug)

This all looks marvellously simple.  Good job!
Christoph Hellwig Nov. 21, 2023, 4:44 a.m. UTC | #2
On Tue, Nov 21, 2023 at 12:35:20AM +0530, Ritesh Harjani (IBM) wrote:
> - mmap path of ext2 continues to use generic_file_vm_ops

So this means there are not space reservations taken at write fault
time.  While iomap was written with the assumption those exist, I can't
instantly spot anything that relies on them - you are just a lot more
likely to hit an -ENOSPC from ->map_blocks now.  Maybe we should add
an xfstests that covers the case where we use up more then the existing
free space through writable mmaps to ensure it fully works (where works
means at least as good as the old behavior)?

> +static ssize_t ext2_buffered_write_iter(struct kiocb *iocb,
> +					struct iov_iter *from)
> +{
> +	ssize_t ret = 0;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	inode_lock(inode);
> +	ret = generic_write_checks(iocb, from);
> +	if (ret > 0)
> +		ret = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
> +	inode_unlock(inode);
> +	if (ret > 0)
> +		ret = generic_write_sync(iocb, ret);
> +	return ret;
> +}

Not for now, but if we end up doing a bunch of conversation of trivial
file systems, it might be worth to eventually add a wrapper for the
above code with just the iomap_ops passed in.  Only once we see a few
duplicates, though.

> +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
> +				 struct inode *inode, loff_t offset)
> +{
> +	if (offset >= wpc->iomap.offset &&
> +	    offset < wpc->iomap.offset + wpc->iomap.length)
> +		return 0;
> +
> +	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> +				IOMAP_WRITE, &wpc->iomap, NULL);
> +}

Looking at gfs2 this also might become a pattern.  But I'm fine with
waiting for now.

> @@ -1372,7 +1428,7 @@ void ext2_set_file_ops(struct inode *inode)
>  	if (IS_DAX(inode))
>  		inode->i_mapping->a_ops = &ext2_dax_aops;
>  	else
> -		inode->i_mapping->a_ops = &ext2_aops;
> +		inode->i_mapping->a_ops = &ext2_file_aops;
>  }

Did yo run into issues in using the iomap based aops for the other uses
of ext2_aops, or are just trying to address the users one at a time?
Ritesh Harjani (IBM) Nov. 21, 2023, 5:56 a.m. UTC | #3
Christoph Hellwig <hch@infradead.org> writes:

> On Tue, Nov 21, 2023 at 12:35:20AM +0530, Ritesh Harjani (IBM) wrote:
>> - mmap path of ext2 continues to use generic_file_vm_ops
>
> So this means there are not space reservations taken at write fault
> time.

Yes.

> While iomap was written with the assumption those exist, I can't
> instantly spot anything that relies on them - you are just a lot more
> likely to hit an -ENOSPC from ->map_blocks now.

Which is also true with existing code no? If the block reservation is
not done at the write fault, writeback is likely to fail due to ENOSPC?

> Maybe we should add
> an xfstests that covers the case where we use up more then the existing
> free space through writable mmaps to ensure it fully works (where works
> means at least as good as the old behavior)?
>

Sure, I can try write an fstests for the same.

Also I did find an old thread which tried implementing ->page_mkwrite in
ext2 [1]. However, it was rejected with following reason - 

"""
Allocating
blocks on write fault has the unwelcome impact that the file layout is
likely going to be much worse (much more fragmented) - I remember getting
some reports about performance regressions from users back when I did a
similar change for ext3. And so I'm reluctant to change behavior of such
an old legacy filesystem as ext2...
"""

https://lore.kernel.org/linux-ext4/20210105175348.GE15080@quack2.suse.cz/



>> +static ssize_t ext2_buffered_write_iter(struct kiocb *iocb,
>> +					struct iov_iter *from)
>> +{
>> +	ssize_t ret = 0;
>> +	struct inode *inode = file_inode(iocb->ki_filp);
>> +
>> +	inode_lock(inode);
>> +	ret = generic_write_checks(iocb, from);
>> +	if (ret > 0)
>> +		ret = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
>> +	inode_unlock(inode);
>> +	if (ret > 0)
>> +		ret = generic_write_sync(iocb, ret);
>> +	return ret;
>> +}
>
> Not for now, but if we end up doing a bunch of conversation of trivial
> file systems, it might be worth to eventually add a wrapper for the
> above code with just the iomap_ops passed in.  Only once we see a few
> duplicates, though.
>

Sure, make sense. Thanks!
I can try and check if the the wrapper helps.

>> +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
>> +				 struct inode *inode, loff_t offset)
>> +{
>> +	if (offset >= wpc->iomap.offset &&
>> +	    offset < wpc->iomap.offset + wpc->iomap.length)
>> +		return 0;
>> +
>> +	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
>> +				IOMAP_WRITE, &wpc->iomap, NULL);
>> +}
>
> Looking at gfs2 this also might become a pattern.  But I'm fine with
> waiting for now.
>

ok.

>> @@ -1372,7 +1428,7 @@ void ext2_set_file_ops(struct inode *inode)
>>  	if (IS_DAX(inode))
>>  		inode->i_mapping->a_ops = &ext2_dax_aops;
>>  	else
>> -		inode->i_mapping->a_ops = &ext2_aops;
>> +		inode->i_mapping->a_ops = &ext2_file_aops;
>>  }
>
> Did yo run into issues in using the iomap based aops for the other uses
> of ext2_aops, or are just trying to address the users one at a time?

There are problems for e.g. for dir type in ext2. It uses the pagecache
for dir. It uses buffer_heads and attaches them to folio->private.
    ...it uses block_write_begin/block_write_end() calls.
    Look for ext4_make_empty() -> ext4_prepare_chunk ->
    block_write_begin(). 
Now during sync/writeback of the dirty pages (ext4_handle_dirsync()), we
might take a iomap writeback path (if using ext2_file_aops for dir)
which sees folio->private assuming it is "struct iomap_folio_state".
And bad things will happen... 

Now we don't have an equivalent APIs in iomap for
block_write_begin()/end() which the users can call for. Hence, Jan
suggested to lets first convert ext2 regular file path to iomap as an RFC.
I shall now check for the dir type to see what changes we might require
in ext2/iomap code.


Thanks again for your review!

-ritesh
Christoph Hellwig Nov. 21, 2023, 6:08 a.m. UTC | #4
On Tue, Nov 21, 2023 at 11:26:15AM +0530, Ritesh Harjani wrote:
> > instantly spot anything that relies on them - you are just a lot more
> > likely to hit an -ENOSPC from ->map_blocks now.
> 
> Which is also true with existing code no? If the block reservation is
> not done at the write fault, writeback is likely to fail due to ENOSPC?

Yes. Not saying you should change this, I just want to make sure the
iomap code handles this fine.  I think it does, but I'd rather be sure.

> Sure, make sense. Thanks!
> I can try and check if the the wrapper helps.

Let's wait until we have a few more conversions.

> > Did yo run into issues in using the iomap based aops for the other uses
> > of ext2_aops, or are just trying to address the users one at a time?
> 
> There are problems for e.g. for dir type in ext2. It uses the pagecache
> for dir. It uses buffer_heads and attaches them to folio->private.
>     ...it uses block_write_begin/block_write_end() calls.
>     Look for ext4_make_empty() -> ext4_prepare_chunk ->
>     block_write_begin(). 
> Now during sync/writeback of the dirty pages (ext4_handle_dirsync()), we
> might take a iomap writeback path (if using ext2_file_aops for dir)
> which sees folio->private assuming it is "struct iomap_folio_state".
> And bad things will happen... 

Oh, indeed, bufferheads again.

> Now we don't have an equivalent APIs in iomap for
> block_write_begin()/end() which the users can call for. Hence, Jan
> suggested to lets first convert ext2 regular file path to iomap as an RFC.

Yes, no problem.  But maybe worth documenting in the commit log.
Ritesh Harjani (IBM) Nov. 21, 2023, 6:15 a.m. UTC | #5
Christoph Hellwig <hch@infradead.org> writes:

> On Tue, Nov 21, 2023 at 11:26:15AM +0530, Ritesh Harjani wrote:
>> > instantly spot anything that relies on them - you are just a lot more
>> > likely to hit an -ENOSPC from ->map_blocks now.
>> 
>> Which is also true with existing code no? If the block reservation is
>> not done at the write fault, writeback is likely to fail due to ENOSPC?
>
> Yes. Not saying you should change this, I just want to make sure the
> iomap code handles this fine.  I think it does, but I'd rather be sure.
>

Sure. I can write a fstest to test the behavior. 

>> Sure, make sense. Thanks!
>> I can try and check if the the wrapper helps.
>
> Let's wait until we have a few more conversions.
>

Sure.

>> > Did yo run into issues in using the iomap based aops for the other uses
>> > of ext2_aops, or are just trying to address the users one at a time?
>> 
>> There are problems for e.g. for dir type in ext2. It uses the pagecache
>> for dir. It uses buffer_heads and attaches them to folio->private.
>>     ...it uses block_write_begin/block_write_end() calls.
>>     Look for ext4_make_empty() -> ext4_prepare_chunk ->
>>     block_write_begin(). 
>> Now during sync/writeback of the dirty pages (ext4_handle_dirsync()), we
>> might take a iomap writeback path (if using ext2_file_aops for dir)
>> which sees folio->private assuming it is "struct iomap_folio_state".
>> And bad things will happen... 
>
> Oh, indeed, bufferheads again.
>
>> Now we don't have an equivalent APIs in iomap for
>> block_write_begin()/end() which the users can call for. Hence, Jan
>> suggested to lets first convert ext2 regular file path to iomap as an RFC.
>
> Yes, no problem.  But maybe worth documenting in the commit log.

Sure, I will update the commit log.

-ritesh
Jan Kara Nov. 22, 2023, 12:29 p.m. UTC | #6
On Tue 21-11-23 00:35:20, Ritesh Harjani (IBM) wrote:
> This patch converts ext2 regular file's buffered-io path to use iomap.
> - buffered-io path using iomap_file_buffered_write
> - DIO fallback to buffered-io now uses iomap_file_buffered_write
> - writeback path now uses a new aops - ext2_file_aops
> - truncate now uses iomap_truncate_page
> - mmap path of ext2 continues to use generic_file_vm_ops
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

Looks nice and simple. Just one comment below:

> +static void ext2_file_readahead(struct readahead_control *rac)
> +{
> +	return iomap_readahead(rac, &ext2_iomap_ops);
> +}

As Matthew noted, the return of void here looks strange.

> +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
> +				 struct inode *inode, loff_t offset)
> +{
> +	if (offset >= wpc->iomap.offset &&
> +	    offset < wpc->iomap.offset + wpc->iomap.length)
> +		return 0;
> +
> +	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> +				IOMAP_WRITE, &wpc->iomap, NULL);
> +}

So this will end up mapping blocks for writeback one block at a time if I'm
understanding things right because ext2_iomap_begin() never sets extent
larger than 'length' in the iomap result. Hence the wpc checking looks
pointless (and actually dangerous - see below). So this will probably get
more expensive than necessary by calling into ext2_get_blocks() for each
block? Perhaps we could first call ext2_iomap_begin() for reading with high
length to get as many mapped blocks as we can and if we find unmapped block
(should happen only for writes through mmap), we resort to what you have
here... Hum, but this will not work because of the races with truncate
which could remove blocks whose mapping we have cached in wpc. We can
safely provide a mapping under a folio only once it is locked or has
writeback bit set. XFS plays the revalidation sequence counter games
because of this so we'd have to do something similar for ext2. Not that I'd
care as much about ext2 writeback performance but it should not be that
hard and we'll definitely need some similar solution for ext4 anyway. Can
you give that a try (as a followup "performance improvement" patch).

								Honza
Christoph Hellwig Nov. 22, 2023, 1:11 p.m. UTC | #7
On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> writeback bit set. XFS plays the revalidation sequence counter games
> because of this so we'd have to do something similar for ext2. Not that I'd
> care as much about ext2 writeback performance but it should not be that
> hard and we'll definitely need some similar solution for ext4 anyway. Can
> you give that a try (as a followup "performance improvement" patch).

Darrick has mentioned that he is looking into lifting more of the
validation sequence counter validation into iomap.

In the meantime I have a series here that at least maps multiple blocks
inside a folio in a single go, which might be worth trying with ext2 as
well:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks
Ritesh Harjani (IBM) Nov. 22, 2023, 8:25 p.m. UTC | #8
Jan Kara <jack@suse.cz> writes:

> On Tue 21-11-23 00:35:20, Ritesh Harjani (IBM) wrote:
>> This patch converts ext2 regular file's buffered-io path to use iomap.
>> - buffered-io path using iomap_file_buffered_write
>> - DIO fallback to buffered-io now uses iomap_file_buffered_write
>> - writeback path now uses a new aops - ext2_file_aops
>> - truncate now uses iomap_truncate_page
>> - mmap path of ext2 continues to use generic_file_vm_ops
>> 
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>
> Looks nice and simple. Just one comment below:
>
>> +static void ext2_file_readahead(struct readahead_control *rac)
>> +{
>> +	return iomap_readahead(rac, &ext2_iomap_ops);
>> +}
>
> As Matthew noted, the return of void here looks strange.
>

yes, I will fix it.

>> +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
>> +				 struct inode *inode, loff_t offset)
>> +{
>> +	if (offset >= wpc->iomap.offset &&
>> +	    offset < wpc->iomap.offset + wpc->iomap.length)
>> +		return 0;
>> +
>> +	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
>> +				IOMAP_WRITE, &wpc->iomap, NULL);
>> +}
>
> So this will end up mapping blocks for writeback one block at a time if I'm
> understanding things right because ext2_iomap_begin() never sets extent
> larger than 'length' in the iomap result. Hence the wpc checking looks
> pointless (and actually dangerous - see below). So this will probably get
> more expensive than necessary by calling into ext2_get_blocks() for each
> block? Perhaps we could first call ext2_iomap_begin() for reading with high
> length to get as many mapped blocks as we can and if we find unmapped block
> (should happen only for writes through mmap), we resort to what you have
> here... Hum, but this will not work because of the races with truncate
> which could remove blocks whose mapping we have cached in wpc. We can
> safely provide a mapping under a folio only once it is locked or has
> writeback bit set. XFS plays the revalidation sequence counter games
> because of this so we'd have to do something similar for ext2. Not that I'd
> care as much about ext2 writeback performance but it should not be that
> hard and we'll definitely need some similar solution for ext4 anyway. Can
> you give that a try (as a followup "performance improvement" patch).
>

Yes, looking into ->map_blocks was on my todo list, since this RFC only
maps 1 block at a time which can be expensive. I missed adding that as a
comment in cover letter.

But also thanks for providing many details on the same. I am taking a
look at it and will get back with more details on how can we get this
working, as it will be anyway required for ext4 too.

-ritesh
Ritesh Harjani (IBM) Nov. 22, 2023, 8:26 p.m. UTC | #9
Christoph Hellwig <hch@infradead.org> writes:

> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
>> writeback bit set. XFS plays the revalidation sequence counter games
>> because of this so we'd have to do something similar for ext2. Not that I'd
>> care as much about ext2 writeback performance but it should not be that
>> hard and we'll definitely need some similar solution for ext4 anyway. Can
>> you give that a try (as a followup "performance improvement" patch).
>
> Darrick has mentioned that he is looking into lifting more of the
> validation sequence counter validation into iomap.
>
> In the meantime I have a series here that at least maps multiple blocks
> inside a folio in a single go, which might be worth trying with ext2 as
> well:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks

Sure, thanks for providing details. I will check this.

-ritesh
Dave Chinner Nov. 22, 2023, 10:26 p.m. UTC | #10
On Wed, Nov 22, 2023 at 05:11:18AM -0800, Christoph Hellwig wrote:
> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> > writeback bit set. XFS plays the revalidation sequence counter games
> > because of this so we'd have to do something similar for ext2. Not that I'd
> > care as much about ext2 writeback performance but it should not be that
> > hard and we'll definitely need some similar solution for ext4 anyway. Can
> > you give that a try (as a followup "performance improvement" patch).
> 
> Darrick has mentioned that he is looking into lifting more of the
> validation sequence counter validation into iomap.

I think that was me, as part of aligning the writeback path with
the ->iomap_valid() checks in the write path after we lock the folio
we instantiated for the write.

It's basically the same thing - once we have a locked folio, we have
to check that the cached iomap is still valid before we use it for
anything.

I need to find the time to get back to that, though.

-Dave.
Darrick J. Wong Nov. 23, 2023, 4:09 a.m. UTC | #11
On Thu, Nov 23, 2023 at 09:26:44AM +1100, Dave Chinner wrote:
> On Wed, Nov 22, 2023 at 05:11:18AM -0800, Christoph Hellwig wrote:
> > On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> > > writeback bit set. XFS plays the revalidation sequence counter games
> > > because of this so we'd have to do something similar for ext2. Not that I'd
> > > care as much about ext2 writeback performance but it should not be that
> > > hard and we'll definitely need some similar solution for ext4 anyway. Can
> > > you give that a try (as a followup "performance improvement" patch).
> > 
> > Darrick has mentioned that he is looking into lifting more of the
> > validation sequence counter validation into iomap.
> 
> I think that was me, as part of aligning the writeback path with
> the ->iomap_valid() checks in the write path after we lock the folio
> we instantiated for the write.
> 
> It's basically the same thing - once we have a locked folio, we have
> to check that the cached iomap is still valid before we use it for
> anything.
> 
> I need to find the time to get back to that, though.

Heh, we probably both have been chatting with willy on and off about
iomap.

The particular idea I had is to add a u64 counter to address_space that
we can bump in the same places where we bump xfs_inode_fork::if_seq
right now..  ->iomap_begin would sample this address_space::i_mappingseq
counter (with locks held), and now buffered writes and writeback can
check iomap::mappingseq == address_space::i_mappingseq to decide if it's
time to revalidate.

Anyway, I'll have time to go play with that (and further purging of
function pointers) next week or whenever is "after I put out v28 of
online repair".  ATM I have a rewrite of log intent recovery cooking
too, but that's going to need at least a week or two of recoveryloop
testing before I put that on the list.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Christoph Hellwig Nov. 23, 2023, 7:02 a.m. UTC | #12
On Thu, Nov 23, 2023 at 09:26:44AM +1100, Dave Chinner wrote:
> I think that was me

No, it was Darrick.  We talkd about a lot of things, but not this :)

> , as part of aligning the writeback path with
> the ->iomap_valid() checks in the write path after we lock the folio
> we instantiated for the write.
> 
> It's basically the same thing - once we have a locked folio, we have
> to check that the cached iomap is still valid before we use it for
> anything.

Yes.  Currently we do that in the wb ops ->map_blocks.  This can get
called multiple times per folio, which is a bit silly.  With the series
I just posted the link to we at least stop doing that if the folio
is mapped contiguously, which solves all practical cases, as the
sequence check is almost free compared to the actual block mapping.

For steps beyond that I'll reply to Darrick's mail.
Christoph Hellwig Nov. 23, 2023, 7:09 a.m. UTC | #13
On Wed, Nov 22, 2023 at 08:09:44PM -0800, Darrick J. Wong wrote:
> The particular idea I had is to add a u64 counter to address_space that
> we can bump in the same places where we bump xfs_inode_fork::if_seq
> right now..  ->iomap_begin would sample this address_space::i_mappingseq
> counter (with locks held), and now buffered writes and writeback can
> check iomap::mappingseq == address_space::i_mappingseq to decide if it's
> time to revalidate.

So I think moving this to the VFS is probably a good idea, and I
actually argued for that when the sequence checking was first proposed.
We just have to be careful to be able to map things like the two
separate data and cow seq counts in XFS (or anything else complicated
in other file systems) to it.

> Anyway, I'll have time to go play with that (and further purging of
> function pointers)

Do we have anything where the function pointer overhead is actually
hurting us right now?

One thing I'd like to move to is to merge the iomap_begin and iomap_end
callbacks into one similar to willy's series from 2020.  The big
benefit of that would be that (together with switching
write_cache_pages to an iterator model) that we could actually use
this single iterator callback also for writeback instead of
->map_blocks, which doesn't really work with the current begin/end
based iomap_iter as the folios actually written through
write_cache_pages might not be contiguous.  Using the same mapping
callback would not only save some code duplication, but should also
allow us to nicely implement Dave's old idea to not dirty pages for
O_SYNC writes, but directly write them out.  I did start prototyping
that in the last days, and iomap_begin vs map_blocks is currently
the biggest stumbling block.
Darrick J. Wong Nov. 29, 2023, 5:37 a.m. UTC | #14
On Wed, Nov 22, 2023 at 11:09:17PM -0800, Christoph Hellwig wrote:
> On Wed, Nov 22, 2023 at 08:09:44PM -0800, Darrick J. Wong wrote:
> > The particular idea I had is to add a u64 counter to address_space that
> > we can bump in the same places where we bump xfs_inode_fork::if_seq
> > right now..  ->iomap_begin would sample this address_space::i_mappingseq
> > counter (with locks held), and now buffered writes and writeback can
> > check iomap::mappingseq == address_space::i_mappingseq to decide if it's
> > time to revalidate.
> 
> So I think moving this to the VFS is probably a good idea, and I
> actually argued for that when the sequence checking was first proposed.
> We just have to be careful to be able to map things like the two
> separate data and cow seq counts in XFS (or anything else complicated
> in other file systems) to it.

TBH I've been wondering what would happen if we bumped i_mappingseq on
updates of either data or cow fork instead of the shift+or'd thing that
we use now for writeback and/or pagecache write.

I suppose the nice thing about the current encodings is that we elide
revalidations when the cow fork changes but mapping isn't shared.

> > Anyway, I'll have time to go play with that (and further purging of
> > function pointers)
> 
> Do we have anything where the function pointer overhead is actually
> hurting us right now?

Not that I know of, but moving to a direct call model means that the fs
would know based on the iomap_XXX_iter function signature whether or not
iomap needs a srcmap; and then it can modify its iomap_begin function
accordingly.

Right now all those rules aren't especially obvious or well documented.
Maybe I can convince myself that improved documentation will suffice to
eliminate Ted's confusion. :)

Also I haven't checked how much the indirect calls hurt.

> One thing I'd like to move to is to merge the iomap_begin and iomap_end
> callbacks into one similar to willy's series from 2020.  The big

Got a link to that?  I need my memory refreshed, having DROP TABLE MEM2020;
pretty please.

> benefit of that would be that (together with switching
> write_cache_pages to an iterator model) that we could actually use
> this single iterator callback also for writeback instead of
> ->map_blocks, which doesn't really work with the current begin/end
> based iomap_iter as the folios actually written through
> write_cache_pages might not be contiguous.

Ooh it'd benice to get rid of that parallel callbacks thing finally.

>  Using the same mapping
> callback would not only save some code duplication, but should also
> allow us to nicely implement Dave's old idea to not dirty pages for
> O_SYNC writes, but directly write them out.  I did start prototyping
> that in the last days, and iomap_begin vs map_blocks is currently
> the biggest stumbling block.

Neat!  willy's been pushing me for that too.

--D
Christoph Hellwig Nov. 29, 2023, 6:32 a.m. UTC | #15
On Tue, Nov 28, 2023 at 09:37:21PM -0800, Darrick J. Wong wrote:
> TBH I've been wondering what would happen if we bumped i_mappingseq on
> updates of either data or cow fork instead of the shift+or'd thing that
> we use now for writeback and/or pagecache write.
> 
> I suppose the nice thing about the current encodings is that we elide
> revalidations when the cow fork changes but mapping isn't shared.

changed?  But yes.

> 
> > > Anyway, I'll have time to go play with that (and further purging of
> > > function pointers)
> > 
> > Do we have anything where the function pointer overhead is actually
> > hurting us right now?
> 
> Not that I know of, but moving to a direct call model means that the fs
> would know based on the iomap_XXX_iter function signature whether or not
> iomap needs a srcmap; and then it can modify its iomap_begin function
> accordingly.
> 
> Right now all those rules aren't especially obvious or well documented.
> Maybe I can convince myself that improved documentation will suffice to
> eliminate Ted's confusion. :)

Well, with an iter model I think we can actually kill the srcmap
entirely, as we could be doing two separate overlapping iterations at
the same time.  Which would probably be nice for large unaligned writes
as the write size won't be limited by the existing mapping only used
to read in the unaligned blocks at the beginning end end.

> > One thing I'd like to move to is to merge the iomap_begin and iomap_end
> > callbacks into one similar to willy's series from 2020.  The big
> 
> Got a link to that?  I need my memory refreshed, having DROP TABLE MEM2020;
> pretty please.

https://lore.kernel.org/all/20200811205314.GF6107@magnolia/T/
Dave Chinner Nov. 29, 2023, 9:19 a.m. UTC | #16
On Wed, Nov 22, 2023 at 08:09:44PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 23, 2023 at 09:26:44AM +1100, Dave Chinner wrote:
> > On Wed, Nov 22, 2023 at 05:11:18AM -0800, Christoph Hellwig wrote:
> > > On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> > > > writeback bit set. XFS plays the revalidation sequence counter games
> > > > because of this so we'd have to do something similar for ext2. Not that I'd
> > > > care as much about ext2 writeback performance but it should not be that
> > > > hard and we'll definitely need some similar solution for ext4 anyway. Can
> > > > you give that a try (as a followup "performance improvement" patch).
> > > 
> > > Darrick has mentioned that he is looking into lifting more of the
> > > validation sequence counter validation into iomap.
> > 
> > I think that was me, as part of aligning the writeback path with
> > the ->iomap_valid() checks in the write path after we lock the folio
> > we instantiated for the write.
> > 
> > It's basically the same thing - once we have a locked folio, we have
> > to check that the cached iomap is still valid before we use it for
> > anything.
> > 
> > I need to find the time to get back to that, though.
> 
> Heh, we probably both have been chatting with willy on and off about
> iomap.
> 
> The particular idea I had is to add a u64 counter to address_space that
> we can bump in the same places where we bump xfs_inode_fork::if_seq
> right now..  ->iomap_begin would sample this address_space::i_mappingseq
> counter (with locks held), and now buffered writes and writeback can
> check iomap::mappingseq == address_space::i_mappingseq to decide if it's
> time to revalidate.

Can't say I'm a great fan of putting filesystem physical extent map
state cookies in the page cache address space.

One of the primary architectural drivers for iomap was to completely
separate the filesystem extent mapping information from the page
cache internals and granularities, so this kinda steps over an
architectural boundary in my mind.

Also, filesystem mapping operations move further away from the VFS
structures into deep internal filesystem - they do not interact with
the page cache structures at all. Hence requiring physical extent
mapping operations have to poke values in the page cache address
space structure just seems like unnecessarily long pointer chasing
to me.

That said, I have no problesm with extent sequence counters in the
VFS inode, but I just don't think it belongs in the page cache
address space....

-Dave.
Ritesh Harjani (IBM) Nov. 30, 2023, 3:24 a.m. UTC | #17
Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:

> Christoph Hellwig <hch@infradead.org> writes:
>
>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
>>> writeback bit set. XFS plays the revalidation sequence counter games
>>> because of this so we'd have to do something similar for ext2. Not that I'd
>>> care as much about ext2 writeback performance but it should not be that
>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
>>> you give that a try (as a followup "performance improvement" patch).
>>
>> Darrick has mentioned that he is looking into lifting more of the
>> validation sequence counter validation into iomap.
>>
>> In the meantime I have a series here that at least maps multiple blocks
>> inside a folio in a single go, which might be worth trying with ext2 as
>> well:
>>
>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks
>
> Sure, thanks for providing details. I will check this.
>

So I took a look at this. Here is what I think -
So this is useful of-course when we have a large folio. Because
otherwise it's just one block at a time for each folio. This is not a
problem once FS buffered-io handling code moves to iomap (because we
can then enable large folio support to it).

However, this would still require us to pass a folio to ->map_blocks
call to determine the size of the folio (which I am not saying can't be
done but just stating my observations here).

Now coming to implementing validation seq check. I am hoping, it should
be straight forward at least for ext2, because it mostly just have to
protect against truncate operation (which can change the block mapping)...

...ok so here is the PoC for seq counter check for ext2. Next let me
try to see if we can lift this up from the FS side to iomap - 


From 86b7bdf79107c3d0a4f37583121d7c136db1bc5c Mon Sep 17 00:00:00 2001
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Subject: [PATCH] ext2: Implement seq check for mapping multiblocks in ->map_blocks

This implements inode block seq counter (ib_seq) which helps in validating 
whether the cached wpc (struct iomap_writepage_ctx) still has the valid 
entries or not. Block mapping can get changed say due to a racing truncate. 

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext2/balloc.c |  1 +
 fs/ext2/ext2.h   |  6 ++++++
 fs/ext2/inode.c  | 51 +++++++++++++++++++++++++++++++++++++++++++-----
 fs/ext2/super.c  |  2 +-
 4 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
index e124f3d709b2..e97040194da4 100644
--- a/fs/ext2/balloc.c
+++ b/fs/ext2/balloc.c
@@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block,
 	}
 
 	ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1);
+	ext2_inc_ib_seq(EXT2_I(inode));
 
 do_more:
 	overflow = 0;
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 677a9ad45dcb..882c14d20183 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -663,6 +663,7 @@ struct ext2_inode_info {
 	struct rw_semaphore xattr_sem;
 #endif
 	rwlock_t i_meta_lock;
+	unsigned int ib_seq;
 
 	/*
 	 * truncate_mutex is for serialising ext2_truncate() against
@@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
 	return container_of(inode, struct ext2_inode_info, vfs_inode);
 }
 
+static inline void ext2_inc_ib_seq(struct ext2_inode_info *ei)
+{
+	WRITE_ONCE(ei->ib_seq, READ_ONCE(ei->ib_seq) + 1);
+}
+
 /* balloc.c */
 extern int ext2_bg_has_super(struct super_block *sb, int group);
 extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f4e0b9a93095..a5490d641c26 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode,
 	ext2_fsblk_t current_block = 0;
 	int ret = 0;
 
+	ext2_inc_ib_seq(EXT2_I(inode));
+
 	/*
 	 * Here we try to allocate the requested multiple blocks at once,
 	 * on a best-effort basis.
@@ -966,14 +968,53 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 	return mpage_writepages(mapping, wbc, ext2_get_block);
 }
 
+struct ext2_writepage_ctx {
+	struct iomap_writepage_ctx ctx;
+	unsigned int		ib_seq;
+};
+
+static inline
+struct ext2_writepage_ctx *EXT2_WPC(struct iomap_writepage_ctx *ctx)
+{
+	return container_of(ctx, struct ext2_writepage_ctx, ctx);
+}
+
+static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode,
+			    loff_t offset)
+{
+	if (offset < wpc->iomap.offset ||
+	    offset >= wpc->iomap.offset + wpc->iomap.length)
+		return false;
+
+	if (EXT2_WPC(wpc)->ib_seq != READ_ONCE(EXT2_I(inode)->ib_seq))
+		return false;
+
+	return true;
+}
+
 static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
 				 struct inode *inode, loff_t offset)
 {
-	if (offset >= wpc->iomap.offset &&
-	    offset < wpc->iomap.offset + wpc->iomap.length)
+	loff_t maxblocks = (loff_t)INT_MAX;
+	u8 blkbits = inode->i_blkbits;
+	u32 bno;
+	bool new, boundary;
+	int ret;
+
+	if (ext2_imap_valid(wpc, inode, offset))
 		return 0;
 
-	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
+	EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq);
+
+	ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
+			      &bno, &new, &boundary, 0);
+	if (ret < 0)
+		return ret;
+	/*
+	 * ret can be 0 in case of a hole which is possible for mmaped writes.
+	 */
+	ret = ret ? ret : 1;
+	return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
 				IOMAP_WRITE, &wpc->iomap, NULL);
 }
 
@@ -984,9 +1025,9 @@ static const struct iomap_writeback_ops ext2_writeback_ops = {
 static int ext2_file_writepages(struct address_space *mapping,
 				struct writeback_control *wbc)
 {
-	struct iomap_writepage_ctx wpc = { };
+	struct ext2_writepage_ctx wpc = { };
 
-	return iomap_writepages(mapping, wbc, &wpc, &ext2_writeback_ops);
+	return iomap_writepages(mapping, wbc, &wpc.ctx, &ext2_writeback_ops);
 }
 
 static int
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 645ee6142f69..cd1d1678e35b 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
 #ifdef CONFIG_QUOTA
 	memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
 #endif
-
+	WRITE_ONCE(ei->ib_seq, 0);
 	return &ei->vfs_inode;
 }
 

2.30.2
Matthew Wilcox Nov. 30, 2023, 4:22 a.m. UTC | #18
On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index 677a9ad45dcb..882c14d20183 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -663,6 +663,7 @@ struct ext2_inode_info {
>  	struct rw_semaphore xattr_sem;
>  #endif
>  	rwlock_t i_meta_lock;
> +	unsigned int ib_seq;

Surely i_blkseq?  Almost everything in this struct is prefixed i_.
Christoph Hellwig Nov. 30, 2023, 4:30 a.m. UTC | #19
On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
> So I took a look at this. Here is what I think -
> So this is useful of-course when we have a large folio. Because
> otherwise it's just one block at a time for each folio. This is not a
> problem once FS buffered-io handling code moves to iomap (because we
> can then enable large folio support to it).

Yes.

> However, this would still require us to pass a folio to ->map_blocks
> call to determine the size of the folio (which I am not saying can't be
> done but just stating my observations here).

XFS currently maps based on the underlyig reservation (delalloc extent)
and not the actual map size.   This works because on-disk extents are
allocated as unwritten extents, and only the actual written part is
the converted.  But if you only want to allocate blocks for the part
actually written you actually need to pass in the dirty range and not
just use the whole folio.  This would be the incremental patch to do
that:

http://git.infradead.org/users/hch/xfs.git/commitdiff/0007893015796ef2ba16bb8b98c4c9fb6e9e6752

But unless your block allocator is very cheap doing what XFS does is
probably going to work much better.

> ...ok so here is the PoC for seq counter check for ext2. Next let me
> try to see if we can lift this up from the FS side to iomap - 

This looks good to me from a very superficial view.  Dave is the expert
on this, though.
Ritesh Harjani (IBM) Nov. 30, 2023, 4:37 a.m. UTC | #20
Matthew Wilcox <willy@infradead.org> writes:

> On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
>> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
>> index 677a9ad45dcb..882c14d20183 100644
>> --- a/fs/ext2/ext2.h
>> +++ b/fs/ext2/ext2.h
>> @@ -663,6 +663,7 @@ struct ext2_inode_info {
>>  	struct rw_semaphore xattr_sem;
>>  #endif
>>  	rwlock_t i_meta_lock;
>> +	unsigned int ib_seq;
>
> Surely i_blkseq?  Almost everything in this struct is prefixed i_.

Certainly!

-ritesh
Ritesh Harjani (IBM) Nov. 30, 2023, 5:27 a.m. UTC | #21
Christoph Hellwig <hch@infradead.org> writes:

> On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
>> So I took a look at this. Here is what I think -
>> So this is useful of-course when we have a large folio. Because
>> otherwise it's just one block at a time for each folio. This is not a
>> problem once FS buffered-io handling code moves to iomap (because we
>> can then enable large folio support to it).
>
> Yes.
>
>> However, this would still require us to pass a folio to ->map_blocks
>> call to determine the size of the folio (which I am not saying can't be
>> done but just stating my observations here).
>
> XFS currently maps based on the underlyig reservation (delalloc extent)
> and not the actual map size.   This works because on-disk extents are
> allocated as unwritten extents, and only the actual written part is
> the converted.  But if you only want to allocate blocks for the part
> actually written you actually need to pass in the dirty range and not
> just use the whole folio.  This would be the incremental patch to do
> that:

yes. dirty range indeed.

>
> http://git.infradead.org/users/hch/xfs.git/commitdiff/0007893015796ef2ba16bb8b98c4c9fb6e9e6752
>
> But unless your block allocator is very cheap doing what XFS does is
> probably going to work much better.
>
>> ...ok so here is the PoC for seq counter check for ext2. Next let me
>> try to see if we can lift this up from the FS side to iomap - 
>
> This looks good to me from a very superficial view.  Dave is the expert
> on this, though.

Sure.
Ritesh Harjani (IBM) Nov. 30, 2023, 7:45 a.m. UTC | #22
Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:

> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
>
>> Christoph Hellwig <hch@infradead.org> writes:
>>
>>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
>>>> writeback bit set. XFS plays the revalidation sequence counter games
>>>> because of this so we'd have to do something similar for ext2. Not that I'd
>>>> care as much about ext2 writeback performance but it should not be that
>>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
>>>> you give that a try (as a followup "performance improvement" patch).

ok. So I am re-thinknig over this on why will a filesystem like ext2
would require sequence counter check. We don't have collapse range
or COW sort of operations, it is only the truncate which can race,
but that should be taken care by folio_lock. And even if the partial
truncate happens on a folio, since the logical to physical block mapping
never changes, it should not matter if the writeback wrote data to a
cached entry, right?

-ritesh

>>>
>>> Darrick has mentioned that he is looking into lifting more of the
>>> validation sequence counter validation into iomap.
>>>
>>> In the meantime I have a series here that at least maps multiple blocks
>>> inside a folio in a single go, which might be worth trying with ext2 as
>>> well:
>>>
>>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks
>>
>> Sure, thanks for providing details. I will check this.
>>
>
> So I took a look at this. Here is what I think -
> So this is useful of-course when we have a large folio. Because
> otherwise it's just one block at a time for each folio. This is not a
> problem once FS buffered-io handling code moves to iomap (because we
> can then enable large folio support to it).
>
> However, this would still require us to pass a folio to ->map_blocks
> call to determine the size of the folio (which I am not saying can't be
> done but just stating my observations here).
>
> Now coming to implementing validation seq check. I am hoping, it should
> be straight forward at least for ext2, because it mostly just have to
> protect against truncate operation (which can change the block mapping)...
>
> ...ok so here is the PoC for seq counter check for ext2. Next let me
> try to see if we can lift this up from the FS side to iomap - 
>
>
> From 86b7bdf79107c3d0a4f37583121d7c136db1bc5c Mon Sep 17 00:00:00 2001
> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
> Subject: [PATCH] ext2: Implement seq check for mapping multiblocks in ->map_blocks
>
> This implements inode block seq counter (ib_seq) which helps in validating 
> whether the cached wpc (struct iomap_writepage_ctx) still has the valid 
> entries or not. Block mapping can get changed say due to a racing truncate. 
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/ext2/balloc.c |  1 +
>  fs/ext2/ext2.h   |  6 ++++++
>  fs/ext2/inode.c  | 51 +++++++++++++++++++++++++++++++++++++++++++-----
>  fs/ext2/super.c  |  2 +-
>  4 files changed, 54 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> index e124f3d709b2..e97040194da4 100644
> --- a/fs/ext2/balloc.c
> +++ b/fs/ext2/balloc.c
> @@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block,
>  	}
>  
>  	ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1);
> +	ext2_inc_ib_seq(EXT2_I(inode));
>  
>  do_more:
>  	overflow = 0;
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index 677a9ad45dcb..882c14d20183 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -663,6 +663,7 @@ struct ext2_inode_info {
>  	struct rw_semaphore xattr_sem;
>  #endif
>  	rwlock_t i_meta_lock;
> +	unsigned int ib_seq;
>  
>  	/*
>  	 * truncate_mutex is for serialising ext2_truncate() against
> @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
>  	return container_of(inode, struct ext2_inode_info, vfs_inode);
>  }
>  
> +static inline void ext2_inc_ib_seq(struct ext2_inode_info *ei)
> +{
> +	WRITE_ONCE(ei->ib_seq, READ_ONCE(ei->ib_seq) + 1);
> +}
> +
>  /* balloc.c */
>  extern int ext2_bg_has_super(struct super_block *sb, int group);
>  extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index f4e0b9a93095..a5490d641c26 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode,
>  	ext2_fsblk_t current_block = 0;
>  	int ret = 0;
>  
> +	ext2_inc_ib_seq(EXT2_I(inode));
> +
>  	/*
>  	 * Here we try to allocate the requested multiple blocks at once,
>  	 * on a best-effort basis.
> @@ -966,14 +968,53 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  	return mpage_writepages(mapping, wbc, ext2_get_block);
>  }
>  
> +struct ext2_writepage_ctx {
> +	struct iomap_writepage_ctx ctx;
> +	unsigned int		ib_seq;
> +};
> +
> +static inline
> +struct ext2_writepage_ctx *EXT2_WPC(struct iomap_writepage_ctx *ctx)
> +{
> +	return container_of(ctx, struct ext2_writepage_ctx, ctx);
> +}
> +
> +static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode,
> +			    loff_t offset)
> +{
> +	if (offset < wpc->iomap.offset ||
> +	    offset >= wpc->iomap.offset + wpc->iomap.length)
> +		return false;
> +
> +	if (EXT2_WPC(wpc)->ib_seq != READ_ONCE(EXT2_I(inode)->ib_seq))
> +		return false;
> +
> +	return true;
> +}
> +
>  static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
>  				 struct inode *inode, loff_t offset)
>  {
> -	if (offset >= wpc->iomap.offset &&
> -	    offset < wpc->iomap.offset + wpc->iomap.length)
> +	loff_t maxblocks = (loff_t)INT_MAX;
> +	u8 blkbits = inode->i_blkbits;
> +	u32 bno;
> +	bool new, boundary;
> +	int ret;
> +
> +	if (ext2_imap_valid(wpc, inode, offset))
>  		return 0;
>  
> -	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> +	EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq);
> +
> +	ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
> +			      &bno, &new, &boundary, 0);
> +	if (ret < 0)
> +		return ret;
> +	/*
> +	 * ret can be 0 in case of a hole which is possible for mmaped writes.
> +	 */
> +	ret = ret ? ret : 1;
> +	return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
>  				IOMAP_WRITE, &wpc->iomap, NULL);
>  }
>  
> @@ -984,9 +1025,9 @@ static const struct iomap_writeback_ops ext2_writeback_ops = {
>  static int ext2_file_writepages(struct address_space *mapping,
>  				struct writeback_control *wbc)
>  {
> -	struct iomap_writepage_ctx wpc = { };
> +	struct ext2_writepage_ctx wpc = { };
>  
> -	return iomap_writepages(mapping, wbc, &wpc, &ext2_writeback_ops);
> +	return iomap_writepages(mapping, wbc, &wpc.ctx, &ext2_writeback_ops);
>  }
>  
>  static int
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 645ee6142f69..cd1d1678e35b 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
>  #ifdef CONFIG_QUOTA
>  	memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
>  #endif
> -
> +	WRITE_ONCE(ei->ib_seq, 0);
>  	return &ei->vfs_inode;
>  }
>  
>
> 2.30.2
Zhang Yi Nov. 30, 2023, 8:22 a.m. UTC | #23
On 2023/11/30 12:30, Christoph Hellwig wrote:
> On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
>> So I took a look at this. Here is what I think -
>> So this is useful of-course when we have a large folio. Because
>> otherwise it's just one block at a time for each folio. This is not a
>> problem once FS buffered-io handling code moves to iomap (because we
>> can then enable large folio support to it).
> 
> Yes.
> 
>> However, this would still require us to pass a folio to ->map_blocks
>> call to determine the size of the folio (which I am not saying can't be
>> done but just stating my observations here).
> 
> XFS currently maps based on the underlyig reservation (delalloc extent)
> and not the actual map size.   This works because on-disk extents are
> allocated as unwritten extents, and only the actual written part is
> the converted.  But if you only want to allocate blocks for the part

IIUC, I noticed a side effect of this map method, Let's think about a
special case, if we sync a partial range of a delalloc extent, and then
the system crashes before the remaining data write back. We could get a
file which has allocated blocks(unwritten) beyond EOF. I can reproduce
it on xfs.

 # write 10 blocks, but only sync 3 blocks, i_size becomes 12K after IO complete
 xfs_io -f -c "pwrite 0 40k" -c "sync_range 0 12k" /mnt/foo
 # postpone the remaining data writeback
 echo 20000 > /proc/sys/vm/dirty_writeback_centisecs
 echo 20000 > /proc/sys/vm/dirty_expire_centisecs
 # wait 35s to make sure xfs's cil log writeback
 sleep 35
 # triger system crash
 echo c > /proc/sysrq-trigger

After system reboot, the 'foo' file's size is 12K but have 10
allocated blocks.

 xfs_db> inode 131
 core.size = 12288
 core.nblocks = 10
 ...
 0:[0,24,3,0]
 1:[3,27,7,1]

I'm not expert on xfs, but I don't think this is the right result,
is it?

Thanks,
Yi.

> actually written you actually need to pass in the dirty range and not
> just use the whole folio.  This would be the incremental patch to do
> that:
> 
> http://git.infradead.org/users/hch/xfs.git/commitdiff/0007893015796ef2ba16bb8b98c4c9fb6e9e6752
> 
> But unless your block allocator is very cheap doing what XFS does is
> probably going to work much better.
> 
>> ...ok so here is the PoC for seq counter check for ext2. Next let me
>> try to see if we can lift this up from the FS side to iomap - 
> 
> This looks good to me from a very superficial view.  Dave is the expert
> on this, though.
> 
> 
> .
>
Jan Kara Nov. 30, 2023, 10:18 a.m. UTC | #24
On Thu 30-11-23 13:15:58, Ritesh Harjani wrote:
> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> 
> > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> >
> >> Christoph Hellwig <hch@infradead.org> writes:
> >>
> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> >>>> writeback bit set. XFS plays the revalidation sequence counter games
> >>>> because of this so we'd have to do something similar for ext2. Not that I'd
> >>>> care as much about ext2 writeback performance but it should not be that
> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
> >>>> you give that a try (as a followup "performance improvement" patch).
> 
> ok. So I am re-thinknig over this on why will a filesystem like ext2
> would require sequence counter check. We don't have collapse range
> or COW sort of operations, it is only the truncate which can race,
> but that should be taken care by folio_lock. And even if the partial
> truncate happens on a folio, since the logical to physical block mapping
> never changes, it should not matter if the writeback wrote data to a
> cached entry, right?

Yes, so this is what I think I've already mentioned. As long as we map just
the block at the current offset (or a block under currently locked folio),
we are fine and we don't need any kind of sequence counter. But as soon as
we start caching any kind of mapping in iomap_writepage_ctx we need a way
to protect from races with truncate. So something like the sequence counter.

								Honza
Ritesh Harjani (IBM) Nov. 30, 2023, 10:59 a.m. UTC | #25
Jan Kara <jack@suse.cz> writes:

> On Thu 30-11-23 13:15:58, Ritesh Harjani wrote:
>> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
>> 
>> > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
>> >
>> >> Christoph Hellwig <hch@infradead.org> writes:
>> >>
>> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
>> >>>> writeback bit set. XFS plays the revalidation sequence counter games
>> >>>> because of this so we'd have to do something similar for ext2. Not that I'd
>> >>>> care as much about ext2 writeback performance but it should not be that
>> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
>> >>>> you give that a try (as a followup "performance improvement" patch).
>> 
>> ok. So I am re-thinknig over this on why will a filesystem like ext2
>> would require sequence counter check. We don't have collapse range
>> or COW sort of operations, it is only the truncate which can race,
>> but that should be taken care by folio_lock. And even if the partial
>> truncate happens on a folio, since the logical to physical block mapping
>> never changes, it should not matter if the writeback wrote data to a
>> cached entry, right?
>
> Yes, so this is what I think I've already mentioned. As long as we map just
> the block at the current offset (or a block under currently locked folio),
> we are fine and we don't need any kind of sequence counter. But as soon as
> we start caching any kind of mapping in iomap_writepage_ctx we need a way
> to protect from races with truncate. So something like the sequence counter.
>

Why do we need to protect from the race with truncate, is my question here.
So, IMO, truncate will truncate the folio cache first before releasing the FS
blocks. Truncation of the folio cache and the writeback path are
protected using folio_lock()
Truncate will clear the dirty flag of the folio before
releasing the folio_lock() right, so writeback will not even proceed for
folios which are not marked dirty (even if we have a cached wpc entry for
which folio is released from folio cache).

Now coming to the stale cached wpc entry for which truncate is doing a
partial truncation. Say, truncate ended up calling
truncate_inode_partial_folio(). Now for such folio (it remains dirty
after partial truncation) (for which there is a stale cached wpc entry),
when writeback writes to the underlying stale block, there is no harm
with that right?

Also this will "only" happen for folio which was partially truncated.
So why do we need to have sequence counter for protecting against this
race is my question. 

So could this be only needed when existing logical to physical block
mapping changes e.g. like COW or maybe collapse range?

-ritesh
Jan Kara Nov. 30, 2023, 2:08 p.m. UTC | #26
On Thu 30-11-23 16:29:59, Ritesh Harjani wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Thu 30-11-23 13:15:58, Ritesh Harjani wrote:
> >> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> >> 
> >> > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> >> >
> >> >> Christoph Hellwig <hch@infradead.org> writes:
> >> >>
> >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> >> >>>> writeback bit set. XFS plays the revalidation sequence counter games
> >> >>>> because of this so we'd have to do something similar for ext2. Not that I'd
> >> >>>> care as much about ext2 writeback performance but it should not be that
> >> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
> >> >>>> you give that a try (as a followup "performance improvement" patch).
> >> 
> >> ok. So I am re-thinknig over this on why will a filesystem like ext2
> >> would require sequence counter check. We don't have collapse range
> >> or COW sort of operations, it is only the truncate which can race,
> >> but that should be taken care by folio_lock. And even if the partial
> >> truncate happens on a folio, since the logical to physical block mapping
> >> never changes, it should not matter if the writeback wrote data to a
> >> cached entry, right?
> >
> > Yes, so this is what I think I've already mentioned. As long as we map just
> > the block at the current offset (or a block under currently locked folio),
> > we are fine and we don't need any kind of sequence counter. But as soon as
> > we start caching any kind of mapping in iomap_writepage_ctx we need a way
> > to protect from races with truncate. So something like the sequence counter.
> >
> 
> Why do we need to protect from the race with truncate, is my question here.
> So, IMO, truncate will truncate the folio cache first before releasing the FS
> blocks. Truncation of the folio cache and the writeback path are
> protected using folio_lock()
> Truncate will clear the dirty flag of the folio before
> releasing the folio_lock() right, so writeback will not even proceed for
> folios which are not marked dirty (even if we have a cached wpc entry for
> which folio is released from folio cache).
> 
> Now coming to the stale cached wpc entry for which truncate is doing a
> partial truncation. Say, truncate ended up calling
> truncate_inode_partial_folio(). Now for such folio (it remains dirty
> after partial truncation) (for which there is a stale cached wpc entry),
> when writeback writes to the underlying stale block, there is no harm
> with that right?
> 
> Also this will "only" happen for folio which was partially truncated.
> So why do we need to have sequence counter for protecting against this
> race is my question. 

That's a very good question and it took me a while to formulate my "this
sounds problematic" feeling into a particular example :) We can still have
a race like:

write_cache_pages()
  cache extent covering 0..1MB range
  write page at offset 0k
					truncate(file, 4k)
					  drops all relevant pages
					  frees fs blocks
					pwrite(file, 4k, 4k)
					  creates dirty page in the page cache
  writes page at offset 4k to a stale block

								Honza
Ritesh Harjani (IBM) Nov. 30, 2023, 3:50 p.m. UTC | #27
Jan Kara <jack@suse.cz> writes:

> On Thu 30-11-23 16:29:59, Ritesh Harjani wrote:
>> Jan Kara <jack@suse.cz> writes:
>> 
>> > On Thu 30-11-23 13:15:58, Ritesh Harjani wrote:
>> >> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
>> >> 
>> >> > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
>> >> >
>> >> >> Christoph Hellwig <hch@infradead.org> writes:
>> >> >>
>> >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
>> >> >>>> writeback bit set. XFS plays the revalidation sequence counter games
>> >> >>>> because of this so we'd have to do something similar for ext2. Not that I'd
>> >> >>>> care as much about ext2 writeback performance but it should not be that
>> >> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
>> >> >>>> you give that a try (as a followup "performance improvement" patch).
>> >> 
>> >> ok. So I am re-thinknig over this on why will a filesystem like ext2
>> >> would require sequence counter check. We don't have collapse range
>> >> or COW sort of operations, it is only the truncate which can race,
>> >> but that should be taken care by folio_lock. And even if the partial
>> >> truncate happens on a folio, since the logical to physical block mapping
>> >> never changes, it should not matter if the writeback wrote data to a
>> >> cached entry, right?
>> >
>> > Yes, so this is what I think I've already mentioned. As long as we map just
>> > the block at the current offset (or a block under currently locked folio),
>> > we are fine and we don't need any kind of sequence counter. But as soon as
>> > we start caching any kind of mapping in iomap_writepage_ctx we need a way
>> > to protect from races with truncate. So something like the sequence counter.
>> >
>> 
>> Why do we need to protect from the race with truncate, is my question here.
>> So, IMO, truncate will truncate the folio cache first before releasing the FS
>> blocks. Truncation of the folio cache and the writeback path are
>> protected using folio_lock()
>> Truncate will clear the dirty flag of the folio before
>> releasing the folio_lock() right, so writeback will not even proceed for
>> folios which are not marked dirty (even if we have a cached wpc entry for
>> which folio is released from folio cache).
>> 
>> Now coming to the stale cached wpc entry for which truncate is doing a
>> partial truncation. Say, truncate ended up calling
>> truncate_inode_partial_folio(). Now for such folio (it remains dirty
>> after partial truncation) (for which there is a stale cached wpc entry),
>> when writeback writes to the underlying stale block, there is no harm
>> with that right?
>> 
>> Also this will "only" happen for folio which was partially truncated.
>> So why do we need to have sequence counter for protecting against this
>> race is my question. 
>
> That's a very good question and it took me a while to formulate my "this
> sounds problematic" feeling into a particular example :) We can still have
> a race like:
>
> write_cache_pages()
>   cache extent covering 0..1MB range
>   write page at offset 0k
> 					truncate(file, 4k)
> 					  drops all relevant pages
> 					  frees fs blocks
> 					pwrite(file, 4k, 4k)
> 					  creates dirty page in the page cache
>   writes page at offset 4k to a stale block

:) Nice! 
Truncate followed by an append write could cause this race with writeback
happening in parallel. And this might cause data corruption.

Thanks again for clearly explaining the race :) 

So I think just having a seq. counter (no other locks required), should
be ok to prevent this race. Since mostly what we are interested in is
whether the stale cached block mapping got changed for the folio which
writeback is going to continue writing to.

i.e. the race only happens when 2 of these operation happen while we
have a cached block mapping for a folio - 
- a new physical block got allocated for the same logical offset to
which the previous folio belongs to 
- the folio got re-instatiated in the page cache as dirty with the new
physical block mapping at the same logical offset of the file.

Now when the writeback continues, it will iterate over the same
dirty folio in question, lock it, check for ->map_blocks which will
notice a changed seq counter and do ->get_blocks again and then
we do submit_bio() to the right physical block.

So, it should be ok, if we simply update the seq counter everytime a
block is allocated or freed for a given inode... because when we
check the seq counter, we know the folio is locked and the other
operation of re-instating a new physical block mapping for a given folio
can also only happen while under a folio lock. 

OR, one other approach to this can be... 

IIUC, for a new block mapping for the same folio, the folio can be made to
get invalidated once in between right? So should this be handled in folio
cache somehow to know if for a given folio the underlying mapping might
have changed for which iomap should again query  ->map_blocks() ? 
... can that also help against unnecessary re-validations against cached
block mappings?

Thoughts?

-ritesh


>
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Nov. 30, 2023, 4:01 p.m. UTC | #28
On Thu 30-11-23 21:20:41, Ritesh Harjani wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Thu 30-11-23 16:29:59, Ritesh Harjani wrote:
> >> Jan Kara <jack@suse.cz> writes:
> >> 
> >> > On Thu 30-11-23 13:15:58, Ritesh Harjani wrote:
> >> >> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> >> >> 
> >> >> > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> >> >> >
> >> >> >> Christoph Hellwig <hch@infradead.org> writes:
> >> >> >>
> >> >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> >> >> >>>> writeback bit set. XFS plays the revalidation sequence counter games
> >> >> >>>> because of this so we'd have to do something similar for ext2. Not that I'd
> >> >> >>>> care as much about ext2 writeback performance but it should not be that
> >> >> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
> >> >> >>>> you give that a try (as a followup "performance improvement" patch).
> >> >> 
> >> >> ok. So I am re-thinknig over this on why will a filesystem like ext2
> >> >> would require sequence counter check. We don't have collapse range
> >> >> or COW sort of operations, it is only the truncate which can race,
> >> >> but that should be taken care by folio_lock. And even if the partial
> >> >> truncate happens on a folio, since the logical to physical block mapping
> >> >> never changes, it should not matter if the writeback wrote data to a
> >> >> cached entry, right?
> >> >
> >> > Yes, so this is what I think I've already mentioned. As long as we map just
> >> > the block at the current offset (or a block under currently locked folio),
> >> > we are fine and we don't need any kind of sequence counter. But as soon as
> >> > we start caching any kind of mapping in iomap_writepage_ctx we need a way
> >> > to protect from races with truncate. So something like the sequence counter.
> >> >
> >> 
> >> Why do we need to protect from the race with truncate, is my question here.
> >> So, IMO, truncate will truncate the folio cache first before releasing the FS
> >> blocks. Truncation of the folio cache and the writeback path are
> >> protected using folio_lock()
> >> Truncate will clear the dirty flag of the folio before
> >> releasing the folio_lock() right, so writeback will not even proceed for
> >> folios which are not marked dirty (even if we have a cached wpc entry for
> >> which folio is released from folio cache).
> >> 
> >> Now coming to the stale cached wpc entry for which truncate is doing a
> >> partial truncation. Say, truncate ended up calling
> >> truncate_inode_partial_folio(). Now for such folio (it remains dirty
> >> after partial truncation) (for which there is a stale cached wpc entry),
> >> when writeback writes to the underlying stale block, there is no harm
> >> with that right?
> >> 
> >> Also this will "only" happen for folio which was partially truncated.
> >> So why do we need to have sequence counter for protecting against this
> >> race is my question. 
> >
> > That's a very good question and it took me a while to formulate my "this
> > sounds problematic" feeling into a particular example :) We can still have
> > a race like:
> >
> > write_cache_pages()
> >   cache extent covering 0..1MB range
> >   write page at offset 0k
> > 					truncate(file, 4k)
> > 					  drops all relevant pages
> > 					  frees fs blocks
> > 					pwrite(file, 4k, 4k)
> > 					  creates dirty page in the page cache
> >   writes page at offset 4k to a stale block
> 
> :) Nice! 
> Truncate followed by an append write could cause this race with writeback
> happening in parallel. And this might cause data corruption.
> 
> Thanks again for clearly explaining the race :) 
> 
> So I think just having a seq. counter (no other locks required), should
> be ok to prevent this race. Since mostly what we are interested in is
> whether the stale cached block mapping got changed for the folio which
> writeback is going to continue writing to.
> 
> i.e. the race only happens when 2 of these operation happen while we
> have a cached block mapping for a folio - 
> - a new physical block got allocated for the same logical offset to
> which the previous folio belongs to 
> - the folio got re-instatiated in the page cache as dirty with the new
> physical block mapping at the same logical offset of the file.
> 
> Now when the writeback continues, it will iterate over the same
> dirty folio in question, lock it, check for ->map_blocks which will
> notice a changed seq counter and do ->get_blocks again and then
> we do submit_bio() to the right physical block.
> 
> So, it should be ok, if we simply update the seq counter everytime a
> block is allocated or freed for a given inode... because when we
> check the seq counter, we know the folio is locked and the other
> operation of re-instating a new physical block mapping for a given folio
> can also only happen while under a folio lock. 

Yes.

> OR, one other approach to this can be... 
> 
> IIUC, for a new block mapping for the same folio, the folio can be made to
> get invalidated once in between right? So should this be handled in folio
> cache somehow to know if for a given folio the underlying mapping might
> have changed for which iomap should again query  ->map_blocks() ? 
> ... can that also help against unnecessary re-validations against cached
> block mappings?

This would be difficult because the page cache handling code does not know
someone has cached block mapping somewhere...

								Honza
Matthew Wilcox Nov. 30, 2023, 4:03 p.m. UTC | #29
On Thu, Nov 30, 2023 at 09:20:41PM +0530, Ritesh Harjani wrote:
> OR, one other approach to this can be... 
> 
> IIUC, for a new block mapping for the same folio, the folio can be made to
> get invalidated once in between right? So should this be handled in folio
> cache somehow to know if for a given folio the underlying mapping might
> have changed for which iomap should again query  ->map_blocks() ? 
> ... can that also help against unnecessary re-validations against cached
> block mappings?

That's pretty much exactly what buffer heads do -- we'd see the mapped
bit was now clear.  Maybe we could have a bit in the iomap_folio_state
that is set once a map has been returned for this folio.  On the one
hand, this is exactly against the direction of iomap; to divorce the
filesystem state from the page cache state.  On the other hand, maybe
that wasn't a great idea and deserves to be reexamined.
Dave Chinner Dec. 1, 2023, 11:09 p.m. UTC | #30
On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> 
> > Christoph Hellwig <hch@infradead.org> writes:
> >
> >> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> >>> writeback bit set. XFS plays the revalidation sequence counter games
> >>> because of this so we'd have to do something similar for ext2. Not that I'd
> >>> care as much about ext2 writeback performance but it should not be that
> >>> hard and we'll definitely need some similar solution for ext4 anyway. Can
> >>> you give that a try (as a followup "performance improvement" patch).
> >>
> >> Darrick has mentioned that he is looking into lifting more of the
> >> validation sequence counter validation into iomap.
> >>
> >> In the meantime I have a series here that at least maps multiple blocks
> >> inside a folio in a single go, which might be worth trying with ext2 as
> >> well:
> >>
> >> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks
> >
> > Sure, thanks for providing details. I will check this.
> >
> 
> So I took a look at this. Here is what I think -
> So this is useful of-course when we have a large folio. Because
> otherwise it's just one block at a time for each folio. This is not a
> problem once FS buffered-io handling code moves to iomap (because we
> can then enable large folio support to it).
> 
> However, this would still require us to pass a folio to ->map_blocks
> call to determine the size of the folio (which I am not saying can't be
> done but just stating my observations here).
> 
> Now coming to implementing validation seq check. I am hoping, it should
> be straight forward at least for ext2, because it mostly just have to
> protect against truncate operation (which can change the block mapping)...
> 
> ...ok so here is the PoC for seq counter check for ext2. Next let me
> try to see if we can lift this up from the FS side to iomap - 
> 
> 
> From 86b7bdf79107c3d0a4f37583121d7c136db1bc5c Mon Sep 17 00:00:00 2001
> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
> Subject: [PATCH] ext2: Implement seq check for mapping multiblocks in ->map_blocks
> 
> This implements inode block seq counter (ib_seq) which helps in validating 
> whether the cached wpc (struct iomap_writepage_ctx) still has the valid 
> entries or not. Block mapping can get changed say due to a racing truncate. 
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/ext2/balloc.c |  1 +
>  fs/ext2/ext2.h   |  6 ++++++
>  fs/ext2/inode.c  | 51 +++++++++++++++++++++++++++++++++++++++++++-----
>  fs/ext2/super.c  |  2 +-
>  4 files changed, 54 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> index e124f3d709b2..e97040194da4 100644
> --- a/fs/ext2/balloc.c
> +++ b/fs/ext2/balloc.c
> @@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block,
>  	}
>  
>  	ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1);
> +	ext2_inc_ib_seq(EXT2_I(inode));
>  
>  do_more:
>  	overflow = 0;
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index 677a9ad45dcb..882c14d20183 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -663,6 +663,7 @@ struct ext2_inode_info {
>  	struct rw_semaphore xattr_sem;
>  #endif
>  	rwlock_t i_meta_lock;
> +	unsigned int ib_seq;
>  
>  	/*
>  	 * truncate_mutex is for serialising ext2_truncate() against
> @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
>  	return container_of(inode, struct ext2_inode_info, vfs_inode);
>  }
>  
> +static inline void ext2_inc_ib_seq(struct ext2_inode_info *ei)
> +{
> +	WRITE_ONCE(ei->ib_seq, READ_ONCE(ei->ib_seq) + 1);
> +}
> +
>  /* balloc.c */
>  extern int ext2_bg_has_super(struct super_block *sb, int group);
>  extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index f4e0b9a93095..a5490d641c26 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode,
>  	ext2_fsblk_t current_block = 0;
>  	int ret = 0;
>  
> +	ext2_inc_ib_seq(EXT2_I(inode));
> +
>  	/*
>  	 * Here we try to allocate the requested multiple blocks at once,
>  	 * on a best-effort basis.
> @@ -966,14 +968,53 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  	return mpage_writepages(mapping, wbc, ext2_get_block);
>  }
>  
> +struct ext2_writepage_ctx {
> +	struct iomap_writepage_ctx ctx;
> +	unsigned int		ib_seq;
> +};

Why copy this structure from XFS? The iomap held in ctx->iomap now
has a 'u64 validity_cookie;' which is expressly intended to be used
for holding this information.

And then this becomes:

> +static inline
> +struct ext2_writepage_ctx *EXT2_WPC(struct iomap_writepage_ctx *ctx)
> +{
> +	return container_of(ctx, struct ext2_writepage_ctx, ctx);
> +}
> +
> +static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode,
> +			    loff_t offset)
> +{
> +	if (offset < wpc->iomap.offset ||
> +	    offset >= wpc->iomap.offset + wpc->iomap.length)
> +		return false;
> +
> +	if (EXT2_WPC(wpc)->ib_seq != READ_ONCE(EXT2_I(inode)->ib_seq))
> +		return false;

	if (wpc->iomap->validity_cookie != EXT2_I(inode)->ib_seq)
		return false;

> +
> +	return true;
> +}
> +
>  static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
>  				 struct inode *inode, loff_t offset)
>  {
> -	if (offset >= wpc->iomap.offset &&
> -	    offset < wpc->iomap.offset + wpc->iomap.length)
> +	loff_t maxblocks = (loff_t)INT_MAX;
> +	u8 blkbits = inode->i_blkbits;
> +	u32 bno;
> +	bool new, boundary;
> +	int ret;
> +
> +	if (ext2_imap_valid(wpc, inode, offset))
>  		return 0;
>  
> -	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> +	EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq);
> +
> +	ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
> +			      &bno, &new, &boundary, 0);

This is incorrectly ordered. ext2_get_blocks() bumps ib_seq when it
does allocation, so the newly stored EXT2_WPC(wpc)->ib_seq is
immediately staled and the very next call to ext2_imap_valid() will
fail, causing a new iomap to be mapped even when not necessary.

Indeed, we also don't hold the ei->truncate_mutex here, so the
sampling of ib_seq could race with other allocation or truncation
calls on this inode. That's a landmine that will eventually bite
hard, because the sequence number returned with the iomap does not
reflect the state of the extent tree when the iomap is created.

If you look at the XFS code, we set iomap->validity_cookie whenever
we call xfs_bmbt_to_iomap() to format the found/allocated extent
into an iomap to return to the caller. The sequence number is passed
into that function alongside the extent map by the caller, because
when we format the extent into an iomap the caller has already dropped the
allocation lock. We must sample the sequence number after the allocation
but before we drop the allocation lock so that the sequence number
-exactly- matches the extent tree and the extent map we are going to
format into the iomap, otherwise the extent we map into the iomap
may already be stale and the validity cookie won't tell us that.

i.e. the cohrenecy relationship between the validity cookie and the
iomap must be set up from a synchronised state.  If the sequence
number is allowed to change between mapping the extent and sampling
the sequence number, then the extent we map and cache may already be
invalid and this introduces the possibility that the validity cookie
won't always catch that...

> +	if (ret < 0)
> +		return ret;
> +	/*
> +	 * ret can be 0 in case of a hole which is possible for mmaped writes.
> +	 */
> +	ret = ret ? ret : 1;
> +	return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
>  				IOMAP_WRITE, &wpc->iomap, NULL);

So calling ext2_iomap_begin() here might have to change. Indeed,
given that we just allocated the blocks and we know where they are
located, calling ext2_iomap_begin()->ext2_get_blocks() to look them
up again just to format them into the iomap seems .... convoluted.

It would be better to factor ext2_iomap_begin() into a call to
ext2_get_blocks() and then a "ext2_blocks_to_iomap()" function to
format the returned blocks into the iomap that is returned. Then
ext2_write_map_blocks() can just call ext2_blocks_to_iomap() here
and it skips the unnecessary block mapping lookup....

It also means that the ib_seq returned by the allocation can be fed
directly into the iomap->validity_cookie...

-Dave.
Ritesh Harjani (IBM) Dec. 5, 2023, 3:22 p.m. UTC | #31
Dave Chinner <david@fromorbit.com> writes:

> On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
>> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
>> 
>> > Christoph Hellwig <hch@infradead.org> writes:
>> >
>> >> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
>> >>> writeback bit set. XFS plays the revalidation sequence counter games
>> >>> because of this so we'd have to do something similar for ext2. Not that I'd
>> >>> care as much about ext2 writeback performance but it should not be that
>> >>> hard and we'll definitely need some similar solution for ext4 anyway. Can
>> >>> you give that a try (as a followup "performance improvement" patch).
>> >>
>> >> Darrick has mentioned that he is looking into lifting more of the
>> >> validation sequence counter validation into iomap.
>> >>
>> >> In the meantime I have a series here that at least maps multiple blocks
>> >> inside a folio in a single go, which might be worth trying with ext2 as
>> >> well:
>> >>
>> >> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks
>> >
>> > Sure, thanks for providing details. I will check this.
>> >
>> 
>> So I took a look at this. Here is what I think -
>> So this is useful of-course when we have a large folio. Because
>> otherwise it's just one block at a time for each folio. This is not a
>> problem once FS buffered-io handling code moves to iomap (because we
>> can then enable large folio support to it).
>> 
>> However, this would still require us to pass a folio to ->map_blocks
>> call to determine the size of the folio (which I am not saying can't be
>> done but just stating my observations here).
>> 
>> Now coming to implementing validation seq check. I am hoping, it should
>> be straight forward at least for ext2, because it mostly just have to
>> protect against truncate operation (which can change the block mapping)...
>> 
>> ...ok so here is the PoC for seq counter check for ext2. Next let me
>> try to see if we can lift this up from the FS side to iomap - 
>> 
>> 
>> From 86b7bdf79107c3d0a4f37583121d7c136db1bc5c Mon Sep 17 00:00:00 2001
>> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
>> Subject: [PATCH] ext2: Implement seq check for mapping multiblocks in ->map_blocks
>> 
>> This implements inode block seq counter (ib_seq) which helps in validating 
>> whether the cached wpc (struct iomap_writepage_ctx) still has the valid 
>> entries or not. Block mapping can get changed say due to a racing truncate. 
>> 
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/ext2/balloc.c |  1 +
>>  fs/ext2/ext2.h   |  6 ++++++
>>  fs/ext2/inode.c  | 51 +++++++++++++++++++++++++++++++++++++++++++-----
>>  fs/ext2/super.c  |  2 +-
>>  4 files changed, 54 insertions(+), 6 deletions(-)
>> 
>> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
>> index e124f3d709b2..e97040194da4 100644
>> --- a/fs/ext2/balloc.c
>> +++ b/fs/ext2/balloc.c
>> @@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block,
>>  	}
>>  
>>  	ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1);
>> +	ext2_inc_ib_seq(EXT2_I(inode));
>>  
>>  do_more:
>>  	overflow = 0;
>> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
>> index 677a9ad45dcb..882c14d20183 100644
>> --- a/fs/ext2/ext2.h
>> +++ b/fs/ext2/ext2.h
>> @@ -663,6 +663,7 @@ struct ext2_inode_info {
>>  	struct rw_semaphore xattr_sem;
>>  #endif
>>  	rwlock_t i_meta_lock;
>> +	unsigned int ib_seq;
>>  
>>  	/*
>>  	 * truncate_mutex is for serialising ext2_truncate() against
>> @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
>>  	return container_of(inode, struct ext2_inode_info, vfs_inode);
>>  }
>>  
>> +static inline void ext2_inc_ib_seq(struct ext2_inode_info *ei)
>> +{
>> +	WRITE_ONCE(ei->ib_seq, READ_ONCE(ei->ib_seq) + 1);
>> +}
>> +
>>  /* balloc.c */
>>  extern int ext2_bg_has_super(struct super_block *sb, int group);
>>  extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
>> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
>> index f4e0b9a93095..a5490d641c26 100644
>> --- a/fs/ext2/inode.c
>> +++ b/fs/ext2/inode.c
>> @@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode,
>>  	ext2_fsblk_t current_block = 0;
>>  	int ret = 0;
>>  
>> +	ext2_inc_ib_seq(EXT2_I(inode));
>> +
>>  	/*
>>  	 * Here we try to allocate the requested multiple blocks at once,
>>  	 * on a best-effort basis.
>> @@ -966,14 +968,53 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
>>  	return mpage_writepages(mapping, wbc, ext2_get_block);
>>  }
>>  
>> +struct ext2_writepage_ctx {
>> +	struct iomap_writepage_ctx ctx;
>> +	unsigned int		ib_seq;
>> +};
>
> Why copy this structure from XFS? The iomap held in ctx->iomap now
> has a 'u64 validity_cookie;' which is expressly intended to be used
> for holding this information.
>
> And then this becomes:
>

Right. Thanks for pointing out.

>> +static inline
>> +struct ext2_writepage_ctx *EXT2_WPC(struct iomap_writepage_ctx *ctx)
>> +{
>> +	return container_of(ctx, struct ext2_writepage_ctx, ctx);
>> +}
>> +
>> +static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode,
>> +			    loff_t offset)
>> +{
>> +	if (offset < wpc->iomap.offset ||
>> +	    offset >= wpc->iomap.offset + wpc->iomap.length)
>> +		return false;
>> +
>> +	if (EXT2_WPC(wpc)->ib_seq != READ_ONCE(EXT2_I(inode)->ib_seq))
>> +		return false;
>
> 	if (wpc->iomap->validity_cookie != EXT2_I(inode)->ib_seq)
> 		return false;
>
>> +
>> +	return true;
>> +}
>> +
>>  static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
>>  				 struct inode *inode, loff_t offset)
>>  {
>> -	if (offset >= wpc->iomap.offset &&
>> -	    offset < wpc->iomap.offset + wpc->iomap.length)
>> +	loff_t maxblocks = (loff_t)INT_MAX;
>> +	u8 blkbits = inode->i_blkbits;
>> +	u32 bno;
>> +	bool new, boundary;
>> +	int ret;
>> +
>> +	if (ext2_imap_valid(wpc, inode, offset))
>>  		return 0;
>>  
>> -	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
>> +	EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq);
>> +
>> +	ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
>> +			      &bno, &new, &boundary, 0);
>
> This is incorrectly ordered. ext2_get_blocks() bumps ib_seq when it
> does allocation, so the newly stored EXT2_WPC(wpc)->ib_seq is
> immediately staled and the very next call to ext2_imap_valid() will
> fail, causing a new iomap to be mapped even when not necessary.

In case of ext2 here, the allocation happens at the write time itself
for buffered writes. So what we are essentially doing here (at the time
of writeback) is querying ->get_blocks(..., create=0) and passing those
no. of blocks (ret) further. So it is unlikely that the block
allocations will happen right after we sample ib_seq
(unless we race with truncate).

For mmapped writes, we expect to find a hole and in case of a hole at
the offset, we only pass & cache 1 block in wpc. 

For mmapped writes case since we go and allocate 1 block, so I agree
that the ib_seq will change right after in ->get_blocks. But since in
this case we only alloc and cache 1 block, we anyway will have to call
->get_blocks irrespective of ib_seq checking.

>
> Indeed, we also don't hold the ei->truncate_mutex here, so the
> sampling of ib_seq could race with other allocation or truncation
> calls on this inode. That's a landmine that will eventually bite
> hard, because the sequence number returned with the iomap does not
> reflect the state of the extent tree when the iomap is created.
>

So by sampling the ib_seq before calling ext2_get_blocks(), we ensure
any change in block mapping after the sampling gets recorded like race
of writeback with truncate (which can therefore change ib_seq value)


> If you look at the XFS code, we set iomap->validity_cookie whenever
> we call xfs_bmbt_to_iomap() to format the found/allocated extent
> into an iomap to return to the caller. The sequence number is passed
> into that function alongside the extent map by the caller, because
> when we format the extent into an iomap the caller has already dropped the
> allocation lock. We must sample the sequence number after the allocation
> but before we drop the allocation lock so that the sequence number
> -exactly- matches the extent tree and the extent map we are going to
> format into the iomap, otherwise the extent we map into the iomap
> may already be stale and the validity cookie won't tell us that.
>
> i.e. the cohrenecy relationship between the validity cookie and the
> iomap must be set up from a synchronised state.  If the sequence
> number is allowed to change between mapping the extent and sampling
> the sequence number, then the extent we map and cache may already be
> invalid and this introduces the possibility that the validity cookie
> won't always catch that...
>

Yes, Dave. Thanks for sharing the details.
In the current PoC, I believe it is synchronized w.r.t ib_seq change.

>> +	if (ret < 0)
>> +		return ret;
>> +	/*
>> +	 * ret can be 0 in case of a hole which is possible for mmaped writes.
>> +	 */
>> +	ret = ret ? ret : 1;
>> +	return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
>>  				IOMAP_WRITE, &wpc->iomap, NULL);
>
> So calling ext2_iomap_begin() here might have to change. Indeed,
> given that we just allocated the blocks and we know where they are
> located, calling ext2_iomap_begin()->ext2_get_blocks() to look them
> up again just to format them into the iomap seems .... convoluted.
>
> It would be better to factor ext2_iomap_begin() into a call to
> ext2_get_blocks() and then a "ext2_blocks_to_iomap()" function to
> format the returned blocks into the iomap that is returned. Then
> ext2_write_map_blocks() can just call ext2_blocks_to_iomap() here
> and it skips the unnecessary block mapping lookup....
>
> It also means that the ib_seq returned by the allocation can be fed
> directly into the iomap->validity_cookie...
>

Thanks for your extensive review and sharing more details around this. 
Let me spend more time to see if this can bring benefit in this case for
ext2.

-ritesh
Jan Kara Dec. 7, 2023, 8:58 a.m. UTC | #32
On Tue 05-12-23 20:52:26, Ritesh Harjani wrote:
> Dave Chinner <david@fromorbit.com> writes:
> >>  static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
> >>  				 struct inode *inode, loff_t offset)
> >>  {
> >> -	if (offset >= wpc->iomap.offset &&
> >> -	    offset < wpc->iomap.offset + wpc->iomap.length)
> >> +	loff_t maxblocks = (loff_t)INT_MAX;
> >> +	u8 blkbits = inode->i_blkbits;
> >> +	u32 bno;
> >> +	bool new, boundary;
> >> +	int ret;
> >> +
> >> +	if (ext2_imap_valid(wpc, inode, offset))
> >>  		return 0;
> >>  
> >> -	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> >> +	EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq);
> >> +
> >> +	ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
> >> +			      &bno, &new, &boundary, 0);
> >
> > This is incorrectly ordered. ext2_get_blocks() bumps ib_seq when it
> > does allocation, so the newly stored EXT2_WPC(wpc)->ib_seq is
> > immediately staled and the very next call to ext2_imap_valid() will
> > fail, causing a new iomap to be mapped even when not necessary.
> 
> In case of ext2 here, the allocation happens at the write time itself
> for buffered writes. So what we are essentially doing here (at the time
> of writeback) is querying ->get_blocks(..., create=0) and passing those
> no. of blocks (ret) further. So it is unlikely that the block
> allocations will happen right after we sample ib_seq
> (unless we race with truncate).
> 
> For mmapped writes, we expect to find a hole and in case of a hole at
> the offset, we only pass & cache 1 block in wpc. 
> 
> For mmapped writes case since we go and allocate 1 block, so I agree
> that the ib_seq will change right after in ->get_blocks. But since in
> this case we only alloc and cache 1 block, we anyway will have to call
> ->get_blocks irrespective of ib_seq checking.

I agree with your reasoning Ritesh but I guess it would deserve a comment
because it is a bit subtle and easily forgotten detail.

								Honza
diff mbox series

Patch

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 4ddc36f4dbd4..ee5cd4a2f24f 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -252,7 +252,7 @@  static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 		iocb->ki_flags &= ~IOCB_DIRECT;
 		pos = iocb->ki_pos;
-		status = generic_perform_write(iocb, from);
+		status = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
 		if (unlikely(status < 0)) {
 			ret = status;
 			goto out_unlock;
@@ -278,6 +278,22 @@  static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	return ret;
 }
 
+static ssize_t ext2_buffered_write_iter(struct kiocb *iocb,
+					struct iov_iter *from)
+{
+	ssize_t ret = 0;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	inode_lock(inode);
+	ret = generic_write_checks(iocb, from);
+	if (ret > 0)
+		ret = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
+	inode_unlock(inode);
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+	return ret;
+}
+
 static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 #ifdef CONFIG_FS_DAX
@@ -299,7 +315,7 @@  static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_flags & IOCB_DIRECT)
 		return ext2_dio_write_iter(iocb, from);
 
-	return generic_file_write_iter(iocb, from);
+	return ext2_buffered_write_iter(iocb, from);
 }
 
 const struct file_operations ext2_file_operations = {
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 464faf6c217e..b6224d94a7dd 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -879,10 +879,14 @@  ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 	if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE) && written == 0)
 		return -ENOTBLK;
 
-	if (iomap->type == IOMAP_MAPPED &&
-	    written < length &&
-	    (flags & IOMAP_WRITE))
+	if (iomap->type == IOMAP_MAPPED && written < length &&
+	   (flags & IOMAP_WRITE)) {
 		ext2_write_failed(inode->i_mapping, offset + length);
+		return 0;
+	}
+
+	if (iomap->flags & IOMAP_F_SIZE_CHANGED)
+		mark_inode_dirty(inode);
 	return 0;
 }
 
@@ -914,6 +918,16 @@  static void ext2_readahead(struct readahead_control *rac)
 	mpage_readahead(rac, ext2_get_block);
 }
 
+static int ext2_file_read_folio(struct file *file, struct folio *folio)
+{
+	return iomap_read_folio(folio, &ext2_iomap_ops);
+}
+
+static void ext2_file_readahead(struct readahead_control *rac)
+{
+	return iomap_readahead(rac, &ext2_iomap_ops);
+}
+
 static int
 ext2_write_begin(struct file *file, struct address_space *mapping,
 		loff_t pos, unsigned len, struct page **pagep, void **fsdata)
@@ -943,12 +957,40 @@  static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
 	return generic_block_bmap(mapping,block,ext2_get_block);
 }
 
+static sector_t ext2_file_bmap(struct address_space *mapping, sector_t block)
+{
+	return iomap_bmap(mapping, block, &ext2_iomap_ops);
+}
+
 static int
 ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	return mpage_writepages(mapping, wbc, ext2_get_block);
 }
 
+static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
+				 struct inode *inode, loff_t offset)
+{
+	if (offset >= wpc->iomap.offset &&
+	    offset < wpc->iomap.offset + wpc->iomap.length)
+		return 0;
+
+	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
+				IOMAP_WRITE, &wpc->iomap, NULL);
+}
+
+static const struct iomap_writeback_ops ext2_writeback_ops = {
+	.map_blocks		= ext2_write_map_blocks,
+};
+
+static int ext2_file_writepages(struct address_space *mapping,
+				struct writeback_control *wbc)
+{
+	struct iomap_writepage_ctx wpc = { };
+
+	return iomap_writepages(mapping, wbc, &wpc, &ext2_writeback_ops);
+}
+
 static int
 ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
@@ -957,6 +999,20 @@  ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc
 	return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
 }
 
+const struct address_space_operations ext2_file_aops = {
+	.dirty_folio		= iomap_dirty_folio,
+	.release_folio		= iomap_release_folio,
+	.invalidate_folio	= iomap_invalidate_folio,
+	.read_folio		= ext2_file_read_folio,
+	.readahead		= ext2_file_readahead,
+	.bmap			= ext2_file_bmap,
+	.direct_IO		= noop_direct_IO,
+	.writepages		= ext2_file_writepages,
+	.migrate_folio		= filemap_migrate_folio,
+	.is_partially_uptodate	= iomap_is_partially_uptodate,
+	.error_remove_page	= generic_error_remove_page,
+};
+
 const struct address_space_operations ext2_aops = {
 	.dirty_folio		= block_dirty_folio,
 	.invalidate_folio	= block_invalidate_folio,
@@ -1281,8 +1337,8 @@  static int ext2_setsize(struct inode *inode, loff_t newsize)
 		error = dax_truncate_page(inode, newsize, NULL,
 					  &ext2_iomap_ops);
 	else
-		error = block_truncate_page(inode->i_mapping,
-				newsize, ext2_get_block);
+		error = iomap_truncate_page(inode, newsize, NULL,
+					    &ext2_iomap_ops);
 	if (error)
 		return error;
 
@@ -1372,7 +1428,7 @@  void ext2_set_file_ops(struct inode *inode)
 	if (IS_DAX(inode))
 		inode->i_mapping->a_ops = &ext2_dax_aops;
 	else
-		inode->i_mapping->a_ops = &ext2_aops;
+		inode->i_mapping->a_ops = &ext2_file_aops;
 }
 
 struct inode *ext2_iget (struct super_block *sb, unsigned long ino)