diff mbox series

[RFCv2,3/3] iomap: Support subpage size dirty tracking to improve write performance

Message ID 5e49fa975ce9d719f5b6f765aa5d3a1d44d98d1d.1675093524.git.ritesh.list@gmail.com (mailing list archive)
State New, archived
Headers show
Series iomap: Add support for subpage dirty state tracking to improve write performance | expand

Commit Message

Ritesh Harjani (IBM) Jan. 30, 2023, 4:14 p.m. UTC
On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
filesystem blocksize, this patch should improve the performance by doing
only the subpage dirty data write.

This should also reduce the write amplification since we can now track
subpage dirty status within state bitmaps. Earlier we had to
write the entire 64k page even if only a part of it (e.g. 4k) was
updated.

Performance testing of below fio workload reveals ~16x performance
improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

<test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

Reported-by: Aravinda Herle <araherle@in.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 103 +++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/super.c      |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 98 insertions(+), 12 deletions(-)

Comments

Christoph Hellwig Jan. 30, 2023, 5:16 p.m. UTC | #1
On Mon, Jan 30, 2023 at 09:44:13PM +0530, Ritesh Harjani (IBM) wrote:
> +iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags,
> +		  bool from_writeback)
>  {
>  	struct iomap_page *iop = to_iomap_page(folio);
>  	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> @@ -58,12 +59,32 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>  
> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>  		      gfp);
>  	if (iop) {

Please just return early here for the allocation failure case instead of 
adding a lot of code with extra indentation.

>  		spin_lock_init(&iop->state_lock);
> -		if (folio_test_uptodate(folio))
> -			bitmap_fill(iop->state, nr_blocks);
> +		/*
> +		 * iomap_page_create can get called from writeback after
> +		 * a truncate_inode_partial_folio operation on a large folio.
> +		 * For large folio the iop structure is freed in
> +		 * iomap_invalidate_folio() to ensure we can split the folio.
> +		 * That means we will have to let go of the optimization of
> +		 * tracking dirty bits here and set all bits as dirty if
> +		 * the folio is marked uptodate.
> +		 */
> +		if (from_writeback && folio_test_uptodate(folio))
> +			bitmap_fill(iop->state, 2 * nr_blocks);
> +		else if (folio_test_uptodate(folio)) {

This code is very confusing.  First please only check
folio_test_uptodate one, and then check the from_writeback flag
inside the branch.  And as mentioned last time I think you really
need some symbolic constants for dealing with dirty vs uptodate
state and not just do a single fill for them.

> +			unsigned start = offset_in_folio(folio,
> +					folio_pos(folio)) >> inode->i_blkbits;
> +			bitmap_set(iop->state, start, nr_blocks);

Also this code leaves my head scratching.  Unless I'm missing something
important

	 offset_in_folio(folio, folio_pos(folio))

must always return 0.

Also the from_writeback logic is weird.  I'd rather have a
"bool is_dirty" argument and then pass true for writeback beause
we know the folio is dirty, false where we know it can't be
dirty and do the folio_test_dirty in the caller where we don't
know the state.

> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> +{
> +	unsigned int nr_blocks = i_blocks_per_folio(mapping->host, folio);
> +	struct iomap_page *iop = iomap_page_create(mapping->host, folio, 0, false);

Please avoid the overly long line.  In fact with such long function
calls I'd generally prefer if the initialization was moved out of the
declaration.

> +
> +	iomap_set_range_dirty(folio, iop, offset_in_folio(folio, folio_pos(folio)),

Another overly long line here.
Matthew Wilcox Jan. 30, 2023, 5:54 p.m. UTC | #2
On Mon, Jan 30, 2023 at 09:44:13PM +0530, Ritesh Harjani (IBM) wrote:
> On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
> filesystem blocksize, this patch should improve the performance by doing
> only the subpage dirty data write.
> 
> This should also reduce the write amplification since we can now track
> subpage dirty status within state bitmaps. Earlier we had to
> write the entire 64k page even if only a part of it (e.g. 4k) was
> updated.
> 
> Performance testing of below fio workload reveals ~16x performance
> improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> 
> <test_randwrite.fio>
> [global]
> 	ioengine=psync
> 	rw=randwrite
> 	overwrite=1
> 	pre_read=1
> 	direct=0
> 	bs=4k
> 	size=1G
> 	dir=./
> 	numjobs=8
> 	fdatasync=1
> 	runtime=60
> 	iodepth=64
> 	group_reporting=1
> 
> [fio-run]

You really need to include this sentence from the cover letter in this
patch:

2. Also our internal performance team reported that this patch improves there
   database workload performance by around ~83% (with XFS on Power)

because that's far more meaningful than "Look, I cooked up an artificial
workload where this makes a difference".
Matthew Wilcox Jan. 30, 2023, 6:01 p.m. UTC | #3
On Mon, Jan 30, 2023 at 09:16:33AM -0800, Christoph Hellwig wrote:
> > +		if (from_writeback && folio_test_uptodate(folio))
> > +			bitmap_fill(iop->state, 2 * nr_blocks);
> > +		else if (folio_test_uptodate(folio)) {
> 
> This code is very confusing.  First please only check
> folio_test_uptodate one, and then check the from_writeback flag
> inside the branch.  And as mentioned last time I think you really
> need some symbolic constants for dealing with dirty vs uptodate
> state and not just do a single fill for them.

And I don't think this 'from_writeback' argument is well-named.
Presumably it's needed because folio_test_dirty() will be false
at this point in the writeback path because it got cleared by the VFS?
But in any case, it should be called 'dirty' or something, not tell me
where the function was called from.  I think what this should really
do is:

		if (dirty)
			iop_set_dirty(iop, 0, nr_blocks);
		if (folio_test_uptodate(folio))
			iop_set_uptodate(iop, 0, nr_blocks);

> > +			unsigned start = offset_in_folio(folio,
> > +					folio_pos(folio)) >> inode->i_blkbits;
> > +			bitmap_set(iop->state, start, nr_blocks);
> 
> Also this code leaves my head scratching.  Unless I'm missing something
> important
> 
> 	 offset_in_folio(folio, folio_pos(folio))
> 
> must always return 0.

You are not missing anything.  I don't understand the mental process
that gets someone to writing that.  It should logically be 0.

> Also the from_writeback logic is weird.  I'd rather have a
> "bool is_dirty" argument and then pass true for writeback beause
> we know the folio is dirty, false where we know it can't be
> dirty and do the folio_test_dirty in the caller where we don't
> know the state.

hahaha, you think the same.  ok, i'm leaving my above comment though ;-)
Ritesh Harjani (IBM) Jan. 30, 2023, 8:27 p.m. UTC | #4
On 23/01/30 09:16AM, Christoph Hellwig wrote:
> On Mon, Jan 30, 2023 at 09:44:13PM +0530, Ritesh Harjani (IBM) wrote:
> > +iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags,
> > +		  bool from_writeback)
> >  {
> >  	struct iomap_page *iop = to_iomap_page(folio);
> >  	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> > @@ -58,12 +59,32 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
> >  	else
> >  		gfp = GFP_NOFS | __GFP_NOFAIL;
> >
> > -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> > +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
> >  		      gfp);
> >  	if (iop) {
>
> Please just return early here for the allocation failure case instead of
> adding a lot of code with extra indentation.

Sure. Will do that.


>
> >  		spin_lock_init(&iop->state_lock);
> > -		if (folio_test_uptodate(folio))
> > -			bitmap_fill(iop->state, nr_blocks);
> > +		/*
> > +		 * iomap_page_create can get called from writeback after
> > +		 * a truncate_inode_partial_folio operation on a large folio.
> > +		 * For large folio the iop structure is freed in
> > +		 * iomap_invalidate_folio() to ensure we can split the folio.
> > +		 * That means we will have to let go of the optimization of
> > +		 * tracking dirty bits here and set all bits as dirty if
> > +		 * the folio is marked uptodate.
> > +		 */
> > +		if (from_writeback && folio_test_uptodate(folio))
> > +			bitmap_fill(iop->state, 2 * nr_blocks);
> > +		else if (folio_test_uptodate(folio)) {
>
> This code is very confusing.  First please only check
> folio_test_uptodate one, and then check the from_writeback flag
> inside the branch.  And as mentioned last time I think you really

Ok, sure. I will try and simplify it.


> need some symbolic constants for dealing with dirty vs uptodate
> state and not just do a single fill for them.

Yes, I agree. Sorry my bad. I will add that change in the next rev.


>
> > +			unsigned start = offset_in_folio(folio,
> > +					folio_pos(folio)) >> inode->i_blkbits;
> > +			bitmap_set(iop->state, start, nr_blocks);
>
> Also this code leaves my head scratching.  Unless I'm missing something
> important
>
> 	 offset_in_folio(folio, folio_pos(folio))
>
> must always return 0.

That is true always yes. In the next rev, I can make it explicitly 0 then
(maybe with just a small comment if required).

>
> Also the from_writeback logic is weird.  I'd rather have a
> "bool is_dirty" argument and then pass true for writeback beause
> we know the folio is dirty, false where we know it can't be
> dirty and do the folio_test_dirty in the caller where we don't
> know the state.

Agreed. Thanks.


>
> > +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> > +{
> > +	unsigned int nr_blocks = i_blocks_per_folio(mapping->host, folio);
> > +	struct iomap_page *iop = iomap_page_create(mapping->host, folio, 0, false);
>
> Please avoid the overly long line.  In fact with such long function
> calls I'd generally prefer if the initialization was moved out of the
> declaration.

Sure, will do the same.

>
> > +
> > +	iomap_set_range_dirty(folio, iop, offset_in_folio(folio, folio_pos(folio)),
>
> Another overly long line here.

sure. Will do the same.

-ritesh
Ritesh Harjani (IBM) Jan. 30, 2023, 8:34 p.m. UTC | #5
On 23/01/30 05:54PM, Matthew Wilcox wrote:
> On Mon, Jan 30, 2023 at 09:44:13PM +0530, Ritesh Harjani (IBM) wrote:
> > On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
> > filesystem blocksize, this patch should improve the performance by doing
> > only the subpage dirty data write.
> >
> > This should also reduce the write amplification since we can now track
> > subpage dirty status within state bitmaps. Earlier we had to
> > write the entire 64k page even if only a part of it (e.g. 4k) was
> > updated.
> >
> > Performance testing of below fio workload reveals ~16x performance
> > improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
> > FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> >
> > <test_randwrite.fio>
> > [global]
> > 	ioengine=psync
> > 	rw=randwrite
> > 	overwrite=1
> > 	pre_read=1
> > 	direct=0
> > 	bs=4k
> > 	size=1G
> > 	dir=./
> > 	numjobs=8
> > 	fdatasync=1
> > 	runtime=60
> > 	iodepth=64
> > 	group_reporting=1
> >
> > [fio-run]
>
> You really need to include this sentence from the cover letter in this
> patch:
>
> 2. Also our internal performance team reported that this patch improves there
>    database workload performance by around ~83% (with XFS on Power)
>
> because that's far more meaningful than "Look, I cooked up an artificial
> workload where this makes a difference".

Agreed. I will add the other lines too in the commit message.

The intention behind adding fio workload is for others to have a test
case to verify against and/or provide more info when someone later refers
to the commit message.
One of the interesting observation with this synthetic fio workload was we can
/should easily observe the theoritical performance gain of around ~16x
(i.e. 64k(ps) / 4k(bs)).

Thanks again for the review!
-ritesh
Ritesh Harjani (IBM) Jan. 30, 2023, 8:44 p.m. UTC | #6
On 23/01/30 06:01PM, Matthew Wilcox wrote:
> On Mon, Jan 30, 2023 at 09:16:33AM -0800, Christoph Hellwig wrote:
> > > +		if (from_writeback && folio_test_uptodate(folio))
> > > +			bitmap_fill(iop->state, 2 * nr_blocks);
> > > +		else if (folio_test_uptodate(folio)) {
> >
> > This code is very confusing.  First please only check
> > folio_test_uptodate one, and then check the from_writeback flag
> > inside the branch.  And as mentioned last time I think you really
> > need some symbolic constants for dealing with dirty vs uptodate
> > state and not just do a single fill for them.
>
> And I don't think this 'from_writeback' argument is well-named.
> Presumably it's needed because folio_test_dirty() will be false
> at this point in the writeback path because it got cleared by the VFS?

Yes, folio_test_dirty() is false. We clear it in write_cache_pages() by calling
clear_page_dirty_for_io() before calling iomap_do_writepage().

> But in any case, it should be called 'dirty' or something, not tell me
> where the function was called from.  I think what this should really
> do is:
>
> 		if (dirty)
> 			iop_set_dirty(iop, 0, nr_blocks);
> 		if (folio_test_uptodate(folio))
> 			iop_set_uptodate(iop, 0, nr_blocks);

Sure I got the idea. I will use "bool is_dirty".

>
> > > +			unsigned start = offset_in_folio(folio,
> > > +					folio_pos(folio)) >> inode->i_blkbits;
> > > +			bitmap_set(iop->state, start, nr_blocks);
> >
> > Also this code leaves my head scratching.  Unless I'm missing something
> > important
> >
> > 	 offset_in_folio(folio, folio_pos(folio))
> >
> > must always return 0.
>
> You are not missing anything.  I don't understand the mental process
> that gets someone to writing that.  It should logically be 0.

Sorry about the confusion. Yes, that is correct. I ended up using above at some
place and 0 at others. Then for final cleanup I ended up using the above call.

I will correct it in the next rev.

>
> > Also the from_writeback logic is weird.  I'd rather have a
> > "bool is_dirty" argument and then pass true for writeback beause
> > we know the folio is dirty, false where we know it can't be
> > dirty and do the folio_test_dirty in the caller where we don't
> > know the state.
>
> hahaha, you think the same.  ok, i'm leaving my above comment though ;-)
>
No problem ;)

Thanks for your review!!
-ritesh
diff mbox series

Patch

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index e782b4f1d104..b9c35288a5eb 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -741,7 +741,7 @@  static const struct address_space_operations gfs2_aops = {
 	.writepages = gfs2_writepages,
 	.read_folio = gfs2_read_folio,
 	.readahead = gfs2_readahead,
-	.dirty_folio = filemap_dirty_folio,
+	.dirty_folio = iomap_dirty_folio,
 	.release_folio = iomap_release_folio,
 	.invalidate_folio = iomap_invalidate_folio,
 	.bmap = gfs2_bmap,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index faee2852db8f..af3e77276dab 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -44,7 +44,8 @@  static inline struct iomap_page *to_iomap_page(struct folio *folio)
 static struct bio_set iomap_ioend_bioset;
 
 static struct iomap_page *
-iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
+iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags,
+		  bool from_writeback)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
@@ -58,12 +59,32 @@  iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;
 
-	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
 		      gfp);
 	if (iop) {
 		spin_lock_init(&iop->state_lock);
-		if (folio_test_uptodate(folio))
-			bitmap_fill(iop->state, nr_blocks);
+		/*
+		 * iomap_page_create can get called from writeback after
+		 * a truncate_inode_partial_folio operation on a large folio.
+		 * For large folio the iop structure is freed in
+		 * iomap_invalidate_folio() to ensure we can split the folio.
+		 * That means we will have to let go of the optimization of
+		 * tracking dirty bits here and set all bits as dirty if
+		 * the folio is marked uptodate.
+		 */
+		if (from_writeback && folio_test_uptodate(folio))
+			bitmap_fill(iop->state, 2 * nr_blocks);
+		else if (folio_test_uptodate(folio)) {
+			unsigned start = offset_in_folio(folio,
+					folio_pos(folio)) >> inode->i_blkbits;
+			bitmap_set(iop->state, start, nr_blocks);
+		}
+		if (folio_test_dirty(folio)) {
+			unsigned start = offset_in_folio(folio,
+					folio_pos(folio)) >> inode->i_blkbits;
+			start = start + nr_blocks;
+			bitmap_set(iop->state, start, nr_blocks);
+		}
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -168,6 +189,48 @@  static void iomap_set_range_uptodate(struct folio *folio,
 		folio_mark_uptodate(folio);
 }
 
+static void iomap_iop_set_range_dirty(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
+{
+	struct inode *inode = folio->mapping->host;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+	unsigned first = (off >> inode->i_blkbits) + nr_blocks;
+	unsigned last = ((off + len - 1) >> inode->i_blkbits) + nr_blocks;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iop->state_lock, flags);
+	bitmap_set(iop->state, first, last - first + 1);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iomap_set_range_dirty(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
+{
+	if (iop)
+		iomap_iop_set_range_dirty(folio, iop, off, len);
+}
+
+static void iomap_iop_clear_range_dirty(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
+{
+	struct inode *inode = folio->mapping->host;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+	unsigned first = (off >> inode->i_blkbits) + nr_blocks;
+	unsigned last = ((off + len - 1) >> inode->i_blkbits) + nr_blocks;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iop->state_lock, flags);
+	bitmap_clear(iop->state, first, last - first + 1);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iomap_clear_range_dirty(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
+{
+	if (iop)
+		iomap_iop_clear_range_dirty(folio, iop, off, len);
+}
+
 static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		size_t len, int error)
 {
@@ -231,7 +294,7 @@  static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iomap_page_create(iter->inode, folio, iter->flags);
+		iop = iomap_page_create(iter->inode, folio, iter->flags, false);
 	else
 		iop = to_iomap_page(folio);
 
@@ -269,7 +332,7 @@  static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, folio);
 
 	/* zero post-eof blocks as the page may be mapped */
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iop = iomap_page_create(iter->inode, folio, iter->flags, false);
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
@@ -497,6 +560,17 @@  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
 
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
+{
+	unsigned int nr_blocks = i_blocks_per_folio(mapping->host, folio);
+	struct iomap_page *iop = iomap_page_create(mapping->host, folio, 0, false);
+
+	iomap_set_range_dirty(folio, iop, offset_in_folio(folio, folio_pos(folio)),
+			      nr_blocks << mapping->host->i_blkbits);
+	return filemap_dirty_folio(mapping, folio);
+}
+EXPORT_SYMBOL_GPL(iomap_dirty_folio);
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -528,7 +602,7 @@  static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct iomap_page *iop = iomap_page_create(iter->inode, folio,
-						   iter->flags);
+						   iter->flags, false);
 	loff_t block_size = i_blocksize(iter->inode);
 	loff_t block_start = round_down(pos, block_size);
 	loff_t block_end = round_up(pos + len, block_size);
@@ -686,6 +760,7 @@  static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
 	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
+	iomap_set_range_dirty(folio, iop, offset_in_folio(folio, pos), len);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -1231,6 +1306,13 @@  static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter,
 		block_commit_write(&folio->page, 0, length);
 	} else {
 		WARN_ON_ONCE(!folio_test_uptodate(folio));
+		/*
+		 * TODO: We need not set range of dirty bits in iop here.
+		 * This will be taken care by iomap_dirty_folio callback
+		 * function which gets called from folio_mark_dirty().
+		 */
+		iomap_set_range_dirty(folio, to_iomap_page(folio),
+				offset_in_folio(folio, iter->pos), length);
 		folio_mark_dirty(folio);
 	}
 
@@ -1590,7 +1672,7 @@  iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
+	struct iomap_page *iop = iomap_page_create(inode, folio, 0, true);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1606,7 +1688,7 @@  iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !test_bit(i, iop->state))
+		if (iop && !test_bit(i + nblocks, iop->state))
 			continue;
 
 		error = wpc->ops->map_blocks(wpc, inode, pos);
@@ -1650,6 +1732,9 @@  iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		}
 	}
 
+	iomap_clear_range_dirty(folio, iop,
+				offset_in_folio(folio, folio_pos(folio)),
+				end_pos - folio_pos(folio));
 	folio_start_writeback(folio);
 	folio_unlock(folio);
 
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 41734202796f..7e6c54955b4f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -571,7 +571,7 @@  const struct address_space_operations xfs_address_space_operations = {
 	.read_folio		= xfs_vm_read_folio,
 	.readahead		= xfs_vm_readahead,
 	.writepages		= xfs_vm_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.bmap			= xfs_vm_bmap,
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index a9c5c3f720ad..4cefc2af87f3 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -267,7 +267,7 @@  static const struct address_space_operations zonefs_file_aops = {
 	.read_folio		= zonefs_read_folio,
 	.readahead		= zonefs_readahead,
 	.writepages		= zonefs_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.migrate_folio		= filemap_migrate_folio,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0983dfc9a203..b60562a0b893 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -262,6 +262,7 @@  void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,