diff mbox series

[RFCv5,5/5] iomap: Add per-block dirty state tracking to improve performance

Message ID 86987466d8d7863bd0dca81e9d6c3eff7abd4964.1683485700.git.ritesh.list@gmail.com (mailing list archive)
State Mainlined, archived
Headers show
Series iomap: Add support for per-block dirty state to improve write performance | expand

Commit Message

Ritesh Harjani (IBM) May 7, 2023, 7:28 p.m. UTC
When filesystem blocksize is less than folio size (either with
mapping_large_folio_support() or with blocksize < pagesize) and when the
folio is uptodate in pagecache, then even a byte write can cause
an entire folio to be written to disk during writeback. This happens
because we currently don't have a mechanism to track per-block dirty
state within struct iomap_page. We currently only track uptodate state.

This patch implements support for tracking per-block dirty state in
iomap_page->state bitmap. This should help improve the filesystem write
performance and help reduce write amplification.

Performance testing of below fio workload reveals ~16x performance
improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

1. <test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves
   their database workload performance by around ~83% (with XFS on Power)

Reported-by: Aravinda Herle <araherle@in.ibm.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/file.c       |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 112 insertions(+), 10 deletions(-)

--
2.39.2

Comments

Pankaj Raghav May 15, 2023, 8:16 a.m. UTC | #1
> @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode,
>  		struct folio *folio, u64 end_pos)
>  {
> -	struct iomap_page *iop = iop_alloc(inode, folio, 0);
> +	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
>  	struct iomap_ioend *ioend, *next;
>  	unsigned len = i_blocksize(inode);
>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * invalid, grab a new one.
>  	 */
>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -		if (iop && !iop_test_block_uptodate(folio, i))
> +		if (iop && !iop_test_block_dirty(folio, i))

Shouldn't this be if(iop && iop_test_block_dirty(folio, i))? 

Before we were skipping if the blocks were not uptodate but now we are
skipping if the blocks are not dirty (which means they are uptodate)?

I am new to iomap but let me know if I am missing something here.

>  			continue;
> 
>  		error = wpc->ops->map_blocks(wpc, inode, pos);
Ritesh Harjani (IBM) May 15, 2023, 8:31 a.m. UTC | #2
Pankaj Raghav <p.raghav@samsung.com> writes:

>> @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  		struct writeback_control *wbc, struct inode *inode,
>>  		struct folio *folio, u64 end_pos)
>>  {
>> -	struct iomap_page *iop = iop_alloc(inode, folio, 0);
>> +	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
>>  	struct iomap_ioend *ioend, *next;
>>  	unsigned len = i_blocksize(inode);
>>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
>> @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  	 * invalid, grab a new one.
>>  	 */
>>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
>> -		if (iop && !iop_test_block_uptodate(folio, i))
>> +		if (iop && !iop_test_block_dirty(folio, i))
>
> Shouldn't this be if(iop && iop_test_block_dirty(folio, i))?
>
> Before we were skipping if the blocks were not uptodate but now we are
> skipping if the blocks are not dirty (which means they are uptodate)?
>
> I am new to iomap but let me know if I am missing something here.
>

We set the per-block dirty status in ->write_begin. The check above happens in the
writeback path when we are about to write the dirty data to the disk.
What the check is doing is that, it checks if the block state is not dirty
then skip it which means the block was not written to in the ->write_begin().
Also the block with dirty status always means that the block is uptodate
anyways.

Hope it helps!

-ritesh
Pankaj Raghav May 15, 2023, 1:23 p.m. UTC | #3
> 
> We set the per-block dirty status in ->write_begin. The check above happens in the
> writeback path when we are about to write the dirty data to the disk.
> What the check is doing is that, it checks if the block state is not dirty
> then skip it which means the block was not written to in the ->write_begin().
> Also the block with dirty status always means that the block is uptodate
> anyways.
> 
> Hope it helps!
> 

Definitely. I also checked your 1st RFC version to get more context. Thanks for
the explanation.
Brian Foster May 15, 2023, 3:15 p.m. UTC | #4
On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
> When filesystem blocksize is less than folio size (either with
> mapping_large_folio_support() or with blocksize < pagesize) and when the
> folio is uptodate in pagecache, then even a byte write can cause
> an entire folio to be written to disk during writeback. This happens
> because we currently don't have a mechanism to track per-block dirty
> state within struct iomap_page. We currently only track uptodate state.
> 
> This patch implements support for tracking per-block dirty state in
> iomap_page->state bitmap. This should help improve the filesystem write
> performance and help reduce write amplification.
> 
> Performance testing of below fio workload reveals ~16x performance
> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> 
> 1. <test_randwrite.fio>
> [global]
> 	ioengine=psync
> 	rw=randwrite
> 	overwrite=1
> 	pre_read=1
> 	direct=0
> 	bs=4k
> 	size=1G
> 	dir=./
> 	numjobs=8
> 	fdatasync=1
> 	runtime=60
> 	iodepth=64
> 	group_reporting=1
> 
> [fio-run]
> 
> 2. Also our internal performance team reported that this patch improves
>    their database workload performance by around ~83% (with XFS on Power)
> 
> Reported-by: Aravinda Herle <araherle@in.ibm.com>
> Reported-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/gfs2/aops.c         |   2 +-
>  fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
>  fs/xfs/xfs_aops.c      |   2 +-
>  fs/zonefs/file.c       |   2 +-
>  include/linux/iomap.h  |   1 +
>  5 files changed, 112 insertions(+), 10 deletions(-)
> 
...
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 25f20f269214..c7f41b26280a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
...
> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
> 
> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> +	/*
> +	 * iop->state tracks two sets of state flags when the
> +	 * filesystem block size is smaller than the folio size.
> +	 * The first state tracks per-block uptodate and the
> +	 * second tracks per-block dirty state.
> +	 */
> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>  		      gfp);
>  	if (iop) {
>  		spin_lock_init(&iop->state_lock);
>  		if (folio_test_uptodate(folio))
>  			iop_set_range(iop, 0, nr_blocks);
> +		if (is_dirty)
> +			iop_set_range(iop, nr_blocks, nr_blocks);

I find the is_dirty logic here a bit confusing. AFAICT, this is
primarily to handle the case where a folio is completely overwritten, so
no iop is allocated at write time, and so then writeback needs to
allocate the iop as fully dirty to reflect that. I.e., all iop_alloc()
callers other than iomap_writepage_map() either pass the result of
folio_test_dirty() or explicitly dirty the entire range of the folio
anyways.  iomap_dirty_folio() essentially does the latter because it
needs to dirty the entire folio regardless of whether the iop already
exists or not, right?

If so and if I'm following all of that correctly, could this complexity
be isolated to iomap_writepage_map() by simply checking for the !iop
case first, then call iop_alloc() immediately followed by
set_range_dirty() of the entire folio? Then presumably iop_alloc() could
always just dirty based on folio state with the writeback path exception
case handled explicitly. Hm?

>  		folio_attach_private(folio, iop);
>  	}
>  	return iop;
...
> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
>  }
>  EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
> 
> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> +{
> +	struct iomap_page *iop;
> +	struct inode *inode = mapping->host;
> +	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;

folio_size()?

> +
> +	iop = iop_alloc(inode, folio, 0, false);
> +	iop_set_range_dirty(inode, folio, 0, len);
> +	return filemap_dirty_folio(mapping, folio);
> +}
> +EXPORT_SYMBOL_GPL(iomap_dirty_folio);
> +
>  static void
>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>  {
...
> @@ -978,6 +1056,28 @@ static int iomap_write_delalloc_scan(struct inode *inode,
>  				}
>  			}
> 
> +			/*
> +			 * When we have per-block dirty tracking, there can be
> +			 * blocks within a folio which are marked uptodate
> +			 * but not dirty. In that case it is necessary to punch
> +			 * out such blocks to avoid leaking any delalloc blocks.
> +			 */
> +			iop = to_iomap_page(folio);
> +			if (!iop)
> +				goto skip_iop_punch;
> +			last_byte = min_t(loff_t, end_byte - 1,
> +				(folio_next_index(folio) << PAGE_SHIFT) - 1);
> +			first_blk = offset_in_folio(folio, start_byte) >>
> +						    blkbits;
> +			last_blk = offset_in_folio(folio, last_byte) >>
> +						   blkbits;
> +			blks_per_folio = i_blocks_per_folio(inode, folio);

Unused?

> +			for (i = first_blk; i <= last_blk; i++) {
> +				if (!iop_test_block_dirty(folio, i))
> +					punch(inode, i << blkbits,
> +						     1 << blkbits);
> +			}
> +skip_iop_punch:

Looks reasonable at first glance, though the deep indentation here kind
of makes my eyes cross. Could we stuff this loop into a
iomap_write_delalloc_scan_folio() helper or some such, and then maybe
update the comment at the top of the whole branch to explain both
punches?

WRT to the !iop case.. I _think_ this is handled correctly here because
that means we'd handle the folio as completely dirty at writeback time.
Is that the case? If so, it might be nice to document that assumption
somewhere (here or perhaps in the writeback path).

Brian

>  			/*
>  			 * Make sure the next punch start is correctly bound to
>  			 * the end of this data range, not the end of the folio.
> @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode,
>  		struct folio *folio, u64 end_pos)
>  {
> -	struct iomap_page *iop = iop_alloc(inode, folio, 0);
> +	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
>  	struct iomap_ioend *ioend, *next;
>  	unsigned len = i_blocksize(inode);
>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * invalid, grab a new one.
>  	 */
>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -		if (iop && !iop_test_block_uptodate(folio, i))
> +		if (iop && !iop_test_block_dirty(folio, i))
>  			continue;
> 
>  		error = wpc->ops->map_blocks(wpc, inode, pos);
> @@ -1726,6 +1826,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		}
>  	}
> 
> +	iop_clear_range_dirty(folio, 0, end_pos - folio_pos(folio));
>  	folio_start_writeback(folio);
>  	folio_unlock(folio);
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 2ef78aa1d3f6..77c7332ae197 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = {
>  	.read_folio		= xfs_vm_read_folio,
>  	.readahead		= xfs_vm_readahead,
>  	.writepages		= xfs_vm_writepages,
> -	.dirty_folio		= filemap_dirty_folio,
> +	.dirty_folio		= iomap_dirty_folio,
>  	.release_folio		= iomap_release_folio,
>  	.invalidate_folio	= iomap_invalidate_folio,
>  	.bmap			= xfs_vm_bmap,
> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
> index 132f01d3461f..e508c8e97372 100644
> --- a/fs/zonefs/file.c
> +++ b/fs/zonefs/file.c
> @@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = {
>  	.read_folio		= zonefs_read_folio,
>  	.readahead		= zonefs_readahead,
>  	.writepages		= zonefs_writepages,
> -	.dirty_folio		= filemap_dirty_folio,
> +	.dirty_folio		= iomap_dirty_folio,
>  	.release_folio		= iomap_release_folio,
>  	.invalidate_folio	= iomap_invalidate_folio,
>  	.migrate_folio		= filemap_migrate_folio,
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 0f8123504e5e..0c2bee80565c 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
>  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops);
>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> --
> 2.39.2
>
Ritesh Harjani (IBM) May 16, 2023, 2:49 p.m. UTC | #5
Brian Foster <bfoster@redhat.com> writes:

> On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
>> When filesystem blocksize is less than folio size (either with
>> mapping_large_folio_support() or with blocksize < pagesize) and when the
>> folio is uptodate in pagecache, then even a byte write can cause
>> an entire folio to be written to disk during writeback. This happens
>> because we currently don't have a mechanism to track per-block dirty
>> state within struct iomap_page. We currently only track uptodate state.
>>
>> This patch implements support for tracking per-block dirty state in
>> iomap_page->state bitmap. This should help improve the filesystem write
>> performance and help reduce write amplification.
>>
>> Performance testing of below fio workload reveals ~16x performance
>> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
>> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>>
>> 1. <test_randwrite.fio>
>> [global]
>> 	ioengine=psync
>> 	rw=randwrite
>> 	overwrite=1
>> 	pre_read=1
>> 	direct=0
>> 	bs=4k
>> 	size=1G
>> 	dir=./
>> 	numjobs=8
>> 	fdatasync=1
>> 	runtime=60
>> 	iodepth=64
>> 	group_reporting=1
>>
>> [fio-run]
>>
>> 2. Also our internal performance team reported that this patch improves
>>    their database workload performance by around ~83% (with XFS on Power)
>>
>> Reported-by: Aravinda Herle <araherle@in.ibm.com>
>> Reported-by: Brian Foster <bfoster@redhat.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/gfs2/aops.c         |   2 +-
>>  fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
>>  fs/xfs/xfs_aops.c      |   2 +-
>>  fs/zonefs/file.c       |   2 +-
>>  include/linux/iomap.h  |   1 +
>>  5 files changed, 112 insertions(+), 10 deletions(-)
>>
> ...
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 25f20f269214..c7f41b26280a 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
> ...
>> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>>  	else
>>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>>
>> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>> +	/*
>> +	 * iop->state tracks two sets of state flags when the
>> +	 * filesystem block size is smaller than the folio size.
>> +	 * The first state tracks per-block uptodate and the
>> +	 * second tracks per-block dirty state.
>> +	 */
>> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>>  		      gfp);
>>  	if (iop) {
>>  		spin_lock_init(&iop->state_lock);
>>  		if (folio_test_uptodate(folio))
>>  			iop_set_range(iop, 0, nr_blocks);
>> +		if (is_dirty)
>> +			iop_set_range(iop, nr_blocks, nr_blocks);
>
> I find the is_dirty logic here a bit confusing. AFAICT, this is
> primarily to handle the case where a folio is completely overwritten, so
> no iop is allocated at write time, and so then writeback needs to
> allocate the iop as fully dirty to reflect that. I.e., all iop_alloc()
> callers other than iomap_writepage_map() either pass the result of
> folio_test_dirty() or explicitly dirty the entire range of the folio
> anyways.  iomap_dirty_folio() essentially does the latter because it
> needs to dirty the entire folio regardless of whether the iop already
> exists or not, right?

Yes.

>
> If so and if I'm following all of that correctly, could this complexity
> be isolated to iomap_writepage_map() by simply checking for the !iop
> case first, then call iop_alloc() immediately followed by
> set_range_dirty() of the entire folio? Then presumably iop_alloc() could
> always just dirty based on folio state with the writeback path exception
> case handled explicitly. Hm?
>

Hi Brian,

It was discussed here [1] to pass is_dirty flag at the time of iop
allocation. We can do what you are essentially suggesting, but it's just
extra work i.e. we will again do some calculations of blocks_per_folio,
start, end and more importantly take and release iop->state_lock
spinlock. Whereas with above approach we could get away with this at the
time of iop allocation itself.

Besides, isn't it easier this way? which as you also stated we will
dirty all the blocks based on is_dirty flag, which is folio_test_dirty()
except at the writeback time.


>>  		folio_attach_private(folio, iop);
>>  	}
>>  	return iop;
> ...
>> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
>>  }
>>  EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
>>
>> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
>> +{
>> +	struct iomap_page *iop;
>> +	struct inode *inode = mapping->host;
>> +	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;
>
> folio_size()?
>

sure.

>> +
>> +	iop = iop_alloc(inode, folio, 0, false);
>> +	iop_set_range_dirty(inode, folio, 0, len);
>> +	return filemap_dirty_folio(mapping, folio);
>> +}
>> +EXPORT_SYMBOL_GPL(iomap_dirty_folio);
>> +
>>  static void
>>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>>  {
> ...
>> @@ -978,6 +1056,28 @@ static int iomap_write_delalloc_scan(struct inode *inode,
>>  				}
>>  			}
>>
>> +			/*
>> +			 * When we have per-block dirty tracking, there can be
>> +			 * blocks within a folio which are marked uptodate
>> +			 * but not dirty. In that case it is necessary to punch
>> +			 * out such blocks to avoid leaking any delalloc blocks.
>> +			 */
>> +			iop = to_iomap_page(folio);
>> +			if (!iop)
>> +				goto skip_iop_punch;
>> +			last_byte = min_t(loff_t, end_byte - 1,
>> +				(folio_next_index(folio) << PAGE_SHIFT) - 1);
>> +			first_blk = offset_in_folio(folio, start_byte) >>
>> +						    blkbits;
>> +			last_blk = offset_in_folio(folio, last_byte) >>
>> +						   blkbits;
>> +			blks_per_folio = i_blocks_per_folio(inode, folio);
>
> Unused?
>

Yes. I will fix it in next rev somehow compilation didn't give me any warnings.

>> +			for (i = first_blk; i <= last_blk; i++) {
>> +				if (!iop_test_block_dirty(folio, i))
>> +					punch(inode, i << blkbits,
>> +						     1 << blkbits);
>> +			}
>> +skip_iop_punch:
>
> Looks reasonable at first glance, though the deep indentation here kind
> of makes my eyes cross. Could we stuff this loop into a
> iomap_write_delalloc_scan_folio() helper or some such, and then maybe
> update the comment at the top of the whole branch to explain both
> punches?

I felt that too. Sure will give it a try in the next spin.

>
> WRT to the !iop case.. I _think_ this is handled correctly here because
> that means we'd handle the folio as completely dirty at writeback time.
> Is that the case? If so, it might be nice to document that assumption
> somewhere (here or perhaps in the writeback path).
>

!iop case is simply when we don't have a large folio and blocksize ==
 pagesize. In that case we don't allocate any iop and simply returns
 from iop_alloc().
So then we just skip the loop which is only meant when we have blocks
within a folio.


-ritesh


> Brian
>
>>  			/*
>>  			 * Make sure the next punch start is correctly bound to
>>  			 * the end of this data range, not the end of the folio.
>> @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  		struct writeback_control *wbc, struct inode *inode,
>>  		struct folio *folio, u64 end_pos)
>>  {
>> -	struct iomap_page *iop = iop_alloc(inode, folio, 0);
>> +	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
>>  	struct iomap_ioend *ioend, *next;
>>  	unsigned len = i_blocksize(inode);
>>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
>> @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  	 * invalid, grab a new one.
>>  	 */
>>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
>> -		if (iop && !iop_test_block_uptodate(folio, i))
>> +		if (iop && !iop_test_block_dirty(folio, i))
>>  			continue;
>>
>>  		error = wpc->ops->map_blocks(wpc, inode, pos);
>> @@ -1726,6 +1826,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  		}
>>  	}
>>
>> +	iop_clear_range_dirty(folio, 0, end_pos - folio_pos(folio));
>>  	folio_start_writeback(folio);
>>  	folio_unlock(folio);
>>
>> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
>> index 2ef78aa1d3f6..77c7332ae197 100644
>> --- a/fs/xfs/xfs_aops.c
>> +++ b/fs/xfs/xfs_aops.c
>> @@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = {
>>  	.read_folio		= xfs_vm_read_folio,
>>  	.readahead		= xfs_vm_readahead,
>>  	.writepages		= xfs_vm_writepages,
>> -	.dirty_folio		= filemap_dirty_folio,
>> +	.dirty_folio		= iomap_dirty_folio,
>>  	.release_folio		= iomap_release_folio,
>>  	.invalidate_folio	= iomap_invalidate_folio,
>>  	.bmap			= xfs_vm_bmap,
>> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
>> index 132f01d3461f..e508c8e97372 100644
>> --- a/fs/zonefs/file.c
>> +++ b/fs/zonefs/file.c
>> @@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = {
>>  	.read_folio		= zonefs_read_folio,
>>  	.readahead		= zonefs_readahead,
>>  	.writepages		= zonefs_writepages,
>> -	.dirty_folio		= filemap_dirty_folio,
>> +	.dirty_folio		= iomap_dirty_folio,
>>  	.release_folio		= iomap_release_folio,
>>  	.invalidate_folio	= iomap_invalidate_folio,
>>  	.migrate_folio		= filemap_migrate_folio,
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 0f8123504e5e..0c2bee80565c 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
>>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
>> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
>>  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>>  		const struct iomap_ops *ops);
>>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
>> --
>> 2.39.2
>>
Brian Foster May 16, 2023, 7:29 p.m. UTC | #6
On Tue, May 16, 2023 at 08:19:31PM +0530, Ritesh Harjani wrote:
> Brian Foster <bfoster@redhat.com> writes:
> 
> > On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
> >> When filesystem blocksize is less than folio size (either with
> >> mapping_large_folio_support() or with blocksize < pagesize) and when the
> >> folio is uptodate in pagecache, then even a byte write can cause
> >> an entire folio to be written to disk during writeback. This happens
> >> because we currently don't have a mechanism to track per-block dirty
> >> state within struct iomap_page. We currently only track uptodate state.
> >>
> >> This patch implements support for tracking per-block dirty state in
> >> iomap_page->state bitmap. This should help improve the filesystem write
> >> performance and help reduce write amplification.
> >>
> >> Performance testing of below fio workload reveals ~16x performance
> >> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
> >> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> >>
> >> 1. <test_randwrite.fio>
> >> [global]
> >> 	ioengine=psync
> >> 	rw=randwrite
> >> 	overwrite=1
> >> 	pre_read=1
> >> 	direct=0
> >> 	bs=4k
> >> 	size=1G
> >> 	dir=./
> >> 	numjobs=8
> >> 	fdatasync=1
> >> 	runtime=60
> >> 	iodepth=64
> >> 	group_reporting=1
> >>
> >> [fio-run]
> >>
> >> 2. Also our internal performance team reported that this patch improves
> >>    their database workload performance by around ~83% (with XFS on Power)
> >>
> >> Reported-by: Aravinda Herle <araherle@in.ibm.com>
> >> Reported-by: Brian Foster <bfoster@redhat.com>
> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> >> ---
> >>  fs/gfs2/aops.c         |   2 +-
> >>  fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
> >>  fs/xfs/xfs_aops.c      |   2 +-
> >>  fs/zonefs/file.c       |   2 +-
> >>  include/linux/iomap.h  |   1 +
> >>  5 files changed, 112 insertions(+), 10 deletions(-)
> >>
> > ...
> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >> index 25f20f269214..c7f41b26280a 100644
> >> --- a/fs/iomap/buffered-io.c
> >> +++ b/fs/iomap/buffered-io.c
> > ...
> >> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
> >>  	else
> >>  		gfp = GFP_NOFS | __GFP_NOFAIL;
> >>
> >> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> >> +	/*
> >> +	 * iop->state tracks two sets of state flags when the
> >> +	 * filesystem block size is smaller than the folio size.
> >> +	 * The first state tracks per-block uptodate and the
> >> +	 * second tracks per-block dirty state.
> >> +	 */
> >> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
> >>  		      gfp);
> >>  	if (iop) {
> >>  		spin_lock_init(&iop->state_lock);
> >>  		if (folio_test_uptodate(folio))
> >>  			iop_set_range(iop, 0, nr_blocks);
> >> +		if (is_dirty)
> >> +			iop_set_range(iop, nr_blocks, nr_blocks);
> >
> > I find the is_dirty logic here a bit confusing. AFAICT, this is
> > primarily to handle the case where a folio is completely overwritten, so
> > no iop is allocated at write time, and so then writeback needs to
> > allocate the iop as fully dirty to reflect that. I.e., all iop_alloc()
> > callers other than iomap_writepage_map() either pass the result of
> > folio_test_dirty() or explicitly dirty the entire range of the folio
> > anyways.  iomap_dirty_folio() essentially does the latter because it
> > needs to dirty the entire folio regardless of whether the iop already
> > exists or not, right?
> 
> Yes.
> 
> >
> > If so and if I'm following all of that correctly, could this complexity
> > be isolated to iomap_writepage_map() by simply checking for the !iop
> > case first, then call iop_alloc() immediately followed by
> > set_range_dirty() of the entire folio? Then presumably iop_alloc() could
> > always just dirty based on folio state with the writeback path exception
> > case handled explicitly. Hm?
> >
> 
> Hi Brian,
> 
> It was discussed here [1] to pass is_dirty flag at the time of iop
> allocation. We can do what you are essentially suggesting, but it's just
> extra work i.e. we will again do some calculations of blocks_per_folio,
> start, end and more importantly take and release iop->state_lock
> spinlock. Whereas with above approach we could get away with this at the
> time of iop allocation itself.
> 

Hi Ritesh,

Isn't that extra work already occurring in iomap_dirty_folio()? I was
just thinking that maybe moving it to where it's apparently needed (i.e.
writeback) might eliminate the need for the param.

I suppose iomap_dirty_folio() would need to call filemap_dirty_folio()
first to make sure iop_alloc() sees the dirty state, but maybe that
would also allow skipping the iop alloc if the folio was already dirty
(i.e. if the folio was previously dirtied by a full buffered overwite
for example)?

I've appended a quick diff below (compile tested only) just to explain
what I mean. When doing that it also occurred to me that if we really
care about the separate call, we could keep the is_dirty param but do
the __iop_alloc() wrapper thing where iop_alloc() always passes
folio_test_dirty().

BTW, I think you left off your [1] discussion reference..

> Besides, isn't it easier this way? which as you also stated we will
> dirty all the blocks based on is_dirty flag, which is folio_test_dirty()
> except at the writeback time.
> 

My thinking was just that I kind of had to read through all of the
iop_alloc() callsites to grok the purpose of the parameter, which made
it seem unnecessarily confusing. But ultimately it made sense, so I
don't insist on changing it or anything if this approach is intentional
and/or preferred by others. That's just my .02 and I'll defer to your
preference. :)

> 
> >>  		folio_attach_private(folio, iop);
> >>  	}
> >>  	return iop;
> > ...
> >> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
...
> >
> > WRT to the !iop case.. I _think_ this is handled correctly here because
> > that means we'd handle the folio as completely dirty at writeback time.
> > Is that the case? If so, it might be nice to document that assumption
> > somewhere (here or perhaps in the writeback path).
> >
> 
> !iop case is simply when we don't have a large folio and blocksize ==
>  pagesize. In that case we don't allocate any iop and simply returns
>  from iop_alloc().
> So then we just skip the loop which is only meant when we have blocks
> within a folio.
> 

Isn't it also the case that iop might be NULL at this point if the fs
has sub-folio blocks, but the folio was dirtied by a full overwrite of
the folio? Or did I misunderstand patch 4?

Brian

--- 8< ---

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 92e1e1061225..89b3053e3f2d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -155,7 +155,7 @@ static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
 }
 
 static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
-				    unsigned int flags, bool is_dirty)
+				    unsigned int flags)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
@@ -181,7 +181,7 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
 		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
 			iop_set_range(iop, 0, nr_blocks);
-		if (is_dirty)
+		if (folio_test_dirty(folio))
 			iop_set_range(iop, nr_blocks, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
@@ -326,8 +326,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iop_alloc(iter->inode, folio, iter->flags,
-				folio_test_dirty(folio));
+		iop = iop_alloc(iter->inode, folio, iter->flags);
 	else
 		iop = to_iomap_page(folio);
 
@@ -365,8 +364,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, folio);
 
 	/* zero post-eof blocks as the page may be mapped */
-	iop = iop_alloc(iter->inode, folio, iter->flags,
-			folio_test_dirty(folio));
+	iop = iop_alloc(iter->inode, folio, iter->flags);
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
@@ -616,13 +614,10 @@ EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
 
 bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
 {
-	struct iomap_page *iop;
-	struct inode *inode = mapping->host;
-	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;
-
-	iop = iop_alloc(inode, folio, 0, false);
-	iop_set_range_dirty(inode, folio, 0, len);
-	return filemap_dirty_folio(mapping, folio);
+	bool dirtied = filemap_dirty_folio(mapping, folio);
+	if (dirtied)
+		iop_alloc(mapping->host, folio, 0);
+	return dirtied;
 }
 EXPORT_SYMBOL_GPL(iomap_dirty_folio);
 
@@ -673,8 +668,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;
 
-	iop = iop_alloc(iter->inode, folio, iter->flags,
-			folio_test_dirty(folio));
+	iop = iop_alloc(iter->inode, folio, iter->flags);
 
 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
@@ -1759,7 +1753,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
+	struct iomap_page *iop = to_iomap_page(folio);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1767,6 +1761,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	int error = 0, count = 0, i;
 	LIST_HEAD(submit_list);
 
+	if (!iop) {
+		iop = iop_alloc(inode, folio, 0);
+		iop_set_range_dirty(inode, folio, 0, folio_size(folio));
+	}
+
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
 
 	/*
Ritesh Harjani (IBM) May 17, 2023, 3:20 p.m. UTC | #7
Brian Foster <bfoster@redhat.com> writes:

> On Tue, May 16, 2023 at 08:19:31PM +0530, Ritesh Harjani wrote:
>> Brian Foster <bfoster@redhat.com> writes:
>>
>> > On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
>> >> When filesystem blocksize is less than folio size (either with
>> >> mapping_large_folio_support() or with blocksize < pagesize) and when the
>> >> folio is uptodate in pagecache, then even a byte write can cause
>> >> an entire folio to be written to disk during writeback. This happens
>> >> because we currently don't have a mechanism to track per-block dirty
>> >> state within struct iomap_page. We currently only track uptodate state.
>> >>
>> >> This patch implements support for tracking per-block dirty state in
>> >> iomap_page->state bitmap. This should help improve the filesystem write
>> >> performance and help reduce write amplification.
>> >>
>> >> Performance testing of below fio workload reveals ~16x performance
>> >> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
>> >> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>> >>
>> >> 1. <test_randwrite.fio>
>> >> [global]
>> >> 	ioengine=psync
>> >> 	rw=randwrite
>> >> 	overwrite=1
>> >> 	pre_read=1
>> >> 	direct=0
>> >> 	bs=4k
>> >> 	size=1G
>> >> 	dir=./
>> >> 	numjobs=8
>> >> 	fdatasync=1
>> >> 	runtime=60
>> >> 	iodepth=64
>> >> 	group_reporting=1
>> >>
>> >> [fio-run]
>> >>
>> >> 2. Also our internal performance team reported that this patch improves
>> >>    their database workload performance by around ~83% (with XFS on Power)
>> >>
>> >> Reported-by: Aravinda Herle <araherle@in.ibm.com>
>> >> Reported-by: Brian Foster <bfoster@redhat.com>
>> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> >> ---
>> >>  fs/gfs2/aops.c         |   2 +-
>> >>  fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
>> >>  fs/xfs/xfs_aops.c      |   2 +-
>> >>  fs/zonefs/file.c       |   2 +-
>> >>  include/linux/iomap.h  |   1 +
>> >>  5 files changed, 112 insertions(+), 10 deletions(-)
>> >>
>> > ...
>> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> >> index 25f20f269214..c7f41b26280a 100644
>> >> --- a/fs/iomap/buffered-io.c
>> >> +++ b/fs/iomap/buffered-io.c
>> > ...
>> >> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>> >>  	else
>> >>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>> >>
>> >> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>> >> +	/*
>> >> +	 * iop->state tracks two sets of state flags when the
>> >> +	 * filesystem block size is smaller than the folio size.
>> >> +	 * The first state tracks per-block uptodate and the
>> >> +	 * second tracks per-block dirty state.
>> >> +	 */
>> >> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>> >>  		      gfp);
>> >>  	if (iop) {
>> >>  		spin_lock_init(&iop->state_lock);
>> >>  		if (folio_test_uptodate(folio))
>> >>  			iop_set_range(iop, 0, nr_blocks);
>> >> +		if (is_dirty)
>> >> +			iop_set_range(iop, nr_blocks, nr_blocks);
>> >
>> > I find the is_dirty logic here a bit confusing. AFAICT, this is
>> > primarily to handle the case where a folio is completely overwritten, so
>> > no iop is allocated at write time, and so then writeback needs to
>> > allocate the iop as fully dirty to reflect that. I.e., all iop_alloc()
>> > callers other than iomap_writepage_map() either pass the result of
>> > folio_test_dirty() or explicitly dirty the entire range of the folio
>> > anyways.  iomap_dirty_folio() essentially does the latter because it
>> > needs to dirty the entire folio regardless of whether the iop already
>> > exists or not, right?
>>
>> Yes.
>>
>> >
>> > If so and if I'm following all of that correctly, could this complexity
>> > be isolated to iomap_writepage_map() by simply checking for the !iop
>> > case first, then call iop_alloc() immediately followed by
>> > set_range_dirty() of the entire folio? Then presumably iop_alloc() could
>> > always just dirty based on folio state with the writeback path exception
>> > case handled explicitly. Hm?
>> >
>>
>> Hi Brian,
>>
>> It was discussed here [1] to pass is_dirty flag at the time of iop
>> allocation. We can do what you are essentially suggesting, but it's just
>> extra work i.e. we will again do some calculations of blocks_per_folio,
>> start, end and more importantly take and release iop->state_lock
>> spinlock. Whereas with above approach we could get away with this at the
>> time of iop allocation itself.
>>
>
> Hi Ritesh,
>
> Isn't that extra work already occurring in iomap_dirty_folio()? I was
> just thinking that maybe moving it to where it's apparently needed (i.e.
> writeback) might eliminate the need for the param.
>
> I suppose iomap_dirty_folio() would need to call filemap_dirty_folio()
> first to make sure iop_alloc() sees the dirty state, but maybe that
> would also allow skipping the iop alloc if the folio was already dirty
> (i.e. if the folio was previously dirtied by a full buffered overwite
> for example)?
>
> I've appended a quick diff below (compile tested only) just to explain
> what I mean. When doing that it also occurred to me that if we really
> care about the separate call, we could keep the is_dirty param but do
> the __iop_alloc() wrapper thing where iop_alloc() always passes
> folio_test_dirty().

Sure. Brian. I guess when we got the comment, it was still in the intial
work & I was anyway passing a from_writeback flag. Thanks for the diff,
it does make sense to me. I will try to add that change in the next revision.

>
> BTW, I think you left off your [1] discussion reference..
>

Sorry, my bad.

[1]: https://lore.kernel.org/linux-xfs/Y9f7cZxnXbL7x0p+@infradead.org/

>> Besides, isn't it easier this way? which as you also stated we will
>> dirty all the blocks based on is_dirty flag, which is folio_test_dirty()
>> except at the writeback time.
>>
>
> My thinking was just that I kind of had to read through all of the
> iop_alloc() callsites to grok the purpose of the parameter, which made
> it seem unnecessarily confusing. But ultimately it made sense, so I
> don't insist on changing it or anything if this approach is intentional
> and/or preferred by others. That's just my .02 and I'll defer to your
> preference. :)
>

Sure Thanks!

>>
>> >>  		folio_attach_private(folio, iop);
>> >>  	}
>> >>  	return iop;
>> > ...
>> >> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
> ...
>> >
>> > WRT to the !iop case.. I _think_ this is handled correctly here because
>> > that means we'd handle the folio as completely dirty at writeback time.
>> > Is that the case? If so, it might be nice to document that assumption
>> > somewhere (here or perhaps in the writeback path).
>> >
>>
>> !iop case is simply when we don't have a large folio and blocksize ==
>>  pagesize. In that case we don't allocate any iop and simply returns
>>  from iop_alloc().
>> So then we just skip the loop which is only meant when we have blocks
>> within a folio.
>>
>
> Isn't it also the case that iop might be NULL at this point if the fs
> has sub-folio blocks, but the folio was dirtied by a full overwrite of
> the folio? Or did I misunderstand patch 4?
>

Yes. Both of the cases. We can either have bs == ps or we can have a
complete overwritten folio in which case we don't allocate any iop in
iomap_writepage_map() function.

Sure. I will document that when we factor out that change in a seperate function.

> Brian
>
> --- 8< ---
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 92e1e1061225..89b3053e3f2d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -155,7 +155,7 @@ static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
>  }
>
>  static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
> -				    unsigned int flags, bool is_dirty)
> +				    unsigned int flags)
>  {
>  	struct iomap_page *iop = to_iomap_page(folio);
>  	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> @@ -181,7 +181,7 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>  		spin_lock_init(&iop->state_lock);
>  		if (folio_test_uptodate(folio))
>  			iop_set_range(iop, 0, nr_blocks);
> -		if (is_dirty)
> +		if (folio_test_dirty(folio))
>  			iop_set_range(iop, nr_blocks, nr_blocks);
>  		folio_attach_private(folio, iop);
>  	}
> @@ -326,8 +326,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
>  	if (WARN_ON_ONCE(size > iomap->length))
>  		return -EIO;
>  	if (offset > 0)
> -		iop = iop_alloc(iter->inode, folio, iter->flags,
> -				folio_test_dirty(folio));
> +		iop = iop_alloc(iter->inode, folio, iter->flags);
>  	else
>  		iop = to_iomap_page(folio);
>
> @@ -365,8 +364,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>  		return iomap_read_inline_data(iter, folio);
>
>  	/* zero post-eof blocks as the page may be mapped */
> -	iop = iop_alloc(iter->inode, folio, iter->flags,
> -			folio_test_dirty(folio));
> +	iop = iop_alloc(iter->inode, folio, iter->flags);
>  	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
>  	if (plen == 0)
>  		goto done;
> @@ -616,13 +614,10 @@ EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
>
>  bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
>  {
> -	struct iomap_page *iop;
> -	struct inode *inode = mapping->host;
> -	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;
> -
> -	iop = iop_alloc(inode, folio, 0, false);
> -	iop_set_range_dirty(inode, folio, 0, len);
> -	return filemap_dirty_folio(mapping, folio);
> +	bool dirtied = filemap_dirty_folio(mapping, folio);
> +	if (dirtied)
> +		iop_alloc(mapping->host, folio, 0);
> +	return dirtied;
>  }
>  EXPORT_SYMBOL_GPL(iomap_dirty_folio);
>
> @@ -673,8 +668,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  	    pos + len >= folio_pos(folio) + folio_size(folio))
>  		return 0;
>
> -	iop = iop_alloc(iter->inode, folio, iter->flags,
> -			folio_test_dirty(folio));
> +	iop = iop_alloc(iter->inode, folio, iter->flags);
>
>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
>  		return -EAGAIN;
> @@ -1759,7 +1753,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode,
>  		struct folio *folio, u64 end_pos)
>  {
> -	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
> +	struct iomap_page *iop = to_iomap_page(folio);
>  	struct iomap_ioend *ioend, *next;
>  	unsigned len = i_blocksize(inode);
>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1767,6 +1761,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	int error = 0, count = 0, i;
>  	LIST_HEAD(submit_list);
>
> +	if (!iop) {
> +		iop = iop_alloc(inode, folio, 0);
> +		iop_set_range_dirty(inode, folio, 0, folio_size(folio));
> +	}
> +
>  	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
>
>  	/*

Thanks for the review!

-ritesh
Brian Foster May 17, 2023, 6:48 p.m. UTC | #8
On Wed, May 17, 2023 at 08:50:39PM +0530, Ritesh Harjani wrote:
> Brian Foster <bfoster@redhat.com> writes:
> 
> > On Tue, May 16, 2023 at 08:19:31PM +0530, Ritesh Harjani wrote:
> >> Brian Foster <bfoster@redhat.com> writes:
> >>
> >> > On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
> >> >> When filesystem blocksize is less than folio size (either with
> >> >> mapping_large_folio_support() or with blocksize < pagesize) and when the
> >> >> folio is uptodate in pagecache, then even a byte write can cause
> >> >> an entire folio to be written to disk during writeback. This happens
> >> >> because we currently don't have a mechanism to track per-block dirty
> >> >> state within struct iomap_page. We currently only track uptodate state.
> >> >>
> >> >> This patch implements support for tracking per-block dirty state in
> >> >> iomap_page->state bitmap. This should help improve the filesystem write
> >> >> performance and help reduce write amplification.
> >> >>
> >> >> Performance testing of below fio workload reveals ~16x performance
> >> >> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
> >> >> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> >> >>
> >> >> 1. <test_randwrite.fio>
> >> >> [global]
> >> >> 	ioengine=psync
> >> >> 	rw=randwrite
> >> >> 	overwrite=1
> >> >> 	pre_read=1
> >> >> 	direct=0
> >> >> 	bs=4k
> >> >> 	size=1G
> >> >> 	dir=./
> >> >> 	numjobs=8
> >> >> 	fdatasync=1
> >> >> 	runtime=60
> >> >> 	iodepth=64
> >> >> 	group_reporting=1
> >> >>
> >> >> [fio-run]
> >> >>
> >> >> 2. Also our internal performance team reported that this patch improves
> >> >>    their database workload performance by around ~83% (with XFS on Power)
> >> >>
> >> >> Reported-by: Aravinda Herle <araherle@in.ibm.com>
> >> >> Reported-by: Brian Foster <bfoster@redhat.com>
> >> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> >> >> ---
> >> >>  fs/gfs2/aops.c         |   2 +-
> >> >>  fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
> >> >>  fs/xfs/xfs_aops.c      |   2 +-
> >> >>  fs/zonefs/file.c       |   2 +-
> >> >>  include/linux/iomap.h  |   1 +
> >> >>  5 files changed, 112 insertions(+), 10 deletions(-)
> >> >>
> >> > ...
> >> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >> >> index 25f20f269214..c7f41b26280a 100644
> >> >> --- a/fs/iomap/buffered-io.c
> >> >> +++ b/fs/iomap/buffered-io.c
> >> > ...
> >> >> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
> >> >>  	else
> >> >>  		gfp = GFP_NOFS | __GFP_NOFAIL;
> >> >>
> >> >> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> >> >> +	/*
> >> >> +	 * iop->state tracks two sets of state flags when the
> >> >> +	 * filesystem block size is smaller than the folio size.
> >> >> +	 * The first state tracks per-block uptodate and the
> >> >> +	 * second tracks per-block dirty state.
> >> >> +	 */
> >> >> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
> >> >>  		      gfp);
> >> >>  	if (iop) {
> >> >>  		spin_lock_init(&iop->state_lock);
> >> >>  		if (folio_test_uptodate(folio))
> >> >>  			iop_set_range(iop, 0, nr_blocks);
> >> >> +		if (is_dirty)
> >> >> +			iop_set_range(iop, nr_blocks, nr_blocks);
> >> >
> >> > I find the is_dirty logic here a bit confusing. AFAICT, this is
> >> > primarily to handle the case where a folio is completely overwritten, so
> >> > no iop is allocated at write time, and so then writeback needs to
> >> > allocate the iop as fully dirty to reflect that. I.e., all iop_alloc()
> >> > callers other than iomap_writepage_map() either pass the result of
> >> > folio_test_dirty() or explicitly dirty the entire range of the folio
> >> > anyways.  iomap_dirty_folio() essentially does the latter because it
> >> > needs to dirty the entire folio regardless of whether the iop already
> >> > exists or not, right?
> >>
> >> Yes.
> >>
> >> >
> >> > If so and if I'm following all of that correctly, could this complexity
> >> > be isolated to iomap_writepage_map() by simply checking for the !iop
> >> > case first, then call iop_alloc() immediately followed by
> >> > set_range_dirty() of the entire folio? Then presumably iop_alloc() could
> >> > always just dirty based on folio state with the writeback path exception
> >> > case handled explicitly. Hm?
> >> >
> >>
> >> Hi Brian,
> >>
> >> It was discussed here [1] to pass is_dirty flag at the time of iop
> >> allocation. We can do what you are essentially suggesting, but it's just
> >> extra work i.e. we will again do some calculations of blocks_per_folio,
> >> start, end and more importantly take and release iop->state_lock
> >> spinlock. Whereas with above approach we could get away with this at the
> >> time of iop allocation itself.
> >>
> >
> > Hi Ritesh,
> >
> > Isn't that extra work already occurring in iomap_dirty_folio()? I was
> > just thinking that maybe moving it to where it's apparently needed (i.e.
> > writeback) might eliminate the need for the param.
> >
> > I suppose iomap_dirty_folio() would need to call filemap_dirty_folio()
> > first to make sure iop_alloc() sees the dirty state, but maybe that
> > would also allow skipping the iop alloc if the folio was already dirty
> > (i.e. if the folio was previously dirtied by a full buffered overwite
> > for example)?
> >
> > I've appended a quick diff below (compile tested only) just to explain
> > what I mean. When doing that it also occurred to me that if we really
> > care about the separate call, we could keep the is_dirty param but do
> > the __iop_alloc() wrapper thing where iop_alloc() always passes
> > folio_test_dirty().
> 
> Sure. Brian. I guess when we got the comment, it was still in the intial
> work & I was anyway passing a from_writeback flag. Thanks for the diff,
> it does make sense to me. I will try to add that change in the next revision.
> 

Ok, though I think what I did to iomap_folio_dirty() might be wrong...
If we have a folio/iop that is already partially dirty with sub-folio
blocks, and then that folio is mapped and write faulted, we still need
to explicitly dirty the rest of the iop during the fault, right? If so,
then I guess we'd still need to keep the iop_set_range_dirty() call
there at least for that case.

Hmm.. so I suppose I could see doing that a couple different ways. One
is just to just keep it as you already have it and just drop the dirty
param. I.e.:

bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
{
        iop_alloc(mapping->host, folio, 0);
        iop_set_range_dirty(mapping->host, folio, 0, folio_size(folio));
        return filemap_dirty_folio(mapping, folio);
}

But I also wonder.. if we can skip the iop alloc on full folio buffered
overwrites, isn't that also true of mapped writes to folios that don't
already have an iop? I.e., I think what I was trying to do in the
previous diff was actually something like this:

bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
{
        iop_set_range_dirty(mapping->host, folio, 0, folio_size(folio));
        return filemap_dirty_folio(mapping, folio);
}

... which would only dirty the full iop if it already exists. Thoughts?

Brian

> >
> > BTW, I think you left off your [1] discussion reference..
> >
> 
> Sorry, my bad.
> 
> [1]: https://lore.kernel.org/linux-xfs/Y9f7cZxnXbL7x0p+@infradead.org/
> 
> >> Besides, isn't it easier this way? which as you also stated we will
> >> dirty all the blocks based on is_dirty flag, which is folio_test_dirty()
> >> except at the writeback time.
> >>
> >
> > My thinking was just that I kind of had to read through all of the
> > iop_alloc() callsites to grok the purpose of the parameter, which made
> > it seem unnecessarily confusing. But ultimately it made sense, so I
> > don't insist on changing it or anything if this approach is intentional
> > and/or preferred by others. That's just my .02 and I'll defer to your
> > preference. :)
> >
> 
> Sure Thanks!
> 
> >>
> >> >>  		folio_attach_private(folio, iop);
> >> >>  	}
> >> >>  	return iop;
> >> > ...
> >> >> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
> > ...
> >> >
> >> > WRT to the !iop case.. I _think_ this is handled correctly here because
> >> > that means we'd handle the folio as completely dirty at writeback time.
> >> > Is that the case? If so, it might be nice to document that assumption
> >> > somewhere (here or perhaps in the writeback path).
> >> >
> >>
> >> !iop case is simply when we don't have a large folio and blocksize ==
> >>  pagesize. In that case we don't allocate any iop and simply returns
> >>  from iop_alloc().
> >> So then we just skip the loop which is only meant when we have blocks
> >> within a folio.
> >>
> >
> > Isn't it also the case that iop might be NULL at this point if the fs
> > has sub-folio blocks, but the folio was dirtied by a full overwrite of
> > the folio? Or did I misunderstand patch 4?
> >
> 
> Yes. Both of the cases. We can either have bs == ps or we can have a
> complete overwritten folio in which case we don't allocate any iop in
> iomap_writepage_map() function.
> 
> Sure. I will document that when we factor out that change in a seperate function.
> 
> > Brian
> >
> > --- 8< ---
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 92e1e1061225..89b3053e3f2d 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -155,7 +155,7 @@ static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
> >  }
> >
> >  static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
> > -				    unsigned int flags, bool is_dirty)
> > +				    unsigned int flags)
> >  {
> >  	struct iomap_page *iop = to_iomap_page(folio);
> >  	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> > @@ -181,7 +181,7 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
> >  		spin_lock_init(&iop->state_lock);
> >  		if (folio_test_uptodate(folio))
> >  			iop_set_range(iop, 0, nr_blocks);
> > -		if (is_dirty)
> > +		if (folio_test_dirty(folio))
> >  			iop_set_range(iop, nr_blocks, nr_blocks);
> >  		folio_attach_private(folio, iop);
> >  	}
> > @@ -326,8 +326,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
> >  	if (WARN_ON_ONCE(size > iomap->length))
> >  		return -EIO;
> >  	if (offset > 0)
> > -		iop = iop_alloc(iter->inode, folio, iter->flags,
> > -				folio_test_dirty(folio));
> > +		iop = iop_alloc(iter->inode, folio, iter->flags);
> >  	else
> >  		iop = to_iomap_page(folio);
> >
> > @@ -365,8 +364,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> >  		return iomap_read_inline_data(iter, folio);
> >
> >  	/* zero post-eof blocks as the page may be mapped */
> > -	iop = iop_alloc(iter->inode, folio, iter->flags,
> > -			folio_test_dirty(folio));
> > +	iop = iop_alloc(iter->inode, folio, iter->flags);
> >  	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
> >  	if (plen == 0)
> >  		goto done;
> > @@ -616,13 +614,10 @@ EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
> >
> >  bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> >  {
> > -	struct iomap_page *iop;
> > -	struct inode *inode = mapping->host;
> > -	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;
> > -
> > -	iop = iop_alloc(inode, folio, 0, false);
> > -	iop_set_range_dirty(inode, folio, 0, len);
> > -	return filemap_dirty_folio(mapping, folio);
> > +	bool dirtied = filemap_dirty_folio(mapping, folio);
> > +	if (dirtied)
> > +		iop_alloc(mapping->host, folio, 0);
> > +	return dirtied;
> >  }
> >  EXPORT_SYMBOL_GPL(iomap_dirty_folio);
> >
> > @@ -673,8 +668,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> >  	    pos + len >= folio_pos(folio) + folio_size(folio))
> >  		return 0;
> >
> > -	iop = iop_alloc(iter->inode, folio, iter->flags,
> > -			folio_test_dirty(folio));
> > +	iop = iop_alloc(iter->inode, folio, iter->flags);
> >
> >  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
> >  		return -EAGAIN;
> > @@ -1759,7 +1753,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> >  		struct writeback_control *wbc, struct inode *inode,
> >  		struct folio *folio, u64 end_pos)
> >  {
> > -	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
> > +	struct iomap_page *iop = to_iomap_page(folio);
> >  	struct iomap_ioend *ioend, *next;
> >  	unsigned len = i_blocksize(inode);
> >  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> > @@ -1767,6 +1761,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> >  	int error = 0, count = 0, i;
> >  	LIST_HEAD(submit_list);
> >
> > +	if (!iop) {
> > +		iop = iop_alloc(inode, folio, 0);
> > +		iop_set_range_dirty(inode, folio, 0, folio_size(folio));
> > +	}
> > +
> >  	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
> >
> >  	/*
> 
> Thanks for the review!
> 
> -ritesh
>
Christoph Hellwig May 18, 2023, 1:23 p.m. UTC | #9
On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> But I also wonder.. if we can skip the iop alloc on full folio buffered
> overwrites, isn't that also true of mapped writes to folios that don't
> already have an iop?

Yes.

> I.e., I think what I was trying to do in the
> previous diff was actually something like this:
> 
> bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> {
>         iop_set_range_dirty(mapping->host, folio, 0, folio_size(folio));
>         return filemap_dirty_folio(mapping, folio);
> }
> 
> ... which would only dirty the full iop if it already exists. Thoughts?

That does sound pretty good to me, and I can't think of any obvious
pitfall.
Christoph Hellwig May 18, 2023, 1:27 p.m. UTC | #10
On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
> +static inline void iop_clear_range(struct iomap_page *iop,
> +				   unsigned int start_blk, unsigned int nr_blks)
> +{
> +	bitmap_clear(iop->state, start_blk, nr_blks);
> +}

Similar to the other trivial bitmap wrappers added earlier in the
series I don't think this one is very useful.

> +	struct iomap_page *iop = to_iomap_page(folio);
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +	unsigned int first_blk = (off >> inode->i_blkbits);
> +	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	if (!iop)
> +		return;
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	iop_set_range(iop, first_blk + blks_per_folio, nr_blks);
> +	spin_unlock_irqrestore(&iop->state_lock, flags);

All the variable initializations except for ios should really move
into a branch here.  Or we just use separate helpers for the case
where we actually have an iops and isolate all that, which would
be my preference (but goes counter to the direction of changes
earlier in the series to the existing functions).

> +static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +	struct inode *inode = folio->mapping->host;
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +	unsigned int first_blk = (off >> inode->i_blkbits);
> +	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	if (!iop)
> +		return;
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	iop_clear_range(iop, first_blk + blks_per_folio, nr_blks);
> +	spin_unlock_irqrestore(&iop->state_lock, flags);
> +}

Same here.
Matthew Wilcox (Oracle) May 18, 2023, 4:15 p.m. UTC | #11
On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> > overwrites, isn't that also true of mapped writes to folios that don't
> > already have an iop?
> 
> Yes.

Hm, well, maybe?  If somebody stores to a page, we obviously set the
dirty flag on the folio, but depending on the architecture, we may
or may not have independent dirty bits on the PTEs (eg if it's a PMD,
we have one dirty bit for the entire folio; similarly if ARM uses the
contiguous PTE bit).  If we do have independent dirty bits, we could
dirty only the blocks corresponding to a single page at a time.

This has potential for causing some nasty bugs, so I'm inclined to
rule that if a folio is mmaped, then it's all dirty from any writable
page fault.  The fact is that applications generally do not perform
writes through mmap because the error handling story is so poor.

There may be a different answer for anonymous memory, but that doesn't
feel like my problem and shouldn't feel like any FS developer's problem.
Ritesh Harjani (IBM) May 19, 2023, 4:08 p.m. UTC | #12
Christoph Hellwig <hch@infradead.org> writes:

> On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
>> +static inline void iop_clear_range(struct iomap_page *iop,
>> +				   unsigned int start_blk, unsigned int nr_blks)
>> +{
>> +	bitmap_clear(iop->state, start_blk, nr_blks);
>> +}
>
> Similar to the other trivial bitmap wrappers added earlier in the
> series I don't think this one is very useful.
>
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +	unsigned int first_blk = (off >> inode->i_blkbits);
>> +	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
>> +	unsigned int nr_blks = last_blk - first_blk + 1;
>> +	unsigned long flags;
>> +
>> +	if (!iop)
>> +		return;
>> +	spin_lock_irqsave(&iop->state_lock, flags);
>> +	iop_set_range(iop, first_blk + blks_per_folio, nr_blks);
>> +	spin_unlock_irqrestore(&iop->state_lock, flags);
>
> All the variable initializations except for ios should really move
> into a branch here.

Branch won't be needed I guess, will just move the initialization after
the return.

> Or we just use separate helpers for the case
> where we actually have an iops and isolate all that, which would
> be my preference (but goes counter to the direction of changes
> earlier in the series to the existing functions).
>
>> +static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +	struct inode *inode = folio->mapping->host;
>> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +	unsigned int first_blk = (off >> inode->i_blkbits);
>> +	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
>> +	unsigned int nr_blks = last_blk - first_blk + 1;
>> +	unsigned long flags;
>> +
>> +	if (!iop)
>> +		return;
>> +	spin_lock_irqsave(&iop->state_lock, flags);
>> +	iop_clear_range(iop, first_blk + blks_per_folio, nr_blks);
>> +	spin_unlock_irqrestore(&iop->state_lock, flags);
>> +}
>
> Same here.

Sure.

-ritesh
Ritesh Harjani (IBM) May 22, 2023, 4:33 a.m. UTC | #13
Matthew Wilcox <willy@infradead.org> writes:

> On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
>> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
>> > But I also wonder.. if we can skip the iop alloc on full folio buffered
>> > overwrites, isn't that also true of mapped writes to folios that don't
>> > already have an iop?
>>
>> Yes.
>
> Hm, well, maybe?  If somebody stores to a page, we obviously set the
> dirty flag on the folio, but depending on the architecture, we may
> or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> we have one dirty bit for the entire folio; similarly if ARM uses the
> contiguous PTE bit).  If we do have independent dirty bits, we could
> dirty only the blocks corresponding to a single page at a time.
>
> This has potential for causing some nasty bugs, so I'm inclined to
> rule that if a folio is mmaped, then it's all dirty from any writable
> page fault.  The fact is that applications generally do not perform
> writes through mmap because the error handling story is so poor.
>
> There may be a different answer for anonymous memory, but that doesn't
> feel like my problem and shouldn't feel like any FS developer's problem.

Although I am skeptical too to do the changes which Brian is suggesting
here. i.e. not making all the blocks of the folio dirty when we are
going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).

However, I am sorry but I coudn't completely follow your reasoning
above. I think what Brian is suggesting here is that
filemap_dirty_folio() should be similar to complete buffered overwrite
case where we do not allocate the iop at the ->write_begin() time.
Then at the writeback time we allocate an iop and mark all blocks dirty.

In a way it is also the similar case as for mmapped writes too but my
only worry is the way mmaped writes work and it makes more
sense to keep the dirty state of folio and per-block within iop in sync.
For that matter, we can even just make sure we always allocate an iop in
the complete overwrites case as well. I didn't change that code because
it was kept that way for uptodate state as well and based on one of your
inputs for complete overwrite case.

Though I agree that we should ideally be allocatting & marking all
blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
understand your reasoning better.

Thanks!
-ritesh
Matthew Wilcox (Oracle) May 22, 2023, 4:48 a.m. UTC | #14
On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> >> > overwrites, isn't that also true of mapped writes to folios that don't
> >> > already have an iop?
> >>
> >> Yes.
> >
> > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> > dirty flag on the folio, but depending on the architecture, we may
> > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> > we have one dirty bit for the entire folio; similarly if ARM uses the
> > contiguous PTE bit).  If we do have independent dirty bits, we could
> > dirty only the blocks corresponding to a single page at a time.
> >
> > This has potential for causing some nasty bugs, so I'm inclined to
> > rule that if a folio is mmaped, then it's all dirty from any writable
> > page fault.  The fact is that applications generally do not perform
> > writes through mmap because the error handling story is so poor.
> >
> > There may be a different answer for anonymous memory, but that doesn't
> > feel like my problem and shouldn't feel like any FS developer's problem.
> 
> Although I am skeptical too to do the changes which Brian is suggesting
> here. i.e. not making all the blocks of the folio dirty when we are
> going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> 
> However, I am sorry but I coudn't completely follow your reasoning
> above. I think what Brian is suggesting here is that
> filemap_dirty_folio() should be similar to complete buffered overwrite
> case where we do not allocate the iop at the ->write_begin() time.
> Then at the writeback time we allocate an iop and mark all blocks dirty.
> 
> In a way it is also the similar case as for mmapped writes too but my
> only worry is the way mmaped writes work and it makes more
> sense to keep the dirty state of folio and per-block within iop in sync.
> For that matter, we can even just make sure we always allocate an iop in
> the complete overwrites case as well. I didn't change that code because
> it was kept that way for uptodate state as well and based on one of your
> inputs for complete overwrite case.
> 
> Though I agree that we should ideally be allocatting & marking all
> blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> understand your reasoning better.

Think about it at a higher level than the implementation ("do we allocate
an iop or not").  If userspace dirties one page in a folio, should it
dirty all pages in the folio, or just the page that was actually dirtied?
I appreciate you're thinking about this from the point of view of 64kB
pages on PPC and using single-page folios, but pretend we've allocated a
1MB folio, mmaped it (or a part of it?)  and now userspace stores to it.
How much of it do we want to write back?

My argument is that this is RARE.  Userspace generally does not
mmap(MAP_SHARED), store to it and call msync() to do writes.  Writes are
almost always done using the write() syscall.  Userspace gets a lot more
control about when the writeback happens, and they actually get errors
back from the write() syscall.

If we attempt to track which pages have actually been dirtied, I worry
the fs and the mm will lose track of what the other needs to know.
eg the mm will make every page in the folio writable and then not notify
the fs when subsequent pages in the folio are stored to.
Brian Foster May 22, 2023, 11:18 a.m. UTC | #15
On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> >> > overwrites, isn't that also true of mapped writes to folios that don't
> >> > already have an iop?
> >>
> >> Yes.
> >
> > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> > dirty flag on the folio, but depending on the architecture, we may
> > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> > we have one dirty bit for the entire folio; similarly if ARM uses the
> > contiguous PTE bit).  If we do have independent dirty bits, we could
> > dirty only the blocks corresponding to a single page at a time.
> >
> > This has potential for causing some nasty bugs, so I'm inclined to
> > rule that if a folio is mmaped, then it's all dirty from any writable
> > page fault.  The fact is that applications generally do not perform
> > writes through mmap because the error handling story is so poor.
> >
> > There may be a different answer for anonymous memory, but that doesn't
> > feel like my problem and shouldn't feel like any FS developer's problem.
> 
> Although I am skeptical too to do the changes which Brian is suggesting
> here. i.e. not making all the blocks of the folio dirty when we are
> going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> 
> However, I am sorry but I coudn't completely follow your reasoning
> above. I think what Brian is suggesting here is that
> filemap_dirty_folio() should be similar to complete buffered overwrite
> case where we do not allocate the iop at the ->write_begin() time.
> Then at the writeback time we allocate an iop and mark all blocks dirty.
> 

Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
granularity of intra-folio faults) makes sense, but I'm also not sure
what it has to do with the idea of being consistent with how full folio
overwrites are implemented (between buffered or mapped writes). We're
not changing historical dirtying granularity either way. I think this is
just a bigger picture thought for future consideration as opposed to
direct feedback on this patch..

> In a way it is also the similar case as for mmapped writes too but my
> only worry is the way mmaped writes work and it makes more
> sense to keep the dirty state of folio and per-block within iop in sync.
> For that matter, we can even just make sure we always allocate an iop in
> the complete overwrites case as well. I didn't change that code because
> it was kept that way for uptodate state as well and based on one of your
> inputs for complete overwrite case.
> 

Can you elaborate on your concerns, out of curiosity?

Either way, IMO it also seems reasonable to drop this behavior for the
basic implementation of dirty tracking (so always allocate the iop for
sub-folio tracking as you suggest above) and then potentially restore it
as a separate optimization patch at the end of the series.

That said, I'm not totally clear why it exists in the first place, so
that might warrant some investigation. Is it primarily to defer
allocations out of task write/fault contexts? To optimize the case where
pagecache is dirtied but truncated or something and thus never written
back? Is there any room for further improvement where the alloc could be
avoided completely for folio overwrites instead of just deferred? Was
that actually the case at some point and then something later decided
the iop was needed at writeback time, leading to current behavior?

Brian

> Though I agree that we should ideally be allocatting & marking all
> blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> understand your reasoning better.
> 
> Thanks!
> -ritesh
>
Darrick J. Wong May 23, 2023, 12:56 a.m. UTC | #16
On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
> On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> > Matthew Wilcox <willy@infradead.org> writes:
> > 
> > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> > >> > overwrites, isn't that also true of mapped writes to folios that don't
> > >> > already have an iop?
> > >>
> > >> Yes.
> > >
> > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> > > dirty flag on the folio, but depending on the architecture, we may
> > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> > > we have one dirty bit for the entire folio; similarly if ARM uses the
> > > contiguous PTE bit).  If we do have independent dirty bits, we could
> > > dirty only the blocks corresponding to a single page at a time.
> > >
> > > This has potential for causing some nasty bugs, so I'm inclined to
> > > rule that if a folio is mmaped, then it's all dirty from any writable
> > > page fault.  The fact is that applications generally do not perform
> > > writes through mmap because the error handling story is so poor.
> > >
> > > There may be a different answer for anonymous memory, but that doesn't
> > > feel like my problem and shouldn't feel like any FS developer's problem.
> > 
> > Although I am skeptical too to do the changes which Brian is suggesting
> > here. i.e. not making all the blocks of the folio dirty when we are
> > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> > 
> > However, I am sorry but I coudn't completely follow your reasoning
> > above. I think what Brian is suggesting here is that
> > filemap_dirty_folio() should be similar to complete buffered overwrite
> > case where we do not allocate the iop at the ->write_begin() time.
> > Then at the writeback time we allocate an iop and mark all blocks dirty.
> > 
> 
> Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
> granularity of intra-folio faults) makes sense, but I'm also not sure
> what it has to do with the idea of being consistent with how full folio
> overwrites are implemented (between buffered or mapped writes). We're
> not changing historical dirtying granularity either way. I think this is
> just a bigger picture thought for future consideration as opposed to
> direct feedback on this patch..

<nod>

> > In a way it is also the similar case as for mmapped writes too but my
> > only worry is the way mmaped writes work and it makes more
> > sense to keep the dirty state of folio and per-block within iop in sync.
> > For that matter, we can even just make sure we always allocate an iop in
> > the complete overwrites case as well. I didn't change that code because
> > it was kept that way for uptodate state as well and based on one of your
> > inputs for complete overwrite case.
> > 
> 
> Can you elaborate on your concerns, out of curiosity?
> 
> Either way, IMO it also seems reasonable to drop this behavior for the
> basic implementation of dirty tracking (so always allocate the iop for
> sub-folio tracking as you suggest above) and then potentially restore it
> as a separate optimization patch at the end of the series.

Agree.

> That said, I'm not totally clear why it exists in the first place, so
> that might warrant some investigation. Is it primarily to defer
> allocations out of task write/fault contexts?

(Assuming by 'it' you mean the behavior where we don't unconditionally
allocate iops for blocksize < foliosize...)

IIRC the reason is to reduce memory usage by eliding iop allocations
unless it's absolutely necessary for correctness was /my/ understanding
of why we don't always allocate the iop...

> To optimize the case where pagecache is dirtied but truncated or
> something and thus never written back?

...because this might very well happen.  Write a temporary .o file to
the filesystem, then delete the whole thing before writeback ever gets
its hands on the file.

> Is there any room for further improvement where the alloc could be
> avoided completely for folio overwrites instead of just deferred?

Once writeback starts, though, we need the iop so that we can know when
all the writeback for that folio is actually complete, no matter how
many IOs might be in flight for that folio.  I don't know how you'd get
around this problem.

> Was that actually the case at some point and then something later
> decided the iop was needed at writeback time, leading to current
> behavior?

It's been in iomap since the beginning when we lifted it from xfs.

--D (who is now weeks behind on reviewing things and stressed out)

> Brian
> 
> > Though I agree that we should ideally be allocatting & marking all
> > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> > understand your reasoning better.
> > 
> > Thanks!
> > -ritesh
> > 
>
Brian Foster May 23, 2023, 12:15 p.m. UTC | #17
On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> > > Matthew Wilcox <willy@infradead.org> writes:
> > > 
> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
> > > >> > already have an iop?
> > > >>
> > > >> Yes.
> > > >
> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> > > > dirty flag on the folio, but depending on the architecture, we may
> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
> > > > dirty only the blocks corresponding to a single page at a time.
> > > >
> > > > This has potential for causing some nasty bugs, so I'm inclined to
> > > > rule that if a folio is mmaped, then it's all dirty from any writable
> > > > page fault.  The fact is that applications generally do not perform
> > > > writes through mmap because the error handling story is so poor.
> > > >
> > > > There may be a different answer for anonymous memory, but that doesn't
> > > > feel like my problem and shouldn't feel like any FS developer's problem.
> > > 
> > > Although I am skeptical too to do the changes which Brian is suggesting
> > > here. i.e. not making all the blocks of the folio dirty when we are
> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> > > 
> > > However, I am sorry but I coudn't completely follow your reasoning
> > > above. I think what Brian is suggesting here is that
> > > filemap_dirty_folio() should be similar to complete buffered overwrite
> > > case where we do not allocate the iop at the ->write_begin() time.
> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
> > > 
> > 
> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
> > granularity of intra-folio faults) makes sense, but I'm also not sure
> > what it has to do with the idea of being consistent with how full folio
> > overwrites are implemented (between buffered or mapped writes). We're
> > not changing historical dirtying granularity either way. I think this is
> > just a bigger picture thought for future consideration as opposed to
> > direct feedback on this patch..
> 
> <nod>
> 
> > > In a way it is also the similar case as for mmapped writes too but my
> > > only worry is the way mmaped writes work and it makes more
> > > sense to keep the dirty state of folio and per-block within iop in sync.
> > > For that matter, we can even just make sure we always allocate an iop in
> > > the complete overwrites case as well. I didn't change that code because
> > > it was kept that way for uptodate state as well and based on one of your
> > > inputs for complete overwrite case.
> > > 
> > 
> > Can you elaborate on your concerns, out of curiosity?
> > 
> > Either way, IMO it also seems reasonable to drop this behavior for the
> > basic implementation of dirty tracking (so always allocate the iop for
> > sub-folio tracking as you suggest above) and then potentially restore it
> > as a separate optimization patch at the end of the series.
> 
> Agree.
> 
> > That said, I'm not totally clear why it exists in the first place, so
> > that might warrant some investigation. Is it primarily to defer
> > allocations out of task write/fault contexts?
> 
> (Assuming by 'it' you mean the behavior where we don't unconditionally
> allocate iops for blocksize < foliosize...)
> 
> IIRC the reason is to reduce memory usage by eliding iop allocations
> unless it's absolutely necessary for correctness was /my/ understanding
> of why we don't always allocate the iop...
> 
> > To optimize the case where pagecache is dirtied but truncated or
> > something and thus never written back?
> 
> ...because this might very well happen.  Write a temporary .o file to
> the filesystem, then delete the whole thing before writeback ever gets
> its hands on the file.
> 

I don't think a simple temp write will trigger this scenario currently
because the folios would have to be uptodate at the time of the write to
bypass the iop alloc. I guess you'd have to read folios (even if backed
by holes) first to start seeing the !iop case at writeback time (for bs
!= ps).

That could change with these patches, but I was trying to reason about
the intent of the existing code and whether there was some known reason
to continue to try and defer the iop allocation as the need/complexity
for deferring it grows with the addition of more (i.e. dirty) tracking.

> > Is there any room for further improvement where the alloc could be
> > avoided completely for folio overwrites instead of just deferred?
> 
> Once writeback starts, though, we need the iop so that we can know when
> all the writeback for that folio is actually complete, no matter how
> many IOs might be in flight for that folio.  I don't know how you'd get
> around this problem.
> 

Ok. I noticed some kind of counter or something being updated last time
I looked through that code, so it sounds like that's the reason the iop
eventually needs to exist. Thanks.

> > Was that actually the case at some point and then something later
> > decided the iop was needed at writeback time, leading to current
> > behavior?
> 
> It's been in iomap since the beginning when we lifted it from xfs.
> 

Not sure exactly what you're referring to here. iomap_writepage_map()
would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
("iomap: Permit pages without an iop to enter writeback"), so I don't
see how iop allocs were deferred (other than when bs == ps, obviously)
prior to that.

Heh, but I'm starting to get my wires crossed just trying to piece
things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
case is primarily a subtle side effect of the current writeback behavior
being driven by uptodate status. I think it's probably wise to drop it
at least initially, always alloc and dirty the appropriate iop ranges
for sub-folio blocks, and then if you or others think there is value in
the overwrite optimization to defer iop allocs, tack that on as a
separate patch and try to be consistent between buffered and mapped
writes.

Darrick noted above that he also agrees with that separate patch
approach. For me, I think it would also be useful to show that there is
some measurable performance benefit on at least one reasonable workload
to help justify it.

Brian

> --D (who is now weeks behind on reviewing things and stressed out)
> 
> > Brian
> > 
> > > Though I agree that we should ideally be allocatting & marking all
> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> > > understand your reasoning better.
> > > 
> > > Thanks!
> > > -ritesh
> > > 
> > 
>
Ritesh Harjani (IBM) May 23, 2023, 1:43 p.m. UTC | #18
Brian Foster <bfoster@redhat.com> writes:

> On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
>> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
>> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
>> > > Matthew Wilcox <willy@infradead.org> writes:
>> > >
>> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
>> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
>> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
>> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
>> > > >> > already have an iop?
>> > > >>
>> > > >> Yes.
>> > > >
>> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
>> > > > dirty flag on the folio, but depending on the architecture, we may
>> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
>> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
>> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
>> > > > dirty only the blocks corresponding to a single page at a time.
>> > > >
>> > > > This has potential for causing some nasty bugs, so I'm inclined to
>> > > > rule that if a folio is mmaped, then it's all dirty from any writable
>> > > > page fault.  The fact is that applications generally do not perform
>> > > > writes through mmap because the error handling story is so poor.
>> > > >
>> > > > There may be a different answer for anonymous memory, but that doesn't
>> > > > feel like my problem and shouldn't feel like any FS developer's problem.
>> > >
>> > > Although I am skeptical too to do the changes which Brian is suggesting
>> > > here. i.e. not making all the blocks of the folio dirty when we are
>> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
>> > >
>> > > However, I am sorry but I coudn't completely follow your reasoning
>> > > above. I think what Brian is suggesting here is that
>> > > filemap_dirty_folio() should be similar to complete buffered overwrite
>> > > case where we do not allocate the iop at the ->write_begin() time.
>> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
>> > >
>> >
>> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
>> > granularity of intra-folio faults) makes sense, but I'm also not sure
>> > what it has to do with the idea of being consistent with how full folio
>> > overwrites are implemented (between buffered or mapped writes). We're
>> > not changing historical dirtying granularity either way. I think this is
>> > just a bigger picture thought for future consideration as opposed to
>> > direct feedback on this patch..
>>
>> <nod>
>>
>> > > In a way it is also the similar case as for mmapped writes too but my
>> > > only worry is the way mmaped writes work and it makes more
>> > > sense to keep the dirty state of folio and per-block within iop in sync.
>> > > For that matter, we can even just make sure we always allocate an iop in
>> > > the complete overwrites case as well. I didn't change that code because
>> > > it was kept that way for uptodate state as well and based on one of your
>> > > inputs for complete overwrite case.
>> > >
>> >
>> > Can you elaborate on your concerns, out of curiosity?
>> >
>> > Either way, IMO it also seems reasonable to drop this behavior for the
>> > basic implementation of dirty tracking (so always allocate the iop for
>> > sub-folio tracking as you suggest above) and then potentially restore it
>> > as a separate optimization patch at the end of the series.
>>
>> Agree.
>>
>> > That said, I'm not totally clear why it exists in the first place, so
>> > that might warrant some investigation. Is it primarily to defer
>> > allocations out of task write/fault contexts?
>>
>> (Assuming by 'it' you mean the behavior where we don't unconditionally
>> allocate iops for blocksize < foliosize...)
>>
>> IIRC the reason is to reduce memory usage by eliding iop allocations
>> unless it's absolutely necessary for correctness was /my/ understanding
>> of why we don't always allocate the iop...
>>
>> > To optimize the case where pagecache is dirtied but truncated or
>> > something and thus never written back?
>>
>> ...because this might very well happen.  Write a temporary .o file to
>> the filesystem, then delete the whole thing before writeback ever gets
>> its hands on the file.
>>
>
> I don't think a simple temp write will trigger this scenario currently
> because the folios would have to be uptodate at the time of the write to
> bypass the iop alloc. I guess you'd have to read folios (even if backed
> by holes) first to start seeing the !iop case at writeback time (for bs
> != ps).
>
> That could change with these patches, but I was trying to reason about
> the intent of the existing code and whether there was some known reason
> to continue to try and defer the iop allocation as the need/complexity
> for deferring it grows with the addition of more (i.e. dirty) tracking.
>

Here is the 1st discussion/reasoning where the deferred iop allocation
in the readpage path got discussed [1].
And here is the discussion when I first pointed out the deferred
allocation in writepage path. IMO, it got slipped in with the
discussions maybe only on mailing list but nothing in the commit
messages or comments.[2]

[1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/
[2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/

>> > Is there any room for further improvement where the alloc could be
>> > avoided completely for folio overwrites instead of just deferred?
>>
>> Once writeback starts, though, we need the iop so that we can know when
>> all the writeback for that folio is actually complete, no matter how
>> many IOs might be in flight for that folio.  I don't know how you'd get
>> around this problem.
>>
>
> Ok. I noticed some kind of counter or something being updated last time
> I looked through that code, so it sounds like that's the reason the iop
> eventually needs to exist. Thanks.
>
>> > Was that actually the case at some point and then something later
>> > decided the iop was needed at writeback time, leading to current
>> > behavior?
>>
>> It's been in iomap since the beginning when we lifted it from xfs.
>>
>
> Not sure exactly what you're referring to here. iomap_writepage_map()
> would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
> ("iomap: Permit pages without an iop to enter writeback"), so I don't
> see how iop allocs were deferred (other than when bs == ps, obviously)
> prior to that.
>
> Heh, but I'm starting to get my wires crossed just trying to piece
> things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
> case is primarily a subtle side effect of the current writeback behavior
> being driven by uptodate status. I think it's probably wise to drop it
> at least initially, always alloc and dirty the appropriate iop ranges
> for sub-folio blocks, and then if you or others think there is value in
> the overwrite optimization to defer iop allocs, tack that on as a
> separate patch and try to be consistent between buffered and mapped
> writes.

Based on the discussion so far, I would like to think of this as follow:
We already have some sort of lazy iop allocation in the buffered I/O
path (discussed above). This patch series does not changes that
behavior. For now I would like to keep the page mkwrite page as is
without any lazy iop allocation optimization.
I am ok to pick this optimization work as a seperate series
because, IIUC, Christoph has some ideas on deferring iop allocations
even further [2] (from link shared above).

Does that sound good?

>
> Darrick noted above that he also agrees with that separate patch
> approach. For me, I think it would also be useful to show that there is
> some measurable performance benefit on at least one reasonable workload
> to help justify it.

Agree that when we work on such optimizations as a seperate series, it
will be worth measuring the performance benefits of that.


-ritesh

>
> Brian
>
>> --D (who is now weeks behind on reviewing things and stressed out)
>>
>> > Brian
>> >
>> > > Though I agree that we should ideally be allocatting & marking all
>> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
>> > > understand your reasoning better.
>> > >
>> > > Thanks!
>> > > -ritesh
>> > >
>> >
>>
Brian Foster May 23, 2023, 2:44 p.m. UTC | #19
On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote:
> Brian Foster <bfoster@redhat.com> writes:
> 
> > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
> >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
> >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> >> > > Matthew Wilcox <willy@infradead.org> writes:
> >> > >
> >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
> >> > > >> > already have an iop?
> >> > > >>
> >> > > >> Yes.
> >> > > >
> >> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> >> > > > dirty flag on the folio, but depending on the architecture, we may
> >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
> >> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
> >> > > > dirty only the blocks corresponding to a single page at a time.
> >> > > >
> >> > > > This has potential for causing some nasty bugs, so I'm inclined to
> >> > > > rule that if a folio is mmaped, then it's all dirty from any writable
> >> > > > page fault.  The fact is that applications generally do not perform
> >> > > > writes through mmap because the error handling story is so poor.
> >> > > >
> >> > > > There may be a different answer for anonymous memory, but that doesn't
> >> > > > feel like my problem and shouldn't feel like any FS developer's problem.
> >> > >
> >> > > Although I am skeptical too to do the changes which Brian is suggesting
> >> > > here. i.e. not making all the blocks of the folio dirty when we are
> >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> >> > >
> >> > > However, I am sorry but I coudn't completely follow your reasoning
> >> > > above. I think what Brian is suggesting here is that
> >> > > filemap_dirty_folio() should be similar to complete buffered overwrite
> >> > > case where we do not allocate the iop at the ->write_begin() time.
> >> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
> >> > >
> >> >
> >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
> >> > granularity of intra-folio faults) makes sense, but I'm also not sure
> >> > what it has to do with the idea of being consistent with how full folio
> >> > overwrites are implemented (between buffered or mapped writes). We're
> >> > not changing historical dirtying granularity either way. I think this is
> >> > just a bigger picture thought for future consideration as opposed to
> >> > direct feedback on this patch..
> >>
> >> <nod>
> >>
> >> > > In a way it is also the similar case as for mmapped writes too but my
> >> > > only worry is the way mmaped writes work and it makes more
> >> > > sense to keep the dirty state of folio and per-block within iop in sync.
> >> > > For that matter, we can even just make sure we always allocate an iop in
> >> > > the complete overwrites case as well. I didn't change that code because
> >> > > it was kept that way for uptodate state as well and based on one of your
> >> > > inputs for complete overwrite case.
> >> > >
> >> >
> >> > Can you elaborate on your concerns, out of curiosity?
> >> >
> >> > Either way, IMO it also seems reasonable to drop this behavior for the
> >> > basic implementation of dirty tracking (so always allocate the iop for
> >> > sub-folio tracking as you suggest above) and then potentially restore it
> >> > as a separate optimization patch at the end of the series.
> >>
> >> Agree.
> >>
> >> > That said, I'm not totally clear why it exists in the first place, so
> >> > that might warrant some investigation. Is it primarily to defer
> >> > allocations out of task write/fault contexts?
> >>
> >> (Assuming by 'it' you mean the behavior where we don't unconditionally
> >> allocate iops for blocksize < foliosize...)
> >>
> >> IIRC the reason is to reduce memory usage by eliding iop allocations
> >> unless it's absolutely necessary for correctness was /my/ understanding
> >> of why we don't always allocate the iop...
> >>
> >> > To optimize the case where pagecache is dirtied but truncated or
> >> > something and thus never written back?
> >>
> >> ...because this might very well happen.  Write a temporary .o file to
> >> the filesystem, then delete the whole thing before writeback ever gets
> >> its hands on the file.
> >>
> >
> > I don't think a simple temp write will trigger this scenario currently
> > because the folios would have to be uptodate at the time of the write to
> > bypass the iop alloc. I guess you'd have to read folios (even if backed
> > by holes) first to start seeing the !iop case at writeback time (for bs
> > != ps).
> >
> > That could change with these patches, but I was trying to reason about
> > the intent of the existing code and whether there was some known reason
> > to continue to try and defer the iop allocation as the need/complexity
> > for deferring it grows with the addition of more (i.e. dirty) tracking.
> >
> 
> Here is the 1st discussion/reasoning where the deferred iop allocation
> in the readpage path got discussed [1].
> And here is the discussion when I first pointed out the deferred
> allocation in writepage path. IMO, it got slipped in with the
> discussions maybe only on mailing list but nothing in the commit
> messages or comments.[2]
> 
> [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/
> [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/
> 
> >> > Is there any room for further improvement where the alloc could be
> >> > avoided completely for folio overwrites instead of just deferred?
> >>
> >> Once writeback starts, though, we need the iop so that we can know when
> >> all the writeback for that folio is actually complete, no matter how
> >> many IOs might be in flight for that folio.  I don't know how you'd get
> >> around this problem.
> >>
> >
> > Ok. I noticed some kind of counter or something being updated last time
> > I looked through that code, so it sounds like that's the reason the iop
> > eventually needs to exist. Thanks.
> >
> >> > Was that actually the case at some point and then something later
> >> > decided the iop was needed at writeback time, leading to current
> >> > behavior?
> >>
> >> It's been in iomap since the beginning when we lifted it from xfs.
> >>
> >
> > Not sure exactly what you're referring to here. iomap_writepage_map()
> > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
> > ("iomap: Permit pages without an iop to enter writeback"), so I don't
> > see how iop allocs were deferred (other than when bs == ps, obviously)
> > prior to that.
> >
> > Heh, but I'm starting to get my wires crossed just trying to piece
> > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
> > case is primarily a subtle side effect of the current writeback behavior
> > being driven by uptodate status. I think it's probably wise to drop it
> > at least initially, always alloc and dirty the appropriate iop ranges
> > for sub-folio blocks, and then if you or others think there is value in
> > the overwrite optimization to defer iop allocs, tack that on as a
> > separate patch and try to be consistent between buffered and mapped
> > writes.
> 
> Based on the discussion so far, I would like to think of this as follow:
> We already have some sort of lazy iop allocation in the buffered I/O
> path (discussed above). This patch series does not changes that
> behavior. For now I would like to keep the page mkwrite page as is
> without any lazy iop allocation optimization.
> I am ok to pick this optimization work as a seperate series
> because, IIUC, Christoph has some ideas on deferring iop allocations
> even further [2] (from link shared above).
> 
> Does that sound good?
> 

Could you do that in two steps where the buffered I/O path variant is
replaced by explicit dirty tracking in the initial patch, and then is
restored by a subsequent patch in the same series? That would allow
keeping it around and documenting it explicitly in the commit log for
the separate patch, but IMO makes this a bit easier to review (and
potentially debug/bisect if needed down the road).

But I don't insist if that's too troublesome for some reason...

Brian

> >
> > Darrick noted above that he also agrees with that separate patch
> > approach. For me, I think it would also be useful to show that there is
> > some measurable performance benefit on at least one reasonable workload
> > to help justify it.
> 
> Agree that when we work on such optimizations as a seperate series, it
> will be worth measuring the performance benefits of that.
> 
> 
> -ritesh
> 
> >
> > Brian
> >
> >> --D (who is now weeks behind on reviewing things and stressed out)
> >>
> >> > Brian
> >> >
> >> > > Though I agree that we should ideally be allocatting & marking all
> >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> >> > > understand your reasoning better.
> >> > >
> >> > > Thanks!
> >> > > -ritesh
> >> > >
> >> >
> >>
>
Ritesh Harjani (IBM) May 23, 2023, 3:02 p.m. UTC | #20
Brian Foster <bfoster@redhat.com> writes:

> On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote:
>> Brian Foster <bfoster@redhat.com> writes:
>> 
>> > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
>> >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
>> >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
>> >> > > Matthew Wilcox <willy@infradead.org> writes:
>> >> > >
>> >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
>> >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
>> >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
>> >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
>> >> > > >> > already have an iop?
>> >> > > >>
>> >> > > >> Yes.
>> >> > > >
>> >> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
>> >> > > > dirty flag on the folio, but depending on the architecture, we may
>> >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
>> >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
>> >> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
>> >> > > > dirty only the blocks corresponding to a single page at a time.
>> >> > > >
>> >> > > > This has potential for causing some nasty bugs, so I'm inclined to
>> >> > > > rule that if a folio is mmaped, then it's all dirty from any writable
>> >> > > > page fault.  The fact is that applications generally do not perform
>> >> > > > writes through mmap because the error handling story is so poor.
>> >> > > >
>> >> > > > There may be a different answer for anonymous memory, but that doesn't
>> >> > > > feel like my problem and shouldn't feel like any FS developer's problem.
>> >> > >
>> >> > > Although I am skeptical too to do the changes which Brian is suggesting
>> >> > > here. i.e. not making all the blocks of the folio dirty when we are
>> >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
>> >> > >
>> >> > > However, I am sorry but I coudn't completely follow your reasoning
>> >> > > above. I think what Brian is suggesting here is that
>> >> > > filemap_dirty_folio() should be similar to complete buffered overwrite
>> >> > > case where we do not allocate the iop at the ->write_begin() time.
>> >> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
>> >> > >
>> >> >
>> >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
>> >> > granularity of intra-folio faults) makes sense, but I'm also not sure
>> >> > what it has to do with the idea of being consistent with how full folio
>> >> > overwrites are implemented (between buffered or mapped writes). We're
>> >> > not changing historical dirtying granularity either way. I think this is
>> >> > just a bigger picture thought for future consideration as opposed to
>> >> > direct feedback on this patch..
>> >>
>> >> <nod>
>> >>
>> >> > > In a way it is also the similar case as for mmapped writes too but my
>> >> > > only worry is the way mmaped writes work and it makes more
>> >> > > sense to keep the dirty state of folio and per-block within iop in sync.
>> >> > > For that matter, we can even just make sure we always allocate an iop in
>> >> > > the complete overwrites case as well. I didn't change that code because
>> >> > > it was kept that way for uptodate state as well and based on one of your
>> >> > > inputs for complete overwrite case.
>> >> > >
>> >> >
>> >> > Can you elaborate on your concerns, out of curiosity?
>> >> >
>> >> > Either way, IMO it also seems reasonable to drop this behavior for the
>> >> > basic implementation of dirty tracking (so always allocate the iop for
>> >> > sub-folio tracking as you suggest above) and then potentially restore it
>> >> > as a separate optimization patch at the end of the series.
>> >>
>> >> Agree.
>> >>
>> >> > That said, I'm not totally clear why it exists in the first place, so
>> >> > that might warrant some investigation. Is it primarily to defer
>> >> > allocations out of task write/fault contexts?
>> >>
>> >> (Assuming by 'it' you mean the behavior where we don't unconditionally
>> >> allocate iops for blocksize < foliosize...)
>> >>
>> >> IIRC the reason is to reduce memory usage by eliding iop allocations
>> >> unless it's absolutely necessary for correctness was /my/ understanding
>> >> of why we don't always allocate the iop...
>> >>
>> >> > To optimize the case where pagecache is dirtied but truncated or
>> >> > something and thus never written back?
>> >>
>> >> ...because this might very well happen.  Write a temporary .o file to
>> >> the filesystem, then delete the whole thing before writeback ever gets
>> >> its hands on the file.
>> >>
>> >
>> > I don't think a simple temp write will trigger this scenario currently
>> > because the folios would have to be uptodate at the time of the write to
>> > bypass the iop alloc. I guess you'd have to read folios (even if backed
>> > by holes) first to start seeing the !iop case at writeback time (for bs
>> > != ps).
>> >
>> > That could change with these patches, but I was trying to reason about
>> > the intent of the existing code and whether there was some known reason
>> > to continue to try and defer the iop allocation as the need/complexity
>> > for deferring it grows with the addition of more (i.e. dirty) tracking.
>> >
>> 
>> Here is the 1st discussion/reasoning where the deferred iop allocation
>> in the readpage path got discussed [1].
>> And here is the discussion when I first pointed out the deferred
>> allocation in writepage path. IMO, it got slipped in with the
>> discussions maybe only on mailing list but nothing in the commit
>> messages or comments.[2]
>> 
>> [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/
>> [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/
>> 
>> >> > Is there any room for further improvement where the alloc could be
>> >> > avoided completely for folio overwrites instead of just deferred?
>> >>
>> >> Once writeback starts, though, we need the iop so that we can know when
>> >> all the writeback for that folio is actually complete, no matter how
>> >> many IOs might be in flight for that folio.  I don't know how you'd get
>> >> around this problem.
>> >>
>> >
>> > Ok. I noticed some kind of counter or something being updated last time
>> > I looked through that code, so it sounds like that's the reason the iop
>> > eventually needs to exist. Thanks.
>> >
>> >> > Was that actually the case at some point and then something later
>> >> > decided the iop was needed at writeback time, leading to current
>> >> > behavior?
>> >>
>> >> It's been in iomap since the beginning when we lifted it from xfs.
>> >>
>> >
>> > Not sure exactly what you're referring to here. iomap_writepage_map()
>> > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
>> > ("iomap: Permit pages without an iop to enter writeback"), so I don't
>> > see how iop allocs were deferred (other than when bs == ps, obviously)
>> > prior to that.
>> >
>> > Heh, but I'm starting to get my wires crossed just trying to piece
>> > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
>> > case is primarily a subtle side effect of the current writeback behavior
>> > being driven by uptodate status. I think it's probably wise to drop it
>> > at least initially, always alloc and dirty the appropriate iop ranges
>> > for sub-folio blocks, and then if you or others think there is value in
>> > the overwrite optimization to defer iop allocs, tack that on as a
>> > separate patch and try to be consistent between buffered and mapped
>> > writes.
>> 
>> Based on the discussion so far, I would like to think of this as follow:
>> We already have some sort of lazy iop allocation in the buffered I/O
>> path (discussed above). This patch series does not changes that
>> behavior. For now I would like to keep the page mkwrite page as is
>> without any lazy iop allocation optimization.
>> I am ok to pick this optimization work as a seperate series
>> because, IIUC, Christoph has some ideas on deferring iop allocations
>> even further [2] (from link shared above).
>> 
>> Does that sound good?
>> 
>
> Could you do that in two steps where the buffered I/O path variant is
> replaced by explicit dirty tracking in the initial patch, and then is
> restored by a subsequent patch in the same series? That would allow

Sorry, I couldn't follow it. Can you please elaborate.

So, what I was suggesting was - for buffered I/O path we should continue
to have the lazy iop allocation scheme whereever possible.
Rest of the optimizations of further deferring the iop allocation
including in the ->mkwrite path should be dealt seperately in a later
patch series.

Also we already have a seperate patch in this series which defers the
iop allocation if the write completely overwrites the folio [1].
Earlier the behavior was that it skips it entirely if the folio was
uptodate, but since we require it for per-block dirty tracking, we
defer iop allocation only if it was a complete overwrite of the folio. 

[1]: https://lore.kernel.org/linux-xfs/ZGzRX9YVkAYJGLqV@bfoster/T/#m048d0a097f7abb09da1c12c9a02afbcc3b9d39ee


-ritesh

> keeping it around and documenting it explicitly in the commit log for
> the separate patch, but IMO makes this a bit easier to review (and
> potentially debug/bisect if needed down the road).
>
> But I don't insist if that's too troublesome for some reason...
>
> Brian
>
>> >
>> > Darrick noted above that he also agrees with that separate patch
>> > approach. For me, I think it would also be useful to show that there is
>> > some measurable performance benefit on at least one reasonable workload
>> > to help justify it.
>> 
>> Agree that when we work on such optimizations as a seperate series, it
>> will be worth measuring the performance benefits of that.
>> 
>> 
>> -ritesh
>> 
>> >
>> > Brian
>> >
>> >> --D (who is now weeks behind on reviewing things and stressed out)
>> >>
>> >> > Brian
>> >> >
>> >> > > Though I agree that we should ideally be allocatting & marking all
>> >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
>> >> > > understand your reasoning better.
>> >> > >
>> >> > > Thanks!
>> >> > > -ritesh
>> >> > >
>> >> >
>> >>
>>
Brian Foster May 23, 2023, 3:22 p.m. UTC | #21
On Tue, May 23, 2023 at 08:32:42PM +0530, Ritesh Harjani wrote:
> Brian Foster <bfoster@redhat.com> writes:
> 
> > On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote:
> >> Brian Foster <bfoster@redhat.com> writes:
> >> 
> >> > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
> >> >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
> >> >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> >> >> > > Matthew Wilcox <willy@infradead.org> writes:
> >> >> > >
> >> >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> >> >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> >> >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> >> >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
> >> >> > > >> > already have an iop?
> >> >> > > >>
> >> >> > > >> Yes.
> >> >> > > >
> >> >> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> >> >> > > > dirty flag on the folio, but depending on the architecture, we may
> >> >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> >> >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
> >> >> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
> >> >> > > > dirty only the blocks corresponding to a single page at a time.
> >> >> > > >
> >> >> > > > This has potential for causing some nasty bugs, so I'm inclined to
> >> >> > > > rule that if a folio is mmaped, then it's all dirty from any writable
> >> >> > > > page fault.  The fact is that applications generally do not perform
> >> >> > > > writes through mmap because the error handling story is so poor.
> >> >> > > >
> >> >> > > > There may be a different answer for anonymous memory, but that doesn't
> >> >> > > > feel like my problem and shouldn't feel like any FS developer's problem.
> >> >> > >
> >> >> > > Although I am skeptical too to do the changes which Brian is suggesting
> >> >> > > here. i.e. not making all the blocks of the folio dirty when we are
> >> >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> >> >> > >
> >> >> > > However, I am sorry but I coudn't completely follow your reasoning
> >> >> > > above. I think what Brian is suggesting here is that
> >> >> > > filemap_dirty_folio() should be similar to complete buffered overwrite
> >> >> > > case where we do not allocate the iop at the ->write_begin() time.
> >> >> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
> >> >> > >
> >> >> >
> >> >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
> >> >> > granularity of intra-folio faults) makes sense, but I'm also not sure
> >> >> > what it has to do with the idea of being consistent with how full folio
> >> >> > overwrites are implemented (between buffered or mapped writes). We're
> >> >> > not changing historical dirtying granularity either way. I think this is
> >> >> > just a bigger picture thought for future consideration as opposed to
> >> >> > direct feedback on this patch..
> >> >>
> >> >> <nod>
> >> >>
> >> >> > > In a way it is also the similar case as for mmapped writes too but my
> >> >> > > only worry is the way mmaped writes work and it makes more
> >> >> > > sense to keep the dirty state of folio and per-block within iop in sync.
> >> >> > > For that matter, we can even just make sure we always allocate an iop in
> >> >> > > the complete overwrites case as well. I didn't change that code because
> >> >> > > it was kept that way for uptodate state as well and based on one of your
> >> >> > > inputs for complete overwrite case.
> >> >> > >
> >> >> >
> >> >> > Can you elaborate on your concerns, out of curiosity?
> >> >> >
> >> >> > Either way, IMO it also seems reasonable to drop this behavior for the
> >> >> > basic implementation of dirty tracking (so always allocate the iop for
> >> >> > sub-folio tracking as you suggest above) and then potentially restore it
> >> >> > as a separate optimization patch at the end of the series.
> >> >>
> >> >> Agree.
> >> >>
> >> >> > That said, I'm not totally clear why it exists in the first place, so
> >> >> > that might warrant some investigation. Is it primarily to defer
> >> >> > allocations out of task write/fault contexts?
> >> >>
> >> >> (Assuming by 'it' you mean the behavior where we don't unconditionally
> >> >> allocate iops for blocksize < foliosize...)
> >> >>
> >> >> IIRC the reason is to reduce memory usage by eliding iop allocations
> >> >> unless it's absolutely necessary for correctness was /my/ understanding
> >> >> of why we don't always allocate the iop...
> >> >>
> >> >> > To optimize the case where pagecache is dirtied but truncated or
> >> >> > something and thus never written back?
> >> >>
> >> >> ...because this might very well happen.  Write a temporary .o file to
> >> >> the filesystem, then delete the whole thing before writeback ever gets
> >> >> its hands on the file.
> >> >>
> >> >
> >> > I don't think a simple temp write will trigger this scenario currently
> >> > because the folios would have to be uptodate at the time of the write to
> >> > bypass the iop alloc. I guess you'd have to read folios (even if backed
> >> > by holes) first to start seeing the !iop case at writeback time (for bs
> >> > != ps).
> >> >
> >> > That could change with these patches, but I was trying to reason about
> >> > the intent of the existing code and whether there was some known reason
> >> > to continue to try and defer the iop allocation as the need/complexity
> >> > for deferring it grows with the addition of more (i.e. dirty) tracking.
> >> >
> >> 
> >> Here is the 1st discussion/reasoning where the deferred iop allocation
> >> in the readpage path got discussed [1].
> >> And here is the discussion when I first pointed out the deferred
> >> allocation in writepage path. IMO, it got slipped in with the
> >> discussions maybe only on mailing list but nothing in the commit
> >> messages or comments.[2]
> >> 
> >> [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/
> >> [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/
> >> 
> >> >> > Is there any room for further improvement where the alloc could be
> >> >> > avoided completely for folio overwrites instead of just deferred?
> >> >>
> >> >> Once writeback starts, though, we need the iop so that we can know when
> >> >> all the writeback for that folio is actually complete, no matter how
> >> >> many IOs might be in flight for that folio.  I don't know how you'd get
> >> >> around this problem.
> >> >>
> >> >
> >> > Ok. I noticed some kind of counter or something being updated last time
> >> > I looked through that code, so it sounds like that's the reason the iop
> >> > eventually needs to exist. Thanks.
> >> >
> >> >> > Was that actually the case at some point and then something later
> >> >> > decided the iop was needed at writeback time, leading to current
> >> >> > behavior?
> >> >>
> >> >> It's been in iomap since the beginning when we lifted it from xfs.
> >> >>
> >> >
> >> > Not sure exactly what you're referring to here. iomap_writepage_map()
> >> > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
> >> > ("iomap: Permit pages without an iop to enter writeback"), so I don't
> >> > see how iop allocs were deferred (other than when bs == ps, obviously)
> >> > prior to that.
> >> >
> >> > Heh, but I'm starting to get my wires crossed just trying to piece
> >> > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
> >> > case is primarily a subtle side effect of the current writeback behavior
> >> > being driven by uptodate status. I think it's probably wise to drop it
> >> > at least initially, always alloc and dirty the appropriate iop ranges
> >> > for sub-folio blocks, and then if you or others think there is value in
> >> > the overwrite optimization to defer iop allocs, tack that on as a
> >> > separate patch and try to be consistent between buffered and mapped
> >> > writes.
> >> 
> >> Based on the discussion so far, I would like to think of this as follow:
> >> We already have some sort of lazy iop allocation in the buffered I/O
> >> path (discussed above). This patch series does not changes that
> >> behavior. For now I would like to keep the page mkwrite page as is
> >> without any lazy iop allocation optimization.
> >> I am ok to pick this optimization work as a seperate series
> >> because, IIUC, Christoph has some ideas on deferring iop allocations
> >> even further [2] (from link shared above).
> >> 
> >> Does that sound good?
> >> 
> >
> > Could you do that in two steps where the buffered I/O path variant is
> > replaced by explicit dirty tracking in the initial patch, and then is
> > restored by a subsequent patch in the same series? That would allow
> 
> Sorry, I couldn't follow it. Can you please elaborate.
> 

Sorry for the confusion...

> So, what I was suggesting was - for buffered I/O path we should continue
> to have the lazy iop allocation scheme whereever possible.
> Rest of the optimizations of further deferring the iop allocation
> including in the ->mkwrite path should be dealt seperately in a later
> patch series.
> 

Yup, agree.

> Also we already have a seperate patch in this series which defers the
> iop allocation if the write completely overwrites the folio [1].
> Earlier the behavior was that it skips it entirely if the folio was
> uptodate, but since we require it for per-block dirty tracking, we
> defer iop allocation only if it was a complete overwrite of the folio. 
> 

That is a prepatory patch before iop dirty tracking is enabled in patch
5, right? I was mainly just suggesting to make the overwrite checking
part of this patch come after dirty tracking is enabled (as a small
optimization patch) rather than before.

I don't want to harp on it if that's difficult or still doesn't make
sense for some reason. I'll take a closer look the next go around when I
have a bit more time and just send a diff if it seems it can be done
cleanly..

Brian

> [1]: https://lore.kernel.org/linux-xfs/ZGzRX9YVkAYJGLqV@bfoster/T/#m048d0a097f7abb09da1c12c9a02afbcc3b9d39ee
> 
> 
> -ritesh
> 
> > keeping it around and documenting it explicitly in the commit log for
> > the separate patch, but IMO makes this a bit easier to review (and
> > potentially debug/bisect if needed down the road).
> >
> > But I don't insist if that's too troublesome for some reason...
> >
> > Brian
> >
> >> >
> >> > Darrick noted above that he also agrees with that separate patch
> >> > approach. For me, I think it would also be useful to show that there is
> >> > some measurable performance benefit on at least one reasonable workload
> >> > to help justify it.
> >> 
> >> Agree that when we work on such optimizations as a seperate series, it
> >> will be worth measuring the performance benefits of that.
> >> 
> >> 
> >> -ritesh
> >> 
> >> >
> >> > Brian
> >> >
> >> >> --D (who is now weeks behind on reviewing things and stressed out)
> >> >>
> >> >> > Brian
> >> >> >
> >> >> > > Though I agree that we should ideally be allocatting & marking all
> >> >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> >> >> > > understand your reasoning better.
> >> >> > >
> >> >> > > Thanks!
> >> >> > > -ritesh
> >> >> > >
> >> >> >
> >> >>
> >> 
>
Ritesh Harjani (IBM) May 23, 2023, 3:38 p.m. UTC | #22
Brian Foster <bfoster@redhat.com> writes:

> On Tue, May 23, 2023 at 08:32:42PM +0530, Ritesh Harjani wrote:
>> Brian Foster <bfoster@redhat.com> writes:
>>
>> > On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote:
>> >> Brian Foster <bfoster@redhat.com> writes:
>> >>
>> >> > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
>> >> >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
>> >> >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
>> >> >> > > Matthew Wilcox <willy@infradead.org> writes:
>> >> >> > >
>> >> >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
>> >> >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
>> >> >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
>> >> >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
>> >> >> > > >> > already have an iop?
>> >> >> > > >>
>> >> >> > > >> Yes.
>> >> >> > > >
>> >> >> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
>> >> >> > > > dirty flag on the folio, but depending on the architecture, we may
>> >> >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
>> >> >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
>> >> >> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
>> >> >> > > > dirty only the blocks corresponding to a single page at a time.
>> >> >> > > >
>> >> >> > > > This has potential for causing some nasty bugs, so I'm inclined to
>> >> >> > > > rule that if a folio is mmaped, then it's all dirty from any writable
>> >> >> > > > page fault.  The fact is that applications generally do not perform
>> >> >> > > > writes through mmap because the error handling story is so poor.
>> >> >> > > >
>> >> >> > > > There may be a different answer for anonymous memory, but that doesn't
>> >> >> > > > feel like my problem and shouldn't feel like any FS developer's problem.
>> >> >> > >
>> >> >> > > Although I am skeptical too to do the changes which Brian is suggesting
>> >> >> > > here. i.e. not making all the blocks of the folio dirty when we are
>> >> >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
>> >> >> > >
>> >> >> > > However, I am sorry but I coudn't completely follow your reasoning
>> >> >> > > above. I think what Brian is suggesting here is that
>> >> >> > > filemap_dirty_folio() should be similar to complete buffered overwrite
>> >> >> > > case where we do not allocate the iop at the ->write_begin() time.
>> >> >> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
>> >> >> > >
>> >> >> >
>> >> >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
>> >> >> > granularity of intra-folio faults) makes sense, but I'm also not sure
>> >> >> > what it has to do with the idea of being consistent with how full folio
>> >> >> > overwrites are implemented (between buffered or mapped writes). We're
>> >> >> > not changing historical dirtying granularity either way. I think this is
>> >> >> > just a bigger picture thought for future consideration as opposed to
>> >> >> > direct feedback on this patch..
>> >> >>
>> >> >> <nod>
>> >> >>
>> >> >> > > In a way it is also the similar case as for mmapped writes too but my
>> >> >> > > only worry is the way mmaped writes work and it makes more
>> >> >> > > sense to keep the dirty state of folio and per-block within iop in sync.
>> >> >> > > For that matter, we can even just make sure we always allocate an iop in
>> >> >> > > the complete overwrites case as well. I didn't change that code because
>> >> >> > > it was kept that way for uptodate state as well and based on one of your
>> >> >> > > inputs for complete overwrite case.
>> >> >> > >
>> >> >> >
>> >> >> > Can you elaborate on your concerns, out of curiosity?
>> >> >> >
>> >> >> > Either way, IMO it also seems reasonable to drop this behavior for the
>> >> >> > basic implementation of dirty tracking (so always allocate the iop for
>> >> >> > sub-folio tracking as you suggest above) and then potentially restore it
>> >> >> > as a separate optimization patch at the end of the series.
>> >> >>
>> >> >> Agree.
>> >> >>
>> >> >> > That said, I'm not totally clear why it exists in the first place, so
>> >> >> > that might warrant some investigation. Is it primarily to defer
>> >> >> > allocations out of task write/fault contexts?
>> >> >>
>> >> >> (Assuming by 'it' you mean the behavior where we don't unconditionally
>> >> >> allocate iops for blocksize < foliosize...)
>> >> >>
>> >> >> IIRC the reason is to reduce memory usage by eliding iop allocations
>> >> >> unless it's absolutely necessary for correctness was /my/ understanding
>> >> >> of why we don't always allocate the iop...
>> >> >>
>> >> >> > To optimize the case where pagecache is dirtied but truncated or
>> >> >> > something and thus never written back?
>> >> >>
>> >> >> ...because this might very well happen.  Write a temporary .o file to
>> >> >> the filesystem, then delete the whole thing before writeback ever gets
>> >> >> its hands on the file.
>> >> >>
>> >> >
>> >> > I don't think a simple temp write will trigger this scenario currently
>> >> > because the folios would have to be uptodate at the time of the write to
>> >> > bypass the iop alloc. I guess you'd have to read folios (even if backed
>> >> > by holes) first to start seeing the !iop case at writeback time (for bs
>> >> > != ps).
>> >> >
>> >> > That could change with these patches, but I was trying to reason about
>> >> > the intent of the existing code and whether there was some known reason
>> >> > to continue to try and defer the iop allocation as the need/complexity
>> >> > for deferring it grows with the addition of more (i.e. dirty) tracking.
>> >> >
>> >>
>> >> Here is the 1st discussion/reasoning where the deferred iop allocation
>> >> in the readpage path got discussed [1].
>> >> And here is the discussion when I first pointed out the deferred
>> >> allocation in writepage path. IMO, it got slipped in with the
>> >> discussions maybe only on mailing list but nothing in the commit
>> >> messages or comments.[2]
>> >>
>> >> [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/
>> >> [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/
>> >>
>> >> >> > Is there any room for further improvement where the alloc could be
>> >> >> > avoided completely for folio overwrites instead of just deferred?
>> >> >>
>> >> >> Once writeback starts, though, we need the iop so that we can know when
>> >> >> all the writeback for that folio is actually complete, no matter how
>> >> >> many IOs might be in flight for that folio.  I don't know how you'd get
>> >> >> around this problem.
>> >> >>
>> >> >
>> >> > Ok. I noticed some kind of counter or something being updated last time
>> >> > I looked through that code, so it sounds like that's the reason the iop
>> >> > eventually needs to exist. Thanks.
>> >> >
>> >> >> > Was that actually the case at some point and then something later
>> >> >> > decided the iop was needed at writeback time, leading to current
>> >> >> > behavior?
>> >> >>
>> >> >> It's been in iomap since the beginning when we lifted it from xfs.
>> >> >>
>> >> >
>> >> > Not sure exactly what you're referring to here. iomap_writepage_map()
>> >> > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
>> >> > ("iomap: Permit pages without an iop to enter writeback"), so I don't
>> >> > see how iop allocs were deferred (other than when bs == ps, obviously)
>> >> > prior to that.
>> >> >
>> >> > Heh, but I'm starting to get my wires crossed just trying to piece
>> >> > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
>> >> > case is primarily a subtle side effect of the current writeback behavior
>> >> > being driven by uptodate status. I think it's probably wise to drop it
>> >> > at least initially, always alloc and dirty the appropriate iop ranges
>> >> > for sub-folio blocks, and then if you or others think there is value in
>> >> > the overwrite optimization to defer iop allocs, tack that on as a
>> >> > separate patch and try to be consistent between buffered and mapped
>> >> > writes.
>> >>
>> >> Based on the discussion so far, I would like to think of this as follow:
>> >> We already have some sort of lazy iop allocation in the buffered I/O
>> >> path (discussed above). This patch series does not changes that
>> >> behavior. For now I would like to keep the page mkwrite page as is
>> >> without any lazy iop allocation optimization.
>> >> I am ok to pick this optimization work as a seperate series
>> >> because, IIUC, Christoph has some ideas on deferring iop allocations
>> >> even further [2] (from link shared above).
>> >>
>> >> Does that sound good?
>> >>
>> >
>> > Could you do that in two steps where the buffered I/O path variant is
>> > replaced by explicit dirty tracking in the initial patch, and then is
>> > restored by a subsequent patch in the same series? That would allow
>>
>> Sorry, I couldn't follow it. Can you please elaborate.
>>
>
> Sorry for the confusion...
>
>> So, what I was suggesting was - for buffered I/O path we should continue
>> to have the lazy iop allocation scheme whereever possible.
>> Rest of the optimizations of further deferring the iop allocation
>> including in the ->mkwrite path should be dealt seperately in a later
>> patch series.
>>
>
> Yup, agree.
>
>> Also we already have a seperate patch in this series which defers the
>> iop allocation if the write completely overwrites the folio [1].
>> Earlier the behavior was that it skips it entirely if the folio was
>> uptodate, but since we require it for per-block dirty tracking, we
>> defer iop allocation only if it was a complete overwrite of the folio.
>>
>
> That is a prepatory patch before iop dirty tracking is enabled in patch
> 5, right?

Yes, right.


> I was mainly just suggesting to make the overwrite checking
> part of this patch come after dirty tracking is enabled (as a small
> optimization patch) rather than before.
>
> I don't want to harp on it if that's difficult or still doesn't make
> sense for some reason. I'll take a closer look the next go around when I
> have a bit more time and just send a diff if it seems it can be done
> cleanly..

It should not be difficult. Will make the change in next rev.

Thanks for the help. Seems a lot of things have been sorted out now :)

I will be working on the review comments, finish off few of the
remaining todos, get more testing done and will spin off the next
revision.


-ritesh

>
> Brian
>
>> [1]: https://lore.kernel.org/linux-xfs/ZGzRX9YVkAYJGLqV@bfoster/T/#m048d0a097f7abb09da1c12c9a02afbcc3b9d39ee
>>
>>
>> -ritesh
>>
>> > keeping it around and documenting it explicitly in the commit log for
>> > the separate patch, but IMO makes this a bit easier to review (and
>> > potentially debug/bisect if needed down the road).
>> >
>> > But I don't insist if that's too troublesome for some reason...
>> >
>> > Brian
>> >
>> >> >
>> >> > Darrick noted above that he also agrees with that separate patch
>> >> > approach. For me, I think it would also be useful to show that there is
>> >> > some measurable performance benefit on at least one reasonable workload
>> >> > to help justify it.
>> >>
>> >> Agree that when we work on such optimizations as a seperate series, it
>> >> will be worth measuring the performance benefits of that.
>> >>
>> >>
>> >> -ritesh
>> >>
>> >> >
>> >> > Brian
>> >> >
>> >> >> --D (who is now weeks behind on reviewing things and stressed out)
>> >> >>
>> >> >> > Brian
>> >> >> >
>> >> >> > > Though I agree that we should ideally be allocatting & marking all
>> >> >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
>> >> >> > > understand your reasoning better.
>> >> >> > >
>> >> >> > > Thanks!
>> >> >> > > -ritesh
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>
Matthew Wilcox (Oracle) May 23, 2023, 3:59 p.m. UTC | #23
On Tue, May 23, 2023 at 11:22:33AM -0400, Brian Foster wrote:
> That is a prepatory patch before iop dirty tracking is enabled in patch
> 5, right? I was mainly just suggesting to make the overwrite checking
> part of this patch come after dirty tracking is enabled (as a small
> optimization patch) rather than before.
> 
> I don't want to harp on it if that's difficult or still doesn't make
> sense for some reason. I'll take a closer look the next go around when I
> have a bit more time and just send a diff if it seems it can be done
> cleanly..

Honestly, I think the way the patchset is structured now is fine.
There are always multiple ways to structure a set of patches, and I don't
think we need be too concerned about what order the things are done in,
as long as they get done.
diff mbox series

Patch

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index a5f4be6b9213..75efec3c3b71 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -746,7 +746,7 @@  static const struct address_space_operations gfs2_aops = {
 	.writepages = gfs2_writepages,
 	.read_folio = gfs2_read_folio,
 	.readahead = gfs2_readahead,
-	.dirty_folio = filemap_dirty_folio,
+	.dirty_folio = iomap_dirty_folio,
 	.release_folio = iomap_release_folio,
 	.invalidate_folio = iomap_invalidate_folio,
 	.bmap = gfs2_bmap,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 25f20f269214..c7f41b26280a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -52,6 +52,12 @@  static inline void iop_set_range(struct iomap_page *iop, unsigned int start_blk,
 	bitmap_set(iop->state, start_blk, nr_blks);
 }

+static inline void iop_clear_range(struct iomap_page *iop,
+				   unsigned int start_blk, unsigned int nr_blks)
+{
+	bitmap_clear(iop->state, start_blk, nr_blks);
+}
+
 static inline bool iop_test_block(struct iomap_page *iop, unsigned int block)
 {
 	return test_bit(block, iop->state);
@@ -84,6 +90,16 @@  static bool iop_test_block_uptodate(struct folio *folio, unsigned int block)
 	return iop_test_block(iop, block);
 }

+static bool iop_test_block_dirty(struct folio *folio, int block)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	struct inode *inode = folio->mapping->host;
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+
+	WARN_ON(!iop);
+	return iop_test_block(iop, block + blks_per_folio);
+}
+
 static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
 				   size_t off, size_t len)
 {
@@ -104,8 +120,42 @@  static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
 	}
 }

+static void iop_set_range_dirty(struct inode *inode, struct folio *folio,
+				size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+	unsigned int first_blk = (off >> inode->i_blkbits);
+	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	if (!iop)
+		return;
+	spin_lock_irqsave(&iop->state_lock, flags);
+	iop_set_range(iop, first_blk + blks_per_folio, nr_blks);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	struct inode *inode = folio->mapping->host;
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+	unsigned int first_blk = (off >> inode->i_blkbits);
+	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	if (!iop)
+		return;
+	spin_lock_irqsave(&iop->state_lock, flags);
+	iop_clear_range(iop, first_blk + blks_per_folio, nr_blks);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
 static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
-				    unsigned int flags)
+				    unsigned int flags, bool is_dirty)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
@@ -119,12 +169,20 @@  static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;

-	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
+	/*
+	 * iop->state tracks two sets of state flags when the
+	 * filesystem block size is smaller than the folio size.
+	 * The first state tracks per-block uptodate and the
+	 * second tracks per-block dirty state.
+	 */
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
 		      gfp);
 	if (iop) {
 		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
 			iop_set_range(iop, 0, nr_blocks);
+		if (is_dirty)
+			iop_set_range(iop, nr_blocks, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -268,7 +326,8 @@  static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iop_alloc(iter->inode, folio, iter->flags);
+		iop = iop_alloc(iter->inode, folio, iter->flags,
+				folio_test_dirty(folio));
 	else
 		iop = to_iomap_page(folio);

@@ -306,7 +365,8 @@  static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, folio);

 	/* zero post-eof blocks as the page may be mapped */
-	iop = iop_alloc(iter->inode, folio, iter->flags);
+	iop = iop_alloc(iter->inode, folio, iter->flags,
+			folio_test_dirty(folio));
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
@@ -561,6 +621,18 @@  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);

+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
+{
+	struct iomap_page *iop;
+	struct inode *inode = mapping->host;
+	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;
+
+	iop = iop_alloc(inode, folio, 0, false);
+	iop_set_range_dirty(inode, folio, 0, len);
+	return filemap_dirty_folio(mapping, folio);
+}
+EXPORT_SYMBOL_GPL(iomap_dirty_folio);
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -608,7 +680,8 @@  static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;

-	iop = iop_alloc(iter->inode, folio, iter->flags);
+	iop = iop_alloc(iter->inode, folio, iter->flags,
+			folio_test_dirty(folio));

 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
@@ -767,6 +840,7 @@  static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
 	iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos), len);
+	iop_set_range_dirty(inode, folio, offset_in_folio(folio, pos), copied);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -954,6 +1028,10 @@  static int iomap_write_delalloc_scan(struct inode *inode,
 {
 	while (start_byte < end_byte) {
 		struct folio	*folio;
+		struct iomap_page *iop;
+		unsigned int first_blk, last_blk, blks_per_folio, i;
+		loff_t last_byte;
+		u8 blkbits = inode->i_blkbits;

 		/* grab locked page */
 		folio = filemap_lock_folio(inode->i_mapping,
@@ -978,6 +1056,28 @@  static int iomap_write_delalloc_scan(struct inode *inode,
 				}
 			}

+			/*
+			 * When we have per-block dirty tracking, there can be
+			 * blocks within a folio which are marked uptodate
+			 * but not dirty. In that case it is necessary to punch
+			 * out such blocks to avoid leaking any delalloc blocks.
+			 */
+			iop = to_iomap_page(folio);
+			if (!iop)
+				goto skip_iop_punch;
+			last_byte = min_t(loff_t, end_byte - 1,
+				(folio_next_index(folio) << PAGE_SHIFT) - 1);
+			first_blk = offset_in_folio(folio, start_byte) >>
+						    blkbits;
+			last_blk = offset_in_folio(folio, last_byte) >>
+						   blkbits;
+			blks_per_folio = i_blocks_per_folio(inode, folio);
+			for (i = first_blk; i <= last_blk; i++) {
+				if (!iop_test_block_dirty(folio, i))
+					punch(inode, i << blkbits,
+						     1 << blkbits);
+			}
+skip_iop_punch:
 			/*
 			 * Make sure the next punch start is correctly bound to
 			 * the end of this data range, not the end of the folio.
@@ -1666,7 +1766,7 @@  iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iop_alloc(inode, folio, 0);
+	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1682,7 +1782,7 @@  iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !iop_test_block_uptodate(folio, i))
+		if (iop && !iop_test_block_dirty(folio, i))
 			continue;

 		error = wpc->ops->map_blocks(wpc, inode, pos);
@@ -1726,6 +1826,7 @@  iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		}
 	}

+	iop_clear_range_dirty(folio, 0, end_pos - folio_pos(folio));
 	folio_start_writeback(folio);
 	folio_unlock(folio);

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2ef78aa1d3f6..77c7332ae197 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -578,7 +578,7 @@  const struct address_space_operations xfs_address_space_operations = {
 	.read_folio		= xfs_vm_read_folio,
 	.readahead		= xfs_vm_readahead,
 	.writepages		= xfs_vm_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.bmap			= xfs_vm_bmap,
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 132f01d3461f..e508c8e97372 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -175,7 +175,7 @@  const struct address_space_operations zonefs_file_aops = {
 	.read_folio		= zonefs_read_folio,
 	.readahead		= zonefs_readahead,
 	.writepages		= zonefs_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.migrate_folio		= filemap_migrate_folio,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0f8123504e5e..0c2bee80565c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -264,6 +264,7 @@  bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,