[RFCv3,1/3] iomap: Allocate iop in ->write_begin() early

Message ID	34dafb5e15dba3bb0b0e072404ac6fb9f11561b8.1677428794.git.ritesh.list@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@vger.kernel.org> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: Ritesh Harjani <ritesh.list@gmail.com> Subject: [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early Date: Mon, 27 Feb 2023 01:13:30 +0530 Message-Id: <34dafb5e15dba3bb0b0e072404ac6fb9f11561b8.1677428794.git.ritesh.list@gmail.com> In-Reply-To: <cover.1677428794.git.ritesh.list@gmail.com> References: <cover.1677428794.git.ritesh.list@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	iomap: Add support for subpage dirty state tracking to improve write performance \| expand [RFCv3,0/3] iomap: Add support for subpage dirty state tracking to improve write performance [RFCv3,1/3] iomap: Allocate iop in ->write_begin() early [RFCv3,2/3] iomap: Change uptodate variable name to state [RFCv3,3/3] iomap: Support subpage size dirty tracking to improve write performance

Ritesh Harjani (IBM) Feb. 26, 2023, 7:43 p.m. UTC

Earlier when the folio is uptodate, we only allocate iop at writeback
time (in iomap_writepage_map()). This is ok until now, but when we are
going to add support for subpage size dirty bitmap tracking in iop, this
could cause some performance degradation. The reason is that if we don't
allocate iop during ->write_begin(), then we will never mark the
necessary dirty bits in ->write_end() call. And we will have to mark all
the bits as dirty at the writeback time, that could cause the same write
amplification and performance problems as it is now (w/o subpage dirty
bitmap tracking in iop).

However, for all the writes with (pos, len) which completely overlaps
the given folio, there is no need to allocate an iop during
->write_begin(). So skip those cases.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Dave Chinner Feb. 26, 2023, 10:41 p.m. UTC | #1

On Mon, Feb 27, 2023 at 01:13:30AM +0530, Ritesh Harjani (IBM) wrote:
> Earlier when the folio is uptodate, we only allocate iop at writeback
> time (in iomap_writepage_map()). This is ok until now, but when we are
> going to add support for subpage size dirty bitmap tracking in iop, this
> could cause some performance degradation. The reason is that if we don't
> allocate iop during ->write_begin(), then we will never mark the
> necessary dirty bits in ->write_end() call. And we will have to mark all
> the bits as dirty at the writeback time, that could cause the same write
> amplification and performance problems as it is now (w/o subpage dirty
> bitmap tracking in iop).
> 
> However, for all the writes with (pos, len) which completely overlaps
> the given folio, there is no need to allocate an iop during
> ->write_begin(). So skip those cases.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/iomap/buffered-io.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 356193e44cf0..c5b51ab1184e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>  	size_t poff, plen;
>  
> +	if (pos <= folio_pos(folio) &&
> +	    pos + len >= folio_pos(folio) + folio_size(folio))
> +		return 0;

This is magic without a comment explaining why it exists. You have
that explanation in the commit message, but that doesn't help anyone
looking at the code:

	/*
	 * If the write completely overlaps the current folio, then
	 * entire folio will be dirtied so there is no need for
	 * sub-folio state tracking structures to be attached to this folio.
	 */

-Dave.

Matthew Wilcox (Oracle) Feb. 26, 2023, 11:12 p.m. UTC | #2

On Mon, Feb 27, 2023 at 01:13:30AM +0530, Ritesh Harjani (IBM) wrote:
> +++ b/fs/iomap/buffered-io.c
> @@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>  	size_t poff, plen;
>  
> +	if (pos <= folio_pos(folio) &&
> +	    pos + len >= folio_pos(folio) + folio_size(folio))
> +		return 0;
> +
> +	iop = iomap_page_create(iter->inode, folio, iter->flags);
> +
>  	if (folio_test_uptodate(folio))
>  		return 0;
>  	folio_clear_error(folio);
>  
> -	iop = iomap_page_create(iter->inode, folio, iter->flags);
>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
>  		return -EAGAIN;

Don't you want to move the -EAGAIN check up too?  Otherwise an
io_uring write will dirty the entire folio rather than a block.

It occurs to me (even though I was the one who suggested the current
check) that pos <= folio_pos etc is actually a bit tighter than
necessary.  We could get away with:

	if (pos < folio_pos(folio) + block_size &&
	    pos + len > folio_pos(folio) + folio_size(folio) - block_size)

since that will also cause the entire folio to be dirtied.  Not sure if
it's worth it.

Ritesh Harjani (IBM) Feb. 28, 2023, 5:55 p.m. UTC | #3

Dave Chinner <david@fromorbit.com> writes:

> On Mon, Feb 27, 2023 at 01:13:30AM +0530, Ritesh Harjani (IBM) wrote:
>> Earlier when the folio is uptodate, we only allocate iop at writeback
>> time (in iomap_writepage_map()). This is ok until now, but when we are
>> going to add support for subpage size dirty bitmap tracking in iop, this
>> could cause some performance degradation. The reason is that if we don't
>> allocate iop during ->write_begin(), then we will never mark the
>> necessary dirty bits in ->write_end() call. And we will have to mark all
>> the bits as dirty at the writeback time, that could cause the same write
>> amplification and performance problems as it is now (w/o subpage dirty
>> bitmap tracking in iop).
>>
>> However, for all the writes with (pos, len) which completely overlaps
>> the given folio, there is no need to allocate an iop during
>> ->write_begin(). So skip those cases.
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/iomap/buffered-io.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 356193e44cf0..c5b51ab1184e 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>>  	size_t poff, plen;
>>
>> +	if (pos <= folio_pos(folio) &&
>> +	    pos + len >= folio_pos(folio) + folio_size(folio))
>> +		return 0;
>
> This is magic without a comment explaining why it exists. You have
> that explanation in the commit message, but that doesn't help anyone
> looking at the code:
>
> 	/*
> 	 * If the write completely overlaps the current folio, then
> 	 * entire folio will be dirtied so there is no need for
> 	 * sub-folio state tracking structures to be attached to this folio.
> 	 */

Sure, got it. I will add a comment which explains this in the code as
well.

Thanks for the review!
-ritesh

Ritesh Harjani (IBM) Feb. 28, 2023, 6:33 p.m. UTC | #4

Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Feb 27, 2023 at 01:13:30AM +0530, Ritesh Harjani (IBM) wrote:
>> +++ b/fs/iomap/buffered-io.c
>> @@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>>  	size_t poff, plen;
>>
>> +	if (pos <= folio_pos(folio) &&
>> +	    pos + len >= folio_pos(folio) + folio_size(folio))
>> +		return 0;
>> +
>> +	iop = iomap_page_create(iter->inode, folio, iter->flags);
>> +
>>  	if (folio_test_uptodate(folio))
>>  		return 0;
>>  	folio_clear_error(folio);
>>
>> -	iop = iomap_page_create(iter->inode, folio, iter->flags);
>>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
>>  		return -EAGAIN;
>
> Don't you want to move the -EAGAIN check up too?  Otherwise an
> io_uring write will dirty the entire folio rather than a block.

I am not entirely convinced whether we should move this check up
(to put it just after the iop allocation). The reason is if the folio is
uptodate then it is ok to return 0 rather than -EAGAIN, because we are
anyway not going to read the folio from disk (given it is completely
uptodate).

Thoughts? Or am I missing anything here.

>
> It occurs to me (even though I was the one who suggested the current
> check) that pos <= folio_pos etc is actually a bit tighter than
> necessary.  We could get away with:
>
> 	if (pos < folio_pos(folio) + block_size &&
> 	    pos + len > folio_pos(folio) + folio_size(folio) - block_size)
>
> since that will also cause the entire folio to be dirtied.  Not sure if
> it's worth it.

I am not sure of how much impact such a change can cause. But I agree
that the above check is much lighter in terms of restriction.

Let me spend some more time thinking it through.

Thanks for the review!
-ritesh

Matthew Wilcox (Oracle) Feb. 28, 2023, 6:36 p.m. UTC | #5

On Wed, Mar 01, 2023 at 12:03:48AM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Mon, Feb 27, 2023 at 01:13:30AM +0530, Ritesh Harjani (IBM) wrote:
> >> +++ b/fs/iomap/buffered-io.c
> >> @@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> >>  	size_t from = offset_in_folio(folio, pos), to = from + len;
> >>  	size_t poff, plen;
> >>
> >> +	if (pos <= folio_pos(folio) &&
> >> +	    pos + len >= folio_pos(folio) + folio_size(folio))
> >> +		return 0;
> >> +
> >> +	iop = iomap_page_create(iter->inode, folio, iter->flags);
> >> +
> >>  	if (folio_test_uptodate(folio))
> >>  		return 0;
> >>  	folio_clear_error(folio);
> >>
> >> -	iop = iomap_page_create(iter->inode, folio, iter->flags);
> >>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
> >>  		return -EAGAIN;
> >
> > Don't you want to move the -EAGAIN check up too?  Otherwise an
> > io_uring write will dirty the entire folio rather than a block.
> 
> I am not entirely convinced whether we should move this check up
> (to put it just after the iop allocation). The reason is if the folio is
> uptodate then it is ok to return 0 rather than -EAGAIN, because we are
> anyway not going to read the folio from disk (given it is completely
> uptodate).
> 
> Thoughts? Or am I missing anything here.

But then we won't have an iop, so a write will dirty the entire folio
instead of just the blocks you want to dirty.

Ritesh Harjani (IBM) March 2, 2023, 6:59 p.m. UTC | #6

Matthew Wilcox <willy@infradead.org> writes:

> On Wed, Mar 01, 2023 at 12:03:48AM +0530, Ritesh Harjani wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>>
>> > On Mon, Feb 27, 2023 at 01:13:30AM +0530, Ritesh Harjani (IBM) wrote:
>> >> +++ b/fs/iomap/buffered-io.c
>> >> @@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>> >>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>> >>  	size_t poff, plen;
>> >>
>> >> +	if (pos <= folio_pos(folio) &&
>> >> +	    pos + len >= folio_pos(folio) + folio_size(folio))
>> >> +		return 0;
>> >> +
>> >> +	iop = iomap_page_create(iter->inode, folio, iter->flags);
>> >> +
>> >>  	if (folio_test_uptodate(folio))
>> >>  		return 0;
>> >>  	folio_clear_error(folio);
>> >>
>> >> -	iop = iomap_page_create(iter->inode, folio, iter->flags);
>> >>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
>> >>  		return -EAGAIN;
>> >
>> > Don't you want to move the -EAGAIN check up too?  Otherwise an
>> > io_uring write will dirty the entire folio rather than a block.
>>
>> I am not entirely convinced whether we should move this check up
>> (to put it just after the iop allocation). The reason is if the folio is
>> uptodate then it is ok to return 0 rather than -EAGAIN, because we are
>> anyway not going to read the folio from disk (given it is completely
>> uptodate).
>>
>> Thoughts? Or am I missing anything here.
>
> But then we won't have an iop, so a write will dirty the entire folio
> instead of just the blocks you want to dirty.

Ok, I got what you are saying. Make sense. I will give it a try.

Thanks
-ritesh

[RFCv3,1/3] iomap: Allocate iop in ->write_begin() early

Commit Message

Comments

Patch