[v2,0/6] iomap: some minor non-critical fixes and improvements when block size < folio size

Message ID	20240812121159.3775074-1-yi.zhang@huaweicloud.com (mailing list archive)
Headers	show Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com [45.249.212.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0242513634B; Mon, 12 Aug 2024 12:16:20 +0000 (UTC) From: Zhang Yi <yi.zhang@huaweicloud.com> To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, djwong@kernel.org, hch@infradead.org, brauner@kernel.org, david@fromorbit.com, jack@suse.cz, willy@infradead.org, yi.zhang@huawei.com, yi.zhang@huaweicloud.com, chengzhihao1@huawei.com, yukuai3@huawei.com Subject: [PATCH v2 0/6] iomap: some minor non-critical fixes and improvements when block size < folio size Date: Mon, 12 Aug 2024 20:11:53 +0800 Message-Id: <20240812121159.3775074-1-yi.zhang@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	iomap: some minor non-critical fixes and improvements when block size < folio size \| expand [v2,0/6] iomap: some minor non-critical fixes and improvements when block size < folio size [v2,1/6] iomap: correct the range of a partial dirty clear [v2,2/6] iomap: support invalidating partial folios [v2,3/6] iomap: advance the ifs allocation if we have more than one blocks per folio [v2,4/6] iomap: correct the dirty length in page mkwrite [v2,5/6] iomap: don't mark blocks uptodate after partial zeroing [v2,6/6] iomap: reduce unnecessary state_lock when setting ifs uptodate and dirty bits

Zhang Yi Aug. 12, 2024, 12:11 p.m. UTC

From: Zhang Yi <yi.zhang@huawei.com>

Changes since v1:
 - Patch 5 fix a stale data exposure problem pointed out by Willy, drop
   the setting of uptodate bits after zeroing out unaligned range.
 - As Dave suggested, in order to prevent increasing the complexity of
   maintain the state_lock, don't just drop all the state_lock in the
   buffered write path, patch 6 introduce a new helper to set uptodate
   bit and dirty bits together under the state_lock, reduce one time of
   locking per write, the benefits of performance optimization do not
   change too much.

This series contains some minor non-critical fixes and performance
improvements on the filesystem with block size < folio size.

The first 4 patches fix the handling of setting and clearing folio ifs
dirty bits when mark the folio dirty and when invalidat the folio.
Although none of these code mistakes caused a real problem now, it's
still deserve a fix to correct the behavior.

The second 2 patches drop the unnecessary state_lock in ifs when setting
and clearing dirty/uptodate bits in the buffered write path, it could
improve some (~8% on my machine) buffer write performance. I tested it
through UnixBench on my x86_64 (Xeon Gold 6151) and arm64 (Kunpeng-920)
virtual machine with 50GB ramdisk and xfs filesystem, the results shows
below.

UnixBench test cmd:
 ./Run -i 1 -c 1 fstime-w

Before:
x86    File Write 1024 bufsize 2000 maxblocks       524708.0 KBps
arm64  File Write 1024 bufsize 2000 maxblocks       801965.0 KBps

After:
x86    File Write 1024 bufsize 2000 maxblocks       569218.0 KBps
arm64  File Write 1024 bufsize 2000 maxblocks       871605.0 KBps

Thanks,
Yi.

Zhang Yi (6):
  iomap: correct the range of a partial dirty clear
  iomap: support invalidating partial folios
  iomap: advance the ifs allocation if we have more than one blocks per
    folio
  iomap: correct the dirty length in page mkwrite
  iomap: don't mark blocks uptodate after partial zeroing
  iomap: reduce unnecessary state_lock when setting ifs uptodate and
    dirty bits

 fs/iomap/buffered-io.c | 73 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 60 insertions(+), 13 deletions(-)

Dave Chinner Aug. 14, 2024, 1:49 a.m. UTC | #1

On Mon, Aug 12, 2024 at 08:11:53PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Changes since v1:
>  - Patch 5 fix a stale data exposure problem pointed out by Willy, drop
>    the setting of uptodate bits after zeroing out unaligned range.
>  - As Dave suggested, in order to prevent increasing the complexity of
>    maintain the state_lock, don't just drop all the state_lock in the
>    buffered write path, patch 6 introduce a new helper to set uptodate
>    bit and dirty bits together under the state_lock, reduce one time of
>    locking per write, the benefits of performance optimization do not
>    change too much.

It's helpful to provide a lore link to the previous version so that
reviewers don't have to go looking for it themselves to remind them
of what was discussed last time.

https://lore.kernel.org/linux-xfs/20240731091305.2896873-1-yi.zhang@huaweicloud.com/T/

> This series contains some minor non-critical fixes and performance
> improvements on the filesystem with block size < folio size.
> 
> The first 4 patches fix the handling of setting and clearing folio ifs
> dirty bits when mark the folio dirty and when invalidat the folio.
> Although none of these code mistakes caused a real problem now, it's
> still deserve a fix to correct the behavior.
> 
> The second 2 patches drop the unnecessary state_lock in ifs when setting
> and clearing dirty/uptodate bits in the buffered write path, it could
> improve some (~8% on my machine) buffer write performance. I tested it
> through UnixBench on my x86_64 (Xeon Gold 6151) and arm64 (Kunpeng-920)
> virtual machine with 50GB ramdisk and xfs filesystem, the results shows
> below.
> 
> UnixBench test cmd:
>  ./Run -i 1 -c 1 fstime-w
> 
> Before:
> x86    File Write 1024 bufsize 2000 maxblocks       524708.0 KBps
> arm64  File Write 1024 bufsize 2000 maxblocks       801965.0 KBps
> 
> After:
> x86    File Write 1024 bufsize 2000 maxblocks       569218.0 KBps
> arm64  File Write 1024 bufsize 2000 maxblocks       871605.0 KBps

Those are the same performance numbers as you posted for the
previous version of the patch. How does this new version perform
given that it's a complete rework of the optimisation? It's
important to know if the changes made actually provided the benefit
we expected them to make....

i.e. this is the sort of table of results I'd like to see provided:

platform	base		v1		v2
x86		524708.0	569218.0	????
arm64		801965.0	871605.0	????

-Dave.

Zhang Yi Aug. 14, 2024, 2:14 a.m. UTC | #2

On 2024/8/14 9:49, Dave Chinner wrote:
> On Mon, Aug 12, 2024 at 08:11:53PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Changes since v1:
>>  - Patch 5 fix a stale data exposure problem pointed out by Willy, drop
>>    the setting of uptodate bits after zeroing out unaligned range.
>>  - As Dave suggested, in order to prevent increasing the complexity of
>>    maintain the state_lock, don't just drop all the state_lock in the
>>    buffered write path, patch 6 introduce a new helper to set uptodate
>>    bit and dirty bits together under the state_lock, reduce one time of
>>    locking per write, the benefits of performance optimization do not
>>    change too much.
> 
> It's helpful to provide a lore link to the previous version so that
> reviewers don't have to go looking for it themselves to remind them
> of what was discussed last time.
> 
> https://lore.kernel.org/linux-xfs/20240731091305.2896873-1-yi.zhang@huaweicloud.com/T/

Sure, will add in my later iterations.

> 
>> This series contains some minor non-critical fixes and performance
>> improvements on the filesystem with block size < folio size.
>>
>> The first 4 patches fix the handling of setting and clearing folio ifs
>> dirty bits when mark the folio dirty and when invalidat the folio.
>> Although none of these code mistakes caused a real problem now, it's
>> still deserve a fix to correct the behavior.
>>
>> The second 2 patches drop the unnecessary state_lock in ifs when setting
>> and clearing dirty/uptodate bits in the buffered write path, it could
>> improve some (~8% on my machine) buffer write performance. I tested it
>> through UnixBench on my x86_64 (Xeon Gold 6151) and arm64 (Kunpeng-920)
>> virtual machine with 50GB ramdisk and xfs filesystem, the results shows
>> below.
>>
>> UnixBench test cmd:
>>  ./Run -i 1 -c 1 fstime-w
>>
>> Before:
>> x86    File Write 1024 bufsize 2000 maxblocks       524708.0 KBps
>> arm64  File Write 1024 bufsize 2000 maxblocks       801965.0 KBps
>>
>> After:
>> x86    File Write 1024 bufsize 2000 maxblocks       569218.0 KBps
>> arm64  File Write 1024 bufsize 2000 maxblocks       871605.0 KBps
> 
> Those are the same performance numbers as you posted for the
> previous version of the patch. How does this new version perform
> given that it's a complete rework of the optimisation? It's

It's not exactly the same, but the difference is small, I've updated
the performance number in this cover letter.

> important to know if the changes made actually provided the benefit
> we expected them to make....
> 
> i.e. this is the sort of table of results I'd like to see provided:
> 
> platform	base		v1		v2
> x86		524708.0	569218.0	????
> arm64		801965.0	871605.0	????
> 

 platform	base		v1		v2
 x86		524708.0	571315.0 	569218.0
 arm64		801965.0	876077.0	871605.0

Thanks,
Yi.

Dave Chinner Aug. 14, 2024, 2:47 a.m. UTC | #3

On Wed, Aug 14, 2024 at 10:14:01AM +0800, Zhang Yi wrote:
> On 2024/8/14 9:49, Dave Chinner wrote:
> > important to know if the changes made actually provided the benefit
> > we expected them to make....
> > 
> > i.e. this is the sort of table of results I'd like to see provided:
> > 
> > platform	base		v1		v2
> > x86		524708.0	569218.0	????
> > arm64		801965.0	871605.0	????
> > 
> 
>  platform	base		v1		v2
>  x86		524708.0	571315.0 	569218.0
>  arm64	801965.0	876077.0	871605.0

So avoiding the lock cycle in iomap_write_begin() (in patch 5) in
this partial block write workload made no difference to performance
at all, and removing a lock cycle in iomap_write_end provided all
that gain?

Is this an overwrite workload or a file extending workload? The
result implies that iomap_block_needs_zeroing() is returning false,
hence it's an overwrite workload and it's reading partial blocks
from disk. i.e. it is doing synchronous RMW cycles from the ramdisk
and so still calling the uptodate bitmap update function rather than
hitting the zeroing case and skipping it.

Hence I'm just trying to understand what the test is doing because
that tells me what the result should be...

Cheers,

Dave.

Zhang Yi Aug. 14, 2024, 3:57 a.m. UTC | #4

On 2024/8/14 10:47, Dave Chinner wrote:
> On Wed, Aug 14, 2024 at 10:14:01AM +0800, Zhang Yi wrote:
>> On 2024/8/14 9:49, Dave Chinner wrote:
>>> important to know if the changes made actually provided the benefit
>>> we expected them to make....
>>>
>>> i.e. this is the sort of table of results I'd like to see provided:
>>>
>>> platform	base		v1		v2
>>> x86		524708.0	569218.0	????
>>> arm64		801965.0	871605.0	????
>>>
>>
>>  platform	base		v1		v2
>>  x86		524708.0	571315.0 	569218.0
>>  arm64	801965.0	876077.0	871605.0
> 
> So avoiding the lock cycle in iomap_write_begin() (in patch 5) in
> this partial block write workload made no difference to performance
> at all, and removing a lock cycle in iomap_write_end provided all
> that gain?

Yes.

> 
> Is this an overwrite workload or a file extending workload? The
> result implies that iomap_block_needs_zeroing() is returning false,
> hence it's an overwrite workload and it's reading partial blocks
> from disk. i.e. it is doing synchronous RMW cycles from the ramdisk
> and so still calling the uptodate bitmap update function rather than
> hitting the zeroing case and skipping it.
> 
> Hence I'm just trying to understand what the test is doing because
> that tells me what the result should be...
> 

I forgot to mentioned that I test this on xfs with 1K block size, this
is a simple case of block size < folio size that I can direct use
UnixBench.

This test first do buffered append write with bs=1K,count=2000 in the
first round, and then do overwrite from the start position with the same
parameters repetitively in 30 seconds. All the write operations are
block size aligned, so iomap_write_begin() just continue after
iomap_adjust_read_range(), don't call iomap_set_range_uptodate() to set
range uptodate originally, hence there is no difference whether with or
without patch 5 in this test case.

Thanks,
Yi.

Dave Chinner Aug. 14, 2024, 5:16 a.m. UTC | #5

On Wed, Aug 14, 2024 at 11:57:03AM +0800, Zhang Yi wrote:
> On 2024/8/14 10:47, Dave Chinner wrote:
> > On Wed, Aug 14, 2024 at 10:14:01AM +0800, Zhang Yi wrote:
> >> On 2024/8/14 9:49, Dave Chinner wrote:
> >>> important to know if the changes made actually provided the benefit
> >>> we expected them to make....
> >>>
> >>> i.e. this is the sort of table of results I'd like to see provided:
> >>>
> >>> platform	base		v1		v2
> >>> x86		524708.0	569218.0	????
> >>> arm64		801965.0	871605.0	????
> >>>
> >>
> >>  platform	base		v1		v2
> >>  x86		524708.0	571315.0 	569218.0
> >>  arm64	801965.0	876077.0	871605.0
> > 
> > So avoiding the lock cycle in iomap_write_begin() (in patch 5) in
> > this partial block write workload made no difference to performance
> > at all, and removing a lock cycle in iomap_write_end provided all
> > that gain?
> 
> Yes.
> 
> > 
> > Is this an overwrite workload or a file extending workload? The
> > result implies that iomap_block_needs_zeroing() is returning false,
> > hence it's an overwrite workload and it's reading partial blocks
> > from disk. i.e. it is doing synchronous RMW cycles from the ramdisk
> > and so still calling the uptodate bitmap update function rather than
> > hitting the zeroing case and skipping it.
> > 
> > Hence I'm just trying to understand what the test is doing because
> > that tells me what the result should be...
> > 
> 
> I forgot to mentioned that I test this on xfs with 1K block size, this
> is a simple case of block size < folio size that I can direct use
> UnixBench.

OK. So it's an even more highly contrived microbenchmark than I
thought. :/

What is the impact on a 4kB block size filesystem running that same
1kB write test? That's going to be a far more common thing to occur
in production machines for such small IO, let's make sure that we
haven't regressed that case in optimising for this one.

> This test first do buffered append write with bs=1K,count=2000 in the
> first round, and then do overwrite from the start position with the same
> parameters repetitively in 30 seconds. All the write operations are
> block size aligned, so iomap_write_begin() just continue after
> iomap_adjust_read_range(), don't call iomap_set_range_uptodate() to set
> range uptodate originally, hence there is no difference whether with or
> without patch 5 in this test case.

Ok, so you really need to come up with an equivalent test that
exercises the paths that patch 5 modifies, because right now we have
no real idea of what the impact of that change will be...

-Dave.

Zhang Yi Aug. 14, 2024, 6:32 a.m. UTC | #6

On 2024/8/14 13:16, Dave Chinner wrote:
> On Wed, Aug 14, 2024 at 11:57:03AM +0800, Zhang Yi wrote:
>> On 2024/8/14 10:47, Dave Chinner wrote:
>>> On Wed, Aug 14, 2024 at 10:14:01AM +0800, Zhang Yi wrote:
>>>> On 2024/8/14 9:49, Dave Chinner wrote:
>>>>> important to know if the changes made actually provided the benefit
>>>>> we expected them to make....
>>>>>
>>>>> i.e. this is the sort of table of results I'd like to see provided:
>>>>>
>>>>> platform	base		v1		v2
>>>>> x86		524708.0	569218.0	????
>>>>> arm64		801965.0	871605.0	????
>>>>>
>>>>
>>>>  platform	base		v1		v2
>>>>  x86		524708.0	571315.0 	569218.0
>>>>  arm64	801965.0	876077.0	871605.0
>>>
>>> So avoiding the lock cycle in iomap_write_begin() (in patch 5) in
>>> this partial block write workload made no difference to performance
>>> at all, and removing a lock cycle in iomap_write_end provided all
>>> that gain?
>>
>> Yes.
>>
>>>
>>> Is this an overwrite workload or a file extending workload? The
>>> result implies that iomap_block_needs_zeroing() is returning false,
>>> hence it's an overwrite workload and it's reading partial blocks
>>> from disk. i.e. it is doing synchronous RMW cycles from the ramdisk
>>> and so still calling the uptodate bitmap update function rather than
>>> hitting the zeroing case and skipping it.
>>>
>>> Hence I'm just trying to understand what the test is doing because
>>> that tells me what the result should be...
>>>
>>
>> I forgot to mentioned that I test this on xfs with 1K block size, this
>> is a simple case of block size < folio size that I can direct use
>> UnixBench.
> 
> OK. So it's an even more highly contrived microbenchmark than I
> thought. :/
> 
> What is the impact on a 4kB block size filesystem running that same
> 1kB write test? That's going to be a far more common thing to occur
> in production machines for such small IO, 

Yeah, I agree with you, the original test case I want to test is
buffered overwrite with bs=4K to the 4KB filesystem which has existing
larger size folios (> 4KB), this is one kind of common case of
block size < folio size after large folio is enabled. But I don't find
a benchmark tool can do this test easily, so I use the above tests
parameters to simulate this case.

> let's make sure that we
> haven't regressed that case in optimising for this one.

Sure, I will test this case either.

> 
>> This test first do buffered append write with bs=1K,count=2000 in the
>> first round, and then do overwrite from the start position with the same
>> parameters repetitively in 30 seconds. All the write operations are
>> block size aligned, so iomap_write_begin() just continue after
>> iomap_adjust_read_range(), don't call iomap_set_range_uptodate() to set
>> range uptodate originally, hence there is no difference whether with or
>> without patch 5 in this test case.
> 
> Ok, so you really need to come up with an equivalent test that
> exercises the paths that patch 5 modifies, because right now we have
> no real idea of what the impact of that change will be...
> 

Sure.

Thanks,
Yi.

[v2,0/6] iomap: some minor non-critical fixes and improvements when block size < folio size

Message

Comments