Message ID | 20240812121159.3775074-1-yi.zhang@huaweicloud.com (mailing list archive) |
---|---|
Headers | show |
Series | iomap: some minor non-critical fixes and improvements when block size < folio size | expand |
On Mon, Aug 12, 2024 at 08:11:53PM +0800, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@huawei.com> > > Changes since v1: > - Patch 5 fix a stale data exposure problem pointed out by Willy, drop > the setting of uptodate bits after zeroing out unaligned range. > - As Dave suggested, in order to prevent increasing the complexity of > maintain the state_lock, don't just drop all the state_lock in the > buffered write path, patch 6 introduce a new helper to set uptodate > bit and dirty bits together under the state_lock, reduce one time of > locking per write, the benefits of performance optimization do not > change too much. It's helpful to provide a lore link to the previous version so that reviewers don't have to go looking for it themselves to remind them of what was discussed last time. https://lore.kernel.org/linux-xfs/20240731091305.2896873-1-yi.zhang@huaweicloud.com/T/ > This series contains some minor non-critical fixes and performance > improvements on the filesystem with block size < folio size. > > The first 4 patches fix the handling of setting and clearing folio ifs > dirty bits when mark the folio dirty and when invalidat the folio. > Although none of these code mistakes caused a real problem now, it's > still deserve a fix to correct the behavior. > > The second 2 patches drop the unnecessary state_lock in ifs when setting > and clearing dirty/uptodate bits in the buffered write path, it could > improve some (~8% on my machine) buffer write performance. I tested it > through UnixBench on my x86_64 (Xeon Gold 6151) and arm64 (Kunpeng-920) > virtual machine with 50GB ramdisk and xfs filesystem, the results shows > below. > > UnixBench test cmd: > ./Run -i 1 -c 1 fstime-w > > Before: > x86 File Write 1024 bufsize 2000 maxblocks 524708.0 KBps > arm64 File Write 1024 bufsize 2000 maxblocks 801965.0 KBps > > After: > x86 File Write 1024 bufsize 2000 maxblocks 569218.0 KBps > arm64 File Write 1024 bufsize 2000 maxblocks 871605.0 KBps Those are the same performance numbers as you posted for the previous version of the patch. How does this new version perform given that it's a complete rework of the optimisation? It's important to know if the changes made actually provided the benefit we expected them to make.... i.e. this is the sort of table of results I'd like to see provided: platform base v1 v2 x86 524708.0 569218.0 ???? arm64 801965.0 871605.0 ???? -Dave.
On 2024/8/14 9:49, Dave Chinner wrote: > On Mon, Aug 12, 2024 at 08:11:53PM +0800, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@huawei.com> >> >> Changes since v1: >> - Patch 5 fix a stale data exposure problem pointed out by Willy, drop >> the setting of uptodate bits after zeroing out unaligned range. >> - As Dave suggested, in order to prevent increasing the complexity of >> maintain the state_lock, don't just drop all the state_lock in the >> buffered write path, patch 6 introduce a new helper to set uptodate >> bit and dirty bits together under the state_lock, reduce one time of >> locking per write, the benefits of performance optimization do not >> change too much. > > It's helpful to provide a lore link to the previous version so that > reviewers don't have to go looking for it themselves to remind them > of what was discussed last time. > > https://lore.kernel.org/linux-xfs/20240731091305.2896873-1-yi.zhang@huaweicloud.com/T/ Sure, will add in my later iterations. > >> This series contains some minor non-critical fixes and performance >> improvements on the filesystem with block size < folio size. >> >> The first 4 patches fix the handling of setting and clearing folio ifs >> dirty bits when mark the folio dirty and when invalidat the folio. >> Although none of these code mistakes caused a real problem now, it's >> still deserve a fix to correct the behavior. >> >> The second 2 patches drop the unnecessary state_lock in ifs when setting >> and clearing dirty/uptodate bits in the buffered write path, it could >> improve some (~8% on my machine) buffer write performance. I tested it >> through UnixBench on my x86_64 (Xeon Gold 6151) and arm64 (Kunpeng-920) >> virtual machine with 50GB ramdisk and xfs filesystem, the results shows >> below. >> >> UnixBench test cmd: >> ./Run -i 1 -c 1 fstime-w >> >> Before: >> x86 File Write 1024 bufsize 2000 maxblocks 524708.0 KBps >> arm64 File Write 1024 bufsize 2000 maxblocks 801965.0 KBps >> >> After: >> x86 File Write 1024 bufsize 2000 maxblocks 569218.0 KBps >> arm64 File Write 1024 bufsize 2000 maxblocks 871605.0 KBps > > Those are the same performance numbers as you posted for the > previous version of the patch. How does this new version perform > given that it's a complete rework of the optimisation? It's It's not exactly the same, but the difference is small, I've updated the performance number in this cover letter. > important to know if the changes made actually provided the benefit > we expected them to make.... > > i.e. this is the sort of table of results I'd like to see provided: > > platform base v1 v2 > x86 524708.0 569218.0 ???? > arm64 801965.0 871605.0 ???? > platform base v1 v2 x86 524708.0 571315.0 569218.0 arm64 801965.0 876077.0 871605.0 Thanks, Yi.
On Wed, Aug 14, 2024 at 10:14:01AM +0800, Zhang Yi wrote: > On 2024/8/14 9:49, Dave Chinner wrote: > > important to know if the changes made actually provided the benefit > > we expected them to make.... > > > > i.e. this is the sort of table of results I'd like to see provided: > > > > platform base v1 v2 > > x86 524708.0 569218.0 ???? > > arm64 801965.0 871605.0 ???? > > > > platform base v1 v2 > x86 524708.0 571315.0 569218.0 > arm64 801965.0 876077.0 871605.0 So avoiding the lock cycle in iomap_write_begin() (in patch 5) in this partial block write workload made no difference to performance at all, and removing a lock cycle in iomap_write_end provided all that gain? Is this an overwrite workload or a file extending workload? The result implies that iomap_block_needs_zeroing() is returning false, hence it's an overwrite workload and it's reading partial blocks from disk. i.e. it is doing synchronous RMW cycles from the ramdisk and so still calling the uptodate bitmap update function rather than hitting the zeroing case and skipping it. Hence I'm just trying to understand what the test is doing because that tells me what the result should be... Cheers, Dave.
On 2024/8/14 10:47, Dave Chinner wrote: > On Wed, Aug 14, 2024 at 10:14:01AM +0800, Zhang Yi wrote: >> On 2024/8/14 9:49, Dave Chinner wrote: >>> important to know if the changes made actually provided the benefit >>> we expected them to make.... >>> >>> i.e. this is the sort of table of results I'd like to see provided: >>> >>> platform base v1 v2 >>> x86 524708.0 569218.0 ???? >>> arm64 801965.0 871605.0 ???? >>> >> >> platform base v1 v2 >> x86 524708.0 571315.0 569218.0 >> arm64 801965.0 876077.0 871605.0 > > So avoiding the lock cycle in iomap_write_begin() (in patch 5) in > this partial block write workload made no difference to performance > at all, and removing a lock cycle in iomap_write_end provided all > that gain? Yes. > > Is this an overwrite workload or a file extending workload? The > result implies that iomap_block_needs_zeroing() is returning false, > hence it's an overwrite workload and it's reading partial blocks > from disk. i.e. it is doing synchronous RMW cycles from the ramdisk > and so still calling the uptodate bitmap update function rather than > hitting the zeroing case and skipping it. > > Hence I'm just trying to understand what the test is doing because > that tells me what the result should be... > I forgot to mentioned that I test this on xfs with 1K block size, this is a simple case of block size < folio size that I can direct use UnixBench. This test first do buffered append write with bs=1K,count=2000 in the first round, and then do overwrite from the start position with the same parameters repetitively in 30 seconds. All the write operations are block size aligned, so iomap_write_begin() just continue after iomap_adjust_read_range(), don't call iomap_set_range_uptodate() to set range uptodate originally, hence there is no difference whether with or without patch 5 in this test case. Thanks, Yi.
On Wed, Aug 14, 2024 at 11:57:03AM +0800, Zhang Yi wrote: > On 2024/8/14 10:47, Dave Chinner wrote: > > On Wed, Aug 14, 2024 at 10:14:01AM +0800, Zhang Yi wrote: > >> On 2024/8/14 9:49, Dave Chinner wrote: > >>> important to know if the changes made actually provided the benefit > >>> we expected them to make.... > >>> > >>> i.e. this is the sort of table of results I'd like to see provided: > >>> > >>> platform base v1 v2 > >>> x86 524708.0 569218.0 ???? > >>> arm64 801965.0 871605.0 ???? > >>> > >> > >> platform base v1 v2 > >> x86 524708.0 571315.0 569218.0 > >> arm64 801965.0 876077.0 871605.0 > > > > So avoiding the lock cycle in iomap_write_begin() (in patch 5) in > > this partial block write workload made no difference to performance > > at all, and removing a lock cycle in iomap_write_end provided all > > that gain? > > Yes. > > > > > Is this an overwrite workload or a file extending workload? The > > result implies that iomap_block_needs_zeroing() is returning false, > > hence it's an overwrite workload and it's reading partial blocks > > from disk. i.e. it is doing synchronous RMW cycles from the ramdisk > > and so still calling the uptodate bitmap update function rather than > > hitting the zeroing case and skipping it. > > > > Hence I'm just trying to understand what the test is doing because > > that tells me what the result should be... > > > > I forgot to mentioned that I test this on xfs with 1K block size, this > is a simple case of block size < folio size that I can direct use > UnixBench. OK. So it's an even more highly contrived microbenchmark than I thought. :/ What is the impact on a 4kB block size filesystem running that same 1kB write test? That's going to be a far more common thing to occur in production machines for such small IO, let's make sure that we haven't regressed that case in optimising for this one. > This test first do buffered append write with bs=1K,count=2000 in the > first round, and then do overwrite from the start position with the same > parameters repetitively in 30 seconds. All the write operations are > block size aligned, so iomap_write_begin() just continue after > iomap_adjust_read_range(), don't call iomap_set_range_uptodate() to set > range uptodate originally, hence there is no difference whether with or > without patch 5 in this test case. Ok, so you really need to come up with an equivalent test that exercises the paths that patch 5 modifies, because right now we have no real idea of what the impact of that change will be... -Dave.
On 2024/8/14 13:16, Dave Chinner wrote: > On Wed, Aug 14, 2024 at 11:57:03AM +0800, Zhang Yi wrote: >> On 2024/8/14 10:47, Dave Chinner wrote: >>> On Wed, Aug 14, 2024 at 10:14:01AM +0800, Zhang Yi wrote: >>>> On 2024/8/14 9:49, Dave Chinner wrote: >>>>> important to know if the changes made actually provided the benefit >>>>> we expected them to make.... >>>>> >>>>> i.e. this is the sort of table of results I'd like to see provided: >>>>> >>>>> platform base v1 v2 >>>>> x86 524708.0 569218.0 ???? >>>>> arm64 801965.0 871605.0 ???? >>>>> >>>> >>>> platform base v1 v2 >>>> x86 524708.0 571315.0 569218.0 >>>> arm64 801965.0 876077.0 871605.0 >>> >>> So avoiding the lock cycle in iomap_write_begin() (in patch 5) in >>> this partial block write workload made no difference to performance >>> at all, and removing a lock cycle in iomap_write_end provided all >>> that gain? >> >> Yes. >> >>> >>> Is this an overwrite workload or a file extending workload? The >>> result implies that iomap_block_needs_zeroing() is returning false, >>> hence it's an overwrite workload and it's reading partial blocks >>> from disk. i.e. it is doing synchronous RMW cycles from the ramdisk >>> and so still calling the uptodate bitmap update function rather than >>> hitting the zeroing case and skipping it. >>> >>> Hence I'm just trying to understand what the test is doing because >>> that tells me what the result should be... >>> >> >> I forgot to mentioned that I test this on xfs with 1K block size, this >> is a simple case of block size < folio size that I can direct use >> UnixBench. > > OK. So it's an even more highly contrived microbenchmark than I > thought. :/ > > What is the impact on a 4kB block size filesystem running that same > 1kB write test? That's going to be a far more common thing to occur > in production machines for such small IO, Yeah, I agree with you, the original test case I want to test is buffered overwrite with bs=4K to the 4KB filesystem which has existing larger size folios (> 4KB), this is one kind of common case of block size < folio size after large folio is enabled. But I don't find a benchmark tool can do this test easily, so I use the above tests parameters to simulate this case. > let's make sure that we > haven't regressed that case in optimising for this one. Sure, I will test this case either. > >> This test first do buffered append write with bs=1K,count=2000 in the >> first round, and then do overwrite from the start position with the same >> parameters repetitively in 30 seconds. All the write operations are >> block size aligned, so iomap_write_begin() just continue after >> iomap_adjust_read_range(), don't call iomap_set_range_uptodate() to set >> range uptodate originally, hence there is no difference whether with or >> without patch 5 in this test case. > > Ok, so you really need to come up with an equivalent test that > exercises the paths that patch 5 modifies, because right now we have > no real idea of what the impact of that change will be... > Sure. Thanks, Yi.
From: Zhang Yi <yi.zhang@huawei.com> Changes since v1: - Patch 5 fix a stale data exposure problem pointed out by Willy, drop the setting of uptodate bits after zeroing out unaligned range. - As Dave suggested, in order to prevent increasing the complexity of maintain the state_lock, don't just drop all the state_lock in the buffered write path, patch 6 introduce a new helper to set uptodate bit and dirty bits together under the state_lock, reduce one time of locking per write, the benefits of performance optimization do not change too much. This series contains some minor non-critical fixes and performance improvements on the filesystem with block size < folio size. The first 4 patches fix the handling of setting and clearing folio ifs dirty bits when mark the folio dirty and when invalidat the folio. Although none of these code mistakes caused a real problem now, it's still deserve a fix to correct the behavior. The second 2 patches drop the unnecessary state_lock in ifs when setting and clearing dirty/uptodate bits in the buffered write path, it could improve some (~8% on my machine) buffer write performance. I tested it through UnixBench on my x86_64 (Xeon Gold 6151) and arm64 (Kunpeng-920) virtual machine with 50GB ramdisk and xfs filesystem, the results shows below. UnixBench test cmd: ./Run -i 1 -c 1 fstime-w Before: x86 File Write 1024 bufsize 2000 maxblocks 524708.0 KBps arm64 File Write 1024 bufsize 2000 maxblocks 801965.0 KBps After: x86 File Write 1024 bufsize 2000 maxblocks 569218.0 KBps arm64 File Write 1024 bufsize 2000 maxblocks 871605.0 KBps Thanks, Yi. Zhang Yi (6): iomap: correct the range of a partial dirty clear iomap: support invalidating partial folios iomap: advance the ifs allocation if we have more than one blocks per folio iomap: correct the dirty length in page mkwrite iomap: don't mark blocks uptodate after partial zeroing iomap: reduce unnecessary state_lock when setting ifs uptodate and dirty bits fs/iomap/buffered-io.c | 73 ++++++++++++++++++++++++++++++++++-------- 1 file changed, 60 insertions(+), 13 deletions(-)