diff mbox series

[-next,v5,7/8] xfs: speed up truncating down a big realtime inode

Message ID 20240613090033.2246907-8-yi.zhang@huaweicloud.com (mailing list archive)
State Accepted, archived
Headers show
Series iomap/xfs: fix stale data exposure when truncating realtime inodes | expand

Commit Message

Zhang Yi June 13, 2024, 9 a.m. UTC
From: Zhang Yi <yi.zhang@huawei.com>

If we truncate down a big realtime inode, zero out the entire aligned
EOF extent could gets slow down as the rtextsize increases. Fortunately,
__xfs_bunmapi() would align the unmapped range to rtextsize, split and
convert the blocks beyond EOF to unwritten. So speed up this by
adjusting the unitsize to the filesystem blocksize when truncating down
a large realtime inode, let __xfs_bunmapi() convert the tail blocks to
unwritten, this could improve the performance significantly.

 # mkfs.xfs -f -rrtdev=/dev/pmem1s -f -m reflink=0,rmapbt=0, \
            -d rtinherit=1 -r extsize=$rtextsize /dev/pmem2s
 # mount -ortdev=/dev/pmem1s /dev/pmem2s /mnt/scratch
 # for i in {1..1000}; \
   do dd if=/dev/zero of=/mnt/scratch/$i bs=$rtextsize count=1024; done
 # sync
 # time for i in {1..1000}; \
   do xfs_io -c "truncate 4k" /mnt/scratch/$i; done

 rtextsize       8k      16k      32k      64k     256k     1024k
 before:       9.601s  10.229s  11.153s  12.086s  12.259s  20.141s
 after:        9.710s   9.642s   9.958s   9.441s  10.021s  10.526s

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_inode.c | 10 ++++++++--
 fs/xfs/xfs_iops.c  |  9 +++++++++
 2 files changed, 17 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig June 14, 2024, 6:08 a.m. UTC | #1
On Thu, Jun 13, 2024 at 05:00:32PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> If we truncate down a big realtime inode, zero out the entire aligned
> EOF extent could gets slow down as the rtextsize increases. Fortunately,
> __xfs_bunmapi() would align the unmapped range to rtextsize, split and
> convert the blocks beyond EOF to unwritten. So speed up this by
> adjusting the unitsize to the filesystem blocksize when truncating down
> a large realtime inode, let __xfs_bunmapi() convert the tail blocks to
> unwritten, this could improve the performance significantly.
> 
>  # mkfs.xfs -f -rrtdev=/dev/pmem1s -f -m reflink=0,rmapbt=0, \
>             -d rtinherit=1 -r extsize=$rtextsize /dev/pmem2s
>  # mount -ortdev=/dev/pmem1s /dev/pmem2s /mnt/scratch
>  # for i in {1..1000}; \
>    do dd if=/dev/zero of=/mnt/scratch/$i bs=$rtextsize count=1024; done
>  # sync
>  # time for i in {1..1000}; \
>    do xfs_io -c "truncate 4k" /mnt/scratch/$i; done
> 
>  rtextsize       8k      16k      32k      64k     256k     1024k
>  before:       9.601s  10.229s  11.153s  12.086s  12.259s  20.141s
>  after:        9.710s   9.642s   9.958s   9.441s  10.021s  10.526s
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/xfs/xfs_inode.c | 10 ++++++++--
>  fs/xfs/xfs_iops.c  |  9 +++++++++
>  2 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 92daa2279053..5e837ed093b0 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1487,6 +1487,7 @@ xfs_itruncate_extents_flags(
>  	struct xfs_trans	*tp = *tpp;
>  	xfs_fileoff_t		first_unmap_block;
>  	int			error = 0;
> +	unsigned int		unitsize = xfs_inode_alloc_unitsize(ip);
>  
>  	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
>  	if (atomic_read(&VFS_I(ip)->i_count))
> @@ -1510,9 +1511,14 @@ xfs_itruncate_extents_flags(
>  	 *
>  	 * We have to free all the blocks to the bmbt maximum offset, even if
>  	 * the page cache can't scale that far.
> +	 *
> +	 * For big realtime inode, don't aligned to allocation unitsize,
> +	 * it'll split the extent and convert the tail blocks to unwritten.
>  	 */
> +	if (xfs_inode_has_bigrtalloc(ip))
> +		unitsize = i_blocksize(VFS_I(ip));
> +	first_unmap_block = XFS_B_TO_FSB(mp, roundup_64(new_size, unitsize));

If you expand what xfs_inode_alloc_unitsize and xfs_inode_has_bigrtalloc
this is looking a bit silly:

	unsigned int            blocks = 1;

	if (XFS_IS_REALTIME_INODE(ip))
		blocks = ip->i_mount->m_sb.sb_rextsize;

	unitsize = XFS_FSB_TO_B(ip->i_mount, blocks);
	if (XFS_IS_REALTIME_INODE(ip) && ip->i_mount->m_sb.sb_rextsize > 1)
		unsitsize = i_blocksize(inode);

So I think we can simply drop this part now that the variant that zeroes
the entire rtextent is gone.

> @@ -862,6 +862,15 @@ xfs_setattr_truncate_data(
>  	/* Truncate down */
>  	blocksize = xfs_inode_alloc_unitsize(ip);
>  
> +	/*
> +	 * If it's a big realtime inode, zero out the entire EOF extent could
> +	 * get slow down as the rtextsize increases, speed it up by adjusting
> +	 * the blocksize to the filesystem blocksize, let __xfs_bunmapi() to
> +	 * split the extent and convert the tail blocks to unwritten.
> +	 */
> +	if (xfs_inode_has_bigrtalloc(ip))
> +		blocksize = i_blocksize(inode);

Same here.  And with that probably also the passing of the block size
to the truncate_page helpers.
Zhang Yi June 14, 2024, 7:18 a.m. UTC | #2
On 2024/6/14 14:08, Christoph Hellwig wrote:
> On Thu, Jun 13, 2024 at 05:00:32PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> If we truncate down a big realtime inode, zero out the entire aligned
>> EOF extent could gets slow down as the rtextsize increases. Fortunately,
>> __xfs_bunmapi() would align the unmapped range to rtextsize, split and
>> convert the blocks beyond EOF to unwritten. So speed up this by
>> adjusting the unitsize to the filesystem blocksize when truncating down
>> a large realtime inode, let __xfs_bunmapi() convert the tail blocks to
>> unwritten, this could improve the performance significantly.
>>
>>  # mkfs.xfs -f -rrtdev=/dev/pmem1s -f -m reflink=0,rmapbt=0, \
>>             -d rtinherit=1 -r extsize=$rtextsize /dev/pmem2s
>>  # mount -ortdev=/dev/pmem1s /dev/pmem2s /mnt/scratch
>>  # for i in {1..1000}; \
>>    do dd if=/dev/zero of=/mnt/scratch/$i bs=$rtextsize count=1024; done
>>  # sync
>>  # time for i in {1..1000}; \
>>    do xfs_io -c "truncate 4k" /mnt/scratch/$i; done
>>
>>  rtextsize       8k      16k      32k      64k     256k     1024k
>>  before:       9.601s  10.229s  11.153s  12.086s  12.259s  20.141s
>>  after:        9.710s   9.642s   9.958s   9.441s  10.021s  10.526s
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/xfs/xfs_inode.c | 10 ++++++++--
>>  fs/xfs/xfs_iops.c  |  9 +++++++++
>>  2 files changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 92daa2279053..5e837ed093b0 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -1487,6 +1487,7 @@ xfs_itruncate_extents_flags(
>>  	struct xfs_trans	*tp = *tpp;
>>  	xfs_fileoff_t		first_unmap_block;
>>  	int			error = 0;
>> +	unsigned int		unitsize = xfs_inode_alloc_unitsize(ip);
>>  
>>  	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
>>  	if (atomic_read(&VFS_I(ip)->i_count))
>> @@ -1510,9 +1511,14 @@ xfs_itruncate_extents_flags(
>>  	 *
>>  	 * We have to free all the blocks to the bmbt maximum offset, even if
>>  	 * the page cache can't scale that far.
>> +	 *
>> +	 * For big realtime inode, don't aligned to allocation unitsize,
>> +	 * it'll split the extent and convert the tail blocks to unwritten.
>>  	 */
>> +	if (xfs_inode_has_bigrtalloc(ip))
>> +		unitsize = i_blocksize(VFS_I(ip));
>> +	first_unmap_block = XFS_B_TO_FSB(mp, roundup_64(new_size, unitsize));
> 
> If you expand what xfs_inode_alloc_unitsize and xfs_inode_has_bigrtalloc
> this is looking a bit silly:
> 
> 	unsigned int            blocks = 1;
> 
> 	if (XFS_IS_REALTIME_INODE(ip))
> 		blocks = ip->i_mount->m_sb.sb_rextsize;
> 
> 	unitsize = XFS_FSB_TO_B(ip->i_mount, blocks);
> 	if (XFS_IS_REALTIME_INODE(ip) && ip->i_mount->m_sb.sb_rextsize > 1)
> 		unsitsize = i_blocksize(inode);
> 
> So I think we can simply drop this part now that the variant that zeroes
> the entire rtextent is gone.
> 
Thanks for your suggestion.

Yeah, we could fix the realtime inode problem by just drop this part, but
for the upcoming forcealign feature and atomic feature by John, IIUC, we
couldn't split and convert the tail extent like RT inode does, we should
zero out the entire tail force aligned extent, if not, atomic write could
be broken by submitting unaligned bios.

Jone had already expand the xfs_inode_alloc_unitsize() [1], so I think
we should keep this part for forcealign feature and deal with realtime
inode separately, is that right?

[1]
https://lore.kernel.org/linux-xfs/20240607143919.2622319-1-john.g.garry@oracle.com/T/#m73ccaa7b6fec9988f24b881e013fc367429405d6
https://lore.kernel.org/linux-xfs/20240607143919.2622319-1-john.g.garry@oracle.com/T/#m1a6312428e4addc4d0d260fbf33ad7bcffb98e0d

Thanks,
Yi.

>> @@ -862,6 +862,15 @@ xfs_setattr_truncate_data(
>>  	/* Truncate down */
>>  	blocksize = xfs_inode_alloc_unitsize(ip);
>>  
>> +	/*
>> +	 * If it's a big realtime inode, zero out the entire EOF extent could
>> +	 * get slow down as the rtextsize increases, speed it up by adjusting
>> +	 * the blocksize to the filesystem blocksize, let __xfs_bunmapi() to
>> +	 * split the extent and convert the tail blocks to unwritten.
>> +	 */
>> +	if (xfs_inode_has_bigrtalloc(ip))
>> +		blocksize = i_blocksize(inode);
> 
> Same here.  And with that probably also the passing of the block size
> to the truncate_page helpers.
>
Christoph Hellwig June 14, 2024, 9:13 a.m. UTC | #3
On Fri, Jun 14, 2024 at 03:18:07PM +0800, Zhang Yi wrote:
> Thanks for your suggestion.
> 
> Yeah, we could fix the realtime inode problem by just drop this part, but
> for the upcoming forcealign feature and atomic feature by John, IIUC, we
> couldn't split and convert the tail extent like RT inode does, we should
> zero out the entire tail force aligned extent, if not, atomic write could
> be broken by submitting unaligned bios.

Let's worry about that if/when those actually land.  I also see no
rason why those couldn't just use partially convert to unwritten path
offhand (but without having looked into the details).
Zhang Yi June 15, 2024, 11:44 a.m. UTC | #4
On 2024/6/14 17:13, Christoph Hellwig wrote:
> On Fri, Jun 14, 2024 at 03:18:07PM +0800, Zhang Yi wrote:
>> Thanks for your suggestion.
>>
>> Yeah, we could fix the realtime inode problem by just drop this part, but
>> for the upcoming forcealign feature and atomic feature by John, IIUC, we
>> couldn't split and convert the tail extent like RT inode does, we should
>> zero out the entire tail force aligned extent, if not, atomic write could
>> be broken by submitting unaligned bios.
> 
> Let's worry about that if/when those actually land.

OK, if we don't consider the upcoming forcealign feature and atomic feature,
I think only path 6 is needed to fix the issue.

> I also see no
> rason why those couldn't just use partially convert to unwritten path
> offhand (but without having looked into the details).
> 

The reason why atomic feature can't split and convert the tail extent on truncate
down now is the dio write iter loop will split an atomic dio which covers the
whole allocation unit(extsize) since there are two extents in on allocation unit.

Please see this code:
__iomap_dio_rw()
{
	...
	while ((ret = iomap_iter(&iomi, ops)) > 0) {
		iomi.processed = iomap_dio_iter(&iomi, dio);
	...
}

The first loop find and submit the frist extent and the second loop submit the
second extent, this breaks the atomic property.

For example, we have a file with only one extszie length,if we truncate down
and split the extent, the file becomes below,

  |   forced extsize (one atomic IO unit)  |
  wwwwwwwwwwwwww+uuuuuuuuuuuuuuuuuuuuuuuuuuu
                ^new size A                ^old size B

Then if we submit a DIO from 0 to B, xfs should submit it in one bio, but it
will submit to two bios, since there are two extents. So, unless we find
another way to guarantee submit one bio even we have two extents in one atomic
write unit (I guess it may complicated), we have to zero out the whole unit
when truncate down(I'd prefer this solution), we need consider this in the near
future.

Thanks,
Yi.
Christoph Hellwig June 17, 2024, 6:59 a.m. UTC | #5
On Sat, Jun 15, 2024 at 07:44:21PM +0800, Zhang Yi wrote:
> The reason why atomic feature can't split and convert the tail extent on truncate
> down now is the dio write iter loop will split an atomic dio which covers the
> whole allocation unit(extsize) since there are two extents in on allocation unit.

We could fix this by merging the two in iomap_begin, as the end_io
handler already deals with multiple ranges.  But let's think of that
when the need actually arises.
Zhang Yi June 17, 2024, 9:11 a.m. UTC | #6
On 2024/6/17 14:59, Christoph Hellwig wrote:
> On Sat, Jun 15, 2024 at 07:44:21PM +0800, Zhang Yi wrote:
>> The reason why atomic feature can't split and convert the tail extent on truncate
>> down now is the dio write iter loop will split an atomic dio which covers the
>> whole allocation unit(extsize) since there are two extents in on allocation unit.
> 
> We could fix this by merging the two in iomap_begin, as the end_io
> handler already deals with multiple ranges.  

Yeah, that's one solution.

> But let's think of that when the need actually arises.
> 

Sure, I will retest and submit only patch 6&8 to solve current issue in my next
version.

Thanks,
Yi.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 92daa2279053..5e837ed093b0 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1487,6 +1487,7 @@  xfs_itruncate_extents_flags(
 	struct xfs_trans	*tp = *tpp;
 	xfs_fileoff_t		first_unmap_block;
 	int			error = 0;
+	unsigned int		unitsize = xfs_inode_alloc_unitsize(ip);
 
 	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
 	if (atomic_read(&VFS_I(ip)->i_count))
@@ -1510,9 +1511,14 @@  xfs_itruncate_extents_flags(
 	 *
 	 * We have to free all the blocks to the bmbt maximum offset, even if
 	 * the page cache can't scale that far.
+	 *
+	 * For big realtime inode, don't aligned to allocation unitsize,
+	 * it'll split the extent and convert the tail blocks to unwritten.
 	 */
-	first_unmap_block = XFS_B_TO_FSB(mp,
-			roundup_64(new_size, xfs_inode_alloc_unitsize(ip)));
+	if (xfs_inode_has_bigrtalloc(ip))
+		unitsize = i_blocksize(VFS_I(ip));
+	first_unmap_block = XFS_B_TO_FSB(mp, roundup_64(new_size, unitsize));
+
 	if (!xfs_verify_fileoff(mp, first_unmap_block)) {
 		WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
 		return 0;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8af13fd37f1b..1903c06d39bc 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -862,6 +862,15 @@  xfs_setattr_truncate_data(
 	/* Truncate down */
 	blocksize = xfs_inode_alloc_unitsize(ip);
 
+	/*
+	 * If it's a big realtime inode, zero out the entire EOF extent could
+	 * get slow down as the rtextsize increases, speed it up by adjusting
+	 * the blocksize to the filesystem blocksize, let __xfs_bunmapi() to
+	 * split the extent and convert the tail blocks to unwritten.
+	 */
+	if (xfs_inode_has_bigrtalloc(ip))
+		blocksize = i_blocksize(inode);
+
 	/*
 	 * iomap won't detect a dirty page over an unwritten block (or a cow
 	 * block over a hole) and subsequently skips zeroing the newly post-EOF