diff mbox series

[RFC,v4,8/8] xfs: improve truncate on a realtime inode with huge extsize

Message ID 20240529095206.2568162-9-yi.zhang@huaweicloud.com (mailing list archive)
State New
Headers show
Series iomap/xfs: fix stale data exposure when truncating realtime inodes | expand

Commit Message

Zhang Yi May 29, 2024, 9:52 a.m. UTC
From: Zhang Yi <yi.zhang@huawei.com>

If we truncate down a realtime inode which extsize is too large, zeroing
out the entire aligned EOF extent could be very slow. Fortunately,
__xfs_bunmapi() would align the unmapped range to rtextsize, split and
convert the extra blocks to unwritten state. So, adjust the blocksize to
the filesystem blocksize if the rtextsize is large enough, let
__xfs_bunmapi() to convert the tail blocks to unwritten, this could
improve the truncate performance significantly.

 # mkfs.xfs -f -rrtdev=/dev/pmem1s -f -m reflink=0,rmapbt=0, \
            -d rtinherit=1 -r extsize=1G /dev/pmem2s
 # for i in {1..5}; \
   do dd if=/dev/zero of=/mnt/scratch/$i bs=1M count=1024; done
 # sync
 # time for i in {1..5}; \
   do xfs_io -c "truncate 4k" /mnt/scratch/$i; done

Before:
 real    0m16.762s
 user    0m0.008s
 sys     0m16.750s

After:
 real    0m0.076s
 user    0m0.010s
 sys     0m0.069s

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_inode.c |  2 +-
 fs/xfs/xfs_inode.h | 12 ++++++++++++
 fs/xfs/xfs_iops.c  |  9 +++++++++
 3 files changed, 22 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig May 31, 2024, 1:46 p.m. UTC | #1
> +/*
> + * Decide if this file is a realtime file whose data allocation unit is larger
> + * than default.
> + */
> +static inline bool xfs_inode_has_hugertalloc(struct xfs_inode *ip)
> +{
> +	struct xfs_mount *mp = ip->i_mount;
> +
> +	return XFS_IS_REALTIME_INODE(ip) &&
> +	       mp->m_sb.sb_rextsize > XFS_B_TO_FSB(mp, XFS_DFL_RTEXTSIZE);
> +}

The default rtextsize is actually a single FSB unless we're on a striped
volume in which case it is increased.

I'll take care of removing the unused and confusing XFS_DFL_RTEXTSIZE,
but for this patch we'd need to know the trade-off of when to just
convert to unwritten.  For single-fsb rtextents we obviously don't need
any special action.  But do you see a slowdown when converting to
unwritten for small > 1 rtextsizes?  Because if not we could just
always use that code path, which would significantly simplify things
and remove yet another different to test code path.
Darrick J. Wong May 31, 2024, 2:12 p.m. UTC | #2
On Fri, May 31, 2024 at 06:46:10AM -0700, Christoph Hellwig wrote:
> > +/*
> > + * Decide if this file is a realtime file whose data allocation unit is larger
> > + * than default.
> > + */
> > +static inline bool xfs_inode_has_hugertalloc(struct xfs_inode *ip)
> > +{
> > +	struct xfs_mount *mp = ip->i_mount;
> > +
> > +	return XFS_IS_REALTIME_INODE(ip) &&
> > +	       mp->m_sb.sb_rextsize > XFS_B_TO_FSB(mp, XFS_DFL_RTEXTSIZE);
> > +}
> 
> The default rtextsize is actually a single FSB unless we're on a striped
> volume in which case it is increased.
> 
> I'll take care of removing the unused and confusing XFS_DFL_RTEXTSIZE,
> but for this patch we'd need to know the trade-off of when to just
> convert to unwritten.  For single-fsb rtextents we obviously don't need
> any special action.  But do you see a slowdown when converting to
> unwritten for small > 1 rtextsizes?  Because if not we could just
> always use that code path, which would significantly simplify things
> and remove yet another different to test code path.

There are <cough> some users that want 1G extents.

For the rest of us who don't live in the stratosphere, it's convenient
for fsdax to have rt extents that match the PMD size, which could be
large on arm64 (e.g. 512M, or two smr sectors).

--D
Christoph Hellwig May 31, 2024, 2:15 p.m. UTC | #3
On Fri, May 31, 2024 at 07:12:10AM -0700, Darrick J. Wong wrote:
> There are <cough> some users that want 1G extents.
> 
> For the rest of us who don't live in the stratosphere, it's convenient
> for fsdax to have rt extents that match the PMD size, which could be
> large on arm64 (e.g. 512M, or two smr sectors).

That's fine.  Maybe to rephrase my question.  With this series we
have 3 different truncate path:

 1) unmap all blocks (!rt || rtextsizse == 1)
 2) zero leftover blocks in an rtextent (small rtextsize, but > 1)
 3) converted leftover block in an rtextent to unwritten (large
   rtextsize)

What is the right threshold to switch between 2 and 3?  And do we
really need 2) at all?
Darrick J. Wong May 31, 2024, 3 p.m. UTC | #4
On Fri, May 31, 2024 at 07:15:34AM -0700, Christoph Hellwig wrote:
> On Fri, May 31, 2024 at 07:12:10AM -0700, Darrick J. Wong wrote:
> > There are <cough> some users that want 1G extents.
> > 
> > For the rest of us who don't live in the stratosphere, it's convenient
> > for fsdax to have rt extents that match the PMD size, which could be
> > large on arm64 (e.g. 512M, or two smr sectors).
> 
> That's fine.  Maybe to rephrase my question.  With this series we
> have 3 different truncate path:
> 
>  1) unmap all blocks (!rt || rtextsizse == 1)
>  2) zero leftover blocks in an rtextent (small rtextsize, but > 1)
>  3) converted leftover block in an rtextent to unwritten (large
>    rtextsize)
> 
> What is the right threshold to switch between 2 and 3?  And do we
> really need 2) at all?

I don't think we need (2) at all.

There's likely some threshold below where it's a wash -- compare with
ext4 strategy of trying to write 64k chunks even if that requires
zeroing pagecache to cut down on fragmentation on hdds -- but I don't
know if we care anymore. ;)

--D
Zhang Yi June 4, 2024, 7:09 a.m. UTC | #5
On 2024/5/31 23:00, Darrick J. Wong wrote:
> On Fri, May 31, 2024 at 07:15:34AM -0700, Christoph Hellwig wrote:
>> On Fri, May 31, 2024 at 07:12:10AM -0700, Darrick J. Wong wrote:
>>> There are <cough> some users that want 1G extents.
>>>
>>> For the rest of us who don't live in the stratosphere, it's convenient
>>> for fsdax to have rt extents that match the PMD size, which could be
>>> large on arm64 (e.g. 512M, or two smr sectors).
>>
>> That's fine.  Maybe to rephrase my question.  With this series we
>> have 3 different truncate path:
>>
>>  1) unmap all blocks (!rt || rtextsizse == 1)
>>  2) zero leftover blocks in an rtextent (small rtextsize, but > 1)
>>  3) converted leftover block in an rtextent to unwritten (large
>>    rtextsize)
>>
>> What is the right threshold to switch between 2 and 3?  And do we
>> really need 2) at all?
> 
> I don't think we need (2) at all.
> 
> There's likely some threshold below where it's a wash -- compare with
> ext4 strategy of trying to write 64k chunks even if that requires
> zeroing pagecache to cut down on fragmentation on hdds -- but I don't
> know if we care anymore. ;)
> 

I supplemented some tests for small > 1 rtextsizes on my ramdisk,

  mkfs.xfs -f -m reflink=0,rmapbt=0, -d rtinherit=1 \
           -r rtdev=/dev/pmem1s,extsize=$rtextsize /dev/pmem2s
  mount -ortdev=/dev/pmem1s /dev/pmem2s /mnt/scratch
  for i in {1..1000}; \
  do dd if=/dev/zero of=/mnt/scratch/$i bs=$rtextsize count=1; done
  sync
  time for i in {1..1000}; \
  do xfs_io -c "truncate 4k" /mnt/scratch/$i; done

rtextsize            8k      16k      32k      64k     256k     1024k
zero out:          9.601s  10.229s  11.153s  12.086s  12.259s  20.141s
convert unwritten: 9.710s   9.642s   9.958s   9.441s  10.021s  10.526s

The test showed that there is no much difference between (2) and (3)
with small rtextsize, but if the size gets progressively larger, (3)
will be better, so I agree with you that we could just drop (2) for
rt device.

Thanks,
Yi.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index db35167acef6..c0c1ab310aae 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1513,7 +1513,7 @@  xfs_itruncate_extents_flags(
 	 * the page cache can't scale that far.
 	 */
 	first_unmap_block = XFS_B_TO_FSB(mp, (xfs_ufsize_t)new_size);
-	if (xfs_inode_has_bigrtalloc(ip))
+	if (xfs_inode_has_bigrtalloc(ip) && !xfs_inode_has_hugertalloc(ip))
 		first_unmap_block = xfs_rtb_roundup_rtx(mp, first_unmap_block);
 	if (!xfs_verify_fileoff(mp, first_unmap_block)) {
 		WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 292b90b5f2ac..4eed5b0c57c0 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -320,6 +320,18 @@  static inline bool xfs_inode_has_bigrtalloc(struct xfs_inode *ip)
 	return XFS_IS_REALTIME_INODE(ip) && ip->i_mount->m_sb.sb_rextsize > 1;
 }
 
+/*
+ * Decide if this file is a realtime file whose data allocation unit is larger
+ * than default.
+ */
+static inline bool xfs_inode_has_hugertalloc(struct xfs_inode *ip)
+{
+	struct xfs_mount *mp = ip->i_mount;
+
+	return XFS_IS_REALTIME_INODE(ip) &&
+	       mp->m_sb.sb_rextsize > XFS_B_TO_FSB(mp, XFS_DFL_RTEXTSIZE);
+}
+
 /*
  * Return the buftarg used for data allocations on a given inode.
  */
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index c53de5e6ef66..d5fc84e5a37c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -870,6 +870,15 @@  xfs_setattr_size(
 	if (newsize < oldsize) {
 		unsigned int blocksize = xfs_inode_alloc_unitsize(ip);
 
+		/*
+		 * If the extsize is too large on a realtime inode, zeroing
+		 * out the entire aligned EOF extent could be slow, adjust the
+		 * blocksize to the filesystem blocksize, let __xfs_bunmapi()
+		 * to convert the tail blocks to unwritten.
+		 */
+		if (xfs_inode_has_hugertalloc(ip))
+			blocksize = i_blocksize(inode);
+
 		/*
 		 * Zeroing out the partial EOF block and the rest of the extra
 		 * aligned blocks on a downward truncate.