[RFC,v4,5/8] xfs: refactor the truncating order

Message ID	20240529095206.2568162-6-yi.zhang@huaweicloud.com (mailing list archive)
State	New
Headers	show Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com [45.249.212.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C549D15A4BC; Wed, 29 May 2024 01:59:22 +0000 (UTC) From: Zhang Yi <yi.zhang@huaweicloud.com> To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, djwong@kernel.org, hch@infradead.org, brauner@kernel.org, david@fromorbit.com, chandanbabu@kernel.org, jack@suse.cz, willy@infradead.org, yi.zhang@huawei.com, yi.zhang@huaweicloud.com, chengzhihao1@huawei.com, yukuai3@huawei.com Subject: [RFC PATCH v4 5/8] xfs: refactor the truncating order Date: Wed, 29 May 2024 17:52:03 +0800 Message-Id: <20240529095206.2568162-6-yi.zhang@huaweicloud.com> In-Reply-To: <20240529095206.2568162-1-yi.zhang@huaweicloud.com> References: <20240529095206.2568162-1-yi.zhang@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	iomap/xfs: fix stale data exposure when truncating realtime inodes \| expand [RFC,v4,0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes [RFC,v4,1/8] iomap: zeroing needs to be pagecache aware [RFC,v4,2/8] math64: add rem_u64() to just return the remainder [RFC,v4,3/8] iomap: pass blocksize to iomap_truncate_page() [RFC,v4,4/8] fsdax: pass blocksize to dax_truncate_page() [RFC,v4,5/8] xfs: refactor the truncating order [RFC,v4,6/8] xfs: correct the truncate blocksize of realtime inode [RFC,v4,7/8] xfs: reserve blocks for truncating realtime inode [RFC,v4,8/8] xfs: improve truncate on a realtime inode with huge extsize

Zhang Yi May 29, 2024, 9:52 a.m. UTC

From: Zhang Yi <yi.zhang@huawei.com>

When truncating down an inode, we call xfs_truncate_page() to zero out
the tail partial block that beyond new EOF, which prevents exposing
stale data. But xfs_truncate_page() always assumes the blocksize is
i_blocksize(inode), it's not always true if we have a large allocation
unit for a file and we should aligned to this unitsize, e.g. realtime
inode should aligned to the rtextsize.

Current xfs_setattr_size() can't support zeroing out a large alignment
size on trucate down since the process order is wrong. We first do zero
out through xfs_truncate_page(), and then update inode size through
truncate_setsize() immediately. If the zeroed range is larger than a
folio, the write back path would not write back zeroed pagecache beyond
the EOF folio, so it doesn't write zeroes to the entire tail extent and
could expose stale data after an appending write into the next aligned
extent.

We need to adjust the order to zero out tail aligned blocks, write back
zeroed or cached data, update i_size and drop cache beyond aligned EOF
block, preparing for the fix of realtime inode and supporting the
upcoming forced alignment feature.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_iomap.c |   2 +-
 fs/xfs/xfs_iomap.h |   3 +-
 fs/xfs/xfs_iops.c  | 107 ++++++++++++++++++++++++++++-----------------
 3 files changed, 69 insertions(+), 43 deletions(-)

Christoph Hellwig May 31, 2024, 1:31 p.m. UTC | #1

> +	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;

Maybe need_writeback would be a better name for the variable?  Also no
need to initialize it to false at declaration time if it is
unconditionally set here.

> +		/*
> +		 * Updating i_size after writing back to make sure the zeroed
> +		 * blocks could been written out, and drop all the page cache
> +		 * range that beyond blocksize aligned new EOF block.
> +		 *
> +		 * We've already locked out new page faults, so now we can
> +		 * safely remove pages from the page cache knowing they won't
> +		 * get refaulted until we drop the XFS_MMAP_EXCL lock after the
> +		 * extent manipulations are complete.
> +		 */
> +		i_size_write(inode, newsize);
> +		truncate_pagecache(inode, roundup_64(newsize, blocksize));

Any reason this open codes truncate_setsize()?

Darrick J. Wong May 31, 2024, 3:27 p.m. UTC | #2

On Fri, May 31, 2024 at 06:31:36AM -0700, Christoph Hellwig wrote:
> > +	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;
> 
> Maybe need_writeback would be a better name for the variable?  Also no
> need to initialize it to false at declaration time if it is
> unconditionally set here.

This variable captures whether or not we need to write dirty file tail
data because we're extending the ondisk EOF, right?

I don't really like long names like any good 1980s C programmer, but
maybe we should name this something like "extending_ondisk_eof"?

	if (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)
		extending_ondisk_eof = true;

	...

	if (did_zeroing || extending_ondisk_eof)
		filemap_write_and_wait_range(...);

Hm?

> > +		/*
> > +		 * Updating i_size after writing back to make sure the zeroed
> > +		 * blocks could been written out, and drop all the page cache
> > +		 * range that beyond blocksize aligned new EOF block.
> > +		 *
> > +		 * We've already locked out new page faults, so now we can
> > +		 * safely remove pages from the page cache knowing they won't
> > +		 * get refaulted until we drop the XFS_MMAP_EXCL lock after the

And can we correct the comment here too?

"...until we drop XFS_MMAPLOCK_EXCL after the extent manipulations..."

--D

> > +		 * extent manipulations are complete.
> > +		 */
> > +		i_size_write(inode, newsize);
> > +		truncate_pagecache(inode, roundup_64(newsize, blocksize));
> 
> Any reason this open codes truncate_setsize()?
> 
>

Darrick J. Wong May 31, 2024, 3:44 p.m. UTC | #3

On Wed, May 29, 2024 at 05:52:03PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When truncating down an inode, we call xfs_truncate_page() to zero out
> the tail partial block that beyond new EOF, which prevents exposing
> stale data. But xfs_truncate_page() always assumes the blocksize is
> i_blocksize(inode), it's not always true if we have a large allocation
> unit for a file and we should aligned to this unitsize, e.g. realtime
> inode should aligned to the rtextsize.
> 
> Current xfs_setattr_size() can't support zeroing out a large alignment
> size on trucate down since the process order is wrong. We first do zero
> out through xfs_truncate_page(), and then update inode size through
> truncate_setsize() immediately. If the zeroed range is larger than a
> folio, the write back path would not write back zeroed pagecache beyond
> the EOF folio, so it doesn't write zeroes to the entire tail extent and
> could expose stale data after an appending write into the next aligned
> extent.
> 
> We need to adjust the order to zero out tail aligned blocks, write back
> zeroed or cached data, update i_size and drop cache beyond aligned EOF
> block, preparing for the fix of realtime inode and supporting the
> upcoming forced alignment feature.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/xfs/xfs_iomap.c |   2 +-
>  fs/xfs/xfs_iomap.h |   3 +-
>  fs/xfs/xfs_iops.c  | 107 ++++++++++++++++++++++++++++-----------------
>  3 files changed, 69 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 8cdfcbb5baa7..0369b64cc3f4 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1468,10 +1468,10 @@ int
>  xfs_truncate_page(
>  	struct xfs_inode	*ip,
>  	loff_t			pos,
> +	unsigned int		blocksize,
>  	bool			*did_zero)
>  {
>  	struct inode		*inode = VFS_I(ip);
> -	unsigned int		blocksize = i_blocksize(inode);
>  
>  	if (IS_DAX(inode))
>  		return dax_truncate_page(inode, pos, blocksize, did_zero,
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> index 4da13440bae9..feb1610cb645 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -25,7 +25,8 @@ int xfs_bmbt_to_iomap(struct xfs_inode *ip, struct iomap *iomap,
>  
>  int xfs_zero_range(struct xfs_inode *ip, loff_t pos, loff_t len,
>  		bool *did_zero);
> -int xfs_truncate_page(struct xfs_inode *ip, loff_t pos, bool *did_zero);
> +int xfs_truncate_page(struct xfs_inode *ip, loff_t pos,
> +		unsigned int blocksize, bool *did_zero);
>  
>  static inline xfs_filblks_t
>  xfs_aligned_fsb_count(
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index d44508930b67..d24927075022 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -812,6 +812,7 @@ xfs_setattr_size(
>  	int			error;
>  	uint			lock_flags = 0;
>  	bool			did_zeroing = false;
> +	bool			write_back = false;
>  
>  	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
>  	ASSERT(S_ISREG(inode->i_mode));
> @@ -853,30 +854,7 @@ xfs_setattr_size(
>  	 * the transaction because the inode cannot be unlocked once it is a
>  	 * part of the transaction.
>  	 *
> -	 * Start with zeroing any data beyond EOF that we may expose on file
> -	 * extension, or zeroing out the rest of the block on a downward
> -	 * truncate.
> -	 */
> -	if (newsize > oldsize) {
> -		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> -		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
> -				&did_zeroing);
> -	} else if (newsize != oldsize) {
> -		error = xfs_truncate_page(ip, newsize, &did_zeroing);
> -	}
> -
> -	if (error)
> -		return error;
> -
> -	/*
> -	 * We've already locked out new page faults, so now we can safely remove
> -	 * pages from the page cache knowing they won't get refaulted until we
> -	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
> -	 * complete. The truncate_setsize() call also cleans partial EOF page
> -	 * PTEs on extending truncates and hence ensures sub-page block size
> -	 * filesystems are correctly handled, too.
> -	 *
> -	 * We have to do all the page cache truncate work outside the
> +	 * And we have to do all the page cache truncate work outside the

Style nit: don't start a paragraph with "and".

>  	 * transaction context as the "lock" order is page lock->log space
>  	 * reservation as defined by extent allocation in the writeback path.
>  	 * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but
> @@ -884,27 +862,74 @@ xfs_setattr_size(
>  	 * user visible changes). There's not much we can do about this, except
>  	 * to hope that the caller sees ENOMEM and retries the truncate
>  	 * operation.
> -	 *
> -	 * And we update in-core i_size and truncate page cache beyond newsize
> -	 * before writeback the [i_disk_size, newsize] range, so we're
> -	 * guaranteed not to write stale data past the new EOF on truncate down.
>  	 */
> -	truncate_setsize(inode, newsize);
> +	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;
> +	if (newsize < oldsize) {
> +		unsigned int blocksize = i_blocksize(inode);
>  
> -	/*
> -	 * We are going to log the inode size change in this transaction so
> -	 * any previous writes that are beyond the on disk EOF and the new
> -	 * EOF that have not been written out need to be written here.  If we
> -	 * do not write the data out, we expose ourselves to the null files
> -	 * problem. Note that this includes any block zeroing we did above;
> -	 * otherwise those blocks may not be zeroed after a crash.
> -	 */
> -	if (did_zeroing ||
> -	    (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) {
> -		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> -						ip->i_disk_size, newsize - 1);
> +		/*
> +		 * Zeroing out the partial EOF block and the rest of the extra
> +		 * aligned blocks on a downward truncate.
> +		 */
> +		error = xfs_truncate_page(ip, newsize, blocksize, &did_zeroing);
>  		if (error)
>  			return error;
> +
> +		/*
> +		 * We are going to log the inode size change in this transaction
> +		 * so any previous writes that are beyond the on disk EOF and
> +		 * the new EOF that have not been written out need to be written
> +		 * here.  If we do not write the data out, we expose ourselves
> +		 * to the null files problem. Note that this includes any block
> +		 * zeroing we did above; otherwise those blocks may not be
> +		 * zeroed after a crash.
> +		 */
> +		if (did_zeroing || write_back) {
> +			error = filemap_write_and_wait_range(inode->i_mapping,
> +					min_t(loff_t, ip->i_disk_size, newsize),
> +					roundup_64(newsize, blocksize) - 1);
> +			if (error)
> +				return error;
> +		}
> +
> +		/*
> +		 * Updating i_size after writing back to make sure the zeroed

"Update the incore i_size after flushing dirty tail pages to disk, and
drop all the pagecache beyond the allocation unit containing EOF." ?

> +		 * blocks could been written out, and drop all the page cache
> +		 * range that beyond blocksize aligned new EOF block.
> +		 *
> +		 * We've already locked out new page faults, so now we can
> +		 * safely remove pages from the page cache knowing they won't
> +		 * get refaulted until we drop the XFS_MMAP_EXCL lock after the
> +		 * extent manipulations are complete.
> +		 */
> +		i_size_write(inode, newsize);
> +		truncate_pagecache(inode, roundup_64(newsize, blocksize));

I'm not sure why we need to preserve the pagecache beyond eof having
zeroed and then written the post-eof blocks out to disk, but I'm
guessing this is why you open-code truncate_setsize?

> +	} else {
> +		/*
> +		 * Start with zeroing any data beyond EOF that we may expose on
> +		 * file extension.
> +		 */
> +		if (newsize > oldsize) {
> +			trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> +			error = xfs_zero_range(ip, oldsize, newsize - oldsize,
> +					       &did_zeroing);
> +			if (error)
> +				return error;
> +		}
> +
> +		/*
> +		 * The truncate_setsize() call also cleans partial EOF page
> +		 * PTEs on extending truncates and hence ensures sub-page block
> +		 * size filesystems are correctly handled, too.
> +		 */
> +		truncate_setsize(inode, newsize);
> +
> +		if (did_zeroing || write_back) {
> +			error = filemap_write_and_wait_range(inode->i_mapping,
> +					ip->i_disk_size, newsize - 1);
> +			if (error)
> +				return error;
> +		}
>  	}

At this point I wonder if these three truncate cases (down, up, and
unchanged) should just be broken out into three helpers without so much
twisty logic.

xfs_setattr_truncate_down():
	xfs_truncate_page(..., &did_zeroing);

	if (did_zeroing || extending_ondisk_eof)
		filemap_write_and_wait_range(...);

	truncate_setsize(...); /* or your opencoded version */

xfs_setattr_truncate_up():
	xfs_zero_range(..., &did_zeroing);

	truncate_setsize(...);

	if (did_zeroing || extending_ondisk_eof)
		filemap_write_and_wait_range(...);

xfs_setattr_truncate_unchanged():
	truncate_setsize(...);

	if (extending_ondisk_eof)
		filemap_write_and_wait_range(...);

So then the callsite becomes:

	if (newsize > oldsize)
		xfs_settattr_truncate_up();
	else if (newsize < oldsize)
		xfs_setattr_truncate_down();
	else
		xfs_setattr_truncate_unchanged();

But, I dunno.  Most of the code is really just extensive commenting.

--D

> +			if (error)
> +				return error;
> +		}
> +
> +		/*
> +		 * The truncate_setsize() call also cleans partial EOF page
> +		 * PTEs on extending truncates and hence ensures sub-page block
> +		 * size filesystems are correctly handled, too.
> +		 */
> +		truncate_setsize(inode, newsize);
> +
> +		if (did_zeroing || write_back) {
> +			error = filemap_write_and_wait_range(inode->i_mapping,
> +					ip->i_disk_size, newsize - 1);



>  
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> -- 
> 2.39.2
> 
>

Christoph Hellwig May 31, 2024, 4:17 p.m. UTC | #4

On Fri, May 31, 2024 at 08:27:32AM -0700, Darrick J. Wong wrote:
> On Fri, May 31, 2024 at 06:31:36AM -0700, Christoph Hellwig wrote:
> > > +	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;
> > 
> > Maybe need_writeback would be a better name for the variable?  Also no
> > need to initialize it to false at declaration time if it is
> > unconditionally set here.
> 
> This variable captures whether or not we need to write dirty file tail
> data because we're extending the ondisk EOF, right?

Yes.

> I don't really like long names like any good 1980s C programmer, but
> maybe we should name this something like "extending_ondisk_eof"?

Sure.

Dave Chinner June 2, 2024, 10:46 p.m. UTC | #5

On Wed, May 29, 2024 at 05:52:03PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When truncating down an inode, we call xfs_truncate_page() to zero out
> the tail partial block that beyond new EOF, which prevents exposing
> stale data. But xfs_truncate_page() always assumes the blocksize is
> i_blocksize(inode), it's not always true if we have a large allocation
> unit for a file and we should aligned to this unitsize, e.g. realtime
> inode should aligned to the rtextsize.
> 
> Current xfs_setattr_size() can't support zeroing out a large alignment
> size on trucate down since the process order is wrong. We first do zero
> out through xfs_truncate_page(), and then update inode size through
> truncate_setsize() immediately. If the zeroed range is larger than a
> folio, the write back path would not write back zeroed pagecache beyond
> the EOF folio, so it doesn't write zeroes to the entire tail extent and
> could expose stale data after an appending write into the next aligned
> extent.
> 
> We need to adjust the order to zero out tail aligned blocks, write back
> zeroed or cached data, update i_size and drop cache beyond aligned EOF
> block, preparing for the fix of realtime inode and supporting the
> upcoming forced alignment feature.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
.....
> @@ -853,30 +854,7 @@ xfs_setattr_size(
>  	 * the transaction because the inode cannot be unlocked once it is a
>  	 * part of the transaction.
>  	 *
> -	 * Start with zeroing any data beyond EOF that we may expose on file
> -	 * extension, or zeroing out the rest of the block on a downward
> -	 * truncate.
> -	 */
> -	if (newsize > oldsize) {
> -		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> -		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
> -				&did_zeroing);
> -	} else if (newsize != oldsize) {
> -		error = xfs_truncate_page(ip, newsize, &did_zeroing);
> -	}
> -
> -	if (error)
> -		return error;
> -
> -	/*
> -	 * We've already locked out new page faults, so now we can safely remove
> -	 * pages from the page cache knowing they won't get refaulted until we
> -	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
> -	 * complete. The truncate_setsize() call also cleans partial EOF page
> -	 * PTEs on extending truncates and hence ensures sub-page block size
> -	 * filesystems are correctly handled, too.
> -	 *
> -	 * We have to do all the page cache truncate work outside the
> +	 * And we have to do all the page cache truncate work outside the
>  	 * transaction context as the "lock" order is page lock->log space
>  	 * reservation as defined by extent allocation in the writeback path.
>  	 * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but
......

Lots of new logic for zeroing here. That makes xfs_setattr_size()
even longer than it already is. Can you lift this EOF zeroing logic
into it's own helper function so that it is clear that it is a
completely independent operation to the actual transaction that
changes the inode size. That would also allow the operations to be
broken up into:

	if (newsize >= oldsize) {
		/* do the simple stuff */
		....
		return error;
	}
	/* do the complex size reduction stuff without additional indenting */

-Dave.

Zhang Yi June 3, 2024, 1:51 p.m. UTC | #6

On 2024/5/31 23:27, Darrick J. Wong wrote:
> On Fri, May 31, 2024 at 06:31:36AM -0700, Christoph Hellwig wrote:
>>> +	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;
>>
>> Maybe need_writeback would be a better name for the variable?  Also no
>> need to initialize it to false at declaration time if it is
>> unconditionally set here.
> 
> This variable captures whether or not we need to write dirty file tail
> data because we're extending the ondisk EOF, right?
> 
> I don't really like long names like any good 1980s C programmer, but
> maybe we should name this something like "extending_ondisk_eof"?
> 
> 	if (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)
> 		extending_ondisk_eof = true;
> 
> 	...
> 
> 	if (did_zeroing || extending_ondisk_eof)
> 		filemap_write_and_wait_range(...);
> 
> Hm?

Sure, this name looks better.

> 
>>> +		/*
>>> +		 * Updating i_size after writing back to make sure the zeroed
>>> +		 * blocks could been written out, and drop all the page cache
>>> +		 * range that beyond blocksize aligned new EOF block.
>>> +		 *
>>> +		 * We've already locked out new page faults, so now we can
>>> +		 * safely remove pages from the page cache knowing they won't
>>> +		 * get refaulted until we drop the XFS_MMAP_EXCL lock after the
> 
> And can we correct the comment here too?
> 
> "...until we drop XFS_MMAPLOCK_EXCL after the extent manipulations..."
> 

Sure,

> --D
> 
>>> +		 * extent manipulations are complete.
>>> +		 */
>>> +		i_size_write(inode, newsize);
>>> +		truncate_pagecache(inode, roundup_64(newsize, blocksize));
>>
>> Any reason this open codes truncate_setsize()?
>>

It's not equal to open codes truncate_setsize(), please look the truncate
start pos is aligned to rtextsize for realtime inode, we only drop page
cache that beyond the new aligned EOF block.

Thanks,
Yi.

Zhang Yi June 3, 2024, 2:15 p.m. UTC | #7

On 2024/5/31 23:44, Darrick J. Wong wrote:
> On Wed, May 29, 2024 at 05:52:03PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When truncating down an inode, we call xfs_truncate_page() to zero out
>> the tail partial block that beyond new EOF, which prevents exposing
>> stale data. But xfs_truncate_page() always assumes the blocksize is
>> i_blocksize(inode), it's not always true if we have a large allocation
>> unit for a file and we should aligned to this unitsize, e.g. realtime
>> inode should aligned to the rtextsize.
>>
>> Current xfs_setattr_size() can't support zeroing out a large alignment
>> size on trucate down since the process order is wrong. We first do zero
>> out through xfs_truncate_page(), and then update inode size through
>> truncate_setsize() immediately. If the zeroed range is larger than a
>> folio, the write back path would not write back zeroed pagecache beyond
>> the EOF folio, so it doesn't write zeroes to the entire tail extent and
>> could expose stale data after an appending write into the next aligned
>> extent.
>>
>> We need to adjust the order to zero out tail aligned blocks, write back
>> zeroed or cached data, update i_size and drop cache beyond aligned EOF
>> block, preparing for the fix of realtime inode and supporting the
>> upcoming forced alignment feature.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/xfs/xfs_iomap.c |   2 +-
>>  fs/xfs/xfs_iomap.h |   3 +-
>>  fs/xfs/xfs_iops.c  | 107 ++++++++++++++++++++++++++++-----------------
>>  3 files changed, 69 insertions(+), 43 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 8cdfcbb5baa7..0369b64cc3f4 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -1468,10 +1468,10 @@ int
>>  xfs_truncate_page(
>>  	struct xfs_inode	*ip,
>>  	loff_t			pos,
>> +	unsigned int		blocksize,
>>  	bool			*did_zero)
>>  {
>>  	struct inode		*inode = VFS_I(ip);
>> -	unsigned int		blocksize = i_blocksize(inode);
>>  
>>  	if (IS_DAX(inode))
>>  		return dax_truncate_page(inode, pos, blocksize, did_zero,
>> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
>> index 4da13440bae9..feb1610cb645 100644
>> --- a/fs/xfs/xfs_iomap.h
>> +++ b/fs/xfs/xfs_iomap.h
>> @@ -25,7 +25,8 @@ int xfs_bmbt_to_iomap(struct xfs_inode *ip, struct iomap *iomap,
>>  
>>  int xfs_zero_range(struct xfs_inode *ip, loff_t pos, loff_t len,
>>  		bool *did_zero);
>> -int xfs_truncate_page(struct xfs_inode *ip, loff_t pos, bool *did_zero);
>> +int xfs_truncate_page(struct xfs_inode *ip, loff_t pos,
>> +		unsigned int blocksize, bool *did_zero);
>>  
>>  static inline xfs_filblks_t
>>  xfs_aligned_fsb_count(
>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>> index d44508930b67..d24927075022 100644
>> --- a/fs/xfs/xfs_iops.c
>> +++ b/fs/xfs/xfs_iops.c
>> @@ -812,6 +812,7 @@ xfs_setattr_size(
>>  	int			error;
>>  	uint			lock_flags = 0;
>>  	bool			did_zeroing = false;
>> +	bool			write_back = false;
>>  
>>  	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
>>  	ASSERT(S_ISREG(inode->i_mode));
>> @@ -853,30 +854,7 @@ xfs_setattr_size(
>>  	 * the transaction because the inode cannot be unlocked once it is a
>>  	 * part of the transaction.
>>  	 *
>> -	 * Start with zeroing any data beyond EOF that we may expose on file
>> -	 * extension, or zeroing out the rest of the block on a downward
>> -	 * truncate.
>> -	 */
>> -	if (newsize > oldsize) {
>> -		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
>> -		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>> -				&did_zeroing);
>> -	} else if (newsize != oldsize) {
>> -		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>> -	}
>> -
>> -	if (error)
>> -		return error;
>> -
>> -	/*
>> -	 * We've already locked out new page faults, so now we can safely remove
>> -	 * pages from the page cache knowing they won't get refaulted until we
>> -	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
>> -	 * complete. The truncate_setsize() call also cleans partial EOF page
>> -	 * PTEs on extending truncates and hence ensures sub-page block size
>> -	 * filesystems are correctly handled, too.
>> -	 *
>> -	 * We have to do all the page cache truncate work outside the
>> +	 * And we have to do all the page cache truncate work outside the
> 
> Style nit: don't start a paragraph with "and".

Sure, thanks for point this out.

> 
>>  	 * transaction context as the "lock" order is page lock->log space
>>  	 * reservation as defined by extent allocation in the writeback path.
>>  	 * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but
>> @@ -884,27 +862,74 @@ xfs_setattr_size(
>>  	 * user visible changes). There's not much we can do about this, except
>>  	 * to hope that the caller sees ENOMEM and retries the truncate
>>  	 * operation.
>> -	 *
>> -	 * And we update in-core i_size and truncate page cache beyond newsize
>> -	 * before writeback the [i_disk_size, newsize] range, so we're
>> -	 * guaranteed not to write stale data past the new EOF on truncate down.
>>  	 */
>> -	truncate_setsize(inode, newsize);
>> +	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;
>> +	if (newsize < oldsize) {
>> +		unsigned int blocksize = i_blocksize(inode);
>>  
>> -	/*
>> -	 * We are going to log the inode size change in this transaction so
>> -	 * any previous writes that are beyond the on disk EOF and the new
>> -	 * EOF that have not been written out need to be written here.  If we
>> -	 * do not write the data out, we expose ourselves to the null files
>> -	 * problem. Note that this includes any block zeroing we did above;
>> -	 * otherwise those blocks may not be zeroed after a crash.
>> -	 */
>> -	if (did_zeroing ||
>> -	    (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) {
>> -		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
>> -						ip->i_disk_size, newsize - 1);
>> +		/*
>> +		 * Zeroing out the partial EOF block and the rest of the extra
>> +		 * aligned blocks on a downward truncate.
>> +		 */
>> +		error = xfs_truncate_page(ip, newsize, blocksize, &did_zeroing);
>>  		if (error)
>>  			return error;
>> +
>> +		/*
>> +		 * We are going to log the inode size change in this transaction
>> +		 * so any previous writes that are beyond the on disk EOF and
>> +		 * the new EOF that have not been written out need to be written
>> +		 * here.  If we do not write the data out, we expose ourselves
>> +		 * to the null files problem. Note that this includes any block
>> +		 * zeroing we did above; otherwise those blocks may not be
>> +		 * zeroed after a crash.
>> +		 */
>> +		if (did_zeroing || write_back) {
>> +			error = filemap_write_and_wait_range(inode->i_mapping,
>> +					min_t(loff_t, ip->i_disk_size, newsize),
>> +					roundup_64(newsize, blocksize) - 1);
>> +			if (error)
>> +				return error;
>> +		}
>> +
>> +		/*
>> +		 * Updating i_size after writing back to make sure the zeroed
> 
> "Update the incore i_size after flushing dirty tail pages to disk, and
> drop all the pagecache beyond the allocation unit containing EOF." ?

Yep.

> 
>> +		 * blocks could been written out, and drop all the page cache
>> +		 * range that beyond blocksize aligned new EOF block.
>> +		 *
>> +		 * We've already locked out new page faults, so now we can
>> +		 * safely remove pages from the page cache knowing they won't
>> +		 * get refaulted until we drop the XFS_MMAP_EXCL lock after the
>> +		 * extent manipulations are complete.
>> +		 */
>> +		i_size_write(inode, newsize);
>> +		truncate_pagecache(inode, roundup_64(newsize, blocksize));
> 
> I'm not sure why we need to preserve the pagecache beyond eof having
> zeroed and then written the post-eof blocks out to disk, but I'm
> guessing this is why you open-code truncate_setsize?

Yeah, xfs_truncate_page() already done the zero out, if we keep passing the
newsize to truncate_pagecache() through truncate_setsize(), it would zero out
partial folio which cover the already zeroed blocks. What we should do at
this moment is just drop all the page cache beyond aligned EOF block, so I
roundup the newsize, just a small optimization.

> 
>> +	} else {
>> +		/*
>> +		 * Start with zeroing any data beyond EOF that we may expose on
>> +		 * file extension.
>> +		 */
>> +		if (newsize > oldsize) {
>> +			trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
>> +			error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>> +					       &did_zeroing);
>> +			if (error)
>> +				return error;
>> +		}
>> +
>> +		/*
>> +		 * The truncate_setsize() call also cleans partial EOF page
>> +		 * PTEs on extending truncates and hence ensures sub-page block
>> +		 * size filesystems are correctly handled, too.
>> +		 */
>> +		truncate_setsize(inode, newsize);
>> +
>> +		if (did_zeroing || write_back) {
>> +			error = filemap_write_and_wait_range(inode->i_mapping,
>> +					ip->i_disk_size, newsize - 1);
>> +			if (error)
>> +				return error;
>> +		}
>>  	}
> 
> At this point I wonder if these three truncate cases (down, up, and
> unchanged) should just be broken out into three helpers without so much
> twisty logic.
> 
> xfs_setattr_truncate_down():
> 	xfs_truncate_page(..., &did_zeroing);
> 
> 	if (did_zeroing || extending_ondisk_eof)
> 		filemap_write_and_wait_range(...);
> 
> 	truncate_setsize(...); /* or your opencoded version */
> 
> xfs_setattr_truncate_up():
> 	xfs_zero_range(..., &did_zeroing);
> 
> 	truncate_setsize(...);
> 
> 	if (did_zeroing || extending_ondisk_eof)
> 		filemap_write_and_wait_range(...);
> 
> xfs_setattr_truncate_unchanged():
> 	truncate_setsize(...);
> 
> 	if (extending_ondisk_eof)
> 		filemap_write_and_wait_range(...);
> 
> So then the callsite becomes:
> 
> 	if (newsize > oldsize)
> 		xfs_settattr_truncate_up();
> 	else if (newsize < oldsize)
> 		xfs_setattr_truncate_down();
> 	else
> 		xfs_setattr_truncate_unchanged();

Sounds good.

> 
> But, I dunno.  Most of the code is really just extensive commenting.
> 

Yeah, the extensive comments also bothers me, too. I will try to make
it more clear in the next iteration, I hope.

Thanks,
Yi.

> --D
> 
>> +			if (error)
>> +				return error;
>> +		}
>> +
>> +		/*
>> +		 * The truncate_setsize() call also cleans partial EOF page
>> +		 * PTEs on extending truncates and hence ensures sub-page block
>> +		 * size filesystems are correctly handled, too.
>> +		 */
>> +		truncate_setsize(inode, newsize);
>> +
>> +		if (did_zeroing || write_back) {
>> +			error = filemap_write_and_wait_range(inode->i_mapping,
>> +					ip->i_disk_size, newsize - 1);
> 
> 
> 
>>  
>>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
>> -- 
>> 2.39.2
>>
>>

Zhang Yi June 3, 2024, 2:18 p.m. UTC | #8

On 2024/6/3 6:46, Dave Chinner wrote:
> On Wed, May 29, 2024 at 05:52:03PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When truncating down an inode, we call xfs_truncate_page() to zero out
>> the tail partial block that beyond new EOF, which prevents exposing
>> stale data. But xfs_truncate_page() always assumes the blocksize is
>> i_blocksize(inode), it's not always true if we have a large allocation
>> unit for a file and we should aligned to this unitsize, e.g. realtime
>> inode should aligned to the rtextsize.
>>
>> Current xfs_setattr_size() can't support zeroing out a large alignment
>> size on trucate down since the process order is wrong. We first do zero
>> out through xfs_truncate_page(), and then update inode size through
>> truncate_setsize() immediately. If the zeroed range is larger than a
>> folio, the write back path would not write back zeroed pagecache beyond
>> the EOF folio, so it doesn't write zeroes to the entire tail extent and
>> could expose stale data after an appending write into the next aligned
>> extent.
>>
>> We need to adjust the order to zero out tail aligned blocks, write back
>> zeroed or cached data, update i_size and drop cache beyond aligned EOF
>> block, preparing for the fix of realtime inode and supporting the
>> upcoming forced alignment feature.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
> .....
>> @@ -853,30 +854,7 @@ xfs_setattr_size(
>>  	 * the transaction because the inode cannot be unlocked once it is a
>>  	 * part of the transaction.
>>  	 *
>> -	 * Start with zeroing any data beyond EOF that we may expose on file
>> -	 * extension, or zeroing out the rest of the block on a downward
>> -	 * truncate.
>> -	 */
>> -	if (newsize > oldsize) {
>> -		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
>> -		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>> -				&did_zeroing);
>> -	} else if (newsize != oldsize) {
>> -		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>> -	}
>> -
>> -	if (error)
>> -		return error;
>> -
>> -	/*
>> -	 * We've already locked out new page faults, so now we can safely remove
>> -	 * pages from the page cache knowing they won't get refaulted until we
>> -	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
>> -	 * complete. The truncate_setsize() call also cleans partial EOF page
>> -	 * PTEs on extending truncates and hence ensures sub-page block size
>> -	 * filesystems are correctly handled, too.
>> -	 *
>> -	 * We have to do all the page cache truncate work outside the
>> +	 * And we have to do all the page cache truncate work outside the
>>  	 * transaction context as the "lock" order is page lock->log space
>>  	 * reservation as defined by extent allocation in the writeback path.
>>  	 * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but
> ......
> 
> Lots of new logic for zeroing here. That makes xfs_setattr_size()
> even longer than it already is. Can you lift this EOF zeroing logic
> into it's own helper function so that it is clear that it is a
> completely independent operation to the actual transaction that
> changes the inode size. That would also allow the operations to be
> broken up into:
> 
> 	if (newsize >= oldsize) {
> 		/* do the simple stuff */
> 		....
> 		return error;
> 	}
> 	/* do the complex size reduction stuff without additional indenting */
> 

Sure, I will try to factor them out.

Thanks,
Yi.

[RFC,v4,5/8] xfs: refactor the truncating order

Commit Message

Comments

Patch