diff mbox series

[V5,13/16] xfs: Conditionally upgrade existing inodes to use 64-bit extent counters

Message ID 20220121051857.221105-14-chandan.babu@oracle.com (mailing list archive)
State Superseded, archived
Headers show
Series xfs: Extend per-inode extent counters | expand

Commit Message

Chandan Babu R Jan. 21, 2022, 5:18 a.m. UTC
This commit upgrades inodes to use 64-bit extent counters when they are read
from disk. Inodes are upgraded only when the filesystem instance has
XFS_SB_FEAT_INCOMPAT_NREXT64 incompat flag set.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Darrick J. Wong Feb. 1, 2022, 8:01 p.m. UTC | #1
On Fri, Jan 21, 2022 at 10:48:54AM +0530, Chandan Babu R wrote:
> This commit upgrades inodes to use 64-bit extent counters when they are read
> from disk. Inodes are upgraded only when the filesystem instance has
> XFS_SB_FEAT_INCOMPAT_NREXT64 incompat flag set.
> 
> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_inode_buf.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 2200526bcee0..767189c7c887 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -253,6 +253,12 @@ xfs_inode_from_disk(
>  	}
>  	if (xfs_is_reflink_inode(ip))
>  		xfs_ifork_init_cow(ip);
> +
> +	if ((from->di_version == 3) &&
> +	     xfs_has_nrext64(ip->i_mount) &&
> +	     !xfs_dinode_has_nrext64(from))
> +		ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;

Hmm.  Last time around I asked about the oddness of updating the inode
feature flags outside of a transaction, and then never responded. :(
So to quote you from last time:

> The following is the thought process behind upgrading an inode to
> XFS_DIFLAG2_NREXT64 when it is read from the disk,
>
> 1. With support for dynamic upgrade, The extent count limits of an
> inode needs to be determined by checking flags present within the
> inode i.e.  we need to satisfy self-describing metadata property. This
> helps tools like xfs_repair and scrub to verify inode's extent count
> limits without having to refer to other metadata objects (e.g.
> superblock feature flags).

I think this makes an even /stronger/ argument for why this update
needs to be transactional.

> 2. Upgrade when performed inside xfs_trans_log_inode() may cause
> xfs_iext_count_may_overflow() to return -EFBIG when the inode's
> data/attr extent count is already close to 2^31/2^15 respectively.
> Hence none of the file operations will be able to add new extents to a
> file.

Aha, there's the reason why!  You're right, xfs_iext_count_may_overflow
will abort the operation due to !NREXT64 before we even get a chance to
log the inode.

I observe, however, that any time we call that function, we also have a
transaction allocated and we hold the ILOCK on the inode being tested.
*Most* of those call sites have also joined the inode to the transaction
already.  I wonder, is that a more appropriate place to be upgrading the
inodes?  Something like:

/*
 * Ensure that the inode has the ability to add the specified number of
 * extents.  Caller must hold ILOCK_EXCL and have joined the inode to
 * the transaction.  Upon return, the inode will still be in this state
 * upon return and the transaction will be clean.
 */
int
xfs_trans_inode_ensure_nextents(
	struct xfs_trans	**tpp,
	struct xfs_inode	*ip,
	int			whichfork,
	int			nr_to_add)
{
	int			error;

	error = xfs_iext_count_may_overflow(ip, whichfork, nr_to_add);
	if (!error)
		return 0;

	/*
	 * Try to upgrade if the extent count fields aren't large
	 * enough.
	 */
	if (!xfs_has_nrext64(ip->i_mount) ||
	    (ip->i_diflags2 & XFS_DIFLAG2_NREXT64))
		return error;

	ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
	xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);

	error = xfs_trans_roll(tpp);
	if (error)
		return error;

	return xfs_iext_count_may_overflow(ip, whichfork, nr_to_add);
}

and then the current call sites become:

	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write,
			dblocks, rblocks, false, &tp);
	if (error)
		return error;

	error = xfs_trans_inode_ensure_nextents(&tp, ip, XFS_DATA_FORK,
			XFS_IEXT_ADD_NOSPLIT_CNT);
	if (error)
		goto out_cancel;

What do you think about that?

--D

> +
>  	return 0;
>  
>  out_destroy_data_fork:
> -- 
> 2.30.2
>
Chandan Babu R Feb. 7, 2022, 4:55 a.m. UTC | #2
On 02 Feb 2022 at 01:31, Darrick J. Wong wrote:
> On Fri, Jan 21, 2022 at 10:48:54AM +0530, Chandan Babu R wrote:
>> This commit upgrades inodes to use 64-bit extent counters when they are read
>> from disk. Inodes are upgraded only when the filesystem instance has
>> XFS_SB_FEAT_INCOMPAT_NREXT64 incompat flag set.
>> 
>> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
>> ---
>>  fs/xfs/libxfs/xfs_inode_buf.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>> 
>> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
>> index 2200526bcee0..767189c7c887 100644
>> --- a/fs/xfs/libxfs/xfs_inode_buf.c
>> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
>> @@ -253,6 +253,12 @@ xfs_inode_from_disk(
>>  	}
>>  	if (xfs_is_reflink_inode(ip))
>>  		xfs_ifork_init_cow(ip);
>> +
>> +	if ((from->di_version == 3) &&
>> +	     xfs_has_nrext64(ip->i_mount) &&
>> +	     !xfs_dinode_has_nrext64(from))
>> +		ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
>
> Hmm.  Last time around I asked about the oddness of updating the inode
> feature flags outside of a transaction, and then never responded. :(
> So to quote you from last time:
>
>> The following is the thought process behind upgrading an inode to
>> XFS_DIFLAG2_NREXT64 when it is read from the disk,
>>
>> 1. With support for dynamic upgrade, The extent count limits of an
>> inode needs to be determined by checking flags present within the
>> inode i.e.  we need to satisfy self-describing metadata property. This
>> helps tools like xfs_repair and scrub to verify inode's extent count
>> limits without having to refer to other metadata objects (e.g.
>> superblock feature flags).
>
> I think this makes an even /stronger/ argument for why this update
> needs to be transactional.
>
>> 2. Upgrade when performed inside xfs_trans_log_inode() may cause
>> xfs_iext_count_may_overflow() to return -EFBIG when the inode's
>> data/attr extent count is already close to 2^31/2^15 respectively.
>> Hence none of the file operations will be able to add new extents to a
>> file.
>
> Aha, there's the reason why!  You're right, xfs_iext_count_may_overflow
> will abort the operation due to !NREXT64 before we even get a chance to
> log the inode.
>
> I observe, however, that any time we call that function, we also have a
> transaction allocated and we hold the ILOCK on the inode being tested.
> *Most* of those call sites have also joined the inode to the transaction
> already.  I wonder, is that a more appropriate place to be upgrading the
> inodes?  Something like:
>
> /*
>  * Ensure that the inode has the ability to add the specified number of
>  * extents.  Caller must hold ILOCK_EXCL and have joined the inode to
>  * the transaction.  Upon return, the inode will still be in this state
>  * upon return and the transaction will be clean.
>  */
> int
> xfs_trans_inode_ensure_nextents(
> 	struct xfs_trans	**tpp,
> 	struct xfs_inode	*ip,
> 	int			whichfork,
> 	int			nr_to_add)
> {
> 	int			error;
>
> 	error = xfs_iext_count_may_overflow(ip, whichfork, nr_to_add);
> 	if (!error)
> 		return 0;
>
> 	/*
> 	 * Try to upgrade if the extent count fields aren't large
> 	 * enough.
> 	 */
> 	if (!xfs_has_nrext64(ip->i_mount) ||
> 	    (ip->i_diflags2 & XFS_DIFLAG2_NREXT64))
> 		return error;
>
> 	ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
> 	xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
>
> 	error = xfs_trans_roll(tpp);
> 	if (error)
> 		return error;
>
> 	return xfs_iext_count_may_overflow(ip, whichfork, nr_to_add);
> }
>
> and then the current call sites become:
>
> 	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write,
> 			dblocks, rblocks, false, &tp);
> 	if (error)
> 		return error;
>
> 	error = xfs_trans_inode_ensure_nextents(&tp, ip, XFS_DATA_FORK,
> 			XFS_IEXT_ADD_NOSPLIT_CNT);
> 	if (error)
> 		goto out_cancel;
>
> What do you think about that?
>

I went through all the call sites of xfs_iext_count_may_overflow() and I think
that your suggestion can be implemented.

However, wouldn't the current approach suffice in terms of being functionally
and logically correct? XFS_DIFLAG2_NREXT64 is set when inode is read from the
disk and the first operation to log the changes made to the inode will make
sure to include the new value of ip->i_diflags2. Hence we never end up in a
situation where a disk inode has more than 2^31 data fork extents without
having XFS_DIFLAG2_NREXT64 flag set.

But the approach described above does go against the convention of changing
metadata within a transaction. Hence I will try to implement your suggestion
and include it in the next version of the patchset.
Darrick J. Wong Feb. 7, 2022, 5:11 p.m. UTC | #3
On Mon, Feb 07, 2022 at 10:25:19AM +0530, Chandan Babu R wrote:
> On 02 Feb 2022 at 01:31, Darrick J. Wong wrote:
> > On Fri, Jan 21, 2022 at 10:48:54AM +0530, Chandan Babu R wrote:
> >> This commit upgrades inodes to use 64-bit extent counters when they are read
> >> from disk. Inodes are upgraded only when the filesystem instance has
> >> XFS_SB_FEAT_INCOMPAT_NREXT64 incompat flag set.
> >> 
> >> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
> >> ---
> >>  fs/xfs/libxfs/xfs_inode_buf.c | 6 ++++++
> >>  1 file changed, 6 insertions(+)
> >> 
> >> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> >> index 2200526bcee0..767189c7c887 100644
> >> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> >> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> >> @@ -253,6 +253,12 @@ xfs_inode_from_disk(
> >>  	}
> >>  	if (xfs_is_reflink_inode(ip))
> >>  		xfs_ifork_init_cow(ip);
> >> +
> >> +	if ((from->di_version == 3) &&
> >> +	     xfs_has_nrext64(ip->i_mount) &&
> >> +	     !xfs_dinode_has_nrext64(from))
> >> +		ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
> >
> > Hmm.  Last time around I asked about the oddness of updating the inode
> > feature flags outside of a transaction, and then never responded. :(
> > So to quote you from last time:
> >
> >> The following is the thought process behind upgrading an inode to
> >> XFS_DIFLAG2_NREXT64 when it is read from the disk,
> >>
> >> 1. With support for dynamic upgrade, The extent count limits of an
> >> inode needs to be determined by checking flags present within the
> >> inode i.e.  we need to satisfy self-describing metadata property. This
> >> helps tools like xfs_repair and scrub to verify inode's extent count
> >> limits without having to refer to other metadata objects (e.g.
> >> superblock feature flags).
> >
> > I think this makes an even /stronger/ argument for why this update
> > needs to be transactional.
> >
> >> 2. Upgrade when performed inside xfs_trans_log_inode() may cause
> >> xfs_iext_count_may_overflow() to return -EFBIG when the inode's
> >> data/attr extent count is already close to 2^31/2^15 respectively.
> >> Hence none of the file operations will be able to add new extents to a
> >> file.
> >
> > Aha, there's the reason why!  You're right, xfs_iext_count_may_overflow
> > will abort the operation due to !NREXT64 before we even get a chance to
> > log the inode.
> >
> > I observe, however, that any time we call that function, we also have a
> > transaction allocated and we hold the ILOCK on the inode being tested.
> > *Most* of those call sites have also joined the inode to the transaction
> > already.  I wonder, is that a more appropriate place to be upgrading the
> > inodes?  Something like:
> >
> > /*
> >  * Ensure that the inode has the ability to add the specified number of
> >  * extents.  Caller must hold ILOCK_EXCL and have joined the inode to
> >  * the transaction.  Upon return, the inode will still be in this state
> >  * upon return and the transaction will be clean.
> >  */
> > int
> > xfs_trans_inode_ensure_nextents(
> > 	struct xfs_trans	**tpp,
> > 	struct xfs_inode	*ip,
> > 	int			whichfork,
> > 	int			nr_to_add)
> > {
> > 	int			error;
> >
> > 	error = xfs_iext_count_may_overflow(ip, whichfork, nr_to_add);
> > 	if (!error)
> > 		return 0;
> >
> > 	/*
> > 	 * Try to upgrade if the extent count fields aren't large
> > 	 * enough.
> > 	 */
> > 	if (!xfs_has_nrext64(ip->i_mount) ||
> > 	    (ip->i_diflags2 & XFS_DIFLAG2_NREXT64))
> > 		return error;
> >
> > 	ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
> > 	xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
> >
> > 	error = xfs_trans_roll(tpp);
> > 	if (error)
> > 		return error;
> >
> > 	return xfs_iext_count_may_overflow(ip, whichfork, nr_to_add);
> > }
> >
> > and then the current call sites become:
> >
> > 	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write,
> > 			dblocks, rblocks, false, &tp);
> > 	if (error)
> > 		return error;
> >
> > 	error = xfs_trans_inode_ensure_nextents(&tp, ip, XFS_DATA_FORK,
> > 			XFS_IEXT_ADD_NOSPLIT_CNT);
> > 	if (error)
> > 		goto out_cancel;
> >
> > What do you think about that?
> >
> 
> I went through all the call sites of xfs_iext_count_may_overflow() and I think
> that your suggestion can be implemented.
> 
> However, wouldn't the current approach suffice in terms of being functionally
> and logically correct? XFS_DIFLAG2_NREXT64 is set when inode is read from the
> disk and the first operation to log the changes made to the inode will make
> sure to include the new value of ip->i_diflags2. Hence we never end up in a
> situation where a disk inode has more than 2^31 data fork extents without
> having XFS_DIFLAG2_NREXT64 flag set.
> 
> But the approach described above does go against the convention of changing
> metadata within a transaction. Hence I will try to implement your suggestion
> and include it in the next version of the patchset.

Ok, that sounds good. :)

--D

> -- 
> chandan
Chandan Babu R Feb. 11, 2022, 12:10 p.m. UTC | #4
On 07 Feb 2022 at 22:41, Darrick J. Wong wrote:
> On Mon, Feb 07, 2022 at 10:25:19AM +0530, Chandan Babu R wrote:
>> On 02 Feb 2022 at 01:31, Darrick J. Wong wrote:
>> > On Fri, Jan 21, 2022 at 10:48:54AM +0530, Chandan Babu R wrote:
>> >> This commit upgrades inodes to use 64-bit extent counters when they are read
>> >> from disk. Inodes are upgraded only when the filesystem instance has
>> >> XFS_SB_FEAT_INCOMPAT_NREXT64 incompat flag set.
>> >> 
>> >> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
>> >> ---
>> >>  fs/xfs/libxfs/xfs_inode_buf.c | 6 ++++++
>> >>  1 file changed, 6 insertions(+)
>> >> 
>> >> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
>> >> index 2200526bcee0..767189c7c887 100644
>> >> --- a/fs/xfs/libxfs/xfs_inode_buf.c
>> >> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
>> >> @@ -253,6 +253,12 @@ xfs_inode_from_disk(
>> >>  	}
>> >>  	if (xfs_is_reflink_inode(ip))
>> >>  		xfs_ifork_init_cow(ip);
>> >> +
>> >> +	if ((from->di_version == 3) &&
>> >> +	     xfs_has_nrext64(ip->i_mount) &&
>> >> +	     !xfs_dinode_has_nrext64(from))
>> >> +		ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
>> >
>> > Hmm.  Last time around I asked about the oddness of updating the inode
>> > feature flags outside of a transaction, and then never responded. :(
>> > So to quote you from last time:
>> >
>> >> The following is the thought process behind upgrading an inode to
>> >> XFS_DIFLAG2_NREXT64 when it is read from the disk,
>> >>
>> >> 1. With support for dynamic upgrade, The extent count limits of an
>> >> inode needs to be determined by checking flags present within the
>> >> inode i.e.  we need to satisfy self-describing metadata property. This
>> >> helps tools like xfs_repair and scrub to verify inode's extent count
>> >> limits without having to refer to other metadata objects (e.g.
>> >> superblock feature flags).
>> >
>> > I think this makes an even /stronger/ argument for why this update
>> > needs to be transactional.
>> >
>> >> 2. Upgrade when performed inside xfs_trans_log_inode() may cause
>> >> xfs_iext_count_may_overflow() to return -EFBIG when the inode's
>> >> data/attr extent count is already close to 2^31/2^15 respectively.
>> >> Hence none of the file operations will be able to add new extents to a
>> >> file.
>> >
>> > Aha, there's the reason why!  You're right, xfs_iext_count_may_overflow
>> > will abort the operation due to !NREXT64 before we even get a chance to
>> > log the inode.
>> >
>> > I observe, however, that any time we call that function, we also have a
>> > transaction allocated and we hold the ILOCK on the inode being tested.
>> > *Most* of those call sites have also joined the inode to the transaction
>> > already.  I wonder, is that a more appropriate place to be upgrading the
>> > inodes?  Something like:
>> >
>> > /*
>> >  * Ensure that the inode has the ability to add the specified number of
>> >  * extents.  Caller must hold ILOCK_EXCL and have joined the inode to
>> >  * the transaction.  Upon return, the inode will still be in this state
>> >  * upon return and the transaction will be clean.
>> >  */
>> > int
>> > xfs_trans_inode_ensure_nextents(
>> > 	struct xfs_trans	**tpp,
>> > 	struct xfs_inode	*ip,
>> > 	int			whichfork,
>> > 	int			nr_to_add)
>> > {
>> > 	int			error;
>> >
>> > 	error = xfs_iext_count_may_overflow(ip, whichfork, nr_to_add);
>> > 	if (!error)
>> > 		return 0;
>> >
>> > 	/*
>> > 	 * Try to upgrade if the extent count fields aren't large
>> > 	 * enough.
>> > 	 */
>> > 	if (!xfs_has_nrext64(ip->i_mount) ||
>> > 	    (ip->i_diflags2 & XFS_DIFLAG2_NREXT64))
>> > 		return error;
>> >
>> > 	ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
>> > 	xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
>> >
>> > 	error = xfs_trans_roll(tpp);
>> > 	if (error)
>> > 		return error;
>> >
>> > 	return xfs_iext_count_may_overflow(ip, whichfork, nr_to_add);
>> > }
>> >
>> > and then the current call sites become:
>> >
>> > 	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write,
>> > 			dblocks, rblocks, false, &tp);
>> > 	if (error)
>> > 		return error;
>> >
>> > 	error = xfs_trans_inode_ensure_nextents(&tp, ip, XFS_DATA_FORK,
>> > 			XFS_IEXT_ADD_NOSPLIT_CNT);
>> > 	if (error)
>> > 		goto out_cancel;
>> >
>> > What do you think about that?
>> >
>> 
>> I went through all the call sites of xfs_iext_count_may_overflow() and I think
>> that your suggestion can be implemented.

Sorry, I missed/overlooked the usage of xfs_iext_count_may_overflow() in
xfs_symlink().

Just after invoking xfs_iext_count_may_overflow(), we execute the following
steps,

1. Allocate inode chunk
2. Initialize inode chunk.
3. Insert record into inobt/finobt.
4. Roll the transaction.
5. Allocate ondisk inode.
6. Add directory inode to transaction.
7. Allocate blocks to store symbolic link path name.
8. Log symlink's inode (data fork contains block mappings).
9. Log data blocks containing symbolic link path name.
10. Add name to directory and log directory's blocks.
11. Log directory inode.
12. Commit transaction.

xfs_trans_roll() invoked in step 4 would mean that we cannot move step 6 to
occur before step 1 since xfs_trans_roll would unlock the inode by executing
xfs_inode_item_committing().

xfs_create() has a similar flow.

Hence, I think we should retain the current logic of setting
XFS_DIFLAG2_NREXT64 just after reading the inode from the disk.

>> 
>> However, wouldn't the current approach suffice in terms of being functionally
>> and logically correct? XFS_DIFLAG2_NREXT64 is set when inode is read from the
>> disk and the first operation to log the changes made to the inode will make
>> sure to include the new value of ip->i_diflags2. Hence we never end up in a
>> situation where a disk inode has more than 2^31 data fork extents without
>> having XFS_DIFLAG2_NREXT64 flag set.
>> 
>> But the approach described above does go against the convention of changing
>> metadata within a transaction. Hence I will try to implement your suggestion
>> and include it in the next version of the patchset.
>
> Ok, that sounds good. :)
>
Darrick J. Wong Feb. 14, 2022, 5:07 p.m. UTC | #5
On Fri, Feb 11, 2022 at 05:40:30PM +0530, Chandan Babu R wrote:
> On 07 Feb 2022 at 22:41, Darrick J. Wong wrote:
> > On Mon, Feb 07, 2022 at 10:25:19AM +0530, Chandan Babu R wrote:
> >> On 02 Feb 2022 at 01:31, Darrick J. Wong wrote:
> >> > On Fri, Jan 21, 2022 at 10:48:54AM +0530, Chandan Babu R wrote:
> >> >> This commit upgrades inodes to use 64-bit extent counters when they are read
> >> >> from disk. Inodes are upgraded only when the filesystem instance has
> >> >> XFS_SB_FEAT_INCOMPAT_NREXT64 incompat flag set.
> >> >> 
> >> >> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
> >> >> ---
> >> >>  fs/xfs/libxfs/xfs_inode_buf.c | 6 ++++++
> >> >>  1 file changed, 6 insertions(+)
> >> >> 
> >> >> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> >> >> index 2200526bcee0..767189c7c887 100644
> >> >> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> >> >> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> >> >> @@ -253,6 +253,12 @@ xfs_inode_from_disk(
> >> >>  	}
> >> >>  	if (xfs_is_reflink_inode(ip))
> >> >>  		xfs_ifork_init_cow(ip);
> >> >> +
> >> >> +	if ((from->di_version == 3) &&
> >> >> +	     xfs_has_nrext64(ip->i_mount) &&
> >> >> +	     !xfs_dinode_has_nrext64(from))
> >> >> +		ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
> >> >
> >> > Hmm.  Last time around I asked about the oddness of updating the inode
> >> > feature flags outside of a transaction, and then never responded. :(
> >> > So to quote you from last time:
> >> >
> >> >> The following is the thought process behind upgrading an inode to
> >> >> XFS_DIFLAG2_NREXT64 when it is read from the disk,
> >> >>
> >> >> 1. With support for dynamic upgrade, The extent count limits of an
> >> >> inode needs to be determined by checking flags present within the
> >> >> inode i.e.  we need to satisfy self-describing metadata property. This
> >> >> helps tools like xfs_repair and scrub to verify inode's extent count
> >> >> limits without having to refer to other metadata objects (e.g.
> >> >> superblock feature flags).
> >> >
> >> > I think this makes an even /stronger/ argument for why this update
> >> > needs to be transactional.
> >> >
> >> >> 2. Upgrade when performed inside xfs_trans_log_inode() may cause
> >> >> xfs_iext_count_may_overflow() to return -EFBIG when the inode's
> >> >> data/attr extent count is already close to 2^31/2^15 respectively.
> >> >> Hence none of the file operations will be able to add new extents to a
> >> >> file.
> >> >
> >> > Aha, there's the reason why!  You're right, xfs_iext_count_may_overflow
> >> > will abort the operation due to !NREXT64 before we even get a chance to
> >> > log the inode.
> >> >
> >> > I observe, however, that any time we call that function, we also have a
> >> > transaction allocated and we hold the ILOCK on the inode being tested.
> >> > *Most* of those call sites have also joined the inode to the transaction
> >> > already.  I wonder, is that a more appropriate place to be upgrading the
> >> > inodes?  Something like:
> >> >
> >> > /*
> >> >  * Ensure that the inode has the ability to add the specified number of
> >> >  * extents.  Caller must hold ILOCK_EXCL and have joined the inode to
> >> >  * the transaction.  Upon return, the inode will still be in this state
> >> >  * upon return and the transaction will be clean.
> >> >  */
> >> > int
> >> > xfs_trans_inode_ensure_nextents(
> >> > 	struct xfs_trans	**tpp,
> >> > 	struct xfs_inode	*ip,
> >> > 	int			whichfork,
> >> > 	int			nr_to_add)
> >> > {
> >> > 	int			error;
> >> >
> >> > 	error = xfs_iext_count_may_overflow(ip, whichfork, nr_to_add);
> >> > 	if (!error)
> >> > 		return 0;
> >> >
> >> > 	/*
> >> > 	 * Try to upgrade if the extent count fields aren't large
> >> > 	 * enough.
> >> > 	 */
> >> > 	if (!xfs_has_nrext64(ip->i_mount) ||
> >> > 	    (ip->i_diflags2 & XFS_DIFLAG2_NREXT64))
> >> > 		return error;
> >> >
> >> > 	ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
> >> > 	xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
> >> >
> >> > 	error = xfs_trans_roll(tpp);
> >> > 	if (error)
> >> > 		return error;
> >> >
> >> > 	return xfs_iext_count_may_overflow(ip, whichfork, nr_to_add);
> >> > }
> >> >
> >> > and then the current call sites become:
> >> >
> >> > 	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write,
> >> > 			dblocks, rblocks, false, &tp);
> >> > 	if (error)
> >> > 		return error;
> >> >
> >> > 	error = xfs_trans_inode_ensure_nextents(&tp, ip, XFS_DATA_FORK,
> >> > 			XFS_IEXT_ADD_NOSPLIT_CNT);
> >> > 	if (error)
> >> > 		goto out_cancel;
> >> >
> >> > What do you think about that?
> >> >
> >> 
> >> I went through all the call sites of xfs_iext_count_may_overflow() and I think
> >> that your suggestion can be implemented.
> 
> Sorry, I missed/overlooked the usage of xfs_iext_count_may_overflow() in
> xfs_symlink().
> 
> Just after invoking xfs_iext_count_may_overflow(), we execute the following
> steps,
> 
> 1. Allocate inode chunk
> 2. Initialize inode chunk.
> 3. Insert record into inobt/finobt.
> 4. Roll the transaction.
> 5. Allocate ondisk inode.
> 6. Add directory inode to transaction.
> 7. Allocate blocks to store symbolic link path name.
> 8. Log symlink's inode (data fork contains block mappings).
> 9. Log data blocks containing symbolic link path name.
> 10. Add name to directory and log directory's blocks.
> 11. Log directory inode.
> 12. Commit transaction.
> 
> xfs_trans_roll() invoked in step 4 would mean that we cannot move step 6 to
> occur before step 1 since xfs_trans_roll would unlock the inode by executing
> xfs_inode_item_committing().
> 
> xfs_create() has a similar flow.
> 
> Hence, I think we should retain the current logic of setting
> XFS_DIFLAG2_NREXT64 just after reading the inode from the disk.

File creation shouldn't ever run into problems with
xfs_iext_count_may_overflow because (a) only symlinks get created with
mapped blocks, and never more than two; and (b) we always set NREXT64
(the inode flag) on new files if NREXT64 (the superblock feature bit) is
enabled, so a newly created file will never require upgrading.

--D

> >> 
> >> However, wouldn't the current approach suffice in terms of being functionally
> >> and logically correct? XFS_DIFLAG2_NREXT64 is set when inode is read from the
> >> disk and the first operation to log the changes made to the inode will make
> >> sure to include the new value of ip->i_diflags2. Hence we never end up in a
> >> situation where a disk inode has more than 2^31 data fork extents without
> >> having XFS_DIFLAG2_NREXT64 flag set.
> >> 
> >> But the approach described above does go against the convention of changing
> >> metadata within a transaction. Hence I will try to implement your suggestion
> >> and include it in the next version of the patchset.
> >
> > Ok, that sounds good. :)
> >
> 
> -- 
> chandan
Chandan Babu R Feb. 15, 2022, 6:48 a.m. UTC | #6
On 14 Feb 2022 at 22:37, Darrick J. Wong wrote:
> On Fri, Feb 11, 2022 at 05:40:30PM +0530, Chandan Babu R wrote:
>> On 07 Feb 2022 at 22:41, Darrick J. Wong wrote:
>> > On Mon, Feb 07, 2022 at 10:25:19AM +0530, Chandan Babu R wrote:
>> >> On 02 Feb 2022 at 01:31, Darrick J. Wong wrote:
>> >> > On Fri, Jan 21, 2022 at 10:48:54AM +0530, Chandan Babu R wrote:
>> >> >> This commit upgrades inodes to use 64-bit extent counters when they are read
>> >> >> from disk. Inodes are upgraded only when the filesystem instance has
>> >> >> XFS_SB_FEAT_INCOMPAT_NREXT64 incompat flag set.
>> >> >> 
>> >> >> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
>> >> >> ---
>> >> >>  fs/xfs/libxfs/xfs_inode_buf.c | 6 ++++++
>> >> >>  1 file changed, 6 insertions(+)
>> >> >> 
>> >> >> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
>> >> >> index 2200526bcee0..767189c7c887 100644
>> >> >> --- a/fs/xfs/libxfs/xfs_inode_buf.c
>> >> >> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
>> >> >> @@ -253,6 +253,12 @@ xfs_inode_from_disk(
>> >> >>  	}
>> >> >>  	if (xfs_is_reflink_inode(ip))
>> >> >>  		xfs_ifork_init_cow(ip);
>> >> >> +
>> >> >> +	if ((from->di_version == 3) &&
>> >> >> +	     xfs_has_nrext64(ip->i_mount) &&
>> >> >> +	     !xfs_dinode_has_nrext64(from))
>> >> >> +		ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
>> >> >
>> >> > Hmm.  Last time around I asked about the oddness of updating the inode
>> >> > feature flags outside of a transaction, and then never responded. :(
>> >> > So to quote you from last time:
>> >> >
>> >> >> The following is the thought process behind upgrading an inode to
>> >> >> XFS_DIFLAG2_NREXT64 when it is read from the disk,
>> >> >>
>> >> >> 1. With support for dynamic upgrade, The extent count limits of an
>> >> >> inode needs to be determined by checking flags present within the
>> >> >> inode i.e.  we need to satisfy self-describing metadata property. This
>> >> >> helps tools like xfs_repair and scrub to verify inode's extent count
>> >> >> limits without having to refer to other metadata objects (e.g.
>> >> >> superblock feature flags).
>> >> >
>> >> > I think this makes an even /stronger/ argument for why this update
>> >> > needs to be transactional.
>> >> >
>> >> >> 2. Upgrade when performed inside xfs_trans_log_inode() may cause
>> >> >> xfs_iext_count_may_overflow() to return -EFBIG when the inode's
>> >> >> data/attr extent count is already close to 2^31/2^15 respectively.
>> >> >> Hence none of the file operations will be able to add new extents to a
>> >> >> file.
>> >> >
>> >> > Aha, there's the reason why!  You're right, xfs_iext_count_may_overflow
>> >> > will abort the operation due to !NREXT64 before we even get a chance to
>> >> > log the inode.
>> >> >
>> >> > I observe, however, that any time we call that function, we also have a
>> >> > transaction allocated and we hold the ILOCK on the inode being tested.
>> >> > *Most* of those call sites have also joined the inode to the transaction
>> >> > already.  I wonder, is that a more appropriate place to be upgrading the
>> >> > inodes?  Something like:
>> >> >
>> >> > /*
>> >> >  * Ensure that the inode has the ability to add the specified number of
>> >> >  * extents.  Caller must hold ILOCK_EXCL and have joined the inode to
>> >> >  * the transaction.  Upon return, the inode will still be in this state
>> >> >  * upon return and the transaction will be clean.
>> >> >  */
>> >> > int
>> >> > xfs_trans_inode_ensure_nextents(
>> >> > 	struct xfs_trans	**tpp,
>> >> > 	struct xfs_inode	*ip,
>> >> > 	int			whichfork,
>> >> > 	int			nr_to_add)
>> >> > {
>> >> > 	int			error;
>> >> >
>> >> > 	error = xfs_iext_count_may_overflow(ip, whichfork, nr_to_add);
>> >> > 	if (!error)
>> >> > 		return 0;
>> >> >
>> >> > 	/*
>> >> > 	 * Try to upgrade if the extent count fields aren't large
>> >> > 	 * enough.
>> >> > 	 */
>> >> > 	if (!xfs_has_nrext64(ip->i_mount) ||
>> >> > 	    (ip->i_diflags2 & XFS_DIFLAG2_NREXT64))
>> >> > 		return error;
>> >> >
>> >> > 	ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
>> >> > 	xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
>> >> >
>> >> > 	error = xfs_trans_roll(tpp);
>> >> > 	if (error)
>> >> > 		return error;
>> >> >
>> >> > 	return xfs_iext_count_may_overflow(ip, whichfork, nr_to_add);
>> >> > }
>> >> >
>> >> > and then the current call sites become:
>> >> >
>> >> > 	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write,
>> >> > 			dblocks, rblocks, false, &tp);
>> >> > 	if (error)
>> >> > 		return error;
>> >> >
>> >> > 	error = xfs_trans_inode_ensure_nextents(&tp, ip, XFS_DATA_FORK,
>> >> > 			XFS_IEXT_ADD_NOSPLIT_CNT);
>> >> > 	if (error)
>> >> > 		goto out_cancel;
>> >> >
>> >> > What do you think about that?
>> >> >
>> >> 
>> >> I went through all the call sites of xfs_iext_count_may_overflow() and I think
>> >> that your suggestion can be implemented.
>> 
>> Sorry, I missed/overlooked the usage of xfs_iext_count_may_overflow() in
>> xfs_symlink().
>> 
>> Just after invoking xfs_iext_count_may_overflow(), we execute the following
>> steps,
>> 
>> 1. Allocate inode chunk
>> 2. Initialize inode chunk.
>> 3. Insert record into inobt/finobt.
>> 4. Roll the transaction.
>> 5. Allocate ondisk inode.
>> 6. Add directory inode to transaction.
>> 7. Allocate blocks to store symbolic link path name.
>> 8. Log symlink's inode (data fork contains block mappings).
>> 9. Log data blocks containing symbolic link path name.
>> 10. Add name to directory and log directory's blocks.
>> 11. Log directory inode.
>> 12. Commit transaction.
>> 
>> xfs_trans_roll() invoked in step 4 would mean that we cannot move step 6 to
>> occur before step 1 since xfs_trans_roll would unlock the inode by executing
>> xfs_inode_item_committing().
>> 
>> xfs_create() has a similar flow.
>> 
>> Hence, I think we should retain the current logic of setting
>> XFS_DIFLAG2_NREXT64 just after reading the inode from the disk.
>
> File creation shouldn't ever run into problems with
> xfs_iext_count_may_overflow because (a) only symlinks get created with
> mapped blocks, and never more than two; and (b) we always set NREXT64
> (the inode flag) on new files if NREXT64 (the superblock feature bit) is
> enabled, so a newly created file will never require upgrading.
>

The inode representing the symbolic link being created cannot overflow its
data fork extent count field. However, the inode representing the directory
inside which the symbolic link entry is being created, might overflow its data
fork extent count field. Similary, xfs_create() can cause data fork extent
count field of the parent directory to overflow.

>> >> 
>> >> However, wouldn't the current approach suffice in terms of being functionally
>> >> and logically correct? XFS_DIFLAG2_NREXT64 is set when inode is read from the
>> >> disk and the first operation to log the changes made to the inode will make
>> >> sure to include the new value of ip->i_diflags2. Hence we never end up in a
>> >> situation where a disk inode has more than 2^31 data fork extents without
>> >> having XFS_DIFLAG2_NREXT64 flag set.
>> >> 
>> >> But the approach described above does go against the convention of changing
>> >> metadata within a transaction. Hence I will try to implement your suggestion
>> >> and include it in the next version of the patchset.
>> >
>> > Ok, that sounds good. :)
>> >
Dave Chinner Feb. 15, 2022, 9:33 a.m. UTC | #7
On Tue, Feb 15, 2022 at 12:18:50PM +0530, Chandan Babu R wrote:
> On 14 Feb 2022 at 22:37, Darrick J. Wong wrote:
> > On Fri, Feb 11, 2022 at 05:40:30PM +0530, Chandan Babu R wrote:
> >> On 07 Feb 2022 at 22:41, Darrick J. Wong wrote:
> >> > On Mon, Feb 07, 2022 at 10:25:19AM +0530, Chandan Babu R wrote:
> >> >> On 02 Feb 2022 at 01:31, Darrick J. Wong wrote:
> >> >> > On Fri, Jan 21, 2022 at 10:48:54AM +0530, Chandan Babu R wrote:
> >> >> I went through all the call sites of xfs_iext_count_may_overflow() and I think
> >> >> that your suggestion can be implemented.
> >> 
> >> Sorry, I missed/overlooked the usage of xfs_iext_count_may_overflow() in
> >> xfs_symlink().
> >> 
> >> Just after invoking xfs_iext_count_may_overflow(), we execute the following
> >> steps,
> >> 
> >> 1. Allocate inode chunk
> >> 2. Initialize inode chunk.
> >> 3. Insert record into inobt/finobt.
> >> 4. Roll the transaction.
> >> 5. Allocate ondisk inode.
> >> 6. Add directory inode to transaction.
> >> 7. Allocate blocks to store symbolic link path name.
> >> 8. Log symlink's inode (data fork contains block mappings).
> >> 9. Log data blocks containing symbolic link path name.
> >> 10. Add name to directory and log directory's blocks.
> >> 11. Log directory inode.
> >> 12. Commit transaction.
> >> 
> >> xfs_trans_roll() invoked in step 4 would mean that we cannot move step 6 to
> >> occur before step 1 since xfs_trans_roll would unlock the inode by executing
> >> xfs_inode_item_committing().
> >> 
> >> xfs_create() has a similar flow.
> >> 
> >> Hence, I think we should retain the current logic of setting
> >> XFS_DIFLAG2_NREXT64 just after reading the inode from the disk.
> >
> > File creation shouldn't ever run into problems with
> > xfs_iext_count_may_overflow because (a) only symlinks get created with
> > mapped blocks, and never more than two; and (b) we always set NREXT64
> > (the inode flag) on new files if NREXT64 (the superblock feature bit) is
> > enabled, so a newly created file will never require upgrading.
> 
> The inode representing the symbolic link being created cannot overflow its
> data fork extent count field. However, the inode representing the directory
> inside which the symbolic link entry is being created, might overflow its data
> fork extent count field.

I dont' think that can happen. A directory is limited in size to 3
segments of 32GB each. In reality, only the data segment can ever
reach 32GB as both the dabtree and free space segments are just
compact indexes of the contents of the 32GB data segment.

Hence a directory is never likely to reach more than about 40GB of
blocks which is nowhere near large enough to overflowing a 32 bit
extent count field.

Cheers,

Dave.
Chandan Babu R Feb. 15, 2022, 11:33 a.m. UTC | #8
On 15 Feb 2022 at 15:03, Dave Chinner wrote:
> On Tue, Feb 15, 2022 at 12:18:50PM +0530, Chandan Babu R wrote:
>> On 14 Feb 2022 at 22:37, Darrick J. Wong wrote:
>> > On Fri, Feb 11, 2022 at 05:40:30PM +0530, Chandan Babu R wrote:
>> >> On 07 Feb 2022 at 22:41, Darrick J. Wong wrote:
>> >> > On Mon, Feb 07, 2022 at 10:25:19AM +0530, Chandan Babu R wrote:
>> >> >> On 02 Feb 2022 at 01:31, Darrick J. Wong wrote:
>> >> >> > On Fri, Jan 21, 2022 at 10:48:54AM +0530, Chandan Babu R wrote:
>> >> >> I went through all the call sites of xfs_iext_count_may_overflow() and I think
>> >> >> that your suggestion can be implemented.
>> >> 
>> >> Sorry, I missed/overlooked the usage of xfs_iext_count_may_overflow() in
>> >> xfs_symlink().
>> >> 
>> >> Just after invoking xfs_iext_count_may_overflow(), we execute the following
>> >> steps,
>> >> 
>> >> 1. Allocate inode chunk
>> >> 2. Initialize inode chunk.
>> >> 3. Insert record into inobt/finobt.
>> >> 4. Roll the transaction.
>> >> 5. Allocate ondisk inode.
>> >> 6. Add directory inode to transaction.
>> >> 7. Allocate blocks to store symbolic link path name.
>> >> 8. Log symlink's inode (data fork contains block mappings).
>> >> 9. Log data blocks containing symbolic link path name.
>> >> 10. Add name to directory and log directory's blocks.
>> >> 11. Log directory inode.
>> >> 12. Commit transaction.
>> >> 
>> >> xfs_trans_roll() invoked in step 4 would mean that we cannot move step 6 to
>> >> occur before step 1 since xfs_trans_roll would unlock the inode by executing
>> >> xfs_inode_item_committing().
>> >> 
>> >> xfs_create() has a similar flow.
>> >> 
>> >> Hence, I think we should retain the current logic of setting
>> >> XFS_DIFLAG2_NREXT64 just after reading the inode from the disk.
>> >
>> > File creation shouldn't ever run into problems with
>> > xfs_iext_count_may_overflow because (a) only symlinks get created with
>> > mapped blocks, and never more than two; and (b) we always set NREXT64
>> > (the inode flag) on new files if NREXT64 (the superblock feature bit) is
>> > enabled, so a newly created file will never require upgrading.
>> 
>> The inode representing the symbolic link being created cannot overflow its
>> data fork extent count field. However, the inode representing the directory
>> inside which the symbolic link entry is being created, might overflow its data
>> fork extent count field.
>
> I dont' think that can happen. A directory is limited in size to 3
> segments of 32GB each. In reality, only the data segment can ever
> reach 32GB as both the dabtree and free space segments are just
> compact indexes of the contents of the 32GB data segment.
>
> Hence a directory is never likely to reach more than about 40GB of
> blocks which is nowhere near large enough to overflowing a 32 bit
> extent count field.

I think you are right.

The maximum file size that can be represented by the data fork extent counter
in the worst case occurs when all extents are 1 block in size and each block
is 1k in size.

With 1k byte sized blocks, a file can reach upto,
1k * (2^31) = 2048 GB

This is much larger than the asymptotic maximum size of a directory i.e.
32GB * 3 = 96GB.
Chandan Babu R Feb. 15, 2022, 1:16 p.m. UTC | #9
On 15 Feb 2022 at 17:03, Chandan Babu R wrote:
> On 15 Feb 2022 at 15:03, Dave Chinner wrote:
>> On Tue, Feb 15, 2022 at 12:18:50PM +0530, Chandan Babu R wrote:
>>> On 14 Feb 2022 at 22:37, Darrick J. Wong wrote:
>>> > On Fri, Feb 11, 2022 at 05:40:30PM +0530, Chandan Babu R wrote:
>>> >> On 07 Feb 2022 at 22:41, Darrick J. Wong wrote:
>>> >> > On Mon, Feb 07, 2022 at 10:25:19AM +0530, Chandan Babu R wrote:
>>> >> >> On 02 Feb 2022 at 01:31, Darrick J. Wong wrote:
>>> >> >> > On Fri, Jan 21, 2022 at 10:48:54AM +0530, Chandan Babu R wrote:
>>> >> >> I went through all the call sites of xfs_iext_count_may_overflow() and I think
>>> >> >> that your suggestion can be implemented.
>>> >> 
>>> >> Sorry, I missed/overlooked the usage of xfs_iext_count_may_overflow() in
>>> >> xfs_symlink().
>>> >> 
>>> >> Just after invoking xfs_iext_count_may_overflow(), we execute the following
>>> >> steps,
>>> >> 
>>> >> 1. Allocate inode chunk
>>> >> 2. Initialize inode chunk.
>>> >> 3. Insert record into inobt/finobt.
>>> >> 4. Roll the transaction.
>>> >> 5. Allocate ondisk inode.
>>> >> 6. Add directory inode to transaction.
>>> >> 7. Allocate blocks to store symbolic link path name.
>>> >> 8. Log symlink's inode (data fork contains block mappings).
>>> >> 9. Log data blocks containing symbolic link path name.
>>> >> 10. Add name to directory and log directory's blocks.
>>> >> 11. Log directory inode.
>>> >> 12. Commit transaction.
>>> >> 
>>> >> xfs_trans_roll() invoked in step 4 would mean that we cannot move step 6 to
>>> >> occur before step 1 since xfs_trans_roll would unlock the inode by executing
>>> >> xfs_inode_item_committing().
>>> >> 
>>> >> xfs_create() has a similar flow.
>>> >> 
>>> >> Hence, I think we should retain the current logic of setting
>>> >> XFS_DIFLAG2_NREXT64 just after reading the inode from the disk.
>>> >
>>> > File creation shouldn't ever run into problems with
>>> > xfs_iext_count_may_overflow because (a) only symlinks get created with
>>> > mapped blocks, and never more than two; and (b) we always set NREXT64
>>> > (the inode flag) on new files if NREXT64 (the superblock feature bit) is
>>> > enabled, so a newly created file will never require upgrading.
>>> 
>>> The inode representing the symbolic link being created cannot overflow its
>>> data fork extent count field. However, the inode representing the directory
>>> inside which the symbolic link entry is being created, might overflow its data
>>> fork extent count field.
>>
>> I dont' think that can happen. A directory is limited in size to 3
>> segments of 32GB each. In reality, only the data segment can ever
>> reach 32GB as both the dabtree and free space segments are just
>> compact indexes of the contents of the 32GB data segment.
>>
>> Hence a directory is never likely to reach more than about 40GB of
>> blocks which is nowhere near large enough to overflowing a 32 bit
>> extent count field.
>
> I think you are right.
>
> The maximum file size that can be represented by the data fork extent counter
> in the worst case occurs when all extents are 1 block in size and each block
> is 1k in size.
>
> With 1k byte sized blocks, a file can reach upto,
> 1k * (2^31) = 2048 GB
>
> This is much larger than the asymptotic maximum size of a directory i.e.
> 32GB * 3 = 96GB.

Also, I think I should remove extent count overflow checks performed in the
following functions,

xfs_create()
xfs_rename()
xfs_link()
xfs_symlink()
xfs_bmap_del_extent_real()

... Since they do not accomplish anything.

Please let me know your views on this.
Darrick J. Wong Feb. 16, 2022, 1:16 a.m. UTC | #10
On Tue, Feb 15, 2022 at 06:46:16PM +0530, Chandan Babu R wrote:
> On 15 Feb 2022 at 17:03, Chandan Babu R wrote:
> > On 15 Feb 2022 at 15:03, Dave Chinner wrote:
> >> On Tue, Feb 15, 2022 at 12:18:50PM +0530, Chandan Babu R wrote:
> >>> On 14 Feb 2022 at 22:37, Darrick J. Wong wrote:
> >>> > On Fri, Feb 11, 2022 at 05:40:30PM +0530, Chandan Babu R wrote:
> >>> >> On 07 Feb 2022 at 22:41, Darrick J. Wong wrote:
> >>> >> > On Mon, Feb 07, 2022 at 10:25:19AM +0530, Chandan Babu R wrote:
> >>> >> >> On 02 Feb 2022 at 01:31, Darrick J. Wong wrote:
> >>> >> >> > On Fri, Jan 21, 2022 at 10:48:54AM +0530, Chandan Babu R wrote:
> >>> >> >> I went through all the call sites of xfs_iext_count_may_overflow() and I think
> >>> >> >> that your suggestion can be implemented.
> >>> >> 
> >>> >> Sorry, I missed/overlooked the usage of xfs_iext_count_may_overflow() in
> >>> >> xfs_symlink().
> >>> >> 
> >>> >> Just after invoking xfs_iext_count_may_overflow(), we execute the following
> >>> >> steps,
> >>> >> 
> >>> >> 1. Allocate inode chunk
> >>> >> 2. Initialize inode chunk.
> >>> >> 3. Insert record into inobt/finobt.
> >>> >> 4. Roll the transaction.
> >>> >> 5. Allocate ondisk inode.
> >>> >> 6. Add directory inode to transaction.
> >>> >> 7. Allocate blocks to store symbolic link path name.
> >>> >> 8. Log symlink's inode (data fork contains block mappings).
> >>> >> 9. Log data blocks containing symbolic link path name.
> >>> >> 10. Add name to directory and log directory's blocks.
> >>> >> 11. Log directory inode.
> >>> >> 12. Commit transaction.
> >>> >> 
> >>> >> xfs_trans_roll() invoked in step 4 would mean that we cannot move step 6 to
> >>> >> occur before step 1 since xfs_trans_roll would unlock the inode by executing
> >>> >> xfs_inode_item_committing().
> >>> >> 
> >>> >> xfs_create() has a similar flow.
> >>> >> 
> >>> >> Hence, I think we should retain the current logic of setting
> >>> >> XFS_DIFLAG2_NREXT64 just after reading the inode from the disk.
> >>> >
> >>> > File creation shouldn't ever run into problems with
> >>> > xfs_iext_count_may_overflow because (a) only symlinks get created with
> >>> > mapped blocks, and never more than two; and (b) we always set NREXT64
> >>> > (the inode flag) on new files if NREXT64 (the superblock feature bit) is
> >>> > enabled, so a newly created file will never require upgrading.
> >>> 
> >>> The inode representing the symbolic link being created cannot overflow its
> >>> data fork extent count field. However, the inode representing the directory
> >>> inside which the symbolic link entry is being created, might overflow its data
> >>> fork extent count field.
> >>
> >> I dont' think that can happen. A directory is limited in size to 3
> >> segments of 32GB each. In reality, only the data segment can ever
> >> reach 32GB as both the dabtree and free space segments are just
> >> compact indexes of the contents of the 32GB data segment.
> >>
> >> Hence a directory is never likely to reach more than about 40GB of
> >> blocks which is nowhere near large enough to overflowing a 32 bit
> >> extent count field.
> >
> > I think you are right.
> >
> > The maximum file size that can be represented by the data fork extent counter
> > in the worst case occurs when all extents are 1 block in size and each block
> > is 1k in size.
> >
> > With 1k byte sized blocks, a file can reach upto,
> > 1k * (2^31) = 2048 GB
> >
> > This is much larger than the asymptotic maximum size of a directory i.e.
> > 32GB * 3 = 96GB.

The downside of getting rid of the checks for directories is that we
won't be able to use the error injection knob that limits all forks to
10 extents max to see what happens when that part of directory expansion
fails.  But if it makes it easier to handle nrext64, then that's
probably a good enough reason to forego that.

> Also, I think I should remove extent count overflow checks performed in the
> following functions,
> 
> xfs_create()
> xfs_rename()
> xfs_link()
> xfs_symlink()

Those are probably ok to remove the checks.

> xfs_bmap_del_extent_real()

Not sure about this one, since it actually /can/ result in more extents.

--D

> ... Since they do not accomplish anything.
> 
> Please let me know your views on this.
> 
> -- 
> chandan
Dave Chinner Feb. 16, 2022, 3:59 a.m. UTC | #11
On Tue, Feb 15, 2022 at 05:16:33PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 15, 2022 at 06:46:16PM +0530, Chandan Babu R wrote:
> > On 15 Feb 2022 at 17:03, Chandan Babu R wrote:
> > > On 15 Feb 2022 at 15:03, Dave Chinner wrote:
> > >> On Tue, Feb 15, 2022 at 12:18:50PM +0530, Chandan Babu R wrote:
> > >>> On 14 Feb 2022 at 22:37, Darrick J. Wong wrote:
> > >>> > On Fri, Feb 11, 2022 at 05:40:30PM +0530, Chandan Babu R wrote:
> > >>> >> On 07 Feb 2022 at 22:41, Darrick J. Wong wrote:
> > >>> >> > On Mon, Feb 07, 2022 at 10:25:19AM +0530, Chandan Babu R wrote:
> > >>> >> >> On 02 Feb 2022 at 01:31, Darrick J. Wong wrote:
> > >>> >> >> > On Fri, Jan 21, 2022 at 10:48:54AM +0530, Chandan Babu R wrote:
> > >>> >> >> I went through all the call sites of xfs_iext_count_may_overflow() and I think
> > >>> >> >> that your suggestion can be implemented.
> > >>> >> 
> > >>> >> Sorry, I missed/overlooked the usage of xfs_iext_count_may_overflow() in
> > >>> >> xfs_symlink().
> > >>> >> 
> > >>> >> Just after invoking xfs_iext_count_may_overflow(), we execute the following
> > >>> >> steps,
> > >>> >> 
> > >>> >> 1. Allocate inode chunk
> > >>> >> 2. Initialize inode chunk.
> > >>> >> 3. Insert record into inobt/finobt.
> > >>> >> 4. Roll the transaction.
> > >>> >> 5. Allocate ondisk inode.
> > >>> >> 6. Add directory inode to transaction.
> > >>> >> 7. Allocate blocks to store symbolic link path name.
> > >>> >> 8. Log symlink's inode (data fork contains block mappings).
> > >>> >> 9. Log data blocks containing symbolic link path name.
> > >>> >> 10. Add name to directory and log directory's blocks.
> > >>> >> 11. Log directory inode.
> > >>> >> 12. Commit transaction.
> > >>> >> 
> > >>> >> xfs_trans_roll() invoked in step 4 would mean that we cannot move step 6 to
> > >>> >> occur before step 1 since xfs_trans_roll would unlock the inode by executing
> > >>> >> xfs_inode_item_committing().
> > >>> >> 
> > >>> >> xfs_create() has a similar flow.
> > >>> >> 
> > >>> >> Hence, I think we should retain the current logic of setting
> > >>> >> XFS_DIFLAG2_NREXT64 just after reading the inode from the disk.
> > >>> >
> > >>> > File creation shouldn't ever run into problems with
> > >>> > xfs_iext_count_may_overflow because (a) only symlinks get created with
> > >>> > mapped blocks, and never more than two; and (b) we always set NREXT64
> > >>> > (the inode flag) on new files if NREXT64 (the superblock feature bit) is
> > >>> > enabled, so a newly created file will never require upgrading.
> > >>> 
> > >>> The inode representing the symbolic link being created cannot overflow its
> > >>> data fork extent count field. However, the inode representing the directory
> > >>> inside which the symbolic link entry is being created, might overflow its data
> > >>> fork extent count field.
> > >>
> > >> I dont' think that can happen. A directory is limited in size to 3
> > >> segments of 32GB each. In reality, only the data segment can ever
> > >> reach 32GB as both the dabtree and free space segments are just
> > >> compact indexes of the contents of the 32GB data segment.
> > >>
> > >> Hence a directory is never likely to reach more than about 40GB of
> > >> blocks which is nowhere near large enough to overflowing a 32 bit
> > >> extent count field.
> > >
> > > I think you are right.
> > >
> > > The maximum file size that can be represented by the data fork extent counter
> > > in the worst case occurs when all extents are 1 block in size and each block
> > > is 1k in size.
> > >
> > > With 1k byte sized blocks, a file can reach upto,
> > > 1k * (2^31) = 2048 GB
> > >
> > > This is much larger than the asymptotic maximum size of a directory i.e.
> > > 32GB * 3 = 96GB.
> 
> The downside of getting rid of the checks for directories is that we
> won't be able to use the error injection knob that limits all forks to
> 10 extents max to see what happens when that part of directory expansion
> fails.  But if it makes it easier to handle nrext64, then that's
> probably a good enough reason to forego that.

If you want error injection to do that, add explicit error injection
to the directory code.

> > xfs_bmap_del_extent_real()
> 
> Not sure about this one, since it actually /can/ result in more extents.

Yup, unlikely to ever trigger, but still necessary for correctness.

Cheers,

Dave.
Chandan Babu R Feb. 16, 2022, 12:34 p.m. UTC | #12
On 16 Feb 2022 at 09:29, Dave Chinner wrote:
> On Tue, Feb 15, 2022 at 05:16:33PM -0800, Darrick J. Wong wrote:
>> On Tue, Feb 15, 2022 at 06:46:16PM +0530, Chandan Babu R wrote:
>> > On 15 Feb 2022 at 17:03, Chandan Babu R wrote:
>> > > On 15 Feb 2022 at 15:03, Dave Chinner wrote:
>> > >> On Tue, Feb 15, 2022 at 12:18:50PM +0530, Chandan Babu R wrote:
>> > >>> On 14 Feb 2022 at 22:37, Darrick J. Wong wrote:
>> > >>> > On Fri, Feb 11, 2022 at 05:40:30PM +0530, Chandan Babu R wrote:
>> > >>> >> On 07 Feb 2022 at 22:41, Darrick J. Wong wrote:
>> > >>> >> > On Mon, Feb 07, 2022 at 10:25:19AM +0530, Chandan Babu R wrote:
>> > >>> >> >> On 02 Feb 2022 at 01:31, Darrick J. Wong wrote:
>> > >>> >> >> > On Fri, Jan 21, 2022 at 10:48:54AM +0530, Chandan Babu R wrote:
>> > >>> >> >> I went through all the call sites of xfs_iext_count_may_overflow() and I think
>> > >>> >> >> that your suggestion can be implemented.
>> > >>> >> 
>> > >>> >> Sorry, I missed/overlooked the usage of xfs_iext_count_may_overflow() in
>> > >>> >> xfs_symlink().
>> > >>> >> 
>> > >>> >> Just after invoking xfs_iext_count_may_overflow(), we execute the following
>> > >>> >> steps,
>> > >>> >> 
>> > >>> >> 1. Allocate inode chunk
>> > >>> >> 2. Initialize inode chunk.
>> > >>> >> 3. Insert record into inobt/finobt.
>> > >>> >> 4. Roll the transaction.
>> > >>> >> 5. Allocate ondisk inode.
>> > >>> >> 6. Add directory inode to transaction.
>> > >>> >> 7. Allocate blocks to store symbolic link path name.
>> > >>> >> 8. Log symlink's inode (data fork contains block mappings).
>> > >>> >> 9. Log data blocks containing symbolic link path name.
>> > >>> >> 10. Add name to directory and log directory's blocks.
>> > >>> >> 11. Log directory inode.
>> > >>> >> 12. Commit transaction.
>> > >>> >> 
>> > >>> >> xfs_trans_roll() invoked in step 4 would mean that we cannot move step 6 to
>> > >>> >> occur before step 1 since xfs_trans_roll would unlock the inode by executing
>> > >>> >> xfs_inode_item_committing().
>> > >>> >> 
>> > >>> >> xfs_create() has a similar flow.
>> > >>> >> 
>> > >>> >> Hence, I think we should retain the current logic of setting
>> > >>> >> XFS_DIFLAG2_NREXT64 just after reading the inode from the disk.
>> > >>> >
>> > >>> > File creation shouldn't ever run into problems with
>> > >>> > xfs_iext_count_may_overflow because (a) only symlinks get created with
>> > >>> > mapped blocks, and never more than two; and (b) we always set NREXT64
>> > >>> > (the inode flag) on new files if NREXT64 (the superblock feature bit) is
>> > >>> > enabled, so a newly created file will never require upgrading.
>> > >>> 
>> > >>> The inode representing the symbolic link being created cannot overflow its
>> > >>> data fork extent count field. However, the inode representing the directory
>> > >>> inside which the symbolic link entry is being created, might overflow its data
>> > >>> fork extent count field.
>> > >>
>> > >> I dont' think that can happen. A directory is limited in size to 3
>> > >> segments of 32GB each. In reality, only the data segment can ever
>> > >> reach 32GB as both the dabtree and free space segments are just
>> > >> compact indexes of the contents of the 32GB data segment.
>> > >>
>> > >> Hence a directory is never likely to reach more than about 40GB of
>> > >> blocks which is nowhere near large enough to overflowing a 32 bit
>> > >> extent count field.
>> > >
>> > > I think you are right.
>> > >
>> > > The maximum file size that can be represented by the data fork extent counter
>> > > in the worst case occurs when all extents are 1 block in size and each block
>> > > is 1k in size.
>> > >
>> > > With 1k byte sized blocks, a file can reach upto,
>> > > 1k * (2^31) = 2048 GB
>> > >
>> > > This is much larger than the asymptotic maximum size of a directory i.e.
>> > > 32GB * 3 = 96GB.
>> 
>> The downside of getting rid of the checks for directories is that we
>> won't be able to use the error injection knob that limits all forks to
>> 10 extents max to see what happens when that part of directory expansion
>> fails.  But if it makes it easier to handle nrext64, then that's
>> probably a good enough reason to forego that.
>
> If you want error injection to do that, add explicit error injection
> to the directory code.

The transaction might already be dirty before entering the directory code
(e.g. xfs_dir_createname()). In this case, an error return from
xfs_iext_count_may_overflow() will cause the filesystem to be shut down.

On the other hand, removing calls to xfs_iext_count_may_overflow() from the
previously listed directory functions would result in the error injection knob
to not work for directories. This would require us to delete xfs/533 test.

Leaving the current invocations of xfs_iext_count_may_overflow() in their
respective locations would mean that they are essentially no-ops for functions
which manipulate directories. However, with functions like xfs_symlink() and
xfs_create(), I wouldn't be able to add the inode to the transaction before
invoking xfs_iext_count_may_overflow() because this leads to inode being
unlocked when rolling the transaction.

Therefore I think we should not change the current code flow w.r.t to
functions associated with directory entry manipulation. i.e.
1. Let xfs_iext_count_may_overflow() continue to be no-op w.r.t directory
   manipulation.
2. Since xfs_iext_count_may_overflow() is a no-op, there is no need to move
   "add inode to transaction" code to occur before invoking
   xfs_iext_count_may_overflow().

>
>> > xfs_bmap_del_extent_real()
>> 
>> Not sure about this one, since it actually /can/ result in more extents.
>
> Yup, unlikely to ever trigger, but still necessary for correctness.
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 2200526bcee0..767189c7c887 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -253,6 +253,12 @@  xfs_inode_from_disk(
 	}
 	if (xfs_is_reflink_inode(ip))
 		xfs_ifork_init_cow(ip);
+
+	if ((from->di_version == 3) &&
+	     xfs_has_nrext64(ip->i_mount) &&
+	     !xfs_dinode_has_nrext64(from))
+		ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
+
 	return 0;
 
 out_destroy_data_fork: