diff mbox series

[-next,v6,1/2] xfs: reserve blocks for truncating large realtime inode

Message ID 20240618142112.1315279-2-yi.zhang@huaweicloud.com (mailing list archive)
State Accepted, archived
Headers show
Series iomap/xfs: fix stale data exposure when truncating realtime inodes | expand

Commit Message

Zhang Yi June 18, 2024, 2:21 p.m. UTC
From: Zhang Yi <yi.zhang@huawei.com>

When unaligned truncate down a big realtime file, xfs_truncate_page()
only zeros out the tail EOF block, __xfs_bunmapi() should split the tail
written extent and convert the later one that beyond EOF block to
unwritten, but it couldn't work as expected now since the reserved block
is zero in xfs_setattr_size(), this could expose stale data just after
commit '943bc0882ceb ("iomap: don't increase i_size if it's not a write
operation")'.

If we truncate file that contains a large enough written extent:

     |<    rxext    >|<    rtext    >|
  ...WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
        ^ (new EOF)      ^ old EOF

Since we only zeros out the tail of the EOF block, and
xfs_itruncate_extents()->..->__xfs_bunmapi() unmap the whole ailgned
extents, it becomes this state:

     |<    rxext    >|
  ...WWWzWWWWWWWWWWWWW
        ^ new EOF

Then if we do an extending write like this, the blocks in the previous
tail extent becomes stale:

     |<    rxext    >|
  ...WWWzSSSSSSSSSSSSS..........WWWWWWWWWWWWWWWWW
        ^ old EOF               ^ append start  ^ new EOF

Fix this by reserving XFS_DIOSTRAT_SPACE_RES blocks for big realtime
inode.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iops.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Dave Chinner July 1, 2024, 1:16 a.m. UTC | #1
On Tue, Jun 18, 2024 at 10:21:11PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When unaligned truncate down a big realtime file, xfs_truncate_page()
> only zeros out the tail EOF block, __xfs_bunmapi() should split the tail
> written extent and convert the later one that beyond EOF block to
> unwritten, but it couldn't work as expected now since the reserved block
> is zero in xfs_setattr_size(), this could expose stale data just after
> commit '943bc0882ceb ("iomap: don't increase i_size if it's not a write
> operation")'.
> 
> If we truncate file that contains a large enough written extent:
> 
>      |<    rxext    >|<    rtext    >|
>   ...WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
>         ^ (new EOF)      ^ old EOF
> 
> Since we only zeros out the tail of the EOF block, and
> xfs_itruncate_extents()->..->__xfs_bunmapi() unmap the whole ailgned
> extents, it becomes this state:
> 
>      |<    rxext    >|
>   ...WWWzWWWWWWWWWWWWW
>         ^ new EOF
> 
> Then if we do an extending write like this, the blocks in the previous
> tail extent becomes stale:
> 
>      |<    rxext    >|
>   ...WWWzSSSSSSSSSSSSS..........WWWWWWWWWWWWWWWWW
>         ^ old EOF               ^ append start  ^ new EOF
> 
> Fix this by reserving XFS_DIOSTRAT_SPACE_RES blocks for big realtime
> inode.

This same problem is going to happen with force aligned allocations,
right? i.e. it is a result of having a allocation block size larger
than one filesystem block?

If so, then....

> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_iops.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index ff222827e550..a00dcbc77e12 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -17,6 +17,8 @@
>  #include "xfs_da_btree.h"
>  #include "xfs_attr.h"
>  #include "xfs_trans.h"
> +#include "xfs_trans_space.h"
> +#include "xfs_bmap_btree.h"
>  #include "xfs_trace.h"
>  #include "xfs_icache.h"
>  #include "xfs_symlink.h"
> @@ -811,6 +813,7 @@ xfs_setattr_size(
>  	struct xfs_trans	*tp;
>  	int			error;
>  	uint			lock_flags = 0;
> +	uint			resblks = 0;
>  	bool			did_zeroing = false;
>  
>  	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
> @@ -917,7 +920,17 @@ xfs_setattr_size(
>  			return error;
>  	}
>  
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> +	/*
> +	 * For realtime inode with more than one block rtextsize, we need the
> +	 * block reservation for bmap btree block allocations/splits that can
> +	 * happen since it could split the tail written extent and convert the
> +	 * right beyond EOF one to unwritten.
> +	 */
> +	if (xfs_inode_has_bigrtalloc(ip))
> +		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);

.... should this be doing this generic check instead:

	if (xfs_inode_alloc_unitsize(ip) > 1)
		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);

-Dave.
Zhang Yi July 1, 2024, 2:26 a.m. UTC | #2
On 2024/7/1 9:16, Dave Chinner wrote:
> On Tue, Jun 18, 2024 at 10:21:11PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When unaligned truncate down a big realtime file, xfs_truncate_page()
>> only zeros out the tail EOF block, __xfs_bunmapi() should split the tail
>> written extent and convert the later one that beyond EOF block to
>> unwritten, but it couldn't work as expected now since the reserved block
>> is zero in xfs_setattr_size(), this could expose stale data just after
>> commit '943bc0882ceb ("iomap: don't increase i_size if it's not a write
>> operation")'.
>>
>> If we truncate file that contains a large enough written extent:
>>
>>      |<    rxext    >|<    rtext    >|
>>   ...WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
>>         ^ (new EOF)      ^ old EOF
>>
>> Since we only zeros out the tail of the EOF block, and
>> xfs_itruncate_extents()->..->__xfs_bunmapi() unmap the whole ailgned
>> extents, it becomes this state:
>>
>>      |<    rxext    >|
>>   ...WWWzWWWWWWWWWWWWW
>>         ^ new EOF
>>
>> Then if we do an extending write like this, the blocks in the previous
>> tail extent becomes stale:
>>
>>      |<    rxext    >|
>>   ...WWWzSSSSSSSSSSSSS..........WWWWWWWWWWWWWWWWW
>>         ^ old EOF               ^ append start  ^ new EOF
>>
>> Fix this by reserving XFS_DIOSTRAT_SPACE_RES blocks for big realtime
>> inode.
> 
> This same problem is going to happen with force aligned allocations,
> right? i.e. it is a result of having a allocation block size larger
> than one filesystem block?
> 
Yeah, right.

+cc John

> If so, then....
> 
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> ---
>>  fs/xfs/xfs_iops.c | 15 ++++++++++++++-
>>  1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>> index ff222827e550..a00dcbc77e12 100644
>> --- a/fs/xfs/xfs_iops.c
>> +++ b/fs/xfs/xfs_iops.c
>> @@ -17,6 +17,8 @@
>>  #include "xfs_da_btree.h"
>>  #include "xfs_attr.h"
>>  #include "xfs_trans.h"
>> +#include "xfs_trans_space.h"
>> +#include "xfs_bmap_btree.h"
>>  #include "xfs_trace.h"
>>  #include "xfs_icache.h"
>>  #include "xfs_symlink.h"
>> @@ -811,6 +813,7 @@ xfs_setattr_size(
>>  	struct xfs_trans	*tp;
>>  	int			error;
>>  	uint			lock_flags = 0;
>> +	uint			resblks = 0;
>>  	bool			did_zeroing = false;
>>  
>>  	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
>> @@ -917,7 +920,17 @@ xfs_setattr_size(
>>  			return error;
>>  	}
>>  
>> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
>> +	/*
>> +	 * For realtime inode with more than one block rtextsize, we need the
>> +	 * block reservation for bmap btree block allocations/splits that can
>> +	 * happen since it could split the tail written extent and convert the
>> +	 * right beyond EOF one to unwritten.
>> +	 */
>> +	if (xfs_inode_has_bigrtalloc(ip))
>> +		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
> 
> .... should this be doing this generic check instead:
> 
> 	if (xfs_inode_alloc_unitsize(ip) > 1)

        if (xfs_inode_alloc_unitsize(ip) > i_blocksize(inode)) ?

> 		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
> 

Yeah, it makes sense to me, but Christoph suggested to think about force
aligned allocations later, so I only dealt with the big RT inode case here.
I can revise it if John and Christoph don't object.

Thanks,
Yi.
Dave Chinner July 1, 2024, 12:35 p.m. UTC | #3
On Mon, Jul 01, 2024 at 10:26:18AM +0800, Zhang Yi wrote:
> On 2024/7/1 9:16, Dave Chinner wrote:
> > On Tue, Jun 18, 2024 at 10:21:11PM +0800, Zhang Yi wrote:
> >> @@ -917,7 +920,17 @@ xfs_setattr_size(
> >>  			return error;
> >>  	}
> >>  
> >> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> >> +	/*
> >> +	 * For realtime inode with more than one block rtextsize, we need the
> >> +	 * block reservation for bmap btree block allocations/splits that can
> >> +	 * happen since it could split the tail written extent and convert the
> >> +	 * right beyond EOF one to unwritten.
> >> +	 */
> >> +	if (xfs_inode_has_bigrtalloc(ip))
> >> +		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
> > 
> > .... should this be doing this generic check instead:
> > 
> > 	if (xfs_inode_alloc_unitsize(ip) > 1)
> 
>         if (xfs_inode_alloc_unitsize(ip) > i_blocksize(inode)) ?
> 
> > 		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
> > 
> 
> Yeah, it makes sense to me, but Christoph suggested to think about force
> aligned allocations later, so I only dealt with the big RT inode case here.
> I can revise it if John and Christoph don't object.

Sorry, but I don't really care what either John or Christoph say on
this matter: xfs_inode_has_bigrtalloc() is recently introduced
technical debt that should not be propagated further.

xfs_inode_has_bigrtalloc() needs to be replaced completely with
xfs_inode_alloc_unitsize() and any conditional behaviour needed can
be based on the return value from xfs_inode_alloc_unitsize(). That
works for everything that has an allocation block size larger than
one filesystem block, not just one specific RT case.

Don't force John to have fix all these same RT bugs that are being
fixed with xfs_inode_has_bigrtalloc() just because forced alignment
stuff is not yet merged. Don't make John's life harder than it needs
to be to get that stuff merged, and don't waste my time arguing
about it: just fix the problem the right way the first time.

-Dave.
Christoph Hellwig July 2, 2024, 7:34 a.m. UTC | #4
On Mon, Jul 01, 2024 at 10:35:07PM +1000, Dave Chinner wrote:
> Sorry, but I don't really care what either John or Christoph say on
> this matter: xfs_inode_has_bigrtalloc() is recently introduced
> technical debt that should not be propagated further.

So send a patch to fix it.

> xfs_inode_has_bigrtalloc() needs to be replaced completely with
> xfs_inode_alloc_unitsize() and any conditional behaviour needed can
> be based on the return value from xfs_inode_alloc_unitsize(). That
> works for everything that has an allocation block size larger than
> one filesystem block, not just one specific RT case.

Only assuming we actually get these larger alloc sizes.  Which right
now we don't have, and to be honest I'm not sure they are a good
idea.  The whole larger alloc size thing has been a massive pain
to deal with, and we'll need good argument for furthering that pain.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ff222827e550..a00dcbc77e12 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -17,6 +17,8 @@ 
 #include "xfs_da_btree.h"
 #include "xfs_attr.h"
 #include "xfs_trans.h"
+#include "xfs_trans_space.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_symlink.h"
@@ -811,6 +813,7 @@  xfs_setattr_size(
 	struct xfs_trans	*tp;
 	int			error;
 	uint			lock_flags = 0;
+	uint			resblks = 0;
 	bool			did_zeroing = false;
 
 	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
@@ -917,7 +920,17 @@  xfs_setattr_size(
 			return error;
 	}
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	/*
+	 * For realtime inode with more than one block rtextsize, we need the
+	 * block reservation for bmap btree block allocations/splits that can
+	 * happen since it could split the tail written extent and convert the
+	 * right beyond EOF one to unwritten.
+	 */
+	if (xfs_inode_has_bigrtalloc(ip))
+		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
+				0, 0, &tp);
 	if (error)
 		return error;