Message ID | 20240618142112.1315279-2-yi.zhang@huaweicloud.com (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
Series | iomap/xfs: fix stale data exposure when truncating realtime inodes | expand |
On Tue, Jun 18, 2024 at 10:21:11PM +0800, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@huawei.com> > > When unaligned truncate down a big realtime file, xfs_truncate_page() > only zeros out the tail EOF block, __xfs_bunmapi() should split the tail > written extent and convert the later one that beyond EOF block to > unwritten, but it couldn't work as expected now since the reserved block > is zero in xfs_setattr_size(), this could expose stale data just after > commit '943bc0882ceb ("iomap: don't increase i_size if it's not a write > operation")'. > > If we truncate file that contains a large enough written extent: > > |< rxext >|< rtext >| > ...WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW > ^ (new EOF) ^ old EOF > > Since we only zeros out the tail of the EOF block, and > xfs_itruncate_extents()->..->__xfs_bunmapi() unmap the whole ailgned > extents, it becomes this state: > > |< rxext >| > ...WWWzWWWWWWWWWWWWW > ^ new EOF > > Then if we do an extending write like this, the blocks in the previous > tail extent becomes stale: > > |< rxext >| > ...WWWzSSSSSSSSSSSSS..........WWWWWWWWWWWWWWWWW > ^ old EOF ^ append start ^ new EOF > > Fix this by reserving XFS_DIOSTRAT_SPACE_RES blocks for big realtime > inode. This same problem is going to happen with force aligned allocations, right? i.e. it is a result of having a allocation block size larger than one filesystem block? If so, then.... > Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> > --- > fs/xfs/xfs_iops.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index ff222827e550..a00dcbc77e12 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -17,6 +17,8 @@ > #include "xfs_da_btree.h" > #include "xfs_attr.h" > #include "xfs_trans.h" > +#include "xfs_trans_space.h" > +#include "xfs_bmap_btree.h" > #include "xfs_trace.h" > #include "xfs_icache.h" > #include "xfs_symlink.h" > @@ -811,6 +813,7 @@ xfs_setattr_size( > struct xfs_trans *tp; > int error; > uint lock_flags = 0; > + uint resblks = 0; > bool did_zeroing = false; > > xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL); > @@ -917,7 +920,17 @@ xfs_setattr_size( > return error; > } > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); > + /* > + * For realtime inode with more than one block rtextsize, we need the > + * block reservation for bmap btree block allocations/splits that can > + * happen since it could split the tail written extent and convert the > + * right beyond EOF one to unwritten. > + */ > + if (xfs_inode_has_bigrtalloc(ip)) > + resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); .... should this be doing this generic check instead: if (xfs_inode_alloc_unitsize(ip) > 1) resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); -Dave.
On 2024/7/1 9:16, Dave Chinner wrote: > On Tue, Jun 18, 2024 at 10:21:11PM +0800, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@huawei.com> >> >> When unaligned truncate down a big realtime file, xfs_truncate_page() >> only zeros out the tail EOF block, __xfs_bunmapi() should split the tail >> written extent and convert the later one that beyond EOF block to >> unwritten, but it couldn't work as expected now since the reserved block >> is zero in xfs_setattr_size(), this could expose stale data just after >> commit '943bc0882ceb ("iomap: don't increase i_size if it's not a write >> operation")'. >> >> If we truncate file that contains a large enough written extent: >> >> |< rxext >|< rtext >| >> ...WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW >> ^ (new EOF) ^ old EOF >> >> Since we only zeros out the tail of the EOF block, and >> xfs_itruncate_extents()->..->__xfs_bunmapi() unmap the whole ailgned >> extents, it becomes this state: >> >> |< rxext >| >> ...WWWzWWWWWWWWWWWWW >> ^ new EOF >> >> Then if we do an extending write like this, the blocks in the previous >> tail extent becomes stale: >> >> |< rxext >| >> ...WWWzSSSSSSSSSSSSS..........WWWWWWWWWWWWWWWWW >> ^ old EOF ^ append start ^ new EOF >> >> Fix this by reserving XFS_DIOSTRAT_SPACE_RES blocks for big realtime >> inode. > > This same problem is going to happen with force aligned allocations, > right? i.e. it is a result of having a allocation block size larger > than one filesystem block? > Yeah, right. +cc John > If so, then.... > >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> >> Reviewed-by: Christoph Hellwig <hch@lst.de> >> --- >> fs/xfs/xfs_iops.c | 15 ++++++++++++++- >> 1 file changed, 14 insertions(+), 1 deletion(-) >> >> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c >> index ff222827e550..a00dcbc77e12 100644 >> --- a/fs/xfs/xfs_iops.c >> +++ b/fs/xfs/xfs_iops.c >> @@ -17,6 +17,8 @@ >> #include "xfs_da_btree.h" >> #include "xfs_attr.h" >> #include "xfs_trans.h" >> +#include "xfs_trans_space.h" >> +#include "xfs_bmap_btree.h" >> #include "xfs_trace.h" >> #include "xfs_icache.h" >> #include "xfs_symlink.h" >> @@ -811,6 +813,7 @@ xfs_setattr_size( >> struct xfs_trans *tp; >> int error; >> uint lock_flags = 0; >> + uint resblks = 0; >> bool did_zeroing = false; >> >> xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL); >> @@ -917,7 +920,17 @@ xfs_setattr_size( >> return error; >> } >> >> - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); >> + /* >> + * For realtime inode with more than one block rtextsize, we need the >> + * block reservation for bmap btree block allocations/splits that can >> + * happen since it could split the tail written extent and convert the >> + * right beyond EOF one to unwritten. >> + */ >> + if (xfs_inode_has_bigrtalloc(ip)) >> + resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); > > .... should this be doing this generic check instead: > > if (xfs_inode_alloc_unitsize(ip) > 1) if (xfs_inode_alloc_unitsize(ip) > i_blocksize(inode)) ? > resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); > Yeah, it makes sense to me, but Christoph suggested to think about force aligned allocations later, so I only dealt with the big RT inode case here. I can revise it if John and Christoph don't object. Thanks, Yi.
On Mon, Jul 01, 2024 at 10:26:18AM +0800, Zhang Yi wrote: > On 2024/7/1 9:16, Dave Chinner wrote: > > On Tue, Jun 18, 2024 at 10:21:11PM +0800, Zhang Yi wrote: > >> @@ -917,7 +920,17 @@ xfs_setattr_size( > >> return error; > >> } > >> > >> - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); > >> + /* > >> + * For realtime inode with more than one block rtextsize, we need the > >> + * block reservation for bmap btree block allocations/splits that can > >> + * happen since it could split the tail written extent and convert the > >> + * right beyond EOF one to unwritten. > >> + */ > >> + if (xfs_inode_has_bigrtalloc(ip)) > >> + resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); > > > > .... should this be doing this generic check instead: > > > > if (xfs_inode_alloc_unitsize(ip) > 1) > > if (xfs_inode_alloc_unitsize(ip) > i_blocksize(inode)) ? > > > resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); > > > > Yeah, it makes sense to me, but Christoph suggested to think about force > aligned allocations later, so I only dealt with the big RT inode case here. > I can revise it if John and Christoph don't object. Sorry, but I don't really care what either John or Christoph say on this matter: xfs_inode_has_bigrtalloc() is recently introduced technical debt that should not be propagated further. xfs_inode_has_bigrtalloc() needs to be replaced completely with xfs_inode_alloc_unitsize() and any conditional behaviour needed can be based on the return value from xfs_inode_alloc_unitsize(). That works for everything that has an allocation block size larger than one filesystem block, not just one specific RT case. Don't force John to have fix all these same RT bugs that are being fixed with xfs_inode_has_bigrtalloc() just because forced alignment stuff is not yet merged. Don't make John's life harder than it needs to be to get that stuff merged, and don't waste my time arguing about it: just fix the problem the right way the first time. -Dave.
On Mon, Jul 01, 2024 at 10:35:07PM +1000, Dave Chinner wrote: > Sorry, but I don't really care what either John or Christoph say on > this matter: xfs_inode_has_bigrtalloc() is recently introduced > technical debt that should not be propagated further. So send a patch to fix it. > xfs_inode_has_bigrtalloc() needs to be replaced completely with > xfs_inode_alloc_unitsize() and any conditional behaviour needed can > be based on the return value from xfs_inode_alloc_unitsize(). That > works for everything that has an allocation block size larger than > one filesystem block, not just one specific RT case. Only assuming we actually get these larger alloc sizes. Which right now we don't have, and to be honest I'm not sure they are a good idea. The whole larger alloc size thing has been a massive pain to deal with, and we'll need good argument for furthering that pain.
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index ff222827e550..a00dcbc77e12 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -17,6 +17,8 @@ #include "xfs_da_btree.h" #include "xfs_attr.h" #include "xfs_trans.h" +#include "xfs_trans_space.h" +#include "xfs_bmap_btree.h" #include "xfs_trace.h" #include "xfs_icache.h" #include "xfs_symlink.h" @@ -811,6 +813,7 @@ xfs_setattr_size( struct xfs_trans *tp; int error; uint lock_flags = 0; + uint resblks = 0; bool did_zeroing = false; xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL); @@ -917,7 +920,17 @@ xfs_setattr_size( return error; } - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); + /* + * For realtime inode with more than one block rtextsize, we need the + * block reservation for bmap btree block allocations/splits that can + * happen since it could split the tail written extent and convert the + * right beyond EOF one to unwritten. + */ + if (xfs_inode_has_bigrtalloc(ip)) + resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); + + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks, + 0, 0, &tp); if (error) return error;