Message ID | 20240311122255.2637311-3-yi.zhang@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | xfs/iomap: fix non-atomic clone operation and don't update size when zeroing range post eof | expand |
On Mon, Mar 11, 2024 at 08:22:53PM +0800, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@huawei.com> > > Current clone operation could be non-atomic if the destination of a file > is beyond EOF, user could get a file with corrupted (zeroed) data on > crash. > > The problem is about to pre-alloctions. If you write some data into a > file [A, B) (the position letters are increased one by one), and xfs > could pre-allocate some blocks, then we get a delayed extent [A, D). > Then the writeback path allocate blocks and convert this delayed extent > [A, C) since lack of enough contiguous physical blocks, so the extent > [C, D) is still delayed. After that, both the in-memory and the on-disk > file size are B. If we clone file range into [E, F) from another file, > xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the > range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed > extent, it will be zeroed and the file's in-memory && on-disk size will > be updated to D after flushing and before doing the clone operation. > This is wrong, because user can user can see the size change and read > zeros in the middle of the clone operation. > > We need to keep the in-memory and on-disk size before the clone > operation starts, so instead of writing zeroes through the page cache > for delayed ranges beyond EOF, we convert these ranges to unwritten and > invalidating any cached data over that range beyond EOF. > > Suggested-by: Dave Chinner <david@fromorbit.com> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > --- > fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++ > 1 file changed, 29 insertions(+) > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index ccf83e72d8ca..2b2aace25355 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin( > struct xfs_mount *mp = ip->i_mount; > xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); > xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, count); > + xfs_fileoff_t eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip)); > struct xfs_bmbt_irec imap, cmap; > struct xfs_iext_cursor icur, ccur; > xfs_fsblock_t prealloc_blocks = 0; > @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin( > } > > if (imap.br_startoff <= offset_fsb) { > + /* > + * For zeroing out delayed allocation extent, we trim it if > + * it's partial beyonds EOF block, or convert it to unwritten > + * extent if it's all beyonds EOF block. > + */ > + if ((flags & IOMAP_ZERO) && > + isnullstartblock(imap.br_startblock)) { > + if (offset_fsb > eof_fsb) > + goto convert_delay; > + if (end_fsb > eof_fsb) { > + end_fsb = eof_fsb + 1; > + xfs_trim_extent(&imap, offset_fsb, > + end_fsb - offset_fsb); > + } > + } > + > /* > * For reflink files we may need a delalloc reservation when > * overwriting shared extents. This includes zeroing of > @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin( > xfs_iunlock(ip, lockmode); > return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq); > > +convert_delay: > + end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount); > + xfs_iunlock(ip, lockmode); > + truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb)); > + error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb, > + flags, &imap, &seq); I expected this to be a direct call to xfs_bmapi_convert_delalloc. What was the reason not for using that? --D > + if (error) > + return error; > + > + trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap); > + return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq); > + > found_cow: > seq = xfs_iomap_inode_sequence(ip, 0); > if (imap.br_startoff <= offset_fsb) { > -- > 2.39.2 > >
On Mon, Mar 11, 2024 at 08:37:37AM -0700, Darrick J. Wong wrote: > > +convert_delay: > > + end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount); > > + xfs_iunlock(ip, lockmode); > > + truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb)); > > + error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb, > > + flags, &imap, &seq); > > I expected this to be a direct call to xfs_bmapi_convert_delalloc. > What was the reason not for using that? Same here. The fact that we even convert delalloc reservations in xfs_iomap_write_direct is something that doesn't make much sense given that we're punching out delalloc reservations before starting direct I/O.
On 2024/3/11 23:37, Darrick J. Wong wrote: > On Mon, Mar 11, 2024 at 08:22:53PM +0800, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@huawei.com> >> >> Current clone operation could be non-atomic if the destination of a file >> is beyond EOF, user could get a file with corrupted (zeroed) data on >> crash. >> >> The problem is about to pre-alloctions. If you write some data into a >> file [A, B) (the position letters are increased one by one), and xfs >> could pre-allocate some blocks, then we get a delayed extent [A, D). >> Then the writeback path allocate blocks and convert this delayed extent >> [A, C) since lack of enough contiguous physical blocks, so the extent >> [C, D) is still delayed. After that, both the in-memory and the on-disk >> file size are B. If we clone file range into [E, F) from another file, >> xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the >> range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed >> extent, it will be zeroed and the file's in-memory && on-disk size will >> be updated to D after flushing and before doing the clone operation. >> This is wrong, because user can user can see the size change and read >> zeros in the middle of the clone operation. >> >> We need to keep the in-memory and on-disk size before the clone >> operation starts, so instead of writing zeroes through the page cache >> for delayed ranges beyond EOF, we convert these ranges to unwritten and >> invalidating any cached data over that range beyond EOF. >> >> Suggested-by: Dave Chinner <david@fromorbit.com> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> >> --- >> fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++ >> 1 file changed, 29 insertions(+) >> >> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c >> index ccf83e72d8ca..2b2aace25355 100644 >> --- a/fs/xfs/xfs_iomap.c >> +++ b/fs/xfs/xfs_iomap.c >> @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin( >> struct xfs_mount *mp = ip->i_mount; >> xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); >> xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, count); >> + xfs_fileoff_t eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip)); >> struct xfs_bmbt_irec imap, cmap; >> struct xfs_iext_cursor icur, ccur; >> xfs_fsblock_t prealloc_blocks = 0; >> @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin( >> } >> >> if (imap.br_startoff <= offset_fsb) { >> + /* >> + * For zeroing out delayed allocation extent, we trim it if >> + * it's partial beyonds EOF block, or convert it to unwritten >> + * extent if it's all beyonds EOF block. >> + */ >> + if ((flags & IOMAP_ZERO) && >> + isnullstartblock(imap.br_startblock)) { >> + if (offset_fsb > eof_fsb) >> + goto convert_delay; >> + if (end_fsb > eof_fsb) { >> + end_fsb = eof_fsb + 1; >> + xfs_trim_extent(&imap, offset_fsb, >> + end_fsb - offset_fsb); >> + } >> + } >> + >> /* >> * For reflink files we may need a delalloc reservation when >> * overwriting shared extents. This includes zeroing of >> @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin( >> xfs_iunlock(ip, lockmode); >> return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq); >> >> +convert_delay: >> + end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount); >> + xfs_iunlock(ip, lockmode); >> + truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb)); >> + error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb, >> + flags, &imap, &seq); > > I expected this to be a direct call to xfs_bmapi_convert_delalloc. > What was the reason not for using that? > It's because xfs_bmapi_convert_delalloc() isn't guarantee to convert enough blocks once a time, it may convert insufficient blocks since lack of enough contiguous free physical blocks. If we are going to use it, I suppose we need to introduce a new helper something like xfs_convert_blocks(), add a loop to do the conversion. xfs_iomap_write_direct() has done all the work of converting, but the name of this function is non-obviousness than xfs_bmapi_convert_delalloc(), I can change to use it if you think xfs_bmapi_convert_delalloc() is better. :) Thanks, Yi. > --D > >> + if (error) >> + return error; >> + >> + trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap); >> + return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq); >> + >> found_cow: >> seq = xfs_iomap_inode_sequence(ip, 0); >> if (imap.br_startoff <= offset_fsb) { >> -- >> 2.39.2 >> >>
On 2024/3/12 20:21, Christoph Hellwig wrote: > On Mon, Mar 11, 2024 at 08:37:37AM -0700, Darrick J. Wong wrote: >>> +convert_delay: >>> + end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount); >>> + xfs_iunlock(ip, lockmode); >>> + truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb)); >>> + error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb, >>> + flags, &imap, &seq); >> >> I expected this to be a direct call to xfs_bmapi_convert_delalloc. >> What was the reason not for using that? > > Same here. The fact that we even convert delalloc reservations in > xfs_iomap_write_direct is something that doesn't make much sense > given that we're punching out delalloc reservations before starting > direct I/O. > OK, sure, I will use xfs_bmapi_convert_delalloc() in my next iteration. Thanks, Yi.
On Tue, Mar 12, 2024 at 08:31:58PM +0800, Zhang Yi wrote: > On 2024/3/11 23:37, Darrick J. Wong wrote: > > On Mon, Mar 11, 2024 at 08:22:53PM +0800, Zhang Yi wrote: > >> From: Zhang Yi <yi.zhang@huawei.com> > >> > >> Current clone operation could be non-atomic if the destination of a file > >> is beyond EOF, user could get a file with corrupted (zeroed) data on > >> crash. > >> > >> The problem is about to pre-alloctions. If you write some data into a > >> file [A, B) (the position letters are increased one by one), and xfs > >> could pre-allocate some blocks, then we get a delayed extent [A, D). > >> Then the writeback path allocate blocks and convert this delayed extent > >> [A, C) since lack of enough contiguous physical blocks, so the extent > >> [C, D) is still delayed. After that, both the in-memory and the on-disk > >> file size are B. If we clone file range into [E, F) from another file, > >> xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the > >> range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed > >> extent, it will be zeroed and the file's in-memory && on-disk size will > >> be updated to D after flushing and before doing the clone operation. > >> This is wrong, because user can user can see the size change and read > >> zeros in the middle of the clone operation. > >> > >> We need to keep the in-memory and on-disk size before the clone > >> operation starts, so instead of writing zeroes through the page cache > >> for delayed ranges beyond EOF, we convert these ranges to unwritten and > >> invalidating any cached data over that range beyond EOF. > >> > >> Suggested-by: Dave Chinner <david@fromorbit.com> > >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > >> --- > >> fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++ > >> 1 file changed, 29 insertions(+) > >> > >> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > >> index ccf83e72d8ca..2b2aace25355 100644 > >> --- a/fs/xfs/xfs_iomap.c > >> +++ b/fs/xfs/xfs_iomap.c > >> @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin( > >> struct xfs_mount *mp = ip->i_mount; > >> xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); > >> xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, count); > >> + xfs_fileoff_t eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip)); > >> struct xfs_bmbt_irec imap, cmap; > >> struct xfs_iext_cursor icur, ccur; > >> xfs_fsblock_t prealloc_blocks = 0; > >> @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin( > >> } > >> > >> if (imap.br_startoff <= offset_fsb) { > >> + /* > >> + * For zeroing out delayed allocation extent, we trim it if > >> + * it's partial beyonds EOF block, or convert it to unwritten > >> + * extent if it's all beyonds EOF block. > >> + */ > >> + if ((flags & IOMAP_ZERO) && > >> + isnullstartblock(imap.br_startblock)) { > >> + if (offset_fsb > eof_fsb) > >> + goto convert_delay; > >> + if (end_fsb > eof_fsb) { > >> + end_fsb = eof_fsb + 1; > >> + xfs_trim_extent(&imap, offset_fsb, > >> + end_fsb - offset_fsb); > >> + } > >> + } > >> + > >> /* > >> * For reflink files we may need a delalloc reservation when > >> * overwriting shared extents. This includes zeroing of > >> @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin( > >> xfs_iunlock(ip, lockmode); > >> return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq); > >> > >> +convert_delay: > >> + end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount); > >> + xfs_iunlock(ip, lockmode); > >> + truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb)); > >> + error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb, > >> + flags, &imap, &seq); > > > > I expected this to be a direct call to xfs_bmapi_convert_delalloc. > > What was the reason not for using that? > > > > It's because xfs_bmapi_convert_delalloc() isn't guarantee to convert > enough blocks once a time, it may convert insufficient blocks since lack > of enough contiguous free physical blocks. If we are going to use it, I > suppose we need to introduce a new helper something like > xfs_convert_blocks(), add a loop to do the conversion. I thought xfs_bmapi_convert_delalloc passes out (via @iomap) the extent that xfs_bmapi_allocate (or anyone else) allocated (bma.got). If that mapping is shorter, won't xfs_buffered_write_iomap_begin pass the shortened mapping out to the iomap machinery? In which case that iomap_iter loop will call ->iomap_begin on the unfinished delalloc conversion work? > xfs_iomap_write_direct() has done all the work of converting, but the > name of this function is non-obviousness than xfs_bmapi_convert_delalloc(), > I can change to use it if you think xfs_bmapi_convert_delalloc() is > better. :) Yes. --D > Thanks, > Yi. > > > --D > > > >> + if (error) > >> + return error; > >> + > >> + trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap); > >> + return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq); > >> + > >> found_cow: > >> seq = xfs_iomap_inode_sequence(ip, 0); > >> if (imap.br_startoff <= offset_fsb) { > >> -- > >> 2.39.2 > >> > >> > >
On 2024/3/13 0:21, Darrick J. Wong wrote: > On Tue, Mar 12, 2024 at 08:31:58PM +0800, Zhang Yi wrote: >> On 2024/3/11 23:37, Darrick J. Wong wrote: >>> On Mon, Mar 11, 2024 at 08:22:53PM +0800, Zhang Yi wrote: >>>> From: Zhang Yi <yi.zhang@huawei.com> >>>> >>>> Current clone operation could be non-atomic if the destination of a file >>>> is beyond EOF, user could get a file with corrupted (zeroed) data on >>>> crash. >>>> >>>> The problem is about to pre-alloctions. If you write some data into a >>>> file [A, B) (the position letters are increased one by one), and xfs >>>> could pre-allocate some blocks, then we get a delayed extent [A, D). >>>> Then the writeback path allocate blocks and convert this delayed extent >>>> [A, C) since lack of enough contiguous physical blocks, so the extent >>>> [C, D) is still delayed. After that, both the in-memory and the on-disk >>>> file size are B. If we clone file range into [E, F) from another file, >>>> xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the >>>> range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed >>>> extent, it will be zeroed and the file's in-memory && on-disk size will >>>> be updated to D after flushing and before doing the clone operation. >>>> This is wrong, because user can user can see the size change and read >>>> zeros in the middle of the clone operation. >>>> >>>> We need to keep the in-memory and on-disk size before the clone >>>> operation starts, so instead of writing zeroes through the page cache >>>> for delayed ranges beyond EOF, we convert these ranges to unwritten and >>>> invalidating any cached data over that range beyond EOF. >>>> >>>> Suggested-by: Dave Chinner <david@fromorbit.com> >>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> >>>> --- >>>> fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++ >>>> 1 file changed, 29 insertions(+) >>>> >>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c >>>> index ccf83e72d8ca..2b2aace25355 100644 >>>> --- a/fs/xfs/xfs_iomap.c >>>> +++ b/fs/xfs/xfs_iomap.c >>>> @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin( >>>> struct xfs_mount *mp = ip->i_mount; >>>> xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); >>>> xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, count); >>>> + xfs_fileoff_t eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip)); >>>> struct xfs_bmbt_irec imap, cmap; >>>> struct xfs_iext_cursor icur, ccur; >>>> xfs_fsblock_t prealloc_blocks = 0; >>>> @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin( >>>> } >>>> >>>> if (imap.br_startoff <= offset_fsb) { >>>> + /* >>>> + * For zeroing out delayed allocation extent, we trim it if >>>> + * it's partial beyonds EOF block, or convert it to unwritten >>>> + * extent if it's all beyonds EOF block. >>>> + */ >>>> + if ((flags & IOMAP_ZERO) && >>>> + isnullstartblock(imap.br_startblock)) { >>>> + if (offset_fsb > eof_fsb) >>>> + goto convert_delay; >>>> + if (end_fsb > eof_fsb) { >>>> + end_fsb = eof_fsb + 1; >>>> + xfs_trim_extent(&imap, offset_fsb, >>>> + end_fsb - offset_fsb); >>>> + } >>>> + } >>>> + >>>> /* >>>> * For reflink files we may need a delalloc reservation when >>>> * overwriting shared extents. This includes zeroing of >>>> @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin( >>>> xfs_iunlock(ip, lockmode); >>>> return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq); >>>> >>>> +convert_delay: >>>> + end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount); >>>> + xfs_iunlock(ip, lockmode); >>>> + truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb)); >>>> + error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb, >>>> + flags, &imap, &seq); >>> >>> I expected this to be a direct call to xfs_bmapi_convert_delalloc. >>> What was the reason not for using that? >>> >> >> It's because xfs_bmapi_convert_delalloc() isn't guarantee to convert >> enough blocks once a time, it may convert insufficient blocks since lack >> of enough contiguous free physical blocks. If we are going to use it, I >> suppose we need to introduce a new helper something like >> xfs_convert_blocks(), add a loop to do the conversion. > > I thought xfs_bmapi_convert_delalloc passes out (via @iomap) the extent > that xfs_bmapi_allocate (or anyone else) allocated (bma.got). If that > mapping is shorter, won't xfs_buffered_write_iomap_begin pass the > shortened mapping out to the iomap machinery? In which case that > iomap_iter loop will call ->iomap_begin on the unfinished delalloc > conversion work? Yeah, make sense, it works, I forgot this loop in iomap_iter(). Thanks, Yi.
On 2024/3/13 15:07, Zhang Yi wrote: > On 2024/3/13 0:21, Darrick J. Wong wrote: >> On Tue, Mar 12, 2024 at 08:31:58PM +0800, Zhang Yi wrote: >>> On 2024/3/11 23:37, Darrick J. Wong wrote: >>>> On Mon, Mar 11, 2024 at 08:22:53PM +0800, Zhang Yi wrote: >>>>> From: Zhang Yi <yi.zhang@huawei.com> >>>>> >>>>> Current clone operation could be non-atomic if the destination of a file >>>>> is beyond EOF, user could get a file with corrupted (zeroed) data on >>>>> crash. >>>>> >>>>> The problem is about to pre-alloctions. If you write some data into a >>>>> file [A, B) (the position letters are increased one by one), and xfs >>>>> could pre-allocate some blocks, then we get a delayed extent [A, D). >>>>> Then the writeback path allocate blocks and convert this delayed extent >>>>> [A, C) since lack of enough contiguous physical blocks, so the extent >>>>> [C, D) is still delayed. After that, both the in-memory and the on-disk >>>>> file size are B. If we clone file range into [E, F) from another file, >>>>> xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the >>>>> range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed >>>>> extent, it will be zeroed and the file's in-memory && on-disk size will >>>>> be updated to D after flushing and before doing the clone operation. >>>>> This is wrong, because user can user can see the size change and read >>>>> zeros in the middle of the clone operation. >>>>> >>>>> We need to keep the in-memory and on-disk size before the clone >>>>> operation starts, so instead of writing zeroes through the page cache >>>>> for delayed ranges beyond EOF, we convert these ranges to unwritten and >>>>> invalidating any cached data over that range beyond EOF. >>>>> >>>>> Suggested-by: Dave Chinner <david@fromorbit.com> >>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> >>>>> --- >>>>> fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++ >>>>> 1 file changed, 29 insertions(+) >>>>> >>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c >>>>> index ccf83e72d8ca..2b2aace25355 100644 >>>>> --- a/fs/xfs/xfs_iomap.c >>>>> +++ b/fs/xfs/xfs_iomap.c >>>>> @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin( >>>>> struct xfs_mount *mp = ip->i_mount; >>>>> xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); >>>>> xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, count); >>>>> + xfs_fileoff_t eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip)); >>>>> struct xfs_bmbt_irec imap, cmap; >>>>> struct xfs_iext_cursor icur, ccur; >>>>> xfs_fsblock_t prealloc_blocks = 0; >>>>> @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin( >>>>> } >>>>> >>>>> if (imap.br_startoff <= offset_fsb) { >>>>> + /* >>>>> + * For zeroing out delayed allocation extent, we trim it if >>>>> + * it's partial beyonds EOF block, or convert it to unwritten >>>>> + * extent if it's all beyonds EOF block. >>>>> + */ >>>>> + if ((flags & IOMAP_ZERO) && >>>>> + isnullstartblock(imap.br_startblock)) { >>>>> + if (offset_fsb > eof_fsb) >>>>> + goto convert_delay; >>>>> + if (end_fsb > eof_fsb) { >>>>> + end_fsb = eof_fsb + 1; >>>>> + xfs_trim_extent(&imap, offset_fsb, >>>>> + end_fsb - offset_fsb); >>>>> + } >>>>> + } >>>>> + >>>>> /* >>>>> * For reflink files we may need a delalloc reservation when >>>>> * overwriting shared extents. This includes zeroing of >>>>> @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin( >>>>> xfs_iunlock(ip, lockmode); >>>>> return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq); >>>>> >>>>> +convert_delay: >>>>> + end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount); >>>>> + xfs_iunlock(ip, lockmode); >>>>> + truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb)); >>>>> + error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb, >>>>> + flags, &imap, &seq); >>>> >>>> I expected this to be a direct call to xfs_bmapi_convert_delalloc. >>>> What was the reason not for using that? >>>> >>> >>> It's because xfs_bmapi_convert_delalloc() isn't guarantee to convert >>> enough blocks once a time, it may convert insufficient blocks since lack >>> of enough contiguous free physical blocks. If we are going to use it, I >>> suppose we need to introduce a new helper something like >>> xfs_convert_blocks(), add a loop to do the conversion. >> >> I thought xfs_bmapi_convert_delalloc passes out (via @iomap) the extent >> that xfs_bmapi_allocate (or anyone else) allocated (bma.got). If that >> mapping is shorter, won't xfs_buffered_write_iomap_begin pass the >> shortened mapping out to the iomap machinery? In which case that >> iomap_iter loop will call ->iomap_begin on the unfinished delalloc >> conversion work? > > Yeah, make sense, it works, I forgot this loop in iomap_iter(). Sorry, I've found that it doesn't always work. Think about a special case, If we have a file below: A B C D +wwwwwwwwww+DDDDDDDDDDD+dddddddddddddddddddd+ EOF EOF (on disk) (in memory) where 'd' is a delalloc block with no data and 'D' is a delalloc block with dirty folios over it. xfs_bmapi_convert_delalloc() might only convert some blocks from B to B', A B B' C D +wwwwwwwwww+UUU+DDDDDDD+dddddddddddddddddddd+ EOF EOF (on disk) (in memory) After that, it will trigger below warning in iomap_iter_done(): WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos); So I guess the loop is still needed, I plane to revise and use xfs_convert_blocks() here. Yi.
On Wed, Mar 13, 2024 at 09:25:49PM +0800, Zhang Yi wrote: > On 2024/3/13 15:07, Zhang Yi wrote: > > On 2024/3/13 0:21, Darrick J. Wong wrote: > >> On Tue, Mar 12, 2024 at 08:31:58PM +0800, Zhang Yi wrote: > >>> On 2024/3/11 23:37, Darrick J. Wong wrote: > >>>> On Mon, Mar 11, 2024 at 08:22:53PM +0800, Zhang Yi wrote: > >>>>> From: Zhang Yi <yi.zhang@huawei.com> > >>>>> > >>>>> Current clone operation could be non-atomic if the destination of a file > >>>>> is beyond EOF, user could get a file with corrupted (zeroed) data on > >>>>> crash. > >>>>> > >>>>> The problem is about to pre-alloctions. If you write some data into a > >>>>> file [A, B) (the position letters are increased one by one), and xfs > >>>>> could pre-allocate some blocks, then we get a delayed extent [A, D). > >>>>> Then the writeback path allocate blocks and convert this delayed extent > >>>>> [A, C) since lack of enough contiguous physical blocks, so the extent > >>>>> [C, D) is still delayed. After that, both the in-memory and the on-disk > >>>>> file size are B. If we clone file range into [E, F) from another file, > >>>>> xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the > >>>>> range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed > >>>>> extent, it will be zeroed and the file's in-memory && on-disk size will > >>>>> be updated to D after flushing and before doing the clone operation. > >>>>> This is wrong, because user can user can see the size change and read > >>>>> zeros in the middle of the clone operation. > >>>>> > >>>>> We need to keep the in-memory and on-disk size before the clone > >>>>> operation starts, so instead of writing zeroes through the page cache > >>>>> for delayed ranges beyond EOF, we convert these ranges to unwritten and > >>>>> invalidating any cached data over that range beyond EOF. > >>>>> > >>>>> Suggested-by: Dave Chinner <david@fromorbit.com> > >>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > >>>>> --- > >>>>> fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++ > >>>>> 1 file changed, 29 insertions(+) > >>>>> > >>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > >>>>> index ccf83e72d8ca..2b2aace25355 100644 > >>>>> --- a/fs/xfs/xfs_iomap.c > >>>>> +++ b/fs/xfs/xfs_iomap.c > >>>>> @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin( > >>>>> struct xfs_mount *mp = ip->i_mount; > >>>>> xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); > >>>>> xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, count); > >>>>> + xfs_fileoff_t eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip)); > >>>>> struct xfs_bmbt_irec imap, cmap; > >>>>> struct xfs_iext_cursor icur, ccur; > >>>>> xfs_fsblock_t prealloc_blocks = 0; > >>>>> @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin( > >>>>> } > >>>>> > >>>>> if (imap.br_startoff <= offset_fsb) { > >>>>> + /* > >>>>> + * For zeroing out delayed allocation extent, we trim it if > >>>>> + * it's partial beyonds EOF block, or convert it to unwritten > >>>>> + * extent if it's all beyonds EOF block. > >>>>> + */ > >>>>> + if ((flags & IOMAP_ZERO) && > >>>>> + isnullstartblock(imap.br_startblock)) { > >>>>> + if (offset_fsb > eof_fsb) > >>>>> + goto convert_delay; > >>>>> + if (end_fsb > eof_fsb) { > >>>>> + end_fsb = eof_fsb + 1; > >>>>> + xfs_trim_extent(&imap, offset_fsb, > >>>>> + end_fsb - offset_fsb); > >>>>> + } > >>>>> + } > >>>>> + > >>>>> /* > >>>>> * For reflink files we may need a delalloc reservation when > >>>>> * overwriting shared extents. This includes zeroing of > >>>>> @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin( > >>>>> xfs_iunlock(ip, lockmode); > >>>>> return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq); > >>>>> > >>>>> +convert_delay: > >>>>> + end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount); > >>>>> + xfs_iunlock(ip, lockmode); > >>>>> + truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb)); > >>>>> + error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb, > >>>>> + flags, &imap, &seq); > >>>> > >>>> I expected this to be a direct call to xfs_bmapi_convert_delalloc. > >>>> What was the reason not for using that? > >>>> > >>> > >>> It's because xfs_bmapi_convert_delalloc() isn't guarantee to convert > >>> enough blocks once a time, it may convert insufficient blocks since lack > >>> of enough contiguous free physical blocks. If we are going to use it, I > >>> suppose we need to introduce a new helper something like > >>> xfs_convert_blocks(), add a loop to do the conversion. > >> > >> I thought xfs_bmapi_convert_delalloc passes out (via @iomap) the extent > >> that xfs_bmapi_allocate (or anyone else) allocated (bma.got). If that > >> mapping is shorter, won't xfs_buffered_write_iomap_begin pass the > >> shortened mapping out to the iomap machinery? In which case that > >> iomap_iter loop will call ->iomap_begin on the unfinished delalloc > >> conversion work? > > > > Yeah, make sense, it works, I forgot this loop in iomap_iter(). > > Sorry, I've found that it doesn't always work. Think about a special case, > If we have a file below: > > A B C D > +wwwwwwwwww+DDDDDDDDDDD+dddddddddddddddddddd+ > EOF EOF > (on disk) (in memory) > > where 'd' is a delalloc block with no data and 'D' is a delalloc > block with dirty folios over it. > > xfs_bmapi_convert_delalloc() might only convert some blocks from B to B', > > A B B' C D > +wwwwwwwwww+UUU+DDDDDDD+dddddddddddddddddddd+ > EOF EOF > (on disk) (in memory) > > After that, it will trigger below warning in iomap_iter_done(): > > WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos); > > So I guess the loop is still needed, I plane to revise and use > xfs_convert_blocks() here. Ah, sounds good to me. Though, I wouldn't work too hard to hammer that writeback helper into a write helper. --D > Yi. > >
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index ccf83e72d8ca..2b2aace25355 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -957,6 +957,7 @@ xfs_buffered_write_iomap_begin( struct xfs_mount *mp = ip->i_mount; xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, count); + xfs_fileoff_t eof_fsb = XFS_B_TO_FSBT(mp, XFS_ISIZE(ip)); struct xfs_bmbt_irec imap, cmap; struct xfs_iext_cursor icur, ccur; xfs_fsblock_t prealloc_blocks = 0; @@ -1035,6 +1036,22 @@ xfs_buffered_write_iomap_begin( } if (imap.br_startoff <= offset_fsb) { + /* + * For zeroing out delayed allocation extent, we trim it if + * it's partial beyonds EOF block, or convert it to unwritten + * extent if it's all beyonds EOF block. + */ + if ((flags & IOMAP_ZERO) && + isnullstartblock(imap.br_startblock)) { + if (offset_fsb > eof_fsb) + goto convert_delay; + if (end_fsb > eof_fsb) { + end_fsb = eof_fsb + 1; + xfs_trim_extent(&imap, offset_fsb, + end_fsb - offset_fsb); + } + } + /* * For reflink files we may need a delalloc reservation when * overwriting shared extents. This includes zeroing of @@ -1158,6 +1175,18 @@ xfs_buffered_write_iomap_begin( xfs_iunlock(ip, lockmode); return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq); +convert_delay: + end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount); + xfs_iunlock(ip, lockmode); + truncate_pagecache_range(inode, offset, XFS_FSB_TO_B(mp, end_fsb)); + error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb, + flags, &imap, &seq); + if (error) + return error; + + trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap); + return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq); + found_cow: seq = xfs_iomap_inode_sequence(ip, 0); if (imap.br_startoff <= offset_fsb) {