Message ID | 20210811054505.186828-1-wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] btrfs: prevent falloc to mix inline and regular extents | expand |
On Wed, Aug 11, 2021 at 01:45:05PM +0800, Qu Wenruo wrote: > [PROBLEM] > The following script can create a fs with mixed inline and regular > extents: > > mkfs.btrfs -f -s 4k $dev > mount $dev $mnt -o nospace_cache > > xfs_io -f -c "pwrite 0 1k" -c "sync" \ > -c "falloc 0 4k" -c "pwrite 4k 4k" $mnt/file > umount $mnt > > It will result the following file extents layout: > > item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14 > index 2 namelen 4 name: file > item 6 key (257 EXTENT_DATA 0) itemoff 14824 itemsize 1045 > generation 8 type 0 (inline) > inline extent data size 1024 ram_bytes 1024 compression 0 (none) > item 7 key (257 EXTENT_DATA 4096) itemoff 14771 itemsize 53 > generation 8 type 1 (regular) > extent data disk byte 13631488 nr 4096 > extent data offset 0 nr 4096 ram 4096 > extent compression 0 (none) > > Which mixes inline and regular extents. > > Without above falloc operation, we would get proper regular extent only > layout. > > [CAUSE] > Normally we rely on btrfs_truncate_block() to convert the inline extent > to regular extent. > > For example, if without falloc(), at the 2nd pwrite, we will call > btrfs_truncate_block() to zero the tailing part of the inline extent, > then at writeback time, we find the isize is larger than inline > threshold and will not create inline extent, but write back the first > sector as regular extent. > > While in falloc, although we also call btrfs_truncate_block(), it's > called before we enlarge the inode size. > > And we start writeback of the range immediately, with inode size > unchanged. > > This means, we just re-create an inline extent inside btrfs_fallocate(). > > [FIX] > Only call btrfs_truncate_block() after we have updated inode size inside > btrfs_fallocate(). > > By this later writeback will properly writeback the first sector as > regular extent. > > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > Reason for RFC: > > There is a long existing discussion on whether we should allow mixing > inline and regular extents. > > I totally understand the idea that mixing such extents won't cause > anything wrong, the existing kernel can handle them pretty well, and no > data corruption or whatever. > > But since it's really not that simple to create such mixed extents > (except the method mentioned above), I really don't believe that's the > expected behavior. > > Thus even it's not a big deal, I still want to prevent such mixed > extents. I agree, it's rather obscure and I don't think it was intentional. That it works is fine and we'll have to keep it like that but not creating the mixed extents should be the default.
On 2021/8/19 下午7:17, David Sterba wrote: > On Wed, Aug 11, 2021 at 01:45:05PM +0800, Qu Wenruo wrote: >> [PROBLEM] >> The following script can create a fs with mixed inline and regular >> extents: >> >> mkfs.btrfs -f -s 4k $dev >> mount $dev $mnt -o nospace_cache >> >> xfs_io -f -c "pwrite 0 1k" -c "sync" \ >> -c "falloc 0 4k" -c "pwrite 4k 4k" $mnt/file >> umount $mnt >> >> It will result the following file extents layout: >> >> item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14 >> index 2 namelen 4 name: file >> item 6 key (257 EXTENT_DATA 0) itemoff 14824 itemsize 1045 >> generation 8 type 0 (inline) >> inline extent data size 1024 ram_bytes 1024 compression 0 (none) >> item 7 key (257 EXTENT_DATA 4096) itemoff 14771 itemsize 53 >> generation 8 type 1 (regular) >> extent data disk byte 13631488 nr 4096 >> extent data offset 0 nr 4096 ram 4096 >> extent compression 0 (none) >> >> Which mixes inline and regular extents. >> >> Without above falloc operation, we would get proper regular extent only >> layout. >> >> [CAUSE] >> Normally we rely on btrfs_truncate_block() to convert the inline extent >> to regular extent. >> >> For example, if without falloc(), at the 2nd pwrite, we will call >> btrfs_truncate_block() to zero the tailing part of the inline extent, >> then at writeback time, we find the isize is larger than inline >> threshold and will not create inline extent, but write back the first >> sector as regular extent. >> >> While in falloc, although we also call btrfs_truncate_block(), it's >> called before we enlarge the inode size. >> >> And we start writeback of the range immediately, with inode size >> unchanged. >> >> This means, we just re-create an inline extent inside btrfs_fallocate(). >> >> [FIX] >> Only call btrfs_truncate_block() after we have updated inode size inside >> btrfs_fallocate(). >> >> By this later writeback will properly writeback the first sector as >> regular extent. >> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> Reason for RFC: >> >> There is a long existing discussion on whether we should allow mixing >> inline and regular extents. >> >> I totally understand the idea that mixing such extents won't cause >> anything wrong, the existing kernel can handle them pretty well, and no >> data corruption or whatever. >> >> But since it's really not that simple to create such mixed extents >> (except the method mentioned above), I really don't believe that's the >> expected behavior. >> >> Thus even it's not a big deal, I still want to prevent such mixed >> extents. > > I agree, it's rather obscure and I don't think it was intentional. That > it works is fine and we'll have to keep it like that but not creating > the mixed extents should be the default. > BTW, the (at least current) root cause is the delayed isize update in falloc. Unlike regular write/truncate, for falloc we don't fill the falloc range with holes, nor enlarge isize. Thus it's super easy to cause the mentioned problem, in falloc. But if we enlarge the inode first, then all later falloc will need to drop the hole extent first, then insert the preallocated extent. I'm not sure how the new way will slow down falloc, but if a little slower falloc is the cost to prevent mixed extents, I'm going to take the cost. Thanks, Qu
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 7ff577005d0f..a09cf02537b4 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3325,6 +3325,7 @@ static long btrfs_fallocate(struct file *file, int mode, struct falloc_range *range; struct falloc_range *tmp; struct list_head reserve_list; + u64 old_isize; u64 cur_offset; u64 last_byte; u64 alloc_start; @@ -3365,6 +3366,7 @@ static long btrfs_fallocate(struct file *file, int mode, } btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); + old_isize = i_size_read(inode); if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) { ret = inode_newsize_ok(inode, offset + len); @@ -3373,7 +3375,7 @@ static long btrfs_fallocate(struct file *file, int mode, } /* - * TODO: Move these two operations after we have checked + * TODO: Move this operation after we have checked * accurate reserved space, or fallocate can still fail but * with page truncated or size expanded. * @@ -3384,15 +3386,6 @@ static long btrfs_fallocate(struct file *file, int mode, alloc_start); if (ret) goto out; - } else if (offset + len > inode->i_size) { - /* - * If we are fallocating from the end of the file onward we - * need to zero out the end of the block if i_size lands in the - * middle of a block. - */ - ret = btrfs_truncate_block(BTRFS_I(inode), inode->i_size, 0, 0); - if (ret) - goto out; } /* @@ -3515,6 +3508,18 @@ static long btrfs_fallocate(struct file *file, int mode, out_unlock: unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, &cached_state); + + /* + * If we are fallocating from the end of the file onward we + * need to zero out the end of the block if i_size lands in the + * middle of a block. + * + * This needs to be done after isize update, or the truncated sector + * will still be written back as inline extent, resulting mixing + * inline and regular extents. + */ + if (!ret && offset + len > old_isize) + ret = btrfs_truncate_block(BTRFS_I(inode), old_isize, 0, 0); out: btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); /* Let go of our reservation. */
[PROBLEM] The following script can create a fs with mixed inline and regular extents: mkfs.btrfs -f -s 4k $dev mount $dev $mnt -o nospace_cache xfs_io -f -c "pwrite 0 1k" -c "sync" \ -c "falloc 0 4k" -c "pwrite 4k 4k" $mnt/file umount $mnt It will result the following file extents layout: item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14 index 2 namelen 4 name: file item 6 key (257 EXTENT_DATA 0) itemoff 14824 itemsize 1045 generation 8 type 0 (inline) inline extent data size 1024 ram_bytes 1024 compression 0 (none) item 7 key (257 EXTENT_DATA 4096) itemoff 14771 itemsize 53 generation 8 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) Which mixes inline and regular extents. Without above falloc operation, we would get proper regular extent only layout. [CAUSE] Normally we rely on btrfs_truncate_block() to convert the inline extent to regular extent. For example, if without falloc(), at the 2nd pwrite, we will call btrfs_truncate_block() to zero the tailing part of the inline extent, then at writeback time, we find the isize is larger than inline threshold and will not create inline extent, but write back the first sector as regular extent. While in falloc, although we also call btrfs_truncate_block(), it's called before we enlarge the inode size. And we start writeback of the range immediately, with inode size unchanged. This means, we just re-create an inline extent inside btrfs_fallocate(). [FIX] Only call btrfs_truncate_block() after we have updated inode size inside btrfs_fallocate(). By this later writeback will properly writeback the first sector as regular extent. Signed-off-by: Qu Wenruo <wqu@suse.com> --- Reason for RFC: There is a long existing discussion on whether we should allow mixing inline and regular extents. I totally understand the idea that mixing such extents won't cause anything wrong, the existing kernel can handle them pretty well, and no data corruption or whatever. But since it's really not that simple to create such mixed extents (except the method mentioned above), I really don't believe that's the expected behavior. Thus even it's not a big deal, I still want to prevent such mixed extents. --- fs/btrfs/file.c | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-)