Message ID | 20190506154351.20047-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Btrfs: avoid fallback to transaction commit during fsync of files with holes | expand |
On Mon, May 06, 2019 at 04:43:51PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a > file that has holes and has file extent items spanning two or more leafs, > we can end up falling to back to a full transaction commit due to a logic > bug that leads to failure to insert a duplicate file extent item that is > meant to represent a hole between the last file extent item of a leaf and > the first file extent item in the next leaf. The failure (EEXIST error) > leads to a transaction commit (as most errors when logging an inode do). > > For example, we have the two following leafs: > > Leaf N: > > ----------------------------------------------- > | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) | > ----------------------------------------------- > The file extent item at the end of leaf N has a length of 4Kb, > representing the file range from 64K to 68K - 1. > > Leaf N + 1: > > ----------------------------------------------- > | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... | > ----------------------------------------------- > The file extent item at the first slot of leaf N + 1 has a length of > 4Kb too, representing the file range from 72K to 76K - 1. > > During the full fsync path, when we are at tree-log.c:copy_items() with > leaf N as a parameter, after processing the last file extent item, that > represents the extent at offset 64K, we take a look at the first file > extent item at the next leaf (leaf N + 1), and notice there's a 4K hole > between the two extents, and therefore we insert a file extent item > representing that hole, starting at file offset 68K and ending at offset > 72K - 1. However we don't update the value of *last_extent, which is used > to represent the end offset (plus 1, non-inclusive end) of the last file > extent item inserted in the log, so it stays with a value of 68K and not > with a value of 72K. > > Then, when copy_items() is called for leaf N + 1, because the value of > *last_extent is smaller then the offset of the first extent item in the > leaf (68K < 72K), we look at the last file extent item in the previous > leaf (leaf N) and see it there's a 4K gap between it and our first file > extent item (again, 68K < 72K), so we decide to insert a file extent item > representing the hole, starting at file offset 68K and ending at offset > 72K - 1, this insertion will fail with -EEXIST being returned from > btrfs_insert_file_extent() because we already inserted a file extent item > representing a hole for this offset (68K) in the previous call to > copy_items(), when processing leaf N. > > The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(), > which falls back to a full transaction commit. > > Fix this by adjusting *last_extent after inserting a hole when we had to > look at the next leaf. > > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Mon, May 06, 2019 at 04:43:51PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a > file that has holes and has file extent items spanning two or more leafs, > we can end up falling to back to a full transaction commit due to a logic > bug that leads to failure to insert a duplicate file extent item that is > meant to represent a hole between the last file extent item of a leaf and > the first file extent item in the next leaf. The failure (EEXIST error) > leads to a transaction commit (as most errors when logging an inode do). > > For example, we have the two following leafs: > > Leaf N: > > ----------------------------------------------- > | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) | > ----------------------------------------------- > The file extent item at the end of leaf N has a length of 4Kb, > representing the file range from 64K to 68K - 1. > > Leaf N + 1: > > ----------------------------------------------- > | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... | > ----------------------------------------------- > The file extent item at the first slot of leaf N + 1 has a length of > 4Kb too, representing the file range from 72K to 76K - 1. > > During the full fsync path, when we are at tree-log.c:copy_items() with > leaf N as a parameter, after processing the last file extent item, that > represents the extent at offset 64K, we take a look at the first file > extent item at the next leaf (leaf N + 1), and notice there's a 4K hole > between the two extents, and therefore we insert a file extent item > representing that hole, starting at file offset 68K and ending at offset > 72K - 1. However we don't update the value of *last_extent, which is used > to represent the end offset (plus 1, non-inclusive end) of the last file > extent item inserted in the log, so it stays with a value of 68K and not > with a value of 72K. > > Then, when copy_items() is called for leaf N + 1, because the value of > *last_extent is smaller then the offset of the first extent item in the > leaf (68K < 72K), we look at the last file extent item in the previous > leaf (leaf N) and see it there's a 4K gap between it and our first file > extent item (again, 68K < 72K), so we decide to insert a file extent item > representing the hole, starting at file offset 68K and ending at offset > 72K - 1, this insertion will fail with -EEXIST being returned from > btrfs_insert_file_extent() because we already inserted a file extent item > representing a hole for this offset (68K) in the previous call to > copy_items(), when processing leaf N. > > The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(), > which falls back to a full transaction commit. > > Fix this by adjusting *last_extent after inserting a hole when we had to > look at the next leaf. > > Signed-off-by: Filipe Manana <fdmanana@suse.com> Queued for 5.2-rc, thanks.
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index bb7b27a47537..87e3e4e37606 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4169,6 +4169,7 @@ static noinline int copy_items(struct btrfs_trans_handle *trans, *last_extent, 0, 0, len, 0, len, 0, 0, 0); + *last_extent += len; } } }