Message ID | 20190506154402.20097-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] Btrfs: fix race between ranged fsync and writeback of adjacent ranges | expand |
Hi, [This is an automated email] This commit has been processed because it contains a -stable tag. The stable tag indicates that it's relevant for the following trees: 3.16+ The bot has tested the following trees: v5.0.13, v4.19.40, v4.14.116, v4.9.173, v4.4.179, v3.18.139. v5.0.13: Build OK! v4.19.40: Build OK! v4.14.116: Build OK! v4.9.173: Build OK! v4.4.179: Build OK! v3.18.139: Failed to apply! Possible dependencies: 9dcbeed4d7e1 ("btrfs: fix signed overflows in btrfs_sync_file") How should we proceed with this patch? -- Thanks, Sasha
On Mon, May 06, 2019 at 04:44:02PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the > inode) that happens to be ranged, which happens during a msync() or writes > for files opened with O_SYNC for example, we can end up with a corrupt log, > due to different file extent items representing ranges that overlap with > each other, or hit some assertion failures. > > When doing a ranged fsync we only flush delalloc and wait for ordered > exents within that range. If while we are logging items from our inode > ordered extents for adjacent ranges complete, we end up in a race that can > make us insert the file extent items that overlap with others we logged > previously and the assertion failures. > > For example, if tree-log.c:copy_items() receives a leaf that has the > following file extents items, all with a length of 4K and therefore there > is an implicit hole in the range 68K to 72K - 1: > > (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ... > > It copies them to the log tree. However due to the need to detect implicit > holes, it may release the path, in order to look at the previous leaf to > detect an implicit hole, and then later it will search again in the tree > for the first file extent item key, with the goal of locking again the > leaf (which might have changed due to concurrent changes to other inodes). > > However when it locks again the leaf containing the first key, the key > corresponding to the extent at offset 72K may not be there anymore since > there is an ordered extent for that range that is finishing (that is, > somewhere in the middle of btrfs_finish_ordered_io()), and it just > removed the file extent item but has not yet replaced it with a new file > extent item, so the part of copy_items() that does hole detection will > decide that there is a hole in the range starting from 68K to 76K - 1, > and therefore insert a file extent item to represent that hole, having > a key offset of 68K. After that we now have a log tree with 2 different > extent items that have overlapping ranges: > Well this kind of sucks. I wonder if we should be locking the extent range while we're doing this in order to keep this problem from happening. A fix for another day though Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Tue, May 7, 2019 at 6:44 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On Mon, May 06, 2019 at 04:44:02PM +0100, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the > > inode) that happens to be ranged, which happens during a msync() or writes > > for files opened with O_SYNC for example, we can end up with a corrupt log, > > due to different file extent items representing ranges that overlap with > > each other, or hit some assertion failures. > > > > When doing a ranged fsync we only flush delalloc and wait for ordered > > exents within that range. If while we are logging items from our inode > > ordered extents for adjacent ranges complete, we end up in a race that can > > make us insert the file extent items that overlap with others we logged > > previously and the assertion failures. > > > > For example, if tree-log.c:copy_items() receives a leaf that has the > > following file extents items, all with a length of 4K and therefore there > > is an implicit hole in the range 68K to 72K - 1: > > > > (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ... > > > > It copies them to the log tree. However due to the need to detect implicit > > holes, it may release the path, in order to look at the previous leaf to > > detect an implicit hole, and then later it will search again in the tree > > for the first file extent item key, with the goal of locking again the > > leaf (which might have changed due to concurrent changes to other inodes). > > > > However when it locks again the leaf containing the first key, the key > > corresponding to the extent at offset 72K may not be there anymore since > > there is an ordered extent for that range that is finishing (that is, > > somewhere in the middle of btrfs_finish_ordered_io()), and it just > > removed the file extent item but has not yet replaced it with a new file > > extent item, so the part of copy_items() that does hole detection will > > decide that there is a hole in the range starting from 68K to 76K - 1, > > and therefore insert a file extent item to represent that hole, having > > a key offset of 68K. After that we now have a log tree with 2 different > > extent items that have overlapping ranges: > > > > Well this kind of sucks. I wonder if we should be locking the extent range > while we're doing this in order to keep this problem from happening. A fix for > another day though The only reason I have not adopted that, is because it would prevent fsync and readpages() / readpage() from running concurrently (more of a problem when fsync is full ranged). Locking the range would avoid any such eventual performance drop on fsync alone, but it would also allow to drop the inode's dio_sem? Remember it was giving lockdep warnings before and you moved it to btrfs_sync_file() from btrfs_log_changed_extents() sometime ago. However still not enough, as I still get often (random xfstests, brfs/072 and generic/299 for example) similar lockdep warnings due to dio_sem: https://pastebin.com/ar6cLg2Q Thanks. > > Reviewed-by: Josef Bacik <josef@toxicpanda.com> > > Thanks, > > Josef
On Tue, May 07, 2019 at 08:09:02PM +0100, Filipe Manana wrote: > On Tue, May 7, 2019 at 6:44 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > On Mon, May 06, 2019 at 04:44:02PM +0100, fdmanana@kernel.org wrote: > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the > > > inode) that happens to be ranged, which happens during a msync() or writes > > > for files opened with O_SYNC for example, we can end up with a corrupt log, > > > due to different file extent items representing ranges that overlap with > > > each other, or hit some assertion failures. > > > > > > When doing a ranged fsync we only flush delalloc and wait for ordered > > > exents within that range. If while we are logging items from our inode > > > ordered extents for adjacent ranges complete, we end up in a race that can > > > make us insert the file extent items that overlap with others we logged > > > previously and the assertion failures. > > > > > > For example, if tree-log.c:copy_items() receives a leaf that has the > > > following file extents items, all with a length of 4K and therefore there > > > is an implicit hole in the range 68K to 72K - 1: > > > > > > (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ... > > > > > > It copies them to the log tree. However due to the need to detect implicit > > > holes, it may release the path, in order to look at the previous leaf to > > > detect an implicit hole, and then later it will search again in the tree > > > for the first file extent item key, with the goal of locking again the > > > leaf (which might have changed due to concurrent changes to other inodes). > > > > > > However when it locks again the leaf containing the first key, the key > > > corresponding to the extent at offset 72K may not be there anymore since > > > there is an ordered extent for that range that is finishing (that is, > > > somewhere in the middle of btrfs_finish_ordered_io()), and it just > > > removed the file extent item but has not yet replaced it with a new file > > > extent item, so the part of copy_items() that does hole detection will > > > decide that there is a hole in the range starting from 68K to 76K - 1, > > > and therefore insert a file extent item to represent that hole, having > > > a key offset of 68K. After that we now have a log tree with 2 different > > > extent items that have overlapping ranges: > > > > > > > Well this kind of sucks. I wonder if we should be locking the extent range > > while we're doing this in order to keep this problem from happening. A fix for > > another day though > > The only reason I have not adopted that, is because it would prevent > fsync and readpages() / readpage() from running concurrently (more of > a problem when fsync is full ranged). > Hrm good point. Time to make the extent lock read/write lock! > Locking the range would avoid any such eventual performance drop on > fsync alone, but it would also allow to drop the inode's dio_sem? > Remember it was giving lockdep warnings before and you moved it to > btrfs_sync_file() from btrfs_log_changed_extents() sometime ago. > However still not enough, as I still get often (random xfstests, > brfs/072 and generic/299 for example) similar lockdep warnings due to > dio_sem: > > https://pastebin.com/ar6cLg2Q Hmm that's annoying, but not a real problem because we're just destroying a workqueue that has nothing on it because we raced with somebody else allocating a new workqueue. I think the best thing to do is to export sb_init_dio_done_wq and call it within our direct io before we take all of our locks, that should shut lockdep up. That's kind of shitty though, I'll think about it some more. Thanks, Josef
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 34fe8a58b0e9..db277bde4894 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2059,6 +2059,18 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) u64 len; /* + * If the inode needs a full sync, make sure we use a full range to + * avoid log tree corruption, due to hole detection racing with ordered + * extent completion for adjacent ranges, and assertion failures during + * hole detection. + */ + if (test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, + &BTRFS_I(inode)->runtime_flags)) { + start = 0; + end = LLONG_MAX; + } + + /* * The range length can be represented by u64, we have to do the typecasts * to avoid signed overflow if it's [0, LLONG_MAX] eg. from fsync() */