Message ID | 20200131140607.26923-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Btrfs: fix race between using extent maps and merging them | expand |
On 1/31/20 9:06 AM, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > We have a few cases where we allow an extent map that is in an extent map > tree to be merged with other extents in the tree. Such cases include the > unpinning of an extent after the respective ordered extent completed or > after logging an extent during a fast fsync. This can lead to subtle and > dangerous problems because when doing the merge some other task might be > using the same extent map and as consequence see an inconsistent state of > the extent map - for example sees the new length but has seen the old start > offset. > > With luck this triggers a BUG_ON(), and not some silent bug, such as the > following one in __do_readpage(): > > $ cat -n fs/btrfs/extent_io.c > 3061 static int __do_readpage(struct extent_io_tree *tree, > 3062 struct page *page, > (...) > 3127 em = __get_extent_map(inode, page, pg_offset, cur, > 3128 end - cur + 1, get_extent, em_cached); > 3129 if (IS_ERR_OR_NULL(em)) { > 3130 SetPageError(page); > 3131 unlock_extent(tree, cur, end); > 3132 break; > 3133 } > 3134 extent_offset = cur - em->start; > 3135 BUG_ON(extent_map_end(em) <= cur); > (...) > > Consider the following example scenario, where we end up hitting the > BUG_ON() in __do_readpage(). > > We have an inode with a size of 8Kb and 2 extent maps: > > extent A: file offset 0, length 4Kb, disk_bytenr = X, persisted on disk by > a previous transaction > > extent B: file offset 4Kb, length 4Kb, disk_bytenr = X + 4Kb, not yet > persisted but writeback started for it already. The extent map > is pinned since there's writeback and an ordered extent in > progress, so it can not be merged with extent map A yet > > The following sequence of steps leads to the BUG_ON(): > > 1) The ordered extent for extent B completes, the respective page gets its > writeback bit cleared and the extent map is unpinned, at that point it > is not yet merged with extent map A because it's in the list of modified > extents; > > 2) Due to memory pressure, or some other reason, the mm subsystem releases > the page corresponding to extent B - btrfs_releasepage() is called and > returns 1, meaning the page can be released as it's not dirty, not under > writeback anymore and the extent range is not locked in the inode's > iotree. However the extent map is not released, either because we are > not in a context that allows memory allocations to block or because the > inode's size is smaller than 16Mb - in this case our inode has a size > of 8Kb; > > 3) Task B needs to read extent B and ends up __do_readpage() through the > btrfs_readpage() callback. At __do_readpage() it gets a reference to > extent map B; > > 4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B > while holding the write lock on the inode's extent map tree - this > results in try_merge_map() being called and since it's possible to merge > extent map B with extent map A now (the extent map B was removed from > the list of modified extents), the merging begins - it sets extent map > B's start offset to 0 (was 4Kb), but before it increments the map's > length to 8Kb (4kb + 4Kb), task A is at: > > BUG_ON(extent_map_end(em) <= cur); > > The call to extent_map_end() sees the extent map has a start of 0 > and a length still at 4Kb, so it returns 4Kb and 'cur' is 4Kb, so > the BUG_ON() is triggered. > > So it's dangerous to modify an extent map that is in the tree, because some > other task might have got a reference to it before and still using it, and > needs to see a consistent map while using it. Generally this is very rare > since most paths that lookup and use extent maps also have the file range > locked in the inode's iotree. The fsync path is pretty much the only > exception where we don't do it to avoid serialization with concurrent > reads. > > Fix this by not allowing an extent map do be merged if if it's being used > by tasks other then the one attempting to merge the extent map (when the > reference count of the extent map is greater than 2). > > Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp> > Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp> > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211 > CC: stable@vger.kernel.org # 4.4+ > Signed-off-by: Filipe Manana <fdmanana@suse.com> Eesh that's bad, and I went to look at our statistics and we're hitting this like ~20ish times a day. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
Hi, [This is an automated email] This commit has been processed because it contains a -stable tag. The stable tag indicates that it's relevant for the following trees: 4.4+ The bot has tested the following trees: v5.5.1, v5.4.17, v4.19.101, v4.14.169, v4.9.212, v4.4.212. v5.5.1: Build OK! v5.4.17: Build OK! v4.19.101: Build OK! v4.14.169: Build OK! v4.9.212: Build failed! Errors: fs/btrfs/extent_map.c:238:6: error: implicit declaration of function ‘refcount_read’ [-Werror=implicit-function-declaration] v4.4.212: Build failed! Errors: fs/btrfs/extent_map.c:238:6: error: implicit declaration of function ‘refcount_read’ [-Werror=implicit-function-declaration] NOTE: The patch will not be queued to stable trees until it is upstream. How should we proceed with this patch?
On Fri, Jan 31, 2020 at 02:06:07PM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > We have a few cases where we allow an extent map that is in an extent map > tree to be merged with other extents in the tree. Such cases include the > unpinning of an extent after the respective ordered extent completed or > after logging an extent during a fast fsync. This can lead to subtle and > dangerous problems because when doing the merge some other task might be > using the same extent map and as consequence see an inconsistent state of > the extent map - for example sees the new length but has seen the old start > offset. > > With luck this triggers a BUG_ON(), and not some silent bug, such as the > following one in __do_readpage(): > > $ cat -n fs/btrfs/extent_io.c > 3061 static int __do_readpage(struct extent_io_tree *tree, > 3062 struct page *page, > (...) > 3127 em = __get_extent_map(inode, page, pg_offset, cur, > 3128 end - cur + 1, get_extent, em_cached); > 3129 if (IS_ERR_OR_NULL(em)) { > 3130 SetPageError(page); > 3131 unlock_extent(tree, cur, end); > 3132 break; > 3133 } > 3134 extent_offset = cur - em->start; > 3135 BUG_ON(extent_map_end(em) <= cur); > (...) > > Consider the following example scenario, where we end up hitting the > BUG_ON() in __do_readpage(). > > We have an inode with a size of 8Kb and 2 extent maps: > > extent A: file offset 0, length 4Kb, disk_bytenr = X, persisted on disk by > a previous transaction > > extent B: file offset 4Kb, length 4Kb, disk_bytenr = X + 4Kb, not yet > persisted but writeback started for it already. The extent map > is pinned since there's writeback and an ordered extent in > progress, so it can not be merged with extent map A yet > > The following sequence of steps leads to the BUG_ON(): > > 1) The ordered extent for extent B completes, the respective page gets its > writeback bit cleared and the extent map is unpinned, at that point it > is not yet merged with extent map A because it's in the list of modified > extents; > > 2) Due to memory pressure, or some other reason, the mm subsystem releases > the page corresponding to extent B - btrfs_releasepage() is called and > returns 1, meaning the page can be released as it's not dirty, not under > writeback anymore and the extent range is not locked in the inode's > iotree. However the extent map is not released, either because we are > not in a context that allows memory allocations to block or because the > inode's size is smaller than 16Mb - in this case our inode has a size > of 8Kb; > > 3) Task B needs to read extent B and ends up __do_readpage() through the > btrfs_readpage() callback. At __do_readpage() it gets a reference to > extent map B; > > 4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B > while holding the write lock on the inode's extent map tree - this > results in try_merge_map() being called and since it's possible to merge > extent map B with extent map A now (the extent map B was removed from > the list of modified extents), the merging begins - it sets extent map > B's start offset to 0 (was 4Kb), but before it increments the map's > length to 8Kb (4kb + 4Kb), task A is at: > > BUG_ON(extent_map_end(em) <= cur); > > The call to extent_map_end() sees the extent map has a start of 0 > and a length still at 4Kb, so it returns 4Kb and 'cur' is 4Kb, so > the BUG_ON() is triggered. > > So it's dangerous to modify an extent map that is in the tree, because some > other task might have got a reference to it before and still using it, and > needs to see a consistent map while using it. Generally this is very rare > since most paths that lookup and use extent maps also have the file range > locked in the inode's iotree. The fsync path is pretty much the only > exception where we don't do it to avoid serialization with concurrent > reads. > > Fix this by not allowing an extent map do be merged if if it's being used > by tasks other then the one attempting to merge the extent map (when the > reference count of the extent map is greater than 2). > > Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp> > Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp> > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211 > CC: stable@vger.kernel.org # 4.4+ > Signed-off-by: Filipe Manana <fdmanana@suse.com> Added to misc-next, thanks.
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 6f417ff68980..bd6229fb2b6f 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -237,6 +237,17 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em) struct extent_map *merge = NULL; struct rb_node *rb; + /* + * We can't modify an extent map that is in the tree and that is being + * used by another task, as it can cause that other task to see it in + * inconsistent state during the merging. We always have 1 reference for + * the tree and 1 for this task (which is unpinning the extent map or + * clearing the logging flag), so anything > 2 means it's being used by + * other tasks too. + */ + if (refcount_read(&em->refs) > 2) + return; + if (em->start != 0) { rb = rb_prev(&em->rb_node); if (rb)