Message ID | 20190930092025.31118-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Btrfs: fix memory leak due to concurrent append writes with fiemap | expand |
Hi, [This is an automated email] This commit has been processed because it contains a -stable tag. The stable tag indicates that it's relevant for the following trees: 4.16+ The bot has tested the following trees: v5.3.1, v5.2.17, v4.19.75. v5.3.1: Build OK! v5.2.17: Build OK! v4.19.75: Failed to apply! Possible dependencies: 7073017aeb98 ("btrfs: use offset_in_page instead of open-coding it") NOTE: The patch will not be queued to stable trees until it is upstream. How should we proceed with this patch? -- Thanks, Sasha
On Mon, Sep 30, 2019 at 10:20:25AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When we have a buffered write that starts at an offset greater than or > equals to the file's size happening concurrently with a full ranged > fiemap, we can end up leaking an extent state structure. > > Suppose we have a file with a size of 1Mb, and before the buffered write > and fiemap are performed, it has a single extent state in its io tree > representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set. > > The following sequence diagram shows how the memory leak happens if a > fiemap a buffered write, starting at offset 1Mb and with a length of > 4Kb, are performed concurrently. > > CPU 1 CPU 2 > > extent_fiemap() > --> it's a full ranged fiemap > range from 0 to LLONG_MAX - 1 > (9223372036854775807) > > --> locks range in the inode's > io tree > --> after this we have 2 extent > states in the io tree: > --> 1 for range [0, 1Mb[ with > the bits EXTENT_LOCKED and > EXTENT_DELALLOC_BITS set > --> 1 for the range > [1Mb, LLONG_MAX[ with > the EXTENT_LOCKED bit set > > --> start buffered write at offset > 1Mb with a length of 4Kb > > btrfs_file_write_iter() > > btrfs_buffered_write() > --> cached_state is NULL > > lock_and_cleanup_extent_if_need() > --> returns 0 and does not lock > range because it starts > at current i_size / eof > > --> cached_state remains NULL > > btrfs_dirty_pages() > btrfs_set_extent_delalloc() > (...) > __set_extent_bit() > > --> splits extent state for range > [1Mb, LLONG_MAX[ and now we > have 2 extent states: > > --> one for the range > [1Mb, 1Mb + 4Kb[ with > EXTENT_LOCKED set > --> another one for the range > [1Mb + 4Kb, LLONG_MAX[ with > EXTENT_LOCKED set as well > > --> sets EXTENT_DELALLOC on the > extent state for the range > [1Mb, 1Mb + 4Kb[ > --> caches extent state > [1Mb, 1Mb + 4Kb[ into > @cached_state because it has > the bit EXTENT_LOCKED set > > --> btrfs_buffered_write() ends up > with a non-NULL cached_state and > never calls anything to release its > reference on it, resulting in a > memory leak > > Fix this by calling free_extent_state() on cached_state if the range was > not locked by lock_and_cleanup_extent_if_need(). > > The same issue can happen if anything else other than fiemap locks a range > that covers eof and beyond. > > This could be triggered, sporadically, by test case generic/561 from the > fstests suite, which makes duperemove run concurrently with fsstress, and > duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set > the leak is reported in dmesg/syslog when removing the btrfs module with > a message like the following: > > [77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1 > > Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak. > > CC: stable@vger.kernel.org # 4.16+ > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Mon, Sep 30, 2019 at 10:20:25AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When we have a buffered write that starts at an offset greater than or > equals to the file's size happening concurrently with a full ranged > fiemap, we can end up leaking an extent state structure. > > Suppose we have a file with a size of 1Mb, and before the buffered write > and fiemap are performed, it has a single extent state in its io tree > representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set. > > The following sequence diagram shows how the memory leak happens if a > fiemap a buffered write, starting at offset 1Mb and with a length of > 4Kb, are performed concurrently. > > CPU 1 CPU 2 > > extent_fiemap() > --> it's a full ranged fiemap > range from 0 to LLONG_MAX - 1 > (9223372036854775807) > > --> locks range in the inode's > io tree > --> after this we have 2 extent > states in the io tree: > --> 1 for range [0, 1Mb[ with > the bits EXTENT_LOCKED and > EXTENT_DELALLOC_BITS set > --> 1 for the range > [1Mb, LLONG_MAX[ with > the EXTENT_LOCKED bit set > > --> start buffered write at offset > 1Mb with a length of 4Kb > > btrfs_file_write_iter() > > btrfs_buffered_write() > --> cached_state is NULL > > lock_and_cleanup_extent_if_need() > --> returns 0 and does not lock > range because it starts > at current i_size / eof > > --> cached_state remains NULL > > btrfs_dirty_pages() > btrfs_set_extent_delalloc() > (...) > __set_extent_bit() > > --> splits extent state for range > [1Mb, LLONG_MAX[ and now we > have 2 extent states: > > --> one for the range > [1Mb, 1Mb + 4Kb[ with > EXTENT_LOCKED set > --> another one for the range > [1Mb + 4Kb, LLONG_MAX[ with > EXTENT_LOCKED set as well > > --> sets EXTENT_DELALLOC on the > extent state for the range > [1Mb, 1Mb + 4Kb[ > --> caches extent state > [1Mb, 1Mb + 4Kb[ into > @cached_state because it has > the bit EXTENT_LOCKED set > > --> btrfs_buffered_write() ends up > with a non-NULL cached_state and > never calls anything to release its > reference on it, resulting in a > memory leak > > Fix this by calling free_extent_state() on cached_state if the range was > not locked by lock_and_cleanup_extent_if_need(). > > The same issue can happen if anything else other than fiemap locks a range > that covers eof and beyond. > > This could be triggered, sporadically, by test case generic/561 from the > fstests suite, which makes duperemove run concurrently with fsstress, and > duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set > the leak is reported in dmesg/syslog when removing the btrfs module with > a message like the following: > > [77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1 > > Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak. > > CC: stable@vger.kernel.org # 4.16+ > Signed-off-by: Filipe Manana <fdmanana@suse.com> Nice one, added to 5.4 queue, thanks.
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 35c3499a425a..3d151f788177 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1591,7 +1591,6 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_root *root = BTRFS_I(inode)->root; struct page **pages = NULL; - struct extent_state *cached_state = NULL; struct extent_changeset *data_reserved = NULL; u64 release_bytes = 0; u64 lockstart; @@ -1611,6 +1610,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, return -ENOMEM; while (iov_iter_count(i) > 0) { + struct extent_state *cached_state = NULL; size_t offset = offset_in_page(pos); size_t sector_offset; size_t write_bytes = min(iov_iter_count(i), @@ -1758,9 +1758,20 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, if (copied > 0) ret = btrfs_dirty_pages(inode, pages, dirty_pages, pos, copied, &cached_state); + + /* + * If we have not locked the extent range, because the range's + * start offset is >= i_size, we might still have a non-NULL + * cached extent state, acquired while marking the extent range + * as delalloc through btrfs_dirty_pages(). Therefore free any + * possible cached extent state to avoid a memory leak. + */ if (extents_locked) unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, &cached_state); + else + free_extent_state(cached_state); + btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true); if (ret) {