Message ID | a351fd3b0397b6fe8d94f11a6744e1655349d687.1725356877.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: fix race setting file private on concurrent lseek using same fd | expand |
On Tue, Sep 03, 2024 at 10:55:36AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When doing concurrent lseek(2) system calls against the same file > descriptor, using multiple threads belonging to the same process, we have > a short time window where a race happens and can result in a memory leak. > > The race happens like this: > > 1) A program opens a file descriptor for a file and then spawns two > threads (with the pthreads library for example), lets call them > task A and task B; > > 2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at > file.c:find_desired_extent() while holding a read lock on the inode; > > 3) At the start of find_desired_extent(), it extracts the file's > private_data pointer into a local variable named 'private', which has > a value of NULL; > > 4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode > in shared mode and enters file.c:find_desired_extent(), where it also > extracts file->private_data into its local variable 'private', which > has a NULL value; > > 5) Because it saw a NULL file private, task A allocates a private > structure and assigns to the file structure; > > 6) Task B also saw a NULL file private so it also allocates its own file > private and then assigns it to the same file structure, since both > tasks are using the same file descriptor. > > At this point we leak the private structure allocated by task A. > > Besides the memory leak, there's also the detail that both tasks end up > using the same cached state record in the private structure (struct > btrfs_file_private::llseek_cached_state), which can result in a > use-after-free problem since one task can free it while the other is > still using it (only one task took a reference count on it). Also, sharing > the cached state is not a good idea since it could result in incorrect > results in the future - right now it should not be a problem because it > end ups being used only in extent-io-tree.c:count_range_bits() where we do > range validation before using the cached state. > > Fix this by protecting the private assignment and check of a file while > holding the inode's spinlock and keep track of the task that allocated > the private, so that it's used only by that task in order to prevent > user-after-free issues with the cached state record as well as potentially > using it incorrectly in the future. > > Fixes: 3c32c7212f16 ("btrfs: use cached state when looking for delalloc ranges with lseek") > Signed-off-by: Filipe Manana <fdmanana@suse.com> We should maybe re-look at our useage of btrfs_file_private and see if we can come up with a better solution for this style of problem. I think the directory readdir stuff does the same sort of thing and we might end up with a similar issue if we have two tasks using the same fd to do readdir. But these are just random other thoughts for future work, for this you can add Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Tue, Sep 10, 2024 at 4:36 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On Tue, Sep 03, 2024 at 10:55:36AM +0100, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > When doing concurrent lseek(2) system calls against the same file > > descriptor, using multiple threads belonging to the same process, we have > > a short time window where a race happens and can result in a memory leak. > > > > The race happens like this: > > > > 1) A program opens a file descriptor for a file and then spawns two > > threads (with the pthreads library for example), lets call them > > task A and task B; > > > > 2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at > > file.c:find_desired_extent() while holding a read lock on the inode; > > > > 3) At the start of find_desired_extent(), it extracts the file's > > private_data pointer into a local variable named 'private', which has > > a value of NULL; > > > > 4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode > > in shared mode and enters file.c:find_desired_extent(), where it also > > extracts file->private_data into its local variable 'private', which > > has a NULL value; > > > > 5) Because it saw a NULL file private, task A allocates a private > > structure and assigns to the file structure; > > > > 6) Task B also saw a NULL file private so it also allocates its own file > > private and then assigns it to the same file structure, since both > > tasks are using the same file descriptor. > > > > At this point we leak the private structure allocated by task A. > > > > Besides the memory leak, there's also the detail that both tasks end up > > using the same cached state record in the private structure (struct > > btrfs_file_private::llseek_cached_state), which can result in a > > use-after-free problem since one task can free it while the other is > > still using it (only one task took a reference count on it). Also, sharing > > the cached state is not a good idea since it could result in incorrect > > results in the future - right now it should not be a problem because it > > end ups being used only in extent-io-tree.c:count_range_bits() where we do > > range validation before using the cached state. > > > > Fix this by protecting the private assignment and check of a file while > > holding the inode's spinlock and keep track of the task that allocated > > the private, so that it's used only by that task in order to prevent > > user-after-free issues with the cached state record as well as potentially > > using it incorrectly in the future. > > > > Fixes: 3c32c7212f16 ("btrfs: use cached state when looking for delalloc ranges with lseek") > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > We should maybe re-look at our useage of btrfs_file_private and see if we can > come up with a better solution for this style of problem. I think the directory > readdir stuff does the same sort of thing and we might end up with a similar > issue if we have two tasks using the same fd to do readdir. For the directory case, it's different because the private is allocated and set when opening the directory/file, so it's not subject to the same problem. I did look at that. Thanks. > > But these are just random other thoughts for future work, for this you can add > > Reviewed-by: Josef Bacik <josef@toxicpanda.com> > > Thanks, > > Josef
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 90e72031c724..5dbcc2ebf399 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -152,6 +152,7 @@ struct btrfs_inode { * logged_trans), to access/update delalloc_bytes, new_delalloc_bytes, * defrag_bytes, disk_i_size, outstanding_extents, csum_bytes and to * update the VFS' inode number of bytes used. + * Also protects setting struct file::private_data. */ spinlock_t lock; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c8568b1a61c4..18554f34f2d3 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -459,6 +459,8 @@ struct btrfs_file_private { void *filldir_buf; u64 last_index; struct extent_state *llseek_cached_state; + /* Task that allocated this structure. */ + struct task_struct *owner_task; }; static inline u32 BTRFS_LEAF_DATA_SIZE(const struct btrfs_fs_info *info) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index c5e36f58eb07..4fb521d91b06 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3485,7 +3485,7 @@ static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence, static loff_t find_desired_extent(struct file *file, loff_t offset, int whence) { struct btrfs_inode *inode = BTRFS_I(file->f_mapping->host); - struct btrfs_file_private *private = file->private_data; + struct btrfs_file_private *private; struct btrfs_fs_info *fs_info = inode->root->fs_info; struct extent_state *cached_state = NULL; struct extent_state **delalloc_cached_state; @@ -3513,7 +3513,19 @@ static loff_t find_desired_extent(struct file *file, loff_t offset, int whence) inode_get_bytes(&inode->vfs_inode) == i_size) return i_size; - if (!private) { + spin_lock(&inode->lock); + private = file->private_data; + spin_unlock(&inode->lock); + + if (private && private->owner_task != current) { + /* + * Not allocated by us, don't use it as its cached state is used + * by the task that allocated it and we don't want neither to + * mess with it nor get incorrect results because it reflects an + * invalid state for the current task. + */ + private = NULL; + } else if (!private) { private = kzalloc(sizeof(*private), GFP_KERNEL); /* * No worries if memory allocation failed. @@ -3521,7 +3533,23 @@ static loff_t find_desired_extent(struct file *file, loff_t offset, int whence) * lseek SEEK_HOLE/DATA calls to a file when there's delalloc, * so everything will still be correct. */ - file->private_data = private; + if (private) { + bool free = false; + + private->owner_task = current; + + spin_lock(&inode->lock); + if (file->private_data) + free = true; + else + file->private_data = private; + spin_unlock(&inode->lock); + + if (free) { + kfree(private); + private = NULL; + } + } } if (private)