diff mbox series

btrfs: fix race setting file private on concurrent lseek using same fd

Message ID a351fd3b0397b6fe8d94f11a6744e1655349d687.1725356877.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fix race setting file private on concurrent lseek using same fd | expand

Commit Message

Filipe Manana Sept. 3, 2024, 9:55 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

When doing concurrent lseek(2) system calls against the same file
descriptor, using multiple threads belonging to the same process, we have
a short time window where a race happens and can result in a memory leak.

The race happens like this:

1) A program opens a file descriptor for a file and then spawns two
   threads (with the pthreads library for example), lets call them
   task A and task B;

2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at
   file.c:find_desired_extent() while holding a read lock on the inode;

3) At the start of find_desired_extent(), it extracts the file's
   private_data pointer into a local variable named 'private', which has
   a value of NULL;

4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode
   in shared mode and enters file.c:find_desired_extent(), where it also
   extracts file->private_data into its local variable 'private', which
   has a NULL value;

5) Because it saw a NULL file private, task A allocates a private
   structure and assigns to the file structure;

6) Task B also saw a NULL file private so it also allocates its own file
   private and then assigns it to the same file structure, since both
   tasks are using the same file descriptor.

   At this point we leak the private structure allocated by task A.

Besides the memory leak, there's also the detail that both tasks end up
using the same cached state record in the private structure (struct
btrfs_file_private::llseek_cached_state), which can result in a
use-after-free problem since one task can free it while the other is
still using it (only one task took a reference count on it). Also, sharing
the cached state is not a good idea since it could result in incorrect
results in the future - right now it should not be a problem because it
end ups being used only in extent-io-tree.c:count_range_bits() where we do
range validation before using the cached state.

Fix this by protecting the private assignment and check of a file while
holding the inode's spinlock and keep track of the task that allocated
the private, so that it's used only by that task in order to prevent
user-after-free issues with the cached state record as well as potentially
using it incorrectly in the future.

Fixes: 3c32c7212f16 ("btrfs: use cached state when looking for delalloc ranges with lseek")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h |  1 +
 fs/btrfs/ctree.h       |  2 ++
 fs/btrfs/file.c        | 34 +++++++++++++++++++++++++++++++---
 3 files changed, 34 insertions(+), 3 deletions(-)

Comments

Josef Bacik Sept. 10, 2024, 3:35 p.m. UTC | #1
On Tue, Sep 03, 2024 at 10:55:36AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When doing concurrent lseek(2) system calls against the same file
> descriptor, using multiple threads belonging to the same process, we have
> a short time window where a race happens and can result in a memory leak.
> 
> The race happens like this:
> 
> 1) A program opens a file descriptor for a file and then spawns two
>    threads (with the pthreads library for example), lets call them
>    task A and task B;
> 
> 2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at
>    file.c:find_desired_extent() while holding a read lock on the inode;
> 
> 3) At the start of find_desired_extent(), it extracts the file's
>    private_data pointer into a local variable named 'private', which has
>    a value of NULL;
> 
> 4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode
>    in shared mode and enters file.c:find_desired_extent(), where it also
>    extracts file->private_data into its local variable 'private', which
>    has a NULL value;
> 
> 5) Because it saw a NULL file private, task A allocates a private
>    structure and assigns to the file structure;
> 
> 6) Task B also saw a NULL file private so it also allocates its own file
>    private and then assigns it to the same file structure, since both
>    tasks are using the same file descriptor.
> 
>    At this point we leak the private structure allocated by task A.
> 
> Besides the memory leak, there's also the detail that both tasks end up
> using the same cached state record in the private structure (struct
> btrfs_file_private::llseek_cached_state), which can result in a
> use-after-free problem since one task can free it while the other is
> still using it (only one task took a reference count on it). Also, sharing
> the cached state is not a good idea since it could result in incorrect
> results in the future - right now it should not be a problem because it
> end ups being used only in extent-io-tree.c:count_range_bits() where we do
> range validation before using the cached state.
> 
> Fix this by protecting the private assignment and check of a file while
> holding the inode's spinlock and keep track of the task that allocated
> the private, so that it's used only by that task in order to prevent
> user-after-free issues with the cached state record as well as potentially
> using it incorrectly in the future.
> 
> Fixes: 3c32c7212f16 ("btrfs: use cached state when looking for delalloc ranges with lseek")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

We should maybe re-look at our useage of btrfs_file_private and see if we can
come up with a better solution for this style of problem.  I think the directory
readdir stuff does the same sort of thing and we might end up with a similar
issue if we have two tasks using the same fd to do readdir.

But these are just random other thoughts for future work, for this you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
Filipe Manana Sept. 10, 2024, 3:38 p.m. UTC | #2
On Tue, Sep 10, 2024 at 4:36 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Tue, Sep 03, 2024 at 10:55:36AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When doing concurrent lseek(2) system calls against the same file
> > descriptor, using multiple threads belonging to the same process, we have
> > a short time window where a race happens and can result in a memory leak.
> >
> > The race happens like this:
> >
> > 1) A program opens a file descriptor for a file and then spawns two
> >    threads (with the pthreads library for example), lets call them
> >    task A and task B;
> >
> > 2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at
> >    file.c:find_desired_extent() while holding a read lock on the inode;
> >
> > 3) At the start of find_desired_extent(), it extracts the file's
> >    private_data pointer into a local variable named 'private', which has
> >    a value of NULL;
> >
> > 4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode
> >    in shared mode and enters file.c:find_desired_extent(), where it also
> >    extracts file->private_data into its local variable 'private', which
> >    has a NULL value;
> >
> > 5) Because it saw a NULL file private, task A allocates a private
> >    structure and assigns to the file structure;
> >
> > 6) Task B also saw a NULL file private so it also allocates its own file
> >    private and then assigns it to the same file structure, since both
> >    tasks are using the same file descriptor.
> >
> >    At this point we leak the private structure allocated by task A.
> >
> > Besides the memory leak, there's also the detail that both tasks end up
> > using the same cached state record in the private structure (struct
> > btrfs_file_private::llseek_cached_state), which can result in a
> > use-after-free problem since one task can free it while the other is
> > still using it (only one task took a reference count on it). Also, sharing
> > the cached state is not a good idea since it could result in incorrect
> > results in the future - right now it should not be a problem because it
> > end ups being used only in extent-io-tree.c:count_range_bits() where we do
> > range validation before using the cached state.
> >
> > Fix this by protecting the private assignment and check of a file while
> > holding the inode's spinlock and keep track of the task that allocated
> > the private, so that it's used only by that task in order to prevent
> > user-after-free issues with the cached state record as well as potentially
> > using it incorrectly in the future.
> >
> > Fixes: 3c32c7212f16 ("btrfs: use cached state when looking for delalloc ranges with lseek")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> We should maybe re-look at our useage of btrfs_file_private and see if we can
> come up with a better solution for this style of problem.  I think the directory
> readdir stuff does the same sort of thing and we might end up with a similar
> issue if we have two tasks using the same fd to do readdir.

For the directory case, it's different because the private is
allocated and set when opening the directory/file, so it's not subject
to the same problem.
I did look at that.

Thanks.

>
> But these are just random other thoughts for future work, for this you can add
>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>
> Thanks,
>
> Josef
diff mbox series

Patch

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 90e72031c724..5dbcc2ebf399 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -152,6 +152,7 @@  struct btrfs_inode {
 	 * logged_trans), to access/update delalloc_bytes, new_delalloc_bytes,
 	 * defrag_bytes, disk_i_size, outstanding_extents, csum_bytes and to
 	 * update the VFS' inode number of bytes used.
+	 * Also protects setting struct file::private_data.
 	 */
 	spinlock_t lock;
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c8568b1a61c4..18554f34f2d3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -459,6 +459,8 @@  struct btrfs_file_private {
 	void *filldir_buf;
 	u64 last_index;
 	struct extent_state *llseek_cached_state;
+	/* Task that allocated this structure. */
+	struct task_struct *owner_task;
 };
 
 static inline u32 BTRFS_LEAF_DATA_SIZE(const struct btrfs_fs_info *info)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index c5e36f58eb07..4fb521d91b06 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3485,7 +3485,7 @@  static bool find_desired_extent_in_hole(struct btrfs_inode *inode, int whence,
 static loff_t find_desired_extent(struct file *file, loff_t offset, int whence)
 {
 	struct btrfs_inode *inode = BTRFS_I(file->f_mapping->host);
-	struct btrfs_file_private *private = file->private_data;
+	struct btrfs_file_private *private;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct extent_state *cached_state = NULL;
 	struct extent_state **delalloc_cached_state;
@@ -3513,7 +3513,19 @@  static loff_t find_desired_extent(struct file *file, loff_t offset, int whence)
 	    inode_get_bytes(&inode->vfs_inode) == i_size)
 		return i_size;
 
-	if (!private) {
+	spin_lock(&inode->lock);
+	private = file->private_data;
+	spin_unlock(&inode->lock);
+
+	if (private && private->owner_task != current) {
+		/*
+		 * Not allocated by us, don't use it as its cached state is used
+		 * by the task that allocated it and we don't want neither to
+		 * mess with it nor get incorrect results because it reflects an
+		 * invalid state for the current task.
+		 */
+		private = NULL;
+	} else if (!private) {
 		private = kzalloc(sizeof(*private), GFP_KERNEL);
 		/*
 		 * No worries if memory allocation failed.
@@ -3521,7 +3533,23 @@  static loff_t find_desired_extent(struct file *file, loff_t offset, int whence)
 		 * lseek SEEK_HOLE/DATA calls to a file when there's delalloc,
 		 * so everything will still be correct.
 		 */
-		file->private_data = private;
+		if (private) {
+			bool free = false;
+
+			private->owner_task = current;
+
+			spin_lock(&inode->lock);
+			if (file->private_data)
+				free = true;
+			else
+				file->private_data = private;
+			spin_unlock(&inode->lock);
+
+			if (free) {
+				kfree(private);
+				private = NULL;
+			}
+		}
 	}
 
 	if (private)