Message ID | 20240814-dev_replace_rwsem-new-v1-1-c42120994ce6@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: relax dev_replace rwsem usage on scrub with rst | expand |
On 14.08.24 14:46, Johannes Thumshirn wrote: > + if (rst && dev_replace_is_ongoing) > + up_read(&dev_replace->rwsem); > for (int i = 0; i < io_geom.num_stripes; i++) { > ret = set_io_stripe(fs_info, logical, length, > &bioc->stripes[i], map, &io_geom); > + > if (ret < 0) > break; > io_geom.stripe_index++; > } That stray newline was added by accident, I'll remove it if there's a need for a v2 or when applying.
On Wed, Aug 14, 2024 at 1:46 PM Johannes Thumshirn <jth@kernel.org> wrote: > > From: Johannes Thumshin <johannes.thumshirn@wdc.com> > > Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of > the RAID stripe-tree, we get the following splat from lockdep: > > BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started > > ============================================ > WARNING: possible recursive locking detected > 6.11.0-rc3+ #592 Not tainted > -------------------------------------------- > btrfs/4203 is trying to acquire lock: > ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250 > > but task is already holding lock: > ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250 > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock(&fs_info->dev_replace.rwsem); > lock(&fs_info->dev_replace.rwsem); > > *** DEADLOCK *** Is this really the full splat? There should be a stack trace showing that the problem happens when btrfs_map_block() is called within the scrub/dev replace code, no? > > May be due to missing lock nesting notation > > 1 lock held by btrfs/4203: > #0: ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250 > > This fixes a deadlock on RAID stripe-tree where device replace performs a > scrub operation, which in turn calls into btrfs_map_block() to find the > physical location of the block. > > Cc: Filipe Manana <fdmanana@suse.com> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > > > Signed-off-by: Johannes Thumshirn <jth@kernel.org> > --- > fs/btrfs/volumes.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 4b9b647a7e29..e5bd2bee912d 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -6459,6 +6459,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, > int dev_replace_is_ongoing = 0; > u16 num_alloc_stripes; > u64 max_len; > + bool rst; The name is a bit confusing, something like "update_rst" is more meaningful and clearly indicates it's a boolean/condition. > > ASSERT(bioc_ret); > > @@ -6475,6 +6476,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, > if (io_geom.mirror_num > num_copies) > return -EINVAL; > > + rst = btrfs_need_stripe_tree_update(fs_info, map->type); > + > map_offset = logical - map->start; > io_geom.raid56_full_stripe_start = (u64)-1; > max_len = btrfs_max_io_len(map, map_offset, &io_geom); > @@ -6597,13 +6600,19 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, > * For all other non-RAID56 profiles, just copy the target > * stripe into the bioc. > */ > + if (rst && dev_replace_is_ongoing) > + up_read(&dev_replace->rwsem); > for (int i = 0; i < io_geom.num_stripes; i++) { > ret = set_io_stripe(fs_info, logical, length, > &bioc->stripes[i], map, &io_geom); So, why is this safe? The change log doesn't mention anything about the chosen fix. So even if this is called while we are not in the device replace code, btrfs_need_stripe_tree_update() can return true. In that case we unlock the device replace semaphore and can result in a use-after-free on a device, like this: 1) btrfs_map_block() called while not in the device replace code callchain, and there's a device replace for device X running in parallel; 2) btrfs_need_stripe_tree_update() returns true; 3) we unlock device replace semaphore; 4) we call set_io_stripe() which makes bioc point to device X, which is the old device (the one being replaced); 5) before we read lock the device replace semaphore at btrfs_map_block(), the device replace finishes and frees device X; 6) the bioc still points to device X... and then it's used for doing IO later Can't this happen? I don't see why not. If this is safe we should have an explanation in the changelog about the details. Thanks. > + > if (ret < 0) > break; > io_geom.stripe_index++; > } > + if (rst && dev_replace_is_ongoing) > + down_read(&dev_replace->rwsem); > + > } > > if (ret) { > > --- > base-commit: 4ce21d87ae81a86b488e0d326682883485317dcb > change-id: 20240814-dev_replace_rwsem-new-91c160579246 > > Best regards, > -- > Johannes Thumshirn <jth@kernel.org> > >
On Wed, Aug 14, 2024 at 2:31 PM Filipe Manana <fdmanana@kernel.org> wrote: > > On Wed, Aug 14, 2024 at 1:46 PM Johannes Thumshirn <jth@kernel.org> wrote: > > > > From: Johannes Thumshin <johannes.thumshirn@wdc.com> > > > > Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of > > the RAID stripe-tree, we get the following splat from lockdep: > > > > BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started > > > > ============================================ > > WARNING: possible recursive locking detected > > 6.11.0-rc3+ #592 Not tainted > > -------------------------------------------- > > btrfs/4203 is trying to acquire lock: > > ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250 > > > > but task is already holding lock: > > ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250 > > > > other info that might help us debug this: > > Possible unsafe locking scenario: > > > > CPU0 > > ---- > > lock(&fs_info->dev_replace.rwsem); > > lock(&fs_info->dev_replace.rwsem); > > > > *** DEADLOCK *** > > Is this really the full splat? > There should be a stack trace showing that the problem happens when > btrfs_map_block() is called within the scrub/dev replace code, no? > > > > > > May be due to missing lock nesting notation > > > > 1 lock held by btrfs/4203: > > #0: ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250 > > > > This fixes a deadlock on RAID stripe-tree where device replace performs a > > scrub operation, which in turn calls into btrfs_map_block() to find the > > physical location of the block. > > > > Cc: Filipe Manana <fdmanana@suse.com> > > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > --- > > > > > > Signed-off-by: Johannes Thumshirn <jth@kernel.org> > > --- > > fs/btrfs/volumes.c | 9 +++++++++ > > 1 file changed, 9 insertions(+) > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > index 4b9b647a7e29..e5bd2bee912d 100644 > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -6459,6 +6459,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, > > int dev_replace_is_ongoing = 0; > > u16 num_alloc_stripes; > > u64 max_len; > > + bool rst; > > The name is a bit confusing, something like "update_rst" is more > meaningful and clearly indicates it's a boolean/condition. > > > > > ASSERT(bioc_ret); > > > > @@ -6475,6 +6476,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, > > if (io_geom.mirror_num > num_copies) > > return -EINVAL; > > > > + rst = btrfs_need_stripe_tree_update(fs_info, map->type); > > + > > map_offset = logical - map->start; > > io_geom.raid56_full_stripe_start = (u64)-1; > > max_len = btrfs_max_io_len(map, map_offset, &io_geom); > > @@ -6597,13 +6600,19 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, > > * For all other non-RAID56 profiles, just copy the target > > * stripe into the bioc. > > */ > > + if (rst && dev_replace_is_ongoing) > > + up_read(&dev_replace->rwsem); > > for (int i = 0; i < io_geom.num_stripes; i++) { > > ret = set_io_stripe(fs_info, logical, length, > > &bioc->stripes[i], map, &io_geom); > > So, why is this safe? The change log doesn't mention anything about > the chosen fix. > > So even if this is called while we are not in the device replace code, > btrfs_need_stripe_tree_update() can return true. > In that case we unlock the device replace semaphore and can result in > a use-after-free on a device, like this: > > 1) btrfs_map_block() called while not in the device replace code > callchain, and there's a device replace for device X running in > parallel; > > 2) btrfs_need_stripe_tree_update() returns true; > > 3) we unlock device replace semaphore; > > 4) we call set_io_stripe() which makes bioc point to device X, which > is the old device (the one being replaced); > > 5) before we read lock the device replace semaphore at > btrfs_map_block(), the device replace finishes and frees device X; > > 6) the bioc still points to device X... and then it's used for doing IO later > > Can't this happen? I don't see why not. > If this is safe we should have an explanation in the changelog about > the details. And, how actually does this patch fixes the double locking problem? Isn't the problem that the replace code ends calling btrfs_map_block() while holding a read lock on the semaphore and then btrfs_map_block() does a read lock on it again? I would suggest a different fix: Make the device replace code store a pointer (or pid) of to the task running device replace, and at btrfs_map_block() don't take the semaphore if "current" matches that pointer/pid. Wouldn't that work? Seems safe and simple to me. > > Thanks. > > > > + > > if (ret < 0) > > break; > > io_geom.stripe_index++; > > } > > + if (rst && dev_replace_is_ongoing) > > + down_read(&dev_replace->rwsem); > > + > > } > > > > if (ret) { > > > > --- > > base-commit: 4ce21d87ae81a86b488e0d326682883485317dcb > > change-id: 20240814-dev_replace_rwsem-new-91c160579246 > > > > Best regards, > > -- > > Johannes Thumshirn <jth@kernel.org> > > > >
On 14.08.24 16:14, Filipe Manana wrote: > I would suggest a different fix: > > Make the device replace code store a pointer (or pid) of to the task > running device replace, and at btrfs_map_block() don't take the > semaphore if "current" matches that pointer/pid. > > Wouldn't that work? Seems safe and simple to me. > Hmm that could work, let me see.
On 14.08.24 16:14, Filipe Manana wrote: > > And, how actually does this patch fixes the double locking problem? > > Isn't the problem that the replace code ends calling btrfs_map_block() > while holding a read lock on the semaphore and then btrfs_map_block() > does a read lock on it again? > > I would suggest a different fix: > > Make the device replace code store a pointer (or pid) of to the task > running device replace, and at btrfs_map_block() don't take the > semaphore if "current" matches that pointer/pid. > > Wouldn't that work? Seems safe and simple to me. From the first test, this seems to work. At least I don't get a lockdep splat anymore. I'll give it some more test time and then I'll submit the patch. Thanks, Johannes
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 4b9b647a7e29..e5bd2bee912d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6459,6 +6459,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, int dev_replace_is_ongoing = 0; u16 num_alloc_stripes; u64 max_len; + bool rst; ASSERT(bioc_ret); @@ -6475,6 +6476,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, if (io_geom.mirror_num > num_copies) return -EINVAL; + rst = btrfs_need_stripe_tree_update(fs_info, map->type); + map_offset = logical - map->start; io_geom.raid56_full_stripe_start = (u64)-1; max_len = btrfs_max_io_len(map, map_offset, &io_geom); @@ -6597,13 +6600,19 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, * For all other non-RAID56 profiles, just copy the target * stripe into the bioc. */ + if (rst && dev_replace_is_ongoing) + up_read(&dev_replace->rwsem); for (int i = 0; i < io_geom.num_stripes; i++) { ret = set_io_stripe(fs_info, logical, length, &bioc->stripes[i], map, &io_geom); + if (ret < 0) break; io_geom.stripe_index++; } + if (rst && dev_replace_is_ongoing) + down_read(&dev_replace->rwsem); + } if (ret) {