btrfs: relax dev_replace rwsem usage on scrub with rst

Message ID	20240814-dev_replace_rwsem-new-v1-1-c42120994ce6@kernel.org (mailing list archive)
State	New, archived
Headers	show Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com [209.85.218.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 91B864688; Wed, 14 Aug 2024 12:45:56 +0000 (UTC) From: Johannes Thumshirn <jth@kernel.org> Date: Wed, 14 Aug 2024 14:45:40 +0200 Subject: [PATCH] btrfs: relax dev_replace rwsem usage on scrub with rst Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20240814-dev_replace_rwsem-new-v1-1-c42120994ce6@kernel.org> To: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, David Sterba <dsterba@suse.com> Cc: linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org, Johannes Thumshin <johannes.thumshirn@wdc.com>, Filipe Manana <fdmanana@suse.com>, Johannes Thumshirn <johannes.thumshirn@wdc.com>, Johannes Thumshirn <jth@kernel.org>
Series	btrfs: relax dev_replace rwsem usage on scrub with rst \| expand btrfs: relax dev_replace rwsem usage on scrub with rst

Johannes Thumshirn Aug. 14, 2024, 12:45 p.m. UTC

From: Johannes Thumshin <johannes.thumshirn@wdc.com>

Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of
the RAID stripe-tree, we get the following splat from lockdep:

 BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started

 ============================================
 WARNING: possible recursive locking detected
 6.11.0-rc3+ #592 Not tainted
 --------------------------------------------
 btrfs/4203 is trying to acquire lock:
 ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250

 but task is already holding lock:
 ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&fs_info->dev_replace.rwsem);
   lock(&fs_info->dev_replace.rwsem);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 1 lock held by btrfs/4203:
  #0: ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250

This fixes a deadlock on RAID stripe-tree where device replace performs a
scrub operation, which in turn calls into btrfs_map_block() to find the
physical location of the block.

Cc: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---


Signed-off-by: Johannes Thumshirn <jth@kernel.org>
---
 fs/btrfs/volumes.c | 9 +++++++++
 1 file changed, 9 insertions(+)


---
base-commit: 4ce21d87ae81a86b488e0d326682883485317dcb
change-id: 20240814-dev_replace_rwsem-new-91c160579246

Best regards,

Johannes Thumshirn Aug. 14, 2024, 1:08 p.m. UTC | #1

On 14.08.24 14:46, Johannes Thumshirn wrote:
> +		if (rst && dev_replace_is_ongoing)
> +			up_read(&dev_replace->rwsem);
>   		for (int i = 0; i < io_geom.num_stripes; i++) {
>   			ret = set_io_stripe(fs_info, logical, length,
>   					    &bioc->stripes[i], map, &io_geom);
> +
>   			if (ret < 0)
>   				break;
>   			io_geom.stripe_index++;
>   		}

That stray newline was added by accident, I'll remove it if there's a 
need for a v2 or when applying.

Filipe Manana Aug. 14, 2024, 1:31 p.m. UTC | #2

On Wed, Aug 14, 2024 at 1:46 PM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshin <johannes.thumshirn@wdc.com>
>
> Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of
> the RAID stripe-tree, we get the following splat from lockdep:
>
>  BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started
>
>  ============================================
>  WARNING: possible recursive locking detected
>  6.11.0-rc3+ #592 Not tainted
>  --------------------------------------------
>  btrfs/4203 is trying to acquire lock:
>  ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
>
>  but task is already holding lock:
>  ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
>
>  other info that might help us debug this:
>   Possible unsafe locking scenario:
>
>         CPU0
>         ----
>    lock(&fs_info->dev_replace.rwsem);
>    lock(&fs_info->dev_replace.rwsem);
>
>   *** DEADLOCK ***

Is this really the full splat?
There should be a stack trace showing that the problem happens when
btrfs_map_block() is called within the scrub/dev replace code, no?


>
>   May be due to missing lock nesting notation
>
>  1 lock held by btrfs/4203:
>   #0: ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
>
> This fixes a deadlock on RAID stripe-tree where device replace performs a
> scrub operation, which in turn calls into btrfs_map_block() to find the
> physical location of the block.
>
> Cc: Filipe Manana <fdmanana@suse.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>
>
> Signed-off-by: Johannes Thumshirn <jth@kernel.org>
> ---
>  fs/btrfs/volumes.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 4b9b647a7e29..e5bd2bee912d 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6459,6 +6459,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>         int dev_replace_is_ongoing = 0;
>         u16 num_alloc_stripes;
>         u64 max_len;
> +       bool rst;

The name is a bit confusing, something like "update_rst" is more
meaningful and clearly indicates it's a boolean/condition.

>
>         ASSERT(bioc_ret);
>
> @@ -6475,6 +6476,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>         if (io_geom.mirror_num > num_copies)
>                 return -EINVAL;
>
> +       rst = btrfs_need_stripe_tree_update(fs_info, map->type);
> +
>         map_offset = logical - map->start;
>         io_geom.raid56_full_stripe_start = (u64)-1;
>         max_len = btrfs_max_io_len(map, map_offset, &io_geom);
> @@ -6597,13 +6600,19 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>                  * For all other non-RAID56 profiles, just copy the target
>                  * stripe into the bioc.
>                  */
> +               if (rst && dev_replace_is_ongoing)
> +                       up_read(&dev_replace->rwsem);
>                 for (int i = 0; i < io_geom.num_stripes; i++) {
>                         ret = set_io_stripe(fs_info, logical, length,
>                                             &bioc->stripes[i], map, &io_geom);

So, why is this safe? The change log doesn't mention anything about
the chosen fix.

So even if this is called while we are not in the device replace code,
btrfs_need_stripe_tree_update() can return true.
In that case we unlock the device replace semaphore and can result in
a use-after-free on a device, like this:

1) btrfs_map_block() called while not in the device replace code
callchain, and there's a device replace for device X running in
parallel;

2) btrfs_need_stripe_tree_update() returns true;

3) we unlock device replace semaphore;

4) we call set_io_stripe() which makes bioc point to device X, which
is the old device (the one being replaced);

5) before we read lock the device replace semaphore at
btrfs_map_block(), the device replace finishes and frees device X;

6) the bioc still points to device X... and then it's used for doing IO later

Can't this happen? I don't see why not.
If this is safe we should have an explanation in the changelog about
the details.

Thanks.


> +
>                         if (ret < 0)
>                                 break;
>                         io_geom.stripe_index++;
>                 }
> +               if (rst && dev_replace_is_ongoing)
> +                       down_read(&dev_replace->rwsem);
> +
>         }
>
>         if (ret) {
>
> ---
> base-commit: 4ce21d87ae81a86b488e0d326682883485317dcb
> change-id: 20240814-dev_replace_rwsem-new-91c160579246
>
> Best regards,
> --
> Johannes Thumshirn <jth@kernel.org>
>
>

Filipe Manana Aug. 14, 2024, 2:12 p.m. UTC | #3

On Wed, Aug 14, 2024 at 2:31 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Wed, Aug 14, 2024 at 1:46 PM Johannes Thumshirn <jth@kernel.org> wrote:
> >
> > From: Johannes Thumshin <johannes.thumshirn@wdc.com>
> >
> > Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of
> > the RAID stripe-tree, we get the following splat from lockdep:
> >
> >  BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started
> >
> >  ============================================
> >  WARNING: possible recursive locking detected
> >  6.11.0-rc3+ #592 Not tainted
> >  --------------------------------------------
> >  btrfs/4203 is trying to acquire lock:
> >  ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
> >
> >  but task is already holding lock:
> >  ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
> >
> >  other info that might help us debug this:
> >   Possible unsafe locking scenario:
> >
> >         CPU0
> >         ----
> >    lock(&fs_info->dev_replace.rwsem);
> >    lock(&fs_info->dev_replace.rwsem);
> >
> >   *** DEADLOCK ***
>
> Is this really the full splat?
> There should be a stack trace showing that the problem happens when
> btrfs_map_block() is called within the scrub/dev replace code, no?
>
>
> >
> >   May be due to missing lock nesting notation
> >
> >  1 lock held by btrfs/4203:
> >   #0: ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
> >
> > This fixes a deadlock on RAID stripe-tree where device replace performs a
> > scrub operation, which in turn calls into btrfs_map_block() to find the
> > physical location of the block.
> >
> > Cc: Filipe Manana <fdmanana@suse.com>
> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > ---
> >
> >
> > Signed-off-by: Johannes Thumshirn <jth@kernel.org>
> > ---
> >  fs/btrfs/volumes.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 4b9b647a7e29..e5bd2bee912d 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -6459,6 +6459,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> >         int dev_replace_is_ongoing = 0;
> >         u16 num_alloc_stripes;
> >         u64 max_len;
> > +       bool rst;
>
> The name is a bit confusing, something like "update_rst" is more
> meaningful and clearly indicates it's a boolean/condition.
>
> >
> >         ASSERT(bioc_ret);
> >
> > @@ -6475,6 +6476,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> >         if (io_geom.mirror_num > num_copies)
> >                 return -EINVAL;
> >
> > +       rst = btrfs_need_stripe_tree_update(fs_info, map->type);
> > +
> >         map_offset = logical - map->start;
> >         io_geom.raid56_full_stripe_start = (u64)-1;
> >         max_len = btrfs_max_io_len(map, map_offset, &io_geom);
> > @@ -6597,13 +6600,19 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> >                  * For all other non-RAID56 profiles, just copy the target
> >                  * stripe into the bioc.
> >                  */
> > +               if (rst && dev_replace_is_ongoing)
> > +                       up_read(&dev_replace->rwsem);
> >                 for (int i = 0; i < io_geom.num_stripes; i++) {
> >                         ret = set_io_stripe(fs_info, logical, length,
> >                                             &bioc->stripes[i], map, &io_geom);
>
> So, why is this safe? The change log doesn't mention anything about
> the chosen fix.
>
> So even if this is called while we are not in the device replace code,
> btrfs_need_stripe_tree_update() can return true.
> In that case we unlock the device replace semaphore and can result in
> a use-after-free on a device, like this:
>
> 1) btrfs_map_block() called while not in the device replace code
> callchain, and there's a device replace for device X running in
> parallel;
>
> 2) btrfs_need_stripe_tree_update() returns true;
>
> 3) we unlock device replace semaphore;
>
> 4) we call set_io_stripe() which makes bioc point to device X, which
> is the old device (the one being replaced);
>
> 5) before we read lock the device replace semaphore at
> btrfs_map_block(), the device replace finishes and frees device X;
>
> 6) the bioc still points to device X... and then it's used for doing IO later
>
> Can't this happen? I don't see why not.
> If this is safe we should have an explanation in the changelog about
> the details.

And, how actually does this patch fixes the double locking problem?

Isn't the problem that the replace code ends calling btrfs_map_block()
while holding a read lock on the semaphore and then btrfs_map_block()
does a read lock on it again?

I would suggest a different fix:

Make the device replace code store a pointer (or pid) of to the task
running device replace, and at btrfs_map_block() don't take the
semaphore if "current" matches that pointer/pid.

Wouldn't that work? Seems safe and simple to me.

>
> Thanks.
>
>
> > +
> >                         if (ret < 0)
> >                                 break;
> >                         io_geom.stripe_index++;
> >                 }
> > +               if (rst && dev_replace_is_ongoing)
> > +                       down_read(&dev_replace->rwsem);
> > +
> >         }
> >
> >         if (ret) {
> >
> > ---
> > base-commit: 4ce21d87ae81a86b488e0d326682883485317dcb
> > change-id: 20240814-dev_replace_rwsem-new-91c160579246
> >
> > Best regards,
> > --
> > Johannes Thumshirn <jth@kernel.org>
> >
> >

Johannes Thumshirn Aug. 14, 2024, 2:15 p.m. UTC | #4

On 14.08.24 16:14, Filipe Manana wrote:
> I would suggest a different fix:
> 
> Make the device replace code store a pointer (or pid) of to the task
> running device replace, and at btrfs_map_block() don't take the
> semaphore if "current" matches that pointer/pid.
> 
> Wouldn't that work? Seems safe and simple to me.
> 


Hmm that could work, let me see.

Johannes Thumshirn Aug. 15, 2024, 10:55 a.m. UTC | #5

On 14.08.24 16:14, Filipe Manana wrote:
> 
> And, how actually does this patch fixes the double locking problem?
> 
> Isn't the problem that the replace code ends calling btrfs_map_block()
> while holding a read lock on the semaphore and then btrfs_map_block()
> does a read lock on it again?
> 
> I would suggest a different fix:
> 
> Make the device replace code store a pointer (or pid) of to the task
> running device replace, and at btrfs_map_block() don't take the
> semaphore if "current" matches that pointer/pid.
> 
> Wouldn't that work? Seems safe and simple to me.

 From the first test, this seems to work. At least I don't get a lockdep 
splat anymore.

I'll give it some more test time and then I'll submit the patch.

Thanks,
	Johannes

btrfs: relax dev_replace rwsem usage on scrub with rst

Commit Message

Comments

Patch