diff mbox series

[RFC] btrfs: make dev-scrub as an exclusive operation

Message ID ef0c22ce3cf2f7941634ed1cb2ca718f04ce675d.1682296794.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [RFC] btrfs: make dev-scrub as an exclusive operation | expand

Commit Message

Qu Wenruo April 24, 2023, 12:48 a.m. UTC
[PROBLEMS]
Currently dev-scrub is not an exclusive operation, thus it's possible to
run scrub with balance at the same time.

But there are possible several problems when such concurrency is
involved:

- Scrub can lead to false alerts due to removed block groups
  When balance is finished, it would delete the source block group
  and may even do an discard.

  In that case if a scrub is still running for that block group, we
  can lead to false alerts on bad checksum.

- Balance is already checking the checksum
  Thus we're doing double checksum verification, under most cases it's
  just a waste of IO and CPU time.

- Extra chance of unnecessary -ENOSPC
  Both balance and scrub would try to mark the target block group
  read-only.
  With more block groups marked read-only, we have higher chance to
  hit early -ENOSPC.

[ENHANCEMENT]
Let's make dev-scrub as an exclusive operation, but unlike regular
exclusive operations, we need to allow multiple dev-scrub to run
concurrently.

Thus we introduce a new member, fs_info::exclusive_dev_scrubs, to record
how many dev scrubs are running.
And only set fs_info::exclusive_operation back to NONE when no more
dev-scrub is running.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Reason for RFC:

This change is a change to the dev-scrub behavior, now we have extra
error patterns.

And some existing test cases would be invalid, as they expect
concurrent scrub and balance as a background stress.

Although this makes later logical bytenr based scrub much easier to
implement (needs to be exclusive with dev-scrub).
---
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/fs.h      |  5 ++++-
 fs/btrfs/ioctl.c   | 29 ++++++++++++++++++++++++++++-
 3 files changed, 33 insertions(+), 2 deletions(-)

Comments

Filipe Manana April 24, 2023, 10:14 a.m. UTC | #1
On Mon, Apr 24, 2023 at 2:48 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [PROBLEMS]
> Currently dev-scrub is not an exclusive operation, thus it's possible to
> run scrub with balance at the same time.
>
> But there are possible several problems when such concurrency is
> involved:
>
> - Scrub can lead to false alerts due to removed block groups
>   When balance is finished, it would delete the source block group
>   and may even do an discard.

So, I haven't taken a look at scrub after the recent rewrite you made.
But some comments on this.

Scrub has to deal with block groups getting removed anyway, due to the
automatic removal of empty block groups. The balance case is no different...

You seem to have missed the fact that  before balance relocates a block group,
it pauses scrub, and then after finishing the relocation, it allows
scrub to resume.
The pausing happens at btrfs_relocate_chunk().
It only removes the block group  when it's empty...

Furthermore, transaction commits pause scrub, do the commit root switches with
scrub paused, and scrub uses (or it used) commit roots to search for
allocated extents.

Also, how is that discard problem possible?

If a block group is deleted, the discard only happens after the
corresponding transaction
is committed, and again, the transaction commit pauses scrub, the
discard happens while
scrub is paused (at btrfs_finish_extent_commit()), and scrub uses commit roots.

At least this used to be true before the recent scrub rewrite...

I would like to see a corrected changelog or at least updated to detail exactly
how these issues can happen.

>
>   In that case if a scrub is still running for that block group, we
>   can lead to false alerts on bad checksum.
>
> - Balance is already checking the checksum
>   Thus we're doing double checksum verification, under most cases it's
>   just a waste of IO and CPU time.

So what? If they're done sequentially, it's the same thing...
Multiple reads can also trigger checksum verification, both direct IO
and buffered IO.

Thanks.

>
> - Extra chance of unnecessary -ENOSPC
>   Both balance and scrub would try to mark the target block group
>   read-only.
>   With more block groups marked read-only, we have higher chance to
>   hit early -ENOSPC.
>
> [ENHANCEMENT]
> Let's make dev-scrub as an exclusive operation, but unlike regular
> exclusive operations, we need to allow multiple dev-scrub to run
> concurrently.
>
> Thus we introduce a new member, fs_info::exclusive_dev_scrubs, to record
> how many dev scrubs are running.
> And only set fs_info::exclusive_operation back to NONE when no more
> dev-scrub is running.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Reason for RFC:
>
> This change is a change to the dev-scrub behavior, now we have extra
> error patterns.
>
> And some existing test cases would be invalid, as they expect
> concurrent scrub and balance as a background stress.
>
> Although this makes later logical bytenr based scrub much easier to
> implement (needs to be exclusive with dev-scrub).
> ---
>  fs/btrfs/disk-io.c |  1 +
>  fs/btrfs/fs.h      |  5 ++++-
>  fs/btrfs/ioctl.c   | 29 ++++++++++++++++++++++++++++-
>  3 files changed, 33 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 59ea049fe7ee..c6323750be18 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2957,6 +2957,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>         atomic_set(&fs_info->async_delalloc_pages, 0);
>         atomic_set(&fs_info->defrag_running, 0);
>         atomic_set(&fs_info->nr_delayed_iputs, 0);
> +       atomic_set(&fs_info->exclusive_dev_scrubs, 0);
>         atomic64_set(&fs_info->tree_mod_seq, 0);
>         fs_info->global_root_tree = RB_ROOT;
>         fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 0d98fc5f6f44..ff7c0c1fb145 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -335,7 +335,7 @@ struct btrfs_discard_ctl {
>  };
>
>  /*
> - * Exclusive operations (device replace, resize, device add/remove, balance)
> + * Exclusive operations (device replace, resize, device add/remove, balance, scrub)
>   */
>  enum btrfs_exclusive_operation {
>         BTRFS_EXCLOP_NONE,
> @@ -344,6 +344,7 @@ enum btrfs_exclusive_operation {
>         BTRFS_EXCLOP_DEV_ADD,
>         BTRFS_EXCLOP_DEV_REMOVE,
>         BTRFS_EXCLOP_DEV_REPLACE,
> +       BTRFS_EXCLOP_DEV_SCRUB,
>         BTRFS_EXCLOP_RESIZE,
>         BTRFS_EXCLOP_SWAP_ACTIVATE,
>  };
> @@ -744,6 +745,8 @@ struct btrfs_fs_info {
>
>         /* Type of exclusive operation running, protected by super_lock */
>         enum btrfs_exclusive_operation exclusive_operation;
> +       /* Number of running dev_scrubs for exclusive operations. */
> +       atomic_t exclusive_dev_scrubs;
>
>         /*
>          * Zone size > 0 when in ZONED mode, otherwise it's used for a check
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 25833b4eeaf5..4be89198baed 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -403,6 +403,20 @@ bool btrfs_exclop_start(struct btrfs_fs_info *fs_info,
>         spin_lock(&fs_info->super_lock);
>         if (fs_info->exclusive_operation == BTRFS_EXCLOP_NONE) {
>                 fs_info->exclusive_operation = type;
> +               if (type == BTRFS_EXCLOP_DEV_SCRUB)
> +                       atomic_inc(&fs_info->exclusive_dev_scrubs);
> +               ret = true;
> +               spin_unlock(&fs_info->super_lock);
> +               return ret;
> +       }
> +
> +       /*
> +        * For dev-scrub, we allow multiple scrubs to be run concurrently
> +        * as it's a per-device operation.
> +        */
> +       if (type == BTRFS_EXCLOP_DEV_SCRUB &&
> +           fs_info->exclusive_operation == type) {
> +               atomic_inc(&fs_info->exclusive_dev_scrubs);
>                 ret = true;
>         }
>         spin_unlock(&fs_info->super_lock);
> @@ -442,7 +456,12 @@ void btrfs_exclop_start_unlock(struct btrfs_fs_info *fs_info)
>  void btrfs_exclop_finish(struct btrfs_fs_info *fs_info)
>  {
>         spin_lock(&fs_info->super_lock);
> -       WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
> +       if (fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_SCRUB) {
> +               if (atomic_dec_and_test(&fs_info->exclusive_dev_scrubs))
> +                       WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
> +       } else {
> +               WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
> +       }
>         spin_unlock(&fs_info->super_lock);
>         sysfs_notify(&fs_info->fs_devices->fsid_kobj, NULL, "exclusive_operation");
>  }
> @@ -3172,9 +3191,17 @@ static long btrfs_ioctl_scrub(struct file *file, void __user *arg)
>                         goto out;
>         }
>
> +       if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_DEV_SCRUB)) {
> +               ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
> +               if (!(sa->flags & BTRFS_SCRUB_READONLY))
> +                       mnt_drop_write_file(file);
> +               goto out;
> +       }
> +
>         ret = btrfs_scrub_dev(fs_info, sa->devid, sa->start, sa->end,
>                               &sa->progress, sa->flags & BTRFS_SCRUB_READONLY,
>                               0);
> +       btrfs_exclop_finish(fs_info);
>
>         /*
>          * Copy scrub args to user space even if btrfs_scrub_dev() returned an
> --
> 2.39.2
>
Qu Wenruo April 24, 2023, 10:48 a.m. UTC | #2
On 2023/4/24 18:14, Filipe Manana wrote:
> On Mon, Apr 24, 2023 at 2:48 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [PROBLEMS]
>> Currently dev-scrub is not an exclusive operation, thus it's possible to
>> run scrub with balance at the same time.
>>
>> But there are possible several problems when such concurrency is
>> involved:
>>
>> - Scrub can lead to false alerts due to removed block groups
>>    When balance is finished, it would delete the source block group
>>    and may even do an discard.
> 
> So, I haven't taken a look at scrub after the recent rewrite you made.
> But some comments on this.
> 
> Scrub has to deal with block groups getting removed anyway, due to the
> automatic removal of empty block groups. The balance case is no different...
> 
> You seem to have missed the fact that  before balance relocates a block group,
> it pauses scrub, and then after finishing the relocation, it allows
> scrub to resume.
> The pausing happens at btrfs_relocate_chunk().
> It only removes the block group  when it's empty...
> 
> Furthermore, transaction commits pause scrub, do the commit root switches with
> scrub paused, and scrub uses (or it used) commit roots to search for
> allocated extents.
> 
> Also, how is that discard problem possible?

The whole case is described by this syzbot report:

https://lore.kernel.org/lkml/00000000000041d47405f9f56290@google.com/

Where we have balance and scrub hitting the problem.

The ASSERT() itself is going to be removed by the scrub rework, but if 
the old code can lead to that ASSERT(), the reworked code should still 
lead to the same situation, as the scrub/balance exclusion code is still 
kept as is.

And in that dmesg, we got quite some tree block bytenr mismatch, which 
indicates the tree block is already zeroed out.

> 
> If a block group is deleted, the discard only happens after the
> corresponding transaction
> is committed, and again, the transaction commit pauses scrub, the
> discard happens while
> scrub is paused (at btrfs_finish_extent_commit()), and scrub uses commit roots.
> 
> At least this used to be true before the recent scrub rewrite...

The syzbot report is before my scrub rework, and the scrub pause code is 
untouched during the rework.

Although I did see your point, a deeper look into the existing scrub 
pause code indeed shows transaction commit would pause scrub before 
switch commit roots.

Allow me to dig deeper to find out why this syzbot can happen.

Thanks,
Qu

> 
> I would like to see a corrected changelog or at least updated to detail exactly
> how these issues can happen.
> 
>>
>>    In that case if a scrub is still running for that block group, we
>>    can lead to false alerts on bad checksum.
>>
>> - Balance is already checking the checksum
>>    Thus we're doing double checksum verification, under most cases it's
>>    just a waste of IO and CPU time.
> 
> So what? If they're done sequentially, it's the same thing...
> Multiple reads can also trigger checksum verification, both direct IO
> and buffered IO.
> 
> Thanks.
> 
>>
>> - Extra chance of unnecessary -ENOSPC
>>    Both balance and scrub would try to mark the target block group
>>    read-only.
>>    With more block groups marked read-only, we have higher chance to
>>    hit early -ENOSPC.
>>
>> [ENHANCEMENT]
>> Let's make dev-scrub as an exclusive operation, but unlike regular
>> exclusive operations, we need to allow multiple dev-scrub to run
>> concurrently.
>>
>> Thus we introduce a new member, fs_info::exclusive_dev_scrubs, to record
>> how many dev scrubs are running.
>> And only set fs_info::exclusive_operation back to NONE when no more
>> dev-scrub is running.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Reason for RFC:
>>
>> This change is a change to the dev-scrub behavior, now we have extra
>> error patterns.
>>
>> And some existing test cases would be invalid, as they expect
>> concurrent scrub and balance as a background stress.
>>
>> Although this makes later logical bytenr based scrub much easier to
>> implement (needs to be exclusive with dev-scrub).
>> ---
>>   fs/btrfs/disk-io.c |  1 +
>>   fs/btrfs/fs.h      |  5 ++++-
>>   fs/btrfs/ioctl.c   | 29 ++++++++++++++++++++++++++++-
>>   3 files changed, 33 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 59ea049fe7ee..c6323750be18 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2957,6 +2957,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>>          atomic_set(&fs_info->async_delalloc_pages, 0);
>>          atomic_set(&fs_info->defrag_running, 0);
>>          atomic_set(&fs_info->nr_delayed_iputs, 0);
>> +       atomic_set(&fs_info->exclusive_dev_scrubs, 0);
>>          atomic64_set(&fs_info->tree_mod_seq, 0);
>>          fs_info->global_root_tree = RB_ROOT;
>>          fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
>> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
>> index 0d98fc5f6f44..ff7c0c1fb145 100644
>> --- a/fs/btrfs/fs.h
>> +++ b/fs/btrfs/fs.h
>> @@ -335,7 +335,7 @@ struct btrfs_discard_ctl {
>>   };
>>
>>   /*
>> - * Exclusive operations (device replace, resize, device add/remove, balance)
>> + * Exclusive operations (device replace, resize, device add/remove, balance, scrub)
>>    */
>>   enum btrfs_exclusive_operation {
>>          BTRFS_EXCLOP_NONE,
>> @@ -344,6 +344,7 @@ enum btrfs_exclusive_operation {
>>          BTRFS_EXCLOP_DEV_ADD,
>>          BTRFS_EXCLOP_DEV_REMOVE,
>>          BTRFS_EXCLOP_DEV_REPLACE,
>> +       BTRFS_EXCLOP_DEV_SCRUB,
>>          BTRFS_EXCLOP_RESIZE,
>>          BTRFS_EXCLOP_SWAP_ACTIVATE,
>>   };
>> @@ -744,6 +745,8 @@ struct btrfs_fs_info {
>>
>>          /* Type of exclusive operation running, protected by super_lock */
>>          enum btrfs_exclusive_operation exclusive_operation;
>> +       /* Number of running dev_scrubs for exclusive operations. */
>> +       atomic_t exclusive_dev_scrubs;
>>
>>          /*
>>           * Zone size > 0 when in ZONED mode, otherwise it's used for a check
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 25833b4eeaf5..4be89198baed 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -403,6 +403,20 @@ bool btrfs_exclop_start(struct btrfs_fs_info *fs_info,
>>          spin_lock(&fs_info->super_lock);
>>          if (fs_info->exclusive_operation == BTRFS_EXCLOP_NONE) {
>>                  fs_info->exclusive_operation = type;
>> +               if (type == BTRFS_EXCLOP_DEV_SCRUB)
>> +                       atomic_inc(&fs_info->exclusive_dev_scrubs);
>> +               ret = true;
>> +               spin_unlock(&fs_info->super_lock);
>> +               return ret;
>> +       }
>> +
>> +       /*
>> +        * For dev-scrub, we allow multiple scrubs to be run concurrently
>> +        * as it's a per-device operation.
>> +        */
>> +       if (type == BTRFS_EXCLOP_DEV_SCRUB &&
>> +           fs_info->exclusive_operation == type) {
>> +               atomic_inc(&fs_info->exclusive_dev_scrubs);
>>                  ret = true;
>>          }
>>          spin_unlock(&fs_info->super_lock);
>> @@ -442,7 +456,12 @@ void btrfs_exclop_start_unlock(struct btrfs_fs_info *fs_info)
>>   void btrfs_exclop_finish(struct btrfs_fs_info *fs_info)
>>   {
>>          spin_lock(&fs_info->super_lock);
>> -       WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
>> +       if (fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_SCRUB) {
>> +               if (atomic_dec_and_test(&fs_info->exclusive_dev_scrubs))
>> +                       WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
>> +       } else {
>> +               WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
>> +       }
>>          spin_unlock(&fs_info->super_lock);
>>          sysfs_notify(&fs_info->fs_devices->fsid_kobj, NULL, "exclusive_operation");
>>   }
>> @@ -3172,9 +3191,17 @@ static long btrfs_ioctl_scrub(struct file *file, void __user *arg)
>>                          goto out;
>>          }
>>
>> +       if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_DEV_SCRUB)) {
>> +               ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
>> +               if (!(sa->flags & BTRFS_SCRUB_READONLY))
>> +                       mnt_drop_write_file(file);
>> +               goto out;
>> +       }
>> +
>>          ret = btrfs_scrub_dev(fs_info, sa->devid, sa->start, sa->end,
>>                                &sa->progress, sa->flags & BTRFS_SCRUB_READONLY,
>>                                0);
>> +       btrfs_exclop_finish(fs_info);
>>
>>          /*
>>           * Copy scrub args to user space even if btrfs_scrub_dev() returned an
>> --
>> 2.39.2
>>
Qu Wenruo April 25, 2023, 12:30 p.m. UTC | #3
On 2023/4/24 18:14, Filipe Manana wrote:
> On Mon, Apr 24, 2023 at 2:48 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [PROBLEMS]
>> Currently dev-scrub is not an exclusive operation, thus it's possible to
>> run scrub with balance at the same time.
>>
>> But there are possible several problems when such concurrency is
>> involved:
>>
>> - Scrub can lead to false alerts due to removed block groups
>>    When balance is finished, it would delete the source block group
>>    and may even do an discard.
> 
> So, I haven't taken a look at scrub after the recent rewrite you made.
> But some comments on this.
> 
> Scrub has to deal with block groups getting removed anyway, due to the
> automatic removal of empty block groups. The balance case is no different...
> 
> You seem to have missed the fact that  before balance relocates a block group,
> it pauses scrub, and then after finishing the relocation, it allows
> scrub to resume.
> The pausing happens at btrfs_relocate_chunk().
> It only removes the block group  when it's empty...
> 
> Furthermore, transaction commits pause scrub, do the commit root switches with
> scrub paused, and scrub uses (or it used) commit roots to search for
> allocated extents.

In this particular case, there seems to be a small race window which can 
lead to an ASSERT() triggered (the syzbot ASSERT()).

              Balance thread        |        Scrub thread
-----------------------------------+----------------------------------------
btrfs_relocate_chunk()             |
|- btrfs_scrub_pause()             |
|- btrfs_relocate_block_group()    |
|- btrfs_scrub_continue()          |
|- btrfs_remove_chunk()            |
    |- btrfs_remove_block_group()   |
       Now the bg is removed from   |
       bg cache                     | scrub_handle_errored_block()
                                    | |- lock_full_stripe()
                                    |    |- btrfs_lookup_block_group()
                                         |  The bg is removed thus we got
                                         |  an NULL bg_cache
                                         |- ASSERT(!bg_cache)

This ASSERT() is thankfully completely removed in the scrub rework.

Instead, reworked scrub would hold the block group cache for the whole 
scrub progress, thus no need to call btrfs_lookup_block_group() anymore.
(Also the lock_full_stripe() is removed)

But this is not the same as the problem I described here...


For the discarded tree blocks, I don't have a clue yet.

Your comments on the timing are correct, the extents can only be 
discarded during btrfs_finish_extent_commit(), which has scrub paused.

Thanks,
Qu
> 
> Also, how is that discard problem possible?
> 
> If a block group is deleted, the discard only happens after the
> corresponding transaction
> is committed, and again, the transaction commit pauses scrub, the
> discard happens while
> scrub is paused (at btrfs_finish_extent_commit()), and scrub uses commit roots.
> 
> At least this used to be true before the recent scrub rewrite...
> 
> I would like to see a corrected changelog or at least updated to detail exactly
> how these issues can happen.
> 
>>
>>    In that case if a scrub is still running for that block group, we
>>    can lead to false alerts on bad checksum.
>>
>> - Balance is already checking the checksum
>>    Thus we're doing double checksum verification, under most cases it's
>>    just a waste of IO and CPU time.
> 
> So what? If they're done sequentially, it's the same thing...
> Multiple reads can also trigger checksum verification, both direct IO
> and buffered IO.
> 
> Thanks.
> 
>>
>> - Extra chance of unnecessary -ENOSPC
>>    Both balance and scrub would try to mark the target block group
>>    read-only.
>>    With more block groups marked read-only, we have higher chance to
>>    hit early -ENOSPC.
>>
>> [ENHANCEMENT]
>> Let's make dev-scrub as an exclusive operation, but unlike regular
>> exclusive operations, we need to allow multiple dev-scrub to run
>> concurrently.
>>
>> Thus we introduce a new member, fs_info::exclusive_dev_scrubs, to record
>> how many dev scrubs are running.
>> And only set fs_info::exclusive_operation back to NONE when no more
>> dev-scrub is running.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Reason for RFC:
>>
>> This change is a change to the dev-scrub behavior, now we have extra
>> error patterns.
>>
>> And some existing test cases would be invalid, as they expect
>> concurrent scrub and balance as a background stress.
>>
>> Although this makes later logical bytenr based scrub much easier to
>> implement (needs to be exclusive with dev-scrub).
>> ---
>>   fs/btrfs/disk-io.c |  1 +
>>   fs/btrfs/fs.h      |  5 ++++-
>>   fs/btrfs/ioctl.c   | 29 ++++++++++++++++++++++++++++-
>>   3 files changed, 33 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 59ea049fe7ee..c6323750be18 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2957,6 +2957,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>>          atomic_set(&fs_info->async_delalloc_pages, 0);
>>          atomic_set(&fs_info->defrag_running, 0);
>>          atomic_set(&fs_info->nr_delayed_iputs, 0);
>> +       atomic_set(&fs_info->exclusive_dev_scrubs, 0);
>>          atomic64_set(&fs_info->tree_mod_seq, 0);
>>          fs_info->global_root_tree = RB_ROOT;
>>          fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
>> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
>> index 0d98fc5f6f44..ff7c0c1fb145 100644
>> --- a/fs/btrfs/fs.h
>> +++ b/fs/btrfs/fs.h
>> @@ -335,7 +335,7 @@ struct btrfs_discard_ctl {
>>   };
>>
>>   /*
>> - * Exclusive operations (device replace, resize, device add/remove, balance)
>> + * Exclusive operations (device replace, resize, device add/remove, balance, scrub)
>>    */
>>   enum btrfs_exclusive_operation {
>>          BTRFS_EXCLOP_NONE,
>> @@ -344,6 +344,7 @@ enum btrfs_exclusive_operation {
>>          BTRFS_EXCLOP_DEV_ADD,
>>          BTRFS_EXCLOP_DEV_REMOVE,
>>          BTRFS_EXCLOP_DEV_REPLACE,
>> +       BTRFS_EXCLOP_DEV_SCRUB,
>>          BTRFS_EXCLOP_RESIZE,
>>          BTRFS_EXCLOP_SWAP_ACTIVATE,
>>   };
>> @@ -744,6 +745,8 @@ struct btrfs_fs_info {
>>
>>          /* Type of exclusive operation running, protected by super_lock */
>>          enum btrfs_exclusive_operation exclusive_operation;
>> +       /* Number of running dev_scrubs for exclusive operations. */
>> +       atomic_t exclusive_dev_scrubs;
>>
>>          /*
>>           * Zone size > 0 when in ZONED mode, otherwise it's used for a check
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 25833b4eeaf5..4be89198baed 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -403,6 +403,20 @@ bool btrfs_exclop_start(struct btrfs_fs_info *fs_info,
>>          spin_lock(&fs_info->super_lock);
>>          if (fs_info->exclusive_operation == BTRFS_EXCLOP_NONE) {
>>                  fs_info->exclusive_operation = type;
>> +               if (type == BTRFS_EXCLOP_DEV_SCRUB)
>> +                       atomic_inc(&fs_info->exclusive_dev_scrubs);
>> +               ret = true;
>> +               spin_unlock(&fs_info->super_lock);
>> +               return ret;
>> +       }
>> +
>> +       /*
>> +        * For dev-scrub, we allow multiple scrubs to be run concurrently
>> +        * as it's a per-device operation.
>> +        */
>> +       if (type == BTRFS_EXCLOP_DEV_SCRUB &&
>> +           fs_info->exclusive_operation == type) {
>> +               atomic_inc(&fs_info->exclusive_dev_scrubs);
>>                  ret = true;
>>          }
>>          spin_unlock(&fs_info->super_lock);
>> @@ -442,7 +456,12 @@ void btrfs_exclop_start_unlock(struct btrfs_fs_info *fs_info)
>>   void btrfs_exclop_finish(struct btrfs_fs_info *fs_info)
>>   {
>>          spin_lock(&fs_info->super_lock);
>> -       WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
>> +       if (fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_SCRUB) {
>> +               if (atomic_dec_and_test(&fs_info->exclusive_dev_scrubs))
>> +                       WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
>> +       } else {
>> +               WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
>> +       }
>>          spin_unlock(&fs_info->super_lock);
>>          sysfs_notify(&fs_info->fs_devices->fsid_kobj, NULL, "exclusive_operation");
>>   }
>> @@ -3172,9 +3191,17 @@ static long btrfs_ioctl_scrub(struct file *file, void __user *arg)
>>                          goto out;
>>          }
>>
>> +       if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_DEV_SCRUB)) {
>> +               ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
>> +               if (!(sa->flags & BTRFS_SCRUB_READONLY))
>> +                       mnt_drop_write_file(file);
>> +               goto out;
>> +       }
>> +
>>          ret = btrfs_scrub_dev(fs_info, sa->devid, sa->start, sa->end,
>>                                &sa->progress, sa->flags & BTRFS_SCRUB_READONLY,
>>                                0);
>> +       btrfs_exclop_finish(fs_info);
>>
>>          /*
>>           * Copy scrub args to user space even if btrfs_scrub_dev() returned an
>> --
>> 2.39.2
>>
David Sterba April 28, 2023, 4:55 p.m. UTC | #4
On Mon, Apr 24, 2023 at 08:48:47AM +0800, Qu Wenruo wrote:
> [PROBLEMS]
> Currently dev-scrub is not an exclusive operation, thus it's possible to
> run scrub with balance at the same time.
> 
> But there are possible several problems when such concurrency is
> involved:
> 
> - Scrub can lead to false alerts due to removed block groups
>   When balance is finished, it would delete the source block group
>   and may even do an discard.
> 
>   In that case if a scrub is still running for that block group, we
>   can lead to false alerts on bad checksum.
> 
> - Balance is already checking the checksum
>   Thus we're doing double checksum verification, under most cases it's
>   just a waste of IO and CPU time.
> 
> - Extra chance of unnecessary -ENOSPC
>   Both balance and scrub would try to mark the target block group
>   read-only.
>   With more block groups marked read-only, we have higher chance to
>   hit early -ENOSPC.
> 
> [ENHANCEMENT]
> Let's make dev-scrub as an exclusive operation, but unlike regular
> exclusive operations, we need to allow multiple dev-scrub to run
> concurrently.
> 
> Thus we introduce a new member, fs_info::exclusive_dev_scrubs, to record
> how many dev scrubs are running.
> And only set fs_info::exclusive_operation back to NONE when no more
> dev-scrub is running.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Reason for RFC:
> 
> This change is a change to the dev-scrub behavior, now we have extra
> error patterns.
> 
> And some existing test cases would be invalid, as they expect
> concurrent scrub and balance as a background stress.
> 
> Although this makes later logical bytenr based scrub much easier to
> implement (needs to be exclusive with dev-scrub).

IIRC there were suggestions in the past to make scrub another exclusive
op but for performance impact if accidentally started in the background.
Which is related to the reasons you mention.

Scrub is a bit different than the other ops, it can be started on
devices so it's not a whole-filesystem op, which needs the counter but
that could be acceptable in the end.

But I'm not sure we should add this limitation, and it's a bit hard to
guess how much is the concurrent scrub/balance used versus how often the
spurious ENOSPC happens (and is considered a problem). We could add
warnings to balance and scrub start if there's the other one detected.
diff mbox series

Patch

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 59ea049fe7ee..c6323750be18 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2957,6 +2957,7 @@  void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	atomic_set(&fs_info->async_delalloc_pages, 0);
 	atomic_set(&fs_info->defrag_running, 0);
 	atomic_set(&fs_info->nr_delayed_iputs, 0);
+	atomic_set(&fs_info->exclusive_dev_scrubs, 0);
 	atomic64_set(&fs_info->tree_mod_seq, 0);
 	fs_info->global_root_tree = RB_ROOT;
 	fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 0d98fc5f6f44..ff7c0c1fb145 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -335,7 +335,7 @@  struct btrfs_discard_ctl {
 };
 
 /*
- * Exclusive operations (device replace, resize, device add/remove, balance)
+ * Exclusive operations (device replace, resize, device add/remove, balance, scrub)
  */
 enum btrfs_exclusive_operation {
 	BTRFS_EXCLOP_NONE,
@@ -344,6 +344,7 @@  enum btrfs_exclusive_operation {
 	BTRFS_EXCLOP_DEV_ADD,
 	BTRFS_EXCLOP_DEV_REMOVE,
 	BTRFS_EXCLOP_DEV_REPLACE,
+	BTRFS_EXCLOP_DEV_SCRUB,
 	BTRFS_EXCLOP_RESIZE,
 	BTRFS_EXCLOP_SWAP_ACTIVATE,
 };
@@ -744,6 +745,8 @@  struct btrfs_fs_info {
 
 	/* Type of exclusive operation running, protected by super_lock */
 	enum btrfs_exclusive_operation exclusive_operation;
+	/* Number of running dev_scrubs for exclusive operations. */
+	atomic_t exclusive_dev_scrubs;
 
 	/*
 	 * Zone size > 0 when in ZONED mode, otherwise it's used for a check
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 25833b4eeaf5..4be89198baed 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -403,6 +403,20 @@  bool btrfs_exclop_start(struct btrfs_fs_info *fs_info,
 	spin_lock(&fs_info->super_lock);
 	if (fs_info->exclusive_operation == BTRFS_EXCLOP_NONE) {
 		fs_info->exclusive_operation = type;
+		if (type == BTRFS_EXCLOP_DEV_SCRUB)
+			atomic_inc(&fs_info->exclusive_dev_scrubs);
+		ret = true;
+		spin_unlock(&fs_info->super_lock);
+		return ret;
+	}
+
+	/*
+	 * For dev-scrub, we allow multiple scrubs to be run concurrently
+	 * as it's a per-device operation.
+	 */
+	if (type == BTRFS_EXCLOP_DEV_SCRUB &&
+	    fs_info->exclusive_operation == type) {
+		atomic_inc(&fs_info->exclusive_dev_scrubs);
 		ret = true;
 	}
 	spin_unlock(&fs_info->super_lock);
@@ -442,7 +456,12 @@  void btrfs_exclop_start_unlock(struct btrfs_fs_info *fs_info)
 void btrfs_exclop_finish(struct btrfs_fs_info *fs_info)
 {
 	spin_lock(&fs_info->super_lock);
-	WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
+	if (fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_SCRUB) {
+		if (atomic_dec_and_test(&fs_info->exclusive_dev_scrubs))
+			WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
+	} else {
+		WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
+	}
 	spin_unlock(&fs_info->super_lock);
 	sysfs_notify(&fs_info->fs_devices->fsid_kobj, NULL, "exclusive_operation");
 }
@@ -3172,9 +3191,17 @@  static long btrfs_ioctl_scrub(struct file *file, void __user *arg)
 			goto out;
 	}
 
+	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_DEV_SCRUB)) {
+		ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
+		if (!(sa->flags & BTRFS_SCRUB_READONLY))
+			mnt_drop_write_file(file);
+		goto out;
+	}
+
 	ret = btrfs_scrub_dev(fs_info, sa->devid, sa->start, sa->end,
 			      &sa->progress, sa->flags & BTRFS_SCRUB_READONLY,
 			      0);
+	btrfs_exclop_finish(fs_info);
 
 	/*
 	 * Copy scrub args to user space even if btrfs_scrub_dev() returned an