btrfs: scrub: Require mandatory block group RO for dev-replace
diff mbox series

Message ID 20200123073759.23535-1-wqu@suse.com
State New
Headers show
Series
  • btrfs: scrub: Require mandatory block group RO for dev-replace
Related show

Commit Message

Qu Wenruo Jan. 23, 2020, 7:37 a.m. UTC
[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.

The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.

The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
check free space before marking a block group RO")

[CAUSE]
Dev-replace has two source of writes:
- Write duplication
  All writes to source device will also be duplicated to target device.

  Content:	Latest data/meta

- Scrub copy
  Dev-replace reused scrub code to iterate through existing extents, and
  copy the verified data to target device.

  Content:	Data/meta in commit root

The difference in contents makes the following race possible:
	Regular Writer		|	Dev-replace
-----------------------------------------------------------------
  ^                             |
  | Preallocate one data extent |
  | at bytenr X, len 1M		|
  v				|
  ^ Commit transaction		|
  | Now extent [X, X+1M) is in  |
  v commit root			|
 ================== Dev replace starts =========================
  				| ^
				| | Scrub extent [X, X+1M)
				| | Read [X, X+1M)
				| | (The content are mostly garbage
				| |  since it's preallocated)
  ^				| v
  | Write back happens for	|
  | extent [X, X+512K)		|
  | New data writes to both	|
  | source and target dev.	|
  v				|
				| ^
				| | Scrub writes back extent [X, X+1M)
				| | to target device.
				| | This will over write the new data in
				| | [X, X+512K)
				| v

This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous trans
(in commit root).

This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.

*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
		   -f collapse=0 -f punch=0 -f resvsp=0"
   I didn't expect resvsp ioctl will fallback to fallocate in VFS...

[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().

This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.

ENOSPC for dev-replace is still much better than data corruption.

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
RFC->v1:
- Remove the RFC tag
  Since the cause is pinned and verified, no need for RFC.

- Only wait for nocow writes for dev-replace
  CoW writes are safe as they will never overwrite extents in commit
  root.

- Put the wait call into proper lock context
  Previous wait happens after scrub_pause_off(), which can cause
  deadlock where we may need to commit transaction in one of the
  wait calls. But since we are in scrub_pause_off() environment,
  transaction commit will wait us to continue, causing a wait-on-self
  deadlock.
---
 fs/btrfs/scrub.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

Comments

Filipe Manana Jan. 23, 2020, 12:06 p.m. UTC | #1
On Thu, Jan 23, 2020 at 7:38 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
> looped runs can lead to random failure, where scrub finds csum error.
>
> The possibility is not high, around 1/20 to 1/100, but it's causing data
> corruption.
>
> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
> check free space before marking a block group RO")
>
> [CAUSE]
> Dev-replace has two source of writes:
> - Write duplication
>   All writes to source device will also be duplicated to target device.
>
>   Content:      Latest data/meta

I find the term "latest" a bit confusing, perhaps "not yet persisted
data and metadata" is more clear.

>
> - Scrub copy
>   Dev-replace reused scrub code to iterate through existing extents, and
>   copy the verified data to target device.
>
>   Content:      Data/meta in commit root

And so here "previously persisted data and metadata".

>
> The difference in contents makes the following race possible:
>         Regular Writer          |       Dev-replace
> -----------------------------------------------------------------
>   ^                             |
>   | Preallocate one data extent |
>   | at bytenr X, len 1M         |
>   v                             |
>   ^ Commit transaction          |
>   | Now extent [X, X+1M) is in  |
>   v commit root                 |
>  ================== Dev replace starts =========================
>                                 | ^
>                                 | | Scrub extent [X, X+1M)
>                                 | | Read [X, X+1M)
>                                 | | (The content are mostly garbage
>                                 | |  since it's preallocated)
>   ^                             | v
>   | Write back happens for      |
>   | extent [X, X+512K)          |
>   | New data writes to both     |
>   | source and target dev.      |
>   v                             |
>                                 | ^
>                                 | | Scrub writes back extent [X, X+1M)
>                                 | | to target device.
>                                 | | This will over write the new data in
>                                 | | [X, X+512K)
>                                 | v
>
> This race can only happen for nocow writes. Thus metadata and data cow
> writes are safe, as COW will never overwrite extents of previous trans
> (in commit root).
>
> This behavior can be confirmed by disabling all fallocate related calls
> in fsstress (*), then all related tests can pass a 2000 run loop.
>
> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
>                    -f collapse=0 -f punch=0 -f resvsp=0"
>    I didn't expect resvsp ioctl will fallback to fallocate in VFS...
>
> [FIX]
> Make dev-replace to require mandatory block group RO, and wait for current
> nocow writes before calling scrub_chunk().
>
> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
> when set_block_ro failed") for dev-replace path.
>
> ENOSPC for dev-replace is still much better than data corruption.

Technically if we flag the block group RO without being able to
persist that due to ENOSPC, it's still ok.
We just want to prevent nocow writes racing with scrub copying
procedure. But that's something for some other time, and this is fine
to me.

>
> Reported-by: Filipe Manana <fdmanana@suse.com>
> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> RFC->v1:
> - Remove the RFC tag
>   Since the cause is pinned and verified, no need for RFC.
>
> - Only wait for nocow writes for dev-replace
>   CoW writes are safe as they will never overwrite extents in commit
>   root.
>
> - Put the wait call into proper lock context
>   Previous wait happens after scrub_pause_off(), which can cause
>   deadlock where we may need to commit transaction in one of the
>   wait calls. But since we are in scrub_pause_off() environment,
>   transaction commit will wait us to continue, causing a wait-on-self
>   deadlock.
> ---
>  fs/btrfs/scrub.c | 30 +++++++++++++++++++++++++-----
>  1 file changed, 25 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 21de630b0730..5aa486cad298 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                  * This can easily boost the amount of SYSTEM chunks if cleaner
>                  * thread can't be triggered fast enough, and use up all space
>                  * of btrfs_super_block::sys_chunk_array
> +                *
> +                * While for dev replace, we need to try our best to mark block
> +                * group RO, to prevent race between:
> +                * - Write duplication
> +                *   Contains latest data
> +                * - Scrub copy
> +                *   Contains data from commit tree
> +                *
> +                * If target block group is not marked RO, nocow writes can
> +                * be overwritten by scrub copy, causing data corruption.
> +                * So for dev-replace, it's not allowed to continue if a block
> +                * group is not RO.
>                  */
> -               ret = btrfs_inc_block_group_ro(cache, false);
> -               scrub_pause_off(fs_info);
> -
> +               ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
>                 if (ret == 0) {
>                         ro_set = 1;
> -               } else if (ret == -ENOSPC) {
> +               } else if (ret == -ENOSPC && !sctx->is_dev_replace) {
>                         /*
>                          * btrfs_inc_block_group_ro return -ENOSPC when it
>                          * failed in creating new chunk for metadata.
> -                        * It is not a problem for scrub/replace, because
> +                        * It is not a problem for scrub, because
>                          * metadata are always cowed, and our scrub paused
>                          * commit_transactions.
>                          */
> @@ -3596,9 +3606,19 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                         btrfs_warn(fs_info,
>                                    "failed setting block group ro: %d", ret);
>                         btrfs_put_block_group(cache);
> +                       scrub_pause_off(fs_info);
>                         break;
>                 }
>
> +               /*
> +                * Now the target block is marked RO, wait for nocow writes to
> +                * finish before dev-replace.
> +                * COW is fine, as COW never overwrites extents in commit tree.
> +                */
> +               if (sctx->is_dev_replace)
> +                       btrfs_wait_nocow_writers(cache);

So this only waits for nocow ordered extents to be created - it
doesn't wait for them to complete.
After that you still need to call:

btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);

Other than that, looks good to me.

Thanks.

> +
> +               scrub_pause_off(fs_info);
>                 down_write(&dev_replace->rwsem);
>                 dev_replace->cursor_right = found_key.offset + length;
>                 dev_replace->cursor_left = found_key.offset;
> --
> 2.25.0
>
Qu Wenruo Jan. 23, 2020, 12:28 p.m. UTC | #2
On 2020/1/23 下午8:06, Filipe Manana wrote:
> On Thu, Jan 23, 2020 at 7:38 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
>> looped runs can lead to random failure, where scrub finds csum error.
>>
>> The possibility is not high, around 1/20 to 1/100, but it's causing data
>> corruption.
>>
>> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
>> check free space before marking a block group RO")
>>
>> [CAUSE]
>> Dev-replace has two source of writes:
>> - Write duplication
>>   All writes to source device will also be duplicated to target device.
>>
>>   Content:      Latest data/meta
> 
> I find the term "latest" a bit confusing, perhaps "not yet persisted
> data and metadata" is more clear.
> 
>>
>> - Scrub copy
>>   Dev-replace reused scrub code to iterate through existing extents, and
>>   copy the verified data to target device.
>>
>>   Content:      Data/meta in commit root
> 
> And so here "previously persisted data and metadata".
> 
>>
>> The difference in contents makes the following race possible:
>>         Regular Writer          |       Dev-replace
>> -----------------------------------------------------------------
>>   ^                             |
>>   | Preallocate one data extent |
>>   | at bytenr X, len 1M         |
>>   v                             |
>>   ^ Commit transaction          |
>>   | Now extent [X, X+1M) is in  |
>>   v commit root                 |
>>  ================== Dev replace starts =========================
>>                                 | ^
>>                                 | | Scrub extent [X, X+1M)
>>                                 | | Read [X, X+1M)
>>                                 | | (The content are mostly garbage
>>                                 | |  since it's preallocated)
>>   ^                             | v
>>   | Write back happens for      |
>>   | extent [X, X+512K)          |
>>   | New data writes to both     |
>>   | source and target dev.      |
>>   v                             |
>>                                 | ^
>>                                 | | Scrub writes back extent [X, X+1M)
>>                                 | | to target device.
>>                                 | | This will over write the new data in
>>                                 | | [X, X+512K)
>>                                 | v
>>
>> This race can only happen for nocow writes. Thus metadata and data cow
>> writes are safe, as COW will never overwrite extents of previous trans
>> (in commit root).
>>
>> This behavior can be confirmed by disabling all fallocate related calls
>> in fsstress (*), then all related tests can pass a 2000 run loop.
>>
>> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
>>                    -f collapse=0 -f punch=0 -f resvsp=0"
>>    I didn't expect resvsp ioctl will fallback to fallocate in VFS...
>>
>> [FIX]
>> Make dev-replace to require mandatory block group RO, and wait for current
>> nocow writes before calling scrub_chunk().
>>
>> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
>> when set_block_ro failed") for dev-replace path.
>>
>> ENOSPC for dev-replace is still much better than data corruption.
> 
> Technically if we flag the block group RO without being able to
> persist that due to ENOSPC, it's still ok.

Right, I will change the words, since it only slightly increase the
possibility of ENOSPC, not ensured to cause ENOSPC and abort replace.

> We just want to prevent nocow writes racing with scrub copying
> procedure. But that's something for some other time, and this is fine
> to me.
> 
>>
>> Reported-by: Filipe Manana <fdmanana@suse.com>
>> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
>> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> RFC->v1:
>> - Remove the RFC tag
>>   Since the cause is pinned and verified, no need for RFC.
>>
>> - Only wait for nocow writes for dev-replace
>>   CoW writes are safe as they will never overwrite extents in commit
>>   root.
>>
[...]
>> +               /*
>> +                * Now the target block is marked RO, wait for nocow writes to
>> +                * finish before dev-replace.
>> +                * COW is fine, as COW never overwrites extents in commit tree.
>> +                */
>> +               if (sctx->is_dev_replace)
>> +                       btrfs_wait_nocow_writers(cache);
> 
> So this only waits for nocow ordered extents to be created - it
> doesn't wait for them to complete.
> After that you still need to call:
> 
> btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);

Forgot that, although 1000 runs doesn't expose any problem you are
completely right.

I'll update the patch to address all mentioned comments.

Thanks,
Qu

> 
> Other than that, looks good to me.
> 
> Thanks.
> 
>> +
>> +               scrub_pause_off(fs_info);
>>                 down_write(&dev_replace->rwsem);
>>                 dev_replace->cursor_right = found_key.offset + length;
>>                 dev_replace->cursor_left = found_key.offset;
>> --
>> 2.25.0
>>
> 
>
Qu Wenruo Jan. 23, 2020, 1:39 p.m. UTC | #3
On 2020/1/23 下午8:06, Filipe Manana wrote:
> On Thu, Jan 23, 2020 at 7:38 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
>> looped runs can lead to random failure, where scrub finds csum error.
>>
>> The possibility is not high, around 1/20 to 1/100, but it's causing data
>> corruption.
>>
>> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
>> check free space before marking a block group RO")
>>
>> [CAUSE]
>> Dev-replace has two source of writes:
>> - Write duplication
>>   All writes to source device will also be duplicated to target device.
>>
>>   Content:      Latest data/meta
> 
> I find the term "latest" a bit confusing, perhaps "not yet persisted
> data and metadata" is more clear.
> 
>>
>> - Scrub copy
>>   Dev-replace reused scrub code to iterate through existing extents, and
>>   copy the verified data to target device.
>>
>>   Content:      Data/meta in commit root
> 
> And so here "previously persisted data and metadata".
> 
>>
>> The difference in contents makes the following race possible:
>>         Regular Writer          |       Dev-replace
>> -----------------------------------------------------------------
>>   ^                             |
>>   | Preallocate one data extent |
>>   | at bytenr X, len 1M         |
>>   v                             |
>>   ^ Commit transaction          |
>>   | Now extent [X, X+1M) is in  |
>>   v commit root                 |
>>  ================== Dev replace starts =========================
>>                                 | ^
>>                                 | | Scrub extent [X, X+1M)
>>                                 | | Read [X, X+1M)
>>                                 | | (The content are mostly garbage
>>                                 | |  since it's preallocated)
>>   ^                             | v
>>   | Write back happens for      |
>>   | extent [X, X+512K)          |
>>   | New data writes to both     |
>>   | source and target dev.      |
>>   v                             |
>>                                 | ^
>>                                 | | Scrub writes back extent [X, X+1M)
>>                                 | | to target device.
>>                                 | | This will over write the new data in
>>                                 | | [X, X+512K)
>>                                 | v
>>
>> This race can only happen for nocow writes. Thus metadata and data cow
>> writes are safe, as COW will never overwrite extents of previous trans
>> (in commit root).
>>
>> This behavior can be confirmed by disabling all fallocate related calls
>> in fsstress (*), then all related tests can pass a 2000 run loop.
>>
>> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
>>                    -f collapse=0 -f punch=0 -f resvsp=0"
>>    I didn't expect resvsp ioctl will fallback to fallocate in VFS...
>>
>> [FIX]
>> Make dev-replace to require mandatory block group RO, and wait for current
>> nocow writes before calling scrub_chunk().
>>
>> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
>> when set_block_ro failed") for dev-replace path.
>>
>> ENOSPC for dev-replace is still much better than data corruption.
> 
> Technically if we flag the block group RO without being able to
> persist that due to ENOSPC, it's still ok.
> We just want to prevent nocow writes racing with scrub copying
> procedure. But that's something for some other time, and this is fine
> to me.
> 
>>
>> Reported-by: Filipe Manana <fdmanana@suse.com>
>> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
>> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> RFC->v1:
>> - Remove the RFC tag
>>   Since the cause is pinned and verified, no need for RFC.
>>
>> - Only wait for nocow writes for dev-replace
>>   CoW writes are safe as they will never overwrite extents in commit
>>   root.
>>
>> - Put the wait call into proper lock context
>>   Previous wait happens after scrub_pause_off(), which can cause
>>   deadlock where we may need to commit transaction in one of the
>>   wait calls. But since we are in scrub_pause_off() environment,
>>   transaction commit will wait us to continue, causing a wait-on-self
>>   deadlock.
>> ---
>>  fs/btrfs/scrub.c | 30 +++++++++++++++++++++++++-----
>>  1 file changed, 25 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index 21de630b0730..5aa486cad298 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>                  * This can easily boost the amount of SYSTEM chunks if cleaner
>>                  * thread can't be triggered fast enough, and use up all space
>>                  * of btrfs_super_block::sys_chunk_array
>> +                *
>> +                * While for dev replace, we need to try our best to mark block
>> +                * group RO, to prevent race between:
>> +                * - Write duplication
>> +                *   Contains latest data
>> +                * - Scrub copy
>> +                *   Contains data from commit tree
>> +                *
>> +                * If target block group is not marked RO, nocow writes can
>> +                * be overwritten by scrub copy, causing data corruption.
>> +                * So for dev-replace, it's not allowed to continue if a block
>> +                * group is not RO.
>>                  */
>> -               ret = btrfs_inc_block_group_ro(cache, false);
>> -               scrub_pause_off(fs_info);
>> -
>> +               ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
>>                 if (ret == 0) {
>>                         ro_set = 1;
>> -               } else if (ret == -ENOSPC) {
>> +               } else if (ret == -ENOSPC && !sctx->is_dev_replace) {
>>                         /*
>>                          * btrfs_inc_block_group_ro return -ENOSPC when it
>>                          * failed in creating new chunk for metadata.
>> -                        * It is not a problem for scrub/replace, because
>> +                        * It is not a problem for scrub, because
>>                          * metadata are always cowed, and our scrub paused
>>                          * commit_transactions.
>>                          */
>> @@ -3596,9 +3606,19 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>                         btrfs_warn(fs_info,
>>                                    "failed setting block group ro: %d", ret);
>>                         btrfs_put_block_group(cache);
>> +                       scrub_pause_off(fs_info);
>>                         break;
>>                 }
>>
>> +               /*
>> +                * Now the target block is marked RO, wait for nocow writes to
>> +                * finish before dev-replace.
>> +                * COW is fine, as COW never overwrites extents in commit tree.
>> +                */
>> +               if (sctx->is_dev_replace)
>> +                       btrfs_wait_nocow_writers(cache);
> 
> So this only waits for nocow ordered extents to be created - it
> doesn't wait for them to complete.

Wait for minute.

This btrfs_wait_nocow_writers() is not just triggering ordered extents
for nocow.
It waits for the nocow_writers count decreased to 0 for that block group.

Since we have already marked the block group RO, no new nocow writers
can happen, thus that counter can only decrease, no way to increase.

There are several cases involved:
- NoCOW Write back happens before bg RO
  It will increase cache->nocow_writers counter, and decrease the
  counter after finish_oredered_io().
  btrfs_wait_nocow_writers() will wait for such writes

- Writeback happens after bg RO
  Then the write will fallback to COW.

- Writeback happens after bg RO cleared (reverted back to RW)
  No need to bother at all.

Thus this should be enough, no need for btrfs_wait_ordered_roots().

Thanks,
Qu


> After that you still need to call:
> 
> btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
> 
> Other than that, looks good to me.
> 
> Thanks.
> 
>> +
>> +               scrub_pause_off(fs_info);
>>                 down_write(&dev_replace->rwsem);
>>                 dev_replace->cursor_right = found_key.offset + length;
>>                 dev_replace->cursor_left = found_key.offset;
>> --
>> 2.25.0
>>
> 
>
Filipe Manana Jan. 23, 2020, 1:49 p.m. UTC | #4
On Thu, Jan 23, 2020 at 1:39 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/1/23 下午8:06, Filipe Manana wrote:
> > On Thu, Jan 23, 2020 at 7:38 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> [BUG]
> >> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
> >> looped runs can lead to random failure, where scrub finds csum error.
> >>
> >> The possibility is not high, around 1/20 to 1/100, but it's causing data
> >> corruption.
> >>
> >> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
> >> check free space before marking a block group RO")
> >>
> >> [CAUSE]
> >> Dev-replace has two source of writes:
> >> - Write duplication
> >>   All writes to source device will also be duplicated to target device.
> >>
> >>   Content:      Latest data/meta
> >
> > I find the term "latest" a bit confusing, perhaps "not yet persisted
> > data and metadata" is more clear.
> >
> >>
> >> - Scrub copy
> >>   Dev-replace reused scrub code to iterate through existing extents, and
> >>   copy the verified data to target device.
> >>
> >>   Content:      Data/meta in commit root
> >
> > And so here "previously persisted data and metadata".
> >
> >>
> >> The difference in contents makes the following race possible:
> >>         Regular Writer          |       Dev-replace
> >> -----------------------------------------------------------------
> >>   ^                             |
> >>   | Preallocate one data extent |
> >>   | at bytenr X, len 1M         |
> >>   v                             |
> >>   ^ Commit transaction          |
> >>   | Now extent [X, X+1M) is in  |
> >>   v commit root                 |
> >>  ================== Dev replace starts =========================
> >>                                 | ^
> >>                                 | | Scrub extent [X, X+1M)
> >>                                 | | Read [X, X+1M)
> >>                                 | | (The content are mostly garbage
> >>                                 | |  since it's preallocated)
> >>   ^                             | v
> >>   | Write back happens for      |
> >>   | extent [X, X+512K)          |
> >>   | New data writes to both     |
> >>   | source and target dev.      |
> >>   v                             |
> >>                                 | ^
> >>                                 | | Scrub writes back extent [X, X+1M)
> >>                                 | | to target device.
> >>                                 | | This will over write the new data in
> >>                                 | | [X, X+512K)
> >>                                 | v
> >>
> >> This race can only happen for nocow writes. Thus metadata and data cow
> >> writes are safe, as COW will never overwrite extents of previous trans
> >> (in commit root).
> >>
> >> This behavior can be confirmed by disabling all fallocate related calls
> >> in fsstress (*), then all related tests can pass a 2000 run loop.
> >>
> >> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
> >>                    -f collapse=0 -f punch=0 -f resvsp=0"
> >>    I didn't expect resvsp ioctl will fallback to fallocate in VFS...
> >>
> >> [FIX]
> >> Make dev-replace to require mandatory block group RO, and wait for current
> >> nocow writes before calling scrub_chunk().
> >>
> >> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
> >> when set_block_ro failed") for dev-replace path.
> >>
> >> ENOSPC for dev-replace is still much better than data corruption.
> >
> > Technically if we flag the block group RO without being able to
> > persist that due to ENOSPC, it's still ok.
> > We just want to prevent nocow writes racing with scrub copying
> > procedure. But that's something for some other time, and this is fine
> > to me.
> >
> >>
> >> Reported-by: Filipe Manana <fdmanana@suse.com>
> >> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> >> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >> Changelog:
> >> RFC->v1:
> >> - Remove the RFC tag
> >>   Since the cause is pinned and verified, no need for RFC.
> >>
> >> - Only wait for nocow writes for dev-replace
> >>   CoW writes are safe as they will never overwrite extents in commit
> >>   root.
> >>
> >> - Put the wait call into proper lock context
> >>   Previous wait happens after scrub_pause_off(), which can cause
> >>   deadlock where we may need to commit transaction in one of the
> >>   wait calls. But since we are in scrub_pause_off() environment,
> >>   transaction commit will wait us to continue, causing a wait-on-self
> >>   deadlock.
> >> ---
> >>  fs/btrfs/scrub.c | 30 +++++++++++++++++++++++++-----
> >>  1 file changed, 25 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> >> index 21de630b0730..5aa486cad298 100644
> >> --- a/fs/btrfs/scrub.c
> >> +++ b/fs/btrfs/scrub.c
> >> @@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>                  * This can easily boost the amount of SYSTEM chunks if cleaner
> >>                  * thread can't be triggered fast enough, and use up all space
> >>                  * of btrfs_super_block::sys_chunk_array
> >> +                *
> >> +                * While for dev replace, we need to try our best to mark block
> >> +                * group RO, to prevent race between:
> >> +                * - Write duplication
> >> +                *   Contains latest data
> >> +                * - Scrub copy
> >> +                *   Contains data from commit tree
> >> +                *
> >> +                * If target block group is not marked RO, nocow writes can
> >> +                * be overwritten by scrub copy, causing data corruption.
> >> +                * So for dev-replace, it's not allowed to continue if a block
> >> +                * group is not RO.
> >>                  */
> >> -               ret = btrfs_inc_block_group_ro(cache, false);
> >> -               scrub_pause_off(fs_info);
> >> -
> >> +               ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
> >>                 if (ret == 0) {
> >>                         ro_set = 1;
> >> -               } else if (ret == -ENOSPC) {
> >> +               } else if (ret == -ENOSPC && !sctx->is_dev_replace) {
> >>                         /*
> >>                          * btrfs_inc_block_group_ro return -ENOSPC when it
> >>                          * failed in creating new chunk for metadata.
> >> -                        * It is not a problem for scrub/replace, because
> >> +                        * It is not a problem for scrub, because
> >>                          * metadata are always cowed, and our scrub paused
> >>                          * commit_transactions.
> >>                          */
> >> @@ -3596,9 +3606,19 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>                         btrfs_warn(fs_info,
> >>                                    "failed setting block group ro: %d", ret);
> >>                         btrfs_put_block_group(cache);
> >> +                       scrub_pause_off(fs_info);
> >>                         break;
> >>                 }
> >>
> >> +               /*
> >> +                * Now the target block is marked RO, wait for nocow writes to
> >> +                * finish before dev-replace.
> >> +                * COW is fine, as COW never overwrites extents in commit tree.
> >> +                */
> >> +               if (sctx->is_dev_replace)
> >> +                       btrfs_wait_nocow_writers(cache);
> >
> > So this only waits for nocow ordered extents to be created - it
> > doesn't wait for them to complete.
>
> Wait for minute.
>
> This btrfs_wait_nocow_writers() is not just triggering ordered extents
> for nocow.
> It waits for the nocow_writers count decreased to 0 for that block group.
>
> Since we have already marked the block group RO, no new nocow writers
> can happen, thus that counter can only decrease, no way to increase.
>
> There are several cases involved:
> - NoCOW Write back happens before bg RO
>   It will increase cache->nocow_writers counter, and decrease the
>   counter after finish_oredered_io().

Nop. nocow_writers is decremented after creating the ordered extent
when starting writeback (at run_delalloc_nocow) - not when completing
the ordered extent (at btrfs_finish_ordered_io()).
Same applies direct IO.

Thanks


>   btrfs_wait_nocow_writers() will wait for such writes
>
> - Writeback happens after bg RO
>   Then the write will fallback to COW.
>
> - Writeback happens after bg RO cleared (reverted back to RW)
>   No need to bother at all.
>
> Thus this should be enough, no need for btrfs_wait_ordered_roots().
>
> Thanks,
> Qu
>
>
> > After that you still need to call:
> >
> > btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
> >
> > Other than that, looks good to me.
> >
> > Thanks.
> >
> >> +
> >> +               scrub_pause_off(fs_info);
> >>                 down_write(&dev_replace->rwsem);
> >>                 dev_replace->cursor_right = found_key.offset + length;
> >>                 dev_replace->cursor_left = found_key.offset;
> >> --
> >> 2.25.0
> >>
> >
> >
>
Qu Wenruo Jan. 23, 2020, 1:57 p.m. UTC | #5
On 2020/1/23 下午9:49, Filipe Manana wrote:
[...]
>> This btrfs_wait_nocow_writers() is not just triggering ordered extents
>> for nocow.
>> It waits for the nocow_writers count decreased to 0 for that block group.
>>
>> Since we have already marked the block group RO, no new nocow writers
>> can happen, thus that counter can only decrease, no way to increase.
>>
>> There are several cases involved:
>> - NoCOW Write back happens before bg RO
>>   It will increase cache->nocow_writers counter, and decrease the
>>   counter after finish_oredered_io().
> 
> Nop. nocow_writers is decremented after creating the ordered extent
> when starting writeback (at run_delalloc_nocow) - not when completing
> the ordered extent (at btrfs_finish_ordered_io()).

Oh, right. The dec part still happens in run_delalloc_nocow().
So the wait_ordered_roots() call is still needed.

Thanks,
Qu

> Same applies direct IO.
> 
> Thanks
> 

>>>
>>
> 
>
David Sterba Jan. 23, 2020, 4:40 p.m. UTC | #6
On Thu, Jan 23, 2020 at 03:37:59PM +0800, Qu Wenruo wrote:
> 
> Reported-by: Filipe Manana <fdmanana@suse.com>
> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")

This one is in the 5.5-rc, so I'd like to get it to the final release. I
haven't read the discussion properly so please let me know if the patch
needs another round or fixups I can do. Time for the pull request is in
a day (2 at most as it's too close to the release) but given the type of
fix it's justified. Thanks.
Qu Wenruo Jan. 23, 2020, 11:58 p.m. UTC | #7
On 2020/1/24 上午12:40, David Sterba wrote:
> On Thu, Jan 23, 2020 at 03:37:59PM +0800, Qu Wenruo wrote:
>>
>> Reported-by: Filipe Manana <fdmanana@suse.com>
>> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
>> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> 
> This one is in the 5.5-rc, so I'd like to get it to the final release. I
> haven't read the discussion properly so please let me know if the patch
> needs another round or fixups I can do. Time for the pull request is in
> a day (2 at most as it's too close to the release) but given the type of
> fix it's justified. Thanks.
> 
The v2 version has Cced to you (and the mail list).

Thanks,
Qu

Patch
diff mbox series

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 21de630b0730..5aa486cad298 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3577,17 +3577,27 @@  int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		 * This can easily boost the amount of SYSTEM chunks if cleaner
 		 * thread can't be triggered fast enough, and use up all space
 		 * of btrfs_super_block::sys_chunk_array
+		 *
+		 * While for dev replace, we need to try our best to mark block
+		 * group RO, to prevent race between:
+		 * - Write duplication
+		 *   Contains latest data
+		 * - Scrub copy
+		 *   Contains data from commit tree
+		 *
+		 * If target block group is not marked RO, nocow writes can
+		 * be overwritten by scrub copy, causing data corruption.
+		 * So for dev-replace, it's not allowed to continue if a block
+		 * group is not RO.
 		 */
-		ret = btrfs_inc_block_group_ro(cache, false);
-		scrub_pause_off(fs_info);
-
+		ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
 		if (ret == 0) {
 			ro_set = 1;
-		} else if (ret == -ENOSPC) {
+		} else if (ret == -ENOSPC && !sctx->is_dev_replace) {
 			/*
 			 * btrfs_inc_block_group_ro return -ENOSPC when it
 			 * failed in creating new chunk for metadata.
-			 * It is not a problem for scrub/replace, because
+			 * It is not a problem for scrub, because
 			 * metadata are always cowed, and our scrub paused
 			 * commit_transactions.
 			 */
@@ -3596,9 +3606,19 @@  int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			btrfs_warn(fs_info,
 				   "failed setting block group ro: %d", ret);
 			btrfs_put_block_group(cache);
+			scrub_pause_off(fs_info);
 			break;
 		}
 
+		/*
+		 * Now the target block is marked RO, wait for nocow writes to
+		 * finish before dev-replace.
+		 * COW is fine, as COW never overwrites extents in commit tree.
+		 */
+		if (sctx->is_dev_replace)
+			btrfs_wait_nocow_writers(cache);
+
+		scrub_pause_off(fs_info);
 		down_write(&dev_replace->rwsem);
 		dev_replace->cursor_right = found_key.offset + length;
 		dev_replace->cursor_left = found_key.offset;