[RFC] btrfs: scrub: Mandatory RO block group for device replace
diff mbox series

Message ID 20200122083628.16331-1-wqu@suse.com
State New
Headers show
Series
  • [RFC] btrfs: scrub: Mandatory RO block group for device replace
Related show

Commit Message

Qu Wenruo Jan. 22, 2020, 8:36 a.m. UTC
[BUG]
btrfs/06[45] btrfs/071 could fail by finding csum error.
The reproducibility is not high, around 1/20~1/100, needs to run them in
loops.

And the profile doesn't make much difference, SINGLE/SINGLE can also
reproduce the problem.

The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
check free space before marking a block group RO")

[CAUSE]
Device replace reuses scrub code to iterate existing extents.

It adds scrub_write_block_to_dev_replace() to scrub_block_complete(), so
that scrub read can write the verified data to target device.

Device replace also utilizes "write duplication" to write new data to
both source and target device.

However those two write can conflict and may lead to data corruption:
- Scrub writes old data from commit root
  Both extent location and csum are fetched from commit root, which
  is not always the up-to-date data.

- Write duplication is always duplicating latest data

This means there could be a race, that "write duplication" writes the
latest data to disk, then scrub write back the old data, causing data
corruption.

In theory, this should only affects data, not metadata.
Metadata write back only happens when committing transaction, thus it's
always after scrub writes.

[FIX]
Make dev-replace to require mandatory RO for target block group.

And to be extra safe, for dev-replace, wait for all exiting writes to
finish before scrubbing the chunk.

This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
when set_block_ro failed").
ENOSPC for dev-replace is still much better than data corruption.

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Not concretely confirmed, mostly through guess, thus it has RFC tag.

My first guess is race at the dev-replace starting point, but related
code is in fact very safe.
---
 fs/btrfs/scrub.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

Comments

Filipe Manana Jan. 22, 2020, 10:05 a.m. UTC | #1
On Wed, Jan 22, 2020 at 8:37 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> btrfs/06[45] btrfs/071 could fail by finding csum error.
> The reproducibility is not high, around 1/20~1/100, needs to run them in
> loops.
>
> And the profile doesn't make much difference, SINGLE/SINGLE can also
> reproduce the problem.
>
> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
> check free space before marking a block group RO")
>
> [CAUSE]
> Device replace reuses scrub code to iterate existing extents.
>
> It adds scrub_write_block_to_dev_replace() to scrub_block_complete(), so
> that scrub read can write the verified data to target device.
>
> Device replace also utilizes "write duplication" to write new data to
> both source and target device.
>
> However those two write can conflict and may lead to data corruption:
> - Scrub writes old data from commit root
>   Both extent location and csum are fetched from commit root, which
>   is not always the up-to-date data.
>
> - Write duplication is always duplicating latest data
>
> This means there could be a race, that "write duplication" writes the
> latest data to disk, then scrub write back the old data, causing data
> corruption.

Worth mentioning this is for nocow writes only then.
Given that the test cases that fail use fsstress and don't use nocow
files or -o nodatacow, the only possible case is writes into prealloc
extents.
Write duplication writes the new data and then extent iteration writes
zeroes (or whatever is on disk) after that.

>
> In theory, this should only affects data, not metadata.
> Metadata write back only happens when committing transaction, thus it's
> always after scrub writes.

No, not only when committing transaction.
It can happen under memory pressure, tree extents can be written
before. In fact, if you remember the 5.2 corruption and deadlock, the
deadlock case happened precisely when writeback of the btree inode was
triggered before a transaction commit.

>
> [FIX]
> Make dev-replace to require mandatory RO for target block group.
>
> And to be extra safe, for dev-replace, wait for all exiting writes to
> finish before scrubbing the chunk.
>
> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
> when set_block_ro failed").
> ENOSPC for dev-replace is still much better than data corruption.
>
> Reported-by: Filipe Manana <fdmanana@suse.com>
> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Not concretely confirmed, mostly through guess, thus it has RFC tag.

Well, it's better to confirm...
IIRC, correctly, dev-replace does not skip copies for prealloc
extents, it copies what is on disk.
If that's the case, then this is correct. However if it's smart and
skips copying prealloc extents (which is pointless), then the problem
must have other technical explanation.

>
> My first guess is race at the dev-replace starting point, but related
> code is in fact very safe.
> ---
>  fs/btrfs/scrub.c | 35 ++++++++++++++++++++++++++++++++---
>  1 file changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 21de630b0730..69e76a4d1258 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3472,6 +3472,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>         struct btrfs_path *path;
>         struct btrfs_fs_info *fs_info = sctx->fs_info;
>         struct btrfs_root *root = fs_info->dev_root;
> +       bool is_dev_replace = sctx->is_dev_replace;

Not needed, just use sctx->is_dev_replace like everywhere else.

Thanks.

>         u64 length;
>         u64 chunk_offset;
>         int ret = 0;
> @@ -3577,17 +3578,35 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                  * This can easily boost the amount of SYSTEM chunks if cleaner
>                  * thread can't be triggered fast enough, and use up all space
>                  * of btrfs_super_block::sys_chunk_array
> +                *
> +                *
> +                * On the other hand, try our best to mark block group RO for
> +                * dev-replace case.
> +                *
> +                * Dev-replace has two types of write:
> +                * - Write duplication
> +                *   New write will be written to both target and source device
> +                *   The content is always the *newest* data.
> +                * - Scrub write for dev-replace
> +                *   Scrub will write the verified data for dev-replace.
> +                *   The data and its csum are all from *commit* root, which
> +                *   is not the newest version.
> +                *
> +                * If scrub write happens after write duplication, we would
> +                * cause data corruption.
> +                * So we need to try our best to mark block group RO, and exit
> +                * out if we don't have enough space.
>                  */
> -               ret = btrfs_inc_block_group_ro(cache, false);
> +               ret = btrfs_inc_block_group_ro(cache, is_dev_replace);
>                 scrub_pause_off(fs_info);
>
>                 if (ret == 0) {
>                         ro_set = 1;
> -               } else if (ret == -ENOSPC) {
> +               } else if (ret == -ENOSPC && !is_dev_replace) {
>                         /*
>                          * btrfs_inc_block_group_ro return -ENOSPC when it
>                          * failed in creating new chunk for metadata.
> -                        * It is not a problem for scrub/replace, because
> +                        * It is not a problem for scrub, because
>                          * metadata are always cowed, and our scrub paused
>                          * commit_transactions.
>                          */
> @@ -3605,6 +3624,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                 dev_replace->item_needs_writeback = 1;
>                 up_write(&dev_replace->rwsem);
>
> +               /*
> +                * Also wait for any exitings writes to prevent race between
> +                * write duplication and scrub writes.
> +                */
> +               if (is_dev_replace) {
> +                       btrfs_wait_block_group_reservations(cache);
> +                       btrfs_wait_nocow_writers(cache);
> +                       btrfs_wait_ordered_roots(fs_info, U64_MAX,
> +                                       cache->start, cache->length);
> +               }
>                 ret = scrub_chunk(sctx, scrub_dev, chunk_offset, length,
>                                   found_key.offset, cache);
>
> --
> 2.25.0
>
Qu Wenruo Jan. 22, 2020, 10:40 a.m. UTC | #2
On 2020/1/22 下午6:05, Filipe Manana wrote:
> On Wed, Jan 22, 2020 at 8:37 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> btrfs/06[45] btrfs/071 could fail by finding csum error.
>> The reproducibility is not high, around 1/20~1/100, needs to run them in
>> loops.
>>
>> And the profile doesn't make much difference, SINGLE/SINGLE can also
>> reproduce the problem.
>>
>> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
>> check free space before marking a block group RO")
>>
>> [CAUSE]
>> Device replace reuses scrub code to iterate existing extents.
>>
>> It adds scrub_write_block_to_dev_replace() to scrub_block_complete(), so
>> that scrub read can write the verified data to target device.
>>
>> Device replace also utilizes "write duplication" to write new data to
>> both source and target device.
>>
>> However those two write can conflict and may lead to data corruption:
>> - Scrub writes old data from commit root
>>   Both extent location and csum are fetched from commit root, which
>>   is not always the up-to-date data.
>>
>> - Write duplication is always duplicating latest data
>>
>> This means there could be a race, that "write duplication" writes the
>> latest data to disk, then scrub write back the old data, causing data
>> corruption.
> 
> Worth mentioning this is for nocow writes only then.
> Given that the test cases that fail use fsstress and don't use nocow
> files or -o nodatacow, the only possible case is writes into prealloc
> extents.
> Write duplication writes the new data and then extent iteration writes
> zeroes (or whatever is on disk) after that.

Thank you very much for the mentioning of prealloc extents, that's
exactly the missing piece!

My original assumption in fact has a hole, extents in commit tree won't
get re-allocated as they will get pinned down, and until next trans
won't be re-used.
So the explaination should only work for nodatacow case, and I could not
find a good explanation until now.

And if it's prealloc extent, then it's indeed a different story.

> 
>>
>> In theory, this should only affects data, not metadata.
>> Metadata write back only happens when committing transaction, thus it's
>> always after scrub writes.
> 
> No, not only when committing transaction.
> It can happen under memory pressure, tree extents can be written
> before. In fact, if you remember the 5.2 corruption and deadlock, the
> deadlock case happened precisely when writeback of the btree inode was
> triggered before a transaction commit.
> 
>>
>> [FIX]
>> Make dev-replace to require mandatory RO for target block group.
>>
>> And to be extra safe, for dev-replace, wait for all exiting writes to
>> finish before scrubbing the chunk.
>>
>> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
>> when set_block_ro failed").
>> ENOSPC for dev-replace is still much better than data corruption.
>>
>> Reported-by: Filipe Manana <fdmanana@suse.com>
>> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
>> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Not concretely confirmed, mostly through guess, thus it has RFC tag.
> 
> Well, it's better to confirm...
> IIRC, correctly, dev-replace does not skip copies for prealloc
> extents, it copies what is on disk.

That's true, it doesn't do backref walk to determine if it's
preallocated or regular.
It just gather csum, copy pages from disk, verify if there is csum, then
copy the pages back.

So prealloc indeed looks like a very valid cause, and it can be verified
just by disabling prealloc in fsstress.

Thanks again for pointing out the missing piece.
Qu

> If that's the case, then this is correct. However if it's smart and
> skips copying prealloc extents (which is pointless), then the problem
> must have other technical explanation.
> 
>>
>> My first guess is race at the dev-replace starting point, but related
>> code is in fact very safe.
>> ---
>>  fs/btrfs/scrub.c | 35 ++++++++++++++++++++++++++++++++---
>>  1 file changed, 32 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index 21de630b0730..69e76a4d1258 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -3472,6 +3472,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>         struct btrfs_path *path;
>>         struct btrfs_fs_info *fs_info = sctx->fs_info;
>>         struct btrfs_root *root = fs_info->dev_root;
>> +       bool is_dev_replace = sctx->is_dev_replace;
> 
> Not needed, just use sctx->is_dev_replace like everywhere else.
> 
> Thanks.
> 
>>         u64 length;
>>         u64 chunk_offset;
>>         int ret = 0;
>> @@ -3577,17 +3578,35 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>                  * This can easily boost the amount of SYSTEM chunks if cleaner
>>                  * thread can't be triggered fast enough, and use up all space
>>                  * of btrfs_super_block::sys_chunk_array
>> +                *
>> +                *
>> +                * On the other hand, try our best to mark block group RO for
>> +                * dev-replace case.
>> +                *
>> +                * Dev-replace has two types of write:
>> +                * - Write duplication
>> +                *   New write will be written to both target and source device
>> +                *   The content is always the *newest* data.
>> +                * - Scrub write for dev-replace
>> +                *   Scrub will write the verified data for dev-replace.
>> +                *   The data and its csum are all from *commit* root, which
>> +                *   is not the newest version.
>> +                *
>> +                * If scrub write happens after write duplication, we would
>> +                * cause data corruption.
>> +                * So we need to try our best to mark block group RO, and exit
>> +                * out if we don't have enough space.
>>                  */
>> -               ret = btrfs_inc_block_group_ro(cache, false);
>> +               ret = btrfs_inc_block_group_ro(cache, is_dev_replace);
>>                 scrub_pause_off(fs_info);
>>
>>                 if (ret == 0) {
>>                         ro_set = 1;
>> -               } else if (ret == -ENOSPC) {
>> +               } else if (ret == -ENOSPC && !is_dev_replace) {
>>                         /*
>>                          * btrfs_inc_block_group_ro return -ENOSPC when it
>>                          * failed in creating new chunk for metadata.
>> -                        * It is not a problem for scrub/replace, because
>> +                        * It is not a problem for scrub, because
>>                          * metadata are always cowed, and our scrub paused
>>                          * commit_transactions.
>>                          */
>> @@ -3605,6 +3624,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>                 dev_replace->item_needs_writeback = 1;
>>                 up_write(&dev_replace->rwsem);
>>
>> +               /*
>> +                * Also wait for any exitings writes to prevent race between
>> +                * write duplication and scrub writes.
>> +                */
>> +               if (is_dev_replace) {
>> +                       btrfs_wait_block_group_reservations(cache);
>> +                       btrfs_wait_nocow_writers(cache);
>> +                       btrfs_wait_ordered_roots(fs_info, U64_MAX,
>> +                                       cache->start, cache->length);
>> +               }
>>                 ret = scrub_chunk(sctx, scrub_dev, chunk_offset, length,
>>                                   found_key.offset, cache);
>>
>> --
>> 2.25.0
>>
> 
>
Filipe Manana Jan. 22, 2020, 11:55 a.m. UTC | #3
On Wed, Jan 22, 2020 at 10:40 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/1/22 下午6:05, Filipe Manana wrote:
> > On Wed, Jan 22, 2020 at 8:37 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> [BUG]
> >> btrfs/06[45] btrfs/071 could fail by finding csum error.
> >> The reproducibility is not high, around 1/20~1/100, needs to run them in
> >> loops.
> >>
> >> And the profile doesn't make much difference, SINGLE/SINGLE can also
> >> reproduce the problem.
> >>
> >> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
> >> check free space before marking a block group RO")
> >>
> >> [CAUSE]
> >> Device replace reuses scrub code to iterate existing extents.
> >>
> >> It adds scrub_write_block_to_dev_replace() to scrub_block_complete(), so
> >> that scrub read can write the verified data to target device.
> >>
> >> Device replace also utilizes "write duplication" to write new data to
> >> both source and target device.
> >>
> >> However those two write can conflict and may lead to data corruption:
> >> - Scrub writes old data from commit root
> >>   Both extent location and csum are fetched from commit root, which
> >>   is not always the up-to-date data.
> >>
> >> - Write duplication is always duplicating latest data
> >>
> >> This means there could be a race, that "write duplication" writes the
> >> latest data to disk, then scrub write back the old data, causing data
> >> corruption.
> >
> > Worth mentioning this is for nocow writes only then.
> > Given that the test cases that fail use fsstress and don't use nocow
> > files or -o nodatacow, the only possible case is writes into prealloc
> > extents.
> > Write duplication writes the new data and then extent iteration writes
> > zeroes (or whatever is on disk) after that.
>
> Thank you very much for the mentioning of prealloc extents, that's
> exactly the missing piece!
>
> My original assumption in fact has a hole, extents in commit tree won't
> get re-allocated as they will get pinned down, and until next trans
> won't be re-used.
> So the explaination should only work for nodatacow case, and I could not
> find a good explanation until now.
>
> And if it's prealloc extent, then it's indeed a different story.

Just mention that in the changelog and the comment, that the need to set the
block group RO and wait for ongoing writes is only needed for nocow writes
(which includes both files with the nocow bit set and writes into
prealloc extents).

>
> >
> >>
> >> In theory, this should only affects data, not metadata.
> >> Metadata write back only happens when committing transaction, thus it's
> >> always after scrub writes.
> >
> > No, not only when committing transaction.
> > It can happen under memory pressure, tree extents can be written
> > before. In fact, if you remember the 5.2 corruption and deadlock, the
> > deadlock case happened precisely when writeback of the btree inode was
> > triggered before a transaction commit.
> >
> >>
> >> [FIX]
> >> Make dev-replace to require mandatory RO for target block group.
> >>
> >> And to be extra safe, for dev-replace, wait for all exiting writes to
> >> finish before scrubbing the chunk.
> >>
> >> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
> >> when set_block_ro failed").
> >> ENOSPC for dev-replace is still much better than data corruption.
> >>
> >> Reported-by: Filipe Manana <fdmanana@suse.com>
> >> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> >> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >> Not concretely confirmed, mostly through guess, thus it has RFC tag.
> >
> > Well, it's better to confirm...
> > IIRC, correctly, dev-replace does not skip copies for prealloc
> > extents, it copies what is on disk.
>
> That's true, it doesn't do backref walk to determine if it's
> preallocated or regular.
> It just gather csum, copy pages from disk, verify if there is csum, then
> copy the pages back.

Yep, I didn't had the code in front of me when I replied, but I didn't
remember dev-replace/scrub
checking extent types.

>
> So prealloc indeed looks like a very valid cause, and it can be verified
> just by disabling prealloc in fsstress.

Yes, disable fallocate and zero range operations in fsstress.
If it passes without this patch for thousands of iterations, then that
was the cause.

Thanks!

>
> Thanks again for pointing out the missing piece.
> Qu
>
> > If that's the case, then this is correct. However if it's smart and
> > skips copying prealloc extents (which is pointless), then the problem
> > must have other technical explanation.
> >
> >>
> >> My first guess is race at the dev-replace starting point, but related
> >> code is in fact very safe.
> >> ---
> >>  fs/btrfs/scrub.c | 35 ++++++++++++++++++++++++++++++++---
> >>  1 file changed, 32 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> >> index 21de630b0730..69e76a4d1258 100644
> >> --- a/fs/btrfs/scrub.c
> >> +++ b/fs/btrfs/scrub.c
> >> @@ -3472,6 +3472,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>         struct btrfs_path *path;
> >>         struct btrfs_fs_info *fs_info = sctx->fs_info;
> >>         struct btrfs_root *root = fs_info->dev_root;
> >> +       bool is_dev_replace = sctx->is_dev_replace;
> >
> > Not needed, just use sctx->is_dev_replace like everywhere else.
> >
> > Thanks.
> >
> >>         u64 length;
> >>         u64 chunk_offset;
> >>         int ret = 0;
> >> @@ -3577,17 +3578,35 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>                  * This can easily boost the amount of SYSTEM chunks if cleaner
> >>                  * thread can't be triggered fast enough, and use up all space
> >>                  * of btrfs_super_block::sys_chunk_array
> >> +                *
> >> +                *
> >> +                * On the other hand, try our best to mark block group RO for
> >> +                * dev-replace case.
> >> +                *
> >> +                * Dev-replace has two types of write:
> >> +                * - Write duplication
> >> +                *   New write will be written to both target and source device
> >> +                *   The content is always the *newest* data.
> >> +                * - Scrub write for dev-replace
> >> +                *   Scrub will write the verified data for dev-replace.
> >> +                *   The data and its csum are all from *commit* root, which
> >> +                *   is not the newest version.
> >> +                *
> >> +                * If scrub write happens after write duplication, we would
> >> +                * cause data corruption.
> >> +                * So we need to try our best to mark block group RO, and exit
> >> +                * out if we don't have enough space.
> >>                  */
> >> -               ret = btrfs_inc_block_group_ro(cache, false);
> >> +               ret = btrfs_inc_block_group_ro(cache, is_dev_replace);
> >>                 scrub_pause_off(fs_info);
> >>
> >>                 if (ret == 0) {
> >>                         ro_set = 1;
> >> -               } else if (ret == -ENOSPC) {
> >> +               } else if (ret == -ENOSPC && !is_dev_replace) {
> >>                         /*
> >>                          * btrfs_inc_block_group_ro return -ENOSPC when it
> >>                          * failed in creating new chunk for metadata.
> >> -                        * It is not a problem for scrub/replace, because
> >> +                        * It is not a problem for scrub, because
> >>                          * metadata are always cowed, and our scrub paused
> >>                          * commit_transactions.
> >>                          */
> >> @@ -3605,6 +3624,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >>                 dev_replace->item_needs_writeback = 1;
> >>                 up_write(&dev_replace->rwsem);
> >>
> >> +               /*
> >> +                * Also wait for any exitings writes to prevent race between
> >> +                * write duplication and scrub writes.
> >> +                */
> >> +               if (is_dev_replace) {
> >> +                       btrfs_wait_block_group_reservations(cache);
> >> +                       btrfs_wait_nocow_writers(cache);
> >> +                       btrfs_wait_ordered_roots(fs_info, U64_MAX,
> >> +                                       cache->start, cache->length);
> >> +               }
> >>                 ret = scrub_chunk(sctx, scrub_dev, chunk_offset, length,
> >>                                   found_key.offset, cache);
> >>
> >> --
> >> 2.25.0
> >>
> >
> >
>

Patch
diff mbox series

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 21de630b0730..69e76a4d1258 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3472,6 +3472,7 @@  int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 	struct btrfs_path *path;
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
 	struct btrfs_root *root = fs_info->dev_root;
+	bool is_dev_replace = sctx->is_dev_replace;
 	u64 length;
 	u64 chunk_offset;
 	int ret = 0;
@@ -3577,17 +3578,35 @@  int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		 * This can easily boost the amount of SYSTEM chunks if cleaner
 		 * thread can't be triggered fast enough, and use up all space
 		 * of btrfs_super_block::sys_chunk_array
+		 *
+		 *
+		 * On the other hand, try our best to mark block group RO for
+		 * dev-replace case.
+		 *
+		 * Dev-replace has two types of write:
+		 * - Write duplication
+		 *   New write will be written to both target and source device
+		 *   The content is always the *newest* data.
+		 * - Scrub write for dev-replace
+		 *   Scrub will write the verified data for dev-replace.
+		 *   The data and its csum are all from *commit* root, which
+		 *   is not the newest version.
+		 *
+		 * If scrub write happens after write duplication, we would
+		 * cause data corruption.
+		 * So we need to try our best to mark block group RO, and exit
+		 * out if we don't have enough space.
 		 */
-		ret = btrfs_inc_block_group_ro(cache, false);
+		ret = btrfs_inc_block_group_ro(cache, is_dev_replace);
 		scrub_pause_off(fs_info);
 
 		if (ret == 0) {
 			ro_set = 1;
-		} else if (ret == -ENOSPC) {
+		} else if (ret == -ENOSPC && !is_dev_replace) {
 			/*
 			 * btrfs_inc_block_group_ro return -ENOSPC when it
 			 * failed in creating new chunk for metadata.
-			 * It is not a problem for scrub/replace, because
+			 * It is not a problem for scrub, because
 			 * metadata are always cowed, and our scrub paused
 			 * commit_transactions.
 			 */
@@ -3605,6 +3624,16 @@  int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		dev_replace->item_needs_writeback = 1;
 		up_write(&dev_replace->rwsem);
 
+		/*
+		 * Also wait for any exitings writes to prevent race between
+		 * write duplication and scrub writes.
+		 */
+		if (is_dev_replace) {
+			btrfs_wait_block_group_reservations(cache);
+			btrfs_wait_nocow_writers(cache);
+			btrfs_wait_ordered_roots(fs_info, U64_MAX,
+					cache->start, cache->length);
+		}
 		ret = scrub_chunk(sctx, scrub_dev, chunk_offset, length,
 				  found_key.offset, cache);