diff mbox series

btrfs: scrub: Require mandatory block group RO for dev-replace

Message ID 20200123235820.20764-1-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: scrub: Require mandatory block group RO for dev-replace | expand

Commit Message

Qu Wenruo Jan. 23, 2020, 11:58 p.m. UTC
[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.

The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.

The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
check free space before marking a block group RO")

[CAUSE]
Dev-replace has two source of writes:
- Write duplication
  All writes to source device will also be duplicated to target device.

  Content:	Not yet persisted data/meta

- Scrub copy
  Dev-replace reused scrub code to iterate through existing extents, and
  copy the verified data to target device.

  Content:	Previously persisted data and metadata

The difference in contents makes the following race possible:
	Regular Writer		|	Dev-replace
-----------------------------------------------------------------
  ^                             |
  | Preallocate one data extent |
  | at bytenr X, len 1M		|
  v				|
  ^ Commit transaction		|
  | Now extent [X, X+1M) is in  |
  v commit root			|
 ================== Dev replace starts =========================
  				| ^
				| | Scrub extent [X, X+1M)
				| | Read [X, X+1M)
				| | (The content are mostly garbage
				| |  since it's preallocated)
  ^				| v
  | Write back happens for	|
  | extent [X, X+512K)		|
  | New data writes to both	|
  | source and target dev.	|
  v				|
				| ^
				| | Scrub writes back extent [X, X+1M)
				| | to target device.
				| | This will over write the new data in
				| | [X, X+512K)
				| v

This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous trans
(in commit root).

This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.

*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
		   -f collapse=0 -f punch=0 -f resvsp=0"
   I didn't expect resvsp ioctl will fallback to fallocate in VFS...

[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().

This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.

The side effect is, dev-replace can be more strict on avaialble space, but
definitely worthy to avoid data corruption.

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
RFC->v1:
- Remove the RFC tag
  Since the cause is pinned and verified, no need for RFC.

- Only wait for nocow writes for dev-replace
  CoW writes are safe as they will never overwrite extents in commit
  root.

- Put the wait call into proper lock context
  Previous wait happens after scrub_pause_off(), which can cause
  deadlock where we may need to commit transaction in one of the
  wait calls. But since we are in scrub_pause_off() environment,
  transaction commit will wait us to continue, causing a wait-on-self
  deadlock.

v2:
- Add btrfs_wait_ordered_roots() call before scrub_chunk().
- Commit message change to avoid confusion.
---
 fs/btrfs/scrub.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

Comments

Filipe Manana Jan. 24, 2020, 9:24 a.m. UTC | #1
On Fri, Jan 24, 2020 at 12:07 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
> looped runs can lead to random failure, where scrub finds csum error.
>
> The possibility is not high, around 1/20 to 1/100, but it's causing data
> corruption.
>
> The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
> check free space before marking a block group RO")
>
> [CAUSE]
> Dev-replace has two source of writes:
> - Write duplication
>   All writes to source device will also be duplicated to target device.
>
>   Content:      Not yet persisted data/meta
>
> - Scrub copy
>   Dev-replace reused scrub code to iterate through existing extents, and
>   copy the verified data to target device.
>
>   Content:      Previously persisted data and metadata
>
> The difference in contents makes the following race possible:
>         Regular Writer          |       Dev-replace
> -----------------------------------------------------------------
>   ^                             |
>   | Preallocate one data extent |
>   | at bytenr X, len 1M         |
>   v                             |
>   ^ Commit transaction          |
>   | Now extent [X, X+1M) is in  |
>   v commit root                 |
>  ================== Dev replace starts =========================
>                                 | ^
>                                 | | Scrub extent [X, X+1M)
>                                 | | Read [X, X+1M)
>                                 | | (The content are mostly garbage
>                                 | |  since it's preallocated)
>   ^                             | v
>   | Write back happens for      |
>   | extent [X, X+512K)          |
>   | New data writes to both     |
>   | source and target dev.      |
>   v                             |
>                                 | ^
>                                 | | Scrub writes back extent [X, X+1M)
>                                 | | to target device.
>                                 | | This will over write the new data in
>                                 | | [X, X+512K)
>                                 | v
>
> This race can only happen for nocow writes. Thus metadata and data cow
> writes are safe, as COW will never overwrite extents of previous trans
> (in commit root).
>
> This behavior can be confirmed by disabling all fallocate related calls
> in fsstress (*), then all related tests can pass a 2000 run loop.
>
> *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
>                    -f collapse=0 -f punch=0 -f resvsp=0"
>    I didn't expect resvsp ioctl will fallback to fallocate in VFS...
>
> [FIX]
> Make dev-replace to require mandatory block group RO, and wait for current
> nocow writes before calling scrub_chunk().
>
> This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
> when set_block_ro failed") for dev-replace path.
>
> The side effect is, dev-replace can be more strict on avaialble space, but
> definitely worthy to avoid data corruption.
>
> Reported-by: Filipe Manana <fdmanana@suse.com>
> Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Now it looks good, thanks!

> ---
> Changelog:
> RFC->v1:
> - Remove the RFC tag
>   Since the cause is pinned and verified, no need for RFC.
>
> - Only wait for nocow writes for dev-replace
>   CoW writes are safe as they will never overwrite extents in commit
>   root.
>
> - Put the wait call into proper lock context
>   Previous wait happens after scrub_pause_off(), which can cause
>   deadlock where we may need to commit transaction in one of the
>   wait calls. But since we are in scrub_pause_off() environment,
>   transaction commit will wait us to continue, causing a wait-on-self
>   deadlock.
>
> v2:
> - Add btrfs_wait_ordered_roots() call before scrub_chunk().
> - Commit message change to avoid confusion.
> ---
>  fs/btrfs/scrub.c | 33 ++++++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 21de630b0730..fd266a2d15ec 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                  * This can easily boost the amount of SYSTEM chunks if cleaner
>                  * thread can't be triggered fast enough, and use up all space
>                  * of btrfs_super_block::sys_chunk_array
> +                *
> +                * While for dev replace, we need to try our best to mark block
> +                * group RO, to prevent race between:
> +                * - Write duplication
> +                *   Contains latest data
> +                * - Scrub copy
> +                *   Contains data from commit tree
> +                *
> +                * If target block group is not marked RO, nocow writes can
> +                * be overwritten by scrub copy, causing data corruption.
> +                * So for dev-replace, it's not allowed to continue if a block
> +                * group is not RO.
>                  */
> -               ret = btrfs_inc_block_group_ro(cache, false);
> -               scrub_pause_off(fs_info);
> -
> +               ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
>                 if (ret == 0) {
>                         ro_set = 1;
> -               } else if (ret == -ENOSPC) {
> +               } else if (ret == -ENOSPC && !sctx->is_dev_replace) {
>                         /*
>                          * btrfs_inc_block_group_ro return -ENOSPC when it
>                          * failed in creating new chunk for metadata.
> -                        * It is not a problem for scrub/replace, because
> +                        * It is not a problem for scrub, because
>                          * metadata are always cowed, and our scrub paused
>                          * commit_transactions.
>                          */
> @@ -3596,9 +3606,22 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                         btrfs_warn(fs_info,
>                                    "failed setting block group ro: %d", ret);
>                         btrfs_put_block_group(cache);
> +                       scrub_pause_off(fs_info);
>                         break;
>                 }
>
> +               /*
> +                * Now the target block is marked RO, wait for nocow writes to
> +                * finish before dev-replace.
> +                * COW is fine, as COW never overwrites extents in commit tree.
> +                */
> +               if (sctx->is_dev_replace) {
> +                       btrfs_wait_nocow_writers(cache);
> +                       btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start,
> +                                       cache->length);
> +               }
> +
> +               scrub_pause_off(fs_info);
>                 down_write(&dev_replace->rwsem);
>                 dev_replace->cursor_right = found_key.offset + length;
>                 dev_replace->cursor_left = found_key.offset;
> --
> 2.25.0
>
David Sterba Jan. 24, 2020, 1:40 p.m. UTC | #2
On Fri, Jan 24, 2020 at 09:24:50AM +0000, Filipe Manana wrote:
> On Fri, Jan 24, 2020 at 12:07 AM Qu Wenruo <wqu@suse.com> wrote:
> > Reported-by: Filipe Manana <fdmanana@suse.com>
> > Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
> > Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> Now it looks good, thanks!

Thank you both, patch on the way to 5.5-rc8 or 5.6 final.
David Sterba Jan. 24, 2020, 2:44 p.m. UTC | #3
On current master branch (4703d9119972bf58) with this patch btrfs/011
prints a warning from this code:

 502         ret = btrfs_dev_replace_finishing(fs_info, ret);
 503         if (ret == -EINPROGRESS) {
 504                 ret = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
 505         } else if (ret != -ECANCELED) {
 506                 WARN_ON(ret);
 507         }

with ret == -ENOSPC

The purpose seems to be to catch generic error codes other than
EINPROGRESS and ECNACELED, I don't see much point printing a warning in
that case. But it' a new ENOSPC problem, likely caused by the
read-only status switching.

My test devices are 12G, there's full log of the test to see at which
phase it happened.

[ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
[ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
[ 1892.059993] ------------[ cut here ]------------
[ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
[ 1892.068628] Modules linked in: xxhash_generic btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
[ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942
[ 1892.076559] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
[ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286
[ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889
[ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080
[ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000
[ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000
[ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200
[ 1892.108642] FS:  00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000
[ 1892.111558] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0
[ 1892.115814] Call Trace:
[ 1892.116896]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
[ 1892.118962]  btrfs_ioctl+0x1d62/0x2550 [btrfs]
[ 1892.120633]  ? __lock_acquire+0x272/0x1320
[ 1892.122177]  ? kvm_sched_clock_read+0x14/0x30
[ 1892.123629]  ? do_sigaction+0xf8/0x240
[ 1892.124919]  ? kvm_sched_clock_read+0x14/0x30
[ 1892.126418]  ? sched_clock+0x5/0x10
[ 1892.127534]  ? sched_clock_cpu+0x10/0x120
[ 1892.129096]  ? do_sigaction+0xf8/0x240
[ 1892.130525]  ? do_vfs_ioctl+0x56e/0x770
[ 1892.131818]  do_vfs_ioctl+0x56e/0x770
[ 1892.133012]  ? do_sigaction+0xf8/0x240
[ 1892.134228]  ksys_ioctl+0x3a/0x70
[ 1892.135447]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1892.137148]  __x64_sys_ioctl+0x16/0x20
[ 1892.138702]  do_syscall_64+0x50/0x210
[ 1892.140095]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1892.141911] RIP: 0033:0x7fcaed61e387
[ 1892.149247] RSP: 002b:00007ffcb47fc2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1892.151839] RAX: ffffffffffffffda RBX: 00007ffcb47ff12d RCX: 00007fcaed61e387
[ 1892.154060] RDX: 00007ffcb47fd160 RSI: 00000000ca289435 RDI: 0000000000000003
[ 1892.156191] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
[ 1892.158317] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000003
[ 1892.160564] R13: 0000000000000001 R14: 0000000000000000 R15: 00005624ff1602e0
[ 1892.162736] irq event stamp: 244626
[ 1892.164191] hardirqs last  enabled at (244625): [<ffffffff9178860e>] _raw_spin_unlock_irqrestore+0x3e/0x50
[ 1892.167531] hardirqs last disabled at (244626): [<ffffffff91002aeb>] trace_hardirqs_off_thunk+0x1a/0x1c
[ 1892.170875] softirqs last  enabled at (244482): [<ffffffff91a00358>] __do_softirq+0x358/0x52b
[ 1892.174005] softirqs last disabled at (244471): [<ffffffff9108ac1d>] irq_exit+0x9d/0xb0
[ 1892.176555] ---[ end trace d72b653b61eb7d20 ]---
[failed, exit status 1] [14:22:59]- output mismatch (see /tmp/fstests/results//btrfs/011.out.bad)
    --- tests/btrfs/011.out     2018-04-12 16:57:00.608225550 +0000
    +++ /tmp/fstests/results//btrfs/011.out.bad 2020-01-24 14:22:59.248000000 +0000
    @@ -1,3 +1,4 @@
     QA output created by 011
     *** test btrfs replace
    -*** done
    +failed: '/sbin/btrfs replace start -Bf -r /dev/vdd /dev/vdb /tmp/scratch'
    +(see /tmp/fstests/results//btrfs/011.full for details)
    ...
    (Run 'diff -u /tmp/fstests/tests/btrfs/011.out /tmp/fstests/results//btrfs/011.out.bad'  to see the entire diff)

btrfs/011               [13:59:50][  504.570047] run fstests btrfs/011 at 2020-01-24 13:59:50
[  505.710501] BTRFS: device fsid cd48459b-2332-425b-9d4e-324021eb0f2a devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20824)
[  505.740305] BTRFS info (device vdb): turning on discard
[  505.742954] BTRFS info (device vdb): disk space caching is enabled
[  505.745007] BTRFS info (device vdb): has skinny extents
[  505.747096] BTRFS info (device vdb): flagging fs with big metadata feature
[  505.755093] BTRFS info (device vdb): checking UUID tree
[  521.548385] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
[  525.294200] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
[  526.798086] BTRFS info (device vdb): scrub: started on devid 1
[  528.104425] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[  528.393736] BTRFS info (device vdc): turning on discard
[  528.396009] BTRFS info (device vdc): disk space caching is enabled
[  528.398144] BTRFS info (device vdc): has skinny extents
[  528.597381] BTRFS: device fsid 20fd7216-ce75-4761-bb61-a5819aef05b6 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (27027)
[  528.626562] BTRFS info (device vdb): turning on discard
[  528.629852] BTRFS info (device vdb): disk space caching is enabled
[  528.633442] BTRFS info (device vdb): has skinny extents
[  528.640914] BTRFS info (device vdb): checking UUID tree
[  548.785255] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
[  551.256550] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
[  557.620409] BTRFS info (device vdb): scrub: started on devid 1
[  559.258036] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[  559.646237] BTRFS info (device vdc): turning on discard
[  559.649237] BTRFS info (device vdc): disk space caching is enabled
[  559.652122] BTRFS info (device vdc): has skinny extents
[  561.076431] BTRFS: device fsid e8c0f848-fd83-4401-95ff-503d3bba3c08 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (750)
[  561.103870] BTRFS info (device vdb): turning on discard
[  561.106896] BTRFS info (device vdb): disk space caching is enabled
[  561.110432] BTRFS info (device vdb): has skinny extents
[  561.113446] BTRFS info (device vdb): flagging fs with big metadata feature
[  561.121762] BTRFS info (device vdb): checking UUID tree
[  576.690406] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
[  581.249317] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
[  584.252317] BTRFS info (device vdb): scrub: started on devid 1
[  585.192576] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[  586.225905] BTRFS info (device vdc): turning on discard
[  586.229054] BTRFS info (device vdc): disk space caching is enabled
[  586.231819] BTRFS info (device vdc): has skinny extents
[  586.599308] BTRFS: device fsid dd789d35-4d2f-428c-aa03-88ffd5a734cb devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (6931)
[  586.628325] BTRFS info (device vdb): turning on discard
[  586.630954] BTRFS info (device vdb): disk space caching is enabled
[  586.634146] BTRFS info (device vdb): has skinny extents
[  586.636859] BTRFS info (device vdb): flagging fs with big metadata feature
[  586.645020] BTRFS info (device vdb): checking UUID tree
[  608.342513] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
[  611.869495] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc canceled
[  646.226962] BTRFS info (device vdb): scrub: started on devid 1
[  652.537808] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[  654.148332] BTRFS info (device vdb): turning on discard
[  654.152563] BTRFS info (device vdb): disk space caching is enabled
[  654.157499] BTRFS info (device vdb): has skinny extents
[  655.318709] BTRFS: device fsid cc2a56c5-5371-4b14-88af-527515683491 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (14080)
[  655.341656] BTRFS info (device vdb): turning on discard
[  655.344075] BTRFS info (device vdb): disk space caching is enabled
[  655.346723] BTRFS info (device vdb): has skinny extents
[  655.355368] BTRFS info (device vdb): checking UUID tree
[  688.140556] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
[  696.723022] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
[  707.940962] BTRFS info (device vdb): scrub: started on devid 1
[  711.456414] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[  719.263299] BTRFS info (device vdc): turning on discard
[  719.266339] BTRFS info (device vdc): disk space caching is enabled
[  719.269694] BTRFS info (device vdc): has skinny extents
[  722.816437] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20268)
[  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
[  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
[  722.850081] BTRFS info (device vdb): turning on discard
[  722.853321] BTRFS info (device vdb): disk space caching is enabled
[  722.855710] BTRFS info (device vdb): has skinny extents
[  722.857894] BTRFS info (device vdb): flagging fs with big metadata feature
[  722.866879] BTRFS info (device vdb): checking UUID tree
[  738.622392] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
[  739.728325] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
[  740.017266] BTRFS info (device vdb): scrub: started on devid 1
[  740.017277] BTRFS info (device vdb): scrub: started on devid 2
[  740.714789] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
[  740.714965] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[  741.234158] BTRFS info (device vdd): turning on discard
[  741.237294] BTRFS info (device vdd): disk space caching is enabled
[  741.240088] BTRFS info (device vdd): has skinny extents
[  741.485130] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (26509)
[  741.489012] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (26509)
[  741.517997] BTRFS info (device vdb): turning on discard
[  741.522388] BTRFS info (device vdb): disk space caching is enabled
[  741.527084] BTRFS info (device vdb): has skinny extents
[  741.528865] BTRFS info (device vdb): flagging fs with big metadata feature
[  741.536809] BTRFS info (device vdb): checking UUID tree
[  803.614447] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
[ 1561.888757] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
[ 1563.133038] BTRFS info (device vdb): scrub: started on devid 2
[ 1563.133043] BTRFS info (device vdb): scrub: started on devid 1
[ 1851.421233] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
[ 1857.639294] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
[ 1869.454605] BTRFS info (device vdd): turning on discard
[ 1869.458972] BTRFS info (device vdd): disk space caching is enabled
[ 1869.461062] BTRFS info (device vdd): has skinny extents
[ 1891.972499] BTRFS info (device vdd): dev_replace from /dev/vdd (devid 1) to /dev/vdb started
[ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
[ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
(stacktrace)
Filipe Manana Jan. 24, 2020, 4:28 p.m. UTC | #4
On Fri, Jan 24, 2020 at 3:12 PM David Sterba <dsterba@suse.cz> wrote:
>
> On current master branch (4703d9119972bf58) with this patch btrfs/011
> prints a warning from this code:
>
>  502         ret = btrfs_dev_replace_finishing(fs_info, ret);
>  503         if (ret == -EINPROGRESS) {
>  504                 ret = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
>  505         } else if (ret != -ECANCELED) {
>  506                 WARN_ON(ret);
>  507         }
>
> with ret == -ENOSPC
>
> The purpose seems to be to catch generic error codes other than
> EINPROGRESS and ECNACELED, I don't see much point printing a warning in
> that case. But it' a new ENOSPC problem, likely caused by the
> read-only status switching.
>
> My test devices are 12G, there's full log of the test to see at which
> phase it happened.

It passes for me on 20G devices, haven't tested with 12G however
(can't afford to reboot any of my vms now).

I think that happens because before this patch we ignored ENOSPC
errors, when trying to set a block group to RO, for device replace and
scrub.
But with this patch we don't ignore ENOSPC for device replace anymore
- this is actually correct because if we ignore we can corrupt nocow
writes (including writes into prealloc extents).

Now if it's a real ENOSPC situation or just a bug in the space
management code, it's a different thing to look at.

Thanks.

>
> [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
> [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
> [ 1892.059993] ------------[ cut here ]------------
> [ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
> [ 1892.068628] Modules linked in: xxhash_generic btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
> [ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942
> [ 1892.076559] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
> [ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
> [ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286
> [ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889
> [ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080
> [ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000
> [ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000
> [ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200
> [ 1892.108642] FS:  00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000
> [ 1892.111558] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0
> [ 1892.115814] Call Trace:
> [ 1892.116896]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
> [ 1892.118962]  btrfs_ioctl+0x1d62/0x2550 [btrfs]
> [ 1892.120633]  ? __lock_acquire+0x272/0x1320
> [ 1892.122177]  ? kvm_sched_clock_read+0x14/0x30
> [ 1892.123629]  ? do_sigaction+0xf8/0x240
> [ 1892.124919]  ? kvm_sched_clock_read+0x14/0x30
> [ 1892.126418]  ? sched_clock+0x5/0x10
> [ 1892.127534]  ? sched_clock_cpu+0x10/0x120
> [ 1892.129096]  ? do_sigaction+0xf8/0x240
> [ 1892.130525]  ? do_vfs_ioctl+0x56e/0x770
> [ 1892.131818]  do_vfs_ioctl+0x56e/0x770
> [ 1892.133012]  ? do_sigaction+0xf8/0x240
> [ 1892.134228]  ksys_ioctl+0x3a/0x70
> [ 1892.135447]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [ 1892.137148]  __x64_sys_ioctl+0x16/0x20
> [ 1892.138702]  do_syscall_64+0x50/0x210
> [ 1892.140095]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 1892.141911] RIP: 0033:0x7fcaed61e387
> [ 1892.149247] RSP: 002b:00007ffcb47fc2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 1892.151839] RAX: ffffffffffffffda RBX: 00007ffcb47ff12d RCX: 00007fcaed61e387
> [ 1892.154060] RDX: 00007ffcb47fd160 RSI: 00000000ca289435 RDI: 0000000000000003
> [ 1892.156191] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
> [ 1892.158317] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000003
> [ 1892.160564] R13: 0000000000000001 R14: 0000000000000000 R15: 00005624ff1602e0
> [ 1892.162736] irq event stamp: 244626
> [ 1892.164191] hardirqs last  enabled at (244625): [<ffffffff9178860e>] _raw_spin_unlock_irqrestore+0x3e/0x50
> [ 1892.167531] hardirqs last disabled at (244626): [<ffffffff91002aeb>] trace_hardirqs_off_thunk+0x1a/0x1c
> [ 1892.170875] softirqs last  enabled at (244482): [<ffffffff91a00358>] __do_softirq+0x358/0x52b
> [ 1892.174005] softirqs last disabled at (244471): [<ffffffff9108ac1d>] irq_exit+0x9d/0xb0
> [ 1892.176555] ---[ end trace d72b653b61eb7d20 ]---
> [failed, exit status 1] [14:22:59]- output mismatch (see /tmp/fstests/results//btrfs/011.out.bad)
>     --- tests/btrfs/011.out     2018-04-12 16:57:00.608225550 +0000
>     +++ /tmp/fstests/results//btrfs/011.out.bad 2020-01-24 14:22:59.248000000 +0000
>     @@ -1,3 +1,4 @@
>      QA output created by 011
>      *** test btrfs replace
>     -*** done
>     +failed: '/sbin/btrfs replace start -Bf -r /dev/vdd /dev/vdb /tmp/scratch'
>     +(see /tmp/fstests/results//btrfs/011.full for details)
>     ...
>     (Run 'diff -u /tmp/fstests/tests/btrfs/011.out /tmp/fstests/results//btrfs/011.out.bad'  to see the entire diff)
>
> btrfs/011               [13:59:50][  504.570047] run fstests btrfs/011 at 2020-01-24 13:59:50
> [  505.710501] BTRFS: device fsid cd48459b-2332-425b-9d4e-324021eb0f2a devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20824)
> [  505.740305] BTRFS info (device vdb): turning on discard
> [  505.742954] BTRFS info (device vdb): disk space caching is enabled
> [  505.745007] BTRFS info (device vdb): has skinny extents
> [  505.747096] BTRFS info (device vdb): flagging fs with big metadata feature
> [  505.755093] BTRFS info (device vdb): checking UUID tree
> [  521.548385] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
> [  525.294200] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
> [  526.798086] BTRFS info (device vdb): scrub: started on devid 1
> [  528.104425] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [  528.393736] BTRFS info (device vdc): turning on discard
> [  528.396009] BTRFS info (device vdc): disk space caching is enabled
> [  528.398144] BTRFS info (device vdc): has skinny extents
> [  528.597381] BTRFS: device fsid 20fd7216-ce75-4761-bb61-a5819aef05b6 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (27027)
> [  528.626562] BTRFS info (device vdb): turning on discard
> [  528.629852] BTRFS info (device vdb): disk space caching is enabled
> [  528.633442] BTRFS info (device vdb): has skinny extents
> [  528.640914] BTRFS info (device vdb): checking UUID tree
> [  548.785255] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
> [  551.256550] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
> [  557.620409] BTRFS info (device vdb): scrub: started on devid 1
> [  559.258036] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [  559.646237] BTRFS info (device vdc): turning on discard
> [  559.649237] BTRFS info (device vdc): disk space caching is enabled
> [  559.652122] BTRFS info (device vdc): has skinny extents
> [  561.076431] BTRFS: device fsid e8c0f848-fd83-4401-95ff-503d3bba3c08 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (750)
> [  561.103870] BTRFS info (device vdb): turning on discard
> [  561.106896] BTRFS info (device vdb): disk space caching is enabled
> [  561.110432] BTRFS info (device vdb): has skinny extents
> [  561.113446] BTRFS info (device vdb): flagging fs with big metadata feature
> [  561.121762] BTRFS info (device vdb): checking UUID tree
> [  576.690406] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
> [  581.249317] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
> [  584.252317] BTRFS info (device vdb): scrub: started on devid 1
> [  585.192576] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [  586.225905] BTRFS info (device vdc): turning on discard
> [  586.229054] BTRFS info (device vdc): disk space caching is enabled
> [  586.231819] BTRFS info (device vdc): has skinny extents
> [  586.599308] BTRFS: device fsid dd789d35-4d2f-428c-aa03-88ffd5a734cb devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (6931)
> [  586.628325] BTRFS info (device vdb): turning on discard
> [  586.630954] BTRFS info (device vdb): disk space caching is enabled
> [  586.634146] BTRFS info (device vdb): has skinny extents
> [  586.636859] BTRFS info (device vdb): flagging fs with big metadata feature
> [  586.645020] BTRFS info (device vdb): checking UUID tree
> [  608.342513] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
> [  611.869495] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc canceled
> [  646.226962] BTRFS info (device vdb): scrub: started on devid 1
> [  652.537808] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [  654.148332] BTRFS info (device vdb): turning on discard
> [  654.152563] BTRFS info (device vdb): disk space caching is enabled
> [  654.157499] BTRFS info (device vdb): has skinny extents
> [  655.318709] BTRFS: device fsid cc2a56c5-5371-4b14-88af-527515683491 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (14080)
> [  655.341656] BTRFS info (device vdb): turning on discard
> [  655.344075] BTRFS info (device vdb): disk space caching is enabled
> [  655.346723] BTRFS info (device vdb): has skinny extents
> [  655.355368] BTRFS info (device vdb): checking UUID tree
> [  688.140556] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
> [  696.723022] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
> [  707.940962] BTRFS info (device vdb): scrub: started on devid 1
> [  711.456414] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [  719.263299] BTRFS info (device vdc): turning on discard
> [  719.266339] BTRFS info (device vdc): disk space caching is enabled
> [  719.269694] BTRFS info (device vdc): has skinny extents
> [  722.816437] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20268)
> [  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
> [  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
> [  722.850081] BTRFS info (device vdb): turning on discard
> [  722.853321] BTRFS info (device vdb): disk space caching is enabled
> [  722.855710] BTRFS info (device vdb): has skinny extents
> [  722.857894] BTRFS info (device vdb): flagging fs with big metadata feature
> [  722.866879] BTRFS info (device vdb): checking UUID tree
> [  738.622392] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
> [  739.728325] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
> [  740.017266] BTRFS info (device vdb): scrub: started on devid 1
> [  740.017277] BTRFS info (device vdb): scrub: started on devid 2
> [  740.714789] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
> [  740.714965] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [  741.234158] BTRFS info (device vdd): turning on discard
> [  741.237294] BTRFS info (device vdd): disk space caching is enabled
> [  741.240088] BTRFS info (device vdd): has skinny extents
> [  741.485130] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (26509)
> [  741.489012] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (26509)
> [  741.517997] BTRFS info (device vdb): turning on discard
> [  741.522388] BTRFS info (device vdb): disk space caching is enabled
> [  741.527084] BTRFS info (device vdb): has skinny extents
> [  741.528865] BTRFS info (device vdb): flagging fs with big metadata feature
> [  741.536809] BTRFS info (device vdb): checking UUID tree
> [  803.614447] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
> [ 1561.888757] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
> [ 1563.133038] BTRFS info (device vdb): scrub: started on devid 2
> [ 1563.133043] BTRFS info (device vdb): scrub: started on devid 1
> [ 1851.421233] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
> [ 1857.639294] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
> [ 1869.454605] BTRFS info (device vdd): turning on discard
> [ 1869.458972] BTRFS info (device vdd): disk space caching is enabled
> [ 1869.461062] BTRFS info (device vdd): has skinny extents
> [ 1891.972499] BTRFS info (device vdd): dev_replace from /dev/vdd (devid 1) to /dev/vdb started
> [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
> [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
> (stacktrace)
Qu Wenruo Jan. 25, 2020, 12:36 a.m. UTC | #5
On 2020/1/25 上午12:28, Filipe Manana wrote:
> On Fri, Jan 24, 2020 at 3:12 PM David Sterba <dsterba@suse.cz> wrote:
>>
>> On current master branch (4703d9119972bf58) with this patch btrfs/011
>> prints a warning from this code:
>>
>>  502         ret = btrfs_dev_replace_finishing(fs_info, ret);
>>  503         if (ret == -EINPROGRESS) {
>>  504                 ret = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
>>  505         } else if (ret != -ECANCELED) {
>>  506                 WARN_ON(ret);
>>  507         }
>>
>> with ret == -ENOSPC
>>
>> The purpose seems to be to catch generic error codes other than
>> EINPROGRESS and ECNACELED, I don't see much point printing a warning in
>> that case. But it' a new ENOSPC problem, likely caused by the
>> read-only status switching.
>>
>> My test devices are 12G, there's full log of the test to see at which
>> phase it happened.
> 
> It passes for me on 20G devices, haven't tested with 12G however
> (can't afford to reboot any of my vms now).

5G for all scratch devices, and failed to reproduce it.
(The full run before submitting the patch also failed to reproduce it)

> 
> I think that happens because before this patch we ignored ENOSPC
> errors, when trying to set a block group to RO, for device replace and
> scrub.
> But with this patch we don't ignore ENOSPC for device replace anymore
> - this is actually correct because if we ignore we can corrupt nocow
> writes (including writes into prealloc extents).
> 
> Now if it's a real ENOSPC situation or just a bug in the space
> management code, it's a different thing to look at.

I tend to take a middle land of the problem.

For current stage, the WARN_ON() is indeed overkilled, at least for ENOSPC.

But on the other handle, the full RO of a block group for scrub/replace
is also a little overkilled.
Just as Filipe mentioned, we only want to kill nocow writes into a block
group, but still allow COW writes.

It looks like something like mark_block_group_nocow_ro() in the future
could reduce the possibility if not fully kill it.


It looks like changing the WARN_ON(ret) to WARN_ON(ret != -ENOSPC) would
be needed for this patch as a quick fix.

Thanks,
Qu

> 
> Thanks.
> 
>>
>> [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
>> [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
>> [ 1892.059993] ------------[ cut here ]------------
>> [ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
>> [ 1892.068628] Modules linked in: xxhash_generic btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
>> [ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942
>> [ 1892.076559] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
>> [ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
>> [ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286
>> [ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889
>> [ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080
>> [ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000
>> [ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000
>> [ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200
>> [ 1892.108642] FS:  00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000
>> [ 1892.111558] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0
>> [ 1892.115814] Call Trace:
>> [ 1892.116896]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
>> [ 1892.118962]  btrfs_ioctl+0x1d62/0x2550 [btrfs]
>> [ 1892.120633]  ? __lock_acquire+0x272/0x1320
>> [ 1892.122177]  ? kvm_sched_clock_read+0x14/0x30
>> [ 1892.123629]  ? do_sigaction+0xf8/0x240
>> [ 1892.124919]  ? kvm_sched_clock_read+0x14/0x30
>> [ 1892.126418]  ? sched_clock+0x5/0x10
>> [ 1892.127534]  ? sched_clock_cpu+0x10/0x120
>> [ 1892.129096]  ? do_sigaction+0xf8/0x240
>> [ 1892.130525]  ? do_vfs_ioctl+0x56e/0x770
>> [ 1892.131818]  do_vfs_ioctl+0x56e/0x770
>> [ 1892.133012]  ? do_sigaction+0xf8/0x240
>> [ 1892.134228]  ksys_ioctl+0x3a/0x70
>> [ 1892.135447]  ? trace_hardirqs_off_thunk+0x1a/0x1c
>> [ 1892.137148]  __x64_sys_ioctl+0x16/0x20
>> [ 1892.138702]  do_syscall_64+0x50/0x210
>> [ 1892.140095]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [ 1892.141911] RIP: 0033:0x7fcaed61e387
>> [ 1892.149247] RSP: 002b:00007ffcb47fc2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>> [ 1892.151839] RAX: ffffffffffffffda RBX: 00007ffcb47ff12d RCX: 00007fcaed61e387
>> [ 1892.154060] RDX: 00007ffcb47fd160 RSI: 00000000ca289435 RDI: 0000000000000003
>> [ 1892.156191] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
>> [ 1892.158317] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000003
>> [ 1892.160564] R13: 0000000000000001 R14: 0000000000000000 R15: 00005624ff1602e0
>> [ 1892.162736] irq event stamp: 244626
>> [ 1892.164191] hardirqs last  enabled at (244625): [<ffffffff9178860e>] _raw_spin_unlock_irqrestore+0x3e/0x50
>> [ 1892.167531] hardirqs last disabled at (244626): [<ffffffff91002aeb>] trace_hardirqs_off_thunk+0x1a/0x1c
>> [ 1892.170875] softirqs last  enabled at (244482): [<ffffffff91a00358>] __do_softirq+0x358/0x52b
>> [ 1892.174005] softirqs last disabled at (244471): [<ffffffff9108ac1d>] irq_exit+0x9d/0xb0
>> [ 1892.176555] ---[ end trace d72b653b61eb7d20 ]---
>> [failed, exit status 1] [14:22:59]- output mismatch (see /tmp/fstests/results//btrfs/011.out.bad)
>>     --- tests/btrfs/011.out     2018-04-12 16:57:00.608225550 +0000
>>     +++ /tmp/fstests/results//btrfs/011.out.bad 2020-01-24 14:22:59.248000000 +0000
>>     @@ -1,3 +1,4 @@
>>      QA output created by 011
>>      *** test btrfs replace
>>     -*** done
>>     +failed: '/sbin/btrfs replace start -Bf -r /dev/vdd /dev/vdb /tmp/scratch'
>>     +(see /tmp/fstests/results//btrfs/011.full for details)
>>     ...
>>     (Run 'diff -u /tmp/fstests/tests/btrfs/011.out /tmp/fstests/results//btrfs/011.out.bad'  to see the entire diff)
>>
>> btrfs/011               [13:59:50][  504.570047] run fstests btrfs/011 at 2020-01-24 13:59:50
>> [  505.710501] BTRFS: device fsid cd48459b-2332-425b-9d4e-324021eb0f2a devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20824)
>> [  505.740305] BTRFS info (device vdb): turning on discard
>> [  505.742954] BTRFS info (device vdb): disk space caching is enabled
>> [  505.745007] BTRFS info (device vdb): has skinny extents
>> [  505.747096] BTRFS info (device vdb): flagging fs with big metadata feature
>> [  505.755093] BTRFS info (device vdb): checking UUID tree
>> [  521.548385] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>> [  525.294200] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>> [  526.798086] BTRFS info (device vdb): scrub: started on devid 1
>> [  528.104425] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [  528.393736] BTRFS info (device vdc): turning on discard
>> [  528.396009] BTRFS info (device vdc): disk space caching is enabled
>> [  528.398144] BTRFS info (device vdc): has skinny extents
>> [  528.597381] BTRFS: device fsid 20fd7216-ce75-4761-bb61-a5819aef05b6 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (27027)
>> [  528.626562] BTRFS info (device vdb): turning on discard
>> [  528.629852] BTRFS info (device vdb): disk space caching is enabled
>> [  528.633442] BTRFS info (device vdb): has skinny extents
>> [  528.640914] BTRFS info (device vdb): checking UUID tree
>> [  548.785255] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>> [  551.256550] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>> [  557.620409] BTRFS info (device vdb): scrub: started on devid 1
>> [  559.258036] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [  559.646237] BTRFS info (device vdc): turning on discard
>> [  559.649237] BTRFS info (device vdc): disk space caching is enabled
>> [  559.652122] BTRFS info (device vdc): has skinny extents
>> [  561.076431] BTRFS: device fsid e8c0f848-fd83-4401-95ff-503d3bba3c08 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (750)
>> [  561.103870] BTRFS info (device vdb): turning on discard
>> [  561.106896] BTRFS info (device vdb): disk space caching is enabled
>> [  561.110432] BTRFS info (device vdb): has skinny extents
>> [  561.113446] BTRFS info (device vdb): flagging fs with big metadata feature
>> [  561.121762] BTRFS info (device vdb): checking UUID tree
>> [  576.690406] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>> [  581.249317] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>> [  584.252317] BTRFS info (device vdb): scrub: started on devid 1
>> [  585.192576] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [  586.225905] BTRFS info (device vdc): turning on discard
>> [  586.229054] BTRFS info (device vdc): disk space caching is enabled
>> [  586.231819] BTRFS info (device vdc): has skinny extents
>> [  586.599308] BTRFS: device fsid dd789d35-4d2f-428c-aa03-88ffd5a734cb devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (6931)
>> [  586.628325] BTRFS info (device vdb): turning on discard
>> [  586.630954] BTRFS info (device vdb): disk space caching is enabled
>> [  586.634146] BTRFS info (device vdb): has skinny extents
>> [  586.636859] BTRFS info (device vdb): flagging fs with big metadata feature
>> [  586.645020] BTRFS info (device vdb): checking UUID tree
>> [  608.342513] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>> [  611.869495] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc canceled
>> [  646.226962] BTRFS info (device vdb): scrub: started on devid 1
>> [  652.537808] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [  654.148332] BTRFS info (device vdb): turning on discard
>> [  654.152563] BTRFS info (device vdb): disk space caching is enabled
>> [  654.157499] BTRFS info (device vdb): has skinny extents
>> [  655.318709] BTRFS: device fsid cc2a56c5-5371-4b14-88af-527515683491 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (14080)
>> [  655.341656] BTRFS info (device vdb): turning on discard
>> [  655.344075] BTRFS info (device vdb): disk space caching is enabled
>> [  655.346723] BTRFS info (device vdb): has skinny extents
>> [  655.355368] BTRFS info (device vdb): checking UUID tree
>> [  688.140556] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>> [  696.723022] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>> [  707.940962] BTRFS info (device vdb): scrub: started on devid 1
>> [  711.456414] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [  719.263299] BTRFS info (device vdc): turning on discard
>> [  719.266339] BTRFS info (device vdc): disk space caching is enabled
>> [  719.269694] BTRFS info (device vdc): has skinny extents
>> [  722.816437] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20268)
>> [  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
>> [  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
>> [  722.850081] BTRFS info (device vdb): turning on discard
>> [  722.853321] BTRFS info (device vdb): disk space caching is enabled
>> [  722.855710] BTRFS info (device vdb): has skinny extents
>> [  722.857894] BTRFS info (device vdb): flagging fs with big metadata feature
>> [  722.866879] BTRFS info (device vdb): checking UUID tree
>> [  738.622392] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
>> [  739.728325] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
>> [  740.017266] BTRFS info (device vdb): scrub: started on devid 1
>> [  740.017277] BTRFS info (device vdb): scrub: started on devid 2
>> [  740.714789] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
>> [  740.714965] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [  741.234158] BTRFS info (device vdd): turning on discard
>> [  741.237294] BTRFS info (device vdd): disk space caching is enabled
>> [  741.240088] BTRFS info (device vdd): has skinny extents
>> [  741.485130] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (26509)
>> [  741.489012] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (26509)
>> [  741.517997] BTRFS info (device vdb): turning on discard
>> [  741.522388] BTRFS info (device vdb): disk space caching is enabled
>> [  741.527084] BTRFS info (device vdb): has skinny extents
>> [  741.528865] BTRFS info (device vdb): flagging fs with big metadata feature
>> [  741.536809] BTRFS info (device vdb): checking UUID tree
>> [  803.614447] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
>> [ 1561.888757] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
>> [ 1563.133038] BTRFS info (device vdb): scrub: started on devid 2
>> [ 1563.133043] BTRFS info (device vdb): scrub: started on devid 1
>> [ 1851.421233] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
>> [ 1857.639294] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>> [ 1869.454605] BTRFS info (device vdd): turning on discard
>> [ 1869.458972] BTRFS info (device vdd): disk space caching is enabled
>> [ 1869.461062] BTRFS info (device vdd): has skinny extents
>> [ 1891.972499] BTRFS info (device vdd): dev_replace from /dev/vdd (devid 1) to /dev/vdb started
>> [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
>> [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
>> (stacktrace)
> 
> 
>
Qu Wenruo Jan. 25, 2020, 12:48 a.m. UTC | #6
On 2020/1/25 上午8:36, Qu Wenruo wrote:
> 
> 
> On 2020/1/25 上午12:28, Filipe Manana wrote:
>> On Fri, Jan 24, 2020 at 3:12 PM David Sterba <dsterba@suse.cz> wrote:
>>>
>>> On current master branch (4703d9119972bf58) with this patch btrfs/011
>>> prints a warning from this code:
>>>
>>>  502         ret = btrfs_dev_replace_finishing(fs_info, ret);
>>>  503         if (ret == -EINPROGRESS) {
>>>  504                 ret = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
>>>  505         } else if (ret != -ECANCELED) {
>>>  506                 WARN_ON(ret);
>>>  507         }
>>>
>>> with ret == -ENOSPC
>>>
>>> The purpose seems to be to catch generic error codes other than
>>> EINPROGRESS and ECNACELED, I don't see much point printing a warning in
>>> that case. But it' a new ENOSPC problem, likely caused by the
>>> read-only status switching.
>>>
>>> My test devices are 12G, there's full log of the test to see at which
>>> phase it happened.
>>
>> It passes for me on 20G devices, haven't tested with 12G however
>> (can't afford to reboot any of my vms now).
> 
> 5G for all scratch devices, and failed to reproduce it.
> (The full run before submitting the patch also failed to reproduce it)
> 
>>
>> I think that happens because before this patch we ignored ENOSPC
>> errors, when trying to set a block group to RO, for device replace and
>> scrub.
>> But with this patch we don't ignore ENOSPC for device replace anymore
>> - this is actually correct because if we ignore we can corrupt nocow
>> writes (including writes into prealloc extents).
>>
>> Now if it's a real ENOSPC situation or just a bug in the space
>> management code, it's a different thing to look at.
> 
> I tend to take a middle land of the problem.
> 
> For current stage, the WARN_ON() is indeed overkilled, at least for ENOSPC.
> 
> But on the other handle, the full RO of a block group for scrub/replace
> is also a little overkilled.
> Just as Filipe mentioned, we only want to kill nocow writes into a block
> group, but still allow COW writes.
> 
> It looks like something like mark_block_group_nocow_ro() in the future
> could reduce the possibility if not fully kill it.
> 
> 
> It looks like changing the WARN_ON(ret) to WARN_ON(ret != -ENOSPC) would
> be needed for this patch as a quick fix.

Well, it should be WARN_ON(ret && ret != -ENOSPC);

> 
> Thanks,
> Qu
> 
>>
>> Thanks.
>>
>>>
>>> [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
>>> [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
>>> [ 1892.059993] ------------[ cut here ]------------
>>> [ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
>>> [ 1892.068628] Modules linked in: xxhash_generic btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
>>> [ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942
>>> [ 1892.076559] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
>>> [ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
>>> [ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286
>>> [ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889
>>> [ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080
>>> [ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000
>>> [ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000
>>> [ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200
>>> [ 1892.108642] FS:  00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000
>>> [ 1892.111558] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0
>>> [ 1892.115814] Call Trace:
>>> [ 1892.116896]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
>>> [ 1892.118962]  btrfs_ioctl+0x1d62/0x2550 [btrfs]
>>> [ 1892.120633]  ? __lock_acquire+0x272/0x1320
>>> [ 1892.122177]  ? kvm_sched_clock_read+0x14/0x30
>>> [ 1892.123629]  ? do_sigaction+0xf8/0x240
>>> [ 1892.124919]  ? kvm_sched_clock_read+0x14/0x30
>>> [ 1892.126418]  ? sched_clock+0x5/0x10
>>> [ 1892.127534]  ? sched_clock_cpu+0x10/0x120
>>> [ 1892.129096]  ? do_sigaction+0xf8/0x240
>>> [ 1892.130525]  ? do_vfs_ioctl+0x56e/0x770
>>> [ 1892.131818]  do_vfs_ioctl+0x56e/0x770
>>> [ 1892.133012]  ? do_sigaction+0xf8/0x240
>>> [ 1892.134228]  ksys_ioctl+0x3a/0x70
>>> [ 1892.135447]  ? trace_hardirqs_off_thunk+0x1a/0x1c
>>> [ 1892.137148]  __x64_sys_ioctl+0x16/0x20
>>> [ 1892.138702]  do_syscall_64+0x50/0x210
>>> [ 1892.140095]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> [ 1892.141911] RIP: 0033:0x7fcaed61e387
>>> [ 1892.149247] RSP: 002b:00007ffcb47fc2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>> [ 1892.151839] RAX: ffffffffffffffda RBX: 00007ffcb47ff12d RCX: 00007fcaed61e387
>>> [ 1892.154060] RDX: 00007ffcb47fd160 RSI: 00000000ca289435 RDI: 0000000000000003
>>> [ 1892.156191] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
>>> [ 1892.158317] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000003
>>> [ 1892.160564] R13: 0000000000000001 R14: 0000000000000000 R15: 00005624ff1602e0
>>> [ 1892.162736] irq event stamp: 244626
>>> [ 1892.164191] hardirqs last  enabled at (244625): [<ffffffff9178860e>] _raw_spin_unlock_irqrestore+0x3e/0x50
>>> [ 1892.167531] hardirqs last disabled at (244626): [<ffffffff91002aeb>] trace_hardirqs_off_thunk+0x1a/0x1c
>>> [ 1892.170875] softirqs last  enabled at (244482): [<ffffffff91a00358>] __do_softirq+0x358/0x52b
>>> [ 1892.174005] softirqs last disabled at (244471): [<ffffffff9108ac1d>] irq_exit+0x9d/0xb0
>>> [ 1892.176555] ---[ end trace d72b653b61eb7d20 ]---
>>> [failed, exit status 1] [14:22:59]- output mismatch (see /tmp/fstests/results//btrfs/011.out.bad)
>>>     --- tests/btrfs/011.out     2018-04-12 16:57:00.608225550 +0000
>>>     +++ /tmp/fstests/results//btrfs/011.out.bad 2020-01-24 14:22:59.248000000 +0000
>>>     @@ -1,3 +1,4 @@
>>>      QA output created by 011
>>>      *** test btrfs replace
>>>     -*** done
>>>     +failed: '/sbin/btrfs replace start -Bf -r /dev/vdd /dev/vdb /tmp/scratch'
>>>     +(see /tmp/fstests/results//btrfs/011.full for details)
>>>     ...
>>>     (Run 'diff -u /tmp/fstests/tests/btrfs/011.out /tmp/fstests/results//btrfs/011.out.bad'  to see the entire diff)
>>>
>>> btrfs/011               [13:59:50][  504.570047] run fstests btrfs/011 at 2020-01-24 13:59:50
>>> [  505.710501] BTRFS: device fsid cd48459b-2332-425b-9d4e-324021eb0f2a devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20824)
>>> [  505.740305] BTRFS info (device vdb): turning on discard
>>> [  505.742954] BTRFS info (device vdb): disk space caching is enabled
>>> [  505.745007] BTRFS info (device vdb): has skinny extents
>>> [  505.747096] BTRFS info (device vdb): flagging fs with big metadata feature
>>> [  505.755093] BTRFS info (device vdb): checking UUID tree
>>> [  521.548385] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>>> [  525.294200] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>>> [  526.798086] BTRFS info (device vdb): scrub: started on devid 1
>>> [  528.104425] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [  528.393736] BTRFS info (device vdc): turning on discard
>>> [  528.396009] BTRFS info (device vdc): disk space caching is enabled
>>> [  528.398144] BTRFS info (device vdc): has skinny extents
>>> [  528.597381] BTRFS: device fsid 20fd7216-ce75-4761-bb61-a5819aef05b6 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (27027)
>>> [  528.626562] BTRFS info (device vdb): turning on discard
>>> [  528.629852] BTRFS info (device vdb): disk space caching is enabled
>>> [  528.633442] BTRFS info (device vdb): has skinny extents
>>> [  528.640914] BTRFS info (device vdb): checking UUID tree
>>> [  548.785255] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>>> [  551.256550] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>>> [  557.620409] BTRFS info (device vdb): scrub: started on devid 1
>>> [  559.258036] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [  559.646237] BTRFS info (device vdc): turning on discard
>>> [  559.649237] BTRFS info (device vdc): disk space caching is enabled
>>> [  559.652122] BTRFS info (device vdc): has skinny extents
>>> [  561.076431] BTRFS: device fsid e8c0f848-fd83-4401-95ff-503d3bba3c08 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (750)
>>> [  561.103870] BTRFS info (device vdb): turning on discard
>>> [  561.106896] BTRFS info (device vdb): disk space caching is enabled
>>> [  561.110432] BTRFS info (device vdb): has skinny extents
>>> [  561.113446] BTRFS info (device vdb): flagging fs with big metadata feature
>>> [  561.121762] BTRFS info (device vdb): checking UUID tree
>>> [  576.690406] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>>> [  581.249317] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>>> [  584.252317] BTRFS info (device vdb): scrub: started on devid 1
>>> [  585.192576] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [  586.225905] BTRFS info (device vdc): turning on discard
>>> [  586.229054] BTRFS info (device vdc): disk space caching is enabled
>>> [  586.231819] BTRFS info (device vdc): has skinny extents
>>> [  586.599308] BTRFS: device fsid dd789d35-4d2f-428c-aa03-88ffd5a734cb devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (6931)
>>> [  586.628325] BTRFS info (device vdb): turning on discard
>>> [  586.630954] BTRFS info (device vdb): disk space caching is enabled
>>> [  586.634146] BTRFS info (device vdb): has skinny extents
>>> [  586.636859] BTRFS info (device vdb): flagging fs with big metadata feature
>>> [  586.645020] BTRFS info (device vdb): checking UUID tree
>>> [  608.342513] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>>> [  611.869495] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc canceled
>>> [  646.226962] BTRFS info (device vdb): scrub: started on devid 1
>>> [  652.537808] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [  654.148332] BTRFS info (device vdb): turning on discard
>>> [  654.152563] BTRFS info (device vdb): disk space caching is enabled
>>> [  654.157499] BTRFS info (device vdb): has skinny extents
>>> [  655.318709] BTRFS: device fsid cc2a56c5-5371-4b14-88af-527515683491 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (14080)
>>> [  655.341656] BTRFS info (device vdb): turning on discard
>>> [  655.344075] BTRFS info (device vdb): disk space caching is enabled
>>> [  655.346723] BTRFS info (device vdb): has skinny extents
>>> [  655.355368] BTRFS info (device vdb): checking UUID tree
>>> [  688.140556] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc started
>>> [  696.723022] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdc finished
>>> [  707.940962] BTRFS info (device vdb): scrub: started on devid 1
>>> [  711.456414] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [  719.263299] BTRFS info (device vdc): turning on discard
>>> [  719.266339] BTRFS info (device vdc): disk space caching is enabled
>>> [  719.269694] BTRFS info (device vdc): has skinny extents
>>> [  722.816437] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (20268)
>>> [  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
>>> [  722.821079] BTRFS: device fsid 9d18355e-3861-4506-a847-b18a0f36a046 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (20268)
>>> [  722.850081] BTRFS info (device vdb): turning on discard
>>> [  722.853321] BTRFS info (device vdb): disk space caching is enabled
>>> [  722.855710] BTRFS info (device vdb): has skinny extents
>>> [  722.857894] BTRFS info (device vdb): flagging fs with big metadata feature
>>> [  722.866879] BTRFS info (device vdb): checking UUID tree
>>> [  738.622392] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
>>> [  739.728325] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
>>> [  740.017266] BTRFS info (device vdb): scrub: started on devid 1
>>> [  740.017277] BTRFS info (device vdb): scrub: started on devid 2
>>> [  740.714789] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
>>> [  740.714965] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [  741.234158] BTRFS info (device vdd): turning on discard
>>> [  741.237294] BTRFS info (device vdd): disk space caching is enabled
>>> [  741.240088] BTRFS info (device vdd): has skinny extents
>>> [  741.485130] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (26509)
>>> [  741.489012] BTRFS: device fsid ad5bec99-e695-4960-b622-18a890a2a566 devid 2 transid 5 /dev/vdc scanned by mkfs.btrfs (26509)
>>> [  741.517997] BTRFS info (device vdb): turning on discard
>>> [  741.522388] BTRFS info (device vdb): disk space caching is enabled
>>> [  741.527084] BTRFS info (device vdb): has skinny extents
>>> [  741.528865] BTRFS info (device vdb): flagging fs with big metadata feature
>>> [  741.536809] BTRFS info (device vdb): checking UUID tree
>>> [  803.614447] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
>>> [ 1561.888757] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd finished
>>> [ 1563.133038] BTRFS info (device vdb): scrub: started on devid 2
>>> [ 1563.133043] BTRFS info (device vdb): scrub: started on devid 1
>>> [ 1851.421233] BTRFS info (device vdb): scrub: finished on devid 2 with status: 0
>>> [ 1857.639294] BTRFS info (device vdb): scrub: finished on devid 1 with status: 0
>>> [ 1869.454605] BTRFS info (device vdd): turning on discard
>>> [ 1869.458972] BTRFS info (device vdd): disk space caching is enabled
>>> [ 1869.461062] BTRFS info (device vdd): has skinny extents
>>> [ 1891.972499] BTRFS info (device vdd): dev_replace from /dev/vdd (devid 1) to /dev/vdb started
>>> [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
>>> [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
>>> (stacktrace)
>>
>>
>>
>
David Sterba Jan. 25, 2020, 11:35 a.m. UTC | #7
On Sat, Jan 25, 2020 at 08:36:14AM +0800, Qu Wenruo wrote:
> >> The purpose seems to be to catch generic error codes other than
> >> EINPROGRESS and ECNACELED, I don't see much point printing a warning in
> >> that case. But it' a new ENOSPC problem, likely caused by the
> >> read-only status switching.
> >>
> >> My test devices are 12G, there's full log of the test to see at which
> >> phase it happened.
> > 
> > It passes for me on 20G devices, haven't tested with 12G however
> > (can't afford to reboot any of my vms now).
> 
> 5G for all scratch devices, and failed to reproduce it.
> (The full run before submitting the patch also failed to reproduce it)

5G is not actually enough to run some of the tests that require at least
10G of free space (so the block device needs to be a bit larger).

> > I think that happens because before this patch we ignored ENOSPC
> > errors, when trying to set a block group to RO, for device replace and
> > scrub.
> > But with this patch we don't ignore ENOSPC for device replace anymore
> > - this is actually correct because if we ignore we can corrupt nocow
> > writes (including writes into prealloc extents).
> > 
> > Now if it's a real ENOSPC situation or just a bug in the space
> > management code, it's a different thing to look at.
> 
> I tend to take a middle land of the problem.
> 
> For current stage, the WARN_ON() is indeed overkilled, at least for ENOSPC.
> 
> But on the other handle, the full RO of a block group for scrub/replace
> is also a little overkilled.
> Just as Filipe mentioned, we only want to kill nocow writes into a block
> group, but still allow COW writes.
> 
> It looks like something like mark_block_group_nocow_ro() in the future
> could reduce the possibility if not fully kill it.

Yeah this sounds doable.

> It looks like changing the WARN_ON(ret) to WARN_ON(ret != -ENOSPC) would
> be needed for this patch as a quick fix.

I'll remove the warning completely, as a separate patch.
Qu Wenruo Jan. 25, 2020, 12:09 p.m. UTC | #8
On 2020/1/25 下午7:35, David Sterba wrote:
> On Sat, Jan 25, 2020 at 08:36:14AM +0800, Qu Wenruo wrote:
>>>> The purpose seems to be to catch generic error codes other than
>>>> EINPROGRESS and ECNACELED, I don't see much point printing a warning in
>>>> that case. But it' a new ENOSPC problem, likely caused by the
>>>> read-only status switching.
>>>>
>>>> My test devices are 12G, there's full log of the test to see at which
>>>> phase it happened.
>>>
>>> It passes for me on 20G devices, haven't tested with 12G however
>>> (can't afford to reboot any of my vms now).
>>
>> 5G for all scratch devices, and failed to reproduce it.
>> (The full run before submitting the patch also failed to reproduce it)
> 
> 5G is not actually enough to run some of the tests that require at least
> 10G of free space (so the block device needs to be a bit larger).

BTW, that ENOSPC sounds like a metadata over-commit bug, there if we can
allocate 2 or more meta chunks, and meta rsv already needs 2 new chunks,
then btrfs_inc_block_group_ro() will always fail no matter whatever the
parameter we're using for any metadata chunk.
(Since btrfs_block_group_ro() can only pre-alloc one chunk, while our
current meta over-commit allows way more meta rsv than just one chunk)

That one can be solved by my previous per-profile avail space patchset.

> 
>>> I think that happens because before this patch we ignored ENOSPC
>>> errors, when trying to set a block group to RO, for device replace and
>>> scrub.
>>> But with this patch we don't ignore ENOSPC for device replace anymore
>>> - this is actually correct because if we ignore we can corrupt nocow
>>> writes (including writes into prealloc extents).
>>>
>>> Now if it's a real ENOSPC situation or just a bug in the space
>>> management code, it's a different thing to look at.
>>
>> I tend to take a middle land of the problem.
>>
>> For current stage, the WARN_ON() is indeed overkilled, at least for ENOSPC.
>>
>> But on the other handle, the full RO of a block group for scrub/replace
>> is also a little overkilled.
>> Just as Filipe mentioned, we only want to kill nocow writes into a block
>> group, but still allow COW writes.
>>
>> It looks like something like mark_block_group_nocow_ro() in the future
>> could reduce the possibility if not fully kill it.
> 
> Yeah this sounds doable.

But it may need more consideration since the rw -> nocow_ro -> ro state
machine can be more complex than my initial thought.

> 
>> It looks like changing the WARN_ON(ret) to WARN_ON(ret != -ENOSPC) would
>> be needed for this patch as a quick fix.
> 
> I'll remove the warning completely, as a separate patch.

That's great.

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 21de630b0730..fd266a2d15ec 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3577,17 +3577,27 @@  int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		 * This can easily boost the amount of SYSTEM chunks if cleaner
 		 * thread can't be triggered fast enough, and use up all space
 		 * of btrfs_super_block::sys_chunk_array
+		 *
+		 * While for dev replace, we need to try our best to mark block
+		 * group RO, to prevent race between:
+		 * - Write duplication
+		 *   Contains latest data
+		 * - Scrub copy
+		 *   Contains data from commit tree
+		 *
+		 * If target block group is not marked RO, nocow writes can
+		 * be overwritten by scrub copy, causing data corruption.
+		 * So for dev-replace, it's not allowed to continue if a block
+		 * group is not RO.
 		 */
-		ret = btrfs_inc_block_group_ro(cache, false);
-		scrub_pause_off(fs_info);
-
+		ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
 		if (ret == 0) {
 			ro_set = 1;
-		} else if (ret == -ENOSPC) {
+		} else if (ret == -ENOSPC && !sctx->is_dev_replace) {
 			/*
 			 * btrfs_inc_block_group_ro return -ENOSPC when it
 			 * failed in creating new chunk for metadata.
-			 * It is not a problem for scrub/replace, because
+			 * It is not a problem for scrub, because
 			 * metadata are always cowed, and our scrub paused
 			 * commit_transactions.
 			 */
@@ -3596,9 +3606,22 @@  int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			btrfs_warn(fs_info,
 				   "failed setting block group ro: %d", ret);
 			btrfs_put_block_group(cache);
+			scrub_pause_off(fs_info);
 			break;
 		}
 
+		/*
+		 * Now the target block is marked RO, wait for nocow writes to
+		 * finish before dev-replace.
+		 * COW is fine, as COW never overwrites extents in commit tree.
+		 */
+		if (sctx->is_dev_replace) {
+			btrfs_wait_nocow_writers(cache);
+			btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start,
+					cache->length);
+		}
+
+		scrub_pause_off(fs_info);
 		down_write(&dev_replace->rwsem);
 		dev_replace->cursor_right = found_key.offset + length;
 		dev_replace->cursor_left = found_key.offset;