diff mbox series

[v3,2/2] btrfs: zoned: mark relocation as writing

Message ID 01fa2ddededefc7f03ca4d6df2cccfdbf550aa26.1645157220.git.naohiro.aota@wdc.com (mailing list archive)
State New, archived
Headers show
Series btrfs: zoned: mark relocation as writing | expand

Commit Message

Naohiro Aota Feb. 18, 2022, 4:14 a.m. UTC
There is a hung_task issue with running generic/068 on an SMR
device. The hang occurs while a process is trying to thaw the
filesystem. The process is trying to take sb->s_umount to thaw the
FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is
waiting for an ordered extent to finish. However, as the FS is frozen,
the ordered extent never finish.

Having an ordered extent while the FS is frozen is the root cause of
the hang. The ordered extent is initiated from btrfs_relocate_chunk()
which is called from btrfs_reclaim_bgs_work().

This commit add sb_*_write() around btrfs_relocate_chunk() call
site. For the usual "btrfs balance" command, we already call it with
mnt_want_file() in btrfs_ioctl_balance().

Additionally, add an ASSERT in btrfs_relocate_chunk() to check it is
properly called.

Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
Cc: stable@vger.kernel.org # 5.13+
Link: https://github.com/naota/linux/issues/56
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 8 +++++++-
 fs/btrfs/volumes.c     | 6 ++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

Comments

Johannes Thumshirn Feb. 18, 2022, 6:13 a.m. UTC | #1
Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
David Sterba Feb. 23, 2022, 10:31 a.m. UTC | #2
On Fri, Feb 18, 2022 at 01:14:19PM +0900, Naohiro Aota wrote:
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3240,6 +3240,9 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>  	u64 length;
>  	int ret;
>  
> +	/* Assert we called sb_start_write(), not to race with FS freezing */
> +	ASSERT(sb_write_started(fs_info->sb));

I see this assertion to fail, it's not on all testing VMs, but has
happened a few times already so it's probably some race:

[ 2927.013859] BTRFS warning (device vdc): devid 1 uuid 4335c7a6-652c-4389-8ea9-270c00fa9880 is missing
[ 2927.017693] BTRFS warning (device vdc): devid 1 uuid 4335c7a6-652c-4389-8ea9-270c00fa9880 is missing
[ 2927.022921] BTRFS info (device vdc): bdev /dev/vdd errs: wr 0, rd 0, flush 0, corrupt 6000, gen 0
[ 2927.031780] BTRFS info (device vdc): checking UUID tree
[ 2927.045348] BTRFS: error (device vdc: state X) in __btrfs_free_extent:3199: errno=-5 IO failure
[ 2927.049729] BTRFS info (device vdc: state EX): forced readonly
[ 2927.051787] BTRFS: error (device vdc: state EX) in btrfs_run_delayed_refs:2159: errno=-5 IO failure
[ 2927.058758] BTRFS info (device vdc: state EX): balance: resume -dusage=90 -musage=90 -susage=90
[ 2927.062457] assertion failed: sb_write_started(fs_info->sb), in fs/btrfs/volumes.c:3244
[ 2927.066121] ------------[ cut here ]------------
[ 2927.067682] kernel BUG at fs/btrfs/ctree.h:3552!
[ 2927.069214] invalid opcode: 0000 [#1] PREEMPT SMP
[ 2927.070926] CPU: 2 PID: 22817 Comm: btrfs-balance Not tainted 5.17.0-rc5-default+ #1632
[ 2927.075299] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[ 2927.080897] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
[ 2927.092652] RSP: 0018:ffffaed9c610fdc0 EFLAGS: 00010246
[ 2927.095227] RAX: 000000000000004b RBX: ffffa13a873db000 RCX: 0000000000000000
[ 2927.096898] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 00000000ffffffff
[ 2927.100514] RBP: ffffa13a55324000 R08: 0000000000000003 R09: 0000000000000001
[ 2927.102518] R10: 0000000000000000 R11: 0000000000000001 R12: ffffa13a6922f098
[ 2927.104330] R13: 000000008cfa0000 R14: ffffa13a553262a0 R15: ffffa13a873db000
[ 2927.106025] FS:  0000000000000000(0000) GS:ffffa13abda00000(0000) knlGS:0000000000000000
[ 2927.108652] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2927.110568] CR2: 000055fdf2a94fd0 CR3: 000000005d012005 CR4: 0000000000170ea0
[ 2927.112167] Call Trace:
[ 2927.112801]  <TASK>
[ 2927.113212]  btrfs_relocate_chunk.cold+0x42/0x67 [btrfs]
[ 2927.114328]  __btrfs_balance+0x2ea/0x490 [btrfs]
[ 2927.114871] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 131072 csum 0x7e797e3e expected csum 0x8941f998 mirror 2
[ 2927.115469]  btrfs_balance+0x4ed/0x7e0 [btrfs]
[ 2927.118802] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 139264 csum 0x27df6522 expected csum 0x8941f998 mirror 2
[ 2927.119691]  ? btrfs_balance+0x7e0/0x7e0 [btrfs]
[ 2927.123158] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 143360 csum 0x9f144c35 expected csum 0x8941f998 mirror 2
[ 2927.123965]  balance_kthread+0x37/0x50 [btrfs]
[ 2927.127299] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 147456 csum 0x1027ab9a expected csum 0x8941f998 mirror 2
[ 2927.128016]  kthread+0xea/0x110
[ 2927.128023]  ? kthread_complete_and_exit+0x20/0x20
[ 2927.128027]  ret_from_fork+0x1f/0x30
[ 2927.128031]  </TASK>
[ 2927.128032] Modules linked in:
[ 2927.131390] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 155648 csum 0x428b86d5 expected csum 0x8941f998 mirror 2
[ 2927.131400] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 163840 csum 0x8fff7df2 expected csum 0x8941f998 mirror 2
[ 2927.131401] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 159744 csum 0x9893a835 expected csum 0x8941f998 mirror 2
[ 2927.131416] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 180224 csum 0x83d83877 expected csum 0x8941f998 mirror 2
[ 2927.131832] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 524288 csum 0x1a0c8fd4 expected csum 0x8941f998 mirror 2
[ 2927.132128] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 540672 csum 0xcaaf83cc expected csum 0x8941f998 mirror 2
[ 2927.133105]  dm_flakey dm_mod btrfs blake2b_generic libcrc32c crc32c_intel xor lzo_compress lzo_decompress raid6_pq zstd_decompress zstd_compress xxhash loop
[ 2927.144290] ---[ end trace 0000000000000000 ]---
[ 2927.145080] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
[ 2927.147738] RSP: 0018:ffffaed9c610fdc0 EFLAGS: 00010246
[ 2927.148220] RAX: 000000000000004b RBX: ffffa13a873db000 RCX: 0000000000000000
[ 2927.149126] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 00000000ffffffff
[ 2927.150057] RBP: ffffa13a55324000 R08: 0000000000000003 R09: 0000000000000001
[ 2927.150676] R10: 0000000000000000 R11: 0000000000000001 R12: ffffa13a6922f098
[ 2927.151297] R13: 000000008cfa0000 R14: ffffa13a553262a0 R15: ffffa13a873db000
[ 2927.152529] FS:  0000000000000000(0000) GS:ffffa13abda00000(0000) knlGS:0000000000000000
[ 2927.153646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2927.154280] CR2: 000055fdf2a94fd0 CR3: 000000005d012005 CR4: 0000000000170ea0
Naohiro Aota Feb. 24, 2022, 2:15 a.m. UTC | #3
On Wed, Feb 23, 2022 at 11:31:07AM +0100, David Sterba wrote:
> On Fri, Feb 18, 2022 at 01:14:19PM +0900, Naohiro Aota wrote:
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -3240,6 +3240,9 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
> >  	u64 length;
> >  	int ret;
> >  
> > +	/* Assert we called sb_start_write(), not to race with FS freezing */
> > +	ASSERT(sb_write_started(fs_info->sb));
> 
> I see this assertion to fail, it's not on all testing VMs, but has
> happened a few times already so it's probably some race:
> 
> [ 2927.013859] BTRFS warning (device vdc): devid 1 uuid 4335c7a6-652c-4389-8ea9-270c00fa9880 is missing
> [ 2927.017693] BTRFS warning (device vdc): devid 1 uuid 4335c7a6-652c-4389-8ea9-270c00fa9880 is missing
> [ 2927.022921] BTRFS info (device vdc): bdev /dev/vdd errs: wr 0, rd 0, flush 0, corrupt 6000, gen 0
> [ 2927.031780] BTRFS info (device vdc): checking UUID tree
> [ 2927.045348] BTRFS: error (device vdc: state X) in __btrfs_free_extent:3199: errno=-5 IO failure
> [ 2927.049729] BTRFS info (device vdc: state EX): forced readonly
> [ 2927.051787] BTRFS: error (device vdc: state EX) in btrfs_run_delayed_refs:2159: errno=-5 IO failure
> [ 2927.058758] BTRFS info (device vdc: state EX): balance: resume -dusage=90 -musage=90 -susage=90
> [ 2927.062457] assertion failed: sb_write_started(fs_info->sb), in fs/btrfs/volumes.c:3244
> [ 2927.066121] ------------[ cut here ]------------
> [ 2927.067682] kernel BUG at fs/btrfs/ctree.h:3552!
> [ 2927.069214] invalid opcode: 0000 [#1] PREEMPT SMP
> [ 2927.070926] CPU: 2 PID: 22817 Comm: btrfs-balance Not tainted 5.17.0-rc5-default+ #1632
> [ 2927.075299] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
> [ 2927.080897] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
> [ 2927.092652] RSP: 0018:ffffaed9c610fdc0 EFLAGS: 00010246
> [ 2927.095227] RAX: 000000000000004b RBX: ffffa13a873db000 RCX: 0000000000000000
> [ 2927.096898] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 00000000ffffffff
> [ 2927.100514] RBP: ffffa13a55324000 R08: 0000000000000003 R09: 0000000000000001
> [ 2927.102518] R10: 0000000000000000 R11: 0000000000000001 R12: ffffa13a6922f098
> [ 2927.104330] R13: 000000008cfa0000 R14: ffffa13a553262a0 R15: ffffa13a873db000
> [ 2927.106025] FS:  0000000000000000(0000) GS:ffffa13abda00000(0000) knlGS:0000000000000000
> [ 2927.108652] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2927.110568] CR2: 000055fdf2a94fd0 CR3: 000000005d012005 CR4: 0000000000170ea0
> [ 2927.112167] Call Trace:
> [ 2927.112801]  <TASK>
> [ 2927.113212]  btrfs_relocate_chunk.cold+0x42/0x67 [btrfs]
> [ 2927.114328]  __btrfs_balance+0x2ea/0x490 [btrfs]
> [ 2927.114871] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 131072 csum 0x7e797e3e expected csum 0x8941f998 mirror 2
> [ 2927.115469]  btrfs_balance+0x4ed/0x7e0 [btrfs]
> [ 2927.118802] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 139264 csum 0x27df6522 expected csum 0x8941f998 mirror 2
> [ 2927.119691]  ? btrfs_balance+0x7e0/0x7e0 [btrfs]
> [ 2927.123158] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 143360 csum 0x9f144c35 expected csum 0x8941f998 mirror 2
> [ 2927.123965]  balance_kthread+0x37/0x50 [btrfs]

It looks like this occurs when the balance is resumed. We also need
sb_{start,end}_write around btrfs_balance() in balance_kthred().

I guess we can cause a hang if we resume the balance and freeze the FS
at the same time.

> [ 2927.127299] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 147456 csum 0x1027ab9a expected csum 0x8941f998 mirror 2
> [ 2927.128016]  kthread+0xea/0x110
> [ 2927.128023]  ? kthread_complete_and_exit+0x20/0x20
> [ 2927.128027]  ret_from_fork+0x1f/0x30
> [ 2927.128031]  </TASK>
> [ 2927.128032] Modules linked in:
> [ 2927.131390] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 155648 csum 0x428b86d5 expected csum 0x8941f998 mirror 2
> [ 2927.131400] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 163840 csum 0x8fff7df2 expected csum 0x8941f998 mirror 2
> [ 2927.131401] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 159744 csum 0x9893a835 expected csum 0x8941f998 mirror 2
> [ 2927.131416] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 180224 csum 0x83d83877 expected csum 0x8941f998 mirror 2
> [ 2927.131832] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 524288 csum 0x1a0c8fd4 expected csum 0x8941f998 mirror 2
> [ 2927.132128] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 540672 csum 0xcaaf83cc expected csum 0x8941f998 mirror 2
> [ 2927.133105]  dm_flakey dm_mod btrfs blake2b_generic libcrc32c crc32c_intel xor lzo_compress lzo_decompress raid6_pq zstd_decompress zstd_compress xxhash loop
> [ 2927.144290] ---[ end trace 0000000000000000 ]---
> [ 2927.145080] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
> [ 2927.147738] RSP: 0018:ffffaed9c610fdc0 EFLAGS: 00010246
> [ 2927.148220] RAX: 000000000000004b RBX: ffffa13a873db000 RCX: 0000000000000000
> [ 2927.149126] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 00000000ffffffff
> [ 2927.150057] RBP: ffffa13a55324000 R08: 0000000000000003 R09: 0000000000000001
> [ 2927.150676] R10: 0000000000000000 R11: 0000000000000001 R12: ffffa13a6922f098
> [ 2927.151297] R13: 000000008cfa0000 R14: ffffa13a553262a0 R15: ffffa13a873db000
> [ 2927.152529] FS:  0000000000000000(0000) GS:ffffa13abda00000(0000) knlGS:0000000000000000
> [ 2927.153646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2927.154280] CR2: 000055fdf2a94fd0 CR3: 000000005d012005 CR4: 0000000000170ea0
David Sterba Feb. 24, 2022, 7:12 p.m. UTC | #4
On Thu, Feb 24, 2022 at 02:15:58AM +0000, Naohiro Aota wrote:
> On Wed, Feb 23, 2022 at 11:31:07AM +0100, David Sterba wrote:
> > On Fri, Feb 18, 2022 at 01:14:19PM +0900, Naohiro Aota wrote:
> > [ 2927.114871] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 131072 csum 0x7e797e3e expected csum 0x8941f998 mirror 2
> > [ 2927.115469]  btrfs_balance+0x4ed/0x7e0 [btrfs]
> > [ 2927.118802] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 139264 csum 0x27df6522 expected csum 0x8941f998 mirror 2
> > [ 2927.119691]  ? btrfs_balance+0x7e0/0x7e0 [btrfs]
> > [ 2927.123158] BTRFS warning (device vdc: state EX): csum failed root 5 ino 258 off 143360 csum 0x9f144c35 expected csum 0x8941f998 mirror 2
> > [ 2927.123965]  balance_kthread+0x37/0x50 [btrfs]
> 
> It looks like this occurs when the balance is resumed. We also need
> sb_{start,end}_write around btrfs_balance() in balance_kthred().

Sounds plausible.

> I guess we can cause a hang if we resume the balance and freeze the FS
> at the same time.

The background balance starts only when the filesystem is mounted for
write, so right after the sb_rdonly check in open_ctree, but I think
you're right that freeze during that can lead to a hang.
David Sterba Feb. 28, 2022, 8:18 p.m. UTC | #5
On Thu, Feb 24, 2022 at 02:15:58AM +0000, Naohiro Aota wrote:
> On Wed, Feb 23, 2022 at 11:31:07AM +0100, David Sterba wrote:
> > On Fri, Feb 18, 2022 at 01:14:19PM +0900, Naohiro Aota wrote:
> It looks like this occurs when the balance is resumed. We also need
> sb_{start,end}_write around btrfs_balance() in balance_kthred().
> 
> I guess we can cause a hang if we resume the balance and freeze the FS
> at the same time.

We need to fix the missing write protection before the asserts can be
added, so I'll delete them from this patch and will submit the helpers
patch once after we have fixed all.
diff mbox series

Patch

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 3113f6d7f335..c22d287e020b 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1522,8 +1522,12 @@  void btrfs_reclaim_bgs_work(struct work_struct *work)
 	if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
 		return;
 
-	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE))
+	sb_start_write(fs_info->sb);
+
+	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
+		sb_end_write(fs_info->sb);
 		return;
+	}
 
 	/*
 	 * Long running balances can keep us blocked here for eternity, so
@@ -1531,6 +1535,7 @@  void btrfs_reclaim_bgs_work(struct work_struct *work)
 	 */
 	if (!mutex_trylock(&fs_info->reclaim_bgs_lock)) {
 		btrfs_exclop_finish(fs_info);
+		sb_end_write(fs_info->sb);
 		return;
 	}
 
@@ -1605,6 +1610,7 @@  void btrfs_reclaim_bgs_work(struct work_struct *work)
 	spin_unlock(&fs_info->unused_bgs_lock);
 	mutex_unlock(&fs_info->reclaim_bgs_lock);
 	btrfs_exclop_finish(fs_info);
+	sb_end_write(fs_info->sb);
 }
 
 void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fa7fee09e39b..74c8024d8f96 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3240,6 +3240,9 @@  int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	u64 length;
 	int ret;
 
+	/* Assert we called sb_start_write(), not to race with FS freezing */
+	ASSERT(sb_write_started(fs_info->sb));
+
 	if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
 		btrfs_err(fs_info,
 			  "relocate: not supported on extent tree v2 yet");
@@ -8304,10 +8307,12 @@  static int relocating_repair_kthread(void *data)
 	target = cache->start;
 	btrfs_put_block_group(cache);
 
+	sb_start_write(fs_info->sb);
 	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
 		btrfs_info(fs_info,
 			   "zoned: skip relocating block group %llu to repair: EBUSY",
 			   target);
+		sb_end_write(fs_info->sb);
 		return -EBUSY;
 	}
 
@@ -8335,6 +8340,7 @@  static int relocating_repair_kthread(void *data)
 		btrfs_put_block_group(cache);
 	mutex_unlock(&fs_info->reclaim_bgs_lock);
 	btrfs_exclop_finish(fs_info);
+	sb_end_write(fs_info->sb);
 
 	return ret;
 }