diff mbox series

btrfs: avoid possible parallel list adding

Message ID 58e8574ccd70645c613e6bc7d328e34c2f52421a.1719549099.git.naohiro.aota@wdc.com (mailing list archive)
State New, archived
Headers show
Series btrfs: avoid possible parallel list adding | expand

Commit Message

Naohiro Aota June 28, 2024, 4:32 a.m. UTC
There is a potential parallel list adding for retrying in
btrfs_reclaim_bgs_work and adding to the unused list. Since the block
group is removed from the reclaim list and it is on a relocation work,
it can be added into the unused list in parallel. When that happens,
adding it to the reclaim list will corrupt the list head and trigger
list corruption like below.

Fix it by taking fs_info->unused_bgs_lock.

[27177.504101][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
[27177.514722][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
[27177.529789][T2585409] ------------[ cut here ]------------
[27177.537690][T2585409] kernel BUG at lib/list_debug.c:65!
[27177.545548][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[27177.555466][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
[27177.568502][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
[27177.579784][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
[27177.589946][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
[27177.600088][T2585409] Code: fa ff 0f 0b 4c 89 e2 48 89 de 48 c7 c7 c0
8c 3b 93 e8 ab 8e fa ff 0f 0b 4c 89 e1 48 89 de 48 c7 c7 00 8e 3b 93 e8
97 8e fa ff <0f> 0b 48 63 d1 4c 89 f6 48 c7 c7 e0 56 67 94 89 4c 24 04
e8 ff f2
[27177.624173][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
[27177.633052][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
[27177.644059][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
[27177.655030][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
[27177.665992][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
[27177.676964][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
[27177.687938][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
[27177.700006][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[27177.709431][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
[27177.720402][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
[27177.731337][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
[27177.742252][T2585409] PKRU: 55555554
[27177.748207][T2585409] Call Trace:
[27177.753850][T2585409]  <TASK>
[27177.759103][T2585409]  ? __die_body.cold+0x19/0x27
[27177.766405][T2585409]  ? die+0x2e/0x50
[27177.772498][T2585409]  ? do_trap+0x1ea/0x2d0
[27177.779139][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
[27177.788518][T2585409]  ? do_error_trap+0xa3/0x160
[27177.795649][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
[27177.805045][T2585409]  ? handle_invalid_op+0x2c/0x40
[27177.812022][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
[27177.820947][T2585409]  ? exc_invalid_op+0x2d/0x40
[27177.827608][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
[27177.834637][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
[27177.843519][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]

There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
safe, AFAIC. Since the block group was in the unused list, the used bytes
should be 0 when it was added to the unused list. Then, it checks
block_group->{used,resereved,pinned} are still 0 under the
block_group->lock. So, they should be still eligible for the unused list,
not the reclaim list.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Comments

Qu Wenruo June 28, 2024, 5:37 a.m. UTC | #1
在 2024/6/28 14:02, Naohiro Aota 写道:
> There is a potential parallel list adding for retrying in
> btrfs_reclaim_bgs_work and adding to the unused list. Since the block
> group is removed from the reclaim list and it is on a relocation work,
> it can be added into the unused list in parallel. When that happens,
> adding it to the reclaim list will corrupt the list head and trigger
> list corruption like below.
>
> Fix it by taking fs_info->unused_bgs_lock.
>
> [27177.504101][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
> [27177.514722][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
> [27177.529789][T2585409] ------------[ cut here ]------------
> [27177.537690][T2585409] kernel BUG at lib/list_debug.c:65!
> [27177.545548][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> [27177.555466][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
> [27177.568502][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
> [27177.579784][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
> [27177.589946][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.600088][T2585409] Code: fa ff 0f 0b 4c 89 e2 48 89 de 48 c7 c7 c0
> 8c 3b 93 e8 ab 8e fa ff 0f 0b 4c 89 e1 48 89 de 48 c7 c7 00 8e 3b 93 e8
> 97 8e fa ff <0f> 0b 48 63 d1 4c 89 f6 48 c7 c7 e0 56 67 94 89 4c 24 04
> e8 ff f2
> [27177.624173][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
> [27177.633052][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
> [27177.644059][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
> [27177.655030][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
> [27177.665992][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
> [27177.676964][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
> [27177.687938][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
> [27177.700006][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [27177.709431][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
> [27177.720402][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
> [27177.731337][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
> [27177.742252][T2585409] PKRU: 55555554
> [27177.748207][T2585409] Call Trace:
> [27177.753850][T2585409]  <TASK>
> [27177.759103][T2585409]  ? __die_body.cold+0x19/0x27
> [27177.766405][T2585409]  ? die+0x2e/0x50
> [27177.772498][T2585409]  ? do_trap+0x1ea/0x2d0
> [27177.779139][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.788518][T2585409]  ? do_error_trap+0xa3/0x160
> [27177.795649][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.805045][T2585409]  ? handle_invalid_op+0x2c/0x40
> [27177.812022][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.820947][T2585409]  ? exc_invalid_op+0x2d/0x40
> [27177.827608][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
> [27177.834637][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.843519][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
>
> There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
> safe, AFAIC. Since the block group was in the unused list, the used bytes
> should be 0 when it was added to the unused list. Then, it checks
> block_group->{used,resereved,pinned} are still 0 under the
> block_group->lock. So, they should be still eligible for the unused list,
> not the reclaim list.
>
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop")
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Just to mention, I find another location which may cause problems:

- btrfs_make_block_group()
   We call list_add_tail() to add the current bg to trans->new_bgs,
   without any extra locking.

   Not sure if it's racy, as it doesn't look that possible to call
   btrfs_make_block_group() on the same trans handler, but maybe we want
   to make sure it's safe.

Thanks,
Qu
> ---
>   fs/btrfs/block-group.c | 13 +++++++++++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 7cde40fe6a50..498442d0c216 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1945,8 +1945,17 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>   next:
>   		if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
>   			/* Refcount held by the reclaim_bgs list after splice. */
> -			btrfs_get_block_group(bg);
> -			list_add_tail(&bg->bg_list, &retry_list);
> +			spin_lock(&fs_info->unused_bgs_lock);
> +			/*
> +			 * This block group might be added to the unused list
> +			 * during the above process. Move it back to the
> +			 * reclaim list otherwise.
> +			 */
> +			if (list_empty(&bg->bg_list)) {
> +				btrfs_get_block_group(bg);
> +				list_add_tail(&bg->bg_list, &retry_list);
> +			}
> +			spin_unlock(&fs_info->unused_bgs_lock);
>   		}
>   		btrfs_put_block_group(bg);
>
Johannes Thumshirn June 28, 2024, 6:02 a.m. UTC | #2
Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Filipe Manana June 28, 2024, 9:45 a.m. UTC | #3
On Fri, Jun 28, 2024 at 5:54 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> There is a potential parallel list adding for retrying in
> btrfs_reclaim_bgs_work and adding to the unused list. Since the block
> group is removed from the reclaim list and it is on a relocation work,
> it can be added into the unused list in parallel. When that happens,
> adding it to the reclaim list will corrupt the list head and trigger
> list corruption like below.
>
> Fix it by taking fs_info->unused_bgs_lock.
>
> [27177.504101][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
> [27177.514722][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
> [27177.529789][T2585409] ------------[ cut here ]------------
> [27177.537690][T2585409] kernel BUG at lib/list_debug.c:65!
> [27177.545548][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> [27177.555466][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
> [27177.568502][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
> [27177.579784][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
> [27177.589946][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.600088][T2585409] Code: fa ff 0f 0b 4c 89 e2 48 89 de 48 c7 c7 c0
> 8c 3b 93 e8 ab 8e fa ff 0f 0b 4c 89 e1 48 89 de 48 c7 c7 00 8e 3b 93 e8
> 97 8e fa ff <0f> 0b 48 63 d1 4c 89 f6 48 c7 c7 e0 56 67 94 89 4c 24 04
> e8 ff f2
> [27177.624173][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
> [27177.633052][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
> [27177.644059][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
> [27177.655030][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
> [27177.665992][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
> [27177.676964][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
> [27177.687938][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
> [27177.700006][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [27177.709431][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
> [27177.720402][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
> [27177.731337][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
> [27177.742252][T2585409] PKRU: 55555554
> [27177.748207][T2585409] Call Trace:
> [27177.753850][T2585409]  <TASK>
> [27177.759103][T2585409]  ? __die_body.cold+0x19/0x27
> [27177.766405][T2585409]  ? die+0x2e/0x50
> [27177.772498][T2585409]  ? do_trap+0x1ea/0x2d0
> [27177.779139][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.788518][T2585409]  ? do_error_trap+0xa3/0x160
> [27177.795649][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.805045][T2585409]  ? handle_invalid_op+0x2c/0x40
> [27177.812022][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.820947][T2585409]  ? exc_invalid_op+0x2d/0x40
> [27177.827608][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
> [27177.834637][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.843519][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
>
> There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
> safe, AFAIC. Since the block group was in the unused list, the used bytes
> should be 0 when it was added to the unused list. Then, it checks
> block_group->{used,resereved,pinned} are still 0 under the
> block_group->lock. So, they should be still eligible for the unused list,
> not the reclaim list.

The reason it is safe there it's because because we're holding
space_info->groups_sem in write mode.

That means no other task can allocate from the block group, so while
we are at deleted_unused_bgs()
it's not possible for other tasks to allocate and deallocate extents
from the block group,
so it can't be added to the unused list or the reclaim list by anyone else.

>
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop")
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

The change looks good.
I would however make the subject less generic, maybe something like:

"btrfs: fix adding block group to a reclaim list and the unused list
during reclaim"

Reviewed-by: Filipe Manana <fdmanana@suse.com>

> ---
>  fs/btrfs/block-group.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 7cde40fe6a50..498442d0c216 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1945,8 +1945,17 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  next:
>                 if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
>                         /* Refcount held by the reclaim_bgs list after splice. */
> -                       btrfs_get_block_group(bg);
> -                       list_add_tail(&bg->bg_list, &retry_list);
> +                       spin_lock(&fs_info->unused_bgs_lock);
> +                       /*
> +                        * This block group might be added to the unused list
> +                        * during the above process. Move it back to the
> +                        * reclaim list otherwise.
> +                        */
> +                       if (list_empty(&bg->bg_list)) {
> +                               btrfs_get_block_group(bg);
> +                               list_add_tail(&bg->bg_list, &retry_list);
> +                       }
> +                       spin_unlock(&fs_info->unused_bgs_lock);
>                 }
>                 btrfs_put_block_group(bg);
>
> --
> 2.45.2
>
>
Filipe Manana June 28, 2024, 9:56 a.m. UTC | #4
On Fri, Jun 28, 2024 at 6:37 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/6/28 14:02, Naohiro Aota 写道:
> > There is a potential parallel list adding for retrying in
> > btrfs_reclaim_bgs_work and adding to the unused list. Since the block
> > group is removed from the reclaim list and it is on a relocation work,
> > it can be added into the unused list in parallel. When that happens,
> > adding it to the reclaim list will corrupt the list head and trigger
> > list corruption like below.
> >
> > Fix it by taking fs_info->unused_bgs_lock.
> >
> > [27177.504101][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
> > [27177.514722][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
> > [27177.529789][T2585409] ------------[ cut here ]------------
> > [27177.537690][T2585409] kernel BUG at lib/list_debug.c:65!
> > [27177.545548][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > [27177.555466][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
> > [27177.568502][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
> > [27177.579784][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
> > [27177.589946][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.600088][T2585409] Code: fa ff 0f 0b 4c 89 e2 48 89 de 48 c7 c7 c0
> > 8c 3b 93 e8 ab 8e fa ff 0f 0b 4c 89 e1 48 89 de 48 c7 c7 00 8e 3b 93 e8
> > 97 8e fa ff <0f> 0b 48 63 d1 4c 89 f6 48 c7 c7 e0 56 67 94 89 4c 24 04
> > e8 ff f2
> > [27177.624173][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
> > [27177.633052][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
> > [27177.644059][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
> > [27177.655030][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
> > [27177.665992][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
> > [27177.676964][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
> > [27177.687938][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
> > [27177.700006][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [27177.709431][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
> > [27177.720402][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
> > [27177.731337][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
> > [27177.742252][T2585409] PKRU: 55555554
> > [27177.748207][T2585409] Call Trace:
> > [27177.753850][T2585409]  <TASK>
> > [27177.759103][T2585409]  ? __die_body.cold+0x19/0x27
> > [27177.766405][T2585409]  ? die+0x2e/0x50
> > [27177.772498][T2585409]  ? do_trap+0x1ea/0x2d0
> > [27177.779139][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.788518][T2585409]  ? do_error_trap+0xa3/0x160
> > [27177.795649][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.805045][T2585409]  ? handle_invalid_op+0x2c/0x40
> > [27177.812022][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.820947][T2585409]  ? exc_invalid_op+0x2d/0x40
> > [27177.827608][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
> > [27177.834637][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.843519][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
> >
> > There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
> > safe, AFAIC. Since the block group was in the unused list, the used bytes
> > should be 0 when it was added to the unused list. Then, it checks
> > block_group->{used,resereved,pinned} are still 0 under the
> > block_group->lock. So, they should be still eligible for the unused list,
> > not the reclaim list.
> >
> > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> > Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop")
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Just to mention, I find another location which may cause problems:
>
> - btrfs_make_block_group()
>    We call list_add_tail() to add the current bg to trans->new_bgs,
>    without any extra locking.
>
>    Not sure if it's racy, as it doesn't look that possible to call
>    btrfs_make_block_group() on the same trans handler, but maybe we want
>    to make sure it's safe.

It's safe there, because it happens in the first phase of block
creation before the block group can be found
by any task other than the task that created the block group.

If at that point anyone had already added the block group to the
reclaim list, or unused list, or any other list,
the second phase of block group creation would not happen, and things
would break pretty bad, we would
have extents pointing to block groups without all the items in the
chunk tree and device tree etc (chunk item,
device extent items).

>
> Thanks,
> Qu
> > ---
> >   fs/btrfs/block-group.c | 13 +++++++++++--
> >   1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 7cde40fe6a50..498442d0c216 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -1945,8 +1945,17 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >   next:
> >               if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
> >                       /* Refcount held by the reclaim_bgs list after splice. */
> > -                     btrfs_get_block_group(bg);
> > -                     list_add_tail(&bg->bg_list, &retry_list);
> > +                     spin_lock(&fs_info->unused_bgs_lock);
> > +                     /*
> > +                      * This block group might be added to the unused list
> > +                      * during the above process. Move it back to the
> > +                      * reclaim list otherwise.
> > +                      */
> > +                     if (list_empty(&bg->bg_list)) {
> > +                             btrfs_get_block_group(bg);
> > +                             list_add_tail(&bg->bg_list, &retry_list);
> > +                     }
> > +                     spin_unlock(&fs_info->unused_bgs_lock);
> >               }
> >               btrfs_put_block_group(bg);
> >
>
Filipe Manana June 28, 2024, 10:03 a.m. UTC | #5
On Fri, Jun 28, 2024 at 5:54 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> There is a potential parallel list adding for retrying in
> btrfs_reclaim_bgs_work and adding to the unused list. Since the block
> group is removed from the reclaim list and it is on a relocation work,
> it can be added into the unused list in parallel. When that happens,
> adding it to the reclaim list will corrupt the list head and trigger
> list corruption like below.
>
> Fix it by taking fs_info->unused_bgs_lock.
>
> [27177.504101][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
> [27177.514722][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
> [27177.529789][T2585409] ------------[ cut here ]------------
> [27177.537690][T2585409] kernel BUG at lib/list_debug.c:65!
> [27177.545548][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> [27177.555466][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
> [27177.568502][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
> [27177.579784][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
> [27177.589946][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.600088][T2585409] Code: fa ff 0f 0b 4c 89 e2 48 89 de 48 c7 c7 c0
> 8c 3b 93 e8 ab 8e fa ff 0f 0b 4c 89 e1 48 89 de 48 c7 c7 00 8e 3b 93 e8
> 97 8e fa ff <0f> 0b 48 63 d1 4c 89 f6 48 c7 c7 e0 56 67 94 89 4c 24 04
> e8 ff f2
> [27177.624173][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
> [27177.633052][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
> [27177.644059][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
> [27177.655030][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
> [27177.665992][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
> [27177.676964][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
> [27177.687938][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
> [27177.700006][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [27177.709431][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
> [27177.720402][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
> [27177.731337][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
> [27177.742252][T2585409] PKRU: 55555554
> [27177.748207][T2585409] Call Trace:
> [27177.753850][T2585409]  <TASK>
> [27177.759103][T2585409]  ? __die_body.cold+0x19/0x27
> [27177.766405][T2585409]  ? die+0x2e/0x50
> [27177.772498][T2585409]  ? do_trap+0x1ea/0x2d0
> [27177.779139][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.788518][T2585409]  ? do_error_trap+0xa3/0x160
> [27177.795649][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.805045][T2585409]  ? handle_invalid_op+0x2c/0x40
> [27177.812022][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.820947][T2585409]  ? exc_invalid_op+0x2d/0x40
> [27177.827608][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
> [27177.834637][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.843519][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
>
> There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
> safe, AFAIC. Since the block group was in the unused list, the used bytes
> should be 0 when it was added to the unused list. Then, it checks
> block_group->{used,resereved,pinned} are still 0 under the
> block_group->lock. So, they should be still eligible for the unused list,
> not the reclaim list.
>
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop")
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/block-group.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 7cde40fe6a50..498442d0c216 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1945,8 +1945,17 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  next:
>                 if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
>                         /* Refcount held by the reclaim_bgs list after splice. */

Btw this comment should be moved below otherwise it's confusing.

> -                       btrfs_get_block_group(bg);
> -                       list_add_tail(&bg->bg_list, &retry_list);
> +                       spin_lock(&fs_info->unused_bgs_lock);
> +                       /*
> +                        * This block group might be added to the unused list
> +                        * during the above process. Move it back to the
> +                        * reclaim list otherwise.
> +                        */
> +                       if (list_empty(&bg->bg_list)) {

Here, right before incrementing the ref count.

Thanks.

> +                               btrfs_get_block_group(bg);
> +                               list_add_tail(&bg->bg_list, &retry_list);
> +                       }
> +                       spin_unlock(&fs_info->unused_bgs_lock);
>                 }
>                 btrfs_put_block_group(bg);
>
> --
> 2.45.2
>
>
Naohiro Aota July 1, 2024, 7:14 a.m. UTC | #6
On Fri, Jun 28, 2024 at 10:45:30AM GMT, Filipe Manana wrote:
> On Fri, Jun 28, 2024 at 5:54 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
> >
> > There is a potential parallel list adding for retrying in
> > btrfs_reclaim_bgs_work and adding to the unused list. Since the block
> > group is removed from the reclaim list and it is on a relocation work,
> > it can be added into the unused list in parallel. When that happens,
> > adding it to the reclaim list will corrupt the list head and trigger
> > list corruption like below.
> >
> > Fix it by taking fs_info->unused_bgs_lock.
> >
> > [27177.504101][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
> > [27177.514722][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
> > [27177.529789][T2585409] ------------[ cut here ]------------
> > [27177.537690][T2585409] kernel BUG at lib/list_debug.c:65!
> > [27177.545548][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > [27177.555466][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
> > [27177.568502][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
> > [27177.579784][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
> > [27177.589946][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.600088][T2585409] Code: fa ff 0f 0b 4c 89 e2 48 89 de 48 c7 c7 c0
> > 8c 3b 93 e8 ab 8e fa ff 0f 0b 4c 89 e1 48 89 de 48 c7 c7 00 8e 3b 93 e8
> > 97 8e fa ff <0f> 0b 48 63 d1 4c 89 f6 48 c7 c7 e0 56 67 94 89 4c 24 04
> > e8 ff f2
> > [27177.624173][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
> > [27177.633052][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
> > [27177.644059][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
> > [27177.655030][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
> > [27177.665992][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
> > [27177.676964][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
> > [27177.687938][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
> > [27177.700006][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [27177.709431][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
> > [27177.720402][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
> > [27177.731337][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
> > [27177.742252][T2585409] PKRU: 55555554
> > [27177.748207][T2585409] Call Trace:
> > [27177.753850][T2585409]  <TASK>
> > [27177.759103][T2585409]  ? __die_body.cold+0x19/0x27
> > [27177.766405][T2585409]  ? die+0x2e/0x50
> > [27177.772498][T2585409]  ? do_trap+0x1ea/0x2d0
> > [27177.779139][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.788518][T2585409]  ? do_error_trap+0xa3/0x160
> > [27177.795649][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.805045][T2585409]  ? handle_invalid_op+0x2c/0x40
> > [27177.812022][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.820947][T2585409]  ? exc_invalid_op+0x2d/0x40
> > [27177.827608][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
> > [27177.834637][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.843519][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
> >
> > There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
> > safe, AFAIC. Since the block group was in the unused list, the used bytes
> > should be 0 when it was added to the unused list. Then, it checks
> > block_group->{used,resereved,pinned} are still 0 under the
> > block_group->lock. So, they should be still eligible for the unused list,
> > not the reclaim list.
> 
> The reason it is safe there it's because because we're holding
> space_info->groups_sem in write mode.
> 
> That means no other task can allocate from the block group, so while
> we are at deleted_unused_bgs()
> it's not possible for other tasks to allocate and deallocate extents
> from the block group,
> so it can't be added to the unused list or the reclaim list by anyone else.

Thank you for the description. May I incorporate this to the commit log?

> 
> >
> > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> > Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop")
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> 
> The change looks good.
> I would however make the subject less generic, maybe something like:
> 
> "btrfs: fix adding block group to a reclaim list and the unused list
> during reclaim"

Sure.

> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> > ---
> >  fs/btrfs/block-group.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 7cde40fe6a50..498442d0c216 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -1945,8 +1945,17 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >  next:
> >                 if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
> >                         /* Refcount held by the reclaim_bgs list after splice. */
> > -                       btrfs_get_block_group(bg);
> > -                       list_add_tail(&bg->bg_list, &retry_list);
> > +                       spin_lock(&fs_info->unused_bgs_lock);
> > +                       /*
> > +                        * This block group might be added to the unused list
> > +                        * during the above process. Move it back to the
> > +                        * reclaim list otherwise.
> > +                        */
> > +                       if (list_empty(&bg->bg_list)) {
> > +                               btrfs_get_block_group(bg);
> > +                               list_add_tail(&bg->bg_list, &retry_list);
> > +                       }
> > +                       spin_unlock(&fs_info->unused_bgs_lock);
> >                 }
> >                 btrfs_put_block_group(bg);
> >
> > --
> > 2.45.2
> >
> >
Naohiro Aota July 1, 2024, 7:17 a.m. UTC | #7
On Fri, Jun 28, 2024 at 11:03:58AM GMT, Filipe Manana wrote:
> On Fri, Jun 28, 2024 at 5:54 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
> >
> > There is a potential parallel list adding for retrying in
> > btrfs_reclaim_bgs_work and adding to the unused list. Since the block
> > group is removed from the reclaim list and it is on a relocation work,
> > it can be added into the unused list in parallel. When that happens,
> > adding it to the reclaim list will corrupt the list head and trigger
> > list corruption like below.
> >
> > Fix it by taking fs_info->unused_bgs_lock.
> >
> > [27177.504101][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
> > [27177.514722][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
> > [27177.529789][T2585409] ------------[ cut here ]------------
> > [27177.537690][T2585409] kernel BUG at lib/list_debug.c:65!
> > [27177.545548][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > [27177.555466][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
> > [27177.568502][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
> > [27177.579784][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
> > [27177.589946][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.600088][T2585409] Code: fa ff 0f 0b 4c 89 e2 48 89 de 48 c7 c7 c0
> > 8c 3b 93 e8 ab 8e fa ff 0f 0b 4c 89 e1 48 89 de 48 c7 c7 00 8e 3b 93 e8
> > 97 8e fa ff <0f> 0b 48 63 d1 4c 89 f6 48 c7 c7 e0 56 67 94 89 4c 24 04
> > e8 ff f2
> > [27177.624173][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
> > [27177.633052][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
> > [27177.644059][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
> > [27177.655030][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
> > [27177.665992][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
> > [27177.676964][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
> > [27177.687938][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
> > [27177.700006][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [27177.709431][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
> > [27177.720402][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
> > [27177.731337][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
> > [27177.742252][T2585409] PKRU: 55555554
> > [27177.748207][T2585409] Call Trace:
> > [27177.753850][T2585409]  <TASK>
> > [27177.759103][T2585409]  ? __die_body.cold+0x19/0x27
> > [27177.766405][T2585409]  ? die+0x2e/0x50
> > [27177.772498][T2585409]  ? do_trap+0x1ea/0x2d0
> > [27177.779139][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.788518][T2585409]  ? do_error_trap+0xa3/0x160
> > [27177.795649][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.805045][T2585409]  ? handle_invalid_op+0x2c/0x40
> > [27177.812022][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.820947][T2585409]  ? exc_invalid_op+0x2d/0x40
> > [27177.827608][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
> > [27177.834637][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.843519][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
> >
> > There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
> > safe, AFAIC. Since the block group was in the unused list, the used bytes
> > should be 0 when it was added to the unused list. Then, it checks
> > block_group->{used,resereved,pinned} are still 0 under the
> > block_group->lock. So, they should be still eligible for the unused list,
> > not the reclaim list.
> >
> > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> > Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop")
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/btrfs/block-group.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 7cde40fe6a50..498442d0c216 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -1945,8 +1945,17 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >  next:
> >                 if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
> >                         /* Refcount held by the reclaim_bgs list after splice. */
> 
> Btw this comment should be moved below otherwise it's confusing.
> 
> > -                       btrfs_get_block_group(bg);
> > -                       list_add_tail(&bg->bg_list, &retry_list);
> > +                       spin_lock(&fs_info->unused_bgs_lock);
> > +                       /*
> > +                        * This block group might be added to the unused list
> > +                        * during the above process. Move it back to the
> > +                        * reclaim list otherwise.
> > +                        */
> > +                       if (list_empty(&bg->bg_list)) {
> 
> Here, right before incrementing the ref count.

Exactly. I'll move it when I'm going to commit.

> 
> Thanks.
> 
> > +                               btrfs_get_block_group(bg);
> > +                               list_add_tail(&bg->bg_list, &retry_list);
> > +                       }
> > +                       spin_unlock(&fs_info->unused_bgs_lock);
> >                 }
> >                 btrfs_put_block_group(bg);
> >
> > --
> > 2.45.2
> >
> >
Filipe Manana July 1, 2024, 9:14 a.m. UTC | #8
On Mon, Jul 1, 2024 at 8:14 AM Naohiro Aota <Naohiro.Aota@wdc.com> wrote:
>
> On Fri, Jun 28, 2024 at 10:45:30AM GMT, Filipe Manana wrote:
> > On Fri, Jun 28, 2024 at 5:54 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
> > >
> > > There is a potential parallel list adding for retrying in
> > > btrfs_reclaim_bgs_work and adding to the unused list. Since the block
> > > group is removed from the reclaim list and it is on a relocation work,
> > > it can be added into the unused list in parallel. When that happens,
> > > adding it to the reclaim list will corrupt the list head and trigger
> > > list corruption like below.
> > >
> > > Fix it by taking fs_info->unused_bgs_lock.
> > >
> > > [27177.504101][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
> > > [27177.514722][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
> > > [27177.529789][T2585409] ------------[ cut here ]------------
> > > [27177.537690][T2585409] kernel BUG at lib/list_debug.c:65!
> > > [27177.545548][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > > [27177.555466][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
> > > [27177.568502][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
> > > [27177.579784][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
> > > [27177.589946][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
> > > [27177.600088][T2585409] Code: fa ff 0f 0b 4c 89 e2 48 89 de 48 c7 c7 c0
> > > 8c 3b 93 e8 ab 8e fa ff 0f 0b 4c 89 e1 48 89 de 48 c7 c7 00 8e 3b 93 e8
> > > 97 8e fa ff <0f> 0b 48 63 d1 4c 89 f6 48 c7 c7 e0 56 67 94 89 4c 24 04
> > > e8 ff f2
> > > [27177.624173][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
> > > [27177.633052][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
> > > [27177.644059][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
> > > [27177.655030][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
> > > [27177.665992][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
> > > [27177.676964][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
> > > [27177.687938][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
> > > [27177.700006][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [27177.709431][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
> > > [27177.720402][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
> > > [27177.731337][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
> > > [27177.742252][T2585409] PKRU: 55555554
> > > [27177.748207][T2585409] Call Trace:
> > > [27177.753850][T2585409]  <TASK>
> > > [27177.759103][T2585409]  ? __die_body.cold+0x19/0x27
> > > [27177.766405][T2585409]  ? die+0x2e/0x50
> > > [27177.772498][T2585409]  ? do_trap+0x1ea/0x2d0
> > > [27177.779139][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > > [27177.788518][T2585409]  ? do_error_trap+0xa3/0x160
> > > [27177.795649][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > > [27177.805045][T2585409]  ? handle_invalid_op+0x2c/0x40
> > > [27177.812022][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > > [27177.820947][T2585409]  ? exc_invalid_op+0x2d/0x40
> > > [27177.827608][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
> > > [27177.834637][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > > [27177.843519][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
> > >
> > > There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
> > > safe, AFAIC. Since the block group was in the unused list, the used bytes
> > > should be 0 when it was added to the unused list. Then, it checks
> > > block_group->{used,resereved,pinned} are still 0 under the
> > > block_group->lock. So, they should be still eligible for the unused list,
> > > not the reclaim list.
> >
> > The reason it is safe there it's because because we're holding
> > space_info->groups_sem in write mode.
> >
> > That means no other task can allocate from the block group, so while
> > we are at deleted_unused_bgs()
> > it's not possible for other tasks to allocate and deallocate extents
> > from the block group,
> > so it can't be added to the unused list or the reclaim list by anyone else.
>
> Thank you for the description. May I incorporate this to the commit log?

Sure.
Thanks!

>
> >
> > >
> > > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > > Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> > > Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop")
> > > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> >
> > The change looks good.
> > I would however make the subject less generic, maybe something like:
> >
> > "btrfs: fix adding block group to a reclaim list and the unused list
> > during reclaim"
>
> Sure.
>
> >
> > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> >
> > > ---
> > >  fs/btrfs/block-group.c | 13 +++++++++++--
> > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > > index 7cde40fe6a50..498442d0c216 100644
> > > --- a/fs/btrfs/block-group.c
> > > +++ b/fs/btrfs/block-group.c
> > > @@ -1945,8 +1945,17 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> > >  next:
> > >                 if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
> > >                         /* Refcount held by the reclaim_bgs list after splice. */
> > > -                       btrfs_get_block_group(bg);
> > > -                       list_add_tail(&bg->bg_list, &retry_list);
> > > +                       spin_lock(&fs_info->unused_bgs_lock);
> > > +                       /*
> > > +                        * This block group might be added to the unused list
> > > +                        * during the above process. Move it back to the
> > > +                        * reclaim list otherwise.
> > > +                        */
> > > +                       if (list_empty(&bg->bg_list)) {
> > > +                               btrfs_get_block_group(bg);
> > > +                               list_add_tail(&bg->bg_list, &retry_list);
> > > +                       }
> > > +                       spin_unlock(&fs_info->unused_bgs_lock);
> > >                 }
> > >                 btrfs_put_block_group(bg);
> > >
> > > --
> > > 2.45.2
> > >
> > >
David Sterba July 1, 2024, 1:12 p.m. UTC | #9
On Fri, Jun 28, 2024 at 01:32:24PM +0900, Naohiro Aota wrote:
> There is a potential parallel list adding for retrying in
> btrfs_reclaim_bgs_work and adding to the unused list. Since the block
> group is removed from the reclaim list and it is on a relocation work,
> it can be added into the unused list in parallel. When that happens,
> adding it to the reclaim list will corrupt the list head and trigger
> list corruption like below.
> 
> Fix it by taking fs_info->unused_bgs_lock.
> 
> [27177.504101][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
> [27177.514722][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
> [27177.529789][T2585409] ------------[ cut here ]------------
> [27177.537690][T2585409] kernel BUG at lib/list_debug.c:65!
> [27177.545548][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> [27177.555466][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
> [27177.568502][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
> [27177.579784][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
> [27177.589946][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.600088][T2585409] Code: fa ff 0f 0b 4c 89 e2 48 89 de 48 c7 c7 c0
> 8c 3b 93 e8 ab 8e fa ff 0f 0b 4c 89 e1 48 89 de 48 c7 c7 00 8e 3b 93 e8
> 97 8e fa ff <0f> 0b 48 63 d1 4c 89 f6 48 c7 c7 e0 56 67 94 89 4c 24 04
> e8 ff f2
> [27177.624173][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
> [27177.633052][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
> [27177.644059][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
> [27177.655030][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
> [27177.665992][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
> [27177.676964][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
> [27177.687938][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
> [27177.700006][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [27177.709431][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
> [27177.720402][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
> [27177.731337][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
> [27177.742252][T2585409] PKRU: 55555554
> [27177.748207][T2585409] Call Trace:
> [27177.753850][T2585409]  <TASK>
> [27177.759103][T2585409]  ? __die_body.cold+0x19/0x27
> [27177.766405][T2585409]  ? die+0x2e/0x50
> [27177.772498][T2585409]  ? do_trap+0x1ea/0x2d0
> [27177.779139][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.788518][T2585409]  ? do_error_trap+0xa3/0x160
> [27177.795649][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.805045][T2585409]  ? handle_invalid_op+0x2c/0x40
> [27177.812022][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.820947][T2585409]  ? exc_invalid_op+0x2d/0x40
> [27177.827608][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
> [27177.834637][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> [27177.843519][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
> 
> There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
> safe, AFAIC. Since the block group was in the unused list, the used bytes
> should be 0 when it was added to the unused list. Then, it checks
> block_group->{used,resereved,pinned} are still 0 under the
> block_group->lock. So, they should be still eligible for the unused list,
> not the reclaim list.
> 
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop")

This commit has landed in several stable trees so we need to get this
fixup out quickly. How hard is to hit the problem? It's caught by list
corruption detection but this is most likely not enabled on distro
kernels so it might go unnoticed.
Naohiro Aota July 1, 2024, 1:35 p.m. UTC | #10
On Mon, Jul 01, 2024 at 03:12:44PM GMT, David Sterba wrote:
> On Fri, Jun 28, 2024 at 01:32:24PM +0900, Naohiro Aota wrote:
> > There is a potential parallel list adding for retrying in
> > btrfs_reclaim_bgs_work and adding to the unused list. Since the block
> > group is removed from the reclaim list and it is on a relocation work,
> > it can be added into the unused list in parallel. When that happens,
> > adding it to the reclaim list will corrupt the list head and trigger
> > list corruption like below.
> > 
> > Fix it by taking fs_info->unused_bgs_lock.
> > 
> > [27177.504101][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
> > [27177.514722][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
> > [27177.529789][T2585409] ------------[ cut here ]------------
> > [27177.537690][T2585409] kernel BUG at lib/list_debug.c:65!
> > [27177.545548][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > [27177.555466][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
> > [27177.568502][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
> > [27177.579784][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
> > [27177.589946][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.600088][T2585409] Code: fa ff 0f 0b 4c 89 e2 48 89 de 48 c7 c7 c0
> > 8c 3b 93 e8 ab 8e fa ff 0f 0b 4c 89 e1 48 89 de 48 c7 c7 00 8e 3b 93 e8
> > 97 8e fa ff <0f> 0b 48 63 d1 4c 89 f6 48 c7 c7 e0 56 67 94 89 4c 24 04
> > e8 ff f2
> > [27177.624173][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
> > [27177.633052][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
> > [27177.644059][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
> > [27177.655030][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
> > [27177.665992][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
> > [27177.676964][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
> > [27177.687938][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
> > [27177.700006][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [27177.709431][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
> > [27177.720402][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
> > [27177.731337][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
> > [27177.742252][T2585409] PKRU: 55555554
> > [27177.748207][T2585409] Call Trace:
> > [27177.753850][T2585409]  <TASK>
> > [27177.759103][T2585409]  ? __die_body.cold+0x19/0x27
> > [27177.766405][T2585409]  ? die+0x2e/0x50
> > [27177.772498][T2585409]  ? do_trap+0x1ea/0x2d0
> > [27177.779139][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.788518][T2585409]  ? do_error_trap+0xa3/0x160
> > [27177.795649][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.805045][T2585409]  ? handle_invalid_op+0x2c/0x40
> > [27177.812022][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.820947][T2585409]  ? exc_invalid_op+0x2d/0x40
> > [27177.827608][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
> > [27177.834637][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
> > [27177.843519][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
> > 
> > There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
> > safe, AFAIC. Since the block group was in the unused list, the used bytes
> > should be 0 when it was added to the unused list. Then, it checks
> > block_group->{used,resereved,pinned} are still 0 under the
> > block_group->lock. So, they should be still eligible for the unused list,
> > not the reclaim list.
> > 
> > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> > Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop")
> 
> This commit has landed in several stable trees so we need to get this
> fixup out quickly. How hard is to hit the problem? It's caught by list
> corruption detection but this is most likely not enabled on distro
> kernels so it might go unnoticed.

It surely occurs when I run fstests generic/166 some (~5) times. I don't
think it's directly related to the workload, though. It always happens when
it fails to relocate a chunk due to near ENOSPC state. So, I believe it can
happen easily on such tight space filesystem.
David Sterba July 1, 2024, 1:50 p.m. UTC | #11
On Mon, Jul 01, 2024 at 01:35:56PM +0000, Naohiro Aota wrote:
> > This commit has landed in several stable trees so we need to get this
> > fixup out quickly. How hard is to hit the problem? It's caught by list
> > corruption detection but this is most likely not enabled on distro
> > kernels so it might go unnoticed.
> 
> It surely occurs when I run fstests generic/166 some (~5) times. I don't
> think it's directly related to the workload, though. It always happens when
> it fails to relocate a chunk due to near ENOSPC state. So, I believe it can
> happen easily on such tight space filesystem.

I see, thanks. This can also happen when profiles are converted and
there's not enough chunk space (like with striped profile conversions).
Sounds serious to me, I'll expedite the fix but it will still take like
a week before it's released.  Unfortunatelly it affects kernels up to
5.15.x.
diff mbox series

Patch

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 7cde40fe6a50..498442d0c216 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1945,8 +1945,17 @@  void btrfs_reclaim_bgs_work(struct work_struct *work)
 next:
 		if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
 			/* Refcount held by the reclaim_bgs list after splice. */
-			btrfs_get_block_group(bg);
-			list_add_tail(&bg->bg_list, &retry_list);
+			spin_lock(&fs_info->unused_bgs_lock);
+			/*
+			 * This block group might be added to the unused list
+			 * during the above process. Move it back to the
+			 * reclaim list otherwise.
+			 */
+			if (list_empty(&bg->bg_list)) {
+				btrfs_get_block_group(bg);
+				list_add_tail(&bg->bg_list, &retry_list);
+			}
+			spin_unlock(&fs_info->unused_bgs_lock);
 		}
 		btrfs_put_block_group(bg);