[v2,0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

Message ID	20181108054919.18253-1-wqu@suse.com (mailing list archive)
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead Date: Thu, 8 Nov 2018 13:49:12 +0800 Message-Id: <20181108054919.18253-1-wqu@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	btrfs: qgroup: Delay subtree scan to reduce overhead \| expand [v2,0/6] btrfs: qgroup: Delay subtree scan to reduce overhead [v2,1/6] btrfs: qgroup: Allow btrfs_qgroup_extent_record::old_roots unpopulated at insert time [v2,2/6] btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots() [v2,3/6] btrfs: qgroup: Refactor btrfs_qgroup_trace_subtree_swap() [v2,4/6] btrfs: qgroup: Introduce per-root swapped blocks infrastructure [v2,5/6] btrfs: qgroup: Use delayed subtree rescan for balance [v2,6/6] btrfs: qgroup: Cleanup old subtree swap code

Qu Wenruo Nov. 8, 2018, 5:49 a.m. UTC

This patchset can be fetched from github:
https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased

Which is based on v4.20-rc1.

This patch address the heavy load subtree scan, but delaying it until
we're going to modify the swapped tree block.

The overall workflow is:

1) Record the subtree root block get swapped.

   During subtree swap:
   O = Old tree blocks
   N = New tree blocks
         reloc tree                         file tree X
            Root                               Root
           /    \                             /    \
         NA     OB                          OA      OB
       /  |     |  \                      /  |      |  \
     NC  ND     OE  OF                   OC  OD     OE  OF

  In these case, NA and OA is going to be swapped, record (NA, OA) into
  file tree X.

2) After subtree swap.
         reloc tree                         file tree X
            Root                               Root
           /    \                             /    \
         OA     OB                          NA      OB
       /  |     |  \                      /  |      |  \
     OC  OD     OE  OF                   NC  ND     OE  OF

3a) CoW happens for OB
    If we are going to CoW tree block OB, we check OB's bytenr against
    tree X's swapped_blocks structure.
    It doesn't fit any one, nothing will happen.

3b) CoW happens for NA
    Check NA's bytenr against tree X's swapped_blocks, and get a hit.
    Then we do subtree scan on both subtree OA and NA.
    Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).

    Then no matter what we do to file tree X, qgroup numbers will
    still be correct.
    Then NA's record get removed from X's swapped_blocks.

4)  Transaction commit
    Any record in X's swapped_blocks get removed, since there is no
    modification to swapped subtrees, no need to trigger heavy qgroup
    subtree rescan for them.

[[Benchmark]]
Hardware:
	VM 4G vRAM, 8 vCPUs,
	disk is using 'unsafe' cache mode,
	backing device is SAMSUNG 850 evo SSD.
	Host has 16G ram.

Mkfs parameter:
	--nodesize 4K (To bump up tree size)

Initial subvolume contents:
	4G data copied from /usr and /lib.
	(With enough regular small files)

Snapshots:
	16 snapshots of the original subvolume.
	each snapshot has 3 random files modified.

balance parameter:
	-m

So the content should be pretty similar to a real world root fs layout.

And after file system population, there is no other activity, so it
should be the best case scenario.

                     | v4.20-rc1            | w/ patchset    | diff
-----------------------------------------------------------------------
relocated extents    | 22615                | 22457          | -0.1%
qgroup dirty extents | 163457               | 121606         | -25.6%
time (sys)           | 22.884s              | 18.842s        | -17.6%
time (real)          | 27.724s              | 22.884s        | -17.5%

changelog:
v2:
  Rebase to v4.20-rc1.
  Instead commit transaction after each reloc tree merge, delay it until
  merge_reloc_roots() finishes.
  This provides a more natural behavior, and reduce the unnecessary
  transaction commits.

Qu Wenruo (6):
  btrfs: qgroup: Allow btrfs_qgroup_extent_record::old_roots unpopulated
    at insert time
  btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots()
  btrfs: qgroup: Refactor btrfs_qgroup_trace_subtree_swap()
  btrfs: qgroup: Introduce per-root swapped blocks infrastructure
  btrfs: qgroup: Use delayed subtree rescan for balance
  btrfs: qgroup: Cleanup old subtree swap code

 fs/btrfs/ctree.c       |   8 +
 fs/btrfs/ctree.h       |  14 ++
 fs/btrfs/disk-io.c     |   1 +
 fs/btrfs/qgroup.c      | 376 +++++++++++++++++++++++++++++++----------
 fs/btrfs/qgroup.h      | 107 +++++++++++-
 fs/btrfs/relocation.c  | 140 ++++++++++++---
 fs/btrfs/transaction.c |   1 +
 7 files changed, 527 insertions(+), 120 deletions(-)

David Sterba Nov. 12, 2018, 9:33 p.m. UTC | #1

On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> 
> Which is based on v4.20-rc1.

Thanks, I'll add it to for-next soon.

David Sterba Nov. 13, 2018, 5:07 p.m. UTC | #2

On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> > This patchset can be fetched from github:
> > https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> > 
> > Which is based on v4.20-rc1.
> 
> Thanks, I'll add it to for-next soon.

During test generic/517, the logs were full of the warning below. The reference
test on current master, effectively misc-4.20 which was used as base of your
branch did not get the warning.

[11540.167829] BTRFS: end < start 2519039 2519040
[11540.170513] WARNING: CPU: 1 PID: 539 at fs/btrfs/extent_io.c:436 insert_state+0xd8/0x100 [btrfs]
[11540.174411] Modules linked in: dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq dm_mod loop [last unloaded: libcrc32c]
[11540.178279] CPU: 1 PID: 539 Comm: xfs_io Tainted: G      D W         4.20.0-rc1-default+ #329
[11540.180616] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
[11540.183754] RIP: 0010:insert_state+0xd8/0x100 [btrfs]
[11540.189173] RSP: 0018:ffffa0d245eafb20 EFLAGS: 00010282
[11540.189885] RAX: 0000000000000000 RBX: ffff9f0bb3267320 RCX: 0000000000000000
[11540.191646] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffffffffa40c400d
[11540.192942] RBP: 0000000000266fff R08: 0000000000000001 R09: 0000000000000000
[11540.193871] R10: 0000000000000000 R11: ffffffffa629da2d R12: ffff9f0ba0281c60
[11540.195527] R13: 0000000000267000 R14: ffffa0d245eafb98 R15: ffffa0d245eafb90
[11540.197026] FS:  00007fa338eb4b80(0000) GS:ffff9f0bbd600000(0000) knlGS:0000000000000000
[11540.198251] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11540.199698] CR2: 00007fa33873bfb8 CR3: 000000006fb6e000 CR4: 00000000000006e0
[11540.201428] Call Trace:
[11540.202164]  __set_extent_bit+0x43b/0x5b0 [btrfs]
[11540.203223]  lock_extent_bits+0x5d/0x210 [btrfs]
[11540.204346]  ? _raw_spin_unlock+0x24/0x40
[11540.205381]  ? test_range_bit+0xdf/0x130 [btrfs]
[11540.206573]  lock_extent_range+0xb8/0x150 [btrfs]
[11540.207696]  btrfs_double_extent_lock+0x78/0xb0 [btrfs]
[11540.208988]  btrfs_extent_same_range+0x131/0x4e0 [btrfs]
[11540.210237]  btrfs_remap_file_range+0x337/0x350 [btrfs]
[11540.211448]  vfs_dedupe_file_range_one+0x141/0x150
[11540.212622]  vfs_dedupe_file_range+0x146/0x1a0
[11540.213795]  do_vfs_ioctl+0x520/0x6c0
[11540.214711]  ? __fget+0x109/0x1e0
[11540.215616]  ksys_ioctl+0x3a/0x70
[11540.216233]  __x64_sys_ioctl+0x16/0x20
[11540.216860]  do_syscall_64+0x54/0x180
[11540.217409]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[11540.218126] RIP: 0033:0x7fa338a4daa7

Filipe Manana Nov. 13, 2018, 5:58 p.m. UTC | #3

On Tue, Nov 13, 2018 at 5:08 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> > On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> > > This patchset can be fetched from github:
> > > https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> > >
> > > Which is based on v4.20-rc1.
> >
> > Thanks, I'll add it to for-next soon.
>
> During test generic/517, the logs were full of the warning below. The reference
> test on current master, effectively misc-4.20 which was used as base of your
> branch did not get the warning.
>
> [11540.167829] BTRFS: end < start 2519039 2519040
> [11540.170513] WARNING: CPU: 1 PID: 539 at fs/btrfs/extent_io.c:436 insert_state+0xd8/0x100 [btrfs]
> [11540.174411] Modules linked in: dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq dm_mod loop [last unloaded: libcrc32c]
> [11540.178279] CPU: 1 PID: 539 Comm: xfs_io Tainted: G      D W         4.20.0-rc1-default+ #329
> [11540.180616] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
> [11540.183754] RIP: 0010:insert_state+0xd8/0x100 [btrfs]
> [11540.189173] RSP: 0018:ffffa0d245eafb20 EFLAGS: 00010282
> [11540.189885] RAX: 0000000000000000 RBX: ffff9f0bb3267320 RCX: 0000000000000000
> [11540.191646] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffffffffa40c400d
> [11540.192942] RBP: 0000000000266fff R08: 0000000000000001 R09: 0000000000000000
> [11540.193871] R10: 0000000000000000 R11: ffffffffa629da2d R12: ffff9f0ba0281c60
> [11540.195527] R13: 0000000000267000 R14: ffffa0d245eafb98 R15: ffffa0d245eafb90
> [11540.197026] FS:  00007fa338eb4b80(0000) GS:ffff9f0bbd600000(0000) knlGS:0000000000000000
> [11540.198251] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [11540.199698] CR2: 00007fa33873bfb8 CR3: 000000006fb6e000 CR4: 00000000000006e0
> [11540.201428] Call Trace:
> [11540.202164]  __set_extent_bit+0x43b/0x5b0 [btrfs]
> [11540.203223]  lock_extent_bits+0x5d/0x210 [btrfs]
> [11540.204346]  ? _raw_spin_unlock+0x24/0x40
> [11540.205381]  ? test_range_bit+0xdf/0x130 [btrfs]
> [11540.206573]  lock_extent_range+0xb8/0x150 [btrfs]
> [11540.207696]  btrfs_double_extent_lock+0x78/0xb0 [btrfs]
> [11540.208988]  btrfs_extent_same_range+0x131/0x4e0 [btrfs]
> [11540.210237]  btrfs_remap_file_range+0x337/0x350 [btrfs]
> [11540.211448]  vfs_dedupe_file_range_one+0x141/0x150
> [11540.212622]  vfs_dedupe_file_range+0x146/0x1a0
> [11540.213795]  do_vfs_ioctl+0x520/0x6c0
> [11540.214711]  ? __fget+0x109/0x1e0
> [11540.215616]  ksys_ioctl+0x3a/0x70
> [11540.216233]  __x64_sys_ioctl+0x16/0x20
> [11540.216860]  do_syscall_64+0x54/0x180
> [11540.217409]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [11540.218126] RIP: 0033:0x7fa338a4daa7

That's the infinite loop issue fixed by one of the patches submitted
for 4.20-rc2:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.20-rc2&id=11023d3f5fdf89bba5e1142127701ca6e6014587

The branch you used for testing doesn't have that fix?

>

Qu Wenruo Nov. 13, 2018, 11:56 p.m. UTC | #4

On 2018/11/14 上午1:58, Filipe Manana wrote:
> On Tue, Nov 13, 2018 at 5:08 PM David Sterba <dsterba@suse.cz> wrote:
>>
>> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
>>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
>>>> This patchset can be fetched from github:
>>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>>>>
>>>> Which is based on v4.20-rc1.
>>>
>>> Thanks, I'll add it to for-next soon.
>>
>> During test generic/517, the logs were full of the warning below. The reference
>> test on current master, effectively misc-4.20 which was used as base of your
>> branch did not get the warning.
>>
>> [11540.167829] BTRFS: end < start 2519039 2519040
>> [11540.170513] WARNING: CPU: 1 PID: 539 at fs/btrfs/extent_io.c:436 insert_state+0xd8/0x100 [btrfs]
>> [11540.174411] Modules linked in: dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq dm_mod loop [last unloaded: libcrc32c]
>> [11540.178279] CPU: 1 PID: 539 Comm: xfs_io Tainted: G      D W         4.20.0-rc1-default+ #329
>> [11540.180616] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
>> [11540.183754] RIP: 0010:insert_state+0xd8/0x100 [btrfs]
>> [11540.189173] RSP: 0018:ffffa0d245eafb20 EFLAGS: 00010282
>> [11540.189885] RAX: 0000000000000000 RBX: ffff9f0bb3267320 RCX: 0000000000000000
>> [11540.191646] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffffffffa40c400d
>> [11540.192942] RBP: 0000000000266fff R08: 0000000000000001 R09: 0000000000000000
>> [11540.193871] R10: 0000000000000000 R11: ffffffffa629da2d R12: ffff9f0ba0281c60
>> [11540.195527] R13: 0000000000267000 R14: ffffa0d245eafb98 R15: ffffa0d245eafb90
>> [11540.197026] FS:  00007fa338eb4b80(0000) GS:ffff9f0bbd600000(0000) knlGS:0000000000000000
>> [11540.198251] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [11540.199698] CR2: 00007fa33873bfb8 CR3: 000000006fb6e000 CR4: 00000000000006e0
>> [11540.201428] Call Trace:
>> [11540.202164]  __set_extent_bit+0x43b/0x5b0 [btrfs]
>> [11540.203223]  lock_extent_bits+0x5d/0x210 [btrfs]
>> [11540.204346]  ? _raw_spin_unlock+0x24/0x40
>> [11540.205381]  ? test_range_bit+0xdf/0x130 [btrfs]
>> [11540.206573]  lock_extent_range+0xb8/0x150 [btrfs]
>> [11540.207696]  btrfs_double_extent_lock+0x78/0xb0 [btrfs]
>> [11540.208988]  btrfs_extent_same_range+0x131/0x4e0 [btrfs]
>> [11540.210237]  btrfs_remap_file_range+0x337/0x350 [btrfs]
>> [11540.211448]  vfs_dedupe_file_range_one+0x141/0x150
>> [11540.212622]  vfs_dedupe_file_range+0x146/0x1a0
>> [11540.213795]  do_vfs_ioctl+0x520/0x6c0
>> [11540.214711]  ? __fget+0x109/0x1e0
>> [11540.215616]  ksys_ioctl+0x3a/0x70
>> [11540.216233]  __x64_sys_ioctl+0x16/0x20
>> [11540.216860]  do_syscall_64+0x54/0x180
>> [11540.217409]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [11540.218126] RIP: 0033:0x7fa338a4daa7
> 
> That's the infinite loop issue fixed by one of the patches submitted
> for 4.20-rc2:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.20-rc2&id=11023d3f5fdf89bba5e1142127701ca6e6014587
> 
> The branch you used for testing doesn't have that fix?

Yep, I tried v4.20-rc1 tag, which hits tons of such warning even without
my patchset.

So it shouldn't be my patches causing the problem.

Thanks,
Qu

> 
>>
> 
>

David Sterba Nov. 14, 2018, 7:05 p.m. UTC | #5

On Tue, Nov 13, 2018 at 05:58:14PM +0000, Filipe Manana wrote:
> That's the infinite loop issue fixed by one of the patches submitted
> for 4.20-rc2:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.20-rc2&id=11023d3f5fdf89bba5e1142127701ca6e6014587
> 
> The branch you used for testing doesn't have that fix?

That explains it, thanks.  The branch was based on 4.20-rc1 as I took it
from Qu's repository but did not check the base.

Qu Wenruo Nov. 15, 2018, 5:23 a.m. UTC | #6

On 2018/11/15 上午3:05, David Sterba wrote:
> On Tue, Nov 13, 2018 at 05:58:14PM +0000, Filipe Manana wrote:
>> That's the infinite loop issue fixed by one of the patches submitted
>> for 4.20-rc2:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.20-rc2&id=11023d3f5fdf89bba5e1142127701ca6e6014587
>>
>> The branch you used for testing doesn't have that fix?
> 
> That explains it, thanks.  The branch was based on 4.20-rc1 as I took it
> from Qu's repository but did not check the base.

BTW should I always rebase my patches to misc-next or misc-4.20?

IMHO based on -rc tags should make it easier for David to rebase/apply,
but if it's causing problem like this, I could definitely go misc-* based.

Thanks,
Qu

David Sterba Nov. 15, 2018, 10:28 a.m. UTC | #7

On Thu, Nov 15, 2018 at 01:23:25PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/11/15 上午3:05, David Sterba wrote:
> > On Tue, Nov 13, 2018 at 05:58:14PM +0000, Filipe Manana wrote:
> >> That's the infinite loop issue fixed by one of the patches submitted
> >> for 4.20-rc2:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.20-rc2&id=11023d3f5fdf89bba5e1142127701ca6e6014587
> >>
> >> The branch you used for testing doesn't have that fix?
> > 
> > That explains it, thanks.  The branch was based on 4.20-rc1 as I took it
> > from Qu's repository but did not check the base.
> 
> BTW should I always rebase my patches to misc-next or misc-4.20?

The misc-next can rebase as I add the tags or need to remove a patch
etc., so using last rc or the last pulled branch (ie. the exact commit
from misc-4.20, not the head itself) should be better option for you so
you don't need to catch up and rebase constantly.

I can handle rebases of your branches to current misc-next as long as
there are no major conflicts. 

> IMHO based on -rc tags should make it easier for David to rebase/apply,
> but if it's causing problem like this, I could definitely go misc-* based.

No, that was my fault. I review and rebase all topic branches at each rc
release, but taking the branch from your repo was a bit different step
in the workflow.

David Sterba Dec. 6, 2018, 7:35 p.m. UTC | #8

On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> > This patchset can be fetched from github:
> > https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> > 
> > Which is based on v4.20-rc1.
> 
> Thanks, I'll add it to for-next soon.

The branch was there for some time but not for at least a week (my
mistake I did not notice in time). I've rebased it on top of recent
misc-next, but without the delayed refs patchset from Josef.

At the moment I'm considering it for merge to 4.21, there's still some
time to pull it out in case it shows up to be too problematic. I'm
mostly worried about the unknown interactions with the enospc updates or
generally because of lack of qgroup and reloc code reviews.

I'm going to do some testing of the rebased branch before I add it to
for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos,
plase check if everyghing is still ok there. Thanks.

Qu Wenruo Dec. 6, 2018, 10:51 p.m. UTC | #9

On 2018/12/7 上午3:35, David Sterba wrote:
> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>>>
>>> Which is based on v4.20-rc1.
>>
>> Thanks, I'll add it to for-next soon.
> 
> The branch was there for some time but not for at least a week (my
> mistake I did not notice in time). I've rebased it on top of recent
> misc-next, but without the delayed refs patchset from Josef.
> 
> At the moment I'm considering it for merge to 4.21, there's still some
> time to pull it out in case it shows up to be too problematic. I'm
> mostly worried about the unknown interactions with the enospc updates or

For that part, I don't think it would have some obvious problem for
enospc updates.

As the user-noticeable effect is the delay of reloc tree deletion.

Despite that, it's mostly transparent to extent allocation.

> generally because of lack of qgroup and reloc code reviews.

That's the biggest problem.

However most of the current qgroup + balance optimization is done inside
qgroup code (to skip certain qgroup record), if we're going to hit some
problem then this patchset would have the highest possibility to hit
problem.

Later patches will just keep tweaking qgroup to without affecting any
other parts mostly.

So I'm fine if you decide to pull it out for now.

Thanks,
Qu

> 
> I'm going to do some testing of the rebased branch before I add it to
> for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos,
> plase check if everyghing is still ok there. Thanks.
>

David Sterba Dec. 8, 2018, 12:47 a.m. UTC | #10

On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/12/7 上午3:35, David Sterba wrote:
> > On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> >> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> >>> This patchset can be fetched from github:
> >>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> >>>
> >>> Which is based on v4.20-rc1.
> >>
> >> Thanks, I'll add it to for-next soon.
> > 
> > The branch was there for some time but not for at least a week (my
> > mistake I did not notice in time). I've rebased it on top of recent
> > misc-next, but without the delayed refs patchset from Josef.
> > 
> > At the moment I'm considering it for merge to 4.21, there's still some
> > time to pull it out in case it shows up to be too problematic. I'm
> > mostly worried about the unknown interactions with the enospc updates or
> 
> For that part, I don't think it would have some obvious problem for
> enospc updates.
> 
> As the user-noticeable effect is the delay of reloc tree deletion.
> 
> Despite that, it's mostly transparent to extent allocation.
> 
> > generally because of lack of qgroup and reloc code reviews.
> 
> That's the biggest problem.
> 
> However most of the current qgroup + balance optimization is done inside
> qgroup code (to skip certain qgroup record), if we're going to hit some
> problem then this patchset would have the highest possibility to hit
> problem.
> 
> Later patches will just keep tweaking qgroup to without affecting any
> other parts mostly.
> 
> So I'm fine if you decide to pull it out for now.

I've adapted a stress tests that unpacks a large tarball, snaphosts
every 20 seconds, deletes a random snapshot every 50 seconds, deletes
file from the original subvolume, now enhanced with qgroups just for the
new snapshots inherigin the toplevel subvolume. Lockup.

It gets stuck in a snapshot call with the follwin stacktrace

[<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
[<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]
[<0>] do_walk_down+0x681/0xb20 [btrfs]
[<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
[<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
[<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
[<0>] cleaner_kthread+0xf8/0x170 [btrfs]
[<0>] kthread+0x121/0x140
[<0>] ret_from_fork+0x27/0x50

and that's like 10th snapshot and ~3rd deltion. This is qgroup show:

qgroupid         rfer         excl parent
--------         ----         ---- ------
0/5         865.27MiB      1.66MiB ---
0/257           0.00B        0.00B ---
0/259           0.00B        0.00B ---
0/260       806.58MiB    637.25MiB ---
0/262           0.00B        0.00B ---
0/263           0.00B        0.00B ---
0/264           0.00B        0.00B ---
0/265           0.00B        0.00B ---
0/266           0.00B        0.00B ---
0/267           0.00B        0.00B ---
0/268           0.00B        0.00B ---
0/269           0.00B        0.00B ---
0/270       989.04MiB      1.22MiB ---
0/271           0.00B        0.00B ---
0/272       922.25MiB    416.00KiB ---
0/273       931.02MiB      1.50MiB ---
0/274       910.94MiB      1.52MiB ---
1/1           1.64GiB      1.64GiB
0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274

No IO or cpu activity at this point, the stacktrace and show output
remains the same.

So, considering this, I'm not going to add the patchset to 4.21 but will
keep it in for-next for testing, any fixups or updates will be applied.

Qu Wenruo Dec. 8, 2018, 12:50 a.m. UTC | #11

On 2018/12/8 上午8:47, David Sterba wrote:
> On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018/12/7 上午3:35, David Sterba wrote:
>>> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
>>>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
>>>>> This patchset can be fetched from github:
>>>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>>>>>
>>>>> Which is based on v4.20-rc1.
>>>>
>>>> Thanks, I'll add it to for-next soon.
>>>
>>> The branch was there for some time but not for at least a week (my
>>> mistake I did not notice in time). I've rebased it on top of recent
>>> misc-next, but without the delayed refs patchset from Josef.
>>>
>>> At the moment I'm considering it for merge to 4.21, there's still some
>>> time to pull it out in case it shows up to be too problematic. I'm
>>> mostly worried about the unknown interactions with the enospc updates or
>>
>> For that part, I don't think it would have some obvious problem for
>> enospc updates.
>>
>> As the user-noticeable effect is the delay of reloc tree deletion.
>>
>> Despite that, it's mostly transparent to extent allocation.
>>
>>> generally because of lack of qgroup and reloc code reviews.
>>
>> That's the biggest problem.
>>
>> However most of the current qgroup + balance optimization is done inside
>> qgroup code (to skip certain qgroup record), if we're going to hit some
>> problem then this patchset would have the highest possibility to hit
>> problem.
>>
>> Later patches will just keep tweaking qgroup to without affecting any
>> other parts mostly.
>>
>> So I'm fine if you decide to pull it out for now.
> 
> I've adapted a stress tests that unpacks a large tarball, snaphosts
> every 20 seconds, deletes a random snapshot every 50 seconds, deletes
> file from the original subvolume, now enhanced with qgroups just for the
> new snapshots inherigin the toplevel subvolume. Lockup.
> 
> It gets stuck in a snapshot call with the follwin stacktrace
> 
> [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
> [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]

This looks like the original subtree tracing has something wrong.

Thanks for the report, I'll investigate it.
Qu

> [<0>] do_walk_down+0x681/0xb20 [btrfs]
> [<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
> [<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
> [<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
> [<0>] cleaner_kthread+0xf8/0x170 [btrfs]
> [<0>] kthread+0x121/0x140
> [<0>] ret_from_fork+0x27/0x50
> 
> and that's like 10th snapshot and ~3rd deltion. This is qgroup show:
> 
> qgroupid         rfer         excl parent
> --------         ----         ---- ------
> 0/5         865.27MiB      1.66MiB ---
> 0/257           0.00B        0.00B ---
> 0/259           0.00B        0.00B ---
> 0/260       806.58MiB    637.25MiB ---
> 0/262           0.00B        0.00B ---
> 0/263           0.00B        0.00B ---
> 0/264           0.00B        0.00B ---
> 0/265           0.00B        0.00B ---
> 0/266           0.00B        0.00B ---
> 0/267           0.00B        0.00B ---
> 0/268           0.00B        0.00B ---
> 0/269           0.00B        0.00B ---
> 0/270       989.04MiB      1.22MiB ---
> 0/271           0.00B        0.00B ---
> 0/272       922.25MiB    416.00KiB ---
> 0/273       931.02MiB      1.50MiB ---
> 0/274       910.94MiB      1.52MiB ---
> 1/1           1.64GiB      1.64GiB
> 0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274
> 
> No IO or cpu activity at this point, the stacktrace and show output
> remains the same.
> 
> So, considering this, I'm not going to add the patchset to 4.21 but will
> keep it in for-next for testing, any fixups or updates will be applied.
>

David Sterba Dec. 8, 2018, 4:17 p.m. UTC | #12

On Sat, Dec 08, 2018 at 08:50:32AM +0800, Qu Wenruo wrote:
> > I've adapted a stress tests that unpacks a large tarball, snaphosts
> > every 20 seconds, deletes a random snapshot every 50 seconds, deletes
> > file from the original subvolume, now enhanced with qgroups just for the
> > new snapshots inherigin the toplevel subvolume. Lockup.
> > 
> > It gets stuck in a snapshot call with the follwin stacktrace
> > 
> > [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
> > [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]
> 
> This looks like the original subtree tracing has something wrong.

Yes, I ran the test on current master and it locked up too, so it's not
due to your patchset.

> Thanks for the report, I'll investigate it.

Thanks.

Qu Wenruo Dec. 10, 2018, 5:51 a.m. UTC | #13

On 2018/12/8 上午8:47, David Sterba wrote:
> On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018/12/7 上午3:35, David Sterba wrote:
>>> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
>>>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
>>>>> This patchset can be fetched from github:
>>>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>>>>>
>>>>> Which is based on v4.20-rc1.
>>>>
>>>> Thanks, I'll add it to for-next soon.
>>>
>>> The branch was there for some time but not for at least a week (my
>>> mistake I did not notice in time). I've rebased it on top of recent
>>> misc-next, but without the delayed refs patchset from Josef.
>>>
>>> At the moment I'm considering it for merge to 4.21, there's still some
>>> time to pull it out in case it shows up to be too problematic. I'm
>>> mostly worried about the unknown interactions with the enospc updates or
>>
>> For that part, I don't think it would have some obvious problem for
>> enospc updates.
>>
>> As the user-noticeable effect is the delay of reloc tree deletion.
>>
>> Despite that, it's mostly transparent to extent allocation.
>>
>>> generally because of lack of qgroup and reloc code reviews.
>>
>> That's the biggest problem.
>>
>> However most of the current qgroup + balance optimization is done inside
>> qgroup code (to skip certain qgroup record), if we're going to hit some
>> problem then this patchset would have the highest possibility to hit
>> problem.
>>
>> Later patches will just keep tweaking qgroup to without affecting any
>> other parts mostly.
>>
>> So I'm fine if you decide to pull it out for now.
> 
> I've adapted a stress tests that unpacks a large tarball, snaphosts
> every 20 seconds, deletes a random snapshot every 50 seconds, deletes
> file from the original subvolume, now enhanced with qgroups just for the
> new snapshots inherigin the toplevel subvolume. Lockup.

Could you please provide the test script?
As I can't reproduce it in my environment.

I crafted my own test with some simplification, namely no qgroup inherit.
However I can't reproduce the problem even with more snapshots
creation/deletion and more data.

In my test script, I created around 35 snapshots, deleted 6 snapshots,
with around 1000 data regular extents and 1000 2K inline extents.

My test script can be found at:
https://gist.github.com/adam900710/4109fa23fc5ba8fc6b37a9c8e52353c1

Thanks,
Qu

> 
> It gets stuck in a snapshot call with the follwin stacktrace
> 
> [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
> [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]
> [<0>] do_walk_down+0x681/0xb20 [btrfs]
> [<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
> [<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
> [<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
> [<0>] cleaner_kthread+0xf8/0x170 [btrfs]
> [<0>] kthread+0x121/0x140
> [<0>] ret_from_fork+0x27/0x50
> 
> and that's like 10th snapshot and ~3rd deltion. This is qgroup show:
> 
> qgroupid         rfer         excl parent
> --------         ----         ---- ------
> 0/5         865.27MiB      1.66MiB ---
> 0/257           0.00B        0.00B ---
> 0/259           0.00B        0.00B ---
> 0/260       806.58MiB    637.25MiB ---
> 0/262           0.00B        0.00B ---
> 0/263           0.00B        0.00B ---
> 0/264           0.00B        0.00B ---
> 0/265           0.00B        0.00B ---
> 0/266           0.00B        0.00B ---
> 0/267           0.00B        0.00B ---
> 0/268           0.00B        0.00B ---
> 0/269           0.00B        0.00B ---
> 0/270       989.04MiB      1.22MiB ---
> 0/271           0.00B        0.00B ---
> 0/272       922.25MiB    416.00KiB ---
> 0/273       931.02MiB      1.50MiB ---
> 0/274       910.94MiB      1.52MiB ---
> 1/1           1.64GiB      1.64GiB
> 0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274
> 
> No IO or cpu activity at this point, the stacktrace and show output
> remains the same.
> 
> So, considering this, I'm not going to add the patchset to 4.21 but will
> keep it in for-next for testing, any fixups or updates will be applied.
>

Filipe Manana Dec. 10, 2018, 10:45 a.m. UTC | #14

On Sat, Dec 8, 2018 at 12:51 AM Qu Wenruo <wqu@suse.de> wrote:
>
>
>
> On 2018/12/8 上午8:47, David Sterba wrote:
> > On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2018/12/7 上午3:35, David Sterba wrote:
> >>> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> >>>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> >>>>> This patchset can be fetched from github:
> >>>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> >>>>>
> >>>>> Which is based on v4.20-rc1.
> >>>>
> >>>> Thanks, I'll add it to for-next soon.
> >>>
> >>> The branch was there for some time but not for at least a week (my
> >>> mistake I did not notice in time). I've rebased it on top of recent
> >>> misc-next, but without the delayed refs patchset from Josef.
> >>>
> >>> At the moment I'm considering it for merge to 4.21, there's still some
> >>> time to pull it out in case it shows up to be too problematic. I'm
> >>> mostly worried about the unknown interactions with the enospc updates or
> >>
> >> For that part, I don't think it would have some obvious problem for
> >> enospc updates.
> >>
> >> As the user-noticeable effect is the delay of reloc tree deletion.
> >>
> >> Despite that, it's mostly transparent to extent allocation.
> >>
> >>> generally because of lack of qgroup and reloc code reviews.
> >>
> >> That's the biggest problem.
> >>
> >> However most of the current qgroup + balance optimization is done inside
> >> qgroup code (to skip certain qgroup record), if we're going to hit some
> >> problem then this patchset would have the highest possibility to hit
> >> problem.
> >>
> >> Later patches will just keep tweaking qgroup to without affecting any
> >> other parts mostly.
> >>
> >> So I'm fine if you decide to pull it out for now.
> >
> > I've adapted a stress tests that unpacks a large tarball, snaphosts
> > every 20 seconds, deletes a random snapshot every 50 seconds, deletes
> > file from the original subvolume, now enhanced with qgroups just for the
> > new snapshots inherigin the toplevel subvolume. Lockup.
> >
> > It gets stuck in a snapshot call with the follwin stacktrace
> >
> > [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
> > [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]
>
> This looks like the original subtree tracing has something wrong.
>
> Thanks for the report, I'll investigate it.

Btw, there's another deadlock with qgroups. I don't recall if I ever
reported it, but I still hit it with fstests (rarely happens) for at
least 1 year iirc:

[29845.732448] INFO: task kworker/u8:8:3898 blocked for more than 120 seconds.
[29845.732852]       Not tainted 4.20.0-rc5-btrfs-next-40 #1
[29845.733248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[29845.733558] kworker/u8:8    D    0  3898      2 0x80000000
[29845.733878] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[29845.734183] Call Trace:
[29845.734499]  ? __schedule+0x3d4/0xbc0
[29845.734818]  schedule+0x39/0x90
[29845.735131]  btrfs_tree_read_lock+0xe7/0x140 [btrfs]
[29845.735430]  ? remove_wait_queue+0x60/0x60
[29845.735731]  find_parent_nodes+0x25e/0xe30 [btrfs]
[29845.736037]  btrfs_find_all_roots_safe+0xc6/0x140 [btrfs]
[29845.736342]  btrfs_find_all_roots+0x52/0x70 [btrfs]
[29845.736710]  btrfs_qgroup_trace_extent_post+0x37/0x80 [btrfs]
[29845.737046]  btrfs_add_delayed_data_ref+0x240/0x3d0 [btrfs]
[29845.737362]  btrfs_inc_extent_ref+0xb7/0x140 [btrfs]
[29845.737678]  __btrfs_mod_ref+0x174/0x250 [btrfs]
[29845.737999]  ? add_pinned_bytes+0x60/0x60 [btrfs]
[29845.738298]  update_ref_for_cow+0x26b/0x340 [btrfs]
[29845.738592]  __btrfs_cow_block+0x221/0x5b0 [btrfs]
[29845.738899]  btrfs_cow_block+0xf4/0x210 [btrfs]
[29845.739200]  btrfs_search_slot+0x583/0xa40 [btrfs]
[29845.739527]  ? init_object+0x6b/0x80
[29845.739823]  btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
[29845.740119]  __btrfs_drop_extents+0x157/0xd70 [btrfs]
[29845.740524]  insert_reserved_file_extent.constprop.66+0x97/0x2f0 [btrfs]
[29845.740853]  ? start_transaction+0xa2/0x490 [btrfs]
[29845.741166]  btrfs_finish_ordered_io+0x344/0x810 [btrfs]
[29845.741489]  normal_work_helper+0xea/0x530 [btrfs]
[29845.741880]  process_one_work+0x22f/0x5d0
[29845.742174]  worker_thread+0x4f/0x3b0
[29845.742462]  ? rescuer_thread+0x360/0x360
[29845.742759]  kthread+0x103/0x140
[29845.743044]  ? kthread_create_worker_on_cpu+0x70/0x70
[29845.743336]  ret_from_fork+0x3a/0x50

It happened last friday again on 4.20-rcX. It's caused by a change
from 2017 (commit fb235dc06fac9eaa4408ade9c8b20d45d63c89b7 btrfs:
qgroup: Move half of the qgroup accounting time out of commit trans).
The task is deadlocking with itself.

thanks


> Qu
>
> > [<0>] do_walk_down+0x681/0xb20 [btrfs]
> > [<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
> > [<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
> > [<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
> > [<0>] cleaner_kthread+0xf8/0x170 [btrfs]
> > [<0>] kthread+0x121/0x140
> > [<0>] ret_from_fork+0x27/0x50
> >
> > and that's like 10th snapshot and ~3rd deltion. This is qgroup show:
> >
> > qgroupid         rfer         excl parent
> > --------         ----         ---- ------
> > 0/5         865.27MiB      1.66MiB ---
> > 0/257           0.00B        0.00B ---
> > 0/259           0.00B        0.00B ---
> > 0/260       806.58MiB    637.25MiB ---
> > 0/262           0.00B        0.00B ---
> > 0/263           0.00B        0.00B ---
> > 0/264           0.00B        0.00B ---
> > 0/265           0.00B        0.00B ---
> > 0/266           0.00B        0.00B ---
> > 0/267           0.00B        0.00B ---
> > 0/268           0.00B        0.00B ---
> > 0/269           0.00B        0.00B ---
> > 0/270       989.04MiB      1.22MiB ---
> > 0/271           0.00B        0.00B ---
> > 0/272       922.25MiB    416.00KiB ---
> > 0/273       931.02MiB      1.50MiB ---
> > 0/274       910.94MiB      1.52MiB ---
> > 1/1           1.64GiB      1.64GiB
> > 0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274
> >
> > No IO or cpu activity at this point, the stacktrace and show output
> > remains the same.
> >
> > So, considering this, I'm not going to add the patchset to 4.21 but will
> > keep it in for-next for testing, any fixups or updates will be applied.
> >
>

Qu Wenruo Dec. 10, 2018, 11:23 a.m. UTC | #15

On 2018/12/10 下午6:45, Filipe Manana wrote:
> On Sat, Dec 8, 2018 at 12:51 AM Qu Wenruo <wqu@suse.de> wrote:
>>
>>
>>
>> On 2018/12/8 上午8:47, David Sterba wrote:
>>> On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2018/12/7 上午3:35, David Sterba wrote:
>>>>> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
>>>>>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
>>>>>>> This patchset can be fetched from github:
>>>>>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>>>>>>>
>>>>>>> Which is based on v4.20-rc1.
>>>>>>
>>>>>> Thanks, I'll add it to for-next soon.
>>>>>
>>>>> The branch was there for some time but not for at least a week (my
>>>>> mistake I did not notice in time). I've rebased it on top of recent
>>>>> misc-next, but without the delayed refs patchset from Josef.
>>>>>
>>>>> At the moment I'm considering it for merge to 4.21, there's still some
>>>>> time to pull it out in case it shows up to be too problematic. I'm
>>>>> mostly worried about the unknown interactions with the enospc updates or
>>>>
>>>> For that part, I don't think it would have some obvious problem for
>>>> enospc updates.
>>>>
>>>> As the user-noticeable effect is the delay of reloc tree deletion.
>>>>
>>>> Despite that, it's mostly transparent to extent allocation.
>>>>
>>>>> generally because of lack of qgroup and reloc code reviews.
>>>>
>>>> That's the biggest problem.
>>>>
>>>> However most of the current qgroup + balance optimization is done inside
>>>> qgroup code (to skip certain qgroup record), if we're going to hit some
>>>> problem then this patchset would have the highest possibility to hit
>>>> problem.
>>>>
>>>> Later patches will just keep tweaking qgroup to without affecting any
>>>> other parts mostly.
>>>>
>>>> So I'm fine if you decide to pull it out for now.
>>>
>>> I've adapted a stress tests that unpacks a large tarball, snaphosts
>>> every 20 seconds, deletes a random snapshot every 50 seconds, deletes
>>> file from the original subvolume, now enhanced with qgroups just for the
>>> new snapshots inherigin the toplevel subvolume. Lockup.
>>>
>>> It gets stuck in a snapshot call with the follwin stacktrace
>>>
>>> [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
>>> [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]
>>
>> This looks like the original subtree tracing has something wrong.
>>
>> Thanks for the report, I'll investigate it.
> 
> Btw, there's another deadlock with qgroups. I don't recall if I ever
> reported it, but I still hit it with fstests (rarely happens) for at
> least 1 year iirc:
> 
> [29845.732448] INFO: task kworker/u8:8:3898 blocked for more than 120 seconds.
> [29845.732852]       Not tainted 4.20.0-rc5-btrfs-next-40 #1
> [29845.733248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [29845.733558] kworker/u8:8    D    0  3898      2 0x80000000
> [29845.733878] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> [29845.734183] Call Trace:
> [29845.734499]  ? __schedule+0x3d4/0xbc0
> [29845.734818]  schedule+0x39/0x90
> [29845.735131]  btrfs_tree_read_lock+0xe7/0x140 [btrfs]
> [29845.735430]  ? remove_wait_queue+0x60/0x60
> [29845.735731]  find_parent_nodes+0x25e/0xe30 [btrfs]
> [29845.736037]  btrfs_find_all_roots_safe+0xc6/0x140 [btrfs]
> [29845.736342]  btrfs_find_all_roots+0x52/0x70 [btrfs]
> [29845.736710]  btrfs_qgroup_trace_extent_post+0x37/0x80 [btrfs]
> [29845.737046]  btrfs_add_delayed_data_ref+0x240/0x3d0 [btrfs]
> [29845.737362]  btrfs_inc_extent_ref+0xb7/0x140 [btrfs]
> [29845.737678]  __btrfs_mod_ref+0x174/0x250 [btrfs]
> [29845.737999]  ? add_pinned_bytes+0x60/0x60 [btrfs]
> [29845.738298]  update_ref_for_cow+0x26b/0x340 [btrfs]
> [29845.738592]  __btrfs_cow_block+0x221/0x5b0 [btrfs]
> [29845.738899]  btrfs_cow_block+0xf4/0x210 [btrfs]
> [29845.739200]  btrfs_search_slot+0x583/0xa40 [btrfs]
> [29845.739527]  ? init_object+0x6b/0x80
> [29845.739823]  btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
> [29845.740119]  __btrfs_drop_extents+0x157/0xd70 [btrfs]
> [29845.740524]  insert_reserved_file_extent.constprop.66+0x97/0x2f0 [btrfs]
> [29845.740853]  ? start_transaction+0xa2/0x490 [btrfs]
> [29845.741166]  btrfs_finish_ordered_io+0x344/0x810 [btrfs]
> [29845.741489]  normal_work_helper+0xea/0x530 [btrfs]
> [29845.741880]  process_one_work+0x22f/0x5d0
> [29845.742174]  worker_thread+0x4f/0x3b0
> [29845.742462]  ? rescuer_thread+0x360/0x360
> [29845.742759]  kthread+0x103/0x140
> [29845.743044]  ? kthread_create_worker_on_cpu+0x70/0x70
> [29845.743336]  ret_from_fork+0x3a/0x50
> 
> It happened last friday again on 4.20-rcX. It's caused by a change
> from 2017 (commit fb235dc06fac9eaa4408ade9c8b20d45d63c89b7 btrfs:
> qgroup: Move half of the qgroup accounting time out of commit trans).

I have to admit, this commit doesn't really save much critical section
time, but causes a lot of problem for its ability to trigger backward
tree locking behavior.

Especially when its original objective is to reduce balance + qgroup
overhead, but did a poor job compared to recent optimization.

I'll revert it just as what we did in SLE kernels.

Thanks,
Qu

> The task is deadlocking with itself.
> 
> thanks
> 
> 
>> Qu
>>
>>> [<0>] do_walk_down+0x681/0xb20 [btrfs]
>>> [<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
>>> [<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
>>> [<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
>>> [<0>] cleaner_kthread+0xf8/0x170 [btrfs]
>>> [<0>] kthread+0x121/0x140
>>> [<0>] ret_from_fork+0x27/0x50
>>>
>>> and that's like 10th snapshot and ~3rd deltion. This is qgroup show:
>>>
>>> qgroupid         rfer         excl parent
>>> --------         ----         ---- ------
>>> 0/5         865.27MiB      1.66MiB ---
>>> 0/257           0.00B        0.00B ---
>>> 0/259           0.00B        0.00B ---
>>> 0/260       806.58MiB    637.25MiB ---
>>> 0/262           0.00B        0.00B ---
>>> 0/263           0.00B        0.00B ---
>>> 0/264           0.00B        0.00B ---
>>> 0/265           0.00B        0.00B ---
>>> 0/266           0.00B        0.00B ---
>>> 0/267           0.00B        0.00B ---
>>> 0/268           0.00B        0.00B ---
>>> 0/269           0.00B        0.00B ---
>>> 0/270       989.04MiB      1.22MiB ---
>>> 0/271           0.00B        0.00B ---
>>> 0/272       922.25MiB    416.00KiB ---
>>> 0/273       931.02MiB      1.50MiB ---
>>> 0/274       910.94MiB      1.52MiB ---
>>> 1/1           1.64GiB      1.64GiB
>>> 0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274
>>>
>>> No IO or cpu activity at this point, the stacktrace and show output
>>> remains the same.
>>>
>>> So, considering this, I'm not going to add the patchset to 4.21 but will
>>> keep it in for-next for testing, any fixups or updates will be applied.
>>>
>>
> 
>

[v2,0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

Message

Comments