diff mbox

[2/2] btrfs: Refactor running of delayed inode items during transaction commit

Message ID 1518531409-31625-2-git-send-email-nborisov@suse.com (mailing list archive)
State New, archived
Headers show

Commit Message

Nikolay Borisov Feb. 13, 2018, 2:16 p.m. UTC
Currently calls to btrfs_run_delayed_inode items have been scattered
around the transaction commit code with no real design argument when they 
should be execute.

We have one call, after transaction writers go to 0. Then we have the delayed
items running as part of creating a snapshot (once per pedning snapshot).
Finally, delayed items are run once more _after_ snapshots have been created.
All in all this amounts to 2 + N (N = number of snapshots slated for creation).
In reality we need to flush delayed items once before
create_pending_snapshots is called to ensure snapshosts are consistent with
inode data and once after snapshots are created so that newly introduced inode 
items during snapshot creation process are correctly persisted on disk. This 
patch brings the total executions of run_delayed_items to just 2. 

This survived multiple xfstest runs. 

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/transaction.c | 31 ++++++++++---------------------
 1 file changed, 10 insertions(+), 21 deletions(-)

Comments

Misono Tomohiro May 28, 2018, 8:21 a.m. UTC | #1
Hello,

I found current misc-next sometimes fails btrfs/152 when the number
of cpu is >= 4 in my vm and git bisect points this commit.
(btrfs/152 performs parallel send/receive.)

The failure is caused by _check_dmesg and sometimes also leads to inconsistent fs.

dmesg looks like:
[    6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 record_root_in_trans+0x38/0xd0
[    6.650807] CPU: 0 PID: 2838 Comm: btrfs Tainted: G        W         4.17.0-rc6+ #113
[    6.651998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[    6.653325] RIP: 0010:record_root_in_trans+0x38/0xd0
[    6.654090] RSP: 0018:ffff90b781e23a68 EFLAGS: 00010206
[    6.654882] RAX: ffff8f346a48b720 RBX: ffff8f3479ce86c0 RCX: 0000000000000000
[    6.655965] RDX: 0000000000000000 RSI: ffff8f347c8ec000 RDI: ffff8f346a03a958
[    6.657059] RBP: ffff90b781e23b80 R08: 0000000000025870 R09: ffffffffa8c112bb
[    6.658162] R10: ffff90b781e23ba0 R11: 0000000000000002 R12: ffff8f346a03a958
[    6.659290] R13: ffff8f347c8d9000 R14: ffff8f347ba2a0b8 R15: 0000000000000000
[    6.660374] FS:  00007f3b62ebc8c0(0000) GS:ffff8f347fc00000(0000) knlGS:0000000000000000
[    6.661349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.661928] CR2: 00007f0a641995e0 CR3: 000000006af5c000 CR4: 00000000000006f0
[    6.662651] Call Trace:
[    6.662909]  create_pending_snapshot+0x1ab/0xd00
[    6.663391]  ? btrfs_delayed_inode_release_metadata+0xe0/0xf0
[    6.663972]  ? __btrfs_update_delayed_inode+0x1aa/0x210
[    6.664526]  ? __btrfs_release_delayed_node.part.18+0x8f/0x1d0
[    6.665116]  ? create_pending_snapshots+0x81/0xa0
[    6.665597]  create_pending_snapshots+0x81/0xa0
[    6.666068]  btrfs_commit_transaction+0x279/0x860
[    6.666553]  ? start_transaction+0x98/0x3c0
[    6.666989]  btrfs_mksubvol+0x589/0x5a0
[    6.667390]  ? btrfs_opendir+0x39/0x70
[    6.667778]  btrfs_ioctl_snap_create_transid+0x16a/0x1a0
[    6.668327]  btrfs_ioctl_snap_create_v2+0x121/0x180
[    6.668827]  btrfs_ioctl+0x56d/0x25a0
[    6.669205]  ? do_filp_open+0xaa/0x110
[    6.669591]  ? do_vfs_ioctl+0x9f/0x610
[    6.669980]  ? btrfs_ioctl_get_supported_features+0x20/0x20
[    6.670550]  do_vfs_ioctl+0x9f/0x610
[    6.670922]  ksys_ioctl+0x6b/0x80
[    6.671261]  __x64_sys_ioctl+0x11/0x20
[    6.671650]  do_syscall_64+0x49/0x100
[    6.672040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    6.672545] RIP: 0033:0x7f3b61cbddd7
[    6.672899] RSP: 002b:00007ffcdbe7b318 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[    6.673666] RAX: ffffffffffffffda RBX: 00007ffcdbe90610 RCX: 00007f3b61cbddd7
[    6.674430] RDX: 00007ffcdbe7b360 RSI: 0000000050009417 RDI: 0000000000000003
[    6.675154] RBP: 00005634403ec020 R08: 000000000000ffff R09: 0000000000001010
[    6.675878] R10: 0000000000000000 R11: 0000000000000246 R12: 00005634403ed670
[    6.676604] R13: 00007ffcdbe7c470 R14: 0000000000000053 R15: 00007ffcdbe7b360
[    6.677330] Code: f0 01 00 00 a8 02 74 27 48 8b 07 48 39 86 40 03 00 00 73 1b 49 39 75 28 0f 84 91 00 00 00 85 d2 75 17 48 8b 06 48 39 46 08 74 0e <0f> 0b eb 0a 85 d2 74 65 49 3b 75 28 74 76 41 89 d4 48 89 f3 48
[    6.679318] ---[ end trace b2378f91e69026c3 ]---

transaction.c:303:
  WARN_ON(!force && root->commit_root != root->node);

And 152.out.full looks like:
  *** fsck.btrfs output ***
  root item for root 261, current bytenr 53870592, current gen 149, current
  level 0, new bytenr 54067200, new gen 149, new level 0
  Found 1 roots with an outdated root item.
  Please run a filesystem check with the option --repair to fix them.

=====
(My observation)

Call chain is:
  btrfs_commit_transaction()
   create_pending_snapshots()
     create_pending_snapshot()
       run_delayed_item() <- removed
       qgroup_account_snapshot()
   run_delayed_item() <- added

This commit removes run_delayed_item() in create_pending_snapshot() and
instead calls it after create_pending_snapshots().

However, as qgroup_account_snapshot() updates commit_root, if there are
more than one pending snapshots under the same subvolume, it may fail
to update root item correctly.

So, I think we cannot remove run_delayed_item() here.

Thanks,
Tomohiro Misono

On 2018/02/13 23:16, Nikolay Borisov wrote:
> Currently calls to btrfs_run_delayed_inode items have been scattered
> around the transaction commit code with no real design argument when they 
> should be execute.
> 
> We have one call, after transaction writers go to 0. Then we have the delayed
> items running as part of creating a snapshot (once per pedning snapshot).
> Finally, delayed items are run once more _after_ snapshots have been created.
> All in all this amounts to 2 + N (N = number of snapshots slated for creation).
> In reality we need to flush delayed items once before
> create_pending_snapshots is called to ensure snapshosts are consistent with
> inode data and once after snapshots are created so that newly introduced inode 
> items during snapshot creation process are correctly persisted on disk. This 
> patch brings the total executions of run_delayed_items to just 2. 
> 
> This survived multiple xfstest runs. 
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/transaction.c | 31 ++++++++++---------------------
>  1 file changed, 10 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 02bc1e6212e6..b32d3136f36c 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1524,18 +1524,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>  	}
>  	btrfs_release_path(path);
>  
> -	/*
> -	 * pull in the delayed directory update
> -	 * and the delayed inode item
> -	 * otherwise we corrupt the FS during
> -	 * snapshot
> -	 */
> -	ret = btrfs_run_delayed_items(trans);
> -	if (ret) {	/* Transaction aborted */
> -		btrfs_abort_transaction(trans, ret);
> -		goto fail;
> -	}
> -
>  	record_root_in_trans(trans, root, 0);
>  	btrfs_set_root_last_snapshot(&root->root_item, trans->transid);
>  	memcpy(new_root_item, &root->root_item, sizeof(*new_root_item));
> @@ -2069,10 +2057,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  	wait_event(cur_trans->writer_wait,
>  		   extwriter_counter_read(cur_trans) == 0);
>  
> -	/* some pending stuffs might be added after the previous flush. */
> -	ret = btrfs_run_delayed_items(trans);
> -	if (ret)
> -		goto cleanup_transaction;
>  
>  	btrfs_wait_delalloc_flush(fs_info);
>  
> @@ -2095,6 +2079,16 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  		ret = cur_trans->aborted;
>  		goto scrub_continue;
>  	}
> +
> +	/*
> +	 * Run delayed items because we need them to be consistent on-disk
> +	 * so that snapshots created in create_pending_snapshots don't corrupt
> +	 * the filesystem. At this point we are the one doing transaction
> +	 * commit and now one can come and introduce new delayed inode items
> +	 */
> +	ret = btrfs_run_delayed_items(trans);
> +	if (ret)
> +		goto scrub_continue;
>  	/*
>  	 * the reloc mutex makes sure that we stop
>  	 * the balancing code from coming in and moving
> @@ -2102,11 +2096,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  	 */
>  	mutex_lock(&fs_info->reloc_mutex);
>  
> -	/*
> -	 * We needn't worry about the delayed items because we will
> -	 * deal with them in create_pending_snapshot(), which is the
> -	 * core function of the snapshot creation.
> -	 */
>  	ret = create_pending_snapshots(trans);
>  	if (ret) {
>  		mutex_unlock(&fs_info->reloc_mutex);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov May 28, 2018, 8:35 a.m. UTC | #2
On 28.05.2018 11:21, Misono Tomohiro wrote:
> Hello,
> 
> I found current misc-next sometimes fails btrfs/152 when the number
> of cpu is >= 4 in my vm and git bisect points this commit.
> (btrfs/152 performs parallel send/receive.)
> 
> The failure is caused by _check_dmesg and sometimes also leads to inconsistent fs.
> 
> dmesg looks like:
> [    6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 record_root_in_trans+0x38/0xd0
> [    6.650807] CPU: 0 PID: 2838 Comm: btrfs Tainted: G        W         4.17.0-rc6+ #113
> [    6.651998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
> [    6.653325] RIP: 0010:record_root_in_trans+0x38/0xd0
> [    6.654090] RSP: 0018:ffff90b781e23a68 EFLAGS: 00010206
> [    6.654882] RAX: ffff8f346a48b720 RBX: ffff8f3479ce86c0 RCX: 0000000000000000
> [    6.655965] RDX: 0000000000000000 RSI: ffff8f347c8ec000 RDI: ffff8f346a03a958
> [    6.657059] RBP: ffff90b781e23b80 R08: 0000000000025870 R09: ffffffffa8c112bb
> [    6.658162] R10: ffff90b781e23ba0 R11: 0000000000000002 R12: ffff8f346a03a958
> [    6.659290] R13: ffff8f347c8d9000 R14: ffff8f347ba2a0b8 R15: 0000000000000000
> [    6.660374] FS:  00007f3b62ebc8c0(0000) GS:ffff8f347fc00000(0000) knlGS:0000000000000000
> [    6.661349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    6.661928] CR2: 00007f0a641995e0 CR3: 000000006af5c000 CR4: 00000000000006f0
> [    6.662651] Call Trace:
> [    6.662909]  create_pending_snapshot+0x1ab/0xd00
> [    6.663391]  ? btrfs_delayed_inode_release_metadata+0xe0/0xf0
> [    6.663972]  ? __btrfs_update_delayed_inode+0x1aa/0x210
> [    6.664526]  ? __btrfs_release_delayed_node.part.18+0x8f/0x1d0
> [    6.665116]  ? create_pending_snapshots+0x81/0xa0
> [    6.665597]  create_pending_snapshots+0x81/0xa0
> [    6.666068]  btrfs_commit_transaction+0x279/0x860
> [    6.666553]  ? start_transaction+0x98/0x3c0
> [    6.666989]  btrfs_mksubvol+0x589/0x5a0
> [    6.667390]  ? btrfs_opendir+0x39/0x70
> [    6.667778]  btrfs_ioctl_snap_create_transid+0x16a/0x1a0
> [    6.668327]  btrfs_ioctl_snap_create_v2+0x121/0x180
> [    6.668827]  btrfs_ioctl+0x56d/0x25a0
> [    6.669205]  ? do_filp_open+0xaa/0x110
> [    6.669591]  ? do_vfs_ioctl+0x9f/0x610
> [    6.669980]  ? btrfs_ioctl_get_supported_features+0x20/0x20
> [    6.670550]  do_vfs_ioctl+0x9f/0x610
> [    6.670922]  ksys_ioctl+0x6b/0x80
> [    6.671261]  __x64_sys_ioctl+0x11/0x20
> [    6.671650]  do_syscall_64+0x49/0x100
> [    6.672040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [    6.672545] RIP: 0033:0x7f3b61cbddd7
> [    6.672899] RSP: 002b:00007ffcdbe7b318 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [    6.673666] RAX: ffffffffffffffda RBX: 00007ffcdbe90610 RCX: 00007f3b61cbddd7
> [    6.674430] RDX: 00007ffcdbe7b360 RSI: 0000000050009417 RDI: 0000000000000003
> [    6.675154] RBP: 00005634403ec020 R08: 000000000000ffff R09: 0000000000001010
> [    6.675878] R10: 0000000000000000 R11: 0000000000000246 R12: 00005634403ed670
> [    6.676604] R13: 00007ffcdbe7c470 R14: 0000000000000053 R15: 00007ffcdbe7b360
> [    6.677330] Code: f0 01 00 00 a8 02 74 27 48 8b 07 48 39 86 40 03 00 00 73 1b 49 39 75 28 0f 84 91 00 00 00 85 d2 75 17 48 8b 06 48 39 46 08 74 0e <0f> 0b eb 0a 85 d2 74 65 49 3b 75 28 74 76 41 89 d4 48 89 f3 48
> [    6.679318] ---[ end trace b2378f91e69026c3 ]---
> 
> transaction.c:303:
>   WARN_ON(!force && root->commit_root != root->node);
> 
> And 152.out.full looks like:
>   *** fsck.btrfs output ***
>   root item for root 261, current bytenr 53870592, current gen 149, current
>   level 0, new bytenr 54067200, new gen 149, new level 0
>   Found 1 roots with an outdated root item.
>   Please run a filesystem check with the option --repair to fix them.
> 
> =====
> (My observation)
> 
> Call chain is:
>   btrfs_commit_transaction()
>    create_pending_snapshots()
>      create_pending_snapshot()
>        run_delayed_item() <- removed
>        qgroup_account_snapshot()
>    run_delayed_item() <- added
> 
> This commit removes run_delayed_item() in create_pending_snapshot() and
> instead calls it after create_pending_snapshots().
> 
> However, as qgroup_account_snapshot() updates commit_root, if there are
> more than one pending snapshots under the same subvolume, it may fail
> to update root item correctly.
> 
> So, I think we cannot remove run_delayed_item() here.

So btrfs_run_delayed_items is run once before going into
create_pending_snapshots. This means that when we go into
create_pending_snapshots the filesystem should be consistent from the
POV of delayed items (as per patch's changelog).

I somehow fail to reconcile how the removal of run delayed items (which
really pertain to inode state) have anything to do with an outdated
root?  IMHO you are hitting some race which my patch just made more
visible due to removing code in the critical path.

> 
> Thanks,
> Tomohiro Misono
> 
> On 2018/02/13 23:16, Nikolay Borisov wrote:
>> Currently calls to btrfs_run_delayed_inode items have been scattered
>> around the transaction commit code with no real design argument when they 
>> should be execute.
>>
>> We have one call, after transaction writers go to 0. Then we have the delayed
>> items running as part of creating a snapshot (once per pedning snapshot).
>> Finally, delayed items are run once more _after_ snapshots have been created.
>> All in all this amounts to 2 + N (N = number of snapshots slated for creation).
>> In reality we need to flush delayed items once before
>> create_pending_snapshots is called to ensure snapshosts are consistent with
>> inode data and once after snapshots are created so that newly introduced inode 
>> items during snapshot creation process are correctly persisted on disk. This 
>> patch brings the total executions of run_delayed_items to just 2. 
>>
>> This survived multiple xfstest runs. 
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> ---
>>  fs/btrfs/transaction.c | 31 ++++++++++---------------------
>>  1 file changed, 10 insertions(+), 21 deletions(-)
>>
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index 02bc1e6212e6..b32d3136f36c 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -1524,18 +1524,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>>  	}
>>  	btrfs_release_path(path);
>>  
>> -	/*
>> -	 * pull in the delayed directory update
>> -	 * and the delayed inode item
>> -	 * otherwise we corrupt the FS during
>> -	 * snapshot
>> -	 */
>> -	ret = btrfs_run_delayed_items(trans);
>> -	if (ret) {	/* Transaction aborted */
>> -		btrfs_abort_transaction(trans, ret);
>> -		goto fail;
>> -	}
>> -
>>  	record_root_in_trans(trans, root, 0);
>>  	btrfs_set_root_last_snapshot(&root->root_item, trans->transid);
>>  	memcpy(new_root_item, &root->root_item, sizeof(*new_root_item));
>> @@ -2069,10 +2057,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>  	wait_event(cur_trans->writer_wait,
>>  		   extwriter_counter_read(cur_trans) == 0);
>>  
>> -	/* some pending stuffs might be added after the previous flush. */
>> -	ret = btrfs_run_delayed_items(trans);
>> -	if (ret)
>> -		goto cleanup_transaction;
>>  
>>  	btrfs_wait_delalloc_flush(fs_info);
>>  
>> @@ -2095,6 +2079,16 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>  		ret = cur_trans->aborted;
>>  		goto scrub_continue;
>>  	}
>> +
>> +	/*
>> +	 * Run delayed items because we need them to be consistent on-disk
>> +	 * so that snapshots created in create_pending_snapshots don't corrupt
>> +	 * the filesystem. At this point we are the one doing transaction
>> +	 * commit and now one can come and introduce new delayed inode items
>> +	 */
>> +	ret = btrfs_run_delayed_items(trans);
>> +	if (ret)
>> +		goto scrub_continue;
>>  	/*
>>  	 * the reloc mutex makes sure that we stop
>>  	 * the balancing code from coming in and moving
>> @@ -2102,11 +2096,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>  	 */
>>  	mutex_lock(&fs_info->reloc_mutex);
>>  
>> -	/*
>> -	 * We needn't worry about the delayed items because we will
>> -	 * deal with them in create_pending_snapshot(), which is the
>> -	 * core function of the snapshot creation.
>> -	 */
>>  	ret = create_pending_snapshots(trans);
>>  	if (ret) {
>>  		mutex_unlock(&fs_info->reloc_mutex);
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov May 28, 2018, 8:51 a.m. UTC | #3
On 28.05.2018 11:35, Nikolay Borisov wrote:
> 
> 
> On 28.05.2018 11:21, Misono Tomohiro wrote:
>> Hello,
>>
>> I found current misc-next sometimes fails btrfs/152 when the number
>> of cpu is >= 4 in my vm and git bisect points this commit.
>> (btrfs/152 performs parallel send/receive.)
>>
>> The failure is caused by _check_dmesg and sometimes also leads to inconsistent fs.
>>
>> dmesg looks like:
>> [    6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 record_root_in_trans+0x38/0xd0
>> [    6.650807] CPU: 0 PID: 2838 Comm: btrfs Tainted: G        W         4.17.0-rc6+ #113
>> [    6.651998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
>> [    6.653325] RIP: 0010:record_root_in_trans+0x38/0xd0
>> [    6.654090] RSP: 0018:ffff90b781e23a68 EFLAGS: 00010206
>> [    6.654882] RAX: ffff8f346a48b720 RBX: ffff8f3479ce86c0 RCX: 0000000000000000
>> [    6.655965] RDX: 0000000000000000 RSI: ffff8f347c8ec000 RDI: ffff8f346a03a958
>> [    6.657059] RBP: ffff90b781e23b80 R08: 0000000000025870 R09: ffffffffa8c112bb
>> [    6.658162] R10: ffff90b781e23ba0 R11: 0000000000000002 R12: ffff8f346a03a958
>> [    6.659290] R13: ffff8f347c8d9000 R14: ffff8f347ba2a0b8 R15: 0000000000000000
>> [    6.660374] FS:  00007f3b62ebc8c0(0000) GS:ffff8f347fc00000(0000) knlGS:0000000000000000
>> [    6.661349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [    6.661928] CR2: 00007f0a641995e0 CR3: 000000006af5c000 CR4: 00000000000006f0
>> [    6.662651] Call Trace:
>> [    6.662909]  create_pending_snapshot+0x1ab/0xd00
>> [    6.663391]  ? btrfs_delayed_inode_release_metadata+0xe0/0xf0
>> [    6.663972]  ? __btrfs_update_delayed_inode+0x1aa/0x210
>> [    6.664526]  ? __btrfs_release_delayed_node.part.18+0x8f/0x1d0
>> [    6.665116]  ? create_pending_snapshots+0x81/0xa0
>> [    6.665597]  create_pending_snapshots+0x81/0xa0
>> [    6.666068]  btrfs_commit_transaction+0x279/0x860
>> [    6.666553]  ? start_transaction+0x98/0x3c0
>> [    6.666989]  btrfs_mksubvol+0x589/0x5a0
>> [    6.667390]  ? btrfs_opendir+0x39/0x70
>> [    6.667778]  btrfs_ioctl_snap_create_transid+0x16a/0x1a0
>> [    6.668327]  btrfs_ioctl_snap_create_v2+0x121/0x180
>> [    6.668827]  btrfs_ioctl+0x56d/0x25a0
>> [    6.669205]  ? do_filp_open+0xaa/0x110
>> [    6.669591]  ? do_vfs_ioctl+0x9f/0x610
>> [    6.669980]  ? btrfs_ioctl_get_supported_features+0x20/0x20
>> [    6.670550]  do_vfs_ioctl+0x9f/0x610
>> [    6.670922]  ksys_ioctl+0x6b/0x80
>> [    6.671261]  __x64_sys_ioctl+0x11/0x20
>> [    6.671650]  do_syscall_64+0x49/0x100
>> [    6.672040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [    6.672545] RIP: 0033:0x7f3b61cbddd7
>> [    6.672899] RSP: 002b:00007ffcdbe7b318 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>> [    6.673666] RAX: ffffffffffffffda RBX: 00007ffcdbe90610 RCX: 00007f3b61cbddd7
>> [    6.674430] RDX: 00007ffcdbe7b360 RSI: 0000000050009417 RDI: 0000000000000003
>> [    6.675154] RBP: 00005634403ec020 R08: 000000000000ffff R09: 0000000000001010
>> [    6.675878] R10: 0000000000000000 R11: 0000000000000246 R12: 00005634403ed670
>> [    6.676604] R13: 00007ffcdbe7c470 R14: 0000000000000053 R15: 00007ffcdbe7b360
>> [    6.677330] Code: f0 01 00 00 a8 02 74 27 48 8b 07 48 39 86 40 03 00 00 73 1b 49 39 75 28 0f 84 91 00 00 00 85 d2 75 17 48 8b 06 48 39 46 08 74 0e <0f> 0b eb 0a 85 d2 74 65 49 3b 75 28 74 76 41 89 d4 48 89 f3 48
>> [    6.679318] ---[ end trace b2378f91e69026c3 ]---
>>
>> transaction.c:303:
>>   WARN_ON(!force && root->commit_root != root->node);
>>
>> And 152.out.full looks like:
>>   *** fsck.btrfs output ***
>>   root item for root 261, current bytenr 53870592, current gen 149, current
>>   level 0, new bytenr 54067200, new gen 149, new level 0
>>   Found 1 roots with an outdated root item.
>>   Please run a filesystem check with the option --repair to fix them.
>>
>> =====
>> (My observation)
>>
>> Call chain is:
>>   btrfs_commit_transaction()
>>    create_pending_snapshots()
>>      create_pending_snapshot()
>>        run_delayed_item() <- removed
>>        qgroup_account_snapshot()
>>    run_delayed_item() <- added
>>
>> This commit removes run_delayed_item() in create_pending_snapshot() and
>> instead calls it after create_pending_snapshots().
>>
>> However, as qgroup_account_snapshot() updates commit_root, if there are
>> more than one pending snapshots under the same subvolume, it may fail
>> to update root item correctly.
>>
>> So, I think we cannot remove run_delayed_item() here.
> 
> So btrfs_run_delayed_items is run once before going into
> create_pending_snapshots. This means that when we go into
> create_pending_snapshots the filesystem should be consistent from the
> POV of delayed items (as per patch's changelog).
> 
> I somehow fail to reconcile how the removal of run delayed items (which
> really pertain to inode state) have anything to do with an outdated
> root?  IMHO you are hitting some race which my patch just made more
> visible due to removing code in the critical path.

I think my claim can be easily tested by re-instating the
run_delayed_Refs call in create pending snapshot and hacking the
function to return the number of delayed items processed. My expectation
(on which my patch is based) is for this function to always return 0 in
create_pending_snapshot. Will try to test this soon and report back.

In the mean time, I'd like to hear Qu's opinion on your reported
inconsistency since it seems to be the canonical problem he tried to
initially fix in :

4d31778aa2fa ("btrfs: qgroup: Fix root item corruption when multiple
same source snapshots are created with quota enabled")



> 
>>
>> Thanks,
>> Tomohiro Misono
>>
>> On 2018/02/13 23:16, Nikolay Borisov wrote:
>>> Currently calls to btrfs_run_delayed_inode items have been scattered
>>> around the transaction commit code with no real design argument when they 
>>> should be execute.
>>>
>>> We have one call, after transaction writers go to 0. Then we have the delayed
>>> items running as part of creating a snapshot (once per pedning snapshot).
>>> Finally, delayed items are run once more _after_ snapshots have been created.
>>> All in all this amounts to 2 + N (N = number of snapshots slated for creation).
>>> In reality we need to flush delayed items once before
>>> create_pending_snapshots is called to ensure snapshosts are consistent with
>>> inode data and once after snapshots are created so that newly introduced inode 
>>> items during snapshot creation process are correctly persisted on disk. This 
>>> patch brings the total executions of run_delayed_items to just 2. 
>>>
>>> This survived multiple xfstest runs. 
>>>
>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>>> ---
>>>  fs/btrfs/transaction.c | 31 ++++++++++---------------------
>>>  1 file changed, 10 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index 02bc1e6212e6..b32d3136f36c 100644
>>> --- a/fs/btrfs/transaction.c
>>> +++ b/fs/btrfs/transaction.c
>>> @@ -1524,18 +1524,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>>>  	}
>>>  	btrfs_release_path(path);
>>>  
>>> -	/*
>>> -	 * pull in the delayed directory update
>>> -	 * and the delayed inode item
>>> -	 * otherwise we corrupt the FS during
>>> -	 * snapshot
>>> -	 */
>>> -	ret = btrfs_run_delayed_items(trans);
>>> -	if (ret) {	/* Transaction aborted */
>>> -		btrfs_abort_transaction(trans, ret);
>>> -		goto fail;
>>> -	}
>>> -
>>>  	record_root_in_trans(trans, root, 0);
>>>  	btrfs_set_root_last_snapshot(&root->root_item, trans->transid);
>>>  	memcpy(new_root_item, &root->root_item, sizeof(*new_root_item));
>>> @@ -2069,10 +2057,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>>  	wait_event(cur_trans->writer_wait,
>>>  		   extwriter_counter_read(cur_trans) == 0);
>>>  
>>> -	/* some pending stuffs might be added after the previous flush. */
>>> -	ret = btrfs_run_delayed_items(trans);
>>> -	if (ret)
>>> -		goto cleanup_transaction;
>>>  
>>>  	btrfs_wait_delalloc_flush(fs_info);
>>>  
>>> @@ -2095,6 +2079,16 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>>  		ret = cur_trans->aborted;
>>>  		goto scrub_continue;
>>>  	}
>>> +
>>> +	/*
>>> +	 * Run delayed items because we need them to be consistent on-disk
>>> +	 * so that snapshots created in create_pending_snapshots don't corrupt
>>> +	 * the filesystem. At this point we are the one doing transaction
>>> +	 * commit and now one can come and introduce new delayed inode items
>>> +	 */
>>> +	ret = btrfs_run_delayed_items(trans);
>>> +	if (ret)
>>> +		goto scrub_continue;
>>>  	/*
>>>  	 * the reloc mutex makes sure that we stop
>>>  	 * the balancing code from coming in and moving
>>> @@ -2102,11 +2096,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>>  	 */
>>>  	mutex_lock(&fs_info->reloc_mutex);
>>>  
>>> -	/*
>>> -	 * We needn't worry about the delayed items because we will
>>> -	 * deal with them in create_pending_snapshot(), which is the
>>> -	 * core function of the snapshot creation.
>>> -	 */
>>>  	ret = create_pending_snapshots(trans);
>>>  	if (ret) {
>>>  		mutex_unlock(&fs_info->reloc_mutex);
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo May 28, 2018, 11:51 a.m. UTC | #4
On 2018年05月28日 16:51, Nikolay Borisov wrote:
> 
> 
> On 28.05.2018 11:35, Nikolay Borisov wrote:
>>
>>
>> On 28.05.2018 11:21, Misono Tomohiro wrote:
>>> Hello,
>>>
>>> I found current misc-next sometimes fails btrfs/152 when the number
>>> of cpu is >= 4 in my vm and git bisect points this commit.
>>> (btrfs/152 performs parallel send/receive.)
>>>
>>> The failure is caused by _check_dmesg and sometimes also leads to inconsistent fs.
>>>
>>> dmesg looks like:
>>> [    6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 record_root_in_trans+0x38/0xd0
>>> [    6.650807] CPU: 0 PID: 2838 Comm: btrfs Tainted: G        W         4.17.0-rc6+ #113
>>> [    6.651998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
>>> [    6.653325] RIP: 0010:record_root_in_trans+0x38/0xd0
>>> [    6.654090] RSP: 0018:ffff90b781e23a68 EFLAGS: 00010206
>>> [    6.654882] RAX: ffff8f346a48b720 RBX: ffff8f3479ce86c0 RCX: 0000000000000000
>>> [    6.655965] RDX: 0000000000000000 RSI: ffff8f347c8ec000 RDI: ffff8f346a03a958
>>> [    6.657059] RBP: ffff90b781e23b80 R08: 0000000000025870 R09: ffffffffa8c112bb
>>> [    6.658162] R10: ffff90b781e23ba0 R11: 0000000000000002 R12: ffff8f346a03a958
>>> [    6.659290] R13: ffff8f347c8d9000 R14: ffff8f347ba2a0b8 R15: 0000000000000000
>>> [    6.660374] FS:  00007f3b62ebc8c0(0000) GS:ffff8f347fc00000(0000) knlGS:0000000000000000
>>> [    6.661349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    6.661928] CR2: 00007f0a641995e0 CR3: 000000006af5c000 CR4: 00000000000006f0
>>> [    6.662651] Call Trace:
>>> [    6.662909]  create_pending_snapshot+0x1ab/0xd00
>>> [    6.663391]  ? btrfs_delayed_inode_release_metadata+0xe0/0xf0
>>> [    6.663972]  ? __btrfs_update_delayed_inode+0x1aa/0x210
>>> [    6.664526]  ? __btrfs_release_delayed_node.part.18+0x8f/0x1d0
>>> [    6.665116]  ? create_pending_snapshots+0x81/0xa0
>>> [    6.665597]  create_pending_snapshots+0x81/0xa0
>>> [    6.666068]  btrfs_commit_transaction+0x279/0x860
>>> [    6.666553]  ? start_transaction+0x98/0x3c0
>>> [    6.666989]  btrfs_mksubvol+0x589/0x5a0
>>> [    6.667390]  ? btrfs_opendir+0x39/0x70
>>> [    6.667778]  btrfs_ioctl_snap_create_transid+0x16a/0x1a0
>>> [    6.668327]  btrfs_ioctl_snap_create_v2+0x121/0x180
>>> [    6.668827]  btrfs_ioctl+0x56d/0x25a0
>>> [    6.669205]  ? do_filp_open+0xaa/0x110
>>> [    6.669591]  ? do_vfs_ioctl+0x9f/0x610
>>> [    6.669980]  ? btrfs_ioctl_get_supported_features+0x20/0x20
>>> [    6.670550]  do_vfs_ioctl+0x9f/0x610
>>> [    6.670922]  ksys_ioctl+0x6b/0x80
>>> [    6.671261]  __x64_sys_ioctl+0x11/0x20
>>> [    6.671650]  do_syscall_64+0x49/0x100
>>> [    6.672040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [    6.672545] RIP: 0033:0x7f3b61cbddd7
>>> [    6.672899] RSP: 002b:00007ffcdbe7b318 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>> [    6.673666] RAX: ffffffffffffffda RBX: 00007ffcdbe90610 RCX: 00007f3b61cbddd7
>>> [    6.674430] RDX: 00007ffcdbe7b360 RSI: 0000000050009417 RDI: 0000000000000003
>>> [    6.675154] RBP: 00005634403ec020 R08: 000000000000ffff R09: 0000000000001010
>>> [    6.675878] R10: 0000000000000000 R11: 0000000000000246 R12: 00005634403ed670
>>> [    6.676604] R13: 00007ffcdbe7c470 R14: 0000000000000053 R15: 00007ffcdbe7b360
>>> [    6.677330] Code: f0 01 00 00 a8 02 74 27 48 8b 07 48 39 86 40 03 00 00 73 1b 49 39 75 28 0f 84 91 00 00 00 85 d2 75 17 48 8b 06 48 39 46 08 74 0e <0f> 0b eb 0a 85 d2 74 65 49 3b 75 28 74 76 41 89 d4 48 89 f3 48
>>> [    6.679318] ---[ end trace b2378f91e69026c3 ]---
>>>
>>> transaction.c:303:
>>>   WARN_ON(!force && root->commit_root != root->node);
>>>
>>> And 152.out.full looks like:
>>>   *** fsck.btrfs output ***
>>>   root item for root 261, current bytenr 53870592, current gen 149, current
>>>   level 0, new bytenr 54067200, new gen 149, new level 0
>>>   Found 1 roots with an outdated root item.
>>>   Please run a filesystem check with the option --repair to fix them.
>>>
>>> =====
>>> (My observation)
>>>
>>> Call chain is:
>>>   btrfs_commit_transaction()
>>>    create_pending_snapshots()
>>>      create_pending_snapshot()
>>>        run_delayed_item() <- removed
>>>        qgroup_account_snapshot()
>>>    run_delayed_item() <- added
>>>
>>> This commit removes run_delayed_item() in create_pending_snapshot() and
>>> instead calls it after create_pending_snapshots().
>>>
>>> However, as qgroup_account_snapshot() updates commit_root, if there are
>>> more than one pending snapshots under the same subvolume, it may fail
>>> to update root item correctly.
>>>
>>> So, I think we cannot remove run_delayed_item() here.
>>
>> So btrfs_run_delayed_items is run once before going into
>> create_pending_snapshots. This means that when we go into
>> create_pending_snapshots the filesystem should be consistent from the
>> POV of delayed items (as per patch's changelog).
>>
>> I somehow fail to reconcile how the removal of run delayed items (which
>> really pertain to inode state) have anything to do with an outdated
>> root?  IMHO you are hitting some race which my patch just made more
>> visible due to removing code in the critical path.

Not pretty sure about delayed items.
But if delayed items can cause delayed refs, then we need them run
before qgroup accounting.

> 
> I think my claim can be easily tested by re-instating the
> run_delayed_Refs call in create pending snapshot and hacking the
> function to return the number of delayed items processed. My expectation
> (on which my patch is based) is for this function to always return 0 in
> create_pending_snapshot. Will try to test this soon and report back.
> 
> In the mean time, I'd like to hear Qu's opinion on your reported
> inconsistency since it seems to be the canonical problem he tried to
> initially fix in :
> 
> 4d31778aa2fa ("btrfs: qgroup: Fix root item corruption when multiple
> same source snapshots are created with quota enabled")

Since it's mostly related to delayed-item, I need some time to get
familiar with it before any really useful advice.

Thanks,
Qu

> 
> 
> 
>>
>>>
>>> Thanks,
>>> Tomohiro Misono
>>>
>>> On 2018/02/13 23:16, Nikolay Borisov wrote:
>>>> Currently calls to btrfs_run_delayed_inode items have been scattered
>>>> around the transaction commit code with no real design argument when they 
>>>> should be execute.
>>>>
>>>> We have one call, after transaction writers go to 0. Then we have the delayed
>>>> items running as part of creating a snapshot (once per pedning snapshot).
>>>> Finally, delayed items are run once more _after_ snapshots have been created.
>>>> All in all this amounts to 2 + N (N = number of snapshots slated for creation).
>>>> In reality we need to flush delayed items once before
>>>> create_pending_snapshots is called to ensure snapshosts are consistent with
>>>> inode data and once after snapshots are created so that newly introduced inode 
>>>> items during snapshot creation process are correctly persisted on disk. This 
>>>> patch brings the total executions of run_delayed_items to just 2. 
>>>>
>>>> This survived multiple xfstest runs. 
>>>>
>>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>>>> ---
>>>>  fs/btrfs/transaction.c | 31 ++++++++++---------------------
>>>>  1 file changed, 10 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>> index 02bc1e6212e6..b32d3136f36c 100644
>>>> --- a/fs/btrfs/transaction.c
>>>> +++ b/fs/btrfs/transaction.c
>>>> @@ -1524,18 +1524,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>>>>  	}
>>>>  	btrfs_release_path(path);
>>>>  
>>>> -	/*
>>>> -	 * pull in the delayed directory update
>>>> -	 * and the delayed inode item
>>>> -	 * otherwise we corrupt the FS during
>>>> -	 * snapshot
>>>> -	 */
>>>> -	ret = btrfs_run_delayed_items(trans);
>>>> -	if (ret) {	/* Transaction aborted */
>>>> -		btrfs_abort_transaction(trans, ret);
>>>> -		goto fail;
>>>> -	}
>>>> -
>>>>  	record_root_in_trans(trans, root, 0);
>>>>  	btrfs_set_root_last_snapshot(&root->root_item, trans->transid);
>>>>  	memcpy(new_root_item, &root->root_item, sizeof(*new_root_item));
>>>> @@ -2069,10 +2057,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>>>  	wait_event(cur_trans->writer_wait,
>>>>  		   extwriter_counter_read(cur_trans) == 0);
>>>>  
>>>> -	/* some pending stuffs might be added after the previous flush. */
>>>> -	ret = btrfs_run_delayed_items(trans);
>>>> -	if (ret)
>>>> -		goto cleanup_transaction;
>>>>  
>>>>  	btrfs_wait_delalloc_flush(fs_info);
>>>>  
>>>> @@ -2095,6 +2079,16 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>>>  		ret = cur_trans->aborted;
>>>>  		goto scrub_continue;
>>>>  	}
>>>> +
>>>> +	/*
>>>> +	 * Run delayed items because we need them to be consistent on-disk
>>>> +	 * so that snapshots created in create_pending_snapshots don't corrupt
>>>> +	 * the filesystem. At this point we are the one doing transaction
>>>> +	 * commit and now one can come and introduce new delayed inode items
>>>> +	 */
>>>> +	ret = btrfs_run_delayed_items(trans);
>>>> +	if (ret)
>>>> +		goto scrub_continue;
>>>>  	/*
>>>>  	 * the reloc mutex makes sure that we stop
>>>>  	 * the balancing code from coming in and moving
>>>> @@ -2102,11 +2096,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>>>  	 */
>>>>  	mutex_lock(&fs_info->reloc_mutex);
>>>>  
>>>> -	/*
>>>> -	 * We needn't worry about the delayed items because we will
>>>> -	 * deal with them in create_pending_snapshot(), which is the
>>>> -	 * core function of the snapshot creation.
>>>> -	 */
>>>>  	ret = create_pending_snapshots(trans);
>>>>  	if (ret) {
>>>>  		mutex_unlock(&fs_info->reloc_mutex);
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov May 28, 2018, 12:26 p.m. UTC | #5
On 28.05.2018 14:51, Qu Wenruo wrote:
> 
> 
> On 2018年05月28日 16:51, Nikolay Borisov wrote:
>>
>>
>> On 28.05.2018 11:35, Nikolay Borisov wrote:
>>>
>>>
>>> On 28.05.2018 11:21, Misono Tomohiro wrote:
>>>> Hello,
>>>>
>>>> I found current misc-next sometimes fails btrfs/152 when the number
>>>> of cpu is >= 4 in my vm and git bisect points this commit.
>>>> (btrfs/152 performs parallel send/receive.)
>>>>
>>>> The failure is caused by _check_dmesg and sometimes also leads to inconsistent fs.
>>>>
>>>> dmesg looks like:
>>>> [    6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 record_root_in_trans+0x38/0xd0
>>>> [    6.650807] CPU: 0 PID: 2838 Comm: btrfs Tainted: G        W         4.17.0-rc6+ #113
>>>> [    6.651998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
>>>> [    6.653325] RIP: 0010:record_root_in_trans+0x38/0xd0
>>>> [    6.654090] RSP: 0018:ffff90b781e23a68 EFLAGS: 00010206
>>>> [    6.654882] RAX: ffff8f346a48b720 RBX: ffff8f3479ce86c0 RCX: 0000000000000000
>>>> [    6.655965] RDX: 0000000000000000 RSI: ffff8f347c8ec000 RDI: ffff8f346a03a958
>>>> [    6.657059] RBP: ffff90b781e23b80 R08: 0000000000025870 R09: ffffffffa8c112bb
>>>> [    6.658162] R10: ffff90b781e23ba0 R11: 0000000000000002 R12: ffff8f346a03a958
>>>> [    6.659290] R13: ffff8f347c8d9000 R14: ffff8f347ba2a0b8 R15: 0000000000000000
>>>> [    6.660374] FS:  00007f3b62ebc8c0(0000) GS:ffff8f347fc00000(0000) knlGS:0000000000000000
>>>> [    6.661349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [    6.661928] CR2: 00007f0a641995e0 CR3: 000000006af5c000 CR4: 00000000000006f0
>>>> [    6.662651] Call Trace:
>>>> [    6.662909]  create_pending_snapshot+0x1ab/0xd00
>>>> [    6.663391]  ? btrfs_delayed_inode_release_metadata+0xe0/0xf0
>>>> [    6.663972]  ? __btrfs_update_delayed_inode+0x1aa/0x210
>>>> [    6.664526]  ? __btrfs_release_delayed_node.part.18+0x8f/0x1d0
>>>> [    6.665116]  ? create_pending_snapshots+0x81/0xa0
>>>> [    6.665597]  create_pending_snapshots+0x81/0xa0
>>>> [    6.666068]  btrfs_commit_transaction+0x279/0x860
>>>> [    6.666553]  ? start_transaction+0x98/0x3c0
>>>> [    6.666989]  btrfs_mksubvol+0x589/0x5a0
>>>> [    6.667390]  ? btrfs_opendir+0x39/0x70
>>>> [    6.667778]  btrfs_ioctl_snap_create_transid+0x16a/0x1a0
>>>> [    6.668327]  btrfs_ioctl_snap_create_v2+0x121/0x180
>>>> [    6.668827]  btrfs_ioctl+0x56d/0x25a0
>>>> [    6.669205]  ? do_filp_open+0xaa/0x110
>>>> [    6.669591]  ? do_vfs_ioctl+0x9f/0x610
>>>> [    6.669980]  ? btrfs_ioctl_get_supported_features+0x20/0x20
>>>> [    6.670550]  do_vfs_ioctl+0x9f/0x610
>>>> [    6.670922]  ksys_ioctl+0x6b/0x80
>>>> [    6.671261]  __x64_sys_ioctl+0x11/0x20
>>>> [    6.671650]  do_syscall_64+0x49/0x100
>>>> [    6.672040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> [    6.672545] RIP: 0033:0x7f3b61cbddd7
>>>> [    6.672899] RSP: 002b:00007ffcdbe7b318 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>>> [    6.673666] RAX: ffffffffffffffda RBX: 00007ffcdbe90610 RCX: 00007f3b61cbddd7
>>>> [    6.674430] RDX: 00007ffcdbe7b360 RSI: 0000000050009417 RDI: 0000000000000003
>>>> [    6.675154] RBP: 00005634403ec020 R08: 000000000000ffff R09: 0000000000001010
>>>> [    6.675878] R10: 0000000000000000 R11: 0000000000000246 R12: 00005634403ed670
>>>> [    6.676604] R13: 00007ffcdbe7c470 R14: 0000000000000053 R15: 00007ffcdbe7b360
>>>> [    6.677330] Code: f0 01 00 00 a8 02 74 27 48 8b 07 48 39 86 40 03 00 00 73 1b 49 39 75 28 0f 84 91 00 00 00 85 d2 75 17 48 8b 06 48 39 46 08 74 0e <0f> 0b eb 0a 85 d2 74 65 49 3b 75 28 74 76 41 89 d4 48 89 f3 48
>>>> [    6.679318] ---[ end trace b2378f91e69026c3 ]---
>>>>
>>>> transaction.c:303:
>>>>   WARN_ON(!force && root->commit_root != root->node);
>>>>
>>>> And 152.out.full looks like:
>>>>   *** fsck.btrfs output ***
>>>>   root item for root 261, current bytenr 53870592, current gen 149, current
>>>>   level 0, new bytenr 54067200, new gen 149, new level 0
>>>>   Found 1 roots with an outdated root item.
>>>>   Please run a filesystem check with the option --repair to fix them.
>>>>
>>>> =====
>>>> (My observation)
>>>>
>>>> Call chain is:
>>>>   btrfs_commit_transaction()
>>>>    create_pending_snapshots()
>>>>      create_pending_snapshot()
>>>>        run_delayed_item() <- removed
>>>>        qgroup_account_snapshot()
>>>>    run_delayed_item() <- added
>>>>
>>>> This commit removes run_delayed_item() in create_pending_snapshot() and
>>>> instead calls it after create_pending_snapshots().
>>>>
>>>> However, as qgroup_account_snapshot() updates commit_root, if there are
>>>> more than one pending snapshots under the same subvolume, it may fail
>>>> to update root item correctly.
>>>>
>>>> So, I think we cannot remove run_delayed_item() here.
>>>
>>> So btrfs_run_delayed_items is run once before going into
>>> create_pending_snapshots. This means that when we go into
>>> create_pending_snapshots the filesystem should be consistent from the
>>> POV of delayed items (as per patch's changelog).
>>>
>>> I somehow fail to reconcile how the removal of run delayed items (which
>>> really pertain to inode state) have anything to do with an outdated
>>> root?  IMHO you are hitting some race which my patch just made more
>>> visible due to removing code in the critical path.
> 
> Not pretty sure about delayed items.
> But if delayed items can cause delayed refs, then we need them run
> before qgroup accounting.
> 
>>
>> I think my claim can be easily tested by re-instating the
>> run_delayed_Refs call in create pending snapshot and hacking the
>> function to return the number of delayed items processed. My expectation
>> (on which my patch is based) is for this function to always return 0 in
>> create_pending_snapshot. Will try to test this soon and report back.
>>
>> In the mean time, I'd like to hear Qu's opinion on your reported
>> inconsistency since it seems to be the canonical problem he tried to
>> initially fix in :
>>
>> 4d31778aa2fa ("btrfs: qgroup: Fix root item corruption when multiple
>> same source snapshots are created with quota enabled")
> 
> Since it's mostly related to delayed-item, I need some time to get
> familiar with it before any really useful advice.

You can use this
https://github.com/btrfs/btrfs-dev-docs/blob/master/delayed-inode.txt as
a starting point, to greatly reduce the learning curve.

On a quick look it seems delayed-inode infrastructure can indeed cause
delayed refs, for example :
__btrfs_run_delayed_items
 __btrfs_commit_inode_delayed_items
   btrfs_insert_delayed_items
     btrfs_insert_delayed_item
       btrfs_insert_empty_item <- goes into the allocator which creates
delayed refs

Or
__btrfs_commit_inode_delayed_items
 btrfs_update_delayed_inode
   __btrfs_update_delayed_inode
    btrfs_lookup_inode  <-- cals btrfs_search_slot with cow so will
create delayed refs


So the answer to your question is "yes", in which case indeed this patch
will have to be reverted.

> 
> Thanks,
> Qu
> 
>>
>>
>>
>>>
>>>>
>>>> Thanks,
>>>> Tomohiro Misono
>>>>
>>>> On 2018/02/13 23:16, Nikolay Borisov wrote:
>>>>> Currently calls to btrfs_run_delayed_inode items have been scattered
>>>>> around the transaction commit code with no real design argument when they 
>>>>> should be execute.
>>>>>
>>>>> We have one call, after transaction writers go to 0. Then we have the delayed
>>>>> items running as part of creating a snapshot (once per pedning snapshot).
>>>>> Finally, delayed items are run once more _after_ snapshots have been created.
>>>>> All in all this amounts to 2 + N (N = number of snapshots slated for creation).
>>>>> In reality we need to flush delayed items once before
>>>>> create_pending_snapshots is called to ensure snapshosts are consistent with
>>>>> inode data and once after snapshots are created so that newly introduced inode 
>>>>> items during snapshot creation process are correctly persisted on disk. This 
>>>>> patch brings the total executions of run_delayed_items to just 2. 
>>>>>
>>>>> This survived multiple xfstest runs. 
>>>>>
>>>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>>>>> ---
>>>>>  fs/btrfs/transaction.c | 31 ++++++++++---------------------
>>>>>  1 file changed, 10 insertions(+), 21 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>>> index 02bc1e6212e6..b32d3136f36c 100644
>>>>> --- a/fs/btrfs/transaction.c
>>>>> +++ b/fs/btrfs/transaction.c
>>>>> @@ -1524,18 +1524,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>>>>>  	}
>>>>>  	btrfs_release_path(path);
>>>>>  
>>>>> -	/*
>>>>> -	 * pull in the delayed directory update
>>>>> -	 * and the delayed inode item
>>>>> -	 * otherwise we corrupt the FS during
>>>>> -	 * snapshot
>>>>> -	 */
>>>>> -	ret = btrfs_run_delayed_items(trans);
>>>>> -	if (ret) {	/* Transaction aborted */
>>>>> -		btrfs_abort_transaction(trans, ret);
>>>>> -		goto fail;
>>>>> -	}
>>>>> -
>>>>>  	record_root_in_trans(trans, root, 0);
>>>>>  	btrfs_set_root_last_snapshot(&root->root_item, trans->transid);
>>>>>  	memcpy(new_root_item, &root->root_item, sizeof(*new_root_item));
>>>>> @@ -2069,10 +2057,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>>>>  	wait_event(cur_trans->writer_wait,
>>>>>  		   extwriter_counter_read(cur_trans) == 0);
>>>>>  
>>>>> -	/* some pending stuffs might be added after the previous flush. */
>>>>> -	ret = btrfs_run_delayed_items(trans);
>>>>> -	if (ret)
>>>>> -		goto cleanup_transaction;
>>>>>  
>>>>>  	btrfs_wait_delalloc_flush(fs_info);
>>>>>  
>>>>> @@ -2095,6 +2079,16 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>>>>  		ret = cur_trans->aborted;
>>>>>  		goto scrub_continue;
>>>>>  	}
>>>>> +
>>>>> +	/*
>>>>> +	 * Run delayed items because we need them to be consistent on-disk
>>>>> +	 * so that snapshots created in create_pending_snapshots don't corrupt
>>>>> +	 * the filesystem. At this point we are the one doing transaction
>>>>> +	 * commit and now one can come and introduce new delayed inode items
>>>>> +	 */
>>>>> +	ret = btrfs_run_delayed_items(trans);
>>>>> +	if (ret)
>>>>> +		goto scrub_continue;
>>>>>  	/*
>>>>>  	 * the reloc mutex makes sure that we stop
>>>>>  	 * the balancing code from coming in and moving
>>>>> @@ -2102,11 +2096,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>>>>  	 */
>>>>>  	mutex_lock(&fs_info->reloc_mutex);
>>>>>  
>>>>> -	/*
>>>>> -	 * We needn't worry about the delayed items because we will
>>>>> -	 * deal with them in create_pending_snapshot(), which is the
>>>>> -	 * core function of the snapshot creation.
>>>>> -	 */
>>>>>  	ret = create_pending_snapshots(trans);
>>>>>  	if (ret) {
>>>>>  		mutex_unlock(&fs_info->reloc_mutex);
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba May 28, 2018, 4:12 p.m. UTC | #6
On Mon, May 28, 2018 at 03:26:58PM +0300, Nikolay Borisov wrote:
> >>>> dmesg looks like:
> >>>> [    6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 record_root_in_trans+0x38/0xd0

Found in the logs. I reported it to the patch that added the assertion
but I did not suspect your patches.

> >>>> [    6.662909]  create_pending_snapshot+0x1ab/0xd00

> So the answer to your question is "yes", in which case indeed this patch
> will have to be reverted.

Both patches removed from misc-next.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov May 30, 2018, 9:32 p.m. UTC | #7
On 28.05.2018 19:12, David Sterba wrote:
> On Mon, May 28, 2018 at 03:26:58PM +0300, Nikolay Borisov wrote:
>>>>>> dmesg looks like:
>>>>>> [    6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 record_root_in_trans+0x38/0xd0
> 
> Found in the logs. I reported it to the patch that added the assertion
> but I did not suspect your patches.
> 
>>>>>> [    6.662909]  create_pending_snapshot+0x1ab/0xd00
> 
>> So the answer to your question is "yes", in which case indeed this patch
>> will have to be reverted.
> 
> Both patches removed from misc-next.

I think 1/2 is actually safe, it's removing something from the
transaction path without affecting the overall logic which should be a
small win.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba May 31, 2018, 9:29 a.m. UTC | #8
On Thu, May 31, 2018 at 12:32:03AM +0300, Nikolay Borisov wrote:
> On 28.05.2018 19:12, David Sterba wrote:
> > On Mon, May 28, 2018 at 03:26:58PM +0300, Nikolay Borisov wrote:
> >>>>>> dmesg looks like:
> >>>>>> [    6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 record_root_in_trans+0x38/0xd0
> > 
> > Found in the logs. I reported it to the patch that added the assertion
> > but I did not suspect your patches.
> > 
> >>>>>> [    6.662909]  create_pending_snapshot+0x1ab/0xd00
> > 
> >> So the answer to your question is "yes", in which case indeed this patch
> >> will have to be reverted.
> > 
> > Both patches removed from misc-next.
> 
> I think 1/2 is actually safe, it's removing something from the
> transaction path without affecting the overall logic which should be a
> small win.

Yeah, but right now we're too close to merge window, please resend them
for the next cycle.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 02bc1e6212e6..b32d3136f36c 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1524,18 +1524,6 @@  static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 	}
 	btrfs_release_path(path);
 
-	/*
-	 * pull in the delayed directory update
-	 * and the delayed inode item
-	 * otherwise we corrupt the FS during
-	 * snapshot
-	 */
-	ret = btrfs_run_delayed_items(trans);
-	if (ret) {	/* Transaction aborted */
-		btrfs_abort_transaction(trans, ret);
-		goto fail;
-	}
-
 	record_root_in_trans(trans, root, 0);
 	btrfs_set_root_last_snapshot(&root->root_item, trans->transid);
 	memcpy(new_root_item, &root->root_item, sizeof(*new_root_item));
@@ -2069,10 +2057,6 @@  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	wait_event(cur_trans->writer_wait,
 		   extwriter_counter_read(cur_trans) == 0);
 
-	/* some pending stuffs might be added after the previous flush. */
-	ret = btrfs_run_delayed_items(trans);
-	if (ret)
-		goto cleanup_transaction;
 
 	btrfs_wait_delalloc_flush(fs_info);
 
@@ -2095,6 +2079,16 @@  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		ret = cur_trans->aborted;
 		goto scrub_continue;
 	}
+
+	/*
+	 * Run delayed items because we need them to be consistent on-disk
+	 * so that snapshots created in create_pending_snapshots don't corrupt
+	 * the filesystem. At this point we are the one doing transaction
+	 * commit and now one can come and introduce new delayed inode items
+	 */
+	ret = btrfs_run_delayed_items(trans);
+	if (ret)
+		goto scrub_continue;
 	/*
 	 * the reloc mutex makes sure that we stop
 	 * the balancing code from coming in and moving
@@ -2102,11 +2096,6 @@  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	 */
 	mutex_lock(&fs_info->reloc_mutex);
 
-	/*
-	 * We needn't worry about the delayed items because we will
-	 * deal with them in create_pending_snapshot(), which is the
-	 * core function of the snapshot creation.
-	 */
 	ret = create_pending_snapshots(trans);
 	if (ret) {
 		mutex_unlock(&fs_info->reloc_mutex);